Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metrics to all major gce operations (latency, errors) #44510

Merged
merged 1 commit into from
Apr 27, 2017

Conversation

bowei
Copy link
Member

@bowei bowei commented Apr 14, 2017

Add metrics to all major gce operations {latency, errors}

The new metrics are:

  cloudprovider_gce_api_request_duration_seconds{request, region, zone}
  cloudprovider_gce_api_request_errors{request, region, zone}
 
`request` is the specific function that is used.
`region` is the target region (Will be "<n/a>" if not applicable)
`zone` is the target zone (Will be "<n/a>" if not applicable)

Note: this fixes some issues with the previous implementation of
metrics for disks:
- Time duration tracked was of the initial API call, not the entire
  operation.
- Metrics label tuple would have resulted in many independent
  histograms stored, one for each disk. (Did not aggregate well).

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 14, 2017
@k8s-reviewable
Copy link

This change is Reviewable

@k8s-github-robot k8s-github-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Apr 14, 2017
@k8s-github-robot k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 14, 2017
@bowei
Copy link
Member Author

bowei commented Apr 14, 2017

/assign @saad-ali @nicksardo

@bowei bowei force-pushed the gce-metrics branch 5 times, most recently from 41846b2 to 9ee5e0a Compare April 15, 2017 04:07
@bowei
Copy link
Member Author

bowei commented Apr 15, 2017

@k8s-bot non-cri e2e test this

@ghost ghost removed their assignment Apr 18, 2017
@bowei
Copy link
Member Author

bowei commented Apr 18, 2017

@saad-ali -- assign to someone for review?

@gnufied
Copy link
Member

gnufied commented Apr 18, 2017

/assign @gnufied

@saad-ali
Copy link
Member

Thanks for the fixes here @bowei. @gnufied from the storage-sig will review the storage changes.

CC @brancz to review on behalf of sig-instrumentation; @kubernetes/sig-storage-pr-reviews as FYI

@gnufied
Copy link
Member

gnufied commented Apr 18, 2017

cc @brancz

@gnufied
Copy link
Member

gnufied commented Apr 19, 2017

@bowei can you explain -

"Metrics label tuple would have resulted in many independent
histograms stored, one for each disk. (Did not aggregate well)." ?

The old metrics only had namespace as dimensions and this PR replaces that with zone and node as dimension. So likely old PR will create 1 time series for 1 metric but this PR will create 2000 time series for 1 metric.

Also --

"- Time duration tracked was of the initial API call, not the entire
operation."

What you mean by entire operation? Do you mean we capture error and success metrics separately?

@bowei
Copy link
Member Author

bowei commented Apr 19, 2017

  1. If you look at the metrics created before this PR:
bowei@e2e-test-bowei-master ~ $ curl localhost:10252/metrics 2>/dev/null | grep _disk_ | grep Inf   
gce_attach_disk_duration_seconds_bucket{namespace="e2e-test-bowei-dynamic-pvc-02c15253-1f0f-11e7-8e35-42010a280002",le="+Inf"} 1
gce_attach_disk_duration_seconds_bucket{namespace="e2e-test-bowei-dynamic-pvc-1c0aba8f-1f0f-11e7-8e35-42010a280002",le="+Inf"} 1
gce_attach_disk_duration_seconds_bucket{namespace="e2e-test-bowei-dynamic-pvc-de1c20ab-1f0e-11e7-8e35-42010a280002",le="+Inf"} 1
... (many entries)

You will end up keeping a histogram for each PVC that is created.

The number of separate stats kept by Prometheus is the number of unique tuples for the given metric. For disk, I aggregate based on (zone, node), which means the metrics tracked will be the same as the number of nodes in the cluster.

After this change, we see the following, aggregating per node.

bowei@e2e-test-bowei-master ~ $ curl localhost:10252/metrics 2>/dev/null | grep cloud | grep Inf
cloudprovider_gce_disk_duration_seconds_bucket{function="attach",node="e2e-test-bowei-minion-group-3wd9",zone="us-central1-b",le="+Inf"} 1
cloudprovider_gce_disk_duration_seconds_bucket{function="attach",node="e2e-test-bowei-minion-group-52ql",zone="us-central1-b",le="+Inf"} 2
cloudprovider_gce_routes_duration_seconds_bucket{function="create",le="+Inf"} 41
  1. Most of the GCP operations that change state return immediately after the change has been submitted, but the completion of the API call is signaled by polling the API for the state change. When you attach the duration timer to the HTTP transaction context, you will be timing the API submission, not the duration of the state change. We can track completion time of the HTTP request/response but it seems to be more interesting to track the time taken by the overall state change.

@bowei
Copy link
Member Author

bowei commented Apr 19, 2017

@gnufied responded

@gnufied
Copy link
Member

gnufied commented Apr 19, 2017

@bowei I don't have a prometheus install handy but depending on metrics scrapper, some dimensions (such as node) are automatically added when metrics are collected. It is entirely possible I am wrong. can you please check?

@bowei
Copy link
Member Author

bowei commented Apr 19, 2017

@gnufied: the Prometheus scraper will add the node the metric was scraped from, which in this case will be the master node where the controller was run, not the target of the operation (e.g. the node that the PV is being mounted on).

@gnufied
Copy link
Member

gnufied commented Apr 21, 2017

@bowei I think we have bunch of parallel threads going on about metric naming and dimensions. I opened a PR for aws here - #43477 in which @justinsb proposed that we report api action as a dimension rather than part of the metric name.

You have also dropped namespace from the reported metrics. I am not sure using e2e tests is a good example since it creates lot of ephemeral namespaces.

There is also a design proposal which should be amended if we are going to change metric names - https://github.com/kubernetes/community/blob/master/contributors/design-proposals/cloudprovider-storage-metrics.md

Can we have a quick call today sometime and discuss this offline? I would ask @justinsb to be present as well, so as we can discuss a common naming scheme for all cloudprovider metrics.

@gnufied
Copy link
Member

gnufied commented Apr 21, 2017

@bowei @justinsb does 1.00PM EST works for you guys?

@brancz
Copy link
Member

brancz commented Apr 24, 2017

Sorry I was on vacation, is there a summary of your offline discussion, or what was the outcome? Do you want me to hold off on reviewing for now?

@gnufied
Copy link
Member

gnufied commented Apr 24, 2017

Me, @chauncey(Jonathan) @bowei and @justinsb discussed this offline. Following things were finalized:

  1. As agreed previously API action name will be moved to a dimension and will not be part of metric name.
  2. The actual meaning of metrics between cloudproviders might be subtly different. For example - aws_create_disk could just mean time it took to submit the API request, whereas gce_create_disk could mean - time it took to submit the API request and waiting for API action to take effect. These low level metrics, shouldn't be used to measure relative performance of cloudprovider (namely - if creating a disk is faster in AWS or GCE). It was agreed that, it is upto cloudprovider maintainers to decide how they are going to emit metrics.
  3. For now, we are going to drop the namespace requirement from the metrics. Namespace information isn't available in most call sites anyways. It was discussed that, we may amend metric emission to pass Context objects which can hold additional dimensions.

@justinsb @bowei let me know, if I missed anything.

@bowei
Copy link
Member Author

bowei commented Apr 24, 2017

Thanks @gnufied -- that is a good summary. Are you going to update the metrics doc as well?

@bowei
Copy link
Member Author

bowei commented Apr 24, 2017

@gnufied -- oh, did you guys decide on the specific naming?

"cloudprovider_{gce,aws,...}_{disk, }"?

Or something else...

@gnufied
Copy link
Member

gnufied commented Apr 24, 2017

@bowei we didn't really had a chance to get consensus around metric name. But I have updated the design proposal kubernetes/community#507 to use cloudprovider_gce_api_duration_seconds as a name and api_request as a dimension.

@bowei
Copy link
Member Author

bowei commented Apr 26, 2017

@gnufied please take a look, I updated the metrics to fit with the discussed metrics proposal

Copy link
Member

@gnufied gnufied left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two small nits but looks good.

}

// Value for an unused label in the metric dimension.
const unusedLabel = "<n/a>"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename it to unusedMetricLabel ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// Observe the result of a API call.
func (mc *metricContext) Observe(err error) error {
apiMetrics.latency.WithLabelValues(mc.attributes...).Observe(
time.Now().Sub(mc.start).Seconds())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time.Since(mc.start).Seconds() seems simpler?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@bowei bowei force-pushed the gce-metrics branch 2 times, most recently from 3e369b2 to 862facc Compare April 27, 2017 01:00
@gnufied
Copy link
Member

gnufied commented Apr 27, 2017

@bowei lets wait for @brancz to have a look by tomorrow. It looks all good from me.

@brancz
Copy link
Member

brancz commented Apr 27, 2017

Same thing applies here as I mentioned here.

The new metrics is:

  cloudprovider_gce_api_request_duration_seconds{request, region, zone}
  cloudprovider_gce_api_request_errors{request, region, zone}

`request` is the specific function that is used.
`region` is the target region (Will be "<n/a>" if not applicable)
`zone` is the target zone (Will be "<n/a>" if not applicable)

Note: this fixes some issues with the previous implementation of
metrics for disks:
- Time duration tracked was of the initial API call, not the entire
  operation.
- Metrics label tuple would have resulted in many independent
  histograms stored, one for each disk. (Did not aggregate well).
@bowei
Copy link
Member Author

bowei commented Apr 27, 2017

@brancz @gnufied -- fixed the naming as requested

@gnufied
Copy link
Member

gnufied commented Apr 27, 2017

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 27, 2017
@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bowei, gnufied

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 44124, 44510)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants