-
Notifications
You must be signed in to change notification settings - Fork 39.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add metrics to all major gce operations (latency, errors) #44510
Conversation
bowei
commented
Apr 14, 2017
•
edited
Loading
edited
/assign @saad-ali @nicksardo |
41846b2
to
9ee5e0a
Compare
@k8s-bot non-cri e2e test this |
@saad-ali -- assign to someone for review? |
/assign @gnufied |
cc @brancz |
@bowei can you explain - "Metrics label tuple would have resulted in many independent The old metrics only had namespace as dimensions and this PR replaces that with zone and node as dimension. So likely old PR will create 1 time series for 1 metric but this PR will create 2000 time series for 1 metric. Also -- "- Time duration tracked was of the initial API call, not the entire What you mean by entire operation? Do you mean we capture error and success metrics separately? |
You will end up keeping a histogram for each PVC that is created. The number of separate stats kept by Prometheus is the number of unique tuples for the given metric. For disk, I aggregate based on After this change, we see the following, aggregating per node.
|
@gnufied responded |
@bowei I don't have a prometheus install handy but depending on metrics scrapper, some dimensions (such as node) are automatically added when metrics are collected. It is entirely possible I am wrong. can you please check? |
@gnufied: the Prometheus scraper will add the node the metric was scraped from, which in this case will be the master node where the controller was run, not the target of the operation (e.g. the node that the PV is being mounted on). |
@bowei I think we have bunch of parallel threads going on about metric naming and dimensions. I opened a PR for aws here - #43477 in which @justinsb proposed that we report api action as a dimension rather than part of the metric name. You have also dropped namespace from the reported metrics. I am not sure using e2e tests is a good example since it creates lot of ephemeral namespaces. There is also a design proposal which should be amended if we are going to change metric names - https://github.com/kubernetes/community/blob/master/contributors/design-proposals/cloudprovider-storage-metrics.md Can we have a quick call today sometime and discuss this offline? I would ask @justinsb to be present as well, so as we can discuss a common naming scheme for all cloudprovider metrics. |
Sorry I was on vacation, is there a summary of your offline discussion, or what was the outcome? Do you want me to hold off on reviewing for now? |
Me, @chauncey(Jonathan) @bowei and @justinsb discussed this offline. Following things were finalized:
|
Thanks @gnufied -- that is a good summary. Are you going to update the metrics doc as well? |
@gnufied -- oh, did you guys decide on the specific naming? "cloudprovider_{gce,aws,...}_{disk, }"? Or something else... |
@bowei we didn't really had a chance to get consensus around metric name. But I have updated the design proposal kubernetes/community#507 to use |
@gnufied please take a look, I updated the metrics to fit with the discussed metrics proposal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two small nits but looks good.
} | ||
|
||
// Value for an unused label in the metric dimension. | ||
const unusedLabel = "<n/a>" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename it to unusedMetricLabel
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
// Observe the result of a API call. | ||
func (mc *metricContext) Observe(err error) error { | ||
apiMetrics.latency.WithLabelValues(mc.attributes...).Observe( | ||
time.Now().Sub(mc.start).Seconds()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
time.Since(mc.start).Seconds()
seems simpler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
3e369b2
to
862facc
Compare
Same thing applies here as I mentioned here. |
The new metrics is: cloudprovider_gce_api_request_duration_seconds{request, region, zone} cloudprovider_gce_api_request_errors{request, region, zone} `request` is the specific function that is used. `region` is the target region (Will be "<n/a>" if not applicable) `zone` is the target zone (Will be "<n/a>" if not applicable) Note: this fixes some issues with the previous implementation of metrics for disks: - Time duration tracked was of the initial API call, not the entire operation. - Metrics label tuple would have resulted in many independent histograms stored, one for each disk. (Did not aggregate well).
/lgtm |
Automatic merge from submit-queue (batch tested with PRs 44124, 44510) |