Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix inconsistent Prometheus cAdvisor metrics #51473

Merged

Conversation

bboreham
Copy link
Contributor

What this PR does / why we need it:

We need this because otherwise kubelet is exposing different sets of Prometheus metrics that randomly include or do not include containers.

See also google/cadvisor#1704; quoting here:

Prometheus requires that all metrics in the same family have the same labels, so we arrange to supply blank strings for missing labels

The function containerPrometheusLabels() conditionally adds various metric labels from container labels - pod name, image, etc. However, when it receives the metrics, Prometheus checks that all metrics in the same family have the same label set, and rejects those that do not.

Since containers are collected in (somewhat) random order, depending on which kind is seen first you get one set of metrics or the other.

Changing the container labels function to always add the same set of labels, adding "" when it doesn't have a real value, eliminates the issue in my testing.

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #

Fixes #50151

Special notes for your reviewer:

I have made the same fix in two places. I am 98% sure the one in cadvisor_linux.go isn't used and indeed cannot be used, but have not gone fully down that rabbit-hole.

Release note:

Fix inconsistent Prometheus cAdvisor metrics

Prometheus requires that all metrics in the same family have the same
labels, so we arrange to supply blank strings for missing labels

See google/cadvisor#1704
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 28, 2017
@k8s-ci-robot
Copy link
Contributor

Hi @bboreham. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 28, 2017
@k8s-github-robot k8s-github-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Aug 28, 2017
@yujuhong yujuhong assigned dashpole and yguo0905 and unassigned yifan-gu and yujuhong Aug 28, 2017
@dashpole
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 28, 2017
@dashpole
Copy link
Contributor

/test pull-kubernetes-federation-e2e-gce

@dashpole
Copy link
Contributor

This change looks good, although it doesnt appear to fix all issues. At least for me, it looks like diskio metrics are still not consistent. I'll have to look into that...
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 28, 2017
@dashpole
Copy link
Contributor

/assign @Random-Liu
for approval. This fixes google/cadvisor#1704

@dashpole
Copy link
Contributor

/test pull-kubernetes-federation-e2e-gce

@Random-Liu
Copy link
Member

/approve

@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Random-Liu, bboreham, dashpole

Associated issue: 50151

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-github-robot k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 28, 2017
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

1 similar comment
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 51471, 50561, 50435, 51473, 51436)

@k8s-github-robot k8s-github-robot merged commit cc557e6 into kubernetes:master Aug 29, 2017
@sylr
Copy link
Contributor

sylr commented Aug 29, 2017

This needs to be cherry picked into release-1.7 before the next release.

@sylr
Copy link
Contributor

sylr commented Sep 7, 2017

@Random-Liu or anyone could launch the release-1.7 cherry-pick process ?

@luxas
Copy link
Member

luxas commented Sep 11, 2017

@sylr if you want to cherrypick this; just propose doing so.
Install https://github.com/github/hub and run

GITHUB_USER=sylr hack/cherry-pick-pull.sh upstream/release-1.7 51473

(also see the doc: https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md)

No need to wait for others to do this.
Thanks!

@k8s-cherrypick-bot
Copy link

Removing label cherrypick-candidate because no release milestone was set. This is an invalid state and thus this PR is not being considered for cherry-pick to any release branch. Please add an appropriate release milestone and then re-add the label.

@sylr
Copy link
Contributor

sylr commented Sep 11, 2017

@luxas Thanks

Note to self:

GITHUB_USER=sylr UPSTREAM_REMOTE=origin FORK_REMOTE=sylr hack/cherry_pick_pull.sh sylr/release-1.7 51473

@luxas
Copy link
Member

luxas commented Sep 11, 2017

yeah, if that are the git remote names you're using

@AndreaGiardini
Copy link

@luxas I think you need to add a release milestone to get this cherry-pick go on

@luxas
Copy link
Member

luxas commented Sep 11, 2017

@AndreaGiardini AFAIK, that's not totally mandatory, but makes it much more clear and the normal flow, so will do. Thanks!

@luxas luxas added this to the v1.7 milestone Sep 11, 2017
@wojtek-t wojtek-t added cherrypick-candidate cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. labels Sep 13, 2017
k8s-github-robot pushed a commit that referenced this pull request Sep 17, 2017
…ylr-release-1.7

Automatic merge from submit-queue.

Automated cherry pick of #51473

Cherry pick of #51473 on release-1.7.

#51473: Make Prometheus cAdvisor metrics labels consistent
@k8s-cherrypick-bot
Copy link

Commit found in the "release-1.7" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cAdvisor returns exclusive parts of metrics randomly