fix pleg relist time #45496

andyxning · 2017-05-08T15:09:22Z

This PR fix pleg reslist time. According to current implementation, we have a Healthy method periodically check the relist time. If current timestamp subtracts latest relist time is longer than relistThreshold(default is 3 minutes), we should return an error to indicate the error of runtime.

relist method is also called periodically. If runtime(docker) hung, the relist method should return immediately without updating the latest relist time. If we update latest relist time no matter runtime(docker) hung(default timeout is 2 minutes), the Healthy method will never return an error.

Kubelet PLEG updates the relist timestamp only after successfully relisting.

/cc @yujuhong @Random-Liu @dchen1107

k8s-reviewable · 2017-05-08T15:09:30Z

This change is

andyxning · 2017-05-08T15:14:06Z

/assign @derekwaynecarr

PTAL.

andyxning · 2017-05-08T15:23:51Z

@k8s-bot ok to test

yujuhong · 2017-05-08T17:00:08Z

@andyxning the original intention of the check is to indicate that kubelet is not function normally -- it's not updating its view of the containers/pods in the time frame. When that happens, kubelet will not react to changes in container states and will not be able to start new pods. That's why I wanted to surface it to the node status.

What's the reason of changing the logic?

andyxning · 2017-05-09T05:57:48Z

@yujuhong

Just for docker hung for example.

rellist will be run periodically to list pods. In case docker hung, relist will take for more than 2 minutes(default settings for common docker operation). No matter whether docker hung, we update latest relist time when we run relist.

In Healthy, we also check the period since latest relist time, if the period is longer enough we just think that docker runtime is not working properly and return an error. I think this is the intended result.

However, just as what relist does, we update latest relist time when we do a relist, even docker hung and reslit will last for more than 2 minutes, so the check for period since latest reslit time will always less than the relistThreshold(3 minutes) cause the largest period since latest relist time is 2 minutes.

What we should do is to update the latest relist time only when we successfully relist, i.e., without docker hung. If we encounter docker hung, the latest relist time may longer than relistThreshold and Healthy can do the judgement and return an error to indicate than an error happens to docker runtime.

TL;DR:
we should only update relist time when we can run relist successfully from docker runtime.

andyxning · 2017-05-12T02:48:57Z

friendly ping @yujuhong :)

andyxning · 2017-05-16T14:39:44Z

/cc @yujuhong @derekwaynecarr @dchen1107 :)

yujuhong · 2017-05-17T23:39:03Z

pkg/kubelet/pleg/generic.go

+		glog.Errorf("GenericPLEG: Unable to retrieve pods: %v", err)
+		return
+	}
+


You need to move timestamp := g.clock.Now() and the defer block for PLEGRelistLatency to before the GetPods. Otherwise, we won't be recording the right metrics.

@yujuhong Right. Will do.

Done. PTAL.

BTW, @yujuhong Does this PR can be cherry picked to 1.3/1.4/1.5/1.6?

/ping @yujuhong

@andyxning Why do you need to cherry pick this to 1.3? Are you still using it?

Yes, some clusters are still using it.

yujuhong · 2017-05-19T00:15:32Z

BTW, @yujuhong Does this PR can be cherry picked to 1.3/1.4/1.5/1.6?

I think we've stopped patching and cutting releases for 1.3. Cherrypicks for 1.4 or 1.5 are usually critical fixes.
I don't even remember whether we had docker request timeout for 1.3. If not, and if docker hangs, PLEG will not be able to recover and update the timestamp.

I understand this bug might be annoying, but kubelet also has other docker health checks. On top of that, on most platforms, the babysitter on your node will run docker ps and restart docker if the request fails. Why do you want to cherrypick to so many branches?

andyxning · 2017-05-19T00:38:14Z

I have checked the source code for v1.3.3. There is a timeout for rlist operation. I think if this pr is ok, maybe we could accept it first.

As for backport to 1.3, this is just my thought. If we do not do it in the main repo, i can do this for our own.

I understand this bug might be annoying, but kubelet also has other docker health checks. On top of that, on most platforms, the babysitter on your node will run docker ps and restart docker if the request fails. Why do you want to cherrypick to so many branches?

We use systemd to supervisor docker instead of supervisord. For systemd, it is not so easy to make a babysister.:(

yujuhong · 2017-05-19T00:48:11Z

/lgtm

I have checked the source code for v1.3.3. There is a timeout for rlist operation. I think if this pr is ok, maybe we could accept it first.

For cherrypicking, you'd need a unit test to convince the branch manager that this PR actually fixes an issue. If you can add a test, we can at least try to cherrypick this to 1.6 (and 1.5).

As for backport to 1.3, this is just my thought. If we do not do it in the main repo, i can do this for our own.

It's unlikely we'll do this, but I don't have time to verify whether this is possible. You can try asking in the dev mailing list and/or release slack channels, but I'd suggest you patch manually.

We use systemd to supervisor docker instead of supervisord. For systemd, it is not so easy to make a babysister.:(

On COS (the image we used for GKE), we started a separate systemd unit to monitor docker, just FYI: https://github.com/kubernetes/kubernetes/blob/v1.7.0-alpha.0/cluster/gce/gci/node.yaml#L43

yujuhong · 2017-05-19T00:51:18Z

Please add a release note.

andyxning · 2017-05-19T02:05:42Z

Release note:
Kubelet PLEG relist time updating only after successfully relisting

andyxning · 2017-05-19T02:06:07Z

@yujuhong Done.

yujuhong · 2017-05-19T16:50:18Z

@andyxning you need to update your original comment with the "```release-note" block. I edited that for you.

/lgtm

k8s-github-robot · 2017-05-19T16:50:25Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andyxning, yujuhong

Needs approval from an approver in each of these OWNERS Files:

~~pkg/kubelet/OWNERS~~ [yujuhong]

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

andyxning · 2017-05-19T21:58:21Z

@yujuhong Many thanks. Can we remove the do-not-merge label. :)

andyxning · 2017-05-20T07:27:48Z

/ping @yujuhong

k8s-ci-robot · 2017-05-21T10:49:00Z

@andyxning: The following test(s) failed:

Test name	Commit	Details	Rerun command
pull-kubernetes-federation-e2e-gce	`af6c040`	link	`@k8s-bot pull-kubernetes-federation-e2e-gce test this`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-github-robot · 2017-05-21T11:17:13Z

Automatic merge from submit-queue

derekwaynecarr · 2017-05-22T16:25:36Z

@sjenning -- can you pick this to our origin 1.6 release?

sjenning · 2017-05-22T16:57:41Z

@derekwaynecarr yes, i'll do it

warmchang · 2017-12-21T14:35:50Z

👍

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 8, 2017

k8s-ci-robot requested review from Random-Liu, dchen1107 and yujuhong May 8, 2017 15:09

k8s-github-robot assigned dashpole and tmrts May 8, 2017

k8s-github-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. release-note-label-needed labels May 8, 2017

k8s-ci-robot assigned derekwaynecarr May 8, 2017

yujuhong assigned yujuhong and unassigned dashpole and tmrts May 8, 2017

yujuhong reviewed May 17, 2017

View reviewed changes

fix pleg relist time

af6c040

andyxning force-pushed the fix_pleg_relist_time branch from 0f738bc to af6c040 Compare May 18, 2017 03:40

k8s-github-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 18, 2017

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 19, 2017

k8s-github-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. labels May 19, 2017

yujuhong added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-label-needed labels May 19, 2017

fejta removed the do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. label May 21, 2017

k8s-github-robot merged commit c1f8fcd into kubernetes:master May 21, 2017

andyxning deleted the fix_pleg_relist_time branch May 21, 2017 11:17

sjenning mentioned this pull request May 22, 2017

UPSTREAM: 45496: fix pleg relist time openshift/origin#14282

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix pleg relist time #45496

fix pleg relist time #45496

andyxning commented May 8, 2017 •

edited by yujuhong

Loading

k8s-reviewable commented May 8, 2017

andyxning commented May 8, 2017

andyxning commented May 8, 2017

yujuhong commented May 8, 2017

andyxning commented May 9, 2017

andyxning commented May 12, 2017

andyxning commented May 16, 2017

yujuhong May 17, 2017

andyxning May 18, 2017

andyxning May 18, 2017 •

edited

Loading

andyxning May 18, 2017

Random-Liu May 19, 2017

andyxning May 19, 2017

yujuhong commented May 19, 2017

andyxning commented May 19, 2017

yujuhong commented May 19, 2017

yujuhong commented May 19, 2017

andyxning commented May 19, 2017

andyxning commented May 19, 2017

yujuhong commented May 19, 2017

k8s-github-robot commented May 19, 2017

andyxning commented May 19, 2017 •

edited

Loading

andyxning commented May 20, 2017

k8s-ci-robot commented May 21, 2017 •

edited

Loading

k8s-github-robot commented May 21, 2017

derekwaynecarr commented May 22, 2017

sjenning commented May 22, 2017

warmchang commented Dec 21, 2017

fix pleg relist time #45496

fix pleg relist time #45496

Conversation

andyxning commented May 8, 2017 • edited by yujuhong Loading

k8s-reviewable commented May 8, 2017

andyxning commented May 8, 2017

andyxning commented May 8, 2017

yujuhong commented May 8, 2017

andyxning commented May 9, 2017

andyxning commented May 12, 2017

andyxning commented May 16, 2017

yujuhong May 17, 2017

Choose a reason for hiding this comment

andyxning May 18, 2017

Choose a reason for hiding this comment

andyxning May 18, 2017 • edited Loading

Choose a reason for hiding this comment

andyxning May 18, 2017

Choose a reason for hiding this comment

Random-Liu May 19, 2017

Choose a reason for hiding this comment

andyxning May 19, 2017

Choose a reason for hiding this comment

yujuhong commented May 19, 2017

andyxning commented May 19, 2017

yujuhong commented May 19, 2017

yujuhong commented May 19, 2017

andyxning commented May 19, 2017

andyxning commented May 19, 2017

yujuhong commented May 19, 2017

k8s-github-robot commented May 19, 2017

andyxning commented May 19, 2017 • edited Loading

andyxning commented May 20, 2017

k8s-ci-robot commented May 21, 2017 • edited Loading

k8s-github-robot commented May 21, 2017

derekwaynecarr commented May 22, 2017

sjenning commented May 22, 2017

warmchang commented Dec 21, 2017

andyxning commented May 8, 2017 •

edited by yujuhong

Loading

andyxning May 18, 2017 •

edited

Loading

andyxning commented May 19, 2017 •

edited

Loading

k8s-ci-robot commented May 21, 2017 •

edited

Loading