Fix pods stuck in Pending state forever #49661

julia-stripe · 2017-07-26T21:34:04Z

What this PR does / why we need it:

This PR fixes 2 bugs introduced into the scheduler in ecb962e6585#diff-67f2b61521299ca8d8687b0933bbfb19R223.

Together, they meant that when a pod fails to bind to a node, it can end up stuck in the Pending state indefinitely. The way this happened was:

The pod fails to bind to the node
ForgetPod fails, because the wrong *v1.Pod struct was passed into bind and therefore ForgetPod (pod instead of assumed) -- ForgetPod requires that the pod ask it to forget matches the pod in the cache
bind returns immediately after ForgetPod fails instead of correctly reporting an error, because of this line: ecb962e6585#diff-67f2b61521299ca8d8687b0933bbfb19R223 -- this should be a glog.Errorf, not a return. The return on that line causes the function to skip the error handling, which means that the pod isn't put back onto the queue to attempt scheduling again.
Because the pod isn't put back on to the queue, the pod stays stuck in Pending forever until the scheduler is restarted.

This also adds better error handling when AssumePod fails -- it seems to me that any time pod scheduling fails, we should handle the error with sched.config.Error and put the pod back in the queue to be rescheduled instead of silently failing and leaving the pod potentially stuck in Pending forever.

It also adds 2 new unit tests, which test:

when both ForgetPod and Bind fail, the correct error is reported
when AssumePod fails, the correct error is reported
when Bind fails, the correct pod is passed into ForgetPod

Which issue this PR fixes
fixes #49314

Special notes for your reviewer:

Release note:

Fix bug where pods stuck in Pending state forever after binding fails

k8s-github-robot · 2017-07-26T21:34:10Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: julia-stripe
We suggest the following additional approver: timothysc

Assign the PR to them by writing /assign @timothysc in a comment when ready.

Associated issue: 49314

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

plugin/pkg/scheduler/OWNERS

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

k8s-ci-robot · 2017-07-26T21:34:11Z

Hi @julia-stripe. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

bsalamat · 2017-07-26T21:54:04Z

/ok-to-test

k82cn · 2017-07-27T07:03:22Z

plugin/pkg/scheduler/scheduler.go

@@ -222,7 +225,7 @@ func (sched *Scheduler) bind(assumed *v1.Pod, b *v1.Binding) error {
 	if err != nil {


we should call FinishBinding just after bind as before.

the call to FinishBinding was moved in this PR a few weeks ago: #48451 -- let me know if you think it should be moved back (and why? :) ).

finishbinding will add deadline and remove the resumed pod when expired. IMO, the deadline is not only for binding successfully, but also for error handling, e.g. ForgetPod.

do you think we should revert #48451?

Give me some time to check history.

That's error-prone part, we used to close some PRs which want to enhance it. Another concern is that our UT and e2e did not catch it :(.

yup, still prefer to call FinishBinding just after bind as before. But we can not revert #48451 , we need to just log error if 'FinishBinding' failed.

Current fix did not get a chance to call 'FinishBinding' which will still leave assumed pod in cache, so the pod can not be re-scheduled. For the concern on race of #48451, I think that's fine; 'ForgetPod' is quick enough, and scheduler will correct it in next cycle if it happened.

/cc @wojtek-t @timothysc for more suggestion :).

Updated this PR to call FinishBinding right after bind and log an error instead of returning. (thank you so much for all the comments!)

Also happy to break that commit out as a separate PR since it's an independent problem & this is getting kind of big, not sure which is better

As we had created another PRs, can we simplify this PR to only include FinishBinding and ForgetPod?

k82cn · 2017-07-27T07:05:28Z

plugin/pkg/scheduler/scheduler.go

@@ -185,7 +185,7 @@ func (sched *Scheduler) schedule(pod *v1.Pod) (string, error) {
 }

 // assume signals to the cache that a pod is already in the cache, so that binding can be asnychronous.
-func (sched *Scheduler) assume(pod *v1.Pod, host string) error {
+func (sched *Scheduler) assume(pod *v1.Pod, host string) (*v1.Pod, error) {


we should not return assumed pod, we will check current Pod with the pod in cache.

sorry, I don't understand -- the reason this is returning the assumed pod is that bind expects to get the assumed pod as an argument (bind(assumed *v1.Pod, b *v1.Binding)).

before the refactor assume and bind were part of the same function and the assumed pod got passed to ForgetPod and FinishBinding -- you can see the old code (which I believe was correct) here:

kubernetes/plugin/pkg/scheduler/scheduler.go

Lines 229 to 234 in 80f26fa

if err := sched.config.SchedulerCache.FinishBinding(&assumed); err != nil {

glog.Errorf("scheduler cache FinishBinding failed: %v", err)

}

if err != nil {

glog.V(1).Infof("Failed to bind pod: %v/%v", pod.Namespace, pod.Name)

if err := sched.config.SchedulerCache.ForgetPod(&assumed); err != nil {

wow, yes. We did not use object from cache directly; prefer to highlight copy outside assume:

assumedPod := *pod sched.assume(assumedPod) go sched.bind(assumedPod)

Done, thanks!

k82cn · 2017-07-27T07:06:05Z

/assign

Good catch!, add some comments :).

k82cn · 2017-07-27T07:08:46Z

plugin/pkg/scheduler/scheduler.go

+		sched.config.PodConditionUpdater.Update(pod, &v1.PodCondition{
+			Type:    v1.PodScheduled,
+			Status:  v1.ConditionFalse,
+			Reason:  "AssumePodFailed",


Assume is internal phase of scheduler, we should not publish it; different scheduler may have different way to handle race condition between cache & apiserver.

I'm not sure what the best way to indicate "this is most likely a bug in the scheduler" is but changed it to SchedulerError .

cblecker · 2017-07-30T00:20:49Z

@julia-stripe This is currently failing verify as plugin/pkg/scheduler/testing/fake_cache.go needs a gofmt :)

julia-stripe · 2017-07-30T18:07:02Z

oops, fixed!

can someone add the cherrypick-candidate label and release: 1.7.0? (is that the right way to flag bug fixes for the 1.7 release?)

julia-stripe · 2017-07-31T01:04:38Z

/retest

cblecker · 2017-07-31T01:15:56Z

@julia-stripe One of the scheduling maintainers should be able to assist with that once:

This PR is approved to merge into master
It's determined that this bug appears in 1.7, and the fix is unlikely to cause any regressions.

More information on the cherry-pick procedure is available here. This would fall under the section "INDIVIDUAL CHERRYPICKS".

julia-stripe · 2017-07-31T12:21:54Z

/retest

julia-stripe · 2017-07-31T13:44:56Z

/retest

julia-stripe · 2017-08-01T13:48:28Z

/retest

julia-stripe · 2017-08-02T16:50:35Z

PTAL @k82cn

Fixed all the issues you mentioned! Is there anything else? Should 589cb36 be broken out into a separate PR because it's an independent change?

julia-stripe · 2017-08-02T17:46:17Z

Also pulled one smaller change out of this in the hopes that it's easier to review: #50028

julia-stripe · 2017-08-03T19:13:41Z

Going to close this in favour of #50028 and open a separate PR to improve the error handling in scheduler.go.

julia-stripe · 2017-08-03T20:06:41Z

@k82cn Opened #50106 which is the error handling parts of this PR -- hopefully that will be easier to review.

k82cn · 2017-08-03T23:32:31Z

great, thanks very much :).

…ind-call Automatic merge from submit-queue Fix incorrect call to 'bind' in scheduler I previously submitted #49661 -- I'm not sure if that PR is too big or what, but this is an attempt at a smaller PR that makes progress on the same issue and is easier to review. **What this PR does / why we need it**: In this refactor (ecb962e6585#diff-67f2b61521299ca8d8687b0933bbfb19R223) the scheduler code was refactored into separate `bind` and `assume` functions. When that happened, `bind` was called with `pod` as an argument. The argument to `bind` should be the assumed pod, not the original pod. Evidence that `assumedPod` is the correct argument bind and not `pod`: https://github.com/kubernetes/kubernetes/blob/80f26fa8a89ef5863cb19c71a620bb389d025166/plugin/pkg/scheduler/scheduler.go#L229-L234. (and it says `assumed` in the function signature for `bind`, even though it's not called with the assumed pod as an argument). This is an issue (and causes #49314, where pods that fail to bind to a node get stuck indefinitely) in the following scenario: 1. The pod fails to bind to the node 2. `bind` calls `ForgetPod` with the `pod` argument 3. since `ForgetPod` is expecting the assumed pod as an argument (because that's what's in the scheduler cache), it fails with an error like `scheduler cache ForgetPod failed: pod test-677550-rc-edit-namespace/nginx-jvn09 state was assumed on a different node` 4. The pod gets lost forever because of some incomplete error handling (which I haven't addressed here in the interest of making a simpler PR) In this PR I've fixed the call to `bind` and modified the tests to make sure that `ForgetPod` gets called with the correct argument (the assumed pod) when binding fails. **Which issue this PR fixes**: fixes #49314 **Special notes for your reviewer**: **Release note**: ```release-note ```

Fix pods stuck in Pending state forever

25dade8

k8s-github-robot assigned timothysc and bsalamat Jul 26, 2017

k8s-github-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Jul 26, 2017

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jul 26, 2017

k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jul 26, 2017

k82cn reviewed Jul 27, 2017

View reviewed changes

k8s-ci-robot assigned k82cn Jul 27, 2017

k82cn reviewed Jul 27, 2017

View reviewed changes

Change AssumePodFailed => SchedulerError

273107b

julia-stripe force-pushed the fix-pods-stuck-pending branch from c0c6a94 to 273107b Compare July 27, 2017 12:44

Make copy of pod into assumedPod more explicit

be92289

julia-stripe mentioned this pull request Jul 28, 2017

Pod that failed to bind, stuck in Pending state forever #49314

Closed

gofmt

444865b

Call FinishBinding right after Bind

589cb36

k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 31, 2017

julia-stripe mentioned this pull request Aug 2, 2017

Fix incorrect call to 'bind' in scheduler #50028

Merged

julia-stripe closed this Aug 3, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix pods stuck in Pending state forever #49661

Fix pods stuck in Pending state forever #49661

julia-stripe commented Jul 26, 2017

k8s-github-robot commented Jul 26, 2017

k8s-ci-robot commented Jul 26, 2017

bsalamat commented Jul 26, 2017

k82cn Jul 27, 2017

julia-stripe Jul 27, 2017 •

edited

Loading

k82cn Jul 27, 2017 •

edited

Loading

julia-stripe Jul 27, 2017

k82cn Jul 27, 2017

k82cn Jul 31, 2017

k82cn Jul 31, 2017

julia-stripe Jul 31, 2017

julia-stripe Jul 31, 2017

k82cn Aug 3, 2017

k82cn Jul 27, 2017

julia-stripe Jul 27, 2017

k82cn Jul 27, 2017

julia-stripe Jul 27, 2017

k82cn commented Jul 27, 2017

k82cn Jul 27, 2017

julia-stripe Jul 27, 2017

cblecker commented Jul 30, 2017

julia-stripe commented Jul 30, 2017

julia-stripe commented Jul 31, 2017

cblecker commented Jul 31, 2017

julia-stripe commented Jul 31, 2017

julia-stripe commented Jul 31, 2017

julia-stripe commented Aug 1, 2017

julia-stripe commented Aug 2, 2017

julia-stripe commented Aug 2, 2017

julia-stripe commented Aug 3, 2017

julia-stripe commented Aug 3, 2017

k82cn commented Aug 3, 2017

		@@ -222,7 +225,7 @@ func (sched Scheduler) bind(assumed v1.Pod, b *v1.Binding) error {
		if err != nil {

	if err := sched.config.SchedulerCache.FinishBinding(&assumed); err != nil {
	glog.Errorf("scheduler cache FinishBinding failed: %v", err)
	}
	if err != nil {
	glog.V(1).Infof("Failed to bind pod: %v/%v", pod.Namespace, pod.Name)
	if err := sched.config.SchedulerCache.ForgetPod(&assumed); err != nil {

Fix pods stuck in Pending state forever #49661

Fix pods stuck in Pending state forever #49661

Conversation

julia-stripe commented Jul 26, 2017

k8s-github-robot commented Jul 26, 2017

k8s-ci-robot commented Jul 26, 2017

bsalamat commented Jul 26, 2017

Choose a reason for hiding this comment

julia-stripe Jul 27, 2017 • edited Loading

Choose a reason for hiding this comment

k82cn Jul 27, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k82cn commented Jul 27, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cblecker commented Jul 30, 2017

julia-stripe commented Jul 30, 2017

julia-stripe commented Jul 31, 2017

cblecker commented Jul 31, 2017

julia-stripe commented Jul 31, 2017

julia-stripe commented Jul 31, 2017

julia-stripe commented Aug 1, 2017

julia-stripe commented Aug 2, 2017

julia-stripe commented Aug 2, 2017

julia-stripe commented Aug 3, 2017

julia-stripe commented Aug 3, 2017

k82cn commented Aug 3, 2017

julia-stripe Jul 27, 2017 •

edited

Loading

k82cn Jul 27, 2017 •

edited

Loading