Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix pods stuck in Pending state forever #49661

Closed

Conversation

julia-stripe
Copy link
Contributor

What this PR does / why we need it:

This PR fixes 2 bugs introduced into the scheduler in ecb962e6585#diff-67f2b61521299ca8d8687b0933bbfb19R223.

Together, they meant that when a pod fails to bind to a node, it can end up stuck in the Pending state indefinitely. The way this happened was:

  1. The pod fails to bind to the node
  2. ForgetPod fails, because the wrong *v1.Pod struct was passed into bind and therefore ForgetPod (pod instead of assumed) -- ForgetPod requires that the pod ask it to forget matches the pod in the cache
  3. bind returns immediately after ForgetPod fails instead of correctly reporting an error, because of this line: ecb962e6585#diff-67f2b61521299ca8d8687b0933bbfb19R223 -- this should be a glog.Errorf, not a return. The return on that line causes the function to skip the error handling, which means that the pod isn't put back onto the queue to attempt scheduling again.
  4. Because the pod isn't put back on to the queue, the pod stays stuck in Pending forever until the scheduler is restarted.

This also adds better error handling when AssumePod fails -- it seems to me that any time pod scheduling fails, we should handle the error with sched.config.Error and put the pod back in the queue to be rescheduled instead of silently failing and leaving the pod potentially stuck in Pending forever.

It also adds 2 new unit tests, which test:

  • when both ForgetPod and Bind fail, the correct error is reported
  • when AssumePod fails, the correct error is reported
  • when Bind fails, the correct pod is passed into ForgetPod

Which issue this PR fixes
fixes #49314

Special notes for your reviewer:

Release note:

Fix bug where pods stuck in Pending state forever after binding fails

@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: julia-stripe
We suggest the following additional approver: timothysc

Assign the PR to them by writing /assign @timothysc in a comment when ready.

Associated issue: 49314

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-github-robot k8s-github-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Jul 26, 2017
@k8s-ci-robot
Copy link
Contributor

Hi @julia-stripe. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jul 26, 2017
@bsalamat
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jul 26, 2017
@@ -222,7 +225,7 @@ func (sched *Scheduler) bind(assumed *v1.Pod, b *v1.Binding) error {
if err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should call FinishBinding just after bind as before.

Copy link
Contributor Author

@julia-stripe julia-stripe Jul 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the call to FinishBinding was moved in this PR a few weeks ago: #48451 -- let me know if you think it should be moved back (and why? :) ).

Copy link
Member

@k82cn k82cn Jul 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finishbinding will add deadline and remove the resumed pod when expired. IMO, the deadline is not only for binding successfully, but also for error handling, e.g. ForgetPod.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think we should revert #48451?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Give me some time to check history.

That's error-prone part, we used to close some PRs which want to enhance it. Another concern is that our UT and e2e did not catch it :(.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, still prefer to call FinishBinding just after bind as before. But we can not revert #48451 , we need to just log error if 'FinishBinding' failed.

Current fix did not get a chance to call 'FinishBinding' which will still leave assumed pod in cache, so the pod can not be re-scheduled. For the concern on race of #48451, I think that's fine; 'ForgetPod' is quick enough, and scheduler will correct it in next cycle if it happened.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @wojtek-t @timothysc for more suggestion :).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this PR to call FinishBinding right after bind and log an error instead of returning. (thank you so much for all the comments!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also happy to break that commit out as a separate PR since it's an independent problem & this is getting kind of big, not sure which is better

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we had created another PRs, can we simplify this PR to only include FinishBinding and ForgetPod?

@@ -185,7 +185,7 @@ func (sched *Scheduler) schedule(pod *v1.Pod) (string, error) {
}

// assume signals to the cache that a pod is already in the cache, so that binding can be asnychronous.
func (sched *Scheduler) assume(pod *v1.Pod, host string) error {
func (sched *Scheduler) assume(pod *v1.Pod, host string) (*v1.Pod, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not return assumed pod, we will check current Pod with the pod in cache.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I don't understand -- the reason this is returning the assumed pod is that bind expects to get the assumed pod as an argument (bind(assumed *v1.Pod, b *v1.Binding)).

before the refactor assume and bind were part of the same function and the assumed pod got passed to ForgetPod and FinishBinding -- you can see the old code (which I believe was correct) here:

if err := sched.config.SchedulerCache.FinishBinding(&assumed); err != nil {
glog.Errorf("scheduler cache FinishBinding failed: %v", err)
}
if err != nil {
glog.V(1).Infof("Failed to bind pod: %v/%v", pod.Namespace, pod.Name)
if err := sched.config.SchedulerCache.ForgetPod(&assumed); err != nil {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, yes. We did not use object from cache directly; prefer to highlight copy outside assume:

assumedPod := *pod

sched.assume(assumedPod)
go sched.bind(assumedPod)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

@k82cn
Copy link
Member

k82cn commented Jul 27, 2017

/assign

Good catch!, add some comments :).

sched.config.PodConditionUpdater.Update(pod, &v1.PodCondition{
Type: v1.PodScheduled,
Status: v1.ConditionFalse,
Reason: "AssumePodFailed",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assume is internal phase of scheduler, we should not publish it; different scheduler may have different way to handle race condition between cache & apiserver.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the best way to indicate "this is most likely a bug in the scheduler" is but changed it to SchedulerError .

@julia-stripe julia-stripe force-pushed the fix-pods-stuck-pending branch from c0c6a94 to 273107b Compare July 27, 2017 12:44
@cblecker
Copy link
Member

@julia-stripe This is currently failing verify as plugin/pkg/scheduler/testing/fake_cache.go needs a gofmt :)

@julia-stripe
Copy link
Contributor Author

oops, fixed!

can someone add the cherrypick-candidate label and release: 1.7.0? (is that the right way to flag bug fixes for the 1.7 release?)

@julia-stripe
Copy link
Contributor Author

/retest

@cblecker
Copy link
Member

@julia-stripe One of the scheduling maintainers should be able to assist with that once:

  • This PR is approved to merge into master
  • It's determined that this bug appears in 1.7, and the fix is unlikely to cause any regressions.

More information on the cherry-pick procedure is available here. This would fall under the section "INDIVIDUAL CHERRYPICKS".

@julia-stripe
Copy link
Contributor Author

/retest

1 similar comment
@julia-stripe
Copy link
Contributor Author

/retest

@k8s-github-robot k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 31, 2017
@julia-stripe
Copy link
Contributor Author

/retest

@julia-stripe
Copy link
Contributor Author

PTAL @k82cn

Fixed all the issues you mentioned! Is there anything else? Should 589cb36 be broken out into a separate PR because it's an independent change?

@julia-stripe
Copy link
Contributor Author

Also pulled one smaller change out of this in the hopes that it's easier to review: #50028

@julia-stripe
Copy link
Contributor Author

Going to close this in favour of #50028 and open a separate PR to improve the error handling in scheduler.go.

@julia-stripe
Copy link
Contributor Author

@k82cn Opened #50106 which is the error handling parts of this PR -- hopefully that will be easier to review.

@k82cn
Copy link
Member

k82cn commented Aug 3, 2017

great, thanks very much :).

k8s-github-robot pushed a commit that referenced this pull request Aug 4, 2017
…ind-call

Automatic merge from submit-queue

Fix incorrect call to 'bind' in scheduler

I previously submitted #49661 -- I'm not sure if that PR is too big or what, but this is an attempt at a smaller PR that makes progress on the same issue and is easier to review.

**What this PR does / why we need it**:

In this refactor (ecb962e6585#diff-67f2b61521299ca8d8687b0933bbfb19R223) the scheduler code was refactored into separate `bind` and `assume` functions. When that happened, `bind` was called with `pod` as an argument. The argument to `bind` should be the assumed pod, not the original pod. Evidence that `assumedPod` is the correct argument bind and not `pod`: https://github.com/kubernetes/kubernetes/blob/80f26fa8a89ef5863cb19c71a620bb389d025166/plugin/pkg/scheduler/scheduler.go#L229-L234. (and it says `assumed` in the function signature for `bind`, even though it's not called with the assumed pod as an argument).

This is an issue (and causes #49314, where pods that fail to bind to a node get stuck indefinitely) in the following scenario:

1. The pod fails to bind to the node
2. `bind` calls `ForgetPod` with the `pod` argument
3. since `ForgetPod` is expecting the assumed pod as an argument (because that's what's in the scheduler cache), it fails with an error like `scheduler cache ForgetPod failed: pod test-677550-rc-edit-namespace/nginx-jvn09 state was assumed on a different node`
4. The pod gets lost forever because of some incomplete error handling (which I haven't addressed here in the interest of making a simpler PR)

In this PR I've fixed the call to `bind` and modified the tests to make sure that `ForgetPod` gets called with the correct argument (the assumed pod) when binding fails.

**Which issue this PR fixes**: fixes #49314

**Special notes for your reviewer**:

**Release note**:

```release-note
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pod that failed to bind, stuck in Pending state forever
7 participants