-
Notifications
You must be signed in to change notification settings - Fork 39.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix pods stuck in Pending state forever #49661
Fix pods stuck in Pending state forever #49661
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: julia-stripe Assign the PR to them by writing Associated issue: 49314 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
Hi @julia-stripe. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/ok-to-test |
@@ -222,7 +225,7 @@ func (sched *Scheduler) bind(assumed *v1.Pod, b *v1.Binding) error { | |||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should call FinishBinding just after bind as before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the call to FinishBinding
was moved in this PR a few weeks ago: #48451 -- let me know if you think it should be moved back (and why? :) ).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
finishbinding will add deadline and remove the resumed pod when expired. IMO, the deadline is not only for binding successfully, but also for error handling, e.g. ForgetPod
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you think we should revert #48451?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Give me some time to check history.
That's error-prone part, we used to close some PRs which want to enhance it. Another concern is that our UT and e2e did not catch it :(.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, still prefer to call FinishBinding just after bind as before. But we can not revert #48451 , we need to just log error if 'FinishBinding' failed.
Current fix did not get a chance to call 'FinishBinding' which will still leave assumed pod in cache, so the pod can not be re-scheduled. For the concern on race of #48451, I think that's fine; 'ForgetPod' is quick enough, and scheduler will correct it in next cycle if it happened.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/cc @wojtek-t @timothysc for more suggestion :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated this PR to call FinishBinding
right after bind
and log an error instead of returning. (thank you so much for all the comments!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also happy to break that commit out as a separate PR since it's an independent problem & this is getting kind of big, not sure which is better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we had created another PRs, can we simplify this PR to only include FinishBinding
and ForgetPod
?
plugin/pkg/scheduler/scheduler.go
Outdated
@@ -185,7 +185,7 @@ func (sched *Scheduler) schedule(pod *v1.Pod) (string, error) { | |||
} | |||
|
|||
// assume signals to the cache that a pod is already in the cache, so that binding can be asnychronous. | |||
func (sched *Scheduler) assume(pod *v1.Pod, host string) error { | |||
func (sched *Scheduler) assume(pod *v1.Pod, host string) (*v1.Pod, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should not return assumed pod, we will check current Pod with the pod in cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, I don't understand -- the reason this is returning the assumed pod is that bind
expects to get the assumed pod as an argument (bind(assumed *v1.Pod, b *v1.Binding)
).
before the refactor assume and bind were part of the same function and the assumed pod got passed to ForgetPod
and FinishBinding
-- you can see the old code (which I believe was correct) here:
kubernetes/plugin/pkg/scheduler/scheduler.go
Lines 229 to 234 in 80f26fa
if err := sched.config.SchedulerCache.FinishBinding(&assumed); err != nil { | |
glog.Errorf("scheduler cache FinishBinding failed: %v", err) | |
} | |
if err != nil { | |
glog.V(1).Infof("Failed to bind pod: %v/%v", pod.Namespace, pod.Name) | |
if err := sched.config.SchedulerCache.ForgetPod(&assumed); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow, yes. We did not use object from cache directly; prefer to highlight copy outside assume
:
assumedPod := *pod
sched.assume(assumedPod)
go sched.bind(assumedPod)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
/assign Good catch!, add some comments :). |
plugin/pkg/scheduler/scheduler.go
Outdated
sched.config.PodConditionUpdater.Update(pod, &v1.PodCondition{ | ||
Type: v1.PodScheduled, | ||
Status: v1.ConditionFalse, | ||
Reason: "AssumePodFailed", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assume
is internal phase of scheduler, we should not publish it; different scheduler may have different way to handle race condition between cache & apiserver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what the best way to indicate "this is most likely a bug in the scheduler" is but changed it to SchedulerError
.
c0c6a94
to
273107b
Compare
@julia-stripe This is currently failing verify as |
oops, fixed! can someone add the |
/retest |
@julia-stripe One of the scheduling maintainers should be able to assist with that once:
More information on the cherry-pick procedure is available here. This would fall under the section "INDIVIDUAL CHERRYPICKS". |
/retest |
1 similar comment
/retest |
/retest |
Also pulled one smaller change out of this in the hopes that it's easier to review: #50028 |
Going to close this in favour of #50028 and open a separate PR to improve the error handling in scheduler.go. |
great, thanks very much :). |
…ind-call Automatic merge from submit-queue Fix incorrect call to 'bind' in scheduler I previously submitted #49661 -- I'm not sure if that PR is too big or what, but this is an attempt at a smaller PR that makes progress on the same issue and is easier to review. **What this PR does / why we need it**: In this refactor (ecb962e6585#diff-67f2b61521299ca8d8687b0933bbfb19R223) the scheduler code was refactored into separate `bind` and `assume` functions. When that happened, `bind` was called with `pod` as an argument. The argument to `bind` should be the assumed pod, not the original pod. Evidence that `assumedPod` is the correct argument bind and not `pod`: https://github.com/kubernetes/kubernetes/blob/80f26fa8a89ef5863cb19c71a620bb389d025166/plugin/pkg/scheduler/scheduler.go#L229-L234. (and it says `assumed` in the function signature for `bind`, even though it's not called with the assumed pod as an argument). This is an issue (and causes #49314, where pods that fail to bind to a node get stuck indefinitely) in the following scenario: 1. The pod fails to bind to the node 2. `bind` calls `ForgetPod` with the `pod` argument 3. since `ForgetPod` is expecting the assumed pod as an argument (because that's what's in the scheduler cache), it fails with an error like `scheduler cache ForgetPod failed: pod test-677550-rc-edit-namespace/nginx-jvn09 state was assumed on a different node` 4. The pod gets lost forever because of some incomplete error handling (which I haven't addressed here in the interest of making a simpler PR) In this PR I've fixed the call to `bind` and modified the tests to make sure that `ForgetPod` gets called with the correct argument (the assumed pod) when binding fails. **Which issue this PR fixes**: fixes #49314 **Special notes for your reviewer**: **Release note**: ```release-note ```
What this PR does / why we need it:
This PR fixes 2 bugs introduced into the scheduler in ecb962e6585#diff-67f2b61521299ca8d8687b0933bbfb19R223.
Together, they meant that when a pod fails to bind to a node, it can end up stuck in the Pending state indefinitely. The way this happened was:
ForgetPod
fails, because the wrong*v1.Pod
struct was passed intobind
and thereforeForgetPod
(pod
instead ofassumed
) --ForgetPod
requires that the pod ask it to forget matches the pod in the cachebind
returns immediately after ForgetPod fails instead of correctly reporting an error, because of this line: ecb962e6585#diff-67f2b61521299ca8d8687b0933bbfb19R223 -- this should be aglog.Errorf
, not a return. The return on that line causes the function to skip the error handling, which means that the pod isn't put back onto the queue to attempt scheduling again.Pending
forever until the scheduler is restarted.This also adds better error handling when
AssumePod
fails -- it seems to me that any time pod scheduling fails, we should handle the error withsched.config.Error
and put the pod back in the queue to be rescheduled instead of silently failing and leaving the pod potentially stuck inPending
forever.It also adds 2 new unit tests, which test:
ForgetPod
andBind
fail, the correct error is reportedAssumePod
fails, the correct error is reportedBind
fails, the correct pod is passed into ForgetPodWhich issue this PR fixes
fixes #49314
Special notes for your reviewer:
Release note: