-
Notifications
You must be signed in to change notification settings - Fork 42k
Do not snapshot scheduler cache before starting preemption #72895
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not snapshot scheduler cache before starting preemption #72895
Conversation
8265859 to
e3f4e1e
Compare
misterikkit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code change looks fine, but I don't completely understand the logic change.
In the case of nominating a node for a pending pod. Some cluster resources are considered as reserved for that pod and will prevent scheduling of other pods. This change seems geared at moving where those reserved resources happen, but why does changing a nomination from a new node to an existing node improve behavior?
I tried to explain that when new nodes are added to a cluster, there is a chance that one is added somewhere between the start of a scheduling cycle and the start of the following preemption cycle. In such cases, the pod get nominated node name with no preemption and then the pod is moved back to the queue and is placed behind other pods with the same priority. When there are many pending pods, it could take minutes before the pod is retried while all the resources remain unused for other pods that are being attempted by the scheduler. This causes delays in cluster autoscaler. That's why we want to use the same set of nodes that the scheduling cycle used. |
Why is the alternative better? The pod that triggered preemption still goes to the back of the queue, while some running pods get terminated. The resources freed up for the pending pod would still go unused while the scheduler works through the queue of equal-priority pods. That is, unless the queue has a feature to jump a pending pod to the front of the queue when it's nominated resources become free? |
In the case that I described, no running pod is terminated. The resources which are not marked as "nominated" are used by the next pod and don't stay unused for a long time. |
|
Ah okay, that was my misunderstanding. Because the pod would otherwise not be scheduled, the reserved & idle resources only happen as a result of resources becoming available during a scheduling cycle. /lgtm |
|
/retest |
Huang-Wei
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
During one scheduling cycle (in both first attempt and second preemption), the scheduler cache should be consistent - which is snapshotted at the first attempt:
| if err := g.snapshot(); err != nil { |
ravisantoshgudimetla
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Thanks for the PR and explanation @bsalamat
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bsalamat, Huang-Wei, ravisantoshgudimetla The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…95-upstream-release-1.11 Automated cherry pick of #72895 upstream release 1.11
…95-upstream-release-1.13 Automated cherry pick of #72895 upstream release 1.13
…95-upstream-release-1.12 Automated cherry pick of #72895 upstream release 1.12
What type of PR is this?
/kind bug
What this PR does / why we need it:
In clusters with many pending pods with the same priority, if a pod is nominated, it goes back to the queue and behind all other pending pods with the same priority. This is important to avoid starvation of other pods, but it could also hold the nominated resources for a long time while the scheduler tries to schedule other pending pods in front of it. This scenario can happen and we cannot do much about it, but in the existing implementation of the scheduler, preemption updates its scheduler cache snapshot before starting the preemption work. If a node is added to the cluster or a pod is terminated, after a scheduling cycle and before the preemption logic starts its work, the preemption logic may find a feasible node without preempting any pods. The node becomes nominated in such cases and without preempting any pods. Now the nominated pod goes back to the queue and is placed behind other pods with similar priority, but none of those other pods can be scheduled on the node because there is a nominated pod for the node and the pod may take minutes before being retried if there are thousands of pending pods in the queue. To avoid such scenarios, we do not update the snapshot that the preemption logic uses and use the same one that its corresponding scheduling cycle has used.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
/sig scheduling