Skip to content

Conversation

@bsalamat
Copy link
Member

What type of PR is this?
/kind bug

What this PR does / why we need it:
In clusters with many pending pods with the same priority, if a pod is nominated, it goes back to the queue and behind all other pending pods with the same priority. This is important to avoid starvation of other pods, but it could also hold the nominated resources for a long time while the scheduler tries to schedule other pending pods in front of it. This scenario can happen and we cannot do much about it, but in the existing implementation of the scheduler, preemption updates its scheduler cache snapshot before starting the preemption work. If a node is added to the cluster or a pod is terminated, after a scheduling cycle and before the preemption logic starts its work, the preemption logic may find a feasible node without preempting any pods. The node becomes nominated in such cases and without preempting any pods. Now the nominated pod goes back to the queue and is placed behind other pods with similar priority, but none of those other pods can be scheduled on the node because there is a nominated pod for the node and the pod may take minutes before being retried if there are thousands of pending pods in the queue. To avoid such scenarios, we do not update the snapshot that the preemption logic uses and use the same one that its corresponding scheduling cycle has used.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Improve efficiency of preemption logic in clusters with many pending pods.

/sig scheduling

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jan 14, 2019
@bsalamat bsalamat requested a review from misterikkit January 14, 2019 19:43
@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 14, 2019
@bsalamat bsalamat force-pushed the no_refresh_preemption branch from 8265859 to e3f4e1e Compare January 14, 2019 20:08
Copy link

@misterikkit misterikkit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code change looks fine, but I don't completely understand the logic change.

In the case of nominating a node for a pending pod. Some cluster resources are considered as reserved for that pod and will prevent scheduling of other pods. This change seems geared at moving where those reserved resources happen, but why does changing a nomination from a new node to an existing node improve behavior?

@bsalamat
Copy link
Member Author

This change seems geared at moving where those reserved resources happen, but why does changing a nomination from a new node to an existing node improve behavior?

I tried to explain that when new nodes are added to a cluster, there is a chance that one is added somewhere between the start of a scheduling cycle and the start of the following preemption cycle. In such cases, the pod get nominated node name with no preemption and then the pod is moved back to the queue and is placed behind other pods with the same priority. When there are many pending pods, it could take minutes before the pod is retried while all the resources remain unused for other pods that are being attempted by the scheduler. This causes delays in cluster autoscaler. That's why we want to use the same set of nodes that the scheduling cycle used.

@misterikkit
Copy link

the pod get nominated node name with no preemption

Why is the alternative better? The pod that triggered preemption still goes to the back of the queue, while some running pods get terminated. The resources freed up for the pending pod would still go unused while the scheduler works through the queue of equal-priority pods.

That is, unless the queue has a feature to jump a pending pod to the front of the queue when it's nominated resources become free?

@bsalamat
Copy link
Member Author

Why is the alternative better? The pod that triggered preemption still goes to the back of the queue, while some running pods get terminated. The resources freed up for the pending pod would still go unused while the scheduler works through the queue of equal-priority pods.

In the case that I described, no running pod is terminated. The resources which are not marked as "nominated" are used by the next pod and don't stay unused for a long time.

@misterikkit
Copy link

Ah okay, that was my misunderstanding. Because the pod would otherwise not be scheduled, the reserved & idle resources only happen as a result of resources becoming available during a scheduling cycle.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 14, 2019
@ravisantoshgudimetla
Copy link
Contributor

/retest

Copy link
Member

@Huang-Wei Huang-Wei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

During one scheduling cycle (in both first attempt and second preemption), the scheduler cache should be consistent - which is snapshotted at the first attempt:

if err := g.snapshot(); err != nil {

Copy link
Contributor

@ravisantoshgudimetla ravisantoshgudimetla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Thanks for the PR and explanation @bsalamat

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat, Huang-Wei, ravisantoshgudimetla

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 1482483 into kubernetes:master Jan 15, 2019
k8s-ci-robot added a commit that referenced this pull request Jan 17, 2019
…95-upstream-release-1.11

Automated cherry pick of #72895 upstream release 1.11
k8s-ci-robot added a commit that referenced this pull request Jan 19, 2019
…95-upstream-release-1.13

Automated cherry pick of #72895 upstream release 1.13
k8s-ci-robot added a commit that referenced this pull request Jan 22, 2019
…95-upstream-release-1.12

Automated cherry pick of #72895 upstream release 1.12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants