Do not snapshot scheduler cache before starting preemption #72895

bsalamat · 2019-01-14T19:43:07Z

What type of PR is this?
/kind bug

What this PR does / why we need it:
In clusters with many pending pods with the same priority, if a pod is nominated, it goes back to the queue and behind all other pending pods with the same priority. This is important to avoid starvation of other pods, but it could also hold the nominated resources for a long time while the scheduler tries to schedule other pending pods in front of it. This scenario can happen and we cannot do much about it, but in the existing implementation of the scheduler, preemption updates its scheduler cache snapshot before starting the preemption work. If a node is added to the cluster or a pod is terminated, after a scheduling cycle and before the preemption logic starts its work, the preemption logic may find a feasible node without preempting any pods. The node becomes nominated in such cases and without preempting any pods. Now the nominated pod goes back to the queue and is placed behind other pods with similar priority, but none of those other pods can be scheduled on the node because there is a nominated pod for the node and the pod may take minutes before being retried if there are thousands of pending pods in the queue. To avoid such scenarios, we do not update the snapshot that the preemption logic uses and use the same one that its corresponding scheduling cycle has used.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Improve efficiency of preemption logic in clusters with many pending pods.

/sig scheduling

misterikkit

The code change looks fine, but I don't completely understand the logic change.

In the case of nominating a node for a pending pod. Some cluster resources are considered as reserved for that pod and will prevent scheduling of other pods. This change seems geared at moving where those reserved resources happen, but why does changing a nomination from a new node to an existing node improve behavior?

bsalamat · 2019-01-14T21:42:55Z

This change seems geared at moving where those reserved resources happen, but why does changing a nomination from a new node to an existing node improve behavior?

I tried to explain that when new nodes are added to a cluster, there is a chance that one is added somewhere between the start of a scheduling cycle and the start of the following preemption cycle. In such cases, the pod get nominated node name with no preemption and then the pod is moved back to the queue and is placed behind other pods with the same priority. When there are many pending pods, it could take minutes before the pod is retried while all the resources remain unused for other pods that are being attempted by the scheduler. This causes delays in cluster autoscaler. That's why we want to use the same set of nodes that the scheduling cycle used.

misterikkit · 2019-01-14T22:34:41Z

the pod get nominated node name with no preemption

Why is the alternative better? The pod that triggered preemption still goes to the back of the queue, while some running pods get terminated. The resources freed up for the pending pod would still go unused while the scheduler works through the queue of equal-priority pods.

That is, unless the queue has a feature to jump a pending pod to the front of the queue when it's nominated resources become free?

bsalamat · 2019-01-14T23:28:31Z

Why is the alternative better? The pod that triggered preemption still goes to the back of the queue, while some running pods get terminated. The resources freed up for the pending pod would still go unused while the scheduler works through the queue of equal-priority pods.

In the case that I described, no running pod is terminated. The resources which are not marked as "nominated" are used by the next pod and don't stay unused for a long time.

misterikkit · 2019-01-14T23:38:57Z

Ah okay, that was my misunderstanding. Because the pod would otherwise not be scheduled, the reserved & idle resources only happen as a result of resources becoming available during a scheduling cycle.

/lgtm

ravisantoshgudimetla · 2019-01-15T04:04:19Z

/retest

Huang-Wei

/lgtm

During one scheduling cycle (in both first attempt and second preemption), the scheduler cache should be consistent - which is snapshotted at the first attempt:

kubernetes/pkg/scheduler/core/generic_scheduler.go

Line 177 in 9661abe

if err := g.snapshot(); err != nil {

ravisantoshgudimetla

/lgtm

Thanks for the PR and explanation @bsalamat

k8s-ci-robot · 2019-01-15T04:06:40Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat, Huang-Wei, ravisantoshgudimetla

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~pkg/scheduler/OWNERS~~ [bsalamat,ravisantoshgudimetla]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…95-upstream-release-1.11 Automated cherry pick of #72895 upstream release 1.11

…95-upstream-release-1.13 Automated cherry pick of #72895 upstream release 1.13

…95-upstream-release-1.12 Automated cherry pick of #72895 upstream release 1.12

bsalamat requested a review from misterikkit January 14, 2019 19:43

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 14, 2019

k8s-ci-robot requested review from Huang-Wei and resouer January 14, 2019 19:43

Do not snapshot scheduler cache before starting preemption

e3f4e1e

bsalamat force-pushed the no_refresh_preemption branch from 8265859 to e3f4e1e Compare January 14, 2019 20:08

misterikkit reviewed Jan 14, 2019

View reviewed changes

Fix and improve preemption test to work with the new logic

1273212

k8s-ci-robot assigned misterikkit Jan 14, 2019

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 14, 2019

Huang-Wei approved these changes Jan 15, 2019

View reviewed changes

k8s-ci-robot assigned Huang-Wei Jan 15, 2019

ravisantoshgudimetla approved these changes Jan 15, 2019

View reviewed changes

k8s-ci-robot assigned ravisantoshgudimetla Jan 15, 2019

k8s-ci-robot merged commit 1482483 into kubernetes:master Jan 15, 2019

k8s-ci-robot added a commit that referenced this pull request Jan 17, 2019

Merge pull request #72983 from bsalamat/automated-cherry-pick-of-#728…

4649015

…95-upstream-release-1.11 Automated cherry pick of #72895 upstream release 1.11

k8s-ci-robot added a commit that referenced this pull request Jan 19, 2019

Merge pull request #72995 from bsalamat/automated-cherry-pick-of-#728…

328d336

…95-upstream-release-1.13 Automated cherry pick of #72895 upstream release 1.13

k8s-ci-robot added a commit that referenced this pull request Jan 22, 2019

Merge pull request #72994 from bsalamat/automated-cherry-pick-of-#728…

b4698f5

…95-upstream-release-1.12 Automated cherry pick of #72895 upstream release 1.12

MaciekPytel mentioned this pull request Jan 30, 2019

Pods left unschedulable kubernetes/autoscaler#1627

Closed

aleksandra-malinowska mentioned this pull request Feb 1, 2019

Autoscaler fails to scale up nodes with pending pods kubernetes/autoscaler#1049

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Do not snapshot scheduler cache before starting preemption #72895

Do not snapshot scheduler cache before starting preemption #72895

Uh oh!

bsalamat commented Jan 14, 2019

Uh oh!

misterikkit left a comment

Uh oh!

bsalamat commented Jan 14, 2019

Uh oh!

misterikkit commented Jan 14, 2019

Uh oh!

bsalamat commented Jan 14, 2019

Uh oh!

misterikkit commented Jan 14, 2019

Uh oh!

ravisantoshgudimetla commented Jan 15, 2019

Uh oh!

Huang-Wei left a comment

Uh oh!

ravisantoshgudimetla left a comment

Uh oh!

k8s-ci-robot commented Jan 15, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Do not snapshot scheduler cache before starting preemption #72895

Do not snapshot scheduler cache before starting preemption #72895

Uh oh!

Conversation

bsalamat commented Jan 14, 2019

Uh oh!

misterikkit left a comment

Choose a reason for hiding this comment

Uh oh!

bsalamat commented Jan 14, 2019

Uh oh!

misterikkit commented Jan 14, 2019

Uh oh!

bsalamat commented Jan 14, 2019

Uh oh!

misterikkit commented Jan 14, 2019

Uh oh!

ravisantoshgudimetla commented Jan 15, 2019

Uh oh!

Huang-Wei left a comment

Choose a reason for hiding this comment

Uh oh!

ravisantoshgudimetla left a comment

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Jan 15, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants