activate unschedulable pods only if the node became more schedulable #71551

mlmhl · 2018-11-29T07:55:04Z

What type of PR is this?
/kind feature

What this PR does / why we need it:

This is a new scheduler performance optimization PR try to fix issue #70316 . Compared with the earlier PR, this PR checks node conditions updates more fine-grained: Ignore the heartbeat timestamp updates of node's conditions.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #70316

Special notes for your reviewer:

I'm not sure how much side effects this change has: It always copy node's conditions before comparison, this maybe increase scheduler's resource consumption(memory and CPU) in a large cluster. cc @bsalamat @Huang-Wei

Does this PR introduce a user-facing change?:

Scheduler only activates unschedulable pods if node's scheduling related properties change.

mlmhl · 2018-11-29T07:56:10Z

/assign @bsalamat

wgliang · 2018-11-29T08:21:00Z

/ok-to-test

wgliang · 2018-11-29T08:27:55Z

pkg/scheduler/factory/factory.go

@@ -1064,6 +1067,72 @@ func (c *configFactory) invalidateCachedPredicatesOnNodeUpdate(newNode *v1.Node,
 	}
 }

+func nodeSchedulingPropertiesChanged(newNode *v1.Node, oldNode *v1.Node) bool {
+	if nodeAllocatableChanged(newNode, oldNode) {
+		klog.V(4).Infof("Allocatable resource of node %s changed", newNode.Name)


Maybe just printing out the change is not enough? In my opinion, it is necessary to print out the changed view. Otherwise, this log will not have much value to the viewer.

+1, actually I don't think you need to log out anything in this function :-)

resouer · 2018-11-29T09:47:52Z

/assign

The idea is quite valid, I will take a round of review.

resouer · 2018-11-29T10:17:48Z

pkg/scheduler/factory/factory.go

+	oldTaints, oldErr := helper.GetTaintsFromNodeAnnotations(oldNode.GetAnnotations())
+	if oldErr != nil {
+		// If parse old node's taint annotation failed, we assume node's taint changed.
+		klog.Errorf("Failed to get taints from annotation of old node %s: %v", oldNode.Name, oldErr)


It sounds aggressive,I think just logout the error is enough.

resouer · 2018-11-29T10:25:35Z

pkg/scheduler/factory/factory.go

+}
+
+func nodeConditionsChanged(newNode *v1.Node, oldNode *v1.Node) bool {
+	strip := func(conditions []v1.NodeCondition) []v1.NodeCondition {


Instead of create a fake NodeCondition slice, it would be cleaner if you just do:

Suggested change

strip := func(conditions []v1.NodeCondition) []v1.NodeCondition {

oldConditions := make(map[v1.NodeConditionType]v1.ConditionStatus)

newConditions := make(map[v1.NodeConditionType]v1.ConditionStatus)

for _, cond := range oldNode.Status.Conditions {

oldConditions[cond.Type] = cond.Status

}

for _, cond := range newNode.Status.Conditions {

newConditions[cond.Type] = cond.Status

}

WDYT?

This optimization LGTM, it can make the condition comparison operation simpler and faster.

resouer · 2018-11-29T11:27:44Z

pkg/scheduler/factory/factory.go

+		klog.V(4).Infof("Conditions of node %s changed", newNode.Name)
+		return true
+	}
+	if newNode.Spec.Unschedulable != oldNode.Spec.Unschedulable && newNode.Spec.Unschedulable == false {


If you wrap this line in a function and remove klog lines, you can get a super clean code for this part. :-)

resouer

Just took a first round review :-)

bsalamat · 2018-11-29T22:27:42Z

pkg/scheduler/factory/factory.go

@@ -992,7 +992,10 @@ func (c *configFactory) updateNodeInCache(oldObj, newObj interface{}) {
 	}

 	c.invalidateCachedPredicatesOnNodeUpdate(newNode, oldNode)
-	c.podQueue.MoveAllToActiveQueue()
+	// Only activate unschedulable pods if the node became more schedulable.


An optimization we can perform here is to look at the unschedulableQueue and if there is no pod in it, we can skip the check for changes in the node object and call MoveAllToActiveQueue. This optimization will be useful in large clusters where many nodes send updates and there is no unschedulable pods in the cluster.

Sorry I missed this optimization, I will add this later.

k82cn · 2018-11-30T02:26:30Z

pkg/scheduler/factory/factory.go

@@ -1064,6 +1067,72 @@ func (c *configFactory) invalidateCachedPredicatesOnNodeUpdate(newNode *v1.Node,
 	}
 }

+func nodeSchedulingPropertiesChanged(newNode *v1.Node, oldNode *v1.Node) bool {


Can we move this to helper/util and make it as public? So others can reuse it with predicates and priorities :)

xref kubernetes-retired/kube-batch#491

@jiaxuanzhou , something like this PR seems better :)

k82cn · 2018-11-30T02:37:29Z

pkg/scheduler/factory/factory.go

+		klog.Errorf("Failed to get taints from annotation of old node %s: %v", oldNode.Name, oldErr)
+		return true
+	}
+	newTaints, newErr := helper.GetTaintsFromNodeAnnotations(newNode.GetAnnotations())


GetTaintsFromNodeAnnotations is only used for validation right now; if we did not cherry pick this PR, it's not necessary to me.

BTW, Toleration/Taint should be GAed for a while, we should remove the annotation :)

Yeah, It's not appropriate to use GetTaintsFromNodeAnnotations here. I'm not sure if we need to be compatible with old nodes has taint annotations, if not, remove the annotation comparison LGTM.

+1 to @k82cn's point. I don't think we need to process taints from annotations.

k82cn · 2018-11-30T02:49:32Z

pkg/scheduler/factory/factory.go

+		strippedConditions := make([]v1.NodeCondition, len(conditions))
+		copy(strippedConditions, conditions)
+		for i := range strippedConditions {
+			strippedConditions[i].LastHeartbeatTime = metav1.Time{}


seems we only enhanced nodeConditionsChanged, why includes other part comparing to #70366 :)

#70366 is reverted, so I includes other comparisons in this PR too.

mlmhl · 2018-12-04T03:42:15Z

@resouer @bsalamat @k82cn All comments are addressed, PTAL :)

wgliang · 2018-12-04T03:49:40Z

pkg/scheduler/internal/queue/scheduling_queue.go

@@ -530,6 +537,13 @@ func (p *PriorityQueue) DeleteNominatedPodIfExists(pod *v1.Pod) {
 	p.lock.Unlock()
 }

+// HasUnschedulablePods returns true if any unschedulable pods exist in the SchedulingQueue.
+func (p *PriorityQueue) HasUnschedulablePods() bool {
+	p.lock.Lock()


There is no write operation here, and p.lock is a sync.RWMutex, so I recommend you to use p.lock.RLock().
(FYI:https://golang.org/pkg/sync/#RWMutex)

+1. This should be a read lock.

bsalamat · 2018-12-04T18:24:09Z

pkg/scheduler/factory/factory.go

@@ -992,7 +992,10 @@ func (c *configFactory) updateNodeInCache(oldObj, newObj interface{}) {
 	}

 	c.invalidateCachedPredicatesOnNodeUpdate(newNode, oldNode)
-	c.podQueue.MoveAllToActiveQueue()
+	// Only activate unschedulable pods if the node became more schedulable.
+	if c.podQueue.HasUnschedulablePods() && nodeSchedulingPropertiesChanged(newNode, oldNode) {


You should instead do the following:

Suggested change

if c.podQueue.HasUnschedulablePods() && nodeSchedulingPropertiesChanged(newNode, oldNode) {

if !c.podQueue.HasUnschedulablePods() || nodeSchedulingPropertiesChanged(newNode, oldNode) {

c.podQueue.MoveAllToActiveQueue()

}

It may seem odd to move pods when there is nothing in the unschedulable queue, but we should do so, because there may be a pod that the scheduler is currently processing and it may be determined "unschedulable". The scheduling queue has a mechanism to retry such pod when a "move to active queue" event happens. So, what the above optimization does is that when there is no unschedulable pod in the unschedulable queue, it skips the node comparison and send a move to active queue request.

bsalamat · 2018-12-04T18:32:36Z

pkg/scheduler/internal/queue/scheduling_queue.go

@@ -530,6 +537,13 @@ func (p *PriorityQueue) DeleteNominatedPodIfExists(pod *v1.Pod) {
 	p.lock.Unlock()
 }

+// HasUnschedulablePods returns true if any unschedulable pods exist in the SchedulingQueue.
+func (p *PriorityQueue) HasUnschedulablePods() bool {
+	p.lock.Lock()


+1. This should be a read lock.

mlmhl · 2018-12-10T02:08:57Z

@bsalamat PR rebased :)

mlmhl · 2018-12-10T04:52:31Z

/test pull-kubernetes-e2e-gce-100-performance

Huang-Wei · 2018-12-10T05:52:38Z

/lgtm

mlmhl · 2018-12-10T10:28:16Z

/retest

bsalamat · 2018-12-10T21:11:43Z

@mlmhl Please go ahead and send the backport PRs. In case of not having the time, please let us know.

mlmhl · 2018-12-11T03:38:38Z

@bsalamat The backport PRs are send to 1.13, 1.12, and 1.11, PTAL :)

…upstream-release-1.13 Automated cherry pick of #71551: activate unschedulable pods only if the node became more

…upstream-release-1.12 Automated cherry pick of #71551: activate unschedulable pods only if the node became more

…upstream-release-1.11 Automated cherry pick of #71551: activate unschedulable pods only if the node became more

…only if the node became more"

…ry-pick-of-#71551-upstream-release-1.11 Revert "Automated cherry pick of #71551: activate unschedulable pods only if the node became more"

…le pods only if the node became more""

…51-upstream-release-1.11 Automated cherry pick of #71551: activate unschedulable pods only if the node became more

k8s-ci-robot requested review from k82cn and resouer November 29, 2018 07:55

k8s-ci-robot assigned bsalamat Nov 29, 2018

mlmhl mentioned this pull request Nov 29, 2018

Process unschedulable pods on node updates more efficiently #70316

Closed

k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Nov 29, 2018

wgliang reviewed Nov 29, 2018

View reviewed changes

k8s-ci-robot assigned resouer Nov 29, 2018

resouer reviewed Nov 29, 2018

View reviewed changes

bsalamat reviewed Nov 29, 2018

View reviewed changes

k82cn reviewed Nov 30, 2018

View reviewed changes

bsalamat mentioned this pull request Nov 30, 2018

Change sort function of the scheduling queue to avoid starvation #71488

Merged

wgliang suggested changes Dec 4, 2018

View reviewed changes

bsalamat reviewed Dec 4, 2018

View reviewed changes

activate unschedulable pods only if the node became more schedulable

2fe9b14

mlmhl force-pushed the scheduler_optimization branch from 7d25f32 to 2fe9b14 Compare December 10, 2018 02:01

k8s-ci-robot removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Dec 10, 2018

k8s-ci-robot assigned Huang-Wei Dec 10, 2018

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 10, 2018

k8s-ci-robot merged commit 698db70 into kubernetes:master Dec 10, 2018

mlmhl deleted the scheduler_optimization branch December 11, 2018 01:49

k8s-ci-robot added a commit that referenced this pull request Dec 11, 2018

Merge pull request #71931 from mlmhl/automated-cherry-pick-of-#71551-…

83f9207

…upstream-release-1.13 Automated cherry pick of #71551: activate unschedulable pods only if the node became more

k8s-ci-robot added a commit that referenced this pull request Dec 12, 2018

Merge pull request #71932 from mlmhl/automated-cherry-pick-of-#71551-…

19e999f

…upstream-release-1.12 Automated cherry pick of #71551: activate unschedulable pods only if the node became more

k8s-ci-robot added a commit that referenced this pull request Dec 12, 2018

Merge pull request #71933 from mlmhl/automated-cherry-pick-of-#71551-…

c4240ec

…upstream-release-1.11 Automated cherry pick of #71551: activate unschedulable pods only if the node became more

foxish added a commit that referenced this pull request Dec 14, 2018

Revert "Automated cherry pick of #71551: activate unschedulable pods …

71d4e28

…only if the node became more"

msau42 mentioned this pull request Dec 18, 2018

Automated cherry pick of #65616: Retry scheduling on various events. #72127

Merged

bsalamat added a commit that referenced this pull request Jan 4, 2019

Revert "Revert "Automated cherry pick of #71551: activate unschedulab…

df601cb

…le pods only if the node became more""

This was referenced Jan 4, 2019

Automated cherry pick of #71551: activate unschedulable pods only if the node became more"" #72553

Closed

Automated cherry pick of #71551: activate unschedulable pods only if the node became more #72600

Merged

k8s-ci-robot added a commit that referenced this pull request Jan 8, 2019

Merge pull request #72600 from bsalamat/automated-cherry-pick-of-#715…

5a0d4e5

…51-upstream-release-1.11 Automated cherry pick of #71551: activate unschedulable pods only if the node became more

This was referenced Jan 18, 2019

Scheduler leaves pod stuck in pending after worker node reload #72925

Closed

scheduler: makes pod less racing so as to be put back into activeQ properly #73078

Merged

tuminoid mentioned this pull request Jan 23, 2019

Pods sharing a PVC on a single node cluster fail to schedule #73216

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

activate unschedulable pods only if the node became more schedulable #71551

activate unschedulable pods only if the node became more schedulable #71551

mlmhl commented Nov 29, 2018

mlmhl commented Nov 29, 2018

wgliang commented Nov 29, 2018

wgliang Nov 29, 2018

resouer Nov 29, 2018

resouer commented Nov 29, 2018

resouer Nov 29, 2018

resouer Nov 29, 2018

mlmhl Dec 4, 2018

resouer Nov 29, 2018

resouer left a comment

bsalamat Nov 29, 2018

mlmhl Dec 4, 2018

k82cn Nov 30, 2018

k82cn Dec 7, 2018

k82cn Nov 30, 2018

mlmhl Dec 4, 2018

bsalamat Dec 5, 2018

mlmhl Dec 6, 2018

k82cn Nov 30, 2018

mlmhl Dec 4, 2018

mlmhl commented Dec 4, 2018

wgliang Dec 4, 2018

bsalamat Dec 4, 2018

bsalamat Dec 4, 2018

bsalamat Dec 4, 2018

mlmhl commented Dec 10, 2018

mlmhl commented Dec 10, 2018

Huang-Wei commented Dec 10, 2018

mlmhl commented Dec 10, 2018

bsalamat commented Dec 10, 2018

mlmhl commented Dec 11, 2018

-	strip := func(conditions []v1.NodeCondition) []v1.NodeCondition {
+			oldConditions := make(map[v1.NodeConditionType]v1.ConditionStatus)
+			newConditions := make(map[v1.NodeConditionType]v1.ConditionStatus)
+			for _, cond := range oldNode.Status.Conditions {
+				oldConditions[cond.Type] = cond.Status
+			}
+			for _, cond := range newNode.Status.Conditions {
+				newConditions[cond.Type] = cond.Status
+			}

activate unschedulable pods only if the node became more schedulable #71551

activate unschedulable pods only if the node became more schedulable #71551

Conversation

mlmhl commented Nov 29, 2018

mlmhl commented Nov 29, 2018

wgliang commented Nov 29, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

resouer commented Nov 29, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

resouer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mlmhl commented Dec 4, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mlmhl commented Dec 10, 2018

mlmhl commented Dec 10, 2018

Huang-Wei commented Dec 10, 2018

mlmhl commented Dec 10, 2018

bsalamat commented Dec 10, 2018

mlmhl commented Dec 11, 2018