Skip to content

activate unschedulable pods only if the node became more schedulable #71551

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Dec 10, 2018

Conversation

mlmhl
Copy link
Contributor

@mlmhl mlmhl commented Nov 29, 2018

What type of PR is this?
/kind feature

What this PR does / why we need it:

This is a new scheduler performance optimization PR try to fix issue #70316 . Compared with the earlier PR, this PR checks node conditions updates more fine-grained: Ignore the heartbeat timestamp updates of node's conditions.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #70316

Special notes for your reviewer:

I'm not sure how much side effects this change has: It always copy node's conditions before comparison, this maybe increase scheduler's resource consumption(memory and CPU) in a large cluster. cc @bsalamat @Huang-Wei

Does this PR introduce a user-facing change?:

Scheduler only activates unschedulable pods if node's scheduling related properties change.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 29, 2018
@mlmhl
Copy link
Contributor Author

mlmhl commented Nov 29, 2018

/assign @bsalamat

@wgliang
Copy link
Contributor

wgliang commented Nov 29, 2018

/ok-to-test

@k8s-ci-robot k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Nov 29, 2018
@@ -1064,6 +1067,72 @@ func (c *configFactory) invalidateCachedPredicatesOnNodeUpdate(newNode *v1.Node,
}
}

func nodeSchedulingPropertiesChanged(newNode *v1.Node, oldNode *v1.Node) bool {
if nodeAllocatableChanged(newNode, oldNode) {
klog.V(4).Infof("Allocatable resource of node %s changed", newNode.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just printing out the change is not enough? In my opinion, it is necessary to print out the changed view. Otherwise, this log will not have much value to the viewer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, actually I don't think you need to log out anything in this function :-)

@resouer
Copy link
Contributor

resouer commented Nov 29, 2018

/assign

The idea is quite valid, I will take a round of review.

oldTaints, oldErr := helper.GetTaintsFromNodeAnnotations(oldNode.GetAnnotations())
if oldErr != nil {
// If parse old node's taint annotation failed, we assume node's taint changed.
klog.Errorf("Failed to get taints from annotation of old node %s: %v", oldNode.Name, oldErr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds aggressive,I think just logout the error is enough.

}

func nodeConditionsChanged(newNode *v1.Node, oldNode *v1.Node) bool {
strip := func(conditions []v1.NodeCondition) []v1.NodeCondition {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of create a fake NodeCondition slice, it would be cleaner if you just do:

Suggested change
strip := func(conditions []v1.NodeCondition) []v1.NodeCondition {
oldConditions := make(map[v1.NodeConditionType]v1.ConditionStatus)
newConditions := make(map[v1.NodeConditionType]v1.ConditionStatus)
for _, cond := range oldNode.Status.Conditions {
oldConditions[cond.Type] = cond.Status
}
for _, cond := range newNode.Status.Conditions {
newConditions[cond.Type] = cond.Status
}

WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This optimization LGTM, it can make the condition comparison operation simpler and faster.

klog.V(4).Infof("Conditions of node %s changed", newNode.Name)
return true
}
if newNode.Spec.Unschedulable != oldNode.Spec.Unschedulable && newNode.Spec.Unschedulable == false {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you wrap this line in a function and remove klog lines, you can get a super clean code for this part. :-)

Copy link
Contributor

@resouer resouer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just took a first round review :-)

@@ -992,7 +992,10 @@ func (c *configFactory) updateNodeInCache(oldObj, newObj interface{}) {
}

c.invalidateCachedPredicatesOnNodeUpdate(newNode, oldNode)
c.podQueue.MoveAllToActiveQueue()
// Only activate unschedulable pods if the node became more schedulable.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An optimization we can perform here is to look at the unschedulableQueue and if there is no pod in it, we can skip the check for changes in the node object and call MoveAllToActiveQueue. This optimization will be useful in large clusters where many nodes send updates and there is no unschedulable pods in the cluster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed this optimization, I will add this later.

@@ -1064,6 +1067,72 @@ func (c *configFactory) invalidateCachedPredicatesOnNodeUpdate(newNode *v1.Node,
}
}

func nodeSchedulingPropertiesChanged(newNode *v1.Node, oldNode *v1.Node) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this to helper/util and make it as public? So others can reuse it with predicates and priorities :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xref kubernetes-retired/kube-batch#491

@jiaxuanzhou , something like this PR seems better :)

klog.Errorf("Failed to get taints from annotation of old node %s: %v", oldNode.Name, oldErr)
return true
}
newTaints, newErr := helper.GetTaintsFromNodeAnnotations(newNode.GetAnnotations())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetTaintsFromNodeAnnotations is only used for validation right now; if we did not cherry pick this PR, it's not necessary to me.

BTW, Toleration/Taint should be GAed for a while, we should remove the annotation :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, It's not appropriate to use GetTaintsFromNodeAnnotations here. I'm not sure if we need to be compatible with old nodes has taint annotations, if not, remove the annotation comparison LGTM.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to @k82cn's point. I don't think we need to process taints from annotations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

strippedConditions := make([]v1.NodeCondition, len(conditions))
copy(strippedConditions, conditions)
for i := range strippedConditions {
strippedConditions[i].LastHeartbeatTime = metav1.Time{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems we only enhanced nodeConditionsChanged, why includes other part comparing to #70366 :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#70366 is reverted, so I includes other comparisons in this PR too.

@mlmhl
Copy link
Contributor Author

mlmhl commented Dec 4, 2018

@resouer @bsalamat @k82cn All comments are addressed, PTAL :)

@@ -530,6 +537,13 @@ func (p *PriorityQueue) DeleteNominatedPodIfExists(pod *v1.Pod) {
p.lock.Unlock()
}

// HasUnschedulablePods returns true if any unschedulable pods exist in the SchedulingQueue.
func (p *PriorityQueue) HasUnschedulablePods() bool {
p.lock.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no write operation here, and p.lock is a sync.RWMutex, so I recommend you to use p.lock.RLock().
(FYI:https://golang.org/pkg/sync/#RWMutex)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. This should be a read lock.

@@ -992,7 +992,10 @@ func (c *configFactory) updateNodeInCache(oldObj, newObj interface{}) {
}

c.invalidateCachedPredicatesOnNodeUpdate(newNode, oldNode)
c.podQueue.MoveAllToActiveQueue()
// Only activate unschedulable pods if the node became more schedulable.
if c.podQueue.HasUnschedulablePods() && nodeSchedulingPropertiesChanged(newNode, oldNode) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should instead do the following:

Suggested change
if c.podQueue.HasUnschedulablePods() && nodeSchedulingPropertiesChanged(newNode, oldNode) {
if !c.podQueue.HasUnschedulablePods() || nodeSchedulingPropertiesChanged(newNode, oldNode) {
c.podQueue.MoveAllToActiveQueue()
}

It may seem odd to move pods when there is nothing in the unschedulable queue, but we should do so, because there may be a pod that the scheduler is currently processing and it may be determined "unschedulable". The scheduling queue has a mechanism to retry such pod when a "move to active queue" event happens. So, what the above optimization does is that when there is no unschedulable pod in the unschedulable queue, it skips the node comparison and send a move to active queue request.

@@ -530,6 +537,13 @@ func (p *PriorityQueue) DeleteNominatedPodIfExists(pod *v1.Pod) {
p.lock.Unlock()
}

// HasUnschedulablePods returns true if any unschedulable pods exist in the SchedulingQueue.
func (p *PriorityQueue) HasUnschedulablePods() bool {
p.lock.Lock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. This should be a read lock.

@mlmhl mlmhl force-pushed the scheduler_optimization branch from 7d25f32 to 2fe9b14 Compare December 10, 2018 02:01
@k8s-ci-robot k8s-ci-robot removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Dec 10, 2018
@mlmhl
Copy link
Contributor Author

mlmhl commented Dec 10, 2018

@bsalamat PR rebased :)

@mlmhl
Copy link
Contributor Author

mlmhl commented Dec 10, 2018

/test pull-kubernetes-e2e-gce-100-performance

@Huang-Wei
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 10, 2018
@mlmhl
Copy link
Contributor Author

mlmhl commented Dec 10, 2018

/retest

@k8s-ci-robot k8s-ci-robot merged commit 698db70 into kubernetes:master Dec 10, 2018
@bsalamat
Copy link
Member

@mlmhl Please go ahead and send the backport PRs. In case of not having the time, please let us know.

@mlmhl
Copy link
Contributor Author

mlmhl commented Dec 11, 2018

@bsalamat The backport PRs are send to 1.13, 1.12, and 1.11, PTAL :)

k8s-ci-robot added a commit that referenced this pull request Dec 11, 2018
…upstream-release-1.13

Automated cherry pick of #71551: activate unschedulable pods only if the node became more
k8s-ci-robot added a commit that referenced this pull request Dec 12, 2018
…upstream-release-1.12

Automated cherry pick of #71551: activate unschedulable pods only if the node became more
k8s-ci-robot added a commit that referenced this pull request Dec 12, 2018
…upstream-release-1.11

Automated cherry pick of #71551: activate unschedulable pods only if the node became more
foxish added a commit that referenced this pull request Dec 14, 2018
k8s-ci-robot added a commit that referenced this pull request Dec 14, 2018
…ry-pick-of-#71551-upstream-release-1.11

Revert "Automated cherry pick of #71551: activate unschedulable pods only if the node became more"
bsalamat added a commit that referenced this pull request Jan 4, 2019
k8s-ci-robot added a commit that referenced this pull request Jan 8, 2019
…51-upstream-release-1.11

Automated cherry pick of #71551: activate unschedulable pods only if the node became more
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Process unschedulable pods on node updates more efficiently
8 participants