-
Notifications
You must be signed in to change notification settings - Fork 40.6k
activate unschedulable pods only if the node became more schedulable #71551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
/assign @bsalamat |
/ok-to-test |
pkg/scheduler/factory/factory.go
Outdated
@@ -1064,6 +1067,72 @@ func (c *configFactory) invalidateCachedPredicatesOnNodeUpdate(newNode *v1.Node, | |||
} | |||
} | |||
|
|||
func nodeSchedulingPropertiesChanged(newNode *v1.Node, oldNode *v1.Node) bool { | |||
if nodeAllocatableChanged(newNode, oldNode) { | |||
klog.V(4).Infof("Allocatable resource of node %s changed", newNode.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just printing out the change
is not enough? In my opinion, it is necessary to print out the changed view. Otherwise, this log will not have much value to the viewer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, actually I don't think you need to log out anything in this function :-)
/assign The idea is quite valid, I will take a round of review. |
pkg/scheduler/factory/factory.go
Outdated
oldTaints, oldErr := helper.GetTaintsFromNodeAnnotations(oldNode.GetAnnotations()) | ||
if oldErr != nil { | ||
// If parse old node's taint annotation failed, we assume node's taint changed. | ||
klog.Errorf("Failed to get taints from annotation of old node %s: %v", oldNode.Name, oldErr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds aggressive,I think just logout the error is enough.
pkg/scheduler/factory/factory.go
Outdated
} | ||
|
||
func nodeConditionsChanged(newNode *v1.Node, oldNode *v1.Node) bool { | ||
strip := func(conditions []v1.NodeCondition) []v1.NodeCondition { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of create a fake NodeCondition
slice, it would be cleaner if you just do:
strip := func(conditions []v1.NodeCondition) []v1.NodeCondition { | |
oldConditions := make(map[v1.NodeConditionType]v1.ConditionStatus) | |
newConditions := make(map[v1.NodeConditionType]v1.ConditionStatus) | |
for _, cond := range oldNode.Status.Conditions { | |
oldConditions[cond.Type] = cond.Status | |
} | |
for _, cond := range newNode.Status.Conditions { | |
newConditions[cond.Type] = cond.Status | |
} |
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This optimization LGTM, it can make the condition comparison operation simpler and faster.
pkg/scheduler/factory/factory.go
Outdated
klog.V(4).Infof("Conditions of node %s changed", newNode.Name) | ||
return true | ||
} | ||
if newNode.Spec.Unschedulable != oldNode.Spec.Unschedulable && newNode.Spec.Unschedulable == false { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you wrap this line in a function and remove klog
lines, you can get a super clean code for this part. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just took a first round review :-)
@@ -992,7 +992,10 @@ func (c *configFactory) updateNodeInCache(oldObj, newObj interface{}) { | |||
} | |||
|
|||
c.invalidateCachedPredicatesOnNodeUpdate(newNode, oldNode) | |||
c.podQueue.MoveAllToActiveQueue() | |||
// Only activate unschedulable pods if the node became more schedulable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An optimization we can perform here is to look at the unschedulableQueue and if there is no pod in it, we can skip the check for changes in the node object and call MoveAllToActiveQueue
. This optimization will be useful in large clusters where many nodes send updates and there is no unschedulable pods in the cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I missed this optimization, I will add this later.
@@ -1064,6 +1067,72 @@ func (c *configFactory) invalidateCachedPredicatesOnNodeUpdate(newNode *v1.Node, | |||
} | |||
} | |||
|
|||
func nodeSchedulingPropertiesChanged(newNode *v1.Node, oldNode *v1.Node) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this to helper/util and make it as public? So others can reuse it with predicates and priorities :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xref kubernetes-retired/kube-batch#491
@jiaxuanzhou , something like this PR seems better :)
pkg/scheduler/factory/factory.go
Outdated
klog.Errorf("Failed to get taints from annotation of old node %s: %v", oldNode.Name, oldErr) | ||
return true | ||
} | ||
newTaints, newErr := helper.GetTaintsFromNodeAnnotations(newNode.GetAnnotations()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GetTaintsFromNodeAnnotations
is only used for validation right now; if we did not cherry pick this PR, it's not necessary to me.
BTW, Toleration/Taint should be GAed for a while, we should remove the annotation :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, It's not appropriate to use GetTaintsFromNodeAnnotations
here. I'm not sure if we need to be compatible with old nodes has taint annotations, if not, remove the annotation comparison LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to @k82cn's point. I don't think we need to process taints from annotations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
pkg/scheduler/factory/factory.go
Outdated
strippedConditions := make([]v1.NodeCondition, len(conditions)) | ||
copy(strippedConditions, conditions) | ||
for i := range strippedConditions { | ||
strippedConditions[i].LastHeartbeatTime = metav1.Time{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems we only enhanced nodeConditionsChanged
, why includes other part comparing to #70366 :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#70366 is reverted, so I includes other comparisons in this PR too.
@@ -530,6 +537,13 @@ func (p *PriorityQueue) DeleteNominatedPodIfExists(pod *v1.Pod) { | |||
p.lock.Unlock() | |||
} | |||
|
|||
// HasUnschedulablePods returns true if any unschedulable pods exist in the SchedulingQueue. | |||
func (p *PriorityQueue) HasUnschedulablePods() bool { | |||
p.lock.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no write operation here, and p.lock is a sync.RWMutex
, so I recommend you to use p.lock.RLock().
(FYI:https://golang.org/pkg/sync/#RWMutex)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. This should be a read lock.
pkg/scheduler/factory/factory.go
Outdated
@@ -992,7 +992,10 @@ func (c *configFactory) updateNodeInCache(oldObj, newObj interface{}) { | |||
} | |||
|
|||
c.invalidateCachedPredicatesOnNodeUpdate(newNode, oldNode) | |||
c.podQueue.MoveAllToActiveQueue() | |||
// Only activate unschedulable pods if the node became more schedulable. | |||
if c.podQueue.HasUnschedulablePods() && nodeSchedulingPropertiesChanged(newNode, oldNode) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should instead do the following:
if c.podQueue.HasUnschedulablePods() && nodeSchedulingPropertiesChanged(newNode, oldNode) { | |
if !c.podQueue.HasUnschedulablePods() || nodeSchedulingPropertiesChanged(newNode, oldNode) { | |
c.podQueue.MoveAllToActiveQueue() | |
} |
It may seem odd to move pods when there is nothing in the unschedulable queue, but we should do so, because there may be a pod that the scheduler is currently processing and it may be determined "unschedulable". The scheduling queue has a mechanism to retry such pod when a "move to active queue" event happens. So, what the above optimization does is that when there is no unschedulable pod in the unschedulable queue, it skips the node comparison and send a move to active queue request.
@@ -530,6 +537,13 @@ func (p *PriorityQueue) DeleteNominatedPodIfExists(pod *v1.Pod) { | |||
p.lock.Unlock() | |||
} | |||
|
|||
// HasUnschedulablePods returns true if any unschedulable pods exist in the SchedulingQueue. | |||
func (p *PriorityQueue) HasUnschedulablePods() bool { | |||
p.lock.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. This should be a read lock.
7d25f32
to
2fe9b14
Compare
@bsalamat PR rebased :) |
/test pull-kubernetes-e2e-gce-100-performance |
/lgtm |
/retest |
@mlmhl Please go ahead and send the backport PRs. In case of not having the time, please let us know. |
@bsalamat The backport PRs are send to 1.13, 1.12, and 1.11, PTAL :) |
…upstream-release-1.13 Automated cherry pick of #71551: activate unschedulable pods only if the node became more
…upstream-release-1.12 Automated cherry pick of #71551: activate unschedulable pods only if the node became more
…upstream-release-1.11 Automated cherry pick of #71551: activate unschedulable pods only if the node became more
…ry-pick-of-#71551-upstream-release-1.11 Revert "Automated cherry pick of #71551: activate unschedulable pods only if the node became more"
…le pods only if the node became more""
…51-upstream-release-1.11 Automated cherry pick of #71551: activate unschedulable pods only if the node became more
What type of PR is this?
/kind feature
What this PR does / why we need it:
This is a new scheduler performance optimization PR try to fix issue #70316 . Compared with the earlier PR, this PR checks node conditions updates more fine-grained: Ignore the heartbeat timestamp updates of node's conditions.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #70316
Special notes for your reviewer:
I'm not sure how much side effects this change has: It always copy node's conditions before comparison, this maybe increase scheduler's resource consumption(memory and CPU) in a large cluster. cc @bsalamat @Huang-Wei
Does this PR introduce a user-facing change?: