-
Notifications
You must be signed in to change notification settings - Fork 39.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support NoSchedule taints correctly in DaemonSet controller #48189
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mikedanese Assign the PR to them by writing Associated issue: 48190 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
cc @kubernetes/sig-apps-pr-reviews @kubernetes/sig-scheduling-pr-reviews @kubernetes/kubernetes-release-managers |
This is a regression in 1.6. We need to cherry pick this there @enisoc @ethernetdan |
// only emit this event if insufficient resource is the only thing | ||
// preventing the daemon pod from scheduling | ||
if shouldSchedule && insufficientResourceErr != nil { | ||
dsc.eventRecorder.Eventf(ds, v1.EventTypeWarning, FailedPlacementReason, "failed to place pod on %q: %s", node.ObjectMeta.Name, insufficientResourceErr.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How often do we retry in case of this Error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be much clearer to determine that if the following return contained all the actual variables that it returns. It seems that err is nil at this point so unless the controller needs to retry for a different error down the rest of the code path, this won't be retried.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's ok to not requeue on insufficientResourceErr as long as we are requeueing when something changes that would cause resources to be freed up. Regardless of whether something is doing that right now, this is not a change in behavior and not the bug I am trying to fix.
} | ||
if !fit { | ||
predicateFails = append(predicateFails, reasons...) | ||
} | ||
|
||
fit, _, err = predicates.PodToleratesNodeNoExecuteTaints(pod, nil, nodeInfo) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A question here, can you please add some comments why check taint twice? At both line 1190 and line 1198?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, here is mainly to check if want to evict the pod or not, but this seems a bit strange: In above PodToleratesNodeTaints
, it was already checked both taints, and here we need to check the NoExecute
taint separately.
I think that @janetkuo 's proposal is better and would involve less code change.
@@ -1190,18 +1183,24 @@ func NewPod(ds *extensions.DaemonSet, nodeName string) *v1.Pod { | |||
|
|||
// Predicates checks if a DaemonSet's pod can be scheduled on a node using GeneralPredicates | |||
// and PodToleratesNodeTaints predicate | |||
func Predicates(pod *v1.Pod, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) { | |||
func Predicates(pod *v1.Pod, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, bool, error) { | |||
var predicateFails []algorithm.PredicateFailureReason | |||
critical := utilfeature.DefaultFeatureGate.Enabled(features.ExperimentalCriticalPodAnnotation) && kubelettypes.IsCriticalPod(pod) | |||
|
|||
fit, reasons, err := predicates.PodToleratesNodeTaints(pod, nil, nodeInfo) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, we should consider Pods' status in PodToleratesNodeTaints
: if pod is running, it tolerant node's unschedule effect (event spec.Toleration
not). Or create another func for that. And I think we need to check other codes for this case.
@gmarek , WDYT ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please - bear with me for few more days. I promise that I'll be more responsive after 1.7 is released...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, np :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@k82cn I think that is a reasonable change for PodToleratesNodeTaints
to take into account whether .spec.nodeName
is set on the pod (which is what I consider to be the definition of scheduled although others may disagree) but that change wouldn't help us here. The pod here always has a nodeName
set when we call Predicates and never has a status
. This could be fixed to work but it would be a larger change then what I would want to cherrypick.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TL;DR: It'd be too big change to cherry-pick for 1.7.
OK, we spoke about that and it all depends on the definition of what 'scheduled' means. Here the Pod that's being checked is completely artificial, created specially for this simulation, with NodeName already set. One can argue that semantics of 'NoSchedule' should prevent from setting NodeName, i.e. it should be ignored if NodeName is set. But I'd propose another meaning of word 'scheduled', i.e. 'Node agent already knows about it'. Sadly this would mean that the only thing that one can be certain, is that if Pod doesn't have NodeName set it's not scheduled. Otherwise it can't be said looking at the Pod object alone. Which means that 'PodTolerateTaints' need to just return what Taints are possibly violated and depend on the caller to figure out what to do with it. Exactly as we're doing in this PR. Post 1.7 we can discuss this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mikedanese, I means pod.Status.Phase
is v1.PodRunning
. As gmarek@ suggested, let's discuss this after 1.7.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the draft diff (not test yet) in my mind: https://gist.github.com/k82cn/3dbd398ef2b138fee6a4c19ec3d2ddcf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It won't be anything, because this Pod is an artificial one, created for simulation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow, yes. but we already get the pod of that node in nodeShouldRunDaemonPod
, we can keep its status.phase
, update a bit of gist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this would mean that everyone who'd do such simulation needs to remember/know about it. Which is why I think it's not worth it, as I believe that it'd introduce more bugs that it would solve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do this without reverting the change (and keep all the test)?
- Leave
Predicates
as is - Before calling
hasIntentionalPredicatesReasons
, checkPodToleratesNodeNoExecuteTaints
: if it doesn't fit,return false, false, false
immediately - Move
predicates.ErrTaintsTolerationsNotMatch
case out ofhasIntentionalPredicatesReasons
and down to thefor range reasons
loop, and this case should only setwantToRun
andshouldSchedule
tofalse
(since we're sure it's not NoExecute).
…e intentional reasons." This partially reverts commit 2b311fe. We drop the changes to the DaemonSet controller but leave the test. By reverting the changes, we make it easier to return different values of shouldContinueRunning for intentional predicate failures, rather then lumping all intentional predicate failures together. The test should continue to pass after the fix.
…ller And add some unit tests.
I reverted changes to Predicate but I don't like breaking up the switch. The function has a very clear contracted (documented in a comment at the beginning which I expanded on). Since |
/test pull-kubernetes-kubemark-e2e-gce |
I recalled the major concern is the function |
I don't think breaking up the switch is the best way of doing that. It makes the logic very fragmented IMO. We can't treat all intentional predicates the same so |
@mikedanese what do you mean by |
I'm referring to the section where we simulate what would happen if we were to add a new daemon pod to the node: That looks like it would cleanly extract to a function with signature: func (dsc *DaemonSetsController) simulate(pod, node) (reasons, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally suggested doing this without reverting 2b311fe so that this change is small and the 1.7 cherrypick would be simpler. We can do code clean up as a follow up in master. However, reverting that commit makes cherrypicking to 1.6 easier so I'm okay with it now.
@mikedanese got it, will try to create a PR for this later. |
@mikedanese a PR #48189 for the comments above. |
/retest |
Automatic merge from submit-queue |
Addressed comments from kubernetes#48189 (comment)
Automatic merge from submit-queue Factored out simulate from nodeShouldRunDaemonPod. Addressed comments from #48189 (comment) **What this PR does / why we need it**: **Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes # **Special notes for your reviewer**: **Release note**: ```release-note none ``` /sig apps
Commit found in the "release-1.7" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked. |
Fixes #48190
cc @kubernetes/sig-apps-pr-reviews