Skip to content

Conversation

@bsalamat
Copy link
Member

@bsalamat bsalamat commented Jan 19, 2019

What type of PR is this?
/kind bug

What this PR does / why we need it:
Fixes a race condition between node controller applying taints for new nodes and the scheduler using new nodes for scheduling pods by adding not ready taint to new nodes at admission time.

Which issue(s) this PR fixes:

Fixes #72129

Special notes for your reviewer:
My understanding is that we do not need to add the taint in node updates, mainly because node condition updates may arrive anytime and can race with other K8s operations. For example, pods may be scheduled on nodes that may become unavailable shortly afterwards in normal condition and K8s handles those scenarios well. The bigger problem that we are trying to fix is to avoid placing pods on newly added nodes.

UPDATE: Given that tainting a node at update reduces wasted work and provides better experience with respect to scheduling pods on more reliable nodes, I added the logic to taint nodes at updates as well.

Does this PR introduce a user-facing change?:

A new `TaintNodesByCondition` admission plugin taints newly created Node objects as "not ready", to fix a race condition that could cause pods to be scheduled on new nodes before their taints were updated to accurately reflect their reported conditions. This admission plugin is enabled by default if the `TaintNodesByCondition` feature is enabled.

/sig scheduling
/sig node
/priority critical-urgent

@bsalamat bsalamat requested a review from liggitt January 19, 2019 00:34
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/node Categorizes an issue or PR as relevant to SIG Node. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 19, 2019
@bsalamat
Copy link
Member Author

cc/ @ravisantoshgudimetla @k82cn

@k8s-ci-robot k8s-ci-robot added the sig/auth Categorizes an issue or PR as relevant to SIG Auth. label Jan 19, 2019
@ravisantoshgudimetla
Copy link
Contributor

For example, pods may be scheduled on nodes that may become unavailable shortly afterwards in normal condition and K8s handles those scenarios well.

Yeah, that is right, eventually the pods may get evicted from those nodes which become unavailable over time, also the pods get scheduled onto those kind of nodes - more on that scenario can be found here: #72548 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/the taint/the NotReadytaint.

the comment could be further expanded to something like the new node may receive taint from nodecontroller with some delay..

Copy link
Contributor

@ravisantoshgudimetla ravisantoshgudimetla Jan 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Probably a test to ensure we don't add this taint, if the taint already exists.

@liggitt
Copy link
Member

liggitt commented Jan 19, 2019

I think we'd want this in its own admission plugin that we could enable by default here:

defaultOnPlugins := sets.NewString(

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there an existing key for the notready taint the node controller manages?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one in scheduler API, but I didn't want to create a dependency to the API only for this label.

@resouer
Copy link
Contributor

resouer commented Jan 21, 2019

@answer1991 Please take a look, this patch is critical and align with your original proposal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we may also need to point out who is expected to un-taint the node in the future (NodeLifecycleController etc).

@k8s-ci-robot k8s-ci-robot added area/apiserver sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Jan 22, 2019
@bsalamat
Copy link
Member Author

Thanks @liggitt, @ravisantoshgudimetla, @resouer for your comments and reviews. I addressed all the comments. PTAL.

@bsalamat bsalamat force-pushed the fix_taint_nodes branch 3 times, most recently from 77107aa to f2b0959 Compare January 22, 2019 23:31
@resouer
Copy link
Contributor

resouer commented Jan 23, 2019

Seems still some golints errors to fix. :-)

@bsalamat
Copy link
Member Author

The bigger problem is the fact that a NotReady taint is removed in a node update causing TestNodeAuthorizer to fail. I added an experimental commit to let a change of NotReady taint to be accepted, but I don't feel good about this change.
@liggitt what would be the recommended way of addressing this issue?

@liggitt
Copy link
Member

liggitt commented Jan 24, 2019

@liggitt While I agree with you, it is not the case and I don't think we should change that here. We need to cherrypick this to older releases. I don't want to make a risky change to our tests in older releases.

then the safest alternative would seem to be to disable this admission plugin specifically for the node-e2e test, if it wants to run in a partial environment

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 24, 2019
@yujuhong
Copy link
Contributor

I'd expect the node-e2e to run in a relatively realistic environment w.r.t. the controllers that interact with nodes and pods on nodes. That seems useful for flushing out unexpected interactions.

Node e2e was introduced mainly to test kubelet and things below that in different environments (e.g., various os images). The apiserver was added simply to make interacting with kubelet easier. The node e2e suite helps us identify node failures easily (and also allows us to change the kubelet configuration easily to test different features on the node).

The interaction between kubelet and controllers are tested in cluster e2e. It may be useful to have a simpler environment to test control plane on the master with nodes, but that wasn't what node e2e was created for.

then the safest alternative would seem to be to disable this admission plugin specifically for the node-e2e test, if it wants to run in a partial environment

+1 to disable this.

@bsalamat
Copy link
Member Author

Thanks, @yujuhong!

I just pushed a commit to disable the new admission plugin in node e2e tests.

Copy link
Contributor

@resouer resouer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

let's see how CI says then :-)

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 24, 2019
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 24, 2019
@ravisantoshgudimetla
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 25, 2019
@resouer
Copy link
Contributor

resouer commented Jan 25, 2019

/lgtm
/hold cancel

Thanks ALL working on this issue!

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 25, 2019
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

}

func addNotReadyTaint(node *api.Node) {
notReadyTaint := api.Taint{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this can be declared as singleton outside the func

@k8s-ci-robot
Copy link
Contributor

@bsalamat: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-node-e2e 9b780d3537bf488fe895f65ec0a2239949b322a2 link /test pull-kubernetes-node-e2e

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Race condition between setting node taints and scheduling

8 participants