Skip to content

Conversation

@Huang-Wei
Copy link
Member

What type of PR is this?

/kind bug

What this PR does / why we need it:

When a node is deleted and genericScheduler.cachedNodeInfoMap isn't populated (called in Schedule()/Preempt()), invoking a nodeInfo.Clone() on a "stripped" NodeInfo will panic.

"stripped" means nodeInfo (the pointer) in schedulerCache side is taken off:

func (cache *schedulerCache) RemoveNode(node *v1.Node) error {

And most of its content are set to nil:

func (n *NodeInfo) RemoveNode(node *v1.Node) error {

Which issue(s) this PR fixes:

Fixes #70450.

Special notes for your reviewer:

Generally speaking, the root cause is b/c internally we didn't handle the cache population properly. In schedulerCache, it watches on events of Node Add/Update/Delete, and update its nodeInfoMap immediately. But in a scheduling cycle, it just called snapshot() during Schedule()/Preempt().

An ideal solution is to (1) do the cache "population" in real time, or (2) do you really need two nodeInfoMap? We can revisit it in our next refactoring iteration. cc/ @misterikkit .

This issue is likely to occur frequently in auto-scaler case - where node add/delete happens frequently.

Does this PR introduce a user-facing change?:

Fix a scheduler panic due to internal cache inconsistency

/sig scheduling
/assign @bsalamat

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 15, 2018
@Huang-Wei
Copy link
Member Author

/priority important-soon

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 15, 2018
@Huang-Wei
Copy link
Member Author

/retest

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which line of Clone do we think is causing the panic here? Looking at the nodeInfo.Clone() method body, I couldn't immediately spot which copy would panic on a stripped node, unless nodeInfo itself was nil.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justinsb from the stack trace, it's:

requestedResource: n.requestedResource.Clone(),

but honestly I can't find any code related with "setting requestedResource to nil" or "build NodeInfo() without init requestedResource".

And from the trace, it seems nodeInfo itself is not nil.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Huang-Wei I think the fix is to change NodeInfo.Clone() code to check for nil before calling Clone() on requestedResource, nonzeroRequest, and allocatableResource.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can do that - that's a more safe way.

My point on checking if NodeInfo.node is nil was: if "nodeInfo.node == nil" is related with "nodeInfo.requestedResource == nil", (we haven't been able to confirm it yet), then we can directly return as calculation on a NodeInfoCopy with node==nil sounds error prone, or maybe we will see other panics.

As of now, I think we can make both changes on (1) checking "if NodeInfo.node is nil" and (2) if "requestedResource/nonzeroRequest/allocatableResource" is nil? Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no reason to believe that checking NodeInfo.node is nil helps with mitigating the crash. I would limit this to checking resources. Alternatively, we could dig to find where we initialize a NodeInfo object without initialing the "resources"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it check both? Besides, I am not sure if we should actually check nodeInfo.node == nil.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

func (n *NodeInfo) Node() *v1.Node {
	if n == nil {
		return nil
	}
	return n.node
}

nodeInfo.node == nil may not happen, as nodeInfo here is a cloned one, setting nodeInfo.node to nil can only happen in schedulerCache when it receive event of RemoveNode(). That one won't impact this cloned one.

So @bsalamat @ravisantoshgudimetla I can change it to check if nodeInfo == nil only, if it's more reasonable to you.

Copy link
Member

@bsalamat bsalamat Nov 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point Wei. You are right, but I still feel checking nodeInfo == nil is more readable for those who may be confused by nodeInfo.node == nil like myself.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, will update.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, at first I found it be confusing too but after looking at the code for a bit, I understood it.

@justinsb
Copy link
Member

justinsb commented Nov 16, 2018

So a question from trying to understand:

I don't see locking around cachedNodeInfoMap (and checkNode is spawned in a goroutine here, and I believe nodeNameToInfo is == cachedNodeInfoMap).

So is it possible that we concurrently deleted the node, and thus it is nil?

Technically we should be reading the map with a lock, because concurrent map access is unsafe. However, given the likely perf impact and the slow rate of change of Node objects, my personal vote would be that (if I am correct that we should be using a lock) that we not address it till 1.14. But we should check the case when the node is not in nodeNameToInfo?

(There are likely other ways also: we could also use a copy-on-write map, or maybe concurrent reading is safe if we are only writing from a thread that holds a write lock and we never change the set of keys etc)

@Huang-Wei
Copy link
Member Author

So is it possible that we concurrently deleted the node, and thus it is nil?

This is not that possible. But it reminds me of another possibility that when a new node is added, it's possible cachedNodeInfoMap doesn't have that entry, yet. Why? Let me explain:

As I mentioned in description of this PR, schedulerCache is always up-to-date as it uses informers to watch changes on pods/nodes/etc. But the sync from schedulerCache to cachedNodeInfoMap is not realtime - it's called in snapshot(), by Schedule() and Preempt().

The issue occurs in Preempt():

  • on L251, it called snapshot() to populate the nodeInfoMap, it's good so far and snapshot() is a bi-directional comparison, and it's safe

  • but on L259, it uses NodeLister to generate all available names, and it's up-to-date! In other words, in between L251 and L259, it's possible there is a new node is added to the cluster

allNodes, err := nodeLister.List()

  • and on L266, it calculates a potentialNodes, which is filter out nodes which has failed on previous run of scheduling, and for sure a new node won't be filtered out

potentialNodes := nodesWherePreemptionMightHelp(allNodes, fitError.FailedPredicates)

  • then it comes to where the panic happens: calls selectNodesForPreemption() and in selectNodesForPreemption(), checkNode() runs to iterate on every potentialNodeName and call nodeNameToInfo[nodeName].Clone() directly:

nodeToVictims, err := selectNodesForPreemption(pod, g.cachedNodeInfoMap, potentialNodes, g.predicates,

checkNode := func(i int) {
nodeName := potentialNodes[i].Name
var metaCopy algorithm.PredicateMetadata
if meta != nil {
metaCopy = meta.ShallowCopy()
}
pods, numPDBViolations, fits := selectVictimsOnNode(pod, metaCopy, nodeNameToInfo[nodeName], predicates, queue, pdbs)

@Huang-Wei
Copy link
Member Author

@ravisantoshgudimetla
Copy link
Contributor

/milestone 1.13

@k8s-ci-robot
Copy link
Contributor

@ravisantoshgudimetla: You must be a member of the kubernetes/kubernetes-milestone-maintainers github team to set the milestone.

Details

In response to this:

/milestone 1.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Huang-Wei Huang-Wei force-pushed the nodeinfo-clone-panic branch from 2068665 to 25e1f4c Compare November 16, 2018 20:07
@AishSundar
Copy link
Contributor

/milestone v1.13

Adding to 1.13. But do note that Code freeze is at 5pm today. If this is critical for 1.13 and need to get in, please switch priority to critical-urgent

@k8s-ci-robot k8s-ci-robot added this to the v1.13 milestone Nov 16, 2018
@Huang-Wei
Copy link
Member Author

/priority critical-urgent

@k8s-ci-robot k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Nov 16, 2018
@Huang-Wei
Copy link
Member Author

Huang-Wei commented Nov 16, 2018

I added a UT to simulate the panic and prove my analysis in #71063 (comment):

Running tool: /usr/local/bin/go test -timeout 30s k8s.io/kubernetes/pkg/scheduler/core -run ^TestSelectNodesForPreemption$ -v -count=1

=== RUN   TestSelectNodesForPreemption
=== RUN   TestSelectNodesForPreemption/a_pod_that_does_not_fit_on_any_machine
=== RUN   TestSelectNodesForPreemption/a_pod_that_fits_with_no_preemption
=== RUN   TestSelectNodesForPreemption/a_pod_that_fits_on_one_machine_with_no_preemption
E1116 12:10:41.760498   60306 runtime.go:69] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
/Users/wei.huang1/gospace/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:76
/Users/wei.huang1/gospace/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/Users/wei.huang1/gospace/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/usr/local/Cellar/go/1.11.1/libexec/src/runtime/asm_amd64.s:522
/usr/local/Cellar/go/1.11.1/libexec/src/runtime/panic.go:513
/usr/local/Cellar/go/1.11.1/libexec/src/runtime/panic.go:82
/usr/local/Cellar/go/1.11.1/libexec/src/runtime/signal_unix.go:390
/Users/wei.huang1/gospace/src/k8s.io/kubernetes/pkg/scheduler/cache/node_info.go:437
/Users/wei.huang1/gospace/src/k8s.io/kubernetes/pkg/scheduler/core/generic_scheduler.go:990
/Users/wei.huang1/gospace/src/k8s.io/kubernetes/pkg/scheduler/core/generic_scheduler.go:909
/Users/wei.huang1/gospace/src/k8s.io/kubernetes/vendor/k8s.io/client-go/util/workqueue/parallelizer.go:65
/usr/local/Cellar/go/1.11.1/libexec/src/runtime/asm_amd64.s:1333

and interestingly it's also pointing to the same line as original stack trace :)

https://github.com/kubernetes/kubernetes/blob/25e1f4c9b7589f71a06aeb52a28b8cb92b783dc3/pkg/scheduler/cache/node_info.go#L437

Copy link
Contributor

@ravisantoshgudimetla ravisantoshgudimetla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Huang-Wei for finding and fixing this bug.

/lgtm
/approve

/cc @bsalamat

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 16, 2018
@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 16, 2018
@ravisantoshgudimetla
Copy link
Contributor

@Huang-Wei - Can you please create backports to 1.12 and 1.11?

@Huang-Wei
Copy link
Member Author

@randomvariable that's what I'm going to do :)

@Huang-Wei Huang-Wei force-pushed the nodeinfo-clone-panic branch from 25e1f4c to a86ba8b Compare November 16, 2018 21:02
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 16, 2018
@ravisantoshgudimetla
Copy link
Contributor

@bsalamat any other comments before I re-lgtm?

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 16, 2018
Copy link
Member

@bsalamat bsalamat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

Thanks, @Huang-Wei and @ravisantoshgudimetla!

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat, Huang-Wei, ravisantoshgudimetla

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Huang-Wei
Copy link
Member Author

/retest

2 similar comments
@Huang-Wei
Copy link
Member Author

/retest

@Huang-Wei
Copy link
Member Author

/retest

@k8s-ci-robot k8s-ci-robot merged commit 7e621cc into kubernetes:master Nov 17, 2018
@Huang-Wei Huang-Wei deleted the nodeinfo-clone-panic branch November 17, 2018 22:59
k8s-ci-robot added a commit that referenced this pull request Nov 28, 2018
…063-upstream-release-1.12

Automated cherry pick of #71063: fix a scheduler panic due to internal cache inconsistency
k8s-ci-robot added a commit that referenced this pull request Dec 3, 2018
…063-upstream-release-1.11

Automated cherry pick of #71063: fix a scheduler panic due to internal cache inconsistency
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

kube-scheduler panic

6 participants