fix a scheduler panic due to internal cache inconsistency #71063

Huang-Wei · 2018-11-15T05:51:12Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

When a node is deleted and genericScheduler.cachedNodeInfoMap isn't populated (called in Schedule()/Preempt()), invoking a nodeInfo.Clone() on a "stripped" NodeInfo will panic.

"stripped" means nodeInfo (the pointer) in schedulerCache side is taken off:

kubernetes/pkg/scheduler/internal/cache/cache.go

Line 437 in e3ddaaa

func (cache *schedulerCache) RemoveNode(node *v1.Node) error {

And most of its content are set to nil:

kubernetes/pkg/scheduler/cache/node_info.go

Line 633 in 843a67b

func (n *NodeInfo) RemoveNode(node *v1.Node) error {

Which issue(s) this PR fixes:

Fixes #70450.

Special notes for your reviewer:

Generally speaking, the root cause is b/c internally we didn't handle the cache population properly. In schedulerCache, it watches on events of Node Add/Update/Delete, and update its nodeInfoMap immediately. But in a scheduling cycle, it just called snapshot() during Schedule()/Preempt().

An ideal solution is to (1) do the cache "population" in real time, or (2) do you really need two nodeInfoMap? We can revisit it in our next refactoring iteration. cc/ @misterikkit .

This issue is likely to occur frequently in auto-scaler case - where node add/delete happens frequently.

Does this PR introduce a user-facing change?:

Fix a scheduler panic due to internal cache inconsistency

/sig scheduling
/assign @bsalamat

Huang-Wei · 2018-11-15T05:51:33Z

/priority important-soon

Huang-Wei · 2018-11-15T07:06:54Z

/retest

justinsb · 2018-11-15T16:31:42Z

pkg/scheduler/core/generic_scheduler.go

Which line of Clone do we think is causing the panic here? Looking at the nodeInfo.Clone() method body, I couldn't immediately spot which copy would panic on a stripped node, unless nodeInfo itself was nil.

@justinsb from the stack trace, it's:

kubernetes/pkg/scheduler/cache/node_info.go

Line 437 in 076adbf

requestedResource: n.requestedResource.Clone(),

but honestly I can't find any code related with "setting requestedResource to nil" or "build NodeInfo() without init requestedResource".

And from the trace, it seems nodeInfo itself is not nil.

@Huang-Wei I think the fix is to change NodeInfo.Clone() code to check for nil before calling Clone() on requestedResource, nonzeroRequest, and allocatableResource.

Yes, we can do that - that's a more safe way.

My point on checking if NodeInfo.node is nil was: if "nodeInfo.node == nil" is related with "nodeInfo.requestedResource == nil", (we haven't been able to confirm it yet), then we can directly return as calculation on a NodeInfoCopy with node==nil sounds error prone, or maybe we will see other panics.

As of now, I think we can make both changes on (1) checking "if NodeInfo.node is nil" and (2) if "requestedResource/nonzeroRequest/allocatableResource" is nil? Thoughts?

I have no reason to believe that checking NodeInfo.node is nil helps with mitigating the crash. I would limit this to checking resources. Alternatively, we could dig to find where we initialize a NodeInfo object without initialing the "resources"

How does it check both? Besides, I am not sure if we should actually check nodeInfo.node == nil.

func (n *NodeInfo) Node() *v1.Node { if n == nil { return nil } return n.node }

nodeInfo.node == nil may not happen, as nodeInfo here is a cloned one, setting nodeInfo.node to nil can only happen in schedulerCache when it receive event of RemoveNode(). That one won't impact this cloned one.

So @bsalamat @ravisantoshgudimetla I can change it to check if nodeInfo == nil only, if it's more reasonable to you.

I see your point Wei. You are right, but I still feel checking nodeInfo == nil is more readable for those who may be confused by nodeInfo.node == nil like myself.

sure, will update.

Yeah, at first I found it be confusing too but after looking at the code for a bit, I understood it.

justinsb · 2018-11-16T13:00:42Z

So a question from trying to understand:

I don't see locking around cachedNodeInfoMap (and checkNode is spawned in a goroutine here, and I believe nodeNameToInfo is == cachedNodeInfoMap).

So is it possible that we concurrently deleted the node, and thus it is nil?

Technically we should be reading the map with a lock, because concurrent map access is unsafe. However, given the likely perf impact and the slow rate of change of Node objects, my personal vote would be that (if I am correct that we should be using a lock) that we not address it till 1.14. But we should check the case when the node is not in nodeNameToInfo?

(There are likely other ways also: we could also use a copy-on-write map, or maybe concurrent reading is safe if we are only writing from a thread that holds a write lock and we never change the set of keys etc)

Huang-Wei · 2018-11-16T19:18:18Z

So is it possible that we concurrently deleted the node, and thus it is nil?

This is not that possible. But it reminds me of another possibility that when a new node is added, it's possible cachedNodeInfoMap doesn't have that entry, yet. Why? Let me explain:

As I mentioned in description of this PR, schedulerCache is always up-to-date as it uses informers to watch changes on pods/nodes/etc. But the sync from schedulerCache to cachedNodeInfoMap is not realtime - it's called in snapshot(), by Schedule() and Preempt().

The issue occurs in Preempt():

on L251, it called snapshot() to populate the nodeInfoMap, it's good so far and snapshot() is a bi-directional comparison, and it's safe

kubernetes/pkg/scheduler/core/generic_scheduler.go

Line 251 in d0c3cd1

err := g.snapshot()

but on L259, it uses NodeLister to generate all available names, and it's up-to-date! In other words, in between L251 and L259, it's possible there is a new node is added to the cluster

kubernetes/pkg/scheduler/core/generic_scheduler.go

Line 259 in d0c3cd1

allNodes, err := nodeLister.List()

and on L266, it calculates a potentialNodes, which is filter out nodes which has failed on previous run of scheduling, and for sure a new node won't be filtered out

kubernetes/pkg/scheduler/core/generic_scheduler.go

Line 266 in d0c3cd1

    
           potentialNodes := nodesWherePreemptionMightHelp(allNodes, fitError.FailedPredicates)

then it comes to where the panic happens: calls selectNodesForPreemption() and in selectNodesForPreemption(), checkNode() runs to iterate on every potentialNodeName and call nodeNameToInfo[nodeName].Clone() directly:

kubernetes/pkg/scheduler/core/generic_scheduler.go

Line 276 in d0c3cd1

    
           nodeToVictims, err := selectNodesForPreemption(pod, g.cachedNodeInfoMap, potentialNodes, g.predicates,

kubernetes/pkg/scheduler/core/generic_scheduler.go

Lines 904 to 910 in d0c3cd1

    
           checkNode := func(i int) { 
        
           	nodeName := potentialNodes[i].Name 
        
           	var metaCopy algorithm.PredicateMetadata 
        
           	if meta != nil { 
        
           		metaCopy = meta.ShallowCopy() 
        
           	} 
        
           	pods, numPDBViolations, fits := selectVictimsOnNode(pod, metaCopy, nodeNameToInfo[nodeName], predicates, queue, pdbs)

Huang-Wei · 2018-11-16T19:27:51Z

@justinsb @bsalamat @ravisantoshgudimetla ^^

ravisantoshgudimetla · 2018-11-16T19:52:56Z

/milestone 1.13

k8s-ci-robot · 2018-11-16T19:52:56Z

@ravisantoshgudimetla: You must be a member of the kubernetes/kubernetes-milestone-maintainers github team to set the milestone.

Details

In response to this:

/milestone 1.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

AishSundar · 2018-11-16T20:12:49Z

/milestone v1.13

Adding to 1.13. But do note that Code freeze is at 5pm today. If this is critical for 1.13 and need to get in, please switch priority to critical-urgent

Huang-Wei · 2018-11-16T20:13:59Z

/priority critical-urgent

Huang-Wei · 2018-11-16T20:17:57Z

I added a UT to simulate the panic and prove my analysis in #71063 (comment):

Running tool: /usr/local/bin/go test -timeout 30s k8s.io/kubernetes/pkg/scheduler/core -run ^TestSelectNodesForPreemption$ -v -count=1

=== RUN   TestSelectNodesForPreemption
=== RUN   TestSelectNodesForPreemption/a_pod_that_does_not_fit_on_any_machine
=== RUN   TestSelectNodesForPreemption/a_pod_that_fits_with_no_preemption
=== RUN   TestSelectNodesForPreemption/a_pod_that_fits_on_one_machine_with_no_preemption
E1116 12:10:41.760498   60306 runtime.go:69] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
/Users/wei.huang1/gospace/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:76
/Users/wei.huang1/gospace/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/Users/wei.huang1/gospace/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/usr/local/Cellar/go/1.11.1/libexec/src/runtime/asm_amd64.s:522
/usr/local/Cellar/go/1.11.1/libexec/src/runtime/panic.go:513
/usr/local/Cellar/go/1.11.1/libexec/src/runtime/panic.go:82
/usr/local/Cellar/go/1.11.1/libexec/src/runtime/signal_unix.go:390
/Users/wei.huang1/gospace/src/k8s.io/kubernetes/pkg/scheduler/cache/node_info.go:437
/Users/wei.huang1/gospace/src/k8s.io/kubernetes/pkg/scheduler/core/generic_scheduler.go:990
/Users/wei.huang1/gospace/src/k8s.io/kubernetes/pkg/scheduler/core/generic_scheduler.go:909
/Users/wei.huang1/gospace/src/k8s.io/kubernetes/vendor/k8s.io/client-go/util/workqueue/parallelizer.go:65
/usr/local/Cellar/go/1.11.1/libexec/src/runtime/asm_amd64.s:1333

and interestingly it's also pointing to the same line as original stack trace :)

https://github.com/kubernetes/kubernetes/blob/25e1f4c9b7589f71a06aeb52a28b8cb92b783dc3/pkg/scheduler/cache/node_info.go#L437

ravisantoshgudimetla

Thanks @Huang-Wei for finding and fixing this bug.

/lgtm
/approve

/cc @bsalamat

ravisantoshgudimetla · 2018-11-16T20:27:10Z

@Huang-Wei - Can you please create backports to 1.12 and 1.11?

Huang-Wei · 2018-11-16T21:01:14Z

@randomvariable that's what I'm going to do :)

ravisantoshgudimetla · 2018-11-16T21:11:49Z

@bsalamat any other comments before I re-lgtm?

bsalamat

/lgtm
/approve

Thanks, @Huang-Wei and @ravisantoshgudimetla!

k8s-ci-robot · 2018-11-16T21:53:59Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat, Huang-Wei, ravisantoshgudimetla

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~pkg/scheduler/OWNERS~~ [bsalamat,ravisantoshgudimetla]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Huang-Wei · 2018-11-16T22:24:11Z

/retest

Huang-Wei · 2018-11-16T23:44:16Z

/retest

Huang-Wei · 2018-11-17T01:16:47Z

/retest

…063-upstream-release-1.12 Automated cherry pick of #71063: fix a scheduler panic due to internal cache inconsistency

…063-upstream-release-1.11 Automated cherry pick of #71063: fix a scheduler panic due to internal cache inconsistency

k8s-ci-robot assigned bsalamat Nov 15, 2018

k8s-ci-robot requested review from k82cn and resouer November 15, 2018 05:51

k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 15, 2018

justinsb reviewed Nov 15, 2018

View reviewed changes

Huang-Wei force-pushed the nodeinfo-clone-panic branch from 2068665 to 25e1f4c Compare November 16, 2018 20:07

k8s-ci-robot added this to the v1.13 milestone Nov 16, 2018

k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Nov 16, 2018

Huang-Wei mentioned this pull request Nov 16, 2018

kube-scheduler panic #70450

Closed

k8s-ci-robot assigned ravisantoshgudimetla Nov 16, 2018

ravisantoshgudimetla approved these changes Nov 16, 2018

View reviewed changes

k8s-ci-robot requested a review from bsalamat November 16, 2018 20:26

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 16, 2018

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 16, 2018

fix a scheduler panic due to internal cache inconsistency

a86ba8b

Huang-Wei force-pushed the nodeinfo-clone-panic branch from 25e1f4c to a86ba8b Compare November 16, 2018 21:02

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 16, 2018

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 16, 2018

bsalamat reviewed Nov 16, 2018

View reviewed changes

k8s-ci-robot merged commit 7e621cc into kubernetes:master Nov 17, 2018

Huang-Wei deleted the nodeinfo-clone-panic branch November 17, 2018 22:59

This was referenced Nov 19, 2018

Automated cherry pick of #71063: fix a scheduler panic due to internal cache inconsistency #71195

Merged

Automated cherry pick of #71063: fix a scheduler panic due to internal cache inconsistency #71196

Merged

k8s-ci-robot added a commit that referenced this pull request Nov 28, 2018

Merge pull request #71195 from Huang-Wei/automated-cherry-pick-of-#71…

1ff12b8

…063-upstream-release-1.12 Automated cherry pick of #71063: fix a scheduler panic due to internal cache inconsistency

k8s-ci-robot added a commit that referenced this pull request Dec 3, 2018

Merge pull request #71196 from Huang-Wei/automated-cherry-pick-of-#71…

69417a6

…063-upstream-release-1.11 Automated cherry pick of #71063: fix a scheduler panic due to internal cache inconsistency

ingvagabund mentioned this pull request Jul 20, 2020

bug 1858287: UPSTREAM: 71063: fix a scheduler panic due to internal cache inconsistency openshift/origin#25296

Merged

fix a scheduler panic due to internal cache inconsistency #71063

fix a scheduler panic due to internal cache inconsistency #71063

Uh oh!

Conversation

Huang-Wei commented Nov 15, 2018

Uh oh!

Huang-Wei commented Nov 15, 2018

Uh oh!

Huang-Wei commented Nov 15, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bsalamat Nov 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinsb commented Nov 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Huang-Wei commented Nov 16, 2018

Uh oh!

Huang-Wei commented Nov 16, 2018

Uh oh!

ravisantoshgudimetla commented Nov 16, 2018

Uh oh!

k8s-ci-robot commented Nov 16, 2018

Uh oh!

AishSundar commented Nov 16, 2018

Uh oh!

Huang-Wei commented Nov 16, 2018

Uh oh!

Huang-Wei commented Nov 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ravisantoshgudimetla left a comment

Choose a reason for hiding this comment

Uh oh!

ravisantoshgudimetla commented Nov 16, 2018

Uh oh!

Huang-Wei commented Nov 16, 2018

Uh oh!

ravisantoshgudimetla commented Nov 16, 2018

Uh oh!

bsalamat left a comment

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Nov 16, 2018

Uh oh!

Huang-Wei commented Nov 16, 2018

Uh oh!

Huang-Wei commented Nov 16, 2018

Uh oh!

Huang-Wei commented Nov 17, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

bsalamat Nov 16, 2018 •

edited

Loading

justinsb commented Nov 16, 2018 •

edited

Loading

Huang-Wei commented Nov 16, 2018 •

edited

Loading