-
Notifications
You must be signed in to change notification settings - Fork 42k
fix a scheduler panic due to internal cache inconsistency #71063
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix a scheduler panic due to internal cache inconsistency #71063
Conversation
|
/priority important-soon |
|
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which line of Clone do we think is causing the panic here? Looking at the nodeInfo.Clone() method body, I couldn't immediately spot which copy would panic on a stripped node, unless nodeInfo itself was nil.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@justinsb from the stack trace, it's:
kubernetes/pkg/scheduler/cache/node_info.go
Line 437 in 076adbf
| requestedResource: n.requestedResource.Clone(), |
but honestly I can't find any code related with "setting requestedResource to nil" or "build NodeInfo() without init requestedResource".
And from the trace, it seems nodeInfo itself is not nil.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Huang-Wei I think the fix is to change NodeInfo.Clone() code to check for nil before calling Clone() on requestedResource, nonzeroRequest, and allocatableResource.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can do that - that's a more safe way.
My point on checking if NodeInfo.node is nil was: if "nodeInfo.node == nil" is related with "nodeInfo.requestedResource == nil", (we haven't been able to confirm it yet), then we can directly return as calculation on a NodeInfoCopy with node==nil sounds error prone, or maybe we will see other panics.
As of now, I think we can make both changes on (1) checking "if NodeInfo.node is nil" and (2) if "requestedResource/nonzeroRequest/allocatableResource" is nil? Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no reason to believe that checking NodeInfo.node is nil helps with mitigating the crash. I would limit this to checking resources. Alternatively, we could dig to find where we initialize a NodeInfo object without initialing the "resources"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does it check both? Besides, I am not sure if we should actually check nodeInfo.node == nil.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func (n *NodeInfo) Node() *v1.Node {
if n == nil {
return nil
}
return n.node
}nodeInfo.node == nil may not happen, as nodeInfo here is a cloned one, setting nodeInfo.node to nil can only happen in schedulerCache when it receive event of RemoveNode(). That one won't impact this cloned one.
So @bsalamat @ravisantoshgudimetla I can change it to check if nodeInfo == nil only, if it's more reasonable to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point Wei. You are right, but I still feel checking nodeInfo == nil is more readable for those who may be confused by nodeInfo.node == nil like myself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, will update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, at first I found it be confusing too but after looking at the code for a bit, I understood it.
|
So a question from trying to understand: I don't see locking around cachedNodeInfoMap (and checkNode is spawned in a goroutine here, and I believe nodeNameToInfo is == cachedNodeInfoMap). So is it possible that we concurrently deleted the node, and thus it is nil? Technically we should be reading the map with a lock, because concurrent map access is unsafe. However, given the likely perf impact and the slow rate of change of Node objects, my personal vote would be that (if I am correct that we should be using a lock) that we not address it till 1.14. But we should check the case when the node is not in (There are likely other ways also: we could also use a copy-on-write map, or maybe concurrent reading is safe if we are only writing from a thread that holds a write lock and we never change the set of keys etc) |
This is not that possible. But it reminds me of another possibility that when a new node is added, it's possible cachedNodeInfoMap doesn't have that entry, yet. Why? Let me explain: As I mentioned in description of this PR, schedulerCache is always up-to-date as it uses informers to watch changes on pods/nodes/etc. But the sync from schedulerCache to cachedNodeInfoMap is not realtime - it's called in snapshot(), by Schedule() and Preempt(). The issue occurs in Preempt():
kubernetes/pkg/scheduler/core/generic_scheduler.go Lines 904 to 910 in d0c3cd1
|
|
/milestone 1.13 |
|
@ravisantoshgudimetla: You must be a member of the kubernetes/kubernetes-milestone-maintainers github team to set the milestone. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
2068665 to
25e1f4c
Compare
|
/milestone v1.13 Adding to 1.13. But do note that Code freeze is at 5pm today. If this is critical for 1.13 and need to get in, please switch priority to critical-urgent |
|
/priority critical-urgent |
|
I added a UT to simulate the panic and prove my analysis in #71063 (comment): Running tool: /usr/local/bin/go test -timeout 30s k8s.io/kubernetes/pkg/scheduler/core -run ^TestSelectNodesForPreemption$ -v -count=1
=== RUN TestSelectNodesForPreemption
=== RUN TestSelectNodesForPreemption/a_pod_that_does_not_fit_on_any_machine
=== RUN TestSelectNodesForPreemption/a_pod_that_fits_with_no_preemption
=== RUN TestSelectNodesForPreemption/a_pod_that_fits_on_one_machine_with_no_preemption
E1116 12:10:41.760498 60306 runtime.go:69] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
/Users/wei.huang1/gospace/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:76
/Users/wei.huang1/gospace/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/Users/wei.huang1/gospace/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/usr/local/Cellar/go/1.11.1/libexec/src/runtime/asm_amd64.s:522
/usr/local/Cellar/go/1.11.1/libexec/src/runtime/panic.go:513
/usr/local/Cellar/go/1.11.1/libexec/src/runtime/panic.go:82
/usr/local/Cellar/go/1.11.1/libexec/src/runtime/signal_unix.go:390
/Users/wei.huang1/gospace/src/k8s.io/kubernetes/pkg/scheduler/cache/node_info.go:437
/Users/wei.huang1/gospace/src/k8s.io/kubernetes/pkg/scheduler/core/generic_scheduler.go:990
/Users/wei.huang1/gospace/src/k8s.io/kubernetes/pkg/scheduler/core/generic_scheduler.go:909
/Users/wei.huang1/gospace/src/k8s.io/kubernetes/vendor/k8s.io/client-go/util/workqueue/parallelizer.go:65
/usr/local/Cellar/go/1.11.1/libexec/src/runtime/asm_amd64.s:1333and interestingly it's also pointing to the same line as original stack trace :) |
ravisantoshgudimetla
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
@Huang-Wei - Can you please create backports to 1.12 and 1.11? |
|
@randomvariable that's what I'm going to do :) |
25e1f4c to
a86ba8b
Compare
|
@bsalamat any other comments before I re-lgtm? |
bsalamat
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
Thanks, @Huang-Wei and @ravisantoshgudimetla!
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bsalamat, Huang-Wei, ravisantoshgudimetla The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
2 similar comments
|
/retest |
|
/retest |
…063-upstream-release-1.12 Automated cherry pick of #71063: fix a scheduler panic due to internal cache inconsistency
…063-upstream-release-1.11 Automated cherry pick of #71063: fix a scheduler panic due to internal cache inconsistency
What type of PR is this?
/kind bug
What this PR does / why we need it:
When a node is deleted and genericScheduler.cachedNodeInfoMap isn't populated (called in Schedule()/Preempt()), invoking a nodeInfo.Clone() on a "stripped" NodeInfo will panic.
"stripped" means nodeInfo (the pointer) in schedulerCache side is taken off:
kubernetes/pkg/scheduler/internal/cache/cache.go
Line 437 in e3ddaaa
And most of its content are set to nil:
kubernetes/pkg/scheduler/cache/node_info.go
Line 633 in 843a67b
Which issue(s) this PR fixes:
Fixes #70450.
Special notes for your reviewer:
Generally speaking, the root cause is b/c internally we didn't handle the cache population properly. In schedulerCache, it watches on events of Node Add/Update/Delete, and update its nodeInfoMap immediately. But in a scheduling cycle, it just called
snapshot()duringSchedule()/Preempt().An ideal solution is to (1) do the cache "population" in real time, or (2) do you really need two nodeInfoMap? We can revisit it in our next refactoring iteration. cc/ @misterikkit .
This issue is likely to occur frequently in auto-scaler case - where node add/delete happens frequently.
Does this PR introduce a user-facing change?:
/sig scheduling
/assign @bsalamat