-
Notifications
You must be signed in to change notification settings - Fork 42.1k
Use monotonically increasing generation to prevent equivalence cache race #67308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@kubernetes/sig-scheduling-pr-reviews |
0a41fc2 to
3bbf639
Compare
|
/assign @resouer @misterikkit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a nit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename to receivedInvalidationRequests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like anybody who does a lookup will clear the flag for that predicate, causing the next updater to assume that the cache has not been invalidated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, in theory the flag should be per cache result (per predicate per node and per pod),
but it will be too complex and current per predicate per node flag will not have this problem, because scheduler run predicates for nodes in parallel but for pods sequentially, so if there is per predicate per node goroutine is running, it's impossible that there will be another goroutine to lookup result for this predicate on this node.
|
@misterikkit Add a integration test in e3f6c61d10342cd218f8741ea403bc509c0a98e2. |
|
/test pull-kubernetes-e2e-kops-aws |
1 similar comment
|
/test pull-kubernetes-e2e-kops-aws |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lookupResult is in the critical path during scheduling, but this delete operation requires mu.Lock() instead of mu.RLock().
We may need evaluation of performance impact here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With new solution misterikkit suggested, RW lock is not required now.
It's not right to get generation in lookupResult, because invalidations may happen after snapshotting scheduler cache and before lookupResult.
|
Just to add some more context, I believe you when you say this is a problem, but I don't believe this PR is the right fix. We tried to fix this exact problem in #63040, so my assumption is that something about that fix needs to be corrected. |
Time to eat my words. 😄 As mentioned in To merge this PR, I would like it to replace the incorrect fix from #63040. I also have some other comments on the code changes. |
misterikkit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned in other comments, I would like this fix to replace the fix for #62921
In particular, I think we should,
- add atomic generation numbers to the eCache
- snapshot the current generation numbers before snapshotting the scheduler cache
- compare generation number in snapshot to the live version before writing to eCache
- "undo" the fix from #63040 and #63895
Sorry that is a lot more work than your initial approach. Let me know if you need help with it. I am also available to chat if you want to discuss the fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think a flag is sufficient, since lookupResult clears the flag, and I don't want us to make assumptions about when that may be called.
Instead, how about an int64 "generation number" which is atomically incremented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding this check removes the need for the check on line 254, if !cache.IsUpToDate(nodeInfo). I think this PR should delete that check, and then remove nodeInfo from the arguments, and all the plumbing that goes with that.
See #63040
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if !cache.IsUpToDate(nodeInfo) is removed now, but nodeInfo argument is kept, because it's used in other places.
|
Thanks for clarification and the advice! I'll update this PR. |
|
Thinking more about the solution that misterikkit outlined above last night, I am a bit worried that snapshotting the ecache might cause a noticeable slow down. We should measure performance impact of this solution. |
|
Any other concerns? |
bsalamat
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @cofyc and sorry for the delay. It looks generally good. I have several minor comments.
I haven't looked at the tests yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should add an else statement that logs an error when a predicate key is not found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
pkg/scheduler/factory/factory.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/caches/cache/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
pkg/scheduler/factory/factory.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If remove must happen after the invalidation, it deserves a better comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I found I'm wrong with this part. All equivalence cache invalidations should be after resource objects are added, updated and deleted, then predicate functions will run and use new versions of resource objects. I revert this change.
By the way, there is no need to invalidate predicates for node after adding because no predicate functions will run on deleted node (we iterate on nodes in scheduler cache), this is different from other resources, e.g. PVs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you that invalidation of eCache should happen after CUD operations on objects. The following diagram shows a scenario where the reverse order causes an issue. If invalidation happens first, we could snapshot eCache after the invalidation and then we snapshot the cache itself. A node is then removed/added/updated before we update eCache. We then would update eCache with stale information which is based on our latest snapshot of the cache taken before the node removal/addition/update:
------|-----------------|------------------|----------------|---------------|-------->
Invalidate eCache Snapshot eCache Snapshot Cache Remove Node Update eCache
I think you should add this to the comment to document why the order is important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in 2a05122. This simply update comments in PR https://github.com/kubernetes/kubernetes/pull/63895/files to reflect latest code.
There is no need to revert it now, I'll rebase it when PR is done.
73b4434 to
cc3b6d9
Compare
pkg/scheduler/factory/factory.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we will init NodeCache here, it seems we don't need to call g.equivalenceCache.GetNodeCache(nodeName) in generic_scheduler?
Instead, we can only call sth like g.equivalenceCache.LoadNodeCache() there, only read operation. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a good idea! Please check 8ece5d6.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just added several review comments, others look good, please take a look!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stores a *NodeCache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to include comments for all last three members, to make other developer's life easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
6f266d7 to
8ece5d6
Compare
pkg/scheduler/scheduler.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this comment applies here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After assuming pod into scheduler cache, it will invalidate affected predicates in equivalence cache.
https://github.com/kubernetes/kubernetes/blob/8ece5d624ded4815655104d8a12712cfe6d8743e/pkg/scheduler/scheduler.go#L355-L363
bsalamat
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @cofyc!
/lgtm
This reverts commit 17d0190.
|
hi, @bsalamat |
- snapshot equivalence cache generation numbers before snapshotting the scheduler cache - skip update when generation does not match live generation - keep the node and increment its generation to invalidate it instead of deletion - use predicates order ID as key to improve performance
|
/retest |
bsalamat
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bsalamat, cofyc The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest Review the full test history for this PR. Silence the bot with an |
What this PR does / why we need it:
Use monotonically increasing generation to prevent equivalence cache race.
Because invalidating predicate cache invalidates latest cache, but in-flight predicate goroutines may write stale results into cache. This may happen when events occur frequently:
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)format, will close the issue(s) when PR gets merged):Fixes #67260
Special notes for your reviewer:
Release note: