Skip to content

Conversation

@cofyc
Copy link
Member

@cofyc cofyc commented Aug 12, 2018

What this PR does / why we need it:

Use monotonically increasing generation to prevent equivalence cache race.

Because invalidating predicate cache invalidates latest cache, but in-flight predicate goroutines may write stale results into cache. This may happen when events occur frequently:

objectVersion invalidation goroutine predicate goroutine cacheVersion status
v0 -> v1 cache invalidated begin to calculate result.v1 cache.v0 OK
v1 -> v2 cache invalidated begin to write result.v1 into cache no cache OK
v2 result.v1 wrote into cache cache.v1 cache mismatch

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #67260

Special notes for your reviewer:

Release note:

Use monotonically increasing generation to prevent scheduler equivalence cache race.

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Aug 12, 2018
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 12, 2018
@cofyc
Copy link
Member Author

cofyc commented Aug 12, 2018

@kubernetes/sig-scheduling-pr-reviews
/assign @msau42

@cofyc cofyc force-pushed the fix67260 branch 2 times, most recently from 0a41fc2 to 3bbf639 Compare August 12, 2018 04:31
@msau42
Copy link
Member

msau42 commented Aug 13, 2018

/assign @resouer @misterikkit

Copy link
Member

@yastij yastij left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a nit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename to receivedInvalidationRequests

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like anybody who does a lookup will clear the flag for that predicate, causing the next updater to assume that the cache has not been invalidated.

Copy link
Member Author

@cofyc cofyc Aug 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, in theory the flag should be per cache result (per predicate per node and per pod),
but it will be too complex and current per predicate per node flag will not have this problem, because scheduler run predicates for nodes in parallel but for pods sequentially, so if there is per predicate per node goroutine is running, it's impossible that there will be another goroutine to lookup result for this predicate on this node.

@misterikkit
Copy link

misterikkit commented Aug 13, 2018

Hi, have you confirmed that this error can happen? Bonus points if you can add a test that triggers the error.

The reason I ask is that we tried to address this specific problem in #63040. Perhaps the flaky test is due to an invalidation event that we aren't observing properly?

/cc @bsalamat

@k8s-ci-robot k8s-ci-robot requested a review from bsalamat August 13, 2018 21:59
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 14, 2018
@cofyc
Copy link
Member Author

cofyc commented Aug 14, 2018

@misterikkit Add a integration test in e3f6c61d10342cd218f8741ea403bc509c0a98e2.
This test will fail without this feature.

@cofyc
Copy link
Member Author

cofyc commented Aug 14, 2018

/test pull-kubernetes-e2e-kops-aws

1 similar comment
@resouer
Copy link
Contributor

resouer commented Aug 14, 2018

/test pull-kubernetes-e2e-kops-aws

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lookupResult is in the critical path during scheduling, but this delete operation requires mu.Lock() instead of mu.RLock().

We may need evaluation of performance impact here.

Copy link
Member Author

@cofyc cofyc Aug 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With new solution misterikkit suggested, RW lock is not required now.
It's not right to get generation in lookupResult, because invalidations may happen after snapshotting scheduler cache and before lookupResult.

@misterikkit
Copy link

Just to add some more context, I believe you when you say this is a problem, but I don't believe this PR is the right fix. We tried to fix this exact problem in #63040, so my assumption is that something about that fix needs to be corrected.

@misterikkit
Copy link

Just to add some more context, I believe you when you say this is a problem, but I don't believe this PR is the right fix.

Time to eat my words. 😄 As mentioned in
#67260, this is an old bug whose fix was not 100% correct.

To merge this PR, I would like it to replace the incorrect fix from #63040. I also have some other comments on the code changes.

Copy link

@misterikkit misterikkit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in other comments, I would like this fix to replace the fix for #62921

In particular, I think we should,

  • add atomic generation numbers to the eCache
  • snapshot the current generation numbers before snapshotting the scheduler cache
  • compare generation number in snapshot to the live version before writing to eCache
  • "undo" the fix from #63040 and #63895

Sorry that is a lot more work than your initial approach. Let me know if you need help with it. I am also available to chat if you want to discuss the fix.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think a flag is sufficient, since lookupResult clears the flag, and I don't want us to make assumptions about when that may be called.

Instead, how about an int64 "generation number" which is atomically incremented?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this check removes the need for the check on line 254, if !cache.IsUpToDate(nodeInfo). I think this PR should delete that check, and then remove nodeInfo from the arguments, and all the plumbing that goes with that.

See #63040

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if !cache.IsUpToDate(nodeInfo) is removed now, but nodeInfo argument is kept, because it's used in other places.

@cofyc
Copy link
Member Author

cofyc commented Aug 15, 2018

Thanks for clarification and the advice! I'll update this PR.

@bsalamat
Copy link
Member

Thinking more about the solution that misterikkit outlined above last night, I am a bit worried that snapshotting the ecache might cause a noticeable slow down. We should measure performance impact of this solution.

@cofyc
Copy link
Member Author

cofyc commented Sep 14, 2018

Any other concerns?

Copy link
Member

@bsalamat bsalamat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @cofyc and sorry for the delay. It looks generally good. I have several minor comments.
I haven't looked at the tests yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should add an else statement that logs an error when a predicate key is not found.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/caches/cache/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If remove must happen after the invalidation, it deserves a better comment.

Copy link
Member Author

@cofyc cofyc Sep 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I found I'm wrong with this part. All equivalence cache invalidations should be after resource objects are added, updated and deleted, then predicate functions will run and use new versions of resource objects. I revert this change.

By the way, there is no need to invalidate predicates for node after adding because no predicate functions will run on deleted node (we iterate on nodes in scheduler cache), this is different from other resources, e.g. PVs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you that invalidation of eCache should happen after CUD operations on objects. The following diagram shows a scenario where the reverse order causes an issue. If invalidation happens first, we could snapshot eCache after the invalidation and then we snapshot the cache itself. A node is then removed/added/updated before we update eCache. We then would update eCache with stale information which is based on our latest snapshot of the cache taken before the node removal/addition/update:

------|-----------------|------------------|----------------|---------------|-------->
Invalidate eCache  Snapshot eCache   Snapshot Cache     Remove Node    Update eCache    

I think you should add this to the comment to document why the order is important.

Copy link
Member Author

@cofyc cofyc Sep 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 2a05122. This simply update comments in PR https://github.com/kubernetes/kubernetes/pull/63895/files to reflect latest code.

There is no need to revert it now, I'll rebase it when PR is done.

@cofyc cofyc force-pushed the fix67260 branch 2 times, most recently from 73b4434 to cc3b6d9 Compare September 16, 2018 13:44
@cofyc
Copy link
Member Author

cofyc commented Sep 16, 2018

@bsalamat Thanks, I pushed a new commit: cc3b6d9.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we will init NodeCache here, it seems we don't need to call g.equivalenceCache.GetNodeCache(nodeName) in generic_scheduler?

Instead, we can only call sth like g.equivalenceCache.LoadNodeCache() there, only read operation. WDYT?

Copy link
Member Author

@cofyc cofyc Sep 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a good idea! Please check 8ece5d6.

Copy link
Contributor

@resouer resouer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added several review comments, others look good, please take a look!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stores a *NodeCache

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to include comments for all last three members, to make other developer's life easier.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@cofyc cofyc force-pushed the fix67260 branch 2 times, most recently from 6f266d7 to 8ece5d6 Compare September 17, 2018 11:46
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this comment applies here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After assuming pod into scheduler cache, it will invalidate affected predicates in equivalence cache.
https://github.com/kubernetes/kubernetes/blob/8ece5d624ded4815655104d8a12712cfe6d8743e/pkg/scheduler/scheduler.go#L355-L363

Copy link
Member

@bsalamat bsalamat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @cofyc!

/lgtm

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Sep 21, 2018
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 22, 2018
@cofyc
Copy link
Member Author

cofyc commented Sep 22, 2018

hi, @bsalamat
Old PR contains some temporary commits, I squashed them into fewer commits now (changes are same), could you please LGTM again? Thanks!

Yecheng Fu added 2 commits September 22, 2018 12:08
- snapshot equivalence cache generation numbers before snapshotting the
scheduler cache
- skip update when generation does not match live generation
- keep the node and increment its generation to invalidate it instead of
deletion
- use predicates order ID as key to improve performance
@cofyc
Copy link
Member Author

cofyc commented Sep 22, 2018

/retest

Copy link
Member

@bsalamat bsalamat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 24, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat, cofyc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@k8s-ci-robot k8s-ci-robot merged commit 28d86ac into kubernetes:master Sep 25, 2018
@cofyc cofyc deleted the fix67260 branch May 4, 2019 07:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TestVolumeBindingRescheduling out of sync; retrying flakes

8 participants