-
Notifications
You must be signed in to change notification settings - Fork 42.1k
Deep copying EndpointSlices in reconciler before modifying them #85368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deep copying EndpointSlices in reconciler before modifying them #85368
Conversation
|
picked into #85350 for testing with the cache mutation detector on |
|
still seeing crashes with this change: |
|
annotations were still changed: as were endpoints and ports |
|
I think reconcileByPortMapping can append to existing Endpoints arrays, which could have belonged to a pre-existing shared endpointslice: basically, any time we're assigning or appending to anything, we need to be sure we own it |
|
I'd recommend adding a mini cache-mutation detector step in the endpointslice unit tests that you can snapshot all the endpointslices in the store before each sync, then check if they've been mutated after running a sync |
5f46d26 to
8d467a3
Compare
8d467a3 to
a8c81ea
Compare
|
I'm running #85350 with the commit that disables endpoint slices now, fyi. If you want to pick in my three commits from that PR into this one to get mutation detection in e2e, that might be good. I do think you should add the checks to the unit tests, since those might hit scenarios the e2es would not. We also want to be careful we don't get too clever about being efficient and make this uninspectable. The more places we stick broken-apart pieces, then later append to them, the greater the chance of us missing something. |
|
@liggitt No idea how to write a cache-mutation detector, it does sound incredibly useful though, any examples of something like that? It would certainly be trivial to deep copy all EndpointSlices at the start of reconcile, so if it's important this gets in soon, I can always do that, but obviously that's not good for performance at scale. I've been focusing my efforts on anytime the list of slices to update is added to, as those slices will change. The last update I pushed catches 1 more place that results in an addition to the list of slices to update, but can't say for sure that's actually everything. Would love to get this into unit tests. |
|
simplest way would be to do something like https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/tools/cache/mutation_detector.go#L97-L132 (though you don't need a background goroutine in a deterministic unit test) can construct one, add all existing endpointslices in the fake client before sync, then call check after sync. |
a8c81ea to
57d3bfb
Compare
|
Hey @liggitt, thanks for the example implementation! I've added a version of that to the existing EndpointSlice controller tests. It was able to catch the problems I'd already caught in this PR, but did not find any additional issues. Would appreciate another round of review whenever you have some time. |
57d3bfb to
f81e9a4
Compare
|
/retest |
|
/test pull-kubernetes-e2e-gce-alpha-features |
|
/retest |
|
It is expected the performance tests will fail with the mutation cache checker force enabled. Now that the feature is disabled by default, you aren't exercising endpointslice in the e2e tests. I'd recommend dropping my three commits out of this PR at this point. |
ddb5aca to
7681b35
Compare
|
Thanks for clarifying, somehow I'd thought the alpha test suite I ran on this PR after adding your commits would cover that. Pulled those commits back out. |
|
what does our line/branch coverage look like from the unit tests? I still have a really hard time following the objects around the sync method's call graph to be sure there are no latent mutations. |
7681b35 to
4229b99
Compare
This PR adds tests to both |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: liggitt, robscott The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…pstream-release-1.17 Automated cherry pick of #85368: Deep copying EndpointSlices in reconciler before modifying
What type of PR is this?
/kind bug
What this PR does / why we need it:
This should ensure the EndpointSlice controller does not try to modify shared objects.
Which issue(s) this PR fixes:
Fixes #85364
Does this PR introduce a user-facing change?:
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
Enhancement Issue: kubernetes/enhancements#752
/sig network
/priority critical-urgent
/cc @liggitt @freehan