-
Notifications
You must be signed in to change notification settings - Fork 42.1k
[WIP] Add integration test showing possible ResourceQuota race #128899
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
|
||
| // testResourceClaimForbidden attempts to create a ResourceClaim expecting 403 Forbidden due to resource quota limits being exceeded. | ||
| func testResourceClaimForbidden(clientset clientset.Interface, namespace string, resourceClaim *resourceapi.ResourceClaim, t *testing.T) { | ||
| _, err := clientset.ResourceV1beta1().ResourceClaims(namespace).Create(context.TODO(), resourceClaim, metav1.CreateOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed the Service version of this function wraps this check in a PollImmediate which allows non-Forbidden errors to be retried but still fails when a nil error is returned like we're seeing in the e2e flake. I removed the wrapping here because I think we want to make sure any error we get is a 403.
test/integration/quota/quota_test.go
Outdated
| close(informersStarted) | ||
|
|
||
| wg := new(sync.WaitGroup) | ||
| namespaces := 500 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
500 is about as low as I can make this to reliably fail for me locally. I'll play around with lowering this value for CI to see if we can make logs easier to sift through.
| KUBE_TIMEOUT=${KUBE_TIMEOUT:--timeout=600s} | ||
| LOG_LEVEL=${LOG_LEVEL:-2} | ||
| KUBE_TEST_ARGS=${KUBE_TEST_ARGS:-} | ||
| KUBE_TEST_ARGS="-run TestQuotaLimitResourceClaim" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hacks in this file are just to make it faster to run the particular test in CI. Leaving this comment as another reminder to revert these:
/hold
test/integration/quota/quota_test.go
Outdated
| go func() { | ||
| defer wg.Done() | ||
|
|
||
| ns := framework.CreateNamespaceOrDie(clientset, "quota-"+utilrand.String(5), t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pohly My intent with this test was to mimic the flaky e2e test, but essentially running a bunch of those concurrently in different namespaces. Does this seem like a fair representation of that?
|
Looking into the failures more, I noticed all of the ones that hit the same error condition as the e2e flake (where creating the second ResourceClaim succeeds) had hit some other error earlier in the test. I just pushed a change to iron out the control flow to fail early when that happens, and I no longer see any tests hit that same failure condition. |
|
|
||
| discoveryFunc := clientset.Discovery().ServerPreferredNamespacedResources | ||
| listerFuncForResource := generic.ListerFuncForResourceFunc(informers.ForResource) | ||
| qc := quotainstall.NewQuotaConfigurationForControllers(listerFuncForResource) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had not looked into how the quota mechanism is integrated into the apiserver. So there really is a separate controller which updates the v1.ResourceQuotaStatus? And the apiserver watches that status to make decisions about whether it should allow creating a new resource?
Such a design would have an obvious race condition:
- first resource created
- quota controller updates status, quota is now exhausted
- apiserver handles second create request before seeing that updated status -> allows it
But perhaps I am missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT that's all accurate. I don't see any extra synchronization that would prevent that kind of race.
I think I found a way to trigger this in a single namespace like the e2e test by adding the ResourceQuota to the API server's cache well after the controller manager's cache and that triggers the same failure as the flakes. Does this iteration of the test seem convincing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It all seems pretty clear now. We could try to get confirmation from SIG Auth that the quota mechanism isn't reliable.
The fix for the E2E flake then is to turn the one-shot creation of a second ResourceClaim into an "eventually we should get an error" when creating many ResourceClaims.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for following up on Slack! I've opened a PR with the Eventually workaround: #128943
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: nojnhuh The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@nojnhuh: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
I'll close this for now, but we can reference it later if needed. /close |
|
@nojnhuh: Closed this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What type of PR is this?
/kind bug
/kind flake
What this PR does / why we need it:
This PR adds a (failing) integration test that reliably triggers the bug causing the e2e flake described by #128410. I intend to iterate on a fix for the issue in this PR as I continue to dig into why this is failing.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: