[WIP] Add integration test showing possible ResourceQuota race #128899

nojnhuh · 2024-11-20T21:47:12Z

What type of PR is this?

/kind bug
/kind flake

What this PR does / why we need it:

This PR adds a (failing) integration test that reliably triggers the bug causing the e2e flake described by #128410. I intend to iterate on a fix for the issue in this PR as I continue to dig into why this is failing.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2024-11-20T21:47:21Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

nojnhuh · 2024-11-20T21:49:48Z

test/integration/quota/quota_test.go

+
+// testResourceClaimForbidden attempts to create a ResourceClaim expecting 403 Forbidden due to resource quota limits being exceeded.
+func testResourceClaimForbidden(clientset clientset.Interface, namespace string, resourceClaim *resourceapi.ResourceClaim, t *testing.T) {
+	_, err := clientset.ResourceV1beta1().ResourceClaims(namespace).Create(context.TODO(), resourceClaim, metav1.CreateOptions{})


I noticed the Service version of this function wraps this check in a PollImmediate which allows non-Forbidden errors to be retried but still fails when a nil error is returned like we're seeing in the e2e flake. I removed the wrapping here because I think we want to make sure any error we get is a 403.

nojnhuh · 2024-11-20T21:50:50Z

test/integration/quota/quota_test.go

+	close(informersStarted)
+
+	wg := new(sync.WaitGroup)
+	namespaces := 500


500 is about as low as I can make this to reliably fail for me locally. I'll play around with lowering this value for CI to see if we can make logs easier to sift through.

nojnhuh · 2024-11-20T21:52:51Z

hack/make-rules/test-integration.sh

 KUBE_TIMEOUT=${KUBE_TIMEOUT:--timeout=600s}
 LOG_LEVEL=${LOG_LEVEL:-2}
-KUBE_TEST_ARGS=${KUBE_TEST_ARGS:-}
+KUBE_TEST_ARGS="-run TestQuotaLimitResourceClaim"


The hacks in this file are just to make it faster to run the particular test in CI. Leaving this comment as another reminder to revert these:
/hold

nojnhuh · 2024-11-20T22:08:05Z

test/integration/quota/quota_test.go

+		go func() {
+			defer wg.Done()
+
+			ns := framework.CreateNamespaceOrDie(clientset, "quota-"+utilrand.String(5), t)


@pohly My intent with this test was to mimic the flaky e2e test, but essentially running a bunch of those concurrently in different namespaces. Does this seem like a fair representation of that?

nojnhuh · 2024-11-21T06:25:43Z

Looking into the failures more, I noticed all of the ones that hit the same error condition as the e2e flake (where creating the second ResourceClaim succeeds) had hit some other error earlier in the test. I just pushed a change to iron out the control flow to fail early when that happens, and I no longer see any tests hit that same failure condition.

pohly · 2024-11-21T07:34:55Z

test/integration/quota/quota_test.go

+
+	discoveryFunc := clientset.Discovery().ServerPreferredNamespacedResources
+	listerFuncForResource := generic.ListerFuncForResourceFunc(informers.ForResource)
+	qc := quotainstall.NewQuotaConfigurationForControllers(listerFuncForResource)


I had not looked into how the quota mechanism is integrated into the apiserver. So there really is a separate controller which updates the v1.ResourceQuotaStatus? And the apiserver watches that status to make decisions about whether it should allow creating a new resource?

Such a design would have an obvious race condition:

first resource created

quota controller updates status, quota is now exhausted

apiserver handles second create request before seeing that updated status -> allows it

But perhaps I am missing something?

AFAICT that's all accurate. I don't see any extra synchronization that would prevent that kind of race.

I think I found a way to trigger this in a single namespace like the e2e test by adding the ResourceQuota to the API server's cache well after the controller manager's cache and that triggers the same failure as the flakes. Does this iteration of the test seem convincing?

It all seems pretty clear now. We could try to get confirmation from SIG Auth that the quota mechanism isn't reliable.

The fix for the E2E flake then is to turn the one-shot creation of a second ResourceClaim into an "eventually we should get an error" when creating many ResourceClaims.

Thanks for following up on Slack! I've opened a PR with the Eventually workaround: #128943

k8s-ci-robot · 2024-11-21T22:07:24Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: nojnhuh
Once this PR has been reviewed and has the lgtm label, please assign smarterclayton for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2024-11-21T23:01:49Z

@nojnhuh: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kubernetes-integration	`2e4775e`	link	true	`/test pull-kubernetes-integration`
pull-kubernetes-linter-hints	`2e4775e`	link	false	`/test pull-kubernetes-linter-hints`
pull-kubernetes-verify-lint	`2e4775e`	link	true	`/test pull-kubernetes-verify-lint`
pull-kubernetes-verify	`2e4775e`	link	true	`/test pull-kubernetes-verify`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

nojnhuh · 2024-12-09T19:10:44Z

I'll close this for now, but we can reference it later if needed.

/close

k8s-ci-robot · 2024-12-09T19:10:50Z

@nojnhuh: Closed this PR.

Details

In response to this:

I'll close this for now, but we can reference it later if needed.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Add integration test showing possible ResourceQuota race

5d7f644

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 20, 2024

k8s-ci-robot requested review from sttts and thockin November 20, 2024 21:47

k8s-ci-robot added area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 20, 2024

nojnhuh mentioned this pull request Nov 20, 2024

Failure cluster [e15e458b...]: cluster supports count/resourceclaims.resource.k8s.io ResourceQuota #128410

Closed

nojnhuh commented Nov 20, 2024

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 20, 2024

nojnhuh and others added 2 commits November 20, 2024 17:07

add namespace to logs and errors

f74b834

avoid Fatal in parallel tests

8050bf7

pohly reviewed Nov 21, 2024

View reviewed changes

force apiserver's cache to be stale when second claim is created

2e4775e

k8s-ci-robot added area/apiserver sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels Nov 21, 2024

pohly mentioned this pull request Nov 22, 2024

Failure cluster [c199077e...]: Pod InPlace Resize Container pod-resize-resource-quota-test #128840

Closed

nojnhuh mentioned this pull request Nov 22, 2024

e2e: wrap DRA ResourceQuota Forbidden claim check in gomega.Eventually #128943

Merged

k8s-ci-robot closed this Dec 9, 2024

nojnhuh deleted the claim-quota-race branch August 27, 2025 16:01

[WIP] Add integration test showing possible ResourceQuota race #128899

[WIP] Add integration test showing possible ResourceQuota race #128899

Uh oh!

Conversation

nojnhuh commented Nov 20, 2024

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

k8s-ci-robot commented Nov 20, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nojnhuh commented Nov 21, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Nov 21, 2024

Uh oh!

k8s-ci-robot commented Nov 21, 2024

Uh oh!

nojnhuh commented Dec 9, 2024

Uh oh!

k8s-ci-robot commented Dec 9, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants