Skip to content

Conversation

@nojnhuh
Copy link
Contributor

@nojnhuh nojnhuh commented Nov 20, 2024

What type of PR is this?

/kind bug
/kind flake

What this PR does / why we need it:

This PR adds a (failing) integration test that reliably triggers the bug causing the e2e flake described by #128410. I intend to iterate on a fix for the issue in this PR as I continue to dig into why this is failing.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 20, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 20, 2024
@k8s-ci-robot k8s-ci-robot added area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 20, 2024

// testResourceClaimForbidden attempts to create a ResourceClaim expecting 403 Forbidden due to resource quota limits being exceeded.
func testResourceClaimForbidden(clientset clientset.Interface, namespace string, resourceClaim *resourceapi.ResourceClaim, t *testing.T) {
_, err := clientset.ResourceV1beta1().ResourceClaims(namespace).Create(context.TODO(), resourceClaim, metav1.CreateOptions{})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed the Service version of this function wraps this check in a PollImmediate which allows non-Forbidden errors to be retried but still fails when a nil error is returned like we're seeing in the e2e flake. I removed the wrapping here because I think we want to make sure any error we get is a 403.

close(informersStarted)

wg := new(sync.WaitGroup)
namespaces := 500
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

500 is about as low as I can make this to reliably fail for me locally. I'll play around with lowering this value for CI to see if we can make logs easier to sift through.

KUBE_TIMEOUT=${KUBE_TIMEOUT:--timeout=600s}
LOG_LEVEL=${LOG_LEVEL:-2}
KUBE_TEST_ARGS=${KUBE_TEST_ARGS:-}
KUBE_TEST_ARGS="-run TestQuotaLimitResourceClaim"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hacks in this file are just to make it faster to run the particular test in CI. Leaving this comment as another reminder to revert these:
/hold

go func() {
defer wg.Done()

ns := framework.CreateNamespaceOrDie(clientset, "quota-"+utilrand.String(5), t)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pohly My intent with this test was to mimic the flaky e2e test, but essentially running a bunch of those concurrently in different namespaces. Does this seem like a fair representation of that?

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 20, 2024
@nojnhuh
Copy link
Contributor Author

nojnhuh commented Nov 21, 2024

Looking into the failures more, I noticed all of the ones that hit the same error condition as the e2e flake (where creating the second ResourceClaim succeeds) had hit some other error earlier in the test. I just pushed a change to iron out the control flow to fail early when that happens, and I no longer see any tests hit that same failure condition.


discoveryFunc := clientset.Discovery().ServerPreferredNamespacedResources
listerFuncForResource := generic.ListerFuncForResourceFunc(informers.ForResource)
qc := quotainstall.NewQuotaConfigurationForControllers(listerFuncForResource)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had not looked into how the quota mechanism is integrated into the apiserver. So there really is a separate controller which updates the v1.ResourceQuotaStatus? And the apiserver watches that status to make decisions about whether it should allow creating a new resource?

Such a design would have an obvious race condition:

  • first resource created
  • quota controller updates status, quota is now exhausted
  • apiserver handles second create request before seeing that updated status -> allows it

But perhaps I am missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT that's all accurate. I don't see any extra synchronization that would prevent that kind of race.

I think I found a way to trigger this in a single namespace like the e2e test by adding the ResourceQuota to the API server's cache well after the controller manager's cache and that triggers the same failure as the flakes. Does this iteration of the test seem convincing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It all seems pretty clear now. We could try to get confirmation from SIG Auth that the quota mechanism isn't reliable.

The fix for the E2E flake then is to turn the one-shot creation of a second ResourceClaim into an "eventually we should get an error" when creating many ResourceClaims.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for following up on Slack! I've opened a PR with the Eventually workaround: #128943

@k8s-ci-robot k8s-ci-robot added area/apiserver sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels Nov 21, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: nojnhuh
Once this PR has been reviewed and has the lgtm label, please assign smarterclayton for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

@nojnhuh: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-integration 2e4775e link true /test pull-kubernetes-integration
pull-kubernetes-linter-hints 2e4775e link false /test pull-kubernetes-linter-hints
pull-kubernetes-verify-lint 2e4775e link true /test pull-kubernetes-verify-lint
pull-kubernetes-verify 2e4775e link true /test pull-kubernetes-verify

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Dec 9, 2024

I'll close this for now, but we can reference it later if needed.

/close

@k8s-ci-robot
Copy link
Contributor

@nojnhuh: Closed this PR.

Details

In response to this:

I'll close this for now, but we can reference it later if needed.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@nojnhuh nojnhuh deleted the claim-quota-race branch August 27, 2025 16:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/apiserver area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

3 participants