Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correctly handle empty watch event cache #49992

Merged
merged 1 commit into from
Aug 2, 2017

Conversation

liggitt
Copy link
Member

@liggitt liggitt commented Aug 2, 2017

Fixes #49956

Introduced by ada6023 which did not adjust the oldest available resourceVersion for an empty watch event cache.

Exposed by 74b9ba3, which allowed controllers to get list results from etcd before the watch cache is ready (normally they list with resourceVersion=0 which serves the list request from the watch cache, blocking until it is ready)

When the watch cache had an empty cache of watch events, it currently allows establishing a watch as if it can deliver a watch event for its currently synced resourceVersion. This results in an off-by-one error which can result in a missed watch event.

Scenario:

bob:

  1. creates object at resourceVersion=11

sally:

  1. does a list API request, gets a list resourceVersion of 10 (just before bob creates the object)
  2. starts watch handled by watch cache at resourceVersion=10

Watch cache:

  1. initial list gets resourceVersion=11, including the item created by bob
  2. when determining the initial watch events to send to sally's watch, there are no watch events in the cache, so no initial watch events are sent.
  3. the cache listerwatcher watches etcd starting at resourceVersion=11, so future events are fed into the event cache and to sally's watch

The watch cache should have dropped sally's watch from resourceVersion=10 with a "gone" error, since it can't deliver the watch event for resourceVersion=11. This would force sally to relist (where she would get a list at resourceVersion=11) and rewatch (from resourceVersion=11)

This particularly affects tests that create CRD/TPRs and establish watches on the new types as the storage layer's watch cache is also populating for that type.

Fixed a bug in the API server watch cache, which could cause a missing watch event immediately after cache initialization.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 2, 2017
@liggitt
Copy link
Member Author

liggitt commented Aug 2, 2017

@kubernetes/sig-api-machinery-bugs

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. kind/bug Categorizes issue or PR as related to a bug. labels Aug 2, 2017
@liggitt liggitt added kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. queue/fix labels Aug 2, 2017
@smarterclayton
Copy link
Contributor

I follow all the logic

@liggitt
Copy link
Member Author

liggitt commented Aug 2, 2017

/test pull-kubernetes-verify

@liggitt
Copy link
Member Author

liggitt commented Aug 2, 2017

on master, TestClientGoCustomResourceExample fails 20% of the time locally for me.

with this change, I cannot get it to fail (ran it ~150 times in a row before stopping the loop)

@liggitt liggitt added this to the v1.7 milestone Aug 2, 2017
@liggitt
Copy link
Member Author

liggitt commented Aug 2, 2017

@ironcladlou this fix also resolves most of the timeout issues I was seeing with other integration tests depending on watch on CRD objects (like TestMixedRelationships)

@k8s-github-robot k8s-github-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. release-note-label-needed labels Aug 2, 2017
@wojtek-t wojtek-t added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note-label-needed labels Aug 2, 2017
@wojtek-t
Copy link
Member

wojtek-t commented Aug 2, 2017

@liggitt - thanks a lot for this fix

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 2, 2017
@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: liggitt, wojtek-t

Associated issue: 49956

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@wojtek-t
Copy link
Member

wojtek-t commented Aug 2, 2017

/retest

@k8s-github-robot
Copy link

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Aug 2, 2017

@liggitt: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-e2e-kops-aws 0df769f link /test pull-kubernetes-e2e-kops-aws

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 49992, 48861, 49267, 49356, 49886)

@k8s-github-robot k8s-github-robot merged commit 35c3a51 into kubernetes:master Aug 2, 2017
@ironcladlou
Copy link
Contributor

Awesome, thanks.

@ironcladlou
Copy link
Contributor

cc @caesarxuchao

@liggitt liggitt added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Aug 2, 2017
@liggitt liggitt deleted the debug-flake branch August 2, 2017 15:45
@wojtek-t wojtek-t added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Aug 3, 2017
k8s-github-robot pushed a commit that referenced this pull request Aug 3, 2017
…2-upstream-release-1.7

Automatic merge from submit-queue

Automated cherry pick of #49992

Cherry pick of #49992 on release-1.7.

#49992: Correctly handle empty watch event cache

```release-note
Fixed a bug in the API server watch cache, which could cause a missing watch event immediately after cache initialization.
```
@k8s-cherrypick-bot
Copy link

Commit found in the "release-1.7" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked.

k8s-github-robot pushed a commit that referenced this pull request Aug 4, 2017
…2-upstream-release-1.6

Automatic merge from submit-queue

Automated cherry pick of #49992

Cherry pick of #49992 on release-1.6.

#49992: Correctly handle empty watch event cache

```release-note
Fixed a bug in the API server watch cache, which could cause a missing watch event immediately after cache initialization.
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants