-
Notifications
You must be signed in to change notification settings - Fork 39.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduler should use a shared informer, and fix broken watch behavior for cached watches #46223
Scheduler should use a shared informer, and fix broken watch behavior for cached watches #46223
Conversation
881eb09
to
0f9ffed
Compare
Reverts #46199 |
@smarterclayton do you still need the |
testcase? |
@kubernetes/sig-api-machinery-pr-reviews we broke watch DELETE events with filters when we introduced the watch cache. This PR corrects that behavior... but it potentially has ramifications for API consumers. APIs not backed by watch caches are correct (in the strict sense that they have always behaved this way since 1.0) - APIs backed by watch caches return the most recent update, not the old update. This is an unfortunate state to be in - we are broken both ways, we regressed several releases ago, and no one noticed until I actually made code from a client that depended on the correct behavior of the system. We need to decide whether we accept the break, fix the break, or provide both modes for those who need correct behavior. |
Yes, still need the filtering set for scheduler extensions that expect to get only assigned pods. |
does this mean we never get a watch event containing the final etcd record, only the next-to-final record? (or is PrevObject the final etcd record?) |
Not sure what it does for final delete. |
@liggitt before this fix, we get (DELETED, pod.Phase=Succeeded). With the fix, we get (DELETED, pod.Phase=Running), which is correct, since the fieldSelector is for assigned, non-terminated pods. So with the fix we never get the final etcd watch record, but we shouldn't be getting that one in this case. |
I think @liggitt was referring to an unfiltered watch. |
exactly |
@smarterclayton would it make more sense to move the assigned pod lister functionality into pod_expansion.go? |
0f9ffed
to
39ba06f
Compare
I think we might want to consider back porting the cache watch stuff if we can decide on the API implications. |
Send an email to the dev list? I doubt that would cover everyone using watches and relying on this incorrect behavior, but it's better than nothing. |
39ba06f
to
e795360
Compare
Can be used either from a true shared informer or a local shared informer created just for the scheduler.
The underlying storage has always returned the old object on watch delete events when filtering. The cache watcher does not, which means a downsteam caller gets different behavior. This fixes the cache watcher to be consistent with our long term behavior for watch. It may result in a behavior change (the filter becomes more precise) but this was a regression in behavior.
e795360
to
e9e6935
Compare
Consensus seems to be merge and backport. Reviewers, do your worst |
@kubernetes/sig-scheduling-pr-reviews |
c.scheduledPodsHasSynced = podInformer.Informer().HasSynced | ||
// scheduled pod cache | ||
podInformer.Informer().AddEventHandler( | ||
cache.FilteringResourceEventHandler{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the informer cache all Pods in local (scheduler) and filter? or handled in apiserver?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the informer's job. The informer in this package is filtered to non-terminal pods, and then split in half into the pod cache (assigned) and queue (unassigned). The global informer in the controller manager (which this PR does not use, but an extension scheduler or hyper kube could use) would be global, and this filters down to that set.
Yeah - I would recommend this too. @smarterclayton - thanks a lot for debugging and fixing it. It's unfortunate that we didn't find this bug for 5 releases... |
This LGTM |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, smarterclayton
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
@k8s-bot test this [submit-queue is verifying that this PR is safe to merge] |
@smarterclayton: The following test(s) failed:
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Automatic merge from submit-queue (batch tested with PRs 45766, 46223) |
@kubernetes/kubernetes-release-managers this is something we want to backport to 1.6, 1.5, and any older release that we release a patch for. |
Removing label |
@mwielgus ^^ |
Can be used either from a true shared informer or a local shared
informer created just for the scheduler.
Fixes a bug in the cache watcher where we were returning the "current" object from a watch event, not the historic event. This means that we broke behavior when introducing the watch cache. This may have API implications for filtering watch consumers - but on the other hand, it prevents clients filtering from seeing objects outside of their watch correctly, which can lead to other subtle bugs.