-
Notifications
You must be signed in to change notification settings - Fork 39.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add request processing HPA into the queue after processing is finished. #72373
Conversation
Hi @krzysztof-jastrzebski. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/sig autoscaling |
/assign mwielgus |
@@ -298,20 +302,20 @@ func (a *HorizontalController) computeReplicasForMetrics(hpa *autoscalingv2.Hori | |||
return replicas, metric, statuses, timestamp, nil | |||
} | |||
|
|||
func (a *HorizontalController) reconcileKey(key string) error { | |||
func (a *HorizontalController) reconcileKey(key string) (err error, deleted bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error must be the last variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
7a0b7ee
to
c56439d
Compare
How was this PR tested? |
I've created K8s build with debugs showing when HPAs are processed, created cluster and HPA. I observed how often HPA is updated. Next I removed HPA just after HPA was processed and verified that HPA is not processed anymore. |
/lgtm |
/approve |
/ok-to-test |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: krzysztof-jastrzebski, mwielgus The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Review the full test history for this PR. Silence the bot with an |
c56439d
to
48eb53e
Compare
// request into the queue. Request is not inserted into queue by resync if previous one wasn't processed yet. | ||
// This happens quite often because requests from previous resync are removed from the queue at the same moment | ||
// as next resync inserts new requests. | ||
if !deleted { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I reading this correctly that it always requeues items that still exist, even in non-error cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, see comment below
/hold Unconditional reinsertion seems very likely to cause problems. Doesn't this cause HPAs to be processed much more frequently than intended? |
HPA controller uses rate limiter which always returns ResyncPeriod (currently 15 seconds).
Creation:
In that case every request inserted to queue waits 15 seconds in the queue. If we insert one request then it stays in queue for 15 seconds and all request inserted during 15 seconds are skipped as they would be executed after request which is in queue. After 15 seconds next request can be inserted and it also stays in queue for 15 seconds etc. All redundant request are skipped, HPA is updated at most every 15 seconds. I see this is not the best way of using RateLimitingQueue but I would like to do small fix so I can backport it to 1.12. If you think we should use queue in other way then I'm happy to fix it next PR. Do you have any suggestion how to do that? |
This is a strange use of the rate limiting queue trying to achieve tight guarantees about retry interval. I'm concerned about doubling the number of queue insertion attempts for systems with large numbers of HPAs. Attempting to enqueue all HPAs in the system every 15 seconds is already very aggressive... this change would attempt to enqueue them all twice every 15 seconds. Until the structure of this controller can be revisited, can we do something simpler and less likely to impact performance like one of the following:
Either of those approaches would give |
I was thinking about similar solution. The problem with it is why to add/subtract 1 second. resyncPeriod is defined by a flag. Default is 15 seconds but someone can use 1 second resuncPeriod. If we added 1 second then HPAs would be refreshed every 2 seconds instead of 1, we would also process HPA on every change of the objects (rate limiter set to 0) which can be in some cases very often. |
Setting a 1 second resyncPeriod isn't likely to work well in general, but I agree we wouldn't want to drop that to 0. Setting to a percentage or capping at a minimum value would prevent that. Unconditional requeuing is confusing to anyone who is familiar with all the other controllers. I'd prefer the smallest change we can make that doesn't take this controller even further away from standard usage of the queue components, so I'm not in favor of the current PR, but will defer to @kubernetes/sig-autoscaling-pr-reviews /hold cancel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you expand a bit on why this occurs in the comment? Use your example from above
if err == nil { | ||
// don't "forget" here because we want to only process a given HPA once per resync interval | ||
return true | ||
deleted, err := a.reconcileKey(key.(string)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should still have a comment as to why we're not forgetting the key, since this works significantly different from normal controllers.
please also fix the PR description to remove the unused flags. |
I'm inclined to go with @liggitt's suggestion short-term, and think up a better long-term fix (e.g. don't watch HPA, and instead just reconcile every 15 seconds, staggered, or something). |
Using solution suggested by @liggitt will take the same amount of time (implementing and testing) as implementing final solution. I prefer submitting this solution (unless you see any bugs in it), backporting it and implementing final solution immediately. |
All of the proposed solutions are ugly one way or another. Probably the cleanest solution would be to write the right queue instead of trying to bend the existing one to match the needs. Given the amount of time needed to implement, it I'm rather in favour of merging the proposed/existing/tested fix unless someone proves it doesn't work. @krzysztof-jastrzebski please expand the comments in the code so that the possible confusion caused by non-standard use is minimal. |
This fixes a bug with skipping request inserted by resync because previous one hasn't processed yet.
48eb53e
to
c6ebd12
Compare
@mwielgus done |
/test pull-kubernetes-e2e-kops-aws |
/test pull-kubernetes-e2e-gce-100-performance |
/test pull-kubernetes-kubemark-e2e-gce-big |
/test pull-kubernetes-e2e-gce-100-performance |
/lgtm |
/test pull-kubernetes-integration |
…-pick-of-#72373-upstream-release-1.13 Automated cherry pick of #72373: Add request processing HPA into the queue after processing is
…-pick-of-#72373-upstream-release-1.12 Automated cherry pick of #72373: Add request processing HPA into the queue after processing is
This fixes a bug whth skipping request inserted by resync because previous one hasn't processed yet.
What type of PR is this?
/kind bug
What this PR does / why we need it:
This PR fixes a bug with skipping request inserted by resync because previous one hasn't processed yet.
Which issue(s) this PR fixes:
Fixes #72372
Special notes for your reviewer:
Does this PR introduce a user-facing change?: