Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add request processing HPA into the queue after processing is finished. #72373

Merged
merged 1 commit into from
Jan 4, 2019

Conversation

krzysztof-jastrzebski
Copy link
Contributor

@krzysztof-jastrzebski krzysztof-jastrzebski commented Dec 27, 2018

This fixes a bug whth skipping request inserted by resync because previous one hasn't processed yet.

What type of PR is this?
/kind bug

What this PR does / why we need it:
This PR fixes a bug with skipping request inserted by resync because previous one hasn't processed yet.
Which issue(s) this PR fixes:

Fixes #72372

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Fixes a bug in HPA controller so HPAs are always updated every resyncPeriod (15 seconds).

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Dec 27, 2018
@k8s-ci-robot
Copy link
Contributor

Hi @krzysztof-jastrzebski. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Dec 27, 2018
@krzysztof-jastrzebski
Copy link
Contributor Author

/sig autoscaling

@k8s-ci-robot k8s-ci-robot added sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 27, 2018
@k8s-ci-robot k8s-ci-robot added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Dec 27, 2018
@krzysztof-jastrzebski
Copy link
Contributor Author

/assign mwielgus

@@ -298,20 +302,20 @@ func (a *HorizontalController) computeReplicasForMetrics(hpa *autoscalingv2.Hori
return replicas, metric, statuses, timestamp, nil
}

func (a *HorizontalController) reconcileKey(key string) error {
func (a *HorizontalController) reconcileKey(key string) (err error, deleted bool) {
Copy link
Contributor

@mwielgus mwielgus Dec 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error must be the last variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@mwielgus
Copy link
Contributor

How was this PR tested?

@krzysztof-jastrzebski
Copy link
Contributor Author

I've created K8s build with debugs showing when HPAs are processed, created cluster and HPA. I observed how often HPA is updated. Next I removed HPA just after HPA was processed and verified that HPA is not processed anymore.

@mwielgus
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 27, 2018
@mwielgus
Copy link
Contributor

/approve

@mwielgus
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 27, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: krzysztof-jastrzebski, mwielgus

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 27, 2018
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

// request into the queue. Request is not inserted into queue by resync if previous one wasn't processed yet.
// This happens quite often because requests from previous resync are removed from the queue at the same moment
// as next resync inserts new requests.
if !deleted {
Copy link
Member

@liggitt liggitt Dec 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I reading this correctly that it always requeues items that still exist, even in non-error cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, see comment below

@liggitt
Copy link
Member

liggitt commented Dec 28, 2018

/hold

Unconditional reinsertion seems very likely to cause problems. Doesn't this cause HPAs to be processed much more frequently than intended?

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 28, 2018
@krzysztof-jastrzebski
Copy link
Contributor Author

HPA controller uses rate limiter which always returns ResyncPeriod (currently 15 seconds).
Rate Limiter:

func (r *FixedItemIntervalRateLimiter) When(item interface{}) time.Duration {

Creation:
queue: workqueue.NewNamedRateLimitingQueue(NewDefaultHPARateLimiter(resyncPeriod), "horizontalpodautoscaler"),

In that case every request inserted to queue waits 15 seconds in the queue. If we insert one request then it stays in queue for 15 seconds and all request inserted during 15 seconds are skipped as they would be executed after request which is in queue. After 15 seconds next request can be inserted and it also stays in queue for 15 seconds etc.
All redundant request are skipped, HPA is updated at most every 15 seconds.

I see this is not the best way of using RateLimitingQueue but I would like to do small fix so I can backport it to 1.12. If you think we should use queue in other way then I'm happy to fix it next PR. Do you have any suggestion how to do that?

@liggitt
Copy link
Member

liggitt commented Dec 28, 2018

This is a strange use of the rate limiting queue trying to achieve tight guarantees about retry interval.

I'm concerned about doubling the number of queue insertion attempts for systems with large numbers of HPAs. Attempting to enqueue all HPAs in the system every 15 seconds is already very aggressive... this change would attempt to enqueue them all twice every 15 seconds.

Until the structure of this controller can be revisited, can we do something simpler and less likely to impact performance like one of the following:

  • set up the HPA AddEventHandlerWithResyncPeriod with resyncPeriod+time.Second
  • set up NewDefaultHPARateLimiter(resyncPeriod-time.Second)

Either of those approaches would give delayingType#waitingLoop a second to move ready HPA items from waitingForQueue into the queue. If that takes longer than a second (for example, under load), HPA processing could still skip a resync interval, but that actually seems like a better outcome to me than doubling queue attempts in pursuit of exact retry intervals.

@krzysztof-jastrzebski
Copy link
Contributor Author

I was thinking about similar solution. The problem with it is why to add/subtract 1 second. resyncPeriod is defined by a flag. Default is 15 seconds but someone can use 1 second resuncPeriod. If we added 1 second then HPAs would be refreshed every 2 seconds instead of 1, we would also process HPA on every change of the objects (rate limiter set to 0) which can be in some cases very often.
I don't think that adding HPA to queue after processing it is a problem as current implementation of HPA(it is single-threaded) can process less than 10 HPAs per second. Adding 10 request every second to local queue shouldn't be a big deal.

@liggitt
Copy link
Member

liggitt commented Jan 2, 2019

Default is 15 seconds but someone can use 1 second resyncPeriod. If we added 1 second then HPAs would be refreshed every 2 seconds instead of 1, we would also process HPA on every change of the objects (rate limiter set to 0) which can be in some cases very often.

Setting a 1 second resyncPeriod isn't likely to work well in general, but I agree we wouldn't want to drop that to 0. Setting to a percentage or capping at a minimum value would prevent that.

Unconditional requeuing is confusing to anyone who is familiar with all the other controllers. I'd prefer the smallest change we can make that doesn't take this controller even further away from standard usage of the queue components, so I'm not in favor of the current PR, but will defer to @kubernetes/sig-autoscaling-pr-reviews

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 2, 2019
Copy link
Contributor

@DirectXMan12 DirectXMan12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you expand a bit on why this occurs in the comment? Use your example from above

if err == nil {
// don't "forget" here because we want to only process a given HPA once per resync interval
return true
deleted, err := a.reconcileKey(key.(string))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should still have a comment as to why we're not forgetting the key, since this works significantly different from normal controllers.

@DirectXMan12
Copy link
Contributor

please also fix the PR description to remove the unused flags.

@DirectXMan12
Copy link
Contributor

I'm inclined to go with @liggitt's suggestion short-term, and think up a better long-term fix (e.g. don't watch HPA, and instead just reconcile every 15 seconds, staggered, or something).

@krzysztof-jastrzebski
Copy link
Contributor Author

Using solution suggested by @liggitt will take the same amount of time (implementing and testing) as implementing final solution. I prefer submitting this solution (unless you see any bugs in it), backporting it and implementing final solution immediately.

@mwielgus
Copy link
Contributor

mwielgus commented Jan 4, 2019

All of the proposed solutions are ugly one way or another. Probably the cleanest solution would be to write the right queue instead of trying to bend the existing one to match the needs. Given the amount of time needed to implement, it I'm rather in favour of merging the proposed/existing/tested fix unless someone proves it doesn't work.

@krzysztof-jastrzebski please expand the comments in the code so that the possible confusion caused by non-standard use is minimal.

This fixes a bug with skipping request inserted by resync because previous one hasn't processed yet.
@krzysztof-jastrzebski
Copy link
Contributor Author

@mwielgus done

@krzysztof-jastrzebski
Copy link
Contributor Author

/test pull-kubernetes-e2e-kops-aws
/test pull-kubernetes-kubemark-e2e-gce-big

@krzysztof-jastrzebski
Copy link
Contributor Author

/test pull-kubernetes-e2e-gce-100-performance

@krzysztof-jastrzebski
Copy link
Contributor Author

/test pull-kubernetes-kubemark-e2e-gce-big

@krzysztof-jastrzebski
Copy link
Contributor Author

/test pull-kubernetes-e2e-gce-100-performance

@mwielgus
Copy link
Contributor

mwielgus commented Jan 4, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 4, 2019
@krzysztof-jastrzebski
Copy link
Contributor Author

/test pull-kubernetes-integration

@k8s-ci-robot k8s-ci-robot merged commit 86691ca into kubernetes:master Jan 4, 2019
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Jan 7, 2019
k8s-ci-robot added a commit that referenced this pull request Jan 8, 2019
…-pick-of-#72373-upstream-release-1.13

Automated cherry pick of #72373: Add request processing HPA into the queue after processing is
k8s-ci-robot added a commit that referenced this pull request Jan 8, 2019
…-pick-of-#72373-upstream-release-1.12

Automated cherry pick of #72373: Add request processing HPA into the queue after processing is
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HorizontalPodAutoscalers are not processed every
6 participants