Add request processing HPA into the queue after processing is finished. #72373

krzysztof-jastrzebski · 2018-12-27T16:24:20Z

This fixes a bug whth skipping request inserted by resync because previous one hasn't processed yet.

What type of PR is this?
/kind bug

What this PR does / why we need it:
This PR fixes a bug with skipping request inserted by resync because previous one hasn't processed yet.
Which issue(s) this PR fixes:

Fixes #72372

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Fixes a bug in HPA controller so HPAs are always updated every resyncPeriod (15 seconds).

k8s-ci-robot · 2018-12-27T16:24:27Z

Hi @krzysztof-jastrzebski. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

krzysztof-jastrzebski · 2018-12-27T16:24:33Z

/sig autoscaling

krzysztof-jastrzebski · 2018-12-27T16:25:53Z

/assign mwielgus

mwielgus · 2018-12-27T16:28:57Z

pkg/controller/podautoscaler/horizontal.go

@@ -298,20 +302,20 @@ func (a *HorizontalController) computeReplicasForMetrics(hpa *autoscalingv2.Hori
 	return replicas, metric, statuses, timestamp, nil
 }

-func (a *HorizontalController) reconcileKey(key string) error {
+func (a *HorizontalController) reconcileKey(key string) (err error, deleted bool) {


Error must be the last variable.

mwielgus · 2018-12-27T16:42:39Z

How was this PR tested?

krzysztof-jastrzebski · 2018-12-27T16:48:33Z

I've created K8s build with debugs showing when HPAs are processed, created cluster and HPA. I observed how often HPA is updated. Next I removed HPA just after HPA was processed and verified that HPA is not processed anymore.

mwielgus · 2018-12-27T16:48:54Z

/lgtm

mwielgus · 2018-12-27T16:49:02Z

/approve

mwielgus · 2018-12-27T16:49:10Z

/ok-to-test

k8s-ci-robot · 2018-12-27T16:49:26Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: krzysztof-jastrzebski, mwielgus

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/controller/podautoscaler/OWNERS~~ [mwielgus]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

fejta-bot · 2018-12-27T19:50:18Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

liggitt · 2018-12-28T02:26:21Z

pkg/controller/podautoscaler/horizontal.go

+	// request into the queue. Request is not inserted into queue by resync if previous one wasn't processed yet.
+	// This happens quite often because requests from previous resync are removed from the queue at the same moment
+	// as next resync inserts new requests.
+	if !deleted {


Am I reading this correctly that it always requeues items that still exist, even in non-error cases?

yes, see comment below

liggitt · 2018-12-28T02:28:02Z

/hold

Unconditional reinsertion seems very likely to cause problems. Doesn't this cause HPAs to be processed much more frequently than intended?

krzysztof-jastrzebski · 2018-12-28T07:46:23Z

HPA controller uses rate limiter which always returns ResyncPeriod (currently 15 seconds).
Rate Limiter:

kubernetes/pkg/controller/podautoscaler/rate_limiters.go

Line 38 in 16fb54e

func (r *FixedItemIntervalRateLimiter) When(item interface{}) time.Duration {

Creation:

kubernetes/pkg/controller/podautoscaler/horizontal.go

Line 120 in a6ba2eb

    
           queue:                        workqueue.NewNamedRateLimitingQueue(NewDefaultHPARateLimiter(resyncPeriod), "horizontalpodautoscaler"),

In that case every request inserted to queue waits 15 seconds in the queue. If we insert one request then it stays in queue for 15 seconds and all request inserted during 15 seconds are skipped as they would be executed after request which is in queue. After 15 seconds next request can be inserted and it also stays in queue for 15 seconds etc.
All redundant request are skipped, HPA is updated at most every 15 seconds.

I see this is not the best way of using RateLimitingQueue but I would like to do small fix so I can backport it to 1.12. If you think we should use queue in other way then I'm happy to fix it next PR. Do you have any suggestion how to do that?

liggitt · 2018-12-28T19:36:26Z

This is a strange use of the rate limiting queue trying to achieve tight guarantees about retry interval.

I'm concerned about doubling the number of queue insertion attempts for systems with large numbers of HPAs. Attempting to enqueue all HPAs in the system every 15 seconds is already very aggressive... this change would attempt to enqueue them all twice every 15 seconds.

Until the structure of this controller can be revisited, can we do something simpler and less likely to impact performance like one of the following:

set up the HPA AddEventHandlerWithResyncPeriod with resyncPeriod+time.Second
set up NewDefaultHPARateLimiter(resyncPeriod-time.Second)

Either of those approaches would give delayingType#waitingLoop a second to move ready HPA items from waitingForQueue into the queue. If that takes longer than a second (for example, under load), HPA processing could still skip a resync interval, but that actually seems like a better outcome to me than doubling queue attempts in pursuit of exact retry intervals.

krzysztof-jastrzebski · 2018-12-28T21:05:33Z

I was thinking about similar solution. The problem with it is why to add/subtract 1 second. resyncPeriod is defined by a flag. Default is 15 seconds but someone can use 1 second resuncPeriod. If we added 1 second then HPAs would be refreshed every 2 seconds instead of 1, we would also process HPA on every change of the objects (rate limiter set to 0) which can be in some cases very often.
I don't think that adding HPA to queue after processing it is a problem as current implementation of HPA(it is single-threaded) can process less than 10 HPAs per second. Adding 10 request every second to local queue shouldn't be a big deal.

liggitt · 2019-01-02T17:39:51Z

Default is 15 seconds but someone can use 1 second resyncPeriod. If we added 1 second then HPAs would be refreshed every 2 seconds instead of 1, we would also process HPA on every change of the objects (rate limiter set to 0) which can be in some cases very often.

Setting a 1 second resyncPeriod isn't likely to work well in general, but I agree we wouldn't want to drop that to 0. Setting to a percentage or capping at a minimum value would prevent that.

Unconditional requeuing is confusing to anyone who is familiar with all the other controllers. I'd prefer the smallest change we can make that doesn't take this controller even further away from standard usage of the queue components, so I'm not in favor of the current PR, but will defer to @kubernetes/sig-autoscaling-pr-reviews

/hold cancel

DirectXMan12

can you expand a bit on why this occurs in the comment? Use your example from above

DirectXMan12 · 2019-01-04T00:40:56Z

pkg/controller/podautoscaler/horizontal.go

-	if err == nil {
-		// don't "forget" here because we want to only process a given HPA once per resync interval
-		return true
+	deleted, err := a.reconcileKey(key.(string))


you should still have a comment as to why we're not forgetting the key, since this works significantly different from normal controllers.

DirectXMan12 · 2019-01-04T00:43:54Z

please also fix the PR description to remove the unused flags.

DirectXMan12 · 2019-01-04T00:46:00Z

I'm inclined to go with @liggitt's suggestion short-term, and think up a better long-term fix (e.g. don't watch HPA, and instead just reconcile every 15 seconds, staggered, or something).

krzysztof-jastrzebski · 2019-01-04T06:27:21Z

Using solution suggested by @liggitt will take the same amount of time (implementing and testing) as implementing final solution. I prefer submitting this solution (unless you see any bugs in it), backporting it and implementing final solution immediately.

mwielgus · 2019-01-04T10:27:41Z

All of the proposed solutions are ugly one way or another. Probably the cleanest solution would be to write the right queue instead of trying to bend the existing one to match the needs. Given the amount of time needed to implement, it I'm rather in favour of merging the proposed/existing/tested fix unless someone proves it doesn't work.

@krzysztof-jastrzebski please expand the comments in the code so that the possible confusion caused by non-standard use is minimal.

This fixes a bug with skipping request inserted by resync because previous one hasn't processed yet.

krzysztof-jastrzebski · 2019-01-04T11:01:14Z

@mwielgus done

krzysztof-jastrzebski · 2019-01-04T11:46:51Z

/test pull-kubernetes-e2e-kops-aws
/test pull-kubernetes-kubemark-e2e-gce-big

krzysztof-jastrzebski · 2019-01-04T11:57:19Z

/test pull-kubernetes-e2e-gce-100-performance

krzysztof-jastrzebski · 2019-01-04T12:34:00Z

/test pull-kubernetes-kubemark-e2e-gce-big

krzysztof-jastrzebski · 2019-01-04T13:09:50Z

/test pull-kubernetes-e2e-gce-100-performance

mwielgus · 2019-01-04T14:12:06Z

/lgtm

krzysztof-jastrzebski · 2019-01-04T15:25:08Z

/test pull-kubernetes-integration

…-pick-of-#72373-upstream-release-1.13 Automated cherry pick of #72373: Add request processing HPA into the queue after processing is

…-pick-of-#72373-upstream-release-1.12 Automated cherry pick of #72373: Add request processing HPA into the queue after processing is

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Dec 27, 2018

k8s-ci-robot added sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 27, 2018

k8s-ci-robot requested review from jszczepkowski and MaciekPytel December 27, 2018 16:24

k8s-ci-robot added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Dec 27, 2018

k8s-ci-robot assigned mwielgus Dec 27, 2018

mwielgus reviewed Dec 27, 2018

View reviewed changes

krzysztof-jastrzebski force-pushed the hpa_fix branch from 7a0b7ee to c56439d Compare December 27, 2018 16:33

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 27, 2018

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 27, 2018

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 27, 2018

krzysztof-jastrzebski force-pushed the hpa_fix branch from c56439d to 48eb53e Compare December 27, 2018 20:06

liggitt reviewed Dec 28, 2018

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 28, 2018

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 2, 2019

DirectXMan12 suggested changes Jan 4, 2019

View reviewed changes

Add request processing HPA into the queue after processing is finished.

c6ebd12

This fixes a bug with skipping request inserted by resync because previous one hasn't processed yet.

krzysztof-jastrzebski force-pushed the hpa_fix branch from 48eb53e to c6ebd12 Compare January 4, 2019 11:00

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 4, 2019

k8s-ci-robot merged commit 86691ca into kubernetes:master Jan 4, 2019

This was referenced Jan 7, 2019

Automated cherry pick of #72373: Add request processing HPA into the queue after processing is #72642

Merged

Automated cherry pick of #72373: Add request processing HPA into the queue after processing is #72643

Merged

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Jan 7, 2019

k8s-ci-robot added a commit that referenced this pull request Jan 8, 2019

Merge pull request #72643 from krzysztof-jastrzebski/automated-cherry…

34d4711

…-pick-of-#72373-upstream-release-1.13 Automated cherry pick of #72373: Add request processing HPA into the queue after processing is

k8s-ci-robot added a commit that referenced this pull request Jan 8, 2019

Merge pull request #72642 from krzysztof-jastrzebski/automated-cherry…

014afff

…-pick-of-#72373-upstream-release-1.12 Automated cherry pick of #72373: Add request processing HPA into the queue after processing is

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add request processing HPA into the queue after processing is finished. #72373

Add request processing HPA into the queue after processing is finished. #72373

krzysztof-jastrzebski commented Dec 27, 2018 •

edited

Loading

k8s-ci-robot commented Dec 27, 2018

krzysztof-jastrzebski commented Dec 27, 2018

krzysztof-jastrzebski commented Dec 27, 2018

mwielgus Dec 27, 2018 •

edited

Loading

krzysztof-jastrzebski Dec 27, 2018

mwielgus commented Dec 27, 2018

krzysztof-jastrzebski commented Dec 27, 2018

mwielgus commented Dec 27, 2018

mwielgus commented Dec 27, 2018

mwielgus commented Dec 27, 2018

k8s-ci-robot commented Dec 27, 2018

fejta-bot commented Dec 27, 2018

liggitt Dec 28, 2018 •

edited

Loading

krzysztof-jastrzebski Dec 28, 2018

liggitt commented Dec 28, 2018 •

edited

Loading

krzysztof-jastrzebski commented Dec 28, 2018

liggitt commented Dec 28, 2018 •

edited

Loading

krzysztof-jastrzebski commented Dec 28, 2018

liggitt commented Jan 2, 2019

DirectXMan12 left a comment •

edited

Loading

DirectXMan12 Jan 4, 2019

DirectXMan12 commented Jan 4, 2019

DirectXMan12 commented Jan 4, 2019

krzysztof-jastrzebski commented Jan 4, 2019

mwielgus commented Jan 4, 2019

krzysztof-jastrzebski commented Jan 4, 2019

krzysztof-jastrzebski commented Jan 4, 2019

krzysztof-jastrzebski commented Jan 4, 2019

krzysztof-jastrzebski commented Jan 4, 2019

krzysztof-jastrzebski commented Jan 4, 2019

mwielgus commented Jan 4, 2019

krzysztof-jastrzebski commented Jan 4, 2019

Add request processing HPA into the queue after processing is finished. #72373

Add request processing HPA into the queue after processing is finished. #72373

Conversation

krzysztof-jastrzebski commented Dec 27, 2018 • edited Loading

k8s-ci-robot commented Dec 27, 2018

krzysztof-jastrzebski commented Dec 27, 2018

krzysztof-jastrzebski commented Dec 27, 2018

mwielgus Dec 27, 2018 • edited Loading

Choose a reason for hiding this comment

krzysztof-jastrzebski Dec 27, 2018

Choose a reason for hiding this comment

mwielgus commented Dec 27, 2018

krzysztof-jastrzebski commented Dec 27, 2018

mwielgus commented Dec 27, 2018

mwielgus commented Dec 27, 2018

mwielgus commented Dec 27, 2018

k8s-ci-robot commented Dec 27, 2018

fejta-bot commented Dec 27, 2018

liggitt Dec 28, 2018 • edited Loading

Choose a reason for hiding this comment

krzysztof-jastrzebski Dec 28, 2018

Choose a reason for hiding this comment

liggitt commented Dec 28, 2018 • edited Loading

krzysztof-jastrzebski commented Dec 28, 2018

liggitt commented Dec 28, 2018 • edited Loading

krzysztof-jastrzebski commented Dec 28, 2018

liggitt commented Jan 2, 2019

DirectXMan12 left a comment • edited Loading

Choose a reason for hiding this comment

DirectXMan12 Jan 4, 2019

Choose a reason for hiding this comment

DirectXMan12 commented Jan 4, 2019

DirectXMan12 commented Jan 4, 2019

krzysztof-jastrzebski commented Jan 4, 2019

mwielgus commented Jan 4, 2019

krzysztof-jastrzebski commented Jan 4, 2019

krzysztof-jastrzebski commented Jan 4, 2019

krzysztof-jastrzebski commented Jan 4, 2019

krzysztof-jastrzebski commented Jan 4, 2019

krzysztof-jastrzebski commented Jan 4, 2019

mwielgus commented Jan 4, 2019

krzysztof-jastrzebski commented Jan 4, 2019

krzysztof-jastrzebski commented Dec 27, 2018 •

edited

Loading

mwielgus Dec 27, 2018 •

edited

Loading

liggitt Dec 28, 2018 •

edited

Loading

liggitt commented Dec 28, 2018 •

edited

Loading

liggitt commented Dec 28, 2018 •

edited

Loading

DirectXMan12 left a comment •

edited

Loading