Change GCE LB health check interval from 2s to 8s #70099

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

k8s-ci-robot merged 1 commit into kubernetes:master from grayluck:five-sec-hc

Oct 25, 2018

Contributor

grayluck commented Oct 22, 2018 •

edited

Loading

Let ELB always ensure HttpHealthCheck.

e2e test included to test whether health check interval will be
reconciled when kube-controller-manager restarts.

What type of PR is this?

Uncomment only one, leave it on its own line:

/kind api-change
/kind bug
/kind cleanup
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:
The CPU and network overhead for health check with 2s interval which overruns the cluster and make nodes unhealthy. HC check interval 2s -> 5s is a relief to the traffic and decrease the QPS by 2.5x.

Special notes for your reviewer:
/assign @bowei

Does this PR introduce a user-facing change?:

GCE/GKE load balancer health check default interval changes from 2 seconds to 8 seconds, unhealthyThreshold to 3.
Health check parameters are configurable to be bigger than default values.

Contributor Author

grayluck commented Oct 22, 2018

/priority critical-urgent

grayluck closed this

grayluck reopened this

grayluck force-pushed the five-sec-hc branch 2 times, most recently from 301d5dd to 250c4b3 Compare

October 22, 2018 21:25

k8s-ci-robot added the release-note label

k8s-ci-robot assigned bowei

k8s-ci-robot added kind/bug size/M needs-sig needs-ok-to-test priority/critical-urgent labels

k8s-ci-robot requested review from bowei and gmarek

October 22, 2018 21:41

k8s-ci-robot added sig/cloud-provider sig/testing and removed needs-sig labels

grayluck force-pushed the five-sec-hc branch from 250c4b3 to e102229 Compare

October 22, 2018 21:53

Contributor Author

grayluck commented Oct 22, 2018

/remove-sig cloud-provider
/remove-sig testing

Contributor Author

grayluck commented Oct 22, 2018

/sig network

k8s-ci-robot added cncf-cla: yes sig/network and removed sig/cloud-provider sig/testing labels

Contributor Author

grayluck commented Oct 22, 2018

e2e test passed.

e2e test log:
STEP: create load balancer service
Oct 22 15:24:13.238: INFO: Waiting up to 20m0s for service "lb-hc-int" to have a LoadBalancer
STEP: modify the health check interval
STEP: restart kube-controller-manager
STEP: health check should be reconciled
Oct 22 15:25:29.183: INFO: hc.CheckIntervalSec = 10
Oct 22 15:25:49.355: INFO: hc.CheckIntervalSec = 10
Oct 22 15:26:09.343: INFO: hc.CheckIntervalSec = 5

Seeing health check interval being reconciled to 5 sec.

bowei reviewed

View reviewed changes

test/e2e/network/service.go Outdated

Member

bowei Oct 23, 2018

I think this needs to be [serial] no? Otherwise, you might have the healthcheck recreated by another test running at the same time?

Contributor Author

grayluck Oct 23, 2018

You're right. This test might be disrupted.

Member

bowei commented Oct 23, 2018

/ok-to-test

k8s-ci-robot removed the needs-ok-to-test label

bowei reviewed

View reviewed changes

pkg/cloudprovider/providers/gce/gce.go Outdated

Member

bowei Oct 23, 2018

We should be reducing the gceHcUnhealthyThreshold to 3, otherwise it will be very unresponsive.

Contributor Author

grayluck Oct 23, 2018

Done.

pkg/cloudprovider/providers/gce/gce.go Outdated

Member

bowei Oct 23, 2018

Let's increase this to 8 s.

Then we have 3 * 8 = 24 s to notice a VM is down.

pkg/cloudprovider/providers/gce/gce_loadbalancer_external.go Outdated

Member

bowei Oct 23, 2018

I think we probably want to add logic to keep higher values for the parameters on the healthcheck if the user has increased the value outside of Kubernetes. This gives the user a way out if their cluster is being impacted by the healthcheck volume.

Contributor Author

grayluck Oct 23, 2018

Logic added. Unit test also added to guard the logic.

k8s-ci-robot removed the size/M label

grayluck force-pushed the five-sec-hc branch from ea945fb to 8403a8e Compare

October 24, 2018 21:05

Contributor Author

grayluck commented Oct 24, 2018

/hold cancel

k8s-ci-robot removed the do-not-merge/hold label

freehan reviewed

View reviewed changes

pkg/cloudprovider/providers/gce/gce_loadbalancer_external.go Outdated

Contributor

freehan Oct 24, 2018

Reconcile HealthCheck interval to be >= the new default interval.

E.g. old health check interval is 2s, new one is 8. Reconcile to 8.
If the existing health check > the default interval, keep the existing health check.

Contributor Author

grayluck Oct 25, 2018

Comment added.

pkg/cloudprovider/providers/gce/gce_loadbalancer_external.go Outdated

Contributor

freehan Oct 24, 2018

httpHealthCheckChanged

Contributor Author

grayluck Oct 25, 2018

Changed to needToUpdateHttpHealthChecks

pkg/cloudprovider/providers/gce/gce_loadbalancer_external.go Outdated

Contributor

freehan Oct 25, 2018

if needToUpdateHealthChecks(old, new) {
   newHC :=  mergeHealthCheck(old, new)
   update(newHC)
}

Contributor Author

grayluck Oct 25, 2018

Done.

pkg/cloudprovider/providers/gce/gce_loadbalancer_internal.go Outdated

Contributor

freehan Oct 25, 2018

make it consistent with external LB

if needToUpdateHealthChecks(old, new) {
   newHC :=  mergeHealthCheck(old, new)
   update(newHC)....
}

Contributor Author

grayluck Oct 25, 2018

Done.

grayluck force-pushed the five-sec-hc branch from dbe414d to d67df6f Compare

October 25, 2018 00:33

freehan reviewed

View reviewed changes

pkg/cloudprovider/providers/gce/gce_loadbalancer_external.go Outdated Show resolved Hide resolved

pkg/cloudprovider/providers/gce/gce_loadbalancer_external.go Outdated Show resolved Hide resolved

Contributor

freehan commented Oct 25, 2018

just nits.
/lgtm

k8s-ci-robot assigned freehan

k8s-ci-robot added the lgtm label

grayluck force-pushed the five-sec-hc branch from d67df6f to 2c95696 Compare

October 25, 2018 00:53

k8s-ci-robot removed the lgtm label


          Change GCE LB health check interval from 2s to 8s, unhealthyThreashol…

40ab479

…d to 3

Force ELB to ensureHealthCheck when target pool exists.

Add e2e test to ensure that HC interval will be reconciled when
kube-controller-manager restarts.

Health checks with bigger thresholds and larger intervals will not be reconciled.

Add unittest for ILB and ELB to ensure HC reconciles and is configurable.

grayluck force-pushed the five-sec-hc branch from 2c95696 to 40ab479 Compare

October 25, 2018 00:55

Contributor

freehan commented Oct 25, 2018

/lgtm

k8s-ci-robot added the lgtm label

Contributor

freehan commented Oct 25, 2018

/test pull-kubernetes-integration
/test pull-kubernetes-kubemark-e2e-gce-big

Contributor

freehan commented Oct 25, 2018

/approve

Contributor

k8s-ci-robot commented Oct 25, 2018

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: freehan, grayluck

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~pkg/cloudprovider/providers/gce/OWNERS~~ [freehan]
~~test/e2e/network/OWNERS~~ [freehan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the approved label

k8s-ci-robot merged commit ebace77 into kubernetes:master

Contributor

k8s-ci-robot commented Oct 25, 2018

@grayluck: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
pull-kubernetes-verify	`40ab479`	link	`/test pull-kubernetes-verify`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

justinsb mentioned this pull request

Failing test: should reconcile LB health check interval [Slow][Serial] #70280

Closed

Member

justinsb commented Oct 26, 2018

This caused an e2e failure on GKE because we don't support restarting k-c-m: #70280. Fix in #70283

This was referenced Oct 26, 2018

Automated cherry pick of #70099: Change GCE LB health check interval from 2s to 8s, #70301

Closed

Automated cherry pick of #70099 #70315

Merged

Automated cherry pick of #70099 #70317

Merged

Automated cherry pick of #70099: Change GCE LB health check interval from 2s to 8s, #70318

Merged

k8s-ci-robot added a commit that referenced this pull request


          Merge pull request #70317 from grayluck/automated-cherry-pick-of-#700…

b919401

…99-upstream-release-1.11

Automated cherry pick of #70099

k8s-ci-robot added a commit that referenced this pull request


          Merge pull request #70318 from grayluck/automated-cherry-pick-of-#700…

9328c49

…99-upstream-release-1.12

Automated cherry pick of #70099: Change GCE LB health check interval from 2s to 8s,

k8s-ci-robot added a commit that referenced this pull request


          Merge pull request #70315 from grayluck/automated-cherry-pick-of-#700…

bf2ba3e

…99-upstream-release-1.10

Automated cherry pick of #70099

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved cncf-cla: yes kind/bug lgtm priority/critical-urgent release-note sig/cloud-provider sig/network sig/testing size/L