Skip to content

Conversation

@grayluck
Copy link
Contributor

@grayluck grayluck commented Oct 22, 2018

Let ELB always ensure HttpHealthCheck.

e2e test included to test whether health check interval will be
reconciled when kube-controller-manager restarts.

What type of PR is this?

Uncomment only one, leave it on its own line:

/kind api-change
/kind bug
/kind cleanup
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:
The CPU and network overhead for health check with 2s interval which overruns the cluster and make nodes unhealthy. HC check interval 2s -> 5s is a relief to the traffic and decrease the QPS by 2.5x.

Special notes for your reviewer:
/assign @bowei

Does this PR introduce a user-facing change?:

GCE/GKE load balancer health check default interval changes from 2 seconds to 8 seconds, unhealthyThreshold to 3.
Health check parameters are configurable to be bigger than default values.

@grayluck
Copy link
Contributor Author

/priority critical-urgent

@grayluck grayluck closed this Oct 22, 2018
@grayluck grayluck reopened this Oct 22, 2018
@grayluck grayluck force-pushed the five-sec-hc branch 2 times, most recently from 301d5dd to 250c4b3 Compare October 22, 2018 21:25
@k8s-ci-robot k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Oct 22, 2018
@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Oct 22, 2018
@k8s-ci-robot k8s-ci-robot requested review from bowei and gmarek October 22, 2018 21:41
@k8s-ci-robot k8s-ci-robot added sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 22, 2018
@grayluck
Copy link
Contributor Author

/remove-sig cloud-provider
/remove-sig testing

@grayluck
Copy link
Contributor Author

/sig network

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. sig/network Categorizes an issue or PR as relevant to SIG Network. and removed sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Oct 22, 2018
@grayluck
Copy link
Contributor Author

e2e test passed.

e2e test log:
STEP: create load balancer service
Oct 22 15:24:13.238: INFO: Waiting up to 20m0s for service "lb-hc-int" to have a LoadBalancer
STEP: modify the health check interval
STEP: restart kube-controller-manager
STEP: health check should be reconciled
Oct 22 15:25:29.183: INFO: hc.CheckIntervalSec = 10
Oct 22 15:25:49.355: INFO: hc.CheckIntervalSec = 10
Oct 22 15:26:09.343: INFO: hc.CheckIntervalSec = 5

Seeing health check interval being reconciled to 5 sec.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be [serial] no? Otherwise, you might have the healthcheck recreated by another test running at the same time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. This test might be disrupted.

@bowei
Copy link
Member

bowei commented Oct 23, 2018

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 23, 2018
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be reducing the gceHcUnhealthyThreshold to 3, otherwise it will be very unresponsive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's increase this to 8 s.

Then we have 3 * 8 = 24 s to notice a VM is down.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we probably want to add logic to keep higher values for the parameters on the healthcheck if the user has increased the value outside of Kubernetes. This gives the user a way out if their cluster is being impacted by the healthcheck volume.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic added. Unit test also added to guard the logic.

@k8s-ci-robot k8s-ci-robot removed the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Oct 23, 2018
@grayluck
Copy link
Contributor Author

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 24, 2018
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reconcile HealthCheck interval to be >= the new default interval.

E.g. old health check interval is 2s, new one is 8. Reconcile to 8.
If the existing health check > the default interval, keep the existing health check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment added.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

httpHealthCheckChanged

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to needToUpdateHttpHealthChecks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if needToUpdateHealthChecks(old, new) {
   newHC :=  mergeHealthCheck(old, new)
   update(newHC)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make it consistent with external LB

if needToUpdateHealthChecks(old, new) {
   newHC :=  mergeHealthCheck(old, new)
   update(newHC)....
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@freehan
Copy link
Contributor

freehan commented Oct 25, 2018

just nits.
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 25, 2018
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 25, 2018
…d to 3

Force ELB to ensureHealthCheck when target pool exists.

Add e2e test to ensure that HC interval will be reconciled when
kube-controller-manager restarts.

Health checks with bigger thresholds and larger intervals will not be reconciled.

Add unittest for ILB and ELB to ensure HC reconciles and is configurable.
@freehan
Copy link
Contributor

freehan commented Oct 25, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 25, 2018
@freehan
Copy link
Contributor

freehan commented Oct 25, 2018

/test pull-kubernetes-integration
/test pull-kubernetes-kubemark-e2e-gce-big

@freehan
Copy link
Contributor

freehan commented Oct 25, 2018

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: freehan, grayluck

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 25, 2018
@k8s-ci-robot k8s-ci-robot merged commit ebace77 into kubernetes:master Oct 25, 2018
@k8s-ci-robot
Copy link
Contributor

@grayluck: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-verify 40ab479 link /test pull-kubernetes-verify

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@justinsb
Copy link
Member

This caused an e2e failure on GKE because we don't support restarting k-c-m: #70280. Fix in #70283

k8s-ci-robot added a commit that referenced this pull request Oct 27, 2018
…99-upstream-release-1.11

Automated cherry pick of #70099
k8s-ci-robot added a commit that referenced this pull request Oct 30, 2018
…99-upstream-release-1.12

Automated cherry pick of #70099: Change GCE LB health check interval from 2s to 8s,
k8s-ci-robot added a commit that referenced this pull request Oct 30, 2018
…99-upstream-release-1.10

Automated cherry pick of #70099
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants