-
Notifications
You must be signed in to change notification settings - Fork 42k
Change GCE LB health check interval from 2s to 8s #70099
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
/priority critical-urgent |
301d5dd to
250c4b3
Compare
|
/remove-sig cloud-provider |
|
/sig network |
|
e2e test passed. e2e test log: Seeing health check interval being reconciled to 5 sec. |
test/e2e/network/service.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to be [serial] no? Otherwise, you might have the healthcheck recreated by another test running at the same time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. This test might be disrupted.
|
/ok-to-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be reducing the gceHcUnhealthyThreshold to 3, otherwise it will be very unresponsive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's increase this to 8 s.
Then we have 3 * 8 = 24 s to notice a VM is down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we probably want to add logic to keep higher values for the parameters on the healthcheck if the user has increased the value outside of Kubernetes. This gives the user a way out if their cluster is being impacted by the healthcheck volume.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logic added. Unit test also added to guard the logic.
|
/hold cancel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reconcile HealthCheck interval to be >= the new default interval.
E.g. old health check interval is 2s, new one is 8. Reconcile to 8.
If the existing health check > the default interval, keep the existing health check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
httpHealthCheckChanged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to needToUpdateHttpHealthChecks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if needToUpdateHealthChecks(old, new) {
newHC := mergeHealthCheck(old, new)
update(newHC)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make it consistent with external LB
if needToUpdateHealthChecks(old, new) {
newHC := mergeHealthCheck(old, new)
update(newHC)....
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
just nits. |
…d to 3 Force ELB to ensureHealthCheck when target pool exists. Add e2e test to ensure that HC interval will be reconciled when kube-controller-manager restarts. Health checks with bigger thresholds and larger intervals will not be reconciled. Add unittest for ILB and ELB to ensure HC reconciles and is configurable.
|
/lgtm |
|
/test pull-kubernetes-integration |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: freehan, grayluck The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@grayluck: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
…99-upstream-release-1.11 Automated cherry pick of #70099
…99-upstream-release-1.12 Automated cherry pick of #70099: Change GCE LB health check interval from 2s to 8s,
…99-upstream-release-1.10 Automated cherry pick of #70099
Let ELB always ensure HttpHealthCheck.
e2e test included to test whether health check interval will be
reconciled when kube-controller-manager restarts.
What type of PR is this?
What this PR does / why we need it:
The CPU and network overhead for health check with 2s interval which overruns the cluster and make nodes unhealthy. HC check interval 2s -> 5s is a relief to the traffic and decrease the QPS by 2.5x.
Special notes for your reviewer:
/assign @bowei
Does this PR introduce a user-facing change?: