-
Notifications
You must be signed in to change notification settings - Fork 42.1k
add ability to disable health checks on kube-apiserver for healthz using query-params #70676
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
staging/src/k8s.io/apiserver/pkg/server/healthz/healthz_test.go
Outdated
Show resolved
Hide resolved
|
/ok-to-test Generally LGTM but I'd like to see the param used in https://github.com/kubernetes/kubernetes/blob/master/cluster/gce/manifests/kube-apiserver.manifest#L37 - and we can add a comment at the same time! |
|
I don't object per se , but are you sure that it's worth enough to avoid just crashlooping it? At least in the etcd example, it's not like this process is doing much for you. |
Crashlooping is also noisy. Yes, in the etcd example, not crashlooping is not going to fix the issue but if the etcd is unhealthy-ish due to a longish boot, then it will get to a correct state faster without crashlooping. There are trade-offs, for sure. Then again, this feature is opt-in and if you are in situation where it occurs to you to use this feature (i.e. happening a lot), then probably to you it seems worthwhile. |
|
/assign @cheftako |
0923187 to
dc6922b
Compare
dc6922b to
33d4194
Compare
|
Yeah I think a blacklist is safer and more closely models the user workflow
of "I know check X isn't useful in my setup because Y"--clearly an admin
won't have done this reasoning for items that are brand new!
…On Tue, Nov 13, 2018 at 10:20 AM Han Kang ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In staging/src/k8s.io/apiserver/pkg/server/healthz/healthz.go
<#70676 (comment)>
:
> @@ -141,12 +142,28 @@ func (c *healthzCheck) Check(r *http.Request) error {
return c.check(r)
}
+// getExcludedChecks extracts the health check names to be excluded from the query param
+func getExcludedChecks(r *http.Request) sets.String {
+ checks, found := r.URL.Query()["exclude"]
With whitelisting, don't we encounter the same issues that @deads2k
<https://github.com/deads2k> mentioned earlier in the thread, re: version
skew and rollbacks?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#70676 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAnglggZfHuEztWQpxjCtOJm5DJ6hR9Jks5uuw16gaJpZM4YPimG>
.
|
5666d19 to
0f798a0
Compare
|
/test pull-kubernetes-e2e-gke |
1 similar comment
|
/test pull-kubernetes-e2e-gke |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to be nitpicky, but I'd call this a warning and not an error, since it doesn't make the check fail and the word "error" makes admins nervous :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can remove the prefix. I added it after the fact because it made it slightly easier to parse the string out from the verbose output blob.
0f798a0 to
895dd41
Compare
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: lavalamp, logicalhan The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/test pull-kubernetes-verify |
| "host": "127.0.0.1", | ||
| "port": 8080, | ||
| "path": "/healthz" | ||
| "path": "/healthz?exclude=etcd" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a couple issues with this:
- The API server main blocks (after listening, but before serving) on the etcd client opening a connection to the etcd server. Not all, but some etcd failures will cause the kube-apiserver to never start serving /healthz.
- The etcd healthcheck is not the only healthcheck that depends on etcd. Spot checking [poststarthook/bootstrap-controller, poststarthook/rbac/bootstrap-roles, poststarthook/scheduling/bootstrap-system-priority-classes, poststarthook/ca-registration, autoregister-completion], all depend on etcd being up before they return an initial healthy status.
Is this change intended to help reduce restarts on apiserver startup or only in steady state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change intended to help reduce restarts on apiserver startup or only in steady state?
The etcd check use case was to avoid useless restarts in steady state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it was intended for steady state. Also, the observation that boot sequences are more problematic was a contributing factor for my thought process in #71054.
What type of PR is this?
/kind feature
What this PR does / why we need it:
Currently, the healthz endpoint is inflexible in that any and all health checks are currently executed when the endpoint is hit. Since the kube-apiserver only currently supports a liveness check, this means all persistent health check failures will eventually result in a kubelet driven restart. However, this may not be desirable behavior if you have additional monitoring services layered on top of your control plane.
For instance, take the etcd failure from the perspective of the kube-apiserver. Currently, persistent failures during the etcd health check from the api-server healthz endpoint will cause the kube-apiserver to be restarted. This sometimes helps (in the case of a faulty etcd-client connection) but also can be detrimental (the kube-apiserver restarts until the etcd cluster is functioning, then the kube-apiserver bombards the etcd cluster with requests to update local caches, saturating the etcd cluster with requests, causing it to timeout on its own liveness probes and trigger a kubelet-driven restart).
Let's say that we could now disable etcd from our liveness health check. Then if we had an external monitoring service, we could ask independently ask etcd if it was healthy and we could also ask the kube-apiserver if it thought etcd was unhealthy. Based on the combination of those answers, we could enable more intelligent responses to certain situation, like restarting the kube-apiserver when it thought etcd was unhealthy but etcd was actually reporting as healthy but not doing so when etcd was actually unhealthy.
This feature is backwards-compatible and opt-in. There would be no impact to existing behavior unless these query params are actually used. This feature no-opts for strings which do not match the name of an existing health check.
Which issue(s) this PR fixes:
Fixes #70591
Does this PR introduce a user-facing change?:
/sig api-machinery