add ability to disable health checks on kube-apiserver for healthz using query-params #70676

logicalhan · 2018-11-06T00:43:29Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

Currently, the healthz endpoint is inflexible in that any and all health checks are currently executed when the endpoint is hit. Since the kube-apiserver only currently supports a liveness check, this means all persistent health check failures will eventually result in a kubelet driven restart. However, this may not be desirable behavior if you have additional monitoring services layered on top of your control plane.

For instance, take the etcd failure from the perspective of the kube-apiserver. Currently, persistent failures during the etcd health check from the api-server healthz endpoint will cause the kube-apiserver to be restarted. This sometimes helps (in the case of a faulty etcd-client connection) but also can be detrimental (the kube-apiserver restarts until the etcd cluster is functioning, then the kube-apiserver bombards the etcd cluster with requests to update local caches, saturating the etcd cluster with requests, causing it to timeout on its own liveness probes and trigger a kubelet-driven restart).

Let's say that we could now disable etcd from our liveness health check. Then if we had an external monitoring service, we could ask independently ask etcd if it was healthy and we could also ask the kube-apiserver if it thought etcd was unhealthy. Based on the combination of those answers, we could enable more intelligent responses to certain situation, like restarting the kube-apiserver when it thought etcd was unhealthy but etcd was actually reporting as healthy but not doing so when etcd was actually unhealthy.

This feature is backwards-compatible and opt-in. There would be no impact to existing behavior unless these query params are actually used. This feature no-opts for strings which do not match the name of an existing health check.

Which issue(s) this PR fixes:
Fixes #70591

Does this PR introduce a user-facing change?:

The kube-apiserver's healthz now takes in an optional query parameter which allows you to disable health checks from causing healthz failures.

/sig api-machinery

logicalhan · 2018-11-06T18:35:33Z

/cc @cheftako @lavalamp @liggitt

staging/src/k8s.io/apiserver/pkg/server/healthz/healthz_test.go

staging/src/k8s.io/apiserver/pkg/server/healthz/healthz.go

justinsb · 2018-11-07T16:08:04Z

/ok-to-test

Generally LGTM but I'd like to see the param used in https://github.com/kubernetes/kubernetes/blob/master/cluster/gce/manifests/kube-apiserver.manifest#L37 - and we can add a comment at the same time!

deads2k · 2018-11-07T17:52:21Z

I don't object per se , but are you sure that it's worth enough to avoid just crashlooping it? At least in the etcd example, it's not like this process is doing much for you.

logicalhan · 2018-11-07T21:55:23Z

I don't object per se , but are you sure that it's worth enough to avoid just crashlooping it? At least in the etcd example, it's not like this process is doing much for you.

Crashlooping is also noisy. Yes, in the etcd example, not crashlooping is not going to fix the issue but if the etcd is unhealthy-ish due to a longish boot, then it will get to a correct state faster without crashlooping. There are trade-offs, for sure. Then again, this feature is opt-in and if you are in situation where it occurs to you to use this feature (i.e. happening a lot), then probably to you it seems worthwhile.

roycaihw · 2018-11-08T21:17:23Z

/assign @cheftako

lavalamp · 2018-11-13T18:29:48Z

Yeah I think a blacklist is safer and more closely models the user workflow of "I know check X isn't useful in my setup because Y"--clearly an admin won't have done this reasoning for items that are brand new!

…

On Tue, Nov 13, 2018 at 10:20 AM Han Kang ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In staging/src/k8s.io/apiserver/pkg/server/healthz/healthz.go <#70676 (comment)> : > @@ -141,12 +142,28 @@ func (c *healthzCheck) Check(r *http.Request) error { return c.check(r) } +// getExcludedChecks extracts the health check names to be excluded from the query param +func getExcludedChecks(r *http.Request) sets.String { + checks, found := r.URL.Query()["exclude"] With whitelisting, don't we encounter the same issues that @deads2k <https://github.com/deads2k> mentioned earlier in the thread, re: version skew and rollbacks? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#70676 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAnglggZfHuEztWQpxjCtOJm5DJ6hR9Jks5uuw16gaJpZM4YPimG> .

logicalhan · 2018-11-13T21:48:36Z

/test pull-kubernetes-e2e-gke

logicalhan · 2018-11-13T22:36:55Z

/test pull-kubernetes-e2e-gke

lavalamp · 2018-11-13T22:42:30Z

staging/src/k8s.io/apiserver/pkg/server/healthz/healthz.go

Sorry to be nitpicky, but I'd call this a warning and not an error, since it doesn't make the check fail and the word "error" makes admins nervous :)

Sure, I can remove the prefix. I added it after the fact because it made it slightly easier to parse the string out from the verbose output blob.

…in a query param

… GCE

lavalamp · 2018-11-13T23:01:20Z

/lgtm
/approve

k8s-ci-robot · 2018-11-13T23:01:53Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lavalamp, logicalhan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~cluster/gce/OWNERS~~ [lavalamp]
~~staging/src/k8s.io/apiserver/OWNERS~~ [lavalamp]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

logicalhan · 2018-11-14T00:34:55Z

/test pull-kubernetes-verify

mikedanese · 2018-11-15T03:17:50Z

cluster/gce/manifests/kube-apiserver.manifest

        "host": "127.0.0.1",
        "port": 8080,
-        "path": "/healthz"
+        "path": "/healthz?exclude=etcd"


There are a couple issues with this:

The API server main blocks (after listening, but before serving) on the etcd client opening a connection to the etcd server. Not all, but some etcd failures will cause the kube-apiserver to never start serving /healthz.

The etcd healthcheck is not the only healthcheck that depends on etcd. Spot checking [poststarthook/bootstrap-controller, poststarthook/rbac/bootstrap-roles, poststarthook/scheduling/bootstrap-system-priority-classes, poststarthook/ca-registration, autoregister-completion], all depend on etcd being up before they return an initial healthy status.

Is this change intended to help reduce restarts on apiserver startup or only in steady state?

Is this change intended to help reduce restarts on apiserver startup or only in steady state?

The etcd check use case was to avoid useless restarts in steady state

Yes, it was intended for steady state. Also, the observation that boot sequences are more problematic was a contributing factor for my thought process in #71054.

#70676-#70971-upstream-release-1.12 Automated cherry pick of #70753 #70676 #70971 upstream release 1.12

…25-upstream-release-1.10 Automated cherry pick of #70753, #70676 and #70971 upstream release 1.10

k8s-ci-robot requested review from deads2k and sttts November 6, 2018 00:43

k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. area/apiserver labels Nov 6, 2018

k8s-ci-robot requested review from cheftako, lavalamp and liggitt November 6, 2018 18:35

lavalamp reviewed Nov 6, 2018

View reviewed changes

staging/src/k8s.io/apiserver/pkg/server/healthz/healthz_test.go Outdated Show resolved Hide resolved

lavalamp reviewed Nov 6, 2018

View reviewed changes

staging/src/k8s.io/apiserver/pkg/server/healthz/healthz.go Outdated Show resolved Hide resolved

k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Nov 7, 2018

k8s-ci-robot assigned cheftako Nov 8, 2018

logicalhan force-pushed the exclude-checks branch 4 times, most recently from 0923187 to dc6922b Compare November 10, 2018 03:45

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 10, 2018

logicalhan force-pushed the exclude-checks branch from dc6922b to 33d4194 Compare November 13, 2018 00:00

k8s-ci-robot added sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Nov 13, 2018

logicalhan force-pushed the exclude-checks branch from 5666d19 to 0f798a0 Compare November 13, 2018 20:20

lavalamp reviewed Nov 13, 2018

View reviewed changes

lavalamp added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Nov 13, 2018

k8s-ci-robot removed the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Nov 13, 2018

Han Kang added 2 commits November 13, 2018 14:48

add ability to exclude health checks from failing healthz by passing …

f1f1bc8

…in a query param

exclude etcd from the liveness health check for the kube-apiserver on…

895dd41

… GCE

logicalhan force-pushed the exclude-checks branch from 0f798a0 to 895dd41 Compare November 13, 2018 22:49

k8s-ci-robot assigned lavalamp Nov 13, 2018

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 13, 2018

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 13, 2018

lavalamp added this to the v1.13 milestone Nov 14, 2018

k8s-ci-robot merged commit ca338b9 into kubernetes:master Nov 14, 2018

mikedanese mentioned this pull request Nov 15, 2018

increase the liveness probe delay for GCE e2e tests to avoid premature teardown #71054

Merged

mikedanese reviewed Nov 15, 2018

View reviewed changes

k8s-ci-robot added a commit that referenced this pull request Dec 5, 2018

Merge pull request #71285 from cheftako/automated-cherry-pick-of-#70753-

ee860a5

#70676-#70971-upstream-release-1.12 Automated cherry pick of #70753 #70676 #70971 upstream release 1.12

logicalhan mentioned this pull request Dec 8, 2018

REQUEST: New membership for @logicalhan kubernetes/org#292

Closed

6 tasks

k8s-ci-robot added a commit that referenced this pull request Dec 13, 2018

Merge pull request #71334 from cheftako/automated-cherry-pick-of-#713…

15008f0

…25-upstream-release-1.10 Automated cherry pick of #70753, #70676 and #70971 upstream release 1.10

chaochn47 mentioned this pull request Oct 5, 2023

Add livez and readyz for etcd etcd-io/etcd#16651

Merged

chaochn47 mentioned this pull request Oct 13, 2023

raft loop prober with counter etcd-io/etcd#16713

Closed

add ability to disable health checks on kube-apiserver for healthz using query-params #70676

add ability to disable health checks on kube-apiserver for healthz using query-params #70676

Uh oh!

Conversation

logicalhan commented Nov 6, 2018

Uh oh!

logicalhan commented Nov 6, 2018

Uh oh!

Uh oh!

Uh oh!

justinsb commented Nov 7, 2018

Uh oh!

deads2k commented Nov 7, 2018

Uh oh!

logicalhan commented Nov 7, 2018

Uh oh!

roycaihw commented Nov 8, 2018

Uh oh!

lavalamp commented Nov 13, 2018 via email

Uh oh!

logicalhan commented Nov 13, 2018

Uh oh!

logicalhan commented Nov 13, 2018

Uh oh!

lavalamp Nov 13, 2018

Choose a reason for hiding this comment

Uh oh!

logicalhan Nov 13, 2018

Choose a reason for hiding this comment

Uh oh!

lavalamp commented Nov 13, 2018

Uh oh!

k8s-ci-robot commented Nov 13, 2018

Uh oh!

logicalhan commented Nov 14, 2018

Uh oh!

mikedanese Nov 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liggitt Nov 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

logicalhan Nov 15, 2018

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

mikedanese Nov 15, 2018 •

edited

Loading

liggitt Nov 15, 2018 •

edited

Loading