I appreciate the inclusion of advice about timeouts. I’ve seen too many outages due to the default 1 second timeout on Kubernetes liveness probes.
You’d think 1 second is plenty to return a 200 status but for a Python server running 2 gunicorn workers, a pitifully small burst of requests could occupy both workers for 30 seconds (failure threshold of 3s * period of 10s), triggering a restart. Using aiohttp has the same risk because of how easy it is to accidentally block the one event loop.
When the consequence of failing a health check is something as drastic as restarting the service, the health check should be tuned so that failures “truly indicate unrecoverable application failure, for example a deadlock.”
but yeah, a health check is a binary classifier at the end of the day, and the timeout is the decision threshold. The only principled way to set up a health check would be to train one under prod conditions.
We had the same realization, so we changed the Kubernetes health checks to not talk to anything external to the process. We added a second endpoint that returns all the fancy connectivity checks, and have our monitoring tool check it every 5-10 minutes and alert us if something is wrong.
At my last employer, a health check was allowed to fail (explicitly returning false) during initialization. Once a service served a positive health check, then normal practice and policy would be to restart the service/container if it started to return new negative health checks. This practice was quite reliable.
We also had a concept of ‘deep’ health checks which were reporting on the status of dependencies. This was convenient because we could leverage the same monitoring and alerting infrastructure, but deep health check failures did not trigger container restarts or change load balancer policies.
It’s a good idea to separate the deep healthchecks like that. I guess if it’s not directly affecting regular health it can be used as an observability metric.
The problem is that if, say, your application literally cannot do anything without a DB connection, and a particular machine (or pod, or whatever) has a missing/failing DB connection, you don’t want traffic going to it because the only possible outcome of it is an HTTP 500. You want all the alarms going off and you want something to try restarting the server/pod/whatever – if the DB really is down, well, your whole site was going to be down no matter what and you can always manually toggle off the restarts for the duration of the outage. And if it’s only down for that one server/pod/whatever, you want a restart to happen quickly because a restart is the cheapest and easiest way to try to fix a bad server/pod.
I think it’s critical to treat this complex problem in layers; e.g.,
An error report that a component generated an exception and had to fail a request, or health check failed; one or two of these might be interesting but not necessarily hard down. A large number could represent a serious fault.
The diagnosis that a component is broken; e.g., an instance is generating exceptions for more than 10% of requests, or a health check failed twice consecutively. This is as much a policy decision as anything else, and is quite hard to definitively get right! Diagnoses often need to encompass telemetry from multiple layers, so that a fault can correctly be pinned on a backend database that’s down rather than just everything, for example.
Corrective action; e.g., restarting something or replacing a component with a new one. You don’t always want to leap straight to restarting stuff, you may want to restart at most N instances in an hour to avoid making things worse by constantly restarting. If it’s a RAID array, for example, you might offline a busted disk that’s making everything slow, but obviously you can’t offline more disks than you have parity stripes without data loss.
There are others, as well, like how to report or alarm on errors, diagnoses, and corrective actions. Tying all this in to how you do deployments. And how to express policy, etc.
I agree. It’s hard to make a blanket call on whether restarting is worthwhile, e.g. failing to connect to a db could be due to a connection pool being emptied by connections not being returned, and that would be fixed by restarting.
As with ztoz’s ‘deep’ health checks, I think it probably makes sense to have a separate concept for “this container needs restarting” from “this container is not in a working state”, where the latter is a metric signal collected to alert on at a higher level than the orchestrator, and the former is a direct signal to the orchestrator to restart the container.
I appreciate the inclusion of advice about timeouts. I’ve seen too many outages due to the default 1 second timeout on Kubernetes liveness probes.
You’d think 1 second is plenty to return a 200 status but for a Python server running 2 gunicorn workers, a pitifully small burst of requests could occupy both workers for 30 seconds (failure threshold of 3s * period of 10s), triggering a restart. Using aiohttp has the same risk because of how easy it is to accidentally block the one event loop.
When the consequence of failing a health check is something as drastic as restarting the service, the health check should be tuned so that failures “truly indicate unrecoverable application failure, for example a deadlock.”
Timeouts, timeouts: always wrong!
Some too short and some too long.
Excellent rhyme!
but yeah, a health check is a binary classifier at the end of the day, and the timeout is the decision threshold. The only principled way to set up a health check would be to train one under prod conditions.
Kinda riffing off of this, I’ve had success in the past implementing
/health
or/heartbeat
endpoints which returned JSON blobs that had:select 1;
or similar) to prove connectivityThese are cheap, helpful for sanity-checking, and provide useful context in addition to just “did the heartbeat succeed?”.
We had them too, but I removed those: I don’t want everything to restart when there’s DB maintenance going on.
We had the same realization, so we changed the Kubernetes health checks to not talk to anything external to the process. We added a second endpoint that returns all the fancy connectivity checks, and have our monitoring tool check it every 5-10 minutes and alert us if something is wrong.
I also take this approach when defining healthchecks. I’ve not seen it explicitly stated though, and I wonder how commonly it’s followed.
Is a good rule of thumb to only fail a healthcheck if restarting the container would lead the problem being resolved?
At my last employer, a health check was allowed to fail (explicitly returning false) during initialization. Once a service served a positive health check, then normal practice and policy would be to restart the service/container if it started to return new negative health checks. This practice was quite reliable.
We also had a concept of ‘deep’ health checks which were reporting on the status of dependencies. This was convenient because we could leverage the same monitoring and alerting infrastructure, but deep health check failures did not trigger container restarts or change load balancer policies.
It’s a good idea to separate the deep healthchecks like that. I guess if it’s not directly affecting regular health it can be used as an observability metric.
The problem is that if, say, your application literally cannot do anything without a DB connection, and a particular machine (or pod, or whatever) has a missing/failing DB connection, you don’t want traffic going to it because the only possible outcome of it is an HTTP 500. You want all the alarms going off and you want something to try restarting the server/pod/whatever – if the DB really is down, well, your whole site was going to be down no matter what and you can always manually toggle off the restarts for the duration of the outage. And if it’s only down for that one server/pod/whatever, you want a restart to happen quickly because a restart is the cheapest and easiest way to try to fix a bad server/pod.
I think it’s critical to treat this complex problem in layers; e.g.,
There are others, as well, like how to report or alarm on errors, diagnoses, and corrective actions. Tying all this in to how you do deployments. And how to express policy, etc.
I agree. It’s hard to make a blanket call on whether restarting is worthwhile, e.g. failing to connect to a db could be due to a connection pool being emptied by connections not being returned, and that would be fixed by restarting.
As with ztoz’s ‘deep’ health checks, I think it probably makes sense to have a separate concept for “this container needs restarting” from “this container is not in a working state”, where the latter is a metric signal collected to alert on at a higher level than the orchestrator, and the former is a direct signal to the orchestrator to restart the container.