-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Thanos, Prometheus and Golang version used:
Thanos v0.37.0
Prometheus v3.1.0
Object Storage Provider:
S3
What happened:
Thanos-compact threw a transient error "preallocate: no space left on device". Although that cleanup loop failed, subsequent cleanup loops completed without error.
The thanos_compact_halted metric was set to 1 and remained so, despite compact making progress.
On-call personnel had to manually restart the pod in order to clear the resulting alert.
What you expected to happen:
The metric should have been set back to 0 once compact completed a loop without error, so we don't have to have on-call personnel waste time manually restarting the pod.
How to reproduce it (as minimally and precisely as possible):
Run Thanos in production.
Full logs to relevant components:
Anything else we need to know: