Skip to content

thanos_compact_halted metric never cleared #8228

@johngmyers

Description

@johngmyers

Thanos, Prometheus and Golang version used:

Thanos v0.37.0
Prometheus v3.1.0

Object Storage Provider:

S3

What happened:

Thanos-compact threw a transient error "preallocate: no space left on device". Although that cleanup loop failed, subsequent cleanup loops completed without error.

The thanos_compact_halted metric was set to 1 and remained so, despite compact making progress.

On-call personnel had to manually restart the pod in order to clear the resulting alert.

What you expected to happen:

The metric should have been set back to 0 once compact completed a loop without error, so we don't have to have on-call personnel waste time manually restarting the pod.

How to reproduce it (as minimally and precisely as possible):

Run Thanos in production.

Full logs to relevant components:

Anything else we need to know:

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions