Skip to content

Weekly High IO on cluster #1521

@johnseekins

Description

@johnseekins

Describe the bug
I've described this bug in other places (#1441), but I feel like I'm stealing thunder from other issues, so I'm breaking this out to a separate issue...
At a high level:
We have a cluster configured to retain data for 1 year (-retentionPeriod=365d). Every Monday at midnight (00:00 UTC) we see a significant IO spike that directly causes data rerouting. The IO spike out-lasts the re-routes, so it doesn't seem directly connected to the re-routing.
Our instances are 8 CPUx 16 GB RAM instances running atop Ceph storage (with 4TB disks per storage node).

Because we see a significant drop in objects in Ceph itself at the same time, we suspect this might be related to an aggressive number of deletes in the cluster. The associated Ceph graph is also linked below.

To Reproduce

  1. Have large ingest rate (>~800k points/sec) to large cluster (28 storage nodes)
  2. wait a week

Expected behavior
No sudden weekly IO spikes.

Logs
No errors discovered in the logs other than messages about query timeouts (which are clearly symptoms of the problem).

Screenshots
Load profile:
image
Re-routing of data:
image
Ceph objects being removed:
Screenshot from 2021-08-04 10-35-49

Version
1.63.0

Used command-line flags

 -search.maxUniqueTimeseries=300000000 -search.maxTagKeys=1000000 -search.maxTagValues=1000000000 -dedup.minScrapeInterval=1s -memory.allowedPercent=75 -storageDataPath=/var/lib/victoriametrics/storage/prod_cluster_1 -retentionPeriod=365d -vminsertAddr :8400 -vmselectAddr :8401 -httpListenAddr :8482

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionThe question issue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions