Weekly High IO on cluster #1521

johnseekins · 2021-08-04T16:36:29Z

Describe the bug
I've described this bug in other places (#1441), but I feel like I'm stealing thunder from other issues, so I'm breaking this out to a separate issue...
At a high level:
We have a cluster configured to retain data for 1 year (-retentionPeriod=365d). Every Monday at midnight (00:00 UTC) we see a significant IO spike that directly causes data rerouting. The IO spike out-lasts the re-routes, so it doesn't seem directly connected to the re-routing.
Our instances are 8 CPUx 16 GB RAM instances running atop Ceph storage (with 4TB disks per storage node).

Because we see a significant drop in objects in Ceph itself at the same time, we suspect this might be related to an aggressive number of deletes in the cluster. The associated Ceph graph is also linked below.

To Reproduce

Have large ingest rate (>~800k points/sec) to large cluster (28 storage nodes)
wait a week

Expected behavior
No sudden weekly IO spikes.

Logs
No errors discovered in the logs other than messages about query timeouts (which are clearly symptoms of the problem).

Screenshots
Load profile:

Re-routing of data:

Ceph objects being removed:

Version
1.63.0

Used command-line flags

 -search.maxUniqueTimeseries=300000000 -search.maxTagKeys=1000000 -search.maxTagValues=1000000000 -dedup.minScrapeInterval=1s -memory.allowedPercent=75 -storageDataPath=/var/lib/victoriametrics/storage/prod_cluster_1 -retentionPeriod=365d -vminsertAddr :8400 -vmselectAddr :8401 -httpListenAddr :8482

The text was updated successfully, but these errors were encountered:

hagen1778 · 2021-08-05T08:43:06Z

Hi @johnseekins! Thank you for so detailed issue!
Could you please provide some additional screenshots form our dashboards:

storage. LSM parts
storage. Disk writes/reads
storage. Active merges
resource usage. Open FDs
overview. Disk space used

Thank you!

johnseekins · 2021-08-05T13:55:48Z

active merges and lsm parts:

disk r/w and fds:

Disk used:

And a few other interesting panels:

There is definitely a huge spike in indexdb and small merges during that time, but they seem to be symptoms (as they spike up suddenly during the event and then taper off...)

hagen1778 · 2021-08-05T14:58:15Z

Hm, the LSM parts graph suggests VM has about 6-8k parts on disk. And "vm_hdd Objects" (if I'm reading it right) suggests there 3kk objects were deleted over the 2.5h. Which does not correlate with amount of parts either merged or deleted by VM.
The spike in ActiveMerges can be explained by RowsRerouted graph - every time vmstorage receives new time series it results into new parts for index on disk and consequent merges...

Is there any chance that some type of cronjob enabled in the OS which runs every Monday at midnight? I can recall a similar issue with cronjob for fstrim process causing lags for our SSDs every week.

johnseekins · 2021-08-05T15:53:04Z

We can't find anything...but I'll check again. The one suspicious thing was an MD scan of the hypervisor's OS drive...but that only happens once a month...and triggers two hours after the event starts.

But fstrim on the boxes is scheduled for Monday at midnight!

root@store-1:~# systemctl status fstrim.timer
● fstrim.timer - Discard unused blocks once a week
     Loaded: loaded (/lib/systemd/system/fstrim.timer; enabled; vendor preset: enabled)
     Active: active (waiting) since Wed 2021-06-16 14:08:16 UTC; 1 months 19 days ago
    Trigger: Mon 2021-08-09 00:00:00 UTC; 3 days left
   Triggers: ● fstrim.service
       Docs: man:fstrim

Jun 16 14:08:16 store-1 systemd[1]: Started Discard unused blocks once a week.

So drive trimming is causing this? That's strange, given the data drive is a ceph block device and the OS drive is an Openstack block device. But I'm ready to believe it!

I've disabled fstrim on all the storage nodes ('cause whether or not that's the actual problem, there's no reason for it to be running on these hosts).

johnseekins · 2021-08-09T13:22:33Z

This was fstrim. It may be worth documenting that fstrim can have an adverse affect on the system somewhere?

hagen1778 · 2021-08-10T06:21:35Z

Hm, I'm not sure this is exactly relevant to VM. The case I mentioned happened to me 5y ago and it was a Postgres cluster suffering every Sunday because of fstrim. I was just lucky to recall it now and ask you to check if there is something similar in your system.
However, here's commit to mention fstrim in Tuning section.

johnseekins · 2021-08-10T13:27:11Z

It wasn't directly related to VictoriaMetrics, no. But disabling fstrim on these virtual hosts meant the weekly I/O spikes didn't happen.

hagen1778 added the question The question issue label Aug 5, 2021

johnseekins closed this as completed Aug 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weekly High IO on cluster #1521

Weekly High IO on cluster #1521

johnseekins commented Aug 4, 2021 •

edited

Loading

hagen1778 commented Aug 5, 2021 •

edited

Loading

johnseekins commented Aug 5, 2021 •

edited

Loading

hagen1778 commented Aug 5, 2021

johnseekins commented Aug 5, 2021 •

edited

Loading

johnseekins commented Aug 9, 2021

hagen1778 commented Aug 10, 2021 •

edited

Loading

johnseekins commented Aug 10, 2021

Weekly High IO on cluster #1521

Weekly High IO on cluster #1521

Comments

johnseekins commented Aug 4, 2021 • edited Loading

hagen1778 commented Aug 5, 2021 • edited Loading

johnseekins commented Aug 5, 2021 • edited Loading

hagen1778 commented Aug 5, 2021

johnseekins commented Aug 5, 2021 • edited Loading

johnseekins commented Aug 9, 2021

hagen1778 commented Aug 10, 2021 • edited Loading

johnseekins commented Aug 10, 2021

johnseekins commented Aug 4, 2021 •

edited

Loading

hagen1778 commented Aug 5, 2021 •

edited

Loading

johnseekins commented Aug 5, 2021 •

edited

Loading

johnseekins commented Aug 5, 2021 •

edited

Loading

hagen1778 commented Aug 10, 2021 •

edited

Loading