Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weekly High IO on cluster #1521

Closed
johnseekins opened this issue Aug 4, 2021 · 7 comments
Closed

Weekly High IO on cluster #1521

johnseekins opened this issue Aug 4, 2021 · 7 comments
Labels
question The question issue

Comments

@johnseekins
Copy link
Contributor

johnseekins commented Aug 4, 2021

Describe the bug
I've described this bug in other places (#1441), but I feel like I'm stealing thunder from other issues, so I'm breaking this out to a separate issue...
At a high level:
We have a cluster configured to retain data for 1 year (-retentionPeriod=365d). Every Monday at midnight (00:00 UTC) we see a significant IO spike that directly causes data rerouting. The IO spike out-lasts the re-routes, so it doesn't seem directly connected to the re-routing.
Our instances are 8 CPUx 16 GB RAM instances running atop Ceph storage (with 4TB disks per storage node).

Because we see a significant drop in objects in Ceph itself at the same time, we suspect this might be related to an aggressive number of deletes in the cluster. The associated Ceph graph is also linked below.

To Reproduce

  1. Have large ingest rate (>~800k points/sec) to large cluster (28 storage nodes)
  2. wait a week

Expected behavior
No sudden weekly IO spikes.

Logs
No errors discovered in the logs other than messages about query timeouts (which are clearly symptoms of the problem).

Screenshots
Load profile:
image
Re-routing of data:
image
Ceph objects being removed:
Screenshot from 2021-08-04 10-35-49

Version
1.63.0

Used command-line flags

 -search.maxUniqueTimeseries=300000000 -search.maxTagKeys=1000000 -search.maxTagValues=1000000000 -dedup.minScrapeInterval=1s -memory.allowedPercent=75 -storageDataPath=/var/lib/victoriametrics/storage/prod_cluster_1 -retentionPeriod=365d -vminsertAddr :8400 -vmselectAddr :8401 -httpListenAddr :8482
@hagen1778
Copy link
Collaborator

hagen1778 commented Aug 5, 2021

Hi @johnseekins! Thank you for so detailed issue!
Could you please provide some additional screenshots form our dashboards:

  • storage. LSM parts
  • storage. Disk writes/reads
  • storage. Active merges
  • resource usage. Open FDs
  • overview. Disk space used

Thank you!

@hagen1778 hagen1778 added the question The question issue label Aug 5, 2021
@johnseekins
Copy link
Contributor Author

johnseekins commented Aug 5, 2021

active merges and lsm parts:
Screenshot from 2021-08-05 07-49-13
disk r/w and fds:
Screenshot from 2021-08-05 07-51-23
Disk used:
Screenshot from 2021-08-05 07-53-22
And a few other interesting panels:
Screenshot from 2021-08-05 07-53-04

There is definitely a huge spike in indexdb and small merges during that time, but they seem to be symptoms (as they spike up suddenly during the event and then taper off...)

@hagen1778
Copy link
Collaborator

Hm, the LSM parts graph suggests VM has about 6-8k parts on disk. And "vm_hdd Objects" (if I'm reading it right) suggests there 3kk objects were deleted over the 2.5h. Which does not correlate with amount of parts either merged or deleted by VM.
The spike in ActiveMerges can be explained by RowsRerouted graph - every time vmstorage receives new time series it results into new parts for index on disk and consequent merges...

Is there any chance that some type of cronjob enabled in the OS which runs every Monday at midnight? I can recall a similar issue with cronjob for fstrim process causing lags for our SSDs every week.

@johnseekins
Copy link
Contributor Author

johnseekins commented Aug 5, 2021

We can't find anything...but I'll check again. The one suspicious thing was an MD scan of the hypervisor's OS drive...but that only happens once a month...and triggers two hours after the event starts.

But fstrim on the boxes is scheduled for Monday at midnight!

root@store-1:~# systemctl status fstrim.timer
● fstrim.timer - Discard unused blocks once a week
     Loaded: loaded (/lib/systemd/system/fstrim.timer; enabled; vendor preset: enabled)
     Active: active (waiting) since Wed 2021-06-16 14:08:16 UTC; 1 months 19 days ago
    Trigger: Mon 2021-08-09 00:00:00 UTC; 3 days left
   Triggers: ● fstrim.service
       Docs: man:fstrim

Jun 16 14:08:16 store-1 systemd[1]: Started Discard unused blocks once a week.

So drive trimming is causing this? That's strange, given the data drive is a ceph block device and the OS drive is an Openstack block device. But I'm ready to believe it!

I've disabled fstrim on all the storage nodes ('cause whether or not that's the actual problem, there's no reason for it to be running on these hosts).

@johnseekins
Copy link
Contributor Author

This was fstrim. It may be worth documenting that fstrim can have an adverse affect on the system somewhere?

@hagen1778
Copy link
Collaborator

hagen1778 commented Aug 10, 2021

Hm, I'm not sure this is exactly relevant to VM. The case I mentioned happened to me 5y ago and it was a Postgres cluster suffering every Sunday because of fstrim. I was just lucky to recall it now and ask you to check if there is something similar in your system.
However, here's commit to mention fstrim in Tuning section.

@johnseekins
Copy link
Contributor Author

It wasn't directly related to VictoriaMetrics, no. But disabling fstrim on these virtual hosts meant the weekly I/O spikes didn't happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question The question issue
Projects
None yet
Development

No branches or pull requests

2 participants