-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Weekly High IO on cluster #1521
Comments
Hi @johnseekins! Thank you for so detailed issue!
Thank you! |
Hm, the LSM parts graph suggests VM has about 6-8k parts on disk. And "vm_hdd Objects" (if I'm reading it right) suggests there 3kk objects were deleted over the 2.5h. Which does not correlate with amount of parts either merged or deleted by VM. Is there any chance that some type of cronjob enabled in the OS which runs every Monday at midnight? I can recall a similar issue with cronjob for fstrim process causing lags for our SSDs every week. |
We can't find anything...but I'll check again. The one suspicious thing was an MD scan of the hypervisor's OS drive...but that only happens once a month...and triggers two hours after the event starts. But fstrim on the boxes is scheduled for Monday at midnight!
So drive trimming is causing this? That's strange, given the data drive is a ceph block device and the OS drive is an Openstack block device. But I'm ready to believe it! I've disabled fstrim on all the storage nodes ('cause whether or not that's the actual problem, there's no reason for it to be running on these hosts). |
This was |
Hm, I'm not sure this is exactly relevant to VM. The case I mentioned happened to me 5y ago and it was a Postgres cluster suffering every Sunday because of fstrim. I was just lucky to recall it now and ask you to check if there is something similar in your system. |
It wasn't directly related to VictoriaMetrics, no. But disabling fstrim on these virtual hosts meant the weekly I/O spikes didn't happen. |
Describe the bug
I've described this bug in other places (#1441), but I feel like I'm stealing thunder from other issues, so I'm breaking this out to a separate issue...
At a high level:
We have a cluster configured to retain data for 1 year (
-retentionPeriod=365d
). Every Monday at midnight (00:00 UTC) we see a significant IO spike that directly causes data rerouting. The IO spike out-lasts the re-routes, so it doesn't seem directly connected to the re-routing.Our instances are 8 CPUx 16 GB RAM instances running atop Ceph storage (with 4TB disks per storage node).
Because we see a significant drop in objects in Ceph itself at the same time, we suspect this might be related to an aggressive number of deletes in the cluster. The associated Ceph graph is also linked below.
To Reproduce
Expected behavior
No sudden weekly IO spikes.
Logs
No errors discovered in the logs other than messages about query timeouts (which are clearly symptoms of the problem).
Screenshots
Load profile:
Re-routing of data:
Ceph objects being removed:
Version
1.63.0
Used command-line flags
The text was updated successfully, but these errors were encountered: