-
Notifications
You must be signed in to change notification settings - Fork 21.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run GC out-of-band by default #50449
Comments
As discussed on campfire, it's debatable whether it's a good default or not. Every time a major GC triggers past the boot sequence, is in my opinion a Ruby GC bug (or a misconfiguration). We fixed many of these "bugs" in Ruby 3.3 with Peter Zhu, and I'll continue to work with him to make major GC as rare as possible in future Ruby versions. OOB GC is a good workaround for these "bugs" if you really care about minimizing tail latency, and don't mind trading off some throughput. Especially on small deployments, Puma's Also, major GC really gain painful the larger your heap is. In Shopify's monolith a major GC takes That said, Ruby GC can be a latency hog, perhaps an alternative solution is to make GC pauses more visible, and direct people toward https://github.com/Shopify/autotuner to improve that if they notice a problem. Something that has been on my personal TODO for a while was to make it trivial to add extra timings in the request logs, because I wanted to make a gems that adds GC and GVL time in there. But now that I convinced Koichi to add |
What drawbacks do you see with running GC OOB as a bridging default while all the GC bugs and issues are getting worked out? That the GC might run too often? Curious to understand that if, say, a GC run takes 200ms on a medium sized app (I believe that's what it was for HEY), would running the GC twice as frequently as its otherwise triggered be 2x 200ms or 2x 100ms? Unless there's a large, additional overhead to running the GC somewhat more often, it seems that running it OOB by default is a win? Maybe we could also try to autotune the GC variables on the basis of how much RAM is available to the VM when booting? |
Throughput impact. While the process is running GC, it's not accepting and processing requests. It's even worse with Puma with multiple threads, because you first need to wait for all current requests to complete before you are "out of band".
GC tuning is not only function of RAM available, but mainly of the application profile. e.g. how many resident objects, how many allocations per requests etc. |
One just idea is
On multi-thread (puma-based?) web application it does not work though because other threads stops on major GC. Directly, the incremental GC is probably not tuned well enough. |
I was thinking the same. I guess I need to write the feature request :) (FYI @peterzhu2118)
Yes, Puma would need another hook to handle this correctly, but I'm sure that can be added to Puma if the Ruby feature justify a new hook. |
Thinking more about it, this should also prevent object promotion, otherwise it may make the issue worse. On paper, once boot is done, no object should ever survive a request cycle (of course it's never perfect in practice), so if we prevent promotion, it should also prevent the need for a major. |
Would like to see us pursue both paths at the same time. Both offer a OOB GC default, even if it requires a 1:1 worker/thread ratio, and then also work to remove the need for GC'ing like this between requests at all. |
Again, OOB GC is a hard tradeoff that I'm not quite confident to default on users, it's really something you need to know is there and be ready to tune. In other words, way too sharp of a tool to be enabled by default. Additionally, it's basically incompatible with small scale deployments, e.g. a single worker Puma on heroku, when OOB GC triggers you immediately go to 0 capacity for the duration of the GC, which is easily multiple hundred milliseconds. It's only viable if you have a decent number of processes. But more importantly it's not a boolean config, as you need to select a frequency that is heavily dependent on the app size, profile and also on what gems you use (native gems that create unprotected objects tend to cause major GC a lot more). So I think we could add a commented out snippet in the generated |
So I guess what I'm most curious about is whether we can find a way to only run GC when it's actually close to being required anyway. Because here's how I see the two scenarios: Request A triggers in-request GC. During this time no new clients are accepted to the server AND the current client is made to wait. Request B needs to wait until processing of A and the GC is complete to be served. Request X puts the system over the GC limit. The GC is scheduled to run OOB after the request. Request Y needs to wait until X is done and the GC is done to be served. The only way the second scenario seems worse than the first to me is if we are running GC more often than we normally would. Presumably we can find a way not to do that? |
We discussed this at a recent Shopify ruby-infra gathering and will try to get the feature described in #50449 (comment) in Ruby 3.4. One slight note though is that we believe that disabling major GC must also disable promotion of objects to the old generation. I'll update this issue once the feature request is open upstream. |
This issue has been automatically marked as stale because it has not been commented on for at least three months. |
Doesn't seem like we have a straight shot towards achieving this in an automated way. Closing for now. |
Ruby's GC can be a bit of a crude beast, and from the perspective of an application, it's awakened seemingly at random. When it runs, it smites that request with a serious penalty, and seemingly through no fault of its own. We've seen dramatic improvement in our P99 and P100 on HEY by running OOB GC, following @byroot's excellent deep-dive into the Shopify OOB GC work.
So let's enable this great trick by default, and in case we need a few simple levers to control it, let's make them few and simple.
We can use https://www.rubydoc.info/gems/puma/Puma%2FDSL:out_of_band.
cc @byroot
The text was updated successfully, but these errors were encountered: