-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perf issue: High longtail response times with KeepAlive=true when running with(or without) workers #3487
Comments
Have you seen #3443? Not sure we need more issues? |
And #2311 |
Nope, agreed on the duplicate, closing. Thanks. |
Moving a discussion from slack over here. I was struggling to understand what exactly is going on. Others got there before me, but I'm going to re-explain in my own words/understanding and trying not to leave anything out so hopefully someone new to the problem has a better starting place. Thanks @bensheldon and @slizco in addition to @dentarg. Also, I want to re-open this issue as a fresh discussion of the understanding of the core problem. If it fizzles out, I'm happy to close. Variance explanationA very short and overly simplified description of the problem is this: Puma's keepalive behavior allows requests to jump in line in the queue when they're from the same connection even when other requests by other clients/connections have been waiting for awhile. Long: The variance happens due to this: Here's a simple lifecycle of a request for one puma thread for one connection:
Let's say there's 2 threads and the hey CLI is trying to use 3 connections. Each connection gets to completely utilize a puma thread for up to 10 requests due to
Thoughts on optimizationShort: The easiest thing that I believe will improve the situation is setting the default of Long: We're trying to tune a few different values here:
So the question is: Is there a better way to tune the system to min/max the outcome that we want. Or at least make good tradeoffs. I would think about this like I did with https://www.schneems.com/2020/07/08/a-fast-car-needs-good-brakes-how-we-added-client-rate-throttling-to-the-platform-api-gem/. One way to do it is thinking about different ideal scenarios and how different values relate to each other: In the scenario that no requests are from the same client then Puma has to pay the overhead of doing some checks, but otherwise all numbers would seem to be ideal (or at least, we cannot improve them by changing behavior). Effectively: With only this parameter to tune, you end up with an inverse correlation of two values we want to optimize. We can maybe pick a better default, but we can also explore other signals to use to trigger a switch. By signals I mean, things we can measure. Such as timing information (like request/response times) or other countable things. We could have the value be clock time based instead of request time based, set a maximum time a thread can devote to a single client like Ruby's thread "quantum." The question is how should we tune the value? Or what would an ideal value be? If it's less than the median response time, then it's the same as disabling keepalive behavior. If it's too high, then we get the same behavior as today. It feels like setting the We could maybe investigate getting more/better information from the system in terms of a "lookahead" or timing based comparison. But tuning the default value of max_fast_inline down seems like a good sell. I like when magic numbers can be kinda explained, and would venture to suggest we set max_fast_inline 's default to be the same value as puma's max threads. I was curious what Falcon (or others) are doing with keepalive, but a quick grep didn't get me much. @ioquatix I'm curious if you've got thoughts or opinions on this issue and my analysis. |
To validate my logic here's some manual testing, by adjusting With puma threads = 2, max fast inline = 2, and hey connections = 3:
With puma threads = 5, max fast inline = 5, and hey connections = 6:
It's better on average, but the relationship with the upper bound didn't change. It's still average response time multiplied by |
@schneems Setting max_fast_inline == max_threads makes complete sense to me and I'm a little ashamed that we didn't do that from the very beginning! Clamping it at a max value of 10 also makes sense. Improving correctness here would mean dynamically adjusting max_fast_inline according to queue latency. We could get very in the weeds on that, only to conclude that setting |
@schneems
I agree with you it’s a trade-off between time spent in TCP operations versus providing “stable” request handling times. For many Puma applications (like those running on Heroku), it’s unlikely that the next request in-line (based on the time it entered the queue) will be on the same connection the thread is already handling. This is due to the roughly even distribution of requests/connections across the Heroku routers. Each router should be handling an equal fraction of the requests for an app. The same is true for Puma apps running on different cloud providers with various load balancer set-ups. I’m a bit skeptical of this:
I think the other big considerations are:
|
I believe the best way to ensure requests are handled in order and minimize queue time is to set I believe that's actually the intention of this block: Lines 488 to 495 in 5fc43d7
but I am not sure it's behaving as expected. I have yet to confirm if we are getting inside this |
@slizco A totally fair assessment. One option is to only even consider checking for a next request on the client if the queue is empty. In that way, if someone is waiting, we get to them ASAP. But in the case that we could handle the next request without starving someone else, we do. |
Ah! We actually meant to do just that but I think the conditional is inverted! |
Oh, actually we skewed it even too far to the inline check by only even checking max_fast_inline IF there was a client waiting. I'm going to PR changing the logic to swing the pendulum back the other direction. |
Thank you @schneems for digging deeper into this issue and suggesting potential improvements! ❤️ It can be interesting to see how other application servers deal with keepAlive. Another data point is that we recently migrated a node app to AWS and benchmarked that in Heroku (no keep-alive) and AWS (keep-alive by default). We don't see this difference in latencies similar to what we saw with Puma.
Do we have any data on how much time this would save in the average case? Is it a few milliseconds, tens of milliseconds or hundreds of milliseconds? In the best case it saves one round trip time I guess? But practically speaking for any production app with decent traffic the keep-alive connections would be managed by the load balancer and possibly the round trip time (slowstart fee) we are trying to save here is a very short round trip with the load balancer and application server sitting in the same region and data center. I am wondering if the savings that are being made here really worth it when the downside is that other requests are being starved for a much longer duration and paying a huge penalty. It might be worth adding some Puma config to easily disable this keep-alive behavior for people whose traffic pattern doesn't warrant these slow start savings for scenarios described above. We added a close connection header recently to our responses so that the AWS ALB terminates the keep-alive connection it maintains with Puma and then we started getting comparable response times to what we were getting with Heroku (which doesn't use keep-alive between their router and Puma). |
See #3488 (comment) Some quick testing shows that it behaves essentially the same with keep-alive connections and non. Re closing connections, with Puma's current code structure, it isn't necessary to close the connection, just remove the 'special' code that checks if a keep-alive connection has sent another request immediately after Puma sends the response. It was this code that was causing the 'long-tail' response times. But, closing connections (and reopening them) may incur a time penalty, especially with SSL connections. It depends on several things, including 'front-end' setup (nginx, load-balancers, etc). Doing so also pushes them up into the OS's socket backlog, which probably isn't best, as querying it is very OS/platform dependent... |
@MSP-Greg In a lot of real world use cases SSL termination happens on the load balancer and there's no SSL involved when load balancer communicates with Puma over keep-alive connections. Heroku does the same and terminates SSL on the routing layer. They also explicitly disable keep-alive for communication between Heroku router and dynos/puma as mentioned here:
Can be interesting to hear why they made that choice. Overall IMHO it can be useful to quantify the savings that would be made here for the average case with keep-alive compared to the cost we are paying by making other requests wait. When looking at numbers with varied workloads, maybe the simpler solution without keep-alive wins. And then enabling keep-alive can be an optional setting for scenarios where it matters a lot. |
I'm working on the new Heroku router, and this issue actually presented and was reported by our customers because the new router does by default, support keep-alives where the legacy router does not.
I hear you that it often times may not make a huge difference to have keep-alives disabled, especially if TLS has already been terminated by the load balancer/proxy. That said, there's still time saved (will vary considerably from set-up to set-up). Plus, keep-alives are the default in HTTP/1.1, and I would expect the default Puma server configuration to be compatible with that default HTTP behavior in a way that ensures requests are served fairly. |
@slizco If that’s possible that would be great! But as per @schneems comment that sounds like a hard problem to solve based on current implementation. Some trade-offs seemingly need to be made and I am wondering if those trade-offs are worth it given the perceived benefits vs practical loss of performance for new requests. If the algorithm can be changed to serve all requests in order and reusing any keep-alive connections that still exist, that would be ideal, but it sounded like that’s hard and has other implications. |
I think several people are looking into this. I also am but I have some non-coding tasks that take priority in the immediate future.
We'll say it's not trivial. I think a solution that does not close the keep-alive connections is best. Other things like pipelined/multiple requests, affect when using workers, etc... |
Looking forward to a solution and great work everyone involved! ❤️ Another trivia - I presented a hypothetical scenario in this comment on the original issue and seemingly it wasn't far from reality 😄
|
Some more investigating this morning has led me to believe that even if you configure Lines 163 to 169 in 48b89b5
I was able to demonstrate this with
I've thrown some debug lines in the request handling code, and I can see that the latter two OR'ed conditions are sometimes true:
I believe this logic makes sense. If the number of busy threads is less than the max, a thread should be available to take the next connection in-line. If there is no new connection on the socket, then why not just continue processing requests off this connection? So if either of these is true, the connection is kept open. However, I think we may be falling into a race condition. The values for I believe the simplest, safest way for us to ensure requests are fairly served is to provide a server-side option to disable keep-alives entirely. |
This doesn't address the concern about keeping connections open while still serving requests in order, but think it's a worthy addition: #3496 |
Thanks for your work on this. I may have a solution for this, but I have a lot of testing to do. As you know the current code sends a response, and if the connection is keep-alive, it tries to read another request. I'm trying to remove that code and send the keep-alive connections back to the reactor. It does seem to even out the response times... |
Thank you @slizco for the the PR to disable keep-alive! 🌟 |
To all, Thanks to everyone for their investigating this. Please see #3506 for a possible update that resolves this issue. |
Thanks @MSP-Greg! I'll take a look shortly. Also, does anyone know in what version we could expect the |
I think it will be included in 6.5.0 (when that release will happen only @nateberkopec knows). We can help out by updating #3409 |
@MSP-Greg I commented on the PR. I think putting the request back in the reactor is the right idea. It's just a question of also adjusting other logic to allow the reactor to continue to ingest additional TCP connections and to keep them ordered at least semi-fairly. So either your work or #3488 seems like a good starting point. If i'm right about the logic for limiting ingesting TCP connections based on the number of available threads, then I think we could solve the long tail for single workers by removing that logic when we're not in worker mode. That theory remains to be proven, but even if it's correct that leaves us with the problem in worker mode still which is more common. I looked at some other webserver implementations. I'm focusing on ones that have a process forking model (workers). The two most interesting candidates so far are NGINX and gunicorn (python). I poked around the ecosystem in general and it seems that lots of webservers are using timeouts for keep-alive connections rather than this kind of latch-counter thing we're doing with max requests (though that's tangential https://github.com/benoitc/gunicorn/blob/903792f152af6a27033d458020923cb2bcb11459/docs/source/settings.rst#keepalive). I would ideally like to reproduce a similar setup to what we're doing with puma. I spent a little messing with nginx config and getting an We need to simulate a constant 200ms workload with a fixed number of processes. I'm not sure the easiest way to do that with nginx. Possibly fcgi with a ruby process? Once we've got that case we can validate whether it would suffer the same performance characteristics under the same load. If someone has more experience with nginx and could give me some tips or wants to try setting up a similar benchmark in another webserver (like python), it would be helpful. I'm thinking: If we find a north star, some existing implementation that handles this case correctly, then we don't have to re-invent the wheel. If we find out other servers have this problem too at least we feel comfort knowing we're not alone and that it's a genuinely hard problem ™️. |
Thanks for looking into this. Re the 'latch-counter', maybe have both it and a 'total connection time' value that can be used to close ka connections? Both should be user configurable. More comments tomorrow, been a long day... |
Updating here as well. I'm able to verify that #3506 removes the long tail times I was seeing.
Agreed. I'm still digging into your work, but if it really does eliminate that long tail, then we might be able to juice our throughput numbers in the scenario where many (more than 10) very small requests come in in rapid succession if we use a timer rather than countdown latch. That's not my main concern, I wanted to call it out as interesting. |
I can also confirm that I am seeing improved performance locally with keep-alive when benchmarking that fix locally. More details in this comment: I can also try and do more comprehensive load tests later on our AWS setup to verify that the loss of performance we were seeing before goes away with this fix. It will be easier to do that though when this is released with a new puma version. I also like @schneems's suggestion on the PR about keeping both the code paths for faster rollback in case someone else uncovers issues that we might have missed in that change.
We possibly can go a step further and release this as an experimental opt-in feature which results in performance gains to get more people to try this out. Maybe do a blog post about it as well to get more people to try it out. I am happy to help with the blog post if needed and share our AWS benchmarking results in it. If there are no complaints for a few weeks, we can make this the default in the next version and deprecate the legacy code path and eventually retire it. That might be a more safer way to roll this out gradually. Thank you everyone for great work on this! 🌟❤️ |
Thanks for taking a look!
Short: The config Long: I've tried all the ways to do gradual rollouts. Both pre-releases and opt-in config come with the problem that if no one knows about them, then they won't be used. The benefit of an opt-in config is that there's methods we could use to advertise it. Like we could add a banner on Puma boot that checks if its' been less than 30 days since release and emits a message like "Help wanted: Please try this configuration option X and give us feedback about your experiences via ". If we end up taking a more gradual strategy, then I volunteer to take on that rollout load. I've actually not really tried that approach (time limited banner encouraging a feature). The downside of a config approach is that it can be hard to know when it's safe to get rid of the old config, or worse: If you do it poorly, then the configuration suggestion might change over time. I.e. we don't want to roll out some config like I suggested The lines of code changed in Greg's PR is pretty low, but understanding the nuance is quite difficult. I somewhat worry that trying to support both paths could introduce a problem while trying to offer an "out." It's also asking him to do a considerable amount of work on top of what's already being done. With all that said, my preference would stand at wanting to roll this out as a default. Currently I feel very confident in my understanding of the change, there's some unknowns here #3506 (comment). If we can increase confidence there then I would like to roll this out as a default. If we are still unsure, then I would support shipping with a toggle and inviting people to turn it on. I can volunteer to take on that config work separately from Greg, so we could merge his change in and then I could refactor it to re-introduce the old codepath (if it's easier for him) and he can review it. I could use |
I agree with you @schneems. To me, it sounds harder to get two code paths right instead of one. And I don't think we should introduce even more configs around this. We should keep it simple.
If this happens, the workaround for users is to use the previous version until a version with the bug fixed is relased. |
Absolutely not! If we are confident about the change then it can go out as the default. My suggestion was mostly to mitigate risk and not disrupt other people's setup in case we have missed something here.
That's mostly the scenario I am slightly worried about. Having said that, @dentarg's suggestion of people going back to previous version can also work, even though it will cause some disruption and noise in case a bug was uncovered and if that affected people's production setup. My suggestion was mostly to roll it out this way:
The slight benefit I see here is that we are giving people a heads up that this is coming and more people trying it out helps cover the risk around "implementation bug" scenario. Having said that, both of you understand this better and if you feel confident enough then we can certainly roll this out as the default in the next release and get rid of the old code path and save ourselves a lot of work. Let me know if I can help out in any way. Maybe @slizco can help out as well by rolling this out to new Heroku router and letting us know how it is performing for customers who have opted into their beta. That can help us gather more confidence around this! ❤️ |
Update on where we're at #3506 (comment). I'm suggesting deprecating/removing |
We don't use that, but some people on this issue claim to be using it and saw performance gains: But that might be because of this keep-alive bug as the issue is pretty similar. |
@ghiculescu still using Puma with |
We're using We took @nateberkopec advice on this (from a slack conversation)
related issues, #2311 |
- Short: Removing this feature makes the internals easier to reason about and I do not believe there's a good reason to disable the reactor that we want to support. Research also points to disabling of the reactor to be an implementation detail in an initial effort to configure the ACCEPT behavior, and over time it morphed into effectively meaning "disable the reactor" - Long: ## Background The `queue_requests` config was originally introduced in #640 in 2015. It is marked as closing this issue #612 (opened in 2014, ten years ago). The use case in the issue for this option to disable the reactor was this: A worker with 1 thread needs to send a request to itself to finish. The example given is downloading a PDF from the same server. If the reactor is accepting requests as fast as it can, the same worker that sent the request might accept the request, however, since it cannot complete the first request (waiting on the second), the second request will never execute. i.e. it will deadlock. The problem described does not come from having a reactor that buffers request bodies, but rather from the application ACCEPT-ing a new request from the socket before it had capacity to process it. Disabling the reactor was effectively an implementation detail. The issue was opened nearly 10 years ago and the justification given for wanting to run with one max thread is that "not all code is thread safe". In the decade since, the Ruby community has come a long way. Sidekiq's threaded job queue has become the industry standard. It's become an expectation that all Ruby code should be threadsafe for quite some time now. An application can avoid this deadlock situation by either not requiring this functionality (making requests to itself) or running with more threads. It's also worth noting that the possibility of deadlock is not zero when moving to 2 threads, as both those threads might be executing a "give me a PDF" request, however the chances are decreased. I do not believe this to be a primary use case we need to support, and further, we've since mitigated the issue in other ways. In the same PR a `wait_until_not_full` method was introduced. But it was only called if queue_requests was disabled. This method still lives on today, but is called unconditionally before the socket is read. https://github.com/puma/puma/blob/a59afabe782ec8869bb709ca991b32c0b2cc95fd/lib/puma/server.rb#L363 What does `wait_until_not_full` do? It is what I call "anti-thundering-herd" measures where a worker will first check if it has capacity to handle a new request (it sees if all threads are busy or not) and only when it has capacity does it ACCEPT a new request from the socket. The situation for a thundering herd looks like this: Puma boots with multiple workers, one worker boots before the others and accepts more requests than it can process. The other workers boot and have nothing to work on and are idle. However it didn't always work as intended. This is one fix 9690d8f. I was present in the 2017 roundtable mentioned in the commit. Since then, this logic has been relatively stable. Since that was introduced, we also introduced `wait_for_less_busy_worker` which adds a tiny sleep to the worker to handle a related yet different scenario caused by OS thread scheduling. In this scenario, the OS might favor executing a busy worker rather than one with capacity to take new requests. More details are in the PR where it was added #2079. The solutio was to add a really tiny sleep to the busy workers to hint to the OS to let the less busy ones be scheduled. This became a default ## Queue requests confusion The `queue_requests` config that was originally introduced to gate ACCEPT behavior, has eventually come to mean "disable the reactor" in the current codebase as defaults have changed and morphed over the years. While the original docs introduced in that PR hint that enabling it deactivates keepalive functionality: > # Note that setting this to false disables HTTP keepalive and > # slow clients will occupy a handler thread while the request > # is being sent. A reverse proxy, such as nginx, can handle > # slow clients and queue requests before they reach puma. It wasn't exactly clear what that meant. i.e. when a keepalive request is made to puma with queue requests disabled, what should be the expected behavior? The released version of puma checks for several conditions before determining whether or not to send a "close" header to a keepalive request https://github.com/puma/puma/blob/d2b3b309bf76f252819cf6c89d72249895daf629/lib/puma/request.rb#L167-L176. The ability to deactivate this keepalive behavior is recent and is staged on main, it was introduced in #3496. Notably even HEAD does not deactivate this behavior based on the state of `queue_requests`. The current behavior is that while that logic returns `true` and the request is not placed back into the reactor (not possible when queue_requests is false) it will loop and continue to accept keepalive requests over the same connection because this code https://github.com/puma/puma/blob/9712d29bc156a7fa05a6f4ec6deda5ac20df50be/lib/puma/server.rb#L468-L470 is inside of a while loop. ## Keepalive connection This looping behavior for keepalive connections is not limited to when the reactor is disabled. There's a bug that currently means that keepalive requests effectively monopolize a thread for up to 10 sequential requests (by default) which is described here #3487. All signs point to needing to put the keepalive request into the reactor as a part of the mitigation. However, as pointed out in this comment #3506 (comment) when request queuing is disabled we are left with a difficult path forward. We can either choose to continue monopolizing the thread and preserve the behavior that exists today. Or we can close the connection which aligns with the docs saying that this setting will not work with keepalive requests. Rather than make that determination for the user, this change suggests we should remove that codepath as an invalid configuration option. ## Usage As part of informal research into useage of the feature I asked on mastodon and no one appears to be using it: - Link: https://ruby.social/@Schneems/113289823703358711 - Screenshot: https://www.dropbox.com/s/9t1ur0rpqz5zkku/Screenshot%202024-10-14%20at%2012.43.27%E2%80%AFPM.png?dl=0 I asked in slack groups and the only example someone could give was this issue comment where (funny enough) they're using it to try to diagnose the very same keepalive buggy behavior #2311 (comment). From my personal experience debugging Ruby applications at Heroku, I have seen this feature used in the wild, so I know it's used. It's just that when it is used, people are not aware that they're losing slow-client protection and disabling the reactor. Usually they're after some other side-effect that disabling it caused. In short, I don't believe it's heavilly used, and for the cases where it is used, I don't believe those are valid use cases. In the comments I added a note requesting people inform us of their use cases if they feel they have a valid one so we can investigate it and possibly add a different configuration option.
I've introduced a PR to remove this @camallen I think the reason it helped is that you're seeing the behavior described here (#3487) when you disable that setting it has the effect of breaking out of a Line 493 in 36c7307
|
thanks for the links and detailed reasoning @schneems.
For us times have changed since #3487, we're running puma with 2 threads in cluster mode now. I think we'd be fine running without |
Here's a writeup about this issue on the Heroku blog: https://blog.heroku.com/pumas-routers-keepalives-ohmy (references this issue) |
Describe the bug
When using
ab
with-k
or usinghey
(a go based a/b network testing library) I see very high long tail response times that is an order of magnitude of what I would expect.AB without keepalive:
Longest request took 432ms
With keepalive:
Longest request took 2,477ms (or 5.7x longer than without keepalive).
Puma config:
To Reproduce
bundle exec puma -t5 -p 3000
hey
Extra observations
In
server.rb
the codepaths divergepuma/lib/puma/server.rb
Lines 467 to 469 in 796d8c6
prepare_response()
inlib/puma/request.rb
. The return value of true indicates keepalive and false indicates no keepalive.The text was updated successfully, but these errors were encountered: