Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf issue: High longtail response times with KeepAlive=true when running with(or without) workers #3487

Open
schneems opened this issue Sep 13, 2024 · 41 comments · May be fixed by #3506
Open

Comments

@schneems
Copy link
Contributor

Describe the bug

When using ab with -k or using hey (a go based a/b network testing library) I see very high long tail response times that is an order of magnitude of what I would expect.

AB without keepalive:

$  ab -c 10 -n 200 127.0.0.1:3000/
...
Percentage of the requests served within a certain time (ms)
  50%    420
  66%    421
  75%    422
  80%    422
  90%    423
  95%    424
  98%    426
  99%    428
 100%    432 (longest request)

Longest request took 432ms

With keepalive:

$  ab -c 10 -n 200 -k 127.0.0.1:3000/
...
Percentage of the requests served within a certain time (ms)
  50%    205
  66%    205
  75%    206
  80%    206
  90%   1852
  95%   2258
  98%   2273
  99%   2450
 100%   2477 (longest request)

Longest request took 2,477ms (or 5.7x longer than without keepalive).

Puma config:

$ cat Gemfile
source "https://rubygems.org"

gem "puma"
gem "rack"
⛄️ 3.3.1 🚀 /private/tmp/e3f9bf2d7f5e861b6de95d7e48a1b5b0
$ cat config.ru
require 'rack'

response = ['200', {'Content-Type' => 'text/html'}, ['Hello, World!']]
app = Proc.new do |env|
  sleep 0.2
  response
end

run app
$ bundle exec puma -t5 -p 3000
Puma starting in single mode...
* Puma version: 6.4.2 (ruby 3.3.1-p55) ("The Eagle of Durango")
*  Min threads: 5
*  Max threads: 5
*  Environment: development
*          PID: 7434
* Listening on http://0.0.0.0:3000
Use Ctrl-C to stop

To Reproduce

  • Use the above example app that sleeps for 200ms
  • Boot with bundle exec puma -t5 -p 3000
  • Hit with AB and keepalive enabled or hey
$ hey -c 10 -n 200 http://localhost:3000/

Summary:
  Total:	8.3877 secs
  Slowest:	2.4628 secs
  Fastest:	0.2005 secs
  Average:	0.3612 secs
  Requests/sec:	23.8445

  Total data:	2600 bytes
  Size/request:	13 bytes

Response time histogram:
  0.201 [1]	|
  0.427 [184]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.653 [0]	|
  0.879 [0]	|
  1.105 [0]	|
  1.332 [0]	|
  1.558 [0]	|
  1.784 [0]	|
  2.010 [0]	|
  2.237 [0]	|
  2.463 [15]	|■■■

Extra observations

  • Enabling workers makes the problem go away
  • Disabling keepalive makes the problem go away

In server.rb the codepaths diverge

puma/lib/puma/server.rb

Lines 467 to 469 in 796d8c6

case handle_request(client, requests + 1)
when false
break
and that could be an okay place to look. Also within prepare_response() in lib/puma/request.rb. The return value of true indicates keepalive and false indicates no keepalive.

@schneems
Copy link
Contributor Author

Similar issue from earlier #2625 h/t @slizco

@dentarg
Copy link
Member

dentarg commented Sep 13, 2024

Have you seen #3443? Not sure we need more issues?

@dentarg
Copy link
Member

dentarg commented Sep 13, 2024

And #2311

@schneems
Copy link
Contributor Author

Nope, agreed on the duplicate, closing. Thanks.

@schneems
Copy link
Contributor Author

Moving a discussion from slack over here. I was struggling to understand what exactly is going on. Others got there before me, but I'm going to re-explain in my own words/understanding and trying not to leave anything out so hopefully someone new to the problem has a better starting place.

Thanks @bensheldon and @slizco in addition to @dentarg.

Also, I want to re-open this issue as a fresh discussion of the understanding of the core problem. If it fizzles out, I'm happy to close.

Variance explanation

A very short and overly simplified description of the problem is this: Puma's keepalive behavior allows requests to jump in line in the queue when they're from the same connection even when other requests by other clients/connections have been waiting for awhile.

Long: The variance happens due to this: Here's a simple lifecycle of a request for one puma thread for one connection:

  • A request comes in, the reactor sees it's fully buffered, hands it to a thread
  • The thread processes the request, so sleeps 200ms (due to application code)
  • When done, a response is delivered, puma sees keepalive is true so the connection is still open, it takes then next request from the same client and repeats this process
  • That occurs 10 times until puma sees that the default max_fast_inline has been reached

Let's say there's 2 threads and the hey CLI is trying to use 3 connections. Each connection gets to completely utilize a puma thread for up to 10 requests due to max_fast_inline. Let's call the hey connections A, B, and C and the puma threads one and two.

  • Connection A makes a request, It's received by puma thread one and follows the pattern above.
  • Connection B makes a request, It's received by puma thread two and follows the pattern above.
  • Connection C makes a request and is stuck waiting in the reactor or on the TCP socket.
  • Thread one keeps processing requests until it hit's it's inline maximum and then it returns the connection back into the reactor.
  • Connection C's first request is picked up by thread one. At this point it's had to wait 10x 200ms to be processed so it's total round trip time is 2000ms + 200ms which is what we're seeing.

Thoughts on optimization

Short: The easiest thing that I believe will improve the situation is setting the default of max_fast_inline to the same as the max number of threads. The theoretical "best" solution might not be possible and would involve preferring to preserve request received order over minimizing TCP connection overhead.

Long: We're trying to tune a few different values here:

  • Minimize TCP slowstart overhead (the value prop of Keepalive)
  • Minimize request variance
  • Maximize server throughput
  • Maximize client throughput

So the question is: Is there a better way to tune the system to min/max the outcome that we want. Or at least make good tradeoffs. I would think about this like I did with https://www.schneems.com/2020/07/08/a-fast-car-needs-good-brakes-how-we-added-client-rate-throttling-to-the-platform-api-gem/.

One way to do it is thinking about different ideal scenarios and how different values relate to each other:
In the scenario that all requests are from the same client, then we would ideally never want to close the connection so they never have to pay the TCP slowstart fee again. However as soon as you have a single request from a different connection, you would block it forever. Basically this strategy would provide the absolute minimum TCP slowstart overhead, but the absolute maximum request variance.

In the scenario that no requests are from the same client then Puma has to pay the overhead of doing some checks, but otherwise all numbers would seem to be ideal (or at least, we cannot improve them by changing behavior).
Those are the two extremes.
The current strategy of dealing with a mixed workload or a sustained load where the number of clients is higher than the number of threads is to treat each client fairly by way of limiting the total number of requests they're allowed to have jump the queue before being asked to wait their turn. But fair doesn't always me equal. This leads to very high request variance. If we tune this number up, then the TCP slowstart overhead gets lower, but at the cost of higher request variance. Tuning it down to a low number (like 1 or 0) would effectively disable the keepalive behavior which would maximize slowstart overhead at the cost of request variance.

Effectively: With only this parameter to tune, you end up with an inverse correlation of two values we want to optimize. We can maybe pick a better default, but we can also explore other signals to use to trigger a switch. By signals I mean, things we can measure. Such as timing information (like request/response times) or other countable things.
We saw that we could minimize request variance by disabling this keepalive behavior. But that mazimizes TCP slowstart overhead. Is there something we could signal off of that doesn't have a 1:1 inverse correlation? If the problem is linejumping, then an idealized algorithm could be: process requests in the order they come always. If the next request is from the same client, don't close the connection. However, I'm guessing that this is either impossible to know or difficult. In the absence of perfect information, we can use a proxy, possibly we could use timing values? My one reservation would be that constructing time objects can be expensive, and comparing them can be as well. Also depending on where you're getting the timing data from, it might be unreliable.

We could have the value be clock time based instead of request time based, set a maximum time a thread can devote to a single client like Ruby's thread "quantum." The question is how should we tune the value? Or what would an ideal value be? If it's less than the median response time, then it's the same as disabling keepalive behavior. If it's too high, then we get the same behavior as today.

It feels like setting the max_fast_inline to the same value as the number of maximum threads is a reasonable default. It's not perfect, a system with 128 threads could force a client to wait 25,600 seconds. But on average there should be one new thread taking a new client every "cycle" (if you define a cycle as every puma thread processing one request). From this lens, the problem seems like maybe the problem is about the ratio of max_fast_inline to threads. This seems easy, and somewhat intuitive. I just worry that the delta between ideal and actual behavior is larger than we would expect. We could have set to min(max_threads, 10) and avoid unexpected terminal behavior, at worst we're saying the behavior wouldn't change versus today.

We could maybe investigate getting more/better information from the system in terms of a "lookahead" or timing based comparison. But tuning the default value of max_fast_inline down seems like a good sell. I like when magic numbers can be kinda explained, and would venture to suggest we set max_fast_inline 's default to be the same value as puma's max threads.

I was curious what Falcon (or others) are doing with keepalive, but a quick grep didn't get me much. @ioquatix I'm curious if you've got thoughts or opinions on this issue and my analysis.

@schneems
Copy link
Contributor Author

schneems commented Sep 17, 2024

To validate my logic here's some manual testing, by adjusting max_fast_inline down.

With puma threads = 2, max fast inline = 2, and hey connections = 3:

$ hey -c 3 -n 200 http://localhost:3000/

Summary:
  Total:	20.7240 secs
  Slowest:	0.6296 secs
  Fastest:	0.2008 secs
  Average:	0.3077 secs
  Requests/sec:	9.5542

  Total data:	2574 bytes
  Size/request:	13 bytes

Response time histogram:
  0.201 [1]	|
  0.244 [103]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.287 [0]	|
  0.329 [0]	|
  0.372 [0]	|
  0.415 [43]	|■■■■■■■■■■■■■■■■■
  0.458 [49]	|■■■■■■■■■■■■■■■■■■■
  0.501 [0]	|
  0.544 [0]	|
  0.587 [0]	|
  0.630 [2]	|■


Latency distribution:
  10% in 0.2030 secs
  25% in 0.2049 secs
  50% in 0.2090 secs
  75% in 0.4167 secs
  90% in 0.4418 secs
  95% in 0.4487 secs
  99% in 0.6296 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0008 secs, 0.2008 secs, 0.6296 secs
  DNS-lookup:	0.0004 secs, 0.0000 secs, 0.0042 secs
  req write:	0.0001 secs, 0.0000 secs, 0.0006 secs
  resp wait:	0.3066 secs, 0.2007 secs, 0.6252 secs
  resp read:	0.0001 secs, 0.0000 secs, 0.0010 secs

Status code distribution:
  [200]	198 responses

With puma threads = 5, max fast inline = 5, and hey connections = 6:

$ hey -c 6 -n 200 http://localhost:3000/

Summary:
  Total:	9.0548 secs
  Slowest:	1.2390 secs
  Fastest:	0.2005 secs
  Average:	0.2432 secs
  Requests/sec:	21.8670

  Total data:	2574 bytes
  Size/request:	13 bytes

Response time histogram:
  0.201 [1]	|
  0.304 [174]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.408 [3]	|■
  0.512 [13]	|■■■
  0.616 [0]	|
  0.720 [3]	|■
  0.824 [1]	|
  0.927 [2]	|
  1.031 [0]	|
  1.135 [0]	|
  1.239 [1]	|


Latency distribution:
  10% in 0.2022 secs
  25% in 0.2039 secs
  50% in 0.2051 secs
  75% in 0.2065 secs
  90% in 0.4093 secs
  95% in 0.4170 secs
  99% in 1.2390 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0002 secs, 0.2005 secs, 1.2390 secs
  DNS-lookup:	0.0002 secs, 0.0000 secs, 0.0022 secs
  req write:	0.0000 secs, 0.0000 secs, 0.0003 secs
  resp wait:	0.2428 secs, 0.2004 secs, 1.2353 secs
  resp read:	0.0001 secs, 0.0000 secs, 0.0017 secs

Status code distribution:
  [200]	198 responses

It's better on average, but the relationship with the upper bound didn't change. It's still average response time multiplied by max_fast_inline plus 1. So (200 * (5 + 1) ~= 1,200)

@evanphx
Copy link
Member

evanphx commented Sep 17, 2024

@schneems Setting max_fast_inline == max_threads makes complete sense to me and I'm a little ashamed that we didn't do that from the very beginning! Clamping it at a max value of 10 also makes sense.

Improving correctness here would mean dynamically adjusting max_fast_inline according to queue latency. We could get very in the weeds on that, only to conclude that setting max_fast_inline == min(max_threads, 10) results in effectively the same latency gains. Given the work required, fully onboard making this change.

@slizco
Copy link
Contributor

slizco commented Sep 17, 2024

@schneems
This is 100% my understanding as well:

  • Connection C makes a request and is stuck waiting in the reactor or on the TCP socket.
  • Thread one keeps processing requests until it hit’s it’s inline maximum and then it returns the connection back into the reactor.
  • Connection C’s first request is picked up by thread one. At this point it’s had to wait 10x 200ms to be processed so it’s total round trip time is 2000ms + 200ms which is what we’re seeing.

I agree with you it’s a trade-off between time spent in TCP operations versus providing “stable” request handling times. For many Puma applications (like those running on Heroku), it’s unlikely that the next request in-line (based on the time it entered the queue) will be on the same connection the thread is already handling. This is due to the roughly even distribution of requests/connections across the Heroku routers. Each router should be handling an equal fraction of the requests for an app. The same is true for Puma apps running on different cloud providers with various load balancer set-ups.

I’m a bit skeptical of this:

It feels like setting the max_fast_inline to the same value as the number of maximum threads is a reasonable default. It's not perfect, a system with 128 threads could force a client to wait 25,600 seconds. But on average there should be one new thread taking a new client every "cycle" (if you define a cycle as every puma thread processing one request). From this lens, the problem seems like maybe the problem is about the ratio of max_fast_inline to threads. This seems easy, and somewhat intuitive. I just worry that the delta between ideal and actual behavior is larger than we would expect. We could have set to min(max_threads, 10) and avoid unexpected terminal behavior, at worst we're saying the behavior wouldn't change versus today.

I think the other big considerations are:

  • The number of connections inbound to the Puma worker. If there are more connections than threads, you can almost guarantee requests will be waiting in the queue for a duration of max_fast_inline*request_handling_time (under current logic). If we used min(max_threads, 10) as the comparison value, that at best improves the longest queue times to max_threads*request_handling_time. With more threads, that number gets worse.
  • The typical request handling times really matter here. If the application processes requests quickly, all <1ms, processing requests in-line is not such a big deal. If there is any sort of mid-longer delay (like our 200ms artificial server lag), you’re going to start to see request queue times upwards of 2s. If you use min(max_threads, 10) as the comparison value, you'll see the highest queue times at ~1s with a max_threads=5.

@slizco
Copy link
Contributor

slizco commented Sep 17, 2024

I believe the best way to ensure requests are handled in order and minimize queue time is to set max_fast_inline=1 and return the client to the reactor at the end of each request handling while not closing the connection. That means the reactor has to do more work than if a client could process 10 requests in a row, but at least it means stable connections at the TCP level.

I believe that's actually the intention of this block:

puma/lib/puma/server.rb

Lines 488 to 495 in 5fc43d7

unless next_request_ready
break unless @queue_requests
client.set_timeout @persistent_timeout
if @reactor.add client
close_socket = false
break
end
end

but I am not sure it's behaving as expected.

I have yet to confirm if we are getting inside this unless statement, will test this out soon.

@evanphx
Copy link
Member

evanphx commented Sep 17, 2024

@slizco A totally fair assessment. One option is to only even consider checking for a next request on the client if the queue is empty. In that way, if someone is waiting, we get to them ASAP. But in the case that we could handle the next request without starving someone else, we do.

@evanphx
Copy link
Member

evanphx commented Sep 17, 2024

Ah! We actually meant to do just that but I think the conditional is inverted!

@evanphx
Copy link
Member

evanphx commented Sep 17, 2024

Oh, actually we skewed it even too far to the inline check by only even checking max_fast_inline IF there was a client waiting. I'm going to PR changing the logic to swing the pendulum back the other direction.

@amitsaxena
Copy link

Thank you @schneems for digging deeper into this issue and suggesting potential improvements! ❤️

It can be interesting to see how other application servers deal with keepAlive. Another data point is that we recently migrated a node app to AWS and benchmarked that in Heroku (no keep-alive) and AWS (keep-alive by default). We don't see this difference in latencies similar to what we saw with Puma.

then we would ideally never want to close the connection so they never have to pay the TCP slowstart fee again

Do we have any data on how much time this would save in the average case? Is it a few milliseconds, tens of milliseconds or hundreds of milliseconds? In the best case it saves one round trip time I guess? But practically speaking for any production app with decent traffic the keep-alive connections would be managed by the load balancer and possibly the round trip time (slowstart fee) we are trying to save here is a very short round trip with the load balancer and application server sitting in the same region and data center. I am wondering if the savings that are being made here really worth it when the downside is that other requests are being starved for a much longer duration and paying a huge penalty.

It might be worth adding some Puma config to easily disable this keep-alive behavior for people whose traffic pattern doesn't warrant these slow start savings for scenarios described above. We added a close connection header recently to our responses so that the AWS ALB terminates the keep-alive connection it maintains with Puma and then we started getting comparable response times to what we were getting with Heroku (which doesn't use keep-alive between their router and Puma).

@MSP-Greg
Copy link
Member

See #3488 (comment)

Some quick testing shows that it behaves essentially the same with keep-alive connections and non.

Re closing connections, with Puma's current code structure, it isn't necessary to close the connection, just remove the 'special' code that checks if a keep-alive connection has sent another request immediately after Puma sends the response. It was this code that was causing the 'long-tail' response times.

But, closing connections (and reopening them) may incur a time penalty, especially with SSL connections. It depends on several things, including 'front-end' setup (nginx, load-balancers, etc). Doing so also pushes them up into the OS's socket backlog, which probably isn't best, as querying it is very OS/platform dependent...

@amitsaxena
Copy link

But, closing connections (and reopening them) may incur a time penalty, especially with SSL connections.

@MSP-Greg In a lot of real world use cases SSL termination happens on the load balancer and there's no SSL involved when load balancer communicates with Puma over keep-alive connections. Heroku does the same and terminates SSL on the routing layer. They also explicitly disable keep-alive for communication between Heroku router and dynos/puma as mentioned here:

The router will assume that the client wants to close the connection on each request (no keep-alive).
https://devcenter.heroku.com/articles/http-routing#http-versions-supported

Can be interesting to hear why they made that choice.

Overall IMHO it can be useful to quantify the savings that would be made here for the average case with keep-alive compared to the cost we are paying by making other requests wait. When looking at numbers with varied workloads, maybe the simpler solution without keep-alive wins. And then enabling keep-alive can be an optional setting for scenarios where it matters a lot.

@slizco
Copy link
Contributor

slizco commented Sep 18, 2024

@amitsaxena

They also explicitly disable keep-alive for communication between Heroku router and dynos/puma as mentioned here:

I'm working on the new Heroku router, and this issue actually presented and was reported by our customers because the new router does by default, support keep-alives where the legacy router does not.

When looking at numbers with varied workloads, maybe the simpler solution without keep-alive wins. And then enabling keep-alive can be an optional setting for scenarios where it matters a lot.

I hear you that it often times may not make a huge difference to have keep-alives disabled, especially if TLS has already been terminated by the load balancer/proxy. That said, there's still time saved (will vary considerably from set-up to set-up). Plus, keep-alives are the default in HTTP/1.1, and I would expect the default Puma server configuration to be compatible with that default HTTP behavior in a way that ensures requests are served fairly.

@amitsaxena
Copy link

in a way that ensures requests are served fairly

@slizco If that’s possible that would be great! But as per @schneems comment that sounds like a hard problem to solve based on current implementation. Some trade-offs seemingly need to be made and I am wondering if those trade-offs are worth it given the perceived benefits vs practical loss of performance for new requests. If the algorithm can be changed to serve all requests in order and reusing any keep-alive connections that still exist, that would be ideal, but it sounded like that’s hard and has other implications.

@MSP-Greg
Copy link
Member

If the algorithm can be changed to serve all requests in order and reusing any keep-alive connections

I think several people are looking into this. I also am but I have some non-coding tasks that take priority in the immediate future.

it sounded like that’s hard and has other implications.

We'll say it's not trivial. I think a solution that does not close the keep-alive connections is best. Other things like pipelined/multiple requests, affect when using workers, etc...

@dentarg
Copy link
Member

dentarg commented Sep 19, 2024

Just a litte fun fact, the very first issue on Puma was the request for keep-alive support :) #1 (added with 997c18)

@amitsaxena
Copy link

amitsaxena commented Sep 19, 2024

I think several people are looking into this.

Looking forward to a solution and great work everyone involved! ❤️

Another trivia - I presented a hypothetical scenario in this comment on the original issue and seemingly it wasn't far from reality 😄

So theoretically speaking if Heroku started using keep-alive in their router as an enhancement, all apps using puma will perform exponentially worse compared to today :)

@slizco
Copy link
Contributor

slizco commented Sep 19, 2024

Some more investigating this morning has led me to believe that even if you configure max_fast_inline=1, that is not a bullet-proof way to force connections to close. It's still possible for a thread to continue pulling requests off it's current connection, due to this conditional in handle_request:

puma/lib/puma/request.rb

Lines 163 to 169 in 48b89b5

# Close the connection after a reasonable number of inline requests
# if the server is at capacity and the listener has a new connection ready.
# This allows Puma to service connections fairly when the number
# of concurrent connections exceeds the size of the threadpool.
force_keep_alive = requests < @max_fast_inline ||
@thread_pool.busy_threads < @max_threads ||
!client.listener.to_io.wait_readable(0)

I was able to demonstrate this with hey. In both cases, Puma is running with max_fast_inline=1 and more concurrent connections than thread (max = 5):

❯ hey -c 15 -q 15 -z 30s http://0.0.0.0:3000/fixed

Summary:
  Total:	30.5479 secs
  Slowest:	1.2286 secs
  Fastest:	0.2013 secs
  Average:	0.6217 secs
  Requests/sec:	23.8642

  Total data:	2187 bytes
  Size/request:	3 bytes

Response time histogram:
  0.201 [1]	|
  0.304 [147]	|■■■■■■■■■■■■■■■■■■■■
  0.407 [0]	|
  0.509 [5]	|■
  0.612 [0]	|
  0.715 [288]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.818 [169]	|■■■■■■■■■■■■■■■■■■■■■■■
  0.920 [98]	|■■■■■■■■■■■■■■
  1.023 [4]	|■
  1.126 [15]	|■■
  1.229 [2]	|


Latency distribution:
  10% in 0.2049 secs
  25% in 0.6273 secs
  50% in 0.6575 secs
  75% in 0.7936 secs
  90% in 0.8303 secs
  95% in 0.8399 secs
  99% in 1.0383 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0009 secs, 0.2013 secs, 1.2286 secs
  DNS-lookup:	0.0000 secs, 0.0000 secs, 0.0000 secs
  req write:	0.0000 secs, 0.0000 secs, 0.0003 secs
  resp wait:	0.6206 secs, 0.2011 secs, 1.2276 secs
  resp read:	0.0001 secs, 0.0000 secs, 0.0007 secs

Status code distribution:
  [200]	729 responses

~
❯ hey -disable-keepalive -c 15 -q 15 -z 30s http://0.0.0.0:3000/fixed

Summary:
  Total:	30.5141 secs
  Slowest:	0.6583 secs
  Fastest:	0.2112 secs
  Average:	0.6243 secs
  Requests/sec:	23.7595

  Total data:	2175 bytes
  Size/request:	3 bytes

Response time histogram:
  0.211 [1]	|
  0.256 [4]	|
  0.301 [0]	|
  0.345 [0]	|
  0.390 [0]	|
  0.435 [5]	|
  0.479 [0]	|
  0.524 [0]	|
  0.569 [0]	|
  0.614 [1]	|
  0.658 [714]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■


Latency distribution:
  10% in 0.6243 secs
  25% in 0.6264 secs
  50% in 0.6287 secs
  75% in 0.6309 secs
  90% in 0.6323 secs
  95% in 0.6334 secs
  99% in 0.6365 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0010 secs, 0.2112 secs, 0.6583 secs
  DNS-lookup:	0.0000 secs, 0.0000 secs, 0.0000 secs
  req write:	0.0000 secs, 0.0000 secs, 0.0006 secs
  resp wait:	0.6231 secs, 0.2091 secs, 0.6573 secs
  resp read:	0.0001 secs, 0.0000 secs, 0.0063 secs

Status code distribution:
  [200]	725 responses

I've thrown some debug lines in the request handling code, and I can see that the latter two OR'ed conditions are sometimes true:

FORCE KEEPALIVE: true. (requests < @max_fast_inline): false, (@thread_pool.busy_threads < @max_threads): true, (!client.listener.to_io.wait_readable(0)): false
Connections kept-alive: 109
FORCE KEEPALIVE: true. (requests < @max_fast_inline): false, (@thread_pool.busy_threads < @max_threads): true, (!client.listener.to_io.wait_readable(0)): false
Connections kept-alive: 110
FORCE KEEPALIVE: true. (requests < @max_fast_inline): false, (@thread_pool.busy_threads < @max_threads): false, (!client.listener.to_io.wait_readable(0)): true
Connections kept-alive: 111

I believe this logic makes sense. If the number of busy threads is less than the max, a thread should be available to take the next connection in-line. If there is no new connection on the socket, then why not just continue processing requests off this connection? So if either of these is true, the connection is kept open.

However, I think we may be falling into a race condition. The values for @thread_pool.busy_threads and client.listener.to_io.wait_readable(0) are constantly changing, and being accessed by multiple threads at a time. Those threads will then take actions that affect those values.

I believe the simplest, safest way for us to ensure requests are fairly served is to provide a server-side option to disable keep-alives entirely.

@slizco
Copy link
Contributor

slizco commented Sep 19, 2024

This doesn't address the concern about keeping connections open while still serving requests in order, but think it's a worthy addition: #3496

@MSP-Greg
Copy link
Member

@slizco

Thanks for your work on this. I may have a solution for this, but I have a lot of testing to do. As you know the current code sends a response, and if the connection is keep-alive, it tries to read another request. I'm trying to remove that code and send the keep-alive connections back to the reactor. It does seem to even out the response times...

@amitsaxena
Copy link

Thank you @slizco for the the PR to disable keep-alive! 🌟

@MSP-Greg
Copy link
Member

To all,

Thanks to everyone for their investigating this.

Please see #3506 for a possible update that resolves this issue.

@slizco
Copy link
Contributor

slizco commented Oct 1, 2024

Thanks @MSP-Greg! I'll take a look shortly.

Also, does anyone know in what version we could expect the enable_keep_alives option to be released? Thanks

@dentarg
Copy link
Member

dentarg commented Oct 1, 2024

I think it will be included in 6.5.0 (when that release will happen only @nateberkopec knows). We can help out by updating #3409

@schneems
Copy link
Contributor Author

schneems commented Oct 2, 2024

@MSP-Greg I commented on the PR. I think putting the request back in the reactor is the right idea. It's just a question of also adjusting other logic to allow the reactor to continue to ingest additional TCP connections and to keep them ordered at least semi-fairly. So either your work or #3488 seems like a good starting point.

If i'm right about the logic for limiting ingesting TCP connections based on the number of available threads, then I think we could solve the long tail for single workers by removing that logic when we're not in worker mode. That theory remains to be proven, but even if it's correct that leaves us with the problem in worker mode still which is more common.

I looked at some other webserver implementations. I'm focusing on ones that have a process forking model (workers). The two most interesting candidates so far are NGINX and gunicorn (python). I poked around the ecosystem in general and it seems that lots of webservers are using timeouts for keep-alive connections rather than this kind of latch-counter thing we're doing with max requests (though that's tangential https://github.com/benoitc/gunicorn/blob/903792f152af6a27033d458020923cb2bcb11459/docs/source/settings.rst#keepalive).

I would ideally like to reproduce a similar setup to what we're doing with puma. I spent a little messing with nginx config and getting an echo_sleep server working https://gist.github.com/schneems/610e16906e30914eba80459952d1f1c6 but once I did I found that the echo_sleep happens async instead of blocking so it's not simulating the same thing.

We need to simulate a constant 200ms workload with a fixed number of processes. I'm not sure the easiest way to do that with nginx. Possibly fcgi with a ruby process? Once we've got that case we can validate whether it would suffer the same performance characteristics under the same load. If someone has more experience with nginx and could give me some tips or wants to try setting up a similar benchmark in another webserver (like python), it would be helpful.

I'm thinking: If we find a north star, some existing implementation that handles this case correctly, then we don't have to re-invent the wheel. If we find out other servers have this problem too at least we feel comfort knowing we're not alone and that it's a genuinely hard problem ™️.

@MSP-Greg
Copy link
Member

MSP-Greg commented Oct 2, 2024

@schneems

Thanks for looking into this.

Re the 'latch-counter', maybe have both it and a 'total connection time' value that can be used to close ka connections? Both should be user configurable.

More comments tomorrow, been a long day...

@schneems
Copy link
Contributor Author

schneems commented Oct 3, 2024

Updating here as well. I'm able to verify that #3506 removes the long tail times I was seeing.

Re the 'latch-counter', maybe have both it and a 'total connection time' value that can be used to close ka connections? Both should be user configurable.

Agreed. I'm still digging into your work, but if it really does eliminate that long tail, then we might be able to juice our throughput numbers in the scenario where many (more than 10) very small requests come in in rapid succession if we use a timer rather than countdown latch. That's not my main concern, I wanted to call it out as interesting.

@amitsaxena
Copy link

I can also confirm that I am seeing improved performance locally with keep-alive when benchmarking that fix locally. More details in this comment:
#3443 (comment)

I can also try and do more comprehensive load tests later on our AWS setup to verify that the loss of performance we were seeing before goes away with this fix. It will be easier to do that though when this is released with a new puma version.

I also like @schneems's suggestion on the PR about keeping both the code paths for faster rollback in case someone else uncovers issues that we might have missed in that change.

one option would be to ship code that has both codepaths and allow the end user to configure it if they need. We just introduced enable_keepalives config on HEAD, we could make it accept true/false/:legacy where :legacy is the current behavior on master

We possibly can go a step further and release this as an experimental opt-in feature which results in performance gains to get more people to try this out. Maybe do a blog post about it as well to get more people to try it out. I am happy to help with the blog post if needed and share our AWS benchmarking results in it. If there are no complaints for a few weeks, we can make this the default in the next version and deprecate the legacy code path and eventually retire it. That might be a more safer way to roll this out gradually.

Thank you everyone for great work on this! 🌟❤️
You are making the internet faster! 😄 💪

@schneems schneems changed the title Perf issue: High longtail response times with KeepAlive=true when running without workers Perf issue: High longtail response times with KeepAlive=true when running with(or without) workers Oct 9, 2024
@schneems
Copy link
Contributor Author

schneems commented Oct 9, 2024

Thanks for taking a look!

That might be a more safer way to roll this out gradually.

Short: The config enable_keepalives false allows anyone to opt-out of this behavior already. It's not quite the same as "legacy/current behavior", but is still a safety valve. I see no reason why someone would actively want the old behavior, other than the new behavior has a bug that needs to be fixed/updated. We're going to need to roll this fix out as the new default someday, if we do a major-rev and advertise that people can either disable keepalive behavior alltogether or use the older major version that would be my preference. The thing that could make me change my mind is whether we feel we need to roll something out that we're medium to high uncertainty on.

Long:

I've tried all the ways to do gradual rollouts. Both pre-releases and opt-in config come with the problem that if no one knows about them, then they won't be used. The benefit of an opt-in config is that there's methods we could use to advertise it. Like we could add a banner on Puma boot that checks if its' been less than 30 days since release and emits a message like "Help wanted: Please try this configuration option X and give us feedback about your experiences via ". If we end up taking a more gradual strategy, then I volunteer to take on that rollout load.

I've actually not really tried that approach (time limited banner encouraging a feature).

The downside of a config approach is that it can be hard to know when it's safe to get rid of the old config, or worse: If you do it poorly, then the configuration suggestion might change over time. I.e. we don't want to roll out some config like enable_awesome :yes and then 3 months down the road have to explain to people that enable_awesome :yes actually slows down their code and they should instead be using the default. One of the biggest offenders is having a config value take a "new" and "current" state, but what happens when you need to add a third state and :current is no longer :current? Or what if it's not a temporary/temporal state, but needs to live on forever be a named value that better represents what it's doing? Then you have to warn people "sorry :current is no longer valid use :blerg instead"

I suggested :legacy here which kinda falls into the new/current pitfall. But I really don't think the "legacy" state should live on forever. The benefit of keeping two codepaths alive here forever is small, and the cost would be large. The current behavior is "monopolize a puma thread to respond to mutiple requests in a connection in an unfair order." Which, I really don't think anyone wants. I really only think the reason someone would want to opt out of the new behavior is if there's a bug in our implementation. The delta between "to opt out of this behavior change one line in your config" versus "to opt out of this behavior change one line in your Gemfile" is roughly the same IMHO. Not to mention, we have another config that was just introduced enable_keepalives false that will prevent this new logic from firing that people can use to opt out of it if they encounter problems.

The lines of code changed in Greg's PR is pretty low, but understanding the nuance is quite difficult. I somewhat worry that trying to support both paths could introduce a problem while trying to offer an "out." It's also asking him to do a considerable amount of work on top of what's already being done.

With all that said, my preference would stand at wanting to roll this out as a default. Currently I feel very confident in my understanding of the change, there's some unknowns here #3506 (comment). If we can increase confidence there then I would like to roll this out as a default. If we are still unsure, then I would support shipping with a toggle and inviting people to turn it on. I can volunteer to take on that config work separately from Greg, so we could merge his change in and then I could refactor it to re-introduce the old codepath (if it's easier for him) and he can review it. I could use enable_keepalives :reactor to indicate the new behavior which I expect will live on beyond this temporary state while enable_keepalives true would preserve existing behavior. Once we are confident/comfortable then enable_keepalives true and enable_keepalives :reactor (both being truthy) would have the same effect and not require anyone to update their config after the default change.

@dentarg
Copy link
Member

dentarg commented Oct 10, 2024

I agree with you @schneems. To me, it sounds harder to get two code paths right instead of one. And I don't think we should introduce even more configs around this. We should keep it simple.

I really only think the reason someone would want to opt out of the new behavior is if there's a bug in our implementation.

If this happens, the workaround for users is to use the previous version until a version with the bug fixed is relased.

@amitsaxena
Copy link

But I really don't think the "legacy" state should live on forever.

Absolutely not! If we are confident about the change then it can go out as the default. My suggestion was mostly to mitigate risk and not disrupt other people's setup in case we have missed something here.

I really only think the reason someone would want to opt out of the new behavior is if there's a bug in our implementation

That's mostly the scenario I am slightly worried about. Having said that, @dentarg's suggestion of people going back to previous version can also work, even though it will cause some disruption and noise in case a bug was uncovered and if that affected people's production setup.

My suggestion was mostly to roll it out this way:

  1. Retain existing code path as default and provide the new code path as an opt-in in next Puma version. Market the new code path with a promise of better performance and encourage people to try it out while some of us benchmark it on our production setups.
  2. Let it stay for a couple of weeks while people try it out and if no bugs are reported, we release a new version of Puma with the new code path as default and old code path removed.

The slight benefit I see here is that we are giving people a heads up that this is coming and more people trying it out helps cover the risk around "implementation bug" scenario. Having said that, both of you understand this better and if you feel confident enough then we can certainly roll this out as the default in the next release and get rid of the old code path and save ourselves a lot of work. Let me know if I can help out in any way.

Maybe @slizco can help out as well by rolling this out to new Heroku router and letting us know how it is performing for customers who have opted into their beta. That can help us gather more confidence around this! ❤️

@schneems
Copy link
Contributor Author

Update on where we're at #3506 (comment). I'm suggesting deprecating/removing queue_requests which definitely would not be easily toggle-able and more or less takes rollout strategy off the table. If you use queue_requests false I'm curious to hear more.

@amitsaxena
Copy link

If you use queue_requests false I'm curious to hear more.

We don't use that, but some people on this issue claim to be using it and saw performance gains:
#2311 (comment)

But that might be because of this keep-alive bug as the issue is pretty similar.

@dentarg
Copy link
Member

dentarg commented Oct 14, 2024

@ghiculescu still using Puma with queue_requests false in production?

@camallen
Copy link

We're using queue_requests = false at Tanda. We have puma behind an nginx reverse proxy to protect us from slow clients.

We took @nateberkopec advice on this (from a slack conversation)

"disable. it's a poorly name setting, really what it does is turn request buffering on/off. nginx will do that for you, so you don't need puma to do it"

related issues, #2311

schneems added a commit that referenced this issue Oct 15, 2024
- Short: Removing this feature makes the internals easier to reason about and I do not believe there's a good reason to disable the reactor that we want to support. Research also points to disabling of the reactor to be an implementation detail in an initial effort to configure the ACCEPT behavior, and over time it morphed into effectively meaning "disable the reactor"

- Long: 

## Background

The `queue_requests` config was originally introduced in #640 in 2015. It is marked as closing this issue #612 (opened in 2014, ten years ago). 

The use case in the issue for this option to disable the reactor was this: 

A worker with 1 thread needs to send a request to itself to finish. The example given is downloading a PDF from the same server. If the reactor is accepting requests as fast as it can, the same worker that sent the request might accept the request, however, since it cannot complete the first request (waiting on the second), the second request will never execute. i.e. it will deadlock.

The problem described does not come from having a reactor that buffers request bodies, but rather from the application ACCEPT-ing a new request from the socket before it had capacity to process it. Disabling the reactor was effectively an implementation detail.

The issue was opened nearly 10 years ago and the justification given for wanting to run with one max thread is that "not all code is thread safe". In the decade since, the Ruby community has come a long way. Sidekiq's threaded job queue has become the industry standard. It's become an expectation that all Ruby code should be threadsafe for quite some time now. An application can avoid this deadlock situation by either not requiring this functionality (making requests to itself) or running with more threads. It's also worth noting that the possibility of deadlock is not zero when moving to 2 threads, as both those threads might be executing a "give me a PDF" request, however the chances are decreased. I do not believe this to be a primary use case we need to support, and further, we've since mitigated the issue in other ways.

In the same PR a `wait_until_not_full` method was introduced. But it was only called if queue_requests was disabled. This method still lives on today, but is called unconditionally before the socket is read. https://github.com/puma/puma/blob/a59afabe782ec8869bb709ca991b32c0b2cc95fd/lib/puma/server.rb#L363

What does `wait_until_not_full` do? It is what I call "anti-thundering-herd" measures where a worker will first check if it has capacity to handle a new request (it sees if all threads are busy or not) and only when it has capacity does it ACCEPT a new request from the socket. The situation for a thundering herd looks like this: Puma boots with multiple workers, one worker boots before the others and accepts more requests than it can process. The other workers boot and have nothing to work on and are idle. However it didn't always work as intended. This is one fix 9690d8f. I was present in the 2017 roundtable mentioned in the commit. Since then, this logic has been relatively stable.

Since that was introduced, we also introduced `wait_for_less_busy_worker` which adds a tiny sleep to the worker to handle a related yet different scenario caused by OS thread scheduling. In this scenario, the OS might favor executing a busy worker rather than one with capacity to take new requests. More details are in the PR where it was added #2079. The solutio was to add a really tiny sleep to the busy workers to hint to the OS to let the less busy ones be scheduled. This became a default 

## Queue requests confusion

The `queue_requests` config that was originally introduced to gate ACCEPT behavior, has eventually come to mean "disable the reactor" in the current codebase as defaults have changed and morphed over the years.

While the original docs introduced in that PR hint that enabling it deactivates keepalive functionality:

>      # Note that setting this to false disables HTTP keepalive and
>      # slow clients will occupy a handler thread while the request
>      # is being sent. A reverse proxy, such as nginx, can handle
>      # slow clients and queue requests before they reach puma.

It wasn't exactly clear what that meant. i.e. when a keepalive request is made to puma with queue requests disabled, what should be the expected behavior?

The released version of puma checks for several conditions before determining whether or not to send a "close" header to a keepalive request https://github.com/puma/puma/blob/d2b3b309bf76f252819cf6c89d72249895daf629/lib/puma/request.rb#L167-L176. The ability to deactivate this keepalive behavior is recent and is staged on main, it was introduced in #3496. Notably even HEAD does not deactivate this behavior based on the state of `queue_requests`.

The current behavior is that while that logic returns `true` and the request is not placed back into the reactor (not possible when queue_requests is false) it will loop and continue to accept keepalive requests over the same connection because this code https://github.com/puma/puma/blob/9712d29bc156a7fa05a6f4ec6deda5ac20df50be/lib/puma/server.rb#L468-L470 is inside of a while loop.

## Keepalive connection

This looping behavior for keepalive connections is not limited to when the reactor is disabled. There's a bug that currently means that keepalive requests effectively monopolize a thread for up to 10 sequential requests (by default) which is described here #3487.

All signs point to needing to put the keepalive request into the reactor as a part of the mitigation. However, as pointed out in this comment #3506 (comment) when request queuing is disabled we are left with a difficult path forward. We can either choose to continue monopolizing the thread and preserve the behavior that exists today. Or we can close the connection which aligns with the docs saying that this setting will not work with keepalive requests. Rather than make that determination for the user, this change suggests we should remove that codepath as an invalid configuration option.

## Usage

As part of informal research into useage of the feature I asked on mastodon and no one appears to be using it:

- Link: https://ruby.social/@Schneems/113289823703358711
- Screenshot: https://www.dropbox.com/s/9t1ur0rpqz5zkku/Screenshot%202024-10-14%20at%2012.43.27%E2%80%AFPM.png?dl=0

I asked in slack groups and the only example someone could give was this issue comment where (funny enough) they're using it to try to diagnose the very same keepalive buggy behavior #2311 (comment).

From my personal experience debugging Ruby applications at Heroku, I have seen this feature used in the wild, so I know it's used. It's just that when it is used, people are not aware that they're losing slow-client protection and disabling the reactor. Usually they're after some other side-effect that disabling it caused.

In short, I don't believe it's heavilly used, and for the cases where it is used, I don't believe those are valid use cases. In the comments I added a note requesting people inform us of their use cases if they feel they have a valid one so we can investigate it and possibly add a different configuration option.
@schneems
Copy link
Contributor Author

schneems commented Oct 16, 2024

I've introduced a PR to remove this queue_requests setting, the commit has a lot of historical details #3524.

@camallen I think the reason it helped is that you're seeing the behavior described here (#3487) when you disable that setting it has the effect of breaking out of a while loop in this code

break unless @queue_requests
. If we fix the underlying behavior then it shouldn't be needed. Basically: I think the fix for you wasn't disabling the reactor, but by taking a different codepath that's only tangentially related.

@camallen
Copy link

thanks for the links and detailed reasoning @schneems.

If we fix the underlying behavior then it shouldn't be needed.

For us times have changed since #3487, we're running puma with 2 threads in cluster mode now. I think we'd be fine running without queue_requests(false) if we avoid any latency issues.

@dentarg dentarg added the perf label Nov 4, 2024
@dentarg
Copy link
Member

dentarg commented Nov 27, 2024

Here's a writeup about this issue on the Heroku blog: https://blog.heroku.com/pumas-routers-keepalives-ohmy (references this issue)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants