The Heroku API uses GCRA. I wrote a client side rate throttling algorithm that minimizes sleep time while fairly distributing it against all clients with no state needed other than what’s returned by the server.
I get the impression your task would have been a lot easier if the server told your client the permitted rate, so the client did not have to work it out!
Like the 4500 number I mention in the article? I agree.
Or I if you mean the sleep value, pushing that to the server would require much more state persistence. It’s not that big of a deal to persist the data per client, but centralizing it all would be more work for the server.
I saw in your article you mentioned exposing when the next request will be available. If clients use that naively then you get a thundering herd where potentially hundreds of clients race for one valid request and then they all do it over and over again. Maybe if the server kept track of how many clients got a 429 and were waiting on retry it could apply that info to the suggested sleep time. But I’m betting there’s some non-trivial edge cases.
I meant the server could tell the client, eg, 10 requests per second. I got the impression the 4500 number was the number of requests available in the bucket (the burst size), is that right?
I thought the server had per-client rate limits, not a global rate limit. I would have thought a global limit would allow a busy client to starve out others. Did you have some cunning way to avoid that problem?
It’s a per-token rate limit. Many clients can use the same token. Even the same “client” might be multi threaded or multi process.
There’s no way for the server to guarantee or to know that one and only one client is using a token unless the server did something like sending a nonce with each request and requiring the back on new requests (effectively forcing serial requests only).
10 requests per second
The problem is that the server needs to know how many clients are using a token for it to give out a useful number. Otherwise each client sees “10 per second” and if you double the number of clients the server would have to have some way to figure out to start telling them “5 per second”. Because the clients themselves don’t know if they are the only one or one of a million sharing the same token. You could tell them but that would require distributed communication which is expensive and difficult.
Maybe there’s some way it could be calculated via request rate and littles law or whatever. But adding any additional state requirements to the server demand additional resources (which we are already trying to preserve) and complexity. Distributed computing is fundamentally a very challenging domain to get things correct and performant.
Did you have some cunning way to avoid that problem?
Kinda the whole point of the link I posted. Exponential back off with proportional decrease. Basically TCP slow-start but in reverse. But it relies on all clients sharing the same algorithm.
In the end the algorithm is quite simple. The hard part was developing it and then proving it was better.
It’s maybe like 30 lines of code. If you use Ruby you can just use my gem.
Ah, I see now. I was wondering about customers competing for resources, but the per-token limit deals with that. Your client-side rate limiter is for cases where many clients are run by the same customer using the same token and sharing a server-side rate limit, and they need to co-operate without shared state. I didn’t imagine one customer would create so much work for themselves!
You mostly run into it when a “client” is a service of some kind. For instance in my case I create and tear down ~90 Heroku apps on prod as a part of integration testing for the Ruby buildpack via the API. And multiple CI runs can be happening in parallel. Each test has multiple API calls. It doesn’t take much to chew through an hours worth of limits.
Client side rate throttling helps the global server load as it spends less time saying “429, try again later” and more time responding to real requests. But my main goals were improving the client experience.
On its own terms this metaphor doesn’t make much sense. If the bucket is leaking, why is it bad for it to overflow?
A “filly bucket” (where a hose pours water in at a constant rate and a request is allowed if you can fill a cup from what’s in the bucket) is more logical but less euphonious. It’s exactly the same as “token bucket”, except that token bucket implies discreteness (and probably a fixed cost of 1) whereas the leaky/filly bucket leads you to think about continuous flow and arbitrary costs.
I encourage anyone to come up with a snappier name and promote it a bit, it’d have a fair chance of catching on.
Heh :-) TBH I don’t think “leaky bucket” really needs a better name; its existing name is so old and well-known I doubt it can be displaced. Even so I like to pick apart dodgy compsci metaphors because playing with alternative analogies is an oblique strategy for examining an idea from different angles and getting to know it better.
On the other hand I think GCRA desperately needs a more catchy name and/or an evocative metaphor. “Earliest Allowed”? “Always-Slow Clock”? Maybe “Tortoise and Hare” because the analogy is deliberately broken? Maybe “Alice’s White Rabbit” is a better animal-themed literary reference? I haven’t come up with a winner yet.
I wonder whether anyone has tried to fold system load into the concept of time. I.e. time flows slower if the system is under load from other requests.
The Heroku API uses GCRA. I wrote a client side rate throttling algorithm that minimizes sleep time while fairly distributing it against all clients with no state needed other than what’s returned by the server.
I described how here https://www.schneems.com/2020/07/08/a-fast-car-needs-good-brakes-how-we-added-client-rate-throttling-to-the-platform-api-gem/
I get the impression your task would have been a lot easier if the server told your client the permitted rate, so the client did not have to work it out!
Like the 4500 number I mention in the article? I agree.
Or I if you mean the sleep value, pushing that to the server would require much more state persistence. It’s not that big of a deal to persist the data per client, but centralizing it all would be more work for the server.
I saw in your article you mentioned exposing when the next request will be available. If clients use that naively then you get a thundering herd where potentially hundreds of clients race for one valid request and then they all do it over and over again. Maybe if the server kept track of how many clients got a 429 and were waiting on retry it could apply that info to the suggested sleep time. But I’m betting there’s some non-trivial edge cases.
I meant the server could tell the client, eg, 10 requests per second. I got the impression the 4500 number was the number of requests available in the bucket (the burst size), is that right?
I thought the server had per-client rate limits, not a global rate limit. I would have thought a global limit would allow a busy client to starve out others. Did you have some cunning way to avoid that problem?
It’s a per-token rate limit. Many clients can use the same token. Even the same “client” might be multi threaded or multi process.
There’s no way for the server to guarantee or to know that one and only one client is using a token unless the server did something like sending a nonce with each request and requiring the back on new requests (effectively forcing serial requests only).
The problem is that the server needs to know how many clients are using a token for it to give out a useful number. Otherwise each client sees “10 per second” and if you double the number of clients the server would have to have some way to figure out to start telling them “5 per second”. Because the clients themselves don’t know if they are the only one or one of a million sharing the same token. You could tell them but that would require distributed communication which is expensive and difficult.
Maybe there’s some way it could be calculated via request rate and littles law or whatever. But adding any additional state requirements to the server demand additional resources (which we are already trying to preserve) and complexity. Distributed computing is fundamentally a very challenging domain to get things correct and performant.
Kinda the whole point of the link I posted. Exponential back off with proportional decrease. Basically TCP slow-start but in reverse. But it relies on all clients sharing the same algorithm.
In the end the algorithm is quite simple. The hard part was developing it and then proving it was better.
It’s maybe like 30 lines of code. If you use Ruby you can just use my gem.
Ah, I see now. I was wondering about customers competing for resources, but the per-token limit deals with that. Your client-side rate limiter is for cases where many clients are run by the same customer using the same token and sharing a server-side rate limit, and they need to co-operate without shared state. I didn’t imagine one customer would create so much work for themselves!
Thanks for taking the time to explain :-)
You mostly run into it when a “client” is a service of some kind. For instance in my case I create and tear down ~90 Heroku apps on prod as a part of integration testing for the Ruby buildpack via the API. And multiple CI runs can be happening in parallel. Each test has multiple API calls. It doesn’t take much to chew through an hours worth of limits.
Client side rate throttling helps the global server load as it spends less time saying “429, try again later” and more time responding to real requests. But my main goals were improving the client experience.
A “filly bucket” (where a hose pours water in at a constant rate and a request is allowed if you can fill a cup from what’s in the bucket) is more logical but less euphonious. It’s exactly the same as “token bucket”, except that token bucket implies discreteness (and probably a fixed cost of 1) whereas the leaky/filly bucket leads you to think about continuous flow and arbitrary costs.
I encourage anyone to come up with a snappier name and promote it a bit, it’d have a fair chance of catching on.
Heh :-) TBH I don’t think “leaky bucket” really needs a better name; its existing name is so old and well-known I doubt it can be displaced. Even so I like to pick apart dodgy compsci metaphors because playing with alternative analogies is an oblique strategy for examining an idea from different angles and getting to know it better.
On the other hand I think GCRA desperately needs a more catchy name and/or an evocative metaphor. “Earliest Allowed”? “Always-Slow Clock”? Maybe “Tortoise and Hare” because the analogy is deliberately broken? Maybe “Alice’s White Rabbit” is a better animal-themed literary reference? I haven’t come up with a winner yet.
I wonder whether anyone has tried to fold system load into the concept of time. I.e. time flows slower if the system is under load from other requests.
It sounds like you’re looking for queue management algorithms, such as CoDel, PIE or BBR.