In the previous post, I wrote a small Redis clone using the most naïve manner. It was able to hit nearly 1M queries per second on our test instance (c6g.4xlarge, using 16 cores and 64 GB of memory). Before we get any deeper into optimization, it is worth understanding where the time is actually being spent. I run the server under a profiler, to see the various costs.
I like using dotTrace as a profiler, while using the Tracing mode, since that gives me execution time as well as the number of calls. Often enough I can reason a lot about the system performance just from those details.
Take a look at the following stats, this is the breakdown of costs in the actual processing of the connection:
And here it is when we break it up by
You can see that the cost of FlushAsync() dominates. I’m going to form a hypothesis here. When we call FlushAsync() on the StreamWriter, we’ll also flush to the underlying stream. Looking deeper into the call stack that looks like we’ll need a separate packet per command at the TCP level.
What will happen if we’ll change the StreamWriter’s AutoFlush to true, which will cause it to write immediately to the underlying stream, but won’t call the flush on the TCP stream. That will allow the TCP stream to buffer writes more efficiently.
The code change involved is removing the FlushAsync() calls and initializing the StreamWiter like so:
Let’s run the benchmark again, which will give us (on my development machine):
- 138,979.57 QPS – using AutoFlush = true
- 139,653.98 QPS – using FlushAsync
Either option is a wash, basically. But here is why:
Basically, AutoFlush set to true will flush not just the current stream, but also the underlying stream, putting us in the same position.
The problem is that we need to flush, otherwise we may buffer results in memory that won’t be sent to the client. Redis benchmarks rely heavily on pipelining (sending multiple commands at once), but it is entirely possible that you’ll get a bunch of commands, write them (to the buffer) and then not send anything to the client since the output buffer isn’t full. We can optimize this quite easily, using the following change:
What I’m doing here is writing to the StreamWriter directly, and I’ll only flush the buffer if there is no more input waiting. That should reduce the number of packets we send significantly, and it does. Running the benchmark again gives us:
- 229,783.30 QPS – using delayed flushing
That is almost twice as fast, which is impressive, for such a small change. The idea is that we are able to buffer our writes far more, but not delay them too much. If we write enough to the StreamWriter buffer, it will flush itself automatically, and we’ll only actually flush the StreamWriter manually when we have nothing further to read, which we do in parallel with the reading itself.
Here is the new cost structure:
And the actual methods called:
If we’ll compare this to the first profiling results, we can find some really interesting numbers. Before, we have called FlushAsync per command (see the ExecuteCommand & FlushAsync), now we call this a lot less often).
You can see that most of the time is now in the “business logic” for this system, and from the subsystems breakdown, a lot of the cost is now in the collections.
The GC costs here also went down significantly (~5%). I’m fairly certain that this is because we flush to the TCP stream, but I didn’t check too much.
Note that string processing and GC take a lot of time, but the Collections / ExecuteCommand is taking the vast majority of the costs.
If we look into that, we’ll see:
And that is… interesting.
Mostly because the major costs are in TryAddInternal. We know that there is high contention in this scenario, but 92% of the time spent in the method directly? What is it doing? Looking at the code, it becomes obvious:
The ConcurrentDictionary is sharding the calls between the locks. And the number of locks is defined by the number of the cores we have by default. The more concurrency we have, the more we can benefit from increasing the amount. I tried setting this to 1024 and running it under the profiler, and this gave me a few percentage points improvements, but not much more. Valuable, but not at the level we are playing with.
Even so, we managed to get some interesting details from this exploration. We know that we’ll have to deal with the dictionary implementation, since it takes roughly 50% of our time. I also want to pay some attention to these numbers:
Right now, we need to figure out how to make it faster in terms of collections, but we also have to consider overall GC costs as well as the string processing details. More on that in the next post.