Ayende @ Rahien

A twisted tale of memory optimization

Thu, 24 Aug 2023 12:00:00 GMT

I was looking into reducing the allocation in a particular part of our code, and I ran into what was basically the following code (boiled down to the essentials):

As you can see, this does a lot of allocations. The actual method in question was a pretty good size, and all those operations happened in different locations and weren’t as obvious.

Take a moment to look at the code, how many allocations can you spot here?

The first one, obviously, is the string allocation, but there is another one, inside the call to GetBytes(), let’s fix that first by allocating the buffer once (I’m leaving aside the allocation of the reusable buffer, you can assume it is big enough to cover all our needs):

For that matter, we can also easily fix the second problem, by avoiding the string allocation:

That is a few minutes of work, and we are good to go. This method is called a lot, so we can expect a huge reduction in the amount of memory that we allocated.

Except… that didn’t happen. In fact, the amount of memory that we allocate remained pretty much the same. Digging into the details, we allocate roughly the same number of byte arrays (how!) and instead of allocating a lot of strings, we now allocate a lot of character arrays.

I broke the code apart into multiple lines, which made things a lot clearer. (In fact, I threw that into SharpLab, to be honest). Take a look:

This code: buffer[..len] is actually translated to:

char[] charBuffer= RuntimeHelpers.GetSubArray(buffer, Range.EndAt(len));

That will, of course, allocate. I had to change the code to be very explicit about the types that I wanted to use:

This will not allocate, but if you note the changes in the code, you can see that the use of var in this case really tripped me up. Because of the number of overloads and automatic coercion of types that didn’t happen.

For that matter, note that any slicing on arrays will generate a new array, including this code:

This makes perfect sense when you realize what is going on and can still be a big surprise, I looked at the code a lot before I figured out what was going on, and that was with a profiler output that pinpointed the fault.

A performance profile mystery: The cost of Stopwatch

Wed, 09 Aug 2023 12:00:00 GMT

Measuring the length of time that a particular piece of code takes is a surprising challenging task. There are two aspects to this, the first is how do you ensure that the cost of getting the start and end times won’t interfere with the work you are doing. The second is how to actually get the time (potentially many times a second) in as efficient way as possible.

To give some context, Andrey Akinshin does a great overview of how the Stopwatch class works in C#. On Linux, that is basically calling to the clock_gettime system call, except that this is not a system call. That is actually a piece of code that the Kernel sticks inside your process that will then integrate with other aspects of the Kernel to optimize this. The idea is that this system call is so frequent that you cannot pay the cost of the Kernel mode transition. There is a good coverage of this here.

In short, that is a very well-known problem and quite a lot of brainpower has been dedicated to solving it. And then we reached this situation:

What you are seeing here is us testing the indexing process of RavenDB under the profiler. This is indexing roughly 100M documents, and according to the profiler, we are spending 15% of our time gathering metrics?

The StatsScope.Start() method simply calls Stopwatch.Start(), so we are basically looking at a profiler output that says that Stopwatch is accounting for 15% of our runtime?

Sorry, I don’t believe that. I mean, it is possible, but it seems far-fetched.

In order to test this, I wrote a very simple program, which will generate 100K integers and test whether they are prime or not. I’m doing that to test compute-bound work, basically, and testing calling Start() and Stop() either across the whole loop or in each iteration.

I run that a few times and I’m getting:

Windows: 311 ms with Stopwatch per iteration and 312 ms without
Linux: 450 ms with Stopwatch per iteration and 455 ms without

On Linux, there is about 5ms overhead if we use a per iteration stopwatch, on Windows, it is either the same cost or slightly cheaper with per iteration stopwatch.

Here is the profiler output on Windows:

And on Linux:

Now, that is what happens when we are doing a significant amount of work, what happens if the amount of work is negligible? I made the IsPrime() method very cheap, and I got:

So that is a good indication that this isn’t free, but still…

Comparing the costs, it is utterly ridiculous that the profiler says that so much time is spent in those methods.

Another aspect here may be the issue of the profiler impact itself. There are differences between using Tracing and Sampling methods, for example.

I don’t have an answer, just a lot of very curious questions.

Production postmortem: The dog ate my request

Mon, 24 Jul 2023 12:00:00 GMT

A customer called us, quite upset, because their RavenDB cluster was failing every few minutes. That was weird, because they were running on our cloud offering, so we had full access to the metrics, and we saw absolutely no problem on our end.

During the call, it turned out that every now and then, but almost always immediately after a new deployment, RavenDB would fail some requests. On a fairly consistent basis, we could see two failures and a retry that was finally successful.

Okay, so at least there is no user visible impact, but this was still super strange to see. On the backend, we couldn’t see any reason why we would get those sort of errors.

Looking at the failure stack, we narrowed things down to an async operation that was invoked via DataDog. Our suspicions were focused on this being an error in the async machinery customization that DataDog uses for adding non-invasive monitoring.

We created a custom build for the user that they could test and waited to get the results from their environment. Trying to reproduce this locally using DataDog integration didn’t raise any flags.

The good thing was that we did find a smoking gun, a violation of the natural order and invariant breaking behavior.

The not so good news was that it was in our own code. At least that meant that we could fix this.

Let’s see if I can explain what is going on. The customer was using a custom configuration: FastestNode. This is used to find the nearest / least loaded node in the cluster and operate from it.

How does RavenDB know which is the fastest node? That is kind of hard to answer, after all. It checks.

Every now and then, RavenDB replicates a read request to all nodes in the cluster. Something like this:

The idea is that we send the request to all the nodes, and wait for the first one to arrive. Since this is the same request, all servers will do the same amount of work, and we’ll find the fastest node from our perspective.

Did you notice the cancellation token in there? When we return from this function, we cancel the existing requests. Here is what this looks like from the monitoring perspective:

This looks exactly like every few minutes, we have a couple of failures (and failover) in the system and was quite confusing until we figured out exactly what was going on.

Fight for every byte it takes: Fitting 64 values in 4 bits

Thu, 27 Apr 2023 12:00:00 GMT

Moving to nibble encoding gave us a measurable improvement in the density of the entries in the page. The problem is that we pretty much run out of room to do so. We are currently using a byte per entry to hold the size of the entry (as two nibbles, of 4 bits each). You can’t really go lower than that.

Let’s review again what we know about the structure of the data, we have an 8KB page, with three sections, fixed size header and variable size offsets and entries array. Here is what this looks like:

This is called a slotted page design. The idea is that the offset array at the bottom of the page is maintaining the sort orders of the entries, and that we can write the entries from the top of the page. When we need to sort the entries, we just need to touch the offsets array (shown in yellow in the image).

Given that we are talking about size and density, we spent a lot of time trying to reduce the size of the entries, but can we do something with the header or the offsets? The header is just 4 bytes right now, two shorts that denote the location of the bottom and the top position in the page. Given that the page is 8KB in size, we have to use 16 bits integer to cover the range. For offsets, the situation is the same. We have to be able to point to the entry location on the page, and that means that we have to reach 8KB. So the offsets are actually 16 bits ints and take two bytes.

In other words, there is a hidden overhead of 2 bytes per entry that we didn’t even consider. In the case of our latest success, we were able to push 759 entries into the page, which means that we are actually using 18.5% of the page just to hold the offsets of the entries. That is 1.48 KB that is being used.

The problem is that we need to use this. We have to be able to point to an entry anywhere in the page, which means that we have to reach 0 .. 8192. The minimum size we can use is 16 bits or two bytes.

Or do we?

16 bits gives us a range of 0 .. 65,535, after all. That is far in excess of what we need. We could use a 64KB page, but there are other reasons to want to avoid that.

To cover 8KB, we only need 13 bits to cover the range we need, after all. For that matter, we can extend that bit by 25%. If we decide that an entry should be 2 bytes aligned, we can access the entire page in 12 bits.

That means that we have 4 whole free bits to play with. The first idea is to change the offsets array from 16 bits ints to 12 bits ints. That would save us 380 bytes at 759 entries per page. That is quite a lot. Unfortunately, working with bits in this manner would be super awkward. We are doing a lot of random access and moves while we are building the page. It is possible to do this using bits, but not fun.

So we can set things up so we have a nibble free to use. We just used nibbles to save on the cost of variable size ints, to great success.

However, we don’t need just a nibble, we need two of them. We need to store the size of the key and the value in bytes. Actually, we don’t need two nibbles. The size of the key and the value maxes at 8 bytes, after all. We can encode that in 3 bits. In other words, we need 6 bits to encode this information.

We only have 4 bits, however. It is a really nice idea, however, and I kept turning that in my head, trying to come up with all sorts of clever ways to figure out how we can push 64 values in 4 bits. The impact of that would be pretty amazing.

Eventually, I realized that it is fairly easy to prove, using math, that there is no way to do so. Faced with this failure, I realigned my thinking and found a solution. I don’t need to have a perfect answer, I can have a good one.

4 bits give me a range of 16 values (out of the possible 64). If I give up on trying to solve the whole problem, can I solve a meaningful part of it?

And I came up with the following idea. We can do a two-stage approach, we’ll map the most common 15 values of key and value sizes to those 4 bits. The last value will be a marker that you have to go and look elsewhere.

Using just the data in the offset, I’m able to figure out what the location of the entry in the page is as well as the size of the key and value for most cases. For the (hopefully rare) scenarios where that is not the case, we fall back to storing the size information as two nibbles preceding the entry data.

This is a pretty neat idea, even if I say so myself, and it has a good chance to allow us to save about 1 byte per entry in the common case. In fact, I tested that and about 90% of the cases in my test case are covered by the top 15 cases. That is a pretty good indication that I’m on the right track.

All of that said, let’s look at how this looks in code:

I’m using a switch expression here for readability, so it is clear what is going on. If the key and value sizes are in one of the known patterns, we can put that in the nibble we’ll return. If the value is not, we’ll write it to the entry buffer.

The Set method itself had to change in some subtle but crucial ways, let’s look at it first, then I’ll discuss those changes:

As before, we encode the entry into a temporary buffer. Now, in addition to getting the length of the entry, we are also getting the nibble that we’ll need to store.

You can see the changes in how we work with the offsets array following that. When we need to update an existing value, we are using this construction to figure out the actual entry offset:

var actualEntryOffset = ((offsets[idx] & 0xFFF0) >> 3);

What exactly is going on here? Don’t try to figure it out yet, let’s see how we are writing the data:

top = (ushort)((top - reqEntryLen) & ~1); // align on two bytes boundary 
offsets[idx] = (ushort)(top << 3 | nibble);

Those two code snippets may look very odd, so let’s go over them in detail.

First, remember that we have an 8KB page to work with, but we need to use 4 bits for the size nibble we got from encoding the entry. To address the full 8,192 values in the page, we’ll need to reserve 13 bits. That is… a problem. We solve that by saying that the entry addresses must always be aligned on two bytes boundary. That is handled by clearing the first bit in the new top computation. Since we are growing down, that has the effect of ensuring aligned-by-two.

Then, we merge the top location and the nibble together. We know that the bottom-most of the top is cleared, so we can just move the value by 3 bits and we know that we’ve 4 cleared bits ready.

Conversely, when we want to read, we clear the first 4 bits and then we shift by three. That has the effect of returning us back to the original state.

A little bit confusing, but we managed to get to squeeze 784 entries into the page using the realistic dataset and 765 using the full one. That is another 3.5% of space savings over the previous nibble attempt and over 10% increase in capacity from the variable integer approach.

And at this point, I don’t believe that there is anything more that I can do to reduce the size in a significant manner without having a negative impact elsewhere.

We are not done yet, however. We are done with the size aspect, but we also have much to do in terms of performance and optimizations for runtime.

In the meantime, you can see my full code here. In the next post, we are going to start talking about the actual machine code and how we can optimize it.

Fight for every byte it takes: Variable size data

Tue, 25 Apr 2023 12:00:00 GMT

In my previous post, we stored keys and values as raw numbers inside the 8KB page. That was simple, but wasteful. For many scenarios, we are never going to need to utilize the full 8 bytes range for a long. Most numbers are far smaller.

In the example I gave in the last post, we are storing the following range of numbers (file offsets, basically). I’m using two test scenarios, one where I’m testing the full range (for correctness) and one where I’m testing files under 512 GB in size. Given that we are trying to compress the space, once we hit the 512GB mark, it is probably less urgent, after all.

Here are the number generations that I’m using:

What this means is:

Full data set

Realistic data set

3% in the first 128 bytes
7% in the first 64 KB
25% in the first 8 MB
35% in the first 2 GB
15% in the first 512 GB
5% in the first 128 TB
3% in the first 32 Petabytes
2% in the first 4 Exabytes

1% in the first 128 bytes
2% in the first 64 KB
27% in the first 8 MB
35% in the first 2 GB
25% in the first 512 GB

This is meant to verify that we can handle any scenario, in practice, we can usually focus on the first 512 GB, which is far more common.

Using both approaches, I can fit using my previous approach, up to 511 entries per page. That makes sense, we are storing the data raw, so how can we do better? Most of the time, we don’t need anywhere near 8 bytes per value. For that reason, we have variable length encoding, which has many names, such as variable size int, 7 bits integers, etc. I adapted some methods from the .NET codebase to allow me to operate on Spans, like so:

Let’s check what sort of savings we can get using this approach:

Under 127 bytes– 1 byte
128 bytes .. 32 KB – 2 bytes
32KB .. 8MB – 3 bytes
8MB .. 2GB – 4 bytes
2 GB .. 512 GB – 5 bytes
512GB .. 128 TB – 6 bytes
128TB .. 32 Petabytes – 7 bytes
32 Petabytes .. 8 Exabytes – 8 bytes
Greater than 8 Exabytes – 9 bytes

That is really cool, since for the realistic data set, we can pack a lot more data into the page.

It comes with a serious issue, however. The data is no longer fixed size (well, that is the point, no?). Why is that a problem? Because we want to be able to do a binary search on that, which means that we need to be able to access the data by index. As usual, the solution is to utilize indirection. We’ll dedicate the bottom of the page to an array of fixed-size int (16 bits – sufficient to cover the 8KB range of the page) that will point to the actual location of the entry. Like before, we are going to reserve the first few bytes as a header, in this case we’ll use 4 bytes, divided into two shorts. Those will keep track of the writes to the bottom and the top of the page.

At the bottom, we’ll have the actual offsets that point to the entries, and at the top, we write the actual entries. Here is what this looks like:

Let’s see how our reading from the page will look now. As you can see, it is very similar to what we had before, but instead of going directly to the key by its offset, we have to use the indirection:

The offsets array contains the location of the entry in the page, and that is laid out as the [varint-key][varint-val]. So we read (and discard) the key from the offset we found (we have to do that to discover its size) and then we read and return the actual value.

Let’s look at how we implemented the actual binary search in the page:

This is a bog standard binary search, with the only interesting bit that we are going through the offsets array to find the actual location of the key, which we then read using variable size decoding.

The interesting part of this model happens when we need to set a value. Here is what this looks like, with my notes following the code.

This is quite a lot, I’ll admit. Let’s try to break up into individual pieces what is going on here.

First, we get the header values (bottom, top) and initialize them if empty (note that bottom is set to 4, after the header, while top is set to the end of the buffer). The idea is that the bottom grows up and the top grows down. This is called Slotted Page design and it is a staple of database design.

We then encode the key and the value into a temporary buffer. We need to do that so we’ll know what size the entry will take. Then we need to figure out if we are updating an existing record or creating a new one.

Updating an existing record is complex. This is because the size of the new record may be greater than the size of the old one. So we can’t put it in the same location. I’m handling this by just allocating new space for this entry, ignoring the old space that was allocated to it.

I’m not handling any deletes / space reclamation on this series. That is a separate subject, not complex, but fairly tedious to do properly. So I’m going to focus solely on writes.

Updates to an existing entry that also change its size aren’t in my test dataset, so I’m not worried about it too much here. I mention this to point out that variable length records bring with them considerations that we wouldn’t have run into with the fixed-size model.

And after all of this work? What are the results?

With the fixed-size version, we could fit 511 entries into the page. With the variable size int, however, we can do better.

For the realistic dataset, I can fit 712 entries for the page, and for the full dataset, 710 (there are very few very big elements even there, but we can see that it has an impact).

511 vs. 712 may not sound like much, but that is 40% increase in the number of entries that I can fit. To give some context, using 8KB pages, that is a difference of 5 MB per million entries. That adds up.

The question is, can we do better? More on that in my next post…

Fight for every byte it takes: Storing raw numbers

Mon, 24 Apr 2023 12:00:00 GMT

I write databases for a living, which means that I’m thinking a lot about persistence. Here is a fun challenge that we went through recently. We have the need to store a list of keys and values and then lookup a value by key. Pretty standard stuff. The keys and values are both 64 bits integers. In other words, what I would like to have is:

Dictionary<long,long> lookup;

That would be perfect, except that I’ve to persist the data, which means that I have to work with raw bytes. It’s easiest to think about it if we have some code in front of us. Here is the interface that I need to implement:

As you can see, we have a byte buffer (8KB in size) and we want to add or lookup values from the buffer. All the data resides in the buffer, nothing is external. And we cannot unpack it in memory, because this is used for lookups, so this needs to be really fast.

The keys we are storing are file offsets, so they correlate quite nicely to the overall size of the file. Meaning that you’ll have a lot of small values, but also large ones. Given a key, we need to be able to look its value quickly, since we may run this lookup billions of times.

Given that I have 8KB of data, I can do the following, just treat the buffer as a sorted array, which means that I get a pretty easy way to search for a particular value and a simple way to actually store things.

Theoretically, given an 8KB page, and 16 bytes per each (key, value) entry, we can store up to 512 entries per page. But it turns out that this is just a theory. We also need to keep track of the number of items that we have, and that takes some space. Just a couple of bytes, but it means that we don’t have those bytes available. A page can now contain up to 511 entries, and even at full capacity, we have 14 bytes wasted (2 for the number of entries, and the rest are unused).

Here is what this looks like in code:

As you can see, we are creating two arrays, the keys are growing from the bottom of the page and the values are growing from the top. The idea is that I can utilize the BinarySearch() method to quickly find the index of a key (or where it ought) to go. From there, I can look at the corresponding values array to get the actual value. The fact that they are growing separately (and toward each other) means that I don’t need to move as much memory if I’m getting values out of order.

For now, I want to set up the playground in which we’ll operate. The type of data that you write into such a system is important. I decided to use the following code to generate the test set:

The idea is that we’ll generate a random set of numbers, in the given distribution. Most of the values are in the range of 8MB to 512GB, representing a pretty good scenario overall, I think.

And with that, we just need to figure out what metrics we want to use for this purpose. My goal is to push as many values as I can into the buffer, while maintaining the ability to get a value by its key as fast as possible.

The current approach, for example, does a binary search on a sorted array plus an extra lookup to the companion values array. You really can’t beat this, if you allow to store arbitrary keys. Here is my test bench:

This will insert key/value pairs into the page until it is full. Note that we allow duplicates (we’ll just update the value), so we need to keep track of the number of entries inserted, not just the number of insertions. We also validate the structure at any step of the way, to ensure that we always get the right behavior.

This code runs as expected and we can put 511 values into the page before it gives up. This approach works, it is simple to reason about and has very few flaws. It is also quite wasteful in terms of information density. I would like to do better than 511 entries / pager. Is it possible to drop below 16 bytes per entry?

Give it some thought, I’m going to present several ways of doing just that in my next post…

Looking into Corax’s posting lists: Part II

Fri, 14 Apr 2023 12:00:00 GMT

In a previous post (which went out a long time ago) I explained that we have the notion of a set of uint64 values that are used for document IDs. We build a B+Tree with different behaviors for branch pages and leaf pages, allowing us to pack a lot of document IDs (thousands or more) per page.

The problem is that this structure hold the data compressed, so when we add or remove a value, we don’t know if it exists already or not. That is a problem, because while we are able to do any relevant fixups to skip duplicates and erase removed values, we end up in a position where the number of entries in the set is not accurate. That is a Problem, with a capital P, since we use that for query optimizations.

The solution for that is to move to a different manner of storing the data in the leaf page, instead of going with a model where we add the data directly to the page and compress when the raw values section overflows, we’ll use the following format instead:

Basically, I removed the raw values section from the design entirely. That means that whenever we want to add a new value, we need to find the relevant compressed segment inside the page and add to it (potentially creating a page split, etc).

Obviously, that is not going to perform well for write. Since on each addition, we’ll need to decompress the segment, add to it and then compress it again.

The idea here is that we don’t need to do that. Instead of trying to store the entries in the set immediately, we are going to keep them in memory for the duration of the transaction. Just before we commit the transaction, we are going to have two lists of document IDs to go through. One of added documents and one of removed documents. We can then sort those ids and then start walking over the list, find the relevant page for each section in the list, and merging it with the compressed values.

By moving the buffering stage from the per-page model to the per-transaction model, we actually gain quite a lot of performance, since if we have a lot of changes to a single page, we can handle compression of the data only once. It is a very strange manner of working, to be honest, because I’m used to doing the operation immediately. By delaying the cost to the end of the transaction, we are able to gain two major benefits. First, we have a big opportunity for batching and optimizing work on large datasets. Second, we have a single code path for this operation. It’s always: “Get a batch of changes and apply them as a unit”. It turns out that this is far easier to understand and work with. And that is for the writes portion of Corax.

Remember, however, that Corax is a search engine, so we expect a lot of reads. For reads, we can now stream the results directly from the compressed segments. Given that we can usually pack a lot of numbers into a segment, and that we don’t need to compare to the uncompressed portion, that ended up benefiting us significantly on the read side as well, surprisingly.

Of course, there is also another issue, look at the Baseline in the Page Header? We’ll discuss that in the next post, turned out that it wasn’t such a good idea.

Production postmortem: The big server that couldn’t handle the load

Mon, 23 Jan 2023 12:00:00 GMT

A user of ours called us, quite frantic. They are running a lot of systems on RavenDB, and have been for quite some time.

However, very recently they started to run into severe issues. RavenDB would complain that there isn’t sufficient memory to run.

The system metrics, however, said that there are still gobs of GBs available (I believe that this is the appropriate technical term).

After verifying the situation, the on-call engineer escalated the issue. The problem was weird. There was enough memory, for sure, but for some reason RavenDB would be unable to run properly.

An important aspect is that this user is running a multi-tenant system, with each tenant being served by its own database. Each database has a few indexes as well.

Once we figured that out, it was actually easy to understand what is going on.

There are actually quite a few limits that you have to take into account. I talked about them here. In that post, the issue was the maximum number of tasks defined by the system. After which, you can no longer create new threads.

In this case, the suspect was: vm.max_map_count.

Beyond just total memory, Linux has a limit on the number of memory mappings that a process may have. And RavenDB uses Voron, which is based on mmap(), and each database and each index typically have multiple maps going on.

Given the number of databases involved…

The solution was to increase the max_map_count and add a task for us, to give a warning to the user ahead of time when they are approaching the system's limits.

Production postmortem: The heisenbug server

Mon, 16 Jan 2023 12:00:00 GMT

A user reported that they observed nodes in the cluster “going dark”. Basically, they would stop communicating with the rest of the cluster, but would otherwise appear functional. Both the internal and external metrics were all fine, the server would just stop responding to anything over the network. The solution for the problem was to restart the service (note, the service, not the whole machine), but the problem would happen every few days.

As you can imagine, we are taking this sort of thing very seriously, so we looked into the problem. And we came up short. The problem made absolutely no sense. The problem occurred on a (minor) version migration, but there was absolutely nothing related to this that could cause it. What was really weird was that the service itself continue to work. We could see log entries being written and it was able to execute scheduled backups, for example. It would just refuse to talk to us over the network.

That was super strange, since the network itself was fine. All the monitoring systems were green, after all. For that matter, the user was able to SSH into the system to restart the service. This didn’t match with any other issue we could think of. Since the user worked around the problem by restarting the server, we didn’t have a lead.

Then we noticed the exact same problem in one of our cloud instances, and there we have much better diagnostic capabilities. Once we had noticed a problematic server, we were able to SSH into that and try to figure out what was going on.

Here is what we found out:

The server will not respond to HTTP(s) communication either from outside the machine or by trying to connect from inside the machine.
The server will respond to SNMP queries both from inside the machine and outside of it (which is how we typically monitor the system).

When we designed RavenDB, we implemented a “maintenance hatch” for such scenarios, in addition to using HTTP(s) for communication, RavenDB also exposes a named pipe that allows you to connect to the server without going through the network at all. This ensures that if you have administrator privileges on the server, you are able to connect even if there are network issues, certificate problems, etc.

Here is the kicker. Under this particular situation, we could not activate this escape hatch. That is not supposed to be possible. Named pipes on Linux, where we run into the problem, are basically Unix Sockets. A network issue such as a firewall problem or something similar isn’t going to affect them.

At the same time, we were able to communicate with the process using SNMP. What is the problem?

Lacking any other options, we dumped the process, restarted the service, and tried to do the analysis offline. We couldn’t find any problem. All the details we looked at said that everything was fine, the server was properly listening to new connections and it should work. That was… weird.

And then it happened again, and we did the same analysis, and it came back the same. We were clueless. One of the things that we updated between versions was the .NET runtime that we were using, so we opened an issue to see if anyone ran into the same problem.

And then it happened again. This time, we knew that just looking at the dump wouldn’t help us, so we tried other avenues. Linux has a pretty rich set of knobs and dials that you can look at to see what was going on. We suspected that this may be an issue with running out of file descriptors, running out of memory, etc.

We tried looking into what is going on inside the process using strace, and everything was fine. The trace clearly showed that the server was processing requests and was able to send and receive data properly.

Wait, go through that statement again please!

It is fine? But the reason we are using strace is that there is a problem. It looks like the problem fixed itself. That was annoying, because we were hoping to use the trace to figure out what is going on. We added more monitoring along the way, which would let us know if the server found itself isolated. And we waited.

The next time we ran into the problem, the first thing we did was run strace, we needed to get the root cause as soon as possible, and we were afraid that it would fix itself before we had a chance to get to the root cause. The moment we used strace, the server got back online, continuing as if there was never any issue.

Over the next few instances of this issue, we were able to confirm the following observations:

The service would stop responding to TCP and Unix Sockets entirely.
There were no firewall or network issues.
The service was up and functional, tailing the log showed activity.
We could query the server state using SNMP.
Running strace on the service process would fix the problem.

There are a few more things, the actual trigger for the fix wasn’t strace itself. It was the ptrace() call, which it uses. That would cause the service to start responding again. The ptrace() call is basically the beginning and the end of debugging under Linux. Everything uses it.

If you want to dump a memory process, you start with ptrace(). You want to trace the calls, ptrace(). You want to debug the process? GDB will start by calling ptrace(), etc.

And doing that would alleviate the problem.

That was… quite annoying.

We still had absolutely no indication of what the root cause even was.

We suspected it may be something inside Kestrel that was causing a problem. But that wouldn’t affect the named pipes / Unix sockets that we also saw.

Networking worked, because SNMP did. We thought that this may be because SNMP uses UDP instead of TCP, and looked into that, but we couldn’t figure out how that would be any different.

Looking at this further, we found that we have this in the code dumps:

      ~~~~ 5072
         1 Interop+Sys.WaitForSocketEvents(IntPtr, SocketEvent*, Int32*)
         1 System.Net.Sockets.SocketAsyncEngine.EventLoop()
         1 System.Net.Sockets.SocketAsyncEngine+<>c.ctor>b__14_0(Object)

As you can see, we are waiting for this in the .NET Sockets thread. The SNMP, on the other hand, looked like:

Thread (0x559):
   [Native Frames]
   System.Net.Sockets!System.Net.Sockets.SocketPal.SysReceive()
   System.Net.Sockets!System.Net.Sockets.SocketPal.TryCompleteReceiveFrom()
   System.Net.Sockets!System.Net.Sockets.SocketAsyncContext.ReceiveFrom()
   System.Net.Sockets!System.Net.Sockets.SocketPal.ReceiveFrom()
   System.Net.Sockets!System.Net.Sockets.Socket.ReceiveFrom()
   SharpSnmpLib.Engine!Lextm.SharpSnmpLib.Pipeline.ListenerBinding.AsyncReceive()

That was really interesting, since it meant that for sockets (both HTTP and Unix), we were always using async calls, but for SNMP, we were using the synchronous API. We initially suspected that this may be something related to the thread pool. Maybe we had something that blocked it, but it turns out to be a lot more interesting. Here is the code that is actually handling the SNMP:

var count = _socket.ReceiveFrom(buffer, ref remote);

Task.Factory.StartNew(() => HandleMessage(buffer, count, (IPEndPoint)remote));

In other words, we are actually reading from the socket in a blocking manner, but then processing the actual message using the thread pool. So being able to get results via SNMP meant the thread pool was well.

At this point we resulted to hair pulling, rubber ducking and in some instances, shaking our fists to heaven.

I reminded myself that I’m an adult with a bit of experience solving problems, and dug deeper. We started looking into how .NET is actually handling sockets in async mode. This end up here, doing a system call:

while ((numEvents = epoll_wait(port, events, *count, -1)) < 0 && errno == EINTR);

Reading through the man page for epoll_wait() I learned how epoll() works, that it is complex and that we need to be aware of level-triggered and edge-triggered options. Since .NET uses edge-triggered events (EPOLLET, which I keep reading as electronic chicken), we focused on that.

There are a lot of edge cases and things to cover, but everything we checked was handled properly. We finally had a good smoking gun. For some reason, we weren’t getting notifications from epoll(), even though we should. Using strace() or friends somehow fixes that.

We actually found the exact scenario we saw in StackOverflow, but without any idea what the issue was. Truly, there is an XKCD for everything.

Our current understanding of the issue:

All async sockets in .NET are going through the same socket engine, and are using epoll() under the covers.
SNMP is using synchronous calls, so it wasn’t using epoll().

That covers both of the weird things that we are seeing. So what is the issue?

It is not in .NET. Given the size & scope of .NET, we wouldn’t be the only ones seeing that. Below .NET, there is the kernel, so we looked into that. The machines we were running that on were using kernel 5.4.0-azure-1095, so we looked into that.

And it looked like it is [email protected]/#r">a kernel bug, which was fixed in the next updated kernel. A race condition inside the kernel would cause us to miss wakeups, and then we would basically just stall without anything to wake us up.

We dug deeper to understand a bit more about this situation, and we got this:

       Some system calls return with EINTR if a signal was sent to a
       tracee, but delivery was suppressed by the tracer.  (This is very
       typical operation: it is usually done by debuggers on every
       attach, in order to not introduce a bogus SIGSTOP).  As of Linux
       3.2.9, the following system calls are affected (this list is
       likely incomplete): epoll_wait(2), and read(2) from an inotify(7)
       file descriptor.  The usual symptom of this bug is that when you
       attach to a quiescent process with the command

           strace -p <process-ID>

       then, instead of the usual and expected one-line output such as

           restart_syscall(<... resuming interrupted call ...>_

       or

           select(6, [5], NULL, [5], NULL_

       ('_' denotes the cursor position), you observe more than one
       line.  For example:

               clock_gettime(CLOCK_MONOTONIC, {15370, 690928118}) = 0
               epoll_wait(4,_

       What is not visible here is that the process was blocked in
       epoll_wait(2) before strace(1) has attached to it.  Attaching
       caused epoll_wait(2) to return to user space with the error
       EINTR.  In this particular case, the program reacted to EINTR by
       checking the current time, and then executing epoll_wait(2)
       again.  (Programs which do not expect such "stray" EINTR errors
       may behave in an unintended way upon an strace(1) attach.)

And.. that is exactly what is happening. On attaching, the epoll_wait() will return with EINTR, which will cause .NET to retry the command, and that “fixes” the issue.

It makes total sense now, and concludes the discovery process of a pretty nasty bug.

Now, if you’ll excuse me, I need to go and apologize to a rubber duck.

Answer: what does this code print?

Thu, 05 Jan 2023 12:00:00 GMT

I posted this code previously:

And asked what it prints. This is actually an infinite loop that will print an endless amount of zeros to the console. The question is why.

The answer is that we are running into two separate features of C# that interact with each other in a surprising way.

The issue is that we are using a nullable iterator here, and accessing the struct using the Value property. The problem is that this is a struct, and using a property will cause it to be copied.

So the way it works, the code actually runs:

And now you can more easily see the temporary copies that are created and how because we are using a value type here, we are using a different instance each time.

Challenge: what does this code print?

Wed, 04 Jan 2023 12:00:00 GMT

Given the following code:

Can you guess what it will do?

Can you explain why?

I love that this snippet is under 20 lines of code, but being able to explain it shows a lot more knowledge about C# than you would expect.

Answer: What does this code print?

Thu, 15 Dec 2022 12:00:00 GMT

I asked the following question, about code that uses AsyncLocal as well as async calls. Here is the code again:

This code prints False twice, the question is why. I would expect that the AsyncLocal value to remain the same after the call to Start(), since that is obviously the point of AsyncLocal. It turns out that this isn’t the case.

AsyncLocal is good if you are trying to pass a value down to child tasks, but it won’t be applicable to other tasks that are called in the same level. In other words, it works for children, not siblings tasks. This is actually even more surprising in the code above, since we don’t do any awaits in the Start() method.

The question is why? Looking at the documentation, I couldn’t see any reason for that. Digging deeper into the source code, I figured out what is going on.

We can use SharpLab.io to lower the high level C# code to see what is actually going on here, which gives us the following code for the Start() method:

Note that we call to AsyncTaskMethodBuilder.Start() method, which ends up in AsyncMethodBuilderCore.Start(). There we have a bunch of interesting code, in particular, we remember the current thread execution context before we execute user code, here. After the code is done running, we restore it if this is needed, as you can see here.

That looks fine, but why would the execution context change here? It turns out that one of the few places that interact with it is the AsyncValue itself, which ends up in the ExecutionContext.SetLocalValue. The way it works, each time you set an async local, it creates a new layer in the async stack. And when you exit an async call, it will reset the async stack to the place it was before the async call started.

In other words, the local in the name AsyncLocal isn’t a match to ThreadLocal, but is more similar to a local variable, which goes out of scope on function exit.

This isn’t a new thing, and there are workarounds, but it was interesting enough that I decided to dig deep and understand what is actually going on.

Challenge: What does this code print?

Wed, 14 Dec 2022 12:00:00 GMT

Take a look at the following code, what do you think it will print?

Since it obviously doesn’t print the expected value, why do you think this is the case?

Solving support issues in other people’s products

Tue, 06 Dec 2022 12:00:00 GMT

A user contacted us to tell us that RavenDB does not work in his environment. As you can imagine, we didn’t really like to hear that, so we looked deeper into the issue. The issue in question included the actual problem, which looked something like this:

{
    "Url": "/auth/",
    "Type": "Raven.Client.Exceptions.Routing.RouteNotFoundException",
    "Message": "There is no handler for path: GET /auth/",
    "Error": "Raven.Client.Exceptions.Routing.RouteNotFoundException: There is no handler for path: GET /auth/\n"
}

My reaction to that was… huh?!

That is a really strange error, since RavenDB does not have an “/auth/” endpoint. The problem isn’t with RavenDB, it is with something else.

In this case, the user ran RavenDB on port 8080 (which is the normal thing to do) and then tried to access RavenDB in the browser.

The problem was that they previously ran some other software, and that software had the following interaction:

* Connected to 127.0.0.1 port 8080
> GET / HTTP/1.1
> Host: 127.0.0.1
> User-Agent: curl/7.85.0
> Accept: */*
>
< HTTP/1.1 301 Moved Permanently
< Location: http://127.0.0.1:8080/auth/
< Content-Type: text/html; charset=UTF-8

In other words, it redirected the browser to the “/auth/” endpoint. It’s critical to understand that 301 response means: Moved Permanently. That means that they are actually cached by the browser. In this case, the scenario was reusing the same endpoint for a different software product, and the browser cache meant that we got strange results.

Looking into Corax’s posting lists: Part I

Wed, 30 Nov 2022 12:00:00 GMT

One of the big changes in RavenDB is the new search engine, Corax. We want to replace Lucene with a purpose built search engine, capable of doing everything that we can do with Lucene, but far faster.

In this post, I want to discuss a particular detail of the implementation. Managing the posting list. In information retrieval, the idea of a posting list is that we have a list of document ids that match a particular term. I’m ignoring the details of how we create or manage the list. The basic idea is that we have a need to store a potentially large number of document ids, update them on the fly as well query them. Lucene manages that by creating immutable segments, but that design decision doesn’t match the way we want to do things in Corax. The problem with immutable segments is that you’ll eventually need to run compaction, which can be really expensive.

As a database, we already have a really good data structure that matches most of those requirements, it turns out. A B+Tree can do a pretty good approximation of a posting list, but it does have a problem. It’s not efficient in terms of space usage. A document id is a 64 bits integer, and we can make a number of assumptions about it. Therefore, I create a dedicated B+Tree like structure to hold the posting list. This B+Tree is called a Set inside of Voron, and it can hold any number of uint64 values. Internally, this is managed as two separate types of pages. The Branch pages (which are fairly standard B+Tree branches) and the Leaf pages.

Here is the first version of the Set Leaf Page implementation:

Let’s figure out what is going on here. The page has a notion of a base. That means all the documents IDs that have the same upper 33 bits. Basically, inside a single page, all the IDs are in the same 2 billion range. That means that even though the document ids are 64 bits, in the context of a single page, we can treat them as 32 bits integers. That turns out to be important, since most integer compression routines work on 32 bits integers.

How does that work? We have a section in the page that is dedicated to raw values, and we insert values into that section until it is full. Whenever that happens, we compress the raw values section using PForDelta compression. The page will then contain multiple compressed segments and a single raw values segment. Each compressed segment is non-overlapping with the other compressed segments (but may overlap with the raw values). PForDelta is really good in compressing integers, especially if it is able to figure out patterns in your data. And documents IDs in Corax are explicitly designed to have common patterns so it will be able to take advantage of this behavior. When we read the entries from the page, we merge the compressed segments with the raw values and return the full details.

The code isn’t particularly simple, but has a fairly straightforward approach to the problem once you understand the design.

One thing that I haven’t touched is the notion of removals. That is an important concept, and we handle that in an interesting way. Remember that I said that the baseline for the page is the upper 33 bits? That is because the numbers inside the page are 31 bits in length, we reserve the top bit to mark a value as a tombstone marker. In other words, when we write to the raw values, we can have either a new value or a removed value, distinguished using the uppermost bit.

When we fill the raw values segment, we compress it alongside the relevant compressed segments. At that point, we’ll filter out the removed values. This is very similar to the segments that Lucene is using, but we do that on a page boundary (8KB), not across all values.

We are able to push a lot of values into a single page. We see typically thousands to tens of thousands of documents IDs fitting into a single 8KB page. That is pretty amazing, since even if you have a posting list that has millions of entries, the actual disk reads are minimal.

The design was with us throughout most of the development of Corax, but we ran into a serious issue with the way it works when we started to build the query optimizer for Corax.

That is an interesting statement, I think you’ll agree. What is the relation between a (very) low-level design of the on-disk data format and the result of the query optimizer?

Well, it turns out that one of the things that we need to know for the query optimizer is: “How big is this posting list?”

This question is actually really important to be able to generate optimal queries. And given the structure we have, we can provide a good answer to that, most of the time, but not always.

Why is that?

The problem is what happens when we want to remove a value from the set, or add an existing value. If the value already exists inside a compressed segment, we don’t open the compressed segement (which will require re-writing it from scratch), so we record an addition that is actually spurious. Conversely, if we try to remove a value that isn’t in the set, we’ll wrongly decrement the number of entries in the posting list, leading to issues with a mismatch between the record number of entries and the actual number we have in the posting list.

That was very confusing to figure out, I’ll admit. It was also really hard to fix, but I’ll address that in the next post in the series.

On the critical importance of accepting and working with invalid data

Fri, 14 Oct 2022 12:00:00 GMT

A long while ago, I was involved in a project that dealt with elder home care. The company I was working for was contracted by the government to send care workers to the homes of elderly people to assist in routine tasks. Depending on the condition of the patient in question, they would spend anything from 2 – 8 hours a day a few days a week with the patient. Some caretakers would have multiple patients that they would be in charge of.

There were a lot of regulations and details that we had to get right. A patient may be allowed a certain amount of hours a week by the government, but may also pay for additional hours from their own pocket, etc. Quite interesting to work on, but not the topic of this post.

A key aspect of the system was actually paying the workers, of course. The problem was the way it was handled. At the end of each month, they would designate “office hours” for the care workers to come and submit their time sheets. Those had to be signed by the patients or their family members and were submitted in person at the office. Based on the number of hours worked, the workers would be paid.

The time sheets would look something like this:

A single worker would typically have 4 – 8 of those, and the office would spend a considerable amount of time deciphering those, computing the total hours and handing over payment.

The idea of the system I was working on was to automate all of that. Here is more or less what I needed to do:

For each one of the timesheet entries, the office would need to input the start & end times, who the care worker took care of, whether the time was approved (actually, whether this was paid by the government, privately, by the company, etc), and whether it was counted as overtime, etc.

The rules for all of those were… interesting, but we got it working and tested. And then we got to talk with the actual users.

They took one look at the user interface they had to work with and absolutely rebelled. The underlying issue was that during the end of the month period, each branch would need to handle hundreds of care workers, typically within four to six hours. They didn’t have the time to do that (pun intended). Their current process was to review the submitted time sheet with the care worker, stamp it with approved, and put just the total hours worked into the payroll system.

Having to spend so much time on data entry was horrendous for them, but the company really wanted to have that level of granularity, to be able to properly track how many hours were worked and handle its own billing more easily.

Many of the care workers were also… non-technical, and the timesheet had to be approved by the patient or family worker. Having a signed piece of paper was easy to handle, trying to get them to go on a website and enter hours was a non-starter. That was also before everyone had a smartphone (and even today I would assume that would be difficult in that line of business).

As an additional twist, it turns out that the manual process allowed the office employees to better manage the care workers. For example, they may give them a 0.5 / hour adjustment for a particular month to manage a difficult client or deal with some specific issue, or approve (at their discretion) overtime pay when it wasn’t quite "proper” to do so.

One of the reasons that the company wanted to move to a modern system was to avoid this sort of “initiatives”, but they turned out to be actually quite important for the proper management of the care workers and actually getting things done. For an additional twist, they had multiple branches, and each of those had a different methodology of how that was handled, and all of them were different from what HQ thought it should be.

The process turned out to be even more complex than we initially got, because there was a lot of flexibility in the system that was actually crucial for the proper running of the business.

If I recall properly, we ended up having a two-stage approach. During the end of the month rush, they would fill in the gross hours and payment details. That part of the system was intentionally simplified to the point where we did almost no data validation and trusted them to put the right values.

After payroll was processed, they had to actually put in all those hours and we would run a check on the expected amount / validation vs. what was actually paid. That gave the company the insight into what was going on that they needed, the office people were able to keep on handling things the way they were used to, and discrepancies could be raised and handled more easily.

Being there in the “I’m just a techie” role, so I got to sit on the sidelines and see tug-of-war between the different groups. It was quite interesting to see, I described the process above in a bit of a chaotic manner, but there were proper procedures and processes in the offices. They were just something that the company HQ never even realized was there.

It also taught me quite a lot about the importance of accepting “invalid” data. In many cases, you’ll see computerized solutions flat out refuse to accept values that they consider to be wrong. The problem is that often, you need to record reality, which may not agree with validation rules on your systems. And in most cases, reality wins.

What bug through yonder weirdness my build breaks?

Tue, 13 Sep 2022 12:00:00 GMT

We have a lot of tests for RavenDB, and we are running them on plenty of environments. We semi frequently get a build failure when running on the “macOS latest” runner on GitHub.

The problem is that the information that I have is self-contradicting. Here is the most relevant piece:

Here you can see the failure itself and what is causing it.

Note that the debug message is showing that all three variables here have the same numeric value. The address and the current variables are also held on the stack, so there is no option for race conditions, or something like that.

I can’t figure out any reason why this would be triggered, in this case. About the only thing that pops to mind is whether there is some weirdness going on with pointer comparisons on MacOS, but I don’t have a lead to follow.

We haven’t investigated it properly yet, I thought to throw this to the blog and see if you have any idea what may be going on here.

My new interview task: Stop the flow

Thu, 11 Aug 2022 12:00:00 GMT

We run into an interesting scenario at work that I thought would make for a pretty good interview task. Consider a server that needs to proxy a request from the outside world to an internal service, something like this:

That isn’t that interesting. The interesting bit is that the network between the internal server and the proxy is running at 10Gb/sec and the external network is limited to 512Kb/sec.

Furthermore, the internal server expects the other side to just… take the load. It will blast the other side with as much data as it can, and if you can’t handle that, will cut the connection. What this means is that for small requests, the proxy can successfully pass the data to the external server, but for larger ones, it is unable to read the data quickly enough to do so and the internal server will disconnect from it.

It is the responsibility of the proxy to manage that scenario. That is the background for this task, practically speaking, this means that you have the following code, which works if the size is 8MB but fails if it is 64MB.

We have the SlowClientStream and the FastServerStream – which means that we are able to focus completely on the task at hand (ignoring networks, etc).

The requirement is to pass a 64 MB of data between those two streams (which have conflicting requirements)

The FastServerStream requires that you’ll read from it in a rate of about 31Kb / sec.
The SlowClientStream, on the other hand, will accept data at a maximum rate of about 30Kb/sec (but is variable across time).

You may not change the implementation of either stream (but may add behavior in separate classes).

You may not read the entire response from the server before sending to the client.

There is a memory limit of 32 MB on the maximum amount of memory you may use in the program.

How would you go about solving this?

The challenge skeleton is here.

When debugging, assume an unreliable narrator

Wed, 20 Jul 2022 12:00:00 GMT

When we are handling a support call, we are often working with partial information about the state of the software at the customer site. Sometimes that is an unavoidable part of the job. When troubleshooting a system with patients' records, I can’t just ask the customer to schlep the data to my laptop so I can analyze it properly. Even if we could do that, there are a lot of cases that simply don’t reproduce anywhere but the live environment.

Part of the process of debugging an issue in a production environment is to be able to gather enough information on site that we can draw the appropriate conclusions from. RavenDB comes with a lot of tools to do just that. One of the most useful of those tools is the idea of the debug package. That is a simple idea, in the end. It gathers all the information we have about our system and packages that into a zip file. That zip file contains a lot of metrics, but it doesn’t contain customer data (aside from databases & index names, which are usually not sensitive).

There have been several separate cases recently where we were able to use the debug package to analyze what is going on and came back to the customer with answers. However, when hearing our explanations about what was the root cause of the issue, the customer rejected our hypothesis.

In one case, a customer was worried about the load that they were observing in their system. Not because there was an issue, but the number of requests that they observed was higher than was expected. The customer switched to using concurrent subscriptions recently and deployed a couple of worker nodes to spread the load of processing documents. However, the number of requests observed was far higher than they expected. Whereas before they had a single worker per subscription, and a known amount of work that they could easily measure, after switching to concurrent subscriptions they observed a big increase in the number of requests processed by RavenDB.

Given that they deployed their subscriptions to two workers, initially, it was expected that the amount of work that the cluster is processing would double. Instead, it was increased by tenfold. Looking at the metrics in the debug package, we could see that they had 10 instances of each subscription running, but the customer was insistent that they only deployed two workers nodes.

Our metrics said that there were 5 subscriptions from IP-1 and 5 subscriptions from IP-2. After some back and forth it was revealed that everyone was correct, but talking past each other. The customer deployed two worker nodes, yes. But each of those spawned 5 instances of the subscriptions to take advantage of concurrency inside the same worker node.

In the second case, we have a customer that noticed a marked increase in the amount of bandwidth that they were using. They traced that additional bandwidth to the internal communication between the nodes in the cluster. Given that they are running in the cloud, they were (rightly) concerned about the sudden jump in the bandwidth. We started the investigation process and… we didn’t like what we saw. The cluster had gone through three full node rebuilds in the past month. At least, that was what the data was telling us. But that didn’t make much sense.

Quite concerned that there is something really bad going on, we talked to the customer, who thought about this for a while, checked their own logs and explained what was going on. They are running on Lsv2-series Azure instances, and apparently, within the space of a few weeks, all three of their instances had been moved to another physical host. The Lsv2-series instances use local ephemeral NVMe drives. When they moved an instance between hosts, the effect was as if we were given a brand new hard disk. RavenDB was able to handle that scenario more or less seamlessly, with the other nodes in the cluster filling in for the down node and sending it all the data it lost. The effect of that, of course, was a big jump in network bandwidth while that was going on.

The customer wasn’t actually aware that this happened until they looked at the logs, RavenDB had it handled, and it was only noticed because of the bandwidth spike.

The point of this post isn’t to talk about how awesome RavenDB is (even if I do think it is pretty awesome). Nor is it to extoll how good our support team is at figuring out things about the customer setup that even the customer isn’t aware of.

The point of this post is that you have to take into account, quite clearly, that the details that the customer is providing may be outdated, wrong or just misleading. Not because of any malicious intention on their end, but because they give you the information they have, not what is actually going on.

It reminds me of an old trick in tech support: “Take the plug out of the socket, blow on both the socket and the plug, then insert it again”. The point isn’t to blow whatever dust may have been there, preventing good contact. The point is to ensure that the silly thing is actually plugged in, but you can’t ask if this is plugged in, because the person on the other side of the call would say: “Of course it is” and never check.

Production postmortem: Out of memory on a clear sky

Mon, 11 Jul 2022 12:00:00 GMT

A customer opened a support call telling us that they reached the scaling limits of RavenDB. Given that they had a pretty big machine specifically to handle the load they were expecting, they were (rightly) upset about that.

A short back and forth caused us to realize that RavenDB started to fail shortly after they added a new customer to their system. And by fail I mean that it started throwing OutOfMemoryException in certain places. The system was not loaded and there weren’t any other indications of high load. The system had plenty of memory available, but critical functions inside RavenDB would fail because of out of memory errors.

We looked at the actual error and found this log message:

Raven.Client.Exceptions.Database.DatabaseLoadFailureException: Failed to start database orders-21
At /data/global/ravenData/Databases/orders-21
 ---> System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
   at System.Threading.Thread.StartInternal(ThreadHandle t, Int32 stackSize, Int32 priority, Char* pThreadName)
   at System.Threading.Thread.StartCore()
   at Raven.Server.Utils.PoolOfThreads.LongRunning(Action`1 action, Object state, String name) in C:\Builds\RavenDB-5.3-Custom\53024\src\Raven.Server\Utils\PoolOfThreads.cs:line 91
   at Raven.Server.Documents.TransactionOperationsMerger.Start() in C:\Builds\RavenDB-5.3-Custom\53024\src\Raven.Server\Documents\TransactionOperationsMerger.cs:line 76
   at Raven.Server.Documents.DocumentDatabase.Initialize(InitializeOptions options, Nullable`1 wakeup) in C:\Builds\RavenDB-5.3-Custom\53024\src\Raven.Server\Documents\DocumentDatabase.cs:line 388
   at Raven.Server.Documents.DatabasesLandlord.CreateDocumentsStorage(StringSegment databaseName, RavenConfiguration config, Nullable`1 wakeup) in C:\Builds\RavenDB-5.3-Custom\53024\src\Raven.Server\Documents\DatabasesLandlord.cs:line 826

This is quite an interesting error. To start with, this is us failing to load a database, because we couldn’t spawn the relevant thread to handle transaction merging. That is bad, but why?

It turns out that .NET will only consider a single failure scenario for a thread failing to start. If it fails, it must be because the system is out of memory. However, we are running on Linux, and there are other reasons why that can happen. In particular, there are various limits that you can set on your environment that would limit the number of threads that you can set.

There are global knobs that you should look at first, such as those:

/proc/sys/kernel/threads-max
/proc/sys/kernel/pid_max
/proc/sys/vm/max_map_count

Any of those can serve as a limit. There are also ways to set those limits on a per process manner.

There is also a per user setting, which is controlled via:

/etc/systemd/logind.conf: UserTasksMax

The easiest way to figure out what is going on is to look at the kernel log at that time, here is what we got in the log:

a-orders-srv kernel: cgroup: fork rejected by pids controller in /system.slice/ravendb.service

That made it obvious where the problem was, in the ravendb.service file, we didn’t have TasksMax set, which meant that it was set to 4915 (probably automatically set by the system depending on some heuristic).

When the number of databases and operations on the database reached a particular size, we hit the limit and started failing. That is not a fun place to be in, but at least it is easy to fix.

I created this post specifically so it will be easy to Google that in the future. I also created an issue to get a better error message in this scenario.