Ayende @ Rahien

Jul 01 2024

ChallengeEfficient snapshotable state

time to read 4 min | 778 words

Tags:

At the heart of RavenDB, there is a data structure that we call the Page Translation Table. It is one of the most important pieces inside RavenDB.

The page translation table is basically a Dictionary<long, Page>, mapping between a page number and the actual page. The critical aspect of this data structure is that it is both concurrent and multi-version. That is, at a single point, there may be multiple versions of the table, representing different versions of the table at given points in time.

The way it works, a transaction in RavenDB generates a page translation table as part of its execution and publishes the table on commit. However, each subsequent table builds upon the previous one, so things become more complex. Here is a usage example (in Python pseudo-code):

table = {}


with wtx1 = write_tx(table):
  wtx1.put(2, 'v1')
  wtx1.put(3, 'v1')
  wtx1.publish(table)


# table has (2 => v1, 3 => v1)


with wtx2 = write_tx(table):
  wtx2.put(2, 'v2')
  wtx2.put(4, 'v2')
  wtx2.publish(table)


# table has (2 => v2, 3 => v1, 4 => v2)

This is pretty easy to follow, I think. The table is a simple hash table at this point in time.

The catch is when we mix read transactions as well, like so:

# table has (2 => v2, 3 => v1, 4 => v2)


with rtx1 = read_tx(table):


        with wtx3 = write_tx(table):
                wtx3.put(2, 'v3')
                wtx3.put(3, 'v3')
                wtx3.put(5, 'v3')


                with rtx2 = read_tx(table):
                        rtx2.read(2) # => gives, v2
                        rtx2.read(3) # => gives, v1
                        rtx2.read(5) # => gives, None


                wtx3.publish(table)


# table has (2 => v3, 3 => v3, 4 => v2, 5 => v3)
# but rtx2 still observe the value as they were when
# rtx2 was created


        rtx2.read(2) # => gives, v2
        rtx2.read(3) # => gives, v1
        rtx2.read(5) # => gives, None

In other words, until we publish a transaction, its changes don’t take effect. And any read translation that was already started isn’t impacted. We also need this to be concurrent, so we can use the table in multiple threads (a single write transaction at a time, but potentially many read transactions). Each transaction may modify hundreds or thousands of pages, and we’ll only clear the table of old values once in a while (so it isn’t infinite growth, but may certainly reach respectable numbers of items).

The implementation we have inside of RavenDB for this is complex! I tried drawing that on the whiteboard to explain what was going on, and I needed both the third and fourth dimensions to illustrate the concept.

Given these requirements, how would you implement this sort of data structure?

Oct 13 2023

ChallengeFastest node selection metastable error state–answer

time to read 2 min | 290 words

Tweet Share Share 1 comments

Tags:

In the previous post, I showed a very simple request router that would always select the fastest node. That worked for a long while, until it didn’t, and the challenge is figuring out why.

As it turns out, the issue is a simple one of spooky action at a distance. Here is what happens. Assume that we have three servers and 10 clients. Each server is sized to handle 4 clients. So far, so good, the system has the capacity to spare.

The problem is in the manner in which clients will detect which is the fastest node in the cluster. The only thing that is considered is the state of the node at the time of selection. At that time, we may end up with all the nodes selecting one particular node as the fastest.

In other words, we have three servers, two of them have no clients talking to them and one of the servers has all the clients talking to it. That results in that node going down, obviously. The clients would then react appropriately, and select a new node to talk to. All of them would do that, find the fastest node, and… bring it down as well. Rinse & repeat.

The issue can be stated as Time Of Check vs Time Of Use, but also as a race condition, where all individual nodes end up doing a synchronized “wave” operation that kills the system.

How do you prevent this?

You introduce randomness into the system. You don’t test the status once, but re-check on a regular basis so you can respond to shifting load. You should also introduce randomness into the process. So the nodes won’t all do this exactly at the same time and end up in the same position.

Oct 12 2023

ChallengeFastest node selection metastable error state

time to read 1 min | 186 words

Tweet Share Share 12 comments

Tags:

Side note: Current state in Israel right now is bad. I’m writing this blog post as a form of escapism so I can talk about something that makes sense and follow logic and reason. I’ll not comment on the current status otherwise in this area.

Consider the following scenario. We have a bunch of servers and clients. The clients want to send requests for processing to the fastest node that they have available. But the algorithm that was implemented has an issue, can you see what this is?

To simplify things, we are going to assume that the work that is being done for each request is the same, so we don’t need to worry about different request workloads.

The idea is that each client node will find the fastest node (usually meaning the nearest one) and if there is enough load on the server to have it start throwing errors, it will try to find another one. This system has successfully spread the load across all servers, until one day, the entire system went down. And then it stayed down.

Can you figure out what is the issue?

Sep 19 2023

ChallengeSpot the bug

time to read 1 min | 27 words

Tweet Share Share 5 comments

Tags:

The following bug cost me a bunch of time, can you see what I’m doing wrong?

For fun, it’s so nasty because usually, it will accidentally work.

Jan 04 2023

Challengewhat does this code print?

time to read 1 min | 42 words

Tweet Share Share 4 comments

Tags:

Given the following code:

Can you guess what it will do?

Can you explain why?

I love that this snippet is under 20 lines of code, but being able to explain it shows a lot more knowledge about C# than you would expect.

Dec 14 2022

ChallengeWhat does this code print?

time to read 1 min | 30 words

Tweet Share Share 4 comments

Tags:

Take a look at the following code, what do you think it will print?

Since it obviously doesn’t print the expected value, why do you think this is the case?

Jul 01 2022

ChallengeFind the stack smash bug… – answer

time to read 2 min | 249 words

Tweet Share Share 2 comments

Tags:

Yesterday I presented a bug that killed the process in a particularly rude manner. This is a recursive function that guards against stack overflows using RuntimeHelpers.EnsureSufficientExecutionStack().

Because of how this function kills the process, it took some time to figure out what is going on. There was no StackOverflowException, just an exit code. Here is the relevant code:

This looks okay, we optimize for zero allocations on the common path (less than 2K items), but also handle the big one.

The problem is that our math is wrong. More specifically, take a look at this line:

var sizeInBytes = o.Count / (sizeof(byte) * 8) + o.Count % (sizeof(byte) * 8) == 0 ? 0 : 1;

Let’s assume that your count is 10, what do you think the value of this is going to be?

Well, it looks like this should give us 2, right?

10 / 8 + 10%8 == 0 ? 0 :1

The problem is in the operator precedence. I read this as:

(10 / 8) + (10 % 8 == 0 ? 0 : 1)

And the C# compiler read it as:

(10 / 8 + 10 % 8) == 0 ? 0 : 1

In other words, *#@*!*@!.

The end result is that we overwrite past our allocated stack. Usually that doesn’t do anything bad, since there is enough stack space. But sometimes, if the stack is aligned just right, we cross into the stack guard page and kill the process.

Opps, that was not expected.

Jun 30 2022

ChallengeFind the stack smash bug…

time to read 1 min | 171 words

Tweet Share Share 4 comments

Tags:

The following code is something that we ran into yesterday, under some conditions, this code will fail with a stack overflow. More specifically, the process crashes and the return code is –1073740791 (or as it is usually presented: 0xC0000409.

At this point in my career I can look at that error code and just recall that this is the Windows error code for a stack overflow, to be more precise, this is: STATUS_STACK_BUFFER_OVERRUN

That… makes sense, I guess, this is a recursive code, after all. Let’s take a look:

Except, that this code explicitly protects against this. Note the call to:

RuntimeHelpers.EnsureSufficientExecutionStack();

In other words, if we are about the run out of stack space, we ask the .NET framework to throw (just before we run out, basically).

This code doesn’t fail often, and we tried to push deeply nested structure through that, and we got an InsufficientExecutionStackException thrown.

Sometimes, however, when we run this code with a relatively flat structure (2 – 4 levels), it will just die with this error.

Can you spot the bug?

Jun 03 2022

ChallengeSpot the data corruption

time to read 1 min | 49 words

Tweet Share Share 6 comments

Tags:

The following code does not output the right value, can you tell why?

This has been a real bug that we ran into (with only slightly more complicated code.

The sort of things that make you just stare at the screen in disbelief when you realize what is going on…

May 06 2022

ChallengeSpot the optimization–solution

time to read 1 min | 164 words

Tweet Share Share 6 comments

Tags:

In my previous post, I asked why this change would result in a better performing system, since the total amount of work that is done is the same:

The answer is quite simple. The amount of work that our code is doing is the same, sure, but that isn’t all the code that runs.

In the first version, we would allocate the string, and then we’ll start a bunch of async operations. Those operations are likely to take some time and involve I/O (otherwise, they wouldn’t be async).

It is very likely that in the meantime, we’ll get a GC run. At that point, the string pointed to be the ids variable will be promoted (since it survived a GC). That means that it would be collected much later.

Using the new code, the scope of the ids string is far shorter. That means that the GC is more likely to catch it very early and significantly reduce the cost of releasing the memory.

Oren Eini

Oren Eini

CEO of RavenDB

ChallengeEfficient snapshotable state

ChallengeFastest node selection metastable error state–answer

ChallengeFastest node selection metastable error state

ChallengeSpot the bug

Challengewhat does this code print?

ChallengeWhat does this code print?

ChallengeFind the stack smash bug… – answer

ChallengeFind the stack smash bug…

ChallengeSpot the data corruption

ChallengeSpot the optimization–solution

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed