At the heart of RavenDB, there is a data structure that we call the Page Translation Table. It is one of the most important pieces inside RavenDB.
The page translation table is basically a Dictionary<long, Page>, mapping between a page number and the actual page. The critical aspect of this data structure is that it is both concurrent and multi-version. That is, at a single point, there may be multiple versions of the table, representing different versions of the table at given points in time.
The way it works, a transaction in RavenDB generates a page translation table as part of its execution and publishes the table on commit. However, each subsequent table builds upon the previous one, so things become more complex. Here is a usage example (in Python pseudo-code):
table = {}
with wtx1 = write_tx(table):
wtx1.put(2, 'v1')
wtx1.put(3, 'v1')
wtx1.publish(table)
# table has (2 => v1, 3 => v1)
with wtx2 = write_tx(table):
wtx2.put(2, 'v2')
wtx2.put(4, 'v2')
wtx2.publish(table)
# table has (2 => v2, 3 => v1, 4 => v2)
This is pretty easy to follow, I think. The table is a simple hash table at this point in time.
The catch is when we mix read transactions as well, like so:
# table has (2 => v2, 3 => v1, 4 => v2)
with rtx1 = read_tx(table):
with wtx3 = write_tx(table):
wtx3.put(2, 'v3')
wtx3.put(3, 'v3')
wtx3.put(5, 'v3')
with rtx2 = read_tx(table):
rtx2.read(2) # => gives, v2
rtx2.read(3) # => gives, v1
rtx2.read(5) # => gives, None
wtx3.publish(table)
# table has (2 => v3, 3 => v3, 4 => v2, 5 => v3)
# but rtx2 still observe the value as they were when
# rtx2 was created
rtx2.read(2) # => gives, v2
rtx2.read(3) # => gives, v1
rtx2.read(5) # => gives, None
In other words, until we publish a transaction, its changes don’t take effect. And any read translation that was already started isn’t impacted. We also need this to be concurrent, so we can use the table in multiple threads (a single write transaction at a time, but potentially many read transactions). Each transaction may modify hundreds or thousands of pages, and we’ll only clear the table of old values once in a while (so it isn’t infinite growth, but may certainly reach respectable numbers of items).
The implementation we have inside of RavenDB for this is complex! I tried drawing that on the whiteboard to explain what was going on, and I needed both the third and fourth dimensions to illustrate the concept.
Given these requirements, how would you implement this sort of data structure?