Voron is the code name for another storage engine that we are currently trying, based on LMDB. After taking that for a spin for a while, it is not pretty complete, and I decided to give it some perf test runs. So far, there has been zero performance work. All the usual caveat applies here, with regards to short test runs, etc.
Just like before, we found that this is horribly slow on the first run. The culprit was our debug code that verified the entire structure whenever we added or removed something. One we run it in release mode we started getting more interesting results.
Here is the test code:
using (var env = new StorageEnvironment(new MemoryMapPager("bench.data", flushMode)))
{
using (var tx = env.NewTransaction(TransactionFlags.ReadWrite))
{
var value = new byte[100];
new Random().NextBytes(value);
var ms = new MemoryStream(value);
for (long i = 0; i < Count; i++)
{
ms.Position = 0;
env.Root.Add(tx, i.ToString("0000000000000000"), ms);
}
tx.Commit();
}
using (var tx = env.NewTransaction(TransactionFlags.ReadWrite))
{
DebugStuff.RenderAndShow(tx, tx.GetTreeInformation(env.Root).Root);
tx.Commit();
}
}
We write 1 million entries with 100 bytes in size and 8 bytes of the key. We run it in three mode:
fill seq none : 9,578 ms 104,404 ops / sec
fill seq buff : 10,802 ms 92,575 ops / sec
fill seq sync : 9,387 ms 106,528 ops / sec
None means no flushing to disk, let the OS deals with that completely. Buffers means flush to the OS, but not to disk, and full means do a full fsync.
Now, this is pretty stupid way to go about it, I’ve to say. This is doing everything in a single transaction, and we are actually counting the time to close & open the db here as well, but that is okay for now. We aren’t interested in real numbers, just some rough ideas.
Now, let us see how we read it?
using (var env = new StorageEnvironment(new MemoryMapPager("bench.data")))
{
using (var tx = env.NewTransaction(TransactionFlags.Read))
{
var ms = new MemoryStream(100);
for (int i = 0; i < Count; i++)
{
var key = i.ToString("0000000000000000");
using (var stream = env.Root.Read(tx, key))
{
ms.Position = 0;
stream.CopyTo(ms);
}
}
tx.Commit();
}
}
And this gives us:
read seq : 3,289 ms 304,032 ops / sec
And again, this is with opening & closing the entire db.
We could do better with pre-allocation of space on the disk, etc. But I wanted to keep things realistic and to allow us to grow.
Next, I wanted to see how much we would gain by parallelizing everything, so I wrote the following code:
using (var env = new StorageEnvironment(new MemoryMapPager("bench.data")))
{
var countdownEvent = new CountdownEvent(parts);
for (int i = 0; i < parts; i++)
{
var currentBase = i;
ThreadPool.QueueUserWorkItem(state =>
{
using (var tx = env.NewTransaction(TransactionFlags.Read))
{
var ms = new MemoryStream(100);
for (int j = 0; j < Count / parts; j++)
{
var current = j * currentBase;
var key = current.ToString("0000000000000000");
using (var stream = env.Root.Read(tx, key))
{
ms.Position = 0;
stream.CopyTo(ms);
}
}
tx.Commit();
}
countdownEvent.Signal();
});
}
countdownEvent.Wait();
}
I then run it with 1 – 16 parts, so see how it behaves. Here are the details for this machine:
And the results pretty much match what I expected:
read seq : 3,317 ms 301,424 ops / sec
read parallel 1 : 2,539 ms 393,834 ops / sec
read parallel 2 : 1,950 ms 512,711 ops / sec
read parallel 4 : 2,201 ms 454,172 ops / sec
read parallel 8 : 2,139 ms 467,387 ops / sec
read parallel 16 : 2,010 ms 497,408 ops / sec
We get a 2x perf improvement from running on two cores, running on 4 threads require some dancing around, which caused some perf drop, then we see more time spent in thread switching than anything else, pretty much. As you can see, we see a really pretty jump in performance the more cores we use, until we reach the actual machine limitations.
Note that I made sure to clear the OS buffer cache before each test. If we don't do that, we get:
read seq : 2,562 ms 390,291 ops / sec
read parallel 1 : 2,608 ms 383,393 ops / sec
read parallel 2 : 1,868 ms 535,220 ops / sec
read parallel 4 : 1,646 ms 607,283 ops / sec
read parallel 8 : 1,673 ms 597,557 ops / sec
read parallel 16 : 1,581 ms 632,309 ops / sec
So far, I am pretty happy with those numbers. What I am not happy is the current API, but I’ll talk about this in a separate post.