RavenDB Cloud has a whole bunch of new features that were quietly launched in the past few months. I talked about them in this post. It turns out that the team keeps on delivering new stuff, faster than I can write about it.
The new feature is a really interesting one, because it is pretty simple to understand and has some interesting implications for production.
You need to explicitly enable auto-scaling on your cluster, this is how this looks like:
Once you enabled auto-scaling - which usually takes under a minute, you can click on the Configure button to set your own policies:
Here is what this looks like:
The idea is very simple, we routinely measure the load on the system, and if we detect a high CPU threshold for a long time, we’ll kick off scaling to the next tier (or maybe higher, see the Upscaling / Downscaling step options) to provide additional resources to the system. If there isn’t enough load (as measured in CPU usage), we will downscale back to the lowest instance type.
Conceptually, this is a pretty simple setup. You use a lot of CPU - you get a bigger machine that has more resources to use, until it all balances out.
Now, let’s talk about the implications of this feature. To start with, it means that you only pay based on your actual load, and you don’t need to over-provision to match peak load.
The design of this feature and RavenDB in general means that we’ll be able to make scale up and scale down changes without any interruption in service. That means that you can let the auto-scaling manage the size of your instances.
In the image above, you may have noticed that I’m using the PB line of products (PB10 … PB50). That stands for burstable instances, which consume CPU credits if they are in use. How that plays with auto-scaling is really interesting.
As you use more CPU, you consume all the CPU credits, and your CPU usage is high. At this point, the auto-scaling kicks in and moves you to a higher tier. That gives you both more baseline CPU credits and higher CPU credits accrual rate.
Together with zero downtime upscaling and downscaling, it means that you can benefit from the burstable instances' lower cost without having to worry about running out of resources.
Note that auto-scaling only applies to the same family of instances. So if you are running on burstable instances, you’ll get scaling from burstable instances, and if you are running on the P series (non burstable) your auto-scaling will use P instances.
Note that we offer auto scaling for development instances as well. There, however, there is just a single RavenDB instance, so auto-scaling will trigger, but you’ll have up to two minutes in which we scale the instance and it will be inaccessible. That isn’t an issue for the production tier.
In RavenDB, we really care about performance. That means that our typical code does not follow idiomatic C# code. Instead, we make use of everything that the framework and the language give us to eke out that additional push for performance. Recently we ran into a bug that was quite puzzling. Here is a simple reproduction of the problem:
usingSystem.Runtime.InteropServices;var counts =newDictionary<int,int>();var totalKey =10_000;refvar total =ref CollectionsMarshal.GetValueRefOrAddDefault(
counts, totalKey,out _);for(int i =0; i <4; i++){var key = i %32;refvar count =ref CollectionsMarshal.GetValueRefOrAddDefault(
counts, key,out _);
count++;
total++;}
Console.WriteLine(counts[totalKey]);
What would you expect this code to output? We are using two important features of C# here:
Value types (in this case, an int, but the real scenario was with a struct)
CollectionMarshal.GetValueRefOrAddDefault()
The latter method is a way to avoid performing two lookups in the dictionary to get the value if it exists and then add or modify it.
If you run the code above, it will output the number 2.
That is not expected, but when I sat down and thought about it, it made sense.
We are keeping track of the reference to a value in the dictionary, and we are mutating the dictionary.
The documentation for the method very clearly explains that this is a Bad Idea. It is an easy mistake to make, but still a mistake. The challenge here is figuring out why this is happening. Can you give it a minute of thought and see if you can figure it out?
A dictionary is basically an array that you access using an index (computed via a hash function), that is all. So if we strip everything away, the code above can be seen as:
var buffer =newint[2];
ref var total = ref var buffer[0];
We simply have a reference to the first element in the array, that’s what this does behind the scenes. And when we insert items into the dictionary, we may need to allocate a bigger backing array for it, so this becomes:
var buffer =newint[2];
ref var total = ref var buffer[0];var newBuffer =newint[4];
buffer.CopyTo(newBuffer);
buffer = newBuffer;
total =1;var newTotal = buffer[0]
In other words, the total variable is pointing to the first element in the two-element array, but we allocated a new array (and copied all the values). That is the reason why the code above gives the wrong result. Makes perfect sense, and yet, was quite puzzling to figure out.
I need to query over a time span, either known (start, end) or (start, $currentDate), and I need to be able to sort on them.
That might sound… vague, I know. A better way to explain this is that I have a list of people, and I need to sort them by their age. That’s trivial to do since I can sort by the birthday, right? The problem is that we include some historical data, so some people are deceased.
Basically, we want to be able to get the following data, sorted by age ascending:
Name
Birthday
Death
Michael Stonebraker
1943
N/A
Sir Tim Berners-Lee
1955
N/A
Narges Mohammadi
1972
N/A
Sir Terry Prachett
1948
2015
Agatha Christie
1890
1976
This doesn’t look hard, right? I mean, all you need to do is something like:
order by datediff( coalesce(Death, now()), Birthday )
Easy enough, and would work great if you have a small number of items to sort. What happens if we want to sort over 10M records?
Look at the manner in which we are ordering, that will require us to evaluate each and every record. That means we’ll have to scan through the entire list and sort it. This can be really expensive. And because we are sorting over a date (which changes), you can’t even get away with a computed field.
RavenDB will refuse to run queries that can only work with small amounts of data but will fail as the data grows. This is part of our philosophy, saying that things should Just Work. Of course, in this case, it doesn’t work, so the question is how this aligns with our philosophy?
The idea is simple. If we cannot make it work in all cases, we will reject it outright. The idea is to ensure that your system is not susceptible to hidden traps. By explicitly rejecting it upfront, we make sure that you’ll have a good solution and not something that will fail as your data size grows.
What is the appropriate behavior here, then? How can we make it work with RavenDB?
The key issue is that we want to be able to figure out what is the value we’ll sort on during the indexing stage. This is important because otherwise we’ll have to compute it across the entire dataset for each query. We can do that in RavenDB by exposing that value to the index.
We cannot just call DateTime.Today, however. That won’t work when the day rolls over, of course. So instead, we store that value in a document config/current-date, like so:
Once this is stored as a document, we can then write the following index:
from p in docs.People
let end = p.Death ?? LoadDocument("config/current-date","Config").Date
select new
{
Age = end - p.Birthday
}
And then query it using:
from index 'People/WithAge'
order by Age desc
That works beautifully, of course, until the next day. What happens then? Well, we’ll need to schedule an update to the config/current-date document to correct the date.
At that point, because there is an association created between all the documents that loaded the current date, the indexing engine in RavenDB will go and re-index them. The idea is that at any given point in time, we have already computed the value, and can run really quick queries and sort on it.
When you update the configuration document, it is a signal that we need to re-index the referencing documents. RavenDB is good at knowing how to do that on a streaming basis, so it won’t need to do a huge amount of work all at once.
You’ll also note that we only load the configuration document if we don’t have an end date. So the deceased people’s records will not be affected or require re-indexing.
In short, we can benefit from querying over the age without incurring query time costs and can defer those costs to background indexing time. The downside is that we need to set up a cron job to make it happen, but that isn’t too big a task, I think.
You can utilize similar setups for other scenarios where you need to query over changing values. The performance benefits here are enormous. And what is more interesting, even if you have a huge amount of data, this approach will just keep on ticking and deliver great results at very low latencies.
The Cloud team at RavenDB has been working quite hard recently. The company at large is gearing up for the upcoming 6.2 release, but I can’t ignore the number of goodies that have dropped for RavenDB Cloud Customers.
Large Clusters & Sharding
RavenDB Cloud runs your production cluster with 3 nodes by default. Each one of them operates in a separate availability zone for maximum survivability. The new feature allows you to add additional nodes to your cluster. In the RavenDB Cloud Portal, you can see the “Add node” button and its impact:
Clicking this button allows you to add additional nodes to your cluster. The nodes will be deployed and attached to your cluster within a minute or two. The new nodes will be deployed in the same region (but not necessarily the same availability zone) where your cluster is already deployed.
There are plans in place to add support for deploying nodes in other regions and even in a multi-cloud environment. I would love to hear your feedback on this proposed feature.
You can see the new instances in the RavenDB Studio as well:
The key reason for adding additional nodes to a cluster is when you have very large datasets and you want to shard the data. Here is what this can look like:
In this case, we have sharded the data across 5 nodes, with a replication factor of 2.
Feature selection
There are certain Enterprise features that are only available in the higher-end instances in RavenDB Cloud (typically P30 or higher). We now allow you to selectively enable these features even on lower-tier instances.
This feature allows you to easily pick & choose (on an a-la-carte basis) the specific features you want, without having to upgrade to the more expensive tiers.
Metrics & monitoring
This feature isn’t actually new, but it absolutely deserves your attention. The RavenDB Cloud Portal has a metrics button that you should get familiar with:
Clicking it will provide a wealth of information about your cluster and its behavior. That can be really useful if you want to understand the system’s behavior. Take a peek:
Alerts & Warnings
In addition to just looking at the metrics, the RavenDB Cloud backend will give you some indication about things that you should pay attention to. For example, let’s assume that we had a node failure. You’ll typically not notice that since the RavenDB Cluster & client will work to ensure high availability.
You’ll be able to see that in the metrics, and the RavenDB Cloud Portal will bring it to your attention:
Summary
The major point we strive for in RavenDB and RavenDB Cloud is the notion that the entire experience will be seamless. From deployment and routine management to ensuring that you don’t have to concern yourself with the minutiae of data management, so you can focus on your application.
Being able to develop both the software and its execution environment greatly helps in providing solutions that Just Work. I’m really proud of what we have accomplished and I would love to get your feedback on it.
It has been almost a year since the release of RavenDB 6.0. The highlights of the 6.0 release were Corax (a new blazing-fast indexing engine) and Sharding (server-side and simple to operate at scale). We made 10 stable releases in the 6.0.x line since then, mostly focused on performance, stability, and minor features.
The new RavenDB 6.2 release is now out and it has a bunch of new features for you to play with and explore. The team has been working on a wide range of new features, from enabling serverless triggers to quality-of-life improvements for operations teams.
RavenDB 6.2 is a Long Term Support (LTS) release
RavenDB 6.2 is a Long Term Support release, replacing the current 5.4 LTS (released in 2022). That means that we’ll support RavenDB 5.4 until Oct 2025, and we strongly encourage all users to upgrade to RavenDB 6.2 at their earliest convenience.
You can get the new RavenDB 6.2 bits on the download page. If you are running in the cloud, you can open a support request and ask to be upgraded to the new release.
Data sovereignty and geo-distribution via Prefixed Sharding
In RavenDB 6.2 we introduced a seemingly simple change to the way RavenDB handles sharding, with profound implications for what you can do with it. Prefixed sharding allows you to define which shards a particular set of documents will go to.
Here is a simple example:
In this case, data for users in the US will reside in shards 0 & 1, while the EU data is limited to shards 2 & 3. The data from Asia is spread over shards 0, 2, & 4. You can then assign those shards to specific nodes in a particular geographic region, and with that, you are done.
RavenDB will ensure that documents will stay only in their assigned location, handling data sovereignty issues for you. In the same manner, you get to geographically split the data so you can have a single world-spanning database while issuing mostly local queries.
You can read more about this feature and its impact in the documentation.
Actors architecture with Akka.NET
New in RavenDB 6.2 is the integration of RavenDB with Akka.NET. The idea is to allow you to easily manage state persistence of distributed actors in RavenDB. You’ll get both the benefit of the actor model via Akka.NET, simplifying parallelism and concurrency, while at the same time freeing yourself from persistence and high availability concerns thanks to RavenDB.
Azure Functions integration with ETL to Azure Queues
This is the sort of feature with hidden depths. ETL to Azure Queue Storage is fairly simple on the surface, it allows you to push data using RavenDB’s usual ETL mechanisms to Azure Queues. At a glance, this looks like a simple extension of our already existing capabilities with queues (ETL to Kafka or RabbitMQ).
OpenTelemetry integration is in addition to Prometheus, Telegraf, and SNMP telemetry solutions that are already in RavenDB. You can pick any of them to monitor and inspect the state of RavenDB.
Studio Omni-Search
We made some nice improvements to RavenDB Studio as well, and probably the most visible of those is the Omni-Search feature. You can now hit Ctrl+K in the Studio and just search across everything:
Commands in the Studio
Documents
Indexes
This feature greatly enhances the discoverability of features in RavenDB as well as makes it a joy for those of us (myself included) who love to keep our hands on the keyboard.
Summary
I’m really happy about this release. It follows a predictable and stable release cadence since the release of 6.0 a year ago. The new release adds a whole bunch of new features and capabilities, and it can be upgraded in place (including cross-version clusters) and deployed to production with no hassles.
Looking forward, we have already started work on the next version of RavenDB, tentatively meant to be 7.0. We have some cool ideas about what will go into that release (check the roadmap), but the key feature is likely to make RavenDB a more intelligent database, one might even say, artificially so.
Corax was released just under a year ago, and we are seeing more customers deploying that to production. During a call with a customer, we noticed the following detail:
Let me explain what we are seeing here. The two indexes are the same, operating on the same set of documents. The only difference between those indexes is the indexing engine.
What is really amazing here is that Corax is able to index in 3:21 minutes what Lucene takes 17:15 minutes to index. In other words, Corax is more than 5 times faster than Lucene in a real world scenario.
RavenDB has a hidden feature, enabled by default and not something that you usually need to be aware of. It has built-in support for caching. Consider the following code:
async Task<Dictionary<string,int>>HowMuchWorkToDo(string userId){
using var session = _documentStore.OpenAsyncSession();var results = await session.Query<Item>().GroupBy(x =>new{ x.Status, x.AssignedTo }).Where(g => g.Key.AssignedTo == userId && g.Key.Status !="Closed").Select(g =>new{
Status = g.Key.Status,
Count = g.Count()}).ToListAsync();return results.ToDictionary(x => x.Status, x => x.Count);}
What happens if I call it twice with the same user? The first time, RavenDB will send the query to the server, where it will be evaluated and executed. The server will also send an ETag header with the response. The client will remember the response and its ETag in its own memory.
The next time this is called on the same user, the client will again send a request to the server. This time, however, it will also inform the server that it has a previous response to this query, with the specified ETag. The server, when realizing the client has a cached response, will do a (very cheap) check to see if the cached response matches the current state of the server. If so, it can inform the client (using 304 Not Modified) that it can use its cache.
In this way, we benefit twice:
First, on the server side, we avoid the need to compute the actual query.
Second, on the network side, we aren’t sending a full response back, just a very small notification to use the cached version.
You’ll note, however, that there is still an issue. We have to go to the server to check. That means that we still pay the network costs. So far, this feature is completely transparent to the user. It works behind the scenes to optimize server query costs and network bandwidth costs.
The next stage is to involve the user. Enter the AggressiveCache() feature (see the full documentation here), which allows the user to specify an additional aspect. Now, when the client has the value in the cache, it will skip going to the server entirely and serve the request directly from the cache.
What about cache invalidation? Instead of having the client check on each request if things have changed, we invert the process. The client asks the server to notify it when things change, and until it gets notice from the server, it can serve responses completely from the local cache.
I really love this feature, that was the Good part, now let’s talk about the other pieces:
There are only two hard things in Computer Science: cache invalidation and naming things.
-- Phil Karlton
The bad part of caching is that this introduces more complexity to the system. Consider a system with two clients that are using the same database. An update from one of them may show up at different times in each. Cache invalidation will not happen instantly, and it is possible to get into situations where the server fails to notify the client about the update, meaning that we didn’t clear the cache.
We have a good set of solutions around all of those, I think. But it is important to understand that the problem space itself is a problem.
In particular, let’s talk about dealing with the following query:
var emps = session.Query<Employee>().Include(x => x.Department).Where(x => x.Location.City =="London").ToListAsync();
When an employee is changed on the server, it will send a notice to the client, which can evict the item from the cache, right? But what about when a department is changed?
For that matter, what happens if a new employee is added to London? How do we detect that we need to refresh this query?
There are solutions to those problems, but they are super complicated and have various failure modes that often require more computing power than actually running the query. For that reason, RavenDB uses a much simpler model. If the server notifies us about any change, we’ll mark the entire cache as suspect.
The next request will have to go to the server (again with an ETag, etc) to verify that the response hasn’t changed. Note that if the specific query results haven’t changed, we’ll get OK (304 Not Modified) from the server, and the client will use the cached response.
Conservatively aggressive approach
In other words, even when using aggressive caching, RavenDB still has to go to the server sometimes. What is the impact of this approach when you have a system under load?
We’ll still use aggressive caching, but you’ll see brief periods where we aren’t checking with the server (usually be able to cache for about a second or so), followed by queries to the server to check for any changes.
In most cases, this is what you want. We still benefit from the cache while reducing the number of remote calls by about 50%, and we don’t have to worry about missing updates. The downside is that, as application developers, we know that this particular document and query are independent, so we want to cache them until we get notice about that particular document being changed.
The default aggressive caching in RavenDB will not be of major help here, I’m afraid. But there are a few things you can do.
You can use Aggressive Caching in the NoTracking mode. In that mode, the client will not ask the server for notifications on changes, and will cache the responses in memory until they expire (clock expiration or size expiration only).
Another option is to take this feature higher than RavenDB directly, but still use its capabilities. Since we have a scenario where we know that we want to cache a specific set of documents and refresh the cache only when those documents are updated, let’s write it.
There are a few things to note about this code. We are holding live instances, so we ensure that the values we keep are immutable records. Otherwise, we may hand the same instance to two threads which can be… fun.
Note that document IDs in RavenDB are case insensitive, so we pass the right string comparer.
Finally, the magic happens in the constructor. We register for two important events. Whenever the connection status of the Changes() connection is modified, we clear the cache. This handles any lost updates scenarios that occurred while we were disconnected.
In practice, the subscription to events on that particular collection is where we ensure that after the server notification, we can evict the document from the cache so that the next request will load a fresh version.
Caching + Distributed Systems = 🤯🤯🤯
I’m afraid this isn’t an easy topic once you dive into the specifics and constraints we operate under. As I mentioned, I would love your feedback on the background cache refresh feature, or maybe you have better insight into other ways to address the topic.
RavenDB is a pretty old codebase, hitting 15+ years in production recently. In order to keep it alive & well, we make sure to follow the rule of always leaving the code in a better shape than we found it.
Today’s tale is about the StreamBitArray class, deep in the guts of Voron, RavenDB’s storage engine. The class itself isn’t really that interesting, it is just an implementation of a Bit Array that we have for a bitmap. We wrote it (based on Mono’s code, it looks like) very early in the history of RavenDB and have never really touched it since.
The last time anyone touched it was 5 years ago (fixing the namespace), 7 years ago we created an issue from a TODO comment, etc. Most of the code dates back to 2013, actually. And even then it was moved from a different branch, so we lost the really old history.
To be clear, that class did a full tour of duty. For over a decade, it has served us very well. We never found a reason to change it, never got a trace of it in the profiler, etc. As we chip away at various hurdles inside RavenDB, I ran into this class and really looked at it with modern sensibilities. I think that this makes a great test case for code refactoring from the old style to our modern one.
Here is what the class looks like:
Already, we can see several things that really bug me. That class is only used in one context, to manage the free pages bitmap for Voron. That means we create it whenever Voron frees a page. That can happen a lot, as you might imagine.
A single bitmap here covers 2048 pages, so when we create an instance of this class we also allocate an array with 64 ints. In other words, we need to allocate 312 bytes for each page we free. That isn’t fun, and it actually gets worse. Here is a typical example of using this class:
using (freeSpaceTree.Read(section, out Slice result)){
sba =!result.HasValue ?
newStreamBitArray():newStreamBitArray(result.CreateReader());}
sba.Set((int)(pageNumber % NumberOfPagesInSection),true);
using (sba.ToSlice(tx.Allocator, out Slice val))
freeSpaceTree.Add(section, val);
And inside the ToSlice() call, we have:
public ByteStringContext.InternalScopeToSlice(ByteStringContext context,ByteStringType type, out Slice str){var buffer =ToBuffer();var scope =context.From(buffer,0,buffer.Length,
type, out ByteString byteString);
str =newSlice(byteString);return scope;}
private unsafe byte[]ToBuffer(){var tmpBuffer =new byte[(_inner.Length +1)*sizeof (int)];
unsafe
{
fixed (int* src = _inner)
fixed (byte* dest = tmpBuffer){*(int*) dest =SetCount;Memory.Copy(dest + sizeof (int),(byte*) src,tmpBuffer.Length-1);}}return tmpBuffer;}
In other words, ToSlice() calls ToBuffer(), which allocates an array of bytes (288 bytes are allocated here), copies the data from the inner buffer to a new one (using fixed on the two arrays, which is a performance issue all in itself) and then calls a method to do the actual copy. Then in ToSlice() itself we allocate it again in native memory, which we then write to Voron, and then discard the whole thing.
In short, somehow it turns out that freeing a page in Voron costs us ~1KB of memory allocations. That sucks, I have to say. And the only reasoning I have for this code is that it is old.
This accepts a reader to a piece of memory and does a bunch of things. It calls a few methods, uses fixed on the array, etc., all to get the data from the reader to the class. That is horribly inefficient.
Let’s write it from scratch and see what we can do. The first thing to notice is that this is a very short-lived class, it is only used inside methods and never held for long. This usage pattern tells me that it is a good candidate to be made into a struct, and as long as we do that, we might as well fix the allocation of the array as well.
Note that I have a hard constraint, I cannot change the structure of the data on disk for backward compatibility reasons. So only in-memory changes are allowed.
Here is my first attempt at refactoring the code:
public unsafestructStreamBitArray{
private fixed uint _inner[64];
public int SetCount;
public StreamBitArray(){SetCount=0;Vector256<uint>.Zero.StoreUnsafe(ref _inner[0]);Vector256<uint>.Zero.StoreUnsafe(ref _inner[8]);Vector256<uint>.Zero.StoreUnsafe(ref _inner[16]);Vector256<uint>.Zero.StoreUnsafe(ref _inner[24]);Vector256<uint>.Zero.StoreUnsafe(ref _inner[32]);Vector256<uint>.Zero.StoreUnsafe(ref _inner[40]);Vector256<uint>.Zero.StoreUnsafe(ref _inner[48]);Vector256<uint>.Zero.StoreUnsafe(ref _inner[56]);}
public StreamBitArray(byte* ptr){
var ints =(uint*)ptr;SetCount=(int)*ints;
var a =Vector256.LoadUnsafe(ref ints[1]);
var b =Vector256.LoadUnsafe(ref ints[9]);
var c =Vector256.LoadUnsafe(ref ints[17]);
var d =Vector256.LoadUnsafe(ref ints[25]);
var e =Vector256.LoadUnsafe(ref ints[33]);
var f =Vector256.LoadUnsafe(ref ints[41]);
var g =Vector256.LoadUnsafe(ref ints[49]);
var h =Vector256.LoadUnsafe(ref ints[57]);
a.StoreUnsafe(ref _inner[0]);
b.StoreUnsafe(ref _inner[8]);
c.StoreUnsafe(ref _inner[16]);
d.StoreUnsafe(ref _inner[24]);
e.StoreUnsafe(ref _inner[32]);
f.StoreUnsafe(ref _inner[40]);
g.StoreUnsafe(ref _inner[48]);
h.StoreUnsafe(ref _inner[56]);}}
That looks like a lot of code, but let’s see what changes I brought to bear here.
Using a struct instead of a class saves us an allocation.
Using a fixed array means that we don’t have a separate allocation for the buffer.
Using [SkipLocalsInit] means that we ask the JIT not to zero the struct. We do that directly in the default constructor.
We are loading the data from the ptr in the second constructor directly.
The fact that this is a struct and using a fixed array means that we can create a new instance of this without any allocations, we just need 260 bytes of stack space (the 288 we previously allocated also included object headers).
Let’s look at the actual machine code that these two constructors generate. Looking at the default constructor, we have:
There is the function prolog and epilog, but the code of this method uses 4 256-bit instructions to zero the buffer. If we were to let the JIT handle this, it would use 128-bit instructions and a loop to do it. In this case, our way is better, because we know more than the JIT.
As for the constructor accepting an external pointer, here is what this translates into:
This code is exciting to me because we are also allowing instruction-level parallelism. We effectively allow the CPU to execute all the operations of reading and writing in parallel.
Next on the chopping block is this method:
publicintFirstSetBit(){for(int i =0; i < _inner.Length; i++){if(_inner[i]==0)continue;return i <<5|HighestBitSet(_inner[i]);}return-1;}privatestaticintHighestBitSet(int v){
v |= v >>1;// first round down to one less than a power of 2
v |= v >>2;
v |= v >>4;
v |= v >>8;
v |= v >>16;return MultiplyDeBruijnBitPosition[(uint)(v *0x07C4ACDDU)>>27];}
We are using vector instructions to scan 8 ints at a time, trying to find the first one that is set. Then we find the right int and locate the first set bit there. Here is what the assembly looks like:
In short, the code is simpler, shorter, and more explicit about what it is doing. The machine code that is running there is much tighter. And I don’t have allocations galore.
This particular optimization isn’t about showing better numbers in a specific scenario that I can point to. I don’t think we ever delete enough pages to actually see this in a profiler output in such an obvious way. The goal is to reduce allocations and give the GC less work to do, which has a global impact on the performance of the system.
During a performance evaluation internally, we ran into a strange situation. Our bulk insert performance using the node.js API was significantly worse than the performance of other clients. In particular, when we compared that to the C# version, we saw that the numbers were significantly worse than expected.
To be fair, this comparison is made between our C# client, which has been through the wringer in terms of optimization and attention to performance, and the Node.js client. The focus of the Node.js client was on correctness and usability.
It isn’t fair to expect the same performance from Node.js and C#, after all. However, that difference in performance was annoying enough to make us take a deeper look into what was going on.
Here is the relevant code:
const store =newDocumentStore('http://localhost:8080','bulk');
store.initialize();const bulk = store.bulkInsert();for(let i =0; i <100_000_000; i++){await bulk.store(newUser('user'+ i));}await bulk.finish();
As you can see, the Node.js numbers are respectable. Running at a rate of over 85,000 writes per second is nothing to sneeze at.
But I also ran the exact same test with the C# client, and I got annoyed. The C# client was able to hit close to 100,000 more writes per second than the Node.js client. And in both cases, the actual limit was on the client side, not on the server side.
For fun, I ran a few clients and hit 250,000 writes/second without really doing much. The last time we properly tested ingest performance for RavenDB we achieved 150,000 writes/second. So it certainly looks like we are performing significantly better.
Going back to the Node.js version, I wanted to know what exactly was the problem that we had there. Why are we so much slower than the C# version? It’s possible that this is just the limits of the node.js platform, but you gotta check to know.
Node.js has an --inspect flag that you can use, and Chrome has a built-in profiler (chrome://inspect) that can plug into that. Using the DevTools, you can get a performance profile of a Node.js process.
I did just that and go the following numbers:
That is… curious. Really curious, isn’t it?
Basically, none of my code appears here at all, most of the time is spent dealing with the async machinery. If you look at the code above, you can see that we are issuing an await for each document stored.
The idea with bulk insert is that under the covers, we split the writing to an in-memory buffer and the flushing of the buffer to the network. In the vast majority of cases, we’ll not do any async operations in the store() call. If the buffer is full, we’ll need to flush it to the network, and that may force us to do an actual await operation. In Node.js, awaiting an async function that doesn’t actually perform any async operation appears to be super expensive.
We threw around a bunch of ideas on how to resolve this issue. The problem is that Node.js has no equivalent to C#’s ValueTask. We also have a lot of existing code out there in the field that we must remain compatible with.
Our solution to this dilemma was to add another function that you can call, like so:
for(let i =0; i <100_000_000; i++){const user =newUser('user'+ i);const id ="users/"+ i;if(bulk.tryStoreSync(user, id)==false){await bulk.store(user, id);}}
The idea is that if you call tryStoreSync() we’ll try to do everything in memory, but it may not be possible (e.g. if we need to flush the buffer). In that case, you’ll need to call the async function store() explicitly.
Given that the usual reason for using the dedicated API for bulk insert is performance, this looks like a reasonable thing to ask. Especially when you can see the actual performance results. We are talking about over 55%(!!!) improvement in the performance of bulk insert.
It gets even better. That was just the mechanical fix to avoid generating a promise per operation. While we are addressing this performance issue, there are a few other low-hanging fruits that could improve the bulk insert performance in Node.js.
For example, it turns out that we pay a hefty cost to generate the metadata for all those documents (runtime reflection cost, mostly). We can generate it once and be done with it, like so:
const bulk = store.bulkInsert();const metadata ={"@collection":"Users","Raven-Node-Type":"User"};for(let i =0; i <100_000_000; i++){const user =newUser('user'+ i);const id ="users/"+ i;if(bulk.tryStoreSync(user, id, metadata)==false){await bulk.store(user, id, metadata);}}await bulk.finish();
And this code in particular gives us:
That is basically near enough to the C#’s speed that I don’t think we need to pay more attention to performance. Overall, that was time very well spent in making things go fast.
A customer called us about some pretty weird-looking numbers in their system:
You’ll note that the total number of entries in the index across all the nodes does not match. Notice that node C has 1 less entry than the rest of the system.
At the same time, all the indicators are green. As far as the administrator can tell, there is no issue, except for the number discrepancy. Why is it behaving in this manner?
Well, let’s zoom out a bit. What are we actually looking at here? We are looking at the state of a particular index in a single database within a cluster of machines. When examining the index, there is no apparent problem. Indexing is running properly, after all.
The actual problem was a replication issue, which prevented replication from proceeding to the third node. When looking at the index status, you can only see that the entry count is different.
When we zoom out and look at the state of the cluster, we can see this:
There are a few things that I want to point out in this scenario. The problem here is a pretty nasty one. All nodes are alive and well, they are communicating with each other, and any simple health check you run will give good results.
However, there is a problem that prevents replication from properly flowing to node C. The actual details aren’t relevant (a bug that we fixed, to tell the complete story). The most important aspect is how RavenDB behaves in such a scenario.
The cluster detected this as a problem, marked the node as problematic, and raised the appropriate alerts. As a result of this, clients would automatically be turned away from node C and use only the healthy nodes.
From the customer’s perspective, the issue was never user-visible since the cluster isolated the problematic node. I had a hand in the design of this, and I wrote some of the relevant code. And I’m still looking at these screenshots with a big sense of accomplishment.
This stuff isn’t easy or simple. But to an outside observer, the problem started from: why am I looking at funny numbers in the index state in the admin panel? And not at: why am I serving the wrong data to my users.
The design of RavenDB is inherently paranoid. We go to a lot of trouble to ensure that even if you run into problems, even if you encounter outright bugs (as in this case), the system as a whole would know how to deal with them and either recover or work around the issue.
As you can see, live in production, it actually works and does the Right Thing for you. Thus, I can end this post by saying that this behavior makes me truly happy.