Ayende @ Rahienhttps://www.ayende.com/blog/Ayende @ RahienCopyright (C) Ayende Rahien 2004 - 2021 (c) 202460Fun with bugs: Advanced Dictionary API<p style="text-align:left;">In RavenDB, we <em>really</em>&nbsp;care about performance. That means that our typical code does <em>not</em>&nbsp;follow idiomatic C# code. Instead, we make use of everything that the framework and the language give us to eke out that additional push for performance. Recently we ran into a bug that was quite puzzling. Here is a simple reproduction of the problem:</p><p style="text-align:left;"><hr/><pre class='line-numbers language-csharp'><code class='line-numbers language-csharp'><span class="token keyword">using</span> <span class="token namespace">System<span class="token punctuation">.</span>Runtime<span class="token punctuation">.</span>InteropServices</span><span class="token punctuation">;</span> <span class="token class-name"><span class="token keyword">var</span></span> counts <span class="token operator">=</span> <span class="token keyword">new</span> <span class="token constructor-invocation class-name">Dictionary<span class="token punctuation">&lt;</span><span class="token keyword">int</span><span class="token punctuation">,</span> <span class="token keyword">int</span><span class="token punctuation">></span></span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token class-name"><span class="token keyword">var</span></span> totalKey <span class="token operator">=</span> <span class="token number">10_000</span><span class="token punctuation">;</span> <span class="token keyword">ref</span> <span class="token class-name"><span class="token keyword">var</span></span> total <span class="token operator">=</span> <span class="token keyword">ref</span> CollectionsMarshal<span class="token punctuation">.</span><span class="token function">GetValueRefOrAddDefault</span><span class="token punctuation">(</span> counts<span class="token punctuation">,</span> totalKey<span class="token punctuation">,</span> <span class="token keyword">out</span> _<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">for</span> <span class="token punctuation">(</span><span class="token class-name"><span class="token keyword">int</span></span> i <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">;</span> i <span class="token operator">&lt;</span> <span class="token number">4</span><span class="token punctuation">;</span> i<span class="token operator">++</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token class-name"><span class="token keyword">var</span></span> key <span class="token operator">=</span> i <span class="token operator">%</span> <span class="token number">32</span><span class="token punctuation">;</span> <span class="token keyword">ref</span> <span class="token class-name"><span class="token keyword">var</span></span> count <span class="token operator">=</span> <span class="token keyword">ref</span> CollectionsMarshal<span class="token punctuation">.</span><span class="token function">GetValueRefOrAddDefault</span><span class="token punctuation">(</span> counts<span class="token punctuation">,</span> key<span class="token punctuation">,</span> <span class="token keyword">out</span> _<span class="token punctuation">)</span><span class="token punctuation">;</span> count<span class="token operator">++</span><span class="token punctuation">;</span> total<span class="token operator">++</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> Console<span class="token punctuation">.</span><span class="token function">WriteLine</span><span class="token punctuation">(</span>counts<span class="token punctuation">[</span>totalKey<span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">;</span></code></pre><hr/></p><p style="text-align:left;">What would you <em>expect</em>&nbsp;this code to output? We are using two important features of C# here:</p><ul><li>Value types (in this case, an int, but the real scenario was with a struct)</li><li>CollectionMarshal.GetValueRefOrAddDefault()</li></ul><p style="text-align:left;">The latter method is a way to avoid performing two lookups in the dictionary to get the value if it exists and then add or modify it. </p><p style="text-align:left;">If you run the code above, it will output the number 2. </p><p style="text-align:left;">That is <em>not</em>&nbsp;expected, but when I sat down and thought about it, it made sense.</p><p style="text-align:left;">We are keeping track of the reference to a value in the dictionary, <em>and we are mutating</em>&nbsp;the dictionary.</p><p style="text-align:left;">The documentation for the method <span style="text-decoration:underline;"><a style="color:inherit;" href="https://learn.microsoft.com/en-us/dotnet/api/system.runtime.interopservices.collectionsmarshal.getvaluereforadddefault?view=net-8.0#system-runtime-interopservices-collectionsmarshal-getvaluereforadddefault-2(system-collections-generic-dictionary((-0-1))-0-system-boolean@)">very clearly explains that this is a Bad Idea.</a></span>&nbsp;It is an easy mistake to make, but still a mistake. The challenge here is figuring out <em>why</em>&nbsp;this is happening. Can you give it a minute of thought and see if you can figure it out?</p><p style="text-align:left;">A dictionary is basically an array that you access using an index (computed via a hash function), that is all. So if we strip everything away, the code above can be seen as:</p><p style="text-align:left;"><hr/><pre class='line-numbers language-javascript'><code class='line-numbers language-javascript'><span class="token keyword">var</span> buffer <span class="token operator">=</span> <span class="token keyword">new</span> <span class="token class-name">int</span><span class="token punctuation">[</span><span class="token number">2</span><span class="token punctuation">]</span><span class="token punctuation">;</span> ref <span class="token keyword">var</span> total <span class="token operator">=</span> ref <span class="token keyword">var</span> buffer<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">;</span></code></pre><hr/></p><p style="text-align:left;">We simply have a reference to the first element in the array, that&rsquo;s what this does behind the scenes. And when we insert items into the dictionary, we may need to allocate a bigger backing array for it, so this becomes:</p><p style="text-align:left;"><hr/><pre class='line-numbers language-javascript'><code class='line-numbers language-javascript'><span class="token keyword">var</span> buffer <span class="token operator">=</span> <span class="token keyword">new</span> <span class="token class-name">int</span><span class="token punctuation">[</span><span class="token number">2</span><span class="token punctuation">]</span><span class="token punctuation">;</span> ref <span class="token keyword">var</span> total <span class="token operator">=</span> ref <span class="token keyword">var</span> buffer<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">;</span> <span class="token keyword">var</span> newBuffer <span class="token operator">=</span> <span class="token keyword">new</span> <span class="token class-name">int</span><span class="token punctuation">[</span><span class="token number">4</span><span class="token punctuation">]</span><span class="token punctuation">;</span> buffer<span class="token punctuation">.</span><span class="token function">CopyTo</span><span class="token punctuation">(</span>newBuffer<span class="token punctuation">)</span><span class="token punctuation">;</span> buffer <span class="token operator">=</span> newBuffer<span class="token punctuation">;</span> total <span class="token operator">=</span> <span class="token number">1</span><span class="token punctuation">;</span> <span class="token keyword">var</span> newTotal <span class="token operator">=</span> buffer<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span></code></pre><hr/></p><p style="text-align:left;">In other words, the <em>total</em>&nbsp;variable is pointing to the first element in the <em>two-element array</em>, but we allocated a <em>new</em>&nbsp;array (and copied all the values). That is the reason why the code above gives the wrong result. Makes perfect sense, and yet, was quite puzzling to figure out.</p> <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/9000.0.1/themes/prism.min.css" integrity="sha512-/mZ1FHPkg6EKcxo0fKXF51ak6Cr2ocgDi5ytaTBjsQZIH/RNs6GF6+oId/vPe3eJB836T36nXwVh/WBl/cWT4w==" crossorigin="anonymous" referrerpolicy="no-referrer" />https://www.ayende.com/blog/201761-C/fun-with-bugs-advanced-dictionary-api?Key=8f507241-34b2-4e31-b3f1-a31f6adbcc28https://www.ayende.com/blog/201761-C/fun-with-bugs-advanced-dictionary-api?Key=8f507241-34b2-4e31-b3f1-a31f6adbcc28Fri, 15 Nov 2024 10:00:00 GMTQuerying over the current time in RavenDB<p style="text-align:left;">We received <span style="text-decoration:underline;"><a style="color:inherit;" href="https://github.com/ravendb/ravendb/discussions/19327#discussioncomment-10747974">a really interesting question</a></span>&nbsp;from a user, which basically boils down to:</p><blockquote><p style="text-align:left;">I need to query over a time span, either known (start, end) or (start, $currentDate), and I need to be able to sort on them.</p></blockquote><p style="text-align:left;">That might sound&hellip; vague, I know. A better way to explain this is that I have a list of people, and I need to sort them by their age. That&rsquo;s trivial to do since I can sort by the birthday, right? The problem is that we include some historical data, so some people are deceased. </p><p style="text-align:left;">Basically, we want to be able to get the following data, sorted by age ascending:</p><table style="width:100%;" class="table-bordered table-striped" ><tr><strong><td>Name</td><td>Birthday</td><td>Death</td></strong></tr><tr><td>Michael Stonebraker</td><td>1943</td><td>N/A</td></tr><tr><td>Sir Tim Berners-Lee </td><td>1955</td><td>N/A</td></tr><tr><td>Narges Mohammadi</td><td>1972</td><td>N/A</td></tr><tr><td>Sir Terry Prachett</td><td>1948</td><td>2015</td></tr><tr><td>Agatha Christie</td><td>1890</td><td>1976</td></tr></table><p style="text-align:left;">This doesn&rsquo;t look hard, right? I mean, all you need to do is something like:</p><p style="text-align:left;"><hr/><pre class='line-numbers language-bash'><code class='line-numbers language-bash'>order by datediff<span class="token punctuation">(</span> coalesce<span class="token punctuation">(</span>Death, now<span class="token punctuation">(</span><span class="token punctuation">))</span>, Birthday <span class="token punctuation">)</span></code></pre><hr/></p><p style="text-align:left;">Easy enough, and would work <em>great</em>&nbsp;if you have a small number of items to sort. What happens if we want to sort over 10M records? </p><p style="text-align:left;">Look at the manner in which we are ordering, that will <em>require</em>&nbsp;us to evaluate each and every record. That means we&rsquo;ll have to scan through the entire list and sort it. This can be <em>really</em>&nbsp;expensive. And because we are sorting over a date (which changes), you can&rsquo;t even get away with a computed field.</p><p style="text-align:left;">RavenDB will refuse to run queries that can only work with small amounts of data but will fail as the data grows. This is part of our philosophy, saying that things should Just Work. Of course, in this case, it do<em>esn&rsquo;t</em>&nbsp;work, so the question is how this aligns&nbsp;with our philosophy?</p><p style="text-align:left;">The idea is simple. If we cannot make it work in all cases, we will reject it outright. The idea is to ensure that your system is not susceptible to hidden traps. By explicitly rejecting it upfront, we make sure that you&rsquo;ll have a good solution and not something that will fail as your data size grows. </p><p style="text-align:left;">What is the appropriate behavior here, then? How can we make it work with RavenDB?</p><p style="text-align:left;">The key issue is that we want to be able to figure out what is the value we&rsquo;ll sort on <em>during the indexing stage</em>. This is important because otherwise we&rsquo;ll have to compute it across the entire dataset for <em>each</em>&nbsp;query. We can do that in RavenDB by exposing that value to the index. </p><p style="text-align:left;">We cannot just call DateTime.Today, however. That won&rsquo;t work when the day rolls over, of course. So instead, we store that value in a document config/current-date,&nbsp;like so:</p><p style="text-align:left;"><hr/><pre class='line-numbers language-json'><code class='line-numbers language-json'><span class="token punctuation">{</span> <span class="token comment">// config/current-date</span> <span class="token property">"Date"</span><span class="token operator">:</span> <span class="token string">"2024-10-10T00:00:00.0000000"</span> <span class="token punctuation">}</span></code></pre><hr/></p><p style="text-align:left;">Once this is stored as a document, we can then write the following index:</p><p style="text-align:left;"><hr/><pre class='line-numbers language-json'><code class='line-numbers language-json'>from p in docs.People let end = p.Death ?? LoadDocument(<span class="token string">"config/current-date"</span><span class="token punctuation">,</span> <span class="token string">"Config"</span>).Date select new <span class="token punctuation">{</span> Age = end - p.Birthday <span class="token punctuation">}</span></code></pre><hr/></p><p style="text-align:left;">And then query it using:</p><p style="text-align:left;"><hr/><pre class='line-numbers language-bash'><code class='line-numbers language-bash'>from index <span class="token string">'People/WithAge'</span> order by Age desc</code></pre><hr/></p><p style="text-align:left;">That works beautifully, of course, until the next day. What happens then? Well, we&rsquo;ll need to schedule an update to the config/current-date&nbsp;document to correct the date.</p><p style="text-align:left;">At that point, because there is an association created between all the documents that loaded the current date, the indexing engine in RavenDB will go and re-index them. The idea is that at any given point in time, we have already computed the value, and can run <em>really</em>&nbsp;quick queries and sort on it. </p><p style="text-align:left;">When you update the configuration document, it is a signal that we need to re-index the referencing documents. RavenDB is good at knowing how to do that on a streaming basis, so it won&rsquo;t need to do a huge amount of work all at once. </p><p style="text-align:left;">You&rsquo;ll also note that we only load the configuration document if we <em>don&rsquo;t</em>&nbsp;have an end date. So the deceased people&rsquo;s records will not be affected or require re-indexing.</p><p style="text-align:left;">In short, we can benefit from querying over the age without incurring query time costs and can defer those costs to background indexing time. The downside is that we need to set up a cron job to make it happen, but that isn&rsquo;t too big a task, I think.</p><p style="text-align:left;">You can utilize similar setups for other scenarios where you need to query over changing values. The performance benefits here are enormous. And what is more interesting, even if you have a huge amount of data, this approach will just keep on ticking and deliver great results at very low latencies.</p> <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/9000.0.1/themes/prism.min.css" integrity="sha512-/mZ1FHPkg6EKcxo0fKXF51ak6Cr2ocgDi5ytaTBjsQZIH/RNs6GF6+oId/vPe3eJB836T36nXwVh/WBl/cWT4w==" crossorigin="anonymous" referrerpolicy="no-referrer" />https://www.ayende.com/blog/201729-B/querying-over-the-current-time-in-ravendb?Key=d39c0afe-76fe-4911-81d6-1d3e3b73f23dhttps://www.ayende.com/blog/201729-B/querying-over-the-current-time-in-ravendb?Key=d39c0afe-76fe-4911-81d6-1d3e3b73f23dFri, 11 Oct 2024 12:00:00 GMTRavenDB 6.2 release<p style="text-align:left;">It has been almost a year since the release of RavenDB 6.0. The highlights of the 6.0 release were Corax (a new blazing-fast indexing engine) and Sharding (server-side and simple to operate at scale). We made 10 stable releases in the 6.0.x line since then, mostly focused on performance, stability, and minor features.</p><p style="text-align:left;">The new RavenDB 6.2 release is now out and it has a <em>bunch </em>of new features for you to play with and explore. The team has been working on a wide range of new features, from enabling serverless triggers to quality-of-life improvements for operations teams. </p><h3 style="color:#434343;text-align:left;">RavenDB 6.2 is a Long Term Support (LTS) release</h3><blockquote><p style="text-align:left;">RavenDB 6.2 is a Long Term Support release, replacing the current 5.4 LTS (released in 2022). That means that we&rsquo;ll support RavenDB 5.4 until Oct 2025, and we strongly encourage all users to upgrade to RavenDB 6.2 at their earliest convenience.</p></blockquote><p style="text-align:left;">You can get the new RavenDB 6.2 bits on <span style="text-decoration:underline;"><a style="color:inherit;" href="https://ravendb.net/download">the download page</a></span>. If you are running in the cloud, you can open a support request and ask to be upgraded to the new release. </p><h2 style="text-align:left;">Data sovereignty and geo-distribution via Prefixed Sharding</h2><p style="text-align:left;">In RavenDB 6.2 we introduced a <span style="text-decoration:underline;"><a style="color:inherit;" href="https://ravendb.net/docs/article-page/6.1/csharp/sharding/administration/sharding-by-prefix">seemingly simple change</a></span>&nbsp;to the way RavenDB handles sharding, with profound implications for what you can do with it. <span style="text-decoration:underline;"><a style="color:inherit;" href="https://ravendb.net/docs/article-page/6.1/csharp/sharding/administration/sharding-by-prefix">Prefixed sharding</a></span>&nbsp;allows you to define which shards a particular set of documents will go to. </p><p style="text-align:left;">Here is a simple example:</p><p style="text-align:left;"><img src=""/></p><p style="text-align:left;">In this case, data for users in the US will reside in shards 0 &amp; 1, while the EU data is limited to shards 2 &amp; 3. The data from Asia is spread over shards 0, 2, &amp; 4. &nbsp;You can then assign those shards to specific nodes in a particular geographic region, and with that, you are done. </p><p style="text-align:left;">RavenDB will ensure that documents will stay only in their assigned location, handling data sovereignty issues for you. In the same manner, you get to geographically split the data so you can have a single world-spanning database while issuing mostly local queries.</p><p style="text-align:left;">You can read more about this feature and its impact in <span style="text-decoration:underline;"><a style="color:inherit;" href="https://ravendb.net/docs/article-page/6.1/csharp/sharding/administration/sharding-by-prefix">the documentation</a></span>.</p><h1 style="text-align:left;">Actors architecture with Akka.NET</h1><p style="text-align:left;">New in RavenDB 6.2 is the <span style="text-decoration:underline;"><a style="color:inherit;" href="https://ravendb.net/articles/using-ravendb-persistence-in-an-akka-net-application">integration of RavenDB with Akka.NET</a></span>. The idea is to allow you to easily manage state persistence of distributed actors in RavenDB. You&rsquo;ll get both the benefit of the actor model via Akka.NET, simplifying parallelism and concurrency, while at the same time freeing yourself from persistence and high availability concerns thanks to RavenDB.</p><p style="text-align:left;">We have an article out discussing <span style="text-decoration:underline;"><a style="color:inherit;" href="https://ravendb.net/articles/using-ravendb-persistence-in-an-akka-net-application">how you use RavenDB &amp; Akka.NET</a></span>,&nbsp;and if you are into that sort of thing, there is also a <span style="text-decoration:underline;"><a style="color:inherit;" href="https://ravendb.net/articles/notes-from-integrating-ravendb-with-akka-net-persistence">detailed set of notes covering the actual implementation</a></span>&nbsp;and the challenges involved.</p><h2 style="text-align:left;">Azure Functions integration with ETL to Azure Queues</h2><p style="text-align:left;">This is the sort of feature with hidden depths. <span style="text-decoration:underline;"><a style="color:inherit;" href="https://ravendb.net/docs/article-page/6.1/csharp/server/ongoing-tasks/etl/queue-etl/azure-queue">ETL to Azure Queue Storage</a></span>&nbsp;is fairly simple on the surface, it allows you to push data using RavenDB&rsquo;s usual ETL mechanisms to Azure Queues. At a glance, this looks like a simple extension of our already existing capabilities with queues (ETL to Kafka or RabbitMQ). </p><p style="text-align:left;">The reason that this is a top-line feature is that it also enables a very interesting scenario. You can now seamlessly integrate Azure Functions into your RavenDB data pipeline using this feature. <span style="text-decoration:underline;"><a style="color:inherit;" href="https://ravendb.net/articles/using-azure-queue-storage-etl-to-process-ravendb-documents-in-a-serverless-manner">We have an article out that walks you through setting up Azure Functions to process data from RavenDB</a></span>.</p><h2 style="text-align:left;">OpenTelemetry integration</h2><p style="text-align:left;">In RavenDB 6.2 we have added support for the OpenTelemetry framework. This allows your operations team to more easily integrate RavenDB into your infrastructure. You can read <span style="text-decoration:underline;"><a style="color:inherit;" href="https://ravendb.net/docs/article-page/6.1/csharp/server/administration/monitoring/open-telemetry">more about how to set up OpenTelemetry for your RavenDB cluster in the documentation</a></span>.</p><p style="text-align:left;">OpenTelemetry integration is in addition to <span style="text-decoration:underline;"><a style="color:inherit;" href="https://ravendb.net/docs/article-page/6.1/csharp/server/administration/monitoring/prometheus">Prometheus</a></span>, <span style="text-decoration:underline;"><a style="color:inherit;" href="https://ravendb.net/docs/article-page/6.1/csharp/server/administration/monitoring/telegraf">Telegraf</a></span>,&nbsp;and <span style="text-decoration:underline;"><a style="color:inherit;" href="https://ravendb.net/docs/article-page/6.1/csharp/server/administration/SNMP/snmp">SNMP</a></span>&nbsp;telemetry solutions that are already in RavenDB. You can pick any of them to monitor and inspect the state of RavenDB. </p><h2 style="text-align:left;">Studio Omni-Search</h2><p style="text-align:left;">We made some nice improvements to RavenDB Studio as well, and probably the most visible of those is the Omni-Search feature. &nbsp;You can now hit Ctrl+K in the Studio and just search across <em>everything</em>:</p><ul><li>Commands in the Studio</li><li>Documents</li><li>Indexes</li></ul><p style="text-align:left;"><img src=""/></p><p style="text-align:left;">This feature greatly enhances the discoverability of features in RavenDB as well as makes it a joy for those of us (myself included) who love to keep our hands on the keyboard.</p><h3 style="color:#434343;text-align:left;">Summary</h3><p style="text-align:left;">I&rsquo;m really happy about this release. It follows a predictable and stable release cadence since the release of 6.0 a year ago. The new release adds a whole bunch of new features and capabilities, and it can be upgraded in place (including cross-version clusters) and deployed to production with no hassles.</p><p style="text-align:left;">Looking forward, we have already started work on the next version of RavenDB, tentatively meant to be 7.0. We have some cool ideas about what will go into that release (check the <span style="text-decoration:underline;"><a style="color:inherit;" href="https://ravendb.net/about/roadmap">roadmap</a></span>), but the key feature is likely to make RavenDB a more intelligent database, one might even say, artificially so.</p> <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/9000.0.1/themes/prism.min.css" integrity="sha512-/mZ1FHPkg6EKcxo0fKXF51ak6Cr2ocgDi5ytaTBjsQZIH/RNs6GF6+oId/vPe3eJB836T36nXwVh/WBl/cWT4w==" crossorigin="anonymous" referrerpolicy="no-referrer" />https://www.ayende.com/blog/201697-B/ravendb-6-2-release?Key=3fc21f8e-8e84-4868-a4fe-74fd0a515c5chttps://www.ayende.com/blog/201697-B/ravendb-6-2-release?Key=3fc21f8e-8e84-4868-a4fe-74fd0a515c5cWed, 18 Sep 2024 12:00:00 GMTSeeing the results of Corax in production<p>Corax was released just under a year ago, and we are seeing more customers deploying that to production. During a call with a customer, we noticed the following detail:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Seeing-the-results-of-Corax-in-productio_D60D/image_2.png"><img width="900" height="226" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Seeing-the-results-of-Corax-in-productio_D60D/image_thumb.png" border="0"></a></p><p>Let me explain what we are seeing here. The two indexes are the same, operating on the same set of documents. The only difference between those indexes is the indexing engine.</p><p>What is really amazing here is that Corax is able to index in 3:21 minutes what Lucene takes 17:15 minutes to index. In other words, Corax is <em>more than 5 times faster</em> than Lucene in a real world scenario.</p><p>And these news make me <em>very</em> happy.</p>https://www.ayende.com/blog/201633-A/seeing-the-results-of-corax-in-production?Key=b89c4aac-27e1-48e7-b8b9-631de8b2c513https://www.ayende.com/blog/201633-A/seeing-the-results-of-corax-in-production?Key=b89c4aac-27e1-48e7-b8b9-631de8b2c513Wed, 28 Aug 2024 12:00:00 GMTOptimizing old code: StreamBitArray refactoring<p style="text-align:left;">RavenDB is a pretty old codebase, hitting 15+ years in production recently. In order to keep it alive &amp; well, we make sure to follow the rule of always leaving the code in a better shape than we found it. </p><p style="text-align:left;">Today&rsquo;s tale is about the <span style="text-decoration:underline;"><a style="color:inherit;" href="https://github.com/ravendb/ravendb/blame/ceb5e9cfa53c8ba93d872a97bbda8250b0417c65/src/Voron/Impl/FreeSpace/StreamBitArray.cs">StreamBitArray</a></span>&nbsp;class, deep in the guts of Voron, RavenDB&rsquo;s storage engine. The class itself isn&rsquo;t really that interesting, it is just an implementation of a Bit Array that we have for a bitmap. We wrote it (based on Mono&rsquo;s code, it looks like) very early in the history of RavenDB and have never really touched it since.</p><p style="text-align:left;">The last time anyone touched it was 5 years ago (fixing the namespace), 7 years ago we created an issue from a TODO comment, etc. Most of the code dates back to 2013, actually. And even then it was moved from a different branch, so we lost the <em>really </em>old history. </p><p style="text-align:left;">To be clear, that class <em>did a full tour of duty. </em>For over a decade, it has served us very well. We never found a reason to change it, never got a trace of it in the profiler, etc. As we chip away at various hurdles inside RavenDB, I ran into this class and really looked at it with modern sensibilities. I think that this makes a great test case for code refactoring from the old style to our modern one.</p><p style="text-align:left;">Here is what the class looks like:</p><p style="text-align:left;"><img src=""/></p><p style="text-align:left;">Already, we can see several things that really bug me. That class is only used in one context, to manage the free pages bitmap for Voron. That means we create it whenever Voron frees a page. That can happen a lot, as you might imagine. </p><p style="text-align:left;">A single bitmap here covers 2048 pages, so when we create an instance of this class we also allocate an array with 64 ints. In other words, we need to allocate 312 bytes for each page we free. That isn&rsquo;t fun, and it actually gets worse. Here is a typical example of using this class:</p><p style="text-align:left;"><hr/><pre class='line-numbers language-go'><code class='line-numbers language-go'>using <span class="token punctuation">(</span>freeSpaceTree<span class="token punctuation">.</span><span class="token function">Read</span><span class="token punctuation">(</span>section<span class="token punctuation">,</span> out Slice result<span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> sba <span class="token operator">=</span> <span class="token operator">!</span>result<span class="token punctuation">.</span>HasValue ? <span class="token builtin">new</span> <span class="token function">StreamBitArray</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token punctuation">:</span> <span class="token builtin">new</span> <span class="token function">StreamBitArray</span><span class="token punctuation">(</span>result<span class="token punctuation">.</span><span class="token function">CreateReader</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> sba<span class="token punctuation">.</span><span class="token function">Set</span><span class="token punctuation">(</span><span class="token punctuation">(</span><span class="token builtin">int</span><span class="token punctuation">)</span><span class="token punctuation">(</span>pageNumber <span class="token operator">%</span> NumberOfPagesInSection<span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token boolean">true</span><span class="token punctuation">)</span><span class="token punctuation">;</span> using <span class="token punctuation">(</span>sba<span class="token punctuation">.</span><span class="token function">ToSlice</span><span class="token punctuation">(</span>tx<span class="token punctuation">.</span>Allocator<span class="token punctuation">,</span> out Slice val<span class="token punctuation">)</span><span class="token punctuation">)</span> freeSpaceTree<span class="token punctuation">.</span><span class="token function">Add</span><span class="token punctuation">(</span>section<span class="token punctuation">,</span> val<span class="token punctuation">)</span><span class="token punctuation">;</span></code></pre><hr/></p><p style="text-align:left;">And inside the ToSlice()&nbsp;call, we have:</p><p style="text-align:left;"><hr/><pre class='line-numbers language-dart'><code class='line-numbers language-dart'>public <span class="token class-name">ByteStringContext.InternalScope</span> <span class="token class-name">ToSlice</span><span class="token punctuation">(</span><span class="token class-name">ByteStringContext</span> context<span class="token punctuation">,</span> <span class="token class-name">ByteStringType</span> type<span class="token punctuation">,</span> out <span class="token class-name">Slice</span> str<span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token keyword">var</span> buffer <span class="token operator">=</span> <span class="token class-name">ToBuffer</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">var</span> scope <span class="token operator">=</span> <span class="token class-name"><span class="token namespace">context<span class="token punctuation">.</span></span>From</span><span class="token punctuation">(</span>buffer<span class="token punctuation">,</span> <span class="token number">0</span><span class="token punctuation">,</span> <span class="token class-name"><span class="token namespace">buffer<span class="token punctuation">.</span></span>Length</span><span class="token punctuation">,</span> type<span class="token punctuation">,</span> out <span class="token class-name">ByteString</span> byteString<span class="token punctuation">)</span><span class="token punctuation">;</span> str <span class="token operator">=</span> <span class="token keyword">new</span> <span class="token class-name">Slice</span><span class="token punctuation">(</span>byteString<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">return</span> scope<span class="token punctuation">;</span> <span class="token punctuation">}</span> private unsafe byte<span class="token punctuation">[</span><span class="token punctuation">]</span> <span class="token class-name">ToBuffer</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token keyword">var</span> tmpBuffer <span class="token operator">=</span> <span class="token keyword">new</span> byte<span class="token punctuation">[</span><span class="token punctuation">(</span>_inner<span class="token punctuation">.</span>Length <span class="token operator">+</span> <span class="token number">1</span><span class="token punctuation">)</span><span class="token operator">*</span>sizeof <span class="token punctuation">(</span>int<span class="token punctuation">)</span><span class="token punctuation">]</span><span class="token punctuation">;</span> unsafe <span class="token punctuation">{</span> fixed <span class="token punctuation">(</span>int<span class="token operator">*</span> src <span class="token operator">=</span> _inner<span class="token punctuation">)</span> fixed <span class="token punctuation">(</span>byte<span class="token operator">*</span> dest <span class="token operator">=</span> tmpBuffer<span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token operator">*</span><span class="token punctuation">(</span>int<span class="token operator">*</span><span class="token punctuation">)</span> dest <span class="token operator">=</span> <span class="token class-name">SetCount</span><span class="token punctuation">;</span> <span class="token class-name">Memory.Copy</span><span class="token punctuation">(</span>dest <span class="token operator">+</span> sizeof <span class="token punctuation">(</span>int<span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token punctuation">(</span>byte<span class="token operator">*</span><span class="token punctuation">)</span> src<span class="token punctuation">,</span> <span class="token class-name"><span class="token namespace">tmpBuffer<span class="token punctuation">.</span></span>Length</span> <span class="token operator">-</span> <span class="token number">1</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span> <span class="token keyword">return</span> tmpBuffer<span class="token punctuation">;</span> <span class="token punctuation">}</span></code></pre><hr/></p><p style="text-align:left;">In other words, ToSlice()&nbsp;calls ToBuffer(), which allocates an array of bytes (288 bytes are allocated here), copies the data from the inner buffer to a new one (using fixed&nbsp;on the two arrays, which is a performance issue all in itself) and then calls a method to do the actual copy. Then in ToSlice()&nbsp;itself we allocate it <em>again</em>&nbsp;in native memory, which we then write to Voron, and then discard the whole thing. </p><p style="text-align:left;">In short, somehow it turns out that freeing a page in Voron costs us ~1KB of memory allocations. That sucks, I have to say. And the only reasoning I have for this code is that it is old. </p><p style="text-align:left;">Here is the constructor for this class as well:</p><p style="text-align:left;"><hr/><pre class='line-numbers language-csharp'><code class='line-numbers language-csharp'><span class="token keyword">public</span> <span class="token function">StreamBitArray</span><span class="token punctuation">(</span><span class="token class-name">ValueReader</span> reader<span class="token punctuation">)</span> <span class="token punctuation">{</span> SetCount <span class="token operator">=</span> reader<span class="token punctuation">.</span><span class="token function">ReadLittleEndianInt32</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">unsafe</span> <span class="token punctuation">{</span> <span class="token keyword">fixed</span> <span class="token punctuation">(</span><span class="token keyword">int</span><span class="token operator">*</span> i <span class="token operator">=</span> _inner<span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token class-name"><span class="token keyword">int</span></span> read <span class="token operator">=</span> reader<span class="token punctuation">.</span><span class="token function">Read</span><span class="token punctuation">(</span><span class="token punctuation">(</span><span class="token keyword">byte</span><span class="token operator">*</span><span class="token punctuation">)</span>i<span class="token punctuation">,</span> _inner<span class="token punctuation">.</span>Length <span class="token operator">*</span> <span class="token keyword">sizeof</span><span class="token punctuation">(</span><span class="token type-expression class-name"><span class="token keyword">int</span></span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">if</span> <span class="token punctuation">(</span>read <span class="token operator">&lt;</span> _inner<span class="token punctuation">.</span>Length <span class="token operator">*</span> <span class="token keyword">sizeof</span><span class="token punctuation">(</span><span class="token type-expression class-name"><span class="token keyword">int</span></span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token keyword">throw</span> <span class="token keyword">new</span> <span class="token constructor-invocation class-name">EndOfStreamException</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span></code></pre><hr/></p><p style="text-align:left;">This accepts a reader to a piece of memory and does a <em>bunch</em>&nbsp;of things. It calls a few methods, uses fixed on the array, etc., all to get the data from the reader to the class. That is horribly inefficient. </p><p style="text-align:left;">Let&rsquo;s write it from scratch and see what we can do. The first thing to notice is that this is a very short-lived class, it is only used inside methods and never held for long. This usage pattern tells me that it is a good candidate to be made into a struct, and as long as we do that, we might as well fix the allocation of the array as well. </p><p style="text-align:left;">Note that I have a hard constraint, I cannot change the structure of the data on disk for backward compatibility reasons. So only in-memory changes are allowed. </p><p style="text-align:left;">Here is my first attempt at refactoring the code:</p><p style="text-align:left;"><hr/><pre class='line-numbers language-rust'><code class='line-numbers language-rust'>public <span class="token keyword">unsafe</span> <span class="token keyword">struct</span> <span class="token type-definition class-name">StreamBitArray</span> <span class="token punctuation">{</span> private fixed uint _inner<span class="token punctuation">[</span><span class="token number">64</span><span class="token punctuation">]</span><span class="token punctuation">;</span> public int <span class="token class-name">SetCount</span><span class="token punctuation">;</span> public <span class="token class-name">StreamBitArray</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token class-name">SetCount</span> <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">;</span> <span class="token class-name">Vector256</span><span class="token operator">&lt;</span>uint<span class="token operator">></span><span class="token punctuation">.</span><span class="token class-name">Zero</span><span class="token punctuation">.</span><span class="token class-name">StoreUnsafe</span><span class="token punctuation">(</span><span class="token keyword">ref</span> _inner<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token class-name">Vector256</span><span class="token operator">&lt;</span>uint<span class="token operator">></span><span class="token punctuation">.</span><span class="token class-name">Zero</span><span class="token punctuation">.</span><span class="token class-name">StoreUnsafe</span><span class="token punctuation">(</span><span class="token keyword">ref</span> _inner<span class="token punctuation">[</span><span class="token number">8</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token class-name">Vector256</span><span class="token operator">&lt;</span>uint<span class="token operator">></span><span class="token punctuation">.</span><span class="token class-name">Zero</span><span class="token punctuation">.</span><span class="token class-name">StoreUnsafe</span><span class="token punctuation">(</span><span class="token keyword">ref</span> _inner<span class="token punctuation">[</span><span class="token number">16</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token class-name">Vector256</span><span class="token operator">&lt;</span>uint<span class="token operator">></span><span class="token punctuation">.</span><span class="token class-name">Zero</span><span class="token punctuation">.</span><span class="token class-name">StoreUnsafe</span><span class="token punctuation">(</span><span class="token keyword">ref</span> _inner<span class="token punctuation">[</span><span class="token number">24</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token class-name">Vector256</span><span class="token operator">&lt;</span>uint<span class="token operator">></span><span class="token punctuation">.</span><span class="token class-name">Zero</span><span class="token punctuation">.</span><span class="token class-name">StoreUnsafe</span><span class="token punctuation">(</span><span class="token keyword">ref</span> _inner<span class="token punctuation">[</span><span class="token number">32</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token class-name">Vector256</span><span class="token operator">&lt;</span>uint<span class="token operator">></span><span class="token punctuation">.</span><span class="token class-name">Zero</span><span class="token punctuation">.</span><span class="token class-name">StoreUnsafe</span><span class="token punctuation">(</span><span class="token keyword">ref</span> _inner<span class="token punctuation">[</span><span class="token number">40</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token class-name">Vector256</span><span class="token operator">&lt;</span>uint<span class="token operator">></span><span class="token punctuation">.</span><span class="token class-name">Zero</span><span class="token punctuation">.</span><span class="token class-name">StoreUnsafe</span><span class="token punctuation">(</span><span class="token keyword">ref</span> _inner<span class="token punctuation">[</span><span class="token number">48</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token class-name">Vector256</span><span class="token operator">&lt;</span>uint<span class="token operator">></span><span class="token punctuation">.</span><span class="token class-name">Zero</span><span class="token punctuation">.</span><span class="token class-name">StoreUnsafe</span><span class="token punctuation">(</span><span class="token keyword">ref</span> _inner<span class="token punctuation">[</span><span class="token number">56</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> public <span class="token class-name">StreamBitArray</span><span class="token punctuation">(</span>byte<span class="token operator">*</span> ptr<span class="token punctuation">)</span> <span class="token punctuation">{</span> var ints <span class="token operator">=</span> <span class="token punctuation">(</span>uint<span class="token operator">*</span><span class="token punctuation">)</span>ptr<span class="token punctuation">;</span> <span class="token class-name">SetCount</span> <span class="token operator">=</span> <span class="token punctuation">(</span>int<span class="token punctuation">)</span><span class="token operator">*</span>ints<span class="token punctuation">;</span> var a <span class="token operator">=</span> <span class="token class-name">Vector256</span><span class="token punctuation">.</span><span class="token class-name">LoadUnsafe</span><span class="token punctuation">(</span><span class="token keyword">ref</span> ints<span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">;</span> var b <span class="token operator">=</span> <span class="token class-name">Vector256</span><span class="token punctuation">.</span><span class="token class-name">LoadUnsafe</span><span class="token punctuation">(</span><span class="token keyword">ref</span> ints<span class="token punctuation">[</span><span class="token number">9</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">;</span> var c <span class="token operator">=</span> <span class="token class-name">Vector256</span><span class="token punctuation">.</span><span class="token class-name">LoadUnsafe</span><span class="token punctuation">(</span><span class="token keyword">ref</span> ints<span class="token punctuation">[</span><span class="token number">17</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">;</span> var d <span class="token operator">=</span> <span class="token class-name">Vector256</span><span class="token punctuation">.</span><span class="token class-name">LoadUnsafe</span><span class="token punctuation">(</span><span class="token keyword">ref</span> ints<span class="token punctuation">[</span><span class="token number">25</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">;</span> var e <span class="token operator">=</span> <span class="token class-name">Vector256</span><span class="token punctuation">.</span><span class="token class-name">LoadUnsafe</span><span class="token punctuation">(</span><span class="token keyword">ref</span> ints<span class="token punctuation">[</span><span class="token number">33</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">;</span> var f <span class="token operator">=</span> <span class="token class-name">Vector256</span><span class="token punctuation">.</span><span class="token class-name">LoadUnsafe</span><span class="token punctuation">(</span><span class="token keyword">ref</span> ints<span class="token punctuation">[</span><span class="token number">41</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">;</span> var g <span class="token operator">=</span> <span class="token class-name">Vector256</span><span class="token punctuation">.</span><span class="token class-name">LoadUnsafe</span><span class="token punctuation">(</span><span class="token keyword">ref</span> ints<span class="token punctuation">[</span><span class="token number">49</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">;</span> var h <span class="token operator">=</span> <span class="token class-name">Vector256</span><span class="token punctuation">.</span><span class="token class-name">LoadUnsafe</span><span class="token punctuation">(</span><span class="token keyword">ref</span> ints<span class="token punctuation">[</span><span class="token number">57</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">;</span> a<span class="token punctuation">.</span><span class="token class-name">StoreUnsafe</span><span class="token punctuation">(</span><span class="token keyword">ref</span> _inner<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">;</span> b<span class="token punctuation">.</span><span class="token class-name">StoreUnsafe</span><span class="token punctuation">(</span><span class="token keyword">ref</span> _inner<span class="token punctuation">[</span><span class="token number">8</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">;</span> c<span class="token punctuation">.</span><span class="token class-name">StoreUnsafe</span><span class="token punctuation">(</span><span class="token keyword">ref</span> _inner<span class="token punctuation">[</span><span class="token number">16</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">;</span> d<span class="token punctuation">.</span><span class="token class-name">StoreUnsafe</span><span class="token punctuation">(</span><span class="token keyword">ref</span> _inner<span class="token punctuation">[</span><span class="token number">24</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">;</span> e<span class="token punctuation">.</span><span class="token class-name">StoreUnsafe</span><span class="token punctuation">(</span><span class="token keyword">ref</span> _inner<span class="token punctuation">[</span><span class="token number">32</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">;</span> f<span class="token punctuation">.</span><span class="token class-name">StoreUnsafe</span><span class="token punctuation">(</span><span class="token keyword">ref</span> _inner<span class="token punctuation">[</span><span class="token number">40</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">;</span> g<span class="token punctuation">.</span><span class="token class-name">StoreUnsafe</span><span class="token punctuation">(</span><span class="token keyword">ref</span> _inner<span class="token punctuation">[</span><span class="token number">48</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">;</span> h<span class="token punctuation">.</span><span class="token class-name">StoreUnsafe</span><span class="token punctuation">(</span><span class="token keyword">ref</span> _inner<span class="token punctuation">[</span><span class="token number">56</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span></code></pre><hr/></p><p style="text-align:left;">That looks like a lot of code, but let&rsquo;s see what changes I brought to bear here. </p><ul><li>Using a struct instead of a class saves us an allocation.</li><li>Using a fixed array means that we don&rsquo;t have a separate allocation for the buffer.</li><li>Using [SkipLocalsInit]&nbsp;means that we ask the JIT <em>not</em>&nbsp;to zero the struct. We do that directly in the default constructor.</li><li>We are loading the data from the ptr&nbsp;in the second constructor directly.</li></ul><p style="text-align:left;">The fact that this is a struct and using a fixed array means that we can create a new instance of this without any allocations, we just need 260 bytes of stack space (the 288 we previously allocated also included object headers).</p><p style="text-align:left;">Let&rsquo;s look at the actual machine code that these two constructors generate. Looking at the default constructor, we have:</p><p style="text-align:left;"><hr/><pre class='line-numbers language-yaml'><code class='line-numbers language-yaml'>StreamBitArray..ctor() <span class="token key atrule">L0000</span><span class="token punctuation">:</span> push ebp <span class="token key atrule">L0001</span><span class="token punctuation">:</span> mov ebp<span class="token punctuation">,</span> esp <span class="token key atrule">L0003</span><span class="token punctuation">:</span> vzeroupper <span class="token key atrule">L0006</span><span class="token punctuation">:</span> xor eax<span class="token punctuation">,</span> eax <span class="token key atrule">L0008</span><span class="token punctuation">:</span> mov <span class="token punctuation">[</span>ecx+0x100<span class="token punctuation">]</span><span class="token punctuation">,</span> eax <span class="token key atrule">L000e</span><span class="token punctuation">:</span> vxorps ymm0<span class="token punctuation">,</span> ymm0<span class="token punctuation">,</span> ymm0 <span class="token key atrule">L0012</span><span class="token punctuation">:</span> vmovups <span class="token punctuation">[</span>ecx<span class="token punctuation">]</span><span class="token punctuation">,</span> ymm0 <span class="token key atrule">L0016</span><span class="token punctuation">:</span> vmovups <span class="token punctuation">[</span>ecx+0x20<span class="token punctuation">]</span><span class="token punctuation">,</span> ymm0 <span class="token key atrule">L001b</span><span class="token punctuation">:</span> vmovups <span class="token punctuation">[</span>ecx+0x40<span class="token punctuation">]</span><span class="token punctuation">,</span> ymm0 <span class="token key atrule">L0020</span><span class="token punctuation">:</span> vmovups <span class="token punctuation">[</span>ecx+0x60<span class="token punctuation">]</span><span class="token punctuation">,</span> ymm0 <span class="token key atrule">L0025</span><span class="token punctuation">:</span> vmovups <span class="token punctuation">[</span>ecx+0x80<span class="token punctuation">]</span><span class="token punctuation">,</span> ymm0 <span class="token key atrule">L002d</span><span class="token punctuation">:</span> vmovups <span class="token punctuation">[</span>ecx+0xa0<span class="token punctuation">]</span><span class="token punctuation">,</span> ymm0 <span class="token key atrule">L0035</span><span class="token punctuation">:</span> vmovups <span class="token punctuation">[</span>ecx+0xc0<span class="token punctuation">]</span><span class="token punctuation">,</span> ymm0 <span class="token key atrule">L003d</span><span class="token punctuation">:</span> vmovups <span class="token punctuation">[</span>ecx+0xe0<span class="token punctuation">]</span><span class="token punctuation">,</span> ymm0 <span class="token key atrule">L0045</span><span class="token punctuation">:</span> vzeroupper <span class="token key atrule">L0048</span><span class="token punctuation">:</span> pop ebp <span class="token key atrule">L0049</span><span class="token punctuation">:</span> ret</code></pre><hr/></p><p style="text-align:left;">There is the function prolog and epilog, but the code of this method uses 4 256-bit instructions to zero the buffer. If we were to let the JIT handle this, it would use 128-bit instructions and a loop to do it. In this case, our way is better, because we know more than the JIT.</p><p style="text-align:left;">As for the constructor accepting an external pointer, here is what this translates into:</p><p style="text-align:left;"><hr/><pre class='line-numbers language-yaml'><code class='line-numbers language-yaml'>StreamBitArray..ctor(Byte<span class="token important">*)</span> <span class="token key atrule">L0000</span><span class="token punctuation">:</span> push ebp <span class="token key atrule">L0001</span><span class="token punctuation">:</span> mov ebp<span class="token punctuation">,</span> esp <span class="token key atrule">L0003</span><span class="token punctuation">:</span> vzeroupper <span class="token key atrule">L0006</span><span class="token punctuation">:</span> mov eax<span class="token punctuation">,</span> <span class="token punctuation">[</span>edx<span class="token punctuation">]</span> <span class="token key atrule">L0008</span><span class="token punctuation">:</span> mov <span class="token punctuation">[</span>ecx+0x100<span class="token punctuation">]</span><span class="token punctuation">,</span> eax <span class="token key atrule">L000e</span><span class="token punctuation">:</span> vmovups ymm0<span class="token punctuation">,</span> <span class="token punctuation">[</span>edx+4<span class="token punctuation">]</span> <span class="token key atrule">L0013</span><span class="token punctuation">:</span> vmovups ymm1<span class="token punctuation">,</span> <span class="token punctuation">[</span>edx+0x24<span class="token punctuation">]</span> <span class="token key atrule">L0018</span><span class="token punctuation">:</span> vmovups ymm2<span class="token punctuation">,</span> <span class="token punctuation">[</span>edx+0x44<span class="token punctuation">]</span> <span class="token key atrule">L001d</span><span class="token punctuation">:</span> vmovups ymm3<span class="token punctuation">,</span> <span class="token punctuation">[</span>edx+0x64<span class="token punctuation">]</span> <span class="token key atrule">L0022</span><span class="token punctuation">:</span> vmovups ymm4<span class="token punctuation">,</span> <span class="token punctuation">[</span>edx+0x84<span class="token punctuation">]</span> <span class="token key atrule">L002a</span><span class="token punctuation">:</span> vmovups ymm5<span class="token punctuation">,</span> <span class="token punctuation">[</span>edx+0xa4<span class="token punctuation">]</span> <span class="token key atrule">L0032</span><span class="token punctuation">:</span> vmovups ymm6<span class="token punctuation">,</span> <span class="token punctuation">[</span>edx+0xc4<span class="token punctuation">]</span> <span class="token key atrule">L003a</span><span class="token punctuation">:</span> vmovups ymm7<span class="token punctuation">,</span> <span class="token punctuation">[</span>edx+0xe4<span class="token punctuation">]</span> <span class="token key atrule">L0042</span><span class="token punctuation">:</span> vmovups <span class="token punctuation">[</span>ecx<span class="token punctuation">]</span><span class="token punctuation">,</span> ymm0 <span class="token key atrule">L0046</span><span class="token punctuation">:</span> vmovups <span class="token punctuation">[</span>ecx+0x20<span class="token punctuation">]</span><span class="token punctuation">,</span> ymm1 <span class="token key atrule">L004b</span><span class="token punctuation">:</span> vmovups <span class="token punctuation">[</span>ecx+0x40<span class="token punctuation">]</span><span class="token punctuation">,</span> ymm2 <span class="token key atrule">L0050</span><span class="token punctuation">:</span> vmovups <span class="token punctuation">[</span>ecx+0x60<span class="token punctuation">]</span><span class="token punctuation">,</span> ymm3 <span class="token key atrule">L0055</span><span class="token punctuation">:</span> vmovups <span class="token punctuation">[</span>ecx+0x80<span class="token punctuation">]</span><span class="token punctuation">,</span> ymm4 <span class="token key atrule">L005d</span><span class="token punctuation">:</span> vmovups <span class="token punctuation">[</span>ecx+0xa0<span class="token punctuation">]</span><span class="token punctuation">,</span> ymm5 <span class="token key atrule">L0065</span><span class="token punctuation">:</span> vmovups <span class="token punctuation">[</span>ecx+0xc0<span class="token punctuation">]</span><span class="token punctuation">,</span> ymm6 <span class="token key atrule">L006d</span><span class="token punctuation">:</span> vmovups <span class="token punctuation">[</span>ecx+0xe0<span class="token punctuation">]</span><span class="token punctuation">,</span> ymm7 <span class="token key atrule">L0075</span><span class="token punctuation">:</span> vzeroupper <span class="token key atrule">L0078</span><span class="token punctuation">:</span> pop ebp <span class="token key atrule">L0079</span><span class="token punctuation">:</span> ret</code></pre><hr/></p><p style="text-align:left;">This code is <em>exciting</em>&nbsp;to me because we are also allowing instruction-level parallelism. We effectively allow the CPU to execute all the operations of reading and writing in parallel. </p><p style="text-align:left;">Next on the chopping block is this method:</p><p style="text-align:left;"><hr/><pre class='line-numbers language-csharp'><code class='line-numbers language-csharp'><span class="token keyword">public</span> <span class="token return-type class-name"><span class="token keyword">int</span></span> <span class="token function">FirstSetBit</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token keyword">for</span> <span class="token punctuation">(</span><span class="token class-name"><span class="token keyword">int</span></span> i <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">;</span> i <span class="token operator">&lt;</span> _inner<span class="token punctuation">.</span>Length<span class="token punctuation">;</span> i<span class="token operator">++</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token keyword">if</span> <span class="token punctuation">(</span>_inner<span class="token punctuation">[</span>i<span class="token punctuation">]</span> <span class="token operator">==</span> <span class="token number">0</span><span class="token punctuation">)</span> <span class="token keyword">continue</span><span class="token punctuation">;</span> <span class="token keyword">return</span> i <span class="token operator">&lt;&lt;</span> <span class="token number">5</span> <span class="token operator">|</span> <span class="token function">HighestBitSet</span><span class="token punctuation">(</span>_inner<span class="token punctuation">[</span>i<span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token keyword">return</span> <span class="token operator">-</span><span class="token number">1</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token keyword">private</span> <span class="token keyword">static</span> <span class="token return-type class-name"><span class="token keyword">int</span></span> <span class="token function">HighestBitSet</span><span class="token punctuation">(</span><span class="token class-name"><span class="token keyword">int</span></span> v<span class="token punctuation">)</span> <span class="token punctuation">{</span> v <span class="token operator">|=</span> v <span class="token operator">>></span> <span class="token number">1</span><span class="token punctuation">;</span> <span class="token comment">// first round down to one less than a power of 2 </span> v <span class="token operator">|=</span> v <span class="token operator">>></span> <span class="token number">2</span><span class="token punctuation">;</span> v <span class="token operator">|=</span> v <span class="token operator">>></span> <span class="token number">4</span><span class="token punctuation">;</span> v <span class="token operator">|=</span> v <span class="token operator">>></span> <span class="token number">8</span><span class="token punctuation">;</span> v <span class="token operator">|=</span> v <span class="token operator">>></span> <span class="token number">16</span><span class="token punctuation">;</span> <span class="token keyword">return</span> MultiplyDeBruijnBitPosition<span class="token punctuation">[</span><span class="token punctuation">(</span><span class="token keyword">uint</span><span class="token punctuation">)</span><span class="token punctuation">(</span>v <span class="token operator">*</span> <span class="token number">0x07C4ACDDU</span><span class="token punctuation">)</span> <span class="token operator">>></span> <span class="token number">27</span><span class="token punctuation">]</span><span class="token punctuation">;</span> <span class="token punctuation">}</span></code></pre><hr/></p><p style="text-align:left;">We are using vector instructions to scan 8 ints at a time, trying to find the first one that is set. Then we find the right int and locate the first set bit <em>there</em>. Here is what the assembly looks like:</p><p style="text-align:left;"><hr/><pre class='line-numbers language-yaml'><code class='line-numbers language-yaml'>StreamBitArray.FirstSetBit() <span class="token key atrule">L0000</span><span class="token punctuation">:</span> push ebp <span class="token key atrule">L0001</span><span class="token punctuation">:</span> mov ebp<span class="token punctuation">,</span> esp <span class="token key atrule">L0003</span><span class="token punctuation">:</span> vzeroupper <span class="token key atrule">L0006</span><span class="token punctuation">:</span> xor edx<span class="token punctuation">,</span> edx <span class="token key atrule">L0008</span><span class="token punctuation">:</span> cmp <span class="token punctuation">[</span>ecx<span class="token punctuation">]</span><span class="token punctuation">,</span> cl <span class="token key atrule">L000a</span><span class="token punctuation">:</span> vmovups ymm0<span class="token punctuation">,</span> <span class="token punctuation">[</span>ecx+edx<span class="token important">*4</span><span class="token punctuation">]</span> <span class="token key atrule">L000f</span><span class="token punctuation">:</span> vxorps ymm1<span class="token punctuation">,</span> ymm1<span class="token punctuation">,</span> ymm1 <span class="token key atrule">L0013</span><span class="token punctuation">:</span> vpcmpud k1<span class="token punctuation">,</span> ymm0<span class="token punctuation">,</span> ymm1<span class="token punctuation">,</span> <span class="token number">6</span> <span class="token key atrule">L001a</span><span class="token punctuation">:</span> vpmovm2d ymm0<span class="token punctuation">,</span> k1 <span class="token key atrule">L0020</span><span class="token punctuation">:</span> vptest ymm0<span class="token punctuation">,</span> ymm0 <span class="token key atrule">L0025</span><span class="token punctuation">:</span> jne short L0039 <span class="token key atrule">L0027</span><span class="token punctuation">:</span> add edx<span class="token punctuation">,</span> <span class="token number">8</span> <span class="token key atrule">L002a</span><span class="token punctuation">:</span> cmp edx<span class="token punctuation">,</span> <span class="token number">0x40</span> <span class="token key atrule">L002d</span><span class="token punctuation">:</span> jl short L000a <span class="token key atrule">L002f</span><span class="token punctuation">:</span> mov eax<span class="token punctuation">,</span> <span class="token number">0xffffffff</span> <span class="token key atrule">L0034</span><span class="token punctuation">:</span> vzeroupper <span class="token key atrule">L0037</span><span class="token punctuation">:</span> pop ebp <span class="token key atrule">L0038</span><span class="token punctuation">:</span> ret <span class="token key atrule">L0039</span><span class="token punctuation">:</span> vmovmskps eax<span class="token punctuation">,</span> ymm0 <span class="token key atrule">L003d</span><span class="token punctuation">:</span> tzcnt eax<span class="token punctuation">,</span> eax <span class="token key atrule">L0041</span><span class="token punctuation">:</span> add eax<span class="token punctuation">,</span> edx <span class="token key atrule">L0043</span><span class="token punctuation">:</span> xor edx<span class="token punctuation">,</span> edx <span class="token key atrule">L0045</span><span class="token punctuation">:</span> tzcnt edx<span class="token punctuation">,</span> <span class="token punctuation">[</span>ecx+eax<span class="token important">*4</span><span class="token punctuation">]</span> <span class="token key atrule">L004a</span><span class="token punctuation">:</span> shl eax<span class="token punctuation">,</span> <span class="token number">5</span> <span class="token key atrule">L004d</span><span class="token punctuation">:</span> add eax<span class="token punctuation">,</span> edx <span class="token key atrule">L004f</span><span class="token punctuation">:</span> vzeroupper <span class="token key atrule">L0052</span><span class="token punctuation">:</span> pop ebp <span class="token key atrule">L0053</span><span class="token punctuation">:</span> ret</code></pre><hr/></p><p style="text-align:left;">In short, the code is simpler, shorter, and more explicit about what it is doing. The machine code that is running there is <em>much</em>&nbsp;tighter. And I don&rsquo;t have allocations galore. </p><p style="text-align:left;">This particular optimization isn&rsquo;t about showing better numbers in a specific scenario that I can point to. I don&rsquo;t think we ever delete enough pages to actually see this in a profiler output in such an obvious way. The goal is to reduce allocations and give the GC less work to do, which has a global impact on the performance of the system.</p> <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/9000.0.1/themes/prism.min.css" integrity="sha512-/mZ1FHPkg6EKcxo0fKXF51ak6Cr2ocgDi5ytaTBjsQZIH/RNs6GF6+oId/vPe3eJB836T36nXwVh/WBl/cWT4w==" crossorigin="anonymous" referrerpolicy="no-referrer" />https://www.ayende.com/blog/201601-A/optimizing-old-code-streambitarray-refactoring?Key=3620c293-cb23-48b8-bb06-df1576cea427https://www.ayende.com/blog/201601-A/optimizing-old-code-streambitarray-refactoring?Key=3620c293-cb23-48b8-bb06-df1576cea427Fri, 16 Aug 2024 12:00:00 GMTImproving RavenDB's Node.js bulk insert performance<p style="text-align:left;">During a performance evaluation internally, we ran into a strange situation. Our bulk insert performance using the node.js API was <em>significantly</em>&nbsp;worse than the performance of other clients. In particular, when we compared that to the C# version, we saw that the numbers were <em>significantly</em>&nbsp;worse than expected.</p><p style="text-align:left;">To be fair, this comparison is made between our C# client, which has been through the <em>wringer</em>&nbsp;in terms of optimization and attention to performance, and the Node.js client. The focus of the Node.js client was on correctness and usability. </p><p style="text-align:left;">It isn&rsquo;t fair to expect the same performance from Node.js and C#, after all. However, that difference in performance was annoying enough to make us take a deeper look into what was going on.</p><p style="text-align:left;">Here is the relevant code:</p><p style="text-align:left;"><hr/><pre class='line-numbers language-javascript'><code class='line-numbers language-javascript'><span class="token keyword">const</span> store <span class="token operator">=</span> <span class="token keyword">new</span> <span class="token class-name">DocumentStore</span><span class="token punctuation">(</span><span class="token string">'http://localhost:8080'</span><span class="token punctuation">,</span> <span class="token string">'bulk'</span><span class="token punctuation">)</span><span class="token punctuation">;</span> store<span class="token punctuation">.</span><span class="token function">initialize</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">const</span> bulk <span class="token operator">=</span> store<span class="token punctuation">.</span><span class="token function">bulkInsert</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">for</span> <span class="token punctuation">(</span><span class="token keyword">let</span> i <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">;</span> i <span class="token operator">&lt;</span> <span class="token number">100_000_000</span><span class="token punctuation">;</span> i<span class="token operator">++</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token keyword">await</span> bulk<span class="token punctuation">.</span><span class="token function">store</span><span class="token punctuation">(</span><span class="token keyword">new</span> <span class="token class-name">User</span><span class="token punctuation">(</span><span class="token string">'user'</span> <span class="token operator">+</span> i<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token keyword">await</span> bulk<span class="token punctuation">.</span><span class="token function">finish</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span></code></pre><hr/></p><p style="text-align:left;">As you can see, the Node.js numbers are <em>respectable</em>. Running at a rate of over 85,000 writes per second is nothing to sneeze at.</p><p style="text-align:left;"><img src=""/></p><p style="text-align:left;">But I also ran the exact same test with the C# client, and I got annoyed. The C# client was able to hit close to 100,000 <em>more</em>&nbsp;writes per second than the Node.js client. And in both cases, the actual limit was on the client side, not on the server side. </p><p style="text-align:left;"><img src=""/></p><p style="text-align:left;">For fun, I ran a few clients and hit 250,000 writes/second without really doing much. The last time we properly tested ingest performance for RavenDB we achieved 150,000 writes/second. So it certainly looks like we are performing significantly better.</p><p style="text-align:left;">Going back to the Node.js version, I wanted to know what exactly was the problem that we had there. Why are we so much slower than the C# version? It&rsquo;s possible that this is just the limits of the node.js platform, but you <em>gotta </em>check to know. </p><p style="text-align:left;">Node.js has an --inspect flag that you can use, and Chrome has a built-in profiler (<span style="color:#2c3437;">chrome://inspect) </span>that can plug into that. Using the DevTools, you can get a performance profile of a Node.js process.</p><p style="text-align:left;">I did just that and go the following numbers:</p><p style="text-align:left;"><img src="" style="float: right"/></p><p style="text-align:left;">That is&hellip; curious. <em>Really</em>&nbsp;curious, isn&rsquo;t it?</p><p style="text-align:left;">Basically, none of my code appears here <em>at</em>&nbsp;all, most of the time is spent dealing with the async machinery. If you look at the code above, you can see that we are issuing an await&nbsp;for each document stored. &nbsp;</p><p style="text-align:left;">The idea with bulk insert is that under the covers, we split the writing to an in-memory buffer and the flushing of the buffer to the network. In the vast majority of cases, we&rsquo;ll <em>not</em>&nbsp;do any async operations in the store()&nbsp;call. If the buffer is full, we&rsquo;ll need to flush it to the network, and that may force us to do an actual await&nbsp;operation. In Node.js, awaiting an async function that doesn&rsquo;t actually perform any async operation appears to be <em>super</em>&nbsp;expensive.</p><p style="text-align:left;">We threw around a bunch of ideas on how to resolve this issue. The problem is that Node.js has no equivalent to C#&rsquo;s ValueTask.&nbsp;We also have a lot of existing code out there in the field that we must remain compatible with.</p><p style="text-align:left;">Our solution to this dilemma was to add another function that you can call, like so:</p><p style="text-align:left;"><hr/><pre class='line-numbers language-javascript'><code class='line-numbers language-javascript'><span class="token keyword">for</span> <span class="token punctuation">(</span><span class="token keyword">let</span> i <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">;</span> i <span class="token operator">&lt;</span> <span class="token number">100_000_000</span><span class="token punctuation">;</span> i<span class="token operator">++</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token keyword">const</span> user <span class="token operator">=</span> <span class="token keyword">new</span> <span class="token class-name">User</span><span class="token punctuation">(</span><span class="token string">'user'</span> <span class="token operator">+</span> i<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">const</span> id <span class="token operator">=</span> <span class="token string">"users/"</span> <span class="token operator">+</span> i<span class="token punctuation">;</span> <span class="token keyword">if</span> <span class="token punctuation">(</span>bulk<span class="token punctuation">.</span><span class="token function">tryStoreSync</span><span class="token punctuation">(</span>user<span class="token punctuation">,</span> id<span class="token punctuation">)</span> <span class="token operator">==</span> <span class="token boolean">false</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token keyword">await</span> bulk<span class="token punctuation">.</span><span class="token function">store</span><span class="token punctuation">(</span>user<span class="token punctuation">,</span> id<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span></code></pre><hr/></p><p style="text-align:left;">The idea is that if you call tryStoreSync()&nbsp;we&rsquo;ll try to do everything in memory, but it may not be possible (e.g. if we need to flush the buffer). In that case, you&rsquo;ll need to call the async function store()&nbsp;explicitly. </p><p style="text-align:left;">Given that the usual reason for using the dedicated API for bulk insert is performance, this looks like a reasonable thing to ask. Especially when you can see the actual performance results. We are talking about over 55%<strong>(!!!)</strong>&nbsp;improvement in the performance of bulk insert.</p><p style="text-align:left;"><img src=""/></p><p style="text-align:left;">It gets even better. That was just the mechanical fix to avoid generating a promise per operation. While we are addressing this performance issue, there are a few other low-hanging fruits that could improve the bulk insert performance in Node.js. </p><p style="text-align:left;">For example, it turns out that we pay a hefty cost to generate the metadata for all those documents (runtime reflection cost, mostly). We can generate it <em>once</em>&nbsp;and be done with it, like so:</p><p style="text-align:left;"><hr/><pre class='line-numbers language-javascript'><code class='line-numbers language-javascript'><span class="token keyword">const</span> bulk <span class="token operator">=</span> store<span class="token punctuation">.</span><span class="token function">bulkInsert</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">const</span> metadata <span class="token operator">=</span> <span class="token punctuation">{</span> <span class="token string-property property">"@collection"</span><span class="token operator">:</span> <span class="token string">"Users"</span><span class="token punctuation">,</span> <span class="token string-property property">"Raven-Node-Type"</span><span class="token operator">:</span> <span class="token string">"User"</span> <span class="token punctuation">}</span><span class="token punctuation">;</span> <span class="token keyword">for</span> <span class="token punctuation">(</span><span class="token keyword">let</span> i <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">;</span> i <span class="token operator">&lt;</span> <span class="token number">100_000_000</span><span class="token punctuation">;</span> i<span class="token operator">++</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token keyword">const</span> user <span class="token operator">=</span> <span class="token keyword">new</span> <span class="token class-name">User</span><span class="token punctuation">(</span><span class="token string">'user'</span> <span class="token operator">+</span> i<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">const</span> id <span class="token operator">=</span> <span class="token string">"users/"</span> <span class="token operator">+</span> i<span class="token punctuation">;</span> <span class="token keyword">if</span> <span class="token punctuation">(</span>bulk<span class="token punctuation">.</span><span class="token function">tryStoreSync</span><span class="token punctuation">(</span>user<span class="token punctuation">,</span> id<span class="token punctuation">,</span> metadata<span class="token punctuation">)</span> <span class="token operator">==</span> <span class="token boolean">false</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token keyword">await</span> bulk<span class="token punctuation">.</span><span class="token function">store</span><span class="token punctuation">(</span>user<span class="token punctuation">,</span> id<span class="token punctuation">,</span> metadata<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span> <span class="token keyword">await</span> bulk<span class="token punctuation">.</span><span class="token function">finish</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span></code></pre><hr/></p><p style="text-align:left;">And this code in particular gives us:</p><p style="text-align:left;"><img src=""/></p><p style="text-align:left;">That is basically near enough to the C#&rsquo;s speed that I don&rsquo;t think we <em>need</em>&nbsp;to pay more attention to performance. Overall, that was time very well spent in making things go <em>fast</em>.</p> <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/9000.0.1/themes/prism.min.css" integrity="sha512-/mZ1FHPkg6EKcxo0fKXF51ak6Cr2ocgDi5ytaTBjsQZIH/RNs6GF6+oId/vPe3eJB836T36nXwVh/WBl/cWT4w==" crossorigin="anonymous" referrerpolicy="no-referrer" />https://www.ayende.com/blog/201569-A/improving-ravendbs-node-js-bulk-insert-performance?Key=77afcb92-876b-45b3-99da-17adb2630100https://www.ayende.com/blog/201569-A/improving-ravendbs-node-js-bulk-insert-performance?Key=77afcb92-876b-45b3-99da-17adb2630100Thu, 08 Aug 2024 12:00:00 GMT Cloned Dictionary vs. Immutable Dictionary vs. Frozen Dictionary in high traffic systems<p style="text-align:left;">In my <span style="text-decoration:underline;"><a style="color:inherit;" href="https://ayende.com/blog/201281-B/challenge-efficient-snapshotable-state?key=352251b097cd4fbda04a3e274ffdc2f0">previous post</a></span>, I explained what we are trying to do. Create a way to carry a dictionary between transactions in RavenDB, allowing one write transaction to modify it while all other read transactions only observe the state of the dictionary as it was at the publication time.</p><p style="text-align:left;">I want to show a couple of ways I tried solving this problem using the built-in tools in the Base Class Library. Here is roughly what I&rsquo;m trying to do:</p><p style="text-align:left;"><hr/><pre class='line-numbers language-javascript'><code class='line-numbers language-javascript'>IEnumerable<span class="token operator">&lt;</span>object<span class="token operator">></span> <span class="token function">SingleDictionary</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token keyword">var</span> dic <span class="token operator">=</span> <span class="token keyword">new</span> <span class="token class-name">Dictionary</span><span class="token operator">&lt;</span>long<span class="token punctuation">,</span> object<span class="token operator">></span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">var</span> random <span class="token operator">=</span> <span class="token keyword">new</span> <span class="token class-name">Random</span><span class="token punctuation">(</span><span class="token number">932</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">var</span> v <span class="token operator">=</span> <span class="token keyword">new</span> <span class="token class-name">object</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// number of transactions</span> <span class="token keyword">for</span> <span class="token punctuation">(</span><span class="token keyword">var</span> txCount <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">;</span> txCount <span class="token operator">&lt;</span> <span class="token number">1000</span><span class="token punctuation">;</span> txCount<span class="token operator">++</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token comment">// operations in transaction</span> <span class="token keyword">for</span> <span class="token punctuation">(</span>int opCount <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">;</span> opCount <span class="token operator">&lt;</span> <span class="token number">10_000</span><span class="token punctuation">;</span> opCount<span class="token operator">++</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> dic<span class="token punctuation">[</span>random<span class="token punctuation">.</span><span class="token function">NextInt64</span><span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">1024</span> <span class="token operator">*</span> <span class="token number">1024</span> <span class="token operator">*</span> <span class="token number">1024</span><span class="token punctuation">)</span><span class="token punctuation">]</span> <span class="token operator">=</span> v<span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token keyword">yield</span> <span class="token keyword">return</span> dic<span class="token punctuation">;</span><span class="token comment">// publish the dictionary</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span></code></pre><hr/></p><p style="text-align:left;">As you can see, we are running a thousand transactions, each of which performs 10,000 operations. We &ldquo;publish&rdquo; the state of the transaction after each time. </p><p style="text-align:left;">This is just to set up a baseline for what I&rsquo;m trying to do. I&rsquo;m focusing solely on this one aspect of the table that is published. Note that I cannot actually use this particular code. The issue is that the dictionary is both mutable and shared (across threads), I cannot do that. </p><p style="text-align:left;">The easiest way to go about this is to just clone the dictionary. Here is what this would look like:</p><p style="text-align:left;"><hr/><pre class='line-numbers language-bash'><code class='line-numbers language-bash'>IEnumerable<span class="token operator">&lt;</span>object<span class="token operator">></span> <span class="token function-name function">ClonedDictionary</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> var dic <span class="token operator">=</span> new Dictionary<span class="token operator">&lt;</span>long, object<span class="token operator">></span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> var random <span class="token operator">=</span> new Random<span class="token punctuation">(</span><span class="token number">932</span><span class="token punctuation">)</span><span class="token punctuation">;</span> var <span class="token function">v</span> <span class="token operator">=</span> new object<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> // number of transactions <span class="token keyword">for</span> <span class="token punctuation">(</span>var txCount <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">;</span> txCount <span class="token operator">&lt;</span> <span class="token number">1000</span><span class="token punctuation">;</span> txCount++<span class="token punctuation">)</span> <span class="token punctuation">{</span> // operations <span class="token keyword">in</span> transaction <span class="token keyword">for</span> <span class="token punctuation">(</span>int opCount <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">;</span> opCount <span class="token operator">&lt;</span> 10_000<span class="token punctuation">;</span> opCount++<span class="token punctuation">)</span> <span class="token punctuation">{</span> dic<span class="token punctuation">[</span>random.NextInt64<span class="token punctuation">(</span><span class="token number">0</span>, <span class="token number">1024</span> * <span class="token number">1024</span> * <span class="token number">1024</span><span class="token punctuation">)</span><span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token function">v</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> // publish the dictionary yield <span class="token builtin class-name">return</span> new Dictionary<span class="token operator">&lt;</span>long, object<span class="token operator">></span><span class="token punctuation">(</span>dic<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span></code></pre><hr/></p><p style="text-align:left;">This is basically the same code, but when I publish the dictionary, I&rsquo;m going to create a new instance (which will be read-only). This is <em>exactly</em>&nbsp;what I want: to have a cloned, read-only copy that the read transactions can use while I get to keep on modifying the write copy.</p><p style="text-align:left;">The downside of this approach is twofold. First, there are a lot of allocations because of this, and the more items in the table, the more expensive it is to copy.</p><p style="text-align:left;">I can try using the ImmutableDictionary in the Base Class Library, however. Here is what this would look like:</p><p style="text-align:left;"><hr/><pre class='line-numbers language-dart'><code class='line-numbers language-dart'><span class="token class-name">IEnumerable</span><span class="token generics"><span class="token punctuation">&lt;</span>object<span class="token punctuation">></span></span> <span class="token class-name">ClonedImmutableDictionary</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token keyword">var</span> dic <span class="token operator">=</span> <span class="token class-name">ImmutableDictionary.Create</span><span class="token generics"><span class="token punctuation">&lt;</span>long<span class="token punctuation">,</span> object<span class="token punctuation">></span></span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">var</span> random <span class="token operator">=</span> <span class="token keyword">new</span> <span class="token class-name">Random</span><span class="token punctuation">(</span><span class="token number">932</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">var</span> v <span class="token operator">=</span> <span class="token keyword">new</span> <span class="token function">object</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// number of transactions</span> <span class="token keyword">for</span> <span class="token punctuation">(</span><span class="token keyword">var</span> txCount <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">;</span> txCount <span class="token operator">&lt;</span> <span class="token number">1000</span><span class="token punctuation">;</span> txCount<span class="token operator">++</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token comment">// operations in transaction</span> <span class="token keyword">for</span> <span class="token punctuation">(</span>int opCount <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">;</span> opCount <span class="token operator">&lt;</span> <span class="token number">10</span>_000<span class="token punctuation">;</span> opCount<span class="token operator">++</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> dic <span class="token operator">=</span> <span class="token class-name"><span class="token namespace">dic<span class="token punctuation">.</span></span>Add</span><span class="token punctuation">(</span><span class="token class-name"><span class="token namespace">random<span class="token punctuation">.</span></span>NextInt64</span><span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">1024</span> <span class="token operator">*</span> <span class="token number">1024</span> <span class="token operator">*</span> <span class="token number">1024</span><span class="token punctuation">)</span><span class="token punctuation">,</span> v<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token comment">// publish the dictionary</span> <span class="token keyword">yield</span> <span class="token keyword">return</span> dic<span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span></code></pre><hr/></p><p style="text-align:left;">The benefit here is that the act of publishing is effectively a no-op. Just send the immutable value out to the world. The downside of using immutable dictionaries is that each operation involves an allocation, and the actual underlying implementation is far less efficient as a hash table than the regular dictionary.</p><p style="text-align:left;">I can try to optimize this a bit by using the builder pattern, as shown here:</p><p style="text-align:left;"><hr/><pre class='line-numbers language-javascript'><code class='line-numbers language-javascript'>IEnumerable<span class="token operator">&lt;</span>object<span class="token operator">></span> <span class="token function">BuilderImmutableDictionary</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token keyword">var</span> builder <span class="token operator">=</span> ImmutableDictionary<span class="token punctuation">.</span>CreateBuilder<span class="token operator">&lt;</span>long<span class="token punctuation">,</span> object<span class="token operator">></span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">var</span> random <span class="token operator">=</span> <span class="token keyword">new</span> <span class="token class-name">Random</span><span class="token punctuation">(</span><span class="token number">932</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">var</span> v <span class="token operator">=</span> <span class="token keyword">new</span> <span class="token class-name">object</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">;</span> <span class="token comment">// number of transactions</span> <span class="token keyword">for</span> <span class="token punctuation">(</span><span class="token keyword">var</span> txCount <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">;</span> txCount <span class="token operator">&lt;</span> <span class="token number">1000</span><span class="token punctuation">;</span> txCount<span class="token operator">++</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token comment">// operations in transaction</span> <span class="token keyword">for</span> <span class="token punctuation">(</span>int opCount <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">;</span> opCount <span class="token operator">&lt;</span> <span class="token number">10_000</span><span class="token punctuation">;</span> opCount<span class="token operator">++</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> builder<span class="token punctuation">[</span>random<span class="token punctuation">.</span><span class="token function">NextInt64</span><span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">1024</span> <span class="token operator">*</span> <span class="token number">1024</span> <span class="token operator">*</span> <span class="token number">1024</span><span class="token punctuation">)</span><span class="token punctuation">]</span> <span class="token operator">=</span> v<span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token comment">// publish the dictionary</span> <span class="token keyword">yield</span> <span class="token keyword">return</span> builder<span class="token punctuation">.</span><span class="token function">ToImmutable</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span></code></pre><hr/></p><p style="text-align:left;">Now we only pay the immutable cost one per transaction, right? However,&nbsp;the underlying implementation is still an AVL tree, not a proper hash table. This means that not only is it more expensive for publishing the state, but we are now slower for <em>reads</em>&nbsp;as well. That is <em>not</em>&nbsp;something that we want.</p><p style="text-align:left;">The BCL recently introduced a FrozenDictionary, which is meant to be super efficient for a really common case of dictionaries that are accessed a <em>lot</em>&nbsp;but rarely written to. I delved into its implementation and was impressed by the amount of work invested into ensuring that this will be really fast.</p><p style="text-align:left;">Let&rsquo;s see how <em>that </em>would look like for our scenario, shall we?</p><p style="text-align:left;"><hr/><pre class='line-numbers language-javascript'><code class='line-numbers language-javascript'>IEnumerable<span class="token operator">&lt;</span>object<span class="token operator">></span> <span class="token function">FrozenDictionary</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token keyword">var</span> dic <span class="token operator">=</span> <span class="token keyword">new</span> <span class="token class-name">Dictionary</span><span class="token operator">&lt;</span>long<span class="token punctuation">,</span> object<span class="token operator">></span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">var</span> random <span class="token operator">=</span> <span class="token keyword">new</span> <span class="token class-name">Random</span><span class="token punctuation">(</span><span class="token number">932</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">var</span> v <span class="token operator">=</span> <span class="token keyword">new</span> <span class="token class-name">object</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// number of transactions</span> <span class="token keyword">for</span> <span class="token punctuation">(</span><span class="token keyword">var</span> txCount <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">;</span> txCount <span class="token operator">&lt;</span> <span class="token number">1000</span><span class="token punctuation">;</span> txCount<span class="token operator">++</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token comment">// operations in transaction</span> <span class="token keyword">for</span> <span class="token punctuation">(</span>int opCount <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">;</span> opCount <span class="token operator">&lt;</span> <span class="token number">10_000</span><span class="token punctuation">;</span> opCount<span class="token operator">++</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> dic<span class="token punctuation">[</span>random<span class="token punctuation">.</span><span class="token function">NextInt64</span><span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">1024</span> <span class="token operator">*</span> <span class="token number">1024</span> <span class="token operator">*</span> <span class="token number">1024</span><span class="token punctuation">)</span><span class="token punctuation">]</span> <span class="token operator">=</span> v<span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token comment">// publish the dictionary</span> <span class="token keyword">yield</span> <span class="token keyword">return</span> dic<span class="token punctuation">.</span><span class="token function">ToFrozenDictionary</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span></code></pre><hr/></p><p style="text-align:left;">The good thing is that we are using a standard dictionary on the write side and publishing it once per transaction. The downside is that we need to pay a cost to create the frozen dictionary that is proportional to the number of items in the dictionary. That can get expensive <em>fast</em>.</p><p style="text-align:left;">After seeing all of those options, let&rsquo;s check the numbers. The full code is in <span style="text-decoration:underline;"><a style="color:inherit;" href="https://gist.github.com/ayende/307d50cac75ab3c5819568f9e5f68644">this gist</a></span>.</p><p style="text-align:left;">I executed all of those using Benchmark.NET, let&rsquo;s see the results. </p><table style="width:100%;" class="table-bordered table-striped" ><tr><strong><td><span style="color:#1f2328;"><strong>Method</strong></span></td><td><span style="color:#1f2328;"><strong>Mean</strong></span></td><td><span style="color:#1f2328;"><strong>Ratio</strong></span></td></strong></tr><tr><td><span style="color:#1f2328;">SingleDictionaryBench</span></td><td><span style="color:#1f2328;">7.768 ms</span></td><td><span style="color:#1f2328;">1.00</span></td></tr><tr><td><span style="color:#1f2328;">BuilderImmutableDictionaryBench</span></td><td><span style="color:#1f2328;">122.508 ms</span></td><td><span style="color:#1f2328;">15.82</span></td></tr><tr><td><span style="color:#1f2328;">ClonedImmutableDictionaryBench</span></td><td><span style="color:#1f2328;">176.041 ms</span></td><td><span style="color:#1f2328;">21.95</span></td></tr><tr><td><span style="color:#1f2328;">ClonedDictionaryBench</span></td><td><span style="color:#1f2328;">1,489.614 ms</span></td><td><span style="color:#1f2328;">195.04</span></td></tr><tr><td><span style="color:#1f2328;">FrozenDictionaryBench</span></td><td><span style="color:#1f2328;">6,279.542 ms</span></td><td><span style="color:#1f2328;">807.36</span></td></tr><tr><td><span style="color:#1f2328;">ImmutableDictionaryFromDicBench</span></td><td><span style="color:#1f2328;">46,906.047 ms</span></td><td><span style="color:#1f2328;">6,029.69</span></td></tr></table><p style="text-align:left;">Note that the difference in speed is absolutely staggering. The SingleDictionaryBench&nbsp;is a bad example. It is just filling a dictionary directly, with no additional cost. The cost for the <span style="color:#1f2328;">BuilderImmutableDictionaryBench </span>is more reasonable, given what it has to do. </p><p style="text-align:left;">Just looking at the benchmark result isn&rsquo;t sufficient. I implemented every one of those options in RavenDB and ran them under a profiler. The results are quite interesting.</p><p style="text-align:left;">Here is the version I started with, using a frozen dictionary. That is the <em>right</em>&nbsp;data structure for what I want. I have one thread that is mutating data, then publish the frozen results for others to use.</p><p style="text-align:left;">However, take a look at the profiler results! Don&rsquo;t focus on the duration values, look at the percentage of time spent creating the frozen dictionary. That is 60%(!) of the <em>total transaction time</em>. That is&hellip; an absolutely insane number.</p><p style="text-align:left;"><img src=""/></p><p style="text-align:left;">Note that it is clear that the frozen dictionary isn&rsquo;t suitable for our needs here. The ratio between reading and writing isn&rsquo;t sufficient to justify the cost. One of the benefits of FrozenDictionary is that it is <em>more</em>&nbsp;expensive to create than normal since it is trying hard to optimize for reading performance.</p><p style="text-align:left;">What about the ImmutableDictionary? Well, that is a <em>complete</em>&nbsp;non-starter. It is taking close to 90%(!!) of the total transaction runtime. I know that I called the frozen numbers insane, I should have chosen something else, because now I have no words to describe this. </p><p style="text-align:left;"><img src=""/></p><p style="text-align:left;">Remember that one problem here is that we cannot just use the regular dictionary or a concurrent dictionary. We need to have a <em>fixed</em>&nbsp;state of the dictionary when we publish it. What if we use a normal dictionary, cloned?</p><p style="text-align:left;">This is far better, at about 40%, instead of 60% or 90%.</p><p style="text-align:left;"><img src=""/></p><p style="text-align:left;">You have to understand, better doesn&rsquo;t mean good. Spending those numbers on just publishing the state of the transaction is beyond ridiculous. </p><p style="text-align:left;">We need to find another way to do this. Remember where we started? The PageTable in RavenDB that currently handles this is really complex. </p><p style="text-align:left;">I looked into my records and found <span style="text-decoration:underline;"><a style="color:inherit;" href="https://ayende.com/blog/164739/immutable-collections-performance">this blog post from over a decade ago</a></span>, discussing this exact problem. It certainly looks like this complexity is at least semi-justified. </p><p style="text-align:left;">I still want to be able to fix this&hellip; but it won&rsquo;t be as easy as reaching out to a built-in type in the BCL, it seems. </p> <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/9000.0.1/themes/prism.min.css" integrity="sha512-/mZ1FHPkg6EKcxo0fKXF51ak6Cr2ocgDi5ytaTBjsQZIH/RNs6GF6+oId/vPe3eJB836T36nXwVh/WBl/cWT4w==" crossorigin="anonymous" referrerpolicy="no-referrer" />https://www.ayende.com/blog/201314-B/cloned-dictionary-vs-immutable-dictionary-vs-frozen-dictionary-in-high-traffic-systems?Key=5b127528-fc8b-4749-9442-eedcd34afb9bhttps://www.ayende.com/blog/201314-B/cloned-dictionary-vs-immutable-dictionary-vs-frozen-dictionary-in-high-traffic-systems?Key=5b127528-fc8b-4749-9442-eedcd34afb9bWed, 03 Jul 2024 12:00:00 GMTChallenge: Efficient snapshotable state<p style="text-align:left;">At the heart of RavenDB, there is a data structure that we call the Page Translation Table. It is one of <em>the</em>&nbsp;most important pieces inside RavenDB.</p><p style="text-align:left;">The page translation table is basically a Dictionary&lt;long, Page&gt;, mapping between a page number and the actual page. The critical aspect of this data structure is that it is both concurrent and multi-version. That is, at a single point, there may be multiple versions of the table, representing different versions of the table at given points in time.</p><p style="text-align:left;">The way it works, a transaction in RavenDB generates a page translation table as part of its execution and publishes the table on commit. However, each subsequent table builds upon the previous one, so things become more complex. Here is a usage example (in Python pseudo-code):</p><p style="text-align:left;"><hr/><pre class='line-numbers language-bash'><code class='line-numbers language-bash'>table <span class="token operator">=</span> <span class="token punctuation">{</span><span class="token punctuation">}</span> with wtx1 <span class="token operator">=</span> write_tx<span class="token punctuation">(</span>table<span class="token punctuation">)</span>: wtx1.put<span class="token punctuation">(</span><span class="token number">2</span>, <span class="token string">'v1'</span><span class="token punctuation">)</span> wtx1.put<span class="token punctuation">(</span><span class="token number">3</span>, <span class="token string">'v1'</span><span class="token punctuation">)</span> wtx1.publish<span class="token punctuation">(</span>table<span class="token punctuation">)</span> <span class="token comment"># table has (2 => v1, 3 => v1)</span> with wtx2 <span class="token operator">=</span> write_tx<span class="token punctuation">(</span>table<span class="token punctuation">)</span>: wtx2.put<span class="token punctuation">(</span><span class="token number">2</span>, <span class="token string">'v2'</span><span class="token punctuation">)</span> wtx2.put<span class="token punctuation">(</span><span class="token number">4</span>, <span class="token string">'v2'</span><span class="token punctuation">)</span> wtx2.publish<span class="token punctuation">(</span>table<span class="token punctuation">)</span> <span class="token comment"># table has (2 => v2, 3 => v1, 4 => v2)</span></code></pre><hr/></p><p style="text-align:left;">This is pretty easy to follow, I think. The table is a simple hash table at this point in time.</p><p style="text-align:left;">The catch is when we mix read transactions as well, like so:</p><p style="text-align:left;"><hr/><pre class='line-numbers language-bash'><code class='line-numbers language-bash'><span class="token comment"># table has (2 => v2, 3 => v1, 4 => v2)</span> with rtx1 <span class="token operator">=</span> read_tx<span class="token punctuation">(</span>table<span class="token punctuation">)</span>: with wtx3 <span class="token operator">=</span> write_tx<span class="token punctuation">(</span>table<span class="token punctuation">)</span>: wtx3.put<span class="token punctuation">(</span><span class="token number">2</span>, <span class="token string">'v3'</span><span class="token punctuation">)</span> wtx3.put<span class="token punctuation">(</span><span class="token number">3</span>, <span class="token string">'v3'</span><span class="token punctuation">)</span> wtx3.put<span class="token punctuation">(</span><span class="token number">5</span>, <span class="token string">'v3'</span><span class="token punctuation">)</span> with rtx2 <span class="token operator">=</span> read_tx<span class="token punctuation">(</span>table<span class="token punctuation">)</span>: rtx2.read<span class="token punctuation">(</span><span class="token number">2</span><span class="token punctuation">)</span> <span class="token comment"># => gives, v2</span> rtx2.read<span class="token punctuation">(</span><span class="token number">3</span><span class="token punctuation">)</span> <span class="token comment"># => gives, v1</span> rtx2.read<span class="token punctuation">(</span><span class="token number">5</span><span class="token punctuation">)</span> <span class="token comment"># => gives, None</span> wtx3.publish<span class="token punctuation">(</span>table<span class="token punctuation">)</span> <span class="token comment"># table has (2 => v3, 3 => v3, 4 => v2, 5 => v3)</span> <span class="token comment"># but rtx2 still observe the value as they were when</span> <span class="token comment"># rtx2 was created</span> rtx2.read<span class="token punctuation">(</span><span class="token number">2</span><span class="token punctuation">)</span> <span class="token comment"># => gives, v2</span> rtx2.read<span class="token punctuation">(</span><span class="token number">3</span><span class="token punctuation">)</span> <span class="token comment"># => gives, v1</span> rtx2.read<span class="token punctuation">(</span><span class="token number">5</span><span class="token punctuation">)</span> <span class="token comment"># => gives, None</span></code></pre><hr/></p><p style="text-align:left;">In other words, until we publish a transaction, its changes don&rsquo;t take effect. And any <em>read</em>&nbsp;translation that was already started isn&rsquo;t impacted. We also need this to be concurrent, so we can use the table in multiple threads (a single write transaction at a time, but potentially <em>many</em>&nbsp;read transactions). Each transaction may modify hundreds or thousands of pages, and we&rsquo;ll only clear the table of old values once in a while (so it isn&rsquo;t <em>infinite</em>&nbsp;growth, but may certainly reach respectable numbers of items).</p><p style="text-align:left;">The implementation we have inside of RavenDB for this is <em>complex</em>! I tried drawing that on the whiteboard to explain what was going on, and I needed both the third and fourth dimensions to illustrate the concept. </p><p style="text-align:left;">Given these requirements, how would you implement this sort of data structure? </p> <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/9000.0.1/themes/prism.min.css" integrity="sha512-/mZ1FHPkg6EKcxo0fKXF51ak6Cr2ocgDi5ytaTBjsQZIH/RNs6GF6+oId/vPe3eJB836T36nXwVh/WBl/cWT4w==" crossorigin="anonymous" referrerpolicy="no-referrer" />https://www.ayende.com/blog/201281-B/challenge-efficient-snapshotable-state?Key=352251b0-97cd-4fbd-a04a-3e274ffdc2f0https://www.ayende.com/blog/201281-B/challenge-efficient-snapshotable-state?Key=352251b0-97cd-4fbd-a04a-3e274ffdc2f0Mon, 01 Jul 2024 12:00:00 GMTRecording: Building a Database Engine in C# & .NET<p>Watch Oren Eini, CEO of RavenDB, as he delves into the intricate process of constructing a database engine using C# and .NET. Uncover the unique features that make C# a robust system language for high-end system development. Learn how C# provides direct memory access and fine-grained control, enabling developers to seamlessly blend high-level concepts with intimate control over system operations within a single project. Embark on the journey of leveraging the power of C# and .NET to craft a potent and efficient database engine, unlocking new possibilities in system development.</p><p>I’m going <em>deep</em> into some of the cool stuff that you can do with C# and low level programming.</p><p><iframe width="787" height="443" title="Building a Database Engine in C# &amp; .NET" src="https://www.youtube.com/embed/4TqR8yVVjV4" frameborder="0" allowfullscreen="" referrerpolicy="strict-origin-when-cross-origin" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share"></iframe></p>https://www.ayende.com/blog/201217-A/recording-building-a-database-engine-in-c-net?Key=7886276f-7ae5-4c05-8418-d01e20ab4e3ahttps://www.ayende.com/blog/201217-A/recording-building-a-database-engine-in-c-net?Key=7886276f-7ae5-4c05-8418-d01e20ab4e3aWed, 19 Jun 2024 12:00:00 GMTCorax Query Plan visualization<p style="text-align:left;">Corax is the new indexing and querying engine in RavenDB, which recently came out with RavenDB 6.0. Our focus when building Corax was on one thing, performance. I did a full talk explaining <span style="text-decoration:underline;"><a style="color:inherit;" href="https://vimeo.com/905237416">how it works from the inside out, available here</a></span>&nbsp;as well as <span style="text-decoration:underline;"><a style="color:inherit;" href="https://ayende.com/blog/200738-B/recording-technology-friends-oren-eini-on-the-corax-search-engine">a couple of podcasts</a></span>.</p><p style="text-align:left;">Now that RavenDB 6.0 has been out for a while, we&rsquo;ve had the chance to complete a few features that didn&rsquo;t make the cut for the big 6.0 release. There is a host of small features for Corax, mostly completing tasks that were not included in the initial 6.0 release. </p><p style="text-align:left;">All these features are available in the 6.0.102 release, which went live in late April 2024.</p><p style="text-align:left;">The most important new feature for Corax is query plan visualization. </p><p style="text-align:left;">Let&rsquo;s run the following query in the RavenDB Studio on the sample data set:</p><p style="text-align:left;"><hr/><pre class='line-numbers language-unknown><code class='line-numbers language-unknown'>from index <span class="token string">'Orders/ByShipment/Location'</span> where spatial<span class="token punctuation">.</span><span class="token function">within</span><span class="token punctuation">(</span>ShipmentLocation<span class="token punctuation">,</span> spatial<span class="token punctuation">.</span><span class="token function">circle</span><span class="token punctuation">(</span> <span class="token number">10</span><span class="token punctuation">,</span> <span class="token number">49.255</span><span class="token punctuation">,</span> <span class="token number">4.154</span><span class="token punctuation">,</span> <span class="token string">'miles'</span><span class="token punctuation">)</span> <span class="token punctuation">)</span> and <span class="token punctuation">(</span>Employee <span class="token operator">=</span> <span class="token string">'employees/5-A'</span> or Company <span class="token operator">=</span> <span class="token string">'companies/85-A'</span><span class="token punctuation">)</span> order by Company<span class="token punctuation">,</span> <span class="token function">score</span><span class="token punctuation">(</span><span class="token punctuation">)</span> include <span class="token function">timings</span><span class="token punctuation">(</span><span class="token punctuation">)</span></code></pre><hr/></p><p style="text-align:left;">Note that we are using the <em>include</em><em>timings()</em>&nbsp;feature. If you configure this index to use Corax, issuing the above query will also give us the full query plan. In this case, you can see it here:</p><p style="text-align:left;"><img src="" style="float: right"/></p><p style="text-align:left;">You can see exactly how the query engine has processed your query and the pipeline it has gone through. </p><p style="text-align:left;">We have incorporated many additional features into Corax, including phrase queries, scoring based on spatial results, and more complex sorting pipelines. For the most part, those are small but they fulfill specific needs and enable a wider range of scenarios for Corax.</p><p style="text-align:left;">Over six months since Corax went live with 6.0, I can tell that it has been a successful feature. It performs its primary job well, being a faster and more efficient querying engine. And the best part is that it isn&rsquo;t even something that you need to be aware of.</p><p style="text-align:left;">Corax has been the default indexing engine for the Development and Community editions of RavenDB for over 3 months now, and almost no one has noticed. </p><p style="text-align:left;">It&rsquo;s a strange metric, I know, for a feature to be successful when no one is even aware of its existence, but that is a common theme for RavenDB. The whole point behind RavenDB is to provide a database that works, allowing you to forget about it.</p> <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/9000.0.1/themes/prism.min.css" integrity="sha512-/mZ1FHPkg6EKcxo0fKXF51ak6Cr2ocgDi5ytaTBjsQZIH/RNs6GF6+oId/vPe3eJB836T36nXwVh/WBl/cWT4w==" crossorigin="anonymous" referrerpolicy="no-referrer" />https://www.ayende.com/blog/201025-C/corax-query-plan-visualization?Key=4d05e664-8367-41a4-8411-2029d0071cebhttps://www.ayende.com/blog/201025-C/corax-query-plan-visualization?Key=4d05e664-8367-41a4-8411-2029d0071cebTue, 07 May 2024 12:00:00 GMTRavenDB Cloud now offering NVMe based clusters<p style="text-align:left;">I&rsquo;m really happy to announce that RavenDB Cloud is now offering <span style="text-decoration:underline;"><a style="color:inherit;" href="https://cloud.ravendb.net/pricing?config=eyJpb3BzIjozMDAwLCJjbG91ZFByb3ZpZGVyIjoiQXdzIiwiYXdzUmVnaW9uIjoidXMtZWFzdC0xIiwiYXp1cmVSZWdpb24iOiJlYXN0dXMiLCJnY3BSZWdpb24iOiJ1cy1lYXN0MSIsInRpZXIiOiJQcm9kdWN0aW9uIiwiY29udHJhY3RUZXJtIjoiT25EZW1hbmQiLCJwYXltZW50T3B0aW9uIjoiTW9udGhseSIsImNwdVByaW9yaXR5IjoiUGVyZm9ybWFuY2UiLCJpbnN0YW5jZVR5cGUiOiJQTjIwIiwic3RvcmFnZVR5cGUiOiJTc2RTdGFuZGFyZCIsInN0b3JhZ2VTaXplIjozMiwidGhyb3VnaHB1dCI6MTI1LCJwcm9kdWN0VHlwZUlkIjoiQXdzL1Byb2QvdXMtZWFzdC0xIn0=">NVMe based instances on the Performance tier</a></span>. In short, that means that you can deploy RavenDB Cloud clusters to handle some truly high workloads. </p><p style="text-align:left;">You can learn more about what is actually going on in <span style="text-decoration:underline;"><a style="color:inherit;" href="https://ravendb.net/docs/article-page/6.0/csharp/cloud/cloud-instances#performance-grade-production-cluster">our documentation</a></span>. For performance numbers and costs, feel free to skip to the bottom of this post.</p><p style="text-align:left;">I&rsquo;m assuming that you may not be familiar with everything that a database needs to run <em>fast</em>, and this feature deserves a full explanation of what is on offer. So here are the full details of what you can now do.</p><p style="text-align:left;">RavenDB is a transactional database that often processes far more data than the memory available on the machine. Consequently, it needs to read from and write to &nbsp;the disk. In fact, as a database, you can say that it is its primary role. This means that one of <em>the </em>most important factors for database performance is the speed of your disk. I have written about <span style="text-decoration:underline;"><a style="color:inherit;" href="https://ravendb.net/articles/questions-to-answer-when-sizing-a-ravendb-node">the topic before in more depth</a></span>, if you are interested in <span style="text-decoration:underline;"><a style="color:inherit;" href="https://ravendb.net/articles/database-hardware-optimization-strategies-to-reduce-hosting-costs">exploring the topic</a></span>. </p><p style="text-align:left;"><img src="" style="float: right"/></p><p style="text-align:left;">When running on-premises, it&rsquo;s easy to get the best disks you can. We recommend at least good SSDs and prefer NVMe drives for best results. When running on the cloud, the situation is quite different. Machines in the cloud are assumed to be transient, they come and go. Disks, on the other hand, are required to be persistent. So a typical disk on the cloud is actually a remote storage device (typically replicated). That means that disk I/O on the cloud is&hellip; slow. To the point where you could get better performance from off-the-shelf hardware from 20 years ago.</p><p style="text-align:left;">There are higher tiers of high-performance disks available in the cloud, of course. If you need them, you are paying quite a lot for that additional performance. There is another option, however. You can use NVMe disks on the cloud as well. Well, you <em>could</em>, but do you want to?</p><p style="text-align:left;">The reason you&rsquo;d want to use an NVMe disk in the cloud is performance, of course. But the problem with achieving this performance on the cloud is that you lose the safety of &ldquo;this disk is persistent beyond the machine&rdquo;. In other words, this is literally a disk that is attached to the physical server hosting your VM. If something goes wrong with that machine, you lose the disk. Traditionally, that means that you can only use that for transient data, not as the backend store for a database. </p><p style="text-align:left;">However, RavenDB has some interesting options to deal with this. RavenDB Cloud runs RavenDB clusters with 3 copies of the data by default, operating in a full multi-master configuration. Given that we already have multiple copies of the data, what would happen if we lost a machine?</p><p style="text-align:left;">The underlying watchdog would realize that something happened and initiate recovery, which will effectively spawn the instance on another node. That <em>works</em>, but what about the data? All of that data is now lost. The design of RavenDB treats that as an acceptable scenario, the cluster would detect such an issue, move the affected node to rehabilitation mode, and start pumping all the data from the other nodes in the cluster to it.</p><p style="text-align:left;">In short, now we&rsquo;ve shifted from a node failure being catastrophic to having a small bump in the data traffic bill at the end of the month. Thanks to its multi-master setup, RavenDB can recover even if <em>two</em>&nbsp;nodes go down at the same time, as we&rsquo;ll recover from the third one. RavenDB Cloud runs the nodes in the cluster in separate availability zones specifically to handle such failure scenarios. </p><p style="text-align:left;">We have run into this scenario multiple times, both as part of our testing and as actual production events. I am happy to say that everything works as expected, the failed node comes up empty, is filled by the rest of the cluster, and then seamlessly resumes its work. The users were not even aware that something happened.</p><p style="text-align:left;">Of course, there is always the possibility that the entire region could go down, or that three separate instances in three separate availability zones would fail at the same time. What happens then? That is expected to be a rare scenario, but we are all about covering our edge cases.<img src="" style="float: right"/></p><p style="text-align:left;">In such a scenario, you would need to recover from backup. Clusters using NVMe disks are configured to run using Snapshot backups, which consume slightly more disk space than normal but can be restored more quickly. </p><p style="text-align:left;">RavenDB Cloud also blocks the user&#39;s ability to scale up or down such clusters from the portal and requires a support ticket to perform them. This is because special care is needed when performing such operations on NVMe machines. Even with those limitations, we are able to perform such actions with zero downtime for the users.</p><p style="text-align:left;">And after all this story, let&rsquo;s talk numbers. Take a look at the following table illustrating the costs vs. performance on AWS (us-east-1):</p><table style="width:100%;"><tr><strong><td>Type</td><td># of cores</td><td>Memory</td><td>Disk</td><td>Cost ($ / hour)</td></strong></tr><tr><td>P40 (Premium disk)</td><td>16</td><td>64 GB</td><td>2 TB, 10,000 IOPS, 360 MB/s</td><td>8.790</td></tr><tr><td>PN30 (NVMe)</td><td>8</td><td>64 GB</td><td>2 TB, 110,000 IOPS, 440 MB/s</td><td>6.395</td></tr><tr><td>PN40 (NVMe)</td><td>16</td><td>128 GB</td><td>4 TB, 220,000 IOPS, 880 MB/s</td><td>12.782</td></tr></table><p style="text-align:left;">The situation is even more blatant when looking at the numbers on Azure (eastus):</p><table style="width:100%;"><tr><strong><td>Type</td><td># of cores</td><td>Memory</td><td>Disk</td><td>Cost ($ / hour)</td></strong></tr><tr><td>P40 (Premium disk)</td><td>16</td><td>64 GB</td><td>2 TB, 7,500 IOPS, 250 MB/s</td><td>7.089</td></tr><tr><td>PN30 (NVMe)</td><td>8</td><td>64 GB</td><td>2 TB, 400,000 IOPS, 2 GB/s</td><td>6.485</td></tr><tr><td>PN40 (NVMe)</td><td>16</td><td>128 GB</td><td>4 TB, 800,000 IOPS, 4 GB/s</td><td>12.956</td></tr></table><p style="text-align:left;">In other words, you can upgrade to the NVMe cluster and actually <em>reduce</em>&nbsp;the spend if you are stalled on I/O. We expect most workloads to see both higher performance and lower cost from a move from P40 with premium disk to PN30 (same amount of memory, fewer cores). For most workloads, we have found that IO rate matters even more than core count or CPU horsepower. </p><p style="text-align:left;">I&rsquo;m really excited about this new feature, not only because it can give you a big performance boost but also because it leverages the same behavior that RavenDB already exhibits for handling issues in the cluster and recovering from unexpected failures. </p><p style="text-align:left;">In short, you now have some new capabilities at your fingertips, being able to use RavenDB Cloud for even more demanding workloads. Give it a try, I hear it goes vrooom &#128578;.</p> <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/9000.0.1/themes/prism.min.css" integrity="sha512-/mZ1FHPkg6EKcxo0fKXF51ak6Cr2ocgDi5ytaTBjsQZIH/RNs6GF6+oId/vPe3eJB836T36nXwVh/WBl/cWT4w==" crossorigin="anonymous" referrerpolicy="no-referrer" />https://www.ayende.com/blog/200641-B/ravendb-cloud-now-offering-nvme-based-clusters?Key=738f78d5-e43f-4b1e-9e9d-7b8aa56f8ad5https://www.ayende.com/blog/200641-B/ravendb-cloud-now-offering-nvme-based-clusters?Key=738f78d5-e43f-4b1e-9e9d-7b8aa56f8ad5Mon, 12 Feb 2024 12:00:00 GMTCorax, Lucene, Benchmarks and lies!<p style="text-align:left;">When we started working on Corax (10 years ago!), we had a pretty simple mission statement for that: &ldquo;Lucene, but 10 times faster for our use case&rdquo;. When we actually started implementing this in code (early 2020), we had a few more rules about the direction we wanted to take.</p><p style="text-align:left;"><img src="" style="float: right"/></p><p style="text-align:left;">Corax had to be faster than Lucene in all scenarios, and 10 times faster for common indexing and querying scenarios. Corax design is meant for online indexing, not batch-oriented like Lucene. We favor moving work to indexing time and ensuring that our data structures on disk can work with no additional processing time.</p><p style="text-align:left;">Lucene was created at a time when data size was much smaller and disks were far more expensive. It shows in the overall design in many ways, but one of the critical aspects is that the file design for Lucene is compressed, meaning that you need to read the data, decode that into the in-memory data structure, and then process it. </p><p style="text-align:left;">For RavenDB&rsquo;s use case, that turned out to be a serious problem. In particular, the issue of cold queries, where you query the database for the first time and have to pay the initialization cost, was particularly difficult. Now, cold queries aren&rsquo;t really that interesting, from a benchmark perspective, you have to warm things up in every software (caches are everywhere, from your disk to your CPU). I like to say that even <em>memory</em>&nbsp;has caches (yes, plural) because it is so slow (L1, L2, L3 caches). </p><p style="text-align:left;"><img src="" style="float: right"/></p><p style="text-align:left;">With Lucene&rsquo;s design, however, whenever it runs an indexing batch, it creates a new file, and to start querying after that means that you have a &ldquo;cold start&rdquo; for that file. Usually, those files are small, but every now and then Lucene needs to merge several files together and then we have to pay the cold start price for a large amount of data.</p><p style="text-align:left;">The issue is that this sometimes introduces a high latency spike (hitting us in the P999 targets), which is really hard to smooth over. We spent a lot of time and engineering resources ensuring that this doesn&rsquo;t have a big impact on our users. </p><p style="text-align:left;">One of the design goals for Corax was to ensure that this doesn&rsquo;t happen. That we are able to get consistent performance from the system without periodic maintenance tasks. That led us to a very different internal design. The persistent data structures that we use are meant to be used as is, without initial processing. </p><p style="text-align:left;">Everything has a cost, and in this case, it means that the size of Corax on disk is typically somewhat larger than Lucene. The big advantage is that the amount of <em>memory</em>&nbsp;being used by Corax tends to be significantly lower. And in today&rsquo;s world, disks are far cheaper than memory. Corax&rsquo;s cold start time is <em>orders of magnitude</em>&nbsp;faster than Lucene&rsquo;s cold start time. </p><p style="text-align:left;">It turns out that there is a huge impact in another scenario as well, completely unexpected. We continuously run performance tests on our system, and we got some ridiculous results when testing query performance using encrypted databases.</p><p style="text-align:left;">When you use encryption at rest, RavenDB ensures that the only time that your data is decrypted is when there is an <em>active transaction using the data</em>. In other words, even in-memory buffers are encrypted. That applies to documents as well as indexes. It does <em>not</em>&nbsp;apply to the in-memory data that Lucene holds in its cache, though. For Corax, however, <em>all</em>&nbsp;of its state is encrypted.</p><p style="text-align:left;">When we run our benchmark on encrypted database queries, we expect to see either roughly the same performance between Corax and Lucene or see Lucene edging out Corax in this scenario, since it can use its cache without paying decryption costs.</p><p style="text-align:left;">Instead, we got really puzzling results. I tried showing them in bar chart format, but I literally couldn&rsquo;t make the data fit in a reasonable size. The scenario is testing queries on an encrypted database, using an m5.xlarge instance on AWS. We are hitting the server with 500 queries/second, and testing for the 99.99 percentile performance.</p><table style="width:100%;"><tr><strong><td><strong>Indexing Engine</strong></td><td><strong>99.99% percentile (ms)</strong></td><td><strong>99.99% percentile (seconds)</strong></td></strong></tr><tr><td>Lucene</td><td>40,210</td><td>40.21</td></tr><tr><td>Corax</td><td>186</td><td>0.18</td></tr></table><p style="text-align:left;">Take a look at those numbers! Somehow Corax is absolutely smoking Lucene&rsquo;s lunch. And I was quite surprised about that. I mean, I&rsquo;m <em>happy</em>, I guess, that the indexing engine we spent so much time on is doing this well, but any time that we see a performance number that we cannot explain we need to figure out what is going on.</p><p style="text-align:left;">Here is the profiler output for this benchmark, using Lucene.</p><p style="text-align:left;"><img src="" style="float: right"/></p><p style="text-align:left;">As you can see, the vast majority of the time is spent decrypting pages. And we are decrypting pages belonging to a stream. Those are the Lucene files, stored (encrypted in this case) inside of Voron. The issue is that the access pattern that Lucene is using forces us to touch large parts of the file. It usually reads a very small portion each time, but in various locations. Given that the data is encrypted, we have to decrypt each of those locations. </p><p style="text-align:left;">Corax, on the other hand, keeps the persistent data structure in such a way that when we need to access specific pages only. That means that in terms of the number of pages touched by Corax or Lucene for this particular scenario, Lucene is using a lot more. You&rsquo;ll usually not notice that since Voron (our storage engine) is memory mapped and those accesses are cheap. When using encrypted storage, however, we need to decrypt the data first, so that was very noticeable. </p><p style="text-align:left;">It&rsquo;s interesting to note that this also applies to instances where there is a memory pressure involved. Corax would tend to touch a lot less memory and have a smaller working set, while Lucene will generate more page faults. </p><p style="text-align:left;">Really interesting results, and I&rsquo;m both happy and amused that totally different design decisions have led to such a big impact in this scenario. In short, Corax is fast, really fast, and in many more scenarios than we initially thought.</p> <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/9000.0.1/themes/prism.min.css" integrity="sha512-/mZ1FHPkg6EKcxo0fKXF51ak6Cr2ocgDi5ytaTBjsQZIH/RNs6GF6+oId/vPe3eJB836T36nXwVh/WBl/cWT4w==" crossorigin="anonymous" referrerpolicy="no-referrer" />https://www.ayende.com/blog/200609-B/corax-lucene-benchmarks-and-lies?Key=d14c28d7-6b81-4927-8498-70bf7147e5c9https://www.ayende.com/blog/200609-B/corax-lucene-benchmarks-and-lies?Key=d14c28d7-6b81-4927-8498-70bf7147e5c9Wed, 24 Jan 2024 12:00:00 GMTOptimizing cache resets for higher transaction output<p>One of <em>the</em> most frequent operations we make in RavenDB is getting a page pointer. That is the basis of pretty much <em>everything</em> that you can think of inside the storage engine. On the surface, that is pretty simple, here is the API we&rsquo;ll call:</p> <p><span style="color: #0000ff;">public</span> <span style="color: #001080;">Page</span> <span style="color: #795e26;">GetPage</span> <span style="color: #3b3b3b;">(</span> <span style="color: #0000ff;">long</span> <span style="color: #001080;">pageNumber</span> <span style="color: #3b3b3b;">)</span></p> <p>Easy, right? But internally, we need to ensure that we get the <em>right</em> page. And that is where some complexity enters the picture. Voron, our storage engine, implements a pattern called MVCC (multi-version concurrency control). In other words, two transactions loading the same page may see different versions of the page at the same time.</p> <p>What this means is that the call to <span style="color: #795e26;">GetPage</span> <span style="color: #3b3b3b;">() </span>needs to check if the page:</p> <ul> <li>Has been modified in the current transaction</li> <li>Has been modified in a <em>previously </em>committed transaction and has not yet flushed to disk</li> <li>The on-disk version is the most up-to-date one</li> </ul> <p>Each one of those checks is <em>cheap</em>, but getting a page is a <em>common</em> operation. So we implemented a small cache in front of these operations, which resulted in a substantial performance improvement.&nbsp;</p> <p>Conceptually, here is what that cache looks like:</p> <p><span style="color: #0000ff;">public unsafe</span> <span style="color: #0000ff;">class</span> <span style="color: #267f99;">PageLocator </span> <span style="color: #3b3b3b;">{</span></p> <p><span style="color: #3b3b3b;">&nbsp;&nbsp;&nbsp;&nbsp;</span> <span style="color: #0000ff;">private</span> <span style="color: #0000ff;">struct</span> <span style="color: #267f99;">PageData </span> <span style="color: #3b3b3b;">{</span></p> <p><span style="color: #3b3b3b;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span> <span style="color: #0000ff;">public</span> <span style="color: #0000ff;">long</span> <span style="color: #001080;">PageNumber</span> <span style="color: #3b3b3b;">;</span></p> <p><span style="color: #3b3b3b;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span> <span style="color: #0000ff;">public</span> <span style="color: #0000ff;">byte</span>* <span style="color: #001080;">Pointer</span> <span style="color: #3b3b3b;">;</span></p> <p><span style="color: #3b3b3b;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span> <span style="color: #0000ff;">public</span> <span style="color: #0000ff;">bool</span> <span style="color: #001080;">IsWritable</span> <span style="color: #3b3b3b;">;</span></p> <p><span style="color: #3b3b3b;">&nbsp;&nbsp;&nbsp;&nbsp;}</span></p> <p><span style="color: #3b3b3b;">&nbsp;&nbsp;&nbsp;&nbsp;</span> <span style="color: #0000ff;">private</span> <span style="color: #267f99;">PageData</span> <span style="color: #3b3b3b;">[] </span> <span style="color: #001080;">_cache</span> = <span style="color: #0000ff;">new</span> <span style="color: #267f99;">PageData</span> <span style="color: #3b3b3b;">[</span> <span style="color: #098658;">512</span> <span style="color: #3b3b3b;">];</span></p> <p><span style="color: #3b3b3b;">&nbsp;&nbsp;&nbsp;&nbsp;</span> <span style="color: #0000ff;">public</span> <span style="color: #0000ff;">byte</span>* <span style="color: #795e26;">Get</span> <span style="color: #3b3b3b;">(</span> <span style="color: #0000ff;">long</span> <span style="color: #001080;">page</span> <span style="color: #3b3b3b;">, </span> <span style="color: #0000ff;">out</span> <span style="color: #0000ff;">bool</span> <span style="color: #001080;">isWritable</span> <span style="color: #3b3b3b;">) {</span></p> <p><span style="color: #3b3b3b;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span> <span style="color: #0000ff;">var</span> <span style="color: #001080;">index</span> = <span style="color: #001080;">page</span> &amp; <span style="color: #098658;">511</span> <span style="color: #3b3b3b;">;</span></p> <p><span style="color: #3b3b3b;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span> <span style="color: #0000ff;">ref</span> <span style="color: #0000ff;">var</span> <span style="color: #001080;">data</span> = <span style="color: #0000ff;">ref</span> <span style="color: #001080;">_cache</span> <span style="color: #3b3b3b;">[</span> <span style="color: #001080;">index</span> <span style="color: #3b3b3b;">];</span></p> <p><span style="color: #3b3b3b;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span> <span style="color: #af00db;">if</span> <span style="color: #3b3b3b;"> (</span> <span style="color: #001080;">data</span>.<span style="color: #001080;">PageNumber</span> == <span style="color: #001080;">page</span> <span style="color: #3b3b3b;">) {</span></p> <p><span style="color: #3b3b3b;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span> <span style="color: #001080;">isWritable</span> = <span style="color: #001080;">data</span>.<span style="color: #001080;">IsWritable</span> <span style="color: #3b3b3b;">;</span></p> <p><span style="color: #3b3b3b;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span> <span style="color: #af00db;">return</span> <span style="color: #001080;">data</span>.<span style="color: #001080;">Pointer</span> <span style="color: #3b3b3b;">;</span></p> <p><span style="color: #3b3b3b;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}</span></p> <p><span style="color: #3b3b3b;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span> <span style="color: #af00db;">return</span> <span style="color: #001080;">LookupAndGetPage</span> <span style="color: #3b3b3b;">(</span> <span style="color: #001080;">page</span> <span style="color: #3b3b3b;">, </span> <span style="color: #0000ff;">ref</span> <span style="color: #001080;">data</span> <span style="color: #3b3b3b;">, </span> <span style="color: #0000ff;">out</span> <span style="color: #001080;">isWritable</span> <span style="color: #3b3b3b;">);</span></p> <p><span style="color: #3b3b3b;">&nbsp;&nbsp;&nbsp;&nbsp;}</span></p> <p><span style="color: #3b3b3b;">&nbsp;&nbsp;&nbsp;&nbsp;</span> <span style="color: #0000ff;">public</span> <span style="color: #0000ff;">void</span> <span style="color: #795e26;">Reset</span> <span style="color: #3b3b3b;">() {</span></p> <p><span style="color: #3b3b3b;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span> <span style="color: #af00db;">for</span> <span style="color: #3b3b3b;"> (</span> <span style="color: #0000ff;">int</span> <span style="color: #001080;">i</span> = <span style="color: #098658;">0</span> <span style="color: #3b3b3b;">; </span> <span style="color: #001080;">i</span> &lt; <span style="color: #001080;">_cache</span>.<span style="color: #001080;">Length</span> <span style="color: #3b3b3b;">; </span> <span style="color: #001080;">i</span>++<span style="color: #3b3b3b;">)</span></p> <p><span style="color: #3b3b3b;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span> <span style="color: #001080;">_cache</span> <span style="color: #3b3b3b;">[</span> <span style="color: #001080;">i</span> <span style="color: #3b3b3b;">]</span>.<span style="color: #001080;">PageNumber</span> =-<span style="color: #098658;">1</span> <span style="color: #3b3b3b;">;</span></p> <p><span style="color: #3b3b3b;">&nbsp;&nbsp;&nbsp;&nbsp;}</span></p> <p><span style="color: #3b3b3b;">}</span></p> <p>This is intended to be a fairly simple cache, and it is fine if certain access patterns aren&rsquo;t going to produce optimal results. After all, the source it is using is already quite fast, we simply want to get even better performance when we can. This is important because caching is quite a complex topic on its own. The <span style="color: #267f99;">PageLocator </span>itself is used in the context of a transaction and is pooled. Transactions in RavenDB tend to be pretty short, so that alleviates a lot of the complexity around cache management.</p> <p>This is actually a pretty solved problem for us, and has been for a long while. We have been using some variant of the code above for about 9 years. The reason for this post, however, is that we are trying to optimize things further. And this class showed up in our performance traces as problematic.</p> <p>Surprisingly enough, what is actually costly isn&rsquo;t the lookup part, but making the <span style="color: #267f99;">PageLocator </span>ready for the next transaction. We are talking about the <span style="color: #795e26;">Reset</span> <span style="color: #3b3b3b;">()</span> method.</p> <p>The question is: how can we significantly optimize the process of making the instance ready for a new transaction? We don&rsquo;t want to allocate, and resetting the page numbers is what is costing us.</p> <p>One option is to add an <span style="color: #0000ff;">int</span> <span style="color: #001080;">Generation</span> field to the <span style="color: #267f99;">PageData </span>structure, which we&rsquo;ll then check on getting from the cache. Resetting the cache can then be done by incrementing the locator&rsquo;s generation with a single instruction. That is pretty awesome, right?</p> <p>It sadly exposes a problem, what happens when we use the locator enough to encounter an overflow? What happens if we have a sequence of events that brings us back to the same generation as a cached instance? We&rsquo;ll be risking getting an old instance (from a previous transaction, which happened long ago).&nbsp;</p> <p>The chances for that are <em>low</em>, seriously low. But that is not an acceptable risk for us. So we need to consider going further. Here is what we ended up with:</p> <p><span style="color: #0000ff;">public</span> <span style="color: #0000ff;">unsafe</span> <span style="color: #0000ff;">class</span> <span style="color: #267f99;">PageLocator</span> <span style="color: #3b3b3b;"> {</span></p> <p><span style="color: #3b3b3b;">&nbsp;&nbsp;&nbsp;&nbsp;</span> <span style="color: #0000ff;">private</span> <span style="color: #0000ff;">struct</span> <span style="color: #267f99;">PageData</span> <span style="color: #3b3b3b;"> {</span></p> <p><span style="color: #3b3b3b;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span> <span style="color: #0000ff;">public</span> <span style="color: #0000ff;">long</span> <span style="color: #001080;">PageNumber</span> <span style="color: #3b3b3b;">;</span></p> <p><span style="color: #3b3b3b;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span> <span style="color: #0000ff;">public</span> <span style="color: #0000ff;">byte</span>* <span style="color: #001080;">Pointer</span> <span style="color: #3b3b3b;">;</span></p> <p><span style="color: #3b3b3b;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span> <span style="color: #0000ff;">public</span> <span style="color: #0000ff;">ushort</span> <span style="color: #001080;">Generation</span> <span style="color: #3b3b3b;">;</span></p> <p><span style="color: #3b3b3b;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span> <span style="color: #0000ff;">public</span> <span style="color: #0000ff;">bool</span> <span style="color: #001080;">IsWritable</span> <span style="color: #3b3b3b;">;</span></p> <p><span style="color: #3b3b3b;">&nbsp;&nbsp;&nbsp;&nbsp;}</span></p> <p><span style="color: #3b3b3b;">&nbsp;&nbsp;&nbsp;&nbsp;</span> <span style="color: #0000ff;">private</span> <span style="color: #267f99;">PageData</span> <span style="color: #3b3b3b;">[] </span> <span style="color: #001080;">_cache</span> = <span style="color: #0000ff;">new</span> <span style="color: #267f99;">PageData</span> <span style="color: #3b3b3b;">[</span> <span style="color: #098658;">512</span> <span style="color: #3b3b3b;">];</span></p> <p><span style="color: #3b3b3b;">&nbsp;&nbsp;&nbsp;&nbsp;</span> <span style="color: #0000ff;">private</span> <span style="color: #0000ff;">int</span> <span style="color: #001080;">_generation</span> = <span style="color: #098658;">1</span> <span style="color: #3b3b3b;">;</span></p> <p><span style="color: #3b3b3b;">&nbsp;&nbsp;&nbsp;&nbsp;</span> <span style="color: #0000ff;">public</span> <span style="color: #0000ff;">byte</span>* <span style="color: #795e26;">Get</span> <span style="color: #3b3b3b;">(</span> <span style="color: #0000ff;">long</span> <span style="color: #001080;">page</span> <span style="color: #3b3b3b;">, </span> <span style="color: #0000ff;">out</span> <span style="color: #0000ff;">bool</span> <span style="color: #001080;">isWritable</span> <span style="color: #3b3b3b;">) {</span></p> <p><span style="color: #3b3b3b;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span> <span style="color: #0000ff;">var</span> <span style="color: #001080;">index</span> = <span style="color: #001080;">page</span> &amp; <span style="color: #098658;">511</span> <span style="color: #3b3b3b;">;</span></p> <p><span style="color: #3b3b3b;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span> <span style="color: #0000ff;">ref</span> <span style="color: #0000ff;">var</span> <span style="color: #001080;">data</span> = <span style="color: #0000ff;">ref</span> <span style="color: #001080;">_cache</span> <span style="color: #3b3b3b;">[</span> <span style="color: #001080;">index</span> <span style="color: #3b3b3b;">];</span></p> <p><span style="color: #3b3b3b;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span> <span style="color: #af00db;">if</span> <span style="color: #3b3b3b;"> (</span> <span style="color: #001080;">data</span>.<span style="color: #001080;">PageNumber</span> == <span style="color: #001080;">page</span> &amp;&amp; <span style="color: #001080;">data</span>.<span style="color: #001080;">Generation</span> == <span style="color: #001080;">_generation</span> <span style="color: #3b3b3b;">) {</span></p> <p><span style="color: #3b3b3b;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span> <span style="color: #001080;">isWritable</span> = <span style="color: #001080;">data</span>.<span style="color: #001080;">IsWritable</span> <span style="color: #3b3b3b;">;</span></p> <p><span style="color: #3b3b3b;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span> <span style="color: #af00db;">return</span> <span style="color: #001080;">data</span>.<span style="color: #001080;">Pointer</span> <span style="color: #3b3b3b;">;</span></p> <p><span style="color: #3b3b3b;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}</span></p> <p><span style="color: #3b3b3b;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span> <span style="color: #af00db;">return</span> <span style="color: #001080;">LookupAndGetPage</span> <span style="color: #3b3b3b;">(</span> <span style="color: #001080;">page</span> <span style="color: #3b3b3b;">, </span> <span style="color: #0000ff;">ref</span> <span style="color: #001080;">data</span> <span style="color: #3b3b3b;">, </span> <span style="color: #0000ff;">out</span> <span style="color: #001080;">isWritable</span> <span style="color: #3b3b3b;">);</span></p> <p><span style="color: #3b3b3b;">&nbsp;&nbsp;&nbsp;&nbsp;}</span></p> <p><span style="color: #3b3b3b;">&nbsp;&nbsp;&nbsp;&nbsp;</span> <span style="color: #0000ff;">public</span> <span style="color: #0000ff;">void</span> <span style="color: #795e26;">Reset</span> <span style="color: #3b3b3b;">() {</span></p> <p><span style="color: #3b3b3b;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span> <span style="color: #001080;">_generation</span>++<span style="color: #3b3b3b;">;</span></p> <p><span style="color: #3b3b3b;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span> <span style="color: #af00db;">if</span> <span style="color: #3b3b3b;">(</span> <span style="color: #001080;">_generation</span> &gt;= <span style="color: #098658;">65535</span> <span style="color: #3b3b3b;">){</span></p> <p><span style="color: #3b3b3b;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span> <span style="color: #001080;">_generation</span> = <span style="color: #098658;">1</span> <span style="color: #3b3b3b;">;</span></p> <p><span style="color: #3b3b3b;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span> <span style="color: #001080;">MemoryMarshal</span>.<span style="color: #001080;">Cast</span> <span style="color: #3b3b3b;">&lt;</span> <span style="color: #267f99;">PageData</span> <span style="color: #3b3b3b;">, </span> <span style="color: #0000ff;">byte</span> <span style="color: #3b3b3b;">&gt;(</span> <span style="color: #001080;">_cache</span> <span style="color: #3b3b3b;">)</span>.<span style="color: #001080;">Fill</span> <span style="color: #3b3b3b;">(</span> <span style="color: #098658;">0</span> <span style="color: #3b3b3b;">);</span></p> <p><span style="color: #3b3b3b;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}</span></p> <p><span style="color: #3b3b3b;">&nbsp;&nbsp;&nbsp;&nbsp;}</span></p> <p><span style="color: #3b3b3b;">}</span></p> <p>Once every 64K operations, we&rsquo;ll pay the cost of resetting the entire buffer, but we do that in an instruction that is heavily optimized. If needed, we can take it further, but here are our results before the change:</p> <p><img src="https://lh7-us.googleusercontent.com/DfKQynCl4Q_3ycGLcwIFdjz8tpUwyKpjzjlz1FG1Z9RCNgtk9WCPfrmbQkW4CSVygBoIwbmyJZSGVSo5U-u36GJSzndPF95klG_tBQy022C62OV_gcEpCNiNDZqGY58B7F_sSlEaufGvI2IZRU-qYSI" alt="" width="569" height="57" /></p> <p>And afterward:</p> <p><img src="https://lh7-us.googleusercontent.com/SG9vSdUalb328B_0E433IBJmNAk03GVcO988vHkiLt2ZFthHhmnqDYGeMnCP-zNDExiXy8m2lFQ6kvbBya0UwhpDXqRASfgX9dbwrq0mZgejVI-PYmpRdGj24NiXJXrbLULMLemiT8_-3OCe_vdrGCk" alt="" width="609" height="41" /></p> <p>The cost of the <span style="color: #795e26;">Renew</span> <span style="color: #3b3b3b;">()</span>call, which was composed mostly of the&nbsp; <span style="color: #795e26;">Reset</span> <span style="color: #3b3b3b;">()</span> call, basically dropped off the radar,and the performance roughly doubled.</p> <p>That is a pretty cool benefit for a straightforward change.</p> <p>&nbsp;</p>https://www.ayende.com/blog/200449-A/optimizing-cache-resets-for-higher-transaction-output?Key=51c9f8d8-ba69-4881-a3b6-b80bf04e51dahttps://www.ayende.com/blog/200449-A/optimizing-cache-resets-for-higher-transaction-output?Key=51c9f8d8-ba69-4881-a3b6-b80bf04e51daThu, 11 Jan 2024 12:00:00 GMTRavenDB Backups are now Faster & Smaller<p>With<a href="https://ayende.com/blog/200193-B/ravendb-version-6-0-is-now-live?key=4f7ec81d511745869ec409d0b602bc46"> the release of RavenDB 6.0</a>, we are now starting to focus on smaller features. The first one out of the gate, part of RavenDB 6.0.1 release, is actually a set of enhancements around making backups faster, smaller and cheaper.</p> <p>I just checked, and the core backup behavior of RavenDB hasn't changed much since 2010(!). In other words, decisions that were made almost 14 years ago are still in effect. There have been a&hellip; number of changes in both RavenDB, its operating environment and the size of the database that we deal with.</p> <p>In the past year, we ran into a number of cases where people working with datasets in the high hundreds of GB to low TB range had issues with backups. In particular, with the duration of backups. After the 6.0 release, we had the capacity to do a lot more about this, so we took a look.</p> <p>On first impression, you would expect that backing up a database whose size exceeds 750GB will take&hellip; a while. And indeed, it does. The question is, why? It&rsquo;s a lot of data, sure. But where does the time go?</p> <p>The format of RavenDB backups is really simple. It is just a GZipped JSON file. The contents are treated as a JSON stream that contains all the data in the database. This has a number of advantages, the file size is small, the format itself lends itself well to extension, it is streamable, etc. In fact, it is a testament to the early design decision that we haven&rsquo;t really had to touch that in so long.</p> <p>Given that the format is stable, and that we have a lot of experience with producing JSON, we approach the task of optimizing the backups with a good idea where we should go. The problem is likely with I/O (we need to go through the entire database, after all). There were some (pretty wild) ideas flying around on how to address this, but the first thing to do, of course, was to run it under the profiler.</p> <p>The results, as you can imagine, were not what we expected. It turns out that we spend quite a lot of the time inside of GZip, compressing the data. It turns out that when we set up the backup format all those years ago, we chose GZip and Optimal compression mode. In other words, we wanted the file size to be as small as possible. That&hellip; makes sense, of course. But it turns out that the vast majority of the time is actually spent compressing the data?</p> <p>Time to start looking deeper into that. GZip is an old format (it came out in 1992!). And recently there have been a number of new compression algorithms (Zstd, Brotli, etc). We decided to look into those in detail. GZip also has several modes that affect compression ratio vs. compression time.</p> <p>After a bit of experimentation, we have the following details when backing up a 35GB database.</p> <table border="0" cellspacing="0" cellpadding="0"> <tbody> <tr> <td valign="top"> <p><strong>Algorithm &amp; Mode&nbsp;&nbsp;&nbsp;&nbsp; </strong></p> </td> <td valign="top"> <p><strong>Size</strong></p> </td> <td valign="top"> <p><strong>Time</strong></p> </td> </tr> <tr> <td valign="top"> <p>GZip - Optimal</p> </td> <td valign="top"> <p>5.9 GB</p> </td> <td valign="top"> <p>6 min, 40 sec</p> </td> </tr> <tr> <td valign="top"> <p>GZip - Fastest</p> </td> <td valign="top"> <p>6.6 GB</p> </td> <td valign="top"> <p>4 min, 7 sec</p> </td> </tr> <tr> <td valign="top"> <p>ZStd - Fastest</p> </td> <td valign="top"> <p>4.1 GB</p> </td> <td valign="top"> <p>3 min, 1 sec</p> </td> </tr> </tbody> </table> <p>The data in this case is mostly textual (JSON), and it turns out that we can reduce the backup time by more than half while saving 30% in the space we take. Those are some nice numbers.</p> <p>You&rsquo;ll note that ZStd also has a mode that controls compression ratio vs compression time. We tried checking this as well on a different dataset (a snapshot of the actual database) with a size of 25.5GB and we got:</p> <table border="0" cellspacing="0" cellpadding="0"> <tbody> <tr> <td valign="top"> <p><strong>Algorithm &amp; Mode&nbsp;&nbsp;&nbsp; </strong></p> </td> <td valign="top"> <p><strong>Size</strong></p> </td> <td valign="top"> <p><strong>Time</strong></p> </td> </tr> <tr> <td valign="top"> <p>ZStd - Fastest</p> </td> <td valign="top"> <p>2.18 GB</p> </td> <td valign="top"> <p>56 sec</p> </td> </tr> <tr> <td valign="top"> <p>ZStd - Optimal</p> </td> <td valign="top"> <p>1.98 GB</p> </td> <td valign="top"> <p>1 min, 41 sec</p> </td> </tr> <tr> <td valign="top"> <p>GZip - Optimal</p> </td> <td valign="top"> <p>2.99 GB</p> </td> <td valign="top"> <p>3 min, 50 sec</p> </td> </tr> </tbody> </table> <p>As you can see, GZip isn&rsquo;t going to get a participation trophy at this point, coming dead last for both size and time.</p> <p>In short, RavenDB 6.0.1 will use the new ZStd compression algorithm for backups (and export files),&nbsp; and you can expect to have greatly reduced backup times as well as smaller backups overall.</p> <p>This is now the default mode for RavenDB 6.0.1 or higher, but you can control that in the backup settings if you so wish.</p> <p><a href="https://ayende.com/blog/Images/Open-Live-Writer/RavenDB-Backups-are-now-Faster--Smaller_947F/image_2.png"><img style="margin: 0px; border: 0px currentcolor; display: inline; background-image: none;" title="image" src="https://ayende.com/blog/Images/Open-Live-Writer/RavenDB-Backups-are-now-Faster--Smaller_947F/image_thumb.png" alt="image" width="602" height="203" border="0" /></a></p> <p><span style="font-weight: 400;">Restoring from old backups is no issue, of course, but restoring a ZStd backup on an older version of RavenDB is not supported. You can configure RavenDB to use the GZip algorithm if that is required.</span></p> <p><span style="font-weight: 400;">Another feature that is going to improve backup performance is the notion of backup mode, which you can see in the image above. RavenDB backups support multiple destinations, so you can back up to Amazon S3 as well as Azure Blob Storage as a single unit.&nbsp;</span></p> <p><span style="font-weight: 400;">At the time of designing the backup system, that was a nice feature to have, since we assume that you&rsquo;ll usually have a backup to a local disk (for quick restore) as well as an offsite backup for longer-term storage. In practice, almost all backup configurations in RavenDB have a single destination. However, because we have support for multiple backup destinations, the backup process will first write the backup file to the local disk and then upload it.</span></p> <p><span style="font-weight: 400;">The new Direct Upload mode only supports a single destination, and it streams the data to the destination directly, without touching the disk. As a result of this change, we are using far less I/O during backup procedures as well as reducing the total time it takes to run the backup.</span></p> <p><span style="font-weight: 400;">This is especially useful if your backup destination is nearby and the network is good. This is frequently the case in the cloud, where you are backing up to S3 in the same region. In our tests, it reduced the backup time by 30% in some cases.</span></p> <p><span style="font-weight: 400;">From a coding perspective, those are not huge changes, but together they mean that backups in RavenDB are now cheaper, faster, and far smaller. That translates to a better operating environment for your system. It also means that the costs of storing backups are going to go down by a significant amount.</span></p> <p>You can read all the technical details about the few features in the feature announcements:</p> <ul> <li> <p><a href="https://github.com/ravendb/ravendb/discussions/17678">Introduction of zstd compression algorithm in Backups and Exports</a></p> </li> <li> <p><a href="https://github.com/ravendb/ravendb/discussions/17679">Direct Upload mode in Backups</a></p> </li> </ul>https://www.ayende.com/blog/200321-A/ravendb-backups-are-now-faster-smaller?Key=37896812-01f2-4e47-b843-793236cb4cfdhttps://www.ayende.com/blog/200321-A/ravendb-backups-are-now-faster-smaller?Key=37896812-01f2-4e47-b843-793236cb4cfdFri, 15 Dec 2023 12:00:00 GMTProduction Postmortem: The Spawn of Denial of Service<p>A customer contacted us to complain about a highly unstable cluster in their production system. The metrics didn&rsquo;t support the situation, however. There was no excess load on the cluster in terms of CPU and memory, but there were a lot of network issues. The cluster got to the point where it would just flat-out be unable to connect from one node to another.</p> <p>It was obviously some sort of a network issue, but our ping and network tests worked just fine. Something else was going on. Somehow, the server would get to a point where it would be simply inaccessible for a period of time, then accessible, then not, etc. What was <em>weird</em> was that the usual metrics didn&rsquo;t give us anything. The logs were fine, as were memory and CPU. The network was stable throughout.</p> <p>If the first level of metrics isn&rsquo;t telling a story, we need to dig deeper. So we did, and we found something really interesting. Here is the total number of TCP connections on the server over time.</p> <p>So there are a <em>lot</em> of connections on the system, which is choking it? But the CPU is fine, so what is going on? Are we being attacked? We looked at the connections, but they all came from authorized machines, and the firewall was locked down tight.</p> <p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Production-Postmortem-The-Spawn-of-Denia_FC1A/unnamed.jpg"><img style="border: 0px currentcolor; display: inline; background-image: none;" title="unnamed" src="https://ayende.com/blog/Images/Open-Live-Writer/Production-Postmortem-The-Spawn-of-Denia_FC1A/unnamed_thumb.jpg" alt="unnamed" width="640" height="141" border="0" /></a></p> <p>If you look closely at the graph, you can see that it hits 32K connections at its peak. That is a really interesting number, because 32K is also the number of ephemeral port range values for Linux. In other words, we basically hit the OS limit for how many connections could be sustained between a client and a server.</p> <p>The question is what could be generating all of those connections? Remember, they are coming from a trusted source and are valid operations.&nbsp; Indeed, digging deeper we could see that there are a lot of connections in the TIME_WAIT state.</p> <p>We asked to look at the client code to figure out what was going on. Here is what we found:</p> <blockquote> <script src="https://gist.github.com/ayende/502133e9abcbe5efc21c8daa60cece6b.js"></script> </blockquote> <p>There is&hellip; not much here, as you can see. And certainly nothing that should cause us to generate a stupendous amount of connections to the server. In fact, this is a very short process. It is going to run, read a single line from the input, write a document to RavenDB, and then exit.</p> <p>To understand what is actually going on, we need to zoom out and understand the system at a higher level. Let&rsquo;s assume that the script above is called using the following manner:</p> <blockquote> <script src="https://gist.github.com/ayende/b523953e5c2c5f23c2ecede304808b33.js"></script> </blockquote> <p>What will happen now? All of this code is pretty innocent, I&rsquo;m sure you can tell. But together, we are going to get the following interesting behavior:</p> <p>For each line in the input, we&rsquo;ll invoke the script, which will spawn a separate process to connect to RavenDB, write a single document to the server, and exit. Immediately afterward, we'll have <em>another</em> such process, etc.</p> <p>Each of those processes is going to have a separate connection, identified by a quartet of (src ip, src port, dst ip, dst port). And there are only so many such ports available on the OS. Once you close a connection, it is moved to a TIME_WAIT mode, and any packets that arrive for the specified connection quartet are going to be assumed to be from the old connection and drop. Generate enough new connections fast enough, and you literally lock yourself out of the network.</p> <p>The solution to this problem is to avoid using a separate process for each interaction. Aside from alleviating the connection issue (which also requires non trivial cost on the server) it allows RavenDB to far better optimize network and traffic patterns.</p>https://www.ayende.com/blog/200289-B/production-postmortem-the-spawn-of-denial-of-service?Key=afdb85dc-3822-4528-b03b-456fceeb4409https://www.ayende.com/blog/200289-B/production-postmortem-the-spawn-of-denial-of-service?Key=afdb85dc-3822-4528-b03b-456fceeb4409Tue, 12 Dec 2023 12:00:00 GMTFiltering negative numbers, fast: Beating memcpy()<p>In the <a href="https://ayende.com/blog/200102-C/filtering-negative-numbers-fast-avx">previous post</a>, I was able to utilize AVX to get some nice speedups. In general, I was able to save up to 57%(!) of the runtime in processing arrays of 1M items. That is really amazing, if you think about it. But my best effort only gave me a 4% improvement when using 32M items.</p> <p>I decided to investigate what is going on in more depth, and I came up with the following benchmark. Given that I want to filter negative numbers, what would happen if the only negative number in the array was the first one?</p> <p>In other words, let&rsquo;s see what happens when we could write this algorithm as the following line of code:</p> <blockquote> <p>array[1..].CopyTo(array);</p> </blockquote> <p>The idea here is that we should measure the speed of raw memory copy and see how that compares to our code.</p> <p>Before we dive into the results, I want to make a few things explicit. We are dealing here with arrays of long, when I&rsquo;m talking about an array with 1M items, I&rsquo;m actually talking about an 8MB buffer, and for the 32M items, we are talking about 256MB of memory.</p> <p>I&rsquo;m running these benchmarks on the following machine:</p> <blockquote> <p>&nbsp;&nbsp;&nbsp; AMD Ryzen 9 5950X 16-Core Processor</p> <p>&nbsp;&nbsp;&nbsp; Base speed:&nbsp;&nbsp;&nbsp; 3.40 GHz<br />&nbsp;&nbsp;&nbsp;&nbsp; L1 cache:&nbsp;&nbsp;&nbsp; 1.0 MB<br />&nbsp;&nbsp;&nbsp;&nbsp; L2 cache:&nbsp;&nbsp;&nbsp; 8.0 MB<br />&nbsp;&nbsp;&nbsp;&nbsp; L3 cache:&nbsp;&nbsp;&nbsp; 64.0 MB</p> <p>&nbsp;&nbsp;&nbsp; Utilization&nbsp;&nbsp;&nbsp; 9%<br />&nbsp;&nbsp;&nbsp;&nbsp; Speed&nbsp;&nbsp;&nbsp; 4.59 GHz</p> </blockquote> <p>In other words, when we look at this, the 1M items (8MB) can fit into L2 (barely, but certainly backed by the L3 cache. For the 32M items (256MB), we are far beyond what can fit in the cache, so we are probably dealing with memory bandwidth issues.</p> <p>I wrote the following functions to test it out:</p> <blockquote> <script src="https://gist.github.com/ayende/ab0deb8f900f3005f439802a10f367cb.js"></script> </blockquote> <p>Let&rsquo;s look at what I&rsquo;m actually testing here.</p> <ul> <li>CopyTo() &ndash; using the span native copying is the most ergonomic way to do things, I think.</li> <li>MemoryCopy() &ndash; uses a built-in unsafe API in the framework. That eventually boils down to a <a href="https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Buffer.cs#L134">heavily optimized routine</a>, which&hellip; calls to Memove() if the buffer overlaps (as they do in this case).</li> <li>MoveMemory() &ndash; uses a pinvoke to call to the Windows API to do the moving of memory for us.</li> </ul> <p>Here are the results for the 1M case (8MB):</p> <table class="table table-striped table-bordered"> <thead> <tr> <th>Method</th> <th>N</th> <th style="text-align: right;">Mean</th> <th style="text-align: right;">Error</th> <th style="text-align: right;">StdDev</th> <th style="text-align: right;">Ratio</th> </tr> </thead> <tbody> <tr> <td>FilterCmp</td> <td>1048599</td> <td style="text-align: right;">441.4 us</td> <td style="text-align: right;">1.78 us</td> <td style="text-align: right;">1.58 us</td> <td style="text-align: right;">1.00</td> </tr> <tr> <td>FilterCmp_Avx</td> <td>1048599</td> <td style="text-align: right;">141.1 us</td> <td style="text-align: right;">2.70 us</td> <td style="text-align: right;">2.65 us</td> <td style="text-align: right;">0.32</td> </tr> <tr> <td>CopyTo</td> <td>1048599</td> <td style="text-align: right;">872.8 us</td> <td style="text-align: right;">11.27 us</td> <td style="text-align: right;">10.54 us</td> <td style="text-align: right;">1.98</td> </tr> <tr> <td>MemoryCopy</td> <td>1048599</td> <td style="text-align: right;">869.7 us</td> <td style="text-align: right;">7.29 us</td> <td style="text-align: right;">6.46 us</td> <td style="text-align: right;">1.97</td> </tr> <tr> <td>MoveMemory</td> <td>1048599</td> <td style="text-align: right;">126.9 us</td> <td style="text-align: right;">0.28 us</td> <td style="text-align: right;">0.25 us</td> <td style="text-align: right;">0.29</td> </tr> </tbody> </table> <p>We can see some real surprises here. I&rsquo;m using the FilterCmp (the very basic implementation) that I wrote.</p> <p>I cannot explain why CopyTo() and MemoryMove() are so slow.</p> <p>What is really impressive is that the FilterCmp_Avx() and MoveMemory() are so close in performance, and <em>so much faster. </em>To put it in another way, we are already at a stage where we are within shouting distance from the MoveMemory() performance. That is.. really impressive.</p> <p>That said, what happens with 32M (256MB) ?</p> <table class="table table-striped table-bordered"> <thead> <tr> <th>Method</th> <th>N</th> <th style="text-align: right;">Mean</th> <th style="text-align: right;">Error</th> <th style="text-align: right;">StdDev</th> <th style="text-align: right;">Ratio</th> </tr> </thead> <tbody> <tr> <td>FilterCmp</td> <td>33554455</td> <td style="text-align: right;">22,763.6 us</td> <td style="text-align: right;">157.23 us</td> <td style="text-align: right;">147.07 us</td> <td style="text-align: right;">1.00</td> </tr> <tr> <td>FilterCmp_Avx</td> <td>33554455</td> <td style="text-align: right;">20,122.3 us</td> <td style="text-align: right;">214.10 us</td> <td style="text-align: right;">200.27 us</td> <td style="text-align: right;">0.88</td> </tr> <tr> <td>CopyTo</td> <td>33554455</td> <td style="text-align: right;">27,660.1 us</td> <td style="text-align: right;">91.41 us</td> <td style="text-align: right;">76.33 us</td> <td style="text-align: right;">1.22</td> </tr> <tr> <td>MemoryCopy</td> <td>33554455</td> <td style="text-align: right;">27,618.4 us</td> <td style="text-align: right;">136.16 us</td> <td style="text-align: right;">127.36 us</td> <td style="text-align: right;">1.21</td> </tr> <tr> <td>MoveMemory</td> <td>33554455</td> <td style="text-align: right;">20,152.0 us</td> <td style="text-align: right;">166.66 us</td> <td style="text-align: right;">155.89 us</td> <td style="text-align: right;">0.89</td> </tr> </tbody> </table> <p>Now we are <em>faster</em> in the FilterCmp_Avx than MoveMemory. That is&hellip; a pretty big wow, and a really nice close for this blog post series, right? Except that we won&rsquo;t be stopping here.</p> <p>The way the task I set out works, we are actually filtering just the first item out, and then we are basically copying the memory. Let&rsquo;s do some math: 256MB in 20.1ms means 12.4 GB/sec!</p> <p>On this system, I have the following memory setup:</p> <blockquote> <p>&nbsp;&nbsp;&nbsp; 64.0 GB</p> <p>&nbsp;&nbsp;&nbsp; Speed:&nbsp;&nbsp;&nbsp; 2133 MHz<br />&nbsp;&nbsp;&nbsp;&nbsp; Slots used:&nbsp;&nbsp;&nbsp; 4 of 4<br />&nbsp;&nbsp;&nbsp;&nbsp; Form factor:&nbsp;&nbsp;&nbsp; DIMM<br />&nbsp;&nbsp;&nbsp;&nbsp; Hardware reserved:&nbsp;&nbsp;&nbsp; 55.2 MB</p> </blockquote> <p>I&rsquo;m using DDR4 memory, so <a href="https://www.softwareok.com/?seite=faq-This-and-That-or-Other&amp;faq=74">I can expect a maximum speed of 17GB/sec.</a> In theory, I might be able to get more if the memory is located on different DIMMs, but for the size in question, that is not likely.</p> <p>I&rsquo;m going to skip the training montage of VTune, understanding memory architecture and figuring out what is actually going on.</p> <p>Let&rsquo;s drop everything and look at what we have with just AVX vs. MoveMemory:</p> <table class="table table-striped table-bordered"> <thead> <tr> <th>Method</th> <th>N</th> <th style="text-align: right;">Mean</th> <th style="text-align: right;">Error</th> <th style="text-align: right;">StdDev</th> <th style="text-align: right;">Median</th> <th style="text-align: right;">Ratio</th> </tr> </thead> <tbody> <tr> <td>FilterCmp_Avx</td> <td>1048599</td> <td style="text-align: right;">141.6 us</td> <td style="text-align: right;">2.28 us</td> <td style="text-align: right;">2.02 us</td> <td style="text-align: right;">141.8 us</td> <td style="text-align: right;">1.12</td> </tr> <tr> <td>MoveMemory</td> <td>1048599</td> <td style="text-align: right;">126.8 us</td> <td style="text-align: right;">0.25 us</td> <td style="text-align: right;">0.19 us</td> <td style="text-align: right;">126.8 us</td> <td style="text-align: right;">1.00</td> </tr> <tr> <td>&nbsp;</td> <td>&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> </tr> <tr> <td>FilterCmp_Avx</td> <td>33554455</td> <td style="text-align: right;">21,105.5 us</td> <td style="text-align: right;">408.65 us</td> <td style="text-align: right;">963.25 us</td> <td style="text-align: right;">20,770.4 us</td> <td style="text-align: right;">1.08</td> </tr> <tr> <td>MoveMemory</td> <td>33554455</td> <td style="text-align: right;">20,142.5 us</td> <td style="text-align: right;">245.08 us</td> <td style="text-align: right;">204.66 us</td> <td style="text-align: right;">20,108.2 us</td> <td style="text-align: right;">1.00</td> </tr> </tbody> </table> <p>The new baseline is MoveMemory, and in this run, we can see that the AVX code is slightly <em>slower</em>.</p> <p>It&rsquo;s sadly not uncommon to see numbers shift by those ranges when we are testing such micro-optimizations, mostly because we are subject to so many variables that can affect performance. In this case, I dropped all the other benchmarks, which may have changed things.</p> <p>At any rate, using those numbers, we have 12.4GB/sec for MoveMemory() and 11.8GB/sec for the AVX version. The <em>hardware </em>maximum speed is 17GB/sec. So we are quite close to what can be done.</p> <p>For that matter, I would like to point out that the trivial code completed the task in 11GB/sec, so that is <em>quite</em> respectable and shows that the issue here is literally getting the memory fast enough to the CPU.</p> <p>Can we do something about that? I made a <a href="https://gist.github.com/ayende/ad5ae9c3a4519b116db920f2b9ea10cd">pretty small change</a> to the AVX version, like so:</p> <blockquote> <script src="https://gist.github.com/ayende/bdb8612f1c05dd32810a4164cd04f6a5.js"></script> </blockquote> <p>What are we actually doing here? Instead of loading the value and immediately using it, we are loading the <em>next</em> value, then we are executing the loop and when we iterate again, we will start loading the next value and process the current one. The idea is to parallelize load and compute at the instruction level.</p> <p>Sadly, that didn&rsquo;t seem to do the trick. I saw a 19% additional cost for that version compared to the vanilla AVX one on the 8MB run and a 2% additional cost on the 256MB run.</p> <p>I then realized that there was one really important test that I had to also make, and wrote the following:</p> <blockquote> <script src="https://gist.github.com/ayende/61cda608199a25705d0139adbcaeb1f1.js"></script> </blockquote> <p>In other words, let's test the speed of moving memory and filling memory as fast as we possibly can. Here are the results:</p> <table class="table table-striped table-bordered"> <thead> <tr> <th>Method</th> <th>N</th> <th style="text-align: right;">Mean</th> <th style="text-align: right;">Error</th> <th style="text-align: right;">StdDev</th> <th style="text-align: right;">Ratio</th> <th style="text-align: right;">RatioSD</th> <th style="text-align: right;">Code Size</th> </tr> </thead> <tbody> <tr> <td>MoveMemory</td> <td>1048599</td> <td style="text-align: right;">126.8 us</td> <td style="text-align: right;">0.36 us</td> <td style="text-align: right;">0.33 us</td> <td style="text-align: right;">0.25</td> <td style="text-align: right;">0.00</td> <td style="text-align: right;">270 B</td> </tr> <tr> <td>FillMemory</td> <td>1048599</td> <td style="text-align: right;">513.5 us</td> <td style="text-align: right;">10.05 us</td> <td style="text-align: right;">10.32 us</td> <td style="text-align: right;">1.00</td> <td style="text-align: right;">0.00</td> <td style="text-align: right;">351 B</td> </tr> <tr> <td>&nbsp;</td> <td>&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> </tr> <tr> <td>MoveMemory</td> <td>33554455</td> <td style="text-align: right;">20,022.5 us</td> <td style="text-align: right;">395.35 us</td> <td style="text-align: right;">500.00 us</td> <td style="text-align: right;">1.26</td> <td style="text-align: right;">0.02</td> <td style="text-align: right;">270 B</td> </tr> <tr> <td>FillMemory</td> <td>33554455</td> <td style="text-align: right;">15,822.4 us</td> <td style="text-align: right;">19.85 us</td> <td style="text-align: right;">17.60 us</td> <td style="text-align: right;">1.00</td> <td style="text-align: right;">0.00</td> <td style="text-align: right;">351 B</td> </tr> </tbody> </table> <p>This is really interesting, for a small buffer (8MB), MoveMemory is somehow faster. I don&rsquo;t have a way to explain that, but it has been a pretty consistent result in my tests.</p> <p>For the large buffer (256MB), we are seeing results that make more sense to me.</p> <ul> <li>MoveMemory &ndash; 12.5 GB / sec</li> <li>FIllMemory &ndash; 15.8 GB / sec</li> </ul> <p>In other words, for MoveMemory, we are both reading and writing, so we are paying for memory bandwidth in both directions. For filling the memory, we are only writing, so we can get better performance (no need for reads).</p> <p>In other words, we are talking about hitting the real physical limits of what the hardware can do. There are all sorts of tricks that one can pull, but when we are this close to the limit, they are almost always context-sensitive and dependent on many factors.</p> <p>To conclude, here are my final results:</p> <table class="table table-striped table-bordered"> <thead> <tr> <th>Method</th> <th>N</th> <th style="text-align: right;">Mean</th> <th style="text-align: right;">Error</th> <th style="text-align: right;">StdDev</th> <th style="text-align: right;">Ratio</th> <th style="text-align: right;">RatioSD</th> <th style="text-align: right;">Code Size</th> </tr> </thead> <tbody> <tr> <td>FilterCmp_Avx</td> <td>1048599</td> <td style="text-align: right;">307.9 us</td> <td style="text-align: right;">6.15 us</td> <td style="text-align: right;">12.84 us</td> <td style="text-align: right;">0.99</td> <td style="text-align: right;">0.05</td> <td style="text-align: right;">270 B</td> </tr> <tr> <td>FilterCmp_Avx_Next</td> <td>1048599</td> <td style="text-align: right;">308.4 us</td> <td style="text-align: right;">6.07 us</td> <td style="text-align: right;">9.26 us</td> <td style="text-align: right;">0.99</td> <td style="text-align: right;">0.03</td> <td style="text-align: right;">270 B</td> </tr> <tr> <td>CopyTo</td> <td>1048599</td> <td style="text-align: right;">1,043.7 us</td> <td style="text-align: right;">15.96 us</td> <td style="text-align: right;">14.93 us</td> <td style="text-align: right;">3.37</td> <td style="text-align: right;">0.11</td> <td style="text-align: right;">452 B</td> </tr> <tr> <td>ArrayCopy</td> <td>1048599</td> <td style="text-align: right;">1,046.7 us</td> <td style="text-align: right;">15.92 us</td> <td style="text-align: right;">14.89 us</td> <td style="text-align: right;">3.38</td> <td style="text-align: right;">0.14</td> <td style="text-align: right;">266 B</td> </tr> <tr> <td>UnsafeCopy</td> <td>1048599</td> <td style="text-align: right;">309.5 us</td> <td style="text-align: right;">6.15 us</td> <td style="text-align: right;">8.83 us</td> <td style="text-align: right;">1.00</td> <td style="text-align: right;">0.04</td> <td style="text-align: right;">133 B</td> </tr> <tr> <td>MoveMemory</td> <td>1048599</td> <td style="text-align: right;">310.8 us</td> <td style="text-align: right;">6.17 us</td> <td style="text-align: right;">9.43 us</td> <td style="text-align: right;">1.00</td> <td style="text-align: right;">0.00</td> <td style="text-align: right;">270 B</td> </tr> <tr> <td>&nbsp;</td> <td>&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> </tr> <tr> <td>FilterCmp_Avx</td> <td>33554455</td> <td style="text-align: right;">24,013.1 us</td> <td style="text-align: right;">451.09 us</td> <td style="text-align: right;">443.03 us</td> <td style="text-align: right;">0.98</td> <td style="text-align: right;">0.02</td> <td style="text-align: right;">270 B</td> </tr> <tr> <td>FilterCmp_Avx_Next</td> <td>33554455</td> <td style="text-align: right;">24,437.8 us</td> <td style="text-align: right;">179.88 us</td> <td style="text-align: right;">168.26 us</td> <td style="text-align: right;">1.00</td> <td style="text-align: right;">0.01</td> <td style="text-align: right;">270 B</td> </tr> <tr> <td>CopyTo</td> <td>33554455</td> <td style="text-align: right;">32,931.6 us</td> <td style="text-align: right;">416.57 us</td> <td style="text-align: right;">389.66 us</td> <td style="text-align: right;">1.35</td> <td style="text-align: right;">0.02</td> <td style="text-align: right;">452 B</td> </tr> <tr> <td>ArrayCopy</td> <td>33554455</td> <td style="text-align: right;">32,538.0 us</td> <td style="text-align: right;">463.00 us</td> <td style="text-align: right;">433.09 us</td> <td style="text-align: right;">1.33</td> <td style="text-align: right;">0.02</td> <td style="text-align: right;">266 B</td> </tr> <tr> <td>UnsafeCopy</td> <td>33554455</td> <td style="text-align: right;">24,386.9 us</td> <td style="text-align: right;">209.98 us</td> <td style="text-align: right;">196.42 us</td> <td style="text-align: right;">1.00</td> <td style="text-align: right;">0.01</td> <td style="text-align: right;">133 B</td> </tr> <tr> <td>MoveMemory</td> <td>33554455</td> <td style="text-align: right;">24,427.8 us</td> <td style="text-align: right;">293.75 us</td> <td style="text-align: right;">274.78 us</td> <td style="text-align: right;">1.00</td> <td style="text-align: right;">0.00</td> <td style="text-align: right;">270 B</td> </tr> </tbody> </table> <p>As you can see, just the AVX version is comparable or (slightly) beating the MoveMemory function.</p> <p>I tried things like prefetching memory, both the next item, the next cache item and from the next page, using non-temporal load and stores and many other things, but this is a pretty tough challenge.</p> <p>What is really interesting is that the original, very simple and obvious implementation, clocked at 11 GB/sec. After pulling pretty much all the stops and tricks, I was able to hit 12.5 GB / sec.</p> <p>I don&rsquo;t think anyone can look / update / understand the resulting code in any way without going through deep meditation. That is <em>not</em> a bad result at all. And along the way, I learned quite a bit about how the lowest level of the machine architecture&nbsp;is working.</p>https://www.ayende.com/blog/200129-A/filtering-negative-numbers-fast-beating-memcpy?Key=150a4bae-c691-4f65-a2b6-cbabae9f8832https://www.ayende.com/blog/200129-A/filtering-negative-numbers-fast-beating-memcpy?Key=150a4bae-c691-4f65-a2b6-cbabae9f8832Fri, 15 Sep 2023 12:00:00 GMTFiltering negative numbers, fast: AVX<p><a href="https://ayende.com/blog/200101-C/filtering-negative-numbers-fast-unroll?key=617ce0ba58e7446092686f584c90c4cd">In the previous post</a> I discussed how we can optimize the filtering of negative numbers by unrolling the loop, looked into branchless code and in general was able to improve performance by up to 15% from the initial version we started with. We pushed as much as we could on what can be done using scalar code. Now it is the time to open a whole new world and see what we can do when we implement this challenge using vector instructions.</p> <p>The key problem with such tasks is that SIMD, AVX and their friends were designed by&hellip; an interesting process using a perspective that makes sense if you can see in a couple of additional dimensions. I assume that at least some of that is implementation constraints, but the key issue is that when you start using SIMD, you realize that you don&rsquo;t have general-purpose instructions. Instead, you have a lot of dedicated instructions that are doing one thing, hopefully well, and it is your role to compose them into something that would make sense. Oftentimes, you need to turn the solution on its head in order to successfully solve it using SIMD. The benefit, of course, is that you can get quite an amazing boost in speed when you do this.</p> <p>The algorithm we use is basically to scan the list of entries and copy to the start of the list only those items that are positive. How can we do that using SIMD? The whole point here is that we want to be able to operate on multiple data, but this particular task isn&rsquo;t trivial. I&rsquo;m going to show the code first, then discuss what it does in detail:</p> <blockquote> <script src="https://gist.github.com/ayende/3982fadce72c64a99f5c3a5a1ea8d2c2.js"></script> </blockquote> <p>We start with the usual check (if you&rsquo;ll recall, that ensures that the JIT knows to elide some range checks, then we load the PremuteTable. For now, just assume that this is magic (and it is). The first interesting thing happens when we start iterating over the loop. Unlike before, now we do that in chunks of 4 int64 elements at a time. Inside the loop, we start by loading a vector of int64 and then we do the first odd thing. We call <em>ExtractMostSignificantBits</em>(), since the sign bit is used to mark whether&nbsp;a number if negative or not. That means that I can use a single instruction to get an integer with the bits set for all the negative numbers. That is particularly juicy for what we need, since there is no need for comparisons, etc.</p> <p>If the mask we got is all zeroes, it means that all the numbers we loaded to the vector are positives, so we can write them as-is to the output and move to the next part. Things get interesting when that isn&rsquo;t the case.</p> <p>We load a permute value using some shenanigans (we&rsquo;ll touch on that shortly) and call the <em>PermuteVar8x32</em>() method. The idea here is that we pack all the non-negative numbers to the start of the vector, then we write the vector to the output. The key here is that when we do that, we increment the output index only by the number of valid values.&nbsp; The rest of this method just handles the remainder that does not fit into a vector.</p> <p>The hard part in this implementation was to figure out how to handle the scenario where we loaded some negative numbers. We need a way to filter them, after all. But there is no SIMD instruction that allows us to do so. Luckily, we have the <a href="https://learn.microsoft.com/en-us/dotnet/api/system.runtime.intrinsics.x86.avx2.permutevar8x32?view=net-7.0">Avx2.PermuteVar8x32</a>() method to help here. To confuse things, we don&rsquo;t actually want to deal with 8x32 values. We want to deal with 4x64 values. There <em>is</em> <a href="https://learn.microsoft.com/en-us/dotnet/api/system.runtime.intrinsics.x86.avx2.permute4x64?view=net-7.0#system-runtime-intrinsics-x86-avx2-permute4x64(system-runtime-intrinsics-vector256((system-int64))-system-byte)">Avx2.Permute4x64</a>() method, and it will work quite nicely, with a single caveat. This method assumes that you are going to pass it a constant value. We don&rsquo;t have such a constant, we need to be able to provide that based on whatever the masked bits will give us.</p> <p>So how do we deal with this issue of filtering with SIMD? We need to move all the values we care about to the front of the vector. We have the method to do that, <em>PermuteVar8x32</em>() method, and we just need to figure out how to actually make use of this. <em>PermuteVar8x32</em>() accepts an input vector as well as a vector of the premutation you want to make. In this case, we are basing this on the 4 top bits of the 4 elements vector of int64. As such, there are a total of 16 options available to us. We have to deal with 32bits values rather than 64bits, but that isn&rsquo;t that much of a problem.</p> <p>Here is the premutation table that we&rsquo;ll be using:</p> <blockquote> <script src="https://gist.github.com/ayende/9b1ace0341874c50403342346928299c.js"></script> </blockquote> <p>What you can see here is that when we have a 1 in the bits (shown in comments) we&rsquo;ll not copy that to the vector. Let&rsquo;s take a look at the entry of <em>0101</em>, which may be caused by the following values [1,-2,3,-4].</p> <p>When we look at the right entry at index #5 in the table: 2,3,6,7,0,0,0,0</p> <p>What does this mean? It means that we want to put the 2nd int64 element in the source vector and move it as the first element of the destination vector, take the 3rd element from the source as the second element in the destination and discard the rest (marked as 0,0,0,0 in the table).</p> <p>This is a bit hard to follow because we have to compose the value out of the individual 32 bits words, but it works quite well. Or, at least, it would work, but not as efficiently. This is because we would need to load the PermuteTableInts into a variable and access it, but there are better ways to deal with it. We can also ask the JIT to embed the value directly. The problem is that the pattern that the JIT recognizes is limited to ReadOnlySpan&lt;byte&gt;, which means that the already non-trivial int32 table got turned into this:</p> <blockquote> <script src="https://gist.github.com/ayende/e45b5ddce0d41879912a7dbed1958345.js"></script> </blockquote> <p>This is the exact same data as before, but using ReadOnlySpan&lt;byte&gt; means that the JIT can package that inside the data section and treat it as a constant value.</p> <p>The code was heavily optimized, to the point where I noticed <a href="https://github.com/dotnet/runtime/issues/91824">a JIT bug</a> where these two versions of the code give different assembly output:</p> <blockquote> <script src="https://gist.github.com/ayende/ea4aefe4e0c956692558ad2a4c880d05.js"></script> </blockquote> <p>Here is what we get out:</p> <blockquote> <script src="https://gist.github.com/ayende/5fe1d8ae986dcfacf758bbaa301d1b27.js"></script> </blockquote> <p>This looks like an unintended consequence of Roslyn and the JIT each doing their (separate jobs), but not reaching the end goal. Constant folding looks like it is done mostly by Roslyn, but it does a scan like that from the left, so it wouldn&rsquo;t convert $A * 4 * 8 to $A * 32. That is because it stopped evaluating the constants when it found a variable. When we add parenthesis, we isolate the value and now understand that we can fold it.</p> <p>Speaking of assembly, here is the annotated assembly version of the code:</p> <blockquote> <script src="https://gist.github.com/ayende/7e144aacf9385ee7ec4ebd7d1428a5ac.js"></script> </blockquote> <p>And after all of this work, where are we standing?</p> <table class="table table-striped table-bordered"> <thead> <tr> <th>Method</th> <th>N</th> <th style="text-align: right;">Mean</th> <th style="text-align: right;">Error</th> <th style="text-align: right;">StdDev</th> <th style="text-align: right;">Ratio</th> <th style="text-align: right;">RatioSD</th> <th style="text-align: right;">Code Size</th> </tr> </thead> <tbody> <tr> <td>FilterCmp</td> <td>23</td> <td style="text-align: right;">285.7 ns</td> <td style="text-align: right;">3.84 ns</td> <td style="text-align: right;">3.59 ns</td> <td style="text-align: right;">1.00</td> <td style="text-align: right;">0.00</td> <td style="text-align: right;">411 B</td> </tr> <tr> <td>FilterCmp_NoRangeCheck</td> <td>23</td> <td style="text-align: right;">272.6 ns</td> <td style="text-align: right;">3.98 ns</td> <td style="text-align: right;">3.53 ns</td> <td style="text-align: right;">0.95</td> <td style="text-align: right;">0.01</td> <td style="text-align: right;">397 B</td> </tr> <tr> <td>FilterCmp_Unroll_8</td> <td>23</td> <td style="text-align: right;">261.4 ns</td> <td style="text-align: right;">1.27 ns</td> <td style="text-align: right;">1.18 ns</td> <td style="text-align: right;">0.91</td> <td style="text-align: right;">0.01</td> <td style="text-align: right;">672 B</td> </tr> <tr> <td>FilterCmp_Avx</td> <td>23</td> <td style="text-align: right;">261.6 ns</td> <td style="text-align: right;">1.37 ns</td> <td style="text-align: right;">1.28 ns</td> <td style="text-align: right;">0.92</td> <td style="text-align: right;">0.01</td> <td style="text-align: right;">521 B</td> </tr> <tr> <td>&nbsp;</td> <td>&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> </tr> <tr> <td>FilterCmp</td> <td>1047</td> <td style="text-align: right;">758.7 ns</td> <td style="text-align: right;">1.51 ns</td> <td style="text-align: right;">1.42 ns</td> <td style="text-align: right;">1.00</td> <td style="text-align: right;">0.00</td> <td style="text-align: right;">411 B</td> </tr> <tr> <td>FilterCmp_NoRangeCheck</td> <td>1047</td> <td style="text-align: right;">756.8 ns</td> <td style="text-align: right;">1.83 ns</td> <td style="text-align: right;">1.53 ns</td> <td style="text-align: right;">1.00</td> <td style="text-align: right;">0.00</td> <td style="text-align: right;">397 B</td> </tr> <tr> <td>FilterCmp_Unroll_8</td> <td>1047</td> <td style="text-align: right;">640.4 ns</td> <td style="text-align: right;">1.94 ns</td> <td style="text-align: right;">1.82 ns</td> <td style="text-align: right;">0.84</td> <td style="text-align: right;">0.00</td> <td style="text-align: right;">672 B</td> </tr> <tr> <td>FilterCmp_Avx</td> <td>1047</td> <td style="text-align: right;">426.0 ns</td> <td style="text-align: right;">1.62 ns</td> <td style="text-align: right;">1.52 ns</td> <td style="text-align: right;">0.56</td> <td style="text-align: right;">0.00</td> <td style="text-align: right;">521 B</td> </tr> <tr> <td>&nbsp;</td> <td>&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> </tr> <tr> <td>FilterCmp</td> <td>1048599</td> <td style="text-align: right;">502,681.4 ns</td> <td style="text-align: right;">3,732.37 ns</td> <td style="text-align: right;">3,491.26 ns</td> <td style="text-align: right;">1.00</td> <td style="text-align: right;">0.00</td> <td style="text-align: right;">411 B</td> </tr> <tr> <td>FilterCmp_NoRangeCheck</td> <td>1048599</td> <td style="text-align: right;">499,472.7 ns</td> <td style="text-align: right;">6,082.44 ns</td> <td style="text-align: right;">5,689.52 ns</td> <td style="text-align: right;">0.99</td> <td style="text-align: right;">0.01</td> <td style="text-align: right;">397 B</td> </tr> <tr> <td>FilterCmp_Unroll_8</td> <td>1048599</td> <td style="text-align: right;">425,800.3 ns</td> <td style="text-align: right;">352.45 ns</td> <td style="text-align: right;">312.44 ns</td> <td style="text-align: right;">0.85</td> <td style="text-align: right;">0.01</td> <td style="text-align: right;">672 B</td> </tr> <tr> <td>FilterCmp_Avx</td> <td>1048599</td> <td style="text-align: right;">218,075.1 ns</td> <td style="text-align: right;">212.40 ns</td> <td style="text-align: right;">188.29 ns</td> <td style="text-align: right;">0.43</td> <td style="text-align: right;">0.00</td> <td style="text-align: right;">521 B</td> </tr> <tr> <td>&nbsp;</td> <td>&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> </tr> <tr> <td>FilterCmp</td> <td>33554455</td> <td style="text-align: right;">29,820,978.8 ns</td> <td style="text-align: right;">73,461.68 ns</td> <td style="text-align: right;">61,343.83 ns</td> <td style="text-align: right;">1.00</td> <td style="text-align: right;">0.00</td> <td style="text-align: right;">411 B</td> </tr> <tr> <td>FilterCmp_NoRangeCheck</td> <td>33554455</td> <td style="text-align: right;">29,471,229.2 ns</td> <td style="text-align: right;">73,805.56 ns</td> <td style="text-align: right;">69,037.77 ns</td> <td style="text-align: right;">0.99</td> <td style="text-align: right;">0.00</td> <td style="text-align: right;">397 B</td> </tr> <tr> <td>FilterCmp_Unroll_8</td> <td>33554455</td> <td style="text-align: right;">29,234,413.8 ns</td> <td style="text-align: right;">67,597.45 ns</td> <td style="text-align: right;">63,230.70 ns</td> <td style="text-align: right;">0.98</td> <td style="text-align: right;">0.00</td> <td style="text-align: right;">672 B</td> </tr> <tr> <td>FilterCmp_Avx</td> <td>33554455</td> <td style="text-align: right;">28,498,115.4 ns</td> <td style="text-align: right;">71,661.94 ns</td> <td style="text-align: right;">67,032.62 ns</td> <td style="text-align: right;">0.96</td> <td style="text-align: right;">0.00</td> <td style="text-align: right;">521 B</td> </tr> </tbody> </table> <p>So it seems that the idea of using SIMD instruction has a lot of merit. Moving from the original code to the final version, we see that we can complete the same task in up to half the time.</p> <p>I&rsquo;m not quite sure why we aren&rsquo;t seeing the same sort of performance on the 32M, but I suspect that this is likely because we far exceed the CPU cache and we have to fetch it all from memory, so that is as fast as it <em>can</em> go.</p> <p>If you are interested in learning more, <a href="https://lemire.me/blog/2022/06/23/filtering-numbers-quickly-with-sve-on-amazon-graviton-3-processors/">Lemire solves the same problem in SVE (SIMD for ARM)</a> and <a href="https://quickwit.io/blog/simd-range">Paul has a similar approach in Rust</a>.</p> <p>If you can think of further optimizations, I would love to hear your ideas.</p>https://www.ayende.com/blog/200102-C/filtering-negative-numbers-fast-avx?Key=ecacbd27-fe9d-4a7c-8824-411229bbec91https://www.ayende.com/blog/200102-C/filtering-negative-numbers-fast-avx?Key=ecacbd27-fe9d-4a7c-8824-411229bbec91Wed, 13 Sep 2023 12:00:00 GMTFiltering negative numbers, fast: Unroll<p><a href="https://ayende.com/blog/200100-C/filtering-negative-numbers-fast-scalar?key=457d45e9014e4352ac3dab2aa2bd4e04">In the previous post,</a> we looked into what it would take to reduce the cost of filtering negative numbers. We got into the assembly and analyzed exactly what was going on. In terms of this directly, I don&rsquo;t think that even hand-optimized assembly would take us further. Let&rsquo;s see if there are other options that are available for us to get better speed.</p> <p>The first thing that pops to mind here is to do a loop unrolling. After all, we have a very tight loop, if we can do more work per loop iteration, we might get better performance, no? Here is my first version:</p> <blockquote> <script src="https://gist.github.com/ayende/ce56c4517491bc8c08961c9d147213f1.js"></script> </blockquote> <p>And here are the benchmark results:</p> <table class="table table-striped table-bordered"> <thead> <tr> <th>Method</th> <th>N</th> <th style="text-align: right;">Mean</th> <th style="text-align: right;">Error</th> <th style="text-align: right;">StdDev</th> <th style="text-align: right;">Ratio</th> <th style="text-align: right;">Code Size</th> </tr> </thead> <tbody> <tr> <td>FilterCmp</td> <td>23</td> <td style="text-align: right;">274.6 ns</td> <td style="text-align: right;">0.40 ns</td> <td style="text-align: right;">0.35 ns</td> <td style="text-align: right;">1.00</td> <td style="text-align: right;">411 B</td> </tr> <tr> <td>FilterCmp_Unroll</td> <td>23</td> <td style="text-align: right;">257.5 ns</td> <td style="text-align: right;">0.94 ns</td> <td style="text-align: right;">0.83 ns</td> <td style="text-align: right;">0.94</td> <td style="text-align: right;">606 B</td> </tr> <tr> <td>&nbsp;</td> <td>&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> </tr> <tr> <td>FilterCmp</td> <td>1047</td> <td style="text-align: right;">748.1 ns</td> <td style="text-align: right;">2.91 ns</td> <td style="text-align: right;">2.58 ns</td> <td style="text-align: right;">1.00</td> <td style="text-align: right;">411 B</td> </tr> <tr> <td>FilterCmp_Unroll</td> <td>1047</td> <td style="text-align: right;">702.5 ns</td> <td style="text-align: right;">5.23 ns</td> <td style="text-align: right;">4.89 ns</td> <td style="text-align: right;">0.94</td> <td style="text-align: right;">606 B</td> </tr> <tr> <td>&nbsp;</td> <td>&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> </tr> <tr> <td>FilterCmp</td> <td>1048599</td> <td style="text-align: right;">501,545.2 ns</td> <td style="text-align: right;">4,985.42 ns</td> <td style="text-align: right;">4,419.45 ns</td> <td style="text-align: right;">1.00</td> <td style="text-align: right;">411 B</td> </tr> <tr> <td>FilterCmp_Unroll</td> <td>1048599</td> <td style="text-align: right;">446,311.1 ns</td> <td style="text-align: right;">3,131.42 ns</td> <td style="text-align: right;">2,929.14 ns</td> <td style="text-align: right;">0.89</td> <td style="text-align: right;">606 B</td> </tr> <tr> <td>&nbsp;</td> <td>&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> </tr> <tr> <td>FilterCmp</td> <td>33554455</td> <td style="text-align: right;">29,637,052.2 ns</td> <td style="text-align: right;">184,796.17 ns</td> <td style="text-align: right;">163,817.00 ns</td> <td style="text-align: right;">1.00</td> <td style="text-align: right;">411 B</td> </tr> <tr> <td>FilterCmp_Unroll</td> <td>33554455</td> <td style="text-align: right;">29,275,060.6 ns</td> <td style="text-align: right;">145,756.53 ns</td> <td style="text-align: right;">121,713.31 ns</td> <td style="text-align: right;">0.99</td> <td style="text-align: right;">606 B</td> </tr> </tbody> </table> <p>That is quite a jump, 6% &ndash; 11% savings is no joke. Let&rsquo;s look at what is actually going on at the assembly level and see if we can optimize this further.</p> <p>As expected, the code size is bigger, 264 bytes versus the 55 we previously got. But more importantly, we got the range check back, and a lot of them:</p> <blockquote> <script src="https://gist.github.com/ayende/d2265156014143ece0ac3901a430428e.js"></script> </blockquote> <p>The JIT isn&rsquo;t able to reason about our first for loop and see that all our accesses are within bounds, which leads to doing a lot of range checks, and likely slows us down. Even with that, we are still showing significant improvements here.</p> <p>Let&rsquo;s see what we can do with this:</p> <blockquote> <script src="https://gist.github.com/ayende/48719ff7e902113847be504a56473a06.js"></script> </blockquote> <p>With that, we expect to have no range checks and still be able to benefit from the unrolling.</p> <table class="table table-striped table-bordered"> <thead> <tr> <th>Method</th> <th>N</th> <th style="text-align: right;">Mean</th> <th style="text-align: right;">Error</th> <th style="text-align: right;">StdDev</th> <th style="text-align: right;">Ratio</th> <th style="text-align: right;">RatioSD</th> <th style="text-align: right;">Code Size</th> </tr> </thead> <tbody> <tr> <td>FilterCmp</td> <td>23</td> <td style="text-align: right;">275.4 ns</td> <td style="text-align: right;">2.31 ns</td> <td style="text-align: right;">2.05 ns</td> <td style="text-align: right;">1.00</td> <td style="text-align: right;">0.00</td> <td style="text-align: right;">411 B</td> </tr> <tr> <td>FilterCmp_Unroll</td> <td>23</td> <td style="text-align: right;">253.6 ns</td> <td style="text-align: right;">2.59 ns</td> <td style="text-align: right;">2.42 ns</td> <td style="text-align: right;">0.92</td> <td style="text-align: right;">0.01</td> <td style="text-align: right;">563 B</td> </tr> <tr> <td>&nbsp;</td> <td>&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> </tr> <tr> <td>FilterCmp</td> <td>1047</td> <td style="text-align: right;">741.6 ns</td> <td style="text-align: right;">5.95 ns</td> <td style="text-align: right;">5.28 ns</td> <td style="text-align: right;">1.00</td> <td style="text-align: right;">0.00</td> <td style="text-align: right;">411 B</td> </tr> <tr> <td>FilterCmp_Unroll</td> <td>1047</td> <td style="text-align: right;">665.5 ns</td> <td style="text-align: right;">2.38 ns</td> <td style="text-align: right;">2.22 ns</td> <td style="text-align: right;">0.90</td> <td style="text-align: right;">0.01</td> <td style="text-align: right;">563 B</td> </tr> <tr> <td>&nbsp;</td> <td>&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> </tr> <tr> <td>FilterCmp</td> <td>1048599</td> <td style="text-align: right;">497,624.9 ns</td> <td style="text-align: right;">3,904.39 ns</td> <td style="text-align: right;">3,652.17 ns</td> <td style="text-align: right;">1.00</td> <td style="text-align: right;">0.00</td> <td style="text-align: right;">411 B</td> </tr> <tr> <td>FilterCmp_Unroll</td> <td>1048599</td> <td style="text-align: right;">444,489.0 ns</td> <td style="text-align: right;">2,524.45 ns</td> <td style="text-align: right;">2,361.38 ns</td> <td style="text-align: right;">0.89</td> <td style="text-align: right;">0.01</td> <td style="text-align: right;">563 B</td> </tr> <tr> <td>&nbsp;</td> <td>&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> </tr> <tr> <td>FilterCmp</td> <td>33554455</td> <td style="text-align: right;">29,781,164.3 ns</td> <td style="text-align: right;">361,625.63 ns</td> <td style="text-align: right;">320,571.70 ns</td> <td style="text-align: right;">1.00</td> <td style="text-align: right;">0.00</td> <td style="text-align: right;">411 B</td> </tr> <tr> <td>FilterCmp_Unroll</td> <td>33554455</td> <td style="text-align: right;">29,954,093.9 ns</td> <td style="text-align: right;">588,614.32 ns</td> <td style="text-align: right;">916,401.59 ns</td> <td style="text-align: right;">1.01</td> <td style="text-align: right;">0.04</td> <td style="text-align: right;">563 B</td> </tr> </tbody> </table> <p>That helped, by quite a lot, it seems, for most cases, the 32M items case, however, was slightly slower, which is quite a surprise.</p> <p>Looking at the assembly, I can see that we still have branches, like so:</p> <blockquote> <script src="https://gist.github.com/ayende/5ffa2b6ad85b5039c9145996843e1608.js"></script> </blockquote> <p>And here is why this is the case:</p> <blockquote> <script src="https://gist.github.com/ayende/77b7f29167e7369f80454ee31a26def9.js"></script> </blockquote> <p>Now, can we do better here? It turns out that we can, by using a branchless version of the operation. Here is another way to write the same thing:</p> <blockquote> <script src="https://gist.github.com/ayende/f2547983bed2c383991d0cb468c53f65.js"></script> </blockquote> <p>What happens here is that we are unconditionally setting the value in the array, but only increment if the value is greater than or equal to zero. That saves us in branches and will likely result in less code. In fact, let&rsquo;s see what sort of assembly the JIT will output:</p> <blockquote> <script src="https://gist.github.com/ayende/205e96a945978d829b6abb936760edb1.js"></script> </blockquote> <p>What about the performance? I decided to pit the two versions (normal and branchless) head to head and see what this will give us:</p> <table class="table table-striped table-bordered"> <thead> <tr> <th>Method</th> <th>N</th> <th style="text-align: right;">Mean</th> <th style="text-align: right;">Error</th> <th style="text-align: right;">StdDev</th> <th style="text-align: right;">Ratio</th> <th style="text-align: right;">Code Size</th> </tr> </thead> <tbody> <tr> <td>FilterCmp_Unroll</td> <td>23</td> <td style="text-align: right;">276.3 ns</td> <td style="text-align: right;">4.13 ns</td> <td style="text-align: right;">3.86 ns</td> <td style="text-align: right;">1.00</td> <td style="text-align: right;">411 B</td> </tr> <tr> <td>FilterCmp_Unroll_Branchleses</td> <td>23</td> <td style="text-align: right;">263.6 ns</td> <td style="text-align: right;">0.95 ns</td> <td style="text-align: right;">0.84 ns</td> <td style="text-align: right;">0.96</td> <td style="text-align: right;">547 B</td> </tr> <tr> <td>&nbsp;</td> <td>&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> </tr> <tr> <td>FilterCmp_Unroll</td> <td>1047</td> <td style="text-align: right;">743.7 ns</td> <td style="text-align: right;">9.41 ns</td> <td style="text-align: right;">8.80 ns</td> <td style="text-align: right;">1.00</td> <td style="text-align: right;">411 B</td> </tr> <tr> <td>FilterCmp_Unroll_Branchleses</td> <td>1047</td> <td style="text-align: right;">733.3 ns</td> <td style="text-align: right;">3.54 ns</td> <td style="text-align: right;">3.31 ns</td> <td style="text-align: right;">0.99</td> <td style="text-align: right;">547 B</td> </tr> <tr> <td>&nbsp;</td> <td>&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> </tr> <tr> <td>FilterCmp_Unroll</td> <td>1048599</td> <td style="text-align: right;">502,631.1 ns</td> <td style="text-align: right;">3,641.47 ns</td> <td style="text-align: right;">3,406.23 ns</td> <td style="text-align: right;">1.00</td> <td style="text-align: right;">411 B</td> </tr> <tr> <td>FilterCmp_Unroll_Branchleses</td> <td>1048599</td> <td style="text-align: right;">495,590.9 ns</td> <td style="text-align: right;">335.33 ns</td> <td style="text-align: right;">297.26 ns</td> <td style="text-align: right;">0.99</td> <td style="text-align: right;">547 B</td> </tr> <tr> <td>&nbsp;</td> <td>&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> </tr> <tr> <td>FilterCmp_Unroll</td> <td>33554455</td> <td style="text-align: right;">29,356,331.7 ns</td> <td style="text-align: right;">207,133.86 ns</td> <td style="text-align: right;">172,966.15 ns</td> <td style="text-align: right;">1.00</td> <td style="text-align: right;">411 B</td> </tr> <tr> <td>FilterCmp_Unroll_Branchleses</td> <td>33554455</td> <td style="text-align: right;">29,709,835.1 ns</td> <td style="text-align: right;">86,129.58 ns</td> <td style="text-align: right;">71,922.10 ns</td> <td style="text-align: right;">1.01</td> <td style="text-align: right;">547 B</td> </tr> </tbody> </table> <p>Surprisingly enough, it looks like the branchless version is very slightly <em>slower</em>. That is a surprise, since I would expect reducing the branches to be more efficient.</p> <p>Looking at the assembly of those two, the branchless version is slightly bigger (10 bytes, not that meaningful). I think that the key here is that there is a 0.5% chance of actually hitting the branch, which is pretty low. That means that the branch predictor can likely do a really good job and we aren&rsquo;t going to see any big benefits from the branchless version.</p> <p>That said&hellip; what would happen if we tested that with 5% negatives? That difference in behavior may cause us to see a different result. I tried that, and the results were quite surprising. In the case of the 1K and 32M items, we see a slightl&nbsp;<em>cost </em>for the branchless version (additional 1% &ndash; 4%) while for the 1M entries there is an 18% <em>reduction</em> in latency for the branchless version.</p> <p>I ran the tests again with a 15% change of negative, to see what would happen. In that case, we get:</p> <table class="table table-striped table-bordered"> <thead> <tr> <th>Method</th> <th>N</th> <th style="text-align: right;">Mean</th> <th style="text-align: right;">Error</th> <th style="text-align: right;">StdDev</th> <th style="text-align: right;">Ratio</th> <th style="text-align: right;">RatioSD</th> <th style="text-align: right;">Code Size</th> </tr> </thead> <tbody> <tr> <td>FilterCmp_Unroll</td> <td>23</td> <td style="text-align: right;">273.5 ns</td> <td style="text-align: right;">3.66 ns</td> <td style="text-align: right;">3.42 ns</td> <td style="text-align: right;">1.00</td> <td style="text-align: right;">0.00</td> <td style="text-align: right;">537 B</td> </tr> <tr> <td>FilterCmp_Unroll_Branchleses</td> <td>23</td> <td style="text-align: right;">280.2 ns</td> <td style="text-align: right;">4.85 ns</td> <td style="text-align: right;">4.30 ns</td> <td style="text-align: right;">1.03</td> <td style="text-align: right;">0.02</td> <td style="text-align: right;">547 B</td> </tr> <tr> <td>&nbsp;</td> <td>&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> </tr> <tr> <td>FilterCmp_Unroll</td> <td>1047</td> <td style="text-align: right;">1,675.7 ns</td> <td style="text-align: right;">29.55 ns</td> <td style="text-align: right;">27.64 ns</td> <td style="text-align: right;">1.00</td> <td style="text-align: right;">0.00</td> <td style="text-align: right;">537 B</td> </tr> <tr> <td>FilterCmp_Unroll_Branchleses</td> <td>1047</td> <td style="text-align: right;">1,676.3 ns</td> <td style="text-align: right;">16.97 ns</td> <td style="text-align: right;">14.17 ns</td> <td style="text-align: right;">1.00</td> <td style="text-align: right;">0.02</td> <td style="text-align: right;">547 B</td> </tr> <tr> <td>&nbsp;</td> <td>&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> </tr> <tr> <td>FilterCmp_Unroll</td> <td>1048599</td> <td style="text-align: right;">2,206,354.4 ns</td> <td style="text-align: right;">6,141.19 ns</td> <td style="text-align: right;">5,444.01 ns</td> <td style="text-align: right;">1.00</td> <td style="text-align: right;">0.00</td> <td style="text-align: right;">537 B</td> </tr> <tr> <td>FilterCmp_Unroll_Branchleses</td> <td>1048599</td> <td style="text-align: right;">1,688,677.3 ns</td> <td style="text-align: right;">11,584.00 ns</td> <td style="text-align: right;">10,835.68 ns</td> <td style="text-align: right;">0.77</td> <td style="text-align: right;">0.01</td> <td style="text-align: right;">547 B</td> </tr> <tr> <td>&nbsp;</td> <td>&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> </tr> <tr> <td>FilterCmp_Unroll</td> <td>33554455</td> <td style="text-align: right;">205,320,736.1 ns</td> <td style="text-align: right;">2,757,108.01 ns</td> <td style="text-align: right;">2,152,568.58 ns</td> <td style="text-align: right;">1.00</td> <td style="text-align: right;">0.00</td> <td style="text-align: right;">537 B</td> </tr> <tr> <td>FilterCmp_Unroll_Branchleses</td> <td>33554455</td> <td style="text-align: right;">199,520,169.4 ns</td> <td style="text-align: right;">2,097,285.87 ns</td> <td style="text-align: right;">1,637,422.86 ns</td> <td style="text-align: right;">0.97</td> <td style="text-align: right;">0.01</td> <td style="text-align: right;">547 B</td> </tr> </tbody> </table> <p>As you can see, we have basically the same cost under 15% negatives for small values, a <em>big</em> improvement on the 1M scenario and not much improvement on the 32M scenario.</p> <p>All in all, that is very interesting information. Digging into the exact why and how of that means pulling a CPU instruction profiler and starting to look at where we have stalls, which is a bit further that I want to invest in this scenario.</p> <p>What if we&rsquo;ll try to rearrange the code a little bit. The code looks like this (load the value and AddToOutput() immediately):</p> <blockquote> <p>AddToOutput(ref itemsRef, Unsafe.Add(ref itemsRef, i + 0));</p> </blockquote> <p>What if we split it a little bit, so the code will look like this:</p> <blockquote> <script src="https://gist.github.com/ayende/eda4fa6716cdc2dafebc912b23fa56f7.js"></script> </blockquote> <p>The idea here is that we are trying to get the JIT / CPU to fetch the items before they are actually needed, so there would be more time for the memory to arrive.</p> <p>Remember that for the 1M scenario, we are dealing with 8MB of memory and for the 32M scenario, we have 256MB. Here is what happens when we look at the loop prolog, we can see that it is indeed first fetching all the items from memory, then doing the work:</p> <p><script src="https://gist.github.com/ayende/0260946c003df0ad02ae18cdde2b80dc.js"></script></p> <p>In terms of performance, that gives us a small win (1% &ndash; 2% range) for the 1M and 32M entries scenario.</p> <p>The one last thing that I wanted to test is if we&rsquo;ll unroll the loop even further, what would happen if we did 8 items per loop, instead of 4.</p> <blockquote> <script src="https://gist.github.com/ayende/3416319e177d00eec4f97bfeab6052c8.js"></script> </blockquote> <p>There is <em>some</em> improvement, (4% in the 1K scenario, 1% in the 32M scenario) but also slowdowns&nbsp; (2% in the 1M scenario).</p> <p>I think that this is probably roughly the end of the line as far as we can get for scalar code.</p> <p>We already made quite a few strides in trying to parallelize the work the CPU is doing by just laying out the code as we would like it to be. We tried to control the manner in which it touches memory and in general, those are pretty advanced techniques.</p> <p>To close this post, I would like to take a look at the gains we got. I&rsquo;m comparing the first version of the code, the last version we had on the previous post and the unrolled version for both branchy and branchless with 8 operations at once and memory prefetching.</p> <table class="table table-striped table-bordered"> <thead> <tr> <th>Method</th> <th>N</th> <th style="text-align: right;">Mean</th> <th style="text-align: right;">Error</th> <th style="text-align: right;">StdDev</th> <th style="text-align: right;">Ratio</th> <th style="text-align: right;">RatioSD</th> <th style="text-align: right;">Code Size</th> </tr> </thead> <tbody> <tr> <td>FilterCmp</td> <td>23</td> <td style="text-align: right;">277.3 ns</td> <td style="text-align: right;">0.69 ns</td> <td style="text-align: right;">0.64 ns</td> <td style="text-align: right;">1.00</td> <td style="text-align: right;">0.00</td> <td style="text-align: right;">411 B</td> </tr> <tr> <td>FilterCmp_NoRangeCheck</td> <td>23</td> <td style="text-align: right;">270.7 ns</td> <td style="text-align: right;">0.42 ns</td> <td style="text-align: right;">0.38 ns</td> <td style="text-align: right;">0.98</td> <td style="text-align: right;">0.00</td> <td style="text-align: right;">397 B</td> </tr> <tr> <td>FilterCmp_Unroll_8</td> <td>23</td> <td style="text-align: right;">257.6 ns</td> <td style="text-align: right;">1.45 ns</td> <td style="text-align: right;">1.21 ns</td> <td style="text-align: right;">0.93</td> <td style="text-align: right;">0.00</td> <td style="text-align: right;">672 B</td> </tr> <tr> <td>FilterCmp_Unroll_8_Branchless</td> <td>23</td> <td style="text-align: right;">259.9 ns</td> <td style="text-align: right;">1.96 ns</td> <td style="text-align: right;">1.84 ns</td> <td style="text-align: right;">0.94</td> <td style="text-align: right;">0.01</td> <td style="text-align: right;">682 B</td> </tr> <tr> <td>&nbsp;</td> <td>&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> </tr> <tr> <td>FilterCmp</td> <td>1047</td> <td style="text-align: right;">754.3 ns</td> <td style="text-align: right;">1.38 ns</td> <td style="text-align: right;">1.22 ns</td> <td style="text-align: right;">1.00</td> <td style="text-align: right;">0.00</td> <td style="text-align: right;">411 B</td> </tr> <tr> <td>FilterCmp_NoRangeCheck</td> <td>1047</td> <td style="text-align: right;">749.0 ns</td> <td style="text-align: right;">1.81 ns</td> <td style="text-align: right;">1.69 ns</td> <td style="text-align: right;">0.99</td> <td style="text-align: right;">0.00</td> <td style="text-align: right;">397 B</td> </tr> <tr> <td>FilterCmp_Unroll_8</td> <td>1047</td> <td style="text-align: right;">647.2 ns</td> <td style="text-align: right;">2.23 ns</td> <td style="text-align: right;">2.09 ns</td> <td style="text-align: right;">0.86</td> <td style="text-align: right;">0.00</td> <td style="text-align: right;">672 B</td> </tr> <tr> <td>FilterCmp_Unroll_8_Branchless</td> <td>1047</td> <td style="text-align: right;">721.2 ns</td> <td style="text-align: right;">1.23 ns</td> <td style="text-align: right;">1.09 ns</td> <td style="text-align: right;">0.96</td> <td style="text-align: right;">0.00</td> <td style="text-align: right;">682 B</td> </tr> <tr> <td>&nbsp;</td> <td>&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> </tr> <tr> <td>FilterCmp</td> <td>1048599</td> <td style="text-align: right;">499,675.6 ns</td> <td style="text-align: right;">2,639.97 ns</td> <td style="text-align: right;">2,469.43 ns</td> <td style="text-align: right;">1.00</td> <td style="text-align: right;">0.00</td> <td style="text-align: right;">411 B</td> </tr> <tr> <td>FilterCmp_NoRangeCheck</td> <td>1048599</td> <td style="text-align: right;">494,388.4 ns</td> <td style="text-align: right;">600.46 ns</td> <td style="text-align: right;">532.29 ns</td> <td style="text-align: right;">0.99</td> <td style="text-align: right;">0.01</td> <td style="text-align: right;">397 B</td> </tr> <tr> <td>FilterCmp_Unroll_8</td> <td>1048599</td> <td style="text-align: right;">426,940.7 ns</td> <td style="text-align: right;">1,858.57 ns</td> <td style="text-align: right;">1,551.99 ns</td> <td style="text-align: right;">0.85</td> <td style="text-align: right;">0.01</td> <td style="text-align: right;">672 B</td> </tr> <tr> <td>FilterCmp_Unroll_8_Branchless</td> <td>1048599</td> <td style="text-align: right;">483,940.8 ns</td> <td style="text-align: right;">517.14 ns</td> <td style="text-align: right;">458.43 ns</td> <td style="text-align: right;">0.97</td> <td style="text-align: right;">0.00</td> <td style="text-align: right;">682 B</td> </tr> <tr> <td>&nbsp;</td> <td>&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> </tr> <tr> <td>FilterCmp</td> <td>33554455</td> <td style="text-align: right;">30,282,334.8 ns</td> <td style="text-align: right;">599,306.15 ns</td> <td style="text-align: right;">531,269.30 ns</td> <td style="text-align: right;">1.00</td> <td style="text-align: right;">0.00</td> <td style="text-align: right;">411 B</td> </tr> <tr> <td>FilterCmp_NoRangeCheck</td> <td>33554455</td> <td style="text-align: right;">29,410,612.5 ns</td> <td style="text-align: right;">29,583.56 ns</td> <td style="text-align: right;">24,703.61 ns</td> <td style="text-align: right;">0.97</td> <td style="text-align: right;">0.02</td> <td style="text-align: right;">397 B</td> </tr> <tr> <td>FilterCmp_Unroll_8</td> <td>33554455</td> <td style="text-align: right;">29,102,708.3 ns</td> <td style="text-align: right;">42,824.78 ns</td> <td style="text-align: right;">40,058.32 ns</td> <td style="text-align: right;">0.96</td> <td style="text-align: right;">0.02</td> <td style="text-align: right;">672 B</td> </tr> <tr> <td>FilterCmp_Unroll_8_Branchless</td> <td>33554455</td> <td style="text-align: right;">29,761,841.1 ns</td> <td style="text-align: right;">48,108.03 ns</td> <td style="text-align: right;">42,646.51 ns</td> <td style="text-align: right;">0.98</td> <td style="text-align: right;">0.02</td> <td style="text-align: right;">682 B</td> </tr> </tbody> </table> <p>The unrolled 8 version is the winner by far, in this scenario (0.5% negatives). Since that is the scenario we have in the real code, that is what I&rsquo;m focusing on.</p> <p>Is there anything left to do here?</p> <p>My next step is to explore whether&nbsp;using vector instructions will be a good option for us.</p>https://www.ayende.com/blog/200101-C/filtering-negative-numbers-fast-unroll?Key=617ce0ba-58e7-4460-9268-6f584c90c4cdhttps://www.ayende.com/blog/200101-C/filtering-negative-numbers-fast-unroll?Key=617ce0ba-58e7-4460-9268-6f584c90c4cdTue, 12 Sep 2023 12:00:00 GMTFiltering negative numbers, fast: Scalar<p>While working deep on the guts of RavenDB, I found myself with a seemingly simple task. Given a list of longs, I need to filter out all negative numbers as quickly as possible.</p> <p>The actual scenario is that we run a speculative algorithm, given a potentially large list of items, we check if we can fulfill the request in an optimal fashion. However, if that isn&rsquo;t possible, we need to switch to a slower code path that does more work.</p> <p>Conceptually, this looks something like this:</p> <blockquote> <script src="https://gist.github.com/ayende/9e0e2ceaa091b0039eb70bf4d79c4f70.js"></script> </blockquote> <p>That is the setup for this story. The problem we have now is that we now need to <em>filter</em> the results we pass to the <em>RunManually</em>() method.</p> <p>There is a problem here, however. We marked the entries that we already used in the list by negating them. The issue is that <em>RunManually</em>() does not allow negative values, and its internal implementation is <em>not</em> friendly to ignoring those values.</p> <p>In other words, given a Span&lt;long&gt;, I need to write the code that would filter out all the negative numbers. Everything else about the list of numbers should remain the same (the order of elements, etc).</p> <p>From a coding perspective, this is as simple as:</p> <blockquote> <script src="https://gist.github.com/ayende/f24e8a3b056f72f68b7ee9af265c21ed.js"></script> </blockquote> <p>Please note, just <em>looking</em> at this code makes me cringe a lot. This does the work, but it has an absolutely horrible performance profile. It allocates multiple arrays, uses a lambda, etc.</p> <p>We don&rsquo;t actually <em>care</em> about the <em>entries </em>here, so we are free to modify them without allocating a new value. As such, let&rsquo;s see what kind of code we can write to do this work in an efficient manner. Here is what I came up with:</p> <blockquote> <script src="https://gist.github.com/ayende/b952dba7f28b538ec8ba4f0513d1bb70.js"></script> </blockquote> <p>The way this works is that we scan through the list, skipping writing the negative lists, so we effectively &ldquo;move down&rdquo; all the non-negative lists on top of the negative ones. This has a cost of O(N) and will modify the entire array, the final output is the number of valid items that we have there.</p> <p>In order to test the performance, I wrote the following harness:</p> <blockquote> <script src="https://gist.github.com/ayende/a0006f32dba0e5b26f15ac0021b14b94.js"></script> </blockquote> <p>We compare 1K, 1M and 32M elements arrays, each of which has about 0.5% negative, randomly spread across the range. Because we modify the values directly, we need to sprinkle the negatives across the array on each call. In this case, I&rsquo;m testing two options for this task, one that uses a direct comparison (shown above) and one that uses bitwise or, like so:</p> <blockquote> <script src="https://gist.github.com/ayende/809308fbbf24975197d4133a6771bbfe.js"></script> </blockquote> <p>I&rsquo;m testing the cost of sprinkling negatives as well, since that has to be done before each benchmark call (since we modify the array during the call, we need to &ldquo;reset&rdquo; its state for the next one).</p> <p>Given the two options, before we discuss the results, what would you expect to be the faster option? How would the size of the array matter here?</p> <p>I really like this example, because it is <em>simple, </em>there isn&rsquo;t any real complexity in what we are trying to do. And there is a very straightforward implementation that we can use as our baseline. That also means that I get to analyze what is going on at a very deep level. You might have noticed the disassembler attribute on the benchmark code, we are going to dive <em>deep</em> into that. For the same reason, we aren&rsquo;t using <em>exactly</em> 1K, 1M, or 32M arrays, but slightly higher than that, so we&rsquo;ll have to deal with remainders later on.</p> <p>Let&rsquo;s first look at what the JIT actually did here. Here is the annotated assembly for the FilterCmp function:</p> <blockquote> <script src="https://gist.github.com/ayende/3c08fffcd2baa9e91ae844f3114f93b5.js"></script> </blockquote> <p>For the FilterOr, the code is pretty much the same, except that the key part is:</p> <blockquote> <script src="https://gist.github.com/ayende/45dfa8af2e19a5c9b2927dca660af698.js"></script> </blockquote> <p>As you can see, the cmp option is slightly smaller, in terms of code size. In terms of performance, we have:</p> <table class="table table-striped table-bordered"> <thead> <tr> <th>Method</th> <th>N</th> <th style="text-align: right;">Mean</th> </tr> </thead> <tbody> <tr> <td>FilterOr</td> <td>1047</td> <td style="text-align: right;">745.6 ns</td> </tr> <tr> <td>FilterCmp</td> <td>1047</td> <td style="text-align: right;">745.8 ns</td> </tr> <tr> <td>&mdash;</td> <td>&ndash;</td> <td style="text-align: right;">&ndash;</td> </tr> <tr> <td>FilterOr</td> <td>1048599</td> <td style="text-align: right;">497,463.6 ns</td> </tr> <tr> <td>FilterCmp</td> <td>1048599</td> <td style="text-align: right;">498,784.8 ns</td> </tr> <tr> <td>&mdash;</td> <td>&ndash;</td> <td style="text-align: right;">&ndash;</td> </tr> <tr> <td>FilterOr</td> <td>33554455</td> <td style="text-align: right;">31,427,660.7 ns</td> </tr> <tr> <td>FilterCmp</td> <td>33554455</td> <td style="text-align: right;">30,024,102.9 ns</td> </tr> </tbody> </table> <p>The costs are very close to one another, with Or being very slightly faster on low numbers, and Cmp being slightly faster on the larger sizes. Note that the difference level between them is basically noise. They have the same performance.</p> <p>The question is, can we do better here?</p> <p>Looking at the assembly, there is an extra range check in the main loop that the JIT couldn&rsquo;t elide (the call to <em>items[output++])</em>. Can <em>we</em> do something about it, and would it make any difference in performance? Here is how I can remove the range check:</p> <blockquote> <script src="https://gist.github.com/ayende/451cf7bb3594ad7a860058fe51e18e9c.js"></script> </blockquote> <p>Here I&rsquo;m telling the JIT: &ldquo;I know what I&rsquo;m doing&rdquo;, and it shows.</p> <p>Let&rsquo;s look at the assembly changes between those two methods, first the prolog:</p> <blockquote> <script src="https://gist.github.com/ayende/c397a944faddbaeb0be20a3d8b8587b5.js"></script> </blockquote> <p>Here you can see what we are actually doing here. Note the last 4 instructions,&nbsp;we have a range check for the items, and then we have another check for the loop. The first will get you an exception, the second will just skip the loop. In both cases, we test the exact same thing. The JIT had a chance to actually optimize that, but didn&rsquo;t.</p> <p>Here is a funny scenario where <em>adding</em> code may reduce the amount of code generated. Let&rsquo;s do another version of this method:</p> <blockquote> <script src="https://gist.github.com/ayende/e15931fa4dbdc4258ea2676bb0fcc0e6.js"></script> </blockquote> <p>In this case, I added a check to handle the scenario of items being empty. What can the JIT do with this now? It turns out, quite a lot. We dropped 10 bytes from the method, which is a nice result of our diet.&nbsp; Here is the annotated version of the assembly:</p> <blockquote> <script src="https://gist.github.com/ayende/6c710aceedc59a8160cabe081c25ede5.js"></script> </blockquote> <p>A lot of the space savings in this case come from just not having to do a range check, but you&rsquo;ll note that we still do an extra check there (lines 12..13), even though we already checked that. I think that the JIT knows that the value is not zero at this point, but has to consider that the value may be negative.</p> <p>If we&rsquo;ll change the initial guard clause to: items.Length &lt;= 0, what do you think will happen? At this point, the JIT is smart enough to just elide everything, we are at 55 bytes of code and it is a super clean assembly (not a sentence I ever thought I would use). I&rsquo;ll spare you going through <em>more </em>assembly listing, but you can find <a href="https://gist.github.com/ayende/ef55effdec3a68e9e0c4ca5b1884bc9a">the output here</a>.</p> <p>And after all of that, where are we at?</p> <table class="table table-striped table-bordered"> <thead> <tr> <th>Method</th> <th>N</th> <th style="text-align: right;">Mean</th> <th style="text-align: right;">Error</th> <th style="text-align: right;">StdDev</th> <th style="text-align: right;">Ratio</th> <th style="text-align: right;">RatioSD</th> <th style="text-align: right;">Code Size</th> </tr> </thead> <tbody> <tr> <td>FilterCmp</td> <td>23</td> <td style="text-align: right;">274.5 ns</td> <td style="text-align: right;">1.91 ns</td> <td style="text-align: right;">1.70 ns</td> <td style="text-align: right;">1.00</td> <td style="text-align: right;">0.00</td> <td style="text-align: right;">411 B</td> </tr> <tr> <td>FilterCmp_NoRangeCheck</td> <td>23</td> <td style="text-align: right;">269.7 ns</td> <td style="text-align: right;">1.33 ns</td> <td style="text-align: right;">1.24 ns</td> <td style="text-align: right;">0.98</td> <td style="text-align: right;">0.01</td> <td style="text-align: right;">397 B</td> </tr> <tr> <td>&nbsp;</td> <td>&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> </tr> <tr> <td>FilterCmp</td> <td>1047</td> <td style="text-align: right;">744.5 ns</td> <td style="text-align: right;">4.88 ns</td> <td style="text-align: right;">4.33 ns</td> <td style="text-align: right;">1.00</td> <td style="text-align: right;">0.00</td> <td style="text-align: right;">411 B</td> </tr> <tr> <td>FilterCmp_NoRangeCheck</td> <td>1047</td> <td style="text-align: right;">745.8 ns</td> <td style="text-align: right;">3.44 ns</td> <td style="text-align: right;">3.22 ns</td> <td style="text-align: right;">1.00</td> <td style="text-align: right;">0.00</td> <td style="text-align: right;">397 B</td> </tr> <tr> <td>&nbsp;</td> <td>&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> </tr> <tr> <td>FilterCmp</td> <td>1048599</td> <td style="text-align: right;">502,608.6 ns</td> <td style="text-align: right;">3,890.38 ns</td> <td style="text-align: right;">3,639.06 ns</td> <td style="text-align: right;">1.00</td> <td style="text-align: right;">0.00</td> <td style="text-align: right;">411 B</td> </tr> <tr> <td>FilterCmp_NoRangeCheck</td> <td>1048599</td> <td style="text-align: right;">490,669.1 ns</td> <td style="text-align: right;">1,793.52 ns</td> <td style="text-align: right;">1,589.91 ns</td> <td style="text-align: right;">0.98</td> <td style="text-align: right;">0.01</td> <td style="text-align: right;">397 B</td> </tr> <tr> <td>&nbsp;</td> <td>&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> <td style="text-align: right;">&nbsp;</td> </tr> <tr> <td>FilterCmp</td> <td>33554455</td> <td style="text-align: right;">30,495,286.6 ns</td> <td style="text-align: right;">602,907.86 ns</td> <td style="text-align: right;">717,718.92 ns</td> <td style="text-align: right;">1.00</td> <td style="text-align: right;">0.00</td> <td style="text-align: right;">411 B</td> </tr> <tr> <td>FilterCmp_NoRangeCheck</td> <td>33554455</td> <td style="text-align: right;">29,952,221.2 ns</td> <td style="text-align: right;">442,176.37 ns</td> <td style="text-align: right;">391,977.84 ns</td> <td style="text-align: right;">0.99</td> <td style="text-align: right;">0.02</td> <td style="text-align: right;">397 B</td> </tr> </tbody> </table> <p>There is a very slight benefit to the NoRangeCheck, but even when we talk about 32M items, we aren&rsquo;t talking about a lot of time.</p> <p>The question what can we do better here?</p>https://www.ayende.com/blog/200100-C/filtering-negative-numbers-fast-scalar?Key=457d45e9-014e-4352-ac3d-ab2aa2bd4e04https://www.ayende.com/blog/200100-C/filtering-negative-numbers-fast-scalar?Key=457d45e9-014e-4352-ac3d-ab2aa2bd4e04Mon, 11 Sep 2023 12:00:00 GMTOptimizing a three-way merge<p>Deep inside of the Corax indexing engine inside of RavenDB there is the notion of a posting list. A posting list is just an ordered set of entry ids that contains a particular term. During the indexing process, we need to add and remove items from that posting list. This ends up being something like this:</p> <blockquote> <script src="https://gist.github.com/ayende/4c697a93697ad11ed2db5ad99e3ce39c.js"></script> </blockquote> <p>For fun, go and ask ChatGPT to write you the code for this task.</p> <p>You can assume that there are no duplicates between the removals and additions, and that adding an existing item is a no-op (so just one value would be in the end result). Here is a quick solution for this task (not actually tested that much, mind, but sufficient to understand what I&rsquo;m trying to do):</p> <blockquote> <script src="https://gist.github.com/ayende/aa1c05e9e0ef222f149ba829f694429e.js"></script> </blockquote> <p>If you look at this code in terms of performance, you&rsquo;ll realize that this is quite expensive. In terms of complexity, this is actually pretty good, we iterate over the arrays just once, and the number of comparisons is also bounded to the lengths of the list.</p> <p>However, there is a big issue here, the number of branches that you have to deal with. Basically, every if and every for loop is going to add a tiny bit of latency to the system. This is because these are unpredictable branches, which are pretty nasty to deal with.</p> <p>It turns out that the values that we put in the posting list are actually always a multiple of 4, so the bottom 2 bits are always cleared. That means that we actually have a different way to deal with it. Here is the new logic:</p> <blockquote> <script src="https://gist.github.com/ayende/0ce8c4ab675be292d80975685b94e2fb.js"></script> </blockquote> <p>This code was written with an eye to being able to explain the algorithm, mind, not performance.</p> <p>The idea goes like this. We flag the removals with a bit, then concatenate all the arrays together, sort them, and then do a single scan over the whole thing, removing duplicates and removals.</p> <p>In the real code, we are using raw pointers, not a List, so there&nbsp;are no access checks, etc.</p> <p>From an algorithmic perspective, this code makes absolutely no sense at all. We concatenate all the values together, then sort them (O(NlogN) operation) then scan it again?!</p> <p align="left">How can that be faster than a single scan across all three arrays? The answer is simple, we have a really efficient sort primitive (<a href="https://github.com/damageboy/VxSort">vxsort</a>) that is able to <a href="https://twitter.com/damageboy/status/1533391869618491392?lang=en">sort things really fast</a> (GB/sec). There is a <a href="https://bits.houmus.org/2020-01-28/this-goes-to-eleven-pt1">really good series of posts that explain how that is achieved</a>.</p> <p align="left">Since we consider sorting to be cheap, the rest of the work is just a single scan on the list, and there are <em>no</em> branches at all there. The code plays with the offset that we write into, figuring out whether&nbsp;we need to overwrite the current value (duplicate) or go back (removal), but in general it means that it can execute very quickly.</p> <p align="left">This approach also has another really important aspect. Take a look at the actual code that we have in production. This is from about an hour worth of profiling a busy indexing session:</p> <p align="left"><a href="https://ayende.com/blog/Images/Open-Live-Writer/Optimizing-a-three-way-merge_11E41/image_2.png"><img style="margin: 0px; border: 0px currentcolor; display: inline; background-image: none;" title="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Optimizing-a-three-way-merge_11E41/image_thumb.png" alt="image" width="986" height="112" border="0" /></a></p> <p align="left">And the more common code path:</p> <p align="left"><a href="https://ayende.com/blog/Images/Open-Live-Writer/Optimizing-a-three-way-merge_11E41/image_4.png"><img style="margin: 0px; border: 0px currentcolor; display: inline; background-image: none;" title="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Optimizing-a-three-way-merge_11E41/image_thumb_1.png" alt="image" width="1143" height="232" border="0" /></a></p> <p align="left">In both of them, you&rsquo;ll notice something really important. There isn&rsquo;t a call to sorting <em>at all</em> in here. In fact, when I search for the relevant function, I find:</p> <p align="left"><a href="https://ayende.com/blog/Images/Open-Live-Writer/Optimizing-a-three-way-merge_11E41/image_6.png"><img style="margin: 0px; border: 0px currentcolor; display: inline; background-image: none;" title="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Optimizing-a-three-way-merge_11E41/image_thumb_2.png" alt="image" width="997" height="132" border="0" /></a></p> <p align="left">That is 25 ms out of over an hour.</p> <p align="left">How can this be? As efficient as the sorting can be, we are supposed to be calling it a <em>lot</em>.</p> <p align="left">Well, consider one scenario, what happens if:</p> <ul> <li> <div align="left">There are no removals</div> </li> <li> <div align="left">All additions happen after the last existing item in the list</div> </li> </ul> <p align="left">In this case, I don&rsquo;t <em>need</em> to do anything beyond concatenate the lists. I can skip the entire process entirely, just copy the existing and additions to the output and call it a day.</p> <p align="left">Even when I do have a lot of removals and complicated merge processes, the code structure means that the CPU can get through this code very quickly. This isn&rsquo;t super friendly for humans to read, but for the CPU, this is chump change.</p>https://www.ayende.com/blog/200065-B/optimizing-a-three-way-merge?Key=67d6f65d-63ba-4fb7-9d31-dfc49ae5aa1dhttps://www.ayende.com/blog/200065-B/optimizing-a-three-way-merge?Key=67d6f65d-63ba-4fb7-9d31-dfc49ae5aa1dMon, 04 Sep 2023 12:00:00 GMT