Ayende @ Rahienhttps://www.ayende.com/blog/Ayende @ RahienCopyright (C) Ayende Rahien 2004 - 2021 (c) 202460Fun with bugs: Advanced Dictionary API<p style="text-align:left;">In RavenDB, we <em>really</em> care about performance. That means that our typical code does <em>not</em> follow idiomatic C# code. Instead, we make use of everything that the framework and the language give us to eke out that additional push for performance. Recently we ran into a bug that was quite puzzling. Here is a simple reproduction of the problem:</p><p style="text-align:left;"><hr/><pre class='line-numbers language-csharp'><code class='line-numbers language-csharp'><span class="token keyword">using</span> <span class="token namespace">System<span class="token punctuation">.</span>Runtime<span class="token punctuation">.</span>InteropServices</span><span class="token punctuation">;</span>
<span class="token class-name"><span class="token keyword">var</span></span> counts <span class="token operator">=</span> <span class="token keyword">new</span> <span class="token constructor-invocation class-name">Dictionary<span class="token punctuation"><</span><span class="token keyword">int</span><span class="token punctuation">,</span> <span class="token keyword">int</span><span class="token punctuation">></span></span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
<span class="token class-name"><span class="token keyword">var</span></span> totalKey <span class="token operator">=</span> <span class="token number">10_000</span><span class="token punctuation">;</span>
<span class="token keyword">ref</span> <span class="token class-name"><span class="token keyword">var</span></span> total <span class="token operator">=</span> <span class="token keyword">ref</span> CollectionsMarshal<span class="token punctuation">.</span><span class="token function">GetValueRefOrAddDefault</span><span class="token punctuation">(</span>
counts<span class="token punctuation">,</span> totalKey<span class="token punctuation">,</span> <span class="token keyword">out</span> _<span class="token punctuation">)</span><span class="token punctuation">;</span>
<span class="token keyword">for</span> <span class="token punctuation">(</span><span class="token class-name"><span class="token keyword">int</span></span> i <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">;</span> i <span class="token operator"><</span> <span class="token number">4</span><span class="token punctuation">;</span> i<span class="token operator">++</span><span class="token punctuation">)</span>
<span class="token punctuation">{</span>
<span class="token class-name"><span class="token keyword">var</span></span> key <span class="token operator">=</span> i <span class="token operator">%</span> <span class="token number">32</span><span class="token punctuation">;</span>
<span class="token keyword">ref</span> <span class="token class-name"><span class="token keyword">var</span></span> count <span class="token operator">=</span> <span class="token keyword">ref</span> CollectionsMarshal<span class="token punctuation">.</span><span class="token function">GetValueRefOrAddDefault</span><span class="token punctuation">(</span>
counts<span class="token punctuation">,</span> key<span class="token punctuation">,</span> <span class="token keyword">out</span> _<span class="token punctuation">)</span><span class="token punctuation">;</span>
count<span class="token operator">++</span><span class="token punctuation">;</span>
total<span class="token operator">++</span><span class="token punctuation">;</span>
<span class="token punctuation">}</span>
Console<span class="token punctuation">.</span><span class="token function">WriteLine</span><span class="token punctuation">(</span>counts<span class="token punctuation">[</span>totalKey<span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">;</span></code></pre><hr/></p><p style="text-align:left;">What would you <em>expect</em> this code to output? We are using two important features of C# here:</p><ul><li>Value types (in this case, an int, but the real scenario was with a struct)</li><li>CollectionMarshal.GetValueRefOrAddDefault()</li></ul><p style="text-align:left;">The latter method is a way to avoid performing two lookups in the dictionary to get the value if it exists and then add or modify it. </p><p style="text-align:left;">If you run the code above, it will output the number 2. </p><p style="text-align:left;">That is <em>not</em> expected, but when I sat down and thought about it, it made sense.</p><p style="text-align:left;">We are keeping track of the reference to a value in the dictionary, <em>and we are mutating</em> the dictionary.</p><p style="text-align:left;">The documentation for the method <span style="text-decoration:underline;"><a style="color:inherit;" href="https://learn.microsoft.com/en-us/dotnet/api/system.runtime.interopservices.collectionsmarshal.getvaluereforadddefault?view=net-8.0#system-runtime-interopservices-collectionsmarshal-getvaluereforadddefault-2(system-collections-generic-dictionary((-0-1))-0-system-boolean@)">very clearly explains that this is a Bad Idea.</a></span> It is an easy mistake to make, but still a mistake. The challenge here is figuring out <em>why</em> this is happening. Can you give it a minute of thought and see if you can figure it out?</p><p style="text-align:left;">A dictionary is basically an array that you access using an index (computed via a hash function), that is all. So if we strip everything away, the code above can be seen as:</p><p style="text-align:left;"><hr/><pre class='line-numbers language-javascript'><code class='line-numbers language-javascript'><span class="token keyword">var</span> buffer <span class="token operator">=</span> <span class="token keyword">new</span> <span class="token class-name">int</span><span class="token punctuation">[</span><span class="token number">2</span><span class="token punctuation">]</span><span class="token punctuation">;</span>
ref <span class="token keyword">var</span> total <span class="token operator">=</span> ref <span class="token keyword">var</span> buffer<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">;</span></code></pre><hr/></p><p style="text-align:left;">We simply have a reference to the first element in the array, that’s what this does behind the scenes. And when we insert items into the dictionary, we may need to allocate a bigger backing array for it, so this becomes:</p><p style="text-align:left;"><hr/><pre class='line-numbers language-javascript'><code class='line-numbers language-javascript'><span class="token keyword">var</span> buffer <span class="token operator">=</span> <span class="token keyword">new</span> <span class="token class-name">int</span><span class="token punctuation">[</span><span class="token number">2</span><span class="token punctuation">]</span><span class="token punctuation">;</span>
ref <span class="token keyword">var</span> total <span class="token operator">=</span> ref <span class="token keyword">var</span> buffer<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">;</span>
<span class="token keyword">var</span> newBuffer <span class="token operator">=</span> <span class="token keyword">new</span> <span class="token class-name">int</span><span class="token punctuation">[</span><span class="token number">4</span><span class="token punctuation">]</span><span class="token punctuation">;</span>
buffer<span class="token punctuation">.</span><span class="token function">CopyTo</span><span class="token punctuation">(</span>newBuffer<span class="token punctuation">)</span><span class="token punctuation">;</span>
buffer <span class="token operator">=</span> newBuffer<span class="token punctuation">;</span>
total <span class="token operator">=</span> <span class="token number">1</span><span class="token punctuation">;</span>
<span class="token keyword">var</span> newTotal <span class="token operator">=</span> buffer<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span></code></pre><hr/></p><p style="text-align:left;">In other words, the <em>total</em> variable is pointing to the first element in the <em>two-element array</em>, but we allocated a <em>new</em> array (and copied all the values). That is the reason why the code above gives the wrong result. Makes perfect sense, and yet, was quite puzzling to figure out.</p>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/9000.0.1/themes/prism.min.css" integrity="sha512-/mZ1FHPkg6EKcxo0fKXF51ak6Cr2ocgDi5ytaTBjsQZIH/RNs6GF6+oId/vPe3eJB836T36nXwVh/WBl/cWT4w==" crossorigin="anonymous" referrerpolicy="no-referrer" />https://www.ayende.com/blog/201761-C/fun-with-bugs-advanced-dictionary-api?Key=8f507241-34b2-4e31-b3f1-a31f6adbcc28https://www.ayende.com/blog/201761-C/fun-with-bugs-advanced-dictionary-api?Key=8f507241-34b2-4e31-b3f1-a31f6adbcc28Fri, 15 Nov 2024 10:00:00 GMTDebugging the Linux kernel using awesome psychic powers<p style="text-align:left;">I wanted to test low-level file-system behavior in preparation for a new feature for RavenDB. Specifically, I wanted to look into hole punching - where you can give low-level instructions to the file system to indicate that you’re giving up disk space, but without actually reducing the size of the file.</p><p style="text-align:left;">This can be very helpful in space management. If I have a section in the file that is full of zeroes, I can just tell the file system that, and it can skip storing that range of zeros on the disk entirely. This is an advanced feature for file systems. I haven't actually used that in the past, so I needed to gain some expertise with it.</p><p style="text-align:left;">I wrote the following code for Linux:</p><p style="text-align:left;"><hr/><pre class='line-numbers language-bash'><code class='line-numbers language-bash'>int fd <span class="token operator">=</span> open<span class="token punctuation">(</span><span class="token string">"test.file"</span>, O_CREAT <span class="token operator">|</span> O_WRONLY, 0644<span class="token punctuation">)</span><span class="token punctuation">;</span>
lseek<span class="token punctuation">(</span>fd, <span class="token number">128</span> * <span class="token number">1024</span> * <span class="token number">1024</span> - <span class="token number">1</span>, SEEK_SET<span class="token punctuation">)</span><span class="token punctuation">;</span> // 128MB <span class="token function">file</span>
write<span class="token punctuation">(</span>fd, <span class="token string">""</span>, <span class="token number">1</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
fallocate<span class="token punctuation">(</span>fd, // <span class="token number">32</span> MB hole from the 16MB<span class="token punctuation">..</span>48MB range
FALLOC_FL_PUNCH_HOLE <span class="token operator">|</span> FALLOC_FL_KEEP_SIZE,
<span class="token number">16</span> * <span class="token number">1024</span> * <span class="token number">1024</span>, <span class="token number">32</span> * <span class="token number">1024</span> * <span class="token number">1024</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
close<span class="token punctuation">(</span>fd<span class="token punctuation">)</span><span class="token punctuation">;</span></code></pre><hr/></p><p style="text-align:left;">The code for <span style="text-decoration:underline;"><a style="color:inherit;" href="https://gist.github.com/ayende/68dcb097c0500785b18a14ab9afa69a1">Windows is here</a></span> if you want to see it. I tested the feature on both Windows & Linux, and it worked. I could see that while the file size was 128MB, I was able to give back 16MB to the operating system without any issues. I turned the code above into a test and called it a day. </p><p style="text-align:left;">And then the CI build broke. But that wasn’t possible since I <em>tested</em> that. And there had been CI runs that did work on Linux. So I did the obvious thing and started running the code above in a loop. </p><p style="text-align:left;">I found something <em>really</em> annoying. This code worked, <em>sometimes</em>. And sometimes it just didn’t. </p><p style="text-align:left;">In order to get the size, I need to run this code:</p><p style="text-align:left;"><hr/><pre class='line-numbers language-bash'><code class='line-numbers language-bash'>struct <span class="token function">stat</span> st<span class="token punctuation">;</span>
fstat<span class="token punctuation">(</span>fd, <span class="token operator">&</span>st<span class="token punctuation">)</span><span class="token punctuation">;</span>
printf<span class="token punctuation">(</span><span class="token string">"Total size: %lld bytes<span class="token entity" title="\n">\n</span>"</span>,
<span class="token punctuation">(</span>long long<span class="token punctuation">)</span>st.st_size<span class="token punctuation">)</span><span class="token punctuation">;</span>
printf<span class="token punctuation">(</span><span class="token string">"Actual size on disk: %lld bytes<span class="token entity" title="\n">\n</span>"</span>,
<span class="token punctuation">(</span>long long<span class="token punctuation">)</span>st.st_blocks * <span class="token number">512</span><span class="token punctuation">)</span><span class="token punctuation">;</span></code></pre><hr/></p><p style="text-align:left;">I’m used to weirdness from file systems at this point, but this is really simple. All the data is 4KB aligned (in fact, all the data is 16MB aligned). There shouldn’t be any weirdness here.</p><p style="text-align:left;">As you can see, I’m already working at the level of Linux syscalls, but I used strace to check if there is something funky going on. Nope, there was a 1:1 mapping between the code and the actual system calls issued.</p><p style="text-align:left;">That means that I have to debug deeper if I want to understand what is going on. This involves debugging the Linux Kernel, which is a <span style="text-decoration:underline;"><a style="color:inherit;" href="https://github.com/torvalds/linux/blob/master/fs/ext4/inode.c#L3946">Big Task</a></span>. Take a look at the code in the relevant link. I’m fairly certain that <span style="text-decoration:underline;"><a style="color:inherit;" href="https://github.com/torvalds/linux/blob/master/fs/ext4/inode.c#L4024-L4025">the issue is in those lines</a></span>. The problem is that this cannot be, since both offset & length are aligned to 4KB.</p><p style="text-align:left;">I got out my crystal ball and thinking hat and meditated on this. If you’ll note, the difference between the expected and actual values is exactly 4KB. It almost looks like the file <em>itself</em> is not aligned on a 4KB boundary, but the holes must be. </p><p style="text-align:left;">Given that I just want to release this space to the operating system and 4KB is really small, I can adjust that as a fudge factor for the test. I would <em>love</em> to understand exactly what is going on, but so far the “file itself is not 4KB aligned, but holes are” is a good working hypothesis (even though my gut tells me it might be wrong). </p><blockquote><p style="text-align:left;">If you know the actual reason for this, I would love to hear it. </p></blockquote><p style="text-align:left;">And don't get me started on what happened with sparse files in macOS. There, the OS will <em>randomly</em> decide to mark some parts of your file as holes, making any deterministic testing really hard. </p>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/9000.0.1/themes/prism.min.css" integrity="sha512-/mZ1FHPkg6EKcxo0fKXF51ak6Cr2ocgDi5ytaTBjsQZIH/RNs6GF6+oId/vPe3eJB836T36nXwVh/WBl/cWT4w==" crossorigin="anonymous" referrerpolicy="no-referrer" />https://www.ayende.com/blog/201665-A/debugging-the-linux-kernel-using-awesome-psychic-powers?Key=9089428b-0483-4aad-9653-44f96a66cde9https://www.ayende.com/blog/201665-A/debugging-the-linux-kernel-using-awesome-psychic-powers?Key=9089428b-0483-4aad-9653-44f96a66cde9Tue, 17 Sep 2024 12:00:00 GMTCryptographically impossible bug hunt<p style="text-align:left;">I’m currently deep in the process of modifying the internals of Voron, trying to eke out more performance out of the system. I’m making great progress, but I’m also touching parts of the code that haven’t even been looked at for a <em>long </em>time. </p><p style="text-align:left;">In other words, I’m mucking about with the most stable and most critical portions of the storage engine. It’s a lot of fun, and I’m actually seeing some great results, but it is also nerve-wracking. </p><p style="text-align:left;">We have enough tests that I’ve great confidence I would catch any actual stability issues, but the drive back toward a fully green build has been a slog.</p><p style="text-align:left;">The process is straightforward:</p><ul><li>Change something.</li><li>Verify that it works better than before.</li><li>Run the entire test suite (upward of 30K tests) to see if there are any breaks.</li></ul><p style="text-align:left;">The last part can be frustrating because it takes a <em>while</em> to run this sort of test suite. That would be bad enough, but some of the changes I made were things like marking a piece of memory that used to be read/write as read-only. Now any access to that memory would result in an access violation. </p><p style="text-align:left;">I fixed those in the code, of course, but we have a <em>lot</em> of tests, including some tests that intentionally corrupt data to verify that RavenDB behaves properly under those conditions. </p><p style="text-align:left;">One such test writes garbage to the RavenDB file, using read-write memory. The idea is to verify that the checksum matches on read and abort early. Because that test directly modifies what is now read-only memory, it generates a crash due to a memory access violation. That doesn’t just result in a test failure, it takes the whole process down.</p><p style="text-align:left;">I’ve gotten pretty good at debugging those sorts of issues (--blame-crash is fantastic) and was able to knock quite a few of them down and get them fixed. <img src=""/></p><p style="text-align:left;">And then there was this test, which uses encryption-at-rest. That test started to fail after my changes, and I was pretty confused about exactly what was going on. When trying to read data from disk, it would follow up a pointer to an invalid location. That is <em>not</em> supposed to happen, obviously. </p><p style="text-align:left;">Looks like I have a little data corruption issue on my hands. The problem is that this shouldn’t be possible. Remember how we validate the checksum on read? When using encryption-at-rest, we are using a mechanism called AEAD (Authenticated Encryption with Associated Data). That means that in order to successfully decrypt a page of data from disk, it must have been <em>cryptographically</em> verified to be valid.</p><p style="text-align:left;">My test results showed, pretty conclusively, that I was generating valid data and then encrypting it. The next stage was to decrypt the data (verifying that it was valid), at which point I ended up with complete garbage.</p><p style="text-align:left;">RavenDB trusts that since the data was properly decrypted, it is valid and tries to use it. Because the data is garbage, that leads to… excitement. Once I realized what was going on, I was really confused. I’m pretty sure that I didn’t break 256-bit encryption, but I had a very clear chain of steps that led to valid data being decrypted (successfully!) to garbage. </p><p style="text-align:left;">It was also quite frustrating to track because any small-stage test that I wrote would return the expected results. It was only when I ran the entire system and <em>stressed</em> it that I got this weird scenario.</p><p style="text-align:left;">I started practicing for my Fields medal acceptance speech while digging deeper. Something here <em>had</em> to be wrong. It took me a while to figure out what was going on, but eventually, I tracked it down to registering to the TransactionCommit event when we open a new file. </p><p style="text-align:left;">The idea is that when we commit the transaction, we’ll encrypt all the data buffers and then write them to the file. We register for an event to handle that, and we used to do that on a per-file basis. My changes, among other things, moved that logic to apply globally. </p><p style="text-align:left;">As long as we were writing to a single file, everything just worked. When we had enough workload to need a second file, we would encrypt the data <em>twice</em> and then write it to the file. Upon decryption, we would successfully decrypt the data but would end up with still encrypted data (looking like random fluff). </p><p style="text-align:left;">The fix was simply moving the event registration to the transaction level, not the file level. I committed my changes and went back to the unexciting life of bug-fixing, rather than encryption-breaking and math-defying hacks.</p>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/9000.0.1/themes/prism.min.css" integrity="sha512-/mZ1FHPkg6EKcxo0fKXF51ak6Cr2ocgDi5ytaTBjsQZIH/RNs6GF6+oId/vPe3eJB836T36nXwVh/WBl/cWT4w==" crossorigin="anonymous" referrerpolicy="no-referrer" />https://www.ayende.com/blog/201441-B/cryptographically-impossible-bug-hunt?Key=b0e0416d-2ab6-4e40-9e90-74a7c913b24ehttps://www.ayende.com/blog/201441-B/cryptographically-impossible-bug-hunt?Key=b0e0416d-2ab6-4e40-9e90-74a7c913b24eWed, 24 Jul 2024 12:00:00 GMTPermission to pay your money, please!<p style="text-align: left;">I’m trying to pay a SaaS bill online, and I run into the following issue. I have insufficient permissions to pay the invoice on the account. No insufficient funds, which is something that you’ll routinely run into when dealing with payment processing. But insufficient permissions!</p>
<p style="text-align: left;"> </p>
<p style="text-align: left;">Is… paying something an act that requires permissions? That something that happens? Can I get <em>more</em> vulnerabilities like that? When I get people to drive-by pay for my bills?</p>
<p style="text-align: left;">I can’t think of a scenario where you are prevented from <em>paying</em> to the provider. That is… weird.</p>
<p style="text-align: left;">And now I’m in this “nice” position where I have to chase after the provider to <em>give them money</em>, because otherwise they’ll close the account.</p>https://www.ayende.com/blog/200865-A/permission-to-pay-your-money-please?Key=394e981a-ce0f-4dd0-8773-8b34736558e0https://www.ayende.com/blog/200865-A/permission-to-pay-your-money-please?Key=394e981a-ce0f-4dd0-8773-8b34736558e0Mon, 25 Mar 2024 12:00:00 GMTChallenge: Spot the bug<p>The following bug cost me a bunch of time, can you see what I’m doing wrong?</p>
<blockquote>
<script src="https://gist.github.com/ayende/cd3d6a7efae64b9725a3a9083625d4c1.js"></script>
</blockquote>
<p>For fun, it’s so nasty because usually, it will accidentally work.</p>https://www.ayende.com/blog/200161-C/challenge-spot-the-bug?Key=277eeb3d-ea55-4969-a179-95bd86ebece9https://www.ayende.com/blog/200161-C/challenge-spot-the-bug?Key=277eeb3d-ea55-4969-a179-95bd86ebece9Tue, 19 Sep 2023 12:00:00 GMTSolving heap corruption errors in managed applications<p>RavenDB is a .NET application, written in C#. It also has a <em>non trivial</em> amount of unmanaged memory usage. We absolutely need that to get the proper level of performance that we require.</p>
<p>With managing memory manually, there is also the possibility that we’ll mess it up. We run into one such case, when running our full test suite (over 10,000 tests) we would get random crashes due to heap corruption. Those issues are <em>nasty</em>, because there is a big separation between the root cause and the actual problem manifesting.</p>
<p>I recently learned that you can use <a href="https://learn.microsoft.com/en-us/windows-hardware/drivers/debugger/gflags-and-pageheap">the gflags tool on .NET executables</a>. We were able to narrow the problem to a single scenario, but we still had no idea where the problem really occurred. So I installed the <a href="https://learn.microsoft.com/en-us/windows-hardware/drivers/debugger/debugger-download-tools">Debugging Tools for Windows</a> and then executed:</p>
<blockquote>
<pre> &"C:\Program Files (x86)\Windows Kits\10\Debuggers\x64\gflags.exe" /p /enable C:\Work\ravendb-6.0\test\Tryouts\bin\release\net7.0\Tryouts.exe</pre>
</blockquote>
<p>What this does is enable a special debug heap at the executable level, which applies to <em>all</em> operations (managed and native memory alike).</p>
<p>With that enabled, I ran the scenario in question:</p>
<blockquote>
<p>PS C:\Work\ravendb-6.0\test\Tryouts> C:\Work\ravendb-6.0\test\Tryouts\bin\release\net7.0\Tryouts.exe<br /> 42896<br /> Starting to run 0<br /> Max number of concurrent tests is: 16<br /> Ignore request for setting processor affinity. Requested cores: 3. Number of cores on the machine: 32.<br /> To attach debugger to test process (x64), use proc-id: 42896. Url <a href="http://127.0.0.1:51595">http://127.0.0.1:51595</a><br /> Ignore request for setting processor affinity. Requested cores: 3. Number of cores on the machine: 32. License limits: A: 3/32. Total utilized cores: 3. Max licensed cores: 1024<br /> <a href="http://127.0.0.1:51595/studio/index.html#databases/documents?&database=Should_correctly_reduce_after_updating_all_documents_1&withStop=true&disableAnalytics=true">http://127.0.0.1:51595/studio/index.html#databases/documents?&database=Should_correctly_reduce_after_updating_all_documents_1&withStop=true&disableAnalytics=true</a><br /> Fatal error. System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.<br /> at Sparrow.Server.Compression.Encoder3Gram`1[[System.__Canon, System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]].Encode(System.ReadOnlySpan`1<Byte>, System.Span`1<Byte>)<br /> at Sparrow.Server.Compression.HopeEncoder`1[[Sparrow.Server.Compression.Encoder3Gram`1[[System.__Canon, System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]], Sparrow.Server, Version=6.0.0.0, Culture=neutral, PublicKeyToken=37f41c7f99471593]].Encode(System.ReadOnlySpan`1<Byte> ByRef, System.Span`1<Byte> ByRef)<br /> at Voron.Data.CompactTrees.PersistentDictionary.ReplaceIfBetter[[Raven.Server.Documents.Indexes.Persistence.Corax.CoraxDocumentTrainEnumerator, Raven.Server, Version=6.0.0.0, Culture=neutral, PublicKeyToken=37f41c7f99471593],[Raven.Server.Documents.Indexes.Persistence.Corax.CoraxDocumentTrainEnumerator, Raven.Server, Version=6.0.0.0, Culture=neutral, PublicKeyToken=37f41c7f99471593]](Voron.Impl.LowLevelTransaction, Raven.Server.Documents.Indexes.Persistence.Corax.CoraxDocumentTrainEnumerator, Raven.Server.Documents.Indexes.Persistence.Corax.CoraxDocumentTrainEnumerator, Voron.Data.CompactTrees.PersistentDictionary)<br /> at Raven.Server.Documents.Indexes.Persistence.Corax.CoraxIndexPersistence.Initialize(Voron.StorageEnvironment)</p>
</blockquote>
<p>That pinpointed things so I was able to know exactly where we are messing up.</p>
<p>I was also able to reproduce the behavior on the debugger:</p>
<p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Solving-heap-corruption-errors-in-manage_B4AE/image%20(3)_2.png"><img style="border: 0px currentcolor; display: inline; background-image: none;" title="image (3)" src="https://ayende.com/blog/Images/Open-Live-Writer/Solving-heap-corruption-errors-in-manage_B4AE/image%20(3)_thumb.png" alt="image (3)" width="1394" height="306" border="0" /></a></p>
<p>This saved me <em>hours </em>or <em>days</em> of trying to figure out where the problem actually is.</p>https://www.ayende.com/blog/199682-B/solving-heap-corruption-errors-in-managed-applications?Key=9d558341-07d7-46a6-be46-f94cb752404ehttps://www.ayende.com/blog/199682-B/solving-heap-corruption-errors-in-managed-applications?Key=9d558341-07d7-46a6-be46-f94cb752404eWed, 05 Jul 2023 12:00:00 GMTBug chasing, narrowing down the scope<p>I just completed a major refactoring of a piece of code inside RavenDB that is responsible for how we manage sorted queries. The first two tiers of tests all passed, which is great. Now was the time to test how this change performed. I threw 50M records into RavenDB and indexed them. I did… not <em>like</em> the numbers I got back. It makes sense, since I was heavily refactoring to get a particular structure, I could think of a few ways to improve performance, but I like doing this based on profiler output.</p>
<p>When running the same scenario under the profiler, the process crashed. That is… quite annoying, as you can imagine. In fact, I discovered a really startling issue.</p>
<p>If I index the data and query on it, I get the results I expect. If I restart the process and run the <em>same query, </em>I get an ExecutionEngineException. Trying to debug those is a PITA. In this case, I’m 100% at fault, we are doing a <em>lot</em> of unsafe things to get better performance, and it appears that I messed up something along the way. But my only reproduction is a 50M records dataset. To give some context, this means 51GB of documents to be indexed and 18 GB of indexing. Indexing this in release mode takes about 20 minutes. In debug mode, it takes a <em>lot</em> longer.</p>
<p>Trying to find <em>an</em> error there, especially one that can only happen after you restart the process is going to be a challenging task. But this isn’t my first rodeo. Part of good system design is knowing how to address just these sorts of issues.</p>
<p>The indexing process inside RavenDB is single-threaded per index. That means that we can rule out a huge chunk of issues around race conditions. It also means that we can play certain tricks. Allow me to present you with the nicest tool for debugging that you can imagine: repeatable traces.</p>
<p>Here is what this looks like in terms of code:</p>
<blockquote>
<script src="https://gist.github.com/ayende/e248344c5b038f2cc2bd5b654609a0cd.js"></script>
</blockquote>
<p>In this case, you can see that this is a development only feature, so it is really bare-bones one. What it does is capture the indexing and commit operations on the system and write them to a file. I have another piece of similarly trivial code that reads and applies them, as shown below. Don’t bother to dig into that, the code itself isn’t really that interesting. What is important is that I have captured the behavior of the system and can now replay it at will.</p>
<blockquote>
<script src="https://gist.github.com/ayende/447c66ac5b724a5983f51b21e2392b53.js"></script>
</blockquote>
<p>The code itself isn’t much, but it does the job. What is more important, note that we have calls to StopDatabase() and StartDatabase(), I was able to reproduce the crash using this code.</p>
<p>That was a massive win, since it dropped my search area from 50M documents to <em>merely</em> 1.2 million.</p>
<p>The key aspect of this is that I now have a way to play around with things. In particular, instead of using the commit points in the trace, I can force a commit (and start / stop the database) every 10,000 items (by calling FlushIndexAndRenewWriteTransaction). When using that, I can reproduce this <em>far </em>faster. Here is the output when I run this in release mode:</p>
<blockquote>
<pre>1 With 0
2 With 10000
3 With 10000
4 With 10000
5 With 10000
6 With 10000
7 With 10000
8 With 10000
9 With 10000
10 With 10000
11 With 10000
Fatal error. Internal CLR error. (0x80131506)
</pre>
</blockquote>
<p>So now I dropped the search area to 120,000 items, which is pretty awesome. Even more important, when I run this in debug mode, I get this:</p>
<blockquote>
<pre>1 With 0
2 With 10000
Process terminated. Assertion failed.
at Voron.Data.Containers.Container.Get(Low...</pre>
</blockquote>
<p>So now I have a repro in 30,000 items, what is even better, a debug assertion was fired, so I have a really good lead into what is going on.</p>
<p>The key challenge in this bug is that it is probably triggered as a result of a commit and an index of the <em>next</em> batch. There is a bunch of work that we do around batch optimizations that likely cause this sort of behavior. By being able to capture the input to the process and play with the batch size, we were able to reduce the amount of work required to generate a reproduction from 50M records to 30,000 and have a lead into what is going on.</p>
<p>With that, I can now start applying more techniques to narrow down what is going on. But by far the most important aspect as far as I’m concerned is the feedback cycle. I can now hit F5 to run the code and encounter the problem in a few seconds.</p>
<blockquote>
<p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Bug-chasing-narrowing-down-the-scope_984B/image_2.png"><img style="margin: 0px; border: 0px currentcolor; display: inline; background-image: none;" title="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Bug-chasing-narrowing-down-the-scope_984B/image_thumb.png" alt="image" width="697" height="86" border="0" /></a></p>
</blockquote>
<p>It looks like we hit a debug assertion because we keep a reference to an item that was already freed. That is really interesting, and now I can find out <em>which</em> item and then figure out why this is the case. And at each point, I can simply go one step back in the investigation, and reproduce the state, I have to hit F5 and wait a bit. This means that I can be far more liberal in how I figure out this bug.</p>
<p>This is triggered by a query on the indexed data, and if I follow up the stack, I have:</p>
<blockquote>
<p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Bug-chasing-narrowing-down-the-scope_984B/image_4.png"><img style="margin: 0px; border: 0px currentcolor; display: inline; background-image: none;" title="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Bug-chasing-narrowing-down-the-scope_984B/image_thumb_1.png" alt="image" width="704" height="201" border="0" /></a></p>
</blockquote>
<p>This is really interesting, I wonder… what happens if I query <em>before</em> I restart the database? With this structure, this is easy to do.</p>
<blockquote>
<p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Bug-chasing-narrowing-down-the-scope_984B/image_6.png"><img style="margin: 0px; border: 0px currentcolor; display: inline; background-image: none;" title="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Bug-chasing-narrowing-down-the-scope_984B/image_thumb_2.png" alt="image" width="724" height="197" border="0" /></a></p>
</blockquote>
<p>This is actually a <em>big</em> relief. I had no idea why restarting the database would cause us to expose this bug.</p>
<p>Another thing to note is that when I ran into the problem, I reproduced this on a query that sorted on a single field. In the test code, I’m testing on <em>all</em> fields, so that may be an asset in exposing this faster.</p>
<p>Right now the problem reproduces on the id field, which is unique. That helps, because it removes a large swath of code that deals with multiple terms for an entry. The current stage is that I can now reproduce this issue <em>without</em> running the queries, and I know exactly where it goes wrong.</p>
<blockquote>
<p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Bug-chasing-narrowing-down-the-scope_984B/image_10.png"><img style="margin: 0px; border: 0px currentcolor; display: inline; background-image: none;" title="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Bug-chasing-narrowing-down-the-scope_984B/image_thumb_4.png" alt="image" width="894" height="80" border="0" /></a></p>
</blockquote>
<p>And I can put a breakpoint on the exact location where this entry is created:</p>
<blockquote>
<p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Bug-chasing-narrowing-down-the-scope_984B/image_12.png"><img style="margin: 0px; border: 0px currentcolor; display: inline; background-image: none;" title="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Bug-chasing-narrowing-down-the-scope_984B/image_thumb_5.png" alt="image" width="429" height="123" border="0" /></a></p>
</blockquote>
<p>By the way, note that I’m modifying the code instead of using a conditional breakpoint. This is because of the performance difference. For a conditional breakpoint, the debugger has to stop execution, evaluate the condition and decide what to do. If this is run a <em>lot</em>, it can have a <em>huge</em> impact on performance. Easier to modify the code. The fact that I can do that and hit F5 and get to the same state allows me to have a lot more freedom in the ergonomics of how I work.</p>
<p>This allows me to discover that the entry in question was created during the <em>second</em> transaction. But the failure happens during the <em>third</em>, which is really interesting. More to the point, it means that I can now do this:</p>
<blockquote>
<p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Bug-chasing-narrowing-down-the-scope_984B/image_14.png"><img style="margin: 0px; border: 0px currentcolor; display: inline; background-image: none;" title="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Bug-chasing-narrowing-down-the-scope_984B/image_thumb_6.png" alt="image" width="368" height="145" border="0" /></a></p>
</blockquote>
<p>With the idea that this will trigger the assert on the exact entry that cause the problem. This is a good idea, and I wish that it worked, but we are actually doing a non-trivial amount of work during the commit process, so now we have a negative feedback and another clue. This is happening in the commit phase of the indexing process. It’s not a big loss, I can do the same in the commit process as well. I have done just that and now I know that I have a problem when indexing the term: “tweets/1212163952102137856”. Which leads to this code:</p>
<blockquote>
<p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Bug-chasing-narrowing-down-the-scope_984B/image_16.png"><img style="margin: 0px; border: 0px currentcolor; display: inline; background-image: none;" title="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Bug-chasing-narrowing-down-the-scope_984B/image_thumb_7.png" alt="image" width="855" height="100" border="0" /></a></p>
</blockquote>
<p>And at this point, I can now single step through this and figure out what is going on, I hope.</p>
<p>When working on complex data structures, one of the things that you need to do is to allow to visualize them. Being able to manually inspect the internal structure of your data structures can save you a <em>lot</em> of debugging. As I mentioned, this isn’t my first rodeo. So when I narrowed it down to a specific location, I started looking into exactly what is going on.</p>
<p>Beforehand, I need to explain a couple of terms (pun intended):</p>
<ul>
<li>tweets/1212163952102137856 – this is the entry that <em>triggers</em> the error.</li>
<li>tweets/1212163846623727616 – this is the term that should be returned for 1679560</li>
</ul>
<p>Here is what the structure looks like at the time of the insert:</p>
<blockquote>
<p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Bug-chasing-narrowing-down-the-scope_984B/image_18.png"><img style="margin: 0px; border: 0px currentcolor; display: inline; background-image: none;" title="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Bug-chasing-narrowing-down-the-scope_984B/image_thumb_8.png" alt="image" width="684" height="270" border="0" /></a></p>
</blockquote>
<p>You can notice that the value here for the last page is the same as the one that we are checking for 1679560.</p>
<p>To explain what is going on will take us down a pretty complex path that you probably don’t care about, but the situation is that we are keeping track of the id in two locations. Making sure to add and remove it in both locations as appropriate. <em>However</em>, at certain points, we may decide to shuffle things around inside the tree, and we didn’t sync that up properly with the rest of the system, leading to a dangling reference.</p>
<p>Now that I know what is going on, I can figure out how to fix it. But the story of this post was mostly about how I figured it out, not the bug itself.</p>
<p>The key aspect was to get to the point where I can reproduce this easily, so I can repeat it as many times that is needed to slowly inch closer to the solution.</p>https://www.ayende.com/blog/199457-B/bug-chasing-narrowing-down-the-scope?Key=ae76509f-0c74-4d15-a2e5-da5a93a273f4https://www.ayende.com/blog/199457-B/bug-chasing-narrowing-down-the-scope?Key=ae76509f-0c74-4d15-a2e5-da5a93a273f4Fri, 05 May 2023 12:00:00 GMTBug chasing, the process is more important than the result<p>I’m doing a pretty major refactoring inside of RavenDB right now. I was able to finish a bunch of work and submitted things to the CI server for testing. RavenDB has several layers of tests, which we run depending on context.</p>
<p>During development, we’ll usually run the FastTests. About 2,300 tests are being run to validate various behaviors for RavenDB, and on my machine, they take just over 3 minutes to complete. The next tier is the SlowTests, which run for about 3 hours on the CI server and run about 26,000 tests. Beyond that we actually have a few more layers, like tests that are being run only on the nightly builds and stress tests, which can take several minutes each to complete.</p>
<p>In short, the usual process is that you write the code and write the relevant tests. You also validate that you didn’t break anything by running the FastTests locally. Then we let CI pick up the rest of the work. At the last count, we had about 9 dedicated machines as CI agents and given our workload, an actual full test run of a PR may complete the next day.</p>
<p>I’m mentioning all of that to explain that when I reviewed the build log for my PR, I found that there were a bunch of tests that failed. That was reasonable, given the scope of my changes. I sat down to grind through them, fixing them one at a time. Some of them were <em>quite</em> important things that I didn’t take into account, after all. For example, one of the tests failed because I didn’t account for sorting on a dynamic numeric field. Sorting on a numeric field worked, and a dynamic text field also worked. But dynamic numeric field didn’t. It’s the sort of thing that I would never think of, but we got the tests to cover us.</p>
<p>But when I moved to the next test, it didn’t fail. I ran it again, and it still didn’t fail. I ran it in a loop, and it failed on the 5th iteration. That… sucked. Because it meant that I had a race condition in there somewhere. I ran the loop again, and it failed <em>again</em> on the 5th. In fact, in every iteration I tried, it would only fail on the 5th iteration.</p>
<p>When trying to isolate a test failure like that, I usually run that in a loop, and hope that with enough iterations, I’ll get it to reproduce. Having it happen constantly on the 5th iteration was… really strange. I tried figuring out what was going on, and I realized that the test was generating 1000 documents using a random. The fact that I’m using Random is the reason it is non-deterministic, of course, except that this is the code inside my test base class:</p>
<blockquote>
<p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Bug-chasing-the-process-is-more-importan_FFE0/image_2.png"><img style="border: 0px currentcolor; display: inline; background-image: none;" title="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Bug-chasing-the-process-is-more-importan_FFE0/image_thumb.png" alt="image" width="450" height="47" border="0" /></a></p>
</blockquote>
<p>So this is <em>properly</em> initialized with a seed, so it will be consistent.</p>
<p>Read the code again, do you see the problem?</p>
<blockquote>
<p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Bug-chasing-the-process-is-more-importan_FFE0/image_7.png"><img style="margin: 0px; border: 0px currentcolor; display: inline; background-image: none;" title="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Bug-chasing-the-process-is-more-importan_FFE0/image_thumb_2.png" alt="image" width="450" height="47" border="0" /></a></p>
</blockquote>
<p>That is a <em>static</em> value. So there are two problems here:</p>
<ul>
<li>I’m getting the <em>bad</em> values on the fifth run in a consistent manner because that is the set of results that reproduce the error.</li>
<li>This is a <em>shared</em> instance that may be called from multiple tests at once, leading to the wrong result because Random is not thread safe.</li>
</ul>
<p>Before fixing this issue so it would run properly, I decided to use an ancient debugging technique. It’s called printf().</p>
<p>In this case, I wrote out all the values that were generated by the test and wrote a <em>new</em> test to replay them. <em>That</em> one failed consistently.</p>
<p>The problem was that it was still too big in scope. I iterated over this approach, trying to end up with a smaller section of the codebase that I could invoke to repeat this issue. <em>That</em> took most of the day. But the end result is a test like this:</p>
<blockquote>
<script src="https://gist.github.com/ayende/1af4847d0f7dc293f6b075824ca8c3d8.js"></script>
</blockquote>
<p>As you can see, in terms of the amount of code that it invokes, it is pretty minimal. Which is pretty awesome, since that allowed me to figure out what the problem was:</p>
<blockquote>
<p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Bug-chasing-the-process-is-more-importan_FFE0/image_9.png"><img style="margin: 0px; border: 0px currentcolor; display: inline; background-image: none;" title="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Bug-chasing-the-process-is-more-importan_FFE0/image_thumb_3.png" alt="image" width="247" height="76" border="0" /></a></p>
</blockquote>
<p>I’ve been developing software professionally for over two decades at this point. I still get caught up with things like that, sigh.</p>https://www.ayende.com/blog/199425-B/bug-chasing-the-process-is-more-important-than-the-result?Key=25f17f7f-bece-4fee-98d2-91ccbe9df5e8https://www.ayende.com/blog/199425-B/bug-chasing-the-process-is-more-important-than-the-result?Key=25f17f7f-bece-4fee-98d2-91ccbe9df5e8Thu, 04 May 2023 12:00:00 GMTDebugging native memory issues in a C# application<p>I’m working on improving the performance of Corax, RavenDB’s new search engine. Along the way, I introduced a bug, a fairly nasty one. At a random location, while indexing a ~50 million documents corpus, we are getting an access violation exception. That means that I messed something up.</p>
<p>That makes sense, given that my changes were mostly about making things lower-level. Working directly with pointers and avoiding length checks. At our speed, even the use of Span can be a killer for performance, and we want to be as close to the raw metal as possible. The particular changeset that I was working on was able to improve the indexing speed from 90,000 per second to 120,000 per second. That is a change that I absolutely <em>want</em> to keep, so I started investigating it.</p>
<p>I mentioned that it is a <em>fairly</em> nasty problem. A truly nasty problem would be heap corruption that is discovered after the fact and is very hard to trace. In this case, it was <em>not</em> consistent, which is <em>really</em> strange. One of <em>the</em> important aspects of Corax is that it is single-threaded, which means that a lot of complexity is out the window. It means that for the same input, we always have the same behavior. If there is any variance, such as <em>not</em> crashing all the time, it means that there are external factors involved.</p>
<p>At any rate, given that it happened at least half the time, I was able to attach WinDBG to the process and wait for the exception to happen, this is what I got:</p>
<blockquote>
<pre>(5e20.1468): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
Corax!Corax.IndexWriter.AddEntriesToTermResultViaSmallPostingList+0x953:
00007ffa`24dcea53 c4e261902411 vpgatherdd xmm4,dword ptr [rcx+xmm2],xmm3 ds:0000026d`516514e7=????????</pre>
</blockquote>
<p>Now, look at the last line, that is an interesting one, we use the VPGATHERDD assembly instruction. It is gathering packed DWORD values, in C#, this is generated using the Avx2.GatherVector128() method. We are using that to do some bit packing in this case, so this makes a lot of sense.</p>
<p>Next, let’s see what we get from the exception:</p>
<blockquote>
<pre>0:074> .exr -1
ExceptionAddress: 00007ffafc2bfe7c (KERNELBASE!RaiseException+0x000000000000006c)
ExceptionCode: c0000005 (Access violation)
ExceptionFlags: 00000080
NumberParameters: 2
Parameter[0]: 0000000000000000
Parameter[1]: 0000026d51650000
Attempt to read from address 0000026d51650000</pre>
</blockquote>
<p>All of this points to an out-of-bounds read, but why is that? The call we have for GatherVector128() is used inside a method named: ReadAvx2(). And this method is called like this:</p>
<blockquote>
<pre>private unsafe static ulong Read(int stateBitPos, byte* inputBufferPtr, int bitsToRead, int inputBufferSize, out int outputStateBit)
{
if ((stateBitPos + bitsToRead) / 8 >= inputBufferSize)
throw new ArgumentOutOfRangeException();
if ( Avx2.IsSupported)
{
return ReadAvx2(stateBitPos, inputBufferPtr, bitsToRead, out outputStateBit);
}
return ReadScalar(stateBitPos, inputBufferPtr, bitsToRead, out outputStateBit);
}</pre>
</blockquote>
<p>It is an optimized approach to read some bits from a buffer, I’ll skip the details on exactly how this works. As you can see, we have a proper bounds check here, ensuring that we aren’t reading past the end of the buffer.</p>
<p>Except…</p>
<p>That we aren’t actually checking this. What we are doing is checking that we can access the <em>bytes</em> range, but consider the following scenario:</p>
<p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Debugging-native-memory-issues-in-a-C-ap_817C/image_2.png"><img style="margin: 0px; border: 0px currentcolor; display: inline; background-image: none;" title="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Debugging-native-memory-issues-in-a-C-ap_817C/image_thumb.png" alt="image" width="625" height="152" border="0" /></a></p>
<p>We have a memory page and a buffer that is located toward the end of it. We are now trying to access the last bit in the buffer, using ReadAvx2(). If we’ll check the actual bytes range, it will pass, we are trying to access the last byte.</p>
<p>However, we are going to call GatherVector128(), which means that we’ll actually access 16 bytes(!), and only the first byte is in the valid memory range, the rest is going to be read from the next page, which isn’t mapped.</p>
<p>This also explains why we are not always failing. If the <em>next </em>page is valid (which is subject to the decisions of the operating system allocator), it will pass. So that is why we didn’t have 100% reproduction. In fact, this is the sort of bug that is very easy to hide for a very long time in the system, given that it is dependent on the actual memory structure of the application.</p>
<p>Once we figured out what was going on, it was pretty easy to understand, but the fact that the AVX instructions will read after the end of the buffer was really confusing. Because even when we used Span, and its range checks, it would be completely ignored. Makes total sense, given that those aren’t really methods, but compiler intrinsics that are translated to direct machine instructions.</p>
<p>Amusingly enough, now that we found the problem, we ran into something very similar a long while ago. Then it was <a href="https://ayende.com/blog/185185-C/production-postmortem-the-arm-is-killing-me">the wrong instruction</a> being used (loading a word, instead of a byte), that would fail, but the same overal setup. It will sometimes fail, depending on the state of the <em>next</em> page in the memory.</p>
<p>We actually built some tooling around managing that, we call that electric fence memory. We allocate memory so any out-of-band access would <em>always</em> hit invalid memory, stopping us in our tracks. That means that I can get <em>easy</em> reproduction of those sorts of issues, and once we have that, the rest isn’t really that interesting, to be honest. It’s just a normal bug fix. It’s the hunt for the root cause that is both incredibly frustrating and quite rewarding.</p>https://www.ayende.com/blog/199265-A/debugging-native-memory-issues-in-a-c-application?Key=8667f482-df0e-4b5b-8ff9-f4b69797e8b5https://www.ayende.com/blog/199265-A/debugging-native-memory-issues-in-a-c-application?Key=8667f482-df0e-4b5b-8ff9-f4b69797e8b5Mon, 10 Apr 2023 12:00:00 GMTProduction postmortem: The server ate all my memory<p>A customer reported a scenario where RavenDB was using stupendous amounts of memory. In the orders of tens of GB on a system that didn’t have that much load.</p>
<p>Our first suspicion was that this is an issue with reading the metrics, since RavenDB will <a href="https://ayende.com/blog/181729-C/understanding-memory-utilization-in-ravendb">try to keep as much of the data in memory</a>, which sometimes leads users to worry. <a href="https://ayende.com/blog/181729-C/understanding-memory-utilization-in-ravendb">I spoke about this at length in the past</a>.</p>
<p>In this case, that wasn’t the case. We were able to drill down into the exact cause of the memory usage and we found out that RavenDB <em>was</em> using an abnormally high amount of memory. The question was why that was, exactly.</p>
<p>We looked into the common operations on the server, and we found a suspicious query, it looked something like this:</p>
<blockquote>
<p><span style="font-family: Consolas;"><span style="color: #0000ff;">from index</span> 'Sales/Actions'<br /><span style="color: #0000ff;"> where</span> <span style="color: #9b00d3;">endsWith</span>(WorkflowStage, '/Final')</span></p>
</blockquote>
<p>The <em>endsWith</em> query was suspicious, so we looked into that further. In general, <em>endsWith</em> requires us to scan all the unique terms for a particular field, but in most cases, there aren’t that many unique values for a field. In this case, however, that wasn’t the case, here are some of the values for WorkflowStage:</p>
<ul>
<li>Workflows/3a1af12a-b5d2-4c96-9348-177ebaacab6c/Step-2</li>
<li>Workflows/6aacc86c-2f28-4b8b-8dee-1024314d5add/Final</li>
</ul>
<p>In total, there were about 250 <em>million </em>sales in the database, each one of them with a unique WorflowStage value.</p>
<p>What does this mean, in terms of RavenDB query execution? Well, the fields are indexed, but we need to effectively do:</p>
<blockquote>
<script src="https://gist.github.com/ayende/eb784b0fba00fd8b4caf20fdfa9f8ef6.js"></script>
</blockquote>
<p>This isn’t the actual code, but it will show you what is going on.</p>
<p>In other words, in order to process this query, we have to scan (and materialize) all 250 million unique terms for this field. Obviously that is going to consume a lot of memory.</p>
<p>But what is the solution to that? Instead of doing an expensive <em>endsWith</em> query, we can move the computation from the query time to the index time.</p>
<p>In other words, instead of indexing the <em>WorkflowStage </em>field as is, we’ll extract the information we want from it. The index would have one of those:</p>
<blockquote>
<p>IsFinalWorkFlowStage = doc.WorkflowStage.EndsWith(“/Final”),</p>
<p>WorkflowStagePostfix = doc.WorkflowStage.Split(‘/’).Last()</p>
</blockquote>
<p>The first one will check whether the value is final or not, while the second just gets the (one of hopefully a few) postfixes for the field. We can then query using equality instead of <em>endsWith</em>, leading to far better performance and greatly reduced memory usage, since we don’t need to materialize any values during the query.</p>https://www.ayende.com/blog/198913-C/production-postmortem-the-server-ate-all-my-memory?Key=ed9c76e7-cf4c-4660-a3e1-188a53117d66https://www.ayende.com/blog/198913-C/production-postmortem-the-server-ate-all-my-memory?Key=ed9c76e7-cf4c-4660-a3e1-188a53117d66Fri, 27 Jan 2023 12:00:00 GMTProduction postmortem: The big server that couldn’t handle the load<p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Production-postmortem-The-big-server-tha_EA92/image_2.png"><img style="border: 0px currentcolor; float: left; display: inline; background-image: none;" title="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Production-postmortem-The-big-server-tha_EA92/image_thumb.png" alt="image" width="141" height="90" align="left" border="0" /></a> A user of ours called us, quite frantic. They are running a lot of systems on RavenDB, and have been for quite some time.</p>
<p>However, very recently they started to run into severe issues. RavenDB would complain that there isn’t sufficient memory to run.</p>
<p>The system metrics, however, said that there are still gobs of GBs available (I believe that this is the appropriate technical term).</p>
<p>After verifying the situation, the on-call engineer escalated the issue. The problem was <em>weird</em>. There <em>was</em> enough memory, for sure, but for some reason RavenDB would be unable to run properly.</p>
<p>An important aspect is that this user is running a multi-tenant system, with each tenant being served by its own database. Each database has a few indexes as well.</p>
<p>Once we figured <em>that</em> out, it was actually easy to understand what is going on.</p>
<p>There are actually quite a few limits that you have to take into account. <a href="https://ayende.com/blog/197635-A/production-postmortem-out-of-memory-on-a-clear-sky">I talked about them here</a>. In that post, the issue was the maximum number of tasks defined by the system. After which, you can no longer create new threads.</p>
<p>In this case, the suspect was: vm.max_map_count.</p>
<p>Beyond just total memory, Linux has a limit on the number of memory <em>mappings</em> that a process may have. And RavenDB uses Voron, which is based on <em>mmap</em>(), and each database and each index typically have multiple maps going on.</p>
<p>Given the number of databases involved…</p>
<p>The solution was to increase the max_map_count and add a task for us, to give a warning to the user ahead of time when they are approaching the system's limits.</p>https://www.ayende.com/blog/198881-C/production-postmortem-the-big-server-that-couldnt-handle-the-load?Key=a4b4820b-d2b8-4e52-9a3c-69229409e55dhttps://www.ayende.com/blog/198881-C/production-postmortem-the-big-server-that-couldnt-handle-the-load?Key=a4b4820b-d2b8-4e52-9a3c-69229409e55dMon, 23 Jan 2023 12:00:00 GMTProduction postmortem: The heisenbug server<p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Production-postmortem-The-hiesien_ABC0/image_2.png"><img style="border: 0px currentcolor; float: right; display: inline; background-image: none;" title="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Production-postmortem-The-hiesien_ABC0/image_thumb.png" alt="image" width="284" height="284" align="right" border="0" /></a>A user reported that they observed nodes in the cluster “going dark”. Basically, they would stop communicating with the rest of the cluster, but would otherwise appear functional. Both the internal and external metrics were all fine, the server would just stop responding to anything over the network. The solution for the problem was to restart the service (note, the service, not the whole machine), but the problem would happen every few days.</p>
<p>As you can imagine, we are taking this sort of thing <em>very</em> seriously, so we looked into the problem. And we came up short. The problem made <em>absolutely</em> no sense. The problem occurred on a (minor) version migration, but there was absolutely nothing related to this that could cause it. What was <em>really</em> weird was that the <em>service </em>itself continue to work. We could see log entries being written and it was able to execute scheduled backups, for example. It would just refuse to talk to us over the network.</p>
<p>That was <em>super</em> strange, since the <em>network </em>itself was fine. All the monitoring systems were green, after all. For that matter, the user was able to SSH into the system to restart the service. This didn’t match with any other issue we could think of. Since the user worked around the problem by restarting the server, we didn’t have a lead.</p>
<p>Then we noticed the exact same problem in one of our cloud instances, and there we have much better diagnostic capabilities. Once we had noticed a problematic server, we were able to SSH into that and try to figure out what was going on.</p>
<p>Here is what we found out:</p>
<ul>
<li>The server will not respond to HTTP(s) communication either from outside the machine or by trying to connect from <em>inside</em> the machine.</li>
<li>The server <em>will</em> respond to SNMP queries both from inside the machine and outside of it (which is how we typically monitor the system).</li>
</ul>
<p>When we designed RavenDB, we implemented a “maintenance hatch” for such scenarios, in addition to using HTTP(s) for communication, RavenDB also exposes a named pipe that allows you to connect to the server without going through the network at all. This ensures that if you have administrator privileges on the server, you are able to connect even if there are network issues, certificate problems, etc.</p>
<p>Here is the kicker. Under this particular situation, we <em>could</em> <em>not </em>activate this escape hatch. That is not supposed to be possible. Named pipes on Linux, where we run into the problem, are basically Unix Sockets. A network issue such as a firewall problem or something similar isn’t going to affect them.</p>
<p>At the same time, we were able to communicate with the process using SNMP. What is the problem?</p>
<p>Lacking any other options, we dumped the process, restarted the service, and tried to do the analysis offline. We couldn’t find any problem. All the details we looked at said that everything was fine, the server <em>was</em> properly listening to new connections and it <em>should</em> work. That was… weird.</p>
<p>And then it happened again, and we did the same analysis, and it came back the same. We were clueless. One of the things that we updated between versions was the .NET runtime that we were using, so we <a href="https://github.com/dotnet/runtime/issues/79446">opened an issue</a> to see if anyone ran into the same problem.</p>
<p>And then it happened again. This time, we knew that just looking at the dump wouldn’t help us, so we tried other avenues. Linux has a pretty rich set of knobs and dials that you can look at to see what was going on. We suspected that this may be an issue with running out of file descriptors, running out of memory, etc.</p>
<p>We tried looking into what is going on inside the process using <em>strace</em>, and everything was <strong>fine</strong>. The trace clearly showed that the server was processing requests and was able to send and receive data properly.</p>
<p>Wait, go through that statement again please!</p>
<p>It is fine? But the reason we are using <em>strace</em> is that there is a problem. It looks like the problem fixed itself. That was annoying, because we were hoping to use the trace to figure out what is going on. We added more monitoring along the way, which would let us know if the server found itself isolated. And we waited.</p>
<p>The next time we ran into the problem, the first thing we did was run <em>strace, </em>we needed to get the root cause as soon as possible, and we were afraid that it would fix itself before we had a chance to get to the root cause. The moment we used <em>strace</em>, the server got back online, continuing as if there was never any issue.</p>
<p>Over the next few instances of this issue, we were able to confirm the following observations:</p>
<ol>
<li>The service would stop responding to TCP and Unix Sockets entirely.</li>
<li>There were <em>no</em> firewall or network issues.</li>
<li>The service was up and functional, tailing the log showed activity.</li>
<li>We could query the server state using SNMP.</li>
<li>Running strace on the service process would fix the problem.</li>
</ol>
<p>There are a few more things, the actual trigger for the fix wasn’t strace itself. It was the <em>ptrace()</em> call, which it uses. That would cause the service to start responding again. The <em>ptrace() </em>call is basically the beginning and the end of debugging under Linux. <em>Everything </em>uses it.</p>
<p>If you want to dump a memory process, you start with <em>ptrace()</em>. You want to trace the calls, <em>ptrace(). </em>You want to <em>debug </em>the process? GDB will start by calling <em>ptrace()</em>, etc.</p>
<p>And doing that would alleviate the problem.</p>
<p>That was… quite annoying.</p>
<p>We still had absolutely no indication of what the root cause even was.</p>
<p>We suspected it may be something inside Kestrel that was causing a problem. But that wouldn’t affect the named pipes / Unix sockets that we also saw.</p>
<p>Networking <em>worked</em>, because SNMP did. We thought that this may be because SNMP uses UDP instead of TCP, and looked into that, but we couldn’t figure out how that would be any different.</p>
<p>Looking at this further, we found that we have this in the code dumps:</p>
<blockquote>
<pre><code> ~~~~ 5072
1 Interop+Sys.WaitForSocketEvents(IntPtr, SocketEvent*, Int32*)
1 System.Net.Sockets.SocketAsyncEngine.EventLoop()
1 System.Net.Sockets.SocketAsyncEngine+<>c.ctor>b__14_0(Object)</code></pre>
</blockquote>
<p>As you can see, we are waiting for this in the .NET Sockets thread. The SNMP, on the other hand, looked like:</p>
<blockquote>
<p><span style="font-family: 'Courier New';">Thread (0x559):<br /> [Native Frames]<br /> System.Net.Sockets!System.Net.Sockets.SocketPal.SysReceive()<br /> System.Net.Sockets!System.Net.Sockets.SocketPal.TryCompleteReceiveFrom()<br /> System.Net.Sockets!System.Net.Sockets.SocketAsyncContext.ReceiveFrom()<br /> System.Net.Sockets!System.Net.Sockets.SocketPal.ReceiveFrom()<br /> System.Net.Sockets!System.Net.Sockets.Socket.ReceiveFrom()<br /> SharpSnmpLib.Engine!Lextm.SharpSnmpLib.Pipeline.ListenerBinding.AsyncReceive()<br /> </span></p>
</blockquote>
<p>That was really interesting, since it meant that for sockets (both HTTP and Unix), we were always using async calls, but for SNMP, we were using the synchronous API. We initially suspected that this may be something related to the thread pool. Maybe we had something that blocked it, but it turns out to be a lot more interesting. Here is the code that is actually <em>handling</em> the SNMP:</p>
<blockquote>
<p>var count = _socket.ReceiveFrom(buffer, ref remote);</p>
<p>Task.Factory.StartNew(() => HandleMessage(buffer, count, (IPEndPoint)remote));</p>
</blockquote>
<p>In other words, we are actually reading from the socket in a blocking manner, but then processing the actual message using the thread pool. So being able to get results via SNMP meant the thread pool was well.</p>
<p>At this point we resulted to hair pulling, rubber ducking and in some instances, shaking our fists to heaven.</p>
<p>I reminded myself that I’m an adult with a bit of experience solving problems, and dug deeper. We started looking into how .NET is actually <a href="https://github.com/dotnet/runtime/blob/e52462326be03fb329384b7e04a33d3eb7c16736/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs#L176">handling sockets in async mode.</a> This end up <a href="https://github.com/dotnet/runtime/blob/e52462326be03fb329384b7e04a33d3eb7c16736/src/native/libs/System.Native/pal_networking.c#L2727">here, doing a system call</a>:</p>
<blockquote>
<p>while ((numEvents = epoll_wait(port, events, *count, -1)) < 0 && errno == EINTR);</p>
</blockquote>
<p>Reading through the man page for <em>epoll_wait()</em> I learned how <em>epoll</em>() works, that it is complex and that we need to be aware of level-triggered and edge-triggered options. Since .NET uses <a href="https://github.com/dotnet/runtime/blob/e52462326be03fb329384b7e04a33d3eb7c16736/src/native/libs/System.Native/pal_networking.c#LL2693C60-L2693C67">edge-triggered</a> events (EPOLLET, which I keep reading as electronic chicken), we focused on that.</p>
<p>There are a <em>lot</em> of edge cases and things to cover, but everything we checked was handled properly. We finally had a good smoking gun. For some reason, we weren’t getting notifications from <em>epoll</em>(), even though we should. Using <em>strace()</em> or friends somehow fixes that.</p>
<p>We actually <em>found</em> the <a href="https://stackoverflow.com/questions/20272845/strace-fixes-hung-process">exact scenario we saw in StackOverflow</a>, but without any idea what the issue was. <a href="https://xkcd.com/979/">Truly, there is an XKCD for everything</a>.</p>
<p>Our current understanding of the issue:</p>
<ul>
<li>All async sockets in .NET are going through the same socket engine, and are using <em>epoll</em>() under the covers.</li>
<li>SNMP is using <em>synchronous</em> calls, so it wasn’t using <em>epoll()</em>.</li>
</ul>
<p>That covers both of the weird things that we are seeing. So what is the issue?</p>
<p>It is not in .NET. Given the size & scope of .NET, we wouldn’t be the only ones seeing that. Below .NET, there is the kernel, so we looked into that. The machines we were running that on were using kernel 5.4.0-azure-1095, so we looked into that.</p>
<p>And it looked like it is <a href="https://lore.kernel.org/linux-fsdevel/[email protected]/#r">a kernel bug</a>, which was fixed in the <a href="https://bugs.launchpad.net/ubuntu/+source/containerd/+bug/1996678/comments/28">next updated kernel</a>. A race condition inside the kernel would cause us to miss wakeups, and then we would basically just stall without anything to wake us up.</p>
<p>We dug deeper to understand a bit more about this situation, and <a href="https://man7.org/linux/man-pages/man2/ptrace.2.html">we got this</a>:</p>
<pre> Some system calls return with <strong>EINTR </strong>if a signal was sent to a
tracee, but delivery was suppressed by the tracer. (This is very
typical operation: it is usually done by debuggers on every
attach, in order to not introduce a bogus <strong>SIGSTOP</strong>). As of Linux
3.2.9, the following system calls are affected (this list is
likely incomplete): <a href="https://man7.org/linux/man-pages/man2/epoll_wait.2.html">epoll_wait(2)</a>, and <a href="https://man7.org/linux/man-pages/man2/read.2.html">read(2)</a> from an <a href="https://man7.org/linux/man-pages/man7/inotify.7.html">inotify(7)</a>
file descriptor. The usual symptom of this bug is that when you
attach to a quiescent process with the command
strace -p <process-ID>
then, instead of the usual and expected one-line output such as
restart_syscall(<... resuming interrupted call ...>_
or
select(6, [5], NULL, [5], NULL_
('_' denotes the cursor position), you observe more than one
line. For example:
clock_gettime(CLOCK_MONOTONIC, {15370, 690928118}) = 0
epoll_wait(4,_
What is not visible here is that the process was blocked in
<a href="https://man7.org/linux/man-pages/man2/epoll_wait.2.html">epoll_wait(2)</a> before <a href="https://man7.org/linux/man-pages/man1/strace.1.html">strace(1)</a> has attached to it. Attaching
caused <a href="https://man7.org/linux/man-pages/man2/epoll_wait.2.html">epoll_wait(2)</a> to return to user space with the error
<strong>EINTR</strong>. In this particular case, the program reacted to <strong>EINTR </strong>by
checking the current time, and then executing <a href="https://man7.org/linux/man-pages/man2/epoll_wait.2.html">epoll_wait(2)</a>
again. (Programs which do not expect such "stray" <strong>EINTR </strong>errors
may behave in an unintended way upon an <a href="https://man7.org/linux/man-pages/man1/strace.1.html">strace(1)</a> attach.)</pre>
<p>And.. that is exactly what is happening. On attaching, the <em>epoll_wait</em>() will return with <em>EINTR</em>, which will cause .NET to retry the command, and that “fixes” the issue.</p>
<p>It makes total sense now, and concludes the discovery process of a pretty nasty bug.</p>
<p>Now, if you’ll excuse me, I need to go and apologize to a rubber duck.</p>
<p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Production-postmortem-The-hiesien_ABC0/image_4.png"><img style="margin: 0px; border: 0px currentcolor; display: inline; background-image: none;" title="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Production-postmortem-The-hiesien_ABC0/image_thumb_1.png" alt="image" width="228" height="229" border="0" /></a></p>https://www.ayende.com/blog/198849-C/production-postmortem-the-heisenbug-server?Key=1eeda567-02a8-4bbb-b90f-557523973233https://www.ayende.com/blog/198849-C/production-postmortem-the-heisenbug-server?Key=1eeda567-02a8-4bbb-b90f-557523973233Mon, 16 Jan 2023 12:00:00 GMTAnswer: What does this code print?<p>I asked the <a href="https://ayende.com/blog/198689-B/challenge-what-does-this-code-print?key=57f95b74a23443828fa5f02fb82d19b4">following question</a>, about code that uses <em>AsyncLocal</em> as well as async calls. Here is the code again:</p>
<blockquote>
<script src="https://gist.github.com/ayende/180ac03715165f723ffed93f228c0973.js"></script>
</blockquote>
<p>This code prints <em>False</em> twice, the question is why. I would expect that the <em>AsyncLocal</em> value to remain the same after the call to Start(), since that is obviously the <em>point</em> of <em>AsyncLocal. </em>It turns out that this isn’t the case.</p>
<p>AsyncLocal is good if you are trying to pass a value <em>down</em> to child tasks, but it won’t be applicable to other tasks that are called in the same level. In other words, it works for children, not siblings tasks. This is actually even more surprising in the code above, since we don’t do any awaits in the <em>Start</em>() method.</p>
<p>The question is why? <a href="https://learn.microsoft.com/en-us/dotnet/api/system.threading.asynclocal-1?view=net-7.0">Looking at the documentation</a>, I couldn’t see any reason for that. Digging deeper into the source code, I figured out what is going on.</p>
<p>We can use <a href="https://sharplab.io/#v2:D4AQTAjAsAUCAMACEEB0AVAFgJwKYEMATASwDsBzAblgWQgFZqYaBmZMRAZQGNdT9sxAPawA3rESTkbAIIBnAJ6luAGSHd8AGwA8AIyFDNAPkQzuAF2IA3XIgC8iUrgDupxcrUad+w0YAUAJRMUtLIABzIAGxc5gLmgRJS4jAhIWaWNqgAaloArrYO5tj5wVIAvrCJkiBsIBEg0QBKuaQJKUlVqSgAnH7p1rjZebhBnSEg3TFxgaWpdL39mTma+aPtkhUwZUA===">SharpLab.io</a> to lower the high level C# code to see what is actually going on here, which gives us the following code for the <em>Start</em>() method:</p>
<blockquote>
<script src="https://gist.github.com/ayende/d0fbad92269e1d9dde5767af50c15151.js"></script>
</blockquote>
<p>Note that we call to <em>AsyncTaskMethodBuilder.Start</em>() method, which ends up in <em>AsyncMethodBuilderCore.Start</em>(). There we have a bunch of interesting code, in particular, we remember the current thread execution context <em>before</em> we execute user code, <a href="https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncMethodBuilderCore.cs#L33-L34">here</a>. After the code is done running, we restore it if this is needed, as you can see <a href="https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncMethodBuilderCore.cs#L50-L53">here</a>.</p>
<p>That looks fine, but why would the execution context change here? It turns out that one of the few places that interact with it is the <em>AsyncValue</em> itself, which ends up in the <a href="https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Threading/ExecutionContext.cs#L532">ExecutionContext.SetLocalValue</a>. The way it works, each time you set an async local, it creates a new <em>layer </em>in the async stack. And when you <em>exit</em> an async call, it will reset the async stack to the place it was before the async call started.</p>
<p>In other words, the <em>local</em> in the name <em>AsyncLocal</em> isn’t a match to <em>ThreadLocal</em>, but is more similar to a local <em>variable</em>, which goes out of scope on function exit.</p>
<p><a href="https://stackoverflow.com/a/37309427">This isn’t a new thing, and there are workarounds</a>, but it was interesting enough that I decided to dig deep and understand what is actually going on.</p>https://www.ayende.com/blog/198690-B/answer-what-does-this-code-print?Key=6c670721-8ab4-4dd4-8eb5-b6d3929a5a39https://www.ayende.com/blog/198690-B/answer-what-does-this-code-print?Key=6c670721-8ab4-4dd4-8eb5-b6d3929a5a39Thu, 15 Dec 2022 12:00:00 GMTChallenge: What does this code print?<p>Take a look at the following code, what do you think it will print?</p><blockquote><script src="https://gist.github.com/ayende/180ac03715165f723ffed93f228c0973.js"></script></blockquote><p>Since it obviously doesn’t print the <em>expected</em> value, why do you think this is the case?</p>https://www.ayende.com/blog/198689-B/challenge-what-does-this-code-print?Key=57f95b74-a234-4382-8fa5-f02fb82d19b4https://www.ayende.com/blog/198689-B/challenge-what-does-this-code-print?Key=57f95b74-a234-4382-8fa5-f02fb82d19b4Wed, 14 Dec 2022 12:00:00 GMTSolving support issues in other people’s products<p>A user <a href="https://github.com/ravendb/ravendb/issues/15375">contacted us</a> to tell us that RavenDB does not work in his environment. As you can imagine, we didn’t really like to hear that, so we looked deeper into the issue. The issue in question included the actual problem, which looked something like this:</p>
<blockquote>
<pre>{
"Url": "/auth/",
"Type": "Raven.Client.Exceptions.Routing.RouteNotFoundException",
"Message": "There is no handler for path: GET /auth/",
"Error": "Raven.Client.Exceptions.Routing.RouteNotFoundException: There is no handler for path: GET /auth/\n"
}</pre>
</blockquote>
<p>My reaction to that was… huh?!</p>
<p>That is a really strange error, since RavenDB does <em>not</em> have an “/auth/” endpoint. The problem isn’t with RavenDB, it is with something else.</p>
<p>In this case, the user ran RavenDB on port 8080 (which is the normal thing to do) and then tried to access RavenDB in the browser.</p>
<p>The problem was that they <em>previously</em> ran some other software, and that software had the following interaction:</p>
<blockquote>
<pre>* Connected to 127.0.0.1 port 8080
> GET / HTTP/1.1
> Host: 127.0.0.1
> User-Agent: curl/7.85.0
> Accept: */*
>
< HTTP/1.1 301 Moved Permanently
< Location: http://127.0.0.1:8080/auth/
< Content-Type: text/html; charset=UTF-8</pre>
</blockquote>
<p>In other words, it redirected the browser to the “/auth/” endpoint. It’s critical to understand that 301 response means: Moved Permanently. That means that they are actually cached by the browser. In this case, the scenario was reusing the same endpoint for a different software product, and the browser cache meant that we got strange results.</p>https://www.ayende.com/blog/198626-A/solving-support-issues-in-other-peoples-products?Key=c33c5575-d31c-4373-b936-187722334e8bhttps://www.ayende.com/blog/198626-A/solving-support-issues-in-other-peoples-products?Key=c33c5575-d31c-4373-b936-187722334e8bTue, 06 Dec 2022 12:00:00 GMTProduction postmortem: Do you trust this server?<p>A customer called us with a problem. They set up a production cluster successfully, they could manually verify that everything is working, except that it would fail when they try to connect to it via the client API.</p>
<p>The error in question looked something like this:</p>
<blockquote>
<p>CertificateNameMismatchException: You are trying to contact host rvn-db-72 but the hostname must match one of the CN or SAN properties of the server certificate: CN=rvn-db-72, OU=UAT, OU=Computers, OU=Operations, OU=Jam, DC=example, DC=com, DNS Name=rvn-db-72.jam.example.com</p>
</blockquote>
<p>That is… a really strange error. Because they <em>were</em> accessing the server using: rvn-db-72.jam.example.com, and that was the configured certificate for it. But for some reason the RavenDB client was trying to connect directly to rvn-db-72. It <em>was</em> able to connect to it, but failed on the hostname validation because the certificates didn’t match.</p>
<p>Initially, we suspected that there is some sort of a MITM or some network appliance that got in the way, but we finally figured out that we had the following sequence of events, shown in the image below. The RavenDB client was properly configured, but when it asked the server where the database is, the server would give the wrong URL, leading to this error.</p>
<p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Production-postmortem-Do-you-trust-this-_EABB/image_4.png"><img style="margin: 0px; border: 0px currentcolor; display: inline; background-image: none;" title="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Production-postmortem-Do-you-trust-this-_EABB/image_thumb_1.png" alt="image" width="578" height="287" border="0" /></a></p>
<p>This deserves some explanation. When we initialize the RavenDB client, one of the first things that the client does is query the cluster for the URLs where it can find the database it needs to work with. This is because the distribution of databases in a cluster doesn’t have to match the nodes in the cluster.</p>
<p>Consider this setup:</p>
<p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Production-postmortem-Do-you-trust-this-_EABB/image_8.png"><img style="margin: 0px; border: 0px currentcolor; display: inline; background-image: none;" title="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Production-postmortem-Do-you-trust-this-_EABB/image_thumb_3.png" alt="image" width="582" height="358" border="0" /></a></p>
<p>In this case, we have three nodes in the cluster, but the “Orders DB” is located only on two of them. If we query the <em>rvn-db-72</em> database for the topology of “Orders DB”, we’ll get nodes <em>rvn-db-73 </em>and <em>rvn-db-74</em>. Here is what this will look like:</p>
<p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Production-postmortem-Do-you-trust-this-_EABB/image_10.png"><img style="margin: 0px; border: 0px currentcolor; display: inline; background-image: none;" title="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Production-postmortem-Do-you-trust-this-_EABB/image_thumb_4.png" alt="image" width="536" height="362" border="0" /></a></p>
<p>Now that we understand what is going on, what is the root cause of the problem?</p>
<p>A misconfigured server, basically. The <em>PublicServerUrl</em> for the server in question was left as the hostname, instead of the full domain name.</p>
<blockquote>
<script src="https://gist.github.com/ayende/5cf12c7adf7d9dd4a0e81682ad47aa4b.js"></script>
</blockquote>
<p>This configuration meant that the server would give the wrong URL to the client, which would then fail.</p>
<p>This is something that only the client API is doing, so the Studio behaved just fine, which made it harder to figure out what exactly is going on there. The actual fix is trivial, naturally, but figuring it out took too long. We’ll be adding an alert to detect and resolve misconfigurations like that in the future.</p>https://www.ayende.com/blog/198305-B/production-postmortem-do-you-trust-this-server?Key=3aa88ca0-ed5a-4d0a-bca6-413bbab4df27https://www.ayende.com/blog/198305-B/production-postmortem-do-you-trust-this-server?Key=3aa88ca0-ed5a-4d0a-bca6-413bbab4df27Mon, 03 Oct 2022 12:00:00 GMTProduction postmortem: The missed indexing reference<p>RavenDB has a really nice feature, it allows you to index data from <em>related</em> documents. Consider the following document structure:</p>
<p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Production-postmortem-The-missed-indexin_CD3A/image_2.png"><img style="margin: 0px; border: 0px currentcolor; display: inline; background-image: none;" title="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Production-postmortem-The-missed-indexin_CD3A/image_thumb.png" alt="image" width="951" height="486" border="0" /></a></p>
<p>We have tickets, vehicles, and users, and we want to issue a search on all the tickets issued to Joe. Leaving aside whether this is the proper <em>way</em> to handle this, here is what the index would look like:</p>
<blockquote>
<script src="https://gist.github.com/ayende/f95bb7d41a128a86c26991728edba9fe.js"></script>
</blockquote>
<p>What we are doing here is walk the reference graph and index data from related documents. So far, so good. The cool thing about this feature is that RavenDB is in charge of ensuring that if we update the owner of the vehicle or the name of the user, the Right Thing will happen.</p>
<p>Of course, I wouldn’t be writing this blog post if we didn’t run into a problem in this scenario.</p>
<p>The way it works, for each collection referenced by the index, RavenDB maintains a list of the last document that was chceked for changes in the collection. That way, on modification of a related document, we can tell that we need to re-index a particular document.</p>
<p>This looks something like this:</p>
<blockquote>
<script src="https://gist.github.com/ayende/65e85076c25fabacb2be9df5665c2e1f.js"></script>
</blockquote>
<p>In other words, for each document that was loaded by another during indexing, we keep a list of the referencing documents.</p>
<p>Let’s say that we update document <em>vehicles/200</em>. That would be written to the storage with a new etag, and the index would wake up. It would ask to get all the documents in the Vehicles collection after etag 456, get <em>vehicles/200</em> and then check the <em>ReferencedBy</em> and find that the document <em>tickets/100</em> loaded it. At this point, it will re-index <em>tickets/100</em> to ensure we have the latest values.</p>
<p>There is quite a bit more to this process, of course, I’m skipping on a lot of optimizations and detail work. For the purpose of this post, we don’t need any of that.</p>
<p>A customer reported that (very rarely), an index similar to the one above would “miss” on updates. That should not be possible. As much as I love this feature, conceptually, it is a very simple one, there isn’t much here that <em>can</em> fail. And yet, it did. Figuring out what was happening required us to look very deeply into the exact series of steps that were taken to produce this output. It turns out that our approach had a hole in it.</p>
<p>We assume that the writes would always happen in an orderly fashion. In other words, that the writes would be consistent. But there is no actual requirement for that.</p>
<p>Consider what happens if I write just the ticket document to the database:</p>
<ul>
<li>RavenDB will index the ticket document</li>
<li>It will attempt to load the associated vehicle, figure out that there is no such document and move on</li>
<li>The related user document, of course, is not known at this point (since there is no vehicle document)</li>
</ul>
<p>The end result is that we have the following data internally:</p>
<blockquote>
<script src="https://gist.github.com/ayende/d432cec3ec432c35450095e22b9b18dc.js"></script>
</blockquote>
<p>That is fine, when we’ll add the vehicle and the user, we’ll do the appropriate wiring, no?</p>
<p>In almost all cases, that is exactly what will happen. However, consider the metadata above. We are concerned here with <em>tickets/100</em>, but there is also <em>tickets/20</em>, whose references exist properly. So the structure we have right now in terms of reference tracking is:</p>
<p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Production-postmortem-The-missed-indexin_CD3A/image_4.png"><img style="border: 0px currentcolor; display: inline; background-image: none;" title="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Production-postmortem-The-missed-indexin_CD3A/image_thumb_1.png" alt="image" width="940" height="539" border="0" /></a></p>
<p>It’s important to note that the references are always kept from the initial 'tickets' document. So even though the path from <em>tickets/20</em> to <em>users/99 </em>goes through <em>vehicles/19</em>, the relationship is a direct association.</p>
<p>What will happen if we’ll insert just the <em>users/300</em> document now? Well, there is no reference to this document, so we’ve no reason to do anything with it. But that isn’t a problem. When <em>vehicles/200</em> is inserted, this will be fixed.</p>
<p>On the other hand, if we add just <em>vehicles/200</em> to the database (with <em>users/300</em> not being present), that is a change in a tracked document, which will cause us to index the referencing document (<em>tickets/100</em>) again and move us to this state:</p>
<p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Production-postmortem-The-missed-indexin_CD3A/image_12.png"><img style="margin: 0px; border: 0px currentcolor; display: inline; background-image: none;" title="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Production-postmortem-The-missed-indexin_CD3A/image_thumb_5.png" alt="image" width="931" height="298" border="0" /></a></p>
<p>When we will then add <em>users/300</em>, document <em>tickets/100</em> will have the record of this reference and we’ll re-index it.</p>
<p>In other words, we are covered on both sides. Except, that there is still this pesky (and impossible) problem that the user is seeing.</p>
<p>Now, consider the following state of affairs, we are back in the initial state, both <em>vehicles/200 </em>and <em>users/300</em> are missing in the database and <em>tickets/20, vehicles/19 </em>and <em>users/99</em> are there.</p>
<p>We add <em>vehicles/200</em> to the database, and there is a re-indexing process going on. <em>At the same time</em> that we re-index <em>tickets/100</em> because of the new <em>vehicles/200</em> document, we are adding the <em>users/300</em> document in a <strong>separate</strong> transaction.</p>
<p>That means that during the indexing of <em>tickers/100</em>, we’ll see document <em>vehicles/200</em> but not the <em>users/300</em> document (even though it exists).</p>
<p>That is still not a problem, we’ll write the referencing record and on the next batch, detect that we have a user that we haven’t seen and re-index the document again.</p>
<p>Except… what if we didn’t update <em>just</em> the <em>users/300</em> document in this case, what if we also updated <em>users/99</em> at the same transaction (and after we insert document <em>users/300</em>).</p>
<p>Depending on the exact timings, we may end up missing document <em>users/300</em> (because there was no reference to it at the time) but will notice that document <em>users/99</em> was updated (we already had it referenced). Since <em>users/99 </em>was modified after <em>users/300</em>, we’ll record that we observed all the changes in the Users collection before <em>users/99</em>. That, crucially, also includes the <em>users/300</em> that we never noticed.</p>
<p>This is confusing, I’ll freely admit. In order to reproduce this bug you need a non-standard pattern for creating references, a chain of at least two references, multiple independent references with different states, <em>and</em> an unlucky draw from Murphy with the exact timing of transactions, indexing and order of operations.</p>
<p>The root cause was that we recorded the newly added document reference in memory, and only updated them when the entire indexing batch was completed. During that time, there may have been multiple transactions that modified the documents. But because we didn’t sync the state until the end of the batch, we would end up missing this case. Solving the problem once we knew what was going on involved moving a single line of code from the outer loop to an inner one, basically.</p>
<p>Writing a reproducible test case was actually far harder, since so many things had to fall just <em>so</em> this would happen. I have to admit that I don’t have any strong conclusions about this bug. It isn’t something systematic or an issue that we missed. It is a sequence of unfortunate events with a very low probability of occurring that we never actually considered.</p>
<p>The really good thing about this issue is that it is the first one in this particular area of the code in quite some time. That means that this has been quite stable for many scenarios.</p>https://www.ayende.com/blog/198210-C/production-postmortem-the-missed-indexing-reference?Key=5c2437cb-f78c-4d69-a562-42520c0a1ef6https://www.ayende.com/blog/198210-C/production-postmortem-the-missed-indexing-reference?Key=5c2437cb-f78c-4d69-a562-42520c0a1ef6Thu, 15 Sep 2022 12:00:00 GMTWhat bug through yonder weirdness my build breaks?<p>We have a lot of tests for RavenDB, and we are running them on plenty of environments. We semi frequently get a build failure when running on the <a href="https://github.com/ravendb/ravendb/runs/8286680144?check_suite_focus=true">“macOS latest” runner</a> on GitHub.</p>
<p>The problem is that the information that I have is self-contradicting. Here is the most relevant piece:</p>
<blockquote>
<script src="https://gist.github.com/ayende/31c5cb783877d0109c093b08f73848d9.js"></script>
</blockquote>
<p>Here you can see the failure itself and what is causing it.</p>
<p>Note that the debug message is showing that all three variables here have the same numeric value. The <em>address</em> and the <em>current</em> variables are also held on the stack, so there is no option for race conditions, or something like that.</p>
<p>I can’t figure out any reason why this would be triggered, in this case. About the only thing that pops to mind is whether there is some weirdness going on with pointer comparisons on MacOS, but I don’t have a lead to follow.</p>
<p>We haven’t investigated it properly yet, I thought to throw this to the blog and see if you have any idea what may be going on here.</p>https://www.ayende.com/blog/198242-B/what-bug-through-yonder-weirdness-my-build-breaks?Key=965ff556-3ddd-4bb0-8141-93d0f07fa3f9https://www.ayende.com/blog/198242-B/what-bug-through-yonder-weirdness-my-build-breaks?Key=965ff556-3ddd-4bb0-8141-93d0f07fa3f9Tue, 13 Sep 2022 12:00:00 GMTProduction postmortem: Efficiency all the way to Out of Memory error<p>RavenDB is written in C#, and as such, uses managed memory. As a database, however, we need granular control of our memory, so we also do manual memory management.</p>
<p>One of the key optimizations that we utilize to reduce the amount of overhead we have on managing our memory is using an arena allocator. That is a piece of memory that we allocate in one shot from the operating system and operate on. Once a particular task is done, we can discard that whole segment in one shot, rather than try to work out exactly what is going on there. That gives us a proper <em>scope</em> for operations, which means that missing a free in some cases isn’t the end of the world.</p>
<p>It also makes the code for RavenDB memory allocation super simple. Here is what this looks like:</p>
<p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Production-postmortem-The-memory-savings_B349/image_8.png"><img style="border: 0px currentcolor; display: inline; background-image: none;" title="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Production-postmortem-The-memory-savings_B349/image_thumb_3.png" alt="image" width="346" height="274" border="0" /></a></p>
<p>Whenever we need to allocate more memory, we’ll just bump the allocator up. Initially, we didn’t even implement freeing memory, but it turns out that there are a <em>lot</em> of long running processes inside of RavenDB, so we needed to reuse the memory inside the same operation, not just between operations.</p>
<p>The implementation of freeing memory is pretty simple, as well. If we return the last item that we allocated, we can just drop the next allocation position by how many bytes were allocated. For that matter, it also allows us to do incremental allocations. We can ask for some memory, then <em>increase</em> the allocation amount on the fly very easily.</p>
<p>Here is a (highly simplified) example of how this works:</p>
<blockquote>
<script src="https://gist.github.com/ayende/d623acc194d7adde68b26c66f6016f3a.js"></script>
</blockquote>
<p>As you can see, there isn’t much <em>there</em>. A key requirement here is that you need to return the memory back in the reverse order of how you allocated it. That is <em>usually</em> how it goes, but what if it doesn’t happen?</p>
<p>Well, then we can’t reuse the memory directly. Instead, we’ll place them in a free list. The actual allocations are done on powers of two, so that makes things easier. Here is what this actually looks like:</p>
<p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Production-postmortem-The-memory-savings_B349/image_12.png"><img style="border: 0px currentcolor; display: inline; background-image: none;" title="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Production-postmortem-The-memory-savings_B349/image_thumb_5.png" alt="image" width="569" height="361" border="0" /></a></p>
<p>So if we free, but not from the top, we remember the location and can use it again. Note that for 2048 in the image above, we don’t have any free items.</p>
<p>I’m quite fond of this approach, since this is simple, easy to understand and has a great performance profile. But I wouldn’t be writing this blog post if we didn’t run into issues, now would I?</p>
<p>A customer reported high memory usage (to the point of memory exhaustion) when doing a certain set of operations. That… didn’t make any sense, to be honest. That was a <em>well traveled</em> code path, any issue there should have been long found out.</p>
<p>They were able to send us a reproduction and the support team was able to figure out what is going on. The problem was that the code in question did a couple of things, which altogether led to an interesting issue.</p>
<ul>
<li>It allocated and deallocated memory, but not always in the same order – this is fine, that is why we have the free list, after all.</li>
<li>It extended the memory allocation it used on the fly – perfectly fine and an important optimization for us.</li>
</ul>
<p>Give it a moment to consider how could these two operations together result in a problem…</p>
<p>Here is the sequence of events:</p>
<ul>
<li>Loop:</li>
<ul>
<li>Allocate(1024) -> $1</li>
<li>Allocate(256) -> $2</li>
<li>Grow($1, 4096) -> Success</li>
<li>Allocate(128) -> $3</li>
<li>Free($1) (4096)</li>
<li>Free($3) (128)</li>
<li>Free($2) (256)</li>
</ul>
</ul>
<p>What is going on here?</p>
<p>Well, the issue is that we are allocating a 1KB buffer, but <em>return</em> a 4KB buffer. That means that we add the returned buffer to the 4KB free list, but we cannot <em>pull </em>from that free list on allocation.</p>
<p>Once found, it was an easy thing to do (detect this state and handle it), but until we figured it out, it was quite a mystery.</p>https://www.ayende.com/blog/197825-C/production-postmortem-efficiency-all-the-way-to-out-of-memory-error?Key=e100f378-87a0-471d-b8f7-8c1a5f831f88https://www.ayende.com/blog/197825-C/production-postmortem-efficiency-all-the-way-to-out-of-memory-error?Key=e100f378-87a0-471d-b8f7-8c1a5f831f88Fri, 22 Jul 2022 12:00:00 GMTWhen debugging, assume an unreliable narrator<p>When we are handling a support call, we are often working with partial information about the state of the software at the customer site. Sometimes that is an unavoidable part of the job. When troubleshooting a system with patients' records, I can’t just ask the customer to schlep the data to my laptop so I can analyze it properly. Even if we could do that, there are a <em>lot</em> of cases that simply don’t reproduce anywhere but the live environment.</p>
<p>Part of the process of debugging an issue in a production environment is to be able to gather enough information <em>on site</em> that we can draw the appropriate conclusions from. RavenDB comes with a lot of tools to do just that. One of the most useful of those tools is the idea of the debug package. That is a simple idea, in the end. It gathers all the information we have about our system and packages that into a zip file. That zip file contains a lot of metrics, but it doesn’t contain customer data (aside from databases & index names, which are usually not sensitive).</p>
<p>There have been several separate cases recently where we were able to use the debug package to analyze what is going on and came back to the customer with answers. However, when hearing our explanations about what was the root cause of the issue, the customer rejected our hypothesis.</p>
<p>In one case, a customer was worried about the load that they were observing in their system. Not because there <em>was</em> an issue, but the number of requests that they observed was higher than was expected. The customer switched to using concurrent subscriptions recently and deployed a couple of worker nodes to spread the load of processing documents. However, the number of requests observed was <em>far</em> higher than they expected. Whereas before they had a single worker per subscription, and a known amount of work that they could easily measure, after switching to concurrent subscriptions they observed a <em>big</em> increase in the number of requests processed by RavenDB.</p>
<p>Given that they deployed their subscriptions to two workers, initially, it was expected that the amount of work that the cluster is processing would double. Instead, it was increased by tenfold. Looking at the metrics in the debug package, we could see that they had 10 instances of each subscription running, but the customer was insistent that they only deployed two workers nodes.</p>
<p>Our metrics said that there were 5 subscriptions from IP-1 and 5 subscriptions from IP-2. After some back and forth it was revealed that everyone was correct, but talking past each other. The customer deployed two worker nodes, yes. But each of those spawned 5 instances of the subscriptions to take advantage of concurrency inside the same worker node.</p>
<p>In the second case, we have a customer that noticed a marked increase in the amount of bandwidth that they were using. They traced that additional bandwidth to the internal communication between the nodes in the cluster. Given that they are running in the cloud, they were (rightly) concerned about the sudden jump in the bandwidth. We started the investigation process and… we didn’t like what we saw. The cluster had gone through <em>three</em> full node rebuilds in the past month. At least, that was what the data was telling us. But that didn’t make much sense.</p>
<p>Quite concerned that there is something really bad going on, we talked to the customer, who thought about this for a while, checked their own logs and explained what was going on. They are running on Lsv2-series Azure instances, and apparently, within the space of a few weeks, all three of their instances had been moved to another physical host. The Lsv2-series instances use local ephemeral NVMe drives. When they moved an instance between hosts, the effect was as if we were given a brand new hard disk. RavenDB was able to handle that scenario more or less seamlessly, with the other nodes in the cluster filling in for the down node and sending it all the data it lost. The effect of that, of course, was a big jump in network bandwidth while that was going on.</p>
<p>The customer wasn’t actually aware that this happened until they looked at the logs, RavenDB had it handled, and it was only noticed because of the bandwidth spike.</p>
<p>The point of this post isn’t to talk about how awesome RavenDB is (even if I do think it is pretty awesome). Nor is it to extoll how good our support team is at figuring out things about the customer setup that even the customer isn’t aware of.</p>
<p><em>The</em> point of this post is that you have to take into account, quite clearly, that the details that the customer is providing may be outdated, wrong or just misleading. Not because of any malicious intention on their end, but because they give you the information they <em>have</em>, not what is actually going on.</p>
<p>It reminds me of an old trick in tech support: “Take the plug out of the socket, blow on both the socket and the plug, then insert it again”. The point isn’t to blow whatever dust may have been there, preventing good contact. The point is to ensure that the silly thing is actually <em>plugged</em> in, but you can’t <em>ask</em> if this is plugged in, because the person on the other side of the call would say: “Of course it is” and never check.</p>https://www.ayende.com/blog/197793-C/when-debugging-assume-an-unreliable-narrator?Key=f1425369-1baa-4f76-a7a8-f61bc332716bhttps://www.ayende.com/blog/197793-C/when-debugging-assume-an-unreliable-narrator?Key=f1425369-1baa-4f76-a7a8-f61bc332716bWed, 20 Jul 2022 12:00:00 GMT