Ayende @ Rahienhttps://www.ayende.com/blog/Ayende @ RahienCopyright (C) Ayende Rahien 2004 - 2021 (c) 202460Reading unfamiliar codebases quickly: LMDB<p style="text-align:left;">Reading code is a Skill (with a capital letter, yes) that is really important for developers. You cannot be a good developer without it.</p><p style="text-align:left;">Today I want to talk about one aspect of this. The ability to go into an unfamiliar codebase and extract <em>one</em> piece of information out. The idea is that we don’t need to understand the entire system, grok the architecture, etc. I want to understand one thing about it and get away as soon as I can.</p><p style="text-align:left;">For example, you know that project Xyz is doing some operation, and you want to figure out how this is done. So you need to look at the code and figure that out, then you can go your merry way.</p><p style="text-align:left;">Today, I’m interested in understanding how the LMDB project writes data to the disk on Windows. This is because LMDB is based around a memory-mapped model, and Windows <span style="text-decoration:underline;"><a style="color:inherit;" href="https://learn.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-mapviewoffile#remarks">doesn’t keep the data between file I/O and mmap I/O coherent</a></span>.</p><p style="text-align:left;">LMDB is an embedded database engine (similar to Voron, and in fact, Voron is based on some ideas from LMDB) written in C. If you are interested in it, I wrote <span style="text-decoration:underline;"><a style="color:inherit;" href="https://ayende.com/blog/posts/series/162754/reviewing-lightning-memory-mapped-database-library">11 posts going through every line of code</a></span> in the project. </p><p style="text-align:left;">So I’m familiar with the project, but the last time I read the code was over a decade ago. From what I recall, the code is <em>dense</em>. There are about 11.5K lines of code in a single file, implementing the entire thing.</p><p style="text-align:left;">I’m using <span style="text-decoration:underline;"><a style="color:inherit;" href="https://git.openldap.org/openldap/openldap/-/blob/master/libraries/liblmdb/mdb.c?ref_type=heads#L3432">the code from here</a></span>.</p><p style="text-align:left;">The first thing to do is find the relevant section in the code. I started by searching for the WriteFile() function, the Win32 API to write. The first occurrence of a call to this method is in the <span style="color:#795e26;">mdb_page_flush </span><span style="text-decoration:underline;"><a style="color:inherit;" href="https://git.openldap.org/openldap/openldap/-/blob/master/libraries/liblmdb/mdb.c?ref_type=heads#L3432">function</a></span>.</p><p style="text-align:left;">I look at this code, and… there isn’t really anything there. It is fairly obvious and straightforward code (to be clear, that is a <em>compliment</em>). I was expecting to see a trick there. I couldn’t find it.</p><p style="text-align:left;">That meant either the code had a gaping hole and potential data corruption (highly unlikely) or I was missing something. That led me to a long trip of trying to distinguish between documented <em>guarantees</em> and actual behavior. </p><p style="text-align:left;">The documentation for <span style="text-decoration:underline;"><a style="color:inherit;" href="https://learn.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-mapviewoffile">MapViewOfFile</a></span> is pretty clear:</p><blockquote><p style="text-align:left;"><span style="color:#161616;">A mapped view of a file is not guaranteed to be coherent with a file that is being accessed by the </span><a style="color:inherit;" href="https://learn.microsoft.com/en-us/windows/desktop/api/fileapi/nf-fileapi-readfile">ReadFile</a><span style="color:#161616;"> or </span><a style="color:inherit;" href="https://learn.microsoft.com/en-us/windows/desktop/api/fileapi/nf-fileapi-writefile">WriteFile</a><span style="color:#161616;"> function.</span></p></blockquote><p style="text-align:left;">I have <span style="text-decoration:underline;"><a style="color:inherit;" href="https://ayende.com/blog/164577/is-select-broken-memory-mapped-files-with-unbufferred-writes-race-condition">my own run-ins with this behavior</a></span>, which was super confusing. This means that I had experimental evidence to say that this is broken. But it didn’t make sense, there was no code in LMDB to handle it, and this is pretty easy to trigger. </p><p style="text-align:left;">It turns out that while the documentation is pretty broad about <em>not</em> guaranteeing the behavior, the actual issue only occurs if you are working with remote files or using unbuffered I/O.</p><p style="text-align:left;">If you are working with local files and buffered I/O (which is 99.99% of the cases), then you can rely on this behavior. I found some <span style="text-decoration:underline;"><a style="color:inherit;" href="https://community.osr.com/t/memory-mapped-file-position/13911/2">vague</a></span><span style="text-decoration:underline;"><a style="color:inherit;" href="https://community.osr.com/t/coherency-memory-mapped-file-write/13919/3">references </a></span>to this, but that wasn’t enough. There is <span style="text-decoration:underline;"><a style="color:inherit;" href="https://community.osr.com/t/memory-mapped-file-detection-logic/43826/11">this post that is really interesting</a></span>, though.</p><p style="text-align:left;">I pinged Howard Chu, the author of LMDB, for clarification, and he was quick enough to assure me that yes, my understanding was (now) correct. On Windows, you can mix memory map operations with file I/O and get the right results.</p><blockquote><p style="text-align:left;">The documentation appears to be a holdover from Windows 9x, with the NT line always being able to ensure coherency for local files. This is a guess about the history of documentation, to be honest. Not something that I can verify.</p></blockquote><p style="text-align:left;">I had the wrong information in my head for over a decade. I did <em>not</em> expect this result when I started this post, I was sure I would be discussing navigating complex codebases. I’m going to stand in the corner and feel upset about this for a while now. </p>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/9000.0.1/themes/prism.min.css" integrity="sha512-/mZ1FHPkg6EKcxo0fKXF51ak6Cr2ocgDi5ytaTBjsQZIH/RNs6GF6+oId/vPe3eJB836T36nXwVh/WBl/cWT4w==" crossorigin="anonymous" referrerpolicy="no-referrer" />https://www.ayende.com/blog/201313-B/reading-unfamiliar-codebases-quickly-lmdb?Key=64d1c380-b6d1-47a0-8ce6-d4cb741659b1https://www.ayende.com/blog/201313-B/reading-unfamiliar-codebases-quickly-lmdb?Key=64d1c380-b6d1-47a0-8ce6-d4cb741659b1Fri, 05 Jul 2024 12:00:00 GMTCode review & Time Travel<p style="text-align:left;">A not insignificant part of my job is to go over code. Today I want to discuss how we approach code reviews at RavenDB, not from a process perspective but from an operational one. I have been a developer for nearly 25 years now, and I’ve come to realize that when I’m doing a code review I’m actually looking at the code from three separate perspectives.</p><p style="text-align:left;">The first, and most obvious one, is when I’m actually looking for problems in the code - ensuring that I can understand what is going on, confirming the flow makes sense, etc. This involves looking at the code <em>as it is right now</em>. </p><p style="text-align:left;">I’m going to be showing snippets of code reviews here. You are not actually expected to follow the <em>code</em>, only the concepts that we talk about here.</p><p style="text-align:left;">Here is a classic code review comment:</p><p style="text-align:left;"><img src="" style="float: right"/></p><p style="text-align:left;">There is some duplicated code that we need to manage. Another comment that I liked is this one, pointing out a potential optimization in the code:</p><p style="text-align:left;"><img src="" style="float: right"/></p><p style="text-align:left;">If we define the code using the <em>static</em> keyword, we’ll avoid delegate allocation and save some memory, yay!</p><p style="text-align:left;">It gets more interesting when the code is correct and proper, but may do something weird in some cases, such as in this one:</p><p style="text-align:left;"><img src="" style="float: right"/></p><p style="text-align:left;">I really love it when I run into those because they allow me to actually explore the problem thoroughly. Here is an even better example, this isn’t about a problem in the code, but a discussion on its impact. </p><p style="text-align:left;"><img src="" style="float: right"/></p><p style="text-align:left;">RavenDB has been around for over 15 years, and being able to go back and look at those conversations in a decade or so is invaluable to understanding what is going on. It also ensures that we can share current knowledge a lot more easily.</p><p style="text-align:left;">Speaking of long running-projects, take a look at the following comment:</p><p style="text-align:left;"><img src="" style="float: right"/></p><p style="text-align:left;">Here we need to provide some context to explain. The <em>_caseInsensitive</em> variable here is a concurrent dictionary, and the change is a pretty simple optimization to avoid the annoying KeyValuePair overload. Except… this code is there intentionally, we use it to ensure that the removal operation will only succeed if <em>both</em> the key and the value match. There was an old bug that happened when we removed blindly and the end result was that an updated value was removed. </p><p style="text-align:left;">In this case, we look at the code change from a historical perspective and realize that a modification would reintroduce old (bad) behavior. We added a comment to explain that in detail in the code (and there already was a test to catch it if this happens again). </p><p style="text-align:left;">By far, the most important and critical part of doing code reviews, in my opinion, is not focusing on what <em>is</em> or what <em>was</em>, but on what <em>will </em>be. In other words, when I’m looking at a piece of code, I’m considering not only what it is doing right now, but also what we’ll be doing with it in the future. </p><p style="text-align:left;">Here is a simple example of what I mean, showing a change to a perfectly fine piece of code:</p><p style="text-align:left;"><img src="" style="float: right"/></p><p style="text-align:left;">The problem is that the if statement will call <em>InitializeCmd</em>(), but we previously called it <em>using a different condition</em>. We are essentially testing for the same thing using two different methods, and while currently we end up with the same situation, in the future we need to be aware that this may change. </p><p style="text-align:left;">I believe one of the major shifts in my thinking about code reviews came about because I mostly work on RavenDB, and we have kept the project running over a long period of time. Focusing on making sure that we have a sustainable and maintainable code base over the long haul is <em>important. </em>Especially because you need to experience those benefits over time to really appreciate looking at codebase changes from a historical perspective.</p>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/9000.0.1/themes/prism.min.css" integrity="sha512-/mZ1FHPkg6EKcxo0fKXF51ak6Cr2ocgDi5ytaTBjsQZIH/RNs6GF6+oId/vPe3eJB836T36nXwVh/WBl/cWT4w==" crossorigin="anonymous" referrerpolicy="no-referrer" />https://www.ayende.com/blog/200577-B/code-review-time-travel?Key=3ec75656-3dca-4436-91a4-85f16cbae154https://www.ayende.com/blog/200577-B/code-review-time-travel?Key=3ec75656-3dca-4436-91a4-85f16cbae154Fri, 19 Jan 2024 12:00:00 GMTProduction postmortem: The allocating query<p>A customer was experiencing large memory spikes in some cases, and we were looking into the allocation patterns of some of the queries that were involved. One of the things that popped up was a query that allocated just under 30GB of managed memory during its processing.</p>
<p>Let me repeat that, because it bears repeating. That query allocated 30(!) GB(!) during its execution. Now, that doesn’t mean that it was <em>consuming</em> 30 GB, it was just the allocations involved. Most of that memory was immediately discarded during the operation. But 30 GB of garbage to cleanup puts a lot of pressure on the system. We took a closer look at the offensive query. It looked something like this:</p>
<blockquote>
<p><span style="color: #0000ff;">from index</span> “Notifications/RoutingAndPriority” <br /><span style="color: #0000ff;">where</span> <span style="color: #9b00d3;">startsWith</span>(Route, $routeKeyPrefix)<span style="color: #0000ff;"> <br />order by</span> Priority <span style="color: #0000ff;">desc</span></p>
</blockquote>
<p>That does <em>not</em> seem like a query that should be all that expensive. But details matter, so we dove into this. For this particular query, the routes are hierarchical structures that are <em>unique </em>for each message. Something like:</p>
<ul>
<li>notifications/traffic/new-york-city/67a81019-941b-4d04-a0db-0559ed45343c</li>
<li>notifications/emergency/las-vegas/0a8e18fb-563b-4b6a-8e93-e10e08239656</li>
</ul>
<p>And the queries that were generated were using the city & topic to filter the information that they were interested in.</p>
<p>The customer in question had a <em>lot</em> of notifications going on at all times. And each one of their Routes was unique. Internally, RavenDB uses Lucene (currently <img class="wlEmoticon wlEmoticon-smile" src="https://ayende.com/blog/Images/Open-Live-Writer/Production-postmortem_3E51/wlEmoticon-smile_2.png" alt="Smile" /> ) to handle searches, and Lucene is using an inverse index to execute queries.</p>
<p>The usual way to think about is like this:</p>
<p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Production-postmortem_3E51/image_2.png"><img style="margin: 0px; display: inline; background-image: none;" title="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Production-postmortem_3E51/image_thumb.png" alt="image" width="234" height="358" border="0" /></a></p>
<p>We have a list of terms (Brown, Green & Purple) and each of them has a list of the matching documents that contain the particular term.</p>
<p>The process of issuing a prefix query then is easy, scan all entries that match the prefix and return their results. This is indeed what Lucene is doing. However… while it is <em>doing</em> that, it will do something like this:</p>
<blockquote>
<script src="https://gist.github.com/ayende/144f242419ffc03cad35603adec2263e.js"></script>
</blockquote>
<p>Pay close attention to what is actually happening here. There are <em>two</em> enumerators that we work with. One for the terms for the field and one for the documents for a specific term.</p>
<p>All of this is perfectly reasonable, but there is an issue. What happens when you have a <em>lot</em> of unique values? Well, then Lucene will have a lot of iterations of the loop. In this case, each term has just a single match, and Lucene is pretty good at optimizing search by specific term.</p>
<p>The actual problem is that Lucene <em>allocates</em> a string instance for <em>each term</em>. If we have 30 million notifications for New York’s traffic, that means that we’ll allocate 30 million strings during the processing of the query. We aren’t <em>retaining</em> these strings, mind. They’ll be cleaned up by the GC quickly enough, but that is an additional cost that we don’t actually want.</p>
<p>Luckily, in this case, there is a much simple solution. Given that the pattern of the route is known, we can skip the unique portion of the route. That means that in our index, we’ll do something similar to:</p>
<blockquote>
<p>Route = doc.Route.Substring(0, doc.Route.LastIndexOf('/') + 1)</p>
</blockquote>
<p>Once that is done, the number of <em>unique</em> matches there would be negligible. There would be no more allocations galore to observe and overall system performance is much improved.</p>
<p>We looked into whether there is something that we can do with Lucene to avoid this allocations issue, but it is endemic to the way the API works. The longer term plan is to fix that completely, of course. We are making great strides there already <img class="wlEmoticon wlEmoticon-smile" src="https://ayende.com/blog/Images/Open-Live-Writer/Production-postmortem_3E51/wlEmoticon-smile_2.png" alt="Smile" />.</p>
<p>In short, if you are doing startsWith() queries or similar, pay attention to the number of unique terms that you have to go through. A simple optimization on the index like the one above can bring quite a bit of dividends.</p>https://www.ayende.com/blog/197953-A/production-postmortem-the-allocating-query?Key=8a19a5e6-c99f-484b-b58c-f0caf6e65221https://www.ayende.com/blog/197953-A/production-postmortem-the-allocating-query?Key=8a19a5e6-c99f-484b-b58c-f0caf6e65221Fri, 05 Aug 2022 12:00:00 GMTReviewing code isn’t a binary operation<p>I want to comment on the following tweet:</p><blockquote><blockquote class="twitter-tweet"><p lang="en" dir="ltr">A rather banal “thought experiment:”<br><br>What if the only code we could review was tests and interface definitions?<br><br>What would that force us to specify at the interface and behaviour level, rather than just the implementation level?</p>— Reginald Braithwaite (@raganwald) <a href="https://twitter.com/raganwald/status/1345428908766875648?ref_src=twsrc%5Etfw">January 2, 2021</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script></blockquote><p>When I read it, I had an immediate and visceral reaction. Because this is one of those things that <em>sound</em> nice, but is actually a horrible dystopian trap. It confused two very important concepts and put them in the wrong order, resulting in utter chaos. </p><p>The two ideas are “writing tests” and “producing high quality code”. And they are usually expressed in something like this:</p><blockquote><p>We write tests in order to product high quality code. </p></blockquote><p>Proper tests ensure that you can make forward progress without having regressions. They are a tool you use to ensure a certain level of quality as you move forward. If you assume that the target is the <em>tests</em> and that you’ll have high quality code because of that, however, you end up in weird places. For example, take a look at the following set of stairs. They aren’t going anywhere, and aside from being decorative, serves no purpose.</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing_9EF5/image_2.png"><img width="596" height="438" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing_9EF5/image_thumb.png" border="0"></a></p><p>When you start considering tests themselves to be the goal, instead of a means to achieve it, you end up with decorative tests. They add to your budget and make it harder to change things, but don’t benefit the project. </p><p>There are a <em>lot</em> of things that you are looking for in code review that shouldn’t be in the tests. For example, consider the error handling strategy for the code. We have an invariant that says that exceptions may no escape our code when running in a background thread. That is because this will kill the process. How would you express something like that in a test? Because you end up with an error raised from a write to a file that happens when the disk is full that kills the server. </p><p>Another example is a critical piece of code that needs to be safely handle out of memory exceptions. You can test for that, sure, but it is hard and expensive. It also tend to freeze your design and implementation, because you are now testing implementation concerns and that make it very hard to change your code. </p><p>Reviewing code for performance pitfalls is also another major consideration. How do you police allocations using a test? And do that without killing your productivity? For fun, the following code allocates:</p><blockquote><p>Console.WriteLine(1_000_000);</p></blockquote><p>There are ways to monitor and track these kind of things, for sure, but they are very badly suited for repeatable tests. </p><p>Then there are other things that you’ll cover in the review, more tangible items. For example, the quality of error messages you raise, or the logging output. </p><p>I’m not saying that you can’t write tests for those. I’m saying that you shouldn’t. That is something that you want to be able to change and modify quickly, because you may realize that you want to add more information in a certain scenario. Freezing the behavior using tests just means that you have more work to do when you need to make the changes. And reviewing just test code is an even bigger problem when you consider that you need to consider interactions between features and their impact on one another. Not in terms of correctness, you absolutely need to test that, but in terms of behavior.</p><p>The interleaving of internal tasks inside of RavenDB was careful designed to ensure that we’ll be biased in favor of user facing operations, starving background operations if needed. At the same time, it means that we need to ensure that we’ll give the background tasks time to run. I can’t really think about how you would write a test for something like that without basically setting in stone the manner in which we make that determination. That is something that I explicitly don’t want to do, it will make changing how and what we do harder. But that is something that I absolutely consider in code reviews.</p>https://www.ayende.com/blog/192834-B/reviewing-code-isnt-a-binary-operation?Key=badef09f-84dc-4cf1-84c9-d063c947c74dhttps://www.ayende.com/blog/192834-B/reviewing-code-isnt-a-binary-operation?Key=badef09f-84dc-4cf1-84c9-d063c947c74dTue, 05 Jan 2021 12:00:00 GMTReviewing mimalloc: Part II<p>In my last post, I looked into the infrastructure of mimalloc, to be frank. A few details on how it is actually hooking the system allocator when used, mostly. I’m still using what is effectively a random scatter/gather approach to the codebase. Mostly because it is small. I’m trying to… grok it would be the best term, I guess. I’m going over the code because it also give me a good idea about practices in what seems to be a <em>damn good</em> C codebase. </p><blockquote><p>I should mention that I drove right into the code, there is also the <a href="https://www.microsoft.com/en-us/research/uploads/prod/2019/06/mimalloc-tr-v1.pdf">tech report</a>, which I intend to read, but only after I got through enough of the code to get a good feeling for it. </p></blockquote><p>I run into the code in the <em>options.c</em> file, for instance:</p><blockquote><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-mimalloc-Part-II_13640/image_2.png"><img width="931" height="733" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-mimalloc-Part-II_13640/image_thumb.png" border="0"></a></p></blockquote><p>This is a really nice way to get configuration values from the user. What I find really interesting, to be frank, is not the actual options, which would be interesting later on, but the fact that this is such a nice way to represent things in a readable manner. </p><p>I’m doing similar things in C# (a much higher level language) to create readable options (typically using dictionary & dictionary initializer). I like how the code is able to express things so succinctly in a language with far fewer features.</p><p>However, the <em>order</em> of parameters is critical (is should match the <em>mi_option_t</em> enum values), and there is no way to express this in the code. </p><p>I also liked this code, which is part of reading configuration values from env variables:</p><blockquote><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-mimalloc-Part-II_13640/image_4.png"><img width="745" height="228" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-mimalloc-Part-II_13640/image_thumb_1.png" border="0"></a></p></blockquote><p>I like that this is using <em>strstr()</em> in reverse in this manner. It is really elegant.</p><p>Going over the OS abstraction layer, on the other hand, show some granliness, take a look here:</p><blockquote><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-mimalloc-Part-II_13640/image_8.png"><img width="570" height="430" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-mimalloc-Part-II_13640/image_thumb_3.png" border="0"></a></p></blockquote><p>I actually simplified the code abit, because it also had #if there for BSD, Linux, etc. I find it harder to follow this style, maybe adding indentation would help, but I have had to read this function multiple times, filtering for specific OSes to get it right.</p><p>I did find this tidbit, which is interesting:</p><blockquote><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-mimalloc-Part-II_13640/image_6.png"><img width="725" height="361" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-mimalloc-Part-II_13640/image_thumb_2.png" border="0"></a></p></blockquote><p>This is attempting to do allocation with exclusive access. I wonder how this is actually used for. It looks like mimalloc is attempting to allocate in specific addresses, so that should be interesting.</p><p>Indeed, in <em>_mi_os_reset()</em> they will explicitly ask the OS to throw the memory away by calling MADV_FREE or MEM_RESET. I find this interesting, because this let the OS know that the memory can be thrown away, but the allocation still persists. I’m currently looking into some fragmentation issues in 32bits, which <em>won’t</em> be helped by this scenario. Then again, I don’t think that mimalloc is primarily intended for 32 bits systems (I can see code handling 32 bits, but I don’t think this is the primary use case or that 32 bits had a lot of attention).</p><p>The <em>mi_os_alloc_aligned_ensured()</em> method call is doing some interesting things. If I need a 16MB buffer, but aligned on 1MB boundary, I have no real way to ask this from the OS. So this is implemented directly by over-allocating. To be fair, I can’t think of a good reason why you’ll want to do something like that (you have no guarantees about what the <em>actual physical </em>memory layout would be after all, and that is the only scenario I can think this would be useful. Given that page aligned memory (which is what you get anyway from the OS) is usually sufficient, I wonder what is the use case for that here.</p><p>I get why mimalloc have to struggle with this, given that it is limited to just returning a pointer back (malloc interface), and doesn’t have a good way to tell that it played games with the alignment when you pass a value to <em>free()</em>. There seems to be a lot of code around here to deal with memory alignment requirements. I think that I’ll need to go up the stack to figure out what kind of alignment requirements it has.</p><p>That is enough for now, I think. I’m going to get to the core of mimalloc in the next post.</p>https://www.ayende.com/blog/188001-C/reviewing-mimalloc-part-ii?Key=85573868-a09c-45e6-9f1e-044d7da4a799https://www.ayende.com/blog/188001-C/reviewing-mimalloc-part-ii?Key=85573868-a09c-45e6-9f1e-044d7da4a799Mon, 22 Jul 2019 12:00:00 GMTReviewing Sled: Part III<p>Unusually for me, I had a bit of a pause in reviewing <a href="https://github.com/spacejam/sled/">Sled</a>. As a reminder, Sled is an embedded database engine written in Rust. I last stopped looking at the buffer management, but I still don’t really have a good grasp of what is going on.</p><p>The next file is the iterator. It looks like it translates between segments and messages in these segments. Here is the struct:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-Sled-Part-III_11EA8/image_2.png"><img width="1152" height="305" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-Sled-Part-III_11EA8/image_thumb.png" border="0"></a></p><p>As you can see, the log iterator holds an iterator of segments, and iterating over it looks like it will go over all the messages in the segments in order. Yep, here is the actual work being done:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-Sled-Part-III_11EA8/image_4.png"><img width="839" height="154" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-Sled-Part-III_11EA8/image_thumb_1.png" border="0"></a></p><p>The <em>next()</em> method is fairly straightforward, I found. But I have to point out this:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-Sled-Part-III_11EA8/image_6.png"><img width="961" height="159" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-Sled-Part-III_11EA8/image_thumb_2.png" border="0"></a></p><p>First, the <em>will need</em> call is really interesting. Mostly because you have a pretty obvious way to do conditional compiling that doesn’t really sucks. <em>#if</em> is usually much more jarring in the code.</p><p>Second, I think that the style of putting really important functions inside an if result in a pretty dense code. Especially since the if is entered only on error. I would have preferred to have it as a stand alone variable, and then check if it failed. </p><p>What I don’t understand is the <em>read_segment</em> call. Inside that method, we have:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-Sled-Part-III_11EA8/image_8.png"><img width="916" height="81" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-Sled-Part-III_11EA8/image_thumb_3.png" border="0"></a></p><p>There are also similar calls on segment trailer. It looks like we have a single file for the data, but stuff that is too large is held externally, in the blob files. </p><p>We then get to this guy, which I find really elegant way to handle all the different states.</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-Sled-Part-III_11EA8/image_10.png"><img width="926" height="432" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-Sled-Part-III_11EA8/image_thumb_4.png" border="0"></a></p><p>That is about it for interesting bits in the iterator, the next fun bit is the Log. I do have to admit that I don’t like the <em>term</em> log. It is too easy to get it confused with a debug log. In Voron, I used the term Journal or Write Ahead Journal (OH in the office: “did we waj it yet?”). </p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-Sled-Part-III_11EA8/image_12.png"><img width="1020" height="445" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-Sled-Part-III_11EA8/image_thumb_5.png" border="0"></a></p><p>The fact that you need to figure out where to get the offset of the data you are about to write is really interesting. This is the method that does the bulk of the work:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-Sled-Part-III_11EA8/image_14.png"><img width="1083" height="299" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-Sled-Part-III_11EA8/image_thumb_6.png" border="0"></a></p><p>Note that it just reserve and complete the operation. This also does <em>not</em> flush the data to disk. That is handled by the flusher or by explicit call. The <em>reserve()</em> method calls to <em>reserve_internal()</em> and there we find this gem:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-Sled-Part-III_11EA8/image_16.png"><img width="1197" height="525" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-Sled-Part-III_11EA8/image_thumb_7.png" border="0"></a></p><p>I know what it does (conditional compilation), but I find it really hard to follow. Especially because it <em>looks </em>like a mistake, with <em>buf </em>being defined twice. This is actually a case where an <em>#if</em> statement would be better, in my eyes.</p><p>Most of the code in there is to manage calls to the iobuf, which I already reviewed. So I’m going to skip ahead and look at something that is going to be more interesting, the page cache. Sled has an interesting behavior, in that it can shred a page into multiple location, requiring some logic to bring it all back together. That is going to be really interesting to look at, I hope.</p><p>The file stats with this:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-Sled-Part-III_11EA8/image_18.png"><img width="1301" height="416" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-Sled-Part-III_11EA8/image_thumb_8.png" border="0"></a></p><p>And this… takes a while to unpack. Remember that epoch is manual GC pattern for concurrent data structure without GC.</p><p>The <em>cached_ptr</em> value is a shared pointer to a Node (inside a lock free stack) that holds a CacheEntry with static lifetime and thread safe to a generic argument that must have static lifetime and be thread safe. And there is a unsigned long there as well.</p><p>No idea yet what is going on. But here is the first method on this struct:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-Sled-Part-III_11EA8/image_20.png"><img width="901" height="145" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-Sled-Part-III_11EA8/image_thumb_9.png" border="0"></a></p><p>That is… a lot. The cache entry is a discriminated union with the following options:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-Sled-Part-III_11EA8/image_22.png"><img width="723" height="376" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-Sled-Part-III_11EA8/image_thumb_10.png" border="0"></a></p><p>There are some awesome documentation comments here, including full blown sample code that really help understand what is going on in the code.</p><p>There seems to be a few key methods that are involved here:</p><ul><li>allocate(val) – create a new page and write an initial value, gets back a page id.</li><li>link(id, val) – make a write to a page id. Which simply write a value out.</li><li>get(id) – read all the values for a page id, and uses a materializer to merge them all to a single value.</li><li>replace(id, val) – set the page id to the new value, removing all the other values it had.</li></ul><p>The idea here, as I gather. Is to allow sequential writes to the data, but allow fast reads, mostly by utilizing SSD’s random read feature.</p><p>I’m trying to follow the code, but it is a bit complicated. In particular, we have:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-Sled-Part-III_11EA8/image_24.png"><img width="682" height="189" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-Sled-Part-III_11EA8/image_thumb_11.png" border="0"></a></p><p>This try to allocate either a free page or allocate a new one. One of the things that really mess with me here is that the use of the term Page. I’m using to B+Tree, where a page is literally some section of memory. Here it refers to something more nebulous. Key point here, I don’t see where the <em>size</em> is specified. But given that you can link items to page, that sort of make sense. I just need to get used to Pages != storage. </p><p>The fact that all of this is generic also make it hard to follow what is actually going on. I’m getting lost in here, so I think that I’ll stop for now.</p>https://www.ayende.com/blog/187073-C/reviewing-sled-part-iii?Key=8e4c8a31-646a-4c58-a99c-83a7d67beb12https://www.ayende.com/blog/187073-C/reviewing-sled-part-iii?Key=8e4c8a31-646a-4c58-a99c-83a7d67beb12Tue, 23 Apr 2019 12:00:00 GMTReviewing FASTER: Summary<p><a href="http://github.com/microsoft/FASTER/">FASTER</a> is an interesting project, with some unique approaches to solving their tasks that I haven’t encountered before. When I initially read the paper about a year or so ago, I was impressed with what they were doing, even I didn’t quite grasp exactly what was going on. After reading the code, this is now much clearer. I don’t remember where I read it, but I remember reading a Googler talking about the difference between Microsoft and Google with regards to publishing technical papers. Google would only publish something after it has been in production for a while (and probably ready to sunset <img class="wlEmoticon wlEmoticon-smile" style="" alt="Smile" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Summary_66D/wlEmoticon-smile_2.png">) while Microsoft would publish papers about software that hasn’t been deployed yet. </p><p>The reason I mention this is that FASTER isn’t suitable for production. Not by a long shot. I’m talking about issues such as swallowing errors, writing to the console as an error handling approach, calling sleep(), lack of logging / tracing / visibility into what is going on in the system. In short, FASTER looks like it was produced to support the paper. It is proof of concept / research code, not something that can take and use. </p><p>You can see it clearly in the way that the system is designed to be operated. You have dedicated threads that process requests as fast as they possibly can. However, there is no concept of working in any kind of operational environment, you can’t start using FASTER from an ASP.Net MVC app, for example. They models are just too different. I can think of a few ways to build a server using the FASTER model, but they are all pretty awkward and very specialized. This lead to the next issue with the project, it is <em>highly</em> specialized solution.</p><p>It isn’t meant for general consumption. In fact, as I can figure out, this is perfect if you have a relatively small working set that you do a <em>lot</em> of operations on. The examples I have seen given are related to tracking ads, which is a great example. If you want to store impressions on an ad, the active ads are going to pretty small, but you are going to have a lot of impressions on them. For other stuff, I don’t see a lot of usage scenarios.</p><p>FASTER is limited in the following ways:</p><ul><li>Get / Set / Update only – no way to <em>query</em></li><li>No support for atomic operations on more than a single key</li><li>Can support fixed length values only</li><li>Crash means data loss (of the most recent 14.4 GB, usually)</li><li>The API is <em>awkward</em> to use, and you need to write a bit of (non trivial) code for each key/val you want to store.</li><li>No support for compaction of data beyond dropping the oldest entries</li></ul><p>Some of these issues can be mitigated. For example, compaction can be implemented, and you can force data to be written to disk faster if you want to, but those aren’t in the box and require careful handling.</p><p>Now that I have gone over the code, I’m not quite sure what was the point there, to be honest. In terms of performance, you can get about 25% of the achieved performance by just using ConcurrentDictionary in .NET, I’m pretty sure that you can do better by literally just using a concurrent hash map in C++. This isn’t something that you can use as a primary data store, after all, so I wonder why not just keep all the data in memory and be done with it. </p><p>I liked the mutable / read only portions of the log, that is certainly a really nice way to do it, and I’m sure that the epoch idea simplified things during the implementation with the ability to not worry about concurrent accesses. However, the code is <em>complex</em> and I’m pretty sure that it is <em>not</em> going to be fun to debug / work with in real world scenarios.</p><p>To sum it up, interesting codebase and approaches, but I would caution from using it for real. The perf numbers are to salivate over, but the manner in which the benchmark was written means that it is not really applicable for any real world scenario.</p>https://www.ayende.com/blog/184387-A/reviewing-faster-summary?Key=e3ebba51-faaf-4961-8bc6-86015458a43dhttps://www.ayende.com/blog/184387-A/reviewing-faster-summary?Key=e3ebba51-faaf-4961-8bc6-86015458a43dThu, 06 Sep 2018 09:00:00 GMTReviewing FASTER: When the data hits the disk<p>So far, I ignored anything in <a href="https://github.com/Microsoft/FASTER/">FASTER</a> about how the data actually hits the disk. Based on my reading of the code and the paper, here is what I <em>think</em> that is going on. FASTER works in segments and in conjunction with its allocator. When you create a new instance, you have to define what would be the log size. From looking at the code, they seem to be favoring 16GB as the default size of the log. This is passed to PersistentMemoryMalloc, which uses pages of 32MB each to manage the memory. Out of these 16GB, 14.4GB are considered to be mutable and 1.6 GB is considered to be read only. </p><p>On startup, this class allocates two pages (64MB). Whenever FASTER needs more memory, such as to store a new record, it will eventually call to this method:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Writing-data_142D/image_2.png"><img width="910" height="517" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Writing-data_142D/image_thumb.png" border="0"></a></p><p>Again we have <em>num_slots</em> that actually means size in bytes, but I’ll leave my pet peeves aside. </p><p>You can see that this allocates from tail of the page use <em>Reserve, </em>which does an atomic operation. If we run out of space in the 32MB page, the <em>caller</em> need to call <em>NewPage()</em> to handle the new allocation. This plays together with the buffer management and the epoch. In particular, here is how a new page is allocated in the normal case. Assuming we just started and we consumed 64MB of memory, the new entry will allocate the 3rd page. This will also move the section of read only memory when there is a new page allocated and the number of active pages is beyond 14.4 GB.</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Writing-data_142D/image_4.png"><img width="1158" height="361" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Writing-data_142D/image_thumb_1.png" border="0"></a></p><p>What this means, in practice, is that FASTER typically has a 14.4GB range in which all operations are working on purely mutable memory. That means that two impressions on the same ad will end up being very close to simply Interlocked.Increment on the relevant value. This is the key for the performance that FASTER exhibits. </p><p>What happens once we start going beyond the 14.4 GB? FASTER will begin to move data to the read only section. In this case, it means that the any new modifications to the data will create a new copy of it in the mutable section.</p><p>At the same time, <em>PageAlignedShiftReadOnlyAddress()</em> will also starts an async process of writing these readonly pages to disk. Here is how it works:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Writing-data_142D/image_8.png"><img width="1241" height="277" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Writing-data_142D/image_thumb_3.png" border="0"></a></p><p>If the <em>read_only_address</em> was updated, FASTER calls to <em>BumpCurrentEpoch()</em> and will execute <em>OnPagesMarkedReadOnly</em>() when the Epoch moves beyond this range (this works because then it is guaranteed that no one may mutate this memory). That method looks like this:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Writing-data_142D/image_10.png"><img width="1167" height="339" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Writing-data_142D/image_thumb_4.png" border="0"></a></p><p>The notion of <em>read_only_address</em> and <em>safe_readonly_only_address</em> is discussed in the paper quite nicely, by the way. <em>AsyncFlushPages()</em> writes the data to disk, as you would expect and updates various in memory structures.</p><p>Note that just having the data written to disk doesn’t remove the in memory copy. It is still available in memory until it is pushed back from the log by new data. Now that we understand how the data goes to the log and then to the disk, I want to figure out how the data is written in to the disk itself. Actual writes are handled here:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Writing-data_142D/image_12.png"><img width="973" height="519" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Writing-data_142D/image_thumb_5.png" border="0"></a></p><p>What you can see is that the destination offset is used to divide the data on disk to sections. Each section is 1GB in size. In other words, the way FASTER works is to write the data in 1 GB segments that are sequential over time. </p><p>This also plays into the expiration policy that FASTER employs. Because it uses a logs based system, it accumulate data over time and will need to handle that. The current way it deals with the problem is to just delete old files, this gets rid of the data in roughly time based order, which is suitable for the use case that the paper talks about. Another alternative is to read the old files, and move the still valid entries to the start. That doesn’t seem to be implemented and I think it will be pretty hard to do and likely consume a <em>lot</em> of resources.</p><p>I’ll probably have another summary post about FASTER, but that is pretty much it. I’ve ignored other parts (recovery, several state machines used to handle internal state, etc), but they aren’t important to grokking what it is actually doing. It is an interesting codebase, but it feels… incomplete. But I’ll reserve such thoughts to the summary post.</p>https://www.ayende.com/blog/184386-A/reviewing-faster-when-the-data-hits-the-disk?Key=04032ab3-2f56-40f3-b546-5c76296372fchttps://www.ayende.com/blog/184386-A/reviewing-faster-when-the-data-hits-the-disk?Key=04032ab3-2f56-40f3-b546-5c76296372fcWed, 05 Sep 2018 09:00:00 GMTReviewing FASTER: Reading data from disk<p>One of the things that <a href="https://github.com/microsoft/FASTER/">FASTER</a> claims is that it is suitable for larger than memory datasets with its hybrid log approach. I briefly went over that code during my review, but I want to dedicate this post to discussing how FASTER work with the disk.</p><p>FASTER is using async I/O on Linux & Windows to do its I/O, which present its own challenges, in particular, how do you handle an operation that is going to need to hit the disk ( to read old data). Another thing that I would like to discover is how does it save the data to disk. We’ll start from the reading data part.</p><p>Reading in FASTER looks like this:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_2B/image_8.png"><img width="794" height="161" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_2B/image_thumb_3.png" border="0"></a></p><p>You pass a context, a callback and serial number. I’m not sure what the serial is about, I <em>think</em> it is about checkpoints, but not sure. You’ll be called with the results once the operation executed. </p><p>Here is the core of the Read method:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_2B/image_6.png"><img width="1219" height="359" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_2B/image_thumb_2.png" border="0"></a></p><p>FASTER first checks the in memory details, and if it isn’t there it is either not found or on disk. This is actually really interesting, because it implies that FASTER keep track of all of the data items purely in memory. Let’s go to InternalRead and figure out how that works. We already looked into most of that in the <a href="https://ayende.com/blog/184354-A/reviewing-faster-the-hash-structure">previous post</a> FindEntry is called to find the entry by it’s hash. FASTER keep all the hashes in memory, even while it is writing entries to disk. This way, it can easily answer if an entry exists or not. Note that as I read things, if FASTER has more than a certain number of values, hash collision rate is going to reach high percentage, requiring often expensive I/O to figure out whatever the value exists.</p><p>If the value is in memory, your ReadContext.GetAtomic() method will be called. Here is one such implementation:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_2B/image_10.png"><img width="626" height="93" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_2B/image_thumb_4.png" border="0"></a></p><p>Where the value was defined as:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_2B/image_12.png"><img width="622" height="127" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_2B/image_thumb_5.png" border="0"></a></p><p>If the value has already been moved to the read only section, it will use the Get() method, instead, as an optimization. If the value is not on the mutable section or the read only section, it is on the disk, in which case we have this code:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_2B/image_14.png"><img width="1188" height="162" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_2B/image_thumb_6.png" border="0"></a> </p><p>The <em>go_async()</em> method just records the status of the operation when we started the async process, it doesn’t actually invoke anything async. That is done in the caller code, Read().</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_2B/image_16.png"><img width="1222" height="155" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_2B/image_thumb_7.png" border="0"></a></p><p>This, in turn, gets us to this piece of code:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_2B/image_18.png"><img width="810" height="150" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_2B/image_thumb_8.png" border="0"></a></p><p>Note that the code is <em>full</em> of handling of the current phase of the thread. I’m ignoring all of these for now, they aren’t important. </p><p>The next thing to look at is the async I/O request, which gets us to:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_2B/image_20.png"><img width="1240" height="397" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_2B/image_thumb_9.png" border="0"></a></p><p>We first register the pending I/O, then actually starts to process the async call. Except that not really. <em>AsyncGetFromDisk()</em> isn’t actually doing I/O. Instead, it seems to be focused on limiting the number of concurrent I/O operations that are going on.</p><p>In this case, if there are more than 120 pending I/Os, it will call <em>io_getevents()</em> in Linux and <em>GetQueuedCompletionStatus()</em> and try to process any completed I/O immediately. </p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_2B/image_22.png"><img width="1101" height="429" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_2B/image_thumb_10.png" border="0"></a></p><p>ProtectAndDrain is more interesting. It asks the epoch to complete any pending operations. Such actions are the creation of a new segment file or the moving of a segment from memory to disk.</p><p>I find it interesting that FASTER chose to stall the thread until all pending I/Os are completed. Given their model, it would make more sense to push such operation into the thread ops and resume process other requests. I guess that they expect this to be either a rare case or using this for throttling overall system load. You might also have noticed the num_records arguments, this is used in the hlog method:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_2B/image_24.png"><img width="629" height="37" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_2B/image_thumb_11.png" border="0"></a></p><p>That was <em>confusing</em>, I expected this to be the number of records (as in, how many records are read from disk) but this is the number of bytes to read.</p><p>The C++ memory model make async code a bit complex. In particular, if you’ll look at the first code snippet in this post, you’ll see that we pass a stack allocated struct to the Read() method. Because this method can complete in an async manner, what FASTER will do is to perform a deep copy of the data. Combine that with lambda’s capturing state, and I’m pretty sure that this code will cause a <em>very</em> subtle memory corruption in rare cases.</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_2B/image_28.png"><img width="841" height="220" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_2B/image_thumb_13.png" border="0"></a></p><p>What I <em>think</em> will happen is that we capture by ref the stack variable and in 99% of the cases, we’ll run this in sync mode, meaning that there will be no issues. Only if the value needs to be read from disk will you get an issue. Because that function will already return but you still have the callback (and the captured reference now pointing to something completely different) still active. I think that a C++ developer would recognize this, and the fact that C++ require you to be explicit about your captures make this a lot easier to deal with. </p><p>It is important to note that there is no good way to actually handle the async nature of certain operations here. Any code that actually handle the operation need to go in the callback. </p><p>Another aspect to remember is that just reading from the disk doesn’t mean that you got the right value. You might have gotten a hash collision:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_2B/image_30.png"><img width="1023" height="449" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_2B/image_thumb_14.png" border="0"></a></p><p>In other words, if you have a way to generate hash collisions, as soon as the value hits the disk, you are going to be facing with making N <em>disk I/O requests</em> to find if you have the value or not. <a href="https://news.ycombinator.com/item?id=3401900">Denial of service attacks against hash tables</a> are well known and represent a significant risk of to services. </p><p>Next on my list if seeing how FASTER is actually <em>writing </em>the data to disk, but that will be in a separate post. </p>https://www.ayende.com/blog/184385-A/reviewing-faster-reading-data-from-disk?Key=bbe0732a-22ff-479b-bad6-4c1fe0a558d4https://www.ayende.com/blog/184385-A/reviewing-faster-reading-data-from-disk?Key=bbe0732a-22ff-479b-bad6-4c1fe0a558d4Tue, 04 Sep 2018 09:00:00 GMTReviewing FASTER: The hash structure<p>Continuing my stroll through the FASTER codebase, I want to try to get to the core stuff, not just the periphery. This is a bit hard, because there is a lot of code here and it seems to be hard to find where the real action is happening. I decided to find how FASTER is handling Get and Put operations. There is something called InternalHashTable inside FasterKv, which seems promising, so I’m going to follow up on that. Interestingly enough, it shows up as:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_118CC/image_2.png"><img width="1138" height="75" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_118CC/image_thumb.png" border="0"></a></p><p>So there are some funky stuff here to consider too, but we’ll deal with that later, for now, I want to understand what it is doing with the InternalHashTable. Inside that class, there is the notion of bukcets:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_118CC/image_6.png"><img width="914" height="67" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_118CC/image_thumb_2.png" border="0"></a></p><p>And a bucket is:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_118CC/image_8.png"><img width="927" height="410" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_118CC/image_thumb_3.png" border="0"></a></p><p>This starts to get interesting for me, so let’s dig deeper into HashBucketEntry, where they use the same union trick I talked about in <a href="https://ayende.com/blog/184321-A/reviewing-faster-digging-into-the-c-impl">my previous posts</a>, this allow to easily define an atomic version of it with very little effort.</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_118CC/image_10.png"><img width="884" height="274" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_118CC/image_thumb_4.png" border="0"></a></p><p>There is also the overflow entry definition, which looks like this:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_118CC/image_12.png"><img width="424" height="213" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_118CC/image_thumb_5.png" border="0"></a></p><p>I got to say, I’m really liking this approach for handling data packing as well as multi threading concerns. It also plays very well with the actual cache line architecture of modern CPUs.</p><p>There is also the KeyHash struct, whose heart is this:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_118CC/image_14.png"><img width="423" height="244" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_118CC/image_thumb_6.png" border="0"></a></p><p>So far, this lines us very neatly to how the FASTER paper talks about things. Given a KeyHash, here is how you get it’s bucket:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_118CC/image_16.png"><img width="574" height="97" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_118CC/image_thumb_7.png" border="0"></a></p><p>This simply index into the <em>buckets_</em> array by taking the (size_<em> –1)</em> bits from the hash itself. That tells me that the FASTER structure is <em>very</em> sensitive to the choice of hash function that will be used. This <a href="https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed">SO answer</a> has some amazing detail on analysis of the output of hash functions with different outputs, which can give you some idea about why this matters so much. <a href="https://probablydance.com/2018/05/28/a-new-fast-hash-table-in-response-to-googles-new-fast-hash-table/">This post</a> is an <em>in depth</em> discussion of this, as well of why the choice of hash function <em>matters</em>. This is enough for me to stop everything and try to find what kind of hash function is actually being used here.</p><p>The user get to define their own hash function, and if they do so badly (for example, use the identity function since they have an 8 bytes value and they need an 8 bytes hash) that is going to be fun. The function that they seem to be using in their examples is this one:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_118CC/image_18.png"><img width="1052" height="307" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_118CC/image_thumb_8.png" border="0"></a></p><p>A bit of searching on the interweb didn’t narrow it down to something specific, it may be something that they wrote specifically for that. Given that the paper doesn’t mention this, it doesn’t seem to be something special.</p><p>Given that 40343 is a prime, it seems like a pretty common pattern of multiple by a prime with each 16 bits portion of the key. The idea is that the multiplication by prime will spread the bits around. No idea how high the quality of this hash function is, since actual analysis would take at least a few hours. At least at a glance, it doesn’t seem to be awful, but I do wonder whatever something like FNV-1. In fact, this is very similar, but with different prime and offset and with addition instead of XOR. </p><p>The actual InternalHashTable class doesn’t actually <em>do</em> something, there are a few functions around checkpointing and recovery, but they aren’t very interesting at this point. I’m going back to actually working with the has table and looked at how reads work. They end up in the InternalRead method which does the majority of the work inside the FindEntry function. The first thing that happens <em>there</em> is:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_118CC/image_20.png"><img width="1057" height="100" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_118CC/image_thumb_9.png" border="0"></a></p><p>The first line is there to handle growing the hash table on the fly and will be ignored for now. As you can see, this basically just gives me the relevant bucket. Here is the next stage, once we <em>have</em> a bucket, we need to find the relevant entry that matches what we are looking for. Here is the code:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_118CC/image_22.png"><img width="1189" height="823" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_118CC/image_thumb_10.png" border="0"></a></p><p>You can see that there are a bunch of stuff going on here. First, we run over the known entries in the bucket, trying to find an entry with the same tag. You can see the <em>tentative</em> usage, which is used to sync up between concurrent readers and writers. There may be more items in the bucket than we have space for, so there is also the concept of overflow. This is basically a linked list of 7 items at a time and likely to be pretty fast (frequently already in the higher tiers of the cache for commonly accessed data). </p><p>Now that we have the entry, let’s see what is done with it. I’m currently reading through the InternalRead code. Just finding the matching hash doesn’t really help us, we may have hash collisions, this is handled here:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_118CC/image_24.png"><img width="877" height="523" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_118CC/image_thumb_11.png" border="0"></a></p><p>This is the first time that we actually run into the hybrid log (<em>hlog</em> here). But the basic idea is pretty obvious. Either the key match, or we have a pointer to the previous entry. I’m not seeing any handling of complete mismatch, though. I’m pretty sure that <a href="https://github.com/Microsoft/FASTER/issues/26">this is a bug</a> (the behavior is different in the C# version).</p><p>This is enough for now, from here the InternalRead code is starting to do things with the hlog, state machine, etc. I’m going to handle that in a separate post.</p>https://www.ayende.com/blog/184354-A/reviewing-faster-the-hash-structure?Key=916292d2-ed28-4908-b12e-189516ae0556https://www.ayende.com/blog/184354-A/reviewing-faster-the-hash-structure?Key=916292d2-ed28-4908-b12e-189516ae0556Mon, 03 Sep 2018 09:00:00 GMTReviewing FASTER: Working with the file system<p>In my <a href="https://ayende.com/blog/184321-A/reviewing-faster-digging-into-the-c-impl?key=7c5bfc3488fc4ef1b4f17f3d49cad989">last post</a> I dove into the Epoch implementation. The Epoch is explained very nicely in the paper, and the code follows the paper pretty closely. The code make sense, but I still lack the proper… feeling for how it is actually being used. The Epoch allows you to register code that will be executed when the epoch is updated, which is the key to how FASTER is making progress, but while I can see that this is being called from the allocators, I haven’t really grokked it yet. I’m going to back to the faster.h file and see what I can glean from there.</p><p>Because of the template utilization, it is kinda hard to figure out what exactly is going on, I’m going to look at some of the examples and try to figure out what it is doing there. Here is one instance of this class:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_91F3/image_2.png"><img width="1275" height="101" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_91F3/image_thumb.png" border="0"></a></p><p>AdId and NumClicks are just two ways to provide operations on 8 bytes keys and values. I like these examples because they provide good use case to talk about FASTER usage.</p><p>This code leads me to the FileSystemDisk, which is defined as:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_91F3/image_4.png"><img width="594" height="214" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_91F3/image_thumb_1.png" border="0"></a></p><p>In the FileSystemFile, we have this code:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_91F3/image_6.png"><img width="1505" height="352" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_91F3/image_thumb_2.png" border="0"></a></p><p>This is pretty simple, but I was quite amused by this, because this is C# API sneaking up again in the C++ code. There is also this:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_91F3/image_8.png"><img width="746" height="125" title="image" style="border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_91F3/image_thumb_3.png" border="0"></a></p><p>I’m not sure what this bundle is, though. I run into this code in the class:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_91F3/image_10.png"><img width="887" height="287" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_91F3/image_thumb_4.png" border="0"></a></p><p>This is… not nice, in my eyes. Based on this code, whoever allocated this instance also allocated a buffer large enough to include more data there. This is fairly common, since you want to work with such data together, but I find it ugly / risky because it means that there are multiple locations that needs to be aware of it. I would like it better if they just passed the pointer explicitly. That would avoid this snippet:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_91F3/image_12.png"><img width="1536" height="250" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_91F3/image_thumb_5.png" border="0"></a></p><p>Which I find pretty annoying to read. What is stranger is that to use this, you have to write (bundle_t has been typedef for the FileSystemSegmentBundle):</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_91F3/image_14.png"><img width="1544" height="131" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_91F3/image_thumb_6.png" border="0"></a></p><p>I get what is going on here, but I just find it really awkward to handle. There are multiple places where you need knowledge of this allocation pattern and I don’t believe that the benefit of placing all of the data together is <em>that</em> important. For that matter, given the importance of <em>not</em> using new explicitly in modern C++, I’m sure that there are other ways to achieve the same goal that would be more natural.</p><p>Going through the code, we now have:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_91F3/image_16.png"><img width="620" height="90" title="image" style="border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_91F3/image_thumb_7.png" border="0"></a></p><p>I decided to make this post about the file system usage, because there is a <em>lot</em> of pretty complex code here that I would like to understand. I finally figured out what the S is, this is the segment size:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_91F3/image_18.png"><img width="1186" height="98" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_91F3/image_thumb_8.png" border="0"></a></p><p>This means that the earlier definition of FasterKv basically defined Segment Size of 1 GB in size. I’m not sure what these segments <em>are</em>, though. I’m pretty sure that this is how they manage time base expiration, but I’m not certain. Following upward from the creation of a new segment, we have WriteAsync, like so:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_91F3/image_20.png"><img width="1352" height="709" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_91F3/image_thumb_9.png" border="0"></a></p><p>You can see that the segment number is basically just the file id, and if the file does not already exists, we call OpenSegment on it. Afterward, we call WriteAsync on that specific file. I’ll look into how that work in a bit, this isn’t that interesting at the moment. Right now I want to dig into OpenSegment. I removed some error handling here, but the gist of it is clear. </p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_91F3/image_24.png"><img width="1589" height="1082" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_91F3/image_thumb_11.png" border="0"></a></p><p>The actual code also handles threading and errors, which I omitted. You can see that it creates the new files, copying them from the existing value. Then it creates a context that holds the <em>old</em> files and pass it to BumpCurrentEpoch.</p><p>When FASTER is sure that no one else is looking at this epoch, it will call the callback and delete / dispose the old files. This is a nice way to ensure consistency. LMDB does something similar with its transactions’ table. So now we know that whenever we write at a 1GB boundary, FASTER will generate a new epoch. </p><p>What about the actual writing? Here is what this looks like (the Linux impl):</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_91F3/image_26.png"><img width="1523" height="433" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_91F3/image_thumb_12.png" border="0"></a></p><p>On Linux, this ends up being:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_91F3/image_28.png"><img width="937" height="59" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_91F3/image_thumb_13.png" border="0"></a></p><p>This is then checked in TryComplete:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_91F3/image_30.png"><img width="1262" height="123" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_91F3/image_thumb_14.png" border="0"></a></p><p>This is called FasterKv.CompletePending(), which seems to be called occasionally by FASTER. On Windows, this is using async I/O and callbacks to handle this.</p><p>Okay, this is already long enough, but I got a handle on how FASTER is writing to disk, even though I don’t know yet what it is <em>doing</em> with that. I also saw an actual use of Epoch that made sense (clearing old data once no one is looking at that).</p>https://www.ayende.com/blog/184353-A/reviewing-faster-working-with-the-file-system?Key=6f5d1bdf-7164-4b10-ab17-cfb7a808d696https://www.ayende.com/blog/184353-A/reviewing-faster-working-with-the-file-system?Key=6f5d1bdf-7164-4b10-ab17-cfb7a808d696Fri, 31 Aug 2018 09:00:00 GMTReviewing FASTER: Digging into the C++ impl<p>After going over the <a href="https://ayende.com/blog/184225-A/reviewing-faster-reading-the-paper?key=ced524c829bc4a1fae86a3c7facc325b">paper</a> and the <a href="https://ayende.com/blog/184257-A/reviewing-faster-lets-start-with-managed-code?key=478acffa3cca45daaf9de4c00a6d378d">managed implementation</a>, I’m ready to start with the C++ implementation. I have higher hopes for this code. As I started browsing the C++ code, it occurred to me that the way the C#’s implementation handles dynamic code generation is pretty much how templates in C++ work. I wonder if this was the trigger for that.</p><p>The C++ code reads a lot more naturally to me. There are some nice tricks that are used there that are a joy to read. For example, take a look at Address manipulation:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Digging-into-the-C-impl_AF0F/image_4.png"><img width="825" height="276" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Digging-into-the-C-impl_AF0F/image_thumb_1.png" border="0"></a></p><p>The colon syntax are a way to express bitfields in C. But the real fun part for me is the idea of <em>control_</em>. What is this for? Well, it runs out that in addition to Address, the also defined AtomicAddress, whose implementation need to provide atomic operation on address. This is implemented in the following manner:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Digging-into-the-C-impl_AF0F/image_6.png"><img width="1193" height="732" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Digging-into-the-C-impl_AF0F/image_thumb_2.png" border="0"></a></p><p>I find this a really elegant way to handle this requirement.</p><p>Another amusing observation is that almost all the code in FASTER is in .h files, because of the heavy use of templates. I wonder how that affects compilation speed and how that would play in larger projects.</p><p>It is in <em>faster.h</em> that we start to get into the interesting bits. I first run into this guy:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Digging-into-the-C-impl_AF0F/image_10.png"><img width="758" height="431" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Digging-into-the-C-impl_AF0F/image_thumb_4.png" border="0"></a></p><p>This maps pretty closely to what I have actually seen the C# code does, but in C++ it is a much more natural approach that dynamic compilation on the fly as it did in C#. </p><p>Next we have the constructor, which looks like this:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Digging-into-the-C-impl_AF0F/image_12.png"><img width="1105" height="544" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Digging-into-the-C-impl_AF0F/image_thumb_5.png" border="0"></a></p><p>The <em>epoch_</em> field is auto initialized by the compiler and is not shown here. This indicates that FASTER can handle up to 2.1 billion entries in total, which seems to be a strange limit for a data store that is expected to handle hundreds of thousands of operations per second. I’m going to jump around the codebase a bit, because I want to follow exactly what is going on when initializing this class. The first place to look is the epoch. The idea of epoch is described in the paper, so I’m not going to repeat it. The code defines a struct that is 64 bytes in size (cache line sized, to avoid false sharing), this is used to store a thread specific value and is used to maintain most of the invariants of the epoch.</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Digging-into-the-C-impl_AF0F/image_16.png"><img width="702" height="357" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Digging-into-the-C-impl_AF0F/image_thumb_7.png" border="0"></a></p><p>When switching between epochs, there are actions that needs to be run, here is what this looks like in the code. </p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Digging-into-the-C-impl_AF0F/image_14.png"><img width="1002" height="663" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Digging-into-the-C-impl_AF0F/image_thumb_6.png" border="0"></a></p><p>I must say, this really mess up with my mind, because we have C#’s naming conventions (TryPop, TryPush) in C++ code. It’s like the code couldn’t decide what shape it wanted to be in <em>either</em> language.</p><p>The number of threads that can take part is limited by this value:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Digging-into-the-C-impl_AF0F/image_18.png"><img width="879" height="62" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Digging-into-the-C-impl_AF0F/image_thumb_8.png" border="0"></a></p><p>Right now, this is set to 96, which means that if you need more threads than that, you’ll get a runtime error. This fits nicely with the model FASTER uses of long running threads, but I don’t see how it can play well with actually accepting commands from network / other location. </p><p>As part of it’s constructor, this method is called, which actually does the real work of setting up the epoch.</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Digging-into-the-C-impl_AF0F/image_20.png"><img width="1072" height="393" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Digging-into-the-C-impl_AF0F/image_thumb_9.png" border="0"></a></p><p>I’m not really sure at this point why it is allocating two additional entries beyond the specified size. </p><p>When a thread start running FATER code, it needs to register itself with the Epoch, this is done in the <em>Protect()</em> call. </p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Digging-into-the-C-impl_AF0F/image_22.png"><img width="832" height="151" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Digging-into-the-C-impl_AF0F/image_thumb_10.png" border="0"></a></p><p>Going into the <em>Thread</em> class reveals a simple table of values that are used to give ids to the threads that asked to get an id. This is done in this function:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Digging-into-the-C-impl_AF0F/image_24.png"><img width="1096" height="342" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Digging-into-the-C-impl_AF0F/image_thumb_11.png" border="0"></a></p><p>It took me a couple of times of reading the first two lines to understand what is going on here. This is an <em>awesome</em> way to handle a circular buffer scanning. It is very clear and saves a bunch of code ( at the cost of doing mod operation, which can be usually be masked if the value is known at compile time and is a power of 2, which in this case it is not). I’m probably going to use this the next time I need to implement scanning through a ring buffer. </p><p>Then we have computing the earliest safe epoch:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Digging-into-the-C-impl_AF0F/image_26.png"><img width="1255" height="560" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Digging-into-the-C-impl_AF0F/image_thumb_12.png" border="0"></a></p><p>The first of these methods is elegant, it does a simple read from the table, reading potentially stale values. This doesn’t matter, because the worst thing that can happen is that we’ll keep a previous epoch for longer than it is required. </p><p>The second one reads wrong to me, but I’ll have to dig deeper into the C++ memory model more deeply for this. The problem is that this seems like it is relying on the CPU to update its state somehow. But I don’t see any instruction that would force it to. I <em>think</em> that the set to <em>safe_to_reclaim_epoch</em> (which is std::atomic<uint64_t>) will use memory_order_seq_cst for the operation, but I’m not sure how that would impact reads from the <em>table_</em>. </p><p>Also, I want you to pay attention to the variable names here. Private member fields:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Digging-into-the-C-impl_AF0F/image_28.png"><img width="519" height="124" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Digging-into-the-C-impl_AF0F/image_thumb_13.png" border="0"></a></p><p>Public member fields:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Digging-into-the-C-impl_AF0F/image_30.png"><img width="697" height="155" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Digging-into-the-C-impl_AF0F/image_thumb_14.png" border="0"></a></p><p>And then we have SpinWaitForSafeToReclaim that uses:</p><ul><li>safe_to_reclaim_epoch – public member field</li><li>safe_to_reclaim_epoch_ – method argument </li></ul><p>I’m not sure if this a common practice in C++, but this is <em>really </em>confusing to me. This is enough for now, I’m going to keep going thought the C++ code in my next post. There hasn’t been anything really interesting so far, just digging into the code and getting a feel as to how it is put together.</p>https://www.ayende.com/blog/184321-A/reviewing-faster-digging-into-the-c-impl?Key=7c5bfc34-88fc-4ef1-b4f1-7f3d49cad989https://www.ayende.com/blog/184321-A/reviewing-faster-digging-into-the-c-impl?Key=7c5bfc34-88fc-4ef1-b4f1-7f3d49cad989Thu, 30 Aug 2018 09:00:00 GMTReviewing FASTER: Let’s check these numbers<p>Before heading to the C++ implementation, I thought I would take the time to just walk through the FASTER codebase as it is performing a simple operation. The access violation error that I previously run into has been fixed, so I could run the benchmark. Here is my configuration:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-How-does-this-thing-wor_9EA4/image_2.png"><img width="671" height="38" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-How-does-this-thing-wor_9EA4/image_thumb.png" border="0"></a></p><p>I got the following results, when running on a single thread.</p><blockquote><p>Total 142,603,719 ops done in 30 secs.</p></blockquote><p>This is about 4.7 million operations per second, which is certainly nice. I then decided to compare this to ConcurrentDictionary to see what kind of performance that would give me. I made the <a href="https://gist.github.com/ayende/785cfa6107332a8c9c2b203d30bf8e7e">following changes</a>, which I’ll admit are pretty brute force way to go about it. But note that this is probably significantly less efficient then I could probably write it. Nevertheless, using ConcurrentDictionary with a single thread in the same benchmark gives me:</p><blockquote><p>Total 84,729,062 ops done in 30 secs.</p></blockquote><p>There isn’t much call for using this on a single thread, though. My machine has 20 cores, so let’s see what happens when I give FASTER its head, shall we?</p><blockquote><p>2,330,054,219 ops done in 30.021 secs.</p></blockquote><p>That is impressive, with about 77,668,473 operations per second. On the other hand, this is what happened when I run with 20 threads and ConcurrentDictionary:</p><blockquote><p>671,071,685 ops done in 30.024 secs.</p></blockquote><p>This gives us “only” 22,369,056 operations per second. </p><p>It is clear that FASTER is much better, right? The problem is that it isn’t much faster <em>enough</em>. What do I mean by this? I used idiomatic C# for the ConcurrentDictionary usage and got with 1/4 of FASTER’s perf. The FATER codebase is doing native calls and unsafe manipulation, dedicated allocation, etc. I expect to get better perf at that point, but “merely” 400% improvement isn’t enough for the kind of effort that was put into this. I run the concurrent dictionary in a sampling profiler, with 20 threads, and I got the following results. </p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-How-does-this-thing-wor_9EA4/image_4.png"><img width="1137" height="246" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-How-does-this-thing-wor_9EA4/image_thumb_1.png" border="0"></a></p><p>On the other hand, using FASTER for the same scenario gives:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-How-does-this-thing-wor_9EA4/image_6.png"><img width="1117" height="71" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-How-does-this-thing-wor_9EA4/image_thumb_2.png" border="0"></a></p><p>This is really interesting. You can see that the FASTER option spends all its time in either: InternalUpsert or inside the RunYcsb method (which is actually the store.Read() method that was inlined). </p><p>What is more interesting is that there are no additional calls there. The InternalUpsert call is 219 lines of code, and the code uses [MethodImpl(MethodImplOptions.AggressiveInlining)] quite aggressively (pun intended). On the other hand, the ConcurrentDictionary implementation has to make virtual method calls on each call. </p><p>There are several ways to handle this, including using generic struct that can eliminate most of the virtual calls. This is effectively what FASTER is doing, without the generics. FASTER also benefits from pre-allocating everything upfront. If you’ll look at the profiler results, you can see that these are the major source of “slowness” in the benchmark.</p><p>Given the nature of the benchmark, I feel that it is unfair to compare FASTER to persistent data stores and it should be compared more closely to a concurrent hash map. Given that this is effectively what FASTER is doing in this showcase benchmark, that seems a lot more fair. I checked the literature and we have <a href="https://arxiv.org/pdf/1601.04017.pdf">this paper</a> talking about concurrent hash maps where we see (Figure 2.a) numbers that near to 300 millions ops/sec for pure writes and 600 millions ops/sec for reads. </p>https://www.ayende.com/blog/184289-A/reviewing-faster-lets-check-these-numbers?Key=fcf4b9d1-86b8-4daa-9414-8f4e55e458bbhttps://www.ayende.com/blog/184289-A/reviewing-faster-lets-check-these-numbers?Key=fcf4b9d1-86b8-4daa-9414-8f4e55e458bbWed, 29 Aug 2018 09:00:00 GMTReviewing FASTER: Let’s start with managed code<p><a href="https://ayende.com/blog/184225-A/reviewing-faster-reading-the-paper?key=ced524c829bc4a1fae86a3c7facc325b">Last post</a> commented on the FASTER paper, now I’m getting to reading the code. This is a pleasant change of scenery for me, since FASTER is written in C# (there is a C++ port that I’ll also be going through later). As someone who spend most of their time in C#, that make it much easier to go through the code. On the other hand, this was <em>clearly</em> written by someone who spent most of their time in C++. This means that naming conventions and the general approach to writing the code sometimes directly contradict C# conventions. Some of that really bugs me.</p><p>The first thing I did was to try to run the benchmark on my own machine, to get relative numbers. It died with an <a href="https://github.com/Microsoft/FASTER/issues/15">AccessViolationException</a>, which was a disappointment. I’m going to ignore that and just read through the code. One thing that I did noticed when reading through the benchmark code is that this piece:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Lets-start-with-the-cod_7CAD/image_2.png"><img width="416" height="224" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Lets-start-with-the-cod_7CAD/image_thumb.png" border="0"></a></p><p>This maps closely to some of the things they mentioned in the paper, where each thread refresh every 256 ops and complete pending operations every 65536 ops. The reason I call this out is that having this done in what effectively is a client code is a bad idea in term of design. The benchmark code is operating under continuous stream of operations and can afford to amortize such things. However, real code need to deal with client code that <em>isn’t </em>well behaving and you can’t assume that you’ll be operating in this kind of environment.</p><p>Based on the semantics discussed in the paper and what the code is showing me, I would assume that the general model for actually using this is to spawn off a bunch of threads and then listen to some data source. For example, a socket. On each network read, the thread would apply a batch of operations. Note that this means that you have to deal with threads that may be arbitrarily delayed. I would expect that under such a scenario, you’ll see a very different performance profile. You won’t have all threads working in tandem and staying close to one another. </p><p>When getting to the code itself, I started with the native portions, trying to figure out what FASTER is doing with the lowest level of the system. It is mostly advanced file operations (things like telling the OS to don’t defrag files, allow to allocate disk space without zeroing it first, etc). As far as I can see, this isn’t actually being called from the current code, but I assume that they at least tested out this approach. Another native code they have is related to calling __rdtsc(), which they use in the HiResTimer class. </p><p>This can be replaced completely by the .NET Stopwatch class and I believe that dropping the <em>adv-file-ops</em> and <em>readtsc </em>native projects is possible and straightforward, allowing for a fully managed FASTER impl. Note that it is still using a <em>lot</em> of interop calls, it seems, but at least so far, I think that most of them are not necessary. To be fair, given the way the code is structured, a lot of the native code is about I/O with pointers, which until the Great Spanification in .NET Core was PITA to deal with.</p><p>In general, by the way, the code reads more as a proof of concept than a production codebase. I notice this in particular with the general approach for errors handling. Here are two examples:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Lets-start-with-the-cod_7CAD/image_4.png"><img width="764" height="100" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Lets-start-with-the-cod_7CAD/image_thumb_1.png" border="0"></a></p><p>This is from the native code, from which it is a PITA to return errors. However, <em>writing to the console</em> <em>from a library </em>is not an error reporting strategy. What really bugged me was this code, from the MallocFixedPageSize code:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Lets-start-with-the-cod_7CAD/image_6.png"><img width="532" height="266" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Lets-start-with-the-cod_7CAD/image_thumb_2.png" border="0"></a></p><p>If there is any error in the process of pinning memory, just ignore it? Leave the memory pinned? </p><p>Here is another example that made me cringe:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Lets-start-with-the-cod_7CAD/image_8.png"><img width="528" height="192" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Lets-start-with-the-cod_7CAD/image_thumb_3.png" border="0"></a></p><p>Moving on to the <em>CodeGen</em> directory, it gets interesting. The paper talks about dynamic code generation, but I didn’t expect to see such wide usage of this. In particular, large sections of the code (13 files, to be exact, over 6000 lines of code) are dynamically loaded, transformed and compiled on the fly when you create the hashtable. </p><p>I understand the <em>reasoning </em>for this, you want to get the JIT to compile the best possible code that it can for the specific operations you execute. However, this make it pretty hard to follow what is going on there. In particular, code generating code make it harder to follow what end up actually going on. There are also better ways to do it. Either through generic struct parameters to specialize the code or only generating the dedicated connecting methods as needed and not recompiling large portions on the fly. </p><p>The <em>Device</em> directory isn’t really interesting. Mostly pretty trivial I/O operations, so I’m not going to discuss it in depth.</p><p>Next, getting to the Epoch, which was really interesting to read in the paper. The actual implementation raise a few questions. The Epoch value in an 32 bits integer, that means that it will wrap fairly quickly. It looks like the Epoch is bumped every time you need a new page. Given the operations rate that are reported, I would expect it to happen on a regular basis (this is also required for the system to progress properly and sync up). My reading is that wrapping of the Epoch counter will result in Bad Things Going On.</p><p>There there is this:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Lets-start-with-the-cod_7CAD/image_10.png"><img width="499" height="86" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Lets-start-with-the-cod_7CAD/image_thumb_4.png" border="0"></a></p><p>Size, in this case, is always set to 128. The size of Entry is 64 bytes, this means that this call will allocate 8,320 bytes in Gen0, immediately pin it and never let it go. This is going to result in memory fragmentation. It would be better to allocate this on the Large Object Heap and avoid the issue. In fact, the same issue can be seen in the memory allocation, where the code does:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Lets-start-with-the-cod_7CAD/image_12.png"><img width="521" height="62" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Lets-start-with-the-cod_7CAD/image_thumb_5.png" border="0"></a></p><p>In This case, PageSize is 65,536, however, and given any value except a byte, this will automatically go to the Large Object Heap anyway. I’m going to be a bit ungenerous here, but I’m not sure if this was intentional or not.</p><p>I’m firmly convinced that this is pure POC code, here is the snippet that finally did it for me, from the BumpCurrentEpoch() code. </p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Lets-start-with-the-cod_7CAD/image_14.png"><img width="618" height="203" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER-Lets-start-with-the-cod_7CAD/image_thumb_6.png" border="0"></a></p><p>Note that I don’t mind the Thread.Sleep() here, it make sense in context, but the Console.WriteLine cinched the deal for me. This is not something that you can take a use, but something to look at as you implement this properly.</p><p>I have to say that I find it very hard to go through the code, mostly because it’s in C# and there are certain expectations that I have from the code which are routinely violated. I think that I’m going to stop reading the C# codebase and switch over to the C++ implementation. I expect this to be more idiomatic, both because it is the second time this was written and because the people writing the code are clearly more using to writing in C++.</p>https://www.ayende.com/blog/184257-A/reviewing-faster-lets-start-with-managed-code?Key=478acffa-3cca-45da-af9d-e4c00a6d378dhttps://www.ayende.com/blog/184257-A/reviewing-faster-lets-start-with-managed-code?Key=478acffa-3cca-45da-af9d-e4c00a6d378dTue, 28 Aug 2018 09:00:00 GMTReviewing FASTER: Reading the paper<p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_D137/image_2.png"><img width="262" height="287" title="image" align="right" style="border: 0px currentcolor; border-image: none; float: right; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_D137/image_thumb.png" border="0"></a>The <a href="https://github.com/Microsoft/FASTER">FASTER</a> is a fast key-value store from Microsoft. It’s <a href="https://www.microsoft.com/en-us/research/uploads/prod/2018/03/faster-sigmod18.pdf">paper</a> was published a while ago and now that the source is available, I thought that I would take a peek. You might say that I have a professional interest in such a project <img class="wlEmoticon wlEmoticon-smile" style="" alt="Smile" src="https://ayende.com/blog/Images/Open-Live-Writer/Reviewing-FASTER_D137/wlEmoticon-smile_2.png">.</p><p>I decided to read the paper first before reading the code, because I want to figure out what <em>isn’t</em> in the paper. The implementation details beyond the abstract are what I find most interesting, to be honest. But if I’m going to read the paper, I’m also going to add some comments on it. Please note that the comments are probably not going to make much sense unless you read the paper.</p><p>The <em>Epoch Basic</em> section explains how they use a global epoch counter with a thread local value and a shared table that marks what epochs are visible to each thread. They use this to provide synchronization points, I assume (don’t know yet). This resonates very strongly with how LMDB’s global transaction table operates. </p><p>I like the concept of the drain list which is executed whenever an epoch become safe. I would assume that they use that to let the clients know that their operation was accepted and what was its state.</p><p>I wasn’t able to figure out what they use the <em>tag</em> for in the hash bucket entry. I think that the way it works is that you have K hash buckets and then use the first K bits to find the appropriate bucket, then scan for a match on the last 15 bits. I’m not really sure how that work with collisions, though. I assume that this will be clearer when I get to the code. I like the two phase scan approach to ensure that you have no conflicts when updating an entry. </p><p>The paper keeps repeating the speed numbers of 160 millions ops/sec and talking about 20 millions ops / sec as being “merely”. Just based on my reading so far, I don’t see how this can happen. What is interesting to me is what is the definition of ops. Is it something like incrementing the same (small) set of counters? If that is the case, than the perf numbers both make more sense and are of less interest. Typically when talking about ops / sec in such scenarios we talk about inserts / updates to individual documents / rows / objects. Again, I don’t know yet, but that is my thinking so far.</p><p>One thing that I find sad is that this isn’t a durable store. A failure in the middle of operations would cause some data to be completely lost. It seems like they have support for checkpoints, so you don’t lose everything. However, I’m not sure how often that happens and the benchmarks they are talking about were run without it. Interestingly enough, the benchmarks were run without garbage collection. I haven’t gotten to the discussion on that yet, so I’m not exactly what that means Another missing feature here is that there is no support for atomicity. You cannot ensure that two separate operations will run as a single atomic step. </p><p>The benchmark machine is 28 cores with 256GB RAM and 3.2 TB NVMe drive. This is a really nice machine, but from the get go I can tell that this is not going to be a fair comparison to most other storage engines. Faster is explicitly designed to work mostly in memory and with high degree of parallelism. This is great, but it gives us some important features (atomic batches and durability, also known as transactions). The data size they tested are:</p><ul><li>250 million records with 8 bytes keys & 8 bytes values – Just under <strong>4GB </strong>in total.</li><li>250 million records with 8 bytes keys & 100 bytes values – Just under <strong>32GB </strong>in total.</li></ul><p>I’m not sure why they think that this is going to provide larger than memory setup. In particular, they mention a memory budget of 2GB, but I assume that this is just internal configuration. There is also going to be quite a lot of memory cached in the OS’ page cache, for example, and I don’t see anything controlling that. Maybe I’ll when I’ll go through the code, though.</p><p>Okay, the garbage collection they refer to is related to how they compact the log. They use an interesting approach where they just discard it at a some point, and any pointer to the removed section is considered to be deleted automatically. That is likely to be very efficient, assuming that you don’t care about older data. </p><p>All in all, I feel that I have a vague understanding on how Faster works and a lot better grasp on what it does and how it is utilized. I’m looking forward to diving into the code.</p>https://www.ayende.com/blog/184225-A/reviewing-faster-reading-the-paper?Key=ced524c8-29bc-4a1f-ae86-a3c7facc325bhttps://www.ayende.com/blog/184225-A/reviewing-faster-reading-the-paper?Key=ced524c8-29bc-4a1f-ae86-a3c7facc325bMon, 27 Aug 2018 09:00:00 GMTReading the NSA’s codebase: LemonGraph review–Part VII–Summary<p>As the final post in this series, I decided to see how I can create a <em>complex</em> query. Given the NSA’s statement, I decided to see if I can use <a href="https://github.com/NationalSecurityAgency/lemongraph">LemonGraph</a> to find a dog of interest. In particular, given our graph, I wanted to find start with a particular dog and find another dog that likes this dog that also like a dog that dislike the original.</p><p>As a reminder, here is the graph:</p><p><img alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_8B1B/image_thumb_3.png"></p><p>And starting from Arava, I want to find a dog that likes Arava that also likes a dog that isn’t liked by Arava.</p><p>The LemonGraph query language isn’t very expressive, or at least I’m not familiar enough with it to make it work properly. I decided on the following query:</p><blockquote><p>n(type="dog", value="arava")-><br> @e(type="likes", value="yes")-><br>n(type="dog")-><br> @e(type="likes", value="yes")-><br>n(type="dog")-><br> @e(type="likes", value="no")-><br>@N(type="dog", value="arava")</p></blockquote><p>This is a bit of a brute force method to do this. It encodes the path directly. There are a few minor things that might not be obvious here. The @ prefix means don’t return this to the user and the N() indicates that we shouldn’t filter already seen values. I can certainly see how this can be useful for certain things <img class="wlEmoticon wlEmoticon-smile" style="" alt="Smile" src="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_CB05/wlEmoticon-smile_2.png">. I wonder if LemonGraph has a better way to express such a query.</p><p>This is the first time I actually reviewed this kind of codebase, where some things are done in C and some in Python. It was quite interesting to see the interaction between them. The codebase itself is really interesting, but I found it really hard to follow at times. The love affair with tuples and dynamic behavior made the code non trivial and likely is going to cause maintenance issues down the line. It is also quite obvious that this is intended for internal consumption, with very little time or effort spent on “productization”. By that I meant things like better query errors and more obvious thing to do. </p><p>It has an interesting approach to solving the problem of graph queries and I’ve learned quite a few things from it. </p><p><a name="comments"></a>https://www.ayende.com/blog/184102-C/reading-the-nsas-codebase-lemongraph-review-part-vii-summary?Key=807f8f07-e2b5-4f29-bad6-747ad30514behttps://www.ayende.com/blog/184102-C/reading-the-nsas-codebase-lemongraph-review-part-vii-summary?Key=807f8f07-e2b5-4f29-bad6-747ad30514beMon, 13 Aug 2018 09:00:00 GMTReading the NSA’s codebase: LemonGraph review–Part VI–Executing queries<p>After going over many layers inside <a href="https://github.com/NationalSecurityAgency/lemongraph">LemonGraph</a>, I think we are ready to get to the real deal. We know how LemonGraph starts to execute a query. It find the clause that is likely to have the least number of items and starts from there. These are the seeds of the query, but how is it going to actually process that?</p><p>Here is my graph:</p><p><img alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_8B1B/image_thumb_3.png"></p><p>And here is the query that I want to run on it: n()->e(type="likes", value="yes")->n()</p><p>LemonGraph detects that the cheapest source of seeds for this query is the edge and provides these to the <em>MatchCTX </em>class which does the actual processing. Let’s explore how it is working on a single seed. </p><p>The actual behavior starts from the <em>matches()</em> method, which starts here:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_A4B1/image_2.png"><img width="405" height="478" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_A4B1/image_thumb.png" border="0"></a></p><p>As usual for this codebase, the use of tuples for controlling behavior is prevalent and annoying. In this case, the <em>idx</em> argument controls the position of the target in the query, whatever this is the start, end or somewhere in the middle. The deltas define what direction the matches should go and the stops where you must stop searching, I think.</p><p>Now that we setup the ground rules, the execution is as follows:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_A4B1/image_4.png"><img width="1108" height="310" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_A4B1/image_thumb_1.png" border="0"></a></p><p>In this case, because the seed of our query is the edge (which is in the middle) it determines that <em>deltas </em>are (1, –1) and <em>stops </em>are (2, 0). We’ll also go to the first clause in the if statement. The <em>link</em> variable there controls whatever we should follow links going in or out. </p><p>There is a <em>lot</em> of logic actually packed into the <em>link</em> initialization. The <em>self.link </em>is an array that has ‘next’ and ‘prev’ as its values, so the idea is that it find what property to look at, then use the dictionary syntax to get the relevant property and decide what kind of direction it should go. </p><p>We’ll focus on the <em>_recurse()</em> method for now. In this case, we are being passed:</p><ul><li>idx = 1</li><li>delta = 1</li><li>stop = 2</li></ul><p>Here is the actual method:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_A4B1/image_6.png"><img width="714" height="376" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_A4B1/image_thumb_2.png" border="0"></a></p><p>As you can see, we first validate that the match is valid (this is where we mostly check other properties of the value that we are currently checking, filtering them in place). Usually the <em>do_seen </em>will be set to <em>True</em>, which will ensure that we only traverse each node once. The main iteration logic is done in combination with the <em>_next()</em> method, shown below:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_A4B1/image_8.png"><img width="798" height="284" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_A4B1/image_thumb_3.png" border="0"></a></p><p> The first line there shows usage of delta, which control in what direction to move. I’m not quite sure why the <em>filter</em> is set in the if statements, since it is also being set immediately after. This looks like redundant code that snuck in somehow. </p><p>The bulk of the work is shelled out to <em>iterlinks</em>, so let’s check what is going on there… I know that we are running on an edge here, so we need to look at the <em>Edge.iterlinks()</em> method:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_A4B1/image_10.png"><img width="903" height="407" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_A4B1/image_thumb_4.png" border="0"></a></p><p>There isn’t really much to tell here, it checks the filters and return the incoming or outgoing edges, nothing more. On the other hand, the <em>Node.iterlinks() </em>implementation is a bit simpler:<a href="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_A4B1/image_12.png"><img width="689" height="157" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_A4B1/image_thumb_5.png" border="0"></a></p><p>We’ll explore exactly how the edges are loaded for a node in a bit, however, I do want to note right now that the <em>_next() </em>method isn’t passing the types of the edge, even though it looks to me that it has this information. In that case, that would give a performance boost because it could filter a lot of edges in busy graphs.</p><p>Actually, I already looked at how iterators working in <a href="https://ayende.com/blog/184098-C/reading-the-nsas-codebase-lemongraph-review-part-iii-figuring-out-queries?key=3841c2073e084d37bf8d9df46662bb63">a previous post</a>, so I won’t go over all of that again. This is basically calling this method</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_A4B1/image_14.png"><img width="1246" height="348" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_A4B1/image_thumb_6.png" border="0"></a></p><p>The <em>graph_node_edges_in()</em> and <em>graph_node_edges_out()</em> methods are pretty simple, basically just scanning the <em>DB_TGTNODE_IDX</em> or <em>DB_SRCNODE_IDX </em>indexes, respectively. The <em>graph_node_edges()</em> however, is really strange. </p><p>Here it is:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_A4B1/image_16.png"><img width="924" height="206" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_A4B1/image_thumb_7.png" border="0"></a></p><p>It took me a few reads to figure out that this is probably a case of someone unpacking a statement for debugging but forgetting to remove the packed statement. The second return statement is ignored and likely stripped out as unreachable, but it <em>is </em>pretty confusing to read.</p><p>The <em>graph_iter_concat()</em> answers a question I had about how <em>graph_iter_t</em> is used, here is how it looks like:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_A4B1/image_18.png"><img width="669" height="454" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_A4B1/image_thumb_8.png" border="0"></a></p><p>This is using C’s support for passing an unknown number of parameters to a method. This basically builds a simple linked list, which also answers the usage of <em>_blarf() </em>function and its behavior a few posts ago.</p><p>So we are back were we started, we understand how the data flows into the Python code, now let’s go back and look at <em>_recurse()</em> and <em>_next()</em> calls.</p><p>Now I know a lot more about the behavior of the code, so this make sense. The <em>stop</em> argument to the <em>_recurse()</em> control the depth of the number of links that would be traversed in a particular query. Now that I know how this particular clause work, I understand how LemonGraph goes from the edge to the node in e()->n(), but I don’t know how it goes to back to find the full path. </p><p>The key for that are the calls to <em>push()</em> and <em>pop() </em>in the the <em>MatchCtx._recurse()</em> methods. These update the chain, like so:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_A4B1/image_20.png"><img width="924" height="129" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_A4B1/image_thumb_9.png" border="0"></a></p><p>In processing the query, we first append the edge to the chain:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_A4B1/image_22.png"><img width="499" height="50" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_A4B1/image_thumb_10.png" border="0"></a></p><p>The next step goes from the edge to it’s destination, giving us Oscar:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_A4B1/image_24.png"><img width="471" height="70" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_A4B1/image_thumb_11.png" border="0"></a></p><p>Remember that in the <em>matches()</em>, we got into this if clause:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_A4B1/image_26.png"><img width="1095" height="204" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_A4B1/image_thumb_12.png" border="0"></a></p><p>This is when we start scanning an edge, which first add things to the right, then it scan things to the <em>left </em>in the <em>_next()</em> method. Look at the <em>push() </em>method there. By the time we get to the <em>result()</em>, we have iterated both sides of the connection, giving us:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_A4B1/image_28.png"><img width="490" height="91" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_A4B1/image_thumb_13.png" border="0"></a></p><p>That is a very clever way of doing things. </p><p>Let’s try doing things a little bit differently. I’m going to check a slightly different query: n(type=”dog”, value=”arava”)->e(type="likes", value="yes")->n()</p><p>If I understand things correctly, this is going to select the node as the starting point, because that is a lot more efficient. That will allow me to figure out the other side of this operation. I just run this through the debugger, and this is indeed how it actually works. </p><p>The usage of <em>yield</em> to pass control midway is not trivial to follow, but it ends up being quite a nice implementation. This is enough for this post. In my next one, I want to explore a few of the possible operations that exposed by LemonGraph.</p>https://www.ayende.com/blog/184101-C/reading-the-nsas-codebase-lemongraph-review-part-vi-executing-queries?Key=fc7bc515-816f-4026-a643-4c2ecde64b56https://www.ayende.com/blog/184101-C/reading-the-nsas-codebase-lemongraph-review-part-vi-executing-queries?Key=fc7bc515-816f-4026-a643-4c2ecde64b56Fri, 10 Aug 2018 09:00:00 GMTReading the NSA’s codebase: LemonGraph review–Part V–Query parsing<p>I <a href="https://ayende.com/blog/184098-C/reading-the-nsas-codebase-lemongraph-review-part-iii-figuring-out-queries?key=3841c2073e084d37bf8d9df46662bb63">said before</a> that I don’t want to get into the details of how <a href="https://github.com/NationalSecurityAgency/lemongraph">LemonGraph</a> is dealing with parsing the queries. Unfortunately, I can’t avoid that. There seems to be a lot of logic, magic and mystery in the <em>MatchLGQL()</em> class, which is critical to understanding how queries work.</p><p>The problem is that either my Python-fu is lacking or it is just really hard to figure out a non trivial codebase behavior in a dynamic language like python. I find it hard to figure out what data is stored where and how it is manipulated. Therefor, I decided to break with my usual custom and actually run the code in the debugger to try to follow what is going on there. I tried to run this on WSL, but it crashed horribly, so I had to spin up a VM and setup PyCharm on it. First time that I’m actually using that and the experience is pretty nice so far. Being able to inspect things directly means that it is much easier to figure out the behavior of the code. </p><p>In order to explore how queries work in LemonGraph, I created the following graph, which represents the relationships between my dogs:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_8B1B/image_8.png"><img width="590" height="624" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_8B1B/image_thumb_3.png" border="0"></a></p><p>Here is how this looks like in code:</p><script src="https://gist.github.com/ayende/ea9c670550691decc79104a8cac59d95.js"></script><p>This tells us to find all the dogs that like each other. And it finds:</p><ul><li>Arava –> Oscar</li><li>Oscar –> Arava</li><li>Oscar –> Pheobe</li></ul><p>Now that we have a query that we can sink our teeth into, let’s figure out how this work, shall we? Inside the dreaded <em>MatchLGQL()</em> class, there are <em>all sorts</em> of regular expressions running on the parse this thing, but eventually we get to the partially processed parsed query:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_8B1B/image_10.png"><img width="939" height="90" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_8B1B/image_thumb_4.png" border="0"></a></p><p>This screen shot might explain why I wasn’t happy with the code structure for figuring out what is going on without debugging. The number of tuples here is quite amazing, and they are used everywhere. This make static analysis (as in, just reading the code) too hard for me. But with the debugger, that is much easier. If you are familiar with ASTs, this should be pretty easy to figure out.</p><p>Here is a piece of code that we already looked at (and criticized), this is in <em>munge_obj()</em> method, where it is deciding how to optimize the query:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_8B1B/image_14.png"><img width="647" height="355" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_8B1B/image_thumb_6.png" border="0"></a></p><p>This piece of code is critical for the performance of the system. And it is really hard to understand. Here is what is going on.</p><p>The <em>accel</em> array tell a later piece of code how to accelerate the query, using the type or type and value to start from a particular source. The <em>info</em> is used to carry state about particular clause in the query. Before this code run there is some code that builds the dictionary <em>d</em> which is used to figure out the filters on the particular clause. This is <em>fun</em>, because it is using missing a key lookup in the dictionary for control flow.</p><p>Let’s follow the logic?</p><ul><li>Line 2 - If the clause operates on a node, rank it as 6. If it is an edge, rank it as 7.</li><li>Line 6 – If the clause has a <em>type</em> specified, rank is as 4 if it is a node, 5 if it is an edge. Otherwise, abort the optimization.</li><ul><li>You might not see the “abort this optimization” in line 6, because it relies on the dictionary to throw if the key isn’t found. This is a common pattern in this code and something that I personally greatly dislike.</li></ul><li>Line 8 – it uses the <em>length </em>of the type as a metric for secondary ranking. I’m not quite sure why this is the case. I guess the code needed a tie breaker, but I can’t imagine why the length of a type would have any impact on performance.</li><ul><li>Unless, of course, the code assumes that shorter types are more common, and therefor will prefer to use the rare longer types?</li></ul><li>Line 10 – If there is a type <em>and</em> a value defined, that is even better. Note that again the is the ranking of node (2) and edge (3) which I find non obvious.</li></ul><p>Here are the results of the matches after they have been munged, I marked the ranking:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_8B1B/image_16.png"><img width="1372" height="94" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_8B1B/image_thumb_7.png" border="0"></a></p><p>Looking at this, this seems very strange, the <em>rank2</em> value is 1 in the second element, but I expected it to be the length of the string. As it turns out, this is not working directly on the string, it is working on the tuple of <em>possible values</em>, so the secondary ranking here is not based on the length of the type or the value but on the <em>number</em> of possible types and values that were specified for each clause.</p><p>The code judges that the best place to start this query is with the second entry, since it is the most specific option. This in turn takes us the the <em>seeds()</em> method that <a href="https://ayende.com/blog/184098-C/reading-the-nsas-codebase-lemongraph-review-part-iii-figuring-out-queries?key=3841c2073e084d37bf8d9df46662bb63">we previously covered</a>. In this case, the code is going to hit this branch:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_8B1B/image_18.png"><img width="679" height="154" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_8B1B/image_thumb_8.png" border="0"></a></p><p>This means that it is going to be iterating over all the edges of a particular type and filtering them in Python code. This is strange, because the on disk indexes actually support doing a direct query on the (type, value) directly and would probably be <em>much</em> cheaper in the case you have many values for a particular type of an edge. </p><p>In fact, just that is implemented for querying nodes by (type, value):</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_8B1B/image_20.png"><img width="837" height="130" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_8B1B/image_thumb_9.png" border="0"></a></p><p>I’m guessing that they are either don’t have a lot of queries on (type, value) on edges or not a lot of different values for edge types that they can optimize in this manner.</p><p>That is enough for now, I have a pretty good grasp of how queries are parsed and how they fetch data from the underlying storage. The next post will talk about how LemonGraph takes the seeds of the query and execute the actual graph operations on them. The code that does this is <em>tight </em>and will require a full post to explore properly.</p>https://www.ayende.com/blog/184100-C/reading-the-nsas-codebase-lemongraph-review-part-v-query-parsing?Key=54a3d1e0-b3a1-4313-90a2-a8b973e6e335https://www.ayende.com/blog/184100-C/reading-the-nsas-codebase-lemongraph-review-part-v-query-parsing?Key=54a3d1e0-b3a1-4313-90a2-a8b973e6e335Thu, 09 Aug 2018 09:00:00 GMTReading the NSA’s codebase: LemonGraph review–Part IV–Compressed, sortable integers<p>Before going over the actual query implementation, I wanted to talk about something that I just realized. I <a href="https://ayende.com/blog/184066-C/reading-the-nsas-codebase-lemongraph-review-part-i?key=d248365d55574326bec98ba8d9e6699c">said previously</a> that I don’t understand why <a href="https://github.com/NationalSecurityAgency/lemongraph">LemonGraph</a> is using its integer encoding method, because it is less efficient than using variant sized integer. What I didn’t take into account is that the method LemonGraph is using gives short, but sortable, integers.</p><p>Here is the encoding method:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_9B6/image_2.png"><img width="1099" height="211" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_9B6/image_thumb.png" border="0"></a></p><p>Now, let’s see what kind of binary output this will generate when given a few numbers:</p><p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_9B6/image_4.png"><img width="492" height="147" title="image" style="margin: 0px; border: 0px currentcolor; border-image: none; display: inline; background-image: none;" alt="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_9B6/image_thumb_1.png" border="0"></a></p><p>The key here is that the number of bytes is stored as the first item. This means that when we compare two numbers using <em>memcmp</em>(), the number with more bytes is considered greater. This is indeed the case, because if you need more bytes to store a number, it is obviously larger.</p><p>What about two numbers that have the same number of bytes? This is handled by simple binary comparison of the values. If they are the same size, the fact that the <em>encode()</em> output them in big endian format means that we can compare them using <em>memcmp()</em> and be done with it.</p><p>This is a really nice way to both keep the values sorted and to avoid storing the full 8 bytes for numbers that are usually much smaller.</p>https://www.ayende.com/blog/184099-C/reading-the-nsas-codebase-lemongraph-review-part-iv-compressed-sortable-integers?Key=75aba3eb-e62c-4bd1-8603-13bc434eefb8https://www.ayende.com/blog/184099-C/reading-the-nsas-codebase-lemongraph-review-part-iv-compressed-sortable-integers?Key=75aba3eb-e62c-4bd1-8603-13bc434eefb8Wed, 08 Aug 2018 09:00:00 GMTReading the NSA’s codebase: LemonGraph review–Part III - Figuring out queries<p>After <a href="https://ayende.com/blog/184097-C/reading-the-nsas-codebase-lemongraph-review-part-ii?key=a59721f61eb1452ba72c97627cbea623">figuring out</a> how <a href="https://ayende.com/blog/184066-C/reading-the-nsas-codebase-lemongraph-review-part-i?key=d248365d55574326bec98ba8d9e6699c">LemonGraph</a> is storing data on disk, my next task is to figure out how queries are handled. Here are some queries:</p>
<p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_14AE1/image_2.png"><img style="margin: 0px; border: 0px currentcolor; display: inline; background-image: none;" title="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_14AE1/image_thumb.png" alt="image" width="480" height="290" border="0" /></a></p>
<p>A query starts in the Python’s Query class, where it is parsed by the <a href="https://github.com/NationalSecurityAgency/lemongraph/blob/master/LemonGraph/MatchLGQL.py">MatchLGQL</a> class. I scanned this code briefly, but this is basically doing query parsing into the in memory representation. This is typically ugly piece of code, and that holds true for this code as well. Python’s dynamic nature also means that there isn’t an easy to follow set of AST classes. I’ll skip the details of query parsing and just see how it is actually handling the queries, then.</p>
<p>I started to look into the query execution and got lost in details that I didn’t understand. In particular, there is a very strong tie between the query parsing and the execution. More so than I expected. What brought this home was this piece of code, which is used to rank the most efficient manner in which you should start executing the query.</p>
<p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_14AE1/image_4.png"><img style="margin: 0px; border: 0px currentcolor; display: inline; background-image: none;" title="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_14AE1/image_thumb_1.png" alt="image" width="907" height="603" border="0" /></a></p>
<p>At this point, I think that the idea here is that when you start running a query, you want to start from the smallest set of seed nodes. the ranking here seems to be a nice way to go about doing just that, but I’m not really sure yet how this is applied.</p>
<p>This is later used to figure out what the best starting location is in the <em>Query.finalize() </em>method.</p>
<p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_14AE1/image_6.png"><img style="margin: 0px; border: 0px currentcolor; display: inline; background-image: none;" title="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_14AE1/image_thumb_2.png" alt="image" width="836" height="150" border="0" /></a></p>
<p>This all come together for me inside the <em>_adhoc()</em> method on Query, where the query is actually being run:</p>
<p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_14AE1/image_8.png"><img style="margin: 0px; border: 0px currentcolor; display: inline; background-image: none;" title="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_14AE1/image_thumb_3.png" alt="image" width="738" height="173" border="0" /></a></p>
<p>The <em>self.compiled</em> is the set of already parsed matches. On each of them, we create a context (which will track already visited nodes / edges) and start by finding the seeds on the query. Seeds are handled using… an interesting approach:</p>
<p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_14AE1/image_10.png"><img style="margin: 0px; border: 0px currentcolor; display: inline; background-image: none;" title="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_14AE1/image_thumb_4.png" alt="image" width="843" height="412" border="0" /></a></p>
<p>It took me a few reads to get what this code is trying to do and I still thing that this is an obnoxious way to write things. This basically does the other side of the ranking. It is using the ranking to decide which method to call. I think that an enum would be about three time more readable. Especially since a bit lower you have:</p>
<p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_14AE1/image_12.png"><img style="margin: 0px; border: 0px currentcolor; display: inline; background-image: none;" title="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_14AE1/image_thumb_5.png" alt="image" width="711" height="176" border="0" /></a></p>
<p>I have to say that the modulus usage is the sign of true genius. Or madness. I’m not sure which, because I can’t figure out what the history of this code is. This was the same way from the initial commit, but I think that this code has history from before the public release. And it might have a reason for this kind of code. But I don’t like it and I think it would have been<em> much</em> easier to read if it wasn’t using magic numbers all over the place.</p>
<p>At any rate, let’s assume that we have the simplest query, for all the nodes. This would send us to <em>txn.nodes()</em> method. This would be rank 6, by the way. Here is how this looks like:</p>
<p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_14AE1/image_14.png"><img style="margin: 0px; border: 0px currentcolor; display: inline; background-image: none;" title="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_14AE1/image_thumb_6.png" alt="image" width="922" height="387" border="0" /></a></p>
<p>As you can see, we have two modes here. If the rank was 6, we aren’t sent a type. But if the rank was 4, we are getting a type. I’m going to start from the search for types, which seems more interesting.</p>
<p>Here is where we end up in the C code:</p>
<p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_14AE1/image_18.png"><img style="border: 0px currentcolor; display: inline; background-image: none;" title="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_14AE1/image_thumb_8.png" alt="image" width="1344" height="464" border="0" /></a></p>
<p>Yes, the <em>graph_nodes_type()</em> method is calling <em>_graph_nodes_edges_type()</em> method. That was confusing as well to me. The key here is the DB_NODE_IDX index there, which tell it to use a different tree for the lookups.</p>
<p>The <em>graph_string_lookup()</em> is something that we already run into, this is the <em>__blob_resolve()</em> method, which is searching for the string id for the given type. The code starts to get interesting when we see the <em>graph_iter_new() </em>call:</p>
<p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_14AE1/image_20.png"><img style="margin: 0px; border: 0px currentcolor; display: inline; background-image: none;" title="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_14AE1/image_thumb_9.png" alt="image" width="1081" height="408" border="0" /></a></p>
<p>So we specify an iterator on the given prefix. From the previous post, you might recall that DB_NODE_IDX is specific as (type, val –> id). So this does a search on the first item that matches the prefix. The <em>_cleanse_beforeID()</em> method will ensure that the <em>beforeID </em>is only valid if it represent a value that is between 1 and the max log id that was generated on the graph.</p>
<p>The iterator we got from the <em>nodes()</em> method just implement’s Python iteration interface, starting from the <em>graph_iter_new()</em> item, then calling <em>graph_iter_next()</em> until the end. This is implemented like so:</p>
<p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_14AE1/image_24.png"><img style="margin: 0px; border: 0px currentcolor; display: inline; background-image: none;" title="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_14AE1/image_thumb_11.png" alt="image" width="819" height="295" border="0" /></a></p>
<p>Here we see for the first time the actual usage of <em>beforeID</em>. I’m not sure what this does yet, so we’ll skip this for now and look at the <em>_iter_idx_nextID()</em> method and come back to it later. This method is quite interesting. We have the head of the iterator, which is set to true in the iterator init. I’m not sure what this means yet. What seems to be interesting is that <em>_blarf()</em> method, which I understand to be a cry of frustration (I had to look it up).</p>
<p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_14AE1/image_26.png"><img style="margin: 0px; border: 0px currentcolor; display: inline; background-image: none;" title="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_14AE1/image_thumb_12.png" alt="image" width="725" height="647" border="0" /></a></p>
<p>I’m not sure what the <em>next</em> pointer is doing there at all. We’ll look at that after I’ll inspect (with some measure of dread) the <em>_blarf()</em> function.</p>
<p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_14AE1/image_28.png"><img style="margin: 0px; border: 0px currentcolor; display: inline; background-image: none;" title="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_14AE1/image_thumb_13.png" alt="image" width="996" height="294" border="0" /></a></p>
<p>To start with, I love the use of <em>goto</em> instead of <em>return</em> statements. I understand that this may be a coding convention here and this is used to clearly mark when resources are supposed to be disposed, but still…</p>
<p>The <em>iter_next_key()</em> ends up moving the cursor (and validating that the prefix is still valid). The <em>_parse_idx_logID()</em> call is here:</p>
<p><a href="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_14AE1/image_30.png"><img style="margin: 0px; border: 0px currentcolor; display: inline; background-image: none;" title="image" src="https://ayende.com/blog/Images/Open-Live-Writer/Reading-the-NSAs-codebase-LemonGraph-rev_14AE1/image_thumb_14.png" alt="image" width="756" height="286" border="0" /></a></p>
<p>And this is just a fancy way of saying, gimme the last value in this buffer.</p>
<p>To understand what is going on, let’s go back a bit and understand what is going on here. We are actually scanning the index <em>DB_NODE_IDX</em>. And that index has the value of (type, val, id). Since the iteration is done in sorted order, this means that you can iterate over all the matches that match the type you want. The use of <em>beforeID</em> for filtering here, however, is interesting. I wonder how common is the use of historical queries like this are. Because if you need to skip over a lot of items, this will result in an O(N) operation while you are scanning and discarding a lot of data.</p>
<p>Anyway, there doesn’t seem to be any use of the <em>graph_iter_t.next</em> in this code path, so I’ll leave it for now. The iteration over all the nodes is done with the exact same method, but without specifying any prefix, which means that it matches everything.</p>
<p>I have to admit that at this point, I don’t really get how it process the more complex queries. I had to give up the “let’s not run the code” and try a few things out. Surprisingly, on WSL, the code just cause segmentation fault. I tracked it down to something in cffi, but I didn’t care that much. I created a simple ubuntu machine and played with it for a bit. So far, we just looked at how we get the first seeds of the query. A lot of the smarts seems to be hidden in the <em>MatchCTX</em> class. In particular, the <em>matches()</em> method seems to be doing a lot of magic.</p>
<p>I’m going to concentrate on that in my next post.</p>https://www.ayende.com/blog/184098-C/reading-the-nsas-codebase-lemongraph-review-part-iii-figuring-out-queries?Key=3841c207-3e08-4d37-bf8d-9df46662bb63https://www.ayende.com/blog/184098-C/reading-the-nsas-codebase-lemongraph-review-part-iii-figuring-out-queries?Key=3841c207-3e08-4d37-bf8d-9df46662bb63Tue, 07 Aug 2018 09:00:00 GMT