I wanted to test low-level file-system behavior in preparation for a new feature for RavenDB. Specifically, I wanted to look into hole punching - where you can give low-level instructions to the file system to indicate that you’re giving up disk space, but without actually reducing the size of the file.
This can be very helpful in space management. If I have a section in the file that is full of zeroes, I can just tell the file system that, and it can skip storing that range of zeros on the disk entirely. This is an advanced feature for file systems. I haven't actually used that in the past, so I needed to gain some expertise with it.
The code for Windows is here if you want to see it. I tested the feature on both Windows & Linux, and it worked. I could see that while the file size was 128MB, I was able to give back 16MB to the operating system without any issues. I turned the code above into a test and called it a day.
And then the CI build broke. But that wasn’t possible since I tested that. And there had been CI runs that did work on Linux. So I did the obvious thing and started running the code above in a loop.
I found something really annoying. This code worked, sometimes. And sometimes it just didn’t.
In order to get the size, I need to run this code:
I’m used to weirdness from file systems at this point, but this is really simple. All the data is 4KB aligned (in fact, all the data is 16MB aligned). There shouldn’t be any weirdness here.
As you can see, I’m already working at the level of Linux syscalls, but I used strace to check if there is something funky going on. Nope, there was a 1:1 mapping between the code and the actual system calls issued.
That means that I have to debug deeper if I want to understand what is going on. This involves debugging the Linux Kernel, which is a Big Task. Take a look at the code in the relevant link. I’m fairly certain that the issue is in those lines. The problem is that this cannot be, since both offset & length are aligned to 4KB.
I got out my crystal ball and thinking hat and meditated on this. If you’ll note, the difference between the expected and actual values is exactly 4KB. It almost looks like the file itself is not aligned on a 4KB boundary, but the holes must be.
Given that I just want to release this space to the operating system and 4KB is really small, I can adjust that as a fudge factor for the test. I would love to understand exactly what is going on, but so far the “file itself is not 4KB aligned, but holes are” is a good working hypothesis (even though my gut tells me it might be wrong).
If you know the actual reason for this, I would love to hear it.
And don't get me started on what happened with sparse files in macOS. There, the OS will randomly decide to mark some parts of your file as holes, making any deterministic testing really hard.
RavenDB has a hidden feature, enabled by default and not something that you usually need to be aware of. It has built-in support for caching. Consider the following code:
async Task<Dictionary<string,int>>HowMuchWorkToDo(string userId){
using var session = _documentStore.OpenAsyncSession();var results = await session.Query<Item>().GroupBy(x =>new{ x.Status, x.AssignedTo }).Where(g => g.Key.AssignedTo == userId && g.Key.Status !="Closed").Select(g =>new{
Status = g.Key.Status,
Count = g.Count()}).ToListAsync();return results.ToDictionary(x => x.Status, x => x.Count);}
What happens if I call it twice with the same user? The first time, RavenDB will send the query to the server, where it will be evaluated and executed. The server will also send an ETag header with the response. The client will remember the response and its ETag in its own memory.
The next time this is called on the same user, the client will again send a request to the server. This time, however, it will also inform the server that it has a previous response to this query, with the specified ETag. The server, when realizing the client has a cached response, will do a (very cheap) check to see if the cached response matches the current state of the server. If so, it can inform the client (using 304 Not Modified) that it can use its cache.
In this way, we benefit twice:
First, on the server side, we avoid the need to compute the actual query.
Second, on the network side, we aren’t sending a full response back, just a very small notification to use the cached version.
You’ll note, however, that there is still an issue. We have to go to the server to check. That means that we still pay the network costs. So far, this feature is completely transparent to the user. It works behind the scenes to optimize server query costs and network bandwidth costs.
The next stage is to involve the user. Enter the AggressiveCache() feature (see the full documentation here), which allows the user to specify an additional aspect. Now, when the client has the value in the cache, it will skip going to the server entirely and serve the request directly from the cache.
What about cache invalidation? Instead of having the client check on each request if things have changed, we invert the process. The client asks the server to notify it when things change, and until it gets notice from the server, it can serve responses completely from the local cache.
I really love this feature, that was the Good part, now let’s talk about the other pieces:
There are only two hard things in Computer Science: cache invalidation and naming things.
-- Phil Karlton
The bad part of caching is that this introduces more complexity to the system. Consider a system with two clients that are using the same database. An update from one of them may show up at different times in each. Cache invalidation will not happen instantly, and it is possible to get into situations where the server fails to notify the client about the update, meaning that we didn’t clear the cache.
We have a good set of solutions around all of those, I think. But it is important to understand that the problem space itself is a problem.
In particular, let’s talk about dealing with the following query:
var emps = session.Query<Employee>().Include(x => x.Department).Where(x => x.Location.City =="London").ToListAsync();
When an employee is changed on the server, it will send a notice to the client, which can evict the item from the cache, right? But what about when a department is changed?
For that matter, what happens if a new employee is added to London? How do we detect that we need to refresh this query?
There are solutions to those problems, but they are super complicated and have various failure modes that often require more computing power than actually running the query. For that reason, RavenDB uses a much simpler model. If the server notifies us about any change, we’ll mark the entire cache as suspect.
The next request will have to go to the server (again with an ETag, etc) to verify that the response hasn’t changed. Note that if the specific query results haven’t changed, we’ll get OK (304 Not Modified) from the server, and the client will use the cached response.
Conservatively aggressive approach
In other words, even when using aggressive caching, RavenDB still has to go to the server sometimes. What is the impact of this approach when you have a system under load?
We’ll still use aggressive caching, but you’ll see brief periods where we aren’t checking with the server (usually be able to cache for about a second or so), followed by queries to the server to check for any changes.
In most cases, this is what you want. We still benefit from the cache while reducing the number of remote calls by about 50%, and we don’t have to worry about missing updates. The downside is that, as application developers, we know that this particular document and query are independent, so we want to cache them until we get notice about that particular document being changed.
The default aggressive caching in RavenDB will not be of major help here, I’m afraid. But there are a few things you can do.
You can use Aggressive Caching in the NoTracking mode. In that mode, the client will not ask the server for notifications on changes, and will cache the responses in memory until they expire (clock expiration or size expiration only).
Another option is to take this feature higher than RavenDB directly, but still use its capabilities. Since we have a scenario where we know that we want to cache a specific set of documents and refresh the cache only when those documents are updated, let’s write it.
There are a few things to note about this code. We are holding live instances, so we ensure that the values we keep are immutable records. Otherwise, we may hand the same instance to two threads which can be… fun.
Note that document IDs in RavenDB are case insensitive, so we pass the right string comparer.
Finally, the magic happens in the constructor. We register for two important events. Whenever the connection status of the Changes() connection is modified, we clear the cache. This handles any lost updates scenarios that occurred while we were disconnected.
In practice, the subscription to events on that particular collection is where we ensure that after the server notification, we can evict the document from the cache so that the next request will load a fresh version.
Caching + Distributed Systems = 🤯🤯🤯
I’m afraid this isn’t an easy topic once you dive into the specifics and constraints we operate under. As I mentioned, I would love your feedback on the background cache refresh feature, or maybe you have better insight into other ways to address the topic.
I got into an interesting discussion on LinkedIn about my previous post, talking about Code Rot. I was asked about Legacy Code defined as code without tests and how I reconcile code rot with having tests.
I started to reply there, but it really got out of hand and became its own post.
I read Working Effectively with Legacy Code for the first time in 2005 or thereabout, I think. It left a massive impression on me and on the industry at large. The book is one of the reasons I started rigorously writing tests for my code, it got me interested in mocking and eventually led me to writing Rhino Mocks.
It is ironic that the point of this post is that I disagree with this statement by Michael because of Rhino Mocks. Let’s start with numbers, last commit to the Rhino Mocks repository was about a decade ago. It has just under 1,000 tests and code coverage that ranges between 95% - 100%.
I can modify this codebase with confidence, knowing that I will not break stuff unintentionally. The design of the code is very explicitly meant to aid in testing and the entire project was developed with a Test First mindset.
I haven’t touched the codebase in a decade (and it has been close to 15 years since I really delved into it). The code itself was written in .NET 1.1 around the 2006 timeframe. It literally predates generics in .NET.
It compiles and runs all tests when I try to run it, which is great. But it is still very much a legacy codebase.
It is a legacy codebase because changing this code is a big undertaking. This code will not run on modern systems. We need to address issues related to dynamic code generation between .NET Framework and .NET.
That in turn requires a high level of expertise and knowledge. I’m fairly certain that given enough time and effort, it is possible to do so. The problem is that this will now require me to reconstitute my understanding of the code.
The tests are going to be invaluable for actually making those changes, but the core issue is that a lot of knowledge has been lost. It will be a Project just to get it back to a normative state.
This scenario is pretty interesting because I am actually looking back at my own project. Thinking about having to do the same to a similar project from someone else’s code is an even bigger challenge.
Legacy code, in this context, means that there is a huge amount of effort required to start moving the project along. Note that if we had kept the knowledge and information within the same codebase, the same process would be far cheaper and easier.
Legacy code isn’t about the state of the codebase in my eyes, it is about the state of the team maintaining it. The team, their knowledge, and expertise, are far more important than the code itself.
An orphaned codebase, one that has no one to take care of, is a legacy project even if it has tests. Conversely, a project with no tests but with an actively knowledgeable team operating on it is not.
Note that I absolutely agree that tests are crucial regardless. The distinction that I make between legacy projects and non-legacy projects is whether we can deliver a change to the system.
Reminder: A codebase that isn’t being actively maintained and has no tests is the worst thing of all. If you are in that situation, go read Working Effectively with Legacy Code, it will be a lifesaver.
I need a feature with an ideal cost of X (time, materials, effort, cost, etc). A project with no tests but people familiar with it will be able to deliver it at a cost of 2-3X. A legacy project will need 10X or more. The second feature may still require 2X from the maintained project, but only 5X from the legacy system. However, that initial cost to get things started is the killer.
In other words, what matters here is the inertia, the ability to actually deliver updates to the system.
A customer called us about some pretty weird-looking numbers in their system:
You’ll note that the total number of entries in the index across all the nodes does not match. Notice that node C has 1 less entry than the rest of the system.
At the same time, all the indicators are green. As far as the administrator can tell, there is no issue, except for the number discrepancy. Why is it behaving in this manner?
Well, let’s zoom out a bit. What are we actually looking at here? We are looking at the state of a particular index in a single database within a cluster of machines. When examining the index, there is no apparent problem. Indexing is running properly, after all.
The actual problem was a replication issue, which prevented replication from proceeding to the third node. When looking at the index status, you can only see that the entry count is different.
When we zoom out and look at the state of the cluster, we can see this:
There are a few things that I want to point out in this scenario. The problem here is a pretty nasty one. All nodes are alive and well, they are communicating with each other, and any simple health check you run will give good results.
However, there is a problem that prevents replication from properly flowing to node C. The actual details aren’t relevant (a bug that we fixed, to tell the complete story). The most important aspect is how RavenDB behaves in such a scenario.
The cluster detected this as a problem, marked the node as problematic, and raised the appropriate alerts. As a result of this, clients would automatically be turned away from node C and use only the healthy nodes.
From the customer’s perspective, the issue was never user-visible since the cluster isolated the problematic node. I had a hand in the design of this, and I wrote some of the relevant code. And I’m still looking at these screenshots with a big sense of accomplishment.
This stuff isn’t easy or simple. But to an outside observer, the problem started from: why am I looking at funny numbers in the index state in the admin panel? And not at: why am I serving the wrong data to my users.
The design of RavenDB is inherently paranoid. We go to a lot of trouble to ensure that even if you run into problems, even if you encounter outright bugs (as in this case), the system as a whole would know how to deal with them and either recover or work around the issue.
As you can see, live in production, it actually works and does the Right Thing for you. Thus, I can end this post by saying that this behavior makes me truly happy.
I was talking to a colleague about a particular problem we are trying to solve. He suggested that we solve the problem using a particular data structure from a recently published paper. As we were talking, he explained how this data structure works and how that should handle our problem.
The solution was complex and it took me a while to understand what it was trying to achieve and how it would fit our scenario. And then something clicked in my head and I said something like:
Oh, that is just epoch-based, copy-on-write B+Tree with single-producer/ concurrent-readers?
If this sounds like nonsense to you, it is fine. Those are very specific terms that we are using here. The point of such a discussion is that this sort of jargon serves a very important purpose. It allows us to talk with clarity and intent about fairly complex topics, knowing that both sides have the same understanding of what we are actually talking about.
The idea is that we can elevate the conversation and focus on the differences between what the jargon specifies and the topic at hand. This is abstraction at the logic level, where we can basically zoom out a lot of details and still keep high intent accuracy.
Being able to discuss something at this level is hugely important because we can convey complex ideas easily. Once I managed to put what he was suggesting in context that I could understand, we were able to discuss the pros and cons of this data structure for the scenario.
I do appreciate that the conversation basically stopped making sense to anyone who isn’t already well-versed in the topic as soon as we were able to (from my perspective) clearly and effectively communicate.
“When I use a word,’ Humpty Dumpty said in rather a scornful tone, ‘it means just what I choose it to mean — neither more nor less.”
Clarity of communication is a really important aspect of software engineering. Being able to explain, hopefully in a coherent fashion, why the software is built the way it is and why the code is structured just so can be really complex. Leaning on existing knowledge and understanding can make that a lot simpler.
There is also another aspect. When using jargon like that, it is clear when you don’t know something. You can go and research it. The mere fact that you can’t understand the text tells you both that you are missing information and where you can find it.
For software, you need to consider two scenarios. Writing code today and explaining how it works to your colleagues, and looking at code that you wrote ten years ago and trying to figure out what was going on there.
In both cases, I think that this sort of approach is a really useful way to convey information.
I usually talk about the things that I do that were successful. Today I want to discuss something that I tried but failed at. Documenting failed approaches is just as important, though less enjoyable, as documenting what we excel at.
In order to explain the failure, I need to get a bit deeper into how computers handle memory. There is physical memory, the RAM sticks that you have in your machine, and then there is how the OS and CPU present that memory to your code. Usually, the abstraction is quite seamless, and we don’t need to pay attention to it.
Occasionally, we can take advantage of this model. Consider the following memory setup, showing a single physical memory page that was mapped in two different locations:
In this case, it means that you can do things like this:
In other words, because the two virtual pages point to the same physical page in memory, we can modify memory in one location and see the changes in another. This isn’t spooky action at a distance, it is simply the fact that the memory addresses we use are virtual and they point to the same place.
Note that in the image above, I modified the data using the pointer to Page 1 and then read it from Page 2. The Memory Management Unit (MMU) in the CPU can do a bunch of really interesting things because of this. You’ll note that each virtual page is annotated with an access permission.
In this case, the second page is marked as Copy on Write. That means that when we read from this page, the MMU will happily read the data from the physical page it is pointed to. But when we write, the situation is different.
The MMU will raise an exception to the operating system, telling it that a write was attempted on this page, which is forbidden. At this point, the OS will allocate a new physical page, copy the data to it, and then update the virtual address to point to the new page. Here is what this looks like:
Now we have two distinct mappings. A write to either one of them will not be reflected on the other. Here is what this looks like in code:
*page1 ='1'; // now
printf("Page1: %c, Page2: %c\n", *page1, *page2);
// output: Page1: 1, Page2: 1
*page2 ='2'; // force the copy on write to occur
printf("Page1: %c, Page2: %c\n", *page1, *page2);
// output: Page1: 1, Page2: 2
As long as the modifications happened through the first page address (the orange one in the image), there was no issue and any change would be reflected in both pages. When we make a modification to the second page (the green one in the image), the OS will create a new physical page and effectively split them forever.
Changes made to either page will only be reflected in that page, not both, since they aren’t sharing the same page.
Note that this behavior applies at a page boundary. What happens if I have a buffer, 1GB in size, and I use this technique on it? Let’s assume that we have a buffer that is 1GB in size and I created a copy-on-write mapping on top of it.
The amount of physical memory that I would consume is still just 1GB.
In fact, I would effectively memcpy()very quickly, since I’m not actually copying anything. And for all intents and purposes, it works. I can change the data through the second buffer, and it would not show up in the first buffer. Of particular note is that when I modify the data on the second buffer, only a single page is changed. Here is what this looks like:
So instead of having to copy 1GB all at once, we map the buffer again as copy on write, and we can get a new page whenever we actually modify our “copy” of the data.
So far, this is great, and it is heavily used for many optimizations. It is also something that I want to use to implement cheap snapshots of a potentially large data structure. The idea that I have is that I can use this technique to implement it.
Here is the kind of code that I want to write:
var list =newCopyOnWriteList();list.Put(1);list.Put(2);var snapshot1 =list.CreateSnapshot();list.Put(3)var snapshot2 =list.CreateSnapshot();list.Put(4);
And the idea is that I’ll have (at the same time) the following:
list
snapshot1
snapshot2
1,2,3,4
1,2
1,2,3
I want to have effectively unlimited snapshots, and the map may contain a large amount of data. In graphical form, you can see it here:
We started with Page 1, created a Copy of Write for Page 2, modified Page 2 (breaking the Copy on Write), and then attempted to create a Copy on Write for Page 2. That turns out to be a problem.
Let’s see the code that we need in order to create a copy using copy-on-write mapping on Linux:
Take a look at the API we have for creating a copy-on-write:
MapViewOfFile(hMapFile, FILE_MAP_COPY, 0, 0, 4096); // windows
mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE, shm_fd, 0); // linux
A key aspect of the API is that we need to provide a source for the Copy-on-Write operation. That means that we can only create a Copy-on-Write from a single source. We cannot perform a Copy-on-Write on top of a page that was marked as copy-on-write. This is because we cannot refer to it. Basically, I don’t have a source that I can use for this sort of mapping.
I tried being clever and wrote the following code on Linux:
On Linux, you can use the special file /proc/self/mem to refer to your memory using file I/O. That means that I can get a file descriptor for my own memory, which provides a source for my copy-on-write operation.
I was really excited when I realized that this was a possibility. I spent a lot of time trying to figure out how I could do the same on Windows. However, when I actually ran the code on Linux, I realized that this doesn’t work.
The mmap() call will return ENODEV when I try that. It looks like this isn’t a supported action.
Linux has another call that looks almost right, which is mremap(), but that either zeros out or sets up a userfaulfdhandler for the region. So it can’t serve my needs.
This is quite annoying since we are almost there. All the relevant pieces are available, if we had a way to tell the kernel to create the mapping, everything else should just work from there.
Anyway, this is my tale of woe, trying (and failing) to create a snapshot-based system using the Memory Manager Unit. Hopefully, you’ll either learn something from my failure or let me know that there is a way to do this…
Reading code is a Skill (with a capital letter, yes) that is really important for developers. You cannot be a good developer without it.
Today I want to talk about one aspect of this. The ability to go into an unfamiliar codebase and extract one piece of information out. The idea is that we don’t need to understand the entire system, grok the architecture, etc. I want to understand one thing about it and get away as soon as I can.
For example, you know that project Xyz is doing some operation, and you want to figure out how this is done. So you need to look at the code and figure that out, then you can go your merry way.
LMDB is an embedded database engine (similar to Voron, and in fact, Voron is based on some ideas from LMDB) written in C. If you are interested in it, I wrote 11 posts going through every line of code in the project.
So I’m familiar with the project, but the last time I read the code was over a decade ago. From what I recall, the code is dense. There are about 11.5K lines of code in a single file, implementing the entire thing.
The first thing to do is find the relevant section in the code. I started by searching for the WriteFile() function, the Win32 API to write. The first occurrence of a call to this method is in the mdb_page_flush function.
I look at this code, and… there isn’t really anything there. It is fairly obvious and straightforward code (to be clear, that is a compliment). I was expecting to see a trick there. I couldn’t find it.
That meant either the code had a gaping hole and potential data corruption (highly unlikely) or I was missing something. That led me to a long trip of trying to distinguish between documented guarantees and actual behavior.
The documentation for MapViewOfFile is pretty clear:
A mapped view of a file is not guaranteed to be coherent with a file that is being accessed by the ReadFile or WriteFile function.
I have my own run-ins with this behavior, which was super confusing. This means that I had experimental evidence to say that this is broken. But it didn’t make sense, there was no code in LMDB to handle it, and this is pretty easy to trigger.
It turns out that while the documentation is pretty broad about not guaranteeing the behavior, the actual issue only occurs if you are working with remote files or using unbuffered I/O.
If you are working with local files and buffered I/O (which is 99.99% of the cases), then you can rely on this behavior. I found some vaguereferences to this, but that wasn’t enough. There is this post that is really interesting, though.
I pinged Howard Chu, the author of LMDB, for clarification, and he was quick enough to assure me that yes, my understanding was (now) correct. On Windows, you can mix memory map operations with file I/O and get the right results.
The documentation appears to be a holdover from Windows 9x, with the NT line always being able to ensure coherency for local files. This is a guess about the history of documentation, to be honest. Not something that I can verify.
I had the wrong information in my head for over a decade. I did not expect this result when I started this post, I was sure I would be discussing navigating complex codebases. I’m going to stand in the corner and feel upset about this for a while now.
Today I got in my car to drive to work and realized that Waze suggested “Work” as the primary destination to select. I had noticed that before, and it is a really nice feature. Today, I got to thinking about how I would implement something like that.
That was a nice drive since I kept thinking about algorithms and data flow. When I got to the office, I decided to write about how we can implement something like that. Based on historical information, let’s suggest the likely destinations.
Here is the information we have:
The Lat & Lng coordinates represent the start location, the time is the start time for the trip, and the destination is obvious. In the data set above, we have trips to and from work, to the gym once a week, and to our parents over the weekends.
Based on this data, I would like to build recommendations for destinations. I could try to analyze the data and figure out all sorts of details. The prediction that I want to make is, given a location & time, to find where my likely destination is going to be.
I could try to analyze the data on a deep level, drawings on patterns, etc. Or I can take a very different approach and just throw some computing power at the problem.
Let’s talk in code since this is easier. I have a list of trips that look like this:
I’m going to start by processing the trips data, to extract the relevant information:
var historyByDest =newDictionary<string, List<double[]>>();foreach(var trip in trips){if(historyByDest.TryGetValue(trip.Destination, out var list) is false){
historyByDest[trip.Destination]= list =new();}
list.Add([
trip.Lat,
trip.Lng,
trip.Time.Hour *100+ trip.Time.Minute,// minutes after midnight
trip.Time.DayOfYear,(int)trip.Time.DayOfWeek
]);}
What this code does is extract details (location, day of the week, time of day, etc.) from the trip information and store them in an array. For each trip, we basically break apart the trip across multiple dimensions.
The next step is to make the actual prediction we want, which will begin by extracting the same dimensions from the inputs we get, like so:
Now we basically have an array of values from which we want to predict, and for each destination, an array that represents the same dimensions of historical trips. Here is the actual computation:
List<(string Dest, double Score)> scores =new();foreach(var(dest, items)in historyByDest){
double score =0;foreach(var cur in items){for(var i =0; i < cur.Length; i++){
score += Math.Abs(cur[i]- compare[i]);}}
score /= items.Count;
scores.Add((dest, score));}
scores.Sort((x, y)=> x.Score.CompareTo(y.Score));
What we do here is compute the difference between the two arrays: the current start location & time compared to the start location & time of historical trips. We do that not only on the raw data but also extract additional features from the information.
For example, one dimension is the day of the week, and the other is the time of day. It is not sufficient to compare just the date itself.
The end result is the distance between the current trip start and previous trips for each of the destinations I have. Then I can return the destinations that most closely match my current location & time.
Running this over a few tests shows that this is remarkably effective. For example, if I’m at home on a Saturday, I’m very likely to visit either set of grandparents. On Sunday morning, I head to the Gym or Work, but on Monday morning, it is more likely to be Work.
All of those were mostly fixed, with the day of the week and the time being different. But If I’m at my parents’ house on a weekday (which is unusual), the location would have a far greater weight on the decision, etc. Note that the code is really trivial (I spent more time generating the actual data), but we can extract some nice information from this.
The entire code is here, admittedly it’s pretty dirty code since I wanted to test how this would actually work. At this point, I’m going to update my Curriculum Vitae and call myself a senior AI developer.
Joking aside, this approach provides a good (although highly simplified) overview of how modern AI systems work. Given a data item (image, text, etc.), you run that through the engine that outputs the embedding (those arrays we saw earlier, with values for each dimension) and then try to find its nearest neighbors across multiple dimensions.
In the example above, I explicitly defined the dimensions to use, whereas LLMs would have their“secret sauce” for this. The concept, at a sufficiently high level, is the same.
I have a piece of code that has been living, rent-free, in my head for the past 30 years or so.
In middle school (I was 12 - 13 at the time), I was taught Pascal as the entry-level programming language. I found it to be a really fascinating topic, as you can imagine.
One of the things I did was try to read other people’s code and see what sense I could make out of it. That was way before the Internet was a thing, though. Even at that time, Pascal was mostly on its way out, and there wasn’t much code that a kid could access.
To give some context, at the time, if you wanted to move data from one place to the next, the only real option was to physically disassemble your computer, take out the hard disk, wrap it in a towel for protection, and carry it to your destination. I recall doing that and then spending some hours at a friend’s house, trying to figure out the right jumper configuration so the BIOS would recognize two drives at once.
Because of this, I had a great interest in disk parking, a technique that helps ensure that taking the hard drive from one location to the next wouldn’t ruin it. I remember running into a particular piece of code to handle disk parking in Pascal and being quite excited by this. Here I had something that I needed to do, and also in a language that I was familiar with.
I remember staring at the code and trying to make sense of it for a while. I couldn’t figure it out. In fact, I couldn’t even begin to understand what was going on there. I remember feeling both frustrated and stupid because I could understand the syntax but not what it was doing.
In fact, it was so far above my head that I was upset about it for days, to the point that decades later, it still pops into my head. When it came back for a visit this week, I decided to try to figure it out once and for all.
That led to the first problem, which is that I cannot actually recall the code. I remember it was in Pascal, that it dealt with parking the disk needles, and it was full of weird numbers that I couldn’t grasp. Turning to Google didn’t help much, I found some code samples, but nothing that really jived with what I recalled. I ended up cobbling something that more or less matches what I had in my head.
program DiskPark;function ParkHead(drive: char): Boolean;var
regs: Registers;beginwith regs dobegin
AH :=$13;
AL := Ord(drive)- Ord('A')+$80;
DL := AL;
Intr($13, regs);
ParkHead :=(Flags and FCarry)=0;end;end;beginif ParkHead('A')then
WriteLn('Success: parked')else
WriteLn('Failure, no idea why!');end.
The actual code that I remember was beefier, and I think that it handled a bunch of additional scenarios, but it had been 30 years, I can’t recall it.
Amusingly enough, I actually had to update my publishing software, since I never thought I would be publishing Pascal code snippets 🙂.
What is interesting is that, as I look at this code and try to view it through the glasses of my 12-year-old self, I can understand exactly the frustration.
Look at this code, it is filled with random numbers and letters. What is AH or the Intr() call? I mean, looking at this, there is no place to get started understanding this.
Let’s look at this line:
AH := $13;
There is no definition of AH anywhere, I can’t tell where it is coming from. And $13, what is that about? A bad luck omen because I can’t figure it out, I think…
The problem was that my 12-year-old self wrote programs like “Guess the number”, and really advanced stuff was bubble sort. I approached the code above with the same mindset. There is a logic to this that I can figure out if I can just stare at this long enough.
The problem is that the code above is basically arbitrary, there is nothing intrinsic about it.
The weird variables that come from nowhere (AL, DL, and AH) are registers (at the time, I don’t believe I was even aware that those existed). The weird numbers are just arguments that we pass to the BIOS when you invoke a command.
In the same sense, consider this code:
return syscall(437, ptr, 525376);
This line will tell you absolutely nothing about the reasoning behind the code. While the numbers seem arbitrary,in this case, they correspond to calling SYS_openat with the flags O_APPEND | O_CREAT | O_CLOEXEC.
The whole program above is just the way you have to go to invoke a system call (as we would call it today), and there is no logic or sense to it. Just something that you have to know. And once you do know, all of this can be packaged into a much simpler, higher-level concept: the system call. (Yes, it is an interrupt invocation, not the same thing. You may go stand in the pedantic corner, mind.)
Hopefully, I wouldn’t ever have to think about this program any longer. I managed to not only understand what it does, but actually grok it.
I think that the following are reasonable (loosely based on what Secure Drop aims for):
Completely anonymous:
No accounts, no registrations.
No server-side state about users.
Prevent metadata tracking:
It’s not just message contents that are hidden.
Cannot tell if A talks to B or C.
Accessible via Tor to protect against traffic analysis.
Cannot tell who is talking to me at all.
Assume that the server may be compromised by a malicious entity.
No special trust should be granted to the server.
Reasoning
(You can skip this part to read the actual implementation below)
Let’s consider an actual usage scenario. We have Whistleblower Will (WW from now on) who wants to send sensitive information to Journalist Jane (JJ from now on). Let’s assume that the adversary in this case is a Big Bad Behemoth. I believe that the current villain de jour is Boeing, which is a huge company and had a couple of strange incidents with whistleblowers recently.
If WW wants to send stuff to JJ, why can’t he just email [email protected] from his personal email [email protected] ? Practically speaking, email today is being sent using TLS anyway, so we can assume that no one can read the message in the middle. Moreover, even pretty sophisticated traffic analysis would find it difficult to track the email, as WW is talking to AOL (or GMail, or their likely email provider), and then AOL is communicating with the nice.journalist server (which is likely hosted by Exchange, Gmail, etc.) In other words, any such message would likely be lost in the noise of regular traffic.
However, while that is likely to happen, it isn’t guaranteed. Given the risk of life & limb involved in our scenario, we would like to ensure that this is the case. I’m writing this post because it is an interesting scenario, not because I actually have a use case. As usual in my encryption posts, this is merely my musings on the matter, don’t take me as an authority on the subject. That said, I’m actually quite interested in realistic threat models here. Please provide any feedback, I would love to know more about this.
One thing to pay attention to with regards to this scenario, however, is that if my threat model is a Bad Company that is one thing, but what if my threat model includes a nation state? I would point you to this wonderful article: This World of Ours by James Mickens which manages to be both hilarious and informative. The level of capability that you face when your opponent is a nation state is high.
When thinking about nation-states and whistleblowers… Snowden is the first name that comes to mind. In this case, the issue isn’t whether there is some plaintext being sent over the wire for everyone who cares to listen. Assume you are sending such an email from AOL to Gmail. Both companies will provide any data they have, including the full contents of any messages, if provided with an appropriate court order.
Moving from legal (but dubious) actions to the other side, I’m fairly certain that it would take very little investigative work to find an appropriate person who:
Works at an email provider.
Has access to the email contents.
Is able to make use of an appropriate cash infusion in their life.
I would also further assume that the actual amount required is surprisingly low.
Another thing to consider is that our whistleblower may want to provide the information to the journalist, but may absolutely not want to be identified. That includes being identified by the journalist.
In short, the whole mindset when building something like a dead drop is extremely paranoid. With that in mind, let’s see how we can build such a system.
Implementation
The concept behind this system is that a journalist will publish their public key in some manner, probably in their newspaper. That is a fairly simple and obvious step, of course. The issue with needing a dead drop isn’t about being able to hide the contents of the messages. If that were the case, a PGP encrypted message would suffice. The real issue is that we want to hide the fact that we are even communicating at all.
Note that this isn’t about any form of instant messaging. This is about dropping a message (text, files, etc.) and having it sent securely to the other side. The key aspect is that not only are the contents hidden, but also the fact that we even sent it in the first place. And even if you are watching the source or the destination, you can’t tell who the other side is. For reference, think of spycraft techniques in the Cold War era.
Given a journalist public key, the whistleblower will package all the data they want to send in a zip file and encrypt that using the public key. That is easy enough, the problem now is that we need to push that file somewhere and then have the journalist get it. That is where our system has a role.
In the design of the system, I tried to intentionally reduce the amount of information that we provide to the server. That way, even if the server is malicious, there isn’t much that they can do with what they have.
I decided to use a serverless architecture here, for two primary reasons. The first reason is that I’m currently teaching a Cloud Development course and that was a nice project to play with. The second reason is that by running everything as a serverless function, I reduced the amount of data that I can easily aggregate since invocations are independent of one another.
From a client perspective, here is the manner in which I’m able to send my sensitive information to a journalist. We need two pieces of information, first is the address of the dead drop, in this case I’m using a dummy .onion service and the journalist public key. The public key is used to encrypt the information so only the journalist can see it.
Let’s look at the code first, and then discuss what is going on there:
BASE_URL ="https://deaddrop0j22dp4vl2id.onion" # example only
JOURNALIST_PUB_KEY = base64.b64decode('GVT0GzjFRvMxcDh9c6jpmXkHoGB5KoIp9vyU3RozT2A=')
data =open('secrets.zip','rb').read() # file to send
searler =SealedBox(PublicKey(JOURNALIST_PUB_KEY))
enc_file = searler.encrypt(data)
with torpy.http.requests() as s:
res = s.get(BASE_URL +"/upload-url").json()
s.put(res.get('url'),files={'file':enc_file}).raise_for_status()
file_id = res.get('id').encode('ascii')+ b'='
enc_id = searler.encrypt(base64.urlsafe_b64decode(base64_id))
s.put(BASE_URL +"/register-id", data=enc_id).raise_for_status()
The first step is to encrypt the data to send (in this case, the file secrets.zip) using SealedBox. That is the act of encrypting the data using a public key so only the corresponding private key can open it.
The next step is to use Tor to call GET /upload-url to get a JSON object back, with url and id properties. The output of this request looks something like this:
The output is basically a random name of the file and an S3 presigned URL. Note that the file id is a base64 value, with the padding removed, which is why I have to add back the = character when decoding the value.
We can use that URL to upload the file to S3 (or a compatible service), and then we PUT /register-id with the encrypted file id, again using the journalist public key.
In other words, we have broken the upload process into four separate stages:
Encrypt the file
Request an upload url and get a random file id
Upload the encrypted file
Encrypt the file id and register it
The entire process is done over Tor, ensuring that our physical location is not leaked.
SecureDrop has a whole list of steps to take in order to increase your safety at this stage, by the way, Tor is basically just the start here, and physical security is of paramount importance.
This looks like a really convoluted way to do things, I know, but there is some logic behind everything. Let’s look at the actual implementation and see how all those pieces look from the other side. In terms of the architectural diagram, here is what we use to generate the upload URL:
The following code implements the logic for the upload-url endpoint. As you can see, there is really nothing here. We generate a random 32-byte token, which will serve as the file id, generate the pre-signed URL, and hand it over to the caller.
upload_bucket = os.environ.get('UPLOAD_BUCKET')
s3 = boto3.client('s3')
def generate_upload_url(event, context):
id = secrets.token_urlsafe(32)
resp = s3.generate_presigned_url('put_object',
Params={'Bucket': upload_bucket,'Key':'uploads/'+ id },
ExpiresIn=3600)
body = json.dumps({'id': id,'url': resp})return{'statusCode':200,'body': body}
The caller is free to make one or more such calls and use (or not) the pre-signed URLs that it got. The backend doesn’t have any say in this, nor any way to influence the caller.
The intent here is that we split the responsibilities, instead of having an upload that the server can gather information about. If we assume a malicious server, then requesting an upload URL doesn’t provide much information, just the Tor exit node IP that was used.
Proper setup would mean that the S3 bucket we use is not logging anything. Even if we assume that it is doing so, the only data in the log would be the Tor exit node IP. A malicious server would also have access to the actual uploaded file, but that is not usable without the private key of the journalist.
At this point we have an uploaded file, but how do we actually get the journalist to know about it? This is the part where the real fun happens. First, let’s look at the architectural diagram, then we’ll discuss how it works in detail.
There are a lot of moving pieces here. The design is intentionally meant to be hard to pierce since we are trying to ensure that even if the system is operated by a malicious entity, it will still retain much of its secured capabilities. (Again, this is a mental exercise for me, something to do for fun. If your life/liberty is at stake here, you probably want to get a second opinion on this design).
How do we let the journalist know that we have a new file for them (and along the way, not let anyone else know about it)? The whistleblower has the file id from the server, the one used as the file name for the upload.
I’m using Libsodium SealedBox. SealedBox is a way to encrypt a value given a public key in such a way that only the owner of the associated private key can access it.
SealedBox has an envelope size of 48 bytes. In other words, encrypting a 32-byte id + 48 envelope gives us exactly 80 bytes.
The whistleblower will use SealedBox to encrypt the file name, and then call the register id endpoint. Here is the associated lambda backend for the register-id endpoint:
It… doesn’t do much (you may have noticed a theme here), we are simply verifying that the size of the value matched, and then we send that to an SQS queue.
By the way, if we are sending 80 bytes, why are we receiving 108 bytes? This is because we are using lambda, and the binary data is being base64’ed by the lambda infrastructure.
Just registering the (encrypted) file id in the queue isn’t really helpful. It is just… sitting there, who is going to read or operate on that? That is where the two timers come into place. Note that the architecture diagram has events that happen every minute and every 5 minutes. Every minute, we have a lambda running to maybe publish a decoy value. Its code looks like this:
queue_url = os.environ.get('NOTIFICATIONS_QUEUE')
sqs = boto3.client('sqs')
def maybe_publish_decoy(event, context):if secrets.randbelow(4)!=0:return # 75% of the time,do nothing
# 25% of the time, generate a decoy message
returnregister_id_internal(
base64.urlsafe_b64encode(secrets.token_bytes(80)).decode('ascii'))
The maybe_publish_decoy() lambda will usually do nothing, but 25% of the time it will register a decoy value in an SQS queue. Note that the actual message we generate is just random bytes, nothing meaningful.
Remember that when a user registers a file id, the file also ends up in the queue. Because both user ids and decoy ids end up in the same queue, and they both look like random bits, there is no way for an external observer to tell which is which.
For that matter, there is no way to tell for the system itself. Once a decoy is posted to the queue, there is no way to know whether it is a decoy value or a real one.
The last component in our system is actually working with the messages in the queue. Those are handled by the publish_ids() lambda, which is invoked every 5 minutes. Let’s look at the code:
Every 5 minutes, the publis_ids() lambda runs, and it starts by reading messages from the queue. We read a batch of maximum 8 messages at a time, and round up with made up ids if there are not enough messages to store. Here is what such a file looks like:
We then write the file to another S3 folder, note that we use the iso format for the file name, which gives us lexical sorting for the values by time. Here is what this looks like on the bucket itself:
As you can see, roughly every 5 minutes, we have some values being written out. The file size is always the same, and we can’t tell based on the presence of a file if there were messages posted at a given time.
In the time frame shown here, I didn’t post any messages, but you can see that we still have two files written at 04:15, likely because there were enough messages in the queue to force us to run a second file (basically, a race condition between the decoy lambda and the publish lambda). That is a desirable outcome, by the way.
This is… pretty much it, I have to say. There aren’t any additional behaviors or things to explore. This Rube Goldberg machine is meant to create a system that breaks apart the different sections and loses information as we move forward.
I’ll cover the impact of this design in the case of a malicious server later on in the post. For now, I want to cover how the journalist can read the data. The server currently holds two folders:
uploads/ - allows anonymous GET for files, auto-expires in 14 days, uploads require a pre-signed URL
ids/ - allow anonymous GET and LIST, auto-expires in 14 days, only written to be the backend (from the queue)
A journalist is going to be running a check every 5 - 10 minutes, using Tor, on the contents of the ids/ bucket, like so:
def read_messages(session, base_url, reveal, last_file =""):while True:
res = session.get(base_url, params={'prefix':'ids/','start-after': last_file,'list-type':2})
res.raise_for_status()
dict_data = xmltodict.parse(res.content)
contents = dict_data.get('ListBucketResult').get('Contents')iflen(contents)==0:
time.sleep(5*60)continuefor file in contents:
last_file = file.get('Key')
data = session.get(base_url + last_file)for line in io.BytesIO(data.content).readlines():
id = base64.urlsafe_b64decode(line)try:
file_name = reveal.decrypt(id)
yield file_name
except:
pass
What we are doing here is scanning the id/ folder, getting all the id files that we haven’t seen yet. For each of those files, we get it, and try to decrypt each of the lines in it. When done processing all the files in the folder, we’ll wait 5 minutes, and then read the next file.
This relies on the fact that S3 buckets return items in lexical sort order, and our publish_ids() lambda generates the file names in the ids/ folder using lexically sorted timestamps.
If there are many journalists listening on the system, each one of them will be making 1 - 2 remote calls every 5 minutes, seeing if there are any ids in there that they can decrypt. Note that this gives the server absolutely no information about what data each journalist is able to access. It also drastically reduces the amount of information that you need to deal with and distribute.
Each journalist will need to go through roughly 250 KB per day of ids to scan for messages aimed at them. That is assuming that we have a low load system, with < 8 messages / every 5 minutes. Note that this is still over 2,300 messages/day.
Assuming that we have a very high load and have to push 100,000 messages a day, the amount of data that each journalist will have to scan is under 10 MB. Those are good numbers, especially since we don’t intend this to be a high-traffic system.
With the file id in hand, the journalist can now download and decrypt the data, like so:
BASE_URL ="https://deaddrop0j22dp4vl2id.onion" # example only
PRIVATE_KEY ='Iei28jYsIl5E/Kks9BzGeg/36CKsrojEh65IUE2eNvA='
key =PrivateKey(base64.b64decode(PRIVATE_KEY))
reveal =SealedBox(key)
with torpy.http.requests() as s:for msg inread_messages(s, BASE_URL, reveal):
res = s.get(BASE_URL +"uploads/"+ msg)try:print(reveal.decrypt(res.content))
except:
pass
We get the file ids, download them from the S3 bucket, and decrypt them. And now the journalist can start actually looking at the data. To reply, the whistleblower would need to send their own public key to the journalist, and subscribe in the same manner to the updates.
Note that this is not meant to be an online protocol, and you can scan the data once a day or once a week, without any real problems. That makes this system attractive since you can schedule a weekly trip to a remote location where you can anonymously check if you have anything new in your “mailbox”, without anyone being able to tell.
With the raw technical details out of the way, let’s consider some of the implications of this sort of system design.
The cautious playbook
A journalist would publish their public key, likely in their paper or website. They are interested in getting such information from anonymous sources. At the same time, they need to ensure that no one can tell if someone sent them information. A whistleblower wants to send information, but be protected from anything knowing that they sent it. Ideally, we should even limit the knowledge that any information was sent.
The way SecureDrop recommends, access to the system is only via Tor and usually from a location that isn’t near your usual haunts, which you traveled to without a phone, paid in cash, using a live-cd Tails instance.
The idea of adding additional layers beyond the system encryption is that even with a powerful adversary, the number of hurdles they have to go through is very high. You have to break the system encryption, and then Tor, and then you reach some coffee shop in the middle of nowhere and need to go over footage to try to identify someone.
The crazy thing is that this is actually viable. It isn’t science fiction today to obtain the security footage (which we can safely assume exists), run it through a facial recognition system and get a list of people to check. So we need to have multiple layers of defense in place.
From the point of view of this dead drop system, all the data is encrypted, the only information you have is about analyzing traffic patterns. A good way to alleviate even that is to not run everything at once.
The whole point of breaking it into discrete steps is that you can execute it in isolation. You can get the upload url, upload the file, and then actually register the id a day later, for example. That makes timing analysis a lot harder.
Or consider a few bots that would (via different Tor exit nodes for each operation):
Request upload urls on an ongoing basis
Occasionally upload random files to those urls
Read the ids/ folder files
Download some of those urls
If we have a few of those bots, and they send each other “messages” that generate decoy traffic, it is going to be much harder to track who and what is happening, since you’ll have activity (anonymized via Tor) that hides the real actions.
I think that those bots are likely to be another important layer of security, defeating traffic analysis and masking actual usage by people.
Consequences of a malicious takeover
Let’s consider for a moment what will happen if the system is taken over by a malicious party. In this case, we assume that there is a valid system that was taken over. Therefore, we need to figure out the impact of old messages that were sent and new messages that will be sent from now on.
The design of the system calls for the following important configurations:
Disabling logging on access (S3, endpoints, etc).
All files (ids, uploads, etc) are deleted within 14 days.
We are routing data in such a way that information is lost (registered id will be merged with decoys, etc)
Assuming that a proper system was taken over by a malicious party, the only additional information that they now have is all the contents of the uploads/ folder, which aren’t visible to other parties (you have to know the file name to download).
Given that the files are encrypted, there isn’t much that is leaked. And the bots I talked about will mask the real traffic with dummy files.
Once the system has been taken over, we can assume that the following happens: There is correlation now between calls to upload-url and the actual uploaded file. You can also correlate a registered id with an upload, assuming sufficiently low traffic (which is likely).
Those are the only additional bits of information that you gain from having access to the system. When we register an id to be published, the whistleblower sends the encrypted value, which the server has no way of correlating to the recipient.
The server can now do traffic analysis, for example, monitor that someone is reading the ids/ folder and then downloading a file from the uploads/ folder. But that is of limited utility, that would be:
Masked by the bots traffic.
Only give you the Tor exit node.
A malicious party could also disable expiration and retain long-term all the uploaded files, in case they get the keys at a later time, but beyond that, I can’t think of anything else that they will get from actually having control over the system.
In particular, for the whistleblower, there is no data leakage about who they are or who they are talking to. Even with the collaboration of the journalist, if the whistleblower didn’t provide that information, they remain anonymous.
Side channels
Another aspect of this system is that we don’t need to go with the id registration and publication to journalists. The fact that this is stored (and encrypted) means that you now don’t need to pass potentially a lot of data to a journalist, but can just give them the id itself. That is 108 bytes in base64 format, and doesn’t convey any additional information beyond that.
The question is, of course, if you can pass the id, why not just pass the encrypted file directly.
Attacks and mitigations
The design of the system makes certain attacks impossible to execute or impossible to hide. For example, you can perform denial of service attacks on a journalist by sending many messages that they would have to go through (you have the public key, after all). But that is obviously detectable.
In general, denial of service attacks on the system can be mitigated by requiring proof of work to submit the files.
What about blocking access from a source or to a particular journalist? There is no identifying the source, so you cannot block based on that. You can also not block the journalist, since the server doesn’t have any idea who the destination is.
You can block access to the system, simply by ignoring id registrations. From the outside world, it will look like someone just didn’t send us any messages. However, that is easily detectable. A journalist can send a message to herself, and upon not receiving it, detect that the system is not operationable.
Key leakage
What happens if the key pair of a journalist leaks? That would be catastrophic for the journalist and the sources because it would enable decryption of any messages that were sent. This is mitigated by only keeping messages for a maximum of 14 days, but we must assume that an adversary has copies of all the messages that were ever sent.
In this scenario, key leadage would compromise all communications intended for the journalist. Technically speaking, we can try to do a key exchange using this system, and have a temporary key assigned for a particular conversation. The problem is that this sort of system is mostly offline, with days or weeks between interactions.
That means that we need to persist the keys (and can thus assume that a key pair leak will also leak any “temporary” keys). Something that you can do is publish not just a public key but several over time, with built-in expiry. Let’s say that you publish your key in January, and replace that with a new key in February. In March, you’ll destroy the January key, so you don’t have a way to leak that.
Having rotating keys is a cute idea, but I think in practice this is too complex. People have a hard enough time remembering things like their passwords, requiring them to remember (and change) multiple passphrases is too much. On a yearly basis, however, that makes a lot more sense. But then again, what does “destroy the old key” mean exactly.
Summary
Well, this post has gone on entirely too long. I actually started writing it to play around with serverless system architecture, but I got sidetracked into everything else. It is long enough that I won’t try to dive into the serverless aspect in this post. Maybe in a future one.
As a reminder, this is a nice design, and the blog post and research consumed quite a few very enjoyable evenings, but I’m not a security or cryptography expert. If you require a system like this, I would recommend consulting an actual professional.