Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

[email protected] +972 52-548-6969

Posts: 7,527
|
Comments: 51,162
Privacy Policy · Terms
filter by tags archive
time to read 5 min | 862 words

It has been almost a year since the release of RavenDB 6.0. The highlights of the 6.0 release were Corax (a new blazing-fast indexing engine) and Sharding (server-side and simple to operate at scale). We made 10 stable releases in the 6.0.x line since then, mostly focused on performance, stability, and minor features.

The new RavenDB 6.2 release is now out and it has a bunch of new features for you to play with and explore. The team has been working on a wide range of new features, from enabling serverless triggers to quality-of-life improvements for operations teams.

RavenDB 6.2 is a Long Term Support (LTS) release

RavenDB 6.2 is a Long Term Support release, replacing the current 5.4 LTS (released in 2022). That means that we’ll support RavenDB 5.4 until Oct 2025, and we strongly encourage all users to upgrade to RavenDB 6.2 at their earliest convenience.

You can get the new RavenDB 6.2 bits on the download page. If you are running in the cloud, you can open a support request and ask to be upgraded to the new release.

Data sovereignty and geo-distribution via Prefixed Sharding

In RavenDB 6.2 we introduced a seemingly simple change to the way RavenDB handles sharding, with profound implications for what you can do with it. Prefixed sharding allows you to define which shards a particular set of documents will go to.

Here is a simple example:

In this case, data for users in the US will reside in shards 0 & 1, while the EU data is limited to shards 2 & 3. The data from Asia is spread over shards 0, 2, & 4.  You can then assign those shards to specific nodes in a particular geographic region, and with that, you are done.

RavenDB will ensure that documents will stay only in their assigned location, handling data sovereignty issues for you. In the same manner, you get to geographically split the data so you can have a single world-spanning database while issuing mostly local queries.

You can read more about this feature and its impact in the documentation.

Actors architecture with Akka.NET

New in RavenDB 6.2 is the integration of RavenDB with Akka.NET. The idea is to allow you to easily manage state persistence of distributed actors in RavenDB. You’ll get both the benefit of the actor model via Akka.NET, simplifying parallelism and concurrency, while at the same time freeing yourself from persistence and high availability concerns thanks to RavenDB.

We have an article out discussing how you use RavenDB & Akka.NET, and if you are into that sort of thing, there is also a detailed set of notes covering the actual implementation and the challenges involved.

Azure Functions integration with ETL to Azure Queues

This is the sort of feature with hidden depths. ETL to Azure Queue Storage is fairly simple on the surface, it allows you to push data using RavenDB’s usual ETL mechanisms to Azure Queues. At a glance, this looks like a simple extension of our already existing capabilities with queues (ETL to Kafka or RabbitMQ).

The reason that this is a top-line feature is that it also enables a very interesting scenario. You can now seamlessly integrate Azure Functions into your RavenDB data pipeline using this feature. We have an article out that walks you through setting up Azure Functions to process data from RavenDB.

OpenTelemetry integration

In RavenDB 6.2 we have added support for the OpenTelemetry framework. This allows your operations team to more easily integrate RavenDB into your infrastructure. You can read more about how to set up OpenTelemetry for your RavenDB cluster in the documentation.

OpenTelemetry integration is in addition to Prometheus, Telegraf, and SNMP telemetry solutions that are already in RavenDB. You can pick any of them to monitor and inspect the state of RavenDB.

Studio Omni-Search

We made some nice improvements to RavenDB Studio as well, and probably the most visible of those is the Omni-Search feature.  You can now hit Ctrl+K in the Studio and just search across everything:

  • Commands in the Studio
  • Documents
  • Indexes

This feature greatly enhances the discoverability of features in RavenDB as well as makes it a joy for those of us (myself included) who love to keep our hands on the keyboard.

Summary

I’m really happy about this release. It follows a predictable and stable release cadence since the release of 6.0 a year ago. The new release adds a whole bunch of new features and capabilities, and it can be upgraded in place (including cross-version clusters) and deployed to production with no hassles.

Looking forward, we have already started work on the next version of RavenDB, tentatively meant to be 7.0. We have some cool ideas about what will go into that release (check the roadmap), but the key feature is likely to make RavenDB a more intelligent database, one might even say, artificially so.

time to read 4 min | 764 words

I wanted to test low-level file-system behavior in preparation for a new feature for RavenDB. Specifically, I wanted to look into hole punching - where you can give low-level instructions to the file system to indicate that you’re giving up disk space, but without actually reducing the size of the file.

This can be very helpful in space management. If I have a section in the file that is full of zeroes, I can just tell the file system that, and it can skip storing that range of zeros on the disk entirely. This is an advanced feature for file systems. I haven't actually used that in the past, so I needed to gain some expertise with it.

I wrote the following code for Linux:


int fd = open("test.file", O_CREAT | O_WRONLY, 0644);
lseek(fd, 128 * 1024 * 1024 - 1, SEEK_SET); // 128MB file
write(fd, "", 1);
fallocate(fd, // 32 MB hole from the 16MB..48MB range
    FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 
    16 * 1024 * 1024, 32 * 1024 * 1024); 
close(fd);

The code for Windows is here if you want to see it. I tested the feature on both Windows & Linux, and it worked. I could see that while the file size was 128MB, I was able to give back 16MB to the operating system without any issues. I turned the code above into a test and called it a day.

And then the CI build broke. But that wasn’t possible since I tested that. And there had been CI runs that did work on Linux. So I did the obvious thing and started running the code above in a loop.

I found something really annoying. This code worked, sometimes. And sometimes it just didn’t.

In order to get the size, I need to run this code:


struct stat st;
fstat(fd, &st);
printf("Total size: %lld bytes\n",
    (long long)st.st_size);
printf("Actual size on disk: %lld bytes\n", 
    (long long)st.st_blocks * 512);

I’m used to weirdness from file systems at this point, but this is really simple. All the data is 4KB aligned (in fact, all the data is 16MB aligned). There shouldn’t be any weirdness here.

As you can see, I’m already working at the level of Linux syscalls, but I used strace to check if there is something funky going on. Nope, there was a 1:1 mapping between the code and the actual system calls issued.

That means that I have to debug deeper if I want to understand what is going on. This involves debugging the Linux Kernel, which is a Big Task. Take a look at the code in the relevant link. I’m fairly certain that the issue is in those lines. The problem is that this cannot be, since both offset & length are aligned to 4KB.

I got out my crystal ball and thinking hat and meditated on this. If you’ll note, the difference between the expected and actual values is exactly 4KB. It almost looks like the file itself is not aligned on a 4KB boundary, but the holes must be.

Given that I just want to release this space to the operating system and 4KB is really small, I can adjust that as a fudge factor for the test. I would love to understand exactly what is going on, but so far the “file itself is not 4KB aligned, but holes are” is a good working hypothesis (even though my gut tells me it might be wrong).

If you know the actual reason for this, I would love to hear it.

And don't get me started on what happened with sparse files in macOS. There, the OS will randomly decide to mark some parts of your file as holes, making any deterministic testing really hard.

time to read 4 min | 765 words

I’m currently deep in the process of modifying the internals of Voron, trying to eke out more performance out of the system. I’m making great progress, but I’m also touching parts of the code that haven’t even been looked at for a long time.

In other words, I’m mucking about with the most stable and most critical portions of the storage engine. It’s a lot of fun, and I’m actually seeing some great results, but it is also nerve-wracking.

We have enough tests that I’ve great confidence I would catch any actual stability issues, but the drive back toward a fully green build has been a slog.

The process is straightforward:

  • Change something.
  • Verify that it works better than before.
  • Run the entire test suite (upward of 30K tests) to see if there are any breaks.

The last part can be frustrating because it takes a while to run this sort of test suite. That would be bad enough, but some of the changes I made were things like marking a piece of memory that used to be read/write as read-only. Now any access to that memory would result in an access violation.

I fixed those in the code, of course, but we have a lot of tests, including some tests that intentionally corrupt data to verify that RavenDB behaves properly under those conditions.

One such test writes garbage to the RavenDB file, using read-write memory. The idea is to verify that the checksum matches on read and abort early. Because that test directly modifies what is now read-only memory, it generates a crash due to a memory access violation. That doesn’t just result in a test failure, it takes the whole process down.

I’ve gotten pretty good at debugging those sorts of issues (--blame-crash is fantastic) and was able to knock quite a few of them down and get them fixed.

And then there was this test, which uses encryption-at-rest. That test started to fail after my changes, and I was pretty confused about exactly what was going on. When trying to read data from disk, it would follow up a pointer to an invalid location. That is not supposed to happen, obviously.

Looks like I have a little data corruption issue on my hands. The problem is that this shouldn’t be possible. Remember how we validate the checksum on read? When using encryption-at-rest, we are using a mechanism called AEAD (Authenticated Encryption with Associated Data). That means that in order to successfully decrypt a page of data from disk, it must have been cryptographically verified to be valid.

My test results showed, pretty conclusively, that I was generating valid data and then encrypting it. The next stage was to decrypt the data (verifying that it was valid), at which point I ended up with complete garbage.

RavenDB trusts that since the data was properly decrypted, it is valid and tries to use it. Because the data is garbage, that leads to… excitement. Once I realized what was going on, I was really confused. I’m pretty sure that I didn’t break 256-bit encryption, but I had a very clear chain of steps that led to valid data being decrypted (successfully!) to garbage.

It was also quite frustrating to track because any small-stage test that I wrote would return the expected results. It was only when I ran the entire system and stressed it that I got this weird scenario.

I started practicing for my Fields medal acceptance speech while digging deeper. Something here had to be wrong. It took me a while to figure out what was going on, but eventually, I tracked it down to registering to the TransactionCommit event when we open a new file.

The idea is that when we commit the transaction, we’ll encrypt all the data buffers and then write them to the file. We register for an event to handle that, and we used to do that on a per-file basis. My changes, among other things, moved that logic to apply globally.

As long as we were writing to a single file, everything just worked. When we had enough workload to need a second file, we would encrypt the data twice and then write it to the file. Upon decryption, we would successfully decrypt the data but would end up with still encrypted data (looking like random fluff).

The fix was simply moving the event registration to the transaction level, not the file level. I committed my changes and went back to the unexciting life of bug-fixing, rather than encryption-breaking and math-defying hacks.

time to read 3 min | 485 words

I was talking to a colleague about a particular problem we are trying to solve. He suggested that we solve the problem using a particular data structure from a recently published paper. As we were talking, he explained how this data structure works and how that should handle our problem.

The solution was complex and it took me a while to understand what it was trying to achieve and how it would fit our scenario. And then something clicked in my head and I said something like:

Oh, that is just epoch-based, copy-on-write B+Tree with single-producer/ concurrent-readers?

If this sounds like nonsense to you, it is fine. Those are very specific terms that we are using here. The point of such a discussion is that this sort of jargon serves a very important purpose. It allows us to talk with clarity and intent about fairly complex topics, knowing that both sides have the same understanding of what we are actually talking about.

The idea is that we can elevate the conversation and focus on the differences between what the jargon specifies and the topic at hand. This is abstraction at the logic level, where we can basically zoom out a lot of details and still keep high intent accuracy.

Being able to discuss something at this level is hugely important because we can convey complex ideas easily. Once I managed to put what he was suggesting in context that I could understand, we were able to discuss the pros and cons of this data structure for the scenario.

I do appreciate that the conversation basically stopped making sense to anyone who isn’t already well-versed in the topic as soon as we were able to (from my perspective) clearly and effectively communicate.

“When I use a word,’ Humpty Dumpty said in rather a scornful tone, ‘it means just what I choose it to mean — neither more nor less.”

Clarity of communication is a really important aspect of software engineering. Being able to explain, hopefully in a coherent fashion, why the software is built the way it is and why the code is structured just so can be really complex. Leaning on existing knowledge and understanding can make that a lot simpler.

There is also another aspect. When using jargon like that, it is clear when you don’t know something. You can go and research it. The mere fact that you can’t understand the text tells you both that you are missing information and where you can find it.

For software, you need to consider two scenarios. Writing code today and explaining how it works to your colleagues, and looking at code that you wrote ten years ago and trying to figure out what was going on there.

In both cases, I think that this sort of approach is a really useful way to convey information.

time to read 10 min | 1997 words

I usually talk about the things that I do that were successful. Today I want to discuss something that I tried but failed at. Documenting failed approaches is just as important, though less enjoyable, as documenting what we excel at.

In order to explain the failure, I need to get a bit deeper into how computers handle memory. There is physical memory, the RAM sticks that you have in your machine, and then there is how the OS and CPU present that memory to your code. Usually, the abstraction is quite seamless, and we don’t need to pay attention to it.

Occasionally, we can take advantage of this model. Consider the following memory setup, showing a single physical memory page that was mapped in two different locations:

In this case, it means that you can do things like this:


*page1 = '*';
printf("Same: %d - Val: %c\n", (page1 == page2), *page2); 
// output is:
// Same: 0 - Val: *

In other words, because the two virtual pages point to the same physical page in memory, we can modify memory in one location and see the changes in another. This isn’t spooky action at a distance, it is simply the fact that the memory addresses we use are virtual and they point to the same place.

Note that in the image above, I modified the data using the pointer to Page 1 and then read it from Page 2. The Memory Management Unit (MMU) in the CPU can do a bunch of really interesting things because of this. You’ll note that each virtual page is annotated with an access permission.

In this case, the second page is marked as Copy on Write. That means that when we read from this page, the MMU will happily read the data from the physical page it is pointed to. But when we write, the situation is different.

The MMU will raise an exception to the operating system, telling it that a write was attempted on this page, which is forbidden. At this point, the OS will allocate a new physical page, copy the data to it, and then update the virtual address to point to the new page. Here is what this looks like:

Now we have two distinct mappings. A write to either one of them will not be reflected on the other. Here is what this looks like in code:


*page1 = '1'; // now 
printf("Page1: %c, Page2: %c\n", *page1, *page2); 
// output: Page1: 1, Page2: 1
*page2 = '2'; // force the copy on write to occur
printf("Page1: %c, Page2: %c\n", *page1, *page2); 
// output: Page1: 1, Page2: 2

As long as the modifications happened through the first page address (the orange one in the image), there was no issue and any change would be reflected in both pages. When we make a modification to the second page (the green one in the image), the OS will create a new physical page and effectively split them forever.

Changes made to either page will only be reflected in that page, not both, since they aren’t sharing the same page.

Note that this behavior applies at a page boundary. What happens if I have a buffer, 1GB in size, and I use this technique on it? Let’s assume that we have a buffer that is 1GB in size and I created a copy-on-write mapping on top of it.

The amount of physical memory that I would consume is still just 1GB.

In fact, I would effectively memcpy()very quickly, since I’m not actually copying anything. And for all intents and purposes, it works. I can change the data through the second buffer, and it would not show up in the first buffer. Of particular note is that when I modify the data on the second buffer, only a single page is changed. Here is what this looks like:

So instead of having to copy 1GB all at once, we map the buffer again as copy on write, and we can get a new page whenever we actually modify our “copy” of the data.

So far, this is great, and it is heavily used for many optimizations. It is also something that I want to use to implement cheap snapshots of a potentially large data structure. The idea that I have is that I can use this technique to implement it.

Here is the kind of code that I want to write:


var list = new CopyOnWriteList();
list.Put(1);
list.Put(2);


var snapshot1 = list.CreateSnapshot();


list.Put(3)




var snapshot2 = list.CreateSnapshot();


list.Put(4);

And the idea is that I’ll have (at the same time) the following:

listsnapshot1snapshot2
1,2,3,41,21,2,3

I want to have effectively unlimited snapshots, and the map may contain a large amount of data. In graphical form, you can see it here:

We started with Page 1, created a Copy of Write for Page 2, modified Page 2 (breaking the Copy on Write), and then attempted to create a Copy on Write for Page 2. That turns out to be a problem.

Let’s see the code that we need in order to create a copy using copy-on-write mapping on Linux:


int shm_fd = shm_open("/third", O_CREAT | O_RDWR, 0666);
ftruncate(shm_fd, 4096);
char *page1 = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, shm_fd, 0);
page1[0] = 'A'; page1[1] = 'B';
// pages1 = 'AB'
char *page2 = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE, shm_fd, 0);
// pages2 = 'AB'
page1[0]= 'a';
// pages1 = 'aB'
// pages2 = 'aB' (same pagee)
page2[2] = 'C'; // force a private copy creation
// pages1 = 'aB'
// pages2 = 'aBC'
page1[1] = 'b';
// pages1 = 'ab'
// pages2 = 'aBC' (no change here)

The code in Windows is pretty similar and behaves in the same manner:


HANDLE hMapFile = CreateFileMapping(INVALID_HANDLE_VALUE,
    NULL,PAGE_READWRITE,0,4096, TEXT("Local\\MySharedMemory"));
char* page1 = MapViewOfFile(hMapFile,
    FILE_MAP_READ | FILE_MAP_WRITE, 0, 0, 4096);
page1[0] = 'A'; page1[1] = 'B';
// pages1 = 'AB'
char* page2 = MapViewOfFile(hMapFile,
    FILE_MAP_COPY, 0, 0, 4096);
// pages2 = 'AB'
page1[0] = 'a';
// pages1 = 'aB'
// pages2 = 'aB' (same pagee)
page2[2] = 'C'; // force a copy on write 
// pages1 = 'aB'
// pages2 = 'aBC'
page1[1] = 'b';
// pages1 = 'ab'
// pages2 = 'aBC' (no change here)

Take a look at the API we have for creating a copy-on-write:


MapViewOfFile(hMapFile, FILE_MAP_COPY, 0, 0, 4096); // windows
mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE, shm_fd, 0); // linux

A key aspect of the API is that we need to provide a source for the Copy-on-Write operation. That means that we can only create a Copy-on-Write from a single source. We cannot perform a Copy-on-Write on top of a page that was marked as copy-on-write. This is because we cannot refer to it. Basically, I don’t have a source that I can use for this sort of mapping.

I tried being clever and wrote the following code on Linux:


int selfmem = open("/proc/self/mem", O_RDWR);
char *page2 = mmap(NULL, 4096, PROT_READ | PROT_WRITE, 
                   MAP_PRIVATE, selfmem, (off_t)page1);

On Linux, you can use the special file /proc/self/mem to refer to your memory using file I/O. That means that I can get a file descriptor for my own memory, which provides a source for my copy-on-write operation.

I was really excited when I realized that this was a possibility. I spent a lot of time trying to figure out how I could do the same on Windows. However, when I actually ran the code on Linux, I realized that this doesn’t work.

The mmap() call will return ENODEV when I try that. It looks like this isn’t a supported action.

Linux has another call that looks almost right, which is mremap(), but that either zeros out or sets up a userfaulfdhandler for the region. So it can’t serve my needs.

Looking around, I’m not the first person to try this, but it doesn’t seem like there is an actual solution.

This is quite annoying since we are almost there. All the relevant pieces are available, if we had a way to tell the kernel to create the mapping, everything else should just work from there.

Anyway, this is my tale of woe, trying (and failing) to create a snapshot-based system using the Memory Manager Unit. Hopefully, you’ll either learn something from my failure or let me know that there is a way to do this…

time to read 1 min | 129 words

Watch Oren Eini, CEO of RavenDB, as he delves into the intricate process of constructing a database engine using C# and .NET. Uncover the unique features that make C# a robust system language for high-end system development. Learn how C# provides direct memory access and fine-grained control, enabling developers to seamlessly blend high-level concepts with intimate control over system operations within a single project. Embark on the journey of leveraging the power of C# and .NET to craft a potent and efficient database engine, unlocking new possibilities in system development.

I’m going deep into some of the cool stuff that you can do with C# and low level programming.

time to read 6 min | 1090 words

I have a piece of code that has been living, rent-free, in my head for the past 30 years or so.

In middle school (I was 12 - 13 at the time), I was taught Pascal as the entry-level programming language. I found it to be a really fascinating topic, as you can imagine.

One of the things I did was try to read other people’s code and see what sense I could make out of it. That was way before the Internet was a thing, though. Even at that time, Pascal was mostly on its way out, and there wasn’t much code that a kid could access.

To give some context, at the time, if you wanted to move data from one place to the next, the only real option was to physically disassemble your computer, take out the hard disk, wrap it in a towel for protection, and carry it to your destination. I recall doing that and then spending some hours at a friend’s house, trying to figure out the right jumper configuration so the BIOS would recognize two drives at once.

Because of this, I had a great interest in disk parking, a technique that helps ensure that taking the hard drive from one location to the next wouldn’t ruin it. I remember running into a particular piece of code to handle disk parking in Pascal and being quite excited by this. Here I had something that I needed to do, and also in a language that I was familiar with.

I remember staring at the code and trying to make sense of it for a while. I couldn’t figure it out. In fact, I couldn’t even begin to understand what was going on there. I remember feeling both frustrated and stupid because I could understand the syntax but not what it was doing.

In fact, it was so far above my head that I was upset about it for days, to the point that decades later, it still pops into my head. When it came back for a visit this week, I decided to try to figure it out once and for all.

That led to the first problem, which is that I cannot actually recall the code. I remember it was in Pascal, that it dealt with parking the disk needles, and it was full of weird numbers that I couldn’t grasp.  Turning to Google didn’t help much, I found some code samples, but nothing that really jived with what I recalled. I ended up cobbling something that more or less matches what I had in my head.


program DiskPark;


function ParkHead(drive: char): Boolean;
var
  regs: Registers;
begin
  with regs do
  begin
    AH := $13;  
    AL := Ord(drive) - Ord('A') + $80;
    DL := AL;  
    Intr($13, regs); 
    ParkHead := (Flags and FCarry) = 0;
  end;
end;


begin
  if ParkHead('A') then
    WriteLn('Success: parked')
  else
    WriteLn('Failure, no idea why!');
end.

The actual code that I remember was beefier, and I think that it handled a bunch of additional scenarios, but it had been 30 years, I can’t recall it.

Amusingly enough, I actually had to update my publishing software, since I never thought I would be publishing Pascal code snippets 🙂.

What is interesting is that, as I look at this code and try to view it through the glasses of my 12-year-old self, I can understand exactly the frustration.

Look at this code, it is filled with random numbers and letters. What is AH or the Intr() call? I mean, looking at this, there is no place to get started understanding this.

Let’s look at this line:


AH := $13;

There is no definition of AH anywhere, I can’t tell where it is coming from. And $13, what is that about? A bad luck omen because I can’t figure it out, I think…

The problem was that my 12-year-old self wrote programs like “Guess the number”, and really advanced stuff was bubble sort. I approached the code above with the same mindset. There is a logic to this that I can figure out if I can just stare at this long enough.

The problem is that the code above is basically arbitrary, there is nothing intrinsic about it.

The weird variables that come from nowhere (AL, DL, and AH) are registers (at the time, I don’t believe I was even aware that those existed). The weird numbers are just arguments that we pass to the BIOS when you invoke a command.

In the same sense, consider this code:


return syscall(437, ptr, 525376);

This line will tell you absolutely nothing about the reasoning behind the code. While the numbers seem arbitrary,in this case, they correspond to calling SYS_openat with the flags O_APPEND | O_CREAT | O_CLOEXEC.

The whole program above is just the way you have to go to invoke a system call (as we would call it today), and there is no logic or sense to it. Just something that you have to know. And once you do know, all of this can be packaged into a much simpler, higher-level concept: the system call. (Yes, it is an interrupt invocation, not the same thing. You may go stand in the pedantic corner, mind.)

Hopefully, I wouldn’t ever have to think about this program any longer. I managed to not only understand what it does, but actually grok it.

time to read 4 min | 668 words

I got a really interesting comment on a blog post talking about query optimization. The context was that working with dates is much easier for a query engine than working with date & time at the millisecond granularity. You can read the details on that in the post. Here, I want to focus on the modeling difference between the two.

In real-life scenarios, DateTime.Date is kind of useless. I don't know if any application will use such a query unless they are a UK-only application or an application that says they only deal with UTC. In practice, it is often hour or minute precision due to the time zone.. The majority of time zones are hourly based, with some operating on minute bases. e.g. 15 or 30 minutes. (Newfoundland UTC-03:30, India UTC+5:30, Eucla UTC+8:45)

This is a great example of the difference in thinking between developers and business people. Because we typically work with date & time, we tend to assume that all the associated considerations for time also apply to dates. However, that isn’t actually the case.

The date 2024-05-13 is a date. It is the same in the UK and in India. The start time for that date may be different, but the date is the same. A date doesn’t have a time zone. Mostly because that isn’t a meaningful distinction.

Let’s consider the most famous date of all, your birthday. You were born at a given point in time (date time) in a particular location (time zone). However, for the vast majority of scenarios, such details are irrelevant.

If you were born on Baker Island (UTC-12) and currently reside in Line Islands (UTC+14), you’ll celebrate your birthday based on the date, not the time. That is made obvious when you consider that a year is not an exact measurement in terms of time and the duration of time within a year varies considerably between different years.

When we talk about businesses and dates, it really gets more complex. Consider the simplest scenario, we have a business that has stores in Honolulu, Hawaii, and Dallas, Texas. On Dec 31, 2023, a purchase was made in the Honolulu store at 8PM. That is already Jan 1st, 2024 in Dallas, mind. What year would taxes be paid on that sale?

You can make all sorts of cases here, for paying that in the current year or the previous one, etc. In practice, it doesn’t matter. The date the sale was made is what determines the tax year. That means that a sale in Honolulu will be registered in 2023, while a sale that happened half an hour earlier in Dallas will be registered for 2024.

The reason for that is simply that there is no really good answer here when you start working across time zones. And trying to maintain the distinction between when the day started is not really meaningful from a business perspective.

Here is another consideration, if I signed an office lease for 6 months starting on January 1st. However, in March, daylight savings time came into effect. When is my lease going to expire?

The answer is May 31, 23:59, regardless of changes in DST. Because the duration is in months (and days) not in terms of time(and hours).

There are scenarios in business that I care deeply about the time that passed. A great example would be for payroll purposes. I did a night shift with daylight savings time in it. You bet that I want to get paid for the total number of hours that passed, not the difference between the hours on the clock. For fun, it gets really complex when you have shifts that cross payroll periods (how do you calculate overtime), but that is a topic for another time (pun intended).

The common case, however, is that you only care about the date, and the timezone is not relevant.

time to read 27 min | 5255 words

I ran into an interesting Reddit comment about deniable encryption and decided to spend an evening playing with it. The concept is that we have a way to encrypt a message in such a way that we can provide a key that would reveal a different message.

The idea is that if you are forced to reveal your key, you can do so, without spilling your secret. From a technical perspective, this is a truly fascinating scenario. Of course, it comes with the problem that if you’ve provided a key that doesn’t show anything the adversary is happy with, they’ll assume that there is another key.

Note: As usual when talking about cryptography, I’m at best an amateur in this area. This is strictly me having fun, don’t try to keep your Bitcoin keys here (instead, send them to me by snail mail).

In theory, there is a simple way to do so. Behold my prediction for the winner of the 2024 US election. I don’t want to reveal to you ahead of time, but here is the encrypted value:


4a/8AqOcEFzlMRP9VsLgEtXIq8+Cc11bKKp6+iR3c975qOrNtA==

After the election, I’ll share the key that will show that I properly predicted this (you should send me bitcoins at that point). Let’s commit further and show you how you can verify this, it’s really simple:


string Decrypt(string encrypted, string key)
{
    var t = Convert.FromBase64String(encrypted);
    var k = Convert.FromBase64String(key);
    return Encoding.UTF8.GetString(
         t.Select((b, i) => (byte)(b ^ k[i])).ToArray()
    );
}

In the interest of time (and those bitcoins), I’ll let you know that the answer is either:


tsaSbMbuMDODESHNZPbAR4bozqPnECkyR8Rak1dNU4qL3Ye9mg==
tsaSbMbuMDODESHNZPbAR4bozqPnECkyR8Rak1dNU5yQzI+jmg==

Voila, we are done, right? Not only did I demonstrate my ability to properly predict the future, but I was also able to show how you can use two separate keys to decrypt the same data.

This is just a property of the way I “encrypted” the data. What happened is that I took some random bytes, and when I needed to produce an answer, I XORed those bytes with the message I wanted to get and then I sent you the XORed value. When you XOR it again with the “message” you previously got, we get the output I want. In essence, that “message” you got is a one-time pad, and I can use that to send you any message I want.

This also has all the usual limitations of one-time pads, you can only send data up to the size of the key, it doesn’t protect you from the text being corrupted, and it is malleable. In other words, you have no way to ensure (cryptographically) that the message you received was actually sent by me.

Modern cryptography relies on something called AEAD (Authenticated Encryption with Associated Data), which ensures that you can send as much encrypted data as you want and ensures that no one can alter the data I receive if the decryption process is successful.

What I aim to do is create a proper way to encrypt a message and be able to retrieve it later, but also provide another key if needed. Here is the API that I have in my mind:


var output = DeniableEncryption.Encrypt(
    ("P@ssw0rd", "Joey doesn't share food!"),
    ("swordfish", "Meet at dawn by the beach to toast the new year"),
    ("adm1n!str@t0r", "We were on a break!"),
    ("Qwerty!Asdf@2024", 
          "Bitcoin seed: lonely ghost need apology spend shy festival funds"
    )
);

As you can see, we are actually encrypting multiple messages here, each with its own password. The output of this code will be something like this:


OwlVWixT3NMen2vuklHsY34SEIfu+zuDKJx5UMQKEjP4GnDnSDsagUiMWxcl83kC4GRI1s64JEU7
x7vf4u15FZOw3DDzwCG71Mqatfjc7nTzAox7Cr9FtVnxqsqkIyeOVsq6yHbiP56HAlUkbu7/D3kp
RrRmNtdqo5S6Dl7Y50fH492W/+/wtEhNePfWP7YhO1KsvUcnX2S4B7VKnTZbhJDhvxgTUlMK4/Cl
UdQiP3H0R1CZ8ucB1mb1yP/gkwIPYA7ComAzKtM9VOviCzqP5wqzdq7KcWM2FdJH3Lqpuoi376Lr
3Dnh4FUDe8jJhU2xlNhj7O9tXPczIeeUu7GuG0FHegeHprqRc7AkKK5b5kiEN7VnQz58fk29WBAQ
2LTYrBwbDn47Sw4PybMFcl9Wy4Yuw9ElHQoZcvXGk9hAXsqWdRyRgOq8HREiuZpzvSyx5v56T2+u
3hGVeXStKfYF6T5R9w8wqPKQ8UB3/blKguQZjtJNlueXDSpJ5Jzl/7FKUCSJfK16l9NgiVpcGGLg
qke4cZ1aVdpL6rsCH8uLLjEO0MVliuLjjX2VP76UrJBpTdh2uuyjaNDa6tFxQ/zx0jz5VlVbNZv+
w4KOl9jkQYk1U9VZ4K1v8IP6swZ1AOmfkWDuXmeNHlBeFk7Au0+ZDgpAJptwZrSDP2rgRgnrTlDL
qN9YqSlV4Q2TBmcVmQy9+sW1aI7yHhUdzfx3wZ551BmJX0IBLJviMKm+1DglKlHfs/+BZ1fz1HR2
ygnSSq2FXMf8y8vDZRjxb54Q/4JzgvYMpq47/ukmzPgc6JcYfkLtDCNXeJ+0bUDjn5r+2I8d3DLF
FbkGa6oB+um1ipAAeO5paH6aqzERtK3qEDw+nmnHW+y7BQUrQ7VRq2rmmPI2Hq8aMGcyCu5MpI97
/TXgcZDsUxFgRvA4ZQWz0UJ1ZNZOHgCRWgsAOkp1Y6zUKYwcJlhiICfUlfuj+6I2YayldANn/GR/
jn8/U4W2UXN3cAkl9AtgkxooGVQThPnAVTqo4l+EiL9X7WLo3EyaTJSNzxSSgudMQZJJJc+nYK66
A592plmhCzsvWedUeBzacE/pifyWM9DGDHqa5K75VLGV1AenIIBsUkYc44UPDLazpLSUBfTXB7LZ
Tig4sHmmLvfiXrD/lH1jFKEsAdTHiYWLooxTBD0Q2COpEM6kyKkljwfko8FVpO6HRNwiyqwQCsDM
xSMOO2Vk5qQVAb7VOEXN0fQjxEjSeB890XODduP6d941dqk+L1iAQK50GHk8WWCkZFn6FricVCFs
5T6fEWRj6wJlD4EISmoNVmanqAmF79Spg4YCOr3N2EbSdokf5d4ZNA7GtDFX9esbHTIPS0SiXyUx
MHgS99CllTACSEpmEisu/JiEsWKHOg0oy6oCZbi9hrMbeMGGQ9jM4sIxQ20/u7dV1maut67CPN0H
F5VYkRQ0/PNqhvT4Tb3CuGLndxD/nsvs9MgOOZljVZWLaF//Gdno2CNNPjnnHTuURScpcIFKMN5c
80NDgQX6kO8Hoiho7NFc61QwDCNYk1xk2qVDo9jcOaJ2zdtFWJfVavhLCNnEfWoHxzBmS4oL6Ss=

And I’m able to turn that back into the encrypted message using this code:


var msg in DeniableEncryption.Decrpyt(pwd, encrypted);
Console.WriteLine(msg);

If I don’t have the password, on the other hand, it should be completely unfeasible for me to figure out what the message is. In fact, let’s try to list the requirements from such a scheme:

  • With a password, I can easily decrypt the message.
  • Knowing one password isn’t useful for decrypting a message using any other password.
  • I cannot tell how many messages are hiding in the encrypted text.
  • I cannot detect anything about the messages themselves.

I’m an amateur at best in cryptography, so I’m not going to try to construct something myself. Let’s see if I can cobble together something that would at least hold up for a bit.

I’m using real passwords, and I need to turn them into encryption keys. I’m going to be using PBKDF2 to do that, and for the encryption itself, I’ll use the AES-GCM algorithm. Here is the rough format of the output.

We start with a salt (32 bytes generated using CSRNG) which is used to feed into the PBKDF2 algorithm, then a set of offsets into the file, and the actual data itself. Note that we are always “storing” exactly 8 messages.

In practice, I’m going to allow up to 6 user-defined messages, to ensure that we always have “empty” slots. The size of the data is also meaningful, so we need to ensure that we aren’t leaking that.

What I’m doing is ensuring that we round up (by 64 bytes) the size of all the messages that we want to encrypt and ensure that each data block is of the same size. To avoid leaking even what is the exact size at 64-byte intervals, I’m writing some additional random bytes at the end.

Let’s look a bit deeper into the format of the data block itself. We start by writing the actual size of the block, then the nonce and authentication tag (important for the AES-GCM usage), and then the encrypted message. The rest is filled with random data.

You’ll note that the offsets in the overall output format and the size in the data format implies that we are leaking information about the messages we encrypted. Given that I need to know where to look for the value in the value, and I need to know the size, why am I spending so much time trying to obfuscate that?

The idea is that I actually have two levels of encryption here. When I derive the key with PBKDF2, I’m asking it to use SHA512 and give me 40 bytes of derived key material. I’m actually only using 32 bytes of those as the actual encryption key, leaving me with 8 bytes (two pairs of 4 bytes) that I can use to XOR with the offset and the length. That hides the actual offset and size (basically using some of the PBKFD2 output as a stream cipher).

It has all the usual problems of raw stream cipher, but I don’t care about malleability or authentication in this scenario. I rely on AES-GCM to handle that part of the process and just need to hide the information from other prying eyes. A man-in-the-middle attack targeting those values is going to be able to cause me to try (and fail) to decrypt a value, so I don’t think that this matters.

With all of that said, let’s look at the actual code for the encryption portion:


public static byte[] Encrypt(params (string Password, string Value)[] items)
{
    if (items.Length > MaxUserItems)
        throw new ArgumentException("You are allowed up to 6 items");
    if (items.GroupBy(x => x.Password).Any(x => x.Count() != 1))
        throw new ArgumentException("No reusing passwords");


    var totalSize = items.Max(
        x => Encoding.UTF8.GetByteCount(x.Value) + 
             sizeof(int) + AesGcm.NonceByteSizes.MaxSize +
              AesGcm.TagByteSizes.MaxSize
    );
    var sizeAlignedUp = (totalSize + BlockSize - 1) & -BlockSize;


    var additionalSizeMixed = RandomNumberGenerator.GetInt32(1, 4) *
        RandomNumberGenerator.GetInt32(BlockSize / 2, BlockSize);
    var outputBuffer = RandomNumberGenerator.GetBytes(
       ItemsCount * sizeAlignedUp + OffsetsBlockSize + SaltSize +
       additionalSizeMixed
    );
    Span<byte> output = outputBuffer;


    var salt = output.Slice(0, SaltSize);
    var offsetsBlock = MemoryMarshal.Cast<byte, int>(
       output.Slice(SaltSize, OffsetsBlockSize)
    );


    int index = RandomNumberGenerator.GetInt32(ItemsCount);
    foreach (var (pwd, val) in items)
    {
        ReadOnlySpan<byte> derived = Rfc2898DeriveBytes.Pbkdf2(pwd, 
           salt, Iterations, HashAlgorithmName.SHA512, 
           sizeof(int) + DerivedKeySize + sizeof(int)
        );


        var plaintext = Encoding.UTF8.GetBytes(val);
        var requiredSize = sizeof(int) + AesGcm.NonceByteSizes.MaxSize +
           AesGcm.TagByteSizes.MaxSize + plaintext.Length;


        var offset = sizeAlignedUp * index + SaltSize + OffsetsBlockSize +
            RandomNumberGenerator.GetInt32(sizeAlignedUp - requiredSize);


        var sizeMask = MemoryMarshal.Read<int>(derived.Slice(0, sizeof(int)));


        offsetsBlock[index] = offset ^ sizeMask;


        index = (index + 1) % ItemsCount;


        Span<byte> mem = output.Slice(offset, requiredSize);


        var lenMask = MemoryMarshal.Read<int>(
           derived.Slice(sizeof(int), sizeof(int))
        );
        var mask = lenMask ^ plaintext.Length;
        MemoryMarshal.Write(mem, mask);


        var derivedKey = derived.Slice(
           sizeof(int) + sizeof(int),
           DerivedKeySize
        );


        using var cipher = new AesGcm(derivedKey, AesGcm.TagByteSizes.MaxSize);


        cipher.Encrypt(
            nonce: mem.Slice(sizeof(int), AesGcm.NonceByteSizes.MaxSize),
            plaintext: plaintext,
            ciphertext: mem.Slice(sizeof(int) + AesGcm.NonceByteSizes.MaxSize +
                          AesGcm.TagByteSizes.MaxSize, plaintext.Length
            ),
            tag: mem.Slice(sizeof(int) + AesGcm.NonceByteSizes.MaxSize,
                   AesGcm.TagByteSizes.MaxSize)
            );
    }


    return outputBuffer;
}

We validate that the user provided us with up to 6 messages (MaxUserItems) to encrypt and that there are no repeated passwords, then we compute the size required to encrypt the longest message. We align that on 64 bytes (BlockSize) and use that to compute the actual overall buffer size. Note that we also add a bit of additional space at the end, to confuse attempts to figure out values based on size (such as the BEAST attack).

We then get the output buffer. Note that in this case, we are asking the RandomNumberGenerator class to give us a buffer that is already filled with random data. The idea is that we don’t need to worry about filling stuff up with cryptographically secured data. We start with random noise, and we add whatever meaning we need from there.

The first 32 bytes (SaltSize) are the salt, this is used to mitigate rainbow table attacks, among others. The next 32 bytes are used as the offsets array, which are used to store the location of the actual encrypted messages.

For the message we want to encrypt, we start by using PBKDF2 to derive a 40-byte cryptographic key. We are using SHA512 (which has a block size of 64 bytes) and 210,000 iterations to derive the key, per the OWASP recommendation.

We want to be unpredictable, so we aren’t writing the first element to the first offset position. Instead, we start the offset position in a random location. We figure out what is the size of the encrypted value (including the size, nonce, tag, and actual encrypted bytes) and stash that at a random location in a random offset in the output buffer.

We then take the first 4 bytes of the derived key value and XOR that with the offset of the value we’ll be writing. We are using those bytes as a stream cipher, basically. We write the encrypted offset to the offsets table. Note that in order to decrypt that, you need to re-run the PBKDF2 computation, which requires that you have the password.

The next 4 bytes (4..8) are used as a stream cipher to encrypt the length of the value we are about to encrypt. And the other 32 bytes (8..40) are used as the encryption key itself.

Note that we are “missing” things like nonce generation. We don’t need that, since the nonce buffer we point to has already been seeded with random values from a cryptographic source.

The Encrypt() does most of the work, and… this is pretty much it. There isn’t a lot of code, most of it is in how we put things together.

The decryption portion is a lot more interesting, I think, so let’s take a look at it:


public static string? Decrpyt(string pwd, byte[] encrypted)
{
    Span<byte> mem = encrypted;
    var salt = mem.Slice(0, SaltSize);


    ReadOnlySpan<byte> derived = Rfc2898DeriveBytes.Pbkdf2(pwd, salt,
       Iterations, HashAlgorithmName.SHA512, sizeof(int) + 
       DerivedKeySize + sizeof(int)
    );


    var offsetMask = MemoryMarshal.Read<int>(derived.Slice(0, sizeof(int)));
    var lenMask = MemoryMarshal.Read<int>(
       derived.Slice(sizeof(int), sizeof(int))
    );
    var derivedKey = derived.Slice(sizeof(int) + sizeof(int), DerivedKeySize);


    var offsetsBlock = MemoryMarshal.Cast<byte, int>(
        mem.Slice(SaltSize, OffsetsBlockSize)
    );


    for (int i = 0; i < ItemsCount; i++)
    {
        var offset = offsetsBlock[i] ^ offsetMask;
        if (offset < SaltSize + OffsetsBlockSize ||
            offset + sizeof(int) > mem.Length)
            continue;


        var maskedLen = MemoryMarshal.Read<int>(
           mem.Slice(offset, sizeof(int))
        );


        var len = maskedLen ^ lenMask;


        if (len < 0 || offset + len + sizeof(int) > mem.Length)
            continue;


        using var cipher = new AesGcm(derivedKey, AesGcm.TagByteSizes.MaxSize);
        var outputBuf = new byte[len];
        try
        {
            cipher.Decrypt(
                nonce: mem.Slice(
                          offset + sizeof(int), 
                          AesGcm.NonceByteSizes.MaxSize
                ),
                ciphertext: mem.Slice(
                    offset + sizeof(int) + AesGcm.NonceByteSizes.MaxSize +
                    AesGcm.TagByteSizes.MaxSize,
                    len
                ),
                tag: mem.Slice(
                        offset + sizeof(int) + AesGcm.NonceByteSizes.MaxSize,
                        AesGcm.TagByteSizes.MaxSize
                ),
                outputBuf);
        }
        catch (CryptographicException)
        {
            // expected, we may hit a dummy value or wrong password
        }
        return Encoding.UTF8.GetString(outputBuf);
    }
    return null;
}

Here we take the first 32 bytes (the salt) and use PBKDF2 and the password to generate the derived key. Again, we are getting 40 bytes back. The first 4 bytes are the offset mask (to figure out where to look for the values, the next 4 bytes are the length mask, to figure out the length for decryption, and the last 32 bytes are the decryption key.

Without the password, we cannot get to the derived key, remember. Then we start scanning through the offsets block. For each of the items we XOR the value in the offsets with the mask. Here we have three options:

  • The XORed value is completely off, which we detect and skip.
  • The XORed value is correct and points to the right offset to continue the operation.
  • The XORed value appears to be correct (its value in bounds). We’ll continue the operation, but fail in the next stage when we actually try to decrypt the value. This is because we are using AES-GCM, which is an AEAD (authenticated encryption) that validates (using cryptographic primitives) that the decrypted value matches the value that was encrypted. I wrote a blog post (part of a larger series) explaining this in detail.

With the offset, we can now read the masked length of the buffer, which has the same problems as the masked offset. We XOR that with the right mask and need to deal with the obvious wrong, correct, or appears to be correct but actually wrong scenario as well. We don’t really care, since we leave the actual validation to the authenticated encryption portion.

If we are able to correctly decrypt the value, we immediately return it. But if not, we’ll try with the next offset, etc. Note that for decryption, we are scanning the offsets array and attempting to check whether the key we derived from the password is able to decrypt the current value. During encryption, we randomized where everything goes, and here we can just do a simple scan and stop on the first value that was successfully decrypted.

As I mentioned, that was a lovely evening to spend on an interesting exercise. I think that this is a valid way to go about building a deniable encryption scheme. The full code is here, I would love your feedback on both the code and the actual idea.

I like that I can provide multiple passwords and messages, in a simple manner. I think that a viable use case would be to encrypt three values. Safe, honeypot, and the real deal. For example:


var output = DeniableEncryption.Encrypt(
    ("safe", "I don't like Mondays"),
    ("honeypot", "I microwave fish in the office break room and I’m not going to stop"),
    ("motherlode", "Bitcoin seed: armor cactus gaze off future blade artist")
);

There is no way to tell whether there is a third option here, and the format is intentionally always assuming 8 “entries”, even if you provide less than the maximum. Of course, that also raises the problem of what if after you give up the motherlode, the other side still suspects there are more secrets. At this point, I’ll point you out to Mickens and a wonderful article about threat models.

 Check out the code and let me know what you think about this.

time to read 1 min | 113 words

For Episode 123 of the CollabTalk Podcast, we explored the pivotal role of community in shaping businesses, discussing my guest’s founding of his company and the strategies for building and nurturing open-source communities. We covered the symbiosis between commercial success and community engagement, emphasizing the importance of community feedback in innovation and the challenges and benefits of integrating open-source models into business strategies. You can listen to the podcast above and follow me using your favorite app, such as Spotify, Apple Podcasts, Stitcher, Soundcloud, or the iHeartRadio app. Be sure to subscribe!

FUTURE POSTS

  1. RavenDB Performance: 15% improvement in one line - 15 hours from now

There are posts all the way to Dec 02, 2024

RECENT SERIES

  1. RavenDB Cloud (2):
    26 Nov 2024 - Auto scaling
  2. Challenge (75):
    01 Jul 2024 - Efficient snapshotable state
  3. Recording (14):
    19 Jun 2024 - Building a Database Engine in C# & .NET
  4. re (33):
    28 May 2024 - Secure Drop protocol
  5. Meta Blog (2):
    23 Jan 2024 - I'm a JS Developer now
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}