Debugging the Linux kernel using awesome psychic powers

time to read 4 min | 764 words

I wanted to test low-level file-system behavior in preparation for a new feature for RavenDB. Specifically, I wanted to look into hole punching - where you can give low-level instructions to the file system to indicate that you’re giving up disk space, but without actually reducing the size of the file.

This can be very helpful in space management. If I have a section in the file that is full of zeroes, I can just tell the file system that, and it can skip storing that range of zeros on the disk entirely. This is an advanced feature for file systems. I haven't actually used that in the past, so I needed to gain some expertise with it.

I wrote the following code for Linux:


int fd = open("test.file", O_CREAT | O_WRONLY, 0644);
lseek(fd, 128 * 1024 * 1024 - 1, SEEK_SET); // 128MB file
write(fd, "", 1);
fallocate(fd, // 32 MB hole from the 16MB..48MB range
    FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 
    16 * 1024 * 1024, 32 * 1024 * 1024); 
close(fd);

The code for Windows is here if you want to see it. I tested the feature on both Windows & Linux, and it worked. I could see that while the file size was 128MB, I was able to give back 16MB to the operating system without any issues. I turned the code above into a test and called it a day.

And then the CI build broke. But that wasn’t possible since I tested that. And there had been CI runs that did work on Linux. So I did the obvious thing and started running the code above in a loop.

I found something really annoying. This code worked, sometimes. And sometimes it just didn’t.

In order to get the size, I need to run this code:


struct stat st;
fstat(fd, &st);
printf("Total size: %lld bytes\n",
    (long long)st.st_size);
printf("Actual size on disk: %lld bytes\n", 
    (long long)st.st_blocks * 512);

I’m used to weirdness from file systems at this point, but this is really simple. All the data is 4KB aligned (in fact, all the data is 16MB aligned). There shouldn’t be any weirdness here.

As you can see, I’m already working at the level of Linux syscalls, but I used strace to check if there is something funky going on. Nope, there was a 1:1 mapping between the code and the actual system calls issued.

That means that I have to debug deeper if I want to understand what is going on. This involves debugging the Linux Kernel, which is a Big Task. Take a look at the code in the relevant link. I’m fairly certain that the issue is in those lines. The problem is that this cannot be, since both offset & length are aligned to 4KB.

I got out my crystal ball and thinking hat and meditated on this. If you’ll note, the difference between the expected and actual values is exactly 4KB. It almost looks like the file itself is not aligned on a 4KB boundary, but the holes must be.

Given that I just want to release this space to the operating system and 4KB is really small, I can adjust that as a fudge factor for the test. I would love to understand exactly what is going on, but so far the “file itself is not 4KB aligned, but holes are” is a good working hypothesis (even though my gut tells me it might be wrong).

If you know the actual reason for this, I would love to hear it.

And don't get me started on what happened with sparse files in macOS. There, the OS will randomly decide to mark some parts of your file as holes, making any deterministic testing really hard.