Digging into extendible hashing

time to read 5 min | 960 words

After my post yesterday, I dug a lot deeper into extendible hashing. There is a wealth of information on the topic. What is more interesting, from my point of view, is just how freaking elegant this data structure is.

I spent a few hours implementing it, because I don’t really get a data structure until I actually sat down and writing it in code. I decided to write it in C, because I haven’t written C in a while. It took about 200 lines of code, and I could see how it works.

Well, I saw see, but I learned something very important a few years ago when implementing complex data structures. One of the first things that you want to do is to make sure that you have a visualization of the data. That can be absolutely invaluable when debugging.

Here is what this looked like when we have 31 entries in the hash table, for example:

I’m outputting Graphviz text and then rendering it. This means that I can very easily and visually see how things are working. Very helpful to understand where I messed up.

You can find the whole source here. There is a bit more there then I can comfortably discuss in a blog post, but what I found most useful is to insert data, then see the before / after pictures of the table.

What you can see above is a hash table that is split to 4 sections, however, it only has three pages. This is because of the way the hash table works. As shown, all values that ends with 00 binary suffix will go to the left most page. All values that end with 10 will go on the right most.  Not coincidentally, these pages are pointed to from the 0 (00) and 2 (10) positions on the hash table buckets’ list.

We also have something a lot more interesting, though. We have the middle page, which is pointed to by both the 1 (01) and 3 (11) positions. In fact, you can see that in that page, we have a depth of 1, so we are actually mixed in this location. As we currently stand, I didn’t do anything interesting with regards to how to find a value inside the page (I’m doing simple linear search), but I expect that to be relatively straightforward to treat the page as a hash table as well.

I need this feature to be able to use as a Map<int64, int64>, which really simplify my life. The key observation is that I don’t actually need to do any hashing here. The can be no collisions, after all, given that I’m using 64 bits keys inside the hash table. Using extendible hashing, I also don’t need to mix the values. If I’m going to get values that are clustered together to the same place, I’m going to end up splitting the page and solving the problem naturally. Of course, that would eventually cause me to have to double my directory size, so there is some cost for collisions, but I don’t think that this is going to be a problem. I tested this with a million consecutive numbers (with no hash mixing) and got a directory size of 32KB. That doesn’t seems likely to become a problem.

I also ought to mention linear hashing, which uses a different approach. The linked article does a great job explaining how this works. There isn’t any recent work on comparing linear hashing to extendible hashing. The one I kept running into is from the 80s. The advantage then was using linear hashing on machines with small memories. Remember, this is the 80s that we are talking about. I’m not sure what they though about as small memory in that time frame. The article talks about limiting the amount of memory used for extendible hashing’s directory to 1000 entries. That translates (assuming 32 bits values) to less than 8KB of data. That isn’t something that I really think I need to worry about.

Let’s do some basic math, shall we?

Assuming an 8KB page size and 64 bits pointers, we can have a directory page for extendible hashing that holds 1,024 buckets. Each one of them is going to be 8KB in turn, so at full utilization, we are looking at a single directory page pointing to 8MB of buckets. In each bucket, we are storing key/value that are both 64 bits in size, which give us 512,288 entries. In practice, I haven’t account for metadata, so we can probably expect to be able to address about half a million entries per 8MB.  If we need to store a hundred million records, you’ll need less than 800MB using this scheme.

Given current hardware, that seems like a pretty good idea for me. There are other advantages to take into account, though. Some of the advantages of linear hashing is that you can skip the directory part. Indeed, that can grow quite big. In the case of the 100 million records I mentioned earlier? You’ll need about 12 MB just for the directory metadata. Linear hashing can skip this cost (but has other storage costs related to overflow pages, that I’m ignoring here). However, that assumes that you can actually compute the address of a page from the key. That isn’t really something that we can do if we haven’t dedicated the whole file for this purpose. In Voron, for example, you can’t assume that you have a raw range of pages to work with. Pages are allocated as needed, and may be mixed by other users of the disk space.

I can’t see any real advantages to linear hashing at this time. I would be happy to hear about scenarios where it make sense.