Researching a disk based hash table

time to read 4 min | 700 words

There is a specific scenario that I run into that could be really helped by an O(1) lookup cost on a disk persistent data structure. Voron, our storage engine library, is built on top of a whole big pile of B+Trees, which has an O(logN) lookup cost. I could use that, but I wanted to see if we could do better.

The natural thing to do when you hear about O(1) costs is to go fetch the nearest hash table, so I spent some time thinking about how to build a hash table that would be persisted to disk. My needs are relatively simple, I believe:

  • O(1) lookup (at least in the common case).
  • Able to support greater than memory size.
  • Mutable (writes & deletes)
  • Keys and values are limited to int64.

It doesn’t sound hard, right?

But when I started thinking about it, I run into a pretty hard limit. If the size of the data is greater than memory, then we have to take into account data access costs. A simple approach here would be to allocate a section in the file for the hash table and use a hash to get to the right location in the file. That works, if you don’t need to support mutations. But when you do, you run into a big problem. At some point, the load factor of the hash table is going to increase to the point where you need to regrow it. At that point, you may need to re-hash the entire thing.

Assume that the hash table size at this point is 4 GB, you need to re-hash it to 8GB and you have just 1 GB available. That is going to take some time and be a pretty horrible process all around. That is as far as I got when I considered directly translating in memory hash table to disk based one. I’m pretty lucky that I don’t have to do that, because there is a wealth of research on the matter.

These go back to before I was born, although B+Trees predate them by a decade or so. They key here is to use extensible hashing. The Wikipedia article is pretty good, I especially liked the Python code showing how things work there. The original paper on the topic is also quite interesting and is of interest to people who care about the details of how storage engines work.

I believe that my next step is going to be playing with some codebases that implement these ideas. I decided to look at how this is done with the DBM family of systems. They are old, some of them are probably direct implementations of the extensible hashing paper, but I’m mostly interested in seeing how things fit together at this point.

All of that said, I run into a lot of red flags along the way.

Modern B-Tree Techniques discuss the issue of B-Trees vs. Hashes Indexes and come to the conclusion that you shouldn’t bother. They cover quite a few aspects of this issue, from complexity of implementation to usage scenarios.

The Berkley DB documentation states that for anything requiring locality of reference, B-Trees are the way to go. However, for large amount of data, their Hash implementation uses less metadata, so might be better. That said, this doesn’t match my expectation for the way the system will behave. Looking at this StackOverflow answer, it seems very likely that if you have a working set that is greater than memory, the hash implementation will hit page faults all the time and the B-Tree implementation will be able to keep at least most of its metadata in memory, benefiting greatly from that.

Indeed, we have this interesting quote from Berkley DB as well:

Hash access method is appropriate for data sets so large that not even the Btree indexing structures fit into memory. At that point, it's better to use the memory for data than for indexing structures. This trade-off made a lot more sense in 1990 when main memory was typically much smaller than today.

All in all, this seems like a nice area for me to look into. I’ll go and read some code now, and maybe I’ll write about it.