In my previous post, I showed a pretty trivial index and asked how to efficiently update it. Efficient being time & memory wise.
The easiest approach to do that is by using a reverse lookup option. Of course, that means that we actually need to store about twice as much data as before.
Given the following documents:
- users/21 – Oren Eini
- users/42 – Hibernating Rhinos
- users/13 – Arava Eini
Previously, we had:
Term | Documents |
Oren | users/21, |
Eini | users/21, users/13 |
Hibernating | users/42, |
Rhinos | users/42, |
Arava | users/13 |
And with the reverse lookup, we have:
Term | Documents | Document | Terms | |
Oren | users/21, | users/21 | Oren, Eini | |
Eini | users/21, users/13 | users/42 | Hibernating, Rhinos | |
Hibernating | users/42 | users/13 | Arava, Eini | |
Rhinos | users/42 | |||
Arava | users/13 |
And each update to the index would first do a lookup for the document id, then remove the document id from all the matching terms.
The downside of that is that we take about twice as much room. The upside is that all the work is done during indexing time, and space is pretty cheap.
It isn’t that cheap, though. So we want to try something better.
Another alternative is to introduce a level of indirection, like so:
Term | Documents | Num | Id | |
Oren | 1, | 1 | users/21 | |
Eini | 1,3 | 2 | users/42 | |
Hibernating | 2 | 3 | users/13 | |
Rhinos | 2 | |||
Arava | 3 |
Now, let us say that we want to update users/13 to be Phoebe Eini, we will end up with:
Term | Documents | Num | Id | |
Oren | 1, | 1 | users/21 | |
Eini | 1,3,4 | 2 | users/42 | |
Hibernating | 2 | 4 | users/13 | |
Rhinos | 2 | |||
Arava | 3 | |||
Phoebe | 4 |
We removed the 3rd document, and didn’t touch the terms except to add to them.
That gives us a very fast way to add to the system, and if someone will search for Arava, we will see that the number no longer exists, so we’ll return no results for the query.
Of course, this means that we have to deal with garbage in the index, and have some way to clean it up periodically. It also means that we don’t have a way to really support Update, instead we have just Add and Delete operations.