David Boike shows us the Dude…
In this talk, I’m creating, live on stage, a database for handling time series data with the goal of handling 100 billion events per year.
And it actually worked!
In this panel discussion, the speakers in RavenDB Conference talks with the attendees and share their own experience of using RavenDB in anger.
This low level talk is all about Voron, how it actually works, the design decisions that we had to do, the things that didn’t work…
We recorded the recent RavenDB conference, and we are starting to make them available now. The first to show up is Mauro Servienti’s intro talk to RavenDB.
https://www.youtube.com/watch?v=0WEMNKk3HHU
Corax is a research project that we have, to see how we can build a full text search library on top of Voron. Along the way, we take the chance to find out how Lucene does things, and what we can do better. Pretty much from the get go, Corax is likely to use more disk space than Lucene, probably significantly so. I would be happy if we could get a merely 50% increase over Lucene
The reason that this is the case is that Lucene goes to great length to save disk space. From storing all integers in variable length format, to prefix compression to implicitly referencing data in other files. For example, you can see that when you try reading term positions:
TermPositions are ordered by term (the term is implicit, from the .tis file).
Positions entries are ordered by increasing document number (the document number is implicit from the .frq file).
The downside of saving every little bit is that it is a lot more complex to read the data, requiring multiple caches and complex code path to actually get it properly. It make a lot of sense, when Lucene was created, disk space was at a premium. I won’t go as far as to say that disk space doesn’t matter, but given a trade off of using more disk space vs. using more memory / complexity, it is much easier to justify disk space usage today*.
* The caveat here is that you need to be careful, because just accessing the disk can be very slow.
One of the major things that we wanted to deal with Corax is reducing index corruption issues, and seeing if we can simplify things into a transactional system. As a side effect of that, we don’t need to have index segments, and we don’t need to do merges to free disk space. The problem is that in order to handle this, we need to make track additional information that Lucene doesn’t need to.
Let us look at the actual data we keep. Here is a very simple index:
using (var fullTextIndex = new FullTextIndex(StorageEnvironmentOptions.CreateMemoryOnly(), new DefaultAnalyzer()))
{
using (var indexer = fullTextIndex.CreateIndexer())
{
indexer.NewIndexEntry();
indexer.AddField("Name", "Oren Eini");
indexer.AddField("Email","[email protected]");
indexer.NewIndexEntry();
indexer.AddField("Name", "Arava Eini");
indexer.AddField("Email","[email protected]");
indexer.Flush();
}
}
For each field, we are going to create a multi tree. And for each unique term in the field we have a list of (Index Entry Id, term frequency, boost).
- @fld_Name
- arava
- { 2, 1, 1.0 } (index id 2, freq 1, boost 1.0)
- eini
- { 1,1,1.0 }
- { 2,1,1.0 }
- oren
- { 1,1,1 }
- @fld_Email
- [email protected]
- { 2,1,1.0 }
- [email protected]
- { 1,1,1.0 }
This is pretty much the equivalent to the way Lucene store things. Possible space optimizations here include not storing default values (term frex or boost of 1), storing index entry ids as variable ints, etc.
The problem is that while this is actually enough for the way Lucene does things, it is not enough for the way Corax does things. Let us consider the case of deleting a document. How would you go about doing this using the information above?
Lucene does this by marking a document id as deleted, and will purge its details on the next segments merge. That works, but only because a segment merge actually read & write all of the relevant segments data. Without a segments merge, deleting a document is actually something that would require us to scan all the data in the entire database. This is not really practical. Therefor, we need to store additional data so we can delete it later on. In this case, we have the Docs tree, which has keys for (index entry id, field id and term num). This looks like this:
- Docs
- [1,1,1]: oren
- [1,1,2]: eini
- [1,2,1]: [email protected]
- [2,1,1]: arava
- [2,1,2]: eini
- [2,2,1]: [email protected]
Using this information, we can now remove all traces of a document when it is deleted. However, the problem here is that we need to also keep the terms per document in the index. That really blow up the index size, obviously.
The reason for this peculiar way of storing the document fields in this manner is that we also want to reuse this information for sorting. When Lucene needs to sort data, it has to read all of the data from the fields, then recreate the values for all relevant documents. Corax can just serve the data already there.
A pretty obvious step to save space would be to track the terms separately, and use an id in the Docs tree, not the full term. That leads to an interesting problem, because we are going to need to be able to go from term –> id and id –> term, which pretty much require storing them twice, unfortunately.
Final note, Corax is a research project.
During the RavenDB courses* in the past few weeks, I was talking with one of the attendees and I came up with what I think is a great analogy.
* Yes, courses, in the past 4 weeks, I’ve given the RavenDB course 3 times. That probably explains why I don’t remember which course it was.
What are your success metrics? From Opening a Champagne Bottle To Hiding Under the Bed with Said Bottle?
The first success metric is when you have enough users (and, presumably, revenue) to cross the threshold to the Big Boys League. Let us call this the 25,000 users range. That is the moment when you throw a party, go to the store and grab a whole case of champagne bottles and make fancy speeches. Of course, the problem with success is that you can have too much of it. A system that does just fine (maybe creeks a little ) on a 25,000 users is going to behave pretty differently when you have 100,000 users. That is the moment when you find your engineers under the bed, with a half empty bottle of champagne and muttering things about Out Of Capacity errors and refusing to come out until we fire all the users.
In just about any system, you need to define the success points. Because Twitter was very luck that it managed to grow even though it had so many problems when its user base exploded. It is far more likely that users will figure out that your service is… well, your engineers are drunk and hiding under the bed, so the service looks accordingly.
And yes, you can try to talk to people about SLA, and metrics and capacities. But I have found that an image like that tend to give you a lot more focused answers. Even if a lot of the time the answer is “I don’t know”. That is a place to start, because this make it a lot more acute than just “how many req/sec do we need to support?”.
There are a lot of stuff that are hard to do when you are working on a large team. But the really nice thing is the velocity in which you can move.
I just started the morning with the following commands:
And I have another PR pending for a different branch that I’m going to have to look at. Overall, I think that this is a pretty cool thing to have. We can push forward in many direction at once, and it can be pretty awesome to look at all the good thins that are coming our way.
One of the nice features that Voron got from LMDB is the notion of multi trees. If I recall correctly, LMDB calls them duplicate items, or something like that. Basically, it is the ability to store multiple values for a single key.
Update: The behavior described in this post has been in LMDB for the over 3 years. It is just something that Voron didn't have. Please note that this post discuss the Voron's limitation and its solution, not LMDB, which has had the solution for a long time.
Those items are actually stored as a nested tree, which effectively gives us a great way to handle duplicates. From an API standpoint, this looks like this:
// store tree.MultiAdd(tx, "Eini", "users/1"); tree.MultiAdd(tx, "Eini", "users/3"); tree.MultiAdd(tx, "Eini", "users/5");//read using(var it = tree.MultiRead(tx, "Eini")) { if(it.Seek(Slice.BeforeAllKeys) == false) yield break; do { yield return it.CurrentKey; // "users/1", "users/3", "users/5" }while(it.MoveNext()); }
Internally, we handle this in the following fashion:
- If a multi add operation is the very first such operation, we’ll add it as a simple key/value pair in the tree.
- If a multi add operation is the 2nd such operation, we’ll create a new tree, and add both operations to the new tree. The original tree will have the key/nested tree reference stored.
That lead to a very easy to use API, which is quite useful in many cases. However, it also have space usage implication. In particular, a tree means that we have to allocate at least one page to it. And that means that we have to allocate 4KB to hold any multi tree with more than a single value. Since there are actually a lot of scenarios where we would want to store a small number of small values, that can often lead to a lot of waste on our part. We use 4KB to store data that we could have stored in just 64 bytes, for example. That means that when we want to actually store things that might be duplicated, we actually need to consider the taken space consideration as well.
I think that this is a pretty bad idea, because it leads us to do things with composite keys under some scenarios. That is something that Voron should be able to just handle for me. So we changed that, in fact, what we did was changed the way we approach multi trees in general. Now, for every multi value, we can store up to 1KB of items in the same page as the key, before we devolve into a full blown multi tree.
The idea is that we can use the same Page mechanism we use elsewhere, and just have a nested page inside the parent page. As long as the nested page is small enough, we can just store it embedded. That result in some nice space saving. Since usually we have items in the following counts: zero, one, few, lots.
We need this to help with Corax, and in general, that would reduce the amount of space we need to use in many cases.
I posted before about design practice for how I would approach building a search engine library. I decided to bite the bullet and actually try to do this. Using Voron, that turned out to be a really simple thing to do. Of course, this doesn’t do a tenth of what Lucene does, but it actually does quite a lot. The code is available here, and I want to emphasize again, this is purely experimental / research project.
The entire thing comes to less than 500 lines of code. And it is pretty functional even at this stage.
Corax is composed of:
- Analysis
- Indexing
- Querying
Analysis of the documents is handled via analyzers:
1: public interface IAnalyzer2: {
3: ITokenSource CreateTokenSource(string field, ITokenSource existing);4: bool Process(string field, ITokenSource source);5:
6: }
An analyzer create a token source, which accept a TextReader and produces token. For each token, the Process method is called, and it is used to do things to the relevant token. For example, here is the default analyzer:
1: public class DefaultAnalyzer : IAnalyzer2: {
3: readonly IFilter[] _filters =4: {
5: new LowerCaseFilter(),6: new RemovePossesiveSuffix(),7: new StopWordsFilter(),8: };
9:
10:
11: public ITokenSource CreateTokenSource(string field, ITokenSource existing)12: {
13: return existing ?? new StringTokenizer();14: }
15:
16: public bool Process(string field, ITokenSource source)17: {
18: for (int i = 0; i < _filters.Length; i++)19: {
20: if (_filters[i].ProcessTerm(source) == false)21: return false;22: }
23: return true;24: }
25: }
The idea here is to match, fairly closely, what Lucene is doing, but hopefully with clearer code. This analyzer will text a stream of text, break it up to discrete tokens, lower case them, remove the possessive ‘s suffix and clear stop words. Note that each of the filters are actually modifying the token in place. And the tokenizer is pretty simple, but it does the job for now.
Now, let us move to indexing. With Lucene, the recommendation is that you’ll reuse your document and field instance, to avoid create garbage for the GC. With Corax, I took it a step further:
1: using (var indexer = fullTextIndex.CreateIndexer())2: {
3: indexer.NewDocument();
4: indexer.AddField("Name", "Oren Eini");5:
6: indexer.NewDocument();
7: indexer.AddField("Name", "Ayende Rahien");8:
9: indexer.Flush();
10: }
There are a couple of things to note here. An index can create indexers, it is intended to have multiple concurrent indexers running at the same time. Note that unlike Lucene, we don’t have Document or Field classes. Instead, we call methods on the indexer to create a new document and then add fields to the current document. When you are done with a document, you start a new one, or flush to complete the entire operation. For long running indexing, the indexer will flush itself automatically for you.
I think that this API gives us the best approach. It guide you toward using a single instance, with internal optimizations to make it memory efficient. Multiple instances can be used concurrently to speed up indexing time. And it knows when to spill flush itself for you, so you don’t have to worry about that. Although you do have to complete the operation by calling Flush() at the end.
How about searching? That turned out to be pretty similar as well. All you have to do is create a searcher:
1: using (var searcher = fti.CreateSearcher())2: {
3: QueryResults queryResults = searcher.QueryTop(new TermQuery("Name", "Arava"), 10);4: Console.WriteLine(queryResults.TotalResults);
5: foreach (var match in queryResults.Results)6: {
7: Console.WriteLine(match.DocumentId + " - " + match.Score);8: }
9: }
We create a searcher, and then we can utilize it to perform queries.
So far, this has been all about the API we have, I’ll talk about the actual implementation in my next post.