Building data store – indexing data structure
I run into an interestingly annoying problem recently. Basically, I am trying to write the following code:
tree.Add(new { One = 1, Two = 2 }, 13);
tree.Add(new { One = 2, Two = 3 }, 14); tree.Add(new { One = 3, Two = 1 }, 15); var docPos = tree.Find(new { One = 1, Two = 2 }); Assert.Equal(13, docPos); docPos = tree.Find(new { Two = 2 }); Assert.Equal(13, docPos); docPos = tree.Find(new { Two = 1 }); Assert.Equal(14, docPos);
As you can imagine, this is part of an indexing approach, the details of which aren’t really important. What is important is that I am trying to figure out how to support partial searches. In the example, we index by One & Two, and we can search on both of them. The problem begins when we want to make a search on just Two.
While the tree can compare between partial results just fine, the problem is how to avoid traversing the entire tree for a partial result. The BTree is structured like this:
The problem when doing a partial search is that at the root, I have no idea if I should turn right or left.
What I am thinking now is that since I can’t do a binary search, I’ll have to use a BTree+ instead. Since BTree+ also have the property that the leaf nodes are a linked list, it means that I can scan it effectively. I am hoping for a better option, though.
Any ideas?
Comments
Afaik, in the database world they use bitmap indexes for this, but only if the cardinality (=amount of possible different value combinations) is small enough.
A bitmap index is basicly an intermediate table that contains all the unique combinations between your indexed fields, and for each combination a series of references to all items containing this combination...
The other way of solving it would be using a separate index for every searchable field, but that would probably cost to much... (allthough storage is rather cheap these days)
If you do find an answer, that would be big news, because you would avoid having to create multiple indexes for multiple searches..
Good luck on your quest...
Tom
ps: Fyi, I am a big fan of your work/blog
I wonder if using a skip list implementation instead of btrees would provide a bit more flexibility.. you could do all sorts of interesting and unholy things the sparser lists.
Sorry, comment above should end: "WITH the sparser lists."
In a database, this is typically done by creating a second index over the separate elements, e.g. solely on 'Two'. When you for example use a compound PK with two fields which are also FK fields (each field in the compound pk is an fk to a different table, e.g. OrderDetails in Northwind), you get an index on the whole PK, but filtering on 1 field in that compound PK will still produce a table scan, as it can't leverage the index available, as that's a compound index, like you have.
the DB engine optimizer then picks the right index: it picks the single field index (if available) if the fields necessary for the other indexes aren't available.
So in your case: if you search on One and Two, the engine should pick the index over One AND Two (the one you create in your example). In the case of searching on solely 'Two', the index over 'Two' is picked, as there's no value for 'One' to limit down the rows in that index, or better: not all fields in the index are present in the predicate set.
In SQL Server the leaf nodes form a linked list. Has worked for them for years^^
Addy,
A skip list isn't really helpful, because you run into the same situation either way.
Try implementing partial search with skip list, and you run into the same problem
Frans,
Yeah, pretty much what I got to.
Although, I think that what you can do is perform an index scan, rather than a table scan, in those cases. That should still be more efficient.
Tobi,
What you describe is a BTree+
Hello,
We are currently using R-tree index for quite similar tasks. I can mail you c# rtree in-memory index for indexind integer-bound multidimensional objects.
You are asking two different questions really. What is the structure you should store the the original index in, and what is the structure you should store the secondary index in.
And it all depends on your access pattern. For a general approach, B+-Trees have worked well for databases for a long time, but it really does require more info than the above to base a real answer either of the indexes.
You will find MySQL indexes (at least) run into the same trouble Ayende.
They support multi-column queries, but only if you specify a match on the left-most columns of the index without leaving a 'gap' in the selection clause.
For example, assume an index on: INDEX (col1, col2, col3).
A WHERE clause (for arguments sake) can utilise this index when it looks like:
WHERE col1 = ?
WHERE col1 = ? AND col2 = ?
WHERE col1 = ? AND col2 = ? AND col3 = ?
But not:
WHERE col1 = ? AND col3 = ?
WHERE col2 = ? AND col3 = ?
WHERE col3 = ?
This is because the index was created by merging the columns in order of col1, col2, col3. Without specifying a selection criteria for the first (left most) columns, the index would need to begin with every index node as the starting point and then evaluate col2, etc.
So maybe at a minimum, you could have the same matching criteria on your indexes. If people want to search by 'Two = ?' (and One = ? AMD Two = ?), then they need to define an index like INDEX (Two, One) instead of INDEX (One, Two).
They could opt to create two indexes of both combinations if their queries often require one or the other field being searched on.
"Since BTree+ also have the property that the leaf nodes are a linked list, it means that I can scan it effectively. I am hoping for a better option, though."
I believe databases (at least MySQL) will prefer a table scan instead if it can't use the index to begin with, because random I/O (by randomly checking index nodes) is worse than just doing a sequential table scan.
--
Bitmap indexes are a good idea when you have a low cardinality (eg only a few values for a column, such as True/False or Yes/Now/Maybe, etc).
They would let you quickly filter out which nodes contain that value; but you need to maintain a bitmap index for each possible value which is where it gets expensive.
They have a nice advantage though that you can merge multiple bitmap indexes to resolve queries.
For large indexes with low cardinality on One, you can find the distinct values of One (very fast as you can repeatedly read a value from One and seek to the neek position where x > (last value of One) ) and for each of those you can to a seek with One and Two. SQL Server currently does not support this access path (alas).
Andrew,
I would love to see that, sure, but please not that I am thinking about the problem in the more generic sense.
The indexed data may be: { User = "ayende", Age = 18 }, for example.
Dennis,
Not quite, I am trying to see if I can do a multi column index that allows searching on non-leftmost columns
Andrew,
Yeah, I think that I pretty much agree with you here.
Regarding index vs. table scans, that actually depend on how you store the data.
In a relational datastore, table scan is usually straightforward scanning.
In append only, table scan would involve a lot of seeks, so an index scan is probably better.
Tobi,
Good point.
If your index is ordered (One then Two), then it's unordered regarding Two only. There isn't a general way to efficient search unordered data, you'll always have to so some kind of full scan.
In some special cases there might be clever solutions (e.g. when One has low cardinality), but assuming no special knowledge about the data, a full scan cannot be avoided.
If you need to search for exact matches on One, Two and One+Two, then having two indexes should work best.
If you have range queries (X1 <= One <= X2 && Y1 <= Two <= Y2), then that's actually a geometric problem. You'll want to use a spatial index data structure (e.g. a quad tree).
Searching (O(N)) or indexed searching (less than O(N))?
I know of no data structure that is capable of keeping two or more attributes to do indexed searching on. And I am guessing that if one exists it would put a severe penalty on the searching of the "primary" key (A wild guess would that in such a structure O(log N) would be the best that you can do). An obvious solution would be to just implicitly keep more than one index.
For just allowing linear searching, then yes, B+ Trees would be an excellent solution. But you will find that you need to do a lot of work to keep the leafs on disc so that they are sequentially accessible.
I was wrong. It can be done using kd-trees. http://en.wikipedia.org/wiki/Kd-tree
Querying an axis-parallel range in a balanced kd-tree takes O(n1-1/k +m) time, where m is the number of the reported points, and k the dimension of the kd-tree.
Kd-trees, quad trees, bitsliced, projection indeces are not good in high dimensions. R-tree rules (:
what about having two indexes(or n indexes for n columns)
then doing a search over one index you get a set of doc ids,then you do a search over the second index and do a intersect of the doc ids of the first index with the second index.
i.e. you get the doc ids that match the criteria of the two columns.
Dennis,
Wow, K Tree is fascinating, and I see how you can make a search efficiently there, but it seems like it would be pretty complex implementation-wise.
I am not sure if it wouldn't be simpler to just do an index scan on a BTree instead
Uriel,
I am trying to avoid having to do so.
Certain, for optimal performance, that would be best, but I hope to try something a bit better.
Perhaps a skipgraph would do what you want?
I've shared small project with R-tree index.
rapidshare.com/files/400910942/RTraverser.rar.html
It can be significantly improved to do range searches, such as "One from 767 to 891 and Two from 10.10.2010 to 11.11.2011". Also, it can be used to create hierarchized keys.
Comment preview