Building data store – indexing data structure

time to read 2 min | 341 words

I run into an interestingly annoying problem recently. Basically, I am trying to write the following code:

tree.Add(new { One = 1, Two = 2 }, 13);
tree.Add(new { One = 2, Two = 3 }, 14); 
tree.Add(new { One = 3, Two = 1 }, 15);

var docPos = tree.Find(new { One = 1, Two = 2 });
Assert.Equal(13, docPos);

docPos = tree.Find(new { Two = 2 });
Assert.Equal(13, docPos);

docPos = tree.Find(new { Two = 1 });
Assert.Equal(14, docPos);

As you can imagine, this is part of an indexing approach, the details of which aren’t really important. What is important is that I am trying to figure out how to support partial searches. In the example, we index by One & Two, and we can search on both of them. The problem begins when we want to make a search on just Two.

While the tree can compare between partial results just fine, the problem is how to avoid traversing the entire tree for a partial result. The BTree is structured like this:

The problem when doing a partial search is that at the root, I have no idea if I should turn right or left.

What I am thinking now is that since I can’t do a binary search, I’ll have to use a BTree+ instead. Since BTree+ also have the property that the leaf nodes are a linked list, it means that I can scan it effectively. I am hoping for a better option, though.

Any ideas?

Tweet Share Share 23 comments

Tags:

Databases

Comments

19 Jun 2010
05:07 AM

Tom Janssens

Afaik, in the database world they use bitmap indexes for this, but only if the cardinality (=amount of possible different value combinations) is small enough.

A bitmap index is basicly an intermediate table that contains all the unique combinations between your indexed fields, and for each combination a series of references to all items containing this combination...

The other way of solving it would be using a separate index for every searchable field, but that would probably cost to much... (allthough storage is rather cheap these days)

If you do find an answer, that would be big news, because you would avoid having to create multiple indexes for multiple searches..

Good luck on your quest...

Tom

ps: Fyi, I am a big fan of your work/blog

19 Jun 2010
07:59 AM

addy santo

I wonder if using a skip list implementation instead of btrees would provide a bit more flexibility.. you could do all sorts of interesting and unholy things the sparser lists.

19 Jun 2010
08:00 AM

addy santo

Sorry, comment above should end: "WITH the sparser lists."

19 Jun 2010
08:53 AM

Frans Bouma

In a database, this is typically done by creating a second index over the separate elements, e.g. solely on 'Two'. When you for example use a compound PK with two fields which are also FK fields (each field in the compound pk is an fk to a different table, e.g. OrderDetails in Northwind), you get an index on the whole PK, but filtering on 1 field in that compound PK will still produce a table scan, as it can't leverage the index available, as that's a compound index, like you have.

the DB engine optimizer then picks the right index: it picks the single field index (if available) if the fields necessary for the other indexes aren't available.

So in your case: if you search on One and Two, the engine should pick the index over One AND Two (the one you create in your example). In the case of searching on solely 'Two', the index over 'Two' is picked, as there's no value for 'One' to limit down the rows in that index, or better: not all fields in the index are present in the predicate set.

19 Jun 2010
10:06 AM

tobi

In SQL Server the leaf nodes form a linked list. Has worked for them for years^^

19 Jun 2010
10:09 AM

Ayende Rahien

Addy,

A skip list isn't really helpful, because you run into the same situation either way.

Try implementing partial search with skip list, and you run into the same problem

19 Jun 2010
10:10 AM

Ayende Rahien

Frans,

Yeah, pretty much what I got to.

Although, I think that what you can do is perform an index scan, rather than a table scan, in those cases. That should still be more efficient.

Tobi,

What you describe is a BTree+

19 Jun 2010
10:36 AM

Andrew Borodin

Hello,

We are currently using R-tree index for quite similar tasks. I can mail you c# rtree in-memory index for indexind integer-bound multidimensional objects.

19 Jun 2010
11:26 AM

Dennis

You are asking two different questions really. What is the structure you should store the the original index in, and what is the structure you should store the secondary index in.

And it all depends on your access pattern. For a general approach, B+-Trees have worked well for databases for a long time, but it really does require more info than the above to base a real answer either of the indexes.

19 Jun 2010
11:44 AM

Andrew

You will find MySQL indexes (at least) run into the same trouble Ayende.

They support multi-column queries, but only if you specify a match on the left-most columns of the index without leaving a 'gap' in the selection clause.

For example, assume an index on: INDEX (col1, col2, col3).

A WHERE clause (for arguments sake) can utilise this index when it looks like:

WHERE col1 = ?
WHERE col1 = ? AND col2 = ?
WHERE col1 = ? AND col2 = ? AND col3 = ?

But not:

WHERE col1 = ? AND col3 = ?
WHERE col2 = ? AND col3 = ?
WHERE col3 = ?

This is because the index was created by merging the columns in order of col1, col2, col3. Without specifying a selection criteria for the first (left most) columns, the index would need to begin with every index node as the starting point and then evaluate col2, etc.

So maybe at a minimum, you could have the same matching criteria on your indexes. If people want to search by 'Two = ?' (and One = ? AMD Two = ?), then they need to define an index like INDEX (Two, One) instead of INDEX (One, Two).

They could opt to create two indexes of both combinations if their queries often require one or the other field being searched on.

"Since BTree+ also have the property that the leaf nodes are a linked list, it means that I can scan it effectively. I am hoping for a better option, though."

I believe databases (at least MySQL) will prefer a table scan instead if it can't use the index to begin with, because random I/O (by randomly checking index nodes) is worse than just doing a sequential table scan.

Bitmap indexes are a good idea when you have a low cardinality (eg only a few values for a column, such as True/False or Yes/Now/Maybe, etc).

They would let you quickly filter out which nodes contain that value; but you need to maintain a bitmap index for each possible value which is where it gets expensive.

They have a nice advantage though that you can merge multiple bitmap indexes to resolve queries.

19 Jun 2010
11:59 AM

tobi

For large indexes with low cardinality on One, you can find the distinct values of One (very fast as you can repeatedly read a value from One and seek to the neek position where x > (last value of One) ) and for each of those you can to a seek with One and Two. SQL Server currently does not support this access path (alas).

19 Jun 2010
13:13 PM

Ayende Rahien

Andrew,

I would love to see that, sure, but please not that I am thinking about the problem in the more generic sense.

The indexed data may be: { User = "ayende", Age = 18 }, for example.

19 Jun 2010
13:14 PM

Ayende Rahien

Dennis,

Not quite, I am trying to see if I can do a multi column index that allows searching on non-leftmost columns

19 Jun 2010
13:17 PM

Ayende Rahien

Andrew,

Yeah, I think that I pretty much agree with you here.

Regarding index vs. table scans, that actually depend on how you store the data.

In a relational datastore, table scan is usually straightforward scanning.

In append only, table scan would involve a lot of seeks, so an index scan is probably better.

Tobi,

Good point.

19 Jun 2010
13:56 PM

Daniel

If your index is ordered (One then Two), then it's unordered regarding Two only. There isn't a general way to efficient search unordered data, you'll always have to so some kind of full scan.

In some special cases there might be clever solutions (e.g. when One has low cardinality), but assuming no special knowledge about the data, a full scan cannot be avoided.

If you need to search for exact matches on One, Two and One+Two, then having two indexes should work best.

If you have range queries (X1 <= One <= X2 && Y1 <= Two <= Y2), then that's actually a geometric problem. You'll want to use a spatial index data structure (e.g. a quad tree).

19 Jun 2010
14:00 PM

Dennis

Searching (O(N)) or indexed searching (less than O(N))?

I know of no data structure that is capable of keeping two or more attributes to do indexed searching on. And I am guessing that if one exists it would put a severe penalty on the searching of the "primary" key (A wild guess would that in such a structure O(log N) would be the best that you can do). An obvious solution would be to just implicitly keep more than one index.

For just allowing linear searching, then yes, B+ Trees would be an excellent solution. But you will find that you need to do a lot of work to keep the leafs on disc so that they are sequentially accessible.

19 Jun 2010
15:10 PM

Dennis

I was wrong. It can be done using kd-trees. http://en.wikipedia.org/wiki/Kd-tree

Querying an axis-parallel range in a balanced kd-tree takes O(n1-1/k +m) time, where m is the number of the reported points, and k the dimension of the kd-tree.

19 Jun 2010
15:31 PM

Andrew Borodin

Kd-trees, quad trees, bitsliced, projection indeces are not good in high dimensions. R-tree rules (:

19 Jun 2010
18:22 PM

Uriel Katz

what about having two indexes(or n indexes for n columns)

then doing a search over one index you get a set of doc ids,then you do a search over the second index and do a intersect of the doc ids of the first index with the second index.

i.e. you get the doc ids that match the criteria of the two columns.

19 Jun 2010
20:05 PM

Ayende Rahien

Dennis,

Wow, K Tree is fascinating, and I see how you can make a search efficiently there, but it seems like it would be pretty complex implementation-wise.

I am not sure if it wouldn't be simpler to just do an index scan on a BTree instead

19 Jun 2010
20:05 PM

Ayende Rahien

Uriel,

I am trying to avoid having to do so.

Certain, for optimal performance, that would be best, but I hope to try something a bit better.

20 Jun 2010
00:45 AM

Perhaps a skipgraph would do what you want?

20 Jun 2010
06:25 AM

Andrew Borodin

I've shared small project with R-tree index.

rapidshare.com/files/400910942/RTraverser.rar.html

It can be significantly improved to do range searches, such as "One from 767 to 891 and Two from 10.10.2010 to 11.11.2011". Also, it can be used to create hierarchized keys.

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB