Table scans, index scans and index seeks, on my!

time to read 5 min | 979 words

In general, when you break it down to the fundamentals, a data base is just a fancy linked list + some btrees. Yes, I am ignoring a lot, but if you push aside a lot of the abstraction, that is what is actually going on.

If you ever dealt with database optimizations you are familiar with query plans, like the following (from NHProf):

You can see that we have some interesting stuff going on here:

  • Index Seek
  • Index Scan

And if you are unlucky, you are probably familiar with the dreaded “your !@&!@ query is causing a table scan!” scream from the DBA. But most of the time, people just know that table scan is slow, index scan is fast and index seek is fastest. I am ignoring things like clustered vs. unclustered indexes, since they aren’t really important for what I want to do.

For simplicity sake, I’ll use the following in memory data structure:

public class DocumentDatabase
{
    public List<JObject> Documents = new ...;
    public IDictionary<string, IDictionary<JToken, JObject>> Indexes = new ...; 
}

To keep things simple, we will only bother with the case of exact matching. For example, I might store the following document:

{ "User": "ayende", "Name": "Ayende Rahien", "Email": "[email protected]" }

And define an index on Name & Email. What would happen if I wanted to make a query by the user name?

Well, we don’t really have any other option, we have to do what amounts to a full table scan:

foreach (var doc in Documents)
{
    if(doc.User == "ayende")
          yield return doc;
}

As you can imagine, this is an O(N) operation, which can get pretty expensive if we are querying a large table.

What happen if I want to find data by name & email? We have an index that is perfectly suited for that, so we might as well use it:

Indexes["NameAndEmail"][new{Name="Ayende Rahien", Email = “[email protected]}];

Note that what we are doing here is accessing the NameAndEmail index, and then making a query on that. This is essentially an index seek.

What happens if I want to query just by email? There isn’t an index just for emails, but we do have an index that contains emails. We have two options, use a table scan, or and index scan. We already saw what a table scan is, so let us look at what is an index scan:

var nameAndEmailIndex = Indexes["NameAndEmail"];
foreach (var indexed in nameAndEmailIndex)
{
   if(indexed.Email == "[email protected]")
             yield return indexed;
}

All in all, it is very similar to the table scan, and when using in memory data structures, it is probably not worth doing index scans (at least, not if the index is over the entire data set).

Where index scans prove to be very useful is when we are talking about persistent data sets, where reading the data from the index may be more efficient than reading it from the table. That is usually because the index is much smaller than the actual table. In certain databases, the format of the data on the disk may make it as efficient to do a table scan in some situations as it to do an index scan.

Another thing to note is that while I am using hashes to demonstrate the principal, in practice, most persistent data stores are going to use some variant of trees.