We received a really interesting question from a user, which basically boils down to:
I need to query over a time span, either known (start, end) or (start, $currentDate), and I need to be able to sort on them.
That might sound… vague, I know. A better way to explain this is that I have a list of people, and I need to sort them by their age. That’s trivial to do since I can sort by the birthday, right? The problem is that we include some historical data, so some people are deceased.
Basically, we want to be able to get the following data, sorted by age ascending:
Name | Birthday | Death |
Michael Stonebraker | 1943 | N/A |
Sir Tim Berners-Lee | 1955 | N/A |
Narges Mohammadi | 1972 | N/A |
Sir Terry Prachett | 1948 | 2015 |
Agatha Christie | 1890 | 1976 |
This doesn’t look hard, right? I mean, all you need to do is something like:
order by datediff( coalesce(Death, now()), Birthday )
Easy enough, and would work great if you have a small number of items to sort. What happens if we want to sort over 10M records?
Look at the manner in which we are ordering, that will require us to evaluate each and every record. That means we’ll have to scan through the entire list and sort it. This can be really expensive. And because we are sorting over a date (which changes), you can’t even get away with a computed field.
RavenDB will refuse to run queries that can only work with small amounts of data but will fail as the data grows. This is part of our philosophy, saying that things should Just Work. Of course, in this case, it doesn’t work, so the question is how this aligns with our philosophy?
The idea is simple. If we cannot make it work in all cases, we will reject it outright. The idea is to ensure that your system is not susceptible to hidden traps. By explicitly rejecting it upfront, we make sure that you’ll have a good solution and not something that will fail as your data size grows.
What is the appropriate behavior here, then? How can we make it work with RavenDB?
The key issue is that we want to be able to figure out what is the value we’ll sort on during the indexing stage. This is important because otherwise we’ll have to compute it across the entire dataset for each query. We can do that in RavenDB by exposing that value to the index.
We cannot just call DateTime.Today, however. That won’t work when the day rolls over, of course. So instead, we store that value in a document config/current-date, like so:
{ // config/current-date
"Date": "2024-10-10T00:00:00.0000000"
}
Once this is stored as a document, we can then write the following index:
from p in docs.People
let end = p.Death ?? LoadDocument("config/current-date", "Config").Date
select new
{
Age = end - p.Birthday
}
And then query it using:
from index 'People/WithAge'
order by Age desc
That works beautifully, of course, until the next day. What happens then? Well, we’ll need to schedule an update to the config/current-date document to correct the date.
At that point, because there is an association created between all the documents that loaded the current date, the indexing engine in RavenDB will go and re-index them. The idea is that at any given point in time, we have already computed the value, and can run really quick queries and sort on it.
When you update the configuration document, it is a signal that we need to re-index the referencing documents. RavenDB is good at knowing how to do that on a streaming basis, so it won’t need to do a huge amount of work all at once.
You’ll also note that we only load the configuration document if we don’t have an end date. So the deceased people’s records will not be affected or require re-indexing.
In short, we can benefit from querying over the age without incurring query time costs and can defer those costs to background indexing time. The downside is that we need to set up a cron job to make it happen, but that isn’t too big a task, I think.
You can utilize similar setups for other scenarios where you need to query over changing values. The performance benefits here are enormous. And what is more interesting, even if you have a huge amount of data, this approach will just keep on ticking and deliver great results at very low latencies.