I’m going to be talking to SWENUG next week about RavenDB and how it affects the design and architecture of your systems.
Feel free to join us live on Sep 22, 5 PM UTC.
I’m going to be talking to SWENUG next week about RavenDB and how it affects the design and architecture of your systems.
Feel free to join us live on Sep 22, 5 PM UTC.
On October 26, I’ll be giving a two days workshop on RavenDB 5.0 as part of the NDC Conference.
In the workshop, I’m going to talk about using RavenDB from scratch. We are going to explore how to utilize RavenDB, modeling and design decisions and how to use RavenDB’s features for your best advantage. Topics includes distribution and data management, application and system architecture and much more.
I’m really looking forward to it, see you there.
You can listen to my talking with Barry about RavenDB in the Developer Weekly Podcast.
Chicago publishes its taxi’s trips in an easy to consume format, so I decided to see what kind of information I can dig out of the data using RavenDB. Here is what the data looks like:
There are actually a lot more fields in the data, but I wanted to generate a more focused dataset to show off certain features. For that reason, I’m going to record the trips for each taxi, where for each trip, I’m going to look at the start time, duration and pick up and drop off locations. The data’s size is significant, with about 194 million trips recorded.
I converted the data into RavenDB’s time series, with a Location time series for each taxi’s location at a given point in time. You can see that the location is tagged with the type of event associated with it. The raw data has both pickup and drop off for each row, but I split it into two separate events.
The reason I did it this way is that we get a lot of queries on how to use RavenDB for doing… stuff with vehicles and locations data. The Chicago’s taxi data is a good source for non trivial amount of real world data, which is very nice to use.
Once we have all the data loaded in, we can see that there are 9,179 distinct taxis in the data set and there are varying number of events for each taxi. Here is one such scenario:
The taxi in question has six years of data and 6,545 pickup and dropoff events.
The question now is, what can we do with this data? What sort of questions can we answer?
Asking where a taxi is at a given point in time is easy enough:
And gives us the results:
But asking a question about a single taxi isn’t that interesting, can we do things across all taxis?
Let’s think about what kind of questions can we ask:
To answer all of these questions, we have to aggregate data from multiple time series. We can do that using a Map/Reduce index on the time series data. Here is what this looks like:
We are scanning through all the location events for the taxis and group them on an hourly basis. We are also generate a GeoHash code for the location of the taxi in that time. This is using a GeoHash with a length of 9, so it represent an accuracy of about 2.5 square meters.
We then aggregate all the taxis that were in the same GeoHash at the same hour into a single entry. To make it easier for ourselves, we’ll also use a spatial field (computed from the geo hash) to allow for spatial queries.
The idea is that we want to aggregate the taxi’s location information on both space and time. It is easy to go from a more accurate time stamp to a lower granularity one (zeroing the minutes and seconds of a time). For spatial location, we can a GeoHash of a particular precision to do pretty much the same thing. Instead of having to deal with the various points, we’ll aggregate the taxis by decreasing the resolution we use to track the location.
The GeoHash code isn’t part of RavenDB. This is provided as an additional source to the index, and can be seen fully in the following link. With this index in place, we are ready to start answering all sort of interesting questions. Since the data is from Chicago, I decided to look in the map and see if I can find anything interesting there.
I created the following shape on a map:
This is the textual representation of the shape using Well Known Text: POLYGON((-87.74606191713963 41.91097449402647,-87.66915762026463 41.910463501644806,-87.65748464663181 41.89359845829678,-87.64924490053806 41.89002045220879,-87.645811672999 41.878262735374236,-87.74194204409275 41.874683870355824,-87.74606191713963 41.91097449402647)).
And now I can query over the data to find the taxis that were in that particular area on Dec 1st, 2019:
And here are the results of this query:
You can see that we have a very nice way to see which taxis were at each location at a time. We can also use the same results to paint a heat map over time, counting the number of taxis in a particular location.
To put this into (sadly) modern terms, we can use this to track people that were near a particular person, to figure out if they might be at risk for being sick due to being near a sick person.
In order to answer this question, we need to take two steps. First, we ask to get the location of a particular taxi for a time period. We already saw how we can query on that. Then we ask to find all the taxis that were in the specified locations in the right times. That gives us the intersection of taxis that were in the same place as the initial taxi, and from there we can send the plague police.
I recently got an interesting question about how RavenDB is processing certain types of queries. The customer in question had a document with a custom analyzer and couldn’t figure out why certain queries didn’t work.
For the purpose of the discussion, let’s consider the following analyzer:
In other words, when using this analyzer, we’ll have the following transformations:
As a reminder, this is used in a reverse index, which gives us the ability to lookup a term and find all the documents containing that term.
An analyzer is applied on the text that is being indexed, but also on the queries. In other words, because during indexing I changed “singing” to “sing”, I also need to do the same for the query. Otherwise a query for “singing voice” will have no results, even if the “singing” term was in the original data.
The rules change when we do a prefix search, though. Consider the following query:
What should we be searching on here? Remember, this is using an analyzer, but we are also doing a prefix search. Lets consider our options. If we pass this through an analyzer, the query will change its meaning. Instead of searching for terms starting with “sing”, we’ll search for terms starting with “s”.
That will give us results for “Sterling Silver”, which is probably not expected. In this case, by the way, I’m actually looking for the term “singularity”, and processing the term further would prevent that.
For that reason, RavenDB assumes that queries using wildcard searches are not subject to an analyzer and will not process them using one. The reasoning is simple, by using a wildcard you are quite explicitly stating that this is not a real term.
We recently added support for RavenDB on Linux ARM 64. You can get the new build here:
The nice thing about ARM64 is that when you run on AWS, you can use A1 instances that cost 48% less.
Oh, and what about the performance? This is RavenDB running on a1.large instance (2 cores, 4 GB RAM) with a standard gb2 hard disk with 100 IOPS.
No future posts left, oh my!