Slaying relational dragons
I recently had a fascinating support call, talking about how to optimize a very big model and an access pattern that basically required to have the entire model in memory for performing certain operations.
A pleasant surprise was that it wasn’t horrible (when I get called, there is usually a mess), which is what made things interesting. In the space of two hours, we managed to:
- Reduced number of queries by 90%.
- Reduced size of queries by 52%.
- Increased responsiveness by 60%, even for data set an order of magnitude.
My default answer whenever I am asked when to use NHibernate is: Whenever you use a relational database.
My strong recommendation at the end of that support call? Don’t use a relational DB for what you are doing.
The ERD just below has absolutely nothing to do with the support call, but hopefully it will help make the example. Note that I dropped some of the association tables, to make it simpler.
And the scenario we have to deal with is this one:
Every single table in the ERD is touched by this screen. Using a relational database, I would need something like the following to get all this data:
SELECT * FROM Users WHERE Id = @UserID SELECT * FROM Subscriptions WHERE UserId = @UserId AND GETDATE() BETWEEN StartDate AND EndDate SELECT MIN(CheckedBooks.CheckedAt), Books.Name, Books.ImageUrl, AVG(Reviews.NumberOfStars), GROUP_CONCAT(', ',Authors.Name), GROUP_CONCAT(', ',Categories.Name) FROM CheckedBooks JOIN Books ON BookId JOIN BookToAuthors ON BookId JOIN Authors ON AuthorId JOIN Reviews ON BookId JOIN BooksCategories ON BookId JOIN Categories ON CategoryId WHERE CheckedBooks.UserId = @UserId GROUP BY BookId SELECT Books.Name, Books.ImageUrl, AVG(Reviews.NumberOfStars), GROUP_CONCAT(', ',Authors.Name), GROUP_CONCAT(', ',Categories.Name) FROM Books JOIN BookToAuthors ON BookId JOIN Authors ON AuthorId JOIN Reviews ON BookId JOIN BooksCategories ON BookId JOIN Categories ON CategoryId WHERE BookId IN (SELECT BookID FROM QueuedBooks WHERE UserId = @UserId) GROUP BY BookId SELECT Books.Name, Books.ImageUrl, AVG(Reviews.NumberOfStars), GROUP_CONCAT(', ',Authors.Name), GROUP_CONCAT(', ',Categories.Name) FROM Books JOIN BookToAuthors ON BookId JOIN Authors ON AuthorId JOIN Reviews ON BookId JOIN BooksCategories ON BookId JOIN Categories ON CategoryId WHERE BookId IN (SELECT BookID FROM RecommendedBooks WHERE UserId = @UserId) GROUP BY BookId SELECT Books.Name, Books.ImageUrl, AVG(Reviews.NumberOfStars), GROUP_CONCAT(', ',Authors.Name), GROUP_CONCAT(', ',Categories.Name) FROM Books JOIN BookToAuthors ON BookId JOIN Authors ON AuthorId JOIN Reviews ON BookId JOIN BooksCategories ON BookId JOIN Categories ON CategoryId WHERE Books.Name LIKE @search OR Categories.Name LIKE @search OR Reviews.Review LIKE @search GROUP BY BookId
Yes, this is a fairly simplistic approach, without de-normalization, and I would never perform searches in this manner, but… notice how complex things are getting. For bonus points, look at the forth query, the queued books are ordered, try to figure out how we can get the order in a meaningful way. I shudder to thing about the execution plan of this set of queries. Even if we ignore the last one that does full text searching in the slowest possible way. And this is just for bringing the data for a single screen, assuming that magically it will show up (you need to do a lot of manipulation at the app level to make this happen).
The problem is simple, our data access pattern and the data storage technology that we use are at odds with one another. While relational modeling dictate normalization, our actual data usage means that we don’t really deal with a single-row entity, with relatively rare access to associations, which is the best case for OLTP. Nor are we dealing with set based logic, which is the best case for OLAP / Relational based queries.
Instead, we are dealing an aggregate that spans multiple tables, mostly because we have no other way to express lists and many to many associations in a relational database.
Let us see how we could handle things if we were using a document or key/value database. We would have two aggregates, User and Book.
GetUser(userId) –> would result in:
We can now issue another query, to bring the associated books. GetBooks(153, 1337) would result in:
Note that the entire data structure is different, we haven’t just copied the normalized relational model, we have a totally different model. An aggregate (similar to DDD’s aggregate) is a single entity that contains anything except other aggregates. References to other aggregates are allowed (from user to all the books), but most of the entity’s data is stored as a single value.
That has several interesting implications. First, we need two queries to get the data for the screen. One to get the user’s data, and the second to get the books that we need to display. Reducing remote calls is something that you really care about, and simplifying the queries to mere query by ids is going to have a significant effect as well.
By changing the data storage technology, we also enforced a very rigid aggregate boundary. Transactions becomes much simpler as well, since most transactions will now modify only a single aggregate, which is a single operation, no matter how many actual operations we perform on that aggregate. And by tailoring the data structure that we use to match our needs, we have natural aggregate boundaries.
The end result is a far simpler method of working with the data. It may mean that we have to do more work upfront, but look at the type of work we would have to do in order to try to solve our problems using the relational model. I know what model I would want for this sort of a problem.
Comments
In database land we too have 'code smells' or anti-patterns. One of them is to model the db after screen layouts or screen behavior. I hope I don't have to explain why that's so utterly bad. From what I see in your blogpost, you just did that.
It's ok to optimize some parts of the database for fetching, but it's always based on top of an existing model which is not in normalized form to annoy people, but because it's then in a correct form (theoretically), which means that you can consume it in whatever application you think of, the model is always the truth as it's correct.
For example you also could have used an indexed view or multiple indexed views to optimize fetching in this area. That way you didn't have to butcher the abstract entity model at all, and still you'd get optimized fetching.
The main point of using a relational model is that entity definitions are usable as separate elements, so you can just define new relationships between them to utilize them in a different way without changing the data (the entity instances). Storing the data in a structure which meets the screen the data is used in might sound 'great' but if you change the screen a bit it will force you to alter the database and also doesn't allow you to reason over the data at all as it's not in a 'correct' form, it's stored as the screen demanded.
IMHO "One Size Fits All" is soo of the nineties :) I don't think I have to give examples of that either, and would agree with Ayende that for certain problems it is advantageous to use a different modelling technique. The "correct form" depends on what you are doing, even if you stay completely in the relational realm, normalization is a choice (OLTP vs OLAP for instance). Would agree that purely "modelling a database model after a screen" is incorrect, but I can't agree on the fact that Ayende is doing that here.
What about CQRS?
Like OLAP DB for reads and Event Storage for writes?
From how I read this, Ayende is using a key/value database for loading the screens, Which would simply have a serialized object in the value field.
Or I suppose it could quite easy be an object database .
I personally am move more and more over to CQRS and splitting my read and writes. That way we don't have to try and make a one size fits all and scaling becomes easy to "think about" :)
It looks like each area of the screen could be considered as a separate 'widget', each with it's own related service & query/table.
For recommendations use of something built over Apache Mahout (or similar) might be best.
Subscription details might indeed be held as a document. Orders maybe relational, or a set of documents that are linked through a MapReduce view.
I don't think that, apart from maybe as a ViewModel, it would be good to capture all of this in a single model.
that's just nonsense. A relational model's correctness doesn't depend on what you do with it, that's the whole point.
Frans,
The relational model doesn't work in many scenarios. In this case, the assumption that the ERD is the One True Model is false.
Optimizing the data model to fit the actual data access patterns isn't an anti pattern by any means. This is simply an easier way of thinking in Aggregates, one where the Aggregates are actually first class citizens.
FallenGameR,
I don't like talking out of context, show me a scenario where this would be applicable, and I can consider it.
BTW, OLAP RDBMS would suffer from the same problem outlined in this post
Jonathan,
OODB might be a problem here (not sure), since I want to control the scope of my objects (to reduce remote calls).
That depends on how the DB allowed me to define aggregate boundary
no-one says it does. that's why there are things invented like indexed views (even partitioned across machines). This, without you having to ruin your model because v1.0 has a couple of screens hitting the DB hard. (what happens if v2.0 of your application has different screens?
Abstract entity models model the entity definitions discovered in reality and how they relate to each other. Saying that there are more than 1 model possible suggests that there are more definitions of 'reality' possible. It's not something you can cook up on the spot, Oren, this is proven science for decades. (and mind you, it has nothing to do with performance). You optimized the fetching pipeline. That's great. However once you enter the slippery slope of changing your model because performance requires it, you'll end up with 1 big table because that requires just 1 query.
So: optimizing for data-retrieval is an action you perform on the data to transform it into a model which is suitable for READs. It's not used for WRITES. As soon as you do that, you have a big anti-pattern implemented.
Read-only databases (the way this is solved for decades already) are often used for this. writeable databases are used for DML, readonly databases with a transformation of the data are used for fetching. But never for writing.
It's an anti pattern because when you need a different application consuming the same data you're in trouble, as that other application can't reason over the data. If the main application is refactored, you then have a truckload of data (not information!) which has no meaning, as the model isn't there: it's tied directly to the application in a previous version.
THAT's why it's very important that people focus on correctness of a model, as that gives meaning to your data, otherwise you have just a bucket of bits.
And why you just did not replace first SQL query with an indexed view?
Like the two previous people who mentioned CQRS - I think that applies perfectly to this scenario. It also fits somewhat with what Frans is saying with a read-only database. You optimize your write-model and your read-model differently.
You might end up in the same place with a document database for your write model (and maybe also for your read model) but you don't necessarily have to.
Notice the type of information that I need to load there. You really have no choice but make a LOT of queries to get the information for a single screen.
The issue is that the relational model doesn't deals well with aggregates.
Of course there is more than 1 definition of reality! E=MCC, after all, so everything is relative.
More seriously, a single model isn't going to serve you really well, beacuse it is unsuited for the tasks that you require of it.
A transactional model and a reporting model are the most common examples, but they aren't the only ones. My accountant has a very different view about my finances than I do, for example.
Um, my model is actually extremely optimized for writes, just about all writes can be handled in a single remote call. That means removing things like dead locking entirely.
The problem is that you make the assumption that I tailored the data model for the UI, but that is not the case. I tailored the data model to fit the way I am interacting with the data, so everything is encapsulated in a single aggregate.
Frans,
Maybe a misunderstanding, but The Correct model is not about correctness, the correct model is the right model to choose for the specific need. There is no one true model. That is what I was referring to with the OLTP - OLAP reference. What is correct to choose regarding normalization in oltp is not correct in olap.
In a datawarehouse structure things are very different than in the online processing environment. And these structures are different and separate on purpose.
Bubak,
Can you write me an index view definition that would substantially reduce the number of queries that i have to make?
If you are going the key/value pair database, this view does indeed become easy!
What happens when the users come back and ask for all books under a category, or written by a specific author?
I am hoping there is a clever way to handle that case in K/V database world, otherwise, the database is tailored for a specific view.
Pete,
You use different methods to handle this. For example, maintaining a Lucene index of the data would make this VERY easy
@Ayende
Hmm, having Scalaris for key/value, Lucene for searches, and SQL 2008 for the relational part of it... How much are you planning to pay to system administrator if this thing is to be installed in a local public library?
Pre-ORM, a view would have been my choice to solve that distinct problem, but along the exact same premise. It is important to differentiate the model based on the way the data is presented.
This is a perfect example of data-centric design. Instead of using an ORM, they would be using the technology as a RORM (Relational Object Relational Mapper:) or simply as a persistence wrapper. There is no smell in defining a model around a view, it just means turning the perspective on-end and focusing on the view instead of the data.
Part of the problem is the relational bug has been deeply ingrained. Eliminate duplication, save space. The first half is absolutely valid, the second now causes many more problems than it's worth. Space is cheap. Engineering time to try and deal with the performance issues introduced by diving beyond 3rd normal form is bloody expensive.
"Part of the problem is the relational bug has been deeply ingrained. Eliminate duplication, save space. The first half is absolutely valid, the second now causes many more problems than it's worth. Space is cheap. Engineering time to try and deal with the performance issues introduced by diving beyond 3rd normal form is bloody expensive."
Well said!! Sometimes a little duplication goes along way....btw...
"Space is cheap"
Its not a matter of space. Its the complications of maintaining consistency when you have redundant data.
@Ayende, thank you for the answer, that will work. I suppose that I have a hard time designing data models driven by their UI/use cases. So many years have been spent working with a noun-per-table fashion.
@pete
That's the first half, and why there is a valid case for using relational databases. But there is a balance needed in the quest to eliminate duplication. The argument for the depth of normalization historically was predominantly around space. Drives and memory were slower and cost a whole lot dearer. The demand for the data was simpler. Now we see a different demand for data, more visual, dimensionalized... Now that deep normalization is working against you, especially if your object model reflects that normalized structure. (class per table.)
Property A.B.C.D.E.F if lazy-loaded can mean huge select N+1 scenarios. Eager fetch and you're looking at loading everything from A, B, C, D, and E. If you can get by without C or D it helps. Even better if you can avoid falling into classes arranged by relations in the first place.
@Alex,
a) Who said anything about having a relational db?
b) From the POV of the Doc DB, Lucene queries are part & parcel of the actual service it provides. IOW, the Lucene index is an implementation of how you can query the DB.
Ken,
I don't understand what you mean, re-process the entire aggregate.
My usual method would be something like:
var aggregate = Get(1);
aggregate.DoSomeActions();
Save(aggregate);
No reprocessing, no need to figure out what changed.
As for reindexing, that is a cheap operation, all told, so I wouldn't worry about it.
There are two ways this can be resolved.
Either we create the association directly, as shown in the post, where you have the reviews ids in the book.
Or, we can query the DB for all reviews for a particular book id, which is a cheap op.
Having UI with so many book from different segments on this screen is almost equal of getting N random books from the DB.
"Transactions becomes much simpler as well, since most transactions will now modify only a single aggregate, which is a single operation, no matter how many actual operations we perform on that aggregate."
I really don't get this. If you delete a book, should it not be deleted also from every user aggregate referencing this book? I haven't used k/v databases, but from you aggregate example, doesn't it mean database still has to perform user + N (where N is the book count user aggregate is referencing) lookups?
@Ayende
Have you really any notion of what an indexed / materialized view is? I think you don't. It's exactly what you want: you have a normal model for writes, and an indexed view (or several) for reads. An indexed view is a view which query results are stored and you can place indexes on them, even with aggregates (at least in Oracle). The sole purpose of materialized/indexed views is precisely what you are dealing with: to reduce the # of queries over a normalized model.
The advantage of it is that you can keep your normalized, correct, model for the data itself, how it relates, what the meaning of the bits is, and the indexed views (which update themselves!) for fetching. Get a different need for fetching? Create a different indexed view. The data and model itself doesn't change, as there's no need for change.
What you did is throwing everything together because you want to consume the data at this point in this way. If you want/need to consume the data differently in a future version, you have a big problem, which is unnecessary, because the RDBMS already offers you this feature at no additional effort. The only thing you need to do is create a view in the DB, map an entity to it and fetch that entity in your screen. A 5 minute or less task.
About several forms of reality: no there's only one reality: in that reality you recognize 'book', 'author', 'user', 'queue' etc. These are entity definitions you can recognize in your domain. That there are several projections of this model for various purposes, that's a different story. But it's important to realize there's 1 reality: all those projections have that reality as their SOURCE. Otherwise you have several sources of WHY you have created the model you have at hand: what does this series of bits mean, for example.
Like I said: you end up with 1 table as that's the easiest: 1 query. Right? There's a reason why relational models are normalized.
That's the same thing, Oren. The UI needs data-access and you created your model to meet your data-access strategy. What happens if that changes? In a normal database, you wouldn't have to change anything.
@Raymond: models used for data-warehousing are always based (projections of) normalized models, and the data inside these models is too. This is done so because otherwise you can't reason over what the meaning is of the data in the de-normalized materialized views used in data-warehousing. With the projection you can reason about what the source is and thus what the meaning of the destination is.
@Steve
Normalization isn't about saving space. It's about data consistency: an attribute is in the model just once. This means that if the value has to change, you only have to change it in 1 spot, not in several spots. The more you de-normalize, the more you introduce duplicates. This isn't bad for disk space, it's bad for modifications, as the logic to do so gets more complicated as more and more thought has to be put into the code to keep the data consistent (you've to update all duplicates of a given value!)
Duplication is therefore not really wanted for writes. It's however different for reads. As reads can be done from projected datasets (which are denormalized for example) you can optimize reads without introducing problematic database maintenance for writes: the reads fetch from projected sets, which we all know as materialized/indexed views.
@Frans
Are you talking about an indexed view to load an entire aggregate (say 3-4 levels deep) from the DB? Never even thought of that....I've always avoided views like the plague. Maybe I should think about this again.
Bunter,
a) deletes are actually pretty rare.
b) you can handle that lazily
c) associations as a first class citizen will make this easier (graph db)
@Marc
There's a difference between an indexed view and a normal view. A normal view is a query which is embedded in the query you use the view in. So if you have a view V like:
SELECT ... FROM Customers {complicated join here} WHERE Country=="USA";
and you do in another query:
SELECT o.OrderId, v.CustomerId
FROM Orders o INNER JOIN V ON o.CustomerId = V.CustomerId
it will become at runtime:
SELECT o.OrderId, v.CustomerId
FROM Orders o INNER JOIN (SELECT ... FROM Customers {complicated join here} WHERE Country=="USA") V ON o.CustomerId = V.CustomerId
so that's not going to help in the performance department, only to deal with abstractions.
An indexed view is something else. Instead of mapping the query in the main query, you get the resultset stored on disk, and you can create indexes on it again. So if I make 'V' an indexed view, the db will store the entire resultset of V on disk as a 'table' like all the other tables are stored on disk too. If I then add an index on customerId on V, I don't get the query of V mapped onto my main query, but I access 'V' as if it's a table, complete with index. As the DB takes care of the maintenance of it, if I change a row in Customers, and it's in V's resultset, it's updated as well. you can of course schedule this (most of the time, data-warehousing is done on data harvested periodically, e.g. once a day).
See: http://en.wikipedia.org/wiki/Materialized_view
and for example:
download.oracle.com/.../repmview.htm
(Indexed view is the sqlserver equivalent of a materialized view).
As resultsets can get large, modern systems allow partitioning of these as well across multiple systems. The advantage is: you get very fast fetches, maintenance is low and you keep your original normalized model. So if your fetch strategy changes or you have to provide multiple fetch strategies (for multiple applications for example) you can simply provide more or different views without changing a single table or row in your db.
And yet, that is clearly a requirement of the software, since it makes working with it much easier.
Not giving the user what they want is not an option, and fighting with an RDBMS to do something it is not meant to do isn't really productive.
Frans,
I am well aware what a materialized view is.
However, while materialized can significantly reduce the amount of I/O done, it cannot signficantly reduce the number of queries that you are required to make.
Can you provide me we a sample SQL that would reduce the number of queries required for this screen?
Who says that the relational model is the correct one?
Huh? Why is that?
I can query the data more or less freely, and by structuring the data on aggregate boundary I gotten to a position when I have clearly defined responsabilities and access patterns.
Accessing a value object outside of its aggregate is frowned upon, this pattern matches the way I want to think about my data.
And, finally, there is no reason not to applied materalized views on top of this data model as well, in fact, that is very easy.
Are you familiar with CouchDB, by any chance?
Those are not projections, those are totally different things.
The common example is client tracking for billing & help desk. The name may be the same, but everything else is different, trying to treat them the same leads to issues.
Hm, can you show me where my model is NOT normalized? Can you point to a place where I have duplicated data?
There is no such place. Relational & normalized are two distinct properties that do not relate to one another.
No, I didn't.
I showed the example of the UI because that is the easiest to understand, but others include things like adding a book to the queue:
var user = GetUser(1);
var book = GetBook(13);
user.AddToQueue(book); // BL goes here
Save(user);
LOL, that is why migrations is so important to so many people.
Nice article..You have made a great point here.
Perhaps you can elaborate in another article how an aggregate would actually be chopped-up into key/value pairs and stored - especially how compositions of dependent objects (non-aggregates) and lists (aggregates or non-aggregates) would be handled?
Sony,
That depends rather heavily on the actual NoSQL DB impl
@Frans
I don't know why everyone is attacking a clarification of a secondary point I made about normalization. Primary is duplication, and this is a fully justified reason for using a RDBMS, and structuring tables to 3NF. Secondary is/was space, driving the duplication zealots to the depths of the 5NF and beyond.
I still see database structures today with 5NF and even 6NF tables, and when you try and plot these out in OO it-is-a-mess. I am dealing with one right now with addresses. The goal is that a physical address should exist in the system once, and only once. So the address to entity (customer/company, etc.) becomes a many-to-many relationship (since an entity can have multiple addresses.) This adds crap-loads of complexity when one customer decides to update their address. Is that address used by others? should it be updated, or replaced? Does the new address already exist? A freaking headache that really isn't justified just because 2 entities could share the same location. This decision isn't the RDBMS' fault at all, but it is due to a dominant normailized design opinion. Thou shalt not duplicate!
I'm really not seeing the criticism of the relational model here. If the problem is round trips to the server then the ORM layer should be issuing its requests in a single batch, which results in a single round trip to the DB. If you're worried about the joins in the DB then you use materialized views or denormalize wherever you find bottlenecks.
The problem (to me) seems to be more in the mapping layer (NHibernate) that doesn't easily allow multiple queries in a single round trip rather than the DB for making your queries look a little complex.
The real issue with going to something like Ayende's proposed solution is it completely falls apart with the next requirement - say something like "who has book X in their queue" or even more likely "who has book X checked out". Suddenly you end up with back-references all through the code which need updating on writes or having to do complete list traversals.
On a large system you'll inevitably get a request for some sort of OLAP query eventually that slices things in various ways and having a solid RDB underneath really helps the ETL.
While relational databases are certainly not the answer for everything, I can't see the fault in using one in this particular case.
John,
a) NHibernate fully supports multiple queries for batch, both for writing & reading.
That doesn't reduce the complexity involved or the number of queries required.
b) Answering both your questions would be something like:
User.Queue: 41
User.CheckedOut:41
Both of which are cheap query operations on top of the same model, using something like Lucnene indexing
Ayende
one last question if I may... I was wondering how lucene would handle, for lack of a better term, "indirect aggregations".
For example, extend your model slightly by adding an address for each user.
I want to see the top three popular book categories by country.
Given your experience with lucene, is that something relatively simple to answer? I'm not looking for code, I just want to get a feel for the difficulty of implementing this kind of question.
Thanks!
Pete,
Lucene is a document db, as such, it is not capable of doing relations in the same sense that an RDBMS can.
What you want cannot be constructed as a single query in Lucene. Since it doesn't support any aggregation.
Indeed, there is no way to compare between different documents using lucene.
Ayende,
On the batching of queries, you specifically mentioned "reducing remote calls", which I was simply pointing out was a non-issue given the ability to make multiple queries run in a single batch. In that sense the RDB solution is better than the Lucene one because Lucene doesn't offer the same ability. Similarly the query plan on those queries would be pretty straightforward, particularly if you took the common code from the queries on books and dropped it into an materialized view (as suggested by others). Should all be nice and fast, especially given that the optimizer will be able to reduce the row counts very quickly right off the bat.
Given I now properly understand (thanks to some co-workers smacking me around the head for idiocy) that you're really just suggesting a text search over a set of "files" created with a specific format, I can see a number of problems with the way you're proposing:
i) Text indexers such as Lucene require manual updating of the indexes which is an expensive operation. Any change to the data is going to be significantly more expensive than the RDBMS solution as you need to update your indexes before returning the new result sets.
ii) Escaping and data formatting. Unless you use a proper key-value store, a text indexer like Lucene will have issues with escaping (especially given titles with colons and the like in them). Not insurmountable, but certainly annoying.
iii) Performance/Scalability. Ultimately, Lucene (and most indexers of the type) itself is build on an RDBMS. What you've effectively proposed is just a massive denormalization of the data. It's basically equivalent to saying in SQL 2008 that you're going to denormalize your schema into two tables "Books" and "Users" which both have an id and an XML column, but without the added benefit of escaping and automatic index updating.
I certainly believe that some denormalization doesn't go astray when you come to analyzing your problem, but I suspect tossing the RDBMS out the window is a bit like tossing the baby out with the bathwater in this situation.
However, while materialized can significantly reduce the amount of I/O done, it cannot signficantly reduce the number of queries that you are required to make.
Can you provide me we a sample SQL that would reduce the number of queries required for this screen?"
Ah, you say you are well aware of what a materialized view is yet, you have no clue how it reduces the # of queries?
Let me help you:
you join all the tables in the screen together to a de-normalized set. Then you query that set with 1 query and a filter. As the set is stored on disk as a 'table', it's 1 query and 1 i/o procedure.
You see the need for storing things in an object-oriented fashion. That's nothing new, the OODB vendors are trying to sell that picture for decades. Now, conspiracy theories aside, why aren't they successful in normal applications, do you think? Would it be because structuring data in a rigid form like they force you to is only beneficial for the active problem at hand, but when it changes, the data structure has to change as well, while the data structured in a RDBMS wouldn't have required that restructuring?
But alas, enough. It's simply not worth it to discuss anything with you, as I don't recall a single moment where you weren't convinced you were absolutely right.
"b) you can handle that lazily"
But of course your relational sample avoids all the lazyness and fetches all the data. Don't you think the comparison in unequal at best?
Besides all that, I have recently rarely seen the system that doesn't have very extensive integration to other sinks/sources. Those single app/single database things are "so nineties" as one commentator tried to ridiculously view relational databases. With k/v database this would most probably require another model(view) if I get the thinking behind k/v correctly. So will the updates traversing into N different models turn your read-once advantage into write-many nightmare?
John,
Don't assume that Lucene doesn't offer that, it is trivially to do something like that at the server side, and I recently did some really funky things with Lucene.
i) updating a single document in lucene is actually a very cheap operation.
ii) that is not really an issue, it is a matter of how you index things, but it is an utterly solved problem
iii) Huh? Lucene is NOT built on RDBMS, the data model and access strategies are completely different.
Also, I suggested Lucene for indexing, not for storing the data.
I think that there are some problems with your understanding of how Lucene works
Frans,
a) I would really appreciate it if you tune down your tone. You comes off as condesencding, and while I am sure that you don't mean to sound like that, it is annoying on the recieving side.
b) look at the data model, we are talking about collection association, not single association, you can't return that as a single result set.
Frans,
If this is so simple, you can spend 5 minutes and produce the SQL for the materialized view & the queries that reproduce the data needed for the screen
As I said, lazily is an option. First class associations would deal with this easily enough.
My relational model suffer for severe issues with regards to the problem at hand, that is why I am trying to move away from it.
Huh? I am not sure that I understand why you would need that.
I don't see a scenario where that is likely, and would strongly resist it if that was the case.
Can you give me a concrete example of this?
Thanks Ayende.
I suppose I speak for many when I say that I need to learn how to determine when to use NoSQL or not.
Ayende,
I look forward to reading the funky stuff you did with Lucene - the discussion has been interesting so far and I'm sure it will be just as interesting with those examples. I'm not exactly trying to dismiss Lucene as a solution, just pointing out that in the specific case you mentioned here an RDBMS really isn't that poor a performer and I'd be willing to wager that it would work out at least as fast and possibly more flexible and robust than the Lucene solution.
I'll freely admit I'm quite new to Lucene and still learning by prowling through the 3.0.0 source tree and documentation. Apologies for any misunderstandings I have and thanks for the corrections.
For the specific comments:
i) Adding a row in an RDBMS (most updates in a netflix-style system will be single OLTP style row adds) is going to be significantly faster than updating a page in Lucene no matter which way you look at it, just from the size of the update. The index system in Lucene doesn't seem as optimized for this sort of OLTP behavior as your typical RDBMS is, though I will admit your entire article was based around reading data rather than writing - you do mention using it primarily for data indexing rather than data storage.
ii) Agreed that it's solved, just an annoyance and I'm sure you see as many bugs from "utterly solved escaping problems" as I do (cough SQL Injection cough)
iii) I stand corrected - saw the links to sleepycat in the code and made too many assumptions. The file format has plenty in common with RDBMS structures, but is definitely geared to the full-text search.
I understand the idea of using Lucene to index rather than store (wasn't aware you even could use it to store) but I'm still not really seeing the whole concept as any different to dumping the whole thing in a great big XML index.
John,
The actual sample scenario required over 50 queries to get the actual data for the operation to run, don't get too hung up on the actual sample, I tend to simplify them so they would get the point across without being TOO complex.
Regarding inserts, that is actually cheap, I am using Lucene only for indexing, so the things that I am indexing are much cheaper than the full entity. And RDMBS would have to do the same on insert, updating indexes / materialized views).
Ayende you are without a doubt one of the smartest programmers around. Thats why I am amazed about the nature of this argument. The problem you describe here had a solution for over 20 years. And it has been addressed by most RDBMS manufacturers. While I love great abstraction and design ideas as much as the next guy, I have to say that in this case I would rely on the database. Its to your credit that you tackle a variety of issues on your blog and propose creative solutions. However, especially in reading the heat of the comments back and forth, I recommend taking a step back and looking at db solutions in addition to code patterns. Of course everyone is free to follow their own creativity in solving a thorny problem. Alas we all know sometimes a different approach to a problem can really move things along. Personally I believe that the Ayende Rahien whose work I admire does not simply dismiss potential solutions out of hand. That just doesn't sound like you. Please rake a look at indexed views. You might like them.
Thomas,
Explain the solution, then.
If you are proposing a materialized view, show me the SQL to generate the view and the SQL to query it
I too would like to see an example using the view.
"As I said, lazily is an option. "
And yet you decided to add them to SQL and not to your k/v sample, leaving out the most trivial optimizations like having book rating calculated and added to the book entity :) But I agree, relational model might seem utterly complicated compared to your extremely denormalized model (which you can implement hackish way in relational database as well, summing the review rating, authors, categories along with book entity for example and having instantly 5 less joins or creating materialized view and getting all the data from from single "entity" and aggregating it in memory). However, relational databases are old, proven technology and rather well tuned. So what if there are 6 joins? If it's small application, performance isn't an issue anyway with any sane approach, with huge application nobody dares to implement it on the niche database anyways :)
Examples where this model turns into write many:
1) part of your category tree is loaded from another datasource. Categories can appear and disappear at will, books belonging to disappearing categories need to be remapped. It can happen daily basis. Category names can change at with any load.
2) Part of the book catalog is loaded from another datasource. Books can be taken off shelf any moment meaning they must disappear from reader recommendation list, marked hidden if they are purchased (in your case "checked out") in public catalog
In sum - my point is your case might be right but your sample is a bit skewed. In lot more specialized data models having deeply hierarchical nature and strong aggregates, k/v probably can really pay off.
Marc,
It is in the queue
Bunter,
I am not following, what did I do in the SQL and not in the DDB?
Are you talking about avg_stars? If so, feel free to ignore this, then.
You don't really need it.
1) That is not a problem, as I said. Those aren't driven from the document, those are driven via indexes.
2) Ditto.
@Ayende: About my tone: the thing is, you're actually pretty inflexible towards what I have said in this thread. This isn't new, I don't recall you ever admitting you didn't know something or you were wrong, everybody else is always wrong or has the wrong idea. It's your blog, do what you have to do but it's not helpful.
You ask for an example for the materialized view/indexed views. I assume you're using SQL Server. Sql server has the concept of an indexed view. This is similar to a materialized view on oracle. See: msdn.microsoft.com/en-us/library/dd171921.aspx
You posted a couple of sql queries. All queries use a where clause with a User. Strip the queries from that clause, and create for each query an indexed view. Then write queries which simply query your indexed views with the where clause using @userid. This gives the following benefits:
1) no more join logic
2) no aggregate computation
You still need several queries. So you can also create, for this screen, a single indexed view, which only has the information needed for this screen. This will have more redundant info per row, but it's also more efficient, as you then have to query just that single indexed view.
There are many ways to optimize this screen, as you know. If the db is the bottleneck, creating such an indexed view can help and at the same time without you having to tie the model to the application (as you did in your optimized form).
Try it with the indexed views, it's not has hard as you might think (it's very easy, you already have the queries. ). The only drawback sqlserver has with indexed views is that when you update a table in an indexed view the ARITHABORT flag has to be set. I'm not sure NHibernate has an option to set that flag before a query.
It would be really interesting to see the index view idea implemented with NHibernate for this book sample. Ayende, got any spare time??? ;)
@Marc: I don't think there's anything to do with Nhibernate for the indexed view - it should just work :) An indexed view is just like a normal table.
I gotta go with Frans here. Even if one cant write a single view that ties together and flattens different result sets, one can most definitely write a SP that returns multiple result sets(from multiple indexed views). Apps come and go, data stays forever.
I don't really get your complaint with # of queries; are you worried about stressing out the DB engine? or code clutter due to various queries/OR mappings? It cant be performance; optimizing these queries is bread & butter for every DB person. Either you've a serious point that you're holding up; or you're just trolling :)
Bela,
Like I said, the actual issue I presented here is the simplified version of the issue.
The actual problem was that we didn't have just 6 tables and a few queries, we had dozens of tables, and a LOT of queries.
Sure, you can try to optimize this sort of thing, but it stops being an effective approach after a certain time/size has reached.
Beyond that, there is the maintainability issue of having to work things just right in order to be able to get things done right. The more complex this is, the harder it would be to maintain.
Worse, if every screen had this requirement, the cost of developing/changing a screen goes up the roof.
Changing the model so you don't have to worry about this, because you have clean lines between things, is much easier.
@ re: Slaying relational dragons 2/24/2010 5:02 AM pete w
@ # re: Slaying relational dragons 2/24/2010 5:26 AM Ayende Rahien
"I want to see the top three popular book categories by country. " - pete
"Pete,Lucene is a document db, as such, it is not capable of doing relations in the same sense that an RDBMS can.
What you want cannot be constructed as a single query in Lucene. Since it doesn't support any aggregation...."
Ayenda/Pete:
I have to disagree. If using lucene/solr the query to get such a result could be simplified to below.
Solr:
(noun:"book" AND (rank:"1" OR rank:"2" OR rank:"3")&facet=country
More over we can get all the ranked books for each country (not just one) using the above.
Also Lucene/Solr can get me back any facet, such as state, zip, country, customer age, etc. Allowing slicing of data (datawarehouse type functions, but infinitely faster)
by merely slicing on the facets.
(noun:"book" AND (rank:"1" OR rank:"2" OR rank:"3")&facet=country,zip,state,age
This would give me all rankings.
Now does this matter? No. Because why do I want to get all book rankings for a country?
Who cares? That's a back-office/support task and that's not a revenue stream nor something
that needs to be optimized or even thought about until you actually "have" a product.
Do you sign into amazon.com and see book ratings by country? No, you get books that "YOU" care about. And that's a different query all together (Distance Scoring).
Read: Algorithms Intelligent Web, Manning Collective Intelligence. etc. What I find fascinating about these arguments is the lack of focus on the product.
I think Ayende Rahien has more of the customers goal in mind, not IT. Software is transient and there is no "perfect" solution.
Ayenda is fundamentally right that it can't do (out of the box) something like AVG(book.price) where facet.country = "US".
But that's a set operation perfect for a database. Or you could just code it in as many
lines using the right abstractions/hadoop/mapreduce/plinq/yada yada.
var avgPrice = solr.query(new {
Console.WriteLine(avgPrices["US"]);
Console.WriteLine(avgPrices["JP"]);
@John Wiltshire
"I'd be willing to wager that it would work out at least as fast and possibly more flexible and robust than the Lucene solution."
I really hate these two words in IT. Anytime somebody says something like that I promised a long time ago to ask "what does robust mean to you" and
"are you describing the coffee you had this morning?". Bold, rich, robust are things used to describe coffee, not software. Cool-aid is for kids.
"Examples where this model turns into write many:
1) part of your category tree is loaded from another data source. Categories can appear and disappear at will, books belonging to disappearing categories need to be remapped. It can happen daily basis. Category names can change at with any load.
2) Part of the book catalog is loaded from another data source. Books can be taken off shelf any moment meaning they must disappear from reader recommendation list, marked hidden if they are purchased (in your case "checked out") in public catalog"
What are you talking about exactly? Book categories can appear/disappear at will like magic?
To live in a world where book categories can disappear is interesting. I imagine it's like being on the bridge of
some starship and the helm says "Captain, we just lost another book category to the borg!"
Seriously tho, when was the last time "fiction" was dropped from amazon's lexicon? Or "drama" from netflix? Besides the obvious lets say you were talking about
categories at a more granular level (Like ASP.NET MVC 2.0 Books). We are assuming, in this case, that book updates need to be immediate.
It takes months to many years to write a book and book publishers work with distributors for many months to
deliver a book (often waiting for good market conditions), so if it takes 6 hours to update a lucene index (which is doesn't BTW) I think you might
be focusing too much of your time solving the wrong problem to the wrong question.
Besides thats why they have a nice "pre-order" button.
Also we assume business growth wont make this point absurd when you add multi-language support for your categories. Show me a
database that handles that very well. "Business is great, we want to move into China". Oh thats easy, our RDMS system can take care of that in no time,
nobody has to weed through any queries at all and it requires no code changes. Everybody then high fives and has a beer.
Again, a nice world, but I dont think it gets close to that.
@Frans Bouma
Just a quick couple of jabs, I am paraphrasing
In short, Your "quick and easy" five minute task has exploded into multiple queries, multiple indexed views, and potentially multiple materialized views.
Doesn't that sound like a problem? I am sure you're one of those people that can look at a SQL query and see the execution paths in your head and think you're the only one in your company / area that knows about partitioned views, OVER queries, FACT tables, whichever. But I say this with kindness, you're a dinosaur.
Honestly look around you and take a hard look at where databases are going. You might not like it, but stop and think for a moment.
Google, Amazon, Facebook, Netflix, Hadoop/MapReduce.
The idea of you "owning" the data and having it perfectly clean and usable without repetition is... cute.
And if I were in a small/medium sized bureaucracy-laden company that only had to deal with one database for clearing, processing, reports - and I never talked to the business side - I am sure I would see no problem in an SQL can do anything where I work attitude. So the guys over at google and facebook clearly have
no idea what they are talking about.
Information overload is happening, it is a problem people are trying to solve day-by-day. Tons of information is being collected from many different sources (blogs, cell phone usage, credit card purchase patterns, census data, transaction records). It is not clean, it is not pure, and you know what? That doesn't matter.
What matters is delivering a product/experience to your customers faster, cheaper, and with more quality.
You, in this situation, have exhibited more passion and rancor than I have seen in awhile. Good for you - I like passionate people. But you have yet to say anything meaningful. Which I dislike even more than I like passionate people.
Because you are the type of person that will slam your fist down saying:
"But E.F. Codd had said to his people. The way to god is through RDMS. The heathens, which are inferior to you in every-way, will at some point come together. You must at that point take up arms and by being a complete douche/tool in order to prevent their ascension. For no other idea other the RDMS is the way, all
others are corrupt and evil. These heathens with their strange ideas will kill your children, murder kittens, and throw puppies out of things called "cars". You must save the kittens! Go forth and be fruitful! WITH (NOLOCK) AMEN!"
So you like your niche of SQL and whatever else you do. So much so that have become incapable of looking at other situations and/or solutions which while young and immature are probably going to replace your beloved system.
You fear change or at least fear thinking differently.
I am sure COBOL developers had this same passion and rancor when they meet with RDMS people. Because in 1959 Grace Hooper had said to them:
"There will be a time when a man with a name like a fish will come down to you and say his way is better, he will try to convert you. But beware his evils, he will corrupt you and you will end up killing kittens. Save the kittens! Be fruitful and multiply! PIC(X) AMEN!"
Terrance,
That is really interesting. Is this a solr only faeture? Can I use it with pure lucene?
Any good docs on it?
Interesting article and discussion. Thanks.
Managing data consistency seems like a tough problem with the document approach.
To be more concrete, in the sample if a book's rating gets bumped I'll need to do something like the following to keep things consistent:
var book = GetBook(1);
book.Rate(4);
Save(book);
// The rub, I need to remember to do this too:
var users = GetUsersWithRecommendedBook(1);
foreach (var u in users)
{
}
Correct?
Marc,
No :-)
Frans,
The fun part about blog discussions is that you always have the chance to just say nothing if the other side irritate you. I spent several days thinking about it.
I am still waiting for the SQL required to make a "single indexed view" that would serve all the information required for this screen.
I am not asking for instructions on how to do it, and I can find the documentations on my own.
I am asking you to put the minuscule amount of effort you seems certain this is going to take and actually produce this.
I was enjoying the argument immensely without any intention of jumping in until the following post...
"The idea of you "owning" the data and having it perfectly clean and usable without repetition is... cute. "
It seems almost a waste of time to rebut this... but here goes...
Let's see...
Ever need to count something accurately?
a. money
b. books
c. authors
d. customers
e. widgits
f. orders
Ever need to represent a financial transaction?
a. It's definitely not "cute" when you have to explain to the business
I missed the part of the requirement stating "close is good
enough..."
Sorry if my tone seems a bit condescending, I'm just blown away by this sentiment...
Rob,
Let us take a bank account as a simple example.
It is NOT transactional, and it is probably the classic example for BASE.
The notion that you advocate doesn't really exists.
@Rob
1.) Yes
a. yes
b. maybe?
c. why?
d. why?
e. why?
f. why?
2.) Yes, and?
3.) And?
You are injecting an argument for transactional data (money, orders, etc) which by their definition are repeated. As my statement said, "clean, usable, without repetition". Transactional data is very repetitious and not normalized (for speed and for ease) so your point while valid is neither here-no-there. But good point, as that could confuse other people so thanks for bringing it up.
As for financial transactions and your view of them.... having to be "clean/pure/whatever".
Ever work for a large institutional brokerage house?
Guess how they manage checking each trade for validation, amount, balance, margin, etc to ensure that everything is "accurate and clean" while keeping the delay on the system down?
Answer: They don't. Not a single one does.
In these financial systems the slightest ms delay could cost thousands of dollars and upset a multimillion dollar client. At the time of trade there is no check, there is no accounting, the data isn't pristine, there is just data - not referentially checked. In fact it could be on a piece of paper.
When it comes time to clear everything then you go through batch/workflow/processing to ensure that any amounts cleared and settled.
Clearing/exchanges don't go a single day without some discrepancy in one or more trades. Both clearer and processor send clearing files to each other and sometimes * gasp * talk on the phone and this could take days!? What about that poor database, and those people that want the data perfect?! Oh no!
For transactional data / financial data and your view of clean and accurate data I recommend the following book:
www.amazon.com/.../ref=pd_lpo_k2_dp_sr_1
I might also recommend the below when talking about transactions as it concerns business.
"Starbucks does not use a two-phase commit."
www.eaipatterns.com/ramblings/18_starbucks.html
Again, I digress, my focus was for this Frans. Who enforces his judicious belief in how superior databases are at everything and that to question moving the smallest item out of a database that uses SQL, to a database of another form, be it ORM, EaV, or otherwise resulted in his attack.
Congratulations, Ayende!
You chose to try a different model and you showed to the world. The point here is not if it could be done right with a relational DB. Of course it is.
The point is that it was easier, more intuitive and more suitable for the kind of problem he was trying to solve, isn't it?
Relational DBs are great to solve a lot of problems, but are not the only way.
Look at Google. They decided to rebuild everything by themselves (DB, filesystem, search algorithms, etc) because the existing software didn't fit their needs. And it was a great success. I agree with Terrance, open your mind, there could be other ways of doing things right!
Even though I'm not sold on the NoSQL approach yet, it's fascinating reading the arguments on either side between two people so clearly switched on with their approaches (regardless of questions of which is actually 'better').
On a general note, many, many, new databases are appearing for data warehousing and general use that are not relational (have a look at illuminate). While relational databases work for smaller apps and singular relationship models they are inherently limited because they were designed for procedural programming with a small set of data with singular relationships. NHibernate and other products exist to bridge the gap between OOP and RDBM but programmers still have to pull the complete dataset into memory to manage multiple relationships (e.g. "show me all tables, rows and columns that contain the text "FOO").
Beyond that there is a huge performance hit just to use the sql-92 standard, from parsing sql commands and data storage. The new column based db are outperforming Oracle and MSSQL by a factor of 10 and correlation database are at almost 100x. MSSQL and Oracle have dominated database software for almost 15 years now and it has become about dogma, not about performance. There is a reason both of their EULAs don’t let independent sources benchmark their servers.
I didn't read all of the replys here. For the queries, it is actually a very simple one if you create a stored procedure and call it from your code. I guess the problem is that you are trying to use nhibernate for all database access.
This is an interesting topic. I arrived here in my searching of DDD and repositories as they apply/mesh with .NET technologies like LINQ-to-SQL and ADO.NET Entities in a layered architecture. Specifically, I'm currently battling the purpose and placement of the repository pattern in this scenario. So I have a non-LOB-specific framework, a DAL, a BLL, etc. Obviously, I'm using LINQ-based technologies in my DAL, and aim to keep my layers as decoupled as possible in case I need to physically tier/distribute them.
Anyway, within my organization, I have two segments that I want to model out in the architecture, namely both OLTP and OLAP. So I have a DW, and two operational data stores (ODS) in the transactional environment. As anyone with experience in both of these segments know, the logical and physical models are vastly different. But, of course, my goal is to keep the conceptual business model (the domain model) the same. Obviously, I'm going to leverage my repositories to decouple most of this.
However, it gets interesting, as in ayende's case, when I get to the business layer. The BIG decision point is: Should I keep a single organization-level "domain" model for all segments? Or would "domain" really mean one model per segment/department? On the one hand, I'm tempted to let my BLL encapsulate that and use a single set of domain objects as my BOs and let each BO contain the appropriate repositories per segment to accomplish the storage need, based on the business operations. But on the other, it seems I'm violating some key principles. Things like OLAPs 99% selectivity, OLTP's normalized read/write, not to mention federation, governance, and security due to a completely differerent set of users (yes, applications are users).
Should I keep these as separate "vertical" layers? Any thoughts? Anyone done it before?
Nix my comment. I've decided to go with specific repositories per segment. I'm now way past this, and have boiled the respository, unit of work, entities, and domain/business objects down. Basically, my idea is to wrapper my repositories with "resource agents" and then a "resource manager" to act as a sort of controller. I'm borrowing some of this concept from the notion of "ambient transactions." The aim is to decouple so that my agent (depending on what kind of agent implementation) can act more autonomous in handling the repositories. So I'll keep my repositories intentionally lightweight, nothing more than a view, and let the domain object be the model, and my upstream resource manager be the controller. Actually, the "resource manager" will evolve to some type of controller - just not sure what to call it yet. But I hope to mimic the same semantics as the TransactionScope constructs to leverage the ambient resource agent concept. This will be especially nice for testing or virtualizing.
Admittedly, I'm still fuzzy on the impl. details, but roughly something like this:
ResourceAgent
ResourceManager : IUnitOfWork : IDisposable
IDictionary <string,resourceagent> Agents
Save(bool throwOnConflict)
Revert()
Dispoe()
BusinessActivityScope (controller) : IDisposable
ResourceManager
Complete()
Dispose()
Add
Then I want to be able to use similar constructs as System.Transaction:
using(BusinessActivityScope s = new BusinessActivityScope()) {
Order order = new Order();
// do something with it
s.Add(order);
s.Complete();
}
Clearly there are HUGE holes in this as an "architecture" but I hope that it hints at the vision. With this, it wouldn't be hard to get it to play nicely with System.Transactions, as well as in a service-oriented or workflow environment. Further, would be way slick t6o test as you could through either DI or AOP let the agent discover environmental settings, and for example automatically use a mock resource, and so on. Also could work in a tiered architecture too, with not that much extra work.
I digress, my whole perspective was on balancing the appreciation of the pattern as well as how to boil it down nicely into a modern contemporary technology domain. Sometimes those patterns get too "classical" if you know what I mean. Just look in the .NET framework, and you will them everywhere, but often with a more contemporary (some use the term "optimized") implementation.
Lastly, Ayende, you almost always post interesting stuff. I've been following your blogs and articles since 2005-ish. You didn't know it, but you were a NHibernate mentor for me back in the day, hahahaha! Time has sped along!
Comment preview