DBMS Musings: PACELC

Due to Hurricane Sandy, Yale gave me a day off from teaching today and I have finally been able to get to a few things on my "to-do" list. One of them is to write a blog post about the IEEE Computer CAP Retrospective edition and make my paper that appeared inside of it publicly available.

Earlier this year, the IEEE Computer magazine came out with an issue largely devoted to a 12-year retrospective of the CAP theorem and contains several articles from distributed systems researchers that contribute various opinions and thoughts about CAP. The first article is from Eric Brewer, who coined the CAP theorem 12 years ago (though he points out in his article that it was actually 14 years ago). A PDF of Brewer’s article is available for free from: http://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed. The second article is from Seth Gilbert and Nancy Lynch (the same Gilbert and Lynch that proved the CAP theorem 10 years ago).

The third article is from me, and contains my criticisms of CAP that long-time readers of my blog will be familiar with. In particular, I point out that many people assume that modern NoSQL systems relax consistency guarantees in order to gain availability due to the constraints of the CAP theorem, when the reality is that these systems give up on consistency even in the absence of network partitions, which is not required according to the CAP theorem. The reason why they give up on consistency is because of a desire to improve system latency, an increasingly important requirement in the modern impatient world. I then describe the latency-consistency tradeoff in more detail, and end the article with the PACELC reformulation of CAP that debuted on my blog over two years ago. With the permission of the IEEE, I am making a free version of this article available today. This article is the first time that the PACELC formulation and my thoughts on CAP appear in a scholarly article, which gives people a venue to refer to (bibtex code available here) when citing this work (you can stop citing a blog post!)

The fourth article is from Raghu Ramakrishnan, entitled “CAP and Cloud Data Management” and describes the PNUTS system that I have mentioned in the past as a good example of a system for which the consistency-latency tradeoff has had a more direct impact on the system design than the consistency-availability tradeoff of CAP. The fifth article is from Ken Birman, Daniel Freedman, Qi Huang, and Patrick Dowell of Cornell University on overcoming CAP with soft-state replication. Unfortunately, I cannot find a free link to Raghu’s article, but if you have an IEEE account, you can access it at at: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6122007&tag=1. The Birman et. al. article can be found for free at: http://www.cs.cornell.edu/Projects/mrc/CAP.pdf.

If you have enjoyed my thoughts on CAP on this blog, I highly recommend you read each of these five articles.

The Brewer article in particular acknowledges my past criticism of CAP not actually being about picking two of three out of C (consistency), A (availability), and P (partition tolerance) due to the fact that it does not make sense to reason about a system that is ‘CA’. (If there is no partition, any system can be both consistent and available --- the only question is what happens when there is a partition --- does consistency or availability get sacrificed?) Brewer uses this observation to lead into a nice generalization of consistency-availability tradeoff. In particular, when a partition occurs, the system does three things: (1) detect that the partition occurred, (2) enter a partition mode that may or may not limit some operations, and (3) initiate some sort of reconciliation algorithm when the partition is fixed. Depending on how these three things are implemented, it is possible to obtain much of the spectrum between CP systems and AP systems. The article also contains a nice reference to the CRDT work by Shapiro et. al. at INRIA. Overall, I strongly support Brewer’s approach to navigating this tradeoff. It also fits nicely with Mehul Shah’s talk at HPTS in the way that the spectrum between consistency and availability is explicitly considered at system design time, rather than trying to bolt consistency on top of an AP (eventually consistent) system after the fact (a wildly suboptimal endeavor).

While most of Brewer’s article focused on the consistency-availability tradeoff, Brewer also briefly acknowledges that “in its classic interpretation, the CAP theorem ignores latency”, and that some systems reduce consistency for latency (he even refers to the PNUTS example I used in my original blog post). I remain convinced that PACELC is the best way to reason about both of these tradeoffs in a single formulation: if there is a partition (P) how does the system tradeoff between availability and consistency (A and C); else (E) when the system is running as normal in the absence of partitions, how does the system tradeoff between latency (L) and consistency (C)?

As 24/7 availability becomes increasingly important for modern applications, database systems are frequently replicated in order to stay up and running in the face of database server failure. It is no longer acceptable for an application to wait for a database to recover from a log on disk --- most mission-critical applications need immediate failover to a replica.

There are several important tradeoffs to consider when it comes to system design for replicated database systems. The most famous one is CAP --- you have to trade off consistency vs. availability in the event of a network partition. In this post, I will go into detail about a lesser-known but equally important tradeoff --- between latency and consistency. Unlike CAP, where consistency and availability are only traded off in the event of a network partition, the latency vs. consistency tradeoff is present even during normal operations of the system. (Note: the latency-consistency tradeoff discussed in this post is the same as the "ELC" case in my PACELC post).

The intuition behind the tradeoff is the following: there's no way to perform consistent replication across database replicas without some level of synchronous network communication. This communication takes time and introduces latency. For replicas that are physically close to each other (e.g., on the same switch), this latency is not necessarily onerous. But replication over a WAN will introduce significant latency.

The rest of this post adds more meat to the above intuition. I will discuss several general techniques for performing replication, and show how each technique trades off latency or consistency. I will then discuss several modern implementations of distributed database systems and show how they fit into the general replication techniques that are outlined in this post.

There are only three alternatives for implementing replication (each with several variations): (1) data updates are sent to all replicas at the same time, (2) data updates are sent to an agreed upon master node first, or (3) data updates are sent to a single (arbitrary) node first. Each of these three cases can be implemented in various ways; however each implementation comes with a consistency-latency tradeoff. This is described in detail below.

Data updates are sent to all replicas at the same time. If updates are not first passed through a preprocessing layer or some other agreement protocol, replica divergence (a clear lack of consistency) could ensue (assuming there are multiple updates to the system that are submitted concurrently, e.g., from different clients), since each replica might choose a different order with which to apply the updates . On the other hand, if updates are first passed through a preprocessing layer, or all nodes involved in the write use an agreement protocol to decide on the order of operations, then it is possible to ensure that all replicas will agree on the order in which to process the updates, but this leads to several sources of increased latency. For the case of the agreement protocol, the protocol itself is the additional source of latency. For the case of the preprocessor, the additional sources of latency are:
1. Routing updates through an additional system component (the preprocessor) increases latency
2. The preprocessor either consists of multiple machines or a single machine. If it consists of multiple machines, an agreement protocol to decide on operation ordering is still needed across machines. Alternatively, if it runs on a single machine, all updates, no matter where they are initiated (potentially anywhere in the world) are forced to route all the way to the single preprocessor first, even if there is a data replica that is nearer to the update initiation location.

Data updates are sent to an agreed upon location first (this location can be dependent on the actual data being updated) --- we will call this the “master node” for a particular data item. This master node resolves all requests to update the same data item, and the order that it picks to perform these updates will determine the order that all replicas perform the updates. After it resolves updates, it replicates them to all replica locations. There are three options for this replication:
1. The replication is done synchronously, meaning that the master node waits until all updates have made it to the replica(s) before "committing" the update. This ensures that the replicas remain consistent, but synchronous actions across independent entities (especially if this occurs over a WAN) increases latency due to the requirement to pass messages between these entities, and the fact that latency is limited by the speed of the slowest entity.
2. The replication is done asynchronously, meaning that the update is treated as if it were completed before it has been replicated. Typically the update has at least made it to stable storage somewhere before the initiator of the update is told that it has completed (in case the master node fails), but there are no guarantees that the update has been propagated to replicas. The consistency-latency tradeoff in this case is dependent on how reads are dealt with:
  1. If all reads are routed to the master node and served from there, then there is no reduction in consistency. However, there are several latency problems with this approach:
    1. Even if there is a replica close to the initiator of the read request, the request must still be routed to the master node which could potentially be physically much farther away.
    2. If the master node is overloaded with other requests or has failed, there is no option to serve the read from a different node. Rather, the request must wait for the master node to become free or recover. In other words, there is a potential for increased latency due to lack of load balancing options.
  2. If reads can be served from any node, read latency is much better, but this can result in inconsistent reads of the same data item, since different locations have different versions of a data item while its updates are still being propagated, and a read can potentially be sent to any of these locations. Although the level of reduced consistency can be bounded by keeping track of update sequence numbers and using them to implement “sequential/timeline consistency” or “read-your-writes consistency”, these options are nonetheless reduced consistency options. Furthermore, write latency can be high if the master for a write operation is geographically far away from the requester of the write.
3. A combination of (a) and (b) are possible. Updates are sent to some subset of replicas synchronously, and the rest asynchronously. The consistency-latency tradeoff in this case again is determined by how reads are dealt with. If reads are routed to at least one node that had been synchronously updated (e.g. when R + W > N in a quorum protocol, where R is the number of nodes involved in a synchronous read, W is the number of nodes involved in a synchronous write, and N is the number of replicas), then consistency can be preserved, but the latency problems of (a), (b)(i)(1), and (b)(i)(2) are all present (though to somewhat lower degrees, since the number of nodes involved in the synchronization is smaller, and there is potentially more than one node that can serve read requests). If it is possible for reads to be served from nodes that have not been synchronously updated (e.g. when R + W <= N), then inconsistent reads are possible, as in (b)(ii) above .
Data updates are sent to an arbitrary location first, the updates are performed there, and are then propagated to the other replicas. The difference between this case and case (2) above is that the location that updates are sent to for a particular data item is not always the same. For example, two different updates for a particular data item can be initiated at two different locations simultaneously. The consistency-latency tradeoff again depends on two options:
1. If replication is done synchronously, then the latency problems of case (2)(a) above are present. Additionally, extra latency can be incurred in order to detect and resolve cases of simultaneous updates to the same data item initiated at two different locations.
2. If replication is done asynchronously, then similar consistency problems as described in case (1) and (2b) above present themselves.

Therefore, no matter how the replication is performed, there is a tradeoff between consistency and latency. For carefully controlled replication across short distances, there exists reasonable options (e.g. choice 2(a) above, since network communication latency is small in local data centers); however, for replication over a WAN, there exists no way around the significant consistency-latency tradeoff.

To more fully understand the tradeoff, it is helpful to consider how several well-known distributed systems are placed into the categories outlined above. Dynamo, Riak, and Cassandra choose a combination of (2)(c) and (3) from the replication alternatives described above. In particular, updates generally go to the same node, and are then propagated synchronously to W other nodes (case (2)(c)). Reads are synchronously sent to R nodes with R + W typically being set to a number less than or equal to N, leading to a possibility of inconsistent reads. However, the system does not always send updates to the same node for a particular data item (e.g., this can happen in various failure cases, or due to rerouting by a load balancer), which leads to the situation described in alternative (3) above, and the potentially more substantial types of consistency shortfalls. PNUTS chooses option (2)(b)(ii) above, for excellent latency at reduced consistency. HBase chooses (2) (a) within a cluster, but gives up consistency for lower latency for replication across different clusters (using option (2)(b)).

In conclusion, there are two major reasons to reduce consistency in modern distributed database systems, and only one of them is CAP. Ignoring the consistency-latency tradeoff of replicated systems is a great oversight, since it is present at all times during system operation, whereas CAP is only relevant in the (arguably) rare case of a network partition. In fact, the consistency-latency tradeoff is potentially more significant than CAP, since it has a more direct effect of the baseline operations of modern distributed database systems.

DBMS Musings

Monday, October 29, 2012

IEEE Computer issue on the CAP Theorem

Wednesday, December 7, 2011

Replication and the latency-consistency tradeoff

Blog Archive

Followers