DBMS Musings

My thoughts on the data mesh

2021-03-25T08:52:00.002-07:00

The concept of a "data warehouse" has been around for a long time. A really, really long time. The term started being used in the 1970s, and has essentially retained the same meaning for its 50 years of existence, which is an eternity in the realm of computer science.

The data warehouse consists of two basic components:

An organizational process for ingesting data from operational data stores, cleaning and transforming this data, and then serving it from a central and accessible location within an enterprise.
Database software that implements the storage, modeling, and serving of the data in the data warehouse.

The second component --- the database software --- has made tremendous progress over the past five decades. The current software is much faster, more fault tolerant, and more scalable than the original data warehousing software.

However, the first component --- the organizational process --- has been much slower to modernize, and still scales very poorly.

The data mesh, recently proposed by Zhamak Dehghani, is a pretty major paradigm shift that has the potential to bring this first component forward in a rare major redesign of the organizational process.

I wrote up my thoughts in detail in a guest post on Starburst's blog. The short summary of my thesis in that post is that a major reason why data warehousing organizational processes fail is that they don't scale. The data mesh has the potential to do for the data warehouse from an organizational perspective what the parallel DBMS did for the data warehouse from a database software scalability perspective.

It's time to rethink how we share data on the Web

2019-12-18T12:28:00.000-08:00

The world wide web (WWW) has had an extraordinary impact on our day-to-day lives. An enormous amount of information is available to any participant at extremely low cost (usually this cost is paid via one’s attention to advertisements). However, the interface is fundamentally limited. A user must either have pre-existing knowledge of the location of the needed information (e.g., the correct URL), or use a search interface which generally attempts to match words in a search query with the natural language found on the web. It is totally impossible to query the entire Internet with a single SQL query (or any other structured query language), and even if you could, the data available on the WWW is not published in a format which would be amenable to such queries. Even for those Websites that provide an API to access structured data, the data is typically provided in JSON format, which is orders of magnitude slower to process than native relational data formats, and usually non-interoperable with similar datasets provided by other Websites.

A small group of researchers and practitioners are today releasing a vision for a complete redesign of how we share structured data on the Internet. This vision is outlined in a 13 page paper that will be appearing in the data systems vision conference called CIDR that convenes next month. This vision includes a proposed architecture of a completely decentralized, ownerless platform for sharing structured data. This platform aims to enable a new WWW for structured data (e.g., data that fits in rows and columns of relational tables), with an initial focus on IoT data. Anybody can publish structured data using their preferred schema, and they retain the ability to specify the permissions of that data. Some data will be published with open access—in which case it will be queryable by any user of the platform. Other data will be published in encrypted form, in which case only users with access to the decryption key may access query results.

The platform is designed to make it easy for users to not only publish IoT (or any other types of structured) datasets, but even be rewarded every time the data that they publish is queried. The platform enables a SQL interface that supports querying the entire wealth of previously-published data. Questions such as: “What was the maximum temperature reported in Palo Alto on June 21, 2008?” or “What was the difference in near accidents between self-driving cars that used deep-learning model X vs. self-driving cars that used deep-learning model Y?” or “How many cars passed the toll bridge in the last hour?” or “How many malfunctions were reported by a turbine of a particular model in all deployments in the last year?” can all be expressed using clean and clearly specified SQL queries over the data published in the the platform from many different data sources.

The Internet of Things was chosen as the initial use-case for this vision since the data is machine-generated and usually requires less cleaning than human-generated data. Furthermore, there are a limited number of unique devices, with typically many instances of a particular unique device. Each instance of a device (that is running a particular software version) produces data according to an identical schema (for a long period of time). This reduces the complexity of the data integration problem. In many cases, device manufacturers can also include digital signatures that are sent along with any data generated by that device. These signatures can be used to verify that the data was generated by a known manufacturer, thereby reducing the ability of publishers to profit off of the contribution of “fake data” to the platform.

As alluded to above, publishers receive a financial reward every time the data that they contributed participates in a query result. This reward accomplishes three important goals: (1) It motivates data owners to contribute their data to the platform (2) It motivates data owners to make their data public (since public data will be queried more often than private data) (3) It motivates data owners to use an existing schema to publish their data (instead of creating a new one).

The first goal is an important departure from the WWW, where data contributors are motivated by the fame and fortune that come with bringing people directly to their website. Monetizing this web traffic through ad revenue disincentivizes interoperability since providing access to the data through a standardized API reduces the data owner’s ability to serve advertisements. Instead, the proposed architecture enables data contributors to monetize data through a SQL interface that can answer queries from any source succinctly and directly. Making this data public, the second goal, increases the potential for monetization.

The third goal is a critical one for structured data: the data integration problem is best approached at the source—at the time that the data is generated rather than at query time. The proposed architecture aims to incentivize data integration prior to data publication by allowing free market forces to generate consensus on a small number of economically viable schemas per application domain. Of course, this incentivization does not completely solve the data integration problem, but we expect the platform to be useful for numerous application domains even when large amounts of potentially relevant data must be ignored at query time due to data integration challenges.

As a fully decentralized system, anybody can create an interface to the data on the platform --- both for humans and for machines. We envision a typical human-oriented interface would look like the following: users are presented with a faceted interface that helps them to choose from a limited number of application domains. Once the domain is chosen, the user is presented with another faceted interface that enables the user to construct selection predicates (to narrow the focus of the data that the user is interested in within that domain). After this is complete, one of the schemas from all of the registered schemas for that domain is selected based on which datasets published using that schema contain the most relevant data based on the user’s predicates. After the schema is chosen, the interface aids the user in creating a static or streaming SQL query over that schema. The entire set of data that was published using that schema, and for which the user who issued the query has access to, is queried. The results are combined, aggregated, and returned to the user. Machine interfaces would likely skip most of these steps, and instead query the platform directly in SQL (potentially after issuing some initial queries to access important metadata that enable the final SQL query to be formed).

The proposed architecture incorporates third-party contractors and coordinators for storing and providing query access to data. Contractors and coordinators act as middlemen between data publishers and consumers. This allows publishers to meaningfully participate in the network without having to provide resources for storage and processing of data. This also facilitates managing data at the edge.

Despite making the system easier to use for publishers, the existence of contractors and coordinators in the architecture present two challenges: (1) How to incentivize them to participate, and (2) How to preserve the integrity of data and query results when untrusted and potentially malicious entities are involved in the storage and processing. The published CIDR paper proposes an infrastructure to solve both these challenges.

To overview these solutions briefly: Contractors and coordinators are incentivized similarly to publishers, by a financial reward for every query they serve. Querying the platform requires a small payment of tokens (a human-facing interface may serve advertisements to subsidize this token cost). These payment tokens are shared between the publishers that contributed data that was returned by the query, along with the contractors and coordinators that were involved in processing that query. The financial reward received per query incentivizes participation of contractors and coordinators in query processing. However, it does not ensure that the participation is honest and correct query results are returned. In fact, without safeguards, contractors and coordinators can make more money by avoiding wasting local resources on query processing, and instead returning half-baked answers to query requests.

Indeed, one of the main obstacles to building decentralized database systems like what we are proposing is how to secure the confidentiality, integrity, and availability of data, query results, and payment/incentive processing when the participants in the system are mutually distrustful and no universally-trusted third party exists. Until relatively recently, the security mechanisms necessary for building such a system did not exist, were too inefficient, or were unable to scale. Today, we believe recent advances in secure query processing, blockchain, byzantine agreement, and trusted execution environments put secure decentralized database systems within reach. The proposed infrastructure uses a combination of these mechanisms to secure data and computation within the system. For more details, please take a look at the CIDR paper!

I have a student, Gang Liao, who recently started building a prototype of the platform (research codebase at: https://github.com/DSLAM-UMD/P2PDB). Please contact us if you have some IoT data you can contribute our research prototype. Separate from this academic effort, there is also a company called AnyLog that has taken some of the ideas from the research paper and is building a non-fully decentralized version of the prototype.

Introducing SLOG: Cheating the low-latency vs. strict serializability tradeoff

2019-10-07T08:25:00.000-07:00

This post provides an overview of a new technology from my lab that was recently published in VLDB 2019. In short: SLOG is a geographically distributed data store that guarantees both strict serializability and low latency (for both reads and writes). The first half of this post is written to convince you that it is impossible to achieve both strict serializability and low latency reads and writes, and gives examples of existing systems that achieve one or the other, but never both. The second half of the post describes how SLOG cheats this tradeoff and is the only system in research or industry that achieves both (in the common case).

The Strict Serializability vs. Low Latency Tradeoff

Modern applications span geographic borders with users located on any continent worldwide. Application state thus must be accessible anywhere. If application state never changes, it is possible for all users to interact with the global application with low latency. Application state is simply replicated to many locations around the world, and users can interact with their nearest replica. But for most applications, state changes over time: products get purchased, money is exchanged, users become linked, resources are consumed, etc. The problem is: sending messages across the world can take hundreds of milliseconds. There is thus a necessary delay before information about the state change travels to all parts of the world. In the meantime, applications have two choices: they can present stale, out-of-date state to the user, or they make the user wait until the correct, up-to-date state arrives.

Presenting stale state allows for low-latency interactions with the application around the world, but may lead to negative experiences by the end user. For example, observing stale state may result in users purchasing products that no longer exist, overdrawing balances from bank accounts, confusing discussion threading on social media, or reacting to incorrect stimuli in online multiplayer games.

Applications typically store state in some kind of “data store”. Some data stores guarantee “strong consistency” (such as strict consistency, atomic consistency, or linearizability, which I discussed in a previous post) that ensures that stale state will never be accessed. More specifically: any requests to access state on behalf of a client will reflect all state changes that occurred prior to that request. However, guaranteeing strong consistency may result in slower (high-latency) response times. In contrast, other data stores make reduced consistency guarantees and as a result, can achieve improved latency. In short, there is a tradeoff between consistency and latency: this is the ELC part of the PACELC theorem.

Separately, some systems provide support for transactions: groups of read and write requests that occur together as atomic units. Such systems typically provide “isolation” guarantees that prevent transactions from reading incomplete/temporary/incorrect data that were written by other concurrently running transactions. There are different levels of isolation guarantees, that I’ve discussed at length in previous posts (post 1, post 2, post 3), but the highest level is “serializability”, which guarantees that concurrent transactions are run in an equivalent fashion to how they would have been run if there were no other concurrently running transactions. In geo-distributed systems, where concurrently running transactions are running on physically separate machines that are potentially thousands of miles apart, serializability becomes harder to achieve. If fact, there exist proofs in the literature that show that there is a fundamental tradeoff between serializability and latency in geographically distributed systems.

Bottom line: there is a fundamental tradeoff between consistency and latency. And there is another fundamental tradeoff between serializability and latency.

Strict serializability includes perfect isolation (serializability) and consistency (linearizability) guarantees. If either one of those guarantees by itself is enough to incur a tradeoff with latency, then certainly the combination will necessarily require a latency tradeoff.

Indeed, a close examination of the commercial (and academic) data stores that guarantee strict serializability highlights this latency tradeoff. For example, Spanner (along with all the Spanner-derivative systems: Cloud Spanner, CockroachDB, and YugaByte) runs an agreement protocol (e.g. Paxos/Raft) as part every write to the data store in order to ensure that a majority of replicas agree to the write. This global coordination helps to ensure strict serializability, but increases the latency of each write. [Side note: neither CockroachDB nor YugaByte currently guarantee full strict serializability, but the reason for that is unrelated the discussion in this post, and discussed at length in several of my previous posts (post 1, post 2). The main point that is relevant for this post is that they inherit Spanner’s requirement to do an agreement protocol upon each write, which facilitates achieving a higher level of consistency than what would have been achieved if they did not inherit this functionality from Spanner.]

For example, if there are three copies of the data --- in North America, Europe, and Asia --- then every single write in Spanner and Spanner-derivative systems require agreement between at least two continents before it can complete. It is impossible to complete a write in less than 100 milliseconds in such deployments. Such high write latencies are unacceptable for most applications. Cloud Spanner, aware of this this latency problem that arises from the Spanner architecture, refuses to allow such “global” deployments. Rather, all writable copies of data must be within a 1000 mile radius (as of the time of publications of this post … see: https://cloud.google.com/spanner/docs/instances which currently says: “In general, the voting regions in a multi-region configuration are placed geographically close—less than a thousand miles apart—to form a low-latency quorum that enables fast writes (learn more).”). Spanner replicas outside of this 1000 mile radius must be read-only replicas that operate at a lag behind the quorum eligible copies. In summary, write latency is so poor in Spanner that they disable truly global deployments. [Side note: CockroachDB improves over Spanner by allowing truly global deployments and running Paxos only within geo-partitions, but as mentioned in the side note above, not with strict serializability.]

Amazon Aurora --- like Cloud Spanner --- explicitly disables strictly serializable global deployments. As of the time of this post, Aurora only guarantees serializable isolation for single-master deployments, and even then only on the primary instance. The read-only replicas operate at a maximum of “repeatable read” isolation.

Calvin and Calvin-derivative systems (such as Fauna, Streamy-db, and T-Part) typically have better write latency than Spanner-derivative systems (see my previous post on this subject for more details), but they still have to run an agreement protocol across the diameter of the deployment for every write transaction. The more geographically disperse the deployment, the longer the write latency.

In the research literature, papers like MDCC and TAPIR require at least one round trip across the geographic diameter of the deployment in order to guarantee strict serializability. Again: for truly global deployments, this leads to hundreds of milliseconds of latency.

In short: whether you focus on the theory, or whether you focus on the practice, it is seemingly impossible to achieve both strict serializability (for arbitrary transactions) and low latency reads and writes.

How SLOG cheats Strict Serializability vs. Low Latency Tradeoff

Let’s say that a particular data item was written by a client in China. Where is the most likely location from which it will be accessed next? In theory, it could be accessed from anywhere. But in practice for most applications, (at least) 8 times out of 10, it will be accessed by a client in China. There is a natural locality in data access that existing strictly serializable systems have failed to exploit.

If the system can be certain that the next read after a write will occur from the same region of the world that originated the write, there is no need (from a strict serializability correctness perspective) to synchronously replicate the write to any other region. The next read will be served from the same near-by replica from which the write originated, and is guaranteed to see the most up to date value. The only reason to synchronously replicate to another region is for availability --- in case an entire region fails. Entire region failure is rare, but still important to handle in highly available systems. Some systems, such as Amazon Aurora, only synchronously replicate to a nearby region. This limits the latency effect from synchronous replication. Other systems, such as Spanner, run Paxos or Raft, which causes synchronous replication to a majority of regions in a deployment --- a much slower process. The main availability advantage of Paxos/Raft over replication to only nearby regions is to be able to remain (partially) available in the event of network partitions. However, Google themselves claim that network partitions are very rare, and have almost no effect on availability in practice. And Amazon clearly came to the same conclusion in their decision to avoid Paxos/Raft from the beginning.

To summarize, in practice: replication within a region and to near-by regions can achieve near identical levels of availability as Paxos/Raft, yet avoid the latency costs. And if there is locality in data access, there is no need to run synchronous replication of every write for any reason unrelated to availability. So in effect, for workloads with access locality, there is no strict serializability vs. low latency tradeoff. Simply serve reads from the same location as the most recent write to that same data item, and synchronously replicate writes only within a region and to near-by regions for availability.

Now, let’s drop the assumption of perfect locality. Those reads that initiate far from where that data item was last written must travel farther and thus necessarily take longer (i.e. high latency). If such reads are common, it is indeed impossible to achieve low latency and strict serializability at the same time. However, as long as the vast majority of reads initiate from locations near to where that item was last written, the vast majority of reads will achieve low latency. And writes also achieve low latency (since they are not synchronously replicated across the geographic diameter of the deployment). In short, low latency is achieved for the vast majority of reads and writes, without giving up strict serializability.

Where things get complicated are (1) How does the system which is processing a read know which region to send it to? In other words, how does any arbitrary region know which was the last region to write a data item? (2) What happens if a transaction needs to access multiple data items that were last written by different regions?

The solution to (1) is simpler than the solution to (2). For (1), each data item is assigned only one region at a time as its “home region”. Write to that data item can only occur at that home region. As the region locality of a data item shifts over time, an explicit hand-off process automatically occurs in which that data item is “rehoused” at a new region. Each region has a cache of the current house for each data item. If the cache is out of date for a particular read, the read will be initially sent to the wrong region, but that region will immediately forward the read request to the new home of that data item.

The solution to (2) is far more complicated. Getting such transactions to run correctly without unreasonably high latencies was a huge challenge. Our initial approach (not discussed in the paper we published) was to rehouse all data items accessed by a transaction to the same region prior to running it. Unfortunately, this approach interfered with the system’s ability to take advantage of the natural region locality in a workload (by forcing data items to be rehoused to non-natural locations) and made it nearly impossible to run transactions with overlapping access sets in parallel, which destroyed the throughput scalability of the system.

Instead, the paper we published uses a novel asynchronous replication algorithm that is designed to work with a deterministic execution framework based on our previous research on Calvin. This solution enables processing of “multi-home” transactions with no forced-rehousing, with half-round-trip cross-region latency. For example, let’s say that a transaction wants to read, modify, and write two data items: X and Y, where X is housed in the USA and Y is housed in Europe. All writes to X that occurred prior to this transaction are asynchronously sent from the USA to Europe, and (simultaneously) all writes to Y that occurred prior to this transaction are asynchronously sent from Europe to the USA. After this simple one-way communication across the contents is complete, each region then independently runs the read/modify/writes to both X and Y locally, and rely on determinism to ensure that they both are guaranteed to write the same exact values and come to the same commit/abort decision for the transaction without further communication.

Obviously there are many details that I’m leaving out of this short blog post. The reader is encouraged to look at the full paper for more details. But the final performance numbers are astonishingly good. SLOG is able to get the same throughput as state-of-the-art geographically replicated systems such as Calvin (which is an order of magnitude better than Spanner) while at the same time getting an order of magnitude better latency for most transactions, and equivalent latency to Paxos-based systems such as Calvin and Spanner for the remaining “multi-home” transactions.

An example of the latency improvement of SLOG over Paxos/Raft-based systems for a workload in which 90% of all reads initiate from the same region as where they were last written is shown in the figure below: [Experimental setup is described in the paper, and is run over a multi-continental geo-distributed deployment that includes locations on the west and east coasts of the US, and Europe.]

This figure shows two versions of SLOG --- one version that only replicates data within the same region, and one version that replicates also to a near-by region (to be robust to entire region failure). For the version of SLOG with only intra-region replication, the figure shows that 80% of transactions complete in less than 10 milliseconds, 90% less than 20 milliseconds, and the remaining transactions take no longer than round-trip Paxos latency across the diameter of the geographically dispersed deployment. Adding synchronous replication to near-by regions adds some latency, but only a small factor. In contrast, Paxos/Raft-based systems such as Spanner and Calvin take orders of magnitude longer latency to complete transactions for this benchmark (which is a simplified version of Figure 12 from the SLOG paper).

Conclusion

By cheating the latency-tradeoff, SLOG is able to get average latencies on the order of 10 milliseconds for both reads and writes for the same geographically dispersed deployments that require hundreds of milliseconds in existing strictly serializable systems available today. SLOG does this without giving up strict serializability, without giving up throughput scalability, and without giving up availability (aside from the negligible availability difference relative to Paxos-based systems from not being as tolerant to network partitions). In short, by improving latency by an order of magnitude without giving up any other essential feature of the system, an argument can be made that SLOG is strictly better than the other strictly serializable systems in existence today.

[This article includes a description of work that was funded by the NSF under grant IIS-1763797. Any opinions, findings, and conclusions or recommendations expressed in this article are those of Daniel Abadi and do not necessarily reflect the views of the National Science Foundation.]

An explanation of the difference between Isolation levels vs. Consistency levels

2019-08-23T09:49:00.000-07:00

(Editor's note: This post is cross-posted on Fauna's blog, where Daniel Abadi serves as an adviser. Fauna is is a serverless, cloud database system that is built using the Calvin scalable distributed data store technology from Daniel Abadi's research lab.)

In several recent posts, we discussed two ways to trade off correctness for performance in database systems. In particular, I wrote two posts (first one and second one) on the subject of isolation levels, and one post on the subject of consistency levels.

For many years, database users did not have to simultaneously understand the concept of isolation levels and consistency levels. Either a database system provided a correctness/performance tradeoff using isolation levels, or it provided a correctness/performance tradeoff using consistency levels, but never both. This resulted in a blurring of these concepts to the point that many people --- even experts in the field --- confuse isolation levels with consistency levels and vice versa. For example, this talk by a PhD student at Berkeley (incorrectly) refers to causal consistency as an “isolation level”. And this paper from well-known researchers at MIT and Harvard --- including a Turing Award laureate --- (incorrectly) call snapshot isolation and serializability “consistency” levels. I am confident that all these well-known researchers know the difference between isolation levels and consistency levels. However, as long as isolation and consistency could not be tuned within the same system, there has been little necessity for precision in the parlance around these terms.

In modern times, systems are being released that provide both a set of isolation levels and a set of consistency levels to choose from. Unfortunately, due to the historical failure in our community to be careful in our parlance, these new systems continue the legacy of blurring the distinction between isolation and consistency, which has resulted in “isolation levels” or “consistency levels” containing a mixture of isolation and consistency guarantees, and wide-spread confusion. Database users need more education on the difference between isolation levels and consistency levels and how they interact with each other in order to make an informed decision on how to trade off correctness for performance. This post is designed to be a resource for these types of decisions.

Background material

This post is designed to be a self-contained overview of the difference between isolation levels and consistency levels and how they interact with each other. You do not need to read my previous posts that I linked to above in order to read this post. Nonetheless, some of the definitions and fundamental concepts from those posts are important background material for understanding this post. To avoid making you read those previous posts, in this section, I will summarize the pertinent background material (but for more detail and elaboration, please see the previous posts).

Database isolation refers to the ability of a database to allow a transaction to execute as if there are no other concurrently running transactions (even though in reality there can be a large number of concurrently running transactions). The overarching goal is to prevent reads and writes of temporary, incomplete, aborted, or otherwise incorrect data written by concurrent transactions.

If the application developer is able to ensure the correctness of their code when no other concurrent processes are running, a system that guarantees perfect isolation will ensure that the code remains correct even when there is other code running concurrently in the system that may read or write the same data. Thus, in order to achieve perfect isolation, the system must ensure that when transactions are running concurrently, the final state is equivalent to a state of the system that would exist if they were run serially. This perfect isolation level is called serializability.

One of our previous posts discussed some famous anomalies in application code that can occur at levels of isolation below serializability, and also some widely-used reduced isolation levels, of which the most notable are snapshot isolation and read committed. A detailed understanding of these anomalies and reduced isolation levels is not critical in order to understand the material we discuss below in this post.

Database consistency is defined in different ways depending on the context, but when a modern system offers multiple consistency levels, they define consistency in terms of the client view of the database. If two clients can see different states at the same point in time, we say that their view of the database is inconsistent (or, more euphemistically, operating at a “reduced consistency level”). Even if they see the same state, but that state does not reflect writes that are known to have committed previously, their view is inconsistent with the known state of the database.

We explained in our previous post that the notion of a consistency level originated by researchers of shared-memory multi-processor systems. The goal of the early work was to reason about how and when different threads of execution, which may be running concurrently and accessing overlapping sets of data in shared memory, should see the writes performed by each other. We gave a bunch of examples of consistency levels with examples using schedule figures similar to the one below, in which different threads of execution write and read data as time moves forward from left to right. In the figure below, we have four threads of execution: P1, P2, P3, and P4. P1 writes the value 5 to x, and thread P2 writes the value 10 to y after P1’s write to x completes. P3 and P4 read the new value (10) for y during overlapping periods of time. P3 and P4 then read x during overlapping periods of time. P3 starts slightly earlier and sees the old value, while P4 starts (and completes) later sees the new value.

To some degree, our example schedule is "consistent": although P3 and P4 read different values for x, since P4 initiated its read after P3, they can believe that the official value of x in the system did not change until a point in time in between these two read requests. P3 saw the new value of y (10) before reading the old value of x (0). Therefore, it believes that the write to y happened before the write to x. No thread observes a write ordering different from the write to y occurring before the write to x (P1 and P2 do not perform any reads, so they do not “see” anything, and P4 sees the new values of both x and y, and therefore did not see anything to contradict P3’s observation that the write to y happened before the write to x). Therefore, all threads can agree on a consistent, global sequential ordering of writes. This level of consistency is called sequential consistency.

Sequential consistency is a very high level of consistency, but it is not perfect. The reason why it is not perfect is that the globally observed write order contradicts reality. In reality, the write to x happens before the write to y. Furthermore, P3 sees a stale value for x: it reads the old value of x (0) long after the write of 5 to x has completed. This return of the non-current version of x is another example of contradicting reality. Perfect consistency (in this post, we are going to call linearizability “perfect” even though strict consistency has slightly stronger guarantees) ensures both: that every thread of execution observes the same write ordering AND that write ordering matches reality (if write A completes before write B starts, no thread will see the write to B happening before the write to A). This guarantee is also true for reads: if a write to A completes before a read of A begins, then the write of A will be visible to the subsequent read of A (assuming A was not overwritten by a different write request in the interim).

When expanding the traditional consistency diagrams such as the one from our example to a transactional model, we annotate each read and write request with the transaction identifier of the transaction that initiated each request. If each thread of execution can only process one transaction at a time, and transactions can not be processed by more than one thread of execution, then the traditional consistency diagrams need only be supplemented with rectangular boundaries indicating the start and end point of each transaction within a thread of execution, as shown in the figure below.

Once you include transactions in these consistency diagrams, the concept of database isolation must be considered (isolation is the I of ACID within the ACID set of guarantees for transactions). Depending on the isolation level, a write within a transaction may not become visible until the entire transaction is complete. Isolation guarantees thus place limitations on how and when writes become visible to threads of execution that read database state. Yet, consistency guarantees also specify how and when writes become visible! This conceptual overlap of isolation and consistency guarantees is the source of much confusion.

Ignorance is dangerous! You need to understand the difference between isolation guarantees and consistency guarantees. And you need to know what you are getting for both types of guarantees.

Let’s look at an example. The diagram below shows a system running two transactions in different threads of execution. Each transaction reads the current value of X, adds one to it, and writes it back out. In any serializable execution, the final value of X after running both transactions should be two higher than the initial value. But in this example, we have a system running at perfect, linearizable consistency, but the final value of X is only one higher than the initial value. Clearly, this is a correctness bug.

The basic problem is that historically, as we described above, consistency levels are only designed for single-operation actions. They generally do not model the possibility of a group of actions that commit or abort atomically. Therefore, they do not protect against the case where writes are performed only temporarily, but ultimately get rolled back if a transaction aborts. Furthermore, without explicit synchronization constructs, consistency levels do not protect against reads (on which subsequent writes in that same transaction depend) becoming immediately stale as a result of a concurrent write by a different process. (This latter problem is the problem that was shown in Figure 3).

The bottom line is that as soon as you have the concept of a transaction --- a group of read and write operations --- you need to have rules for what happens during the timeline between the first of the operations of the group and the last of the operations of the group. What operations by other threads of execution are allowed to occur during this time period between the first and last operation and which operations are not allowed? These set of rules are defined by isolation levels that I discussed previously. For example, serializable isolation states that the only concurrent operations that are allowed to occur are those which will not cause a visible change in database state relative to what the state would have been if there were no concurrent operations.

The bottom line: as soon as you allow transactions, you need isolation guarantees.

What about vice versa? Once you have isolation guarantees, do you need consistency guarantees?

Let’s look at another example. The diagram below shows a system running three transactions. In one thread of execution, a transaction runs (call it T1) that adds one to X (it was originally 4 and now it is 5). In the other thread of execution, a transaction runs (call it T2) that writes the current value of X to Y (5). After the transaction returns, a new transaction runs (call it T3) that reads the value of X and sees the old value (4). The result is entirely serializable (the equivalent serial order is T3 then T1 then T2). But it violates strict serializability and linearizability because it reorders T3 before T2 even though T3 started in real time after T2 completed. Furthermore, it violates sequential consistency because T2 and T3 are run in the same thread of execution, and under sequential consistency, the later transaction, T3, is not allowed to see an earlier value of X than the earlier transaction, T2. This could result in a bug in the application: many applications are unable to handle the phenomenon of database state going backwards in time across transactions --- and especially not within the same session.

Guarantees of isolation without any guarantees of consistency are not particularly helpful. As an extreme example: assume that a database started empty and then grows over time during the lifetime of an application. A system that guarantees perfect (serializable) isolation without any consistency guarantees could theoretically return the empty set (no results) for every single read-only transaction without violating its serializability guarantee. The serial order that the system is equivalent to is simply a serial order where the read-only transactions happen before the first write transaction. In essence, the serializability guarantee allows transactions to “go back in time” --- as long as the final result is equivalent to some serial order, the guarantee is upheld. Without some kind of consistency guarantee, there are no restrictions on which serial order it needs to be equivalent to.

In general, all of the time travel bugs that I discussed in a previous post in the context of serializability are possible at any isolation level. Isolation levels only specify what happens for concurrent transactions. If two transactions are not running at the same time, no isolation guarantee in the world will specify how they should be processed. They are already perfectly isolated from each other by virtue of not running concurrently with each other! This is what makes time travel correctness bugs possible at even the highest levels of isolation. If you want guarantees for nonconcurrent transactions: for example, you want to be sure that a later transaction reads the writes of an earlier transaction, you need consistency guarantees.

Many, but not all, consistency levels make real-time guarantees for nonconcurrent operations. For example, strict consistency and linearizable/atomic consistency both ensure that all nonconcurrent operations are ordered by real-time. Even lower consistency levels, such as sequential consistency and causal consistency make real time guarantees of operations within the same thread of execution.

So the bottom line is the following: once you have transactions with more than one operation, you need isolation guarantees. And if you need any kind of guarantee for non-concurrent transactions, you need consistency guarantees. Most of the time, if you are using a database system, you need both.

Therefore, as a database user, you need to figure out what you need in terms of isolation guarantees, and you also need to figure out what you need in terms of consistency guarantees. And if that wasn’t hard enough, you then need to figure out how to map what you think you need to the various options that your system gives you. This is where it gets really tricky, because many systems don’t view isolation and consistency as separate guarantees that can be intermixed arbitrarily. Rather, they only allow certain combinations of isolation guarantees and consistency guarantees, and give each combination a name. To make matters worse, the common industry practice is to call each name for a potential combination an “isolation level”, even though it is really a combination of an isolation level and a consistency level. We will give some examples in the next section.

“Isolation levels” that are really a mix of isolation and consistency guarantees

For my loyal readers that read everything I write, you might have been confused by my two posts on isolation guarantees when read together as a unit. In the earlier of these two posts, I claimed that serializability is the highest possible isolation level. Then, in the next post, I described a whole bunch of isolation levels there were seemingly “higher” or “more correct” --- “one-copy serializability”, “strong session serializability”, “strong write serializability”, “strong partition serializability”, and “strict serializability”.

But now that we have gotten to this point in this post --- if you have understood what I’ve written so far, hopefully now the answer to why this is not a contradiction is obvious. Serializability is indeed the highest possible isolation level. All variations of serializability that we discussed --- one-copy serializability, strong session serializability, strong write serializability, strong partition serializability, and strict serializability have identical isolation guarantees: they all guarantee serializability. The only difference between these so called isolation levels are their consistency guarantees.

One-copy serializability guarantees sequential consistency where no thread of execution can process more than one transaction. Strong session serializability guarantees sequential consistency where each session corresponds to a separate thread of execution. Strong write serializability corresponds to a linearizability guarantee for write transactions, but only sequential consistency for read-only transactions. Strong partition serializability guarantees linearizable consistency within a partition, but only sequential consistency across partitions. And strict serializability guarantees linearizable consistency at all times for all reads and writes (across non-concurrent transactions).

The same method can be used to understand other so-called “isolation levels”. For example, the isolation levels of WEAK SNAPSHOT ISOLATION, STRONG SNAPSHOT ISOLATION, and STRONG SESSION SNAPSHOT ISOLATION are overviewed in this paper. All of these “isolation levels” make equivalent isolation guarantees: all guarantee snapshot isolation and are thus susceptible to write skew anomalies. The only difference is that STRONG SNAPSHOT ISOLATION guarantees linearizable consistency (if timestamps are based on real time) alongside SNAPSHOT ISOLATION’s imperfect isolation guarantee, while the other levels pair lower levels of consistency with snapshot isolation.

Conclusion

Database system vendors will continue to use single terms to describe a particular mixture of isolation levels and consistency levels. My recommendation is to avoid getting confused by these loaded definitions and break them apart before starting to reason about them. If you see an “isolation level” with three or more words, chances are, it is really an isolation level combined with a consistency level. Break apart the term into the component isolation and consistency guarantees, and then carefully consider each guarantee separately. What does your application require as far as isolation of concurrently running transactions? Does the isolation guarantee that the isolation level is giving you match your requirements? Furthermore, what does your application require as far as ordering of non-concurrent transactions? Does the consistency guarantee that you extracted from the complex term match your requirements?

As a separate point, it is worth measuring the difference in performance you get between high isolation and consistency levels and lower ones. The quality of the architecture of the system can have a significant effect on the amount of performance drop for high isolation and consistency levels. Poorly designed systems will push you into choosing lower isolation and consistency levels by virtue of a dramatic performance drop if you choose higher levels. All systems will have some performance drop, but well-designed systems will observe a much less dramatic drop.

Overview of Consistency Levels in Database Systems

2019-07-25T11:29:00.000-07:00

Database systems typically give users the ability to trade off correctness for performance. We have spent the previous two posts in this series discussing one particular way to trade off correctness for performance: database isolation levels. In distributed systems, there is a whole other category for trading off correctness for performance: consistency levels. There are an increasing number of distributed database systems that are giving their users multiple different consistency levels to choose from, thereby allowing the user to specify which consistency guarantees are needed from the system for a particular application. Similar to isolation levels --- weaker consistency levels typically come with better performance, and thus come with the same types of temptations as reduced isolation levels.

In this post, we give a short tutorial on consistency levels --- explaining what they do, and how they work. Much of the existing literature and online discussion of consistency levels are done in the context of multi-processor or distributed systems that operate on a single data item at a time, without the concept of a “transaction”. In this post we give some direction for how to think about consistency levels in the context of ACID-compliant database systems.

What is a consistency level?

The definition of consistency is fundamentally dependent on context. In general, consistency refers to the ability of a system to ensure that it complies (without fail) to a predefined set of rules. However, this set of rules changes based on context. For example, the C of ACID and the C of CAP both refer to consistency. However, the set of rules implied by these two contexts are totally different. In ACID, the rules refer to application-defined semantics. A system that guarantees the C of ACID ensures that processing a transaction does not violate referential integrity constraints, foreign-key constraints, and any other application-specific constraints (such as: “every user must have a name”). In contrast, the consistency C of CAP refers to the rules related to making a concurrent, distributed system appear like a single-threaded, centralized system. Reads at a particular point in time have only one possible result: they must reflect the most recent completed write (in real time) of that data item, no matter which server processed that write.

One point of confusion that we can eliminate at this point is that the phrase “consistency level” is not typically used in the context of ACID consistency. This is because the C of ACID is almost entirely the responsibility of the application developer --- only the developer can ensure that the code they place inside a transaction does not violate application semantics when it is run in isolation. ACID is really a misnomer --- really it should be AID, since only those three (atomicity, isolation, and durability) are in the realm of system guarantees. [Joe Hellerstein claims that he was taught that the C in ACID was only included to make an LSD pun, but acknowledges that this may be apocryphal.]

When we talk about consistency levels, we’re really referring the the C of CAP. In this context, perfect consistency --- usually referred to as “strict consistency” --- would imply that the system ensures that all reads reflect all previous writes --- no matter where those writes were performed. Any consistency level below “perfect” consistency enables situations to occur where a read does not return the most recent write of a data item. [Side note: the C of CAP in the original CAP paper refers to something called “atomic consistency” which is slightly weaker than strict consistency but still considered “perfect” for practical purposes. We’ll discuss the difference later in this post.]

Depending on how a particular system is architected, perfect consistency becomes easier or harder to achieve. In poorly designed systems, achieving perfection comes with a prohibitive performance and availability cost, and users of such systems are pushed to accept guarantees significantly short of perfection. However, even in well designed systems, there is often a non-trivial performance benefit achieved by accepting guarantees short of perfection.

An overview of well-known consistency levels

The notion of a consistency level originated by researchers of shared-memory multi-processor systems. The goal of the early work was to reason about how and when different threads of execution, which may be running concurrently and accessing overlapping sets of data in shared memory, should see the writes performed by each other. As such, most of the initial work focused on reads and writes of individual data items, rather than at the level of groups of reads and writes within a transaction. We will begin our discussion using the same terminology as the original research and models, and then proceed to discuss how to apply these ideas to transactions.

In sequential consistency, all writes --- no matter which thread made the write, and no matter which data item was written to --- are globally ordered. Every single thread of execution must see the writes occurring in this order. For example, if one thread saw data X being updated to 5, and then later saw Y being updated to 10, every single thread must see the update of X happening before the update of Y. If any thread sees the new value of Y but the old value of X, sequential consistency would be violated. This example is shown in Figure 1. In this figure, time gets later as you move to the right in the figure, and there are 4 threads of execution: P1, P2, P3, and P4. Every thread (that reads X and Y) sees the update of X from 0 to 5 happening before the update of Y from 0 to 10. Threads: P1 and P2 write X and Y respectively, but do not read either one. Thread P3 sees the new value of X and subsequently sees the old value of Y. This is only possible if the update to X happened before the update to Y. Thread P4 only sees the new values of X and Y, so it does not see which one happened first. Thus, all threads agree that it is possible that the update of X happened before the update to Y. Contrast this to Figure 2, below, in which P3 and P4 see clearly different orders of the updates to X and Y --- P3 sees the new value of X (5) and subsequently the old value of Y (0), while P4 sees the new value of Y (10) and subsequently the old value of X (0). Figure 2 is thus not a sequentially consistent schedule.

In general, sequential consistency does not place any requirements on how to order the writes. In our example, the write to X happened in real time before the write to Y. Nonetheless, as long as every thread agrees to see the write to Y as happening before the write to X, sequential consistency allows the official history to be different that what occurred according to real time (the only restriction is that writes and reads originating from the same thread of execution cannot be reordered). See Figure 3 for an example of this.

In contrast to sequential consistency, strict consistency does place real time requirements on how to order the writes. It assumes that it is always possible to know what time it currently is with zero error --- i.e. that every thread of execution agree on the precise current time. The order of the writes in the sequential order must be equal to the real time that these writes were issued. Furthermore, every read operation must read the value of the most recent write in real time --- no matter which thread of execution initiated that write. In a distributed system (and even multi-processor single-server systems), it is impossible in practice to have global agreement on precise current time, which renders strict consistency to be mostly of theoretical interest.

None of Figures 1, 2, or 3 above satisfy strict consistency because they all contain either a read of x=0 or a read of y=0 after the value of x or y has been written to a new value. However, Figure 4 below satisfies strict consistency since all reads reflect the most recent write in real time:

In a distributed/replicated system, where writes and reads can originate anywhere, the highest level of consistency obtained in practice is linearizability (also known as “atomic consistency” which is what it is called in the CAP theorem). Linearizability is very similar to strict consistency: both are extensions of sequential consistency that impose real time constraints on writes. The difference is that the linearizability model acknowledges that there is a period of time that occurs between when an operation is submitted to the system, and when the system responds with an acknowledgement that it was completed. In a distributed system, the sending of the write request to the correct location(s) --- which may include replication --- can occur during this time period. A linearizability guarantee does not place any ordering constraints on operations that occur with overlapping start and end times. The only ordering constraint is for operations that do not overlap in time --- only in those cases does the earlier write have to be seen before the later write.

Figure 5 above shows an example of a schedule that is linearizable, but not strictly consistent. It is not strictly consistency since the read of X by P3 is initiated (and returns) slightly after the write of X by P1, but still sees the old value. Nonetheless, it is linearizable because this read of X by P3 and write of X by P1 overlap in time, and therefore linearizability does not require the read of X by P3 to see the result of the write of X by P1.

While linearizability and strict consistency are stronger than sequential consistency, sequential consistency is by itself a very high level of consistency, and there exist many consistency levels below it.

Causal consistency is a popular and useful consistency level that is slightly below sequential consistency. In sequential consistency, all writes must be globally ordered --- even if they are totally unrelated to each other. Causal consistency does not enforce orderings of unrelated writes. However, if a thread of execution performs a read of some data item (call it X) and then writes that data item or a different one (call it Y), it assumes that the subsequent write may have been caused by the read. Therefore, it enforces the order of X and Y --- specifically all threads of execution must observe the write of Y after the write of X.

As an example, compare Figure 6 (above) with Figure 2. In Figure 2, P3 saw the write to X happening before the write to Y, but P4 saw the write to Y happening before the write to X. This violates sequential consistency, but not causal consistency. However, in Figure 6, P2 read the write to X before performing the write to Y. That places a causal constraint between the write to X and Y --- Y must happen after X. Therefore, when P4 sees the write to Y without the write to X, causal consistency is violated.

Eventual consistency is even weaker --- even causally dependent writes may become visible out of order. For example, despite violating every other consistency guarantee that we have discussed so far, Figure 6 does not necessarily violate eventual consistency. The only guarantee in eventual consistency is that if there are no writes for a “long” period of time (where the definition of “long” is system dependent), every thread of execution will agree on the value of the last write. So as long as P4 eventually sees the new value of X (5) at some later point in time (not shown in Figure 6), then eventual consistency is maintained.

Strong consistency vs. Weak consistency

Strict consistency and linearizability/atomic consistency are typically thought of as “strong” consistency levels. In many cases, sequential consistency is also referred to as a strong consistency level. The key feature that each of these consistency levels share is that the state of the database goes through a universally agreed-upon sequence of state changes. This allows the realities of replication to be hidden to the end user. In essence, the view of the database to the user is that there is only one copy of the database that continuously makes state transitions in a forward direction. In contrast, weaker consistency levels (such as causal consistency and eventual consistency) allow different views of the database state to see different progression of steps in database state ---- a clear impossibility unless there is more than one copy of the database. Thus, for weaker levels of consistency, the application developer must be explicitly aware of the replicated nature of the data in the database, thereby increasing the complexity of development relative to strong consistency.

Transactions and consistency levels

As we discussed above, consistency levels have historically been defined in terms of individual reads or writes of a data item. This makes the discussion of consistency levels hard to apply to the context of database systems in which groups of read and write operations occur together in atomic transactions. Indeed, both the research literature and vendor documentation (for those vendors that offer multiple consistency levels) are extremely confusing and do not take a uniform approach when it comes to applying consistency levels in the context of database systems.

I think that the simplest way to reason about consistency levels in the presence of database transactions is to make only minor adjustments to the consistency models we discussed earlier. We still view consistency as threads of execution sending read and write requests to a shared data store. The only difference is that we annotate each read and write request with the transaction identifier of the transaction that initiated each request. If each thread of execution can only process one transaction at a time, and transactions can not be processed by more than one thread of execution, then the traditional timeline consistency diagrams need only be supplemented with rectangular boundaries indicating the start and end point of each transaction within a thread of execution, as shown in the figure below.

The presence of a transaction in a consistency diagram adds additional constraints corresponding to AID of ACID: all of the reads and writes within the transaction succeed or fail together (atomicity), they are isolated from other concurrently running transactions (the degree of isolation will depend on the isolation level), and writes of committed transactions will outlive all kinds of system failure (durability).

The atomicity and durability guarantees of transactions are pretty easy to cognitively separate from consistency guarantees, because they deal with fundamentally different concepts. The problem is isolation. Consistency guarantees specify how and when writes are visible to threads of execution that read database state. Isolation guarantees also place limitations on when writes become visible. This conceptual overlap of isolation and consistency guarantees is the source of much confusion. In the next post of this series I plan to give a tutorial for understanding the difference between isolation levels and consistency levels.

The dangers of conditional consistency guarantees

2019-07-15T09:37:00.000-07:00

The ease of writing an application on top of infrastructure that guarantees consistency can not be overstated. It’s just so much easier to write an application when you can be sure that every read will return the most recent state of the data --- no matter where in the world writes to data state occur, and no matter what types of failure may occur.

Unfortunately, consistency in distributed systems comes at a cost. Other desirable system properties may be traded off in order to uphold a consistency guarantee. For example, it is well known (see the CAP theorem) that in the event of a network partition, it is impossible to guarantee both consistency and availability. One must be traded off for the other. Furthermore, in a system deployed across a wide geography, the laws of physics prevent the communication required to maintain consistency from happening instantaneously. This leads to the PACELC tradeoff: when there is a network partition (the P in PACELC), the system must decide how it will tradeoff availability (A) for consistency (C). Else (E) --- during normal operation of the system --- the system must decide how it will tradeoff latency (L) for consistency (C).

There are four points in the PACELC tradeoff space: (1) PA/EC (when there is a partition, choose availability; else, choose consistency), (2) PC/EC (choose consistency at all times), (3) PA/EL (prioritize availability and latency over consistency), and (4) PC/EL (when there is a partition, choose consistency; else, choose latency).

If you were to ask a student who just learned about the CAP theorem in a distributed systems class what guarantees they would want from a distributed system deployed for an application that needs 99.999% availability, the student would likely answer: “That’s easy! I would choose a system that prioritizes availability over consistency in the event of a network partition. But in the absence of a network partition, it is possible for a system to guarantee both availability and consistency, so I would want both those guarantees in the system I would deploy.”

This answer is natural and obvious. But at the same time, it is naive and incorrect.

The system that the student described would be a PA/EC system from PACELC --- a system that does not guarantee consistency during a network partition, but otherwise guarantees consistency. There are many applications that need several “nines” of availability. If the right PACELC system for those applications was PA/EC, then there would be many vendors competing with each other, attempting to sell the best PA/EC system. Yet in practice, the opposite is true --- it is extremely rare to find PA/EC systems in the wild. In fact, it took me several years until I came across the IMDG (in-memory data grid) market before I could find examples of real PA/EC systems.

The reason for this is that PA/EC systems are much less useful than one might initially expect. To understand why this is the case, one must understand the fundamental difference between a real guarantee and a conditional guarantee.

Real guarantees vs. conditional guarantees

Of the three desirable properties that we’ve mentioned thus far as being desirable in a distributed system --- availability, consistency, and latency --- only one of them can ever be a real guarantee: consistency. No system will ever guarantee 100% availability, nor will any system guarantee that 100% of requests will complete within a particular latency bound. Such guarantees would be impossible to uphold. A “PA” system in PACELC prioritizes availability, but cannot guarantee 100% availability. Similarly, an “EL” system in PACELC prioritizes latency, but does not make any particular 100% guarantees.

In contrast, consistency is a real guarantee. Systems that guarantee consistency are actually capable of upholding the promise of 100% consistency, and many systems do in fact provide it.

Real guarantees form the foundation of the success of modern computing. Computers are extremely complex systems. Even a simple program must involve a large number of complex components --- from registers and logic gates to cache and memory to operating systems and security all the way up to the run time environment of the program. A simple “hello world” program ought to take weeks to develop. The reason why it doesn’t is because of the power of “abstraction” in computer science. Each level of a computer system builds on top of an underlying level that abstracts away the complex details of its implementation. Instead, the underlying level presents an interface to the higher level, and inviolable guarantees about what is returned through this interface.

For example, when a program reads from a particular place in memory, it can assume that it will see the data that was last written there. In reality, the processor inside a computer may choose to execute instructions within a program out of order. Furthermore, electric or magnetic interference may cause bits to spontaneously flip in memory. In practice, complicated solutions are used to detect and correct such bit corruption, including using redundant memory bits, additional circuitry, and error correcting codes. All of this is hidden to the higher level program that wants to read this location in memory. The higher level program assumes that it will receive the data that was most recently written to that location 100% of the time. [Side note: In theory, all of the error correcting logic in storage systems still cannot yield an actual 100% guarantee. However, the chances of a corrupted data value being returned to the end user is so infinitesimally small that this guarantee is still treated like 100% for all practical purposes.]

Imagine the extra complexity that would be involved in writing a program if you could not assume that instructions from a program are processed in the specified order, or the data in memory has not been corrupted. Almost every line of code that reads data from a variable would have to be surrounded by ‘if-statements’ that check for what might be corruption in that particular context and that specifies what should be done if the read appears to not be correct. This extra code would likely increase the development cycle for a program by one to two orders of magnitude!

For a program to be “bug-free”, this guarantee must be 100%. If a read returns the correct result 99% of the time, or even 99.99% or 99.9999% of the time, the above described ‘if-statements’ and conditional logic will still be necessary. Only if the program can assume that the correct result will always be returned can the extra conditional logic be avoided. Thus, from the perspective of the developer, there is an enormous difference between a complete guarantee and an incomplete guarantee. Within the space of incomplete guarantees, it is certainly the case that 99.9999% accuracy is better than 99% accuracy. But the big jump in complexity savings for the developer is the jump between anything less than 100% and a 100% guarantee.

This is why a PA/EC system has limited utility to an application developer. A PA/EC system is a system that guarantees consistency in all cases …. except when there is a network partition. In other words, a PA/EC makes a conditional guarantee of consistency: it guarantees consistency in all cases as long as a particular condition is met: that there is no network partition.

However, we just said that the big jump in complexity savings for the developer is a jump between anything less than 100% and a 100% guarantee. Any guarantee that comes with conditions that are out of the control of the developer (such as ensuring that there will never be a network partition), is not that helpful. The developer still has to write code to handle the cases where the guarantee is not upheld.

I am not arguing that all systems must guarantee consistency. I am only pointing out that there is a big difference in developer complexity between developing on top of a system that guarantees consistency fully, and one that guarantees consistency only under certain conditions. Without a full guarantee of consistency, there is a significant amount of extra effort required to ensure the absence of corner-case bugs that may emerge days, weeks, or even years after the application is first deployed.

Conditional guarantees and PACELC

We just explained that a PA system in PACELC results in a large amount of extra effort for the application developer, in order to ensure that the application remains bug-free even when inconsistent reads occur during a network partition. If so, during normal operation, where there is a tradeoff between consistency and latency, why choose consistency? All of the effort in building an application that does not break in the face of inconsistent reads has already been paid. Why not benefit from all that effort during normal operation and get better latency as well?

All of this suggests that application developers should decide whether they want to develop on top of a system that guarantees consistency or not. If they want consistency, they should deploy on top of a PC/EC system in PACELC --- i.e. a system that guarantees consistency 100% of the time. If they are willing to go to all of the extra effort to build on top of a system that does not guarantee consistency, they should be looking to get as many benefits as possible from this effort, and should deploy on top of PA/EL system, and thereby get better availability and latency properties (even if those properties do not come with guarantees).

New consistency guarantees in Hazelcast

As I pointed out in a previous post, when deploying within a single geographic region, the latency vs. consistency tradeoff is not significant. In such environments, there is not much difference between PA/EC and PA/EL systems. Since in-memory data grids are typically deployed in a single region, PA/EC became the standard PACELC option for the IMDG market.

Nonetheless, it is interesting to note how Hazelcast --- one of the most widely used IMDG implementations --- is moving away from the PA/EC default configuration and towards the more natural PC/EC and PA/EL pairings.

In 2017, I wrote an in-depth analysis of Hazelcast’s consistency and availability guarantees. At around the same time, Kyle Kingsbury published a linearizability analysis of Hazelcast using his Jepsen testing library. These analyses led to what CEO Greg Luck tells me was 3 person-years of effort to release the first PC/EC configuration option in Hazelcast.

A quick reminder about what Hazelcast does: Many Java programs store program state inside Java collections and data structures such as Queue, Map, or Multimap. As a program scales --- both in terms of the number of clients accessing data and in terms of the amount of data stored, these data structures may get too large to fit in memory on a single server. Hazelcast provides a distributed implementation of these popular data structures, thereby enabling programs that leverage them to scale. Java programs interact with Hazelcast identically to how they interacted with the local data Java structures. However, under the covers, Hazelcast is distributing and replicating them across a cluster of machines.

Hazelcast also provides distributed implementations of Java concurrency structures such as AtomicLong, AtomicReference, Lock. Using such structures without a consistency guarantee is downright dangerous. For example, in situations for which consistency is not upheld, it is possible for multiple different processes to acquire the same lock (concurrently) or for non-atomic actions to occur on supposedly atomic data structures. In other words, lack of consistency violates the fundamental premise of these backbone Java concurrency structures.

With a PC/EC configuration option, you can now deploy Hazelcast’s distributed version of these concurrency tools --- IAtomicLong, IAtomicReference, and ILock --- and get identical correctness guarantees relative to what they provide when being used by a regular single-server Java program.

Hazelcast’s PC/EC configuration performs distributed agreement on data structure state via the Raft consensus protocol. Raft is known for achieving strong availability. Even when a network partition occurs, the majority of servers can remain available (the consistency guarantee only requires that the minority partition must lose availability).

Separately from the consistency issue, another well-known problem with distributed implementations of lock data structures is liveness. A server can acquire the lock and then fail for an indefinite period of time while holding the lock. Meanwhile the rest of the distributed system remains alive, but unable to acquire the lock and make progress. Therefore, in practice, distributed locks typically include timeouts, in order to prevent servers that fail while holding the lock from causing the rest of the system to stall indefinitely. Unfortunately, as soon as it is possible for a lock to timeout, it is possible for a process to get timed out while holding the lock, and yet be temporarily unaware that the lock timed out. In such a scenario, it is possible for two different processes to concurrently believe they have the lock, and perform concurrent side-effects that the lock was designed to prevent. Hazelcast’s distributed lock implementation includes a fencing mechanism which enables downstream code --- code that is designed to be run while locks are being held --- to participate in the locking protocol and be able to distinguish which process is correct when multiple processes concurrently believe that they hold the lock.

As far as the PA/EL option in Hazelcast, they have recently added PN Counters --- a CRDT counter implementation that enables some correctness guarantees despite the potential for inconsistent reads. In particular, when there is no consistency guarantee, it is possible for a server to read a stale value of the counter. For example, process A may update the value of a counter from 4 to 5. Afterwards, process B sees the stale value of the counter (4) and also adds one to it (to get 5). In versions of Hazelcast prior to 3.10, the final value of the counter --- even after whatever event caused the inconsistency is resolved --- would be 5. In other words, one of the two increments of the counter would be lost. With PN Counters (introduced in Hazelcast 3.10), the counter state will eventually converge to the correct value (in this case --- 6 --- that reflects the value after two increments to an initial value of 4). PN Counters also support decrements of the counter (in addition to increments), and guarantees (1) read-your-own-writes and (2) monotonic reads in addition to (3) what I described above --- the eventual convergence of the counter value to the correct value.

Furthermore, inconsistent reads in Hazelcast prior to version 3.10 --- under PA/EC or PA/EL configurations --- could potentially result in duplicate IDs being generated. Hazelcast 3.10 introduced Flake ID generators which guarantee global unique IDs even in the absence of a global consistency guarantee.

Conclusion

PA/EC systems sound good in theory, but are not particularly useful in practice. Our one example of a real PA/EC system --- Hazelcast --- has spent the past 1.5 years introducing features that are designed for alternative PACELC configurations --- specifically PC/EC and PA/EL configurations. PC/EC and PA/EL configurations are a more natural cognitive fit for an application developer. Either the developer can be certain that the underlying system guarantees consistency in all cases (the PC/EC configuration) in which case the application code can be significantly simplified, or the system makes no guarantees about consistency at all (the PA/EL configuration) but promises high availability and low latency. CRDTs and globally unique IDs can still provide limited correctness guarantees despite the lack of consistency guarantees in PA/EL configurations.

Correctness Anomalies Under Serializable Isolation

2019-06-28T10:11:00.002-07:00

Most database systems support multiple isolation levels that enable their users to trade off exposure to various types of application anomalies and bugs for (potentially small) increases in potential transaction concurrency. For decades, the highest level of “bug-free correctness” offered by commercial database systems was “SERIALIZABLE” isolation in which the database system runs transactions in parallel, but in a way that is equivalent to as if they were running one after the other. This isolation level was considered “perfect” because it enabled users that write code on top of a database system to avoid having to reason about bugs that could arise due to concurrency. As long as particular transaction code is correct in the sense that if nothing else is running at the same time, the transaction will take the current database state from one correct state to another correct state (where “correct” is defined as not violating any semantics of an application), then serializable isolation will guarantee that the presence of concurrently running transactions will not cause any kind of race conditions that could allow the database to get to an incorrect state. In other words, serializable isolation generally allows an application developer to avoid having to reason about concurrency, and only focus on making single-threaded code correct.

In the good old days of having a “database server” which is running on a single physical machine, serializable isolation was indeed sufficient, and database vendors never attempted to sell database software with stronger correctness guarantees than SERIALIZABLE. However, as distributed and replicated database systems have started to proliferate in the last few decades, anomalies and bugs have started to appear in applications even when running over a database system that guarantees serializable isolation. As a consequence, database system vendors started to release systems with stronger correctness guarantees than serializable isolation, which promise a lack of vulnerability to these newer anomalies. In this post, we will discuss several well known bugs and anomalies in serializable distributed database systems, and modern correctness guarantees that ensure avoidance of these anomalies.

What does “serializable” mean in a distributed/replicated system?

We defined “serializable isolation” above as a guarantee that even though a database system is allowed to run transactions in parallel, the final result is equivalent to as if they were running one after the other. In a replicated system, this guarantee must be strengthened in order to avoid the anomalies that would only occur at lower levels of isolation in non-replicated systems. For example, let’s say that the balance of Alice’s checking account ($50) is replicated so that the same value is stored in data centers in Europe and the United States. Many systems do not replicate data synchronously over such large distances. Rather, a transaction will complete at one region first, and its update to the database system may be replicated afterwards. If a withdrawal of $20 is made concurrently in the United States and Europe, the old balance is read ($50) in both places, $20 is removed from it, and the new balance ($30) is written back in both places and replicated to the other data center. This final balance is clearly incorrect ---- it should be $10 --- and was caused by concurrently executing transactions. But the truth is, the same outcome could happen if the transactions were serial (one after the other) as long as the replication is not included as part of the transaction (but rather happens afterwards). Therefore, a concurrency bug results despite equivalence to a serial order.

Rony Attar, Phil Bernstein, and Nathan Goodman expanded the concept of serializability in 1984 to define correctness in the context of replicated systems. The basic idea is that all the replicas of a data item behave like a single logical data item. When we say that a concurrent execution of transactions is “equivalent to processing them in a particular serial order”, this implies that whenever a data item is read, the value returned will be the most recent write to that data item by a previous transaction in the (equivalent) serial order --- no matter which copy was written by that write. In this context “most recent write” means the write by the closest (previous) transaction in that serial order. In our example above, either the withdrawal in Europe or the withdrawal in the US will be ordered first in the equivalent serial order. Whichever transaction is second --- when it reads the balance --- it must read the value written by the first transaction. Attar et. al. named this guarantee “one copy serializability” or “1SR”, because the isolation guarantee is equivalent to serializability in an unreplicated system with “one copy” of every data item.

Anomalies are possible under serializability; Anomalies are possible under one-copy serializability

We just stated that one-copy serializability in replicated systems is the identical isolation guarantee as serializability in unreplicated systems. There are many, many database systems that offer an isolation level called “serializability”, but very few (if any) replicated database systems that offer an isolation level called “one-copy serializability”. To understand why this is the case, we need to explain some challenges in writing bug-free programs on top of systems that “only” guarantee serializability.

A serializable system only guarantees that transactions will be processed in an equivalent way to some serial order. The serializability guarantee by itself doesn’t place any constraints on what this serial order is. In theory, one transaction can run and commit. Another transaction can come along --- a long period of time after the first one commits --- and be processed in such a way that the resulting equivalent serial order places the later transaction before the earlier one. In a sense, the later transaction “time travels” back in time, and is processed such that the final state of the database is equivalent to that transaction running prior to transactions that completed prior to when it began. A serializable system doesn’t prevent this. Nor does a one-copy serializable system. Nonetheless, in single-server systems, it is easy and convenient to prevent time-travel. Therefore, the vast majority of single-server systems that guarantee “serializability” also prevent time-travel. Indeed, it was so trivial to prevent time travel that most commercial serializable systems did not consider it notable enough to document their prevention of this behavior.

In contrast, in distributed and replicated systems, it is far less trivial to guarantee a lack of time travel, and many systems allow some forms of time travel in their transaction processing behavior.

The next few sections describe some forms of time-travel anomalies that occur in distributed and/or replicated systems, and the types of application bugs that they may cause. All of these anomalies are possible under a system that only guarantees one-copy serializability. Therefore, vendors typically document which of these anomalies they do and do not allow, thereby potentially guaranteeing a level of correctness higher than one-copy serializability. At the end of this post, we will classify different correctness guarantees and which time-travel anomalies they allow.

The immortal write

Let’s say the user of an application currently has a display name of “Daniel”, and decides to change it to “Danny”. He goes to the application interface, and changes his display name accordingly. He then reads his profile to ensure the change took effect, and confirms that it has. Two weeks later, he changes his mind again, and decides he wants to change his display name to “Danger”. He goes to the interface and changes his display name accordingly and was told that that change was successful. But when he performs a read on his profile, it still lists his name as “Danny”. He can go back and change his name a million times. Every time, he is told the change was successful, but the value of his display name in the system remains “Danny”.

What happened? All of the future writes of his name travelled back in time to a point in the serial order directly before the transaction that changed his name to “Danny”. The “Danny” transaction therefore overwrote the value written by all of these other transactions, even though it happened much earlier than these other transactions in real time. The system decided that the serial order that it was guaranteeing equivalence to has the “Danny” transaction after all of the other name-change transactions --- it has full power to decide this without violating its serializability guarantee. [Side note: when the “Danny” transaction and/or the other name-change transactions also perform a read to the database as part of the same transaction as the write to the name, the ability to time-travel without violating serializability becomes much more difficult. But for “blind write” transactions such as these examples, time-travel is easy to accomplish.]

In multi-master asynchronously replicated database systems, where writes are allowed to occur at either replica, it is possible for conflicting writes to occur across the replicas. In such a scenario, it is tempting to leverage time travel to create an immortal blind write, which enables straightforward conflict resolution without violating the serializability guarantee.

The stale read

The most common type of anomaly that appears in replicated systems but not in serializable single-server systems is the “stale read anomaly”. For example, Charlie has a bank account with $50 left in the account. He goes to an ATM and withdraws $50. Afterwards, he asks for a receipt with his current bank balance. The receipt (incorrectly) states that he has $50 left in his account (when, in actuality, he now has no money left). As a result, Charlie is left with an incorrect impression of how much money he has, and may make real life behavioral mistakes (for example, splurging on a new gadget) that he wouldn’t have done if he had the correct impression of the balance of his account. This anomaly happened as a result of a stale read: his account certainly used to have $50 in it. But when the ATM did a read request on the bank database to get his current balance, this read request did not reflect the write to his balance that happened a few seconds earlier when he withdrew money from his account.

The stale read anomaly is extremely common in asynchronously replicated database systems (such as read replicas in MySQL or Amazon Aurora). The write (the update to Charlie’s balance) gets directed to one copy, which is not immediately replicated to the other copy. If the read gets directed to the other copy before the new write has been replicated to it, it will see a stale value.

Stale reads do not violate serializability. The system is simply time travelling the read transaction to a point in time in the equivalent serial order of transactions before the new writes to this data item occur. Therefore, asynchronously replicated database systems can allow stale reads without giving up its serializability (or even one-copy serializability) guarantee.

In a single-server system, there’s little motivation to read anything aside from the most recent value of a data item. In contrast, in a replicated system, network delays from synchronous replication are time-consuming and expensive. It is therefore tempting to do replication asynchronously, since reads can occur from asynchronous read-only replicas without violating serializability (as long as the replicated data becomes visible in the same order as the original).

The causal reverse

In contrast to the stale read anomaly, the causal reverse anomaly can happen in any distributed database system and is independent of how replication is performed (synchronous or asynchronous). In the causal reverse anomaly, a later write which was caused by an earlier write, time-travels to a point in the serial order prior to the earlier write. In general, these two writes can be to totally different data items. Reads that occur in the serial order between these two writes may observe the “effect” without the “cause”, which can lead to application bugs.

For example, most banks do not exchange money between accounts in a single database transaction. Instead, money is removed from one account into bank-owned account in one transaction. A second transaction moves the money from the bank-owned account to the account intended as the destination for this transfer. The second transaction is caused by the first. If the first transaction didn’t succeed for any reason, the second transaction would never be issued.

Let’s say that $1,000,000 is being transferred from account A (which currently has $1,000,000 and will have $0 left after this transfer) to account B (which currently has $0, and will have $1,000,000 after the transfer). Let’s say that account A and account B are owned by the same entity, and this entity wishes to get a sizable loan that requires $2,000,000 as a down payment. In order to see if this customer is eligible for the loan, the lender issues a read transaction that reads the values of accounts A and B and takes the sum of the balance of those two accounts. If this read transaction occurs in the serial order before the transfer of $1,000,000 from A to B, a total of $1,000,000 will be observed across accounts. If this read transaction occurs after the transfer of $1,000,000 from A to B, a total of $1,000,000 will still be observed across accounts. If this read transaction occurs between the two transactions involved in transfer of $1,000,000 from A to B that we described above, a total of $0 will be observed across accounts. In all three possible cases, the entity will be (correctly) denied the loan due to lack of funds necessary for the down payment.

However, if a second transaction involved in the transfer (the one that adds $1,000,000 to account B) time-travels before the transaction that caused it to exist in the first place (the one that subtracts $1,000,000 from account A), it is possible for a read transaction that occurs between these two writes to (incorrectly) observe a balance across accounts of $2,000,000 and thereby allow the entity to secure the loan. Since the transfer was performed in two separate transactions, this example does not violate serializability. The equivalent serial order is: (1) the transaction that does the second part of the transfer (2) the read transaction and (3) the transaction that does the first part of the transfer. However, this example shows the potential for devastating bugs in application code code if causative transactions are allowed to time-travel to a point in time before their cause.

One example of a distributed database system that allows the causal reverse is CockroachDB (aka CRDB). CockroachDB partitions a database such that each partition commits writes and synchronously replicates data separately from other partitions. Each write receives a timestamp based on the local clock on one of the servers within that partition. In general, it is impossible to perfectly synchronize clocks across a large number of machines, so CockroachDB allows a maximum clock skew for which clocks across a deployment can differ. However, (unlike Google Spanner) CockroachDB does not wait for the maximum clock skew bound to pass before committing a transaction. Therefore, it is possible in CockroachDB for a transaction to commit, and a later transaction to come along (that writes data to a different partition), that was caused by the earlier one (that started after the earlier one finished), and still receive an earlier timestamp than the earlier transaction. This enables a read (in CockroachDB’s case, this read has to be sent to the system before the two write transactions) to potentially see the write of the later transaction, but not the earlier one.

In other words, if the bank example we’ve been discussing was implemented over CockroachDB, the entity wishing to secure the loan could simply repeatedly make the loan request and then transfer money between accounts A and B until the causal reverse anomaly shows up, and the loan is approved. Obviously, a well-written application should be able to detect the repeated loan requests and prevent this hack from occurring. But in general, it is hard to predict all possible hacks and write defensive application code to prevent them. Furthermore, banks are not usually able to recruit elite application programmers, which leads to some mind-boggling vulnerabilities in real-world applications.

Avoiding time travel anomalies

All the anomalies that we’ve discussed so far --- the immortal write, the stale read, and the causal reverse --- all exploit the permissibility of time travel in the serializability guarantee, and thereby introduce bugs in application code. To avoid these bugs, the system needs to guarantee that transactions are not allowed to travel in time, in addition to guaranteeing serializability. As we mentioned above, single-server systems generally make this time-travel guarantee without advertising it, since the implementation of this guarantee is trivial on a single-server. In distributed and replicated database systems, this additional guarantee of “no time travel” on top of the other serializability guarantees is non-trivial, but has nonetheless been accomplished by several systems such as FaunaDB/Calvin, FoundationDB, and Spanner. This high level of correctness is called strict serializability.

Strict serializability makes all of the guarantees of one-copy serializability that we discussed above. In addition, it guarantees that if a transaction X completed before transaction Y started (in real time) then X will be placed before Y in the serial order that the system guarantees equivalence to.

Classification of serializable systems

Systems that guarantee strict serializability eliminate all types of time travel anomalies. At the other end of the spectrum, systems that guarantee “only” one-copy serializability are vulnerable to all of the anomalies that we’ve discussed in this post (even though they are immune to the isolation anomalies we discussed in a previous post). There also exist systems that guarantee a version of serializability between these two extremes. One example are “strong session serializable” systems that guarantee strict serializability of transactions within the same session, but otherwise only one-copy serializability. Another example are "strong write serializable" systems that guarantee strict serializability for all transactions that insert or update data, but only one-copy serializability for read-only transactions. This isolation level is commonly implemented by read-only replica systems where all update transactions go to the master replica which processes them with strict serializability. These updates are asynchronously replicated to read-only replicas in the order they were processed at the master. Reads from the replicas may be stale, but they are still serializable. A third class of systems are "strong partition serializable" systems that guarantee strict serializability only on a per-partition basis. Data is divided into a number of disjoint partitions. Within each partition, transactions that access data within that partition are guaranteed to be strictly serializable. But otherwise, the system only guarantees one-copy serializability. This isolation level can be implemented by synchronously replicating writes within a partition, but avoiding coordination across partitions for disjoint writes (we explained above that CockroachDB is member of this class). Now that we have given names to these different levels of serializability, we can summarize the anomalies to which they are vulnerable with a simple chart:

System Guarantee	Immortal write	Stale read	Causal reverse
ONE COPY SERIALIZABLE	Possible	Possible	Possible
STRONG SESSION SERIALIZABLE	Possible (but not within same session)	Possible (but not within same session)	Possible (but not within same session)
STRONG WRITE SERIALIZABLE	Not Possible	Possible	Not Possible
STRONG PARTITION SERIALIZABLE	Not Possible	Not Possible	Possible
STRICT SERIALIZABLE	Not Possible	Not Possible	Not Possible

For readers who read my previous post on isolation levels, we can combine the isolation anomalies from that post with the time travel anomalies from this post to get a single table with all the anomalies we’ve discussed across the two posts:

System Guarantee	Dirty read	Non-repeatable read	Phantom Read	Write Skew	Immortal write	Stale read	Causal reverse
READ UNCOMMITTED	Possible	Possible	Possible	Possible	Possible	Possible	Possible
READ COMMITTED	Not Possible	Possible	Possible	Possible	Possible	Possible	Possible
REPEATABLE READ	Not Possible	Not Possible	Possible	Possible	Possible	Possible	Possible
SNAPSHOT ISOLATION	Not Possible	Not Possible	Not Possible	Possible	Possible	Possible	Possible
SERIALIZABLE / ONE COPY SERIALIZABLE / STRONG SESSION SERIALIZABLE	Not Possible	Not Possible	Not Possible	Not Possible	Possible	Possible	Possible
STRONG WRITE SERIALIZABLE	Not Possible	Not Possible	Not Possible	Not Possible	Not Possible	Possible	Not Possible
STRONG PARTITION SERIALIZABLE	Not Possible	Not Possible	Not Possible	Not Possible	Not Possible	Not Possible	Possible
STRICT SERIALIZABLE	Not Possible	Not Possible	Not Possible	Not Possible	Not Possible	Not Possible	Not Possible

Introduction to Transaction Isolation Levels

2019-05-03T07:42:00.000-07:00

For decades, database systems have given their users multiple isolation levels to choose from, ranging from some flavor of “serializability” at the high end down to “read committed” or “read uncommitted” at the low end. These different isolation levels expose an application to markedly different types of concurrency bugs. Nonetheless, many database users stick with the default isolation level of whatever database system that they are using, and do not bother to consider which isolation level is optimal for their application. This practice is dangerous—the vast majority of widely-used database systems—including Oracle, IBM DB2, Microsoft SQL Server, SAP HANA, MySQL, and PostgreSQL—do not guarantee any flavor of serializability by default. As we will detail below, isolation levels weaker than serializability can lead to concurrency bugs in an application and negative user experiences. It is very important that a database user is aware of the isolation level guaranteed by the database system, and what concurrency bugs may emerge as a result.

In this post we give a tutorial on database isolation levels, the advantages that come with lower isolation levels, and the types of concurrency bugs that these lower levels may allow. Our main focus in this post is the difference between serializable isolation levels and lower levels that expose an application to specific types of concurrency anomalies. There are also important differences within the category of serializable isolation (e.g., “strict serializability” makes a different set of guarantees than “one-copy serializability”). However, in keeping with the title of this post as an “introduction” to database isolation levels, we will ignore the subtle differences within the class of serializable isolation, and focus on the commonality of all elements within this class—referring to this commonality as “serializable”. In a future, less introductory, post, we will investigate the category of serializable isolation in more detail.

What is an “Isolation Level”?

Database isolation refers to the ability of a database to allow a transaction to execute as if there are no other concurrently running transactions (even though in reality there can be a large number of concurrently running transactions). The overarching goal is to prevent reads and writes of temporary, aborted, or otherwise incorrect data written by concurrent transactions.

There is such a thing as perfect isolation (we will define this below). Unfortunately, perfection usually comes at a performance cost—in terms of transaction latency (how long before a transaction completes) or throughput (how many transactions per second can the system complete). Depending on how a particular system is architected, perfect isolation becomes easier or harder to achieve. In poorly designed systems, achieving perfection comes with a prohibitive performance cost, and users of such systems will be pushed to accept guarantees significantly short of perfection. However, even in well designed systems, there is often a non-trivial performance benefit achieved by accepting guarantees short of perfection. Therefore, isolation levels came into existence: they provide the user of a system the ability to trade off isolation guarantees for improved performance.

As we will see as we proceed in this discussion, there are many subtleties that lead to confusion when discussing isolation levels. This confusion is exacerbated by the existence of a SQL standard that fails to accurately define database isolation levels and database vendors that attach liberal and non-standard semantics to particular named isolation levels. Nonetheless, a database user is obligated to do due diligence on the different options provided by a particular system, and choose the right level for their application.

Perfect Isolation

Let’s begin our discussion of database isolation levels by providing a notion of what “perfect” isolation is. We defined isolation above as the ability of a database system to allow a transaction to execute as if there are no other concurrently running transactions (even though in reality that can be a large number of concurrently running transactions). What does it mean to be perfect in this regard? At first blush, it may appear that perfection is impossible. If two transactions both read and write the same data item, it is critical that they impact each other. If they ignore each other, then whichever transaction completes the write last could clobber the first transaction, resulting in the same final state as if it never ran.

The database system was one of the first scalable concurrent systems, and has served as an archetype for many other concurrent systems developed subsequently. The database system community—many decades ago—developed an incredibly powerful (and yet perhaps underappreciated) mechanism for dealing with the complexity of implementing concurrent programs.

The idea is as follows: human beings are fundamentally bad at reasoning about concurrency. It’s hard enough to write a bug-free non-concurrent program. But once you add concurrency, there are a near-infinite array of race conditions that could occur—if one thread reaches line 17 of a program before another thread reaches line 5, but after it reaches line 3, a problem could occur that would not exist under any other concurrent execution of the program. It is nearly impossible to consider all the different ways that program execution in different threads could overlap with each other, and how different types of overlap can lead to different final states.

Instead, database systems provide a beautiful abstraction to the application developer—a “transaction”. A transaction may contain arbitrary code, but it is fundamentally single-threaded.

An application developer only needs to focus on the code within a transaction—to ensure it is correct when there are no other concurrent processes running in the system. Given any starting state of the database, the code must not violate the semantics of the application. Ensuring correctness of code is non-trivial, but it is much easier to ensure the correctness of code when it is running by itself, than ensuring the correctness of code when it is running alongside other code that may attempt to read or write shared data.

If the application developer is able to ensure the correctness of their code when no other concurrent processes are running, a system that guarantees perfect isolation will ensure that the code remains correct even when there is other code running concurrently in the system that may read or write the same data. In other words, a database system enables a user to write code without concern for the complexity of potential concurrency, and yet still process that code concurrently without introducing new bugs or violations of application semantics.

Implementing this level of perfection sounds difficult, but it’s actually fairly straightforward to achieve. We have already assumed that the code is correct when run without concurrency over any starting state. Therefore, if transaction code is run serially — one after the other — then the final state will also be correct. Thus, in order to achieve perfect isolation, all the system has to do is to ensure that when transactions are running concurrently, the final state is equivalent to a state of the system that would exist if they were run serially. There are several ways to achieve this—such as via locking, validation, or multi-versioning—that are out of scope for this article. The key point for our purposes is that we are defining “perfect isolation” as the ability of a system to run transactions in parallel, but in a way that is equivalent to as if they were running one after the other. In the SQL standard, this perfect isolation level is called serializability.

Isolation levels in distributed systems get more complicated. Many distributed systems implement variations of the serializable isolation level, such as one copy-serializability (1SR), strict serializability (strict 1SR) or update serializability (US). However, as we mentioned above, in order to focus the discussion in this post on core concepts behind database isolation, we will defer discussion of these more advanced concepts to a future post.

Anomalies in Concurrent Systems

The SQL standard defines several isolation levels below serializability. Furthermore, there are other isolation levels commonly found in commercial database systems—most notably snapshot isolation—which are not included in the SQL standard. Before we discuss these different levels of isolation, let’s discuss some well-known application bugs/anomalies that can occur at isolation levels below serializability. We will describe these bugs using a retail example.

Let’s assume that whenever a customer purchases a widget, the following “Purchase” transaction is run:

Read old inventory
Write new inventory which is one less than what was read in step (1)
Insert new order corresponding to this purchase into the orders table

If such Purchase transactions run serially, then all initial inventory will always be accounted for. If we started with 42 widgets, then at all times, the sum of all inventory remaining plus orders in the order table will be 42.

But what if such transactions run concurrently at isolation levels below serializability?

For example, let’s say that two transactions running concurrently read the same initial inventory (42), and then both attempt to write out the new inventory of one less than the value that they read (41) in addition to the new order. In such a case, the final state is an inventory of 41, yet there are two new orders in the orders table (for a total of 43 widgets accounted for). We created a widget out of nothing! Clearly, this is a bug. It is known as the lost-update anomaly.

As another example, let’s say these same two transactions are running concurrently but this time the second transaction starts in between steps (2) and (3) of the first one. In this case, the second transaction reads the value of inventory after it has been decremented - i.e. it reads the value of 41 and decrements it to 40, and writes out the order. In the meantime, the first transaction aborted when writing out the order (e.g. because of a credit card decline). In such a case, during the abort process, the first transaction reverts back to the state of the database before it began (when inventory was 42). Therefore the final state is an inventory of 42, and one order written out (from the second transaction). Again, we created a widget out of nothing! This is known as the dirty-write anomaly (because the second transaction overwrote the value of the first transaction’s write before it decided whether it would commit or abort).

As a third example, let’s say a separate transaction performs a read of both the inventory and the orders table, in order to make an accounting of all widgets that ever existed. If it runs between steps (2) and (3) of a Purchase transaction, it will see a temporary state of the database in which the widget has disappeared from the inventory, but has not yet appeared as an order. It will appear that a widget has been lost—another bug in our application. This is known as the dirty-read anomaly, since the accounting transaction was allowed to read the temporary (incomplete) state of the purchase transaction.

As a fourth example, let’s say that a separate transaction checks the inventory and acquires some more widgets if there are fewer than 10 widgets left:

IF (READ(Inventory) = (10 OR 11 OR 12))
Ship some new widgets to restock inventory via standard shipping
IF (READ(Inventory) < 10)
Ship some new widgets to restock inventory via express shipping

Note that this transaction reads the inventory twice. If the Purchase transaction runs in between step (1) and (3) of this transaction, then a different value of inventory will be read each time. If the initial inventory before the Purchase transaction ran was 10, this would lead to the same restock request to be made twice—once with standard shipping and once with express shipping. This is called the non-repeatable read anomaly.

As a fifth example, imagine a transaction that scans the orders table in order to calculate the maximum price of an order and then scans it again to find the average order price. In between these two scans, an extremely expensive order gets inserted that skews the average so much that it becomes higher than the maximum price found in the previous scan. This transaction returns an average price that is larger than the maximum price—a clear impossibility and a bug that would never happen in a serializable system. This bug is slightly different than the non-repeatable read anomaly since every value that the transaction read stayed the same between the two scans—the source of the bug is that additional records were inserted in between these two scans. This is called the phantom read anomaly.

As a final example, assume that the application allows the price of the widget to change based on inventory. For example, many airlines increase ticket price as the inventory for a flight decreases. Assume that the application uses a formula to place a constraint on how these two variables interrelate—e.g. 10I + P >= $500 (where I is inventory and P is price). Before allowing a purchase to succeed, the purchase transaction has to check both the inventory and price to make sure the above constraint is not violated. If the constraint will not be violated, the update of inventory by that Purchase transaction may proceed. Similarly, a separate transaction that implements special promotional discounts may check both the inventory and price to make sure that the constraint is not violated when updating the price as part of a promotion. If it will not be violated, the price can be updated. Now: imagine these two transactions are running at the same time—they both read the old value of I and P and independently decide that their updates of inventory and price respectively will not violate the constraint. They therefore proceed with their updates. Unfortunately, this may result in a new value of I and P that violates the constraint! If one had run before the other, the first one would have succeeded and the other would would have read the value of I and P after the first one finished and detected that their update would violate the constraint and therefore not proceed. But since they were running concurrently, they both see the old value and incorrectly decide that they can proceed with the update. This bug is called the write skew anomaly because it happens when two transactions read the same data but update disjoint subsets of the data that was read.

Definitions in The ISO SQL Standard

The SQL standard defines reduced isolation levels in terms of which of these anomalies are possible. In particular, it contains the following table:

Isolation Level	Dirty Read	Non-repeatable read	Phantom Read
READ UNCOMMITTED	Possible	Possible	Possible
READ COMMITTED	Not Possible	Possible	Possible
REPEATABLE READ	Not Possible	Not Possible	Possible
SERIALIZABLE	Not Possible	Not Possible	Not Possible

There are many, many problems which how the SQL standard defines these isolation levels. Most of these problems were already pointed out in 1995, but inexplicably, revision after revision of the SQL standard have been released since that point without fixing these problems.

The first problem is that the standard only uses three types of anomalies to define each isolation level—dirty read, non-repeatable read, and phantom read. However, there are many types of concurrency bugs that can appear in practice—many more than just these three anomalies. In this post alone we have already described six unique types of anomalies. The SQL standard makes no mention about whether the READ UNCOMMITTED, READ COMMITTED, and REPEATABLE READ isolation levels are susceptible to the lost update anomaly, the dirty-write anomaly, or the write-skew anomaly. As a result, each commercial system is free to decide which of these other anomalies these reduced isolation levels are susceptible to—and in many cases these vulnerabilities are poorly documented, leading to confusion and unpredictable bugs for the application developer.

A second (related) problem is that using anomalies to define isolation levels only gives the end user a guarantee of what specific types of concurrency bugs are impossible. It does not give a precise definition of the potential database states that are viewable by any particular transaction. There are several improved and more precise definitions of isolation levels given in the academic literature. Atul Adya’s PhD thesis gives a precise definition of the SQL standard isolation levels based on how reads and writes from different transactions may be interleaved. However these definitions are given from the point of view of the system. The recent work by Natacha Crooks et. al gives elegant and precise definitions from the point of view of the user.

A third problem is that the standard does not define, nor provide correctness constraints on one of the most popular reduced isolation levels used in practice: snapshot isolation (nor any of its many variants—PSI, NMSI, Read Atomic, etc). By failing to provide a definition of snapshot isolation, differences in concurrency vulnerabilities allowed by snapshot isolation have emerged across systems. In general, snapshot isolation performs all reads of data as of a particular snapshot of the database state which contains only committed data. This snapshot remains constant throughout the lifetime of the transaction, so all reads are guaranteed to be repeatable (in addition to being only of committed data). Furthermore, concurrent transactions that write the same data detect conflicts with each other and typically resolve this conflict via aborting one of the conflicting transactions. This prevents the lost-update anomaly. However, conflicts are only detected if the conflicting transactions write an overlapping set of data. If the write sets are disjoint, these conflicts will not be detected. Therefore snapshot isolation is vulnerable to the write skew anomaly. Some implementations are also vulnerable to the phantom read anomaly.

A fourth problem is that the SQL standard seemingly gives two different definitions of the SERIALIZABLE isolation level. First, it defines SERIALIZABLE correctly: that the final result must be equivalent to a result that could have occured if there were no concurrency. But then, it presents the above table, which seems to imply that as long as an isolation level does not allow dirty reads, non-repeatable reads, or phantom reads, it may be called SERIALIZABLE. Oracle, has historically leveraged this ambiguity to justify calling its implementation of snapshot isolation “SERIALIZABLE”. To be honest, I think most people who read the ISO SQL standard would come away believing the more precise definition of SERIALIZABLE given earlier in the document (which is the correct one) is the intention of the authors of the document. Nonetheless, I guess Oracle’s lawyers have looked at it and determined that there is enough ambiguity in the document to legally justify their reliance on the other definition. If any of my readers are aware of any real lawsuits that came from application developers who believed they were getting a SERIALIZABLE isolation level, but experienced write skew anomalies in practice, I would be curious to hear about them. Or if you are an application developer and this happened to you, I would also be curious to hear about it.

The bottom line is this: it is nearly impossible to give clear definitions of the different isolation levels available to application developers, because vagueness and ambiguities in the SQL standard has led to semantic differences across implementations/systems.

What Isolation Level Should You Choose?

My advice to application programmers is the following: reduced isolation levels are dangerous. It is very hard to figure out which concurrency bugs may present themselves. If every system defined their isolation levels using the methodology of Crooks et. al. that I cited above, at least you would have a precise and formal definition of their associated guarantees. Unfortunately, the formalism of the Crooks paper is too advanced for most database users, so it is unlikely that database vendors will adopt these formalisms in their documentation any time soon. In the meantime, the definition of reduced isolation levels remain vague in practice and risky to use.

Furthermore, even if you could know exactly which concurrency bugs are possible for a particular isolation level, writing an application in a way that these bugs will not happen in practice (or if they do, that they will not cause negative experiences for the users of the application) is also very challenging. If your database system gives you a choice, the right choice is usually to avoid lower isolation levels than serializable isolation (for the vast majority of database systems, you actually have to go and change the defaults to accomplish this!).

However, there are three caveats:

As I mentioned above, some systems use the word “SERIALIZABLE” isolation to mean something weaker than true serializable isolation. Unfortunately, this means that simply choosing the “SERIALIZABLE” isolation level in your database system may not be sufficient to actually ensure serializability. You need to check the documentation to ensure that it defines “SERIALIZABLE” in the following way: that the visible state of the database is always equivalent to a state that could have occurred if there was no concurrency. Otherwise, your application will likely be vulnerable to the write-skew anomaly.
As mentioned above, serializable isolation level comes with a performance cost. Depending on the quality of the system architecture, the performance cost of serializability may be large or small. In a recent research paper that I wrote with Jose Faleiro and Joe Hellerstein, we showed that in a well-designed system, the performance difference between SERIALIZABLE and READ COMMITTED can be negligible … and in some cases it is possible for the SERIALIZABLE isolation level to (surprisingly) outperform the READ COMMITTED isolation level. If you find that the cost of serializable isolation in your system is prohibitive, you should probably consider using a different database system earlier than you consider settling for a reduced isolation level.
In distributed systems, there are important anomalies that can (and do) emerge even within the class of serializable isolation levels. For such systems, it is important to understand the subtle differences between the elements of the class of serializable isolation (strict serializability is known to be the most safe). We will shed more light into this matter in a future post.

It’s Time to Move on from Two Phase Commit

2019-01-25T08:30:00.000-08:00

The two-phase commit protocol (2PC) has been used in enterprise software systems for over three decades. It has been an an incredibly impactful protocol for ensuring atomicity and durability of transactions that access data in multiple partitions or shards. It is used everywhere --- both in older “venerable” distributed systems, database systems, and file systems such as Oracle, IBM DB2, PostgreSQL, and Microsoft TxF (transactional NTFS), and in younger “millennial” systems such as MariaDB, TokuDB, VoltDB, Cloud Spanner, Apache Flink, Apache Kafka, and Azure SQL Database. If your system supports ACID transactions across shards/partitions/databases, there’s a high probability that it is running 2PC (or some variant thereof) under the covers. [Sometimes it’s even “over the covers” --- older versions of MongoDB required users to implement 2PC for multi-document transactions in application code.]

In this post, we will first describe 2PC: how it works and what problems it solves. Then, we will show some major issues with 2PC and how modern systems attempt to get around these issues. Unfortunately, these attempted solutions cause other problems to emerge. In the end, I will make the case that the next generation of distributed systems should avoid 2PC, and how this is possible.

Overview of the 2PC protocol

There are many variants of 2PC, but the basic protocol works as follows:

Background assumption:The work entailed by a transaction has already been divided across all of the shards/partitions that store data accessed by that transaction. We will refer to the effort performed at each shard as being performed by the “worker” for that shard. Each worker is able to start working on its responsibilities for a given transaction independently of each other. The 2PC protocol begins at the end of transaction processing, when the transaction is ready to “commit”. It is initiated by a single, coordinator machine (which may be one of the workers involved in that transaction).

The basic flow of the 2PC protocol is shown in the figure below. [The protocol begins at the top of the figure and then proceeds in a downward direction.]

Phase 1: A coordinator asks each worker whether they have successfully completed their responsibilities for that transaction and are ready to commit. Each worker responds ‘yes’ or ‘no’.

Phase 2: The coordinator counts all the responses. If every worker responded ‘yes’, then the transaction will commit. Otherwise, it will abort. The coordinator sends a message to each worker with the final commit decision and receives an acknowledgement back.

This mechanism ensures the atomicity property of transactions: either the entire transaction will be reflected in the final state of the system, or none of it. If even just a single worker cannot commit, then the entire transaction will be aborted. In other words: each worker has “veto-power” for a transaction.

It also ensures transaction durability. Each worker ensures that all of the writes of a transaction have been durably written to storage prior to responding ‘yes’ in phase 1. This gives the coordinator freedom to make a final decision about a transaction without concern for the fact that a worker may fail after voting ‘yes’. [In this post, we are being purposefully vague when using the term “durable writes” --- this term can either refer to writing to local non-volatile storage or, alternatively, replicating the writes to enough locations for it to be considered “durable”.]

In addition to durably writing the writes that are directly required by the transaction, the protocol itself requires additional writes that must be made durable before it can proceed. For example, a worker has veto power until the point it votes ‘yes’ in phase 1. After that point, it cannot change its vote. But what if it crashes right after voting ‘yes’? When it recovers it might not know that it voted ‘yes’, and still think it has veto power and go ahead and abort the transaction. To prevent this, it must write its vote durably before sending the ‘yes’ vote back to the coordinator. [In addition to this example, in standard 2PC, there are two other writes that are made durable prior to sending messages that are part of the protocol.]

The problems with 2PC

There are two major problems with 2PC. The first is well known, and discussed in every reputable textbook that presents 2PC. The second is much less well known, but a major problem nonetheless.

The well-known problem is referred to as the “blocking problem”. This happens when every worker has voted ‘yes’, but the coordinator fails before sending a message with the final decision to at least one worker. The reason why this is a problem is that by voting ‘yes’, each worker has removed its power to veto the transaction. However, the coordinator still has absolute power to decide the final state of a transaction. If the coordinator fails before sending a message with the final decision to at least one worker, the workers cannot get together to make a decision amongst themselves --- they can’t abort because maybe the coordinator decided to commit before it failed, and they can’t commit because maybe the coordinator decided to abort before it failed. Thus, they have to block --- wait until the coordinator recovers --- in order to find out the final decision. In the meantime, they cannot process transactions that conflict with the stalled transaction since the final outcome of the writes of that transaction are yet to be determined.

There are two categories of work-arounds to the blocking problem. The first category of work-around modifies the core protocol in order to eliminate the blocking problem. Unfortunately, these modifications reduce the performance --- typically by adding an extra round of communication --- and thus are rarely used in practice. The second category keeps the protocol in tact but reduces the probability of the types of coordinator failure than can lead to the blocking program --- for example, by running 2PC over replica consensus protocols and ensuring that important state for the protocol is replicated at all times. Unfortunately, once again, these work-arounds reduce performance, since the protocol requires that these replica consensus rounds occur sequentially, and thus they may add significant latency to the protocol.

The lesser-known problem is what I call the “cloggage problem”. 2PC occurs after transaction is processed, and thus necessarily increases the latency of the transaction by an amount equal to the time it takes to run the protocol. This latency increase alone can already be an issue for many applications, but a potentially larger issue is that worker nodes do not know the final outcome of a transaction until mid-way through the second phase. Until they know the final outcome, they have to be prepared for the possibility that it might abort, and thus they typically prevent conflicting transactions from making progress until they are certain that the transaction will commit. These blocked transactions in turn block other transactions from running, and so on, until 2PC completes and all of the blocked transactions can resume. This cloggage further increases the average transaction latency and also decreases transactional throughput.

To summarize the problems we discussed above: 2PC poisons a system along four dimensions: latency (the time of the protocol plus the stall time of conflicting transactions), throughput (because it prevents conflicting transactions from running during the protocol), scalability (the larger the system, the more likely transactions become multi-partition and have to pay the throughput and latency costs of 2PC), and availability (the blocking problem we discussed above). Nobody likes 2PC, but for decades, people have assumed that it is a necessary evil.

It’s time to move on

For over three decades, we’ve been stuck with two-phase commit in sharded systems. People are aware of the performance, scalability, and availability problems it introduces, but nonetheless continue on, with no obvious better alternative.

The truth is, if we would just architect our systems differently, the need for 2PC would vanish. There have been some attempts to accomplish this --- both in academia (such as this SIGMOD 2016 paper) and industry. However, these attempts typically work by avoiding multi-sharded transactions in the first place, such as by repartitioning data in advance of a transaction so that it is no longer multi-sharded. Unfortunately, this repartitioning reduces performance of the system in other ways.

What I am calling for is a deeper type of change in the way we architect distributed systems. I insist that systems should still be able to process multi-sharded transactions --- with all the ACID guarantees and what that entails --- such as atomicity and durability --- but with much simpler and faster commit protocols.

It all comes down to a fundamental assumption that has been present in our systems for decades: a transaction may abort at any time and for any reason. Even if I run the same transaction on the same initial system state … if I run it at 2:00PM it may commit, but at 3:00 it may abort.

There are several reasons why most architects believe we need this assumption. First, a machine may fail at anytime --- including in the middle of a transaction. Upon recovery, it is generally impossible to recreate all of the state of that transaction that was in volatile memory prior to the failure. As a result, it is seemingly impossible to pick up where the transaction left off prior to the failure. Therefore, the system aborts all transactions that were in progress at the time of the failure. Since a failure can occur at any time, this means that a transaction may abort at any time.

Second, most concurrency control protocols require the ability to abort a transaction at any time. Optimistic protocols perform a “validation” phase after processing a transaction. If validation fails, the transaction aborts. Pessimistic protocols typically use locks to prevent concurrency anomalies. This use of locks may lead to deadlock, which is resolved by aborting (at least) one of the deadlocked transactions. Since deadlock may be discovered at any time, the transaction needs to retain the ability to abort at any time.

If you look carefully at the two-phase commit protocol, you will see that this arbitrary potential to abort a transaction is the primary source of complexity and latency in the protocol. Workers cannot easily tell each other whether they will commit or not, because they might still fail after this point (before the transaction is committed) and want to abort this transaction during recovery. Therefore, they have to wait until the end of transaction processing (when all important state is made durable) and proceed in the necessary two phases: in the first phase, each worker publically relinquishes its control to abort a transaction, and only then can the second phase occur in which a final decision is made and disseminated.

In my opinion we need to remove veto power from workers and architect systems in which the system does not have freedom to abort a transaction whenever it wants during its execution. Only logic within a transaction should be allowed to cause a transaction to abort. If it is theoretically possible to commit a transaction given an current state of the database, that transaction must commit, no matter what types of failures occur. Furthermore, there must not be race conditions relative to other concurrently running transactions that can affect the final commit/abort state of a transaction.

Removing abort flexibility sounds hard. We’ll discuss soon how to accomplish this. But first let’s observe how the commit protocol changes if transactions don’t have abort flexibility.

What a commit protocol looks like when transactions can’t abort arbitrarily

Let’s look at two examples:

In the first example, assume that the worker for the shard that stores the value for variable X is assigned a single task for a transaction: change the value of X to 42. Assume (for now) that there are no integrity constraints or triggers defined on X (which may prevent the system from setting X to 42). In such a case, that worker is never given the power to be able to abort the transaction. No matter what happens, that worker must change X to 42. If that worker fails, it must change X to 42 after it recovers. Since it never has any power to abort, there is no need to check with that worker during the commit protocol to see if it will commit.

In the second example, assume that the worker for the shard that stores the value for variables Y and Z is assigned two tasks for a transaction: subtract 1 from the previous value of Y and set Z to the new value of Y. Furthermore, assume that there is an integrity constraint on Y that states that Y can never go below 0 (e.g., if it represents the inventory of an item in a retail application). Therefore, this worker has to run the equivalent of the following code:

IF (Y > 0)

Subtract 1 from Y

ELSE

ABORT the transaction

Z = Y

This worker must be given the power to abort the transaction since this required by the logic of the application. However, this power is limited. Only if the initial value of Y was 0 can this worker abort the transaction. Otherwise, it has no choice but to commit. Therefore, it doesn’t have to wait until it has completed the transaction code before knowing whether it will commit or not. On the contrary: as soon as it has finished executing the first line of code in the transaction, it already knows its final commit/abort decision. This implies that the commit protocol will be able to start much earlier relative to 2PC.

Let’s now combine these two examples into a single example in which a transaction is being performed by two workers --- one of them is doing the work described in the first example, and the other one doing the work described in the second example. Since we are guaranteeing atomicity, the first worker cannot simply blindly set X to 42. Rather, it’s own work must also be dependent on the value of Y. In effect, it’s transaction code becomes:

temp = Do_Remote_Read(Y)

if (temp > 0)

X = 42

Note that if the first worker’s code is written in this way, the code for the other worker can be simplified to just:

IF (Y > 0)

Subtract 1 from Y

Z = Y

By writing the transaction code in this way, we have removed explicit abort logic from both workers. Instead, both workers have if statements that check for the constraint that would have caused the original transaction to abort. If the original transaction would have aborted, both workers end up doing nothing. Otherwise, both workers change the values of their local state as required by the transaction logic.

The important thing to note at this point is that the need for a commit protocol has been totally eliminated in the above code. The system is not allowed to abort a transaction for any reason other than conditional logic defined by application code on a given state of the data. And all workers condition their writes on this same conditional logic so that they can all independently decide to “do nothing” in those situations where a transaction cannot complete as a result of current system state. Thus, all possibility of a transaction abort has been removed, and there is no need for any kind of distributed protocol at the end of transaction processing to make a combined final decision about the transaction. All of the problems of 2PC have been eliminated. There is no blocking problem because there is no coordinator. And there is no cloggage problem, because all necessary checks are overlapped with transaction processing instead of after it completes.

Moreover, as long as the system is not allowed to abort a transaction for any reason other than the conditional application logic based on input data state, it is always possible to rewrite any transaction as we did above in order to replace abort logic in the code with if statements that conditionally check the abort conditions. Furthermore, it is possible to accomplish this without actually rewriting application code. [The details of how to do this are out of scope for this post, but to summarize at a high level: shards can set special system-owned boolean flags when they have completed any conditional logic that could cause an abort, and it is these boolean flags that are remotely read from other shards.]

In essence: there are two types of aborts that are possible in transaction processing systems: (1) Those that are caused by the state of the data and (2) Those that are caused by the system itself (e.g. failures or deadlocks). Category (1) can always be written in terms of conditional logic on the data as we did above. So if you can eliminate category (2) aborts, the commit protocol can be eliminated.

So now, all we have to do is explain how to eliminate category (2) aborts.

Removing system-induced aborts

I have spent almost an entire decade designing systems that do not allow system-induced aborts. Examples of such systems are Calvin, CalvinFS, Orthrus, PVW, and a system that processes transactions lazily. The impetus for this feature came from the first of these projects --- Calvin --- because of its status of being a deterministic database system. A deterministic database guarantees that there is only one possible final state of the data in the database given a defined set of input requests. It is therefore possible to send the same input to two distinct replicas of the system and be certain that the replicas will process this input independently and end up in the same final state, without any possibility of divergence.

System-induced aborts such as system failure or concurrency control race conditions are fundamentally nondeterministic events. It is very possible that one replica will fail or enter a race condition while the other replica will not. If these nondeterministic events were allowed to result in an a transaction to abort, then one replica may abort a transaction while the other one would commit --- a fundamental violation of the deterministic guarantee. Therefore, we had to design Calvin in a way that failures and race conditions cannot result in a transaction to abort. For concurrency control, Calvin used pessimistic locking with a deadlock avoidance technique that ensured that the system would never get into a situation where it had to abort a transaction due to deadlock. In the face of a system failure, Calvin did not pick up a transaction exactly where it left off (because of the loss of volatile memory during the failure). Nonetheless, it was able to bring the processing of that transaction to completion without having to abort it. It accomplished this via restarting the transaction from the same original input.

Neither of these solutions --- neither deadlock avoidance nor transaction restart upon a failure --- are limited to being used in deterministic database systems. [Transaction restart gets a little tricky in nondeterministic systems if some of the volatile state associated with a transaction that was lost during a failure was observed by other machines that did not fail. But there are simple ways to solve this problem that are out of scope for this post.] Indeed, some of the other systems I linked to above are nondeterministic systems. Once we realized the power that comes with removing system-level aborts, we built this feature into every system we built after the Calvin project --- even the nondeterministic systems.

Conclusion

I see very little benefit in system architects making continued use of 2PC in sharded systems moving forward. I believe that removing system-induced aborts and rewriting state-induced aborts is the better way forward. Deterministic database systems such as Calvin or FaunaDB always remove system-induced aborts anyway, and thus usually can avoid 2PC as we described above. But it is a huge waste to limit this benefit to only deterministic databases. It is not hard to remove system-induced aborts from nondeterministic systems. Recent projects have shown that it is even possible to remove system-induced aborts in systems that use concurrency control techniques other than pessimistic concurrency control. For example, both the PVW and the lazy transaction processing systems we linked to above use a variant of multi-versioned concurrency control. And FaunaDB uses a variant of optimistic concurrency control.

In my opinion there is very little excuse to continue with antiquated assumptions regarding the need for system-induced aborts in the system. In the old days when systems ran on single machines, such assumptions were justifiable. However, in modern times, where many systems need to scale to multiple machines that can fail independently of each other, these assumptions require expensive coordination and commit protocols such as 2PC. The performance problems of 2PC has been a major force behind the rise of non-ACID compliant systems that give up important guarantees in order to achieve better scalability, availability, and performance. 2PC is just too slow --- it increases the latency of all transactions --- not just by the length of the protocol itself, but also by preventing transactions that access the same set of data from running concurrently. 2PC also limits scalability (by reducing concurrency) and availability (the blocking problem we discussed above). The way forward is clear: we need to reconsider antiquated assumptions when designing our systems and say “good-bye” to two phase commit!

Partitioned consensus and its impact on Spanner’s latency

2018-12-14T07:00:00.000-08:00

In a post that I published in September, I described two primary approaches for performing consensus in distributed systems, and how the choice of approach affects the consistency guarantees of the system. In particular, consensus can either be unified, such that all writes in the system participate in the same distributed consensus protocol, or it can be partitioned, such that different subsets of the data participate in distinct consensus protocols.

The primary downside of partitioned consensus was that achieving global consistency is much more complicated. Consistency guarantees require that requests submitted after previous requests complete will never “go back in time” and view a state of the system that existed prior to the completed request. Such guarantees are hard to enforce in partitioned consensus systems since different partitions operate independently from each other: Consistency requires a notion of “before” and “after” --- even for events on separate machines or separate partitions. For partitions that operate completely independently, the most natural way to define “before” and “after” is to use real time --- the time on the clocks of the different partitions. However, clocks tend to skew at the millisecond granularity, and keeping clocks in sync is nontrivial. We discussed how Google has a hardware solution that aids in clock synchronization, while other solutions attempt to use software-only clock synchronization algorithms.

In contrast, unified consensus results in a global order of all requests. This global order can be used to implement the notion of “before” and “after” without having to rely on local clock time, which entirely avoids the need for clock synchronization. This results in stronger consistency guarantees: unified consensus systems can guarantee consistency at all times, while partitioned consensus systems can only guarantee consistency if the clock skew stays within an expected bound. For software-only implementations, it is hard to avoid occasionally violating the maximum clock skew bound assumption, and the violations themselves may not be discoverable. Therefore, unified consensus is the safer option.

The post led to several interesting debates, most of which are beyond the scope of this post. However, there was one interesting debate I’d like to explore more deeply in this post. In particular, the question arose: Are there any fundamental latency differences between unified-consensus systems and partitioned-consensus systems? When I read the comments to my post (both on the post itself and also external discussion boards), I noticed that there appears to be a general assumption amongst my readers that unified consensus systems must have higher latency than partitioned consensus systems. One reader even accused me of purposely avoiding discussing latency in that post in order to cover up a disadvantage of unified consensus systems. In this post, I want to clear up some of the misconceptions and inaccurate assumptions around these latency tradeoffs, and present a deeper (and technical) analysis on how these different approaches to consensus have surprisingly broad consequences on transaction latency. We will analyze the latency tradeoff from three perspectives: (1) Latency for write transactions, (2) Latency for linearizable read-only transactions and (3) Latency for serializable snapshot transactions.

Latency for write transactions

The debate around write transactions is quite interesting since valid arguments can be made for both sides.

The partitioned-consensus side points out two latency downsides of unified consensus: (1) As mentioned in my previous post, in order to avoid scalability bottlenecks, unified consensus algorithms perform consensus batch-at-a-time instead of on individual requests. They, therefore, have to pay the additional latency of collecting transactions into batches prior to running consensus. In the original Calvin paper, batch windows were 10ms (so the average latency would be 5ms); however, we have subsequently halved the batch window latency in my labs at Yale/UMD. FaunaDB (which uses Calvin’s unified consensus approach) also limits the batch window to 10ms. (2) For unified consensus, there will usually be one extra network hop to get the request to the participant of the consensus protocol for its local region. This extra hop is local, and therefore can be assumed to take single-digit milliseconds. If you combine latency sources (1) and (2), the extra latency incurred by the preparation stage for unified consensus is approximately 10-15ms.

On the other hand, the unified-consensus side points out that multi-partition atomic write transactions require two-phase commit for partitioned-consensus systems. For example, let’s say that a transaction writes data in two partitions: A and B. In a partitioned-consensus system, the write that occurred in each partition achieves consensus separately. It is very possible that the consensus in partition A succeeds while in B it fails. If the system guarantees atomicity for transactions, then the whole transaction must fail, which requires coordination across the partitions --- usually via two-phase commit. Two-phase commit can result in significant availability outages (if the coordinator fails at the wrong time) unless it runs on top of consensus protocols. Thus Spanner and Spanner-derivative systems all run two-phase commit over partitioned consensus for multi-partition transactions. The latency cost of the Raft/Paxos protocol itself (once it gets started) is the same for unified and partitioned consensus, but two-phase commit requires two rounds of consensus to commit such transactions. A single round of consensus may take between 5ms and 200ms, depending on how geographically disperse the deployment is. Since Spanner requires two sequential rounds, the minimum transaction latency is double that --- between 10ms for single-region deployments to 400ms for geographically disperse deployments.

In practice, this two-phase commit also has an additional issue: a transaction cannot commit until all partitions vote to commit. A simple majority is not sufficient --- rather every partition must vote to commit. A single slow partition (for example, a partition undergoing leader election) stalls the entire transaction. This increases the observed long tail latency in proportion to transaction complexity.

In contrast, unified consensus systems such as Calvin and its derivatives such as FaunaDB do not require two-phase commit. [Side note: a discussion of how to avoid two-phase commit in a unified consensus system can be found in this VLDB paper. FaunaDB’s approach is slightly different, but the end result is the same: no two-phase commit.] As a result, unified consensus systems such as Calvin and FaunaDB only require one round of consensus to commit all transactions --- even transactions that access data on many different machines.

The bottom line is that the better latency option between unified or partitioned consensus for write transactions is somewhat workload dependent. Unified consensus increases latency for all transactions by a little, but partitioned consensus can increase latency by a lot more for multi-partition transitions. For most applications, it is impossible to avoid multi-partition transactions. For example, many applications allow transactional interactions between arbitrary users (payments between users, exchanges between users, “friendship” status updates between users, gaming interactions, etc.). Although it is possible to group users into partitions such that many of their interactions will be with other users within that partition (e.g. partition by a user’s location), as long as arbitrary interactions are allowed, there will always be interactions between users in different partitions. These multi-partition interactions are much slower in partitioned-consensus systems. Thus, for most workloads, unified-consensus is the better latency option for write transactions.

Latency for linearizable read-only transactions

Linearizable read-only transactions are generally sent to the consensus leader’s region and performed (or at least receive a timestamp) there [other options exist, but this is what Spanner and other systems mentioned in my previous post do]. In unified-consensus, there is only one leader region for the whole system. Linearizable read transactions that initiate from near this region will be processed with low latency, while transactions that initiate from farther away will observe correspondingly higher latency.

Meanwhile, in partitioned-consensus, there is one leader per partition, and these leaders can be located in different regions. The partitioned-consensus supporters argue that this can lead to lower latency in an array of common scenarios. An application developer can specify a location-based partitioning algorithm. All data that is usually accessed from region X should be placed in the same partition, with a consensus leader located in region X. All data that is usually accessed from region Y should be placed in the same partition, with a consensus leader located in region Y. In doing so, a larger number of read-only transactions will observe lower latency.

The downside of this approach is that it breaks the abstraction of the consensus protocol as a separate component of the system --- now the consensus protocol and data storage layer become much more intertwined, increasing the monolithicity of the system. Furthermore, consensus protocols run leader elections after a lease expires, and would have to reduce the democratic nature of this protocol in order to ensure the leader remains in the closest possible region. Finally, it increases complexity and reduces the flexibility of the partitioning protocol. As far as I know, the most well-known example of a partitioned-consensus system --- Spanner --- does not take advantage of this potential optimization for these reasons.

Consequently, although in theory, there is a potential latency advantage for partitioned-consensus systems for linearizable read-only transactions, in practice this advantage is not realized.

On the other hand, there is a fundamental latency advantage for unified-consensus systems in the presence of multi-partitioned transactions. A multi-partition transaction in a partitioned-consensus system must involve more than one leader. The leaders of each partition accessed by the read transaction must communicate with each other in order to figure out a timestamp at which this linearizable read can be safely performed (see sections 4.1.4 and 4.2.2 of the Spanner paper). Alternatively, a timestamp sufficiently into the future (at the end of the clock skew uncertainty bound) can be chosen at which to perform the read. Either way ---- whether it is communication across leaders (that may be located in different geographic regions) or whether it is waiting until the end of the clock skew uncertainty bound --- multi-partition reads pay an additional latency cost in order to ensure linearizability. In contrast, unified consensus systems have only a single leader region, and can perform linearizable reads across the servers in this region, without any communication with other regions or waiting for clock skew uncertainty windows to close.

Latency for serializable snapshot read-only transactions

Many applications --- even if they require linearizable write transactions --- do not require linearizable read-only transactions, and instead are content to read from a recent snapshot of the database state. However, such snapshot reads must be serializable --- they should reflect the database state as of a particular transaction in the serial order, and none of the writes from transactions after that transaction.

Recall that transactions may write data on different machines / partitions. For example, a transaction, T, may write data on partition A and partition B. A serializable snapshot that includes data from both partition A and partition B must therefore include both of T’s writes to those partitions, or neither. In particular, it should include both of T’s writes if the snapshot is as of a point in time “after” T, and neither of T’s writes if the snapshot is as of a point in time “before” T. Note that this notion of “point in time” must exist --- even across partitions.Therefore, once again, there needs to be a global notion of “before” and “after” --- even for writes across different machines. As long as this notion of “before” and “after” exists, such snapshot reads can be sent to any replica to be processed there, and do not require any consensus or interaction with consensus leader regions. This is critically important to support low latency reads in a geographically disperse deployment.

As mentioned in the introductory paragraph of this post, both unified- and partitioned-consensus systems are capable of generating global notions of “before” and “after”, and thus both types of systems are able to achieve the performance advantage of being able to perform these reads from any replica. However, as we mentioned above, unified-consensus systems can achieve this global notion of “before” and “after” without any clock synchronization, while partitioned-consensus systems use clock synchronization. Thus, unified-consensus can always achieve correct serializable snapshot reads, while partitioned-consensus can only achieve the same result if the maximum clock skew bound assumption is not violated.

Conclusion

The latency debate between unified vs. partitioned consensus is an intricate one. However, it is clear that multi-partition transactions exacerbate the disadvantages of partitioned-consensus transactions in (at least) three dimensions:

Multi-partition transactions require two-phase commit on top of the consensus protocol in partitioned-consensus systems. In many deployments, consensus across replicas is the latency bottleneck. By requiring two-phase commit on top of consensus, partitioned-consensus systems result in (approximately) double the latency (relative to unified-consensus) in such deployments, and higher long tail latency.
Multi-partition transactions require communication across leaders or waiting out clock skew uncertainty bounds for linearizable transactions --- even for linearizable read-only transactions.
Partitioned-consensus systems require clock synchronization in order to achieve low latency snapshot reads (in addition to all linearizable operations). Any incorrect assumptions of the maximum clock skew across machines may result in serializability violations (and thus incorrect results being returned).

As we discussed above, it is usually impossible to avoid multi-partition transactions in most real-world applications. Furthermore, as an application scales, the number of partitions must increase, and thus the probability of multi-partition transactions is also likely to increase. Therefore, the disadvantages of partitioned-consensus systems relative to unified-consensus systems accentuate as the application scales.

(Daniel Abadi is an advisor at FaunaDB)

NewSQL database systems are failing to guarantee consistency, and I blame Spanner

2018-09-21T06:09:00.000-07:00

(Spanner vs. Calvin, Part 2)

[TL;DR I wrote a post in 2017 that discussed Spanner vs. Calvin that focused on performance differences. This post discusses another very important distinction between the two systems: the subtle differences in consistency guarantees between Spanner (and Spanner-derivative systems) vs. Calvin.]

The CAP theorem famously states that it is impossible to guarantee both consistency and availability in the event of a network partition. Since network partitions are always theoretically possible in a scalable, distributed system, the architects of modern scalable database systems fractured into two camps: those that prioritized availability (the NoSQL camp) and those that prioritized consistency (the NewSQL camp). For a while, the NoSQL camp was clearly the more dominant of the two --- in an “always-on” world, downtime is unacceptable, and developers were forced into handling the reduced consistency levels of scalable NoSQL systems. [Side note: NoSQL is a broad umbrella that contains many different systems with different features and innovations. When this post uses the term “NoSQL”, we are referring to the subset of the umbrella that is known for building scalable systems that prioritize availability over consistency, such as Cassandra, DynamoDB (default settings), Voldemort, CouchDB, Riak, and multi-region deployments of Azure CosmosDB.]

Over the past decade, application developers have discovered that it is extremely difficult to build bug-free applications over database systems that do not guarantee consistency. This has led to a surprising shift in momentum, with many of the more recently released systems claiming to guarantee consistency (and be CP from CAP). Included in this list of newer systems are: Spanner (and its Cloud Spanner counterpart), FaunaDB, CockroachDB, and YugaByte. In this post, we will look more deeply into the consistency claims of these four systems (along with similar systems) and note that while some do indeed guarantee consistency, way too many of them fail to completely guarantee consistency. We will trace the failure to guarantee consistency to a controversial design decision made by Spanner that has been tragically and imperfectly emulated in other systems.

What is consistency anyway?

Consistency, also known as “atomic consistency” or “linearizability”, guarantees that once a write completes, all future reads will reflect that value of the write. For example, let’s say that we have a variable called X, whose value is currently 4. If we run the following code:

X = 10;
Y = X + 8;

In a consistent system, there is only one possible value for Y after running this code (assuming the second statement is run after the first statement completes): 18. Everybody who has completed an “Introduction to Programming” course understands how this works, and relies on this guarantee when writing code.

In a system that does not guarantee consistency, the value of Y after running this code is also probably 18. But there’s a chance it might be 12 (since the original value of X was 4). Even if the system returns an explicit message: “I have completed the X = 10 statement”, it is nonetheless still a possibility that the subsequent read of X will reflect the old value (4) and Y will end up as 12. Consequently, the application developer has to be aware of the non-zero possibility that Y is not 18, and must deal with all possible values of Y in subsequent code. This is MUCH more complicated, and beyond the intellectual capabilities of a non-trivial subset of application developers.

[Side note: Another name for "consistency" is "strong consistency". This alternate name was coined in order to distinguish the full consistency guarantee from weaker consistency levels that also use the word "consistency" in their name (despite not providing the complete consistency guarantee). Indeed, some of these weaker consistency levels, such as "causal consistency", "session consistency", and "bounded staleness consistency" provide useful guarantees that somewhat reduce complexity for application developers. Nonetheless, the best way to avoid the existence of corner case bugs in an application is to build it on top of a system that guarantees complete, strong consistency.]

Why give up on consistency?

Consistency is a basic staple, a guarantee that is extremely hard to live without. So why do most NoSQL systems fail to guarantee consistency? They blame the CAP theorem. (For example, the Amazon Dynamo paper, which inspired many widely used NoSQL systems, such as Cassandra, DynamoDB, and Riak, mention the availability vs. consistency tradeoff in the first paragraph of the section that discussed their “Design Considerations”, which lead to their famous “eventually consistent” architecture.) It is very hard, but not impossible, to build applications over systems that do not guarantee consistency. But the CAP theorem says that it is impossible for a system that guarantees consistency to guarantee 100% availability in the presence of a network partition. So if you can only choose one, it makes sense to choose availability. As we said above, once the system fails to guarantee consistency, developing applications on top of it without ugly corner case bugs is extremely challenging, and generally requires highly-skilled application developers that are able to handle the intellectual rigors of such development environments. Nonetheless, such skilled developers do exist, and this is the only way to avoid the impossibility proof from the CAP theorem of 100% availability.

The reasoning of the previous paragraph, although perhaps well-thought out and convincing, is fundamentally flawed. The CAP theorem lives in a theoretical world where there is such a thing as 100% availability. In the real world, there is no such thing as 100% availability. Highly available systems are defined in terms of ‘9s’. Are you 99.9% available? Or 99.99% available? The more 9s, the better. Availability is fundamentally a pursuit in imperfection. No system can guarantee availability.

This fact has significant ramifications when considering the availability vs. consistency tradeoff that was purported by the CAP theorem. It is not the case that if we guarantee consistency, we have to give up the guarantee of availability. We never had a guarantee of availability in the first place! Rather, guaranteeing consistency causes a reduction to our already imperfect availability.

Therefore: the question becomes: how much availability is lost when we guarantee consistency? In practice, the answer is very little. Systems that guarantee consistency only experience a necessary reduction in availability in the event of a network partition. As networks become more redundant, partitions become an increasingly rare event. And even if there is a partition, it is still possible for the majority partition to be available. Only the minority partition must become unavailable. Therefore, for the reduction in availability to be perceived, there must be both a network partition, and also clients that are able to communicate with the nodes in the minority partition (and not the majority partition). This combination of events is typically rarer than other causes of system unavailability. Consequently, the real world impact of guaranteeing consistency on availability is often negligible. It is very possible to have a system that guarantees consistency and achieves high availability at the same time.

[Side note: I have written extensively about these issues with the CAP theorem. I believe the PACELC theorem is better able to summarize consistency tradeoffs in distributed systems.]

The glorious return of consistent NewSQL systems

The argument above actually results in 3 distinct reasons for modern systems to be CP from CAP, instead of AP (i.e. choose consistency over availability):

(1) Systems that fail to guarantee consistency result in complex, expensive, and often buggy application code.

(2) The reduction of availability that is caused by the guarantee of consistency is minute, and hardly noticeable for many deployments.

(3) The CAP theorem is fundamentally asymmetrical. CP systems can guarantee consistency. AP systems do not guarantee availability (no system can guarantee 100% availability). Thus only one side of the CAP theorem opens the door for any useful guarantees.

I believe that the above three points is what has caused the amazing renaissance of distributed, transactional database systems --- many of which have become commercially available in the past few years --- that choose to be CP from CAP instead of AP. There is still certainly a place for AP systems, and their associated NoSQL implementations. But for most developers, building on top of a CP system is a safer bet.

However, when I say that CP systems are the safer bet, I intend to refer to CP systems that actually guarantee consistency. Unfortunately, way too many of these modern NewSQL systems fail to guarantee consistency, despite their claims to the contrary. And once the guarantee is removed, the corner case bugs, complexity, and costs return.

Spanner is the source of the problem

I have discussed in previous posts that there are many ways to guarantee consistency in distributed systems. The most popular mechanism, which guarantees consistency at minimal cost to availability, is to use the Paxos or Raft consensus protocols to enforce consistency across multiple replicas of the data. At a simplified level, these protocols work via a majority voting mechanism. Any change to the data requires a majority of replicas to agree to the change. This allows the minority of replicas to be down or unavailable and the system can nonetheless continue to read or write data.

Most NewSQL systems use consensus protocols to enforce consistency. However, they differ in a significant way in how they use these protocols. I divide NewSQL systems into two categories along this dimension: The first category, as embodied in systems such as Calvin (which came out of my research group) and FaunaDB, uses a single, global consensus protocol per database. Every transaction participates in the same global protocol. The second category, as embodied in systems such as Spanner, CockroachDB, and YugaByte, partitions the data into ‘shards’, and applies a separate consensus protocol per shard.

The main downside of the first category is scalability. A server can process a fixed number of messages per second. If every transaction in the system participates in the same consensus protocol, the same set of servers vote on every transaction. Since voting requires communication, the number of votes per second is limited by the number of messages each server can handle. This limits the total amount of transactions per second that the system can handle.

Calvin and FaunaDB get around this downside via batching. Rather than voting on each transaction individually, they vote on batches of transactions. Each server batches all transactions that it receives over a fixed time period (e.g., 10 ms), and then initiates a vote on that entire batch at once. With 10ms batches, Calvin was able to achieve a throughput of over 500,000 transactions per second. For comparison, Amazon.com and NASDAQ likely process no more than 10,000 orders/trades per second even during peak workloads [Update: there has been some discussion about these numbers from my readers. The number for NASDAQ might be closer to 100,000 orders per second. I have not seen anybody dispute the 10,000 orders per second number from Amazon.com, but readers have pointed out that they issue more than 10,000 writes to the database per second. However, this blog post is focused on strictly serializable transactions rather than individual write operations. For Calvin's 500,000 transactions per second number, each transaction included many write operations.]

The main downside of the second category is that by localizing consensus on a per-shard basis, it becomes nontrivial to guarantee consistency in the presence of transactions that touch data in multiple shards. The quintessential example is the case of someone performing a sequence of two actions on a photo-sharing application (1) Removing her parents from having permission to see her photos (2) Posting her photos from spring break. Even though there was a clear sequence of these actions from the vantage point of the user, if the permissions data and the photo data are located in separate shards, and the shards perform consensus separately, there is a risk that the parents will nonetheless be able to see the user’s recently uploaded photos.

Spanner famously got around this downside with their TrueTime API. All transactions receive a timestamp which is based on the actual (wall-clock) current time. This enables there to be a concept of “before” and “after” for two different transactions, even those that are processed by completely disjoint set of servers. The transaction with a lower timestamp is “before” the transaction with a higher timestamp. Obviously, there may be a small amount of skew across the clocks of the different servers. Therefore, Spanner utilizes the concept of an “uncertainty” window which is based on the maximum possible time skew across the clocks on the servers in the system. After completing their writes, transactions wait until after this uncertainty window has passed before they allow any client to see the data that they wrote.

Spanner thus faces a potentially uncomfortable tradeoff. It is desirable that the uncertainty window should be as small as possible, since as it gets larger, the latency of transactions increases, and the overall concurrency of the system decreases. On the other hand, it needs to 100% sure that clock skew never gets larger than the uncertainty window (since otherwise the guarantee of consistency would no longer exist), and thus larger windows are safer than smaller ones.

Spanner handles this tradeoff with a specialized hardware solution that uses both GPS and atomic clocks to ensure a minimal clock skew across servers. This solution allows the system to keep the uncertainty window relatively narrow while at the same time keeping the probability of incorrect uncertainty window estimates (and corresponding consistency violations) to be extremely small. Indeed, the probability is so small that Spanner’s architects feel comfortable claiming that Spanner “guarantees” consistency.

[It is worth noting at this point that systems that use global consensus avoid this problem entirely. If every transaction goes through the same protocol, then a natural order of all transactions emerges --- the order is simply the order in which transactions were voted on during the protocol. When batches are used instead of transactions, it is the batches that are ordered during the protocol, and transactions are globally ordered by combining their batch identifier with their sequence number within the batch. There is no need for clock time to be used in order to create a notion of before or after. Instead, the consensus protocol itself can be used to elegantly create a global order.]

Spanner Derivatives

Spanner is a beautiful and innovative system. It was also invented by Google and widely used there. Either because of the former or latter (or both), it has been extremely influential, and many systems (e.g., CockroachDB and YugaByte) have been inspired by the architectural decisions by Spanner. Unfortunately, these derivative systems are software-only, which implies that they have inherited only the software innovations without the hardware and infrastructure upon which Spanner relies at Google. In light of Spanner’s decision to have separate consensus protocols per shard, software-only derivatives are extremely dangerous. Like Spanner, these systems rely on real-world time in order to enforce consistency --- CockroachDB on HLC (hybrid logical clocks) and YugaByte on Hybrid Time. Like Spanner, these systems rely on knowing the maximum clock skew across servers in order to avoid consistency violations. But unlike Spanner, these systems lack hardware and infrastructure support for minimizing and measuring clock skew uncertainty.

CockroachDB, to its credit, has acknowledged that by only incorporating Spanner’s software innovations, the system cannot guarantee CAP consistency (which, as mentioned above, is linearizability).

YugaByte, however, continues to claim a guarantee of consistency [Edit for clarification: YugaByte only makes this claim for single key operations; however, YugaByte also relies on time synchronization for reading correct snapshots for transactions running under snapshot isolation.]. I would advise people to be wary of these claims which are based on assumptions of maximum clock skew. YugaByte, by virtue of its Spanner roots, will run into consistency violations when the local clock on a server suddenly jumps beyond the skew uncertainty window. This can happen under a variety of scenarios such as when a VM that is running YugaByte freezes or migrates to a different machine. Even without sudden jumps, YugaByte’s free edition relies on the user to set the assumptions about maximum clock skew. Any mistaken assumptions on behalf of the user can result in consistency violations.

In contrast to CockroachDB and YugaByte, FaunaDB was inspired by Calvin instead of Spanner. [Historical note: the Calvin and Spanner papers were both published in 2012]. FaunaDB therefore has a single, elegant, global consensus protocol, and needs no small print regarding clock skew assumptions. Consequently, FaunaDB is able to guarantee consistency of transactions that modify any data in the database without concern for the corner case violations that can plague software-only derivatives of Spanner-style systems.

There are other differences between Calvin-style systems and Spanner-style systems that I’ve talked about in the past. In this post we focused on perhaps the most consequential difference: global consensus vs. partitioned consensus. As with any architectural decision, there are tradeoffs between these two options. For the vast majority of applications, exceeding 500,000 transactions a second is beyond their wildest dreams. If so, then the decision is clear. Global consensus is probably the better choice.

[Editor's note: Daniel Abadi is an advisor at FaunaDB.]

An analysis of the strengths and weaknesses of Apache Arrow

2018-03-27T07:24:00.000-07:00

In my previous blog post, I discussed the relatively new Apache Arrow project, and compared it with two similar column-oriented storage formats in ORC and Parquet. In particular, I explained how storage formats targeted for main memory have fundamental differences from storage formats targeted for disk-resident data. There was a surprising amount of activity surrounding the post --- the post received 28,000 visits (making it the 7th most popular post on my blog all time), and 86 comments in a HackerNews thread discussing the post. Given this clear interest of my readers in Apache Arrow, I would like to take a deeper look into the project in this post, and present my analysis of both some specific decisions that were made regarding the format itself, and also my personal experience with installing and running experiments on the code base.

A quick caveat before we begin: Many of the comments on the HackerNews thread revolved around a back-and-forth between fans and contributors to the Apache Arrow project who went ballistic when they read my title with a sarcastic tone (the title was: “Apache Arrow vs. Parquet and ORC: Do we really need a third Apache project for columnar data representation?”) and more thoughtful and thorough readers who tried to calm them down and explain that the entire post was there to explain precisely why it makes sense to have Arrow as a separate project. However, one common point that was brought up by the pro-Arrow crowd was that my post was narrow in the sense that I only looked at Arrow from the perspective of using it as a storage format in the context of database and data analytics engines, whereas Arrow, as a general standard for representing data in main memory could also be used outside of this space. I should have been clearer about the scope of my analysis in that post, so this time around I want to be more clear: the scope of my analysis in this post is solely from the perspective of using Apache Arrow as a storage format in the context of database and data analytics engines and tools. I limit the scope to this context for two reasons: (1) I predict that the majority of Arrow’s use cases will be in that context (where I define data analytics tools broadly enough to include projects like Pandas) (2) As someone who has spent his entire career as a database system researcher, this is the only context in which I am qualified to present my opinion.

What exactly is Apache Arrow?

Arrow’s homepage self-describes in the following way: “Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication.”

In other words, the creators of Arrow envision the project having impact in three ways: (1) as a development platform, (2) as a columnar memory format standard, and (3) as a set of useful libraries.

In practice, the majority of the code in the Github repository at the time of my interactions with the codebase was for constructing, accessing, and testing data structures using the Arrow standard. So my analysis in this article will just focus on the Arrow standard and the set of code that is provided to help in the implementation of this standard.

Is it even possible to have everybody agree on a data representation standard?

For decades, it was impossible to fathom that there could be a standard representation for data in a database system. Database systems have historically been horribly monolithic and complex pieces of software. The many components of the system --- the storage layer, the transaction manager, the access manager, the recovery manager, the optimizer, etc. --- were significantly intertwined, and designed assuming particular architectural choices for the other components. Therefore, a “page” of data on storage was not a simple block of data, but also contained information critical to the recovery manager (e.g. the identifier of the log record that most recently wrote to this page), the transaction manager (e.g. timestamps required by multi-version concurrency control schemes), access manager, and so on. Each unique database system had different concurrency control schemes, different logging structures, and different index implementations; therefore a page of data in one system looked vastly different than a page of data in other system.

Therefore, if you wanted to move data from one system to another one, you would have to “export” the data, which involved rewriting the data from the complex page format stored inside the system to a simpler representation of the data. This simple representation would be passed to the other system which would then rewrite the simple representation into its own proprietary standard. This process of rewriting the data before export is called “serialization” and rewriting it back before import is called “deserialization”. Serialization and deserialization costs when transferring data between systems have historically been necessary and unavoidable.

Over the past few decades, the database system world has changed significantly. First, people stopped believing that one size fits all for database systems, and different systems started being used for different workloads. Most notably, systems that specialized in data analysis separated from systems that specialized in transactional processing. Systems that specialized in data analysis tend to either be read-only or read-mostly systems, and therefore generally have far simpler concurrency control and recovery logic. Second, as the price of memory has rapidly declined, a larger percentage of database applications fit entirely in main memory. This also resulted in simpler recovery and buffer manager logic, which further simplified data representation. Finally, as open source database systems started to proliferate, a greater emphasis was placed on modular design and clean interfaces between the components of the system, in order to accommodate the typical distributed development of open source projects.

All of this has lead to much simpler data representations in main memory and analytical database engines, especially those in the open source sphere. Fixed width data types are often just represented in arrays, and variable-width data types in only slightly more complicated data structures. All of a sudden, the prospect of standardizing the data representation across main memory analytical database systems has become a realistic goal, thereby enabling the transfer of data between systems without having to pay serialization and deserialization costs.

This is exactly the goal of Apache Arrow. Arrow is, in its essence, a data representation specification --- a standard that can be implemented by any engine that processes data in main memory. Engines that use this standard internally can avoid any kind of serialization and deserialization costs when moving data between each other, which several other blog posts (e.g. here and here) have shown to result in significant performance gains. 13 major open source projects, including Pandas, Spark, Hadoop and Dremio have already embraced the standard, which I believe is enough critical mass for the Arrow standard to become ubiquitous in the data analytics industry. Even if existing systems do not adopt the standard for their own internal data representation, I expect they will at least support data exports in Arrow. This increases the motivation for any new main memory analytics engine being designed to adopt it.

While ubiquity is usually a good indicator of quality, there are plenty of languages, APIs, and pieces of software that become ubiquitous for reasons entirely unrelated to their quality. For example, the SQL interface to database systems took off due to the business dominance of the systems that used SQL, even though there were arguably better interfaces to database systems that had been proposed prior to SQL’s take-over. Furthermore, even high quality things are often optimized for certain scenarios, and yield suboptimal performance in scenarios outside of the intended sweet spot. Therefore, I took a deeper look at Apache Arrow without any preconceived biases. Below, I present my analysis and experience with playing with the code of the project, and discuss some of the design decisions of Arrow, and the tradeoffs associated with those decisions.

Columnar

It would be easy for someone who sees Arrow’s self-description of being “columnar” to mistakenly assume that Arrow’s scope is limited to two dimensional data structures that have rows and columns, and that by being “columnar”, it is possible to derive that Arrow stores data column-by-column instead of row-by-row. In fact, Arrow is actually a more general standard --- including specification for one-dimensional data structures such as arrays and lists, and also data structures with more than two dimensions through its support for nesting. Nonetheless, we have all become accustomed to interacting with data through relational database systems and spreadsheets, both of which store data in two dimensional objects, where each row corresponds to an entity, and each column an attribute about that entity. By being columnar, Apache Arrow stores such two dimensional objects attribute by attribute instead of entity by entity. In other words, the first attribute of each entity is stored contiguously, then the second attribute of every entity, and so on.

This columnar representation means that if you want to extract all attributes for a particular entity, the system must jump around memory, finding the data for that entity from each of the separately-stored attributes. This results in a random memory access pattern which results in slower access times than sequential access patterns (my previous post discusses this point in more detail). Therefore, Arrow is not ideal for workloads that tend to read and write a small number of entire entities at once, such as OLTP (transactional) workloads. On the other hand, data analytics workloads tend to focus on only a subset of the attributes at once; scanning through large quantities of entities to aggregate values of these attributes. Therefore, storing data in a columnar fashion results in sequential, high performance access patterns for these workloads.

Storing data column-by-column instead of row-by-row has other advantages for analytical workloads as well --- for example it enables SIMD acceleration and potentially increases compression rates (my previous post goes into more detail on these subjects). The bottom line is that by choosing a columnar representation for two-dimensional data structures, Arrow is clearly positioning itself for adoption by data analytics workloads that do not access individual data items, but rather access a subset of the attributes (columns) from many data items. Indeed, many of the open source projects that have been built natively on Arrow, such as Dremio and NVIDIA’s Open GPU Accelerated Analytics (GOAI), are focused on analytics.

Fixed-width data types

Note that attributes of entities tend to have a uniform data type. Therefore, by choosing a columnar data representation, Arrow can store columns of two dimensional tables in an identical way to how it stores data in one dimension of uniform type. In particular, fixed-width data types such as integers and floats can simply be stored in arrays. Arrow is little endian by default and pads arrays to 64-byte boundaries. Aside from the extra padding, Arrow arrays store data in memory in an equivalent fashion to arrays in C, except that Arrow arrays have three extra pieces of metadata that are not present in C arrays: (1) the length of the array, (2) the number of null elements in the array, and (3) a bitmap indicating which elements of the array are null. One interesting design decision made by Arrow is that null elements of the array take up an identical amount of space as non-null elements --- the only way to know if an element is null is by checking to see if there is a 1-bit in the associated bit for that element in null-bitmap that is part of the metadata for the array. The alternative design would have been to not waste storage at all on the null elements, and instead derive the location of null elements by inspection of the null bitmap. The tradeoff here is storage space vs random access performance. By expending space on null elements, the nth element of the array can be quickly located by simply multiplying n by the fixed-width size of each element in the array. However, if the null elements are removed from the array (and their location derived from the null bitmap), the amount of space needed to store the array will be smaller, but additional calculations and bit counting must occur before finding the value for an element in the array at a particular index. On the other hand, sequential scans of the entire array may be faster if the system is bottlenecked by memory bandwidth, since the array is smaller.

Since Arrow’s design decision was made to optimize for random array element access, I ran a simple benchmark where I created an array of size 100,000,000 32-bit integers, put random values in each element of the array, and then searched for the value at 50,000 different locations in the array. I first tried this experiment in a regular C array that allowed null elements to take up an identical amount of space as non-null elements (similar to Arrow). I then tried a different C array where nulls take up no space, and a high performance index is used to speed up random access of the array . I then installed Apache Arrow, built the same array using the Int32Builder in the Arrow C++ API and accessed it through the Arrow Int32Array API. I ran these experiments on an EC2 t2.medium instance. The results are shown below:

As expected, the version of the C array where nulls take up no space was much slower than the other options for this random access workload. Even though we used a high performance index, direct address offset calculations are always faster. Accessing data through the API that comes with the Arrow codebase was slightly slower than accessing data from an array directly. However, this is not because of the Arrow data format itself. When, after building the Arrow Array, instead of accessing the array through the Arrow API, I instead accessed a pointer to the raw data in the array, cast it as a const int*, and then proceeded to access this raw data directly in C, I saw equivalent performance to a normal C array. This cause of the slowdown from accessing data through the Arrow API is presumably from the C++ compiler failing to inline all of the extra function calls (despite the -O3 flag). I therefore conclude that for applications the are extremely performance sensitive, it is better to access raw data created in accordance to the Arrow specification than to use the API to access the data. But for most cases, using the API will be sufficient. As far as the decision to allow nulls to take up space, that was certainly a win for this random-access workload. But for a workload that scans through the entire dataset, it would have been up to 10% faster for the C array in which nulls take up no space, since in our experiment, 10% of all the values were null, and thus that version of the C array was 10% smaller than for the Arrow-specified arrays.

Variable-width data types

For variable width data types, Arrow stores the data for each element contiguously in memory without any separator bytes between the elements. In order to determine where one element ends and the next one begins, Arrow stores the byte offset of the first byte of each element of the array inside an integer array next to the raw data (there is an example in the next section below). In order to access a random element of the variable-width array, the integer array is accessed to find out the starting position of this and the next element in the raw data (the difference between these positions is the length of the element), and then the raw data is accessed.

The decision not to include separator bytes in the raw data between the elements makes the solution more general --- you don’t have to reserve special byte values for these separators. However, it slows down certain types of sequential access patterns. For example, I ran an experiment where I created an array of 12,500,000 variable-sized strings (average of 8 characters per string) using the StringBuilder API, and searched for a substring of size two characters within all elements of the array (extracting the index of all elements that contain the substring). I measured how long this query took when accessing the array both through the Arrow StringArray C++ API, and also over the raw Arrow data directly. Thirdly, I measured how long the same query took over a string array that included a separator byte between each element. The results are shown below:

In this case, the best performance was the array that was not created according to the Arrow specification. The reason for this is that the raw data could not be searched directly for the two-byte substring in the dataset created according to the Arrow specification, because the companion integer array containing the list of element boundaries needed to be repeatedly consulted to ensure that substring matches did not span multiple elements. However, when seperator bytes were located inside the array itself, no secondary array needed to be scanned.

It should be noted that string separators only accelerate certain types of queries, and I purposely chose one such query for this example. For queries that they do not accelerate, they tend to have to opposite effect, decreasing performance by bloating the size of the array. Furthermore, it should be reiterated at this point that reserving byte values for string separators would have prevented any application that do not reserve the same byte values from using Arrow, thereby limiting the scope of Arrow’s utility. In addition, many other queries can actually benefit from having the companion integer array. For example, an equality comparison (name == "johndoe") can utilize the integer array to ignore any value that has a different length. It should also be noted that any application that wishes to have string separators can simply add them to their strings directly, before creating the array using the StringBuilder API. So this experiment does not show a fundamental weakness of the Arrow standard --- it just indicates that in some cases you may have to add to the raw data in order to get optimal performance.

Nested Data

As self-describing data formats such as JSON become more popular, users are increasingly dropping the two-dimensional restrictions of relational tables and spreadsheets, and instead using nested models for their data. Arrow elegantly deals with nested data without requiring conceptual additions to the basic layout principles described above, where raw data is stored contiguously and offset arrays are used to quickly find particular data elements. For example, in a data set describing classes at the University of Maryland, I may want to nest the list of students in each class. For example, the data set:

Classes:
  Name: Introduction to Database Systems
  Instructor: Daniel Abadi
  Students: Alice
                    Bob
                    Charlie
  Name: Advanced Topics in Database Systems
  Instructor: Daniel Abadi
  Students: Andrew
                    Beatrice

could be stored as follows:

Name offsets: 0, 32, 67
Name values: Introduction to Database SystemsAdvanced Topics in Database Systems

Instructor offsets: 0, 12, 24
Instructor values: Daniel AbadiDaniel Abadi

Nested student list offsets: 0, 3, 5
Student offsets: 0, 5, 8, 15, 21, 29
Student values: AliceBobCharlieAndrewBeatrice

Note that the nested student attribute required two different offset lists: (1) Students are variable length and thus we need one offset list to specify where one student ends and the next one begins, just as for any variable length attribute; and (2) We need second offset list to indicate how many students exist per class. The "Nested student list offsets" accomplish this second goal --- it indicates that the first class has (3-0) students, and the second class has (5-3) students, etc.

Arrow currently allows list, struct, and various types of union type data to be nested in an attribute value.

Conclusion

It is important to separate out the specification of a standard from the tools and libraries that are provided in the current codebase that help developers with implementing this standard. As long as you are performing in-memory analytics where your workloads are typically scanning through a few attributes of many entities, I do not see any reason not to embrace the Arrow standard. The time is right for database systems architects to agree on and adhere to a main memory data representation standard. The proposed Arrow standard fits the bill, and I would encourage designers of main memory data analytics systems to adopt the standard by default unless they can find a compelling reason that representing their data in a different way will result in a significantly different performance profile (for example, Arrow’s attribute-contiguous memory layout is not ideal if your workloads are typically accessing multiple attributes from only a single entity, as is common in OLTP workloads). I also found the tools available in the codebase to read and write data using this standard to be easy to use and quick to get started with. However, I did find that at times, the code was slightly slower than the raw (and less general) implementation of the standard I wrote myself. Nonetheless, the existing codebase is good enough for most use cases and will likely help to further the acceleration of the adoption of the standard. Furthermore, additional performance enhancements to the codebase appear to be on their way, such as optimized LLVM-based processing modules.

Apache Arrow vs. Parquet and ORC: Do we really need a third Apache project for columnar data representation?

2017-10-31T08:49:00.000-07:00

Apache Parquet and Apache ORC have become a popular file formats for storing data in the Hadoop ecosystem. Their primary value proposition revolves around their “columnar data representation format”. To quickly explain what this means: many people model their data in a set of two dimensional tables where each row corresponds to an entity, and each column an attribute about that entity. However, storage is one-dimensional --- you can only read data sequentially from memory or disk in one dimension. Therefore, there are two primary options for storing tables on storage: Store one row sequentially, followed by the next row, and then the next one, etc; or store the first column sequentially, followed by the next column, and then the next one, etc.

Storage layout difference between row- and column-oriented formats

For decades, the vast majority of data engines used row-oriented storage formats. This is because many early data application workloads revolved around reading, writing, and updating single entities at a time. If you store data using a columnar format and you want to extract all attributes for a particular entity, the system must jump around, finding the data for that entity from each of the separately-stored attributes. This results in a random access pattern, which results in slower access times than sequential access patterns. Therefore, columnar storage formats are a poor fit for workloads that tend to read and write entire entities at once, such as OLTP (transactional) workloads. Over time, workloads became more complex, and data analytics workloads emerged that tended to focus on only a few attributes at once; scanning through large quantities of entities to aggregate and/or process values of these attributes. Thus, storing data in a columnar fashion became more viable, and columnar formats resulted in sequential, high performance access patterns for these workloads.

Apache Arrow has recently been released with seemingly an identical value proposition as Apache Parquet and Apache ORC: it is a columnar data representation format that accelerates data analytics workloads. Yes, it is true that Parquet and ORC are designed to be used for storage on disk and Arrow is designed to be used for storage in memory. But disk and memory share the fundamental similarity that sequential access is faster than random access, and therefore the analytics workloads which tend to scan through attributes of data will perform more optimally if data is stored in columnar format no matter where data is stored --- in memory or on disk. And if that’s the case, the workloads for which Parquet and ORC are a good fit will be the identical set of workloads for which Arrow is a good fit. If so, why do we need two different Apache projects?

Before we answer this question, let us run a simple experiment to validate the claimed advantages of column-stores. On an Amazon EC2 t2.medium instance, I created a table with 60,000,000 rows (entities) in main memory. Each row contained six attributes (columns), all of them 32-bit integers. Each row is therefore 24 bytes, and the entire table is almost 1.5GB. I created both row-oriented and column-oriented versions of this table, where the row-oriented version stores the first 24-byte row, followed by the next one, etc; and the column-oriented version stores the entire first column, followed by the next one, etc. I then ran a simple query ideally suited for column-stores --- I simply search the entire first column for a particular value. The column-oriented version of my table should therefore have to scan though just the first column and will never need to access the other five columns. Therefore it will need to scan through 60,000,000 values * 4 bytes per value = almost 0.25GB. Meanwhile the row-store will need to scan through the entire 1.5GB table because the granularity with which data can be passed from memory to the CPU (a cache line) is larger than the 24-byte tuple. Therefore, it is impossible to read just the relevant first attribute from memory without reading the other five attributes as well. So if the column-store has to read 0.25GB of data and the row-store has to read 1.5GB of data, you might expect the column-store to be 6 times faster than the row-store. However, the actual results are presented in the table below:

Surprisingly, the row-store and the column-store perform almost identically, despite the query being ideally suited for a column-store. The reason why this is the case is that I turned off all CPU optimizations (such as vectorization / SIMD processing) for this query. This resulted in the query being bottlenecked by CPU processing, despite the tiny amount of CPU work that has to happen for this query (just a simple integer comparison operation per row). To understand how it is possible for such a simple query to be bottlenecked by the CPU, we need to understand some basic performance specifications of typical machines. As a rough rule of thumb, sequential scans through memory can feed data from memory to the CPU at a rate of around 30GB a second. However, a modern CPU processor runs at approximately 3 GHz --- in other words they can process around 3 billion instructions a second. So even if the processor is doing a 4-byte integer comparison every single cycle, it is processing no more than 12GB a second --- a far smaller rate than the 30GB a second of data that can be sent to it. Therefore, CPU is the bottleneck, and it does not matter that the column-store only needs to send one sixth of the data from memory to the CPU relative to a row-store.

On the other hand, if I turn on CPU optimizations (by adding the ‘-O3’ compiler flag), the equation is very different. Most notably, the compiler can vectorize simple operations such as the comparison operation from our query. What this means is that originally each of the 60,000,000 integer comparisons that are required for this query happened sequentially --- each one occurring in a separate instruction from the previous. However, once vectorization is turned on, most processors can actually take four (or more) contiguous elements of our column, and do the comparison operation for all four of these elements in parallel --- in a single instruction. This effectively makes the CPU go 4 times faster (or more if it can do more than 4 elements in parallel). However, this vectorization optimization only works if each of the four elements fit in the processor register at once, which roughly means that they have to be contiguous in memory. Therefore, the column-store is able to take advantage of vectorization, while the row-store is not. Thus, when I run the same query with CPU optimizations turned on, I get the following result:

As can be seen, the EC2 processor appears to be vectorising 4 values per instruction, and therefore the column-store is 4 times faster than the row-store. However, the CPU still appears to be the bottleneck (if memory was the bottleneck, we would expect the column-store to be 6 times faster than the row-store).

We can thus conclude from this experiment that column-stores are still better than row-stores for attribute-limited, sequential scan queries like the one in our example and similar queries typically found in data analytics workloads. So indeed, it does not matter whether the data is stored on disk or in memory --- column-stores are a win for these types of workloads. However, the reason is totally different. When the table is stored on disk, the CPU is much faster than the bandwidth with which data can be transferred from disk to the CPU. Therefore, column-stores are a win because they require less data to be transferred for these workloads. On the other hand, when the table is stored in memory, the amount of data that needs to be transferred is less relevant. Instead, column-stores are a win because they are better suited to vectorized processing.

The reason why it is so important to understand the difference in bottleneck (even though the bottom line is the same) is that certain decisions for how to organize data into storage formats will look different depending on the bottleneck. Most notably, compression decisions will look very different. In particular, for data stored on disk, where the bandwidth of getting data from disk to CPU is the bottleneck, compression is almost always a good idea. When you compress data, the total size of the data is decreased, and therefore less data needs to be transferred. However, you may have to pay additional CPU processing costs to do the decompression upon arrival. But if CPU is not the bottleneck, this is a great tradeoff to make. On the other hand, if CPU is the bottleneck, such as our experiments above where the data was located in main memory, the additional CPU cost of decompression is only going to slow down the query.

Now we can understand some of the key differences between Apache Parquet/ORC and Apache Arrow. Parquet and ORC, since they are designed for disk-resident data, support high-ratio compression algorithms such as snappy (both), gzip (Parquet), and zlib (ORC) all of which typically require decompression before data processing (and the associated CPU costs). Meanwhile, Arrow, which is designed for memory-resident-data, does not support these algorithms. The only compression currently supported by Arrow is dictionary compression, a scheme that usually does not require decompression before data processing. For example, if you want to find a particular value in a data set, you can simply search for the associated dictionary-encoded value instead. I assume that the Arrow developers will eventually read my 2006 paper on compression in column-stores and expand their compression options to include other schemes which can be operated on directly (such as run-length-encoding and bit-vector compression). I also expect that they will read the X100 compression paper which includes schemes which can be decompressed using vectorized processing. Thus, I expect that Arrow will eventually support an expanded set of compression options beyond just dictionary compression. But it is far less likely that we will see heavier-weight schemes like gzip and snappy in the Apache Arrow library any time soon.

Another difference between optimizing for main memory and optimizing for disk is that the relative difference between random reads and sequential reads is much smaller for memory than for disk. For magnetic disk, a sequential read can be 2-3 orders of magnitude faster than a random read. However, for memory, the difference is usually less than an order of magnitude. In other words, it might take hundreds or even thousands of sequential reads on disk in order to amortize the cost of the original random read to get to the beginning of the sequence. But for main memory, it takes less than ten sequential reads to amortize the cost of the original random read. This enables the batch size of Apache Arrow data to be much smaller than batch sizes of disk-oriented storage formats. Apache Arrow actually fixes batches to be no more 64K records.

So to return back to the original question: do we really need a third column-store Apache project? I would say that there are fundamental differences between main-memory column-stores and disk-resident column-stores. Main-memory column-stores, like Arrow, need to be CPU optimized and focused on vectorized processing (Arrow aligns data to 64-byte boundaries for this reason) and low-overhead compression algorithms. Disk-resident column-stores need to be focused transfer-bandwidth and support higher-compression ratio algorithms. Therefore, it makes sense to keep Arrow and Parquet/ORC as separate projects, while also continuing to maintain tight integration.

Hazelcast and the Mythical PA/EC System

2017-10-08T13:35:00.001-07:00

(Editor’s note: I was unaware that Kyle Kingsbury was doing a linearizability analysis of Hazelcast when I was writing this post. Kyle’s analysis resulted in Greg Luck, Hazelcast’s CEO, to write a blog post where he cited the PACELC theorem, and came to some of the same conclusions that I came to in writing this post. This post, however, was 98% written before both Kyle’s and Greg’s posts, but their posts got me to accelerate the completion of my analysis and publish it now.)

Seven years ago, I introduced the PACELC theorem as a mechanism to more clearly explain the consistency tradeoffs in building distributed systems. At that time, many people were familiar with the consistency vs. availability trade-off that was made well-known by the CAP theorem. However, it was common amongst people unfamiliar with the details of CAP theorem to believe that this tradeoff is always present in a distributed system. However, the truth is that the CAP consistency-availability tradeoff actually describes a very rare corner case. Only when there is an actual network partition --- an increasingly unusual event in modern day infrastructures --- does the consistency-availability tradeoff present itself. At all other times, it is possible to be both available and consistent. Nonetheless, many systems choose not to be fully consistent at all times. The reason for this has nothing to do with the CAP tradeoff. Instead, there is a separate latency vs. consistency tradeoff. Enforcing consistency requires either (1) synchronization messages between machines that must remain consistent with each other or (2) all requests involving a particular data item to be served by a single master for that data item instead of the closest replica to the location where the request originates. Both of these options come with a latency cost. By relaxing consistency and serving reads and writes directly from a closest replica (without synchronization with other replicas), latency can be improved --- sometimes by an order of magnitude.

Therefore, I felt that it was important to clearly tease apart these separate consistency tradeoffs. This led to the PACELC theorem: if there is a partition (P), how does the system trade off availability and consistency (A and C); else (E), when the system is running normally in the absence of partitions, how does the system trade off latency (L) and consistency (C)?

In general, the PACELC theorem leads to four categories of systems: PC/EC, PA/EL, PC/EL, and PA/EC. However, in practice, an application will either go to the effort of building on top of a reduced consistency system or it will not. If it goes to this effort, it stands to benefit in two areas: availability upon a partition, and latency in everyday operation. It is unusual for a system to go to this effort and choose only to attain benefit in one area. Hence, two of these four categories are more common than the other two: PC/EC systems designed for applications that can never sacrifice consistency, and PA/EL systems that are designed for applications that are capable of being built over a reduced consistency system. Despite being less common, PACELC nonetheless theorizes about the existence of PC/EL and PA/EC systems. At the time when I originally introduced PACELC, I gave the example of PNUTS as a PC/EL system. However, I could not think of any good examples of PA/EC systems. Even in my extended article on PACELC in the CAP-anniversary edition of IEEE Computer, I only gave a somewhat hand-wavey example of a PA/EC system.

The basic problem with PA/EC systems is the following: although partitions are a rare event, they are not impossible. Any application built on top of a PA system must have mechanisms in place to deal with inconsistencies that arise during these partition events. But once they have these mechanisms in place, why not benefit during normal operation and get better latency as well?

Over the past few weeks, I have been looking more deeply at the In-Memory Data Grid (“IMDG”) market, and took an especially deep dive into Hazelcast, a ubiquitous open source implementation of a IMDG, with hundreds of thousands of in production deployments. It turns out that Hazelcast (and, indeed, most of the in-memory data grid industry) is a real implementation of the mythical PA/EC system.

In order to understand why PA/EC makes sense for Hazelcast and other IMDGs, we need to first discuss some background material on (1) Hazelcast use cases, (2) data replication and (3) PACELC.

Hazelcast use cases

The most common use case for Hazelcast is the following. Let’s say that you write a Java program that stores and manipulates data inside popular Java collections and data structures, e.g., Queue, Map, AtomicLong, or Multimap. You may want to run this program in multiple different clients, all accessing the same Java collections and data structures. Furthermore, these data structures may get so large that they cannot fit in memory on a single server. Hazelcast comes to the rescue --- it provides a distributed implementation of these Java data structures, thereby enabling scalable utilization of them. Users interact with Hazelcast the same way that they interacted with their local data structures, but behind the scenes, Hazelcast is distributing and replicating them across a cluster of machines.

The vast majority of Hazelcast use cases are within a single computing cluster. Both the client programs and the Hazelcast data structures are located in the same physical region.

Data replication

In general, any arbitrary system may choose to replicate data for one of two primary reasons: Either they want to improve fault tolerance (if a server containing some of the data fails, a replica server can be accessed instead), or they want to improve request latency (messages that have to travel farther distances take longer to transmit; therefore, having a replica of the data “near” locations from which they are typically accessed can improve request latency).

As mentioned above, in-memory data grids are typically running in the same region as the clients which access them. Therefore, only the first reason to replicate data (fault tolerance) applies. (This reason alone is enough to justify the costs of replication in any scalable system. The more physical machines that exist in the system, the more likely it is that at least one machine will fail at any given point in time. Therefore, the bigger the system, the more you need to replicate for fault tolerance).

If the replicas only exist for fault tolerance and not for performance, there is no fundamental requirement to ever access them except in the case of a failure. All reads and writes can be directed to the primary copy of any data item, with the replicas only ever accessed if the primary is not available. (In such a scenario, it is a good idea to mix primary and replica partitions on servers across the cluster, in order to prevent underutilization of server resources.) If all reads and writes go to the same location, this leads to 100% consistency and linearizability (in the absence of failures) since it is easy for a single server to ensure that reads reflect the most recent writes.

What this means for PACELC

Recall what I wrote above about the latency vs. consistency tradeoff: “Enforcing consistency requires either (1) synchronization messages between machines that must remain consistent with each other or (2) all requests involving a particular data item to be served by a single master for that data item instead of the closest replica to the location where the request originates. Both of these options come with a latency cost.” In truth, option (2) does not come with a latency cost when all requests originate from a location closest to the master replica. It’s only when messages travel for longer than the distance to the nearest replica where a cost materializes. In order words, there is no consistency vs. latency tradeoff in the typical Hazelcast use case.

Thus, we should clarify at this point that the PACELC theorem assumes that requests may originate from any arbitrary location. The ELC part of PACELC disappears if all requests come from the same location. I would argue that the CAP theorem makes the same assumption, but such an argument is not as straightforward, and requires a refined discussion about the CAP theorem which is outside scope of this particular blog post.

Failures and partitions

Up until now, we have said that as long as the master node does not fail, if it serves all reads and writes, then full consistency is achieved. The obvious next question is: what happens if the master node fails and a new master takes over? In such a scenario, the ability of the system to maintain consistency depends on how replication is performed. If replication was asynchronous, then consistency cannot be guaranteed, since some updates may have been performed on the old master, but had not yet been replicated to the new master before the old master failed. If all data had been synchronously replicated to the new master, then full consistency can still be guaranteed.

A failed node is logically equivalent to a partition where the failed node is located in one partition and every other node is in the other partition, and all client requests can reach the second partition but not the first. If the failed node is the master node, and replication was asynchronous, then both the CAP theorem and the PAC part of PACELC state that there are only two choices: quiesce the entire system since the only consistent copy is not accessible (i.e. choosing consistency over availability), or serve reads and writes from nodes in the available partition (i.e. choosing availability over consistency).

Hazelcast by default uses “synchronous” replication, which is actually an interesting hybrid between asynchronous and synchronous replication. The master asynchronously sends the writes to the replicas, and each replica acknowledges these writes to the client. The client synchronously waits for these acknowledgments before returning from the call. However, if the requisite number of acknowledgments do not arrive before the end of a time out period, the call either returns with the write succeeding or throws an exception, depending on configuration. If Hazelcast had been configured to throw an exception, the client can retry the operation. Hazelcast also has an anti-entropy algorithm that works offline to re-synchronize replicas with the master to repair missed replications. However, either way --- until the point where the missed replication has been repaired either through the anti-entropy algorithm or through a client retry, the system is temporarily in a state where the write has succeeded on the master but not on at least one replica.

In addition to the hybrid synchronous algorithm described above, Hazelcast also can be configured to use standard asynchronous replication. When configured in this way, the client does not wait for acknowledgments from the replicas before returning from the call. Thus, updates that failed to get replicated will go undetected until the anti-entropy algorithm identifies and repairs the missed replication.

Thus, either way --- whether Hazelcast is configured to use standard asynchronous replication or to use the default hybrid “synchronous” model --- it is possible for the write call to return with the write only succeeding on the master.

If the master node fails, Hazelcast selects a new master to serve reads and writes, even though (as we just mentioned) it is possible that the new master does not have all the writes from the original master. If there is a network partition, the original master will remain the master for its partition, but the other partition will select its own master. Again, this second master may not have all the writes from the original master. Furthermore, a full split brain situation may occur, where the masters for the two different partitions independently accept writes to their partition, thereby causing the partitions to diverge further. However, Hazelcast does have a “split brain protection” feature that prevents significant divergence. The way this feature works is that the system can be configured to define a minimum size for read and write operations. If this minimum size is set to be larger than half of the size of the cluster, then the smaller partition will not accept reads and writes, which prevents further divergence from the larger partition. However, it can take 10s of seconds for the smaller partition to realize how small it is (although Hazelcast claims it will be much faster than this in 3.9.1 and 3.10). Thus there is a delay before the split brain protection kicks in, and the partitions can diverge during this delay period.

The bottom line here is that both if the master fails and also in the (rare) case of a network partition, a new master is selected that may not have all the updates from the original master. The system always remains available, but the second master is allowed to temporarily diverge from the original master. Thus, Hazelcast is PA/EC in PACELC. If the master has failed or partitioned, Hazelcast choses availability over consistency. However, in the absence of failures or partitions, Hazelcast is fully consistent. (As mentioned above, Hazelcast also achieves low latency in the absence of failures or partitions in their primary use case. However, it is appropriate to label Hazelcast EC rather than EL since if a request were to theoretically originate in a location that is far from the master, it would still choose consistency over latency and serve the request from the master.)

Indeed, any system that that serves reads and writes from the master, but elects a new master upon a failure, where this new master is not 100% guaranteed to have seen all of the writes from the original master, will be PA/EC in PACELC. So the PA/EC category is larger than I originally had expected.

I would still argue, however, that PA/EC systems are fundamentally confusing to the end user. If the system cannot guarantee consistency in all cases, then the end user is forced to handle cases of inconsistency in application logic. And once they have the code written to handle these cases (e.g., by including merge functions that resolve situations where replicas may diverge), then the value of the system being consistent in the absence of failures or partitions is significantly reduced. PA/EC systems thus only make sense for applications for which availability takes priority over consistency, but where the code that handles inconsistencies needs to be run as infrequently as possible --- e.g. when the code involves a real world charge (such as refunding a customer’s account) or significant performance costs.

Since not all applications fit into the above category, I suspect that many PA/EC systems will have settings to either increase consistency in order to become fully consistent (i.e. become PC instead of PA) or reduce consistency guarantees in the “else case” (i.e., become EL instead of EC).

Indeed, Hazelcast is such a system and can be configured to be EL rather than EC. There are several ways to accomplish this, but the primary mechanism is through their Near Cache feature. Near Cache is a client side cache of recently accessed data items. If the data items stored in the Near Cache are updated by a different client, these changes are not synchronously replicated to the first client’s Near Cache. Hence, the Near Cache is not kept consistent with the master version of the data (instead it is “eventually consistent”). However, reads by the client are served by its Near Cache if a copy of the data item to be read is stored there. Therefore, excellent latency (less than one microsecond) can be achieved at the cost of consistency --- EL in PACELC.

Furthermore, Hazelcast also supports replication of clusters over a WAN. For example, in a disaster recovery use case, all writes go to the primary cluster, and they are asynchronously replicated to a backup cluster. Alternatively, both clusters can accept writes, and they are asynchronously replicated to the other cluster (the application is responsible for resolving conflicts of conflicting writes to the different clusters using a conflict resolution strategy registered with Hazelcast). Unlike what we discussed earlier, in this case read requests may originate from arbitrary locations rather than always from a location near the master. Hazelcast serves these reads from the closest location, even though it may not have the most up to date copy of the data. Thus, Hazelcast is EL by default for WAN replication.

In summary, through my investigation of Hazelcast (and in-memory data grids in general), I have discovered a new category of PA/EC systems. However, due to the confusing nature of PA/EC systems, it is no surprise that Hazelcast can be configured to be PA/EL in addition to its PA/EC default.

Distributed consistency at scale: Spanner vs. Calvin

2017-04-06T10:59:00.001-07:00

Introduction

In 2012, two research papers were published that described the design of geographically replicated, consistent, ACID compliant, transactional database systems. Both papers criticized the proliferation of NoSQL database systems that compromise replication consistency and transactional support, and argue that it is very possible to build extremely scalable, geographically replicated systems without giving up on consistency and transactional support. The first of these papers was the Calvin paper, published in SIGMOD 2012. A few months later, Google published their Spanner paper in OSDI 2012. Both of these papers have been cited many hundreds of times and have influenced the design of several modern “NewSQL” systems, including FaunaDB (where this post is also being published).

Recently, Google released a beta version of their Spanner implementation, available to customers who use Google Cloud Platform. This development has excited many users seeking to build scalable apps on Google’s cloud, since they now have a reliably scalable and consistent transactional database system to use as a foundation. However, the availability of Spanner outside of Google has also brought it more scrutiny --- what are its technical advantages in practice, and what are its costs? Even though it has been five years since the Calvin paper was published, it is only now that the database community is asking me to directly compare and contrast the technical designs of Calvin and Spanner.

The goal of this post is to do exactly that --- compare the architectural design decisions made in these two systems, and specifically focus on the advantages and disadvantages of these decisions against each other as they relate to performance and scalability. This post is focused on the protocols described in the original papers from 2012. Although the publicly available versions of these systems likely have deviated from the original papers, the core architectural distinctions remain the same.

The CAP theorem in context

Before we get started, allow me to suggest the following: Ignore the CAP theorem in the context of this discussion. Just forget about it. It’s not relevant for the type of modern architectural deployments discussed in this post where network partitions are rare.

Both Spanner and Calvin replicate data across independent regions for high availability. And both Spanner and Calvin are technically CP systems from CAP: they guarantee 100% consistency (serializability, linearizability, etc.) across the entire system. Yes, when there is a network partition, both systems make slight compromises on availability, but partitions are rare enough in practice that developers on top of both systems can assume a fully-available system to many 9s of availability.

(BTW: If you didn’t believe me in 2010 when I explained the shortfalls of using CAP to understand the practical consistency and availability properties of modern systems, maybe you will believe the author of the CAP theorem himself, Eric Brewer, who recommends against analyzing Spanner through the lens of the CAP theorem.)

Ordering transactions in time

Let us start our comparison of Calvin and Spanner with the most obvious difference between them: Spanner’s use of “TrueTime” vs. Calvin’s use of “preprocessing” (or “sequencing” in the language of the original paper) for transaction ordering. In fact, most of the other differences between Spanner and Calvin stem from this fundamental choice.

A serializable system provides a notion of transactional ordering. Even though many transactions may be executed in parallel across many CPUs and many servers in a large distributed system, the final state (and all observable intermediate states) must be as if each transaction was processed one-by-one. If no transactions touch the same data, it is trivial to process them in parallel and maintain this guarantee. However, if the transactions read or write each other’s data, then they must be ordered against each other, with one considered earlier than the other. The one considered “later” must be processed against a version of the database state that includes the writes of the earlier one. In addition, the one considered “earlier” must be processed against a version of the database state that excludes the writes of the later one.

Locking and logging

Spanner uses TrueTime for this transaction ordering. Google famously uses a combination of GPS and atomic clocks in all of their regions to synchronize time to within a known uncertainty bound. If two transactions are processed during time periods that do not have overlapping uncertainty bounds, Spanner can be certain that the later transaction will see all the writes of the earlier transaction.

Spanner obtains write locks within the data replicas on all the data it will write before performing any write. If it obtains all the locks it needs, it proceeds with all of its writes and then assigns the transaction a timestamp at the end of the uncertainty range of the coordinator server for that transaction. It then waits until this later timestamp has definitely passed for all servers in the system (which is the entire length of the uncertainty range) and then releases locks and commits the transaction. Future transactions will get later timestamps and see all the writes of this earlier transaction. Thus, in Spanner, every transaction receives a timestamp based on the actual time that it committed, and this timestamp is used to order transactions. Transactions with later timestamps see all the writes of transactions with earlier timestamps, with locking used to enforce this guarantee.

In contrast, Calvin uses preprocessing to order transactions. All transactions are inserted into a distributed, replicated log before being processed. In more detail: clients submit transactions to the preprocessing layer of their local region, which then submits these transactions to the global log via a cross-region consensus process like Paxos. This is similar to a write-ahead log in a traditional, non-distributed database. The order that the transactions appear in this log is the official transaction ordering. Every replica reads from their local copy of this replicated log and processes transactions in a way that guarantees that their final state is equivalent to what it would have been had every transaction in the log been executed one-by-one.

Replication overhead

The design difference between TrueTime vs. preprocessing directly leads to a difference in how the systems perform replication. In Cavlin, the replication of transactional input during preprocessing is the only replication that is needed. Calvin uses a deterministic execution framework to avoid *all* cross-replication communication during normal (non-recovery mode) execution aside from preprocessing. Every replica sees the same log of transactions and guarantees not only a final state equivalent to executing the transactions in this log one-by-one, but also a final state equivalent to every other replica.

This requires the preprocessor to analyze the transaction code and “pre-execute” any nondeterministic code (e.g. calls to sys.random() or time.now()). (The implication of this in terms of what types of transactions are supported by Calvin are discussed at the end of this post.) Once all code within a transaction is deterministic, a replica can safely focus on just processing the transactions in the log in the correct order without concern for diverging with the other replicas.

In contrast, since Spanner does not do any transaction preprocessing, it can only perform replication after transaction execution. Spanner performs this replication via a cross-region Paxos process.

The cost of two-phase commit

Another key difference between Spanner and Calvin is how they commit multi-partitioned transactions. Both Calvin and Spanner partition data into separate shards that may be stored on separate machines that fail independently from each other. In order to guarantee transaction atomicity and durability, any transaction that accesses data in multiple partitions must go through a commit procedure that ensures that every partition successfully processed the part of the transaction that accessed data in that partition. Since machines may fail at any time, including during the commit procedure, this process generally takes two rounds of communication between the partitions involved in the transaction. This two-round commit protocol is called “two phase commit” and is used in almost every ACID-compliant distributed database system, including Spanner. This two phase commit protocol can often consume the majority of latency for short, simple transactions since the actual processing time of the transaction is much less than the delays involved in sending and receiving two rounds of messages over the network.

The cost of two phase commit is particularly high in Spanner because the protocol involves three forced writes to a log that cannot be overlapped with each other. In Spanner, every force write to a log involves a cross-region Paxos agreement, so the latency of two phase commit in Spanner is at least equal to three times the latency of cross-region Paxos.

Determinism is durability

In contrast to Spanner, Calvin leverages deterministic execution to avoid two-phase commit. Machine failures do not cause transactional aborts in Calvin. Instead, after a failure, the machine that failed in Calvin re-reads the input transaction log from a checkpoint, and deterministically replays it to recover its state at the time of the failure. It can then continue on from there as if nothing happened. As a result, the commit protocol does not need to worry about machine failures during the protocol, and can be performed in a single round of communication (and in some cases, zero rounds of communication --- see the original paper for more details).

Performance

At this point, I think I have provided enough details to make it possible to present a theoretical comparison of the bottom line performance of Calvin vs. Spanner for a variety of different types of requests. This comparison assumes a perfectly optimized and implemented version of each system.

Transactional write latency

A transaction that is “non-read-only” writes at least one value to the database state. In Calvin, such a transaction must pay the latency cost of preprocessing, which is roughly the cost of running cross-region Paxos to agree to append the transaction to the log. After this is complete, the remaining latency is the cost of processing the transaction itself, which includes the zero or one-phase commit protocol for distributed transactions.

In Spanner, there is no preprocessing latency, but it still has to pay the cost of cross-region Paxos replication at commit time, which is roughly equivalent to the Calvin preprocessing latency. Spanner also has to pay the commit wait latency discussed above (which is the size of the time uncertainty window), but this can be overlapped with replication. It also pays the latency of two phase commit for multi-partition transactions.

Thus, Spanner and Calvin have roughly equivalent latency for single-partition transactions, but Spanner has worse latency than Calvin for multi-partition transactions due to the extra phases in the transaction commit protocol.

Snapshot read latency

Both Calvin and Spanner keep around older versions of data and read data at a requested earlier timestamp from a local replica without any Paxos-communication with the other replicas.

Thus, both Calvin and Spanner can achieve very low snapshot-read latency.

Transactional read latency

Read-only transactions do not write any data, but they must be linearizable with respect to other transactions that write data. In practice, Calvin accomplishes this via placing the read-only transaction in the preprocessor log. This means that a read-only transaction in Calvin must pay the cross-region replication latency. In contrast, Spanner only needs to submit the read-only transaction to the leader replica(s) for the partition(s) that are accessed in order to get a global timestamp (and therefore be ordered relative to concurrent transactions). Therefore, there is no cross-region Paxos latency --- only the commit time (uncertainty window) latency.

Thus, Spanner has better latency than Calvin for read-only transactions submitted by clients that are physically close to the location of the leader servers for the partitions accessed by that transaction.

Scalability

Both Spanner and Calvin are both (theoretically) roughly-linearly scalable for transactional workloads for which it is rare for concurrent transactions to be accessing the same data. However, major differences begin to present themselves as the conflict rate between concurrent transactions starts to increase.

Both Spanner and Calvin, as presented in the paper, use locks to prevent concurrent transactions from interfering with each other in impermissible ways. However, the amount of time they hold locks for an identical transaction is substantially different. Both systems need to hold locks during the commit protocol. However, since Calvin’s commit protocol is shorter than Spanner’s, Calvin reduces the lock hold time at the end of the transaction. On the flip side, Calvin acquires all locks that it will need at the beginning of the transaction, whereas Spanner performs all reads for a transaction before acquiring write locks. Therefore, Spanner reduces lock time at the beginning of the transaction.

However, this latter advantage for Spanner is generally outweighed by the former (extra lock-hold time at the end of the transactions) disadvantage, since, as discussed above, the latency of two-phase commit in Spanner involves at least three iterations of cross-region Paxos. Furthermore, Spanner has an additional major disadvantage relative to Calvin in lock-hold time: Spanner must also hold locks during replication (which, as mentioned above, is also a cross-region Paxos process). The farther apart the regions, the larger the latency of this replication, and therefore, the longer Spanner must hold locks.

In contrast, Calvin does its replication during preprocessing, and therefore does not need to hold locks during replication. This leads to Calvin holding locks for much shorter periods of time than Spanner, and therefore being able to process more concurrent transactions in parallel that conflict with each other.

A second difference that can affect scalability is the following: Calvin requires only a single Paxos group for replicating the input log. In contrast, Spanner requires one independent Paxos group per shard, with proportionally higher overhead.

Overall, Calvin has higher throughput scalability than Spanner for transactional workloads where concurrent transactions access the same data. This advantage increases with the distance between datacenters.

Limitations

In order to implement deterministic transaction processing, Calvin requires the preprocessor to analyze transactions and potentially “pre-execute” any non-deterministic code to ensure that replicas do not diverge. This implies that the preprocessor requires the entire transaction to be submitted at once. This highlights another difference between Calvin and Spanner --- while Spanner theoretically allows arbitrary client-side interactive transactions (that may include external communication), Calvin supports a more limited transaction model.

There are some subtle, but interesting differences between Calvin and Spanner in rare situations where every single replica for a shard is unavailable, or if all but one are unavailable, but these differences are out of scope for this post.

Conclusion

I’m obviously biased in favor of Calvin, but in going through this exercise, I found it very difficult to find cases where an ideal implementation of Spanner theoretically outperforms an ideal implementation of Calvin. The only place where I could find that Spanner has a clear performance advantage over Calvin is for latency of read-only transactions submitted by clients that are physically close to the location of the leader servers for the partitions accessed by that transaction. Since any complex transaction is likely to touch multiple partitions, this is almost impossible to guarantee in a real-world setting.

However, many real-world workloads do not require client-side interactive transactions, and furthermore only need transactional support for writes and are satisfied with performing reads against a snapshots (after all, this is the default isolation model of many SQL systems). It seems to me that Calvin is the better fit for a large class of modern applications.

Why MongoDB, Cassandra, HBase, DynamoDB, and Riak will only let you perform transactions on a single data item

2015-10-28T07:36:00.000-07:00

(This post is co-authored by Daniel Abadi and Jose Faleiro and cross-posted on Jose's blog)

NoSQL systems such as MongoDB, Cassandra, HBase, DynamoDB, and Riak have made many things easier for application developers. They generally have extremely flexible data models, that reduce the burden of advance prediction of how an application will change over time. They support a wide variety of data types, allow nesting of data, and dynamic addition of new attributes. Furthermore, on the whole, they are relatively easy to install, with far fewer configuration parameters and dependencies than many traditional database systems.

On the other hand, their lack of support for traditional atomic transactions is a major step backwards in terms of ease-of-use for application developers. An atomic transaction enables a group of writes (to different items in the database) to occur in an all-or-nothing fashion --- either they will all succeed and be reflected in the database state, or none of them will. Moreover, in combination with appropriate concurrency control mechanisms, atomicity guarantees that concurrent and subsequent transactions either observe all of the completed writes of an atomic transaction or none of them. Without atomic transactions, application developers have to write corner-case code to account for cases in which a group of writes (that are supposed to occur together) have only partially succeeded or only partially observed by concurrent processes. This code is error-prone, and requires complex understanding of the semantics of an application.

At first it may seem odd that these NoSQL systems, that are so well-known for their developer-friendly features, should lack such a basic ease-of-use tool as an atomic transaction. One might have thought that this missing feature is a simple matter of maturity --- these systems are relatively new and perhaps they simply haven't yet gotten around to implementing support for atomic transactions. Indeed, Cassandra's "batch update" feature could be viewed as a mini-step in this direction (despite the severe constraints on what types of updates can be placed in a "batch update"). However, as we start to approach a decade since these systems were introduced, it is clear that there is a more fundamental reason for the lack of transactional support in these systems.

Indeed, there is a deeper reason for their lack of transactional support, and it stems from their focus on scalability. Most NoSQL systems are designed to scale horizontally across many different machines, where the data in a database is partitioned across these machines. The writes in a (general) transaction may access data in several different partitions (on several different machines). Such transactions are called "distributed transactions". Guaranteeing atomicity in distributed transactions requires that the machines that participate in the transaction coordinate with each other. Each machine must establish that the transaction can successfully commit on every other machine involved in the transaction. Furthermore, a protocol is used to ensure that no machine involved in the transaction will fail before the writes that it was involved in for that transactions are present in stable storage. This avoids scenarios where one set of nodes commit a transaction's writes, while another set of nodes abort or fail before the transaction is complete (which violates the all-or-nothing guarantee of atomicity).

This coordination process is expensive, both, in terms of resources, and in terms of adding latency to database requests. However, the bigger issue is that other operations are not allowed to read the writes of a transaction until this coordination is complete, since the all-or-nothing nature of transaction execution implies that these writes may need to be rolled-back if the coordination process determines that some of the writes cannot complete and the transaction must be aborted. The delay of concurrent transactions can cause further delay of other transactions that have overlapping read- and write-sets with the delayed transactions, resulting in overall "cloggage" of the system. The distributed coordination that is required for distributed transactions thus has significant drawbacks for overall database system performance --- both in terms of the throughput of transactions per unit time that the system can process, and in terms of the latency of transactions as they get caught up in the cloggage (this cloggage latency often dominates the latency of the transaction coordination protocol itself). Therefore, most NoSQL systems have chosen to disallow general transactions altogether rather than become susceptible to the performance pitfalls that distributed transactions can entail.

MongoDB, Riak, HBase, and Cassandra all provide support for transactions on a single key. This is because all information associated with a single key is stored on a single machine (aside from replicas stored elsewhere). Therefore, transactions on a single key are guaranteed not to involve the types of complicated distributed coordination described above.

Given that distributed transactions necessitate distributed coordination, it would seem that there is a fundamental tradeoff between scalable performance and support for distributed transactions. Indeed, many practitioners assume that this is the case. When they set out to build a scalable system, they immediately assume that they will not be able to support distributed atomic transactions without severe performance degradation.

This is in fact completely false. It is very much possible for a scalable system to support performant distributed atomic transactions.

In a recent paper, we published a new representation of the tradeoffs involved in supporting atomic transactions in scalable systems. In particular, there exists a three-way tradeoff between fairness, isolation, and throughput (FIT). A scalable database system which supports atomic distributed transactions can achieve at most two out of these three properties. Fairness corresponds to the intuitive notion that the execution of any given transaction is not deliberately delayed in order to benefit other transactions. Isolation provides each transaction with the illusion that it has the entire database system to itself. In doing so, isolation guarantees that if any pair of transactions conflict, then one transaction in the pair will always observe the writes of the other. As a consequence, it alleviates application developers from the burden of reasoning about complex interleavings of conflicting transactions' reads and writes. Throughput refers to the ability of the database to process many concurrent transactions per unit time (without hiccups in performance due to clogging).

The FIT tradeoff dictates that there exist three classes of systems that support atomic distributed transactions:

Those that guarantee fairness and isolation, but sacrifice throughput,
Those that guarantee fairness and throughput, but sacrifice isolation, and
Those that guarantee isolation and throughput, but sacrifice fairness.

In other words, not only is it possible to build scalable systems with high throughput distributed transactions, but there actually exist two classes of systems that can do so: those that sacrifice isolation, and those that sacrifice fairness. We discuss each of these two alternatives below.

(Latency is not explicitly mentioned in the tradeoff, but systems that give up throughput also give up latency due to cloggage, and systems that give up fairness yield increased latency for those transactions treated unfairly.)

Give up on isolation

As described above, the root source of the database system cloggage isn't the distributed coordination itself. Rather, it is the fact that other transactions that want to access the data that a particular transaction wrote have to wait until after the distributed coordination is complete before reading or writing the shared data. This waiting occurs due to strong isolation, which guarantees that one transaction in a pair of conflicting must observe the writes of the other. Since a transaction's writes are not guaranteed to commit until after the distributed coordination process is complete, concurrent conflicting transactions cannot make progress for the duration of this coordination.

However, all of this assumes that it is unacceptable for transactions writes to not be immediately observable by concurrent conflicting transactions If this "isolation" requirement is dropped, there is no need for other transactions to wait until the distributed coordination is complete before executing and committing.

While giving up on strong isolation seemingly implies that distributed databases cannot guarantee correctness (because transactions execute against potentially stale database state), it turns out that there exists a class of database constraints that can be guaranteed to hold despite the use of weak isolation among transactions. For more details on the kinds of guarantees that can hold on constraints despite weak isolation, Peter Bailis's work on Read Atomic Multi-Partition (RAMP) transactions provides some great intuition.

Give up on fairness

The underlying motivation for giving up isolation in systems is that distributed coordination extends the duration for which transactions with overlapping data accesses are unable to make progress. Intuitively, distributed coordination and isolation mechanisms overlap in time. This suggests that another way to circumvent the interaction between isolation techniques and distributed coordination is to re-order distributed coordination such that its overlap with any isolation mechanism is minimized. This intuition forms the basis of Isolation-Throughput systems (which give up fairness). In giving up fairness, database systems gain the flexibility to pick the most opportune time to pay the cost of distributed coordination. For instance, it is possible to perform coordination outside of transaction boundaries so that the additional time required to do the coordination does not increase the time that conflicting transactions cannot run. In general, when the system does not need to guarantee fairness, it can deliberately prioritize or delay specific transactions in order to benefit overall throughput.

G-Store is a good example of an Isolation-Throughput system (which gives up fairness). G-Store extends a (non-transactional) distributed key-value store with support for multi-key transactions. G-Store restricts the scope of transactions to an application defined set of keys called a KeyGroup. An application defines KeyGroups dynamically based on the set of keys it anticipates will be accessed together over the course of some period of time. Note that the only restriction on transactions is that the keys involved in the transaction be part of a single KeyGroup. G-Store allows KeyGroups to be created and disbanded when needed, and therefore effectively provides arbitrary transactions over any set of keys.

When an application defines a KeyGroup, G-Store moves the constituent keys from their nodes to a single leader node. The leader node copies the corresponding key-value pairs, and all transactions on the KeyGroup are executed on the leader. Since all the key-value pairs involved in a transaction are stored on a single node (the leader node), G-Store transactions do not need to execute a distributed commit protocol during transaction execution.

G-Store pays the cost of distributed coordination prior to executing transactions. In order to create a KeyGroup, G-Store executes an expensive distributed protocol to allow a leader node to take ownership of a KeyGroup, and then move the KeyGroup's constituent keys to the leader node. The KeyGroup creation protocol involves expensive distributed coordination, the cost of which is amortized across the transactions which execute on the KeyGroup.

The key point is that while G-Store still must perform distributed coordination, this coordination is done prior to transaction execution --- before the need to be concerned with isolation from other transactions. Once the distributed coordination is complete (all the relevant data has been moved to a single master node), the transaction completes quickly on a single node without forcing concurrent transactions with overlapping data accesses to wait for distributed coordination. Hence, G-Store achieves both high throughput and strong isolation.

However, the requirement that transactions restrict their scope to a single KeyGroup favors transactions that execute on keys which have already been grouped. This is "unfair" to transactions that need to execute on a set of as yet ungrouped keys. Before such transactions can begin executing, G-Store must first disband existing KeyGroups to which some keys may belong, and then create the appropriate KeyGroup --- a process with much higher latency than if the desired KeyGroup already existed.

Conclusions

The fundamental reason for the poor performance of conventional distributed transactions is the fact that the mechanisms for guaranteeing atomicity (distributed coordination), and isolation overlap in time. The key to enabling high throughput distributed transactions is to separate these two concerns. This insight leads to two ways of separating atomicity and isolation mechanisms. The first option is to weaken isolation such that conflicting transactions can execute and commit in parallel. The second option is to re-order atomicity and isolation mechanisms so that they do not overlap in time, and in doing so, give up fairness during transaction execution.

(Edit: MongoDB and HBase both have (or will soon have) limited support for multi-key transactions as long as those keys are within the same partition. However, hopefully it is clear to the reader that this post is discussing the difficulties of implementing distributed --- cross-partition --- transactions).

Is Upstart the right way to get college student start-ups funded?

2012-11-07T06:55:00.002-08:00

In 2010, the movie “The Social Network” was released, which has had a tremendously positive effect on the computer science department at Yale; and from what I have heard, a similar effect has been observed across the country. My understanding (I have not seen this movie myself) is that the movie’s plot revolves around Mark Zuckerberg and his role in the formation of Facebook. In the time since the movie was released, the number of computer science majors at Yale has nearly quadrupled, and these majors are increasingly looking at start-ups (either founding their own or joining existing ones) as options for when they graduate (instead of going to Wall Street and working as quants, which has historically been the popular career path for Yale CS majors).

In my opinion, this is unquestionably a good thing. Yale has some of the brightest minds of the next generation, and I feel a lot more confident about the future of our country when I see these great minds being applied to creating new entities and jobs and building something real, instead of being wasted in the zero-sum gain arms race of who can create the automatic trading algorithm that is epsilon better than anybody else’s.

One consequence of this start-up craze is that I get bombarded with requests from students who want to meet with me to discuss their start-up idea. This partly because I teach the “Intro to Programming” course at Yale which has had consistently between 120 and 150 students (many of whom are budding entrepreneurs) enrolled since the release of “The Social Network”, partly because the success of Hadapt is certainly no secret around Yale, and partly because I live on Yale’s campus and part of my job in this capacity is to serve as an adviser and mentor to undergraduates.

When I meet with these students I hear all kinds of ideas. Some of them are good, and some of them are bad. Some of them make me think about an area in a different way, and some of them are carbon copies of something that already exists. Some I could get excited about and some I couldn’t. But just about all of them have one thing in common: the students involved vastly overestimate their probability of success. I suppose this should not surprise me --- after all, these are Yale students that have been successful in just about everything they have ever done in their life. So it follows that they would expect their start-up to be successful. But even when I talk to students at other universities who have start-up ideas --- students who have not necessarily been so successful in their lives --- even they are totally convinced that their startup idea is unlikely to fail. It seems that there is a basic psychological flaw in the human mind --- we so desperately want our dreams to come true that we ignore statistical data about the probability of success and trick ourselves into believing that we are the statistical anomaly and will succeed where others have failed.

Many of these students find out the hard reality regarding their start-up idea when they attempt raise funding. They find out that investors are extremely conscious of the probability of success of a group of students with no experience, no reputation, and a limited network. Most of these start-ups fail to raise funding from professional investors. Some students give up at this point. Other students continue along with limited funding from friends and family in an attempt to create more meat around the bones of their idea and reduce risk for the professional investors. Most will eventually fail, while a rare few will succeed.

The outcome of all this is that despite all of these students eager to be entrepreneurs and start companies, very few student ideas receive funding, and most of these ideas never see the light of day. Whether or not this is a good thing is certainly up for debate, but my feeling is that it is a shame that so few student start-ups get funding.

Therefore, when I first heard of Upstart (I think it was in August), I was quite interested in the idea --- it proposed a way to get student start-ups funded. I signed up to receive e-mail updates, but did not hear from them for several months. However, on Monday of this week I received an update from them that they were open for business. I looked through the profiles of the students who were looking for funding and I saw that no fewer than 4 out of the (approximately) 20 profiles that were available online were from Yale University.

However, a deeper look at the Upstart Website reveals a problematic clause that is attached with the funding of the student start-up ideas. This is not a traditional crowdfunding model where investors receive equity in the start-up in exchange for their investment dollars. Instead, the investors get a percentage of the student’s income for a 10-year period in exchange for the investment. This way, in the likely event that the student’s start-up idea does not work out, the investor is able to receive a nice return on investment by taking a cut from the student’s hard earned salary when the student enters the workforce.

This does not seem right to me. On one side you have students who have an inaccurate view of the probability of success of their start-up, and on the other side you have investors who are looking to profit off of the boundless optimism and dreams of these students. These students, with no experience in the real world, no understanding of what skills are necessary to build a company, and a perception of entrepreneurship built more from Hollywood than the cold realities of business, are more than happy to mortgage a percentage of 10 years of future earnings for a chance to receive some short-term money about which they have no idea how to properly evaluate the costs vs. benefits.

In the traditional model, where the investor receives equity in exchange for the investment, at least the investor is in the same boat as the student --- their interests are aligned and focused on making the start-up a success. With the Upstart model, you have almost the exact opposite. Since the salary of a founder is typically below-market in exchange for the equity the founder receives, the expected rate of return for the investor is actually higher if the student were to give up on the start-up and get a normal job. This is especially true when the investment rate of return for the investor is capped (as it is in Upstart), so that even if the start-up were to take off and the student were to become very wealthy from it, the return to the investor is not markedly different from what it would have been if the company had failed and the student later received a salary at market value. To exacerbate the situation, the investor-investee relationship in Upstart is supposed to be somewhat also a mentor-mentee relationship, which is particularly dangerous when interests are misaligned.

I think Upstart should be commended for trying to get more funding to college students with ideas for starting companies. And although I don’t know many people involved, the people I do know are good people and I highly doubt they are trying to do anything evil. (Jonathan Eng was a TA for my Introduction to Programming class for me 4 years ago, and he was a good and honest TA). However, I do not believe the people involved in Upstart realize how hard it is for students to accurately evaluate the costs and benefits of receiving funding in this way. Therefore I am highly concerned about this model as a way forward for student entrepreneurship.

IEEE Computer issue on the CAP Theorem

2012-10-29T11:44:00.000-07:00

Due to Hurricane Sandy, Yale gave me a day off from teaching today and I have finally been able to get to a few things on my "to-do" list. One of them is to write a blog post about the IEEE Computer CAP Retrospective edition and make my paper that appeared inside of it publicly available.

Earlier this year, the IEEE Computer magazine came out with an issue largely devoted to a 12-year retrospective of the CAP theorem and contains several articles from distributed systems researchers that contribute various opinions and thoughts about CAP. The first article is from Eric Brewer, who coined the CAP theorem 12 years ago (though he points out in his article that it was actually 14 years ago). A PDF of Brewer’s article is available for free from: http://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed. The second article is from Seth Gilbert and Nancy Lynch (the same Gilbert and Lynch that proved the CAP theorem 10 years ago).

The third article is from me, and contains my criticisms of CAP that long-time readers of my blog will be familiar with. In particular, I point out that many people assume that modern NoSQL systems relax consistency guarantees in order to gain availability due to the constraints of the CAP theorem, when the reality is that these systems give up on consistency even in the absence of network partitions, which is not required according to the CAP theorem. The reason why they give up on consistency is because of a desire to improve system latency, an increasingly important requirement in the modern impatient world. I then describe the latency-consistency tradeoff in more detail, and end the article with the PACELC reformulation of CAP that debuted on my blog over two years ago. With the permission of the IEEE, I am making a free version of this article available today. This article is the first time that the PACELC formulation and my thoughts on CAP appear in a scholarly article, which gives people a venue to refer to (bibtex code available here) when citing this work (you can stop citing a blog post!)

The fourth article is from Raghu Ramakrishnan, entitled “CAP and Cloud Data Management” and describes the PNUTS system that I have mentioned in the past as a good example of a system for which the consistency-latency tradeoff has had a more direct impact on the system design than the consistency-availability tradeoff of CAP. The fifth article is from Ken Birman, Daniel Freedman, Qi Huang, and Patrick Dowell of Cornell University on overcoming CAP with soft-state replication. Unfortunately, I cannot find a free link to Raghu’s article, but if you have an IEEE account, you can access it at at: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6122007&tag=1. The Birman et. al. article can be found for free at: http://www.cs.cornell.edu/Projects/mrc/CAP.pdf.

If you have enjoyed my thoughts on CAP on this blog, I highly recommend you read each of these five articles. The Brewer article in particular acknowledges my past criticism of CAP not actually being about picking two of three out of C (consistency), A (availability), and P (partition tolerance) due to the fact that it does not make sense to reason about a system that is ‘CA’. (If there is no partition, any system can be both consistent and available --- the only question is what happens when there is a partition --- does consistency or availability get sacrificed?) Brewer uses this observation to lead into a nice generalization of consistency-availability tradeoff. In particular, when a partition occurs, the system does three things: (1) detect that the partition occurred, (2) enter a partition mode that may or may not limit some operations, and (3) initiate some sort of reconciliation algorithm when the partition is fixed. Depending on how these three things are implemented, it is possible to obtain much of the spectrum between CP systems and AP systems. The article also contains a nice reference to the CRDT work by Shapiro et. al. at INRIA. Overall, I strongly support Brewer’s approach to navigating this tradeoff. It also fits nicely with Mehul Shah’s talk at HPTS in the way that the spectrum between consistency and availability is explicitly considered at system design time, rather than trying to bolt consistency on top of an AP (eventually consistent) system after the fact (a wildly suboptimal endeavor).

While most of Brewer’s article focused on the consistency-availability tradeoff, Brewer also briefly acknowledges that “in its classic interpretation, the CAP theorem ignores latency”, and that some systems reduce consistency for latency (he even refers to the PNUTS example I used in my original blog post). I remain convinced that PACELC is the best way to reason about both of these tradeoffs in a single formulation: if there is a partition (P) how does the system tradeoff between availability and consistency (A and C); else (E) when the system is running as normal in the absence of partitions, how does the system tradeoff between latency (L) and consistency (C)?

Defending Matt Welsh’s 'Startup University' Post

2012-06-26T06:01:00.000-07:00

A week ago, Matt Welsh released a blog post on attaching a startup incubator to a university in order to create a funding model for some of the research that is performed at the university. Unfortunately, the beginning part of the blog post talked about the “inefficiency” of universities in terms of “producing real products” and the (perhaps overly dramatic) assertion that “nothing of practical value came out of [Matt’s] entire research career”. Although Matt has clarified that it was not his intention to indicate that the goal of academic research was to “produce real, shipping products that people could use”, many people interpreted the opening part of Matt’s post in that way, and reacted negatively (including, notably, Michael Mitzenmacher who responded in a comment and Joe Hellerstein who responded in his own blog post).

If we ignore the problems with the first part of Matt’s post, the rest of the post raises some important points and interesting ideas. As an academic who has spent large chunks of time spinning off a research project into a startup (HadoopDB was commercialized by Hadapt, which by most available metrics has been an example of a research lab-to-startup success story), many parts of Matt’s article rung true:

Matt’s statement: “Most universities make starting a company painfully difficult when it comes to questions of IP ownership [and] licensing” was certainly true for Hadapt. It took way too long, and way too much effort to get an agreement in place. Part of the problem was discussed in the comment thread of Matt’s post --- licensing patents are much better aligned with the core mission of a university than accepting equity in start-ups.
Matt’s statement: “Most universities also make starting a company painfully difficult when it comes to […] forcing the academic's research to be dissociated with their commercial activities.” This was also true for me. I do not mean to criticize the university --- I absolutely understand the need for the conflict of interest safeguards because of the way that universities (and the assumptions of incoming students) are structured today. However, restructuring some of these assumptions in the way that Matt talks about may reduce the legal liabilities, and allow for fewer safeguards to have to be put in place. I also think that the students are hurt more than helped by some of these safeguards. For example, one of the PhD students involved in HadoopDB wanted to work part time for Hadapt while finishing his PhD. However, due to the COI legal complexities, he was forbidden from doing this and was forced to choose between Hadapt and the PhD program (he, of course, chose to take a leave of absence and join Hadapt).
Matt’s statement that academics starting companies “involves a high degree of risk (potentially career-ending for pre-tenure faculty)” obviously resonates with me. Whether or not Hadapt is successful, it has certainly taken my time away from publishing papers (though obviously, I'm still trying to publish as much as I can --- see, for example, my last post on the Calvin project). Since publication quantity and quality remain key statistics for academic success, any conscious reduction of them comes with a clear risk.

The bottom line is that I absolutely agree with Matt’s assertion that there are a lot of extremely intelligent faculty in academic institutions across the world that have made the mental calculation and decided that the benefits do not outweigh the risks in spinning off a startup from an ongoing research project. Whether or not this is a bad thing is up for debate --- it is certainly not the core mission of a university to spin off companies or produce real-world products. However most universities do have some number of applied fields, and measuring impact in applied fields is often initiated by looking at real-world deployments of the research ideas. Starting companies is clearly the most direct mechanism for translating research ideas to real-world impact. Hence, it’s probably not a controversial statement to assert that reducing some of the barriers to starting companies would allow faculty in applied fields to increase their impact, the primary goal of research.

Therefore, allowing for explicit relationships between research groups and university-sponsored start-up incubators, where the university invests in a start-up, with proceeds from such investments being used to sponsor additional research in the department, is an idea worth considering. I would, however, change a few things about Matt’s proposal:

I would not simply replace venture capital money with university money. Although it is easy to get into the trap of assuming that the venture capitalist simply trades investment dollars for equity in the company, it turns out that venture capitalists provide a lot of value in addition to their money. Seeing firsthand the difference at Hadapt before and after we got big-name venture capitalists behind us really drove this point home for me. Therefore, I would recommend that the university partner with venture capitalists, or otherwise hire successful venture capitalists to work in-house (and continue to compensate them using the standard venture capital compensation schemes). Although the Kauffman report has recently shed some light into how poorly venture capital has performed over the last decade, the top venture capitalists have still done very well, and it is important to remember that the goal for the university is not to turn a profit on the investment, but rather to increase the number of startups coming out of the university, in order to increase the research impact. Break-even performance or even small amounts of losses are totally acceptable.
The model will not work for any university. The location of the university is critical. Trying to get an incubator going for universities located in the middle of nowhere is a recipe for disaster. Technologists like to think that the technology that the company is commercializing is the most important factor in the company’s success. In fact, it falls way behind ‘market’ and ‘people’ as a determining factor. The company needs competent and experienced people throughout the organization --- the engineering team, marketing, sales, support, etc. Recruiting a competent team in a location where there have been small numbers of comparable companies is likely to be futile. Students from the university can only get you so far --- you need a mix of experienced people as well.
There needs to be explicit mechanisms in place to reduce the risk for the faculty member. This means that the faculty member should get credit for certain company metrics at promotion or yearly evaluation time in addition to standard paper citation metrics. Company financial data is probably not a great metric, but customer counts of people actually using the technology, or even customer counts at “me-too” competitors could be used. Three years after publishing the original HadoopDB paper, there are real people using this technology to solve real problems. It’s pretty rare to see such an immediate impact, and it ought to count for something.

Obviously my own experiences have made me predisposed to liking Matt’s ideas, but I do encourage people to read the second half of his post independently of the first half.

If all these new DBMS technologies are so scalable, why are Oracle and DB2 still on top of TPC-C? A roadmap to end their dominance.

2012-05-16T06:56:00.000-07:00

(This post is coauthored by Alexander Thomson and Daniel Abadi)
In the last decade, database technology has arguably progressed furthest along the scalability dimension. There have been hundreds of research papers, dozens of open-source projects, and numerous startups attempting to improve the scalability of database technology. Many of these new technologies have been extremely influential---some papers have earned thousands of citations, and some new systems have been deployed by thousands of enterprises.

So let’s ask a simple question: If all these new technologies are so scalable, why on earth are Oracle and DB2 still on top of the TPC-C standings? Go to the TPC-C Website with the top 10 results in raw transactions per second. As of today (May 16th, 2012), Oracle 11g is used for 3 of the results (including the top result), 10g is used for 2 of the results, and the rest of the top 10 is filled with various versions of DB2. How is technology designed decades ago still dominating TPC-C? What happened to all these new technologies with all these scalability claims?

The surprising truth is that these new DBMS technologies are not listed in the TPC-C top ten results not because that they do not care enough to enter, but rather because they would not win if they did.

To understand why this is the case, one must understand that scalability does not come for free. Something must be sacrificed to achieve high scalability. Today, there are three major categories of tradeoff that can be exploited to make a system scale. The new technologies basically fall into two of these categories; Oracle and DB2 fall into a third. And the later parts of this blog post describes research from our group at Yale that introduces a fourth category of tradeoff that provides a roadmap to end the dominance of Oracle and DB2.

These categories are:

(1) Sacrifice ACID for scalability. Our previous post on this topic discussed this in detail. Basically we argue that a major class of new scalable technologies fall under the category of “NoSQL” which achieves scalability by dropping ACID guarantees, thereby allowing them to eschew two phase locking, two phase commit, and other impediments to concurrency and processor independence that hurt scalability. All of these systems that relax ACID are immediately ineligible to enter the TPC-C competition since ACID guarantees are one of TPC-C’s requirements. That’s why you don’t see NoSQL databases in the TPC-C top 10---they are immediately disqualified.

(2) Reduce transaction flexibility for scalability. There are many so-called “NewSQL” databases that claim to be both ACID-compliant and scalable. And these claims are true---to a degree. However, the fine print is that they are only linearly scalable when transactions can be completely isolated to a single “partition” or “shard” of data. While these NewSQL databases often hide the complexity of sharding from the application developer, they still rely on the shards to be fairly independent. As soon as a transaction needs to span multiple shards (e.g., update two different user records on two different shards in the same atomic transaction), then these NewSQL systems all run into problems. Some simply reject such transactions. Others allow them, but need to perform two phase commit or other agreement protocols in order to ensure ACID compliance (since each shard may fail independently). Unfortunately, agreement protocols such as two phase commit come at a great scalability cost (see our 2010 paper that explains why). Therefore, NewSQL databases only scale well if multi-shard transactions (also called “distributed transactions” or “multi-partition transactions”) are very rare. Unfortunately for these databases, TPC-C models a fairly reasonable retail application where customers buy products and the inventory needs to be updated in the same atomic transaction. 10% of TPC-C New Order transactions involve customers buying products from a “remote” warehouse, which is generally stored in a separate shard. Therefore, even for basic applications like TPC-C, NewSQL databases lose their scalability advantages. That’s why the NewSQL databases do not enter TPC-C results --- even just 10% of multi-shard transactions causes their performance to degrade rapidly.

(3) Trade cost for scalability. If you use high end hardware, it is possible to get stunningly high transactional throughput using old database technologies that don’t have shared-nothing horizontally scalability. Oracle tops TPC-C with an incredibly high throughput of 500,000 transactions per second. There exists no application in the modern world that produces more than 500,000 transactions per second (as long as humans are initiating the transactions---machine-generated transactions are a different story). Therefore, Oracle basically has all the scalability that is needed for human scale applications. The only downside is cost---the Oracle system that is able to achieve 500,000 transactions per second costs a prohibitive $30,000,000!

Since the first two types of tradeoffs are immediate disqualifiers for TPC-C, the only remaining thing to give up is cost-for-scale, and that’s why the old database technologies are still dominating TPC-C. None of these new technologies can handle both ACID and 10% remote transactions.

A fourth approach...

TPC-C is a very reasonable application. New technologies should be able to handle it. Therefore, at Yale we set out to find a new dimension in this tradeoff space that could allow a system to handle TPC-C at scale without costing $30,000,000. Indeed, we are presenting a paper next week at SIGMOD (see the full paper) that describes a system that can achieve 500,000 ACID-compliant TPC-C New Order transactions per second using commodity hardware in the cloud. The cost to us to run these experiments was less than $300 (of course, this is renting hardware rather than buying, so it’s hard to compare prices --- but still --- a factor of 100,000 less than $30,000,000 is quite large).

Calvin, our prototype system designed and built by a large team of researchers at Yale that include Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao, Anton Petrov, Michael Giuffrida, and Aaron Segal (in addition to the authors of this blog post), explores a tradeoff very different from the three described above. Calvin requires all transactions to be executed fully server-side and sacrifices the freedom to non-deterministically abort or reorder transactions on-the-fly during execution. In return, Calvin gets scalability, ACID-compliance, and extremely low-overhead multi-shard transactions over a shared-nothing architecture. In other words, Calvin is designed to handle high-volume OLTP throughput on sharded databases on cheap, commodity hardware stored locally or in the cloud. Calvin significantly improves the scalability over our previous approach to achieving determinism in database systems.

Scaling ACID

The key to Calvin’s strong performance is that it reorganizes the transaction execution pipeline normally used in DBMSs according to the principle: do all the "hard" work before acquiring locks and beginning execution. In particular, Calvin moves the following stages to the front of the pipeline:

Replication. In traditional systems, replicas agree on each modification to database state only after some transaction has made the change at some "master" replica. In Calvin, all replicas agree in advance on the sequence of transactions that they will (deterministically) attempt to execute.
Agreement between participants in distributed transactions. Database systems traditionally use two-phase commit (2PC) to handle distributed transactions. In Calvin, every node sees the same global sequence of transaction requests, and is able to use this already-agreed-upon information in place of a commit protocol.
Disk accesses. In our VLDB 2010 paper, we observed that deterministic systems performed terribly in disk-based environments due to holding locks for the 10ms+ duration of reading the needed data from disk, since they cannot reorder conflicting transactions on the fly. Calvin gets around this setback by prefetching into memory all records that a transaction will need during the replication phase---before locks are even acquired.

As a result, each transaction’s user-specified logic can be executed at each shard with an absolute minimum of runtime synchronization between shards or replicas to slow it down, even if the transaction’s logic requires it to access records at multiple shards. By minimizing the time that locks are held, concurrency can be greatly increased, thereby leading to near-linear scalability on a commodity cluster of machines.

Strongly consistent global replication

Calvin’s deterministic execution semantics provide an additional benefit: replicating transactional input is sufficient to achieve strongly consistent replication. Since replicating batches of transaction requests is extremely inexpensive and happens before the transactions acquire locks and begin executing, Calvin’s transactional throughput capacity does not depend at all on its replication configuration.

In other words, not only can Calvin can run 500,000 transactions per second on 100 EC2 instances in Amazon’s US East (Virginia) data center, it can maintain strongly-consistent, up-to-date 100-node replicas in Amazon’s Europe (Ireland) and US West (California) data centers---at no cost to throughput.

Calvin accomplishes this by having replicas perform the actual processing of transactions completely independently of one another, maintaining strong consistency without having to constantly synchronize transaction results between replicas. (Calvin’s end-to-end transaction latency does depend on message delays between replicas, of course---there is no getting around the speed of light.)

Flexible data model

So where does Calvin fall in the OldSQL/NewSQL/NoSQL trichotomy?

Actually, nowhere. Calvin is not a database system itself, but rather a transaction scheduling and replication coordination service. We designed the system to integrate with any data storage layer, relational or otherwise. Calvin allows user transaction code to access the data layer freely, using any data access language or interface supported by the underlying storage engine (so long as Calvin can observe which records user transactions access). The experiments presented in the paper use a custom key-value store. More recently, we’ve hooked Calvin up to Google’s LevelDB and added support for SQL-based data access within transactions, building relational tables on top of LevelDB’s efficient sorted-string storage.

From an application developer’s point of view, Calvin’s primary limitation compared to other systems is that transactions must be executed entirely server-side. Calvin has to know in advance what code will be executed for a given transaction. Users may pre-define transactions directly in C++, or submit arbitrary Python code snippets on-the-fly to be parsed and executed as transactions.

For some applications, this requirement of completely server-side transactions might be a difficult limitation. However, many applications prefer to execute transaction code on the database server anyway (in the form of stored procedures), in order to avoid multiple round trip messages between the database server and application server in the middle of a transaction.

If this limitation is acceptable, Calvin presents a nice alternative in the tradeoff space to achieving high scalability without sacrificing ACID or multi-shard transactions. Hence, we believe that our SIGMOD paper may present a roadmap for overcoming the scalability dominance of the decades-old database solutions on traditional OLTP workloads. We look forward to debating the merits of this approach in the weeks ahead (and Alex will be presenting the paper at SIGMOD next week).

Replication and the latency-consistency tradeoff

2011-12-07T20:38:00.000-08:00

As 24/7 availability becomes increasingly important for modern applications, database systems are frequently replicated in order to stay up and running in the face of database server failure. It is no longer acceptable for an application to wait for a database to recover from a log on disk --- most mission-critical applications need immediate failover to a replica.

There are several important tradeoffs to consider when it comes to system design for replicated database systems. The most famous one is CAP --- you have to trade off consistency vs. availability in the event of a network partition. In this post, I will go into detail about a lesser-known but equally important tradeoff --- between latency and consistency. Unlike CAP, where consistency and availability are only traded off in the event of a network partition, the latency vs. consistency tradeoff is present even during normal operations of the system. (Note: the latency-consistency tradeoff discussed in this post is the same as the "ELC" case in my PACELC post).

The intuition behind the tradeoff is the following: there's no way to perform consistent replication across database replicas without some level of synchronous network communication. This communication takes time and introduces latency. For replicas that are physically close to each other (e.g., on the same switch), this latency is not necessarily onerous. But replication over a WAN will introduce significant latency.

The rest of this post adds more meat to the above intuition. I will discuss several general techniques for performing replication, and show how each technique trades off latency or consistency. I will then discuss several modern implementations of distributed database systems and show how they fit into the general replication techniques that are outlined in this post.

There are only three alternatives for implementing replication (each with several variations): (1) data updates are sent to all replicas at the same time, (2) data updates are sent to an agreed upon master node first, or (3) data updates are sent to a single (arbitrary) node first. Each of these three cases can be implemented in various ways; however each implementation comes with a consistency-latency tradeoff. This is described in detail below.

Data updates are sent to all replicas at the same time. If updates are not first passed through a preprocessing layer or some other agreement protocol, replica divergence (a clear lack of consistency) could ensue (assuming there are multiple updates to the system that are submitted concurrently, e.g., from different clients), since each replica might choose a different order with which to apply the updates . On the other hand, if updates are first passed through a preprocessing layer, or all nodes involved in the write use an agreement protocol to decide on the order of operations, then it is possible to ensure that all replicas will agree on the order in which to process the updates, but this leads to several sources of increased latency. For the case of the agreement protocol, the protocol itself is the additional source of latency. For the case of the preprocessor, the additional sources of latency are:
1. Routing updates through an additional system component (the preprocessor) increases latency
2. The preprocessor either consists of multiple machines or a single machine. If it consists of multiple machines, an agreement protocol to decide on operation ordering is still needed across machines. Alternatively, if it runs on a single machine, all updates, no matter where they are initiated (potentially anywhere in the world) are forced to route all the way to the single preprocessor first, even if there is a data replica that is nearer to the update initiation location.

Data updates are sent to an agreed upon location first (this location can be dependent on the actual data being updated) --- we will call this the “master node” for a particular data item. This master node resolves all requests to update the same data item, and the order that it picks to perform these updates will determine the order that all replicas perform the updates. After it resolves updates, it replicates them to all replica locations. There are three options for this replication:
1. The replication is done synchronously, meaning that the master node waits until all updates have made it to the replica(s) before "committing" the update. This ensures that the replicas remain consistent, but synchronous actions across independent entities (especially if this occurs over a WAN) increases latency due to the requirement to pass messages between these entities, and the fact that latency is limited by the speed of the slowest entity.
2. The replication is done asynchronously, meaning that the update is treated as if it were completed before it has been replicated. Typically the update has at least made it to stable storage somewhere before the initiator of the update is told that it has completed (in case the master node fails), but there are no guarantees that the update has been propagated to replicas. The consistency-latency tradeoff in this case is dependent on how reads are dealt with:
  1. If all reads are routed to the master node and served from there, then there is no reduction in consistency. However, there are several latency problems with this approach:
    1. Even if there is a replica close to the initiator of the read request, the request must still be routed to the master node which could potentially be physically much farther away.
    2. If the master node is overloaded with other requests or has failed, there is no option to serve the read from a different node. Rather, the request must wait for the master node to become free or recover. In other words, there is a potential for increased latency due to lack of load balancing options.
  2. If reads can be served from any node, read latency is much better, but this can result in inconsistent reads of the same data item, since different locations have different versions of a data item while its updates are still being propagated, and a read can potentially be sent to any of these locations. Although the level of reduced consistency can be bounded by keeping track of update sequence numbers and using them to implement “sequential/timeline consistency” or “read-your-writes consistency”, these options are nonetheless reduced consistency options. Furthermore, write latency can be high if the master for a write operation is geographically far away from the requester of the write.
3. A combination of (a) and (b) are possible. Updates are sent to some subset of replicas synchronously, and the rest asynchronously. The consistency-latency tradeoff in this case again is determined by how reads are dealt with. If reads are routed to at least one node that had been synchronously updated (e.g. when R + W > N in a quorum protocol, where R is the number of nodes involved in a synchronous read, W is the number of nodes involved in a synchronous write, and N is the number of replicas), then consistency can be preserved, but the latency problems of (a), (b)(i)(1), and (b)(i)(2) are all present (though to somewhat lower degrees, since the number of nodes involved in the synchronization is smaller, and there is potentially more than one node that can serve read requests). If it is possible for reads to be served from nodes that have not been synchronously updated (e.g. when R + W <= N), then inconsistent reads are possible, as in (b)(ii) above .
Data updates are sent to an arbitrary location first, the updates are performed there, and are then propagated to the other replicas. The difference between this case and case (2) above is that the location that updates are sent to for a particular data item is not always the same. For example, two different updates for a particular data item can be initiated at two different locations simultaneously. The consistency-latency tradeoff again depends on two options:
1. If replication is done synchronously, then the latency problems of case (2)(a) above are present. Additionally, extra latency can be incurred in order to detect and resolve cases of simultaneous updates to the same data item initiated at two different locations.
2. If replication is done asynchronously, then similar consistency problems as described in case (1) and (2b) above present themselves.

Therefore, no matter how the replication is performed, there is a tradeoff between consistency and latency. For carefully controlled replication across short distances, there exists reasonable options (e.g. choice 2(a) above, since network communication latency is small in local data centers); however, for replication over a WAN, there exists no way around the significant consistency-latency tradeoff.

To more fully understand the tradeoff, it is helpful to consider how several well-known distributed systems are placed into the categories outlined above. Dynamo, Riak, and Cassandra choose a combination of (2)(c) and (3) from the replication alternatives described above. In particular, updates generally go to the same node, and are then propagated synchronously to W other nodes (case (2)(c)). Reads are synchronously sent to R nodes with R + W typically being set to a number less than or equal to N, leading to a possibility of inconsistent reads. However, the system does not always send updates to the same node for a particular data item (e.g., this can happen in various failure cases, or due to rerouting by a load balancer), which leads to the situation described in alternative (3) above, and the potentially more substantial types of consistency shortfalls. PNUTS chooses option (2)(b)(ii) above, for excellent latency at reduced consistency. HBase chooses (2) (a) within a cluster, but gives up consistency for lower latency for replication across different clusters (using option (2)(b)).

In conclusion, there are two major reasons to reduce consistency in modern distributed database systems, and only one of them is CAP. Ignoring the consistency-latency tradeoff of replicated systems is a great oversight, since it is present at all times during system operation, whereas CAP is only relevant in the (arguably) rare case of a network partition. In fact, the consistency-latency tradeoff is potentially more significant than CAP, since it has a more direct effect of the baseline operations of modern distributed database systems.

Overview of the Oracle NoSQL Database

2011-10-04T21:57:00.000-07:00

Oracle is the clear market leader in the commercial database community, and therefore it is critical for any member of the database community to pay close attention to the new product announcements coming out of Oracle’s annual Open World conference. The sheer size of Oracle’s sales force, entrenched customer base, and third-party ecosystem instantly gives any new Oracle product the potential for very high impact. Oracle’s new products require significant attention simply because they’re made by Oracle.

I was particularly eager for this year’s Oracle Open World conference, because there were rumors of two separate new Oracle products involving Hadoop and NoSQL --- two of the central research focuses of my database group at Yale --- one of them (Hadoop) also being the focus of my recent startup (Hadapt). Oracle’s Hadoop announcements, while very interesting from a business perspective (everyone is talking about how this “validates” Hadoop), are not so interesting from a technical perspective (the announcements seem to revolve around (1) creating a “connector” between Hadoop and Oracle, where Hadoop is used for ETL tasks, and the output of these tasks are then loaded over this connector to the Oracle DBMS and (2) packaging the whole thing into an appliance, which again is very important from a business perspective since there is certainly a market for anything that makes Hadoop easier to use, but does not seem to be introducing any technically interesting new contributions).

In contrast, the Oracle NoSQL database is actually a brand new system built by the Oracle BerkeleyDB team, and is therefore very interesting from a technical perspective. I therefore spent way too much time trying to find out as much as I could about this new system from a variety of sources. There is not yet a lot of publicly available information about the system; however there is a useful whitepaper written by the illustrious Harvard professor Margo Seltzer, who has been working with Oracle since they acquired her start-up in 2006 (the aforementioned BerkeleyDB).

Due to the dearth of available information on the system, I thought that it would be helpful to the readers of my blog if I provided an overview of what I’ve learned about it so far. Some of the facts I state below have been directly made by Oracle; other facts are inferences that I’ve made, based on my understanding of the system architecture and implementation. As always, if I have made any mistakes in my inferences, please let me know, and I will fix them as soon as possible.

The coolest thing about the Oracle NoSQL database is that it is not a simple copy of a currently existing NoSQL system. It is not Dynamo or SimpleDB. It is not Bigtable or HBase. It is not Cassandra or Riak. It is not MongoDB or CouchDB. It is a new system that has a chosen a different point (actually --- several different points) in the system-design tradeoff space than any of the above mentioned systems. Since it makes a different set of tradeoffs, it is entirely inappropriate to call it “better” or “worse” than any of these systems. There will be situations where the Oracle solution will be more appropriate, and there will be situations where other systems will be more appropriate.

Overview of the system:
Oracle NoSQL database is a distributed, replicated key-value store. Given a cluster of machines (in a shared-nothing architecture, with each machine having its own storage, CPU, and memory), each key-value pair is placed on several of these machines depending on the result of a hash function on the key. In particular, the key-value pair will be placed on a single master node, and a configurable number of replica nodes. All write and update operations for a key-value pair go to the master node for that pair first, and then all replica nodes afterwards. This replication is typically done asynchronously, but it is possible to request that it be done synchronously if one is willing to tolerate the higher latency costs. Read operations can go to any node if the user doesn’t mind incomplete consistency guarantees (i.e. reads might not see the most recent data), but they must be served from the master node if the user requires the most recent value for a data item (unless replication is done synchronously). There is no SQL interface (it is a NoSQL system after all!) --- rather it supports simple insert, update, and delete operations of key-value pairs.

The following is where the Oracle NoSQL Database falls in various key dimensions:

CAP
Like many NoSQL databases, the Oracle NoSQL Database is configurable to be either C/P or A/P in CAP. In particular, if writes are configured to be performed synchronously to all replicas, it is C/P in CAP --- a partition or node failure causes the system to be unavailable for writes. If replication is performed asynchronously, and reads are configured to be served from any replica, it is A/P in CAP --- the system is always available, but there is no guarantee of consistency. [Edit: Actually this configuration is really just P of CAP --- minority partitions become unavailable for writes (see comments about eventual consistency below). This violates the technical definition of "availability" in CAP. However, it is obviously the case that the system still has more availability in this case than the synchronous write configuration.]

Eventual consistency
Unlike Dynamo, SimpleDB, Cassandra, or Riak, the Oracle NoSQL Database does not support eventual consistency. I found this to be extremely amusing, since Oracle’s marketing material associates NoSQL with the BASE acronym. But the E in BASE stands for eventual consistency! So by Oracle’s own definition, their lack of support of eventual consistency means that their NoSQL Database is not actually a NoSQL Database! (In my opinion, their database is really NoSQL --- they just need to fix their marketing literature that associates NoSQL with BASE). My proof for why the Oracle NoSQL Database does not support eventual consistency is the following: Let’s say the master node for a particular key-value pair fails, or a network partition separates the master node from its replica nodes. The key-value pair becomes unavailable for writes for a short time until the system elects a new master node from the replicas. Writes can then continue at the new master node. However, any writes that had been submitted to the old master node, but had not yet been sent to the replicas before the master node failure (or partition) are lost. In an eventually consistent system, these old writes can be reconciled with the current state of the key-value pair after the failed node recovers its log from stable storage, or when the network partition is repaired. Of course, if replication had been configured to be done synchronously (at a cost of latency), there will not be data loss during network partitions or node failures. Therefore, there is a fundamental difference between the Oracle NoSQL database system and eventually consistent NoSQL systems: while eventually consistent NoSQL systems choose to tradeoff consistency for latency and availability during failure and network partition events, the Oracle NoSQL system instead trades of durability for latency and availability. To be clear, this difference is only for inserts and updates --- the Oracle NoSQL database is able to trade-off consistency for latency on read requests --- it supports similar types of timeline consistency tradeoffs as the Yahoo PNUTs/Sherpa system.

[Two of the members of the Oracle NoSQL Database team have commented below. There is a little bit of a debate about my statement that the Oracle NoSQL Database lacks eventual consistency, but I stand by the text I wrote above. For more, see the comments.]

Joins
Like most NoSQL systems, the Oracle NoSQL database does not support joins. It only supports simple read, write, update, and delete operations on key-value pairs.

Data Model
The Oracle NoSQL database actually has a more subtle data model than simple key-value pairs. In particular, the key is broken down into a “major key path” and “minor key path” where all keys with the same “major key path” are guaranteed to be stored on the same physical node. I expect that the way minor keys will be used in the Oracle NoSQL database will map directly to the way column families are used in Bigtable, HBase and Cassandra. Rather than trying to gather together every possible attribute about a key in a giant “value” for the single key-value pair, you can separate them into separate key-value pairs where the “major key path” is the same for all the keys in the set of key-value pairs, but the “minor key path” will be different. This is similar to how column families for the same key in Bigtable, HBase, and Cassandra can also be stored separately. Personally, I find the major and minor key path model to be more elegant than the column family model (I have ranted against column-families in the past).

ACID compliance
Like most NoSQL systems, the Oracle NoSQL database is not ACID compliant. Besides the durability and consistency tradeoffs mentioned above, the Oracle NoSQL database also does not support arbitrary atomic transactions (the A in ACID). However, it does support atomic operations on the same key, and even allows atomic transactions on sets of keys that share the same major key path (since keys that share the same major key path are guaranteed to be stored on the same node, atomic operations can be performed without having to worry about distributed commit protocols across multiple machines).

Summary
The sweet spot for the Oracle NoSQL database seems to be in single-rack deployments (e.g. the Oracle Big Data appliance) with a low-latency network, so that the system can be set up to use synchronous replication while keeping latency costs of this type of replication small (and the probability of network partitions are small). Another sweet spot is for wider area deployments, but the application is able to work around reduced durability guarantees. It therefore seems to present the largest amount of competition for NoSQL databases like MongoDB which have similar sweet spots. However, the Oracle NoSQL database will need to add additional “developer-friendly” features if it wants to compete head-to-head with MongoDB. Either way, there are clearly situations where the Oracle NoSQL database will be a great fit, and I love that Oracle (in particular, the Oracle BerkeleyDB team) built this system from scratch as an interesting and technically distinct alternative to currently available NoSQL systems. I hope Oracle continues to invest in the system and the team behind it.

Hadoop's tremendous inefficiency on graph data management (and how to avoid it)

2011-07-19T06:31:00.001-07:00

Hadoop is great. It seems clear that it will serve as the basis of the vast majority of analytical data management within five years. Already today it is extremely popular for unstructured and polystructured data analysis and processing, since it is hard to find other options that are superior from a price/performance perspective. The reader should not take the following as me blasting Hadoop. I believe that Hadoop (with its ecosystem) is going to take over the world.

The problem with Hadoop is that its strength is also its weakness. Hadoop gives the user tremendous flexibility and power to scale all kinds of different data management problems. This is obviously great. But it is this same flexibility that allows the user to perform incredibly inefficient things and not care because (a) they can simply add more machines and use Hadoop's scalability to hide inefficiency in user code (b) they can convince themselves that since everyone talks about Hadoop as being designed for "batch data processing" anyways, they can let their process run in the background and not care about how long it will take for it to return.

Although not the subject of this post, an example of this inefficiency can be found in a SIGMOD paper that a bunch of us from Yale and the University of Wisconsin published 5 weeks ago. The paper shows that using Hadoop on structured (relational) data is at least a factor of 50 less efficient than it needs to be (an incredibly large number given how hard data center administrators work to yield less than a factor of two improvement in efficiency). As many readers of this blog already know, this factor of 50 improvement is the reason why Hadapt was founded. But this post is not about Hadapt or relational data. In this post, the focus is on graph data, and how if one is not careful, using Hadoop can be well over a factor of 1000 less efficient than it needs to be.

Before we get into how to improve Hadoop's efficiency on graph data by a factor of 1000, let's pause for a second to comprehend how dangerous it is to let inefficiencies in Hadoop become widespread. Imagine a world where the vast majority of data processing runs on Hadoop (a not entirely implausible scenario). If people allow these factors of 50 or 1000 to exist in their Hadoop utilization, these inefficiency factors translate directly to factors of 50 or 1000 more power utilization, more carbon emissions, more data center space, and more silicon waste. The disastrous environmental consequences in a world where everyone standardizes on incredibly inefficient technology is downright terrifying. And this is ignoring the impact on businesses in terms of server and energy costs, and lower performance. It seems clear that developing a series of "best practices" around using Hadoop efficiently is going to be extremely important moving forward.

Let's delve into the subject of graph data in more detail. Recently there was a paper by Rohloff et. al. that showed how to store graph data (represented in vertex-edge-vertex "triple" format) in Hadoop, and perform sub-graph pattern matching in a scalable fashion over this graph of data. The particular focus of the paper is on Semantic Web graphs (where the data is stored in RDF and the queries are performed in SPARQL), but the techniques presented in the paper are generalizable to other types of graphs. This paper and resulting system (called SHARD) has received significant publicity, including a presentation at HadoopWorld 2010, a presentation at DIDC 2011, and a feature on Cloudera's Website. In fact, it is a very nice technique. It leverages Hadoop to scale sub-graph pattern matching (something that has historically be difficult to do); and by aggregating all outgoing edges for a given vertex into the same key-value pair in Hadoop, it even scales queries in a way that is 2-3 times more efficient than the naive way to use Hadoop for the same task.

The only problem is that, as shown by an upcoming VLDB paper that we're releasing today, this technique is an astonishing factor of 1340 times less efficient than an alternative technique for processing sub-graph pattern matching queries within a Hadoop-based system that we introduce in our paper. Our paper, led by my student, Jiewen Huang, achieves these enormous speedups in the following ways:

Hadoop, by default, hash partitions data across nodes. In practice (e.g., in the SHARD paper) this results in data for each vertex in the graph being randomly distributed across the cluster (dependent on the result of a hash function applied to the vertex identifier). Therefore, data that is close to each other in the graph can end up very far away from each other in the cluster, spread out across many different physical machines. For graph operations such as sub-graph pattern matching, this is wildly suboptimal. For these types of operations, the graph is traversed by passing through neighbors of vertexes; it is hugely beneficial if these neighbors are stored physically near each other (ideally on the same physical machine). When using hash partitioning, since there is no connection between graph locality and physical locality, a large amount of network traffic is required for each hop in the query pattern being matched (on the order of one MapReduce job per graph hop), which results in severe inefficiency. Using a clustering algorithm to graph partition data across nodes in the Hadoop cluster (instead of using hash partitioning) is a big win.
Hadoop, by default, has a very simple replication algorithm, where all data is generally replicated a fixed number of times (e.g. 3 times) across the cluster. Treating all data equally when it comes to replication is quite inefficient. If data is graph partitioned across a cluster, the data that is on the border of any particular partition is far more important to replicate than the data that is internal to a partition and already has all of its neighbors stored locally. This is because vertexes that are on the border of a partition might have several of their neighbors stored on different physical machines. For the same reasons why it is a good idea to graph partition data to keep graph neighbors local, it is a good idea to replicate data on the edges of partitions so that vertexes are stored on the same physical machine as their neighbors. Hence, allowing different data to be replicated at different factors can further improve system efficiency.
Hadoop, by default, stores data on a distributed file system (HDFS) or a sparse NoSQL store (HBase). Neither of these data stores are optimized for graph data. HDFS is optimized for unstructured data, and HBase for semi-structured data. But there has been significant research in the database community on creating optimized data stores for graph-structured data. Using a suboptimal store for the graph data is another source of tremendous inefficiency. By replacing the physical storage system with graph-optimized storage, but keeping the rest of the system intact (similar to the theme of the HadoopDB project), it is possible to greatly increase the efficiency of the system.

To a first degree of approximation, each of the above three improvements yield an entire order of magnitude speedup (a factor of 10). By combining them, we therefore saw the factor of 1340 improvement in performance on the identical benchmark that was run in the SHARD paper. (For more details on the system architecture, partitioning and data placement algorithms, query processing, and experimental results please see our paper).

It is important to note that since we wanted to run the same benchmark as the SHARD paper, we used the famous Lehigh University Benchmark (LUBM) for Semantic Web graph data and queries. Semantic Web sub-graph pattern matching queries tend to contain quite a lot of constants (especially on edge labels) relative to other types of graph queries. The next step for this project is to extend and benchmark the system on other graph applications (the types of graphs that people tend to use systems based on Google's Pregel project today).

In conclusion, it is perfectly acceptable to give up a little bit of efficiency for improved scalability when using Hadoop. However, once this decrease in efficiency starts to reach a factor of two, it is likely a good idea to think about what is causing this inefficiency, and attempt to find ways to avoid it (while keeping the same scalability properties). Certainly once the factor extends beyond the factor of two (such as the enormous 1340 factor we discovered in our VLDB paper), the sheer waste in power and hardware cannot be ignored. This does not mean that Hadoop should be thrown away; however it will become necessary to package Hadoop with "best practice" solutions to avoid such unnecessarily high levels of waste.

Why Sam Madden is wrong about peer review

2011-05-19T06:16:00.000-07:00

Yesterday my former PhD advisor, Sam Madden, wrote a blog post consisting of a passionate defense for the status quo in the peer review process (though he does say that the review quality needs to be improved). In an effort to draw attention to his blog (Sam is a super-smart guy, and you will get a lot out of reading his blog) I intend to start a flame war with him in this space.

At issue: The quality of reviews of research paper submissions in the database community is deteriorating rapidly. It is clear that something needs to be fixed. Jeff Naughton offered several suggestions for how to fix the problem in his ICDE keynote. A few days ago, I publicly supported his fifth suggestion (eliminating the review process altogether) on Twitter. Sam argued against this suggestion using five main points. Below I list each of Sam's points, and explain why everything he says is wrong:

Sam's point #1: Most of the submissions aren't very good. The review process does the community a favor in making sure that these bad papers do not get published.

My response: I think only a few papers are truly embarrassing, but who cares? Most of the videos uploaded to YouTube aren't very good. They don't in any way detract from the good videos that are uploaded. The cost of publishing a bad paper is basically zero if everybody knows that all papers will be accepted. The cost of rejecting a good paper, which then gets sent to a non-database venue and receives all kind of publicity there, yields tremendous opportunity cost to the database community. Sam Madden should know this very well since (perhaps) his most famous paper fits in that category. The model of "accept everything and let the good submissions carry you" has always proven to be a better model than "let's have a committee of busy people who basically have zero incentive to do a good job (beyond their own ethical standards) decide what to accept" when the marginal cost of accepting an additional submission is near zero. In the Internet age, the good submissions (even from unknown authors) get their appropriate publicity with surprising speed (see YouTube, Hacker News, Quora, etc.).

Sam's point #2: If every paper is accepted, then how do we decide which papers get the opportunity to be presented at the conference? It seems we need a review committee at least for that.

My response: First of all, there might be fewer submissions under the "accept everything model", since there will not be any resubmissions, and there is incentive for people to make sure that their paper is actually ready for publication before submitting it (because the onus of making sure their paper is not an embarrassment now falls on the authors and not on the PC --- assuming once something is published, you can't take it back). So it might be possible to just let everyone give a talk (if you increase the number of tracks). However, if that is not feasible, there are plenty of other options. For example, all papers are accepted immediately; over the course of one calendar year, it sits out there in the public and can be cited by other papers. The top sixty papers in terms of citations after one year get to present at the conference. This only extends the delay between submission and the actual conference by 4 months --- today there is usually an 8 month delay while papers are being reviewed, and camera-ready papers are being prepared.

Sam's point #3: Eliminating the review system will discourage people from working hard on their papers.

My response: I could not disagree more. Instead of having your paper reviewed by three people in private, every problem, every flaw in logic, every typo is immediately out there in the public for people to look at and comment on. As long as submissions cannot be withdrawn, the fear of long term embarrassment yields enough incentive for the authors to ensure that the paper is in good shape at the time of submission.

Sam's point #4: Having papers in top conferences is an important metric for evaluating researchers.

My Response: This is a horrible, horrible metric, and being able to finally eliminate it might be the best outcome of switching to an "accept everything" model. Everybody knows that it is much easier to get a paper accepted that goes into tremendous depth on an extremely narrow (and ultimately inconsequential) problem than to write a broad paper that solves a higher level (and important) problem, but has less depth. The "paper counting" metric incentivizes people to write inconsequential papers. Good riddance.

Sam's point #5: Having papers accepted provides a form of validation, a way to measure progress and success. There is also some kind of psychological benefit.

My response: People who measure themselves in this way are doomed for failure. If you have a paper accepted that nobody ever reads or cites over the long term, you have made zero impact. Just because you managed to get a paper through three poor reviewers is no cause for celebration. We should be celebrating impact, not publication. Furthermore, I strongly disagree with the psychological benefit argument. Getting a paper rejected does FAR more psychological damage than getting a paper accepted does good.

In conclusion, it's time to eliminate the private peer review process and open it up to the public. All papers should be accepted for publication, and people should be encouraged to review papers in public (on their blogs, on twitter, in class assignments that are published on the Web, etc). Let the free market bring the good papers to the top and let the bad papers languish in obscurity. This is the way the rest of the Internet works. It's time to bring the database community to the Internet age. Imagine how much more research could be done if we didn't have to waste so much time of the top researchers in the world with PC duties, and revising good papers because they were improperly rejected. Imagine how many good researchers we have lost because of the psychological trauma of working really hard on a good paper, only to see it rejected. The current system is antiquated and broken, and the solution is obvious and easy to implement. It's time for a change.

Why I'm doing a start-up pre-tenure

2011-03-28T19:48:00.000-07:00

Thanks to the tireless work of the entire Hadapt team, we had a very successful launch at GigaOM's Structure Big Data conference last week. In coming out of stealth, we told the world what we're doing (in short, we're building the only Big Data analytical platform architected from scratch to be (1) optimized for cloud deployments and (2) closely integrated with Hadoop so you don't need those annoying connectors to non-Hadoop-based data management systems anymore; i.e. we're bringing high performance SQL to Hadoop). Although a lot of people knew I was involved in a start-up, several people were surprised to find out at the launch how centrally involved I am in Hadapt, and I have received a lot of questions along the lines of what Maryland professor Jimmy Lin (@lintool) tweeted last week:

.@daniel_abadi wondering how the tenure track thing fits in with @Hadapt (r u on leave?) - but congrats on coming out of the Ivory tower! :)

Although Jimmy did not question my sanity in his tweet, others have, so I think it is time for me to explain my (hopefully rational) decision-making process that lead me to start a company while still on the tenure-track at Yale.

A few facts to get out the way: although I am currently on teaching leave from Yale, I am not taking a complete leave of absence, which means my tenure clock is still ticking while I'm putting all this effort into Hadapt. The time I'm spending on Hadapt necessarily subtracts from the time I have available to spend on more traditional research activities of junior faculty (publishing papers, serving on program committees and editorial boards of publication venues, and attending conferences), which means that there is a huge risk that when I come up for tenure, if I am evaluated using traditional evaluation metrics, I will not have optimized my performance in these areas, and thereby will reduce the probability of receiving tenure. When I was considering starting Hadapt, I sent e-mails to several senior faculty members in my field and asked them if they could think of an example of a database systems professor doing a start-up while still a junior faculty member, and going on to eventually receive tenure (I desperately wanted a precedent that I could use to justify my decision). Not a single one of the people I e-mailed were able to think of such a case (in fact, one of them called the chair of my department to yell at him for even thinking of letting me start a company while still pre-tenure). Starting Hadapt is a gamble --- there's no doubt about it.

So why am I doing it? I want my research to make impact, which to me means that my research ideas should make it into real systems that are used by real people. Unfortunately for me, the research I enjoy the most is research that envisions complete system designs (rather than research on individual techniques that can be applied directly to today's systems). It's hard enough to publish these system design papers; but it's almost impossible to get other people to actually adopt your design in real-world deployments unless an extensive and complete prototype is available, or your design is already proven in real-world applications. For example, there have been many papers published by academics that fall in the same general space as the Google Bigtable paper. Yet the Bigtable paper has had a tremendous amount of impact, while the other papers languish in obscurity. Why? Because when Powerset and Zvents needed to implement a scalable real-time database, they felt safer using the design suggested in the Google paper (in their respective HBase and Hypertable projects) than the design from some other academic paper that has not been proven in the real world (even if the other design is more elegant and a better fit for the problem at hand).

Therefore, if you want to publish system design papers that make impact on the real world, you seemingly only have three choices:

(1) You can use the resources in your lab to build a complete prototype of your idea. That way, when people are considering using your idea, their risk is significantly reduced by trying out your system on their application without significant upfront development cost. Unfortunately, building a complete prototype is a much harder task than building enough of a prototype to get a paper published. It involves a ton of work to deal with all of the corner cases, and to make it work well out of the box --- this amount of work is far too much for a small handful of students to do (especially if they want to graduate before they retire). Therefore additional engineers must be hired to complete the prototype. In the DARPA glory days, this was possible --- I've heard stories of database projects burning over a million dollars per year to complete the engineering of an academic prototype. Unfortunately, those days are long gone. My attempts to get just one tiny programmer to build out the HadoopDB prototype were rebuffed by the National Science Foundation.

(2) You leave academia and work for Google, Yahoo, Facebook, IBM, etc. Matt Welsh has discussed in significant detail his decision to leave Harvard and do exactly that. This is a great solution in many ways --- it increases the probability of your research making impact by orders of magnitude, and has the added bonus of eliminating a lot of the administrative time sinks inherent in academic jobs. If I didn't love other aspects of being part of an academic community so much, this is certainly what I would do.

(3) You do a start-up. This is basically the same as choice (1), except you raise the money to build out the prototype from angel investors and venture capitalists instead of from the government (which typically funds the academic lab). The main downside is that starting a company is highly non-trivial, and you end up having to spend a lot of time in all kinds of non-technical tasks --- meeting with investors, meeting with potential customers, interviewing potential employees, investing the time to understand the market, coming up with a go-to-market strategy, attending board meetings, dealing with patents, participating in boring trade-shows, etc., etc., etc. It adds up to an extraordinary amount of time. It's also more competitive than academia --- there are far more people who want to see you fail in the start-up world than in academia, and some of these people go to great lengths to increase the probability of your failure. There are all kinds of hurdles that come up, and you need to have a strong will to overcome them. If it wasn't for the most determined person I have ever met, Justin Borgman, the CEO of Hadapt, we would never have made it to where we are today. It's hard to start a company, but in my mind, it was the only viable option if I wanted my three years of research on HadoopDB to make impact (Hadapt is a commercialization of the HadoopDB research project).

If it wasn't for the fact that I spent the majority of the last decade soaking up the wisdom of Mike Stonebraker, I might not have chosen option (3). But I watched as my PhD thesis on C-Store was commercialized by Vertica (which was sold last month to HP), and another one of my research projects (H-Store) was commercialized by VoltDB. Thanks to Stonebraker and the first-class engineers at Vertica, I can claim that my PhD research is in use today by Groupon, Verizon, Twitter, Zynga, and hundreds of other businesses. When I come up for tenure, I want to be able to make similar claims about my research at Yale on HadoopDB. So I'm taking the biggest gamble of my career to see that happen. I just hope that the people writing letters for me at tenure time take my contributions to Hadapt into consideration when they are evaluating the impact I have made on the database systems field. I know that this will require a departure from the traditional way junior faculty are evaluated, but it's time to increase the value we place on building real, usable systems. Otherwise, there'll be no place left in academia for systems researchers.

[Note: Hadapt has successfully raised a round of financing and is hiring. If you have experience building real systems, especially database systems --- or even if you have built reasonably complex academic prototypes --- please send an e-mail to [email protected]. I personally read every e-mail that goes to that address.]