Relational databases have been commercially dominant and the go-to solution for quite some time. But change is brewing. Large Web applications have created requirements poorly matched for relational databases, and leading companies like Google and Amazon have built themselves new database infrastructures, such as BigTable and Dynamo, from the ground up. Understandably, these new systems have attracted a lot of attention.
The term NoSQL was first popularized in 2009 as a label for a broad class of non-relational, distributed databases designed to tackle this new class of problems. The term quickly gained popularity as the number of open-source NoSQL database projects, such as HBase, Cassandra and Voldemort, grew. Though a precise definition of NoSQL remains elusive, the broad goals usually include fault tolerance, horizontal scalability, good price/performance, and a relatively simple data model.
With the explosion of interest, there are now likely more than 200 NoSQL databases available. Though performance and robustness vary dramatically, many provide scalability and fault tolerance by running on a cluster of commodity hardware.
However, data model flexibility tends to be very limited in these first-generation NoSQL systems. Though all NoSQL databases, as a group, implement a large variety of data models, each one individually usually provides just one basic data model (such as graph, document, key-value, etc.). This drives the unfortunate need for developers and ops teams to adopt multiple databases if they want to take advantage of multiple data models.
Enter the CAP Theorem
To understand NoSQL systems and their data-model inflexibility, we need to dig into the CAP Theorem.
First-generation NoSQL databases were designed in the shadow of Eric Brewer’s “CAP Theorem.” It was so named because the popular (though misleading) summary was that developers had to “pick two out of three” of (C)onsistency, (A)vailability and (P)artition tolerance. The tradeoff applies to distributed systems where communication channels can fail, with dramatic consequences.
Seeing system availability as essential, the CAP Theorem was used as justification by first-generation NoSQL systems to abandon consistency in order to maximize availability. Thus, the much weaker model of “eventual consistency” was adopted. Eventual consistency means, simply, that writes to the database “eventually” get done, and are “eventually” seen by other clients.
Eventual consistency represents a dramatic weakening of the guarantees that traditional databases provide, and they place a huge burden on software developers. Designing applications that maintain correct behavior even if the accuracy of the database cannot be relied upon is quite a challenge!
From a recent Google paper:
“We also have a lot of experience with eventual consistency systems at Google. In all such systems, we find developers spend a significant fraction of their time building extremely complex and error-prone mechanisms to cope with eventual consistency and handle data that may be out of date. We think this is an unacceptable burden to place on developers and that consistency problems should be solved at the database level.”
In this same paper, Google detailed a system called F1, which is a scalable and fault-tolerant SQL database, which seems to contradict some people’s previous understanding of the CAP Theorem.
In 2012, Brewer explained that the CAP Theorem had been widely misunderstood. He noted that “the ‘two of three’ formulation was always misleading,” and that “CAP prohibits only a tiny part of the design space: perfect availability and consistency in the presence of partitions, which are rare.” This point is fundamental because the CAP notion of perfect availability (i.e., even disconnected nodes can accept writes) is very different from the availability of the database as a whole to a client. Therefore, a distributed database can be designed to be fault-tolerant and highly available without supporting perfect “availability” in the CAP sense.
Google has perhaps gone the furthest in the reconsideration of CAP with its Spanner database, which is intended to replace BigTable across a wide range of Google applications. Spanner is a globally distributed database providing not just strong data consistency, but also true multi-row ACID transactions like its SQL cousins.
The future of NoSQL
Databases are undergoing a sort of Cambrian explosion, with many new approaches being explored after decades of relative stability. The experimentation with new data models and query languages has been particularly broad and exciting.
Though no standards have emerged, it seems like key-value is emerging as a ubiquitous model, as is the “document” concept of hierarchical data. It’s likely that, as applications and languages evolve, this experimentation will continue and there will be a broad array of data models represented in the next generation of NoSQL databases.
Sacrificing strong consistency (and therefore ACID transactions) seems to have been too hasty. Transactions, which have been part of SQL databases for decades, allow applications to both deal with concurrent client access and build robust abstractions. This is evident in that there are several NoSQL companies starting to head toward adding transactional features to their solutions. For some time, NoSQL vendor MarkLogic has had multi-statement transactions, where they write and lock data with a two-phase commit. And DataStax, the company behind Cassandra, recently added Compare-and-Swap operations (calling them “lightweight transactions”) as part of its 2.0 release.
Google was seen leading the way in NoSQL when it introduced BigTable as an alternative to relational systems that were hard to scale and required exotic and expensive hardware. But as Google observed, building systems without transactional guarantees is difficult, even for the most experienced of engineers.
It seems as though Google is yet again redefining NoSQL with F1, a layer on top of the Spanner NoSQL database that adds a SQL data model and query language. This points to an exciting future for NoSQL databases. Next-gen NoSQL systems will continue to employ shared-nothing, distributed architectures with fault tolerance and scalability, but they may follow in Google’s footsteps by aggressively exploring strong consistency with transactional guarantees.
Dave Rosenthal is cofounder of FoundationDB, a NoSQL database company.