Oracle. SQL Server. DB2. MySQL. PostgreSQL. Despite the fact that these are the five most popular enterprise databases, they’ve become a bit dull over the past three years. Since the NoSQL movement began in 2010, new data stores have offered such a diverse array of use cases, it would seem that almost any traditional database could now be replaced by some specialized data store.
But despite the NoSQL revolution being the cause of this new springtime for databases, not all the green shoots are NoSQL. There are new databases cropping up, or just now maturing, in all manner of technical areas. There are new graph databases, new time-series databases, highly expandable key-value stores, and even new takes on the relational model.
Like most tools in any type of job, using the right database in the right place can make the difference between success and failure. That’s why choosing a database has gone from being one of the easiest decisions your team has to make to one of the hardest.
So, then, we set out on a trip through this verdant and growing meadow of data stores. Which one is right for you? That depends entirely on your use case.
SQL revolution
NoSQL can mean two things: No SQL, or Not Only SQL. It is the latter that many NoSQL companies tout when offering their data stores as a supplement to existing relational databases. But just because you need fast response times and highly scalable transactions doesn’t mean you have to throw SQL out entirely.
Still, the challenges, for both new and old relational database players alike, are to focus on the strengths of your data store, and to make sure developers understand what the best use case for your software is.
Scott Jarr, cofounder and chief strategy officer of VoltDB, said that his company has found the sweet spot for its SQL-based relational data store that focuses on Java-based stored procedures. “I think that we are in a state of incredible noise and confusion in the market, and part of that is a natural stage of a market that is in its early stage and growing fast,” he said. “People are no longer looking at the individual products. Instead, they’re saying ‘I’ve got a particular problem,’ and then they’re starting to look at the databases that are options to them.
“Our challenge has been identifying what that use case is, and figuring out how we talk about it. It’s been quite clear to us as we’ve accelerated that our use case is very simple: It’s high-velocity in-bound transactions, like stocks, or the Web. It’s about making decisions on them in real time, and they’re looking at real-time analytics on it. That was news to us two or three years ago. The real-time analytic component became a very unique third leg to that stool. Being able to look at the analytics in real time is very important.”
#!
Naturally, SQL databases don’t have to live in a vacuum. The moniker “Not Only SQL” is aptly applied in the enterprise, where massive databases aren’t going to be replaced overnight by some hot new NoSQL. But having giant data stores already in place doesn’t limit one’s access to NoSQL technologies. Using NoSQLs as a front for existing databases is a great way to spread the benefits of NoSQL without losing the relational benefits on the back end.
Even Oracle is getting into the spirit of the NoSQL revolution by including the Memcached API in its latest release, MySQL 5.6.
Tomas Ulin, vice president of MySQL engineering at Oracle, said the addition of the Memcached API will enable developers to more easily manage and fill their caching layer directly from the database.
“We announced a year ago that we were adding the Memcached API to access and update data in InnoDB,” he said. “We can join the best of both worlds in the SQL/NoSQL discussion. We think we can make it much easier to gain the benefits on NoSQL-type access.”
Time for time series
The cloudy term “NoSQL” has gobbled up a number of database types over the past few years, but when you get right down to it, there are a few types that, while truly not using SQL, don’t really belong under the umbrella term either. Graph databases and time series databases, for example, have both existed for some time, but it’s been the NoSQL revolution that’s brought them out into the open for discussion in the mainstream software development world.
They’re also relevant because of the many new use cases cropping up every day, thanks to social networks, mobile devices, and the need for scalable cloud-based data stores. In the cloud, many existing data-store ideas just don’t work out properly, and so the past few years have been a period of rebirth for these extant database types.
The newest among these new databases is Saturnalia DB. This time-series database was crafted by Jonathan Moore and Leif Ryge, both developers at Web-crawling company Spinn3r. The pair, while remaining at Spinn3r, have launched a new company, called StatMover, around a hosted version of Saturnalia.
“Different problems are better served by different data models,” said Moore. “If you have a particular problem in mind, you can get more performance and lower cost by having a database tailored to your data.”
Thus, as Moore and Ryge were building Spinn3r’s high-speed Web-crawling and storage system, they quickly found that they needed a solution to replace the traditional open-source logging data store RRDtool.
“RRDtool was written for a different age,” said Moore. “RRDtool does not scale in terms of I/O, and you have other limiting factors like how much data you want to keep when you create the database, and it starts dropping data quickly.
“We realized from our work at Spinn3r running a high-performance crawler, we needed an order of magnitude more info on the stack. The tools fell over eventually, and we had to decide what could we monitor. Not ‘What can we monitor?’ but rather ‘What do we want to monitor?’ “
StatMover thus offers a hosted version of Saturnalia and gives developers a place to forward all of their server and application stats. When problems arise, they can typically be spotted through graphical analysis of the statistics and logs gathered. And that’s just what StatMover offers as a SaaS product.
#!
Graphic wave
Another type of database that’s riding the NoSQL wave to success is the graph database. While graph databases have found an immediate niche inside social networking-related applications, Emil Eifrem, CEO of Neo Technology, said that their usefulness will be seen far beyond Facebook. He takes a different view of how databases are differentiated.
“I always tend to take the data-model view to this explosion of databases,” he said. “What are the abstractions, the building blocks exposed to programmers? Graph has three core abstractions: nodes, typed relations between nodes, and key-value pairs attached to both nodes and attachments. That final point is incredibly important and very powerful. Graph is the first model that fundamentally embraces how relations are modeled as a first-class citizen.”
That means that not only can individual objects or data items be referenced by a key-value pair, but all of that item’s connections to other items and objects can be called out by a key value as well. That means the relationships between data are modeled in a quickly accessible storage model.
Eifrem is also bullish on graph database performance due to this first-class citizenship of relations. Rather than performing multiple sub-queries into the data and waiting for relations to pan out in the query, graph databases allow developers to quickly sort data on the fly without being forced to lay out the database schema according to unknown future data-organization requirements.
That means info stored in a graph database can quickly be laid out by date, size, type, owner and any other sub-category, without those individual items being called out into their own tables or indexed in their own schema.
“We talk about light-board friendliness. If you are able to sketch out your domain on a whiteboard, translating that into a data model in the database is typically taking that whiteboard and everything you’ve drawn as hub is a node, and every arrow is a relationship,” said Eifrem.
#!
NoSQL, three years on
It seems odd to call a handful of three-year-old database projects and software companies the old guard, but in this rapidly changing database landscape, tools like Apache Cassandra, Apache CouchDB, MongoDB and Riak are all now the entrenched players.
That’s not to say that these existing NoSQL databases are standing still. The past 12 months have seen extensive maturation for all four of these popular data stores.
Robin Schumacher, vice president of products at DataStax (the company behind Apache Cassandra), said that much of the company’s recent work on the platform has been focused on filling in the gaps for enterprise users.
The company released Enterprise Edition 3.0 of Cassandra and its supporting tools in late January. This version adds a number of enterprise features, such as internal authentication, in-transit SSL encryption, and the ability to do more granular data restores from backup.
Schumacher said that “Cassandra was faulted for being more complex to configure and use” in the past, and that most of the updates DataStax is working on are targeted at remedying this critique.
Elsewhere, the defenestration of Apache CouchDB through the corporate windows of companies like Cloudant and Membase has yielded some interesting results for it.
First, the database’s creator, Damien Katz, left the project in order to focus on rewriting many of the underlying ideas in CouchDB in C for Membase. Then, the Cloudant team decided to build a new solution for its hosted database product, one that only used a portion of the CouchDB code.
Mike Miller, cofounder and chief scientist of Cloudant, doesn’t think these two decisions will hurt Apache CouchDB in the long run, and he pointed out that Cloudant still contributes about a third of the Apache CouchDB code overall.
“We’ve chosen to layer on the CouchDB API [on top of our service],” said Miller. “I think Apache CouchDB is going to have a long and fruitful history. I think it’s the PostgreSQL of NoSQL. In contrast to everything, it doesn’t have a vendor behind it. It was a pure open-source project.
“The things we love about it is the API. The API is very clean. The basic things a database does, like a key-value store, those map perfectly onto a clean REST endpoint: You can get, put, post, delete. That’s something Apache CouchDB got right. We also like their model of not focusing. You don’t do anything except pure HTTP. That has advantages to developers, to give them pure HTTP, especially for mobile developers, because you don’t need any middleware to talk to the database itself, which allows you to do incredible things,” said Miller. Incredible things like simplifying architectures, and keeping applications in direct contact with databases, he said.
Elliott Cordo, principal consultant at Caserta Concepts, has to do incredible things every day. His consulting company works with enterprises, typically on big data-warehouse analytics projects. He said that he’s seen a lot of great innovation in databases over the past few years, but what he wants in the future is more flexibility from those data stores.
“I think we’ll see stores like Apache Hadoop’s HBase mature and become more of a front-end system,” he said. “We’ll see more high-level analytics languages evolve. These databases will also have built-in functionality of aggregation, rather than just being architected for queries.
“In the analytic world, you need them to be a little more general-purpose. We’re going to see more in-memory databases, like Memcached. And we’ll see memory-based OLAP solutions, integrating and working with these platforms.”
Precognition
While data stores are becoming more diverse and interesting to developers, the primary reason for this whole kerfuffle over data has come from the need for analytics, in real time or otherwise. One company that is taking on the analytics side of the problem is Precog, a SaaS-based solution designed to give developers and analysts easy access to predictive analytic tools and techniques.
John De Goes, founder, CEO and CTO of Precog, said that Precog is able to perform deep analytics, and that developers can create these analytics quickly from within their browser.
“We focus on persistent data, interactions and state changes,” he said. “We provide APIs and client libraries and database synchronizers, but primarily APIs to allow developers to capture this data on a mobile device, on an application, on some sort of sensor-based device. We let them trivially capture it with a few lines of code. They can store any kind of semi-structured data, store nested JSON they got off of Twitter, as well as modify schema over time instead of modifying the database. Then we provide integration. We take that stream of persistent data coming in and augment it, cross-reference it with your database of customers, and add info to it. Our technology allows us to do a pre-join on that to accelerate the process of analytics on the fly.”
Precog is built on new technology constructed by De Goes and his team. “You can think of it as a time-series database, but I would characterize it as a statistical database,” he said. “It also includes measured data. It enables you to do in-database statistics without having to stream all that back to the client. That doesn’t work with Big Data. We enable you to bring those computations into the database and execute those really fast.”
In-Memory Influx
With all the furor over Apache Hadoop in the enterprise, running analytics on Big Data is a hot topic. ScaleOut Software is addressing this market with its own scalable in-memory data store that can run analytics at a much faster pace than Hadoop.
William Bain, CEO of ScaleOut Software, said that developers can find great time savings with the workflow insinuated by an in-memory data grid for use in analysis. These savings come not only from the real-time nature of the software, but also from the fact that ScaleOut does not require data ingress steps between query attempts.
“The workflow for our customers is that they’re using the in-memory data grid to hold data,” said Bain. “The data is naturally changing rapidly on a daily basis. This is different from Hadoop. The data is changing while the analytics is going on. For a financial trading system or an airline reservation system, they cannot wait for data to arrive. ScaleOut allows you to perform MapReduce while the data is changing.
“The data being stored in the in-memory data grid is naturally object-oriented, and it fits into the object collection model. It can be selected for analysis based on query properties. Instead of record readers, you do a parallel read through the grid.”