The move to the cloud has brought many changes to software development, but few shifts have been as radical as those occurring in the database market right now. So different is the cloud for software architects that the first response from developers was to build entirely new databases to solve these new problems.Thus, 2010 was the year of the NoSQL database. But as time has moved on and NoSQLs have become more mature, developers are figuring out that the old relational ways of doing things shouldn’t be thrown out with the metaphorical bathwater.While relational databases were considered old world just a year ago, a new crop of options for in-cloud development has brought them racing back to the forefront. A combination of new relational databases aimed at the cloud, coupled with more mature in-cloud relational offerings, such as Microsoft’s SQL Azure database and Amazon’s SimpleDB, have presented some compelling reasons to ditch the new-fangled NoSQLs.
Amazon’s SimpleDB, for example, is a cloud-based relational data storage system focused on simplicity, as the name implies. Rather than cram caching, transformations and compromise solutions to the CAP problem (Consistency, Availability, Partitions: You can only choose two) into a new-world database, SimpleDB eschews futuristic ideas in favor of a clean, easy-to-use data store that can form the backbone of scalable applications while providing the 20% of functionality needed by 80% of users.
Adam Selipsky, vice president of product management and developer relations for Amazon Web Services, said that SimpleDB is about choice and ease of use. “Running a relational database, irrespective of where you do it, takes a certain amount of work and administration,” he said.
“There are a lot of use cases where people don’t need that full functionality of a relational database. SimpleDB is really meant to be the Swiss Army knife of databases. You’re not going to do joins, you’re not going to do complex math procedures. If you want to do data indexing and querying, then it can take all the scaling hassles away from you, and you don’t have to worry about schemas.”
Selipsky said that SimpleDB evolved out of the needs Amazon saw in its users. Databases were a hassle to maintain within the cloud, and yet most developers were only using a fraction of the functionality of those databases, he said.
“Where we started was to provide all the building blocks,” he said. “There’s an incredible variety of needs in our customer base. We have all the separable fundamental Web services with fundamental calls. That’s been one of the principal reasons these services have been so popular. We do think it’s important to make our services easier to use. There are a lot of examples where we will start to make the services easier to use.”
Look before you leap
Stephen O’Grady, analyst with research firm RedMonk, said that every in-cloud database offering is different, and that developers need to know what they’re getting into before they design an architecture around those data stores. He pointed out that Microsoft’s initial offering of SQL Server for Azure did not meet developer expectations, and thus caused some strife for both users and Microsoft itself.
“When Azure came out, the data store was basically a hierarchical database, not a relational one,” he said. “It was not your typical SQL Server. A lot of the initial users chafed at that requirement, and as a result the subsequent iteration [of the software from Microsoft was a] re-badged SQL Server. That’s one of the data stores available on Azure now and it looks a lot like your regular relational now.”
Microsoft decided to move closer to the old-world model of a relational database for its cloud-based offerings. Google, on the other hand, has made some very specific choices in its AppEngine offering, choices that dramatically impact the developer.
As O’Grady explained, “With Google AppEngine, their implementation of Big Table is a unique design. If you’re in the cloud, it really depends on what data store you’re using and what the properties of that data stores are. If you’re dealing with a relational, it will look and act like a relational. If it’s not, you have to adjust your application design to take that into account.
“That’s one of the potential throttles, because a lot of applications depend on a relational database, so moving to AppEngine would require a lot of work and porting. You have to be very aware of what you’re designing and developing too. If you’re designing for AppEngine, you can’t take that application and natively deploy it because you don’t have access to the database.”
EnterpriseDB is the company behind Postgres, an alternative open-source database. Robin Schumacher, director of product strategy at EnterpriseDB, said there’s really one great promise of a cloud database for developers. “The whole idea is to not really have to make changes to your application for it to benefit from a cloud database. That’s what you’re gunning for. There’s definitely a difference between putting up an Oracle instance in the cloud, and having a cloud database that meets the definition of what a cloud database, like SQL Azure or SimpleDB, is supposed to do,” he said.
It is in this capacity that many cloud database systems can be used to complement each other. The common model right now is to augment a relational database such as MySQL or SimpleDB with a NoSQL or caching layer to speed up the access to information. Such system designs began to pop up with the rise of the cloud, and Memcached was a frequently used caching layer for all manner of databases.
Today, however, there are entire classes of databases that include their own caching systems. Damien Katz began writing CouchDB as a way to solve data storage scalability issues, but earlier this year the company he formed to shepherd that database merged with Membase, an advanced form of Memcached.
Together, the two projects have merged to form Couchbase, a company focused on both the scalable back end and the RAM-stored front end of their database. “It was after I had the initial versions of CouchDB written in C++ that I really started thinking about scalability,” said Katz.
“I had a lot of experience with conventional concurrency, with threads and locks. I heard about Erlang. I decided to check it out, play with it for a week, and after a week I knew I could write everything in it. So I threw away my old code and rewrote everything in Erlang. It took me a month and a half to rewrite what I had written in C++ in six months.”
After CouchDB reached version 1.0, however, Katz and company decided there was something missing from their setup, and thus the merger with Membase. The merger effectively divided the database into a two-part system: hot storage and cold storage. Live data would be kept at the edges of the cluster, quickly available in RAM when needed, while the cold long-term storage of this data is left to the back-end CouchDB, a system that can store not only data but also other databases.
Elsewhere in the cloud, the Apache Cassandra project has been getting closer and closer to the Apache Hadoop project. In a manner similar to Couchbase, Cassandra can be used as a NoSQL database at the edges of a network, and thanks to recently released updates to the database, Hadoop can read in information directly out of Cassandra.
O’Grady said this multi-tier approach has been a best practice for some time now. What’s changed, he said, is that relational databases are not the solution to all problems anymore.
“That’s been a common pattern for a long time, at least in the sense of caching. A lot of high-scale properties have a relational database at the back end, supplemented by a caching mechanism, most commonly Memcached, as a front-end caching solution,” said O’Grady.
“That’s been a design pattern and a best practice for a while now. As far as the overall diversity, there’s no question that we’re in an era now where heterogeneity is the norm. As recently as three or four years ago, if you had a persistence problem, the solution was a relational database.
“That’s not true anymore. What is still true is that relational is a solution to a lot of problems, it’s just not the solution to all problems. Developers are beginning to realize that if I only want to store a key and a value, maybe I need a key value store, or if I am traversing graphs, maybe I need a graph database.”
It’s a bit of a thought shift, of course. Databases used to be singular towers of the truth, pillars of information consumed by all from a uniform and unique source. Today, however, the cloud offers many ways for developers to bring together various types of data stores. This is a major contrast to another trend in the market: master data management.
Faster than MDM
Master data management (MDM) practices stipulate that a single source be designated as the one true data source. In an MDM shop, one central database takes on all changes from the day’s (or hour’s) work and keeps the canonical record. It’s the one-server-to-rule-them-all approach.
But the move to the cloud has basically shoved this notion aside a great deal, said Brian Hopkins, principal analyst at Forrester Research. MDM, despite being something of a buzzword for the past few years, is essentially the opposite of a cloud database strategy. Instead of simple servers spreading data across nodes, MDM is about policies, constant ingres and ubiquitous validation.
Thus, Hopkins espouses a simpler approach to multiple databases, one that eschews MDM practices in favor of simpler strategies.
“MDM is a piece of it, but I hesitate to use that word,” he said. “We’ve had statistics where we see an average MDM implementation period takes 24 months, with 30 months to payback ROI. What I’m talking about is more along the lines of the data virtualization space.”
And indeed, “data virtualization” has become a new buzzword in and of itself. While cloud databases can offer simple data stores that can quickly scale, data virtualization in-cloud can bring the relevant information into the same RAM space as a running application. It’s about bringing the data to the application rather than the application to the data.
Starcounter is a new relational database company out of Sweden that will be launching later this summer. The company’s self-titled relational database offers data virtualization capabilities as well.
Peter Idestam-Almquist, CTO of Starcounter, said that his company plans to expand these data virtualization capabilities beyond C# and .NET, though this is the only environment currently supported by the database’s virtualization capabilities.
“Even if you run a database and you have your application code running and you want to access the data in the database, although they are running on the same machine, you transfer that data in RAM,” he said. “You transfer it from one part of RAM to the part of RAM that belongs to the application. You also transform the data from one format used by the DB system to what is used by the application system. By integrating these, you neither move nor transform the data. Instead of having a copy of the data, the application code has direct access to the data.”
It may seem odd that a startup would be focused on creating a new relational database, but there is something of a bumper crop of said companies thanks to the opportunities posed by the cloud. Traditional relational databases like MySQL and Microsoft’s SQL Server do work in the cloud, but they weren’t built for the cloud. Microsoft plans its SQL Azure as a next-generation, cloud-based SQL Server, but there are many others pushing new relational database software packages.
One such company is Citrusleaf, a startup focused on building a real-time transactional and relational database of the same name. Srini Srinivasan, CTO of Citrusleaf, said that his company’s database offers fast transactions and immediate consistency. Along the development path, he said he learned some interesting things about the CAP problem.
“Here’s the insight we’ve had in terms of CAP: When there is no failure, you can have consistency and availability,” he said. “When the partition happens, we can continue to let the two partitions work, and when they come back together, there would be a consistency issue. We have all the code available to detect the conflict, so most of the time our customers have chosen the ability to simply do the conflict resolution themselves.”
It’s about location
Srinivasan said that, beyond the database itself, the most important thing about cloud hosting is location, location, location. “The key thing to look for is collocation. If you’re a developer developing with a database, your access to the database has to happen really fast. Some of our customers have had this problem: They have an application in cloud, and it has to access databases outside cloud, so they found data centers where they were collocated,” he said.
EnterpriseDB’s Schumacher said that most of the desirability for a cloud database rests in its ability to alleviate pain for DBAs. “From a DBA perspective, you’re hoping a cloud takes away a lot of the pain you have to deal with,” he said.
“You want transparent database expansion and contraction, so you can automatically add nodes when demanded, and remove those nodes when demand decreases. Also important is load balancing across those nodes. From a developer perspective, you want to avoid those sharding situations where the application has to be aimed at a specific shard. The load balancers take care of those things, and you hopefully would not have downtime thanks to auto fail-over.”
Schumacher said that these and a few other necessities form the basic requirements for any cloud database. “Those are the key things. There are some minor things, like provisioning. Can you have a rolling upgrade? Do you have multi-tenancy capabilities? Is there a billing interface, so you can easily determine cost and usage?” he asked.
Beyond these capabilities, however, there are still major hurdles to overcome for all cloud databases. “A lot of the cloud databases can scale for reads, but can they scale for writes? Sometimes that’s difficult to pull off,” said Schumacher.
Karen Padir, vice president of products and marketing at EnterpriseDB, said that the company will be offering its own cloud database later this year, which may, perhaps, tackle the read/write disparity. “We’re building a cloud database for release in the late summer or early fall,” he said.
“We will host it in your private cloud, or in the public cloud. Right now, we’re looking at supporting Amazon Web Services, and we’re talking with Eucalyptus and Red Hat.”