For most of the 1990s, databases were the most boring tool in the shed. The rise of the Web over the aughts changed the demands placed on databases, but did not meaningfully change the form of the data stores we so know and love in our day-to-day application work.
The constraints placed on applications to perform at Web scale only could be overcome by specialty database vendors such as FairCom and Oracle. The advent of the cloud, however, brought these Web-scale problems to the forefront, and around 2007, things began to change drastically on the database landscape.
Some of these solutions, like Tokyo Cabinet, Redis and Apache Cassandra, took the approach of spreading a key value store across many servers. Others, like MongoDB and CouchDB, approached the problem from the document model, stashing data in many forms. Over time, a third path was demanded by the market, however.
Enterprises live on SQL. They employ SQL developers. They hold treasure troves of SQL-accessible data, and they understand SQL processing as a workload. Thus, the quickly rising buzzword “NoSQL” began to stretch out and encompass these new ways of storing SQL data.
“Not Only SQL” is now the call, and it’s used to highlight the myriad choices developers now face when choosing a database. The days of picking between a CSV file and Oracle are behind us. Today, developers can choose the right database for the job, and that’s not just some broad, three-tier category. Databases are as varied as birds or insects, and carry with them just as many adaptations to exist in their niche.
What’s driving uptake of NOSQL
Evaldo H. de Oliveira, director of business development for FairCom, said that the database world has been moving very quickly in the past few years, but a lot of the new ideas being discussed aren’t entirely new.
“Our database is different than others. We’re not a pure relational database,” he said. “We’ve always been a bit more designed for high-performance applications. If you follow the database market, the majority of the market in the last year has been about SQL. We have SQL as well, but our technology has been designed to use and take advantage of the non-SQL way to handle the data, so that’s made a difference for us.
“It was funny when everyone started talking about NoSQL. That’s been our business forever.
“The definition of NoSQL starts with a ‘No.’ Something that’s not, which can be a lot of things. There are multiple different types of NoSQL databases. There are graph databases, document databases. There are multiple different types of databases. One of them is [the] key-value store. That’s where we fit.”
Those old ideas are coming in handy because of the urgent business need for information, analytics and real-time information processing. de Oliveira said there is a growing need in enterprises for “real time analytics: the kind of thing that needs to run on a single version of the truth.
“There are a lot of analytics going in one direction, which is real-time analytics on top of transactions. For a credit card system, you have their credit card number and you’re processing the transactions in real time, live. But at the same time if the systems have a full real-time system like ours, you can run analytics on top of that. Because of the Internet of Things, a lot of customers don’t want to do big analytics on batch the next day; they want to run it live on real-time systems. In the next five to 10 years, [businesses are] going to need something more in real time.”
Big Data dealing
Jack Norris, CMO for MapR, agreed that interest in NoSQL is being driven by business analysts, who want that up-to-the-minute view of the business through analytics. To this end, Apache Hadoop has dominated a lot of the talking points around modern Big Data processing.
“There are some uses where I can do some of my ETL processing on Hadoop, and use that as a refinery,” said Norris of the practice of storing everything in Hadoop, then pouring out select datasets into other databases for analysis.
This tactic has been particularly popular for data stores that have connectors to Hadoop, such as Apache Cassandra. But Norris said that the database a developer chooses for this is only half a solution. Instead, he said, keeping that data in Hadoop and doing analysis there means faster access to data.
“People are starting to understand what the technologies are good at and what they’re not good at,” said Norris. “They’re looking at creating the applications and deployments that really impact the business. It’s beyond the experimental phase and really looking at how do we create significant value.
“What we’ve seen is increasingly it’s not about a separate silo of database operations. It’s more about doing that in conjunction with Hadoop. I’ve got all this unstructured data coming in, and I want to impact the business and do that in a real-time way.”
Thus, the true promise of NoSQL—whether standalone or inside Hadoop (as it is with HBase)—is to make the data more accessible to everyone involved. Without the need for painful data migrations and transitions, analysts can get to their answers faster and with less of a headache.
“The environments are complex and hybrid,” said FairCom’s de Oliveira, “so most of the time they have different technologies. There’s a lot going on in what we used to call ETL, but it’s been a little more sophisticated than that, because the transform process is changing and consolidating data. It has a lot to do with Internet of Things: There’s information being generated that needs to be stored somewhere. There are so many different types of data that need to be processed. It’s about having records of those transactions.
“We are sourced, but also the target of these consolidations. Sometimes they use us to consolidate this data. For example, they load C-Tree with all the stock market data and sell it to traders in real-time.”
And Hadoop environments are a big part of that ETL revolution. Thanks to Apache HBase, for example, even relational data stores of information can be stored in Hadoop for processing.
“We’re seeing a huge change as companies look at their data centers and the rate of data ingress, their new data sources, and the applications they’re required to push out,” said Norris. “They have to replicate and do separate ETL activities, and they want to always be agile. Bring those together and it’s a new platform where you land the data and perform operations directly on it. We’ve seen great strides, but we’ll continue to see huge changes and you’ll continue to see MapR lead with innovations.”
How is your database different?
“As a key-value store database, we provide all the interfaces for applications that need to handle the data under the key-value store model. We handle data without the defined schema. It’s a very flexible schema for data.
For the document databases, they have a schema-less model because everything’s a document. The key-value pair is different than that because the key-value pair is stored in a schema-less way and you use indexes overtop of it. The difference is being able to support transactions. We support full transactions. We are ACID-compliant, and we’ve always been ACID-compliant.
C-Tree is used in payment systems. These customers never really cared about multiple tables and querying, all they care about is the SLA of the credit card transaction.
We’re used in a typical key-value store situation, where the application needs to be fast enough to find the info and needs to be flexible enough to have multiple schema types. This is typical for NoSQL scenarios.”
— Evaldo H. de Oliveira, Director of Business Development for FairCom
How is your Hadoop different?
“We’re also seeing if you’ve got a Hadoop you can trust, a platform that’s available and has full business continuity, then instead of moving data around, you’re landing it and doing applications on top of that. Typically, you’re seeing companies doing a wide variety of applications. Eighteen percent of our customers have 50 or more applications running on a single cluster.
You’re basically pulling together data and doing different analytics directly on that. In many cases, it’s because you don’t have the time to ship it off to another system and load it in and do transformations. That’s great if you are doing reporting on what happened last week or yesterday in the business, but if you’re trying to impact business as it happens, that’s a dramatic change. That’s what we’re seeing with ad media who do 100 billion ad auctions a day. They’re making adjustments while they’re happening. If it’s fraud detection, deciding while the credit card swipe is taking place, ‘Is this fraudulent activity?’
When you’ve got the customer on the website and you’re deciding what to show them and what product to recommend, or what additional info to make available to them. That’s where you basically have to have capabilities that provide that consistent low latency. Latency is a big issue when you’re looking at integrating in Hadoop. We’ve focused on that for some time. From the lowest level of architecture up through the stack, we’ve made different product enhancements to provide consistent low latency.”
— Jack Norris, CMO of MapR
What about HBase?
“HBase is the standard NoSQL option with Hadoop,” said MapR’s Norris. “It has scalability advantages beyond some of the other NoSQLs. It is architected to work inside Hadoop. Where we’ve invested has been in optimizations and making sure HBase applications can run on an enterprise-grade NoSQL option: the MapRDB. It provides consistent low latency, eliminated Java compactions, [and] eliminated downtimes. We have customers that have used that in a variety of applications. It’s in cable TV ad insertions, optimizing across 50 million set-top boxes.
We have global table replication, and that is a big advantage. Eventual consistency really limits the type of applications and type of data you would trust using a NoSQL model. MapRDB supports enterprise grade, mission critical applications.”
A Buyers Guide to NoSQL
Amazon: Amazon DynamoDB is a fast and flexible NoSQL database service for all applications that need consistent, single-digit millisecond latency at any scale. It is a fully managed cloud database and supports both document and key-value store models. Its flexible data model and reliable performance make it a great fit for mobile, Web, gaming, ad technology, IoT, and many other applications.
The Apache Software Foundation: Apache Accumulo is a key-value store that provides a robust, scalable, high-performance data storage and retrieval system. Apache Cassandra is a highly scalable, high-availability, high-performance wide column store also based on Google’s BigTable design. Apache CouchDB is a document database that uses JSON for documents, JavaScript for Map/Reduce queries, and regular HTTP for an API. Apache Drill is a distributed MPP query layer that supports SQL and alternative query languages against NoSQL and Hadoop data storage systems. Apache HBase is a Hadoop database that provides random, real-time read/write access to Big Data, and it’s designed to host very large tables consisting of billions of rows and millions of columns using clusters of commodity hardware.
Basho Technologies: Riak KV is a NoSQL open-source database that uses a key-value model. It combines operational simplicity with high availability, scalability, and fault tolerance. Riak KV can be used to store text, images, documents, user and session data, log files, etc. Riak KV Enterprise includes multi-cluster replication ensuring low latency and robust business continuity.
BrightstarDB: BrightstarDB is a fast, embeddable NoSQL database for .NET. It is cross-platform and open source under an MIT license, supporting Linux, OS X, iOS and Android as well as Windows. It provides a code-first entity framework enabling developers to benefit from its schema-free triple-store while still using strongly typed objects and LINQ in their applications.
Cloudera: Cloudera Enterprise is the first unified platform for Big Data, powered by the world’s most popular Hadoop distribution. It includes the most powerful open-source frameworks, including Apache Spark and Impala for the fastest analytics. Cloudera Enterprise is designed specifically for mission-critical, production environments, with the simplest administration, compliance-ready security, and comprehensive data management.
Couchbase: Couchbase Server is a high-performance, open-source distributed NoSQL database for building Web, mobile and IoT applications. Couchbase Server can be deployed as a document database, a key value store or a distributed cache. It scales across commodity hardware to support massive data sets with a high number of concurrent reads and writes while maintaining low latency and strong consistency. Couchbase Server 4.0 will include the N1QL query language that extends SQL to JSON, enabling developers with existing SQL expertise to easily write applications on a NoSQL database.
DataStax: DataStax delivers Apache Cassandra in a database platform that meets the performance and availability demands of Internet of Things, Web and mobile applications. It gives organizations a secure, fast, always-on database technology that remains operationally simple when scaled in a single data center or across multiple data centers and clouds. DataStax offers enterprise-grade capabilities such as search, analytics, in-memory computing, advanced security, automated management services, and visual management and monitoring, among others.
FairCom: c-treeACE is a fully ACID key-value database that supports multiple relational and non-relational APIs. Its unique No+SQL technology facilitates high-performance NoSQL and industry-standard SQL access within the same application over the same data. Flexible schema records, customizable data types, and high-speed indexing, as well as Stored Procedures, user-defined functions, triggers, and full transaction support, make c-treeACE an ideal NoSQL database for mission-critical applications.
MapR: MapR-DB is an enterprise-grade, high-performance, in-Hadoop NoSQL database-management system. It lets you run Apache HBase applications with higher performance and reliability. MapR-DB delivers the speed, scalability and flexibility needed for today’s Big Data environments. It is integrated into the MapR Distribution to support running operational and analytical workloads in the same cluster. In addition to MapR-DB, MapR supports Apache HBase. MapR-DB is available in MapR Enterprise Database Edition and for unlimited production use in the freely downloadable MapR Community Edition.
MarkLogic: MarkLogic is the only Enterprise NoSQL database platform with the flexibility, scalability and agility of NoSQL combined with enterprise-hardened features like ACID transactions, high availability and failover, disaster recovery, government-grade security, full-text search, semantics, and schema-agnostic data modeling. MarkLogic improves time to market by making it easy for developers to implement new business logic, meeting or even exceeding business objectives.
MemcacheDB: MemcacheDB is a distributed key-value storage and retrieval system designed for persistence. It is not a cache solution. MemcacheDB conforms to Memcache protocol, so any Memcached client can connect to it. MemcacheDB uses Berkeley DB as a storage back end to support features such as transaction and replication.
MongoDB: MongoDB is a next-generation database that helps businesses transform their industries by harnessing the power of data. Companies use MongoDB to create applications never before possible at a fraction of the cost of legacy databases. Specifically, MongoDB stores data in JSON documents, taking advantage of JSON’s seamless mapping to native programming language types and dynamic schema so data models can evolve easily compared to relational databases. It scales linearly and processes queries much faster than a relational database. In addition, MongoDB is easy to install, configure, maintain and use.
Neo Technology: Neo4j helps businesses create new products and services by bringing data relationships to the fore. Neo4j combines a native graph property model with ACID transactions, making it crucial for applications in master data management, IT operations, fraud detection, real-time recommendations and graph-based search tools.
NuoDB: NuoDB is a fully ACID-transactional SQL DBMS, but architected for the cloud, with high-speed elastic scale-out and scale-in. NuoDB operates a distributed in-memory cache, with continuous availability supported by arbitrary levels of redundant distributed persistence. With its unique, patented architecture, NuoDB offers global transactional consistency across a globally distributed database.
Objectivity: InfiniteGraph is a graph database that enables organizations to ask deeper, more complex questions across new and existing data stores by traversing complex relationships requiring multiple hops across vast and distributed data stores. The latest version improves search results and provides faster ingest performance. In addition, the Visualizer has been enhanced for visualizing and navigating the graph, and navigation policies can be saved in the graph for later reuse.
Oracle: Oracle NoSQL is a distributed key-value database designed to provide highly reliable, scalable and available data storage across a configurable set of systems that function as storage nodes. It provides a powerful and flexible transaction model that greatly simplifies developing a NoSQL-based application. Oracle NoSQL scales horizontally with high availability and transparent load balancing even when dynamically adding new capacity.
Pivotal: GemFire/Geode is a NoSQL in-memory database for extreme-scale applications providing advanced database capabilities in extreme low-latency and high-concurrency environments for custom applications. GemFire can uniquely support globally distributed environments, massive client/user queries, in-memory distributed functions and scales linearly to support any transactional application load. Redis is an open source, BSD-licensed, advanced key-value data store. It is often called a data structure service since keys can contain strings, hashes, lists, sets, and sorted sets. Redis works with an in-memory data set to speed up performance, and it supports master-slave replication. It includes many other features such as transactions, pub/sub, Lua scripting, time-limited keys, and configuration settings that allow Redis to behave like a cache.