First, there’s size. Many data problems aren’t a question of volume, but more of velocity, variety or veracity. While a Hadoop cluster easily scales up to contend with massive amounts of unstructured data, MongoDB is a NoSQL alternative that functions well on several terabytes of data, but may run into limitations after that point. Another option is Apache HBase, whose flexible data model is good for quickly nabbing small stats within large columnar data sets. And many are fond of the lesser-known Cassandra, another NoSQL, fault-tolerant database written in Java, popular for storing huge machine-to-machine data sets or transaction logging. Accumulo is an interesting option offering cell-level security. There are other flavors, each with pros and cons: Couchbase, Riak and Redis, to name a few.
“One technology people use today is NoSQL databases, because NoSQL is designed to be highly scalable. Companies like Uber, in last two years or so, built an architecture with a huge, huge set of NoSQL databases because of their access to mobile content. On the back-end is Cassandra, then there’s Teradata. In the middle is Talend, tying it all together,” said Ciaran Dynes, Dublin, Ireland-based vice president of product at Talend, a French master data management company and enterprise service bus provider.
Hadoop’s success comes from the power to massively distribute compute clusters on commodity hardware, while storage costs in the cloud have simultaneously plummeted.
“The economic argument for Hadoop is, in the old data warehousing days, $40,000 per terabyte of data was the traditional cost. Now we’re talking about $1,000 per terabyte for Hadoop. That’s a 40x savings, if you ignore the fact that data is growing. So with growing data, you may need something alongside your Teradata data warehouse. You may be legally required to archive information. Hadoop is almost at the same price point as it would have been for the old archive technology, but Hadoop is a data processing platform whereas tape isn’t. Hadoop gets you best of both worlds, and maybe some data scientist helps you find some lost or hidden gems,” said Dynes.
But there are still costs associated with Hadoop, depending on how it is set up, tuned and supported. Microsoft’s HDInsight, comprising Hortonworks Hadoop ported to the 64-bit version of Windows Server 2012 R2 and supporting the .NET Framework, also offers Azure Blob storage for unstructured data. Blob storage has advantages over the Hadoop Distributed File System (HDFS): it’s accessible, via REST APIs, to more applications and other HDInsight clusters; it’s an archive to the HDFS data lake (as redundant as that sounds;) it’s a cheaper storage container than the Hadoop compute cluster and cuts data loading costs; it has automatic elastic scaling that can be less complex than HDFS node provisioning; and it’s geo-replicated, if need be (though that’s more expensive.)
And what about personnel? Hortonworks argues that the Hadoop learning curve is nontrivial—and, these days, non-essential: “New world companies that are data-driven from day one may have a team of 20 guys using Hadoop and managing their own Hadoop cluster. They don’t realize, ‘Hey I don’t need to do this anymore. The Hadoop thing isn’t the differentiator, it’s what I’m doing with it that’s the differentiator,” said Jim Walker, director of product marketing at Hortonworks, which sells enterprise Hadoop support while proudly proclaiming its status as the 100 percent open-source provider of Hadoop distributions.
It’s in the box
Oracle makes the same argument with its Oracle Big Data Appliance X3-2, based on the Sun X3-2L servers, a commodity X86 server for distributed computing. “Many people ask ‘Why should any organization buy a Hadoop appliance?’ While this is a valid question at first glance, the question should really be ‘Why would anyone want to build a Hadoop cluster themselves?’ Building a Hadoop cluster for anything beyond a small cluster is not trivial,” according to the Oracle Big Data Handbook (Oracle Press, September 2013.) The process of building a large Hadoop cluster is closer to erecting a skyscraper, the book contends—thus best left to experts. Economically, too, Oracle makes the argument that its Big Data Appliance list price of $450,000 plus $54,000 per year of maintenance is ultimately cheaper than building an 18-node cluster on your own.
Like Oracle, IBM is no newcomer to database technology. InfoSphere BigInsights melds its expertise in relational databases, SPSS advanced analytics, grid computing, modeling and distributed data crunching with Hadoop. The ubiquity of IBM and Oracle solutions isn’t a legacy: it’s a gold mine. Historic data expertise can’t be downplayed as enterprises seek to define new types of enterprise data hubs via Hadoop, especially since valuable data, once discovered, is likely to be propagated back into the enterprise data warehouse.
Twenty years ago, Oracle’s decision support systems or data warehouses “became engines for terabyte-sized data warehouses. Today Oracle-based data warehouses have grown into the petabytes thanks to many further improvements in the Oracle Database and the introduction of engineered systems such as the Oracle Exadata Database Machine,” according to the Oracle Big Data Handbook. Why does this matter? Because, “Over time, the functionality mix will increasingly overlap in each as more Hadoop capabilities appear in relational databases and more relational database capabilities appear in Hadoop,” the handbook says. Thus the performance enhancements and redundancy of traditional data warehouses may be useful in hybrid applications, which are likely to be the norm.