When it comes to Big Data, the big news stories swirl around the Apache Hadoop project. While there are many reasons for Hadoop’s popularity, its success hasn’t done much to make the Big Data puzzle any easier to solve. While Hadoop promises a place to put all your data, actually deriving business value from that data is another matter entirely.
After all, the Big Data revolution is not just about storing that data, said Luis Maldonado, director of product management for HP Vertica. Maldonado said that enterprises want to “query that data and have a conversation with it. It allows me to have conversations I haven’t thought about before. Understanding customers, no matter what your vertical, has been a big push.”
And with so much data being generated by those customers, there’s never been a better time to try to comprehend why they do what they do. “There’s a big focus on better understanding your customer,” said Maldonado. “People are starting to understand, ‘How are my customers segmented?’ ‘How effective are my campaigns in retaining customers and acquiring them?’ ‘If I have loyalty programs, how do I understand the effect these have?’”
Unfortunately, customers don’t keep their data in neatly ordered relational data stores. They communicate with enterprises through Twitter, Facebook, the corporate website, partner sites, and even the good old-fashioned telephone.
What if there was some magical place where you could store all of this unstructured data, from customer transaction records, to security camera footage, to tweets, to relational data stores, all the way down to voice recordings of tech-support calls? And what if you could build such a data store on open-source software and commodity hardware?
The Apache Hadoop Project is, if nothing else, a place to put the data and to perform computations upon it, no matter its form. A Hadoop cluster is built upon the Hadoop File System, which can spread petabytes of data across commodity hardware reliably, but not yet in a highly available (HA) fashion. With the help of Apache HBase, relational data stores, such as MySQL and Oracle, can be dumped into Hadoop with their relational information intact. And if you have a good Java developer, you can use all of this infrastructure to perform queries upon petabytes of data at a time.
Untangling that data
In years past, analysts using R, SAS or some other data-analysis platform would write complex computations and statistical analysis routines that would run against a more traditional data store against uniformly coded data.
Hadoop, however, requires a Java developer to write what’s known as a Map/Reduce job in order to process data inside the cluster. While Map/Reduce was designed to save developers time—requiring them only to write the code needed for the problem they’re trying to solve on a large data set—writing Map/Reduce jobs is still a programming task suited to an actual Java developer, not to a business analyst, for example.
Maldonado said that “There’s not a lot of talent available. There’s high demand for people who can write Map/Reduce functions and can write out a rich analytics platform on Hadoop.”
There is hope, however, in the form of Cascading, Hive, Pig and about a dozen other new and maturing data-access and management layers for Hadoop. Pig, for example, is a platform built on Hadoop to provide a procedural data-access layer through a language called Pig Latin. Hive, on the other hand, is Facebook’s attempt to build a SQL-like query language on top of Hadoop.
Chris Wensel, founder of enterprise Big Data company Concurrent, said, “Hadoop, in theory, is cheaper than a Teradata database or Greenplum. But mainstream people out there don’t know Java. They’re not going to learn Cascalog, they know SQL, and Greenplum and Teradata are already SQL. The value is that you’ve already written applications in SQL, but you have a team competing for resources on the Teradata system, so migrating code onto Hadoop with as little code as possible is the first pain point.”
Wensel is the creator of Cascading, an API layer for Hadoop that he said smoothes over this pain point. He added that “Historically, our adoption has been through attrition. Most people get seduced by the idea that they can write a couple lines of a SQL-like syntax and put things into production. But once you want to do something interesting, and want to base your business on it, you need to do things like write unit tests.”
Cascading was built to address many of the common headaches developers encountered building on Hadoop, Hive and Pig. For example, said Wensel, “If you’re processing and seeing bad data, Cascading stores it off to the side so you can inspect it postmortem without killing the application.”
He said that because Cascading is an API, “You can build frameworks on top of it. You can build reusable code that anyone on your team can use, and you can write your own languages, like PyCascading, or the JRuby variant. There are lots of other internal projects to support other higher-lever languages, like Scala and Clojure.”
Analytics come to the business side
Naturally, there has been demand for more traditional data-access layers inside and alongside Hadoop. A common model is to use Hadoop to sort and bundle relevant subsets of data, then offload that into a smaller, more analytics-focused data store. Many companies such as DataStax, HP and Revolution Analytics are taking this approach, working in tandem with Hadoop.
David Smith, vice president of corporate marketing at Revolution Analytics, said, “If you hire data scientists, there is an excellent chance they’re already trained in R. You can use all the flexibility of R to distill that data and, for example, extract sentiment from text. There are all sorts of unimagined applications sitting there in Hadoop that need a data science process to figure out what’s going on there. One of the big advantages of using the R Hadoop packages is that you only need to learn one language. The alternative is to learn Java and Pig and Hive just to be able to do an abstraction and analysis of the data.”
Maldonado said that HP Vertica can help bring Hadoop to the business analytics people as well. “We started with integrating at the Map/Reduce level,” he said. “We made it simple for the Map/Reduce developer to pull and push data from the Vertica analytics platform. This past fall, we introduced a simplified model for customers who don’t have expertise in Hadoop.”
But moving data to and from Hadoop, then controlling access to that data, is an essential piece of the enterprise Hadoop puzzle, and a solution is only just now forthcoming. According to Brian Christian, CTO of Hadoop security company Zettaset, “You have this weird storm of regulated industries. This even includes telcos, water supplies and utilities where they’re doing constant data sampling. They’re trying to create smart grids and they’re getting this data overflow, and there really is no good way to handle it. They turn to Hadoop, and again they can’t get Hadoop to comply. Now they’re in this catch-22 of ‘What do I do?’”
Christian said Zettaset offers enterprise-class security for Hadoop clusters, allowing teams locked down by HIPAA or Sarbanes-Oxley to maintain controls on that otherwise unruly blob of data inside of a Hadoop cluster.
A laundry list of changes
Amid all of the Hadoop hype is another freight train of promises. This train is known as Hadoop 2.0, and it should reach the station this summer. Headed up by the folks at Hortonworks, Hadoop 2.0 includes a number of fundamental changes that will mold Hadoop into something that looks less like a Map/Reduce cluster and more like a system representing a generic workload cluster with a highly available file system.
In version 2.0, the HDFS is being improved to allow for predictable availability and better redundancy of data. The result will be that HBase, the relational data store inside Hadoop, will now be reliable enough to run as a front-end database rather than a back-end storage framework for dumping MySQL information.
The actual job-scheduling system behind Map/Reduce in Hadoop has been rewritten from the ground up in Hadoop 2.0. Known as MapReduce Framework 2.0 (or YARN), this project has split the original Map/Reduce daemon in two. The net effect will be better cluster sharing across jobs, as well as the ability to run non-Map/Reduce workloads on Hadoop. Thus, Hadoop 2.0 will be able to shoulder generic workloads, not just those explained in Map/Reduce terms.
But while the Apache Foundation and Hortonworks push their open-source vision for Hadoop, MapR has spent the last three years fixing many of the problems Hadoop 2.0 addresses. Tomer Shiran, vice president of product management at MapR, said that his company’s distribution of Hadoop offers many features not available in other forms of the platform. “MapR is the only distribution that can do disaster recovery and incrementally send that to another data center,” he said.
“It’s the only one that replicates the metadata of the cluster. Having that full [high availability] across every layer of the stack is unique to MapR as well. You look at any other system in the enterprise, whether that’s Teradata, NetApp or Oracle; they all have these capabilities. They have to have these capabilities.”
Shiran said MapR can claim these capabilities because the company was “able to rearchitect that layer of the stack, and that took several years. [MapR] is a read/write system, it has full read/write support. One of the challenges of HDFS is it was built for a search engine, where the only use case was to crawl. Because Hadoop has evolved so much into all these other use cases, the limited use case of crawling is no longer the only one.”
Thus, MapR and Zettaset claim to have made Hadoop enterprise-ready now, while Hortonworks continues to push for solutions to problems inside of the Apache Foundation. While these functionalities may someday become standard in all installations of Apache Hadoop, for now, those enterprise features are still only available from enterprise software companies.