When it comes to Big Data, the big news stories swirl around the Apache Hadoop project. While there are many reasons for Hadoop’s popularity, its success hasn’t done much to make the Big Data puzzle any easier to solve. While Hadoop promises a place to put all your data, actually deriving business value from that data is another matter entirely.
After all, the Big Data revolution is not just about storing that data, said Luis Maldonado, director of product management for HP Vertica. Maldonado said that enterprises want to “query that data and have a conversation with it. It allows me to have conversations I haven’t thought about before. Understanding customers, no matter what your vertical, has been a big push.”
And with so much data being generated by those customers, there’s never been a better time to try to comprehend why they do what they do. “There’s a big focus on better understanding your customer,” said Maldonado. “People are starting to understand, ‘How are my customers segmented?’ ‘How effective are my campaigns in retaining customers and acquiring them?’ ‘If I have loyalty programs, how do I understand the effect these have?’”
Unfortunately, customers don’t keep their data in neatly ordered relational data stores. They communicate with enterprises through Twitter, Facebook, the corporate website, partner sites, and even the good old-fashioned telephone.
What if there was some magical place where you could store all of this unstructured data, from customer transaction records, to security camera footage, to tweets, to relational data stores, all the way down to voice recordings of tech-support calls? And what if you could build such a data store on open-source software and commodity hardware?
The Apache Hadoop Project is, if nothing else, a place to put the data and to perform computations upon it, no matter its form. A Hadoop cluster is built upon the Hadoop File System, which can spread petabytes of data across commodity hardware reliably, but not yet in a highly available (HA) fashion. With the help of Apache HBase, relational data stores, such as MySQL and Oracle, can be dumped into Hadoop with their relational information intact. And if you have a good Java developer, you can use all of this infrastructure to perform queries upon petabytes of data at a time.
Untangling that data
In years past, analysts using R, SAS or some other data-analysis platform would write complex computations and statistical analysis routines that would run against a more traditional data store against uniformly coded data.
Hadoop, however, requires a Java developer to write what’s known as a Map/Reduce job in order to process data inside the cluster. While Map/Reduce was designed to save developers time—requiring them only to write the code needed for the problem they’re trying to solve on a large data set—writing Map/Reduce jobs is still a programming task suited to an actual Java developer, not to a business analyst, for example.
Maldonado said that “There’s not a lot of talent available. There’s high demand for people who can write Map/Reduce functions and can write out a rich analytics platform on Hadoop.”
There is hope, however, in the form of Cascading, Hive, Pig and about a dozen other new and maturing data-access and management layers for Hadoop. Pig, for example, is a platform built on Hadoop to provide a procedural data-access layer through a language called Pig Latin. Hive, on the other hand, is Facebook’s attempt to build a SQL-like query language on top of Hadoop.