Big Data may be the new big buzzword, but it’s not an entirely original concept. With some careful digging, you can find organizations that have already been building with and for Big Data for years. What’s really changed in the Big Data revolution is that there are now numerous tools, both open-source and commercial, for handling all of that data.
Bill Yetman, senior director of engineering at Ancestry.com, has a data problem as big as history. (Your history and my history, to be precise.) Ancestry.com doesn’t just track family trees; it houses historic documents and information that is used to verify family histories. And it’s been doing it for more than 10 years.
In practice, that makes the business more like a data-analysis firm with a public-facing interface than a traditional genealogy company. And it also means that its workflows on data have evolved into a replicable pattern: an example of how Big Data should flow through an enterprise.
“We’re a pretty classic data warehouse. We’ve been an enterprise data warehouse for 10 years,” said Yetman. “With that, as we’ve gone over time, we’ve turned around and put a lot of our behavioral data, engagement data, and other info—like how users are building their family trees—out into the data warehouse, which is a little bit backwards. A lot of that data is very unstructured, so we’re taking it and putting it into a structured data warehouse.”
And this, despite its prescience, is the future planned for Big Data: storing everything, even user interaction data, back in the central repository. It is the active Big Data plan that can then take that information and turn it into actionable intelligence for the business people.
Merv Adrian, research vice president at research firm Gartner, said that data has, up until now, been captured in a lossy fashion. It wasn’t possible for businesses to tabulate every possible variable and condition in every single transaction. The store temperature, sales status, and prior customer purchase information were unknowable elements when the focus was on tracking the money and the inventory in a reliable fashion.
A place to put it
One of the biggest reasons for the new desire within enterprises to capture and analyze any and all data, structured or unstructured, is that there is now a place to put all that information. The Apache Hadoop project has revolutionized how companies handle and utilize their longer-term business information.
“One of the first things we did with Hadoop was use it to capture all our logs,” said Yetman. “The way we did this was to turn around and pull that behavioral data out of the data warehouse and into Hadoop, where we’d do the analytics and the processing and evaluation. It’s almost ETL happening in Hadoop.”
Other Big Data analysis platforms have also sprung up to accommodate the market’s newfound need for Big Data tools and platforms:
Sqrrl’s Accumulo platform is built from the U.S. National Security Agency’s now-infamous data-gathering platform. LexisNexis, the legendary subscription-based data-mining service, open-sourced its HPCC Big Data platform in 2011. Twitter’s Storm project offers a different approach to processing Big Data, and the UC Berkeley Spark project pushes the ideals of Hadoop even further while building on top of the HDFS file system. All the while, Hadoop 2.0 is being brewed by the elephants of Hortonworks.
With so many data platforms and processing environments to choose from, the prospect of devising a Big Data strategy for the developers at your company can be daunting. And don’t worry, you’re not overreacting: It’s extremely daunting.
It’s made even more daunting by the prospect of your higher-ups having caught the buzz of Big Data. Gartner’s Adrian said that he’s even had calls where the client has said, “The CEO read about Big Data in a magazine on an airplane, and he got off and said, ‘We need Big Data.’ What does that even mean?”
Even worse, you can’t expect to hire your way into a Big Data solution. Adrian said that talent is at an extreme premium at the moment. He cited Gartner estimates that there will be a need for 4.4 million new data-knowledge workers to handle the demands businesses will be creating in the next decade.
What’s a development manager to do? Yetman has some advice. “How do you identify and capture whatever is all of your data? Because ‘all’ is a lot. What we’ve had to do is approach things in an iterative manner. We identify a key set of data and go after it. We figure out the right way to ingest it and provide high partitioning at a simple level. How do we get it in the hands of someone who can take a look at it and see if it’s providing the value you want? You can’t just boil the ocean. You have to go after things one piece at a time. Find the data that’s going to be the most valuable for you as a business and attack those first.”
Moving the data
At the beginning of any Big Data strategy design endeavor, the first choice is fairly similar to the first choice in business: Where do you set up shop? For Big Data, this typically leads to a few early decisions on platform. For Hadoop users, HDFS or Cassandra are the first choices to consider. Other options, such as running Hadoop on top of ZFS through Lustre and other file systems are becoming more viable over time as solutions there mature.
But even after choosing your platform’s file system, there are a dozen other data-focused decisions to be made. How will you store your relational data within this platform? How about your unstructured data? How will you manage flat files and versioning? And what methods will you be using to access all of this information?
For relational data, HBase has matured into an increasingly capable data-management platform for traditional enterprise datasets that require ingestion and availability within a Hadoop cluster.
According to Justin Erickson, director of product management at Cloudera, HBase’s community has been focusing on “two big buckets of things. First is stability and durability, and the second is the ease of use of the whole platform. There’s been some general work to harden HBase, so there are less bugs. Some of what we’ve been doing around replication—and the recent work around snapshots—are examples of things we can do to make it easier for new developers to go to the system.”
Those replication and snapshot changes help with the general use of HBase by developers and administrators, said Erickson. One of the primary needs of many developers is simply the ability to quickly test things locally, on their desktop or laptop. He said that Cloudera and the HBase community have been working to make it clear how to prototype in this fashion using HBase, a feat that requires a number of additional moving parts beyond just HBase itself.
While replication and snapshot capabilities are evolving in HBase, MapR has long made its name from offering these capabilities in its Hadoop platform. Tomer Shiran, director of product management at MapR, said that his company’s Hadoop distribution “provides point-in-time recovery things like snapshots and disaster recovery.” Considering that Hadoop clusters tend to range into the petabyte region, disaster recovery can save an organization considerable time and money. It’s an important consideration for any Big Data cluster.
“Having that full high availability across every layer of the stack is unique to MapR as well,” said Shiran. “You look at any other system in the enterprise, whether that’s Teradata, NetApp or Oracle, they all have these capabilities. They have to have these.”
So while open-source Big Data platforms like Hadoop are enticing, they are often devoid of the enterprise functions required by most organizations. It is for this reason that so many different Hadoop distributions are available for enterprises: Cloudera and MapR have their own distributions, Hortonworks favors vanilla Apache Hadoop, and other companies like WANdisco, Microsoft and even Netflix have their own distributions.
But data processing doesn’t begin and end only with Hadoop as the storage framework. HDFS is not the only way to go. The Apache Cassandra project has proven quite popular as a NoSQL storage system for working with Hadoop, said Gartner’s Adrian.
Elsewhere, Leon Guzenda, cofounder and CTO of Objectivity, said that graph databases are a useful alternative tool for analyzing the relationships between data, something that Hadoop is aiming to support through the Giraph Project, which is not yet finished.
Objectivity, on the other hand, has been offering its InfiniteGraph database as a solution for Big Data needs. Guzenda said that InfiniteGraph is a good alternative to NoSQLs and Hadoop as it allows developers “to find relationships not based on statistical correlations. There’s other data in there none of these things touch: It’s in the relationships between the data. It might be just straight visualization of the graph of the network.”
Processing the data
Once the Big Data is in place, it’s time to write the actual code that will process it. Be it in Accumulo, Hadoop, SAP, SPS or any other system, writing software to read and process all of that information is no small task.
For users of today’s Hadoop, this means writing Map/Reduce jobs in Java. For users of the forthcoming Hadoop 2.0, that means writing just about any batch-processing job you can think of, possibly in numerous languages.
Arun Murthy, founder of and architect at Hortonworks, said that Hadoop 2.0 will enable significantly more use cases for all that data stored in HDFS. “As we looked at Hadoop three or four years ago, we saw that people were putting all the data into HDFS,” he said. “It doesn’t discriminate against data types, but Map/Reduce was the only way to get at that data.”
The natural next step, he said, was to change Hadoop to support other types of workloads on all of that data. Now that there was a place to put all that data, Map/Reduce wasn’t enough to accomplish the myriad tasks corporations wanted to perform on top of it.
This desire leads to the use cases currently filled by Twitter’s Storm project: stream processing and manipulation of information in real time. Murthy said that the desire for this capability is already rearing its head in the form of companies pushing to enable real-time SQL queries on Hadoop.
YARN is the next generation of Map/Reduce for Hadoop. This project seeks to split up the two major functions of the Job Tracker in Hadoop: resource management, and job scheduling and monitoring. YARN allows each individual application to have its own manager, which in turn allows more jobs to be run on a Hadoop cluster concurrently.
“YARN becomes this generic resource,” said Murthy. “We wanted to do streaming event processing; there is a need for interactive SQL, and we have the Hive and the Stinger projects… YARN seemed like the right solution. We’ve been working on this for three years,” and he felt as though the work is almost done.
Once all those jobs are finally brought over to Hadoop, however, there is one sticking point every developer can relate to: optimization. For standard applications, attaching a debugger and some performance-monitoring tools are as easy as a few clicks in the IDE. But when your application is spread across 100 servers, each with its own data store and hardware quirks, performance management can be a nightmare.
That’s why Compuware is offering its application performance-management tools in the Hadoop marketplace. Michael Kopp, technology strategist for Compuware APM Center of Excellence, said, “Our business unit inside Compuware focuses on application performance management. We help our customers around the world troubleshoot and manage the performance of their mission-critical applications in production settings and pre-production settings. In the last year and a half, customers have come to ask us to apply application performance monitoring for their Big Data applications. We have applied APM to Hadoop so we help our customers cut down on effort and time for problem-solving and finding performance issues inside Hadoop.”
Analyzing the data
Once the data is in place, perhaps the most difficult task in this new Big Data world is writing the batch-processing Map/Reduce jobs themselves. It is in this skill space that Gartner’s Adrian described a lack of employable candidates. And while numerous tools, such as Cascading, Hive, Pig and the new Stinger Initiative all attempt to give developers an easier way to access and process data inside of Hadoop, there’s still no easy way to bring an entire team up to speed on writing Map/Reduce, short of paying for some training.
Objectivity’s Guzenda said, “I think we need predictive analytics; it has got to be a lot easier to use. It’s really confusing for people who don’t have a grounding in statistical methods. It’s very easy to come to incorrect conclusions if you don’t understand what the statistics are doing. I’ve seen some interesting examples of that: The fact that the numbers 8 and 5 appear in my phone number, and my height have nothing to do with each other.”
One of the fundamental requirements of any data-analysis project is the ability to go traipsing through the data to figure out just what you’re working with. One of the easiest ways to do this is with a search tool.
LucidWorks makes just such a tool. Originally formed to support the Apache Lucene, Solr and Nutch projects, LucidWorks now offers open-source enterprise search tools that can be embedded in existing applications, or on top of Hadoop.
Grant Ingersoll, CTO of LucidWorks, said that his company comes directly from the same place Hadoop comes from. “A lot of people don’t realize Hadoop started as part of the Lucene project to assist with building large-scale distributed indexes, and along came Yahoo and said, ‘We can use this for other use cases.’ We take a connector-based approach. If you’ve got Hadoop, we’ll treat that like data,”
LucidWorks is also partnering with MapR to spread its search solutions to those customers.
Another company that has long made a business of traipsing through unstructured data is Splunk. Coming from the IT administration side, where analyzing logs can require searching for a single line among millions, Splunk has grown into an enterprise data-discovery tool with a powerful interface for developers.
Clint Sharp, senior product manager for Big Data integrations at Splunk, said, “What’s different about Splunk is that we don’t require you to do structuring and analysis of the data in advance. There are no ETL requirements. I don’t have to give you the data in a tabular form. Give us the data however it sits, and Splunk will be able to read that data and give you the ability to do charting and analytics on it. We’re allowing them to do the analytics on the data without having to do a whole lot of upfront investment in order to do analytics on top of that.”
This is a different mindset from traditional business analytics, where the questions must be prepared before the data is massaged into a form where the answers can be gleaned. Hadoop and Splunk require no prior data normalization, which means the data flow pipeline doesn’t need to have dozens of embedded transformations. Indeed, this can cause some confusion for traditional enterprise users.
“The first piece of advice I would give is ‘Don’t throw it away,’ ” said Sharp. “The key to analyzing the data is having it. Find a place to store it. We don’t care if that’s in Hadoop or not. Keep the data.
“Second, my advice would be [that] the world of ETL is, from my perspective, a wasted investment. Rather than doing a bunch of reformatting and structuring, you’re going to chew through a lot of labor doing that. When we worked on a business-intelligence project, well more than 50% to 70% of the investment in an analytics or [business intelligence] project is just collecting the data and getting it into the right rows and columns: figuring out how to structure the data so I can ask it the right questions.”
Those days are over, he added, thanks to unstructured data analysis platforms like Hadoop and Splunk.
That’s good news for Ancestry.com’s Yetman, who’s been putting all of this Big Data technology to use at his company. The goal of all these technologies is to eliminate humans from the decision process, and to use machine learning to figure out when business events—like fraud, market opportunity, or individual customer sales incentives—are occurring, and to react to them instantly. Despite the slower nature of Hadoop processing at present, the future is in real-time, computer-assisted decision-making.
Yetman’s team is already in the future, thanks to its Big Data experience. “We’ve used machine-learning algorithms. Some of our records are hand-written, but others are typeset. City directories, which are older, will show the name of the person and the occupation, before there were phone numbers,” he said.
“We’re using natural-language processing to turn around and pull out names, occupations and addresses. Obituaries are even harder. Can you identify the person who died, but also the spouse, the children, the surviving relatives? It’s where they were born, where they died, where they lived. A lot of machine learning is done to evaluate that. Then we host the algorithms as a service and call the service with the new content to actually do the work. The whole idea is how can we take this content all the way to the site without a human being being involved so we can get it indexed on the site in a totally automated way.”
Is Big Data a just a Big Bubble?
Looking around at the conferences, webinars, ads and venture capital investments, it certainly sounds and looks like a bubble. But to be a bubble, the fundamental change at hand must be an ephemeral and not-quite-ready-for-prime-time sort of change. When speaking to Big Data experts on the potential of Hadoop and other analytics platforms, it’s clear that this isn’t some shallow coat of paint on existing tools. The business value of Big Data is very real, and wide-reaching.
Ryan Betts, field CTO at VoltDB, said, “The nature of a bubble is you never really know when you’re in it. That being said, I think Big Data is very real. I think it’s more than a buzzword. We already see the impact in our day-to-day lives. It impacts us every time we use the Web in the way ads are targeted at us. I truly believe this is going to be transformative to retail. Everywhere you carry a phone, there is an impact in Big Data. You’re taking with you a sensor that identifies you. I think it’s really real in the same way the Internet seemed like a buzzword, and turned out not to be. I think Big Data is really going to be about data and mobile, and tools like HDFS to extract data from those feeds.”
Storm on the horizon
Twitter’s Storm Project is an example of what should be expected from the open-source community in a post-Hadoop world. While Hadoop is bound to disk and Map/Reduce jobs only, Storm is a massive harness and queue for processing streams of information.
Storm is still in its early stages, but many startups have already rolled it out into production. At its heart, Storm is about taking streams of data and performing computations on them, in a fault-tolerant, highly scalable manner.
One early user of Storm is Groupon, which uses the platform to normalize address information for its customers. As properly formatted, interpretable business location information can be tricky to collect and keep up to date, Groupon passes all incoming information through a Storm system that performs more than 40 computations on the information.
At the end of the queue, an address has been checked for spelling, age, veracity, proper punctuation, duplication, and dozens of other possible errors that could be found. Storm, while more focused on real-time processing of streams, is still a Big Data platform to keep an eye on.
A new Spark for Big Data
While Hadoop has undergone more than three years of rewriting in order to reach version 2.0, many of the features being brought into the project with that release are already available in Spark, an open-source project out of UC Berkeley.
Spark, at its core, is a near-real-time take on Hadoop. It’s a platform for processing large amounts of data, but there are some fundamental differences with Hadoop. For starters, Spark is based in memory, and the data that it spreads across its cluster resides in RAM. While HDFS is still the default file system, the medium of the data storage is inherently faster than the disk-based storage used in Hadoop.
That means batch jobs can be run at a significantly faster rate than those that run on Hadoop, simply by virtue of the data being more readily accessible through a faster medium.
As a result, Spark can also do stream processing, which is Storm’s forte.
Finally, Spark is designed to save developers time by using a concise API and focusing on using Scala as a language for writing jobs. Spark is still in its early stages, but it’s already offering many features that are only available elsewhere in Hadoop 2.0.
The big guns of Big Data
IBM, Microsoft and Oracle are the world’s largest database companies. And while they all offer plenty of data-analysis tools and layers for IBM’s DB2, Microsoft’s SQL Server and Oracle’s 11g, all three companies have also embraced Hadoop as a new solution.
Yet, despite having adopted Hadoop into their product lines, all three companies have completely different takes on how software should be consumed by enterprises.
Microsoft, for example, pushes its HDInsight service from Windows Azure. Within Azure, developers can spin up and manage a Hadoop cluster, and run batch jobs across data stored there. HDInsight is also available for Windows Server for organizations looking to work with an on-premise solution.
Oracle, on the other hand, sells Hadoop as an integrator adapter for its existing hardware and software solutions. Oracle also entered an agreement with Cloudera in 2012 to provide the Cloudera Enterprise Hadoop distribution to its customers.
Finally, IBM has had, likely, the oddest engagement with Hadoop. Rather than simply selling IBM PureData System for Hadoop and other connectors for its database, the company actually took its Hadoop knowledge on Jeopardy. IBM’s Watson machine used Hadoop to help it understand questions, and answer them in the form of a question.