Apache Hadoop adoption is accelerating among enterprises and advanced computing environments as the project, related projects, and ecosystem continue to expand. While there were valid reasons to avoid the 1.x versions, skeptics are reconsidering since Hadoop 2 (particularly the latest 2.2.0 version) provides a viable choice for a wider range of users and uses.
“The Hadoop 1.x generation was not easy to deploy or easy to manage,” said Juergen Urbanski, former chief technologist of T-Systems, the IT consulting division of Deutsche Telecom. “The many moving parts that make up a Hadoop cluster were difficult for users to configure. Fortunately, Hadoop 2 fills in many of the gaps. Manageability is a key expectation, particularly for the more critical business use cases.”
Hadoop 2.2.0 adds the YARN resource-management framework to the core set of Hadoop modules, which include the Hadoop Common set of utilities, the Hadoop Distributed File System (HDFS), and Hadoop MapReduce for parallel processing. Other improvements include enhancements to HDFS, binary compatibility for Map/Reduce applications built on Hadoop 1.x, and support for running Hadoop on Windows.
Meanwhile, Hadoop-related projects and commercial products are proliferating along with the ecosystem. Collectively, the new Hadoop capabilities provide a more palatable and workable solution, not only for enterprise developers, business analysts and IT, but also a larger community of data scientists.
“There are many technologies that are helping Hadoop realize its potential as being a more general-purpose platform for computing,” said Doug Cutting, co-creator of Hadoop. “We started out as a batch processing system. People used it to do computations on large data sets that they couldn’t do before, and they could do it affordably. Now there’s an ever-increasing amount of data processing that organizations can do using this one platform.”
YARN expands the possibilities
The limitations of Map/Reduce were the genesis of Apache Hadoop NextGen MapReduce (a.k.a. YARN), according to Arun Murthy, release manager for Hadoop 2.
“It was apparent as early as 2008 that Map/Reduce was going to become a limiting factor because it’s just one algorithm,” he said. “If you’re trying to do things like machine learning and modeling, Map/Reduce is not the right algorithm to do it.”
Rather than replacing Map/Reduce altogether, it was supplemented with YARN to provide things like resource management and fault tolerance as base primitives in the platform, while allowing end users to do different things as they process and track the data in different ways.
“The architecture had to be more general-purpose than Map/Reduce,” said Murthy. “We kept the good parts of Map/Reduce, such as scale and simple APIs, but we had to allow other things to coexist on the same platform.”
The original Hadoop MapReduce was based on the Google Map/Reduce paper, while Hadoop HDFS was based on the Google File System paper. HDFS provides a mechanism to store huge amounts of heterogeneous data cheaply; Map/Reduce enables highly efficient parallel processing.
“Map/Reduce is a mature concept that comes from LISP and functional programming,” said Murthy. “Google scaled Map/Reduce out in a massive way while keeping a real simple interface for the end user so the end user does not have to deal with the nitty-gritty details of scheduling, resource management, fault tolerance, network partitions, and other crazy stuff. It allowed the end user to just deal with the business logic.”
Because YARN is an open framework, users are free to use algorithms other than Map/Reduce. In addition, applications can run on and integrate with it.
“The scientific and security computing communities depend on Open MPI technologies, which weren’t even an option in Hadoop 1,” said Edmon Begoli, CTO of analytics consulting firm PYA Analytics. “The architecture of Hadoop 2 and YARN allows you to plug in your own resource manager and your own parallel processing algorithms. People in the high-performance computing community have been talking about YARN enthusiastically for years.”
#!HDFS: Aspirin for other headaches
Some CIOs have been reluctant to bring Hadoop into the enterprise because there have been too many barriers to entry, although Hadoop 2 improvements are turning the tide.
“I think two of the deal breakers were NameNode federation and the Quorum Journal Manager, which is basically a failover for the HDFS NameNode,” said Jonathan Ellis, project chair for Apache Cassandra. “Historically, if your NameNode went down, you were basically screwed because you’d lose some amount of data.”
Hadoop 2 introduces the Quorum Journal Manager, where changes to the NameNodes are recorded to replicated machines to avoid data loss, he said. NameNode federation allows a pool of NameNodes to share responsibility for an HDFS cluster.
“NameNode federation is a bit of a hack because each NameNode still only knows about the file set it owns, so at the client level you have to somehow teach the client to look for some files on one NameNode and other files on another NameNode,” said Ellis.
HDFS is nevertheless an economically feasible way to store terabytes or even petabytes of data. Facebook has a single cluster that stores more than 100PB on Hadoop, according to Murthy.
“It’s amazing how much data you can store on Hadoop,” he said. “But you have to interact with the data, interrogate it, and come up with insights. That’s where YARN comes in. Now you have a general-purpose data operating system, and on top of it you can run applications like Apache Storm.”
John Haddad, senior director of product marketing at Informatica, said the Hadoop 2 improvements allow his organization to run more types of applications and workloads.
“Various teams can run a variety of different applications on the cluster concurrently,” he said. “Hadoop 1 lacked some of the security, high availability and flexibility necessary to have different applications, different types of workloads, and more than one organization or team submitting jobs to the cluster.”
#!Gearing up for prime time
The number and types of Hadoop open-source projects and commercial offerings are expanding rapidly. Hadoop-related projects include HBase, a highly scalable distributed database; the Hive data warehouse infrastructure; the Pig language and framework for parallel computing; and Ambari, which provisions, manages and monitors Apache Hadoop clusters.
“It seems like we’ve got 20 or 30 new projects every week,” said Cutting. “We have all these separate, independent projects that work together, so they’re interdependent but under separate control so the ecosystem can evolve.”
Meanwhile, solution providers are building products for or integrating their products with Hadoop. Collectively, Hadoop improvements, open-source projects and compatible commercial products are allowing organizations to tailor it to their needs, rather than having to shoehorn what they are doing into a limited set of capabilities. And the results are impressive.
For example, Oak Ridge National Laboratory used Hadoop to help the Center for Medicare and Medicaid Services identify tens of millions of dollars in overpayments and fraudulent transactions in just three weeks.
“Using only two or three engineers, we were able to approach and understand the data from different angles using Hive on Hadoop because it allowed us to write SQL-like queries and use a machine-learning library or run straight Map/Reduce queries,” said PYA Analytics’ Begoli. “In the traditional warehousing world, the same project would have taken months unless you had a very expensive data warehouse platform and very expensive technology consulting resources to help you.”
The groundswell of innovation is enabling Hadoop to move beyond its batch-processing roots to include real-time and near-real-time analytics.
#!Skeptics are doing a double take
Hadoop 2 is converting more skeptics than Hadoop 1 because it’s more mature, it’s easier (but not necessarily easy) to implement, it has more options, and its community is robust.
“You can bring Hadoop into your organization and not worry about vendor lock-in or what happens if the provider disappears,” said Murthy. “We have contributions from about 2,000 people at this point.”
There are also significant competitive pressures at work. Organizations that have adopted Hadoop are improving the effectiveness of things like fraud detection, portfolio management, ad targeting, search, and customer behavior by combining structured and unstructured data from internal and external sources that commonly include social networks, mobile devices and sensors.
“We’re seeing organizations start off with basic things like data warehouse optimization, and then move on to other cool and interesting things that can drive more revenue from the company,” said Informatica’s Haddad.
For example, Yahoo has been deploying YARN in production for a year, and the throughput of the YARN clusters has more than doubled. According to Murthy, Yahoo’s 35,000-node cluster now processes 130 to 150 jobs per day versus 50 to 60 before YARN.
“When you’ve got 2x over 35,000 to 40,000 nodes, that’s phenomenal,” he said. “It’s a pretty compelling story to tell a CIO that if you just upgrade your software from Hadoop 1 to Hadoop 2, you’ll see 2x throughput improvements in your jobs.”
Of course, Hadoop 2.2.0 isn’t perfect. Nothing is. And some question what Hadoop will become as it continues to evolve.
Hadoop co-creator Cutting said the beauty of Hadoop as an open-source project is that new things can replace old things naturally. That prospect somewhat concerns PYA Analytics’ Begoli, however.
“I’m concerned about the explosion of frameworks because it happened with Java and it’s happening with JavaScript,” he said. “When everyone is contributing something, it can be too much or the original vision can be diluted. On the other hand, a lot of brilliant teams are contributing to Hadoop. There are management tools, SQL tools, third-party tools and a lot of other things that are being incubated to deliver advanced capabilities.”
While Hadoop’s full impact has yet to be realized, Hadoop 2 is a major step forward.
#!Well-known Hadoop implementations
Amazon Web Services: Amazon Elastic MapReduce uses Hadoop in order to provide a quick, easy and cost-effective way to distribute and process large amounts of data across a resizable cluster of Amazon EC2 instances. It can be used to analyze click-stream data, process vast amounts of genomic data and other large scientific data sets, and process logs generated by Web and mobile applications.
Apache Hadoop: Apache Hadoop is an open-source