Hadoop is now a general-purpose platform

Published: January 30th, 2014

Apache Hadoop adoption is accelerating among enterprises and advanced computing environments as the project, related projects, and ecosystem continue to expand. While there were valid reasons to avoid the 1.x versions, skeptics are reconsidering since Hadoop 2 (particularly the latest 2.2.0 version) provides a viable choice for a wider range of users and uses.

“The Hadoop 1.x generation was not easy to deploy or easy to manage,” said Juergen Urbanski, former chief technologist of T-Systems, the IT consulting division of Deutsche Telecom. “The many moving parts that make up a Hadoop cluster were difficult for users to configure. Fortunately, Hadoop 2 fills in many of the gaps. Manageability is a key expectation, particularly for the more critical business use cases.”

Hadoop 2.2.0 adds the YARN resource-management framework to the core set of Hadoop modules, which include the Hadoop Common set of utilities, the Hadoop Distributed File System (HDFS), and Hadoop MapReduce for parallel processing. Other improvements include enhancements to HDFS, binary compatibility for Map/Reduce applications built on Hadoop 1.x, and support for running Hadoop on Windows.

Meanwhile, Hadoop-related projects and commercial products are proliferating along with the ecosystem. Collectively, the new Hadoop capabilities provide a more palatable and workable solution, not only for enterprise developers, business analysts and IT, but also a larger community of data scientists.

“There are many technologies that are helping Hadoop realize its potential as being a more general-purpose platform for computing,” said Doug Cutting, co-creator of Hadoop. “We started out as a batch processing system. People used it to do computations on large data sets that they couldn’t do before, and they could do it affordably. Now there’s an ever-increasing amount of data processing that organizations can do using this one platform.”

YARN expands the possibilities
The limitations of Map/Reduce were the genesis of Apache Hadoop NextGen MapReduce (a.k.a. YARN), according to Arun Murthy, release manager for Hadoop 2.

“It was apparent as early as 2008 that Map/Reduce was going to become a limiting factor because it’s just one algorithm,” he said. “If you’re trying to do things like machine learning and modeling, Map/Reduce is not the right algorithm to do it.”

Rather than replacing Map/Reduce altogether, it was supplemented with YARN to provide things like resource management and fault tolerance as base primitives in the platform, while allowing end users to do different things as they process and track the data in different ways.

“The architecture had to be more general-purpose than Map/Reduce,” said Murthy. “We kept the good parts of Map/Reduce, such as scale and simple APIs, but we had to allow other things to coexist on the same platform.”

The original Hadoop MapReduce was based on the Google Map/Reduce paper, while Hadoop HDFS was based on the Google File System paper. HDFS provides a mechanism to store huge amounts of heterogeneous data cheaply; Map/Reduce enables highly efficient parallel processing.

“Map/Reduce is a mature concept that comes from LISP and functional programming,” said Murthy. “Google scaled Map/Reduce out in a massive way while keeping a real simple interface for the end user so the end user does not have to deal with the nitty-gritty details of scheduling, resource management, fault tolerance, network partitions, and other crazy stuff. It allowed the end user to just deal with the business logic.”

Because YARN is an open framework, users are free to use algorithms other than Map/Reduce. In addition, applications can run on and integrate with it.

“The scientific and security computing communities depend on Open MPI technologies, which weren’t even an option in Hadoop 1,” said Edmon Begoli, CTO of analytics consulting firm PYA Analytics. “The architecture of Hadoop 2 and YARN allows you to plug in your own resource manager and your own parallel processing algorithms. People in the high-performance computing community have been talking about YARN enthusiastically for years.”

#!HDFS: Aspirin for other headaches
Some CIOs have been reluctant to bring Hadoop into the enterprise because there have been too many barriers to entry, although Hadoop 2 improvements are turning the tide.

“I think two of the deal breakers were NameNode federation and the Quorum Journal Manager, which is basically a failover for the HDFS NameNode,” said Jonathan Ellis, project chair for Apache Cassandra. “Historically, if your NameNode went down, you were basically screwed because you’d lose some amount of data.”

Hadoop 2 introduces the Quorum Journal Manager, where changes to the NameNodes are recorded to replicated machines to avoid data loss, he said. NameNode federation allows a pool of NameNodes to share responsibility for an HDFS cluster.

“NameNode federation is a bit of a hack because each NameNode still only knows about the file set it owns, so at the client level you have to somehow teach the client to look for some files on one NameNode and other files on another NameNode,” said Ellis.

HDFS is nevertheless an economically feasible way to store terabytes or even petabytes of data. Facebook has a single cluster that stores more than 100PB on Hadoop, according to Murthy.

“It’s amazing how much data you can store on Hadoop,” he said. “But you have to interact with the data, interrogate it, and come up with insights. That’s where YARN comes in. Now you have a general-purpose data operating system, and on top of it you can run applications like Apache Storm.”

John Haddad, senior director of product marketing at Informatica, said the Hadoop 2 improvements allow his organization to run more types of applications and workloads.

“Various teams can run a variety of different applications on the cluster concurrently,” he said. “Hadoop 1 lacked some of the security, high availability and flexibility necessary to have different applications, different types of workloads, and more than one organization or team submitting jobs to the cluster.”

#!Gearing up for prime time
The number and types of Hadoop open-source projects and commercial offerings are expanding rapidly. Hadoop-related projects include HBase, a highly scalable distributed database; the Hive data warehouse infrastructure; the Pig language and framework for parallel computing; and Ambari, which provisions, manages and monitors Apache Hadoop clusters.

“It seems like we’ve got 20 or 30 new projects every week,” said Cutting. “We have all these separate, independent projects that work together, so they’re interdependent but under separate control so the ecosystem can evolve.”

Meanwhile, solution providers are building products for or integrating their products with Hadoop. Collectively, Hadoop improvements, open-source projects and compatible commercial products are allowing organizations to tailor it to their needs, rather than having to shoehorn what they are doing into a limited set of capabilities. And the results are impressive.

For example, Oak Ridge National Laboratory used Hadoop to help the Center for Medicare and Medicaid Services identify tens of millions of dollars in overpayments and fraudulent transactions in just three weeks.

“Using only two or three engineers, we were able to approach and understand the data from different angles using Hive on Hadoop because it allowed us to write SQL-like queries and use a machine-learning library or run straight Map/Reduce queries,” said PYA Analytics’ Begoli. “In the traditional warehousing world, the same project would have taken months unless you had a very expensive data warehouse platform and very expensive technology consulting resources to help you.”

The groundswell of innovation is enabling Hadoop to move beyond its batch-processing roots to include real-time and near-real-time analytics.

#!Skeptics are doing a double take
Hadoop 2 is converting more skeptics than Hadoop 1 because it’s more mature, it’s easier (but not necessarily easy) to implement, it has more options, and its community is robust.

“You can bring Hadoop into your organization and not worry about vendor lock-in or what happens if the provider disappears,” said Murthy. “We have contributions from about 2,000 people at this point.”

There are also significant competitive pressures at work. Organizations that have adopted Hadoop are improving the effectiveness of things like fraud detection, portfolio management, ad targeting, search, and customer behavior by combining structured and unstructured data from internal and external sources that commonly include social networks, mobile devices and sensors.

“We’re seeing organizations start off with basic things like data warehouse optimization, and then move on to other cool and interesting things that can drive more revenue from the company,” said Informatica’s Haddad.

For example, Yahoo has been deploying YARN in production for a year, and the throughput of the YARN clusters has more than doubled. According to Murthy, Yahoo’s 35,000-node cluster now processes 130 to 150 jobs per day versus 50 to 60 before YARN.

“When you’ve got 2x over 35,000 to 40,000 nodes, that’s phenomenal,” he said. “It’s a pretty compelling story to tell a CIO that if you just upgrade your software from Hadoop 1 to Hadoop 2, you’ll see 2x throughput improvements in your jobs.”

Of course, Hadoop 2.2.0 isn’t perfect. Nothing is. And some question what Hadoop will become as it continues to evolve.

Hadoop co-creator Cutting said the beauty of Hadoop as an open-source project is that new things can replace old things naturally. That prospect somewhat concerns PYA Analytics’ Begoli, however.

“I’m concerned about the explosion of frameworks because it happened with Java and it’s happening with JavaScript,” he said. “When everyone is contributing something, it can be too much or the original vision can be diluted. On the other hand, a lot of brilliant teams are contributing to Hadoop. There are management tools, SQL tools, third-party tools and a lot of other things that are being incubated to deliver advanced capabilities.”

While Hadoop’s full impact has yet to be realized, Hadoop 2 is a major step forward.

#!Well-known Hadoop implementations

Amazon Web Services: Amazon Elastic MapReduce uses Hadoop in order to provide a quick, easy and cost-effective way to distribute and process large amounts of data across a resizable cluster of Amazon EC2 instances. It can be used to analyze click-stream data, process vast amounts of genomic data and other large scientific data sets, and process logs generated by Web and mobile applications.

Apache Hadoop: Apache Hadoop is an open-source

Article Tags

Apache, AWS, Big Data, Cloudera, Hadoop, HDFS, Hortonworks, IBM, Intel, LexisNexis, Map/Reduce, MapR, Microsoft, Pivotal, WANdisco, YARN

About Lisa Morgan

Lisa Morgan is a contributing editor to SD Times.

View all posts by Lisa Morgan

Cookie	Duration	Description
cf_use_ob	past	Cloudflare sets this cookie to improve page load times and to disallow any security restrictions based on the visitor's IP address.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
PHPSESSID	session	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__atuvc	1 year 1 month	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__atuvs	30 minutes	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.

Cookie	Duration	Description
__gads	1 year 24 days	The __gads cookie, set by Google, is stored under DoubleClick domain and tracks the number of times users see an advert, measures the success of the campaign and calculates its revenue. This cookie can only be read from the domain they are set on and will not track any data while browsing through other sites.
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_S6PB8V57DG	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_846073_1	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
_jsuid	1 year	This cookie contains random number which is generated when a visitor visits the website for the first time. This cookie is used to identify the new visitors to the website.
at-rand	never	AddThis sets this cookie to track page visits, sources of traffic and share counts.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
iutk	5 months 27 days	This cookie is used by Issuu analytic system to gather information regarding visitor activity on Issuu products.
uvc	1 year 1 month	Set by addthis.com to determine the usage of addthis.com service.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.
WMF-Last-Access	1 month 14 hours 26 minutes	This cookie is used to calculate unique devices accessing the website.

Cookie	Duration	Description
__Host-GAPS	2 years	This cookie allows the website to identify a user and provide enhanced functionality and personalisation.
_pxhd	session	Used by Zoominfo to enhance customer data.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
loc	1 year 1 month	AddThis sets this geolocation cookie to help understand the location of users who share the information.
mc	1 year 1 month	Quantserve sets the mc cookie to anonymously track user behaviour on the website.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
__gpi	1 year 24 days	No description
__Secure-YEC	1 year 1 month	No description
_heatmaps_g2g_100754890	10 minutes	No description
_techvalidate_session	session	No description
cf_7166_id	20 years	No description
cf_7166_person_last_update	session	No description
f5avraaaaaaaaaaaaaaaa_session_	session	No description available.
GoogleAdServingTest	session	No description
Gyazo_cfwoker	7 years 2 months 17 days 7 hours	No description
incap_ses_451_2783402	session	No description
incap_ses_769_2783402	session	No description
loglevel	never	No description available.
m	2 years	No description available.
nlbi_2783402	session	No description
prism_252377639	1 month	No description
TS011605d9	session	No description
ustream-guest	session	No description available.
visid_incap_2783402	1 year	No description
xtc	1 year 1 month	No description

AI

AI and Software Development

Observability

Guide to Observability

CI/CD

A guide to CI/CD

Cloud Native

Cloud Native Content

Data

A Guide to Data

Test

Security Testing

Mobile

Mobile Testing

API

Sponsored by Parasoft

Performance

Load & Performance Testing

DevSecOps

A Guide to DevSecOps

Enterprise Security

A Guide to Security

Supply Chain Security

Supply Chain Security

Dev Manager

Dev Managers Content

Agile

A Guide To Agile

Value Stream

A Guide To Value Stream

Productivity

A Guide To Productivity

DevOps

DevOps Content

API

Gravitee.io

AI

AI and Software Development

Value Stream Management

A Guide To Value Stream

Hadoop is now a general-purpose platform

Article Tags

Subscribe to SDTimes

About Lisa Morgan

Related Articles

This week in AI dev tools: Gemini 2.5 Pro and Flash GA, GitHub Copilot Spaces, and more (June 20, 2025)

IBM launches new integration to help unify AI security and governance

AI updates from the past week: OpenAI Codex adds internet access, Mistral releases coding assistant, and more — June 6, 2025

Amazon launches serverless distributed SQL database, Aurora DSQL