Getting a handle on Hadoop

Published: May 28th, 2014

- Alex Handy

You can think of it as an ever-inflating pink elephant. It’s either got its own space in which to grow, or it’ll just end up sucking all the air out of the room. It’s always easier to talk about the elephant in the zoo than the elephant in the room, and Hadoop is definitely a zoo-full of complex moving parts that can cause just as much damage as an enraged bull elephant, provided we drag this metaphor into the realm of data.

All that data is why you have a Hadoop cluster, after all. Even if you haven’t integrated the system into your day-to-day activities, Hadoop is nothing if not a data lake. It’s a cheap place to put the data. And that, said Mike Tuchen, CEO of Talend, is the single most important thing to remember about Hadoop: its value.

Get someone else to pay for your cluster
This should be quite easy to do, said Tuchen. In fact, he advocated that you should run, not walk, to your CIO/CEO’s office with all of your Hadoop ideas. As an enterprise executive, there is one thing that will always make you look good and get you promoted: cost savings.

Because Hadoop offers cost savings that are an order of magnitude cheaper than systems from traditional ETL and data-warehousing vendors, bringing a cluster online and replacing existing systems can turn you into a rock star, said Tuchen.

“Why care about Hadoop? It’s dramatically cheaper,” he said. “You can take a subset of your data warehouse work and offload it for a dramatically cheaper price. A lot of customers are phrasing it as data process offload and data warehousing. And when you look at it with that lens… if you add up hardware plus software from EMC, NetApp, IBM and you compare it to Hadoop, you’re talking about something that was US$30,000 or $40,000 a terabyte, to $1,000 a terabyte.”

Saving that kind of money for your company could just get you that VP position you wanted. But don’t expect this to be an overnight change. Hadoop is still a difficult system to own and operate, and it’s particularly difficult to hire for. That’s why Tip No. 2 is so important.

Train, don’t hire
If you can hire Hadoop developers and administrators, get out there and do it. If you think you know a team you can bring in-house, or if you happen to have an internal expert, put them into your Hadoop project exclusively.

Why? Because it is quite difficult to find Hadoop people. Popular job site Indeed.com shows that Hadoop has grown from a non-existent job market in 2009 to encompass 0.2% of all jobs on the site. The term has grown 225,000% since 2009. By contrast, the term “Java” is included in around 2% of all jobs posted on the site.

With that many other jobs out there, and such a shallow pool of Hadoop developers and administrators in the market to begin with, it’s going to be difficult for you and your team to find the Hadoop knowledge base you need.

So don’t expect to hire into Hadoop. You may find one or two Hadoop-experienced persons, but your team as a whole is going to have to learn this stuff the hard way: by hand, with manuals and online documentation. Just be sure you keep them happy, or else they’ll find Tip No. 3 useful.

Once you know Hadoop, you are more valuable
If you do train up your in-house employees on Hadoop, understand that they will then be worth at least six figures on the open market, especially if they are willing to relocate. If you’re training up your staff to run Hadoop, and you’ve just saved your department millions of dollars on EMC, NetApp or Teradata licenses, use some of that money to compensate your new Hadoop workers.

This is not an issue that you can ignore until later, either. With the job market for Hadoop expertise on fire, even Wall Street firms are hiring developers with as little as six months of Hadoop experience. Unless you want to waste your training budget on people who just walk out the door, you’ll likely need to “enhance” some salaries around this new cluster project.

And speaking of clusters, Tip No. 4 has a lot to do with what you put on those clusters.#!Tune and monitor the cluster
A single bad stick of RAM in one machine can make an entire cluster sluggish. When you’re building your applications and your Hadoop cluster itself, you’ll want to be sure you’re able to monitor your jobs all the way through the process. Chris Wensel, CTO and founder of Concurrent, said that you and your team have some important decisions to make as you’re designing your processes and your cluster.

Wensel said that, overall, “reducing latency is your ultimate goal, but also reducing the likelihood of failure. The way these technologies were built, they weren’t intended for operational systems.” As such, it is only recently that Hadoop and its many sub-projects have even added high availability support for the underlying file system.

That means Hadoop can still be somewhat brittle. Wensel said teams must first “decide if your application is something with an SLA. Is it something that has to complete in two hours every day, or every 10 minutes? Is it something you don’t want to think about at 10 p.m. when the pager goes off? If it’s an application that’s driving revenue, you need to really think about that. If you decide it has an SLA, you need to adopt some structural integrity in the application itself.”

Zoltan Prekopcsak, cofounder and CEO of Radoop, said that being able to monitor an application end to end through the Hadoop cluster is key to not jamming up the pipeline.

“It’s hard to find the problems,” he said. “If you’re a larger cluster, then it’s very common that one of the nodes has some issues or some problems. When you are submitting jobs to this cluster, this node can slow down the whole operation. It makes a lot of sense to periodically check your cluster and check your computers. There are great monitoring tools for that. Be on the safe side and make sure the hardware component are in good shape, because otherwise you’ll run into strange issues.”

Replace existing infrastructure
Once your team is up and running with Hadoop, and able to reliably use the cluster for storage and analysis, it’s time to go hunting. Hunt down the most expensive batch-operation or storage systems in your company and figure out what it would take to bring them into Hadoop.

You should be looking at your ETL systems, your data-warehousing systems, and even your live backup systems. Talend’s Tuchen said that these systems can all be replaced by a well-run Hadoop cluster, or even two or three clusters.

“What all of the Hadoop distributions will recommend you do is do the ETL scenario,” he said. “You load your data into HDFS [Hadoop Distributed File System] and transform in place. Now you’ve got all your data in the native format sitting in HDFS, so you want to think about that to solve your archive use case. What it turns out is that when you do the math economically, it costs the same as or less as archiving with something like Iron Mountain.”

While Hadoop isn’t completely ready for full-time backup duty, it’s getting there thanks to a new project called Falcon. Tuchen said Falcon “does the data capture replication to give you disaster recovery. If you have two clusters, you put it in one and replicate to the distant one for disaster recovery.”

But moving backup duties to Hadoop isn’t just a flashy way to save some money; it’s also a game changer for your archivist. “The neat thing about that is if you choose to use Hadoop for archiving, now suddenly what you’ve done is you’ve paid for your entire Hadoop installation because of this other use case,” said Tuchen. “But now you’ve got all your data on spinning disks. It’s all analyzable and all the analysis you do with Hadoop now comes for free. That’s pretty cool. Now instead of having a couple quarters’ or a year’s worth of data on spinning disk, and everything else inaccessible on tape, now you have all your data back for years, for the same cost.”

Monitor your cluster
Hadoop brings your team back to the bad old days of the mainframe. You’re all building huge, important projects, but there’s only one central place to run those applications, and there aren’t enough hours in the day for everyone to diddle around on the cluster to see what they’ve done correctly.

For this reason, said Walter Maguire, chief field technologist at HP Vertica, being agile in the traditional sense might not work for you in a Hadoop workflow.

“The notion of changing quickly, trying something new, learning from it, and the idea that I can very quickly adapt my infrastructure, is appealing,” he said. “We see a lot of customers today seeing that approach to Big Data. Obviously a free-for-all won’t work, otherwise you’re subject to whichever process is consuming the most resources.”

To that end, your job scheduling and access management should be a big part of your cluster design decisions. As your cluster grows and adopts new technologies like Spark, YARN and Tez, how will you control jobs that run simultaneously, or at short intervals?

Each distribution has its own way to manage this layer of the cluster, and thus this should be an important part of your cluster distribution selection process.#!Don’t optimize the application; optimize the data
Radoop’s Prekopcsak said that most developers will instinctively want to tweak their applications in order to get better performance out of them once they hit the Hadoop cluster. He said this instinct is incorrect.

“When people start writing their own first programs, then sooner or later they want to optimize it or try to improve it,” said Prekopcsak. “But in many cases optimizations do not necessarily come from rewriting the code itself, but mostly from refactoring the data upon which the job is run.

“Using compression and partitioning can really speed up operations. Sometimes that’s up to a 100x speed-up. If the data is organized the right way, all your processing will be more effective. At first, it makes more sense to get the data into good shape, rather than optimizing your code. Transforming the data will give you much more performance gains.”

Be ready for change
This is a two-sided tip, as it pertains to your cluster, and to Hadoop as a whole. On the micro scale, be sure you keep in mind the fact that your application is going to change once it hits the live data. Said Concurrent’s Wensel: “The other side of the problem is that as you’re developing an application, as you get larger and larger data sets, your application changes. It is a challenge to build an application and have it grow to larger data sets. Be very conscious of the fact that things are changing.”

But on the macro scale, also be aware that the Hadoop ecosystem as a whole is changing almost every day. While Hadoop 2.0 brought in many structural changes and new features to the platform, there are still new aspects of the project that are only now beginning to wind their way into enterprise workloads.

Michael Masterson, director of strategic business development at Compuware, said, “There’s a whole new class of applications being targeted and built for this cluster. It’s moving from a fairly well-defined pattern around Map/Reduce to much more dynamic type use cases. With rapid ingestion of data with Storm, more rapid querying with Spark with up to 100x times improvement, and with things like Impala from Cloudera that are also creating more real-time application frameworks on top… These are probably just the beginning of what YARN opens up.

“We’re seeing the rapid evolution of what that Hadoop programming model looks like, and the rapid evolution of applications, each written on this cluster, each of which will be designed and built and driving a line-of-business need across the organization.”

Masterson continued: “We are starting to see customers realizing workloads are changing. There have been some best practices around. For example, if you build a cluster and don’t know what to do, there’s a sizing guide from HP. One of the rules is that the number of spindles on disk shouldn’t exceed the number of cores. That’s just a general suggestion, but I think it’s getting harder and harder to use a general practice like that. The types of workloads are changing,” he said, referencing solid-state drives as another change in the cluster paradigm.

To that end, be prepared to grow your cluster, build purpose-driven and specific Hadoop clusters, and above all, be sure to keep abreast of the work being done in projects relevant to your team and job.#!Tez
Tez is a project aimed at making life easier for developers who have to use YARN. The project is led by Hortonworks and is currently in the Apache Incubator. From the project’s site:

The Apache Tez project is aimed at building an application framework which allows for a complex directed acyclic graph of tasks for processing data. It is currently built atop Apache Hadoop YARN.

The support for third-party application masters is the crucial aspect to flexibility in YARN. It permits new job runtimes in addition to classical Map/Reduce, while still keeping M/R available and allowing both the old and new to coexist on a single cluster. Apache Tez is one such job runtime that provides richer capabilities than traditional map-reduce. The motivation is to provide a better runtime for scenarios such as relational querying that do not have a strong affinity for the Map/Reduce primitive. This need arises because the Map/Reduce primitive mandates a very particular shape to every job and although this mandatory shape is very general and can be used to implement essentially any batch-oriented data processing job, it conflates too many details and provides too little flexibility.

Falcon
The Apache Falcon project is an effort to more explicitly and granularly control the data you’re storing in your cluster. As such, it allows you to choose how many times data is replicated, coordinate multiple clusters, and enforce policies across the various endpoints data might flow through. From the Apache Incubator page:

Apache Falcon is a feed-processing and feed-management system aimed at making it easier for end consumers to onboard their feed processing and feed management on Hadoop clusters.

Data Management on Hadoop encompasses data motion, process orchestration, life-cycle management, data discovery, etc., among other concerns. Falcon is a new data-processing and management platform for Hadoop that solves this problem and creates additional opportunities by building on existing components within the Hadoop ecosystem without reinventing the wheel.

Article Tags

Big Data, Hadoop

About Alex Handy

Alex Handy is the Senior Editor of Software Development Times.

View all posts by Alex Handy

Cookie	Duration	Description
cf_use_ob	past	Cloudflare sets this cookie to improve page load times and to disallow any security restrictions based on the visitor's IP address.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
PHPSESSID	session	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__atuvc	1 year 1 month	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__atuvs	30 minutes	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.

Cookie	Duration	Description
__gads	1 year 24 days	The __gads cookie, set by Google, is stored under DoubleClick domain and tracks the number of times users see an advert, measures the success of the campaign and calculates its revenue. This cookie can only be read from the domain they are set on and will not track any data while browsing through other sites.
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_S6PB8V57DG	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_846073_1	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
_jsuid	1 year	This cookie contains random number which is generated when a visitor visits the website for the first time. This cookie is used to identify the new visitors to the website.
at-rand	never	AddThis sets this cookie to track page visits, sources of traffic and share counts.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
iutk	5 months 27 days	This cookie is used by Issuu analytic system to gather information regarding visitor activity on Issuu products.
uvc	1 year 1 month	Set by addthis.com to determine the usage of addthis.com service.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.
WMF-Last-Access	1 month 14 hours 26 minutes	This cookie is used to calculate unique devices accessing the website.

Cookie	Duration	Description
__Host-GAPS	2 years	This cookie allows the website to identify a user and provide enhanced functionality and personalisation.
_pxhd	session	Used by Zoominfo to enhance customer data.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
loc	1 year 1 month	AddThis sets this geolocation cookie to help understand the location of users who share the information.
mc	1 year 1 month	Quantserve sets the mc cookie to anonymously track user behaviour on the website.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
__gpi	1 year 24 days	No description
__Secure-YEC	1 year 1 month	No description
_heatmaps_g2g_100754890	10 minutes	No description
_techvalidate_session	session	No description
cf_7166_id	20 years	No description
cf_7166_person_last_update	session	No description
f5avraaaaaaaaaaaaaaaa_session_	session	No description available.
GoogleAdServingTest	session	No description
Gyazo_cfwoker	7 years 2 months 17 days 7 hours	No description
incap_ses_451_2783402	session	No description
incap_ses_769_2783402	session	No description
loglevel	never	No description available.
m	2 years	No description available.
nlbi_2783402	session	No description
prism_252377639	1 month	No description
TS011605d9	session	No description
ustream-guest	session	No description available.
visid_incap_2783402	1 year	No description
xtc	1 year 1 month	No description

AI

AI and Software Development

Observability

Guide to Observability

CI/CD

A guide to CI/CD

Cloud Native

Cloud Native Content

Data

A Guide to Data

Test

Security Testing

Mobile

Mobile Testing

API

Sponsored by Parasoft

Performance

Load & Performance Testing

DevSecOps

A Guide to DevSecOps

Enterprise Security

A Guide to Security

Supply Chain Security

Supply Chain Security

Dev Manager

Dev Managers Content

Agile

A Guide To Agile

Value Stream

A Guide To Value Stream

Productivity

A Guide To Productivity

DevOps

DevOps Content

API

Gravitee.io

AI

AI and Software Development

Value Stream Management

A Guide To Value Stream

Getting a handle on Hadoop

Article Tags

Subscribe to SDTimes

About Alex Handy

Related Articles

Data is the new petroleum; companies need better pipelines — and better oil-spill clean-up methods

Canonical announces general availability of Charmed MLFlow

IBM launches guide for contributing to open source cloud projects

SD Times Open-Source Project of the Week: Apache Drill