You can think of it as an ever-inflating pink elephant. It’s either got its own space in which to grow, or it’ll just end up sucking all the air out of the room. It’s always easier to talk about the elephant in the zoo than the elephant in the room, and Hadoop is definitely a zoo-full of complex moving parts that can cause just as much damage as an enraged bull elephant, provided we drag this metaphor into the realm of data.
All that data is why you have a Hadoop cluster, after all. Even if you haven’t integrated the system into your day-to-day activities, Hadoop is nothing if not a data lake. It’s a cheap place to put the data. And that, said Mike Tuchen, CEO of Talend, is the single most important thing to remember about Hadoop: its value.
Get someone else to pay for your cluster
This should be quite easy to do, said Tuchen. In fact, he advocated that you should run, not walk, to your CIO/CEO’s office with all of your Hadoop ideas. As an enterprise executive, there is one thing that will always make you look good and get you promoted: cost savings.
Because Hadoop offers cost savings that are an order of magnitude cheaper than systems from traditional ETL and data-warehousing vendors, bringing a cluster online and replacing existing systems can turn you into a rock star, said Tuchen.
“Why care about Hadoop? It’s dramatically cheaper,” he said. “You can take a subset of your data warehouse work and offload it for a dramatically cheaper price. A lot of customers are phrasing it as data process offload and data warehousing. And when you look at it with that lens… if you add up hardware plus software from EMC, NetApp, IBM and you compare it to Hadoop, you’re talking about something that was US$30,000 or $40,000 a terabyte, to $1,000 a terabyte.”
Saving that kind of money for your company could just get you that VP position you wanted. But don’t expect this to be an overnight change. Hadoop is still a difficult system to own and operate, and it’s particularly difficult to hire for. That’s why Tip No. 2 is so important.
Train, don’t hire
If you can hire Hadoop developers and administrators, get out there and do it. If you think you know a team you can bring in-house, or if you happen to have an internal expert, put them into your Hadoop project exclusively.
Why? Because it is quite difficult to find Hadoop people. Popular job site Indeed.com shows that Hadoop has grown from a non-existent job market in 2009 to encompass 0.2% of all jobs on the site. The term has grown 225,000% since 2009. By contrast, the term “Java” is included in around 2% of all jobs posted on the site.
With that many other jobs out there, and such a shallow pool of Hadoop developers and administrators in the market to begin with, it’s going to be difficult for you and your team to find the Hadoop knowledge base you need.
(Related: Hadoop is a general-purprose platform)
So don’t expect to hire into Hadoop. You may find one or two Hadoop-experienced persons, but your team as a whole is going to have to learn this stuff the hard way: by hand, with manuals and online documentation. Just be sure you keep them happy, or else they’ll find Tip No. 3 useful.
Once you know Hadoop, you are more valuable
If you do train up your in-house employees on Hadoop, understand that they will then be worth at least six figures on the open market, especially if they are willing to relocate. If you’re training up your staff to run Hadoop, and you’ve just saved your department millions of dollars on EMC, NetApp or Teradata licenses, use some of that money to compensate your new Hadoop workers.
This is not an issue that you can ignore until later, either. With the job market for Hadoop expertise on fire, even Wall Street firms are hiring developers with as little as six months of Hadoop experience. Unless you want to waste your training budget on people who just walk out the door, you’ll likely need to “enhance” some salaries around this new cluster project.
And speaking of clusters, Tip No. 4 has a lot to do with what you put on those clusters.#!Tune and monitor the cluster
A single bad stick of RAM in one machine can make an entire cluster sluggish. When you’re building your applications and your Hadoop cluster itself, you’ll want to be sure you’re able to monitor your jobs all the way through the process. Chris Wensel, CTO and founder of Concurrent, said that you and your team have some important decisions to make as you’re designing your processes and your cluster.
Wensel said that, overall, “reducing latency is your ultimate goal, but also reducing the likelihood of failure. The way these technologies were built, they weren’t intended for operational systems.” As such, it is only recently that Hadoop and its many sub-projects have even added high availability support for the underlying file system.
That means Hadoop can still be somewhat brittle. Wensel said teams must first “decide if your application is something with an SLA. Is it something that has to complete in two hours every day, or every 10 minutes? Is it something you don’t want to think about at 10 p.m. when the pager goes off? If it’s an application that’s driving revenue, you need to really think about that. If you decide it has an SLA, you need to adopt some structural integrity in the application itself.”
Zoltan Prekopcsak, cofounder and CEO of Radoop, said that being able to monitor an application end to end through the Hadoop cluster is key to not jamming up the pipeline.
“It’s hard to find the problems,” he said. “If you’re a larger cluster, then it’s very common that one of the nodes has some issues or some problems. When you are submitting jobs to this cluster, this node can slow down the whole operation. It makes a lot of sense to periodically check your cluster and check your computers. There are great monitoring tools for that. Be on the safe side and make sure the hardware component are in good shape, because otherwise you’ll run into strange issues.”
Replace existing infrastructure
Once your team is up and running with Hadoop, and able to reliably use the cluster for storage and analysis, it’s time to go hunting. Hunt down the most expensive batch-operation or storage systems in your company and figure out what it would take to bring them into Hadoop.
You should be looking at your ETL systems, your data-warehousing systems, and even your live backup systems. Talend’s Tuchen said that these systems can all be replaced by a well-run Hadoop cluster, or even two or three clusters.
“What all of the Hadoop distributions will recommend you do is do the ETL scenario,” he said. “You load your data into HDFS [Hadoop Distributed File System] and transform in place. Now you’ve got all your data in the native format sitting in HDFS, so you want to think about that to solve your archive use case. What it turns out is that when you do the math economically, it costs the same as or less as archiving with something like Iron Mountain.”
While Hadoop isn’t completely ready for full-time backup duty, it’s getting there thanks to a new project called Falcon. Tuchen said Falcon “does the data capture replication to give you disaster recovery. If you have two clusters, you put it in one and replicate to the distant one for disaster recovery.”
(Related: Why ignoring Hadoop is not an option)
But moving backup duties to Hadoop isn’t just a flashy way to save some money; it’s also a game changer for your archivist. “The neat thing about that is if you choose to use Hadoop for archiving, now suddenly what you’ve done is you’ve paid for your entire Hadoop installation because of this other use case,” said Tuchen. “But now you’ve got all your data on spinning disks. It’s all analyzable and all the analysis you do with Hadoop now comes for free. That’s pretty cool. Now instead of having a couple quarters’ or a year’s worth of data on spinning disk, and everything else inaccessible on tape, now you have all your data back for years, for the same cost.”
Monitor your cluster
Hadoop brings your team back to the bad old days of the mainframe. You’re all building huge, important projects, but there’s only one central place to run those applications, and there aren’t enough hours in the day for everyone to diddle around on the cluster to see what they’ve done correctly.
For this reason, said Walter Maguire, chief field technologist at HP Vertica, being agile in the traditional sense might not work for you in a Hadoop workflow.
“The notion of changing quickly, trying something new, learning from it, and the idea that I can very quickly adapt my infrastructure, is appealing,” he said. “We see a lot of customers today seeing that approach to Big Data. Obviously a free-for-all won’t work, otherwise you’re subject to whichever process is consuming the most resources.”
To that end, your job scheduling and access management should be a big part of your cluster design decisions. As your cluster grows and adopts new technologies like Spark, YARN and Tez, how will you control jobs that run simultaneously, or at short intervals?
Each distribution has its own way to manage this layer of the cluster, and thus this should be an important part of your cluster distribution selection process.#!Don’t optimize the application; optimize the data
Radoop’s Prekopcsak said that most developers will instinctively want to tweak their applications in order to get better performance out of them once they hit the Hadoop cluster. He said this instinct is incorrect.
“When people start writing their own first programs, then sooner or later they want to optimize it or try to improve it,” said Prekopcsak. “But in many cases optimizations do not necessarily come from rewriting the code itself, but mostly from refactoring the data upon which the job is run.
“Using compression and partitioning can really speed up operations. Sometimes that’s up to a 100x speed-up. If the data is organized the right way, all your processing will be more effective. At first, it makes more sense to get the data into good shape, rather than optimizing your code. Transforming the data will give you much more performance gains.”
Be ready for change
This is a two-sided tip, as it pertains to your cluster, and to Hadoop as a whole. On the micro scale, be sure you keep in mind the fact that your application is going to change once it hits the live data. Said Concurrent’s Wensel: “The other side of the problem is that as you’re developing an application, as you get larger and larger data sets, your application changes. It is a challenge to build an application and have it grow to larger data sets. Be very conscious of the fact that things are changing.”
But on the macro scale, also be aware that the Hadoop ecosystem as a whole is changing almost every day. While Hadoop 2.0 brought in many structural changes and new features to the platform, there are still new aspects of the project that are only now beginning to wind their way into enterprise workloads.
Michael Masterson, director of strategic business development at Compuware, said, “There’s a whole new class of applications being targeted and built for this cluster. It’s moving from a fairly well-defined pattern around Map/Reduce to much more dynamic type use cases. With rapid ingestion of data with Storm, more rapid querying with Spark with up to 100x times improvement, and with things like Impala from Cloudera that are also creating more real-time application frameworks on top… These are probably just the beginning of what YARN opens up.
“We’re seeing the rapid evolution of what that Hadoop programming model looks like, and the rapid evolution of applications, each written on this cluster, each of which will be designed and built and driving a line-of-business need across the organization.”
Masterson continued: “We are starting to see customers realizing workloads are changing. There have been some best practices around. For example, if you build a cluster and don’t know what to do, there’s a sizing guide from HP. One of the rules is that the number of spindles on disk shouldn’t exceed the number of cores. That’s just a general suggestion, but I think it’s getting harder and harder to use a general practice like that. The types of workloads are changing,” he said, referencing solid-state drives as another change in the cluster paradigm.
To that end, be prepared to grow your cluster, build purpose-driven and specific Hadoop clusters, and above all, be sure to keep abreast of the work being done in projects relevant to your team and job.#!Tez
Tez is a project aimed at making life easier for developers who have to use YARN. The project is led by Hortonworks and is currently in the Apache Incubator. From the project’s site:
The Apache Tez project is aimed at building an application framework which allows for a complex directed acyclic graph of tasks for processing data. It is currently built atop Apache Hadoop YARN.
The support for third-party application masters is the crucial aspect to flexibility in YARN. It permits new job runtimes in addition to classical Map/Reduce, while still keeping M/R available and allowing both the old and new to coexist on a single cluster. Apache Tez is one such job runtime that provides richer capabilities than traditional map-reduce. The motivation is to provide a better runtime for scenarios such as relational querying that do not have a strong affinity for the Map/Reduce primitive. This need arises because the Map/Reduce primitive mandates a very particular shape to every job and although this mandatory shape is very general and can be used to implement essentially any batch-oriented data processing job, it conflates too many details and provides too little flexibility.
Falcon
The Apache Falcon project is an effort to more explicitly and granularly control the data you’re storing in your cluster. As such, it allows you to choose how many times data is replicated, coordinate multiple clusters, and enforce policies across the various endpoints data might flow through. From the Apache Incubator page:
Apache Falcon is a feed-processing and feed-management system aimed at making it easier for end consumers to onboard their feed processing and feed management on Hadoop clusters.
Data Management on Hadoop encompasses data motion, process orchestration, life-cycle management, data discovery, etc., among other concerns. Falcon is a new data-processing and management platform for Hadoop that solves this problem and creates additional opportunities by building on existing components within the Hadoop ecosystem without reinventing the wheel.