The Apache Foundation announced the release of Apache Hadoop 2.0. This next-generation version significantly expands the platform’s capabilities, thanks to the new YARN cluster resource manager. Additionally, the Hadoop File System (HDFS) was upgraded to support high availability and data snapshotting.
Shaun Connolly, vice president of corporate strategy at Hortonworks, said that Hadoop 2.0 is the culmination of a great deal of work between his company and Apache. “If you look at the Hadoop 2.0 line, it’s been in development for over two years. Our strategy has continued to be that we put a premium on the YARN work because each of these systems needs to plug in and inform YARN what the resources are so it can schedule the workloads appropriately,” he said.
(The very technical details about Hadoop 2.0: The Apache Software Foundation announces Apache Hadoop 2)
With Hadoop 2.0’s new core resource management implemented in the YARN project, Hadoop clusters running version 2.0 will no longer be limited to Map/Reduce jobs, said Connolly. YARN will allow other types of jobs to be run across the cluster against the data inside HDFS. And because Hadoop 2.0 is binary compatible with existing Hadoop 1.x applications, data already stored inside of a Hadoop cluster can be left where it is while upgrading.
The HDFS was also upgraded in version 2.0. The primary change was to make the HDFS highly available. As a result of this increase in reliability and stability, HDFS can now be used to underpin real-time applications. The most common use case of this is when an HBase database inside of a Hadoop cluster is used as a back-end data store for an external-facing application. Prior to the high-availability changes in HDFS, HBase could not reliably host a database to the world.
Based on all these changes, a number of next-generation Hadoop projects have been in the works at Hortonworks and inside the Apache Incubator. One of these is Apache Tez, a framework for near-real-time data processing in Hadoop.
Tez is designed to allow a Hadoop cluster to perform interactive queries, rather than batch processing jobs that take time to execute. Tez specifically takes advantage of the YARN project to allow this capability to be deployed across a cluster; it does not use Map/Reduce, but rather it implements a directed acyclic graph.
Tez ties in closely with Hive and Pig, allowing users of these data-analysis packages to construct queries, check them quickly, and refine them, without having to wait for the cluster to finish the whole job.
“We’ve been doubling down on Apache Hive as the SQL layer for Hadoop,” said Connolly. “In 2.0, we have some of the fruits of the Hive work, as well as the beginnings of Hive being able to take advantage of Tez.”
(Hadoop also works with Twitter: Twitter sets Summingbird into the wild)
Another new Hadoop project that will begin to grow now that Hadoop 2.0 is done is the Apache Falcon project. Falcon provides a data life-cycle management service for Hadoop, easing the ingress and egress of data to the platform.
“You’re able to begin to have a framework that allows you to orchestrate recovery and retention. You can say, ‘I want to manage this dataset and make sure it’s moved to another cluster and ages out after a year or two.’ This is a framework that enables those types of scenarios to be managed,” said Connolly.
Apache Hadoop 2.0 is available today from the Apache Foundation’s website.