The Apache Foundation announced the release of Apache Hadoop 2.0. This next-generation version significantly expands the platform’s capabilities, thanks to the new YARN cluster resource manager. Additionally, the Hadoop File System (HDFS) was upgraded to support high availability and data snapshotting.
Shaun Connolly, vice president of corporate strategy at Hortonworks, said that Hadoop 2.0 is the culmination of a great deal of work between his company and Apache. “If you look at the Hadoop 2.0 line, it’s been in development for over two years. Our strategy has continued to be that we put a premium on the YARN work because each of these systems needs to plug in and inform YARN what the resources are so it can schedule the workloads appropriately,” he said.
(The very technical details about Hadoop 2.0: The Apache Software Foundation announces Apache Hadoop 2)
With Hadoop 2.0′s new core resource management implemented in the YARN project, Hadoop clusters running version 2.0 will no longer be limited to Map/Reduce jobs, said Connolly. YARN will allow other types of jobs to be run across the cluster against the data inside HDFS. And because Hadoop 2.0 is binary compatible with existing Hadoop 1.x applications, data already stored inside of a Hadoop cluster can be left where it is while upgrading.
The HDFS was also upgraded in version 2.0. The primary change was to make the HDFS highly available. As a result of this increase in reliability and stability, HDFS can now be used to underpin real-time applications. The most common use case of this is when an HBase database inside of a Hadoop cluster is used as a back-end data store for an external-facing application. Prior to the high-availability changes in HDFS, HBase could not reliably host a database to the world.
Based on all these changes, a number of next-generation Hadoop projects have been in the works at Hortonworks and inside the Apache Incubator. One of these is Apache Tez, a framework for near-real-time data processing in Hadoop.