SAN JOSE — Two years ago, the Hadoop Summit in Santa Clara was sponsored by seven companies, one of which had only two employees: its founders. The conference was so talk-heavy that an expo hall was out of the question. Last week, the conference’s new home in San Jose hosted a hair under 50 sponsor companies.
Heavyweights such as Cisco, Dell, Microsoft and NetApp have all sidled up to the Hadoop bar to have a drink from the flow of demand that seems to be spilling into the marketplace. The two-person company from 2010 was called Datameer, and today it’s employing around 30 people and offering version 2.0 of its Excel-like interface for constructing Hadoop data-mining endeavors.
All of this growth in the industry around Hadoop is mirrored in the Apache Software Foundation, which recently announced that it had experienced unprecedented expansion and graduation of Incubator projects. Powered by the fast evolution and frequent creation of Apache projects to supplement Hadoop, the foundation now has more than 100 active top-level projects for the first time.
Indeed, an Apache Software Foundation announcement noted Big Data projects around Hadoop as a primary reason for the growth in participation, sponsorships and overall projects in 2011. The announcement cited an IDC report that predicted a 60% growth in the size of the Hadoop market annually.
In January, Hadoop 1.0 was released. About a month ago, the first alpha release of Hadoop 2.0 was placed in the Apache repositories. Arun Murthy, founder and architect of Hortonworks, told Summit attendees what they could expect from this forthcoming rewrite of much of the Hadoop platform. (Hortonworks is the former Yahoo Hadoop group, which was spun out into its own company last year. Hortonworks is charged with maintaining, supporting and advancing vanilla Apache Hadoop.)
The biggest change in 2.0 is the refocusing of the Hadoop project into a general data-processing platform, rather than a dedicated MapReduce system. Murthy said that this change has required a complete re-architecting of how Hadoop works on the server-side, but he added that the underlying APIs and MapReduce use cases should remain largely unchanged for those already familiar with them.
What will be a major change, however, is that the MapReduce portions of Hadoop are being refined into a user-side library. That means the actual MapReduce implementation used on a Hadoop cluster can now be changed, and indeed, multiple implementations and batch jobs will be able to be run at the same time.
Murthy said that Hadoop 1.0 uses two trackers: a JobTracker and a TaskTracker. The JobTracker is currently responsible for most of the work inside the cluster: resource management, job scheduling and resource reclamation. The TaskTracker is sent down to each node and used locally to manage a specific task inside a job.
The new architecture in Hadoop 2.0 will feature two new concepts that will be essential in understanding the new types of trackers. First, applications will be the new name for generic bags of tasks or whole jobs. Second, the concept of containers is being introduced, which will allow for the allocation of cluster resources in a more partitioned, controllable fashion.
Container allocation will be handled by the new ResourceManager, and actual job management will now be handled by the Application Manager, essentially, said Murthy, splitting Hadoop’s current JobTracker into two parts. Together with the newly revamped TaskTracker on each node, this new triumvirate forms YARN, the new jobs management and scheduling system at the core of Hadoop 2.0.
This is central to the future of the platform, said Murthy: Data can remain in a Hadoop cluster and Hadoop systems can remain in control of that cluster, but developers will now be able to perform any batch data operation on said data, not just MapReduce.
All these changes are yielding massive performance increases, said Murthy. The Hadoop 2.0 branch includes over three years of changes that the Apache, Hortonworks and Yahoo teams have been crafting. HDFS will soon receive high-availability mode in the 2.0 branch, allowing Hadoop’s much maligned file system to serve as a core, highly available network asset.
Currently, Hadoop tops out at around 4,000 nodes on a cluster. Murthy said that the new architecture in 2.0 will enable clusters to scale up to 10,000 nodes. He also said that 2.0 is prepared for larger node sizes, thanks to the need to keep up with Moore’s law.
There were dozens of companies represented on the Hadoop Summit show floor. The aforementioned Datameer was demonstrating version 2.0 of its Web-based interface for querying Hadoop data. The product is targeted at business analysts rather than developers, and version 2.0 includes extensive new interface customization, charting and graphing options.
Moving down the stack to the actual data analysis development folks, Karmasphere 2.0 was also released at the Summit. This query life-cycle management and creation tool is entirely Web-based, and gives Hive users a clean and consistent interface for sharing, crafting and tweaking their queries in a social way.
Drawn to Scale, a new startup, introduced its first product, Spire. The company’s CEO and founder, Bradford Stephens, said Spire allows Hadoop users to send unadulterated SQL queries into the cluster. Results can be derived in real-time, he said, catering to companies that need fast action on deep data. He added that Spire is now available in beta form. (In this reporter’s opinion, the company would have won the “Best New T-Shirt Design” award, if there had been one.)
Dataguise unveiled its informational masquerade and security tools for Hadoop at the Summit. Dataguise is the first company to address data security concerns within Hadoop file systems. It offers Hadoop administrators a way to disguise, anonymize and generally secure sensitive data that’s being streamed into, or already exists within, a Hadoop cluster.
Perhaps the biggest product news at the show was Cloudera’s first public showing of Cloudera Enterprise Edition 4.0. This repackaged and enhanced Hadoop distribution is targeted at large business users of Hadoop, as the company’s free edition cuts out after 50 nodes.
Cloudera Enterprise Edition 4.0 brings in updated packages from the Hadoop ecosystem, like Apache Avro, Flume and HBase. The Cloudera distribution also includes speed-ups for standard Apache Hadoop functions, such as MapReduce shuffle, faster disk I/O and faster file system reads.