The big-data world hit a major milestone over the holiday break: the Apache Hadoop project 1.0 release went out today.
Hadoop has grown to become one of the most active projects at the Apache Foundation, with dozens of sub-projects attached to it. The project as a whole fulfills several big-data goals: It gives developers cluster-management tools, a place to put petabytes of data for analysis, an implementation of the map/reduce algorithm for distributing jobs across that data, and a host of analysis tools for querying and processing all of that data. (A primer on map/reduce is available here.)
One of the primary points of focus in the 1.0 release is the HBase database, which allows administrators to store entire relational databases inside of the Hadoop File System. That distributed file system is rudimentary and unsuited to use outside of the simple storage of massive amounts of data across cluster nodes, but by using HBase, Hadoop administrators are actually able to host live data from their Hadoop clusters. Popular social bookmarking site StumbleUpon uses HBase as the live database for its website, as opposed to MySQL or Oracle.
In this release, HBase was moved up to a top-level project under Hadoop, and received numerous performance improvements, with the end goal being the removal of performance barriers that keep HBase from being a fully viable MySQL replacement for Hadoop users.
Another new addition to Hadoop 1.0 is WebHDFS, which is a RESTful API for inserting data directly into Hadoop. Previously, ingress of data into a Hadoop cluster required very specific tooling, and could not be performed via REST calls.
As Hadoop has begun to pop up inside of enterprises, a more recent focus on security has been prevalent in the project. With the release of version 1.0, Kerberos authentication has been implemented across nodes.
Merv Adrian, research vice president at Gartner, said that “Gartner is seeing a steady increase in interest in Apache Hadoop and related ‘big data’ technologies, as measured by substantial growth in client inquiries, dramatic rises in attendance at industry events, increasing financial investments and the introduction of products from leading data management and data integration software vendors. The 1.0 release of Apache Hadoop marks a major milestone for this open-source offering as enterprises across multiple industries begin to integrate it into their technology architecture plans.”