Running Hadoop jobs and keeping a cluster up is a full time administrator job. Setting up jobs and parceling them out to the cluster can be difficult, especially if you’re building multiple applications that rely on Hadoop inputs and outputs. Stitching these external applications together can be a pain, and bringing their needs together inside of map/reduce jobs can make things even more complicated.
We’ll let the Cascading project describe itself here:
As a library and API that can be driven from any JVM-based language (Jython, JRuby, Groovy, Clojure, etc.), developers can create applications and frameworks that are “operationalized”. That is, a single deployable JAR can be used to encapsulate a series of complex and dynamic processes all driven from the command line or a shell, instead of using external schedulers to glue many individual applications together with XML against each individual command line interface.
Not only does Cascading abstract away from map/reduce, it also allows developers to deploy their jobs in a much simpler way. That being done, cascading allows developers to build applications on top of Hadoop with a much simpler design model.
The cascading tree also includes a number of tools based on the system. One of those tools, dubbed the “Multitool,” even allows developers to run grep and sed jobs right against the cluster’s data set.
If you like Hadoop, you might also like… Well, if I ran Mahout in my brain, I could crank out some suggestions here. Mahout is the machine-learning library for Hadoop. It’s all about presenting abstracts and primitives that make AI easier to write on a Hadoop cluster.
As a result of using Mahout, developers can build recommendation engines and create software that can spot patterns and similarities in data sets. The classic example of such a system is the Netflix recommendation engine. Though its system is not based on Hadoop or Mahout, it does offer the same functionality: Users input movies they like, and the system is able to recommend similar movies.
Mahout is at version 0.4 right now, but 0.5 should be coming before the end of the year. It’s still in its infancy, but I could easily see it turning into the basis for an entire future based on giant sets of data.