Running Hadoop jobs and keeping a cluster up is a full time administrator job. Setting up jobs and parceling them out to the cluster can be difficult, especially if you’re building multiple applications that rely on Hadoop inputs and outputs. Stitching these external applications together can be a pain, and bringing their needs together inside of map/reduce jobs can make things even more complicated.

We’ll let the Cascading project describe itself here:

As a library and API that can be driven from any JVM-based language (Jython, JRuby, Groovy, Clojure, etc.), developers can create applications and frameworks that are “operationalized”. That is, a single deployable JAR can be used to encapsulate a series of complex and dynamic processes all driven from the command line or a shell, instead of using external schedulers to glue many individual applications together with XML against each individual command line interface.

Not only does Cascading abstract away from map/reduce, it also allows developers to deploy their jobs in a much simpler way. That being done, cascading allows developers to build applications on top of Hadoop with a much simpler design model.

The cascading tree also includes a number of tools based on the system. One of those tools, dubbed the “Multitool,” even allows developers to run grep and sed jobs right against the cluster’s data set.



If you like Hadoop, you might also like… Well, if I ran Mahout in my brain, I could crank out some suggestions here. Mahout is the machine-learning library for Hadoop. It’s all about presenting abstracts and primitives that make AI easier to write on a Hadoop cluster.

As a result of using Mahout, developers can build recommendation engines and create software that can spot patterns and similarities in data sets. The classic example of such a system is the Netflix recommendation engine. Though its system is not based on Hadoop or Mahout, it does offer the same functionality: Users input movies they like, and the system is able to recommend similar movies.

Mahout is at version 0.4 right now, but 0.5 should be coming before the end of the year. It’s still in its infancy, but I could easily see it turning into the basis for an entire future based on giant sets of data.



Originally from Facebook, Hive is a way to query a Hadoop cluster with SQL-like code. Instead of actual SQL, Hive uses its own DSL called QL, which is said to be very similar to SQL. Naturally, if a user is trying to do simple, fast queries, this is not the system to use: Hadoop is still a non-real-time system, and Hive queries can take a few minutes if they’re being run against a petabyte of data.

But then, Hadoop has never been about speed; it’s always been about size. Frankly, 10 years ago, the idea of running a query against petabytes of data was the sort of thing you’d only read about in science fiction books or in quantum computing papers. Thus, despite the lack of responsiveness, Hadoop and Hive can still combine to offer businesses an insanely powerful tool for finding needles in data haystacks.

In addition, Hive adds RPC to the Hadoop mix, meaning developers can plug query bars into existing applications that send out to the Hadoop cluster and bring back the needed information. Just imagine being at Facebook and having the ability to sort out all users nicknamed “Bubba,” cross-referenced with all users who own motorcycles. Considering that information is buried under 46 petabytes of other data, it’s phenomenal to think modern machines can make sense of such a request at all.

Did we neglect to mention the Facebook Hadoop instance is 46 petabytes? Well, it is.



Avro is a data serialization system for Hadoop. The project has been expanding at Apache for just over two years now, and it’s currently at version 1.51.

Avro’s big contribution to Hadoop is the use of schemas to describe data. Instead of just pouring tons of generic data into a Hadoop cluster, then normalizing the data inside the cluster, Avro brings data descriptions into Hadoop via these schemas, and then keeps those schemas with the data, allowing generic processing to take place without a ton of extra data-manipulation coding required in the pipeline.

Avro relies on schemas. When Avro data is read, the schema used when writing it is always present. This permits each datum to be written with no per-value overheads, making serialization both fast and small. This also facilitates use with dynamic scripting languages, since data, together with its schema, is fully self-describing.

From the Avro documentation: When Avro data is stored in a file, its schema is stored with it, so that files may be processed later by any program. If the program reading the data expects a different schema, this can be easily resolved, since both schemas are present.

When Avro is used in RPC, the client and server exchange schemas in the connection handshake. (This can be optimized so that, for most calls, no schemas are actually transmitted.) Since both client and server both have the other’s full schema, correspondence between same named fields, missing fields, extra fields, etc. can all be easily resolved.


BackType Technology has been working with Hadoop for a while, but when it came up against a few limitations of the platform, it decided to throw the whole system out and build something from scratch. The result is a Hadoop-like system that can function in real time, rather than the somewhat plodding pace with which Hadoop can do its jobs. This new Hadoop-free project is called Storm, and it should be ready for general use later this year.

The initial problem BackType encountered was due to some faulty shell usage within Hadoop. From the BackType Technology blog:

We traced the problem to a sloppy implementation detail of Hadoop. It turns out that Hadoop sometimes shells out to do various things with the local filesystem. When you shell out in Java, the process gets forked. Forking a process causes the child process to reserve the same amount of memory for itself as the parent process is using (to fully understand what’s happening, you need to learn about memory overcommit and the copy-on-write semantics of forking in Linux). This means that the Hadoop process which was using 1GB will temporarily “use” 2GB when it shells out.

And so, Storm was born. While no one outside of BackType has really touched the software much yet, it’s still a very promising piece of technology that could soon offer a fast and efficient way to bring an entire cluster to bear on a single stack of data.