You’ve undoubtedly heard of Apache Hadoop, a framework for distributed processing, and the Map/Reduce pattern it has helped make famous. But “Map and Reduce” is a general pattern, not a framework-specific technology. “Map” means “Do some data processing to every element in your collection,” “Reduce” means “Walk over your collection of data (such as that produced by the ‘Map’ step) and summarize or coalesce the results.” For instance, “Map” all the photos you took on your vacation by taking a quick look at them and marking them as blurry or sharp. “Reduce” them by deleting the blurry ones.
Two things are important: first, make the Map data processing as discrete and rapid as possible. In the case of triaging photos, don’t look at the first of 2,000 photos and decide whether to put it in the photo album you share with your friends, just decide whether it’s blurry or not. The second important rule is to keep the “Reduce” step separate from the Map step. Maybe it turns out that the only photo you have of Bigfoot is a little blurry; value decisions are often hard to make without the context of the entire calculation.
This implies another non-obvious aspect of the Map/Reduce approach: Map/Reduce is really Map and Reduce and then Map some more and Reduce some more and then save that and return later to Map a little more, etc. With digital photography, success comes from an efficient and consistent manner to rate and tag your media, and then working with those Map/Reduced datasets for different projects (a “Highlights of Our Trip” album versus a “A Glimpse of Bigfoot” album).
The same principle holds true with Big Data: Even if you have a hunch about the ultimate answer you’re trying to derive, it is more likely to emerge from incremental steps. Although ultimately you may rerun your entire calculation from scratch, it’s something to avoid during the development stage: re-processing raw data over and over again rather than an already-Map/Reduced dataset is infuriating and wasteful. (On the other hand, you must bear in mind the reductions you’ve already applied and avoid re-deriving something you’ve already discarded.)
I haven’t mentioned distributed processing yet, but the reason why Map/Reduce has become so popular is that the Map step, if done properly, is highly parallelizable. The Map function applies the same function to every element in a collection; it’s sometimes called the “Apply” function, and LINQ calls it the “Select” function. A properly written Map function ought not have any loop-carried dependencies: There ought to be nothing in it that requires information from other collection elements or that is dependent upon the order in which it is calculated.
Achieving such independence may require preprocessing with other Map/Reduce sequences, but the benefit gained is that a framework such as Apache Hadoop can distribute the Map calculation across multiple cores, chips or machines (this, of course, becomes very framework- and problem-specific).
The Reduce step (called in other places “Inject,” “Fold” or “Aggregate”) is not independent: Once all the Map functions have been calculated, it moves through the data and does such things as remove duplicates, coalesce data, or collect statistics (such as when you gather the number and ratings of photos tagged “Bigfoot”).
Map, Reduce and the concept of using them in sequence are straightforward. Hadoop is impressively straightforward to get up and running, too. But in between “Hello World” and winning a Kaggle contest are quite a few steps.
The book “MapReduce Design Patterns” by Donald Miner and Adam Shook is a good intermediate resource and among my very technical books of the year. It does not have the step-by-step instructions of a “recipe” book, but I think that’s a fine decision given Hadoop’s position as a fairly specialized technology. By avoiding line-by-line breakdowns, the book is able to deliver a lot of content in its 436 pages. It starts with an approachable summary in 30 or so pages, and then covers “Summarization Patterns,” which I think is the logical training field for Map/Reduce. I’m no expert in Hadoop, but it seems to me the extensive discussion of “Join Patterns” was as comprehensive as it was enlightening.
I highly recommend the book to anyone with even a passing interest in Big Data and distributed processing.
Larry O’Brien is a developer evangelist/advocate for Xamarin. Read his blog at www.knowing.net.