This has been the month of Twitter. The company announced that it would be filing for IPO soon, and also managed to hang its logo on the side of its swanky downtown San Francisco offices. Perhaps the most significant bit of news, though, was the release of Summingbird, a new project that combines Apache Hadoop with Twitter’s own Storm Project.
Storm, for those who’ve not seen it, is all about distributed stream processing in a timely fashion. Groupon uses Storm to polish, validate and de-duplicate address information for the businesses it covers. Its Storm process includes about 45 steps between an address being committed and it being deemed valid, fit and accurate.
Storm combined with Hadoop will allow all of those pesky data transformations to take place in real time as information is pulled into the Hadoop server. Effectively, this offers an easier way to get data into Hadoop; loading info into Hadoop currently is a process that must be handled by someone, and must be done by hand.
Summingbird promises to allow that pipeline for incoming data to behave more manageably and to be transformed before it hits Hadoop. It also means that developers building Hadoop applications now have a lot more flexibility in what they can do to all that data in the cluster.
Expect Summingbird to change the Hadoop equation.