I stopped in on the MongoDB Developer Days in San Francisco on Friday. After all the discussion of NoSQLs we’ve heard over the past few years, it was great to see how enterprises are actually using MongoDB.
What came out of the event were a number of stacks and processes that companies had chained to MongoDB. Yuri Finkelstein, lead architect of platform services at eBay, explained how MongoDB is used to solve the auction service’s photo-hosting problems.
eBay hosts millions of images every day, and Finkelstein said that eliminating duplicate images from the data stores is actually chartable to a dollar amount of savings in processing, storage, bandwidth and other common IT costs. As such, the team at eBay had to find a way to check on page load whether images were duplicates. This required a separate data store to hold all the image MD5 checksums—and some additional meta information—in case of a rare but not impossible chance for MD5 collisions. And for this task, they chose MongoDB.
“We are running MongoDB in a very demanding environment. It’s business-critical at the moment. Is it reliable? Yes,” said Finkelstein. He also advised that running MongoDB requires intimate knowledge of just how the database works.
“MongoDB has lots of features,” he said. “Really, there are too many to pick from, and it’s tempting to use them all. The problem is unless, you understand how every feature works, you’re going to find yourself in trouble. If you want to be successful, you need to deeply understand every aspect of how this database works in order to be successful.”
Elsewhere at MongoDB Days, Peter Bakkum, member of the technical staff at Groupon, described a complex system at Groupon, the star of which was actually Twitter’s open-source project, the Storm Framework.
Bakkum described Storm as “a real-time distribution framework developed at Twitter. If you have a cluster of machines and you need to process data in a real-time way, and data is streaming in all the time, how do you break that up? How do you manage all the queues in between all these queues? Storm manages parallelism. We deploy the topology, and our workers are deployed to do the work.”
Thus, Groupon breaks out its various data normalization and cross-referencing tasks into individual applications, known as “Bolts” in Storm terminology. As new information is pulled into Groupon’s databases—be it a new business address, a company that’s moved, or a merchant that has gone out of business—addresses need to be properly reformatted, duplicate records need to be checked, and information needs to be verified.
In Storm, each Bolt performs one of these tasks, then passes the data down the line to the next process in the pipeline. Essentially, this process allows there to be an always-available central store of all the CRM-like data that Groupon needs to store. Bakkum said that this, in fact, is the service his team offers internally to Groupon’s employees and salespeople.