Even if it’s not where they end up, Hadoop can be a great starting platform for a data-driven software company. That’s what San Francisco- and Taipei-based Fliptop found in 2009 when they launched a social media identity matching engine, ultimately employed by such companies as MailChimp, Dell, Toyota, Oracle and Nordstrom. Based on Amazon Web Services’ implementation of Hadoop, Elastic Map Reduce, Fliptop’s initial approach used Ruby on Rails, Scala, Java, Apache Solr for search and Drools for business rules. But when market signals began to point elsewhere, experience with prior startups acquired by Bertelsmann and Symantec meant that founder Doug Camplejohn followed them.
“In 2012, we realized that while big companies were spending millions of dollars to build data science teams and custom data warehouses utilizing lots of expensive tools, most mid-market companies could not. So we pivoted. We built a suite of SaaS applications that leveraged internal and external data and data science for everyone—not just the data jocks,” Camplejohn said. Building a solution for predictive lead scoring, however, was a different kind of big data problem—one that Hadoop and Amazon’s Elastic Map Reduce couldn’t solve.
“When we were more a social data business, we were doing a lot of batch processing with Elastic Map Reduce. Large brands would give us a list of several 100 million email addresses, and we’d find social media profiles associated with those,” said Dan Chiao, Fliptop vice president of engineering. “Now, with predictive analytics, we have a large batch processing job to build an initial machine learning model, sucking in existing data in the CRM system. But the ongoing task is to score incoming leads as they come into Marketo. For that we do a lot of real-time processing on Apache Storm.”
Fliptop’s new technology stack is centered on Scala for its concise functional syntax, Java libraries, deployment architecture and Scala-specific libraries such as Dispatch and the Lift Web Framework. For streaming intake and normalization of massive data, they chose Storm, which had recently been open-sourced to Apache by Twitter. Finally, for their expanding data sources and continually morphing schemas, they chose the open-source MongoDB, which has “the most sophisticated query syntax of the NoSQL options,” according to Chiao. The solution was hosted on AWS and Azure.
The journey Fliptop has taken in the last five years isn’t unusual, as ISVs attempt to build on the Hadoop platform, mining the Big Data ecosystem for gems that make data science viable even without Hadoop expertise. With all the “data washing” hype now blurring the differences among business intelligence, predictive analytics and traditional databases, it’s time to gaze clearly upon this new “data operating system.”
When Hadoop doesn’t fit
What are you trying to gather and what will you do with it? Those are the first questions any data science project should ask itself. If you’re predicting utility power consumption, you may be dealing with smart meters that generate interval data, along with sensors that check voltage. Retailers may offer personalization based on facial recognition and biometrics, a la Minority Report. Or healthcare systems may look for fraudulent behavior in the form of anomalous data. Whether it’s transaction, historical, or master data will determine how you store and query it. How often you need to get to the data has implications for storage design and costs. Finally, whether you need to process it in real time or in batch mode determines much of your particular Big Data stack.
“I think of Hadoop as an operating system and not a database. It’s a database because it provides you with the ability to store and query data. But if you look at the architecture, it’s like an OS, because operating systems are pluggable. Today we have Pig, then Spark comes along. Like an OS, it allows us to develop on top of it,” said Eli Collins, chief technologist for Cloudera, the market-leading provider of Hadoop distributions and services.
And while major vendors from Oracle to Microsoft to IBM and Intel are incorporating Hadoop distributions from Apache, Cloudera and HortonWorks into their own big data offerings, it’s important to note that Hadoop isn’t one-size-fits-all. At the recent Dreamforce conference in San Francisco, plenty of vendors in the Salesforce ecosystem touted the data science behind their apps, but only some, like Fliptop’s early incarnation, used Hadoop as a data store. Popular alternatives include HBase, Cassandra and MongoDB, for varying reasons.