Have you heard the news? A “data lake” overflowing with information about Hadoop and other tools, data science and more threatens to drown IT shops. What’s worse, some Big Data efforts may fail to stay afloat if they don’t prove their worth early on.
“Here’s a credible angle on why Big Data could implode,” began Gary Nakamura, CEO of Concurrent, which makes Cascading, an open-source data application development platform that works with Hadoop, and Driven, a tool for visualizing data pipeline performance. “A CTO could walk into a data center, and when they say, ‘Here is your 2,000-node Hadoop cluster,’ the CTO says, ‘What the hell is that and why am I paying for it?’ That could happen quite easily. I predicted last year that this would be the ‘show me the money’ year for Hadoop.”
While plenty can go wrong, Nakamura is bullish on Hadoop. With companies like his betting robustly on the Hadoop file system (and its attendant components in the Big Data stack), now is a strategic moment to check your data pipelines for leaks. Here’s a primer on where the database market stands, what trends will rock the boat, and how to configure your data science team for success.
Follow the leader
Risks aside, no one—not even the federal government—is immune to the hope that Big Data will bring valuable breakthroughs. Data science has reached presidential heights, with the Obama administration’s appointment of former LinkedIn and Relate IQ quantitative engineer DJ Patil as the United States’ first Chief Data Scientist in February. If Patil’s slick talks and books are any indication, he is at home in a political setting. Though building on government data isn’t new for many companies offering services in real estate (Zillow), employment (LinkedIn), small business (Intuit), mapping (ESRI) or weather (The Climate Corporation), his role should prompt many more to innovate with newly opened data streams via the highly usable data.gov portal.
“I think it’s wonderful that the government sees what’s happening in the Big Data space and wants to grow it. I worked at LinkedIn for three years, and for a period of time [Patil] was my manager. It’s great to see him succeed,” said Jonathan Goldman, director of data science and analytics at Intuit. (Goldman cofounded Level Up Analytics, which Intuit acquired in 2013.)
Defining the kernel, unifying the stack
“In the last 10 years we’ve gone through a massive explosion of technology in the database industry,” said Seth Proctor, CTO of NuoDB. “Ten years ago, there were only a few dozen databases out there. Now you have a few hundred technologies to consider that are in the mainstream, because there are all these different applications and problem spaces.”
After a decade of growth, however, the Hadoop market is consolidating around a new “Hadoop kernel,” similar to the Linux kernel, and the industry standard Open Data Platform announced in February is designed to reduce fragmentation and rapidly accelerate Apache Hadoop’s maturation. Similarly, the Algorithms, Machines and People Laboratory (AMPLab) at the University of California, Berkeley is now halfway through its six-year DARPA-funded Big Data research initiative, and it’s beginning to move up the stack and focus on a “unification philosophy” around more sophisticated machine learning, according to Michael Franklin, director of AMPLab and associate chair of computer science at UC Berkeley.
“If you look at the current Big Data ecosystem, it started off with Google MapReduce, and things that were built at Amazon and Facebook and Yahoo,” he said at the annual AMPLab AMP Camp conference in late 2014. “The first place most of us saw that stuff was when the open-source version of Hadoop MapReduce came out. Everyone thought this is great, I can get scalable processing, but unfortunately the thing I want to do is special: I want to do graphs, I want to do streaming, I want to do database queries.
“What happened is, people started taking the concepts of MapReduce and started specializing. Of course, that specialization leads to a bunch of problems. You end up with stovepipe systems, and for any one problem you want to solve, you have to cobble together a bunch of systems.
“So what we’re doing in the AMPLab is the opposite: Don’t specialize MapReduce; generalize it. Two additions to Hadoop MapReduce can enable all the models.”
First, said Franklin, general directed acyclic graphs will enrich the language, adding more operators such as joins and filters and flattening. Second, data sharing across all the phases of the program will enable better performance that is not disk-bound.
Thus, the Berkeley Data Analytics Stack (BDAS, pronounced “Badass”) starts with the Hadoop File System storage layer with resource virtualization via Mesos and Yarn. A new layer called Tachyon provides caching for reliable cross-cluster data sharing at memory speed. Next is the processing layer with AMPLab’s own Spark and Velox Model Serving. Finally, access and interfaces include BlinkDB, SampleClean, SparkR, Shark, GraphX, MLlib and Spark Streaming.
On the commercial side, Concurrent is one of many companies seeking to simplify Hadoop for enterprise users. “I’ve seen many articles about Hadoop being too complex, and that you need to hire specialized skill,” said Nakamura. “Our approach is the exact opposite. We want to let the mainstream world leverage all they have spent and execute on that strategy without having to go hire specialized skill. Deploying a 50-node cluster is not a trivial task. We solve the higher-order problem: We’re going to help you run and manage data applications.”
The case for enterprise data warehousing
One of the most compelling scenarios for Hadoop and its ilk isn’t as exciting as new applications, but some say it is an economic imperative. With the cost of traditional data warehouses running 20x to 40x higher per terabyte than Hadoop, offloading enterprise data hubs or warehouses to Hadoop running on commodity hardware does more than just save cents and offer scalability; it enables dynamic data schemas for future data science discoveries, rather than the prescriptive, static labels of traditional relational database management systems (RDBMS).
According to the website of Sonra.io, an Irish firm offering services for enterprise Hadoop adoption, “The data lake, a.k.a. Enterprise Data Hub, will not replace the data warehouse. It just addresses its shortcomings. Both are complimentary and two sides of the same coin. Unlike the monolithic view of a single enterprise-wide data model, the data lake relaxes standardization and defers modeling… If implemented properly, this results in a nearly unlimited potential for agile insight.”
The fight for the hearts of traditional enterprise data warehouse (EDW) and RDBMS customers isn’t a quiet one, however. Some of the biggest posturing has come from the great minds behind seminal database technologies, such as Bill Inmon, “the father of data warehousing.”
Inmon has criticized Cloudera advertisements linking Big Data to the data warehouse. “While it is true that the data warehouse often contains large amounts of data, there the comparison to Big Data ends. Big Data is good at gobbling up large amounts of data. But analyzing the data, using the data for integrated reporting, and trusting the data as a basis for compliance is simply not in the cards,” he wrote on his blog.
“There simply is not the carefully constructed and carefully maintained infrastructure surrounding Big Data that there is for the data warehouse. Any executive that would use Big Data for Sarbanes-Oxley reporting or Basel II reporting isn’t long for his/her job.”
Ralph Kimball is the 1990s rival who countered Inmon’s top-down EDW vision with a proposal for small, star or snowflake schema-based data marts that form a composite, bottom-up EDW. Kimball has boarded the Big Data train, presenting a webinar with Cloudera about building a data warehouse with Hadoop.
“The situation that Hadoop is in now is similar to one the data community was in 20 years ago,” said Eli Collins, chief technologist for Cloudera. “The Inmon vs. Kimball debates were about methodologies. I don’t see it as a hotly debated item anymore, but I think it’s relevant to us. Personally, I don’t want to remake mistakes from the past. The Kimball methodology is about making data accessible so you can ask questions. We want to continue to make data more self-service.”
“We can look to history to see how the future is going,” said Jim Walker, director of product marketing at Hortonworks. “There is some value to looking at the Inmon vs. Kimball debate. Is one approach better than another? We’re now seeing bit more of the Inmon approach. Technology has advanced to the point where we can do that. When everything was strictly typed data, we didn’t have access. Hadoop is starting to tear down some of those walls with schema-on-read over schema-on-write. Technology has opened up to give us a hybrid approach. So there are lessons learned from the two points of view.”
The idea that the data lake is going to completely replace the data warehouse is a myth, according to Eamon O’Neill, director of product management at HP Software in Cambridge, Mass. “I don’t see that happening very soon. It’s going to take time for the data lake to mature,” he said.
“I was just talking to some entertainment companies that run casinos here at the Gartner Business Intelligence and Analytics Summit, and a lot of the most valuable data, they don’t put in Hadoop. It may take years before that’s achieved.”
However, there are cases where it makes sense, continued O’Neill. “It depends on how sensitive the data is and how quickly you want an answer. There are kinds of data you shouldn’t pay a high price for, or that when you query you don’t care if it takes a while. You put it in SAP or Oracle when you want the answer back in milliseconds.”
Data at scale: SQL, NoSQL or NewSQL?
Another bone of contention is the role of NoSQL to solve the scalability limitations of SQL or RDBMS when faced with Web-scale transaction loads. NoSQL databases sacrifice ACID (atomicity, consistency, isolation and durability) data quality constraints to increase performance. MongoDB, Apache HBase, Cassandra, Accumulo, Couchbase, Riak and Redis are among popular NoSQL choices that let analysts query data with SQL but provide different optimizations based on the application (such as streaming, columnar data or scalability).
“One of the brilliant things about SQL is that it is a declarative language; you’re not telling the database what to do, you’re telling it what you want to solve,” said NuoDB’s Proctor, whose company is one of several providing NewSQL alternatives that combine NoSQL scalability with ACID promises. “You understand that there are many different ways to answer that question. One thing that comes out of the new focus on data analysis and science is a different view on the programming model: where the acceptable latencies are, how different sets of problems need different optimizations, and how your architecture can evolve.”
NuoDB was developed by database architect Jim Starkey, who is known for such contributions to database science as the blob column type, type event alerts, arrays, and triggers. Its distributed shared nothing architectures aggregate opt-in nodes in a SQL/ACID-compliant database that “behaves like a flock of birds that fly in an organized fashion but without a central point of control or a single point of failure,” according to Proctor.
Elsewhere, MIT professor and Turing Award winner Michael Stonebraker, the “father of the modern relational database,” cofounded VoltDB in 2009, another NewSQL offering that claims “insane” speed. The company recently touted a benchmark of 686,000 transactions per second for a Spring MVC-enabled application using VoltDB. Another company he founded, Vertica, was acquired by HP Software in 2011 and added to its Haven Big Data platform.
“Vertica is a very fast, scalable data engine with a columnar database architecture that’s very good for OLAP queries,” explained HP’s O’Neill. “It runs on commodity servers and scales out horizontally. Very much like Hadoop, it’s designed to be a cluster of nodes. You can use commodity two-slot servers, and just keep adding them.”
On the SQL side, HP took the Vertica SQL query agent and put it on Hadoop. “It’s comparable to other SQL engines on Hadoop,” said O’Neill. “It’s for when the customer is just in the stage of exploring the data, and they need a SQL query view on it so they can figure out if it’s valuable or not.”
On the NewSQL side, HP Vertica offers a JDBC key-value API to quickly query data on a single node for high-volume requests returning just a few results.
With the explosion of new technologies, it’s likely that the future will include more database specialization. Clustrix, for example, is a San Francisco-based NewSQL competitor that is focusing on e-commerce and also promoting the resurgence of SQL on top of distributed shared nothing architectures.
SQL stays strong
Meanwhile, movements at Google and Facebook are showing the limitations of NoSQL.
“Basically, the traditional database wisdom is just plain wrong,” said Stonebraker in a 2013 talk at MIT where he criticized standard RDBMS architectures as ultimately doomed. According to him, traditional row-based data storage cannot match column-based storage’s 100x performance increase, and he predicted that online analytical processing (OLAP) and data warehouses will migrate to column-based data stores (such as Vertica) within 10 years.
Meanwhile, there’s a race among vendors for the best online transaction processing (OLTP) data storage designs, but classic models spend the bulk of their time on buffer pools, locking, latching and recovery—not on useful data processing.
Despite all that, operational data still relies on SQL—er, NewSQL.
“Hadoop was the first MapReduce system that grabbed mindshare,” said Proctor. “People said, ‘Relational databases won’t scale.’ But then about a year and a half ago, Google turned around and said, ‘We can’t run AdWords without a relational database.’ ”
Proctor was referring to Google’s paper “F1: A Distributed SQL Database that Scales.” In it, Google engineers took aim at the claims of “eventual consistency,” which could not meet the hard requirements they faced with maintaining financial data integrity.
“…Developers spend a significant fraction of their time building extremely complex and error-prone mechanisms to cope with eventual consistency and handle data that may be out of date,” according to the F1 paper. “We think this is an unacceptable burden to place on developers and that consistency problems should be solved at the database level. Full transactional consistency is one of the most important properties of F1.”
“Intuit is going through a massive transformation to leverage data—not just to understand customers, but also to feed it back into our products,” said Lucian Lita, cofounder of Level Up Analytics and now director of data engineering at Intuit. “We’re building a very advanced Big Data platform and putting Intuit on the map in terms of data science. We’re educating internally, contributing to open source and starting to have a good drumbeat.”
As an example, Quickbooks Financing uses data to solve a classic small business problem: They need financing to grow, but they’re so new that they can’t prove they are viable. “Intuit uses Big Data techniques to get attributes of your business, score it, and we should get something like the normal 70% rejection rate by banks turned into a 70% acceptance rate,” said Intuit’s Goldman.
“Small businesses don’t have access to data like Walmart and Starbucks do. We could enable that: Big Data for the little guy,” he said.
The experience of running a data science startup didn’t just translate into being acquired by an established enterprise. It also gave Goldman, Lita and cofounder Anuranjita Tewary, now director of product management at Intuit, insight into how to form effective data science teams.
“When we were working at Level Up Analytics, we spoke with over 100 companies building data products,” Tewary said. “This helped us understand how to structure teams to succeed. It’s important to hire the right mix of skills: product thinking, data thinking and engineering.”
When she looked at struggling data science projects at a multinational bank, a media company and an advertising company, Tewary saw common pitfalls. One of the most frequent? “Treating the data product as a technology project, ending up with not much to show for it and no business impact,” she said.
“It was more, ‘What technology should we have just for the sake of having this technology?’”
Because the tools have gotten easier to use, the idea of having Big Data for Big Data’s sake (and running it in a silo) may not be long for this world.
“It used to be that we were selling tools to IT. Now we think about analytics tools we can sell to business,” said HP’s O’Neill. “Increasingly, data scientists live in the marketing or financial departments, trying to predict what’s going to happen.”
In 1985, the physicist and statistician Edwin Thompson Jaynes wrote, “It would be very nice to have a formal apparatus that gives us some ‘optimal’ way of recognizing unusual phenomena and inventing new classes of hypotheses that are most likely to contain the true one; but this remains an art for the creative human mind.”
That quote is one of the inspirations behind Zoubin Ghahramani’s Automatic Statistician, a Cambridge Machine Learning Group that won a $750,000 Google Focused Research Award. Using Bayesian inference, the Automatic Statistician system examines unstructured data, explores possible statistical models that could explain it, and then reports back with 10 pages of graphics and natural language, describing patterns in the data.
IBM’s Watson and HP’s IDOL are trying to solve the unstructured data problem. The fourth technology on HP’s Haven Big Data platform, IDOL is for parsing “human data”: prose, e-mails, PDFs, slides, videos, TV, voicemail and more. “It extracts from all these media, finds key concepts and indexes them, categorizes them, performs sentiment analysis—looking for tones of voice like forceful, weak, angry, happy, sad,” said O’Neill. “It groups similar documents, categorizes topics into taxonomies, detects language, and makes these things available for search.”
As the F1 paper explained, “conventional wisdom in the engineering community has been that if you need a highly scalable, high- throughput data store, the only viable option is to use a NoSQL key/value store, and to work around the lack of ACID transactional guarantees and the lack of conveniences like secondary indexes, SQL, and so on. When we sought a replacement for Google’s MySQL data store for the AdWords product, that option was simply not feasible: the complexity of dealing with a non-ACID data store in every part of our business logic would be too great, and there was simply no way our business could function without SQL queries.
“Instead of going NoSQL, we built F1, a distributed relational database system that combines high availability, the throughput and scalability of NoSQL systems, and the functionality, usability and consistency of traditional relational databases, including ACID transactions and SQL queries.”
While this results in a more low-level commit latency, improvements in the client application have kept observable end-user speed as good or better than before, according to Google engineers.
In a similar vein, Facebook created Presto, a SQL engine optimized for low-latency interactive analysis of petabytes of data. Netflix’s Big Data Platform team is an enthusiastic proponent of Presto for querying a multi-petabyte scale data warehouse for things like A/B tests, user streaming experiences or recommendation algorithms.
FoundationDB is another hybrid NewSQL database whose proprietary NoSQL-style core key value store can act as universal storage with the transactional integrity of SQL DBMS. Unfortunately, many FoundationDB users were surprised in March when Apple purchased the ISV, possibly for its own high-volume OLTP needs, and its open-source components were promptly removed from GitHub.
Building better data science teams
As with any technology initiative, the biggest success factors aren’t the tools, but the people using them.
Intuit is a case in point: With the 2013 acquisition of Level Up Analytics, the consumer tax and accounting software maker injected a team of data science professionals into the heart of its business.
Futuristic as the possibility of parsing both human data and the data “exhaust” from the Internet of Things sounds, the technology may be the easy part. As the textbook “Doing Data Science” by Cathy O’Neil and Rachel Schutt explained, “You all are not just nerds sitting in the corner. You have increasingly important ethical questions to consider while you work.”
Ultimately, user-generated data will form a feedback loop, reinforcing and influencing subsequent user behaviors. According to O’Neil and Schutt, for data science to thrive, it will be critical to “bring not just a set of machine learning tools, but also our humanity, to interpret and find meaning in data and make ethical, data-driven decisions.”
How to find a data scientist
Claudia Perlich knows quite a bit about competition: She’s a three-time winner of the KDD Cup, the ACM’s annual data mining and knowledge discovery competition. Her recommendation for finding a qualified data scientist? Use a platform like Kaggle and make it a contest.
First, however, you must answer the hardest question: What data science problem to solve. “This is typically one of the hardest questions to answer, but very much business-dependent,” she said. Once you have formulated a compelling question, a Kaggle competition “will get you almost surely the highest-performance solution emerging from the work of thousands of competing data scientists. You could even follow up with whoever is doing well and hire them.
“It turns out that many companies are using Kaggle as a screening platform for DS applicants. It is notably cheaper than paying a headhunter, and you can be confident that the person you hire can in fact solve the problem.”