Who asks the questions?
The explosion of interest in data science isn’t limited to quants. “There are different end-user communities for Hadoop. A data scientist or analyst might want to use SAS, R, Python—those are for writing programs. There’s another community from the DBA world, which is using SQL or a business intelligence tool that uses SQL data sources. Then there’s the ETL community, which has traditionally done transforms via graphical tools. They’re using third-party drag-and-drop—reporting software is in there too. Finally, the platform has gotten easier for non-analysts to use, and you can use free-text search to inspect the data. Splunk is the leader there; their product, Hunk, can run directly on Hadoop. For each of these end-user communities, there’s a different set of tools,” said Collins.
As the technology stack has evolved to meet the demands of Hadoop users, the concept of storing everything has grown commonplace. That’s due in part, according to Hortonworks, to new abilities to retrieve and make sense of data coming from industrial sensors, web traffic and other sources. In Hadoop 2.0, YARN replaces the classic Hadoop MapReduce framework. It’s a new application management framework—the Hadoop operating system—and it improves scalability and cluster performance.
“We saw a major shift on October 23 last year with the introduction of YARN. That’s why you heard the terms ‘data lake,’ ‘data reservoir,’” said Hortonworks’ Walker. As a result of YARN, Hadoop has become an architectural center serving many needs, Walker explained. “Spark, Hive—all these things live in the Hadoop world because of the technology of YARN. If my applications need SQL access, I’ll use Hive. If I’m doing machine learning and data science, I’ll use Spark,” Walker continued.
Python pulls ahead
Today, every element of the Apache Hadoop stack—Ambari for provisioning clusters, Avro for data serialization, Cassandra for scalable multi-master databases, Chukwa for data collection, HBase for large table storage, Hive for adhoc querying, Mahout for machine learning and recommendation engines, Pig for programming parallel computation, Spark for ETL and stream processing, Tez for programming data tasks, and ZooKeeper, for coordinating distributed applications—has several contenders vying for users and tackling precise niches.
Take Mahout, a library of scalable machine-learning algorithms named after the Indian term for an elephant rider. Microsoft Azure users, among others, will find this pre-installed on HDInsight 3.1 clusters. But it may not be production-ready, depending on your needs.
“We tested Mahout, but we didn’t get good results,” said Fliptop’s Chiao. “It had to do with the fact that it only had skeleton implementations of some sophisticated machine learning techniques we wanted to use, like Random Forest and Adaboost.” In its stead, Fliptop chose Python-based scikit-learn, a mature, actively developed, sophisticated and production-grade machine learning library that has been around since 2006 and, Fliptop decided, had comparable performance to R.
“This is a great time to be a data scientist or data engineer who relies on Python or R,” blogged Ben Lorica, chief data scientist at O’Reilly Media. Python and R code can run on a variety of increasingly scalable execution engines such as GraphLab, wise.io, Adatao, H20, Skytree and Revolution R. Speculation swirled in September 2014 when Cloudera acquired Datapad, a company that comes with co-founder Wes McKinney, creator of the Python-based Pandas data analysis library.
“I don’t think of Python as tooling. But there’s been a ton of interest in the Python community around integrating with Hadoop—there’s a large community there. We published an overview of 19 Python frameworks built around Hadoop a few months ago, and that was our number one or two most popular blog post ever,” said Collins.
Security still a step-child
New features and sparkly storytelling are all the rage, but it seems, at least among users, that security still gets short shrift. An integration vendor recently explained, under cover of anonymity, that security had not even been a concern for his company, which made tools to pass data from IoT devices such as thermostats and sprinkler systems to cloud data stores: “That’s what we depend on Hadoop and Amazon to do,” the vendor said.