At Big Data TechCon in Chicago this week, attendees were treated to a glimpse of the future of Hadoop and large-scale data processing. With new solutions such as Apache Apex and Snowflake, enterprise options were shown to be expanding fast.
Keynote speaker Owen O’Malley, cofounder of Hortonworks and a 10-year veteran of the Hadoop project, gazed into the future of the platform. He detailed the origins of Hadoop and compared them to the plans Hortonworks and the Apache community have for the project.
“Customers are building these very complicated flows that use a lot of different technologies,” he said. “Apache Eagle is an incubator project that does security analytics over Hadoop audit logs. They capture the audit logs out of the servers, put them in Kafka, put that into Spark, use some machine-learning libraries, process it, generate models, throw those into Storm, and run the model versus the incoming data and shove it all into a server and provide those notifications out to the user. That’s great, but it’s a pain in the ass to setup.”
(Related: Microsoft and Red Hat team up on hybrid cloud computing)
O’Malley then indicated that what the Hadoop ecosystem needs is a packaging system, similar to Debian’s apt-get or RubyGems.
“You’d like to make it like the iTunes App store, where it comes down, it installs it, and it runs, so it doesn’t take a bunch of Ph.D.’s in computer science to get the thing up and running,” he said.
New tools
Elsewhere at the show, DataTorrent demonstrated Apache Apex, a new incubator project aimed at building an enterprise-grade stream processing service on top of Hadoop.
John Fanelli, vice president of product and marketing at DataTorrent, said that Apex was created by a team of experienced Hadoop developers from the original Yahoo team. He added that Apex is designed to handle much of the reliability work that now must be hand-coded into applications for Storm and Spark.
“One thing Spark does for streaming is micro-batch,” said Fanelli. “They grab events and batch them. They ultimately do MapReduce on that, and if an individual batch fails, they’ll rerun it. But if batch No. 3 fails, it may not rerun until after batch No. 6 runs, so you can’t do anything that’s predictive: If this occurred, if that occurred, then do this.
“Also, high availability is a big requirement for enterprises. In both Spark Streaming and Storm, the developer has to code in fault tolerance. The developer has to decide what to save, how often to save it, and what to do on recovery. With Apex, the platform takes care of all that. The developers only write business logic, they don’t write operational code.”
Snowflake Computing announced a cloud-based enterprise data-warehousing solution. Its service handles scaling and provisioning, allowing developers and administrators to simply pour in their data, and have it quickly replicate globally in AWS or on premises.
DBSH demonstrated its continuous data integration system for NoSQL and SQL databases. It uses models to match datasets stored in relational and non-relational systems, allowing developers and DBAs to sync data across the two disparate types of databases.
LexisNexis had HPCC, its large data processing platform. As the platform has grown, it has added numerous integrations to storage work systems (such as Apache Cassandra and HDFS), allowing the system to process data from existing data lakes.
Linoma Software gave a look at its managed file transfer products. While the company offers ways to migrate large stores of data around the world, it was the reverse-proxy capabilities that it promoted the most. Using this reverse proxy, data can be pushed into the cloud without needing to open a port in the corporate firewall.
Texifter provided text-mining tools that focus on both traditional mining applications, and the more shallow, outsourced social media variety. DiscoverText is the company’s multi-lingual machine-learning text analytics cloud platform, which can be used to generate insights, clean messy data, or evaluate marketing campaign successes.
Striim demonstrated its end-to-end streaming analytics platform: an in-memory platform that can process and transform data from multiple sources, providing instant access to clean data for processing queues.