The Apache Hadoop project has long been big enough for its own conference, this time in its ninth year. During the keynote at Hadoop Summit from Hortonworks’ cofounder and chief architect Arun Murthy, the company detailed its plans to improve the security, governance and ease of use of the Hadoop platform and its ancillary support projects.
Those projects continue to pop up within the Apache Foundation, with dozens of Hadoop-related efforts underway (such as HBase, Hive and ZooKeeper) alongside those that are only now hot topics for the show, such as Metron, NiFi and Ranger.
Murthy said that Hadoop is no longer about just the data: It’s about the data and data flows as the backing for enterprise applications. For this reason, he said, Hadoop is increasingly partnered with streaming and data ingestion platforms such as Apache Spark, Apache Storm or Apache Kafka.
(Related: What’s better than one dose of Big Dat? Two!)
To this end, Murthy indicated that Hortonworks is working on the concept of Assemblies for future releases of the platform. Assemblies are groupings of platform components that are managed and scaled as a whole, rather than as individual components.
“I don’t have to download Kafka and Storm. If someone’s done it already, I am happy to customize the last 10% to 15%,” said Murthy, indicating that Hortonworks will offer these platforms in pre-integrated Assemblies.
He went on to say that Hortonworks will be heavily focusing on security and governance for the Hadoop platform, going forward. This will include new features for managing access to data, and for low-level data masking capabilities.
Adam Wenchel, vice president of data intelligence at CapitalOne, also spoke during the Hadoop Summit keynote. During his time on stage, he detailed the company’s work on the Apache Metron project. Metron was created, he said, to put Hadoop and machine learning to work on cyber security problems.
“Metron is a production-ready reference architecture for Big Data cybersecurity,” said Wenchel. “It uses battle tested solutions under the hood. It allows real-time data exploration, enriching streaming data with threat intelligence feeds, and being able to dive down in detail and do [packet capture] analysis—going packet by packet and doing appropriate forensics when you think you might have a malicious actor on your network.”
Metron was created in December 2014 by Cisco, but Wenchel said Hortonworks has become the leading contributor to the project—though he added that CapitalOne has contributed a good deal of code as well. Specifically, CapitalOne contributed code to parse various security-related data streams, such as router logs and proxy information.
MapR used Hadoop Summit to discuss its Spyglass Initiative, an effort to give greater insight into Hadoop operations through customizable and sharable dashboards. Jack Norris, CMO of MapR, said that this year the Hadoop world is focusing more and more on enabling application development on top of data lakes.
As a data platform for the development of enterprise applications, said Norris, Hadoop is about more than just analytics and reports. “I think the center point of the story is that we’re in the middle of the biggest replatforming in the past 30 years for the enterprise stack. It’s not about the language, or the data manipulation. It’s about the data platform itself. We’ve commoditized at hardware and OS level. The containers and Docker are still reliant on knowing where the data is, and one of the biggest constraints is if you have a stateful app or not. If you have a stateful app, most enterprises are saying it’s not appropriate for Docker,” said Norris.
With a proper data layer strategy in place—based on Hadoop being the single target for all information syncing and retrieval—stateful applications can have their data states abstracted by the network architecture, rather than by the applications’ architecture. This means developers can build with stateful applications with Docker because the data is stored within its own layer within the environment. Norris called this a converged architecture, as it would include many different Hadoop subprojects, such as Flink, Spark and Storm working together.
“I think that’s where the real enabling focus is,” said Norris. “If you’ve got that convergence in place, and you have a flexible data model with self-describing schemas, and [our] Open JSON Application Interface…it’s a game-changer. Most organizations have a fixed data model that takes them months to change and modify. If you have a loosely coupled structure based on JSON, you can have websites changing every hour, and it’s not impacting the downstream speed.”