Since it was created in 2011, Storm has garnered a lot of attention from the Big Data and stream-processing worlds. In September 2014, the project finally reached top-level status at the Apache Foundation, making 2015 the first full year in which Storm will be considered “enterprise ready.” But that doesn’t mean there’s not still plenty of work to do on the project in 2015.
Taylor Goetz, Apache Storm project-management committee chair and member of the technical staff at Hortonworks, said that the road map for Storm in 2015 shows a path through security territory. “Initially, Storm was conceived and developed to be deployed in a non-hostile environment. Storm is a distributed system. There are multiple nodes,” he said.
Thus, Storm was not originally designed with security in mind. That’s changing this year, said Goetz, as the team works to add authentication and process controls to the system. Some of that work will be featured in future branches of the project.
“We’ve allowed every process to authenticate against all the other components in that cluster,” he said. “All interactions are authenticated with Kerberos [and] also have the concept of individual users. A unit of computation, in Storm, is called a ‘topology.’ With security features we’ve added, the topology itself runs as the user that submitted it, that allows us to implement security.”
Storm is part of an increasingly crowded world of data tools that have clustered around Hadoop. Hortonworks offers frontline support for the project now as well, cementing the usefulness of Spark in Hadoop environments. But that’s not to say there isn’t still some confusion in the marketplace as to what Storm is used for.
Specifically, Goetz said that the use cases for Apache Spark and Apache Storm are different, despite some overlaps in their capabilities.
“Storm is a stream-processing framework that allows one-at-a-time processing or micro-batching,” he said. “Spark is a batch-processing framework that also allows micro-batching.”
The advantage of using something like Storm with a Hadoop Big Data system, said Goetz, is that it allows developers and analysts to bridge the gap between the batches and jobs that will be done in a few hours, as well as the information that’s coming into the system right this moment.
He added that Hadoop can be used to store all the raw data coming into Storm, while Storm does all the data processing and transforming as it arrives. If errors occur during this process, the administrator can just hit rewind and feed in the raw data that Hadoop has stored again.
Goetz said that he personally will be spending his time on Storm helping to bring order to the various integration points it now has. This list of connectors should grow as 2015 moves on, he said.