“DataStax offers Apache Cassandra. They used to package it with Hadoop, but now they are packaging only with Spark. SAP HANA is packaging with Spark. I think you are going to see more and more of these storage providers bypassing Hadoop and using [it] when it comes to analytics.”
But Gualtieri thinks there’s a specific class of business that will choose to forego Hadoop and bang straight for Spark. “I think who’s going to say that is a startup with a lot of venture money, and really if you think about it, the relationship between the two is that Spark has no file system, but someone can say ‘I’m going to use just UNIX or an EMC SAN,’ ” he said.
That’s because, at the end of the day, HDFS is still the cheapest way to put petabytes on disk, and even without the rest of Hadoop, many enterprises have already begun the migration to an HDFS data lake, the momentum of which has an effect on future architecture of a company’s data as a whole.
One company that has ditched HDFS in favor of its own storage medium is IBM. “IBM has really made some investments and moves, not the least of which is Spark running on the Z Series mainframe, which is just amazing,” said Gualtieri. “That’s Spark without Hadoop, and that’s very interesting because that’s where many companies’ transactions are. Now you can do your analytics on the database.”
And mainframes are still where it’s at for many enterprises. While Hadoop has grown dramatically inside many organizations over the past five years, it’s still early days, says Gualtieri.
“The main [enterprise] questions are still about Hadoop. From an enterprise standpoint, they still want to adopt Hadoop. They see that as the first step, but at the same time they understand Spark is part of what I call ‘Hadoop and Friends.’ All the major distributions include Spark now. The cloud providers provide it as well,” said Gualtieri.
Perhaps one of the reasons for Spark’s quick rise to prominence among a field of Apache projects for Big Data on Hadoop is the fact that Spark has more capabilities. While it’s clearly an easier way to write processing jobs for a Hadoop cluster, Spark also includes Spark SQL and Spark Streaming.
Spark Streaming is part of a burgeoning movement toward more stable and open-source stream-processing solutions, but it is likely that the hardcore real-time users will stick with Apache Storm or move to Apache Flink; Spark Streaming typically has about a second or so of latency involved.
StreamAnalytix’s Venugopal said, “There are other wonderful advantages of Spark Streaming, like the simplicity of machine learning. But it is not the solution to many problems that other solutions exist for. Low-latency stream processing, such as anything under 500ms, is not a candidate for Spark Streaming. We see enterprises using Storm and Kafka for their streaming stack.