Kyvos Insights has built an interactive analytics solution on top of Hadoop, and Anand said that his team looked for a way to “do fast incremental analytics. There’s capabilities in Spark to do those interactive tasks, and there’s a natural advantage for using Spark’s in-memory computation that can help us in our solutions.”
Two great tastes
The question of whether or not Spark is replacing Hadoop is largely focused on the wrong question. The question shouldn’t be about replacing Hadoop, but rather what portions of Hadoop are being replaced by Spark. At present, MapReduce is the victim of users quickly moving to Spark, but the underlying data storage layer of Hadoop (HDFS and HBase) is likely not going away any time soon.
Thus, Forrester principal analyst for application development and delivery Mike Gualtieri feels that Hadoop and Spark will remain tied together for some time to come.
“I think Spark and Hadoop make the most sense together. You get the best of both worlds. Hadoop was designed for large volumes, Spark was designed for speed. When the data will fit in memory, use Spark, but if you want long term storage you use Hadoop,” said Gualtieri.
Ion Stoica, CEO and cofounder of Spark company Databricks, feels that Spark can completely replace Hadoop when combined with the right data store. That’s because Spark can be run against more than simply HDFSes.
“We are working well with Hadoop,” he said. “Spark is a data-processing engine, so if people already have their implementation of a data lake or data hub using Hadoop and HDFS, Spark will happily consume that data. However, if we look forward, we do believe we will see more and more instances where Spark will consume data from other data sources. If you’re in the cloud storing data in Amazon S3 or in Microsoft Azure’s Block Store, there is not a great reason to just spin up a Hadoop cluster in Amazon.”
Stoica went on to say that usage of Spark against existing enterprise storage systems is growing. “The other thing we’re seeing is if you think about many of the different enterprises, they have a storage solution—be it a database or a simple highly reliable data store—and that company wants to provide an analytics solution, until now the default solution was to also sell the first solution for the Hadoop cluster for analytics,” he said.
That’s a big win for companies like EMC, Teradata and NetApp, which have been scrambling to re-acclimate in our new Hadoop world, where storage of enterprise data is effectively commoditized.
“Going forward, many of these companies are going to align with Spark, first because it’s a good processing engine, and second is because Spark doesn’t provide a storage engine, it is not competing with storage providers,” said Stoica. “If I am going to be a storage provider and sell a packaged Hadoop cluster, it’ll provide very cheap storage, which will compete with my own solutions.