“We see many large enterprises still…adopting things like Apache Storm for low-latency streaming use cases, so we have taken the approach of abstracting both Spark Streaming and Storm under a single UI layer so enterprises have the freedom of choosing for the right use case.”
Spark SQL, on the other hand, is keeping up very well with the competition from Cloudera: Apache Impala. According to benchmarks performed by AtScale, Spark SQL and Impala each have their advantages and performance benefits. He also pointed out that they’re on the interactive query end of the scale, while Hive on Tez are on the batch-processing end of the spectrum.
AtScale’s CEO Dave Mariani said that his company’s benchmarks showed that “Essentially there’s not one SQL-on-Hadoop engine that does it all. There are different workloads: There’s batch and interactive. Batch means we’re computing in aggregate on the cluster. Where you find Big Data with the trillions of rows, tools like Hive on Tez tend to be more stable and be able to consistently produce those results.
“For interactive queries, Spark SQL and Impala are really good at accessing smaller data sets very quickly. Hive on Tez is the tortoise that wins the race. You can’t run a Hive on Tez query and get an answer faster than 10 seconds, versus Impala and Spark SQL, where you can get an answers in milliseconds.”
Because of this, Mariani said he sees customers mixing SQL engines: They’ll process the massive cluster data with Hive on Tez, then use Impala or Spark SQL to run interactive queries on the data aggregated by Hive on Tez.
“You need to have more than one engine depending on the workload. There is a reason for these multiple engines to exist,” said Mariani.
While he said that Impala outperformed Spark SQL when running many queries concurrently, he also said that Spark 1.6 has made major improvements that make life with SQL much easier for developers and data scientists alike.
“There’s been a dramatic improvement in Spark SQL’s ability to not just perform quickly, but also to not fail on large data sets. 1.6 is significantly improved on both fronts. They basically rewrote the internals of how they do query processing, and they improved the joins and join functionality dramatically,” he said.
It’s just easier
At its core, Spark is about making large-scale data processing jobs easier in every direction. These benefits are evident when compared to MapReduce. In fact, Spark is quickly replacing MapReduce simply because it puts the power of the Hadoop cluster directly into the hands of the data scientist, without the need for a Java developer in between.