The Apache Foundation has promoted the Spark project from the Apache incubator to the status of top-level project, less than a month after the release of Spark 0.9.0, which the developers said was the largest release of the project yet.
Indicative of this growth was the fact that version 0.9.0 featured contributions by 83 different developers, a hefty sum of outside contributions for a project that was, at the time, relegated to the Apache Incubator. Now, as a top-level project, Spark joins the Apache Hadoop family of projects to live under the single largest umbrella at the Apache Foundation.
Bill Bain, CEO of ScaleOut Software, said, “Spark is a very exciting and important new technology. We’re seeing a very rapid rate of contributions to it as an open-source technology.”
(Related: Spark hits top-level status)
Spark is the replacement technology for Hadoop’s core processing engine. Its overall goal is to increase the speed of a Hadoop cluster, and to ease the development of jobs that run on top of HDFS. Thus, Spark uses in-memory storage of cluster assets to ensure that queries and jobs can execute in seconds, rather than minutes.
Spark also has other goals. The project is aimed at enabling activities for which vanilla Hadoop would be ill suited or too slow. These activities include machine learning, streaming and interactive queries.
As Spark is aimed at current versions of Hadoop, it is also untied from Map/Reduce. In fact, version 0.9.0 adds GraphX, a new tool for graph processing within Spark. Spark also supports streaming, allowing for fault-tolerant capabilities to be built into applications. This enables Spark applications to stay up and running even when nodes in the cluster crash.
With all that Hadoop-style data in-memory, Spark has the ability to act on data it’s processing very quickly. Thus, the oft-discussed scenario of analyzing a customer’s past purchases before they make a new one while they’re in the store would be more likely to execute on time on Spark than on Hadoop.
And it is this in-memory data storage and processing that makes Spark just one of many tools that have taken this route for enterprise data and analytics processing.
Henry Sohn, vice president of operations at DataTorrent, said, “We do see that the industry, overall, is moving toward faster analytics and being able to get at the insights from that data in a much shorter timeframe. The fact that Big Data has been around for some time, and due to the nature of the technology, previously the insights you’d gain from that were necessarily batched. It’s reminiscent of the old punch-card-driven machines, when you would literally wait for computers to come back to you.
“But that’s all changing. There is new software coming out that allows you to get the value from that data much faster. It used to take eight hours or longer. Those days are finishing. You’re seeing a lot of solutions geared around faster analytics, and the space is occupied by in-memory query or in-memory SQL Engines.”
And while all these in-memory solutions, both new and old, have their own specific use cases, the overall trend in analytics and data processing is pushing everyone, including non-Hadoop users, into memory.