Apache Spark 1.6, which shipped yesterday, offers performance enhancements that range from faster processing of the Parquet data format to better overall performance for streaming state management.
As a large-scale data processing platform, Apache Spark has untethered itself from the Hadoop platform. As a result, Spark can be used against key-value stores and other types of databases. Still, Hadoop remains a large part of the Spark target ecosystem, so the increased performance for processing data in the Apache Parquet format will accelerate the performance of Apache Spark when working with Hadoop systems.
(Related: SAP announces new in-memory query engine for Spark)
Apache Parquet is a columnar storage format that works with any of the Hadoop projects. The 1.6 release of Apache Spark introduces a new Parquet reader, bypassing the existing parquet-mr record assembly routines, which had previously been eating up a lot of processing cycles. The change promises an almost 50% improvement in speed.
Apache Spark 1.6 also includes better memory management. Previously, when processing, Spark would divide available memory in two and work with it as such. Now, the memory manager in Spark can automatically tune the size of different memory regions. The runtime will now grow and shrink regions according to their specific needs.
The state management API in Spark Streaming has been redesigned in this release. Version 1.6 is the first to include the new mapWithState API. This new API scales linearly with the number of updates rather than the total number of records. This allows it to track the deltas instead of constantly rescanning the full dataset.
Speaking of datasets, Apache Spark 1.6 includes the new DataSet API, which allows for compile-time type safety within DataFrames. The existing DataFrames API, using the DataSet API, now supports static typing and user functions that run directly on Scala or Java types.
For data scientists, Apache Spark 1.6 has improved its machine-learning pipeline. The Pipeline API offers functionality to save and reload pipelines in persistent storage. Apache Spark 1.6 also increases algorithm coverage in machine learning by adding support for univariate and bivariate statistics, bisecting k-means clustering, online hypothesis testing, survival analysis, and non-standard JSON data.