Work on Apache Arrow has been progressing rapidly since its inception earlier this year, and now Arrow is the open-source standard for columnar in-memory execution, enabling fast vectorized data processing and interoperability across the Big Data ecosystem.
Background
Apache Parquet is now the de facto standard for columnar storage on disk, and building on that success, popular Apache open-source Big Data projects (Calcite, Cassandra, Drill, HBase, Impala, Pandas, Parquet, Spark, etc.) are defining Arrow as the standard for in-memory columnar representation.
The data-processing industry is also following this trend by moving to vectorized execution (DB2 BLU, MonetDB, Vectorwise, etc.) on top of columnar data.
(Related: MongoDB can now store graph databases)
In addition, many open-source SQL engines are at different stages of the transformation from row-wise to column-wise processing. Drill already uses a columnar in-memory representation, and Hive, Spark (via SparkSQL) and Presto all have partial vectorized capability, with Impala planning columnar vectorization soon.
This brings in-memory representation in line with the storage side, where columnar is the new standard. For example, Parquet is commonly used for immutable files on a distributed file system like HDFS or S3, while Kudu is another columnar option suitable for mutable datasets. As this shift is underway, it is crucial to arrive at a standard for in-memory representation that enables efficient interoperability between these systems.
Key benefits of standardization
If the benefit of vectorized execution is faster processing on ever-growing datasets, then the standardization of this representation will enable even greater performance by permitting faster interoperability as well. Since the applications of such a standardization are endless, we will focus on some concrete examples.
Portable and fast User-Defined Functions (UDFs): One common way data scientists analyze data is by using Python libraries like NumPy and Pandas. As long as the data can fit in the memory of a single machine, this is efficient and fast. However, once it does not fit in memory, one must switch to a distributed environment. A solution relevant to the latter case is PySpark, which allows the distributed execution of Python code on Spark. However, since Spark runs on the JVM and Python as a native binary, considerable overhead is incurred by the serialization and deserialization of objects as they transition between Python and PySpark. This typically happens because mutually intelligible representations are often the lowest common denominator between the two systems, and compatibility achieved at this level often emphasizes utility rather than efficiency.
Arrow solves this by providing a columnar representation that combines data processing efficiency with standardization so that no conversion between formats is necessary. The internal representation for each system is the same, and as such, it also becomes the communication format. No serialization/deserialization is involved, and not even a single copy is made. Since Arrow record batches are immutable, they can be safely shared as read-only memory without requiring copies from one context to another.
And because the format is standard, UDFs become universal. Once execution engines have integrated with Arrow, the same function library can be used by all of them with the same efficiency.
Python-native bindings in Arrow 0.1.0: Thanks to many contributors, including Wes McKinney (creator of the Pandas library) and Uwe Korn (data scientist at Blue Yonder), the first release of Arrow (0.1.0) has native C++-backed integration between Python, Pandas, Arrow and Parquet (limited to flat schemas for now).
Arrow Record Batches can be manipulated as Pandas dataframes in memory, and they can be saved and loaded from Parquet files using the Parquet-Arrow integration.
One use case is to use your favorite SQL-on-Hadoop solution (Drill, Hive, Impala, SparkSQL, etc.) to slice and dice your data, aggregating it to a manageable size before undertaking a more detailed analysis in Pandas.
The next step is Arrow-based UDFs, which position Arrow as the new language-agnostic API (JVM, C++, Python, R). Arrow-based UDFs are portable across query engines, and eliminate the overhead of communication and serialization.
Future developments
Here are some applications of Arrow we expect to see in the near future.
Arrow based Python UDFs for Spark: Now that we have efficient Java and Python implementations of Arrow, we can remove all the serialization overhead incurred by distributed Python execution in Spark and enable vectorized execution of Python UDFs.
Faster storage integration: We can make the integration between Kudu and SQL engines drastically more efficient by using Arrow in place of two column-to-row and row-to-column conversions.