The database revolution happened a few years back, when NoSQL options stormed the world. Today, however, a second wave of innovation in data storage has been unleashed in the form of Apache Arrow. Arrow builds a standard for columnar in-memory analytics, and will provide a unified data structure, algorithms and cross-language bindings.
The overall goal of Arrow is to ensure systems like SAP HANA and Apache Spark are compatible and can move in-memory stored data between multiple in-memory columnar storage systems without the need for lengthy and costly data transformations.
(Related: Apache 1.6 is released)
Jacques Nadeau, vice president of Apache Arrow and CTO and cofounder of Dremio (a startup still in stealth mode). He wrote a blog entry on Dremio’s website describing the goals of the Arrow project.
Nadeau pointed out that many other Apache projects have already attempted to solve some of the problems with in-memory columnar storage, and that by the end of the year, many such projects will be Arrow-compatible.
“Leading execution engines are continuously seeking opportunities to improve performance. Developers in the Drill, Impala and Spark communities have all separately recognized the need for a columnar in-memory approach to data,” wrote Nadeau.
“[Apache] Drill developers have implemented a columnar in-memory analytics layer inside Drill known as ValueVectors. Drill provides an execution layer that performs SQL processing directly on columnar data without row materialization. The combination of optimizations for columnar storage and direct columnar execution significantly lowers memory footprints and provides faster execution of BI and SQL workloads.”
He went on to write that Apache Impala and Apache Spark developers had both realized the need for something like Arrow, and that they had both begun work on similar efforts, known as CIDR and Tungsten, respectively.