The database revolution happened a few years back, when NoSQL options stormed the world. Today, however, a second wave of innovation in data storage has been unleashed in the form of Apache Arrow. Arrow builds a standard for columnar in-memory analytics, and will provide a unified data structure, algorithms and cross-language bindings.
The overall goal of Arrow is to ensure systems like SAP HANA and Apache Spark are compatible and can move in-memory stored data between multiple in-memory columnar storage systems without the need for lengthy and costly data transformations.
(Related: Apache 1.6 is released)
Jacques Nadeau, vice president of Apache Arrow and CTO and cofounder of Dremio (a startup still in stealth mode). He wrote a blog entry on Dremio’s website describing the goals of the Arrow project.
Nadeau pointed out that many other Apache projects have already attempted to solve some of the problems with in-memory columnar storage, and that by the end of the year, many such projects will be Arrow-compatible.
“Leading execution engines are continuously seeking opportunities to improve performance. Developers in the Drill, Impala and Spark communities have all separately recognized the need for a columnar in-memory approach to data,” wrote Nadeau.
“[Apache] Drill developers have implemented a columnar in-memory analytics layer inside Drill known as ValueVectors. Drill provides an execution layer that performs SQL processing directly on columnar data without row materialization. The combination of optimizations for columnar storage and direct columnar execution significantly lowers memory footprints and provides faster execution of BI and SQL workloads.”
He went on to write that Apache Impala and Apache Spark developers had both realized the need for something like Arrow, and that they had both begun work on similar efforts, known as CIDR and Tungsten, respectively.
“The Big Data community has recognized an opportunity to develop a shared technology to address columnar in-memory analytics, and has joined forces to create Apache Arrow,” wrote Nadeau. “The Apache Drill community is seeding the project with the Java library, based on Drill’s existing ValueVectors technology, and Wes McKinney, creator of Pandas and Ibis, is contributing the initial C/C++ library. Given the credentials of those involved as well as code provenance, the Apache Software Foundation decided to make Apache Arrow a Top-Level Project, highlighting the importance of the project and community behind it.”
Arrow is, decidedly, the future of in-memory columnar storage, at least, for the Apache Foundation, wrote Nadeau. “Drill, Impala, Kudu, Ibis and Spark will become Arrow-enabled this year, and I anticipate that many other projects will embrace Arrow in the near future as well. Arrow community members (including myself) will [be] speaking at upcoming conferences, including Strata San Jose, Strata London and numerous meetups,” he wrote.
Apache Arrow is already available from the Apache GitHub repository.