The Apache Software Foundation has announced Apache Parquet as a Top-Level Project.
Apache Parquet, a columnar Hadoop storage format, is used across Big Data processing frameworks, data models and query engines from MapReduce and Apache Spark to Apache Hive, Impala and others. Parquet’s ascension to a Top-Level Project signifies the health and maturity of the open-source community around it, according to Apache Parquet Project Management Committee member and Cloudera software engineer Ryan Blue, who also sees the move as a vote of confident in the project.
Blue said what makes Parquet unique in the Hadoop ecosystem is its bring-your-own-object model.
“Lots of applications are based on existing row-oriented formats, like Avro and Thrift, that come with objects to represent the data,” said Blue. “A great feature of Parquet is that it is built to work natively with those existing classes, so you don’t have to change the application to go from a row-oriented to a column-oriented format. Parquet can read directly to Avro records, Spark data frames, Hive’s
internal writeables, and others.”
Parquet’s open object model and language-agnostic format definition have influenced its integration throughout most of the Hadoop ecosystem, added Julien Le Dem, vice president of the Apache Parquet project. The format is currently used in production at companies and organizations including Cloudera, Netflix, Twitter, Stripe and NASA’s Jet Propulsion Laboratory.
“[The open object model and language-agnostic format] make it easy to experiment with the many query engines and frameworks that exist today,” Le Dem said. “You don’t need to import data into a proprietary storage to analyze when there’s a lot of data that’s really important.”
Going forward, Blue said members of other Apache projects such as Drill, Presto and Hive are collaborating on a vectorized API for accessing Parquet data. Le Dem said Parquet’s graduation to TLP represents the last step in making Parquet a community-driven standard.
“Parquet is not controlled by a single company but is the result of the collaboration of many contributors including teams developing query engines, as well as companies that have adopted it,” said Le Dem. “We’re looking towards those query engines to continue to deeply integrate with the columnar format to reap all the benefits. We’re also improving the APIs so that the benefits of Parquet apply as well to custom applications developed in Hadoop, for example using [the] optimized filter in MapReduce.”