The Apache Software Foundation (ASF) has announced that the big data analysis library Apache DataSketches is now a  top-level project. Becoming a  top-level project indicates that DataSketches’ community has been well-governed under the ASF’s processes and principles. 

“We are excited to be part of the ASF,” said Lee Rhodes, vice president of Apache DataSketches. “We have learned a great deal from the incubation process and look forward to working with new users of our library that want to take advantage of sketching technology.”

The project includes a library of streaming algorithms, otherwise known as sketches. According to the ASF, sketches are ideal for queries that cannot afford the resources needed to generate exact results. DataSketches is useful in scenarios where approximations are acceptable. 

Key benefits include speed, efficiency, parallelization, optimization in large-scale computing environments, binary compatibility, expanded analysis, and mathematically defined and proven error properties. 

The project was started at Yahoo in 2012. It was first open-sourced in 2015 and entered the Apache Incubator in March 2019. It is used at companies such as Nielsen Identity, Permutive, Splice Machine, and Verizon Media.  

“Sketches are fundamental to calculating many of our key company metrics,” said Tom Miller, director of software development engineering at Verizon Media. “It allows us to greatly simplify our data processing and reduce storage costs by allowing us to calculate non-additive metrics across user specified dimension combinations at report time instead of having to either retain raw data or pre-calculate for each set of dimensions.”