Earlier this week, LinkedIn announced it was open-sourcing AvroTensorDataset, which is a “TensorFlow dataset for reading, parsing, and processing Avro data.” Apache Avro is the primary storage format that LinkedIn uses for its training data.
According to LinkedIn, it was experiencing bottlenecks in its machine learning workloads that were caused by the need to read multiple terabytes of input data. AvroTensorDataset can speed up preprocessing of data by multiple orders of magnitude, according to the company.
The tool was built internally at LinkedIn, and it wanted to open-source the project so that others could experience the large performance boosts to training workloads. It has been in production for over a year already at LinkedIn.
LinkedIn says that with this tool it has been able to improve processing speed by 162x compared to existing solutions and has decreased overall training time by 66%
“ATDSDataset is LinkedIn’s solution to efficiently read Avro data into TensorFlow. Through multiple performance enhancements, we were able to speed up I/O throughput by orders of magnitude over existing Avro reader solutions. Our team at LinkedIn worked closely with the TensorFlow I/O community to open-source this feature, and we hope that by open-sourcing it, the TensorFlow community can also benefit from these performance enhancements,” Jonathan Hung, staff software engineer at LinkedIn, wrote in a blog post.