Databricks today announced Delta Lake, an open-source project designed to bring reliability to data lakes for both batch and streaming data. The project was revealed during the Spark +AI Summit taking place this week in San Francisco.
Data lakes are used as repositories for structured and unstructured data, but factors such as failed writes, schema mismatches and a lack of consistency can undermine the reliability of the data. In an email to SD Times about the announcement, the company explained that failed writes can occur when writing data has failures — which, Databricks said, are inevitable in large, distributed environments. “What is needed is a mechanism that is able to ensure that either a write takes place completely or not at all (and not multiple times, adding spurious data). Failed jobs can impose a considerable burden to recover to a clean state,” the company wrote.
Schema mismatches occur when data has missing fields, or when data producers change the schema without making downstream consumers aware. “If not handled properly, data pipelines could drop data or store them in an inconsistent way. The ability to observe and enforce schema would serve to mitigate this,” Databricks explained.
Databricks Delta automatically validates that the schema of the DataFrame being written is compatible with the schema of the table, according to the company. Columns that are present in the table but not in the DataFrame are set to null. If there are extra columns in the DataFrame that are not present in the table, this operation throws an exception. Databricks Delta has DDL (data definition language) to explicitly add new columns explicitly and the ability to update the schema automatically.
Finally, a lack of consistency can result from organizations mixing batch and streaming data. “Trying to read data while it is being appended to provides a challenge since on the one hand there is a desire to keep ingesting new data while on the other hand anyone reading the data prefers a consistent view,” the company wrote. “This is especially an issue when there are multiple readers and writers at work. It is undesirable and impractical, of course, to stop read access while writes complete or stop write access while a reads are in progress.”
“Today nearly every company has a data lake they are trying to gain insights from, but data lakes have proven to lack data reliability. Delta Lake has eliminated these challenges for hundreds of enterprises. By making Delta Lake open source, developers will be able to easily build reliable data lakes and turn them into ‘Delta Lakes’,” Ali Ghodsi, cofounder and CEO at Databricks, said in a statement. Ghodsi is one of the creators of the Apache Spark analytics engine for big data and machine learning.
In the announcement, Databricks wrote: “With Delta Lake, developers will be able to undertake local development and debugging on their laptops to quickly develop data pipelines. They will be able to access earlier versions of their data for audits, rollbacks or reproducing machine learning experiments. They will also be able to convert their existing Parquet, a commonly used data format to store large datasets, files to Delta Lakes in-place, thus avoiding the need for substantial reading and re-writing.”
The Delta Lake project is one of the initiatives Databricks is undertaking as part of an increased focus on software developers, according to Databricks co-founder and chief architect Reynold Xin. “Developers are being pulled in in finserv, healthcare, retail and even medium-sized organizations,” Xin told SD Times in an interview prior to the announcement. “Because of the cloud, and the amounts of data being collected, there’s a need to make ti simple to analyze.”
Then Delta Lake project can be found at delta.io and is under the Apache 2.0 license.