Data Intelligence Platforms revolutionize data management by employing AI models to deeply understand the semantics of enterprise data; we call this data intelligence. They build on the foundation of the lakehouse – a unified system to query and manage all data across the enterprise – but automatically analyze both the data (contents and metadata) and how it is used (queries, reports, lineage, etc.) to add new capabilities. Through this deep understanding of data, Data Intelligence Platforms enable:

  • Natural Language Access: Leveraging AI models, DI Platforms enable working with data in natural language, tailored to each organization’s jargon and acronyms. The platform observes how data is used in existing workloads to learn the organization’s terms and offers a tailored natural language interface to all users – from nonexperts to data engineers.
  • Semantic Cataloguing and Discovery: Generative AI can understand each organization’s data model, metrics and KPIs to offer unparalleled discovery features or automatically identify discrepancies in how data is being used.
  • Automated Management and Optimization: AI models can optimize data layout, partitioning and indexing based on data usage, reducing the need for manual tuning and knob configuration.
  • Enhanced Governance and Privacy: DI Platforms can automatically detect, classify and prevent misuse of sensitive data, while simplifying management using natural language.
  • First-Class Support for AI Workloads: DI Platforms can enhance any enterprise AI application by allowing it to connect to the relevant business data and leverage the semantics learned by the DI Platform (metrics, KPIs, etc.) to deliver accurate results. AI application developers no longer have to “hack” intelligence together through brittle prompt engineering.

Some might wonder how this is different from the natural language Q&A capabilities BI tools added over the last few years. BI tools only represent one narrow (although important) slice of the overall data workloads, and as a result do not have visibility into the vast majority of the workloads happening, or the data’s lineage and uses before it reaches the BI layer. Without visibility into these workloads, they cannot develop the deep semantic understanding necessary. As a result, these natural language Q&A capabilities have yet to see widespread adoption. With data intelligence platforms, BI tools will be able to leverage the underlying AI models for much richer functionality. We, therefore, believe this core functionality will reside in data platforms.

At Databricks, we’ve been building a data intelligence platform on top of the data lakehouse and have grown increasingly excited about the possibilities of AI in data platforms as we have added individual features. We build on the existing unique capabilities of the Databricks lakehouse as the only data platform in the industry with (1) a unified governance layer across data and AI and (2) a single unified query engine that spans ETL, SQL, machine learning and BI. In addition, we’ve leveraged our acquisition of MosaicML to generate AI models in a Data Intelligence Engine we call DatabricksIQ, which fuels all parts of our platform.

DatabricksIQ already permeates many of the layers of our current stack. It is used to:

  • Set the knobs throughout the platform, including automatically indexing columns, laying out partitions and making the foundation of the lakehouse stronger. This will provide lower TCO and better performance for our customers.
  • Improve governance in Unity Catalog (UC) by automatically inserting descriptions and tags of all data assets in UC. These are then leveraged to make the whole platform aware of jargon, acronyms, metrics and semantics. This enables better semantic search, better AI assistant quality and improved ability to do governance.
  • Improve the generation of Python and SQL in our AI assistant, powering both text-to-SQL and text-to-Python.
  • Make those queries much faster by incorporating predictions about the data into query planning in our Photon query engine.
  • Inside Delta Live Tables and Serverless Jobs to provide optimal autoscaling and minimize cost based on predictions about the workload.

Last, but perhaps more importantly, we believe that data intelligence platforms will greatly simplify the development of enterprise AI applications. We are integrating DatabricksIQ directly with our AI platform, Mosaic AI, to make it easy for enterprises to create AI applications that understand their data. Mosaic AI now offers multiple capabilities to directly integrate enterprise data into AI systems, including:

  • End-to-end RAG (Retrieval Augmented Generation) to build high quality conversational agents on your custom data, leveraging the Databricks Vector Database for “memory.”
  • Training custom models either from scratch on an organization’s data, or by continued pretraining of existing models such as MPT and Llama 2, to further enhance AI applications with deep understanding of a target domain.
  • Efficient and secure serverless inference on your enterprise data, and connected into Unity Catalog’s governance and quality monitoring functionality.
  • End-to-end MLOps based on the popular MLflow open source project, with all produced data automatically actionable, tracked and monitorable in the lakehouse.

We believe that AI will transform all software, and data platforms are one of the areas most ripe to innovation through AI. Historically, data platforms have been hard for end-users to access and for data teams to manage and govern. Data intelligence platforms are set to transform this landscape by directly tackling both these challenges – making data much easier to query, manage and govern. In addition, their deep understanding of data and its use will be a foundation for enterprise AI applications that operate on that data. As AI reshapes the software world, we believe that the leaders in every industry will be those who leverage data and AI deeply to power their organizations. DI Platforms will be a cornerstone for these organizations, enabling them to create the next generation of data and AI applications with quality, speed and agility.