In a recent keynote at the Databricks Data + AI Summit, co-founder Reynold Xin raised pertinent questions about the future of online transaction processing (OLTP) databases. He pointed out that these systems remain largely unchanged since the 1990s, characterized by a monolithic architecture that combines compute and storage on large machines. This traditional setup often leads to over-provisioning, scaling difficulties, and various performance challenges.
Databricks is addressing these issues with its new product, Lakebase, which separates compute from storage. This innovative approach allows for a more flexible and efficient use of transactional databases, particularly in the context of emerging technologies like agentic AI. By decoupling these components, Databricks aims to redefine how developers interact with databases, making them more adaptable to modern needs.
Purpose-Built for AI
Xin emphasized that the conventional methods of database integration are not conducive to AI applications. He likened the process of adding features in software development—where developers create a new branch of code—to the cumbersome task of cloning production databases, which can take days. Lakebase aims to streamline this by enabling instantaneous branching, allowing developers to create clones of databases in less than a second, significantly enhancing workflow efficiency.
Built on open-source Postgres, Lakebase leverages a novel architecture that separates storage and compute, enabling systems to scale effectively for concurrent users and larger datasets. The architecture supports various open storage formats, such as Parquet, which are compatible with machine learning tools and libraries. This separation not only reduces costs but also allows for a copy-on-write capability, meaning additional storage costs are incurred only when changes are made.
Streaming Is Changing Enterprise Data Needs
As streaming data becomes increasingly integral to enterprise operations, the separation of compute and storage is proving essential. Mohan noted that this shift is paving the way for applications to scale infinitely, prompting new considerations regarding evaluation, observability, and data semantics. The need for accuracy in AI outputs is paramount, especially when dealing with critical data sources like payroll systems.
Databricks aims to take ownership of the entire data lifecycle—from creation to consumption—ensuring that data remains within its ecosystem. This approach allows for rapid reporting and analytics, addressing the challenges faced by businesses that require timely insights from their data.
A “Disaggregation of Storage and Compute”
The integration of Lakebase with Databricks’ existing infrastructure combines the familiarity of Postgres with the scalability of a modern serverless architecture. This synergy not only enhances the developer experience but also aligns with the operational maturity of Databricks’ Data Intelligence Platform. Mohan highlighted that the acquisition of Neon provides Databricks with a competitive edge, enabling the development of applications that can efficiently handle vast amounts of data.
With the rise of AI agents, each capable of conducting experiments on codebases and databases, the separation of storage and compute becomes even more critical. Xin pointed out that this architecture allows for extensive experimentation at minimal cost, fostering innovation and agility within organizations. The underlying storage framework facilitates high-throughput data synchronization across various data lakes, further enhancing operational efficiency.
As enterprises navigate this evolving landscape, the focus on data evaluation and reliability will become increasingly important. The nuances of language and semantics in AI will necessitate a deeper examination of model accuracy, ensuring that businesses can trust the insights derived from their data.