Sixteen years ago, as I embarked on my PhD journey at UC Berkeley, my advisor shared a thought that resonated deeply: “OLTP databases are a solved problem. They work. Focus on analytics.” This was during a time when the potential for collecting vast amounts of data—both structured and unstructured—was just beginning to unfold, alongside the burgeoning field of machine learning, now often referred to as AI. Following this guidance, I joined my cofounders on a research project that ultimately evolved into Apache Spark, and later, Databricks.
As we developed Databricks, we explored various databases and soon discovered that OLTP databases were far from being a solved issue. They were cumbersome, challenging to scale, and remarkably fragile. Our frustration led us to ponder what an OLTP database would look like if we were to design it from scratch today. This inquiry sparked the creation of Lakebase, our innovative serverless Postgres database.
This article delves into the architecture of Lakebase OLTP. We begin by examining the storage layer of traditional monolithic databases to identify the pain points, then explore how Lakebase reconfigures these components into independent, externalized services. Finally, we introduce LTAP, which allows transactions and analytics to operate on a single data copy in real-time, eliminating the delays and costs associated with Change Data Capture (CDC) or data mirroring.
The database as a monolith
The majority of databases currently in use are monolithic in nature, including MySQL, Postgres, and classic Oracle. Lakebase, built on Postgres—which, interestingly, also originated at Berkeley—serves as our primary example. In these systems, the two critical components stored on disk are the write-ahead log (WAL) and the data files.
When a transaction is committed, the database does not immediately rewrite the data files, as that would be inefficient due to the scattered nature of the rows across the file. Instead, it appends a description of the change to the WAL, a sequential log on disk. A transaction is deemed committed once this log entry is durably written, with the actual data files updated asynchronously later.
To simplify, the WAL is designed to make writes fast and safe, while the data files are optimized for reads. The log allows for a swift transaction commitment through a single sequential append, avoiding the need for random I/O. The data files enable quick query responses by reading the current state directly, rather than replaying the entire database history. For those interested in the intricate details of this design, the 69-page ARIES paper provides an in-depth exploration, albeit with a warning about its complexity.
However, this monolithic architecture presents several challenges:
- Data loss from misconfiguration: A commit’s durability is contingent on the disk flush. If the database or operating system is misconfigured, a write to the WAL may be acknowledged before it is flushed to durable media, risking data loss during power failures or kernel panics.
- Data loss from node failure: Both the WAL and data files reside on a single machine. If that machine’s disk fails, the data is lost. While techniques like RAID-1/RAID-10 can enhance durability, they do not fundamentally resolve this issue.
- Scaling reads necessitates physical cloning: When a single machine cannot handle traffic, the typical solution is to add a read replica, which involves creating a full physical copy of the database, a process that can be time-consuming and resource-intensive.
- High availability requires physical duplication: To survive primary node loss, at least one additional standby node is needed, which incurs additional costs and setup time.
- Analytics compete with transactional traffic: Heavy analytical queries can degrade the performance of latency-sensitive transactional workloads, leading to suboptimal performance.
These issues stem from the same root cause: the monolithic architecture ties durability to a single machine’s disk, necessitating physical cloning for scaling and availability, while workloads interfere due to shared resources.
Lakebase architecture
If tasked with redesigning an OLTP database today, one would leverage modern cloud components: affordable, highly durable cloud object storage combined with elastic compute. This was the approach taken by the Neon team, forming the foundation of Lakebase.
The pivotal change involves making Postgres compute instances stateless. This is achieved by externalizing the WAL and data files from local disks into purpose-built, independently scalable services. The compute layer transforms into a stateless Postgres engine, capable of being started, stopped, and replicated without owning the data.
Let’s explore how these two storage services collaborate to address the aforementioned challenges while maintaining performance.
Scaling writes: WAL becomes SafeKeeper
In a monolithic system, durability is achieved by flushing writes to local disk. In contrast, Lakebase externalizes the WAL to a distributed storage service known as SafeKeeper. Instead of relying on disk flushes, a commit is made durable by replicating the log record across a quorum of SafeKeeper nodes using Paxos-based network replication. This eliminates the risk of data loss due to disk failure or misconfiguration.
One might wonder if this transition increases write latency due to the additional network hop. The answer is no. For any serious Postgres deployment prioritizing durability and availability, synchronous replication is already necessary, which involves that same network hop. In fact, the combination of SafeKeeper and PageServer can yield 5X higher write throughput and 2X lower read latency.
Scaling reads: data files become PageServer
The data files are transitioned to another distributed storage service called PageServer. The WAL is streamed from SafeKeeper into the PageServer, which asynchronously applies changes to its version of the data, materializing pages into low-cost cloud object storage (the lake). The PageServer functions similarly to a write-through cache for the underlying object storage.
This setup mirrors the WAL-data files relationship of the monolith, but now the two components exist in separate, independently scalable services connected via the network. When a page is requested from the PageServer, if the latest version is not available, the PageServer applies logs from SafeKeeper to reconstruct the current state.
Again, one might ask if moving data files from local disks to PageServer increases read latency. The answer remains no for practical purposes. The system is designed to minimize latency impact through aggressive, multi-layered caching. When fetching a page, Postgres first checks its buffer pool in local memory. If the page is absent, it looks in a local disk cache, resorting to the PageServer only in case of a cache miss. As compute nodes can be configured with local memory and disk capacities akin to a monolithic setup, local cache hit rates remain unchanged, ensuring that read latency is comparable to that of a monolith while benefiting from virtually limitless storage.
What this unlocks
With the WAL residing in SafeKeeper and data files in PageServer, a multitude of capabilities that were previously challenging or impossible in monolithic systems become inherent to the architecture. The following features are already available as part of the Lakebase product on both Databricks and Neon:
- Still Postgres: This remains authentic Postgres, ensuring that the wire protocol, SQL, drivers, and extensions function seamlessly.
- Unlimited storage: Data is stored in cloud object storage rather than on provisioned local disks, effectively removing capacity ceilings.
- Serverless, elastic compute: As compute is stateless, it can scale up instantly under load and down to zero when idle, eliminating costs associated with idle infrastructure.
- Durable writes and zero data loss: A commit is durable once replicated across SafeKeeper nodes via Paxos, independent of any single local disk.
- Simpler high availability: High availability no longer requires maintaining a second full physical clone, as the durable state resides in a replicated storage layer independent of compute instances.
- Instant branching, cloning, and recovery: Cloning a large production database becomes a metadata operation rather than a physical copy, allowing for rapid experimentation and recovery.
While separating compute from storage is not a novel concept, Lakebase distinguishes itself by storing operational data on commodity object storage in an open format. This opens avenues for other engines to access the data directly, leading to LTAP.
LTAP: one copy for transactions and analytics
Thus far, we have focused on enhancing a single operational database—making it more durable, elastic, and cost-effective. However, with data residing in an externalized storage layer, we can transcend the traditional separation of transactional databases and analytical systems.
Returning to the PageServer, it already streams changes from the WAL and asynchronously materializes pages into object storage. This materialization step is the ideal moment to address a longstanding issue. In a Lakebase environment, data in object storage is initially written in Postgres’s native page format, which is optimized for transactions but not for analytics. Consequently, analytical engines often face conversion costs or rely on separate data copies maintained by pipelines, leading to governance challenges.
We recently introduced LTAP, which stands for Lake Transactional/Analytical Processing, to eliminate the dual data copy dilemma. The key innovation is to unify the two realms at the storage layer instead of the engine layer. We retain the best tools for each task: Postgres for transactions, with full ACID semantics, and Lakehouse engines for analytics. The change lies in the data beneath them—rather than maintaining two copies in different formats, there is now one durable copy in open columnar formats like Delta and Iceberg, stored as Parquet, accessible to both sides with various caching levels for enhanced performance.
Materializing in columnar form
This section requires a deeper understanding of Postgres internals. As the PageServer materializes pages into object storage, it converts Postgres data from a row format into Parquet’s columnar layout. The exact Postgres representation of every value is preserved, allowing any Postgres-compatible engine to reinterpret it without information loss. This approach differs from CDC, which transmits a stream of logical change events into a foreign schema, discarding Postgres’s physical and transactional semantics. Here, we maintain those semantics.
The PageServer layer’s spare CPU handles the row-to-columnar transcoding during materialization, adding no burden to the Postgres compute serving transactions. To efficiently serve transactional reads, the PageServer still materializes traditional row-based pages in a local cache, but this serves strictly as a performance cache. The underlying durable store remains unified in the lake, accessible by both transactional and analytical engines.
Preserving Postgres semantics in columnar form hinges on two factors: the type system and multi-versioning.
- Type system: Most Postgres types map directly to native Parquet types. Any values lacking a lossless columnar counterpart are retained alongside the original columns in a structured overflow field, ensuring the canonical Postgres text is accessible for reconstruction.
- Multi-versioning: Postgres retains every row version observable by transactions, enabling snapshot isolation and point-in-time recovery. Open table formats provide table-wide consistent snapshots without intermediate row versions. By separating durability from visibility, every row materialized to columnar retains its physical heap address, allowing full reconstruction of heap pages. Postgres indexes are served and rebuilt from the hot cache tier, while intermediate row versions are preserved for MVCC semantics and eventual garbage collection.
This design yields a significant advantage: columnar data compresses more efficiently than row data, often by over tenfold. This conversion reduces the volume of data traversing the network between the caching layer and object store, making the storage path more economical. We even dual-write both row and columnar formats during the transitional rollout of LTAP to ensure data integrity.
Reading the latest data without affecting Postgres
A critical challenge is ensuring freshness. If analytics read from a lake copy, how do they access data committed moments ago that has yet to be materialized? This issue often undermines “just point analytics at the lake” designs, so it’s essential to explore how LTAP addresses it.
When an analytical query initiates—such as from the Lakehouse//RT product recently announced—it first queries Postgres for the current LSN, the log sequence number marking the exact position in the WAL. This metadata lookup is efficient. With the LSN, the analytical engine retrieves most data, including everything materialized up to that point, directly from object storage. Only a small set of recent changes not yet materialized is fetched from the PageServer and merged on top.
The outcome is a consistent, fully up-to-date read of data as of that LSN. The majority of processing occurs on scalable object storage, and critically, Postgres itself handles minimal analytical read traffic, merely returning the LSN. This ensures that transactional workloads remain unaffected by large analytical queries.
For very small tables containing only a few rows, we forgo converting them to columnar form and creating associated Iceberg metadata, as the overhead would outweigh the benefits. Such tables remain present and queryable as part of the single copy.
Every table, automatically
Given the significance of this challenge, there has been considerable market noise surrounding the integration of OLTP and analytics. Traditional approaches like CDC involve replicating data from OLTP storage into a separate analytics storage tier, often referred to as “mirroring” or “zero ETL.”
In CDC or mirroring, the replication pipeline incurs costs, making it impractical to apply universally across all tables. Users must explicitly select which tables to replicate, and this process typically introduces delays.
LTAP eliminates the need for opting in. Every existing table is, by design, already in the lake and queryable. There is no list of replicated or mirrored tables, as replication is nonexistent. Instead, there exists a single governed copy of the data in open formats, with no ETL pipeline to manage or monitor. The transactional and analytical engines can scale independently, each tailored to its specific workload. With no data movement and no second copy, the two views remain synchronized: analytics always reads the same data just written by the application.
For a visual representation of how LTAP integrates, consider watching this demo from the Data and AI Summit.
What about HTAP?
For those familiar with the field, it’s evident that LTAP is a deliberate reference to HTAP: hybrid transactional/analytical processing. HTAP has long been the holy grail of database engineering, aiming to create a single engine capable of handling both transactional and analytical workloads.
However, no widely adopted HTAP database system has emerged. The challenges stem from several factors:
- Incomplete feature set: Developing a new proprietary engine to perform a single task is already a multi-year endeavor. Attempting to create a single engine that excels at multiple tasks compounds the investment required to achieve a mature feature set.
- No ecosystem: Postgres and Spark are at the center of extensive ecosystems comprising drivers, extensions, tools, and decades of operational knowledge. A new engine starts outside this ecosystem, limiting its utility.
- No performance isolation: Many HTAP systems run transactions and analytics on the same hardware, leading to contention for CPU and memory resources. This results in analytical queries starving transactional workloads.
All these challenges arise from the decision to unify the two workloads into a single engine. Lakebase and LTAP circumvent these issues by integrating at the storage layer while employing distinct compute engines for each workload, leveraging their complete feature sets and ecosystem support, along with ensuring performance isolation.
Closing thought
When we first proposed the Lakebase architecture last year, we anticipated that it would unlock unlimited storage, elastic compute, durable writes, simpler high availability, and instant branching—insights gleaned from our experiences with the Neon platform. These capabilities followed almost automatically once the WAL was housed in SafeKeeper and the data files in PageServer.
The LTAP concept emerged later, as the Neon and Databricks teams collaborated to tackle the long-standing challenge of executing analytics on the most current transactional data. As we refine LTAP and roll it out in the coming months, all Lakebase tables will be readily available for analytics, performing at levels comparable to Lakehouse data.
What excites me most is the future. While LTAP represents a natural progression, the same architecture opens numerous optimization opportunities to decouple other resource-intensive maintenance operations from core transactional workloads. We are just beginning to explore the possibilities this architecture presents, and we look forward to sharing our discoveries.
Acknowledgment: I extend my gratitude to the Lakebase team for bringing the concepts discussed in this blog to life, reviewing the content, and ensuring the technical details are accurate.