How we loaded a petabyte into PostgreSQL before New Year — and what happened next

It all began as a lighthearted conversation by the office coffee machine, but what started as a jest quickly evolved into a serious endeavor. Amidst discussions of hardware tuning and optimizing transaction processing speeds, we decided to embark on an audacious experiment: testing PostgreSQL’s capabilities with a one-petabyte database. The challenge was set for December 10, with a report due by January 20, and the holiday season looming just around the corner.

The Hardware Side of the Story

Creating a database of this magnitude is still largely theoretical in production environments. While databases of hundreds of terabytes are becoming more common, petabytes remain in the realm of research labs. However, as data volumes continue to swell, the landscape is changing. To test PostgreSQL’s resilience under such pressure, we first needed to secure a petabyte of storage. Unfortunately, spare petabytes were not readily available in the office.

A quick survey of the market revealed that while we could purchase 15 TB disks and assemble them into a 2U chassis, the costs and delivery times were prohibitive. With the New Year approaching, we sought a more agile solution: the cloud. We reached out to several providers with a straightforward request:

  • Guaranteed 1 PB of unified storage, no thin provisioning.
  • Only SSDs; no HDDs allowed.
  • Access for multiple machines, ideally no fewer than three.
  • Each VM should have between 4 to 16 vCPUs, with at least 64 GiB of RAM.

However, surprise awaited us. Every cloud provider we contacted failed to meet our primary requirement for a single chunk of storage. The alternative of cobbling together storage from different regions would complicate performance testing. Thus, we pivoted back to renting physical servers, only to encounter the same issue: no data center had sufficient fast disk space readily available.

Just as we were about to abandon the project, a promising data center responded. They offered us seven servers that met our needs, and we decided to proceed with a distributed database solution using Shardman across these machines.

What is Shardman?

Shardman is a distributed database engine developed by Postgres Professional over the past five years. Its core functionality revolves around partitioning tables and distributing those partitions across nodes in a share-nothing architecture, all while providing a unified SQL interface. It maintains full ACID compliance and is optimized for OLTP workloads, allowing any node to serve as an entry point for queries.

With hardware secured, we faced a tight timeline—only two weeks remained until New Year.

The Benchmark

With hardware in place, we needed a reliable benchmark to measure our performance. Enter YCSB (Yahoo! Cloud Serving Benchmark), specifically its Go implementation by PingCAP. This NoSQL benchmark, while simplistic, was well-suited for our needs as it utilized a classic key-value model without complex joins or distributed transactions.

The benchmark involved creating a single table called userdata, which supported partitioning and integrated seamlessly with Shardman. Each row in the table occupied approximately 1100 bytes, necessitating the loading of around a trillion rows into the database within our limited timeframe. This required pushing approximately one gigabyte per second—a challenging yet achievable target with modern disks.

Alongside YCSB, we also developed our own tests using pg_microbench, designed to specify which shard to connect to for reading and writing. Our test flow included running shardman.global_analyze() to collect statistics, followed by a warm-up period before launching the actual tests with varying thread counts.

December 10, 2024: Harsh Reality

While awaiting the servers, we decided to run a preliminary benchmark in our lab. We spun up a dozen virtual machines, each with 8 cores, and launched the test. To our dismay, the results showed only 50 GB of data processed across all nodes in 30 minutes—far below our expectations. Upon investigation, we discovered that the data generator had halted due to hash collisions in the key generation logic, causing endless loops.

Management decided to shift our approach by batching inserts instead of writing one row at a time. However, this led to deadlocks as multiple threads attempted to insert the same row due to the collisions. Ultimately, we opted for a simpler sequential key generation method, which significantly improved our data loading speed.

December 20: Eleven Days Left

With the new approach in place, we achieved a remarkable 150 GB of data loaded in just five minutes. It was time to transition to the real hardware.

December 27: We Got the Servers

After installing Debian 12 and configuring RAID 0 across ten disks, we achieved nearly a full petabyte of usable space. We set up Shardman and monitoring tools, ensuring that each node was synchronized to avoid clock skew. With everything prepared, we initiated the data generator across all nodes.

Each node managed to load data at approximately 150 GB every ten minutes, and we felt optimistic about our progress.

December 28: Never Deploy on a Friday

However, Saturday morning revealed that two nodes had fallen significantly behind, processing only a fraction of the data compared to the others. After extensive troubleshooting, we identified elevated RAID usage and slow I/O operations as the culprits. Despite tech support’s assistance, we ultimately decided to break the RAID arrays on the problematic nodes and revert to standalone disks. This workaround, however, led to complications with Shardman’s support for local tablespaces.

After some clever adjustments, we managed to create the necessary tablespaces and resumed data generation.

January 3: Never Rush

As we approached the deadline, we encountered a new issue: disk space on one server ran low due to overlooked index tablespace configurations. After rectifying this, we got back on track.

January 9: What’s Going On?

Post-holiday, we faced another setback when critical RAM errors were detected across multiple servers. Fortunately, tech support quickly resolved the hardware issues, allowing us to continue our work.

January 10: Let’s Test

With the servers stable, we conducted our performance tests using both YCSB and pg_microbench. The results were promising, showing high transaction processing speeds across various worker configurations.

January 15: Still Testing

By January 15, we completed data generation, reaching 863 terabytes—though we fell short of our petabyte goal. The autovacuum process kicked in, consuming resources and slowing down operations. We adjusted our configuration to optimize performance during this phase.

January 17: Time to Make Some Graphs

As we analyzed our results, we noted that server 3, despite its earlier issues, performed exceptionally well in read tests. Meanwhile, server 6 excelled in write operations, showcasing unexpected performance discrepancies among the nodes.

Ultimately, our findings highlighted the importance of careful planning, robust benchmarks, and a thorough understanding of system internals. We documented our learnings and published our data loader for Shardman on GitHub, inviting others to explore the complexities of handling large-scale databases.

Tech Optimizer
How we loaded a petabyte into PostgreSQL before New Year — and what happened next