Inserting a staggering 2 million records per second into Postgres is not just a theoretical exercise; it’s a tangible achievement. However, rather than fixating on micro-benchmarks, it is crucial to step back and consider a more significant question: which abstractions align best with our specific workload requirements?
This exploration will delve into five distinct methods for inserting data into Postgres using Python. The objective is not merely to identify the fastest method but to comprehend the trade-offs involved in terms of abstraction, safety, convenience, and performance.
By the end of this analysis, readers will gain insights into:
- the strengths and weaknesses of ORM, Core, and driver-level inserts
- when performance truly matters
- how to select the appropriate tool without succumbing to over-engineering
Why Fast Inserts Matter
High-volume insert workloads are commonplace in various scenarios:
- loading millions of records
- syncing data from external APIs
- backfilling analytics tables
- ingesting events or logs into warehouses
Even minor inefficiencies can accumulate rapidly. Transforming a 3-minute insert job into a 10-second operation can significantly alleviate system load, liberate resources, and enhance overall throughput.
However, it is essential to recognize that faster does not inherently equate to better. In scenarios involving smaller workloads, sacrificing clarity and safety for marginal speed gains often proves counterproductive. The true aim lies in understanding when performance is critical and why it matters.
Which Tool Do We Use to Insert With?
To interact with our Postgres database, we require a database driver. In this context, we utilize psycopg3, complemented by SQLAlchemy as an additional layer. Here’s a brief differentiation:
Psycopg3 (the Driver)
psycopg3 serves as a low-level PostgreSQL driver for Python. This minimal abstraction communicates directly with Postgres, placing the onus of responsibility on the developer to write SQL, manage batching, and ensure correctness.
SQLAlchemy
SQLAlchemy operates atop database drivers like psycopg3, offering two distinct layers:
1) SQLAlchemy Core
This layer provides a SQL abstraction and execution framework that is database-agnostic, allowing developers to write Python expressions that Core translates into the appropriate SQL dialect while safely binding parameters.
2) SQLAlchemy ORM
The ORM, built on Core, offers even greater abstraction by mapping Python classes to database tables, tracking object states, and managing relationships. While the ORM enhances productivity and safety, it introduces overhead, particularly during bulk operations.
In essence, these three options exist along a spectrum:
- ORM simplifies the use of Core
- Core enhances the safety of using the Driver while maintaining database agnosticism
The Benchmark
To ensure a fair benchmarking process:
- each method receives data in its intended format (ORM objects for ORM, dictionaries for Core, tuples for the Driver)
- only the time spent transferring data from Python to Postgres is measured
- no method incurs penalties for conversion tasks
- the database operates within the same environment as our Python script, preventing bottlenecks from upload speeds
The aim is not to identify the fastest insert method but to comprehend what each approach excels at.
1) Faster is Always Better?
What is better? A Ferrari or a Jeep?
The answer hinges on the problem you’re trying to solve. For navigating a forest, the Jeep is preferable. If speed is your goal, the Ferrari takes the lead.
This analogy extends to data insertion. Reducing a 10-second insert by 300 milliseconds may not warrant the added complexity and risk. Conversely, in other scenarios, such a gain could be invaluable.
Notably, the fastest method on paper may prove to be the slowest when considering:
- maintenance costs
- correctness guarantees
- cognitive load
2) What is Your Starting Point?
The right insertion strategy relies less on row count and more on the existing structure of your data.
ORM, Core, and the driver are not adversarial tools; they are optimized for different objectives:
| Method | Purpose |
ORM (add_all) |
Business logic, correctness, small batches |
ORM(bulk_save_object) |
ORM objects at scale |
Core (execute) |
Structured data, light abstraction |
Driver (executemany) |
Raw rows, high throughput |
Driver (COPY) |
Bulk ingestion, ETL, firehose workloads |
The ORM excels in CRUD-heavy applications where clarity and safety are paramount, such as websites and APIs. Here, performance is typically “good enough,” and clarity takes precedence.
Core is ideal for scenarios requiring control without the need to write raw SQL, such as data ingestion, batch jobs, and analytics pipelines.
The Driver is tailored for maximum throughput, particularly in cases involving extensive writes, such as machine learning training sets, bulk loads, and low-latency ingestion services. While it minimizes overhead, it also necessitates manual SQL writing, increasing the risk of errors.
3) Don’t Mismatch Abstractions
The ORM isn’t slow. COPY isn’t magic.
Performance challenges arise when data is forced through an abstraction for which it was not designed:
- Using Core with SQLAlchemy ORM objects can lead to slowdowns due to conversion overhead.
- Utilizing ORM with tuples is often awkward and brittle.
- Employing ORM bulk operations in ETL processes can result in wasted overhead.
At times, reverting to a lower level can actually enhance performance.
When to Choose Which?
A useful rule of thumb is as follows:
| Layer | Use it when… |
| ORM | You are building an application (correctness and productivity) |
| Core | You are moving or transforming data (balance between safety and speed) |
| Driver | You are pushing performance limits (raw power and full responsibility) |