A common task in enterprise systems involves loading substantial volumes of data into PostgreSQL, often reaching tens or even hundreds of millions of rows. While it may seem straightforward to implement a loop in Java that calls the save() method for each record, this method can be excruciatingly slow. Even a well-optimized PostgreSQL instance cannot compensate for inefficient data handling by the application.
Spring and Hibernate Configuration
To enhance the speed of data insertion into PostgreSQL using Spring and Hibernate, we begin with the most apparent step: tuning the configuration.
Baseline Implementation Using Spring + Hibernate
Initially, we assess the duration of a basic insert operation. By downloading a clean Spring IO starter, we connect a repository, configure the entity, link to the database, and execute a simple transactional snippet that inserts random entities via the save() method:
@Transactional
fun saveBySpring(count: Int) {
val currencies = currencyRepo.findAll()
val accounts = accountRepo.findAll()
for (i in 0 until count) {
pdCustomRepository.save(
getRandomEntity(id = null, currencies.random(), accounts.random())
)
}
}
The results are less than impressive—over 10 minutes for the operation. This figure serves as our baseline as we explore ways to reduce it.
Additionally, Hibernate’s memory consumption is noteworthy; it retains every object in the session-level cache until the transaction concludes. A small heap size could easily lead to an OutOfMemoryError. Enabling logging reveals that data is sent to the database at the moment save() is invoked, rather than at the end of the entity generation process.
Hibernate’s generate_statistics setting offers valuable insights into the queries dispatched to the database. In our case, 4 million records were transmitted, with none batched. Batching is essential; it’s akin to moving books in stacks rather than one by one. Let’s explore how to implement batching effectively.
Enabling Batching: JDBC Batch Size
Hibernate offers the jdbc.batch_size setting, but it may not function as expected. Simply enabling it does not guarantee batching; in my experience, it had no effect initially. When I created the database, I set up a sequence to generate IDs, anticipating multiple backend instances sharing the same database. This allowed me to use GenerationType.IDENTITY without backend ID generation concerns. However, according to Hibernate’s documentation, batching does not work with GenerationType.IDENTITY.
To resolve this, we must explicitly request IDs from the database by adding a SequenceGenerator and switching to GenerationType.SEQUENCE. This adjustment yields a performance improvement of approximately 23%, though high memory consumption persists due to the session-level cache. Checking the logs again reveals that while 40 batches were sent, the total number of SQL statements remained at 4 million, with overhead arising from reading IDs from the database.
Reducing ID Read Requests from the Database
To minimize ID read requests, we can utilize caching. PostgreSQL offers a CACHE keyword for sequences, while Hibernate provides backend caching options. These two approaches differ significantly.
- PostgreSQL’s session-level cache allows it to return multiple IDs at once, but unused IDs are not returned to the sequence.
- Application-side caching reserves IDs for use across the application without further database queries, significantly reducing the number of requests.
Testing both methods under a 10-thread load shows that application-side caching is faster, achieving about a 30% performance gain compared to database-side caching, which yields only a 2% improvement.
Method saveAll()
Next, we consider collecting all entities into a list and invoking saveAll(). This approach results in a minimal improvement of around 2%, as saveAll() effectively calls save() in a loop. Consequently, we observe that Hibernate continues to send data row by row, generating individual INSERT INTO statements for each entity.
PostgreSQL supports batch (multi-row) inserts, allowing multiple rows to be sent in a single INSERT statement. However, Spring does not enable this by default, although the JDBC driver can facilitate it.
Driver Configuration: reWriteBatchedInserts
By enabling the reWriteBatchedInserts setting, the JDBC driver can transform regular inserts into multi-row inserts. This adjustment leads to a significant speed increase without altering any code. The total time for the operation is now reduced to 4 minutes and 37 seconds.
Clearing the Session-Level Cache via EntityManager
To further optimize memory usage, we can clear the session-level cache when the data sent to the database is no longer needed. By injecting an EntityManager using @PersistenceContext, we can manually manage the insert process. When batching, we flush the data to the database and clear the cache:
@PersistenceContext
lateinit var entityManager: EntityManager
@Transactional
fun saveBySpringWithManualBathing(count: Int) {
val currencies = currencyRepo.findAll()
val accounts = accountRepo.findAll()
for (i in 0 until count) {
entityManager.persist(getRandomEntity(id = null, currencies.random(), accounts.random()))
if (i != 0 && i % batchSize == 0) {
entityManager.flush()
entityManager.clear()
}
}
}
This method results in an 18-second time gain and significantly reduces memory usage, mitigating the risk of an OutOfMemoryError.
Configuring order_inserts
Another beneficial feature is the order_inserts setting. When inserting multiple entity types in one transaction, Hibernate may create numerous batches, as it closes the current batch and starts a new one with each different entity type. Setting order_inserts=true compels Hibernate to group objects by type before forming batches, enhancing batching efficiency.
With all settings applied, we have successfully reduced the insertion time from 10 minutes and 30 seconds to 4 minutes and 19 seconds. The next step is to consider writing our own data insertion layer.
Writing Your Own Layer
Creating a custom layer to generate the necessary INSERT statements from entities is feasible. The first step is to extract data and column names using reflection:
val data = PaymentDocumentEntity(
account = AccountEntity().apply { id = 1 },
expense = false,
amount = BigDecimal("10.11"),
cur = CurrencyEntity(code = "RUB"),
orderDate = LocalDate.parse("2023-01-01"),
orderNumber = "123",
prop20 = "1345",
prop15 = "END",
paymentPurpose = "paymentPurpose",
prop10 = "prop10",
)
While reflection is a straightforward method, it is not the fastest. Alternatives include manually listing column names or utilizing Kotlin’s KProperty for column specifications. The performance difference among these methods is minimal—approximately 3 seconds per operation for 4 million entities.
Architecture of the Custom Layer
The custom layer comprises three components:
- Processor: Contains the logic for transforming data for saving.
- Saver: Manages the saving process.
- Factory: Selects the appropriate saver based on the saver type.
The implementation features abstract classes with shared logic and various savers, each responsible for its insertion method. The initial implementation is a basic multi-row insert, where data is passed directly within the INSERT statement:
INSERT INTO payment_document (account_id, amount, expense, cur, order_date, order_number, payment_purpose, prop_10, prop_15, prop_20)
VALUES (1000004, '10.23', true, 'RUB', '2023-06-25', '123456', 'some purpose', 'some 10', 'some 15', 'some 20'),
(1000005, '11.23', true, 'RUB', '2023-06-26', '123457', 'some purpose', 'some 10', 'some 15', 'some 20'),
(1000006, '12.23', true, 'RUB', '2023-06-27', '123458', 'some purpose', 'some 10', 'some 15', 'some 20');
However, accessing data via reflection is not optimal. Therefore, we explore other options, such as using prepared statements instead of regular statements.
Using Prepared Statements
Prepared statements offer advantages, including enhanced security against SQL injection and improved performance due to reduced parsing and planning overhead. The maximum batch size can be determined by dividing the maximum allowed number of parameters by the number of columns.
Benchmarks reveal that the most efficient batch size is 5,000 records. Prepared statements outperform regular statements, and the server-side prepare optimization allows PostgreSQL to reuse query plans, further enhancing performance.
Indexes and Triggers
While triggers can complicate matters due to embedded business logic, indexes can be managed more straightforwardly. Dropping indexes before insertion and recreating them afterward is a common practice, but caution is advised in production environments due to potential risks.
Using PostgreSQL’s Custom Methods
PostgreSQL’s COPY command facilitates bulk data transfers between a file and a table. Testing this method reveals that it is faster than multi-row inserts, with memory consumption remaining low due to streaming data directly to disk.
COPY via CSV and File
Generating a CSV file with the required data and utilizing the COPY command yields impressive performance results, even when writing to disk. When storing data in memory using a StringWriter, performance improves further.
COPY in Binary Format
While the binary format is theoretically faster, it introduces complexities, especially with data type conversions. Benchmarking shows that the binary version may not always outperform the CSV format.
Integrating COPY with Spring
Integrating the COPY command into Spring transactions allows for the use of custom saver implementations within transactional boundaries, enhancing flexibility and performance.
Parallel Insertion
To maximize efficiency, we can parallelize the insertion process. By creating a dedicated ThreadPoolExecutor and implementing a ConcurrentSaverHandler, we can manage tasks effectively, allowing the main thread to continue generating data while separate threads handle database interactions.
The results of this parallelization are substantial, demonstrating a marked improvement in operation speed and memory usage.