An AWS engineer reported a significant drop in PostgreSQL throughput on Linux 7.0, revealing that performance has plummeted to approximately half of its previous capability. The benchmark tests indicated that the removal of the PREEMPT_NONE scheduling option was the primary culprit behind this regression. On a 96-vCPU Graviton4 instance, the throughput measured at just 0.51x compared to earlier kernel versions. With the stable release of Linux 7.0 imminent, set to coincide with Ubuntu 26.04 LTS, kernel maintainers are shifting the responsibility for a fix onto PostgreSQL developers rather than reverting the kernel change.
How the Regression Was Found
Salvatore Dipietro from Amazon/AWS conducted a thorough benchmarking analysis of PostgreSQL 17 on a 96-vCPU Graviton4 instance (EC2 m8g.24xlarge). Utilizing pgbench with 1,024 clients and 96 threads over a 1,200-second duration, he executed a simple-update workload with a scale factor of 8,470 and a fillfactor of 90 on AL2023, utilizing 12 IO2 volumes in RAID0 on XFS. The results were alarming: Linux 7.0 delivered only 0.51x the throughput of its predecessors. Through a methodical bisection process, Dipietro traced the root cause to kernel commit 7dadeaa6e851, introduced in v7.0-rc1 by Intel developer Peter Zijlstra. This commit, titled “sched: Further restrict the preemption modes,” eliminated PREEMPT_NONE as the default option, confining modern CPU architectures to Full and Lazy preemption only.
The implications of this change are profound. Under the PREEMPT_NONE model, threads holding spinlocks could complete their operations without interruption. However, the new PREEMPT_LAZY model allows the scheduler to preempt these threads, leading to increased contention as waiting threads spin longer for locks. PostgreSQL’s reliance on short-held spinlocks for buffer management means that this shift disrupts its operational patterns, resulting in a significant performance hit. Profiling data revealed that 55% of CPU time is now consumed by spinning in PostgreSQL’s spinlock (s_lock()), particularly within the StrategyGetBuffer/GetVictimBuffer buffer management call path. This contention multiplies across a 96-vCPU system, causing the database to waste more than half its CPU budget on lock contention rather than executing queries.
When a revert patch was applied, throughput rebounded to 1.94x the baseline, averaging 98,565 transactions per second (tps) compared to 50,751 tps across three runs. This recovery strongly suggests that the preemption change was indeed the sole cause of the performance drop, rather than an unrelated issue with PostgreSQL or the underlying hardware.
Why the Kernel Changed
The decision to restrict preemption modes in Linux 7.0 was not made lightly; it reflects a strategic design choice aimed at addressing long-standing issues within the kernel’s scheduling model. By limiting available preemption modes to Full and Lazy for modern CPU architectures—including arm64, x86, powerpc, riscv, s390, and loongarch—Zijlstra sought to eliminate the need for numerous voluntary preemption points scattered throughout the kernel codebase. This change was particularly motivated by the challenges faced by the PREEMPT_RT real-time kernel variant, which had suffered from performance degradation due to excessive scheduling.
PREEMPT_LAZY was introduced as a compromise, allowing for preemption while minimizing scheduling overhead. The removal of PREEMPT_NONE from modern architectures was a crucial step in standardizing this model. In his commit message, Zijlstra articulated the rationale behind the change, emphasizing the need for a preemptable model to manage these complexities effectively. He also noted the importance of keeping the patch minimal to address potential regressions that might arise.
The Fix Standoff
As the situation stands, two potential paths forward exist, neither of which is guaranteed to be resolved before the stable release. Dipietro has proposed a patch to restore PREEMPT_NONE as the default, but kernel developers have pushed back, suggesting that PostgreSQL should instead adopt the rseq time slice extension. This extension, which allows user-space processes to request temporary CPU time slice extensions without preemption, could mitigate the lock holder preemption issue that PostgreSQL currently faces.
Integrating the rseq mechanism would require PostgreSQL to adapt its architecture significantly, introducing platform-specific code paths that could complicate its commitment to broad OS portability. Conversely, reverting the kernel change would simplify matters but would also undo the progress made in eliminating legacy preemption models across multiple architectures. The kernel community appears to be leaning towards placing the onus of adaptation on applications rather than the scheduler itself, raising questions about how other spinlock-heavy applications might fare under Linux 7.0.
For database operators running PostgreSQL on Linux, the urgency of the situation is palpable. Any deployment that upgrades to Linux 7.0 without a clear resolution could face a dramatic reduction in throughput, with no adjustments needed on the database side. Organizations planning to transition to Ubuntu 26.04 LTS for production database servers must now weigh their options: delay the upgrade, pin an older kernel, or accept the risk of degraded performance while awaiting either a kernel revert or a PostgreSQL update. With the stable release of Linux 7.0 looming on the horizon, the window for making these critical decisions is rapidly closing.