How Five PostgreSQL Optimizations Sped Up Our Dashboard

Our dashboard operates on a robust Django monolith, supported by PostgreSQL. As we explore solutions to enhance search and filter performance at scale, we are transitioning towards denormalization into a query-optimized store, specifically ClickHouse. However, establishing a secondary database is a long-term commitment. Before making that leap, we aimed to maximize PostgreSQL’s capabilities to provide immediate relief to our users.

We focused on optimizing the max of p95 across endpoints for our business plan customers. Initially, our p50 metric stood at 0.7 seconds, which seemed satisfactory. Yet, this figure masked the occasional instances where customers experienced significant delays, with the same endpoint reporting a p95 of 8 seconds. Our goal was to reduce this p95 to no more than 3 seconds, and we ultimately surpassed this target, achieving a remarkable p95 of 1 second.

Leveraging Observability

Before implementing any changes, we prioritized establishing observability tools to diagnose the root causes of performance issues and measure the effectiveness of our solutions. We set up metrics to monitor the p95 for paying customers on specific endpoints, allowing us to track our progress towards the 3-second goal in real time.

Utilizing OpenTelemetry traces, we identified the slowest HTTP requests from paying customers, particularly focusing on the slowest SQL queries within those requests. We also employed the EXPLAIN ANALYZE command to manually replay those requests on a replica, gaining insights into their sluggishness. With the necessary instrumentation in place, we proceeded to address the identified issues, starting with a technique known as “late joining.”

Late Joining

One inherent limitation of PostgreSQL is its lack of guarantee that JOINs on non-filtered tables will be deferred until after WHERE clauses are applied. For instance, consider a query counting feedback resources related to an incident table:

SELECT i.*, COUNT(f.id) AS feedbacks_count
FROM incident i
LEFT JOIN incident_feedback f ON f.issue_id = i.id
WHERE i.account_id = 42
  AND i.severity < 10
GROUP BY i.id
ORDER BY i.id DESC
LIMIT 50;

While we expect PostgreSQL to retrieve feedback only after filtering incidents based on criteria like account_id = 42 and severity < 10, the query planner might opt to perform the JOIN beforehand, which can be costly depending on the size of the joined table. To work around this limitation, we adopted a two-step approach:

  1. Execute a preliminary query joining only on filtering fields.
  2. Subsequently, query display-only fields separately, filtering based on the foreign keys from step one.

This translates to an initial query that strictly applies the filters to retrieve a page of incident IDs:

SELECT i.id
FROM incident i
WHERE i.account_id = 42
  AND i.severity < 10
ORDER BY i.id DESC
LIMIT 50;

Followed by a second query fetching all required fields for display:

SELECT f.incident_id, COUNT(*) AS feedbacks_count
FROM incident_feedback f
WHERE f.incident_id IN (1001, 1002, 1003, ...)
GROUP BY f.incident_id;

Thanks to the Django ORM, implementing this optimization was straightforward; we switched from select_related to prefetch_related, and refactored our tooling to promote this pattern across the organization.

Asynchronous Counting

In addition to displaying a page of results, our dashboard initially returned a count attribute in the API payload. This approach followed the Django Paginator abstraction, but as data volume increased, the SQL query for the count began to take significantly longer than the queries for individual pages of items.

To mitigate this issue without sacrificing user experience, we implemented a solution involving two parallel requests:

  1. A request to fetch a page of items.
  2. A request to fetch the total count.

This allows users to view results as soon as the first request completes, while the total count is fetched asynchronously.

Premium Replication

Database caching plays a crucial role in query performance, with stark differences observed between cold and hot cache scenarios. We noted up to a 20-fold performance improvement when the cache was warm. Given that all production traffic was served by our primary PostgreSQL instance, we decided to create a PostgreSQL replica dedicated to read operations for paying customers and our most frequently accessed endpoints.

This premium replica not only enhances the user experience but also provides better isolation from potential performance degradation caused by other app components. For instance, while a heavy process may trigger a CPU spike on our primary database instance, the premium replica remains unaffected, ensuring consistent performance for business customers.

Full-Text Search Optimizations

Our incidents listing view supports full-text search across various attributes. Upon analyzing traces, we discovered that the search query string was a significant contributor to long request times. Our text search utilizes the pg_trgm extension, which indexes strings by breaking them down into trigrams.

However, we faced two challenges:

  • From a product perspective, users struggled to understand which attribute was responsible for a match.
  • Technically, the underlying SQL query employed an UNION statement, merging results from multiple queries filtering on different tables.

To resolve these issues, we revamped the search experience. Instead of returning results directly, the full-text search now separately searches supported attributes, allowing users to select one for filtering.

This new approach offers several advantages:

  • Users can clearly identify the attribute they are filtering over.
  • Searches across distinct attributes can be executed in parallel, enhancing speed.
  • The front-end can utilize local caching to avoid redundant requests for matching attributes.
  • Once a filter is selected, the query shifts from a full-text search to a targeted search based on the chosen attribute.

Denormalization

Lastly, we explored optimizing filtering on specific fields that exhibited anomalous performance. By analyzing query plans, we identified that PostgreSQL must choose between two strategies when filtering and sorting paginated results:

  1. Scan the main table in sort order, applying filters row-by-row until N results are found.
  2. Utilize the filter index to identify all matching rows upfront, JOIN them back to the main table, sort the entire result set, and return the first N entries.

While the first strategy is efficient for frequent matches, it can be slow for selective filters. Conversely, the second strategy excels with selective filters but can degrade with many matches. PostgreSQL’s choice between these strategies is probabilistic, leading to potential inefficiencies.

The solution lies in creating a composite index that encompasses both filtering and sorting criteria. This requires denormalizing the filter field onto the main table or a dedicated denormalization table, which we opted for. For instance, consider the following query:

SELECT *
FROM incident i
JOIN secret s ON i.secret_id = s.id
WHERE i.account_id = 42
  AND s.validity = 'valid'
ORDER BY i.date DESC
LIMIT 50;

By denormalizing the secret.validity column alongside the incident date, we can create a composite index:

(account_id, validity, issue_date)
 ^^^^^^^^^^^^^^^^^^^^  ^^^^^^^^^^   
 matches filters     | matches ordering

This index enables PostgreSQL to efficiently match the filters and scan in the appropriate order, significantly improving performance.

PostgreSQL 18 Upgrade

In conjunction with the application and query modifications, we upgraded our production stack to PostgreSQL 18. This upgrade provided incremental performance improvements and established a stronger foundation for future enhancements.

It is important to note that the upgrade served as an accelerator rather than a standalone solution; the most significant gains stemmed from our targeted changes in query design, counting strategy, search behavior, and cache isolation. Nevertheless, running on PostgreSQL 18 helped stabilize these improvements and mitigate the risk of regressions over time.

The Results

A visual representation speaks volumes:

The accompanying chart illustrates:

  1. The number of traces on our most active dashboard view that exceeded 3 seconds.
  2. The p95 of this endpoint for our paying customers.

In the first chart, we observe clusters of spikes separated by flatter segments, corresponding to weekly patterns. The second chart smooths data over a 7-day period to highlight trends. Vertical annotations mark the deployment dates of various optimizations detailed in this article. The latest p95 value, positioned at the right of the second chart, stands at 1 second, comfortably below our 3-second target.

Tech Optimizer
How Five PostgreSQL Optimizations Sped Up Our Dashboard