How TimescaleDB helped us scale analytics and reporting

July 8, 2025

At Cloudflare, PostgreSQL and ClickHouse serve as our primary databases for transactional and analytical workloads. For teams developing products with configurations in our Dashboard, PostgreSQL is often the database of choice due to its speed, versatility, and reliability, honed over more than three decades of development. It has been a cornerstone of our infrastructure since inception, with hundreds of PostgreSQL instances running across various configurations and replication setups. In contrast, ClickHouse, introduced to our stack around 2017, allows us to ingest tens of millions of rows per second while maintaining millisecond-level query performance. However, like any technology, it comes with its own set of trade-offs.

Designing for future growth

With a decade of experience in software development, I have come to value systems that are simple and reliable. I often advocate for architectures with minimal moving parts. When I encounter a system diagram featuring more than three components, I question their necessity. As engineers, we can easily fall into the trap of designing for hypothetical scenarios that may never materialize. We envision future scalability, complex failure modes, or edge cases, leading us to create solutions for problems that might not arise. In practice, systems often evolve differently than anticipated, and we can defer large-scale designs by managing customer expectations and implementing guardrails, such as product limits and rate limits. By focusing on launching initial product versions with just a few essential components, we can ship, test, and learn quickly. Complexity can always be added later when truly needed. This philosophy, whether framed as YAGNI (You Aren’t Gonna Need It) or KISS (Keep It Simple, Stupid), underscores the reality that we are rarely adept at predicting the future. Each additional component introduces potential points of failure, necessitating documentation, testing, observability, and service level objectives (SLOs). Often, teams must learn new programming languages to support new components.

Making Digital Experience Monitoring simple

Two years ago, I was assigned to develop a new product at Cloudflare: Digital Experience Monitoring (DEX). DEX provides visibility into device, network, and application performance within Zero Trust environments. Our initial objective was straightforward—launch a minimum viable product (MVP) focused on fleet status monitoring and synthetic tests, delivering actionable analytics and troubleshooting capabilities to customers. From a technical perspective, fleet status and synthetic tests generate structured logs via the WARP client, which are uploaded to an API, stored in a database, and visualized in the Cloudflare Dashboard.

As with many new engineering initiatives at Cloudflare, DEX began as a “tiger team,” a small group of seasoned engineers tasked with rapidly validating a new product. I faced several constraints:

Team of three full-stack engineers.
Daily collaboration with 2-3 other teams.
Ability to launch in beta, with engineering driving product limits.
Emphasis on rapid shipping.

To balance utility and simplicity, we made intentional design choices early on:

Fleet status logs would be uploaded from WARP clients at fixed 2-minute intervals.
Synthetic tests required user preconfiguration by target (HTTP or traceroute) and frequency.
We capped usage: each device could run up to 10 synthetic tests, no more than once every 5 minutes.
Data retention was set to 7 days.

These guardrails enabled us to launch DEX months ahead of schedule and gather early customer feedback without prematurely investing in scalability and performance. We recognized the need for a basic configuration plane—an interface in the Dashboard for users to create and manage synthetic tests, supported by an API and database for data persistence. This led us to the following architecture:

HTTP API for managing test configurations.
PostgreSQL for storing those configurations.
React UI embedded in the Cloudflare Dashboard.

Just three components—simple, focused, and precisely what we required. Each of these components, however, came with inherent complexity. PostgreSQL was deployed as a high-availability cluster, featuring one primary and one synchronous replica for failover scenarios, alongside several asynchronous replicas distributed across two geographies. The API was deployed on horizontally scaled Kubernetes pods across two regions, while the React app was globally served via Cloudflare’s network. Thanks to our platform teams, this complexity was abstracted away, allowing us to concentrate on just three essential parts, though it highlighted that each component carries significant costs behind the scenes.

Next, we needed to construct the analytics plane—an ingestion pipeline to collect structured logs from WARP clients, store them, and visualize them for our customers in the Dashboard. I was eager to explore ClickHouse for this task, having witnessed its performance in other projects. However, as I delved into the internal documentation for ClickHouse, the complexity became apparent:

Writing data to ClickHouse requires your service to generate logs in a clear format, using Cap’n Proto or Protocol Buffers. Logs should be written to a socket for logfwdr to transport to PDX, then to a Kafka topic. Use a Concept:Inserter to read from Kafka, batching data to achieve a write rate of less than one batch per second.

This realization indicated that incorporating ClickHouse, along with the WARP client, would add five components to our system diagram. While this architecture is justified, the default table engine in ClickHouse, MergeTree, is optimized for high-throughput batch inserts. It writes each insert as a separate partition, followed by background merges to maintain data manageability. This design excels in speed but struggles with numerous small writes, which can lead to write amplification, resource contention, and throttling.

Ultimately, it became evident that ClickHouse is akin to a sports car—ideal for high-speed scenarios—but we required a reliable daily driver for short trips. For our initial launch, we didn’t need millions of inserts per second; we needed something straightforward, dependable, and familiar enough to bring us to market. A colleague suggested we simply use PostgreSQL, asserting that “it can be cranked up” to meet our anticipated load. Thus, we decided to proceed with PostgreSQL.

Using PostgreSQL for analytics

Structurally, there is little distinction between configuration data and analytical logs. Logs are essentially structured payloads—often in JSON—that can be transformed into a columnar format and stored in a relational database. For instance, a device state log might look like this:

{
  "timestamp": "2025-06-16T22:50:12.226Z",
  "accountId": "025779fde8cd4ab8a3e5138f870584a7",
  "deviceId": "07dfde77-3f8a-4431-89f7-acfcf4ead4fc",
  "colo": "SJC",
  "status": "connected",
  "mode": "warp+doh",
  "clientVersion": "2024.3.409.0",
  "clientPlatform": "windows"
}

To store these logs, we created a straightforward PostgreSQL table:

CREATE TABLE device_state (
	"timestamp" TIMESTAMP WITH TIME ZONE NOT NULL,
	account_id TEXT NOT NULL,
	device_id TEXT NOT NULL,
	colo TEXT,
	status TEXT,
	mode TEXT,
	client_version TEXT,
	client_platform TEXT
);

Notably, this table lacks a primary key, which is intentional since time-series data is rarely queried by unique ID. Instead, queries typically focus on time ranges and various attributes (e.g., account ID or device ID). However, we needed a method for deduplicating logs in the event of client retries. To achieve this, we created two indexes to optimize our most common queries:

CREATE UNIQUE INDEX device_state_device_account_time ON device_state USING btree (device_id, account_id, "timestamp");
CREATE INDEX device_state_account_time ON device_state USING btree (account_id, "timestamp");

The unique index ensures deduplication, as each (device, account, timestamp) tuple represents a single log entry. The second index supports typical time-window queries at the account level, as we consistently query by account_id (representing individual customers) and timestamp, making them integral to the index.

Data insertion from our API utilized the UPSERT query:

INSERT INTO device_state (...) VALUES (...) ON CONFLICT DO NOTHING;

About order of columns in multicolumn indexes

PostgreSQL’s B-tree indexes support multiple columns, but the order of those columns significantly impacts query performance. According to PostgreSQL documentation on multicolumn indexes:

A multicolumn B-tree index can be used with query conditions that involve any subset of the index’s columns, but the index is most efficient when there are constraints on the leading (leftmost) columns. The exact rule is that equality constraints on leading columns, plus any inequality constraints on the first column that does not have an equality constraint, will be used to limit the portion of the index that is scanned.

In time-series workloads, queries typically impose inequality constraints on the time column, followed by equality constraints on other columns. A typical query for generating visualizations might look like this:

SELECT 
  DATE_TRUNC('hour', timestamp) as hour, 
  account_id,
  device_id,
  status,
  COUNT(*) as total
FROM device_state 
WHERE 
  account_id = 'a' AND
  device_id = 'b' AND
  timestamp BETWEEN '2025-07-01' AND '2025-07-02'
GROUP BY hour, account_id, device_id, status;

In this case, the WHERE clause includes equality constraints on account_id and device_id, along with two inequality constraints on timestamp. If we had constructed our index with the order of (timestamp, account_id, device_id), only the “timestamp” section of the index would have been utilized to narrow the index scan, requiring a full scan of account_id and device_id to filter out unwanted values.

Additionally, the runtime complexity of searching in a B-tree is O(log n), meaning that as the size of your table (and all indexes) increases, the search becomes slower. Therefore, optimizing the order of columns can significantly enhance query performance. We have observed up to 100% improvement in SELECT query performance simply by adjusting the order of account_id and device_id in our multicolumn index. To achieve optimal performance for time range queries, we adhere to these guidelines for column order:

The timestamp column is always last.
Other columns are leading columns, ordered by their cardinalities, starting with the highest cardinality column.

Launch and improvements

By taking a step back during system design and avoiding premature optimization, we successfully transitioned from zero to a functioning DEX MVP in under four months. Early metrics indicated promising throughput capabilities and latency for API requests. Following the launch, we concentrated on gathering feedback while monitoring system performance. As adoption increased, we scaled to 1,000 inserts per second, resulting in our tables growing to billions of rows. This growth led to performance degradation, particularly for large customers querying time ranges exceeding seven days across tens of thousands of devices.

Optimizing query performance with precomputed aggregates

As DEX expanded to encompass billions of device logs, one of our first performance optimizations involved precomputing aggregates, also known as downsampling. The premise is that if you understand the nature of your queries in advance—such as grouping by status, mode, or geographic location—you can precompute and store those summaries instead of repeatedly querying the raw data. This approach significantly reduces the volume of data scanned and simplifies query execution.

For example, in our DEX Fleet Status dashboard, we render common visualizations that typically group logs by status, mode, or colo, either over a one-hour window or across the entire time range. Our largest customers may have over 30,000 devices, each reporting logs every two minutes, resulting in millions of records daily per customer. However, the columns we visualize (e.g., status and mode) usually have only a few distinct values (4-6). By aggregating this data in advance, we can condense millions of rows into a few hundred per interval, allowing us to query significantly smaller, more manageable tables.

This optimization yielded remarkable results: we experienced up to a 1000x improvement in query performance, with charts that previously took several seconds to render now displaying instantly, even for seven-day views across tens of thousands of devices. Implementing this technique in PostgreSQL posed challenges. While PostgreSQL does support materialized views, they did not meet our needs out of the box, as they do not refresh automatically and incrementally. Instead, we utilized a cron job that periodically executed custom aggregation queries for all pre-aggregate tables. Our Database platform team had developed a lightweight framework for data retention that we integrated into our solution. Despite the considerable time investment required for optimizing aggregation performance, the results were well worth it: fast, reliable queries for the majority of customer use cases.

Table partitioning

While pre-computed aggregates proved beneficial, they are not a panacea. As we added more columns to accommodate new DEX features, we needed to invest time in creating new pre-aggregated tables. Additionally, some features required queries with combined filters, necessitating access to raw data encompassing all columns. However, performance on raw tables was lacking. One technique we considered to enhance performance was table partitioning. In PostgreSQL, tables are stored in a single large file, with large tables split into 1 GB segment files. Partitioning allows for breaking a large table into smaller child tables, each covering a specific data slice (e.g., one day of logs). PostgreSQL then scans only the relevant partitions based on the timestamp filter in your query, potentially improving query performance significantly.

We were particularly interested in range-partitioning on the timestamp column, as our customers desired longer data retention—up to one year. Storing an entire year of data in a single large table would have severely impacted query performance.

CREATE TABLE device_state (
    ...
) PARTITION BY RANGE (timestamp);

CREATE TABLE device_state_20250601 PARTITION OF device_state
    FOR VALUES FROM ('2025-06-01') TO ('2025-06-02');
CREATE TABLE device_state_20250601 PARTITION OF device_state
    FOR VALUES FROM ('2025-06-02') TO ('2025-06-03');
CREATE TABLE device_state_20250601 PARTITION OF device_state
    FOR VALUES FROM ('2025-06-03') TO ('2025-06-04');

Unfortunately, PostgreSQL does not automatically manage partitions; each must be manually created as shown above. This limitation would have necessitated building a comprehensive partition management system to automate the process. Ultimately, we opted against partitioning, as it did not address our core issue: enhancing performance for frequent dashboard queries on recent raw data spanning up to seven days.

TimescaleDB

As our raw PostgreSQL setup began to reveal its limitations, we explored alternative options to improve query performance, leading us to discover TimescaleDB. What particularly piqued my interest were its columnstore and sparse index features, common in OLAP databases like ClickHouse. TimescaleDB appeared to offer a solution to our raw performance challenges. Additionally:

It’s Postgres: TimescaleDB is packaged as a PostgreSQL extension, allowing it to coexist seamlessly with PostgreSQL and granting access to the entire Postgres ecosystem. We can continue using vanilla Postgres tables for transactional workloads while leveraging TimescaleDB hypertables for analytical tasks, providing the convenience of one database for everything.
Automatic partition management: Unlike Postgres, which requires manual table partitioning, TimescaleDB hypertables are partitioned by default and managed automatically.
Automatic data pre-aggregation/downsampling: Tedious processes in native Postgres, such as creating and managing downsampled tables, are automated in TimescaleDB through continuous aggregates, simplifying development and deployment.
Real-time data pre-aggregation/downsampling: TimescaleDB addresses the issue of outdated async aggregates by integrating the most recent raw data into rollup tables during queries.
Compression: TimescaleDB’s compression feature can reduce table size by over 90% while enhancing query performance.
Columnstore performance for real-time analytics: TimescaleDB’s hybrid row/columnar engine enables fast scans and aggregations over large datasets, supporting UPSERTs for backfilling. This combination delivers strong performance for analytical queries while minimizing storage overhead.
Rich library of analytics tools and functions: TimescaleDB offers a suite of tools and functions tailored for analytical workloads, including percentile approximation, count of unique values approximation, and time-weighted averages.

One particularly compelling aspect is that TimescaleDB automates aggregation and data retention, allowing us to simplify our infrastructure and eliminate a component from the system architecture entirely.

Evaluating TimescaleDB for DEX

We deployed a self-hosted TimescaleDB instance on our canary PostgreSQL cluster to conduct a side-by-side comparison with vanilla Postgres. Our production backend was dual-writing to both systems. As anticipated, installing TimescaleDB was straightforward—simply load the library and execute the following SQL query:

CREATE EXTENSION IF NOT EXISTS timescaledb;

Next, we:

Created raw tables
Converted them to hypertables
Enabled columnstore features
Set up continuous aggregates
Configured automated policies for compression and retention

Here’s a condensed example for device_state logs:

-- Create device_state table.
CREATE TABLE device_state (
	...
);

-- Convert it to a hypertable.
SELECT create_hypertable ('device_state', by_range ('timestamp', INTERVAL '1 hour'));

-- Add columnstore settings
ALTER TABLE device_state SET (
    timescaledb.enable_columnstore,
    timescaledb.segmentby = 'account_id'
);

-- Schedule recurring compression jobs
CALL add_columnstore_policy('device_state', after => INTERVAL '2 hours', schedule_interval => INTERVAL '1 hour');

-- Schedule recurring data retention jobs
SELECT add_retention_policy('device_state', INTERVAL '7 days');

-- Create device_state_by_status_1h continuous aggregate
CREATE MATERIALIZED VIEW device_state_by_status_1h
WITH (timescaledb.continuous) AS
SELECT
  time_bucket (INTERVAL '1 hour', TIMESTAMP) AS time_bucket,
  Account_id,
  Status,
  COUNT(*) as total
FROM device_state
GROUP BY 1,2,3
WITH no data;

-- Enable realtime aggregates
ALTER MATERIALIZED VIEW 'device_state_by_status_1h'
SET (timescaledb.materialized_only=FALSE);

-- Schedule recurring continuous aggregate jobs to refresh past 10 hours every 10 minutes
SELECT add_continuous_aggregate_policy (
  'device_state_by_status_1h',
  start_offset=>INTERVAL '10 hours',
  end_offset=>INTERVAL '1 minute',
  schedule_interval=>INTERVAL '10 minutes',
  buckets_per_batch => 1
);

After a two-week backfill period, we conducted benchmarks using real production queries from our dashboard. We tested:

Three time windows: past 1 hour, 24 hours, and 7 days
Three columnstore modes: uncompressed, compressed, and compressed with segmenting
Datasets containing 500 million to 1 billion rows

The results were impressive, showcasing performance improvements ranging from 5x to 35x, depending on the query type and time range:

For short windows (1–24 hours), even uncompressed hypertables performed admirably.
For longer windows (7 days), compression and columnstore settings (especially with segmenting) made a significant difference.
Sparse indexes proved crucial; as PostgreSQL’s B-tree indexes faltered at scale, Timescale’s minmax sparse indexes and columnar layout excelled.

In addition to query performance, we observed remarkable compression ratios, achieving up to 33x:

SELECT 
    pg_size_pretty(before_compression_total_bytes) as before,
    pg_size_pretty(after_compression_total_bytes) as after,
    ROUND(before_compression_total_bytes / after_compression_total_bytes::numeric, 2) as compression_ratio
FROM hypertable_compression_stats('device_state');

before: 1616 GB
after: 49 GB
compression_ratio: 32.83

This outcome allowed us to retain 33 times more data for the same cost.

What makes columnstore so fast?

Two primary factors contribute to the speed of columnstore: compression and sparse indexes. It may seem counterintuitive that querying compressed data, which necessitates decompression, can outperform querying raw data. However, in practice, input/output (I/O) is the primary bottleneck in most analytical workloads. The reduction in disk I/O from compression often outweighs the CPU cost associated with decompression. In TimescaleDB, compression transforms a hypertable into a columnar format: values from each column are grouped in chunks (typically 1,000 at a time), stored in arrays, and then compressed into binary form. For a more detailed explanation, refer to this TimescaleDB blog post.

You might wonder how this is feasible in PostgreSQL, which is traditionally row-based. TimescaleDB employs a clever solution by utilizing PostgreSQL TOAST pages. After compressing tuples of 1,000 values, they are moved to external TOAST pages. The columnstore table itself effectively becomes a table of pointers to TOAST, where the actual data is stored and retrieved lazily, column by column.

The second factor is sparse minmax indexes. Rather than storing every single value in an index, sparse indexes store every N-th value, making them smaller and more efficient for querying large datasets. TimescaleDB implements minmax sparse indexes, where each compressed tuple of 1,000 values generates two additional metadata columns that store the minimum and maximum values. The query engine utilizes these columns to determine whether a value could exist in a compressed tuple before attempting to decompress it. However, we later discovered that sparse indexes must be explicitly enabled via the timescaledb.orderby option; otherwise, TimescaleDB defaults to a setting that may not be optimal for all queries. We added all columns that we filter on to the orderby setting:

-- Add columnstore settings
ALTER TABLE device_state SET (
    timescaledb.enable_columnstore,
    timescaledb.segmentby = 'account_id',
    timescaledb.orderby = 'timestamp,device_id,colo,mode,status,client_version,client_platform'
);

TimescaleDB at Cloudflare

Following the success of DEX, other teams began exploring TimescaleDB for its simplicity and performance. A notable example is the Zero Trust Analytics & Reporting (ART) team, responsible for generating analytics and long-term reports for Zero Trust products such as Access, Gateway, CASB, and DLP. These datasets reside in various ClickHouse and PostgreSQL clusters, which we aimed to consolidate into a unified repository designed to integrate related but geographically disparate data points, tailored to meet our customers’ analytical needs.

We opted to use TimescaleDB as the aggregation layer atop raw logs stored elsewhere. We developed a system of crawlers utilizing cron jobs that periodically query multiple clusters for hourly aggregates across all customers. These aggregates are ingested into TimescaleDB, where we employ continuous aggregates to further roll them up into daily and monthly summaries for reporting.

Access and Gateway datasets are substantial, often ingesting millions of rows per second. To accommodate arbitrary filters in reporting, crawler queries group by all relevant fields, including high-cardinality columns like IP addresses. This results in a low downsampling ratio, leading to the insertion of approximately 100,000 aggregated rows per second. TimescaleDB manages this load effectively, but we made some adjustments:

We switched from bulk INSERTS to COPY, significantly enhancing ingestion throughput. While we did not benchmark this ourselves, numerous benchmarks indicate that COPY performs better with large batches.
We disabled synchronous replication, as temporary data loss is acceptable; our crawlers are idempotent and can reprocess any missing data as necessary.
We also disabled fsync, as durability is less critical for this use case, allowing us to improve ingest performance by skipping disk syncs.
We eliminated most indexes in hypertables, retaining only one on (account_id, timestamp), and relied on aggressive compression and sparse indexes. The absence of indexes facilitated insert rates without significantly impacting query performance, as only a minimal portion of the table remained uncompressed and relied on traditional B-tree indexes.

This system is operational at Cloudflare Zero Trust Analytics.

Tech Optimizer