At Cloudflare, PostgreSQL and ClickHouse are used as primary databases for transactional and analytical workloads. PostgreSQL is preferred for product development due to its speed, versatility, and reliability, with hundreds of instances running in various configurations. ClickHouse, introduced in 2017, allows ingestion of tens of millions of rows per second with millisecond-level query performance but comes with trade-offs.
The Digital Experience Monitoring (DEX) product was developed with a focus on simplicity and rapid shipping, utilizing a team of three engineers and collaboration with other teams. DEX's initial launch included fleet status logs uploaded every two minutes and synthetic tests with usage caps, leading to a successful MVP launch ahead of schedule. The architecture included an HTTP API, PostgreSQL for storing configurations, and a React UI in the Cloudflare Dashboard.
PostgreSQL was chosen for analytics due to its ability to handle structured logs, with a table created for device state logs lacking a primary key but optimized with unique and standard indexes for common queries. The UPSERT query was used for data insertion, and the order of columns in multicolumn indexes was optimized for performance.
After launching DEX, the system scaled to 1,000 inserts per second, leading to performance degradation. Precomputed aggregates were implemented to improve query performance, resulting in up to a 1000x improvement. Table partitioning was considered but ultimately not pursued due to its manual management requirements.
TimescaleDB was explored as an alternative to enhance query performance, offering features like automatic partition management, continuous aggregates, and compression. A side-by-side comparison with PostgreSQL showed significant performance improvements, particularly for longer time windows. TimescaleDB's integration allowed for efficient data retention and aggregation, simplifying the infrastructure.
Following the success of DEX, other teams at Cloudflare began using TimescaleDB for analytics and reporting, consolidating data from various sources and managing high ingestion rates effectively by optimizing data handling techniques such as switching to COPY for bulk inserts and adjusting replication settings.