Ring’s Billion-Scale Semantic Video Search with Amazon RDS for PostgreSQL and pgvector

When utilizing Ring’s semantic video search, users expect to pinpoint specific moments—be it “a dog in my backyard,” “a package delivery,” or “someone wearing a blue shirt”—with remarkable speed and accuracy. This expectation is met through a sophisticated system that transcends traditional keyword matching, employing a new methodology with PostgreSQL and pgvector tailored for handling video data at an unprecedented scale.

In this exploration, we delve into Ring’s billion-scale semantic video search powered by Amazon RDS for PostgreSQL and pgvector. We will discuss architectural decisions, cost-performance challenges, key insights gained, and future directions. The Ring team has architected a vector search solution designed for global scalability, accommodating millions of users through vector embeddings—numerical representations of visual content generated by AI models. By transforming video frames into vectors that encapsulate the visual narrative of each frame, Ring can efficiently store these representations and execute similarity searches. When a user inputs a query like “package delivery,” the system translates this text into a vector and retrieves the most similar video frames, delivering relevant results in under two seconds.

Searching video at global scale

Ring’s mission is straightforward yet ambitious: enhancing neighborhood safety. Over the past 12 years, the company has refined video technology to ensure that users can monitor what matters most—whether it’s family members returning home or a cherished pet straying too far. To achieve this goal, Ring generates vast amounts of video data daily from millions of devices across the globe.

The video search system operates on a colossal scale, serving billions of read requests daily across four continents and nine AWS regions while adhering to strict latency requirements for millions of customers.

Metric	Scale
Embeddings Stored	100–200 billion
Daily New Embeddings	~2 billion
Data Footprint	140–150+ TB across 3 PostgreSQL clusters
Latency Target	<200ms (p50), <2 seconds (worst case)
Read/Write Ratio	~80% reads, ~20% writes
Global Presence	4 continents, 9 AWS Regions

Traditional metadata-based search methods, which rely on tags, timestamps, or manual annotations, fall short when it comes to accommodating natural-language queries. Ring required a fundamentally different approach capable of performing similarity searches at massive scale, ensuring real-time ingestion without compromising query performance, maintaining per-user data isolation aligned with privacy standards, and achieving operational resilience during traffic spikes, such as those experienced on Halloween.

Evaluating alternatives

Prior to finalizing a production architecture, the engineering team conducted a structured evaluation across three key dimensions: cost, query latency, and operational complexity.

Purpose-built vector databases

Initially, several dedicated vector search solutions appeared promising. They offered compelling features such as purpose-built indexing, managed approximate nearest neighbor (ANN) search, and native embedding workflows. However, upon closer examination, these systems did not align with Ring’s requirements. Given the query volume and data scale, the associated costs were prohibitive. These systems were optimized for hybrid keyword/vector search patterns, while Ring’s workload predominantly involved dense-vector retrieval, rendering the added complexity unjustifiable.

The S3-backed ANN prototype

The team also experimented with a custom ANN pipeline utilizing Amazon Simple Storage Service (Amazon S3) as the embedding store. The architecture was straightforward: high-dimensional embedding vectors were stored in S3, retrieved during query time, and processed in memory for nearest-neighbor searches. While functional, this approach did not meet the required speed. The latency incurred from fetching embeddings from object storage consistently exceeded Ring’s sub-2-second service level agreement (SLA). This experiment underscored a critical architectural constraint: vector data and query execution must be co-located.

Why RDS for PostgreSQL with pgvector won

With purpose-built vector database solutions ruled out due to cost and fit, and the S3 prototype dismissed for latency issues, the team settled on RDS for PostgreSQL with the pgvector extension. This decision was both pragmatic and well-informed.

The engineering team possessed extensive expertise in PostgreSQL—schema design, query tuning, replication, and observability. Introducing a new vector database would have necessitated onboarding a separate system, complete with new failure modes, operational runbooks, and monitoring gaps. By leveraging pgvector, the team accessed native ANN search capabilities—via hierarchical navigable small worlds (HNSW) or Inverted File Flat (IVFFlat) indexing—directly within the database engine they already trusted and understood.

As the Ring engineers succinctly stated: “We use PostgreSQL as a vector database.” This is not merely a workaround; it is a deliberate architectural choice that integrates relational and vector workloads into a single, well-understood tier—fulfilling every production requirement without expanding the operational surface area.

By opting for Amazon RDS for PostgreSQL with pgvector, Ring transformed what could have been a costly infrastructure expansion into a capability enhancement. This architecture consolidates relational and vector workloads into a single, operationally optimized tier—satisfying production demands for cost, latency, and reliability while avoiding the introduction of new failure modes or separate database infrastructures. This approach also retains the flexibility to scale beyond a billion embeddings without necessitating architectural redesign, demonstrating that the right choice is one that aligns with operational realities.

Solution overview: semantic video search with pgvector

The AI Video Search leverages the Contrastive Language-Image Pre-training (CLIP) model, enabling users to search video footage using natural-language queries. While CLIP supports multiple input modalities (text and images), Ring’s implementation specifically utilizes the model to generate vector embeddings from video frames captured at regular intervals. These embeddings are stored in a database and searched against text-based query embeddings to identify the most relevant moments. The architecture employs managed AWS services to process billions of video frames daily while maintaining sub-second query latency.

Video ingestion and embedding generation

Figure 1: Video ingestion and embedding generation pipeline

Video capture: Videos from Ring devices are stored in Amazon S3 buckets.
Event-driven processing: S3 event notifications trigger processing workflows via Amazon Simple Queue Service (Amazon SQS) queues, which buffer incoming requests and decouple ingestion from processing—critical for managing traffic spikes during peak activity periods.
Embedding generation: The CLIP model, operating on GPU-accelerated Amazon Elastic Container Service (Amazon ECS) instances, extracts frames at regular intervals from video recordings and generates 768-dimensional vector embeddings for each frame. Each embedding encapsulates the visual content of that moment in a format suitable for similarity search.
Vector storage: Amazon RDS for PostgreSQL, utilizing the pgvector extension, stores embeddings. Each video frame becomes a searchable record, allowing users to locate specific moments across weeks of footage.

Search query flow

Figure 2: Search query flow from user request to ranked results

Natural-language query: A Ring user submits a search query such as “dog in my backyard” or “package delivery.”
Query embedding: The query text is converted into a 768-dimensional embedding using the same CLIP model employed during ingestion.
Similarity search: RDS for PostgreSQL executes a similarity search using pgvector’s dot-product (inner product) distance operator across the user’s stored embeddings. Ring selected this operator for its computational efficiency with normalized embedding vectors.
Ranked results: The database returns the most relevant video segments, ranked by similarity score, delivering results in under two seconds.

Deep dive: architectural decisions

The architecture embodies several unconventional decisions that proved pivotal in achieving performance at this scale, each informed by rigorous testing and the unique demands of video search workloads.

User-based table partitioning

The most significant architectural decision involved implementing table partitioning by user. Instead of consolidating embeddings into a single massive table, Ring creates dedicated partitions for each user’s data. The partition identifier combines user ID, device ID, and model version, facilitating straightforward upgrades when the embedding model changes.

CREATE TABLE embeddings.embeddings_${tableIdentifier} (
    user_id VARCHAR,
    device_id VARCHAR,
    timestamp BIGINT,
    frame_timestamp BIGINT,
    video_timestamp BIGINT,
    model_version SMALLINT,
    event_id TEXT,
    frame_index INTEGER,
    content halfvec,
    inserted_at TIMESTAMP,
    updated_at TIMESTAMP,
    UNIQUE(user_id, device_id, timestamp, frame_timestamp)
);

This partitioning strategy yields several advantages:

Query performance: The PostgreSQL query optimizer leverages partition constraints to bypass irrelevant data, scanning only the target user’s partition (~1 GB) rather than the entire dataset.
Data isolation: Each user’s data is logically separated, bolstering privacy and mitigating noisy-neighbor effects.
Lifecycle management: Partitions can be created during user onboarding and promptly removed when features are disabled or subscriptions expire—eliminating the need for costly DELETE operations.
Model versioning: Including the model version in the partition identifier enables Ring to maintain embeddings from various model generations concurrently, facilitating gradual rollouts of enhanced CLIP models without necessitating re-embedding of historical data.

Brute-force parallel search with dot-product distance: no vector indexes

In a notably counterintuitive architectural choice, Ring opted to forgo traditional vector indexes entirely. Instead of utilizing pgvector’s built-in approximate nearest-neighbor (ANN) index methods like HNSW or IVFFlat, the team relied on brute-force parallel sequential scans within each user partition, employing pgvector’s dot-product (inner product) distance operator to rank results.

The team had evaluated ANN indexes early in development but ultimately discarded this approach without extensive comparison. The rationale was clear: the video search use case demands a 100% recall rate. When a user searches for “dog in my backyard,” every matching moment must be retrieved. ANN indexes enhance speed at the expense of recall—a compromise that Ring’s video search application could not afford, as any loss in accuracy would directly undermine user experience. This approach is viable due to several synergistic factors:

Small partition size: Each user’s partition is approximately 1 GB—small enough to facilitate efficient scanning with parallel workers.
Aggressive parallelism: Setting max_parallel_workers_per_gather = 16 allows PostgreSQL to deploy sufficient workers to fully utilize Amazon Elastic Block Store (Amazon EBS) bandwidth during partition scans.
Zero index maintenance: With ~2 billion new embeddings ingested daily, avoiding index builds and maintenance alleviates a significant operational burden and mitigates write amplification.
100% recall: Brute-force search ensures that every relevant result is returned—crucial for a feature where missing a specific moment erodes user trust.

Multi-cluster routing strategy

Ring distributes data across four PostgreSQL clusters per region using an assignment-based routing strategy. When a device is onboarded to the video search service, a weighted algorithm assigns it to a cluster based on current cluster size and capacity. This assignment is stored in Amazon DynamoDB—chosen for its single-digit millisecond read latency and global availability—and remains fixed for the device’s lifetime, ensuring that queries for a given user are directed to the same cluster housing their data.

This approach facilitates natural horizontal scaling: as the user base expands, new clusters can be added, and the weighting algorithm adjusted to direct new devices to clusters with available capacity. Data is partitioned across clusters so that if an entire cluster becomes unavailable, the impact is confined to the subset of users served by that cluster.

EBS-optimized instance selection

Through extensive testing, Ring identified that EBS throughput was the primary bottleneck for their scan-heavy workload. The team evaluated various instance families and sizes within the r6id family. A key finding was that EBS-optimized instances delivered comparable I/O throughput, irrespective of instance size. Performance differences among sizes were primarily influenced by CPU and RAM (which affect buffer cache hit rates), rather than raw EBS capability. This unexpected result confirmed that EBS-optimized networking—not instance size—was the critical factor.

Instance Type	EBS Throughput (MB/s)	EBS IOPS	Avg Query Latency (ms)
db.r6id.4xlarge	489	3,861	1,504
db.r6id.8xlarge	566	4,465	1,292
db.r6id.16xlarge	529	4,208	1,359

Table: EBS performance across instance types (687 MB per partition, no vector indexes). Note that the 16xlarge instance did not outperform the 8xlarge—confirming that EBS throughput, not instance size, is the determining factor for this workload.

This insight enabled Ring to optimize instance sizing for cost efficiency rather than over-provisioning to larger instance types that would not yield proportional performance gains.

Performance challenges and solutions

During the production rollout, Ring encountered two significant performance challenges that necessitated architectural refinements to maintain sub-second query latency at a global scale.

Challenge 1: Cold-start latency

Initial proof-of-concept (PoC) testing demonstrated excellent sub-2-second performance. However, after two months in production, latencies surged to over 10 seconds for certain queries. Investigations revealed that PoC testing had inadvertently relied on cached data, obscuring a critical architectural reality: RDS for PostgreSQL stores data on Amazon EBS volumes, not on local instance storage.

When data was not resident in PostgreSQL’s shared buffer cache, queries necessitated EBS I/O roundtrips, introducing substantial latency. PostgreSQL allocates 25% of instance RAM to shared buffers by default, meaning data beyond this threshold had to be fetched from EBS upon first access. This behavior was particularly pronounced on read replicas and following failover events, where instances began with entirely cold caches.

Solution: pg_prewarm, read optimized instances, and buffer management

The team implemented a multi-layered strategy to mitigate cold-start latency. First, they deployed automated pg_prewarm scripts that preload frequently accessed user partitions into shared buffers during read replica initialization and after failover events. Additionally, Ring utilized RDS Read Optimized Instances (r6id family with local NVMe storage), which provide a local storage tier that complements EBS and helps reduce read latency for frequently accessed data. This combined approach trades startup time for consistent query performance:

Scenario	Query Latency
Cold instance (no buffer cache)	~10,500 ms
After pg_prewarm	<300 ms
Sustained (subsequent queries, same user)	Sub-second

The 35x improvement from cold to warmed instances highlights the significance of buffer management for EBS-backed PostgreSQL deployments at this scale.

Challenge 2: Under-utilized EBS bandwidth

Performance testing revealed that single-threaded queries achieved only ~50 MB/s throughput against db.r6id instances capable of over 500 MB/s—resulting in over 90% of available EBS bandwidth remaining untapped. The root cause was PostgreSQL’s default query execution: the optimizer typically spawned only 2–4 parallel workers, insufficient to generate enough concurrent I/O operations to optimally utilize the available EBS bandwidth.

Solution: Aggressive Parallel Execution

The team configured max_parallel_workers_per_gather = 16, enabling PostgreSQL to deploy enough parallel workers to fully leverage available EBS bandwidth. Combined with user-based partitioning (which avoided the need for index-driven query plans that inhibited parallelism), this configuration allowed parallel sequential scans to elevate EBS throughput from ~50 MB/s to between 489 and 590 MB/s.

This optimization was only feasible because Ring had purposefully eliminated traditional indexes from their partitions. The presence of indexes would have led PostgreSQL’s query optimizer to favor index scans over parallel sequential scans, inadvertently restricting I/O concurrency. Removing indexes compelled the optimizer to select parallel scans—a counterintuitive decision that unlocked an order-of-magnitude improvement in throughput.

Production performance

In production, the architecture consistently delivers low-latency performance across its global deployment:

Percentile	Target	Actual
P50	<200 ms	~200 ms
P95	<2,000 ms	~600 ms
P99	<2,000 ms	~600 ms (occasional spikes to 7–8s)

The workload is heavily read-dominant (~80% reads, ~20% writes), with peak search traffic reaching approximately 300 requests per minute. Read and write latencies are not directly comparable: read operations scan a user’s entire embedding history, while writes involve storing individual video event embeddings. The system currently employs halfvec for storage optimization.

Reliability and operations

The reliability strategy centers on Amazon RDS Multi-AZ deployments at the cluster level for automated failover. Each PostgreSQL cluster operates as an independent unit with its own set of user partitions. Data is distributed across clusters so that a single cluster failure limits the impact to only those users assigned to that cluster, preventing a global outage.

For monitoring, the team employs Amazon CloudWatch, utilizing a combination of built-in RDS metrics and custom application metrics, complemented by CloudWatch alarms for automated alerting. Key metrics tracked include EBS throughput and IOPS utilization, query latency distributions, buffer cache hit rates, and cluster-level connection counts.

ML and embedding pipeline

Ring’s embedding pipeline is powered by the Contrastive Language-Image Pre-training (CLIP) model, which generates 768-dimensional vector representations of video frames. CLIP is particularly well-suited for this application as it produces aligned embeddings for both images and text within the same vector space—ensuring that a text query like “dog in my backyard” and a video frame depicting a dog in a yard yield vectors that are closely positioned in 768-dimensional space. The 768 dimensions are inherent to the CLIP model architecture and provide a rich representation that balances accuracy with storage efficiency at scale.

The implementation utilizes the model’s image encoding capabilities to process video frames at regular intervals—extracting one frame from every few frames of video and generating an embedding for each. These embeddings encapsulate the visual content of each moment in a format optimized for similarity search. The partition schema includes a model_version field, allowing Ring to maintain embeddings from different CLIP model generations concurrently. This design facilitates gradual rollouts of improved models without necessitating bulk re-embedding of historical data—new embeddings are simply written to new model-versioned partitions while older partitions continue to serve queries until they are retired.

Lessons learned

The journey to billion-scale vector search has imparted several key insights regarding PostgreSQL performance at extreme scale:

Design for EBS behavior from day one. RDS for PostgreSQL stores data on Amazon EBS. Understanding throughput limits, buffer cache behavior, and cold-start latency characteristics should inform architectural decisions prior to writing the first line of code—rather than addressing issues post-production.
Partition strategically for your access pattern. User-based partitioning emerged as the single most impactful optimization. It simultaneously enhanced query performance, simplified data lifecycle management, supported model versioning, and—crucially—unlocked parallel query execution by alleviating the need for traditional indexes.
Test with cold caches. PoC results that depend on warmed buffer caches can misrepresent production behavior. Validate performance with cold instances and realistic data distributions to avoid surprises after launch.
Challenge conventional indexing wisdom. For workloads necessitating a 100% recall rate on bounded per-user datasets, brute-force parallel scans can outperform ANN indexes while alleviating index maintenance overhead. This is particularly true when combined with aggressive parallelism and EBS-optimized instances.
Validate I/O throughput for your specific workload. EBS-optimized instances vary in their ability to utilize available bandwidth. Testing across instance families and fine-tuning PostgreSQL’s parallel worker configuration can unlock significant throughput improvements.

What’s next?

Ring continues to explore next-generation AWS capabilities and architectural enhancements to further optimize their solution:

Amazon Aurora PostgreSQL-Compatible Edition: During the initial evaluation, both Amazon Aurora for PostgreSQL and Amazon RDS for PostgreSQL were tested. Based on their specific workload characteristics at the time—particularly cold-data access patterns favoring provisioned IOPS on RDS—the team selected RDS for PostgreSQL. As Aurora PostgreSQL and pgvector capabilities have evolved significantly, Ring is actively assessing Aurora to optimize operations and reduce the overhead of managing multiple RDS clusters. As the team noted, “The solution works well and meets key performance indicators (KPIs), but launching new clusters and operating them is not very efficient.” The automatic sharding and horizontal scaling capabilities of Aurora could greatly simplify Ring’s multi-cluster architecture.
Caching and performance optimization: Ring is investigating additional caching strategies, including evaluating Amazon ElastiCache, to complement their pg_prewarm approach and further reduce latency for frequently accessed user data.
Next-generation vector indexing: As pgvector continues to mature with improved HNSW and IVFFlat implementations, Ring plans to reassess whether hybrid approaches (indexed pre-filtering with brute-force final ranking) could further reduce latency while maintaining recall assurances.
Improved ingestion pipelines: Enhancing the efficiency of handling event-driven spikes, particularly during high-activity periods like Halloween when doorbell activity surges.
Lower tail latency: Aiming to reduce P99 latency spikes to consistently meet the sub-2-second SLA.
Next-generation embedding models: Ring continues to evaluate improved embedding models that may offer higher accuracy, smaller dimensions, or better performance characteristics for video search workloads.

Tech Optimizer