Better Relevance for AI Apps With BM25 Algorithm in PostgreSQL

Tiger Data, previously known as Timescale, has recently unveiled a preview version of pg_textsearch, an open-source extension for PostgreSQL that leverages the BM25 algorithm to enhance keyword search relevance and performance. The initial response has been remarkable, with the project amassing over 1,800 stars on GitHub shortly after its release.

The rebranding from Timescale to Tiger Data reflects a strategic shift. Initially focused on time-series databases, the company recognized that developers were utilizing their PostgreSQL implementation for a variety of applications beyond time series. As the company broadened its scope to include cloud offerings and PostgreSQL for Agents, the name change aimed to eliminate confusion, as explained by Mike Freedman, founder and CTO.

In recent months, Tiger Data has concentrated on enhancing search capabilities within PostgreSQL, particularly for AI applications. Freedman noted, “Our customers expressed a desire for a general-purpose search primitive to explore AI search, which was not readily available in the market. This prompted us to develop and open-source our solution.” The pg_textsearch extension aims to address the growing need for improved search functionality in the 30+-year-old database.

The Need for Better Keyword Search in the AI Era

The resurgence of interest in PostgreSQL, often referred to as a “boring” yet reliable database, has been particularly pronounced in light of the recent AI boom. While vector databases initially garnered attention, there is a noticeable trend towards integrating vector and keyword search, as Freedman highlighted.

Traditional search engines like Apache Lucene and Elasticsearch, along with PostgreSQL’s native capabilities, have long provided keyword search functionality. However, the advent of AI has accelerated the demand for more relevant search results. Senior software engineer TJ (Todd) Green elaborated, stating that AI-native applications require search capabilities tailored for large language models (LLMs) to retrieve context effectively. He emphasized the importance of combining semantic understanding from vector searches with the precision of keyword matching to enhance result quality.

What Is the BM25 Algorithm?

The BM25 algorithm, or Best Matching 25, is a sophisticated method for ranking relevance in information retrieval systems, offering improvements over the traditional TF-IDF (Term Frequency-Inverse Document Frequency) approach. The pg_textsearch extension utilizes a memtable architecture to index and rank information, implementing several key features:

  • Inverse document frequency to prioritize rare terms.
  • Term frequency saturation to mitigate the impact of frequently used terms.
  • Prevention of long documents from skewing results.
  • Relative ranking that emphasizes rank order over absolute score values.

Compatible with PostgreSQL versions 17 and 18, pg_textsearch addresses performance issues that arise with large corpus sizes in native PostgreSQL searches. By allowing users to set memory sizes for their corpus and apply score thresholds, it enhances performance by filtering out low-relevance results.

When combined with pg_vector and pg_vectorscale, developers can seamlessly integrate keyword and vector searches within PostgreSQL using a single SQL query, thereby simplifying data retrieval processes.

How pg_textsearch Was Built

The development of pg_textsearch commenced in October, following a few months of planning. Green noted that the decision to commit to this project was challenging, given the rapidly evolving landscape of technology. “A project like this would have required significant time and resources pre-AI tools, so we opted for a different approach,” he explained.

This innovative approach involved collaboration with AI tools, specifically Claude Cloud Opus. Green, who has a background in computer science and database development, utilized this AI to streamline the development process. He anticipates that a production-ready version will be available early in the new year, contingent on user feedback and performance assessments.

Freedman emphasized the significance of PostgreSQL in the developer community, stating, “Postgres has captured the hearts and minds of developers. Our goal is to extend its capabilities, enabling developers to create simpler, more efficient data architectures without the need for multiple databases.” He expressed optimism about the transformative impact of AI on development practices and the potential for PostgreSQL to adapt to these changes.

Tech Optimizer
Better Relevance for AI Apps With BM25 Algorithm in PostgreSQL