Migrate full-text search from SQL Server to Amazon Aurora PostgreSQL-compatible edition or Amazon RDS for PostgreSQL

In the contemporary landscape where data reigns supreme, the capability to efficiently search and retrieve information from extensive datasets has become indispensable. While many commercial and open-source databases, such as PostgreSQL, are adept at managing structured data, PostgreSQL also provides robust tools for navigating unstructured or semi-structured data. Its built-in full-text search (FTS) capabilities, along with extensions like pg_trgm and pg_bigm, enhance its text searching prowess.

Traditional SQL queries utilizing LIKE and ILIKE operators, as well as regular expressions, serve well for precise textual matching and structured data retrieval. However, they may falter when tasked with searching through expansive blocks of unstructured text, such as documents, articles, or product descriptions.

When transitioning from a commercial database like SQL Server to an open-source alternative such as PostgreSQL, several considerations must be addressed. Migrating full-text search from SQL Server to Amazon Aurora PostgreSQL-Compatible Edition necessitates adjustments to both queries and schema structures, as the full-text search implementations differ between the two systems. Notably, the AWS Schema Conversion Tool (AWS SCT) does not automatically convert full-text search code.

SQL Server’s FTS is engineered to search for specific words, phrases, or even variations of words (known as stemming) within unstructured or semi-structured text data. This functionality facilitates rapid searching, ranking, and indexing of textual content, rendering it an invaluable resource for applications that manage substantial volumes of text-based information.

Prerequisites

For this discussion, we utilize the AdventureWorks2019 sample database to illustrate the migration of FTS from SQL Server 2019 Standard Edition to PostgreSQL. The following high-level steps outline the implementation of FTS on a SQL Server database:

  1. Enable full-text search for the AdventureWorks2019 database:
    USE [AdventureWorks2019] 
    GO 
    EXEC sp_fulltext_database 'enable' 
    GO
  2. Create a full-text catalog:
    CREATE FULLTEXT CATALOG DescFTSCatalog; 
    GO

    A full-text catalog serves as a logical container for full-text indexes, organizing the index data and defining language-specific word breakers and stemmers.

  3. Define a full-text index on the columns containing text data that require searching:
    CREATE FULLTEXT INDEX 
    ON [AdventureWorks2019].[Production].[ProductDescription]([Description]) 
    KEY INDEX [PK_ProductDescription_ProductDescriptionID] 
    ON DescFTSCatalog 
    GO
  4. Utilize AWS SCT and AWS Database Migration Service (AWS DMS) to convert and migrate the AdventureWorks2019 database from SQL Server to Amazon Aurora PostgreSQL.

For best practices regarding migration, refer to Migrate SQL Server to Amazon Aurora PostgreSQL using best practices and lessons learned from the field. In this instance, we will migrate the ProductDescription table from SQL Server to Amazon Aurora PostgreSQL using AWS SCT and AWS DMS.

PostgreSQL presents several methods for text searching: exact search, pattern matching, regular expressions, and full-text search. The subsequent sections will detail how to implement FTS in PostgreSQL on the migrated database to achieve comparable results.

Full-text search in PostgreSQL

The LIKE and ILIKE operators, along with regular expressions, are employed in a query’s WHERE clause for string pattern-based matching searches utilizing wildcard characters. While search results can be ranked based on their proximity to the search pattern, LIKE and ILIKE do not offer ranking capabilities and disregard frequently used words such as “the” and “is.” PostgreSQL’s built-in full-text search provides enhanced functionality through tsvector and tsquery data types, along with associated functions, operators, and parameters.

The tsvector data type represents a preprocessed and transformed version of text data, optimized for efficient full-text searching. It encapsulates information about individual words or tokens within a text document, facilitating swift and accurate text search operations. The to_tsvector function can be utilized to convert text data (documents) into tsvector format, which is essential for efficient data searching.

The tsquery data type comprises one or more lexemes, the fundamental units of text used for searching. Lexemes can be simple words or tokens, and they may be specified in various forms, including prefix or exact forms. By amalgamating lexemes and operators, users can construct complex search conditions to yield the most pertinent results for their text search queries. The to_tsquery function is employed to convert search text into lexemes, which can then be used to query against tsvector columns and values. For instance, applying to_tsquery to the phrase “He is running in the park” generates the lexemes “he,” “run,” and “park,” while excluding stop words like “is,” “in,” and “the.” The term “running” is stemmed to “run,” and all words are converted to lowercase.

In the following sections, we will illustrate several use cases for full-text search and provide strategies for migrating them from SQL Server to Amazon Aurora PostgreSQL-compatible and Amazon RDS for PostgreSQL.

CONTAINS predicate with AND operator

Basic FTS queries in SQL Server utilize the CONTAINS predicate. This predicate in Transact-SQL offers a versatile approach to executing advanced FTS in SQL Server databases. It accommodates various search conditions, proximity searches, wildcards, and thesaurus features, enabling users to tailor their queries to meet specific requirements.

In the following sample query, the CONTAINS predicate checks for the terms “entry” and “level” within the Description column:

SELECT ProductDescriptionID, Description 
FROM [AdventureWorks2019].[Production].[ProductDescription] 
WHERE CONTAINS([Description], 'entry & level');

This query can be translated into PostgreSQL using the to_tsvector and to_tsquery functions, as demonstrated below, employing the default built-in text search dictionary value pg_catalog.simple. This configuration is governed by the parameter default_text_search_config.

SQL Server
SELECT ProductDescriptionID, Description 
FROM [AdventureWorks2019].[Production].[ProductDescription] 
WHERE CONTAINS([Description], 'entry & level');
PostgreSQL
SELECT productdescriptionid, description 
FROM production.productdescription 
WHERE to_tsvector('pg_catalog.simple', Description) @@ to_tsquery('pg_catalog.simple', 'entry & level');

CONTAINS predicate with OR operator

This scenario mirrors the previous use case, with the distinction that the check is performed using the OR operator. In the following sample query, the predicate verifies the presence of “entry,” “level,” or both:

SELECT ProductDescriptionID, Description 
FROM [AdventureWorks2019].[Production].[ProductDescription] 
WHERE CONTAINS([Description], 'entry | level');

This query can similarly be rewritten in PostgreSQL using the to_tsvector and to_tsquery functions, as shown below, again utilizing the default built-in text search dictionary value pg_catalog.simple.

SQL Server
SELECT ProductDescriptionID, Description 
FROM [AdventureWorks2019].[Production].[ProductDescription] 
WHERE CONTAINS([Description], 'entry | level');
PostgreSQL
SELECT productdescriptionid, description 
FROM production.productdescription 
WHERE to_tsvector('pg_catalog.simple', Description) @@ to_tsquery('pg_catalog.simple', 'entry | level');

FREETEXT predicate

The FREETEXT predicate in Transact-SQL (T-SQL) facilitates full-text searches in SQL Server databases. Unlike the CONTAINS function, which mandates specific terms and conditions, FREETEXT allows for more flexible, natural language-based searches.

In the following sample query, FREETEXT examines the terms “entry” or “level” and their variations (via stemming) within the Description column:

SELECT ProductDescriptionID, Description 
FROM [AdventureWorks2019].[Production].[ProductDescription] 
WHERE FREETEXT([Description], 'entry level');

This query can be reformulated in PostgreSQL using the to_tsvector and to_tsquery functions, as illustrated below, utilizing the configuration value pg_catalog.english. This configuration employs english_stem and a simple dictionary to convert tokens into lexemes, which represent normalized forms of words or tokens suitable for indexing and search operations.

SQL Server
CREATE FULLTEXT CATALOG DescFTSCatalog; 
GO
PostgreSQL
CREATE FULLTEXT CATALOG DescFTSCatalog; 
GO

FREETEXTTABLE function with RANK

FTS in SQL Server can produce an optional score (or rank value) that indicates the relevance of the data returned by a full-text query. This rank value is computed for each row and can serve as a criterion for ordering the result set of a query based on relevance. The rank values reflect only a relative order of relevance among the rows in the result set, lacking significance across different queries.

In the following sample queries, FREETEXTTABLE checks for the words “entry” or “level” and their forms (using stemming) in the Description column while also retrieving the RANK information:

CREATE FULLTEXT CATALOG DescFTSCatalog; 
GO

In PostgreSQL, the ts_rank function is employed to compute the relevance ranking of search results based on their alignment with a specific query. The ranking is derived from a numerical value that signifies how closely a document corresponds to the search terms in the query.

SQL Server
CREATE FULLTEXT CATALOG DescFTSCatalog; 
GO
PostgreSQL
CREATE FULLTEXT CATALOG DescFTSCatalog; 
GO

The ts_headline function is utilized to generate a summarized version of a document’s text, highlighting the most relevant sections that correspond to a specific search query. This function is particularly useful for crafting search result snippets or headlines that provide context to users regarding the relevance of a certain document to their search.

CONTAINSTABLE and FORMSOF functions with RANK

The FORMSOF function in SQL Server facilitates inflectional searches, which involve searching for various forms of a word, including plurals, verb tenses, or related word forms. This capability enhances the likelihood of discovering relevant documents, even if they contain variations of the search term, thereby improving search accuracy.

In the following sample queries, CONTAINSTABLE checks for the word “gear” and its forms (using INFLECTIONAL) in the Description column while also retrieving the RANK information:

CREATE FULLTEXT CATALOG DescFTSCatalog; 
GO

In PostgreSQL, phrases are initially segmented into words or tokens, which are then normalized and stemmed to form base words (lexemes) using the pg_catalog.english FTS configuration. This process ensures that inflectional searches are automatically managed, as the lexemes will be consistent across different forms (stemming) of a word.

SQL Server
CREATE FULLTEXT CATALOG DescFTSCatalog; 
GO
PostgreSQL
CREATE FULLTEXT CATALOG DescFTSCatalog; 
GO

Improving query performance in PostgreSQL

In the sample PostgreSQL queries presented earlier, the to_tsvector function extracts the tsvector values from the Description column within the productdescription table. The following sections will explore various strategies to enhance query performance.

Solution 1: Use a GIN index

The GIN (Generalized Inverted Index) in PostgreSQL is a widely adopted indexing method designed to expedite searches for complex data types, such as JSON and full-text search. While the standard database index, a B-tree, is optimized for equality testing, GIN is tailored for search patterns that operate over nested or composite data structures, facilitating more expressive search patterns. By indexing components of complex data types separately, GIN indexes significantly enhance query performance for operations involving arrays, JSON data, and text searches.

To implement this approach, create an expression-based GIN index on the relevant column within the productdescription table.

  1. Execute the following command:
    CREATE FULLTEXT CATALOG DescFTSCatalog; 
    GO

    If the table contains millions of rows, consider increasing the maintenance_work_mem configuration parameter at the session level to accelerate index creation time. This parameter specifies the maximum amount of memory (in MB) allocated for maintenance operations, such as index creation—defaulting to 64 MB in PostgreSQL.

  2. Run the following EXPLAIN ANALYZE query:
    CREATE FULLTEXT CATALOG DescFTSCatalog; 
    GO

The output will reveal a bitmap index scan being conducted on productdescription_gin_index, indicating improved query performance. The following screenshots illustrate the explain plan prior to and following the creation of the index.

Solution 2: Use a stored generated column

This approach involves creating a computed column description_tsv that retains the tsvector value derived from the description column in the table, followed by establishing a GIN index on the computed column.

  1. Execute the following commands:
    CREATE FULLTEXT INDEX 
    ON [AdventureWorks2019].[Production].[ProductDescription]([Description]) 
    KEY INDEX [PK_ProductDescription_ProductDescriptionID] 
    ON DescFTSCatalog 
    GO
  2. Run the following sample EXPLAIN ANALYZE query:
    CREATE FULLTEXT INDEX 
    ON [AdventureWorks2019].[Production].[ProductDescription]([Description]) 
    KEY INDEX [PK_ProductDescription_ProductDescriptionID] 
    ON DescFTSCatalog 
    GO

The output will indicate that a bitmap index scan is being performed on productdescription_gin_index, demonstrating an enhancement in query performance.

Full-text search in PostgreSQL using the pg_trgm extension

The pg_trgm extension in PostgreSQL facilitates text search functionalities by leveraging trigrams. Trigrams consist of sets of three consecutive characters extracted from a given string. By employing trigrams, users can identify similarities or matches in text patterns by comparing the number of trigrams that align between strings, alongside a predefined similarity threshold parameter set before executing the search.

The pg_trgm extension offers operators that enable the creation of trigram indexes on text columns requiring search capabilities. This indexing allows for efficient similarity operations on the indexed columns. The extension provides three similarity operations: similarity (%), word_similarity (<%), and strict_word_similarity (<<%). The threshold parameters for these operations are pg_trgm.similarity_threshold, pg_trgm.word_similarity_threshold, and pg_trgm.strict_word_similarity_threshold, which can be configured to values ranging from 0 (no similarity) to 1 (perfect match). The functions similarity(), word_similarity(), and strict_word_similarity() are utilized to compute the similarity score. The implementation of pg_trgm can be executed as follows:

  1. Run the command to create the pg_trgm extension:
    CREATE FULLTEXT INDEX 
    ON [AdventureWorks2019].[Production].[ProductDescription]([Description]) 
    KEY INDEX [PK_ProductDescription_ProductDescriptionID] 
    ON DescFTSCatalog 
    GO

    2

  2. Execute the command to create the GIN index on the productdescription column:
    CREATE FULLTEXT INDEX 
    ON [AdventureWorks2019].[Production].[ProductDescription]([Description]) 
    KEY INDEX [PK_ProductDescription_ProductDescriptionID] 
    ON DescFTSCatalog 
    GO
  3. Set the similarity_threshold configuration value to 0.2. This threshold checks for common trigrams between two strings and returns a value between 0 and 1:
    CREATE FULLTEXT INDEX 
    ON [AdventureWorks2019].[Production].[ProductDescription]([Description]) 
    KEY INDEX [PK_ProductDescription_ProductDescriptionID] 
    ON DescFTSCatalog 
    GO
  4. Set the word_similarity_threshold configuration value to 0.6. This threshold checks for common trigrams between strings at the word level:
    CREATE FULLTEXT INDEX 
    ON [AdventureWorks2019].[Production].[ProductDescription]([Description]) 
    KEY INDEX [PK_ProductDescription_ProductDescriptionID] 
    ON DescFTSCatalog 
    GO
  5. Set the strict_word_similarity_threshold configuration value to 0.6. This threshold operates similarly to word_similarity but considers common trigrams only when both words are identical:
    CREATE FULLTEXT INDEX 
    ON [AdventureWorks2019].[Production].[ProductDescription]([Description]) 
    KEY INDEX [PK_ProductDescription_ProductDescriptionID] 
    ON DescFTSCatalog 
    GO
  6. Run the command to drop the index and enable sequential scan:
    CREATE FULLTEXT INDEX 
    ON [AdventureWorks2019].[Production].[ProductDescription]([Description]) 
    KEY INDEX [PK_ProductDescription_ProductDescriptionID] 
    ON DescFTSCatalog 
    GO

Full-text search in PostgreSQL using the pg_bigm extension

The pg_bigm extension in PostgreSQL enhances full-text search capabilities, particularly for languages with intricate character sets, such as Asian languages. A bigram refers to a pair of consecutive characters extracted from a string. This extension employs a bigram indexing approach, which involves segmenting text into pairs of consecutive characters and constructing an index based on these bigrams. The pg_bigm extension provides the bigm_similarity() function, the bigm similarity operator = %, and the pg_bigm.similarity_limit threshold parameter. The implementation of pg_bigm can be executed as follows:

  1. Run the command to create the pg_bigm extension. For guidance on creating the extension in Amazon RDS for PostgreSQL, refer to Using PostgreSQL extensions with Amazon RDS for PostgreSQL:
    CREATE FULLTEXT INDEX 
    ON [AdventureWorks2019].[Production].[ProductDescription]([Description]) 
    KEY INDEX [PK_ProductDescription_ProductDescriptionID] 
    ON DescFTSCatalog 
    GO
  2. Run the command to create the GIN index on the productdescription column:
    CREATE FULLTEXT INDEX 
    ON [AdventureWorks2019].[Production].[ProductDescription]([Description]) 
    KEY INDEX [PK_ProductDescription_ProductDescriptionID] 
    ON DescFTSCatalog 
    GO
  3. Set the similarity_limit configuration value to 0.15. This threshold checks for common bigrams between two strings and returns a value between 0 and 1:
    CREATE FULLTEXT INDEX 
    ON [AdventureWorks2019].[Production].[ProductDescription]([Description]) 
    KEY INDEX [PK_ProductDescription_ProductDescriptionID] 
    ON DescFTSCatalog 
    GO
  4. Run the command to drop the index and enable sequential scan:
    SELECT ProductDescriptionID, Description 
    FROM [AdventureWorks2019].[Production].[ProductDescription] 
    WHERE CONTAINS([Description], 'entry & level');
Tech Optimizer
Migrate full-text search from SQL Server to Amazon Aurora PostgreSQL-compatible edition or Amazon RDS for PostgreSQL