Visualize vector embeddings stored in Amazon Aurora PostgreSQL and explore semantic similarities

Amazon Aurora PostgreSQL-Compatible Edition has recently embraced the pgvector extension, allowing users to store and manipulate vector embeddings directly within their relational databases. This integration opens up a wealth of opportunities for data analysis and visualization, particularly when exploring semantic similarities.

In conjunction with Amazon Bedrock, a fully managed service that offers various foundation models (FMs) including Amazon Titan Text Embeddings, users can analyze and visualize complex data structures with remarkable ease. The combination of Aurora and Bedrock allows for sophisticated semantic similarity exploration.

Visualizing Vector Embeddings

Foundation models like amazon.titan-embed-text-v1 can process up to 8,000 tokens and produce a vector representation with 1,536 dimensions. However, visualizing such high-dimensional data directly is impractical, necessitating dimensionality reduction techniques. Options include principal component analysis (PCA), linear discriminant analysis (LDA), and t-distributed stochastic neighbor embedding (T-SNE). In this discussion, we will employ PCA for its effectiveness in preserving variance while simplifying data visualization.

Solution Overview

The following steps outline the process of performing PCA with Amazon Bedrock and Aurora:

Prepare your dataset for generating vector embeddings, utilizing a sample dataset with product categories.
Generate vector embeddings of product descriptions using the Amazon Bedrock FM titan-embed-text-v1.
Store the product data and vector embeddings in an Aurora PostgreSQL database with the pgvector extension.
Import the libraries necessary for PCA.
Convert high-dimensional vector embeddings into three-dimensional embeddings using PCA.
Generate a scatter plot of the three-dimensional embeddings to visualize semantic similarities within the data.

Prerequisites

To successfully implement this solution, ensure you meet the following prerequisites:

Implementing the Solution

Once the prerequisites are established, follow these steps to implement the solution:

Sign in to a Jupyter notebook instance with a Python kernel, such as the conda_python3 kernel on a SageMaker notebook instance.
Install the required binaries and import the necessary libraries:

!pip install -U boto3 psycopg2-binary pgvector
import json, pandas as pd, boto3

Import the sample product catalog data:

df = pd.read_csv("./product_catalog.csv", sep="|")
df.head(5)

The sample dataset consists of 60 products categorized into groups such as Fruit, Sport, Furniture, and Electronics. An example of the resulting table is as follows:

.	category	name	description
0	Fruit	Apple	Juicy and crisp apple, perfect for snacking or…
1	Fruit	Banana	Sweet and creamy banana, a nutritious addition…
2	Fruit	Mango	Exotic and flavorful mango, delicious eaten fr…
3	Fruit	Orange	Refreshing and citrusy orange, packed with vit…
4	Fruit	Pineapple	Fresh and tropical pineapple, known for its sw…

Utilize the Amazon Bedrock titan-embed-text-v1 model to generate vector embeddings of the product descriptions.

Create an Amazon Bedrock client to facilitate the creation of a text embeddings model:

def create_beddrock_client(region):
    bedrock_client = boto3.client("bedrock-runtime", region_name='us-east-1')
    return bedrock_client    

bedrock_client = create_beddrock_client('us-east-1')

Define a function to generate text embeddings, passing the Amazon Bedrock client and text data:

def create_description_embedding(desc, bedrock_client):
    payload = {"inputText": f"{desc}"}
    body = json.dumps(payload)
    model = "amazon.titan-embed-text-v1"
    accept = "application/json"
    contentType = "application/json"
    response = bedrock_client.invoke_model(
        body=body, modelId=model, accept=accept, contentType=contentType
    )
    response_body = json.loads(response.get("body").read())
    embeddings = response_body.get("embedding")
    return embeddings

Generate embeddings for each product description:

all_records = []

for records in df['p_description']:
    embedded_data = create_description_embedding(records, bedrock_client)
    all_records.append(embedded_data)

df.insert(2, 'p_embeddings', all_records)

Alternatively, you can load this sample data into an Aurora PostgreSQL database and utilize the Aurora machine learning extension to call the aws_bedrock.invoke_model_get_embeddings function for generating embeddings.

Create an Aurora PostgreSQL vector extension and establish a table to load the product catalog data, including vector embeddings:

client = boto3.client('secretsmanager')
response = client.get_secret_value(
    SecretId='aupg-vector-secret'
)
database_secrets = json.loads(response['SecretString'])
dbhost = database_secrets['host']
dbport = database_secrets['port']
dbuser = database_secrets['username']
dbpass = database_secrets['password']

dbconn = psycopg2.connect(host=dbhost, user=dbuser, password=dbpass, port=dbport)
dbconn.set_session(autocommit=True)

cur = dbconn.cursor()
cur.execute("create extension if not exists vector;")
register_vector(dbconn)
cur.execute("drop table if exists product_catalog;")
cur.execute("""create table if not exists product_catalog(
               p_id serial primary key,  
               p_category varchar(15),
               p_name varchar(50),
               p_description text,
               p_embeddings vector(1536));""")

for index, row in df.iterrows():
    cur.execute("""INSERT INTO product_catalog (p_category, p_name, p_description, p_embeddings) values(%s, %s, %s, %s);""",
                 (row.p_category, row.p_name, row.p_description, row.p_embeddings))
cur.execute("""CREATE INDEX ON product_catalog 
               USING ivfflat (p_embeddings vector_l2_ops) WITH (lists = 100);""")
cur.execute("vacuum analyze product_catalog;")
cur.close()
dbconn.close()
print("Data loaded successfully!")

The pgvector extension supports various indexing techniques, including Inverted File with Flat Compression (IVFFlat) and Hierarchical Navigable Small World (HNSW). In this instance, we utilize the IVFFlat index due to its efficiency with smaller datasets. For a deeper understanding of these indexing methods, refer to Optimize generative AI applications with pgvector indexing.

Retrieve records from the PostgreSQL table for visualization:

with psycopg2.connect("host='{}' port={} user={} password={}".format(dbhost, dbport, dbuser, dbpass)) as conn:
    sql = "select p_category, p_name, p_embeddings, p_description from product_catalog;"
    df_data = pd.read_sql_query(sql, conn)
df.head(3)

The following table illustrates an example of the results:

.	p_category	p_name	p_embeddings	p_description
0	Fruit	Apple	[-0.41796875, 0.7578125, -0.16308594, 0.045898…	Juicy and crisp apple, perfect for snacking or…
1	Fruit	Banana	[0.8515625, 0.036376953, 0.31835938, 0.1318359…	Sweet and creamy banana, a nutritious addition…
2	Fruit	Mango	[0.6328125, 0.73046875, 0.3046875, -0.72265625…	Exotic and flavorful mango, delicious eaten fr…

Apply PCA to perform dimensionality reduction on the vector embeddings:

from sklearn.decomposition import PCA
pca = PCA(n_components=3)
vis_dims = pca.fit_transform(df_data['p_embeddings'].to_list())
vis_dims
df_data['pca_embed'] = vis_dims.tolist()
df_data.head(5)

The resulting table from this operation is as follows:

.	p_category	p_name	p_description	p_embeddings	pca_embed
0	Fruit	Apple	Juicy and crisp apple, perfect for snacking or…	[-0.41796875, 0.7578125, -0.16308594, 0.045898…	[0.35626856571655474, 11.501643004047386, 4.42…
1	Fruit	Banana	Sweet and creamy banana, a nutritious addition…	[0.8515625, 0.036376953, 0.31835938, 0.1318359…	[-0.3547466621907463, 10.105496442467032, 2.81…
2	Fruit	Mango	Exotic and flavorful mango, delicious eaten fr…	[0.6328125, 0.73046875, 0.3046875, -0.72265625…	[0.17147068159548648, 11.720291050641865, 4.28…
3	Fruit	Orange	Refreshing and citrusy orange, packed with vit…	[0.921875, 0.69921875, 0.29101562, 0.061523438…	[0.8320213087523731, 10.913051113510148, 3.717…
4	Fruit	Pineapple	Fresh and tropical pineapple, known for its sw…	[0.33984375, 0.70703125, 0.24707031, -0.605468…	[-0.0008173639438334911, 11.01867977558647, 3….

Visualize the newly generated vector embeddings by plotting a three-dimensional graph:

import plotly.graph_objs as go
import numpy as np
fig = go.Figure()

for i, cat in enumerate(categories):
    sub_matrix = np.array(df_data[df_data["p_category"] == cat]["pca_embed"].to_list())
    x = sub_matrix[:, 0]
    y = sub_matrix[:, 1]
    z = sub_matrix[:, 2]

    fig.add_trace(
        go.Scatter3d(
            x=x,
            y=y,
            z=z,
            mode="markers",
            marker=dict(size=5, color=i, colorscale="Viridis", opacity=0.8),
            name=cat,
        )
    )

fig.update_layout(
    autosize=False,
    title="3D Scatter Plot of Categories",
    width=800,
    height=500,
    margin=dict(l=50, r=50, b=100, t=100, pad=10),
    scene=dict(
        xaxis=dict(title="x"),
        yaxis=dict(title="y"),
        zaxis=dict(title="z"),
    ),
)
fig.show()

The resulting three-dimensional scatter plot effectively illustrates how product items with similar meanings cluster closely within the embedding space. This spatial proximity ensures that semantic searches yield products with comparable semantics.

For a comprehensive step-by-step demonstration of this solution, please refer to the GitHub repository.

Resource Cleanup

To prevent incurring unnecessary charges, it is advisable to delete the resources created during this process:

Delete the SageMaker Jupyter notebook instance.
Delete the PostgreSQL cluster if it is no longer required.

About the Author

Ravi Mathur is a Senior Solutions Architect at AWS, providing technical assistance and architectural guidance across various AWS services. With extensive experience in software engineering and architecture for large-scale enterprises, he is well-equipped to address complex challenges in cloud computing.

Tech Optimizer