In the evolving landscape of employment, job search platforms serve as vital conduits between employers and prospective candidates. These platforms are underpinned by sophisticated search engines that meticulously process and analyze extensive volumes of both structured and unstructured data to yield pertinent results. The construction of such systems necessitates robust database technologies capable of executing intricate queries, facilitating full-text and semantic searches, and incorporating geospatial functionalities for location-aware recommendations.
The anatomy of a modern job search engine
Understanding the essential components and requirements of a contemporary job search platform is crucial before delving into PostgreSQL’s specific search capabilities. At its core, a job search engine comprises two primary elements:
Data repository – This serves as the backbone of any job search platform, housing job listings from diverse sources. The database is perpetually updated through web crawlers, direct employer submissions, and integrations with job boards and recruitment platforms. It also retains profiles and resumes of job candidates.
Search engine – This component enables bidirectional search, allowing employers to seek candidates and candidates to pursue opportunities. It processes queries by analyzing and joining structured data (such as job titles, locations, and salary ranges) with unstructured content (like job descriptions and candidate resumes). An advanced search engine transcends mere keyword matching, grasping context, managing synonyms, recognizing related concepts, and considering location-based constraints.
An effective job search engine necessitates:
- Full-text search – This feature provides precise lexical matching for job titles, skills, and organization names, supporting exact phrase matching and typo-tolerant fuzzy searches for partial matches. While it excels when users can articulate specific search criteria, it lacks contextual understanding.
- Semantic search – Vector-based similarity search introduces contextual understanding by interpreting job descriptions and candidate qualifications beyond literal terminology. This capability captures nuanced professional relationships and implicit requirements that keyword matching might overlook, facilitating more intelligent matching between candidates and positions.
- Geospatial search – This feature refines results by incorporating geographic considerations, enabling users to discover opportunities within specific distance parameters, commute thresholds, or regional boundaries, thus aligning professional qualifications with the realities of the job market.
By integrating these complementary search techniques, job search engines can process complex queries that simultaneously evaluate exact terms, contextual meanings, and geographic considerations, delivering more relevant matches in an increasingly intricate employment landscape.
PostgreSQL as a comprehensive search solution
PostgreSQL emerges as a dual-purpose solution, functioning both as a robust data repository and an advanced search engine. Through its built-in features and extensions, PostgreSQL adeptly manages all three essential search dimensions within a single system:
- Full-text search utilizing built-in types like tsvector, tsquery, and GIN indexes
- Vector similarity search for semantic matching through the pgvector extension
- Geospatial queries via the PostGIS extension with GiST indexes
A job search engine leverages PostgreSQL to store job listings and candidate profiles, providing real-time full-text and semantic search across millions of resumes and job listings while identifying job matches within specified geographical radii. This unified approach simplifies architecture, reduces operational complexity, and enables hybrid search strategies.
Consider the following simple data model for a job search engine, which comprises “job” and “resume” tables. These tables include columns of type tsvector, vector, and geometry to accommodate vector data, vector embeddings, and geographical locations of jobs and candidates.
CREATE TABLE job (
Job_id bigint GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
company TEXT,
title TEXT NOT NULL,
description TEXT,
title_tsv TSVECTOR -- Computed full-text search vectors
GENERATED ALWAYS AS (to_tsvector('english', title)) STORED,
description_tsv TSVECTOR
GENERATED ALWAYS AS (to_tsvector('english', description)) STORED,
semantic_vector VECTOR(3)
GENERATED ALWAYS AS (
embedding_function(title || ' ' || description) -- replace with actual embedding generation
) STORED,
skills_vector VECTOR(3) -- Computed skills vector
GENERATED ALWAYS AS (
embedding_function(array_to_string(skills, ' '))
) STORED,
geom GEOMETRY(Point, 4326),
location TEXT,
salary_range INT4RANGE,
experience_level TEXT,
job_location GEOMETRY(Point, 4326),
skills TEXT[]
);
CREATE TABLE resume (
candidate_id bigint GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
candidate_name TEXT,
raw_resume TEXT,
resume_tsv TSVECTOR -- Computed search vectors
GENERATED ALWAYS AS (to_tsvector('english', raw_resume)) STORED,
skills_vector VECTOR(3) -- Computed semantic and skills vectors
GENERATED ALWAYS AS (
-- embedding_function(array_to_string(skills, ' '))
) STORED,
geom GEOMETRY(Point, 4326),
location TEXT,
skills TEXT[],
job_history TEXT[],
education_levels TEXT[]
);
Let us now explore how a job search engine can implement various search techniques in PostgreSQL.
Full-text search in PostgreSQL
PostgreSQL’s full-text search capabilities provide a robust foundation for matching job listings and candidate profiles based on specific keywords, phrases, and requirements. When ingesting job listings and resumes, the engine employs tokenization techniques, breaking documents into lexemes using predefined linguistic dictionaries. These dictionaries guide the normalization of text, removal of stop-words, and application of stemming to reduce words to their root forms. The resulting standardized lexemes are then organized into an inverted index, creating an efficient structure for rapid retrieval. When a candidate inputs a search query, the engine matches the query tokens against indexed job description tokens, ranking results based on term frequency and lexical matches to present a list of relevant job opportunities.
The following diagram illustrates the process of full-text search in PostgreSQL:
PostgreSQL’s full-text search components include:
- Dictionaries – These enable language-aware lexeme parsing, stemming, and stop-word removal. They transform raw text into standardized lexemes, ensuring that variations like “working,” “worked,” and “works” all match a search for “work.” PostgreSQL includes built-in dictionaries for many languages and allows for custom dictionaries tailored to specialized terminology.
- Text processing – The to_tsvector function converts documents (such as job descriptions or resumes) into a special tsvector format that stores normalized lexemes along with their positions and optional weights. Similarly, the to_tsquery function processes search queries into a format optimized for matching against these document vectors.
- Match operator – The match operator (@@) evaluates the similarity between a document vector and a query, returning true if a match exists.
- Ranking functions – Functions such as ts_rank and ts_rank_cd determine the relevance of matches based on factors like term frequency and document structure, allowing results to be sorted by relevance.
The following example identifies candidates with “JavaScript” and either “React” or “Angular” skills while excluding those mentioning “WordPress.”
WITH resume AS(SELECT * FROM (VALUES
('John','react,javascript, wordpress'),
('Mary','angular, javascript'),
) resume(candidate_name, skills))
SELECT candidate_name
FROM resume
WHERE TO_TSVECTOR(resume.skills)@@ to_tsquery('english', 'javascript & (react | angular) & !wordpress');
candidate_name
------------------
Mary
Advanced full-text search features
PostgreSQL offers several advanced features for more complex full-text search:
Proximity search – This feature allows for finding words that appear near each other in a document:
SELECT * FROM resume WHERE resume_tsv @@ to_tsquery('software engineering');
This would match “software engineering” but not “software testing engineering,” where the terms are not adjacent.
Simple ranking – This ensures that the most relevant results are ranked higher. The ts_rank function considers the frequency of words; the more tokens that match the text, the higher the rank. For instance, you can rank resumes by the frequency of the word “Amazon” in job skills requirements:
WITH resume AS(SELECT * FROM (VALUES
('some-company:software engineering'),
('Amazon: software engineering')
) resume(skills))
SELECT resume.skills, TS_RANK_CD(TO_TSVECTOR(resume.skills), TO_TSQUERY('Amazon')) AS rank FROM resume
ORDER BY rank DESC;
skills rank
---------------------------------------------------
Amazon: software engineering 0.1
some-company:software engineering 0.0
Weighted ranking – This allows for assigning different importance to various parts of a document. The following weights can be assigned:
- A (most important): Highest weight
- B (high importance): Second highest
- C (medium importance): Third level
- D (lowest importance): Default weight
Here is an example:
-- Create a function to generate weighted search vector
-- This function assigns weight “A” to column “title”, and weight “B” to column “skills”
CREATE FUNCTION job_search_vector(title TEXT, skills TEXT)
RETURNS tsvector AS $$
BEGIN
RETURN
setweight(to_tsvector('english', title), 'A') || setweight(to_tsvector('english', skills), 'B');
END
$$ LANGUAGE plpgsql;
WITH job AS(SELECT * FROM (VALUES
(1,'programmer','java, python, junit'),
(2,'QA','python, junit')
) job(id,title, skills))
SELECT job.id, job.title, job.skills
ts_rank(job_search_vector(job.title, job. skills), to_tsquery('QA & python & junit & java')
) AS rank
FROM job
ORDER BY rank DESC;
job_id title skills rank
-------------------------------------------------------------
2 QA python, junit 0.915068
1 programmer java, python, junit 0.77922493
Fuzzy matching – The pg_trgm extension complements full-text search, enabling similarity-based matching. This capability is crucial for job search platforms where users might misspell technical terms or job titles:
SET pg_trgm.similarity_threshold = 0.4;
WITH job AS(SELECT * FROM (VALUES
(1,' programmer’,'java, python, junit'),
(2,'QA','python, junit')
) job(id,title, skills))
SELECT * FROM job WHERE job.title % 'programer';
job_id title skills
-------------------------------------------------------
1 programmer java, python, junit
Indexing for performance
PostgreSQL provides specialized index types to optimize full-text search performance:
GIN (Generalized Inverted Index) – This is ideal for static text data where search speed is prioritized over update speed. GIN indexes excel with tsvector columns and are the preferred choice for most job search scenarios.
GiST (Generalized Search Tree) – This index type strikes a balance between search and update performance, consuming less space but potentially being slower for complex queries. GiST indexes are more suitable for applications with frequent updates.
Semantic search with pgvector
While full-text search excels at finding exact matches, it lacks an understanding of meaning and context. For instance, a full-text search wouldn’t naturally recognize that “software engineer” and “developer” represent similar roles, or that “cloud architecture” relates to “AWS expertise.” This is where semantic search through vector embeddings becomes invaluable.
Understanding vector embeddings
Vector embeddings represent text as points in a high-dimensional space, where the geometric relationships between these points capture semantic relationships. Similar concepts appear closer together in this vector space, even if they share no common terms. The pgvector extension adds vector data types and operations to PostgreSQL, enabling the storage of these embeddings directly in the database and facilitating efficient similarity searches.
Implementing semantic search
The following diagram illustrates how semantic search is implemented in PostgreSQL:
The steps for implementing vector search include:
- Generate embeddings – Convert job descriptions and candidate resumes into vector embeddings, typically utilizing machine learning models available through services such as Amazon Bedrock. Below is an example of embeddings generated for job postings.
Vector Embedding
- Store vectors – Store the embeddings in PostgreSQL using pgvector’s vector data type.
- Similarity search – Utilize vector operators to identify similar items by calculating the distance between the vector embeddings, as depicted in the following diagram.
Vector Similarity Search
The following query calculates the distance between an applicant’s skillset embedding and the embeddings for the job postings:
SELECT * FROM (
WITH job AS(SELECT * FROM (VALUES
(1, 'Data Scientist', 'python, machine learning, sql, statistics, deep learning', '[0.9, 0.7, 0.2]'::vector(3)),
(2, 'Frontend Developer', 'javascript, react, css, html, redix', '[0.2, 0.1, 0.9]'::vector(3)),
(3, 'Backend Engineer', 'java, spring, microservices, sql, api design', '[0.5, 0.9, 0.3]'::vector(3)),
(4, 'DevOps Engineer', 'kubernetes, docker, aws, terraform, ci/cd', '[0.6, 0.8, 0.4]'::vector(3))
) job(job_id, title, skills, skill_vector)
),
resume AS(SELECT * FROM (VALUES
(1, 'Mary', 'java, spring boot, microservices, postgresql, rest api', '[0.45, 0.95, 0.25]'::vector(3)),
(2, 'Jean', 'docker, kubernetes, aws, jenkins, terraform', '[0.65, 0.75, 0.35]'::vector(3)),
(3, 'Bill', 'javascript, python, react, express, mongodb, node.js', '[0.35, 0.6, 0.65]'::vector(3))
) resume(candidate_id, name, skills, skill_vector)
)
SELECT job.title, resume.name, 1 - (job.skill_vector resume.skill_vector) AS similarity_score -- Cosine similarity from distance
FROM job CROSS JOIN resume
)scores
WHERE similarity_score > 0.9
ORDER BY similarity_score DESC;
title name similarity_score
---------------------------------------------------
DevOps Engineer Jean 0.9133974713
Backend Engineer Mary 0.9133974391
The operator calculates the distance between vectors, where a smaller distance indicates greater similarity.
Optimizing vector search performance
As the dataset size expands, performance becomes critical for vector searches. PostgreSQL offers specialized index types for vector similarity searches:
IVFFlat index – This divides the vector space into smaller partitions for more efficient searching:
CREATE INDEX ON resume USING ivfflat (skills_vector vector_l2_ops) WITH (lists = 100);
HNSW index – The Hierarchical Navigable Small World graph index provides even faster approximate nearest neighbor searches:
CREATE INDEX ON resume USING hnsw (skills_vector vector_cosine_ops) WITH (m=16, ef_construction=64);
These indexes significantly enhance search performance at the expense of some precision, making them ideal for large job search platforms where sub-second response times are paramount.
Geospatial search with PostGIS
Location often plays a pivotal role in job searches. Candidates typically seek positions within commuting distance, while employers aim to attract local talent. PostgreSQL’s PostGIS extension provides geospatial capabilities for implementing location-aware job searches.
Geospatial search implementation
The following outlines the geospatial search architecture and implementation steps in PostgreSQL:
- Install and enable PostGIS.
- Add geometry or geography columns.
- Index using GiST indexes.
- Perform geospatial queries using geospatial functions, such as ST_DWithin, and sort locations by distance using ST_Distance, as needed. The following query demonstrates this:
CREATE EXTENSION postgis;
WITH resume AS(SELECT * FROM (VALUES
('John','react,javascript, wordpress'),
('Mary','angular, javascript'),
) resume(candidate_name, skills))
SELECT candidate_name
FROM resume
WHERE TO_TSVECTOR(resume.skills)@@ to_tsquery('english', 'javascript & (react | angular) & !wordpress');
candidate_name
------------------
Mary
0
WITH resume AS(SELECT * FROM (VALUES
('John','react,javascript, wordpress'),
('Mary','angular, javascript'),
) resume(candidate_name, skills))
SELECT candidate_name
FROM resume
WHERE TO_TSVECTOR(resume.skills)@@ to_tsquery('english', 'javascript & (react | angular) & !wordpress');
candidate_name
------------------
Mary
0
WITH resume AS(SELECT * FROM (VALUES
('John','react,javascript, wordpress'),
('Mary','angular, javascript'),
) resume(candidate_name, skills))
SELECT candidate_name
FROM resume
WHERE TO_TSVECTOR(resume.skills)@@ to_tsquery('english', 'javascript & (react | angular) & !wordpress');
candidate_name
------------------
Mary
0
Combining search techniques
While each search technique offers distinct advantages, a hybrid approach leverages the strengths of each method to deliver more relevant results across diverse use cases. A combination of full-text search and semantic search proves particularly effective for complex queries where both specific terms and overall meaning are essential. By employing full-text search to match user preferences and similarity search to broaden recommendations to related content, users gain a more comprehensive view of available options. This allows for matching specific skills or job titles (full-text search) while also understanding related roles or transferable skills through similarity search. For queries where user intent may not be fully captured by exact keyword matching, similarity search can help uncover relevant results, while full-text search ensures that no exact matches are overlooked. To enhance job searches further, geo-contextual search can be integrated with full-text or semantic search.
In PostgreSQL’s hybrid search, different search methods independently rank results using their own relevance algorithms. To meaningfully combine these diverse rankings, the Reciprocal Rank Fusion (RRF) algorithm merges them using a specific formula that assigns each result a unified score.
The following example illustrates hybrid search by showcasing top candidates for engineering positions while considering distance and skill matches:
WITH resume AS(SELECT * FROM (VALUES
('John','react,javascript, wordpress'),
('Mary','angular, javascript'),
) resume(candidate_name, skills))
SELECT candidate_name
FROM resume
WHERE TO_TSVECTOR(resume.skills)@@ to_tsquery('english', 'javascript & (react | angular) & !wordpress');
candidate_name
------------------
Mary
0
Performance and scaling considerations
As your job search platform expands in data volume, user base, and query complexity, performance optimization becomes increasingly critical. Job search applications face several specific performance challenges:
- Computational complexity – Hybrid search queries that combine multiple techniques can be resource-intensive, especially when involving complex operations like vector similarity calculations or geospatial distance measurements.
- Indexing overhead – Maintaining specialized indexes for different search techniques increases storage requirements and can slow write operations.
- Result merging – Combining results from various search algorithms often necessitates complex join operations and scoring calculations.
- Concurrent query load – Popular job search platforms must accommodate numerous simultaneous search requests, particularly during peak usage periods.
PostgreSQL offers several features to address these performance challenges:
- Parallel query execution – This distributes query workloads across multiple CPU cores.
- Query pipelining – This processes multiple query stages concurrently.
- Materialized views – These pre-compute common search operations.
- Right indexing – This involves selecting the appropriate index types for each search dimension.
- Table partitioning – This divides large tables into manageable chunks based on logical divisions to eliminate unnecessary data during searches.
What about other applications?
While this discussion centers on job search platforms, the architecture presented is applicable to a wide array of applications. The following table summarizes some examples:
Application | Full Text Search | Vector Search | Geospatial Search | How |
E-commerce Product Discovery | Product names, descriptions, and specifications | “Similar products” recommendations based on product embeddings | Local availability and delivery time estimation | Helps shoppers find products matching their specific requirements while also discovering related items they might be interested in, filtered by what’s available in their region. |
Real estate platforms | Property features, amenities, and descriptions | Find properties with similar overall characteristics | Neighborhood analysis and proximity to points of interest | Helps homebuyers find properties meeting their explicit criteria while also discovering neighborhoods they hadn’t considered but match their lifestyle preferences. |
Content recommendation systems | Topic-specific articles or videos | Thematically similar content based on embeddings | Locally relevant news and events | Enables both precise content discovery and serendipitous recommendations contextually relevant to the user’s location and interests. |
Travel and hospitality | Accommodation amenities and features | “Places similar to this one” recommendations | Proximity to attractions, transportation, and activities | Helps travelers find accommodations that meet their specific requirements while also discovering options in areas they might not have initially considered. |
Healthcare provider matching | Medical specialties and treatments | Providers with similar practice patterns and patient reviews | Proximity and accessibility | Helps patients find providers who match their specific medical needs while considering factors like practice style and convenient location. |