PostgreSQL — Topic 12: Full Text Search

12.1

What Is Full Text Search?

LIKE vs FTS — and how PostgreSQL thinks about text

▾

🤔 The Problem with LIKE

The obvious way to search text is LIKE — but it has serious limitations in production:

❌ LIKE — naive string matching

SELECT * FROM articles
WHERE body LIKE '%running%';

Problems:
• Can't use indexes (full table scan)
• Won't match "run", "runs", "runner"
• Case sensitive by default
• No relevance ranking
• Extremely slow on large tables

✅ Full Text Search — smart matching

SELECT * FROM articles
WHERE to_tsvector('english', body)
   @@ to_tsquery('english', 'running');

Benefits:
• GIN index — lightning fast
• Matches "run", "runs", "runner"
• Language-aware stemming
• Relevance ranking (ts_rank)
• Scales to millions of rows

🧠 How PostgreSQL FTS Works

PostgreSQL FTS has two main building blocks:

tsvector

A sorted list of lexemes (normalized word stems) with their positions. This is what you store/index. Think of it as a pre-processed, searchable version of your text.

tsquery

A search query — also normalised to lexemes — with boolean operators (& AND, | OR, ! NOT, <-> phrase). This is what the user searches for.

FTS Pipeline — from raw text to search result

RAW TEXT:  "PostgreSQL is running fast and Postgres runs well"
               │
               ▼  to_tsvector('english', ...)
               │
      ┌────────────────────────────────────────────┐
      │  Parser: split into tokens                 │
      │  "PostgreSQL" "is" "running" "fast" ...    │
      │               │                            │
      │  Stop words removed: "is", "and", "well"  │
      │               │                            │
      │  Stemmer: "running" → "run"                │
      │           "runs"    → "run"                │
      │           "fast"    → "fast"               │
      │               │                            │
      │  RESULT tsvector:                          │
      │  'fast':4 'postg':1,7 'run':3,6            │
      └────────────────────────────────────────────┘
               │
               ▼  @@ (match operator)
               │
      tsquery: to_tsquery('english', 'run & fast')
               =  'run' & 'fast'
               │
               ▼
      MATCH! ✅  (both lexemes appear in tsvector)

🔑 Key Concepts at a Glance

Concept	What It Is	Example
Lexeme	Normalised word stem — the dictionary form	"running" → `run`
Stop word	Common word filtered out (carries no search value)	"is", "the", "and", "a"
Stemming	Reducing a word to its root form	"quickly" → `quick`
tsvector	Preprocessed, indexed form of a document	`'fast':2 'run':1,5`
tsquery	A search expression with boolean logic	`'run' & 'fast'`
@@	The match operator — does tsvector match tsquery?	`tsvector @@ tsquery`
ts_rank	Ranks results by relevance (0.0–1.0)	`ts_rank(vec, query)`
GIN index	Index type for fast FTS queries	`CREATE INDEX ... USING GIN`

12.2

to_tsvector

Converting text into a searchable document

▾

📖 Syntax & Basic Usage

to_tsvector(language, text) converts raw text into a tsvector. The language controls the stop-word list and stemmer.

-- Basic conversion
SELECT to_tsvector('english', 'The quick brown fox jumps over the lazy dog');

-- Result:
-- 'brown':3 'dog':9 'fox':4 'jump':5 'lazi':8 'quick':2
-- Notice: "The", "over" removed (stop words)
--         "jumps" → "jump", "lazy" → "lazi" (stemmed)

-- Each word gets a position number (used for phrase search)
SELECT to_tsvector('english', 'PostgreSQL runs fast. PostgreSQL is powerful.');
-- 'fast':3 'postgresql':1,5 'powerful':6 'run':2
-- "postgresql" appears at positions 1 and 5

-- Without language — uses default_text_search_config
SELECT to_tsvector('Databases are amazing tools for data storage');
-- 'amaz':3 'data':6 'databas':1 'storag':7 'tool':4

🔤 What Happens to Each Word

Let's trace the sentence "The users are quickly running their PostgreSQL databases" step by step:

Input tokens

The

users → user

are

quickly → quick

running → run

their

PostgreSQL → postgresql

databases → databas

■ Stop words (removed) ■ Stemmed (kept)

-- Verify the stemming yourself:
SELECT to_tsvector('english',
  'The users are quickly running their PostgreSQL databases');

-- Result: 'databas':8 'postgresql':7 'quick':4 'run':5 'user':2
-- Positions: user=2, quick=4, run=5, postgresql=7, databas=8

🗄️ Storing tsvector in a Table Column

For production, don't call to_tsvector() on every query — store the result in a dedicated column and keep it updated with a trigger or generated column.

-- Option 1: Generated column (PG 12+) — auto-updated on INSERT/UPDATE
CREATE TABLE articles (
    id         BIGSERIAL PRIMARY KEY,
    title      TEXT NOT NULL,
    body       TEXT NOT NULL,
    author     TEXT,
    created_at TIMESTAMPTZ DEFAULT NOW(),

    -- Auto-generated tsvector: weighted title (A) + body (B)
    search_vec TSVECTOR GENERATED ALWAYS AS (
        setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
        setweight(to_tsvector('english', coalesce(body,  '')), 'B')
    ) STORED
);

-- Option 2: Trigger-based (PG 11 and earlier)
CREATE TABLE articles (
    id         BIGSERIAL PRIMARY KEY,
    title      TEXT,
    body       TEXT,
    search_vec TSVECTOR   -- manually maintained
);

-- Trigger to keep search_vec up to date
CREATE OR REPLACE FUNCTION articles_search_update() RETURNS TRIGGER AS $$
BEGIN
    NEW.search_vec :=
        setweight(to_tsvector('english', coalesce(NEW.title, '')), 'A') ||
        setweight(to_tsvector('english', coalesce(NEW.body,  '')), 'B');
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER articles_search_trigger
    BEFORE INSERT OR UPDATE ON articles
    FOR EACH ROW EXECUTE FUNCTION articles_search_update();

💡

setweight() marks lexemes with weight A, B, C, or D. Title words (weight A) score higher than body words (weight B) in ts_rank. This is how Google-style "title match beats body match" works.

🌍 Supported Languages

-- List all available text search configurations (languages)
SELECT cfgname FROM pg_ts_config;

-- Common ones:
-- english, spanish, french, german, portuguese,
-- italian, dutch, russian, simple

-- 'simple' — no stemming, no stop words (good for codes/IDs)
SELECT to_tsvector('simple', 'The quick brown fox');
-- 'brown':3 'fox':4 'quick':2 'the':1  (all words kept!)

-- Check current default
SHOW default_text_search_config;  -- usually 'pg_catalog.english'

-- Set default for session
SET default_text_search_config = 'english';

12.3

to_tsquery

Building search queries with boolean logic

▾

🔍 Query Functions

PostgreSQL provides three functions to build a tsquery, each with different strictness:

Function	Behaviour	Best For
`to_tsquery()`	Each word must be a valid lexeme. Errors on stop words.	Internal/programmatic queries where you control input
`plainto_tsquery()`	Plain text → AND of all words. Stop words silently ignored.	Simple user search box — "postgres fast query"
`phraseto_tsquery()`	Words must appear in order, adjacent.	Exact phrase search — "machine learning"
`websearch_to_tsquery()`	Google-style: AND/OR/NOT/phrases from natural input.	User-facing search with "quotes" and -exclusions

🛠️ to_tsquery — Full Boolean Control

-- AND: both words must appear
SELECT to_tsquery('english', 'postgresql & performance');
-- 'postgresql' & 'perform'  (both must match)

-- OR: either word
SELECT to_tsquery('english', 'postgresql | mysql');
-- 'postgresql' | 'mysql'

-- NOT: word must NOT appear
SELECT to_tsquery('english', 'database & !nosql');
-- 'databas' & !'nosql'

-- Phrase: words adjacent in order (distance operator)
SELECT to_tsquery('english', 'machine <-> learning');
-- 'machin' <1> 'learn'  (must appear next to each other)

-- Custom distance: within N words
SELECT to_tsquery('english', 'query <2> optimization');
-- must appear within 2 positions of each other

-- Prefix matching with :*
SELECT to_tsquery('english', 'postg:*');
-- matches: postgres, postgresql, postgreSQL ✅

-- Combined complex query
SELECT to_tsquery('english', '(postgresql | mysql) & (index | performance) & !slow');
-- Reads: (postgresql or mysql) AND (index or performance) AND NOT slow

💬 plainto_tsquery — Best for User Input

-- Just paste raw user text — it handles everything
SELECT plainto_tsquery('english', 'the quick brown fox');
-- 'quick' & 'brown' & 'fox'  ("the" removed as stop word)

SELECT plainto_tsquery('english', 'postgres query optimization tips');
-- 'postgr' & 'queri' & 'optim' & 'tip'

-- No error on stop words (unlike to_tsquery)
SELECT plainto_tsquery('english', 'the is are');
-- '' (empty — all stop words, no error) ⚠️ check for empty result

🌐 websearch_to_tsquery — Google-Style

-- Double quotes = phrase search
SELECT websearch_to_tsquery('english', '"machine learning" database');
-- 'machin' <-> 'learn' & 'databas'

-- Minus = NOT
SELECT websearch_to_tsquery('english', 'postgresql -mysql');
-- 'postgresql' & !'mysql'

-- OR keyword
SELECT websearch_to_tsquery('english', 'postgres OR mysql fast');
-- ('postgresql' | 'mysql') & 'fast'

-- Real-world: build a search API endpoint
SELECT id, title
FROM articles
WHERE search_vec @@ websearch_to_tsquery('english', 'postgresql "query planner" -slow')
ORDER BY ts_rank(search_vec, websearch_to_tsquery('english', 'postgresql "query planner" -slow')) DESC
LIMIT 10;

⚡ The Match Operator @@

-- @@ returns TRUE if tsvector matches tsquery
SELECT
    to_tsvector('english', 'PostgreSQL is fast and powerful')
    @@
    to_tsquery('english', 'fast & powerful')  AS match;
-- Result: true ✅

SELECT
    to_tsvector('english', 'PostgreSQL is fast and powerful')
    @@
    to_tsquery('english', 'fast & mysql')  AS match;
-- Result: false ❌ (mysql not in text)

-- Using in a WHERE clause — this is the core FTS query pattern
SELECT id, title
FROM articles
WHERE search_vec @@ plainto_tsquery('english', 'postgres index performance');

12.4

Ranking with ts_rank

Sorting results by relevance — like a real search engine

▾

📊 What Is ts_rank?

ts_rank(tsvector, tsquery) returns a float between 0 and 1 indicating how relevant a document is to the search query. Higher = more relevant. It considers frequency of the matching lexemes and their weights (A > B > C > D).

-- Basic ranking: sort by relevance score descending
SELECT
    id,
    title,
    ts_rank(search_vec, to_tsquery('english', 'postgresql & index')) AS rank
FROM articles
WHERE search_vec @@ to_tsquery('english', 'postgresql & index')
ORDER BY rank DESC
LIMIT 10;

⚖️ ts_rank vs ts_rank_cd

Function	Algorithm	Use When
`ts_rank()`	Based on frequency of matching lexemes. The more times a word appears, the higher the rank.	General purpose. Short documents (articles, descriptions).
`ts_rank_cd()`	Cover Density ranking — also considers proximity of matched terms to each other.	Long documents where clustering of terms matters (books, manuals).

-- ts_rank_cd — better for long documents
SELECT
    id,
    title,
    ts_rank_cd(search_vec, query) AS rank
FROM articles,
     to_tsquery('english', 'postgresql & performance') query
WHERE search_vec @@ query
ORDER BY rank DESC;

🏷️ Weight Ranking — Title Beats Body

When you use setweight(), words from title (A) are ranked higher than words from body (B). You can tune the weight multipliers:

-- Default weights: D=0.1, C=0.2, B=0.4, A=1.0
-- Pass custom weights array: {D, C, B, A}

SELECT
    id,
    title,
    ts_rank(
        search_vec,
        to_tsquery('english', 'postgresql'),
        32  -- normalization: divide rank by document length
    ) AS rank
FROM articles
WHERE search_vec @@ to_tsquery('english', 'postgresql')
ORDER BY rank DESC;

-- Normalization values (can be OR'd together):
-- 0  = ignore document length (default)
-- 1  = divide by 1 + log(length)
-- 2  = divide by document length
-- 4  = divide by mean harmonic distance between extents
-- 8  = divide by number of unique words
-- 16 = divide by 1 + log(unique word count)
-- 32 = divide by rank + 1 (most common normalization)

📌 ts_headline — Highlight Matches

ts_headline() wraps matching words in your text with HTML tags — essential for showing search snippets like Google does.

-- Generate highlighted snippet
SELECT
    id,
    title,
    ts_headline(
        'english',
        body,
        to_tsquery('english', 'postgresql & index'),
        'MaxWords=35, MinWords=15, ShortWord=3, HighlightAll=false,
         MaxFragments=2, StartSel=<b>, StopSel=</b>'
    ) AS snippet
FROM articles
WHERE search_vec @@ to_tsquery('english', 'postgresql & index')
ORDER BY
    ts_rank(search_vec, to_tsquery('english', 'postgresql & index')) DESC
LIMIT 5;

-- Example output for snippet:
-- "...using <b>PostgreSQL</b> you can create a GIN <b>index</b>
--  on any tsvector column for fast full text..."

-- Options explained:
-- MaxWords/MinWords: size of the snippet
-- MaxFragments: how many text fragments to return
-- StartSel/StopSel: tags to wrap matched words in
-- HighlightAll: if true, highlight entire document

⚠️

ts_headline is slow! It re-parses the raw text column (not the tsvector). Never use it without a WHERE filter first — only highlight the top results you're returning, not the whole table.

📈 Ranking in Action — Visual Example

Three documents, same query 'postgresql & index' — rank varies by frequency and weight:

Rank scores — "postgresql & index"

"PostgreSQL Index Tuning Guide" (title: both words)

0.912

"PostgreSQL Tips" (body: mentions index 3×)

0.583

"Intro to Databases" (body: each word once)

0.224

12.5

GIN Indexes

Making full text search lightning fast at scale

▾

⚡ Why GIN?

GIN (Generalized Inverted Index) is the right index type for FTS. It stores a posting list — for each lexeme, a list of document IDs that contain it. This lets PostgreSQL find all matching documents for a lexeme in O(1) time.

GIN Index internals — Inverted Index

Lexeme        │  Document IDs (posting list)
──────────────┼──────────────────────────────────────
'databas'     │  [1, 5, 12, 44, 203, 901 ...]
'fast'        │  [2, 5, 7, 44, 100, 203 ...]
'index'       │  [1, 2, 5, 19, 44, 55, 203 ...]
'postgresql'  │  [1, 2, 5, 7, 12, 19, 44, 203 ...]
'run'         │  [3, 8, 44, 100 ...]

Query: 'postgresql' & 'index'
→ Lookup 'postgresql' posting list: [1,2,5,7,12,19,44,203...]
→ Lookup 'index'      posting list: [1,2,5,19,44,55,203...]
→ Intersect: [1,2,5,19,44,203...]
→ Fetch only those rows ✅ (no full scan!)

🛠️ Creating GIN Indexes

-- Option A: Index the stored tsvector column (BEST — fastest queries)
CREATE INDEX idx_articles_search
    ON articles USING GIN (search_vec);

-- Option B: Functional index — no extra column needed
-- (slightly slower on INSERT but no extra column storage)
CREATE INDEX idx_articles_fts
    ON articles USING GIN (
        to_tsvector('english', title || ' ' || body)
    );

-- Option C: Separate indexes per column (when weights matter)
CREATE INDEX idx_articles_title_gin
    ON articles USING GIN (to_tsvector('english', title));

CREATE INDEX idx_articles_body_gin
    ON articles USING GIN (to_tsvector('english', body));

-- Build concurrently on large tables (no lock)
CREATE INDEX CONCURRENTLY idx_articles_search_gin
    ON articles USING GIN (search_vec);

✅

Best practice: Store the tsvector in a generated column and create the GIN index on that column. This gives the fastest query performance and keeps the index up-to-date automatically.

📊 GIN vs GiST for FTS

Index	Query Speed	Build Speed	Update Speed	Space	Use For FTS?
GIN	Fast	Slow	Slower	Larger	✅ Preferred
GiST	Moderate	Fast	Faster	Smaller	Only if heavy writes

💡

Rule of thumb: Use GIN for FTS unless your table is extremely write-heavy (millions of inserts/sec). GIN queries are 2–5× faster. GiST is only better if build/update time is a critical bottleneck.

✅ Verify Index Is Being Used

-- Run EXPLAIN to verify GIN is used
EXPLAIN (ANALYZE, BUFFERS)
SELECT id, title
FROM articles
WHERE search_vec @@ to_tsquery('english', 'postgresql & index');

-- Look for in EXPLAIN output:
-- Bitmap Index Scan on idx_articles_search  ← GIN used ✅
--   Index Cond: (search_vec @@ '''postgresql'' & ''index''')

-- If you see Seq Scan — index not used ❌
-- Reasons: table too small (planner prefers seq scan),
--          or functional index language mismatch

12.6

Interactive: FTS Playground

See tsvector, tsquery, and results in action

▾

🧪 tsvector Preview

Type any text and see what PostgreSQL's lexer produces (stop words removed, words stemmed).

📄 to_tsvector Simulator

Language:

Input text:

Click "Generate tsvector" to see the result →

🔎 Live Search Simulator

Search across sample articles and see ranked results with highlighted snippets.

🗞️ Article Search Demo

Search query:

Query mode:

Type a search term and click Search →

⚙️ Full Production Query Builder

🏗️ Generate FTS SQL

Table name:

tsvector column:

Search input:

Include headline: Limit:

Click "Generate SQL" →

12.7

Real-World FTS Patterns

Complete search implementations for production

▾

📰 Blog / Document Search (Complete Setup)

-- 1. Create the table with generated tsvector
CREATE TABLE articles (
    id         BIGSERIAL PRIMARY KEY,
    title      TEXT        NOT NULL,
    body       TEXT        NOT NULL,
    tags       TEXT[],
    author_id  BIGINT      NOT NULL,
    published  BOOLEAN     DEFAULT false,
    created_at TIMESTAMPTZ DEFAULT NOW(),

    -- Weighted vector: title (A) > body (B)
    search_vec TSVECTOR GENERATED ALWAYS AS (
        setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
        setweight(to_tsvector('english', coalesce(body,  '')), 'B')
    ) STORED
);

-- 2. GIN index on the search vector
CREATE INDEX idx_articles_search ON articles USING GIN(search_vec);

-- 3. The search query with ranking + headline
SELECT
    id,
    title,
    author_id,
    created_at,
    ts_rank(search_vec, query)    AS rank,
    ts_headline('english', body, query,
        'MaxWords=30, MinWords=15, MaxFragments=2')  AS snippet
FROM
    articles,
    websearch_to_tsquery('english', 'postgresql indexing') query
WHERE
    search_vec @@ query
    AND published = true               -- only show published
ORDER BY
    rank DESC
LIMIT 10;

💬 AI Chat: Message Search

-- Search inside chat history for a specific user
CREATE TABLE messages (
    id              BIGSERIAL PRIMARY KEY,
    conversation_id BIGINT NOT NULL,
    user_id         BIGINT NOT NULL,
    role            TEXT   NOT NULL,  -- 'user' | 'assistant'
    content         TEXT   NOT NULL,
    created_at      TIMESTAMPTZ DEFAULT NOW(),
    search_vec      TSVECTOR GENERATED ALWAYS AS (
        to_tsvector('english', content)
    ) STORED
);

CREATE INDEX idx_messages_search ON messages USING GIN(search_vec);
CREATE INDEX idx_messages_user   ON messages(user_id, created_at DESC);

-- Find messages mentioning "refund" for a specific user
SELECT
    m.id,
    m.conversation_id,
    m.role,
    m.created_at,
    ts_headline('english', m.content,
        plainto_tsquery('english', 'refund payment')) AS snippet
FROM messages m
WHERE
    m.user_id   = 42
    AND m.search_vec @@ plainto_tsquery('english', 'refund payment')
ORDER BY m.created_at DESC
LIMIT 20;

🛒 Product Search with Filters

CREATE TABLE products (
    id          BIGSERIAL PRIMARY KEY,
    name        TEXT           NOT NULL,
    description TEXT,
    category    TEXT,
    price       NUMERIC(10,2),
    in_stock    BOOLEAN        DEFAULT TRUE,
    search_vec  TSVECTOR GENERATED ALWAYS AS (
        setweight(to_tsvector('english', coalesce(name,        '')), 'A') ||
        setweight(to_tsvector('english', coalesce(description, '')), 'B') ||
        setweight(to_tsvector('english', coalesce(category,    '')), 'C')
    ) STORED
);

CREATE INDEX idx_products_search ON products USING GIN(search_vec);

-- Search products: text + category filter + price range
SELECT
    id, name, price, category,
    ts_rank(search_vec, websearch_to_tsquery('english', 'wireless headphones')) AS rank
FROM products
WHERE
    search_vec @@ websearch_to_tsquery('english', 'wireless headphones')
    AND category = 'Electronics'
    AND price BETWEEN 500 AND 5000
    AND in_stock = true
ORDER BY rank DESC
LIMIT 20;

💡 FTS Best Practices

1

Always use a stored tsvector column + GIN index

Calling to_tsvector() at query time skips the index. Pre-compute and store it.

2

Use websearch_to_tsquery() for user-facing inputs

It gracefully handles stop words, quotes, and minus signs — just like Google search.

3

Apply ts_headline only on the top results

Run WHERE + ORDER BY rank LIMIT first, then apply ts_headline to those rows only.

4

Add setweight() for title vs body

Title matches should outrank body matches — always weight your columns intentionally.

5

Combine FTS with regular filters

FTS and WHERE price < 500 AND category = 'X' work together. PostgreSQL uses bitmap AND to combine the GIN index with B-tree indexes.