PostgreSQL — Topic 10: Query Optimization

10.1

The Query Planner

How PostgreSQL decides how to run your query

▾

What the Planner Does

When you send a SQL query to PostgreSQL, it doesn't execute it immediately. It first goes through a pipeline:

1

Parse

SQL text is parsed into an internal parse tree. Syntax errors are caught here.

2

Analyze / Rewrite

Resolves table/column names, checks permissions, expands views, rewrites subqueries.

3

Plan (The Smart Part)

The planner generates multiple possible execution strategies and picks the one with the lowest estimated cost. This is where EXPLAIN shows you what it chose.

4

Execute

The executor runs the chosen plan and returns results. EXPLAIN ANALYZE shows you how long each step actually took.

Cost-Based Optimization

PostgreSQL is a cost-based optimizer. It assigns a numerical cost to each possible plan (based on estimated I/O and CPU work) and picks the plan with the lowest cost. Costs are not real milliseconds — they're abstract units based on configurable parameters.

How the planner estimates cost

  Cost = (pages read × seq_page_cost)
       + (index pages × random_page_cost)
       + (rows processed × cpu_tuple_cost)
       + (comparisons × cpu_operator_cost)

  Default values:
    seq_page_cost    = 1.0   (sequential disk read — cheap)
    random_page_cost = 4.0   (random disk seek — 4× more expensive)
    cpu_tuple_cost   = 0.01  (processing one row)
    cpu_operator_cost= 0.0025 (one comparison)

  → On SSDs, set random_page_cost = 1.1 (almost same as sequential)

⚙️

Tune for your storage: If you're on SSD (which most cloud databases are), set random_page_cost = 1.1 in postgresql.conf. This makes the planner favor index scans more often, which is correct behavior on SSDs where random reads are nearly as fast as sequential.

Statistics — The Planner's Eyes

The planner can't know exactly how many rows a condition will return without running it. Instead it uses statistics collected by ANALYZE — histograms and frequency data about column values stored in pg_statistic.

-- ANALYZE collects statistics on a table (run automatically by autovacuum)
ANALYZE orders;

-- View statistics: most common values for a column
SELECT
  attname,
  n_distinct,
  correlation    -- 1.0 = perfectly sorted, -1.0 = reverse sorted, ~0 = random
FROM  pg_stats
WHERE tablename = 'orders';

-- Increase statistics target for columns with skewed data (default = 100)
ALTER TABLE orders
  ALTER COLUMN status SET STATISTICS 500;
ANALYZE orders;
-- More statistics = better estimates = better plans for skewed distributions

10.2

EXPLAIN — Reading the Query Plan

Decode every line the planner shows you

▾

Basic EXPLAIN Syntax

-- Shows the query PLAN (does NOT run the query)
EXPLAIN SELECT * FROM orders WHERE user_id = 42;

-- Shows plan + actually RUNS the query and reports real timing
EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 42;

-- Full detail: plan + run + memory + I/O buffers (most useful)
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
  SELECT * FROM orders WHERE user_id = 42;

-- JSON format for programmatic tools (explain.depesz.com, pganalyze)
EXPLAIN (ANALYZE, FORMAT JSON)
  SELECT * FROM orders WHERE user_id = 42;

⚠️

EXPLAIN ANALYZE actually runs the query. If you EXPLAIN ANALYZE an UPDATE or DELETE, it will modify data. Wrap it in a transaction and ROLLBACK if you don't want the side effects:
BEGIN; EXPLAIN ANALYZE UPDATE ...; ROLLBACK;

Anatomy of an EXPLAIN Output

Let's read a real EXPLAIN output line by line. Every indented node represents one step in the plan tree. The innermost (most indented) nodes execute first.

Nested Loop (cost=0.86..92.15 rows=12 width=136) ← estimated -> Index Scan using idx_orders_user_id on orders (cost=0.43..44.76 rows=12 width=104) Index Cond: (user_id = 42) -> Index Scan using users_pkey on users (cost=0.43..3.95 rows=1 width=32) Index Cond: (id = orders.user_id)

cost=0.86..92.15 start_cost..total_cost — estimated cost to return first row .. to return all rows. These are abstract units, not milliseconds.

rows=12 Estimated rows the planner thinks this node will return. Compare with actual rows in EXPLAIN ANALYZE to find estimation errors.

width=136 Estimated average row size in bytes. Helps spot unexpectedly wide rows.

Index Scan using … Which index is being used. "Index Scan" = index lookup + heap fetch. "Index Only Scan" = index lookup only (faster).

Index Cond: (user_id = 42) The condition being evaluated using the index.

Nested Loop The join strategy. For each row in the outer table, it finds matching rows in the inner table. Good for small outer result sets.

The Three Scan Types

Sequential Scan

Reads every page of the table from start to finish. Used when: no usable index, table is small, large fraction of rows returned.

🐢 Slow on large tables

Index Scan

Traverses index to find matching row pointers, then fetches each row from heap. Used when: selective condition, index exists, small fraction of rows.

⚡ Fast for selective queries

Bitmap Index Scan

Scans index, builds a bitmap of matching pages, then reads heap pages in order. Best for medium selectivity or combining multiple indexes (BitmapAnd / BitmapOr).

✓ Good for moderate selectivity

Index Only Scan

All needed columns are in the index (via INCLUDE or are key columns). Never touches the heap. Fastest possible scan for covered queries.

🚀 Fastest — zero heap I/O

🎯

When does PostgreSQL choose Seq Scan over Index Scan? When it estimates that more than ~5–15% of the table's rows will be returned, a sequential scan is actually faster because it reads pages in order (avoiding random I/O). This is correct behavior — don't always fear Seq Scan on small tables.

-- Which scan type will be chosen?

-- Table: 1M rows, 1% match → Index Scan
SELECT * FROM orders WHERE user_id = 42;

-- Table: 1M rows, 60% match → Sequential Scan (most rows anyway)
SELECT * FROM orders WHERE status != 'completed';

-- Two indexes combined → Bitmap Index Scan + BitmapAnd
SELECT * FROM orders
WHERE  user_id = 42 AND status = 'pending';
-- PostgreSQL may use idx_user_id + idx_status, intersect bitmaps

-- Covering index → Index Only Scan
SELECT order_id, status FROM orders WHERE user_id = 42;
-- If index on (user_id) INCLUDE (order_id, status) exists

10.3

EXPLAIN ANALYZE — Actual vs Estimated

Actual cost · Actual time · Row estimation errors

▾

What EXPLAIN ANALYZE Adds

EXPLAIN ANALYZE actually runs the query and adds actual timing and row counts next to the planner's estimates. This lets you spot two problems: (1) slow nodes, (2) bad estimates that led to a wrong plan.

Nested Loop (cost=0.86..92.15 rows=12 width=136) ← estimated (actual time=0.123..4.876 rows=847 loops=1) ← ACTUAL ↑ PROBLEM: estimated 12, got 847! -> Index Scan using idx_orders_user_id on orders (cost=0.43..44.76 rows=12 width=104) (actual time=0.041..2.310 rows=847 loops=1) Index Cond: (user_id = 42) Planning Time: 0.234 ms Execution Time: 156.789 ms ← total wall-clock time

actual time=0.123..4.876 start_time..end_time in milliseconds for this node. Time to first row .. time to all rows.

rows=847 Actual rows returned. Compare with estimated rows=12. A 70× mismatch means stale or insufficient statistics — run ANALYZE.

loops=1 How many times this node was executed. In a Nested Loop join, the inner node runs once per outer row. If loops=1000, multiply the per-loop time × 1000.

Planning Time How long PostgreSQL spent choosing the plan. Usually <1ms. High planning time = very complex query with many joins.

Execution Time Total wall-clock time to run the query and return results. This is what your application feels.

Reading EXPLAIN ANALYZE — Interactive Examples

Click a scenario to see the EXPLAIN ANALYZE output and diagnosis:

Seq Scan on orders (cost=0.00..28450.00 rows=1 width=104) (actual time=12345.67..12345.68 rows=1 loops=1) Filter: (user_id = 42) Rows Removed by Filter: 2000000 Execution Time: 12345.68 ms

🔴

Diagnosis: PostgreSQL scanned 2 million rows to find 1. "Rows Removed by Filter" is the smoking gun — it means no index was used, everything passed through a filter.

Fix: CREATE INDEX idx_orders_user_id ON orders (user_id);

Hash Join (cost=125.00..890.00 rows=10 width=200) (actual time=5.234..890.123 rows=85000 loops=1) Hash Cond: (orders.user_id = users.id) -> Seq Scan on orders (cost=0.00..680.00 rows=10 width=104) (actual time=0.012..120.45 rows=85000 loops=1) -> Hash (cost=45.00..45.00 rows=1000 width=96) (actual time=2.100..2.100 rows=1000 loops=1) Execution Time: 890.123 ms

🟡

Diagnosis: Planner estimated 10 rows, got 85,000 — an 8,500× mismatch. This causes the planner to choose the wrong join strategy and under-allocate hash memory.

Fix: Run ANALYZE orders; to refresh statistics. If the column has skewed values, increase statistics target: ALTER TABLE orders ALTER COLUMN status SET STATISTICS 500;

Nested Loop (cost=0.86..9200.00 rows=100 width=200) (actual time=0.234..45000.789 rows=100 loops=1) -> Seq Scan on orders (outer) (actual rows=10000 loops=1) -> Index Scan on order_items (inner) (actual rows=1 loops=10000) ← 10000 index lookups! Index Cond: (order_id = orders.id) Execution Time: 45000.789 ms

🔴

Diagnosis: The inner index scan executed 10,000 times (loops=10000). Nested loop on large outer sets is expensive. The planner chose this because it expected 100 outer rows, not 10,000.

Fix: Run ANALYZE to correct the row estimate. The planner will likely switch to a Hash Join which builds a hash table once. Alternatively, rewrite the query to filter outer rows earlier.

Index Only Scan using idx_orders_api_cover on orders (cost=0.43..8.47 rows=20 width=48) (actual time=0.021..0.089 rows=20 loops=1) Index Cond: (user_id = 1001) Heap Fetches: 0 ← zero heap I/O! Planning Time: 0.123 ms Execution Time: 0.112 ms

🟢

This is ideal. Index Only Scan means all data came from the index — the heap (table) was never touched. "Heap Fetches: 0" confirms it. Execution time: 0.112ms for 20 rows.

How to achieve this: CREATE INDEX idx_orders_api_cover ON orders (user_id, created_at DESC) INCLUDE (order_id, status, total);

Using BUFFERS — Find I/O Bottlenecks

Adding BUFFERS to EXPLAIN ANALYZE shows how many disk pages were read from shared buffer cache (fast) vs. read from disk (slow).

EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders WHERE user_id = 42;

-- Output includes:
--   Buffers: shared hit=5  read=120
--   ↑ "hit" = came from memory cache (fast)
--   ↑ "read" = had to read from disk (slow — 120 random disk reads!)
--
-- Goal: maximize cache hits, minimize disk reads
-- If read is high → either the table/index needs more RAM (shared_buffers)
-- or the query is doing too much disk I/O (missing index / bad plan)

10.4

Join Strategies

Nested Loop · Hash Join · Merge Join

▾

The Three Ways PostgreSQL Joins Tables

Nested Loop Join

For each row in outer table, probe inner table (usually via index). Best when outer result is tiny (1–100 rows) and inner has a good index.

Cost: O(outer × index_lookup)

Hash Join

Build a hash table of the smaller table, probe with the larger. Best for large tables with no index, or when both sides return many rows. Most common in analytics.

Cost: O(build + probe)

Merge Join

Both inputs must be sorted on the join key. Reads them in parallel like a zip. Excellent when both sides are already sorted (e.g., index scan on both). Rare in practice.

Cost: O(sort + merge)

Join Strategy Examples in EXPLAIN

-- Query: join users and orders
EXPLAIN ANALYZE
SELECT u.name, o.total
FROM   users u
JOIN   orders o ON o.user_id = u.id
WHERE  u.id = 42;

-- Plan: user.id=42 matches 1 user → few orders → Nested Loop
--
-- Hash Join (analytics query, many rows):
-- EXPLAIN ANALYZE
-- SELECT u.name, COUNT(*) FROM users u JOIN orders o ON o.user_id = u.id
-- GROUP BY u.name;
-- → Hash Join: builds hash table on users, probes with orders

work_mem — Memory for Sorts and Hashes

work_mem controls how much memory each sort or hash operation can use before spilling to disk. Too little → disk spills → slow queries. Too much → OOM with many concurrent queries.

-- Check current work_mem
SHOW work_mem;   -- default: 4MB (often too low)

-- Set it for your current session (safe)
SET work_mem = '64MB';

-- In EXPLAIN ANALYZE, disk spill shows as:
--   Sort Method: external merge  Disk: 1456kB  ← spilling to disk
--   Sort Method: quicksort  Memory: 25kB       ← fits in memory ✓

-- Tip: a single query can use work_mem × number_of_sort_operations
-- So with 100 connections × 64MB = 6.4GB potential RAM usage
-- Set it per-session for heavy queries rather than globally

10.5

Common Optimization Patterns

Real fixes you'll use every day in production

▾

Pattern 1 — Function on Column Blocks Index

Wrapping a column in a function prevents the index from being used because the index stores raw values, not function results.

✗ Index not used

-- LOWER() prevents index use on email
SELECT * FROM users
WHERE LOWER(email) = 'alice@x.com';
-- → Seq Scan (even with index on email)

✓ Fix: Expression Index

-- Create an index on the expression
CREATE INDEX idx_users_email_lower
  ON users (LOWER(email));

-- Now the query uses the index
SELECT * FROM users
WHERE LOWER(email) = 'alice@x.com';

-- Other common expression indexes
-- Index on extracted JSONB field:
CREATE INDEX idx_meta_brand ON products ((metadata ->> 'brand'));

-- Index on date part (group by month):
CREATE INDEX idx_orders_month ON orders (DATE_TRUNC('month', created_at));

-- Index on COALESCE:
CREATE INDEX idx_name_coalesce ON users (COALESCE(display_name, username));

Pattern 2 — OFFSET Pagination is Slow at Scale

Using OFFSET for pagination is extremely slow on large tables because PostgreSQL must scan and discard all rows before the offset.

✗ OFFSET pagination (gets slower each page)

-- Page 1: fast
SELECT * FROM orders ORDER BY id LIMIT 20 OFFSET 0;

-- Page 5000: scans 100,000 rows just to skip them
SELECT * FROM orders ORDER BY id LIMIT 20 OFFSET 100000;
-- → Seq Scan, 100k rows discarded

✓ Keyset / Cursor pagination (always fast)

-- First page
SELECT * FROM orders ORDER BY id LIMIT 20;

-- Next page: use last seen id as cursor
SELECT * FROM orders
WHERE id > 100020          -- last id from prev page
ORDER BY id LIMIT 20;
-- → Index Scan, always fast regardless of page

Pattern 3 — SELECT * Hurts Performance

✗ SELECT * — fetches everything

-- Fetches all columns including large TEXT/JSONB
-- Prevents Index-Only Scans
-- Wastes network bandwidth
SELECT * FROM orders WHERE user_id = 42;

✓ Select only what you need

-- Smaller rows = faster transfer
-- Can enable Index-Only Scan with INCLUDE
SELECT order_id, status, total
FROM orders
WHERE user_id = 42;

Pattern 4 — N+1 Queries (the Silent Killer)

An N+1 problem is when application code runs 1 query to get N records, then N more queries to get related data for each. This makes 1+N database round trips instead of 1.

✗ N+1 (100 users = 101 queries)

-- Query 1: get 100 users
users = SELECT * FROM users LIMIT 100;

-- For each user (in app code):
for user in users:
  orders = SELECT * FROM orders
           WHERE user_id = user.id;
-- = 101 database round trips!

✓ JOIN or IN — 1 query

-- One query with JOIN
SELECT u.name, o.total
FROM users u
JOIN orders o ON o.user_id = u.id
WHERE u.id = ANY(ARRAY[1,2,...,100]);

-- Or bulk fetch and join in app
SELECT * FROM orders
WHERE user_id = ANY(ARRAY[1,2,...100]);
-- = 1 database round trip ✓

Pattern 5 — Implicit Type Cast Breaks Index

-- Column type: user_id BIGINT
-- Index: CREATE INDEX idx ON orders (user_id);

-- ✗ Bad: passing a string — implicit cast, index may not be used
SELECT * FROM orders WHERE user_id = '42';

-- ✓ Good: correct type
SELECT * FROM orders WHERE user_id = 42;

-- ✗ Bad: comparing varchar id to integer
SELECT * FROM users WHERE phone = 9876543210;   -- phone is TEXT

-- ✓ Always pass parameters in the correct type in your ORM/driver

Pattern 6 — OR vs UNION ALL for Different Indexes

✗ OR may force Seq Scan

-- Planner might choose Seq Scan
-- even if both conditions are indexed
SELECT * FROM users
WHERE email = 'a@x.com'
   OR phone = '9876543210';

✓ UNION ALL uses both indexes

SELECT * FROM users
WHERE email = 'a@x.com'
UNION ALL
SELECT * FROM users
WHERE phone = '9876543210'
  AND email != 'a@x.com';
-- Two separate index scans ✓

⚡

EXPLAIN Output Interpreter

Paste or pick a scenario — get a plain-English diagnosis

▾

Choose a common EXPLAIN warning sign:

📊

Rows Mismatch: estimated ≠ actual
What you see: rows=10 (estimated) vs rows=85000 (actual) — large gap.

Why it matters: The planner chose a suboptimal join strategy (Nested Loop instead of Hash Join) because it underestimated rows. This caused the query to be 100× slower than needed.

Fix steps:
1. ANALYZE table_name; — refresh column statistics
2. If the column has a skewed distribution: ALTER TABLE t ALTER COLUMN col SET STATISTICS 500; ANALYZE t;
3. For very complex multi-join queries: consider SET join_collapse_limit = 1; to force join order.

🐢

Sequential Scan on a Large Table
What you see: Seq Scan on orders with Rows Removed by Filter: 2000000.

Why it happens: No suitable index exists, or the planner estimates too many rows match to bother with an index.

Fix steps:
1. Add an index on the WHERE column: CREATE INDEX ON orders (user_id);
2. If the planner still chooses Seq Scan despite index: run ANALYZE, check if >5% rows match (Seq Scan may actually be correct).
3. Force index for testing: SET enable_seqscan = off; then re-run EXPLAIN ANALYZE (don't leave this in production).

💾

Sort Disk Spill
What you see: Sort Method: external merge Disk: 45632kB — the sort spilled to disk.

Why it matters: Disk sorts are 10–100× slower than in-memory sorts. This shows up in heavy ORDER BY, GROUP BY, or Hash Join operations.

Fix steps:
1. Increase work_mem for this session: SET work_mem = '256MB'; then re-run.
2. If sort is on an indexed column, add an ORDER BY index so the sort step is skipped entirely.
3. For global tuning: increase work_mem in postgresql.conf but be careful (concurrent queries multiply it).

🔍

Index Exists but Not Being Used
What you see: Seq Scan even though you created an index on the column.

Common reasons + fixes:
1. Function wrapping column: WHERE LOWER(email) = ? — create expression index on LOWER(email).
2. Type mismatch: Column is BIGINT but you're comparing to a string literal. Fix the parameter type.
3. Stale statistics: Run ANALYZE table;
4. Low selectivity: Column has only 2–3 distinct values. Index scan isn't worth it. Use a partial index.
5. Table too small: Planner correctly prefers Seq Scan on a 100-row table.

🔁

High Loops Count on Inner Node
What you see: Inner node shows loops=50000 — executed 50,000 times.

Why it matters: Nested Loop join with a large outer set. Even 0.1ms per inner lookup × 50,000 = 5 seconds.

Fix steps:
1. Run ANALYZE — if the planner underestimated the outer rows, it should switch to Hash Join after statistics are fresh.
2. Force Hash Join to test: SET enable_nestloop = off; then EXPLAIN ANALYZE again.
3. Ensure the inner table join column has a good index (index lookup in nested loop must be O(log n)).
4. Consider rewriting correlated subqueries as JOINs to give the planner more freedom.

📋

EXPLAIN Cheat Sheet

Everything you need to remember at a glance

▾

You See in EXPLAIN	What It Means	Action
`Seq Scan` + large table	Reading every row — no index used	Add index
`Rows Removed by Filter: N`	N rows scanned but discarded — wasteful	Add index or partial index
`Index Only Scan` + Heap Fetches: 0	Perfect — no heap I/O	Ideal, nothing to do
estimated rows ≪ actual rows	Planner had bad statistics	ANALYZE or SET STATISTICS
`Sort Method: external merge Disk:`	Sort spilled to disk	Increase work_mem
`loops=N` where N is large	Nested loop with huge outer set	ANALYZE — planner should switch to Hash Join
`Hash Batches: N` where N > 1	Hash join spilled to disk (N batches)	Increase work_mem
`Bitmap Heap Scan`	Medium selectivity — two-step index + heap	Usually fine
`Buffers: shared read=N` (large N)	Many disk reads — not in cache	Check shared_buffers, query frequency
High `Planning Time`	Too many joins or complex query structure	Simplify query or use CTEs

🏆

The Query Optimization Workflow:
1. EXPLAIN (ANALYZE, BUFFERS) on the slow query
2. Find the most expensive node (highest actual time)
3. Check: is it a Seq Scan that should be an Index Scan? → Add index
4. Check: are row estimates way off? → Run ANALYZE
5. Check: is there a disk sort? → Raise work_mem
6. Check: are you seeing high loops? → Fix join order with statistics
7. Re-run EXPLAIN ANALYZE and confirm improvement
8. Use explain.depesz.com or pganalyze for visual tree analysis