Query Optimization
Understanding how PostgreSQL decides to execute your query — and how to read, interpret, and act on that plan — is what separates engineers who guess from engineers who know. EXPLAIN is your primary diagnostic tool.
The Query Planner
How PostgreSQL decides how to run your query
What the Planner Does
When you send a SQL query to PostgreSQL, it doesn't execute it immediately. It first goes through a pipeline:
Cost-Based Optimization
PostgreSQL is a cost-based optimizer. It assigns a numerical cost to each possible plan (based on estimated I/O and CPU work) and picks the plan with the lowest cost. Costs are not real milliseconds — they're abstract units based on configurable parameters.
Cost = (pages read × seq_page_cost)
+ (index pages × random_page_cost)
+ (rows processed × cpu_tuple_cost)
+ (comparisons × cpu_operator_cost)
Default values:
seq_page_cost = 1.0 (sequential disk read — cheap)
random_page_cost = 4.0 (random disk seek — 4× more expensive)
cpu_tuple_cost = 0.01 (processing one row)
cpu_operator_cost= 0.0025 (one comparison)
→ On SSDs, set random_page_cost = 1.1 (almost same as sequential)
random_page_cost = 1.1 in postgresql.conf. This makes the planner favor index scans more often, which is correct behavior on SSDs where random reads are nearly as fast as sequential.Statistics — The Planner's Eyes
The planner can't know exactly how many rows a condition will return without running it. Instead it uses statistics collected by ANALYZE — histograms and frequency data about column values stored in pg_statistic.
-- ANALYZE collects statistics on a table (run automatically by autovacuum) ANALYZE orders; -- View statistics: most common values for a column SELECT attname, n_distinct, correlation -- 1.0 = perfectly sorted, -1.0 = reverse sorted, ~0 = random FROM pg_stats WHERE tablename = 'orders'; -- Increase statistics target for columns with skewed data (default = 100) ALTER TABLE orders ALTER COLUMN status SET STATISTICS 500; ANALYZE orders; -- More statistics = better estimates = better plans for skewed distributions
EXPLAIN — Reading the Query Plan
Decode every line the planner shows you
Basic EXPLAIN Syntax
-- Shows the query PLAN (does NOT run the query) EXPLAIN SELECT * FROM orders WHERE user_id = 42; -- Shows plan + actually RUNS the query and reports real timing EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 42; -- Full detail: plan + run + memory + I/O buffers (most useful) EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) SELECT * FROM orders WHERE user_id = 42; -- JSON format for programmatic tools (explain.depesz.com, pganalyze) EXPLAIN (ANALYZE, FORMAT JSON) SELECT * FROM orders WHERE user_id = 42;
BEGIN; EXPLAIN ANALYZE UPDATE ...; ROLLBACK;Anatomy of an EXPLAIN Output
Let's read a real EXPLAIN output line by line. Every indented node represents one step in the plan tree. The innermost (most indented) nodes execute first.
The Three Scan Types
-- Which scan type will be chosen? -- Table: 1M rows, 1% match → Index Scan SELECT * FROM orders WHERE user_id = 42; -- Table: 1M rows, 60% match → Sequential Scan (most rows anyway) SELECT * FROM orders WHERE status != 'completed'; -- Two indexes combined → Bitmap Index Scan + BitmapAnd SELECT * FROM orders WHERE user_id = 42 AND status = 'pending'; -- PostgreSQL may use idx_user_id + idx_status, intersect bitmaps -- Covering index → Index Only Scan SELECT order_id, status FROM orders WHERE user_id = 42; -- If index on (user_id) INCLUDE (order_id, status) exists
EXPLAIN ANALYZE — Actual vs Estimated
Actual cost · Actual time · Row estimation errors
What EXPLAIN ANALYZE Adds
EXPLAIN ANALYZE actually runs the query and adds actual timing and row counts next to the planner's estimates. This lets you spot two problems: (1) slow nodes, (2) bad estimates that led to a wrong plan.
Reading EXPLAIN ANALYZE — Interactive Examples
Click a scenario to see the EXPLAIN ANALYZE output and diagnosis:
Fix:
CREATE INDEX idx_orders_user_id ON orders (user_id);
Fix: Run
ANALYZE orders; to refresh statistics. If the column has skewed values, increase statistics target: ALTER TABLE orders ALTER COLUMN status SET STATISTICS 500;
Fix: Run ANALYZE to correct the row estimate. The planner will likely switch to a Hash Join which builds a hash table once. Alternatively, rewrite the query to filter outer rows earlier.
How to achieve this:
CREATE INDEX idx_orders_api_cover ON orders (user_id, created_at DESC) INCLUDE (order_id, status, total);
Using BUFFERS — Find I/O Bottlenecks
Adding BUFFERS to EXPLAIN ANALYZE shows how many disk pages were read from shared buffer cache (fast) vs. read from disk (slow).
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE user_id = 42; -- Output includes: -- Buffers: shared hit=5 read=120 -- ↑ "hit" = came from memory cache (fast) -- ↑ "read" = had to read from disk (slow — 120 random disk reads!) -- -- Goal: maximize cache hits, minimize disk reads -- If read is high → either the table/index needs more RAM (shared_buffers) -- or the query is doing too much disk I/O (missing index / bad plan)
Join Strategies
Nested Loop · Hash Join · Merge Join
The Three Ways PostgreSQL Joins Tables
Join Strategy Examples in EXPLAIN
-- Query: join users and orders EXPLAIN ANALYZE SELECT u.name, o.total FROM users u JOIN orders o ON o.user_id = u.id WHERE u.id = 42; -- Plan: user.id=42 matches 1 user → few orders → Nested Loop -- -- Hash Join (analytics query, many rows): -- EXPLAIN ANALYZE -- SELECT u.name, COUNT(*) FROM users u JOIN orders o ON o.user_id = u.id -- GROUP BY u.name; -- → Hash Join: builds hash table on users, probes with orders
work_mem — Memory for Sorts and Hashes
work_mem controls how much memory each sort or hash operation can use before spilling to disk. Too little → disk spills → slow queries. Too much → OOM with many concurrent queries.
-- Check current work_mem SHOW work_mem; -- default: 4MB (often too low) -- Set it for your current session (safe) SET work_mem = '64MB'; -- In EXPLAIN ANALYZE, disk spill shows as: -- Sort Method: external merge Disk: 1456kB ← spilling to disk -- Sort Method: quicksort Memory: 25kB ← fits in memory ✓ -- Tip: a single query can use work_mem × number_of_sort_operations -- So with 100 connections × 64MB = 6.4GB potential RAM usage -- Set it per-session for heavy queries rather than globally
Common Optimization Patterns
Real fixes you'll use every day in production
Pattern 1 — Function on Column Blocks Index
Wrapping a column in a function prevents the index from being used because the index stores raw values, not function results.
-- LOWER() prevents index use on email SELECT * FROM users WHERE LOWER(email) = 'alice@x.com'; -- → Seq Scan (even with index on email)
-- Create an index on the expression CREATE INDEX idx_users_email_lower ON users (LOWER(email)); -- Now the query uses the index SELECT * FROM users WHERE LOWER(email) = 'alice@x.com';
-- Other common expression indexes -- Index on extracted JSONB field: CREATE INDEX idx_meta_brand ON products ((metadata ->> 'brand')); -- Index on date part (group by month): CREATE INDEX idx_orders_month ON orders (DATE_TRUNC('month', created_at)); -- Index on COALESCE: CREATE INDEX idx_name_coalesce ON users (COALESCE(display_name, username));
Pattern 2 — OFFSET Pagination is Slow at Scale
Using OFFSET for pagination is extremely slow on large tables because PostgreSQL must scan and discard all rows before the offset.
-- Page 1: fast SELECT * FROM orders ORDER BY id LIMIT 20 OFFSET 0; -- Page 5000: scans 100,000 rows just to skip them SELECT * FROM orders ORDER BY id LIMIT 20 OFFSET 100000; -- → Seq Scan, 100k rows discarded
-- First page SELECT * FROM orders ORDER BY id LIMIT 20; -- Next page: use last seen id as cursor SELECT * FROM orders WHERE id > 100020 -- last id from prev page ORDER BY id LIMIT 20; -- → Index Scan, always fast regardless of page
Pattern 3 — SELECT * Hurts Performance
-- Fetches all columns including large TEXT/JSONB -- Prevents Index-Only Scans -- Wastes network bandwidth SELECT * FROM orders WHERE user_id = 42;
-- Smaller rows = faster transfer -- Can enable Index-Only Scan with INCLUDE SELECT order_id, status, total FROM orders WHERE user_id = 42;
Pattern 4 — N+1 Queries (the Silent Killer)
An N+1 problem is when application code runs 1 query to get N records, then N more queries to get related data for each. This makes 1+N database round trips instead of 1.
-- Query 1: get 100 users
users = SELECT * FROM users LIMIT 100;
-- For each user (in app code):
for user in users:
orders = SELECT * FROM orders
WHERE user_id = user.id;
-- = 101 database round trips!
-- One query with JOIN SELECT u.name, o.total FROM users u JOIN orders o ON o.user_id = u.id WHERE u.id = ANY(ARRAY[1,2,...,100]); -- Or bulk fetch and join in app SELECT * FROM orders WHERE user_id = ANY(ARRAY[1,2,...100]); -- = 1 database round trip ✓
Pattern 5 — Implicit Type Cast Breaks Index
-- Column type: user_id BIGINT -- Index: CREATE INDEX idx ON orders (user_id); -- ✗ Bad: passing a string — implicit cast, index may not be used SELECT * FROM orders WHERE user_id = '42'; -- ✓ Good: correct type SELECT * FROM orders WHERE user_id = 42; -- ✗ Bad: comparing varchar id to integer SELECT * FROM users WHERE phone = 9876543210; -- phone is TEXT -- ✓ Always pass parameters in the correct type in your ORM/driver
Pattern 6 — OR vs UNION ALL for Different Indexes
-- Planner might choose Seq Scan -- even if both conditions are indexed SELECT * FROM users WHERE email = 'a@x.com' OR phone = '9876543210';
SELECT * FROM users WHERE email = 'a@x.com' UNION ALL SELECT * FROM users WHERE phone = '9876543210' AND email != 'a@x.com'; -- Two separate index scans ✓
EXPLAIN Output Interpreter
Paste or pick a scenario — get a plain-English diagnosis
Choose a common EXPLAIN warning sign:
What you see:
rows=10 (estimated) vs rows=85000 (actual) — large gap.Why it matters: The planner chose a suboptimal join strategy (Nested Loop instead of Hash Join) because it underestimated rows. This caused the query to be 100× slower than needed.
Fix steps:
1.
ANALYZE table_name; — refresh column statistics2. If the column has a skewed distribution:
ALTER TABLE t ALTER COLUMN col SET STATISTICS 500; ANALYZE t;3. For very complex multi-join queries: consider
SET join_collapse_limit = 1; to force join order.
What you see:
Seq Scan on orders with Rows Removed by Filter: 2000000.Why it happens: No suitable index exists, or the planner estimates too many rows match to bother with an index.
Fix steps:
1. Add an index on the WHERE column:
CREATE INDEX ON orders (user_id);2. If the planner still chooses Seq Scan despite index: run ANALYZE, check if >5% rows match (Seq Scan may actually be correct).
3. Force index for testing:
SET enable_seqscan = off; then re-run EXPLAIN ANALYZE (don't leave this in production).
What you see:
Sort Method: external merge Disk: 45632kB — the sort spilled to disk.Why it matters: Disk sorts are 10–100× slower than in-memory sorts. This shows up in heavy ORDER BY, GROUP BY, or Hash Join operations.
Fix steps:
1. Increase work_mem for this session:
SET work_mem = '256MB'; then re-run.2. If sort is on an indexed column, add an ORDER BY index so the sort step is skipped entirely.
3. For global tuning: increase
work_mem in postgresql.conf but be careful (concurrent queries multiply it).
What you see: Seq Scan even though you created an index on the column.
Common reasons + fixes:
1. Function wrapping column:
WHERE LOWER(email) = ? — create expression index on LOWER(email).2. Type mismatch: Column is
BIGINT but you're comparing to a string literal. Fix the parameter type.3. Stale statistics: Run
ANALYZE table;4. Low selectivity: Column has only 2–3 distinct values. Index scan isn't worth it. Use a partial index.
5. Table too small: Planner correctly prefers Seq Scan on a 100-row table.
What you see: Inner node shows
loops=50000 — executed 50,000 times.Why it matters: Nested Loop join with a large outer set. Even 0.1ms per inner lookup × 50,000 = 5 seconds.
Fix steps:
1. Run ANALYZE — if the planner underestimated the outer rows, it should switch to Hash Join after statistics are fresh.
2. Force Hash Join to test:
SET enable_nestloop = off; then EXPLAIN ANALYZE again.3. Ensure the inner table join column has a good index (index lookup in nested loop must be O(log n)).
4. Consider rewriting correlated subqueries as JOINs to give the planner more freedom.
EXPLAIN Cheat Sheet
Everything you need to remember at a glance
| You See in EXPLAIN | What It Means | Action |
|---|---|---|
Seq Scan + large table |
Reading every row — no index used | Add index |
Rows Removed by Filter: N |
N rows scanned but discarded — wasteful | Add index or partial index |
Index Only Scan + Heap Fetches: 0 |
Perfect — no heap I/O | Ideal, nothing to do |
| estimated rows ≪ actual rows | Planner had bad statistics | ANALYZE or SET STATISTICS |
Sort Method: external merge Disk: |
Sort spilled to disk | Increase work_mem |
loops=N where N is large |
Nested loop with huge outer set | ANALYZE — planner should switch to Hash Join |
Hash Batches: N where N > 1 |
Hash join spilled to disk (N batches) | Increase work_mem |
Bitmap Heap Scan |
Medium selectivity — two-step index + heap | Usually fine |
Buffers: shared read=N (large N) |
Many disk reads — not in cache | Check shared_buffers, query frequency |
High Planning Time |
Too many joins or complex query structure | Simplify query or use CTEs |
1.
EXPLAIN (ANALYZE, BUFFERS) on the slow query2. Find the most expensive node (highest actual time)
3. Check: is it a Seq Scan that should be an Index Scan? → Add index
4. Check: are row estimates way off? → Run ANALYZE
5. Check: is there a disk sort? → Raise work_mem
6. Check: are you seeing high loops? → Fix join order with statistics
7. Re-run EXPLAIN ANALYZE and confirm improvement
8. Use explain.depesz.com or pganalyze for visual tree analysis