🧠
Why this matters: Everything in PostgreSQL — VACUUM, table bloat, snapshot isolation, why updates are slow on wide rows — flows from one core idea: PostgreSQL never overwrites data in place. It always writes a new version. Once you understand that, everything else clicks.
8.1

Architecture

Postmaster, Backend Processes, and Shared Memory

The Postmaster — The Master Process

When you start PostgreSQL, the first process that runs is the Postmaster. It is the parent of all other PostgreSQL processes. Its job is to listen for incoming client connections and spawn a new backend process for each one.

PostgreSQL Process Architecture
  ┌─────────────────────────────────────────────────────────────┐
  │                    POSTMASTER (PID 1)                       │
  │         Listens on port 5432, spawns processes              │
  └──────┬───────────────────────────────────────┬─────────────┘
         │ forks on each connection               │ background workers
         ▼                                        ▼
  ┌──────────────┐  ┌──────────────┐    ┌──────────────────────┐
  │  Backend     │  │  Backend     │    │  Background Workers  │
  │  Process     │  │  Process     │    │  ─────────────────── │
  │  (Client A)  │  │  (Client B)  │    │  • Autovacuum        │
  │              │  │              │    │  • WAL Writer        │
  │  Runs SQL    │  │  Runs SQL    │    │  • Checkpointer      │
  │  for 1 conn  │  │  for 1 conn  │    │  • BGWriter          │
  └──────┬───────┘  └──────┬───────┘    │  • Stats Collector   │
         │                 │            └──────────────────────┘
         └────────┬────────┘
                  ▼
  ┌─────────────────────────────────────────────────────────────┐
  │                   SHARED MEMORY                             │
  │  ─────────────────────────────────────────────────────────  │
  │  Shared Buffers (cache)  │  WAL Buffers  │  Lock Tables    │
  │  CLOG (commit status)    │  Proc Array   │  Free Space Map │
  └─────────────────────────────────────────────────────────────┘
                  │
                  ▼
  ┌─────────────────────────────────────────────────────────────┐
  │                     DISK STORAGE                            │
  │   Data Files (base/)  │  WAL Files (pg_wal/)  │  Indexes   │
  └─────────────────────────────────────────────────────────────┘

Postmaster

The Postmaster is the supervisor. It does not run queries itself. When a client connects, Postmaster forks a new backend process dedicated to that connection. If the backend crashes, only that one connection is lost — not the whole database.

-- See the postmaster and all backend processes SELECT pid, backend_type, state, query FROM pg_stat_activity ORDER BY backend_type; -- Typical output: -- pid | backend_type | state | query -- 1234 | autovacuum worker | active | autovacuum: VACUUM public.orders -- 1235 | client backend | active | SELECT * FROM users -- 1236 | client backend | idle | -- 1237 | walwriter | active | -- 1238 | checkpointer | active |

Backend Process

Each client connection gets its own backend process. This process is responsible for parsing SQL, planning the query, executing it, and sending results back. It has its own private memory (work_mem, sort buffers) and also accesses the shared memory pool.

Private Memory
work_mem: for sorts and hash joins. temp_buffers: for temp tables. Each backend has its own copy — not shared.
Shared Memory
shared_buffers: the main page cache, shared by ALL backends. Lock tables, transaction status — all shared.
Lifecycle
Forked on connect, destroyed on disconnect. Connection poolers (PgBouncer) reuse backends to avoid fork overhead.
connection_limit
Default max_connections = 100. Each backend uses ~5-10MB RAM. High connection counts → use PgBouncer in production.

Shared Memory — The Heart of PostgreSQL

Shared memory is the area all backend processes can read and write simultaneously. The most important component is shared_buffers — the in-memory page cache. When a backend needs a data page, it first checks shared_buffers. If found (cache hit), no disk I/O. If not found (cache miss), it reads from disk into shared_buffers.

ComponentPurposeDefault / Notes
shared_buffersPage cache — holds hot data pages in memory128MB default; set to 25% of RAM in production
WAL BuffersHolds WAL records before flush to diskAuto-tuned; ~1/32 of shared_buffers
CLOGCommit Log — tracks which transactions are committed/abortedTiny; critical for MVCC visibility
Lock TableAll currently held locksShared; checked by every backend
Proc ArrayActive transactions and their XIDsUsed to compute snapshots for MVCC
-- Check current shared_buffers setting SHOW shared_buffers; -- Production recommendation: 25% of total RAM -- In postgresql.conf: -- shared_buffers = 4GB (on a 16GB RAM server) -- effective_cache_size = 12GB (OS page cache estimate) -- See buffer cache hit rate (should be > 99% in production) SELECT sum(heap_blks_hit) AS hits, sum(heap_blks_read) AS misses, round(sum(heap_blks_hit)::numeric / nullif(sum(heap_blks_hit) + sum(heap_blks_read), 0) * 100, 2) AS hit_rate_pct FROM pg_statio_user_tables;
🎯
Production tip: A buffer cache hit rate below 95% means your hot data doesn't fit in shared_buffers. Increase it or add more RAM. Every cache miss means a disk read, which is 100–1000× slower than a memory read.
8.2

Storage — Pages, Blocks & Heap

How data is physically laid out on disk

Pages and Blocks — The Unit of Storage

PostgreSQL reads and writes data in fixed-size units called pages (also called blocks). The default page size is 8KB. Even if you update a single byte in a row, PostgreSQL reads and writes the entire 8KB page. This is the fundamental unit of I/O.

Page Size
8192 bytes (8KB) by default. Compile-time constant — cannot change without rebuilding PostgreSQL.
Table File
Each table is stored as one or more files under $PGDATA/base/<dboid>/<relfilenode>. Files are split at 1GB.
Segment Files
When a table exceeds 1GB, PostgreSQL creates a new segment file: relfilenode, relfilenode.1, relfilenode.2, etc.
Free Space Map
A companion file (relfilenode_fsm) tracks which pages have free space so inserts can fill gaps left by deleted rows.

What's Inside a Page?

Every 8KB page has the same structure. Understanding this explains why large rows are expensive and why VACUUM matters.

Anatomy of one 8KB PostgreSQL Page
┌──────────────────────────────────────────────────────┐  ← Page Start (byte 0)
│  PAGE HEADER  (24 bytes)                             │
│  pd_lsn: last WAL record that modified this page     │
│  pd_lower: end of item pointer array                 │
│  pd_upper: start of tuple data                       │
│  pd_special: start of special space (for indexes)    │
├──────────────────────────────────────────────────────┤
│  ITEM POINTER ARRAY (grows downward →)               │
│  [ptr1][ptr2][ptr3]...  each 4 bytes                 │
│  Points to each tuple's offset within this page      │
├──────────────────────────────────────────────────────┤
│                                                      │
│            FREE SPACE                                │
│    (between item array and tuple data)               │
│                                                      │
├──────────────────────────────────────────────────────┤
│  TUPLE DATA (grows upward ←)                         │
│  Each tuple (row) has:                               │
│    [HeapTupleHeader][actual column data]             │
│  HeapTupleHeader contains: xmin, xmax, ctid, infomask│
├──────────────────────────────────────────────────────┤
│  SPECIAL SPACE (for index pages only)                │
└──────────────────────────────────────────────────────┘  ← Page End (byte 8191)
💡
ctid — physical location: Every row has a system column called ctid which is its physical location: (page_number, tuple_number_within_page). For example (0,1) means page 0, first tuple. When a row is updated, its ctid changes because the new version is written to a different location.

Heap Storage — What Tables Actually Are

In PostgreSQL, a table is stored as a heap — rows are written in no particular order, just wherever there's space. This is different from clustered tables (like InnoDB) where rows are stored in primary key order. The heap structure is why PostgreSQL has CLUSTER as an explicit operation.

-- See the physical file location of a table SELECT pg_relation_filepath('orders'); -- Returns: base/16384/16385 (database OID / table file) -- See the physical location (ctid) of each row SELECT ctid, id, status FROM orders LIMIT 5; -- ctid | id | status -- (0,1) | 1 | shipped ← page 0, tuple 1 -- (0,2) | 2 | pending ← page 0, tuple 2 -- (0,3) | 3 | shipped ← page 0, tuple 3 -- (1,1) | 4 | pending ← page 1, tuple 1 (new page) -- See table size SELECT pg_size_pretty(pg_table_size('orders')) AS table_size, pg_size_pretty(pg_indexes_size('orders')) AS index_size, pg_size_pretty(pg_total_relation_size('orders')) AS total_size;

Why Row Width Matters

Because a page is 8KB and rows cannot span pages (for heap tables), a single row must fit within one page. Wide rows (many large columns) mean fewer rows per page, which means more pages to scan for full table scans and more I/O overall.

✅ Narrow rows — efficient
-- 100 bytes per row
-- 8192 / 100 ≈ 80 rows per page
-- 1M rows → ~12,500 pages
-- Sequential scan: fast
❌ Wide rows — expensive
-- 4000 bytes per row
-- 8192 / 4000 ≈ 2 rows per page
-- 1M rows → ~500,000 pages
-- Sequential scan: 40× more I/O!
⚠️
TOAST — handling large values: When a column value exceeds ~2KB, PostgreSQL automatically stores it in a separate TOAST (The Oversized-Attribute Storage Technique) table. The main row stores a pointer. This keeps rows on-page for most access patterns, while large blobs live separately. JSONB columns often use TOAST.
8.3

MVCC — Multi-Version Concurrency Control

⭐ The most important PostgreSQL internal topic

The Core Idea — Writers Don't Block Readers

In traditional databases, if someone is writing to a row, readers must wait. PostgreSQL avoids this entirely using MVCC: instead of modifying a row in place, it keeps multiple versions of every row. Each transaction sees a consistent snapshot of the database as it existed when the transaction started.

The result: Readers never block writers. Writers never block readers. This is why PostgreSQL can handle thousands of concurrent queries without a global read lock — and it's all implemented without any application-level locking.

Transaction IDs (XIDs) — The Clock of MVCC

Every transaction gets a unique Transaction ID (XID), a monotonically increasing 32-bit integer. XIDs are the "timestamps" MVCC uses to determine which row versions are visible to which transactions.

Past XIDs 100–499
committed or aborted
Committed XID 500
visible to future txns
Current Txn XID 501
← you are here
Future XIDs 502–∞
invisible to you
-- See the current transaction XID SELECT txid_current(); -- e.g. returns 12345 -- See xmin and xmax on actual rows (hidden system columns) SELECT xmin, xmax, id, name FROM users LIMIT 3; -- xmin | xmax | id | name -- 12340 | 0 | 1 | Alice ← created in txn 12340, not deleted -- 12341 | 0 | 2 | Bob ← created in txn 12341, not deleted -- 12342 | 12350| 3 | Carol ← created in 12342, deleted in 12350 -- XID wraparound: 32-bit means only ~4 billion XIDs -- PostgreSQL uses "epoch" tricks to handle this (VACUUM FREEZE) SELECT age(datfrozenxid), datname FROM pg_database; -- age = how many XIDs since last freeze. Keep < 1.5 billion!
⚠️
XID Wraparound — a real catastrophe: Since XID is 32-bit, it wraps around after ~4 billion transactions. If this happens, old rows would look "in the future" and become invisible — effectively corrupting the database. VACUUM FREEZE prevents this by marking old rows as frozen (safe forever). Autovacuum handles this automatically, but it must not be disabled on high-write databases.

Row Versions — xmin and xmax

Every row (tuple) in PostgreSQL's heap carries two hidden metadata fields in its header:

xmin — Born in
The XID of the transaction that inserted this row version. The row is visible to transactions that started after xmin committed.
xmax — Died in
The XID of the transaction that deleted or updated this row version. 0 means still alive. Set when the row is deleted or a new version is created.

When you UPDATE a row, PostgreSQL does NOT modify the existing tuple. Instead it:

  1. Sets xmax on the old tuple to the current XID (marks it as "dead to new readers")
  2. Inserts a brand new tuple with the updated data, setting xmin to the current XID

The old tuple sticks around until VACUUM cleans it up. This is where table bloat comes from.

Row Versions Before and After UPDATE
Before UPDATE (row salary=50000):
Page Header
xmin=100
xmax=0
id=1 salary=50000
(LIVE)
free space
After UPDATE salary=60000 (in XID 200):
Page Header
xmin=100
xmax=200
id=1 salary=50000
(DEAD to txns ≥200)
xmin=200
xmax=0
id=1 salary=60000
(NEW LIVE version)
↑ Old version stays on disk until VACUUM runs. This is bloat.

Visibility Rules — Which Version Do You See?

When your transaction reads a row, PostgreSQL applies these rules to decide which version to show you:

ConditionIs the tuple visible?
xmin is committed AND xmax is 0✅ YES — row is live and never deleted
xmin is committed AND xmax is committed AND xmax > my snapshot✅ YES — was deleted but by a txn after my snapshot
xmin is committed AND xmax is committed AND xmax ≤ my snapshot❌ NO — row was deleted before my snapshot started
xmin is NOT committed (still in progress or aborted)❌ NO — inserting txn didn't commit yet
xmin = my own XID (I inserted this row in this txn)✅ YES — I can see my own uncommitted changes
💡
CLOG (Commit Log): PostgreSQL determines whether an xmin/xmax is "committed" by consulting the CLOG (pg_xact directory). CLOG stores one 2-bit status per transaction: in-progress, committed, aborted, or sub-committed. Checking CLOG is extremely fast because it's cached in shared memory.

Snapshot Isolation — What You See When You BEGIN

When your transaction starts, PostgreSQL takes a snapshot of the database state. A snapshot captures:

Transaction Snapshot (taken at BEGIN)
xmin (horizon):XID 450← oldest active txn when I started
xmax (my XID):XID 501← my transaction ID
xip_list (in-flight):[452, 460, 475]← txns that were active, treat as invisible

A transaction with XID=N is visible to me if: N < xmin (definitely committed before I started) OR (xmin ≤ N < xmax AND N not in xip_list AND N is committed).

Interactive: MVCC Visibility Explorer

Step through a real MVCC scenario: two concurrent transactions, one updating, the other reading. See exactly what each transaction sees at each moment.

🔬 MVCC Demo — Concurrent Read & Write

Scenario: salary row starts at 50,000. T1 updates it to 60,000. What does T2 see?

Timeline
T1 (XID 200) — Writer
T2 (XID 201) — Reader
t=0
t=1
t=2
t=3
t=4
t=5
t=6
Click the step buttons above to walk through the scenario. Watch how T2 always sees a consistent snapshot, even while T1 is mid-update.
8.4

Dead Tuples, Bloat & Why VACUUM Exists

The inevitable consequence of MVCC — and the solution

Dead Tuples — The Price of MVCC

Because MVCC keeps old row versions instead of overwriting them, every UPDATE and DELETE leaves behind dead tuples — old row versions that no active transaction can see anymore. These dead tuples accumulate on disk and cause:

Table Bloat
Dead tuples take up space. A heavily-updated table can be 2–10× larger than the actual live data.
Index Bloat
Indexes also point to old tuple versions. Dead index entries slow down index scans and waste space.
Slower Scans
Sequential scans must visit pages containing dead tuples. More pages = more I/O = slower queries.
XID Wraparound Risk
Very old unfrozen tuples can cause XID wraparound if VACUUM doesn't run. Fatal if ignored.

Visualizing Bloat After Heavy Updates

Table pages after 1000 updates with no VACUUM
dead
dead
live
dead
live
dead
dead
live
dead
live
dead
dead
dead
live
dead
live
Only 6 live rows out of 16 slots. 62% bloat! A full table scan reads ALL pages including dead ones.
After VACUUM runs:
live
live
live
live
live
live
free
free
Dead tuples removed. Free space marked in FSM. New inserts can reuse it.
⚠️
Important: Regular VACUUM marks dead space as reusable but does NOT shrink the file. VACUUM FULL rewrites the entire table to a new file, actually returning space to the OS — but it requires an ACCESS EXCLUSIVE lock (blocks everything) and is slow. Use it sparingly.

Checking Bloat in Production

-- Check dead tuples per table (key bloat indicator) SELECT relname AS table_name, n_live_tup AS live_rows, n_dead_tup AS dead_rows, round(n_dead_tup::numeric / nullif(n_live_tup + n_dead_tup, 0) * 100, 1) AS dead_pct, last_vacuum, last_autovacuum FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 10; -- If dead_pct > 20%, VACUUM is overdue -- Manually trigger VACUUM ANALYZE VACUUM ANALYZE orders; -- reclaim dead space + update stats VACUUM(VERBOSE) orders; -- see detailed output -- Only when you really need to reclaim disk space -- (blocks all queries during execution!) VACUUM FULL orders;

Autovacuum — The Automatic Solution

PostgreSQL's autovacuum daemon runs in the background and automatically vacuums tables when the number of dead tuples exceeds a threshold. It's controlled by these key settings:

-- When autovacuum triggers (in postgresql.conf) -- Triggers when dead tuples > threshold + scale_factor * table_size autovacuum_vacuum_threshold = 50 -- min 50 dead tuples autovacuum_vacuum_scale_factor = 0.2 -- + 20% of table rows -- Example: orders table has 500,000 rows -- Autovacuum triggers when dead tuples > 50 + (0.2 × 500,000) = 100,050 -- For high-write tables, lower the scale factor: ALTER TABLE orders SET ( autovacuum_vacuum_scale_factor = 0.01, -- vacuum at 1% dead tuples autovacuum_vacuum_threshold = 100 ); -- Check if autovacuum is keeping up SELECT schemaname, relname, n_dead_tup, autovacuum_count, last_autovacuum FROM pg_stat_user_tables WHERE n_dead_tup > 10000 ORDER BY n_dead_tup DESC;
🚨
Never disable autovacuum. Some people disable it to "improve write performance." This is one of the most common ways to destroy a PostgreSQL database. Without autovacuum, dead tuples accumulate without bound, XID wraparound occurs, and eventually the database refuses all new transactions to protect itself.
8.5

Snapshot Isolation in Practice

What each isolation level actually sees under MVCC

How Isolation Levels Map to Snapshots

Different isolation levels (from Topic 6) are implemented in PostgreSQL by controlling when a new snapshot is taken:

Isolation LevelSnapshot TakenWhat You SeePostgreSQL Behavior
READ COMMITTED At each statement Committed data as of each query Default in PostgreSQL. Fresh snapshot per SQL statement.
REPEATABLE READ At first statement in txn Consistent view of the whole transaction Same snapshot for all queries in the txn. No phantom reads.
SERIALIZABLE At first statement + tracking As if transactions ran one at a time SSI (Serializable Snapshot Isolation) — detects conflicts, may abort.

READ COMMITTED — The Default

In READ COMMITTED, each SQL statement takes a fresh snapshot. This means if a concurrent transaction commits between two of your queries, your second query sees the new data. This is the most common isolation level.

-- READ COMMITTED (default) behavior BEGIN; -- isolation: READ COMMITTED (default) -- Query 1: snapshot taken here → sees salary = 50,000 SELECT salary FROM employees WHERE id = 1; -- 50000 -- Meanwhile, another txn commits: UPDATE salary = 60,000 -- Query 2: NEW snapshot taken → sees salary = 60,000! SELECT salary FROM employees WHERE id = 1; -- 60000 -- Two queries in the same transaction saw different values! -- This is the "non-repeatable read" anomaly — allowed in READ COMMITTED COMMIT;

REPEATABLE READ — Stable Snapshot

In REPEATABLE READ, the snapshot is taken once at the start of the transaction and reused for all queries. You see a stable, consistent view of the database — even if other transactions commit in the meantime.

-- REPEATABLE READ — snapshot is locked in at BEGIN BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ; -- Snapshot taken here → salary = 50,000 SELECT salary FROM employees WHERE id = 1; -- 50000 -- Another txn commits: UPDATE salary = 60,000 -- Same snapshot reused → still sees 50,000! SELECT salary FROM employees WHERE id = 1; -- 50000 (consistent!) COMMIT; -- Use for: reports, aggregations that must see a consistent point-in-time

Long-Running Transactions — A Silent Killer

A long-running transaction holds its snapshot open. PostgreSQL cannot clean up (VACUUM) any dead tuples that might be visible to that snapshot. This means one idle long transaction can cause the entire table to bloat without any vacuum being able to help.

-- Find long-running transactions (huge production risk!) SELECT pid, now() - xact_start AS txn_age, now() - query_start AS query_age, state, query FROM pg_stat_activity WHERE xact_start IS NOT NULL AND now() - xact_start > INTERVAL '5 minutes' ORDER BY txn_age DESC; -- Find the oldest transaction horizon (anything older blocks VACUUM) SELECT pid, backend_xmin, now() - xact_start AS age FROM pg_stat_activity WHERE backend_xmin IS NOT NULL ORDER BY backend_xmin; -- Set idle_in_transaction_session_timeout to auto-kill stragglers -- In postgresql.conf: idle_in_transaction_session_timeout = '5min'
🚨
Production incident pattern: A developer opens a transaction for a migration, gets distracted, and leaves it open for hours. The table being migrated starts bloating rapidly. VACUUM runs but can't clean anything because the old transaction holds the snapshot. By morning, disk is full. Always set idle_in_transaction_session_timeout.

Topic 8 Summary

The mental model that connects everything in PostgreSQL

The One Mental Model to Rule Them All

🧠
PostgreSQL never modifies data in place. Every write creates a new version.

This one principle explains:
• Why reads don't block writes → (each txn sees its own snapshot)
• Why table bloat happens → (old versions accumulate)
• Why VACUUM exists → (to clean up dead versions)
• Why long transactions are dangerous → (they hold snapshots open, blocking VACUUM)
• Why XID wraparound is a risk → (32-bit counter of versions must be frozen periodically)

Key Concepts Summary

ConceptWhat it isWhy it matters
PostmasterParent process; spawns one backend per connectionFork model means crashes are isolated per connection
shared_buffersIn-memory page cache; shared by all backendsCache hits = no disk I/O; set to 25% RAM in prod
Page (8KB)Unit of storage; all reads/writes use whole pagesRow width affects rows-per-page → scan efficiency
HeapUnordered row storage (unlike clustered tables)Inserts go anywhere with free space; CLUSTER reorders
MVCCMultiple row versions; readers see consistent snapshotsNo read-write blocking; foundation of concurrency
xmin / xmaxHidden fields on every row — born/died in which XIDHow visibility is determined without locks
XIDTransaction ID — 32-bit counter per transactionThe "clock" of MVCC; must be periodically frozen
SnapshotPoint-in-time view of committed transactionsTaken at BEGIN or per-statement depending on isolation
Dead TuplesOld row versions no longer visible to any transactionCause bloat; must be cleaned by VACUUM
AutovacuumBackground daemon that runs VACUUM automaticallyEssential; never disable; tune scale_factor per table
CLOGCommit log; records commit/abort status per XIDChecked during visibility evaluation; lives in shared memory
TOASTOverflow storage for large column values (>~2KB)Keeps main heap rows compact; transparent to users
🎯
Up next → Topic 9: Indexing — Now that you understand how data is stored physically, you're ready to learn how indexes speed up access to that data. You'll learn BTree, Hash, GIN, GiST, BRIN, covering indexes, partial indexes, and when each one is the right tool.