PostgreSQL — Topic 8: Internals & MVCC

8.1

Architecture

Postmaster, Backend Processes, and Shared Memory

▾

The Postmaster — The Master Process

When you start PostgreSQL, the first process that runs is the Postmaster. It is the parent of all other PostgreSQL processes. Its job is to listen for incoming client connections and spawn a new backend process for each one.

PostgreSQL Process Architecture

  ┌─────────────────────────────────────────────────────────────┐
  │                    POSTMASTER (PID 1)                       │
  │         Listens on port 5432, spawns processes              │
  └──────┬───────────────────────────────────────┬─────────────┘
         │ forks on each connection               │ background workers
         ▼                                        ▼
  ┌──────────────┐  ┌──────────────┐    ┌──────────────────────┐
  │  Backend     │  │  Backend     │    │  Background Workers  │
  │  Process     │  │  Process     │    │  ─────────────────── │
  │  (Client A)  │  │  (Client B)  │    │  • Autovacuum        │
  │              │  │              │    │  • WAL Writer        │
  │  Runs SQL    │  │  Runs SQL    │    │  • Checkpointer      │
  │  for 1 conn  │  │  for 1 conn  │    │  • BGWriter          │
  └──────┬───────┘  └──────┬───────┘    │  • Stats Collector   │
         │                 │            └──────────────────────┘
         └────────┬────────┘
                  ▼
  ┌─────────────────────────────────────────────────────────────┐
  │                   SHARED MEMORY                             │
  │  ─────────────────────────────────────────────────────────  │
  │  Shared Buffers (cache)  │  WAL Buffers  │  Lock Tables    │
  │  CLOG (commit status)    │  Proc Array   │  Free Space Map │
  └─────────────────────────────────────────────────────────────┘
                  │
                  ▼
  ┌─────────────────────────────────────────────────────────────┐
  │                     DISK STORAGE                            │
  │   Data Files (base/)  │  WAL Files (pg_wal/)  │  Indexes   │
  └─────────────────────────────────────────────────────────────┘

Postmaster

The Postmaster is the supervisor. It does not run queries itself. When a client connects, Postmaster forks a new backend process dedicated to that connection. If the backend crashes, only that one connection is lost — not the whole database.

-- See the postmaster and all backend processes
SELECT pid, backend_type, state, query
FROM pg_stat_activity
ORDER BY backend_type;

-- Typical output:
-- pid   | backend_type       | state  | query
-- 1234  | autovacuum worker  | active | autovacuum: VACUUM public.orders
-- 1235  | client backend     | active | SELECT * FROM users
-- 1236  | client backend     | idle   |
-- 1237  | walwriter          | active |
-- 1238  | checkpointer       | active |
          

Backend Process

Each client connection gets its own backend process. This process is responsible for parsing SQL, planning the query, executing it, and sending results back. It has its own private memory (work_mem, sort buffers) and also accesses the shared memory pool.

Private Memory

work_mem: for sorts and hash joins. temp_buffers: for temp tables. Each backend has its own copy — not shared.

Shared Memory

shared_buffers: the main page cache, shared by ALL backends. Lock tables, transaction status — all shared.

Lifecycle

Forked on connect, destroyed on disconnect. Connection poolers (PgBouncer) reuse backends to avoid fork overhead.

connection_limit

Default max_connections = 100. Each backend uses ~5-10MB RAM. High connection counts → use PgBouncer in production.

Shared Memory — The Heart of PostgreSQL

Shared memory is the area all backend processes can read and write simultaneously. The most important component is shared_buffers — the in-memory page cache. When a backend needs a data page, it first checks shared_buffers. If found (cache hit), no disk I/O. If not found (cache miss), it reads from disk into shared_buffers.

Component	Purpose	Default / Notes
shared_buffers	Page cache — holds hot data pages in memory	128MB default; set to 25% of RAM in production
WAL Buffers	Holds WAL records before flush to disk	Auto-tuned; ~1/32 of shared_buffers
CLOG	Commit Log — tracks which transactions are committed/aborted	Tiny; critical for MVCC visibility
Lock Table	All currently held locks	Shared; checked by every backend
Proc Array	Active transactions and their XIDs	Used to compute snapshots for MVCC

-- Check current shared_buffers setting
SHOW shared_buffers;

-- Production recommendation: 25% of total RAM
-- In postgresql.conf:
-- shared_buffers = 4GB      (on a 16GB RAM server)
-- effective_cache_size = 12GB  (OS page cache estimate)

-- See buffer cache hit rate (should be > 99% in production)
SELECT
  sum(heap_blks_hit) AS hits,
  sum(heap_blks_read) AS misses,
  round(sum(heap_blks_hit)::numeric /
    nullif(sum(heap_blks_hit) + sum(heap_blks_read), 0) * 100, 2) AS hit_rate_pct
FROM pg_statio_user_tables;
          

🎯

Production tip: A buffer cache hit rate below 95% means your hot data doesn't fit in shared_buffers. Increase it or add more RAM. Every cache miss means a disk read, which is 100–1000× slower than a memory read.

8.2

Storage — Pages, Blocks & Heap

How data is physically laid out on disk

▾

Pages and Blocks — The Unit of Storage

PostgreSQL reads and writes data in fixed-size units called pages (also called blocks). The default page size is 8KB. Even if you update a single byte in a row, PostgreSQL reads and writes the entire 8KB page. This is the fundamental unit of I/O.

Page Size

8192 bytes (8KB) by default. Compile-time constant — cannot change without rebuilding PostgreSQL.

Table File

Each table is stored as one or more files under $PGDATA/base/<dboid>/<relfilenode>. Files are split at 1GB.

Segment Files

When a table exceeds 1GB, PostgreSQL creates a new segment file: relfilenode, relfilenode.1, relfilenode.2, etc.

Free Space Map

A companion file (relfilenode_fsm) tracks which pages have free space so inserts can fill gaps left by deleted rows.

What's Inside a Page?

Every 8KB page has the same structure. Understanding this explains why large rows are expensive and why VACUUM matters.

Anatomy of one 8KB PostgreSQL Page

┌──────────────────────────────────────────────────────┐  ← Page Start (byte 0)
│  PAGE HEADER  (24 bytes)                             │
│  pd_lsn: last WAL record that modified this page     │
│  pd_lower: end of item pointer array                 │
│  pd_upper: start of tuple data                       │
│  pd_special: start of special space (for indexes)    │
├──────────────────────────────────────────────────────┤
│  ITEM POINTER ARRAY (grows downward →)               │
│  [ptr1][ptr2][ptr3]...  each 4 bytes                 │
│  Points to each tuple's offset within this page      │
├──────────────────────────────────────────────────────┤
│                                                      │
│            FREE SPACE                                │
│    (between item array and tuple data)               │
│                                                      │
├──────────────────────────────────────────────────────┤
│  TUPLE DATA (grows upward ←)                         │
│  Each tuple (row) has:                               │
│    [HeapTupleHeader][actual column data]             │
│  HeapTupleHeader contains: xmin, xmax, ctid, infomask│
├──────────────────────────────────────────────────────┤
│  SPECIAL SPACE (for index pages only)                │
└──────────────────────────────────────────────────────┘  ← Page End (byte 8191)

💡

ctid — physical location: Every row has a system column called ctid which is its physical location: (page_number, tuple_number_within_page). For example (0,1) means page 0, first tuple. When a row is updated, its ctid changes because the new version is written to a different location.

Heap Storage — What Tables Actually Are

In PostgreSQL, a table is stored as a heap — rows are written in no particular order, just wherever there's space. This is different from clustered tables (like InnoDB) where rows are stored in primary key order. The heap structure is why PostgreSQL has CLUSTER as an explicit operation.

-- See the physical file location of a table
SELECT pg_relation_filepath('orders');
-- Returns: base/16384/16385  (database OID / table file)

-- See the physical location (ctid) of each row
SELECT ctid, id, status FROM orders LIMIT 5;
-- ctid   | id | status
-- (0,1)  |  1 | shipped   ← page 0, tuple 1
-- (0,2)  |  2 | pending   ← page 0, tuple 2
-- (0,3)  |  3 | shipped   ← page 0, tuple 3
-- (1,1)  |  4 | pending   ← page 1, tuple 1 (new page)

-- See table size
SELECT
  pg_size_pretty(pg_table_size('orders'))       AS table_size,
  pg_size_pretty(pg_indexes_size('orders'))     AS index_size,
  pg_size_pretty(pg_total_relation_size('orders')) AS total_size;
          

Why Row Width Matters

Because a page is 8KB and rows cannot span pages (for heap tables), a single row must fit within one page. Wide rows (many large columns) mean fewer rows per page, which means more pages to scan for full table scans and more I/O overall.

✅ Narrow rows — efficient

-- 100 bytes per row
-- 8192 / 100 ≈ 80 rows per page
-- 1M rows → ~12,500 pages
-- Sequential scan: fast

❌ Wide rows — expensive

-- 4000 bytes per row
-- 8192 / 4000 ≈ 2 rows per page
-- 1M rows → ~500,000 pages
-- Sequential scan: 40× more I/O!

⚠️

TOAST — handling large values: When a column value exceeds ~2KB, PostgreSQL automatically stores it in a separate TOAST (The Oversized-Attribute Storage Technique) table. The main row stores a pointer. This keeps rows on-page for most access patterns, while large blobs live separately. JSONB columns often use TOAST.

8.3

MVCC — Multi-Version Concurrency Control

⭐ The most important PostgreSQL internal topic

▾

The Core Idea — Writers Don't Block Readers

In traditional databases, if someone is writing to a row, readers must wait. PostgreSQL avoids this entirely using MVCC: instead of modifying a row in place, it keeps multiple versions of every row. Each transaction sees a consistent snapshot of the database as it existed when the transaction started.

⚡

The result: Readers never block writers. Writers never block readers. This is why PostgreSQL can handle thousands of concurrent queries without a global read lock — and it's all implemented without any application-level locking.

Transaction IDs (XIDs) — The Clock of MVCC

Every transaction gets a unique Transaction ID (XID), a monotonically increasing 32-bit integer. XIDs are the "timestamps" MVCC uses to determine which row versions are visible to which transactions.

Past XIDs 100–499
committed or aborted

Committed XID 500
visible to future txns

Current Txn XID 501
← you are here

Future XIDs 502–∞
invisible to you

-- See the current transaction XID
SELECT txid_current();          -- e.g. returns 12345

-- See xmin and xmax on actual rows (hidden system columns)
SELECT xmin, xmax, id, name FROM users LIMIT 3;
-- xmin  | xmax | id | name
-- 12340 |    0 |  1 | Alice    ← created in txn 12340, not deleted
-- 12341 |    0 |  2 | Bob      ← created in txn 12341, not deleted
-- 12342 | 12350|  3 | Carol    ← created in 12342, deleted in 12350

-- XID wraparound: 32-bit means only ~4 billion XIDs
-- PostgreSQL uses "epoch" tricks to handle this (VACUUM FREEZE)
SELECT age(datfrozenxid), datname FROM pg_database;
-- age = how many XIDs since last freeze. Keep < 1.5 billion!
          

⚠️

XID Wraparound — a real catastrophe: Since XID is 32-bit, it wraps around after ~4 billion transactions. If this happens, old rows would look "in the future" and become invisible — effectively corrupting the database. VACUUM FREEZE prevents this by marking old rows as frozen (safe forever). Autovacuum handles this automatically, but it must not be disabled on high-write databases.

Row Versions — xmin and xmax

Every row (tuple) in PostgreSQL's heap carries two hidden metadata fields in its header:

xmin — Born in

The XID of the transaction that inserted this row version. The row is visible to transactions that started after xmin committed.

xmax — Died in

The XID of the transaction that deleted or updated this row version. 0 means still alive. Set when the row is deleted or a new version is created.

When you UPDATE a row, PostgreSQL does NOT modify the existing tuple. Instead it:

Sets xmax on the old tuple to the current XID (marks it as "dead to new readers")
Inserts a brand new tuple with the updated data, setting xmin to the current XID

The old tuple sticks around until VACUUM cleans it up. This is where table bloat comes from.

Row Versions Before and After UPDATE

Before UPDATE (row salary=50000):

xmin=100
xmax=0
id=1 salary=50000
(LIVE)

free space

After UPDATE salary=60000 (in XID 200):

xmin=100
xmax=200
id=1 salary=50000
(DEAD to txns ≥200)

xmin=200
xmax=0
id=1 salary=60000
(NEW LIVE version)

↑ Old version stays on disk until VACUUM runs. This is bloat.

Visibility Rules — Which Version Do You See?

When your transaction reads a row, PostgreSQL applies these rules to decide which version to show you:

Condition	Is the tuple visible?
xmin is committed AND xmax is 0	✅ YES — row is live and never deleted
xmin is committed AND xmax is committed AND xmax > my snapshot	✅ YES — was deleted but by a txn after my snapshot
xmin is committed AND xmax is committed AND xmax ≤ my snapshot	❌ NO — row was deleted before my snapshot started
xmin is NOT committed (still in progress or aborted)	❌ NO — inserting txn didn't commit yet
xmin = my own XID (I inserted this row in this txn)	✅ YES — I can see my own uncommitted changes

💡

CLOG (Commit Log): PostgreSQL determines whether an xmin/xmax is "committed" by consulting the CLOG (pg_xact directory). CLOG stores one 2-bit status per transaction: in-progress, committed, aborted, or sub-committed. Checking CLOG is extremely fast because it's cached in shared memory.

Snapshot Isolation — What You See When You BEGIN

When your transaction starts, PostgreSQL takes a snapshot of the database state. A snapshot captures:

Transaction Snapshot (taken at BEGIN)

xmin (horizon):XID 450← oldest active txn when I started

xmax (my XID):XID 501← my transaction ID

xip_list (in-flight):[452, 460, 475]← txns that were active, treat as invisible

A transaction with XID=N is visible to me if: N < xmin (definitely committed before I started) OR (xmin ≤ N < xmax AND N not in xip_list AND N is committed).

Interactive: MVCC Visibility Explorer

Step through a real MVCC scenario: two concurrent transactions, one updating, the other reading. See exactly what each transaction sees at each moment.

🔬 MVCC Demo — Concurrent Read & Write

Scenario: salary row starts at 50,000. T1 updates it to 60,000. What does T2 see?

Timeline

T1 (XID 200) — Writer

T2 (XID 201) — Reader

t=0

—

t=1

—

t=2

—

t=3

—

t=4

—

t=5

—

t=6

—

Click the step buttons above to walk through the scenario. Watch how T2 always sees a consistent snapshot, even while T1 is mid-update.

8.4

Dead Tuples, Bloat & Why VACUUM Exists

The inevitable consequence of MVCC — and the solution

▾

Dead Tuples — The Price of MVCC

Because MVCC keeps old row versions instead of overwriting them, every UPDATE and DELETE leaves behind dead tuples — old row versions that no active transaction can see anymore. These dead tuples accumulate on disk and cause:

Table Bloat

Dead tuples take up space. A heavily-updated table can be 2–10× larger than the actual live data.

Index Bloat

Indexes also point to old tuple versions. Dead index entries slow down index scans and waste space.

Slower Scans

Sequential scans must visit pages containing dead tuples. More pages = more I/O = slower queries.

XID Wraparound Risk

Very old unfrozen tuples can cause XID wraparound if VACUUM doesn't run. Fatal if ignored.

Visualizing Bloat After Heavy Updates

Table pages after 1000 updates with no VACUUM

dead

live

dead

live

dead

live

dead

live

dead

live

dead

live

Only 6 live rows out of 16 slots. 62% bloat! A full table scan reads ALL pages including dead ones.

After VACUUM runs:

live

free

Dead tuples removed. Free space marked in FSM. New inserts can reuse it.

⚠️

Important: Regular VACUUM marks dead space as reusable but does NOT shrink the file. VACUUM FULL rewrites the entire table to a new file, actually returning space to the OS — but it requires an ACCESS EXCLUSIVE lock (blocks everything) and is slow. Use it sparingly.

Checking Bloat in Production

-- Check dead tuples per table (key bloat indicator)
SELECT
  relname                                             AS table_name,
  n_live_tup                                          AS live_rows,
  n_dead_tup                                          AS dead_rows,
  round(n_dead_tup::numeric /
    nullif(n_live_tup + n_dead_tup, 0) * 100, 1)  AS dead_pct,
  last_vacuum,
  last_autovacuum
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 10;

-- If dead_pct > 20%, VACUUM is overdue

-- Manually trigger VACUUM ANALYZE
VACUUM ANALYZE orders;           -- reclaim dead space + update stats
VACUUM(VERBOSE) orders;          -- see detailed output

-- Only when you really need to reclaim disk space
-- (blocks all queries during execution!)
VACUUM FULL orders;
          

Autovacuum — The Automatic Solution

PostgreSQL's autovacuum daemon runs in the background and automatically vacuums tables when the number of dead tuples exceeds a threshold. It's controlled by these key settings:

-- When autovacuum triggers (in postgresql.conf)
-- Triggers when dead tuples > threshold + scale_factor * table_size
autovacuum_vacuum_threshold  = 50        -- min 50 dead tuples
autovacuum_vacuum_scale_factor = 0.2    -- + 20% of table rows

-- Example: orders table has 500,000 rows
-- Autovacuum triggers when dead tuples > 50 + (0.2 × 500,000) = 100,050

-- For high-write tables, lower the scale factor:
ALTER TABLE orders SET (
  autovacuum_vacuum_scale_factor = 0.01,  -- vacuum at 1% dead tuples
  autovacuum_vacuum_threshold = 100
);

-- Check if autovacuum is keeping up
SELECT
  schemaname,
  relname,
  n_dead_tup,
  autovacuum_count,
  last_autovacuum
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC;
          

🚨

Never disable autovacuum. Some people disable it to "improve write performance." This is one of the most common ways to destroy a PostgreSQL database. Without autovacuum, dead tuples accumulate without bound, XID wraparound occurs, and eventually the database refuses all new transactions to protect itself.

8.5

Snapshot Isolation in Practice

What each isolation level actually sees under MVCC

▾

How Isolation Levels Map to Snapshots

Different isolation levels (from Topic 6) are implemented in PostgreSQL by controlling when a new snapshot is taken:

Isolation Level	Snapshot Taken	What You See	PostgreSQL Behavior
READ COMMITTED	At each statement	Committed data as of each query	Default in PostgreSQL. Fresh snapshot per SQL statement.
REPEATABLE READ	At first statement in txn	Consistent view of the whole transaction	Same snapshot for all queries in the txn. No phantom reads.
SERIALIZABLE	At first statement + tracking	As if transactions ran one at a time	SSI (Serializable Snapshot Isolation) — detects conflicts, may abort.

READ COMMITTED — The Default

In READ COMMITTED, each SQL statement takes a fresh snapshot. This means if a concurrent transaction commits between two of your queries, your second query sees the new data. This is the most common isolation level.

-- READ COMMITTED (default) behavior
BEGIN; -- isolation: READ COMMITTED (default)

-- Query 1: snapshot taken here → sees salary = 50,000
SELECT salary FROM employees WHERE id = 1;  -- 50000

-- Meanwhile, another txn commits: UPDATE salary = 60,000

-- Query 2: NEW snapshot taken → sees salary = 60,000!
SELECT salary FROM employees WHERE id = 1;  -- 60000

-- Two queries in the same transaction saw different values!
-- This is the "non-repeatable read" anomaly — allowed in READ COMMITTED
COMMIT;
          

REPEATABLE READ — Stable Snapshot

In REPEATABLE READ, the snapshot is taken once at the start of the transaction and reused for all queries. You see a stable, consistent view of the database — even if other transactions commit in the meantime.

-- REPEATABLE READ — snapshot is locked in at BEGIN
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;

-- Snapshot taken here → salary = 50,000
SELECT salary FROM employees WHERE id = 1;  -- 50000

-- Another txn commits: UPDATE salary = 60,000

-- Same snapshot reused → still sees 50,000!
SELECT salary FROM employees WHERE id = 1;  -- 50000 (consistent!)

COMMIT;
-- Use for: reports, aggregations that must see a consistent point-in-time
          

Long-Running Transactions — A Silent Killer

A long-running transaction holds its snapshot open. PostgreSQL cannot clean up (VACUUM) any dead tuples that might be visible to that snapshot. This means one idle long transaction can cause the entire table to bloat without any vacuum being able to help.

-- Find long-running transactions (huge production risk!)
SELECT
  pid,
  now() - xact_start           AS txn_age,
  now() - query_start          AS query_age,
  state,
  query
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
  AND now() - xact_start > INTERVAL '5 minutes'
ORDER BY txn_age DESC;

-- Find the oldest transaction horizon (anything older blocks VACUUM)
SELECT
  pid,
  backend_xmin,
  now() - xact_start AS age
FROM pg_stat_activity
WHERE backend_xmin IS NOT NULL
ORDER BY backend_xmin;

-- Set idle_in_transaction_session_timeout to auto-kill stragglers
-- In postgresql.conf:
idle_in_transaction_session_timeout = '5min'
          

🚨

Production incident pattern: A developer opens a transaction for a migration, gets distracted, and leaves it open for hours. The table being migrated starts bloating rapidly. VACUUM runs but can't clean anything because the old transaction holds the snapshot. By morning, disk is full. Always set idle_in_transaction_session_timeout.

✓

Topic 8 Summary

The mental model that connects everything in PostgreSQL

▾

The One Mental Model to Rule Them All

🧠

PostgreSQL never modifies data in place. Every write creates a new version.

This one principle explains:
• Why reads don't block writes → (each txn sees its own snapshot)
• Why table bloat happens → (old versions accumulate)
• Why VACUUM exists → (to clean up dead versions)
• Why long transactions are dangerous → (they hold snapshots open, blocking VACUUM)
• Why XID wraparound is a risk → (32-bit counter of versions must be frozen periodically)

Key Concepts Summary

Concept	What it is	Why it matters
Postmaster	Parent process; spawns one backend per connection	Fork model means crashes are isolated per connection
shared_buffers	In-memory page cache; shared by all backends	Cache hits = no disk I/O; set to 25% RAM in prod
Page (8KB)	Unit of storage; all reads/writes use whole pages	Row width affects rows-per-page → scan efficiency
Heap	Unordered row storage (unlike clustered tables)	Inserts go anywhere with free space; CLUSTER reorders
MVCC	Multiple row versions; readers see consistent snapshots	No read-write blocking; foundation of concurrency
xmin / xmax	Hidden fields on every row — born/died in which XID	How visibility is determined without locks
XID	Transaction ID — 32-bit counter per transaction	The "clock" of MVCC; must be periodically frozen
Snapshot	Point-in-time view of committed transactions	Taken at BEGIN or per-statement depending on isolation
Dead Tuples	Old row versions no longer visible to any transaction	Cause bloat; must be cleaned by VACUUM
Autovacuum	Background daemon that runs VACUUM automatically	Essential; never disable; tune scale_factor per table
CLOG	Commit log; records commit/abort status per XID	Checked during visibility evaluation; lives in shared memory
TOAST	Overflow storage for large column values (>~2KB)	Keeps main heap rows compact; transparent to users

🎯

Up next → Topic 9: Indexing — Now that you understand how data is stored physically, you're ready to learn how indexes speed up access to that data. You'll learn BTree, Hash, GIN, GiST, BRIN, covering indexes, partial indexes, and when each one is the right tool.

PostgreSQL Internals & MVCC

Architecture

The Postmaster — The Master Process

Postmaster

Backend Process

Shared Memory — The Heart of PostgreSQL

Storage — Pages, Blocks & Heap

Pages and Blocks — The Unit of Storage

What's Inside a Page?

Heap Storage — What Tables Actually Are

Why Row Width Matters

MVCC — Multi-Version Concurrency Control

The Core Idea — Writers Don't Block Readers

Transaction IDs (XIDs) — The Clock of MVCC

Row Versions — xmin and xmax

Visibility Rules — Which Version Do You See?

Snapshot Isolation — What You See When You BEGIN

Interactive: MVCC Visibility Explorer

🔬 MVCC Demo — Concurrent Read & Write

Dead Tuples, Bloat & Why VACUUM Exists

Dead Tuples — The Price of MVCC

Visualizing Bloat After Heavy Updates

Checking Bloat in Production

Autovacuum — The Automatic Solution

Snapshot Isolation in Practice

How Isolation Levels Map to Snapshots

READ COMMITTED — The Default

REPEATABLE READ — Stable Snapshot

Long-Running Transactions — A Silent Killer

Topic 8 Summary

The One Mental Model to Rule Them All

Key Concepts Summary