1.1

Data Modeling

Entities, attributes, relationships & cardinality

Entities

An entity is a real-world thing your database needs to track. Think of it as a noun — a person, a product, an order.

💡
Rule of thumb: If you find yourself asking "does this thing have its own identity?" — it's probably an entity. A customer has their own identity. But a customer's address might just be an attribute (unless addresses are shared across customers, in which case it becomes its own entity).
Entity Identification
Name the real-world objects: User, Product, Order, Payment
Business Objects
Things the business cares about: invoices, customers, employees
Entity Lifecycle
Created → Updated → Soft-deleted. Track state changes over time
-- Example: Identifying entities for an e-commerce app -- Ask yourself: what are the "nouns" in our system? -- Entity 1: Users (customers who buy) CREATE TABLE users ( user_id BIGINT PRIMARY KEY, email VARCHAR(255), created_at TIMESTAMPTZ ); -- Entity 2: Products (things being sold) CREATE TABLE products ( product_id BIGINT PRIMARY KEY, name VARCHAR(500), price NUMERIC(10,2) ); -- Entity 3: Orders (a transaction between user and products) CREATE TABLE orders ( order_id BIGINT PRIMARY KEY, user_id BIGINT, -- links to users total NUMERIC(10,2), created_at TIMESTAMPTZ );

Attributes

Attributes are the properties or columns of an entity. There are three kinds:

Required Attributes
Must always have a value. Use NOT NULL. Example: email for a user
Optional Attributes
Can be null. Example: middle_name, phone — not everyone provides these
Derived Attributes
Computed from other columns. Example: age derived from date_of_birth
CREATE TABLE users ( user_id BIGINT PRIMARY KEY, -- Required attributes (NOT NULL) email VARCHAR(255) NOT NULL, first_name VARCHAR(100) NOT NULL, -- Optional attributes (allow NULL) middle_name VARCHAR(100), phone VARCHAR(20), -- Derived attribute (compute age from dob, don't store age!) date_of_birth DATE, -- age = EXTRACT(YEAR FROM AGE(date_of_birth)) ← compute on query created_at TIMESTAMPTZ DEFAULT NOW() ); -- Querying a derived attribute on the fly: SELECT first_name, EXTRACT(YEAR FROM AGE(date_of_birth)) AS age FROM users;
⚠️
Don't store derived attributes! If you store age, it gets stale. Compute it at query time from date_of_birth instead.

Relationships

Relationships describe how entities connect to each other. There are three types:

-- ONE-TO-ONE: each user has at most one profile -- Example: User ↔ UserProfile CREATE TABLE users ( user_id BIGINT PRIMARY KEY, email VARCHAR(255) NOT NULL ); CREATE TABLE user_profiles ( profile_id BIGINT PRIMARY KEY, user_id BIGINT UNIQUE REFERENCES users(user_id), -- UNIQUE ensures max one profile per user bio TEXT, avatar_url TEXT );
Use 1:1 to split large tables — keep hot columns in the main table, cold/optional columns in a separate table.
-- ONE-TO-MANY: one user has many orders -- The "many" side holds the foreign key CREATE TABLE users ( user_id BIGINT PRIMARY KEY, email VARCHAR(255) ); CREATE TABLE orders ( order_id BIGINT PRIMARY KEY, user_id BIGINT REFERENCES users(user_id), -- ^ This is the FK on the "many" side total NUMERIC(10,2), created_at TIMESTAMPTZ ); -- Find all orders for a user: SELECT * FROM orders WHERE user_id = 42;
Most common relationship. The foreign key always goes on the "many" side.
-- MANY-TO-MANY: products ↔ categories -- Needs a JUNCTION (bridge) table in between CREATE TABLE products ( product_id BIGINT PRIMARY KEY, name VARCHAR(500) ); CREATE TABLE categories ( category_id BIGINT PRIMARY KEY, name VARCHAR(200) ); -- Junction table: holds the relationship CREATE TABLE product_categories ( product_id BIGINT REFERENCES products(product_id), category_id BIGINT REFERENCES categories(category_id), PRIMARY KEY (product_id, category_id) -- composite PK ); -- Find all categories for a product: SELECT c.name FROM categories c JOIN product_categories pc ON c.category_id = pc.category_id WHERE pc.product_id = 10;

Cardinality

Cardinality defines how many instances of one entity relate to another. It also tells you whether a relationship is mandatory or optional.

Type Notation Example DB Implementation
Mandatory 1:1 ||—|| Employee must have exactly 1 contract NOT NULL + UNIQUE FK
Optional 1:1 |o—|| User may have a profile (optional) UNIQUE FK (nullable)
Mandatory 1:N ||—< Order must belong to a user NOT NULL FK
Optional 1:N |o—< Category may have 0 products Nullable FK
M:N >—< Products ↔ Tags Junction table
1.2

Keys

Primary, foreign, candidate, alternate & composite keys

Primary Keys

A primary key uniquely identifies every row. There are several strategies:

Natural Key
A real-world unique value like email, passport_number, ISBN
Surrogate Key
System-generated integer (SERIAL / BIGSERIAL). No business meaning
UUID Key
Universally unique 128-bit ID. Great for distributed systems
Identity Column
SQL standard auto-increment. Preferred over SERIAL in modern PG
-- Strategy 1: Natural Key (use email as PK) CREATE TABLE users_natural ( email VARCHAR(255) PRIMARY KEY, -- risky! email can change first_name TEXT ); -- Strategy 2: Surrogate Key (auto-increment integer) CREATE TABLE users_surrogate ( user_id BIGSERIAL PRIMARY KEY, -- auto-generates: 1, 2, 3... email TEXT UNIQUE NOT NULL ); -- Strategy 3: UUID Key (best for distributed / microservices) CREATE EXTENSION IF NOT EXISTS "pgcrypto"; CREATE TABLE users_uuid ( user_id UUID PRIMARY KEY DEFAULT gen_random_uuid(), email TEXT UNIQUE NOT NULL ); -- Strategy 4: Identity Column (SQL standard, PG 10+) CREATE TABLE users_identity ( user_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY, email TEXT UNIQUE NOT NULL );
💡
When to use what? For most OLTP apps, use BIGINT GENERATED ALWAYS AS IDENTITY. For microservices or APIs where IDs leak to clients, use UUID. Avoid natural keys as PKs unless the value will truly never change.

Foreign Keys

A foreign key enforces referential integrity — it ensures a child row can't exist without a valid parent.

-- Referential Integrity + Cascade options CREATE TABLE users ( user_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY, email TEXT UNIQUE NOT NULL ); CREATE TABLE orders ( order_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY, user_id BIGINT NOT NULL REFERENCES users(user_id) ON DELETE CASCADE -- delete user → delete their orders ON UPDATE CASCADE, -- if user_id changes, update here too total NUMERIC(10,2) ); -- Other cascade options: -- ON DELETE SET NULL → sets user_id to NULL when user deleted -- ON DELETE RESTRICT → prevents deletion if orders exist -- ON DELETE NO ACTION → default, same as RESTRICT (deferred) -- Test: try inserting an order with a non-existent user_id INSERT INTO orders (user_id, total) VALUES (9999, 150.00); -- ERROR: insert or update on table "orders" violates foreign key constraint
⚠️
CASCADE DELETE is powerful but dangerous. Deleting a user will delete all their orders silently. Prefer soft deletes (adding is_deleted = true) over physical deletes in production.

Candidate, Alternate & Composite Keys

-- CANDIDATE KEY: any column(s) that COULD be the primary key -- In users: both user_id and email could uniquely identify a row -- → user_id is chosen as PK (surrogate) -- → email becomes the ALTERNATE KEY CREATE TABLE users ( user_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY, -- PK email TEXT UNIQUE NOT NULL, -- Alternate Key passport_num TEXT UNIQUE -- another Alternate Key ); -- COMPOSITE KEY: PK made of multiple columns -- Use when no single column is unique, but combination is CREATE TABLE order_items ( order_id BIGINT REFERENCES orders(order_id), product_id BIGINT REFERENCES products(product_id), quantity INT NOT NULL DEFAULT 1, price NUMERIC(10,2) NOT NULL, PRIMARY KEY (order_id, product_id) -- composite PK -- No separate order_item_id needed — the combo is unique );
Key TypeDefinitionExample
Candidate KeyAny minimal set of columns that uniquely identifies a rowemail, passport_num, user_id
Primary KeyThe chosen candidate key — only one per tableuser_id
Alternate KeyCandidate keys not chosen as PK (enforced via UNIQUE)email
Composite KeyPK or UNIQUE constraint spanning multiple columns(order_id, product_id)
1.3

Normalization

1NF → 2NF → 3NF → BCNF → 4NF → 5NF

Normalization is the process of organizing tables to eliminate redundancy and ensure data integrity. Each normal form adds stricter rules. Click each form below to explore it:

1NF
Atomic values
2NF
No partial deps
3NF
No transitive deps
BCNF
Strict 3NF
4NF
No multi-valued
5NF
Join dependency
1NF — First Normal Form: Atomic Values & No Repeating Groups

Every column must contain a single, indivisible value. No lists, no arrays as column values, no duplicate rows.

-- ❌ VIOLATES 1NF: storing multiple phone numbers in one column -- and repeating rows for the same order orders_bad ┌──────────┬─────────────┬──────────────────────────┐ │ order_id │ customer │ phone │ ├──────────┼─────────────┼──────────────────────────┤ │ 1 │ Alice │ 555-1234, 555-5678 │ ← two phones in one cell! │ 1 │ Alice │ 555-1234, 555-5678 │ ← duplicate row! └──────────┴─────────────┴──────────────────────────┘ -- ✅ 1NF COMPLIANT: atomic values, one phone per row CREATE TABLE customer_phones ( customer_id BIGINT, phone VARCHAR(20), -- one value per cell PRIMARY KEY (customer_id, phone) );
2NF — Second Normal Form: No Partial Dependencies

Applies when you have a composite primary key. Every non-key column must depend on the entire composite key, not just part of it.

-- ❌ VIOLATES 2NF: product_name depends only on product_id, -- NOT on the full composite key (order_id, product_id) order_items_bad ┌──────────┬────────────┬──────────────────┬──────────┐ │ order_id │ product_id │ product_name │ quantity │ │ PK │ PK │ ← partial dep! │ │ └──────────┴────────────┴──────────────────┴──────────┘ -- ✅ 2NF COMPLIANT: move product_name to its own table CREATE TABLE products ( product_id BIGINT PRIMARY KEY, product_name TEXT -- depends on product_id only → correct! ); CREATE TABLE order_items ( order_id BIGINT, product_id BIGINT REFERENCES products(product_id), quantity INT, -- depends on (order_id, product_id) → correct! PRIMARY KEY (order_id, product_id) );
3NF — Third Normal Form: No Transitive Dependencies

No non-key column should depend on another non-key column. All columns must depend directly on the primary key.

-- ❌ VIOLATES 3NF: city depends on zip_code, not on user_id -- zip_code → city is a TRANSITIVE dependency users_bad ┌─────────┬──────────┬──────────┬──────────┐ │ user_id │ email │ zip_code │ city │ │ PK │ │ │ ← trans! │ └─────────┴──────────┴──────────┴──────────┘ -- ✅ 3NF COMPLIANT: extract zip → city to its own table CREATE TABLE zip_codes ( zip_code VARCHAR(10) PRIMARY KEY, city TEXT, state TEXT ); CREATE TABLE users ( user_id BIGINT PRIMARY KEY, email TEXT UNIQUE NOT NULL, zip_code VARCHAR(10) REFERENCES zip_codes(zip_code) -- city is NOT stored here — get it by joining zip_codes );
BCNF — Boyce-Codd Normal Form: Strict 3NF

Every determinant (anything that determines another column's value) must be a candidate key. Catches edge cases that 3NF misses when there are multiple overlapping candidate keys.

-- Scenario: Students can be tutored by teachers. -- A student has ONE teacher per subject. -- Each teacher teaches ONLY ONE subject. -- ❌ VIOLATES BCNF: tutoring_bad ┌─────────┬─────────┬───────────┐ │ student │ subject │ teacher │ │ PK │ PK │ │ └─────────┴─────────┴───────────┘ -- Problem: teacher → subject (teacher determines subject) -- but teacher is NOT a candidate key -- ✅ BCNF COMPLIANT: decompose CREATE TABLE teacher_subjects ( teacher TEXT PRIMARY KEY, subject TEXT ); CREATE TABLE student_tutoring ( student TEXT, teacher TEXT REFERENCES teacher_subjects(teacher), PRIMARY KEY (student, teacher) );
💡
In practice, most well-designed databases in 3NF are also in BCNF. You'll mainly need BCNF for complex academic scenarios.
4NF — Fourth Normal Form: No Multi-Valued Dependencies

A table must not have more than one independent multi-valued fact about an entity.

-- Scenario: A person can have multiple hobbies AND multiple phones -- These are INDEPENDENT facts -- ❌ VIOLATES 4NF: mixing two independent multi-valued facts person_data_bad ┌────────┬──────────┬───────────────┐ │ person │ hobby │ phone │ ├────────┼──────────┼───────────────┤ │ Alice │ Cooking │ 555-1234 │ │ Alice │ Cooking │ 555-5678 │ ← duplicate hobby! │ Alice │ Gaming │ 555-1234 │ ← duplicate phone! │ Alice │ Gaming │ 555-5678 │ └────────┴──────────┴───────────────┘ -- ✅ 4NF COMPLIANT: separate tables for each multi-valued fact CREATE TABLE person_hobbies ( person TEXT, hobby TEXT, PRIMARY KEY (person, hobby) ); CREATE TABLE person_phones ( person TEXT, phone TEXT, PRIMARY KEY (person, phone) );
5NF — Fifth Normal Form: No Join Dependencies

A table must not be decomposable into smaller tables without losing information, unless the join dependencies are implied by the candidate keys. This is rare in practice.

-- Scenario: Agents sell products in cities. -- A tuple (agent, product, city) means: this agent sells this product in this city -- But not every combination is valid (not every agent sells every product in every city) -- ✅ 5NF: if the table CANNOT be losslessly decomposed, -- keep it as one table CREATE TABLE agent_product_city ( agent_id BIGINT, product_id BIGINT, city_id BIGINT, PRIMARY KEY (agent_id, product_id, city_id) ); -- 5NF is mostly theoretical. In production, stop at 3NF/BCNF. -- Only apply 4NF and 5NF for truly complex domains.
⚠️
Real-world advice: Design to 3NF in day-to-day work. Understand BCNF and 4NF conceptually. 5NF is academic — you won't encounter it in most production systems.
1.4

Denormalization

Controlled redundancy for read performance

💡
Denormalization is intentional. After normalizing your database, you sometimes deliberately introduce redundancy to speed up reads. The trade-off: faster reads, more complex writes and storage.

Read Optimization Patterns

There are three common denormalization patterns used in production:

-- AGGREGATION TABLE: store pre-computed totals -- Problem: COUNT(*) over millions of orders is slow -- Solution: maintain a counter table CREATE TABLE user_order_stats ( user_id BIGINT PRIMARY KEY REFERENCES users(user_id), total_orders INT DEFAULT 0, total_spent NUMERIC(12,2) DEFAULT 0, last_order_at TIMESTAMPTZ ); -- Update this whenever an order is placed (via trigger or app code) UPDATE user_order_stats SET total_orders = total_orders + 1, total_spent = total_spent + 150.00, last_order_at = NOW() WHERE user_id = 42; -- Now reading a user's stats is instant (no aggregation needed): SELECT total_orders, total_spent FROM user_order_stats WHERE user_id = 42;
-- SUMMARY TABLE: daily/hourly rollups for dashboards -- Instead of scanning millions of order rows for a report, -- pre-aggregate by day CREATE TABLE daily_sales_summary ( summary_date DATE PRIMARY KEY, total_revenue NUMERIC(15,2) DEFAULT 0, order_count INT DEFAULT 0, new_users INT DEFAULT 0 ); -- Populate with a nightly job: INSERT INTO daily_sales_summary (summary_date, total_revenue, order_count) SELECT DATE_TRUNC('day', created_at)::date AS summary_date, SUM(total) AS total_revenue, COUNT(*) AS order_count FROM orders WHERE created_at >= NOW() - INTERVAL '1 day' GROUP BY 1 ON CONFLICT (summary_date) DO UPDATE SET total_revenue = EXCLUDED.total_revenue, order_count = EXCLUDED.order_count;
-- MATERIALIZED VIEW: PostgreSQL's built-in summary caching -- Like a regular view, but stores the result physically CREATE MATERIALIZED VIEW product_sales_summary AS SELECT p.product_id, p.name, SUM(oi.quantity) AS units_sold, SUM(oi.quantity * oi.price) AS revenue FROM products p JOIN order_items oi ON p.product_id = oi.product_id GROUP BY p.product_id, p.name WITH DATA; -- populate immediately -- Query it like a table (super fast): SELECT * FROM product_sales_summary ORDER BY revenue DESC LIMIT 10; -- Refresh when data changes: REFRESH MATERIALIZED VIEW product_sales_summary; -- Or refresh concurrently (no lock, needs UNIQUE index): CREATE UNIQUE INDEX ON product_sales_summary (product_id); REFRESH MATERIALIZED VIEW CONCURRENTLY product_sales_summary;
Materialized Views are the cleanest denormalization tool in PostgreSQL. Use them for reports, dashboards, and heavy analytical queries. Refresh on a schedule or after bulk updates.

When to Normalize vs Denormalize

Normalized (3NF) Denormalized
Read speed Slower (JOINs needed) Faster (pre-computed)
Write speed Faster (one place) Slower (update multiple)
Storage Less More
Consistency Automatic Manual maintenance
Best for OLTP (transactions) Reporting, dashboards

Topic 1: Database Design — all 4 sub-sections covered.

Ready to move on to Topic 2: OLTP Database Design?

🎉

Topic 1 approved! Reply "Next" in the chat to get Topic 2: OLTP Database Design as a new HTML file.