Database Deep Dive

TL;DR (30-second summary)

Indexing: Trade disk space for query speed (B-tree for range queries, hash for exact match). Replication: Copy data for availability (master-slave pattern). Sharding: Split data across servers for scalability (horizontal partitioning). Partitioning: Organize data logically (range, hash, list).

Key insight: Databases are slow by default - these techniques make them fast and scalable.

Why This Matters

In interviews: Interviewers expect you to discuss "How do you handle 1 billion users?" Answer: indexing, replication, sharding.

At work: Most performance and scalability problems are database problems.

Core Concepts

1. Indexing

Problem: Without index, database scans entire table (O(n) query time).

Solution: Index = Data structure for fast lookups (O(log n) for B-tree, O(1) for hash).

B-Tree Index (Most Common)

Structure: Balanced tree, logarithmic lookup time.

                 [50]
               /      \
          [25]          [75]
         /    \        /    \
    [10,20] [30,40] [60,70] [80,90]

Use for:

✅ Range queries (WHERE age BETWEEN 25 AND 35)
✅ Sorting (ORDER BY created_at DESC)
✅ Prefix matching (WHERE name LIKE 'John%')

Example:

-- Create index
CREATE INDEX idx_users_email ON users(email);

-- Query benefits
SELECT * FROM users WHERE email = 'john@example.com';  -- Fast (0.01s)
SELECT * FROM users WHERE age > 30 ORDER BY age;       -- Fast (uses index)

Hash Index

Structure: Hash table, constant lookup time for exact matches only.

Use for:

✅ Exact equality (WHERE id = 123)
❌ Range queries (not supported)
❌ Sorting (not supported)

Example (Redis):

key: "user:123"
hash: hash("user:123") → memory address → value

Composite Index

Definition: Index on multiple columns.

Order matters:

-- Creates index on (city, age)
CREATE INDEX idx_city_age ON users(city, age);

-- These queries use the index:
SELECT * FROM users WHERE city = 'NYC';                    -- ✅ Uses index
SELECT * FROM users WHERE city = 'NYC' AND age > 30;      -- ✅ Uses index

-- This query does NOT use the index:
SELECT * FROM users WHERE age > 30;                        -- ❌ Doesn't use index
-- Reason: Index is (city, age), not (age). Like searching phone book by first name only.

Rule: Leftmost prefix rule - composite index (A, B, C) can be used for queries on (A), (A, B), or (A, B, C), but not (B) or (C) alone.

Index Trade-offs

Aspect	Pros	Cons
Read Performance	100-1000x faster queries	-
Write Performance	-	Slower inserts/updates (must update index)
Storage	-	Extra disk space (~10-30% of table size)
Maintenance	-	Needs occasional rebuilding (fragmentation)

Red Flag

Don't over-index! Each index slows down writes. Analyze query patterns first.

2. Replication

Purpose: Copy data to multiple servers for availability and read scaling.

Master-Slave Replication

How it works:

All writes go to master
Master writes to write-ahead log (WAL)
Replicas read WAL and apply changes (asynchronous)
Reads can go to master or replicas

Benefits:

✅ High availability (if master fails, promote replica)
✅ Read scaling (distribute reads across replicas)
✅ Backup (replicas can be used for backups without locking master)

Replication lag:

Consistency options:

Asynchronous (default): Fast writes, but reads may be stale (50-200ms lag)
Synchronous: Slow writes (wait for replica), but always consistent
Semi-synchronous: Wait for at least 1 replica (compromise)

Master-Master Replication

Both nodes accept writes - more complex, risk of conflicts.

Use case: Multi-region writes (users in US write to US master, EU users write to EU master)

Conflict resolution: Last-write-wins, vector clocks, or application logic.

3. Sharding (Horizontal Partitioning)

Purpose: Split data across multiple databases to scale writes and storage.

Sharding Strategies

1. Range-Based Sharding

Shard 1: user_id 0 - 999,999
Shard 2: user_id 1,000,000 - 1,999,999
Shard 3: user_id 2,000,000 - 2,999,999

Pros: Easy to add shards, range queries work
Cons: Hotspots (if user IDs grow sequentially, newest shard gets all writes)

2. Hash-Based Sharding

shard = hash(user_id) % num_shards
Example: hash("user123") % 3 → Shard 2

Pros: Even distribution, no hotspots
Cons: Range queries don't work, resharding is hard (requires rehashing)

3. Geographic Sharding

Shard US: Users in USA
Shard EU: Users in Europe
Shard Asia: Users in Asia

Pros: Low latency (data close to users), compliance (GDPR)
Cons: Uneven distribution if user base is imbalanced

4. Entity-Based Sharding (Logical)

Shard by tenant_id (for SaaS apps)
All data for customer A on Shard 1
All data for customer B on Shard 2

Pros: Queries within tenant don't cross shards, easy to isolate tenants
Cons: Large tenants can overwhelm a single shard

Sharding Challenges

Challenge	Solution
Cross-shard queries	Denormalize data, use separate aggregation service
Transactions across shards	Avoid, or use 2PC/saga pattern (slow)
Hotspots	Consistent hashing, split hot shards
Resharding	Consistent hashing, or double-write during migration
Complexity	Start with vertical scaling, shard only when necessary

Interview Tip

Always say: "Sharding is a last resort. First try: indexing, caching, read replicas, vertical scaling. Shard only when single database can't handle load."

4. Partitioning

Partitioning = Splitting table within a single database (logical split, not distributed).

Use case: Improve query performance, easier data management (e.g., archive old data).

Types:

Range: partitions by year (2020, 2021, 2022...)
Hash: partition = hash(key) % N
List: partitions by region (US, EU, Asia)

Example (PostgreSQL):

-- Partition orders table by year
CREATE TABLE orders (
    id SERIAL,
    user_id INT,
    created_at TIMESTAMP
) PARTITION BY RANGE (created_at);

CREATE TABLE orders_2022 PARTITION OF orders
    FOR VALUES FROM ('2022-01-01') TO ('2023-01-01');
CREATE TABLE orders_2023 PARTITION OF orders
    FOR VALUES FROM ('2023-01-01') TO ('2024-01-01');

-- Query automatically routes to correct partition
SELECT * FROM orders WHERE created_at > '2023-06-01';  -- Only scans orders_2023

Benefits: Fast queries (scan only relevant partition), easy to drop old partitions.

Common Interview Questions

Q1: "How do you scale a database to handle 1 billion users?"

Answer (step-by-step):

Indexing: Add indexes on frequently queried columns (email, user_id)
Caching: Redis for hot data (90% of reads from cache)
Read replicas: 5-10 replicas for read traffic (read-heavy workload)
Vertical scaling: Upgrade master to larger instance (buy time)
Sharding: When single database can't handle writes, shard by user_id (hash-based)

Q2: "Explain master-slave replication and its trade-offs."

Answer:

How: Master handles writes, replicas copy data asynchronously, replicas serve reads
Benefits: High availability, read scaling
Trade-offs:
- Replication lag (50-200ms) - reads may be stale
- Write bottleneck (all writes to master)
- Failover complexity (promote replica to master)

Q3: "What's the difference between sharding and partitioning?"

Answer:

Partitioning: Logical split within single database (e.g., partition by year)
- Still one database, easier queries
Sharding: Physical split across multiple databases (e.g., shard by user_id)
- Distributed system, scales writes, but complex (cross-shard queries hard)

Q4: "You have a query that's slow. How do you optimize it?"

Answer (debugging steps):

EXPLAIN: Analyze query plan, check if index is used
Add index: If full table scan, add index on WHERE/ORDER BY columns
Optimize query: Avoid SELECT *, reduce JOINs, add LIMIT
Cache results: If query is frequent, cache with Redis (5-minute TTL)
Denormalize: If many JOINs, consider duplicating data
Read replica: Offload read traffic to replicas

Trade-offs

Technique	Pros	Cons	Use When
Indexing	Fast queries (100x)	Slow writes, storage overhead	Query optimization
Replication	High availability, read scaling	Replication lag, complexity	Read-heavy workload
Sharding	Scales writes, unlimited growth	Cross-shard queries hard	>1TB data, write-heavy
Caching	Ultra-fast reads	Stale data, cache invalidation	Hot data, read-heavy

Real-World Examples

Instagram (Sharding)

Sharded PostgreSQL: 1000s of shards by user_id (hash-based)
No cross-shard queries: Denormalize data to avoid joins
Result: Billions of photos, writes scale linearly

Netflix (Read Replicas)

Master-slave MySQL: 1 master, 50+ read replicas
Read-heavy: 99% of traffic is reads (viewing history, recommendations)
Result: Handles millions of requests per second

Discord (Cassandra Sharding)

Sharded by guild_id: Each Discord server is on a shard
No cross-shard queries: Messages are always scoped to a guild
Result: Billions of messages, infinite scalability

Quick Reference Card

Indexing:

B-tree: Range queries, sorting (most common)
Hash: Exact match only
Trade-off: Fast reads, slow writes

Replication:

Master-slave: Writes to master, reads from replicas
Replication lag: 50-200ms (asynchronous)
Use for: High availability, read scaling

Sharding:

Range: Easy to add shards, risk of hotspots
Hash: Even distribution, resharding is hard
Last resort: Only when single DB can't handle load

Optimization checklist:

Add indexes
Cache hot data (Redis)
Add read replicas
Vertical scaling (bigger instance)
Shard (if >1TB or write-heavy)

TL;DR (30-second summary)​

Why This Matters​

Core Concepts​

1. Indexing​

B-Tree Index (Most Common)​

Hash Index​

Composite Index​

Index Trade-offs​

2. Replication​

Master-Slave Replication​

Master-Master Replication​

3. Sharding (Horizontal Partitioning)​

Sharding Strategies​

Sharding Challenges​

4. Partitioning​

Common Interview Questions​

Q1: "How do you scale a database to handle 1 billion users?"​

Q2: "Explain master-slave replication and its trade-offs."​

Q3: "What's the difference between sharding and partitioning?"​

Q4: "You have a query that's slow. How do you optimize it?"​

Trade-offs​

Real-World Examples​

Instagram (Sharding)​

Netflix (Read Replicas)​

Discord (Cassandra Sharding)​

Quick Reference Card​

Further Reading​