Skip to main content

Database Deep Dive

TL;DR (30-second summary)

Indexing: Trade disk space for query speed (B-tree for range queries, hash for exact match). Replication: Copy data for availability (master-slave pattern). Sharding: Split data across servers for scalability (horizontal partitioning). Partitioning: Organize data logically (range, hash, list).

Key insight: Databases are slow by default - these techniques make them fast and scalable.

Why This Matters

In interviews: Interviewers expect you to discuss "How do you handle 1 billion users?" Answer: indexing, replication, sharding.

At work: Most performance and scalability problems are database problems.

Core Concepts

1. Indexing

Problem: Without index, database scans entire table (O(n) query time).

Solution: Index = Data structure for fast lookups (O(log n) for B-tree, O(1) for hash).

B-Tree Index (Most Common)

Structure: Balanced tree, logarithmic lookup time.

                 [50]
/ \
[25] [75]
/ \ / \
[10,20] [30,40] [60,70] [80,90]

Use for:

  • ✅ Range queries (WHERE age BETWEEN 25 AND 35)
  • ✅ Sorting (ORDER BY created_at DESC)
  • ✅ Prefix matching (WHERE name LIKE 'John%')

Example:

-- Create index
CREATE INDEX idx_users_email ON users(email);

-- Query benefits
SELECT * FROM users WHERE email = 'john@example.com'; -- Fast (0.01s)
SELECT * FROM users WHERE age > 30 ORDER BY age; -- Fast (uses index)

Hash Index

Structure: Hash table, constant lookup time for exact matches only.

Use for:

  • ✅ Exact equality (WHERE id = 123)
  • ❌ Range queries (not supported)
  • ❌ Sorting (not supported)

Example (Redis):

key: "user:123"
hash: hash("user:123") → memory address → value

Composite Index

Definition: Index on multiple columns.

Order matters:

-- Creates index on (city, age)
CREATE INDEX idx_city_age ON users(city, age);

-- These queries use the index:
SELECT * FROM users WHERE city = 'NYC'; -- ✅ Uses index
SELECT * FROM users WHERE city = 'NYC' AND age > 30; -- ✅ Uses index

-- This query does NOT use the index:
SELECT * FROM users WHERE age > 30; -- ❌ Doesn't use index
-- Reason: Index is (city, age), not (age). Like searching phone book by first name only.

Rule: Leftmost prefix rule - composite index (A, B, C) can be used for queries on (A), (A, B), or (A, B, C), but not (B) or (C) alone.

Index Trade-offs

AspectProsCons
Read Performance100-1000x faster queries-
Write Performance-Slower inserts/updates (must update index)
Storage-Extra disk space (~10-30% of table size)
Maintenance-Needs occasional rebuilding (fragmentation)
Red Flag

Don't over-index! Each index slows down writes. Analyze query patterns first.

2. Replication

Purpose: Copy data to multiple servers for availability and read scaling.

Master-Slave Replication

How it works:

  1. All writes go to master
  2. Master writes to write-ahead log (WAL)
  3. Replicas read WAL and apply changes (asynchronous)
  4. Reads can go to master or replicas

Benefits:

  • ✅ High availability (if master fails, promote replica)
  • ✅ Read scaling (distribute reads across replicas)
  • ✅ Backup (replicas can be used for backups without locking master)

Replication lag:

Consistency options:

  • Asynchronous (default): Fast writes, but reads may be stale (50-200ms lag)
  • Synchronous: Slow writes (wait for replica), but always consistent
  • Semi-synchronous: Wait for at least 1 replica (compromise)

Master-Master Replication

Both nodes accept writes - more complex, risk of conflicts.

Use case: Multi-region writes (users in US write to US master, EU users write to EU master)

Conflict resolution: Last-write-wins, vector clocks, or application logic.

3. Sharding (Horizontal Partitioning)

Purpose: Split data across multiple databases to scale writes and storage.

Sharding Strategies

1. Range-Based Sharding

Shard 1: user_id 0 - 999,999
Shard 2: user_id 1,000,000 - 1,999,999
Shard 3: user_id 2,000,000 - 2,999,999

Pros: Easy to add shards, range queries work
Cons: Hotspots (if user IDs grow sequentially, newest shard gets all writes)

2. Hash-Based Sharding

shard = hash(user_id) % num_shards
Example: hash("user123") % 3 → Shard 2

Pros: Even distribution, no hotspots
Cons: Range queries don't work, resharding is hard (requires rehashing)

3. Geographic Sharding

Shard US: Users in USA
Shard EU: Users in Europe
Shard Asia: Users in Asia

Pros: Low latency (data close to users), compliance (GDPR)
Cons: Uneven distribution if user base is imbalanced

4. Entity-Based Sharding (Logical)

Shard by tenant_id (for SaaS apps)
All data for customer A on Shard 1
All data for customer B on Shard 2

Pros: Queries within tenant don't cross shards, easy to isolate tenants
Cons: Large tenants can overwhelm a single shard

Sharding Challenges

ChallengeSolution
Cross-shard queriesDenormalize data, use separate aggregation service
Transactions across shardsAvoid, or use 2PC/saga pattern (slow)
HotspotsConsistent hashing, split hot shards
ReshardingConsistent hashing, or double-write during migration
ComplexityStart with vertical scaling, shard only when necessary
Interview Tip

Always say: "Sharding is a last resort. First try: indexing, caching, read replicas, vertical scaling. Shard only when single database can't handle load."

4. Partitioning

Partitioning = Splitting table within a single database (logical split, not distributed).

Use case: Improve query performance, easier data management (e.g., archive old data).

Types:

  1. Range: partitions by year (2020, 2021, 2022...)
  2. Hash: partition = hash(key) % N
  3. List: partitions by region (US, EU, Asia)

Example (PostgreSQL):

-- Partition orders table by year
CREATE TABLE orders (
id SERIAL,
user_id INT,
created_at TIMESTAMP
) PARTITION BY RANGE (created_at);

CREATE TABLE orders_2022 PARTITION OF orders
FOR VALUES FROM ('2022-01-01') TO ('2023-01-01');
CREATE TABLE orders_2023 PARTITION OF orders
FOR VALUES FROM ('2023-01-01') TO ('2024-01-01');

-- Query automatically routes to correct partition
SELECT * FROM orders WHERE created_at > '2023-06-01'; -- Only scans orders_2023

Benefits: Fast queries (scan only relevant partition), easy to drop old partitions.

Common Interview Questions

Q1: "How do you scale a database to handle 1 billion users?"

Answer (step-by-step):

  1. Indexing: Add indexes on frequently queried columns (email, user_id)
  2. Caching: Redis for hot data (90% of reads from cache)
  3. Read replicas: 5-10 replicas for read traffic (read-heavy workload)
  4. Vertical scaling: Upgrade master to larger instance (buy time)
  5. Sharding: When single database can't handle writes, shard by user_id (hash-based)

Q2: "Explain master-slave replication and its trade-offs."

Answer:

  • How: Master handles writes, replicas copy data asynchronously, replicas serve reads
  • Benefits: High availability, read scaling
  • Trade-offs:
    • Replication lag (50-200ms) - reads may be stale
    • Write bottleneck (all writes to master)
    • Failover complexity (promote replica to master)

Q3: "What's the difference between sharding and partitioning?"

Answer:

  • Partitioning: Logical split within single database (e.g., partition by year)
    • Still one database, easier queries
  • Sharding: Physical split across multiple databases (e.g., shard by user_id)
    • Distributed system, scales writes, but complex (cross-shard queries hard)

Q4: "You have a query that's slow. How do you optimize it?"

Answer (debugging steps):

  1. EXPLAIN: Analyze query plan, check if index is used
  2. Add index: If full table scan, add index on WHERE/ORDER BY columns
  3. Optimize query: Avoid SELECT *, reduce JOINs, add LIMIT
  4. Cache results: If query is frequent, cache with Redis (5-minute TTL)
  5. Denormalize: If many JOINs, consider duplicating data
  6. Read replica: Offload read traffic to replicas

Trade-offs

TechniqueProsConsUse When
IndexingFast queries (100x)Slow writes, storage overheadQuery optimization
ReplicationHigh availability, read scalingReplication lag, complexityRead-heavy workload
ShardingScales writes, unlimited growthCross-shard queries hard>1TB data, write-heavy
CachingUltra-fast readsStale data, cache invalidationHot data, read-heavy

Real-World Examples

Instagram (Sharding)

  • Sharded PostgreSQL: 1000s of shards by user_id (hash-based)
  • No cross-shard queries: Denormalize data to avoid joins
  • Result: Billions of photos, writes scale linearly

Netflix (Read Replicas)

  • Master-slave MySQL: 1 master, 50+ read replicas
  • Read-heavy: 99% of traffic is reads (viewing history, recommendations)
  • Result: Handles millions of requests per second

Discord (Cassandra Sharding)

  • Sharded by guild_id: Each Discord server is on a shard
  • No cross-shard queries: Messages are always scoped to a guild
  • Result: Billions of messages, infinite scalability

Quick Reference Card

Indexing:

  • B-tree: Range queries, sorting (most common)
  • Hash: Exact match only
  • Trade-off: Fast reads, slow writes

Replication:

  • Master-slave: Writes to master, reads from replicas
  • Replication lag: 50-200ms (asynchronous)
  • Use for: High availability, read scaling

Sharding:

  • Range: Easy to add shards, risk of hotspots
  • Hash: Even distribution, resharding is hard
  • Last resort: Only when single DB can't handle load

Optimization checklist:

  1. Add indexes
  2. Cache hot data (Redis)
  3. Add read replicas
  4. Vertical scaling (bigger instance)
  5. Shard (if >1TB or write-heavy)

Further Reading



Next: Caching Strategies - Redis patterns, CDN caching, cache invalidation.