Database Deep Dive
TL;DR (30-second summary)
Indexing: Trade disk space for query speed (B-tree for range queries, hash for exact match). Replication: Copy data for availability (master-slave pattern). Sharding: Split data across servers for scalability (horizontal partitioning). Partitioning: Organize data logically (range, hash, list).
Key insight: Databases are slow by default - these techniques make them fast and scalable.
Why This Matters
In interviews: Interviewers expect you to discuss "How do you handle 1 billion users?" Answer: indexing, replication, sharding.
At work: Most performance and scalability problems are database problems.
Core Concepts
1. Indexing
Problem: Without index, database scans entire table (O(n) query time).
Solution: Index = Data structure for fast lookups (O(log n) for B-tree, O(1) for hash).
B-Tree Index (Most Common)
Structure: Balanced tree, logarithmic lookup time.
[50]
/ \
[25] [75]
/ \ / \
[10,20] [30,40] [60,70] [80,90]
Use for:
- ✅ Range queries (
WHERE age BETWEEN 25 AND 35) - ✅ Sorting (
ORDER BY created_at DESC) - ✅ Prefix matching (
WHERE name LIKE 'John%')
Example:
-- Create index
CREATE INDEX idx_users_email ON users(email);
-- Query benefits
SELECT * FROM users WHERE email = 'john@example.com'; -- Fast (0.01s)
SELECT * FROM users WHERE age > 30 ORDER BY age; -- Fast (uses index)
Hash Index
Structure: Hash table, constant lookup time for exact matches only.
Use for:
- ✅ Exact equality (
WHERE id = 123) - ❌ Range queries (not supported)
- ❌ Sorting (not supported)
Example (Redis):
key: "user:123"
hash: hash("user:123") → memory address → value
Composite Index
Definition: Index on multiple columns.
Order matters:
-- Creates index on (city, age)
CREATE INDEX idx_city_age ON users(city, age);
-- These queries use the index:
SELECT * FROM users WHERE city = 'NYC'; -- ✅ Uses index
SELECT * FROM users WHERE city = 'NYC' AND age > 30; -- ✅ Uses index
-- This query does NOT use the index:
SELECT * FROM users WHERE age > 30; -- ❌ Doesn't use index
-- Reason: Index is (city, age), not (age). Like searching phone book by first name only.
Rule: Leftmost prefix rule - composite index (A, B, C) can be used for queries on (A), (A, B), or (A, B, C), but not (B) or (C) alone.
Index Trade-offs
| Aspect | Pros | Cons |
|---|---|---|
| Read Performance | 100-1000x faster queries | - |
| Write Performance | - | Slower inserts/updates (must update index) |
| Storage | - | Extra disk space (~10-30% of table size) |
| Maintenance | - | Needs occasional rebuilding (fragmentation) |
Don't over-index! Each index slows down writes. Analyze query patterns first.
2. Replication
Purpose: Copy data to multiple servers for availability and read scaling.
Master-Slave Replication
How it works:
- All writes go to master
- Master writes to write-ahead log (WAL)
- Replicas read WAL and apply changes (asynchronous)
- Reads can go to master or replicas
Benefits:
- ✅ High availability (if master fails, promote replica)
- ✅ Read scaling (distribute reads across replicas)
- ✅ Backup (replicas can be used for backups without locking master)
Replication lag:
Consistency options:
- Asynchronous (default): Fast writes, but reads may be stale (50-200ms lag)
- Synchronous: Slow writes (wait for replica), but always consistent
- Semi-synchronous: Wait for at least 1 replica (compromise)
Master-Master Replication
Both nodes accept writes - more complex, risk of conflicts.
Use case: Multi-region writes (users in US write to US master, EU users write to EU master)
Conflict resolution: Last-write-wins, vector clocks, or application logic.
3. Sharding (Horizontal Partitioning)
Purpose: Split data across multiple databases to scale writes and storage.
Sharding Strategies
1. Range-Based Sharding
Shard 1: user_id 0 - 999,999
Shard 2: user_id 1,000,000 - 1,999,999
Shard 3: user_id 2,000,000 - 2,999,999
Pros: Easy to add shards, range queries work
Cons: Hotspots (if user IDs grow sequentially, newest shard gets all writes)
2. Hash-Based Sharding
shard = hash(user_id) % num_shards
Example: hash("user123") % 3 → Shard 2
Pros: Even distribution, no hotspots
Cons: Range queries don't work, resharding is hard (requires rehashing)
3. Geographic Sharding
Shard US: Users in USA
Shard EU: Users in Europe
Shard Asia: Users in Asia
Pros: Low latency (data close to users), compliance (GDPR)
Cons: Uneven distribution if user base is imbalanced
4. Entity-Based Sharding (Logical)
Shard by tenant_id (for SaaS apps)
All data for customer A on Shard 1
All data for customer B on Shard 2
Pros: Queries within tenant don't cross shards, easy to isolate tenants
Cons: Large tenants can overwhelm a single shard
Sharding Challenges
| Challenge | Solution |
|---|---|
| Cross-shard queries | Denormalize data, use separate aggregation service |
| Transactions across shards | Avoid, or use 2PC/saga pattern (slow) |
| Hotspots | Consistent hashing, split hot shards |
| Resharding | Consistent hashing, or double-write during migration |
| Complexity | Start with vertical scaling, shard only when necessary |
Always say: "Sharding is a last resort. First try: indexing, caching, read replicas, vertical scaling. Shard only when single database can't handle load."
4. Partitioning
Partitioning = Splitting table within a single database (logical split, not distributed).
Use case: Improve query performance, easier data management (e.g., archive old data).
Types:
- Range:
partitions by year (2020, 2021, 2022...) - Hash:
partition = hash(key) % N - List:
partitions by region (US, EU, Asia)
Example (PostgreSQL):
-- Partition orders table by year
CREATE TABLE orders (
id SERIAL,
user_id INT,
created_at TIMESTAMP
) PARTITION BY RANGE (created_at);
CREATE TABLE orders_2022 PARTITION OF orders
FOR VALUES FROM ('2022-01-01') TO ('2023-01-01');
CREATE TABLE orders_2023 PARTITION OF orders
FOR VALUES FROM ('2023-01-01') TO ('2024-01-01');
-- Query automatically routes to correct partition
SELECT * FROM orders WHERE created_at > '2023-06-01'; -- Only scans orders_2023
Benefits: Fast queries (scan only relevant partition), easy to drop old partitions.
Common Interview Questions
Q1: "How do you scale a database to handle 1 billion users?"
Answer (step-by-step):
- Indexing: Add indexes on frequently queried columns (email, user_id)
- Caching: Redis for hot data (90% of reads from cache)
- Read replicas: 5-10 replicas for read traffic (read-heavy workload)
- Vertical scaling: Upgrade master to larger instance (buy time)
- Sharding: When single database can't handle writes, shard by user_id (hash-based)
Q2: "Explain master-slave replication and its trade-offs."
Answer:
- How: Master handles writes, replicas copy data asynchronously, replicas serve reads
- Benefits: High availability, read scaling
- Trade-offs:
- Replication lag (50-200ms) - reads may be stale
- Write bottleneck (all writes to master)
- Failover complexity (promote replica to master)
Q3: "What's the difference between sharding and partitioning?"
Answer:
- Partitioning: Logical split within single database (e.g., partition by year)
- Still one database, easier queries
- Sharding: Physical split across multiple databases (e.g., shard by user_id)
- Distributed system, scales writes, but complex (cross-shard queries hard)
Q4: "You have a query that's slow. How do you optimize it?"
Answer (debugging steps):
- EXPLAIN: Analyze query plan, check if index is used
- Add index: If full table scan, add index on WHERE/ORDER BY columns
- Optimize query: Avoid SELECT *, reduce JOINs, add LIMIT
- Cache results: If query is frequent, cache with Redis (5-minute TTL)
- Denormalize: If many JOINs, consider duplicating data
- Read replica: Offload read traffic to replicas
Trade-offs
| Technique | Pros | Cons | Use When |
|---|---|---|---|
| Indexing | Fast queries (100x) | Slow writes, storage overhead | Query optimization |
| Replication | High availability, read scaling | Replication lag, complexity | Read-heavy workload |
| Sharding | Scales writes, unlimited growth | Cross-shard queries hard | >1TB data, write-heavy |
| Caching | Ultra-fast reads | Stale data, cache invalidation | Hot data, read-heavy |
Real-World Examples
Instagram (Sharding)
- Sharded PostgreSQL: 1000s of shards by user_id (hash-based)
- No cross-shard queries: Denormalize data to avoid joins
- Result: Billions of photos, writes scale linearly
Netflix (Read Replicas)
- Master-slave MySQL: 1 master, 50+ read replicas
- Read-heavy: 99% of traffic is reads (viewing history, recommendations)
- Result: Handles millions of requests per second
Discord (Cassandra Sharding)
- Sharded by guild_id: Each Discord server is on a shard
- No cross-shard queries: Messages are always scoped to a guild
- Result: Billions of messages, infinite scalability
Quick Reference Card
Indexing:
- B-tree: Range queries, sorting (most common)
- Hash: Exact match only
- Trade-off: Fast reads, slow writes
Replication:
- Master-slave: Writes to master, reads from replicas
- Replication lag: 50-200ms (asynchronous)
- Use for: High availability, read scaling
Sharding:
- Range: Easy to add shards, risk of hotspots
- Hash: Even distribution, resharding is hard
- Last resort: Only when single DB can't handle load
Optimization checklist:
- Add indexes
- Cache hot data (Redis)
- Add read replicas
- Vertical scaling (bigger instance)
- Shard (if >1TB or write-heavy)
Further Reading
- Use The Index, Luke! - Deep dive on indexing
- Sharding at Instagram
- Vitess - MySQL Sharding - How YouTube shards MySQL
- High Performance MySQL
Next: Caching Strategies - Redis patterns, CDN caching, cache invalidation.