System Design Fundamentals
TL;DR (30-second summary)
System design is about trade-offs. You can't have infinite scalability, perfect consistency, and 100% availability all at once. The fundamentals are:
- Scalability: Handle growth (more users, data, requests)
- Reliability: Stay operational despite failures
- Availability: Minimize downtime
- Performance: Low latency, high throughput
- Maintainability: Easy to evolve and debug
The CAP theorem forces you to choose: during network partitions, pick Consistency OR Availability (not both).
Why This Matters
In interviews: Every system design question tests your understanding of these fundamentals. Interviewers want to hear you discuss trade-offs explicitly.
At work: These concepts guide every architectural decision you'll make.
Core Concepts
1. Scalability
Definition: A system's ability to handle increased load.
Two types:
- Vertical scaling (scale up): Add more CPU/RAM to existing machine
- Horizontal scaling (scale out): Add more machines
| Type | Pros | Cons | When to Use |
|---|---|---|---|
| Vertical | • Simple (no code changes) • No distributed system complexity | • Physical limits • Single point of failure • Expensive at large scale | Early stages, legacy apps, databases that don't shard well |
| Horizontal | • Nearly unlimited scaling • High availability • Cost-effective | • Complex (load balancing, data consistency) • Requires stateless design | Web servers, microservices, most modern systems |
Always ask: "Can we scale horizontally?" If yes, prefer it. If not, explain why (e.g., "databases need sharding strategy").
2. Reliability
Definition: The system continues functioning correctly despite faults.
Faults are inevitable:
- Hardware failures (disk crashes, network issues)
- Software bugs (null pointer exceptions, infinite loops)
- Human errors (bad config, accidental deletes)
Reliability techniques:
Key metric: MTBF (Mean Time Between Failures) and MTTR (Mean Time To Recovery)
- MTBF: How long until something breaks?
- MTTR: How fast can we fix it?
Reliability = MTBF / (MTBF + MTTR)
3. Availability
Definition: The system is operational and accessible when needed.
Measured as "nines":
| Availability | Downtime/Year | Downtime/Month | Use Case |
|---|---|---|---|
| 99% (2 nines) | 3.65 days | 7.31 hours | Internal tools |
| 99.9% (3 nines) | 8.77 hours | 43.8 minutes | Most web apps |
| 99.99% (4 nines) | 52.6 minutes | 4.38 minutes | Payment systems |
| 99.999% (5 nines) | 5.26 minutes | 26.3 seconds | Critical infrastructure |
Formula: Availability = Uptime / (Uptime + Downtime)
Don't confuse reliability and availability:
- A system can be available (responding) but unreliable (giving wrong answers)
- Example: A buggy API that returns 200 OK but corrupted data
4. Latency vs Throughput
Two different performance measures:
Latency: Time to complete a single request (milliseconds)
- P50: 50th percentile (median)
- P95: 95th percentile
- P99: 99th percentile (tail latency)
Throughput: Number of requests per second (QPS - Queries Per Second)
| Metric | Good Targets | Critical Factors |
|---|---|---|
| Latency | • Web: < 200ms • API: < 100ms • Real-time: < 50ms | Network hops, database queries, algorithm complexity |
| Throughput | • Depends on use case • Twitter: 300K QPS (reads) • Netflix: 1M+ requests/min | Server capacity, parallelism, caching |
Trade-off: Sometimes improving one hurts the other
- Batching improves throughput but increases latency
- Caching improves latency but may reduce consistency
5. CAP Theorem
The most important concept in distributed systems.
Reality: Network partitions WILL happen, so you must have P. The real choice is:
- CP: Sacrifice availability for consistency (reject requests during partition)
- AP: Sacrifice consistency for availability (eventual consistency)
Examples:
| System | Choice | Reason |
|---|---|---|
| Bank transfers | CP | Can't risk showing wrong balance |
| Social media feed | AP | OK if you see slightly stale posts |
| Shopping cart | AP | Better to allow purchases than show error |
| Inventory system | CP | Can't oversell items |
Always say: "Since network partitions are inevitable, we need partition tolerance. So the real question is: CP or AP for this use case?"
6. Consistency Models
Not just "consistent" or "not consistent" - it's a spectrum:
| Model | Guarantee | Use Case |
|---|---|---|
| Strong Consistency | Reads always see latest write | Financial transactions, inventory |
| Causal Consistency | Related events seen in order | Chat messages (replies after original) |
| Session Consistency | Your own writes always visible to you | Shopping cart, user profile |
| Eventual Consistency | All replicas converge eventually | Social media likes, view counts |
7. Service Level Objectives (SLOs)
SLI (Service Level Indicator): A metric (e.g., latency, error rate)
SLO (Service Level Objective): Target value for SLI (e.g., P99 latency < 200ms)
SLA (Service Level Agreement): Contract with consequences (e.g., 99.9% uptime or refund)
Example SLOs:
- 99.9% of requests complete in < 200ms
- 99.99% availability over 30-day window
- Error rate < 0.1%
Trade-offs
Every system design decision involves trade-offs:
| Dimension | Option A | Option B |
|---|---|---|
| Scaling | Vertical (easier) | Horizontal (more scalable) |
| Consistency | Strong (correct) | Eventual (available) |
| Storage | SQL (structured) | NoSQL (flexible) |
| Latency | Synchronous (predictable) | Asynchronous (decoupled) |
| Cost | Over-provision (reliable) | Right-size (economical) |
No free lunch: You can't optimize for everything. Choose based on requirements.
Common Interview Questions
Q1: "How would you design a system for high availability?"
Answer structure:
- Eliminate single points of failure: Load balancers, database replicas, multiple regions
- Add redundancy: N+1 or N+2 provisioning
- Implement health checks: Auto-remove unhealthy nodes
- Plan for failure: Circuit breakers, graceful degradation
- Monitor and alert: Know when things break
Q2: "What's the difference between latency and throughput?"
Answer:
- Latency: How long one request takes (time)
- Throughput: How many requests per second (rate)
- Example: A highway with latency = time to drive 100 miles, throughput = cars per hour
Q3: "Explain the CAP theorem and give examples."
Answer:
- State the theorem: During network partition, choose C or A
- Emphasize P is mandatory in distributed systems
- Give concrete examples:
- CP: Banking (block transactions during partition)
- AP: Facebook feed (show stale data during partition)
- Mention most systems use eventual consistency (AP) with techniques to minimize staleness
Q4: "How do you measure system reliability?"
Answer:
- Availability: Uptime percentage (e.g., 99.99% = 52 minutes downtime/year)
- Error rate: Failed requests / Total requests
- MTBF and MTTR: How often failures happen and how fast you recover
- SLOs: Specific targets like P99 latency < 200ms
Real-World Examples
Netflix (High Availability)
- Architecture: Multi-region, auto-scaling, chaos engineering
- Trade-off: Chose AP (eventual consistency) for non-critical data
- Result: Can lose entire AWS region and still serve content
Amazon DynamoDB (AP System)
- Design: Eventual consistency by default, optional strong consistency
- Trade-off: Availability over immediate consistency
- Result: 99.999% availability, powers Amazon.com cart
Google Spanner (CP System)
- Design: Strong consistency with TrueTime API
- Trade-off: Slightly higher latency for consistency guarantees
- Result: Global transactions with ACID properties
Quick Reference Card
Memorize these:
- Vertical scaling: Scale up (bigger machine)
- Horizontal scaling: Scale out (more machines)
- Latency: Time per request (ms)
- Throughput: Requests per second (QPS)
- CAP: Consistency + Availability + Partition Tolerance (pick 2, but really pick C or A)
- 99.9% availability = 8.77 hours downtime/year
- 99.99% availability = 52.6 minutes downtime/year
Key trade-offs:
- Consistency ↔ Availability
- Latency ↔ Throughput (sometimes)
- Cost ↔ Reliability
- Simplicity ↔ Scalability
Further Reading
- CAP Theorem - Microsoft Azure
- AWS Well-Architected Framework
- "Designing Data-Intensive Applications" - Chapter 1, 2, 9
- GitHub System Design Primer - Scalability
Next: Back-of-Envelope Calculations - Learn to estimate system requirements like a pro.