System Design Fundamentals

TL;DR (30-second summary)

System design is about trade-offs. You can't have infinite scalability, perfect consistency, and 100% availability all at once. The fundamentals are:

Scalability: Handle growth (more users, data, requests)
Reliability: Stay operational despite failures
Availability: Minimize downtime
Performance: Low latency, high throughput
Maintainability: Easy to evolve and debug

The CAP theorem forces you to choose: during network partitions, pick Consistency OR Availability (not both).

Why This Matters

In interviews: Every system design question tests your understanding of these fundamentals. Interviewers want to hear you discuss trade-offs explicitly.

At work: These concepts guide every architectural decision you'll make.

Core Concepts

1. Scalability

Definition: A system's ability to handle increased load.

Two types:

Vertical scaling (scale up): Add more CPU/RAM to existing machine
Horizontal scaling (scale out): Add more machines

Type	Pros	Cons	When to Use
Vertical	• Simple (no code changes) • No distributed system complexity	• Physical limits • Single point of failure • Expensive at large scale	Early stages, legacy apps, databases that don't shard well
Horizontal	• Nearly unlimited scaling • High availability • Cost-effective	• Complex (load balancing, data consistency) • Requires stateless design	Web servers, microservices, most modern systems

Interview Tip

Always ask: "Can we scale horizontally?" If yes, prefer it. If not, explain why (e.g., "databases need sharding strategy").

2. Reliability

Definition: The system continues functioning correctly despite faults.

Faults are inevitable:

Hardware failures (disk crashes, network issues)
Software bugs (null pointer exceptions, infinite loops)
Human errors (bad config, accidental deletes)

Reliability techniques:

Key metric: MTBF (Mean Time Between Failures) and MTTR (Mean Time To Recovery)

MTBF: How long until something breaks?
MTTR: How fast can we fix it?

Reliability = MTBF / (MTBF + MTTR)

3. Availability

Definition: The system is operational and accessible when needed.

Measured as "nines":

Availability	Downtime/Year	Downtime/Month	Use Case
99% (2 nines)	3.65 days	7.31 hours	Internal tools
99.9% (3 nines)	8.77 hours	43.8 minutes	Most web apps
99.99% (4 nines)	52.6 minutes	4.38 minutes	Payment systems
99.999% (5 nines)	5.26 minutes	26.3 seconds	Critical infrastructure

Formula: Availability = Uptime / (Uptime + Downtime)

Red Flag

Don't confuse reliability and availability:

A system can be available (responding) but unreliable (giving wrong answers)
Example: A buggy API that returns 200 OK but corrupted data

4. Latency vs Throughput

Two different performance measures:

Latency: Time to complete a single request (milliseconds)

P50: 50th percentile (median)
P95: 95th percentile
P99: 99th percentile (tail latency)

Throughput: Number of requests per second (QPS - Queries Per Second)

Metric	Good Targets	Critical Factors
Latency	• Web: < 200ms • API: < 100ms • Real-time: < 50ms	Network hops, database queries, algorithm complexity
Throughput	• Depends on use case • Twitter: 300K QPS (reads) • Netflix: 1M+ requests/min	Server capacity, parallelism, caching

Trade-off: Sometimes improving one hurts the other

Batching improves throughput but increases latency
Caching improves latency but may reduce consistency

5. CAP Theorem

The most important concept in distributed systems.

Reality: Network partitions WILL happen, so you must have P. The real choice is:

CP: Sacrifice availability for consistency (reject requests during partition)
AP: Sacrifice consistency for availability (eventual consistency)

Examples:

System	Choice	Reason
Bank transfers	CP	Can't risk showing wrong balance
Social media feed	AP	OK if you see slightly stale posts
Shopping cart	AP	Better to allow purchases than show error
Inventory system	CP	Can't oversell items

Interview Tip

Always say: "Since network partitions are inevitable, we need partition tolerance. So the real question is: CP or AP for this use case?"

6. Consistency Models

Not just "consistent" or "not consistent" - it's a spectrum:

Model	Guarantee	Use Case
Strong Consistency	Reads always see latest write	Financial transactions, inventory
Causal Consistency	Related events seen in order	Chat messages (replies after original)
Session Consistency	Your own writes always visible to you	Shopping cart, user profile
Eventual Consistency	All replicas converge eventually	Social media likes, view counts

7. Service Level Objectives (SLOs)

SLI (Service Level Indicator): A metric (e.g., latency, error rate)
SLO (Service Level Objective): Target value for SLI (e.g., P99 latency < 200ms)
SLA (Service Level Agreement): Contract with consequences (e.g., 99.9% uptime or refund)

Example SLOs:

99.9% of requests complete in < 200ms
99.99% availability over 30-day window
Error rate < 0.1%

Trade-offs

Every system design decision involves trade-offs:

Dimension	Option A	Option B
Scaling	Vertical (easier)	Horizontal (more scalable)
Consistency	Strong (correct)	Eventual (available)
Storage	SQL (structured)	NoSQL (flexible)
Latency	Synchronous (predictable)	Asynchronous (decoupled)
Cost	Over-provision (reliable)	Right-size (economical)

No free lunch: You can't optimize for everything. Choose based on requirements.

Common Interview Questions

Q1: "How would you design a system for high availability?"

Answer structure:

Eliminate single points of failure: Load balancers, database replicas, multiple regions
Add redundancy: N+1 or N+2 provisioning
Implement health checks: Auto-remove unhealthy nodes
Plan for failure: Circuit breakers, graceful degradation
Monitor and alert: Know when things break

Q2: "What's the difference between latency and throughput?"

Answer:

Latency: How long one request takes (time)
Throughput: How many requests per second (rate)
Example: A highway with latency = time to drive 100 miles, throughput = cars per hour

Q3: "Explain the CAP theorem and give examples."

Answer:

State the theorem: During network partition, choose C or A
Emphasize P is mandatory in distributed systems
Give concrete examples:
- CP: Banking (block transactions during partition)
- AP: Facebook feed (show stale data during partition)
Mention most systems use eventual consistency (AP) with techniques to minimize staleness

Q4: "How do you measure system reliability?"

Answer:

Availability: Uptime percentage (e.g., 99.99% = 52 minutes downtime/year)
Error rate: Failed requests / Total requests
MTBF and MTTR: How often failures happen and how fast you recover
SLOs: Specific targets like P99 latency < 200ms

Real-World Examples

Netflix (High Availability)

Architecture: Multi-region, auto-scaling, chaos engineering
Trade-off: Chose AP (eventual consistency) for non-critical data
Result: Can lose entire AWS region and still serve content

Amazon DynamoDB (AP System)

Design: Eventual consistency by default, optional strong consistency
Trade-off: Availability over immediate consistency
Result: 99.999% availability, powers Amazon.com cart

Google Spanner (CP System)

Design: Strong consistency with TrueTime API
Trade-off: Slightly higher latency for consistency guarantees
Result: Global transactions with ACID properties

Quick Reference Card

Memorize these:

Vertical scaling: Scale up (bigger machine)
Horizontal scaling: Scale out (more machines)
Latency: Time per request (ms)
Throughput: Requests per second (QPS)
CAP: Consistency + Availability + Partition Tolerance (pick 2, but really pick C or A)
99.9% availability = 8.77 hours downtime/year
99.99% availability = 52.6 minutes downtime/year

Key trade-offs:

Consistency ↔ Availability
Latency ↔ Throughput (sometimes)
Cost ↔ Reliability
Simplicity ↔ Scalability

TL;DR (30-second summary)​

Why This Matters​

Core Concepts​

1. Scalability​

2. Reliability​

3. Availability​

4. Latency vs Throughput​

5. CAP Theorem​

6. Consistency Models​

7. Service Level Objectives (SLOs)​

Trade-offs​

Common Interview Questions​

Q1: "How would you design a system for high availability?"​

Q2: "What's the difference between latency and throughput?"​

Q3: "Explain the CAP theorem and give examples."​

Q4: "How do you measure system reliability?"​

Real-World Examples​

Netflix (High Availability)​

Amazon DynamoDB (AP System)​

Google Spanner (CP System)​

Quick Reference Card​

Further Reading​

TL;DR (30-second summary)

Why This Matters

Core Concepts

1. Scalability

2. Reliability

3. Availability

4. Latency vs Throughput

5. CAP Theorem

6. Consistency Models

7. Service Level Objectives (SLOs)

Trade-offs

Common Interview Questions

Q1: "How would you design a system for high availability?"

Q2: "What's the difference between latency and throughput?"

Q3: "Explain the CAP theorem and give examples."

Q4: "How do you measure system reliability?"

Real-World Examples

Netflix (High Availability)

Amazon DynamoDB (AP System)

Google Spanner (CP System)

Quick Reference Card

Further Reading