Skip to main content

System Design Fundamentals

TL;DR (30-second summary)

System design is about trade-offs. You can't have infinite scalability, perfect consistency, and 100% availability all at once. The fundamentals are:

  • Scalability: Handle growth (more users, data, requests)
  • Reliability: Stay operational despite failures
  • Availability: Minimize downtime
  • Performance: Low latency, high throughput
  • Maintainability: Easy to evolve and debug

The CAP theorem forces you to choose: during network partitions, pick Consistency OR Availability (not both).

Why This Matters

In interviews: Every system design question tests your understanding of these fundamentals. Interviewers want to hear you discuss trade-offs explicitly.

At work: These concepts guide every architectural decision you'll make.

Core Concepts

1. Scalability

Definition: A system's ability to handle increased load.

Two types:

  • Vertical scaling (scale up): Add more CPU/RAM to existing machine
  • Horizontal scaling (scale out): Add more machines
TypeProsConsWhen to Use
Vertical• Simple (no code changes)
• No distributed system complexity
• Physical limits
• Single point of failure
• Expensive at large scale
Early stages, legacy apps, databases that don't shard well
Horizontal• Nearly unlimited scaling
• High availability
• Cost-effective
• Complex (load balancing, data consistency)
• Requires stateless design
Web servers, microservices, most modern systems
Interview Tip

Always ask: "Can we scale horizontally?" If yes, prefer it. If not, explain why (e.g., "databases need sharding strategy").

2. Reliability

Definition: The system continues functioning correctly despite faults.

Faults are inevitable:

  • Hardware failures (disk crashes, network issues)
  • Software bugs (null pointer exceptions, infinite loops)
  • Human errors (bad config, accidental deletes)

Reliability techniques:

Key metric: MTBF (Mean Time Between Failures) and MTTR (Mean Time To Recovery)

  • MTBF: How long until something breaks?
  • MTTR: How fast can we fix it?

Reliability = MTBF / (MTBF + MTTR)

3. Availability

Definition: The system is operational and accessible when needed.

Measured as "nines":

AvailabilityDowntime/YearDowntime/MonthUse Case
99% (2 nines)3.65 days7.31 hoursInternal tools
99.9% (3 nines)8.77 hours43.8 minutesMost web apps
99.99% (4 nines)52.6 minutes4.38 minutesPayment systems
99.999% (5 nines)5.26 minutes26.3 secondsCritical infrastructure

Formula: Availability = Uptime / (Uptime + Downtime)

Red Flag

Don't confuse reliability and availability:

  • A system can be available (responding) but unreliable (giving wrong answers)
  • Example: A buggy API that returns 200 OK but corrupted data

4. Latency vs Throughput

Two different performance measures:

Latency: Time to complete a single request (milliseconds)

  • P50: 50th percentile (median)
  • P95: 95th percentile
  • P99: 99th percentile (tail latency)

Throughput: Number of requests per second (QPS - Queries Per Second)

MetricGood TargetsCritical Factors
Latency• Web: < 200ms
• API: < 100ms
• Real-time: < 50ms
Network hops, database queries, algorithm complexity
Throughput• Depends on use case
• Twitter: 300K QPS (reads)
• Netflix: 1M+ requests/min
Server capacity, parallelism, caching

Trade-off: Sometimes improving one hurts the other

  • Batching improves throughput but increases latency
  • Caching improves latency but may reduce consistency

5. CAP Theorem

The most important concept in distributed systems.

Reality: Network partitions WILL happen, so you must have P. The real choice is:

  • CP: Sacrifice availability for consistency (reject requests during partition)
  • AP: Sacrifice consistency for availability (eventual consistency)

Examples:

SystemChoiceReason
Bank transfersCPCan't risk showing wrong balance
Social media feedAPOK if you see slightly stale posts
Shopping cartAPBetter to allow purchases than show error
Inventory systemCPCan't oversell items
Interview Tip

Always say: "Since network partitions are inevitable, we need partition tolerance. So the real question is: CP or AP for this use case?"

6. Consistency Models

Not just "consistent" or "not consistent" - it's a spectrum:

ModelGuaranteeUse Case
Strong ConsistencyReads always see latest writeFinancial transactions, inventory
Causal ConsistencyRelated events seen in orderChat messages (replies after original)
Session ConsistencyYour own writes always visible to youShopping cart, user profile
Eventual ConsistencyAll replicas converge eventuallySocial media likes, view counts

7. Service Level Objectives (SLOs)

SLI (Service Level Indicator): A metric (e.g., latency, error rate)
SLO (Service Level Objective): Target value for SLI (e.g., P99 latency < 200ms)
SLA (Service Level Agreement): Contract with consequences (e.g., 99.9% uptime or refund)

Example SLOs:

  • 99.9% of requests complete in < 200ms
  • 99.99% availability over 30-day window
  • Error rate < 0.1%

Trade-offs

Every system design decision involves trade-offs:

DimensionOption AOption B
ScalingVertical (easier)Horizontal (more scalable)
ConsistencyStrong (correct)Eventual (available)
StorageSQL (structured)NoSQL (flexible)
LatencySynchronous (predictable)Asynchronous (decoupled)
CostOver-provision (reliable)Right-size (economical)

No free lunch: You can't optimize for everything. Choose based on requirements.

Common Interview Questions

Q1: "How would you design a system for high availability?"

Answer structure:

  1. Eliminate single points of failure: Load balancers, database replicas, multiple regions
  2. Add redundancy: N+1 or N+2 provisioning
  3. Implement health checks: Auto-remove unhealthy nodes
  4. Plan for failure: Circuit breakers, graceful degradation
  5. Monitor and alert: Know when things break

Q2: "What's the difference between latency and throughput?"

Answer:

  • Latency: How long one request takes (time)
  • Throughput: How many requests per second (rate)
  • Example: A highway with latency = time to drive 100 miles, throughput = cars per hour

Q3: "Explain the CAP theorem and give examples."

Answer:

  1. State the theorem: During network partition, choose C or A
  2. Emphasize P is mandatory in distributed systems
  3. Give concrete examples:
    • CP: Banking (block transactions during partition)
    • AP: Facebook feed (show stale data during partition)
  4. Mention most systems use eventual consistency (AP) with techniques to minimize staleness

Q4: "How do you measure system reliability?"

Answer:

  • Availability: Uptime percentage (e.g., 99.99% = 52 minutes downtime/year)
  • Error rate: Failed requests / Total requests
  • MTBF and MTTR: How often failures happen and how fast you recover
  • SLOs: Specific targets like P99 latency < 200ms

Real-World Examples

Netflix (High Availability)

  • Architecture: Multi-region, auto-scaling, chaos engineering
  • Trade-off: Chose AP (eventual consistency) for non-critical data
  • Result: Can lose entire AWS region and still serve content

Amazon DynamoDB (AP System)

  • Design: Eventual consistency by default, optional strong consistency
  • Trade-off: Availability over immediate consistency
  • Result: 99.999% availability, powers Amazon.com cart

Google Spanner (CP System)

  • Design: Strong consistency with TrueTime API
  • Trade-off: Slightly higher latency for consistency guarantees
  • Result: Global transactions with ACID properties

Quick Reference Card

Memorize these:

  • Vertical scaling: Scale up (bigger machine)
  • Horizontal scaling: Scale out (more machines)
  • Latency: Time per request (ms)
  • Throughput: Requests per second (QPS)
  • CAP: Consistency + Availability + Partition Tolerance (pick 2, but really pick C or A)
  • 99.9% availability = 8.77 hours downtime/year
  • 99.99% availability = 52.6 minutes downtime/year

Key trade-offs:

  • Consistency ↔ Availability
  • Latency ↔ Throughput (sometimes)
  • Cost ↔ Reliability
  • Simplicity ↔ Scalability

Further Reading



Next: Back-of-Envelope Calculations - Learn to estimate system requirements like a pro.