Skip to main content

Distributed Systems Fundamentals

TL;DR

Distributed system: Multiple computers working together. Challenges: Network failures, clock synchronization, consensus. Consensus algorithms: Raft, Paxos (elect leader, agree on state).

Core Concepts

Challenges

ProblemImpactSolution
Network partitionNodes can't communicateCAP theorem (choose CP or AP)
Partial failuresSome nodes fail, not allTimeouts, retries, circuit breakers
Clock skewClocks don't matchVector clocks, logical timestamps
ConcurrencyConflicting updatesDistributed locks, consensus

Consensus (Raft Algorithm)

Key properties:

  • Majority vote: Need >50% nodes (3/5, 4/7)
  • Leader election: One node coordinates writes
  • Log replication: Leader replicates to followers

Distributed Locks

# Acquire lock (SET if not exists, with TTL)
locked = redis.set("lock:cron_job", "server_1", nx=True, ex=30)

if locked:
try:
run_cron_job()
finally:
redis.delete("lock:cron_job")

Challenges:

  • Lock timeout: Use TTL (what if process crashes?)
  • Split brain: Use consensus algorithm like Redlock

Quick Reference

CAP theorem: CP or AP (can't have both during partition)
Consensus: Raft, Paxos (leader election, agreement)
Distributed locks: Redis, etcd, ZooKeeper