Distributed Systems Fundamentals
TL;DR
Distributed system: Multiple computers working together. Challenges: Network failures, clock synchronization, consensus. Consensus algorithms: Raft, Paxos (elect leader, agree on state).
Core Concepts
Challenges
| Problem | Impact | Solution |
|---|---|---|
| Network partition | Nodes can't communicate | CAP theorem (choose CP or AP) |
| Partial failures | Some nodes fail, not all | Timeouts, retries, circuit breakers |
| Clock skew | Clocks don't match | Vector clocks, logical timestamps |
| Concurrency | Conflicting updates | Distributed locks, consensus |
Consensus (Raft Algorithm)
Key properties:
- Majority vote: Need >50% nodes (3/5, 4/7)
- Leader election: One node coordinates writes
- Log replication: Leader replicates to followers
Distributed Locks
# Acquire lock (SET if not exists, with TTL)
locked = redis.set("lock:cron_job", "server_1", nx=True, ex=30)
if locked:
try:
run_cron_job()
finally:
redis.delete("lock:cron_job")
Challenges:
- Lock timeout: Use TTL (what if process crashes?)
- Split brain: Use consensus algorithm like Redlock
Quick Reference
CAP theorem: CP or AP (can't have both during partition)
Consensus: Raft, Paxos (leader election, agreement)
Distributed locks: Redis, etcd, ZooKeeper