Reliability Patterns
TL;DR
Circuit breaker: Stop calling failed service. Retry: Retry failed requests (with backoff). Timeout: Don't wait forever. Bulkhead: Isolate failures. Graceful degradation: Partial functionality better than total failure.
Patterns
1. Circuit Breaker
States:
- Closed: Normal (requests go through)
- Open: Service failing (fast-fail, don't call)
- Half-open: Test if service recovered
Code example:
circuit_breaker = CircuitBreaker(
failure_threshold=5, # Open after 5 failures
timeout=60 # Try again after 60s
)
if circuit_breaker.is_closed():
try:
response = call_service()
circuit_breaker.record_success()
except:
circuit_breaker.record_failure()
else:
return fallback_response() # Service is down
2. Retry with Exponential Backoff
def call_with_retry(func, max_retries=3):
for attempt in range(max_retries):
try:
return func()
except TransientError:
wait = 2 ** attempt # 1s, 2s, 4s
time.sleep(wait + random.uniform(0, 1)) # Add jitter
raise MaxRetriesExceeded()
Jitter: Random delay prevents thundering herd (all clients retry simultaneously).
3. Timeout
Always set timeouts:
response = http.get(url, timeout=5) # Fail after 5s
Without timeout: Request hangs forever (resource leak).
4. Bulkhead Pattern
Idea: Isolate resources so one failure doesn't cascade.
Example: Separate thread pools for critical vs non-critical operations.
5. Graceful Degradation
Example (Netflix):
- Recommendations service down? → Show generic popular movies
- Personalization service down? → Show cached results
Better partial functionality than complete failure.
Common Interview Questions
Q1: "What is a circuit breaker and why use it?"
Answer:
- Pattern: Stop calling failed service (fast-fail)
- States: Closed (normal) → Open (failing) → Half-open (testing)
- Why: Prevent cascading failures, give service time to recover
- Example: If payment service is down, don't keep trying (circuit opens)
Q2: "How do you prevent cascading failures?"
Answer:
- Circuit breaker: Stop calling failed dependencies
- Timeout: Don't wait forever
- Bulkhead: Isolate thread pools
- Graceful degradation: Return cached/default data
Q3: "Retry vs circuit breaker?"
Answer:
- Retry: Transient errors (network blip) - try again
- Circuit breaker: Service is down - stop trying, fast-fail
- Use both: Retry 3x with backoff, then circuit breaker opens
Quick Reference
Circuit breaker: Fast-fail when service down
Retry: Exponential backoff + jitter
Timeout: Always set (don't hang forever)
Bulkhead: Isolate resources
Graceful degradation: Partial > total failure
Next: Event-Driven Architecture.