Skip to main content

Reliability Patterns

TL;DR

Circuit breaker: Stop calling failed service. Retry: Retry failed requests (with backoff). Timeout: Don't wait forever. Bulkhead: Isolate failures. Graceful degradation: Partial functionality better than total failure.

Patterns

1. Circuit Breaker

States:

  • Closed: Normal (requests go through)
  • Open: Service failing (fast-fail, don't call)
  • Half-open: Test if service recovered

Code example:

circuit_breaker = CircuitBreaker(
failure_threshold=5, # Open after 5 failures
timeout=60 # Try again after 60s
)

if circuit_breaker.is_closed():
try:
response = call_service()
circuit_breaker.record_success()
except:
circuit_breaker.record_failure()
else:
return fallback_response() # Service is down

2. Retry with Exponential Backoff

def call_with_retry(func, max_retries=3):
for attempt in range(max_retries):
try:
return func()
except TransientError:
wait = 2 ** attempt # 1s, 2s, 4s
time.sleep(wait + random.uniform(0, 1)) # Add jitter
raise MaxRetriesExceeded()

Jitter: Random delay prevents thundering herd (all clients retry simultaneously).

3. Timeout

Always set timeouts:

response = http.get(url, timeout=5)  # Fail after 5s

Without timeout: Request hangs forever (resource leak).

4. Bulkhead Pattern

Idea: Isolate resources so one failure doesn't cascade.

Example: Separate thread pools for critical vs non-critical operations.

5. Graceful Degradation

Example (Netflix):

  • Recommendations service down? → Show generic popular movies
  • Personalization service down? → Show cached results

Better partial functionality than complete failure.

Common Interview Questions

Q1: "What is a circuit breaker and why use it?"

Answer:

  • Pattern: Stop calling failed service (fast-fail)
  • States: Closed (normal) → Open (failing) → Half-open (testing)
  • Why: Prevent cascading failures, give service time to recover
  • Example: If payment service is down, don't keep trying (circuit opens)

Q2: "How do you prevent cascading failures?"

Answer:

  1. Circuit breaker: Stop calling failed dependencies
  2. Timeout: Don't wait forever
  3. Bulkhead: Isolate thread pools
  4. Graceful degradation: Return cached/default data

Q3: "Retry vs circuit breaker?"

Answer:

  • Retry: Transient errors (network blip) - try again
  • Circuit breaker: Service is down - stop trying, fast-fail
  • Use both: Retry 3x with backoff, then circuit breaker opens

Quick Reference

Circuit breaker: Fast-fail when service down
Retry: Exponential backoff + jitter
Timeout: Always set (don't hang forever)
Bulkhead: Isolate resources
Graceful degradation: Partial > total failure


Next: Event-Driven Architecture.