Skip to main content

Deployment Cascade Failure

The Interview Question

"We deployed a minor change to our user service. Within 10 minutes, 5 other services started failing. Rolling back the user service didn't help. The entire platform is down. What happened and how do we recover?"

Asked at: Netflix, Amazon, Google, any microservices company

Time to solve: 35-40 minutes

Difficulty: ⭐⭐⭐⭐ (Senior SRE)


Clarifying Questions to Ask

  1. "What kind of change was deployed?" → Code, config, resource limits?
  2. "How are services connected?" → Sync HTTP, async messages?
  3. "Do we have circuit breakers?" → If yes, why didn't they trip?
  4. "What do the logs show?" → Timeouts? Errors? Nothing?
  5. "Are any services healthy?" → Pattern of failure propagation

The Cascade Visualized


Root Cause Analysis

What Actually Happened

# The "minor change" added a new database query
class UserService:
def get_user(self, user_id):
user = db.query("SELECT * FROM users WHERE id = ?", user_id)

# NEW: Added preferences lookup (seems harmless!)
prefs = db.query("""
SELECT * FROM user_preferences
WHERE user_id = ?
""", user_id) # Missing index! Full table scan!

return {**user, 'preferences': prefs}

The cascade:

  1. User Service: Response time goes from 10ms → 2000ms
  2. Order Service: 30-second timeout, threads blocked waiting
  3. Thread pool exhaustion: Order Service can't accept new requests
  4. Connection pool exhaustion: Database connections held by blocked threads
  5. Database overload: Max connections reached, queries rejected
  6. Everything fails: Services retry aggressively, making it worse

Why Rollback Didn't Help

# By the time we rolled back:
# 1. Connection pools were exhausted
# 2. Thread pools were exhausted
# 3. Message queues were full
# 4. Database was overloaded

# Rolling back the code doesn't:
# - Return connections to pool
# - Clear blocked threads
# - Drain overfull queues
# - Reset rate limiters

Immediate Recovery

Step 1: Stop the Bleeding

# 1. Scale down the source of cascade
kubectl scale deployment user-service --replicas=0

# 2. Restart affected services to clear blocked resources
kubectl rollout restart deployment order-service
kubectl rollout restart deployment payment-service
kubectl rollout restart deployment recommendation-service

# 3. Temporarily increase database connection limits
# (Emergency only - not sustainable)
psql -c "ALTER SYSTEM SET max_connections = 500;"
psql -c "SELECT pg_reload_conf();"

Step 2: Drain Queues If Needed

# If queues are backed up, decide: process or discard?
def emergency_queue_drain(queue, action='dead_letter'):
"""
Options:
- 'dead_letter': Move to DLQ for later analysis
- 'discard': Drop messages (data loss!)
- 'process': Process at reduced rate
"""
while queue.length() > threshold:
msg = queue.get()

if action == 'dead_letter':
dead_letter_queue.put(msg)
elif action == 'discard':
log.warning(f"Discarding message: {msg.id}")
elif action == 'process':
try:
process_with_timeout(msg, timeout=1)
except:
dead_letter_queue.put(msg)

Step 3: Gradual Recovery

# Bring services back one at a time
# Start with services that have no dependencies

recovery_order = [
'database', # 1. Ensure DB is healthy
'cache', # 2. Warm up caches
'user-service', # 3. Fixed code, minimal replicas
'order-service', # 4. One service at a time
'payment-service',
'notification-service',
]

for service in recovery_order:
deploy_with_minimal_replicas(service)
wait_for_health_check(service, timeout=60)

if not is_healthy(service):
rollback(service)
alert("Recovery failed at {service}")
break

gradually_scale_up(service)
monitor_for_5_minutes(service)

Prevention: Defense in Depth

Layer 1: Timeouts Everywhere

# Every external call needs a timeout
import httpx
from tenacity import retry, stop_after_attempt, wait_exponential

class ResilientHttpClient:
def __init__(self):
self.client = httpx.Client(
timeout=httpx.Timeout(
connect=1.0, # Connection timeout
read=5.0, # Read timeout
write=5.0, # Write timeout
pool=2.0 # Pool checkout timeout
),
limits=httpx.Limits(
max_connections=100,
max_keepalive_connections=20
)
)

@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=0.1, max=2)
)
def get(self, url):
return self.client.get(url)

Layer 2: Circuit Breakers

from circuitbreaker import circuit

class OrderService:
@circuit(
failure_threshold=5, # Open after 5 failures
recovery_timeout=30, # Try again after 30 seconds
expected_exception=Exception
)
def get_user(self, user_id):
return user_service_client.get(f'/users/{user_id}')

def create_order(self, order_data):
try:
user = self.get_user(order_data.user_id)
except CircuitBreakerError:
# Circuit is open - use fallback
user = self.get_user_from_cache(order_data.user_id)
if not user:
raise ServiceUnavailableError("User service down")

return self.process_order(order_data, user)

Layer 3: Bulkheads (Resource Isolation)

from concurrent.futures import ThreadPoolExecutor

class BulkheadService:
"""
Isolate resources per dependency.
One failing dependency can't exhaust all resources.
"""

def __init__(self):
# Separate thread pools per dependency
self.user_pool = ThreadPoolExecutor(max_workers=10)
self.inventory_pool = ThreadPoolExecutor(max_workers=10)
self.payment_pool = ThreadPoolExecutor(max_workers=5)

def get_user(self, user_id):
future = self.user_pool.submit(
self._call_user_service, user_id
)
return future.result(timeout=5) # Won't block other pools

def check_inventory(self, product_id):
future = self.inventory_pool.submit(
self._call_inventory_service, product_id
)
return future.result(timeout=5)

Layer 4: Load Shedding

import time
from collections import deque

class AdaptiveLoadShedder:
"""
Reject requests when system is overloaded.
Better to fail fast than cascade.
"""

def __init__(self, max_latency_ms=100, window_seconds=10):
self.max_latency = max_latency_ms
self.latencies = deque(maxlen=1000)
self.rejection_rate = 0.0

def should_accept(self) -> bool:
if self.rejection_rate > 0:
if random.random() < self.rejection_rate:
return False
return True

def record_latency(self, latency_ms: float):
self.latencies.append(latency_ms)
self._update_rejection_rate()

def _update_rejection_rate(self):
if len(self.latencies) < 100:
return

avg_latency = sum(self.latencies) / len(self.latencies)

if avg_latency > self.max_latency * 2:
self.rejection_rate = min(0.9, self.rejection_rate + 0.1)
elif avg_latency < self.max_latency:
self.rejection_rate = max(0.0, self.rejection_rate - 0.05)

# Usage in middleware
load_shedder = AdaptiveLoadShedder()

@app.middleware
def load_shedding_middleware(request):
if not load_shedder.should_accept():
return Response(status=503, body="Service overloaded")

start = time.time()
response = handle_request(request)
load_shedder.record_latency((time.time() - start) * 1000)

return response

Layer 5: Graceful Degradation

class OrderService:
def create_order(self, order_data):
# Essential: Must succeed
order = self.save_order(order_data)
payment = self.process_payment(order)

# Non-essential: Can fail gracefully
try:
recommendations = self.get_recommendations(order)
except ServiceUnavailableError:
recommendations = [] # Empty is OK

try:
self.send_confirmation_email(order)
except ServiceUnavailableError:
self.queue_email_for_later(order) # Async retry

try:
self.update_analytics(order)
except:
pass # Analytics can be reconstructed later

return OrderResponse(order, recommendations)

Architecture Patterns

resilience_patterns:
per_service:
- Circuit breaker (per dependency)
- Bulkhead (isolated thread/connection pools)
- Timeout (connect, read, write)
- Retry with backoff

per_request:
- Load shedding
- Rate limiting
- Request hedging (for reads)

infrastructure:
- Health checks (liveness + readiness)
- Auto-scaling
- Canary deployments
- Feature flags for quick disable

Monitoring for Early Detection

# Prometheus alerts
groups:
- name: cascade_early_warning
rules:
- alert: LatencySpike
expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[1m])) > 1
for: 30s
labels:
severity: warning
annotations:
summary: "P99 latency spike - potential cascade starting"

- alert: ErrorRateSpike
expr: rate(http_requests_total{status=~"5.."}[1m]) / rate(http_requests_total[1m]) > 0.1
for: 30s
labels:
severity: critical

- alert: ConnectionPoolExhaustion
expr: hikari_connections_active / hikari_connections_max > 0.9
for: 1m
labels:
severity: critical

- alert: ThreadPoolExhaustion
expr: executor_pool_size_threads / executor_pool_max_threads > 0.9
for: 1m
labels:
severity: critical

Key Takeaways

  1. Timeouts are mandatory - No timeout = potential cascade
  2. Circuit breakers at every boundary - Fail fast, not slow
  3. Bulkheads isolate failures - One dependency shouldn't kill everything
  4. Load shed when overloaded - 503 is better than cascade
  5. Rollback is not enough - Need to clear blocked resources
  6. Monitor connection/thread pools - They're the canary

Golden rule: Design every service assuming its dependencies will fail, and that it will fail for its dependents.