Deployment Cascade Failure
The Interview Question
"We deployed a minor change to our user service. Within 10 minutes, 5 other services started failing. Rolling back the user service didn't help. The entire platform is down. What happened and how do we recover?"
Asked at: Netflix, Amazon, Google, any microservices company
Time to solve: 35-40 minutes
Difficulty: ⭐⭐⭐⭐ (Senior SRE)
Clarifying Questions to Ask
- "What kind of change was deployed?" → Code, config, resource limits?
- "How are services connected?" → Sync HTTP, async messages?
- "Do we have circuit breakers?" → If yes, why didn't they trip?
- "What do the logs show?" → Timeouts? Errors? Nothing?
- "Are any services healthy?" → Pattern of failure propagation
The Cascade Visualized
Root Cause Analysis
What Actually Happened
# The "minor change" added a new database query
class UserService:
def get_user(self, user_id):
user = db.query("SELECT * FROM users WHERE id = ?", user_id)
# NEW: Added preferences lookup (seems harmless!)
prefs = db.query("""
SELECT * FROM user_preferences
WHERE user_id = ?
""", user_id) # Missing index! Full table scan!
return {**user, 'preferences': prefs}
The cascade:
- User Service: Response time goes from 10ms → 2000ms
- Order Service: 30-second timeout, threads blocked waiting
- Thread pool exhaustion: Order Service can't accept new requests
- Connection pool exhaustion: Database connections held by blocked threads
- Database overload: Max connections reached, queries rejected
- Everything fails: Services retry aggressively, making it worse
Why Rollback Didn't Help
# By the time we rolled back:
# 1. Connection pools were exhausted
# 2. Thread pools were exhausted
# 3. Message queues were full
# 4. Database was overloaded
# Rolling back the code doesn't:
# - Return connections to pool
# - Clear blocked threads
# - Drain overfull queues
# - Reset rate limiters
Immediate Recovery
Step 1: Stop the Bleeding
# 1. Scale down the source of cascade
kubectl scale deployment user-service --replicas=0
# 2. Restart affected services to clear blocked resources
kubectl rollout restart deployment order-service
kubectl rollout restart deployment payment-service
kubectl rollout restart deployment recommendation-service
# 3. Temporarily increase database connection limits
# (Emergency only - not sustainable)
psql -c "ALTER SYSTEM SET max_connections = 500;"
psql -c "SELECT pg_reload_conf();"
Step 2: Drain Queues If Needed
# If queues are backed up, decide: process or discard?
def emergency_queue_drain(queue, action='dead_letter'):
"""
Options:
- 'dead_letter': Move to DLQ for later analysis
- 'discard': Drop messages (data loss!)
- 'process': Process at reduced rate
"""
while queue.length() > threshold:
msg = queue.get()
if action == 'dead_letter':
dead_letter_queue.put(msg)
elif action == 'discard':
log.warning(f"Discarding message: {msg.id}")
elif action == 'process':
try:
process_with_timeout(msg, timeout=1)
except:
dead_letter_queue.put(msg)
Step 3: Gradual Recovery
# Bring services back one at a time
# Start with services that have no dependencies
recovery_order = [
'database', # 1. Ensure DB is healthy
'cache', # 2. Warm up caches
'user-service', # 3. Fixed code, minimal replicas
'order-service', # 4. One service at a time
'payment-service',
'notification-service',
]
for service in recovery_order:
deploy_with_minimal_replicas(service)
wait_for_health_check(service, timeout=60)
if not is_healthy(service):
rollback(service)
alert("Recovery failed at {service}")
break
gradually_scale_up(service)
monitor_for_5_minutes(service)
Prevention: Defense in Depth
Layer 1: Timeouts Everywhere
# Every external call needs a timeout
import httpx
from tenacity import retry, stop_after_attempt, wait_exponential
class ResilientHttpClient:
def __init__(self):
self.client = httpx.Client(
timeout=httpx.Timeout(
connect=1.0, # Connection timeout
read=5.0, # Read timeout
write=5.0, # Write timeout
pool=2.0 # Pool checkout timeout
),
limits=httpx.Limits(
max_connections=100,
max_keepalive_connections=20
)
)
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=0.1, max=2)
)
def get(self, url):
return self.client.get(url)
Layer 2: Circuit Breakers
from circuitbreaker import circuit
class OrderService:
@circuit(
failure_threshold=5, # Open after 5 failures
recovery_timeout=30, # Try again after 30 seconds
expected_exception=Exception
)
def get_user(self, user_id):
return user_service_client.get(f'/users/{user_id}')
def create_order(self, order_data):
try:
user = self.get_user(order_data.user_id)
except CircuitBreakerError:
# Circuit is open - use fallback
user = self.get_user_from_cache(order_data.user_id)
if not user:
raise ServiceUnavailableError("User service down")
return self.process_order(order_data, user)
Layer 3: Bulkheads (Resource Isolation)
from concurrent.futures import ThreadPoolExecutor
class BulkheadService:
"""
Isolate resources per dependency.
One failing dependency can't exhaust all resources.
"""
def __init__(self):
# Separate thread pools per dependency
self.user_pool = ThreadPoolExecutor(max_workers=10)
self.inventory_pool = ThreadPoolExecutor(max_workers=10)
self.payment_pool = ThreadPoolExecutor(max_workers=5)
def get_user(self, user_id):
future = self.user_pool.submit(
self._call_user_service, user_id
)
return future.result(timeout=5) # Won't block other pools
def check_inventory(self, product_id):
future = self.inventory_pool.submit(
self._call_inventory_service, product_id
)
return future.result(timeout=5)
Layer 4: Load Shedding
import time
from collections import deque
class AdaptiveLoadShedder:
"""
Reject requests when system is overloaded.
Better to fail fast than cascade.
"""
def __init__(self, max_latency_ms=100, window_seconds=10):
self.max_latency = max_latency_ms
self.latencies = deque(maxlen=1000)
self.rejection_rate = 0.0
def should_accept(self) -> bool:
if self.rejection_rate > 0:
if random.random() < self.rejection_rate:
return False
return True
def record_latency(self, latency_ms: float):
self.latencies.append(latency_ms)
self._update_rejection_rate()
def _update_rejection_rate(self):
if len(self.latencies) < 100:
return
avg_latency = sum(self.latencies) / len(self.latencies)
if avg_latency > self.max_latency * 2:
self.rejection_rate = min(0.9, self.rejection_rate + 0.1)
elif avg_latency < self.max_latency:
self.rejection_rate = max(0.0, self.rejection_rate - 0.05)
# Usage in middleware
load_shedder = AdaptiveLoadShedder()
@app.middleware
def load_shedding_middleware(request):
if not load_shedder.should_accept():
return Response(status=503, body="Service overloaded")
start = time.time()
response = handle_request(request)
load_shedder.record_latency((time.time() - start) * 1000)
return response
Layer 5: Graceful Degradation
class OrderService:
def create_order(self, order_data):
# Essential: Must succeed
order = self.save_order(order_data)
payment = self.process_payment(order)
# Non-essential: Can fail gracefully
try:
recommendations = self.get_recommendations(order)
except ServiceUnavailableError:
recommendations = [] # Empty is OK
try:
self.send_confirmation_email(order)
except ServiceUnavailableError:
self.queue_email_for_later(order) # Async retry
try:
self.update_analytics(order)
except:
pass # Analytics can be reconstructed later
return OrderResponse(order, recommendations)
Architecture Patterns
resilience_patterns:
per_service:
- Circuit breaker (per dependency)
- Bulkhead (isolated thread/connection pools)
- Timeout (connect, read, write)
- Retry with backoff
per_request:
- Load shedding
- Rate limiting
- Request hedging (for reads)
infrastructure:
- Health checks (liveness + readiness)
- Auto-scaling
- Canary deployments
- Feature flags for quick disable
Monitoring for Early Detection
# Prometheus alerts
groups:
- name: cascade_early_warning
rules:
- alert: LatencySpike
expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[1m])) > 1
for: 30s
labels:
severity: warning
annotations:
summary: "P99 latency spike - potential cascade starting"
- alert: ErrorRateSpike
expr: rate(http_requests_total{status=~"5.."}[1m]) / rate(http_requests_total[1m]) > 0.1
for: 30s
labels:
severity: critical
- alert: ConnectionPoolExhaustion
expr: hikari_connections_active / hikari_connections_max > 0.9
for: 1m
labels:
severity: critical
- alert: ThreadPoolExhaustion
expr: executor_pool_size_threads / executor_pool_max_threads > 0.9
for: 1m
labels:
severity: critical
Key Takeaways
- Timeouts are mandatory - No timeout = potential cascade
- Circuit breakers at every boundary - Fail fast, not slow
- Bulkheads isolate failures - One dependency shouldn't kill everything
- Load shed when overloaded - 503 is better than cascade
- Rollback is not enough - Need to clear blocked resources
- Monitor connection/thread pools - They're the canary
Golden rule: Design every service assuming its dependencies will fail, and that it will fail for its dependents.