Skip to main content

Advanced Interview Scenarios

TL;DR

These are the "what would you do if..." problems that test your production experience and crisis management. These scenarios separate Staff engineers from Senior engineers.


Problem 11: GDPR "Right to be Forgotten" at Scale

The Scenario

"A user requests deletion of all their data under GDPR. We have 50+ microservices, 10+ databases, Kafka topics with 30-day retention, ML models trained on their data, and backups going back 7 years. You have 72 hours. Go."

Why It's Hard

Solution: Data Catalog + Soft Delete + Async Purge

Phase 1: Build Data Catalog (should exist before request)

# data_catalog.yaml
user_data_locations:
- service: user-service
database: postgres
tables: [users, user_preferences, user_sessions]
pii_columns: [email, phone, address]

- service: order-service
database: mysql
tables: [orders, shipping_addresses]
foreign_key: user_id

- service: analytics
database: bigquery
tables: [user_events, page_views]
retention: 90_days

- service: ml-pipeline
storage: s3://ml-training-data
action: retrain_model_without_user

- service: kafka
topics: [user-events, order-events]
retention: 30_days
action: wait_for_expiry

Phase 2: Soft Delete (Immediate - within minutes)

def initiate_deletion(user_id):
# 1. Mark user as deleted (prevents new data creation)
db.execute("""
UPDATE users
SET status = 'PENDING_DELETION',
email = 'deleted-{user_id}@redacted.com',
deleted_at = NOW()
WHERE id = ?
""", user_id)

# 2. Revoke all access tokens
auth_service.revoke_all_tokens(user_id)

# 3. Queue async deletion job
queue.enqueue("user_deletion", {
"user_id": user_id,
"requested_at": datetime.now(),
"deadline": datetime.now() + timedelta(hours=72)
})

return {"status": "deletion_initiated", "completion_by": deadline}

Phase 3: Async Purge (Background - within 72 hours)

class UserDeletionJob:
def execute(self, user_id):
deletion_report = []

for location in data_catalog.get_locations(user_id):
try:
if location.type == "database":
self.delete_from_db(location, user_id)
elif location.type == "s3":
self.delete_from_s3(location, user_id)
elif location.type == "elasticsearch":
self.delete_from_es(location, user_id)
elif location.type == "kafka":
# Can't delete from Kafka - wait for retention
self.log_kafka_retention(location, user_id)

deletion_report.append({"location": location, "status": "deleted"})
except Exception as e:
deletion_report.append({"location": location, "status": "failed", "error": str(e)})
alert_oncall(f"GDPR deletion failed for {user_id} at {location}")

# Store deletion certificate
self.store_deletion_certificate(user_id, deletion_report)

Phase 4: Handle Backups (The Hard Part)

# Option 1: Crypto-shredding (preferred)
# Each user's data encrypted with unique key
# Delete key = data unrecoverable

def delete_user_encryption_key(user_id):
key_id = f"user-key-{user_id}"
kms.schedule_key_deletion(key_id, pending_days=7)
# After 7 days, all encrypted backups are unreadable

# Option 2: Lazy deletion on restore
# Mark user as deleted, filter during restore

def restore_backup(backup_date):
data = load_backup(backup_date)
deleted_users = get_deleted_user_ids()
return filter_out_users(data, deleted_users)

Interview Follow-ups

Q: What about ML models trained on this user's data?

You can't "untrain" a model. Options:

  1. Retrain model without user's data (expensive)
  2. Differential privacy from the start (prevents this problem)
  3. Document that model weights are anonymized aggregate data (legal gray area)

Q: What about Kafka topics?

Can't delete individual messages. Options:

  1. Wait for retention period (if < 72 hours)
  2. Encrypt user data in Kafka, delete key
  3. Compact topics with tombstones (for keyed topics)

Problem 12: The Thundering Herd

The Scenario

"Every day at midnight UTC, all caches expire simultaneously. At 00:00:01, we get 500K requests to the database, it falls over, and the site is down for 10 minutes. How do you fix this?"

Root Cause

Solution 1: Jittered TTL

def cache_with_jitter(key, value, base_ttl=3600):
# Add random jitter: 3600 ± 600 seconds (10% variance)
jitter = random.randint(-base_ttl // 10, base_ttl // 10)
actual_ttl = base_ttl + jitter

cache.set(key, value, ttl=actual_ttl)

# Results in expiration spread across 10-minute window instead of all at once

Solution 2: Probabilistic Early Refresh

def get_with_early_refresh(key, refresh_func):
value, ttl_remaining = cache.get_with_ttl(key)

if value is None:
# Cache miss - fetch and store
value = refresh_func()
cache.set(key, value, ttl=3600)
return value

# Probabilistically refresh before expiry
# Probability increases as TTL decreases
expiry_probability = 1.0 / max(ttl_remaining, 1)

if random.random() < expiry_probability:
# Refresh in background (don't block)
thread_pool.submit(lambda: cache.set(key, refresh_func(), ttl=3600))

return value

Solution 3: Single-Flight Pattern (Mutex)

import threading

class SingleFlight:
def __init__(self):
self.locks = {}
self.results = {}

def do(self, key, func):
# Check if another thread is already fetching
if key in self.locks:
self.locks[key].wait() # Wait for result
return self.results[key]

# This thread will fetch
self.locks[key] = threading.Event()

try:
result = func()
self.results[key] = result
cache.set(key, result, ttl=3600)
return result
finally:
self.locks[key].set() # Signal waiters
del self.locks[key]

# Usage
single_flight = SingleFlight()

def get_product(product_id):
cached = cache.get(f"product:{product_id}")
if cached:
return cached

# Only ONE request goes to DB, others wait
return single_flight.do(
f"product:{product_id}",
lambda: db.query("SELECT * FROM products WHERE id = ?", product_id)
)

Solution 4: Background Refresh (Best for Hot Keys)

# Cron job runs every minute
@scheduled(every=60)
def refresh_hot_cache():
hot_keys = analytics.get_hot_keys(top_n=1000)

for key in hot_keys:
ttl = cache.ttl(key)
# Refresh if expiring in next 5 minutes
if ttl < 300:
value = fetch_from_db(key)
cache.set(key, value, ttl=3600)

# Result: Hot keys never expire (always refreshed before TTL)

Problem 13: The Split-Brain Database

The Scenario

"We had a network partition. Both database nodes thought they were primary and accepted writes for 3 minutes. Now we have conflicting data. How do we reconcile?"

The Horror

Solution: Conflict Detection + Resolution

Step 1: Identify Conflicts

-- Export writes from both nodes during partition window
-- Node A writes
SELECT * FROM orders
WHERE updated_at BETWEEN '2024-01-15 10:00:00' AND '2024-01-15 10:03:00';

-- Node B writes
SELECT * FROM orders
WHERE updated_at BETWEEN '2024-01-15 10:00:00' AND '2024-01-15 10:03:00';

-- Find conflicts (same primary key, different data)
SELECT a.id, a.status AS status_a, b.status AS status_b
FROM node_a_orders a
JOIN node_b_orders b ON a.id = b.id
WHERE a.status != b.status;

Step 2: Automated Resolution (Where Possible)

def resolve_conflict(record_a, record_b, conflict_type):
if conflict_type == "order_status":
# Business rule: Later status wins in order lifecycle
status_order = ["pending", "paid", "shipped", "delivered", "cancelled"]
if status_order.index(record_a.status) > status_order.index(record_b.status):
return record_a
return record_b

elif conflict_type == "counter":
# Counters: Sum both increments
return record_a.count + record_b.count - original_count

elif conflict_type == "last_write_wins":
# Simple LWW based on timestamp
return record_a if record_a.updated_at > record_b.updated_at else record_b

else:
# Can't auto-resolve - flag for manual review
create_conflict_ticket(record_a, record_b)
return None

Step 3: Manual Resolution Queue

# For conflicts that can't be auto-resolved
def create_conflict_ticket(record_a, record_b):
ticket = {
"type": "data_conflict",
"priority": "P1",
"table": record_a.table,
"id": record_a.id,
"node_a_data": record_a,
"node_b_data": record_b,
"suggested_resolution": "MANUAL_REVIEW",
"customer_impact": assess_customer_impact(record_a.id)
}

jira.create_ticket(ticket)
notify_oncall("Split-brain conflict requires manual resolution")

Prevention: Proper Fencing

# Use fencing tokens to prevent split-brain
class Database:
def __init__(self):
self.fencing_token = 0

def acquire_primary(self, node_id):
# Increment fencing token
self.fencing_token += 1
token = self.fencing_token

# Store in distributed lock (etcd/ZooKeeper)
distributed_lock.set("db_primary", {
"node": node_id,
"token": token
})

return token

def write(self, data, fencing_token):
current = distributed_lock.get("db_primary")

# Reject writes with old fencing token
if fencing_token < current["token"]:
raise StaleLeaderException("You are no longer primary")

self.do_write(data)

Problem 14: Debugging Intermittent Failures

The Scenario

"1% of requests randomly fail with a generic 500 error. It happens across all endpoints, all users, no pattern. We've been debugging for 2 weeks. Help."

Systematic Debugging Approach

Step 1: Correlate Everything

# Enrich error logs with ALL context
def log_error(error, request):
log.error({
"error": str(error),
"stack_trace": traceback.format_exc(),

# Request context
"request_id": request.id,
"user_id": request.user_id,
"endpoint": request.path,
"method": request.method,
"payload_size": len(request.body),

# Infrastructure context
"server_id": os.environ["SERVER_ID"],
"container_id": os.environ["HOSTNAME"],
"availability_zone": get_az(),
"deployment_version": os.environ["VERSION"],

# Timing
"timestamp": datetime.now().isoformat(),
"request_duration_ms": request.duration_ms,

# Dependencies
"db_connection_pool_used": db.pool.used_connections,
"db_connection_pool_max": db.pool.max_connections,
"redis_latency_ms": redis.last_latency_ms,
"external_api_latency_ms": external_api.last_latency_ms,

# System resources
"memory_percent": psutil.virtual_memory().percent,
"cpu_percent": psutil.cpu_percent(),
"open_file_descriptors": len(os.listdir('/proc/self/fd')),
})

Step 2: Look for Patterns

-- Query error logs for patterns
SELECT
server_id,
COUNT(*) as error_count,
AVG(memory_percent) as avg_memory,
AVG(db_connection_pool_used) as avg_db_connections
FROM error_logs
WHERE timestamp > NOW() - INTERVAL 1 HOUR
GROUP BY server_id
ORDER BY error_count DESC;

-- Aha! Server 7 has 80% of errors and 95% memory usage

Step 3: Common Culprits Checklist

SymptomLikely CauseHow to Verify
Errors spike at :00 and :30Cron job interferenceCheck crontab
Errors correlate with memoryMemory leakProfile with heapy
Errors correlate with one serverBad deploy, hardwareCompare deployments
Errors during high trafficResource exhaustionLoad test
Errors after N hoursConnection pool leakMonitor pool size
Random distributionRace conditionAdd distributed tracing
Errors with large payloadsTimeout, memoryCheck payload sizes

Step 4: The Usual Suspects

# 1. Connection pool exhaustion
# Symptom: Works for hours, then all requests fail
def check_connection_pool():
if db.pool.used >= db.pool.max * 0.9:
log.warn("Connection pool nearly exhausted!")
# Look for: long-running queries, uncommitted transactions

# 2. File descriptor leak
# Symptom: "Too many open files" errors
def check_fd_leak():
fd_count = len(os.listdir('/proc/self/fd'))
if fd_count > 1000:
log.warn(f"High FD count: {fd_count}")
# Look for: unclosed HTTP connections, file handles

# 3. Memory leak
# Symptom: OOM after N hours
def check_memory_leak():
import tracemalloc
tracemalloc.start()
# ... run for a while ...
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')[:10]
# Shows which lines allocate most memory

# 4. DNS resolution failure
# Symptom: Random timeouts to external services
def check_dns():
import socket
try:
socket.setdefaulttimeout(1)
socket.gethostbyname("api.external-service.com")
except socket.timeout:
log.error("DNS resolution timeout!")
# Look for: /etc/resolv.conf, DNS server issues

Problem 15: The Deployment That Broke Everything

The Scenario

"We deployed a 'safe' config change. 5 minutes later, error rate went from 0.1% to 50%. We rolled back but errors continued for 2 hours. What happened?"

Investigation Timeline

14:00 - Deploy config change (timeout: 30s → 5s)
14:05 - Error rate spikes to 50%
14:07 - Rollback initiated
14:10 - Rollback complete
14:15 - Error rate still 50% (WHY?!)
16:10 - Errors finally subside

Root Cause: Cascading Failure

Why Rollback Didn't Fix It

# The problem: Rollback doesn't fix queued requests

# During incident:
# - 100K requests queued waiting for DB
# - Each request holds a connection
# - DB can only process 1K/sec
# - Queue takes 100 seconds to drain

# But worse:
# - Users see timeout, refresh
# - Creates NEW requests
# - Queue never drains

# Solution: Shed load during recovery
def handle_request():
if circuit_breaker.is_open():
# Fast-fail during recovery
return Response(503, "Service recovering, please retry in 60s",
headers={"Retry-After": "60"})

# Normal processing
return process_request()

The Fix: Controlled Recovery

class ControlledRecovery:
def __init__(self):
self.max_concurrent = 100
self.current = 0
self.recovery_mode = False

def enter_recovery_mode(self):
self.recovery_mode = True
self.max_concurrent = 10 # Drastically reduce

def gradual_recovery(self):
# Slowly increase capacity
while self.max_concurrent < 1000:
if error_rate() < 0.01: # Less than 1% errors
self.max_concurrent += 10
time.sleep(60) # Wait 1 minute between increases
else:
self.max_concurrent = max(10, self.max_concurrent - 20) # Back off

def handle_request(self):
if self.current >= self.max_concurrent:
return Response(503, "Shedding load", headers={"Retry-After": "30"})

self.current += 1
try:
return process_request()
finally:
self.current -= 1

Prevention: Config Change Checklist

## Before Any Config Change

- [ ] What's the blast radius? (All users? One region?)
- [ ] Can we canary this? (1% of traffic first)
- [ ] What metrics will show problems? (Set alerts)
- [ ] How fast can we detect issues? (< 1 minute)
- [ ] How fast can we rollback? (< 5 minutes)
- [ ] What happens if rollback doesn't fix it?
- [ ] Is there a dependency that won't recover automatically?

Problem 16: Multi-Tenant Noisy Neighbor

The Scenario

"We're a SaaS platform. One enterprise customer started running huge batch jobs, and now all other customers are experiencing slowdowns. We can't just throttle them - they pay $500K/year."

Solution: Fair Resource Allocation

Option 1: Request Queuing with Priority

class FairScheduler:
def __init__(self):
self.tenant_quotas = {} # tenant_id → tokens/second
self.tenant_buckets = {} # tenant_id → current tokens

def acquire(self, tenant_id, tokens=1):
bucket = self.tenant_buckets.get(tenant_id, self.tenant_quotas[tenant_id])

if bucket >= tokens:
self.tenant_buckets[tenant_id] = bucket - tokens
return True
else:
# Queue request (don't reject)
return self.queue_request(tenant_id, tokens)

def refill_buckets(self):
# Called every second
for tenant_id, quota in self.tenant_quotas.items():
current = self.tenant_buckets.get(tenant_id, 0)
self.tenant_buckets[tenant_id] = min(current + quota, quota * 2) # Max burst = 2x

Option 2: Isolated Resources

# Route enterprise customers to dedicated resources
def get_db_connection(tenant_id):
tier = get_tenant_tier(tenant_id)

if tier == "enterprise":
return enterprise_db_pool.get_connection()
else:
return shared_db_pool.get_connection()

Option 3: Background Job Separation

# Separate queues for batch vs interactive workloads
class JobRouter:
def route(self, job, tenant_id):
if job.type == "batch":
# Low priority queue, limited concurrency
return batch_queue.enqueue(job, priority="low", max_concurrent=5)
else:
# High priority queue for interactive requests
return interactive_queue.enqueue(job, priority="high")

Problem 17: The Impossible Bug - Time Zone Edition

The Scenario

"Twice a year, on the day clocks change for daylight saving, our billing system either double-charges customers or skips charges entirely. We've 'fixed' it three times."

The Problem

# Buggy code
def get_charges_for_day(date):
start = datetime(date.year, date.month, date.day, 0, 0, 0) # Midnight
end = start + timedelta(days=1) # Next midnight

return db.query("SELECT * FROM charges WHERE created_at BETWEEN ? AND ?", start, end)

# On DST "spring forward" day:
# - 2 AM doesn't exist (jumps to 3 AM)
# - Day is only 23 hours
# - Some charges might be in a "missing" hour

# On DST "fall back" day:
# - 2 AM happens twice
# - Day is 25 hours
# - Some charges counted twice

The Fix: Use UTC + Timezone-Aware Comparisons

from datetime import datetime, timezone
import pytz

def get_charges_for_day(date, user_timezone="America/New_York"):
tz = pytz.timezone(user_timezone)

# Get midnight in user's timezone, then convert to UTC
local_midnight = tz.localize(datetime(date.year, date.month, date.day, 0, 0, 0))
local_next_midnight = local_midnight + timedelta(days=1)

# Convert to UTC for database query
utc_start = local_midnight.astimezone(timezone.utc)
utc_end = local_next_midnight.astimezone(timezone.utc)

# Database stores everything in UTC
return db.query(
"SELECT * FROM charges WHERE created_at >= ? AND created_at < ?",
utc_start, utc_end
)

# This correctly handles:
# - 23-hour days (spring forward)
# - 25-hour days (fall back)
# - Users in different timezones

Other Time Landmines

# Landmine 1: Leap seconds
# Solution: Use NTP, accept ±1 second precision

# Landmine 2: February 29
def get_same_day_next_year(date):
try:
return date.replace(year=date.year + 1)
except ValueError:
# Feb 29 → Feb 28 in non-leap year
return date.replace(year=date.year + 1, day=28)

# Landmine 3: Month arithmetic
# "One month from January 31" = ?
from dateutil.relativedelta import relativedelta
jan_31 = datetime(2024, 1, 31)
one_month_later = jan_31 + relativedelta(months=1) # Feb 29, 2024

# Landmine 4: Server timezone vs user timezone
# Solution: ALWAYS store UTC, convert on display

Problem 18: Webhook Delivery Reliability

The Scenario

"Partners complain they're missing webhook events. We fire webhooks and log success, but partners say they never received them. Trust is eroding."

Solution: Reliable Webhook System

Implementation

class WebhookDelivery:
RETRY_DELAYS = [60, 300, 900, 3600, 7200, 14400, 28800] # Exponential backoff

def send_webhook(self, webhook):
try:
response = http.post(
webhook.url,
json=webhook.payload,
headers={
"X-Webhook-ID": webhook.id,
"X-Webhook-Timestamp": str(webhook.created_at.timestamp()),
"X-Webhook-Signature": self.sign(webhook.payload, webhook.secret)
},
timeout=30
)

if response.status_code == 200:
self.mark_delivered(webhook)
else:
self.schedule_retry(webhook)

except Exception as e:
self.schedule_retry(webhook)

def schedule_retry(self, webhook):
if webhook.attempt >= len(self.RETRY_DELAYS):
self.move_to_dlq(webhook)
return

delay = self.RETRY_DELAYS[webhook.attempt]
webhook.attempt += 1
webhook.next_retry = datetime.now() + timedelta(seconds=delay)
queue.enqueue(webhook, delay=delay)

def sign(self, payload, secret):
# HMAC signature so partner can verify authenticity
return hmac.new(
secret.encode(),
json.dumps(payload).encode(),
hashlib.sha256
).hexdigest()

Partner-Side Verification

# Partners should verify signature
def handle_webhook(request):
expected_signature = hmac.new(
MY_WEBHOOK_SECRET.encode(),
request.body,
hashlib.sha256
).hexdigest()

if request.headers["X-Webhook-Signature"] != expected_signature:
return Response(401, "Invalid signature")

# Idempotency check
webhook_id = request.headers["X-Webhook-ID"]
if already_processed(webhook_id):
return Response(200, "Already processed")

process_webhook(request.body)
mark_processed(webhook_id)
return Response(200, "OK")

Dashboard for Partners

# Let partners see their webhook history
@app.route("/partner/webhooks")
def webhook_dashboard():
webhooks = db.query("""
SELECT id, event_type, status, attempts, created_at, delivered_at
FROM webhooks
WHERE partner_id = ?
ORDER BY created_at DESC
LIMIT 100
""", current_partner.id)

return render("webhook_dashboard.html", webhooks=webhooks)

Problem 19: Secret Rotation Without Downtime

The Scenario

"Security audit requires rotating all database passwords every 90 days. We have 50 services using the database. How do we rotate without downtime or coordinated deploys?"

Solution: Dual-Password Support

Implementation

# Database: Support dual passwords
class DualPasswordAuth:
def authenticate(self, username, password):
user = db.get_user(username)

# Check both current and previous password
if bcrypt.verify(password, user.password_hash):
return True
if user.previous_password_hash and bcrypt.verify(password, user.previous_password_hash):
return True

return False

def rotate_password(self, username, new_password):
user = db.get_user(username)
user.previous_password_hash = user.password_hash
user.password_hash = bcrypt.hash(new_password)
user.previous_password_valid_until = datetime.now() + timedelta(hours=24)
db.update(user)

# Secret store: Version secrets
class SecretStore:
def get_secret(self, name, version="latest"):
if version == "latest":
return self.secrets[name]["current"]
return self.secrets[name]["versions"][version]

def rotate_secret(self, name, new_value):
current = self.secrets[name]["current"]
self.secrets[name]["versions"].append(current)
self.secrets[name]["current"] = new_value

# Notify services to refresh
self.notify_rotation(name)

# Service: Refresh on notification
class DatabaseConnection:
def __init__(self):
self.password = secret_store.get_secret("db_password")
secret_store.on_rotation("db_password", self.refresh_password)

def refresh_password(self):
self.password = secret_store.get_secret("db_password")
self.reconnect()

Problem 20: The Postmortem - What Really Happened

The Scenario

"Write a blameless postmortem for this incident: A junior engineer ran DELETE FROM users without a WHERE clause in production."

Postmortem Template

# Incident Postmortem: Mass User Deletion

## Summary
On 2024-01-15, all 2.3M user records were accidentally deleted from
the production database. Service was degraded for 4 hours. All data
was recovered from backups with no permanent data loss.

## Timeline (All times UTC)
- 14:23 - Engineer runs DELETE query intending to remove single user
- 14:23 - All user records deleted
- 14:25 - Monitoring alerts fire (500 errors spike)
- 14:27 - On-call acknowledges, begins investigation
- 14:35 - Root cause identified (empty users table)
- 14:40 - Decision made to restore from backup
- 14:45 - Backup restoration begins
- 16:30 - Database restored, application recovering
- 18:15 - Full service restoration confirmed

## Impact
- 4 hours of degraded service
- All users logged out and unable to re-authenticate
- 0 permanent data loss (1-hour backup RPO)
- Estimated revenue impact: $X
- Customer support tickets: 2,847

## Root Cause
An engineer executed a DELETE statement without a WHERE clause.
The query was intended to delete a single test user but instead
deleted all records.

Query executed:
```sql
DELETE FROM users; -- Missing: WHERE id = 'test_user_123'

Contributing Factors

  1. No query confirmation for destructive operations

    • Production database allows direct DELETE without confirmation
  2. Shared credentials

    • Engineer used a service account with full write access
    • Individual accounts would enable audit trail
  3. No query review process

    • Ad-hoc production queries don't require peer review
  4. Insufficient safeguards

    • No transaction wrapper for manual queries
    • No row-count confirmation before commit

What Went Well

  • Backup was recent (1 hour old) and restoration worked
  • Monitoring detected issue within 2 minutes
  • Team mobilized quickly
  • Communication was clear (status page updated, customers notified)

What Went Wrong

  • No safeguards prevented the DELETE
  • Restoration took longer than expected (1.5 hours)
  • No runbook for "mass data deletion" scenario

Action Items

ActionOwnerPriorityDue Date
Implement query confirmation for DELETE/UPDATEDBA TeamP02024-01-22
Require WHERE clause for DELETE statementsDBA TeamP02024-01-22
Individual database accounts with audit loggingSecurityP12024-02-01
Peer review process for production queriesEngineeringP12024-02-01
Add point-in-time recovery (reduce RPO to 5 min)DBA TeamP22024-03-01
Create runbook for data recovery scenariosSRE TeamP12024-01-29
Conduct training on safe database practicesEngineeringP12024-02-15

Lessons Learned

This incident was not caused by one person making a mistake. It was caused by a system that allowed a simple mistake to have catastrophic consequences. Our action items focus on building safeguards, not assigning blame.

Key insight: If a junior engineer can accidentally delete the entire database, our systems are not safe enough.


---

## Quick Reference: Debugging Checklist

When facing any production incident:

```markdown
## Immediate (First 5 Minutes)
- [ ] What changed recently? (deploys, config, traffic)
- [ ] What's the blast radius? (all users, one region, one feature)
- [ ] Can we rollback? Should we?
- [ ] Who needs to know? (stakeholders, status page)

## Investigation (Next 30 Minutes)
- [ ] Check dashboards: Error rate, latency, traffic
- [ ] Check logs: Filter by time window of incident
- [ ] Check infrastructure: CPU, memory, disk, network
- [ ] Check dependencies: Database, cache, external APIs
- [ ] Check recent changes: Git log, deploy history, config changes

## Common Culprits
- [ ] Recent deployment
- [ ] Config change
- [ ] Traffic spike
- [ ] Database query performance
- [ ] External dependency failure
- [ ] Resource exhaustion (memory, connections, file descriptors)
- [ ] DNS/networking issues
- [ ] Certificate expiration
- [ ] Rate limiting (ours or external)

## Resolution
- [ ] Rollback if possible
- [ ] Scale up if resource exhaustion
- [ ] Failover if regional issue
- [ ] Feature flag off if new code
- [ ] Communicate status

## Post-Incident
- [ ] Write postmortem (within 48 hours)
- [ ] Identify action items
- [ ] Schedule review meeting
- [ ] Update runbooks

You're now prepared for the toughest scenarios! These problems test judgment, crisis management, and production experience - exactly what Staff+ roles require.

Practice tip: For each scenario, practice explaining your approach in 5 minutes. Senior interviews want to hear your thought process, not just the answer.