Advanced Interview Scenarios

TL;DR

These are the "what would you do if..." problems that test your production experience and crisis management. These scenarios separate Staff engineers from Senior engineers.

The Scenario

"A user requests deletion of all their data under GDPR. We have 50+ microservices, 10+ databases, Kafka topics with 30-day retention, ML models trained on their data, and backups going back 7 years. You have 72 hours. Go."

Why It's Hard

Solution: Data Catalog + Soft Delete + Async Purge

Phase 1: Build Data Catalog (should exist before request)

# data_catalog.yaml
user_data_locations:
  - service: user-service
    database: postgres
    tables: [users, user_preferences, user_sessions]
    pii_columns: [email, phone, address]
    
  - service: order-service
    database: mysql
    tables: [orders, shipping_addresses]
    foreign_key: user_id
    
  - service: analytics
    database: bigquery
    tables: [user_events, page_views]
    retention: 90_days
    
  - service: ml-pipeline
    storage: s3://ml-training-data
    action: retrain_model_without_user
    
  - service: kafka
    topics: [user-events, order-events]
    retention: 30_days
    action: wait_for_expiry

Phase 2: Soft Delete (Immediate - within minutes)

def initiate_deletion(user_id):
    # 1. Mark user as deleted (prevents new data creation)
    db.execute("""
        UPDATE users 
        SET status = 'PENDING_DELETION',
            email = 'deleted-{user_id}@redacted.com',
            deleted_at = NOW()
        WHERE id = ?
    """, user_id)
    
    # 2. Revoke all access tokens
    auth_service.revoke_all_tokens(user_id)
    
    # 3. Queue async deletion job
    queue.enqueue("user_deletion", {
        "user_id": user_id,
        "requested_at": datetime.now(),
        "deadline": datetime.now() + timedelta(hours=72)
    })
    
    return {"status": "deletion_initiated", "completion_by": deadline}

Phase 3: Async Purge (Background - within 72 hours)

class UserDeletionJob:
    def execute(self, user_id):
        deletion_report = []
        
        for location in data_catalog.get_locations(user_id):
            try:
                if location.type == "database":
                    self.delete_from_db(location, user_id)
                elif location.type == "s3":
                    self.delete_from_s3(location, user_id)
                elif location.type == "elasticsearch":
                    self.delete_from_es(location, user_id)
                elif location.type == "kafka":
                    # Can't delete from Kafka - wait for retention
                    self.log_kafka_retention(location, user_id)
                
                deletion_report.append({"location": location, "status": "deleted"})
            except Exception as e:
                deletion_report.append({"location": location, "status": "failed", "error": str(e)})
                alert_oncall(f"GDPR deletion failed for {user_id} at {location}")
        
        # Store deletion certificate
        self.store_deletion_certificate(user_id, deletion_report)

Phase 4: Handle Backups (The Hard Part)

# Option 1: Crypto-shredding (preferred)
# Each user's data encrypted with unique key
# Delete key = data unrecoverable

def delete_user_encryption_key(user_id):
    key_id = f"user-key-{user_id}"
    kms.schedule_key_deletion(key_id, pending_days=7)
    # After 7 days, all encrypted backups are unreadable

# Option 2: Lazy deletion on restore
# Mark user as deleted, filter during restore

def restore_backup(backup_date):
    data = load_backup(backup_date)
    deleted_users = get_deleted_user_ids()
    return filter_out_users(data, deleted_users)

Interview Follow-ups

Q: What about ML models trained on this user's data?

You can't "untrain" a model. Options:

Retrain model without user's data (expensive)

Differential privacy from the start (prevents this problem)

Document that model weights are anonymized aggregate data (legal gray area)

Q: What about Kafka topics?

Can't delete individual messages. Options:

Wait for retention period (if < 72 hours)

Encrypt user data in Kafka, delete key

Compact topics with tombstones (for keyed topics)

Problem 12: The Thundering Herd

The Scenario

"Every day at midnight UTC, all caches expire simultaneously. At 00:00:01, we get 500K requests to the database, it falls over, and the site is down for 10 minutes. How do you fix this?"

Root Cause

Solution 1: Jittered TTL

def cache_with_jitter(key, value, base_ttl=3600):
    # Add random jitter: 3600 ± 600 seconds (10% variance)
    jitter = random.randint(-base_ttl // 10, base_ttl // 10)
    actual_ttl = base_ttl + jitter
    
    cache.set(key, value, ttl=actual_ttl)

# Results in expiration spread across 10-minute window instead of all at once

Solution 2: Probabilistic Early Refresh

def get_with_early_refresh(key, refresh_func):
    value, ttl_remaining = cache.get_with_ttl(key)
    
    if value is None:
        # Cache miss - fetch and store
        value = refresh_func()
        cache.set(key, value, ttl=3600)
        return value
    
    # Probabilistically refresh before expiry
    # Probability increases as TTL decreases
    expiry_probability = 1.0 / max(ttl_remaining, 1)
    
    if random.random() < expiry_probability:
        # Refresh in background (don't block)
        thread_pool.submit(lambda: cache.set(key, refresh_func(), ttl=3600))
    
    return value

Solution 3: Single-Flight Pattern (Mutex)

import threading

class SingleFlight:
    def __init__(self):
        self.locks = {}
        self.results = {}
    
    def do(self, key, func):
        # Check if another thread is already fetching
        if key in self.locks:
            self.locks[key].wait()  # Wait for result
            return self.results[key]
        
        # This thread will fetch
        self.locks[key] = threading.Event()
        
        try:
            result = func()
            self.results[key] = result
            cache.set(key, result, ttl=3600)
            return result
        finally:
            self.locks[key].set()  # Signal waiters
            del self.locks[key]

# Usage
single_flight = SingleFlight()

def get_product(product_id):
    cached = cache.get(f"product:{product_id}")
    if cached:
        return cached
    
    # Only ONE request goes to DB, others wait
    return single_flight.do(
        f"product:{product_id}",
        lambda: db.query("SELECT * FROM products WHERE id = ?", product_id)
    )

Solution 4: Background Refresh (Best for Hot Keys)

# Cron job runs every minute
@scheduled(every=60)
def refresh_hot_cache():
    hot_keys = analytics.get_hot_keys(top_n=1000)
    
    for key in hot_keys:
        ttl = cache.ttl(key)
        # Refresh if expiring in next 5 minutes
        if ttl < 300:
            value = fetch_from_db(key)
            cache.set(key, value, ttl=3600)
            
# Result: Hot keys never expire (always refreshed before TTL)

Problem 13: The Split-Brain Database

The Scenario

"We had a network partition. Both database nodes thought they were primary and accepted writes for 3 minutes. Now we have conflicting data. How do we reconcile?"

The Horror

Solution: Conflict Detection + Resolution

Step 1: Identify Conflicts

-- Export writes from both nodes during partition window
-- Node A writes
SELECT * FROM orders 
WHERE updated_at BETWEEN '2024-01-15 10:00:00' AND '2024-01-15 10:03:00';

-- Node B writes  
SELECT * FROM orders 
WHERE updated_at BETWEEN '2024-01-15 10:00:00' AND '2024-01-15 10:03:00';

-- Find conflicts (same primary key, different data)
SELECT a.id, a.status AS status_a, b.status AS status_b
FROM node_a_orders a
JOIN node_b_orders b ON a.id = b.id
WHERE a.status != b.status;

Step 2: Automated Resolution (Where Possible)

def resolve_conflict(record_a, record_b, conflict_type):
    if conflict_type == "order_status":
        # Business rule: Later status wins in order lifecycle
        status_order = ["pending", "paid", "shipped", "delivered", "cancelled"]
        if status_order.index(record_a.status) > status_order.index(record_b.status):
            return record_a
        return record_b
    
    elif conflict_type == "counter":
        # Counters: Sum both increments
        return record_a.count + record_b.count - original_count
    
    elif conflict_type == "last_write_wins":
        # Simple LWW based on timestamp
        return record_a if record_a.updated_at > record_b.updated_at else record_b
    
    else:
        # Can't auto-resolve - flag for manual review
        create_conflict_ticket(record_a, record_b)
        return None

Step 3: Manual Resolution Queue

# For conflicts that can't be auto-resolved
def create_conflict_ticket(record_a, record_b):
    ticket = {
        "type": "data_conflict",
        "priority": "P1",
        "table": record_a.table,
        "id": record_a.id,
        "node_a_data": record_a,
        "node_b_data": record_b,
        "suggested_resolution": "MANUAL_REVIEW",
        "customer_impact": assess_customer_impact(record_a.id)
    }
    
    jira.create_ticket(ticket)
    notify_oncall("Split-brain conflict requires manual resolution")

Prevention: Proper Fencing

# Use fencing tokens to prevent split-brain
class Database:
    def __init__(self):
        self.fencing_token = 0
    
    def acquire_primary(self, node_id):
        # Increment fencing token
        self.fencing_token += 1
        token = self.fencing_token
        
        # Store in distributed lock (etcd/ZooKeeper)
        distributed_lock.set("db_primary", {
            "node": node_id,
            "token": token
        })
        
        return token
    
    def write(self, data, fencing_token):
        current = distributed_lock.get("db_primary")
        
        # Reject writes with old fencing token
        if fencing_token < current["token"]:
            raise StaleLeaderException("You are no longer primary")
        
        self.do_write(data)

Problem 14: Debugging Intermittent Failures

The Scenario

"1% of requests randomly fail with a generic 500 error. It happens across all endpoints, all users, no pattern. We've been debugging for 2 weeks. Help."

Systematic Debugging Approach

Step 1: Correlate Everything

# Enrich error logs with ALL context
def log_error(error, request):
    log.error({
        "error": str(error),
        "stack_trace": traceback.format_exc(),
        
        # Request context
        "request_id": request.id,
        "user_id": request.user_id,
        "endpoint": request.path,
        "method": request.method,
        "payload_size": len(request.body),
        
        # Infrastructure context  
        "server_id": os.environ["SERVER_ID"],
        "container_id": os.environ["HOSTNAME"],
        "availability_zone": get_az(),
        "deployment_version": os.environ["VERSION"],
        
        # Timing
        "timestamp": datetime.now().isoformat(),
        "request_duration_ms": request.duration_ms,
        
        # Dependencies
        "db_connection_pool_used": db.pool.used_connections,
        "db_connection_pool_max": db.pool.max_connections,
        "redis_latency_ms": redis.last_latency_ms,
        "external_api_latency_ms": external_api.last_latency_ms,
        
        # System resources
        "memory_percent": psutil.virtual_memory().percent,
        "cpu_percent": psutil.cpu_percent(),
        "open_file_descriptors": len(os.listdir('/proc/self/fd')),
    })

Step 2: Look for Patterns

-- Query error logs for patterns
SELECT 
    server_id,
    COUNT(*) as error_count,
    AVG(memory_percent) as avg_memory,
    AVG(db_connection_pool_used) as avg_db_connections
FROM error_logs
WHERE timestamp > NOW() - INTERVAL 1 HOUR
GROUP BY server_id
ORDER BY error_count DESC;

-- Aha! Server 7 has 80% of errors and 95% memory usage

Step 3: Common Culprits Checklist

Symptom	Likely Cause	How to Verify
Errors spike at :00 and :30	Cron job interference	Check crontab
Errors correlate with memory	Memory leak	Profile with `heapy`
Errors correlate with one server	Bad deploy, hardware	Compare deployments
Errors during high traffic	Resource exhaustion	Load test
Errors after N hours	Connection pool leak	Monitor pool size
Random distribution	Race condition	Add distributed tracing
Errors with large payloads	Timeout, memory	Check payload sizes

Step 4: The Usual Suspects

# 1. Connection pool exhaustion
# Symptom: Works for hours, then all requests fail
def check_connection_pool():
    if db.pool.used >= db.pool.max * 0.9:
        log.warn("Connection pool nearly exhausted!")
        # Look for: long-running queries, uncommitted transactions

# 2. File descriptor leak
# Symptom: "Too many open files" errors
def check_fd_leak():
    fd_count = len(os.listdir('/proc/self/fd'))
    if fd_count > 1000:
        log.warn(f"High FD count: {fd_count}")
        # Look for: unclosed HTTP connections, file handles

# 3. Memory leak
# Symptom: OOM after N hours
def check_memory_leak():
    import tracemalloc
    tracemalloc.start()
    # ... run for a while ...
    snapshot = tracemalloc.take_snapshot()
    top_stats = snapshot.statistics('lineno')[:10]
    # Shows which lines allocate most memory

# 4. DNS resolution failure
# Symptom: Random timeouts to external services
def check_dns():
    import socket
    try:
        socket.setdefaulttimeout(1)
        socket.gethostbyname("api.external-service.com")
    except socket.timeout:
        log.error("DNS resolution timeout!")
        # Look for: /etc/resolv.conf, DNS server issues

Problem 15: The Deployment That Broke Everything

The Scenario

"We deployed a 'safe' config change. 5 minutes later, error rate went from 0.1% to 50%. We rolled back but errors continued for 2 hours. What happened?"

Investigation Timeline

00 - Deploy config change (timeout: 30s → 5s)
05 - Error rate spikes to 50%
07 - Rollback initiated
10 - Rollback complete
15 - Error rate still 50% (WHY?!)
10 - Errors finally subside

Root Cause: Cascading Failure

Why Rollback Didn't Fix It

# The problem: Rollback doesn't fix queued requests

# During incident:
# - 100K requests queued waiting for DB
# - Each request holds a connection
# - DB can only process 1K/sec
# - Queue takes 100 seconds to drain

# But worse:
# - Users see timeout, refresh
# - Creates NEW requests
# - Queue never drains

# Solution: Shed load during recovery
def handle_request():
    if circuit_breaker.is_open():
        # Fast-fail during recovery
        return Response(503, "Service recovering, please retry in 60s",
                       headers={"Retry-After": "60"})
    
    # Normal processing
    return process_request()

The Fix: Controlled Recovery

class ControlledRecovery:
    def __init__(self):
        self.max_concurrent = 100
        self.current = 0
        self.recovery_mode = False
    
    def enter_recovery_mode(self):
        self.recovery_mode = True
        self.max_concurrent = 10  # Drastically reduce
    
    def gradual_recovery(self):
        # Slowly increase capacity
        while self.max_concurrent < 1000:
            if error_rate() < 0.01:  # Less than 1% errors
                self.max_concurrent += 10
                time.sleep(60)  # Wait 1 minute between increases
            else:
                self.max_concurrent = max(10, self.max_concurrent - 20)  # Back off
    
    def handle_request(self):
        if self.current >= self.max_concurrent:
            return Response(503, "Shedding load", headers={"Retry-After": "30"})
        
        self.current += 1
        try:
            return process_request()
        finally:
            self.current -= 1

Prevention: Config Change Checklist

## Before Any Config Change

- [ ] What's the blast radius? (All users? One region?)
- [ ] Can we canary this? (1% of traffic first)
- [ ] What metrics will show problems? (Set alerts)
- [ ] How fast can we detect issues? (< 1 minute)
- [ ] How fast can we rollback? (< 5 minutes)
- [ ] What happens if rollback doesn't fix it?
- [ ] Is there a dependency that won't recover automatically?

Problem 16: Multi-Tenant Noisy Neighbor

The Scenario

"We're a SaaS platform. One enterprise customer started running huge batch jobs, and now all other customers are experiencing slowdowns. We can't just throttle them - they pay $500K/year."

Solution: Fair Resource Allocation

Option 1: Request Queuing with Priority

class FairScheduler:
    def __init__(self):
        self.tenant_quotas = {}  # tenant_id → tokens/second
        self.tenant_buckets = {}  # tenant_id → current tokens
    
    def acquire(self, tenant_id, tokens=1):
        bucket = self.tenant_buckets.get(tenant_id, self.tenant_quotas[tenant_id])
        
        if bucket >= tokens:
            self.tenant_buckets[tenant_id] = bucket - tokens
            return True
        else:
            # Queue request (don't reject)
            return self.queue_request(tenant_id, tokens)
    
    def refill_buckets(self):
        # Called every second
        for tenant_id, quota in self.tenant_quotas.items():
            current = self.tenant_buckets.get(tenant_id, 0)
            self.tenant_buckets[tenant_id] = min(current + quota, quota * 2)  # Max burst = 2x

Option 2: Isolated Resources

# Route enterprise customers to dedicated resources
def get_db_connection(tenant_id):
    tier = get_tenant_tier(tenant_id)
    
    if tier == "enterprise":
        return enterprise_db_pool.get_connection()
    else:
        return shared_db_pool.get_connection()

Option 3: Background Job Separation

# Separate queues for batch vs interactive workloads
class JobRouter:
    def route(self, job, tenant_id):
        if job.type == "batch":
            # Low priority queue, limited concurrency
            return batch_queue.enqueue(job, priority="low", max_concurrent=5)
        else:
            # High priority queue for interactive requests
            return interactive_queue.enqueue(job, priority="high")

Problem 17: The Impossible Bug - Time Zone Edition

The Scenario

"Twice a year, on the day clocks change for daylight saving, our billing system either double-charges customers or skips charges entirely. We've 'fixed' it three times."

The Problem

# Buggy code
def get_charges_for_day(date):
    start = datetime(date.year, date.month, date.day, 0, 0, 0)  # Midnight
    end = start + timedelta(days=1)  # Next midnight
    
    return db.query("SELECT * FROM charges WHERE created_at BETWEEN ? AND ?", start, end)

# On DST "spring forward" day:
# - 2 AM doesn't exist (jumps to 3 AM)
# - Day is only 23 hours
# - Some charges might be in a "missing" hour

# On DST "fall back" day:
# - 2 AM happens twice
# - Day is 25 hours
# - Some charges counted twice

The Fix: Use UTC + Timezone-Aware Comparisons

from datetime import datetime, timezone
import pytz

def get_charges_for_day(date, user_timezone="America/New_York"):
    tz = pytz.timezone(user_timezone)
    
    # Get midnight in user's timezone, then convert to UTC
    local_midnight = tz.localize(datetime(date.year, date.month, date.day, 0, 0, 0))
    local_next_midnight = local_midnight + timedelta(days=1)
    
    # Convert to UTC for database query
    utc_start = local_midnight.astimezone(timezone.utc)
    utc_end = local_next_midnight.astimezone(timezone.utc)
    
    # Database stores everything in UTC
    return db.query(
        "SELECT * FROM charges WHERE created_at >= ? AND created_at < ?",
        utc_start, utc_end
    )

# This correctly handles:
# - 23-hour days (spring forward)
# - 25-hour days (fall back)
# - Users in different timezones

Other Time Landmines

# Landmine 1: Leap seconds
# Solution: Use NTP, accept ±1 second precision

# Landmine 2: February 29
def get_same_day_next_year(date):
    try:
        return date.replace(year=date.year + 1)
    except ValueError:
        # Feb 29 → Feb 28 in non-leap year
        return date.replace(year=date.year + 1, day=28)

# Landmine 3: Month arithmetic
# "One month from January 31" = ?
from dateutil.relativedelta import relativedelta
jan_31 = datetime(2024, 1, 31)
one_month_later = jan_31 + relativedelta(months=1)  # Feb 29, 2024

# Landmine 4: Server timezone vs user timezone
# Solution: ALWAYS store UTC, convert on display

Problem 18: Webhook Delivery Reliability

The Scenario

"Partners complain they're missing webhook events. We fire webhooks and log success, but partners say they never received them. Trust is eroding."

Solution: Reliable Webhook System

Implementation

class WebhookDelivery:
    RETRY_DELAYS = [60, 300, 900, 3600, 7200, 14400, 28800]  # Exponential backoff
    
    def send_webhook(self, webhook):
        try:
            response = http.post(
                webhook.url,
                json=webhook.payload,
                headers={
                    "X-Webhook-ID": webhook.id,
                    "X-Webhook-Timestamp": str(webhook.created_at.timestamp()),
                    "X-Webhook-Signature": self.sign(webhook.payload, webhook.secret)
                },
                timeout=30
            )
            
            if response.status_code == 200:
                self.mark_delivered(webhook)
            else:
                self.schedule_retry(webhook)
                
        except Exception as e:
            self.schedule_retry(webhook)
    
    def schedule_retry(self, webhook):
        if webhook.attempt >= len(self.RETRY_DELAYS):
            self.move_to_dlq(webhook)
            return
        
        delay = self.RETRY_DELAYS[webhook.attempt]
        webhook.attempt += 1
        webhook.next_retry = datetime.now() + timedelta(seconds=delay)
        queue.enqueue(webhook, delay=delay)
    
    def sign(self, payload, secret):
        # HMAC signature so partner can verify authenticity
        return hmac.new(
            secret.encode(),
            json.dumps(payload).encode(),
            hashlib.sha256
        ).hexdigest()

Partner-Side Verification

# Partners should verify signature
def handle_webhook(request):
    expected_signature = hmac.new(
        MY_WEBHOOK_SECRET.encode(),
        request.body,
        hashlib.sha256
    ).hexdigest()
    
    if request.headers["X-Webhook-Signature"] != expected_signature:
        return Response(401, "Invalid signature")
    
    # Idempotency check
    webhook_id = request.headers["X-Webhook-ID"]
    if already_processed(webhook_id):
        return Response(200, "Already processed")
    
    process_webhook(request.body)
    mark_processed(webhook_id)
    return Response(200, "OK")

Dashboard for Partners

# Let partners see their webhook history
@app.route("/partner/webhooks")
def webhook_dashboard():
    webhooks = db.query("""
        SELECT id, event_type, status, attempts, created_at, delivered_at
        FROM webhooks
        WHERE partner_id = ?
        ORDER BY created_at DESC
        LIMIT 100
    """, current_partner.id)
    
    return render("webhook_dashboard.html", webhooks=webhooks)

Problem 19: Secret Rotation Without Downtime

The Scenario

"Security audit requires rotating all database passwords every 90 days. We have 50 services using the database. How do we rotate without downtime or coordinated deploys?"

Solution: Dual-Password Support

Implementation

# Database: Support dual passwords
class DualPasswordAuth:
    def authenticate(self, username, password):
        user = db.get_user(username)
        
        # Check both current and previous password
        if bcrypt.verify(password, user.password_hash):
            return True
        if user.previous_password_hash and bcrypt.verify(password, user.previous_password_hash):
            return True
        
        return False
    
    def rotate_password(self, username, new_password):
        user = db.get_user(username)
        user.previous_password_hash = user.password_hash
        user.password_hash = bcrypt.hash(new_password)
        user.previous_password_valid_until = datetime.now() + timedelta(hours=24)
        db.update(user)

# Secret store: Version secrets
class SecretStore:
    def get_secret(self, name, version="latest"):
        if version == "latest":
            return self.secrets[name]["current"]
        return self.secrets[name]["versions"][version]
    
    def rotate_secret(self, name, new_value):
        current = self.secrets[name]["current"]
        self.secrets[name]["versions"].append(current)
        self.secrets[name]["current"] = new_value
        
        # Notify services to refresh
        self.notify_rotation(name)

# Service: Refresh on notification
class DatabaseConnection:
    def __init__(self):
        self.password = secret_store.get_secret("db_password")
        secret_store.on_rotation("db_password", self.refresh_password)
    
    def refresh_password(self):
        self.password = secret_store.get_secret("db_password")
        self.reconnect()

Problem 20: The Postmortem - What Really Happened

The Scenario

"Write a blameless postmortem for this incident: A junior engineer ran DELETE FROM users without a WHERE clause in production."

Postmortem Template

# Incident Postmortem: Mass User Deletion

## Summary
On 2024-01-15, all 2.3M user records were accidentally deleted from 
the production database. Service was degraded for 4 hours. All data 
was recovered from backups with no permanent data loss.

## Timeline (All times UTC)
- 14:23 - Engineer runs DELETE query intending to remove single user
- 14:23 - All user records deleted
- 14:25 - Monitoring alerts fire (500 errors spike)
- 14:27 - On-call acknowledges, begins investigation
- 14:35 - Root cause identified (empty users table)
- 14:40 - Decision made to restore from backup
- 14:45 - Backup restoration begins
- 16:30 - Database restored, application recovering
- 18:15 - Full service restoration confirmed

## Impact
- 4 hours of degraded service
- All users logged out and unable to re-authenticate
- 0 permanent data loss (1-hour backup RPO)
- Estimated revenue impact: $X
- Customer support tickets: 2,847

## Root Cause
An engineer executed a DELETE statement without a WHERE clause. 
The query was intended to delete a single test user but instead 
deleted all records.

Query executed:
```sql
DELETE FROM users;  -- Missing: WHERE id = 'test_user_123'

Contributing Factors

No query confirmation for destructive operations
- Production database allows direct DELETE without confirmation
Shared credentials
- Engineer used a service account with full write access
- Individual accounts would enable audit trail
No query review process
- Ad-hoc production queries don't require peer review
Insufficient safeguards
- No transaction wrapper for manual queries
- No row-count confirmation before commit

What Went Well

Backup was recent (1 hour old) and restoration worked
Monitoring detected issue within 2 minutes
Team mobilized quickly
Communication was clear (status page updated, customers notified)

What Went Wrong

No safeguards prevented the DELETE
Restoration took longer than expected (1.5 hours)
No runbook for "mass data deletion" scenario

Action Items

Action	Owner	Priority	Due Date
Implement query confirmation for DELETE/UPDATE	DBA Team	P0	2024-01-22
Require WHERE clause for DELETE statements	DBA Team	P0	2024-01-22
Individual database accounts with audit logging	Security	P1	2024-02-01
Peer review process for production queries	Engineering	P1	2024-02-01
Add point-in-time recovery (reduce RPO to 5 min)	DBA Team	P2	2024-03-01
Create runbook for data recovery scenarios	SRE Team	P1	2024-01-29
Conduct training on safe database practices	Engineering	P1	2024-02-15

Lessons Learned

This incident was not caused by one person making a mistake. It was caused by a system that allowed a simple mistake to have catastrophic consequences. Our action items focus on building safeguards, not assigning blame.

Key insight: If a junior engineer can accidentally delete the entire database, our systems are not safe enough.

---

## Quick Reference: Debugging Checklist

When facing any production incident:

```markdown
## Immediate (First 5 Minutes)
- [ ] What changed recently? (deploys, config, traffic)
- [ ] What's the blast radius? (all users, one region, one feature)
- [ ] Can we rollback? Should we?
- [ ] Who needs to know? (stakeholders, status page)

## Investigation (Next 30 Minutes)
- [ ] Check dashboards: Error rate, latency, traffic
- [ ] Check logs: Filter by time window of incident
- [ ] Check infrastructure: CPU, memory, disk, network
- [ ] Check dependencies: Database, cache, external APIs
- [ ] Check recent changes: Git log, deploy history, config changes

## Common Culprits
- [ ] Recent deployment
- [ ] Config change
- [ ] Traffic spike
- [ ] Database query performance
- [ ] External dependency failure
- [ ] Resource exhaustion (memory, connections, file descriptors)
- [ ] DNS/networking issues
- [ ] Certificate expiration
- [ ] Rate limiting (ours or external)

## Resolution
- [ ] Rollback if possible
- [ ] Scale up if resource exhaustion
- [ ] Failover if regional issue
- [ ] Feature flag off if new code
- [ ] Communicate status

## Post-Incident
- [ ] Write postmortem (within 48 hours)
- [ ] Identify action items
- [ ] Schedule review meeting
- [ ] Update runbooks

You're now prepared for the toughest scenarios! These problems test judgment, crisis management, and production experience - exactly what Staff+ roles require.

Practice tip: For each scenario, practice explaining your approach in 5 minutes. Senior interviews want to hear your thought process, not just the answer.

TL;DR​

Problem 11: GDPR "Right to be Forgotten" at Scale​

The Scenario​

Why It's Hard​

Solution: Data Catalog + Soft Delete + Async Purge​

Interview Follow-ups​

Problem 12: The Thundering Herd​

The Scenario​

Root Cause​

Solution 1: Jittered TTL​

Solution 2: Probabilistic Early Refresh​

Solution 3: Single-Flight Pattern (Mutex)​

Solution 4: Background Refresh (Best for Hot Keys)​

Problem 13: The Split-Brain Database​

The Scenario​

The Horror​

Solution: Conflict Detection + Resolution​

Prevention: Proper Fencing​

Problem 14: Debugging Intermittent Failures​

The Scenario​

Systematic Debugging Approach​

Problem 15: The Deployment That Broke Everything​

The Scenario​

Investigation Timeline​

Root Cause: Cascading Failure​

Why Rollback Didn't Fix It​

The Fix: Controlled Recovery​

Prevention: Config Change Checklist​

Problem 16: Multi-Tenant Noisy Neighbor​

The Scenario​

Solution: Fair Resource Allocation​

Problem 17: The Impossible Bug - Time Zone Edition​

The Scenario​

The Problem​

The Fix: Use UTC + Timezone-Aware Comparisons​

Other Time Landmines​

Problem 18: Webhook Delivery Reliability​

The Scenario​

Solution: Reliable Webhook System​

Problem 19: Secret Rotation Without Downtime​

The Scenario​

Solution: Dual-Password Support​

Problem 20: The Postmortem - What Really Happened​

The Scenario​

Postmortem Template​

Contributing Factors​

What Went Well​

What Went Wrong​

Action Items​

Lessons Learned​

TL;DR

Problem 11: GDPR "Right to be Forgotten" at Scale

The Scenario

Why It's Hard

Solution: Data Catalog + Soft Delete + Async Purge

Interview Follow-ups

Problem 12: The Thundering Herd

The Scenario

Root Cause

Solution 1: Jittered TTL

Solution 2: Probabilistic Early Refresh

Solution 3: Single-Flight Pattern (Mutex)

Solution 4: Background Refresh (Best for Hot Keys)

Problem 13: The Split-Brain Database

The Scenario

The Horror

Solution: Conflict Detection + Resolution

Prevention: Proper Fencing

Problem 14: Debugging Intermittent Failures

The Scenario

Systematic Debugging Approach

Problem 15: The Deployment That Broke Everything

The Scenario

Investigation Timeline

Root Cause: Cascading Failure

Why Rollback Didn't Fix It

The Fix: Controlled Recovery

Prevention: Config Change Checklist

Problem 16: Multi-Tenant Noisy Neighbor

The Scenario

Solution: Fair Resource Allocation

Problem 17: The Impossible Bug - Time Zone Edition

The Scenario

The Problem

The Fix: Use UTC + Timezone-Aware Comparisons

Other Time Landmines

Problem 18: Webhook Delivery Reliability

The Scenario

Solution: Reliable Webhook System

Problem 19: Secret Rotation Without Downtime

The Scenario

Solution: Dual-Password Support

Problem 20: The Postmortem - What Really Happened

The Scenario

Postmortem Template

Contributing Factors

What Went Well

What Went Wrong

Action Items

Lessons Learned