Monitoring & Observability
TL;DR (30-second summary)
Monitoring: Tracking known problems (CPU, memory, errors). Observability: Understanding unknown problems (logs, metrics, traces). Logs: What happened. Metrics: How much/how fast. Traces: Request flow through system. Alerts: Notify when things break.
Golden rule: You can't fix what you can't see. Build observability from day one.
Why This Matters
In interviews: Shows you think about production operations, not just features. Separates senior from junior engineers.
At work: Production issues happen. Observability helps debug fast and prevent customer impact.
Core Concepts
1. Monitoring vs Observability
Monitoring: "Is the system working?"
Observability: "Why isn't it working?"
Example:
- Monitoring: "Error rate is 5%" (tells you there's a problem)
- Observability: "POST /api/checkout fails for users in EU region when cart > $1000" (tells you why)
2. The Three Pillars of Observability
Logs
What: Timestamped records of events.
Example:
{
"timestamp": "2024-01-15T10:30:45Z",
"level": "ERROR",
"service": "payment-service",
"message": "Payment processing failed",
"user_id": "123",
"order_id": "456",
"error": "InsufficientFunds",
"amount": 99.99
}
Log levels:
- DEBUG: Detailed info for debugging (disabled in production)
- INFO: Normal events (user login, order created)
- WARN: Potential issues (slow query, retrying request)
- ERROR: Failures (payment failed, API timeout)
- FATAL: System crashes
Best practices:
- Structured logging (JSON) - easier to parse and search
- Include context: user_id, request_id, trace_id
- Don't log sensitive data (passwords, credit cards)
- Use log levels appropriately (don't INFO everything)
Metrics
What: Numerical measurements over time.
Types:
- Counter: Cumulative value (only increases)
- Example:
requests_total,errors_total
- Example:
- Gauge: Current value (can go up/down)
- Example:
cpu_usage,active_connections
- Example:
- Histogram: Distribution of values
- Example:
request_duration(P50, P95, P99)
- Example:
- Summary: Similar to histogram, pre-calculated percentiles
Example (Prometheus):
# Counter
http_requests_total{method="GET", status="200"} 1547
# Gauge
database_connections_active 42
# Histogram
http_request_duration_seconds_bucket{le="0.1"} 95
http_request_duration_seconds_bucket{le="0.5"} 98
http_request_duration_seconds_bucket{le="1.0"} 100
Key metrics (Google's Four Golden Signals):
- Latency: How long requests take (P50, P95, P99)
- Traffic: Requests per second (QPS)
- Errors: Error rate (%)
- Saturation: Resource utilization (CPU, memory, disk)
Traces
What: Track a single request as it flows through multiple services.
Trace breakdown (total: 295ms):
- API Gateway: 10ms
- Auth Service: 5ms
- Order Service: 50ms
- Payment Service: 200ms ← bottleneck
- Database: 30ms
Distributed tracing tools:
- Jaeger: Open-source, CNCF project
- Zipkin: Twitter's open-source tool
- AWS X-Ray: AWS-native tracing
- Datadog APM: Commercial, full-featured
Trace context (propagated via headers):
X-Trace-Id: 1234567890abcdef
X-Span-Id: abcdef1234567890
X-Parent-Span-Id: fedcba0987654321
3. Alerting
Purpose: Notify team when something breaks (before customers complain).
Alert rules:
# Prometheus AlertManager example
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.service }}"
description: "Error rate is {{ $value | humanizePercentage }}"
- alert: HighLatency
expr: histogram_quantile(0.95, http_request_duration_seconds) > 1.0
for: 10m
labels:
severity: warning
annotations:
summary: "P95 latency > 1 second"
Alert best practices:
| Do | Don't |
|---|---|
| ✅ Alert on symptoms (customer impact) | ❌ Alert on causes (disk 80% full) |
| ✅ Actionable (team can fix) | ❌ Noisy (alert fatigue) |
| ✅ Clear runbook (how to fix) | ❌ Vague ("something is wrong") |
| ✅ Proper severity (critical/warning) | ❌ Everything is "critical" |
On-call rotation:
- Primary on-call: First responder
- Secondary on-call: Backup if primary doesn't respond
- Escalation policy: Auto-escalate after N minutes
4. Dashboards
Purpose: Visualize system health at a glance.
Types:
- Overview Dashboard: High-level metrics (traffic, errors, latency)
- Service Dashboard: Deep dive into single service
- Business Dashboard: User signups, revenue, conversions
Example metrics to display:
Service: API Gateway
Traffic:
- Requests/second (RPS): 1,234 ▲ 5%
- Bytes in/out: 10 MB/s
Latency:
- P50: 45ms
- P95: 120ms ▼ 10ms
- P99: 250ms
Errors:
- 4xx errors: 2.3% (client errors)
- 5xx errors: 0.1% ▲ 0.05% (⚠️ trending up)
Resources:
- CPU: 65%
- Memory: 4.2 GB / 8 GB
- Connections: 450 / 1000
Dashboard tools:
- Grafana: Open-source, flexible, integrates with Prometheus
- Datadog: Commercial, full-featured
- CloudWatch: AWS-native
- New Relic: APM + dashboards
5. SLIs, SLOs, SLAs
SLI (Service Level Indicator): A metric (e.g., latency, uptime)
SLO (Service Level Objective): Target for SLI (e.g., 99.9% uptime)
SLA (Service Level Agreement): Contract with customers (e.g., refund if < 99.9%)
Example:
Service: Payment API
SLIs:
1. Availability (uptime)
2. Latency (P95 response time)
3. Error rate
SLOs:
1. 99.95% availability (21.6 min downtime/month)
2. P95 latency < 200ms
3. Error rate < 0.1%
SLA (customer-facing):
- 99.9% uptime guarantee
- If violated → 10% refund
Error budget:
SLO: 99.95% availability
Error budget: 0.05% = 21.6 minutes/month
If we use up error budget → freeze deploys, focus on reliability
6. Centralized Logging
Problem: Logs scattered across 100s of servers.
Solution: Centralized logging (ELK Stack, Splunk, etc.)
ELK Stack:
- Elasticsearch: Store and search logs
- Logstash/Fluentd: Collect and parse logs
- Kibana: Visualize and query logs
Query example (Kibana):
service:"payment-service" AND level:"ERROR" AND timestamp:[now-1h TO now]
Benefits:
- Search across all logs (find issues fast)
- Correlation (trace request through services)
- Retention (keep logs for compliance, 90 days)
Common Interview Questions
Q1: "How do you debug a sudden spike in errors?"
Answer (step-by-step):
- Check dashboard: Error rate, affected endpoints, regions
- Check recent deployments: Rollback if new release
- Check logs: Filter by error level, look for patterns
- Check traces: Identify slow/failing service
- Check external dependencies: 3rd-party APIs, database
- Mitigate: Rollback, scale up, or circuit breaker
- Root cause: Deep dive after incident resolved
Q2: "What are the three pillars of observability?"
Answer:
- Logs: Discrete events (what happened)
- Metrics: Aggregated numbers (how much/how fast)
- Traces: Request flow (where time is spent)
Together: Complete picture for debugging unknown issues.
Q3: "What metrics would you track for a web application?"
Answer (Google's Four Golden Signals):
- Latency: P95 response time (< 200ms target)
- Traffic: Requests per second (QPS)
- Errors: 5xx error rate (< 0.1% target)
- Saturation: CPU/memory utilization (< 80%)
Plus business metrics:
- User signups, conversions, revenue
Q4: "How do you prevent alert fatigue?"
Answer:
- Alert on symptoms, not causes (customer impact vs disk space)
- Actionable alerts - include runbook (how to fix)
- Proper severity - not everything is "critical"
- Tune thresholds - avoid flapping (alert, resolve, alert...)
- Group related alerts - don't send 100 alerts for same issue
Trade-offs
| Aspect | Option A | Option B | Consider |
|---|---|---|---|
| Logging | Verbose (DEBUG in prod) | Minimal (ERROR only) | Debug speed vs storage cost |
| Metrics | High cardinality (many labels) | Low cardinality | Query power vs cost |
| Sampling | Trace 100% requests | Sample 1-10% | Accuracy vs overhead |
| Retention | 90 days | 7 days | Compliance vs cost |
Real-World Examples
Google (SRE Model)
- SLIs/SLOs: Formalized reliability targets
- Error budgets: 0.1% downtime → 43 min/month
- Postmortems: Blameless, focus on learning
- Result: 99.99% uptime for critical services
Netflix (Chaos Engineering)
- Chaos Monkey: Randomly kill servers (test resilience)
- Observability: Extensive metrics, traces, logs
- Alerting: Automated remediation (auto-scale, failover)
- Result: Lose AWS region, still serve content
Uber (Distributed Tracing)
- Jaeger: Track requests across 2000+ microservices
- Use case: "Why is this ride taking 5 seconds to match?"
- Result: Find bottleneck (geocoding service), optimize
Quick Reference Card
Three pillars:
- Logs: Discrete events (what happened)
- Metrics: Aggregated numbers (how much)
- Traces: Request flow (where time spent)
Four Golden Signals:
- Latency (P50, P95, P99)
- Traffic (QPS)
- Errors (error rate)
- Saturation (CPU, memory)
Log levels:
- DEBUG: Development only
- INFO: Normal events
- WARN: Potential issues
- ERROR: Failures
- FATAL: System crashes
Alert severity:
- Critical: PagerDuty (wake up on-call)
- High: Slack (immediate)
- Medium: Email (review later)
Tools:
- Logs: ELK, Splunk, CloudWatch
- Metrics: Prometheus, Datadog, Grafana
- Traces: Jaeger, Zipkin, X-Ray
Further Reading
- Google SRE Book - Free, comprehensive
- Prometheus Best Practices
- Distributed Tracing - OpenTelemetry
- Monitoring 101
Part 1 Complete! You now have the foundations of system design.
Next: Part 2: Building Blocks - Learn the components you'll use in every design.