Skip to main content

Monitoring & Observability

TL;DR (30-second summary)

Monitoring: Tracking known problems (CPU, memory, errors). Observability: Understanding unknown problems (logs, metrics, traces). Logs: What happened. Metrics: How much/how fast. Traces: Request flow through system. Alerts: Notify when things break.

Golden rule: You can't fix what you can't see. Build observability from day one.

Why This Matters

In interviews: Shows you think about production operations, not just features. Separates senior from junior engineers.

At work: Production issues happen. Observability helps debug fast and prevent customer impact.

Core Concepts

1. Monitoring vs Observability

Monitoring: "Is the system working?"
Observability: "Why isn't it working?"

Example:

  • Monitoring: "Error rate is 5%" (tells you there's a problem)
  • Observability: "POST /api/checkout fails for users in EU region when cart > $1000" (tells you why)

2. The Three Pillars of Observability

Logs

What: Timestamped records of events.

Example:

{
"timestamp": "2024-01-15T10:30:45Z",
"level": "ERROR",
"service": "payment-service",
"message": "Payment processing failed",
"user_id": "123",
"order_id": "456",
"error": "InsufficientFunds",
"amount": 99.99
}

Log levels:

  • DEBUG: Detailed info for debugging (disabled in production)
  • INFO: Normal events (user login, order created)
  • WARN: Potential issues (slow query, retrying request)
  • ERROR: Failures (payment failed, API timeout)
  • FATAL: System crashes

Best practices:

  • Structured logging (JSON) - easier to parse and search
  • Include context: user_id, request_id, trace_id
  • Don't log sensitive data (passwords, credit cards)
  • Use log levels appropriately (don't INFO everything)

Metrics

What: Numerical measurements over time.

Types:

  1. Counter: Cumulative value (only increases)
    • Example: requests_total, errors_total
  2. Gauge: Current value (can go up/down)
    • Example: cpu_usage, active_connections
  3. Histogram: Distribution of values
    • Example: request_duration (P50, P95, P99)
  4. Summary: Similar to histogram, pre-calculated percentiles

Example (Prometheus):

# Counter
http_requests_total{method="GET", status="200"} 1547

# Gauge
database_connections_active 42

# Histogram
http_request_duration_seconds_bucket{le="0.1"} 95
http_request_duration_seconds_bucket{le="0.5"} 98
http_request_duration_seconds_bucket{le="1.0"} 100

Key metrics (Google's Four Golden Signals):

  1. Latency: How long requests take (P50, P95, P99)
  2. Traffic: Requests per second (QPS)
  3. Errors: Error rate (%)
  4. Saturation: Resource utilization (CPU, memory, disk)

Traces

What: Track a single request as it flows through multiple services.

Trace breakdown (total: 295ms):

  • API Gateway: 10ms
  • Auth Service: 5ms
  • Order Service: 50ms
  • Payment Service: 200ms ← bottleneck
  • Database: 30ms

Distributed tracing tools:

  • Jaeger: Open-source, CNCF project
  • Zipkin: Twitter's open-source tool
  • AWS X-Ray: AWS-native tracing
  • Datadog APM: Commercial, full-featured

Trace context (propagated via headers):

X-Trace-Id: 1234567890abcdef
X-Span-Id: abcdef1234567890
X-Parent-Span-Id: fedcba0987654321

3. Alerting

Purpose: Notify team when something breaks (before customers complain).

Alert rules:

# Prometheus AlertManager example
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.service }}"
description: "Error rate is {{ $value | humanizePercentage }}"

- alert: HighLatency
expr: histogram_quantile(0.95, http_request_duration_seconds) > 1.0
for: 10m
labels:
severity: warning
annotations:
summary: "P95 latency > 1 second"

Alert best practices:

DoDon't
✅ Alert on symptoms (customer impact)❌ Alert on causes (disk 80% full)
✅ Actionable (team can fix)❌ Noisy (alert fatigue)
✅ Clear runbook (how to fix)❌ Vague ("something is wrong")
✅ Proper severity (critical/warning)❌ Everything is "critical"

On-call rotation:

  • Primary on-call: First responder
  • Secondary on-call: Backup if primary doesn't respond
  • Escalation policy: Auto-escalate after N minutes

4. Dashboards

Purpose: Visualize system health at a glance.

Types:

  1. Overview Dashboard: High-level metrics (traffic, errors, latency)
  2. Service Dashboard: Deep dive into single service
  3. Business Dashboard: User signups, revenue, conversions

Example metrics to display:

Service: API Gateway

Traffic:
- Requests/second (RPS): 1,234 ▲ 5%
- Bytes in/out: 10 MB/s

Latency:
- P50: 45ms
- P95: 120ms ▼ 10ms
- P99: 250ms

Errors:
- 4xx errors: 2.3% (client errors)
- 5xx errors: 0.1% ▲ 0.05% (⚠️ trending up)

Resources:
- CPU: 65%
- Memory: 4.2 GB / 8 GB
- Connections: 450 / 1000

Dashboard tools:

  • Grafana: Open-source, flexible, integrates with Prometheus
  • Datadog: Commercial, full-featured
  • CloudWatch: AWS-native
  • New Relic: APM + dashboards

5. SLIs, SLOs, SLAs

SLI (Service Level Indicator): A metric (e.g., latency, uptime)
SLO (Service Level Objective): Target for SLI (e.g., 99.9% uptime)
SLA (Service Level Agreement): Contract with customers (e.g., refund if < 99.9%)

Example:

Service: Payment API

SLIs:
1. Availability (uptime)
2. Latency (P95 response time)
3. Error rate

SLOs:
1. 99.95% availability (21.6 min downtime/month)
2. P95 latency < 200ms
3. Error rate < 0.1%

SLA (customer-facing):
- 99.9% uptime guarantee
- If violated → 10% refund

Error budget:

SLO: 99.95% availability
Error budget: 0.05% = 21.6 minutes/month

If we use up error budget → freeze deploys, focus on reliability

6. Centralized Logging

Problem: Logs scattered across 100s of servers.

Solution: Centralized logging (ELK Stack, Splunk, etc.)

ELK Stack:

  • Elasticsearch: Store and search logs
  • Logstash/Fluentd: Collect and parse logs
  • Kibana: Visualize and query logs

Query example (Kibana):

service:"payment-service" AND level:"ERROR" AND timestamp:[now-1h TO now]

Benefits:

  • Search across all logs (find issues fast)
  • Correlation (trace request through services)
  • Retention (keep logs for compliance, 90 days)

Common Interview Questions

Q1: "How do you debug a sudden spike in errors?"

Answer (step-by-step):

  1. Check dashboard: Error rate, affected endpoints, regions
  2. Check recent deployments: Rollback if new release
  3. Check logs: Filter by error level, look for patterns
  4. Check traces: Identify slow/failing service
  5. Check external dependencies: 3rd-party APIs, database
  6. Mitigate: Rollback, scale up, or circuit breaker
  7. Root cause: Deep dive after incident resolved

Q2: "What are the three pillars of observability?"

Answer:

  1. Logs: Discrete events (what happened)
  2. Metrics: Aggregated numbers (how much/how fast)
  3. Traces: Request flow (where time is spent)

Together: Complete picture for debugging unknown issues.

Q3: "What metrics would you track for a web application?"

Answer (Google's Four Golden Signals):

  1. Latency: P95 response time (< 200ms target)
  2. Traffic: Requests per second (QPS)
  3. Errors: 5xx error rate (< 0.1% target)
  4. Saturation: CPU/memory utilization (< 80%)

Plus business metrics:

  • User signups, conversions, revenue

Q4: "How do you prevent alert fatigue?"

Answer:

  1. Alert on symptoms, not causes (customer impact vs disk space)
  2. Actionable alerts - include runbook (how to fix)
  3. Proper severity - not everything is "critical"
  4. Tune thresholds - avoid flapping (alert, resolve, alert...)
  5. Group related alerts - don't send 100 alerts for same issue

Trade-offs

AspectOption AOption BConsider
LoggingVerbose (DEBUG in prod)Minimal (ERROR only)Debug speed vs storage cost
MetricsHigh cardinality (many labels)Low cardinalityQuery power vs cost
SamplingTrace 100% requestsSample 1-10%Accuracy vs overhead
Retention90 days7 daysCompliance vs cost

Real-World Examples

Google (SRE Model)

  • SLIs/SLOs: Formalized reliability targets
  • Error budgets: 0.1% downtime → 43 min/month
  • Postmortems: Blameless, focus on learning
  • Result: 99.99% uptime for critical services

Netflix (Chaos Engineering)

  • Chaos Monkey: Randomly kill servers (test resilience)
  • Observability: Extensive metrics, traces, logs
  • Alerting: Automated remediation (auto-scale, failover)
  • Result: Lose AWS region, still serve content

Uber (Distributed Tracing)

  • Jaeger: Track requests across 2000+ microservices
  • Use case: "Why is this ride taking 5 seconds to match?"
  • Result: Find bottleneck (geocoding service), optimize

Quick Reference Card

Three pillars:

  1. Logs: Discrete events (what happened)
  2. Metrics: Aggregated numbers (how much)
  3. Traces: Request flow (where time spent)

Four Golden Signals:

  1. Latency (P50, P95, P99)
  2. Traffic (QPS)
  3. Errors (error rate)
  4. Saturation (CPU, memory)

Log levels:

  • DEBUG: Development only
  • INFO: Normal events
  • WARN: Potential issues
  • ERROR: Failures
  • FATAL: System crashes

Alert severity:

  • Critical: PagerDuty (wake up on-call)
  • High: Slack (immediate)
  • Medium: Email (review later)

Tools:

  • Logs: ELK, Splunk, CloudWatch
  • Metrics: Prometheus, Datadog, Grafana
  • Traces: Jaeger, Zipkin, X-Ray

Further Reading


Part 1 Complete! You now have the foundations of system design.

Next: Part 2: Building Blocks - Learn the components you'll use in every design.