Monitoring & Observability

TL;DR (30-second summary)

Monitoring: Tracking known problems (CPU, memory, errors). Observability: Understanding unknown problems (logs, metrics, traces). Logs: What happened. Metrics: How much/how fast. Traces: Request flow through system. Alerts: Notify when things break.

Golden rule: You can't fix what you can't see. Build observability from day one.

Why This Matters

In interviews: Shows you think about production operations, not just features. Separates senior from junior engineers.

At work: Production issues happen. Observability helps debug fast and prevent customer impact.

Core Concepts

1. Monitoring vs Observability

Monitoring: "Is the system working?"
Observability: "Why isn't it working?"

Example:

Monitoring: "Error rate is 5%" (tells you there's a problem)
Observability: "POST /api/checkout fails for users in EU region when cart > $1000" (tells you why)

2. The Three Pillars of Observability

Logs

What: Timestamped records of events.

Example:

{
  "timestamp": "2024-01-15T10:30:45Z",
  "level": "ERROR",
  "service": "payment-service",
  "message": "Payment processing failed",
  "user_id": "123",
  "order_id": "456",
  "error": "InsufficientFunds",
  "amount": 99.99
}

Log levels:

DEBUG: Detailed info for debugging (disabled in production)
INFO: Normal events (user login, order created)
WARN: Potential issues (slow query, retrying request)
ERROR: Failures (payment failed, API timeout)
FATAL: System crashes

Best practices:

Structured logging (JSON) - easier to parse and search
Include context: user_id, request_id, trace_id
Don't log sensitive data (passwords, credit cards)
Use log levels appropriately (don't INFO everything)

Metrics

What: Numerical measurements over time.

Types:

Counter: Cumulative value (only increases)
- Example: requests_total, errors_total
Gauge: Current value (can go up/down)
- Example: cpu_usage, active_connections
Histogram: Distribution of values
- Example: request_duration (P50, P95, P99)
Summary: Similar to histogram, pre-calculated percentiles

Example (Prometheus):

# Counter
http_requests_total{method="GET", status="200"} 1547

# Gauge
database_connections_active 42

# Histogram
http_request_duration_seconds_bucket{le="0.1"} 95
http_request_duration_seconds_bucket{le="0.5"} 98
http_request_duration_seconds_bucket{le="1.0"} 100

Key metrics (Google's Four Golden Signals):

Latency: How long requests take (P50, P95, P99)
Traffic: Requests per second (QPS)
Errors: Error rate (%)
Saturation: Resource utilization (CPU, memory, disk)

Traces

What: Track a single request as it flows through multiple services.

Trace breakdown (total: 295ms):

API Gateway: 10ms
Auth Service: 5ms
Order Service: 50ms
Payment Service: 200ms ← bottleneck
Database: 30ms

Distributed tracing tools:

Jaeger: Open-source, CNCF project
Zipkin: Twitter's open-source tool
AWS X-Ray: AWS-native tracing
Datadog APM: Commercial, full-featured

Trace context (propagated via headers):

X-Trace-Id: 1234567890abcdef
X-Span-Id: abcdef1234567890
X-Parent-Span-Id: fedcba0987654321

3. Alerting

Purpose: Notify team when something breaks (before customers complain).

Alert rules:

# Prometheus AlertManager example
- alert: HighErrorRate
  expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "High error rate on {{ $labels.service }}"
    description: "Error rate is {{ $value | humanizePercentage }}"

- alert: HighLatency
  expr: histogram_quantile(0.95, http_request_duration_seconds) > 1.0
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "P95 latency > 1 second"

Alert best practices:

Do	Don't
✅ Alert on symptoms (customer impact)	❌ Alert on causes (disk 80% full)
✅ Actionable (team can fix)	❌ Noisy (alert fatigue)
✅ Clear runbook (how to fix)	❌ Vague ("something is wrong")
✅ Proper severity (critical/warning)	❌ Everything is "critical"

On-call rotation:

Primary on-call: First responder
Secondary on-call: Backup if primary doesn't respond
Escalation policy: Auto-escalate after N minutes

4. Dashboards

Purpose: Visualize system health at a glance.

Types:

Overview Dashboard: High-level metrics (traffic, errors, latency)
Service Dashboard: Deep dive into single service
Business Dashboard: User signups, revenue, conversions

Example metrics to display:

Service: API Gateway

Traffic:
- Requests/second (RPS): 1,234 ▲ 5%
- Bytes in/out: 10 MB/s

Latency:
- P50: 45ms
- P95: 120ms ▼ 10ms
- P99: 250ms

Errors:
- 4xx errors: 2.3% (client errors)
- 5xx errors: 0.1% ▲ 0.05% (⚠️ trending up)

Resources:
- CPU: 65%
- Memory: 4.2 GB / 8 GB
- Connections: 450 / 1000

Dashboard tools:

Grafana: Open-source, flexible, integrates with Prometheus
Datadog: Commercial, full-featured
CloudWatch: AWS-native
New Relic: APM + dashboards

5. SLIs, SLOs, SLAs

SLI (Service Level Indicator): A metric (e.g., latency, uptime)
SLO (Service Level Objective): Target for SLI (e.g., 99.9% uptime)
SLA (Service Level Agreement): Contract with customers (e.g., refund if < 99.9%)

Example:

Service: Payment API

SLIs:
1. Availability (uptime)
2. Latency (P95 response time)
3. Error rate

SLOs:
1. 99.95% availability (21.6 min downtime/month)
2. P95 latency < 200ms
3. Error rate < 0.1%

SLA (customer-facing):
- 99.9% uptime guarantee
- If violated → 10% refund

Error budget:

SLO: 99.95% availability
Error budget: 0.05% = 21.6 minutes/month

If we use up error budget → freeze deploys, focus on reliability

6. Centralized Logging

Problem: Logs scattered across 100s of servers.

Solution: Centralized logging (ELK Stack, Splunk, etc.)

ELK Stack:

Elasticsearch: Store and search logs
Logstash/Fluentd: Collect and parse logs
Kibana: Visualize and query logs

Query example (Kibana):

service:"payment-service" AND level:"ERROR" AND timestamp:[now-1h TO now]

Benefits:

Search across all logs (find issues fast)
Correlation (trace request through services)
Retention (keep logs for compliance, 90 days)

Common Interview Questions

Q1: "How do you debug a sudden spike in errors?"

Answer (step-by-step):

Check dashboard: Error rate, affected endpoints, regions
Check recent deployments: Rollback if new release
Check logs: Filter by error level, look for patterns
Check traces: Identify slow/failing service
Check external dependencies: 3rd-party APIs, database
Mitigate: Rollback, scale up, or circuit breaker
Root cause: Deep dive after incident resolved

Q2: "What are the three pillars of observability?"

Answer:

Logs: Discrete events (what happened)
Metrics: Aggregated numbers (how much/how fast)
Traces: Request flow (where time is spent)

Together: Complete picture for debugging unknown issues.

Q3: "What metrics would you track for a web application?"

Answer (Google's Four Golden Signals):

Latency: P95 response time (< 200ms target)
Traffic: Requests per second (QPS)
Errors: 5xx error rate (< 0.1% target)
Saturation: CPU/memory utilization (< 80%)

Plus business metrics:

User signups, conversions, revenue

Q4: "How do you prevent alert fatigue?"

Answer:

Alert on symptoms, not causes (customer impact vs disk space)
Actionable alerts - include runbook (how to fix)
Proper severity - not everything is "critical"
Tune thresholds - avoid flapping (alert, resolve, alert...)
Group related alerts - don't send 100 alerts for same issue

Trade-offs

Aspect	Option A	Option B	Consider
Logging	Verbose (DEBUG in prod)	Minimal (ERROR only)	Debug speed vs storage cost
Metrics	High cardinality (many labels)	Low cardinality	Query power vs cost
Sampling	Trace 100% requests	Sample 1-10%	Accuracy vs overhead
Retention	90 days	7 days	Compliance vs cost

Real-World Examples

Google (SRE Model)

SLIs/SLOs: Formalized reliability targets
Error budgets: 0.1% downtime → 43 min/month
Postmortems: Blameless, focus on learning
Result: 99.99% uptime for critical services

Netflix (Chaos Engineering)

Chaos Monkey: Randomly kill servers (test resilience)
Observability: Extensive metrics, traces, logs
Alerting: Automated remediation (auto-scale, failover)
Result: Lose AWS region, still serve content

Uber (Distributed Tracing)

Jaeger: Track requests across 2000+ microservices
Use case: "Why is this ride taking 5 seconds to match?"
Result: Find bottleneck (geocoding service), optimize

Quick Reference Card

Three pillars:

Logs: Discrete events (what happened)
Metrics: Aggregated numbers (how much)
Traces: Request flow (where time spent)

Four Golden Signals:

Latency (P50, P95, P99)
Traffic (QPS)
Errors (error rate)
Saturation (CPU, memory)

Log levels:

DEBUG: Development only
INFO: Normal events
WARN: Potential issues
ERROR: Failures
FATAL: System crashes

Alert severity:

Critical: PagerDuty (wake up on-call)
High: Slack (immediate)
Medium: Email (review later)

Tools:

Logs: ELK, Splunk, CloudWatch
Metrics: Prometheus, Datadog, Grafana
Traces: Jaeger, Zipkin, X-Ray

TL;DR (30-second summary)​

Why This Matters​

Core Concepts​

1. Monitoring vs Observability​

2. The Three Pillars of Observability​

Logs​

Metrics​

Traces​

3. Alerting​

4. Dashboards​

5. SLIs, SLOs, SLAs​

6. Centralized Logging​

Common Interview Questions​

Q1: "How do you debug a sudden spike in errors?"​

Q2: "What are the three pillars of observability?"​

Q3: "What metrics would you track for a web application?"​

Q4: "How do you prevent alert fatigue?"​

Trade-offs​

Real-World Examples​

Google (SRE Model)​

Netflix (Chaos Engineering)​

Uber (Distributed Tracing)​

Quick Reference Card​

Further Reading​

TL;DR (30-second summary)

Why This Matters

Core Concepts

1. Monitoring vs Observability

2. The Three Pillars of Observability

Logs

Metrics

Traces

3. Alerting

4. Dashboards

5. SLIs, SLOs, SLAs

6. Centralized Logging

Common Interview Questions

Q1: "How do you debug a sudden spike in errors?"

Q2: "What are the three pillars of observability?"

Q3: "What metrics would you track for a web application?"

Q4: "How do you prevent alert fatigue?"

Trade-offs

Real-World Examples

Google (SRE Model)

Netflix (Chaos Engineering)

Uber (Distributed Tracing)

Quick Reference Card

Further Reading