Skip to main content

The Slow Query Mystery

The Interview Question

"We have a database query that usually takes 10ms but sometimes takes 10 seconds. Same query, same parameters. The DBA says indexes are fine. What's happening and how do you fix it?"

Asked at: Google, Amazon, Stripe, Meta

Time to solve: 30-35 minutes

Difficulty: ⭐⭐⭐⭐ (Senior)


Clarifying Questions to Ask

  1. "How often does the slow query occur?" → 1% vs 50% changes investigation
  2. "Is it random users or specific patterns?" → Time of day, specific accounts?
  3. "What database?" → PostgreSQL, MySQL, etc.
  4. "When did this start?" → Recent deploy? Data growth?
  5. "Are there other symptoms?" → High CPU, disk IO?

Common Causes (And How to Identify)

Cause 1: Query Plan Flip

The Problem: Database optimizer chooses different execution plans based on table statistics. Sometimes it picks a bad plan.

-- Same query, different plans
EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 12345;

-- Fast plan (Index Scan):
-- Index Scan using idx_orders_user_id on orders
-- Actual Time: 0.05..8.12 ms

-- Slow plan (Seq Scan):
-- Seq Scan on orders
-- Filter: (user_id = 12345)
-- Actual Time: 1500..9800 ms

Why it happens: Outdated statistics, parameter sniffing (cached plan bad for some values)

Solution:

-- PostgreSQL: Update statistics
ANALYZE orders;

-- Force index usage
SELECT /*+ INDEX(orders idx_orders_user_id) */
* FROM orders WHERE user_id = 12345;

-- Or use pg_hint_plan extension
LOAD 'pg_hint_plan';
SET pg_hint_plan.enable_hint TO on;
SELECT /*+ IndexScan(orders idx_orders_user_id) */ ...

Cause 2: Lock Contention

The Problem: Query waiting for locks held by other transactions.

-- Check for lock waits (PostgreSQL)
SELECT
blocked_locks.pid AS blocked_pid,
blocked_activity.query AS blocked_query,
blocking_locks.pid AS blocking_pid,
blocking_activity.query AS blocking_query,
blocked_activity.wait_event_type
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity
ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks
ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
WHERE NOT blocked_locks.granted;

Output when locks are the issue:

blocked_query: SELECT * FROM orders WHERE user_id = 12345
blocking_query: UPDATE orders SET status = 'shipped' WHERE created_at < now() - interval '30 days'
wait_event_type: Lock

Solution:

-- 1. Add timeout to prevent indefinite waits
SET lock_timeout = '5s';

-- 2. Use row-level locking only when needed
SELECT * FROM orders WHERE user_id = 12345 FOR UPDATE SKIP LOCKED;

-- 3. Break up large updates
-- Instead of:
UPDATE orders SET status = 'shipped' WHERE created_at < now() - interval '30 days';

-- Do batched updates:
UPDATE orders SET status = 'shipped'
WHERE id IN (
SELECT id FROM orders
WHERE created_at < now() - interval '30 days'
LIMIT 1000
);

Cause 3: Buffer Pool Miss

The Problem: Data not in memory, causing disk reads.

-- PostgreSQL: Check buffer cache hit ratio
SELECT
relname,
heap_blks_read,
heap_blks_hit,
ROUND(heap_blks_hit::numeric /
NULLIF(heap_blks_hit + heap_blks_read, 0) * 100, 2) as cache_hit_ratio
FROM pg_statio_user_tables
WHERE relname = 'orders';

Output showing problem:

relname | heap_blks_read | heap_blks_hit | cache_hit_ratio
--------+----------------+---------------+-----------------
orders | 15000000 | 5000000 | 25.00

Solution:

-- Increase shared_buffers in postgresql.conf
shared_buffers = 4GB -- 25% of available RAM

-- Pre-warm critical tables
CREATE EXTENSION pg_prewarm;
SELECT pg_prewarm('orders');
SELECT pg_prewarm('idx_orders_user_id');

Cause 4: Network Latency Jitter

The Problem: Database is fast, network is slow.

# Measure query time vs total time
import time

def diagnose_query_latency():
# Time just the network
start = time.time()
conn = get_connection()
connect_time = time.time() - start

# Time the query execution
cursor = conn.cursor()
start = time.time()
cursor.execute("SELECT 1") # Simple query
simple_query_time = time.time() - start

# Time the problematic query
start = time.time()
cursor.execute("SELECT * FROM orders WHERE user_id = 12345")
problem_query_time = time.time() - start

print(f"Connect: {connect_time*1000:.2f}ms")
print(f"Simple query: {simple_query_time*1000:.2f}ms")
print(f"Problem query: {problem_query_time*1000:.2f}ms")

Output indicating network issue:

Connect: 450ms  ← Network problem!
Simple query: 1.2ms
Problem query: 12ms ← Query is actually fast

Cause 5: Vacuum/Autovacuum Running

The Problem: Background maintenance competing for resources.

-- Check if autovacuum is running
SELECT pid, query, state, wait_event_type, query_start
FROM pg_stat_activity
WHERE query LIKE 'autovacuum%';

-- Check vacuum progress
SELECT * FROM pg_stat_progress_vacuum;

Solution:

-- Adjust autovacuum timing (postgresql.conf)
autovacuum_naptime = 60s -- Check less frequently
autovacuum_vacuum_cost_delay = 20ms -- Slower but less impact
autovacuum_vacuum_cost_limit = 200 -- Lower resource usage

-- Schedule heavy vacuums during off-hours
-- Run manually at 3 AM:
VACUUM ANALYZE orders;

Cause 6: Data Skew

The Problem: Some values return many more rows than others.

-- Check data distribution
SELECT user_id, COUNT(*) as order_count
FROM orders
GROUP BY user_id
ORDER BY order_count DESC
LIMIT 10;

Output showing skew:

user_id | order_count
--------+-------------
12345 | 500000 ← VIP customer, half million orders!
67890 | 50
11111 | 45

Solution:

-- For high-cardinality users, use pagination
SELECT * FROM orders
WHERE user_id = 12345
AND created_at > '2024-01-01'
ORDER BY id
LIMIT 100;

-- Or create covering index for common query
CREATE INDEX idx_orders_user_created
ON orders(user_id, created_at DESC)
INCLUDE (status, total); -- Covering index

Systematic Debugging Approach

class QueryDebugger:
def diagnose_slow_query(self, query, params):
results = {}

# 1. Check if query plan is consistent
results["plan_check"] = self.check_query_plan_stability(query, params)

# 2. Check for lock contention
results["locks"] = self.check_lock_contention()

# 3. Check buffer cache
results["cache"] = self.check_buffer_cache_hit_ratio()

# 4. Check for running vacuum
results["vacuum"] = self.check_autovacuum_status()

# 5. Check connection pool health
results["pool"] = self.check_connection_pool()

# 6. Check data distribution
results["skew"] = self.check_data_skew(query)

return self.generate_diagnosis(results)

def check_query_plan_stability(self, query, params):
"""Run EXPLAIN 10 times, check if plan changes"""
plans = []
for _ in range(10):
plan = db.execute(f"EXPLAIN {query}", params)
plans.append(plan)

unique_plans = len(set(str(p) for p in plans))
return {
"stable": unique_plans == 1,
"unique_plans": unique_plans,
"recommendation": "Consider pg_hint_plan" if unique_plans > 1 else None
}

Monitoring Setup

# Prometheus alerts for query latency
groups:
- name: database_alerts
rules:
- alert: QueryLatencySpike
expr: |
histogram_quantile(0.99,
rate(pg_query_duration_seconds_bucket{query="orders_by_user"}[5m])
) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "P99 query latency spike detected"

- alert: QueryPlanChange
expr: |
changes(pg_stat_statements_plan_hash{query="orders_by_user"}[1h]) > 0
labels:
severity: info
annotations:
summary: "Query plan changed - investigate if latency affected"

Quick Reference: Diagnostic Commands

CheckPostgreSQL Command
Current queriesSELECT * FROM pg_stat_activity
Lock waitsSELECT * FROM pg_locks WHERE NOT granted
Cache hit ratioSELECT * FROM pg_statio_user_tables
Index usageSELECT * FROM pg_stat_user_indexes
AutovacuumSELECT * FROM pg_stat_progress_vacuum
Query plansEXPLAIN (ANALYZE, BUFFERS) SELECT ...
Table statsSELECT * FROM pg_stats WHERE tablename = 'x'

Key Takeaways

  1. Same query ≠ same plan - Database optimizer can flip plans
  2. Check locks first - Most common cause of intermittent slowness
  3. Buffer cache matters - Cold cache = disk reads = slow
  4. Data skew is sneaky - Some users have way more data
  5. Monitor query plans - Alert when plans change
  6. Measure at each layer - Network, connection pool, query execution

Red flags to look for:

  • P99 latency >> P50 latency
  • Query plan hash changes over time
  • Lock wait events in pg_stat_activity
  • Buffer cache hit ratio below 95%