The Slow Query Mystery
The Interview Question
"We have a database query that usually takes 10ms but sometimes takes 10 seconds. Same query, same parameters. The DBA says indexes are fine. What's happening and how do you fix it?"
Asked at: Google, Amazon, Stripe, Meta
Time to solve: 30-35 minutes
Difficulty: ⭐⭐⭐⭐ (Senior)
Clarifying Questions to Ask
- "How often does the slow query occur?" → 1% vs 50% changes investigation
- "Is it random users or specific patterns?" → Time of day, specific accounts?
- "What database?" → PostgreSQL, MySQL, etc.
- "When did this start?" → Recent deploy? Data growth?
- "Are there other symptoms?" → High CPU, disk IO?
Common Causes (And How to Identify)
Cause 1: Query Plan Flip
The Problem: Database optimizer chooses different execution plans based on table statistics. Sometimes it picks a bad plan.
-- Same query, different plans
EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 12345;
-- Fast plan (Index Scan):
-- Index Scan using idx_orders_user_id on orders
-- Actual Time: 0.05..8.12 ms
-- Slow plan (Seq Scan):
-- Seq Scan on orders
-- Filter: (user_id = 12345)
-- Actual Time: 1500..9800 ms
Why it happens: Outdated statistics, parameter sniffing (cached plan bad for some values)
Solution:
-- PostgreSQL: Update statistics
ANALYZE orders;
-- Force index usage
SELECT /*+ INDEX(orders idx_orders_user_id) */
* FROM orders WHERE user_id = 12345;
-- Or use pg_hint_plan extension
LOAD 'pg_hint_plan';
SET pg_hint_plan.enable_hint TO on;
SELECT /*+ IndexScan(orders idx_orders_user_id) */ ...
Cause 2: Lock Contention
The Problem: Query waiting for locks held by other transactions.
-- Check for lock waits (PostgreSQL)
SELECT
blocked_locks.pid AS blocked_pid,
blocked_activity.query AS blocked_query,
blocking_locks.pid AS blocking_pid,
blocking_activity.query AS blocking_query,
blocked_activity.wait_event_type
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity
ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks
ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
WHERE NOT blocked_locks.granted;
Output when locks are the issue:
blocked_query: SELECT * FROM orders WHERE user_id = 12345
blocking_query: UPDATE orders SET status = 'shipped' WHERE created_at < now() - interval '30 days'
wait_event_type: Lock
Solution:
-- 1. Add timeout to prevent indefinite waits
SET lock_timeout = '5s';
-- 2. Use row-level locking only when needed
SELECT * FROM orders WHERE user_id = 12345 FOR UPDATE SKIP LOCKED;
-- 3. Break up large updates
-- Instead of:
UPDATE orders SET status = 'shipped' WHERE created_at < now() - interval '30 days';
-- Do batched updates:
UPDATE orders SET status = 'shipped'
WHERE id IN (
SELECT id FROM orders
WHERE created_at < now() - interval '30 days'
LIMIT 1000
);
Cause 3: Buffer Pool Miss
The Problem: Data not in memory, causing disk reads.
-- PostgreSQL: Check buffer cache hit ratio
SELECT
relname,
heap_blks_read,
heap_blks_hit,
ROUND(heap_blks_hit::numeric /
NULLIF(heap_blks_hit + heap_blks_read, 0) * 100, 2) as cache_hit_ratio
FROM pg_statio_user_tables
WHERE relname = 'orders';
Output showing problem:
relname | heap_blks_read | heap_blks_hit | cache_hit_ratio
--------+----------------+---------------+-----------------
orders | 15000000 | 5000000 | 25.00
Solution:
-- Increase shared_buffers in postgresql.conf
shared_buffers = 4GB -- 25% of available RAM
-- Pre-warm critical tables
CREATE EXTENSION pg_prewarm;
SELECT pg_prewarm('orders');
SELECT pg_prewarm('idx_orders_user_id');
Cause 4: Network Latency Jitter
The Problem: Database is fast, network is slow.
# Measure query time vs total time
import time
def diagnose_query_latency():
# Time just the network
start = time.time()
conn = get_connection()
connect_time = time.time() - start
# Time the query execution
cursor = conn.cursor()
start = time.time()
cursor.execute("SELECT 1") # Simple query
simple_query_time = time.time() - start
# Time the problematic query
start = time.time()
cursor.execute("SELECT * FROM orders WHERE user_id = 12345")
problem_query_time = time.time() - start
print(f"Connect: {connect_time*1000:.2f}ms")
print(f"Simple query: {simple_query_time*1000:.2f}ms")
print(f"Problem query: {problem_query_time*1000:.2f}ms")
Output indicating network issue:
Connect: 450ms ← Network problem!
Simple query: 1.2ms
Problem query: 12ms ← Query is actually fast
Cause 5: Vacuum/Autovacuum Running
The Problem: Background maintenance competing for resources.
-- Check if autovacuum is running
SELECT pid, query, state, wait_event_type, query_start
FROM pg_stat_activity
WHERE query LIKE 'autovacuum%';
-- Check vacuum progress
SELECT * FROM pg_stat_progress_vacuum;
Solution:
-- Adjust autovacuum timing (postgresql.conf)
autovacuum_naptime = 60s -- Check less frequently
autovacuum_vacuum_cost_delay = 20ms -- Slower but less impact
autovacuum_vacuum_cost_limit = 200 -- Lower resource usage
-- Schedule heavy vacuums during off-hours
-- Run manually at 3 AM:
VACUUM ANALYZE orders;
Cause 6: Data Skew
The Problem: Some values return many more rows than others.
-- Check data distribution
SELECT user_id, COUNT(*) as order_count
FROM orders
GROUP BY user_id
ORDER BY order_count DESC
LIMIT 10;
Output showing skew:
user_id | order_count
--------+-------------
12345 | 500000 ← VIP customer, half million orders!
67890 | 50
11111 | 45
Solution:
-- For high-cardinality users, use pagination
SELECT * FROM orders
WHERE user_id = 12345
AND created_at > '2024-01-01'
ORDER BY id
LIMIT 100;
-- Or create covering index for common query
CREATE INDEX idx_orders_user_created
ON orders(user_id, created_at DESC)
INCLUDE (status, total); -- Covering index
Systematic Debugging Approach
class QueryDebugger:
def diagnose_slow_query(self, query, params):
results = {}
# 1. Check if query plan is consistent
results["plan_check"] = self.check_query_plan_stability(query, params)
# 2. Check for lock contention
results["locks"] = self.check_lock_contention()
# 3. Check buffer cache
results["cache"] = self.check_buffer_cache_hit_ratio()
# 4. Check for running vacuum
results["vacuum"] = self.check_autovacuum_status()
# 5. Check connection pool health
results["pool"] = self.check_connection_pool()
# 6. Check data distribution
results["skew"] = self.check_data_skew(query)
return self.generate_diagnosis(results)
def check_query_plan_stability(self, query, params):
"""Run EXPLAIN 10 times, check if plan changes"""
plans = []
for _ in range(10):
plan = db.execute(f"EXPLAIN {query}", params)
plans.append(plan)
unique_plans = len(set(str(p) for p in plans))
return {
"stable": unique_plans == 1,
"unique_plans": unique_plans,
"recommendation": "Consider pg_hint_plan" if unique_plans > 1 else None
}
Monitoring Setup
# Prometheus alerts for query latency
groups:
- name: database_alerts
rules:
- alert: QueryLatencySpike
expr: |
histogram_quantile(0.99,
rate(pg_query_duration_seconds_bucket{query="orders_by_user"}[5m])
) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "P99 query latency spike detected"
- alert: QueryPlanChange
expr: |
changes(pg_stat_statements_plan_hash{query="orders_by_user"}[1h]) > 0
labels:
severity: info
annotations:
summary: "Query plan changed - investigate if latency affected"
Quick Reference: Diagnostic Commands
| Check | PostgreSQL Command |
|---|---|
| Current queries | SELECT * FROM pg_stat_activity |
| Lock waits | SELECT * FROM pg_locks WHERE NOT granted |
| Cache hit ratio | SELECT * FROM pg_statio_user_tables |
| Index usage | SELECT * FROM pg_stat_user_indexes |
| Autovacuum | SELECT * FROM pg_stat_progress_vacuum |
| Query plans | EXPLAIN (ANALYZE, BUFFERS) SELECT ... |
| Table stats | SELECT * FROM pg_stats WHERE tablename = 'x' |
Key Takeaways
- Same query ≠ same plan - Database optimizer can flip plans
- Check locks first - Most common cause of intermittent slowness
- Buffer cache matters - Cold cache = disk reads = slow
- Data skew is sneaky - Some users have way more data
- Monitor query plans - Alert when plans change
- Measure at each layer - Network, connection pool, query execution
Red flags to look for:
- P99 latency >> P50 latency
- Query plan hash changes over time
- Lock wait events in pg_stat_activity
- Buffer cache hit ratio below 95%