The Slow Query Mystery

The Interview Question

"We have a database query that usually takes 10ms but sometimes takes 10 seconds. Same query, same parameters. The DBA says indexes are fine. What's happening and how do you fix it?"

Asked at: Google, Amazon, Stripe, Meta

Time to solve: 30-35 minutes

Difficulty: ⭐⭐⭐⭐ (Senior)

Clarifying Questions to Ask

"How often does the slow query occur?" → 1% vs 50% changes investigation
"Is it random users or specific patterns?" → Time of day, specific accounts?
"What database?" → PostgreSQL, MySQL, etc.
"When did this start?" → Recent deploy? Data growth?
"Are there other symptoms?" → High CPU, disk IO?

Common Causes (And How to Identify)

Cause 1: Query Plan Flip

The Problem: Database optimizer chooses different execution plans based on table statistics. Sometimes it picks a bad plan.

-- Same query, different plans
EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 12345;

-- Fast plan (Index Scan):
-- Index Scan using idx_orders_user_id on orders
-- Actual Time: 0.05..8.12 ms

-- Slow plan (Seq Scan):
-- Seq Scan on orders
-- Filter: (user_id = 12345)
-- Actual Time: 1500..9800 ms

Why it happens: Outdated statistics, parameter sniffing (cached plan bad for some values)

Solution:

-- PostgreSQL: Update statistics
ANALYZE orders;

-- Force index usage
SELECT /*+ INDEX(orders idx_orders_user_id) */ 
       * FROM orders WHERE user_id = 12345;

-- Or use pg_hint_plan extension
LOAD 'pg_hint_plan';
SET pg_hint_plan.enable_hint TO on;
SELECT /*+ IndexScan(orders idx_orders_user_id) */ ...

Cause 2: Lock Contention

The Problem: Query waiting for locks held by other transactions.

-- Check for lock waits (PostgreSQL)
SELECT 
    blocked_locks.pid AS blocked_pid,
    blocked_activity.query AS blocked_query,
    blocking_locks.pid AS blocking_pid,
    blocking_activity.query AS blocking_query,
    blocked_activity.wait_event_type
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity 
    ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks 
    ON blocking_locks.locktype = blocked_locks.locktype
    AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
    AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
WHERE NOT blocked_locks.granted;

Output when locks are the issue:

blocked_query: SELECT * FROM orders WHERE user_id = 12345
blocking_query: UPDATE orders SET status = 'shipped' WHERE created_at < now() - interval '30 days'
wait_event_type: Lock

Solution:

-- 1. Add timeout to prevent indefinite waits
SET lock_timeout = '5s';

-- 2. Use row-level locking only when needed
SELECT * FROM orders WHERE user_id = 12345 FOR UPDATE SKIP LOCKED;

-- 3. Break up large updates
-- Instead of:
UPDATE orders SET status = 'shipped' WHERE created_at < now() - interval '30 days';

-- Do batched updates:
UPDATE orders SET status = 'shipped' 
WHERE id IN (
    SELECT id FROM orders 
    WHERE created_at < now() - interval '30 days' 
    LIMIT 1000
);

Cause 3: Buffer Pool Miss

The Problem: Data not in memory, causing disk reads.

-- PostgreSQL: Check buffer cache hit ratio
SELECT 
    relname,
    heap_blks_read,
    heap_blks_hit,
    ROUND(heap_blks_hit::numeric / 
          NULLIF(heap_blks_hit + heap_blks_read, 0) * 100, 2) as cache_hit_ratio
FROM pg_statio_user_tables
WHERE relname = 'orders';

Output showing problem:

relname | heap_blks_read | heap_blks_hit | cache_hit_ratio
--------+----------------+---------------+-----------------
orders  |      15000000  |      5000000  |           25.00

Solution:

-- Increase shared_buffers in postgresql.conf
shared_buffers = 4GB  -- 25% of available RAM

-- Pre-warm critical tables
CREATE EXTENSION pg_prewarm;
SELECT pg_prewarm('orders');
SELECT pg_prewarm('idx_orders_user_id');

Cause 4: Network Latency Jitter

The Problem: Database is fast, network is slow.

# Measure query time vs total time
import time

def diagnose_query_latency():
    # Time just the network
    start = time.time()
    conn = get_connection()
    connect_time = time.time() - start
    
    # Time the query execution
    cursor = conn.cursor()
    start = time.time()
    cursor.execute("SELECT 1")  # Simple query
    simple_query_time = time.time() - start
    
    # Time the problematic query
    start = time.time()
    cursor.execute("SELECT * FROM orders WHERE user_id = 12345")
    problem_query_time = time.time() - start
    
    print(f"Connect: {connect_time*1000:.2f}ms")
    print(f"Simple query: {simple_query_time*1000:.2f}ms")
    print(f"Problem query: {problem_query_time*1000:.2f}ms")

Output indicating network issue:

Connect: 450ms  ← Network problem!
Simple query: 1.2ms
Problem query: 12ms  ← Query is actually fast

Cause 5: Vacuum/Autovacuum Running

The Problem: Background maintenance competing for resources.

-- Check if autovacuum is running
SELECT pid, query, state, wait_event_type, query_start
FROM pg_stat_activity
WHERE query LIKE 'autovacuum%';

-- Check vacuum progress
SELECT * FROM pg_stat_progress_vacuum;

Solution:

-- Adjust autovacuum timing (postgresql.conf)
autovacuum_naptime = 60s            -- Check less frequently
autovacuum_vacuum_cost_delay = 20ms -- Slower but less impact
autovacuum_vacuum_cost_limit = 200  -- Lower resource usage

-- Schedule heavy vacuums during off-hours
-- Run manually at 3 AM:
VACUUM ANALYZE orders;

Cause 6: Data Skew

The Problem: Some values return many more rows than others.

-- Check data distribution
SELECT user_id, COUNT(*) as order_count
FROM orders
GROUP BY user_id
ORDER BY order_count DESC
LIMIT 10;

Output showing skew:

user_id | order_count
--------+-------------
 12345  |     500000   ← VIP customer, half million orders!
 67890  |         50
 11111  |         45

Solution:

-- For high-cardinality users, use pagination
SELECT * FROM orders 
WHERE user_id = 12345
  AND created_at > '2024-01-01'
ORDER BY id
LIMIT 100;

-- Or create covering index for common query
CREATE INDEX idx_orders_user_created 
ON orders(user_id, created_at DESC)
INCLUDE (status, total);  -- Covering index

Systematic Debugging Approach

class QueryDebugger:
    def diagnose_slow_query(self, query, params):
        results = {}
        
        # 1. Check if query plan is consistent
        results["plan_check"] = self.check_query_plan_stability(query, params)
        
        # 2. Check for lock contention
        results["locks"] = self.check_lock_contention()
        
        # 3. Check buffer cache
        results["cache"] = self.check_buffer_cache_hit_ratio()
        
        # 4. Check for running vacuum
        results["vacuum"] = self.check_autovacuum_status()
        
        # 5. Check connection pool health
        results["pool"] = self.check_connection_pool()
        
        # 6. Check data distribution
        results["skew"] = self.check_data_skew(query)
        
        return self.generate_diagnosis(results)
    
    def check_query_plan_stability(self, query, params):
        """Run EXPLAIN 10 times, check if plan changes"""
        plans = []
        for _ in range(10):
            plan = db.execute(f"EXPLAIN {query}", params)
            plans.append(plan)
        
        unique_plans = len(set(str(p) for p in plans))
        return {
            "stable": unique_plans == 1,
            "unique_plans": unique_plans,
            "recommendation": "Consider pg_hint_plan" if unique_plans > 1 else None
        }

Monitoring Setup

# Prometheus alerts for query latency
groups:
- name: database_alerts
  rules:
  - alert: QueryLatencySpike
    expr: |
      histogram_quantile(0.99, 
        rate(pg_query_duration_seconds_bucket{query="orders_by_user"}[5m])
      ) > 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "P99 query latency spike detected"
      
  - alert: QueryPlanChange
    expr: |
      changes(pg_stat_statements_plan_hash{query="orders_by_user"}[1h]) > 0
    labels:
      severity: info
    annotations:
      summary: "Query plan changed - investigate if latency affected"

Quick Reference: Diagnostic Commands

Check	PostgreSQL Command
Current queries	`SELECT * FROM pg_stat_activity`
Lock waits	`SELECT * FROM pg_locks WHERE NOT granted`
Cache hit ratio	`SELECT * FROM pg_statio_user_tables`
Index usage	`SELECT * FROM pg_stat_user_indexes`
Autovacuum	`SELECT * FROM pg_stat_progress_vacuum`
Query plans	`EXPLAIN (ANALYZE, BUFFERS) SELECT ...`
Table stats	`SELECT * FROM pg_stats WHERE tablename = 'x'`

Key Takeaways

Same query ≠ same plan - Database optimizer can flip plans
Check locks first - Most common cause of intermittent slowness
Buffer cache matters - Cold cache = disk reads = slow
Data skew is sneaky - Some users have way more data
Monitor query plans - Alert when plans change
Measure at each layer - Network, connection pool, query execution

Red flags to look for:

P99 latency >> P50 latency
Query plan hash changes over time
Lock wait events in pg_stat_activity
Buffer cache hit ratio below 95%

The Interview Question​

Clarifying Questions to Ask​

Common Causes (And How to Identify)​

Cause 1: Query Plan Flip​

Cause 2: Lock Contention​

Cause 3: Buffer Pool Miss​

Cause 4: Network Latency Jitter​

Cause 5: Vacuum/Autovacuum Running​

Cause 6: Data Skew​

Systematic Debugging Approach​

Monitoring Setup​

Quick Reference: Diagnostic Commands​

Key Takeaways​

The Interview Question

Clarifying Questions to Ask

Common Causes (And How to Identify)

Cause 1: Query Plan Flip

Cause 2: Lock Contention

Cause 3: Buffer Pool Miss

Cause 4: Network Latency Jitter

Cause 5: Vacuum/Autovacuum Running

Cause 6: Data Skew

Systematic Debugging Approach

Monitoring Setup

Quick Reference: Diagnostic Commands

Key Takeaways