Memory Leak Hunt

The Interview Question

"Our Java service uses more memory every day. It starts at 2GB, by day 7 it's at 8GB, then OOMs and restarts. The cycle repeats. How do you find and fix the leak?"

Asked at: Amazon, Netflix, LinkedIn, any company with JVM services

Time to solve: 30-35 minutes

Difficulty: ⭐⭐⭐ (Senior)

Clarifying Questions to Ask

"What type of memory is growing?" → Heap? Off-heap? Native?
"Does it happen in all environments?" → Prod only? Load-related?
"Any recent deploys before this started?" → Changed dependencies?
"What's the traffic pattern?" → Spiky? Constant?
"Are there scheduled jobs?" → Batch processing leaks?

The Investigation Process

Step 1: Confirm It's Actually a Leak

# Monitor heap usage over time
jstat -gc <pid> 5000 100  # Every 5 seconds, 100 samples

# Output to watch:
#  OU (Old generation Used) - Should stabilize after GC
#  If OU keeps growing even after Full GC → Leak confirmed

# Check GC activity
jstat -gcutil <pid> 5000

#  S0    S1     E      O      M    CGC   CGCT   GCT
# 0.00  45.12  67.89  95.43  92.1  150   45.2   120.5
#                      ↑
#             Old gen at 95% = trouble

Step 2: Take Heap Dumps

# Take heap dump when memory is low (after restart)
jmap -dump:format=b,file=heap_day1.hprof <pid>

# Take another when memory is high (before OOM)
jmap -dump:live,format=b,file=heap_day7.hprof <pid>

# Or configure JVM to dump on OOM
java -XX:+HeapDumpOnOutOfMemoryError \
     -XX:HeapDumpPath=/var/log/heapdumps/ \
     -jar app.jar

Step 3: Analyze with Eclipse MAT

# Download Eclipse Memory Analyzer
# Open heap_day7.hprof

# Key reports to generate:
1. Leak Suspects Report → Automated analysis
2. Dominator Tree → Largest object trees
3. Histogram → Object count by class
4. Path to GC Roots → Why objects aren't collected

Common patterns to look for:

Dominator Tree:
├── java.util.HashMap: 4.2 GB (!)
│   └── entries: 50,000,000 objects
│       └── com.myapp.UserSession
│           └── ... (never cleared)

Common Memory Leak Causes

Leak 1: Unbounded Cache

// 🔴 BAD: Cache that never evicts
public class UserCache {
    private static final Map<String, User> cache = new HashMap<>();
    
    public User getUser(String id) {
        return cache.computeIfAbsent(id, this::loadUser);
        // Cache grows forever!
    }
}

// ✅ GOOD: Bounded cache with eviction
public class UserCache {
    private static final Cache<String, User> cache = Caffeine.newBuilder()
        .maximumSize(10_000)
        .expireAfterWrite(Duration.ofMinutes(30))
        .build();
    
    public User getUser(String id) {
        return cache.get(id, this::loadUser);
    }
}

Leak 2: Event Listener Not Removed

// 🔴 BAD: Listener registered but never removed
public class OrderProcessor {
    public void processOrder(Order order) {
        eventBus.register(new OrderListener(order));
        // OrderListener stays registered forever!
    }
}

// ✅ GOOD: Unregister when done
public class OrderProcessor {
    public void processOrder(Order order) {
        OrderListener listener = new OrderListener(order);
        eventBus.register(listener);
        
        try {
            // Process order
        } finally {
            eventBus.unregister(listener);
        }
    }
}

// ✅ EVEN BETTER: Use weak references
public class EventBus {
    private final List<WeakReference<EventListener>> listeners = 
        Collections.synchronizedList(new ArrayList<>());
}

Leak 3: ThreadLocal Not Cleared

// 🔴 BAD: ThreadLocal in thread pool
public class RequestContext {
    private static final ThreadLocal<UserSession> context = new ThreadLocal<>();
    
    public static void setSession(UserSession session) {
        context.set(session);
    }
    
    // Threads in pool are reused, ThreadLocal values accumulate!
}

// ✅ GOOD: Always clear ThreadLocal
public class RequestContext {
    private static final ThreadLocal<UserSession> context = new ThreadLocal<>();
    
    public static void setSession(UserSession session) {
        context.set(session);
    }
    
    public static void clear() {
        context.remove();  // Must call after request!
    }
}

// In filter/interceptor:
@Override
public void doFilter(request, response, chain) {
    try {
        RequestContext.setSession(extractSession(request));
        chain.doFilter(request, response);
    } finally {
        RequestContext.clear();  // Always clear!
    }
}

Leak 4: Connection/Resource Not Closed

// 🔴 BAD: Connection not closed on error
public List<User> getUsers() {
    Connection conn = dataSource.getConnection();
    ResultSet rs = conn.executeQuery("SELECT * FROM users");
    List<User> users = mapResults(rs);
    conn.close();  // Never reached if mapResults throws!
    return users;
}

// ✅ GOOD: Try-with-resources
public List<User> getUsers() {
    try (Connection conn = dataSource.getConnection();
         PreparedStatement stmt = conn.prepareStatement("SELECT * FROM users");
         ResultSet rs = stmt.executeQuery()) {
        return mapResults(rs);
    }  // Auto-closed even on exception
}

Leak 5: String Intern Abuse

// 🔴 BAD: Interning user-generated strings
public void processMessage(String message) {
    String normalized = message.toLowerCase().intern();
    // intern() adds to permanent string pool - never GC'd!
}

// ✅ GOOD: Don't intern unbounded strings
public void processMessage(String message) {
    String normalized = message.toLowerCase();
    // Normal string, eligible for GC
}

Leak 6: ClassLoader Leak (in apps with hot reload)

// Common in web apps that redeploy without restart
// Old classloaders keep references to classes

// Detection: PermGen/Metaspace keeps growing after redeploys

// Solution:
// 1. Restart JVM on deploy (recommended for prod)
// 2. Use JVM flag: -XX:+CMSClassUnloadingEnabled
// 3. Fix static references holding classloader

Debugging Tools Cheatsheet

Tool	Use Case	Command
`jstat`	GC statistics	`jstat -gcutil <pid> 5000`
`jmap`	Heap dump	`jmap -dump:live,format=b,file=heap.hprof <pid>`
`jcmd`	Memory info	`jcmd <pid> GC.heap_info`
`jvisualvm`	Real-time monitoring	GUI tool
MAT	Heap analysis	Eclipse Memory Analyzer
async-profiler	CPU + allocation profiling	`./profiler.sh -e alloc -d 60 <pid>`

Quick Memory Profiling Script

#!/usr/bin/env python3
# memory_monitor.py - Track JVM memory over time

import subprocess
import time
import csv
from datetime import datetime

def get_memory_stats(pid):
    result = subprocess.run(
        ['jstat', '-gc', str(pid)],
        capture_output=True, text=True
    )
    # Parse jstat output
    lines = result.stdout.strip().split('\n')
    headers = lines[0].split()
    values = lines[1].split()
    return dict(zip(headers, values))

def monitor(pid, output_file, interval=60):
    with open(output_file, 'w', newline='') as f:
        writer = csv.writer(f)
        writer.writerow(['timestamp', 'heap_used_mb', 'old_gen_mb', 'gc_count'])
        
        while True:
            stats = get_memory_stats(pid)
            heap_used = (float(stats['EU']) + float(stats['OU'])) / 1024
            old_gen = float(stats['OU']) / 1024
            gc_count = int(stats['FGC'])
            
            writer.writerow([
                datetime.now().isoformat(),
                heap_used,
                old_gen,
                gc_count
            ])
            f.flush()
            
            print(f"Heap: {heap_used:.0f}MB, Old: {old_gen:.0f}MB, GCs: {gc_count}")
            time.sleep(interval)

if __name__ == '__main__':
    import sys
    monitor(int(sys.argv[1]), 'memory_log.csv')

Prevention Strategies

// 1. Bounded collections
Map<K, V> cache = Collections.synchronizedMap(
    new LinkedHashMap<>(MAX_SIZE, 0.75f, true) {
        @Override
        protected boolean removeEldestEntry(Map.Entry<K, V> eldest) {
            return size() > MAX_SIZE;
        }
    }
);

// 2. Weak references for caches
Map<K, V> cache = new WeakHashMap<>();

// 3. Explicit cleanup in finally blocks
try {
    // Use resource
} finally {
    cleanup();
}

// 4. JVM flags for monitoring
// -XX:+UseG1GC
// -XX:MaxGCPauseMillis=200
// -XX:+PrintGCDetails
// -Xlog:gc*:file=gc.log

// 5. Metrics and alerts
@Scheduled(fixedRate = 60000)
public void reportMemoryMetrics() {
    MemoryMXBean memory = ManagementFactory.getMemoryMXBean();
    MemoryUsage heap = memory.getHeapMemoryUsage();
    
    metrics.gauge("jvm.heap.used", heap.getUsed());
    metrics.gauge("jvm.heap.max", heap.getMax());
    
    double usagePercent = (double) heap.getUsed() / heap.getMax() * 100;
    if (usagePercent > 80) {
        alert("High heap usage: " + usagePercent + "%");
    }
}

Key Takeaways

Confirm the leak - Use jstat to verify old gen keeps growing
Take before/after dumps - Compare heap state over time
Use MAT's Leak Suspects - Automated analysis finds most leaks
Common causes: Caches, listeners, ThreadLocal, connections
Prevention: Bounded caches, weak references, try-with-resources
Monitor continuously - Alert before OOM happens

Rule of thumb: If memory grows linearly with time (not traffic), you have a leak. If it grows with traffic, you need more capacity.

The Interview Question​

Clarifying Questions to Ask​

The Investigation Process​

Step 1: Confirm It's Actually a Leak​

Step 2: Take Heap Dumps​

Step 3: Analyze with Eclipse MAT​

Common Memory Leak Causes​

Leak 1: Unbounded Cache​

Leak 2: Event Listener Not Removed​

Leak 3: ThreadLocal Not Cleared​

Leak 4: Connection/Resource Not Closed​

Leak 5: String Intern Abuse​

Leak 6: ClassLoader Leak (in apps with hot reload)​

Debugging Tools Cheatsheet​

Quick Memory Profiling Script​

Prevention Strategies​

Key Takeaways​