Skip to main content

Memory Leak Hunt

The Interview Question

"Our Java service uses more memory every day. It starts at 2GB, by day 7 it's at 8GB, then OOMs and restarts. The cycle repeats. How do you find and fix the leak?"

Asked at: Amazon, Netflix, LinkedIn, any company with JVM services

Time to solve: 30-35 minutes

Difficulty: ⭐⭐⭐ (Senior)


Clarifying Questions to Ask

  1. "What type of memory is growing?" → Heap? Off-heap? Native?
  2. "Does it happen in all environments?" → Prod only? Load-related?
  3. "Any recent deploys before this started?" → Changed dependencies?
  4. "What's the traffic pattern?" → Spiky? Constant?
  5. "Are there scheduled jobs?" → Batch processing leaks?

The Investigation Process

Step 1: Confirm It's Actually a Leak

# Monitor heap usage over time
jstat -gc <pid> 5000 100 # Every 5 seconds, 100 samples

# Output to watch:
# OU (Old generation Used) - Should stabilize after GC
# If OU keeps growing even after Full GC → Leak confirmed

# Check GC activity
jstat -gcutil <pid> 5000

# S0 S1 E O M CGC CGCT GCT
# 0.00 45.12 67.89 95.43 92.1 150 45.2 120.5
# ↑
# Old gen at 95% = trouble

Step 2: Take Heap Dumps

# Take heap dump when memory is low (after restart)
jmap -dump:format=b,file=heap_day1.hprof <pid>

# Take another when memory is high (before OOM)
jmap -dump:live,format=b,file=heap_day7.hprof <pid>

# Or configure JVM to dump on OOM
java -XX:+HeapDumpOnOutOfMemoryError \
-XX:HeapDumpPath=/var/log/heapdumps/ \
-jar app.jar

Step 3: Analyze with Eclipse MAT

# Download Eclipse Memory Analyzer
# Open heap_day7.hprof

# Key reports to generate:
1. Leak Suspects Report → Automated analysis
2. Dominator Tree → Largest object trees
3. Histogram → Object count by class
4. Path to GC Roots → Why objects aren't collected

Common patterns to look for:

Dominator Tree:
├── java.util.HashMap: 4.2 GB (!)
│ └── entries: 50,000,000 objects
│ └── com.myapp.UserSession
│ └── ... (never cleared)

Common Memory Leak Causes

Leak 1: Unbounded Cache

// 🔴 BAD: Cache that never evicts
public class UserCache {
private static final Map<String, User> cache = new HashMap<>();

public User getUser(String id) {
return cache.computeIfAbsent(id, this::loadUser);
// Cache grows forever!
}
}

// ✅ GOOD: Bounded cache with eviction
public class UserCache {
private static final Cache<String, User> cache = Caffeine.newBuilder()
.maximumSize(10_000)
.expireAfterWrite(Duration.ofMinutes(30))
.build();

public User getUser(String id) {
return cache.get(id, this::loadUser);
}
}

Leak 2: Event Listener Not Removed

// 🔴 BAD: Listener registered but never removed
public class OrderProcessor {
public void processOrder(Order order) {
eventBus.register(new OrderListener(order));
// OrderListener stays registered forever!
}
}

// ✅ GOOD: Unregister when done
public class OrderProcessor {
public void processOrder(Order order) {
OrderListener listener = new OrderListener(order);
eventBus.register(listener);

try {
// Process order
} finally {
eventBus.unregister(listener);
}
}
}

// ✅ EVEN BETTER: Use weak references
public class EventBus {
private final List<WeakReference<EventListener>> listeners =
Collections.synchronizedList(new ArrayList<>());
}

Leak 3: ThreadLocal Not Cleared

// 🔴 BAD: ThreadLocal in thread pool
public class RequestContext {
private static final ThreadLocal<UserSession> context = new ThreadLocal<>();

public static void setSession(UserSession session) {
context.set(session);
}

// Threads in pool are reused, ThreadLocal values accumulate!
}

// ✅ GOOD: Always clear ThreadLocal
public class RequestContext {
private static final ThreadLocal<UserSession> context = new ThreadLocal<>();

public static void setSession(UserSession session) {
context.set(session);
}

public static void clear() {
context.remove(); // Must call after request!
}
}

// In filter/interceptor:
@Override
public void doFilter(request, response, chain) {
try {
RequestContext.setSession(extractSession(request));
chain.doFilter(request, response);
} finally {
RequestContext.clear(); // Always clear!
}
}

Leak 4: Connection/Resource Not Closed

// 🔴 BAD: Connection not closed on error
public List<User> getUsers() {
Connection conn = dataSource.getConnection();
ResultSet rs = conn.executeQuery("SELECT * FROM users");
List<User> users = mapResults(rs);
conn.close(); // Never reached if mapResults throws!
return users;
}

// ✅ GOOD: Try-with-resources
public List<User> getUsers() {
try (Connection conn = dataSource.getConnection();
PreparedStatement stmt = conn.prepareStatement("SELECT * FROM users");
ResultSet rs = stmt.executeQuery()) {
return mapResults(rs);
} // Auto-closed even on exception
}

Leak 5: String Intern Abuse

// 🔴 BAD: Interning user-generated strings
public void processMessage(String message) {
String normalized = message.toLowerCase().intern();
// intern() adds to permanent string pool - never GC'd!
}

// ✅ GOOD: Don't intern unbounded strings
public void processMessage(String message) {
String normalized = message.toLowerCase();
// Normal string, eligible for GC
}

Leak 6: ClassLoader Leak (in apps with hot reload)

// Common in web apps that redeploy without restart
// Old classloaders keep references to classes

// Detection: PermGen/Metaspace keeps growing after redeploys

// Solution:
// 1. Restart JVM on deploy (recommended for prod)
// 2. Use JVM flag: -XX:+CMSClassUnloadingEnabled
// 3. Fix static references holding classloader

Debugging Tools Cheatsheet

ToolUse CaseCommand
jstatGC statisticsjstat -gcutil <pid> 5000
jmapHeap dumpjmap -dump:live,format=b,file=heap.hprof <pid>
jcmdMemory infojcmd <pid> GC.heap_info
jvisualvmReal-time monitoringGUI tool
MATHeap analysisEclipse Memory Analyzer
async-profilerCPU + allocation profiling./profiler.sh -e alloc -d 60 <pid>

Quick Memory Profiling Script

#!/usr/bin/env python3
# memory_monitor.py - Track JVM memory over time

import subprocess
import time
import csv
from datetime import datetime

def get_memory_stats(pid):
result = subprocess.run(
['jstat', '-gc', str(pid)],
capture_output=True, text=True
)
# Parse jstat output
lines = result.stdout.strip().split('\n')
headers = lines[0].split()
values = lines[1].split()
return dict(zip(headers, values))

def monitor(pid, output_file, interval=60):
with open(output_file, 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(['timestamp', 'heap_used_mb', 'old_gen_mb', 'gc_count'])

while True:
stats = get_memory_stats(pid)
heap_used = (float(stats['EU']) + float(stats['OU'])) / 1024
old_gen = float(stats['OU']) / 1024
gc_count = int(stats['FGC'])

writer.writerow([
datetime.now().isoformat(),
heap_used,
old_gen,
gc_count
])
f.flush()

print(f"Heap: {heap_used:.0f}MB, Old: {old_gen:.0f}MB, GCs: {gc_count}")
time.sleep(interval)

if __name__ == '__main__':
import sys
monitor(int(sys.argv[1]), 'memory_log.csv')

Prevention Strategies

// 1. Bounded collections
Map<K, V> cache = Collections.synchronizedMap(
new LinkedHashMap<>(MAX_SIZE, 0.75f, true) {
@Override
protected boolean removeEldestEntry(Map.Entry<K, V> eldest) {
return size() > MAX_SIZE;
}
}
);

// 2. Weak references for caches
Map<K, V> cache = new WeakHashMap<>();

// 3. Explicit cleanup in finally blocks
try {
// Use resource
} finally {
cleanup();
}

// 4. JVM flags for monitoring
// -XX:+UseG1GC
// -XX:MaxGCPauseMillis=200
// -XX:+PrintGCDetails
// -Xlog:gc*:file=gc.log

// 5. Metrics and alerts
@Scheduled(fixedRate = 60000)
public void reportMemoryMetrics() {
MemoryMXBean memory = ManagementFactory.getMemoryMXBean();
MemoryUsage heap = memory.getHeapMemoryUsage();

metrics.gauge("jvm.heap.used", heap.getUsed());
metrics.gauge("jvm.heap.max", heap.getMax());

double usagePercent = (double) heap.getUsed() / heap.getMax() * 100;
if (usagePercent > 80) {
alert("High heap usage: " + usagePercent + "%");
}
}

Key Takeaways

  1. Confirm the leak - Use jstat to verify old gen keeps growing
  2. Take before/after dumps - Compare heap state over time
  3. Use MAT's Leak Suspects - Automated analysis finds most leaks
  4. Common causes: Caches, listeners, ThreadLocal, connections
  5. Prevention: Bounded caches, weak references, try-with-resources
  6. Monitor continuously - Alert before OOM happens

Rule of thumb: If memory grows linearly with time (not traffic), you have a leak. If it grows with traffic, you need more capacity.