Memory Leak Hunt
The Interview Question
"Our Java service uses more memory every day. It starts at 2GB, by day 7 it's at 8GB, then OOMs and restarts. The cycle repeats. How do you find and fix the leak?"
Asked at: Amazon, Netflix, LinkedIn, any company with JVM services
Time to solve: 30-35 minutes
Difficulty: ⭐⭐⭐ (Senior)
Clarifying Questions to Ask
- "What type of memory is growing?" → Heap? Off-heap? Native?
- "Does it happen in all environments?" → Prod only? Load-related?
- "Any recent deploys before this started?" → Changed dependencies?
- "What's the traffic pattern?" → Spiky? Constant?
- "Are there scheduled jobs?" → Batch processing leaks?
The Investigation Process
Step 1: Confirm It's Actually a Leak
# Monitor heap usage over time
jstat -gc <pid> 5000 100 # Every 5 seconds, 100 samples
# Output to watch:
# OU (Old generation Used) - Should stabilize after GC
# If OU keeps growing even after Full GC → Leak confirmed
# Check GC activity
jstat -gcutil <pid> 5000
# S0 S1 E O M CGC CGCT GCT
# 0.00 45.12 67.89 95.43 92.1 150 45.2 120.5
# ↑
# Old gen at 95% = trouble
Step 2: Take Heap Dumps
# Take heap dump when memory is low (after restart)
jmap -dump:format=b,file=heap_day1.hprof <pid>
# Take another when memory is high (before OOM)
jmap -dump:live,format=b,file=heap_day7.hprof <pid>
# Or configure JVM to dump on OOM
java -XX:+HeapDumpOnOutOfMemoryError \
-XX:HeapDumpPath=/var/log/heapdumps/ \
-jar app.jar
Step 3: Analyze with Eclipse MAT
# Download Eclipse Memory Analyzer
# Open heap_day7.hprof
# Key reports to generate:
1. Leak Suspects Report → Automated analysis
2. Dominator Tree → Largest object trees
3. Histogram → Object count by class
4. Path to GC Roots → Why objects aren't collected
Common patterns to look for:
Dominator Tree:
├── java.util.HashMap: 4.2 GB (!)
│ └── entries: 50,000,000 objects
│ └── com.myapp.UserSession
│ └── ... (never cleared)
Common Memory Leak Causes
Leak 1: Unbounded Cache
// 🔴 BAD: Cache that never evicts
public class UserCache {
private static final Map<String, User> cache = new HashMap<>();
public User getUser(String id) {
return cache.computeIfAbsent(id, this::loadUser);
// Cache grows forever!
}
}
// ✅ GOOD: Bounded cache with eviction
public class UserCache {
private static final Cache<String, User> cache = Caffeine.newBuilder()
.maximumSize(10_000)
.expireAfterWrite(Duration.ofMinutes(30))
.build();
public User getUser(String id) {
return cache.get(id, this::loadUser);
}
}
Leak 2: Event Listener Not Removed
// 🔴 BAD: Listener registered but never removed
public class OrderProcessor {
public void processOrder(Order order) {
eventBus.register(new OrderListener(order));
// OrderListener stays registered forever!
}
}
// ✅ GOOD: Unregister when done
public class OrderProcessor {
public void processOrder(Order order) {
OrderListener listener = new OrderListener(order);
eventBus.register(listener);
try {
// Process order
} finally {
eventBus.unregister(listener);
}
}
}
// ✅ EVEN BETTER: Use weak references
public class EventBus {
private final List<WeakReference<EventListener>> listeners =
Collections.synchronizedList(new ArrayList<>());
}
Leak 3: ThreadLocal Not Cleared
// 🔴 BAD: ThreadLocal in thread pool
public class RequestContext {
private static final ThreadLocal<UserSession> context = new ThreadLocal<>();
public static void setSession(UserSession session) {
context.set(session);
}
// Threads in pool are reused, ThreadLocal values accumulate!
}
// ✅ GOOD: Always clear ThreadLocal
public class RequestContext {
private static final ThreadLocal<UserSession> context = new ThreadLocal<>();
public static void setSession(UserSession session) {
context.set(session);
}
public static void clear() {
context.remove(); // Must call after request!
}
}
// In filter/interceptor:
@Override
public void doFilter(request, response, chain) {
try {
RequestContext.setSession(extractSession(request));
chain.doFilter(request, response);
} finally {
RequestContext.clear(); // Always clear!
}
}
Leak 4: Connection/Resource Not Closed
// 🔴 BAD: Connection not closed on error
public List<User> getUsers() {
Connection conn = dataSource.getConnection();
ResultSet rs = conn.executeQuery("SELECT * FROM users");
List<User> users = mapResults(rs);
conn.close(); // Never reached if mapResults throws!
return users;
}
// ✅ GOOD: Try-with-resources
public List<User> getUsers() {
try (Connection conn = dataSource.getConnection();
PreparedStatement stmt = conn.prepareStatement("SELECT * FROM users");
ResultSet rs = stmt.executeQuery()) {
return mapResults(rs);
} // Auto-closed even on exception
}
Leak 5: String Intern Abuse
// 🔴 BAD: Interning user-generated strings
public void processMessage(String message) {
String normalized = message.toLowerCase().intern();
// intern() adds to permanent string pool - never GC'd!
}
// ✅ GOOD: Don't intern unbounded strings
public void processMessage(String message) {
String normalized = message.toLowerCase();
// Normal string, eligible for GC
}
Leak 6: ClassLoader Leak (in apps with hot reload)
// Common in web apps that redeploy without restart
// Old classloaders keep references to classes
// Detection: PermGen/Metaspace keeps growing after redeploys
// Solution:
// 1. Restart JVM on deploy (recommended for prod)
// 2. Use JVM flag: -XX:+CMSClassUnloadingEnabled
// 3. Fix static references holding classloader
Debugging Tools Cheatsheet
| Tool | Use Case | Command |
|---|---|---|
jstat | GC statistics | jstat -gcutil <pid> 5000 |
jmap | Heap dump | jmap -dump:live,format=b,file=heap.hprof <pid> |
jcmd | Memory info | jcmd <pid> GC.heap_info |
jvisualvm | Real-time monitoring | GUI tool |
| MAT | Heap analysis | Eclipse Memory Analyzer |
| async-profiler | CPU + allocation profiling | ./profiler.sh -e alloc -d 60 <pid> |
Quick Memory Profiling Script
#!/usr/bin/env python3
# memory_monitor.py - Track JVM memory over time
import subprocess
import time
import csv
from datetime import datetime
def get_memory_stats(pid):
result = subprocess.run(
['jstat', '-gc', str(pid)],
capture_output=True, text=True
)
# Parse jstat output
lines = result.stdout.strip().split('\n')
headers = lines[0].split()
values = lines[1].split()
return dict(zip(headers, values))
def monitor(pid, output_file, interval=60):
with open(output_file, 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(['timestamp', 'heap_used_mb', 'old_gen_mb', 'gc_count'])
while True:
stats = get_memory_stats(pid)
heap_used = (float(stats['EU']) + float(stats['OU'])) / 1024
old_gen = float(stats['OU']) / 1024
gc_count = int(stats['FGC'])
writer.writerow([
datetime.now().isoformat(),
heap_used,
old_gen,
gc_count
])
f.flush()
print(f"Heap: {heap_used:.0f}MB, Old: {old_gen:.0f}MB, GCs: {gc_count}")
time.sleep(interval)
if __name__ == '__main__':
import sys
monitor(int(sys.argv[1]), 'memory_log.csv')
Prevention Strategies
// 1. Bounded collections
Map<K, V> cache = Collections.synchronizedMap(
new LinkedHashMap<>(MAX_SIZE, 0.75f, true) {
@Override
protected boolean removeEldestEntry(Map.Entry<K, V> eldest) {
return size() > MAX_SIZE;
}
}
);
// 2. Weak references for caches
Map<K, V> cache = new WeakHashMap<>();
// 3. Explicit cleanup in finally blocks
try {
// Use resource
} finally {
cleanup();
}
// 4. JVM flags for monitoring
// -XX:+UseG1GC
// -XX:MaxGCPauseMillis=200
// -XX:+PrintGCDetails
// -Xlog:gc*:file=gc.log
// 5. Metrics and alerts
@Scheduled(fixedRate = 60000)
public void reportMemoryMetrics() {
MemoryMXBean memory = ManagementFactory.getMemoryMXBean();
MemoryUsage heap = memory.getHeapMemoryUsage();
metrics.gauge("jvm.heap.used", heap.getUsed());
metrics.gauge("jvm.heap.max", heap.getMax());
double usagePercent = (double) heap.getUsed() / heap.getMax() * 100;
if (usagePercent > 80) {
alert("High heap usage: " + usagePercent + "%");
}
}
Key Takeaways
- Confirm the leak - Use
jstatto verify old gen keeps growing - Take before/after dumps - Compare heap state over time
- Use MAT's Leak Suspects - Automated analysis finds most leaks
- Common causes: Caches, listeners, ThreadLocal, connections
- Prevention: Bounded caches, weak references, try-with-resources
- Monitor continuously - Alert before OOM happens
Rule of thumb: If memory grows linearly with time (not traffic), you have a leak. If it grows with traffic, you need more capacity.