Kubernetes OOMKilled in Production: Root Cause & Fix

Your pod gets OOMKilled at 3 AM. Here's how I diagnosed memory leaks, tuned resource limits, and prevented recurrence without downtime.

The Incident

It was 3:17 AM when PagerDuty fired. Our payment-service pods were OOMKilled repeatedly — Kubernetes was evicting them faster than the deployment could bring them back.

The error in kubectl describe pod:

Last State: Terminated
  Reason: OOMKilled
  Exit Code: 137

Exit code 137 = killed by the kernel's Out Of Memory killer.

Root Cause Analysis

First, check actual memory usage vs configured limits:

kubectl top pods -n production --sort-by=memory
kubectl describe node <node-name> | grep -A 5 "Allocated resources"

We found the pod was using 1.8Gi but the limit was set to 512Mi — a configuration drift from 6 months ago.

The Fix

Step 1: Identify the memory leak

kubectl exec -it payment-service-xxx -- /bin/sh
# Inside the container:
cat /proc/meminfo
jmap -heap 1  # for JVM apps

Step 2: Update resource requests/limits

resources:
  requests:
    memory: "512Mi"
    cpu: "250m"
  limits:
    memory: "2Gi"
    cpu: "1000m"

Step 3: Set JVM heap boundaries (Java apps)

JAVA_OPTS="-Xms512m -Xmx1536m -XX:+UseContainerSupport"

Step 4: Add memory alerts before OOMKill

# Prometheus alert rule
- alert: HighMemoryUsage
  expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.85
  for: 5m
  labels:
    severity: warning

Prevention

Always set both requests and limits
Never set limits < 2x your p99 memory usage
Use VPA (Vertical Pod Autoscaler) for automatic tuning
Monitor memory trends, not just spikes

The pod has been stable for 90 days since this fix.

Dealing with a similar problem?

I offer production DevOps consulting. Let's fix it together.

Hire Me →