Your pod gets OOMKilled at 3 AM. Here's how I diagnosed memory leaks, tuned resource limits, and prevented recurrence without downtime.
The Incident
It was 3:17 AM when PagerDuty fired. Our payment-service pods were OOMKilled repeatedly — Kubernetes was evicting them faster than the deployment could bring them back.
The error in kubectl describe pod:
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Exit code 137 = killed by the kernel's Out Of Memory killer.
Root Cause Analysis
First, check actual memory usage vs configured limits:
kubectl top pods -n production --sort-by=memory
kubectl describe node <node-name> | grep -A 5 "Allocated resources"
We found the pod was using 1.8Gi but the limit was set to 512Mi — a configuration drift from 6 months ago.
The Fix
Step 1: Identify the memory leak
kubectl exec -it payment-service-xxx -- /bin/sh
# Inside the container:
cat /proc/meminfo
jmap -heap 1 # for JVM apps
Step 2: Update resource requests/limits
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "2Gi"
cpu: "1000m"
Step 3: Set JVM heap boundaries (Java apps)
JAVA_OPTS="-Xms512m -Xmx1536m -XX:+UseContainerSupport"
Step 4: Add memory alerts before OOMKill
# Prometheus alert rule
- alert: HighMemoryUsage
expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.85
for: 5m
labels:
severity: warning
Prevention
- Always set both
requestsandlimits - Never set limits < 2x your p99 memory usage
- Use VPA (Vertical Pod Autoscaler) for automatic tuning
- Monitor memory trends, not just spikes
The pod has been stable for 90 days since this fix.
Dealing with a similar problem?
I offer production DevOps consulting. Let's fix it together.
Hire Me →