← Back to Blog

Kubernetes OOMKilled in Production: Root Cause & Fix

Your pod gets OOMKilled at 3 AM. Here's how I diagnosed memory leaks, tuned resource limits, and prevented recurrence without downtime.

The Incident

It was 3:17 AM when PagerDuty fired. Our payment-service pods were OOMKilled repeatedly — Kubernetes was evicting them faster than the deployment could bring them back.

The error in kubectl describe pod:

Last State: Terminated
  Reason: OOMKilled
  Exit Code: 137

Exit code 137 = killed by the kernel's Out Of Memory killer.

Root Cause Analysis

First, check actual memory usage vs configured limits:

kubectl top pods -n production --sort-by=memory
kubectl describe node <node-name> | grep -A 5 "Allocated resources"

We found the pod was using 1.8Gi but the limit was set to 512Mi — a configuration drift from 6 months ago.

The Fix

Step 1: Identify the memory leak

kubectl exec -it payment-service-xxx -- /bin/sh
# Inside the container:
cat /proc/meminfo
jmap -heap 1  # for JVM apps

Step 2: Update resource requests/limits

resources:
  requests:
    memory: "512Mi"
    cpu: "250m"
  limits:
    memory: "2Gi"
    cpu: "1000m"

Step 3: Set JVM heap boundaries (Java apps)

JAVA_OPTS="-Xms512m -Xmx1536m -XX:+UseContainerSupport"

Step 4: Add memory alerts before OOMKill

# Prometheus alert rule
- alert: HighMemoryUsage
  expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.85
  for: 5m
  labels:
    severity: warning

Prevention

  • Always set both requests and limits
  • Never set limits < 2x your p99 memory usage
  • Use VPA (Vertical Pod Autoscaler) for automatic tuning
  • Monitor memory trends, not just spikes

The pod has been stable for 90 days since this fix.

Dealing with a similar problem?

I offer production DevOps consulting. Let's fix it together.

Hire Me →