Nginx 502 Bad Gateway: Upstream Timeout Diagnosis

502s spiking after a deploy? I'll walk through the exact nginx config, upstream keepalive tuning, and health check changes that eliminated them.

The Problem

After a routine deployment, Nginx started returning sporadic 502 Bad Gateway errors. Error rate jumped from 0% to 2.3% — enough to trigger SLA alerts.

Diagnosis Steps

Check Nginx error logs first:

tail -f /var/log/nginx/error.log | grep "upstream"

Output:

2025/04/10 14:22:31 [error] upstream timed out (110: Connection timed out) 
while reading response header from upstream

Check upstream health:

curl -v http://backend-service:8080/health
# Compare response times
for i in {1..20}; do curl -w "%{time_total}s\n" -o /dev/null -s http://backend:8080/; done

Root Cause

The app's graceful shutdown was taking 45 seconds. Nginx's upstream timeout was 60s, but we'd set proxy_read_timeout 30s — it was timing out connections during rolling deploys.

The Fix

upstream backend {
  server backend-service:8080;
  keepalive 32;
}

server {
  location / {
    proxy_pass http://backend;
    proxy_http_version 1.1;
    proxy_set_header Connection "";

    # Tuned timeouts
    proxy_connect_timeout 5s;
    proxy_send_timeout 60s;
    proxy_read_timeout 90s;

    # Retry on failure
    proxy_next_upstream error timeout http_502 http_503;
    proxy_next_upstream_tries 3;
    proxy_next_upstream_timeout 10s;
  }
}

And in your Kubernetes deployment:

lifecycle:
  preStop:
    exec:
      command: ["sleep", "5"]
terminationGracePeriodSeconds: 60

The 5-second preStop sleep gives the load balancer time to drain connections before the pod starts shutting down.

502s dropped to zero within 10 minutes.

Dealing with a similar problem?

I offer production DevOps consulting. Let's fix it together.

Hire Me →