502s spiking after a deploy? I'll walk through the exact nginx config, upstream keepalive tuning, and health check changes that eliminated them.
The Problem
After a routine deployment, Nginx started returning sporadic 502 Bad Gateway errors. Error rate jumped from 0% to 2.3% — enough to trigger SLA alerts.
Diagnosis Steps
Check Nginx error logs first:
tail -f /var/log/nginx/error.log | grep "upstream"
Output:
2025/04/10 14:22:31 [error] upstream timed out (110: Connection timed out)
while reading response header from upstream
Check upstream health:
curl -v http://backend-service:8080/health
# Compare response times
for i in {1..20}; do curl -w "%{time_total}s\n" -o /dev/null -s http://backend:8080/; done
Root Cause
The app's graceful shutdown was taking 45 seconds. Nginx's upstream timeout was 60s, but we'd set proxy_read_timeout 30s — it was timing out connections during rolling deploys.
The Fix
upstream backend {
server backend-service:8080;
keepalive 32;
}
server {
location / {
proxy_pass http://backend;
proxy_http_version 1.1;
proxy_set_header Connection "";
# Tuned timeouts
proxy_connect_timeout 5s;
proxy_send_timeout 60s;
proxy_read_timeout 90s;
# Retry on failure
proxy_next_upstream error timeout http_502 http_503;
proxy_next_upstream_tries 3;
proxy_next_upstream_timeout 10s;
}
}
And in your Kubernetes deployment:
lifecycle:
preStop:
exec:
command: ["sleep", "5"]
terminationGracePeriodSeconds: 60
The 5-second preStop sleep gives the load balancer time to drain connections before the pod starts shutting down.
502s dropped to zero within 10 minutes.
Dealing with a similar problem?
I offer production DevOps consulting. Let's fix it together.
Hire Me →