Insights

Zero-downtime deployments: what 'zero' actually requires most teams don't have

Most teams say they do zero-downtime deploys and mean 'we haven't gotten a complaint in a while.' Actually measuring it reveals the truth: connection drops, in-flight request failures, and cache inval…

The pattern

What 'zero downtime' actually requires:

✓ Health checks reflect REAL readiness (not just 'process started')
✓ Graceful shutdown drains in-flight requests (SIGTERM handling)
✓ Connection draining at the load balancer (not just the pod)
✓ Rollback faster than the deploy (< 5 min, automated)
✓ SLI measurement during the rollout window (not just after)

Missing any one of these = not zero downtime. Just unmonitored downtime.

The insight

Most teams say they do zero-downtime deploys and mean 'we haven't gotten a complaint in a while.' Actually measuring it reveals the truth: connection drops, in-flight request failures, and cache invalidation spikes during rollouts that nobody's tracking because nobody defined what zero means.

The non-obvious part

The most common failure mode is passing health checks before the app is actually ready — DB connections not pooled, caches not warm, background workers not started. The pod is 'Ready' and the app is still initializing. Users see errors. Nobody's dashboard shows it because nobody's measuring error rate during the rollout window.

My rule

Define 'zero downtime' with a measurable SLI: error rate < 0.1% during any 5-minute deploy window. Validate this in staging before calling it done. Measure it in production on every release.

Worth reading

  • Kubernetes deployment strategies — rolling, blue/green, canary with traffic splitting
  • AWS ALB / GCP Cloud Load Balancing — connection draining configuration and health check tuning

Route: /insights/zero-downtime-deployments-what-zero-actually-requires-most-teams-dont-have