Workflows

Kubernetes rollouts: why 'pods are Ready' is the wrong promotion gate

Readiness is a node-local signal. Production health is a global one. Most rollout pipelines conflate the two — and that's where incidents come from.

#kubernetes#reliability#devops#sre

How it looks in practice

Bad gate:                         Good gate:
                                  
Deploy ──▶ Pods Ready? ──▶ Done   Deploy ──▶ Pods Ready?
           (local signal)                    │
                                             ▼
                                    SLO window check
                                    (error rate + p95)
                                             │
                                    Pass ──▶ Promote
                                    Fail ──▶ Auto-rollback

Where it breaks

  • 100% Ready pods while P95 latency spikes — bad cache warmup, noisy neighbor, DB connection saturation.
  • HPA reacts slower than a fast rollout — you ship overload before autoscaling catches up.
  • Canary stuck green because metrics lack the right labels/slices to isolate the failing segment.

The rule

Promote only when the canary holds your SLO slice (error rate + latency) for a fixed observation window. Otherwise: auto-rollback.

How to sanity-check it

  • Argo Rollouts or Flagger with Prometheus gates — error rate, latency percentiles, saturation.
  • Alert on canary-vs-baseline deltas, not absolute thresholds. Catches regressions that pass absolute checks.

The bigger picture

Operational maturity isn't about tools — it's about designing for the failure you haven't seen yet.

Route: /workflows/kubernetes-rollouts-why-pods-are-ready-is-the-wrong-promotion-gate