Workflows
Kubernetes rollouts: why 'pods are Ready' is the wrong promotion gate
Readiness is a node-local signal. Production health is a global one. Most rollout pipelines conflate the two — and that's where incidents come from.
#kubernetes#reliability#devops#sre
How it looks in practice
Bad gate: Good gate:
Deploy ──▶ Pods Ready? ──▶ Done Deploy ──▶ Pods Ready?
(local signal) │
▼
SLO window check
(error rate + p95)
│
Pass ──▶ Promote
Fail ──▶ Auto-rollbackWhere it breaks
- 100% Ready pods while P95 latency spikes — bad cache warmup, noisy neighbor, DB connection saturation.
- HPA reacts slower than a fast rollout — you ship overload before autoscaling catches up.
- Canary stuck green because metrics lack the right labels/slices to isolate the failing segment.
The rule
→ Promote only when the canary holds your SLO slice (error rate + latency) for a fixed observation window. Otherwise: auto-rollback.
How to sanity-check it
- Argo Rollouts or Flagger with Prometheus gates — error rate, latency percentiles, saturation.
- Alert on canary-vs-baseline deltas, not absolute thresholds. Catches regressions that pass absolute checks.
The bigger picture
Operational maturity isn't about tools — it's about designing for the failure you haven't seen yet.
Route: /workflows/kubernetes-rollouts-why-pods-are-ready-is-the-wrong-promotion-gate