Workflows
Resilient Architecture: design for failure, not for uptime
A reliability workflow focused on isolating blast radius, controlling dependencies, and using SLIs/SLOs as rollout gates.
Why this is worth your time
- Most outages are dependency cascades, not single-node failures.
- Reliability is cheaper when designed early (timeouts, retries, bulkheads).
- SLO gates prevent fast rollouts from turning into fast incidents.
Architecture pattern
- Bulkheads: isolate critical dependencies and apply concurrency limits.
- Timeouts and retries: bounded, jittered, and tuned per dependency.
- Load shedding: graceful degradation with explicit priorities.
- Progressive delivery: canary with SLO-based promotion and auto-rollback.
Sharp edges
- Unbounded retries amplify incidents (retry storms).
- Missing timeouts turn slow dependencies into full service failure.
- Alerting on symptoms without ownership labels creates noise and slows response.
Production checklist
- Each dependency has timeout + retry budget + circuit breaker behavior.
- Critical paths have SLOs and error budgets.
- Canary rollout is gated by SLI deltas (canary vs baseline).
Copy/paste snippets
promql: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))kubectl -n <ns> describe hpa <name>
kubectl -n <ns> rollout status deploy/<name>
Route: /workflows/resilient-architecture