Workflows

Resilient Architecture: design for failure, not for uptime

A reliability workflow focused on isolating blast radius, controlling dependencies, and using SLIs/SLOs as rollout gates.

Why this is worth your time

  • Most outages are dependency cascades, not single-node failures.
  • Reliability is cheaper when designed early (timeouts, retries, bulkheads).
  • SLO gates prevent fast rollouts from turning into fast incidents.

Architecture pattern

  • Bulkheads: isolate critical dependencies and apply concurrency limits.
  • Timeouts and retries: bounded, jittered, and tuned per dependency.
  • Load shedding: graceful degradation with explicit priorities.
  • Progressive delivery: canary with SLO-based promotion and auto-rollback.

Sharp edges

  • Unbounded retries amplify incidents (retry storms).
  • Missing timeouts turn slow dependencies into full service failure.
  • Alerting on symptoms without ownership labels creates noise and slows response.

Production checklist

  • Each dependency has timeout + retry budget + circuit breaker behavior.
  • Critical paths have SLOs and error budgets.
  • Canary rollout is gated by SLI deltas (canary vs baseline).

Copy/paste snippets

promql: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
kubectl -n <ns> describe hpa <name>
kubectl -n <ns> rollout status deploy/<name>

Route: /workflows/resilient-architecture