Workflows

On-call burnout is an alert design problem, not a schedule problem

Every team I've seen fight burnout by rotating people faster. The actual fix is almost always the same: the alerts are wrong.

#sre#observability#reliability#devops

How it looks in practice

Alert quality spectrum:

Noisy ◀─────────────────────────── ▶ Actionable

[cpu > 80%]  [pod restart]  [error budget burn]  [customer impact]
     │              │               │                    │
 ignore me      maybe?         investigate!          wake me up

Where it breaks

  • Alerts without a named owner and a runbook produce paralysis, not action — especially at 2am.
  • Flapping alerts are the fastest path to alert blindness — engineers learn to dismiss pages before reading them.
  • Cause-based alerts (disk full) and symptom-based alerts (latency spike) need different urgency and routing.

The rule

Before any alert ships: Who acts on it? What do they do? What's the cost of 30 minutes of inaction? If you can't answer all three, it's not ready.

How to sanity-check it

  • Weekly alert review ritual: tag every last-week page as actionable / noisy / redundant. Kill the bottom two categories.
  • PagerDuty/OpsGenie grouping + escalation policies — reduce interrupt rate without hiding real incidents.

The bigger picture

Reliability is a product feature. The engineers who treat it that way are the ones who get asked into the room.

Route: /workflows/on-call-burnout-is-an-alert-design-problem-not-a-schedule-problem