Workflows
On-call burnout is an alert design problem, not a schedule problem
Every team I've seen fight burnout by rotating people faster. The actual fix is almost always the same: the alerts are wrong.
#sre#observability#reliability#devops
How it looks in practice
Alert quality spectrum:
Noisy ◀─────────────────────────── ▶ Actionable
[cpu > 80%] [pod restart] [error budget burn] [customer impact]
│ │ │ │
ignore me maybe? investigate! wake me upWhere it breaks
- Alerts without a named owner and a runbook produce paralysis, not action — especially at 2am.
- Flapping alerts are the fastest path to alert blindness — engineers learn to dismiss pages before reading them.
- Cause-based alerts (disk full) and symptom-based alerts (latency spike) need different urgency and routing.
The rule
→ Before any alert ships: Who acts on it? What do they do? What's the cost of 30 minutes of inaction? If you can't answer all three, it's not ready.
How to sanity-check it
- Weekly alert review ritual: tag every last-week page as actionable / noisy / redundant. Kill the bottom two categories.
- PagerDuty/OpsGenie grouping + escalation policies — reduce interrupt rate without hiding real incidents.
The bigger picture
Reliability is a product feature. The engineers who treat it that way are the ones who get asked into the room.
Route: /workflows/on-call-burnout-is-an-alert-design-problem-not-a-schedule-problem