Insights

Runbook quality decays silently — and that decay kills MTTR

Runbooks that haven't been run recently are wrong. Not outdated — wrong. The service changed. The tool was deprecated. The endpoint moved. Nobody updated the doc because nobody reads it until 3am. And…

The pattern

Runbook decay curve:

Quality
  │▓▓▓▓▓▓▓▓▓▓
  │         ▓▓▓▓▓
  │              ▓▓▓▓
  │                  ▓▓▓▓▓
  │                       ▓▓▓░░░░░░░
  │                             ░░░░░░░░ ← "last validated 8 months ago"
  └────────────────────────────────────▶
  Write   Month 1  Month 3  Month 6  Month 9

The insight

Runbooks that haven't been run recently are wrong. Not outdated — wrong. The service changed. The tool was deprecated. The endpoint moved. Nobody updated the doc because nobody reads it until 3am. And at 3am, a wrong runbook is worse than no runbook — it sends engineers down confident paths that dead-end.

The non-obvious part

The highest-leverage runbook improvement isn't better writing — it's a validation date and a quarterly review reminder. A runbook with 'last validated: 2 weeks ago' that's 70% accurate is worth more than a beautifully written one from 8 months ago that's 40% accurate.

My rule

Every runbook gets a 'last validated' date. Anything older than 3 months is assumed broken until proven otherwise. Review is part of the on-call rotation, not optional.

Worth reading

  • PagerDuty Incident Response guide — runbook standards and validation cadence
  • Post-incident review template: 'Did the runbook help, mislead, or was it missing?' (standard question)

Route: /insights/runbook-quality-decays-silently-and-that-decay-kills-mttr