Insights

Capacity planning is a risk budget conversation, not a utilization spreadsheet

Teams that plan capacity by extrapolating last month's P95 get surprised when a product launch doubles traffic in a week. The right frame isn't 'what utilization should we run at' — it's 'what's the a…

The pattern

Cost asymmetry analysis:

Over-provision by 20%:    Under-provision by 20%:
                          
Cost: +$8K/month          Cost: Incident
                               + on-call burnout
Direct, predictable            + customer churn
                               + post-mortem
                               + team morale tax

For P0 services, 20% buffer almost always wins.

The insight

Teams that plan capacity by extrapolating last month's P95 get surprised when a product launch doubles traffic in a week. The right frame isn't 'what utilization should we run at' — it's 'what's the asymmetric cost of being wrong in each direction, and how much buffer does that justify?'

The non-obvious part

The teams who get capacity planning right treat it as an insurance calculation, not an optimization problem. Over-provision cost is direct and visible. Under-provision cost is diffuse, delayed, and always larger than it looks. The asymmetry should drive your buffer strategy — not your CFO's target utilization number.

My rule

Set buffer based on incident cost, not utilization targets. For every P0 service, calculate: what does one hour of downtime cost vs one month of 20% over-provision? The math almost always justifies the buffer.

Worth reading

  • Google SRE Book — Being On Call and Handling Overload (ch. 11-12)
  • AWS/GCP cost anomaly detection — real-time signals for when your buffer is being consumed

Route: /insights/capacity-planning-is-a-risk-budget-conversation-not-a-utilization-spreadsheet