Insights
Observability debt is invisible until an incident makes it expensive
You can't debug what you can't slice. Teams add dashboards for years and still can't answer the two questions that matter most in an incident: which customers are affected, and which change caused it.…
The pattern
Observability debt accumulation:
Month 1: Service A metrics added (no ownership labels)
Month 3: Service B metrics added (different label schema)
Month 6: Dashboard count: 47. Useful in incident: 3.
Month 9: P0 incident. Can't isolate by customer/version.
Engineer guesses. Guesses wrong. +45min MTTR.
Fix: Define label schema FIRST. Instrument second.The insight
You can't debug what you can't slice. Teams add dashboards for years and still can't answer the two questions that matter most in an incident: which customers are affected, and which change caused it. The problem is almost never the tool — it's the label strategy.
The non-obvious part
The teams that debug incidents fastest don't have more metrics — they have metrics that answer the right questions at the right cardinality. SLI-first instrumentation design is a force multiplier. Most teams instrument first and wonder why dashboards are noisy.
My rule
→ Define your SLIs, then design labels that let you isolate by (service, env, version, customer tier) without exploding cardinality. Instrument last.
Worth reading
- ▸ Brendan Gregg's USE Method + Google's RED Method for SLI-first design
- ▸ Prometheus label best practices — cardinality anti-patterns (prometheus.io/docs)
Route: /insights/observability-debt-is-invisible-until-an-incident-makes-it-expensive