Insights

Observability debt is invisible until an incident makes it expensive

You can't debug what you can't slice. Teams add dashboards for years and still can't answer the two questions that matter most in an incident: which customers are affected, and which change caused it.…

The pattern

Observability debt accumulation:

Month 1:  Service A metrics added (no ownership labels)
Month 3:  Service B metrics added (different label schema)
Month 6:  Dashboard count: 47. Useful in incident: 3.
Month 9:  P0 incident. Can't isolate by customer/version.
          Engineer guesses. Guesses wrong. +45min MTTR.

Fix: Define label schema FIRST. Instrument second.

The insight

The non-obvious part

The teams that debug incidents fastest don't have more metrics — they have metrics that answer the right questions at the right cardinality. SLI-first instrumentation design is a force multiplier. Most teams instrument first and wonder why dashboards are noisy.

My rule

→ Define your SLIs, then design labels that let you isolate by (service, env, version, customer tier) without exploding cardinality. Instrument last.

Worth reading

▸ Brendan Gregg's USE Method + Google's RED Method for SLI-first design
▸ Prometheus label best practices — cardinality anti-patterns (prometheus.io/docs)

Back to Home All Insights Related posts

Route: /insights/observability-debt-is-invisible-until-an-incident-makes-it-expensive