Insights
Distributed tracing: the gap between having it and using it in incidents
Most orgs instrument distributed traces correctly and then debug incidents with grep. The investment in tracing pays off only when your debugging workflow changes — when you start from a trace ID inst…
The pattern
Current state (most orgs): Target state:
Incident fires Incident fires
│ │
Grep logs ──▶ Guess service ──▶ Pull trace ID from alert
│ │
More grep ──▶ Find error Trace shows full request path
│ │
Escalate ──▶ More engineers Latency waterfall identifies
│ bottleneck in 3 minutes
MTTR: 90 min MTTR: 15 minThe insight
Most orgs instrument distributed traces correctly and then debug incidents with grep. The investment in tracing pays off only when your debugging workflow changes — when you start from a trace ID instead of a log query. That's a culture change, not a tooling change.
The non-obvious part
Traces don't reduce MTTR on their own — runbooks that start from trace IDs do. The highest-leverage thing you can do after instrumenting is to rewrite your top 5 incident runbooks to start with 'get the trace ID from the alert, open it in Jaeger/Tempo, find the slowest span.' Engineers follow runbooks under pressure.
My rule
→ Instrument your 3 highest-traffic endpoints first. Then rewrite one runbook to start from a trace ID. Measure incident time-to-hypothesis before and after.
Worth reading
- ▸ OpenTelemetry instrumentation guides — language SDKs (opentelemetry.io/docs)
- ▸ Grafana Tempo + Loki correlation — trace-to-log workflow without leaving the dashboard
Route: /insights/distributed-tracing-the-gap-between-having-it-and-using-it-in-incidents