Insights

Distributed tracing: the gap between having it and using it in incidents

Most orgs instrument distributed traces correctly and then debug incidents with grep. The investment in tracing pays off only when your debugging workflow changes — when you start from a trace ID inst…

The pattern

Current state (most orgs):      Target state:

Incident fires                  Incident fires
     │                               │
Grep logs ──▶ Guess service ──▶  Pull trace ID from alert
     │                               │
More grep ──▶ Find error      Trace shows full request path
     │                               │
Escalate ──▶ More engineers   Latency waterfall identifies
     │                         bottleneck in 3 minutes
MTTR: 90 min                  MTTR: 15 min

The insight

The non-obvious part

Traces don't reduce MTTR on their own — runbooks that start from trace IDs do. The highest-leverage thing you can do after instrumenting is to rewrite your top 5 incident runbooks to start with 'get the trace ID from the alert, open it in Jaeger/Tempo, find the slowest span.' Engineers follow runbooks under pressure.

My rule

→ Instrument your 3 highest-traffic endpoints first. Then rewrite one runbook to start from a trace ID. Measure incident time-to-hypothesis before and after.

Worth reading

▸ OpenTelemetry instrumentation guides — language SDKs (opentelemetry.io/docs)
▸ Grafana Tempo + Loki correlation — trace-to-log workflow without leaving the dashboard

Back to Home All Insights Related posts

Route: /insights/distributed-tracing-the-gap-between-having-it-and-using-it-in-incidents