Workflows

Incident RCA without a data-backed timeline is just a story you told yourself

Most post-mortems produce lessons that don't stick. The root cause is almost always the same: the timeline was built from memory, not from data.

#sre#reliability#observability#devops

How it looks in practice

Memory-based timeline:     Data-backed timeline:

T+0  "Deploy happened"     T+0:00  Deploy (Argo event)
T+?  "Errors started"      T+0:07  Error rate +0.3% (Prometheus)
T+?  "Someone noticed"     T+0:12  P95 latency 340ms→2.1s (trace)
T+?  "We rolled back"      T+0:19  Alert fired (PD)
                           T+0:31  Rollback complete (Argo)

Where it breaks

Log timestamps across services diverge by seconds without NTP — your timeline is wrong before you begin.
Correlation between a deploy event and a metric spike gets missed when dashboards lack deployment markers.
Contributing factors vanish from the narrative because they're hard to prove — and the same incident repeats.

The rule

→ Build the timeline from data only before the RCA meeting begins. If you can't source an event, mark it 'unverified' — not assumed.

How to sanity-check it

OpenTelemetry trace IDs as the timeline spine — they cross service boundaries with sub-millisecond precision.
Grafana annotations on every deploy, config change, and scaling event — visible on every dashboard automatically.

The bigger picture

Systems that are hard to debug were designed without the debugger in mind. Build observability in, not on.

Back to Home All Workflows Related posts

Route: /workflows/incident-rca-without-a-data-backed-timeline-is-just-a-story-you-told-yourself