Workflows

Incident RCA without a data-backed timeline is just a story you told yourself

Most post-mortems produce lessons that don't stick. The root cause is almost always the same: the timeline was built from memory, not from data.

#sre#reliability#observability#devops

How it looks in practice

Memory-based timeline:     Data-backed timeline:

T+0  "Deploy happened"     T+0:00  Deploy (Argo event)
T+?  "Errors started"      T+0:07  Error rate +0.3% (Prometheus)
T+?  "Someone noticed"     T+0:12  P95 latency 340ms→2.1s (trace)
T+?  "We rolled back"      T+0:19  Alert fired (PD)
                           T+0:31  Rollback complete (Argo)

Where it breaks

  • Log timestamps across services diverge by seconds without NTP — your timeline is wrong before you begin.
  • Correlation between a deploy event and a metric spike gets missed when dashboards lack deployment markers.
  • Contributing factors vanish from the narrative because they're hard to prove — and the same incident repeats.

The rule

Build the timeline from data only before the RCA meeting begins. If you can't source an event, mark it 'unverified' — not assumed.

How to sanity-check it

  • OpenTelemetry trace IDs as the timeline spine — they cross service boundaries with sub-millisecond precision.
  • Grafana annotations on every deploy, config change, and scaling event — visible on every dashboard automatically.

The bigger picture

Systems that are hard to debug were designed without the debugger in mind. Build observability in, not on.

Route: /workflows/incident-rca-without-a-data-backed-timeline-is-just-a-story-you-told-yourself