Workflows
Incident RCA without a data-backed timeline is just a story you told yourself
Most post-mortems produce lessons that don't stick. The root cause is almost always the same: the timeline was built from memory, not from data.
#sre#reliability#observability#devops
How it looks in practice
Memory-based timeline: Data-backed timeline:
T+0 "Deploy happened" T+0:00 Deploy (Argo event)
T+? "Errors started" T+0:07 Error rate +0.3% (Prometheus)
T+? "Someone noticed" T+0:12 P95 latency 340ms→2.1s (trace)
T+? "We rolled back" T+0:19 Alert fired (PD)
T+0:31 Rollback complete (Argo)Where it breaks
- Log timestamps across services diverge by seconds without NTP — your timeline is wrong before you begin.
- Correlation between a deploy event and a metric spike gets missed when dashboards lack deployment markers.
- Contributing factors vanish from the narrative because they're hard to prove — and the same incident repeats.
The rule
→ Build the timeline from data only before the RCA meeting begins. If you can't source an event, mark it 'unverified' — not assumed.
How to sanity-check it
- OpenTelemetry trace IDs as the timeline spine — they cross service boundaries with sub-millisecond precision.
- Grafana annotations on every deploy, config change, and scaling event — visible on every dashboard automatically.
The bigger picture
Systems that are hard to debug were designed without the debugger in mind. Build observability in, not on.
Route: /workflows/incident-rca-without-a-data-backed-timeline-is-just-a-story-you-told-yourself