How I Think

Short, opinionated pieces on SRE and platform engineering. These are the mental models I actually use, not theory, not docs summaries.

013 min readobservabilitysre

Why your dashboard isn't actually observability

A Grafana dashboard with 200 panels is not observability. It's a wall of charts. The difference matters when something breaks at 2am.

Key points
  • Dashboards are great for exploration, not for production decisions
  • Production decisions need SLIs (what does 'broken' mean?) and SLOs (how broken is too broken?)
  • Most dashboards I've inherited had no defined question, just metrics someone thought were interesting
  • If your on-call team opens a dashboard during an incident and can't find the right panel in 30 seconds, the dashboard failed

Bottom line: Define the question first. Build the dashboard second. If you can't name the person who owns the alert, the dashboard isn't production-ready.

Page on user impact. Ticket on symptoms. Route by owner.
024 min readterraformiac

Adding a Terraform variable isn't always the right move

In every IaC review I've done, someone has added a variable to something that never changes. It feels like best practice. It isn't.

Key points
  • Variables exist for things that genuinely differ across environments or deployments
  • A module with 40 variables is effectively undocumented, which ones actually matter?
  • Hardcoded values aren't bad if the value never changes. They're honest
  • Forced variables add noise, slow down plans, and create unexpected failures
  • Modules with 3–5 well-chosen variables get reused far more than flexible-but-complex ones

Bottom line: Before adding a variable, ask: in what two situations would this be different? If you can't answer, hardcode it.

If you can't name two scenarios where this value differs. It's not a variable.
034 min readobservabilityon-call

The problem isn't alerts. It's bad alerts

There's a popular genre of engineering post: 'we deleted 90% of our alerts and everything got better'. That's missing the point.

Key points
  • Deleting bad alerts fixes the noise but doesn't fix why they were bad
  • A good alert fires on something a user actually experiences, not a system metric
  • Every alert needs: an owner, a runbook, and a reason it fires at this threshold
  • SLO-based alerting (burn rates) is the right model. Page on budget consumption, not raw numbers
  • 8 high-quality owned alerts beat 200 'just in case' ones every time

Bottom line: The goal is the right alerts, not fewer alerts. SLOs give you the framework to know which alerts actually matter.

An alert without an owner is a fire alarm in an empty building.
045 min readkubernetesreliability

Canary deploys break when your metrics are too broad

Canary deployments look solved. The tooling (Argo Rollouts, Flagger) works well. The failure is almost always in what you measure.

Key points
  • Aggregate error rate won't catch a broken rollout that only affects EU customers, one API path, or one tenant
  • Your gate passes, you promote, and three hours later someone reports a 40% error rate on /checkout
  • The fix is slicing. Add service + version + region labels and compare canary vs baseline at that level
  • Canary windows that are too short miss things, cache warmup, scheduled jobs, traffic shift patterns
  • When a canary fails, the most useful question is 'why didn't our gate catch this?', usually it's label gaps

Bottom line: Gate on your SLO slice, not on pod health. A pod being Ready is a local signal. User impact is global.

Pods are Ready is not the same as users are happy.
054 min readterraformreliability

Terraform state boundaries are your blast radius boundaries

How you split Terraform state is one of those decisions that feels like a detail until something goes wrong. Then it's everything.

Key points
  • One monolithic state for an environment means a bad apply can touch everything, networking, app infra, IAM, at once
  • Splitting by resource type (all S3 in one state, all IAM in another) creates phantom coupling and circular dependencies
  • The right split: by lifecycle, ownership, and blast radius
  • Networking is slow-changing and high-impact. Isolate it
  • Per-service infra changes often and has limited blast radius. Give it its own state
  • Shared platform resources (EKS cluster, RDS) sit in the middle, team ownership, moderate risk

Bottom line: If a bad apply could take down more than one team's service, your state boundary is too wide.

State boundaries should match team ownership and blast radius, not resource types.