Observability

Prometheus, Grafana, and OpenTelemetry in production. SLO design, alert hygiene, and making dashboards that actually help during incidents.

What I work on

  • SLI and SLO design: defining what "broken" means for real users before building any dashboards.
  • Prometheus scrape config, recording rules, and cardinality management.
  • Grafana dashboard hygiene: dashboards that answer a specific question, not walls of charts.
  • OTel Collector setup: the pipeline between instrumentation and your backend.
  • Alert design: SLO-based burn rate alerting instead of raw threshold alerts.
  • On-call noise reduction: ownership routing, runbook-linked alerts, and alert review processes.

How I think about this

Start with the SLI, not the tool
What does "the service is broken" mean for a real user? Translate that into one metric with a threshold you'd page on. Build everything else as supporting context.
Labels are a strategy decision
Cardinality is a cost. Labels should let you isolate failures by service, version, region, and tenant. user_id as a label is a cardinality bomb.
Dashboards are for exploration
Production decisions need SLOs, not dashboards. A dashboard without a defined SLI is just a wall of charts that nobody trusts during an incident.
Alerts need owners
An alert without an owner is a fire alarm in an empty building. Every alert needs a runbook link and a named team responsible for it.

Stack I use

  • Prometheus for metrics collection and alerting rules
  • Grafana for dashboards and alert management UI
  • OpenTelemetry Collector as the metrics/traces/logs pipeline
  • Jaeger or Tempo for distributed tracing
  • PagerDuty or Opsgenie for alert routing with ownership policies