Skills
Observability
Prometheus, Grafana, and OpenTelemetry in production. SLO design, alert hygiene, and making dashboards that actually help during incidents.
What I work on
- SLI and SLO design: defining what "broken" means for real users before building any dashboards.
- Prometheus scrape config, recording rules, and cardinality management.
- Grafana dashboard hygiene: dashboards that answer a specific question, not walls of charts.
- OTel Collector setup: the pipeline between instrumentation and your backend.
- Alert design: SLO-based burn rate alerting instead of raw threshold alerts.
- On-call noise reduction: ownership routing, runbook-linked alerts, and alert review processes.
How I think about this
Start with the SLI, not the tool
What does "the service is broken" mean for a real user? Translate that into one metric with a threshold you'd page on. Build everything else as supporting context.
Labels are a strategy decision
Cardinality is a cost. Labels should let you isolate failures by service, version, region, and tenant. user_id as a label is a cardinality bomb.
Dashboards are for exploration
Production decisions need SLOs, not dashboards. A dashboard without a defined SLI is just a wall of charts that nobody trusts during an incident.
Alerts need owners
An alert without an owner is a fire alarm in an empty building. Every alert needs a runbook link and a named team responsible for it.
Stack I use
- Prometheus for metrics collection and alerting rules
- Grafana for dashboards and alert management UI
- OpenTelemetry Collector as the metrics/traces/logs pipeline
- Jaeger or Tempo for distributed tracing
- PagerDuty or Opsgenie for alert routing with ownership policies