Observability | Neeraja Khanapure

What I work on

SLI and SLO design: defining what "broken" means for real users before building any dashboards.
Prometheus scrape config, recording rules, and cardinality management.
Grafana dashboard hygiene: dashboards that answer a specific question, not walls of charts.
OTel Collector setup: the pipeline between instrumentation and your backend.
Alert design: SLO-based burn rate alerting instead of raw threshold alerts.
On-call noise reduction: ownership routing, runbook-linked alerts, and alert review processes.

How I think about this

Start with the SLI, not the tool

What does "the service is broken" mean for a real user? Translate that into one metric with a threshold you'd page on. Build everything else as supporting context.

Labels are a strategy decision

Cardinality is a cost. Labels should let you isolate failures by service, version, region, and tenant. user_id as a label is a cardinality bomb.

Dashboards are for exploration

Production decisions need SLOs, not dashboards. A dashboard without a defined SLI is just a wall of charts that nobody trusts during an incident.

Alerts need owners

An alert without an owner is a fire alarm in an empty building. Every alert needs a runbook link and a named team responsible for it.

Stack I use

Prometheus for metrics collection and alerting rules
Grafana for dashboards and alert management UI
OpenTelemetry Collector as the metrics/traces/logs pipeline
Jaeger or Tempo for distributed tracing
PagerDuty or Opsgenie for alert routing with ownership policies

Why dashboards lie →Good vs bad alerts →Ask SRE Intel about SLO design →