Neeraja
Khanapure
Making production systems reliable, one incident at a time ☀️
I work on cloud infrastructure, Kubernetes platforms, and streaming systems. On this site I share what I've learned, the patterns that work, and the things that quietly break at scale.

Things I've figured out
Short reads on Kubernetes, Terraform, observability, and reliability. Written from production experience, not from docs.
What I work on
Each area has its own page with real patterns, examples, and case studies.
- EKS and GKE in production
- Autoscaling, upgrades, RBAC
- Canary deploys and rollbacks
- On-call incident ownership
- Reusable modules and remote state
- CI/CD gating and drift detection
- Guardrails and policy enforcement
- Cross-team IaC patterns
- Consumer lag and DLQ patterns
- Partition strategy and retries
- Safe broker rolling restarts
- Streaming reliability debugging
- Prometheus, Grafana, OpenTelemetry
- SLO design and error budgets
- Alert hygiene and ownership
- Reducing on-call noise
- Migration and validation scripts
- API tooling for infra ops
- Toil reduction tooling
- Retry and idempotency patterns
- AWS, GCP and Azure
- HA architecture and multi-AZ
- Cost optimization
- Security and IAM controls
Have a production question?
Ask it here.
SRE Intel is an AI assistant trained on production SRE patterns. Ask about Kubernetes debugging, Terraform state issues, Kafka lag, SLO design, you get a specific answer, not a documentation link.
Start with
--horizontal-pod-autoscaler-downscale-stabilization, default is 5m, often too aggressive. Then check that CPU requests match actual steady-state usage, not peak. HPA scales against requests, not limits.Weekly signal
Let's connect
I'm open to SRE, Platform Engineering, and DevOps roles — especially teams building something interesting on Kubernetes or multi-cloud.
If you read something here that was useful, I'd love to hear about it. And if you have a production problem I might be able to help with, reach out.