SRE · Platform Engineering · DevOps

Neeraja
Khanapure

Making production systems reliable, one incident at a time ☀️

I work on cloud infrastructure, Kubernetes platforms, and streaming systems. On this site I share what I've learned, the patterns that work, and the things that quietly break at scale.

KubernetesTerraformKafkaAWS · GCP · AzurePythonPrometheusGrafanaOpenTelemetry
Neeraja Khanapure
SRE Intel ✦ Live

Have a production question?
Ask it here.

SRE Intel is an AI assistant trained on production SRE patterns. Ask about Kubernetes debugging, Terraform state issues, Kafka lag, SLO design, you get a specific answer, not a documentation link.

How do I fix HPA thrashing?Safe K8s upgrade checklist?Kafka consumer lag, where to start?How to design an SLO?
Open SRE Intel →
How do I fix HPA thrashing on EKS?
Usually mis-sized CPU requests or the downscale window being too short.

Start with --horizontal-pod-autoscaler-downscale-stabilization, default is 5m, often too aggressive. Then check that CPU requests match actual steady-state usage, not peak. HPA scales against requests, not limits.

Let's connect

I'm open to SRE, Platform Engineering, and DevOps roles — especially teams building something interesting on Kubernetes or multi-cloud.

If you read something here that was useful, I'd love to hear about it. And if you have a production problem I might be able to help with, reach out.

Open to new roles