SRE · Platform Engineering · DevOps

Neeraja
Khanapure

Making production systems reliable, one incident at a time ☀️

I work on cloud infrastructure, Kubernetes platforms, and streaming systems. On this site I share what I've learned, the patterns that work, and the things that quietly break at scale.

Read the thinking pieces Ask SRE Intel ✦

KubernetesTerraformKafkaAWS · GCP · AzurePythonPrometheusGrafanaOpenTelemetry

Writing

Things I've figured out

Short reads on Kubernetes, Terraform, observability, and reliability. Written from production experience, not from docs.

Thinking

Why your dashboard isn't actually observability

A beautiful Grafana board with 200 panels is not the same as knowing if your service is healthy. Here's the difference.

observabilityRead →

Thinking

Good alerts vs bad alerts. It's not the same thing

Deleting 90% of your alerts doesn't fix the problem. Ownership and SLOs do.

on-callRead →

Workflows

Kubernetes rollouts: don't trust "pods are Ready"

A pod can be Ready and still be causing a 40% error rate. Here's how to actually gate a rollout.

kubernetesRead →

Workflows

Terraform at scale breaks in predictable ways

Wide dependency graphs, surprise destroys, overused depends_on and how to avoid all of them.

terraformRead →

Insights

Observability is a labeling problem, not a tooling problem

You can have Prometheus, Grafana, and OTel all running and still not know which customer is broken.

prometheusRead →

Insights

AIOps should help you think faster, not replace thinking

Auto-remediation without guardrails is just a faster way to make things worse at 2am.

aiopsRead →

📝 How I Think5 pieces on SRE tradeoffs 🔧 WorkflowsK8s, Terraform, MLOps, CI/CD 💡 InsightsObservability, CI/CD, AIOps 🔖 Weekly PicksCurated SRE reads

Skills

What I work on

Each area has its own page with real patterns, examples, and case studies.

⚙️

Kubernetes

EKS and GKE in production
Autoscaling, upgrades, RBAC
Canary deploys and rollbacks
On-call incident ownership

🏗️

Terraform

Reusable modules and remote state
CI/CD gating and drift detection
Guardrails and policy enforcement
Cross-team IaC patterns

📨

Kafka

Consumer lag and DLQ patterns
Partition strategy and retries
Safe broker rolling restarts
Streaming reliability debugging

📊

Observability

Prometheus, Grafana, OpenTelemetry
SLO design and error budgets
Alert hygiene and ownership
Reducing on-call noise

🐍

Python Automation

Migration and validation scripts
API tooling for infra ops
Toil reduction tooling
Retry and idempotency patterns

☁️

Cloud

AWS, GCP and Azure
HA architecture and multi-AZ
Cost optimization
Security and IAM controls

SRE Intel ✦ Live

Have a production question?
Ask it here.

SRE Intel is an AI assistant trained on production SRE patterns. Ask about Kubernetes debugging, Terraform state issues, Kafka lag, SLO design, you get a specific answer, not a documentation link.

How do I fix HPA thrashing?Safe K8s upgrade checklist?Kafka consumer lag, where to start?How to design an SLO?

Open SRE Intel →

How do I fix HPA thrashing on EKS?

Usually mis-sized CPU requests or the downscale window being too short.

Start with --horizontal-pod-autoscaler-downscale-stabilization, default is 5m, often too aggressive. Then check that CPU requests match actual steady-state usage, not peak. HPA scales against requests, not limits.

Stay current