Kubernetes | Neeraja Khanapure

What I have done

Operated EKS and GKE workloads with real on-call ownership. Rollouts, incidents, brownouts, and noisy alerts.
Built safer delivery patterns: readiness gates, PDBs, canaries, and rollback playbooks.
Designed for scale: HPA/VPA, Cluster Autoscaler, node group isolation (system vs workload), multi-AZ posture.
Security and access: RBAC, IRSA, namespace boundaries, least-privilege service accounts, secrets strategy.

HPA thrashing

Fix with sane requests/limits, cooldowns, and queue-aware metrics. Most thrashing comes from CPU requests that don't reflect real steady-state usage.

Node pressure and evictions

Right-size your workloads, set PDBs before you need them, separate noisy workloads, tune eviction thresholds.

DNS and CNI weirdness

Correlate CoreDNS latency, conntrack pressure, and CNI errors together. Keep runbooks. This class of issue is almost never obvious in isolation.

Upgrade blast radius

Staged upgrades, test add-ons first, gate critical workloads. Never upgrade control plane and node groups in the same window.