Skills
Kubernetes
EKS and GKE in production with real on-call ownership. Not just configuration — actual incident response, upgrades, and reliability work.
What I have done
- Operated EKS and GKE workloads with real on-call ownership. Rollouts, incidents, brownouts, and noisy alerts.
- Built safer delivery patterns: readiness gates, PDBs, canaries, and rollback playbooks.
- Designed for scale: HPA/VPA, Cluster Autoscaler, node group isolation (system vs workload), multi-AZ posture.
- Security and access: RBAC, IRSA, namespace boundaries, least-privilege service accounts, secrets strategy.
What breaks in real life
HPA thrashing
Fix with sane requests/limits, cooldowns, and queue-aware metrics. Most thrashing comes from CPU requests that don't reflect real steady-state usage.
Node pressure and evictions
Right-size your workloads, set PDBs before you need them, separate noisy workloads, tune eviction thresholds.
DNS and CNI weirdness
Correlate CoreDNS latency, conntrack pressure, and CNI errors together. Keep runbooks. This class of issue is almost never obvious in isolation.
Upgrade blast radius
Staged upgrades, test add-ons first, gate critical workloads. Never upgrade control plane and node groups in the same window.
Interview-ready examples
- Safe deploys: readiness gates + canary + metric-gated rollback
- Reduce on-call noise: SLO-based paging + ownership routing
- Cluster upgrade: staged rollout with add-on testing and PDB verification
- Cost reduction: right-sizing via VPA recommendations + spot node groups