Workflows
Kubernetes cost spikes: the usual suspects and how to find them fast
Cloud bills spike in Kubernetes for the same reasons every time. None of them are visible in the default dashboards — and most of them are invisible until month-end.
#kubernetes#finops#devops#platformengineering
How it looks in practice
Cost leak sources (ranked by surprise factor): 1. Unset resource requests → scheduler packs nodes → OOM → over-provision 2. Autoscaler scale-down lag → zombie nodes after traffic spike 3. Log pipelines w/o sample → 40% of bill, 0% of dashboards 4. Idle namespaces → dev clusters running 24/7 5. Spot interruption gaps → fallback to on-demand, never reverted
Where it breaks
- Missing resource requests let the scheduler over-pack nodes — when pods OOM, you over-provision to compensate.
- Cluster autoscaler adds nodes faster than it removes them. Spot interruptions leave zombie capacity for hours.
- Logging agents (Fluentd/Filebeat) on every node with no sampling become the largest line item nobody owns.
The rule
→ Every workload needs requests AND limits. Review autoscaler scale-down thresholds monthly. Sample logs at source, not at the sink.
How to sanity-check it
- Kubecost or OpenCost — per-namespace/team attribution. Without this, no one feels accountable for the number.
- KEDA for event-driven workload scaling — eliminates idle replicas without sacrificing responsiveness.
The bigger picture
The best platform teams I've seen measure success by how rarely product teams have to think about infrastructure.
Route: /workflows/kubernetes-cost-spikes-the-usual-suspects-and-how-to-find-them-fast