Workflows

Kubernetes cost spikes: the usual suspects and how to find them fast

Cloud bills spike in Kubernetes for the same reasons every time. None of them are visible in the default dashboards — and most of them are invisible until month-end.

#kubernetes#finops#devops#platformengineering

How it looks in practice

Cost leak sources (ranked by surprise factor):

1. Unset resource requests   → scheduler packs nodes → OOM → over-provision
2. Autoscaler scale-down lag → zombie nodes after traffic spike
3. Log pipelines w/o sample  → 40% of bill, 0% of dashboards
4. Idle namespaces           → dev clusters running 24/7
5. Spot interruption gaps    → fallback to on-demand, never reverted

Where it breaks

Missing resource requests let the scheduler over-pack nodes — when pods OOM, you over-provision to compensate.
Cluster autoscaler adds nodes faster than it removes them. Spot interruptions leave zombie capacity for hours.
Logging agents (Fluentd/Filebeat) on every node with no sampling become the largest line item nobody owns.

The rule

→ Every workload needs requests AND limits. Review autoscaler scale-down thresholds monthly. Sample logs at source, not at the sink.

How to sanity-check it

Kubecost or OpenCost — per-namespace/team attribution. Without this, no one feels accountable for the number.
KEDA for event-driven workload scaling — eliminates idle replicas without sacrificing responsiveness.

The bigger picture

The best platform teams I've seen measure success by how rarely product teams have to think about infrastructure.

Back to Home All Workflows Related posts

Route: /workflows/kubernetes-cost-spikes-the-usual-suspects-and-how-to-find-them-fast