Kafka reliability (the stuff that matters in prod)
- Topic design: partitions, replication factor, retention, compaction when needed.
- Consumers: group stability, offset mgmt, idempotency, retries, and DLQ patterns.
- Backpressure: protect downstream systems; rate limit and batch responsibly.
- Observability: lag (per group/partition), rebalance rate, produce/consume errors, throughput.
Debugging playbook (quick checklist)
- Lag spike: input surge, consumer slowdown, rebalance, or hot partition?
- Check: consumer errors, commit rate, rebalance count, broker health, ISR, network.
- Fix: scale consumers, tune max.poll, increase partitions carefully, isolate hot keys.
Links