Skills
Kafka
Consumer lag debugging, DLQ patterns, partition strategy, and safe rolling restarts. Streaming systems that stay reliable under load.
What I work on
- Consumer lag monitoring and root cause analysis: lag growing vs lag stuck are very different problems.
- DLQ design: retry strategies with exponential backoff, poison pill isolation, and idempotent processing.
- Partition key design for even distribution and ordering guarantees where required.
- Safe broker rolling restarts with rack-aware replication and in-sync replica monitoring.
- Producer config tuning: acks, retries, and idempotency for at-least-once and exactly-once semantics.
Debugging consumer lag
Lag growing steadily
Consumer throughput is below produce rate. Check consumer CPU and memory, partition count vs consumer count, and batch size config.
Lag stuck but not growing
Usually a rebalancing loop or a poison pill. Check consumer group describe for partition assignment churn.
Lag spiky
Typically a downstream dependency adding latency. Check consumer processing time p99, not just the lag offset number.
Consumer group stuck on rebalance
Check session timeout vs max poll interval. Increase max.poll.interval.ms if processing is slow, or reduce batch size.
First commands I run
kafka-consumer-groups.sh --describeto see per-partition lag distribution- Compare lag across partitions: uneven lag points to a specific partition or consumer issue
- Check broker metrics: under-replicated partitions and ISR shrinks indicate broker health issues
- Monitor consumer group coordinator changes for rebalance frequency