You may ask the question on how to deploy Kafka consumers on Kubernetes. Before diving into the details, let’s start with some quick statistics about Kafka at OSO. We are the experts in configuring, deploying and building applications on Kafka and have been using it at a huge scale for many years. We started using Kafka in production in 2014 and currently have delivered over 30+ enterprise projects for our clients. Our applications handle millions of messages per second and have processed over a trillion messages in total. With such extensive usage, we have gained valuable insights and lessons about Kafka.
Kafka consumers on Kubernetes: Challenges of distributed systems
Why is deploying Kafka consumers on Kubernetes is difficult? Distributed systems, like Kafka, can be challenging to manage because their components run independently. One component can keep running even when another one fails. Components can fail due to various reasons, such as transient issues, configuration errors, or external dependency failures. To ensure that an application is healthy and can continue processing requests, Kubernetes provides utilities like liveness probes, readiness probes, and startup probes.
Kafka consumers on Kubernetes: Understanding Kubernetes probes
Kubernetes has three types of probes to help determine the health of an application container: Liveness probes, Readiness probes, and Startup probes.
Liveness probes: These probes are used to know when to restart an application container. They are useful for checking for deadlocks, where a container may still be running but unable to process requests. By restarting a container, the application can remain available even when bugs are present.
Readiness probes: These probes determine when a container is ready to accept incoming traffic. They are particularly useful for containers that serve as backends for Kubernetes service load balancers, such as APIs. Readiness probes ensure that a pod is ready to handle requests and prevent traffic from being routed to an unhealthy container.
Startup probes: These probes replace liveness probes for slow-starting containers. They ensure that a pod doesn’t get killed before it’s fully up and running.
Implementing smart health checks for Kafka consumers
We initially had a naive approach to health checks for Kafka consumers. They would make a request to list the topics in the broker, which checked the underlying connection to Kafka. If the connection failed, they would trigger a liveness probe and restart the service. While this approach captured some issues like TLS connectivity problems, it didn’t catch all the problems.
We quickly realised that sometimes consumers would sit idle and not process messages, leading to lag on the partition and potential incidents. To address this, we developed a smart health check approach inspired by PagerDuty’s blog post.
The smart health check approach uses two values: the current offset (the last message sent to the topic) and the committed offset (the last message consumed by the consumer). By checking that these two values are changing, we ensured that the consumer is constantly moving forward and processing messages. Here’s how the smart health check works:
- Retrieve the current offset from Kafka. If this fails, it indicates a connectivity issue with Kafka, and a liveness probe is triggered.
- Retrieve the committed offset. This value is stored in memory and should not fail. If it does fail, it suggests a weird state, and a liveness probe is triggered.
- Check if the current offset and committed offset are the same. If they are the same, it means there are no messages left to process on the topic, and the liveness probes pass.
- If the current offset and committed offset are different, check if the committed offset has changed since the last liveness probe. If it hasn’t changed, it indicates that the consumer is not processing messages, and the consumer is restarted.
- If the committed offset has changed, it means the consumer is processing messages, and everything is working as expected.
This smart health check approach helped improve Kafka-related issues and reduce the number of false alerts.
Kafka consumers on Kubernetes: Challenges and Solutions
Cascading Failures: When one pod restarted due to an unrelated issue, it could lead to other pods in the deployment crashing as well. This issue occurs during rebalancing, where a service that was originally consuming from one partition suddenly switched to consuming from a different partition. To address this, we have each replica keep track of its own partitions, ensuring that the rebalance doesn’t affect other pods.
Intermittent Liveness Probe Failures: A client email delivery service, which reads messages in batches and makes calls to external services, experienced intermittent liveness probe failures. This was because the health checks were running too frequently, and if it took longer to process a batch than the interval between health checks, it would mistakenly think that messages were not being consumed. The solution was to configure the initial delay and interval of the health checks to be larger than the time it takes to process a batch.
Lessons learned and takeaways
Our experience with implementing smart health checks for Kafka consumers has taught us several valuable lessons:
- Naive health checks can give a false sense of security and make it seem like an application is running fine even when it’s not. It’s important to implement proper health checks to ensure the application’s reliability.
- Building proper health checks can make a significant difference in the self-healing capabilities of an application and reduce the need for engineers to be called for trivial issues.
- Smart health checks can help reduce the number of false alerts and improve the overall quality of life for the team.
- It’s crucial to consider the specific behaviour of a service when designing health checks and determine what “unhealthy” actually means in that context.
- Logging and metrics play a vital role in troubleshooting and understanding the behaviour of the application. Logging the partition being consumed, the offsets being committed, and the duration of health checks can provide valuable insights.
- Having metrics on the performance of health checks is just as important as monitoring the application itself.
Fore more content:
How to take your Kafka projects to the next level with a Confluent preferred partner
Event driven Architecture: A Simple Guide
Watch Our Kafka Summit Talk: Offering Kafka as a Service in Your Organisation
Successfully Reduce AWS Costs: 4 Powerful Ways
Protecting Kafka Cluster
Apache Kafka Common Mistakes
Kafka Cruise Control 101
Kafka performance best practices for monitoring and alerting
How to build a custom Kafka Streams Statestores
How to avoid configuration drift across multiple Kafka environments using GitOps