OSO has deployed its fair share of Kafka Streams rebalances applications, rebalances and assignments are part and parcel of this architecture. We will explain the reasons behind rebalances, the process of rebalancing, and how to optimise rebalances for better performance. So let’s get started!
When a rebalance occurs in Kafka Streams, it is usually triggered by a specific event or condition. One common reason for rebalancing is a scheduled probing rebalance. This type of rebalance is initiated at regular intervals to ensure the cluster remains balanced and optimised. During a probing rebalance, the assignment of tasks may not change significantly, but it helps maintain the overall balance of the cluster.
Another reason for rebalancing is when a consumer instance, let’s call it B, falls behind in processing tasks. In this case, a rebalance is scheduled to allow B to catch up and resume processing. Once B is ready, the rebalance process begins.
During a rebalance, tasks may change ownership from one consumer instance to another. In the past, rebalances would cause all processing to stop until the rebalance was complete. However, with the incremental cooperative rebalancing protocol, processing can continue during a rebalance unless a specific task needs to be swapped between instances.
When a rebalance occurs, the instance that needs to give up a task will flush its state stores, flush any buffers, and commit the task. This triggers a follow-up rebalance, allowing the other instance to start processing the task. This handoff between instances ensures a smooth transition and minimal downtime.
The rebalance process may involve multiple rebalances, depending on the number of tasks and the time it takes for instances to close out tasks. However, once a stable assignment is produced with no follow-up rebalances, it indicates that the rebalance process is complete and the cluster has converged.
While rebalances are necessary for maintaining a balanced and optimised cluster, it’s important to minimise unnecessary and unwanted rebalances. One way to achieve this is by tuning the consumer configurations. By setting parameters such as max.polling.interval, heartbeat.interval, and session.timeout to larger values, the sensitivity to detecting crashed consumers can be reduced, thereby preventing unnecessary rebalances.
In scenarios where a large number of tasks need to be migrated, such as during a probing rebalance phase, the duration of rebalances can be reduced by increasing the number of warm-up replicas and reducing the warm-up time for each task. This allows for higher throughput from the broker and decreases the time required for each task to warm up.
To understand what’s happening during a rebalance, there are several key metrics to monitor. These metrics include:
By monitoring these metrics, it becomes easier to identify rebalance events and understand the state of the cluster during rebalancing.
Rebalances are an essential part of maintaining a balanced and optimised Kafka Streams cluster. By understanding the reasons for rebalances, optimising rebalances, and monitoring key metrics, it becomes easier to manage rebalances and ensure smooth operation of the cluster.
Remember, rebalances are not something to be feared. They are a mechanism for maintaining the health and efficiency of the cluster. By understanding and managing rebalances effectively, you can ensure the smooth operation of your Kafka Streams applications.
Fore more content:
How to take your Kafka projects to the next level with a Confluent preferred partner
Event driven Architecture: A Simple Guide
Watch Our Kafka Summit Talk: Offering Kafka as a Service in Your Organisation
Successfully Reduce AWS Costs: 4 Powerful Ways
Kafka performance best practices for monitoring and alerting
How to build a custom Kafka Streams Statestores
How to avoid configuration drift across multiple Kafka environments using GitOps
Have a conversation with a Kafka expert to discover how we help your adopt of Apache Kafka in your business.
Contact Us