KRaft, which is shorthand for Kafka Raft Metadata mode, has been introduced as a replacement for Zookeeper to solve the problems around release inconsistencies and to streamline the dependencies for Kafka’s overall architecture. KRaft leverages the Raft consensus algorithm, a distributed protocol ensuring:
- Increased scalability: KRaft scales better with larger deployments compared to Zookeeper.
- High availability: No single point of failure, as the cluster can function even if some nodes are unavailable.
- Simplified architecture: KRaft eliminates the need for a separate Zookeeper service, streamlining the Kafka ecosystem.
Migrating a large fleet of Kafka clusters to KRaft mode is not a trivial task, Apache Kafka has been using Apache ZooKeeper to store metadata since the beginning. Given the sheer number of moving parts involved, it’s impractical to migrate them one by one. Instead, a large-scale batch approach is necessary. This involves a two-step rolling process:
- Entering Hybrid Mode: The first roll switches the clusters to a hybrid mode where both ZooKeeper and KRaft are operational.
- Exiting Hybrid Mode: The second roll is executed swiftly after the first to minimize the time spent in hybrid mode and fully transition to KRaft.
One of the more important aspects of migration is building confidence in the process through controlling staged migrations and monitoring the changes. To ensure the migration process is smooth and reliable, it’s crucial to follow these principles:
- Experiment using development clusters: Start with less critical development clusters to test the migration process without impacting production.
- Small Production Clusters: Proceed with smaller production clusters to begin experiencing real-world scenarios.
- Large Clusters: Finally, migrate the large, critical production clusters after the processes have been refined and proven in the earlier stages.
What metrics to monitoring in KRaft
Monitoring is critical during the migration process. Key metrics to watch include:
- Broker Count in Hybrid Mode: Tracks the number of brokers operating in both ZooKeeper and KRaft modes.
- Migrating Broker Count: Indicates how many brokers are still in the process of migrating from ZooKeeper to KRaft.
- ZooKeeper Migration State: Shows the current state of the migration process.
While running the migration, you can also verify its status by looking at the log on the KRaft controller leader or by checking the kafka.controller:type=KafkaController,name=ZkMigrationState metric.
When the migration is completed, and the metric value changes to MIGRATION, the brokers are still running in ZooKeeper mode. These metrics help in understanding the migration’s progress and in identifying any potential issues early.
How to handle potential migration issues
During Zookeeper to KRaft migration, a few issues might arise, such as:
- Metadata Synchronization Delays: As brokers switch modes, there can be delays in metadata synchronization between ZooKeeper and KRaft.
- Increased Load on ZooKeeper: The initial migration phase can put additional load on ZooKeeper as it needs to handle both old and new system requests.
To mitigate these issues, it’s essential to:
- Optimize Network Configuration: Ensure that the network can handle increased loads without becoming a bottleneck.
Monitor ZooKeeper Performance: Keep an eye on ZooKeeper’s performance and scale it if necessary during the migration phase.
Test the process locally
In a practical demonstration, the migration process was showcased using a local cluster setup. The steps involved:
- Starting the Cluster: Initiate by starting ZooKeeper and then the brokers.
- Creating and Listing Topics: To verify the operational status of the cluster.
- Enabling Migration Configurations: Adjust configurations to allow the transition from ZooKeeper to KRaft.
- Rolling Restart: Perform a controlled rolling restart of the brokers to minimize downtime.
Monitoring Through JMX Metrics: After the migration, use JMX metrics to confirm the new setup’s operational status.
Step-by-Step migration process to migrate Kafka from Zookeeper to KRaft
The migration from ZooKeeper to KRaft mode involves several critical steps, each designed to ensure a smooth transition and maintain system stability:
- Initiating the Cluster: The process begins with the startup of ZooKeeper, followed by the sequential startup of all brokers.
- Topic Management: Initially, topics are listed (showing none initially) and then a new topic is created to ensure the cluster’s functionality.
- Migration Configuration Activation: Migration-specific configurations are enabled to prepare the brokers for the transition to KRaft mode.
- Rolling Restarts: Each broker is restarted in sequence to apply the new configurations without interrupting the overall service.
- Controller Configuration and Startup: The controller node is configured with the cluster ID and started to manage the cluster under the new system.
During the migration you should observing the transition through the following metrics
During the migration, it’s crucial to monitor specific metrics to assess the progress and success of the transition:
- Active Broker Count: Indicates the total number of active brokers known to the controller.
- Migrating Broker Count: Reflects the number of brokers still operating under the old ZooKeeper mode.
- ZooKeeper Migration State: Tracks the stage of migration, providing insights into the overall process.
These metrics are essential for verifying that the migration is proceeding as expected and for making any necessary adjustments.
Finalising the Zookeeper to KRaft migration
Once the metrics confirm that all brokers have successfully migrated:
- Exiting Migration Mode: The final step involves rolling the controller quorum to fully exit the migration mode, ensuring that all operations are now being handled in KRaft mode.
- Shutting Down ZooKeeper: With the migration complete, ZooKeeper is shut down as it is no longer required, marking the full transition to KRaft.
Operational Verification: A new topic is created post -migration to verify that the cluster is fully operational under the new system.
Post-Migration metrics and adjustments
After exiting the migration mode, it’s important to continue monitoring the system to ensure stability and performance:
Broker Operational Status: Confirm that all brokers are active and functioning correctly in the new mode.
System Performance Metrics: Monitor CPU and memory usage to detect any potential overhead or inefficiencies introduced during the migration.
Topic Creation and Management: Ensure that topics can be created and managed without issues, confirming the operational integrity of the cluster.
Some useful learnings and insights we have observed from our experiences
Controller and Broker Configuration: It’s recommended to run controllers on separate nodes from brokers to decouple their roles and enhance system resilience.
Single Node vs. Multi-Node Setup: While a single-node setup might be sufficient for non-critical applications or initial testing, a multi-node configuration is essential for production environments to avoid downtime during upgrades or failures.
Continuous Monitoring and Optimization: Regularly review and optimize the configuration and performance of the Kafka cluster.
Educational Resources and Community Support: Leverage community slack, documentation, and tutorials to stay updated on best practices and new features in Kafka management.
Advanced Tooling and Automation: Consider implementing more advanced management tools and automation scripts to streamline operations and reduce manual intervention.