blog by OSO

Bringing Kafka Without Zookeeper Into Production

Sion Smith 31 July 2023

Blogs 8 mins read

Running Kafka without Zookeeper

The time is finally here where we can run Kafka without ZooKeeper. We will discuss the limitations of using ZooKeeper for replication, how version skew is handled in Kafka, troubleshooting techniques for Kafka clusters in KRaft mode, and the process of upgrading from ZooKeeper mode to KRaft mode.

Kafka without Zookeeper: Zookeeper basics

Zookeeper stores data in data registers called Znodes. ZNodes are arranged in a filesystem-like structure, the name of the Znode resembles a file path. The broker config property zk.connect tells the broker which ensemble to connect and on what path. When the broker joins the cluster, it creates a distinct Znode determined by broker.id property. The path and address in zk.connect have to be the same for all brokers in the cluster and the broker.id has to be unique.

Kafka without Zookeeper: Replication limitations in ZooKeeper

In ZooKeeper, when creating a new topic, you cannot have 3x replication for a new topic. This limitation often leads people to run a minimum of four nodes, even if they only need three. However, in KRaft mode, the controller remembers the brokers that are temporarily down during a role and puts them into a fence state instead of removing them from the cluster. This allows new replicas to be placed on those nodes if no other nodes are available. As a result, you can have a three-node cluster instead of a four-node one if three nodes are suitable for your load.

Kafka without Zookeeper: Upgrading from ZooKeeper mode to KRaft mode

To upgrade from ZooKeeper mode to KRaft mode, the following steps are involved:

Adding KRaft Controllers: Initially, all metadata and brokers are in ZooKeeper mode. To start the upgrade process, a quorum of new KRaft controllers is added to the cluster. These controllers will have new metadata.
Electing a Leader: The KRaft controllers will elect a leader from their ranks and make that leader the leader in ZooKeeper. This allows the new controllers to win the election in ZooKeeper.
Loading Metadata: Once the leader is elected, the metadata is loaded from ZooKeeper. This process is similar to the existing metadata loop process in ZooKeeper.
Sending Requests: After loading the metadata, all the requests that the old controller was sending out, such as leader and ISR update and metadata requests, are sent out by the new controllers.
Simulating the Old Controller: The goal is to simulate the old controller for a while by making metadata changes in the KRaft quorum instead of ZooKeeper. This allows the new controllers to take over the responsibilities of the old controller gradually.
Rolling Brokers: The brokers are rolled one by one, meaning they are upgraded to KRaft mode individually. As each broker is rolled, it stops talking directly to ZooKeeper and starts forwarding its admin requests to the KRaft controller. This avoids direct ZooKeeper communication.
Removing ZooKeeper: Once all the brokers have been rolled, ZooKeeper can be removed from the cluster. At this point, all the metadata will be in the KRaft cluster.

Handling version skew in Kafka

During a role in Kafka, we need to coordinate any changes to the API or the way we use the existing API to ensure that all brokers have enrolled before implementing the change. In ZooKeeper, the inter-broker protocol is used to control the RPC protocols and features supported by the brokers. However, there are some issues with the current use of the inter-broker protocol:

Manual Configuration: The inter-broker protocol is manually configured, which means there is a chance of leaving it out or setting it incorrectly.
Static Configuration: Even if the inter-broker protocol is set correctly, it requires a double role to update it. First, all brokers need to be upgraded to the new software, and then the inter-broker protocol needs to be changed on all nodes.
No Downgrade Support: Currently, there is no official support for downgrading the inter-broker protocol. While there are some unofficial cases where downgrading is possible, it is not recommended due to potential metadata compatibility issues.

To address these issues, Kafka introduces metadata versioning in KRaft mode. Each inter-broker protocol version will have a corresponding metadata version. Unlike the inter-broker protocol, the metadata version is dynamically configured at the controller level. This means that changing the metadata version does not require a roll and can be done by invoking a controller API. The controller will prevent version changes if there are brokers that have not been upgraded to support the new version. Additionally, Kafka supports downgrading in KRaft mode, allowing dynamic downgrades from the command line. However, there are two types of downgrades: safe and unsafe. Safe downgrades preserve metadata, while unsafe downgrades may result in metadata loss.

Troubleshooting Kafka Clusters in KRaft Mode

In KRaft mode, the cluster metadata replaces ZooKeeper as the store of record. This means that instead of looking at the metadata in ZooKeeper, we need to focus on the metadata in the Kafka topic. There are several tools available to troubleshoot Kafka clusters in KRaft mode:

Dump Log Tool: This tool allows you to view the contents of a log file and decode the cluster metadata into JSON format. You can use decoders to specify the format of the data in the topic. The dump log tool provides information about leader change records and register broker records.
Metadata Shell Tool: The metadata shell is a powerful tool that replaces the ZooKeeper shell. It can connect to a running controller cluster or read from a snapshot file or log file. The tool reads log entries into memory and constructs a virtual file system with all the metadata information. You can navigate through the metadata using commands like ls and cd. The metadata shell provides information about brokers, dynamic configs, quorum information, and topics.

Monitoring the quorum in KRaft mode

When monitoring the quorum in KRaft mode, it is important to look at metrics such as the ref state and metadata offset to understand how they are evolving over time. Additionally, there is an RPC (Remote Procedure Call) that can be made to the controller quorum to retrieve information about the quorum. This RPC provides details about the leader ID, leader epoch of the cluster metadata topic, offsets of each member, and the latest offset fetched by each member. Monitoring these metrics and making RPCs to the controller quorum can help in troubleshooting and ensuring the stability of the quorum.

Kafka’s transition from ZooKeeper mode to KRaft mode brings significant improvements in terms of scalability, reliability, and ease of management. The metadata versioning and downgrade capabilities ensure compatibility and flexibility during the upgrade process. If you would like support in moving your clusters off Zookeeper and onto KRaft please contact us for more information and support.

Zookeeper to KRaft migration plan

Fore more content:

How to take your Kafka projects to the next level with a Confluent preferred partner

Event driven Architecture: A Simple Guide

Watch Our Kafka Summit Talk: Offering Kafka as a Service in Your Organisation

Successfully Reduce AWS Costs: 4 Powerful Ways

Protecting Kafka Cluster

Apache Kafka Common Mistakes

Kafka Cruise Control 101

Kafka performance best practices for monitoring and alerting

Running Kafka without Zookeeper
Kafka without Zookeeper: Zookeeper basics
Kafka without Zookeeper: Replication limitations in ZooKeeper
Kafka without Zookeeper: Upgrading from ZooKeeper mode to KRaft mode
Handling version skew in Kafka
Troubleshooting Kafka Clusters in KRaft Mode
Monitoring the quorum in KRaft mode
Zookeeper to KRaft migration plan

Get started with OSO professional services for Apache Kafka

Have a conversation with a Kafka expert to discover how we help your adopt of Apache Kafka in your business.

Latest blog posts

See more →

Blogs 14 mins read

Why You Don’t Need Apache Flink for Agentic AI (And Why Akka Is the Simpler Choice)

Sion Smith 18 October 2025

Blogs 8 mins read

Building Multi-Region Orchestration with Apache Kafka: A Pull-Based Architecture

Sion Smith 16 October 2025

Bringing Kafka Without Zookeeper Into Production

Running Kafka without Zookeeper

Kafka without Zookeeper: Zookeeper basics

Kafka without Zookeeper: Replication limitations in ZooKeeper

Kafka without Zookeeper: Upgrading from ZooKeeper mode to KRaft mode

Handling version skew in Kafka

Troubleshooting Kafka Clusters in KRaft Mode

Monitoring the quorum in KRaft mode

Zookeeper to KRaft migration plan

Get started with OSO professional services for Apache Kafka

Latest blog posts

Why You Don’t Need Apache Flink for Agentic AI (And Why Akka Is the Simpler Choice)

Building Multi-Region Orchestration with Apache Kafka: A Pull-Based Architecture

Subscription form (footer)