blog by OSO

Apache Kafka’s KRaft Protocol: How to Eliminate Zookeeper and Boost Performance by 8x

Sion Smith 29 July 2025
Eliminate Zookeeper

In the Apache Kafka ecosystem, metadata management has long been the Achilles’ heel of cluster operations. For years, we’ve wrestled with Zookeeper dependencies, watched clusters struggle under the weight of external coordination, and accepted that scaling beyond certain partition limits meant architectural compromises. Those days are ending.

Apache Kafka’s KRaft protocol represents more than just a Zookeeper replacement—it’s a fundamental reimagining of how distributed streaming platforms should handle cluster coordination. The OSO engineers have extensively tested KRaft in production environments, and the results speak volumes: 8x performance improvements, elimination of external dependencies, and the ability to scale to millions of partitions without breaking a sweat.

This isn’t just another incremental update. KRaft’s controller-based architecture eliminates the metadata management complexity that has plagued Kafka deployments for years, creating a more robust and scalable streaming platform that finally delivers on Kafka’s original promise of true distributed systems simplicity.

The Metadata Challenge in Distributed Systems

Why Metadata Management Makes or Breaks Kafka Clusters

Every Kafka cluster lives or dies by its metadata. Broker IDs, leader assignments, in-sync replicas, ACLs, and topic configurations form the nervous system of any Kafka deployment. When metadata management fails, everything fails—producers can’t find partition leaders, consumers lose track of offsets, and administrators watch helplessly as their carefully orchestrated data pipelines grind to a halt.

The traditional Zookeeper-based approach created a web of dependencies that looked elegant on paper but proved nightmarish in practice. External dependencies meant managing configuration, security, and networking across multiple Apache projects. Each additional layer introduced new failure points, new security considerations, and new operational complexity that stretched engineering teams thin.

Perhaps most frustrating was the scalability ceiling. Zookeeper’s coordination overhead meant that as partition counts grew, performance degraded exponentially. The OSO engineers have seen enterprise deployments hit walls at partition counts that should have been trivial for modern distributed systems to handle. Zookeeper’s batch processing approach to metadata updates created latency spikes that rippled through entire streaming architectures.

The Configuration Nightmare

Managing a traditional Kafka cluster meant juggling configurations across multiple systems:

# Zookeeper configuration
tickTime=2000
dataDir=/var/lib/zookeeper
clientPort=2181
initLimit=5
syncLimit=2
server.1=zk1:2888:3888
server.2=zk2:2888:3888
server.3=zk3:2888:3888

# Kafka broker configuration
zookeeper.connect=zk1:2181,zk2:2181,zk3:2181
log.dirs=/var/kafka-logs
num.network.threads=8
num.io.threads=8

Each service required separate monitoring, separate security policies, and separate disaster recovery procedures. When something went wrong—and it inevitably did—troubleshooting meant diving into multiple log files, multiple configuration sets, and multiple Apache projects with different operational patterns.

From External Coordination to Internal Consensus

KRaft’s Controller Architecture Deep Dive

KRaft eliminates this complexity through an elegant controller-based architecture that brings metadata management inside Kafka itself. Instead of relying on external coordination, KRaft designates specific nodes as controllers that form a consensus group responsible for all cluster metadata.

The architecture is deceptively simple: a quorum of controller nodes, one active leader, and multiple followers. The active controller handles all metadata writes, while followers maintain replicated copies and stand ready to take over leadership at a moment’s notice.

# KRaft controller configuration
process.roles=controller
node.id=1
controller.quorum.voters=1@controller1:9093,2@controller2:9093,3@controller3:9093
listeners=CONTROLLER://controller1:9093
log.dirs=/var/kafka-logs

The genius lies in the details. Metadata isn’t stored in some external system with its own operational characteristics—it’s stored in a special Kafka topic called __cluster_metadata. This single-partition topic maintains a chronological record of all metadata changes, with each update receiving a sequential offset (o1, o2, o3, and so forth).

The __cluster_metadata Topic: Kafka Managing Kafka

This approach solves multiple problems simultaneously. First, it eliminates external dependencies entirely. The metadata management system uses the same replication, persistence, and consistency guarantees that make Kafka reliable for application data. Second, it provides natural ordering and history tracking—administrators can see exactly when each metadata change occurred and even replay the sequence of events that led to any particular cluster state.

# Inspect the metadata topic
kafka-log-dirs.sh --bootstrap-server localhost:9092 \
  --topic-list __cluster_metadata --describe

# View metadata logs (requires brokerlanguage-bash access)
kafka-dump-log.sh --files /var/kafka-logs/__cluster_metadata-0/00000000000000000000.log \
  --print-data-log

The single-partition design might seem like a bottleneck, but it’s actually a feature. Metadata operations in Kafka clusters follow strict ordering requirements—you can’t assign partition leadership before creating the partition, for instance. By channelling all metadata through a single partition, KRaft ensures natural ordering without complex coordination protocols.

Controller followers maintain local replicas of this metadata topic, which means they have immediate access to the complete cluster state. When leadership changes, there’s no cold start period where the new leader must rebuild its understanding of the cluster—everything is already local and immediately available.

Millisecond Elections vs Multi-Step Dependencies

Leader Election Revolution

The difference between KRaft and Zookeeper leader election isn’t just quantitative—it’s qualitative. Zookeeper’s multi-step process requires external coordination, state synchronisation, and multiple round-trips across network boundaries. When a Kafka controller needs to be elected, Zookeeper must first recognise the failure, coordinate among its own nodes, notify Kafka brokers, and then wait while the new controller rebuilds its metadata understanding from external sources.

KRaft’s Raft-based consensus operates entirely within the Kafka cluster. When a controller fails, the remaining controllers immediately begin a new election round using internal voting mechanisms. Because all metadata is stored locally in the __cluster_metadata topic, the newly elected leader can begin serving requests immediately—no external synchronisation required, no metadata rebuilding delays.

The Numbers Don’t Lie

The OSO engineers measured this difference in controlled environments with 2 million partitions. Zookeeper-based clusters required multiple seconds for leader election, followed by additional time for metadata synchronisation. KRaft clusters completed the entire process in milliseconds. This isn’t a marginal improvement—it’s a fundamental shift in how quickly Kafka clusters can recover from failures.

# Monitor controller election in KRaft mode
kafka-metadata-shell.sh --snapshot /var/kafka-logs/__cluster_metadata-0/00000000000000000001.log

# Check current controller status
kafka-broker-api-versions.sh --bootstrap-server localhost:9092 | grep controller

Elimination of Cold Starts

Perhaps more importantly, KRaft eliminates the dreaded “cold start” problem that plagued Zookeeper-based deployments. When a traditional Kafka controller came online, it needed to rebuild its entire understanding of cluster metadata by querying Zookeeper. For large clusters, this process could take minutes, during which the cluster remained in a degraded state.

KRaft controllers maintain complete metadata locally. When a new leader is elected, it doesn’t need to rebuild anything—it already has the complete cluster state in its local replica of the __cluster_metadata topic. The transition from follower to leader involves updating in-memory pointers, not downloading gigabytes of state information.

# View controller logs during election
tail -f /var/kafka-logs/controller.log | grep "became leader"

# Check metadata lag across controllers
kafka-log-dirs.sh --bootstrap-server localhost:9092 \
  --describe --json | jq '.brokers[].logDirs[].partitions[] | select(.topic == "__cluster_metadata")'

Real-World Performance Impact

Controlled Shutdown Improvements

The OSO engineers conducted extensive testing comparing KRaft and Zookeeper performance across various scenarios. The controlled shutdown tests proved particularly revealing. With 2 million partitions—a scale that pushes most Kafka deployments to their limits—Zookeeper-based clusters required approximately 120 seconds to complete an orderly shutdown.

The same test with KRaft clusters completed in 20-30 seconds, representing a 6x improvement in shutdown time. This improvement cascades through operational procedures: rolling updates complete faster, maintenance windows shrink, and the risk of cascading failures during planned maintenance drops significantly.

# Graceful shutdown timing test
time kafka-server-stop.sh

# Monitor shutdown progress
kafka-log-dirs.sh --bootstrap-server localhost:9092 \
  --describe --json | jq '.brokers[].logDirs[].error'

Recovery Time Revolution

The uncontrolled shutdown recovery tests delivered even more dramatic results. When Zookeeper-based clusters experienced unexpected failures, recovery times often exceeded 450 seconds. During these recovery periods, the entire cluster remained unavailable—producers couldn’t write, consumers couldn’t read, and administrators could only wait while Zookeeper coordination protocols slowly rebuilt cluster state.

KRaft clusters recovered from identical failures in 20-30 seconds. The difference stems from KRaft’s local metadata availability—when controllers come back online, they don’t need to coordinate with external systems or rebuild state from scratch. They simply resume operations with their locally stored metadata, elect a new leader if necessary, and begin processing requests.

Throughput and Latency Gains

The performance improvements extend beyond recovery scenarios. In steady-state operation, the OSO engineers measured 8.2x throughput improvements and significantly lower average latencies in KRaft clusters compared to equivalent Zookeeper-based deployments.

# Performance testing with kafka-producer-perf-test
kafka-producer-perf-test.sh \
  --topic performance-test \
  --num-records 1000000 \
  --record-size 1024 \
  --throughput -1 \
  --producer-props bootstrap.servers=localhost:9092

# Monitor broker performance metrics
kafka-run-class.sh kafka.tools.JmxTool \
  --object-name kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec \
  --jmx-url service:jmx:rmi:///jndi/rmi://localhost:9999/jmxrmi

These improvements result from eliminating the batch processing overhead inherent in Zookeeper coordination. KRaft provides real-time metadata updates, reducing the latency between metadata changes and their availability throughout the cluster. When a partition leader fails, for instance, the replacement leader election and metadata propagation happen immediately rather than waiting for the next Zookeeper batch processing cycle.

What This Means for Your Kafka Infrastructure

Deployment Simplification

The operational benefits of KRaft extend far beyond performance metrics. Perhaps the most immediate impact is deployment simplification. Instead of managing two separate Apache projects with different operational characteristics, teams manage a single distributed system with consistent tooling, monitoring, and operational procedures.

# KRaft-only deployment configuration
process.roles=broker,controller
node.id=1
controller.quorum.voters=1@kafka1:9093,2@kafka2:9093,3@kafka3:9093
listeners=PLAINTEXT://kafka1:9092,CONTROLLER://kafka1:9093
log.dirs=/var/kafka-logs

This unified approach eliminates entire classes of operational complexity. Security policies, network configurations, and monitoring systems need only account for Kafka nodes. Disaster recovery procedures focus on a single system rather than coordinating recovery across multiple projects with different backup and restore characteristics.

Configuration Streamlining

KRaft’s controller functionality integrates seamlessly with existing broker operations. The same nodes that handle producer and consumer traffic can also serve as metadata controllers, eliminating the need for dedicated coordination infrastructure in smaller deployments.

# Combined broker and controller node
process.roles=broker,controller
node.id=1
controller.quorum.voters=1@kafka1:9093,2@kafka2:9093,3@kafka3:9093
listeners=PLAINTEXT://kafka1:9092,CONTROLLER://kafka1:9093
inter.broker.listener.name=PLAINTEXT
controller.listener.names=CONTROLLER
log.dirs=/var/kafka-logs

For larger deployments requiring dedicated controller nodes, the configuration remains straightforward:

# Dedicated controller node
process.roles=controller
node.id=101
controller.quorum.voters=101@controller1:9093,102@controller2:9093,103@controller3:9093
listeners=CONTROLLER://controller1:9093
log.dirs=/var/kafka-metadata

Scalability Planning Transformation

KRaft’s support for millions of partitions opens architectural possibilities that were previously impractical. Event-driven architectures that required careful partition planning to avoid Zookeeper bottlenecks can now scale more naturally. Microservices architectures that needed to consolidate topics to manage partition counts can adopt more granular event modeling.

# Create high partition count topics (previously problematic with Zookeeper)
kafka-topics.sh --bootstrap-server localhost:9092 \
  --create --topic high-scale-events \
  --partitions 10000 --replication-factor 3

# Monitor partition leadership distribution
kafka-topics.sh --bootstrap-server localhost:9092 \
  --describe --topic high-scale-events | grep "Leader:"

The elimination of external coordination overhead means that partition operations—creation, deletion, leadership changes—complete faster and with less system impact. This enables more dynamic topic management strategies and supports use cases where topics are created and destroyed frequently based on application needs.

Looking Forward: The Post-Zookeeper Era

KRaft represents more than just a Zookeeper replacement—it’s a fundamental architectural evolution that positions Kafka for the next generation of real-time data platforms. The move from external coordination to internal consensus reflects broader industry trends towards simplified, self-contained distributed systems that minimize operational complexity while maximizing reliability and performance.

As Kafka 4.0 makes KRaft mandatory, understanding this architectural shift becomes essential for any organisation building robust streaming data pipelines. The performance improvements, operational simplifications, and scalability enhancements aren’t just nice-to-have features—they’re foundational capabilities that enable entirely new classes of real-time applications.

The OSO engineers have seen firsthand how KRaft transforms Kafka operations. Clusters that previously required careful partition planning now scale effortlessly. Recovery procedures that once involved complex coordination across multiple systems now complete in seconds. Monitoring strategies that needed to account for multiple failure modes now focus on a single, unified system.

This isn’t just evolution—it’s the maturation of Apache Kafka into the distributed streaming platform it was always meant to be. For teams building the next generation of event-driven architectures, KRaft isn’t just an upgrade—it’s an enabler of possibilities that were previously out of reach.

Get expert Kafka support for your migration

Ready to migrate from Zookeeper to KRaft? Our Kafka specialists have guided dozens of enterprise migrations with zero downtime. Book a consultation to discuss your cluster's specific requirements and migration timeline.

CONTACT US
OSO
Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.