How to run Apache Kafka across multiple Kubernetes clusters

In the Kafka ecosystem, resilience isn’t just a buzzword—it’s the difference between seamless operations and catastrophic failure. The OSO engineers have seen this firsthand across 30+ enterprise engagements. When a mission-critical Kafka cluster fails, businesses lose millions. But what if we told you there’s a revolutionary architecture that virtually eliminates this risk?

Traditional single-cluster Kafka deployments create dangerous single points of failure, even when running on Kubernetes. This technical deep-dive explains how our engineers pioneered a multi-cluster Kafka architecture using Kubernetes operators, solving complex networking, synchronisation and security challenges to provide unprecedented resilience without compromising performance.

From Bare Metal to Kubernetes: Why Container Orchestration Changed Everything

The evolution of Kafka deployment models tells a compelling story of operational maturity. Let’s rewind to understand why running Kafka on Kubernetes makes sense in the first place.

A few years ago, setting up Kafka meant dedicated virtual machines or even bare metal servers. Managing them was complex, scaling required manual intervention, and deployments were painfully slow. The cloud improved matters, but running Kafka at scale still demanded significant operational expertise.

Enter Kubernetes—the ultimate abstraction layer for deploying and managing distributed systems. With Kubernetes, we gained:

Declarative configuration that eliminates manual broker provisioning
Automated orchestration that simplifies resource allocation
Dynamic scaling capabilities without hardware constraints
Automated failover mechanisms

But Kubernetes alone isn’t enough. Kafka’s stateful nature creates unique challenges:

It needs persistent storage
It requires coordinated scaling
It demands careful failure detection
It necessitates controlled rolling updates

That’s where Kubernetes operators come in. Think of an operator as your Kafka reliability engineer in software form—continuously monitoring deployments, ensuring broker health, automating scaling, and managing configuration.

Beyond Simple Deployment: Solving the Hard Problems of Distributed Kafka

Running Kafka across multiple Kubernetes clusters sounds appealing, but it’s not as simple as deploying more brokers in different locations. Several serious challenges emerge:

1. Networking Isolation

Kubernetes clusters are isolated by default. Enabling brokers to communicate seamlessly across cluster boundaries requires sophisticated networking solutions.

2. Metadata and State Synchronisation

In a multi-cluster Kafka setup, all brokers share the same metadata—topic partition leadership information and cluster state. Keeping them in sync without introducing excessive replication lag requires careful engineering.

3. Fault Tolerance and Failover

If one Kubernetes cluster fails, how do we ensure Kafka continues operating without interruption or split-brain scenarios? The architecture must distribute leadership intelligently.

4. Security and Authentication

Cross-cluster communication introduces significant security considerations. Every connection between clusters represents a potential vulnerability that must be properly secured.

Strimzi: The Foundation for Advanced Kafka Architecture

Rather than using MirrorMaker or similar replication tools (which introduce latency and operational overhead), we need an operator-driven approach. Strimzi, the open-source operator for Kafka, provides the foundation with two key features:

1. Kafka Node Pools

This relatively new abstraction in Strimzi allows defining groups of Kafka nodes independently. It abstracts away complexity so you can specify what you want declaratively, and the operator handles the implementation details—no more manually restarting brokers or worrying about partition distribution.

2. Strimzi Pod Sets

This controls deployment of pods within a Kubernetes cluster. By combining Kafka Node Pools with Pod Sets, we can distribute brokers and controllers across multiple Kubernetes clusters while treating them as a single logical cluster.

Solving the Networking Puzzle with Submariner and Cilium

We employ two cross-cluster networking technologies to enable seamless communication:

1. Submariner

Submariner establishes secure cross-cluster communication channels, enabling Kafka pods to communicate without complex networking hacks.

2. Cilium

Cilium provides eBPF-based networking, ensuring better security, visibility, and control over how data moves between clusters.

Once Kubernetes clusters are connected, we need to configure Kafka to understand the distributed topology. Two key configurations enable this:

3. Controller Quorum Voters

Instead of keeping all controller pods in one Kubernetes cluster, we distribute them across multiple clusters, ensuring seamless failover if any single cluster fails.

4. Advertised Listeners

These define how brokers and clients find each other. We dynamically configure advertised listeners to use Submariner and Strimzi cross-service names, ensuring cross-cluster communication happens transparently.

For security, we encrypt all cross-cluster traffic, enforce strict RBAC policies, and securely store remote cluster credentials as Kubernetes config environment variables in the central cluster operator deployment.

Bringing It All Together: A Resilient, Seamless Kafka Experience

From a user perspective, the complexity is hidden. Engineers simply apply Kafka and Kafka Node Pool custom resources in one of the Kubernetes clusters. All clusters should be automatically connected using Submariner or Cilium.

Users set a few environment variables in the central cluster operator deployment to indicate we’re running Kafka across multiple Kubernetes clusters. The operator then:

Dynamically generates controller quorum voter configurations
Sets up appropriate advertised listeners
Distributes Kafka broker and controller pods across Kubernetes clusters
Establishes secure cross-cluster communication

The cluster operator in each remote cluster reconciles the Strimzi Pod Set custom resources to ensure everything runs smoothly.

A common question is: “What happens if the primary cluster fails?” The key design principle is that we don’t rely on a single central cluster. Instead, leadership is distributed across multiple Kubernetes clusters, enabling seamless failover. All data is replicated across brokers, so there’s no risk of split-brain scenarios.

In standard Strimzi deployments, controller quorum voters and advertised listeners are automatically generated based on detected broker configuration. By default, Strimzi assumes all pods reside within a single Kubernetes cluster and uses internal service DNS names for communication.

In our multi-cluster approach, pods span multiple Kubernetes clusters, requiring direct pod-to-pod communication to avoid introducing additional latency.

Performance and Validation

The OSO engineers tested this architecture using Open Messaging Benchmark, a framework for testing performance of distributed data systems. The results were promising, showing minimal overhead compared to single-cluster deployments.

This approach enables a truly resilient multi-cluster Kafka deployment that functions as a single logical cluster despite spanning multiple physical locations. Key benefits include:

High resilience and availability – No single point of failure at the cluster level
Automation and scalability – Leveraging Kubernetes and Strimzi for operational simplicity
Cloud-native architecture – Future-proofed for evolving infrastructure
Solved cross-cluster communication – No need for complex replication tools

Implementing Your Own Multi-Cluster Kafka Architecture

To implement this architecture, follow these key steps:

Deploy Strimzi Operators: Install the Strimzi operator in each Kubernetes cluster.
Establish Cross-Cluster Networking: Configure Submariner or Cilium between all Kubernetes clusters.
Configure the Central Operator: Set environment variables in the central cluster operator deployment to enable multi-cluster awareness.
Apply Custom Resources: Deploy Kafka and Kafka Node Pool custom resources in the central cluster.
Security Configuration: Implement proper encryption and RBAC policies for cross-cluster traffic.

For monitoring, standard Kafka monitoring practices apply, but pay special attention to cross-cluster network metrics and operator logs. Testing should include simulated cluster failures to verify resilience before production deployment.

The Future of Distributed Kafka

This isn’t just about solving a single problem—the OSO engineers are pushing the boundaries of what’s possible with Kafka on Kubernetes. As the ecosystem evolves, architectures like this will shape the future of distributed data systems.

Many businesses aren’t fully cloud-native yet. Industries like finance, healthcare, and telecommunications often face strict regulations around data sovereignty and compliance, requiring certain workloads to remain on-premises. Multi-cluster Kafka in on-premises or hybrid modes provides the resilience these industries demand without sacrificing sovereignty.

The future roadmap includes further optimisation and improved failure detection with reduced latency. These features are still in the proposal stage within the Strimzi community, but approval is expected soon.

Is your current Kafka architecture providing sufficient resilience for your business-critical data streams? If not, perhaps it’s time to consider a multi-cluster approach. The technology is mature, the benefits are clear, and the operational complexity can be managed through proper tooling and expertise.

As data becomes ever more critical to business operations, architectures that eliminate single points of failure will become the standard. Stay ahead of the curve by implementing multi-cluster Kafka today.