blog by OSO

How to avoid configuration drift across multiple Kafka environments using GitOps

Sion Smith 2 August 2023
configuration-drift

At OSO we deploy enterprise ready self hosted Kafka clusters almost daily for a wide range of clients. We would like to share our best practices and ideas on how we maintain consistent configurations and avoid configuration drift across multiple Kafka environments. The goal of today’s post is to unpack the challenges of redundancy and inconsistency that crop up when scaling to support hundreds of applications across an enterprise and how these challenges such as configuration drift can be addressed effectively.

Configuration Drift: The challenge of scaling

Deploying a multi-tenanted Kafka cluster (many teams using the same brokers) scaling to accommodate this large number of applications is a hurdle that’s not exclusive to Kafka or streaming platforms but is a common issue across all industries. To illustrate, consider the analogy of a Sam’s sandwich shop in a fictional town. Initially, this shop offers a customizable sandwich experience, allowing customers to tailor-make their sandwiches. However, as popularity soars and orders pile up, the staff are tasked with replicating the same sandwiches repeatedly. This manual process is susceptible to human errors, leading to inconsistencies in the final product delivered to customers.

In a similar vein, in a Kafka environment, when a new application team is onboarded to the Kafka cluster for event streaming, they often end up repeating the same set of operations multiple times. These operations include creating topics, adjusting topic configurations, assigning access to producers and consumers, and managing schemas. These repetitive tasks elevate the risk of errors and inconsistencies, which could wreak havoc during production outages.

Consistency across environments using GitOps

In large enterprises who sometimes opt to deploy multiple Kafka clusters per environment where hundreds of applications are supported, achieving consistency across environments and multiple clusters is paramount. To enhance isolation and minimise the blast radius, the company implements this multi-cluster topology, consisting of numerous clusters, most of which are built in an active-active replicated fashion across different regions. However, configuration inconsistency across replicated clusters can present a significant risk, often going unnoticed until a disaster recovery event happens.

To mitigate this risk, we employ a GitOps-based management plane acting as a central layer for governing and managing Kafka clusters. This management layer guarantees that any changes to the clusters’ state, including topic management, access control, and schema management, are conducted through Git. When a topic is created, the change is made inside a YAML configuration file, which in turn is applied by the Kubernetes operator and creates the topic with identical configurations on both replicated clusters. This transactional approach ensures the change is deemed complete only if it is persisted on both clusters; otherwise, it gets rolled back.

Configuration drift

Addressing inconsistency across environments

While the GitOps-based management plan tackles the inconsistency issue across replicated clusters, the challenge of inconsistency across different environments remains. To solve this, we use the Git branching strategy of one branch per environment, all of which are segmented and are never merged. The Dev / Test branches are open to changes for new creations and updates, allowing application teams to construct and test their configurations in the development environment. This Prod branch is restricted from making direct modifications to higher environments, ensuring controlled and consistent changes.

To move configurations from the development environment branch to higher environments, we have an automation container called Promote and Release. This action snaps a picture of all configurations from the development branch and pushes them into the Git production branch. This snapshot acts as a version-controlled release, facilitating easy promotion of configurations to higher environments.

With the release and promotion automation, any new changes are first tested in the development environment. Then, a release is created and promoted to higher environments. If any problems arise during testing, the configuration can be rolled back to its previous state using native Git workflow. This approach eliminates the need for monotonous operations and ensures consistency across environments.

The Promote automation also includes a “dry run” feature that compares the release version with the current state of the cluster in the higher environment. This comparison offers detailed information on the changes that will be made, assisting release engineers in spotting any disruptive changes and making informed decisions about promoting the configuration to production.

Configuration drift 2

Moreover, the Promote automation incorporates scaling factors, allowing for scaling of topic partitions and read/write throughputs in higher environments. This flexibility enables customization using the Kustomize framework and optimization of configurations based on specific requirements in each environment.

Configuration drift 2

Configuration drift: Benefits of GitOps management of Kafka clusters

Using this GitOps management approach offers several advantages. It helps to avoid redundancy, reduce inconsistency, and gives a significant edge to developers. By centralising configuration management and governance of Kafka clusters, we can ensure that changes are consistent and controlled across environments. This approach also bolsters isolation and minimises the blast radius, reducing the impact of configuration issues.

Managing consistent configurations across multiple Kafka environments is crucial for maintaining stability and reliability in a large-scale enterprise and configuration drift. By implementing centralising configuration management, using GitOps for governance and promoting configurations in a controlled and consistent way organisations can avoid redundancy, reduce inconsistency, and enhance overall application performance and reliability. This GitOps management approach not only boosts consistency but also strengthens isolation and limits the blast radius in case of configuration issues.

We are considering open-sourcing our  Promote and Release automation, if you are interested in learning more or adopting this approach then please contact us.

Fore more content:

How to take your Kafka projects to the next level with a Confluent preferred partner

Event driven Architecture: A Simple Guide

Watch Our Kafka Summit Talk: Offering Kafka as a Service in Your Organisation

Successfully Reduce AWS Costs: 4 Powerful Ways

Protecting Kafka Cluster

Apache Kafka Common Mistakes

Kafka Cruise Control 101

Kafka performance best practices for monitoring and alerting

How to build a custom Kafka Streams Statestores

Get started with OSO professional services for Apache Kafka

Have a conversation with a Kafka expert to discover how we help your adopt of Apache Kafka in your business.

Contact Us