How to avoid configuration drift across multiple Kafka environments using GitOps
Sion Smith2 August 2023
Blogs8 mins read
At OSO we deploy enterprise ready self hosted Kafka clusters almost daily for a wide range of clients. We would like to share our best practices and ideas on how we maintain consistent configurations and avoid configuration drift across multiple Kafka environments. The goal of today’s post is to unpack the challenges of redundancy and inconsistency that crop up when scaling to support hundreds of applications across an enterprise and how these challenges such as configuration drift can be addressed effectively.
Configuration Drift: The challenge of scaling
Deploying a multi-tenanted Kafka cluster (many teams using the same brokers) scaling to accommodate this large number of applications is a hurdle that’s not exclusive to Kafka or streaming platforms but is a common issue across all industries. To illustrate, consider the analogy of a Sam’s sandwich shop in a fictional town. Initially, this shop offers a customizable sandwich experience, allowing customers to tailor-make their sandwiches. However, as popularity soars and orders pile up, the staff are tasked with replicating the same sandwiches repeatedly. This manual process is susceptible to human errors, leading to inconsistencies in the final product delivered to customers.
In a similar vein, in a Kafka environment, when a new application team is onboarded to the Kafka cluster for event streaming, they often end up repeating the same set of operations multiple times. These operations include creating topics, adjusting topic configurations, assigning access to producers and consumers, and managing schemas. These repetitive tasks elevate the risk of errors and inconsistencies, which could wreak havoc during production outages.
Consistency across environments using GitOps
In large enterprises who sometimes opt to deploy multiple Kafka clusters per environment where hundreds of applications are supported, achieving consistency across environments and multiple clusters is paramount. To enhance isolation and minimise the blast radius, the company implements this multi-cluster topology, consisting of numerous clusters, most of which are built in an active-active replicated fashion across different regions. However, configuration inconsistency across replicated clusters can present a significant risk, often going unnoticed until a disaster recovery event happens.
To mitigate this risk, we employ a GitOps-based management plane acting as a central layer for governing and managing Kafka clusters. This management layer guarantees that any changes to the clusters’ state, including topic management, access control, and schema management, are conducted through Git. When a topic is created, the change is made inside a YAML configuration file, which in turn is applied by the Kubernetes operator and creates the topic with identical configurations on both replicated clusters. This transactional approach ensures the change is deemed complete only if it is persisted on both clusters; otherwise, it gets rolled back.
Addressing inconsistency across environments
While the GitOps-based management plan tackles the inconsistency issue across replicated clusters, the challenge of inconsistency across different environments remains. To solve this, we use the Git branching strategy of one branch per environment, all of which are segmented and are never merged. The Dev / Test branches are open to changes for new creations and updates, allowing application teams to construct and test their configurations in the development environment. This Prod branch is restricted from making direct modifications to higher environments, ensuring controlled and consistent changes.
To move configurations from the development environment branch to higher environments, we have an automation container called Promote and Release. This action snaps a picture of all configurations from the development branch and pushes them into the Git production branch. This snapshot acts as a version-controlled release, facilitating easy promotion of configurations to higher environments.
With the release and promotion automation, any new changes are first tested in the development environment. Then, a release is created and promoted to higher environments. If any problems arise during testing, the configuration can be rolled back to its previous state using native Git workflow. This approach eliminates the need for monotonous operations and ensures consistency across environments.
The Promote automation also includes a “dry run” feature that compares the release version with the current state of the cluster in the higher environment. This comparison offers detailed information on the changes that will be made, assisting release engineers in spotting any disruptive changes and making informed decisions about promoting the configuration to production.
Moreover, the Promote automation incorporates scaling factors, allowing for scaling of topic partitions and read/write throughputs in higher environments. This flexibility enables customization using the Kustomize framework and optimization of configurations based on specific requirements in each environment.
Configuration drift: Benefits of GitOps management of Kafka clusters
Using this GitOps management approach offers several advantages. It helps to avoid redundancy, reduce inconsistency, and gives a significant edge to developers. By centralising configuration management and governance of Kafka clusters, we can ensure that changes are consistent and controlled across environments. This approach also bolsters isolation and minimises the blast radius, reducing the impact of configuration issues.
Managing consistent configurations across multiple Kafka environments is crucial for maintaining stability and reliability in a large-scale enterprise and configuration drift. By implementing centralising configuration management, using GitOps for governance and promoting configurations in a controlled and consistent way organisations can avoid redundancy, reduce inconsistency, and enhance overall application performance and reliability. This GitOps management approach not only boosts consistency but also strengthens isolation and limits the blast radius in case of configuration issues.
We are considering open-sourcing our Promote and Release automation, if you are interested in learning more or adopting this approach then please contact us.
This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.
Strictly Necessary Cookies
Strictly Necessary Cookie should be enabled at all times so that we can save your preferences for cookie settings.
If you disable this cookie, we will not be able to save your preferences. This means that every time you visit this website you will need to enable or disable cookies again.
3rd Party Cookies
This website uses Google Analytics to collect anonymous information such as the number of visitors to the site, and the most popular pages.
Keeping this cookie enabled helps us to improve our website.
Please enable Strictly Necessary Cookies first so that we can save your preferences!