At OSO we deploy enterprise ready self hosted Kafka clusters almost daily for a wide range of clients. We would like to share our best practices and ideas on how we maintain consistent configurations and avoid configuration drift across multiple Kafka environments. The goal of today’s post is to unpack the challenges of redundancy and inconsistency that crop up when scaling to support hundreds of applications across an enterprise and how these challenges such as configuration drift can be addressed effectively.
Deploying a multi-tenanted Kafka cluster (many teams using the same brokers) scaling to accommodate this large number of applications is a hurdle that’s not exclusive to Kafka or streaming platforms but is a common issue across all industries. To illustrate, consider the analogy of a Sam’s sandwich shop in a fictional town. Initially, this shop offers a customizable sandwich experience, allowing customers to tailor-make their sandwiches. However, as popularity soars and orders pile up, the staff are tasked with replicating the same sandwiches repeatedly. This manual process is susceptible to human errors, leading to inconsistencies in the final product delivered to customers.
In a similar vein, in a Kafka environment, when a new application team is onboarded to the Kafka cluster for event streaming, they often end up repeating the same set of operations multiple times. These operations include creating topics, adjusting topic configurations, assigning access to producers and consumers, and managing schemas. These repetitive tasks elevate the risk of errors and inconsistencies, which could wreak havoc during production outages.
In large enterprises who sometimes opt to deploy multiple Kafka clusters per environment where hundreds of applications are supported, achieving consistency across environments and multiple clusters is paramount. To enhance isolation and minimise the blast radius, the company implements this multi-cluster topology, consisting of numerous clusters, most of which are built in an active-active replicated fashion across different regions. However, configuration inconsistency across replicated clusters can present a significant risk, often going unnoticed until a disaster recovery event happens.
To mitigate this risk, we employ a GitOps-based management plane acting as a central layer for governing and managing Kafka clusters. This management layer guarantees that any changes to the clusters’ state, including topic management, access control, and schema management, are conducted through Git. When a topic is created, the change is made inside a YAML configuration file, which in turn is applied by the Kubernetes operator and creates the topic with identical configurations on both replicated clusters. This transactional approach ensures the change is deemed complete only if it is persisted on both clusters; otherwise, it gets rolled back.
While the GitOps-based management plan tackles the inconsistency issue across replicated clusters, the challenge of inconsistency across different environments remains. To solve this, we use the Git branching strategy of one branch per environment, all of which are segmented and are never merged. The Dev / Test branches are open to changes for new creations and updates, allowing application teams to construct and test their configurations in the development environment. This Prod branch is restricted from making direct modifications to higher environments, ensuring controlled and consistent changes.
To move configurations from the development environment branch to higher environments, we have an automation container called Promote and Release. This action snaps a picture of all configurations from the development branch and pushes them into the Git production branch. This snapshot acts as a version-controlled release, facilitating easy promotion of configurations to higher environments.
With the release and promotion automation, any new changes are first tested in the development environment. Then, a release is created and promoted to higher environments. If any problems arise during testing, the configuration can be rolled back to its previous state using native Git workflow. This approach eliminates the need for monotonous operations and ensures consistency across environments.
The Promote automation also includes a “dry run” feature that compares the release version with the current state of the cluster in the higher environment. This comparison offers detailed information on the changes that will be made, assisting release engineers in spotting any disruptive changes and making informed decisions about promoting the configuration to production.
Moreover, the Promote automation incorporates scaling factors, allowing for scaling of topic partitions and read/write throughputs in higher environments. This flexibility enables customization using the Kustomize framework and optimization of configurations based on specific requirements in each environment.
Using this GitOps management approach offers several advantages. It helps to avoid redundancy, reduce inconsistency, and gives a significant edge to developers. By centralising configuration management and governance of Kafka clusters, we can ensure that changes are consistent and controlled across environments. This approach also bolsters isolation and minimises the blast radius, reducing the impact of configuration issues.
Managing consistent configurations across multiple Kafka environments is crucial for maintaining stability and reliability in a large-scale enterprise and configuration drift. By implementing centralising configuration management, using GitOps for governance and promoting configurations in a controlled and consistent way organisations can avoid redundancy, reduce inconsistency, and enhance overall application performance and reliability. This GitOps management approach not only boosts consistency but also strengthens isolation and limits the blast radius in case of configuration issues.
We are considering open-sourcing our Promote and Release automation, if you are interested in learning more or adopting this approach then please contact us.
Fore more content:
How to take your Kafka projects to the next level with a Confluent preferred partner
Event driven Architecture: A Simple Guide
Watch Our Kafka Summit Talk: Offering Kafka as a Service in Your Organisation
Successfully Reduce AWS Costs: 4 Powerful Ways
Kafka performance best practices for monitoring and alerting
Have a conversation with a Kafka expert to discover how we help your adopt of Apache Kafka in your business.
Contact Us