In this article, we will explore the factors that impact the Kafka performance and performance of a Kafka cluster from an infrastructure perspective. We will discuss the throughput of the storage, storage network, and network, and how they affect the overall performance of the cluster. Additionally, we will look at scaling options and trade-offs between scaling up and scaling out. Let’s dive in!
The throughput of a Kafka cluster is determined by the maximum throughput of the storage, storage network, and network. It is important to note that the maximum throughput of the cluster cannot exceed the maximum throughput of the storage times the number of brokers over the replication factor.
For example, if we have a replication factor of 2 and the maximum throughput of the storage is 250 megabytes per second, the maximum performance of the cluster cannot exceed 81 megabytes per second. This is because the storage network throughput becomes a bottleneck as the number of brokers increases.
To determine the maximum throughput of the network, we need to consider the replication traffic out of the broker and the read traffic caused by the consumers. The maximum throughput of the network out of the instance is the replication traffic plus the traffic going to the consumers.
It is important to note that this is a simplified model and assumes certain conditions. For example, it assumes that there is at least one consumer and that consumers are always reading from the tip of a topic. Additionally, it does not take into account CPU utilisation, which can be significant for certain workloads.
When it comes to scaling a Kafka cluster to achieve a certain throughput, there are four main factors that can be adjusted:
When deciding whether to scale up or scale out, it is important to consider the trade-offs involved. Scaling up involves increasing the capacity of individual brokers, while scaling out involves adding more brokers to the cluster.
Scaling up can result in fewer, larger brokers. While this can provide higher performance, it also means that if a single broker fails, there is a higher load on the remaining brokers. On the other hand, scaling out results in a higher number of smaller brokers. This allows for smaller capacity increments when scaling, but it can also increase the complexity of maintenance and operation, especially when performing rolling upgrades.
It is recommended to deploy Kafka clusters with brokers that have the same configuration and run the same number of brokers in all availability zones. This helps avoid excessive load on the remaining brokers in case of a failure, ensuring a consistent throughput across the deployment.
One complex topic to consider is tiered storage. With tiered storage, data can be aged out from expensive block storage to more cost-effective object storage. This can help optimize storage costs. However, when it comes to performance, tiered storage introduces challenges.
In a tiered storage setup, data is sent from the broker to long-term storage, such as S3. Initially, this data is not consumed by any consumers. But at some point, a consumer may request the data for processing. This backfill operation requires higher throughput compared to regular consumption.
To accommodate the higher throughput required for backfill operations, it is necessary to scale up the network throughput of the cluster. This can be done by increasing the number of brokers before the backfill operation and then scaling back down afterward.
Cloud infrastructure often employs a burst model, where baseline performance for network, CPU, and storage network can be exceeded for a certain period of time. This burst capability can be beneficial for stateful operations or failure modes, as it allows for faster completion of operations like replicating data or spinning up new brokers.
However, it is important to be cautious when running performance tests. Short tests that consume burst credits may show better performance than expected, but this performance may not be sustainable in the long run. It is crucial to drive enough throughput during performance tests to exhaust the burst credits and evaluate the baseline performance of the cluster.
In conclusion, the Kafka performance and performance of a Kafka cluster is influenced by various infrastructure factors, including storage throughput, storage network throughput, and network throughput. Scaling options such as adjusting the replication factor, increasing the number of consumers, scaling up instances, or scaling out instances can help achieve the desired throughput. When deciding between scaling up or scaling out, it is important to consider the trade-offs involved, such as the blast radius and operational complexity.
Tiered storage can be used to optimise storage costs, but it introduces challenges in terms of backfill operations and throughput requirements. Cloud infrastructure often employs a burst model, which can provide temporary performance boosts but should be carefully considered during performance testing. Overall, understanding and optimising these infrastructure factors can help ensure the optimal performance of a Kafka cluster.
If you’re interested in learning more about Kafka performance and monitoring, I recommend reach out to OSO the Kafka Experts or checking out some of our code examples on GitHub.
Fore more content:
How to take your Kafka projects to the next level with a Confluent preferred partner
Event driven Architecture: A Simple Guide
Watch Our Kafka Summit Talk: Offering Kafka as a Service in Your Organisation
Successfully Reduce AWS Costs: 4 Powerful Ways
Kafka performance best practices for monitoring and alerting
Have a conversation with a Kafka expert to discover how we help your adopt of Apache Kafka in your business.
Contact Us