In this article, we will explore the factors that impact the Kafka performance and performance of a Kafka cluster from an infrastructure perspective. We will discuss the throughput of the storage, storage network, and network, and how they affect the overall performance of the cluster. Additionally, we will look at scaling options and trade-offs between scaling up and scaling out. Let’s dive in!
Kafka Performance: Understanding Throughput
The throughput of a Kafka cluster is determined by the maximum throughput of the storage, storage network, and network. It is important to note that the maximum throughput of the cluster cannot exceed the maximum throughput of the storage times the number of brokers over the replication factor.
For example, if we have a replication factor of 2 and the maximum throughput of the storage is 250 megabytes per second, the maximum performance of the cluster cannot exceed 81 megabytes per second. This is because the storage network throughput becomes a bottleneck as the number of brokers increases.
Kafka Performance: Network Throughput
To determine the maximum throughput of the network, we need to consider the replication traffic out of the broker and the read traffic caused by the consumers. The maximum throughput of the network out of the instance is the replication traffic plus the traffic going to the consumers.
It is important to note that this is a simplified model and assumes certain conditions. For example, it assumes that there is at least one consumer and that consumers are always reading from the tip of a topic. Additionally, it does not take into account CPU utilisation, which can be significant for certain workloads.
Kafka Performance: Scaling Options
When it comes to scaling a Kafka cluster to achieve a certain throughput, there are four main factors that can be adjusted:
Replication factor: This determines the number of copies of each message that are stored in the cluster. However, changing the replication factor is not always feasible or desirable.
Number of consumers: Increasing the number of consumers can help increase the egress throughput of the cluster. Scaling up the network throughput of the cluster can also accommodate more consumers.
Scaling up instances: By scaling up instances, the throughput of the storage, storage network, and network can be increased. Different volume types can also be chosen to increase the throughput of the storage volumes.
Scaling out instances: Scaling out instances involves adding or removing brokers in the cluster. This can have an impact on the overall performance of the cluster. It is important to note that scaling out will always work, while scaling up may not be feasible in all cases.
Scaling Up vs Scaling Out
When deciding whether to scale up or scale out, it is important to consider the trade-offs involved. Scaling up involves increasing the capacity of individual brokers, while scaling out involves adding more brokers to the cluster.
Scaling up can result in fewer, larger brokers. While this can provide higher performance, it also means that if a single broker fails, there is a higher load on the remaining brokers. On the other hand, scaling out results in a higher number of smaller brokers. This allows for smaller capacity increments when scaling, but it can also increase the complexity of maintenance and operation, especially when performing rolling upgrades.
It is recommended to deploy Kafka clusters with brokers that have the same configuration and run the same number of brokers in all availability zones. This helps avoid excessive load on the remaining brokers in case of a failure, ensuring a consistent throughput across the deployment.
Tiered Storage
One complex topic to consider is tiered storage. With tiered storage, data can be aged out from expensive block storage to more cost-effective object storage. This can help optimize storage costs. However, when it comes to performance, tiered storage introduces challenges.
In a tiered storage setup, data is sent from the broker to long-term storage, such as S3. Initially, this data is not consumed by any consumers. But at some point, a consumer may request the data for processing. This backfill operation requires higher throughput compared to regular consumption.
To accommodate the higher throughput required for backfill operations, it is necessary to scale up the network throughput of the cluster. This can be done by increasing the number of brokers before the backfill operation and then scaling back down afterward.
Burst Model in Cloud Infrastructure
Cloud infrastructure often employs a burst model, where baseline performance for network, CPU, and storage network can be exceeded for a certain period of time. This burst capability can be beneficial for stateful operations or failure modes, as it allows for faster completion of operations like replicating data or spinning up new brokers.
However, it is important to be cautious when running performance tests. Short tests that consume burst credits may show better performance than expected, but this performance may not be sustainable in the long run. It is crucial to drive enough throughput during performance tests to exhaust the burst credits and evaluate the baseline performance of the cluster.
Improve Apache Kafka performance
In conclusion, the Kafka performance and performance of a Kafka cluster is influenced by various infrastructure factors, including storage throughput, storage network throughput, and network throughput. Scaling options such as adjusting the replication factor, increasing the number of consumers, scaling up instances, or scaling out instances can help achieve the desired throughput. When deciding between scaling up or scaling out, it is important to consider the trade-offs involved, such as the blast radius and operational complexity.
Tiered storage can be used to optimise storage costs, but it introduces challenges in terms of backfill operations and throughput requirements. Cloud infrastructure often employs a burst model, which can provide temporary performance boosts but should be carefully considered during performance testing. Overall, understanding and optimising these infrastructure factors can help ensure the optimal performance of a Kafka cluster.
If you’re interested in learning more about Kafka performance and monitoring, I recommend reach out to OSO the Kafka Experts or checking out some of our code examples on GitHub.
This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.
Strictly Necessary Cookies
Strictly Necessary Cookie should be enabled at all times so that we can save your preferences for cookie settings.
If you disable this cookie, we will not be able to save your preferences. This means that every time you visit this website you will need to enable or disable cookies again.
3rd Party Cookies
This website uses Google Analytics to collect anonymous information such as the number of visitors to the site, and the most popular pages.
Keeping this cookie enabled helps us to improve our website.
Please enable Strictly Necessary Cookies first so that we can save your preferences!