blog by OSO

Kafka Performance 101

Sion Smith 5 July 2023
kafka performance

In this article, we will explore the factors that impact the Kafka performance and performance of a Kafka cluster from an infrastructure perspective. We will discuss the throughput of the storage, storage network, and network, and how they affect the overall performance of the cluster. Additionally, we will look at scaling options and trade-offs between scaling up and scaling out. Let’s dive in!

Kafka Performance: Understanding Throughput

The throughput of a Kafka cluster is determined by the maximum throughput of the storage, storage network, and network. It is important to note that the maximum throughput of the cluster cannot exceed the maximum throughput of the storage times the number of brokers over the replication factor.

For example, if we have a replication factor of 2 and the maximum throughput of the storage is 250 megabytes per second, the maximum performance of the cluster cannot exceed 81 megabytes per second. This is because the storage network throughput becomes a bottleneck as the number of brokers increases.

Kafka Performance: Network Throughput

To determine the maximum throughput of the network, we need to consider the replication traffic out of the broker and the read traffic caused by the consumers. The maximum throughput of the network out of the instance is the replication traffic plus the traffic going to the consumers.

It is important to note that this is a simplified model and assumes certain conditions. For example, it assumes that there is at least one consumer and that consumers are always reading from the tip of a topic. Additionally, it does not take into account CPU utilisation, which can be significant for certain workloads.

Kafka Performance: Scaling Options

When it comes to scaling a Kafka cluster to achieve a certain throughput, there are four main factors that can be adjusted:

  1. Replication factor: This determines the number of copies of each message that are stored in the cluster. However, changing the replication factor is not always feasible or desirable.
  2. Number of consumers: Increasing the number of consumers can help increase the egress throughput of the cluster. Scaling up the network throughput of the cluster can also accommodate more consumers.
  3. Scaling up instances: By scaling up instances, the throughput of the storage, storage network, and network can be increased. Different volume types can also be chosen to increase the throughput of the storage volumes.
  4. Scaling out instances: Scaling out instances involves adding or removing brokers in the cluster. This can have an impact on the overall performance of the cluster. It is important to note that scaling out will always work, while scaling up may not be feasible in all cases.
kafka performance and kafka cluster throughput

Scaling Up vs Scaling Out

When deciding whether to scale up or scale out, it is important to consider the trade-offs involved. Scaling up involves increasing the capacity of individual brokers, while scaling out involves adding more brokers to the cluster.

Scaling up can result in fewer, larger brokers. While this can provide higher performance, it also means that if a single broker fails, there is a higher load on the remaining brokers. On the other hand, scaling out results in a higher number of smaller brokers. This allows for smaller capacity increments when scaling, but it can also increase the complexity of maintenance and operation, especially when performing rolling upgrades.

It is recommended to deploy Kafka clusters with brokers that have the same configuration and run the same number of brokers in all availability zones. This helps avoid excessive load on the remaining brokers in case of a failure, ensuring a consistent throughput across the deployment.

Tiered Storage

One complex topic to consider is tiered storage. With tiered storage, data can be aged out from expensive block storage to more cost-effective object storage. This can help optimize storage costs. However, when it comes to performance, tiered storage introduces challenges.

In a tiered storage setup, data is sent from the broker to long-term storage, such as S3. Initially, this data is not consumed by any consumers. But at some point, a consumer may request the data for processing. This backfill operation requires higher throughput compared to regular consumption.

To accommodate the higher throughput required for backfill operations, it is necessary to scale up the network throughput of the cluster. This can be done by increasing the number of brokers before the backfill operation and then scaling back down afterward.

Burst Model in Cloud Infrastructure

Cloud infrastructure often employs a burst model, where baseline performance for network, CPU, and storage network can be exceeded for a certain period of time. This burst capability can be beneficial for stateful operations or failure modes, as it allows for faster completion of operations like replicating data or spinning up new brokers.

However, it is important to be cautious when running performance tests. Short tests that consume burst credits may show better performance than expected, but this performance may not be sustainable in the long run. It is crucial to drive enough throughput during performance tests to exhaust the burst credits and evaluate the baseline performance of the cluster.

Improve Apache Kafka performance

In conclusion, the Kafka performance and performance of a Kafka cluster is influenced by various infrastructure factors, including storage throughput, storage network throughput, and network throughput. Scaling options such as adjusting the replication factor, increasing the number of consumers, scaling up instances, or scaling out instances can help achieve the desired throughput. When deciding between scaling up or scaling out, it is important to consider the trade-offs involved, such as the blast radius and operational complexity.

Tiered storage can be used to optimise storage costs, but it introduces challenges in terms of backfill operations and throughput requirements. Cloud infrastructure often employs a burst model, which can provide temporary performance boosts but should be carefully considered during performance testing. Overall, understanding and optimising these infrastructure factors can help ensure the optimal performance of a Kafka cluster.

If you’re interested in learning more about Kafka performance and monitoring, I recommend reach out to OSO the Kafka Experts or checking out some of our code examples on GitHub.

Fore more content:

How to take your Kafka projects to the next level with a Confluent preferred partner

Event driven Architecture: A Simple Guide

Watch Our Kafka Summit Talk: Offering Kafka as a Service in Your Organisation

Successfully Reduce AWS Costs: 4 Powerful Ways

Protecting Kafka Cluster

Kafka Cruise Control 101

Kafka performance best practices for monitoring and alerting

How to build a custom Kafka Streams Statestores

How to avoid configuration drift acros

Get started with OSO professional services for Apache Kafka

Have a conversation with a Kafka expert to discover how we help your adopt of Apache Kafka in your business.

Contact Us