Reducing the TCO of Kafka | Kafka on Public Cloud

How to reduce the TCO of Kafka on public cloud? In today’s digital landscape, optimising costs and increasing efficiencies are top priorities for organisations. This is especially true when it comes to deploying Kafka clusters in the cloud. Public cloud services have seen tremendous growth in recent years, with organisations strategically investing in data infrastructure and analytics while also looking to save money. However, there are hidden costs associated with implementing distributed solutions in the cloud.

Over the 30+ Enterprise engagements we have explored a number of ways to demystify and reduce cloud costs for Kafka clusters. We will share lessons learned from the field, drawing from the experiences of Kafka experts and enterprise companies. Topics covered will include techniques to optimise Kafka’s network transfer costs, configuring production and consumption to single and multi-zone environments, compression tools, and tuning the fan-out. By implementing these strategies, you can potentially reduce your cloud spend on Kafka clusters.

Reducing TCO of Kafka: Understanding the Public Cloud Landscape

The adoption of public cloud shows no signs of slowing down, with real-time analytics leading the charge. Organisations are increasingly using cloud services for these real-time data infrastructure and analytics needs, but there are hidden costs involved in implementing distributed solutions. When deploying Kafka clusters in the cloud, it is important to optimise costs and increase efficiencies. Without the correct planning and architecture, things can get very expensive – FAST!

Reducing TCO of Kafka: How to calculate your Kafka cluster size?

Before deploying Kafka on public cloud it is crucial to analyse your needs and estimate your workload growth. This starts by estimating your throughput needs, you can get a rough estimate using this calculator. The calculator should Identify the number of brokers and other resources required to achieve your baseline, from which you can then work this into your cloud resources. You should always track usage, and monitor spending related to infrastructure usage, you can optimise for compute and throughput while saving costs. It is also important to evaluate your approach and compare it to previously established standards in order to have a profitable way of deploying Kafka.

Kafka on Public Cloud: Kafka compute options

When deploying Kafka, you have the option to use either virtual machines or Kubernetes. For high-performance deployments, the preferred instance types are memory optimised with a minimum of 4 vCPU and 16GB of RAM. These instance types are commonly used for log processing, data warehousing, and database-based workloads. It is important to choose the right instance types based on your performance and cost-saving requirements.

Storage optimisation

Using the Simple EventSizer you are able to properly calculate the amount of storage your cluster needs for your desired retention period. If you are using AWS, it offers EBS storage, which eliminates replication overhead and allows for quick and easy reassignment to a new instance in case of failure. Additionally, when KIP-405 tiered storage becomes a reality, you can reduce compute costs for certain high-retention use cases without increasing broker counts or scaling storage.

Kafka networking costs

Networking is a major area where costs are incurred when deploying Kafka clusters. It is important to carefully position producers and consumers across availability zones for reliability and resiliency purposes. While in-zone consumption and production can reduce cross AZ traffic, it is generally recommended to spread consumers and producers across multiple availability zones. By analysing the network fan-out and productionizing it efficiently, you can minimise networking costs. If you are unsure how to architect a cost efficient solution, always consult the Kafka Experts at OSO.

Kafka compression techniques

Kafka allows you to compress messages as they travel over the wire. There are several compression techniques supported by Kafka, such as Gzip, Snappy, and Zstandard. Each technique offers different advantages in terms of CPU usage, compression ratio, speed, and network utilisation. Snappy provides a good balance of CPU usage and compression ratio, while Zstandard offers a higher compression ratio at slightly higher CPU usage. Compressing repetitive data, such as XML and JSON, can help save CPU cycles, disk space, and network bandwidth.

Kafka use-case configuration optimisation

Configurations play a crucial role in optimising Kafka clusters. By tuning parameters such as log.segment.bytes, log.retention.bytes, and log.retention.ms, you can achieve better throughput for a lower cost. Additionally, using tiered storage in Confluent platform and eliminating ZooKeeper can further optimise costs and make Kafka more extensible.

Continuous monitoring and platform automation

When deploying Kafka, it is important to continuously monitor your workloads and make adjustments as needed. This includes estimating your workloads, being patient with incremental workloads, and implementing administrative techniques such as automation to reduce operational burden. By offloading the heavy lifting to tools like Confluent, you can optimise costs and increase efficiency. You should always know how much runway you have in your Kafka clusters, and have adequate monitoring and alerting to notify you at 70% of that threshold.

Reducing TCO of Kafka: How to lower Kafka cluster costs

Deploying Kafka clusters in the cloud requires careful analysis and optimization to reduce costs. By analysing your needs, choosing the right compute and storage options, optimising networking, using compression techniques, and tuning configurations, you can potentially save money while ensuring high performance. Continuous monitoring and automation are also key to maintaining cost efficiency over time. To learn more about optimising Kafka costs and understanding the operational and infrastructure costs, check out Confluent’s comprehensive four-part series on the topic.

Reducing the TCO of Kafka on public cloud

Reducing TCO of Kafka: Understanding the Public Cloud Landscape

Reducing TCO of Kafka: How to calculate your Kafka cluster size?

Kafka on Public Cloud: Kafka compute options

Storage optimisation

Kafka networking costs

Kafka compression techniques

Kafka use-case configuration optimisation

Continuous monitoring and platform automation

Reducing TCO of Kafka: How to lower Kafka cluster costs

Get started with OSO professional services for Apache Kafka

Latest blog posts

Why You Don’t Need Apache Flink for Agentic AI (And Why Akka Is the Simpler Choice)

Building Multi-Region Orchestration with Apache Kafka: A Pull-Based Architecture

Reducing the TCO of Kafka on public cloud

Reducing TCO of Kafka: Understanding the Public Cloud Landscape

Reducing TCO of Kafka: How to calculate your Kafka cluster size?

Kafka on Public Cloud: Kafka compute options

Storage optimisation

Kafka networking costs

Kafka compression techniques

Kafka use-case configuration optimisation

Continuous monitoring and platform automation

Reducing TCO of Kafka: How to lower Kafka cluster costs

Get started with OSO professional services for Apache Kafka

Latest blog posts

Why You Don’t Need Apache Flink for Agentic AI (And Why Akka Is the Simpler Choice)

Building Multi-Region Orchestration with Apache Kafka: A Pull-Based Architecture

Subscription form (footer)