blog by OSO

Serverless Tiered Storage in Apache Kafka

Sion Smith 4 November 2024
Serverless Kafka

Operating Apache Kafka at scale is extremely complex and requires a high amount of experience. One of the most common issues our clients face is the storage of data! The newly released tiered storage solution for Apache Kafka that leverages cloud object storage to provide cost-effective and scalable storage for Kafka data.

Introduction to Serverless Tiered Storage

Traditionally, Kafka clusters store data on disk, which can be expensive and difficult to manage at scale in public cloud vendors. Serverless tiered storage offers an alternative approach by leveraging cloud object storage, such as Amazon S3 or Google Cloud Storage or Azure blob storage to store the Kafka partitions. This allows you to take advantage of the scalability and cost-effectiveness of cloud storage while still benefiting from the real-time processing capabilities of Kafka.

How Serverless Tiered Storage Works

In a serverless tiered storage setup, Kafka data is stored in cloud object storage, while the Kafka brokers and agents are responsible for managing the data and processing the streaming requests. Here’s how it works:

  1. Service Discovery: Each agent in the Kafka cluster is aware of the other agents running in the same availability zone. This information is obtained from a service discovery system, which ensures that the load is evenly spread across the agents.
  2. Partitioning: Kafka data is divided into partitions, which are responsible for storing and processing a specific range of data. Each agent is assigned a set of partitions to handle, ensuring that the load is evenly distributed.
  3. Metadata Store: The cluster uses a strongly consistent metadata store to handle all metadata operations and ensure serializability. This custom metadata store tracks offsets, topics, batches, and other relevant information.
  4. Locking Mechanism: To handle the lack of built-in locking in cloud object storage like S3, the cluster uses a strongly consistent metadata store as a locking mechanism. This ensures that operations are serialised and strongly consistent.
  5. Consumer Group Protocol: The cluster implements the consumer group protocol, similar to Kafka, to track offsets and handle consumer progress tracking. The metadata store acts as the group coordinator and handles all the necessary operations.

Benefits of Serverless Kafka Tiered Storage

Serverless tiered storage is offered by MSK and more recently WarpStream and offers several benefits for managing Kafka data:

  1. Scalability: By leveraging cloud object storage, organisations can easily scale their storage capacity as needed without worrying about managing physical disks.
  2. Cost-effectiveness: Cloud object storage is typically more cost-effective than traditional disk storage, allowing organisations to store large amounts of data at a lower cost.
  3. Flexibility: Serverless tiered storage allows organisations to flexibly deploy Kafka clusters across multiple VPCs, regions, or even cloud accounts. This provides better isolation and flexibility in managing the cluster.
  4. Load Balancing: Instead of balancing topic partitions, serverless tiered storage balances client connections. This approach has been found to work well in practice, ensuring even distribution of load across the agents in the cluster.

Should you use Serverless Kafka?

While serverless tiered storage offers many benefits, there are some challenges and considerations to keep in mind:

  1. Latency: Although serverless tiered storage provides consistent performance, there may be some variability in latency, especially when using cloud object storage. However, this can be mitigated by monitoring the time it takes to flush a file to object storage and implementing speculative retries.
  2. Client Configuration: Clients connecting to the Kafka cluster need to be configured to participate in the serverless tiered storage setup. This includes specifying the role of the client (e.g., producer or consumer) and configuring the view of the cluster that the client sees.
  3. Data Hotspots: It is possible for a single client to write a large amount of data, causing a hotspot in the cluster. However, this is rare in practice, and load balancing client connections usually works well. Additionally, more agents can be added to handle temporary spikes in load.
  4. Locking Mechanism: Since cloud object storage like S3 does not provide built-in locking, a strongly consistent metadata store is used as a locking mechanism. This ensures that operations are serialised and strongly consistent.
  5. Service Discovery: The service discovery system plays a crucial role in serverless tiered storage by providing information about the agents running in the cluster. It is important to ensure that the service discovery system is reliable and can handle the load.

Cutting the cost of running Apache Kafka

Serverless tiered storage offers a cost-effective and scalable solution for managing Kafka data by leveraging cloud object storage. By combining the real-time processing capabilities of Kafka with the scalability and cost-effectiveness of cloud storage, organisations can efficiently manage their data pipelines and streaming applications. While there are some challenges to consider, serverless tiered storage provides flexibility, scalability, and cost savings for Kafka clusters.

If you are interested in learning more or adopting this approach then please contact us.

Get started with OSO professional services for Apache Kafka

Have a conversation with a Kafka expert to discover how we help your adopt of Apache Kafka in your business.

CONTACT US