blog by OSO

Master Apache Kafka 101

Josh 28 February 2023
apache kafka 101

Apache Kafka stream processing

What is Apache Kafka? It is great for stream processing, it is a distributed system, open-source and supported by the apache software foundation. Kafka provides very low latency and is extremely fault-tolerant. It was originally designed at LinkedIn to handle large volumes of data and provide high throughput and fault tolerance. Kafka is great for event streaming platforms such as the monolithic streaming platform Netflix, Uber uses Kafka for real-time data streams and Airbnb, and The New York Times to manage their stream processing and data integration.

Yup! It is used by huge and innovative companies.

In a nutshell, a Kafka cluster consists of multiple servers, called brokers, that run the Kafka software and communicate via the high-performance tcp network protocol. Producers publish data to brokers, who store them in topics. Consumers subscribe to topics and consume messages from the brokers. Messages are stored in partitions within topics.

Kafka uses a publish-subscribe (pub-sub) model where producers publish messages to topics and consumers subscribe to topics to receive messages. Publishers don’t need to know about consumers and vice versa. This allows for decoupling between producers and consumers and makes the system more scalable.

Kafka broker stores messages in a topic’s partitions until a consumer has subscribed to that topic and opened the particular partition for reading. Once the consumer has opened the partition, the broker forwards all new messages for that partition to the consumer. This ensures that no messages are lost if a broker fails.

Distributed Log

  • Kafka is based on the distributed commit log aka the distributed log concept. A log represents a series of decisions such as “knowing what previously happened and what I am to do next”.
  • Every message must have a timestamp, if one isn’t provided via code. Kafka will provide one automatically.
  • Data stored in the log is retained for 7 days.
  • The log is immutable, i.e can’t be changed/deleted. It can only append data. To reiterate logs store data, each event record is written at a particular offset in the log. See diagram (fig 1) for a visual representation after reading about producers and consumers.

Producers

  • One or many producers can append entries and as noted earlier logs are immutable. Therefore entries can never change or delete existing data.
  • Multiple producers can write to logs.

Consumers

  • One or many consumers can read the data “at their own pace” by keeping track of the offset in the logs. This allows consumers to know where they last read data and where to proceed.
  • Multiple consumers can read from the log.
  • For example let’s say a consumer read data from the log at point 3 (offset=3), when the consumer next reads from that log, it will know and will move onto an another point say 4 (offset=4).
  • Producers write the data for the consumers to “read”/consume.
Apache Kafka Example of offset with 2 consumers
Fig 1. Example of an offset in a topic

Topic

  • Think of a topic as a category such as “/products”, “/transactions” etc
  • The topic is where the producers write to and consumers read from.
  • A topic can have one or more partitions.
  • The offset numbers are always unique within a partition
  • Topics can be split into multiple partitions that can reside on multiple machines. This is part of what makes Kafka so fast! as they can be load balanced.

Partitions

  • Partitions are added to a topic, so that the producers can write/store the data and the consumers can read that data.
  • Partitions can be on multiple machines to achieve high performance with horizontal scaling. i.e as more resources are needed — more machines can be added to account for the load.
  • Multiple producers can write to different partitions of the same/replicated topic. Or in other words a producer can write to multiple replicated topics.
  • For example see fig 2. We have one producer, who is writing to different partitions of the same replicated topic. Notice how the offsets are different? This is because the data is being load balanced between the different partitions. This greatly reduces load and can exponentially improve performance / provide high throughput.
Apache Kafka: Producer writing to various product topics
Fig 2. Producer writing to various product topics.
  • The example above doesn’t have custom partitioning. Meaning that data/events can’t be guaranteed across partitions. as producers “write at their own speed”. Meaning that the order that the data is added can’t be guaranteed.

Custom partitioning

  • Messages can be sent to the same partition, continuously. Such as “Bike” products could always go to the first partition.
  • More commonly a “Round Robin” strategy could be used, so that data is always evenly distributed across the partitions.

No locking concept in Kafka

  • It’s great that multiple producers can write to the same topic/partition and that data can be load balanced, but there isn’t a locking concept.
  • If producer A was writing a patch of messages to a topic. Producer B wouldn’t wait until Producer A finished writing those messages. Meaning that messages sent to a topic/partition won’t append in the order they were sent.
  • In terms of speed and ensuring high throughput, this is great though! As producers are able to asynchronously send batches of messages to one or many topics.

Cluster

  • Consists of one or many servers/nodes
  • Kafka brokers run Kafka on one or many nodes/servers.
  • Topics are contained within the broker in the cluster
  • Recommended to run many brokers/nodes inside a cluster to benefit from Kafka’s ability to replicate data, as noted earlier.
Several brokers within a cluster
Several brokers within a cluster

Broker

Each node in a cluster is called a broker.

  • Partitions are stored on each of the brokers in the cluster.
  • A broker (node) can contain one or many partitions for a topic.
  • Brokers can be added/removed from the cluster as needed.
  • Data is persisted until it has been consumed, or configured to be persisted for a set amount of time (retention period).
  • Partitions can be replicated across multiple brokers (nodes).
  • Out of several nodes in a cluster, one node is chosen as the leader by the controller.

Controller

  • One of the brokers/nodes will serve as the active controller.
  • Responsible for managing the partitions, replicating data, electing a leader broker etc
  • Can only be one controller within a cluster.

Apache Kafka is a distributed, fault-tolerant, and low-latency stream processing system that has gained popularity among major companies like Netflix, Uber, Airbnb, and The New York Times. Apache Kafka operates on a publish-subscribe model, allowing producers to publish messages to topics and consumers to subscribe and receive those messages. It’s distributed commit log architecture ensures data integrity, while partitioning enables horizontal scaling and high performance. With its robust features and scalability, Apache Kafka provides a powerful foundation for real-time data streaming and efficient stream processing.

Feel free to take a look at the official documentation and also read some of our blog posts and learn Apache Kafka, Getting started with apache Kafka, and how to setup a test kafka on Kubernetes might be good options! You can also look at apache kafka use cases you can see how Dufry strengthens its case for a Kafka architecture with an Apache Kafka proof of concept (POC).

Happy coding everyone! 😊

Fore more content:

How to take your Kafka projects to the next level with a Confluent preferred partner

Event driven Architecture: A Simple Guide

Watch Our Kafka Summit Talk: Offering Kafka as a Service in Your Organisation

Successfully Reduce AWS Costs: 4 Powerful Ways

Get started with OSO professional services for Apache Kafka

Have a conversation with a Kafka expert to discover how we help your adopt of Apache Kafka in your business.

Contact Us