What is Apache Kafka? It is great for stream processing, it is a distributed system, open-source and supported by the apache software foundation. Kafka provides very low latency and is extremely fault-tolerant. It was originally designed at LinkedIn to handle large volumes of data and provide high throughput and fault tolerance. Kafka is great for event streaming platforms such as the monolithic streaming platform Netflix, Uber uses Kafka for real-time data streams and Airbnb, and The New York Times to manage their stream processing and data integration.
Yup! It is used by huge and innovative companies.
In a nutshell, a Kafka cluster consists of multiple servers, called brokers, that run the Kafka software and communicate via the high-performance tcp network protocol. Producers publish data to brokers, who store them in topics. Consumers subscribe to topics and consume messages from the brokers. Messages are stored in partitions within topics.
Kafka uses a publish-subscribe (pub-sub) model where producers publish messages to topics and consumers subscribe to topics to receive messages. Publishers don’t need to know about consumers and vice versa. This allows for decoupling between producers and consumers and makes the system more scalable.
Kafka broker stores messages in a topic’s partitions until a consumer has subscribed to that topic and opened the particular partition for reading. Once the consumer has opened the partition, the broker forwards all new messages for that partition to the consumer. This ensures that no messages are lost if a broker fails.
Distributed Log
Kafka is based on the distributed commit log aka the distributed log concept. A log represents a series of decisions such as “knowing what previously happened and what I am to do next”.
Every message must have a timestamp, if one isn’t provided via code. Kafka will provide one automatically.
Data stored in the log is retained for 7 days.
The log is immutable, i.e can’t be changed/deleted. It can only append data. To reiterate logs store data, each event record is written at a particular offset in the log. See diagram (fig 1) for a visual representation after reading about producers and consumers.
Producers
One or many producers can append entries and as noted earlier logs are immutable. Therefore entries can never change or delete existing data.
Multiple producers can write to logs.
Consumers
One or many consumers can read the data “at their own pace” by keeping track of the offset in the logs. This allows consumers to know where they last read data and where to proceed.
Multiple consumers can read from the log.
For example let’s say a consumer read data from the log at point 3 (offset=3), when the consumer next reads from that log, it will know and will move onto an another point say 4 (offset=4).
Producers write the data for the consumers to “read”/consume.
Topic
Think of a topic as a category such as “/products”, “/transactions” etc
The topic is where the producers write to and consumers read from.
A topic can have one or more partitions.
The offset numbers are always unique within a partition
Topics can be split into multiple partitions that can reside on multiple machines. This is part of what makes Kafka so fast! as they can be load balanced.
Partitions
Partitions are added to a topic, so that the producers can write/store the data and the consumers can read that data.
Partitions can be on multiple machines to achieve high performance with horizontal scaling. i.e as more resources are needed — more machines can be added to account for the load.
Multiple producers can write to different partitions of the same/replicated topic. Or in other words a producer can write to multiple replicated topics.
For example see fig 2. We have one producer, who is writing to different partitions of the same replicated topic. Notice how the offsets are different? This is because the data is being load balanced between the different partitions. This greatly reduces load and can exponentially improve performance / provide high throughput.
The example above doesn’t have custom partitioning. Meaning that data/events can’t be guaranteed across partitions. as producers “write at their own speed”. Meaning that the order that the data is added can’t be guaranteed.
Custom partitioning
Messages can be sent to the same partition, continuously. Such as “Bike” products could always go to the first partition.
More commonly a “Round Robin” strategy could be used, so that data is always evenly distributed across the partitions.
No locking concept in Kafka
It’s great that multiple producers can write to the same topic/partition and that data can be load balanced, but there isn’t a locking concept.
If producer A was writing a patch of messages to a topic. Producer B wouldn’t wait until Producer A finished writing those messages. Meaning that messages sent to a topic/partition won’t append in the order they were sent.
In terms of speed and ensuring high throughput, this is great though! As producers are able to asynchronously send batches of messages to one or many topics.
Cluster
Consists of one or many servers/nodes
Kafka brokers run Kafka on one or many nodes/servers.
Topics are contained within the broker in the cluster
Recommended to run many brokers/nodes inside a cluster to benefit from Kafka’s ability to replicate data, as noted earlier.
Broker
Each node in a cluster is called a broker.
Partitions are stored on each of the brokers in the cluster.
A broker (node) can contain one or many partitions for a topic.
Brokers can be added/removed from the cluster as needed.
Data is persisted until it has been consumed, or configured to be persisted for a set amount of time (retention period).
Partitions can be replicated across multiple brokers (nodes).
Out of several nodes in a cluster, one node is chosen as the leader by the controller.
Controller
One of the brokers/nodes will serve as the active controller.
Responsible for managing the partitions, replicating data, electing a leader broker etc
Can only be one controller within a cluster.
Apache Kafka is a distributed, fault-tolerant, and low-latency stream processing system that has gained popularity among major companies like Netflix, Uber, Airbnb, and The New York Times. Apache Kafka operates on a publish-subscribe model, allowing producers to publish messages to topics and consumers to subscribe and receive those messages. It’s distributed commit log architecture ensures data integrity, while partitioning enables horizontal scaling and high performance. With its robust features and scalability, Apache Kafka provides a powerful foundation for real-time data streaming and efficient stream processing.
This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.
Strictly Necessary Cookies
Strictly Necessary Cookie should be enabled at all times so that we can save your preferences for cookie settings.
If you disable this cookie, we will not be able to save your preferences. This means that every time you visit this website you will need to enable or disable cookies again.
3rd Party Cookies
This website uses Google Analytics to collect anonymous information such as the number of visitors to the site, and the most popular pages.
Keeping this cookie enabled helps us to improve our website.
Please enable Strictly Necessary Cookies first so that we can save your preferences!