blog by OSO

What is Apache Kafka? The Ultimate Kafka Guide 

Sion Smith 24 July 2023
what-is-apache-kafka

Overview

Apache Kafka serves as a robust distributed data streaming platform, capable of publishing, subscribing, storing, and processing record streams in real-time. Crafted to manage data flows from a plethora of sources to a diverse array of consumers, it excels at transporting vast volumes of data. Essentially, it doesn’t merely transfer data from point A to B but extends its reach from A to Z and beyond, simultaneously catering to all required destinations.

Kafka is the leading technology in event-driven architectures that helps technical leads and senior developers introduce modern software development practices, and build scalable, high-performance data driven systems.

In this ‘Kafka for beginners’ guide, we’ll briefly outline the basics of Apache Kafka and event-driven architecture before diving into why Kafka is used by businesses like LinkedIn and Netflix, how to deploy it step-by-step, and where you can find Kafka support.

But really what is Apache Kafka?

Apache Kafka is an open-source, distributed event streaming platform that allows you to publish, subscribe, and process streams of records in real-time.

Kafka helps decouple systems, allowing multiple teams to consume structured and unstructured data in a consistent manner. Since event-driven systems are more modular, flexible, and decoupled than those that use batch processing, Kafka is useful for building KAAP based architectures.

Senior developers also like Kafka because its ecosystem is always evolving, with a variety of client libraries, connectors, and tools that simplify how teams integrate and deploy the technology.

Event-driven architecture (EDA) is a type of software design where components communicate with each other by producing and consuming events, where an event is any significant state change within the system. For example, an event could be an action a user takes on your website, like clicking buy now which triggers a whole host of downstream asynchronous activities

With event-driven architecture, the components of the system that generate events—producers—need not to know anything about who or what is going to consume their events. They are often loosely coupled with the components that consume the events, known as consumers. These events get routed through a broker, which acts as a centre point of communication for all components and stores this data for on-demand retrieval.

what-is-apache-kafka

Flexibility 

Event-driven architecture is a system design where components are loosely coupled, allowing changes to be made to individual components without affecting the entire system. This makes it possible to independently scale and deploy components..

Scalability 

Since events can be processed asynchronously in event-driven architectures, you can add more instances of a component to handle increased event loads. Adding producers or consumers will horizontally scale the system and keep it highly responsive.

Modularity  

With event-driven architecture, each component or service centres around handling specific types of events. Not only does this make the architecture easier to maintain, but it also lets you add new functionalities without modifying existing components.

Real-time reactions

Event-driven architectures process events as they occur, which is useful for real-time analytics, notifications, and monitoring. Tackling financial fraud? You want the ability to capture information and react immediately.

Integration and interoperability 

Finally, since events are used as a common language, event-driven architecture easily connects with external systems, whether those systems are inside or outside your organisation. For example, if you’re a retailer, you might need to connect with your legacy point of sale (POS) systems.

Read More: How Drivvn Migrated From Batch to Real-Time Streaming Data With Apache Kafka

How does Apache Kafka work? 

Kafka operates through a publish-subscribe model. Essentially, data producers stream data in real-time to Kafka topics, and consumers subscribe to those topics and process the data as soon as it arrives.

The breakdown of Kafka real time streaming 

When a producer sends a message to Kafka, the message is first written to a partition on one of the brokers. Almost instantaneously, that message is then replicated across to partitions on other brokers, which means that there’s no single point of failure. Also, that message is available 24/7.

Consumers read messages from Kafka by subscribing to a topic. It’s like a newspaper: subscribing to the topic means that you’ll effectively receive all of that topic’s messages and updates in real time.

In this model, data is bundled into topics which contain only a certain type of event. Each topic is also divided into partitions, which are stored on different Kafka brokers, this allows consumers to work in parallel, consuming different these different partitions. Dividing topics like this lets Kafka scale horizontally, since you just add more brokers to the cluster if you’d like to handle more data.

Read More: Apache Kafka 101

Topics 

Topics represent a particular stream of records or messages. You can think of this building block as a category name to which records are published. For example, if you’re a financial services firm, you might name one such topic ‘payments’. If you’re an e-commerce store, you might title a topic ‘orders’. If you’re a healthcare provider, you might name one ‘patient_records’.

Producers 

Producers publish records to Kafka topics. They write data to a Kafka broker by selecting a topic and sending a message containing a key, and a value, and an optional timestamp. In this context, keys determine the exact partition to which the message is sent, while values contain the actual data you want to transmit.

Consumers

Consumers subscribe to and read data from Kafka topics. They’re essentially the opposite of Kafka’s producers, and they track their reading progress in each partition of a topic. This progress of which record they have last read is also stored on the brokers in a internal topic called consumer_group_offsets

Brokers

Brokers are the central point of contact that both receive and store published records. They’re responsible for handling read and write requests from producers and consumers, and they log records on disk so that those records are distributed and fault-tolerant.

When you put these components of Kafka architecture together, you get unusually high throughput and scalability.

Read More: Event-Driven Architecture: A Simple Guide 

Apache Kafka in Industry

More than eighty percent of the companies in the Fortune 500. That includes top players in entertainment, financial services, e-commerce, and software, such as LinkedInBoxGoldman SachsUberShopifySpotifyTargetNetflixTesla, and TikTok.

Here’s a quick sampling of how companies use Kafka in their operations.

LinkedIn 

LinkedIn used event-driven architecture to monitor and analyse data, like when users viewed a profile, sent a message, or clicked on a lead magnet. The company maintains 100+ Kafka clusters that handle more than 7 trillion messages per day!

Netflix 

Netflix used event-driven architecture to sync its studio changes related to talent, cash flow, budget, schedules, and payments Event-driven architecture also helps the video streamer run intensive data analytics and power its massive movie recommendation engine.

Spotify*

Before switching to Google Cloud Pub/Sub, Spotify used Kafka for its event delivery system—making it easier to respond with real time suggestions when users listened to songs or searched for specific artists. Like Kanye? You’ll probably like Consequence.

Target 

With more than 26 million site visitors each month, Target used event-driven architecture to speed up its digital service and decrease the number of items it listed as out-of-stock. Event-driven architecture helped the company sync data across locations to better represent its net sales and stock.

Kugu 

Kugu used event-driven architecture to scale its building management platform across Europe and incorporate new features into its product. Notably, Kafka and EDA made it possible for the company to process larger amounts of energy data and pursue its climate goals.

Dufry 

Dufry used EDA to pilot a real-time view of its sales data and a single source of truth for its global levels of stock. As a multinational retailer, Dufry needed to switch to event-driven architecture to successfully retarget travellers with personalised offers and increase revenue per customer.

Read More: Dufry strengthens its case for a Kafka architecture with an Apache Kafka proof of concept (POC) 

Apache Kafka Practical Applications

Leverage event-driven architecture and Apache Kafka in situations where you want your system to be more flexible, scalable, responsive, or fault-tolerant:

    • Scaling a digital business

    • Managing international stock and sales

    • Personalising user recommendations

    • Generating new value from data

    • Monitoring fraud and emergency alerts

Scaling a digital business or platform

    • Seamless integration between microservices ensures scalable and reliable data communication and helps analyse website metrics in real time. (DrivvnKugu, and VAS)

Managing international supply chains, stock, and sales 

Kafka’s horizontal scalability makes it good for tracking large amounts of disparate, interconnected events taking place around the world. (Dufry)

Personalising user recommendations

Real-time data ingestion, processing, and distribution helps improve the types of content you show to users based on their viewing or streaming history. (LinkedIn)

Generating new value from data

Data analytics provides a better idea of broader trends and patterns in your organisation, helping your team improve its services. (The Department for EducationM1 Finance33N)

Monitoring fraud or emergency alerts

Kafka’s publish-subscribe model efficiently captures and reacts to events that might otherwise put clients at risk. (Detected)

How do I deploy Apache Kafka and event-driven architecture? 

To deploy Apache Kafka and event-driven architecture, you’ll follow a few key steps:

    1. Download and Install Kafka on your servers – it is free and open source.

    1. Create some topics and partitions.

    1. Build a simple producer and consumer.

    1. Start producing and consuming events.

Once you’ve decided to use Kafka, start by mapping out your strategy and checking that your hardware and software meet the requirements.

Determine your production requirements, such as expected message throughput, storage capacity, and fault tolerance. Talk to Kafka experts and consultants. Then consider factors like cluster size, network configuration, and data retention policies. You’ll also want to take into account your time, budget, and level of Kafka expertise in-house.

What are the requirements to deploy Apache Kafka? 

Sufficient disk space for log files and data, adequate memory for message buffering and caching, a compatible Java Runtime Environment (JRE), properly configured network settings, scalability considerations, security measures, and monitoring and management tools.

Hardware requirements 

Sufficient disk space – Kafka requires disk space to store log files and data. How much space you need depends on your expected data volume, retention period, and replication factor.

Adequate memory – Kafka works best when you have sufficient memory to handle message buffering and caching to efficiently process messages.

Sufficient CPU – You can improve Kafka performance with a higher number of CPU cores, especially for handling increased message throughput.

Software requirements 

Java Runtime Environment (JRE) – Kafka is built on Java, so you need to have a compatible Java Runtime Environment installed on the deployment machines. Kafka typically requires Java 8 or later versions.

Compatible Operating System – Kafka works with multiple operating systems, including Linux, Windows, and macOS.

ZooKeeper (only earlier versions) – Kafka used to rely on Apache ZooKeeper for managing cluster coordination and maintaining metadata. Now, however, starting from Kafka version 2.8.0, ZooKeeper is no longer a dependency, and Kafka now uses its own internal metadata management system. It’s called KRaft!

It’s a little bit more streamlined. In previous versions of Kafka, ZooKeeper was used for a variety of tasks:

    • Maintaining the cluster metadata, such as the list of brokers and topics

    • Electing leaders for partitions

    • Coordinating state changes between brokers

In Kafka 3.3, however, ZooKeeper was replaced by a new component called KRaft. KRaft is built into the Kafka broker and provides all of the same functionality as ZooKeeper, but it’s more efficient and scalable.

Kafka deployment is now simplified. There’s no longer a need to install and configure ZooKeeper. Instead, you can simply start up your Kafka brokers and automatically form a cluster.

Choose a method of deployment

You can deploy Kafka on your own hardware, on a cloud platform, or on a managed Kafka service.

Install Kafka and ZooKeeper (if necessary)

If you’re deploying Kafka on your own hardware, download the Kafka binary distribution and install it (https://kafka.apache.org/downloads). If you’re using an older version of Kafka, then also install ZooKeeper.

Configure Kafka

Kafka comes with default configuration files located in the config directory that you can modify based on your requirements.

Essentially, you’ll need to edit the server.properties file to configure the Kafka brokers. This includes specifying the number of brokers, the port number, and the replication factor. You’ll also need to make sure that you have proper backup and disaster recovery mechanisms in place.

Start the Kafka brokers

Open a new terminal window or tab, navigate to the Kafka installation directory, and start the Kafka broker(s) using the provided script. If you have multiple Kafka brokers, you’ll need to start each one with a different server.properties file, specifying a unique broker ID and appropriate configurations.

bin/kafka-server-start.sh config/server.properties

Test Kafka

Kafka is now running on your local machine. You can now create topics, produce messages, and consume messages using the Kafka command-line tools! To see this in practice, test the code snippets below. Just replace ‘my-topic’ with the actual name you selected.

Create a topic: 

bin/kafka-topics.sh –bootstrap-server localhost:9092 –topic my-topic –partitions 3

Produce messages: 

bin/kafka-console-producer.sh –bootstrap-server localhost:9092 –topic my-topic

Consume messages: 

bin/kafka-console-consumer.sh –bootstrap-server localhost:9092 –topic my-topic

Monitor and tune performance

Track the health, performance, and resource utilisation of your Kafka cluster. To do so, set up metrics collection, logging, and alerting to identify issues.

Check your message production, consumption, and data integrity, and perform load testing to check that your Kafka cluster can handle the right throughput and meet performance requirements.

Additional Apache Kafka deployment resources 

What’s the easiest way to deploy Apache Kafka? 

Typically, a managed service from one of the large cloud vendors.

Instead of setting up and managing Kafka clusters yourself, Kafka as a service lets you use a cloud provider’s infrastructure and expertise to handle the operational details of running Kafka. Essentially, someone else handles the Kafka deployment and management – and its pay as you go.

Confluent, Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP).

These cloud providers offer the benefits of Kafka while abstracting away the underlying infrastructure management. Each of them, however, has their unique features and benefits.

Confluent Cloud

From the creators of Apache Kafka, Confluent is a truly cloud-native experience, providing Kafka with a wide range of enterprise-grade features to unleash developer productivity.

Confluent benefits 

The Confluent platform can be deployed across all major cloud vendors, and linked securely to your applications through private networking. The usability, security and low cost makes this an excellent choice for beginners.

Azure (Microsoft Azure) 

Azure has a comprehensive set of services, but no native Kafka offering. This means you will have to either leverage Confluent or use this Terraform module to deploy it.  It also provides a strong hybrid cloud offering with Azure Arc, allowing you to deploy and manage Kafka across on-premises, multi-cloud, and edge environments.

Microsoft Azure’s benefits 

Considered one of the top cloud providers and is growing rapidly. It has gained popularity, especially among enterprises with existing Microsoft investments, and offers strong integration with Microsoft’s enterprise tools and services.

GCP (Google Cloud Platform)

GCP offers Cloud Pub/Sub, a managed messaging service, which is an alternative to Kafka. Cloud Pub/Sub provides scalable and reliable messaging, but it has some differences in terms of architecture and features compared to Kafka.

As an aside, GCP also provides Apache Kafka as a service through its Cloud Marketplace, enabling you to deploy and manage Kafka clusters on GCP infrastructure.

GCP’s selling points 

Strong capabilities in big data and analytics, which make it suitable for scenarios that require Kafka integration with data processing and machine learning services.

AWS (Amazon Web Services) 

AWS has a wide range of complementary services that integrate well with Kafka, such as Amazon S3 for data storage, Amazon EC2 for customisable infrastructure, and AWS Lambda for serverless event processing.

AWS’s benefits 

As the market leader, it has a significant market share, offering a well-established cloud platform. AWS also has a large customer base, a wide range of reference architectures, and extensive community support.

You have two options: self managed or as a managed service.

Self-managed

With this approach, you set up and manage Kafka clusters on AWS EC2 instances. You have full control over the deployment, configuration, and maintenance of Kafka.

Managed service

AWS provides a managed Kafka service called Amazon Managed Streaming for Apache Kafka (MSK). With MSK, AWS handles the underlying infrastructure and operational aspects of Kafka, allowing you to focus on using Kafka without worrying about the cluster management.

    • Create an MSK cluster using the AWS Management Console, AWS CLI, or AWS SDKs.

    • Configure the cluster parameters, including the number of brokers, storage capacity, security settings, and networking.

    • Configure SSL/TLS encryption for network communication and configure authentication mechanisms, such as SASL, to secure client connections to your Kafka cluster.

    • Adjust the MSK cluster settings and configurations as needed. MSK provides a managed control plane for Kafka, allowing you to configure various cluster parameters through the AWS Management Console or APIs.

    • Configure the appropriate Amazon Elastic Block Store (EBS) volumes or other storage options for your Kafka brokers to ensure adequate capacity and performance.

    • Implement monitoring and observability solutions, such as Amazon CloudWatch, to monitor the health, performance, and resource utilisation of your Kafka deployment.

Level up your Kafka

Remove the need to deploy Kafka manually with infrastructure as code (IaC) tools, infrastructure automation tools, and configuration management.

Automate your Kafka deployment in 3 steps 

    1. Define your infrastructure as code

Use a tool like Terraform to define the infrastructure resources needed for your Kafka deployment. This includes virtual machines or containers, networking components, storage resources, and any other dependencies required by Kafka.

    1. Provision the infrastructure

We suggest leveraging a PaaS like  Kubernetes to provision the defined infrastructure resources. As part of this process, you’ll create and configure the necessary virtual machines or containers, set up networking, and attach storage resources.

    1. Install and configure Kafka

Use an open source operator like Strimzi to automate the installation and configuration of Kafka on the provisioned infrastructure. Strimzi will take care of downloading the Kafka Docker images, set up the Kafka broker properties, and configure networking, security settings, and any required dependencies.

Read More: How to Run Kubernetes With Ease 

Read More: What Does Ansible Do? From Beginner to Expert in 5 Minutes! 

Docker 

While Docker’s not a requirement for running Kafka, it can be a convenient option for deploying and managing Kafka clusters locally. It allows you to package Kafka and its dependencies into containers, which can be easily deployed and scaled across different environments.

What are the benefits of Kafka Docker?

Terraform 

Terraform is an open-source infrastructure as code (IaC) tool that lets you declaratively define and provision infrastructure resources. With it, you can version control your infrastructure configurations, apply consistent changes, and manage dependencies between resources.

Read More: Making Terraform Smarter

Ansible 

If you are going down the VM (virtual Machine) route, Ansible simplifies how you manage and configure your Kafka infrastructure by helping you define system states and execute tasks across a wide range of systems, including servers, network devices, and cloud environments.

Basically, Ansible uses a human-readable language called YAML to describe automation playbooks. These playbooks collect organised instructions that define your ideal configuration and the actions you want executed on target systems. For example, a playbook might include tasks like installing a package, modifying a given configuration, or launching a service.

Tips and Tricks

Work with a Kafka specialist 

Find Kafka experts and be clear about your requirements!

First, before you start, it’s important to be clear about what you need. What are your goals for using Kafka? What kind of data do you need to process? What are your performance requirements?

Kafka specialists will also need access to your data and systems in order to do their job. Provide access to data and systems, and make sure to set them up with the necessary access permissions.

While you’re doing so, be prepared to troubleshoot. Kafka is a distributed system, and you’ll have times when things go wrong! Don’t let it faze you.

Enable Transport Layer Security (TLS) or Secure Sockets Layer (SSL) encryption

Secure the network communication between clients, brokers, and other components. This prevents eavesdropping and protects data in transit.

Configure Kafka’s built-in Access Control Lists (ACLs)

Restrict access to topics, consumer groups, and administrative operations. This also defines granular permissions based on user roles and makes sure only authorised users and applications can interact with the Kafka cluster.

Implement authentication mechanisms such as SASL (Simple Authentication and Security Layer) or Kerberos 

Verify the identities of clients connecting to Kafka. This ensures that only authenticated and authorised users can access the cluster. This also allows you to implement a quota system if you want to throttle noisy users.

Configure firewalls and network segmentation 

Restrict access to Kafka from external networks. Allow only necessary ports and IP ranges to communicate with the Kafka cluster. If Kafka needs to be accessed over the internet or across different networks, consider using a virtual private network (VPN) or secure gateway for encrypted and authenticated access.

Monitor key metrics, logs, and audit trails 

Detect suspicious activity and potential security breaches.

Perform regular security audits of your Kafka cluster 

Identify and address any vulnerabilities or misconfigurations.

    • Properly determine the size and capacity of your Kafka cluster based on your expected workload.

    • Choose an efficient data serialisation format, such as Avro, JSON, or ProtoBuf, to transfer data efficiently between producers and consumers.

    • Distribute data evenly across partitions while considering the access patterns of producers and consumers to create a balanced load and effectively retrieve data.

    • Monitor key metrics such as CPU and memory utilisation, network throughput, disk I/O, and broker lag.

    • Implement replication and backup mechanisms to safeguard against data loss in the event of failures.

    • Perform load testing to simulate real-world scenarios and assess the performance and scalability of your Kafka cluster.

    • Maintain up-to-date documentation to troubleshoot issues, maintain the system, and share knowledge with your fellow developers.

    • Consult official Kafka documentation and stay on top of Kafka releases.

Partner with Experts

What is Kafka consulting?

Kafka consultants usually provide guidance on how to design Kafka-based architectures, select appropriate configurations, and optimise Kafka clusters for performance and scalability.

What’s more, they collaborate with your organisation to understand its specific use cases, requirements, and challenges.

What types of Kafka support can I get? 

Technical support

Kafka experts offer access to a support team that can address technical questions, provide guidance on Kafka best practices, and help troubleshoot issues related to Kafka deployment, configuration, performance, and integration. They’ll help you through channels like email, phone, video consults, or online ticketing systems.

Performance tuning

Experts provide analysis and optimisation recommendations to help you fine-tune your Kafka clusters and achieve optimal throughput, latency, and scalability. For example, they might review your configuration settings, partitioning strategy, replication factors, and other parameters to optimise your team’s Kafka performance.

Upgrades and migration

You can also get help with planning and executing Kafka upgrades to newer versions or migrating from older versions to the latest stable releases. Here, you’ll get guidance on compatibility, feature enhancements, and potential impact on existing deployments.

Training and education

Finally, good Confluent consultants will help your team upskill and learn more about Kafka and event-driven technology so that you can maintain, improve, and troubleshoot your systems not only during deployment, but over time.

Read the official documentation

Start by referring to the documentation provided by Apache Kafka. These docs cover everything from Kafka concepts to architecture, configuration, deployment, and APIs. They’re a good comprehensive guide for beginners and advanced developers alike.

Take online tutorials and courses

A variety of online tutorials and courses can help you understand Kafka in a structured and practical format. Platforms like UdemyCoursera, and LinkedIn Learning offer courses specifically focused on Kafka, with levels ranging from introductory to advanced.

Attend industry conferences and webinars 

Keep an eye out for Kafka-related conferences, meetups, and webinars. These events often feature talks by Kafka experts, use-case presentations, and discussions on emerging trends and best practices. We loved Kafka Summit London this year, and regularly host Kafka Meetups!

Read More: Kafka Summit London [ OSO’s 7-Step Prep Guide ] 

Check out community resources

Engage with the Kafka community by participating in forums, mailing lists, and discussion groups. The Apache Kafka website, for instance, hosts community resources where you can seek guidance, ask questions, and learn from others’ experiences.

Get Confluent certified 

Confluent certification programs validate your knowledge and expertise in Kafka and event-driven architecture. For the most part, you’ll see two main options for certification: Confluent Certified Developer for Apache Kafka (CCDAK) and Confluent Certified Administrator for Apache Kafka (CCAK).

Wrapping up

After reading our guide to Kafka 101, you now have a comprehensive overview of Kafka’s core concepts, architecture, deployment options, and best practices. Following the steps outlined above, you can start to confidently set up, configure, and use Kafka to meet your organisation’s needs for streaming data.

Whether you’re an executive, developer, or data engineer, embracing Kafka opens up a world of possibilities for building scalable and resilient data-driven applications. Dive into the Kafka ecosystem, experiment with various use cases, and leverage the rich array of tools and integrations available to unlock the full potential of Kafka in your projects.

Download the Ultimate Guide to Kafka Technology 

Download the Kafka-as-a-service Whitepaper

Read more:

How to take your Kafka projects to the next level with a Confluent preferred partner

Event driven Architecture: A Simple Guide

Watch Our Kafka Summit Talk: Offering Kafka as a Service in Your Organisation

Successfully Reduce AWS Costs: 4 Powerful Ways

Protecting Kafka Cluster

Apache Kafka Common Mistakes

Kafka Cruise Control 101

Kafka performance best practices for monitoring and alerting

Real-time Push APIs Using Kafka 

The new consumer rebalance protocol KIP-848

Get started with emerging technologies today

Have a conversation with one of our experts to discover how we can work with you to adopt emerging technologies to keep your business growing.

Book a call