#2AF4A3

Kafka and OSO

What is Apache Kafka?

The streaming platform that enables companies of all sizes to build scalable, fault-tolerant, and real-time data pipelines.

Contact Us
An Introduction to Kafka

What is Kafka and how does Kafka work?

Apache Kafka is a distributed streaming platform that is used to build real-time data pipelines and streaming applications. It was originally developed by engineers at LinkedIn and was later open-sourced to the Apache Software Foundation.

 

Kafka is designed to handle high-volume, high-velocity, and high variety of data formats (including binary) and enables users to store, process, and publish data in real time.

 

At its core, Kafka is a messaging system that uses a publish-subscribe model to allow producers to send messages to a set of brokers. Messages are written to these Kafka brokers, which are distributed across a cluster of servers, and can be read by subscribers in real-time (consumers). Kafka is highly scalable, fault-tolerant, and offers many built-in features for managing multiple versions of a data source and processing it efficiently.

kafka logo
Kafka this, Kafka that

Why are there so many flavours of Kafka?

The core of Kafka provides the fundamental features such as brokers, topics, producers, and consumers. This comes from the Apache Kafka project, maintained by the Apache Software Foundation. This is typically referred to as Open Source Kafka or Apache Kafka.

 

However, there are different vendors that package Kafka with additional features, tools, and support options. Some popular Apache Kafka distributions include Confluent Kafka, Confluent Cloud, Cloudera, IBM Cloud Pak for Integration, and Amazon MSK. We’ll explore these in more detail later.

Open source

What is the Apache Software Foundation?

The Apache Software Foundation (ASF) is a non-profit organisation that provides support for a range of open source software projects, including Apache Kafka. ASF also provides various resources, infrastructure, and governance for more than 350 Apache projects. As a result, Apache Kafka benefits from the foundation’s extensive community support and open-source development model, which has helped it become a widely used and popular distributed streaming platform.

Sign up to the Kafka Report

Keeping you up-to-date with the latest news and events from the world of Kafka and event streaming

We will only use your details for sending your our newsletter and occasional OSO marketing emails. You can unsubscribe at any time.

Kafka Architecture

The building blocks of Kafka Architecture

Kafka’s architecture is based on a distributed model that enables it to handle large amounts of data across multiple machines with no single point of failure. The key components of Kafka architecture include Kafka brokers, producers, consumers, and topics.

 

Kafka brokers are the servers that communicate with each other to manage the distribution of messages across the Kafka cluster. Producers connect to the brokers and write messages to the cluster, which are stored in partitions. 

 

Consumers read messages from the partitions stored on the brokers, and a consumer group can read messages from multiple partitions. One of the key features of Kafka’s architecture is that it supports horizontal scaling by allowing new brokers to be added to the cluster as the data volume grows without needing to notify either producers or consumers.

 

Additionally, Kafka’s architecture has built-in fault tolerance, which ensures that if a broker goes down, data can still be read and written from other brokers. With its highly scalable and fault-tolerant architecture, Kafka has become a go-to technology for building real-time data pipelines and applications that require handling large volumes of data distributed across multiple servers with zero downtime.

Learn more about Kafka

Take a look at some of our blog posts that explain some of the main concepts of Kafka and Event Streaming

Kafka Use Cases

Examples of how companies use Kafka to solve real-world problems

Kafka’s versatility in event streaming can help companies solve real-world problems by building scalable and fault-tolerant data pipelines for various applications, such as real-time analytics, IoT sensors, and machine learning. Kafka can be used to process high-volume data streams such as website clickstreams, user logs, financial transactions, just to name a few.

  1. Retail Industry

Kafka can be used to improve real-time inventory tracking and supply chain management, which helps retailers ensure that they always have the right products in stock for their customers. By processing real-time sales data and connecting it with supply chain systems, Kafka can notify retailers on which products are selling well and need to be restocked.

  1. Healthcare Industry

Kafka can be used to improve patient care by collecting and processing real-time health data streams from different sources. This data can then be analysed to provide doctors with insights into a patient’s health status and help them make better-informed decisions. Here is a blog by Kai Waehner with some real-world Healthcare Kafka examples

  1. Finance Industry

Kafka can be used to detect fraudulent transactions in real-time by processing large volumes of financial data from multiple sources simultaneously. This helps banks and financial institutions to quickly identify suspicious behaviour and mitigate losses.

  1. Transportation Industry

Kafka can be used to improve route optimisation and logistics by processing real-time data from GPS sensors installed in vehicles. This helps transportation companies to optimise their routes and schedules, reducing fuel consumption and costs while improving customer satisfaction.

  1. Gaming Industry

Kafka can be used to build real-time multiplayer gaming applications that require low latency communication between players. By processing real-time game events, Kafka ensures that all players are in sync in real-time, providing a smooth and uninterrupted gaming experience.

 

In all these scenarios, Kafka is used to handle large volumes of different types of data, process real-time data streams, provide insights, and improve decision-making, making it an essential tool for businesses across different industries that wish to improve their real-time data processing capabilities.

 

At OSO, we have helped dozens of companies to implement Kafka and provide a customised solution for their specific use cases, such as transforming the way data is consumed, simplifying data processing, and finally, improving the overall data governance.

 

Check out some examples of OSO customer stories to learn more about how companies are using Kafka to solve their data pipeline challenges.

Setup Kafka

Getting Started with Apache Kafka: a beginner’s guide to installing and working with Kafka

    1. Download and Install Kafka:
      Visit the Apache Kafka website and download the appropriate Kafka distribution package for your operating system. Once downloaded, extract the contents of the package and set up the configuration files.

 

    1. Start a Kafka Cluster:
      To start Kafka, you need to start the ZooKeeper service and Kafka broker service. These services can be started using Kafka’s command-line interfaces, namely bin/zookeeper-server-start.sh and bin/kafka-server-start.sh.

 

    1. Create a Topic:
      To work with Kafka, you need to create a topic, which is essentially a stream of events. Use the command-line tool bin/kafka-topics.sh to create a topic. Specify the topic name, the number of partitions, and the replication factor.

 

    1. Produce Messages:
      Use the Kafka producer API to start producing messages. Using any Kafka support language, such as Java or Python programming language, create a producer client that will send messages to the Kafka topic.

 

    1. Consume Messages:
      Create a Kafka consumer API in your preferred language that subscribes to the same Kafka topic and starts reading the messages that the Kafka producer sends.

 

    1. Monitor and Manage Kafka:
      Kafka provides a few command-line tools and web UIs to help you monitor and manage Kafka clusters.

 

  1. Sign up to the Kafka Report
    Why not sign up to the Kafka Report, a Newsletter from OSO keeping you up to date with all things Kafka.

 

Useful website for references and tutorials:

Options

Which type of Kafka is Best for you?

There are many different distributions of Kafka available, and they are typically categorised based on their deployment options, features, and support models. Here are the 5 we get asked about most often:

Apache Kafka

As we’ve already discussed, Apache Kafka is the core open source version of Kafka. Because it is open source, it offers a high degree of customisation and flexibility. Its key disadvantage is that it requires a significant amount of technical expertise to install and maintain.

 

Confluent Platform

Confluent Platform is a commercial distribution of Kafka that offers enterprise-grade features and support for Kafka clusters. It includes some additional features beyond Apache Kafka, such as Confluent Schema Registry, a UI for managing Kafka topics and consumer groups, and more. The security module and audit trail are very popular features in financial organisations. However, it requires a licence and may not be suitable for organisations with limited budgets.

 

Confluent Cloud

Confluent Cloud is a fully managed cloud-based service that offers many of the features of Confluent Platform, with the added benefits of automated deployment and scaling. It is suitable for organisations of all sizes, and for basic set ups – only requires an intermediate level of technical expertise to set up or manage. May not be suitable for enterprises needing to maintain sovereignty of their data.

 

Amazon Managed Streaming for Apache Kafka (MSK)

Amazon MSK is a fully managed Kafka service designed for use with Amazon Web Services (AWS). It eliminates the need to manually set up and manage Kafka clusters, which can save time and resources. However, it is a cloud-based service, which may not be suitable for organisations with specific security or compliance requirements.

 

Aiven for Apache Kafka

Aiven for Apache Kafka is a fully managed cloud-hosted version of Apache Kafka that simplifies deployment and management of Kafka clusters. It offers automatic scaling, backups, and a user-friendly interface for managing topics and consumers. Aiven also provides 24/7 customer support and uses industry-standard encryption for security.

 

In general, the choice between these Kafka distributions will depend on the specific needs and preferences of your organisation. In our experience, we’ve observed what organisations gravitate towards full managed solutions as the complexity and business criticality of their real-time data streaming platform increases.

Grow

Scaling with Kafka: how Kafka enables elastic scaling for handling large volumes of data

Scaling with Kafka is an essential aspect of the core functionality of the technology. As the volume of data grows, the system needs to accommodate the increased traffic by scaling, which means adding more computing and storage resources. Apache Kafka allows elastic scaling of your data streams, by horizontally adding or removing nodes, partitions, brokers or replicating your data. It provides a distributed infrastructure, reliable data streaming and fault tolerance, with its load-balancing and re-balancing capabilities. 

 

However, scaling with Kafka could pose some hurdles or pitfalls, such as unbalanced loads and network congestion, especially if you try to scale too rapidly, without a proper plan or monitoring. Adding a broker is not as easy as just scaling up a pod in Kubernetes. Using a managed service will reduce many of the associated headaches, but for many companies, self-hosting is the only option available due to internal policies.

 

Additionally, monitoring metrics and vital moments of your cluster, using Kubernetes operator or monitoring tools, like Prometheus and Grafana, can reveal problems in time and provide insights on how to correct them.

Problems scaling Kafka?

Our Kafka experts are here to help

We offer support and consultancy services on Apache Kafka. Our team can help you avoid common pitfalls and provide solutions and recommendations to improve your Kafka cluster’s performance, optimise configurations, monitor the health of your cluster, and ensure that your data streams are reliable and fault tolerant.

Find out about our Kafka services
Integrate with Kafka

Kafka Ecosystem: an overview of tools and technologies that integrate with Kafka

The Kafka ecosystem comprises of various tools and technologies that integrate with Kafka to extend its functionality and provide solutions catering to specific use cases. 

Kafka Streams

Kafka Streams is a powerful library for building real-time event streaming applications in Apache Kafka. It simplifies the development of stateful stream processing applications, by providing a simple and easy-to-use foundation set of APIs that allows developers to write streaming applications using the same constructs as batch processing.

It allows developers to perform a wide variety of stream processing operations, from filtering and transformation to complex aggregations and joins, in real-time. It is tightly integrated with Kafka’s partitioning and replication features, which allows for scalable and fault-tolerant stream processing.

Kafka Streams also supports windowing operations, which enables developers to perform calculations over time-based (e.g., sliding, tumbling) or session-based (i.e., unlimited and bounded) windows. Additionally, Kafka Streams provides a range of built-in serializers and deserializers to cover the most common data formats, and it also allows for custom serialisation.

Kafka Connect

Kafka Connect is a scalable and fault-tolerant tool for streaming data between external systems and Apache Kafka. It simplifies the process of integrating Kafka with other data sources, such as databases, filesystems, and message queues, by providing a simple configuration-based framework for creating and managing connectors.

Kafka Connect is built on a distributed, fault-tolerant architecture, which allows it to scale horizontally and handle high-throughput data ingestion independently of Kafka brokers. It also features a REST API for configuring and monitoring connectors, and it provides a range of pre-built connectors for popular data sources, such as JDBC, Elasticsearch, and HDFS.

Kafka REST Proxy

Kafka REST Proxy is a RESTful interface that allows external systems to interact with Apache Kafka using an HTTP/HTTPS protocol. It sits between client applications and Kafka clusters and translates HTTP requests into Kafka-native protocols such as Producer and Consumer.

Kafka REST Proxy simplifies the process of integrating Kafka with web-based or cloud-native applications by providing a simple, standardised API for publishing or consuming Kafka messages. It supports a range of message formats, including Avro, JSON, and binary data, and provides built-in support for schema registry.

By using Kafka REST Proxy, client applications can easily interact with Kafka from any programming languages or platforms that support HTTP. It also provides an additional layer of security by enforcing access control policies and enabling secure communication over HTTPS. Developers can also gain expose to Kafka in a more familiar format via RESTful interfaces.

Schema Registry

Schema Registry is a tool for managing schemas in Apache Kafka that provides a centralised store for schema files. It enables producers and consumers to agree on a standardised format for the data being sent or received from the Kafka cluster. This helps to ensure interoperability between different applications and enables developers to evolve data schemas over time without breaking existing data pipelines.

Schema Registry supports multiple schema formats such as JSON, Avro or Protobuf, and stores schema information in a schema registry topic on the Kafka broker. When publishing or consuming messages, the schema information is checked and validated against the schema registry topic. If the schema is not compatible, the producer or consumer can be notified with an error message.

ksqlDB

ksqlDB is a powerful open-source database built on top of Apache Kafka that provides a SQL-like interface for real-time stream processing. It simplifies the process of building real-time, event-driven applications by enabling developers to write queries and transformations using a familiar SQL syntax.

ksqlDB allows developers to create tables, define schemas, and perform complex aggregations and joins on real-time streams of data. It is designed to be scalable and fault-tolerant, and it uses Kafka’s distributed architecture to provide high-throughput data processing. It also integrates seamlessly with Kafka ecosystem components such as Kafka Connect or Kafka Streams.

However, be warned that there are some limitations and nuances to be aware of when compared to SQL.

Conduktor

An end-to-end suite of solutions for Kafka developers which includes Kafka management, testing, monitoring, data quality, and data governance. It allows developers to interact with the entire Kafka ecosystem such as Brokers, Topics, Consumers, Producers, Kafka Connect, and Confluent Schema Registry.

SpecMesh

An exciting new open source project that aims to leverage well established Kafka design principles to help the Data Mesh community.

Kafka Tips

Best Practices for Kafka: Tips and tricks for using Kafka effectively and efficiently

Design for scalability

Kafka is designed to be horizontally scalable, which means it’s important to design your Kafka-based solution to allow for expansion. This includes choosing the right hardware, partitioning your data effectively, and monitoring your system to identify bottlenecks and add capacity as necessary.

Use the right data serialisation format

To ensure that your messages can be effectively processed and understood by all systems in your Kafka ecosystem, it’s important to use the right data serialisation format. Avro and JSON are popular serialisation formats, and choosing one that is compatible with your entire stack will make it easier to process messages with any tool.

Monitor and optimise your Kafka cluster

Kafka works best when it is optimised for your specific use case. This means monitoring your Kafka cluster to identify performance bottlenecks, tuning your Kafka settings, and optimising your Kafka-based applications to make the most efficient use of resources. By paying close attention to performance and fine-tuning your system regularly, you can ensure that you are getting the most out of your Kafka-based solution.

Is Kafka the best option?

Alternatives to Apache Kafka

While Kafka is a popular and robust distributed streaming platform, there are alternative solutions available for streaming data pipelines and messaging.

Talk to us about Kafka
FAQ

Frequently Asked Questions about Kafka: answers to common questions about Kafka

What are the disadvantages of Kafka?

One of the significant disadvantages or challenges of using Kafka can be its complexity. Kafka, being a distributed system with many moving parts, requires a high level of technical expertise, and it can be challenging to set up and maintain it. Building event driven applications often also requires a different set of technical solutions. Also, configuring Kafka optimally and monitoring its health adds extra complexity to Kafka operations. Another concern could be the steep learning curve associated with its programming APIs, particularly for new users who may take time to familiarise themselves with Kafka.

Is Kafka difficult to work with?

Kafka can seem difficult because it is a complex and robust distributed system designed to handle large volumes of data in near-real-time. It has many moving parts, including brokers, producers, consumers, topics, and partitions, and requires a deep understanding of its architecture to implement and manage effectively.

Additionally, using Kafka requires a high level of technical expertise, making it challenging for new users to get started. Kafka has several programming interfaces and APIs that require an understanding of distributed systems programming paradigms such as stateful stream processing and message queuing.

Kafka also has its unique jargon and concepts, such as topics and partitions, which can seem foreign to those new to the platform. This can make generating Kafka-specific code from Non-Kafka Developers frustrating, and some end-users may require additional training to become proficient with Kafka development and maintenance.

However, despite these challenges, once properly set up and configured, Kafka offers significant benefits and value for streaming data processing, data integration and analysis. Building your architecture in a decoupled nature allows for rapid development in isolation of others. With tailored training and valid expertise, such as offered by platform providers, some of the barriers to getting started with Kafka may be reduced, providing users with the ability to reap the benefits of a reliable and scalable data streaming platform.

Is Kafka a Database or a messaging system?

Kafka is not a database in the traditional sense. It is instead a distributed streaming platform designed to handle large volumes of data in real-time. Although Kafka is similar to a traditional message broker, it can also function as a robust distributed messaging system. Instead of storing data, Kafka acts as a message queue that can reliably transport data streams between applications and services.

Kafka does not have some of the features typically found in databases, such as tables, indexes, or data querying capabilities. Instead, Kafka stores data in a log data structure, where data is appended to the end of the log as messages. Kafka topics can be partitioned to support parallel processing and provide horizontal scalability, which is generally not feasible with traditional databases.

In summary, Kafka can be used as part of an overall data processing architecture, and while it can store data, it is not a traditional database management system (DBMS). If a user requires a database, they can integrate data stored in Kafka with other DBMS systems such as Hadoop, Spark or NoSQL databases for further storage, processing or analysis.

What is MSK on AWS?

Amazon MSK stands for “Managed Streaming for Kafka,” and it is a fully managed service provided by Amazon Web Services (AWS) that allows users to use Apache Kafka as a streaming platform, without having to worry about the overhead of managing the underlying infrastructure. MSK provides an instant Kafka cluster that can handle large volumes of data and scale horizontally by adding or removing nodes as per the required throughput levels.

Amazon MSK provides a range of benefits, including:

  1. Fully managed service: with Amazon MSK, users don’t need to worry about setting up and maintaining Kafka infrastructure, which can significantly reduce operational overheads.

  2. Elastic scalability: Amazon MSK’s elasticity is flexible, meaning it can be scaled-up or scaled-down on-demand, based on users’ needs.

  3. Security: MSK is built with security in mind, with support for encryption in-transit and at rest, AWS Identity and Access Management (IAM), and Virtual Private Cloud (VPC) integration.

  4. High availability: Amazon MSK enables replication, automatically scaling up the cluster in response to resource utilisation, and fail-over for node and region-wide errors, ensuring that data streams are always available and recoverable.

  5. Integration with AWS services: Amazon MSK is tightly integrated with other AWS services, such as Amazon S3, Amazon EMR, Amazon Redshift, AWS Lambda, and other AWS services which are compatible with Kafka streaming delivery.

In summary, Amazon MSK is designed to provide customers the standard functionalities and cluster resilience of Kafka, but with reduced overhead and flexible scalability options, with close integration with AWS cloud services.

Accelerate your Kafka adoption with the right support

Get in touch with us today to discover how you can leverage your data to build a faster, more responsive business

CONTACT US