blog by OSO

Machine Learning with Apache Kafka

Jan 15 April 2024
Machine learning with Kafka

With all the generative AI hype around multi-model LLMs, everyone is forgetting the power of a small, highly customised model – one which is trained on your data. Leveraging Apache Kafka and machine learning (ML), we can combine these to create a powerful streaming analytics solution. ML involves two main steps: model building and model deployment. Model building includes data ingestion, pre-processing, and training, while model deployment involves real-time scoring and predictions. Kafka, with its data integration and processing capabilities, is an ideal fit for both stages of the ML pipeline.

How to build streaming architecture which uses ML

  1. Ingest data from car sensors into Kafka using technologies like MQTT.
  2. Pre-process the data using Kafka native technologies like Kafka Streams or Apache Flink.
  3. Train the model using machine learning frameworks like TensorFlow, leveraging the extreme scale of the cloud and storing the pre-processed data in a data store like Google Cloud Storage.
  4. Deploy the trained model using technologies like Apache Flink or Kafka Streams, embedding the model directly into the streaming application.
  5. Access historical data for reprocessing and analytics purposes, using tiered storage solutions like Confluent Storage for Kafka.
  6. Continuously process data in real-time and access old events to gain valuable insights and improve analytics capabilities.
  7. Use Kafka’s scalability and reliability to process millions of messages per second at scale.
  8. Build additional consumer applications, handle errors, and ensure compliance and regulatory requirements are met.
  9. Separate model deployment from model training, embedding the model directly into the streaming application for real-time, scalable, and reliable deployment.

Data Ingestion and Pre-processing

Before training a model, it is essential to ingest and pre-process the data. Kafka can be used to ingest data from various sources, such as IoT devices, using something like MQTT Kafka Connector. This cluster can be located at the edge, in a data centre, or in the cloud. Also, Kafka can integrate with other Kafka clusters, allowing for data transfer between production and analytics clusters.

Once the data is ingested into the analytics kafka cluster, pre-processing is performed using Kafka native technologies like Kafka Streams or Apache Flink. These technologies enable filtering, transformations, and anonymization of the data. For more complex pre-processing tasks, a Python framework can be used in conjunction with Kafka native technologies. This allows for scalable and reliable pre-processing of data in real-time.

Model Training

After preprocessing the data, it is ready for model training. While Kafka is not directly involved in model training, it plays a crucial role in storing and accessing the data required for training. The pre-processed data can be ingested into a data store like Google Cloud Storage. From there, machine learning frameworks like TensorFlow can be used to train the model. In the example given, an autoencoder model is trained to detect anomalies in the vibration spikes of motor engines in cars.

There are different approaches to model training. One approach is to ingest the data into a data lake, such as Hadoop or S3, and use ML services or Spark ML to train the model. Another approach is to perform streaming ingestion and model training, where the machine learning framework directly consumes data from Kafka. This simplifies the architecture by eliminating the need for an additional data lake. TensorFlow IO can be used to consume data from Kafka feeds, enabling real-time processing of millions of messages at scale.

Model Deployment

Once the model is trained, the next step is model deployment. Model deployment is separate from model training and can be done in various ways. One option is to use an external model server like TensorFlow Serving, which involves making RPC calls between the streaming Kafka application and the model server. While this approach works for some use cases, it may not be ideal for high-scale, connected infrastructures.

Example IoT anomaly detection use case

To illustrate the power of combining Kafka and machine learning, let’s consider an example of anomaly detection in a connected car infrastructure. In this scenario, car sensors send data to Kafka using technologies like MQTT. The data is then processed using Apache Flink to train an autoencoder model in TensorFlow. The trained model is deployed using Flink, where it is applied to sensor values to detect anomalies in vibration spikes. This simple query, powered by Kafka under the hood, can process millions of messages per second at scale and reliably detect anomalies in real-time.We have lots of these examples on our GitHub, please reach out to us if you have any other use-cases in mind.

Fore more content:

How to take your Kafka projects to the next level with a Confluent preferred partner

Event driven Architecture: A Simple Guide

Watch Our Kafka Summit Talk: Offering Kafka as a Service in Your Organisation

Successfully Reduce AWS Costs: 4 Powerful Ways

Protecting Kafka Cluster

Apache Kafka Common Mistakes

Kafka Cruise Control 101

Kafka performance best practices for monitoring and alerting

How to build a custom Kafka Streams Statestores

Get started with OSO professional services for Apache Kafka

Have a conversation with a Kafka expert to discover how we help your adopt of Apache Kafka in your business.

Contact Us