With all the generative AI hype around multi-model LLMs, everyone is forgetting the power of a small, highly customised model – one which is trained on your data. Leveraging Apache Kafka and machine learning (ML), we can combine these to create a powerful streaming analytics solution. ML involves two main steps: model building and model deployment. Model building includes data ingestion, pre-processing, and training, while model deployment involves real-time scoring and predictions. Kafka, with its data integration and processing capabilities, is an ideal fit for both stages of the ML pipeline.
How to build streaming architecture which uses ML
Ingest data from car sensors into Kafka using technologies like MQTT.
Pre-process the data using Kafka native technologies like Kafka Streams or Apache Flink.
Train the model using machine learning frameworks like TensorFlow, leveraging the extreme scale of the cloud and storing the pre-processed data in a data store like Google Cloud Storage.
Deploy the trained model using technologies like Apache Flink or Kafka Streams, embedding the model directly into the streaming application.
Access historical data for reprocessing and analytics purposes, using tiered storage solutions like Confluent Storage for Kafka.
Continuously process data in real-time and access old events to gain valuable insights and improve analytics capabilities.
Use Kafka’s scalability and reliability to process millions of messages per second at scale.
Build additional consumer applications, handle errors, and ensure compliance and regulatory requirements are met.
Separate model deployment from model training, embedding the model directly into the streaming application for real-time, scalable, and reliable deployment.
Data Ingestion and Pre-processing
Before training a model, it is essential to ingest and pre-process the data. Kafka can be used to ingest data from various sources, such as IoT devices, using something like MQTT Kafka Connector. This cluster can be located at the edge, in a data centre, or in the cloud. Also, Kafka can integrate with other Kafka clusters, allowing for data transfer between production and analytics clusters.
Once the data is ingested into the analytics kafka cluster, pre-processing is performed using Kafka native technologies like Kafka Streams or Apache Flink. These technologies enable filtering, transformations, and anonymization of the data. For more complex pre-processing tasks, a Python framework can be used in conjunction with Kafka native technologies. This allows for scalable and reliable pre-processing of data in real-time.
Model Training
After preprocessing the data, it is ready for model training. While Kafka is not directly involved in model training, it plays a crucial role in storing and accessing the data required for training. The pre-processed data can be ingested into a data store like Google Cloud Storage. From there, machine learning frameworks like TensorFlow can be used to train the model. In the example given, an autoencoder model is trained to detect anomalies in the vibration spikes of motor engines in cars.
There are different approaches to model training. One approach is to ingest the data into a data lake, such as Hadoop or S3, and use ML services or Spark ML to train the model. Another approach is to perform streaming ingestion and model training, where the machine learning framework directly consumes data from Kafka. This simplifies the architecture by eliminating the need for an additional data lake. TensorFlow IO can be used to consume data from Kafka feeds, enabling real-time processing of millions of messages at scale.
Model Deployment
Once the model is trained, the next step is model deployment. Model deployment is separate from model training and can be done in various ways. One option is to use an external model server like TensorFlow Serving, which involves making RPC calls between the streaming Kafka application and the model server. While this approach works for some use cases, it may not be ideal for high-scale, connected infrastructures.
Example IoT anomaly detection use case
To illustrate the power of combining Kafka and machine learning, let’s consider an example of anomaly detection in a connected car infrastructure. In this scenario, car sensors send data to Kafka using technologies like MQTT. The data is then processed using Apache Flink to train an autoencoder model in TensorFlow. The trained model is deployed using Flink, where it is applied to sensor values to detect anomalies in vibration spikes. This simple query, powered by Kafka under the hood, can process millions of messages per second at scale and reliably detect anomalies in real-time.We have lots of these examples on our GitHub, please reach out to us if you have any other use-cases in mind.
This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.
Strictly Necessary Cookies
Strictly Necessary Cookie should be enabled at all times so that we can save your preferences for cookie settings.
If you disable this cookie, we will not be able to save your preferences. This means that every time you visit this website you will need to enable or disable cookies again.
3rd Party Cookies
This website uses Google Analytics to collect anonymous information such as the number of visitors to the site, and the most popular pages.
Keeping this cookie enabled helps us to improve our website.
Please enable Strictly Necessary Cookies first so that we can save your preferences!