Our solution supports you in the journey of migrating a batch to real time streaming data pipeline using Apache Kafka. We will explore the challenges you could face migrating batch to real time, the tools and technologies used, and the lessons learned along the way.
Batch to Real Time: The road to real-time data streaming
We now live in a world driven by real-time decision making. Migrating your batch data process is now necessary to meet the growing demands of businesses and to provide timely and accurate data for these decision makers.
Building real-time momentum
The first step in the migration process (batch to real time) is to select a candidate batch process which is suitable to be converted to real-time. I know what you are thinking, you have 100’s of batch processes in your organisation, we have a framework to help you score and select which batch processes to select first. Next is to build a proof of concept to demonstrate the value of the real-time streaming data pipeline. This involved implementing a small part of the algorithm in Kafka Streams and showing its success.
Batch to Real Time: Migrating Data
The next step is migrating the data from the legacy databases to the Kafka pipeline. This involves using Kafka Connect, a tool that allows for easy integration of external systems with Kafka. There are over 100 production ready Kafka connectors, figuring out the right connectors to use and extracting the necessary data from the hidden legacy databases is the biggest challenge.
One of the most used connectors is the JDBC source connector, which allows you to pull data from many in-house self hosted databases which supports the Java interface. This lends itself well to traditional change data capture (CDC) techniques. You also have the ability to set up a complex query within the JDBC source connector to massage the data and extract what you need.
Batch to Real Time: Validating the Data
Once the pipeline is set up and the data is flowing, the next step is to ensure the accuracy and correctness of the data. To validate the data, you can pull in the legacy data into Kafka using the initial connector. Using Apache Flink or ksqlDB to join the legacy data with the data produced by the Kafka Streams pipeline so you can conduct validation checks to compare the two streams.
This validation process can be time-consuming but essential. It allows you to identify any discrepancies or issues and make necessary adjustments to the algorithm. The results of the validation can be used to prove to stakeholders that the pipeline was working as intended and to gain confidence in the accuracy of the data.
Throughout the migration process, we at OSO have learned several valuable lessons. Here are some key takeaways:
- Thinking in a streaming mindset: To be successful in building a streaming data pipeline, it is important to think in a certain way and break down the algorithm into processing steps. Whether using Kafka Streams, Apache Flink or ksqlDB, it is crucial to consider the most efficient way to process the data at each step. – In Kafka Streams, breaking the algorithm into separate processing stages was effective.
- In ksqlDB, implementing multiple queries helped optimise the processing.
- Taking a step back and considering the most efficient way to break up the algorithm is essential.
- The value of Kafka Connect: Kafka Connect proved to be invaluable in integrating with the legacy architecture. It allowed for easy integration of external systems with Kafka and saved time and effort compared to building custom connectors.
- The complexity of writing custom connectors: While it may seem simple to read data from a database and put it into a Kafka topic, writing custom connectors can be challenging. There are many configuration details and corner cases to consider, making Kafka Connect a more reliable and efficient option.
- Architectural constraints: It is important to consider architectural constraints outside of the pipeline itself. While streaming data pipelines can be powerful and efficient, there may be constraints in the broader business architecture that make it impractical or not worth the effort. It is crucial to evaluate the context and determine if a streaming pipeline is the right solution.
Conclusion
Migrating from a batch processing pipeline to a real-time streaming data pipeline (batch to real time) using Apache Kafka is a complex but rewarding journey. By building a proof of concept, leveraging Kafka Connect, and validating the data, you can successfully migrate your pipelines and vastly improve the accuracy and efficiency of your data.
Fore more content:
How to take your Kafka projects to the next level with a Confluent preferred partner
Event driven Architecture: A Simple Guide
Watch Our Kafka Summit Talk: Offering Kafka as a Service in Your Organisation
Successfully Reduce AWS Costs: 4 Powerful Ways
Protecting Kafka Cluster
Apache Kafka Common Mistakes
Kafka Cruise Control 101
Kafka performance best practices for monitoring and alerting
How to build a custom Kafka Streams Statestores
How to avoid configuration drift across multiple Kafka environments using GitOps