blog by OSO

Batch vs Real time stream processing: A beginners guide 

Sion Smith 8 August 2023
Batch vs Real time stream processing

What is the difference between batch vs real time stream processing? Both approaches have their own advantages and challenges, and understanding them can help content creators make informed decisions about which method to use for their data processing needs.

Batch vs Real time stream processing: The risks of selecting the wrong architecture 

When it comes to data processing, getting it wrong can have serious consequences. Data leaks can occur, causing problems downstream. Additionally, introducing additional state can complicate the entire process. For example, when multiple sources are involved, idempotency becomes a complex question. In a distributed system, the order of processing can vary, leading to potential issues. This is why transactions are crucial in ensuring that streaming events are not processed more than once.

Batch vs Real time stream processing: The importance of implementing idempotency

Implementing idempotency on top of every stream processing task can be extremely complex. It requires careful consideration and ongoing maintenance as software and pipelines evolve. Having built-in functionalities in a platform can greatly simplify the process and reduce the risk of errors. It’s important to remember that no program is perfect from the start, especially in the world of software development. Bugs are expected in the initial versions, and that’s how software is built. Having inbuilt functionalities in a platform can help address these issues and make the development process smoother.

Schema evolution in data processing

Another important topic to consider when dealing with data processing is schema evolution. Even if you use a schema-less database, there is always some schema attached to the data, whether it’s at the application layer or elsewhere. Adding new features or fields to the schema can be a challenging task, especially when backward incompatible changes need to be made.

Dealing with schema evolution can be messy, and it’s important to understand how it compares between batch processing and stream processing. Is it easier to handle one approach over the other?

Incompatible schema changes

When there is a completely incompatible schema change, both batch vs real time stream processing systems require a reset. In an incremental pipeline, where data is processed incrementally, a complete schema change requires starting from the beginning. The schema of the table in the destination needs to be updated, and the entire process needs to be recreated.

In the batch world, if the schema change is not done incrementally, a naive approach of creating a new table for each run can work. However, from an end-to-end perspective, there is still a problem. Consumers of the data will need to adjust their processes when the schema changes, and communication becomes difficult.

Handling compatible schema changes

There are also compatible schema changes, such as adding a new column or allowing null values in a column. These changes are generally easier to handle and can be done incrementally without causing major disruptions. However, it’s important to note that there is not always a clear answer to what is considered a breaking schema change. Determining whether a change requires a restart or can be handled automatically is difficult and often requires user input.

The Ideal Approach: Incremental Changes Until Incompatible Schema

The ideal approach for consumers of data is to handle schema changes incrementally until a fundamentally incompatible change occurs. This allows for a simpler conceptual model where users can query a table that is kept up to date incrementally. When the schema changes, users can assume that the schema might have changed and adjust their processes accordingly. Using GitOps for this can help bridge the gap between development and operations. A good example of this can be found here.

Batch vs Real time stream processing: The importance of user feedback

In the end, the user always has the final say in determining whether data is compatible and how to handle schema changes. If the data is compatible, the schema can be modified on a go-forward basis. If not, a restart and a new table may be necessary.

In conclusion, both batch vs real time stream processing have their own advantages and challenges. It’s important to understand the differences and make informed decisions based on the specific needs of your data processing tasks. Incremental changes are generally preferred until a fundamentally incompatible schema change occurs. However, determining compatibility can be difficult, and user involvement is crucial in making the right decisions.

Fore more content:

How to take your Kafka projects to the next level with a Confluent preferred partner

Event driven Architecture: A Simple Guide

Watch Our Kafka Summit Talk: Offering Kafka as a Service in Your Organisation

Successfully Reduce AWS Costs: 4 Powerful Ways

Protecting Kafka Cluster

Apache Kafka Common Mistakes

Kafka Cruise Control 101

Kafka performance best practices for monitoring and alerting

How to build a custom Kafka Streams Statestores

How to avoid configuration drift across multiple Kafka environments using GitOps

Get started with OSO professional services for Apache Kafka

Have a conversation with a Kafka expert to discover how we help your adopt of Apache Kafka in your business.

Contact Us