blog by OSO

5 Powerful Steps to Real-Time Data Magic with DBT Transformation

Sion Smith 8 August 2023
DBT for real time data

How to use DBT for real time data transformation and analytics? DBT focuses on the T in ETL (Extract, Transform, Load) process; it doesn’t extract or load data, but it’s extremely good at transforming data that’s already loaded into your warehouse. One common problem with DBT for real time data transformation is duplication. We will discuss the importance of avoiding producing duplicate messages and the benefits of subscribing to multiple topics. Additionally, we will define the terms “Upcasters” and “Downcasters” and their role in schema translation.

DBT for real time: DBT focuses on the T in ETL Avoiding double posting and subscribing to multiple topics

When it comes to DBT for real time data transformation and analytics, it is crucial to avoid producing duplicate events. This means that as soon as you stop producing data in one topic, you should start consuming from another topic. By doing so, you ensure that new data goes to the new topic while your consumers continue to consume from both topics. Once the producer switches over, you can instruct your consumers to start consuming only from the new topic. This approach allows for a seamless transition without any data loss or interruption in processing.

In addition to avoiding double posting, subscribing to multiple topics can provide several benefits. It allows you to:

  • Distribute the load: By distributing the load across multiple topics, you can handle a higher volume of data and ensure better performance.
  • Enable parallel processing: Subscribing to multiple topics enables parallel processing, which can significantly improve the speed and efficiency of data transformation and analytics.
  • Implement data segregation: Subscribing to different topics based on specific criteria or data attributes allows you to segregate and process data separately, providing more flexibility and control over your data pipelines.

DBT for real time data: Schema translation with Upcasters and Downcasters

Schema translation is the process of converting data from one schema version to another. In some cases, the schemas may not be directly compatible due to differences in rules and compatibility requirements. This is where Upcasters and downcasters come into play.

Upcasters and downcasters allow for schema translation between different versions that may not have strict forward and backward compatibility rules. Upcasting involves converting data from an older schema version to a newer one, while downcasting involves converting data from a newer schema version to an older one.

However, schema translation using Upcasters and Downcasters requires some human knowledge and logic. It is not a recommended approach for every scenario, as it can be more complex and time-consuming to maintain custom casting rules for each version of the schema. A good example of this can be found here

Maintaining schema definitions with schema registry

Schema registry is a valuable tool for DBT for real time data managing and maintaining schema definitions. However, it is not always recommended to store the master copy of schemas in the schema registry itself. Instead, it is advised to keep the source of truth and golden record of schema definitions in a separate repository.

By maintaining schema definitions outside of the schema registry, you can have more control over versioning, compatibility checks, and data ops processes. You can use a version control system like Git to store your schemas and define build rules to ensure compatibility between different versions. Schema registry can still play a role in this process by validating compatibility during the merge process and ensuring that the main branch remains buildable and can package the new version.

Additionally, storing schemas in a separate repository allows for easier integration with other data discoverability tools. Data discoverability is a crucial aspect of data management, and having a centralized system for storing and managing schemas can make it easier to explore and discover data within your organization.

Generating language-specific code and improving discoverability

Storing schemas in a separate repository also allows developers to easily generate language-specific code for their data needs. By pulling down the repository and utilizing tools like the Avro CLI or the Proto CLI, developers can render objects in their preferred language and make them more accessible and usable.

Furthermore, a centralized system for storing schemas can improve compatibility with other discoverability tools. Data discoverability is a major challenge in the data management space, and having a centralized repository of schemas can make it easier to integrate with third-party data discovery tools and enable better data sharing and collaboration within your organization.

Fore more content:

How to take your Kafka projects to the next level with a Confluent preferred partner

Event driven Architecture: A Simple Guide

Watch Our Kafka Summit Talk: Offering Kafka as a Service in Your Organisation

Successfully Reduce AWS Costs: 4 Powerful Ways

Protecting Kafka Cluster

Apache Kafka Common Mistakes

Kafka Cruise Control 101

Kafka performance best practices for monitoring and alerting

How to build a custom Kafka Streams Statestores

How to avoid configuration drift across multiple Kafka environments using GitOps

Get started with OSO professional services for Apache Kafka

Have a conversation with a Kafka expert to discover how we help your adopt of Apache Kafka in your business.

Contact Us