How to manage schemas in Apache Kafka

How to manage schemas in Apache Kafka? The concept of data contracts is relatively new, how can they be used for long-term schema management. How important is it to maintain a source of truth for schema definitions and the benefits of keeping schemas in a version-controlled repository. Additionally, it can be very challenging when it comes to schema evolution and the use of schema casters to handle compatibility between different schema versions. Let’s dive!

Manage schemas in Apache Kafka: The importance of data contracts

The way to manage schemas in Apache Kafka, it is crucial to have a reliable source of truth and a golden record of schema definitions. This ensures consistency and allows for easy maintenance and evolution of schemas over time. Storing your schemas in a centralised repository, separate from any specific service or tool, provides flexibility and control over your data contracts.

Manage Schemas: Storing schemas in a Git repository

One approach to manage schemas is to store them in a version-controlled repository, such as Git. This allows for easy tracking of changes, collaboration among team members, and the ability to roll back to previous versions if needed. By keeping your schemas in a repository, you can maintain a history of schema changes and easily browse through the evolution of your data contracts. You can also follow the standard GitOps approach to operations, an example of this can be found here.

Using Confluent Schema Registry for runtime compatibility

While storing schemas in a version-controlled repository is beneficial for compile-time management, it is also important to ensure runtime compatibility between different schema versions. This is where a schema registry comes into play. A schema registry is a tool that allows you to store and manage schema compatibility rules.

Schema Casters for compatibility

In some cases, you may need to cast a schema to a different version that is not directly compatible. This is where schema casters come in. Schema casters allow you to upcast or downcast a schema to a version that may not be compatible according to the regular rules of backward compatibility. However, this process requires some human logic and knowledge to ensure compatibility.

Challenges and drawbacks of using Schema Casters

While schema casters can be a clever solution to handle compatibility between different schema versions, they come with their own challenges and drawbacks. One major drawback is the complexity of maintaining data. You need to write code for every version of the data, which can be time-consuming and require more effort. Additionally, schema casters are not multi-language compatible, so if you need multi-language support, it will take even more time and effort.

Maintaining a source of truth for schema definitions

Based on our experience at OSO with various companies, it is recommended to maintain your source of truth and golden record of schema definitions separately from a service like a schema registry. This allows for custom exporting mechanisms and centralized maintenance of schemas. By using a version-controlled repository and implementing CI/CD and data ops / GitOps best practices, you can ensure that your schema definitions are easily accessible, compatible, and discoverable.

Future tools and improvements

While there are existing tools like schema registry and Maven plugins that help with schema management, there is still room for improvement in this space. Currently, most customers end up writing custom batch forms to handle schema compatibility checks and leverage different tooling to bring it all together. However, there is a need for a single, comprehensive tool that can handle all aspects of schema management, from interpretation to generation.

One potential tool that could be developed is a unified schema management (to manage schemas in Apache Kafka) tool that integrates with version control systems like Git and provides a seamless workflow for schema management. This tool could automate compatibility checks, generate language-specific code, and provide a web interface for browsing schema history.

In the context of data engineering and Apache Kafka, the ability to effectively manage schemas is of paramount importance. Organisations are recognising that a well-structured approach to manage schemas is essential for maintaining data integrity and facilitating seamless communication within their data ecosystems. One key strategy involves the utilisation of a dedicated schema registry, acting as a centralised repository to manage schemas across the organisation. This registry serves as the authoritative source of truth, ensuring consistency and compatibility across various applications and services. Complementing this, version-controlled repositories like Git provide a systematic way to manage schemas changes collaboratively, enabling teams to monitor, review, and refine schemas over time, ultimately contributing to a robust and adaptable data infrastructure.