blog by OSO

How to Back Up Your Kafka Schema Registry (And Why Your Current Approach Probably Won’t Restore)

Sion Smith 7 April 2026
backup confluent schema registry

Most Kafka teams have something they call a Schema Registry backup — a curl script tucked into a cron job, a console consumer dump, an export from a governance tool. Very few have ever tested whether those backups actually restore. And on the day they need them, a haunting pattern emerges in the forum threads: schemas restored, IDs reassigned, messages unreadable, and a final post that reads “Finally, we delete all unprocessed data and start over.”

That outcome is more common than the Kafka community likes to admit. The OSO engineers have come across multiple production incidents where teams with otherwise solid disaster recovery practices discovered, mid-incident, that their Schema Registry recovery story was incomplete. It’s the reason we built kafka-backup, an open-source point-in-time recovery tool for Apache Kafka, with Schema Registry backup and restore available as an enterprise feature. In one well-documented case on a public Kafka forum, an enterprise running on managed cloud infrastructure lost their Kafka nodes alongside Schema Registry. When everything came back up, the schema data was gone. Their stream processing applications were failing because they couldn’t resolve schema IDs. There was no supported way to recreate schemas with their original IDs. Their final option was to delete unprocessed data and start over.

The lesson is simple, if uncomfortable: backing up Schema Registry is not the same as backing up the schema definitions. A credible backup must capture the schemas, their compatibility settings, their cross-references, and preserve the schema IDs that are embedded in every Kafka message on disk. This guide walks through how to do it properly — what to capture, how to capture it, and how to restore it without breaking your consumers.

Why the Wire Format Makes This Hard

Every Kafka message serialised with Avro, JSON Schema, or Protobuf using the Confluent wire format begins with a five-byte prefix. The first byte is a magic byte, always 0x00. The next four bytes are a big-endian integer holding the schema ID. After that comes the actual payload.

The schema content itself — field names, types, namespaces, defaults — is not in the message. Only a pointer.

When a consumer receives a message, it reads the magic byte, extracts the schema ID, queries Schema Registry for that ID, fetches the schema definition, and only then deserialises the payload. Break any link in that chain and the consumer doesn’t fail gracefully — it cannot read the message at all.

This is the property that makes Schema Registry what the OSO engineers call a “silent dependency.” It sits there, day in and day out, working reliably enough that teams forget it’s in their critical path. There’s no daily friction reminding you it exists. The only time you discover your backup story is broken is during an actual incident — which is the worst possible moment to be improvising.

The implication for backup strategy is concrete: backing up schema definitions is not enough. The IDs are the part that matters, because the IDs are what every message on disk is referencing. There’s a public GitHub issue on the open-source Schema Registry project where a user reports that after a restart, schema ID 221 simply wasn’t there. The error came back as RestClientException: Schema not found, error code 40403. Their stream processing applications halted completely. The schema definition existed somewhere — they could probably reconstruct it from source control — but the binding between ID 221 and that definition was gone, and millions of messages on disk were now unreadable.

Schema content can usually be reconstructed. The IDs embedded in millions of Kafka messages cannot. This is the asymmetry that catches teams out.

Why the Common Approaches Fall Short

Before rolling your own solution, it’s worth understanding why the obvious approaches don’t work in production.

The console consumer dump

The most commonly suggested approach is to dump the internal _schemas topic to a file using kafka-console-consumer. It captures something, but the format is undocumented and implementation-specific — it has never been committed to as a stable public contract, which means a version upgrade can quietly break your backup. There’s no way to filter by subject, the output is an opaque binary blob you can’t meaningfully inspect, and it doesn’t compose with the rest of your backup pipeline. It’s also unavailable on managed cloud services where you don’t have direct access to internal topics.

The curl script

A more refined approach uses the Schema Registry REST API directly. There’s a well-known case study of an engineer who heroically extracted schemas this way before their registry process restarted and the in-memory state was lost forever. They had no replicas, the _schemas topic was destroyed, but Schema Registry was still serving from cache. They had a narrow window to curl every schema before the process died. It worked — barely — and they ended up restoring to an empty registry using IMPORT mode. But curl scripts are inherently fragile. No error handling, no retry logic, no rate limiting, no authentication beyond what you bolt on manually, no understanding of schema reference dependencies, and no answer for restoring schemas with their original IDs across a populated target.

Continuous replication tools

The vendor-native options provide continuous replication between registries but require specific platform versions and enterprise licences, and they’re designed for replication rather than point-in-time backup. They don’t give you a snapshot that corresponds to a point-in-time Kafka data backup. There are documented cases on public forums where users running managed Kafka discovered that during a failover, schema data was not preserved — the recommended procedure left them recovering schemas and data as separate manual steps, in the middle of an incident, hoping the two would stay in sync.

Governance platforms

 Tools designed for schema governance can back up schemas, but as a separate artefact from your Kafka data. You end up with two backup artefacts, two restore processes, and you need to carefully orchestrate that schemas are restored before data when you do a recovery. That coordination burden falls on you, and that coordination under pressure is exactly where incidents happen.

The common thread: none of these capture schemas in the same artefact as your Kafka data, and most have no answer for restoring schemas with their original IDs.

The Four Things a Complete Backup Must Capture

A backup that will actually restore needs four pieces.

1. The schemas themselves, every version, via the REST API

The REST API is preferable to the _schemas topic dump because it’s a stable public contract, works on both on-premises and cloud-hosted registries, produces inspectable JSON, and supports subject filtering. Walk the API: GET /subjects to discover the list, then GET /subjects/{subject}/versions for each, then GET /subjects/{subject}/versions/{version} to fetch the full definition. The response gives you the schema string, the schema type (Avro, JSON Schema, or Protobuf), the global schema ID, and the references array.

2. Compatibility configuration

Capture the global compatibility level via GET /config and any per-subject overrides via GET /config/{subject}. Without these, your restored registry may reject schemas that producers expect to register, or accept ones that should have been rejected. The compatibility settings are part of the contract between producers and consumers, and losing them silently changes that contract on the recovered side.

3. Mode state

Capture the registry mode via the /mode endpoints, both globally and per-subject. This matters because IMPORT mode is what makes ID-preserving restore possible, and you need to know what state to return subjects to after a restore completes.

4. Schema references and dependency ordering

The references field in each version response is critical for Protobuf schemas and any schema type with cross-references. Suppose you have an orders-value schema that references a Money type from common-money-value and an Address type from common-address-value. You can’t restore orders-value first — Schema Registry will reject it because the referenced schemas don’t yet exist on the target. The OSO engineers describe building a directed acyclic graph of dependencies and running a topological sort using Kahn’s algorithm, so that referenced schemas are restored before the schemas that reference them. The resulting dependency order is baked into the backup manifest, so at restore time the engine doesn’t need to recompute it — it just follows the order in the manifest. Circular references are detected during the sort and surfaced as errors before any restore work begins.

Restoring Without Breaking Your Consumers

Schema IDs in the registry are assigned monotonically. You don’t get to choose. If your source registry has orders-value v1 at ID 1, and you restore to a registry that already contains other schemas, the next available ID might be 47. You restore the schema, it gets ID 47, but your messages still say 0x00 0x00 0x00 0x00 0x01. Consumers fail.

There are two strategies that actually work.

Strategy 1: Force original IDs with IMPORT mode

Schema Registry has a mode called IMPORT that allows registration with explicit IDs and version numbers. Normally you can’t control what ID gets assigned — you call POST /subjects/{subject}/versions and the registry hands out the next available integer. In IMPORT mode, you include an id field in the request body and the registry honours it. The sequence is: set the subject to IMPORT mode via PUT /mode/{subject} with {"mode": "IMPORT"}, register each version with the explicit id and version from the backup, then return the subject to READWRITE. Do this for every subject, following the dependency order, and the target registry ends up with schema IDs identical to the source. The prerequisite is that the target registry must have mode.mutability=true set in its properties file, or the mode-change requests will be rejected outright. This is the right choice when you control the target registry and it’s empty or compatible.

Strategy 2: Rewrite the wire format bytes

For migrations to managed services where you can’t control the ID space, the alternative is more interesting. Let the target registry assign whatever IDs it wants during schema registration, but as each schema is registered, record the mapping: source ID 5 became target ID 23. Build a complete translation table for every schema in the backup. Then, during Kafka data restore — which always runs after schema restore — hook into the record processing pipeline. For every message being written to the target cluster, read the first byte. If it’s 0x00, read bytes one through four as a big-endian integer, look up the target ID in the mapping table, and if they differ, overwrite those four bytes in place with the new ID. The rest of the message is untouched. Consumers on the target see messages with schema IDs that resolve correctly against the target registry.

The magic byte check is the safety guard. Messages that aren’t in Confluent wire format — raw JSON, plain text, custom binary formats — won’t have 0x00 as their first byte and are skipped entirely. The rewrite only touches messages it’s certain about. This is the right choice for managed cloud registries and shared multi-team registries where the ID space is out of your hands.

A Worked Example with OSO kafka-backup

As mentioned earlier, the OSO engineers built kafka-backup to handle exactly this gap. Schemas are captured alongside Kafka topic data in a single artefact, eliminating the coordination burden between separate tools.

A minimal config adds an enterprise.schema_registry section to your existing backup configuration (full configuration reference):

mode: backup
backup_id: daily-2026-04-06
source:
  bootstrap_servers: ["kafka:9092"]
storage:
  backend: s3
  bucket: kafka-backups

enterprise:
  schema_registry:
    url: "https://schema-registry:8081"
    auth:
      type: basic
      username: ${SR_USERNAME}
      password: ${SR_PASSWORD}

A single command captures topics and schemas together — no second tool, no separate schedule:

kafka-backup backup --config backup-config.yaml

The output is human-readable. Schemas live in a schema-registry/ directory within the backup prefix, organised by subject:

{backup_id}/
  manifest.json
  topics/...
  schema-registry/
    _manifest.json
    _global_config.json
    subjects/
      orders-value/
        _metadata.json
        v1.json
        v2.json
      payments-value/
        _metadata.json
        v1.json

Each version file contains the full schema string, type, global ID, and references array. The manifest contains the dependency order, ID-to-file mapping, and subject counts. The whole artefact is inspectable directly from S3 or your storage backend without any tooling — useful for operational debugging when something goes wrong and you need to know exactly what’s in the backup before you commit to a restore.

For a disaster recovery restore into a fresh target registry, force the original IDs:

enterprise:
  schema_registry:
    url: "https://dr-schema-registry:8081"
    restore:
      strategy: preserve
      force_ids: true

For a migration to a managed cloud registry where the ID space is out of your hands, rewrite the message bytes instead:

enterprise:
  schema_registry:
    url: "https://psrc-xxxxx.confluent.cloud"
    auth:
      type: basic
      username: ${CCLOUD_SR_API_KEY}
      password: ${CCLOUD_SR_API_SECRET}
    restore:
      strategy: preserve
      rewrite_ids: true

Three restore strategies cover the realistic scenarios: preserve adds only what’s missing on the target, overwrite replaces existing schemas from the backup, and skip leaves any subject with existing versions alone. Combined with force_ids or rewrite_ids, these handle fresh DR registries, managed cloud migrations, and restores into shared multi-team registries without trampling other teams’ work.

The operational details that turn a script into infrastructure are built in. A token bucket rate limiter (default 25 requests per second) keeps backup runs clear of cloud-imposed API limits, with retry logic and configurable backoff handling transient 429 and 5xx responses. Authentication supports basic auth, mTLS for environments with mutual TLS PKI, and credential interpolation from environment variables so secrets never end up in config files. Pre-flight checks before any restore verify that the target registry is reachable, that mode mutability is enabled if force_ids is set, and that the backup manifest can be parsed — failing fast is always better than failing halfway through. Dry-run mode lets you compute what would happen without making any changes: which subjects would be restored, which would be skipped, what the ID mappings would look like. And metrics exposed via the standard Prometheus endpoint give day-to-day visibility into subjects backed up, API errors by endpoint, ID rewrite counts, and IMPORT mode transitions.

The Backup You Can Restore

Schema Registry backup is one of those topics that sounds boring until the day you need it, and then it’s the most important thing in your infrastructure. The mechanics are not complicated — capture the schemas via the REST API, capture the configuration and references, restore in dependency order, preserve or remap the IDs to match what’s embedded in your messages on disk. What’s complicated is doing all of that together, in a single artefact coupled to your Kafka data backup, with the operational properties that turn a script into a piece of infrastructure.

The teams who end up deleting their unprocessed data didn’t get there because they were careless or under-resourced. They got there because Schema Registry presented itself as boring, reliable, ignorable infrastructure — and the wire format quietly turned every Kafka message into a hostage of a system they weren’t backing up properly. Five bytes at the front of every message, with enormous operational consequences when they stop resolving.

Whether you build your own tooling or use an existing one, the test is the same: run the drill, verify the IDs, and make sure the messages still read on the other side.

Back up your Kafka schema registry the right way

Talk to an OSO engineer about implementing point-in-time Schema Registry backup and restore alongside your Kafka data, with full schema ID preservation built in from day one.

CONTACT US
OSO
Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.