Apache Kafka outbox pattern

In the world of Apache Kafka, a seemingly innocent pattern is quietly corrupting data integrity across countless microservice architectures. It’s called “dual writes,” and it might be happening in your system right now.

As organisations decompose monolithic applications into microservices, each service typically maintains its own database. These services don’t exist in isolation—they need to collaborate and share information to solve business problems. This necessity creates a critical challenge: how does a service update its own database while reliably communicating these changes to other services, often through message brokers like Apache Kafka?

Many developers intuitively reach for what seems like the simplest solution: update the database and then send a message to Kafka. This approach, known as “dual writes,” is fundamentally flawed and introduces significant risks to system consistency that can be costly and difficult to resolve.

Without atomic consistency between database updates and messaging, systems can fall into inconsistent states where data in one service doesn’t match related data in another. The consequences range from minor annoyances to critical business failures—orders that never ship, payments that aren’t recorded, or inventory that doesn’t reconcile.

This article demonstrates why dual writes are fundamentally flawed, explains the risks they introduce, and presents proven alternatives for maintaining data consistency across distributed systems.

Understanding the Dual Write Problem

What Are Dual Writes?

Dual writes occur when an application needs to update two separate systems without transactional guarantees between them. In microservice architectures, this typically looks like:

Updating a database record (e.g., creating a new order in the Orders service database)
Publishing a message to a messaging system like Kafka (e.g., notifying the Shipment service about the new order)

These operations happen sequentially but without a distributed transaction spanning both systems. This pattern appears frequently in microservice architectures where services need to maintain their own state while communicating changes to other services.

The pattern seems attractive because of its apparent simplicity and directness. The code looks clean and straightforward:

// Pseudo-code for dual write approach
try {
    // Step 1: Update database
    database.save(newOrder);
    
    // Step 2: Publish message
    kafka.publish("orders-topic", orderCreatedEvent);
    
    return success();
} catch (Exception e) {
    return error();
}

The Fundamental Flaw

The critical problem with dual writes is the lack of distributed transaction support between databases and messaging systems. As of now, Apache Kafka cannot participate in distributed transactions (though this may change with KIP-939, which we’ll discuss later).

This limitation creates two primary failure modes:

Database commits but messaging fails: Your order is saved in the database, but the shipment service never learns about it. The customer’s order exists but will never be fulfilled.
Message is sent but database update fails: The shipment service receives notification about an order that doesn’t exist in the orders database. This could trigger processing of a non-existent order.

Both scenarios result in inconsistent system state, violating one of the core principles of reliable systems. Even with retry mechanisms, there are edge cases that can’t be fully addressed, such as service crashes between operations.

Distributed Systems Consistency Challenge

The CAP Theorem Context

The dual write problem is fundamentally connected to the CAP theorem, which states that distributed systems cannot simultaneously guarantee Consistency, Availability, and Partition tolerance—you must choose two out of three.

Most modern distributed systems prioritise availability and partition tolerance, making consistency the challenging property to maintain. While we often can’t have perfect consistency in distributed environments, we need mechanisms to ensure eventual consistency without data loss or corruption.

The cost of inconsistency can be substantial:

Financial losses due to missed orders or duplicate processing
Eroded customer trust when systems deliver unpredictable results
Engineering time spent investigating and manually reconciling data
Increased complexity in applications to handle potentially inconsistent states

Common Failure Scenarios

Several failure patterns repeatedly emerge in systems that rely on dual writes:

Transaction rollbacks creating orphaned messages: If a message is sent before the database transaction is committed, and then that transaction rolls back, the message refers to a state that never actually existed in the database.
Network partitions causing partial updates: Network issues might allow the database update to succeed but prevent the message from being sent, creating an inconsistent state that’s difficult to detect and resolve.
Service outages creating backlogs: If the messaging system is temporarily unavailable, updates might continue in the database while messages queue up or fail entirely, leading to temporary or permanent inconsistency.
Ordering problems: Without careful management, the order of operations might not be preserved, potentially violating business logic that depends on the sequence of events.

These failures don’t just create one-time issues—they compound over time, creating increasingly divergent system states that become harder to reconcile the longer they persist.

Testing Consistency Issues

Consistency issues can lurk beneath the surface, manifesting in subtle ways before causing major problems. Watch for these warning signs:

Unexplained data discrepancies between services that should be in sync
Customer complaints about missing actions or duplicate processing
Reconciliation processes that consistently find exceptions requiring manual intervention
Intermittent issues that can’t be reproduced reliably in testing environments
“Ghost” records that appear in one system but not in related systems

If your team regularly performs “data fixing” operations or has built comprehensive reconciliation processes, you might be treating the symptoms rather than addressing the root cause.

Chaos Engineering Approaches

Proactively testing for consistency issues can help identify problems before they affect users. Consider these testing approaches:

Simulate network partitions: Use tools like Toxiproxy or Chaos Monkey to introduce network failures between your application and your messaging system or database.
Test recovery mechanisms: Force failures at various points in your transaction flow and verify that your system recovers to a consistent state.
Measure consistency levels under stress: Introduce high load while simultaneously degrading infrastructure performance, then measure how consistency holds up.

These tests often reveal that dual write approaches fail to maintain consistency under realistic failure conditions, highlighting the need for more robust patterns.

Principles

Implementation Approaches for the Outbox Pattern

When implementing the outbox pattern, you need a reliable mechanism to move messages from your outbox table to Kafka. Let’s explore the two primary approaches through practical examples.

Polling-Based Implementation

The polling approach resembles how many teams initially implement the outbox pattern. Imagine an e-commerce company that has just implemented an outbox table for their order service:

// OrderService.java
@Transactional
public void createOrder(Order order) {
    // Save the order to the orders table
    orderRepository.save(order);
    
    // Create an outbox message
    OutboxEvent event = new OutboxEvent(
        UUID.randomUUID(),
        "Order",
        order.getId().toString(),
        objectMapper.writeValueAsString(orderToEventMapper.toEvent(order))
    );
    
    // Save the outbox message in the same transaction
    outboxRepository.save(event);
}

This works initially, but as their order volume grows, they encounter problems. During a flash sale, a long-running transaction creates an order with position 157, while several quick transactions create orders with positions 158, 159, and 160. Their polling job picks up positions 158-160 first, but misses 157 because it’s not yet committed. By the next polling cycle, they’re looking for position > 160, completely skipping order 157.

They also struggle with the polling frequency—when set to 5 seconds, customers complain about shipping notification delays. When reduced to 1 second, their database starts showing increased load during peak hours.

Log-Based Change Data Capture (CDC)

After encountering these issues, the team switches to a log-based CDC approach using Debezium. They modify their database configuration to enable replication slots (for Postgres) and set up Debezium with Kafka Connect:

{
  "name": "orders-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "postgres",
    "database.port": "5432",
    "database.user": "debezium",
    "database.password": "dbz",
    "database.dbname": "orders",
    "database.server.name": "ecommerce",
    "table.include.list": "public.outbox_events",
    "transforms": "outbox",
    "transforms.outbox.type": "io.debezium.transforms.outbox.EventRouter",
    "transforms.outbox.table.fields.additional.placement": "aggregate_id:header:aggregate_id",
    "transforms.outbox.route.topic.replacement": "${routedByValue}-events",
    "transforms.outbox.table.field.event.key": "aggregate_id",
    "transforms.outbox.table.field.event.value": "payload",
    "transforms.outbox.table.field.route": "aggregate_type"
  }
}

With this configuration, Debezium reads changes directly from the database’s transaction log. When the team compares their old and new approaches during another flash sale, they see dramatic improvements:

Events are captured in the exact order they were committed, ensuring that if Order A was committed before Order B, the events arrive in Kafka in that same order
No events are missed, even during high-volume periods, as the transaction log contains every committed change
Latency drops from seconds to milliseconds, as events are captured almost immediately after commit
Database load decreases since there’s no need for frequent, potentially expensive queries

The team also simplifies their code. Their order service looks almost identical, but they completely eliminate their polling service. Instead, Debezium handles extracting events and publishing them to Kafka.

Taking it a step further, Postgres users in the team even explore removing the outbox table entirely, using pg_logical_emit_message() to write directly to the transaction log:

@Transactional
public void createOrder(Order order) {
    // Save the order to the orders table
    orderRepository.save(order);
    
    // Emit a message directly to the transaction log
    String payload = objectMapper.writeValueAsString(orderToEventMapper.toEvent(order));
    
    jdbcTemplate.execute(
        "SELECT pg_logical_emit_message(true, 'outbox', '" + 
        payload.replace("'", "''") + "');"
    );
}

While requiring additional configuration in Debezium, this approach eliminates the need for outbox table housekeeping entirely.

Advanced Implementation Considerations

Housekeeping

Since the outbox table only serves as a temporary store for messages, you need a strategy to prevent it from growing indefinitely:

Transactional delete: Insert and delete in the same transaction (the delete won’t affect CDC capturing the insert)
Periodic cleanup: Run a scheduled job to remove processed messages
For Postgres users: Use pg_logical_emit_message() to write directly to the transaction log without using a table at all

Backfilling Events

Sometimes you need to recreate events for existing data (e.g., after misconfiguration or data loss). The outbox pattern can be extended to support this:

Use a “chunk” approach with marker events to delineate windows of backfilled data
Implement deduplication logic to discard backfill events for records that have received regular updates
Process these chunks efficiently while maintaining consistency with ongoing operations

Ensuring Idempotency in Consumers

Since message delivery in distributed systems often has at-least-once semantics, consumers need to handle potential duplicates:

Use a monotonically increasing value (like Postgres’s LSN – Log Sequence Number) instead of arbitrary IDs
Consumers can track the highest processed LSN and ignore any messages with lower values
This approach requires storing only a single value rather than an ever-growing set of processed IDs

Alternative Approaches

While the outbox pattern is often the best solution, alternatives exist:

Event Sourcing Approach

Instead of writing to a database first, you could:

Write events directly to Kafka
Build a local read model by consuming these events

This works but has significant drawbacks:

Loss of synchronous read-your-writes consistency
Complexity in implementing constraints and validations
More complex user experience (events may not be immediately queryable)

Two-Phase Commit with Kafka

With upcoming support for two-phase commit transactions in Kafka (KIP-939), a direct dual-write approach may become viable. However:

It increases the availability requirements (both database AND Kafka must be available)
It brings additional overhead to your synchronous request processing
It may not provide significant advantages over the outbox pattern for most use cases

Stream Processing for Legacy Applications

For legacy applications that can’t be modified:

Use CDC to capture raw data changes
Apply stream processing (e.g., with Apache Flink) to transform these into higher-level events
Publish the transformed events to Kafka topics for consumption

This works well for unmodifiable systems but makes transactional consistency more challenging to maintain.

CDC Outbox Checklist

Design your outbox table structure
1. Include message ID, aggregate type, aggregate ID, and payload
2. Consider adding metadata for routing and tracking
Select your CDC approach
For most systems, log-based CDC with Debezium is recommended

1. Configure the Debezium outbox event router for streamlined implementation
Implement housekeeping
1. Either delete records in the same transaction or implement a cleanup process
2. For Postgres, consider direct transaction log messages
Ensure consumer idempotency
1. Use LSNs or similar monotonically increasing values
2. Implement tracking of processed messages
Test thoroughly
1. Verify behaviour under network partitions
2. Confirm recovery after process crashes
3. Validate ordering guarantees

Apache Kafka outbox pattern

Understanding the Dual Write Problem

What Are Dual Writes?

The Fundamental Flaw

Distributed Systems Consistency Challenge

The CAP Theorem Context

Common Failure Scenarios

Testing Consistency Issues

Chaos Engineering Approaches

Implementation Approaches for the Outbox Pattern

Polling-Based Implementation

Log-Based Change Data Capture (CDC)

Advanced Implementation Considerations

Housekeeping

Backfilling Events

Ensuring Idempotency in Consumers

Alternative Approaches

Event Sourcing Approach

Two-Phase Commit with Kafka

Stream Processing for Legacy Applications

CDC Outbox Checklist

Getting started with Kafka Connect and CDC connectors

Latest blog posts

How to Cut Your Snowflake Bill by Up to 55% with a single Open Source CLI tool

An Automated Tool for End-to-End MSK ZooKeeper-to-KRaft Migration

Apache Kafka outbox pattern

Understanding the Dual Write Problem

What Are Dual Writes?

The Fundamental Flaw

Distributed Systems Consistency Challenge

The CAP Theorem Context

Common Failure Scenarios

Testing Consistency Issues

Chaos Engineering Approaches

Implementation Approaches for the Outbox Pattern

Polling-Based Implementation

Log-Based Change Data Capture (CDC)

Advanced Implementation Considerations

Housekeeping

Backfilling Events

Ensuring Idempotency in Consumers

Alternative Approaches

Event Sourcing Approach

Two-Phase Commit with Kafka

Stream Processing for Legacy Applications

CDC Outbox Checklist

Getting started with Kafka Connect and CDC connectors

Latest blog posts

How to Cut Your Snowflake Bill by Up to 55% with a single Open Source CLI tool

An Automated Tool for End-to-End MSK ZooKeeper-to-KRaft Migration

Subscription form (footer)