In the world of Apache Kafka, a seemingly innocent pattern is quietly corrupting data integrity across countless microservice architectures. It’s called “dual writes,” and it might be happening in your system right now.
As organisations decompose monolithic applications into microservices, each service typically maintains its own database. These services don’t exist in isolation—they need to collaborate and share information to solve business problems. This necessity creates a critical challenge: how does a service update its own database while reliably communicating these changes to other services, often through message brokers like Apache Kafka?
Many developers intuitively reach for what seems like the simplest solution: update the database and then send a message to Kafka. This approach, known as “dual writes,” is fundamentally flawed and introduces significant risks to system consistency that can be costly and difficult to resolve.
Without atomic consistency between database updates and messaging, systems can fall into inconsistent states where data in one service doesn’t match related data in another. The consequences range from minor annoyances to critical business failures—orders that never ship, payments that aren’t recorded, or inventory that doesn’t reconcile.
This article demonstrates why dual writes are fundamentally flawed, explains the risks they introduce, and presents proven alternatives for maintaining data consistency across distributed systems.
Understanding the Dual Write Problem
What Are Dual Writes?
Dual writes occur when an application needs to update two separate systems without transactional guarantees between them. In microservice architectures, this typically looks like:
- Updating a database record (e.g., creating a new order in the Orders service database)
- Publishing a message to a messaging system like Kafka (e.g., notifying the Shipment service about the new order)
These operations happen sequentially but without a distributed transaction spanning both systems. This pattern appears frequently in microservice architectures where services need to maintain their own state while communicating changes to other services.
The pattern seems attractive because of its apparent simplicity and directness. The code looks clean and straightforward:
// Pseudo-code for dual write approach
try {
// Step 1: Update database
database.save(newOrder);
// Step 2: Publish message
kafka.publish("orders-topic", orderCreatedEvent);
return success();
} catch (Exception e) {
return error();
}
The Fundamental Flaw
The critical problem with dual writes is the lack of distributed transaction support between databases and messaging systems. As of now, Apache Kafka cannot participate in distributed transactions (though this may change with KIP-939, which we’ll discuss later).
This limitation creates two primary failure modes:
- Database commits but messaging fails: Your order is saved in the database, but the shipment service never learns about it. The customer’s order exists but will never be fulfilled.
- Message is sent but database update fails: The shipment service receives notification about an order that doesn’t exist in the orders database. This could trigger processing of a non-existent order.
Both scenarios result in inconsistent system state, violating one of the core principles of reliable systems. Even with retry mechanisms, there are edge cases that can’t be fully addressed, such as service crashes between operations.
Distributed Systems Consistency Challenge
The CAP Theorem Context
The dual write problem is fundamentally connected to the CAP theorem, which states that distributed systems cannot simultaneously guarantee Consistency, Availability, and Partition tolerance—you must choose two out of three.
Most modern distributed systems prioritise availability and partition tolerance, making consistency the challenging property to maintain. While we often can’t have perfect consistency in distributed environments, we need mechanisms to ensure eventual consistency without data loss or corruption.
The cost of inconsistency can be substantial:
- Financial losses due to missed orders or duplicate processing
- Eroded customer trust when systems deliver unpredictable results
- Engineering time spent investigating and manually reconciling data
- Increased complexity in applications to handle potentially inconsistent states
Common Failure Scenarios
Several failure patterns repeatedly emerge in systems that rely on dual writes:
- Transaction rollbacks creating orphaned messages: If a message is sent before the database transaction is committed, and then that transaction rolls back, the message refers to a state that never actually existed in the database.
- Network partitions causing partial updates: Network issues might allow the database update to succeed but prevent the message from being sent, creating an inconsistent state that’s difficult to detect and resolve.
- Service outages creating backlogs: If the messaging system is temporarily unavailable, updates might continue in the database while messages queue up or fail entirely, leading to temporary or permanent inconsistency.
- Ordering problems: Without careful management, the order of operations might not be preserved, potentially violating business logic that depends on the sequence of events.
These failures don’t just create one-time issues—they compound over time, creating increasingly divergent system states that become harder to reconcile the longer they persist.
Testing Consistency Issues
Consistency issues can lurk beneath the surface, manifesting in subtle ways before causing major problems. Watch for these warning signs:
- Unexplained data discrepancies between services that should be in sync
- Customer complaints about missing actions or duplicate processing
- Reconciliation processes that consistently find exceptions requiring manual intervention
- Intermittent issues that can’t be reproduced reliably in testing environments
- “Ghost” records that appear in one system but not in related systems
If your team regularly performs “data fixing” operations or has built comprehensive reconciliation processes, you might be treating the symptoms rather than addressing the root cause.
Chaos Engineering Approaches
Proactively testing for consistency issues can help identify problems before they affect users. Consider these testing approaches:
- Simulate network partitions: Use tools like Toxiproxy or Chaos Monkey to introduce network failures between your application and your messaging system or database.
- Test recovery mechanisms: Force failures at various points in your transaction flow and verify that your system recovers to a consistent state.
- Measure consistency levels under stress: Introduce high load while simultaneously degrading infrastructure performance, then measure how consistency holds up.
These tests often reveal that dual write approaches fail to maintain consistency under realistic failure conditions, highlighting the need for more robust patterns.
Principles
Implementation Approaches for the Outbox Pattern
When implementing the outbox pattern, you need a reliable mechanism to move messages from your outbox table to Kafka. Let’s explore the two primary approaches through practical examples.
Polling-Based Implementation
The polling approach resembles how many teams initially implement the outbox pattern. Imagine an e-commerce company that has just implemented an outbox table for their order service:
// OrderService.java
@Transactional
public void createOrder(Order order) {
// Save the order to the orders table
orderRepository.save(order);
// Create an outbox message
OutboxEvent event = new OutboxEvent(
UUID.randomUUID(),
"Order",
order.getId().toString(),
objectMapper.writeValueAsString(orderToEventMapper.toEvent(order))
);
// Save the outbox message in the same transaction
outboxRepository.save(event);
}
This works initially, but as their order volume grows, they encounter problems. During a flash sale, a long-running transaction creates an order with position 157, while several quick transactions create orders with positions 158, 159, and 160. Their polling job picks up positions 158-160 first, but misses 157 because it’s not yet committed. By the next polling cycle, they’re looking for position > 160, completely skipping order 157.
They also struggle with the polling frequency—when set to 5 seconds, customers complain about shipping notification delays. When reduced to 1 second, their database starts showing increased load during peak hours.
Log-Based Change Data Capture (CDC)
After encountering these issues, the team switches to a log-based CDC approach using Debezium. They modify their database configuration to enable replication slots (for Postgres) and set up Debezium with Kafka Connect:
{
"name": "orders-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "postgres",
"database.port": "5432",
"database.user": "debezium",
"database.password": "dbz",
"database.dbname": "orders",
"database.server.name": "ecommerce",
"table.include.list": "public.outbox_events",
"transforms": "outbox",
"transforms.outbox.type": "io.debezium.transforms.outbox.EventRouter",
"transforms.outbox.table.fields.additional.placement": "aggregate_id:header:aggregate_id",
"transforms.outbox.route.topic.replacement": "${routedByValue}-events",
"transforms.outbox.table.field.event.key": "aggregate_id",
"transforms.outbox.table.field.event.value": "payload",
"transforms.outbox.table.field.route": "aggregate_type"
}
}
With this configuration, Debezium reads changes directly from the database’s transaction log. When the team compares their old and new approaches during another flash sale, they see dramatic improvements:
- Events are captured in the exact order they were committed, ensuring that if Order A was committed before Order B, the events arrive in Kafka in that same order
- No events are missed, even during high-volume periods, as the transaction log contains every committed change
- Latency drops from seconds to milliseconds, as events are captured almost immediately after commit
- Database load decreases since there’s no need for frequent, potentially expensive queries
The team also simplifies their code. Their order service looks almost identical, but they completely eliminate their polling service. Instead, Debezium handles extracting events and publishing them to Kafka.
Taking it a step further, Postgres users in the team even explore removing the outbox table entirely, using pg_logical_emit_message()
to write directly to the transaction log:
@Transactional
public void createOrder(Order order) {
// Save the order to the orders table
orderRepository.save(order);
// Emit a message directly to the transaction log
String payload = objectMapper.writeValueAsString(orderToEventMapper.toEvent(order));
jdbcTemplate.execute(
"SELECT pg_logical_emit_message(true, 'outbox', '" +
payload.replace("'", "''") + "');"
);
}
While requiring additional configuration in Debezium, this approach eliminates the need for outbox table housekeeping entirely.
Advanced Implementation Considerations
Housekeeping
Since the outbox table only serves as a temporary store for messages, you need a strategy to prevent it from growing indefinitely:
- Transactional delete: Insert and delete in the same transaction (the delete won’t affect CDC capturing the insert)
- Periodic cleanup: Run a scheduled job to remove processed messages
- For Postgres users: Use
pg_logical_emit_message()
to write directly to the transaction log without using a table at all
Backfilling Events
Sometimes you need to recreate events for existing data (e.g., after misconfiguration or data loss). The outbox pattern can be extended to support this:
- Use a “chunk” approach with marker events to delineate windows of backfilled data
- Implement deduplication logic to discard backfill events for records that have received regular updates
- Process these chunks efficiently while maintaining consistency with ongoing operations
Ensuring Idempotency in Consumers
Since message delivery in distributed systems often has at-least-once semantics, consumers need to handle potential duplicates:
- Use a monotonically increasing value (like Postgres’s LSN – Log Sequence Number) instead of arbitrary IDs
- Consumers can track the highest processed LSN and ignore any messages with lower values
- This approach requires storing only a single value rather than an ever-growing set of processed IDs
Alternative Approaches
While the outbox pattern is often the best solution, alternatives exist:
Event Sourcing Approach
Instead of writing to a database first, you could:
- Write events directly to Kafka
- Build a local read model by consuming these events
This works but has significant drawbacks:
- Loss of synchronous read-your-writes consistency
- Complexity in implementing constraints and validations
- More complex user experience (events may not be immediately queryable)
Two-Phase Commit with Kafka
With upcoming support for two-phase commit transactions in Kafka (KIP-939), a direct dual-write approach may become viable. However:
- It increases the availability requirements (both database AND Kafka must be available)
- It brings additional overhead to your synchronous request processing
- It may not provide significant advantages over the outbox pattern for most use cases
Stream Processing for Legacy Applications
For legacy applications that can’t be modified:
- Use CDC to capture raw data changes
- Apply stream processing (e.g., with Apache Flink) to transform these into higher-level events
- Publish the transformed events to Kafka topics for consumption
This works well for unmodifiable systems but makes transactional consistency more challenging to maintain.
CDC Outbox Checklist
- Design your outbox table structure
- Include message ID, aggregate type, aggregate ID, and payload
- Consider adding metadata for routing and tracking
- Select your CDC approach
- For most systems, log-based CDC with Debezium is recommended
- Configure the Debezium outbox event router for streamlined implementation
- Implement housekeeping
- Either delete records in the same transaction or implement a cleanup process
- For Postgres, consider direct transaction log messages
- Ensure consumer idempotency
- Use LSNs or similar monotonically increasing values
- Implement tracking of processed messages
- Test thoroughly
- Verify behaviour under network partitions
- Confirm recovery after process crashes
- Validate ordering guarantees