Skip to content

11 · Data Consistency Engineering: Getting Data Right in a World Without Cross-Service Transactions

The thesis in one line: In the single-machine era, one BEGIN…COMMIT could make "deduct inventory, record the order, issue the coupon" all succeed together or all fail together. The moment those three things scatter across three services and three databases, that transaction boundary that let you sleep soundly is mercilessly severed by the network. The previous chapter explained why distributed systems get messy, lose data, and fork; this chapter picks up from "at-least-once + idempotency" and walks through a body of engineering craft for getting data right anyway, in a world without cross-service transactions: Saga, Outbox, idempotency, event sourcing, CQRS, and contract evolution.


🧭 This is Chapter 2 of the Advanced track. 10 · The Hard Truths of Distributed Systems laid out the "pathology" — partial failure, no global clock, consensus is expensive, exactly-once is an illusion. This chapter is the "clinic": now that you know things will go wrong, how exactly do you treat them? The hook planted at the end of the last chapter — "at-least-once delivery + idempotent consumers = effectively exactly-once" — is precisely the foundation under every technique in this chapter.

Use an everyday scenario to hold the whole chapter together: placing an online order. Behind it are three actions: "deduct inventory, create the order, issue the coupon." On a single machine, one database transaction can make them "all succeed or all fail." But once those three actions live in three services and three databases, the transaction boundary that let you sleep at night is cut by the network — inventory is deducted, the order is created, but the coupon never goes out, leaving the data half-baked. This chapter's whole toolkit (Saga, Outbox, idempotency...) solves one question: once the universal transaction boundary is gone, how do we make the data right again?

This is also the layer that most tests people in the AI era. AI can write you the "happy path" order code in seconds: deduct inventory, create the order, send the message, await all the way through. But it almost never adds, of its own accord, "what do you do about the first two steps when the third one fails?" And that, precisely, is the entire subject of this chapter.


1. Why a "Cross-Service Transaction" Is Nearly a Pipe Dream: The Temptation and Cost of 2PC

Let's go back to the biggest pain microservices brought up in 04 · The Ten Core Architecture Patterns: once you have "one database per service" (database per service), a single business action that spans multiple services no longer has any transaction boundary that can frame them all together.

   Monolith era (one transaction handles it all):
   ┌──────────────────────────────────────┐
   │  BEGIN                                 │
   │    deduct inventory ──┐                │
   │    create order     ──┼─ same DB,      │  all succeed
   │    issue coupon     ──┘  one txn       │  or all fail
   │  COMMIT                                │
   └──────────────────────────────────────┘

   Microservices era (three databases, none can frame the others):
   ┌──────────┐   ┌──────────┐   ┌──────────┐
   │ Inventory │   │  Order    │   │ Marketing │
   │  DB_A     │   │  DB_B     │   │  DB_C     │
   └────┬─────┘   └────┬─────┘   └────┬─────┘
        deducted ✓     created ✓      ✗ crashed
        ──────────────────────────────────────▶
        Inventory deducted, order created, coupon never issued — the data is left "half-baked"

So can't we just frame those three databases back into one transaction? That's exactly what "distributed transactions" set out to do, and the most classic protocol is called Two-Phase Commit (2PC):

   Phase 1 (vote): the coordinator asks every participant "are you ready to commit?"
      Coordinator ──prepare──▶ [Inventory] [Order] [Marketing]
                               ◀── yes / yes / yes
   Phase 2 (commit): only when all say yes does it send commit; one no and everyone rolls back
      Coordinator ──commit ──▶ [Inventory] [Order] [Marketing]

It sounds beautiful, but the cost is so high that big companies steer clear of it on their core paths:

  • Synchronous blocking, long-held locks: between "vote" and "commit," every participant has to lock the relevant data and clutch its resources, waiting idle. As long as a single participant is slow, all participants are dragged down together — and under high concurrency, these locks pile up into a disaster fast.
  • The coordinator is a single point of failure: the Phase 2 commit command goes out halfway, and the coordinator crashes — some participants received the commit, others didn't, and the rest sit there clutching their locks, caught in limbo (in-doubt), with nothing to do but wait for the coordinator to come back to life and issue the final command.
  • Inherently at odds with availability: 2PC requires every participant to be online and to nod yes before it can move forward. That effectively wires multiple services' availability "in series" — drop any one and the whole transaction jams. This is exactly what CAP from [10] least wants to see: sacrificing availability (A) for the sake of strong consistency (C).

Architectural wisdom: 2PC isn't "unusable," it's "extremely costly when used in the wrong place." Inside a single database, or among a small set of tightly coupled resources (say, multiple shards of one database), 2PC/XA still has its place; but when you span multiple independent services and also have to withstand high concurrency, 2PC's synchronous blocking and single-point coordinator will make it collapse. So the industry's mainstream answer isn't "make the transaction even stronger," but "just give up on strong cross-service transactions and switch to an eventually-consistent scheme that can tolerate intermediate states" — and that sets the stage for Saga next.


2. Saga: Splitting One Big Transaction Into "a Chain of Local Transactions + Compensation on Failure"

Since we can't frame multiple services into one transaction, let's flip our thinking: split the big transaction into a chain of small local transactions, each running only against its own service's database (local transactions are reliable). On failure, instead of relying on "rollback," run a chain of "compensating actions" in reverse to undo what's already been done. This is the Saga pattern (Chris Richardson systematized it on microservices.io).

The classic example microservices.io gives is "an order must not exceed the customer's credit limit": the order and the customer live in different databases, so you can't pin them down with a single ACID transaction, and so it's split into a Saga —

   Normal path (each step is one local transaction):
   ① create order (pending) ─▶ ② deduct inventory ─▶ ③ reserve credit limit ─▶ ④ order → confirmed

   When some step fails, run compensation in reverse (undo the completed steps):
   ③ reserving credit fails ✗
        ◀── compensate ②: add the inventory back
   ◀────── compensate ①: mark the order as cancelled

The most crucial point, and the one newcomers most easily misunderstand: compensation is not a database rollback.

   Rollback: the transaction hasn't committed yet; the database "acts as if nothing happened" for you, clean and tidy.
   Compensation: the previous step "has already committed, and may even have produced
                 external side effects" — you can't erase history, you can only perform
                 another "reverse operation" to offset it.
  • The money has already been paid to the merchant → compensation isn't "pretend it was never paid," it's initiating a refund (a new, reverse transaction record).
  • The email has already been sent → compensation isn't "recall the email" (you can't), it's sending another "order cancelled" correction email.
  • The inventory has already been deducted → compensation is adding the inventory back (but by this moment someone else may have already bought it, which is the inherent trouble with compensation).

Architectural wisdom: Saga trades a "semantic undo" for "no cross-service locks." The price is that you must face the intermediate states head-on: while step ③ hasn't finished, the system is genuinely in the awkward state of "order created, inventory deducted, but not yet confirmed" — this window is called the semantic lock period. You have to answer, in business terms: what does the user see during this time? Can it be cancelled? Will others see this "half-finished" order? Saga moves consistency from "the database guarantees it for you" to "the business process guarantees it for you" — you save the locks, but you gain a state machine you now have to worry about.

Orchestration vs. Choreography: Who Conducts This Chain of Steps?

There are two styles for how Saga strings this chain of steps together, and this is the most important trade-off in this section:

   Orchestration: there's a "conductor" giving the orders
   ┌────────────┐
   │   Saga      │ ① command→ Inventory ──receipt→┐
   │ orchestrator│ ② command→ Order     ──receipt→┤ it records "which step we're on now"
   │ (central    │ ③ command→ Marketing ──receipt→┘ on failure, sends compensation commands back
   │  brain)     │
   └────────────┘
   Pros: the flow is clear at a glance, easy to monitor, compensation logic is centralized.
   Cons: the orchestrator is a new core component, prone to becoming a "god service."

   Choreography: no conductor; each service reacts to "events" on its own
   Inventory ──emits "inventory deducted"──▶ Order ──emits "order created"──▶ Marketing
                                                                             │emits "coupon issued"
   Pros: no center, low coupling, each service is autonomous.
   Cons: the flow is "scattered" everywhere; no one can see the whole picture at a glance, and compensation chains are hard to trace.
OrchestrationChoreography
Control flowCentralized in the orchestrator, the whole picture is visibleDispersed across events, relying on event "relay"
CouplingServices only talk to the orchestrator, decoupled from each otherServices are decoupled via events, but with an implicit "who listens to whom" undercurrent
ObservabilityStrong: one place shows "which step it's stuck on"Weak: the flow is scattered, troubleshooting is like a jigsaw puzzle
Best forCritical long flows with many steps, complex compensation, and a need for strong monitoring (orders, payments, onboarding)Lightweight flows with few steps, few participants, pursuing loose coupling
RiskThe orchestrator bloats into a "god service"A few too many steps and it becomes "event spaghetti," with no one able to explain the whole flow

Key judgment: Few steps, short chain — use choreography; many steps, complex compensation, and a need to see at a glance "where it is now, and why it's stuck" — use orchestration. A simple rule of thumb: when you find yourself "having to flip through several services' logs to piece together what actually happened to a single order," it's time to bring in an orchestrator. This is also why companies like Uber and DoorDash build dedicated durable workflow engines (the case study below will cover this) — in essence, they've turned the Saga orchestrator into platform-level infrastructure.


3. The Dual-Write Problem: You Changed the Database AND Need to Send a Message — How Do You Avoid Losing or Duplicating?

Whether it's Saga or event-driven, both rest on an extremely common, extremely insidious problem — the dual write:

A service finishes handling a request and typically has to both change its own database (the order is created) and send a message/event externally (notify downstream that "the order is created"). Those two things land in two different systems (the database + the message broker), and no transaction can frame them both at once.

So no matter which one you put first, there's a failure window:

   Option A: write the database first, then send the message
      DB COMMIT ✓ ───(at this instant the process crashes / network hiccups / the broker happens to be down)───▶ message never sent
      Result: the database says "the order is created," but downstream will never receive the notification → lost message

   Option B: send the message first, then write the database
      message sent ✓ ───(at this instant the DB write fails / rolls back)───▶ no such order in the database
      Result: downstream receives "the order is created," but the database has no such record → phantom message

The reason this problem is insidious is that it almost never shows up in the test environment (everything goes smoothly locally), yet in production's tail probability it keeps manufacturing "the database and downstream don't match up" ghost incidents. Red Hat, Confluent, and others all rank it as the number-one pitfall of event-driven architecture.

The solution: the Transactional Outbox. The core idea is elegant to the point of being almost cunning — since "sending a message" can't be put in the same transaction as "writing the database," then don't actually send the message; instead, insert a row into an outbox table in the same database, and commit that row together with the business data in the same local transaction. This way the "business change" and the "message I want to send" atomically live and die together. Then let an independent "courier" read this table and actually send the messages out.

   ┌──────────── same local transaction (atomic) ────────────┐
   │  INSERT order (orders)                                    │
   │  INSERT pending message (outbox) ← the message also       │  both succeed
   │                                    lands in the database  │  or both fail
   └──────────────────────────────────────────────────────────┘
                         │ after the transaction commits

   ┌──────────────┐  read new outbox rows  ┌──────────────┐
   │   Message     │ ─────────────────────▶ │ Message broker│ ──▶ downstream
   │   Relay       │  mark as sent after     │    Kafka      │
   └──────────────┘  publishing             └──────────────┘

There are two ways for the courier to read the outbox, which microservices.io calls:

  • Polling Publisher: periodically SELECT the rows in that table that are "not yet sent," send them out, and mark them as sent. Simple and direct; the downside is polling latency and extra query pressure.
  • Transaction Log Tailing = CDC (Change Data Capture): instead of querying the table, read the database's transaction log directly (binlog / WAL) — the moment a new row is committed to the outbox table, capture it and send it out immediately. This is exactly what Debezium does — it even built a dedicated Outbox Event Router that routes outbox-table changes directly into Kafka messages, with zero polling and preserved ordering.
   The essence of CDC: the database's "transaction log" is itself a stream of truth about "what happened."
   write DB ──▶ binlog / WAL (which exists for replication and recovery anyway) ──▶ Debezium tails ──▶ Kafka
            "the log is the truth" — the core idea of Jay Kreps's "The Log" (covered in this chapter's case study)

Key judgment: The moment "changed the database, and also need to send a message/event" appears in your system, you almost certainly have to use an Outbox, rather than naively appending a line of kafka.send() after writing the database. The latter is the "happy path" that AI most loves to generate, and that most easily plants a landmine, in the vibe-coding era. Note that the guarantee the Outbox gives you is at-least-once (the courier may crash "after sending but before marking as sent," causing a resend) — so it must be paired with the idempotent consumption of the next section; you can't have one without the other.


4. Idempotency: The Cornerstone of Distributed Correctness

The previous chapter spelled it out: the transport layer can only achieve at-least-once (no loss, but possible duplication). The Outbox is also at-least-once. Retries, even more so, are everywhere. This means "the same operation gets executed twice" is the norm in distributed systems, not the exception. Making a system still correct when it's "executed repeatedly" — this property is called idempotency.

   Definition of idempotency: the same operation, executed once and executed N times,
   has exactly the same effect on the system's state.

   Naturally idempotent:     SET balance = 100   ← set it to 100, do it any number of times, it's still 100 ✓
   Naturally non-idempotent: balance = balance + 100  ← do it twice and you've added 100 extra, a disaster ✗
                                                        (double charge / double payout)

Operations like "add/subtract," "ship," "pay out money," "send a message" are naturally non-idempotent, and yet they happen to be the ones that can least afford to go wrong. How do you remake them into being idempotent? A three-piece set:

   ① Idempotency Key: every operation carries a globally unique ID
      (e.g., for "the payment of order 12345," key = pay_order_12345)
   ② Dedup Table: the consumer maintains a table recording "this key, I've already processed"
   ③ Idempotent consumer: when one arrives, check the dedup table first — seen it? skip and return last
      result; never seen it? process + record the key

   ┌─ receive message (key=pay_order_12345)
   │   is this key in the dedup table?
   │      yes ──▶ already processed, return success directly (don't charge a second time)
   │      no  ──▶ in the same transaction: execute the charge + write the key, commit together
   └─

Note that "in the same transaction: execute the business + write the idempotency key" in step ③ — this is once again the craft of "tucking correctness into one local transaction," cut from the same cloth as the Outbox. If you split "execute the business first, then write the key" into two steps, a crash in between will leave a hole of "money charged but key not recorded," and the next retry charges again.

Architectural wisdom: "Retry" and "idempotency" are a pair of twin brothers that must appear together. Any time you write "retry on failure" in your code, you must first ask: is the operation being retried idempotent? A non-idempotent operation paired with automatic retries amounts to installing a "random double-charging machine" in your system. The idempotent charging of the payment system and the dedup-and-rate-limiting of the notification system are living templates of this three-piece set; payment APIs like Stripe make Idempotency-Key a first-class request header exposed externally, precisely because they know full well: the client's network will definitely retry, so the server must be idempotent.


5. Event Sourcing: Store "What Happened," Not "What It Is Now"

💧 Optional deep dive (this section and the next one on CQRS are both safe to skip on first read; the main line does not depend on them). Event sourcing and CQRS are "heavy weapons" — most everyday CRUD systems do not need them, and forcing them in usually creates more trouble than value. For a first pass, you only need one takeaway: know what they are, and know when it is finally your turn to use them. The rest is the full version for readers who want to understand them.

Up to here, we've been patching within the world of "storing the current state." Now for a flip in worldview — Event Sourcing: instead of storing "the account's balance is now 100," store the sequence of events that led to this balance: "open account (+0) → deposit 80 → deposit 50 → withdraw 30." The current state (balance 100) is no longer something saved directly, but the result computed by "replaying" all the events in order.

   Traditional (state storage): the database holds only one current value; each change "overwrites" the old value
      account.balance = 100   ← you'll never know what it "used to be," or why it became 100

   Event sourcing (append-only, never overwrite):
      [open account] [deposit 80] [deposit 50] [withdraw 30]  ← an append-only, never-modified event stream
       └──────────────────── replay and sum ───────────────────▶ current balance = 100

This idea is actually something you use every day, just without realizing it:

   Double-entry bookkeeping (centuries of accounting wisdom): the ledger is append-only. A mistake can't be "erased,"
                                                              you can only post a "reversing" entry. Current balance = sum of all entries.
   Git: the repo stores not "what the files look like now," but a sequence of commits (change events).
        `git checkout <some commit>` is "replaying to that point in time" — this is time travel.

What event sourcing buys you is very tempting:

  • Complete audit: every state change is an immutable event, inherently a perfect audit log — a hard requirement in finance and compliance scenarios ("how exactly did this money end up like this?" is always traceable).
  • Replayable, time travel: want to know "what state this account was in at 3 PM last Tuesday"? Just replay the events up to that moment. A bizarre bug in production? Replay that sequence of events in the test environment and reproduce it perfectly.
  • One set of events, many interpretations: the same event stream is used today to compute the balance, and tomorrow can be used for risk-control features or data analysis — a new requirement doesn't have to change history, it only needs a new "projection" that reinterprets the old events.

But its cost is equally hardcore — don't let the halo go to your head:

  • Querying gets hard: the database holds a pile of events; want to query "accounts with a balance over 1000"? You can't WHERE directly — you have to replay the events into state first. This is exactly the problem the next section's CQRS sets out to solve (the two often appear as a pair).
  • Event schema evolution is hell: once an event is written, it's never deleted or modified (it's the truth). But three years later your event structure has changed, and the old events must still be replayable correctly by the new code — this kind of "backward compatibility spanning years" is event sourcing's hardest engineering challenge (Section 7 covers evolution specifically).
  • Replay cost: once events accumulate into the millions, computing the current state by replaying from the very beginning every time gets slow — so you periodically store snapshots, replaying forward from the most recent snapshot rather than from the dawn of creation. (Like a save point in a game: you don't restart from level one every time; you continue from the latest save.)

Key judgment: Event sourcing is a double-edged sword, by no means "more advanced, therefore better." It shines in domains where "auditability/traceability overrides everything, and the business is naturally a sequence of events" (ledgers, transactions, order state machines, the edit history of a collaborative document); but if you cram it into an ordinary CRUD backend, all you buy is the cost of "hard to query, hard to evolve, heavy cognitive load," without ever using its upside. Ask first: do I really need "the complete history of every past moment"? If not, just honestly store state.


6. CQRS: Splitting "Read" and "Write" Into Two Separate Models

Event sourcing left behind a hard problem — "the event stream can't be queried directly." The solution is CQRS (Command Query Responsibility Segregation). We mentioned it in 04; here we dig in deeper.

The core idea in one line: use one model for writes, another model for reads, with "events / sync" feeding the read model fresh in between.

Analogy: a restaurant separates the kitchen from the menu display. The kitchen (write side) only cares that dishes are made correctly and inventory is accurate; the menu, review wall, and bestseller board (read side) only care that guests can understand and order quickly. Each is optimized for its job, and the waiter (event sync) carries updates between them. The trade-off is that the "sold out" label on the menu may lag the kitchen by a moment (eventual consistency).

   Traditional: reads and writes share the same model, the same table — has to be both write-friendly
                (normalized, strongly consistent) and read-friendly (all kinds of queries).
                The result is often "compromise on both ends," where one complex query can drag down the whole write database.

   CQRS: split left and right, each optimized on its own
   Write side (Command)                          Read side (Query / materialized views)
   ┌──────────────┐  domain events  ┌─────────────────────────────────────┐
   │ Write model   │ ──────────────▶ │ Read model 1: order list (optimized for the list page) │
   │ normalized/   │  projection      │ Read model 2: user profile (optimized for the detail page) │
   │ strongly      │  updates         │ Read model 3: search index (optimized for search) │
   │ consistent;   │ ──────────────▶ └─────────────────────────────────────┘
   │ just "writes  │
   │ correctly"    │
   └──────────────┘
   After writing, only the write database is strongly consistent; the read database is "projected" out of events,
   slightly lagging = eventual consistency

microservices.io nailed CQRS's most practical use in microservices: once you have "one database per service," a query that "joins data across several services" can no longer be joined directly (the data is scattered across each service). CQRS's answer is: build a dedicated read database (view database) that subscribes to the domain events emitted by each service, pre-assembles and flattens the needed data, and stores it — at query time, just read this ready-made view, fast and simple.

CQRS's sweet and bitter:

  • The sweet: ① reads and writes can be optimized and scaled independently (when reads vastly outnumber writes, pile on replicas on the read side while the write side doesn't budge); ② one write model can project into any number of read models tailored for different queries; ③ complex queries no longer drag down the write database.
  • The bitter: ① the read model is eventually consistent — you just finished writing and go to read, but the projection may not have caught up, so you read stale data (the classic "just placed an order but can't see it in the order list"). This must be confronted head-on in the product experience; ② system components, data redundancy, and operational complexity all double; ③ you've added an asynchronous "projection" pipeline that must be guaranteed not to lose or scramble data.

Key judgment (the most important in this section): CQRS is a "heavy weapon," and the vast majority of CRUD systems should not use it — it's a hotspot for over-engineering. The scenarios where it's truly worth it are narrow: read and write loads are severely asymmetric, or the read side's query shapes are extremely diverse (the same data has to be queried four completely different ways — list, detail, search, report). A simple test: when you find that "for the sake of one report query, you're forced to build a pile of oddly-shaped indexes on the core write database, and it's even slowing down order placement," that's when CQRS starts to pay off. Don't adopt CQRS just because "it sounds advanced"; its eventual consistency and doubled complexity are real debt that must be repaid.


7. Schema / Contract Evolution: Echoing "Data Is the Hardest to Change"

That line from 05 · Data & State, "logic is easy to change, data is hard," is amplified to the extreme in the distributed + event-driven world: your database schema, the event structures you emit, your API contracts — the moment someone else (another service, three years' worth of accumulated old events, an old client not yet upgraded) depends on them, you can no longer change them on a whim.

The difficulty lies in "old and new must coexist." During canary releases and rolling upgrades, the new version's and old version's code and data are online at the same time:

   The intermediate state of a rolling upgrade: old and new code online at the same time, old- and new-format data both existing
   ┌──────────┐  writes new format  ┌────────────┐
   │ New       │ ──────────────────▶ │            │ ◀── the old version is still reading; does it recognize the new format? (needs forward compat)
   │ version   │                     │ data /      │
   └──────────┘                     │ messages    │
   ┌──────────┐  reads old format   │            │
   │ Old       │ ◀────────────────── │            │ ◀── the new version has to read old data; does it recognize the old format? (needs backward compat)
   │ version   │                     └────────────┘
   └──────────┘
  • Backward compatible: new code can understand old data/old messages.
  • Forward compatible: old code can understand (or at least ignore without crashing) new data/new messages.

And the industry's battle-tested approach for safely migrating a database from an old structure to a new one is called expand-contract (also known as the "parallel change"). The intuition is like building a new bridge next to the old one: first let both bridges carry traffic in parallel (old and new fields coexist), then once traffic has safely moved to the new bridge, dismantle the old one — at no point is the road closed.

   ❌ Dangerous approach: directly RENAME / DROP a column — at the instant of deployment,
                          the not-yet-upgraded old instances all crash en masse

   ✅ The three steps of expand-contract (the system stays available at every step):
   ┌── Expand ──────────┐   ┌── Migrate (dual write) ────┐   ┌── Contract ──────────┐
   │ Add new columns/    │   │ Code writes both new and    │   │ Only after confirming  │
   │ tables, don't drop   │   │ old copies; a background     │   │ no one reads the old   │
   │ the old ones.        │   │ job slowly moves existing    │   │ field anymore do you    │
   │ Old code is fully     │   │ data to the new structure.   │   │ safely drop the old     │
   │ unaware.             │   │                              │   │ columns/old code.       │
   └────────────────────┘   └────────────────────────────┘   └──────────────────────┘

The whole process is supported by canary migration: first route 1% of traffic to the new path, watch the monitoring, then 10%, 50%, 100% once it's fine. This way, even if the new structure has a bug, only a small handful blows up, and you can roll back to the old path in seconds.

Architectural wisdom: Evolvability is essentially "never making a destructive change in one fell swoop." Break a dangerous "in-place surgery" into three small surgeries of "expand → dual-write migrate → contract," each keeping old and new compatible and the system running. This is of a piece with 08 · Architecture Decision Records and Evolution: good architecture isn't designed right once, it's something that can change safely, one small step at a time, without downtime or data loss. In an event-sourcing system this point is especially deadly — you can't delete old events, and the new code must forever recognize every version of the old events, so "how event schemas evolve" must be thought through on day one.


📌 Real Case: DoorDash Uses Cadence to Backstop "Lost Events"

DoorDash's delivery business (Drive) early on relied on an event-driven approach to string together its dispatch flow: one action emits an event, downstream services listen and carry on. This is precisely the textbook "choreographed Saga" from Section 2 — low coupling, easy to extend. But it landed squarely on two of this chapter's hard bones:

  1. Lose one event, and the whole flow "jams halfway": event-driven is at-least-once (and occasionally even loses messages); the moment some key event isn't delivered, or some consumer crashes mid-processing, that delivery stops in an intermediate state that no one advances and no one compensates — which is exactly the "dual write / partial failure" pain that Sections 2 and 3 hammered on repeatedly. The weakness of pure choreography is laid bare too: the flow is scattered across each service's listeners, and no one can see at a glance "where exactly this delivery is stuck, and why it stopped moving."
  2. Their solution: bring in Cadence as a "durable workflow" backstop. DoorDash put Drive's delivery-creation flow onto Cadence (the workflow orchestration engine Uber open-sourced in 2017, later donated to the CNCF), as a backstop to the main event stream. The essence of engines like Cadence is exactly turning Section 2's Saga orchestrator into platform-level infrastructure: it persists the state of every step of the workflow, automatically retries a step on failure, and continues from the last checkpoint when a process crashes, rather than starting over or simply losing the work.

The lesson maps precisely onto this chapter: pure choreography (event-driven) falls short when "there are many steps, you need traceability, and you need a backstop"; the moment some step might fail with no one to compensate, you need an orchestration layer that "remembers progress and withstands crashes." This is the real-world landing of the "Saga orchestration + idempotent retries + durable state" three-piece set moving from "pattern" to "platform" — and it explains why the likes of Uber/DoorDash/Netflix all, as if in concert, built or adopted durable workflow engines (Cadence / Temporal).

📎 DoorDash Engineering Blog: Building Reliable Workflows: Cadence as a Fallback for Event-Driven Processing · Uber Official Blog: open-source orchestration tool Cadence

Another first-hand coordinate worth remembering: Jay Kreps's The Log, written at LinkedIn — the idea that "the log is the truth" is precisely the shared spiritual source of Outbox/CDC/event sourcing; LinkedIn later ran trillions of messages a day on Kafka, turning "an append-only, never-modified log as the source of truth between systems" from a paper into industry infrastructure (Confluent's figures).


🤖 The AI / Vibe-Coding Lens

Connect this entire chapter to the present moment and you'll find a striking correspondence:

An AI agent's "multi-step tool calls" are, in essence, a distributed Saga.

   An AI agent helps you "book a business trip":
   ① call the flight-booking tool (charge money, issue ticket) ──▶ ② call the hotel-booking tool (charge money, place order) ──▶ ③ call the car-rental tool ✗ fails
        every step has a real side effect                            every step may time out / duplicate                        on failure, what about the first two steps?
                                                                                                                                  the flight and hotel are already booked!
  • Every step has side effects and may fail → these are Saga's steps; on failure you need compensation (refund the ticket, refund the hotel), not pretending nothing happened.
  • Tool calls will time out, will be retried by the framework → this is at-least-once; so every tool call must be idempotent (the same tool_call_id executed repeatedly must not actually charge twice or send two emails).
  • An agent's long task restarts mid-process → this is partial failure; you need to persist its state so it can resume from the breakpoint — the same principle as DoorDash backstopping with Cadence. The long-task state and checkpoint recovery of the AI Agent / Workflow Platform are a living template of this chapter; and the idempotent charging of the payment system is the last line of defense when an agent calls a payment tool.

And this lays bare the deepest hurdle of the vibe-coding era:

AI can generate the "happy path" in seconds — a flowing string of await tool_a(); await tool_b(); await tool_c();. But it almost never brings its own compensation, idempotency, or outbox. You ask it to "write an order-placement flow," and it gives you the rosy world where "all three steps succeed"; you have to press it yourself: "if the third step fails, how do the first two compensate? Are these calls idempotent? Is sending the message and writing the database a dual write?" — these questions AI won't think of for you proactively, because it's optimizing for "make the demo run," not "keep the data correct through failures."

And that layer of judgment — "keeping data correct through failures" — is precisely the ability that the human architect of this era most needs to shore up, and that is hardest to replace. Implementation gets ever cheaper (AI can write the code for Saga, Outbox, and idempotency), but the judgment of "should there be compensation here, how much inconsistency to tolerate, which step must be idempotent" — its cost is borne by your business — this through-line is of a piece with [10].


🎯 Pop Quiz

🤔An order service finishes handling a request and has to both write the order into its own database and send an order created event to Kafka for downstream. How do you avoid losing or duplicating?
  • ACommit the database write first, then immediately call to send the message; wrapping both steps in retries is enough
  • BWrite the order and a pending message into two tables of the same database with one local transaction, then have an independent relay read them out and send them, with downstream doing idempotent consumption
  • CUse a two-phase commit transaction spanning the database and the message broker to forcibly bind the two together

Chapter Summary

  • Core thesis: once microservices give you "one database per service," that cross-service transaction boundary that let you sleep soundly is gone. Don't try to bring strong transactions back (2PC blocks synchronously, has a single-point coordinator, is at odds with availability); switch to a set of eventually-consistent engineering that can tolerate intermediate states.
  • Saga: split a big transaction into "a chain of local transactions + compensation on failure." Compensation is not a rollback (history can't be erased, you can only operate in reverse). Orchestration (central conductor, good monitoring) vs. choreography (event relay, low coupling): many steps and a need for traceability → orchestration; few steps and a pursuit of loose coupling → choreography.
  • The dual-write problem → the Transactional Outbox: "changing the database" and "sending the message" can't be in the same transaction, so write the "pending message" into an outbox table in the same local transaction too, then deliver via polling or CDC (Debezium). It's at-least-once, so it must be paired with idempotency.
  • Idempotency is the cornerstone of distributed correctness: idempotency key + dedup table + idempotent consumer. "Retry" and "idempotency" must appear together, otherwise automatic retries are a "random double-charging machine."
  • Event sourcing: store "what events happened" rather than "what the state is now" (analogous to double-entry bookkeeping and Git history). The upside is complete audit, replayability, time travel; the cost is hard queries, hard event-schema evolution, and reliance on snapshots. Don't use it just because it's "advanced"; first ask whether you really need the complete history.
  • CQRS: separate read and write models, with the read model projected from events and eventually consistent. The scenarios where it's worth it are narrow (read and write loads severely asymmetric, query shapes extremely diverse); for the vast majority of CRUD, using it is over-engineering.
  • Contract / schema evolution: backward + forward compatibility, with data migration following expand-contract (expand → dual-write migrate → contract) + canary. Evolvability = never making a destructive change in one fell swoop.
  • The AI-era through-line: an AI agent's multi-step tool calls are a distributed Saga, and vibe coding gives you only the "happy path," never bringing its own compensation/idempotency/outbox — the judgment of "keeping data correct through failures" is precisely the layer the human architect must shore up.

Bridging forward: in this chapter we learned to get data right in a world without cross-service transactions — but the premise of "getting it right" is that the system must first withstand failure. The next chapter (Advanced track, Chapter 3), 12 · Designing for Failure: Resilience Engineering, moves from "data is correct" to "the system doesn't fall": timeouts, retries, circuit breakers, bulkheads, graceful degradation, chaos engineering — when failure is destined to come, how do you make the system bow gracefully rather than collapse with a crash.

💬 Comments