Skip to content

10 · Distributed Systems — The Hard Truths: Partial Failure, Time, and Consensus

The thesis in one line: a single machine quietly hands you three luxuries — an operation either succeeds or fails, the whole world shares one "now," and a function call always arrives. Cross over to a second machine and all three vanish. The plain-words CAP of the foundations track was just the doorway; this chapter digs down to the truly jagged rock: where distribution is actually hard — and, in an age where AI writes your implementation, which calls still have to be made by you.


🧭 The advanced track starts here. The foundations track (01–09) teaches you to read a system and design a small-to-medium one from scratch. From this chapter on, the advanced track deals with a different class of problem — the hard rock that only bares its teeth once a system gets big or critical: distribution, failure, scale, evolution, organization.

Don't tense up: this chapter is "old friends, one layer deeper," not a brand-new universe. Remember the example from 05 · Data & State: "like counts can be eventually consistent; ATM balances must be strongly consistent"? Remember the plain-words CAP line: "when the network splits, consistency and availability can't both win"? This chapter simply turns those lines over and asks what traps sit underneath them — why consistency is only "eventual," why strong consistency is so expensive, and what "the network splits" actually means. You're already standing at the doorway; we're just walking one layer further in.

This is exactly the skill that pays in the AI era. AI can spit out a working Raft implementation in seconds, but "do we need consensus here, how much inconsistency can we tolerate, when the network splits do we want to be correct or online" — these are judgment calls whose cost your business bears, and AI can't give you a canned answer, because the answer depends on what you're willing to trade for what. Implementation gets cheaper; judgment gets more valuable. This thread runs through the whole advanced track.


1. The single machine's three luxuries — gone the moment you cross to a second machine

When you write single-machine code, three things feel so obvious you never notice they're luxuries:

   Single-machine world (three luxuries)   Distributed world (all three gone)
   ─────────────────────────────────       ──────────────────────────────────
   ① a function call always arrives        ① the network is unreliable: drops, reorders,
                                              duplicates, delays, partitions
   ② the whole world shares one "now"       ② no global clock: every machine's watch differs;
      (one clock)                              "at the same time" is an illusion
   ③ it either succeeds or fails            ③ partial failure: some nodes succeeded, some failed,
                                              and some — you don't know if they're dead or alive

The first two are easy to grasp. The third — partial failure — is the root of all evil. On one machine an operation's outcome is binary: success, or failure. In a distributed system there's a deadly third state — "don't know":

You send a request to another machine and wait two seconds with no reply. Did it not receive it? Receive it but still computing? Finish, but the response got lost? Or has the whole machine crashed? — You cannot tell "it's dead" apart from "it's just slow." This is called gray failure.

Your only weapon is the timeout: wait long enough with no reply, treat it as failed. But a timeout is fundamentally a guess — guess too short and you mislabel a merely-slow node as dead (then retry, making things worse); guess too long and failure recovery is sluggish. Much of distributed reliability design is, in essence, a wrestling match with this ghost of "can't tell dead from slow."

Architectural wisdom: in the single-machine era you write code assuming "the call will succeed"; in the distributed era flip it — assume every cross-network call may fail, time out, or be delivered twice, then ask: what happens to my system then? This one mental flip is the watershed between the foundations and the advanced track.


2. Consistency isn't a switch, it's a spectrum (digging deeper into 05)

In 05 · Data & State we stretched consistency into a line from "strong" to "eventual." Now zoom into the middle — it's actually several rungs, each with a different price (latency and availability):

   expensive ◀──────────────────────────────────────────▶ cheap
   Linearizable        Sequential     Causal           Eventual
 "everyone instantly  "everyone sees  "causally related "eventually agrees,
  sees the same        the same order, events keep their  in the window each
  latest value, as if  but maybe not   order; unrelated   sees its own"
  one machine"         real-time"      ones, whatever"
   ↑ ATM withdrawal     ↑ config push   ↑ chat msgs/replies ↑ like counts/views
  • Linearizable: the strongest. Anyone reading at any node instantly reads the most recent write, as if the system were a single machine. Highest cost — it needs inter-node coordination, and both latency and availability must give way to it.
  • Causal: a very practical middle rung. Causally related operations (you replied to my message) appear in the same order to everyone; concurrent unrelated operations (two strangers each posting) can be in any order. Far cheaper than linearizable, yet it avoids the absurdity of "seeing the reply before the original post."
  • Eventual: the weakest and cheapest. After a write, replicas catch up over time; in the window, each may see its own version.

The key insight: strong consistency isn't "better," it's "more expensive." Each rung you climb leftward, you pay for it in latency and availability. The architectural call isn't "be as strong as possible," it's "how strong does this data deserve to be" — an ATM balance deserves linearizable; chat messages are fine with causal; a like count finds even eventual consistency lavish. Spend your precious strong-consistency budget where things actually break.


3. CAP is just the opening line; PACELC is the bill you pay every day

05 gave the plain-words CAP: a network partition (P) will happen sooner or later, and when it does you can only pick one of consistency (C) or availability (A). But CAP omits half the truth — most of the time the network is fine, so what are you paying then?

PACELC completes it:

   if (P, network partition) {       // happens occasionally
       pick one:  C or A             // be correct, or stay online?
   } else {                          // ★ 99% of the time you're here ★
       pick one:  L or C             // low latency, or strong consistency?
   }
  • During a partition (PAC): the net splits in two; you either refuse service to stay correct (pick C) or keep serving but maybe hand out stale data (pick A). This is the part CAP talks about.
  • When there's no partition (ELC): even with a perfectly healthy network, strong consistency still costs money — to make all replicas agree on a value, you wait for inter-node round trips, which is latency (L). Want it faster (lower L)? Loosen consistency (C).

Plain-English translation of those two lines: CAP can make you think "consistency only becomes a trade-off at extreme moments when the network is broken." PACELC wakes you up: even when the network is perfectly healthy — the 99% of days where nothing dramatic is happening — every write, every choice of "should we wait for every replica to confirm," is already trading latency for consistency. This is not an emergency insurance premium you pay only during incidents; it's the utility bill you pay every day.

The judgment point: CAP can mislead you into thinking "consistency is only a trade-off during network faults." Wrong. PACELC reminds you: even in calm weather, every "should I wait for all replicas to confirm" choice trades latency for consistency. That's the bill you pay every single day.


4. There is no "now": what logical clocks buy you is "causal correctness"

💧 Optional deep dive (safe to skip on first read; the main line does not depend on it). This is the most "academic" part of distributed systems. You only need to take away one sentence: in a distributed system there is no single "now," so ordering two events is not about looking at clocks; it is about asking "which one caused which" (causality). As for the names of the three clock types below, most business systems will never need them. If this gets heavy, skip to Section 5 with no loss. The rest is for readers who want the full picture.

If every machine's physical clock drifts and network delay is unpredictable, then how does the statement "event A happened before B" count in a distributed system?

First feel how real the trap is: you and a friend both try to grab the same ticket on your own phones, and the two phone clocks may differ by tens of milliseconds; "who tapped first" can happen exactly inside those milliseconds. Comparing the two clock readings is unreliable — the clocks themselves don't agree.

The trick is to abandon physical time and use causality instead: don't ask "what time was it"; ask "did this happen because of that?" That's a logical clock:

   Node A:  event a1 ──send msg m──▶

   Node B:  event b1 ──────────────── ▶ receive m ── event b2

        The logical clock guarantees: a1 (send) must order before b2 (after receive),
        because b2 "causally" depends on m, and m came from a1.
        As for a1 vs b1 — no causal link between them, so it's "concurrent," not ordered.

The key move is surprisingly simple: every message carries the sender's current counter; the receiver moves its own counter forward to be greater than that value. That pins "send" before "receive" and preserves causal order. On top of this simple idea, there are three increasingly refined approaches:

  • Lamport clock: puts every event into one total line (everyone can say which came first).
    • Analogy: like forcing every chat message to get a unique server-side sequence number. The downside: it can't tell whether "B truly replied to A" or "A and B just happened to post independently at the same time." It guarantees you won't see the reply before the original, but it also forces unrelated events into an artificial order.
  • Vector clock: goes further — it can identify which events are actually concurrent and unrelated, which is how it detects "write conflicts."
    • Analogy: if two people edit the same line in the same document at the same time, a vector clock can see that neither edit knew about the other. The system then knows "these changes collided; someone must decide how to merge," instead of blindly overwriting one based on whichever wall-clock timestamp looks later.
  • Hybrid Logical Clock (HLC): stitches "physical clock time" and a "logical counter" together, staying close to real time while preserving causal order. Modern distributed databases use it for "a consistent snapshot at some point."
    • In one sentence: pure logical clocks don't look like real time, which is awkward for debugging; HLC makes the value still look like a timestamp while keeping causality safe.

The judgment point: most business systems don't need logical clocks — don't add one to look sophisticated. But the moment your system must guarantee "causal correctness," it's unavoidable: the merge order of concurrent edits in a Collaborative Doc, message ordering in Realtime Chat, consistent snapshots in a distributed database… they're all the same problem: "how to establish order in a world with no global clock."


5. Consensus: the most expensive coordination there is — never overuse it

Sometimes you really do need a group of machines to reach rock-solid agreement on something: who is the primary? What is entry #100 of this replication log? Who holds this distributed lock right now? That's the consensus problem, and Raft / Paxos is what solves it.

What consensus buys is hardcore: even if some nodes crash or get cut off, the surviving majority can still reach a single, consistent decision on "one value / the order of one log." It is the source of "a single authoritative truth" in the distributed world.

As a everyday analogy: consensus is like a group meeting where every decision only counts after more than half the room raises a hand. The upside is stability — even if a few people temporarily leave (nodes crash), as long as you still have a majority, the meeting can decide, and you don't get two groups passing opposite resolutions. The cost is just as plain: every decision has to wait for a majority to hear it and respond; the more people there are, and the farther apart they sit (across data centers), the slower that round becomes. Consensus is for the important board meeting, not for every casual group chat.

But it's very expensive:

   Every decision goes through a round of "majority voting":
   proposer ──▶ [node1][node2][node3][node4][node5]
                   ╲    │    │    ╱
                    wait for > half to confirm before it counts (this round trip = latency)
   • at least 3 or 5 nodes (to tolerate 1~2 down)   • write throughput bounded by a single leader
   • one extra network round trip per write          • membership changes are subtle and error-prone

The judgment point (most important in this section): consensus is expensive; use it only in the few places that "must have a single authoritative order" — leader election, cluster metadata, distributed locks, replication logs / state machines. The vast majority of business data should not run through consensus directly. A system that crams every business write into a consensus group is paying a steep latency-and-throughput price for a "globally unique order" it doesn't need.

❌ Anti-pattern: using a Raft cluster as a general-purpose database. ✅ Right: let consensus govern only that tiny sliver of metadata where "there can be only one answer," and let business data shard where it should and be eventually consistent where it can.


6. "Exactly-once" is an illusion: hand correctness to idempotency

Newcomers love to ask: "how does a messaging system guarantee exactly-once delivery?" The brutal truth: the delivery layer can't. Because the network drops and duplicates, once you send a message and don't get an ack, you have only two choices:

   resend (risk a duplicate)   ──▶  at-least-once: never lost, but maybe duplicated
   don't resend (risk a loss)  ──▶  at-most-once:  never duplicated, but maybe lost
   "exactly-once" as a delivery guarantee — does not exist.

What you can achieve is to solve it somewhere else:

at-least-once delivery (allow duplicates) + idempotent consumer (a duplicate still takes effect only once) = "exactly-once" in business effect.

"Idempotent" means: running the same operation once or ten times yields the same result. The usual implementation gives each operation an idempotency key, and the consumer keeps a dedup table recording "this key has been handled," skipping it on a repeat.

The judgment point: don't chase the phantom of exactly-once at the transport layer; push correctness down into idempotent design at the consumer. The idempotent charge in a Payment System and the dedup/throttling in a Notification System are living examples of this move.

This "at-least-once + idempotency" is the setup for the next chapter — it's the bedrock under Saga, Outbox, and event sourcing, the engineering techniques for "getting data right in a distributed world."


📌 Real case: how a 43-second network partition threw GitHub into chaos for 24 hours

On October 21, 2018, routine work to replace optical equipment at GitHub left the primary US East Coast data center cut off from the US East Coast network hub for 43 seconds — just 43 seconds. But those 43 seconds hit every jagged rock in this chapter precisely:

  1. Partial failure + failure detection: during the cutoff, the East Coast data center kept accepting writes (it wasn't dead, just isolated); meanwhile Orchestrator — the Raft-consensus-based manager of database topology — judged "East Coast unreachable" as "East Coast primary unavailable." This is exactly gray failure: can't tell dead from slow.
  2. Consensus, double-edged: the West Coast and public-cloud Orchestrator nodes formed a majority and, per Raft consensus, automatically failed the primaries over to the West Coast, directing writes there. Everything executed "correctly, as configured."
  3. No global truth, so the data forked: 43 seconds later the network recovered, and both coasts had each accepted writes — becoming "two clusters, each believing it's the primary and each holding data the other lacks." Auto-reconciliation risked data loss, so it had to be repaired by hand — degraded service for a full 24 hours and 11 minutes.

GitHub's number-one remediation afterward was: forbid Orchestrator from promoting primaries across regions. In other words — an automated failover mechanism designed for "single-node failure" made, in the face of a "whole-region partition," a decision that the application tier simply couldn't survive, and it did so "correctly."

The lesson maps precisely onto this chapter: partial failure means you can't tell dead from slow; the absence of a global clock makes failure detection inevitably mis-fire; consensus reaches agreement among a majority, yet during a partition it can "correctly" steer the system into a catastrophic topology. The more powerful the automation, the harder you must think about what it will "correctly" do in the extreme case.

📎 GitHub's official write-up: October 21 post-incident analysis


🤖 Why this chapter is not "obsolete low-level detail" in the AI era

You might think: isn't all this the low-level stuff AI can write for me? Quite the opposite —

  • AI writes the implementation; you make the call. AI can generate the code for Raft, retries, idempotent dedup; but "do we need consensus here, how much inconsistency can we tolerate, C or A during a partition" are judgment calls whose cost your business bears. Being able to write it ≠ being able to choose it.
  • AI-native systems are themselves heavily distributed. Look at the just-shipped Agent templates: the long-running tasks of an AI Agent / Workflow Platform must be resumable (partial failure), its tool calls must be idempotent (at-least-once retries), its memory must stay consistent across steps; the concurrent subagents and checkpoint recovery of a coding agent like Claude Code also stand on this chapter's hard truths. LLMs stack one more layer of "nondeterminism" on top of distribution's "nondeterminism" — the hard truths only matter more, not less.

🎯 Quick check

🤔A team wants their messaging system to achieve exactly-once delivery. The soundest architectural judgment is?
  • AUse a more reliable protocol at the transport layer to force exactly-once
  • BAccept at-least-once delivery and make the consumer idempotent, deduping by an idempotency key
  • CSwitch to a strongly consistent database for all messages so nothing duplicates

Chapter summary

  • Core claim: distribution is hard because the single machine's three luxuries are gone — the network is unreliable, there is no global clock, and failure is partial. Partial failure is the deadliest, because it brings the gray failure of "can't tell whether the other side is dead or just slow," and your only weapon (the timeout) is merely a guess.
  • Consistency is a spectrum with a price list: linearizable → sequential → causal → eventual; each rung stronger raises the latency-and-availability bill. The call isn't "as strong as possible," it's "how strong does this data deserve to be."
  • PACELC is more complete than CAP: during a partition, choose between C/A; with no partition (99% of the time) you still choose between L/C — strong consistency costs latency even in calm weather.
  • No global clock, so use causality: logical clocks (Lamport / vector / HLC) buy "causal correctness." Most systems don't need them, but collaboration, ordering, and snapshots can't avoid them.
  • Consensus is the most expensive; don't overuse it: Raft/Paxos gives you "a single authoritative truth," at the cost of majority round trips, a node floor, and bounded throughput. Use it only for leader election, metadata, locks, replication logs — where "there can be only one" globally.
  • Exactly-once is an illusion: the delivery layer is only at-least-once or at-most-once; at-least-once + an idempotent consumer = exactly-once in effect.
  • The AI-era thread: implementation keeps getting cheaper, while the judgment about choices and failure boundaries gets more valuable — and AI-native systems are themselves heavily distributed, so these hard truths only matter more.

Bridging ahead: this chapter laid out the "pathology" of distribution — why things go chaotic, get lost, fork. The next chapter (advanced track, part 2), The Engineering of Data Consistency, picks up from this chapter's closing "at-least-once + idempotency" and walks into how you actually get data right in a world without cross-service transactions: Saga, Outbox, idempotent dedup, and putting event sourcing and CQRS into practice.

💬 Comments