14 · Evolving & Splitting Large Systems: Changing the Engine on a Plane in Flight

The thesis in one line: a system that runs real business, that can't be stopped, and that still changes every day — you will never get to "stop it, tear it down, rewrite it once, and ship the new one." All the real craft lies in "changing the engine while the plane keeps flying." The previous class of problems was "how does a system avoid disasters" (distribution, failure, scale); this chapter is a different class — the system is already huge and critical, so how do you reshape it into something better without letting it crash.

🧭 Picking up right after 08 · Architecture Decision Records & Evolution. Chapter 8 told you that "architecture grows up, you must leave seams, and when to upgrade" — that was the worldview of evolution; this chapter gives you the craft of evolution: when the system is already a behemoth running in production, exactly which few moves let you fix it, split it, and replace it bit by bit, with zero downtime and the ability to roll back.
This is also one of the most counterintuitive throughlines of the AI era. AI can generate a "looks cleaner" new implementation in seconds, and vibe coding can let you pile up a working prototype in half a day — but "taking a working pile of stuff and, while it keeps serving, keeps making money, and its requirements keep changing, evolving it without downtime into a maintainable system" is something AI cannot hand you, because what it needs isn't code, it's continuous judgment about "where to move first, how to hedge, when to switch, and how to back out when things break." Implementation gets ever cheaper; steering gets ever more valuable.

1. Why "tear it down and rewrite" is almost doomed: you're chasing a moving target

Every engineer who has inherited a mess has had the same thought cross their mind: "this code is hopeless, better to tear it down and start over." This thought is almost always wrong — not because the old code is any good, but because the act of rewriting itself simply cannot win, as a matter of physics.

   Why the "big rewrite" is a trap:

   Timeline ──────────────────────────────────────────────▶

   Old system:  v1 ── edit ── edit ── fix bug ── add feature ── edit ── …  (always moving!)
                                                              ▲
   New system:        written from 0 ─────────────────────…   has to catch up to here
                      (you think the target is here, but it keeps running)

   By the time your new system finally matches the old one's "version back then,"
   the old system has long since run off elsewhere — you're always a stretch behind,
   and that stretch keeps growing.

The core of the problem is: the old system won't stop and wait for you. While you rewrite it, others are still fixing its bugs and adding features — the target is moving. You think all you have to do is replicate the existing features once, but "existing features" is a living thing that keeps changing; worse still, in that code that looks dirty and weird, there is tacit knowledge — bought with years of stepping on landmines and written down nowhere (this continues the opening of [08], "we deleted the dual-write, and three weeks later it blew up"): that one bizarre if exists to dodge some customer's dirty data, that retry block exists to absorb some third party's flakiness. Rewrite from scratch, and all that blood and tears resets to zero — you'll step on the same landmines all over again.

Architectural wisdom: the biggest cost of a rewrite is not "writing the code again," but "stepping on all the landmines again — and freezing all your progress while you do it." Almost every ugly scar on the old system is a scab over a past production incident. A big rewrite means tearing off all the scars while shooting at a moving target, while demanding that your competitors stand still and wait for you — three impossibilities stacked on top of each other.

So this chapter has only one main road: don't tear down, only evolve incrementally — let old and new coexist, let the old shrink piece by piece while the new grows piece by piece, with every step stoppable, rollback-able, and the system always online. The six sections below are all the concrete craft along this one road.

2. The Strangler Fig pattern: intercept around the old system and let it slowly get "strangled" offline

In 2001, Martin Fowler saw a plant in an Australian rainforest: the strangler fig — its seed lands in the fork of a host tree's branches and sprouts; vines run down the trunk into the soil to take root, and spread upward to grab the sunlight; a few years later, the fully grown fig is self-sufficient, while the original host tree it grew on has withered and died. He said: this is what reshaping a large legacy system ought to look like.

   Strangler Fig pattern: add a facade/routing layer "outside" the old system,
   and gradually steer traffic toward the new implementation

         ┌──────────────────────────────────────────┐
   req ─▶│   Facade / Routing layer (Interceptor)    │  ← this layer is the key
         └───────┬──────────────────────┬───────────┘
                 │ not migrated yet      │ already migrated
                 ▼                      ▼
          ┌─────────────┐        ┌──────────────────┐
          │  Old monolith│        │  New service/impl │
          │ (shrinking)  │        │ (growing)         │
          └─────────────┘        └──────────────────┘

   Over time: the routing table's "go to the new impl" entries keep increasing,
            the old monolith gets hit less and less, until one day the old monolith
            receives not a single request — it has been "strangled," and is safely retired.

The fundamental difference from a big rewrite is: old and new always coexist, traffic is switched block by block, and every block you switch over is a small, rollback-able release — instead of hoarding up one giant version and betting your life on it.

The whole pattern's vital point is that "facade / routing" layer — every external request must pass through it first, and only then does it have the authority to decide "does this request go to the old or the new." This facade can be an API gateway, a reverse proxy (routing by URL prefix), or even a dispatch function inside the monolith. Without it, you don't even have the switch to "quietly cut some feature over to the new implementation."

Architectural wisdom: the Strangler Fig pattern is not "fast"; it is often slow, and it requires maintaining two sets in parallel for a while — but it breaks "betting the whole system in one shot," a giant risk, into "switching one small block at a time," a controllable sequence of risks. You're not dueling the old system; you're slowly growing a new system on top of it, until the old one is unattended and withers naturally. Prerequisite: you can insert an interception layer around the old system. That layer is often where [08]'s "leave seams" gets cashed out during the reshaping period.

3. Branch by Abstraction: swap the implementation behind an abstraction layer, and keep the trunk always shippable

The Strangler Fig governs migration of "whole features around the system's periphery." But often what you need to replace is a core component inside the system, called from countless places — like swapping a homegrown cache layer for Redis, or moving a data-access layer from ORM-A to ORM-B. Such a thing is tangled up tightly with its callers; you can't just "intercept a layer outside it."

A beginner's instinct is: open a long-lived feature branch, hide inside it grinding away for two months, and merge back to the trunk when done. This is another classic trap — the longer the branch lives, the more it diverges from the trunk, and that final merge becomes a disaster (others have been changing the trunk for those two months too), and throughout the process the trunk simply can't ship your part.

Branch by Abstraction (Fowler / Paul Hammant) gives the opposite approach: don't open a long-lived branch; make all changes on the trunk, and let a layer of "abstraction" let old and new implementations coexist.

   Branch by Abstraction in five steps, all on the trunk,
   with the system always compilable and shippable:

   ① In front of "the component to be replaced," insert an abstraction (interface)
        caller ──▶ 【abstraction】──▶ old impl

   ② Make all callers depend on this abstraction (no longer calling the old impl directly)
        caller ──▶ 【abstraction】──▶ old impl   ← behavior is completely unchanged here

   ③ Behind the abstraction, write the new impl too (a switch/feature flag controls which runs)
        caller ──▶ 【abstraction】─┬─▶ old impl (default)
                                  └─▶ new impl (switch off, no traffic yet)

   ④ Gradually flip the switch to the new impl; flip back anytime on trouble (this is rollback)
        caller ──▶ 【abstraction】─┬─▶ old impl (gradually deprecated)
                                  └─▶ new impl (gradually ramped up)

   ⑤ Once the new impl is stable, delete the old impl; the abstraction can stay or go
        caller ──▶ 【abstraction】──▶ new impl

It is a twin sibling of the Strangler Fig: the Strangler Fig intercepts "outside" the system and swaps whole features; Branch by Abstraction works "inside" the system and swaps the implementation behind some abstraction. Their shared belief is exactly the same — the trunk is always in a shippable state, never betting on one long-lived big branch — which is precisely the spirit of continuous integration / trunk-based development.

Architectural wisdom: "open a branch, change it slowly, merge when done" is the intuition, yet it's the most expensive anti-pattern in large-scale refactoring — because you're building a 'merge nuke,' and the fuse length is decided by other people. Branch by Abstraction inverts it: first spend the effort to stand up an abstraction as a "socket" that both old and new implementations can plug into, then switch over calmly on the trunk. The cost of writing one extra layer of abstraction buys you the security of "stoppable anytime, shippable anytime, revertible anytime."

4. Parallel Run / Dark Launch: run old and new at once, compare first, then cut traffic over

The Strangler Fig and Branch by Abstraction both give you a "switch." But before you flip it, on what basis do you trust that the new implementation is correct? Unit tests pass? Real production traffic is forever more devious than any test case you can dream up.

The most hardcore way to build confidence is called the parallel run: both old and new implementations run against the same batch of real requests at the same time; return the old implementation's result to the user (the user feels nothing) while quietly comparing the new implementation's result against the old one and recording the differences. This is also called a "dark launch" — the new code is already running in production, it's just that its output doesn't count yet.

   Parallel Run / Dark Launch: the new impl "runs alongside" first, comparing without taking effect

                 ┌──▶ old impl ──▶ result A ─────────────────▶ returned to user ✅
   real req ──┬──┤                            │
              │  └──▶ new impl ──▶ result B ──┘
              │                               ▼
              │                          compare A vs B
              │                               │
              │                 ┌─────────────┴─────────────┐
              │                 ▼                           ▼
              │            match → log one                mismatch → alert + write log
              │            (confidence +1)                (this is the landmine you didn't know about!)
              │
   Note: the user always gets result A from the old impl; even if the new impl is wrong, production is unaffected.
        Only once the "mismatch rate" drops low enough do you have the nerve to truly cut traffic to the new impl.

The most famous open-source implementation of this idea is GitHub's Scientist library (detailed in the real-world cases below). Its essence: the control group's (the old code's) result is always returned to the user, the candidate group (the new code) runs in the background, shuffles execution order randomly to expose ordering dependencies, swallows exceptions thrown by the candidate (never letting experimental code take production down), and finally publishes "whether the two sides' results match, and how long each took" for analysis.

Architectural wisdom: the greatest fear when refactoring a critical path is "I assumed equivalence, but it isn't equivalent." The parallel run turns that fear into a set of quantifiable data — using real traffic as the judge, not your own confidence. It's especially well suited to those core computations that are "logically complex, catastrophically costly if wrong, and impossible to pin down exactly how many edge cases there are" (billing, permissions, risk control, pricing). The cost is running two sets at once with extra overhead, so it is a heavy weapon for critical paths, not an everyday tool for everywhere.

5. Zero-downtime data migration: data is the hardest to change, so "expand–contract," with every step rollback-able

05 · Data & State long ago laid down an iron law: stateless things are easy to change; data is the hardest. Code can be blue-green switched and rolled back, but there is only one copy of the data, and once you corrupt it, it often can't be saved. So when evolution involves "switching databases / changing the table schema / splitting databases," you need a process that is almost ritually rigorous.

Its overall idea is the same as the "Branch by Abstraction" and "Parallel Change (expand-contract, coined by Joshua Kerievsky)" above — expand first (let old and new coexist), then contract (delete the old once you've confirmed all is well) — and any step in between can stop and back out:

   Zero-downtime data migration in five steps (expand → … → contract), any step rollback-able:

   ① Dual write
      The app writes to both [old store] and [new store]. Reads still go to the old.
      old ◀── write ──┤ app ├── write ──▶ new
                       └─ read ─▶ old
      → Escape route: just stop writing to the new store; the old store is authoritative throughout.

   ② Backfill
      Batch-move the historical data that existed "before dual write began" into the new store.
      → Escape route: the backfill is a batch job that only writes the new store; abort and rerun anytime.

   ③ Shadow read / compare
      Read requests read both sides, and the result returned to the user is still the old store's;
      meanwhile, compare whether "what the new store reads out" matches "the old one" (this is the parallel run from the last section!).
      → Escape route: compare without cutting traffic; if the mismatch rate isn't up to standard, keep fixing — never cut.

   ④ Cut over read
      Once the match rate is high enough, cut "reads" to the new store. You're still dual-writing here.
      → Escape route: cut reads back to the old store, because it has been dual-written all along and is still fresh.

   ⑤ Contract (cleanup)
      Watch for a while until stable, then stop dual-writing to the old store, and finally retire the old store.
      → This is the only "irreversible" step, so put it last and leave a long enough observation window.

The most valuable design across the whole chain is: for the first four steps, the old store is always "authoritative and fresh," so you can instantly back out of any of the first four steps. The only truly irreversible part is the final "cleanup" — and by then you've already filled up your confidence with the previous four steps.

Architectural wisdom: data-migration disasters almost all stem from "one big cut": stop the service some midnight, run a migration script, ship the change, and then pray. The right posture is to stretch it into a long chain of "dual write → backfill → shadow compare → cut read → cleanup," making "the old store is authoritative all along" the safety net you can jump onto anytime — until you cut it only at the last moment. This technique gets a more detailed engineering treatment in the dual-write / backfill of 11 · Data Consistency Engineering; it's the same essence.

6. Splitting the monolith: find the seams first, isolate with an anti-corruption layer, and "modularize the monolith first, split out services on demand"

The first five moves are "how to swap safely." This section answers a bigger strategic question: should a big monolith be split into microservices at all? And how?

First, a bucket of cold water. The mistake beginners make most easily is treating "splitting into microservices" as progress in itself — the big shops all split, so we split too. But 04 · The Ten Core Architecture Patterns said long ago: microservices are "the cure of the mature stage," not "the standard kit of the growth stage"; splitting too early only upgrades "function calls" into "network calls," shouldering all the suffering of distribution out of thin air (the hard truths of that entire chapter, [10]).

How do you find the "seams" worth splitting? The answer comes from DDD (Domain-Driven Design)'s Bounded Context — don't split by technical layers (that only yields anemic CRUD services); split by the natural boundaries of business capability: orders are one context, inventory another, billing another. Good service boundaries are the boundaries of business concepts, not the boundaries of database tables. (This also directly echoes [08]'s Conway's Law: if service boundaries don't line up with team boundaries, splitting is pointless.)

When you split, the old and new domain models are bound to clash — the concept of "order" in the old monolith may be dirty and muddled, and you don't want it polluting the clean model in the new service. The standard weapon to isolate it is the Anti-Corruption Layer (ACL):

   Anti-Corruption Layer (ACL): build a "translation + isolation" wall between the new service and the old system

   ┌────────────────┐     ┌─────────┐     ┌──────────────────────────┐
   │  New service    │────▶│   ACL    │────▶│  Old monolith (messy model)│
   │ (clean model)   │◀────│          │◀────│                          │
   └────────────────┘     └─────────┘     └──────────────────────────┘
                              ▲
              Translate the old system's dirty concepts into a form the new model can accept;
              the new service only ever deals with a "clean interface," and the rot of the old model can't seep in.

And the most critical strategic judgment is the "posture" of the split — don't start by splitting into a pile of independently deployed microservices, each with its own database, calling each other over the network:

   The correct order for monolith evolution (don't skip a grade):

   Big ball of mud  ──▶  Modular Monolith  ──▶  (only when there's a real bottleneck) extract a microservice on demand
   (no boundaries)       (one deployment unit,        (only the context that truly needs
                          but with enforced            independent scaling / independent
                          module boundaries inside)    release gets upgraded to a service)

   ① First draw the boundaries clearly "within one process" — changing a boundary is almost zero-cost; got it wrong? just redraw
   ② Once the boundaries have been pounded on by real business and stabilized, extract the block that "truly needs independent scaling/release"
   ③ The ones that don't need to be independent just stay in the modular monolith — that itself is a good home

The beauty of the Modular Monolith is: it decouples the two things "boundary design" and "distributed deployment." You first use an almost zero-cost approach (in-process module boundaries) to repeatedly polish "where to draw the seams," and only once the seams have been validated as stable by real business do you pay the price of distribution to extract the one or two blocks that "truly need independent scaling or independent release" into services. Take the most valuable part of microservices first (clear boundaries), and defer the most expensive part (distributed operations) until you absolutely must pay it.

Architectural wisdom: "to split into microservices or not" is a badly misframed question. The real question is "did you get the boundaries right" — get the boundaries right, and a modular monolith is good enough; get the boundaries wrong, and splitting into microservices only upgrades 'change one place, disturb many' into 'change one place, disturb many across the network.' So the discipline is: modularize the monolith first, get the seams right and stable; then, on demand and by bottleneck, extract services one block at a time. Service count is never the goal; it's a price paid for some concrete quality attribute (independent scaling, independent release, fault isolation) — go back and measure with that ruler from 06, don't count.

7. Fitness Functions: write architectural constraints as automated tests, so the system "grows up without rotting"

One last question: you've poured your effort into splitting the monolith into a beautiful modular structure and drawing good boundaries — how do you guarantee it won't, over the next few hundred commits, get kneaded back into a ball of mud bit by bit? [08] explained that technical debt accumulates unconsciously, and architectural boundaries are exactly the thing most easily violated on the quiet: one day someone takes a shortcut and has the "order module" directly import an internal class of the "billing module," and the boundary springs a leak — and then the leaks multiply.

Evolutionary Architecture (Neal Ford / Rebecca Parsons / Patrick Kua) gives this cure: the fitness function — write the architectural constraints you care about as a piece of code that runs automatically, can fail, and can block CI. Once an architectural rule can be continuously verified by machine, it goes from "a spec pinned to the wall, relying on self-discipline" to "a hard constraint that turns red and can't be merged if violated."

💧 Optional deep dive (the name sounds scarier than the idea): a "fitness function" is basically an automated health check plus gate for your architecture. Unit tests ask "does the feature behave correctly?"; fitness functions ask "were the architectural rules broken?" — for example, "did the order module secretly import billing internals?" If someone violates the rule, CI turns red and the change can't merge. It is the architecture's immune system: it doesn't stop the system from growing; it stops the system from rotting as it grows.

   Fitness function = give the architecture a "continuous health check" that alerts the moment a rule is broken

   Architectural constraint you care about         →   Written as an automated fitness function (into CI)
   ──────────────────────────────────────────         ──────────────────────────────────────────────────
   "order module must not depend on billing internals"  →  dependency check: spot this import → fail
   "no service's p99 may exceed 200ms"                  →  performance test: exceeded → fail
   "domain layer must not import framework/web layer"   →  layering check: violated → fail
   "no circular dependencies"                           →  dependency-graph analysis: a cycle appears → fail
   "the API must not introduce breaking changes"        →  contract test: incompatible → fail

   Run it on every commit → architectural rot is blocked "the moment a leak first appears," instead of discovered after it's grown into a giant hole.

This is precisely an automation of two of [08]'s ideas: it turns "evolutionary architecture must leave seams" from a slogan into a guardrail that "alerts the instant a seam is broken"; and it turns "technical debt must be booked" from after-the-fact regret into "want to take on this architectural debt? CI stops you first and forces you to confirm it explicitly."

Architectural wisdom: architecture is not a statue that sets once the diagram is drawn; it is a living system that needs continuous maintenance — and a "living" thing without an immune system will inevitably rot. The fitness function is the architecture's immune system: it doesn't stop the system from growing, it only stops the system from going bad as it grows. Without it, every boundary you painstakingly drew clear today is just waiting for some rushed afternoon when someone punches through it.

📌 Real-world cases: three classics of "changing the engine," and one famous "split then glued back together"

① GitHub Scientist — backing a critical-path refactor with confidence (the living model of the parallel run). GitHub once needed to refactor its permission-check critical path — code where a tiny error means "what should be seen isn't, and what shouldn't leak does," a high-stakes hazard. They didn't "change it and pray"; they did a parallel run, and abstracted the mechanism into the open-source library Scientist: the old permission logic's (control's) result was returned to the user as usual, the new logic (candidate) ran against the same request in the background, and whenever the two sides' results mismatched, it was recorded and alerted — but never affected production. Using the "mismatch list" fed by real traffic, they patched, one by one, those edge cases they hadn't even thought of, and only once the mismatch rate dropped low enough did they confidently cut traffic over. This is the precise embodiment of section 4 of this chapter. 📎 github/scientist · design article Move Fast and Fix Things

② Amazon — from the monolith Obidos to service orientation, reshaping the organization along the way. From 1996, Amazon.com was a big monolith called Obidos, with all display, recommendations, Listmania, and reviews mashed in together. By around 2001, this monolith couldn't hold up in the face of scale — "a hundred-some engineers all crammed into the same pile of code," nobody could move freely. Amazon launched a gradual migration toward Service-Oriented Architecture (SOA), slicing functionality into services that could be independently developed, deployed, and tested; and it invented the famous accompanying "two-pizza teams" — a team small enough to be fed by two pizzas (about 8–10 people), each fully owning one slice of service. This is a forward application of Conway's Law ([08]): to get services that evolve independently, first organize people into small squads that can take independent ownership. 📎 Amazon Architecture (High Scalability)

③ Netflix — a seven-year cloud migration, insisting on "re-architecting rather than hauling." In August 2008, Netflix's core Oracle monolith database got corrupted, causing three days of being unable to ship DVDs. Having learned the lesson the hard way, they decided to go to the cloud. This migration took a full seven years (2008 → the last billing system cut over in January 2016), and Netflix repeatedly stressed: they did not "lift-and-shift (move to the cloud as-is)," but refactored into microservices service by service — a textbook Strangler-Fig-style gradual migration, with new services growing out block by block and the old monolith shrinking block by block, rather than stopping the service to bet on one big version. 📎 Netflix: Completing a Decade of Cloud Migration

④ Segment's "Goodbye Microservices" — split too finely, then glued back into a monolith. This is the loudest wake-up call for all "microservice worship." In its early days (around 2013), Segment split the system into microservices for fault isolation — one worker per data destination. As the business exploded in 2016–2017, the number of destinations soared (about 3 new ones per month), and each service had its own repo while sharing a pile of common libraries — the result was that changing a common library once cost about a week of dev effort (all bottlenecked on testing); and "theoretically perfect fault isolation" would require tens of thousands of microservices (one per customer per queue), which simply wasn't realistic. In 2017 they rolled back to a monolith called Centrifuge, deliberately giving up some isolation in exchange for "one repo, unified versions, minute-level deploys, and the ability to keep building new features."

The most cutting line in the retrospective: "if microservices are applied in the wrong place, or used as a band-aid for the real problem, you'll drown in complexity and never be able to ship new products again." 📎 Segment: Goodbye Microservices · InfoQ coverage

⑤ A cautionary tale: Netscape's "tear down and rewrite" tragedy. In 2000, Joel Spolsky, in Things You Should Never Do, Part I, lamented bitterly: Netscape made "the single worst strategic mistake a software company can make" — rewriting the browser code from scratch. Between 4.0 and 6.0, there was nearly three years with no usable new release, and in those three years IE devoured almost the entire market share. Joel's two arguments still hold today: that ugly-looking old code hides tacit knowledge bought with years of stepping on landmines, fixing countless bizarre bugs; and a rewrite is a long undertaking, during which you completely stop improving the existing product while competitors sprint ahead. This is exactly the counter-proof of section 1 of this chapter.

Lay the five cases side by side and this chapter's main road is plain: ①③ are victories of "gradual evolution + parallel validation"; ② is a victory of "evolving while reshaping the organization"; ④ warns you that "splitting too far" and splitting blindly are equally fatal; ⑤ warns you of the fate of "tearing down and rewriting." Not one winner won by "stopping, tearing down, and rewriting."

🤖 The AI / vibe coding angle: AI can change the code, but "how to evolve without downtime" must be steered by a human

AI is genuinely changing the cost structure of this thing called "evolution," but it can't change to whom the judgment belongs.

AI makes the "grunt work" of large-scale refactoring/migration cheap. Codemods across thousands of files, batch-rewriting one set of old API calls into a new API, reading and explaining a stretch of legacy nobody dares touch, even auto-generating abstraction layers and adapter code — this dirty, tiring work that used to take a team months to gnaw through, AI can speed up enormously. "The migration workload of the Strangler Fig," "changing all the callers in Branch by Abstraction," "the backfill scripts of data migration" are exactly the parts where AI can help most.
And "parallel run + result comparison" is, by nature, the best partner for backing "code AI changed" with confidence. Don't dare to fully trust the critical path AI refactored? Then don't fully trust it — treat it as the candidate in section 4, and compare it line by line against the old implementation with real traffic. AI is responsible for producing the new implementation at high speed; the parallel run is responsible for using data to judge, on your behalf, whether it's correct. This is the muscle memory you most ought to build for refactoring in the AI era: let AI be fast, let comparison be slow; give speed to the model, give confidence to the data.
But that most central layer of judgment, AI can't give — only you can lay it down:

   What AI is great at (hand it over):        What only a human can steer (the real craft of this chapter):
   ─────────────────────────────────         ──────────────────────────────────────────────────────────
   • generate new impl / codemod / adapters   • should this block be split at all? where to draw the boundary?
   • read and explain legacy code             • which block to move first? what order is safest?
   • write backfill scripts / migration boilerplate  • what "confidence threshold" do you set before daring to cut traffic?
   • batch-rewrite old API calls into new ones • on trouble, which step to back out to, and how?
                                              • is this a modular monolith, or do you truly need to split services?

This is exactly vibe coding's biggest "sweet trap." Vibe coding lets you pile up a working prototype in half a day — which is so tempting. But between a working prototype and a system that "can be safely evolved while continuously serving and its requirements continuously change" lies exactly these seven sections of craft. AI can help you build that plane fast, but it can't, on your behalf, judge — while the plane is fully loaded with passengers and the engines still spinning — "which engine to remove first, how to guarantee no stall, and how to back out if something goes wrong" — because that needs not code, but continuous judgment about your business, your risk, and what you're willing to trade for what. "How to evolve a working pile of a prototype into a maintainable system without downtime" is exactly the part AI can't give and a human must steer. This is also the "gradual and controllable" philosophy that runs throughout the AI Agent / Workflow Platform: capability can be piled up by AI at high speed, but the steering wheel, the brakes, and the escape routes of evolution must be gripped in human hands.

Architectural wisdom: in the AI era, writing the new implementation will increasingly feel like "pressing a generate button"; while "swapping it in safely in flight" will increasingly become the core value of a human. The more thoroughly implementation gets commoditized, the scarcer and more valuable the judgment of "steering evolution" becomes.

🎯 Quick check

🤔A core billing module running in production is logically complex and catastrophically costly if wrong, and the team (or AI) has rewritten a theoretically equivalent new implementation. What is the safest architectural judgment before shipping?

AOnce all unit tests pass, cut straight to the new implementation and delete the old
BUse a parallel run: the old implementation still returns its result to the user, the new implementation runs in the background against the same batch of real traffic and is compared line by line, and you cut traffic over only once the mismatch rate is low enough
COpen a long lived feature branch, fully build and thoroughly test the new implementation, then merge and ship it all at once

Chapter summary

The big rewrite is almost doomed: the old system won't stop and wait for you (the target is moving), and its ugly code holds tacit knowledge bought by stepping on landmines — a rewrite means "shooting at a moving target while resetting your blood-and-tears experience to zero, and demanding your opponent stand still." The only main road is gradual evolution.
Strangler Fig pattern: add a layer of facade/routing around the old system, run new features on the new implementation and migrate old features block by block, until the old system receives no traffic and retires naturally. The vital point is that interception layer.
Branch by Abstraction: let old and new implementations coexist behind one abstraction and switch gradually via a flag, all on the trunk, always shippable — replacing the catastrophic long-lived feature branch.
Parallel Run / Dark Launch: run old and new at once, give the old result to the user, compare the differences, build confidence with real traffic (not confidence) before cutting traffic over; GitHub Scientist is its living model, and the best partner for backing "code AI changed."
Zero-downtime data migration: data is the hardest to change, so go through dual write → backfill → shadow compare → cut read → cleanup (expand-contract), making "the old store is authoritative all along" the safety net you can jump onto anytime, with only the final cleanup being irreversible.
Splitting the monolith: use DDD Bounded Contexts to find seams by business capability, and the Anti-Corruption Layer to isolate old and new models; the discipline is modularize the monolith first to get the seams right and stable, then extract services one block at a time, on demand and by bottleneck — service count is never the goal.
Fitness functions: write architectural constraints as automated tests that can fail and block CI, giving the architecture an immune system so it "grows up without rotting."
The AI / vibe coding throughline: AI makes the grunt work of refactoring/migration cheap, and the parallel run is by nature suited to backing code AI changed; but "whether to split, what to move first, when to cut, how to back out, modular monolith or truly split services" — these steering judgments AI can't give, and are exactly the core value of a human that grows ever scarcer in the AI era.

Bridging forward: this chapter has been all about "how to evolve a system safely at the technical level." But you've probably caught a recurring whiff — from the Strangler Fig, from splitting the monolith, from Amazon's two-pizza teams — that how the architecture gets split, and whether it can be split at all, is often rooted not in technology but in the organization. In the next chapter (Advanced Part, Chapter 6), 15 · Organization as Architecture, we'll fully unfold Conway's Law that [08] mentioned: why "a system's architecture will eventually grow to look like the organization that designed it," why many "architecture problems" are really "organization problems" in disguise, and — how you can in turn use organization design to shape the architecture you want.

14 · Evolving & Splitting Large Systems: Changing the Engine on a Plane in Flight ​

1. Why "tear it down and rewrite" is almost doomed: you're chasing a moving target ​

2. The Strangler Fig pattern: intercept around the old system and let it slowly get "strangled" offline ​

3. Branch by Abstraction: swap the implementation behind an abstraction layer, and keep the trunk always shippable ​

4. Parallel Run / Dark Launch: run old and new at once, compare first, then cut traffic over ​

5. Zero-downtime data migration: data is the hardest to change, so "expand–contract," with every step rollback-able ​

6. Splitting the monolith: find the seams first, isolate with an anti-corruption layer, and "modularize the monolith first, split out services on demand" ​

7. Fitness Functions: write architectural constraints as automated tests, so the system "grows up without rotting" ​

📌 Real-world cases: three classics of "changing the engine," and one famous "split then glued back together" ​

🤖 The AI / vibe coding angle: AI can change the code, but "how to evolve without downtime" must be steered by a human ​

🎯 Quick check ​

Chapter summary ​

💬 Comments