Skip to content

21 · Splitting & Migration in Practice

The thesis in one line: at the end of ch. 20, that AI customer-service system has grown into a "change one thing, break everything" big monolith. This chapter puts every craft from ch. 14 to work — while it's still serving customers, still processing refunds, still absorbing new requirements, safely breaking it apart piece by piece and safely swapping components one by one, with zero downtime.


🎯 Practice track, ch. 4 · This chapter drills one thing

Drilling the craft of evolution (not the philosophy). Picking up from ch. 14: Strangler Fig, Branch by Abstraction, Parallel Run / Shadow Traffic, zero-downtime data migration, Anti-Corruption Layer (ACL), Modular Monolith, Fitness Functions — all of them case-applied to this AI customer-service system.

After reading you should be able toHow this chapter practices it
Judge whether to split and which piece to split firstSection 1: finding the seams + Modular Monolith first
Use the Strangler Fig to extract a service with zero downtimeSection 2: the step-by-step playbook for extracting the "retrieval service"
Build real confidence before swapping models / retrievalSection 4: Shadow Traffic (the AI migration killer technique)
Swap the vector database with zero downtimeSection 5: expand-contract in five steps

Opening: a plane in flight — don't even think about landing to rebuild it

Ch. 20 reached the end of the scale-up track. This AI customer-service system is making money, surviving flash sales, processing real refunds every day — but internally it looks like this:

   Everything crammed into one deployment unit:
   ┌──────────────────────────────────────────────────────────┐
   │  Orchestration  +  RAG Retrieval  +  Tools/Refunds  +  Billing  │
   │  ↑ Change the retrieval algorithm → redeploy everything, risk killing refunds  │
   │  ↑ Retrieval team and refunds team queuing behind the same codebase to ship    │
   └──────────────────────────────────────────────────────────┘

This is exactly the "organization/efficiency" red flag from the Appendix · When to evolve (trigger signals): change one thing and it ripples everywhere, teams block each other, releases grow slower and slower.

A newcomer's instinct is: "This monolith is unsalvageable — tear it down, rewrite a clean version with microservices." Ch. 14 already blocked that road — the old system doesn't stop and wait for you (the target is moving), and all that "ugly" code has implicit knowledge crystallized from hard lessons (a certain refund edge case, a certain injection guard). Rewriting = shooting at a moving target + zeroing out your hard-won experience + demanding the opposition hold still while you reload.

So this chapter has only one main road: don't tear it down, only evolve incrementally. Let old and new coexist, the old shrinks piece by piece, the new grows piece by piece, and at any step you can stop, roll back, and the system stays online the whole time. The six sections below are all the concrete craft on this main road.

In the AI era, this main road matters even more: vibe coding lets you prompt an AI to "generate a cleaner retrieval service" in half a day — but "safely swapping it in while the plane is flying" is something AI cannot give you. That requires continuous judgment on "what to split first, how to hedge, when to cut over, how to roll back if things break." Implementation is getting cheap; being at the helm is getting more valuable.


1. Find the seams first, don't rush to split (Bounded Context + Modular Monolith first)

Before splitting, answer two questions: along which seam? and which piece first?

Find seams along natural business-capability boundaries (DDD Bounded Context, ch. 14) — not along technical layers. This AI customer-service system has four natural seams:

   ┌──────────────┐  ┌──────────────┐  ┌────────────────────┐  ┌──────────────┐
   │ Conversation  │  │ Retrieval RAG │  │ Tools / Refunds    │  │ Billing      │
   │ Orchestration │  │ (vec / rerank)│  │ (money / state mch)│  │ (token acct) │
   │ (session/rout)│  │              │  │                    │  │              │
   └──────────────┘  └──────────────┘  └────────────────────┘  └──────────────┘

Which piece first? Not "split everything" — instead ask: which piece has the strongest case for independent scaling / independent deployment / failure isolation?

CandidateIndependent needShould it be extracted first?
Retrieval RAGNeeds independent scaling (retrieval QPS grows differently from conversation), needs high-frequency iteration on retrieval quality, may be reused by other products✅ Extract first
Refunds / ToolsReliability / compliance requirements are completely different from conversation, failures need isolation (retrieval going down must not drag refunds down)✅ Extract second
Conversation orchestrationIs the hub, changes frequently but doesn't need independent scalingStay in the core
BillingSimple, stableStay in the core (modular is enough)

The iron rule (from ch. 14): Modular Monolith first, extract services on demand. Don't split into four microservices from the start, turning function calls into network calls and taking on the full burden of distributed systems out of nowhere (ch. 10). The right sequence:

Big Ball of Mud  ──▶  Modular Monolith (one deployment unit, enforced internal boundaries)  ──▶  extract only "Retrieval / Refunds" as services

First draw the four seams cleanly within the process (shifting boundaries costs almost nothing), let the seams be hammered by real business until they're stable, then pay the distributed-systems price only for retrieval and refunds, which genuinely need independent scaling / deployment. The other two pieces are perfectly at home in the Modular Monolith.


2. Strangler Fig: extract the "retrieval service" with zero downtime

Decided to extract retrieval first — but you can't take the service down to move it. Use the Strangler Fig pattern (ch. 14): wrap a routing layer around the old monolith and gradually redirect "retrieval" traffic toward the new service.

   The key is this "facade / router" layer — without it you have no switch to
   "quietly divert a small slice of traffic":

   Orchestration ──▶ ┌─────────────────────────────────────┐
                     │  Retrieval router (switch + % weight) │
                     └──────────┬────────────────┬──────────┘
                                │ 90% (default)  │ 10% (canary)
                                ▼                ▼
                     ┌──────────────────┐  ┌─────────────────┐
                     │ Old retrieval     │  │ New retrieval    │
                     │ inside monolith   │  │ service          │
                     │ (shrinking)       │  │ (growing)        │
                     └──────────────────┘  └─────────────────┘

Step-by-step migration playbook (every step is stoppable and rollback-able):

  1. Establish the facade: funnel all "retrieval" calls inside the monolith through one internal interface retrieve(query, tenant) → chunks (this is step ① of Branch by Abstraction, detailed in the next section). Behavior is completely unchanged at this point.
  2. Build the new service: copy / rewrite the retrieval logic as a standalone "retrieval service" with its own database and its own deployment. No traffic yet.
  3. Shadow comparison (detailed in section 4): the new service "shadow-runs" against real queries, comparing recall quality — results are not returned to users.
  4. Canary cut-over: the router switch steps from 1% → 10% → 50% → 100%, with each tier observed for recall quality, latency, and error rate. If any tier looks wrong, flip the switch back to 0 — the old retrieval stays online the whole time.
  5. Strangle complete: once 100% flows through the new service and it's stable, delete the old retrieval code from the monolith. It has been strangled offline.

Architectural wisdom: the Strangler Fig isn't about being "fast" — it routinely means maintaining old and new in parallel for a period — but it breaks "betting the whole system in one shot" into a controlled-risk sequence of "cut over one small increment at a time." You're not dueling the old monolith; you're growing a new retrieval system around it until the old one has zero traffic and naturally withers.


3. Branch by Abstraction: swap the provider behind the "model abstraction layer"

Ch. 20's scale-up calls for an AI gateway (multi-provider failover). But "calling the model" is entangled across dozens of direct calls scattered through the orchestration layer — you can't intercept it with an external wrapper. This is where you use Branch by Abstraction (ch. 14): no long-lived feature branch — stay on main the whole time, let an abstraction layer make old and new implementations coexist.

   Branch by Abstraction in five steps, always releasable from main:

   ① Insert abstraction layer: every call now goes through ModelClient, no
        more direct connections to a provider SDK
        Orchestration ──▶ [ModelClient abstraction] ──▶ direct provider A (old)

   ② Write new implementation behind the abstraction: connect the AI gateway
        (multi-provider + failover + semantic cache)
        Orchestration ──▶ [ModelClient] ─┬─▶ direct provider A (default, flag-gated)
                                         └─▶ AI gateway (new, no traffic yet)

   ③④ Feature flag gradually shifts traffic to the gateway; roll back any time something breaks
   ⑤ Gateway stable → delete the "direct provider" old implementation

Architectural wisdom: "Open a long-lived branch, change everything slowly, merge when done" is the instinct — and the most expensive anti-pattern for large-scale refactoring. You're building a merge bomb whose fuse is held by others (colleagues also changing main). Branch by Abstraction turns it around: first lay down a "socket" (ModelClient) that both old and new implementations can plug into, then calmly cut over on main without panic. The cost of one extra abstraction layer buys the peace of mind of "can stop, can ship, can revert at any moment."

Incidentally, this ModelClient abstraction is itself the seam AI systems should always have — provider price hikes, rate limits, bans, new model releases… "swapping models" is a high-frequency event in the AI era. Encapsulating it behind one interface is the standard fulfillment of ch. 08's "leave a good seam at the most likely point of change."


4. Parallel Run / Shadow Traffic: build real confidence before swapping models or retrieval (the AI migration killer technique)

Sections 2 and 3 both gave you a "cut-over switch." But before you cut, what gives you confidence that the new retrieval / new model is actually correct? Passing tests? Real traffic is always more devious than your test cases — AI systems even more so, because they have no "correct answer" you can assert against.

The most robust approach is Parallel Run / Shadow Traffic (ch. 14): run old and new against the same real requests simultaneously, return the old result to users (transparently), and quietly compare the new result against the old one in the background.

   Scenario A: swap retrieval algorithm (pure vector → hybrid + rerank)

                 ┌──▶ Old retrieval ──▶ chunks_A ────────────▶ returned to user ✅
   Real query ──┬─┤
                │ └──▶ New retrieval ──▶ chunks_B ──┐
                │                                   ▼
                │                    eval scores both sides' recall
                │                    consistent / better → confidence +1; worse → log for investigation

   Scenario B: swap model / connect gateway
                 ┌──▶ Old model ──▶ answer_A ─────────────────▶ returned to user ✅
   Real prompt ──┬─┤
                 │ └──▶ Candidate model ──▶ answer_B ──┐
                 │                                     ▼
                 │                  LLM-as-judge / rule scoring compares answer quality
                 │                  quality not degraded → safe to cut; degraded → don't cut, keep fixing

This is exactly the muscle memory that AI migrations should build — and it echoes the AI perspective of ch. 14:

Let AI move fast; let comparison move slowly. Speed to the model, confidence to the data.

AI lets you rapidly produce "new retrieval / new implementation," but whether it's correct and whether it has regressed — that verdict belongs to "real traffic + eval," not to your confidence. Key discipline (same as GitHub Scientist): the old implementation's result is always what users get; the candidate runs in the background, swallows its own exceptions (never let experiment code take down production); only when the "inconsistency / quality-degradation rate" drops low enough do you actually cut over.

📎 The living showcase of this mechanism is GitHub's Scientist library — ch. 14's real cases cover it in detail. In the AI era its value has only risen: it is, by design, the perfect backstop for giving you confidence in "AI-modified / AI-generated code."


5. Zero-downtime data migration: swap the vector database

After scale-up, the MVP's makeshift vector store (say, pgvector) can no longer hold the load and needs to be replaced with a dedicated vector database (Milvus / Qdrant). Data is the hardest part — code can blue-green roll back, data has only one copy, and corrupting it is often unrecoverable. So follow ch. 14's expand → contract five steps, every step rollback-able:

   ① Dual write: new documents are written to both [old vector store] and [new vector store].
        Retrieval still reads from the old.
        → Escape hatch: stop writing to the new store; the old is always authoritative.

   ② Backfill: batch-load all existing documents into the new store. ⚠️ AI-specific cost —
        this step requires re-embedding every historical document chunk: real money, real time.
        Run in batches with rate limiting, interruptible and resumable.
        → Escape hatch: backfill only writes to the new store; stop any time.

   ③ Shadow-read verification: query both sides during retrieval; users still get the old store's result.
        Compare "new store recall" vs "old store recall" for consistency
        (this is the shadow traffic from section 4!).
        → Escape hatch: compare without cutting over; keep fixing if recall falls short.

   ④ Cut over reads: once consistency rate is high enough, switch retrieval to the new vector store.
        Dual write is still active.
        → Escape hatch: switch reads back to the old store, which has been continuously dual-written and stays fresh.

   ⑤ Cleanup: after observing stability, stop dual-writing to the old store and decommission it.
        → The only irreversible step — save it for last, with a generous observation window.

Architectural wisdom: data migration disasters almost always stem from "big bang" execution — take the service down at midnight, run the migration script, go live, pray. The right posture is to stretch it into a long chain, keeping "the old store is always authoritative" as a safety net you can jump back to at any moment, until the very last step when you cut the cord. The only thing special about AI systems here is that step ② backfill requires re-embedding — a real compute bill that needs to be planned as a cost line item (echoing step ② of ch. 19).


6. Connecting to legacy systems without being polluted + keeping the architecture from rotting

Anti-Corruption Layer (ACL): The AI customer-service system needs to connect to a decade-old order / payment system — bizarre field names, jumbled concepts. Don't let its dirty model seep into your clean refunds service. Build an Anti-Corruption Layer to translate and isolate (ch. 14):

   ┌──────────────────────┐    ┌────────┐    ┌──────────────────────────────┐
   │ Refunds service       │───▶│  ACL   │───▶│ Old order / payment system    │
   │ (clean new model)     │◀───│       │◀───│ (dirty fields, bizarre states) │
   └──────────────────────┘    └────────┘    └──────────────────────────────┘
        The new service only talks to a "cleanly translated interface";
        the old system's rot cannot seep through.

Fitness Functions: You worked hard to carve out clean boundaries — how do you ensure a few hundred future commits don't mash them back into a ball of mud? Write architectural constraints as automated tests that fail and can gate CI (ch. 14), giving the architecture an immune system. A few fitness functions this AI customer-service system should have:

Architectural constraint you care aboutFitness Function (add to CI)
Orchestration layer must not directly import vector-store internalsDependency check: fail on detection
All model calls must go through the ModelClient abstraction (no direct provider connections scattered around)Dependency check: fail on direct provider SDK import
Every write operation in the refunds service must carry an idempotency keyStatic check / contract test: fail if key is missing
Retrieval service p99 < 200 msPerformance test: fail if exceeded
Retrieval must not cross-tenant recallIntegration test: fail if cross-tenant data is reachable (security red line)

Architectural wisdom: architecture is not a statue that's fixed the moment the diagram is drawn — it's a living system that needs continuous maintenance. Living things without an immune system will inevitably rot. Fitness Functions don't stop the system from growing; they only stop it from rotting as it grows. Without them, every boundary you painstakingly carved out today is just waiting for someone's rushed afternoon and a stray import.


📌 Real cases

All the craft in this chapter comes from the real cases in ch. 14. Two worth committing to memory from the migration road:

  • GitHub Scientist (the living showcase of Parallel Run): when refactoring high-risk permission checks, the old logic result was returned to users as normal, while the new logic ran against real traffic in the background — building confidence from data, not self-confidence. In the AI era, it's the perfect backstop for "swapping models / retrieval / AI-modified code."
  • Segment's "Goodbye Microservices" (a warning about over-splitting): they split the system so finely "for failure isolation" that a change to a shared library took a week to propagate, and they ultimately rolled back to a monolith. The reminder: service count is never the goal — first get the boundaries right with a Modular Monolith, then extract services on demand based on actual bottlenecks.

🎯 Quick check

🤔To switch the AI customer-service retrieval from 'pure vector' to 'hybrid retrieval + rerank', what is the safest approach before going live?
  • APass offline evaluation and then cut straight to the new retrieval at full traffic, deleting the old one
  • BUse shadow traffic: the old retrieval result is returned to users as normal; the new retrieval shadow-runs against the same real queries in the background; use eval to compare recall quality; only cut over gradually once there is no degradation or quality is better
  • COpen a long-lived branch, finish the new retrieval completely, then merge and deploy in one shot
🤔When restructuring a monolithic AI customer-service system into a better shape, what should the very first step be?
  • AImmediately split into four independently deployed microservices: orchestration / retrieval / tools / billing
  • BFirst draw the four business boundaries (Bounded Contexts) cleanly as a Modular Monolith within one deployment unit, then extract only retrieval and refunds as services on demand if they genuinely need independent scaling / deployment
  • CTear it down and rewrite a clean microservices architecture from scratch

Chapter summary

  • Don't rewrite from scratch: the ugly code in the old AI customer-service system carries crystallized hard-won experience; rewriting = chasing a moving target + zeroing out those lessons + demanding the opposition hold still. The only road is incremental evolution.
  • Find the seams first, Modular Monolith first: find Bounded Contexts along business capabilities (conversation / retrieval / tools+refunds / billing); first draw the boundaries cleanly within the process, then extract only retrieval and refunds — which genuinely need independent scaling / deployment — as services.
  • Strangler Fig: add a routing switch between the orchestration layer and retrieval, gradually shifting traffic from old retrieval to the new retrieval service; the old is strangled offline.
  • Branch by Abstraction: funnel "call the model" through a ModelClient abstraction, then cut over from "direct provider" to "AI gateway" with zero downtime behind it — "swapping models" is a high-frequency event in the AI era, this seam is mandatory.
  • Parallel Run / Shadow Traffic is the AI migration killer technique: for both retrieval and model swaps, let the new implementation shadow-run and compare quality against real traffic with eval. Let AI move fast; let comparison move slowly. Speed to the model, confidence to the data.
  • Zero-downtime vector database swap: dual write → backfill (watch the re-embedding compute bill) → shadow-read verification → cut over reads → cleanup — keep the old store as the authoritative safety net the whole time.
  • ACL isolates the dirty model of the legacy order system; Fitness Functions encode architectural constraints (all model calls must go through the abstraction layer, refunds must carry idempotency keys, no cross-tenant recall) into CI, giving the architecture an immune system.

Bridging forward: at this point you've taken this AI customer-service system through a full cycle — read it (18), design it (19), evolve it (20), split and migrate it (21) — but it's still fundamentally "conversation + controlled actions." The next chapter, 22 · AI-native system design, pushes autonomy up another notch: designing a fully autonomous agent that plans on its own, calls tools on its own, and drives an entire support ticket to resolution across multiple steps. The stronger the autonomy, the harder ch. 17's three new constraints bite — and that leads us directly into the final AI-collaborative design track.


💬 Comments