20 · Evolution Playbook: MVP → Scale
The thesis in one line: the "first version" designed in ch. 19 will never be its final form. This chapter follows that same AI customer-service system through its three lives — watching how quantified signals push it to grow, one stage at a time, and how every "should we touch this?" is answered with a ruler, not a gut feeling.
🎯 Practice track, ch. 3 · This chapter drills one thing
Drill evolution judgment: not "how to design," but "when to upgrade, when to hold back." It picks up from ch. 08 (ADRs + when to evolve) and the Appendix · When to Evolve (trigger signals), and lands them on the AI customer-service system from ch. 19.
By the end you should be able to This chapter trains it via Judge when to upgrade with quantified signals (not feelings) The "trigger signals" between each stage Write one ADR for every upgrade Three sample ADRs Recognize the two signals unique to AI system evolution Cost drift + quality drift
Intro: don't force an MVP into the mature-phase blueprint
Ch. 07 said it, and it bites harder in AI systems: architecture grows — never start by building toward the "tens-of-millions, multi-tenant" end state. The diagram from ch. 19 is your mental target, not the day-one look.
In the real world, this AI customer-service system grows like this:
Stage 1: MVP Stage 2: Growth Stage 3: Scale
┌──────────┐ cost / ┌──────────┐ multi-tenant / ┌──────────┐
│ call API │ latency / │ +routing │ resilience / │ +gateway │
│ pure- │ complaints│ +caching │ quality drift │ +multi- │
│ vector │ ─────────▶ │ +hybrid │ ─────────────▶ │ active │
│ RAG │ │ retriev.│ │ +eval │
│ refunds │ signals │ +small │ signals │ gate │
│ manual │ trigger │ auto- │ trigger │ +tenant │
└──────────┘ │ refund │ │ isolat. │
▲ └──────────┘ └──────────┘
└────────── every step is pushed by real signals, not feelings ──────────┘The discipline of this chapter (and the taste from ch. 09): no signal, no move. Upgrading without a trigger is almost always over-engineering or chasing novelty. Each of the three stages below starts with trigger signals, then gives upgrade actions, and closes with an ADR.
Stage 1 · MVP: call the API before building your own, keep humans before automating
One goal only: validate whether "AI customer service" has actual users and produces trustworthy answers. The rougher the better:
| Dimension | How MVP does it (deliberately minimal) |
|---|---|
| Model | Call a hosted vendor API — no self-built GPU |
| RAG | Off-the-shelf framework + fixed-size chunking + pure vector search — just make it run |
| Session | Client carries history, server is stateless and easily scalable |
| Refunds | Model only answers questions, does not auto-refund — anything touching money goes to human approval first |
| Cost | No routing / caching yet — prove people use it before talking about saving money |
Note the last two rows: MVP hands the most dangerous things directly to humans. That's not laziness — it's the most extreme, most friction-free version of ch. 19's "keep uncertainty outside the side-effect boundary" — the human is the gate. Once volume grows, automate the safe parts.
This version deliberately racks up conscious technical debt (no cache, no routing, coarse chunking, manual refunds) — but ch. 08 said it: intentional MVP debt is often the right call, as long as you know what you owe and write it down.
First ADR — recording "why so rough":
┌────────────────────────────────────────────────────────────┐
│ ADR-001: Use hosted model API in MVP; no self-built │
│ inference │
├────────────────────────────────────────────────────────────┤
│ Status: Accepted Date: (pre-launch) Deciders: arch. │
│ │
│ ── Context ── │
│ The product is still validating "whether anyone uses AI │
│ customer service." Traffic is unknown; the team has no │
│ GPU ops experience. The biggest risk right now is │
│ "built it and nobody came," not "it's too expensive." │
│ │
│ ── Decision ── │
│ Call a hosted model API; use an off-the-shelf framework │
│ for RAG; all refunds require human confirmation. │
│ Explicitly defer cost reduction (routing/caching/ │
│ self-hosting) and automated refunds. │
│ │
│ ── Options considered ── │
│ A. Self-build GPU + continuous batching from day one │
│ → Rejected: no traffic yet; self-building = burning │
│ money on over-engineering, and it slows launch │
│ B. Let the model auto-refund → Rejected: no baseline, │
│ no monitoring — letting AI touch money is too risky │
│ │
│ ── Trade-offs & consequences ── │
│ + Fastest launch, lowest risk; money handled by humans │
│ is safest │
│ − Per-token cost is higher; manual refunds don't scale │
│ (known debt — see trigger signals) │
└────────────────────────────────────────────────────────────┘Trigger signals ① → Stage 2 · Growth
The MVP is working, users are growing, and these signals all turn red at once — that's the license to upgrade, not "the boss says we need to be more AI-native":
| Measured signal (experience ranges) | Notes | Likely bottleneck |
|---|---|---|
| Monthly model bill exceeds budget (e.g., hits 1.5× budget) | AI-specific! Token burn is too high | No routing, no caching — even simple questions hit the big model |
| Time-to-first-token P99 over threshold (users getting impatient) | Experience-layer signal (Appendix · When to Evolve) | Model call is slow + no caching |
| "Off-topic answer" complaint rate > threshold | AI-specific! Bad retrieval → hallucination | Pure vector + coarse chunking, poor recall |
| Manual refund queue backing up (human reviewers drowning) | Org / efficiency signal | All refunds manual — can't absorb the volume |
Upgrade actions (fix only what's blocking you — don't rewrite everything at once):
MVP Growth
─── ──────
pure vector + coarse chunks ──▶ hybrid retrieval (vector + keyword) + reranking + semantic chunking
recompute system prompt each ──▶ prompt caching (system prompt / knowledge prefix never recomputed)
turn
big model for everything ──▶ model routing: FAQ → small model, complex questions only → big model
client carries history ──▶ session stored server-side (cross-device, enables long-term memory)
all refunds manual ──▶ small, low-risk refunds "automated" (the idempotent gate from ch. 19);
large amounts still go to human review
(no eval) ──▶ build first eval set: watch retrieval recall + answer qualityNote the "small-amount auto-refund": this is the moment we actually use the deterministic + idempotent + state-machine gate designed in ch. 19 — human review shrinks from "every transaction" to "large amounts only", absorbing the volume while protecting the money.
Two ADRs — recording the two most critical changes, cost and refunds:
┌────────────────────────────────────────────────────────────┐
│ ADR-002: Introduce model routing + prompt caching │
│ (cost trigger) │
├────────────────────────────────────────────────────────────┤
│ Status: Accepted Context: Monthly bill hit 1.6× │
│ budget; FAQ-type queries are 70% of volume yet all │
│ routed to the big model. │
│ Decision: Simple FAQ / knowledge-base hits route to │
│ small model or semantic cache; only complex questions │
│ reach the big model + cached system-prompt prefix. │
│ Options: A. Just swap to a cheaper big model → quality │
│ falls below bar; │
│ B. Remove RAG to save tokens → hallucination │
│ spikes, unacceptable. │
│ Trade-offs: + per-turn cost drops significantly; │
│ − routing logic itself needs tuning; │
│ occasional mis-routes. │
└────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────┐
│ ADR-003: Automate small refunds; keep human review for │
│ large amounts (queue-backlog trigger) │
├────────────────────────────────────────────────────────────┤
│ Status: Accepted Context: Manual refund queue backs │
│ up hours daily; experience is poor and it doesn't scale. │
│ Decision: Refunds ≤ ¥500 with passing order-status │
│ validation are executed idempotently by the deterministic │
│ refund service; > ¥500 still enters the human review │
│ queue. The model only "proposes" — it never executes. │
│ Options: A. Fully automated refunds → Rejected: large │
│ erroneous refunds are irreversible; │
│ B. Keep all-manual → Rejected: signals already │
│ prove it can't scale. │
│ Trade-offs: + human-review volume plummets, experience │
│ improves; │
│ − threshold must be tuned against risk data; │
│ reconciliation backstop is mandatory to │
│ catch automation errors. │
└────────────────────────────────────────────────────────────┘💡 Look at the "Context" fields of ADR-002 and ADR-003 — every entry is a quantified signal; not one says "I have a feeling we should upgrade." That's the ruler from ch. 08.
Trigger signals ② → Stage 3 · Scale
The growth stage has stabilized, the business keeps growing, and a new class of signals appears — many unique to AI systems and multi-tenant systems:
| Signal | Notes | Bottleneck |
|---|---|---|
| Onboarding multiple merchant tenants | Each tenant's knowledge base must not bleed into another's | Missing tenant isolation |
| A single model provider outage takes the entire customer-service platform down | Availability step-change (Appendix · When to Evolve: eliminate single points) | Model calls are a SPOF |
| After a model upgrade or prompt change, production quality silently regresses | AI-specific! Nondeterminism causes quality drift | No eval gate in CI |
| Traffic spikes during promotions + cost continues to grow | The mechanics of scale (ch. 13) | Single region, no elasticity |
Upgrade actions:
Growth Scale
────── ─────
direct connection to one provider ──▶ AI gateway: multi-provider failover + unified billing & rate-limiting
+ semantic caching (see the ai-gateway template)
single knowledge base ──▶ multi-tenant isolation: knowledge base namespaced per tenant;
retrieval enforced per-tenant (no cross-tenant recall)
eval set run manually ──▶ eval as CI gate: model upgrades or prompt changes must pass
eval first; score drop blocks the deploy (prevents quality drift)
single region ──▶ multi-active + elastic scaling (absorbs promotion spikes)
logs via grep ──▶ full-chain trace: every round's retrieval / routing / tool / token
is replayableThe most critical ADR of Stage 3 — also the one unique to AI system evolution:
┌────────────────────────────────────────────────────────────┐
│ ADR-005: Eval set becomes CI gate (quality-drift trigger) │
├────────────────────────────────────────────────────────────┤
│ Status: Accepted │
│ Context: After the last model version upgrade, answers │
│ to a certain class of policy questions silently │
│ degraded — discovered two weeks later from complaints. │
│ Nondeterminism means "quality regression" cannot be │
│ caught by traditional assertions (same input does not │
│ guarantee same output). │
│ Decision: Maintain a representative "question → expected │
│ key points" eval set. Before upgrading the model, │
│ changing a prompt, or changing retrieval strategy, CI │
│ runs eval (rules + LLM-as-judge scoring). Scores below │
│ baseline are blocked unconditionally — no deploy. │
│ Options: A. Manual spot-checks → Rejected: misses │
│ regressions, slow, unsustainable; │
│ B. Freeze model version, no upgrades → │
│ Rejected: forfeits all gains from model │
│ improvement. │
│ Trade-offs: + "good enough" is quantified; silent │
│ regressions are blocked; │
│ − eval set needs continuous maintenance; │
│ scoring itself has a cost (this is the │
│ theme of ch. 22). │
└────────────────────────────────────────────────────────────┘Architectural wisdom: traditional systems guard quality with "same input → same output" assertions; AI systems have had that bedrock pulled out from under them (ch. 17), so quality can only be guarded by "eval set + scoring" over a distribution. That's why "eval as a gate" for AI systems isn't a nice-to-have — it's the brakes against regression during evolution. Without it, every "model upgrade" is a blind gamble.
Master table: signal → upgrade → ADR
| Stage | Trigger signals | Upgrade actions | ADR |
|---|---|---|---|
| MVP | (starting point) | Call API, pure-vector RAG, manual refunds | ADR-001 |
| → Growth | Bill over budget / P99 over threshold / high complaint rate / human-review backlog | Routing + caching, hybrid retrieval + reranking, server-side sessions, small-amount auto-refunds, build eval set | ADR-002/003 |
| → Scale | Multi-tenant / provider SPOF / quality drift / traffic spikes | AI gateway failover, multi-tenant isolation, eval in CI gate, multi-active, full-chain trace | ADR-004/005 |
Two signals unique to AI system evolution — signals traditional systems never have. Watch them closely:
- Cost drift: token bills burn linearly or even super-linearly with scale. Traditional systems can "launch first, optimize cost later" — AI systems that ignore cost will see their margins evaporate overnight.
- Quality drift: any change to the model version, prompt, or retrieval strategy can cause quality to silently regress with no warning. Must use eval as a gate to catch it proactively.
🎯 Quick check
- AThe team feels the current code is not AI-native enough and should be refactored to something more sophisticated
- BThe monthly token bill hit 1.6× budget, with FAQ queries making up 70% of volume yet all routed to the big model — cost is bottlenecked by a real, measured constraint
- CA competitor shipped multi-agent, so we should too
- ABecause AI system code is more complex
- BBecause AI systems are nondeterministic: the same input does not guarantee the same output, so traditional equality assertions break down; model upgrades or prompt changes can silently regress quality, and only eval set + scoring can guard the quality distribution
- CBecause CI runs faster
Chapter summary
- Don't force the mature-phase blueprint onto an MVP: the same AI customer-service system has three lives — a call-API MVP → a growth phase with routing / caching / hybrid retrieval / small-amount auto-refunds → a scale phase with gateway / multi-tenant / eval gate / multi-active.
- MVP hands the most dangerous things to humans: all-manual refunds are the most friction-free version of "keep uncertainty outside the side-effect boundary"; automate the safe parts only once volume warrants it.
- Every upgrade is triggered by quantified signals, not feelings: billing, P99, complaint rate, human-review backlog, provider SPOF, quality drift — no signal, no move (the taste of ch. 08 / ch. 09).
- Write one ADR for every upgrade: the Context field is all signals — that's the decision audit trail.
- Two signals unique to AI system evolution: cost drift (token burn scales linearly) and quality drift (nondeterminism causes silent regressions) — managed with "cost monitoring + model routing" and "eval as a gate," respectively.
Bridging forward: as you reach the tail end of the scale stage, a new problem surfaces: orchestration, RAG, tools, billing, refunds — all crammed into one deployment unit for speed. Now, changing anything breaks everything, nobody dares touch it, and multiple teams block each other — the "org / efficiency" red lights in the Appendix · When to Evolve. The next chapter, 21 · Splitting & Migration in Practice, swaps the engine on this AI customer-service system mid-flight: using the strangler fig, parallel run, and zero-downtime migration from ch. 14 to safely pull it apart piece by piece.
Related links
- The system being evolved: 19 · Full Design Walkthrough: Medium Complexity — this chapter evolves it
- Core methodology: 08 · ADRs & Evolution (ADRs + when to evolve), Appendix · When to Evolve (trigger signals) (quantified signal quick-reference)
- AI-specific constraints: 17 · Architecting in the Age of LLMs (cost / nondeterminism)
- Scale mechanics: 13 · The Mechanics of Scale · compare with the AI Chat Product evolution roadmap
💬 Comments