18 · Reading the Map: Deconstruct Unfamiliar Systems

The thesis in one line: a good designer must first know how to "read" — faced with an unfamiliar architecture diagram, they peel out "why it's wired this way" layer by layer, with the same set of questions every time. AI systems look new, but the reading moves don't change by a single word.

🎯 Practice track, ch. 1 · This chapter drills one thing
No designing, no coding — drill only reverse map-reading: given an architecture description, be able to state its constraints, soul path, key trade-offs, and what will break first.
After reading you should be able to How this chapter trains it
Read any template with the four-step method — no sequential reading Section 1: method + map-reading note template
Walk through a full example Section 2: RAG Knowledge Base (demo, ~15 min)
Fill in a map-reading note yourself Section 3: AI Chat Product (10-minute exercise)
Complete one independently on a different system Section 4: homework
Why does the practice track open with AI systems? Because they're exactly the unfamiliar systems most people are "least familiar with and most easily bluffed by jargon" — and the map-reading method exists precisely to handle the unfamiliar. From the next chapter (19) onward we'll forward-design an AI customer-service bot; this chapter first trains your eye with its "close relative." Learn to read before you talk about design.

After reading you should be able to	How this chapter trains it
Read any template with the four-step method — no sequential reading	Section 1: method + map-reading note template
Walk through a full example	Section 2: RAG Knowledge Base (demo, ~15 min)
Fill in a map-reading note yourself	Section 3: AI Chat Product (10-minute exercise)
Complete one independently on a different system	Section 4: homework

1. Method and tools: four steps + three questions + one-page note

Ch. 07 goes forward: constraints → solution → diagram. This chapter goes backward: diagram → reverse-engineer constraints and trade-offs.

   Design:   constraints ──▶ solution ──▶ diagram
   Reading:  diagram ──▶ constraints ──▶ trade-offs ──▶ "would it still look this way with different constraints?"

Recognizing vs. understanding: recognizing is "there's a vector store, an orchestration layer, a GPU"; understanding means being able to answer three questions for every box (the reverse of the 02 framework):

Which quality attribute or constraint is it serving?
What did it give up?
If scale grows 100×, will this be the first thing to crack?

Four-step reading (mantra: essence → panorama → trade-offs → weak point). When reading any template in the repo, align to the sections: requirements & constraints → panorama & data flow → key decisions → bottlenecks & anti-patterns.

Step	What you're doing
① Grab the essence	Functional requirements + quality attributes + hard constraints
② See the panorama	Which path is the soul data flow (follow one request end to end)
③ Dig trade-offs	2–3 "chose X, gave up Y" decisions
④ Find the weak point	First scaling bottleneck + one common anti-pattern

Map-reading note template (fill in this one sheet for the demo, exercise, and homework):

System name: _______________

① Essence
  · One-line business purpose:
  · Quality attributes (top 2):
  · Hard constraint (1):

② Panorama
  · Soul path (one sentence):

③ Trade-offs (at least 2)
  · Chose ___ / gave up ___ / because constraint ___
  · Chose ___ / gave up ___ / because constraint ___

④ Weak point
  · What breaks first:
  · Common anti-pattern (one I've seen):

Reading AI systems — watch for three extra new constraints (all from ch. 17, recurring in the chapters ahead): cost (every call burns tokens), nondeterminism (same input may not yield the same output), context (the model's working memory is limited and expensive). When reading an AI map, add one more question to "quality attributes" and "trade-offs": is this block saving money, preventing hallucination, or managing context?

2. Demo: reading the "RAG Knowledge Base" together (~15 min)

What it is: chunk your own documents, vectorize them, and build an index; when a user asks a question, retrieve the most relevant chunks first and stuff them into the prompt so the LLM takes an "open-book exam" — answering from your material instead of making things up from training memory.
Reading goal: fill in the note template above. Follow the four steps below; no need to memorize RAG details — memorize the reading moves.

① Grab the essence → fill in the "Essence" section

	Key point
Function	Document ingestion, chunking + vectorization, retrieval (vector + keyword), reranking, assembling context to generate cited answers
Quality attributes (main line)	Retrieval quality (poor retrieval forces hallucination) · Traceability (answers can point back to source) · Cost (chunking / retrieval / generation all burn money)
Hard constraint	Retrieval quality determines the ceiling of answers; context window can't hold all material; retrieved content is untrusted input (may harbor injection)

✅ At the end of this step, you should be able to say: this isn't "asking a smarter model" — it's "letting the model take an open-book exam — and 80% of the skill is in the retrieval."

② See the panorama → fill in the "Panorama" section

Offline: documents → parse → chunk → vectorize → write to vector store (chunk vectors + source)
Online:  question → ① vector + keyword recall  ② rerank to top-K → ③ assemble prompt → LLM → cited answer

Follow one "What's our refund policy?" request: question is vectorized → vector store recalls ~20 chunks → reranking selects 3–5 chunks → packed into prompt → model writes the answer + appends "source: Policy Doc §3.2."

Key block in the diagram	What it serves
Offline indexing (chunking + vectorization)	Transforms material into retrievable form
Recall + reranking (two-stage funnel)	Retrieval quality: cast wide, then narrow
Source citations	Traceability (trustworthy, verifiable)
Treating retrieval results as untrusted text	Prevents prompt injection

✅ At the end of this step, you should be able to state the soul path: "organize material offline → find the right material online → model writes from it" — not "throw the question straight at the model."

③ Dig trade-offs → fill in the "Trade-offs" section (demo: 2 entries)

Decision	Chose	Gave up	Because
RAG vs. long context vs. fine-tuning	RAG (on-demand retrieval)	The simplicity of stuffing all material into the prompt	Material is large, changes often, requires traceability — long context is expensive and "loses things in the middle"
Pure vector vs. hybrid retrieval	Hybrid (vector + keyword) + reranking	The simpler implementation of pure vector	Vector excels at semantics but misfires on exact terms like product codes and names

(Two more you'll often notice when reading: how to chunk is the invisible dial on retrieval quality; retrieval results are always untrusted input.)

✅ At the end of this step, you should be able to write: at least 2 "chose / gave up / because" entries.

④ Find the weak point → fill in the "Weak point" section

What breaks first	Direction to fix
Poor retrieval quality → model hallucinates too	Hybrid retrieval + reranking + continuous recall evaluation (high-frequency in interviews and production)
Vector-store scale and embedding cost grow	Vector DB sharding + incremental updates (only recompute changed chunks)

Anti-pattern examples: blame the "dumb model" when retrieval is the real culprit; arbitrary chunking (too large → noisy; too small → loses context); no citations.

✅ After all four steps, you should have a complete map-reading note.

Architectural wisdom: however fancy the RAG diagram, remember one line — the ceiling of RAG = the ceiling of retrieval (garbage in, garbage out). The model is just the last step that "reads the material and writes the answer"; 80% of the skill is in "finding the right material." Every other box exists in service of that.
📎 When you need more details (chunking, reranking, Contextual Retrieval), refer to the RAG Knowledge Base template.

3. 10-minute exercise: AI Chat Product (design ↔ map reading)

This section is not a second long read. You bring half a page of material + the map-reading note template, fill it in yourself, then check against the reference answers.

Why this system? Because it's the "parent" of the AI customer-service bot we'll forward-design in the next chapter. Understand it now, and you'll be twice as fast when you start designing in ch. 19.

Material A: How the MVP is set up (summary)

	How the MVP was scoped
Scope	Must have: multi-turn conversation, streaming output · Cut (for now): RAG, own GPU, tool calls
Panorama	Client carries history → thin orchestration layer → directly calls a model-provider API → SSE streaming tokens
Cost	Pay per call, no optimization yet — first validate whether anyone uses it

Material B: Key points of the mature "production map" (summary)

	How the production map was scoped
Soul component	Inference serving (GPU) — most expensive and scarce; the entire architecture exists to "feed it, saturate it, economize it"
Panorama	Gateway (auth / rate-limiting / maintain streaming connections) → orchestration (assemble context + safety + Agent loop + billing) →
The cost-saving trio	Continuous batching (keep GPU busy), prompt caching (don't recompute repeated prefixes), model routing (send simple questions to a smaller model)
Weak point	GPU cluster saturated → TTFT spikes; longer context eats more VRAM per request; cost itself is the bottleneck

Your task (time yourself: 10 minutes)

Use Material B, and fill in an AI Chat Product map-reading note following the four steps (don't peek at the reference answers below yet).
Answer the three comparison questions (answer first, then expand).

Three comparison questions:

Why do systems like this almost always choose streaming output (SSE) rather than buffering the full answer before returning it?
Why is "cost" an architectural concern from day one here, while a typical website can "launch first, optimize later"?
The MVP calls an API directly; the mature system builds its own GPU cluster + continuous batching — was the MVP wrong?

Reference answer: AI Chat Product map-reading note (example)

System name: AI Chat Product (mature production map)

① Essence
  · One-line purpose: wrap an expensive "reasoning brain" (LLM) into a conversational,
    streaming, tool-capable product
  · Quality attributes: time-to-first-token (TTFT) must be minimal; generation throughput high;
    cost per thousand tokens as low as possible
  · Hard constraint: GPU is expensive and scarce (constraint #1); context window is limited;
    inference is stateful computation (KV cache eats VRAM)

② Panorama
  · Soul path: gateway maintains streaming connection → orchestration assembles context + safety
    → inference serving batches multiple requests to feed GPU → tokens SSE-streamed back one by one

③ Trade-offs
  · Chose streaming SSE / gave up one-shot return / because perceived latency matters more than true total latency
  · Chose continuous batching + prompt caching + model routing / gave up the simplicity of computing
    each request separately / because GPU is the #1 scarce resource and cost equals gross margin

④ Weak point
  · What breaks first: GPU cluster saturated → request queuing → TTFT spikes
  · Anti-pattern: designed like a plain CRUD site, no streaming, full history recomputed every turn
    with no caching, no token accounting

Reference answer: three comparison questions

1. Why almost always streaming? One-shot return makes users stare at a blank screen for ten-plus seconds — a UX disaster. Streaming gets the first token out in under a second, slashing perceived latency. The cost is more complex connection management and retry recovery — but it's worth it.

2. Why manage cost from day one? A typical website "costs almost nothing per extra request." Here, every 1,000 tokens generated is real GPU cost. Skip the optimizations (caching / routing / batching) and gross margin goes negative. Cost is therefore an architectural concern from day one, not an after-launch optimization.

3. Was the MVP wrong? No — the constraints were different.

Constraint	Choice
Still validating whether anyone uses it (MVP)	Call an API directly — fastest to launch, don't build your own GPU
Stable traffic, cost squeezing gross margin (mature)	Own GPU + batching + routing — drive per-token cost down

Get users first, then talk about saving money — that's the main thread of ch. 20, "Evolution Playbook."

When it conflicts with Material A (MVP)? First ask: what problem are we actually solving right now? Validating demand → call an API, perfectly reasonable; cost squeezing gross margin → then build. Neither is wrong; the constraints changed with the stage.

4. Homework: a different system, 15 minutes

Pick any unread template from templates/ (start with the AI systems):

AI Agent Platform — ch. 22 will forward-design it; read it first to get your eye in.
Inference Serving — zooms in on the "inference serving" box above.
For contrast, you can also pick a classic system (payments, online ticketing) and feel how "reading an AI system vs. a classic system — the moves are identical."

Steps as before:

Close the "key decisions" section.
Use the map-reading note template from section 1 to fill in the blanks.
Open the template to compare — differences are fine; being able to explain different constraints is what counts.

🎯 Quick check

🤔Using this chapter's four-step map-reading method on an architecture diagram — what is the correct order?

ARead the document sequentially from top to bottom, skipping no section
BGrab the essence (requirements / constraints) → see the panorama and data flow → dig key decisions → find bottlenecks and anti-patterns
COnly look at the architecture panorama diagram — once you understand the picture, you have understood the system

🤔Reading a RAG Knowledge Base architecture diagram — which component should you identify as 'the ceiling of answer quality'?

AHow large and powerful the underlying LLM is
BHow accurate the retrieval is — whether it finds the right material
CHow fancy the prompt template is written

Chapter summary

This chapter drills only reverse map-reading — the deliverable is one page of map-reading notes, not a design proposal.
Four steps + three questions: essence → panorama → trade-offs → weak point; for AI systems, also watch three extra new constraints: cost / nondeterminism / context.
RAG = the demo: soul path is "build the index offline → retrieve online → model writes from it"; in one line — the ceiling of RAG = the ceiling of retrieval.
AI Chat Product = the 10-minute exercise: half a page of material + self-filled notes + three comparison questions; focus on practicing "MVP calls an API, mature system builds its own GPU — the difference = constraints change with the stage."
Homework: pick any template (AI Agent Platform or Inference Serving recommended) and fill in the notes in 15 minutes.

Bridging forward: you can now read someone else's AI map. The next chapter, 19 · Full Design Walkthrough: Medium Complexity, has you draw the map forward — using the eight steps from ch. 07, design an AI customer-service bot that "can look up orders and process refunds" from scratch: manage the new AI constraints (prevent hallucination, burn rate, nondeterminism) while holding up the classic hard constraints (money can't be wrong). Read maps to train your eye; design to train your hand.

18 · Reading the Map: Deconstruct Unfamiliar Systems ​

1. Method and tools: four steps + three questions + one-page note ​

2. Demo: reading the "RAG Knowledge Base" together (~15 min) ​

① Grab the essence → fill in the "Essence" section ​

② See the panorama → fill in the "Panorama" section ​

③ Dig trade-offs → fill in the "Trade-offs" section (demo: 2 entries) ​

④ Find the weak point → fill in the "Weak point" section ​

3. 10-minute exercise: AI Chat Product (design ↔ map reading) ​

Material A: How the MVP is set up (summary) ​

Material B: Key points of the mature "production map" (summary) ​

Your task (time yourself: 10 minutes) ​

4. Homework: a different system, 15 minutes ​

🎯 Quick check ​

Chapter summary ​

💬 Comments