33 · AI Infrastructure Technology Stack Selection
The thesis in one line: AI infrastructure selection is not wiring every hot tool together. First identify your scarce resource: model capability, GPU, context, retrieval quality, cost, or controllability. The goal is not tool names; it is using the fewest components to contain AI system risk.
🧰 Technology Stack Selection Track, Chapter 7 · One thing to practice
Chapter 17 introduced nondeterminism, context, and agentic hard parts. Chapter 22 designed AI-native systems. This chapter asks how to choose the AI stack underneath: when a hosted API is enough, and when model gateways, self-hosted inference, GPUs, or eval platforms become justified.
Opening: what are you actually self-building?
When people hear AI infrastructure, they often list:
- GPU (compute commonly used for model training/inference)
- Vector database
- Agent framework
- Model gateway
- Inference engine
- Eval platform
But the first question is:
Are you building infrastructure, or just building an AI application?
For an early product, the sane MVP default is usually hosted model API + minimal logs and cost tracking. Only when clear trigger signals appear, such as cost loss, data cannot leave boundary, latency misses target, provider outage is unacceptable, or model customization is necessary, should you sink lower into gateways, self-hosted inference, and GPU pools.
Architectural wisdom: lower-level AI infrastructure is not automatically more advanced. It gives control, but also hands you cost, capacity planning, failure recovery, security isolation, and operations.
1. Four layers of an AI stack
Entry governance:
AI Gateway / auth / rate limit / cost logging / model routing
Context:
RAG / vector store / document permissions / rerank / citations
Inference:
inference serving / GPU / KV cache / batching
Guardrail:
observability / eval / trace / human approvalYou do not need all four from day one. Add the layer that addresses the risk you can see:
- Cost invisible? Add gateway or usage logs.
- Retrieval quality limits answers? Add RAG evals.
- Many teams call models? Add a model gateway.
- GPU cost beats API cost? Consider self-hosted inference.
- Agent can take actions? Add permissions, human approval, and audit.
2. API or self-hosted inference is a cost/control trade
| Choice | Strengths | Costs |
|---|---|---|
| Hosted model API | Fast, stable, low ops, updated models | Vendor lock-in, data path limits, unit cost at scale |
| Self-hosted inference | Control model, data, cost structure, deployment | GPU, memory, batching, scaling, failures, capacity |
| Hybrid routing | Cheap model for simple tasks, stronger model for hard tasks | Routing, evals, fallback, cost accounting |
Do not ask which is more advanced. Ask:
- Can data leave the boundary?
- Can hosted API meet latency?
- Is volume high enough that self-hosting is cheaper?
- Can the team operate GPU serving?
One yes is usually not enough. Two or three yes answers start to justify going lower.
3. RAG, long context, and fine-tuning solve different problems
| Route | Solves | Fits |
|---|---|---|
| RAG | Retrieve external knowledge at answer time | Many documents, frequent updates, citations, permission filtering |
| Long context | Put lots of material into one request | Small one-off material, fits in context |
| Fine-tuning | Change stable behavior, style, or format | Stable output format, strong samples, domain style |
Common mistake: poor retrieval gets blamed on the model, or a knowledge-update problem gets sent to fine-tuning.
Rule: use retrieval for knowledge; use fine-tuning for behavior. If you need citations, updates, and permissions, make RAG work first.
4. Agent framework: workflow first, autonomy later
Use the Chapter 22 fork:
Can a deterministic workflow solve it?
|-- yes -> workflow first
\-- no -> Agent, with permissions, budget, human approval, trace, evalIf the flow is fixed, such as "look up order -> check refund rules -> call refund API -> send notification," do not rush to an autonomous Agent. Workflow is more predictable, testable, and auditable. Agent becomes useful when steps are open, tools are many, and the task needs dynamic planning.
Agent selection should focus on:
- Can tool permissions be tiered?
- Is human approval supported?
- Are all steps traceable?
- Can budget and max steps be limited?
- Are context compaction and task recovery supported?
5. Guardrail layer is production baseline
AI systems differ because output is unstable and quality drifts. After launch, API success rate is not enough. You need to see:
- Prompt and context.
- Retrieved chunks.
- Model cost.
- Tool authorization.
- Whether the final answer has citations or hallucinations.
- Whether model/prompt/retrieval changes regress quality.
That is Chapter 25's eval discipline. If the system touches money, user data, or automatic actions, eval is not a future nice-to-have. Without eval, every model or prompt change is a blind release.
6. AI stack selection table
| Question | Starting choice | Upgrade trigger | Cost reminder |
|---|---|---|---|
| Just validating product? | Hosted model API + basic logs | Many apps share usage, cost invisible, provider outage hurts | Do not self-host GPU first |
| Many models/teams? | AI Gateway | Need auth, rate limit, billing, failover | Gateway sits on critical path |
| Need private knowledge? | RAG + simple vector retrieval | Retrieval quality unstable, permissions complex, corpus grows | RAG ceiling is retrieval quality |
| Small vector scale? | pgvector / single-node vector search | Millions of vectors, complex filters, tight latency | Dedicated vector DB is new ops surface |
| Model cost high? | Model routing + cache + quotas | API cost exceeds self-host total cost, data cannot leave | Self-hosting means GPU and memory ops |
| Need automatic action? | Deterministic workflow | Open-ended steps, dynamic planning required | Agent needs permissions, budget, human approval |
| Need stable iteration? | Trace + small eval set | Production, money/data, frequent model changes | No eval, no reliable upgrade |
🎯 Quick check
- ABuy GPUs, self-host inference, add distributed vector DB and multi-Agent platform immediately
- BStart with hosted model API, basic logs and cost tracking, and the fewest components needed to validate the business loop
- CFine-tune a model first, otherwise it is not real AI infrastructure
- ARAG, because it retrieves current material at answer time and can point back to sources
- BFine-tuning, because all knowledge should be trained into the model
- COnly increase context window and paste all documents every time
Chapter summary
- AI infrastructure starts with the scarce resource: model, GPU, context, retrieval quality, cost, controllability.
- Hosted API is the default start unless data, cost, latency, customization, or availability force an upgrade.
- Think in four layers: entry governance, context, inference, guardrails. Add what risk requires.
- Use RAG for knowledge and fine-tuning for behavior.
- Production AI needs guardrails: traces show the path, evals guard quality.
Next: language, data, middle layers, API, deployment, observability, and AI infrastructure are now on the table. Chapter 34 turns them into one decision tree.
Related links
- AI method: 17 · Architecting in the age of LLMs · 22 · AI-native design · 25 · Eval-driven
- Templates: AI Gateway · Inference Serving · RAG Knowledge Base · Vector Database
- Cases: DocuMind · CodePilot
💬 Comments