Skip to content

33 · AI Infrastructure Technology Stack Selection

The thesis in one line: AI infrastructure selection is not wiring every hot tool together. First identify your scarce resource: model capability, GPU, context, retrieval quality, cost, or controllability. The goal is not tool names; it is using the fewest components to contain AI system risk.


🧰 Technology Stack Selection Track, Chapter 7 · One thing to practice

Chapter 17 introduced nondeterminism, context, and agentic hard parts. Chapter 22 designed AI-native systems. This chapter asks how to choose the AI stack underneath: when a hosted API is enough, and when model gateways, self-hosted inference, GPUs, or eval platforms become justified.


Opening: what are you actually self-building?

When people hear AI infrastructure, they often list:

  • GPU (compute commonly used for model training/inference)
  • Vector database
  • Agent framework
  • Model gateway
  • Inference engine
  • Eval platform

But the first question is:

Are you building infrastructure, or just building an AI application?

For an early product, the sane MVP default is usually hosted model API + minimal logs and cost tracking. Only when clear trigger signals appear, such as cost loss, data cannot leave boundary, latency misses target, provider outage is unacceptable, or model customization is necessary, should you sink lower into gateways, self-hosted inference, and GPU pools.

Architectural wisdom: lower-level AI infrastructure is not automatically more advanced. It gives control, but also hands you cost, capacity planning, failure recovery, security isolation, and operations.


1. Four layers of an AI stack

   Entry governance:
      AI Gateway / auth / rate limit / cost logging / model routing

   Context:
      RAG / vector store / document permissions / rerank / citations

   Inference:
      inference serving / GPU / KV cache / batching

   Guardrail:
      observability / eval / trace / human approval

You do not need all four from day one. Add the layer that addresses the risk you can see:

  • Cost invisible? Add gateway or usage logs.
  • Retrieval quality limits answers? Add RAG evals.
  • Many teams call models? Add a model gateway.
  • GPU cost beats API cost? Consider self-hosted inference.
  • Agent can take actions? Add permissions, human approval, and audit.

2. API or self-hosted inference is a cost/control trade

ChoiceStrengthsCosts
Hosted model APIFast, stable, low ops, updated modelsVendor lock-in, data path limits, unit cost at scale
Self-hosted inferenceControl model, data, cost structure, deploymentGPU, memory, batching, scaling, failures, capacity
Hybrid routingCheap model for simple tasks, stronger model for hard tasksRouting, evals, fallback, cost accounting

Do not ask which is more advanced. Ask:

  1. Can data leave the boundary?
  2. Can hosted API meet latency?
  3. Is volume high enough that self-hosting is cheaper?
  4. Can the team operate GPU serving?

One yes is usually not enough. Two or three yes answers start to justify going lower.


3. RAG, long context, and fine-tuning solve different problems

RouteSolvesFits
RAGRetrieve external knowledge at answer timeMany documents, frequent updates, citations, permission filtering
Long contextPut lots of material into one requestSmall one-off material, fits in context
Fine-tuningChange stable behavior, style, or formatStable output format, strong samples, domain style

Common mistake: poor retrieval gets blamed on the model, or a knowledge-update problem gets sent to fine-tuning.

Rule: use retrieval for knowledge; use fine-tuning for behavior. If you need citations, updates, and permissions, make RAG work first.


4. Agent framework: workflow first, autonomy later

Use the Chapter 22 fork:

   Can a deterministic workflow solve it?
        |-- yes -> workflow first
        \-- no  -> Agent, with permissions, budget, human approval, trace, eval

If the flow is fixed, such as "look up order -> check refund rules -> call refund API -> send notification," do not rush to an autonomous Agent. Workflow is more predictable, testable, and auditable. Agent becomes useful when steps are open, tools are many, and the task needs dynamic planning.

Agent selection should focus on:

  • Can tool permissions be tiered?
  • Is human approval supported?
  • Are all steps traceable?
  • Can budget and max steps be limited?
  • Are context compaction and task recovery supported?

5. Guardrail layer is production baseline

AI systems differ because output is unstable and quality drifts. After launch, API success rate is not enough. You need to see:

  • Prompt and context.
  • Retrieved chunks.
  • Model cost.
  • Tool authorization.
  • Whether the final answer has citations or hallucinations.
  • Whether model/prompt/retrieval changes regress quality.

That is Chapter 25's eval discipline. If the system touches money, user data, or automatic actions, eval is not a future nice-to-have. Without eval, every model or prompt change is a blind release.


6. AI stack selection table

QuestionStarting choiceUpgrade triggerCost reminder
Just validating product?Hosted model API + basic logsMany apps share usage, cost invisible, provider outage hurtsDo not self-host GPU first
Many models/teams?AI GatewayNeed auth, rate limit, billing, failoverGateway sits on critical path
Need private knowledge?RAG + simple vector retrievalRetrieval quality unstable, permissions complex, corpus growsRAG ceiling is retrieval quality
Small vector scale?pgvector / single-node vector searchMillions of vectors, complex filters, tight latencyDedicated vector DB is new ops surface
Model cost high?Model routing + cache + quotasAPI cost exceeds self-host total cost, data cannot leaveSelf-hosting means GPU and memory ops
Need automatic action?Deterministic workflowOpen-ended steps, dynamic planning requiredAgent needs permissions, budget, human approval
Need stable iteration?Trace + small eval setProduction, money/data, frequent model changesNo eval, no reliable upgrade

🎯 Quick check

🤔A team is building an AI customer-service MVP. Daily calls are low and user demand is not validated. What is the best AI infrastructure choice?
  • ABuy GPUs, self-host inference, add distributed vector DB and multi-Agent platform immediately
  • BStart with hosted model API, basic logs and cost tracking, and the fewest components needed to validate the business loop
  • CFine-tune a model first, otherwise it is not real AI infrastructure
🤔Documents are many, frequently updated, and answers must cite sources. Which route should be preferred?
  • ARAG, because it retrieves current material at answer time and can point back to sources
  • BFine-tuning, because all knowledge should be trained into the model
  • COnly increase context window and paste all documents every time

Chapter summary

  • AI infrastructure starts with the scarce resource: model, GPU, context, retrieval quality, cost, controllability.
  • Hosted API is the default start unless data, cost, latency, customization, or availability force an upgrade.
  • Think in four layers: entry governance, context, inference, guardrails. Add what risk requires.
  • Use RAG for knowledge and fine-tuning for behavior.
  • Production AI needs guardrails: traces show the path, evals guard quality.

Next: language, data, middle layers, API, deployment, observability, and AI infrastructure are now on the table. Chapter 34 turns them into one decision tree.


💬 Comments