Skip to content

34 · Technology Selection Decision Tree

The thesis in one line: technology selection is not choosing the strongest tool from a pile. It is pruning along requirements, constraints, stage, team capability, and exit cost. A mature selection does not prove a technology is good; it proves its benefit is worth the cost under current constraints.


🧰 Technology Stack Selection Track, Chapter 8 · Track wrap-up

The previous seven chapters covered language, database, cache/queue/events, API, deployment, observability, and AI infrastructure. This chapter adds no new tools. It gives one decision tree you can use whenever someone asks "should we use X?"


Opening: the root is not A versus B

The first question is not:

   PostgreSQL or MongoDB?
   REST or gRPC?
   PaaS or K8s?
   API or self-hosted inference?

It is:

Do we really need a new technology?

If the current stack can meet target performance, cost, reliability, and delivery time, default to keeping it. Every new technology brings learning, integration, operations, and migration cost.

This is the same restraint as earlier chapters: monolith before microservices, workflow before Agent, hosted API before self-hosted GPU. Architects treat selection as paying a clear cost for a clear problem.


1. First cut: what stage are you in?

StageScarce resourceSelection tendency
MVPValidation speedFew components, mainstream stack, managed first, low migration cost
GrowthControlled scalingObservability, gradual release, clear boundaries, local scalability
ScaleEfficiency and costDeep optimization, platformization, unit cost, automation
CriticalStability and complianceAudit, isolation, DR, SLO, incident process

The right mature-stage technology can be over-engineering in MVP. While validating, choose fewer components. While growing, choose control. Only at scale should you pay for unit-cost and throughput optimization.


2. Second cut: where will the system die first?

If the current stack is insufficient, locate the failure mode before comparing tools:

Failure modeLook first at
Data wrong or state mismatchData model, transaction boundary, idempotency, Outbox, reconciliation
Read hotspot crushes DBCache, read model, CDN, rate limit
Write spike crushes backendQueue, backpressure, smoothing, async state
P99 amplified by fan-outAPI boundary, timeout budget, degradation, trace
Releases cause incidentsDeployment platform, canary, rollback, config governance
Incidents are hard to locateMetrics, logs, traces, SLO alerts
AI quality driftsEval, trace, RAG evaluation, model routing
Team collaboration blocksModule boundaries, platform engineering, service ownership

Rule: tools are the outer shell of the answer. Failure mode is the question.


3. Third cut: can the team operate it?

Benchmarks can look great while your team cannot run the system. Operating means:

  • Can you deploy it?
  • Can you debug it?
  • Is it monitored?
  • Who fixes it when production breaks?
  • Will upgrades break you?
  • Are enough people able to understand it?

A system with lower performance but a team that can operate it often beats a faster system nobody can repair. Technology selection is not a lab contest. It is a long-term operating contract.


4. Fourth cut: can you exit?

Mature selections have exit paths:

TechnologyExit question
New databaseHow migrate data? How verify dual writes? Where roll back?
Model providerCan the API be adapted? Can prompts and evals be reused?
FrameworkIs business logic swallowed by framework? Can it be layered away?
Message systemHow migrate topics, schemas, offsets?
Cloud platformCan images, config, secrets, storage, network be moved?

No exit path means binding the future. Before important technology enters production, you need a spike, rollout plan, rollback plan, and ADR.


5. The unified decision tree

Need a new technology?
  |
  |-- Existing stack meets target? -- yes --> keep it + local optimization
  |
  \-- no
      |
      |-- MVP? -- yes --> fewest components, fastest validation, low migration cost
      |
      \-- no
          |
          |-- What is the failure mode?
          |    |-- data/consistency -> storage and transaction boundaries
          |    |-- latency/throughput -> cache, batching, scaling
          |    |-- availability/failure -> redundancy, degradation, isolation
          |    |-- AI quality -> eval, RAG, model routing
          |    \-- team collaboration -> module boundaries, platform capability
          |
          \-- Can the team operate it and exit?
               |-- no  -> choose a lighter option
               \-- yes -> spike -> ADR -> gradual rollout

6. Technology selection ADR template

md
### ADR-034: Introduce OpenTelemetry for distributed tracing

- Background: order requests cross 7 services. P99 sometimes exceeds 2s. Each service has local logs only, and one investigation takes about 3 hours.
- Goal: connect request path and per-hop latency, reducing MTTR to under 30 minutes.
- Candidates:
  - Add more logs: cheap, but cannot reliably reconstruct paths.
  - Build private tracing: flexible, but migration risk is high.
  - Use OpenTelemetry: standardized instrumentation, replaceable backend.
- Decision: use OpenTelemetry traces, starting with order, inventory, and payment paths.
- Trade-off: short-term instrumentation and sampling governance cost.
- Benefit: slow requests become traceable across services, backend remains replaceable.
- Review trigger: telemetry cost exceeds budget, or critical path coverage stays below 90%.
- Exit plan: keep standard trace context; observability backend can change; business code does not bind to one vendor SDK.

The format matters less than making the reason and exit explicit.


7. One table for the whole track

ChapterDo not ask firstAsk first
27 Language/frameworkWhich language is more advancedDo team, ecosystem, runtime, and business complexity fit?
28 Database/storageWhich database is strongestWho is source of truth, what is the query shape?
29 Cache/queue/eventsShould we use KafkaIs this read hotspot, time mismatch, or fact broadcast?
30 API/communicationREST or gRPCSync/async, internal/external, contract strength?
31 Deployment platformShould we use K8sDoes the team need and support platform capability?
32 Observability/reliabilityWhich monitoring toolWhat is user SLO and how does incident response work?
33 AI infrastructureShould we self-host GPUIs the scarce resource model, context, cost, quality, or control?

🎯 Quick check

🤔A team wants to introduce a very new database because benchmarks are fast. The current DB has not hit a bottleneck, and nobody has production ops experience with the new one. What does the decision tree say?
  • AIntroduce it immediately; performance is the answer
  • BDo not introduce it yet. There is no clear failure mode and the team cannot operate it, so this is buying future incidents
  • CMove all data at once to force the team to learn
🤔Before an important technology selection enters production, what should remain?
  • AA note saying the technology is advanced
  • BAn ADR: selection reason, alternatives, accepted costs, review triggers, and exit plan
  • CA passing demo is enough

Chapter summary

  • The root is "do we need new technology?" If the current stack meets targets, keep it.
  • Stage changes the answer: MVP buys speed, growth buys control, scale buys efficiency, critical systems buy stability and compliance.
  • Locate failure mode before comparing tools: data, latency, cost, quality, collaboration are different problems.
  • The team must be able to operate it: operability beats paper performance in production.
  • Good selection can exit: spike, ADR, gradual rollout, migration path.

Technology stack track wrap-up: these 8 chapters are not about memorizing more tool names. They train one sentence: read constraints first, select technology second; acknowledge the cost before enjoying the benefit. When you read templates/ and cases/, ask the reverse question: why this stack, and would the answer change if constraints changed?


💬 Comments