32 · Observability and Reliability Stack Selection
The thesis in one line: observability is not a wall of dashboards. It is whether, during a failure, you can answer with evidence: who is affected, where is it slow, why is it slow, and should someone be woken up? Reliability stack selection gives the online system a nervous system and an immune system.
🧰 Technology Stack Selection Track, Chapter 6 · One thing to practice
Chapter 31 covered deployment. This chapter covers how you know whether production is healthy, and how you recover when it is not. Do not start from Prometheus, Grafana, ELK, or OpenTelemetry. Start from SLOs.
Opening: monitoring and observability differ
Monitoring asks:
I know what to watch. Has it crossed a threshold?
Examples: CPU above 90%, error rate above 5%, queue backlog above 10,000.
Observability asks:
I do not know where the problem will appear. Did the system leave enough evidence to investigate?
Example: one user's checkout is slow. The request crosses gateway, order, inventory, payment, and a third-party API. Where did it slow down? A tenant? Version? Availability zone? Database index?
Rule: small systems can survive with monitoring. Distributed systems become blind with monitoring only. The more services, calls, and releases you have, the more you need observability, not just more dashboards.
1. Work backward from SLOs
Define three things:
SLI (Service Level Indicator)
-> what you measure: success rate, P99 latency, error rate, availability
SLO (Service Level Objective)
-> your internal target: 99.9% requests under 300ms
Error budget
-> allowed failure budget: if budget remains, ship; if burned, fix reliabilityThis turns "is the system stable" from feeling into numbers. Alerts should also work backward from SLO:
Wake people when users are being hurt.
CPU, memory, and thread counts are candidate causes. If they do not affect the user journey, they should not directly become midnight pages.
2. Three evidence types: metrics, logs, traces
| Signal | Good for | Cost |
|---|---|---|
| Metrics | Trends: QPS, error rate, P95/P99 latency, queue depth | Cheap, alert-friendly, low detail |
| Logs | Event detail: why an order failed, why auth rejected | Detailed, but costly and noisy |
| Traces | Full path of one request across services | Great for distributed debugging, needs sampling and context propagation |
OpenTelemetry / OTel is valuable because it decouples instrumentation from storage/query backends. You can generate telemetry in a standard way, then swap backend vendors or open-source systems with less migration cost.
Tools can change. Instrumentation habits are hard to change. Trace IDs, structured logs, and key business metrics matter more than dashboard colors.
3. Reliability is not just seeing; it is recovery
Many teams buy observability tools but reliability does not improve because:
Seeing a problem is not the same as handling it.
Reliability also needs response:
Alerting -> wake only people who can act
On-call -> clear response owner
Runbook -> first steps for an alert
Incident -> severity, communication, escalation, review
Release gov -> canary, rollback, feature flags, circuit breakersSo stack selection includes incident process. A serious system needs actionable alerts, service owners, runbooks for key alerts, post-incident reviews, and review actions that feed back into alerts, code, or platform.
4. Alerts: fewer, sharper, user-centered
Low-quality alerts:
- CPU is high.
- Memory is high.
- Disk is at 80%.
- Thread count increased.
These are clues, not necessarily incidents. Better alerts focus on user symptoms:
| User symptom | Actionable alert |
|---|---|
| Login fails | Login success rate below SLO |
| Checkout is slow | Checkout P99 above target for 10 minutes |
| Messages delayed | Queue backlog causes notification delay beyond promise |
| Search unavailable | Search error rate and empty-result rate abnormal |
Architectural wisdom: do not alert "machines feel bad." Alert "users are being hurt." Noise trains people to ignore real incidents.
5. Choose by maturity
| Stage | Stack tendency | Goal |
|---|---|---|
| MVP / small team | Managed logs + error tracking + uptime checks + a few core metrics | Someone knows when it breaks and can find rough cause |
| Standard online system | Metrics + structured logs + key traces + SLO alerts + runbooks | Locate, respond, roll back when users are affected |
| Many services / teams | OTel + correlated metrics/logs/traces + service catalog + owners | Debug across teams without shouting |
| Critical high-reliability path | SLO platform + canary analysis + synthetic monitoring + incident drills | Catch regressions early, limit blast radius, reduce MTTR |
Watch cost: full log retention, full trace sampling, and high-cardinality labels become expensive fast. Observability is not "collect everything." It is "leave enough evidence for important questions."
6. Selection table
| Signal | Better stack | Why | Watch out |
|---|---|---|---|
| MVP / internal tool | Managed logs + error tracking + uptime | Fast feedback loop | Do not self-build the full platform |
| Standard web app | Metrics + structured logs + SLO alerts | See errors, latency, and core business impact | Alerts must be few and precise |
| Microservices / many teams | OTel + unified metrics/logs/traces | Debug across services, reduce owner-hunting | Requires instrumentation standards and owner governance |
| Payment / transaction / core path | SLO + error budget + canary + runbook + incident process | Reliability becomes measurable and actionable | Higher cost and on-call discipline required |
| Huge telemetry volume / cost sensitive | Tiered storage + sampling + aggregated metrics + cardinality limits | Keep debugging value while controlling bill | Oversampling can lose key evidence |
🎯 Quick check
- AKeep waking people; high CPU is always an incident
- BMove alerting toward user-impact SLOs. CPU high can be an investigation clue or low-priority alert; pages should come from success rate, latency, and error budget symptoms
- CTurn off all monitoring
- ATwenty more dashboards
- BDistributed traces with propagated trace IDs, plus key metrics and structured logs for correlation
- CPrint all logs as unstructured text
Chapter summary
- Observability is not dashboards: it is enough evidence to investigate unknown problems.
- Work backward from SLOs: define user-good, then metrics/logs/traces/alerts.
- Signals differ: metrics for trends and alerts, logs for details, traces for cross-service paths.
- Reliability needs response: alerting, on-call, runbooks, incidents, reviews, rollback.
- Invest by maturity: small teams need a feedback loop; large systems need unified standards and platform support.
Next: general systems need to be visible and recoverable. AI systems add model behavior, context, retrieval quality, cost, and evals. Chapter 33 moves stack selection into the LLM era.
Related links
- Method core: 06 · Quality attributes · 12 · Designing for failure · 13 · Mechanics of scale
- AI collaboration: 24 · Review checklist · 25 · Eval-driven
- Cases: StarArena · SyncRoom · CodePilot
💬 Comments