32 · Observability and Reliability Stack Selection

The thesis in one line: observability is not a wall of dashboards. It is whether, during a failure, you can answer with evidence: who is affected, where is it slow, why is it slow, and should someone be woken up? Reliability stack selection gives the online system a nervous system and an immune system.

🧰 Technology Stack Selection Track, Chapter 6 · One thing to practice
Chapter 31 covered deployment. This chapter covers how you know whether production is healthy, and how you recover when it is not. Do not start from Prometheus, Grafana, ELK, or OpenTelemetry. Start from SLOs.

Opening: monitoring and observability differ

Monitoring asks:

I know what to watch. Has it crossed a threshold?

Examples: CPU above 90%, error rate above 5%, queue backlog above 10,000.

Observability asks:

I do not know where the problem will appear. Did the system leave enough evidence to investigate?

Example: one user's checkout is slow. The request crosses gateway, order, inventory, payment, and a third-party API. Where did it slow down? A tenant? Version? Availability zone? Database index?

Rule: small systems can survive with monitoring. Distributed systems become blind with monitoring only. The more services, calls, and releases you have, the more you need observability, not just more dashboards.

1. Work backward from SLOs

Define three things:

   SLI (Service Level Indicator)
      -> what you measure: success rate, P99 latency, error rate, availability

   SLO (Service Level Objective)
      -> your internal target: 99.9% requests under 300ms

   Error budget
      -> allowed failure budget: if budget remains, ship; if burned, fix reliability

This turns "is the system stable" from feeling into numbers. Alerts should also work backward from SLO:

Wake people when users are being hurt.

CPU, memory, and thread counts are candidate causes. If they do not affect the user journey, they should not directly become midnight pages.

2. Three evidence types: metrics, logs, traces

Signal	Good for	Cost
Metrics	Trends: QPS, error rate, P95/P99 latency, queue depth	Cheap, alert-friendly, low detail
Logs	Event detail: why an order failed, why auth rejected	Detailed, but costly and noisy
Traces	Full path of one request across services	Great for distributed debugging, needs sampling and context propagation

OpenTelemetry / OTel is valuable because it decouples instrumentation from storage/query backends. You can generate telemetry in a standard way, then swap backend vendors or open-source systems with less migration cost.

Tools can change. Instrumentation habits are hard to change. Trace IDs, structured logs, and key business metrics matter more than dashboard colors.

3. Reliability is not just seeing; it is recovery

Many teams buy observability tools but reliability does not improve because:

Seeing a problem is not the same as handling it.

Reliability also needs response:

   Alerting     -> wake only people who can act
   On-call      -> clear response owner
   Runbook      -> first steps for an alert
   Incident     -> severity, communication, escalation, review
   Release gov  -> canary, rollback, feature flags, circuit breakers

So stack selection includes incident process. A serious system needs actionable alerts, service owners, runbooks for key alerts, post-incident reviews, and review actions that feed back into alerts, code, or platform.

4. Alerts: fewer, sharper, user-centered

Low-quality alerts:

CPU is high.
Memory is high.
Disk is at 80%.
Thread count increased.

These are clues, not necessarily incidents. Better alerts focus on user symptoms:

User symptom	Actionable alert
Login fails	Login success rate below SLO
Checkout is slow	Checkout P99 above target for 10 minutes
Messages delayed	Queue backlog causes notification delay beyond promise
Search unavailable	Search error rate and empty-result rate abnormal

Architectural wisdom: do not alert "machines feel bad." Alert "users are being hurt." Noise trains people to ignore real incidents.

5. Choose by maturity

Stage	Stack tendency	Goal
MVP / small team	Managed logs + error tracking + uptime checks + a few core metrics	Someone knows when it breaks and can find rough cause
Standard online system	Metrics + structured logs + key traces + SLO alerts + runbooks	Locate, respond, roll back when users are affected
Many services / teams	OTel + correlated metrics/logs/traces + service catalog + owners	Debug across teams without shouting
Critical high-reliability path	SLO platform + canary analysis + synthetic monitoring + incident drills	Catch regressions early, limit blast radius, reduce MTTR

Watch cost: full log retention, full trace sampling, and high-cardinality labels become expensive fast. Observability is not "collect everything." It is "leave enough evidence for important questions."

6. Selection table

Signal	Better stack	Why	Watch out
MVP / internal tool	Managed logs + error tracking + uptime	Fast feedback loop	Do not self-build the full platform
Standard web app	Metrics + structured logs + SLO alerts	See errors, latency, and core business impact	Alerts must be few and precise
Microservices / many teams	OTel + unified metrics/logs/traces	Debug across services, reduce owner-hunting	Requires instrumentation standards and owner governance
Payment / transaction / core path	SLO + error budget + canary + runbook + incident process	Reliability becomes measurable and actionable	Higher cost and on-call discipline required
Huge telemetry volume / cost sensitive	Tiered storage + sampling + aggregated metrics + cardinality limits	Keep debugging value while controlling bill	Oversampling can lose key evidence

🎯 Quick check

🤔A system often wakes on-call because CPU reaches 90%, but user success rate and latency are normal. What is the better change?

AKeep waking people; high CPU is always an incident
BMove alerting toward user-impact SLOs. CPU high can be an investigation clue or low-priority alert; pages should come from success rate, latency, and error budget symptoms
CTurn off all monitoring

🤔A microservice request crosses 6 services and sometimes has high P99 latency. Each service only has local logs, so the team cannot locate the slow hop. What should be added?

ATwenty more dashboards
BDistributed traces with propagated trace IDs, plus key metrics and structured logs for correlation
CPrint all logs as unstructured text

Chapter summary

Observability is not dashboards: it is enough evidence to investigate unknown problems.
Work backward from SLOs: define user-good, then metrics/logs/traces/alerts.
Signals differ: metrics for trends and alerts, logs for details, traces for cross-service paths.
Reliability needs response: alerting, on-call, runbooks, incidents, reviews, rollback.
Invest by maturity: small teams need a feedback loop; large systems need unified standards and platform support.

Next: general systems need to be visible and recoverable. AI systems add model behavior, context, retrieval quality, cost, and evals. Chapter 33 moves stack selection into the LLM era.

Method core: 06 · Quality attributes · 12 · Designing for failure · 13 · Mechanics of scale
AI collaboration: 24 · Review checklist · 25 · Eval-driven
Cases: StarArena · SyncRoom · CodePilot

32 · Observability and Reliability Stack Selection ​

Opening: monitoring and observability differ ​

1. Work backward from SLOs ​

2. Three evidence types: metrics, logs, traces ​

3. Reliability is not just seeing; it is recovery ​

4. Alerts: fewer, sharper, user-centered ​

5. Choose by maturity ​

6. Selection table ​

🎯 Quick check ​

Chapter summary ​

Related links ​

💬 Comments