Skip to content

32 · Observability and Reliability Stack Selection

The thesis in one line: observability is not a wall of dashboards. It is whether, during a failure, you can answer with evidence: who is affected, where is it slow, why is it slow, and should someone be woken up? Reliability stack selection gives the online system a nervous system and an immune system.


🧰 Technology Stack Selection Track, Chapter 6 · One thing to practice

Chapter 31 covered deployment. This chapter covers how you know whether production is healthy, and how you recover when it is not. Do not start from Prometheus, Grafana, ELK, or OpenTelemetry. Start from SLOs.


Opening: monitoring and observability differ

Monitoring asks:

I know what to watch. Has it crossed a threshold?

Examples: CPU above 90%, error rate above 5%, queue backlog above 10,000.

Observability asks:

I do not know where the problem will appear. Did the system leave enough evidence to investigate?

Example: one user's checkout is slow. The request crosses gateway, order, inventory, payment, and a third-party API. Where did it slow down? A tenant? Version? Availability zone? Database index?

Rule: small systems can survive with monitoring. Distributed systems become blind with monitoring only. The more services, calls, and releases you have, the more you need observability, not just more dashboards.


1. Work backward from SLOs

Define three things:

   SLI (Service Level Indicator)
      -> what you measure: success rate, P99 latency, error rate, availability

   SLO (Service Level Objective)
      -> your internal target: 99.9% requests under 300ms

   Error budget
      -> allowed failure budget: if budget remains, ship; if burned, fix reliability

This turns "is the system stable" from feeling into numbers. Alerts should also work backward from SLO:

Wake people when users are being hurt.

CPU, memory, and thread counts are candidate causes. If they do not affect the user journey, they should not directly become midnight pages.


2. Three evidence types: metrics, logs, traces

SignalGood forCost
MetricsTrends: QPS, error rate, P95/P99 latency, queue depthCheap, alert-friendly, low detail
LogsEvent detail: why an order failed, why auth rejectedDetailed, but costly and noisy
TracesFull path of one request across servicesGreat for distributed debugging, needs sampling and context propagation

OpenTelemetry / OTel is valuable because it decouples instrumentation from storage/query backends. You can generate telemetry in a standard way, then swap backend vendors or open-source systems with less migration cost.

Tools can change. Instrumentation habits are hard to change. Trace IDs, structured logs, and key business metrics matter more than dashboard colors.


3. Reliability is not just seeing; it is recovery

Many teams buy observability tools but reliability does not improve because:

Seeing a problem is not the same as handling it.

Reliability also needs response:

   Alerting     -> wake only people who can act
   On-call      -> clear response owner
   Runbook      -> first steps for an alert
   Incident     -> severity, communication, escalation, review
   Release gov  -> canary, rollback, feature flags, circuit breakers

So stack selection includes incident process. A serious system needs actionable alerts, service owners, runbooks for key alerts, post-incident reviews, and review actions that feed back into alerts, code, or platform.


4. Alerts: fewer, sharper, user-centered

Low-quality alerts:

  • CPU is high.
  • Memory is high.
  • Disk is at 80%.
  • Thread count increased.

These are clues, not necessarily incidents. Better alerts focus on user symptoms:

User symptomActionable alert
Login failsLogin success rate below SLO
Checkout is slowCheckout P99 above target for 10 minutes
Messages delayedQueue backlog causes notification delay beyond promise
Search unavailableSearch error rate and empty-result rate abnormal

Architectural wisdom: do not alert "machines feel bad." Alert "users are being hurt." Noise trains people to ignore real incidents.


5. Choose by maturity

StageStack tendencyGoal
MVP / small teamManaged logs + error tracking + uptime checks + a few core metricsSomeone knows when it breaks and can find rough cause
Standard online systemMetrics + structured logs + key traces + SLO alerts + runbooksLocate, respond, roll back when users are affected
Many services / teamsOTel + correlated metrics/logs/traces + service catalog + ownersDebug across teams without shouting
Critical high-reliability pathSLO platform + canary analysis + synthetic monitoring + incident drillsCatch regressions early, limit blast radius, reduce MTTR

Watch cost: full log retention, full trace sampling, and high-cardinality labels become expensive fast. Observability is not "collect everything." It is "leave enough evidence for important questions."


6. Selection table

SignalBetter stackWhyWatch out
MVP / internal toolManaged logs + error tracking + uptimeFast feedback loopDo not self-build the full platform
Standard web appMetrics + structured logs + SLO alertsSee errors, latency, and core business impactAlerts must be few and precise
Microservices / many teamsOTel + unified metrics/logs/tracesDebug across services, reduce owner-huntingRequires instrumentation standards and owner governance
Payment / transaction / core pathSLO + error budget + canary + runbook + incident processReliability becomes measurable and actionableHigher cost and on-call discipline required
Huge telemetry volume / cost sensitiveTiered storage + sampling + aggregated metrics + cardinality limitsKeep debugging value while controlling billOversampling can lose key evidence

🎯 Quick check

🤔A system often wakes on-call because CPU reaches 90%, but user success rate and latency are normal. What is the better change?
  • AKeep waking people; high CPU is always an incident
  • BMove alerting toward user-impact SLOs. CPU high can be an investigation clue or low-priority alert; pages should come from success rate, latency, and error budget symptoms
  • CTurn off all monitoring
🤔A microservice request crosses 6 services and sometimes has high P99 latency. Each service only has local logs, so the team cannot locate the slow hop. What should be added?
  • ATwenty more dashboards
  • BDistributed traces with propagated trace IDs, plus key metrics and structured logs for correlation
  • CPrint all logs as unstructured text

Chapter summary

  • Observability is not dashboards: it is enough evidence to investigate unknown problems.
  • Work backward from SLOs: define user-good, then metrics/logs/traces/alerts.
  • Signals differ: metrics for trends and alerts, logs for details, traces for cross-service paths.
  • Reliability needs response: alerting, on-call, runbooks, incidents, reviews, rollback.
  • Invest by maturity: small teams need a feedback loop; large systems need unified standards and platform support.

Next: general systems need to be visible and recoverable. AI systems add model behavior, context, retrieval quality, cost, and evals. Chapter 33 moves stack selection into the LLM era.


💬 Comments