Observability MOC

Observability MOC

Navigation hub for 9 observability pattern notes across two sections: three pillars (logging, metrics, tracing) and reliability engineering (SLO, alerting, runbooks, incidents, on-call, post-mortems). See Design-Patterns-MOC for the root vault entry point.


Three Pillars

PatternIntentUse When
Structured-LoggingJSON-first structured logging with mandatory field schema (timestamp, level, service, traceId, correlationId), PII exclusion, and async context propagation via MDC.Any service that produces log output — structured JSON is the only format that supports machine parsing, correlation, and centralized log aggregation at scale
Metrics-and-DashboardsQuantitative measurement using RED (request rate, error rate, duration), USE (utilization, saturation, errors), and Four Golden Signals frameworks with dashboard design principles.Monitoring service health, capacity planning, and SLO tracking — choose framework by system type: RED for request-driven services, USE for infrastructure resources
Distributed-Tracing-PatternsSampling strategies (head-based, tail-based, probabilistic) and async context propagation failure modes — pattern-level decisions distinct from OTel SDK implementation in Distributed-Tracing.Understanding request flow across service boundaries; debugging latency in distributed systems; choosing sampling strategy based on traffic volume and cost constraints

Three pillars selection guide: All three pillars are complementary, not alternatives — a production system needs all three. Logs tell you WHAT happened (event details), metrics tell you HOW MUCH is happening (aggregated counts/durations), traces tell you WHERE time was spent (request path across services). Start with structured logging (lowest barrier), add metrics for alerting and dashboards, add distributed tracing when debugging cross-service latency. Distributed-Tracing-Patterns covers sampling and propagation decisions; Distributed-Tracing covers OTel SDK implementation — read both.


Reliability Engineering

PatternIntentUse When
SLO-SLI-SLAService Level Objectives defined from SLI ratios (good events / valid events), error budget derivation, burn rate computation, and multiwindow alerting as prerequisite for actionable alerts.Defining reliability targets and error budgets; prerequisite for all alerting — without SLOs, alerts lack a threshold grounded in user impact
Alerting-StrategiesSymptom-based alerting with multi-window multi-burn-rate rules (fast page, slow page, fast ticket, slow ticket), severity taxonomy, and the principle of alerting on symptoms not causes.Configuring alerts that wake humans only for user-facing impact; reducing alert fatigue by eliminating cause-based pages
Runbook-DesignRunbook anatomy (context, diagnostics, remediation, escalation) with decision tree structure, distinct from playbooks (investigative, no predetermined steps).Creating operational documentation that on-call engineers follow during incidents — every alert should link to a runbook
Incident-ResponseIncident commander role (coordination, not fixing), severity classification (P1-P4), communication cadence, and Excalidraw command structure diagram.Handling production incidents with clear roles and communication patterns — the commander coordinates, domain experts fix
On-Call-PracticesRotation design, escalation policy, on-call health measurement via toil percentage (target < 50%) and MTTA (target < 5min for P1).Building sustainable on-call rotations; measuring and reducing operational toil; preventing burnout
Post-MortemBlameless post-mortem process: timeline reconstruction, 5-whys root cause analysis, counterfactual test, and action items classified as Prevention/Detection/Mitigation.Learning from incidents without blame; producing action items that improve the system — "would a different competent engineer have done the same?"

Reliability engineering selection guide: The reliability engineering workflow follows a specific chain: define SLOs first (what does "reliable enough" mean?), then configure alerts against those SLOs (burn rate), then write runbooks for each alert, then establish incident response for when runbooks are insufficient, then run post-mortems to learn and improve. SLO-SLI-SLA is the foundation — without SLI ratios and error budgets, alerting thresholds are arbitrary. Alerting-Strategies depends on SLO burn rates — read SLO-SLI-SLA first. On-Call-Practices is the human sustainability layer — if toil exceeds 50% over 3 rotations, trigger a reliability sprint.


Observability Concern Selector

Is the concern about understanding what individual requests or events are doing? -> Yes, structured event details (who, what, when, context): Structured-Logging -> Yes, request flow and latency across service boundaries: Distributed-Tracing-Patterns (patterns) + Distributed-Tracing (OTel SDK) -> No: Continue

Is the concern about aggregate system health and capacity? -> Yes, request-driven service (rate, errors, duration): Metrics-and-Dashboards (RED framework) -> Yes, infrastructure resource (CPU, memory, disk): Metrics-and-Dashboards (USE framework) -> No: Continue

Is the concern about defining what "reliable enough" means? -> Yes, setting reliability targets from user-facing SLIs: SLO-SLI-SLA -> Yes, computing error budget and burn rate for alerting thresholds: SLO-SLI-SLA -> No: Continue

Is the concern about when and how to alert humans? -> Yes, configuring alerts that fire on user impact, not internal causes: Alerting-Strategies -> No: Continue

Is the concern about what to do when an alert fires? -> Yes, step-by-step diagnostic and remediation documentation: Runbook-Design -> No: Continue

Is the concern about handling a production incident? -> Yes, roles, severity classification, communication cadence: Incident-Response -> No: Continue

Is the concern about sustainable on-call operations? -> Yes, rotation design, escalation policy, toil measurement: On-Call-Practices -> No: Continue

Is the concern about learning from incidents? -> Yes, blameless analysis, 5-whys, action item classification: Post-Mortem


Cross-Domain Lineage Chains

ChainLineage
Three Pillars to ReliabilityStructured-Logging -> Distributed-Tracing-Patterns -> SLO-SLI-SLA -> Alerting-Strategies (structured logs feed trace correlation, trace data informs SLI measurement, SLO burn rates drive alerting rules)
Metrics to AlertingMetrics-and-Dashboards -> Micrometer -> SLO-SLI-SLA -> Alerting-Strategies -> Runbook-Design (metrics collected via Micrometer define SLIs, SLO burn rates trigger alerts, each alert links to a runbook)
Incident LifecycleAlerting-Strategies -> Incident-Response -> Post-Mortem -> SLO-SLI-SLA (alerts trigger incident response, incidents produce post-mortems, post-mortem action items refine SLOs — a continuous improvement loop)


Sources

  • Google SRE Book — Site Reliability Engineering (O'Reilly, 2016)
  • Google SRE Workbook — The Site Reliability Workbook (O'Reilly, 2018)
  • Charity Majors, Liz Fong-Jones, George Miranda — Observability Engineering (O'Reilly, 2022)
  • NIST SP 800-92 — Guide to Computer Security Log Management