Observability MOC
Observability MOC
Navigation hub for 9 observability pattern notes across two sections: three pillars (logging, metrics, tracing) and reliability engineering (SLO, alerting, runbooks, incidents, on-call, post-mortems). See Design-Patterns-MOC for the root vault entry point.
Three Pillars
| Pattern | Intent | Use When |
|---|---|---|
| Structured-Logging | JSON-first structured logging with mandatory field schema (timestamp, level, service, traceId, correlationId), PII exclusion, and async context propagation via MDC. | Any service that produces log output — structured JSON is the only format that supports machine parsing, correlation, and centralized log aggregation at scale |
| Metrics-and-Dashboards | Quantitative measurement using RED (request rate, error rate, duration), USE (utilization, saturation, errors), and Four Golden Signals frameworks with dashboard design principles. | Monitoring service health, capacity planning, and SLO tracking — choose framework by system type: RED for request-driven services, USE for infrastructure resources |
| Distributed-Tracing-Patterns | Sampling strategies (head-based, tail-based, probabilistic) and async context propagation failure modes — pattern-level decisions distinct from OTel SDK implementation in Distributed-Tracing. | Understanding request flow across service boundaries; debugging latency in distributed systems; choosing sampling strategy based on traffic volume and cost constraints |
Three pillars selection guide: All three pillars are complementary, not alternatives — a production system needs all three. Logs tell you WHAT happened (event details), metrics tell you HOW MUCH is happening (aggregated counts/durations), traces tell you WHERE time was spent (request path across services). Start with structured logging (lowest barrier), add metrics for alerting and dashboards, add distributed tracing when debugging cross-service latency. Distributed-Tracing-Patterns covers sampling and propagation decisions; Distributed-Tracing covers OTel SDK implementation — read both.
Reliability Engineering
| Pattern | Intent | Use When |
|---|---|---|
| SLO-SLI-SLA | Service Level Objectives defined from SLI ratios (good events / valid events), error budget derivation, burn rate computation, and multiwindow alerting as prerequisite for actionable alerts. | Defining reliability targets and error budgets; prerequisite for all alerting — without SLOs, alerts lack a threshold grounded in user impact |
| Alerting-Strategies | Symptom-based alerting with multi-window multi-burn-rate rules (fast page, slow page, fast ticket, slow ticket), severity taxonomy, and the principle of alerting on symptoms not causes. | Configuring alerts that wake humans only for user-facing impact; reducing alert fatigue by eliminating cause-based pages |
| Runbook-Design | Runbook anatomy (context, diagnostics, remediation, escalation) with decision tree structure, distinct from playbooks (investigative, no predetermined steps). | Creating operational documentation that on-call engineers follow during incidents — every alert should link to a runbook |
| Incident-Response | Incident commander role (coordination, not fixing), severity classification (P1-P4), communication cadence, and Excalidraw command structure diagram. | Handling production incidents with clear roles and communication patterns — the commander coordinates, domain experts fix |
| On-Call-Practices | Rotation design, escalation policy, on-call health measurement via toil percentage (target < 50%) and MTTA (target < 5min for P1). | Building sustainable on-call rotations; measuring and reducing operational toil; preventing burnout |
| Post-Mortem | Blameless post-mortem process: timeline reconstruction, 5-whys root cause analysis, counterfactual test, and action items classified as Prevention/Detection/Mitigation. | Learning from incidents without blame; producing action items that improve the system — "would a different competent engineer have done the same?" |
Reliability engineering selection guide: The reliability engineering workflow follows a specific chain: define SLOs first (what does "reliable enough" mean?), then configure alerts against those SLOs (burn rate), then write runbooks for each alert, then establish incident response for when runbooks are insufficient, then run post-mortems to learn and improve. SLO-SLI-SLA is the foundation — without SLI ratios and error budgets, alerting thresholds are arbitrary. Alerting-Strategies depends on SLO burn rates — read SLO-SLI-SLA first. On-Call-Practices is the human sustainability layer — if toil exceeds 50% over 3 rotations, trigger a reliability sprint.
Observability Concern Selector
Is the concern about understanding what individual requests or events are doing? -> Yes, structured event details (who, what, when, context): Structured-Logging -> Yes, request flow and latency across service boundaries: Distributed-Tracing-Patterns (patterns) + Distributed-Tracing (OTel SDK) -> No: Continue
Is the concern about aggregate system health and capacity? -> Yes, request-driven service (rate, errors, duration): Metrics-and-Dashboards (RED framework) -> Yes, infrastructure resource (CPU, memory, disk): Metrics-and-Dashboards (USE framework) -> No: Continue
Is the concern about defining what "reliable enough" means? -> Yes, setting reliability targets from user-facing SLIs: SLO-SLI-SLA -> Yes, computing error budget and burn rate for alerting thresholds: SLO-SLI-SLA -> No: Continue
Is the concern about when and how to alert humans? -> Yes, configuring alerts that fire on user impact, not internal causes: Alerting-Strategies -> No: Continue
Is the concern about what to do when an alert fires? -> Yes, step-by-step diagnostic and remediation documentation: Runbook-Design -> No: Continue
Is the concern about handling a production incident? -> Yes, roles, severity classification, communication cadence: Incident-Response -> No: Continue
Is the concern about sustainable on-call operations? -> Yes, rotation design, escalation policy, toil measurement: On-Call-Practices -> No: Continue
Is the concern about learning from incidents? -> Yes, blameless analysis, 5-whys, action item classification: Post-Mortem
Cross-Domain Lineage Chains
| Chain | Lineage |
|---|---|
| Three Pillars to Reliability | Structured-Logging -> Distributed-Tracing-Patterns -> SLO-SLI-SLA -> Alerting-Strategies (structured logs feed trace correlation, trace data informs SLI measurement, SLO burn rates drive alerting rules) |
| Metrics to Alerting | Metrics-and-Dashboards -> Micrometer -> SLO-SLI-SLA -> Alerting-Strategies -> Runbook-Design (metrics collected via Micrometer define SLIs, SLO burn rates trigger alerts, each alert links to a runbook) |
| Incident Lifecycle | Alerting-Strategies -> Incident-Response -> Post-Mortem -> SLO-SLI-SLA (alerts trigger incident response, incidents produce post-mortems, post-mortem action items refine SLOs — a continuous improvement loop) |
Backlinks
Sources
- Google SRE Book — Site Reliability Engineering (O'Reilly, 2016)
- Google SRE Workbook — The Site Reliability Workbook (O'Reilly, 2018)
- Charity Majors, Liz Fong-Jones, George Miranda — Observability Engineering (O'Reilly, 2022)
- NIST SP 800-92 — Guide to Computer Security Log Management