Post-Mortem

Post-Mortem

A post-mortem is a structured investigation conducted after an incident closes. Its purpose is to understand what happened, determine root cause, and produce action items that prevent recurrence — not to identify who is responsible. The blameless post-mortem is the foundational principle: people acted rationally given what they knew; the system is what failed.

Scope: This note covers the blameless philosophy, the five-step post-mortem process (timeline → impact → root cause → contributing factors → action items), the 5-whys technique with a worked example, the post-mortem document template, and a Mermaid causal chain diagram. For the incident that triggers the post-mortem, see Incident-Response. For the SLO error budget consumed during the incident, see SLO-SLI-SLA.


When NOT to Use

A full post-mortem is not appropriate for every incident:

  • P2/P3 incidents without customer data impact — use a lightweight retrospective (5–10 minutes, no formal document) instead; full post-mortem overhead is disproportionate.
  • Incidents fully explained by a known issue with an existing runbook — if the runbook was consulted, the issue was resolved as expected, and no runbook gaps were found, a lightweight update to the runbook is sufficient.
  • Planned degradations with no surprise failures — if a maintenance window degraded service exactly as expected, a post-mortem adds no learning; document the outcome in the change record.

Required for:

  • All P0 incidents (always)
  • All P1 incidents (always)
  • Any P2 incident with data integrity questions (customer data affected or potentially affected)

Blameless Philosophy

The blameless post-mortem is the philosophical foundation of modern incident analysis. It is not a rhetorical stance — it is an epistemological one.

Core Principle

Assume people acted rationally given the information they had at the time.

When an engineer makes a decision that contributes to an incident, they made the best decision available to them given: the tools they had, the documentation that existed, the monitoring that was present, the time pressure they were under, and the organisational context they were operating in.

Root Cause is Systemic

The blameless approach does not ask "who made the mistake?" It asks "what allowed a rational person to make this mistake?"

  • Root cause lives in the system: missing guardrails, unclear documentation, inadequate monitoring, perverse incentives, organisational pressure to move fast.
  • Root cause does not live in the individual.

Counterfactual Test

Ask: "Would a different competent engineer have done the same thing given the same context?"

  • If the answer is yes — the system is the problem. A different person in the same situation makes the same mistake. The fix must be systemic.
  • If the answer is no — either the context was unusual (document it), or the individual lacked information that should have been provided (fix the information gap).

The counterfactual test prevents post-mortems from becoming accountability theatre.

Why Blame is Counterproductive

  • Engineers who fear blame hide near-misses and early warning signals.
  • Hidden near-misses accumulate into large failures.
  • Blame creates the appearance of resolution ("we fired the person who caused the outage") while leaving the systemic cause intact — the next person in that role will make the same mistake.
  • Blameless cultures catch problems earlier because people report them earlier.

Authority: Google SRE Workbook Chapter 10 — blameless postmortems establishes blamelessness as a load-bearing organisational principle, not a courtesy.

Pitfall: "Blameless" does not mean "accountable-less." Engineers are accountable for how they respond to incidents, for completing post-mortem action items, and for not repeating known mistakes. Blamelessness applies to the analysis of past decisions made under uncertainty — not to future decisions made with full knowledge of the risk.


Post-Mortem Process

Five sequential steps. Steps 1 and 2 must precede step 3. Step 3 feeds step 4. Step 5 is the deliverable.


Step 1 — Timeline Construction

Reconstruct the chronological event sequence from all available data sources:

  • Alert logs (exact timestamps from alerting system)
  • Chat messages (war room channel in Incident-Response)
  • Deploy records (CI/CD system, deployment timestamps)
  • Metrics graphs (annotate exact timestamps of anomaly onset)
  • On-call escalation records

Format each entry:

[timestamp] [event] [who detected or acted] [what they believed at the time]

Key distinction: record both what actually happened (observable) and what responders believed was happening at the time (mental model). These often diverge, and the divergence reveals monitoring gaps.

Example timeline entries:

[14:02] Alert fired: checkout-service 5xx rate > 1% on 5min window (automated)
[14:04] On-call paged (PagerDuty)
[14:07] Responder acknowledged; believed: "likely a deployment artifact"
[14:09] Checked recent deploys — no deploys in past 2h; hypothesis updated
[14:12] Connection pool metrics checked; pool at 98% saturation (responder)
[14:15] Incident declared P1; IC assigned; war room opened

Step 2 — Impact Quantification

Before root cause analysis, quantify the full scope of what happened:

  • Duration: from first observable symptom (not first alert — alerts may fire after the symptom) to full resolution
  • User impact: percentage of users affected; specific user-visible failures (checkout failures, login errors, data not loading)
  • SLO impact: error budget consumed — use burn rate formula from SLO-SLI-SLA
    • budget_consumed = burn_rate × (duration_minutes / (SLO_window_minutes))
  • Business impact: if applicable — transaction failures, data loss events, SLA breach

Impact quantification makes action items proportionate. A 2-minute P1 that consumed 0.1% of monthly budget warrants different investment than a 4-hour P1 that consumed 30%.


Step 3 — Root Cause Analysis: The 5-Whys Technique

The 5-whys is the primary technique for moving from symptom to systemic root cause. It is a heuristic, not a mechanical algorithm — stop when you reach a cause you can act on systemically.

Technique: Ask "Why did X happen?" five times, using each answer as the starting point for the next question. The goal is to move from the observable symptom to the systemic condition that allowed it.

Worked Example:

Starting symptom: The service returned 500 errors for 28 minutes, affecting 35% of users.

  1. Why did the service return 500 errors? → Database connection pool was exhausted; new requests could not acquire connections.

  2. Why was the connection pool exhausted? → A traffic spike significantly exceeded the pool's configured maximum capacity.

  3. Why did the traffic spike exceed pool capacity? → A load test ran against the production environment instead of the staging environment.

  4. Why did the load test run against production? → The environment selection was made manually by the engineer running the test; no validation was performed.

  5. Why was environment selection not validated? → The CI pipeline had no guardrail check to prevent load test jobs from targeting the production environment.

Root cause: The CI pipeline lacks an environment guardrail for load test execution.

Systemic fix: Add an environment validation gate to the CI load test job that rejects production as a target.

Why 5 whys stops here: The root cause is a systemic gap (missing guardrail) that a process change can address. Going deeper ("Why was the guardrail never added?") enters organisational history that does not produce actionable fixes.

Pitfall: The 5-whys can produce multiple valid causal chains. A complex incident may have 2–3 separate causal chains that converged. Trace all chains that contributed — each chain may produce distinct action items. Do not collapse multiple causes into a single chain to keep the document tidy.


Step 4 — Contributing Factors

Identify all conditions that were necessary (but not sufficient alone) for the incident to reach its observed severity.

Three categories:

  • Direct cause: the mechanism that produced the failure (exhausted connection pool)
  • Contributing factors: conditions that enabled the direct cause to occur (no environment guardrail, manual environment selection, no alert on connection pool saturation)
  • Exacerbating factors: conditions that increased the severity or duration (alert fired 7 minutes after symptom onset due to multi-window delay; on-call engineer spent 5 minutes on hypothesis that was already disproven — no runbook for this scenario)

Contributing factors reveal monitoring gaps. The absence of a connection pool saturation alert (P3 threshold, not a wake-up) is a contributing factor. Adding it becomes a detection action item.


Step 5 — Action Items

Every action item must be:

  • Described specifically — "add environment validation to CI load test pipeline" not "improve CI pipeline"
  • Assigned to a role — not a person's name (roles survive team changes)
  • Given a due date — without a due date, action items do not complete
  • Classified by type

Three action item types:

TypePurposeExample
PreventionReduce probability of recurrenceAdd environment guardrail to CI load test job
DetectionReduce time-to-detect when it does occurAdd P3 alert: connection pool saturation > 80%
MitigationReduce blast radius when detectedWrite runbook for connection pool exhaustion scenario

Tracking: a post-mortem whose action items are never completed is worse than no post-mortem — it creates false assurance that the issue is addressed. Assign action items to the sprint backlog immediately. Review completion at the next post-mortem review.


Mermaid Causal Chain Diagram

The causal chain diagram shows the 5-whys derivation from observable symptom to systemic root cause, with action items branching from the causes they address.

flowchart TD
    A[Symptom:\n500 errors for 35% of users\nfor 28 minutes] --> B[Mechanism:\nDB connection pool exhausted]
    B --> C[Proximate cause:\nTraffic spike exceeded pool capacity]
    C --> D[Contributing cause:\nLoad test ran against production]
    D --> E[Root cause:\nNo environment guardrail in CI pipeline]
    E --> F1[Prevention action:\nAdd environment validation\ngate to CI load test job]
    E --> F2[Detection action:\nAdd P3 alert: connection\npool saturation > 80%]
    B --> F2
    D --> F3[Mitigation action:\nWrite runbook for connection\npool exhaustion scenario]
    B --> F3

The diagram makes three things visible:

  1. The causal chain from symptom to root cause (vertical flow)
  2. Where each action item intervenes in the chain (branching leaves)
  3. Contributing causes that multiple action items address (B feeds both F2 and F3)

Post-Mortem Document Template

Copy this template when drafting a post-mortem after an incident closes:

# Post-Mortem: [INCIDENT_NAME]
 
**Date**: [date of incident]
**Severity**: P[N]
**Duration**: [HH:MM start] to [HH:MM end] ([N] minutes)
**Authors**: [names/roles]
**Status**: Draft | In Review | Complete
 
## Impact
 
- Users affected: [%] ([N] users / [N] requests)
- Error budget consumed: [minutes / percentage of monthly budget]
- User-visible failure: [description]
 
## Timeline
 
| Time | Event | Actor |
|------|-------|-------|
| [HH:MM] | Alert fired: [alert name] | Automated |
| [HH:MM] | Incident declared P[N] | [IC role] |
| [HH:MM] | [investigation step] | [resolver role] |
| [HH:MM] | Mitigation applied: [description] | [resolver role] |
| [HH:MM] | SLI recovering | Automated |
| [HH:MM] | Incident resolved | [IC role] |
 
## Root Cause
 
[5-whys derivation — use the format from the worked example above]
 
1. Why did [symptom] happen? → [mechanism]
2. Why did [mechanism] happen? → [proximate cause]
3. Why did [proximate cause] happen? → [contributing cause]
4. Why did [contributing cause] happen? → [deeper cause]
5. Why did [deeper cause] happen? → [systemic root cause]
 
**Root cause:** [one sentence]
 
## Contributing Factors
 
- [Contributing factor 1 — condition that enabled the direct cause]
- [Contributing factor 2 — exacerbating factor that increased severity/duration]
 
## Action Items
 
| Action | Type | Owner | Priority | Due |
|--------|------|-------|----------|-----|
| [action description] | Prevention | [role] | P1 | [YYYY-MM-DD] |
| [action description] | Detection | [role] | P2 | [YYYY-MM-DD] |
| [action description] | Mitigation | [role] | P2 | [YYYY-MM-DD] |

  • Incident-Response — post-mortem is triggered at the end of the incident lifecycle; the incident timeline and war room channel are primary inputs to Step 1
  • SLO-SLI-SLA — Step 2 (impact quantification) uses error budget consumed; burn rate formula provides the calculation
  • Alerting-Strategies — detection action items from post-mortems frequently produce new alert requirements; symptom-based alert gaps are a common contributing factor
  • Runbook-Design — mitigation action items from post-mortems often result in new or updated runbooks; the post-mortem is the primary trigger for the living document update cycle