Incident Response

Incident Response

Incident response is a structured process for declaring, coordinating, and resolving service degradations that breach or threaten to breach user-facing SLOs. Its defining characteristic is role separation: the incident commander coordinates without debugging; resolvers debug without communicating externally. This separation prevents the most common failure mode — a responder who is simultaneously fixing the problem, updating stakeholders, and managing the timeline, doing all three poorly.

Scope: This note covers severity classification, the incident commander role, communication pattern, incident lifecycle, and the minimal declaration template. For alerts that trigger incident declaration, see Alerting-Strategies. For the runbooks consulted during investigation, see Runbook-Design. For the post-incident analysis, see Post-Mortem.


When NOT to Use (Incident Response Process)

Full incident response protocol is inappropriate for:

  • Minor bugs fixed in normal sprint cycle — if the issue requires a story card and a sprint, it is not an incident; it is a defect backlog item.
  • Planned maintenance windows — use change management and pre-communication; maintenance is scheduled degradation, not an unplanned SLO event.
  • Single-user issues — one user experiencing a problem in an otherwise healthy system does not indicate systemic failure; route to support instead.
  • Internal tooling degradation with no user-facing SLO — developer experience issues are not user incidents unless they directly cause user-facing failures.

Threshold for incident declaration: use incident response when a user-facing SLO breach is confirmed or imminent. The burn rate signal from Alerting-Strategies is the trigger: P1 alert fired = incident declared.


Severity Classification

Severity is declared by the incident commander — not auto-assigned. Automated alerts carry an initial severity suggestion from burn rate (see Alerting-Strategies and SLO-SLI-SLA), but humans evaluate ambiguous signals and may upgrade or downgrade.

SeverityNameCriteriaUser ImpactResponse
P0CriticalComplete service outage; data loss or corruption possible; SLO burn rate > 14.4x; affects all usersAll users unable to use core functionalityImmediate response any hour; wake entire on-call rotation if needed
P1MajorSignificant feature degradation; > 50% of users affected; burn rate > 14.4x on multi-window checkMajority of users experiencing failures or severe latencyWake-up page within 15 min; declare incident immediately
P2MinorFeature degradation for subset of users; < 50% affected; burn rate > 6x on multi-window checkSome users experiencing failures; core functionality intactBusiness hours response; create incident; notify on-call via chat
P3LowNo user-facing impact; internal degradation; burn rate elevated but within budgetNo user visible impact; monitoring and tracking onlyNo page; create backlog ticket; review in next sprint

Severity escalation: commanders must upgrade severity when the situation worsens. A P2 that does not respond to initial remediation within 30 minutes should be re-evaluated for P1 upgrade.

Severity downgrade: acceptable after mitigation confirms the blast radius is smaller than initially assessed. Document the downgrade in the incident timeline.

Pitfall: Defaulting all paging alerts to P1 regardless of user impact destroys on-call health. Reserve P0/P1 for confirmed or strongly suspected user-facing SLO breach. See On-Call-Practices for toil percentage thresholds.


Incident Commander Role

The incident commander (IC) is the single person who holds coordination authority during an active incident. This role is the load-bearing pattern in incident response — everything else is coordination against the IC.

Commander Responsibilities

  • Declare incident and assign initial severity — after receiving the alert or escalation, the IC is the first decision-maker
  • Assign resolver(s) — designate who investigates and applies fixes; do not investigate personally
  • Assign communications lead — designate who manages external status page and stakeholder updates
  • Maintain the incident timeline — record what happened when; this is the primary input to the Post-Mortem
  • Decide on severity changes — upgrade or downgrade based on incoming data from resolvers
  • Declare resolution — confirm SLI recovery and close the incident
  • Trigger post-mortem — required for all P0, all P1, and any P2 with data integrity questions

The commander does NOT fix the incident. If the IC is debugging, nobody is coordinating. This is the single most violated principle in incident response. When the IC is in the terminal, the timeline stops, external updates stop, and severity assessment freezes.

Resolver Responsibilities

Resolvers are distinct from the commander in role and communication flow:

  • Diagnose using runbooks and observability data — consult Runbook-Design first; reach for dashboards and logs second
  • Apply fixes and rollbacks — execute remediation steps from the runbook or, when novel, experimental mitigations with IC approval
  • Report outcomes to the commander only — resolvers do not communicate externally; all external information flow goes through the IC
  • State hypotheses explicitly — "I think the connection pool is exhausted" is better than silent debugging; the IC needs to track mental models

Communications Lead Responsibilities

  • Write status page updates at defined intervals — P0/P1: every 30 minutes minimum; P2: at declaration and resolution
  • Notify affected stakeholders — internal slack, email lists, executive summary if required
  • Never speculate about cause in external communications — say "we are investigating elevated error rates" not "we think the database crashed"
  • Coordinate customer support — help support team answer user inquiries without technical speculation

Communication Pattern: Hub-and-Spoke

The incident command structure is a hub-and-spoke model. All information flows through the incident commander.

📄 Incident-Response-diagram.excalidraw.md

Hub (Incident Commander):

  • Receives status updates from resolvers
  • Receives confirmation from communications lead that stakeholders are updated
  • Issues directives to resolvers (priorities, new hypotheses to investigate)
  • Issues approval to communications lead (what to say and when)

Spokes:

  • Resolvers report to commander; do not communicate externally
  • Communications lead reports to commander; does not direct resolvers

War room discipline:

  • Single Slack channel or bridge call per incident — the commander controls it
  • Single thread of record — no side conversations that split the information stream
  • Commander pins key updates in the war room channel for the post-mortem timeline

External communications:

  • Only from the communications lead
  • Never from resolvers — even if a resolver knows exactly what happened
  • Status page updates contain facts, not hypotheses

Incident Lifecycle

Five sequential phases. The commander is responsible for transitioning the incident between phases and communicating each transition to the war room.

Phase 1 — Detection

  • Alert fires from Alerting-Strategies (P1 burn rate alert on multi-window check), OR
  • User report escalated through support, OR
  • Internal monitoring observation

Detection ends when an incident commander is designated.

Phase 2 — Declaration

  • IC designated (on-call engineer becomes IC, or explicitly hands the role)
  • Severity assigned (initial P-level based on burn rate and known impact)
  • War room channel opened
  • Resolvers assigned (minimum one; P0/P1 may warrant multiple resolvers in parallel)
  • Communications lead assigned (may be the same person as IC for P2; should be separate for P0/P1)
  • Initial timeline entry created: [timestamp] Incident declared P[N]

Phase 3 — Investigation

  • Resolvers consult Runbook-Design for known alert types
  • Commander tracks timeline; updates severity if data warrants
  • Communications lead sends first status update within 30 minutes of declaration (P0/P1)

Phase 4 — Mitigation

  • Fix or workaround applied by resolver(s)
  • SLI metrics observed — burn rate should be decreasing
  • Commander confirms: "Mitigation applied, monitoring for recovery"
  • Communications lead sends update: "We have applied a fix and are monitoring"

Phase 5 — Resolution

  • SLI returns within SLO thresholds for a defined stabilisation period (typically 30 minutes for P0/P1)
  • IC declares incident resolved
  • War room channel archived (not deleted — post-mortem needs the history)
  • Post-mortem triggered for P0, P1, and P2 with data integrity questions — see Post-Mortem

Minimal Incident Declaration Template

## Incident: [SHORT_NAME]
- **Severity**: P[N]
- **Commander**: [role/name]
- **Resolver(s)**: [role/name]
- **Comms Lead**: [role/name]
- **Declared**: [timestamp]
- **Status page**: [URL]
- **War room**: [channel/bridge]
 
### Timeline
- [HH:MM] Alert fired: [alert name]
- [HH:MM] Incident declared P[N]
- [HH:MM] [actions taken]
- [HH:MM] Mitigation applied: [description]
- [HH:MM] SLI recovering
- [HH:MM] Incident resolved
 
### Current hypothesis
[What we think is happening]

  • Runbook-Design — resolvers consult runbooks during the investigation phase; Section 4 (escalation) of runbooks triggers incident declaration
  • Alerting-Strategies — P1 alerts trigger incident declaration; burn rate thresholds determine initial severity
  • On-Call-Practices — on-call engineer becomes commander or resolver; on-call health depends on incident frequency
  • Post-Mortem — P0/P1 incidents require a blameless post-mortem after resolution; the incident timeline is the primary input
  • SLO-SLI-SLA — severity classification uses burn rate thresholds; error budget consumed is quantified during the post-mortem