Incident Response

Incident response is a structured process for declaring, coordinating, and resolving service degradations that breach or threaten to breach user-facing SLOs. Its defining characteristic is role separation: the incident commander coordinates without debugging; resolvers debug without communicating externally. This separation prevents the most common failure mode — a responder who is simultaneously fixing the problem, updating stakeholders, and managing the timeline, doing all three poorly.

Scope: This note covers severity classification, the incident commander role, communication pattern, incident lifecycle, and the minimal declaration template. For alerts that trigger incident declaration, see Alerting-Strategies. For the runbooks consulted during investigation, see Runbook-Design. For the post-incident analysis, see Post-Mortem.

When NOT to Use (Incident Response Process)

Full incident response protocol is inappropriate for:

Minor bugs fixed in normal sprint cycle — if the issue requires a story card and a sprint, it is not an incident; it is a defect backlog item.
Planned maintenance windows — use change management and pre-communication; maintenance is scheduled degradation, not an unplanned SLO event.
Single-user issues — one user experiencing a problem in an otherwise healthy system does not indicate systemic failure; route to support instead.
Internal tooling degradation with no user-facing SLO — developer experience issues are not user incidents unless they directly cause user-facing failures.

Threshold for incident declaration: use incident response when a user-facing SLO breach is confirmed or imminent. The burn rate signal from Alerting-Strategies is the trigger: P1 alert fired = incident declared.

Severity Classification

Severity is declared by the incident commander — not auto-assigned. Automated alerts carry an initial severity suggestion from burn rate (see Alerting-Strategies and SLO-SLI-SLA), but humans evaluate ambiguous signals and may upgrade or downgrade.

Severity	Name	Criteria	User Impact	Response
P0	Critical	Complete service outage; data loss or corruption possible; SLO burn rate > 14.4x; affects all users	All users unable to use core functionality	Immediate response any hour; wake entire on-call rotation if needed
P1	Major	Significant feature degradation; > 50% of users affected; burn rate > 14.4x on multi-window check	Majority of users experiencing failures or severe latency	Wake-up page within 15 min; declare incident immediately
P2	Minor	Feature degradation for subset of users; < 50% affected; burn rate > 6x on multi-window check	Some users experiencing failures; core functionality intact	Business hours response; create incident; notify on-call via chat
P3	Low	No user-facing impact; internal degradation; burn rate elevated but within budget	No user visible impact; monitoring and tracking only	No page; create backlog ticket; review in next sprint

Severity escalation: commanders must upgrade severity when the situation worsens. A P2 that does not respond to initial remediation within 30 minutes should be re-evaluated for P1 upgrade.

Severity downgrade: acceptable after mitigation confirms the blast radius is smaller than initially assessed. Document the downgrade in the incident timeline.

Pitfall: Defaulting all paging alerts to P1 regardless of user impact destroys on-call health. Reserve P0/P1 for confirmed or strongly suspected user-facing SLO breach. See On-Call-Practices for toil percentage thresholds.

Incident Commander Role

The incident commander (IC) is the single person who holds coordination authority during an active incident. This role is the load-bearing pattern in incident response — everything else is coordination against the IC.

Commander Responsibilities

Declare incident and assign initial severity — after receiving the alert or escalation, the IC is the first decision-maker
Assign resolver(s) — designate who investigates and applies fixes; do not investigate personally
Assign communications lead — designate who manages external status page and stakeholder updates
Maintain the incident timeline — record what happened when; this is the primary input to the Post-Mortem
Decide on severity changes — upgrade or downgrade based on incoming data from resolvers
Declare resolution — confirm SLI recovery and close the incident
Trigger post-mortem — required for all P0, all P1, and any P2 with data integrity questions

The commander does NOT fix the incident. If the IC is debugging, nobody is coordinating. This is the single most violated principle in incident response. When the IC is in the terminal, the timeline stops, external updates stop, and severity assessment freezes.

Resolver Responsibilities

Resolvers are distinct from the commander in role and communication flow:

Diagnose using runbooks and observability data — consult Runbook-Design first; reach for dashboards and logs second
Apply fixes and rollbacks — execute remediation steps from the runbook or, when novel, experimental mitigations with IC approval
Report outcomes to the commander only — resolvers do not communicate externally; all external information flow goes through the IC
State hypotheses explicitly — "I think the connection pool is exhausted" is better than silent debugging; the IC needs to track mental models

Communications Lead Responsibilities

Write status page updates at defined intervals — P0/P1: every 30 minutes minimum; P2: at declaration and resolution
Notify affected stakeholders — internal slack, email lists, executive summary if required
Never speculate about cause in external communications — say "we are investigating elevated error rates" not "we think the database crashed"
Coordinate customer support — help support team answer user inquiries without technical speculation

Communication Pattern: Hub-and-Spoke

The incident command structure is a hub-and-spoke model. All information flows through the incident commander.

📄 Incident-Response-diagram.excalidraw.md

Hub (Incident Commander):

Receives status updates from resolvers
Receives confirmation from communications lead that stakeholders are updated
Issues directives to resolvers (priorities, new hypotheses to investigate)
Issues approval to communications lead (what to say and when)

Spokes:

Resolvers report to commander; do not communicate externally
Communications lead reports to commander; does not direct resolvers

War room discipline:

Single Slack channel or bridge call per incident — the commander controls it
Single thread of record — no side conversations that split the information stream
Commander pins key updates in the war room channel for the post-mortem timeline

External communications:

Only from the communications lead
Never from resolvers — even if a resolver knows exactly what happened
Status page updates contain facts, not hypotheses

Incident Lifecycle

Five sequential phases. The commander is responsible for transitioning the incident between phases and communicating each transition to the war room.

Phase 1 — Detection

Alert fires from Alerting-Strategies (P1 burn rate alert on multi-window check), OR
User report escalated through support, OR
Internal monitoring observation

Detection ends when an incident commander is designated.

Phase 2 — Declaration

IC designated (on-call engineer becomes IC, or explicitly hands the role)
Severity assigned (initial P-level based on burn rate and known impact)
War room channel opened
Resolvers assigned (minimum one; P0/P1 may warrant multiple resolvers in parallel)
Communications lead assigned (may be the same person as IC for P2; should be separate for P0/P1)
Initial timeline entry created: [timestamp] Incident declared P[N]

Phase 3 — Investigation

Resolvers consult Runbook-Design for known alert types
Commander tracks timeline; updates severity if data warrants
Communications lead sends first status update within 30 minutes of declaration (P0/P1)

Phase 4 — Mitigation

Fix or workaround applied by resolver(s)
SLI metrics observed — burn rate should be decreasing
Commander confirms: "Mitigation applied, monitoring for recovery"
Communications lead sends update: "We have applied a fix and are monitoring"

Phase 5 — Resolution

SLI returns within SLO thresholds for a defined stabilisation period (typically 30 minutes for P0/P1)
IC declares incident resolved
War room channel archived (not deleted — post-mortem needs the history)
Post-mortem triggered for P0, P1, and P2 with data integrity questions — see Post-Mortem

Minimal Incident Declaration Template

## Incident: [SHORT_NAME]
- **Severity**: P[N]
- **Commander**: [role/name]
- **Resolver(s)**: [role/name]
- **Comms Lead**: [role/name]
- **Declared**: [timestamp]
- **Status page**: [URL]
- **War room**: [channel/bridge]
 
### Timeline
- [HH:MM] Alert fired: [alert name]
- [HH:MM] Incident declared P[N]
- [HH:MM] [actions taken]
- [HH:MM] Mitigation applied: [description]
- [HH:MM] SLI recovering
- [HH:MM] Incident resolved
 
### Current hypothesis
[What we think is happening]

Backlinks

Runbook-Design — resolvers consult runbooks during the investigation phase; Section 4 (escalation) of runbooks triggers incident declaration
Alerting-Strategies — P1 alerts trigger incident declaration; burn rate thresholds determine initial severity
On-Call-Practices — on-call engineer becomes commander or resolver; on-call health depends on incident frequency
Post-Mortem — P0/P1 incidents require a blameless post-mortem after resolution; the incident timeline is the primary input
SLO-SLI-SLA — severity classification uses burn rate thresholds; error budget consumed is quantified during the post-mortem

Incident Response

Tags

Incident Response

When NOT to Use (Incident Response Process)

Severity Classification

Incident Commander Role

Commander Responsibilities

Resolver Responsibilities

Communications Lead Responsibilities

Communication Pattern: Hub-and-Spoke

Incident Lifecycle

Phase 1 — Detection

Phase 2 — Declaration

Phase 3 — Investigation

Phase 4 — Mitigation

Phase 5 — Resolution

Minimal Incident Declaration Template

Backlinks

Linked mentions