Runbook Design
Runbook Design
A runbook is a step-by-step procedure for a specific, recurring operational task. It eliminates decision-making overhead during high-stress incidents by externalising the diagnostic reasoning into a written, executable artifact. The four-section anatomy — Context, Diagnostics, Remediation, Escalation — is the load-bearing structure that makes a runbook useful rather than decorative.
Scope: This note covers runbook anatomy, the runbook/playbook distinction, and the living-document maintenance cycle. For the alert conditions that trigger runbook lookup, see Alerting-Strategies. For the incident declaration process triggered by runbook escalation, see Incident-Response.
When NOT to Use
Runbooks are inappropriate or premature in these situations:
- First-time incidents with unknown root cause — you cannot write steps for a situation you have not yet diagnosed. Run the incident response process first; write the runbook after the postmortem.
- One-off manual tasks — if a task runs once per year, a runbook adds maintenance overhead without providing repetition value. Document it in a wiki page instead.
- Development workflows — runbooks are operational artifacts for production systems, not development checklists or CI/CD procedures.
- Speculative scenarios — a runbook written before the first incident is speculation. The first version should emerge from the postmortem of the first occurrence.
Runbook vs Playbook
These terms are often conflated. The distinction matters because they operate at different scopes.
| Dimension | Runbook | Playbook |
|---|---|---|
| Scope | One specific alert or symptom | A class of incident situation |
| Trigger | "Service X 500 error rate exceeded threshold" | "Database tier outage" |
| Depth | Narrow, step-by-step, executable | Broad, references multiple runbooks |
| Audience | On-call engineer with 5 minutes to spare | Incident commander coordinating response |
| Format | Decision tree + copy-pasteable commands | Sections for stakeholders, comms, coordination |
Relationship: A playbook invokes runbooks. A runbook is executable without its playbook; a playbook without runbooks is just a narrative.
Analogy: playbook = training manual for a class of situation; runbook = checklist for a specific procedure.
Pitfall: Calling every procedure a "runbook" collapses the distinction. When responders cannot tell at a glance whether they are reading a step-by-step procedure or a coordination guide, they lose time in high-stakes moments.
Four-Section Anatomy
Every runbook must contain exactly these four sections, in this order. A section may be short, but it may not be absent.
Section 1 — Context
Purpose: orient the responder before they touch anything.
- Alert name and description: what fired, what the condition means in plain English
- Service scope: which service, which environment, which region
- SLO impact: which SLI is affected, current burn rate — link to SLO-SLI-SLA for burn rate formula
- Links: primary dashboard URL, recent deploy history, related alerts, prior incidents
Rule: The responder should be able to answer "what is broken and how bad is it?" from Section 1 alone, before issuing any commands.
Section 2 — Diagnostics
Purpose: guide the responder from symptom to probable cause via structured investigation.
- Initial triage: the first three commands to run — log query, metric check, health endpoint
- Decision tree: symptom → likely cause → next step (see Mermaid template below)
- Data to collect: specific log queries, metric queries, trace queries to gather before acting
- Scope guidance: what this runbook covers versus when to jump to a different runbook
Pitfall: Embedding dashboard screenshots or embedding raw log output in the runbook creates maintenance debt. Link to dashboards; diagnose from them. The runbook documents the questions, not the answers.
Section 3 — Remediation
Purpose: resolve the probable cause identified in Section 2.
- Per-cause remediation: for each branch in the decision tree, specific steps
- Copy-pasteable commands: exact CLI commands, not paraphrases — responders under pressure make transcription errors
- Rollback instructions: if the remediation itself can worsen the situation, include revert steps
- Expected outcome: after each step, state what the responder should observe: "After restarting the pool, error rate should drop within 60 seconds"
Rule: If a responder needs to interpret or adapt a step, the step is not yet written. Steps that require judgment belong in diagnostics.
Section 4 — Escalation
Purpose: define the conditions under which the runbook is no longer sufficient and a handoff is required.
- When to escalate: explicit conditions that exceed this runbook's scope (e.g., "If error rate does not drop after completing all remediation steps")
- Who to escalate to: role, not person name — roles survive team changes
- Information to hand off: what to include in the escalation message — current state, steps already attempted, data collected
- Incident declaration criteria: "If X and Y are both true, declare a P1 incident using Incident-Response"
Pitfall: Runbooks without escalation paths create decision paralysis. Responders who cannot solve a problem need a clear exit, not an open-ended "try harder."
Mermaid Decision Tree (Section 2 Pattern)
The diagnostic decision tree is the backbone of Section 2. It externalises the expert's branching logic into a navigable structure.
flowchart TD
A[Alert fires: Service X\n500 error rate > threshold] --> B{Check error logs:\nwhich component?}
B -- Database timeout --> C{Connection pool\nexhausted?}
B -- Upstream dependency --> D[Check dependency\nhealth endpoint]
B -- Application exception --> E[Check recent\ndeploy history]
C -- Yes --> F[Scale connection pool\nor restart pool]
C -- No --> G[Check DB query\nperformance]
D -- Unhealthy --> H[Activate circuit breaker\nsee Runbook: Dep-X]
E -- Recent deploy --> I[Rollback deploy\nusing rollback runbook]
F --> J{Error rate\ndropped?}
G --> J
H --> J
I --> J
J -- Yes --> K[Monitor 15 min\nthen resolve]
J -- No --> L[Escalate to\nSection 4]
Tree design rules:
- Root node: the exact alert condition
- First branch: the highest-signal distinguishing question
- Leaf nodes: either a concrete action or "Escalate to Section 4"
- Maximum depth: 4 levels — deeper trees are a signal the runbook covers too many scenarios
Minimal Runbook Template
Copy this skeleton when creating a new runbook:
## [ALERT_NAME] Runbook
### Context
- **Alert**: [alert name and description]
- **Service**: [service name] / [environment]
- **SLO impact**: [which SLI, current burn rate]
- **Dashboard**: [URL]
- **Recent deploys**: [deploy history link]
### Diagnostics
[Decision tree or numbered diagnostic steps]
1. Check [first signal]: `[command]`
2. If [condition], proceed to Cause A remediation
3. If [other condition], proceed to Cause B remediation
### Remediation
**Cause A: [probable cause description]**
1. [Exact step with copy-pasteable command]
2. [Expected outcome: "You should see X within Y seconds"]
**Cause B: [probable cause description]**
1. [Exact step]
2. [Rollback if needed: `[rollback command]`]
### Escalation
- **Escalate if**: [specific condition, e.g., "error rate does not drop after Cause A and B remediation"]
- **Escalate to**: [role, e.g., "Database on-call"]
- **Include in handoff**: current error rate, steps attempted, relevant log lines
- **Declare P1 if**: [condition] — use [[Incident-Response]]Living Document Pattern
A runbook that is not updated after incidents becomes misinformation.
Post-incident update triggers:
- Any step was missing from the runbook
- Any step was wrong or misleading
- Any step took more than twice the expected time
- The responder had to ask a question not answered by the runbook
Quarterly review checklist:
- Are all commands still valid in the current environment?
- Do dashboard URLs still resolve?
- Are escalation roles still accurate?
- Has any step been automated since the last review?
Signal that a runbook needs splitting: if a single runbook covers more than three distinct probable causes, it is covering too many scenarios. Split by cause.
Automation Target Pattern
When a runbook step is executed by humans more than twice per month:
- Flag it as an automation candidate
- Implement the automation (script, CI job, alert auto-remediation action)
- Update the runbook step to: "Automation X handles this; if automation fails, manual steps are: [steps]"
The runbook step does not disappear — it becomes the fallback for when automation fails. Runbooks survive automation failures.
TypeScript — Runbook Registry Pattern
interface RunbookEntry {
alertName: string;
url: string;
sloImpact: string[]; // which SLIs are affected
owner: string; // team role, not person
lastReviewed: string; // ISO date
}
const runbookRegistry: Record<string, RunbookEntry> = {
"service-x-500-rate": {
alertName: "Service X 500 Error Rate Exceeded",
url: "https://wiki.example.com/runbooks/service-x-500",
sloImpact: ["availability"],
owner: "backend-on-call",
lastReviewed: "2026-03-01",
},
};
// Alert routing function that surfaces runbook link in notification
function buildAlertNotification(alertName: string): string {
const entry = runbookRegistry[alertName];
if (!entry) return `Alert: ${alertName}\nNo runbook found — check wiki`;
return [
`Alert: ${entry.alertName}`,
`Runbook: ${entry.url}`,
`Owner: ${entry.owner}`,
`SLO Impact: ${entry.sloImpact.join(", ")}`,
].join("\n");
}Java — Runbook Registry Pattern
public record RunbookEntry(
String alertName,
String url,
List<String> sloImpact,
String owner,
String lastReviewed
) {}
public class RunbookRegistry {
private final Map<String, RunbookEntry> entries = new HashMap<>();
public void register(String alertKey, RunbookEntry entry) {
entries.put(alertKey, entry);
}
public String buildAlertNotification(String alertKey) {
RunbookEntry entry = entries.get(alertKey);
if (entry == null) {
return "Alert: " + alertKey + "\nNo runbook found — check wiki";
}
return String.join("\n",
"Alert: " + entry.alertName(),
"Runbook: " + entry.url(),
"Owner: " + entry.owner(),
"SLO Impact: " + String.join(", ", entry.sloImpact())
);
}
}Backlinks
- Alerting-Strategies — every P1/P2 alert links to a runbook; the alert annotation is the trigger for runbook lookup
- Incident-Response — Section 4 (escalation) triggers incident declaration; the incident commander role presupposes runbooks exist
- On-Call-Practices — runbook discipline is the infrastructure of on-call; runbooks are how tribal knowledge becomes executable procedure
- SLO-SLI-SLA — Section 1 (context) references the current burn rate from the SLO framework