Runbook Design

A runbook is a step-by-step procedure for a specific, recurring operational task. It eliminates decision-making overhead during high-stress incidents by externalising the diagnostic reasoning into a written, executable artifact. The four-section anatomy — Context, Diagnostics, Remediation, Escalation — is the load-bearing structure that makes a runbook useful rather than decorative.

Scope: This note covers runbook anatomy, the runbook/playbook distinction, and the living-document maintenance cycle. For the alert conditions that trigger runbook lookup, see Alerting-Strategies. For the incident declaration process triggered by runbook escalation, see Incident-Response.

When NOT to Use

Runbooks are inappropriate or premature in these situations:

First-time incidents with unknown root cause — you cannot write steps for a situation you have not yet diagnosed. Run the incident response process first; write the runbook after the postmortem.
One-off manual tasks — if a task runs once per year, a runbook adds maintenance overhead without providing repetition value. Document it in a wiki page instead.
Development workflows — runbooks are operational artifacts for production systems, not development checklists or CI/CD procedures.
Speculative scenarios — a runbook written before the first incident is speculation. The first version should emerge from the postmortem of the first occurrence.

Runbook vs Playbook

These terms are often conflated. The distinction matters because they operate at different scopes.

Dimension	Runbook	Playbook
Scope	One specific alert or symptom	A class of incident situation
Trigger	"Service X 500 error rate exceeded threshold"	"Database tier outage"
Depth	Narrow, step-by-step, executable	Broad, references multiple runbooks
Audience	On-call engineer with 5 minutes to spare	Incident commander coordinating response
Format	Decision tree + copy-pasteable commands	Sections for stakeholders, comms, coordination

Relationship: A playbook invokes runbooks. A runbook is executable without its playbook; a playbook without runbooks is just a narrative.

Analogy: playbook = training manual for a class of situation; runbook = checklist for a specific procedure.

Pitfall: Calling every procedure a "runbook" collapses the distinction. When responders cannot tell at a glance whether they are reading a step-by-step procedure or a coordination guide, they lose time in high-stakes moments.

Four-Section Anatomy

Every runbook must contain exactly these four sections, in this order. A section may be short, but it may not be absent.

Section 1 — Context

Purpose: orient the responder before they touch anything.

Alert name and description: what fired, what the condition means in plain English
Service scope: which service, which environment, which region
SLO impact: which SLI is affected, current burn rate — link to SLO-SLI-SLA for burn rate formula
Links: primary dashboard URL, recent deploy history, related alerts, prior incidents

Rule: The responder should be able to answer "what is broken and how bad is it?" from Section 1 alone, before issuing any commands.

Section 2 — Diagnostics

Purpose: guide the responder from symptom to probable cause via structured investigation.

Initial triage: the first three commands to run — log query, metric check, health endpoint
Decision tree: symptom → likely cause → next step (see Mermaid template below)
Data to collect: specific log queries, metric queries, trace queries to gather before acting
Scope guidance: what this runbook covers versus when to jump to a different runbook

Pitfall: Embedding dashboard screenshots or embedding raw log output in the runbook creates maintenance debt. Link to dashboards; diagnose from them. The runbook documents the questions, not the answers.

Section 3 — Remediation

Purpose: resolve the probable cause identified in Section 2.

Per-cause remediation: for each branch in the decision tree, specific steps
Copy-pasteable commands: exact CLI commands, not paraphrases — responders under pressure make transcription errors
Rollback instructions: if the remediation itself can worsen the situation, include revert steps
Expected outcome: after each step, state what the responder should observe: "After restarting the pool, error rate should drop within 60 seconds"

Rule: If a responder needs to interpret or adapt a step, the step is not yet written. Steps that require judgment belong in diagnostics.

Section 4 — Escalation

Purpose: define the conditions under which the runbook is no longer sufficient and a handoff is required.

When to escalate: explicit conditions that exceed this runbook's scope (e.g., "If error rate does not drop after completing all remediation steps")
Who to escalate to: role, not person name — roles survive team changes
Information to hand off: what to include in the escalation message — current state, steps already attempted, data collected
Incident declaration criteria: "If X and Y are both true, declare a P1 incident using Incident-Response"

Pitfall: Runbooks without escalation paths create decision paralysis. Responders who cannot solve a problem need a clear exit, not an open-ended "try harder."

Mermaid Decision Tree (Section 2 Pattern)

The diagnostic decision tree is the backbone of Section 2. It externalises the expert's branching logic into a navigable structure.

flowchart TD
    A[Alert fires: Service X\n500 error rate > threshold] --> B{Check error logs:\nwhich component?}
    B -- Database timeout --> C{Connection pool\nexhausted?}
    B -- Upstream dependency --> D[Check dependency\nhealth endpoint]
    B -- Application exception --> E[Check recent\ndeploy history]
    C -- Yes --> F[Scale connection pool\nor restart pool]
    C -- No --> G[Check DB query\nperformance]
    D -- Unhealthy --> H[Activate circuit breaker\nsee Runbook: Dep-X]
    E -- Recent deploy --> I[Rollback deploy\nusing rollback runbook]
    F --> J{Error rate\ndropped?}
    G --> J
    H --> J
    I --> J
    J -- Yes --> K[Monitor 15 min\nthen resolve]
    J -- No --> L[Escalate to\nSection 4]

Tree design rules:

Root node: the exact alert condition
First branch: the highest-signal distinguishing question
Leaf nodes: either a concrete action or "Escalate to Section 4"
Maximum depth: 4 levels — deeper trees are a signal the runbook covers too many scenarios

Minimal Runbook Template

Copy this skeleton when creating a new runbook:

## [ALERT_NAME] Runbook
 
### Context
- **Alert**: [alert name and description]
- **Service**: [service name] / [environment]
- **SLO impact**: [which SLI, current burn rate]
- **Dashboard**: [URL]
- **Recent deploys**: [deploy history link]
 
### Diagnostics
 
[Decision tree or numbered diagnostic steps]
 
1. Check [first signal]: `[command]`
2. If [condition], proceed to Cause A remediation
3. If [other condition], proceed to Cause B remediation
 
### Remediation
 
**Cause A: [probable cause description]**
1. [Exact step with copy-pasteable command]
2. [Expected outcome: "You should see X within Y seconds"]
 
**Cause B: [probable cause description]**
1. [Exact step]
2. [Rollback if needed: `[rollback command]`]
 
### Escalation
- **Escalate if**: [specific condition, e.g., "error rate does not drop after Cause A and B remediation"]
- **Escalate to**: [role, e.g., "Database on-call"]
- **Include in handoff**: current error rate, steps attempted, relevant log lines
- **Declare P1 if**: [condition] — use [[Incident-Response]]

Living Document Pattern

A runbook that is not updated after incidents becomes misinformation.

Post-incident update triggers:

Any step was missing from the runbook
Any step was wrong or misleading
Any step took more than twice the expected time
The responder had to ask a question not answered by the runbook

Quarterly review checklist:

Are all commands still valid in the current environment?
Do dashboard URLs still resolve?
Are escalation roles still accurate?
Has any step been automated since the last review?

Signal that a runbook needs splitting: if a single runbook covers more than three distinct probable causes, it is covering too many scenarios. Split by cause.

Automation Target Pattern

When a runbook step is executed by humans more than twice per month:

Flag it as an automation candidate
Implement the automation (script, CI job, alert auto-remediation action)
Update the runbook step to: "Automation X handles this; if automation fails, manual steps are: [steps]"

The runbook step does not disappear — it becomes the fallback for when automation fails. Runbooks survive automation failures.

TypeScript — Runbook Registry Pattern

interface RunbookEntry {
  alertName: string;
  url: string;
  sloImpact: string[];   // which SLIs are affected
  owner: string;         // team role, not person
  lastReviewed: string;  // ISO date
}
 
const runbookRegistry: Record<string, RunbookEntry> = {
  "service-x-500-rate": {
    alertName: "Service X 500 Error Rate Exceeded",
    url: "https://wiki.example.com/runbooks/service-x-500",
    sloImpact: ["availability"],
    owner: "backend-on-call",
    lastReviewed: "2026-03-01",
  },
};
 
// Alert routing function that surfaces runbook link in notification
function buildAlertNotification(alertName: string): string {
  const entry = runbookRegistry[alertName];
  if (!entry) return `Alert: ${alertName}\nNo runbook found — check wiki`;
  return [
    `Alert: ${entry.alertName}`,
    `Runbook: ${entry.url}`,
    `Owner: ${entry.owner}`,
    `SLO Impact: ${entry.sloImpact.join(", ")}`,
  ].join("\n");
}

Java — Runbook Registry Pattern

public record RunbookEntry(
    String alertName,
    String url,
    List<String> sloImpact,
    String owner,
    String lastReviewed
) {}
 
public class RunbookRegistry {
    private final Map<String, RunbookEntry> entries = new HashMap<>();
 
    public void register(String alertKey, RunbookEntry entry) {
        entries.put(alertKey, entry);
    }
 
    public String buildAlertNotification(String alertKey) {
        RunbookEntry entry = entries.get(alertKey);
        if (entry == null) {
            return "Alert: " + alertKey + "\nNo runbook found — check wiki";
        }
        return String.join("\n",
            "Alert: " + entry.alertName(),
            "Runbook: " + entry.url(),
            "Owner: " + entry.owner(),
            "SLO Impact: " + String.join(", ", entry.sloImpact())
        );
    }
}

Backlinks

Alerting-Strategies — every P1/P2 alert links to a runbook; the alert annotation is the trigger for runbook lookup
Incident-Response — Section 4 (escalation) triggers incident declaration; the incident commander role presupposes runbooks exist
On-Call-Practices — runbook discipline is the infrastructure of on-call; runbooks are how tribal knowledge becomes executable procedure
SLO-SLI-SLA — Section 1 (context) references the current burn rate from the SLO framework

Runbook Design

Tags

Runbook Design

When NOT to Use

Runbook vs Playbook

Four-Section Anatomy

Section 1 — Context

Section 2 — Diagnostics

Section 3 — Remediation

Section 4 — Escalation

Mermaid Decision Tree (Section 2 Pattern)

Minimal Runbook Template

Living Document Pattern

Automation Target Pattern

TypeScript — Runbook Registry Pattern

Java — Runbook Registry Pattern

Backlinks

Linked mentions