Alerting Strategies

Alerting Strategies

Alerting translates SLO burn rate signals into actionable human responses. Effective alerting is not about capturing every anomaly — it is about surfacing the right signal at the right severity level so the on-call engineer can act before the error budget is exhausted. The alert pipeline runs: SLI computation → burn rate → multi-window threshold evaluation → severity tier → on-call routing.

Scope: This note covers symptom vs cause alert classification, multi-window multi-burn-rate alerting with both time windows specified, the alert severity taxonomy, routing patterns, and runbook linkage. Burn rate derivation and error budget math are in SLO-SLI-SLA. Runbook anatomy is in Runbook-Design.


When NOT to Use (Specific Alerting Patterns)

Cause-based alerts as primary (wake-up) pages:

  • CPU high, memory elevated, queue depth growing — these are internal signals, not evidence of user impact.
  • Cause-based primary alerts have a high false-positive rate: CPU spikes during scheduled jobs, memory grows during normal batch processing.
  • The on-call engineer is woken for a cause that resolves itself before investigation begins, burning goodwill and attention.
  • Use cause-based signals as P4 diagnostics, not as P1 pages.

Single-window burn rate alerts:

  • A single observation window cannot simultaneously achieve low false-positive and low false-negative rates.
  • A tight threshold on a short window catches every transient spike — deployment blips, brief traffic bursts, health check failures — as false positives.
  • A loose threshold on a long window misses fast-burning incidents that exhaust the monthly budget in hours.
  • Single-window alerting forces an unresolvable sensitivity tradeoff. See the multi-window pattern below.

"Alert on everything" approach:

  • Alert fatigue is the systematic desensitisation of on-call engineers caused by a high volume of low-signal pages.
  • When every metric threshold fires a page, the engineer learns to dismiss alerts before investigating.
  • The result: real incidents are missed because the signal is buried in noise.
  • Govern alert quantity: if a service page rate exceeds three pages per shift, that is a reliability signal requiring engineering investment, not more alert rules.

Symptom-Based vs Cause-Based Alerts

This is the foundational distinction in alert classification. Every alert can be placed on the spectrum from pure symptom to pure cause.

Symptom-Based Alerts

A symptom-based alert fires on evidence of user impact — something the user observes, not something the system internally records.

Symptom SignalExample Alert Condition
User-facing error rate5xx rate > 1% on /api/checkout for 5 minutes
User-facing latencyp99 latency > 2s on user-facing endpoints
Successful request rate dropRPS on /api/orders drops > 30% below 7-day baseline
SLO burn ratemulti-window burn rate > 14.4x simultaneously on 1h and 5min windows

Cause-Based Alerts

A cause-based alert fires on an internal signal that might lead to user impact but does not yet confirm it.

Cause SignalExample Alert Condition
CPU utilizationCPU > 80% for 10 minutes
Memory utilizationHeap usage > 85%
Queue depthMessage queue depth > 10,000
Connection pool exhaustionDB pool active > 90% of max

The Rule

Alert on symptoms, investigate causes.

Cause-based alerts are diagnostic tools. They answer "what changed" after a symptom alert fires. They are not appropriate as primary wake-up pages because a cause with no user impact is not an incident.

Exception: cause-based alerts are acceptable as P3 or P4 warnings — low-severity notifications that create a ticket for the team to review during business hours, with no paging.

SLO connection: symptom-based alerts are SLI threshold violations — the SLI ratio has degraded to the point that the error budget is burning. Cause-based alerts are implementation details of the service internals.


Alert Severity Taxonomy

Severity tiers define the expected human response time and the cost of that response to the engineer's time and sleep.

TierNameTriggerResponse Expectation
P1Critical / PageBurn rate > 14.4x on both 1h and 5min windowsWake-up page; immediate response any hour; escalate in 10 min if no ack
P2High / Ticket+NotifyBurn rate > 6x on both 6h and 30min windowsResponse within business hours; on-call notified via chat; no sleep disruption
P3WarningBurn rate > 2x on 3h windowMonitor; create team backlog ticket; no immediate human action required
P4InformationalMetrics anomaly with no SLO impactLog to observability platform; no human notification

Calibration principle: Severity is proportional to budget consumption rate, not metric magnitude. A 50% CPU spike that causes no error rate increase is P4. A 1% error rate spike that burns budget at 14.4x is P1.


Multi-Window Multi-Burn-Rate Alerting

This is the canonical alerting pattern for SLO-based services. It requires two simultaneous time windows to fire. A single-window approach is explicitly the anti-pattern.

Why Two Windows Are Required

WindowStrengthWeakness if Used Alone
Short window (1h / 5min)Detects fast-burning incidents quicklyHigh false-positive rate from transient spikes
Long window (6h / 30min)Confirms sustained burn; filters transient noiseFalse negatives for fast burns early in the incident

The rule: the alert fires only when BOTH the short window AND the long window simultaneously exceed their respective thresholds.

P1 alert condition:
  burn_rate(short_window = 5min)  > 14.4
  AND
  burn_rate(long_window  = 1h)    > 14.4

The short window detects the incident. The long window confirms it is not a transient spike.

These four rules cover fast and slow burn at two severity tiers.

AlertShort WindowLong WindowBurn RateSeverityBudget Consumed at Alert
Fast page5 min1 hour14.4xP1~2% in 1h
Slow page30 min6 hours6xP2~5% in 6h
Fast ticket1 hour3 hours3xP3~10% in 3h
Slow ticket3 hours24 hours1xP3budget exhausting normally

Fast-Burn Alert (P1) — Worked Example

Scenario: 99.9% SLO, 30-day window.

Short window: 5 minutes
  observed_error_rate / allowed_error_rate > 14.4
  → At burn rate 14.4, monthly budget exhausts in 2.08 days

Long window: 1 hour
  Same burn rate threshold: 14.4

Alert fires: ONLY when both windows are simultaneously above 14.4

The 5min window catches the incident within minutes. The 1h window confirms the rate is sustained, not a deployment artifact. When both exceed the threshold simultaneously, the on-call engineer has ~2 days of budget remaining — enough time to investigate and mitigate before the SLO is breached.

Slow-Burn Alert (P2) — Worked Example

Short window: 30 minutes
  burn_rate > 6x
  → At burn rate 6, monthly budget exhausts in 5 days

Long window: 6 hours
  Same burn rate threshold: 6x

Alert fires: ONLY when both windows are simultaneously above 6x

At this rate, the budget will not be exhausted in hours. Response within business hours is appropriate. The 6h window filters short-lived issues from deployments, making this alert high-signal.

Burn rate thresholds are derived from the SLO window. See SLO-SLI-SLA for the derivation formula: burn_rate = observed_error_rate / (1 - SLO_target).


Alert Routing Pattern

Every alert severity tier routes to a different response channel. Routing is conceptual — no tool-specific config.

SeverityRoutingChannelEscalation
P1Page on-call engineer (any hour)PagerDuty / OpsGenie voice callIf no ack in 10 min: page secondary then manager
P2Create incident ticket; notify on-call via chatSlack / Teams channelIf no response in 2h: escalate to team lead
P3Create ticket in team backlogJira / LinearNo escalation; reviewed in next sprint
P4Log to observability platformGrafana annotation / log entryNo human notification

Routing discipline: every alert definition must specify exactly one target tier. Alerts without tier assignment are P4 by default (log-only). Undifferentiated routing — where all alerts go to Slack — is an anti-pattern that creates alert fatigue.


Runbook Linkage

Every P1 and P2 alert must carry a runbook_url field in its metadata. An alert without a runbook link is incomplete.

Why: on-call engineers, especially those unfamiliar with the service, need a documented investigation path. Alerts that lack runbooks either result in slow, costly investigation or in escalation to the service owner at 2am.

Standard: include in every alert definition:

runbook_url: https://internal-docs/runbooks/service-name/alert-name
description: "Brief description of what this alert means and initial triage steps"

For the anatomy of a well-designed runbook (symptoms → diagnosis → remediation → escalation path), see Runbook-Design.


Mermaid Decision Tree

flowchart TD
    A[Anomaly detected] --> B{User impact\nobservable?}
    B -- Yes --> C[Symptom-based alert]
    B -- No --> D[Cause-based diagnostic]
    C --> E{Burn rate\nthreshold?}
    E -- ">14.4x 1h + 5min" --> F[P1: Page immediately]
    E -- ">6x 6h + 30min" --> G[P2: Ticket + notify]
    E -- ">2x 3h" --> H[P3: Monitor]
    D --> I[P4: Log only]

TypeScript Example — MultiWindowBurnRateAlert

interface BurnRateReading {
  windowLabel: string; // e.g. '1h', '5min'
  burnRate: number;    // observed_error_rate / allowed_error_rate
}
 
interface AlertDecision {
  shouldAlert: boolean;
  severity: 'P1' | 'P2' | 'P3' | null;
  shortWindow: BurnRateReading;
  longWindow: BurnRateReading;
}
 
/**
 * Evaluates a multi-window burn rate alert.
 * Alert fires ONLY when BOTH windows simultaneously exceed their thresholds.
 *
 * @param shortWindow - short observation window (e.g. 5min or 30min)
 * @param longWindow  - long confirmation window (e.g. 1h or 6h)
 * @param shortWindowThreshold - burn rate threshold for the short window
 * @param longWindowThreshold  - burn rate threshold for the long window
 */
function evaluateMultiWindowAlert(
  shortWindow: BurnRateReading,
  longWindow: BurnRateReading,
  shortWindowThreshold: number,
  longWindowThreshold: number,
): AlertDecision {
  const shortBreached = shortWindow.burnRate > shortWindowThreshold;
  const longBreached  = longWindow.burnRate  > longWindowThreshold;
  const shouldAlert   = shortBreached && longBreached; // BOTH required
 
  let severity: AlertDecision['severity'] = null;
  if (shouldAlert) {
    if (longWindowThreshold >= 14.4) severity = 'P1';
    else if (longWindowThreshold >= 6) severity = 'P2';
    else severity = 'P3';
  }
 
  return { shouldAlert, severity, shortWindow, longWindow };
}
 
// P1 fast-burn check: short window 5min, long window 1h, threshold 14.4x
const p1Alert = evaluateMultiWindowAlert(
  { windowLabel: '5min', burnRate: 16.2 },
  { windowLabel: '1h',   burnRate: 15.0 },
  14.4,
  14.4,
);
// { shouldAlert: true, severity: 'P1', ... }

Java Example — AlertEvaluator

public final class AlertEvaluator {
 
    public record BurnRateReading(String windowLabel, double burnRate) {}
 
    public record AlertDecision(
        boolean shouldAlert,
        String severity,         // "P1" | "P2" | "P3" | null
        BurnRateReading shortWindow,
        BurnRateReading longWindow
    ) {}
 
    /**
     * Evaluates a multi-window burn rate alert.
     * Alert fires ONLY when BOTH windows simultaneously exceed their thresholds.
     *
     * @param shortWindow          short observation window reading (e.g. 5min)
     * @param longWindow           long confirmation window reading (e.g. 1h)
     * @param shortWindowThreshold burn rate threshold for the short window
     * @param longWindowThreshold  burn rate threshold for the long window
     */
    public static AlertDecision evaluate(
            BurnRateReading shortWindow,
            BurnRateReading longWindow,
            double shortWindowThreshold,
            double longWindowThreshold) {
 
        boolean shortBreached = shortWindow.burnRate() > shortWindowThreshold;
        boolean longBreached  = longWindow.burnRate()  > longWindowThreshold;
        boolean shouldAlert   = shortBreached && longBreached; // BOTH required
 
        String severity = null;
        if (shouldAlert) {
            if (longWindowThreshold >= 14.4)     severity = "P1";
            else if (longWindowThreshold >= 6.0) severity = "P2";
            else                                 severity = "P3";
        }
        return new AlertDecision(shouldAlert, severity, shortWindow, longWindow);
    }
}
 
// P1 fast-burn: short=5min burnRate=16.2, long=1h burnRate=15.0, threshold=14.4
// AlertDecision result = AlertEvaluator.evaluate(
//     new BurnRateReading("5min", 16.2),
//     new BurnRateReading("1h",   15.0),
//     14.4, 14.4
// );
// result.shouldAlert() == true, result.severity() == "P1"

Suitability — When to Use Multi-Window Alerting

Multi-window multi-burn-rate alerting is the right approach when:

  1. The service has a defined SLO — the burn rate thresholds (14.4x, 6x) are derived from the SLO window; without an SLO, the thresholds are arbitrary.
  2. The service has sufficient traffic for statistically meaningful SLI ratios — at very low traffic, individual failed requests produce burn rate spikes that are mathematical noise, not incidents.
  3. The on-call team is willing to define P1 runbooks for every alert — multi-window alerting produces high-signal, low-noise pages; each page must have a documented investigation path.