Alerting Strategies

Alerting translates SLO burn rate signals into actionable human responses. Effective alerting is not about capturing every anomaly — it is about surfacing the right signal at the right severity level so the on-call engineer can act before the error budget is exhausted. The alert pipeline runs: SLI computation → burn rate → multi-window threshold evaluation → severity tier → on-call routing.

Scope: This note covers symptom vs cause alert classification, multi-window multi-burn-rate alerting with both time windows specified, the alert severity taxonomy, routing patterns, and runbook linkage. Burn rate derivation and error budget math are in SLO-SLI-SLA. Runbook anatomy is in Runbook-Design.

When NOT to Use (Specific Alerting Patterns)

Cause-based alerts as primary (wake-up) pages:

CPU high, memory elevated, queue depth growing — these are internal signals, not evidence of user impact.
Cause-based primary alerts have a high false-positive rate: CPU spikes during scheduled jobs, memory grows during normal batch processing.
The on-call engineer is woken for a cause that resolves itself before investigation begins, burning goodwill and attention.
Use cause-based signals as P4 diagnostics, not as P1 pages.

Single-window burn rate alerts:

A single observation window cannot simultaneously achieve low false-positive and low false-negative rates.
A tight threshold on a short window catches every transient spike — deployment blips, brief traffic bursts, health check failures — as false positives.
A loose threshold on a long window misses fast-burning incidents that exhaust the monthly budget in hours.
Single-window alerting forces an unresolvable sensitivity tradeoff. See the multi-window pattern below.

"Alert on everything" approach:

Alert fatigue is the systematic desensitisation of on-call engineers caused by a high volume of low-signal pages.
When every metric threshold fires a page, the engineer learns to dismiss alerts before investigating.
The result: real incidents are missed because the signal is buried in noise.
Govern alert quantity: if a service page rate exceeds three pages per shift, that is a reliability signal requiring engineering investment, not more alert rules.

Symptom-Based vs Cause-Based Alerts

This is the foundational distinction in alert classification. Every alert can be placed on the spectrum from pure symptom to pure cause.

Symptom-Based Alerts

A symptom-based alert fires on evidence of user impact — something the user observes, not something the system internally records.

Symptom Signal	Example Alert Condition
User-facing error rate	`5xx rate > 1% on /api/checkout for 5 minutes`
User-facing latency	`p99 latency > 2s on user-facing endpoints`
Successful request rate drop	`RPS on /api/orders drops > 30% below 7-day baseline`
SLO burn rate	`multi-window burn rate > 14.4x simultaneously on 1h and 5min windows`

Cause-Based Alerts

A cause-based alert fires on an internal signal that might lead to user impact but does not yet confirm it.

Cause Signal	Example Alert Condition
CPU utilization	`CPU > 80% for 10 minutes`
Memory utilization	`Heap usage > 85%`
Queue depth	`Message queue depth > 10,000`
Connection pool exhaustion	`DB pool active > 90% of max`

The Rule

Alert on symptoms, investigate causes.

Cause-based alerts are diagnostic tools. They answer "what changed" after a symptom alert fires. They are not appropriate as primary wake-up pages because a cause with no user impact is not an incident.

Exception: cause-based alerts are acceptable as P3 or P4 warnings — low-severity notifications that create a ticket for the team to review during business hours, with no paging.

SLO connection: symptom-based alerts are SLI threshold violations — the SLI ratio has degraded to the point that the error budget is burning. Cause-based alerts are implementation details of the service internals.

Alert Severity Taxonomy

Severity tiers define the expected human response time and the cost of that response to the engineer's time and sleep.

Tier	Name	Trigger	Response Expectation
P1	Critical / Page	Burn rate > 14.4x on both 1h and 5min windows	Wake-up page; immediate response any hour; escalate in 10 min if no ack
P2	High / Ticket+Notify	Burn rate > 6x on both 6h and 30min windows	Response within business hours; on-call notified via chat; no sleep disruption
P3	Warning	Burn rate > 2x on 3h window	Monitor; create team backlog ticket; no immediate human action required
P4	Informational	Metrics anomaly with no SLO impact	Log to observability platform; no human notification

Calibration principle: Severity is proportional to budget consumption rate, not metric magnitude. A 50% CPU spike that causes no error rate increase is P4. A 1% error rate spike that burns budget at 14.4x is P1.

Multi-Window Multi-Burn-Rate Alerting

This is the canonical alerting pattern for SLO-based services. It requires two simultaneous time windows to fire. A single-window approach is explicitly the anti-pattern.

Why Two Windows Are Required

Window	Strength	Weakness if Used Alone
Short window (1h / 5min)	Detects fast-burning incidents quickly	High false-positive rate from transient spikes
Long window (6h / 30min)	Confirms sustained burn; filters transient noise	False negatives for fast burns early in the incident

The rule: the alert fires only when BOTH the short window AND the long window simultaneously exceed their respective thresholds.

P1 alert condition:
  burn_rate(short_window = 5min)  > 14.4
  AND
  burn_rate(long_window  = 1h)    > 14.4

The short window detects the incident. The long window confirms it is not a transient spike.

Four Alert Rules (Recommended Baseline)

These four rules cover fast and slow burn at two severity tiers.

Alert	Short Window	Long Window	Burn Rate	Severity	Budget Consumed at Alert
Fast page	5 min	1 hour	14.4x	P1	~2% in 1h
Slow page	30 min	6 hours	6x	P2	~5% in 6h
Fast ticket	1 hour	3 hours	3x	P3	~10% in 3h
Slow ticket	3 hours	24 hours	1x	P3	budget exhausting normally

Fast-Burn Alert (P1) — Worked Example

Scenario: 99.9% SLO, 30-day window.

Short window: 5 minutes
  observed_error_rate / allowed_error_rate > 14.4
  → At burn rate 14.4, monthly budget exhausts in 2.08 days

Long window: 1 hour
  Same burn rate threshold: 14.4

Alert fires: ONLY when both windows are simultaneously above 14.4

The 5min window catches the incident within minutes. The 1h window confirms the rate is sustained, not a deployment artifact. When both exceed the threshold simultaneously, the on-call engineer has ~2 days of budget remaining — enough time to investigate and mitigate before the SLO is breached.

Slow-Burn Alert (P2) — Worked Example

Short window: 30 minutes
  burn_rate > 6x
  → At burn rate 6, monthly budget exhausts in 5 days

Long window: 6 hours
  Same burn rate threshold: 6x

Alert fires: ONLY when both windows are simultaneously above 6x

At this rate, the budget will not be exhausted in hours. Response within business hours is appropriate. The 6h window filters short-lived issues from deployments, making this alert high-signal.

Burn rate thresholds are derived from the SLO window. See SLO-SLI-SLA for the derivation formula: burn_rate = observed_error_rate / (1 - SLO_target).

Alert Routing Pattern

Every alert severity tier routes to a different response channel. Routing is conceptual — no tool-specific config.

Severity	Routing	Channel	Escalation
P1	Page on-call engineer (any hour)	PagerDuty / OpsGenie voice call	If no ack in 10 min: page secondary then manager
P2	Create incident ticket; notify on-call via chat	Slack / Teams channel	If no response in 2h: escalate to team lead
P3	Create ticket in team backlog	Jira / Linear	No escalation; reviewed in next sprint
P4	Log to observability platform	Grafana annotation / log entry	No human notification

Routing discipline: every alert definition must specify exactly one target tier. Alerts without tier assignment are P4 by default (log-only). Undifferentiated routing — where all alerts go to Slack — is an anti-pattern that creates alert fatigue.

Runbook Linkage

Every P1 and P2 alert must carry a runbook_url field in its metadata. An alert without a runbook link is incomplete.

Why: on-call engineers, especially those unfamiliar with the service, need a documented investigation path. Alerts that lack runbooks either result in slow, costly investigation or in escalation to the service owner at 2am.

Standard: include in every alert definition:

runbook_url: https://internal-docs/runbooks/service-name/alert-name
description: "Brief description of what this alert means and initial triage steps"

For the anatomy of a well-designed runbook (symptoms → diagnosis → remediation → escalation path), see Runbook-Design.

Mermaid Decision Tree

flowchart TD
    A[Anomaly detected] --> B{User impact\nobservable?}
    B -- Yes --> C[Symptom-based alert]
    B -- No --> D[Cause-based diagnostic]
    C --> E{Burn rate\nthreshold?}
    E -- ">14.4x 1h + 5min" --> F[P1: Page immediately]
    E -- ">6x 6h + 30min" --> G[P2: Ticket + notify]
    E -- ">2x 3h" --> H[P3: Monitor]
    D --> I[P4: Log only]

TypeScript Example — MultiWindowBurnRateAlert

interface BurnRateReading {
  windowLabel: string; // e.g. '1h', '5min'
  burnRate: number;    // observed_error_rate / allowed_error_rate
}
 
interface AlertDecision {
  shouldAlert: boolean;
  severity: 'P1' | 'P2' | 'P3' | null;
  shortWindow: BurnRateReading;
  longWindow: BurnRateReading;
}
 
/**
 * Evaluates a multi-window burn rate alert.
 * Alert fires ONLY when BOTH windows simultaneously exceed their thresholds.
 *
 * @param shortWindow - short observation window (e.g. 5min or 30min)
 * @param longWindow  - long confirmation window (e.g. 1h or 6h)
 * @param shortWindowThreshold - burn rate threshold for the short window
 * @param longWindowThreshold  - burn rate threshold for the long window
 */
function evaluateMultiWindowAlert(
  shortWindow: BurnRateReading,
  longWindow: BurnRateReading,
  shortWindowThreshold: number,
  longWindowThreshold: number,
): AlertDecision {
  const shortBreached = shortWindow.burnRate > shortWindowThreshold;
  const longBreached  = longWindow.burnRate  > longWindowThreshold;
  const shouldAlert   = shortBreached && longBreached; // BOTH required
 
  let severity: AlertDecision['severity'] = null;
  if (shouldAlert) {
    if (longWindowThreshold >= 14.4) severity = 'P1';
    else if (longWindowThreshold >= 6) severity = 'P2';
    else severity = 'P3';
  }
 
  return { shouldAlert, severity, shortWindow, longWindow };
}
 
// P1 fast-burn check: short window 5min, long window 1h, threshold 14.4x
const p1Alert = evaluateMultiWindowAlert(
  { windowLabel: '5min', burnRate: 16.2 },
  { windowLabel: '1h',   burnRate: 15.0 },
  14.4,
  14.4,
);
// { shouldAlert: true, severity: 'P1', ... }

Java Example — AlertEvaluator

public final class AlertEvaluator {
 
    public record BurnRateReading(String windowLabel, double burnRate) {}
 
    public record AlertDecision(
        boolean shouldAlert,
        String severity,         // "P1" | "P2" | "P3" | null
        BurnRateReading shortWindow,
        BurnRateReading longWindow
    ) {}
 
    /**
     * Evaluates a multi-window burn rate alert.
     * Alert fires ONLY when BOTH windows simultaneously exceed their thresholds.
     *
     * @param shortWindow          short observation window reading (e.g. 5min)
     * @param longWindow           long confirmation window reading (e.g. 1h)
     * @param shortWindowThreshold burn rate threshold for the short window
     * @param longWindowThreshold  burn rate threshold for the long window
     */
    public static AlertDecision evaluate(
            BurnRateReading shortWindow,
            BurnRateReading longWindow,
            double shortWindowThreshold,
            double longWindowThreshold) {
 
        boolean shortBreached = shortWindow.burnRate() > shortWindowThreshold;
        boolean longBreached  = longWindow.burnRate()  > longWindowThreshold;
        boolean shouldAlert   = shortBreached && longBreached; // BOTH required
 
        String severity = null;
        if (shouldAlert) {
            if (longWindowThreshold >= 14.4)     severity = "P1";
            else if (longWindowThreshold >= 6.0) severity = "P2";
            else                                 severity = "P3";
        }
        return new AlertDecision(shouldAlert, severity, shortWindow, longWindow);
    }
}
 
// P1 fast-burn: short=5min burnRate=16.2, long=1h burnRate=15.0, threshold=14.4
// AlertDecision result = AlertEvaluator.evaluate(
//     new BurnRateReading("5min", 16.2),
//     new BurnRateReading("1h",   15.0),
//     14.4, 14.4
// );
// result.shouldAlert() == true, result.severity() == "P1"

Suitability — When to Use Multi-Window Alerting

Multi-window multi-burn-rate alerting is the right approach when:

The service has a defined SLO — the burn rate thresholds (14.4x, 6x) are derived from the SLO window; without an SLO, the thresholds are arbitrary.
The service has sufficient traffic for statistically meaningful SLI ratios — at very low traffic, individual failed requests produce burn rate spikes that are mathematical noise, not incidents.
The on-call team is willing to define P1 runbooks for every alert — multi-window alerting produces high-signal, low-noise pages; each page must have a documented investigation path.

Backlinks

SLO-SLI-SLA — burn rate derivation and error budget math
Runbook-Design — every P1/P2 alert links to a runbook
Incident-Response — P1 alerts trigger the incident response flow
On-Call-Practices — alert routing feeds the on-call escalation policy
Metrics-and-Dashboards — SLI metrics are the data source for alerting

Alerting Strategies

Tags

Alerting Strategies

When NOT to Use (Specific Alerting Patterns)

Symptom-Based vs Cause-Based Alerts

Symptom-Based Alerts

Cause-Based Alerts

The Rule

Alert Severity Taxonomy

Multi-Window Multi-Burn-Rate Alerting

Why Two Windows Are Required

Four Alert Rules (Recommended Baseline)

Fast-Burn Alert (P1) — Worked Example

Slow-Burn Alert (P2) — Worked Example

Alert Routing Pattern

Runbook Linkage

Mermaid Decision Tree

TypeScript Example — MultiWindowBurnRateAlert

Java Example — AlertEvaluator

Suitability — When to Use Multi-Window Alerting

Backlinks

Linked mentions