Alerting Strategies
Alerting Strategies
Alerting translates SLO burn rate signals into actionable human responses. Effective alerting is not about capturing every anomaly — it is about surfacing the right signal at the right severity level so the on-call engineer can act before the error budget is exhausted. The alert pipeline runs: SLI computation → burn rate → multi-window threshold evaluation → severity tier → on-call routing.
Scope: This note covers symptom vs cause alert classification, multi-window multi-burn-rate alerting with both time windows specified, the alert severity taxonomy, routing patterns, and runbook linkage. Burn rate derivation and error budget math are in SLO-SLI-SLA. Runbook anatomy is in Runbook-Design.
When NOT to Use (Specific Alerting Patterns)
Cause-based alerts as primary (wake-up) pages:
- CPU high, memory elevated, queue depth growing — these are internal signals, not evidence of user impact.
- Cause-based primary alerts have a high false-positive rate: CPU spikes during scheduled jobs, memory grows during normal batch processing.
- The on-call engineer is woken for a cause that resolves itself before investigation begins, burning goodwill and attention.
- Use cause-based signals as P4 diagnostics, not as P1 pages.
Single-window burn rate alerts:
- A single observation window cannot simultaneously achieve low false-positive and low false-negative rates.
- A tight threshold on a short window catches every transient spike — deployment blips, brief traffic bursts, health check failures — as false positives.
- A loose threshold on a long window misses fast-burning incidents that exhaust the monthly budget in hours.
- Single-window alerting forces an unresolvable sensitivity tradeoff. See the multi-window pattern below.
"Alert on everything" approach:
- Alert fatigue is the systematic desensitisation of on-call engineers caused by a high volume of low-signal pages.
- When every metric threshold fires a page, the engineer learns to dismiss alerts before investigating.
- The result: real incidents are missed because the signal is buried in noise.
- Govern alert quantity: if a service page rate exceeds three pages per shift, that is a reliability signal requiring engineering investment, not more alert rules.
Symptom-Based vs Cause-Based Alerts
This is the foundational distinction in alert classification. Every alert can be placed on the spectrum from pure symptom to pure cause.
Symptom-Based Alerts
A symptom-based alert fires on evidence of user impact — something the user observes, not something the system internally records.
| Symptom Signal | Example Alert Condition |
|---|---|
| User-facing error rate | 5xx rate > 1% on /api/checkout for 5 minutes |
| User-facing latency | p99 latency > 2s on user-facing endpoints |
| Successful request rate drop | RPS on /api/orders drops > 30% below 7-day baseline |
| SLO burn rate | multi-window burn rate > 14.4x simultaneously on 1h and 5min windows |
Cause-Based Alerts
A cause-based alert fires on an internal signal that might lead to user impact but does not yet confirm it.
| Cause Signal | Example Alert Condition |
|---|---|
| CPU utilization | CPU > 80% for 10 minutes |
| Memory utilization | Heap usage > 85% |
| Queue depth | Message queue depth > 10,000 |
| Connection pool exhaustion | DB pool active > 90% of max |
The Rule
Alert on symptoms, investigate causes.
Cause-based alerts are diagnostic tools. They answer "what changed" after a symptom alert fires. They are not appropriate as primary wake-up pages because a cause with no user impact is not an incident.
Exception: cause-based alerts are acceptable as P3 or P4 warnings — low-severity notifications that create a ticket for the team to review during business hours, with no paging.
SLO connection: symptom-based alerts are SLI threshold violations — the SLI ratio has degraded to the point that the error budget is burning. Cause-based alerts are implementation details of the service internals.
Alert Severity Taxonomy
Severity tiers define the expected human response time and the cost of that response to the engineer's time and sleep.
| Tier | Name | Trigger | Response Expectation |
|---|---|---|---|
| P1 | Critical / Page | Burn rate > 14.4x on both 1h and 5min windows | Wake-up page; immediate response any hour; escalate in 10 min if no ack |
| P2 | High / Ticket+Notify | Burn rate > 6x on both 6h and 30min windows | Response within business hours; on-call notified via chat; no sleep disruption |
| P3 | Warning | Burn rate > 2x on 3h window | Monitor; create team backlog ticket; no immediate human action required |
| P4 | Informational | Metrics anomaly with no SLO impact | Log to observability platform; no human notification |
Calibration principle: Severity is proportional to budget consumption rate, not metric magnitude. A 50% CPU spike that causes no error rate increase is P4. A 1% error rate spike that burns budget at 14.4x is P1.
Multi-Window Multi-Burn-Rate Alerting
This is the canonical alerting pattern for SLO-based services. It requires two simultaneous time windows to fire. A single-window approach is explicitly the anti-pattern.
Why Two Windows Are Required
| Window | Strength | Weakness if Used Alone |
|---|---|---|
| Short window (1h / 5min) | Detects fast-burning incidents quickly | High false-positive rate from transient spikes |
| Long window (6h / 30min) | Confirms sustained burn; filters transient noise | False negatives for fast burns early in the incident |
The rule: the alert fires only when BOTH the short window AND the long window simultaneously exceed their respective thresholds.
P1 alert condition:
burn_rate(short_window = 5min) > 14.4
AND
burn_rate(long_window = 1h) > 14.4
The short window detects the incident. The long window confirms it is not a transient spike.
Four Alert Rules (Recommended Baseline)
These four rules cover fast and slow burn at two severity tiers.
| Alert | Short Window | Long Window | Burn Rate | Severity | Budget Consumed at Alert |
|---|---|---|---|---|---|
| Fast page | 5 min | 1 hour | 14.4x | P1 | ~2% in 1h |
| Slow page | 30 min | 6 hours | 6x | P2 | ~5% in 6h |
| Fast ticket | 1 hour | 3 hours | 3x | P3 | ~10% in 3h |
| Slow ticket | 3 hours | 24 hours | 1x | P3 | budget exhausting normally |
Fast-Burn Alert (P1) — Worked Example
Scenario: 99.9% SLO, 30-day window.
Short window: 5 minutes
observed_error_rate / allowed_error_rate > 14.4
→ At burn rate 14.4, monthly budget exhausts in 2.08 days
Long window: 1 hour
Same burn rate threshold: 14.4
Alert fires: ONLY when both windows are simultaneously above 14.4
The 5min window catches the incident within minutes. The 1h window confirms the rate is sustained, not a deployment artifact. When both exceed the threshold simultaneously, the on-call engineer has ~2 days of budget remaining — enough time to investigate and mitigate before the SLO is breached.
Slow-Burn Alert (P2) — Worked Example
Short window: 30 minutes
burn_rate > 6x
→ At burn rate 6, monthly budget exhausts in 5 days
Long window: 6 hours
Same burn rate threshold: 6x
Alert fires: ONLY when both windows are simultaneously above 6x
At this rate, the budget will not be exhausted in hours. Response within business hours is appropriate. The 6h window filters short-lived issues from deployments, making this alert high-signal.
Burn rate thresholds are derived from the SLO window. See SLO-SLI-SLA for the derivation formula: burn_rate = observed_error_rate / (1 - SLO_target).
Alert Routing Pattern
Every alert severity tier routes to a different response channel. Routing is conceptual — no tool-specific config.
| Severity | Routing | Channel | Escalation |
|---|---|---|---|
| P1 | Page on-call engineer (any hour) | PagerDuty / OpsGenie voice call | If no ack in 10 min: page secondary then manager |
| P2 | Create incident ticket; notify on-call via chat | Slack / Teams channel | If no response in 2h: escalate to team lead |
| P3 | Create ticket in team backlog | Jira / Linear | No escalation; reviewed in next sprint |
| P4 | Log to observability platform | Grafana annotation / log entry | No human notification |
Routing discipline: every alert definition must specify exactly one target tier. Alerts without tier assignment are P4 by default (log-only). Undifferentiated routing — where all alerts go to Slack — is an anti-pattern that creates alert fatigue.
Runbook Linkage
Every P1 and P2 alert must carry a runbook_url field in its metadata. An alert without a runbook link is incomplete.
Why: on-call engineers, especially those unfamiliar with the service, need a documented investigation path. Alerts that lack runbooks either result in slow, costly investigation or in escalation to the service owner at 2am.
Standard: include in every alert definition:
runbook_url: https://internal-docs/runbooks/service-name/alert-name
description: "Brief description of what this alert means and initial triage steps"
For the anatomy of a well-designed runbook (symptoms → diagnosis → remediation → escalation path), see Runbook-Design.
Mermaid Decision Tree
flowchart TD
A[Anomaly detected] --> B{User impact\nobservable?}
B -- Yes --> C[Symptom-based alert]
B -- No --> D[Cause-based diagnostic]
C --> E{Burn rate\nthreshold?}
E -- ">14.4x 1h + 5min" --> F[P1: Page immediately]
E -- ">6x 6h + 30min" --> G[P2: Ticket + notify]
E -- ">2x 3h" --> H[P3: Monitor]
D --> I[P4: Log only]
TypeScript Example — MultiWindowBurnRateAlert
interface BurnRateReading {
windowLabel: string; // e.g. '1h', '5min'
burnRate: number; // observed_error_rate / allowed_error_rate
}
interface AlertDecision {
shouldAlert: boolean;
severity: 'P1' | 'P2' | 'P3' | null;
shortWindow: BurnRateReading;
longWindow: BurnRateReading;
}
/**
* Evaluates a multi-window burn rate alert.
* Alert fires ONLY when BOTH windows simultaneously exceed their thresholds.
*
* @param shortWindow - short observation window (e.g. 5min or 30min)
* @param longWindow - long confirmation window (e.g. 1h or 6h)
* @param shortWindowThreshold - burn rate threshold for the short window
* @param longWindowThreshold - burn rate threshold for the long window
*/
function evaluateMultiWindowAlert(
shortWindow: BurnRateReading,
longWindow: BurnRateReading,
shortWindowThreshold: number,
longWindowThreshold: number,
): AlertDecision {
const shortBreached = shortWindow.burnRate > shortWindowThreshold;
const longBreached = longWindow.burnRate > longWindowThreshold;
const shouldAlert = shortBreached && longBreached; // BOTH required
let severity: AlertDecision['severity'] = null;
if (shouldAlert) {
if (longWindowThreshold >= 14.4) severity = 'P1';
else if (longWindowThreshold >= 6) severity = 'P2';
else severity = 'P3';
}
return { shouldAlert, severity, shortWindow, longWindow };
}
// P1 fast-burn check: short window 5min, long window 1h, threshold 14.4x
const p1Alert = evaluateMultiWindowAlert(
{ windowLabel: '5min', burnRate: 16.2 },
{ windowLabel: '1h', burnRate: 15.0 },
14.4,
14.4,
);
// { shouldAlert: true, severity: 'P1', ... }Java Example — AlertEvaluator
public final class AlertEvaluator {
public record BurnRateReading(String windowLabel, double burnRate) {}
public record AlertDecision(
boolean shouldAlert,
String severity, // "P1" | "P2" | "P3" | null
BurnRateReading shortWindow,
BurnRateReading longWindow
) {}
/**
* Evaluates a multi-window burn rate alert.
* Alert fires ONLY when BOTH windows simultaneously exceed their thresholds.
*
* @param shortWindow short observation window reading (e.g. 5min)
* @param longWindow long confirmation window reading (e.g. 1h)
* @param shortWindowThreshold burn rate threshold for the short window
* @param longWindowThreshold burn rate threshold for the long window
*/
public static AlertDecision evaluate(
BurnRateReading shortWindow,
BurnRateReading longWindow,
double shortWindowThreshold,
double longWindowThreshold) {
boolean shortBreached = shortWindow.burnRate() > shortWindowThreshold;
boolean longBreached = longWindow.burnRate() > longWindowThreshold;
boolean shouldAlert = shortBreached && longBreached; // BOTH required
String severity = null;
if (shouldAlert) {
if (longWindowThreshold >= 14.4) severity = "P1";
else if (longWindowThreshold >= 6.0) severity = "P2";
else severity = "P3";
}
return new AlertDecision(shouldAlert, severity, shortWindow, longWindow);
}
}
// P1 fast-burn: short=5min burnRate=16.2, long=1h burnRate=15.0, threshold=14.4
// AlertDecision result = AlertEvaluator.evaluate(
// new BurnRateReading("5min", 16.2),
// new BurnRateReading("1h", 15.0),
// 14.4, 14.4
// );
// result.shouldAlert() == true, result.severity() == "P1"Suitability — When to Use Multi-Window Alerting
Multi-window multi-burn-rate alerting is the right approach when:
- The service has a defined SLO — the burn rate thresholds (14.4x, 6x) are derived from the SLO window; without an SLO, the thresholds are arbitrary.
- The service has sufficient traffic for statistically meaningful SLI ratios — at very low traffic, individual failed requests produce burn rate spikes that are mathematical noise, not incidents.
- The on-call team is willing to define P1 runbooks for every alert — multi-window alerting produces high-signal, low-noise pages; each page must have a documented investigation path.
Backlinks
- SLO-SLI-SLA — burn rate derivation and error budget math
- Runbook-Design — every P1/P2 alert links to a runbook
- Incident-Response — P1 alerts trigger the incident response flow
- On-Call-Practices — alert routing feeds the on-call escalation policy
- Metrics-and-Dashboards — SLI metrics are the data source for alerting