SLO-SLI-SLA
SLO-SLI-SLA
Service Level Indicators, Objectives, and Agreements form the foundational vocabulary of site reliability engineering. They translate vague notions of "the service is healthy" into precise, measurable targets with explicit consequences. The three terms form a hierarchy: SLIs are the measurements, SLOs are the targets set on those measurements, and SLAs are the contractual commitments backed by those targets.
Scope: This note covers the reliability math: SLI ratio definition, error budget derivation, burn rate formula, and the multiwindow alerting prerequisite. For dashboard instrumentation of the underlying metrics, see Metrics-and-Dashboards. For alert implementation, see Alerting-Strategies.
When NOT to Use
SLO-driven reliability engineering is premature or counterproductive in these contexts:
- Pre-production and prototype systems — there is no user journey to protect; setting SLOs creates false urgency and distorts prioritisation.
- Systems without a defined latency or availability contract — an SLO without a target is meaningless; if no one has agreed on "what good looks like," the math has no anchor.
- Single-developer scripts and internal tooling — the overhead of error budget tracking exceeds the benefit; direct incident response is sufficient.
- Teams unwilling to act on error budget data — SLOs only produce value when exhausted budget triggers a real response (feature freeze, incident review). Without that commitment, the machinery is theatre.
Core Definitions
SLA — Service Level Agreement
A contractual commitment between a service provider and a customer, with defined consequences (financial penalties, service credits, contract termination) on breach. SLAs are external and legal.
Example: "99.9% monthly uptime or 10% service credit."
SLO — Service Level Objective
An internal reliability target set by the engineering team, typically stricter than the SLA, that drives operational and engineering decisions. Breaching an SLO triggers engineering action; it does not trigger a legal consequence directly.
Example: "99.95% availability target — measured on 30-day rolling window."
SLI — Service Level Indicator
A ratio of good events to valid events — not a raw metric, not a count, and not a gauge value.
SLI = good_events / valid_events
The distinction from raw metrics is load-bearing: a counter of errors is not an SLI. Dividing that counter by the total valid request count produces an SLI. This ratio must be between 0 and 1 (or expressed as a percentage: 0–100%).
| Term | What it Represents | Example |
|---|---|---|
good_events | Requests that met the quality criteria | HTTP 2xx responses with latency < 500 ms |
valid_events | All requests that should have met the criteria | All HTTP requests except health checks and warmup traffic |
Why ratio, not count? Traffic volume fluctuates. An error count of 500/min is alarming on a 1,000 RPS service (50% error rate) but routine on a 500,000 RPS service (0.1% error rate). The SLI normalises for traffic, making the signal meaningful across operating conditions.
Error Budget Derivation
The error budget is the allowable failure quota implied by the SLO target. It is derived — not chosen separately.
Step-by-Step Derivation (99.9% availability SLO)
SLO target: 99.9% → SLO as decimal = 0.999
Allowed failure rate: 1 - 0.999 = 0.001 (0.1%)
Monthly window:
30 days × 86,400 seconds/day = 2,592,000 total seconds
Error budget (seconds):
2,592,000 × 0.001 = 2,592 seconds ≈ 43.2 minutes/month
Error budget (requests at 1,000 RPS):
2,592,000 total requests × 0.001 = 2,592 bad requests/month
Reading the derivation:
- If the service is unavailable for more than 43.2 cumulative minutes in any 30-day window, the SLO is breached.
- At 1,000 RPS, the team can afford 2,592 bad requests per month before breaching.
Common SLO Targets
| SLO Target | Allowed Failure Rate | Monthly Error Budget |
|---|---|---|
| 99% | 1% | ~7.3 hours |
| 99.5% | 0.5% | ~3.65 hours |
| 99.9% | 0.1% | ~43.2 minutes |
| 99.95% | 0.05% | ~21.6 minutes |
| 99.99% | 0.01% | ~4.3 minutes |
Burn Rate
Burn rate measures how fast the error budget is being consumed relative to the rate at which it should be consumed to exhaust exactly at the end of the SLO window.
Definition
Burn rate = (observed error rate) / (error rate at which the budget exhausts over the full SLO window)
A burn rate of 1 means the budget is being consumed at exactly the target pace — it will exhaust at the end of the window. A burn rate of 2 means the budget will exhaust in half the window. A burn rate less than 1 means the budget will last beyond the window.
Formula
burn_rate = (bad_events / valid_events) / (1 - SLO_target)
= observed_error_rate / allowed_error_rate
Where:
bad_events / valid_eventsis the observed error rate (the complement of the SLI)1 - SLO_targetis the allowed error rate derived from the SLO
Worked Example
SLO target: 99.9% → allowed error rate = 1 - 0.999 = 0.001
Observed error rate: 0.3% → bad_events/valid_events = 0.003
burn_rate = 0.003 / 0.001 = 3
Interpretation: budget is being consumed 3x faster than the target rate.
At burn rate 3, the 30-day budget exhausts in: 30 days / 3 = 10 days.
Burn Rate and Time to Exhaustion
The relationship between burn rate and time to exhaustion is direct:
time_to_exhaustion = SLO_window / burn_rate
Examples for a 30-day window:
burn_rate 1 → 30 days (exactly on target)
burn_rate 3 → 10 days
burn_rate 6 → 5 days
burn_rate 14.4 → 2.08 days ≈ 50 hours (Google SRE Workbook fast-page threshold)
burn_rate 30 → 1 day
The value 14.4 appears in the Google SRE Workbook (Chapter 5) as the threshold for a high-severity page alert: at burn rate 14.4 the monthly budget exhausts in approximately 2 days, which is fast enough to page an on-call engineer immediately. Full alert threshold selection is in Alerting-Strategies.
Multiwindow Alerting Prerequisite
A single time window is insufficient for burn rate alerting. This is a foundational constraint, not a preference.
Why Two Windows Are Required
| Window | Sensitivity | Problem if Used Alone |
|---|---|---|
| Short (e.g., 1 hour) | High — catches fast-burning incidents quickly | False positives: a brief traffic spike or deployment blip can momentarily push burn rate above threshold, firing an alert that resolves itself |
| Long (e.g., 6 hours) | Low — confirms sustained burn over time | False negatives for fast burns: a severe outage that exhausts the monthly budget in 2 hours may not look alarming over a 6-hour window if the first 4 hours were clean |
Rule: the alert fires ONLY when BOTH windows simultaneously exceed their respective thresholds.
Alert condition:
burn_rate(1h window) > threshold
AND
burn_rate(6h window) > threshold
The short window detects the incident quickly. The long window confirms it is sustained, not a transient spike. Together they achieve:
- Low false-positive rate (long window filters transient spikes)
- Low false-negative rate (short window catches fast burns)
- Actionable alerts with enough budget remaining to act
Single-window alerting forces an unresolvable tradeoff: a tight threshold catches everything but generates constant noise; a loose threshold misses slow burns entirely. Two windows eliminate this tradeoff.
Note: Full multi-window multi-burn-rate alerting implementation — including the four alert rules for fast page, slow page, fast ticket, and slow ticket — is covered in Alerting-Strategies.
SLA vs SLO vs SLI Summary Table
| Term | Audience | Consequence of Breach | Example |
|---|---|---|---|
| SLA | Customer / Legal | Financial penalty, service credit, contract termination | "99.9% uptime or 10% monthly credit" |
| SLO | Engineering / Operations | Feature freeze, incident review, engineering sprint | "99.95% availability on 30-day rolling window" |
| SLI | Engineering (measurement) | Input to SLO; no direct consequence — it is a number | good_requests / total_requests = 0.9997 |
Mermaid Chain Diagram
flowchart LR
A[SLI\ngood/valid events] --> B[SLO\ntarget ratio]
B --> C[Error Budget\n1 - SLO_target × window]
C --> D[Burn Rate\nbudget consumption rate]
D --> E[Alert Threshold\nmultiwindow trigger]
Each node in the chain is a prerequisite for the next. You cannot define an error budget without an SLO target. You cannot compute a burn rate without an error budget. You cannot set a meaningful alert threshold without a burn rate.
TypeScript Example — computeBurnRate
interface BurnRateResult {
sli: number; // ratio of good events to valid events (0–1)
errorRate: number; // 1 - sli (the complement)
burnRate: number; // errorRate / (1 - sloTarget)
budgetRemainingPct: number; // percentage of error budget remaining (0–100)
}
/**
* Compute burn rate metrics for a given observation window.
*
* @param goodEvents - count of requests that met quality criteria
* @param validEvents - count of all requests that should have met criteria
* @param sloTarget - SLO target as a decimal (e.g. 0.999 for 99.9%)
*/
function computeBurnRate(
goodEvents: number,
validEvents: number,
sloTarget: number,
): BurnRateResult {
if (validEvents === 0) {
throw new Error('validEvents must be > 0; cannot compute SLI on zero events');
}
const sli = goodEvents / validEvents;
const errorRate = 1 - sli;
const allowedErrorRate = 1 - sloTarget;
const burnRate = errorRate / allowedErrorRate;
// budgetRemainingPct > 0 means budget available; 0 means exhausted
const budgetRemainingPct = Math.max(0, Math.min(100, (1 - burnRate) * 100));
return { sli, errorRate, burnRate, budgetRemainingPct };
}
// Example: 1,000,000 valid requests, 300 bad (0.03% error rate), SLO = 99.9%
const result = computeBurnRate(999_700, 1_000_000, 0.999);
// sli = 0.9997, errorRate = 0.0003, burnRate = 0.3, budgetRemainingPct = 70Java Example — SloMetrics.computeBurnRate
public final class SloMetrics {
private SloMetrics() {}
public record BurnRateResult(
double sli,
double errorRate,
double burnRate,
double budgetRemainingPct
) {}
/**
* Compute burn rate metrics for a given observation window.
*
* @param goodEvents count of requests that met quality criteria
* @param validEvents count of all requests that should have met criteria (must be > 0)
* @param sloTarget SLO target as a decimal (e.g. 0.999 for 99.9%)
*/
public static BurnRateResult computeBurnRate(
long goodEvents, long validEvents, double sloTarget) {
if (validEvents == 0) {
throw new IllegalArgumentException(
"validEvents must be > 0; cannot compute SLI on zero events");
}
double sli = (double) goodEvents / validEvents;
double errorRate = 1.0 - sli;
double allowedErrorRate = 1.0 - sloTarget;
double burnRate = errorRate / allowedErrorRate;
// budgetRemainingPct > 0 means budget available; 0 means exhausted
double budgetRemainingPct = Math.max(0.0, Math.min(100.0, (1.0 - burnRate) * 100.0));
return new BurnRateResult(sli, errorRate, burnRate, budgetRemainingPct);
}
}
// Example: 999_700 good out of 1_000_000 valid, SLO = 99.9%
// BurnRateResult r = SloMetrics.computeBurnRate(999_700L, 1_000_000L, 0.999);
// r.burnRate() == 0.3, r.budgetRemainingPct() == 70.0Suitability — When to Use SLO-Driven Reliability
Use SLOs when all three conditions hold:
- Production services with a defined user-facing latency or availability contract — the contract gives the SLO its target; without one, any threshold is arbitrary.
- Services with enough traffic to produce statistically meaningful SLI ratios — at very low volumes (fewer than 100 requests/day), individual failed requests dominate the SLI and produce noisy burn rate signals. Consider synthetic probes instead of traffic-based SLIs.
- Teams willing to act on error budget data — the commitment is: when budget is exhausted, features freeze and reliability work takes priority. Without this commitment, SLOs are decoration.
Related Concepts
- Metrics-and-Dashboards — SLI is measured via the metrics pillar; RED method error rate feeds directly into SLI computation
- Alerting-Strategies — burn rate drives multi-window alert thresholds; fast-burn and slow-burn alert rules
- Distributed-Tracing-Patterns — trace sampling decisions are often conditioned on error budget state; reduce sampling when budget is healthy, increase when burning fast
Backlinks
- Metrics-and-Dashboards — SLI is measured via the metrics pillar
- Alerting-Strategies — burn rate drives multi-window alert thresholds
- Distributed-Tracing-Patterns — trace sampling decisions often conditioned on error budget state