SLO-SLI-SLA

Service Level Indicators, Objectives, and Agreements form the foundational vocabulary of site reliability engineering. They translate vague notions of "the service is healthy" into precise, measurable targets with explicit consequences. The three terms form a hierarchy: SLIs are the measurements, SLOs are the targets set on those measurements, and SLAs are the contractual commitments backed by those targets.

Scope: This note covers the reliability math: SLI ratio definition, error budget derivation, burn rate formula, and the multiwindow alerting prerequisite. For dashboard instrumentation of the underlying metrics, see Metrics-and-Dashboards. For alert implementation, see Alerting-Strategies.

When NOT to Use

SLO-driven reliability engineering is premature or counterproductive in these contexts:

Pre-production and prototype systems — there is no user journey to protect; setting SLOs creates false urgency and distorts prioritisation.
Systems without a defined latency or availability contract — an SLO without a target is meaningless; if no one has agreed on "what good looks like," the math has no anchor.
Single-developer scripts and internal tooling — the overhead of error budget tracking exceeds the benefit; direct incident response is sufficient.
Teams unwilling to act on error budget data — SLOs only produce value when exhausted budget triggers a real response (feature freeze, incident review). Without that commitment, the machinery is theatre.

Core Definitions

SLA — Service Level Agreement

A contractual commitment between a service provider and a customer, with defined consequences (financial penalties, service credits, contract termination) on breach. SLAs are external and legal.

Example: "99.9% monthly uptime or 10% service credit."

SLO — Service Level Objective

An internal reliability target set by the engineering team, typically stricter than the SLA, that drives operational and engineering decisions. Breaching an SLO triggers engineering action; it does not trigger a legal consequence directly.

Example: "99.95% availability target — measured on 30-day rolling window."

SLI — Service Level Indicator

A ratio of good events to valid events — not a raw metric, not a count, and not a gauge value.

SLI = good_events / valid_events

The distinction from raw metrics is load-bearing: a counter of errors is not an SLI. Dividing that counter by the total valid request count produces an SLI. This ratio must be between 0 and 1 (or expressed as a percentage: 0–100%).

Term	What it Represents	Example
`good_events`	Requests that met the quality criteria	HTTP 2xx responses with latency < 500 ms
`valid_events`	All requests that should have met the criteria	All HTTP requests except health checks and warmup traffic

Why ratio, not count? Traffic volume fluctuates. An error count of 500/min is alarming on a 1,000 RPS service (50% error rate) but routine on a 500,000 RPS service (0.1% error rate). The SLI normalises for traffic, making the signal meaningful across operating conditions.

Error Budget Derivation

The error budget is the allowable failure quota implied by the SLO target. It is derived — not chosen separately.

Step-by-Step Derivation (99.9% availability SLO)

SLO target:           99.9%  →  SLO as decimal = 0.999
Allowed failure rate: 1 - 0.999 = 0.001  (0.1%)

Monthly window:
  30 days × 86,400 seconds/day = 2,592,000 total seconds

Error budget (seconds):
  2,592,000 × 0.001 = 2,592 seconds ≈ 43.2 minutes/month

Error budget (requests at 1,000 RPS):
  2,592,000 total requests × 0.001 = 2,592 bad requests/month

Reading the derivation:

If the service is unavailable for more than 43.2 cumulative minutes in any 30-day window, the SLO is breached.
At 1,000 RPS, the team can afford 2,592 bad requests per month before breaching.

Common SLO Targets

SLO Target	Allowed Failure Rate	Monthly Error Budget
99%	1%	~7.3 hours
99.5%	0.5%	~3.65 hours
99.9%	0.1%	~43.2 minutes
99.95%	0.05%	~21.6 minutes
99.99%	0.01%	~4.3 minutes

Higher SLO targets have disproportionately smaller error budgets. Moving from 99.9% to 99.99% shrinks the budget from 43 minutes to 4 minutes per month — a 10x cost increase in reliability engineering investment for a 10x smaller operational window.

Burn Rate

Burn rate measures how fast the error budget is being consumed relative to the rate at which it should be consumed to exhaust exactly at the end of the SLO window.

Definition

Burn rate = (observed error rate) / (error rate at which the budget exhausts over the full SLO window)

A burn rate of 1 means the budget is being consumed at exactly the target pace — it will exhaust at the end of the window. A burn rate of 2 means the budget will exhaust in half the window. A burn rate less than 1 means the budget will last beyond the window.

Formula

burn_rate = (bad_events / valid_events) / (1 - SLO_target)
          = observed_error_rate / allowed_error_rate

Where:

bad_events / valid_events is the observed error rate (the complement of the SLI)
1 - SLO_target is the allowed error rate derived from the SLO

Worked Example

SLO target:           99.9%  →  allowed error rate = 1 - 0.999 = 0.001
Observed error rate:  0.3%   →  bad_events/valid_events = 0.003

burn_rate = 0.003 / 0.001 = 3

Interpretation: budget is being consumed 3x faster than the target rate.
At burn rate 3, the 30-day budget exhausts in: 30 days / 3 = 10 days.

Burn Rate and Time to Exhaustion

The relationship between burn rate and time to exhaustion is direct:

time_to_exhaustion = SLO_window / burn_rate

Examples for a 30-day window:
  burn_rate 1   →  30 days  (exactly on target)
  burn_rate 3   →  10 days
  burn_rate 6   →   5 days
  burn_rate 14.4 →  2.08 days  ≈ 50 hours  (Google SRE Workbook fast-page threshold)
  burn_rate 30  →   1 day

The value 14.4 appears in the Google SRE Workbook (Chapter 5) as the threshold for a high-severity page alert: at burn rate 14.4 the monthly budget exhausts in approximately 2 days, which is fast enough to page an on-call engineer immediately. Full alert threshold selection is in Alerting-Strategies.

Multiwindow Alerting Prerequisite

A single time window is insufficient for burn rate alerting. This is a foundational constraint, not a preference.

Why Two Windows Are Required

Window	Sensitivity	Problem if Used Alone
Short (e.g., 1 hour)	High — catches fast-burning incidents quickly	False positives: a brief traffic spike or deployment blip can momentarily push burn rate above threshold, firing an alert that resolves itself
Long (e.g., 6 hours)	Low — confirms sustained burn over time	False negatives for fast burns: a severe outage that exhausts the monthly budget in 2 hours may not look alarming over a 6-hour window if the first 4 hours were clean

Rule: the alert fires ONLY when BOTH windows simultaneously exceed their respective thresholds.

Alert condition:
  burn_rate(1h window)  > threshold
  AND
  burn_rate(6h window)  > threshold

The short window detects the incident quickly. The long window confirms it is sustained, not a transient spike. Together they achieve:

Low false-positive rate (long window filters transient spikes)
Low false-negative rate (short window catches fast burns)
Actionable alerts with enough budget remaining to act

Single-window alerting forces an unresolvable tradeoff: a tight threshold catches everything but generates constant noise; a loose threshold misses slow burns entirely. Two windows eliminate this tradeoff.

Note: Full multi-window multi-burn-rate alerting implementation — including the four alert rules for fast page, slow page, fast ticket, and slow ticket — is covered in Alerting-Strategies.

SLA vs SLO vs SLI Summary Table

Term	Audience	Consequence of Breach	Example
SLA	Customer / Legal	Financial penalty, service credit, contract termination	"99.9% uptime or 10% monthly credit"
SLO	Engineering / Operations	Feature freeze, incident review, engineering sprint	"99.95% availability on 30-day rolling window"
SLI	Engineering (measurement)	Input to SLO; no direct consequence — it is a number	`good_requests / total_requests = 0.9997`

Mermaid Chain Diagram

flowchart LR
    A[SLI\ngood/valid events] --> B[SLO\ntarget ratio]
    B --> C[Error Budget\n1 - SLO_target × window]
    C --> D[Burn Rate\nbudget consumption rate]
    D --> E[Alert Threshold\nmultiwindow trigger]

Each node in the chain is a prerequisite for the next. You cannot define an error budget without an SLO target. You cannot compute a burn rate without an error budget. You cannot set a meaningful alert threshold without a burn rate.

TypeScript Example — computeBurnRate

interface BurnRateResult {
  sli: number;             // ratio of good events to valid events (0–1)
  errorRate: number;       // 1 - sli (the complement)
  burnRate: number;        // errorRate / (1 - sloTarget)
  budgetRemainingPct: number; // percentage of error budget remaining (0–100)
}
 
/**
 * Compute burn rate metrics for a given observation window.
 *
 * @param goodEvents  - count of requests that met quality criteria
 * @param validEvents - count of all requests that should have met criteria
 * @param sloTarget   - SLO target as a decimal (e.g. 0.999 for 99.9%)
 */
function computeBurnRate(
  goodEvents: number,
  validEvents: number,
  sloTarget: number,
): BurnRateResult {
  if (validEvents === 0) {
    throw new Error('validEvents must be > 0; cannot compute SLI on zero events');
  }
  const sli = goodEvents / validEvents;
  const errorRate = 1 - sli;
  const allowedErrorRate = 1 - sloTarget;
  const burnRate = errorRate / allowedErrorRate;
  // budgetRemainingPct > 0 means budget available; 0 means exhausted
  const budgetRemainingPct = Math.max(0, Math.min(100, (1 - burnRate) * 100));
  return { sli, errorRate, burnRate, budgetRemainingPct };
}
 
// Example: 1,000,000 valid requests, 300 bad (0.03% error rate), SLO = 99.9%
const result = computeBurnRate(999_700, 1_000_000, 0.999);
// sli = 0.9997, errorRate = 0.0003, burnRate = 0.3, budgetRemainingPct = 70

Java Example — SloMetrics.computeBurnRate

public final class SloMetrics {
 
    private SloMetrics() {}
 
    public record BurnRateResult(
        double sli,
        double errorRate,
        double burnRate,
        double budgetRemainingPct
    ) {}
 
    /**
     * Compute burn rate metrics for a given observation window.
     *
     * @param goodEvents  count of requests that met quality criteria
     * @param validEvents count of all requests that should have met criteria (must be > 0)
     * @param sloTarget   SLO target as a decimal (e.g. 0.999 for 99.9%)
     */
    public static BurnRateResult computeBurnRate(
            long goodEvents, long validEvents, double sloTarget) {
 
        if (validEvents == 0) {
            throw new IllegalArgumentException(
                "validEvents must be > 0; cannot compute SLI on zero events");
        }
        double sli = (double) goodEvents / validEvents;
        double errorRate = 1.0 - sli;
        double allowedErrorRate = 1.0 - sloTarget;
        double burnRate = errorRate / allowedErrorRate;
        // budgetRemainingPct > 0 means budget available; 0 means exhausted
        double budgetRemainingPct = Math.max(0.0, Math.min(100.0, (1.0 - burnRate) * 100.0));
        return new BurnRateResult(sli, errorRate, burnRate, budgetRemainingPct);
    }
}
 
// Example: 999_700 good out of 1_000_000 valid, SLO = 99.9%
// BurnRateResult r = SloMetrics.computeBurnRate(999_700L, 1_000_000L, 0.999);
// r.burnRate() == 0.3, r.budgetRemainingPct() == 70.0

Suitability — When to Use SLO-Driven Reliability

Use SLOs when all three conditions hold:

Production services with a defined user-facing latency or availability contract — the contract gives the SLO its target; without one, any threshold is arbitrary.
Services with enough traffic to produce statistically meaningful SLI ratios — at very low volumes (fewer than 100 requests/day), individual failed requests dominate the SLI and produce noisy burn rate signals. Consider synthetic probes instead of traffic-based SLIs.
Teams willing to act on error budget data — the commitment is: when budget is exhausted, features freeze and reliability work takes priority. Without this commitment, SLOs are decoration.

Metrics-and-Dashboards — SLI is measured via the metrics pillar; RED method error rate feeds directly into SLI computation
Alerting-Strategies — burn rate drives multi-window alert thresholds; fast-burn and slow-burn alert rules
Distributed-Tracing-Patterns — trace sampling decisions are often conditioned on error budget state; reduce sampling when budget is healthy, increase when burning fast

Backlinks

Metrics-and-Dashboards — SLI is measured via the metrics pillar
Alerting-Strategies — burn rate drives multi-window alert thresholds
Distributed-Tracing-Patterns — trace sampling decisions often conditioned on error budget state

SLO-SLI-SLA

Tags

SLO-SLI-SLA

When NOT to Use

Core Definitions

SLA — Service Level Agreement

SLO — Service Level Objective

SLI — Service Level Indicator

Error Budget Derivation

Step-by-Step Derivation (99.9% availability SLO)

Common SLO Targets

Burn Rate

Definition

Formula

Worked Example

Burn Rate and Time to Exhaustion

Multiwindow Alerting Prerequisite

Why Two Windows Are Required

SLA vs SLO vs SLI Summary Table

Mermaid Chain Diagram

TypeScript Example — computeBurnRate

Java Example — SloMetrics.computeBurnRate

Suitability — When to Use SLO-Driven Reliability

Related Concepts

Backlinks

Linked mentions