On-Call Practices

On-Call Practices

On-call is the human layer of the reliability system. It bridges automated alerting — Alerting-Strategies defines which alerts page and at what severity — to structured human response — Incident-Response defines what the engineer does once paged. On-Call Practices covers the organisational layer between those two: rotation design, escalation policy, and the measurement of on-call health over time.

Scope: This note covers rotation design patterns (follow-the-sun vs single-team), escalation policy levels, and on-call health measurement via toil percentage and MTTA. For alert routing that feeds the escalation policy, see Alerting-Strategies. For what happens after the page, see Incident-Response. For runbook anatomy, see Runbook-Design.


When NOT to Use (Formal On-Call Rotation)

Formal on-call rotation adds organisational overhead that is only justified by incident frequency and user impact severity.

  • Solo developers — a single engineer cannot rotate; incident response is ad hoc and documented instead.
  • Pre-production and prototype systems — there is no user journey to protect; downtime during business hours is acceptable by definition.
  • Services without an SLO — without a defined availability target, there is no basis for declaring an incident or setting escalation thresholds. Establish the SLO before establishing the rotation.
  • Services with fewer than 2-3 incidents per month — low incident frequency does not justify the overhead of a paging schedule, runbook maintenance, and rotation tooling. Use a lightweight "whoever owns the service responds" policy instead.

Rotation Design Patterns

Two primary models cover the vast majority of team configurations.

Follow-the-Sun Rotation

The team spans at least three geographic timezones. Each region handles its own daytime on-call hours. Pages are handed off at shift boundaries, eliminating sleep disruption for every on-call engineer.

Requirements:

  • Minimum three regional teams with business-hours coverage that chains together (e.g., APAC → EMEA → Americas)
  • 8-hour overlap windows to support warm handoff at shift boundaries
  • A documented warm handoff protocol: incident summary, current error budget state, open investigations, and actions taken
  • Shared runbooks and tooling accessible to all regional teams

Tradeoffs:

  • Eliminates sleep disruption (primary advantage)
  • Requires significant investment in handoff discipline — context loss at shift boundaries is the primary failure mode
  • At least 6-9 engineers needed across three regions to maintain reasonable rotation depth

Single-Team Rotation

All engineers on the team rotate regardless of timezone. This is the standard model for teams below the follow-the-sun threshold.

Standard rotation period: 1 week.

  • Shorter periods (3 days) reduce burnout for high-incident systems — the on-call burden is spread more evenly.
  • Longer periods (2 weeks) reduce context-switching cost and allow the engineer to build deeper incident familiarity — viable only for low-incident services.

Timezone risk: non-local engineers are paged during their night hours. Mitigate with:

  • Explicit comp time policy: every after-hours P1 page earns time off
  • Incident frequency targets: if any engineer is paged > 3 times during sleep hours per rotation, the service needs a reliability sprint

Primary + Secondary pattern: add a secondary on-call role on all rotations.

  • Primary engineer responds first.
  • Secondary engineer receives a page if the primary does not acknowledge within 5-10 minutes.
  • Prevents single points of failure in the response chain.

Escalation Policy Design

Escalation policy defines what happens when the expected response does not occur within the defined time window.

LevelRoleTriggerTime from Alert
Level 1On-call engineer (primary)Auto-paged on P1 alert0 minutes
Level 2Secondary on-call / team leadNo ack from Level 1 within 10 min+10 minutes
Level 3Engineering managerNo ack from Level 2 within 20 min; OR P0 declared+30 minutes
Level 4VP Engineering / Incident CommanderP0 with customer impact; declared by managerSituational

Critical requirement: the escalation chain must be documented in the runbook, not only in the alerting tool configuration. Tool configs are invisible to the engineers being escalated to. Runbook documentation is readable by anyone who needs to escalate manually.

P0 criteria (declare when any applies):

  • Complete service unavailability (0% success rate)
  • Data loss or data corruption risk
  • Security breach or exposure
  • SLA breach imminent (error budget < 10% remaining in the window)

On-Call Health Measurement

On-call health is not a feeling — it is measurable. Track two primary metrics per engineer per rotation.

Toil Percentage

toil_percentage = toil_hours / total_on_call_hours × 100

Toil is manual, repetitive, automatable work triggered by a production system. Responding to a P1 alert, restarting a service, running a manual data migration script — these are toil. They differ from engineering work because they produce no lasting improvement.

Target: toil < 50% per rotation, per the Google SRE Workbook recommendation.

Interpretation:

  • toil < 25% — healthy; engineers are doing mostly engineering work
  • toil 25-50% — acceptable; monitor for trend
  • toil > 50% — critical signal; toil is crowding out engineering work

Actionable threshold: if any engineer reports > 50% toil over 3 consecutive rotations, a reliability sprint is required. The sprint goal is to automate or eliminate the top toil sources (most frequent page types, most common manual remediation steps).

Mean Time to Acknowledge (MTTA)

MTTA = average(time from alert fire to engineer acknowledgement)

Target: MTTA < 5 minutes for P1 alerts.

Interpretation:

  • MTTA > 5 minutes: on-call engineer may have missed the page; escalation policy should have fired
  • MTTA > 10 minutes: escalation policy is not working; review alerting tool configuration and rotation assignment

Pages per Shift

Target: < 3 pages per on-call shift (8-hour window or 1-week rotation).

When any engineer exceeds 3 pages per shift on average over a month, this is a reliability signal: the service is generating too many alerts for the team to absorb sustainably. This requires engineering investment, not more alert rules.

Alert fatigue signal: > 20% of pages result in "no action taken" indicates that alert thresholds need recalibration. High no-action page rates mean the alert is firing on false positives — tighten thresholds or apply the multi-window pattern (see Alerting-Strategies).


Runbook Discipline for On-Call

Runbooks are the on-call engineer's primary tool. On-call practices determine which runbooks exist and how they are maintained.

Automation candidate threshold: every P1 alert that fires more than twice per quarter becomes an automatic candidate for runbook automation. If an alert requires the same manual steps repeatedly, those steps should be encoded in an automated remediation script.

Runbook currency policy: after every P1 or P2 incident, the on-call engineer is responsible for updating the runbook with any new diagnostic steps, changed commands, or modified escalation contacts. Stale runbooks are a reliability risk — they slow investigation and produce false confidence.

For the anatomy of a runbook (symptoms → diagnosis → remediation → escalation path), see Runbook-Design.


Mermaid Rotation Diagram

stateDiagram-v2
    [*] --> Primary: Rotation assigned
    Primary --> Acknowledging: P1 alert fires
    Acknowledging --> Investigating: ACK within 5min
    Acknowledging --> Secondary: No ACK 5min
    Secondary --> Investigating: Secondary ACKs
    Secondary --> Manager: No ACK 10min
    Investigating --> Resolved: Mitigation applied
    Resolved --> [*]: Handoff + blameless notes

On-Call Health Dashboard

Track these per engineer per rotation period (weekly cadence):

MetricTargetCritical Threshold
Toil percentage< 50%> 50% for 3 consecutive rotations
MTTA P1 (minutes)< 5 min> 10 min average
Pages per shift< 3> 5
No-action page rate< 10%> 20%

Review this data in the weekly reliability sync. Trends matter more than single data points: a single high-toil rotation is not a signal; three consecutive high-toil rotations are.


Suitability — When Formal On-Call Is Justified

Formalise the on-call rotation when all three conditions hold:

  1. The service has an SLO with a defined error budget — on-call escalation thresholds are derived from SLO burn rate; without an SLO, the triggering criteria are undefined.
  2. The service has a minimum incident frequency that justifies rotation overhead — at least 2-3 paging incidents per month across the team.
  3. The team has sufficient size to form a sustainable rotation — a minimum of 4-5 engineers for a single-team rotation (prevents individual over-rotation); 6-9 engineers across three regions for follow-the-sun.

  • Alerting-Strategies — alert severity tiers feed the escalation policy; P1 pages trigger Level 1 response
  • Incident-Response — P1 pages trigger the incident response flow; on-call engineer becomes incident commander
  • Runbook-Design — runbook discipline is the operational infrastructure on-call depends on