Distributed Tracing

Distributed Tracing

Distributed tracing tracks a single request as it travels through multiple services in a distributed system. Each service adds timing and metadata, producing a complete picture of latency and failures across service boundaries — something traditional logging and metrics alone cannot provide.

Core Concepts

Trace

A trace represents the full journey of one request. It is identified by a globally unique trace ID (128-bit, represented as a 32-character hex string). All spans belonging to the same request share this trace ID.

Span

A span represents one unit of work within a trace — typically one service's handling of the request, or one function call. Each span has:

  • A unique span ID (64-bit / 16-character hex)
  • A parent span ID (the span that caused this work)
  • A start timestamp and duration
  • A name (operation name)
  • Tags / attributes (key-value metadata)
  • Events (timestamped log entries within the span)

A root span has no parent. The root span of a trace typically starts in the client (browser, mobile app) or at the entry point gateway.

Span Hierarchy

Trace: 4bf92f3577b34da6a3ce929d0e0e4736
│
├─ [root] Angular HTTP request (spanId: a1b2)
│   │
│   └─ [child] BFF Gateway receive (spanId: c3d4, parent: a1b2)
│       │
│       ├─ [child] BFF → User Service (spanId: e5f6, parent: c3d4)
│       │     └─ [child] User Service DB query (spanId: g7h8, parent: e5f6)
│       │
│       └─ [child] BFF → Order Service (spanId: i9j0, parent: c3d4)
│             └─ [child] Order Service DB query (spanId: k1l2, parent: i9j0)

Sampling

Tracing every request at 100% is expensive. Production systems typically sample a percentage:

  • Head-based sampling: decision made at root span (all-or-nothing per trace)
  • Tail-based sampling: decision made after the trace completes (e.g., always trace errors)
  • Probability: probability: 0.1 = 10% of requests traced

W3C Trace Context Standard

The W3C Trace Context specification defines interoperable HTTP headers:

traceparent Header

Format: <version>-<traceId>-<parentSpanId>-<traceFlags>

Example: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01

  • 00 = version
  • 4bf92f3577b34da6a3ce929d0e0e4736 = 128-bit trace ID (hex)
  • 00f067aa0ba902b7 = 64-bit parent span ID (hex)
  • 01 = flags (01 = sampled, 00 = not sampled)

tracestate Header

Vendor-specific additional state, comma-separated key=value pairs. Often empty.

Example: congo=t61rcWkgMzE,rojo=00f067aa0ba902b7

OpenTelemetry

OpenTelemetry (OTel) is the CNCF standard for producing, collecting, and exporting telemetry data (traces, metrics, logs). It replaces vendor-specific agents (Zipkin client, Jaeger client, etc.) with a unified API.

Key components:

  • OTel SDK: instruments code and manages trace context
  • OTel Collector: receives, processes, and exports telemetry to backends
  • Exporters: send data to Zipkin, Jaeger, Tempo, Datadog, etc.

The W3C traceparent header is the default context propagation format in OTel.

Backends

BackendProtocolLicenseNotes
ZipkinHTTP / api/v2/spansApache 2.0Simple, easy to run locally
JaegerOTLP / UDP (legacy)Apache 2.0CNCF project, richer UI
Grafana TempoOTLPAGPLv3Integrates with Grafana stack
DatadogOTLPProprietarySaaS; full observability platform

Distributed Tracing in Spring Boot 3.x

Spring Cloud Sleuth is dead. It was the tracing solution for Spring Boot 2.x and is not compatible with Spring Boot 3.x.

Spring Boot 3.x uses Micrometer Tracing — a vendor-neutral tracing facade (analogous to SLF4J for logging). See Micrometer for the implementation details.

Key auto-configurations provided by spring-boot-starter-actuator + micrometer-tracing-bridge-otel:

  • Incoming HTTP requests become root spans automatically (both MVC and WebFlux)
  • WebClient calls become child spans automatically
  • Trace and span IDs injected into log output (via MDC / Reactor context)
  • W3C traceparent header read on ingress and written on egress

Trace Context in Reactive Pipelines

WebFlux (Project Reactor) does not use thread-per-request, so ThreadLocal-based context propagation (including MDC) does not work reliably across operator boundaries.

Micrometer Tracing solves this by storing the active span in the Reactor Context object, which travels with the reactive chain regardless of which thread executes each operator. The micrometer-tracing-bridge-otel handles this automatically.

See P5-BFF-Observability-Testing for the MdcContextLifter pattern that propagates Reactor context values into MDC for structured logging.

Correlation ID vs. Trace ID

ConceptTrace IDCorrelation ID
Managed byTracing infrastructureApplication code
Format128-bit hex (W3C standard)UUID or custom string
PurposeDistributed trace linkageRequest linkage in logs
VisibilityTracing UI (Zipkin/Jaeger)Application logs
Required headertraceparentX-Correlation-ID (custom)

In a BFF, both are useful: trace ID for tracing tools, correlation ID for log grep/search.

BFF-Specific Tracing Concerns

The BFF-Pattern sits at the boundary between the frontend and all backend microservices. This makes it a critical tracing point:

  1. Receives traceparent from Angular (or generates root span if absent)
  2. Creates child span for its own processing
  3. Forwards updated traceparent to each downstream service call
  4. Aggregation spans show parallel downstream calls as sibling spans under one BFF span

See P5-BFF-Observability-Testing — OBS-01 and OBS-02 for complete implementation.

  • Micrometer — Spring Boot 3.x tracing and metrics facade
  • Spring-Cloud-Gateway — the reactive gateway layer that auto-propagates trace context
  • Project-Reactor — reactive runtime; requires special context propagation handling
  • BFF-Pattern — the architectural pattern being instrumented
  • Distributed-Tracing-Patterns — covers pattern-level sampling decisions (head-based, tail-based, probabilistic) and async context propagation failure modes; this note covers OTel SDK implementation — read both
  • Structured-Logging — trace context (traceId, spanId) must be injected into structured log fields via MDC for log-trace correlation; Structured-Logging defines the mandatory JSON field schema
  • Metrics-and-Dashboards — trace data feeds RED metrics (request rate, error rate, duration); Metrics-and-Dashboards covers the measurement frameworks that consume trace-derived data