Distributed Tracing
Distributed Tracing
Distributed tracing tracks a single request as it travels through multiple services in a distributed system. Each service adds timing and metadata, producing a complete picture of latency and failures across service boundaries — something traditional logging and metrics alone cannot provide.
Core Concepts
Trace
A trace represents the full journey of one request. It is identified by a globally unique trace ID (128-bit, represented as a 32-character hex string). All spans belonging to the same request share this trace ID.
Span
A span represents one unit of work within a trace — typically one service's handling of the request, or one function call. Each span has:
- A unique span ID (64-bit / 16-character hex)
- A parent span ID (the span that caused this work)
- A start timestamp and duration
- A name (operation name)
- Tags / attributes (key-value metadata)
- Events (timestamped log entries within the span)
A root span has no parent. The root span of a trace typically starts in the client (browser, mobile app) or at the entry point gateway.
Span Hierarchy
Trace: 4bf92f3577b34da6a3ce929d0e0e4736
│
├─ [root] Angular HTTP request (spanId: a1b2)
│ │
│ └─ [child] BFF Gateway receive (spanId: c3d4, parent: a1b2)
│ │
│ ├─ [child] BFF → User Service (spanId: e5f6, parent: c3d4)
│ │ └─ [child] User Service DB query (spanId: g7h8, parent: e5f6)
│ │
│ └─ [child] BFF → Order Service (spanId: i9j0, parent: c3d4)
│ └─ [child] Order Service DB query (spanId: k1l2, parent: i9j0)
Sampling
Tracing every request at 100% is expensive. Production systems typically sample a percentage:
- Head-based sampling: decision made at root span (all-or-nothing per trace)
- Tail-based sampling: decision made after the trace completes (e.g., always trace errors)
- Probability:
probability: 0.1= 10% of requests traced
W3C Trace Context Standard
The W3C Trace Context specification defines interoperable HTTP headers:
traceparent Header
Format: <version>-<traceId>-<parentSpanId>-<traceFlags>
Example: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
00= version4bf92f3577b34da6a3ce929d0e0e4736= 128-bit trace ID (hex)00f067aa0ba902b7= 64-bit parent span ID (hex)01= flags (01= sampled,00= not sampled)
tracestate Header
Vendor-specific additional state, comma-separated key=value pairs. Often empty.
Example: congo=t61rcWkgMzE,rojo=00f067aa0ba902b7
OpenTelemetry
OpenTelemetry (OTel) is the CNCF standard for producing, collecting, and exporting telemetry data (traces, metrics, logs). It replaces vendor-specific agents (Zipkin client, Jaeger client, etc.) with a unified API.
Key components:
- OTel SDK: instruments code and manages trace context
- OTel Collector: receives, processes, and exports telemetry to backends
- Exporters: send data to Zipkin, Jaeger, Tempo, Datadog, etc.
The W3C traceparent header is the default context propagation format in OTel.
Backends
| Backend | Protocol | License | Notes |
|---|---|---|---|
| Zipkin | HTTP / api/v2/spans | Apache 2.0 | Simple, easy to run locally |
| Jaeger | OTLP / UDP (legacy) | Apache 2.0 | CNCF project, richer UI |
| Grafana Tempo | OTLP | AGPLv3 | Integrates with Grafana stack |
| Datadog | OTLP | Proprietary | SaaS; full observability platform |
Distributed Tracing in Spring Boot 3.x
Spring Cloud Sleuth is dead. It was the tracing solution for Spring Boot 2.x and is not compatible with Spring Boot 3.x.
Spring Boot 3.x uses Micrometer Tracing — a vendor-neutral tracing facade (analogous to SLF4J for logging). See Micrometer for the implementation details.
Key auto-configurations provided by spring-boot-starter-actuator + micrometer-tracing-bridge-otel:
- Incoming HTTP requests become root spans automatically (both MVC and WebFlux)
WebClientcalls become child spans automatically- Trace and span IDs injected into log output (via MDC / Reactor context)
- W3C
traceparentheader read on ingress and written on egress
Trace Context in Reactive Pipelines
WebFlux (Project Reactor) does not use thread-per-request, so ThreadLocal-based context propagation (including MDC) does not work reliably across operator boundaries.
Micrometer Tracing solves this by storing the active span in the Reactor Context object, which travels with the reactive chain regardless of which thread executes each operator. The micrometer-tracing-bridge-otel handles this automatically.
See P5-BFF-Observability-Testing for the MdcContextLifter pattern that propagates Reactor context values into MDC for structured logging.
Correlation ID vs. Trace ID
| Concept | Trace ID | Correlation ID |
|---|---|---|
| Managed by | Tracing infrastructure | Application code |
| Format | 128-bit hex (W3C standard) | UUID or custom string |
| Purpose | Distributed trace linkage | Request linkage in logs |
| Visibility | Tracing UI (Zipkin/Jaeger) | Application logs |
| Required header | traceparent | X-Correlation-ID (custom) |
In a BFF, both are useful: trace ID for tracing tools, correlation ID for log grep/search.
BFF-Specific Tracing Concerns
The BFF-Pattern sits at the boundary between the frontend and all backend microservices. This makes it a critical tracing point:
- Receives
traceparentfrom Angular (or generates root span if absent) - Creates child span for its own processing
- Forwards updated
traceparentto each downstream service call - Aggregation spans show parallel downstream calls as sibling spans under one BFF span
See P5-BFF-Observability-Testing — OBS-01 and OBS-02 for complete implementation.
Related Concepts
- Micrometer — Spring Boot 3.x tracing and metrics facade
- Spring-Cloud-Gateway — the reactive gateway layer that auto-propagates trace context
- Project-Reactor — reactive runtime; requires special context propagation handling
- BFF-Pattern — the architectural pattern being instrumented
Related Observability Patterns
- Distributed-Tracing-Patterns — covers pattern-level sampling decisions (head-based, tail-based, probabilistic) and async context propagation failure modes; this note covers OTel SDK implementation — read both
- Structured-Logging — trace context (traceId, spanId) must be injected into structured log fields via MDC for log-trace correlation; Structured-Logging defines the mandatory JSON field schema
- Metrics-and-Dashboards — trace data feeds RED metrics (request rate, error rate, duration); Metrics-and-Dashboards covers the measurement frameworks that consume trace-derived data