Metrics and Dashboards

Metrics and Dashboards

Metrics are aggregated numerical measurements of system behaviour over time. They differ from the other two observability pillars in a fundamental way: a metric discards the individual event and retains only the aggregate — a counter, a gauge value, a histogram bucket count. This makes metrics cheap to store, fast to query, and easy to alert on, but they cannot answer why a specific request failed.

  • Metrics — aggregated numbers over time windows; ideal for alerting, SLO tracking, and capacity planning
  • Logs — discrete event records with full context; ideal for post-incident investigation
  • Traces — end-to-end request journeys with span timing; ideal for latency attribution across services

Dashboards make metrics actionable by placing the right aggregations in front of the right operator at the right moment.

Scope:

This note covers measurement framework selection (RED / USE / Four Golden Signals), dashboard design principles, and application-level metric instrumentation patterns. For the Java meter API (Counter, Timer, Gauge, DistributionSummary), MeterRegistry implementation, and Spring Boot auto-configuration, see Micrometer.


When NOT to Use

Metrics alone are insufficient for:

  • Debugging a specific request failure — metrics can tell you the error rate spiked; they cannot show which request failed or what the stack trace was. Use distributed traces (Distributed-Tracing) and structured logs for that.
  • Understanding why an error occurred — you need logs and traces that carry per-request context.
  • Verifying business-logic correctness — a zero-error-rate metric does not mean the order total was calculated correctly.

Cardinality Explosion Warning

Never use high-cardinality values as label/tag values.

Label values such as user_id, request_id, email, or ip_address create one time-series per unique value. At > 10,000 unique combinations per metric, Prometheus suffers OOM, query timeouts, and storage explosion. Safe label cardinalities: HTTP method (5–10 values), status code (~20 values), route path (~50–200 values).


Measurement Frameworks

Three frameworks cover the vast majority of instrumentation needs. Each targets a different system type.

RED Method (Request-oriented Services)

Origin: Tom Wilkie / Weaveworks (2018).

SignalWhat to Measure
RateRequests per second (throughput)
ErrorsError rate — (4xx + 5xx) / total requests
DurationLatency distribution — p50, p95, p99

Target systems: HTTP services, gRPC services, message consumers, async workers.

Key question: Is my service handling requests correctly and fast enough?

USE Method (Resource-oriented Infrastructure)

Origin: Brendan Gregg (2012).

SignalWhat to Measure
UtilizationPercentage of resource capacity in use (CPU %, memory %, disk %)
SaturationQueue depth or backpressure (run queue length, connection pool waiters)
ErrorsHardware or driver errors

Target systems: CPUs, memory, disks, network interfaces, connection pools, thread pools.

Key question: Is a resource becoming a bottleneck?

Four Golden Signals (Google SRE)

Origin: Google SRE Book, Chapter 6.

SignalWhat to Measure
LatencyTime to serve a request — distinguish successful latency from error latency
TrafficRequest rate or throughput
ErrorsRate of failed requests (explicit HTTP 5xx, implicit wrong results, or caught exceptions)
SaturationHow full the service is — CPU throttling %, queue depth, memory pressure

Target systems: user-facing services where SLO adherence matters.

Key question: Is my service meeting user-visible quality targets?

Four Golden Signals is similar to RED but elevates Saturation to a first-class signal alongside Latency/Traffic/Errors, and explicitly distinguishes error latency from success latency.

Framework Comparison Table

FrameworkTarget SystemKey QuestionTypical Metrics
REDHTTP/gRPC services, consumersHandling requests correctly?http_requests_total, http_request_duration_seconds
USEInfrastructure resourcesResource becoming a bottleneck?cpu_utilization, db_pool_active, disk_io_saturation
Four Golden SignalsUser-facing, SLO-drivenMeeting user-visible quality targets?latency (p99), traffic (RPS), error rate, saturation %

Selection Flowchart

flowchart TD
    A[What are you measuring?] --> B{User-facing HTTP or gRPC service?}
    B -- Yes --> C[Use RED\nRate · Errors · Duration]
    B -- No --> D{Infrastructure resource?\nCPU / memory / disk / pool?}
    D -- Yes --> E[Use USE\nUtilization · Saturation · Errors]
    D -- No --> F{SRE / SLO-driven service?}
    F -- Yes --> G[Use Four Golden Signals\nLatency · Traffic · Errors · Saturation]
    F -- No --> H{Complex service needing both\nservice-level and resource-level view?}
    H -- Yes --> I[RED for service layer\n+ USE for resource layer]
    H -- No --> J[Start with RED —\nextend as needed]

Metric Types Reference

These are conceptual types, not API reference. For the Java API (Counter.builder, Timer.builder, Gauge), see Micrometer.

TypeBehaviourUse ForNotes
CounterMonotonically increasing integerRequests, errors, events publishedCompute rate as rate(counter[5m]) in PromQL
GaugePoint-in-time snapshot, can go up or downQueue depth, active connections, cache sizeDo not use to display trends — use histograms
HistogramDistribution of values with configurable bucketsLatency, payload sizesEnables server-side percentile computation (preferred)
SummaryPre-computed percentiles client-sideWhen PromQL recording rules are unavailablePercentiles do NOT aggregate correctly across instances

Decision rule: Prefer Histogram over Summary for latency. Prometheus computes percentiles server-side from histogram buckets, and histograms aggregate correctly across replicas. Summaries aggregate incorrectly — the p99 of a fleet is not the average of per-instance p99s.


TypeScript Example — prom-client 15.x

The following snippet instruments an Express service using the RED method. All labels are low-cardinality.

import { Registry, Counter, Histogram, Gauge } from 'prom-client'; // prom-client 15.x
import { Request, Response, NextFunction } from 'express';
 
const registry = new Registry();
 
// RED: Rate + Errors (Counter) — low-cardinality labels only
const httpRequests = new Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'route', 'status_code'],
  registers: [registry],
});
 
// RED: Duration (Histogram) — p50/p95/p99 computed by Prometheus from buckets
const httpDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request latency',
  labelNames: ['method', 'route'],
  buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5],
  registers: [registry],
});
 
// USE: Saturation (Gauge) — connection pool pressure
const activeConnections = new Gauge({
  name: 'db_pool_active_connections',
  help: 'Active database pool connections',
  registers: [registry],
});
 
export function metricsMiddleware(req: Request, res: Response, next: NextFunction): void {
  const end = httpDuration.startTimer({
    method: req.method,
    route: req.route?.path ?? 'unknown', // route template, NOT req.url — avoids high cardinality
  });
  res.on('finish', () => {
    httpRequests.inc({
      method: req.method,
      route: req.route?.path ?? 'unknown',
      status_code: String(res.statusCode),
    });
    end();
  });
  next();
}
 
// Expose /metrics endpoint for Prometheus scrape
export async function metricsHandler(_req: Request, res: Response): Promise<void> {
  res.set('Content-Type', registry.contentType);
  res.end(await registry.metrics());
}

Key decisions above:

  • req.route?.path (template like /api/orders/:id) instead of req.url (actual URL like /api/orders/abc-123) — prevents per-request cardinality explosion
  • buckets array follows the Prometheus default recommendations, adjusted downward for typical API latency

Java Example — Micrometer / Spring Boot

The following snippet instruments a Spring Boot service using the Micrometer Observation API, which produces both metrics and traces from a single @Observed annotation. For Counter.builder, Timer.builder, and Gauge API details, see Micrometer.

@Service
public class OrderService {
 
    private final ObservationRegistry observationRegistry;
    private final Counter ordersFailed;
 
    public OrderService(ObservationRegistry registry, MeterRegistry meterRegistry) {
        this.observationRegistry = registry;
        // low-cardinality tag "reason" — bounded set of validation failure categories
        this.ordersFailed = Counter.builder("orders.failed")
            .description("Orders that failed processing")
            .tag("reason", "validation")
            .register(meterRegistry);
    }
 
    // @Observed auto-records: rate (count), duration (timer), and errors
    // produces both a metric AND a trace span — single instrumentation point
    @Observed(name = "order.process", contextualName = "processOrder")
    public Order processOrder(OrderRequest request) {
        return repository.save(mapToOrder(request));
    }
 
    public void rejectOrder(OrderRequest request, String reason) {
        ordersFailed.increment(); // explicit RED Errors counter
        throw new ValidationException("Rejected: " + reason);
    }
}

For Counter.builder, Timer.builder, and Gauge API details, see Micrometer.


Dashboard Design Principles

Golden rule: Every panel must answer a specific operational question. If you cannot state the question in one sentence, the panel does not belong on the dashboard.

Layout Structure

Dashboard Layout Diagram

A spatial layout diagram for this structure is at Topics/Metrics-and-Dashboards-diagram.excalidraw.md.

Recommended three-row layout:

RowPurposePanels
Row 1 — Service Health (RED)Is the service handling requests correctly?Request rate (RPS), Error rate (%), p99 latency
Row 2 — Resource Health (USE)Is a resource becoming a bottleneck?CPU utilization %, Memory %, Connection pool active / waiting
Row 3 — Business MetricsAre business outcomes healthy?Order rate, user signups, conversion rate

Rationale for separation: Row 1 and Row 2 serve different audiences (on-call engineer vs capacity planner). Mixing them forces both audiences to filter noise.

Principles

  • Percentiles over averages — p99 latency reveals tail problems; the arithmetic mean hides them. A single 5-second request in a batch of 1ms requests yields a 50ms average that looks healthy.
  • Error rate before error count — "0.1% error rate" is more actionable than "500 errors/min" in isolation. Always show both rate and count.
  • Time range alignment — all panels must be aligned to the same time range. Misaligned ranges create false correlations between unrelated spikes.
  • USE and RED on separate rows — infrastructure metrics and service metrics have different slopes, scales, and audiences.

Anti-Patterns to Avoid

Anti-PatternProblemFix
Christmas tree dashboard (> 20 panels per page)Decision paralysis; operators cannot identify the critical signalOne dashboard per audience (service health, infra health, business metrics)
High-cardinality labels in PromQL (GROUP BY user_id)Query timeouts; Prometheus cardinality explosionGroup by route, method, status_code only
Gauge panels for ratesReaders interpret a point-in-time value as a trendUse time-series graphs for rates; reserve stat panels for current values
SLO burn rate mixed with noise metricsOperators miss budget exhaustion alerts among irrelevant panelsDedicate a separate row or dashboard to SLO error budget burn rate

Cardinality Management

Cardinality is the number of unique time-series a metric produces. It equals the product of unique values across all labels.

Rule: Label values must be bounded and low-cardinality (fewer than 100 unique values per label, ideally fewer than 20).

Label TypeExample ValuesSafe?
HTTP methodGET, POST, DELETEYes — ~5 values
HTTP status code200, 400, 500Yes — ~20 values
Route template/api/orders, /api/usersYes — bounded by code
User IDuser-abc123, user-def456, ...No — unbounded
Request IDUUID per requestNo — one time-series per request
Email addressCustomer emailNo — unbounded PII
IP addressClient IPNo — unbounded

Detection: Monitor prometheus_tsdb_head_series — alert when the series count grows unexpectedly week-over-week. A sudden spike indicates a cardinality bug in newly deployed instrumentation.

Mitigation for existing high-cardinality metrics: Use metric_relabel_configs in the Prometheus scrape config to drop or hash the offending label before storage.


Metrics vs Logs vs Traces — Tradeoffs

DimensionMetricsLogsTraces
Storage costLow — aggregates onlyHigh — full event textMedium — sampled spans
Query speedFast — pre-aggregatedSlow — full-text scanMedium — indexed by trace ID
CardinalityFixed at instrument timeUnlimited (structured fields)Bounded by sampling rate
Per-request detailNoneFullFull (within sample)
SLO/alertingNativeRequires log-based metricsNot directly
Root cause analysisInsufficient aloneSufficientSufficient
Best forAlerting, capacity, SLOsInvestigation, auditsLatency attribution

The three pillars are complementary. A full observability stack uses all three: metrics to detect, traces to locate, logs to explain.


  • Micrometer — Java meter API (Counter, Timer, Gauge), Spring Boot auto-configuration, BFF-specific metric names
  • Distributed-Tracing — request-scoped latency measurement via spans; complements metrics for per-request attribution
  • Structured-Logging — event-level observability; the third pillar that explains why metrics changed
  • SLO-SLI-SLA — metrics as the measurement basis for SLIs; error budgets derived from metric time-series
  • Alerting-Strategies — turning metrics (burn rate, threshold, anomaly) into actionable on-call alerts