Metrics and Dashboards

Metrics are aggregated numerical measurements of system behaviour over time. They differ from the other two observability pillars in a fundamental way: a metric discards the individual event and retains only the aggregate — a counter, a gauge value, a histogram bucket count. This makes metrics cheap to store, fast to query, and easy to alert on, but they cannot answer why a specific request failed.

Metrics — aggregated numbers over time windows; ideal for alerting, SLO tracking, and capacity planning
Logs — discrete event records with full context; ideal for post-incident investigation
Traces — end-to-end request journeys with span timing; ideal for latency attribution across services

Dashboards make metrics actionable by placing the right aggregations in front of the right operator at the right moment.

Scope:

This note covers measurement framework selection (RED / USE / Four Golden Signals), dashboard design principles, and application-level metric instrumentation patterns. For the Java meter API (Counter, Timer, Gauge, DistributionSummary), MeterRegistry implementation, and Spring Boot auto-configuration, see Micrometer.

When NOT to Use

Metrics alone are insufficient for:

Debugging a specific request failure — metrics can tell you the error rate spiked; they cannot show which request failed or what the stack trace was. Use distributed traces (Distributed-Tracing) and structured logs for that.
Understanding why an error occurred — you need logs and traces that carry per-request context.
Verifying business-logic correctness — a zero-error-rate metric does not mean the order total was calculated correctly.

Cardinality Explosion Warning

Never use high-cardinality values as label/tag values.

Label values such as user_id, request_id, email, or ip_address create one time-series per unique value. At > 10,000 unique combinations per metric, Prometheus suffers OOM, query timeouts, and storage explosion. Safe label cardinalities: HTTP method (5–10 values), status code (~20 values), route path (~50–200 values).

Measurement Frameworks

Three frameworks cover the vast majority of instrumentation needs. Each targets a different system type.

RED Method (Request-oriented Services)

Origin: Tom Wilkie / Weaveworks (2018).

Signal	What to Measure
Rate	Requests per second (throughput)
Errors	Error rate — `(4xx + 5xx) / total requests`
Duration	Latency distribution — p50, p95, p99

Target systems: HTTP services, gRPC services, message consumers, async workers.

Key question: Is my service handling requests correctly and fast enough?

USE Method (Resource-oriented Infrastructure)

Origin: Brendan Gregg (2012).

Signal	What to Measure
Utilization	Percentage of resource capacity in use (CPU %, memory %, disk %)
Saturation	Queue depth or backpressure (run queue length, connection pool waiters)
Errors	Hardware or driver errors

Target systems: CPUs, memory, disks, network interfaces, connection pools, thread pools.

Key question: Is a resource becoming a bottleneck?

Four Golden Signals (Google SRE)

Origin: Google SRE Book, Chapter 6.

Signal	What to Measure
Latency	Time to serve a request — distinguish successful latency from error latency
Traffic	Request rate or throughput
Errors	Rate of failed requests (explicit HTTP 5xx, implicit wrong results, or caught exceptions)
Saturation	How full the service is — CPU throttling %, queue depth, memory pressure

Target systems: user-facing services where SLO adherence matters.

Key question: Is my service meeting user-visible quality targets?

Four Golden Signals is similar to RED but elevates Saturation to a first-class signal alongside Latency/Traffic/Errors, and explicitly distinguishes error latency from success latency.

Framework Comparison Table

Framework	Target System	Key Question	Typical Metrics
RED	HTTP/gRPC services, consumers	Handling requests correctly?	`http_requests_total`, `http_request_duration_seconds`
USE	Infrastructure resources	Resource becoming a bottleneck?	`cpu_utilization`, `db_pool_active`, `disk_io_saturation`
Four Golden Signals	User-facing, SLO-driven	Meeting user-visible quality targets?	latency (p99), traffic (RPS), error rate, saturation %

Selection Flowchart

flowchart TD
    A[What are you measuring?] --> B{User-facing HTTP or gRPC service?}
    B -- Yes --> C[Use RED\nRate · Errors · Duration]
    B -- No --> D{Infrastructure resource?\nCPU / memory / disk / pool?}
    D -- Yes --> E[Use USE\nUtilization · Saturation · Errors]
    D -- No --> F{SRE / SLO-driven service?}
    F -- Yes --> G[Use Four Golden Signals\nLatency · Traffic · Errors · Saturation]
    F -- No --> H{Complex service needing both\nservice-level and resource-level view?}
    H -- Yes --> I[RED for service layer\n+ USE for resource layer]
    H -- No --> J[Start with RED —\nextend as needed]

Metric Types Reference

These are conceptual types, not API reference. For the Java API (Counter.builder, Timer.builder, Gauge), see Micrometer.

Type	Behaviour	Use For	Notes
Counter	Monotonically increasing integer	Requests, errors, events published	Compute rate as `rate(counter[5m])` in PromQL
Gauge	Point-in-time snapshot, can go up or down	Queue depth, active connections, cache size	Do not use to display trends — use histograms
Histogram	Distribution of values with configurable buckets	Latency, payload sizes	Enables server-side percentile computation (preferred)
Summary	Pre-computed percentiles client-side	When PromQL recording rules are unavailable	Percentiles do NOT aggregate correctly across instances

Decision rule: Prefer Histogram over Summary for latency. Prometheus computes percentiles server-side from histogram buckets, and histograms aggregate correctly across replicas. Summaries aggregate incorrectly — the p99 of a fleet is not the average of per-instance p99s.

TypeScript Example — prom-client 15.x

The following snippet instruments an Express service using the RED method. All labels are low-cardinality.

import { Registry, Counter, Histogram, Gauge } from 'prom-client'; // prom-client 15.x
import { Request, Response, NextFunction } from 'express';
 
const registry = new Registry();
 
// RED: Rate + Errors (Counter) — low-cardinality labels only
const httpRequests = new Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'route', 'status_code'],
  registers: [registry],
});
 
// RED: Duration (Histogram) — p50/p95/p99 computed by Prometheus from buckets
const httpDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request latency',
  labelNames: ['method', 'route'],
  buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5],
  registers: [registry],
});
 
// USE: Saturation (Gauge) — connection pool pressure
const activeConnections = new Gauge({
  name: 'db_pool_active_connections',
  help: 'Active database pool connections',
  registers: [registry],
});
 
export function metricsMiddleware(req: Request, res: Response, next: NextFunction): void {
  const end = httpDuration.startTimer({
    method: req.method,
    route: req.route?.path ?? 'unknown', // route template, NOT req.url — avoids high cardinality
  });
  res.on('finish', () => {
    httpRequests.inc({
      method: req.method,
      route: req.route?.path ?? 'unknown',
      status_code: String(res.statusCode),
    });
    end();
  });
  next();
}
 
// Expose /metrics endpoint for Prometheus scrape
export async function metricsHandler(_req: Request, res: Response): Promise<void> {
  res.set('Content-Type', registry.contentType);
  res.end(await registry.metrics());
}

Key decisions above:

req.route?.path (template like /api/orders/:id) instead of req.url (actual URL like /api/orders/abc-123) — prevents per-request cardinality explosion
buckets array follows the Prometheus default recommendations, adjusted downward for typical API latency

Java Example — Micrometer / Spring Boot

The following snippet instruments a Spring Boot service using the Micrometer Observation API, which produces both metrics and traces from a single @Observed annotation. For Counter.builder, Timer.builder, and Gauge API details, see Micrometer.

@Service
public class OrderService {
 
    private final ObservationRegistry observationRegistry;
    private final Counter ordersFailed;
 
    public OrderService(ObservationRegistry registry, MeterRegistry meterRegistry) {
        this.observationRegistry = registry;
        // low-cardinality tag "reason" — bounded set of validation failure categories
        this.ordersFailed = Counter.builder("orders.failed")
            .description("Orders that failed processing")
            .tag("reason", "validation")
            .register(meterRegistry);
    }
 
    // @Observed auto-records: rate (count), duration (timer), and errors
    // produces both a metric AND a trace span — single instrumentation point
    @Observed(name = "order.process", contextualName = "processOrder")
    public Order processOrder(OrderRequest request) {
        return repository.save(mapToOrder(request));
    }
 
    public void rejectOrder(OrderRequest request, String reason) {
        ordersFailed.increment(); // explicit RED Errors counter
        throw new ValidationException("Rejected: " + reason);
    }
}

For Counter.builder, Timer.builder, and Gauge API details, see Micrometer.

Dashboard Design Principles

Golden rule: Every panel must answer a specific operational question. If you cannot state the question in one sentence, the panel does not belong on the dashboard.

Layout Structure

Dashboard Layout Diagram

A spatial layout diagram for this structure is at Topics/Metrics-and-Dashboards-diagram.excalidraw.md.

Recommended three-row layout:

Row	Purpose	Panels
Row 1 — Service Health (RED)	Is the service handling requests correctly?	Request rate (RPS), Error rate (%), p99 latency
Row 2 — Resource Health (USE)	Is a resource becoming a bottleneck?	CPU utilization %, Memory %, Connection pool active / waiting
Row 3 — Business Metrics	Are business outcomes healthy?	Order rate, user signups, conversion rate

Rationale for separation: Row 1 and Row 2 serve different audiences (on-call engineer vs capacity planner). Mixing them forces both audiences to filter noise.

Principles

Percentiles over averages — p99 latency reveals tail problems; the arithmetic mean hides them. A single 5-second request in a batch of 1ms requests yields a 50ms average that looks healthy.
Error rate before error count — "0.1% error rate" is more actionable than "500 errors/min" in isolation. Always show both rate and count.
Time range alignment — all panels must be aligned to the same time range. Misaligned ranges create false correlations between unrelated spikes.
USE and RED on separate rows — infrastructure metrics and service metrics have different slopes, scales, and audiences.

Anti-Patterns to Avoid

Anti-Pattern	Problem	Fix
Christmas tree dashboard (> 20 panels per page)	Decision paralysis; operators cannot identify the critical signal	One dashboard per audience (service health, infra health, business metrics)
High-cardinality labels in PromQL (GROUP BY user_id)	Query timeouts; Prometheus cardinality explosion	Group by route, method, status_code only
Gauge panels for rates	Readers interpret a point-in-time value as a trend	Use time-series graphs for rates; reserve stat panels for current values
SLO burn rate mixed with noise metrics	Operators miss budget exhaustion alerts among irrelevant panels	Dedicate a separate row or dashboard to SLO error budget burn rate

Cardinality Management

Cardinality is the number of unique time-series a metric produces. It equals the product of unique values across all labels.

Rule: Label values must be bounded and low-cardinality (fewer than 100 unique values per label, ideally fewer than 20).

Label Type	Example Values	Safe?
HTTP method	`GET`, `POST`, `DELETE`	Yes — ~5 values
HTTP status code	`200`, `400`, `500`	Yes — ~20 values
Route template	`/api/orders`, `/api/users`	Yes — bounded by code
User ID	`user-abc123`, `user-def456`, ...	No — unbounded
Request ID	UUID per request	No — one time-series per request
Email address	Customer email	No — unbounded PII
IP address	Client IP	No — unbounded

Detection: Monitor prometheus_tsdb_head_series — alert when the series count grows unexpectedly week-over-week. A sudden spike indicates a cardinality bug in newly deployed instrumentation.

Mitigation for existing high-cardinality metrics: Use metric_relabel_configs in the Prometheus scrape config to drop or hash the offending label before storage.

Metrics vs Logs vs Traces — Tradeoffs

Dimension	Metrics	Logs	Traces
Storage cost	Low — aggregates only	High — full event text	Medium — sampled spans
Query speed	Fast — pre-aggregated	Slow — full-text scan	Medium — indexed by trace ID
Cardinality	Fixed at instrument time	Unlimited (structured fields)	Bounded by sampling rate
Per-request detail	None	Full	Full (within sample)
SLO/alerting	Native	Requires log-based metrics	Not directly
Root cause analysis	Insufficient alone	Sufficient	Sufficient
Best for	Alerting, capacity, SLOs	Investigation, audits	Latency attribution

The three pillars are complementary. A full observability stack uses all three: metrics to detect, traces to locate, logs to explain.

Micrometer — Java meter API (Counter, Timer, Gauge), Spring Boot auto-configuration, BFF-specific metric names
Distributed-Tracing — request-scoped latency measurement via spans; complements metrics for per-request attribution
Structured-Logging — event-level observability; the third pillar that explains why metrics changed
SLO-SLI-SLA — metrics as the measurement basis for SLIs; error budgets derived from metric time-series
Alerting-Strategies — turning metrics (burn rate, threshold, anomaly) into actionable on-call alerts

Metrics and Dashboards

Metrics and Dashboards

When NOT to Use

Cardinality Explosion Warning

Measurement Frameworks

RED Method (Request-oriented Services)

USE Method (Resource-oriented Infrastructure)

Four Golden Signals (Google SRE)

Framework Comparison Table

Selection Flowchart

Metric Types Reference

TypeScript Example — prom-client 15.x

Java Example — Micrometer / Spring Boot

Dashboard Design Principles

Layout Structure

Principles

Anti-Patterns to Avoid

Cardinality Management

Metrics vs Logs vs Traces — Tradeoffs

Backlinks

Linked mentions

Metrics and Dashboards

Tags

Metrics and Dashboards

When NOT to Use

Cardinality Explosion Warning

Measurement Frameworks

RED Method (Request-oriented Services)

USE Method (Resource-oriented Infrastructure)

Four Golden Signals (Google SRE)

Framework Comparison Table

Selection Flowchart

Metric Types Reference

TypeScript Example — prom-client 15.x

Java Example — Micrometer / Spring Boot

Dashboard Design Principles

Layout Structure

Principles

Anti-Patterns to Avoid

Cardinality Management

Metrics vs Logs vs Traces — Tradeoffs

Related Concepts

Backlinks

Linked mentions