Metrics and Dashboards
Metrics and Dashboards
Metrics are aggregated numerical measurements of system behaviour over time. They differ from the other two observability pillars in a fundamental way: a metric discards the individual event and retains only the aggregate — a counter, a gauge value, a histogram bucket count. This makes metrics cheap to store, fast to query, and easy to alert on, but they cannot answer why a specific request failed.
- Metrics — aggregated numbers over time windows; ideal for alerting, SLO tracking, and capacity planning
- Logs — discrete event records with full context; ideal for post-incident investigation
- Traces — end-to-end request journeys with span timing; ideal for latency attribution across services
Dashboards make metrics actionable by placing the right aggregations in front of the right operator at the right moment.
This note covers measurement framework selection (RED / USE / Four Golden Signals), dashboard design principles, and application-level metric instrumentation patterns. For the Java meter API (Counter, Timer, Gauge, DistributionSummary), MeterRegistry implementation, and Spring Boot auto-configuration, see Micrometer.
When NOT to Use
Metrics alone are insufficient for:
- Debugging a specific request failure — metrics can tell you the error rate spiked; they cannot show which request failed or what the stack trace was. Use distributed traces (Distributed-Tracing) and structured logs for that.
- Understanding why an error occurred — you need logs and traces that carry per-request context.
- Verifying business-logic correctness — a zero-error-rate metric does not mean the order total was calculated correctly.
Cardinality Explosion Warning
Label values such as user_id, request_id, email, or ip_address create one time-series per unique value. At > 10,000 unique combinations per metric, Prometheus suffers OOM, query timeouts, and storage explosion.
Safe label cardinalities: HTTP method (5–10 values), status code (~20 values), route path (~50–200 values).
Measurement Frameworks
Three frameworks cover the vast majority of instrumentation needs. Each targets a different system type.
RED Method (Request-oriented Services)
Origin: Tom Wilkie / Weaveworks (2018).
| Signal | What to Measure |
|---|---|
| Rate | Requests per second (throughput) |
| Errors | Error rate — (4xx + 5xx) / total requests |
| Duration | Latency distribution — p50, p95, p99 |
Target systems: HTTP services, gRPC services, message consumers, async workers.
Key question: Is my service handling requests correctly and fast enough?
USE Method (Resource-oriented Infrastructure)
Origin: Brendan Gregg (2012).
| Signal | What to Measure |
|---|---|
| Utilization | Percentage of resource capacity in use (CPU %, memory %, disk %) |
| Saturation | Queue depth or backpressure (run queue length, connection pool waiters) |
| Errors | Hardware or driver errors |
Target systems: CPUs, memory, disks, network interfaces, connection pools, thread pools.
Key question: Is a resource becoming a bottleneck?
Four Golden Signals (Google SRE)
Origin: Google SRE Book, Chapter 6.
| Signal | What to Measure |
|---|---|
| Latency | Time to serve a request — distinguish successful latency from error latency |
| Traffic | Request rate or throughput |
| Errors | Rate of failed requests (explicit HTTP 5xx, implicit wrong results, or caught exceptions) |
| Saturation | How full the service is — CPU throttling %, queue depth, memory pressure |
Target systems: user-facing services where SLO adherence matters.
Key question: Is my service meeting user-visible quality targets?
Four Golden Signals is similar to RED but elevates Saturation to a first-class signal alongside Latency/Traffic/Errors, and explicitly distinguishes error latency from success latency.
Framework Comparison Table
| Framework | Target System | Key Question | Typical Metrics |
|---|---|---|---|
| RED | HTTP/gRPC services, consumers | Handling requests correctly? | http_requests_total, http_request_duration_seconds |
| USE | Infrastructure resources | Resource becoming a bottleneck? | cpu_utilization, db_pool_active, disk_io_saturation |
| Four Golden Signals | User-facing, SLO-driven | Meeting user-visible quality targets? | latency (p99), traffic (RPS), error rate, saturation % |
Selection Flowchart
flowchart TD
A[What are you measuring?] --> B{User-facing HTTP or gRPC service?}
B -- Yes --> C[Use RED\nRate · Errors · Duration]
B -- No --> D{Infrastructure resource?\nCPU / memory / disk / pool?}
D -- Yes --> E[Use USE\nUtilization · Saturation · Errors]
D -- No --> F{SRE / SLO-driven service?}
F -- Yes --> G[Use Four Golden Signals\nLatency · Traffic · Errors · Saturation]
F -- No --> H{Complex service needing both\nservice-level and resource-level view?}
H -- Yes --> I[RED for service layer\n+ USE for resource layer]
H -- No --> J[Start with RED —\nextend as needed]
Metric Types Reference
These are conceptual types, not API reference. For the Java API (Counter.builder, Timer.builder, Gauge), see Micrometer.
| Type | Behaviour | Use For | Notes |
|---|---|---|---|
| Counter | Monotonically increasing integer | Requests, errors, events published | Compute rate as rate(counter[5m]) in PromQL |
| Gauge | Point-in-time snapshot, can go up or down | Queue depth, active connections, cache size | Do not use to display trends — use histograms |
| Histogram | Distribution of values with configurable buckets | Latency, payload sizes | Enables server-side percentile computation (preferred) |
| Summary | Pre-computed percentiles client-side | When PromQL recording rules are unavailable | Percentiles do NOT aggregate correctly across instances |
Decision rule: Prefer Histogram over Summary for latency. Prometheus computes percentiles server-side from histogram buckets, and histograms aggregate correctly across replicas. Summaries aggregate incorrectly — the p99 of a fleet is not the average of per-instance p99s.
TypeScript Example — prom-client 15.x
The following snippet instruments an Express service using the RED method. All labels are low-cardinality.
import { Registry, Counter, Histogram, Gauge } from 'prom-client'; // prom-client 15.x
import { Request, Response, NextFunction } from 'express';
const registry = new Registry();
// RED: Rate + Errors (Counter) — low-cardinality labels only
const httpRequests = new Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'route', 'status_code'],
registers: [registry],
});
// RED: Duration (Histogram) — p50/p95/p99 computed by Prometheus from buckets
const httpDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request latency',
labelNames: ['method', 'route'],
buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5],
registers: [registry],
});
// USE: Saturation (Gauge) — connection pool pressure
const activeConnections = new Gauge({
name: 'db_pool_active_connections',
help: 'Active database pool connections',
registers: [registry],
});
export function metricsMiddleware(req: Request, res: Response, next: NextFunction): void {
const end = httpDuration.startTimer({
method: req.method,
route: req.route?.path ?? 'unknown', // route template, NOT req.url — avoids high cardinality
});
res.on('finish', () => {
httpRequests.inc({
method: req.method,
route: req.route?.path ?? 'unknown',
status_code: String(res.statusCode),
});
end();
});
next();
}
// Expose /metrics endpoint for Prometheus scrape
export async function metricsHandler(_req: Request, res: Response): Promise<void> {
res.set('Content-Type', registry.contentType);
res.end(await registry.metrics());
}Key decisions above:
req.route?.path(template like/api/orders/:id) instead ofreq.url(actual URL like/api/orders/abc-123) — prevents per-request cardinality explosionbucketsarray follows the Prometheus default recommendations, adjusted downward for typical API latency
Java Example — Micrometer / Spring Boot
The following snippet instruments a Spring Boot service using the Micrometer Observation API, which produces both metrics and traces from a single @Observed annotation. For Counter.builder, Timer.builder, and Gauge API details, see Micrometer.
@Service
public class OrderService {
private final ObservationRegistry observationRegistry;
private final Counter ordersFailed;
public OrderService(ObservationRegistry registry, MeterRegistry meterRegistry) {
this.observationRegistry = registry;
// low-cardinality tag "reason" — bounded set of validation failure categories
this.ordersFailed = Counter.builder("orders.failed")
.description("Orders that failed processing")
.tag("reason", "validation")
.register(meterRegistry);
}
// @Observed auto-records: rate (count), duration (timer), and errors
// produces both a metric AND a trace span — single instrumentation point
@Observed(name = "order.process", contextualName = "processOrder")
public Order processOrder(OrderRequest request) {
return repository.save(mapToOrder(request));
}
public void rejectOrder(OrderRequest request, String reason) {
ordersFailed.increment(); // explicit RED Errors counter
throw new ValidationException("Rejected: " + reason);
}
}For Counter.builder, Timer.builder, and Gauge API details, see Micrometer.
Dashboard Design Principles
Golden rule: Every panel must answer a specific operational question. If you cannot state the question in one sentence, the panel does not belong on the dashboard.
Layout Structure
A spatial layout diagram for this structure is at Topics/Metrics-and-Dashboards-diagram.excalidraw.md.
Recommended three-row layout:
| Row | Purpose | Panels |
|---|---|---|
| Row 1 — Service Health (RED) | Is the service handling requests correctly? | Request rate (RPS), Error rate (%), p99 latency |
| Row 2 — Resource Health (USE) | Is a resource becoming a bottleneck? | CPU utilization %, Memory %, Connection pool active / waiting |
| Row 3 — Business Metrics | Are business outcomes healthy? | Order rate, user signups, conversion rate |
Rationale for separation: Row 1 and Row 2 serve different audiences (on-call engineer vs capacity planner). Mixing them forces both audiences to filter noise.
Principles
- Percentiles over averages — p99 latency reveals tail problems; the arithmetic mean hides them. A single 5-second request in a batch of 1ms requests yields a 50ms average that looks healthy.
- Error rate before error count — "0.1% error rate" is more actionable than "500 errors/min" in isolation. Always show both rate and count.
- Time range alignment — all panels must be aligned to the same time range. Misaligned ranges create false correlations between unrelated spikes.
- USE and RED on separate rows — infrastructure metrics and service metrics have different slopes, scales, and audiences.
Anti-Patterns to Avoid
| Anti-Pattern | Problem | Fix |
|---|---|---|
| Christmas tree dashboard (> 20 panels per page) | Decision paralysis; operators cannot identify the critical signal | One dashboard per audience (service health, infra health, business metrics) |
| High-cardinality labels in PromQL (GROUP BY user_id) | Query timeouts; Prometheus cardinality explosion | Group by route, method, status_code only |
| Gauge panels for rates | Readers interpret a point-in-time value as a trend | Use time-series graphs for rates; reserve stat panels for current values |
| SLO burn rate mixed with noise metrics | Operators miss budget exhaustion alerts among irrelevant panels | Dedicate a separate row or dashboard to SLO error budget burn rate |
Cardinality Management
Cardinality is the number of unique time-series a metric produces. It equals the product of unique values across all labels.
Rule: Label values must be bounded and low-cardinality (fewer than 100 unique values per label, ideally fewer than 20).
| Label Type | Example Values | Safe? |
|---|---|---|
| HTTP method | GET, POST, DELETE | Yes — ~5 values |
| HTTP status code | 200, 400, 500 | Yes — ~20 values |
| Route template | /api/orders, /api/users | Yes — bounded by code |
| User ID | user-abc123, user-def456, ... | No — unbounded |
| Request ID | UUID per request | No — one time-series per request |
| Email address | Customer email | No — unbounded PII |
| IP address | Client IP | No — unbounded |
Detection: Monitor prometheus_tsdb_head_series — alert when the series count grows unexpectedly week-over-week. A sudden spike indicates a cardinality bug in newly deployed instrumentation.
Mitigation for existing high-cardinality metrics: Use metric_relabel_configs in the Prometheus scrape config to drop or hash the offending label before storage.
Metrics vs Logs vs Traces — Tradeoffs
| Dimension | Metrics | Logs | Traces |
|---|---|---|---|
| Storage cost | Low — aggregates only | High — full event text | Medium — sampled spans |
| Query speed | Fast — pre-aggregated | Slow — full-text scan | Medium — indexed by trace ID |
| Cardinality | Fixed at instrument time | Unlimited (structured fields) | Bounded by sampling rate |
| Per-request detail | None | Full | Full (within sample) |
| SLO/alerting | Native | Requires log-based metrics | Not directly |
| Root cause analysis | Insufficient alone | Sufficient | Sufficient |
| Best for | Alerting, capacity, SLOs | Investigation, audits | Latency attribution |
The three pillars are complementary. A full observability stack uses all three: metrics to detect, traces to locate, logs to explain.
Related Concepts
- Micrometer — Java meter API (Counter, Timer, Gauge), Spring Boot auto-configuration, BFF-specific metric names
- Distributed-Tracing — request-scoped latency measurement via spans; complements metrics for per-request attribution
- Structured-Logging — event-level observability; the third pillar that explains why metrics changed
- SLO-SLI-SLA — metrics as the measurement basis for SLIs; error budgets derived from metric time-series
- Alerting-Strategies — turning metrics (burn rate, threshold, anomaly) into actionable on-call alerts