Load Balancer
Load Balancer
Infrastructure component that distributes incoming network traffic across multiple backend servers to improve throughput, reduce latency, and provide fault tolerance through redundancy.
When NOT to Use
- Single-server applications — load balancer adds latency and cost with no benefit; add one only when horizontal scaling is required
- Stateful protocols with no session management — if each request requires full session state and no session store exists, consistent hashing or sticky sessions must be configured, adding complexity that often exceeds the benefit
- Sub-millisecond latency requirements where LB hop overhead matters — use service mesh sidecar with direct routing or client-side load balancing instead; the extra network hop through a centralised LB is unacceptable when the entire latency budget is 1ms
Core Mechanism
L4 vs L7 Load Balancing
Two distinct layers at which load balancing can operate:
- L4 (Transport layer): routes by IP address + TCP/UDP port; does not inspect the request payload; lower latency, lower CPU overhead per connection
- L7 (Application layer): routes by HTTP header, URL path, cookie, gRPC method, or request body content; enables content-based routing, SSL termination, and request rewriting
Decision criterion (required by INFRA-01): Use L4 when minimum latency matters and all traffic is homogeneous (all requests go to identical backend pools). Use L7 when routing must be content-aware — different URL paths to different backend pools, A/B routing, authentication offload, gRPC method routing, or WebSocket upgrade handling.
Health Checks
The LB continuously monitors backend health to route traffic only to healthy servers:
- Active health checks: LB sends periodic probe requests (HTTP GET /health, TCP connect attempt) to each backend on a schedule; marks a server unhealthy after N consecutive failures, healthy again after M consecutive successes
- Passive health checks: LB monitors real client traffic; marks a server unhealthy after M connection errors or timeouts within window T; no extra probe traffic, but detects failures only when real traffic triggers them
Configuration tradeoffs: shorter interval and lower failure threshold detect failures faster but increase false positive rate during transient network jitter. A typical starting point: 5-second interval, 3-failure threshold, 2-success recovery threshold.
Component Diagram
Load-Balancer-diagram.excalidraw
Key Variants
Algorithm comparison — how the LB selects the next backend for each incoming request:
| Algorithm | Mechanism | Use When |
|---|---|---|
| Round-robin | Cyclic assignment to next server in sequence | Requests are stateless and roughly equal cost |
| Weighted round-robin | Round-robin with weight per server; higher-weight servers receive proportionally more requests | Servers have unequal capacity (e.g., different hardware generations) |
| Least connections | Route to the server with the fewest active connections at routing time | Variable request duration (long-lived connections, streaming, WebSocket) |
| Consistent hashing | Hash(client_ip or session_id) maps to a position on a ring that resolves to a server | Session affinity required; stateful protocol (WebSocket, sticky sessions); minimises remapping on server add/remove |
| IP hash | Hash(source_ip) modulo server count → server index | Simple session affinity without a full ring; deterministic but breaks if server count changes |
| Random | Select a server uniformly at random | Simple stateless load distribution; performs similarly to round-robin at scale |
Design Decisions
Active/Passive Failover Models
How traffic is handled when a backend or entire datacenter fails:
- Active-active (multi-active): traffic is distributed across all active backends simultaneously; when one fails, the LB removes it from the pool and remaining backends absorb the load; highest utilisation and throughput
- Active-passive (hot standby): one active server handles all traffic; standby server receives no traffic until the active fails; provides a clean consistency model at the cost of lower utilisation; standby must stay synchronised to be ready
- Multi-region active-active: traffic routed to nearest region; each region runs an active pool; reduces latency globally; requires cross-region data consistency strategy
- Multi-region active-passive: one region is primary; a secondary region is on standby; simpler consistency model; higher latency for users far from primary region
SSL Termination
L7 load balancers can terminate TLS at the LB and forward unencrypted (or re-encrypted) traffic to backends. Benefit: centralises certificate management, offloads TLS handshake CPU from backends. Tradeoff: traffic between LB and backends travels in plaintext unless re-encrypted (TLS passthrough or re-encryption).
Connection Draining (Graceful Deregistration)
When removing a backend from the pool (planned maintenance, rolling deploy), the LB should drain existing connections before stopping traffic. Draining allows in-flight requests to complete while no new requests are routed to the deregistered server. Drain timeout is typically 30-60 seconds; connections alive past the timeout are forcibly closed. Without draining, rolling deploys cause visible client errors.
Session Affinity (Sticky Sessions)
For applications that store session state locally (in-process), the LB can be configured to always route a given client to the same backend — "sticky sessions." Implemented via:
- Cookie-based affinity: LB injects a session cookie on first request; routes subsequent requests from the same cookie to the same backend
- IP-based affinity: consistent hash on source IP; less reliable if clients are behind NAT (many clients appear to share one IP)
Sticky sessions are a partial solution; the preferred approach is externalising session state to a distributed cache (Distributed-Cache) so that any backend can serve any client.
Rate Limiting and DDoS Mitigation at the LB Layer
L7 load balancers can inspect request headers and apply rate limits per source IP or per authenticated user before requests reach backends. This offloads simple rate limiting logic from each backend service. However, for application-aware rate limiting (per-user, per-endpoint), see Operational-API-Patterns — the HTTP 429/Retry-After contract is an application-layer concern, not an LB infrastructure concern.
Pitfalls
Describing L4 and L7 without stating when to use each is incomplete and unhelpful. The decision criterion is: use L4 for homogeneous, latency-sensitive traffic where payload inspection is unnecessary; use L7 when routing must be content-aware (URL routing, cookie-based affinity, SSL termination, A/B testing, gRPC method routing). Without this criterion, teams default to L7 for all cases and pay an unnecessary latency penalty.
Aggressive health check intervals (1-second interval, 1-failure threshold) cause flapping — healthy servers repeatedly marked down and up during transient network jitter. This generates alert noise and briefly drops capacity on every jitter event. Use a 3+ failure threshold with 5-10 second intervals as a starting point. Tune based on observed false positive rate under normal traffic.
Existing Pattern Connections
- Circuit-Breaker-Pattern — circuit breaker at the service layer detects downstream failures and opens the circuit to stop cascading calls; load balancer health checks at the infrastructure layer detect backend unavailability and route around it; they are complementary layers of health management operating at different scopes
- Service-Mesh-Pattern — a service mesh (Istio, Linkerd) provides L7 load balancing via sidecar proxies distributed to every service instance; the mesh makes LB behaviour programmable via a control plane rather than requiring per-LB configuration; mesh-based LB operates within a cluster, infrastructure LB operates at the cluster boundary
- Ambassador-Pattern — the Ambassador sidecar handles retry, timeout, and load balancing for a single service's outbound calls; an infrastructure load balancer handles it at the cluster or data-centre boundary; they operate at different scopes and can coexist in the same system
- Consistent-Hashing — consistent hashing is one of the LB routing algorithms listed above; understanding the ring mechanics — how keys map to nodes and how node addition/removal minimises remapping — is a prerequisite for session-affinity LB design