Load Balancer

Load Balancer

Infrastructure component that distributes incoming network traffic across multiple backend servers to improve throughput, reduce latency, and provide fault tolerance through redundancy.

When NOT to Use

  • Single-server applications — load balancer adds latency and cost with no benefit; add one only when horizontal scaling is required
  • Stateful protocols with no session management — if each request requires full session state and no session store exists, consistent hashing or sticky sessions must be configured, adding complexity that often exceeds the benefit
  • Sub-millisecond latency requirements where LB hop overhead matters — use service mesh sidecar with direct routing or client-side load balancing instead; the extra network hop through a centralised LB is unacceptable when the entire latency budget is 1ms

Core Mechanism

L4 vs L7 Load Balancing

Two distinct layers at which load balancing can operate:

  • L4 (Transport layer): routes by IP address + TCP/UDP port; does not inspect the request payload; lower latency, lower CPU overhead per connection
  • L7 (Application layer): routes by HTTP header, URL path, cookie, gRPC method, or request body content; enables content-based routing, SSL termination, and request rewriting

Decision criterion (required by INFRA-01): Use L4 when minimum latency matters and all traffic is homogeneous (all requests go to identical backend pools). Use L7 when routing must be content-aware — different URL paths to different backend pools, A/B routing, authentication offload, gRPC method routing, or WebSocket upgrade handling.

Health Checks

The LB continuously monitors backend health to route traffic only to healthy servers:

  • Active health checks: LB sends periodic probe requests (HTTP GET /health, TCP connect attempt) to each backend on a schedule; marks a server unhealthy after N consecutive failures, healthy again after M consecutive successes
  • Passive health checks: LB monitors real client traffic; marks a server unhealthy after M connection errors or timeouts within window T; no extra probe traffic, but detects failures only when real traffic triggers them

Configuration tradeoffs: shorter interval and lower failure threshold detect failures faster but increase false positive rate during transient network jitter. A typical starting point: 5-second interval, 3-failure threshold, 2-success recovery threshold.

Component Diagram

Load-Balancer-diagram.excalidraw

Key Variants

Algorithm comparison — how the LB selects the next backend for each incoming request:

AlgorithmMechanismUse When
Round-robinCyclic assignment to next server in sequenceRequests are stateless and roughly equal cost
Weighted round-robinRound-robin with weight per server; higher-weight servers receive proportionally more requestsServers have unequal capacity (e.g., different hardware generations)
Least connectionsRoute to the server with the fewest active connections at routing timeVariable request duration (long-lived connections, streaming, WebSocket)
Consistent hashingHash(client_ip or session_id) maps to a position on a ring that resolves to a serverSession affinity required; stateful protocol (WebSocket, sticky sessions); minimises remapping on server add/remove
IP hashHash(source_ip) modulo server count → server indexSimple session affinity without a full ring; deterministic but breaks if server count changes
RandomSelect a server uniformly at randomSimple stateless load distribution; performs similarly to round-robin at scale

Design Decisions

Active/Passive Failover Models

How traffic is handled when a backend or entire datacenter fails:

  • Active-active (multi-active): traffic is distributed across all active backends simultaneously; when one fails, the LB removes it from the pool and remaining backends absorb the load; highest utilisation and throughput
  • Active-passive (hot standby): one active server handles all traffic; standby server receives no traffic until the active fails; provides a clean consistency model at the cost of lower utilisation; standby must stay synchronised to be ready
  • Multi-region active-active: traffic routed to nearest region; each region runs an active pool; reduces latency globally; requires cross-region data consistency strategy
  • Multi-region active-passive: one region is primary; a secondary region is on standby; simpler consistency model; higher latency for users far from primary region

SSL Termination

L7 load balancers can terminate TLS at the LB and forward unencrypted (or re-encrypted) traffic to backends. Benefit: centralises certificate management, offloads TLS handshake CPU from backends. Tradeoff: traffic between LB and backends travels in plaintext unless re-encrypted (TLS passthrough or re-encryption).

Connection Draining (Graceful Deregistration)

When removing a backend from the pool (planned maintenance, rolling deploy), the LB should drain existing connections before stopping traffic. Draining allows in-flight requests to complete while no new requests are routed to the deregistered server. Drain timeout is typically 30-60 seconds; connections alive past the timeout are forcibly closed. Without draining, rolling deploys cause visible client errors.

Session Affinity (Sticky Sessions)

For applications that store session state locally (in-process), the LB can be configured to always route a given client to the same backend — "sticky sessions." Implemented via:

  • Cookie-based affinity: LB injects a session cookie on first request; routes subsequent requests from the same cookie to the same backend
  • IP-based affinity: consistent hash on source IP; less reliable if clients are behind NAT (many clients appear to share one IP)

Sticky sessions are a partial solution; the preferred approach is externalising session state to a distributed cache (Distributed-Cache) so that any backend can serve any client.

Rate Limiting and DDoS Mitigation at the LB Layer

L7 load balancers can inspect request headers and apply rate limits per source IP or per authenticated user before requests reach backends. This offloads simple rate limiting logic from each backend service. However, for application-aware rate limiting (per-user, per-endpoint), see Operational-API-Patterns — the HTTP 429/Retry-After contract is an application-layer concern, not an LB infrastructure concern.

Pitfalls

Missing L4 vs L7 decision criterion

Describing L4 and L7 without stating when to use each is incomplete and unhelpful. The decision criterion is: use L4 for homogeneous, latency-sensitive traffic where payload inspection is unnecessary; use L7 when routing must be content-aware (URL routing, cookie-based affinity, SSL termination, A/B testing, gRPC method routing). Without this criterion, teams default to L7 for all cases and pay an unnecessary latency penalty.

Health check false positives causing flapping

Aggressive health check intervals (1-second interval, 1-failure threshold) cause flapping — healthy servers repeatedly marked down and up during transient network jitter. This generates alert noise and briefly drops capacity on every jitter event. Use a 3+ failure threshold with 5-10 second intervals as a starting point. Tune based on observed false positive rate under normal traffic.

Existing Pattern Connections

  • Circuit-Breaker-Pattern — circuit breaker at the service layer detects downstream failures and opens the circuit to stop cascading calls; load balancer health checks at the infrastructure layer detect backend unavailability and route around it; they are complementary layers of health management operating at different scopes
  • Service-Mesh-Pattern — a service mesh (Istio, Linkerd) provides L7 load balancing via sidecar proxies distributed to every service instance; the mesh makes LB behaviour programmable via a control plane rather than requiring per-LB configuration; mesh-based LB operates within a cluster, infrastructure LB operates at the cluster boundary
  • Ambassador-Pattern — the Ambassador sidecar handles retry, timeout, and load balancing for a single service's outbound calls; an infrastructure load balancer handles it at the cluster or data-centre boundary; they operate at different scopes and can coexist in the same system
  • Consistent-Hashing — consistent hashing is one of the LB routing algorithms listed above; understanding the ring mechanics — how keys map to nodes and how node addition/removal minimises remapping — is a prerequisite for session-affinity LB design