Database Replication
Database Replication
Maintaining copies of the same data on multiple database nodes to improve read throughput, provide fault tolerance, and reduce read latency by serving reads from geographically closer replicas.
When NOT to Use
- Write throughput exceeds what a single primary can handle — replication scales reads, not writes; use Database-Sharding to scale writes. A single primary is the write bottleneck regardless of how many replicas exist. Replication is the wrong tool when the constraint is write capacity.
- Writes require cross-region low latency — single-leader from a remote region has unacceptable write latency for users in that region; consider multi-leader or geo-partitioning (sharding by region). Replication does not solve write latency for geographically distributed users who need to write locally.
Core Mechanism
Replication keeps multiple copies of the same data on different nodes. The defining question is: which nodes are allowed to accept writes?
Single-Leader (Primary-Replica)
All writes go to one primary; replicas receive changes via replication log. Simple mental model. Write throughput limited by single primary. Read throughput scales horizontally via replicas — adding replicas adds read capacity linearly. Most common model:
- PostgreSQL streaming replication — physical WAL (Write-Ahead Log) streaming
- MySQL binlog replication — logical binlog events
- MongoDB replica sets — primary election with Raft-based consensus
The primary maintains a replication log; replicas replay it to stay current. The lag between primary write and replica application is replication lag.
Multi-Leader (Multi-Master)
Multiple nodes accept writes simultaneously; changes propagate between leaders via asynchronous replication. Enables geo-distributed writes — each region has a local leader, so writes do not cross region boundaries. Requires write conflict resolution because two leaders can accept conflicting writes for the same key:
- Last-write-wins (LWW): each write is timestamped; the write with the later timestamp wins. Simple but risks discarding recent writes on nodes with skewed clocks.
- Application-specific merge: application defines a merge function for conflicting writes (e.g., union of sets for collaborative editing).
- CRDTs (Conflict-free Replicated Data Types): data structures that commute — concurrent writes always merge to a deterministic result without conflict.
Complexity is high; use only when single-leader write latency from remote regions is unacceptable.
Leaderless (Dynamo-Style)
Any replica accepts reads and writes; coordinator routes to N replicas using quorum (W writes + R reads, W+R > N for strong consistency). No single point of failure — no primary election is required on failure; the coordinator simply routes to healthy replicas.
Quorum formula: With N total replicas, W write acknowledgements, R read replicas:
- W + R > N → at least one replica overlaps between write and read set → strong consistency
- Typical defaults: N=3, W=2, R=2 (strong); N=3, W=1, R=1 (low latency, eventual)
Products: Cassandra, Riak, DynamoDB
Conflict resolution:
- Last-write-wins (timestamp) — risk of discarding writes on nodes with clock skew
- Read-repair: coordinator detects stale replica during a read, writes back the latest value
- Anti-entropy: background process continuously compares replicas and repairs divergence
Component Diagram
Database-Replication-diagram.excalidraw
Key Variants
Replication models differ by which nodes accept writes. Replication methods differ by when the primary considers a write acknowledged.
| Method | Mechanism | Consistency | Write latency | Use when |
|---|---|---|---|---|
| Synchronous | Primary waits for replica acknowledgement before acknowledging client | Strong | Higher | Replica is a hot standby; zero data loss required |
| Asynchronous | Primary acknowledges client immediately; replicas receive changes eventually | Eventual | Lower | Default for most read-scaling; acceptable replica lag |
| Semi-synchronous | One replica is synchronous; others are async | One strong replica; others eventual | Medium | Balance between durability and latency (MySQL default) |
Key insight: Synchronous replication trades write latency for durability — if the synchronous replica is unavailable, the primary must wait or refuse writes. Asynchronous replication achieves lower write latency but accepts a data loss window equal to the replication lag if the primary fails.
Design Decisions
Replication Lag Consequences
Async replication introduces lag between the primary and its replicas. Two consistency anomalies emerge when reads are served from replicas:
- Read-your-writes violation: user writes a value then reads a stale replica — perceives their write was lost. Common in social features: user posts an update, refreshes page, sees stale content.
- Monotonic reads violation: reading from different replicas at different lag states — perceives data moving backward in time. User queries a value, refreshes, and gets an older value because the second read hit a more-lagged replica.
Lag Mitigations
| Violation | Mitigation | Tradeoff |
|---|---|---|
| Read-your-writes | Route each user's reads to the same replica (sticky session) | Uneven replica load if user access is skewed |
| Read-your-writes | Route writes and immediately-subsequent reads to the primary | Increases primary read load |
| Both | Use timestamp-based reads (read replica only after it has caught up past the write timestamp) | Read may stall waiting for replica to catch up |
| Monotonic reads | Consistent replica assignment per user session | Requires session-to-replica mapping state |
Failure Modes
Two nodes simultaneously believe they are primary; both accept writes; data diverges. Prevention: fencing (STONITH — Shoot The Other Node In The Head); quorum-based election (only elect if majority of replicas agree). Automatic failover (Patroni, MHA) is faster but carries split-brain risk if network partition makes primary appear dead while still running. Manual failover is slower but reduces split-brain risk.
If the primary fails before un-replicated writes propagate to any replica, those writes are lost. The window size equals the replication lag. For zero data loss, use synchronous replication to at least one replica — at the cost of higher write latency.
Failover strategies:
- Manual failover: operations team promotes replica to primary after verifying replica lag is acceptable. Slower to execute; reduces risk of split-brain. Requires on-call human in the recovery path.
- Automatic failover: orchestrator (Patroni, MHA) detects primary failure via heartbeat; elects a new primary from the replica set. Faster recovery (seconds vs minutes); split-brain risk requires fencing or quorum-based election to mitigate. Typical production choice for high-availability systems.
Failover edge cases:
- Replica with stale data is promoted — lost writes from un-replicated lag become the new baseline
- New primary is announced, old primary recovers and rejoins — two nodes believe they are primary simultaneously
- Network partition heals, old primary sees the new primary — requires fencing to prevent old primary from continuing to serve writes
Existing Pattern Connections
- CAP-Theorem — asynchronous replication is an AP choice (availability over consistency); synchronous replication is a CP choice (consistency at the cost of write availability if replica is unreachable)
- Consistent-Hashing — leaderless replication uses a hash ring to determine which N replicas own each key; the ring mechanics directly apply
- CQRS-Pattern — database replication (read replicas) is an infrastructure-layer implementation of CQRS read/write separation at the storage tier; CQRS provides the application-level pattern on top