Database Sharding

Database Sharding

Horizontal partitioning of data across multiple database nodes (shards) to scale write throughput and storage capacity beyond the limits of a single server.

When NOT to Use

  • Data volume fits on a single well-provisioned server — sharding adds operational complexity (connection routing, cross-shard queries, schema migrations) without benefit. Vertical scaling (larger hardware) and Database-Replication for read scaling should be exhausted first.
  • Write throughput can be addressed with a primary + async replicas — try Database-Replication before sharding; sharding adds an order-of-magnitude more operational complexity. Replication scales reads; if that is the constraint, sharding is the wrong tool.
  • Queries frequently span shard boundaries — cross-shard scatter-gather may perform worse than a single-node full table scan. If the dominant query pattern cannot be served by a single shard, sharding hurts rather than helps. Model your access patterns before committing to a shard key.

Core Mechanism

Sharding routes each row of data to exactly one node (shard) based on a shard key. The shard key is a column or composite of columns chosen at design time — it determines data locality and query routing for the lifetime of the schema.

Horizontal vs Vertical Sharding

Two fundamentally different partitioning strategies:

  • Horizontal sharding (partitioning): split rows across multiple nodes based on a shard key; each shard holds a subset of rows for the same schema. Every shard has the same table structure but contains different data. Scales data volume and write throughput — this is the primary meaning of "sharding" in system design interviews. Choose this when data exceeds single-server limits or write throughput exceeds single-primary capacity.
  • Vertical sharding: split tables or columns across different nodes by domain (e.g., user service owns the users table, order service owns the orders table). Natural with microservices; bounded context maps to a shard boundary (see Bounded-Context). Not the primary meaning of "sharding" in system design interviews — clarify the term when discussing with interviewers.

Cross-Shard Queries

When a query spans multiple shards, the coordinator must fan out to all relevant shards and merge results — this is scatter-gather. Cross-shard queries are the primary operational cost of sharding:

  • JOINs across shards are expensive — requires scatter-gather (fan-out to all shards + aggregate results). Avoid by co-locating related data on the same shard. For example, shard both users and their orders by userId — all of a user's data lives on the same shard, so most queries never cross shard boundaries.
  • Global secondary indexes: each shard maintains a local index; a global index requires scatter-gather across all shards. DynamoDB global secondary indexes replicate index entries to a separate partition. Accept the scatter-gather cost or denormalise the access pattern to avoid the global index entirely.
  • Aggregation queries (COUNT, SUM): require fan-out to all shards. Push aggregation down where possible — partial aggregation per shard, merge at coordinator. This is the map-reduce pattern at the storage layer.

Component Diagram

Database-Sharding-diagram.excalidraw

Key Variants

StrategyMechanismWhen to UseHotspot Risk
Range shardingShard key is a range (e.g., userId 0-1M -> Shard 1, 1M-2M -> Shard 2)Efficient range scans; geographic partitioningHIGH — monotonically increasing keys concentrate writes on last shard
Hash shardingShard key = hash(key) % N; distribute to shard by hash valueUniform write distribution; no range scanLOW — but no efficient range queries; entire table scan for range predicates
Directory (lookup) shardingA separate lookup service maps each key to its shardMaximum flexibility; supports non-uniform distributionMEDIUM — lookup service is a bottleneck and single point of failure
Consistent hash shardingUses consistent hash ring to map keys to shardsSame as hash but minimises key movement during shard add/removeLOW — same as hash; preferred over modulo hash for resharding efficiency

Choosing a strategy:

  • Default: hash or consistent hash sharding — avoids hotspots, distributes writes uniformly
  • When range queries dominate: range sharding — but use non-monotonic keys (UUIDs, Snowflake IDs) to avoid the write hotspot
  • When data distribution is highly irregular and flexibility is needed: directory sharding — at the cost of the lookup service being a critical dependency
  • When resharding is expected: consistent hash sharding — O(key_count/N) key moves vs full remapping with modulo hash

Design Decisions

Shard Key Selection

The shard key is the single most important design decision. A poor shard key causes hotspots, cross-shard queries, or difficult resharding. Selection criteria:

  • High cardinality: shard key must have enough distinct values to distribute data across all shards
  • Access pattern alignment: choose a key that keeps frequently queried related data on the same shard (co-location)
  • Write distribution: avoid monotonically increasing keys (timestamps, auto-increment IDs) as the primary shard key
  • Immutability: changing a row's shard key requires moving it to a new shard — design the shard key to not change over the row's lifetime

Hotspot Causes and Mitigations

Three distinct causes of hotspots, each requiring a different mitigation:

Cause 1 — monotonic shard key: auto-increment ID or timestamp as shard key means all new writes go to the same shard. The last shard becomes a write hotspot while all other shards sit idle.

  • Mitigation: add random prefix to shard key (e.g., prefix_{0..N}_userId); use UUIDs or Snowflake IDs; use hash sharding to scatter writes uniformly.

Cause 2 — celebrity / hot partition: one key receives disproportionate traffic (e.g., a viral post ID, a famous user's profile). Even a perfectly distributed shard scheme will have a hot shard when one key dominates traffic.

  • Mitigation: split hot partition into sub-shards; add a suffix 0-N to the hot key and route to N shards (scatter-gather on read); cache hot keys in Distributed-Cache to absorb reads before they reach the shard.

Cause 3 — imbalanced range boundaries: range shard sizes grow unevenly over time. Shards covering popular key ranges overflow while shards covering sparse ranges stay empty. This is especially common in multi-tenant systems where tenant activity is skewed — one tenant generates 80% of all writes.

  • Mitigation: dynamic split (split overfull shards into two without downtime); consistent hashing (see Consistent-Hashing) — node addition remaps only adjacent keys, rebalancing naturally. Monitor shard sizes and trigger splits proactively before a shard becomes a bottleneck.

Resharding

Resharding is the process of adding or removing shards as data grows or shrinks. It is operationally expensive — plan shard key selection to minimise resharding frequency.

Adding a shard with hash sharding (key % N): all keys must be remapped — N becomes N+1, and almost every hash(key) % N changes. This requires a full data migration across all shards. Operationally expensive; migration window can be hours for large datasets.

Adding a shard with consistent hashing: only the keys in the new shard's clockwise range on the hash ring are remapped — O(key_count/N) moves. For a 10-shard cluster with 1M keys, adding a shard remaps ~100K keys, not ~1M. This is the primary advantage of consistent hash sharding over modulo hash sharding. See Consistent-Hashing for ring mechanics.

CAP tradeoff during resharding (CRITICAL — per INFRA-07 Pitfall 6): during resharding, reads may return pre-migration data — an eventual consistency window. Consistent hashing minimises this window compared to full remapping because fewer keys are in-flight at any time. While migration is in progress, use double-write (write to both old and new shard) or read-from-both strategies; new shard serves reads only after migration of its range is complete. See CAP-Theorem for the formal framework for reasoning about consistency guarantees during this transition.

Removing a shard: requires migrating that shard's key range to a neighbouring shard before decommissioning. With consistent hashing, the key range is simply assigned to the next shard clockwise on the ring. Coordinate removal during low-traffic periods and verify data integrity before decommissioning.

Operational Concerns

Schema migrations across shards: any DDL change (new column, new index, dropping a column) must be applied to every shard. With 50 shards, a migration requires 50 separate ALTER TABLE executions — each with its own lock window. Use online schema change tools (pt-online-schema-change, gh-ost) to avoid table locks, and apply migrations shard-by-shard with health checks between each.

Cross-shard transactions: distributed transactions across shards require 2PC (Two-Phase Commit) or saga coordination. 2PC introduces a coordinator and is susceptible to coordinator failure leaving shards in a blocked state. For most sharding use cases, redesign the data model to avoid cross-shard writes — if two entities always participate in the same transaction, they should share a shard key.

Connection routing: application code (or a proxy layer) must route each query to the correct shard. Options:

  • Application-layer routing: application computes the shard key and connects to the correct shard directly. Low latency; requires shard topology knowledge in every service.
  • Proxy-layer routing: a proxy (ProxySQL, Vitess, Citus) receives all queries and routes to the correct shard. Simpler for applications; proxy is a new bottleneck and single point of failure to manage.

Pitfalls

Using key % N for shard assignment

Modulo hashing (hash(key) % N) requires full data migration when adding or removing a shard — every key must be remapped. Use consistent hash sharding instead: only O(1/N) keys move when the shard count changes. See Consistent-Hashing for ring mechanics.

Hotspot from monotonic shard keys

Auto-increment IDs or timestamps as shard keys concentrate all new writes on the last shard. This is the most common sharding failure mode. Use hash-based shard keys (UUIDs, Snowflake IDs) or add a random prefix to distribute writes uniformly.

Existing Pattern Connections

  • CAP-Theorem — during resharding, consistency guarantees are weakened (reads may return data from the pre-migration state); CAP-Theorem provides the formal framework for understanding this tradeoff
  • Consistent-Hashing — consistent hash sharding uses the hash ring mechanics to minimise key movement during shard addition/removal; the Phase 28 note is a direct prerequisite
  • Aggregate — DDD aggregates co-located on the same shard by aggregate ID are a natural unit for shard key selection; aggregate boundary and shard boundary align
  • Bounded-Context — bounded context boundaries are natural vertical shard boundaries (each context owns its data); the CAP-Theorem consistency model applies within each context

Summary: When to Shard

Sharding is appropriate when the data tier has exhausted vertical scaling and read-replica scaling, AND when the primary constraint is write throughput or total data volume — not read throughput or query complexity. The shard key selection determines the operational burden for the lifetime of the system. Choose carefully; shard keys are expensive to change.