Chat System Design

Chat System Design

System that enables real-time messaging between users via persistent WebSocket connections, with message ordering guarantees, presence tracking, and offline delivery.

Clarify First

Before designing, lock these assumptions with the interviewer:

  1. 1:1 chat only or group chat support? — Group chat introduces fan-out amplification; groups with 1K+ members require the same hybrid push/pull strategy as the news feed celebrity problem — see News-Feed-Design for the parallel.
  2. Message types? — Text only vs text + images + files + voice; media handling requires a separate blob storage path and CDN; scope to text-only initially and note extensions.
  3. Presence required? — Online/offline/last-seen indicator requires a heartbeat service and pub-sub fan-out for status updates; confirm if this is in scope.
  4. Message retention? — 30 days vs indefinite; storage sizing diverges substantially; determine before committing to a storage tier.
  5. End-to-end encryption? — Signal protocol (keys held by clients; server cannot read) vs server-side at-rest encryption; E2E changes the key management architecture significantly.

Capacity Estimation

Derivation chain for a large-scale messaging platform (2026):

Assumption: 500M DAU (WhatsApp scale, 2026 estimate)

Concurrent connections:
  peak_concurrent_users = 500M x 0.1 (10% online simultaneously) = 50M connections
  connections_per_server = 50,000 (WebSocket server capacity estimate)
  chat_servers_needed = 50M / 50,000 = 1,000 chat servers

Message volume:
  messages_per_DAU_per_day = 40
  total_messages_per_day = 500M x 40 = 20B messages/day
  message_QPS = 20B / 86,400 ~ 231,000 QPS

Storage (1-year retention):
  message_size ~ 100 bytes (text only)
  daily_storage = 20B x 100 bytes = 2 TB/day
  yearly_storage = 2 TB x 365 = 730 TB ~ 1 PB/year
  -> NoSQL (wide-column) preferred; message reads are by (conversation_id, seq_id range)

Cross-reference: Capacity-Estimation for the shared DAU-to-QPS-to-storage methodology. Wide-column vs relational tradeoff is covered in SQL-vs-NoSQL.

Conclusion: 1,000 WebSocket chat servers needed at peak. 231K message QPS requires sharded NoSQL storage. 1 PB/year for 1-year retention demands wide-column architecture with efficient range queries.

Central Technical Problem

Message ordering guarantees and offline delivery in a distributed chat system.

Messages from different senders arrive at servers in non-deterministic order. Without monotonic sequence numbers, recipients see messages out of order. Offline users need messages queued and delivered in order upon reconnection. These two problems are coupled — the same seq_id that enforces ordering also identifies message delivery gaps.

Message ordering

Two approaches:

Server-assigned seq_id (chosen approach): The chat server assigns a monotonically increasing seq_id per conversation. In-order delivery is guaranteed by seq_id. Requires a sequence number service — either a single-writer (bottleneck) or a distributed Snowflake-style ID. See Unique-ID-Generator-Design for Snowflake mechanics. Gap detection: client tracks last_received_seq_id; any gap in the sequence triggers a re-sync request to the server.

Client-assigned timestamps (rejected): Clients attach logical timestamps to messages. Subject to clock skew — two messages sent simultaneously on different clients may have identical or inverted timestamps. Mitigating clock skew requires NTP or vector clocks, adding coordination complexity without eliminating the problem.

Decision: Server-assigned seq_id per conversation is the standard approach. Snowflake-style distributed IDs handle ID assignment at scale without a coordination bottleneck. Re-delivered messages on reconnect are deduplicated by seq_id — see Idempotent-Consumer for the idempotent delivery pattern.

Offline delivery

When a recipient is offline, messages are stored in a durable offline queue per recipient. See Message-Queue for at-least-once delivery guarantees.

On reconnect, the client sends its last_received_seq_id to the server. The server delivers all messages with seq_id > last_received in order. Message retention window (e.g., 30 days) bounds offline queue storage. After the retention window, undelivered messages are dropped — this must be communicated in the product (e.g., "Messages older than 30 days may not have been delivered").

WebSocket connection management

Each online user maintains a long-lived WebSocket to a chat server. Chat servers are stateful — they hold the active connection-to-user mapping. A connection service maps user_id -> chat_server_id, enabling message routing across the server pool. Connection routing uses Consistent-Hashing to distribute users across chat servers without hotspot routing.

Message routing: sender's server receives the message → looks up recipient's chat_server_id via the connection service → forwards the message to the correct server via an internal service bus → recipient's server delivers over the open WebSocket.

The chat server acts as a Mediator-Pattern — senders and receivers do not connect directly; all message exchange is mediated through the chat server pool.

Presence service

Heartbeat-based: client sends a heartbeat every N seconds (e.g., 5 seconds). The presence service marks a user offline after a missed heartbeat timeout (e.g., 3 missed heartbeats = 15 seconds). Presence updates are published via pub-sub — see Observer-Pattern for the publish-subscribe model. Subscribed contacts receive status change notifications. The presence service can be AP under CAP-Theorem — slightly stale online status is acceptable; message ordering is CP (strong consistency on seq_id).

Group chat fan-out

Small groups (< 100 members): fan-out-on-write — message is written to all member offline queues. Write amplification is bounded by group size.

Large groups (1K+ members): the same hybrid push/pull decision as the news feed celebrity problem applies — see News-Feed-Design for the hybrid fan-out strategy. Above a group size threshold, messages are written to a shared group message store and members pull on read rather than receiving individual queue entries.

Component Design

[Client] <--WebSocket--> [Chat Server Pool]
                               |
                    [Connection Service]
                    (user_id -> server_id)
                               |
              +----------------+----------------+
              |                |                |
     [Message Service]  [Presence Service] [Group Service]
              |                |                |
     [Message Store]    [Status Cache]   [Member Store]
     (NoSQL, sharded    (Redis)          (relational)
      by conversation_id)
              |
     [Offline Queue]
     (per recipient)

Component responsibilities:

  • Chat Server Pool — maintains long-lived WebSocket connections; routes messages between senders and recipients; stateful per connection
  • Connection Service — service registry mapping user_id -> chat_server_id; queried on each message send to route to the correct chat server
  • Message Service — persists messages to durable store; assigns seq_id; manages offline queue writes for recipients who are not connected
  • Presence Service — tracks online/offline status via heartbeats; publishes status change events to subscriber contacts
  • Group Service — manages group membership; handles fan-out decisions (push for small groups, pull for large groups)
  • Message Store — wide-column NoSQL store sharded by conversation_id; range queries by (conversation_id, seq_id) retrieve conversation history efficiently — see Database-Sharding
  • Offline Queue — durable per-recipient message queue; at-least-once delivery on reconnect
  • Status Cache — Redis store for live presence state; TTL matches heartbeat timeout

System Diagram

Chat-System-Design-diagram.excalidraw

Alternatives Considered

DecisionAlternative AAlternative BWhy Chosen Approach Wins
Connection protocolHTTP polling (client polls every N seconds)HTTP long-polling (server holds request open)WebSocket provides true bidirectional communication with low overhead; polling adds latency; long-polling is a workaround that wastes connections
Message orderingClient-assigned timestampsVector clocks (distributed logical time)Server-assigned seq_id eliminates clock skew without coordination overhead; vector clocks are complex to implement and debug; Snowflake IDs provide distributed assignment without bottleneck
Message storageSQL relational (messages table)Key-value store (message per key)Wide-column NoSQL enables efficient range queries by (conversation_id, seq_id); SQL struggles at 1 PB scale; key-value has no efficient range query
Group fan-outFan-out-on-write for all groupsFan-out-on-read for all groupsWrite-only fails for large groups (1K+ members); read-only fails for small groups (high read latency); hybrid bounded by group size threshold

Likely Follow-Up Questions

  1. How do you handle read receipts (delivered vs read)? — Delivered status: confirmed when the offline queue is drained on reconnect. Read status: client sends a read receipt event when the user views the message; server updates a read_at timestamp; recipient sees the "read" indicator.
  2. How do you support message search across conversations? — Full-text search requires an inverted index (Elasticsearch-style); message store write path also writes to a search index; query path routes search requests to the search cluster. Cross-link: search indexing is covered in the Search-Autocomplete-Design note.
  3. How would you add end-to-end encryption? — Signal protocol: each client holds a private key; server stores only public keys and encrypted message blobs; server cannot decrypt. Key exchange uses the Double Ratchet algorithm. Server-side search becomes impossible with E2E encryption.
  4. What happens when a chat server crashes (connection migration)? — Clients detect WebSocket disconnect and reconnect to any available chat server; the connection service updates the user_id -> chat_server_id mapping on reconnect; in-flight messages are re-sent from the offline queue.
  5. How do you handle message deletion/editing after send? — Deletion: a message_deleted event is broadcast to all conversation participants; clients hide the message. Editing: a message_edited event carries the new content and a version number; seq_id remains unchanged.
  6. How do you scale the presence service (millions of subscribers)? — Presence fan-out is bounded by the contact list size, not global user count. Shard the presence service by user_id; each shard handles pub-sub for its assigned users. Consistent-Hashing routes presence queries to the correct shard.

Existing Pattern Connections

Design DecisionExisting PatternRelationship
WebSocket connection multiplexed to chat serverMediator-PatternChat server mediates all message exchange; senders and receivers do not connect directly — classic Mediator topology
Offline message queue per recipientMessage-QueueUndelivered messages queued in durable store; at-least-once delivery guarantee; client deduplicates via seq_id
Presence service (online/offline status)Observer-PatternPresence is a publish-subscribe pattern: user status changes published; subscribed contacts receive updates
Message sharding by conversation_idDatabase-ShardingAll messages for a conversation co-located on one shard; range queries by (conversation_id, seq_id) are efficient
seq_id gap detection + resyncIdempotent-ConsumerRe-delivered messages on reconnect must be idempotent; client deduplicates by seq_id to avoid duplicate display