Chat System Design
Chat System Design
System that enables real-time messaging between users via persistent WebSocket connections, with message ordering guarantees, presence tracking, and offline delivery.
Clarify First
Before designing, lock these assumptions with the interviewer:
- 1:1 chat only or group chat support? — Group chat introduces fan-out amplification; groups with 1K+ members require the same hybrid push/pull strategy as the news feed celebrity problem — see News-Feed-Design for the parallel.
- Message types? — Text only vs text + images + files + voice; media handling requires a separate blob storage path and CDN; scope to text-only initially and note extensions.
- Presence required? — Online/offline/last-seen indicator requires a heartbeat service and pub-sub fan-out for status updates; confirm if this is in scope.
- Message retention? — 30 days vs indefinite; storage sizing diverges substantially; determine before committing to a storage tier.
- End-to-end encryption? — Signal protocol (keys held by clients; server cannot read) vs server-side at-rest encryption; E2E changes the key management architecture significantly.
Capacity Estimation
Derivation chain for a large-scale messaging platform (2026):
Assumption: 500M DAU (WhatsApp scale, 2026 estimate)
Concurrent connections:
peak_concurrent_users = 500M x 0.1 (10% online simultaneously) = 50M connections
connections_per_server = 50,000 (WebSocket server capacity estimate)
chat_servers_needed = 50M / 50,000 = 1,000 chat servers
Message volume:
messages_per_DAU_per_day = 40
total_messages_per_day = 500M x 40 = 20B messages/day
message_QPS = 20B / 86,400 ~ 231,000 QPS
Storage (1-year retention):
message_size ~ 100 bytes (text only)
daily_storage = 20B x 100 bytes = 2 TB/day
yearly_storage = 2 TB x 365 = 730 TB ~ 1 PB/year
-> NoSQL (wide-column) preferred; message reads are by (conversation_id, seq_id range)
Cross-reference: Capacity-Estimation for the shared DAU-to-QPS-to-storage methodology. Wide-column vs relational tradeoff is covered in SQL-vs-NoSQL.
Conclusion: 1,000 WebSocket chat servers needed at peak. 231K message QPS requires sharded NoSQL storage. 1 PB/year for 1-year retention demands wide-column architecture with efficient range queries.
Central Technical Problem
Message ordering guarantees and offline delivery in a distributed chat system.
Messages from different senders arrive at servers in non-deterministic order. Without monotonic sequence numbers, recipients see messages out of order. Offline users need messages queued and delivered in order upon reconnection. These two problems are coupled — the same seq_id that enforces ordering also identifies message delivery gaps.
Message ordering
Two approaches:
Server-assigned seq_id (chosen approach): The chat server assigns a monotonically increasing seq_id per conversation. In-order delivery is guaranteed by seq_id. Requires a sequence number service — either a single-writer (bottleneck) or a distributed Snowflake-style ID. See Unique-ID-Generator-Design for Snowflake mechanics. Gap detection: client tracks last_received_seq_id; any gap in the sequence triggers a re-sync request to the server.
Client-assigned timestamps (rejected): Clients attach logical timestamps to messages. Subject to clock skew — two messages sent simultaneously on different clients may have identical or inverted timestamps. Mitigating clock skew requires NTP or vector clocks, adding coordination complexity without eliminating the problem.
Decision: Server-assigned seq_id per conversation is the standard approach. Snowflake-style distributed IDs handle ID assignment at scale without a coordination bottleneck. Re-delivered messages on reconnect are deduplicated by seq_id — see Idempotent-Consumer for the idempotent delivery pattern.
Offline delivery
When a recipient is offline, messages are stored in a durable offline queue per recipient. See Message-Queue for at-least-once delivery guarantees.
On reconnect, the client sends its last_received_seq_id to the server. The server delivers all messages with seq_id > last_received in order. Message retention window (e.g., 30 days) bounds offline queue storage. After the retention window, undelivered messages are dropped — this must be communicated in the product (e.g., "Messages older than 30 days may not have been delivered").
WebSocket connection management
Each online user maintains a long-lived WebSocket to a chat server. Chat servers are stateful — they hold the active connection-to-user mapping. A connection service maps user_id -> chat_server_id, enabling message routing across the server pool. Connection routing uses Consistent-Hashing to distribute users across chat servers without hotspot routing.
Message routing: sender's server receives the message → looks up recipient's chat_server_id via the connection service → forwards the message to the correct server via an internal service bus → recipient's server delivers over the open WebSocket.
The chat server acts as a Mediator-Pattern — senders and receivers do not connect directly; all message exchange is mediated through the chat server pool.
Presence service
Heartbeat-based: client sends a heartbeat every N seconds (e.g., 5 seconds). The presence service marks a user offline after a missed heartbeat timeout (e.g., 3 missed heartbeats = 15 seconds). Presence updates are published via pub-sub — see Observer-Pattern for the publish-subscribe model. Subscribed contacts receive status change notifications. The presence service can be AP under CAP-Theorem — slightly stale online status is acceptable; message ordering is CP (strong consistency on seq_id).
Group chat fan-out
Small groups (< 100 members): fan-out-on-write — message is written to all member offline queues. Write amplification is bounded by group size.
Large groups (1K+ members): the same hybrid push/pull decision as the news feed celebrity problem applies — see News-Feed-Design for the hybrid fan-out strategy. Above a group size threshold, messages are written to a shared group message store and members pull on read rather than receiving individual queue entries.
Component Design
[Client] <--WebSocket--> [Chat Server Pool]
|
[Connection Service]
(user_id -> server_id)
|
+----------------+----------------+
| | |
[Message Service] [Presence Service] [Group Service]
| | |
[Message Store] [Status Cache] [Member Store]
(NoSQL, sharded (Redis) (relational)
by conversation_id)
|
[Offline Queue]
(per recipient)
Component responsibilities:
- Chat Server Pool — maintains long-lived WebSocket connections; routes messages between senders and recipients; stateful per connection
- Connection Service — service registry mapping
user_id -> chat_server_id; queried on each message send to route to the correct chat server - Message Service — persists messages to durable store; assigns
seq_id; manages offline queue writes for recipients who are not connected - Presence Service — tracks online/offline status via heartbeats; publishes status change events to subscriber contacts
- Group Service — manages group membership; handles fan-out decisions (push for small groups, pull for large groups)
- Message Store — wide-column NoSQL store sharded by
conversation_id; range queries by(conversation_id, seq_id)retrieve conversation history efficiently — see Database-Sharding - Offline Queue — durable per-recipient message queue; at-least-once delivery on reconnect
- Status Cache — Redis store for live presence state; TTL matches heartbeat timeout
System Diagram
Chat-System-Design-diagram.excalidraw
Alternatives Considered
| Decision | Alternative A | Alternative B | Why Chosen Approach Wins |
|---|---|---|---|
| Connection protocol | HTTP polling (client polls every N seconds) | HTTP long-polling (server holds request open) | WebSocket provides true bidirectional communication with low overhead; polling adds latency; long-polling is a workaround that wastes connections |
| Message ordering | Client-assigned timestamps | Vector clocks (distributed logical time) | Server-assigned seq_id eliminates clock skew without coordination overhead; vector clocks are complex to implement and debug; Snowflake IDs provide distributed assignment without bottleneck |
| Message storage | SQL relational (messages table) | Key-value store (message per key) | Wide-column NoSQL enables efficient range queries by (conversation_id, seq_id); SQL struggles at 1 PB scale; key-value has no efficient range query |
| Group fan-out | Fan-out-on-write for all groups | Fan-out-on-read for all groups | Write-only fails for large groups (1K+ members); read-only fails for small groups (high read latency); hybrid bounded by group size threshold |
Likely Follow-Up Questions
- How do you handle read receipts (delivered vs read)? — Delivered status: confirmed when the offline queue is drained on reconnect. Read status: client sends a read receipt event when the user views the message; server updates a
read_attimestamp; recipient sees the "read" indicator. - How do you support message search across conversations? — Full-text search requires an inverted index (Elasticsearch-style); message store write path also writes to a search index; query path routes search requests to the search cluster. Cross-link: search indexing is covered in the Search-Autocomplete-Design note.
- How would you add end-to-end encryption? — Signal protocol: each client holds a private key; server stores only public keys and encrypted message blobs; server cannot decrypt. Key exchange uses the Double Ratchet algorithm. Server-side search becomes impossible with E2E encryption.
- What happens when a chat server crashes (connection migration)? — Clients detect WebSocket disconnect and reconnect to any available chat server; the connection service updates the
user_id -> chat_server_idmapping on reconnect; in-flight messages are re-sent from the offline queue. - How do you handle message deletion/editing after send? — Deletion: a
message_deletedevent is broadcast to all conversation participants; clients hide the message. Editing: amessage_editedevent carries the new content and a version number;seq_idremains unchanged. - How do you scale the presence service (millions of subscribers)? — Presence fan-out is bounded by the contact list size, not global user count. Shard the presence service by
user_id; each shard handles pub-sub for its assigned users. Consistent-Hashing routes presence queries to the correct shard.
Existing Pattern Connections
| Design Decision | Existing Pattern | Relationship |
|---|---|---|
| WebSocket connection multiplexed to chat server | Mediator-Pattern | Chat server mediates all message exchange; senders and receivers do not connect directly — classic Mediator topology |
| Offline message queue per recipient | Message-Queue | Undelivered messages queued in durable store; at-least-once delivery guarantee; client deduplicates via seq_id |
| Presence service (online/offline status) | Observer-Pattern | Presence is a publish-subscribe pattern: user status changes published; subscribed contacts receive updates |
| Message sharding by conversation_id | Database-Sharding | All messages for a conversation co-located on one shard; range queries by (conversation_id, seq_id) are efficient |
| seq_id gap detection + resync | Idempotent-Consumer | Re-delivered messages on reconnect must be idempotent; client deduplicates by seq_id to avoid duplicate display |