Web Crawler Design
Web Crawler Design
Distributed system that systematically fetches web pages by managing a URL frontier with priority ranking, per-domain politeness enforcement, and content deduplication to build a searchable index.
Clarify First
Before designing, lock these assumptions with the interviewer:
- Crawl scope? — Single domain (site auditor) vs entire web (search engine crawler); affects frontier size and politeness requirements by orders of magnitude.
- Scale? — 1M pages/month (small site crawler) vs 1B pages/month (mid-tier search crawler) vs 100B pages/month (Google-scale); drives infrastructure choices.
- Content types? — HTML only vs HTML + PDF + images + JavaScript-rendered pages; JS-rendered pages require headless browser infrastructure and significantly higher compute.
- Freshness requirement? — One-time crawl vs continuous re-crawl; continuous re-crawl requires priority scoring based on page change frequency and must manage re-crawl budget.
- Robots.txt compliance? — Strict (skip non-compliant pages) vs best-effort (log violations, crawl anyway); strict compliance requires per-domain robots.txt fetch and cache before any crawl.
Capacity Estimation
Derivation chain for a mid-tier search crawler (2026):
Assumption: Crawl 1B pages/month (mid-tier search crawler, 2026 estimate)
pages_per_second = 1B / (30 x 86,400) ~ 386 pages/sec
Page size:
average_page_size = 500 KB
bandwidth = 386 x 500 KB ~ 193 MB/s inbound
URL storage:
URLs per page (outbound links) = 10 average
new_URLs_per_second = 386 x 10 = 3,860 URLs/sec
URLs_per_month = 3,860 x 30 x 86,400 ~ 10B URLs in frontier
URL avg length = 100 bytes
URL frontier storage = 10B x 100 bytes = 1 TB
-> Distributed frontier required; single-node queue cannot hold 1 TB in memory
Cross-reference: Capacity-Estimation for the shared DAU-to-QPS-to-storage methodology.
Conclusion: At 386 pages/sec and 1 TB of URL frontier state, a distributed crawl architecture is mandatory. The frontier cannot fit on a single node; a distributed Message-Queue backed frontier with consistent-hash partitioning is required.
Central Technical Problem
Distributed URL frontier + politeness without coordination bottleneck.
A distributed crawler must partition the URL frontier across crawler nodes without any single node becoming a coordination bottleneck. Simultaneously, each crawler node must respect per-domain crawl rate limits (politeness) without needing to check with a central rate limiter for every fetch.
URL Frontier Design -- Two Approaches
Approach 1: Single-queue frontier
Simple BFS queue of URLs. All crawler nodes read from and write to the same queue. Bottleneck at scale -- all writes (new URLs discovered) and reads (next URL to crawl) contend on the same queue. No politeness enforcement: crawler may hit the same domain repeatedly in rapid succession, triggering rate limiting or bans.
Approach 2: Two-tier (priority + per-domain politeness) queue
Front tier: priority queues rank URLs by importance score (PageRank estimate, freshness signal, domain authority). URLs flow from the global frontier into the priority queue tier.
Back tier: per-domain politeness queues. Each domain gets a dedicated queue. A queue router selects the next URL to crawl by cycling through back-tier queues with a minimum inter-fetch interval per domain.
Decision: Two-tier (priority + per-domain) queue is the standard approach. Per-domain queues decouple politeness enforcement from the global frontier -- each crawler worker selects from its assigned domain queue without central coordination. See Message-Queue for the underlying delivery semantics.
[Seed URLs] --> [URL Frontier]
|
+----------+----------+
| |
[Priority Queues] [Per-Domain Queues]
(front tier) (back tier)
| |
+----------+----------+
|
[Queue Router]
(round-robin or weighted)
|
[Crawler Workers] --> [Content Store]
| |
[URL Extractor] [Fingerprint Store]
|
[Bloom Filter] --> [URL Frontier] (loop)
URL-to-crawler-node assignment uses Consistent-Hashing so the same URL always routes to the same crawler node, preventing duplicate crawl of the same URL by different nodes.
Politeness Enforcement
Politeness is a legal and operational requirement -- crawling too aggressively can constitute a denial-of-service attack and violates robots.txt directives:
- Robots.txt: Fetch and cache per domain with a TTL of 24 hours. Respect
Crawl-delayandDisallowdirectives. Robots.txt is cached using cache-aside strategy -- see Distributed-Cache for the caching pattern. - Per-domain rate limiting: The back-tier queue enforces a minimum inter-fetch interval per domain (e.g., 1 second between fetches to the same domain). The queue router enforces this by only re-selecting from a domain queue after the cooldown period expires.
- No central coordination: Each crawler worker owns a set of domain queues. Politeness is enforced locally within the worker -- no central rate limiter or lock is required.
Content Deduplication
URL deduplication (before fetch):
Bloom filter check before enqueuing a new URL. O(1) lookup, space-efficient, and tunable false positive rate. False positives (incorrectly flagging a new URL as seen) are acceptable -- the URL is skipped, a small loss. False negatives (missing a seen URL) are not tolerable -- they cause re-crawl waste. See Bloom-Filter for the data structure detail. The Bloom filter is the primary dedup mechanism at scale; an exact hash set would require hundreds of gigabytes for 10B URLs.
Content deduplication (after fetch):
Simhash or MD5 fingerprint of the page content detects duplicate pages published at different URLs (mirrors, copied content). The fingerprint is stored in a key-value store; after each fetch, the crawler computes the fingerprint and checks the store. If the fingerprint already exists, the page is not re-indexed. This is an application of the Idempotent-Consumer pattern -- re-fetching the same content is a no-op by design.
Spider Trap Avoidance
Spider traps are URLs that generate infinite redirect chains or dynamically generated content with unique parameters (e.g., calendar pages that generate ?date=YYYY-MM-DD+1 indefinitely):
- Max URL depth limit: Do not crawl URLs beyond N hops from the seed URL (typically 10-15).
- Max URL length limit: Discard URLs exceeding a threshold length (e.g., 2,048 characters); dynamically generated trap URLs tend to be very long.
- URL normalisation: Strip session parameters (
?sessionid=,?sid=), fragment identifiers (#section), and UTM tracking parameters before enqueuing. Normalised URLs collapse many variants of the same page to a single canonical URL.
Component Design
[Seed URLs] --> [URL Frontier (Distributed)]
|
+----------+----------+
| |
[Priority Queues] [Per-Domain Queues]
(front tier: rank by (back tier: 1 queue per
PageRank / freshness) domain, min-interval enforced)
| |
+----------+----------+
|
[Queue Router]
|
+-----------+----------+
| | |
[Worker 1] [Worker 2] [Worker N]
|
[DNS Resolver] --> [HTTP Fetcher]
|
[URL Extractor] --> [Bloom Filter Check] --> [URL Frontier] (loop)
|
[Fingerprint Check] --> [Content Store] (if new content)
(sharded by URL hash)
Component responsibilities:
- URL Frontier -- distributed queue (backed by Message-Queue) holding all discovered URLs; partitioned across nodes for 1 TB scale.
- Priority Queues (front tier) -- rank URLs by importance score; prevent low-priority pages from consuming crawl budget.
- Per-Domain Queues (back tier) -- enforce per-domain crawl-delay; decouple politeness from global scheduling.
- Queue Router -- selects next URL from back-tier queues with round-robin cycling; enforces minimum inter-fetch interval.
- Crawler Workers -- stateless fetch workers; DNS resolution, HTTP fetch, URL extraction, Bloom filter check, fingerprint check; scale horizontally.
- Bloom Filter -- URL-seen probabilistic check; prevents re-enqueuing already-discovered URLs; see Bloom-Filter.
- Content Store -- stores crawled page content; sharded by URL hash using Database-Sharding.
- Fingerprint Store -- key-value store mapping content fingerprint to first-seen URL; used for duplicate content detection.
System Diagram
Web-Crawler-Design-diagram.excalidraw
Alternatives Considered
| Decision | Alternative A | Alternative B | Why Chosen Approach Wins |
|---|---|---|---|
| Crawl order: priority-based two-tier queue | BFS single queue | DFS traversal | BFS covers breadth but has no politeness; DFS risks going arbitrarily deep into low-value paths; priority queue maximises crawl budget on high-value pages |
| Frontier: distributed two-tier | Single centralized queue | Peer-to-peer frontier gossip | Two-tier decouples priority ranking from politeness; distributed avoids single bottleneck; gossip protocol adds consistency complexity |
| URL dedup: Bloom filter | Exact hash set (Redis SET) | Relational DB unique index | Bloom filter uses ~1.2 bytes/entry vs ~100 bytes/entry for hash set; at 10B URLs = 12 GB vs 1 TB; false positive rate tunable; hash set becomes cost-prohibitive at scale |
| Content dedup: simhash fingerprint | Exact MD5 hash | TF-IDF similarity | Simhash detects near-duplicate content (minor edits, ads injected); MD5 only catches exact duplicates; TF-IDF too computationally expensive per page |
Likely Follow-Up Questions
- How do you handle JavaScript-rendered pages? -- Headless browser (Puppeteer / Playwright) renders JS before URL extraction; requires 10-100x more compute per page than static HTML fetch; typically only applied to a high-value URL subset.
- How do you detect and avoid honeypot spider traps? -- Signature-based detection: URLs that match known trap patterns (calendar generators, infinite pagination without content change); fingerprint-based: if 100 consecutive URLs from a domain produce identical content fingerprints, mark domain as trap and blacklist.
- How do you prioritize re-crawl frequency for updated pages? -- Track last-modified header and content fingerprint change frequency per URL; use change probability score to set re-crawl interval; frequently changing pages get higher priority in the front-tier queue.
- How do you handle crawler node failures? -- URL assignments (via Consistent-Hashing) are automatically rebalanced across remaining nodes on failure; in-flight URLs from the failed node are re-enqueued via heartbeat timeout detection in the queue router.
- How do you scale to 100B pages/month (Google scale)? -- 100x the 386 pages/sec baseline = ~38,600 pages/sec; requires thousands of crawler workers; URL frontier grows to ~100 TB requiring a distributed database-backed frontier rather than a queue; consistent-hash partitioning across hundreds of nodes.
- How do you handle DNS resolution bottlenecks? -- Maintain a local DNS cache per crawler worker with TTL equal to the domain's DNS TTL; share a distributed DNS cache across workers for popular domains; avoid resolving the same domain on every fetch.
Existing Pattern Connections
| Design Decision | Existing Pattern | Relationship |
|---|---|---|
| Bloom filter for URL deduplication | Bloom-Filter | Space-efficient probabilistic check before enqueuing; false positives acceptable (skip a URL); false negatives not tolerable (recrawl) |
| Per-domain politeness queue | Message-Queue | Each domain queue is a message queue with rate-controlled consumption; politeness = consumer-side throttle per queue |
| URL-to-crawler routing via consistent hash | Consistent-Hashing | URL hashed to crawler node; same URL always routes to same node -- prevents duplicate crawl of same URL by different nodes |
| Content fingerprint deduplication | Idempotent-Consumer | Page fetch is idempotent by design: fingerprint stored in KV store; re-fetching same content is a no-op |
| robots.txt cache per domain | Distributed-Cache | Cache-aside: fetch robots.txt once per domain, cache with TTL (24h default); reduces re-fetch overhead on high-volume crawls |
| Priority queue for crawl ordering | min-heap-construction | The min-heap data structure that backs the priority queue tier in the two-tier frontier; URLs are ranked by importance score and the highest-priority URL is extracted in O(log n) via heap extract-min |
Additional cross-links woven into the design: Database-Sharding (crawled page storage sharded by URL hash), Capacity-Estimation (shared estimation methodology).