Web Crawler Design

Web Crawler Design

Distributed system that systematically fetches web pages by managing a URL frontier with priority ranking, per-domain politeness enforcement, and content deduplication to build a searchable index.

Clarify First

Before designing, lock these assumptions with the interviewer:

  1. Crawl scope? — Single domain (site auditor) vs entire web (search engine crawler); affects frontier size and politeness requirements by orders of magnitude.
  2. Scale? — 1M pages/month (small site crawler) vs 1B pages/month (mid-tier search crawler) vs 100B pages/month (Google-scale); drives infrastructure choices.
  3. Content types? — HTML only vs HTML + PDF + images + JavaScript-rendered pages; JS-rendered pages require headless browser infrastructure and significantly higher compute.
  4. Freshness requirement? — One-time crawl vs continuous re-crawl; continuous re-crawl requires priority scoring based on page change frequency and must manage re-crawl budget.
  5. Robots.txt compliance? — Strict (skip non-compliant pages) vs best-effort (log violations, crawl anyway); strict compliance requires per-domain robots.txt fetch and cache before any crawl.

Capacity Estimation

Derivation chain for a mid-tier search crawler (2026):

Assumption: Crawl 1B pages/month (mid-tier search crawler, 2026 estimate)
  pages_per_second = 1B / (30 x 86,400) ~ 386 pages/sec

Page size:
  average_page_size = 500 KB
  bandwidth = 386 x 500 KB ~ 193 MB/s inbound

URL storage:
  URLs per page (outbound links) = 10 average
  new_URLs_per_second = 386 x 10 = 3,860 URLs/sec
  URLs_per_month = 3,860 x 30 x 86,400 ~ 10B URLs in frontier
  URL avg length = 100 bytes
  URL frontier storage = 10B x 100 bytes = 1 TB
  -> Distributed frontier required; single-node queue cannot hold 1 TB in memory

Cross-reference: Capacity-Estimation for the shared DAU-to-QPS-to-storage methodology.

Conclusion: At 386 pages/sec and 1 TB of URL frontier state, a distributed crawl architecture is mandatory. The frontier cannot fit on a single node; a distributed Message-Queue backed frontier with consistent-hash partitioning is required.

Central Technical Problem

Distributed URL frontier + politeness without coordination bottleneck.

A distributed crawler must partition the URL frontier across crawler nodes without any single node becoming a coordination bottleneck. Simultaneously, each crawler node must respect per-domain crawl rate limits (politeness) without needing to check with a central rate limiter for every fetch.

URL Frontier Design -- Two Approaches

Approach 1: Single-queue frontier

Simple BFS queue of URLs. All crawler nodes read from and write to the same queue. Bottleneck at scale -- all writes (new URLs discovered) and reads (next URL to crawl) contend on the same queue. No politeness enforcement: crawler may hit the same domain repeatedly in rapid succession, triggering rate limiting or bans.

Approach 2: Two-tier (priority + per-domain politeness) queue

Front tier: priority queues rank URLs by importance score (PageRank estimate, freshness signal, domain authority). URLs flow from the global frontier into the priority queue tier.

Back tier: per-domain politeness queues. Each domain gets a dedicated queue. A queue router selects the next URL to crawl by cycling through back-tier queues with a minimum inter-fetch interval per domain.

Decision: Two-tier (priority + per-domain) queue is the standard approach. Per-domain queues decouple politeness enforcement from the global frontier -- each crawler worker selects from its assigned domain queue without central coordination. See Message-Queue for the underlying delivery semantics.

[Seed URLs] --> [URL Frontier]
                     |
          +----------+----------+
          |                     |
  [Priority Queues]    [Per-Domain Queues]
  (front tier)         (back tier)
          |                     |
          +----------+----------+
                     |
              [Queue Router]
              (round-robin or weighted)
                     |
              [Crawler Workers] --> [Content Store]
                     |                     |
              [URL Extractor]       [Fingerprint Store]
                     |
              [Bloom Filter] --> [URL Frontier] (loop)

URL-to-crawler-node assignment uses Consistent-Hashing so the same URL always routes to the same crawler node, preventing duplicate crawl of the same URL by different nodes.

Politeness Enforcement

Politeness is a legal and operational requirement -- crawling too aggressively can constitute a denial-of-service attack and violates robots.txt directives:

  • Robots.txt: Fetch and cache per domain with a TTL of 24 hours. Respect Crawl-delay and Disallow directives. Robots.txt is cached using cache-aside strategy -- see Distributed-Cache for the caching pattern.
  • Per-domain rate limiting: The back-tier queue enforces a minimum inter-fetch interval per domain (e.g., 1 second between fetches to the same domain). The queue router enforces this by only re-selecting from a domain queue after the cooldown period expires.
  • No central coordination: Each crawler worker owns a set of domain queues. Politeness is enforced locally within the worker -- no central rate limiter or lock is required.

Content Deduplication

URL deduplication (before fetch):

Bloom filter check before enqueuing a new URL. O(1) lookup, space-efficient, and tunable false positive rate. False positives (incorrectly flagging a new URL as seen) are acceptable -- the URL is skipped, a small loss. False negatives (missing a seen URL) are not tolerable -- they cause re-crawl waste. See Bloom-Filter for the data structure detail. The Bloom filter is the primary dedup mechanism at scale; an exact hash set would require hundreds of gigabytes for 10B URLs.

Content deduplication (after fetch):

Simhash or MD5 fingerprint of the page content detects duplicate pages published at different URLs (mirrors, copied content). The fingerprint is stored in a key-value store; after each fetch, the crawler computes the fingerprint and checks the store. If the fingerprint already exists, the page is not re-indexed. This is an application of the Idempotent-Consumer pattern -- re-fetching the same content is a no-op by design.

Spider Trap Avoidance

Spider traps are URLs that generate infinite redirect chains or dynamically generated content with unique parameters (e.g., calendar pages that generate ?date=YYYY-MM-DD+1 indefinitely):

  • Max URL depth limit: Do not crawl URLs beyond N hops from the seed URL (typically 10-15).
  • Max URL length limit: Discard URLs exceeding a threshold length (e.g., 2,048 characters); dynamically generated trap URLs tend to be very long.
  • URL normalisation: Strip session parameters (?sessionid=, ?sid=), fragment identifiers (#section), and UTM tracking parameters before enqueuing. Normalised URLs collapse many variants of the same page to a single canonical URL.

Component Design

[Seed URLs] --> [URL Frontier (Distributed)]
                     |
          +----------+----------+
          |                     |
  [Priority Queues]    [Per-Domain Queues]
  (front tier: rank by  (back tier: 1 queue per
   PageRank / freshness)  domain, min-interval enforced)
          |                     |
          +----------+----------+
                     |
              [Queue Router]
                     |
         +-----------+----------+
         |           |          |
   [Worker 1]  [Worker 2]  [Worker N]
         |
   [DNS Resolver] --> [HTTP Fetcher]
         |
   [URL Extractor] --> [Bloom Filter Check] --> [URL Frontier] (loop)
         |
   [Fingerprint Check] --> [Content Store] (if new content)
                           (sharded by URL hash)

Component responsibilities:

  • URL Frontier -- distributed queue (backed by Message-Queue) holding all discovered URLs; partitioned across nodes for 1 TB scale.
  • Priority Queues (front tier) -- rank URLs by importance score; prevent low-priority pages from consuming crawl budget.
  • Per-Domain Queues (back tier) -- enforce per-domain crawl-delay; decouple politeness from global scheduling.
  • Queue Router -- selects next URL from back-tier queues with round-robin cycling; enforces minimum inter-fetch interval.
  • Crawler Workers -- stateless fetch workers; DNS resolution, HTTP fetch, URL extraction, Bloom filter check, fingerprint check; scale horizontally.
  • Bloom Filter -- URL-seen probabilistic check; prevents re-enqueuing already-discovered URLs; see Bloom-Filter.
  • Content Store -- stores crawled page content; sharded by URL hash using Database-Sharding.
  • Fingerprint Store -- key-value store mapping content fingerprint to first-seen URL; used for duplicate content detection.

System Diagram

Web-Crawler-Design-diagram.excalidraw

Alternatives Considered

DecisionAlternative AAlternative BWhy Chosen Approach Wins
Crawl order: priority-based two-tier queueBFS single queueDFS traversalBFS covers breadth but has no politeness; DFS risks going arbitrarily deep into low-value paths; priority queue maximises crawl budget on high-value pages
Frontier: distributed two-tierSingle centralized queuePeer-to-peer frontier gossipTwo-tier decouples priority ranking from politeness; distributed avoids single bottleneck; gossip protocol adds consistency complexity
URL dedup: Bloom filterExact hash set (Redis SET)Relational DB unique indexBloom filter uses ~1.2 bytes/entry vs ~100 bytes/entry for hash set; at 10B URLs = 12 GB vs 1 TB; false positive rate tunable; hash set becomes cost-prohibitive at scale
Content dedup: simhash fingerprintExact MD5 hashTF-IDF similaritySimhash detects near-duplicate content (minor edits, ads injected); MD5 only catches exact duplicates; TF-IDF too computationally expensive per page

Likely Follow-Up Questions

  1. How do you handle JavaScript-rendered pages? -- Headless browser (Puppeteer / Playwright) renders JS before URL extraction; requires 10-100x more compute per page than static HTML fetch; typically only applied to a high-value URL subset.
  2. How do you detect and avoid honeypot spider traps? -- Signature-based detection: URLs that match known trap patterns (calendar generators, infinite pagination without content change); fingerprint-based: if 100 consecutive URLs from a domain produce identical content fingerprints, mark domain as trap and blacklist.
  3. How do you prioritize re-crawl frequency for updated pages? -- Track last-modified header and content fingerprint change frequency per URL; use change probability score to set re-crawl interval; frequently changing pages get higher priority in the front-tier queue.
  4. How do you handle crawler node failures? -- URL assignments (via Consistent-Hashing) are automatically rebalanced across remaining nodes on failure; in-flight URLs from the failed node are re-enqueued via heartbeat timeout detection in the queue router.
  5. How do you scale to 100B pages/month (Google scale)? -- 100x the 386 pages/sec baseline = ~38,600 pages/sec; requires thousands of crawler workers; URL frontier grows to ~100 TB requiring a distributed database-backed frontier rather than a queue; consistent-hash partitioning across hundreds of nodes.
  6. How do you handle DNS resolution bottlenecks? -- Maintain a local DNS cache per crawler worker with TTL equal to the domain's DNS TTL; share a distributed DNS cache across workers for popular domains; avoid resolving the same domain on every fetch.

Existing Pattern Connections

Design DecisionExisting PatternRelationship
Bloom filter for URL deduplicationBloom-FilterSpace-efficient probabilistic check before enqueuing; false positives acceptable (skip a URL); false negatives not tolerable (recrawl)
Per-domain politeness queueMessage-QueueEach domain queue is a message queue with rate-controlled consumption; politeness = consumer-side throttle per queue
URL-to-crawler routing via consistent hashConsistent-HashingURL hashed to crawler node; same URL always routes to same node -- prevents duplicate crawl of same URL by different nodes
Content fingerprint deduplicationIdempotent-ConsumerPage fetch is idempotent by design: fingerprint stored in KV store; re-fetching same content is a no-op
robots.txt cache per domainDistributed-CacheCache-aside: fetch robots.txt once per domain, cache with TTL (24h default); reduces re-fetch overhead on high-volume crawls
Priority queue for crawl orderingmin-heap-constructionThe min-heap data structure that backs the priority queue tier in the two-tier frontier; URLs are ranked by importance score and the highest-priority URL is extracted in O(log n) via heap extract-min

Additional cross-links woven into the design: Database-Sharding (crawled page storage sharded by URL hash), Capacity-Estimation (shared estimation methodology).