Bloom Filter

Bloom Filter

A space-efficient probabilistic data structure that answers "definitely not in set" or "probably in set" — false negatives are impossible, false positives are tunable via bit array size and hash function count.

When NOT to Use

  • When exact membership is required: Bloom filters cannot answer "is this element definitely in the set." The "probably in set" answer means a positive result may be wrong. Use a hash set when you need certainty.
  • When false positives have serious consequences: Medical records lookup, security authorization, or fraud detection cannot accept false positives. Use an exact data structure where a wrong "present" answer causes harm.
  • When elements need to be deleted: Standard Bloom filters do not support deletion. Clearing a bit position may invalidate other elements that share those bit positions. Use a Counting Bloom Filter (counters instead of bits) or Cuckoo Filter (supports deletion with better lookup performance in some regimes) — both exist but are out of scope for this note.
  • When the set fits in memory anyway: If n < 10M elements at 10 bytes each = 100MB, just use a HashSet. Bloom filters are a space optimization; if RAM is not the constraint, the probabilistic tradeoff is unnecessary.

Core Mechanism

How It Works

A Bloom filter maintains a bit array of m bits, initialized to all zeros. To insert an element, k independent hash functions each map the element to a position in the array and set that bit to 1. To query, the same k hash functions are applied — if all k positions are 1, the element is "probably in set"; if any position is 0, the element is "definitely not in set."

Key properties:

  • False negatives are impossible: If an element was inserted, all k bit positions were set to 1 at insert time. A query will always find all k positions set — the filter can never say "no" to a present element.
  • False positives are possible: Two different elements may share some bit positions. A queried element that was never inserted may still find all k of its positions set to 1 by coincidence.
  • Space-efficient: O(m) bits instead of O(n) hash table entries. For 10M URLs at 10 bits/element = 12.5MB vs ~320MB for a hash set.

False Positive Rate Formula

p = (1 - e^(-kn/m))^k

Where:
  p = false positive probability
  k = number of hash functions
  n = number of elements inserted
  m = size of bit array in bits

Optimal k to minimize p given fixed m and n:

k_optimal = (m/n) * ln(2) = 0.693 * (m/n)

At optimal k, the false positive rate simplifies to approximately (0.6185)^(m/n). This is why doubling the bits per element dramatically reduces the false positive rate — the relationship is exponential.

Component Diagram

Bloom-Filter-diagram.excalidraw

Key Variants

Sizing Guide

Bits per element (m/n)False positive rate (at optimal k)
6 bits~5.6%
8 bits~2.3%
10 bits~1.2%
16 bits~0.05%
20 bits~0.009%

Rule of thumb: 10 bits per element achieves ~1% false positive rate — a common production target.

Example sizing calculation: 10M URLs, 10 bits/element → m = 100M bits = 12.5MB. Optimal k = 0.693 * 10 ≈ 7 hash functions. Expected false positive rate ≈ 0.8%.

Use Cases

  • Cache penetration prevention: Before querying the database for a key, check the Bloom filter. If "definitely not in DB," skip the DB query. Populated at startup or insert time with all valid keys. Prevents attackers or missing-key thundering herds from exhausting the database with guaranteed-miss queries.
  • URL deduplication in web crawlers: Check if a URL has been visited. "Definitely not visited" means crawl it. "Probably visited" means skip (rare false positives lead to missed pages — acceptable tradeoff vs storing all URLs in RAM). See [[Web-Crawler-Design]] (Phase 31).
  • Weak password checking: HaveIBeenPwned uses a Bloom filter for the k-anonymity partial hash check before server-side lookup. The filter pre-screens billions of known compromised passwords without storing them all in a lookup table on the client side.
  • Database query optimization: BigTable, HBase, Cassandra use Bloom filters to avoid reading SSTable files that cannot contain the queried key. Before opening a potentially large on-disk file, the filter determines if the key could possibly be there — if not, skip the file I/O entirely.

Skeletal Snippet

TypeScript — false positive rate calculator:

// Bloom Filter false positive rate calculator
// p = (1 - e^(-k*n/m))^k
function bloomFalsePositiveRate(m: number, k: number, n: number): number {
  // m = bit array size, k = hash functions, n = elements inserted
  return Math.pow(1 - Math.exp(-k * n / m), k);
}
 
// Optimal number of hash functions for given m and n
function optimalK(m: number, n: number): number {
  return Math.round((m / n) * Math.LN2);
}
 
// Example: 10M URLs, 10 bits/element = 100Mbits = 12.5MB
const n = 10_000_000; // elements
const bitsPerElement = 10;
const m = n * bitsPerElement;
const k = optimalK(m, n); // = 7
const p = bloomFalsePositiveRate(m, k, n); // = 0.0082 (0.8%)

Java — Guava BloomFilter production reference:

// PRODUCTION NOTE: Use Guava BloomFilter, not hand-rolled implementation.
// com.google.common.hash.BloomFilter handles hash function selection and bit array sizing.
// BloomFilter<String> filter = BloomFilter.create(
//   Funnels.stringFunnel(Charsets.UTF_8), 10_000_000, 0.01);
// Above: 10M expected insertions, 1% false positive rate -> Guava auto-sizes m and k.

Pitfalls

No deletion support

Standard Bloom filters do not support element deletion. Clearing a bit may invalidate other elements that share those bit positions. If deletion is required, consider a Counting Bloom Filter (each position is a counter, not a bit) or a Cuckoo Filter (supports deletion with better lookup performance in some regimes). Both are out of scope for this note.

Hand-rolling in production

Hash function selection and bit array sizing have correctness subtleties. Use Guava BloomFilter (Java), bloomfilter npm package (TypeScript), or Redis BF.ADD command. The TypeScript calculator above is a teaching tool for understanding the math — not a production Bloom filter implementation.

Existing Pattern Connections

  • [[Dead-Letter-Queue]] — DLQ deduplication can use a Bloom filter as a fast pre-check before the expensive persistent idempotency store lookup
  • [[Idempotent-Consumer]] — same pattern: Bloom filter as probabilistic fast-path before the persistent deduplication store; reduces database reads for duplicate message detection
  • [[Web-Crawler-Design]] — URL-seen detection is the canonical web crawling use case; "probably visited" means skip, "definitely not visited" means crawl (Phase 31 forward link)