Bloom Filter
Bloom Filter
A space-efficient probabilistic data structure that answers "definitely not in set" or "probably in set" — false negatives are impossible, false positives are tunable via bit array size and hash function count.
When NOT to Use
- When exact membership is required: Bloom filters cannot answer "is this element definitely in the set." The "probably in set" answer means a positive result may be wrong. Use a hash set when you need certainty.
- When false positives have serious consequences: Medical records lookup, security authorization, or fraud detection cannot accept false positives. Use an exact data structure where a wrong "present" answer causes harm.
- When elements need to be deleted: Standard Bloom filters do not support deletion. Clearing a bit position may invalidate other elements that share those bit positions. Use a Counting Bloom Filter (counters instead of bits) or Cuckoo Filter (supports deletion with better lookup performance in some regimes) — both exist but are out of scope for this note.
- When the set fits in memory anyway: If n < 10M elements at 10 bytes each = 100MB, just use a HashSet. Bloom filters are a space optimization; if RAM is not the constraint, the probabilistic tradeoff is unnecessary.
Core Mechanism
How It Works
A Bloom filter maintains a bit array of m bits, initialized to all zeros. To insert an element, k independent hash functions each map the element to a position in the array and set that bit to 1. To query, the same k hash functions are applied — if all k positions are 1, the element is "probably in set"; if any position is 0, the element is "definitely not in set."
Key properties:
- False negatives are impossible: If an element was inserted, all k bit positions were set to 1 at insert time. A query will always find all k positions set — the filter can never say "no" to a present element.
- False positives are possible: Two different elements may share some bit positions. A queried element that was never inserted may still find all k of its positions set to 1 by coincidence.
- Space-efficient: O(m) bits instead of O(n) hash table entries. For 10M URLs at 10 bits/element = 12.5MB vs ~320MB for a hash set.
False Positive Rate Formula
p = (1 - e^(-kn/m))^k
Where:
p = false positive probability
k = number of hash functions
n = number of elements inserted
m = size of bit array in bits
Optimal k to minimize p given fixed m and n:
k_optimal = (m/n) * ln(2) = 0.693 * (m/n)
At optimal k, the false positive rate simplifies to approximately (0.6185)^(m/n). This is why doubling the bits per element dramatically reduces the false positive rate — the relationship is exponential.
Component Diagram
Bloom-Filter-diagram.excalidraw
Key Variants
Sizing Guide
| Bits per element (m/n) | False positive rate (at optimal k) |
|---|---|
| 6 bits | ~5.6% |
| 8 bits | ~2.3% |
| 10 bits | ~1.2% |
| 16 bits | ~0.05% |
| 20 bits | ~0.009% |
Rule of thumb: 10 bits per element achieves ~1% false positive rate — a common production target.
Example sizing calculation: 10M URLs, 10 bits/element → m = 100M bits = 12.5MB. Optimal k = 0.693 * 10 ≈ 7 hash functions. Expected false positive rate ≈ 0.8%.
Use Cases
- Cache penetration prevention: Before querying the database for a key, check the Bloom filter. If "definitely not in DB," skip the DB query. Populated at startup or insert time with all valid keys. Prevents attackers or missing-key thundering herds from exhausting the database with guaranteed-miss queries.
- URL deduplication in web crawlers: Check if a URL has been visited. "Definitely not visited" means crawl it. "Probably visited" means skip (rare false positives lead to missed pages — acceptable tradeoff vs storing all URLs in RAM). See
[[Web-Crawler-Design]](Phase 31). - Weak password checking: HaveIBeenPwned uses a Bloom filter for the k-anonymity partial hash check before server-side lookup. The filter pre-screens billions of known compromised passwords without storing them all in a lookup table on the client side.
- Database query optimization: BigTable, HBase, Cassandra use Bloom filters to avoid reading SSTable files that cannot contain the queried key. Before opening a potentially large on-disk file, the filter determines if the key could possibly be there — if not, skip the file I/O entirely.
Skeletal Snippet
TypeScript — false positive rate calculator:
// Bloom Filter false positive rate calculator
// p = (1 - e^(-k*n/m))^k
function bloomFalsePositiveRate(m: number, k: number, n: number): number {
// m = bit array size, k = hash functions, n = elements inserted
return Math.pow(1 - Math.exp(-k * n / m), k);
}
// Optimal number of hash functions for given m and n
function optimalK(m: number, n: number): number {
return Math.round((m / n) * Math.LN2);
}
// Example: 10M URLs, 10 bits/element = 100Mbits = 12.5MB
const n = 10_000_000; // elements
const bitsPerElement = 10;
const m = n * bitsPerElement;
const k = optimalK(m, n); // = 7
const p = bloomFalsePositiveRate(m, k, n); // = 0.0082 (0.8%)Java — Guava BloomFilter production reference:
// PRODUCTION NOTE: Use Guava BloomFilter, not hand-rolled implementation.
// com.google.common.hash.BloomFilter handles hash function selection and bit array sizing.
// BloomFilter<String> filter = BloomFilter.create(
// Funnels.stringFunnel(Charsets.UTF_8), 10_000_000, 0.01);
// Above: 10M expected insertions, 1% false positive rate -> Guava auto-sizes m and k.Pitfalls
Standard Bloom filters do not support element deletion. Clearing a bit may invalidate other elements that share those bit positions. If deletion is required, consider a Counting Bloom Filter (each position is a counter, not a bit) or a Cuckoo Filter (supports deletion with better lookup performance in some regimes). Both are out of scope for this note.
Hash function selection and bit array sizing have correctness subtleties. Use Guava BloomFilter (Java), bloomfilter npm package (TypeScript), or Redis BF.ADD command. The TypeScript calculator above is a teaching tool for understanding the math — not a production Bloom filter implementation.
Existing Pattern Connections
[[Dead-Letter-Queue]]— DLQ deduplication can use a Bloom filter as a fast pre-check before the expensive persistent idempotency store lookup[[Idempotent-Consumer]]— same pattern: Bloom filter as probabilistic fast-path before the persistent deduplication store; reduces database reads for duplicate message detection[[Web-Crawler-Design]]— URL-seen detection is the canonical web crawling use case; "probably visited" means skip, "definitely not visited" means crawl (Phase 31 forward link)