HyperLogLog
HyperLogLog
A probabilistic cardinality estimation algorithm that answers "how many distinct elements are in a stream?" using O(log log n) space — enabling unique-count queries on billions of elements with ~1% error at just 12 KB of memory (Redis implementation).
When NOT to Use
- When exact counts are required (financial transactions, compliance reporting) — HyperLogLog has inherent error (~0.81% at 16384 registers). A 0.81% error on 10 million users means an uncertainty of ±81,000 — unacceptable when regulatory audits require precise figures.
- When you need to answer membership queries ("is element X in the set?") — use Bloom-Filter instead. HyperLogLog cannot test whether a specific element was seen; it can only estimate how many distinct elements were seen. Bloom Filter solves membership; HyperLogLog solves cardinality.
- When the cardinality is small (< 1000 elements) — a simple
HashSetis cheaper, exact, and easier to reason about. HyperLogLog's value emerges at millions+ cardinality where storing all elements would require gigabytes of memory.
Core Mechanism
How It Works
- Hash function maps each incoming element to a binary string (e.g.,
hash("user_42") = 00101101...). - Extract register index from the first k bits of the hash → selects which of the
2^kregisters (buckets) to update. - Count leading zeros (the rho function, ρ) in the remaining bits of the hash. More leading zeros = rarer event = evidence of larger cardinality.
- Update register
M[j] = max(M[j], ρ)— each register tracks the maximum leading-zero count seen so far for that bucket. - Estimate cardinality using the harmonic mean across all
2^kregisters: the harmonic mean corrects for outlier buckets that happened to see an element with many leading zeros. - Apply bias correction constant
α_mto remove systematic bias in the harmonic mean estimator.
Error Rate
Standard error = 1.04 / sqrt(m) where m = number of registers.
| Registers (m) | Error rate | Memory (at 6 bits/register) |
|---|---|---|
| 256 | ~6.5% | ~192 bytes |
| 1024 | ~3.25% | ~768 bytes |
| 4096 | ~1.63% | ~3 KB |
| 16384 | ~0.81% | ~12 KB (Redis default) |
Redis uses 16384 registers (12 KB) to achieve ~0.81% standard error regardless of whether the cardinality is 1000 or 1 billion.
Component Diagram
flowchart TD
A[Input element] --> B[Hash function h]
B --> C["Binary string e.g. 00101..."]
C --> D["Extract register index\nfirst k bits → bucket j"]
C --> E["Count leading zeros\nin remaining bits → ρ"]
D --> F["Register M_j"]
E --> F
F --> G{"ρ > M_j?"}
G -->|Yes| H["Update M_j = ρ"]
G -->|No| I[No update]
H --> J["Harmonic mean of\nall 2^k registers"]
I --> J
J --> K["Apply bias correction\nalpha_m constant"]
K --> L["Cardinality estimate n̂"]
System Design Application
Redis PFADD / PFCOUNT — Redis natively implements HyperLogLog as a first-class data type. PFADD key element adds an element; PFCOUNT key returns the estimated distinct count. The 12 KB maximum footprint is fixed regardless of cardinality.
Common use cases in production systems:
- Unique visitor counts: "How many distinct users visited this page today?" — streaming billions of events; storing each user ID would require gigabytes. HyperLogLog counts distinct users at ~12 KB per counter.
- Distinct query counts: Log all search queries; count unique queries per hour without storing each query string. Used in search analytics pipelines.
- Distinct item views in analytics: "How many unique users viewed product SKU-123 this week?" — a per-product HyperLogLog counter, merged across shards with
PFMERGE.
The infrastructure layer where these Redis HyperLogLog counters live is described in Distributed-Cache — Redis is a distributed in-memory cache/store that exposes PFADD/PFCOUNT as native commands.
Scope Boundaries
-
NOT the same as Bloom-Filter — Bloom Filter answers "is this specific element in the set?" (membership query); HyperLogLog answers "how many distinct elements have been seen in total?" (cardinality estimation). These solve different problems with different space complexity characteristics. A Bloom Filter retains enough information to test individual elements; HyperLogLog discards element identity after hashing and can only estimate a count.
-
NOT the same as Distributed-Cache — HyperLogLog is a data structure stored IN a Redis instance; Distributed-Cache describes the caching infrastructure layer (cache-aside, write-through, TTL, cache stampede mitigation, Redis vs Memcached selection). HyperLogLog is one of the data types Redis exposes — not a description of the cache infrastructure itself.
Skeletal Snippet
// PRODUCTION NOTE: use Redis PFADD/PFCOUNT via ioredis — this snippet is for illustration only
// import Redis from 'ioredis'; const redis = new Redis(); await redis.pfadd('visitors:2026-03-26', userId);
// Conceptual: leading-zeros counting (the rho function)
function countLeadingZeros(n: number, numBits = 32): number {
if (n === 0) return numBits;
let count = 0;
for (let bit = numBits - 1; bit >= 0; bit--) {
if ((n >>> bit) & 1) break;
count++;
}
return count;
}
// Conceptual: register-based cardinality estimator (simplified, no bias correction)
class HyperLogLogSketch {
private registers: number[];
private k: number; // number of bits for register index
constructor(k = 10) { // 2^10 = 1024 registers
this.k = k;
this.registers = new Array(1 << k).fill(0);
}
add(hashedValue: number): void {
const registerIndex = hashedValue >>> (32 - this.k); // first k bits
const remainingBits = hashedValue << this.k;
const rho = countLeadingZeros(remainingBits, 32 - this.k) + 1;
this.registers[registerIndex] = Math.max(this.registers[registerIndex], rho);
}
estimate(): number {
const m = this.registers.length;
// Harmonic mean of 2^register_value across all registers
const harmonicSum = this.registers.reduce((sum, r) => sum + Math.pow(2, -r), 0);
const alphaMSquared = 0.7213 / (1 + 1.079 / m) * m * m; // alpha_m * m^2
return alphaMSquared / harmonicSum;
// NOTE: production HyperLogLog++ includes small/large cardinality corrections
}
}Pitfalls
The basic HyperLogLog algorithm overestimates cardinality for small sets (< 2.5 * m elements) and has precision limits at very large cardinalities approaching 2^32. Redis implements HyperLogLog++ (the extended algorithm by Heule, Nunkesser, and Hall, 2013) which includes small-range and large-range corrections. The snippet above is conceptual — it does NOT include these corrections. Always use Redis PFADD/PFCOUNT in production.
PFMERGE dest key1 key2 merges two HyperLogLog registers into a union estimate (|A ∪ B|). It does NOT compute intersection cardinality (|A ∩ B|). There is no direct HyperLogLog operation for intersection. Approximations require inclusion-exclusion: |A ∩ B| ≈ |A| + |B| - |A ∪ B|, but this accumulates estimation errors and should not be used for small intersections where error dominates the signal.
Existing Pattern Connections
| Note | Relationship |
|---|---|
| Bloom-Filter | Both are probabilistic space-efficient data structures; HyperLogLog counts distinct elements, Bloom Filter tests membership — different problems, different space complexity |
| Distributed-Cache | Redis stores HyperLogLog as a native data type (PFADD/PFCOUNT); Distributed-Cache describes the infrastructure layer where these operations execute |
Backlinks
- Algorithms-MOC — listed in System Design Application Index: O(log log n) cardinality estimation via Redis PFADD/PFCOUNT
- Distributed-Cache — Redis stores HyperLogLog as a native data type; PFADD/PFCOUNT commands operate on these registers
- Bloom-Filter — both are probabilistic space-efficient data structures; Bloom Filter answers membership, HyperLogLog answers cardinality