HyperLogLog

HyperLogLog

A probabilistic cardinality estimation algorithm that answers "how many distinct elements are in a stream?" using O(log log n) space — enabling unique-count queries on billions of elements with ~1% error at just 12 KB of memory (Redis implementation).

When NOT to Use

  1. When exact counts are required (financial transactions, compliance reporting) — HyperLogLog has inherent error (~0.81% at 16384 registers). A 0.81% error on 10 million users means an uncertainty of ±81,000 — unacceptable when regulatory audits require precise figures.
  2. When you need to answer membership queries ("is element X in the set?") — use Bloom-Filter instead. HyperLogLog cannot test whether a specific element was seen; it can only estimate how many distinct elements were seen. Bloom Filter solves membership; HyperLogLog solves cardinality.
  3. When the cardinality is small (< 1000 elements) — a simple HashSet is cheaper, exact, and easier to reason about. HyperLogLog's value emerges at millions+ cardinality where storing all elements would require gigabytes of memory.

Core Mechanism

How It Works

  1. Hash function maps each incoming element to a binary string (e.g., hash("user_42") = 00101101...).
  2. Extract register index from the first k bits of the hash → selects which of the 2^k registers (buckets) to update.
  3. Count leading zeros (the rho function, ρ) in the remaining bits of the hash. More leading zeros = rarer event = evidence of larger cardinality.
  4. Update register M[j] = max(M[j], ρ) — each register tracks the maximum leading-zero count seen so far for that bucket.
  5. Estimate cardinality using the harmonic mean across all 2^k registers: the harmonic mean corrects for outlier buckets that happened to see an element with many leading zeros.
  6. Apply bias correction constant α_m to remove systematic bias in the harmonic mean estimator.

Error Rate

Standard error = 1.04 / sqrt(m) where m = number of registers.

Registers (m)Error rateMemory (at 6 bits/register)
256~6.5%~192 bytes
1024~3.25%~768 bytes
4096~1.63%~3 KB
16384~0.81%~12 KB (Redis default)

Redis uses 16384 registers (12 KB) to achieve ~0.81% standard error regardless of whether the cardinality is 1000 or 1 billion.

Component Diagram

flowchart TD
    A[Input element] --> B[Hash function h]
    B --> C["Binary string e.g. 00101..."]
    C --> D["Extract register index\nfirst k bits → bucket j"]
    C --> E["Count leading zeros\nin remaining bits → ρ"]
    D --> F["Register M_j"]
    E --> F
    F --> G{"ρ > M_j?"}
    G -->|Yes| H["Update M_j = ρ"]
    G -->|No| I[No update]
    H --> J["Harmonic mean of\nall 2^k registers"]
    I --> J
    J --> K["Apply bias correction\nalpha_m constant"]
    K --> L["Cardinality estimate n̂"]

System Design Application

Redis PFADD / PFCOUNT — Redis natively implements HyperLogLog as a first-class data type. PFADD key element adds an element; PFCOUNT key returns the estimated distinct count. The 12 KB maximum footprint is fixed regardless of cardinality.

Common use cases in production systems:

  • Unique visitor counts: "How many distinct users visited this page today?" — streaming billions of events; storing each user ID would require gigabytes. HyperLogLog counts distinct users at ~12 KB per counter.
  • Distinct query counts: Log all search queries; count unique queries per hour without storing each query string. Used in search analytics pipelines.
  • Distinct item views in analytics: "How many unique users viewed product SKU-123 this week?" — a per-product HyperLogLog counter, merged across shards with PFMERGE.

The infrastructure layer where these Redis HyperLogLog counters live is described in Distributed-Cache — Redis is a distributed in-memory cache/store that exposes PFADD/PFCOUNT as native commands.

Scope Boundaries

  • NOT the same as Bloom-Filter — Bloom Filter answers "is this specific element in the set?" (membership query); HyperLogLog answers "how many distinct elements have been seen in total?" (cardinality estimation). These solve different problems with different space complexity characteristics. A Bloom Filter retains enough information to test individual elements; HyperLogLog discards element identity after hashing and can only estimate a count.

  • NOT the same as Distributed-Cache — HyperLogLog is a data structure stored IN a Redis instance; Distributed-Cache describes the caching infrastructure layer (cache-aside, write-through, TTL, cache stampede mitigation, Redis vs Memcached selection). HyperLogLog is one of the data types Redis exposes — not a description of the cache infrastructure itself.

Skeletal Snippet

// PRODUCTION NOTE: use Redis PFADD/PFCOUNT via ioredis — this snippet is for illustration only
// import Redis from 'ioredis'; const redis = new Redis(); await redis.pfadd('visitors:2026-03-26', userId);
 
// Conceptual: leading-zeros counting (the rho function)
function countLeadingZeros(n: number, numBits = 32): number {
  if (n === 0) return numBits;
  let count = 0;
  for (let bit = numBits - 1; bit >= 0; bit--) {
    if ((n >>> bit) & 1) break;
    count++;
  }
  return count;
}
 
// Conceptual: register-based cardinality estimator (simplified, no bias correction)
class HyperLogLogSketch {
  private registers: number[];
  private k: number; // number of bits for register index
 
  constructor(k = 10) { // 2^10 = 1024 registers
    this.k = k;
    this.registers = new Array(1 << k).fill(0);
  }
 
  add(hashedValue: number): void {
    const registerIndex = hashedValue >>> (32 - this.k); // first k bits
    const remainingBits = hashedValue << this.k;
    const rho = countLeadingZeros(remainingBits, 32 - this.k) + 1;
    this.registers[registerIndex] = Math.max(this.registers[registerIndex], rho);
  }
 
  estimate(): number {
    const m = this.registers.length;
    // Harmonic mean of 2^register_value across all registers
    const harmonicSum = this.registers.reduce((sum, r) => sum + Math.pow(2, -r), 0);
    const alphaMSquared = 0.7213 / (1 + 1.079 / m) * m * m; // alpha_m * m^2
    return alphaMSquared / harmonicSum;
    // NOTE: production HyperLogLog++ includes small/large cardinality corrections
  }
}

Pitfalls

HyperLogLog++ bias correction for small cardinalities

The basic HyperLogLog algorithm overestimates cardinality for small sets (< 2.5 * m elements) and has precision limits at very large cardinalities approaching 2^32. Redis implements HyperLogLog++ (the extended algorithm by Heule, Nunkesser, and Hall, 2013) which includes small-range and large-range corrections. The snippet above is conceptual — it does NOT include these corrections. Always use Redis PFADD/PFCOUNT in production.

PFMERGE computes union, not intersection

PFMERGE dest key1 key2 merges two HyperLogLog registers into a union estimate (|A ∪ B|). It does NOT compute intersection cardinality (|A ∩ B|). There is no direct HyperLogLog operation for intersection. Approximations require inclusion-exclusion: |A ∩ B| ≈ |A| + |B| - |A ∪ B|, but this accumulates estimation errors and should not be used for small intersections where error dominates the signal.

Existing Pattern Connections

NoteRelationship
Bloom-FilterBoth are probabilistic space-efficient data structures; HyperLogLog counts distinct elements, Bloom Filter tests membership — different problems, different space complexity
Distributed-CacheRedis stores HyperLogLog as a native data type (PFADD/PFCOUNT); Distributed-Cache describes the infrastructure layer where these operations execute
  • Algorithms-MOC — listed in System Design Application Index: O(log log n) cardinality estimation via Redis PFADD/PFCOUNT
  • Distributed-Cache — Redis stores HyperLogLog as a native data type; PFADD/PFCOUNT commands operate on these registers
  • Bloom-Filter — both are probabilistic space-efficient data structures; Bloom Filter answers membership, HyperLogLog answers cardinality