HyperLogLog Explained for Data Engineers

Counting distinct values is the foundation of every uniqueness check. SELECT COUNT(DISTINCT id) looks innocent. On a 5-billion-row table, it requires hashing every row and storing the hash set in memory — that is O(n) space. On a warehouse with hundreds of large tables, this is a budget and latency problem.

HyperLogLog (Flajolet, Fusy, Gandouet, Meunier, 2007) solves this with a probabilistic data structure that uses kilobytes of memory regardless of cardinality, with a tunable error bound of approximately 1.04 / √m where m is the number of registers.

The Core Idea: Leading Zeros of Hash Values

Take a hash function that maps values uniformly to a binary string. Observe the position of the first 1 bit from the left (the "leading zero count"). In a truly random stream:

The probability of seeing 0 leading zeros is 1/2.
The probability of seeing 1 leading zero is 1/4.
The probability of seeing k leading zeros is 1/2^(k+1).

If the maximum leading-zero count observed so far is k_max, then 2^(k_max) is a rough estimate of the number of distinct elements. This is the LogLog idea from Durand and Flajolet (2003).

HyperLogLog improves accuracy by using m independent sub-streams (registers), applying the harmonic mean to the 2^(k_max) estimates across registers, and applying bias corrections for small and large cardinality ranges.

Error Bound and Register Count

| Registers (m) | Memory | Std Error | |---|---|---| | 16 | 16 bytes | ~26 % | | 512 | 512 bytes | ~4.6 % | | 4096 | 4 KB | ~1.6 % | | 65536 | 64 KB | ~0.4 % |

For data quality purposes, 1–2 % error is acceptable when checking whether a primary key column has near-100 % uniqueness. Exact counts are unnecessary.

Python Example with datasketch

from datasketch import HyperLogLog

hll = HyperLogLog(p=14)  # 2^14 = 16384 registers, ~0.8% error

with open("customer_ids.txt") as f:
    for line in f:
        hll.update(line.strip().encode("utf8"))

estimated_distinct = hll.count()
print(f"Estimated distinct customer IDs: {estimated_distinct:,.0f}")

p=14 uses 16,384 registers and roughly 16 KB of RAM. You can estimate the cardinality of 5 billion rows in a single pass without loading anything into a hash set.

Merging HLL Sketches

One underused property: two HLL sketches built on different partitions can be merged bitwise. This means you can compute warehouse-wide uniqueness by combining per-partition sketches — no full-table scan required.

hll_combined = HyperLogLog(p=14)
hll_combined.merge(hll_partition_1)
hll_combined.merge(hll_partition_2)

print(hll_combined.count())

Where HLL Fits in Practice

HLL is used in: Presto/Trino's approx_distinct(), PostgreSQL's hll extension, Redis's PFADD/PFCOUNT, BigQuery's APPROX_COUNT_DISTINCT(), and Apache Spark's approxCountDistinct().

For more on the uniqueness dimension, see /dimensions/uniqueness and the broader six-dimensions overview.

DQ uses HyperLogLog for every uniqueness check — no full-table scans.

FAQ

Q: When should I use exact COUNT(DISTINCT) instead of HLL? A: When cardinality is small (under ~100,000) and the table fits in memory, exact counts are fine. Use HLL when profiling large tables on a schedule.

Q: Can HLL undercount? A: Yes — at very low cardinalities (under ~10 distinct values), HLL overestimates. Most implementations apply small-range corrections (linear counting) automatically.

Q: Is HLL the same as MinHash? A: No. MinHash (Broder, 1997) estimates set similarity (Jaccard index), not cardinality. Both use hashing tricks but solve different problems. See /blog/auto-cataloging-with-minhash-and-jaccard.

About DQ. DQ is the data quality engine that profiles, validates, and remediates your tables in 90 seconds. Built by K/20X Labs.