HyperLogLog Explained for Data Engineers
HyperLogLog Explained for Data Engineers
Counting distinct values is the foundation of every uniqueness check. SELECT COUNT(DISTINCT id) looks innocent. On a 5-billion-row table, it requires hashing every row and storing the hash set in memory — that is O(n) space. On a warehouse with hundreds of large tables, this is a budget and latency problem.
HyperLogLog (Flajolet, Fusy, Gandouet, Meunier, 2007) solves this with a probabilistic data structure that uses kilobytes of memory regardless of cardinality, with a tunable error bound of approximately 1.04 / √m where m is the number of registers.
The Core Idea: Leading Zeros of Hash Values
Take a hash function that maps values uniformly to a binary string. Observe the position of the first 1 bit from the left (the "leading zero count"). In a truly random stream:
- The probability of seeing 0 leading zeros is 1/2.
- The probability of seeing 1 leading zero is 1/4.
- The probability of seeing k leading zeros is 1/2^(k+1).
If the maximum leading-zero count observed so far is k_max, then 2^(k_max) is a rough estimate of the number of distinct elements. This is the LogLog idea from Durand and Flajolet (2003).
HyperLogLog improves accuracy by using m independent sub-streams (registers), applying the harmonic mean to the 2^(k_max) estimates across registers, and applying bias corrections for small and large cardinality ranges.
Error Bound and Register Count
| Registers (m) | Memory | Std Error | |---|---|---| | 16 | 16 bytes | ~26 % | | 512 | 512 bytes | ~4.6 % | | 4096 | 4 KB | ~1.6 % | | 65536 | 64 KB | ~0.4 % |
For data quality purposes, 1–2 % error is acceptable when checking whether a primary key column has near-100 % uniqueness. Exact counts are unnecessary.
Python Example with datasketch
from datasketch import HyperLogLog
hll = HyperLogLog(p=14) # 2^14 = 16384 registers, ~0.8% error
with open("customer_ids.txt") as f:
for line in f:
hll.update(line.strip().encode("utf8"))
estimated_distinct = hll.count()
print(f"Estimated distinct customer IDs: {estimated_distinct:,.0f}")
p=14 uses 16,384 registers and roughly 16 KB of RAM. You can estimate the cardinality of 5 billion rows in a single pass without loading anything into a hash set.
Merging HLL Sketches
One underused property: two HLL sketches built on different partitions can be merged bitwise. This means you can compute warehouse-wide uniqueness by combining per-partition sketches — no full-table scan required.
hll_combined = HyperLogLog(p=14)
hll_combined.merge(hll_partition_1)
hll_combined.merge(hll_partition_2)
print(hll_combined.count())
Where HLL Fits in Practice
HLL is used in: Presto/Trino's approx_distinct(), PostgreSQL's hll extension, Redis's PFADD/PFCOUNT, BigQuery's APPROX_COUNT_DISTINCT(), and Apache Spark's approxCountDistinct().
For more on the uniqueness dimension, see /dimensions/uniqueness and the broader six-dimensions overview.
DQ uses HyperLogLog for every uniqueness check — no full-table scans.
FAQ
Q: When should I use exact COUNT(DISTINCT) instead of HLL? A: When cardinality is small (under ~100,000) and the table fits in memory, exact counts are fine. Use HLL when profiling large tables on a schedule.
Q: Can HLL undercount? A: Yes — at very low cardinalities (under ~10 distinct values), HLL overestimates. Most implementations apply small-range corrections (linear counting) automatically.
Q: Is HLL the same as MinHash? A: No. MinHash (Broder, 1997) estimates set similarity (Jaccard index), not cardinality. Both use hashing tricks but solve different problems. See /blog/auto-cataloging-with-minhash-and-jaccard.
About DQ. DQ is the data quality engine that profiles, validates, and remediates your tables in 90 seconds. Built by K/20X Labs.