DQ.

Auto-Cataloging with MinHash and Jaccard Similarity

Auto-Cataloging with MinHash and Jaccard Similarity

A data warehouse with 200 tables accumulates semantic duplicates: customer_email, email_address, cust_email, usr_email. They carry the same data. They have different names. No one cataloged them. Downstream joins and deduplication logic silently diverge.

Manual cataloging does not scale. The solution is MinHash (Broder, 1997) combined with Locality-Sensitive Hashing (LSH) to find near-duplicate columns automatically.

Jaccard Similarity

The Jaccard index between two sets A and B is:

J(A, B) = |A ∩ B| / |A ∪ B|

For two columns, treat their value sets as A and B. J = 1.0 means identical value sets. J = 0.0 means no overlap. For near-duplicate detection, a threshold of 0.85–0.95 works well in practice.

Computing Jaccard exactly requires materializing both value sets — O(n) per pair, O(n × m) for all column pairs in a warehouse. This is impractical.

MinHash: Approximate Jaccard in O(k) Time

MinHash (Broder, 1997) approximates Jaccard with k hash functions. For each column:

  1. Hash every value with k independent hash functions.
  2. Keep the minimum hash per function: this is the MinHash signature (a k-length vector).

The key theorem: the probability that two columns agree on any one min-hash equals their Jaccard similarity. Average over k functions to get an unbiased estimate with variance J(1-J)/k.

LSH: Scale to Thousands of Columns

With LSH, split the k-length signature into b bands of r rows each (k = b × r). Two columns become candidates if they match on at least one full band. This turns pairwise comparison from O(m²) to near-linear time.

from datasketch import MinHash, MinHashLSH

lsh = MinHashLSH(threshold=0.85, num_perm=128)
minhashes = {}

for col_name, values in column_value_sets.items():
    m = MinHash(num_perm=128)
    for v in values:
        m.update(str(v).encode("utf8"))
    lsh.insert(col_name, m)
    minhashes[col_name] = m

# Query: find columns similar to customer_email
results = lsh.query(minhashes["customer_email"])
print(results)
# → ['customer_email', 'email_address', 'cust_email']

Real Example: customer_email vs email_address

In a real-world SaaS warehouse with 11 tables containing email-like columns:

| Column | Table | Jaccard vs customer_email | |---|---|---| | customer_email | customers | 1.00 | | email_address | subscriptions | 0.97 | | cust_email | legacy_users | 0.91 | | usr_email | auth_log | 0.87 | | contact_email | leads | 0.62 |

The first four columns are near-duplicates. The fifth is a different domain (leads, not registered users) and correctly excluded at threshold 0.85.

This is the information DQ surfaces in the catalog view: column families, inferred semantic types, and similarity scores — without any manual tagging.

Limitations

MinHash operates on value sets, not column names or schemas. A column with low cardinality (e.g., a status enum) will produce false positives when compared against another low-cardinality column with overlapping values. DQ applies a minimum cardinality filter (default: 50 distinct values) before including a column in LSH indexing.


FAQ

Q: How many hash functions (num_perm) should I use? A: 128 is the standard default. More permutations reduce variance but increase memory and compute. For warehouse-scale profiling, 64–128 is a practical range.

Q: Can MinHash detect column-level lineage (derived columns)? A: Not directly — MinHash measures value overlap, not transformation history. For lineage from SQL, see /blog/data-lineage-from-sql-alone and /docs/lineage.

Q: Is there a risk of cataloging false positives at 0.85? A: Yes. DQ surfaces candidates with confidence scores; a human confirms or rejects each cluster. Confirmed clusters persist in the catalog. Rejections lower the effective threshold for that column pair.


About DQ. DQ is the data quality engine that profiles, validates, and remediates your tables in 90 seconds. Built by K/20X Labs.