Blog
Data quality insights and engineering notes.
Auto-Cataloging with MinHash and Jaccard Similarity
Find near-duplicate columns across hundreds of tables using MinHash signatures and LSH. customer_email vs email_address collapsing to 0.97 similarity — no manual mapping.
Data Lineage from SQL Alone
You don't need an agent to trace data lineage. sqlglot's lineage() parses CREATE VIEW and SELECT statements to produce table-to-table and column-to-column edges.
Habeas Data Colombia: A Compliance Guide for Data Teams
What Ley 1581/2012 requires from data teams: sensitive data categories, consent obligations, data quality's role in compliance, and an 8-point auditor checklist.
Habeas Data Colombia: Guía de Cumplimiento para Equipos de Datos
Qué exige la Ley 1581/2012 a los equipos de datos: datos sensibles, consentimiento, el rol de la calidad de datos en el cumplimiento y una lista de 8 puntos para auditorías.
How to Monitor Data Quality Without Writing Code
Auto-profiling beats YAML rule files for time-to-value. Connect Postgres in 4 lines, get a full scorecard. No Great Expectations configs required.
HyperLogLog Explained for Data Engineers
Why exact distinct count is O(n) memory and breaks on warehouses. HyperLogLog's leading-zero trick, 1.04/√m error bound, and a Python datasketch example.
Mahalanobis Distance for Tabular Outlier Detection
Why Mahalanobis distance beats z-score for multivariate tabular data, the math behind it, pitfalls with singular covariance matrices, and a NumPy implementation.
Repairing Dirty City, Country, and Currency Fields
Cluster BOG, bogota, Bogotá, Bogt´a into canonical Bogotá using rapidfuzz, reference dictionaries, and Soundex/Metaphone. Production-grade remediation without manual mapping.
The Six Dimensions of Data Quality
Completeness, Uniqueness, Validity, Consistency, Integrity, Accuracy — definitions, formulas, and real examples for every data team.