The Six Dimensions of Data Quality
The Six Dimensions of Data Quality
Data quality is not a single number. It decomposes into six measurable dimensions. Each dimension exposes a different class of problem. Ignoring any one of them leaves a blind spot in your pipeline.
The Six Dimensions
1. Completeness — /dimensions/completeness
Definition: The proportion of expected values that are present and non-null.
Formula: completeness = 1 − (null_count / total_rows)
Real example: An email column in a CRM table has 4,200 nulls out of 50,000 rows. Completeness = 1 − 0.084 = 0.916. Downstream email campaigns silently drop those 4,200 contacts.
2. Uniqueness — /dimensions/uniqueness
Definition: The proportion of values in a column (or column combination) that are distinct.
Formula: uniqueness = distinct_count / total_rows
Real example: A transaction_id column expected to be a primary key has 312 duplicates. Uniqueness < 1.0 flags a likely ETL re-run bug. DQ uses HyperLogLog (Flajolet et al., 2007) for large-table estimates — see the HLL explainer.
3. Validity — /dimensions/validity
Definition: The proportion of values that conform to an expected format, type, or value set.
Formula: validity = valid_count / total_rows
Real example: A country_code column should contain ISO 3166-1 alpha-2 codes. Values like "USA", "colombia", and "" are invalid. Validity = 0.73 means 27 % of rows will fail any join against a reference table.
4. Consistency — /dimensions/consistency
Definition: Agreement between two related values, either within a row or across tables.
Formula: consistency = consistent_row_count / total_rows
Real example: order_total = sum(line_items.amount) should hold for every row. When 1.2 % of orders fail this check, a rounding or currency-conversion bug is usually to blame.
5. Integrity — /dimensions/integrity
Definition: The proportion of foreign-key references that resolve to an existing parent record.
Formula: integrity = matched_fk_count / total_fk_count
Real example: orders.customer_id references customers.id. After a bulk delete of inactive customers, 8,000 orphaned orders remain. Integrity = 0.84, breaking every customer-order join.
6. Accuracy — /dimensions/accuracy
Definition: The degree to which values reflect the true real-world state. Requires a ground-truth reference.
Formula: accuracy = matching_value_count / total_rows (vs. reference)
Real example: A city column contains "Bogota", "BOG", "Bogotá" — all representing the same city. Without normalization, GROUP BY city produces fragmented results. DQ's remediation engine clusters these via rapidfuzz and a reference dictionary.
How DQ Uses All Six
DQ computes all six dimensions on every profiling run. Each dimension gets a score from 0 to 1, weighted into a composite score. Rules can be set per-dimension, per-column, or at the table level.
Browse each dimension in detail: Completeness · Uniqueness · Validity · Consistency · Integrity · Accuracy.
See /pricing for plan limits on how many tables DQ can monitor simultaneously.
FAQ
Q: Which dimension causes the most downstream failures in practice? A: Validity and Completeness account for the majority of pipeline failures observed in production. Integrity failures are less frequent but catastrophic when they occur.
Q: Can DQ check Consistency across two separate databases? A: Yes. DQ supports cross-source consistency rules when both sources are registered as datasources in the same workspace.
Q: How often should dimension scores be recomputed? A: For transactional tables, daily is a reasonable baseline. DQ supports per-table schedules down to hourly.
About DQ. DQ is the data quality engine that profiles, validates, and remediates your tables in 90 seconds. Built by K/20X Labs.