DQ.

The Six Dimensions of Data Quality

The Six Dimensions of Data Quality

Data quality is not a single number. It decomposes into six measurable dimensions. Each dimension exposes a different class of problem. Ignoring any one of them leaves a blind spot in your pipeline.

The Six Dimensions

1. Completeness — /dimensions/completeness

Definition: The proportion of expected values that are present and non-null.

Formula: completeness = 1 − (null_count / total_rows)

Real example: An email column in a CRM table has 4,200 nulls out of 50,000 rows. Completeness = 1 − 0.084 = 0.916. Downstream email campaigns silently drop those 4,200 contacts.


2. Uniqueness — /dimensions/uniqueness

Definition: The proportion of values in a column (or column combination) that are distinct.

Formula: uniqueness = distinct_count / total_rows

Real example: A transaction_id column expected to be a primary key has 312 duplicates. Uniqueness < 1.0 flags a likely ETL re-run bug. DQ uses HyperLogLog (Flajolet et al., 2007) for large-table estimates — see the HLL explainer.


3. Validity — /dimensions/validity

Definition: The proportion of values that conform to an expected format, type, or value set.

Formula: validity = valid_count / total_rows

Real example: A country_code column should contain ISO 3166-1 alpha-2 codes. Values like "USA", "colombia", and "" are invalid. Validity = 0.73 means 27 % of rows will fail any join against a reference table.


4. Consistency — /dimensions/consistency

Definition: Agreement between two related values, either within a row or across tables.

Formula: consistency = consistent_row_count / total_rows

Real example: order_total = sum(line_items.amount) should hold for every row. When 1.2 % of orders fail this check, a rounding or currency-conversion bug is usually to blame.


5. Integrity — /dimensions/integrity

Definition: The proportion of foreign-key references that resolve to an existing parent record.

Formula: integrity = matched_fk_count / total_fk_count

Real example: orders.customer_id references customers.id. After a bulk delete of inactive customers, 8,000 orphaned orders remain. Integrity = 0.84, breaking every customer-order join.


6. Accuracy — /dimensions/accuracy

Definition: The degree to which values reflect the true real-world state. Requires a ground-truth reference.

Formula: accuracy = matching_value_count / total_rows (vs. reference)

Real example: A city column contains "Bogota", "BOG", "Bogotá" — all representing the same city. Without normalization, GROUP BY city produces fragmented results. DQ's remediation engine clusters these via rapidfuzz and a reference dictionary.


How DQ Uses All Six

DQ computes all six dimensions on every profiling run. Each dimension gets a score from 0 to 1, weighted into a composite score. Rules can be set per-dimension, per-column, or at the table level.

Browse each dimension in detail: Completeness · Uniqueness · Validity · Consistency · Integrity · Accuracy.

See /pricing for plan limits on how many tables DQ can monitor simultaneously.


FAQ

Q: Which dimension causes the most downstream failures in practice? A: Validity and Completeness account for the majority of pipeline failures observed in production. Integrity failures are less frequent but catastrophic when they occur.

Q: Can DQ check Consistency across two separate databases? A: Yes. DQ supports cross-source consistency rules when both sources are registered as datasources in the same workspace.

Q: How often should dimension scores be recomputed? A: For transactional tables, daily is a reasonable baseline. DQ supports per-table schedules down to hourly.


About DQ. DQ is the data quality engine that profiles, validates, and remediates your tables in 90 seconds. Built by K/20X Labs.