Integrity — Data Quality Dimension
Integrity
Referential integrity failures are silent killers. A table with perfect completeness, uniqueness, and validity scores can still return empty results on every JOIN if its foreign keys point to records that no longer exist. Integrity is the dimension that catches orphaned rows.
Definition
Integrity measures the proportion of foreign key values that resolve to an existing record in the referenced (parent) table.
integrity = matched_fk_count / total_fk_count
A score of 1.0 means every foreign key reference has a corresponding parent record. Any score below 1.0 means orphaned rows exist — rows that reference parents that are missing, deleted, or never existed.
Integrity also covers broader structural relationships: a user_id in an events table that references no row in the users table, or a product_sku in an orders table that references no row in the product catalog.
How DQ Measures It
DQ checks integrity by performing a reference lookup for each foreign key relationship defined in the workspace:
- During profiling, DQ reads
information_schemato discover declared foreign key constraints automatically. - For tables in data warehouses (where FK constraints are not enforced at the database level), DQ infers likely relationships from column names and types, and surfaces them for user confirmation.
- For each confirmed relationship, DQ counts unmatched values: child rows whose FK value does not appear in the parent's primary key column.
-- DQ generates and executes a check equivalent to:
SELECT COUNT(*) AS orphaned_rows
FROM child_table c
LEFT JOIN parent_table p ON c.foreign_key = p.primary_key
WHERE p.primary_key IS NULL
AND c.foreign_key IS NOT NULL;
The result is orphaned_rows / total_child_rows subtracted from 1.0.
Integrity rules can be set cross-source: orders.customer_id checked against crm.customers.id even when the two tables are in different registered datasources.
Common Failure Modes
Bulk deletes without cascade. A cleanup job deletes inactive customer records from the customers table. Referential integrity constraints are not enforced in the warehouse. Orders, events, and activity logs for those customers become orphaned. Every customer-facing aggregate report now undercounts or errors.
Import order errors. A data migration loads child records before parent records are loaded. Integrity is 0.0 during the load window. If monitoring runs during this window, spurious alerts fire. DQ handles this by supporting an "expected loading window" flag that suppresses alerts during defined maintenance periods.
Soft deletes treated as hard deletes by downstream. A users table uses deleted_at for soft deletion. A downstream table adds WHERE deleted_at IS NULL to its source query, effectively hiding records that child tables still reference.
Cross-environment ID drift. Production IDs are copied to a staging environment, but staging has additional test records with the same ID range. A foreign key that resolves in production fails in staging due to ID collision or absence.
For a full overview of all six dimensions, see /blog/six-dimensions-of-data-quality.
About DQ. DQ is the data quality engine that profiles, validates, and remediates your tables in 90 seconds. Built by K/20X Labs.