Validity — Data Quality Dimension
Validity
A column can be fully complete and uniquely populated yet entirely invalid. "not-an-email", "9999-99-99", and "USD_DOLLAR" are present, non-null, and unique — and all three are wrong. Validity catches the class of problems that completeness and uniqueness cannot.
Definition
Validity measures the proportion of values that conform to an expected format, type, allowable value set, or business rule.
validity = valid_count / total_rows
A score of 1.0 means every value passes all validity rules for that column. Validity rules include: regex patterns, type constraints, range checks, enum membership, date format checks, and cross-field logical constraints.
How DQ Measures It
During profiling, DQ infers validity rules automatically for each column:
- Type inference: Detects whether a string column actually contains integers, dates, UUIDs, or emails, and flags values that deviate.
- Format detection: Identifies patterns (ISO dates, phone numbers, RFC-5322 emails, ISO 3166 country codes) and computes the proportion of values that match.
- Range checks: For numeric columns, DQ detects statistical outliers (Isolation Forest, Liu et al., 2008) and values outside a learned [min, max] range.
- Enum detection: For low-cardinality columns, DQ surfaces the observed value set and lets users lock it as an allowable enum.
Inferred rules can be reviewed and promoted to persistent checks with one click. Custom rules (regex, SQL expressions) can also be authored directly.
Example validity rule:
column: country_code
rule: matches_regex
pattern: "^[A-Z]{2}$"
# ISO 3166-1 alpha-2: two uppercase letters only
Common Failure Modes
Free-text entry into structured fields. A city field accepts free text in a web form. Users enter "NYC", "New York City", "new york", "N.Y.". All are present and unique. None match the canonical "New York". See /blog/repairing-dirty-cities-and-countries for the remediation approach.
Date format inconsistency. One upstream system sends YYYY-MM-DD; another sends MM/DD/YYYY. After union, half the date column fails ISO format checks. Date arithmetic on the mixed column produces incorrect or null results.
Numeric overflow stored as string. A BigDecimal from a Java service is serialized as scientific notation: "1.234e7". The downstream column expects "12340000". The value is technically correct but fails the expected format.
Enum drift. A status column is validated against ["active", "inactive", "pending"]. A new engineer adds "suspended" to the application code without updating the reference set. DQ flags the new value as invalid until the rule is updated.
For a full overview of all six dimensions, see /blog/six-dimensions-of-data-quality.
About DQ. DQ is the data quality engine that profiles, validates, and remediates your tables in 90 seconds. Built by K/20X Labs.