Accuracy — Data Quality Dimension

Accuracy

Accuracy is the hardest dimension to measure because it requires a ground truth. A column can be complete, unique, valid, consistent, and referentially intact — and still be factually wrong. price = 0.99 when the actual price is 9.99 passes every other dimension check. Only accuracy catches it.

Definition

Accuracy measures the degree to which values in a dataset reflect the true real-world state of the entities they represent.

accuracy = matching_value_count / total_rows

Where matching_value_count is the number of rows where the stored value agrees with a reference source (another table, an API response, a manually curated ground truth, or a derived logical truth).

When no external reference is available, accuracy is estimated via statistical methods: distribution checks (Kolmogorov-Smirnov), Benford's law analysis for numeric columns where first-digit distribution is expected to follow the law, and outlier detection (Mahalanobis distance, Isolation Forest).

How DQ Measures It

DQ supports two accuracy measurement modes:

Reference-based accuracy. Compare column values against a curated reference table:

-- Example: compare stored city names against a canonical reference
SELECT
    COUNT(*) FILTER (WHERE t.city = r.canonical_city) AS accurate_rows,
    COUNT(*) AS total_rows
FROM your_table t
LEFT JOIN city_reference r
    ON LOWER(TRIM(t.city)) = LOWER(TRIM(r.canonical_city));

This is the most precise method. DQ can join against any registered datasource as the reference.

Statistical accuracy estimation. For numeric columns without a reference, DQ applies:

Benford's law: For naturally occurring numbers (amounts, populations, counts), the first significant digit should follow a log distribution. Deviation flags potential fabrication or systemic data entry errors.
Distribution drift: Kolmogorov-Smirnov test comparing the current column distribution against the baseline established at first profile. Significant drift triggers an accuracy warning.
Mahalanobis distance: See /blog/mahalanobis-distance-for-tabular-outliers for the full explanation.

Common Failure Modes

Unit errors. A temperature column stores Celsius values but a new data source writes Fahrenheit without conversion. Validity passes (both are numeric). Accuracy fails because 98.6 (body temperature in Fahrenheit) appears as an outlier in a column that should range 36–38.

Stale reference data. A product_price column is populated from a price table that is refreshed weekly. Mid-week price changes are not reflected. Values are technically valid but no longer accurate.

Geocoding errors. A pipeline geocodes address strings to (lat, lng) pairs. A batch geocoding API returns 0.0, 0.0 for unresolvable addresses (a common API error sentinel). Completeness = 1.0 (no nulls), validity = 1.0 (valid floats), accuracy = much less than 1.0 (coordinates pointing to the Gulf of Guinea are wrong for Colombian addresses).

Manual data entry drift. Human-entered values accumulate transcription errors over time. A revenue column entered manually has systematic rounding (always ending in 00 or 50). Benford's law analysis flags the unnatural first-digit distribution.

For a full overview of all six dimensions, see /blog/six-dimensions-of-data-quality.

About DQ. DQ is the data quality engine that profiles, validates, and remediates your tables in 90 seconds. Built by K/20X Labs.