Uniqueness — Data Quality Dimension

Uniqueness

Duplicate records are one of the most common and most damaging data quality problems. Duplicate customer_id values break primary key constraints. Duplicate transaction_id values double-count revenue. Duplicate email values cause users to receive the same communication multiple times.

Definition

Uniqueness measures the proportion of values in a column (or column combination) that are distinct.

uniqueness = distinct_count / total_rows

A score of 1.0 means every value is unique. For primary key columns, the target must be 1.0. For columns that should be unique but are not primary keys (e.g., email in a users table), the target is also 1.0 in most cases.

Composite uniqueness checks two or more columns together: (order_id, line_item_id) must be unique as a pair even if neither column is unique alone.

How DQ Measures It

For small tables (under ~1 million rows), DQ computes exact distinct counts. For large tables, DQ uses HyperLogLog (Flajolet et al., 2007), a probabilistic cardinality estimator with ~0.8 % standard error at 16,384 registers — sufficient for detecting uniqueness violations without full-table hash scans.

The process:

Profile computes distinct count (exact or HLL) and total row count.
Uniqueness ratio is stored with a timestamp.
If uniqueness < 1.0 on a column flagged as a key, DQ immediately raises an alert.
DQ also identifies the duplicate values themselves (top-N most frequent duplicates) for investigation.

Rules support both absolute and relative thresholds: "uniqueness on transaction_id must be 1.0" or "uniqueness on session_id must be ≥ 0.995".

See /blog/hyperloglog-explained-for-data-engineers for the technical detail on HLL cardinality estimation.

Common Failure Modes

ETL re-runs without deduplication. A pipeline is re-executed after a failure. Records from the failed partial run are not cleaned up before the retry. The table ends up with 2x the expected rows for the affected time window.

Merge logic errors. An UPSERT is written with the wrong key. Rows that should update instead insert, producing duplicates with different updated_at timestamps.

Cross-system ID collisions. Two source systems use overlapping integer ID sequences. A naive union of both tables produces duplicate IDs with different business meanings.

Soft deletes masking duplicates. A deleted_at column marks rows as inactive rather than removing them. Active + deleted rows for the same entity coexist. Queries without a WHERE deleted_at IS NULL filter produce apparent duplicates.

For a full overview of all six dimensions, see /blog/six-dimensions-of-data-quality.

About DQ. DQ is the data quality engine that profiles, validates, and remediates your tables in 90 seconds. Built by K/20X Labs.