Mahalanobis Distance for Tabular Outlier Detection

P. C. Mahalanobis introduced his distance metric in 1936 to measure how many standard deviations a point lies from a multivariate distribution — accounting for correlations between dimensions. It is still the right tool for numeric outlier detection in tabular data, seventy years later.

Why Z-Score Fails on Multivariate Data

Z-score treats each column independently: z = (x − μ) / σ. A row where age = 95 and income = 12000 might be unremarkable in isolation but anomalous as a combination. Z-score cannot capture this. It ignores the covariance structure of the data.

Mahalanobis distance incorporates the full covariance matrix Σ, so it flags rows that are unusual given what we know about how the columns relate to each other.

The Math

Given a row vector x, mean vector μ, and covariance matrix Σ:

D_M(x) = sqrt( (x − μ)^T · Σ^{-1} · (x − μ) )

Under multivariate normality, D_M² follows a chi-squared distribution with k degrees of freedom (where k is the number of columns). A common threshold is the 97.5th percentile of χ²(k): rows exceeding this are flagged as outliers.

NumPy Implementation

import numpy as np
from scipy.spatial.distance import mahalanobis
from scipy.stats import chi2

def mahalanobis_outliers(df_numeric, alpha=0.025):
    X = df_numeric.dropna().values
    mu = X.mean(axis=0)
    cov = np.cov(X, rowvar=False)

    # Regularize to avoid singular matrix
    cov += np.eye(cov.shape[0]) * 1e-6

    cov_inv = np.linalg.inv(cov)
    k = X.shape[1]
    threshold = chi2.ppf(1 - alpha, df=k)

    distances = np.array([
        mahalanobis(row, mu, cov_inv) ** 2
        for row in X
    ])

    return distances > threshold  # Boolean mask of outlier rows

Pitfalls

Singular covariance matrix. When columns are perfectly correlated, or when the number of rows is smaller than the number of columns, Σ is not invertible. The fix: regularize with Σ + εI (as shown above), or use the Moore-Penrose pseudoinverse via np.linalg.pinv.

Sample size requirements. The chi-squared approximation degrades badly below roughly 10 × k rows. On small tables, fall back to z-score or Isolation Forest (Liu et al., 2008).

Non-normality. Mahalanobis assumes an approximately multivariate normal distribution. Heavy-tailed columns (e.g., revenue, session duration) should be log-transformed before applying the metric.

High dimensionality. Beyond ~50 columns, the curse of dimensionality renders covariance estimates unreliable. Use PCA to reduce to the top components first, then apply Mahalanobis on the reduced space.

Where This Fits in a DQ Pipeline

Outlier detection at the row level complements column-level dimension scoring. A table can have 100 % completeness and 100 % validity yet still contain rows that are statistically impossible given the joint distribution of all columns. Mahalanobis catches these.

For context on the broader quality framework, see /blog/six-dimensions-of-data-quality and /dimensions/accuracy.

DQ runs Mahalanobis distance automatically on every numeric column subset on every run.

FAQ

Q: Does Mahalanobis distance work on categorical columns? A: No. It is defined only for numeric data. DQ applies separate cardinality and frequency-distribution checks (using Kolmogorov-Smirnov) for categorical columns.

Q: What threshold should I use? A: The chi-squared 97.5th percentile (alpha=0.025) is the standard. For production alerts where false positives are costly, use 99th percentile (alpha=0.01).

Q: How does this compare to Isolation Forest? A: Isolation Forest (Liu et al., 2008) is non-parametric and handles non-normal distributions better. Mahalanobis is faster and more interpretable. DQ uses both and reports agreement/disagreement between them.

About DQ. DQ is the data quality engine that profiles, validates, and remediates your tables in 90 seconds. Built by K/20X Labs.