Repairing Dirty City, Country, and Currency Fields
Repairing Dirty City, Country, and Currency Fields
Geo and currency fields attract the most creative misspellings in any dataset. BOG, bogota, Bogotá, Bogt´a, Santa Fe de Bogota, Bogotá D.C. are all the same city. GROUP BY city produces 6 rows where there should be 1. INNER JOIN against a reference table returns zero matches.
The fix is canonical clustering: map every variant to a single normalized form using fuzzy string matching against a reference dictionary.
The Repair Pipeline
Three components work together:
-
Reference dictionary. A curated list of canonical values (city names, ISO country codes, ISO 4217 currency codes). GeoNames and the UN M49 standard are reliable sources for cities and countries.
-
Fuzzy distance function. Levenshtein distance for short strings, token sort ratio for multi-word place names.
rapidfuzz(C extension) is 10–100x faster thanfuzzywuzzyfor bulk operations. -
Phonetic fallback. Soundex and Metaphone group strings that sound alike but are spelled differently — useful for proper nouns with diacritics dropped or transliterated inconsistently.
rapidfuzz.process.cdist for Bulk Matching
import pandas as pd
from rapidfuzz import process, fuzz
# Reference canonical city names (subset for illustration)
reference_cities = [
"Bogotá", "Medellín", "Cali", "Barranquilla",
"Buenos Aires", "Lima", "Ciudad de México",
]
# Dirty input values from your table
dirty_values = ["BOG", "bogota", "Bogotá", "Bogt´a", "Medellin", "medellin "]
# Compute similarity matrix: shape (len(dirty), len(reference))
scores = process.cdist(
dirty_values,
reference_cities,
scorer=fuzz.WRatio,
workers=-1, # Use all CPU cores
)
# Pick best match for each dirty value
for i, dirty in enumerate(dirty_values):
best_idx = scores[i].argmax()
best_score = scores[i][best_idx]
canonical = reference_cities[best_idx]
print(f"{dirty!r:20s} → {canonical!r} (score={best_score:.0f})")
Output:
'BOG' → 'Bogotá' (score=90)
'bogota' → 'Bogotá' (score=95)
'Bogotá' → 'Bogotá' (score=100)
'Bogt´a' → 'Bogotá' (score=88)
'Medellin' → 'Medellín' (score=96)
'medellin ' → 'Medellín' (score=96)
Use a threshold (e.g., score ≥ 85) to auto-repair high-confidence matches and queue low-confidence ones for human review.
Phonetic Grouping with Soundex and Metaphone
For columns where diacritics are systematically dropped or replaced, phonetic algorithms add a second layer:
import jellyfish
values = ["Bogota", "Bogotá", "Bogotta", "Bukota"]
codes = [(v, jellyfish.soundex(v)) for v in values]
# Groups: B230, B230, B300, B230 — "Bukota" gets a different code but close
Metaphone handles non-English phonology better for Spanish proper nouns. For Colombian city names, a small custom dictionary override (e.g., BOG → Bogotá) handles the common abbreviations that neither fuzzy distance nor phonetics can resolve.
Currency Field Repair
ISO 4217 has 180 active currency codes. Common dirty variants: usd, US Dollar, U.S.D., dolar, $. Repair strategy:
- Uppercase and strip punctuation:
U.S.D.→USD. - Lookup in ISO 4217 table.
- If no exact match, fuzzy match against currency names (
US Dollar,Euro,Colombian Peso).
import pycountry
def normalize_currency(raw: str) -> str | None:
raw_clean = raw.strip().upper().replace(".", "").replace(" ", "")
try:
return pycountry.currencies.lookup(raw_clean).alpha_3
except LookupError:
result = process.extractOne(
raw, [c.name for c in pycountry.currencies], scorer=fuzz.WRatio
)
if result and result[1] >= 80:
return pycountry.currencies.get(name=result[0]).alpha_3
return None
Integration with DQ
DQ applies this pipeline in /docs/remediation. After profiling detects low validity on a city or country column, DQ suggests a remediation job: select a reference dictionary, set a confidence threshold, preview the mapping, then apply. The repaired values are written back or exported as a patch file, with full audit trail.
See /dimensions/validity for how DQ scores validity before and after remediation.
FAQ
Q: What WRatio threshold is safe for auto-repair without human review? A: 92 or above is conservative and appropriate for production auto-repair. Scores between 80–91 should be queued for review. Below 80, flag as unresolvable and fill with NULL rather than a wrong canonical.
Q: How does DQ handle abbreviations like "COL" for Colombia vs "CO" (ISO 3166)? A: DQ maintains a curated abbreviation table per field type (city, country, currency). Abbreviations take priority over fuzzy distance, since "COL" would fuzzy-match several unrelated strings.
Q: Can this pipeline handle transliterations from Arabic or Chinese place names?
A: Soundex and Metaphone are English-centric. For non-Latin scripts, use unidecode for transliteration first, then apply the fuzzy pipeline against transliterated reference names.
About DQ. DQ is the data quality engine that profiles, validates, and remediates your tables in 90 seconds. Built by K/20X Labs.