Migrating from Unidecode¶
disarm provides a drop-in replacement for both Unidecode and text-unidecode.
Already wrapping
unidecodein a pipeline? Most hand-rolledunidecode(...)pipelines (slugs, filenames, search keys, URL-encoding) have a single-call disarm equivalent. See Unidecode → disarm recipes for the pattern-by-pattern mapping.
Quick migration¶
Option 1: Drop-in alias¶
# Before
from unidecode import unidecode
# After — one-line change
from disarm import unidecode
The disarm.unidecode() function is a direct alias for transliterate() with default settings. It accepts a single string argument and returns ASCII text.
Coverage compatibility, not endorsement. The alias exists to make migration a one-line change. It is the right tool for romanization (slugs, ASCII keys, search-fold) — but the wrong tool for security. See Unidecode is not a security tool below.
Option 2: Use transliterate directly¶
# Before
from unidecode import unidecode
result = unidecode("café")
# After
from disarm import transliterate
result = transliterate("café")
transliterate() provides additional features not available in Unidecode:
from disarm import transliterate
# Language-specific transliteration
assert transliterate("München", lang="de") == 'Muenchen'
# Error handling modes
assert transliterate("♠", errors="ignore") == ''
assert transliterate("♠", errors="preserve") == '♠'
assert transliterate("♠", errors="replace",
replace_with="?") == '?'
API comparison¶
| Unidecode | disarm | Notes |
|---|---|---|
unidecode(s) |
unidecode(s) |
Direct alias |
unidecode(s) |
transliterate(s) |
Full-featured alternative |
unidecode_expect_ascii(s) |
transliterate(s, errors="replace") |
Default behavior |
unidecode_expect_nonascii(s) |
transliterate(s, errors="preserve") |
Keep unmapped chars |
Behavioral differences¶
Transliteration tables¶
disarm uses its own hand-curated transliteration tables. Most common mappings are identical to Unidecode, but some edge cases may differ. A detailed character-level comparison across all 83 supported languages shows:
- 49,089 codepoints across all Unicode blocks tested comprehensively (no sampling)
- 48,415 mapped by disarm vs 47,408 by Unidecode — disarm has broader coverage overall, with 1,136 characters only disarm maps vs 129 only Unidecode maps
- Most differences are systematic: CJK pinyin casing (~20K), Korean romanization (~3.7K), inherent vowel handling in Brahmic scripts, and language-specific national standards
from disarm import unidecode
# Identical in both
assert unidecode("café") == 'cafe'
assert unidecode("北京") == 'bei jing'
# May differ for obscure characters
# disarm aims for more linguistically accurate results
License¶
| Unidecode | text-unidecode | disarm | |
|---|---|---|---|
| License | GPL-2.0 | Artistic-1.0 | MIT |
If your project requires MIT licensing, disarm is a safe replacement.
Unidecode is not a security tool¶
If you reach for unidecode to "sanitize" untrusted text — to strip homoglyphs,
invisible characters, or other Unicode trickery — switch to disarm's defense
functions, and not just for the speed.
Unidecode (like anyascii, cyrtranslit, uroman) maps confusable characters
phonetically: Cyrillic р (U+0440) → Latin r, by sound. A homoglyph attacker
replaces Latin p with the identical-looking Cyrillic р, so phonetic mapping yields
r — not the original p — and the attack survives. Worse, on invisible-character
attacks unidecode expands zero-width characters into visible ASCII sequences,
introducing spurious tokens that can degrade downstream model accuracy.
disarm maps visually per Unicode TR39
(Cyrillic р → Latin p), which reverses the substitution for confusables in the TR39
table. In a controlled benchmark, visual TR39 mapping reached XMR = 1.000 on the tested
TR39 pairs where phonetic tools recovered roughly half. It is a defense-in-depth layer,
not a complete control — see the Threat Model.
# Wrong tool for defense — phonetic mapping, attack survives
from disarm import unidecode
assert unidecode("рroduсt") == 'rrodust'
# Right tools — visual TR39 mapping
from disarm import strip_obfuscation, normalize_confusables
assert normalize_confusables("рroduсt") == 'product'
assert strip_obfuscation("рroduсt") == 'product'
See Adversarial-Text Defense for the full evidence and the XMR benchmark.
text-unidecode migration¶
text-unidecode has the same API as Unidecode. The migration is identical:
# Before
from text_unidecode import unidecode
# After
from disarm import unidecode