Core Transforms¶

Functions that transform text. All are pure functions — they never mutate the input.

transliterate¶

transliterate `module-attribute` ¶

transliterate = _transliterate_dispatch

slugify¶

slugify ¶

slugify(text: str, *, separator: str = ..., lowercase: bool = ..., max_length: int = ..., word_boundary: bool = ..., save_order: bool = ..., stopwords: Iterable[str] = ..., regex_pattern: str | None = ..., replacements: Iterable[tuple[str, str]] = ..., allow_unicode: bool = ..., lang: str | None = ..., entities: bool = ..., decimal: bool = ..., hexadecimal: bool = ..., default: str | None = ...) -> str

slugify(text: list[str], *, separator: str = ..., lowercase: bool = ..., max_length: int = ..., word_boundary: bool = ..., save_order: bool = ..., stopwords: Iterable[str] = ..., regex_pattern: str | None = ..., replacements: Iterable[tuple[str, str]] = ..., allow_unicode: bool = ..., lang: str | None = ..., entities: bool = ..., decimal: bool = ..., hexadecimal: bool = ..., default: str | None = ...) -> list[str]

slugify(text: str | list[str], *, separator: str = '-', lowercase: bool = True, max_length: int = 0, word_boundary: bool = False, save_order: bool = False, stopwords: Iterable[str] = (), regex_pattern: str | None = None, replacements: Iterable[tuple[str, str]] = (), allow_unicode: bool = False, lang: str | None = None, entities: bool = True, decimal: bool = True, hexadecimal: bool = True, default: str | None = None) -> str | list[str]

Generate a URL-safe slug from Unicode text.

Full pipeline: decode entities → transliterate → lowercase → strip non-alphanumeric → collapse separators → apply stopwords/max_length.

Shares python-slugify's core keyword parameters (separator, max_length, word_boundary, save_order, stopwords, lowercase, etc.), so slugify(text, ...) calls port directly. Note that disarm makes every parameter past text keyword-only, whereas python-slugify accepts some positionally.

Parameters:

text (str | list[str]) –

Input Unicode string.
separator (str, default: '-' ) –

Character(s) between slug words.
lowercase (bool, default: True ) –

Convert to lowercase.
max_length (int, default: 0 ) –

Maximum slug length in bytes (0 = unlimited). With allow_unicode=True, multi-byte characters count as 2–4 bytes each — use :func:grapheme_truncate for character-aware limiting.
word_boundary (bool, default: False ) –

When truncating via max_length, cut at word boundaries.
save_order (bool, default: False ) –

When True, only leading and trailing stopwords are removed; interior stopwords are kept so relative word order is preserved (python-slugify compatible). When False (default), all matching stopwords are removed wherever they appear. (#118)
stopwords (Iterable[str], default: () ) –

Words to remove from the slug.
regex_pattern (str | None, default: None ) –

Custom regex for stripping characters.
replacements (Iterable[tuple[str, str]], default: () ) –

Pre-transliteration (old, new) substitution pairs.
allow_unicode (bool, default: False ) –

Keep non-ASCII letters instead of transliterating.
lang (str | None, default: None ) –

Language code for transliteration (e.g. "de", "ru", "auto").
entities (bool, default: True ) –

Decode HTML entities before processing.
decimal (bool, default: True ) –

Decode HTML decimal entities ({).
hexadecimal (bool, default: True ) –

Decode HTML hex entities ({).
default (str | None, default: None ) –

Fallback when the slug would be empty — i.e. the input has no sluggable characters (emoji, punctuation, or zero-width only). The value is itself run through the same slug pipeline (#193), so it is sanitized to a URL-safe slug and is subject to the same max_length truncation as normal output; a default that has no sluggable characters therefore yields the empty string. When None (the default), the empty string is returned, preserving prior behaviour. Use this to avoid the routing hazard of empty slugs colliding on one URL (#97).

Returns:	`str \| list[str]` – URL-safe slug string (or the sanitized `default` when it would `str \| list[str]` – otherwise be empty). Returns `list[str]` when given `list[str]`.

Raises:

ValueError –

If max_length is negative (validated for both scalar and list input, #193).
TypeError –

If text is neither str nor list[str].
DisarmError –

If an internal Rust error occurs (e.g. an invalid regex_pattern). An unknown lang does not raise — it is treated as best-effort and falls back to the default transliterator; pre-check against list_langs() if you need strict validation.

Examples:

>>> slugify("Hello World!")
'hello-world'
>>> slugify("Straße nach München", lang="de")
'strasse-nach-muenchen'
>>> slugify("My Title", separator="_")
'my_title'
>>> slugify("The Big Fox", stopwords=["the"])
'big-fox'
>>> slugify("Very Long Title Here", max_length=10, word_boundary=True)
'very-long'
>>> slugify("🔥🔥🔥")
''
>>> slugify("🔥🔥🔥", default="n-a")
'n-a'
>>> slugify("🔥", default="N/A")  # default is sanitized, not returned raw
'n-a'

normalize¶

normalize ¶

normalize(text: str, *, form: NormalizationForm = ...) -> str

normalize(text: list[str], *, form: NormalizationForm = ...) -> list[str]

normalize(text: str | list[str], *, form: NormalizationForm = 'NFC') -> str | list[str]

Unicode normalization.

Accepts a single string or a list of strings.

Parameters:	`text` (`str \| list[str]`) – Input string, or list of strings for batch processing. `form` (`NormalizationForm`, default: `'NFC'` ) – Normalization form — "NFC", "NFD", "NFKC", or "NFKD".

Returns:	`str \| list[str]` – Normalized string(s). Returns `str` when given `str`, `str \| list[str]` – `list[str]` when given `list[str]`.

Examples:

>>> normalize("é", form="NFC")
'é'
>>> normalize(["é", "ño"], form="NFC")
['é', 'ño']

normalize_confusables¶

normalize_confusables ¶

normalize_confusables(text: str, *, target_script: str = 'latin') -> str

Replace Unicode confusable homoglyphs with target-script equivalents.

Uses Unicode TR39 confusables table. Characters without a confusable equivalent in the target script pass through unchanged (visual mapping only, not transliteration).

Parameters:	`text` (`str`) – Input string potentially containing homoglyphs. `target_script` (`str`, default: `'latin'` ) – Script to normalize toward. Supported values: `"latin"` (default, ~2,063 mappings) and `"cyrillic"` (~1,369 mappings).

Returns:	`str` – String with confusable characters replaced by target-script equivalents.

Raises:	`DisarmError` – If target_script is not a supported value.

Examples:

>>> normalize_confusables("Ηello")  # Greek Η looks like Latin H
'Hello'
>>> normalize_confusables("раypal")  # Cyrillic р/а look like Latin p/a
'paypal'
>>> normalize_confusables("paypal", target_script="cyrillic")
'раураӏ'

sanitize_filename¶

sanitize_filename ¶

sanitize_filename(text: str, *, separator: str = '_', max_length: int = 255, platform: Platform = 'universal', lang: str | None = None, preserve_extension: bool = True, replacement_text: str | None = None, max_len: int | None = None) -> str

Sanitize a string into a safe filename.

Transliterate → strip OS-illegal chars → collapse separators → handle reserved names (CON, NUL, etc.) → truncate respecting extension.

Parameters:

text (str) –

Input string (title, user input, etc.).
separator (str, default: '_' ) –

Replacement for spaces and stripped characters. Also accepted as replacement_text (pathvalidate compatibility).
max_length (int, default: 255 ) –

Maximum filename length measured in bytes (UTF-8 encoded), not characters. Default 255 matches the ext4/APFS/NTFS filesystem limit. Truncation always lands on a character boundary to avoid splitting multi-byte sequences. Also accepted as max_len (pathvalidate compatibility).
platform (Platform, default: 'universal' ) –

Target platform — "universal", "windows", or "posix".
lang (str | None, default: None ) –

Language code for transliteration (e.g. "de", "ja").
preserve_extension (bool, default: True ) –

When True (default), the file extension is kept intact within max_length. If the extension alone (including the leading .) is ≥ max_length, the extension is dropped and the whole result is truncated to max_length bytes. When False, the entire string is truncated to max_length bytes without special treatment of the extension.

Returns:	`str` – Safe filename string.

Raises:	`DisarmError` – If an internal Rust error occurs.

Examples:

>>> sanitize_filename("My Report (final).pdf")
'My_Report_(final).pdf'
>>> sanitize_filename("CON.txt")  # reserved on Windows
'_CON.txt'
>>> sanitize_filename("résumé.docx", lang="fr")
'resume.docx'

strip_accents¶

strip_accents ¶

strip_accents(text: str) -> str

strip_accents(text: list[str]) -> list[str]

strip_accents(text: str | list[str]) -> str | list[str]

Remove diacritical marks while preserving base characters.

NFD decompose → strip combining marks → NFC recompose. Accepts a single string or a list of strings.

Parameters:	`text` (`str \| list[str]`) – Input string, or list of strings for batch processing.

Returns:	`str \| list[str]` – String(s) with diacritical marks removed.

Examples:

>>> strip_accents("café résumé naïve")
'cafe resume naive'
>>> strip_accents(["café", "naïve"])
['cafe', 'naive']

fold_case¶

fold_case ¶

fold_case(text: str) -> str

Full Unicode case folding per CaseFolding.txt (Unicode 16.0).

Unlike str.lower(), this implements the complete Unicode Case Folding algorithm with all 1,557 status-C and status-F mappings. Covers Latin (ß→ss, ſ→s, İ→i̇), Greek (ς→σ, variant forms ϐ→β, ϑ→θ, ϕ→φ, ϖ→π, ϰ→κ, ϱ→ρ), Cyrillic, Armenian (ligature և→եւ), Georgian Mtavruli, Cherokee, Adlam, Deseret, Osage, Warang Citi, fullwidth Latin, and all Latin ligature expansions (ﬁ→fi, ﬂ→fl, ﬀ→ff, ﬃ→ffi, ﬄ→ffl, ﬅ→st, ﬆ→st).

Equivalent to str.casefold() but executed in Rust via a compile-time PHF (perfect hash function) table. Pure-ASCII strings take a branchless fast path with no table lookup.

Parameters:	`text` (`str`) – Input string.

Returns:	`str` – Case-folded string. Characters not in CaseFolding.txt map to `str` – themselves. Output satisfies `fold_case(fold_case(x)) == fold_case(x)` `str` – (idempotent).

Examples:

>>> fold_case("Straße")
'strasse'
>>> fold_case("ΣΟΦΙΑ")
'σοφια'
>>> fold_case("ﬁnd")
'find'

collapse_whitespace¶

collapse_whitespace ¶

collapse_whitespace(text: str) -> str

Fold all Unicode whitespace runs to single ASCII spaces, trimming the ends.

Folds whitespace only (#433): the line controls (TAB/LF/VT/FF/CR), the information separators (U+001C–U+001F), NEL, the Zs/Zl/Zp spaces, and the blank-rendering set (Braille blank, the Hangul fillers) each fold to a single space. It does not delete control or zero-width characters — to do that, run a :class:TextPipeline with the strip_control / strip_zero_width steps (the canonicalize / canonicalize_strict presets already do).

Folding the line controls (rather than deleting them) means a carriage return between two tokens becomes a space, never a silent join: "a\rb" → "a b".

Parameters:	`text` (`str`) – Input string.

Returns:	`str` – String with whitespace runs folded to single spaces and ends trimmed.

Examples:

>>> collapse_whitespace("  hello   world  ")
'hello world'
>>> collapse_whitespace("tabs\there\ttoo")
'tabs here too'
>>> collapse_whitespace("a\rb")  # carriage return folds, not deletes
'a b'

demojize¶

demojize ¶

demojize(text: str, *, strip_modifiers: bool = False, errors: ErrorMode = 'replace', replace_with: str = '[?]', provider: EmojiProvider | None = None, delimiters: tuple[str, str] | None = None) -> str

Expand emoji sequences to their CLDR short-name text descriptions.

Output is always the bare CLDR short name as plain text.

Parameters:

text (str) –

Input string potentially containing emoji.
strip_modifiers (bool, default: False ) –

If True, collapse skin tone and hair style variants to their base form (e.g. "woman raising hand" instead of "woman raising hand: medium-dark skin tone").
errors (ErrorMode, default: 'replace' ) –

How to handle emoji not in the provider's data. "replace" — substitute with replace_with. "ignore" — silently drop. "preserve" — keep the original emoji.
replace_with (str, default: '[?]' ) –

Replacement string when errors="replace".
provider (EmojiProvider | None, default: None ) –

An object implementing the :class:EmojiProvider protocol. Overrides the global provider for this call. None uses the global provider or the built-in default.
delimiters (tuple[str, str] | None, default: None ) –

emoji library compatibility — ignored, with a DeprecationWarning when explicitly passed. disarm always outputs bare CLDR short names without delimiters; wrap the result yourself if you need delimiters (e.g. f":{name}:").

Returns:	`str` – Text with emoji replaced by their descriptions.

Raises:	`DisarmError` – If an internal Rust error occurs.

Warns:	`UserWarning` – If the provider raises an exception or returns a non-string value. The built-in CLDR tables are used as a fallback for that sequence.

Examples:

>>> demojize("I ❤️ Python 🐍")
'I red heart Python snake'

set_emoji_provider¶

set_emoji_provider ¶

set_emoji_provider(provider: EmojiProvider | None = None) -> None

Set a global emoji provider for all demojize calls.

The provider must implement the :class:EmojiProvider protocol.

Pass None to reset to the built-in default (latest English CLDR).

.. note:: Sequence-length cap (#199). The provider's lookup() is offered a look-ahead window of at most 9 codepoints — the length of the longest built-in CLDR emoji sequence. A provider cannot match a sequence longer than 9 codepoints: the extra codepoints fall through to the built-in tables / per-codepoint handling. This cap is fixed (it sizes a stack-allocated scan window, so widening it would cost every demojize call); design custom mappings to key on ≤ 9 codepoints. Skin-tone and variation-selector modifiers trailing a matched sequence are consumed separately and do not count toward the 9.

Parameters:	`provider` (`EmojiProvider \| None`, default: `None` ) – An object implementing the :class:`EmojiProvider` protocol, or None to reset to the built-in default.

Examples:

>>> set_emoji_provider(None)  # reset to default provider

strip_bidi¶

strip_bidi ¶

strip_bidi(text: str) -> str

Strip bidirectional override and formatting characters (UAX #9).

Removes: soft hyphen (U+00AD), Arabic Letter Mark (U+061C), LRM/RLM (U+200E/F), bidi embeddings/overrides (U+202A–U+202E), bidi isolates (U+2066–U+2069).

Parameters:	`text` (`str`) – Input string.

Returns:	`str` – String with bidi override and formatting characters removed.

Examples:

>>> strip_bidi("hello\u200eworld")  # remove LRM
'helloworld'
>>> strip_bidi("hello\u061cworld")  # remove Arabic Letter Mark
'helloworld'
>>> strip_bidi("safe text")  # no bidi chars → unchanged
'safe text'

strip_tags¶

strip_tags ¶

strip_tags(text: str) -> str

Strip the Unicode Tags block (U+E0000–U+E007F) — the "ASCII smuggling" channel.

Preserves well-formed emoji subdivision flag sequences (U+1F3F4 + tag letters + U+E007F, e.g. the Scotland flag); stray tag characters (including the deprecated language tag U+E0001) are removed.

Examples:

>>> strip_tags("hi\U000e0050\U000e0057\U000e004e")  # tag-encoded "PWN"
'hi'

strip_variation_selectors¶

strip_variation_selectors ¶

strip_variation_selectors(text: str) -> str

Strip every variation selector (VS1–VS16 and VS17–VS256).

These are the arbitrary-byte smuggling channel. Use strip_format if you need to keep the VS15/VS16 presentation selectors for rendering.

Examples:

>>> strip_variation_selectors("g\ufe01data")  # VS2
'gdata'

strip_noncharacters¶

strip_noncharacters ¶

strip_noncharacters(text: str) -> str

Strip every Unicode noncharacter (U+FDD0–U+FDEF, and U+xFFFE/U+xFFFF per plane).

Examples:

>>> strip_noncharacters("a\ufffeb")
'ab'

strip_pua¶

strip_pua ¶

strip_pua(text: str) -> str

Strip every Private Use Area code point (BMP and planes 15/16).

PUA renders as arbitrary, font-defined glyphs (icon fonts, platform logos). Stripped by the comparison presets; use this helper to apply the same policy directly, or strip_format to preserve PUA for rendering.

Examples:

>>> strip_pua("a\ue000b")
'ab'

strip_zalgo¶

strip_zalgo ¶

strip_zalgo(text: str, *, max_marks: int = 2) -> str

Strip excessive combining marks, preserving legitimate diacritics.

Caps the number of combining marks per base character at max_marks. Operates in NFD space and recomposes to NFC.

Parameters:	`text` (`str`) – Input string (may contain zalgo abuse). `max_marks` (`int`, default: `2` ) – Maximum combining marks to keep per base character (default: `2`). Set to `0` to strip all combining marks (equivalent to :func:`strip_accents`).

Returns:	`str` – String with excess combining marks removed.

Examples:

>>> strip_zalgo("café")  # 1 combining mark — preserved
'café'
>>> strip_zalgo("Việt Nam")  # 2 marks — preserved
'Việt Nam'

Caps the number of combining marks per base character, preserving legitimate diacritics (é, ñ, ệ) while removing zalgo stacking abuse.

from disarm import strip_zalgo

assert strip_zalgo("café") == 'café'
assert strip_zalgo("Việt Nam") == 'Việt Nam'

# Strip all combining marks (like strip_accents)
assert strip_zalgo("café", max_marks=0) == 'cafe'

List input (batch processing)¶

transliterate, slugify, normalize, and strip_accents accept either a single str or a list[str]. When a list is passed, all strings are processed in a single Rust call, amortizing the Python → Rust boundary overhead. The return type matches the input type.

Two transliterate modes are the exception and instead process a list item by item: reverse transliteration (target=...) and context-aware transliteration (context=True).

from disarm import transliterate, slugify

titles = ["café résumé", "Straße nach München", "Москва"]

assert transliterate(titles) == ['cafe resume', 'Strasse nach Munchen', 'Moskva']

assert slugify(titles, lang="de") == ['cafe-resume', 'strasse-nach-muenchen', 'moskva']

For large datasets, passing a list is significantly faster than calling the function in a Python loop. See Performance for benchmarks.

Compatibility aliases¶

The following aliases are provided for migration convenience:

Alias	Target	Matches
`unidecode`	`transliterate`	Unidecode / text-unidecode
`ascii_fold`	`transliterate`	Elasticsearch ICU folding
`casefold`	`fold_case`	`str.casefold()`
`remove_accents`	`strip_accents`	sklearn / ML ecosystems

from disarm import unidecode, casefold, remove_accents

assert unidecode("café") == 'cafe'
assert casefold("Straße") == 'strasse'
assert remove_accents("café") == 'cafe'

Core Transforms¶

transliterate¶

transliterate module-attribute ¶

slugify¶

slugify ¶

normalize¶

normalize ¶

normalize_confusables¶

normalize_confusables ¶

sanitize_filename¶

sanitize_filename ¶

strip_accents¶

strip_accents ¶

fold_case¶

fold_case ¶

collapse_whitespace¶

collapse_whitespace ¶

demojize¶

demojize ¶

set_emoji_provider¶

set_emoji_provider ¶

strip_bidi¶

strip_bidi ¶

strip_tags¶

strip_tags ¶

strip_variation_selectors¶

strip_variation_selectors ¶

strip_noncharacters¶

strip_noncharacters ¶

strip_pua¶

strip_pua ¶

strip_zalgo¶

strip_zalgo ¶

List input (batch processing)¶

Compatibility aliases¶

transliterate `module-attribute` ¶