Core Transforms

Functions that transform text. All are pure functions — they never mutate the input.

transliterate

transliterate module-attribute

transliterate = _transliterate_dispatch

slugify

slugify

slugify(text: str, *, separator: str = ..., lowercase: bool = ..., max_length: int = ..., word_boundary: bool = ..., save_order: bool = ..., stopwords: Iterable[str] = ..., regex_pattern: str | None = ..., replacements: Iterable[tuple[str, str]] = ..., allow_unicode: bool = ..., lang: str | None = ..., entities: bool = ..., decimal: bool = ..., hexadecimal: bool = ..., default: str | None = ...) -> str
slugify(text: list[str], *, separator: str = ..., lowercase: bool = ..., max_length: int = ..., word_boundary: bool = ..., save_order: bool = ..., stopwords: Iterable[str] = ..., regex_pattern: str | None = ..., replacements: Iterable[tuple[str, str]] = ..., allow_unicode: bool = ..., lang: str | None = ..., entities: bool = ..., decimal: bool = ..., hexadecimal: bool = ..., default: str | None = ...) -> list[str]
slugify(text: str | list[str], *, separator: str = '-', lowercase: bool = True, max_length: int = 0, word_boundary: bool = False, save_order: bool = False, stopwords: Iterable[str] = (), regex_pattern: str | None = None, replacements: Iterable[tuple[str, str]] = (), allow_unicode: bool = False, lang: str | None = None, entities: bool = True, decimal: bool = True, hexadecimal: bool = True, default: str | None = None) -> str | list[str]

Generate a URL-safe slug from Unicode text.

Full pipeline: decode entities → transliterate → lowercase → strip non-alphanumeric → collapse separators → apply stopwords/max_length.

Shares python-slugify's core keyword parameters (separator, max_length, word_boundary, save_order, stopwords, lowercase, etc.), so slugify(text, ...) calls port directly. Note that disarm makes every parameter past text keyword-only, whereas python-slugify accepts some positionally.

Parameters:
  • text (str | list[str]) –

    Input Unicode string.

  • separator (str, default: '-' ) –

    Character(s) between slug words.

  • lowercase (bool, default: True ) –

    Convert to lowercase.

  • max_length (int, default: 0 ) –

    Maximum slug length in bytes (0 = unlimited). With allow_unicode=True, multi-byte characters count as 2–4 bytes each — use :func:grapheme_truncate for character-aware limiting.

  • word_boundary (bool, default: False ) –

    When truncating via max_length, cut at word boundaries.

  • save_order (bool, default: False ) –

    When True, only leading and trailing stopwords are removed; interior stopwords are kept so relative word order is preserved (python-slugify compatible). When False (default), all matching stopwords are removed wherever they appear. (#118)

  • stopwords (Iterable[str], default: () ) –

    Words to remove from the slug.

  • regex_pattern (str | None, default: None ) –

    Custom regex for stripping characters.

  • replacements (Iterable[tuple[str, str]], default: () ) –

    Pre-transliteration (old, new) substitution pairs.

  • allow_unicode (bool, default: False ) –

    Keep non-ASCII letters instead of transliterating.

  • lang (str | None, default: None ) –

    Language code for transliteration (e.g. "de", "ru", "auto").

  • entities (bool, default: True ) –

    Decode HTML entities before processing.

  • decimal (bool, default: True ) –

    Decode HTML decimal entities ({).

  • hexadecimal (bool, default: True ) –

    Decode HTML hex entities ({).

  • default (str | None, default: None ) –

    Fallback when the slug would be empty — i.e. the input has no sluggable characters (emoji, punctuation, or zero-width only). The value is itself run through the same slug pipeline (#193), so it is sanitized to a URL-safe slug and is subject to the same max_length truncation as normal output; a default that has no sluggable characters therefore yields the empty string. When None (the default), the empty string is returned, preserving prior behaviour. Use this to avoid the routing hazard of empty slugs colliding on one URL (#97).

Returns:
  • str | list[str]

    URL-safe slug string (or the sanitized default when it would

  • str | list[str]

    otherwise be empty). Returns list[str] when given list[str].

Raises:
  • ValueError

    If max_length is negative (validated for both scalar and list input, #193).

  • TypeError

    If text is neither str nor list[str].

  • DisarmError

    If an internal Rust error occurs (e.g. an invalid regex_pattern or unknown lang code).

Examples:

>>> slugify("Hello World!")
'hello-world'
>>> slugify("Straße nach München", lang="de")
'strasse-nach-muenchen'
>>> slugify("My Title", separator="_")
'my_title'
>>> slugify("The Big Fox", stopwords=["the"])
'big-fox'
>>> slugify("Very Long Title Here", max_length=10, word_boundary=True)
'very-long'
>>> slugify("🔥🔥🔥")
''
>>> slugify("🔥🔥🔥", default="n-a")
'n-a'
>>> slugify("🔥", default="N/A")  # default is sanitized, not returned raw
'n-a'

normalize

normalize

normalize(text: str, *, form: NormalizationForm = ...) -> str
normalize(text: list[str], *, form: NormalizationForm = ...) -> list[str]
normalize(text: str | list[str], *, form: NormalizationForm = 'NFC') -> str | list[str]

Unicode normalization.

Accepts a single string or a list of strings.

Parameters:
  • text (str | list[str]) –

    Input string, or list of strings for batch processing.

  • form (NormalizationForm, default: 'NFC' ) –

    Normalization form — "NFC", "NFD", "NFKC", or "NFKD".

Returns:
  • str | list[str]

    Normalized string(s). Returns str when given str,

  • str | list[str]

    list[str] when given list[str].

Examples:

>>> normalize("é", form="NFC")
'é'
>>> normalize(["é", "ño"], form="NFC")
['é', 'ño']

normalize_confusables

normalize_confusables

normalize_confusables(text: str, *, target_script: str = 'latin') -> str

Replace Unicode confusable homoglyphs with target-script equivalents.

Uses Unicode TR39 confusables table. Characters without a confusable equivalent in the target script pass through unchanged (visual mapping only, not transliteration).

Parameters:
  • text (str) –

    Input string potentially containing homoglyphs.

  • target_script (str, default: 'latin' ) –

    Script to normalize toward. Supported values: "latin" (default, ~2,063 mappings) and "cyrillic" (~1,369 mappings).

Returns:
  • str

    String with confusable characters replaced by target-script equivalents.

Raises:
  • DisarmError

    If target_script is not a supported value.

Examples:

>>> normalize_confusables("Ηello")  # Greek Η looks like Latin H
'Hello'
>>> normalize_confusables("раypal")  # Cyrillic р/а look like Latin p/a
'paypal'
>>> normalize_confusables("paypal", target_script="cyrillic")
'раураӏ'

sanitize_filename

sanitize_filename

sanitize_filename(text: str, *, separator: str = '_', max_length: int = 255, platform: Platform = 'universal', lang: str | None = None, preserve_extension: bool = True, replacement_text: str | None = None, max_len: int | None = None) -> str

Sanitize a string into a safe filename.

Transliterate → strip OS-illegal chars → collapse separators → handle reserved names (CON, NUL, etc.) → truncate respecting extension.

Parameters:
  • text (str) –

    Input string (title, user input, etc.).

  • separator (str, default: '_' ) –

    Replacement for spaces and stripped characters. Also accepted as replacement_text (pathvalidate compatibility).

  • max_length (int, default: 255 ) –

    Maximum filename length measured in bytes (UTF-8 encoded), not characters. Default 255 matches the ext4/APFS/NTFS filesystem limit. Truncation always lands on a character boundary to avoid splitting multi-byte sequences. Also accepted as max_len (pathvalidate compatibility).

  • platform (Platform, default: 'universal' ) –

    Target platform — "universal", "windows", or "posix".

  • lang (str | None, default: None ) –

    Language code for transliteration (e.g. "de", "ja").

  • preserve_extension (bool, default: True ) –

    When True (default), the file extension is kept intact within max_length. If the extension alone (including the leading .) is ≥ max_length, the extension is dropped and the whole result is truncated to max_length bytes. When False, the entire string is truncated to max_length bytes without special treatment of the extension.

Returns:
  • str

    Safe filename string.

Raises:
  • DisarmError

    If an internal Rust error occurs.

Examples:

>>> sanitize_filename("My Report (final).pdf")
'My_Report_(final).pdf'
>>> sanitize_filename("CON.txt")  # reserved on Windows
'_CON.txt'
>>> sanitize_filename("résumé.docx", lang="fr")
'resume.docx'

strip_accents

strip_accents

strip_accents(text: str) -> str
strip_accents(text: list[str]) -> list[str]
strip_accents(text: str | list[str]) -> str | list[str]

Remove diacritical marks while preserving base characters.

NFD decompose → strip combining marks → NFC recompose. Accepts a single string or a list of strings.

Parameters:
  • text (str | list[str]) –

    Input string, or list of strings for batch processing.

Returns:
  • str | list[str]

    String(s) with diacritical marks removed.

Examples:

>>> strip_accents("café résumé naïve")
'cafe resume naive'
>>> strip_accents(["café", "naïve"])
['cafe', 'naive']

fold_case

fold_case

fold_case(text: str) -> str

Full Unicode case folding per CaseFolding.txt (Unicode 16.0).

Unlike str.lower(), this implements the complete Unicode Case Folding algorithm with all 1,557 status-C and status-F mappings. Covers Latin (ß→ss, ſ→s, İ→i̇), Greek (ς→σ, variant forms ϐ→β, ϑ→θ, ϕ→φ, ϖ→π, ϰ→κ, ϱ→ρ), Cyrillic, Armenian (ligature և→եւ), Georgian Mtavruli, Cherokee, Adlam, Deseret, Osage, Warang Citi, fullwidth Latin, and all Latin ligature expansions (fi→fi, fl→fl, ff→ff, ffi→ffi, ffl→ffl, ſt→st, st→st).

Equivalent to str.casefold() but executed in Rust via a compile-time PHF (perfect hash function) table. Pure-ASCII strings take a branchless fast path with no table lookup.

Parameters:
  • text (str) –

    Input string.

Returns:
  • str

    Case-folded string. Characters not in CaseFolding.txt map to

  • str

    themselves. Output satisfies fold_case(fold_case(x)) == fold_case(x)

  • str

    (idempotent).

Examples:

>>> fold_case("Straße")
'strasse'
>>> fold_case("ΣΟΦΙΑ")
'σοφια'
>>> fold_case("find")
'find'

collapse_whitespace

collapse_whitespace

collapse_whitespace(text: str, *, strip_control: bool = True, strip_zero_width: bool = True) -> str

Normalize all Unicode whitespace variants to single ASCII spaces.

Optionally strip control characters and zero-width characters.

Parameters:
  • text (str) –

    Input string.

  • strip_control (bool, default: True ) –

    Remove C0/C1 control characters (U+0000–U+001F, U+007F–U+009F) except tab and newline. Carriage return (\r) is stripped, so Windows-style \r\n becomes \n.

  • strip_zero_width (bool, default: True ) –

    Remove zero-width space (U+200B), zero-width non-joiner (U+200C), zero-width joiner (U+200D), and word joiner (U+2060).

Returns:
  • str

    String with whitespace collapsed and optionally cleaned.

Examples:

>>> collapse_whitespace("  hello   world  ")
'hello world'
>>> collapse_whitespace("tabs\there\ttoo")
'tabs here too'
>>> collapse_whitespace("a\u200Bb\u200Bc")  # zero-width spaces
'abc'

demojize

demojize

demojize(text: str, *, strip_modifiers: bool = False, errors: ErrorMode = 'replace', replace_with: str = '[?]', provider: EmojiProvider | None = None, delimiters: tuple[str, str] | None = None) -> str

Expand emoji sequences to their CLDR short-name text descriptions.

Output is always the bare CLDR short name as plain text.

Parameters:
  • text (str) –

    Input string potentially containing emoji.

  • strip_modifiers (bool, default: False ) –

    If True, collapse skin tone and hair style variants to their base form (e.g. "woman raising hand" instead of "woman raising hand: medium-dark skin tone").

  • errors (ErrorMode, default: 'replace' ) –

    How to handle emoji not in the provider's data. "replace" — substitute with replace_with. "ignore" — silently drop. "preserve" — keep the original emoji.

  • replace_with (str, default: '[?]' ) –

    Replacement string when errors="replace".

  • provider (EmojiProvider | None, default: None ) –

    An object implementing the :class:EmojiProvider protocol. Overrides the global provider for this call. None uses the global provider or the built-in default.

  • delimiters (tuple[str, str] | None, default: None ) –

    emoji library compatibility — ignored with a DeprecationWarning. disarm always outputs bare CLDR short names without delimiters; wrap the result yourself if you need delimiters (e.g. f":{name}:").

Returns:
  • str

    Text with emoji replaced by their descriptions.

Raises:
  • DisarmError

    If an internal Rust error occurs.

Warns:
  • UserWarning

    If the provider raises an exception or returns a non-string value. The built-in CLDR tables are used as a fallback for that sequence.

Examples:

>>> demojize("I ❤️ Python 🐍")
'I red heart Python snake'

set_emoji_provider

set_emoji_provider

set_emoji_provider(provider: EmojiProvider | None = None) -> None

Set a global emoji provider for all demojize calls.

The provider must implement the :class:EmojiProvider protocol.

Pass None to reset to the built-in default (latest English CLDR).

.. note:: Sequence-length cap (#199). The provider's lookup() is offered a look-ahead window of at most 9 codepoints — the length of the longest built-in CLDR emoji sequence. A provider cannot match a sequence longer than 9 codepoints: the extra codepoints fall through to the built-in tables / per-codepoint handling. This cap is fixed (it sizes a stack-allocated scan window, so widening it would cost every demojize call); design custom mappings to key on ≤ 9 codepoints. Skin-tone and variation-selector modifiers trailing a matched sequence are consumed separately and do not count toward the 9.

Parameters:
  • provider (EmojiProvider | None, default: None ) –

    An object implementing the :class:EmojiProvider protocol, or None to reset to the built-in default.

Examples:

>>> set_emoji_provider(None)  # reset to default provider

strip_bidi

strip_bidi

strip_bidi(text: str) -> str

Strip bidirectional override and formatting characters (UAX #9).

Removes: soft hyphen (U+00AD), Arabic Letter Mark (U+061C), LRM/RLM (U+200E/F), bidi embeddings/overrides (U+202A–U+202E), bidi isolates (U+2066–U+2069).

Parameters:
  • text (str) –

    Input string.

Returns:
  • str

    String with bidi override and formatting characters removed.

Examples:

>>> strip_bidi("hello\u200eworld")  # remove LRM
'helloworld'
>>> strip_bidi("hello\u061cworld")  # remove Arabic Letter Mark
'helloworld'
>>> strip_bidi("safe text")  # no bidi chars → unchanged
'safe text'

strip_zalgo

strip_zalgo

strip_zalgo(text: str, *, max_marks: int = 2) -> str

Strip excessive combining marks, preserving legitimate diacritics.

Caps the number of combining marks per base character at max_marks. Operates in NFD space and recomposes to NFC.

Parameters:
  • text (str) –

    Input string (may contain zalgo abuse).

  • max_marks (int, default: 2 ) –

    Maximum combining marks to keep per base character (default: 2). Set to 0 to strip all combining marks (equivalent to :func:strip_accents).

Returns:
  • str

    String with excess combining marks removed.

Examples:

>>> strip_zalgo("café")  # 1 combining mark — preserved
'café'
>>> strip_zalgo("Việt Nam")  # 2 marks — preserved
'Việt Nam'

Caps the number of combining marks per base character, preserving legitimate diacritics (é, ñ, ệ) while removing zalgo stacking abuse.

from disarm import strip_zalgo

assert strip_zalgo("café") == 'café'
assert strip_zalgo("Việt Nam") == 'Việt Nam'

# Strip all combining marks (like strip_accents)
assert strip_zalgo("café", max_marks=0) == 'cafe'

List input (batch processing)

transliterate, slugify, normalize, and strip_accents accept either a single str or a list[str]. When a list is passed, all strings are processed in a single Rust call, amortizing the Python → Rust boundary overhead. The return type matches the input type.

Two transliterate modes are the exception and instead process a list item by item: reverse transliteration (target=...) and context-aware transliteration (context=True).

from disarm import transliterate, slugify

titles = ["café résumé", "Straße nach München", "Москва"]

assert transliterate(titles) == ['cafe resume', 'Strasse nach Munchen', 'Moskva']

assert slugify(titles, lang="de") == ['cafe-resume', 'strasse-nach-muenchen', 'moskva']

For large datasets, passing a list is significantly faster than calling the function in a Python loop. See Performance for benchmarks.

Compatibility aliases

The following aliases are provided for migration convenience:

Alias Target Matches
unidecode transliterate Unidecode / text-unidecode
ascii_fold transliterate Elasticsearch ICU folding
casefold fold_case str.casefold()
remove_accents strip_accents sklearn / ML ecosystems
from disarm import unidecode, casefold, remove_accents

assert unidecode("café") == 'cafe'
assert casefold("Straße") == 'strasse'
assert remove_accents("café") == 'cafe'