Precompiled Pipelines¶

Ready-to-use multi-step text processing pipelines. Each is a single compiled Rust function with no pipeline construction overhead at call time.

Renamed in 0.11 (#430)

Three presets were renamed to describe their mechanism rather than imply a safety outcome. The old names are deprecated aliases, behave identically, and are removed in 1.0:

Old name	New name
`security_clean`	`canonicalize`
`display_clean`	`strip_format`
`normalize_user_input`	`canonicalize_strict`

canonicalize¶

canonicalize ¶

canonicalize(text: str) -> str

Canonicalize text for security-sensitive comparison.

Pipeline: NFKC → strip bidi/format → strip invisible classes (#413) → strip_control → strip_zero_width → collapse_whitespace → cap combining marks (anti-zalgo, #429) → NFC → confusables → NFC (the confusable fold is sandwiched between two NFC passes so TR39 skeletoning is normalization-stable and the preset is idempotent — #416)

Collapses fullwidth bypasses, neutralizes homoglyph spoofing, strips dangerous bidi overrides and soft hyphens, then normalizes whitespace (collapsing runs, stripping control chars and zero-width injections).

.. warning:: Canonicalizes Unicode for comparison; it is not an output sanitizer and provides no XSS/HTML/SQL/injection protection. The NFKC step maps fullwidth lookalikes to live ASCII metacharacters by design (＜ → <), so the output may be more important to context-encode on the way out, not less. Encode at the sink; never emit this result into markup or a query unescaped.

Parameters:	`text` (`str`) – Input string (user-submitted, network-received, etc.).

Returns:	`str` – A canonicalized string suitable for security-sensitive comparison `str` – (e.g. against a denylist). Not safe to emit unescaped into any `str` – execution or markup context — see warning above.

Examples:

>>> canonicalize("Ηello Ꮤorld")  # Greek Η + Cherokee Ꮤ → Latin
'Hello World'

Pipeline steps¶

NFKC → strip bidi/format → strip invisibles (#413) → strip_control → strip_zero_width → collapse_whitespace → strip_zalgo (#429) → NFC → confusables → NFC

from disarm import canonicalize

assert canonicalize("ℝ𝕖𝕒𝕝 𝕥𝕖𝕩𝕥") == 'Real text'
assert canonicalize("Ηello Ꮤorld") == 'Hello World'

ml_normalize¶

ml_normalize ¶

ml_normalize(text: str, *, lang: str | None = None, emoji: str = 'cldr') -> str

ML/NLP text normalization pipeline.

NFKC → emoji→text → [transliterate] → strip_accents →

fold_case → collapse_whitespace

Produces clean, accent-free, lowercased text suitable for tokenizers, embeddings, and feature extraction. Emoji are expanded to their CLDR short-name descriptions.

Parameters:	`text` (`str`) – Input Unicode string. `lang` (`str \| None`, default: `None` ) – Optional language code for transliteration (e.g. "de", "ja"). `emoji` (`str`, default: `'cldr'` ) – Emoji handling mode. `"cldr"` — expand emoji to CLDR short names (default). `"none"` — leave emoji characters unchanged.

Returns:	`str` – Clean, accent-free, lowercased text.

Raises:	`InvalidArgumentError` – If emoji is not `"cldr"` or `"none"`. `DisarmError` – If an internal Rust error occurs (base of the above).

Examples:

>>> ml_normalize("Café RÉSUMÉ")
'cafe resume'
>>> ml_normalize("München", lang="de")
'muenchen'

Pipeline steps¶

NFKC → emoji→text → [transliterate] → strip_accents → fold_case → strip_control → strip_zero_width → collapse_whitespace

from disarm import ml_normalize

assert ml_normalize("Café RÉSUMÉ") == 'cafe resume'
assert ml_normalize("München", lang="de") == 'muenchen'
assert ml_normalize("I ❤️ Python 🐍") == 'i red heart python snake'

catalog_key¶

catalog_key ¶

catalog_key(text: str, *, lang: str | None = None, strict_iso9: bool = False) -> str

Library catalog key generation pipeline.

NFKC → fold_case → transliterate → confusables → strip_accents →

fold_case → collapse_whitespace

Produces a canonical deduplication key for bibliographic titles.

Parameters:	`text` (`str`) – Input title or heading. `lang` (`str \| None`, default: `None` ) – Language code for transliteration (e.g. "ru", "ja"). `strict_iso9` (`bool`, default: `False` ) – Use ISO 9:1995 scholarly transliteration for Cyrillic.

Returns:	`str` – Canonical deduplication key string.

Raises:	`DisarmError` – If an internal Rust error occurs.

Examples:

>>> catalog_key("  Café  RÉSUMÉ  ")
'cafe resume'
>>> catalog_key("ΩMEGA  café")
'omega cafe'

Pipeline steps¶

NFKC → fold_case → transliterate → confusables → strip_accents → fold_case → strip_control → strip_zero_width → collapse_whitespace

from disarm import catalog_key

assert catalog_key("  Café  RÉSUMÉ  ") == 'cafe resume'
assert catalog_key("Москва", lang="ru") == 'moskva'
assert catalog_key("Москва", lang="auto") == 'moskva'
assert catalog_key("Müller", lang="de") == 'mueller'

strip_format¶

strip_format ¶

strip_format(text: str) -> str

Strip bidi/format and invisible-injection vectors from rendered content.

strip bidi/format → strip invisibles (#413, rendering policy) →

strip control → strip zero-width → collapse_whitespace

Lightweight cleanup for user-submitted content destined for rendering. Strips bidirectional overrides (which can visually reorder text to hide malicious content), soft hyphens, control characters, and zero-width injections, then collapses runs of whitespace to single spaces.

.. warning:: "Display-safe" means visual hygiene (no bidi reordering, no invisible injections) — not markup-safe. This does no HTML escaping and does not strip <, >, &. When rendering into HTML, still escape at the template/output layer; disarm is not an XSS defense.

Parameters:	`text` (`str`) – Input string (user-submitted content).

Returns:	`str` – A visually cleaned string. Escape it at the output layer before `str` – rendering into HTML or any other markup context (see warning above).

Examples:

>>> strip_format("hello\x00world\u200b!")
'helloworld!'
>>> strip_format("  spaced   out  ")
'spaced out'

Pipeline steps¶

strip_bidi → strip invisibles (#413, rendering policy) → strip_control → strip_zero_width → collapse_whitespace

from disarm import strip_format

assert strip_format("hello\x00world\u200b!") == 'helloworld!'
assert strip_format("  spaced   out  ") == 'spaced out'
assert strip_format("admin\u202Euser") == 'adminuser'

search_key¶

search_key ¶

search_key(text: str, *, lang: str | None = None) -> str

Search index key generation pipeline.

NFKC → fold_case → transliterate → strip_accents → fold_case →

collapse_whitespace

Produces a case-insensitive, accent-insensitive, script-insensitive lookup key. Like :func:catalog_key but without confusable normalization — lighter and faster for search indexes.

Parameters:	`text` (`str`) – Input text to generate a search key from. `lang` (`str \| None`, default: `None` ) – Language code for transliteration (e.g. "ru", "de").

Returns:	`str` – Normalized search key string.

Examples:

>>> search_key("  Café  RÉSUMÉ  ")
'cafe resume'
>>> search_key("Москва")
'moskva'
>>> search_key("Über allen Gipfeln")
'uber allen gipfeln'

Pipeline steps¶

NFKC → fold_case → transliterate → strip_accents → fold_case → strip_control → strip_zero_width → collapse_whitespace

from disarm import search_key

assert search_key("Café RÉSUMÉ") == 'cafe resume'
assert search_key("Москва", lang="ru") == 'moskva'
assert search_key("ΩMEGA", lang="auto") == 'omega'

sort_key¶

sort_key ¶

sort_key(text: str, *, lang: str | None = None) -> str

Sort key generation pipeline.

Pipeline: NFKC → strip_bidi → fold_case → transliterate-non-Latin → fold_case → collapse_whitespace

A case-insensitive collation key that, unlike :func:search_key, preserves base accented characters rather than folding them away. It keeps the accent so accented and unaccented forms stay distinct ("Über" folds to "über", not "uber") and the accent survives for a locale-aware collator. Non-Latin scripts are still folded to a consistent Latin form ("Война" → "voyna") so cross-script titles interfile. This is the collation counterpart to :func:search_key, which folds accents away for exact-match lookup — the two are deliberately not interchangeable for accented Latin input.

Note: the result is a normalized string, not a UCA collation-weight key, so comparing keys with plain codepoint ordering will not interfile über with ASCII u… words. Pass the key to a Unicode/locale collator when linguistically-correct order matters; the value here is that the accent is preserved for it rather than folded away.

Because Latin letters are preserved verbatim, lang only affects transliteration of non-Latin runs; an accented Latin letter is never expanded by a language profile here (e.g. sort_key("Über", lang="de") is "über", whereas search_key("Über", lang="de") is "ueber").

Parameters:	`text` (`str`) – Input text to generate a sort key from. `lang` (`str \| None`, default: `None` ) – Language code for transliteration of non-Latin scripts (e.g. "ru", "de").

Returns:	`str` – Normalized sort key string.

Examples:

>>> sort_key("Война и мир")
'voyna i mir'
>>> sort_key("Über allen Gipfeln")
'über allen gipfeln'
>>> sort_key("  Café  ")
'café'

Pipeline steps¶

NFKC → strip_bidi → fold_case → transliterate-non-Latin → fold_case → strip_control → strip_zero_width → collapse_whitespace

Unlike search_key, sort_key preserves base accented characters so accented and unaccented forms stay distinct and the accent survives for a locale-aware collator. Non-Latin scripts are still folded to a consistent Latin form; Latin letters (including accented ones) are kept verbatim, so lang only affects non-Latin runs. (The key is a normalized string, not a UCA weight key — pass it to a Unicode collator when linguistically-correct order matters.)

from disarm import search_key, sort_key

# accents preserved for ordering (contrast search_key, which folds them away)
assert sort_key("Über") == 'über'
assert search_key("Über") == 'uber'
# a language profile never expands an accented Latin letter in a sort key
assert sort_key("Über", lang="de") == 'über'
# non-Latin scripts are still folded to Latin so titles interfile
assert sort_key("Война и мир", lang="ru") == 'voyna i mir'
assert sort_key("Café") == 'café'

canonicalize_strict¶

canonicalize_strict ¶

canonicalize_strict(text: str) -> str

Strict Unicode canonicalization of user input — not an injection defense.

.. warning:: This normalizes Unicode; it does not make text safe to emit into HTML, JS, URLs, SQL, or shells. It performs no escaping and does not strip <, >, & — <script>alert(1)</script> passes through unchanged, and the NFKC step can surface ASCII metacharacters from fullwidth lookalikes (＜script＞ → <script>). This is not XSS or injection protection: encode at the output sink (framework auto-escaping, DOMPurify, parameterized queries). Run this before that encoder, never instead of it. The name predates this clarification.

Preserves the original script (no transliteration) while neutralizing Unicode-level attack vectors: zalgo stacking, homoglyph spoofing, bidi overrides, zero-width injections, and control characters.

Pipeline: NFKC → strip_bidi → strip_zero_width → strip_control → strip invisible classes (#413) → strip_zalgo → confusables → collapse_whitespace → NFC (invisibles are stripped before zalgo-capping so they cannot split combining-mark runs, and the terminal NFC recomposes any base+mark left adjacent by a stripped invisible — keeping the output idempotent, #416/#413)

Parameters:	`text` (`str`) – User-submitted input string.

Returns:	`str` – A Unicode-normalized string. Safe for storage/comparison; encode it `str` – before emitting into any markup or query context (see warning above).

Examples:

>>> canonicalize_strict("Hello, world!")
'Hello, world!'
>>> canonicalize_strict("p\u0430ypal")  # Cyrillic а → Latin a
'paypal'
>>> canonicalize_strict("admin\u202euser")  # RLO stripped
'adminuser'

Pipeline steps¶

NFKC → strip_bidi → strip_zero_width → strip_control → strip invisibles (#413) → strip_zalgo → confusables → collapse_whitespace → NFC

from disarm import canonicalize_strict

assert canonicalize_strict("Hello, world!") == 'Hello, world!'
assert canonicalize_strict("p\u0430ypal") == 'paypal'
assert canonicalize_strict("admin\u202Euser") == 'adminuser'

Unlike canonicalize, this pipeline also strips zalgo text (excessive combining mark stacking). Unlike catalog_key/search_key, it does not transliterate — the original script is preserved.

strip_obfuscation¶

strip_obfuscation ¶

strip_obfuscation(text: str) -> str

Maximum-strength text deobfuscation.

Neutralizes homoglyph spoofing, zalgo abuse, invisible character injection, and bidi attacks. Uses TR39 confusable mapping (visual similarity) — Cyrillic р→p, с→c, В→B — not phonetic transliteration.

Not an output sanitizer. Resolves Unicode obfuscation only; performs no HTML/JS/SQL escaping and does not strip <, >, &. Encode at the output sink — this is not XSS or injection protection.

Does not transliterate. Non-Latin scripts that have no Latin confusable equivalent pass through unchanged. Chain with transliterate() explicitly if you also need romanization.

Preserves case. Case is not deception — proper nouns, acronyms, and sentence boundaries are meaningful. Chain with fold_case() if lowercasing is also needed.

Pipeline: NFKC → strip_zalgo(max_marks=0) → strip_bidi → strip_zero_width → demojize → confusables → strip_accents → collapse_whitespace (confusables runs after demojize so typographic punctuation in emoji names is folded too, keeping the output idempotent)

Parameters:	`text` (`str`) – Input text (user-generated, adversarial, multilingual).

Returns:	`str` – Deobfuscated string with homoglyphs resolved, zalgo stripped, `str` – invisible characters removed. Case is preserved.

Examples:

>>> strip_obfuscation("P\u0430yP\u0430l")  # Cyrillic а → Latin a
'PayPal'
>>> strip_obfuscation("\u0420rodu\u0441t")  # Cyrillic Р→P, с→c
'Product'
>>> strip_obfuscation("H\u0338a\u0338t\u0338e\u0338 speech")
'Hate speech'

Pipeline steps¶

NFKC → strip_zalgo(0) → strip_bidi → strip_zero_width → demojize → strip invisibles (#413) → confusables → strip_accents → strip_control → collapse_whitespace

from disarm import strip_obfuscation

# Homoglyphs (Greek/Cyrillic) folded, bidi override removed, emoji expanded.
assert strip_obfuscation("Ηеllо‮Wоrld \U0001F600") == "HelloWorld grinning face"
# Strips ALL combining marks (zalgo and accents) but preserves case.
assert strip_obfuscation("Cáfé") == "Cafe"

Maximum-strength deobfuscation for content moderation, anti-phishing, and spam/NLP preprocessing. Strips every combining mark (zalgo and accents), resolves homoglyphs by TR39 visual similarity (Cyrillic р→p, not phonetic р→r), and expands emoji to text. Preserves case — case is meaningful, not deception. Confusable normalization runs after demojize so typographic punctuation inside emoji names is folded too. Does not transliterate; chain transliterate() on the result if you also need phonetic romanization.

PRESETS¶

from disarm import PRESETS

Dict mapping preset function names to their ordered pipeline steps. Each value is a list of (step_name, parameter) tuples in execution order.

assert PRESETS["canonicalize"] == [('normalize', 'NFKC'), ('strip_bidi', None), ('strip_invisibles', 'comparison'), ('strip_control', None), ('strip_zero_width', None), ('collapse_whitespace', None), ('strip_zalgo', None), ('normalize', 'NFC'), ('confusables', 'latin'), ('normalize', 'NFC')]
assert PRESETS["canonicalize_strict"] == [('normalize', 'NFKC'), ('strip_bidi', None), ('strip_zero_width', None), ('strip_control', None), ('strip_invisibles', 'comparison'), ('strip_zalgo', None), ('confusables', 'latin'), ('collapse_whitespace', None), ('normalize', 'NFC')]

Use PRESETS to audit exactly which transforms a preset applies, or to build equivalent TextPipeline configurations.

Policy Profiles¶

Named policy profiles provide pre-configured TextPipeline instances for common institutional and application workflows.

get_pipeline¶

from disarm import get_pipeline

pipe = get_pipeline("scholarly_cyrillic_iso9")
assert pipe("Москва") == 'moskva'

Returns a fresh TextPipeline configured for the named profile. Raises DisarmError for unknown profiles.

list_profiles¶

from disarm import list_profiles

print(list_profiles())
# ['library_catalog_key_eu', 'llm_guardrail', 'ml_corpus_normalize',
#  'normalize_web_input', 'rag_ingest', 'scholarly_cyrillic_iso9', 'search_index']

Returns sorted list of available profile names.

Available profiles¶

Profile	Steps	Output
`scholarly_cyrillic_iso9`	NFKC → transliterate (ISO 9) → fold_case → collapse_whitespace	UTF-8
`library_catalog_key_eu`	NFKC → transliterate → confusables → strip_accents → fold_case → collapse_whitespace	ASCII
`normalize_web_input`	NFKC → confusables → collapse_whitespace	UTF-8
`ml_corpus_normalize`	NFKC → demojize → strip_accents → fold_case → collapse_whitespace	ASCII
`search_index`	NFKC → transliterate → strip_accents → fold_case → collapse_whitespace	ASCII
`llm_guardrail`	NFKC → strip_zalgo(0) → strip_bidi → demojize → strip_accents → confusables → fold_case → strip_control → strip_zero_width → collapse_whitespace	UTF-8
`rag_ingest`	NFKC → strip_bidi → strip_accents → transliterate → strip_control → strip_zero_width → collapse_whitespace	ASCII

llm_guardrail hardens text against prompt-injection and homoglyph/zalgo/bidi obfuscation before it reaches an LLM (digits are never remapped to letters). rag_ingest canonicalizes documents for retrieval pipelines while preserving case.

Homoglyph handling: rag_ingest romanizes, it does not visually-fold (#258)

The two guardrail profiles canonicalize homoglyphs differently, and the distinction matters for spoof resistance:

llm_guardrail runs confusables without transliterate, so a Cyrillic look-alike of "paypal" (раураl) is visually folded to paypal — it collides with the real Latin term (good for "treat the spoof as the word it imitates").
rag_ingest runs transliterate, which phonetically romanizes the same input to raural — a distinct key, so the spoof does not impersonate the real term, and legitimate non-Latin text still romanizes for retrieval (Москва → Moskva).

These are deliberate trade-offs of the fixed step order (transliterate runs before confusables; running confusables first would mangle legitimate Cyrillic/Greek into mixed-script gibberish). Adding confusables to rag_ingest would be a no-op — transliterate has already consumed the non-Latin characters. If you need homoglyph spoofs folded onto the term they imitate, use llm_guardrail (or a dedicated confusables pass), not rag_ingest.

See Policy Templates for detailed usage guidance and institutional recipes.