Precompiled Pipelines

Ready-to-use multi-step text processing pipelines. Each is a single compiled Rust function with no pipeline construction overhead at call time.

security_clean

security_clean

security_clean(text: str) -> str

Security-focused text canonicalization.

Pipeline: NFKC → confusables → strip bidi/format → collapse_whitespace → (path-separator neutralization)

Collapses fullwidth bypasses, neutralizes homoglyph spoofing, strips dangerous bidi overrides and soft hyphens, then normalizes whitespace (collapsing runs, stripping control chars and zero-width injections).

.. warning:: Canonicalizes Unicode for comparison; it is not an output sanitizer and provides no XSS/HTML/SQL/injection protection. The NFKC step maps fullwidth lookalikes to live ASCII metacharacters by design (<), so the output may be more important to context-encode on the way out, not less. Encode at the sink; never emit this result into markup or a query unescaped.

Parameters:
  • text (str) –

    Input string (user-submitted, network-received, etc.).

Returns:
  • str

    A canonicalized string suitable for security-sensitive comparison

  • str

    (e.g. against a denylist). Not safe to emit unescaped into any

  • str

    execution or markup context — see warning above.

Examples:

>>> security_clean("Ηello Ꮤorld")  # Greek Η + Cherokee Ꮤ → Latin
'Hello World'

Pipeline steps

NFKC → confusables → strip bidi/format → collapse_whitespace → (path-separator neutralization)

from disarm import security_clean

assert security_clean("ℝ𝕖𝕒𝕝 𝕥𝕖𝕩𝕥") == 'Real text'
assert security_clean("Ηello Ꮤorld") == 'Hello World'

ml_normalize

ml_normalize

ml_normalize(text: str, *, lang: str | None = None, emoji: str = 'cldr') -> str

ML/NLP text normalization pipeline.

NFKC → emoji→text → [transliterate] → strip_accents →

fold_case → collapse_whitespace

Produces clean, accent-free, lowercased text suitable for tokenizers, embeddings, and feature extraction. Emoji are expanded to their CLDR short-name descriptions.

Parameters:
  • text (str) –

    Input Unicode string.

  • lang (str | None, default: None ) –

    Optional language code for transliteration (e.g. "de", "ja").

  • emoji (str, default: 'cldr' ) –

    Emoji handling mode. "cldr" — expand emoji to CLDR short names (default). "none" — leave emoji characters unchanged.

Returns:
  • str

    Clean, accent-free, lowercased text.

Raises:
  • InvalidArgumentError

    If emoji is not "cldr" or "none".

  • DisarmError

    If an internal Rust error occurs (base of the above).

Examples:

>>> ml_normalize("Café RÉSUMÉ")
'cafe resume'
>>> ml_normalize("München", lang="de")
'muenchen'

Pipeline steps

NFKC → emoji→text → [transliterate] → strip_accents → fold_case → collapse_whitespace

from disarm import ml_normalize

assert ml_normalize("Café RÉSUMÉ") == 'cafe resume'
assert ml_normalize("München", lang="de") == 'muenchen'
assert ml_normalize("I ❤️ Python 🐍") == 'i red heart python snake'

catalog_key

catalog_key

catalog_key(text: str, *, lang: str | None = None, strict_iso9: bool = False) -> str

Library catalog key generation pipeline.

NFKC → transliterate → confusables → strip_accents →

fold_case → collapse_whitespace

Produces a canonical deduplication key for bibliographic titles.

Parameters:
  • text (str) –

    Input title or heading.

  • lang (str | None, default: None ) –

    Language code for transliteration (e.g. "ru", "ja").

  • strict_iso9 (bool, default: False ) –

    Use ISO 9:1995 scholarly transliteration for Cyrillic.

Returns:
  • str

    Canonical deduplication key string.

Raises:
  • DisarmError

    If an internal Rust error occurs.

Examples:

>>> catalog_key("  Café  RÉSUMÉ  ")
'cafe resume'
>>> catalog_key("ΩMEGA  café")
'omega cafe'

Pipeline steps

NFKC → transliterate → confusables → strip_accents → fold_case → collapse_whitespace

from disarm import catalog_key

assert catalog_key("  Café  RÉSUMÉ  ") == 'cafe resume'
assert catalog_key("Москва", lang="ru") == 'moskva'
assert catalog_key("Москва", lang="auto") == 'moskva'
assert catalog_key("Müller", lang="de") == 'mueller'

display_clean

display_clean

display_clean(text: str) -> str

Display-safe text cleaning pipeline.

Pipeline: strip bidi/format → collapse_whitespace (strip control + strip zero-width)

Lightweight cleanup for user-submitted content destined for rendering. Strips bidirectional overrides (which can visually reorder text to hide malicious content), soft hyphens, control characters, and zero-width injections, then collapses runs of whitespace to single spaces.

.. warning:: "Display-safe" means visual hygiene (no bidi reordering, no invisible injections) — not markup-safe. This does no HTML escaping and does not strip <, >, &. When rendering into HTML, still escape at the template/output layer; disarm is not an XSS defense.

Parameters:
  • text (str) –

    Input string (user-submitted content).

Returns:
  • str

    A visually cleaned string. Escape it at the output layer before

  • str

    rendering into HTML or any other markup context (see warning above).

Examples:

>>> display_clean("hello\x00world\u200b!")
'helloworld!'
>>> display_clean("  spaced   out  ")
'spaced out'

Pipeline steps

strip_bidistrip_controlstrip_zero_widthcollapse_whitespace

from disarm import display_clean

assert display_clean("hello\x00world\u200b!") == 'helloworld!'
assert display_clean("  spaced   out  ") == 'spaced out'
assert display_clean("admin\u202Euser") == 'adminuser'

search_key

search_key

search_key(text: str, *, lang: str | None = None) -> str

Search index key generation pipeline.

NFKC → transliterate → strip_accents → fold_case →

collapse_whitespace

Produces a case-insensitive, accent-insensitive, script-insensitive lookup key. Like :func:catalog_key but without confusable normalization — lighter and faster for search indexes.

Parameters:
  • text (str) –

    Input text to generate a search key from.

  • lang (str | None, default: None ) –

    Language code for transliteration (e.g. "ru", "de").

Returns:
  • str

    Normalized search key string.

Examples:

>>> search_key("  Café  RÉSUMÉ  ")
'cafe resume'
>>> search_key("Москва")
'moskva'
>>> search_key("Über allen Gipfeln")
'uber allen gipfeln'

Pipeline steps

NFKC → transliterate → strip_accents → fold_case → collapse_whitespace

from disarm import search_key

assert search_key("Café RÉSUMÉ") == 'cafe resume'
assert search_key("Москва", lang="ru") == 'moskva'
assert search_key("ΩMEGA", lang="auto") == 'omega'

sort_key

sort_key

sort_key(text: str, *, lang: str | None = None) -> str

Sort key generation pipeline.

Pipeline: NFKC → strip_bidi → transliterate → fold_case → collapse_whitespace

Produces a case-insensitive ASCII key for alphabetical ordering. Transliteration folds accented characters to their ASCII base (ée, üu), so the result is accent-folded, not accent-preserving.

.. note:: In practice this currently produces the same output as :func:search_key: search_key adds an explicit accent-strip pass, but transliteration has already removed accents by that point, so the two keys coincide for typical input. Use whichever name documents intent at the call site. (Distinct accent-preserving ordering is tracked for a future release.)

Parameters:
  • text (str) –

    Input text to generate a sort key from.

  • lang (str | None, default: None ) –

    Language code for transliteration (e.g. "ru", "de").

Returns:
  • str

    Normalized sort key string.

Examples:

>>> sort_key("Война и мир")
'voyna i mir'
>>> sort_key("Über allen Gipfeln")
'uber allen gipfeln'
>>> sort_key("  Café  ")
'cafe'

Pipeline steps

NFKC → transliterate → fold_case → collapse_whitespace

from disarm import sort_key

assert sort_key("Über", lang="de") == 'ueber'
assert sort_key("Война и мир", lang="ru") == 'voyna i mir'
assert sort_key("Café") == 'cafe'

normalize_user_input

normalize_user_input

normalize_user_input(text: str) -> str

Unicode hygiene for user-submitted input — not an injection defense.

.. warning:: This normalizes Unicode; it does not make text safe to emit into HTML, JS, URLs, SQL, or shells. It performs no escaping and does not strip <, >, &<script>alert(1)</script> passes through unchanged, and the NFKC step can surface ASCII metacharacters from fullwidth lookalikes (<script><script>). This is not XSS or injection protection: encode at the output sink (framework auto-escaping, DOMPurify, parameterized queries). Run this before that encoder, never instead of it. The name predates this clarification.

Preserves the original script (no transliteration) while neutralizing Unicode-level attack vectors: zalgo stacking, homoglyph spoofing, bidi overrides, zero-width injections, and control characters.

Pipeline: NFKC → strip_bidi → strip_zero_width → strip_control → strip_zalgo → confusables → collapse_whitespace → (path-separator neutralization) (invisibles are stripped before zalgo-capping so they cannot split combining-mark runs, keeping the output idempotent)

Parameters:
  • text (str) –

    User-submitted input string.

Returns:
  • str

    A Unicode-normalized string. Safe for storage/comparison; **encode it

  • str

    before emitting into any markup or query context** (see warning above).

Examples:

>>> normalize_user_input("Hello, world!")
'Hello, world!'
>>> normalize_user_input("p\u0430ypal")  # Cyrillic а → Latin a
'paypal'
>>> normalize_user_input("admin\u202euser")  # RLO stripped
'adminuser'

Pipeline steps

NFKC → strip_bidi → strip_zero_width → strip_control → strip_zalgo → confusables → collapse_whitespace → (path-separator neutralization)

from disarm import normalize_user_input

assert normalize_user_input("Hello, world!") == 'Hello, world!'
assert normalize_user_input("p\u0430ypal") == 'paypal'
assert normalize_user_input("admin\u202Euser") == 'adminuser'

Unlike security_clean, this pipeline also strips zalgo text (excessive combining mark stacking). Unlike catalog_key/search_key, it does not transliterate — the original script is preserved.


PRESETS

from disarm import PRESETS

Dict mapping preset function names to their ordered pipeline steps. Each value is a list of (step_name, parameter) tuples in execution order.

assert PRESETS["security_clean"] == [('normalize', 'NFKC'), ('confusables', 'latin'), ('strip_bidi', None), ('collapse_whitespace', None)]
assert PRESETS["normalize_user_input"] == [('normalize', 'NFKC'), ('strip_bidi', None), ('strip_zero_width', None), ('strip_control', None), ('strip_zalgo', None), ('confusables', 'latin'), ('collapse_whitespace', None)]

Use PRESETS to audit exactly which transforms a preset applies, or to build equivalent TextPipeline configurations.


Policy Profiles

Named policy profiles provide pre-configured TextPipeline instances for common institutional and application workflows.

get_pipeline

from disarm import get_pipeline

pipe = get_pipeline("scholarly_cyrillic_iso9")
assert pipe("Москва") == 'moskva'

Returns a fresh TextPipeline configured for the named profile. Raises DisarmError for unknown profiles.

list_profiles

from disarm import list_profiles

print(list_profiles())
# ['library_catalog_key_eu', 'llm_guardrail', 'ml_corpus_normalize',
#  'normalize_web_input', 'rag_ingest', 'scholarly_cyrillic_iso9', 'search_index']

Returns sorted list of available profile names.

Available profiles

Profile Steps Output
scholarly_cyrillic_iso9 NFKC → transliterate (ISO 9) → fold_case → collapse_whitespace UTF-8
library_catalog_key_eu NFKC → transliterate → confusables → strip_accents → fold_case → collapse_whitespace ASCII
normalize_web_input NFKC → confusables → collapse_whitespace UTF-8
ml_corpus_normalize NFKC → demojize → strip_accents → fold_case → collapse_whitespace ASCII
search_index NFKC → transliterate → strip_accents → fold_case → collapse_whitespace ASCII
llm_guardrail NFKC → strip_zalgo(0) → strip_bidi → demojize → strip_accents → confusables → fold_case → strip_control → strip_zero_width → collapse_whitespace UTF-8
rag_ingest NFKC → strip_bidi → strip_accents → transliterate → strip_control → strip_zero_width → collapse_whitespace ASCII

llm_guardrail hardens text against prompt-injection and homoglyph/zalgo/bidi obfuscation before it reaches an LLM (digits are never remapped to letters). rag_ingest canonicalizes documents for retrieval pipelines while preserving case.

Homoglyph handling: rag_ingest romanizes, it does not visually-fold (#258)

The two guardrail profiles canonicalize homoglyphs differently, and the distinction matters for spoof resistance:

  • llm_guardrail runs confusables without transliterate, so a Cyrillic look-alike of "paypal" (раураl) is visually folded to paypal — it collides with the real Latin term (good for "treat the spoof as the word it imitates").
  • rag_ingest runs transliterate, which phonetically romanizes the same input to raural — a distinct key, so the spoof does not impersonate the real term, and legitimate non-Latin text still romanizes for retrieval (Москва → Moskva).

These are deliberate trade-offs of the fixed step order (transliterate runs before confusables; running confusables first would mangle legitimate Cyrillic/Greek into mixed-script gibberish). Adding confusables to rag_ingest would be a no-op — transliterate has already consumed the non-Latin characters. If you need homoglyph spoofs folded onto the term they imitate, use llm_guardrail (or a dedicated confusables pass), not rag_ingest.

See Policy Templates for detailed usage guidance and institutional recipes.