Precompiled Pipelines¶
Ready-to-use multi-step text processing pipelines. Each is a single compiled Rust function with no pipeline construction overhead at call time.
security_clean¶
security_clean ¶
security_clean(text: str) -> str
Security-focused text canonicalization.
Pipeline: NFKC → confusables → strip bidi/format → collapse_whitespace → (path-separator neutralization)
Collapses fullwidth bypasses, neutralizes homoglyph spoofing, strips dangerous bidi overrides and soft hyphens, then normalizes whitespace (collapsing runs, stripping control chars and zero-width injections).
.. warning::
Canonicalizes Unicode for comparison; it is not an output
sanitizer and provides no XSS/HTML/SQL/injection protection. The NFKC
step maps fullwidth lookalikes to live ASCII metacharacters by design
(< → <), so the output may be more important to context-encode
on the way out, not less. Encode at the sink; never emit this result
into markup or a query unescaped.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Examples:
>>> security_clean("Ηello Ꮤorld") # Greek Η + Cherokee Ꮤ → Latin
'Hello World'
Pipeline steps¶
NFKC → confusables → strip bidi/format → collapse_whitespace → (path-separator neutralization)
from disarm import security_clean
assert security_clean("ℝ𝕖𝕒𝕝 𝕥𝕖𝕩𝕥") == 'Real text'
assert security_clean("Ηello Ꮤorld") == 'Hello World'
ml_normalize¶
ml_normalize ¶
ml_normalize(text: str, *, lang: str | None = None, emoji: str = 'cldr') -> str
ML/NLP text normalization pipeline.
NFKC → emoji→text → [transliterate] → strip_accents →
fold_case → collapse_whitespace
Produces clean, accent-free, lowercased text suitable for tokenizers, embeddings, and feature extraction. Emoji are expanded to their CLDR short-name descriptions.
| Parameters: |
|
|---|
| Returns: |
|
|---|
| Raises: |
|
|---|
Examples:
>>> ml_normalize("Café RÉSUMÉ")
'cafe resume'
>>> ml_normalize("München", lang="de")
'muenchen'
Pipeline steps¶
NFKC → emoji→text → [transliterate] → strip_accents → fold_case → collapse_whitespace
from disarm import ml_normalize
assert ml_normalize("Café RÉSUMÉ") == 'cafe resume'
assert ml_normalize("München", lang="de") == 'muenchen'
assert ml_normalize("I ❤️ Python 🐍") == 'i red heart python snake'
catalog_key¶
catalog_key ¶
catalog_key(text: str, *, lang: str | None = None, strict_iso9: bool = False) -> str
Library catalog key generation pipeline.
NFKC → transliterate → confusables → strip_accents →
fold_case → collapse_whitespace
Produces a canonical deduplication key for bibliographic titles.
| Parameters: |
|
|---|
| Returns: |
|
|---|
| Raises: |
|
|---|
Examples:
>>> catalog_key(" Café RÉSUMÉ ")
'cafe resume'
>>> catalog_key("ΩMEGA café")
'omega cafe'
Pipeline steps¶
NFKC → transliterate → confusables → strip_accents → fold_case → collapse_whitespace
from disarm import catalog_key
assert catalog_key(" Café RÉSUMÉ ") == 'cafe resume'
assert catalog_key("Москва", lang="ru") == 'moskva'
assert catalog_key("Москва", lang="auto") == 'moskva'
assert catalog_key("Müller", lang="de") == 'mueller'
display_clean¶
display_clean ¶
display_clean(text: str) -> str
Display-safe text cleaning pipeline.
Pipeline: strip bidi/format → collapse_whitespace (strip control + strip zero-width)
Lightweight cleanup for user-submitted content destined for rendering. Strips bidirectional overrides (which can visually reorder text to hide malicious content), soft hyphens, control characters, and zero-width injections, then collapses runs of whitespace to single spaces.
.. warning::
"Display-safe" means visual hygiene (no bidi reordering, no invisible
injections) — not markup-safe. This does no HTML escaping and does
not strip <, >, &. When rendering into HTML, still escape at
the template/output layer; disarm is not an XSS defense.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Examples:
>>> display_clean("hello\x00world\u200b!")
'helloworld!'
>>> display_clean(" spaced out ")
'spaced out'
Pipeline steps¶
strip_bidi → strip_control → strip_zero_width → collapse_whitespace
from disarm import display_clean
assert display_clean("hello\x00world\u200b!") == 'helloworld!'
assert display_clean(" spaced out ") == 'spaced out'
assert display_clean("admin\u202Euser") == 'adminuser'
search_key¶
search_key ¶
search_key(text: str, *, lang: str | None = None) -> str
Search index key generation pipeline.
NFKC → transliterate → strip_accents → fold_case →
collapse_whitespace
Produces a case-insensitive, accent-insensitive, script-insensitive
lookup key. Like :func:catalog_key but without confusable
normalization — lighter and faster for search indexes.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Examples:
>>> search_key(" Café RÉSUMÉ ")
'cafe resume'
>>> search_key("Москва")
'moskva'
>>> search_key("Über allen Gipfeln")
'uber allen gipfeln'
Pipeline steps¶
NFKC → transliterate → strip_accents → fold_case → collapse_whitespace
from disarm import search_key
assert search_key("Café RÉSUMÉ") == 'cafe resume'
assert search_key("Москва", lang="ru") == 'moskva'
assert search_key("ΩMEGA", lang="auto") == 'omega'
sort_key¶
sort_key ¶
sort_key(text: str, *, lang: str | None = None) -> str
Sort key generation pipeline.
Pipeline: NFKC → strip_bidi → transliterate → fold_case → collapse_whitespace
Produces a case-insensitive ASCII key for alphabetical ordering.
Transliteration folds accented characters to their ASCII base (é → e,
ü → u), so the result is accent-folded, not accent-preserving.
.. note::
In practice this currently produces the same output as
:func:search_key: search_key adds an explicit accent-strip pass,
but transliteration has already removed accents by that point, so the
two keys coincide for typical input. Use whichever name documents intent
at the call site. (Distinct accent-preserving ordering is tracked for a
future release.)
| Parameters: |
|
|---|
| Returns: |
|
|---|
Examples:
>>> sort_key("Война и мир")
'voyna i mir'
>>> sort_key("Über allen Gipfeln")
'uber allen gipfeln'
>>> sort_key(" Café ")
'cafe'
Pipeline steps¶
NFKC → transliterate → fold_case → collapse_whitespace
from disarm import sort_key
assert sort_key("Über", lang="de") == 'ueber'
assert sort_key("Война и мир", lang="ru") == 'voyna i mir'
assert sort_key("Café") == 'cafe'
normalize_user_input¶
normalize_user_input ¶
normalize_user_input(text: str) -> str
Unicode hygiene for user-submitted input — not an injection defense.
.. warning::
This normalizes Unicode; it does not make text safe to emit into
HTML, JS, URLs, SQL, or shells. It performs no escaping and does not
strip <, >, & — <script>alert(1)</script> passes through
unchanged, and the NFKC step can surface ASCII metacharacters from
fullwidth lookalikes (<script> → <script>). This is not XSS
or injection protection: encode at the output sink (framework
auto-escaping, DOMPurify, parameterized queries). Run this before that
encoder, never instead of it. The name predates this clarification.
Preserves the original script (no transliteration) while neutralizing Unicode-level attack vectors: zalgo stacking, homoglyph spoofing, bidi overrides, zero-width injections, and control characters.
Pipeline: NFKC → strip_bidi → strip_zero_width → strip_control → strip_zalgo
→ confusables → collapse_whitespace → (path-separator neutralization) (invisibles are stripped before zalgo-capping so they
cannot split combining-mark runs, keeping the output idempotent)
| Parameters: |
|
|---|
| Returns: |
|
|---|
Examples:
>>> normalize_user_input("Hello, world!")
'Hello, world!'
>>> normalize_user_input("p\u0430ypal") # Cyrillic а → Latin a
'paypal'
>>> normalize_user_input("admin\u202euser") # RLO stripped
'adminuser'
Pipeline steps¶
NFKC → strip_bidi → strip_zero_width → strip_control → strip_zalgo → confusables → collapse_whitespace → (path-separator neutralization)
from disarm import normalize_user_input
assert normalize_user_input("Hello, world!") == 'Hello, world!'
assert normalize_user_input("p\u0430ypal") == 'paypal'
assert normalize_user_input("admin\u202Euser") == 'adminuser'
Unlike security_clean, this pipeline also strips zalgo text (excessive combining mark stacking). Unlike catalog_key/search_key, it does not transliterate — the original script is preserved.
PRESETS¶
from disarm import PRESETS
Dict mapping preset function names to their ordered pipeline steps. Each value is a list of (step_name, parameter) tuples in execution order.
assert PRESETS["security_clean"] == [('normalize', 'NFKC'), ('confusables', 'latin'), ('strip_bidi', None), ('collapse_whitespace', None)]
assert PRESETS["normalize_user_input"] == [('normalize', 'NFKC'), ('strip_bidi', None), ('strip_zero_width', None), ('strip_control', None), ('strip_zalgo', None), ('confusables', 'latin'), ('collapse_whitespace', None)]
Use PRESETS to audit exactly which transforms a preset applies, or to build equivalent TextPipeline configurations.
Policy Profiles¶
Named policy profiles provide pre-configured TextPipeline instances for common institutional and application workflows.
get_pipeline¶
from disarm import get_pipeline
pipe = get_pipeline("scholarly_cyrillic_iso9")
assert pipe("Москва") == 'moskva'
Returns a fresh TextPipeline configured for the named profile. Raises DisarmError for unknown profiles.
list_profiles¶
from disarm import list_profiles
print(list_profiles())
# ['library_catalog_key_eu', 'llm_guardrail', 'ml_corpus_normalize',
# 'normalize_web_input', 'rag_ingest', 'scholarly_cyrillic_iso9', 'search_index']
Returns sorted list of available profile names.
Available profiles¶
| Profile | Steps | Output |
|---|---|---|
scholarly_cyrillic_iso9 |
NFKC → transliterate (ISO 9) → fold_case → collapse_whitespace | UTF-8 |
library_catalog_key_eu |
NFKC → transliterate → confusables → strip_accents → fold_case → collapse_whitespace | ASCII |
normalize_web_input |
NFKC → confusables → collapse_whitespace | UTF-8 |
ml_corpus_normalize |
NFKC → demojize → strip_accents → fold_case → collapse_whitespace | ASCII |
search_index |
NFKC → transliterate → strip_accents → fold_case → collapse_whitespace | ASCII |
llm_guardrail |
NFKC → strip_zalgo(0) → strip_bidi → demojize → strip_accents → confusables → fold_case → strip_control → strip_zero_width → collapse_whitespace | UTF-8 |
rag_ingest |
NFKC → strip_bidi → strip_accents → transliterate → strip_control → strip_zero_width → collapse_whitespace | ASCII |
llm_guardrail hardens text against prompt-injection and homoglyph/zalgo/bidi obfuscation before it reaches an LLM (digits are never remapped to letters). rag_ingest canonicalizes documents for retrieval pipelines while preserving case.
Homoglyph handling: rag_ingest romanizes, it does not visually-fold (#258)
The two guardrail profiles canonicalize homoglyphs differently, and the distinction matters for spoof resistance:
llm_guardrailrunsconfusableswithouttransliterate, so a Cyrillic look-alike of "paypal" (раураl) is visually folded topaypal— it collides with the real Latin term (good for "treat the spoof as the word it imitates").rag_ingestrunstransliterate, which phonetically romanizes the same input toraural— a distinct key, so the spoof does not impersonate the real term, and legitimate non-Latin text still romanizes for retrieval (Москва → Moskva).
These are deliberate trade-offs of the fixed step order (transliterate runs
before confusables; running confusables first would mangle legitimate
Cyrillic/Greek into mixed-script gibberish). Adding confusables to
rag_ingest would be a no-op — transliterate has already consumed the
non-Latin characters. If you need homoglyph spoofs folded onto the term
they imitate, use llm_guardrail (or a dedicated confusables pass), not
rag_ingest.
See Policy Templates for detailed usage guidance and institutional recipes.