Predicates¶

Functions that inspect text and return boolean or structured results without modifying the input.

detect_scripts¶

detect_scripts ¶

detect_scripts(text: str) -> list[Script]

Return the set of Unicode scripts present in text, in order of first appearance.

Parameters:	`text` (`str`) – Input string.

Returns:	`list[Script]` – List of :class:`Script` enum values, ordered by first appearance.

Examples:

>>> detect_scripts("Hello")
[Script.LATIN]
>>> detect_scripts("Hello Мир")
[Script.LATIN, Script.CYRILLIC]

inspect_auto_lang¶

inspect_auto_lang ¶

inspect_auto_lang(text: str) -> dict[str, str | list[str] | None]

Inspect how lang="auto" would resolve for the given text.

Use this to audit or log the detection decision made by the three-stage auto-detection pipeline.

Parameters:	`text` (`str`) – Input string.

Returns:

dict[str, str | list[str] | None] –

Dict with keys:
dict[str, str | list[str] | None] –
- script: primary non-Latin script name, or None
dict[str, str | list[str] | None] –
- chosen_lang: resolved language code, or None
dict[str, str | list[str] | None] –
- reason: one of "unambiguous_script", "discriminator", "script_default", "latin_discriminator", "no_detection"
dict[str, str | list[str] | None] –
- discriminators_hit: list of discriminator characters found

Examples:

>>> inspect_auto_lang("Київ")["chosen_lang"]
'uk'
>>> inspect_auto_lang("Москва")["reason"]
'script_default'

from disarm import inspect_auto_lang

inspect_auto_lang("Київ")
# {'script': 'Cyrillic', 'chosen_lang': 'uk', 'reason': 'discriminator', 'discriminators_hit': ['ї']}

inspect_auto_lang("Москва")
# {'script': 'Cyrillic', 'chosen_lang': 'ru', 'reason': 'script_default', 'discriminators_hit': []}

inspect_auto_lang("hello")
# {'script': None, 'chosen_lang': None, 'reason': 'no_detection', 'discriminators_hit': []}

See Language Detection for details.

is_mixed_script¶

is_mixed_script ¶

is_mixed_script(text: str) -> bool

True if text contains characters from more than one Unicode script.

Parameters:	`text` (`str`) – Input string.

Returns:	`bool` – True if multiple scripts detected (excluding Common/Inherited).

Examples:

>>> is_mixed_script("Hello")
False
>>> is_mixed_script("Hello Мир")  # Latin + Cyrillic
True

has_bidi_conflict¶

has_bidi_conflict ¶

has_bidi_conflict(text: str) -> bool

True if text mixes strong left-to-right and strong right-to-left characters.

This is the precondition for Unicode Bidi display-reordering (UAX #9) — the structural signal behind "BiDi Swap"-style spoofs, where an LTR brand label sits beside an RTL domain (e.g. "varonis.com.ו.קום"). Unlike a bidi-override (U+202x) check, it fires on the real letters: Latin / Cyrillic / Greek / CJK are left-to-right; Hebrew / Arabic / Syriac / Thaana / N'Ko are right-to-left; digits, punctuation and combining marks are neutral and never create a conflict on their own.

A False result is not a safety guarantee.

Parameters:	`text` (`str`) – Input string.

Returns:	`bool` – True if both a strong-LTR and a strong-RTL character are present.

Examples:

>>> has_bidi_conflict("hello")
False
>>> has_bidi_conflict("helloא")  # Latin + Hebrew
True

is_confusable¶

is_confusable ¶

is_confusable(text: str, *, target_script: str = 'latin', greedy: bool | None = None, preferred_aliases: list[str] | None = None) -> bool

True if text contains characters confusable with target-script characters.

Parameters:

text (str) –

Input string.
target_script (str, default: 'latin' ) –

Script to check confusability against. Currently only "latin" is supported; any other value raises DisarmError.
greedy (bool | None, default: None ) –

confusable_homoglyphs compatibility — ignored, with a DeprecationWarning when explicitly passed. disarm always checks all characters.
preferred_aliases (list[str] | None, default: None ) –

confusable_homoglyphs compatibility — ignored, with a DeprecationWarning when explicitly passed. disarm uses its own script detection engine.

Returns:	`bool` – True if any confusable homoglyphs are present.

Raises:	`DisarmError` – If target_script is not `"latin"`.

Examples:

>>> is_confusable("pаypal")  # Cyrillic а looks like Latin a
True
>>> is_confusable("paypal")  # all genuine Latin
False

is_ascii¶

is_ascii ¶

is_ascii(text: str) -> bool

True if all characters are in U+0000–U+007F.

Parameters:	`text` (`str`) – Input string.

Returns:	`bool` – True if the string is pure ASCII.

Examples:

>>> is_ascii("hello 123")
True
>>> is_ascii("café")
False

is_normalized¶

is_normalized ¶

is_normalized(text: str, *, form: NormalizationForm = 'NFC') -> bool

True if text is already in the specified normalization form.

Parameters:	`text` (`str`) – Input string. `form` (`NormalizationForm`, default: `'NFC'` ) – Normalization form — "NFC", "NFD", "NFKC", or "NFKD".

Returns:	`bool` – True if the string is already normalized.

Examples:

>>> is_normalized("café")  # NFC by default
True
>>> is_normalized("e\u0301", form="NFC")  # NFD decomposed
False

is_zalgo¶

is_zalgo ¶

is_zalgo(text: str, *, threshold: int = 3) -> bool

Detect whether text contains zalgo-style combining mark abuse.

Returns True if any base character has more than threshold consecutive combining marks in NFD decomposition.

Parameters:	`text` (`str`) – Input string to check. `threshold` (`int`, default: `3` ) – Maximum allowed combining marks per base character (default: `3`). Vietnamese `ệ` has 2 marks in NFD — the default is safe for all legitimate scripts.

Returns:	`bool` – `True` if zalgo-style stacking is detected.

Examples:

>>> is_zalgo("café")
False
>>> is_zalgo("Việt Nam")
False
>>> is_zalgo("ḧ̸̡̢̧̛̗̱̜̼̯̞̙́̑̾̊̿̏̒̓̕ě̵̢̧̛̗̱̜̼̯̞̙̈́̑̾̊̿̏̒̓̕l̸̡̢̧̛̗̱̜̼̯̞̙̈́̑̾̊̿̏̒̓̕l̸̡̢̧̛̗̱̜̼̯̞̙̈́̑̾̊̿̏̒̓̕ơ̵̢̧̗̱̜̼̯̞̙̈́̑̾̊̿̏̒̓̕")
True

from disarm import is_zalgo

is_zalgo("café")          # False (1 combining mark — normal)
is_zalgo("Việt Nam")      # False (2 combining marks — normal)
# Zalgo: 'a' with 20 stacked combining graves
is_zalgo("a" + "\u0300" * 20)  # True

is_suspicious_hostname¶

is_suspicious_hostname ¶

is_suspicious_hostname(hostname: str) -> tuple[bool, HostnameAnalysis]

Flag a hostname as suspicious for Unicode homoglyph spoofing.

Returns (suspicious, analysis) where analysis is a HostnameAnalysis with attributes:

suspicious: bool — True if a problem was detected (mixed-script, a bundled-table confusable, or a bidi-direction conflict).
scripts: list[str] — Unicode scripts found across all labels.
mixed_script: bool — True if any single label contains more than one script.
has_confusables: bool — True if confusable homoglyphs found.
bidi_conflict: bool — True if the decoded hostname mixes strong left-to-right and strong right-to-left characters (the "BiDi Swap" reorder precondition). Folded into suspicious.
cross_label_script: bool — True if the labels span more than one distinct script. Broader and noisier than bidi_conflict (it fires on benign IDN ccTLDs like google.рф), so it is not folded into suspicious; exposed for caller policy.
label_scripts: list[list[str]] — per-label resolved scripts, left to right.
canonical: str — Latin-normalized form of the hostname.

A hostname is flagged suspicious if any single label is mixed-script (draws on more than one Unicode script, excluding Common/Inherited), contains confusable homoglyphs, or has a bidi-direction conflict (bidi_conflict). The mixed-script rule is conservative and fails closed: it flags benign combinations such as Latin+CJK as well as spoofing ones, so a caller wanting a more permissive policy can inspect the mixed_script and scripts fields and decide for itself.

A False (not-suspicious) result is not a safety guarantee. It means only that no mixed-script label and no confusable from the bundled TR39 table was found. Whole-script spoofs that use no bundled-table confusable, and confusables outside the bundled table, are out of scope (see the Threat Model) and report not-suspicious. Base allow/deny decisions on the granular findings plus your own policy — a detector can attest the presence of a problem, never the absence of all problems.

Parameters:	`hostname` (`str`) – Hostname string to check (e.g. "example.com").

Returns:	`tuple[bool, HostnameAnalysis]` – Tuple of (suspicious, analysis) where analysis is a HostnameAnalysis.

Examples:

>>> suspicious, analysis = is_suspicious_hostname("google.com")
>>> suspicious
False
>>> analysis.canonical
'google.com'

HostnameAnalysis¶

The second element of the tuple returned by is_suspicious_hostname():

Attribute	Type	Description
`suspicious`	`bool`	`True` if a problem was detected (mixed-script or bundled-table confusable)
`scripts`	`list[str]`	Unicode scripts found across all labels
`mixed_script`	`bool`	`True` if any single label contains more than one script
`has_confusables`	`bool`	`True` if confusable homoglyphs found
`canonical`	`str`	Latin-normalized form of the hostname

from disarm import is_suspicious_hostname

suspicious, analysis = is_suspicious_hostname("google.com")
# suspicious = False, analysis.canonical = "google.com"

suspicious, analysis = is_suspicious_hostname("gооgle.com")  # Cyrillic о's
# suspicious = True, analysis.mixed_script = True, analysis.has_confusables = True

A hostname is flagged suspicious if any single label is mixed-script (draws on more than one Unicode script) or contains confusable homoglyphs. A not-suspicious result is not a safety guarantee — whole-script spoofs with no bundled-table confusable, and confusables outside the bundled table, are out of scope (see Threat Model); branch on the granular fields plus your own policy.