Normalization

Unicode normalization ensures that equivalent sequences of characters are represented identically. disarm provides fast normalization using the Rust unicode-normalization crate.

Why normalize?

The same visible text can have multiple Unicode representations:

# These look identical but are different byte sequences:
a = "\u00e9"       # U+00E9 (precomposed)
b = "\u0065\u0301" # U+0065 U+0301 (decomposed: e + combining acute)

assert (a == b) == False

Normalization resolves this by converting to a canonical form.

Normalization forms

Form Name Description
NFC Canonical Decomposition + Composition Precomposed characters. Most common for storage and comparison.
NFD Canonical Decomposition Decomposed characters. Useful for accent stripping.
NFKC Compatibility Decomposition + Composition Like NFC but also normalizes compatibility characters (fi→fi, ²→2).
NFKD Compatibility Decomposition Like NFD with compatibility decomposition.

Basic usage

from disarm import normalize

# NFC: compose into single codepoints
assert normalize("e\u0301") == 'é'

# NFD: decompose into base + combining marks
assert normalize("é", form="NFD") == 'é'

# NFKC: compatibility + compose
assert normalize("finance", form="NFKC") == 'finance'
assert normalize("2²", form="NFKC") == '22'

# NFKD: compatibility + decompose
assert normalize("fi", form="NFKD") == 'fi'

Checking normalization

Test whether a string is already in a given form without performing the full normalization:

from disarm import is_normalized

assert is_normalized("hello") == True
assert is_normalized("é", form="NFC") == True
assert is_normalized("é", form="NFD") == False
assert is_normalized("e\u0301", form="NFD") == True

The NF enum

For programmatic use, the NF enum provides the four forms:

from disarm import NF, normalize

assert normalize("fi", form=NF.KC.value) == 'fi'
Member Value
NF.C "NFC"
NF.D "NFD"
NF.KC "NFKC"
NF.KD "NFKD"

When to use which form

  • NFC — Default for most applications. Store and compare text in NFC.
  • NFD — Use when you need to manipulate combining marks (e.g., strip_accents() uses NFD internally).
  • NFKC — Use for search indexes and text matching where fi should match fi.
  • NFKD — Use for deep decomposition before further processing.

Performance

Normalization is implemented in Rust via the unicode-normalization crate. Strings that are already in the target form are detected quickly via is_normalized() without allocation.