Exhaustive Testing & Compile-Time Assurance¶

disarm's assurance is described using the methodology of the technical note Provably Lossless Reversible Transliteration: A Formal Specification and an Exhaustive-Verification Methodology (DOI 10.5281/zenodo.20613272). Its central idea is to separate what is proven from what is only tested, and to tag every guarantee with the strength of assurance behind it.

Scope: the shipping transforms are lossy, not reversible

The paper specifies a reversible mode (its requirements R1–R7). disarm does not ship that mode — its transforms are lossy by design: ASCII output (I2) and idempotence (I3) are canonicalization properties that preclude reversibility. This page adopts the paper's verified-vs-tested language to describe the existing testing; it makes no claim that transliterate (or any shipping transform) is reversible.

Three strengths of assurance¶

Every guarantee below is tagged with one of these:

(a) Proof by exhaustion. Enumerate every element of a finite domain and check a decidable predicate. This is a constructive proof over that domain, not a sample — it leaves zero untested inputs (e.g. all 11,172 Hangul syllables, the full BMP, all CJK ideographs).
(b) Structural proof. An argument over the structure of the computation that reaches properties quantifying over unbounded strings (e.g. "the output of a character-wise map that emits ASCII for every character is ASCII for every string"). Exhaustion cannot reach these because the input set is infinite.
(c) Property-based testing. Randomized/fuzz testing (Hypothesis, proptest) of unbounded-input properties not reduced to (a) or (b). It is sound but incomplete evidence — label it "tested, not proven."

The shipping library rests on (a) and (c), with (b) used only where a finite per-character result lifts structurally to all strings (I2's ASCII output). The paper's structural proofs of reversibility / unique decodability describe its reversible mode and are out of scope here.

Compile-Time Guarantees (build.rs) — (a) exhaustion at build time¶

The build script enumerates the generated tables and fails the build if a decidable predicate does not hold for every entry:

Assertion	Scope	What it proves
All default BMP table values are ASCII	5,000+ mappings	No transliteration introduces non-ASCII output
All SMP table values are ASCII	All SMP mappings	Same guarantee for characters above U+FFFF
All language override values are ASCII	22 language tables	Language-specific overrides are pure ASCII
All Hanzi pinyin values are ASCII	20,924 entries	Chinese romanization is pure ASCII
Confusables table count ≥ 1,000	TR39 table	Confusables data not truncated
Default BMP table count ≥ 5,000	BMP translations	Default table not truncated
Hanzi pinyin count ≥ 20,000	CJK mappings	Pinyin table not truncated

src/tables/hangul.rs adds const assertions: JUNGSEONG_COUNT == 21, JONGSEONG_COUNT == 28, total Hangul = 19 × 21 × 28 = 11,172, compatibility jamo = 51. If any assertion fails, cargo build fails. No runtime overhead.

Exhaustive Domain Coverage — (a) proof by exhaustion¶

Each property below is checked for every element of a finite domain, so it is proven over that domain.

Hangul Syllables (11,172 characters)¶

Every precomposed syllable (U+AC00–U+D7A3): romanize_hangul() returns Some, output is pure ASCII and non-empty, decomposition indices are in bounds (cho < 19, jung < 21, jong < 28), and the index identity cho·21·28 + jung·28 + jong == syllable_index holds. (This identity is the Hangul decomposition arithmetic — it is not a transliteration-reversibility claim.)

Compatibility Jamo (51 characters)¶

Every standalone jamo (U+3131–U+3163): lookup_compat_jamo() returns Some and output is pure ASCII.

Full BMP — ASCII Output (63,488 characters)¶

Every non-surrogate codepoint U+0080–U+FFFF with ErrorMode::Ignore yields pure ASCII — proves I2 over the BMP by exhaustion.

Full BMP — Idempotence (63,488 characters)¶

Every non-surrogate codepoint U+0080–U+FFFF: transliterate(transliterate(ch)) == transliterate(ch) — proves I3 over the BMP by exhaustion.

CJK Unified Ideographs (20,992 characters)¶

Every character U+4E00–U+9FFF maps to non-empty ASCII (every ideograph has a pinyin mapping).

Indic Block Structure (15 scripts)¶

For each Brahmic block, structural roles are checked exhaustively over the block: virama at the expected offset is IndicRole::Virama, the full consonant range is Consonant, the full dependent-vowel range is DependentVowel. Scripts: Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala, Tibetan, Myanmar, Khmer, Balinese, Javanese.

Stated Invariants (I1–I7) — the lossy-normalizer specification¶

I1–I7 are the invariants of disarm's lossy normalizer. Two of them — I2 (ASCII output) and I3 (idempotence) — are canonicalization properties: they say the transform collapses input toward a canonical ASCII form, which is precisely why the transform is not reversible. None of I1–I7 asserts reversibility.

Each invariant is tagged with the strongest assurance that discharges it:

ID	Invariant	Statement	Assurance
I1	ASCII Passthrough	`∀s: s.is_ascii() → transliterate(s) = s`	(a) exhaustion over the 128 ASCII chars + (c) property-tested at string level
I2	ASCII Output	`∀s: transliterate(s, errors='ignore').is_ascii()`	(a) exhaustion over the BMP + (b) structural (concatenation of ASCII is ASCII); (c) tested for SMP
I3	Idempotence	`∀s: f(f(s)) = f(s)`, `f = transliterate(·, errors='ignore')`	(a) exhaustion over the BMP + (c) property-tested (Python)
I4	No Exceptions	`∀s ∈ UTF-8, \|s\| ≤ 10 MiB: transliterate(s) does not throw`	(c) property-tested (Hypothesis + edge cases)
I5	Deterministic	`∀s, n>0: n calls of transliterate(s) → identical result`	(c) property-tested (100× over mixed-script inputs)
I6	Input Size Bounded	`∀s: \|s\| > 10 MiB → DisarmError`	(b) structural guard (explicit length check), confirmed by a boundary test
I7	Output Length Bounded	`∀s: \|f(s)\| ≤ \|s\|_bytes × 4 + \|s\|_chars`	(c) property-tested (Hypothesis)

Proven vs tested at a glance. I1–I3 are proven over their finite domains (exhaustion, with a structural lift for I2); I2 above the BMP and I4, I5, I7 are tested, not proven (unbounded inputs); I6 is a structural guard confirmed by a boundary test.

Reversibility is out of scope¶

disarm deliberately ships a lossy transform. The paper's reversible mode (R1–R7) — and the structural proofs of round-trip identity and unique decodability it carries — apply to a specified reversible encoding, which disarm does not implement. The exhaustive and structural results above are about canonicalization (I1–I7); none of them imply that transliterate can be inverted. Do not read "exhaustively verified" as "reversible."

Property-Based Testing — (c) tested, not proven¶

proptest (Rust): property tests in tests/integration_transliterate.rs.
Hypothesis (Python): property tests in tests/test_hypothesis.py across transliteration, slugification, normalization, and confusables.
Fuzz testing: random Unicode generation across the input space.

These run across the three test tiers (deterministic CI, Hypothesis, and the formal/exhaustive pre-release tier) — thousands of tests across Rust and Python. They are sound but incomplete: evidence, not proof.

What Is NOT Verified¶

Area	Why not verified	Mitigation
PHF hash correctness	Trusted from `phf_codegen` (widely used)	Functional tests exercise every lookup path
Linguistic accuracy	Correctness is empirical, not provable by testing	Native-speaker corpus; regression tests; see #173 quality benchmark
Unicode version drift	New Unicode versions add codepoints	CI tracks the Unicode version; new chars fall through to `ErrorMode`
Memory safety (UB)	Requires Miri (nightly only)	`unsafe_code = "forbid"`; no unsafe anywhere

Future: Nightly CI Extensions¶

Kani bounded model checking — would add machine-checked structural proofs (absence of panics, overflow, out-of-bounds) for indic_char_role, romanize_hangul, and the decomposition arithmetic.
Miri UB detection — run the suite under Miri to detect undefined behavior.

Reference¶

Provably Lossless Reversible Transliteration: A Formal Specification and an Exhaustive-Verification Methodology. DOI 10.5281/zenodo.20613272. This page adopts its verified-vs-tested methodology to describe disarm's existing, lossy testing — not its reversible-mode requirements.