Exhaustive Testing & Compile-Time Assurance¶
disarm's assurance is described using the methodology of the technical note Provably Lossless Reversible Transliteration: A Formal Specification and an Exhaustive-Verification Methodology (DOI 10.5281/zenodo.20613272). Its central idea is to separate what is proven from what is only tested, and to tag every guarantee with the strength of assurance behind it.
Scope: the shipping transforms are lossy, not reversible
The paper specifies a reversible mode (its requirements R1–R7). disarm
does not ship that mode — its transforms are lossy by design: ASCII
output (I2) and idempotence (I3) are canonicalization properties that
preclude reversibility. This page adopts the paper's verified-vs-tested
language to describe the existing testing; it makes no claim that
transliterate (or any shipping transform) is reversible.
Three strengths of assurance¶
Every guarantee below is tagged with one of these:
- (a) Proof by exhaustion. Enumerate every element of a finite domain and check a decidable predicate. This is a constructive proof over that domain, not a sample — it leaves zero untested inputs (e.g. all 11,172 Hangul syllables, the full BMP, all CJK ideographs).
- (b) Structural proof. An argument over the structure of the computation that reaches properties quantifying over unbounded strings (e.g. "the output of a character-wise map that emits ASCII for every character is ASCII for every string"). Exhaustion cannot reach these because the input set is infinite.
- (c) Property-based testing. Randomized/fuzz testing (Hypothesis, proptest) of unbounded-input properties not reduced to (a) or (b). It is sound but incomplete evidence — label it "tested, not proven."
The shipping library rests on (a) and (c), with (b) used only where a finite per-character result lifts structurally to all strings (I2's ASCII output). The paper's structural proofs of reversibility / unique decodability describe its reversible mode and are out of scope here.
Compile-Time Guarantees (build.rs) — (a) exhaustion at build time¶
The build script enumerates the generated tables and fails the build if a decidable predicate does not hold for every entry:
| Assertion | Scope | What it proves |
|---|---|---|
| All default BMP table values are ASCII | 5,000+ mappings | No transliteration introduces non-ASCII output |
| All SMP table values are ASCII | All SMP mappings | Same guarantee for characters above U+FFFF |
| All language override values are ASCII | 22 language tables | Language-specific overrides are pure ASCII |
| All Hanzi pinyin values are ASCII | 20,924 entries | Chinese romanization is pure ASCII |
| Confusables table count ≥ 1,000 | TR39 table | Confusables data not truncated |
| Default BMP table count ≥ 5,000 | BMP translations | Default table not truncated |
| Hanzi pinyin count ≥ 20,000 | CJK mappings | Pinyin table not truncated |
src/tables/hangul.rs adds const assertions: JUNGSEONG_COUNT == 21,
JONGSEONG_COUNT == 28, total Hangul = 19 × 21 × 28 = 11,172, compatibility
jamo = 51. If any assertion fails, cargo build fails. No runtime overhead.
Exhaustive Domain Coverage — (a) proof by exhaustion¶
Each property below is checked for every element of a finite domain, so it is proven over that domain.
Hangul Syllables (11,172 characters)¶
Every precomposed syllable (U+AC00–U+D7A3): romanize_hangul() returns Some,
output is pure ASCII and non-empty, decomposition indices are in bounds
(cho < 19, jung < 21, jong < 28), and the index identity
cho·21·28 + jung·28 + jong == syllable_index holds. (This identity is the
Hangul decomposition arithmetic — it is not a transliteration-reversibility
claim.)
Compatibility Jamo (51 characters)¶
Every standalone jamo (U+3131–U+3163): lookup_compat_jamo() returns Some and
output is pure ASCII.
Full BMP — ASCII Output (63,488 characters)¶
Every non-surrogate codepoint U+0080–U+FFFF with ErrorMode::Ignore yields pure
ASCII — proves I2 over the BMP by exhaustion.
Full BMP — Idempotence (63,488 characters)¶
Every non-surrogate codepoint U+0080–U+FFFF: transliterate(transliterate(ch)) ==
transliterate(ch) — proves I3 over the BMP by exhaustion.
CJK Unified Ideographs (20,992 characters)¶
Every character U+4E00–U+9FFF maps to non-empty ASCII (every ideograph has a pinyin mapping).
Indic Block Structure (15 scripts)¶
For each Brahmic block, structural roles are checked exhaustively over the block:
virama at the expected offset is IndicRole::Virama, the full consonant range is
Consonant, the full dependent-vowel range is DependentVowel. Scripts:
Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada,
Malayalam, Sinhala, Tibetan, Myanmar, Khmer, Balinese, Javanese.
Stated Invariants (I1–I7) — the lossy-normalizer specification¶
I1–I7 are the invariants of disarm's lossy normalizer. Two of them — I2 (ASCII output) and I3 (idempotence) — are canonicalization properties: they say the transform collapses input toward a canonical ASCII form, which is precisely why the transform is not reversible. None of I1–I7 asserts reversibility.
Each invariant is tagged with the strongest assurance that discharges it:
| ID | Invariant | Statement | Assurance |
|---|---|---|---|
| I1 | ASCII Passthrough | ∀s: s.is_ascii() → transliterate(s) = s |
(a) exhaustion over the 128 ASCII chars + (c) property-tested at string level |
| I2 | ASCII Output | ∀s: transliterate(s, errors='ignore').is_ascii() |
(a) exhaustion over the BMP + (b) structural (concatenation of ASCII is ASCII); (c) tested for SMP |
| I3 | Idempotence | ∀s: f(f(s)) = f(s), f = transliterate(·, errors='ignore') |
(a) exhaustion over the BMP + (c) property-tested (Python) |
| I4 | No Exceptions | ∀s ∈ UTF-8, |s| ≤ 10 MiB: transliterate(s) does not throw |
(c) property-tested (Hypothesis + edge cases) |
| I5 | Deterministic | ∀s, n>0: n calls of transliterate(s) → identical result |
(c) property-tested (100× over mixed-script inputs) |
| I6 | Input Size Bounded | ∀s: |s| > 10 MiB → DisarmError |
(b) structural guard (explicit length check), confirmed by a boundary test |
| I7 | Output Length Bounded | ∀s: |f(s)| ≤ |s|_bytes × 4 + |s|_chars |
(c) property-tested (Hypothesis) |
Proven vs tested at a glance. I1–I3 are proven over their finite domains (exhaustion, with a structural lift for I2); I2 above the BMP and I4, I5, I7 are tested, not proven (unbounded inputs); I6 is a structural guard confirmed by a boundary test.
Reversibility is out of scope¶
disarm deliberately ships a lossy transform. The paper's reversible mode
(R1–R7) — and the structural proofs of round-trip identity and unique
decodability it carries — apply to a specified reversible encoding, which
disarm does not implement. The exhaustive and structural results above are
about canonicalization (I1–I7); none of them imply that transliterate can be
inverted. Do not read "exhaustively verified" as "reversible."
Property-Based Testing — (c) tested, not proven¶
- proptest (Rust): property tests in
tests/integration_transliterate.rs. - Hypothesis (Python): property tests in
tests/test_hypothesis.pyacross transliteration, slugification, normalization, and confusables. - Fuzz testing: random Unicode generation across the input space.
These run across the three test tiers (deterministic CI, Hypothesis, and the formal/exhaustive pre-release tier) — thousands of tests across Rust and Python. They are sound but incomplete: evidence, not proof.
What Is NOT Verified¶
| Area | Why not verified | Mitigation |
|---|---|---|
| PHF hash correctness | Trusted from phf_codegen (widely used) |
Functional tests exercise every lookup path |
| Linguistic accuracy | Correctness is empirical, not provable by testing | Native-speaker corpus; regression tests; see #173 quality benchmark |
| Unicode version drift | New Unicode versions add codepoints | CI tracks the Unicode version; new chars fall through to ErrorMode |
| Memory safety (UB) | Requires Miri (nightly only) | unsafe_code = "forbid"; no unsafe anywhere |
Future: Nightly CI Extensions¶
- Kani bounded model checking — would add machine-checked structural proofs
(absence of panics, overflow, out-of-bounds) for
indic_char_role,romanize_hangul, and the decomposition arithmetic. - Miri UB detection — run the suite under Miri to detect undefined behavior.
Reference¶
- Provably Lossless Reversible Transliteration: A Formal Specification and an Exhaustive-Verification Methodology. DOI 10.5281/zenodo.20613272. This page adopts its verified-vs-tested methodology to describe disarm's existing, lossy testing — not its reversible-mode requirements.