Changelog¶
All notable changes to this project will be documented in this file.
The format follows Keep a Changelog.
Version numbers use the MAJOR.MINOR.PATCH shape but follow disarm's own
release policy — patch = fixes/cleanups/docs, minor = features
or major refactors, and the major component denotes support status, not API
compatibility (see RELEASING.md).
Project renamed
translit→disarm(#264). Historical entries below predate the rename and refer to the old identity (translit-rson PyPI, thetranslitimport package, the_translitnative module); they are left unchanged because they were accurate for their release. Entries from this point on use thedisarmidentity.
[Unreleased]¶
[0.9.1] — 2026-06-13¶
Added¶
strip_log_injection(text, *, replacement='\ufffd', keep_tab=False)(#307). A stateless, character-level encoder that makes untrusted text safe to write as a log line: it replaces CR/LF/NEL/LS/PS (record forging), NUL/C0/C1 controls (parser corruption), and ESC/DEL (terminal hijack via ANSI escapes) withreplacement(default U+FFFD).\tis neutralized by default (keep_tab=False) to block TSV/logfmt column injection. Idempotent; ASCII-clean fast-path returns the original object; never emits a raw CR/LF/ESC. It owns the log-record and operator-terminal sinks but makes no HTML-log-viewer-safety claim (that is stored XSS — encode at the viewer withescape_html) and is not a log4shell defense (see Threat Model).escape_html(text)andpercent_encode(text, *, component)output encoders (#311). Standalone terminal encoders applied at the output sink — deliberately notTextPipeline/PROFILESsteps (a pipeline is context-free; baking encoding in invites double-encoding and wrong-context escaping).escape_htmlescapes the five HTML metacharacters for element/quoted-attribute context (ASCII fast-path returns the original object; not idempotent by design).percent_encodedoes RFC 3986 percent-encoding for a requiredComponent(PATH/SEGMENT/QUERY/FORM; UTF-8 byte-based, ASCII output,FORMuses space→+). Both are mechanism-named and carry the #306 scope-boundary discipline: they are the narrow, context-pinned exception to "disarm is not an output sanitizer," not a general XSS/injection defense (see Threat Model).
Changed (breaking)¶
- Renamed
is_safe_hostname()→is_suspicious_hostname()and inverted its boolean. The old name asserted a safety it cannot guarantee —safe=Trueonly meant "no mixed-script label and no bundled-table confusable found," yet whole-script spoofs and out-of-table confusables still returnedsafe=True(the false-assurance pattern #306/#308/#309 removed elsewhere, but as a literalsafeboolean a caller branches on). The function now returns(suspicious, analysis)wheresuspicious=Truemeans a problem was detected; the result structSafeHostnameDetails→HostnameAnalysis, fieldsafe→suspicious(inverted). The granularscripts/mixed_script/has_confusables/canonicalfields are unchanged. No alias — invert call sites:safe, d = is_safe_hostname(h)→suspicious, a = is_suspicious_hostname(h). (#313) - Renamed policy profile
web_input_sanitize→normalize_web_input. Follows thesanitize_user_input → normalize_user_inputrename: "sanitize" wrongly implied output/injection safety, and was especially misleading here because this profile is lighter thannormalize_user_input()(NFKC + confusables only; no bidi/zero-width/control/zalgo stripping). Useget_pipeline("normalize_web_input"). No alias is kept. - Renamed
sanitize_user_input()→normalize_user_input(). The old name implied output sanitization (injection safety); this preset performs input Unicode normalization only and is not an XSS/SQL defense (see Threat Model). ThePRESETSregistry key changes to match ("normalize_user_input"). No alias is kept — update call sites directly.
Documentation¶
- Stated the XSS/injection scope boundary explicitly (#306): README, the docs site, and THREAT_MODEL now say plainly that disarm normalizes input and is not an output sanitizer — it performs no HTML/JS/SQL/shell escaping and never replaces context-aware output encoding at the sink (NFKC can even surface ASCII metacharacters from fullwidth lookalikes). This boundary is the conceptual basis for the renames and the new output encoders in this release.
Security¶
- Supply-chain hardening (#260): added
cargo deny(license allow-list, banned/wildcard crates, crates.io-only sources viadeny.toml) to the required Rust checks passed gate, alongside the existingcargo audit. Releases now attach a CycloneDX SBOM (*.cdx.json) of the Rust dependency graph, and PyPI distributions carry PEP 740 build-provenance attestations via OIDC Trusted Publishing. Verification is documented in SECURITY.md. - Bumped
pyo30.24 → 0.29, resolving two upstream advisories:GHSA-36hh-v3qg-5jq4(HIGH — out-of-bounds read innth/nth_backforPyList/PyTupleiterators) andGHSA-chgr-c6px-7xpp(MEDIUM — missingSyncbound onPyCFunction::new_closureclosures). Includes the binding-layer API migration the bump requires (GILwith_gil/allow_threads→attach/detach,PyObject→Py<PyAny>,downcast_exact→cast_exact); no functional change to any transform. (#315)
Internal¶
- Docs: build the MkDocs site in CI and deploy to Cloudflare Pages (served at the unchanged
docs.disarm.dev), replacing the Read the Docs trigger.mkdocs build --strictruns in GitHub Actions (Python-only — mkdocstrings parses source statically); push tomaindeploys production, PRs get preview deploys. Legacy/en/latest/*URLs 301 to root viadocs/_redirects. Removed.readthedocs.yamlandRTD_TOKEN. (#314) - CI: replaced the custom
conversations-resolved.ymlworkflow with GitHub's native Require conversation resolution before merging branch-protection setting. The bespoke "Conversations resolved" status check (#55) was flaky — stale check runs lingered after threads were resolved and blocked otherwise-green PRs. Behavior is unchanged (unresolved review threads still block merge), now enforced by the built-in gate instead of a workflow + required status check.
[0.9.0] — 2026-06-11¶
The first release under the disarm name — the continuation of translit-rs
(last released as 0.8.1). See #264 for the rename rationale. The 0.0.0 entries
on PyPI / crates.io / npm are name-reservation placeholders, not releases; 0.9.0
is the first functional disarm release.
Changed¶
- Renamed the project from
translittodisarm(#264). This unifies the distribution and import names under a singledisarm: - PyPI distribution
translit-rs→disarm;import translit→import disarm. - Native module
translit._translit→disarm._disarm; cratetranslit→disarm. - Console script
translit→disarm. - Breaking: the public base exception
TranslitError→DisarmError(the subclassesInvalidArgumentError/ResourceLimitError/UnsupportedErrorkeep their names).DisarmErrorremains aValueErrorsubclass, soexcept ValueErrorkeeps working. - Breaking: the context-dictionary environment variable
TRANSLIT_DICT_DIR→DISARM_DICT_DIR. - Canonical URLs moved to
https://disarm.dev/https://docs.disarm.dev; the repository moved tohttps://github.com/raeq/disarm.
Fixed¶
uv.locknow declaresrequires-python = ">=3.10", matchingpyproject.toml(it had drifted to>=3.9after the 3.10 floor landed in #277).
[0.8.1] — 2026-06-11¶
The final translit-rs release and the close of the 0.8 performance-hardening
arc. The project continues as disarm from 0.9.0 (#264); 0.8.1 exists to
publish honest, production-true benchmark numbers before the rename.
Changed¶
- Benchmarks now run in the fresh-string regime (#277, #302): every timed
call receives a newly constructed
str, the way production traffic always does. The prior cached-object measurement let CPython's per-objectAsUTF8cache hide ~105–137 ns/call of UTF-8 encode cost that onlytranslitpays (pure-Python comparators never callAsUTF8), flattering it. JSON records now carryregime: fresh-string/v2; pre-flip history is the cachedv1regime and must not be compared across regimes. - README short-string figures updated to the measured fresh-regime values: ~17× vs Unidecode (Latin), ~14× (mixed scripts), ~13× (Cyrillic/Greek); ~65 ns ASCII passthrough; the four-cell Unidecode-own sweep still holds (~1.3× on Unidecode's strongest case to ~25×), with a methodology note explaining the regime.
[0.8.0] — 2026-06-11¶
A performance and hardening release. The headline is a benchmark-gated
optimisation programme (#233) that makes short-string transliterate roughly
15–21× faster than Unidecode (up from ~7–9×) and beats Unidecode on its
own benchmark, while shrinking the library's static and resident memory.
Alongside it, a Unicode-security hardening sweep tightens is_safe_hostname,
the security presets, and the stateful slugifiers. Most changes are
behaviour-preserving; the exceptions are called out under Upgrade notes.
Upgrade notes¶
- Minimum Python is now 3.10 (was 3.9). The extension targets the stable-ABI
floor
abi3-py310, so a single wheel runs on 3.10+ and the per-call Python→Rust path crosses the boundary only once (#277). Python 3.9 wheels are no longer produced. is_safe_hostnamenow flags every mixed-script label as unsafe (#254), not only the four Latin-paired high-risk combinations. A label combining two scripts with no Latin confusable (e.g. Greek + Cyrillic) previously reportedsafe=True; it now returnssafe=False. This also flags benign combinations (e.g. Latin + CJK) — read themixed_script/scriptsfields if you need a more permissive policy. The check fails closed by design.- Security presets no longer synthesise path separators (#248): confusable
characters that normalise to
/,\, or..can no longer pass through the security/filename presets to forge path structure. rag_ingestnow runs the confusables step (#258): Unicode homoglyph spoofs are canonicalised during RAG ingestion instead of surviving it. Output of therag_ingestpreset may change for homoglyph-bearing input.- Stateful slugifiers validate
langat construction (Slugify,UniqueSlugify), closing the gap the 0.7.0 validation pushdown missed (#257); an invalidlang=now raises instead of being silently ignored.UniqueSlugifyalso honours property mutations made after construction (#249). - Auto-language discriminator behaviour was reconciled with its documented contract (#253) — auto-detection results may differ for a few ambiguous inputs.
- Correctness edge cases fixed (#255), which may change output: reverse
transliteration of all-caps digraphs and a
grapheme_truncateoverflow case.
Performance¶
- Short-string
transliterate: ~15–21× faster than Unidecode (#277). A call now crosses the Python→Rust boundary exactly once with Rust-side keyword defaults, extracts UTF-8 zero-copy, and returns already-ASCII input as the originalstrobject via a borrowedCow— roughly 70 ns with no allocation. - Beats Unidecode on its own benchmark (#281): translit wins all four cells
of Unidecode's
expect_ascii/expect_nonascii× ASCII/non-ASCII matrix, including Unidecode's strongest (ASCII-passthrough) case. - Smaller static tables (#237): the default BMP transliteration table became
a two-level page-table + interned-blob trie (~1 MB → ~58 KB), hanzi→pinyin
a dense interned array (~600 KB → ~50 KB), and the 11,172 Hangul
romanisations a single packed blob. No runtime data loading; no
unsafe. - Zero-copy context dictionaries (#238): the Arabic/Persian/Hebrew
dictionaries are read once and indexed by
(offset, len)spans instead of parsed into nestedHashMaps of owned strings — roughly halving their resident memory. Lookup is binary search; the two-step bigram path allocates no per-token key. - Linear-time scanning via Aho-Corasick (#242): global and slug replacements
use longest/first-match automata instead of repeated per-position probing; the
UniqueSlugifycollision counter is amortised; and multi-codepoint emoji are matched through a code-point trie. - Per-character hot-loop improvements — resolve-once language tables, block-table dispatch, ASCII-run skipping (#235); fewer copies on the ASCII/identity path (#236); chunked batch extraction that caps peak memory (#239); single-pass strict mode, O(u)→O(1) in time and space (#240); further ASCII fast-paths and removal of O(n·k) scans (#252).
- A benchmark harness with a deterministic iai-callgrind estimated-cycle gate guards every PR against regressions in CI (#234).
Note: the batch (
list[str]) API's advantage over a Python loop has narrowed for short strings now that a scalar call is ~70 ns — for tiny inputs it is at rough parity. Its durable value is the single GIL-released crossing (thread parallelism), not a raw per-call speedup. Seedocs/performance.md.
Added¶
TextPipeline(preset=…)constructor and related new-surface ergonomics (#259).- CLI:
slugifyhonours--lang; thestrip_bidi/strip_zalgosteps are exposed; error output is cleaned up (#250). - The
errorsparameter annotation now includes"strict"in the callable-module andTextwrappers (#247).
Changed¶
docs/performance.mdrewritten so every claim is CI-executed (Sybil) or linked to a recorded measurement, with a stated margin policy, varied scenarios, a prominent "where we are slower" section, and a credit paragraph for Unidecode and its lineage (#291).
Internal¶
- Resource-limit constants centralised in a single
src/limits.rsmodule so the library's resource posture has one audit surface (#256). - Cross-cutting Rust-core helpers (
apply_replacements,emit_warning) de-duplicated (#251). - Incorrect docstring examples in the Python wrapper modules corrected (#246).
[0.7.0] — 2026-06-10¶
A feature and architecture release. Headlines: a unified, catchable exception
hierarchy; terminal column-width measurement (terminal_width /
grapheme_width); native errors="strict" transliteration; LLM/RAG
guardrail pipeline presets; and a substantial push of validation and
configuration logic down into the Rust core, so the upcoming multi-language
bindings inherit one behaviour instead of reimplementing it. Most changes are
behaviour-preserving; the exceptions are called out under Upgrade notes.
Upgrade notes¶
- Exceptions now form a hierarchy. Every library error subclasses
TranslitError, withInvalidArgumentError,ResourceLimitError, andUnsupportedErrorbeneath it.TranslitErrorremains aValueErrorsubclass, so existingexcept ValueErrorkeeps working. Several error message strings were enriched/standardised (#186, #187) — code matching exact message text may need updating; code matching exception types is unaffected. lang=is validated even for ASCII input (#197). A binding-side ASCII fast path previously skipped language validation, sotransliterate("abc", lang="zz")silently returned the input; it now raisesInvalidArgumentError, matching how non-ASCII input always behaved.slugify_filename/Slugify(safe_chars=…)output corrected (see Fixed):slugify_filename("My Report.pdf")now returns"My_Report.pdf", not"My.Report_pdf". Output for inputs that usesafe_charsmay change.- New modes:
errors="strict"fortransliterate(#184) anddecode_to_utf8(strict=True)(#189).
Added¶
terminal_width/grapheme_width(#224): terminal column width per grapheme cluster (UAX #11 East Asian Width). Wide/fullwidth and emoji-presented clusters are 2 columns; combining marks, controls, and zero-width characters are 0. Ambiguous characters are 1 by default, or 2 withambiguous_wide=True. Width data is generated at build time from the pinned UCD (no runtime data, nounsafe). Measures cells, not pixels; tabs are not expanded.errors="strict"+find_untranslatable(#184): strict transliteration raises on the first untranslatable character (reporting it and its byte offset);find_untranslatablereturns all of them without raising.- Guardrail pipeline presets (#139):
TextPipelinegainsstrip_bidiandstrip_zalgosteps and thellm_guardrail/rag_ingestnamed profiles for LLM/RAG input sanitisation. get_pipeline/list_profiles(#229): the named policy-profile registry now lives in the Rust core; the Python helpers are thin wrappers over it.decode_to_utf8(strict=True)(#189): raise on lossy/replacement decoding instead of silently substituting U+FFFD.
Changed¶
- Unified exception hierarchy (#183): the Python error surface is a
TranslitErrorbase with categorised subclasses; sites that previously raised bareValueErrorare unified (foundation laid in 0.6.3 via #181). - Validation moved into the Rust core (#185, #217, #229, #230, #231): enum
validation, the
transliterate()argument-conflict matrix, non-negativemax_length/max_graphemeschecks,safe_chars, andmin_confidencerange-checking now live in the core, so other bindings enforce the identical contract without reimplementing it. The Python layer keeps only type guards. - Actionable error messages (#186, #187): weak messages now name the offending value, list valid options, and suggest a "did you mean…?" where applicable; message style is standardised across the surface.
- Error cause chains (#188): wrapped errors surface the underlying cause via
__cause__rather than flattening it into the message. TextPipelinestep ordering (#174) is derived from a single source of truth, removing drift between configuration and execution order.- All-ASCII preset fast path (#198): presets skip the NFKC pass for pure-ASCII input (behaviour-preserving).
Fixed¶
slugify_filename/Slugify(safe_chars=…)preserved safe characters at the wrong positions —slugify_filename("My Report.pdf")returned"My.Report_pdf"instead of the awesome-slugify-correct"My_Report.pdf".safe_charsare now handled natively in the Rust core: kept verbatim and treated as word characters so they hold their position (#156, #230). The prior test only covered a dot-free input, so the bug was uncaught; regression tests now cover filenames with extensions, multiple dots, andUniqueSlugify+max_length.slugify(default=…)is now sanitised through the same slug pipeline (so a caller-supplied fallback cannot smuggle path-traversal or URL metacharacters into output documented as URL-safe), threads through the statefulSlugifier/UniqueSlugifierforms, and a negativemax_lengthnow raises a catchableInvalidArgumentErroron both the scalar and batch paths instead of an uncatchableOverflowError(#193, #169).- Low-severity hardening bundle (#200): eight small robustness fixes (bounds, overflow, and edge-case handling) gathered into one pass.
Security¶
- The RustSec advisory audit (
cargo-audit) now blocks merge via the required "Rust checks passed" gate on every PR — an advisory can land on a dependency without any code change here (#195).
Removed¶
- Docker image build/publish and its Trivy CVE scan (#138). translit is a
pip install-first library; previously published images remain as historical artifacts, but no new ones are produced. Install the CLI viapip install translit-rs.
Documentation¶
- Executable cookbook (#154, #91, #140, #156, #172): a Sybil doc-test harness
with a CI gate, unidecode→translit migration recipes, an "LLM pipelines" page,
a tokenizer-preprocessing page, and an anti-rot lint that turned 307 decorative
# =>claims into checked assertions. - normalize-first canonicalisation recipe (#174) and a formal-verification assurance taxonomy (#223 — proof-by-exhaustion / structural / property-tested, tagging each I1–I7 invariant), plus grapheme-integrity property tests (#174).
- The project adopted the Developer Certificate of Origin (#165); all commits are signed off. The custom-emoji-provider 9-codepoint window cap is now documented (#199).
[0.6.3] — 2026-06-08¶
A correctness, maintenance, and architecture-foundation release. No output-affecting
changes — every fix is behaviour-preserving and the one new public behaviour
(slugify(default=...)) is opt-in. Headline: a pure-Rust error model is now in place,
laying the foundation for the multi-language bindings on the roadmap.
Upgrade notes¶
- No output-affecting changes. Existing output and every exception type/message are unchanged.
- New opt-in:
slugify(text, default="…")returns the fallback when the input has no sluggable characters (emoji / punctuation / zero-width) instead of"".default=None(the default) preserves the prior empty-string behaviour.
Added¶
slugify(default=...)— opt-in fallback for inputs that would otherwise slug to the empty string, closing an empty-slug routing hazard (#97).
Fixed¶
PRESETS["strip_obfuscation"]metadata now reflects the real pipeline order (confusablesruns afterdemojize), matchingsrc/presets.rs(#141).- Lock-poison recovery now emits a Python
UserWarningnaming the recovered table, instead of a silent stderr line (#117). docs/api/exceptions.mdcorrected —TranslitErrorinherits fromValueError(notException), and every example message string now matches the real output (#182).
Changed (internal — behaviour-preserving)¶
- Error model (#181, part of #180): a pure-Rust
Errorenum (thiserror) with a stablecode()per variant and a singleFrom<Error> for PyErrboundary; ~35 error sites migrated off in-corePyErrconstruction. Removes the core↔PyO3 coupling and lays the foundation for non-Python bindings. Python exception types and messages are unchanged. - Dependencies:
phf/phf_codegen0.11 → 0.13,criterion0.5 → 0.8,chardetng0.1 → 1.0 — each migrated and verified behaviour-preserving (#146, #153, #164). build.rsnow auto-discovers language override tables — adding a language is just dropping in atranslit_lang_*.tsv(#74).- Generated
.pyistubs are now guarded by a stub/binary signature drift-check, which caught and fixed 18 stale stub signatures (#76).
Maintenance¶
- Split
python/translit/__init__.py(2,683 lines) into_api.py+_presets.py(#73). - Split
tests/integration_transliterate.rsby script family (#75). - Process: a required "Conversations resolved" merge gate (#55); a documented
dependency-upgrade methodology with Dependabot cooldown + auto-merge
(
DEPENDENCY_UPGRADES.md,RELEASING.md).
[0.6.2] — 2026-06-07¶
A correctness, security, performance and maintenance release triaged from a
post-0.6.1 issue sweep (#101–#132). No public API removed; one small new public
behaviour (slugify(save_order=True) now functions). Two output-affecting
fixes — see Upgrade notes.
Upgrade notes (output-affecting)¶
slugify(save_order=True)was an accepted no-op; it now strips only leading/trailing stopwords (preserving interior word order), matching python-slugify (#118). If you passedsave_order=True, slug output changes.decode_to_utf8defaultmin_confidence0.5→0.95(#103). The old default was inert (the detector only reports0.50/0.95, and0.50 < 0.50is false), so it never rejected. It now requires high confidence by default; passmin_confidence=0.0to accept any guess. (No practical change today — the detector currently always reports0.95.)
Fixed¶
- #102 —
UniqueSlugifyno longer panics across the FFI boundary on a multibyte separator + smallmax_length(byte slice landed mid-codepoint; now usesfloor_char_boundary). - #101 — context bigram disambiguation tier was unreachable (it reset on every inter-word space); it now resets only on hard boundaries, so the tier fires in normal prose.
- #104 —
set_emoji_providernow obeysseal_registrations()(the provider swap previously defeated the seal). - #103 —
decode_to_utf8default confidence now actually gates (see notes). - #107 — a corrupt context dictionary now reports a distinct "corrupt" error
instead of the misleading "not found" remedy (
DictStateenum). - #121 —
PRESETS["sanitize_user_input"]now reflects the real pipeline order (strip invisibles before zalgo); Python registry and Rust doc aligned. - #129 —
Text.transliterate()stub now declares thetones/contextparameters the implementation accepts. - #131 —
Slugify(uids=...)emits a correct wrong-class warning rather than a spurious deprecation warning. - #122 — disambiguated the
_compatshould_warnnested ternary.
Security¶
- #105 — added a
cargo audit(RustSec advisory) CI job and acargoDependabot ecosystem. - #132 — added a Trivy CVE scan of the published image to the release
workflow (SARIF → Security tab, fails on fixable HIGH/CRITICAL) +
.trivyignore. - #106 — Rust diagnostics now route through Python
warningsinstead of bareeprintln!, so applications can capture/suppress them.
Performance (output-preserving)¶
- #108 codepoint-range diacritic checks in
tokenize(); #109mem::takeper token boundary; #110 singlech.nfkc()pass on the NFKC fallback; #111 loweredMAX_CAPACITY_HINT256 MiB → 8 MiB; #112/#113 emoji matching uses stack buffers + a fixed sliding window (no per-charVec/String); #114 slugify usesCow(no eagerto_owned); #115 contexttokenize()returns borrowed (Cow) slices of the input — zero per-token allocation (Rust API: the crate-internalcontext::Token.textchanged fromStringtoCow<'_, str>; no effect on the Python API); #116 clamped theContextDictcapacity hint.
Maintenance¶
- #118 implemented
slugify(save_order=True); #119SlugConfig::from_pyargsdedupes the four slugify PyO3 entrypoints; #120_build_slug_kwargshelper; #123 seal-enforcement docs on eachtables::mutator; #124 infallibility comments; #125 typed_CallableModule.__call__kwargs; #126 correctedrecover_lockdoc; #127 documented the lazy-import workaround; #128 renamed_mutation_generation→_registration_generation; #130 annotated the defence-in-depth conflict check.
[0.6.1] — 2026-06-07¶
A bug-fix and test-hardening release. No public API was removed and no new public names were added. One fix changes key output for inputs containing invisible characters — see Upgrade notes.
Upgrade notes (output-affecting fix)¶
search_key/catalog_key/sort_keynow strip bidi overrides and soft-hyphen / format characters (#93). Previously a value stored with an invisible character (e.g."password","usertxt") produced a different key from its clean equivalent, so dedup and lookup silently missed. The new key is the correct one; if you persist these keys, regenerate any that were computed over text that could contain invisible characters.
Fixed¶
- #93 — key functions (
search_key/catalog_key/sort_key) leaked bidi and soft-hyphen characters, so visually-identical inputs produced non-colliding keys. They nowstrip_bidiafter NFKC, matching the other canonicalization presets. - #82 — Greek reverse transliteration (
transliterate(text, target="el")) left literal Latin letters in the output ("psychi"→"ψyχη"). The forward direction romanizes Υ/υ asY/y(including the ου/αυ/ευ diphthongs), so theelreverse table now mapsY/yback to Greek; round-trips no longer leak Latin letters. - #69 —
transliterate()resolved conflicting kwargs differently forstrvslistinput (one path silently droppedtarget, the othercontext). Conflicts are now checked once, before the dispatch, so both raise identically:context+targetandcontext+tonesraiseValueError. - #72 —
translit.unidecode()now mirrors the Unidecode 1.3 signatureunidecode(string, errors="ignore", replace_str="?"), mapping Unidecode'serrorsmodes (ignore/replace/preserve/strict) onto the native error handling, instead of raisingTypeErroron those kwargs. - #95 — Greek Extended polytonic capitals for omicron/upsilon/omega/rho
were corrupted, emitting unrelated Latin letters (
Ὅμηρος→Xmiros,Ὑγίεια→Pgieia). Corrected all 50 affected entries to the proper base romanization, consistent with the monotonic forms (Ὅμηρος→Omiros). - #99.3 — a typo'd
form=/errors=value now raises even for pure-ASCII input. Previously the ASCII fast-path returned before reaching Rust, so the bad enum silently no-opped on ASCII and only raised on the first non-ASCII string. Validation now runs before the fast-path innormalize()andtransliterate().
Performance¶
- #70 — the batch entry points (
transliterate,slugify,normalize,strip_accentsonlist[str]) now release the GIL around their pure-Rust compute loop viapy.allow_threads. Multi-threaded callers processing large batches now get real parallelism (~1.8× wall-clock with two threads) instead of serialising on the interpreter lock. Output is unchanged. Documented in the new "Concurrency (GIL)" section ofdocs/performance.md.
Documentation¶
- #94 —
strict_iso9is no longer described as "ISO 9:1995". It emits ASCII digraphs (ж→zh, ч→ch, ш→sh), not the standard's diacritics (ž/č/š) — translit tables are ASCII-only by design. Docstrings, the data-file header, and the docs now describe it as a scholarly ASCII (ISO 9-style) transliteration and warn it is not ISO 9-conformant. No behavior change. - #98 —
docs/user-guide/transliteration.mdno longer instructs users topip install translit-rs[arabic|hebrew|context](those empty extras were removed in 0.6.0); it now documents thebootstrap_dicts.sh/TRANSLIT_DICT_DIRpath, matching the README and the runtime error message. -
#99.1 / #99.2 — fixed two false docstrings:
sort_keyno longer claims to preserve accents (it folds them via transliteration, coinciding withsearch_key), andslugifyno longer documents apretranslatekwarg it never had. -
#84 — corrected the README throughput table (Cyrillic ~106M chars/sec, slugify ~712K slugs/sec on commodity 4-vCPU hardware) and added a hardware/methodology footnote; added a matching variance note to
docs/performance.md. - #77 — fixed the
Textfluent-builder docstring example (normalizeis keyword-only:.normalize(form="NFC")), reconciled the language-profile count (README now agrees with the docs at 83), and documented thecontextkwarg in thetransliterate()docstring.
Internal / tests¶
- #78 — added adversarial coverage for the raw-bytes decode path
(
detect_encoding/decode_to_utf8): deterministic hostile-byte cases in CI plus a Hypothesisst.binary()fuzz suite proving no-panic and invariant-preservation. Documented inTHREAT_MODEL.mdthat the decode path has no input-size cap (caller's responsibility, per the 0.6.0 cap removal). - #79 — added a single-vs-batch kwarg parity regression test across the full
kwarg matrix and a multi-script corpus (the
tonesbatch drop fixed in 0.6.0 can no longer recur silently).
[0.6.0] — 2026-06-07¶
A hardening and bug-fix release. Two new opt-in helpers (dedup_batch,
make_cached_transliterator) make this a minor bump; no public API was
removed. Several fixes change output for specific inputs — read Upgrade
notes before upgrading if you cache or persist transliterator/normalizer output.
Upgrade notes (output-affecting fixes)¶
Each of these was a bug; the new output is the correct one. If you store or cache results that were keyed on the old (buggy) behaviour, regenerate them:
register_replacements()now actually applies. It was a silent no-op — the registered table was never consulted. Registered replacements now take effect acrosstransliterate()(scalar, list, andcontext=True). If you registered replacements and (knowingly or not) relied on them being ignored, output changes.transliterate(list, tones=True)now returns toned pinyin (was silently toneless on the list path);transliterate(list, target=…, tones=True)now raisesValueErrorfor the forward-only parameter (was silently ignored).normalize_confusables(text, target="cyrillic")no longer maps characters onto invisible combining marks (28 such mappings removed).strip_obfuscationnow folds intra-Latin ASCII homoglyphs (þ→p,ſ→f,ı→i, …) and is idempotent;sanitize_user_inputis idempotent for control/invisible characters between combining marks;demojizeno longer inserts a stray space after a tab/newline that precedes an emoji.- Context-aware transliteration (
context=True, ar/fa/he) distribution changed. The emptyarabic/hebrew/contextpip extras have been removed (they never installed anything). The ~37 MB dictionaries are no longer tracked in git, and are not shipped in the wheel. Context mode now loads dictionaries from$TRANSLIT_DICT_DIR(build them withscripts/bootstrap_dicts.sh), or use theembed-dictsCargo feature for a self-contained build. A packaged pip-installable distribution is tracked in #56/#60. decode_to_utf8defaultmin_confidencechanged0.0→0.5. Low-confidence encoding guesses are now rejected by default instead of silently accepted; passmin_confidence=0.0to restore the old behaviour. (#66)- Unknown
langcodes now raise instead of silently falling back (#68). A typo'd code (lang="RU",lang="russian") used to behave exactly likelang=None— quietly-wrong output — whileerrors=/form=rejected bad values.transliterate,slugify,sanitize_filename,catalog_key,search_key,sort_key, andml_normalizenow raiseTranslitErrorlisting the valid codes."auto", thenb/nn/daaliases, andregister_lang()codes are accepted. (target=already validated.)
Changed¶
- No library-imposed input-size limit (#80, #65). The 10 MiB input cap on
transliterate,normalize,fold_case, and the preset pipelines has been removed — it was paternalistic, inconsistently applied (the ASCII fast path bypassed it;slugify/normalize_confusables/strip_zalgonever had it), and the threat model already disclaims DoS. All operations are linear time and memory; bounding untrusted input is the caller's responsibility, documented in the threat model and docstrings. The single retained size guard is theregister_replacementsoutput amplification bound (a tiny input can expand to an enormous string via a caller-registered value — an amplification a caller's own input check cannot foresee). Backward-compatible: only previously-rejected large inputs now succeed. - External wording: capability, not promise. Security-relevant features are now described as mechanisms (TR39 confusable mapping, bidi/zalgo stripping, hostname analysis) rather than outcome guarantees. Package descriptions, README, and docs no longer claim to "prevent"/"neutralize" attacks or achieve "perfect" recovery; the XMR benchmark figure is always stated with its tested-pairs scope. Engineering rigor is held to a high internal bar (see below); the external surface promises nothing it cannot measure.
Added¶
dedup_batch(texts, …)— transliterate a list, processing each distinct value once and mapping back (large win for repeated/categorical data; ~146× on a high-locality column). Stateless — no cache to invalidate; unique values are chunked at the 100k batch cap. (#31)make_cached_transliterator(maxsize=…, …)— opt-in LRU-cached single-string transliterator with options fixed at construction. Self-invalidating: the next call after anyregister_lang/register_replacements/remove_replacement/clear_replacementsclears the cache (via an internal table-generation counter), so it never serves stale results. Never enabled by default. (#31)THREAT_MODEL.md— defines in-scope mechanisms, explicit out-of-scope items (confusables outside the bundled TR39 table, whole-script and multi-character confusables, Unicode-version skew, semantic attacks, DoS), and a vulnerability-vs- known-limitation policy, grounded in the literature (Holgers 2006, Deng 2020, BitAbuse 2025).SECURITY.mdrewritten on real footing: supported-version policy stated, triage scope defined, and linked to the threat model.- Security-invariant property tests + fuzzing.
proptestinvariants in Rust (src/presets.rs) assert no-panic, idempotence, and "no bidi/format control survives" forstrip_obfuscation/security_clean/sanitize_user_input/strip_bidiacross the Unicode input space; a deterministic, CI-gating adversarial attack-corpus regression (tests/test_attack_corpus.py: homoglyph / zalgo / invisible / bidi / combined, XMR-style); and acargo-fuzzharness (fuzz/) for continuous coverage-guided fuzzing of the defense pipelines. - Confusable coverage for intra-Latin homoglyphs of basic ASCII letters
(e.g.
þ→p,ſ→f,ı→i,ƒ→f,Ɩ→l,ꜱ→s). The TR39 generator previously skipped all Latin-script sources for the Latin target, dropping ~83 genuine homoglyphs of A–Z/a–z;normalize_confusables/strip_obfuscationnow fold them. Single-letter Latin confusable coverage of UTS#39 is now complete. - Pinned
data/confusables.txt(UTS#39 17.0.0) as the reproducible, version- controlled input forscripts/gen_confusables.py(--downloadrefreshes it), and atests/test_confusable_coverage.pygate against Unicode-version drift.
Fixed¶
register_replacements()was a silent no-op — the global table was stored but never consulted bytransliterate(). It now applies as a longest-match pre-pass (no cascade) across the scalar, list, andcontext=Trueforward paths, including ASCII-keyed replacements that previously bypassed Rust via the Python fast path. (#51)tones=on the list/batch path was dropped:transliterate(["北京"], tones=True)returned toneless pinyin while the scalar path returned toned, andtransliterate([...], target=…, tones=True)silently ignored the forward-only parameter instead of raising. Both now match the scalar path. (#14, #15)normalize_confusables(target="cyrillic")emitted invisible combining marks — 28 mappings folded a visible character onto a combining Cyrillic-Extended mark (an obfuscation vector). The generator now excludes combining-mark targets. (#24)script_info("CanadianAboriginal")["context_aware"]raisedKeyError— the entry omitted a requiredScriptMetafield; a completeness guard now prevents recurrence. (#18)- Context path skipped
strict_iso9/gost7034mutual-exclusion validation —transliterate(text, context=True, strict_iso9=True, gost7034=True)now raisesValueErrorlike the non-context path; the missing-dictionary error hint is now language-specific (he→hebrew). (#18) demojizeinserted a stray space after a tab/newline preceding an emoji ("a\t😀"→"a\t grinning face"); it now checks for any whitespace. (#12)- Compatibility digit variants fold to digits, not letters (#89). The
confusables table mapped Mathematical Alphanumeric digits
𝟎/𝟏(and the other four families, plus superscripts) to the look-alike lettersO/l, sonormalize_confusables("𝟏𝟎")gave"lO"andstrip_obfuscationcorrupted digit runs. The generator now folds any character whose NFKC form is an ASCII digit to that digit. They remain detected as confusable (is_confusable), but canonicalize to the correct number. (ASCII0/1were already unaffected.) - NFKC-compatible Latin is recovered instead of dropped to
[?](#81). Mathematical Alphanumeric Symbols (𝕳𝖊𝖑𝖑𝖔 𝟙𝟚𝟛→Hello 123), presentation ligatures (fi/fl→fi/fl), and superscripts (x²→x2) now transliterate: an unmapped non-ASCII char is NFKC-decomposed and re-tried before the error fallback. This matches unidecode/anyascii and closes a filter-evasion ("fancy text") gap. Purely additive — only chars that were previously[?]are affected; emoji (no ASCII decomposition) still map to[?]. - Defense pipelines are now idempotent (bugs found by the property tests):
strip_obfuscation: emoji whose CLDR name contains typographic punctuation (e.g.👒→woman’s hat, U+2019’) weren't folded because confusables ran before demojize; a second pass folded’→'. Confusables now runs after demojize.sanitize_user_input: an invisible or control character between combining marks (e.g. soft-hyphen, NUL) split a mark-run, so removing it after zalgo-capping merged runs that a second pass then capped differently. Bidi, zero-width, and control characters are now stripped before zalgo-capping.- Build-time and doc corrections:
build.rsnow rejects malformed\u{…}escapes in TSV data; embedded-dictionary parse errors are logged (not silently dropped); and numerous stale docstrings/comments were corrected (script_to_langreturns ISO 639-1 or 639-3;normalize()ASCII fast-path; list single-Rust-call caveats).
Security¶
seal_registrations()/registrations_sealed()(#64, high). Theregister_lang/register_replacementsAPIs mutate process-global tables consulted by everytransliterate/slugify/catalog_key/… call, so in a multi-tenant or web process one import or request handler could silently alter everyone's canonicalization.seal_registrations()is a one-way latch: after it is called, register/remove/clear raiseTranslitError. The registration APIs are now documented as startup-only/single-writer. Separately, a poisoned lock no longer resets registrations to defaults (a panic in one thread could previously wipe another caller's registered languages) — it now recovers the data as-is.is_safe_hostnamenow decodes IDN/xn--labels (#63, high). Previously anxn--ACE label was pure ASCII → single-script → reported safe, so the on-the-wire form of the IDN homograph attack (a Cyrillicxn--80ak6aa92e.com"apple" spoof) sailed through — the exact blind spot for a library marketingidn/anti-spoofing. ACE labels are now UTS#46-decoded (via theidnacrate) before script/confusable analysis; a malformed ACE label is treated as unsafe. Non-xn--labels are untouched (no false positives on, e.g.,my_host.local).is_safe_hostnamefails closed (#67.1). A confusable-check error no longer silently degrades to "not confusable" (unwrap_or(false)) → "safe"; it now marks the hostname unsafe.strip_bidi/display_cleannow also strip deprecated format controls (U+206A–U+206F) and interlinear annotation marks (U+FFF9–U+FFFB) (#67.2), which were previously only handled as transliteration-table entries.- NFKC×confusables composition pinned (#67.3). Added a regression test fixing
the exact set of NFKC-ASCII results that
normalize_confusablesre-maps (`→',"→'',|→l) so a data/ordering change — e.g. reintroducing digit→letter — fails loudly; and that presets resolve NFKC/TR39 conflicts (ſ→s) via NFKC. - Context dictionaries are no longer loaded from a CWD-relative path (#61).
load_dict_from_fspreviously probed./data/{name}_dict.binfirst, so a process whose working directory an attacker influences (or where they can drop./data/) could inject a substitute dictionary and silently change ar/fa/he output. Dictionaries now load only from$TRANSLIT_DICT_DIR(explicit opt-in) or the crate's own absolutedata/path in source builds. - Supply-chain: corpus inputs are verified/pinned (#62). The Tashkeela corpus
archive is now checksum-verified before it feeds the builders (fail-closed — an
unpinned checksum aborts unless
ALLOW_UNVERIFIED_CORPUS=1), and the Project Ben Yehuda corpus is fetched at a pinned commit instead of an unpinned live HEAD. ContextDict::from_bytesis fully bounds-checked. A malformed or truncated context dictionary previously caused an out-of-bounds panic (the crate isunsafe_code = forbid, so a panic aborts the process). Every read is now bounds-checked and section offsets are validated; capacity hints are clamped. Added truncation/bogus-offset/u32::MAX-count unit tests. (#18)register_replacementsexpansion is bounded. Replacement values are caller-controlled and unbounded; a small input with a large value could expand past the transliterate input cap. Output is now bounded during construction and rejected once it would exceedMAX_TRANSLITERATE_INPUT_BYTES. (#51)
Internal / tests¶
- 170 deterministic tests were excluded from CI. A module-level
pytestmark = pytest.mark.hypothesisintest_filename_regressions.pyandtest_case_folding.py(filename-security and case-folding regressions) deselected the entire files under CI's-m "not hypothesis"filter; only ~10 were actual property tests. The mark is now scoped to the property-test class in each file, so the deterministic tests run in CI. (#12) - New tests:
register_replacements(unit + Hypothesis property), context-dict parser robustness,resolve_auto_langfor all 18 scripts added in v0.3.0+, and aSCRIPT_METAfield-completeness guard. - CI/workflow hygiene: concurrency group on secret-scan,
uv.lockin the benchmark path filter, and CodeQL no longer triggered by Rust-only changes.
[0.5.0] — 2026-06-06¶
Added¶
- Context-aware transliteration for abjad scripts (Arabic, Persian, Hebrew).
transliterate(text, context=True)uses dictionary-based vowel restoration with bigram context disambiguation to produce readable romanized text instead of consonant skeletons. - Arabic: Tashkeela corpus (65.7M words), 182K unigrams + 200K bigrams. Covers 99%+ of newspaper vocabulary.
- Hebrew: Project Ben Yehuda corpus (11.4M words), 227K unigrams + 200K bigrams. Covers literary Hebrew.
- Persian: 266 curated common words + optional Wiktionary expansion (14.9K entries available via harvester script).
list_context_langs(): returns language codes that supportcontext=True(currently["ar", "fa", "he"]).LangMeta.contextfield:"full","partial", or"none"— enables web/WASM clients to show/hide a context toggle per language.ScriptMeta.context_awarefield:bool— enables toggle per detected script.- Dictionary build tooling:
scripts/build_arabic_dict.py— corpus-based Arabic dictionary builderscripts/build_hebrew_dict.py— corpus-based Hebrew dictionary builderscripts/build_persian_dict.py— curated vocabulary Persian builderscripts/harvest_wiktionary_persian.py— Wiktionary Persian harvesterscripts/bootstrap_dicts.sh— reproducible bootstrap from zero with pinned checksums. All parameters auditable, no manual steps.- Abjad transliteration documentation (
docs/user-guide/abjad-transliteration.md) covering all three languages, standards used, comparison with other systems. - pip extras:
pip install translit-rs[arabic],[hebrew],[context]for optional context dictionary installation. - Rust context engine (
src/context.rs): binary dictionary reader, Arabic/Hebrew tokenizer, three-tier resolve (bigram → unigram → context-free fallback), lazy-loaded global singletons viaOnceLock. - 28 context-aware tests (8 Arabic, 14 Persian, 6 Hebrew).
Changed¶
- Repositioning (docs + metadata only — no API or coverage changes). The project now leads with its differentiated, proven core: Unicode adversarial-text defense and canonicalization (TR39 visual confusable mapping), with standards-based Latin/Cyrillic/Greek transliteration as the supporting pillar and CJK/Indic/other scripts framed as best-effort, unidecode-compatible coverage.
- Rewrote the package description, keywords, and classifiers (added
Topic :: Security) acrosspyproject.toml,Cargo.toml, andmkdocs.ymlto surface the security use case for discovery. - Restructured
README.md/docs/index.mdto lead with defense; introduced an explicit three-tier coverage model (core / compatibility / best-effort). - Added an Adversarial-Text Defense guide (
docs/security/adversarial-defense.md) documenting the phonetic-vs-visual distinction, the XMR metric, and benchmark evidence; elevated security to a top-level docs navigation section. - Reframed the Unidecode migration guide: the
unidecodealias is for romanization compatibility, not security (it cannot reverse homoglyph attacks).
Fixed¶
- Linux x86_64 wheels are now built as
cp39-abi3instead of a version-specificcp38-cp38wheel. Previously the only published x86_64 Linux wheel targeted CPython 3.8, sopipfell back to a source build (requiring a Rust toolchain) on Linux x86_64 for Python 3.9+. The publish workflow now pins the build interpreter and guards against the regression. (#26) - Documentation: corrected the built-in language-profile count (inconsistently
reported as 64 in one place; now consistently 83), and fixed several homoglyph code
examples whose expected output was wrong (e.g. leading-character ordering in
strip_obfuscationexamples). All README/doc examples are now verified against the built library.
Security¶
- Pinned all third-party GitHub Actions to commit SHAs across the CI and release
workflows (resolves the CodeQL
actions/unpinned-tagfindings) and added.github/dependabot.ymlto keep them current. This hardens the release pipeline, which uses PyPI trusted publishing (id-token: write). - Bumped dev/docs dependencies flagged by Dependabot: Pygments → 2.20.0 and pytest → 9.0.3 (the pytest bump applies on Python ≥ 3.10; Python 3.9 stays on pytest 8.4.2, since pytest 9 requires ≥ 3.10). Both are development-only — the package has no runtime dependencies.
Notes¶
- No public API, language registry, or script coverage was removed. All existing imports, language codes, and the pinned API surface are unchanged.
[0.4.0] — 2026-03-29¶
Added¶
strip_obfuscation()preset pipeline: maximum-strength text deobfuscation using TR39 confusable mapping (visual similarity). Neutralizes homoglyph spoofing, zalgo abuse, invisible character injection, and bidi attacks. Does NOT transliterate — chain withtransliterate()explicitly if romanization is also needed. Pipeline: NFKC → strip_zalgo(max_marks=0) → confusables → strip_bidi → strip_zero_width → demojize → strip_accents → fold_case → collapse_whitespace.lang_info()andscript_info()APIs: return structured metadata (display name, script, region) for any language code or script. Backed byLANG_META(83 entries) andSCRIPT_META(55 entries) with import-time drift assertions.- 18 new language codes: ban (Balinese), bax (Bamum), bug (Buginese), chr (Cherokee), cjm (Cham), cop (Coptic), khb (Tai Lue), lis (Lisu), mni (Meitei), nod (Northern Thai), nqo (N'Ko), sat (Santali), su (Sundanese), syr (Syriac), tdd (Tai Le), tl (Tagalog), tzm (Tamazight), vai (Vai). Total: 83 languages.
- 10 new Script enum members: Bamum, Buginese, Cham, Lisu, MeeteiMayek, OlChiki, Sundanese, Tagalog, TaiTham, Tifinagh. Total: 57 scripts.
- Transliteration provenance documentation (
docs/provenance.md): per-block audit of which formal romanization standard each Unicode block follows. - API surface stability tests (
tests/test_api_stability.py): 133 tests locking down function signatures, class methods, enum members, TypedDicts, protocol interfaces, and__all__exports. - Mutation testing survivor killers (
tests/test_mutant_killers.py): 92 tests targeting forward-only parameter validation, default parameter sensitivity, pipeline step tuples, and boundary checks. - Language consistency audit (
scripts/audit_language_consistency.py): checks 11 registration points for Rust/Python/docs/test alignment. Wired into pre-push gate. - 283 empty-string mappings for combining marks and zero-width characters in
translit_default.tsv— these are now silently stripped instead of producing[?]. docs/index.mdis now generated fromREADME.mdviascripts/generate_docs_index.sh— single source of truth, no more drift.
Fixed¶
strip_obfuscation()homoglyph resolution: used phonetic transliteration (Cyrillic р→r, с→s) instead of TR39 visual confusable mapping (р→p, с→c). Removed transliterate from the pipeline; confusables now handles homoglyphs.- Combining marks produce
[?]:transliterate("n\u0303")returned"n[?]"instead of"n". Added empty-string TSV mappings for all Combining Diacritical Marks (U+0300–U+036F), Extended (U+1AB0–U+1AFF), Supplement (U+1DC0–U+1DFF), Symbols (U+20D0–U+20F0), and Half Marks (U+FE20–U+FE2F). - Zero-width characters produce
[?]:transliterate("a\u200Bb")returned"a[?]b". Added empty-string mappings for ZWS, ZWNJ, ZWJ, word joiner, BOM, soft hyphen, bidi marks, and line/paragraph separators. TextPipelineconfusable ordering: confusables ran before transliterate, creating mixed-script gibberish on Cyrillic/Greek input. Swapped execution order so transliterate runs first (matchingcatalog_keypreset).demojize()adjacent emoji concatenation:demojize("🔥🔥")returned"firefire"instead of"fire fire". Added space padding between adjacent emoji-to-text replacements.- SCRIPT_RANGES sort order: MeeteiMayek Extensions was misplaced, breaking
binary search for Ethiopic Extended-A. Added
test_script_ranges_sortedinvariant. - Tibetan incorrectly documented as Wylie: actual mappings use Indic-phonetic romanization (ཅ→cha, not Wylie's ca).
Changed¶
- BREAKING:
transliterate_batch(),slugify_batch(),normalize_batch(), andstrip_accents_batch()removed. The base functions now accept bothstrandlist[str]via@typing.overload. Pass a list to get batch processing:transliterate(["café", "naïve"])→["cafe", "naive"]. - BREAKING:
strip_obfuscation()no longer transliterates. Uses TR39 confusables (visual mapping) instead.lang=parameter removed. Chain withtransliterate()explicitly if romanization is also needed. - CI restructured: lint/test on PRs only (not push-to-main), hypothesis tests excluded (~4s vs ~46s), CodeQL moved to workflow file with path filtering, benchmarks split to own workflow.
- Pinned
ruff==0.15.4in CI andpyproject.tomlto prevent format drift. - Python 3.9 remains a supported runtime (
requires-python = ">=3.9", abi3-py39) but was removed from the release CI matrix; CI runs on Python 3.10+ because tests use PEP 604 (X | Y) syntax withoutfrom __future__ import annotations.
[0.3.0] — 2026-03-28¶
Added¶
- Unicode coverage expansion: 2,553 new codepoints across 33 Unicode blocks,
bringing total
translit_default.tsventries from 6,633 to 9,186.
Tier 1 — Forms and extensions (~1,741 codepoints): - Fullwidth ASCII (FF01–FF5E): 94 characters, mechanical offset mapping - Halfwidth Hangul (FFA0–FFDC): 66 characters via compatibility jamo - Enclosed/Circled Alphanumerics (2460–24FF): 160 characters (①→1, Ⓐ→A) - Superscript/Subscript (2070–209F): 29 characters mapped to base forms - Roman Numerals (2160–2188): 41 characters (Ⅰ→I, Ⅱ→II, ... Ⅻ→XII) - Modifier Letters (02B0–02FF): 80 characters (ʰ→h, ʷ→w) - IPA/Phonetic Extensions (0250–02AF): 96 characters (ɑ→a, ʃ→sh, ŋ→ng) - Greek Extended (1F00–1FFF): 233 characters (polytonic → base Greek → Latin) - Hangul Jamo (1100–11FF): 256 individual jamo components - Kangxi Radicals (2F00–2FD5): 214 radical forms → pinyin via CJK decomposition - CJK Compatibility Ideographs (F900–FAFF): 472 characters → pinyin via canonical decomposition targets
Tier 2 — Living scripts (~812 codepoints): - Gap-filling for 7 partially-covered scripts: Balinese, Canadian Syllabics, Cherokee, Coptic, N'Ko, Syriac, Vai - 10 new abugida scripts with virama/inherent-vowel handling: Sundanese, Tai Tham, Cham, Batak, Buginese, Tagalog, Hanunoo, Buhid, Tagbanwa, Meetei Mayek - 4 new alphabetic/syllabic scripts: Tifinagh, Lisu, Ol Chiki, Bamum
- Unicode range constants for 12 new scripts in
src/unicode_ranges.rs:SUNDANESE,TAI_THAM,CHAM,BATAK,BUGINESE,TAGALOG,HANUNOO,BUHID,TAGBANWA,MEETEI_MAYEK,MEETEI_MAYEK_EXT. - 10 new
*_char_role()functions insrc/transliterate.rsfor abugida virama handling (Sundanese, Tai Tham, Cham, Batak, Buginese, Tagalog, Hanunoo, Buhid, Tagbanwa, Meetei Mayek). scripts/generate_unicode_expansion.py: reproducible generator script for all Tier 1 and Tier 2 TSV entries (1,310 lines).cargo-clippypre-commit hook mirroring CI-D warningsto catch lints before push.- Callable module:
import translit; translit("Москва", lang="auto")now works as a shorthand fortranslit.transliterate(...). Uses in-place__class__mutation to preserveunittest.mock.patchcompatibility.
Fixed¶
- Finnish transliteration: removed incorrect alias
fi→sv. Finnish ä/ö are independent phonemes (→a/o via default table), not ae/oe variants as in Swedish/German.Hämäläinennow correctly producesHamalainen. - Icelandic transliteration: removed incorrect ð→dh and Ð→Dh overrides. Default table already maps ð→d (ICAO/passport standard). Retained Æ→Ae override (differs from default AE). Icelandic override count reduced from 6 to 2.
- clippy
manual_range_patternslint inbuginese_char_role: collapsed0x1A17 | 0x1A18 | 0x1A19..=0x1A1Bto0x1A17..=0x1A1B. errors="preserve"dropping visible characters: characters with explicit empty-string TSV mappings (e.g. U+060E Arabic Poetic Verse Sign, U+30FC Katakana Prolonged Sound Mark) are now preserved instead of silently dropped whenerrors="preserve"is set.
Changed¶
is_indic()andindic_char_role()expanded to cover all 11 new Brahmic/abugida script ranges.lookup_lang(): Finnish no longer dispatches to Swedish override table; falls through to default.- Icelandic language TSV (
translit_lang_is.tsv) reduced from 6 to 2 entries. ml_normalizepreset: switched transliteration fromPreservetoIgnoreerror mode — ML pipelines need clean ASCII output, not preserved non-ASCII.
[0.2.0] — 2026-03-27¶
Added¶
- Exhaustive testing framework — three layers of machine-verifiable assurance:
- Compile-time assertions (
build.rs): all transliteration table values asserted ASCII-only, entry count sanity checks (Hanzi ≥20k, BMP ≥5k, confusables ≥1k). Build fails if any assertion is violated. - Exhaustive domain tests (Rust): 16 tests covering all 11,172 Hangul syllables, full BMP (63,488 codepoints) for ASCII output and idempotence, all 20,992 CJK ideographs, all 51 compatibility jamo, and structural verification of 15 Indic script blocks. Zero sampling gaps.
- Stated invariant specifications (Python): 7 stated invariants (I1–I7) verified via exhaustive enumeration and Hypothesis — ASCII passthrough, ASCII output, idempotence, no exceptions, determinism, input size bound, output length bound.
- Two-tier test architecture: formal tests gated behind
#[ignore](Rust) and@pytest.mark.formal(Python) so they don't slow everyday development. Run before release withcargo test -- --ignoredandpytest -m formal. - CLAUDE.md: project-level development guide for automated agents — documents build commands, test tiers, and code conventions.
list_scripts()function for programmatic script discovery.docs/formal-verification.md: specification document for exhaustive testing methodology.- Comprehensive overhaul of
docs/architecture/testing-guarantees.mdwith exhaustive testing differentiator analysis and alternative library comparison.
Changed¶
IndicRoleenum andindic_char_role()/ script-specific char_role functions changed from private topubfor integration test access (parent modules remain#[doc(hidden)]).tables::hangulmodule changed frommodtopub modfor integration test access.- Hangul const assertions added:
JUNGSEONG_COUNT,JONGSEONG_COUNT, total syllable count, and compatibility jamo range verified at compile time. - Total test count: 2,900+ (up from 1,678 in 0.1.5).
[0.1.5] — 2026-03-27¶
Added¶
- Reverse transliteration:
transliterate(text, target="ru")converts Latin → native script for Russian, Ukrainian, and Greek. PHF tables generated at build time from inverted language TSV data. - Toned pinyin:
transliterate("北京", tones=True)returns"běi jīng"with tone marks. Toned readings sourced from UnihankMandarinfield for all 20,924 CJK Unified Ideographs. - ISO 9:1995 scholarly Cyrillic:
transliterate(text, strict_iso9=True)for scholarly romanization. GOST R 7.0.34 variant viagost7034=True. - Japanese Kunrei-shiki (
lang="ja-kunrei"): alternative romanization profile, bringing total language count to 65. - Ancient scripts: Coptic, Gothic, Old Italic, Runic, Ogham transliteration tables.
- CLI short aliases:
t(transliterate),s(slugify),n(normalize),p(pipeline),d(demojize) — e.g.translit t "café". - CLI
--targetflag:translit t --target ru "Moskva"for reverse transliteration. - CLI
--tones,--strict-iso9,--gost7034flags for transliterate subcommand. - CLI
--langflag for slugify subcommand. console_scriptsentry point:translitcommand available afterpip install translit-rs.docs/cli.md: comprehensive CLI documentation with piping, exit codes, examples.- Links section in README.md and docs/index.md for RTD ↔ GitHub cross-references.
Changed¶
transliterate()API unified:reverse_transliterate()merged intotransliterate()viatargetparameter. Old function removed.transliterate_implRust signature now takes 7 arguments (addedtones: bool).- Updated benchmark numbers after
tonesparameter addition (15–46% regression in transliteration hot path due to additional branch; throughput now 450M chars/sec Latin, 130M chars/sec Cyrillic). - Performance documentation updated across 4 files to reflect current benchmark results.
Fixed¶
- clippy
format_push_stringlint inbuild.rs— replacedpush_str(&format!())withwrite!(). - clippy
unreadable_literalin PHF-generatedreverse_translit_phf.rs— suppressed via inner attribute insrc/reverse.rs. - All 219 integration test call sites updated for 7-argument
transliterate_impl.
[0.1.4] — 2026-03-25¶
Added¶
lang="auto"script-based language detection: Whenlang="auto"is passed totransliterate(),slugify(),TextPipeline,Slugifier, or any other call site, the library detects the dominant non-Latin script in the input and maps it to a default language code automatically. Maps 28 scripts to language codes (e.g. Cyrillic→ru, Han→zh, Hiragana/Katakana→ja, Thai→th). Zero overhead forlang=Noneor explicit lang codes.LANG_AUTOconstant ("auto") intranslit._enums.- Georgian transliteration (
lang="ka"): 114 TSV entries covering Mkhedruli, Mtavruli, and supplement ranges. BGN/PCGN national romanization. - Armenian transliteration (
lang="hy"): 86 TSV entries covering uppercase, lowercase, and 5 ligatures (U+FB13–FB17). BGN/PCGN romanization. - Sinhala transliteration (
lang="si"): 90 TSV entries. Extended Indic Brahmic engine range from0x0900..=0x0D7Fto0x0900..=0x0DFFwith dedicatedsinhala_char_role()function for Sinhala-specific offsets. - Thai transliteration (
lang="th"): 87 TSV entries using RTGS romanization. NewScriptClass::Taiwith tone-mark stripping and cancellation handling. - Lao transliteration (
lang="lo"): 67 TSV entries using BGN/PCGN romanization. Shares Tai engine with Thai via offset masking. - Ethiopic transliteration (
lang="am"): 307 TSV entries for Ge'ez alphasyllabary (34 consonant bases × 7 vowel orders + labialized forms + digits). Pure data addition — no engine changes needed. - Myanmar transliteration (
lang="my"): 89 TSV entries. Newmyanmar_char_role()for Brahmic engine with virama (U+1039) and asat (U+103A) support. Medials (U+103B–103E) classified as dependent vowels. - Khmer transliteration (
lang="km"): 110 TSV entries. Newkhmer_char_role()for Brahmic engine with coeng (U+17D2) as virama. All consonants normalized to inherent 'a' regardless of series. - Tibetan transliteration (
lang="bo"): 147 TSV entries. Newtibetan_char_role()for Brahmic engine with halanta (U+0F84) and subjoined consonants (U+0F90–0FBC). - Unicode range constants:
TIBETAN(0x0F00–0x0FFF),MYANMAR(0x1000–0x109F),KHMER(0x1780–0x17FF) insrc/unicode_ranges.rs. - Comprehensive test coverage: example-based tests for all 9 new scripts, property-based tests (hypothesis + proptest), multi-script mixture tests.
- Built-in language count: 51 → 60.
Changed¶
is_indic()extended to include Tibetan, Myanmar, and Khmer ranges for Brahmic abugida processing.indic_char_role()dispatches to script-specific functions for Sinhala, Tibetan, Myanmar, and Khmer codepoint ranges.
[0.1.3] — 2026-03-25¶
Added¶
strip_controlandstrip_zero_widthnow work as independent pipeline steps without requiringcollapse_whitespace=True. Previously they were silently ignored whencollapse_whitespacewas disabled.strip_control_chars()andstrip_zero_width_chars()standalone Rust functions for filtering without whitespace collapsing.decimalandhexadecimalflags inSlugConfigare now functional. Settingdecimal=Falsepreserves&#NNN;entities;hexadecimal=Falsepreserves&#xHHH;entities. Previously these flags were accepted but silently ignored.- Rust integration tests:
tests/integration_emoji.rs(10 tests),tests/integration_slugify.rs(20 tests),tests/integration_transliterate.rs(21 tests),tests/integration_whitespace.rs(12 tests).
Changed¶
TextPipelineparametersstrip_controlandstrip_zero_widthchanged frombool(defaultTrue) tobool | None(defaultNone). WhenNone, they inherit fromcollapse_whitespace—Trueifcollapse_whitespace=True,Falseotherwise. Set explicitly toTruefor standalone use withoutcollapse_whitespace. This is backward compatible: existing code that passescollapse_whitespace=Truegets the same behavior as before.steps()now reportsstrip_controlandstrip_zero_widthas separate entries when active, giving full visibility into pipeline behavior.- Pipeline step order updated:
normalize → confusables → demojize → strip_accents → transliterate → fold_case → strip_control → strip_zero_width → collapse_whitespace. - Migrated from
once_celltostd::sync::LazyLock/OnceLock; MSRV bumped to 1.80. Removedonce_celldependency. needs_cjk_space()match arm tightened from wildcard_to explicitIdeograph | Hangul | Kanato match the call-siteis_cjkguard.
Fixed¶
decode_entities()corrupting multi-byte UTF-8 characters (BUG-1). The function usedbytes[i] as charwhich treated each continuation byte as a separate Latin-1 codepoint (e.g.café→café). Now advances by full UTF-8 characters.decode_numeric_entity_skip()panicking on malformed&#followed by multi-byte UTF-8 (BUG-2). The skip function walked through continuation bytes looking for;, landing inside a multi-byte character. Now stops at the first non-ASCII byte.
Performance¶
- ASCII fast-path in
demojize_implanddemojize_rust: pure-ASCII text returns immediately withoutVec<char>allocation or emoji scanning. filter_stopwordsreplaced intermediateVec<_>+.join()with a pre-allocatedStringfold, removing one allocation per slugify call.
[0.1.2] — 2026-03-25¶
Added¶
- Python 3.14 support (classifier and CI test matrix).
ruff check --fixpre-commit hook for automatic lint fixing.- CI publish workflow using
pypa/gh-action-pypi-publishwith OIDC trusted publishers. - Multi-platform wheel builds: Linux (x86_64, aarch64), macOS (Intel, ARM64), Windows.
steps()method on_TextPipelinetype stub.
Changed¶
- Resolved all clippy pedantic warnings instead of suppressing them — reduced
lint suppressions from 48 to 22 (remaining are genuine PyO3 constraints).
Fixes include: combined identical match arms, replaced manual counters with
.enumerate(), moved item declarations before statements, usedclone_into(), merged identical branches, fixed doc comment formatting. - Widened
stopwordsandreplacementstype stubs from stricttuple/listtoSequencefor better mypy compatibility. - Applied
ruff formatto all Python source and test files. - Switched docs publish from deprecated
maturin uploadtopypa/gh-action-pypi-publish. - macOS Intel wheels now cross-compiled on ARM64 runner (macos-14) instead of deprecated macos-13.
- CI doctests now run against installed package (not source tree) with explicit
shell: bashfor Windows compatibility.
Fixed¶
TextPipeline.explain()doctest: output format isnormalize (NFC)notnormalize (form=NFC).from __future__ import annotationsplacement in test files (must follow module docstring, not precede it).- Malformed HTML entity test expectation:
decode_entities("&#xyz;")correctly returns"", not"yz;". - Rust benchmark CI: target
bench_corebinary explicitly to avoid passing Criterion flags to the test harness. - Ruff lint fixes: unsorted imports in
test_encoding.py, unused importis_mixed_scriptintest_security_invariants.py. - Read the Docs trigger workflow: simplified curl status handling, graceful
warning when
RTD_TOKENis missing. - Removed incorrect PyPy classifier (abi3 is CPython-only).
[0.1.1] — 2026-03-25¶
Added¶
src/unicode_ranges.rs— named constants for all Unicode codepoint ranges used by the library, eliminating magic numbers scattered across modules.tests/test_concurrency.py— concurrent access tests forLANG_TABLESandHANGUL_CACHE, plus malformed Unicode input tests.- Code coverage reporting in CI (
pytest-cov, XML report uploaded as artifact). CLOCK$,KEYBD$,SCREEN$,COM0,LPT0added to Windows reserved filename list.casefold()alias forfold_case()— matchesstr.casefold()naming.remove_accents()alias forstrip_accents()— matches sklearn/ML ecosystem naming.- Compatibility parameter aliases:
replacement_text/max_lenonsanitize_filename()(pathvalidate),greedy/preferred_aliasesonis_confusable()(confusable_homoglyphs),delimitersondemojize()(emoji library). - Complete API documentation for 19 previously undocumented exported functions:
precompiled pipelines, grapheme clusters, encoding detection,
Textbuilder,is_safe_hostname,demojize,strip_bidi,EmojiProviderprotocol. - Three new API reference pages: Precompiled Pipelines, Grapheme Clusters, Encoding.
- "Guides by role" section in
docs/index.mdandREADME.md. - Performance section in
README.mdwith benchmark numbers. Scriptenum documentation expanded from 28 to all 41 members.
Changed¶
transliterate_implrefactored: capacity estimation extracted toestimate_capacity(), character classification toclassify_char(), and CJK spacing logic toneeds_cjk_space().- All
RwLockaccesses now recover from lock poisoning using.unwrap_or_else(|e| e.into_inner())instead of silently falling through. - Lambda closures in
_compat.pyreplaced with named inner functions for clarity. emoji.rswrite!()call no longer uses.unwrap()(infallible, documented with a// SAFETYcomment).- MkDocs theme switched from
materialtoreadthedocs. - All documentation references updated from "unirust" to "translit".
- Development status promoted from Alpha to Beta.
- Package renamed from
translittotranslit-rson PyPI (interim until PEP 541 grants thetranslitname). Python import remainsimport translit.
Fixed¶
- Type stub
_text.pyiimported from wrong module name (unirust→translit). - Type stub
_translit.pyimissingmin_confidenceparameter on_decode_to_utf8. - Type stub
_text.pyimissinggrapheme_split,grapheme_truncate,catalog_keymethods. security_clean()pipeline step order corrected in 5+ locations: strip_bidi runs before collapse_whitespace (matching Rust implementation).catalog_key()step order corrected: transliterate before strip_accents.- Stale PyO3 boundary overhead corrected from ~4µs to ~240ns in docs and code comments.
Deprecated¶
translit._compatawesome-slugify compatibility layer (Slugify,UniqueSlugify,slugify_*instances) — planned removal in v1.0.
[0.1.0] — 2026-01-01¶
Added¶
- Initial release.
- Unicode transliteration for 60 language profiles.
- Slugification, normalization, confusable detection, filename sanitization.
- Emoji demojization with ZWJ sequence support.
- Backward-compatible layers for Unidecode and awesome-slugify.