Encoding Detection & Decoding¶

Functions for detecting and converting byte sequences to UTF-8. Uses the chardetng algorithm (Firefox's encoding detector) for auto-detection.

detect_encoding¶

detect_encoding ¶

detect_encoding(data: bytes) -> tuple[str, float]

Detect the encoding of a byte sequence.

Returns (encoding_name, confidence) where confidence is 0.0–1.0. Uses the chardetng algorithm (Firefox's encoding detector).

Note (#194): chardetng (since the 1.0 migration, #164) does not expose a graded score — it reports a fixed confidence of 0.95 for every successful detection. The float is kept for API stability and to align with chardet-style ranges, but callers cannot use it to rank detection quality.

Important: automatic encoding detection is inherently probabilistic. A high confidence score does NOT guarantee correctness. For critical pipelines, always prefer explicit encoding metadata over detection.

Parameters:	`data` (`bytes`) – Raw byte sequence to analyze.

Returns:	`tuple[str, float]` – Tuple of (encoding_name, confidence) where confidence is 0.0–1.0.

Raises:	`DisarmError` – If the byte sequence cannot be analyzed.

Examples:

>>> enc, conf = detect_encoding(b"Hello World")
>>> enc
'UTF-8'

from disarm import detect_encoding

enc, confidence = detect_encoding(b"Hello World")
# enc = "UTF-8", confidence = 0.95

# Windows-1252 encoded text
enc, confidence = detect_encoding("café".encode("windows-1252"))
# enc = "windows-1252", confidence = 0.95

Warning

Automatic encoding detection is inherently probabilistic. A high confidence score does not guarantee correctness. For critical pipelines, always prefer explicit encoding metadata (HTTP headers, BOM, schema definitions) over detection.

Confidence is a fixed 0.95 (#194)

chardetng (since the 1.0 migration) does not expose a graded score — it reports a fixed 0.95 for every successful detection. The value is kept for API stability, but you cannot use it to rank detection quality. Consequently min_confidence below is an accept/reject switch, not a quality threshold.

decode_to_utf8¶

decode_to_utf8 ¶

decode_to_utf8(data: bytes, encoding: str | None = None, *, min_confidence: float = 0.95, strict: bool = False) -> tuple[str, bool]

Decode a byte sequence to UTF-8.

Returns (decoded_text, had_errors) where had_errors is True if a U+FFFD replacement character was inserted during decoding.

had_errors=False is not a fidelity guarantee: single-byte encodings such as windows-1252 map every byte to some codepoint without ever inserting U+FFFD, so a wrong-encoding decode can produce mojibake with had_errors=False and no exception. For critical data, prefer explicit encoding metadata over auto-detection (and see strict below).

If encoding is None, auto-detects using the chardetng algorithm. Note that min_confidence is effectively a binary accept/reject knob (see #194 and the argument docs below), not a quality grade.

Supports all WHATWG encodings (UTF-8, windows-1252, ISO-8859-1, Shift_JIS, EUC-JP, EUC-KR, Big5, GB18030, etc.).

Parameters:

data (bytes) –

Raw byte sequence to decode.
encoding (str | None, default: None ) –

Encoding name (e.g. "windows-1252"). None to auto-detect.
min_confidence (float, default: 0.95 ) –

Confidence threshold (0.0–1.0) applied when auto-detecting; raises DisarmError if the detected confidence is below it. When encoding is given explicitly the confidence gate is bypassed (nothing is detected), but the value is still range-validated — an out-of-range min_confidence raises regardless (#217). Defaults to 0.95.

Effectively a binary knob (#194). Since the chardetng 1.0 migration (#164) the detector reports a fixed 0.95 for every successful detection, so min_confidence cannot grade detection quality: any value <= 0.95 (including the 0.95 default) accepts every guess, and any value > 0.95 (e.g. 1.0) rejects auto-detection outright. The default therefore does not reject low-quality detections — to require high-quality input, pass the encoding explicitly rather than relying on this threshold. Pass 0.0 to be explicit about accepting any guess.
strict (bool, default: False ) –

When True, raise :class:DisarmError instead of silently returning had_errors=True if the input contains byte sequences that decode to the U+FFFD replacement character (#189). Use this to turn lossy decodes — a common silent-data-loss source — into a hard failure. Note had_errors is a replacement-character flag, not a full fidelity guarantee (see the module docs), so strict catches malformed input, not every lossy remapping.

Returns:	`str` – Tuple of (decoded_text, had_errors). With `strict=True` the second `bool` – element is always `False` (any error raises instead).

Raises:	`DisarmError` – If the encoding name is unknown, decoding fails, auto-detection confidence is below min_confidence, or `strict=True` and the decode was lossy.

Examples:

>>> text, had_errors = decode_to_utf8(b"caf\xe9", "windows-1252")
>>> text
'café'
>>> had_errors
False

from disarm import decode_to_utf8

# Explicit encoding
text, had_errors = decode_to_utf8(b"caf\xe9", "windows-1252")
# text = "café", had_errors = False

# Auto-detection (accepts the guess; detection confidence is always 0.95)
text, had_errors = decode_to_utf8(raw_bytes)

# min_confidence is an accept/reject switch, not a quality grade (#194):
# any value > 0.95 refuses auto-detection outright (use 1.0 to require an
# explicit encoding); any value <= 0.95 (the 0.95 default) accepts every guess.
text, had_errors = decode_to_utf8(raw_bytes, min_confidence=1.0)
# Raises DisarmError: detection's fixed 0.95 is below the required 1.0

had_errors=False is not a fidelity guarantee — windows-1252 and other single-byte encodings map every byte to a codepoint without inserting U+FFFD, so a wrong-encoding decode yields mojibake with had_errors=False. Pass strict=True to raise on U+FFFD insertion, but prefer explicit encoding metadata for critical data.

Supports all WHATWG encodings: UTF-8, windows-1252, ISO-8859-1, Shift_JIS, EUC-JP, EUC-KR, Big5, GB18030, and more.

Malformed `str` input (lone surrogates)¶

decode_to_utf8 handles raw bytes. A separate, always-on contract covers a str that itself carries unpaired surrogates (e.g. from surrogatepass / WTF-8 decoding): it can't encode to UTF-8 for the Rust core, so every text entry point interprets it as WTF-8 → UTF-8 at the boundary — a well-formed high+low pair recombines into its astral scalar, and each lone surrogate code unit becomes exactly one U+FFFD. No call raises, and the result equals the call on the scrubbed string. This is a silently lossy neutralization (the U+FFFD is terminal — the original code unit is not recovered). The full contract, uniform across the Python, Node, and Ruby bindings, is in the Threat Model.