Encoding Detection & Decoding¶
Functions for detecting and converting byte sequences to UTF-8. Uses the chardetng algorithm (Firefox's encoding detector) for auto-detection.
detect_encoding¶
detect_encoding ¶
detect_encoding(data: bytes) -> tuple[str, float]
Detect the encoding of a byte sequence.
Returns (encoding_name, confidence) where confidence is 0.0–1.0. Uses the chardetng algorithm (Firefox's encoding detector).
Note (#194): chardetng (since the 1.0 migration, #164) does not expose a
graded score — it reports a fixed confidence of 0.95 for every
successful detection. The float is kept for API stability and to align with
chardet-style ranges, but callers cannot use it to rank detection quality.
Important: automatic encoding detection is inherently probabilistic. A high confidence score does NOT guarantee correctness. For critical pipelines, always prefer explicit encoding metadata over detection.
| Parameters: |
|
|---|
| Returns: |
|
|---|
| Raises: |
|
|---|
Examples:
>>> enc, conf = detect_encoding(b"Hello World")
>>> enc
'UTF-8'
from disarm import detect_encoding
enc, confidence = detect_encoding(b"Hello World")
# enc = "UTF-8", confidence = 0.95
# Windows-1252 encoded text
enc, confidence = detect_encoding("café".encode("windows-1252"))
# enc = "windows-1252", confidence = 0.95
Warning
Automatic encoding detection is inherently probabilistic. A high confidence score does not guarantee correctness. For critical pipelines, always prefer explicit encoding metadata (HTTP headers, BOM, schema definitions) over detection.
Confidence is a fixed 0.95 (#194)
chardetng (since the 1.0 migration) does not expose a graded score — it reports a fixed 0.95 for every successful detection. The value is kept for API stability, but you cannot use it to rank detection quality. Consequently min_confidence below is an accept/reject switch, not a quality threshold.
decode_to_utf8¶
decode_to_utf8 ¶
decode_to_utf8(data: bytes, encoding: str | None = None, *, min_confidence: float = 0.95, strict: bool = False) -> tuple[str, bool]
Decode a byte sequence to UTF-8.
Returns (decoded_text, had_errors) where had_errors is True if a U+FFFD replacement character was inserted during decoding.
had_errors=False is not a fidelity guarantee: single-byte encodings
such as windows-1252 map every byte to some codepoint without ever inserting
U+FFFD, so a wrong-encoding decode can produce mojibake with
had_errors=False and no exception. For critical data, prefer explicit
encoding metadata over auto-detection (and see strict below).
If encoding is None, auto-detects using the chardetng algorithm. Note that
min_confidence is effectively a binary accept/reject knob (see #194 and
the argument docs below), not a quality grade.
Supports all WHATWG encodings (UTF-8, windows-1252, ISO-8859-1, Shift_JIS, EUC-JP, EUC-KR, Big5, GB18030, etc.).
| Parameters: |
|
|---|
| Returns: |
|
|---|
| Raises: |
|
|---|
Examples:
>>> text, had_errors = decode_to_utf8(b"caf\xe9", "windows-1252")
>>> text
'café'
>>> had_errors
False
from disarm import decode_to_utf8
# Explicit encoding
text, had_errors = decode_to_utf8(b"caf\xe9", "windows-1252")
# text = "café", had_errors = False
# Auto-detection (accepts the guess; detection confidence is always 0.95)
text, had_errors = decode_to_utf8(raw_bytes)
# min_confidence is an accept/reject switch, not a quality grade (#194):
# any value > 0.95 refuses auto-detection outright (use 1.0 to require an
# explicit encoding); any value <= 0.95 (the 0.95 default) accepts every guess.
text, had_errors = decode_to_utf8(raw_bytes, min_confidence=1.0)
# Raises DisarmError: detection's fixed 0.95 is below the required 1.0
had_errors=False is not a fidelity guarantee — windows-1252 and other single-byte encodings map every byte to a codepoint without inserting U+FFFD, so a wrong-encoding decode yields mojibake with had_errors=False. Pass strict=True to raise on U+FFFD insertion, but prefer explicit encoding metadata for critical data.
Supports all WHATWG encodings: UTF-8, windows-1252, ISO-8859-1, Shift_JIS, EUC-JP, EUC-KR, Big5, GB18030, and more.