Output Encoders¶
Context-explicit output encoders — correct for a specific output sink, applied at the sink, exactly once.
These are deliberately standalone terminal functions, not pipeline steps: output encoding depends on the destination context (HTML element vs. attribute vs. URL component), which a context-free pipeline cannot know. Baking an encoder into a pipeline invites double-encoding, wrong-context escaping, and storing pre-escaped text.
They do not make disarm an XSS/injection framework — they are narrow, context-pinned encoders, the explicit exception to disarm's "not an output sanitizer" positioning. Run them at output, after (not instead of) the input-normalization layer.
escape_html¶
escape_html ¶
escape_html(text: str) -> str
Escape the five HTML metacharacters for element/quoted-attribute context.
& -> &, < -> <, > -> >, " -> ",
' -> '. Everything else passes through unchanged.
Correct for HTML element-body and quoted-attribute context. It is not
correct inside <script>/<style>, unquoted attributes, URL/href/
src attributes, or HTML comments -- there, entity escaping is insufficient
or corrupting. This is a terminal output encoder: apply it at the sink,
exactly once. It is not idempotent (encoding twice double-encodes &),
and disarm is not an XSS framework -- see the Threat Model.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Examples:
>>> escape_html("<b>a & b</b>")
'<b>a & b</b>'
>>> escape_html("plain text")
'plain text'
percent_encode¶
percent_encode ¶
percent_encode(text: str, *, component: Component) -> str
RFC 3986 percent-encode text for a named URL component.
The input is UTF-8 encoded first, then every byte outside the component's
safe set becomes %XX (e with an accent -> %C3%A9); the output is
pure ASCII. component is required because the safe set depends on where
the value is placed (:class:Component: PATH/SEGMENT/QUERY/
FORM; FORM uses application/x-www-form-urlencoded space -> +).
Percent-encoding is not a defense against javascript:/data:
scheme injection or open redirects -- those are URL-construction concerns,
out of scope. Apply at the output sink, exactly once.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Examples:
>>> from disarm import Component
>>> percent_encode("a b&c", component=Component.QUERY)
'a%20b%26c'
>>> percent_encode("a b&c", component=Component.FORM)
'a+b%26c'
Component¶
Component ¶
Bases: Enum
URL component for :func:disarm.percent_encode.
Selects the RFC 3986 safe set; the encoding differs by where the value is placed, so the component must be stated explicitly (there is no default).
strip_log_injection¶
Neutralizes the characters that let untrusted text forge log records (CRLF / NEL / LS / PS), corrupt parsers (NUL / C0 / C1 controls), or hijack a terminal that views the log (ANSI escape introducers / DEL). It makes a log line safe to write — it owns the log-record and operator-terminal sinks.
It makes no HTML-log-viewer-safety claim: rendering attacker text in an HTML dashboard (Kibana/Grafana) is stored/second-order XSS, which the viewer must output-encode with escape_html. It preserves <, >, & precisely so nothing mistakes it for viewer-safe output, and it is not a defense against logging-framework interpolation (log4shell). See the Threat Model.
strip_log_injection ¶
strip_log_injection(text: str, *, replacement: str = '�', keep_tab: bool = False) -> str
Neutralize log-injection / terminal-control characters in text.
Replaces -- rather than dropping, so a redaction stays visible -- every CR,
LF, NEL (U+0085), LS (U+2028), PS (U+2029), NUL, C0/C1 control, ESC, and DEL
with ``replacement`` (default U+FFFD; pass ``replacement=""`` to drop). `` `` is **also** neutralized by
default (``keep_tab=False``): a tab is a field separator in TSV/logfmt logs,
so keeping it permits column injection; pass ``keep_tab=True`` for
human-readable tabular logs. ANSI escape sequences are neutralized by
replacing their introducer (``ESC``), leaving the inert ``[31m`` residue.
Idempotent; the output never contains a raw CR/LF/ESC. This makes a log line
safe to *write*, not safe to later *render as HTML*: it is **not** an
HTML/SQL output sanitizer (it preserves ``< > &`` -- encode those at the log
*viewer* with :func:`escape_html`), and **not** a defense against
logging-framework interpolation (log4shell). See the Threat Model.
Args:
text: The (untrusted) string destined for a log line.
replacement: String substituted for each neutralized character (``""``
drops them). Must not itself contain a neutralized character (else
``DisarmError``).
keep_tab: Keep `` `` instead of neutralizing it.
Returns:
The neutralized string (the original object when nothing needs it).
Examples:
>>> strip_log_injection("user=admin
FAKE LOG ENTRY") 'user=admin�FAKE LOG ENTRY' >>> strip_log_injection("a[31mb") 'a�[31mb'