Output Encoders

Context-explicit output encoders — correct for a specific output sink, applied at the sink, exactly once.

These are deliberately standalone terminal functions, not pipeline steps: output encoding depends on the destination context (HTML element vs. attribute vs. URL component), which a context-free pipeline cannot know. Baking an encoder into a pipeline invites double-encoding, wrong-context escaping, and storing pre-escaped text.

They do not make disarm an XSS/injection framework — they are narrow, context-pinned encoders, the explicit exception to disarm's "not an output sanitizer" positioning. Run them at output, after (not instead of) the input-normalization layer.

escape_html

escape_html

escape_html(text: str) -> str

Escape the five HTML metacharacters for element/quoted-attribute context.

& -> &amp;, < -> &lt;, > -> &gt;, " -> &quot;, ' -> &#x27;. Everything else passes through unchanged.

Correct for HTML element-body and quoted-attribute context. It is not correct inside <script>/<style>, unquoted attributes, URL/href/ src attributes, or HTML comments -- there, entity escaping is insufficient or corrupting. This is a terminal output encoder: apply it at the sink, exactly once. It is not idempotent (encoding twice double-encodes &), and disarm is not an XSS framework -- see the Threat Model.

Parameters:
  • text (str) –

    The string to escape.

Returns:
  • str

    The escaped string (the original object when nothing needs escaping).

Examples:

>>> escape_html("<b>a & b</b>")
'&lt;b&gt;a &amp; b&lt;/b&gt;'
>>> escape_html("plain text")
'plain text'

percent_encode

percent_encode

percent_encode(text: str, *, component: Component) -> str

RFC 3986 percent-encode text for a named URL component.

The input is UTF-8 encoded first, then every byte outside the component's safe set becomes %XX (e with an accent -> %C3%A9); the output is pure ASCII. component is required because the safe set depends on where the value is placed (:class:Component: PATH/SEGMENT/QUERY/ FORM; FORM uses application/x-www-form-urlencoded space -> +).

Percent-encoding is not a defense against javascript:/data: scheme injection or open redirects -- those are URL-construction concerns, out of scope. Apply at the output sink, exactly once.

Parameters:
  • text (str) –

    The string to encode.

  • component (Component) –

    Which URL component the value will be placed in.

Returns:
  • str

    The percent-encoded ASCII string.

Examples:

>>> from disarm import Component
>>> percent_encode("a b&c", component=Component.QUERY)
'a%20b%26c'
>>> percent_encode("a b&c", component=Component.FORM)
'a+b%26c'

Component

Component

Bases: Enum

URL component for :func:disarm.percent_encode.

Selects the RFC 3986 safe set; the encoding differs by where the value is placed, so the component must be stated explicitly (there is no default).


strip_log_injection

Neutralizes the characters that let untrusted text forge log records (CRLF / NEL / LS / PS), corrupt parsers (NUL / C0 / C1 controls), or hijack a terminal that views the log (ANSI escape introducers / DEL). It makes a log line safe to write — it owns the log-record and operator-terminal sinks.

It makes no HTML-log-viewer-safety claim: rendering attacker text in an HTML dashboard (Kibana/Grafana) is stored/second-order XSS, which the viewer must output-encode with escape_html. It preserves <, >, & precisely so nothing mistakes it for viewer-safe output, and it is not a defense against logging-framework interpolation (log4shell). See the Threat Model.

strip_log_injection

strip_log_injection(text: str, *, replacement: str = '�', keep_tab: bool = False) -> str

Neutralize log-injection / terminal-control characters in text.

Replaces -- rather than dropping, so a redaction stays visible -- every CR,
LF, NEL (U+0085), LS (U+2028), PS (U+2029), NUL, C0/C1 control, ESC, and DEL
with ``replacement`` (default U+FFFD; pass ``replacement=""`` to drop). ``  `` is **also** neutralized by
default (``keep_tab=False``): a tab is a field separator in TSV/logfmt logs,
so keeping it permits column injection; pass ``keep_tab=True`` for
human-readable tabular logs. ANSI escape sequences are neutralized by
replacing their introducer (``ESC``), leaving the inert ``[31m`` residue.

Idempotent; the output never contains a raw CR/LF/ESC. This makes a log line
safe to *write*, not safe to later *render as HTML*: it is **not** an
HTML/SQL output sanitizer (it preserves ``< > &`` -- encode those at the log
*viewer* with :func:`escape_html`), and **not** a defense against
logging-framework interpolation (log4shell). See the Threat Model.

Args:
    text: The (untrusted) string destined for a log line.
    replacement: String substituted for each neutralized character (``""``
        drops them). Must not itself contain a neutralized character (else
        ``DisarmError``).
    keep_tab: Keep ``       `` instead of neutralizing it.

Returns:
    The neutralized string (the original object when nothing needs it).

Examples:
    >>> strip_log_injection("user=admin

FAKE LOG ENTRY") 'user=admin�FAKE LOG ENTRY' >>> strip_log_injection("ab") 'a�[31mb'