Grapheme Clusters

Functions for working with user-perceived characters (extended grapheme clusters) as defined by UAX #29. These give correct results for emoji, combining characters, and complex scripts where Python's len() overcounts.

grapheme_len

grapheme_len

grapheme_len(text: str) -> int

Count the number of user-perceived characters (extended grapheme clusters).

This is the correct answer to "how many characters does the user see?" A single grapheme cluster may span multiple codepoints (e.g., flag emoji, skin-toned emoji, Hangul syllables with combining jamo, Zalgo text).

Parameters:
  • text (str) โ€“

    Input string.

Returns:
  • int โ€“

    Number of extended grapheme clusters.

Examples:

>>> grapheme_len("cafรฉ")
4
>>> grapheme_len("๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ")  # family emoji = 1 grapheme cluster
1
from disarm import grapheme_len

assert grapheme_len("cafรฉ") == 4
assert grapheme_len("๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ") == 1
assert grapheme_len("๐Ÿ‡ซ๐Ÿ‡ท") == 1
assert grapheme_len("รฉ") == 1

grapheme_split

grapheme_split

grapheme_split(text: str) -> list[str]

Split text into a list of extended grapheme clusters.

Each element is a user-perceived character.

Parameters:
  • text (str) โ€“

    Input string.

Returns:
  • list[str] โ€“

    List of grapheme cluster strings.

Examples:

>>> grapheme_split("cafรฉ")
['c', 'a', 'f', 'รฉ']
>>> len(grapheme_split("๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ!"))  # family emoji + "!"
2
from disarm import grapheme_split

assert grapheme_split("cafรฉ") == ['c', 'a', 'f', 'รฉ']
assert grapheme_split("๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ!") == ['๐Ÿ‘จ\u200d๐Ÿ‘ฉ\u200d๐Ÿ‘ง\u200d๐Ÿ‘ฆ', '!']

Note

Input is limited to 10 MB to prevent excessive memory allocation. Raises DisarmError for larger inputs.


grapheme_truncate

grapheme_truncate

grapheme_truncate(text: str, max_graphemes: int) -> str

Truncate text to at most max_graphemes user-perceived characters.

Unlike byte-level or codepoint-level truncation, this never splits a grapheme cluster (which could corrupt emoji, combining sequences, or Hangul syllables).

Parameters:
  • text (str) โ€“

    Input string.

  • max_graphemes (int) โ€“

    Maximum number of grapheme clusters to keep.

Returns:
  • str โ€“

    Truncated string containing at most max_graphemes grapheme clusters.

Examples:

>>> grapheme_truncate("Hello World", 5)
'Hello'
>>> grapheme_truncate("cafรฉ", 3)
'caf'
from disarm import grapheme_truncate

assert grapheme_truncate("Hello World", 5) == 'Hello'
assert grapheme_truncate("cafรฉ", 3) == 'caf'
assert grapheme_truncate("๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ๐ŸŽ‰", 1) == '๐Ÿ‘จ\u200d๐Ÿ‘ฉ\u200d๐Ÿ‘ง\u200d๐Ÿ‘ฆ'

Unlike byte-level or codepoint-level truncation, grapheme_truncate never splits a grapheme cluster, which would corrupt emoji, combining sequences, or Hangul syllables.