Grapheme Clusters¶
Functions for working with user-perceived characters (extended grapheme clusters) as defined by UAX #29. These give correct results for emoji, combining characters, and complex scripts where Python's len() overcounts.
grapheme_len¶
grapheme_len ¶
grapheme_len(text: str) -> int
Count the number of user-perceived characters (extended grapheme clusters).
This is the correct answer to "how many characters does the user see?" A single grapheme cluster may span multiple codepoints (e.g., flag emoji, skin-toned emoji, Hangul syllables with combining jamo, Zalgo text).
| Parameters: |
|
|---|
| Returns: |
|
|---|
Examples:
>>> grapheme_len("cafรฉ")
4
>>> grapheme_len("๐จโ๐ฉโ๐งโ๐ฆ") # family emoji = 1 grapheme cluster
1
from disarm import grapheme_len
assert grapheme_len("cafรฉ") == 4
assert grapheme_len("๐จโ๐ฉโ๐งโ๐ฆ") == 1
assert grapheme_len("๐ซ๐ท") == 1
assert grapheme_len("รฉ") == 1
grapheme_split¶
grapheme_split ¶
grapheme_split(text: str) -> list[str]
Split text into a list of extended grapheme clusters.
Each element is a user-perceived character.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Examples:
>>> grapheme_split("cafรฉ")
['c', 'a', 'f', 'รฉ']
>>> len(grapheme_split("๐จโ๐ฉโ๐งโ๐ฆ!")) # family emoji + "!"
2
from disarm import grapheme_split
assert grapheme_split("cafรฉ") == ['c', 'a', 'f', 'รฉ']
assert grapheme_split("๐จโ๐ฉโ๐งโ๐ฆ!") == ['๐จ\u200d๐ฉ\u200d๐ง\u200d๐ฆ', '!']
Note
Input is limited to 10 MB to prevent excessive memory allocation. Raises DisarmError for larger inputs.
grapheme_truncate¶
grapheme_truncate ¶
grapheme_truncate(text: str, max_graphemes: int) -> str
Truncate text to at most max_graphemes user-perceived characters.
Unlike byte-level or codepoint-level truncation, this never splits a grapheme cluster (which could corrupt emoji, combining sequences, or Hangul syllables).
| Parameters: |
|
|---|
| Returns: |
|
|---|
Examples:
>>> grapheme_truncate("Hello World", 5)
'Hello'
>>> grapheme_truncate("cafรฉ", 3)
'caf'
from disarm import grapheme_truncate
assert grapheme_truncate("Hello World", 5) == 'Hello'
assert grapheme_truncate("cafรฉ", 3) == 'caf'
assert grapheme_truncate("๐จโ๐ฉโ๐งโ๐ฆ๐", 1) == '๐จ\u200d๐ฉ\u200d๐ง\u200d๐ฆ'
Unlike byte-level or codepoint-level truncation, grapheme_truncate never splits a grapheme cluster, which would corrupt emoji, combining sequences, or Hangul syllables.