Grapheme Clusters

Unicode text is more complex than it appears. A single user-perceived "character" can be composed of multiple Unicode codepoints — combining accents, emoji modifiers, ZWJ sequences, regional indicator pairs, and Hangul jamo all create situations where Python's len() gives a misleading count.

disarm provides three functions for working with extended grapheme clusters as defined by UAX #29, giving correct results where len() overcounts.

The Problem

text = "café"            # 4 characters, right?
assert len(text) == 4

# But with decomposed é (e + combining acute accent):
import unicodedata
text_nfd = unicodedata.normalize("NFD", "café")
assert len(text_nfd) == 5

# Emoji are worse:
assert len("👨‍👩‍👧‍👦") == 7
assert len("🇬🇧") == 2
assert len("👋🏽") == 2

Python's len() counts codepoints, not user-perceived characters. For correct character counting, splitting, and truncation, you need grapheme cluster segmentation.

Functions

grapheme_len

Count the number of user-perceived characters:

from disarm import grapheme_len

assert grapheme_len("café") == 4
assert grapheme_len("cafe\u0301") == 4

# Emoji
assert grapheme_len("👨‍👩‍👧‍👦") == 1
assert grapheme_len("🇬🇧") == 1
assert grapheme_len("👋🏽") == 1
assert grapheme_len("🏳️‍🌈") == 1

# Complex scripts
assert grapheme_len("\u1100\u1161\u11A8") == 1
assert grapheme_len("नमस्ते") == 3

grapheme_split

Split text into individual grapheme clusters:

from disarm import grapheme_split

assert grapheme_split("café") == ['c', 'a', 'f', 'é']
assert grapheme_split("cafe\u0301") == ['c', 'a', 'f', 'é']

assert grapheme_split("👨‍👩‍👧‍👦!") == ['👨\u200d👩\u200d👧\u200d👦', '!']
assert grapheme_split("🇫🇷🇬🇧") == ['🇫🇷', '🇬🇧']
assert grapheme_split("Hi 👋🏽") == ['H', 'i', ' ', '👋🏽']

Note

Input is limited to 10 MB to prevent excessive memory allocation. Raises DisarmError for larger inputs.

grapheme_truncate

Truncate text to a maximum number of grapheme clusters without splitting any cluster:

from disarm import grapheme_truncate

assert grapheme_truncate("Hello World", 5) == 'Hello'
assert grapheme_truncate("café", 3) == 'caf'
assert grapheme_truncate("cafe\u0301s", 4) == 'café'

# Emoji are never split
assert grapheme_truncate("👨‍👩‍👧‍👦🎉", 1) == '👨\u200d👩\u200d👧\u200d👦'
assert grapheme_truncate("Hi 👩‍👩‍👧‍👦!", 4) == 'Hi 👩\u200d👩\u200d👧\u200d👦'
assert grapheme_truncate("🇬🇧🇫🇷🇩🇪", 2) == '🇬🇧🇫🇷'

Unlike byte-level slicing (text[:n]) or codepoint-level slicing, grapheme_truncate never produces corrupted output — no broken emoji, no orphaned combining marks, no split Hangul syllables.

Text Builder

All grapheme functions are also available on the Text builder:

from disarm import Text

t = Text("Hello 👨‍👩‍👧‍👦!")

# Predicates (non-chaining)
assert t.grapheme_len() == 8
assert t.grapheme_split() == ['H', 'e', 'l', 'l', 'o', ' ', '👨\u200d👩\u200d👧\u200d👦', '!']

# Transform (chaining)
assert t.grapheme_truncate(7).value == 'Hello 👨\u200d👩\u200d👧\u200d👦'

When to Use Grapheme Functions

Use grapheme_len instead of len() when:

  • Enforcing character limits — user-facing limits like "280 characters" should count what users see, not codepoints
  • Validating input length — username or field length validation
  • Character-level ML tokenization — splitting text into "characters" for character-level models
  • Display width estimation — though note that display width also depends on font metrics, not just grapheme count

Use grapheme_truncate instead of slicing when:

  • Truncating user-visible text — preview snippets, title shortening
  • Database field length enforcement — preventing corruption of combining sequences at boundaries
  • API response truncation — ensuring valid Unicode output
  • Slug length limits — though slugify(max_length=) already handles this for ASCII output

Use grapheme_split instead of list() when:

  • Character-level tokenization — NLP pipelines that need individual characters
  • Character frequency analysis — counting character distributions
  • Grapheme-aware iteration — processing text one user-perceived character at a time

Codepoints vs Graphemes vs Bytes

A comparison showing how different counting methods diverge:

Text len(b) bytes len(s) codepoints grapheme_len(s)
"hello" 5 5 5
"café" (NFC) 5 4 4
"café" (NFD) 6 5 4
"👨‍👩‍👧‍👦" 25 7 1
"🇬🇧" 8 2 1
"👋🏽" 8 2 1
"नमस्ते" 18 6 4
"한" (precomposed) 3 1 1
"한" (jamo) 9 3 1

Normalization Interaction

Grapheme cluster boundaries can differ between NFC and NFD forms of the same text. For consistent results, normalize before counting:

from disarm import normalize, grapheme_len

text = "é"  # might be NFC or NFD depending on source
normalized = normalize(text, form="NFC")
count = grapheme_len(normalized)
assert count == 1

In practice, grapheme_len gives the same count for NFC and NFD forms of the same text — the grapheme cluster algorithm handles both. But normalizing first ensures deterministic byte-level results from grapheme_split and grapheme_truncate.

Best Practices

Username validation

Sanitize input first, then enforce a grapheme-aware length limit:

from disarm import normalize_user_input, grapheme_len, grapheme_truncate

def validate_username(raw: str, max_graphemes: int = 30) -> str:
    clean = normalize_user_input(raw)
    if grapheme_len(clean) > max_graphemes:
        clean = grapheme_truncate(clean, max_graphemes)
    return clean

Post/tweet fields

Use display_clean for lightweight sanitization and grapheme_truncate for the character limit:

from disarm import display_clean, grapheme_truncate

def prepare_post(raw: str, max_graphemes: int = 280) -> str:
    clean = display_clean(raw)
    return grapheme_truncate(clean, max_graphemes)

Database column truncation

When storing text in a column with a character limit, truncate by grapheme clusters — never by bytes or codepoints, which can split emoji or combining sequences:

from disarm import security_clean, grapheme_truncate

def safe_for_db(raw: str, max_graphemes: int = 255) -> str:
    clean = security_clean(raw)
    return grapheme_truncate(clean, max_graphemes)

ML corpus preparation

Normalize text before truncating to a token-budget-friendly length:

from disarm import ml_normalize, grapheme_truncate

def prepare_for_model(raw: str, max_graphemes: int = 4096) -> str:
    clean = ml_normalize(raw)
    return grapheme_truncate(clean, max_graphemes)

Terminal column width

grapheme_len counts clusters; it does not tell you how many terminal columns text occupies (a CJK character is one cluster but two columns). Use terminal_width and grapheme_width for that — measured per grapheme cluster over UAX #11 East Asian Width:

from disarm import terminal_width, grapheme_width

assert terminal_width("hello") == 5
assert terminal_width("世界") == 4  # wide CJK: 2 columns each
assert terminal_width("cafe\u0301") == 4  # NFD: "e" + combining acute (U+0301, 0 columns)
assert terminal_width("a😀") == 3  # emoji cluster occupies 2 columns
assert grapheme_width("👨‍👩‍👧‍👦") == 2  # one ZWJ cluster, 2 columns

East Asian Ambiguous characters are 1 column by default (matching modern UTF-8 terminals); pass ambiguous_wide=True for legacy double-width CJK terminals:

assert terminal_width("¡") == 1
assert terminal_width("¡", ambiguous_wide=True) == 2

This measures terminal cells, not pixels or font metrics. Tabs are not expanded and newlines are not modelled — layout that depends on tab stops or wrapping is the caller's responsibility. Emoji-ZWJ and ambiguous widths are inherently terminal-dependent; disarm's policy is fixed (ambiguous = 1 unless ambiguous_wide, and an emoji-presented cluster = 2).

Limitations

  • Display width is terminal cells, not pixels. terminal_width / grapheme_width report monospace column counts (UAX #11), not font-metric or pixel widths, which depend on the rendering stack.
  • Newer emoji sequences. The unicode-segmentation crate's tables must be updated to correctly segment newly standardized ZWJ emoji sequences. Between updates, a brand-new emoji may be split across multiple clusters.
  • Rendering varies. "User-perceived character" is ultimately a rendering question. Not all systems agree on cluster boundaries, particularly for complex emoji. See Limitations for details.

Performance

Grapheme operations use the Rust unicode-segmentation crate, which implements UAX #29 with precomputed lookup tables. Performance is in the sub-microsecond range for typical inputs:

Function Input Time
grapheme_len ASCII string ~100 ns
grapheme_len Emoji string ~260 ns
grapheme_split ASCII string ~285 ns
grapheme_split Emoji string ~516 ns