Character-level models represent and process text at the granularity of individual characters (or bytes), dispensing with the need for explicit tokenization into words or subwords. Instead of maintaining a vocabulary of tens of thousands of word or subword tokens, a character-level model works with a vocabulary of at most a few hundred characters (or 256 bytes), learning to compose these into meaningful higher-level representations through its network architecture. This approach has several compelling advantages: it is truly open-vocabulary (any string can be processed), it naturally shares parameters across morphologically related words, and it eliminates the need for language-specific tokenization decisions.
Architectures for Character-Level Processing
word_emb = Highway(MaxPool(CNN(char_embs)))
Character LSTM language model:
P(cₜ | c₁,...,cₜ₋₁) = softmax(W · hₜ)
hₜ = LSTM(hₜ₋₁, emb(cₜ₋₁))
Character-level Transformer:
Process character sequence c₁...cₙ directly
with self-attention over character positions
Several architectural strategies have been developed for character-level processing. Character-to-word models (Kim et al., 2016) use convolutional neural networks or recurrent networks to compose character sequences into word-level representations, which then feed into a standard word-level model. Fully character-level models (Al-Rfou et al., 2019) process entire texts as character sequences without any word-level component, requiring the model to learn word boundaries and word-level semantics implicitly. Byte-level models operate on raw UTF-8 bytes, handling any script or encoding without character-level preprocessing.
Advantages for Morphologically Rich Languages
Character-level models offer particular advantages for morphologically rich languages. Because morphological relationships are reflected in shared character sequences — "play," "plays," "played," "playing" all share the substring "play" — character-level models automatically share representations across related forms. This reduces the effective vocabulary size and mitigates data sparsity. Experiments have shown that character-level models outperform word-level models for tasks in agglutinative languages like Turkish and Finnish, where the combinatorial explosion of word forms makes word-level modeling impractical.
A practical advantage of character-level models is their robustness to noise and misspellings. Because they process individual characters rather than looking up tokens in a fixed vocabulary, they can gracefully handle typos ("teh" for "the"), spelling variations ("colour" vs. "color"), and informal text ("gooood" for "good"). Belinkov and Bisk (2018) showed that character-level machine translation models are substantially more robust to synthetic noise (character swaps, insertions, deletions) than word-level or BPE-level models. This robustness is valuable for processing user-generated content from social media and messaging platforms.
Challenges and Tradeoffs
The main disadvantage of character-level models is computational cost. A character-level representation of a sentence is roughly 5-7 times longer than a subword representation, and the quadratic attention complexity of transformers makes this a significant burden. Various strategies mitigate this: hierarchical architectures that first compose characters into word-level representations, local attention patterns that restrict attention to nearby characters, and downsampling layers that reduce the sequence length at intermediate layers. The Canine model (Clark et al., 2022) and ByT5 (Xue et al., 2022) demonstrate that character-level and byte-level transformers can achieve competitive performance with word-level models, though at higher computational cost.
The relationship between character-level models and subword tokenization is not strictly competitive. Many modern architectures use character-level components within a subword framework — for example, using character CNNs to produce embeddings for out-of-vocabulary subword tokens, or fine-tuning pretrained subword models with character-level auxiliary losses. The question of optimal text granularity remains open, with the answer likely depending on the language, task, and available computational resources.