Computational Linguistics
About

Character-Level Models

Character-level models process text as sequences of individual characters rather than words or subwords, naturally handling any vocabulary without explicit tokenization and sharing parameters across morphologically related forms.

P(w) = ∏ᵢ P(cᵢ | c₁, ..., cᵢ₋₁)

Character-level models represent and process text at the granularity of individual characters (or bytes), dispensing with the need for explicit tokenization into words or subwords. Instead of maintaining a vocabulary of tens of thousands of word or subword tokens, a character-level model works with a vocabulary of at most a few hundred characters (or 256 bytes), learning to compose these into meaningful higher-level representations through its network architecture. This approach has several compelling advantages: it is truly open-vocabulary (any string can be processed), it naturally shares parameters across morphologically related words, and it eliminates the need for language-specific tokenization decisions.

Architectures for Character-Level Processing

Character-Level Architectures Character CNN (Kim et al., 2016):
word_emb = Highway(MaxPool(CNN(char_embs)))

Character LSTM language model:
P(cₜ | c₁,...,cₜ₋₁) = softmax(W · hₜ)
hₜ = LSTM(hₜ₋₁, emb(cₜ₋₁))

Character-level Transformer:
Process character sequence c₁...cₙ directly
with self-attention over character positions

Several architectural strategies have been developed for character-level processing. Character-to-word models (Kim et al., 2016) use convolutional neural networks or recurrent networks to compose character sequences into word-level representations, which then feed into a standard word-level model. Fully character-level models (Al-Rfou et al., 2019) process entire texts as character sequences without any word-level component, requiring the model to learn word boundaries and word-level semantics implicitly. Byte-level models operate on raw UTF-8 bytes, handling any script or encoding without character-level preprocessing.

Advantages for Morphologically Rich Languages

Character-level models offer particular advantages for morphologically rich languages. Because morphological relationships are reflected in shared character sequences — "play," "plays," "played," "playing" all share the substring "play" — character-level models automatically share representations across related forms. This reduces the effective vocabulary size and mitigates data sparsity. Experiments have shown that character-level models outperform word-level models for tasks in agglutinative languages like Turkish and Finnish, where the combinatorial explosion of word forms makes word-level modeling impractical.

Character-Level Models and Noise Robustness

A practical advantage of character-level models is their robustness to noise and misspellings. Because they process individual characters rather than looking up tokens in a fixed vocabulary, they can gracefully handle typos ("teh" for "the"), spelling variations ("colour" vs. "color"), and informal text ("gooood" for "good"). Belinkov and Bisk (2018) showed that character-level machine translation models are substantially more robust to synthetic noise (character swaps, insertions, deletions) than word-level or BPE-level models. This robustness is valuable for processing user-generated content from social media and messaging platforms.

Challenges and Tradeoffs

The main disadvantage of character-level models is computational cost. A character-level representation of a sentence is roughly 5-7 times longer than a subword representation, and the quadratic attention complexity of transformers makes this a significant burden. Various strategies mitigate this: hierarchical architectures that first compose characters into word-level representations, local attention patterns that restrict attention to nearby characters, and downsampling layers that reduce the sequence length at intermediate layers. The Canine model (Clark et al., 2022) and ByT5 (Xue et al., 2022) demonstrate that character-level and byte-level transformers can achieve competitive performance with word-level models, though at higher computational cost.

The relationship between character-level models and subword tokenization is not strictly competitive. Many modern architectures use character-level components within a subword framework — for example, using character CNNs to produce embeddings for out-of-vocabulary subword tokens, or fine-tuning pretrained subword models with character-level auxiliary losses. The question of optimal text granularity remains open, with the answer likely depending on the language, task, and available computational resources.

Related Topics

References

  1. Kim, Y., Jernite, Y., Sontag, D., & Rush, A. M. (2016). Character-aware neural language models. Proceedings of the 30th AAAI Conference on Artificial Intelligence, 2741–2749.
  2. Al-Rfou, R., Choe, D., Constant, N., Guo, M., & Jones, L. (2019). Character-level language modeling with deeper self-attention. Proceedings of the 33rd AAAI Conference on Artificial Intelligence, 3159–3166. doi:10.1609/aaai.v33i01.33013159
  3. Xue, L., Barua, A., Constant, N., Al-Rfou, R., Narang, S., Kale, M., ... & Raffel, C. (2022). ByT5: Towards a token-free future with pre-trained byte-to-byte models. Transactions of the ACL, 10, 291–306. doi:10.1162/tacl_a_00461

External Links