Computational Linguistics
About

Hidden Markov Models for POS

Hidden Markov Models for POS tagging treat the tag sequence as a hidden Markov chain generating observed words, using the Viterbi algorithm to find the most probable tag sequence.

t* = argmax_{t_1..t_n} ∏_i P(w_i|t_i) · P(t_i|t_{i-1})

The Hidden Markov Model (HMM) was one of the first successful statistical approaches to POS tagging. In this framework, the tag sequence is modeled as a first-order Markov chain (each tag depends only on the previous tag), and each tag generates an observed word according to an emission distribution. The tagging problem becomes finding the most probable hidden state sequence (tag sequence) given the observed sequence (word sequence), which is efficiently solved by the Viterbi algorithm.

HMM Formulation

HMM for POS Tagging t* = argmaxt1..tn P(t1..tn | w1..wn)

≈ argmaxt1..tni=1n P(wi | ti) · P(ti | ti−1)

Transition probabilities: P(ti | ti−1) = count(ti−1, ti) / count(ti−1)
Emission probabilities: P(wi | ti) = count(ti, wi) / count(ti)

The model parameters are the transition probabilities (probability of each tag given the previous tag) and emission probabilities (probability of each word given its tag). These are estimated from a tagged corpus using maximum likelihood estimation (simple counting and normalization). The Baum-Welch algorithm (a special case of EM) can estimate parameters from untagged text, though supervised estimation from tagged corpora produces much better results.

Viterbi Decoding

The Viterbi algorithm finds the most probable tag sequence in O(nT²) time, where n is the sentence length and T is the number of tags. It uses dynamic programming, maintaining for each position and tag the probability of the best tag sequence ending with that tag. Back-pointers allow the optimal sequence to be recovered. The forward algorithm computes the total probability of the observed sequence (summing over all possible tag sequences), which is useful for training and evaluation.

Smoothing for HMM Taggers
Unseen word-tag pairs are a major challenge for HMM taggers. Smoothing techniques include add-one (Laplace) smoothing, Good-Turing smoothing, and interpolation with suffix-based emission models. For unknown words, emission probabilities are often estimated from word morphology: words ending in "-ed" are likely VBD or VBN, words ending in "-ly" are likely RB.

Limitations and Legacy

HMM taggers achieve approximately 95-96% accuracy on the Penn Treebank, which was state of the art in the early 1990s but has since been surpassed by discriminative models (MEMMs, CRFs) and neural taggers. The main limitations of HMMs are the strong independence assumptions (each word depends only on its tag, not on neighboring words or tags) and the generative nature of the model, which prevents the use of arbitrary overlapping features. Despite these limitations, HMM taggers remain an important pedagogical tool and a building block for understanding more advanced sequence models.

Related Topics

References

  1. Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286. https://doi.org/10.1109/5.18626
  2. Brants, T. (2000). TnT — a statistical part-of-speech tagger. Proceedings of the 6th Conference on Applied Natural Language Processing, 224–231. https://doi.org/10.3115/974147.974178
  3. Church, K. W. (1988). A stochastic parts program and noun phrase parser for unrestricted text. Proceedings of the 2nd Conference on Applied Natural Language Processing, 136–143. https://doi.org/10.3115/974235.974260
  4. Kupiec, J. (1992). Robust part-of-speech tagging using a hidden Markov model. Computer Speech & Language, 6(3), 225–242. https://doi.org/10.1016/0885-2308(92)90019-Z

External Links