The Naive Bayes classifier is a probabilistic generative model that applies Bayes' theorem to compute the posterior probability of each class given a document, making the simplifying assumption that features (typically words) are conditionally independent given the class label. Despite this assumption being clearly violated in natural language — words are highly correlated with one another — Naive Bayes classifiers perform remarkably well on text classification tasks. This robustness arises because the classifier only needs to rank classes correctly, not estimate calibrated probabilities, and the independence assumption often preserves the correct ranking even when it distorts the magnitudes.
Multinomial and Bernoulli Models
where tf(wᵢ, d) is the term frequency of word wᵢ in document d
MLE with Laplace smoothing:
P(wᵢ | c) = (count(wᵢ, c) + 1) / (∑_w count(w, c) + |V|)
Two main variants of Naive Bayes are used for text. The multinomial Naive Bayes model treats a document as a bag of words and models word frequencies, making it natural for longer documents where word repetition carries information. The multivariate Bernoulli model instead represents documents as binary vectors indicating word presence or absence, ignoring frequency. McCallum and Nigam (1998) showed that the multinomial model generally outperforms the Bernoulli model for text classification, particularly on longer documents and larger vocabularies, because it exploits the additional information in word counts.
Strengths and Limitations
Naive Bayes has several practical advantages that explain its enduring popularity. Training requires only a single pass through the data to compute word counts per class, making it extremely fast — linear in the number of training documents and vocabulary size. The model is highly interpretable: the most discriminative features for each class can be identified by examining the likelihood ratios P(w | c₁) / P(w | c₂). It also performs well with small training sets, since the independence assumption acts as a strong regulariser that prevents overfitting.
Naive Bayes became widely known through its application to email spam filtering. Sahami et al. (1998) demonstrated that a simple Naive Bayes classifier trained on word features could effectively distinguish spam from legitimate email. Paul Graham's 2002 essay "A Plan for Spam" popularised the approach and led to its adoption in major email clients. The success of Naive Bayes in spam filtering illustrated how a theoretically naive model could solve a practical problem of enormous scale.
The primary limitation of Naive Bayes is the independence assumption itself. Correlated features — such as the bigram "New York" — are treated as independent evidence, which can lead to overconfident predictions. Complement Naive Bayes (Rennie et al., 2003) addresses some of these issues by estimating parameters using data from all classes except the target class, yielding improved performance on imbalanced datasets. Nevertheless, for many text classification tasks, discriminative models such as logistic regression and SVMs achieve higher accuracy by modelling feature interactions implicitly.