Logistic regression is a discriminative linear classifier that directly models the posterior probability P(c | x) as a logistic (sigmoid) function of a linear combination of features. Unlike Naive Bayes, which models the generative process by which documents are produced, logistic regression learns the decision boundary directly by maximising the conditional log-likelihood of the training labels. For text classification, logistic regression operates on the same feature representations as other linear classifiers — bag-of-words, TF-IDF, or n-gram vectors — and consistently achieves strong performance, particularly when combined with appropriate regularisation.
Maximum Conditional Likelihood and Regularisation
P(y = 1 | x; w) = σ(w · x) = 1 / (1 + e^{-w · x})
Gradient: ∂L/∂w = ∑ᵢ (yᵢ − σ(w · xᵢ)) xᵢ − 2λw
The parameters of logistic regression are estimated by maximising the conditional log-likelihood of the training data, which is a convex function and therefore has a unique global maximum. Optimisation is typically performed using gradient-based methods such as L-BFGS or stochastic gradient descent. For text classification, where the feature space is high-dimensional and sparse, L2 regularisation prevents overfitting by penalising large weights, while L1 regularisation induces sparsity in the weight vector, effectively performing feature selection by driving irrelevant feature weights to zero.
Multiclass Extension and Comparison to Other Models
Logistic regression extends naturally to the multiclass setting through the softmax function, which generalises the sigmoid to k classes. The resulting model, known as multinomial logistic regression or maximum entropy classification, computes P(c | x) = exp(w_c · x) / ∑_{c'} exp(w_{c'} · x). Maximum entropy models have been widely used in NLP for tasks including part-of-speech tagging, named entity recognition, and text classification, valued for their ability to incorporate arbitrary overlapping features without the independence assumptions of generative models.
The maximum entropy (MaxEnt) framework, mathematically equivalent to multinomial logistic regression, was introduced to NLP by Berger, Della Pietra, and Della Pietra (1996). The MaxEnt principle states that among all distributions consistent with observed feature expectations, the one with maximum entropy makes the fewest additional assumptions. This principle provided a principled way to combine diverse, overlapping features in NLP models, overcoming the limitations of the independence assumptions in generative models such as HMMs and Naive Bayes.
Compared to SVMs, logistic regression provides calibrated probability estimates rather than just class decisions, making it preferable when confidence scores are needed — for instance, in applications where the cost of misclassification varies by class or where the classifier's output feeds into a larger probabilistic system. Compared to Naive Bayes, logistic regression is its discriminative counterpart: Ng and Jordan (2002) showed that logistic regression asymptotically achieves lower error than Naive Bayes but may require more training data to reach this advantage. In modern practice, logistic regression applied to contextual embeddings from pretrained transformers serves as the standard classification head in fine-tuned language models.