Support vector machines (SVMs) are discriminative classifiers that find the hyperplane in feature space that maximises the margin — the distance between the decision boundary and the nearest training examples from each class. For text classification, SVMs operate in the high-dimensional space defined by the vocabulary, where each dimension corresponds to a word or n-gram feature. Joachims (1998) demonstrated that SVMs are particularly well-suited to text classification because text data exhibits properties that align with SVM strengths: high dimensionality, sparse feature vectors, and approximate linear separability.
The Maximum Margin Principle
subject to: yᵢ(w · xᵢ + b) ≥ 1 − ξᵢ, ξᵢ ≥ 0
where C is the regularisation parameter and ξᵢ are slack variables
The maximum margin principle provides a geometric intuition for why SVMs generalise well. Among all hyperplanes that correctly separate the training data, the one with the largest margin is least sensitive to perturbations of individual data points. Statistical learning theory formalises this intuition: the generalisation error of a linear classifier is bounded by a function of the margin and the radius of the data, independent of the dimensionality. This property makes SVMs robust in high-dimensional spaces where the curse of dimensionality might otherwise cause overfitting. The soft-margin formulation introduces slack variables ξᵢ to allow misclassifications, with the regularisation parameter C controlling the trade-off between margin width and training error.
Kernel Methods and Text Kernels
While linear SVMs are often sufficient for text classification, kernel methods extend SVMs to capture nonlinear decision boundaries by implicitly mapping data into higher-dimensional feature spaces. The kernel trick computes inner products in the transformed space without explicitly computing the transformation, making nonlinear classification computationally tractable. For text, specialised kernels such as the string kernel (Lodhi et al., 2002) compute similarity based on shared subsequences of characters, capturing morphological and sub-word patterns that bag-of-words features miss.
Thorsten Joachims's seminal 1998 paper demonstrated that linear SVMs trained on TF-IDF features achieved the best performance on the Reuters text classification benchmark, outperforming Naive Bayes, k-nearest neighbours, and decision trees. The paper also showed that SVMs handle the high dimensionality of text naturally, since the margin-based objective is controlled by the norm of the weight vector rather than the number of features, making SVMs immune to the curse of dimensionality in practice.
SVMs dominated text classification benchmarks from the late 1990s through the early 2010s. Efficient implementations such as SVMlight and LIBSVM made training practical even on large corpora. The linear SVM remains a strong baseline for text classification, often competitive with neural approaches when labelled data is limited. However, the rise of pretrained language models has shifted the state of the art, as contextual embeddings capture semantic information that no fixed feature representation can match.