Computational Linguistics
About

Text Classification

Text classification assigns predefined categorical labels to documents or passages, serving as one of the most fundamental tasks in natural language processing with applications spanning spam filtering, topic categorization, and language identification.

ŷ = argmax_{c ∈ C} P(c | d)

Text classification is the task of assigning one or more labels from a predefined set C to a document d based on its content. Formally, a classifier is a function f: D → C that maps documents to categories, where the mapping is learned from a training set of labelled examples. The task encompasses binary classification (e.g., spam versus not-spam), multiclass classification (e.g., topic categorization into one of k mutually exclusive categories), and multilabel classification (e.g., tagging a news article with multiple relevant topics). Text classification is among the oldest and most practically important problems in NLP, with roots in library science and information retrieval.

Probabilistic and Discriminative Approaches

Bayes Decision Rule for Classification ŷ = argmax_{c ∈ C} P(c | d) = argmax_{c ∈ C} P(d | c) P(c)

Generative: model P(d | c) directly (e.g., Naive Bayes)
Discriminative: model P(c | d) directly (e.g., Logistic Regression)

Text classifiers fall into two broad paradigms. Generative classifiers such as Naive Bayes model the joint distribution P(d, c) = P(d | c) P(c) and apply Bayes' rule to compute the posterior P(c | d). Discriminative classifiers such as logistic regression and support vector machines model the decision boundary directly, learning P(c | d) or a discriminant function without modeling how documents are generated. Empirically, discriminative models tend to achieve higher accuracy when training data is plentiful, while generative models can be more effective with limited labelled data due to their stronger inductive bias.

Feature Representation and Deep Learning

The performance of any text classifier depends critically on how documents are represented. Traditional approaches use bag-of-words vectors, where each document is a vector of word counts or TF-IDF weights. More sophisticated representations include n-gram features, which capture local word order, and latent semantic features derived from dimensionality reduction techniques such as LSA or LDA. The advent of deep learning brought distributed representations: convolutional neural networks (CNNs) for capturing local patterns, recurrent neural networks (RNNs) for sequential modelling, and most recently pretrained transformer models such as BERT, which achieve state-of-the-art performance by fine-tuning on downstream classification tasks.

The Reuters Benchmark

The Reuters-21578 corpus, a collection of newswire articles manually categorised into 90 topic codes, served as the standard benchmark for text classification research throughout the 1990s and 2000s. The dataset's skewed class distribution and multilabel structure made it a realistic testbed for classification algorithms. More recent benchmarks such as the AG News corpus and the DBpedia ontology classification dataset have extended evaluation to larger scales, but Reuters established the experimental methodology still used in the field.

Modern text classification systems typically leverage transfer learning, where a large language model pretrained on vast unlabelled corpora is fine-tuned on a smaller task-specific labelled dataset. This paradigm, exemplified by BERT and its successors, has dramatically reduced the amount of labelled data needed for high-quality classification and has made it feasible to build accurate classifiers for specialised domains such as biomedical text and legal documents with only hundreds of training examples.

Interactive Calculator

Enter labeled training examples (one per line, format label,text) followed by a blank line and a single test line to classify. The calculator trains a Naive Bayes classifier with Laplace smoothing and shows posterior probabilities for each class.

Click Calculate to see results, or Animate to watch the statistics update one record at a time.

Related Topics

References

  1. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47. doi:10.1145/505282.505283
  2. Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. Proceedings of ECML, 137–142. doi:10.1007/BFb0026683
  3. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of NAACL-HLT, 4171–4186.
  4. Zhang, X., Zhao, J., & LeCun, Y. (2015). Character-level convolutional networks for text classification. Advances in Neural Information Processing Systems, 28, 649–657.

External Links