Computational Linguistics
About

Feature Engineering for Text

Feature engineering for text involves designing and selecting the input representations — from simple word counts and n-grams to syntactic patterns and domain-specific lexicons — that enable machine learning models to capture the linguistic properties most relevant to a given task.

x = [x_unigram; x_bigram; x_pos; x_lex; x_syntax]

Feature engineering is the process of transforming raw text into a set of measurable properties — features — that capture the information relevant to a particular NLP task. Before the deep learning era, feature engineering was widely considered the most important step in building an NLP system, often contributing more to performance than the choice of learning algorithm. Even in the era of pretrained language models, understanding feature engineering remains essential: it provides insight into what linguistic properties matter for different tasks and remains necessary when labelled data is scarce or computational resources are limited.

Lexical, Syntactic, and Semantic Features

Common Feature Types for Text Lexical: unigrams, bigrams, character n-grams, word shapes

Syntactic: POS tags, dependency triples, constituency parse fragments

Semantic: word embeddings, WordNet synsets, sentiment lexicon scores

Domain: task-specific dictionaries, gazetteer membership, regex patterns

Lexical features are the most basic and include individual words (unigrams), word pairs (bigrams), and character-level n-grams. Character n-grams are robust to spelling variations and morphological inflections and have proven especially effective for language identification and authorship attribution. Syntactic features capture grammatical structure: part-of-speech tag sequences, dependency relation triples (e.g., nsubj(wrote, author)), and parse tree fragments. Semantic features encode meaning using resources such as WordNet or using distributional representations such as word embeddings. Domain-specific features might include entries from medical ontologies, legal terminology databases, or sentiment lexicons.

Feature Selection and Dimensionality Reduction

Text data typically yields extremely high-dimensional feature spaces — a vocabulary of 100,000 words produces 100,000-dimensional unigram vectors, and adding bigrams can increase dimensionality to millions. Feature selection methods reduce this dimensionality by retaining only the most informative features. Mutual information measures the statistical dependence between a feature and the class label: I(X; C) = ∑ P(x, c) log(P(x, c) / P(x)P(c)). Chi-squared tests evaluate whether the occurrence of a feature is independent of the class. Document frequency thresholding simply removes features that occur in too few or too many documents.

The Feature Engineering Bottleneck

The labour-intensive nature of feature engineering was historically the primary bottleneck in NLP system development. Building a state-of-the-art system for a new task or domain required extensive experimentation with feature combinations, often consuming months of researcher effort. Deep learning models, particularly pretrained transformers, have largely automated this process by learning task-relevant features from data. However, Ribeiro et al. (2020) showed that even modern models can fail on systematic linguistic phenomena that hand-crafted features would capture, suggesting that human linguistic insight remains valuable.

Dimensionality reduction techniques such as Latent Semantic Analysis (LSA) project high-dimensional term-document matrices into lower-dimensional spaces that capture latent semantic structure. Principal Component Analysis (PCA) and its probabilistic variants find orthogonal directions of maximum variance. Non-negative matrix factorisation (NMF) decomposes the term-document matrix into factors with non-negative entries, producing interpretable topic-like dimensions. These methods can improve classification performance by smoothing noisy features and revealing latent structure, though they add computational cost and can obscure the interpretability of individual features.

Interactive Calculator

Enter labeled training examples (one per line, format label,text) followed by a blank line and a single test line to classify. The calculator trains a Naive Bayes classifier with Laplace smoothing and shows posterior probabilities for each class.

Click Calculate to see results, or Animate to watch the statistics update one record at a time.

Related Topics

References

  1. Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. Proceedings of ICML, 412–420.
  2. Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3, 1289–1305.
  3. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407.
  4. Ribeiro, M. T., Wu, T., Guestrin, C., & Singh, S. (2020). Beyond accuracy: Behavioral testing of NLP models with CheckList. Proceedings of ACL, 4902–4912.

External Links