Feature engineering is the process of transforming raw text into a set of measurable properties — features — that capture the information relevant to a particular NLP task. Before the deep learning era, feature engineering was widely considered the most important step in building an NLP system, often contributing more to performance than the choice of learning algorithm. Even in the era of pretrained language models, understanding feature engineering remains essential: it provides insight into what linguistic properties matter for different tasks and remains necessary when labelled data is scarce or computational resources are limited.
Lexical, Syntactic, and Semantic Features
Syntactic: POS tags, dependency triples, constituency parse fragments
Semantic: word embeddings, WordNet synsets, sentiment lexicon scores
Domain: task-specific dictionaries, gazetteer membership, regex patterns
Lexical features are the most basic and include individual words (unigrams), word pairs (bigrams), and character-level n-grams. Character n-grams are robust to spelling variations and morphological inflections and have proven especially effective for language identification and authorship attribution. Syntactic features capture grammatical structure: part-of-speech tag sequences, dependency relation triples (e.g., nsubj(wrote, author)), and parse tree fragments. Semantic features encode meaning using resources such as WordNet or using distributional representations such as word embeddings. Domain-specific features might include entries from medical ontologies, legal terminology databases, or sentiment lexicons.
Feature Selection and Dimensionality Reduction
Text data typically yields extremely high-dimensional feature spaces — a vocabulary of 100,000 words produces 100,000-dimensional unigram vectors, and adding bigrams can increase dimensionality to millions. Feature selection methods reduce this dimensionality by retaining only the most informative features. Mutual information measures the statistical dependence between a feature and the class label: I(X; C) = ∑ P(x, c) log(P(x, c) / P(x)P(c)). Chi-squared tests evaluate whether the occurrence of a feature is independent of the class. Document frequency thresholding simply removes features that occur in too few or too many documents.
The labour-intensive nature of feature engineering was historically the primary bottleneck in NLP system development. Building a state-of-the-art system for a new task or domain required extensive experimentation with feature combinations, often consuming months of researcher effort. Deep learning models, particularly pretrained transformers, have largely automated this process by learning task-relevant features from data. However, Ribeiro et al. (2020) showed that even modern models can fail on systematic linguistic phenomena that hand-crafted features would capture, suggesting that human linguistic insight remains valuable.
Dimensionality reduction techniques such as Latent Semantic Analysis (LSA) project high-dimensional term-document matrices into lower-dimensional spaces that capture latent semantic structure. Principal Component Analysis (PCA) and its probabilistic variants find orthogonal directions of maximum variance. Non-negative matrix factorisation (NMF) decomposes the term-document matrix into factors with non-negative entries, producing interpretable topic-like dimensions. These methods can improve classification performance by smoothing noisy features and revealing latent structure, though they add computational cost and can obscure the interpretability of individual features.