Corpus Linguistics

Corpus linguistics studies language through the analysis of corpora — large, systematically collected bodies of naturally occurring text or speech. Rather than relying on introspective judgments about what sentences are grammatical or what words mean, corpus linguistics grounds linguistic analysis in empirical evidence drawn from authentic language use. This empirical orientation has made corpus linguistics foundational to computational linguistics, providing both the data on which NLP systems are trained and the methodological framework for evaluating linguistic hypotheses against usage patterns at scale.

Corpus Design and Compilation

Core Corpus Statistics Token frequency: f(w) = count of w in corpus
Type frequency: |V| = number of distinct words
Normalized frequency: f_norm(w) = f(w) / N × 10⁶ (per million)

Collocational strength (MI):
MI(w₁,w₂) = log₂[P(w₁,w₂) / (P(w₁) · P(w₂))]

Log-likelihood ratio: G² = 2 Σ O · ln(O/E)

A well-designed corpus is not simply a collection of text but a carefully curated sample designed to represent a particular language variety, register, or domain. Key design decisions include the target population of texts, the sampling strategy, the balance across genres and time periods, and the total size. The Brown Corpus (1961) pioneered balanced corpus design with one million words of American English across 15 genres. The British National Corpus (BNC) scaled to 100 million words. Modern web-crawled corpora like Common Crawl contain trillions of words but sacrifice principled sampling for scale. The tension between representativeness and size remains a fundamental issue in corpus design.

Annotation and Markup

Raw text corpora gain analytical power through annotation — the addition of linguistic information such as part-of-speech tags, syntactic parses, named entity labels, semantic roles, and discourse structure. The Penn Treebank (Marcus et al., 1993) established standards for syntactic annotation that influenced corpus linguistics for decades. Annotation schemes must balance theoretical commitments, practical reliability (inter-annotator agreement), and computational utility. The development of annotation guidelines, the training of annotators, and the measurement of annotation quality are core methodological concerns in corpus linguistics that directly impact the quality of NLP systems trained on annotated data.

Concordancing and KWIC

The concordance — a listing of all occurrences of a word or phrase in its immediate context — is the fundamental tool of corpus linguistics. Key Word in Context (KWIC) displays align instances of a search term with surrounding text, enabling analysts to identify patterns in usage, collocational preferences, and semantic prosody. Modern corpus query tools like CQPweb, Sketch Engine, and AntConc support complex pattern searches using regular expressions and structural queries, enabling linguists to test hypotheses about collocational patterns, grammatical constructions, and register variation across millions of examples.

Corpus Methods in Computational Linguistics

Corpus linguistics provides the empirical backbone of modern NLP. Language models are trained on corpora; parsers are evaluated against treebanks; word embeddings are learned from distributional patterns in text. The corpus-driven approach — letting patterns emerge from data rather than imposing pre-existing categories — aligns naturally with machine learning methodology. Keyword analysis identifies statistically significant vocabulary in a corpus compared to a reference corpus, revealing the distinctive features of a text type. Collocation analysis, using measures like mutual information and log-likelihood, discovers significant word combinations that are essential for lexicography, language teaching, and compositional semantics research.

Contemporary corpus linguistics faces new challenges and opportunities. Ethical concerns about privacy, consent, and representation in web-scraped corpora have prompted the development of more carefully curated alternatives. Multimodal corpora that combine text with audio, video, and gesture data enable the study of language in its full communicative context. Learner corpora — collections of non-native speaker text — support computational approaches to language education. The relationship between corpus size and model quality has become a central question in the era of large language models, with scaling laws suggesting that even current massive corpora may be insufficient for training models that match human linguistic competence.

Interactive Calculator

Enter multiple documents, one per line. The calculator tokenizes the text, removes stop words, and performs a simplified topic analysis using TF-IDF-based word clustering to identify pseudo-topics and document-topic assignments.

Documents (one per line)

Click Calculate to see results, or Animate to watch the statistics update one record at a time.

Corpus Design and Compilation

Annotation and Markup

Corpus Methods in Computational Linguistics

Interactive Calculator

References

External Links

Corpus Design and Compilation

Annotation and Markup

Corpus Methods in Computational Linguistics

Interactive Calculator

Related Topics

References

External Links