Computational Linguistics
About

Document Representation

Document representation transforms unstructured text into mathematical objects — vectors, matrices, or embeddings — that machine learning algorithms can process, with the choice of representation profoundly affecting the performance of all downstream text analysis tasks.

tf-idf(t, d) = tf(t, d) × log(N / df(t))

Document representation is the process of converting text into a structured mathematical form suitable for computation. Every text analysis algorithm operates not on raw text but on some representation of it, making the choice of representation one of the most consequential decisions in any NLP pipeline. The history of document representation traces a trajectory from sparse, discrete representations based on word counts to dense, continuous representations learned by neural networks, each successive paradigm capturing increasingly rich linguistic information.

Bag-of-Words and TF-IDF

TF-IDF Weighting tf-idf(t, d) = tf(t, d) × idf(t)

tf(t, d) = count of term t in document d
idf(t) = log(N / df(t))

where N is the total number of documents and df(t) is the number of documents containing term t

The bag-of-words (BoW) model represents a document as a vector of word counts, discarding word order entirely. Despite this drastic simplification, BoW representations are effective for many classification and retrieval tasks because the distribution of words in a document carries substantial information about its topic. TF-IDF weighting refines BoW by upweighting terms that are frequent in a document but rare across the corpus, capturing term specificity. Karen Sparck Jones introduced IDF in 1972, and TF-IDF remains one of the most widely used weighting schemes in information retrieval and text classification.

Distributed and Contextual Representations

Distributed representations address the fundamental limitations of sparse BoW vectors: their inability to capture semantic similarity (the vectors for "car" and "automobile" are orthogonal) and their high dimensionality (equal to the vocabulary size). Word2Vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) learn dense vector representations where semantically similar words occupy nearby points in a continuous vector space. Document representations can be constructed by averaging word embeddings, using Doc2Vec, or applying more sophisticated composition functions.

The Distributional Hypothesis

All distributed representations rest on the distributional hypothesis articulated by Zellig Harris (1954) and popularised by J. R. Firth's dictum that "you shall know a word by the company it keeps." Words that appear in similar contexts tend to have similar meanings, and this statistical regularity provides the signal that embedding algorithms exploit. The success of word embeddings validated decades of theoretical work in distributional semantics and established vector space models as the dominant paradigm in computational semantics.

Contextual embeddings from pretrained transformer models such as BERT, ELMo, and GPT represent the current frontier of document representation. Unlike static embeddings, which assign a single vector to each word type, contextual embeddings produce different representations for each word token depending on its surrounding context, naturally handling polysemy and context-dependent meaning. A document can be represented by the special [CLS] token embedding in BERT, by averaging token embeddings, or by pooling strategies that capture different aspects of the document's content. These representations have set new state-of-the-art results across virtually all text analysis benchmarks.

Interactive Calculator

Enter multiple documents, one per line. The calculator computes TF-IDF vectors for each document and pairwise cosine similarity between all document pairs.

Click Calculate to see results, or Animate to watch the statistics update one record at a time.

Related Topics

References

  1. Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613–620. doi:10.1145/361219.361220
  2. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems, 26, 3111–3119.
  3. Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global vectors for word representation. Proceedings of EMNLP, 1532–1543. doi:10.3115/v1/D14-1162
  4. Sparck Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1), 11–21.

External Links