Topic modeling is a family of unsupervised machine learning methods that automatically discover the abstract "topics" that pervade a collection of documents. The key insight is that documents typically address multiple themes simultaneously, and each theme is characterized by a distinctive vocabulary. A news article about the economy might combine a "finance" topic (with words like "market," "stocks," "growth") with a "politics" topic (with words like "government," "policy," "regulation"). Topic models learn these latent thematic structures from the observed word patterns, providing both a compact representation of document content and an interpretable summary of the themes present in a corpus.
Probabilistic Topic Models
θ_d ~ Dir(α) — draw topic proportions
For each word position i in d:
z_i ~ Mult(θ_d) — draw topic assignment
w_i ~ Mult(φ_{z_i}) — draw word from topic
P(w_i | d) = Σ_{k=1}^{K} φ_{k,w_i} · θ_{d,k}
φ_k ~ Dir(β) — topic-word distributions
The foundational probabilistic topic model is Latent Semantic Analysis (LSA), which applies singular value decomposition to the term-document matrix to discover latent dimensions. Probabilistic LSA (pLSA), introduced by Hofmann (1999), provided a generative probabilistic interpretation, modeling each document as a mixture of topics. Latent Dirichlet Allocation (LDA), proposed by Blei, Ng, and Jordan (2003), added Dirichlet priors over the document-topic and topic-word distributions, providing a fully generative Bayesian model that avoids overfitting and allows principled inference about new documents.
Inference Methods
Exact inference in topic models is intractable, requiring approximate methods. Variational inference approximates the posterior with a simpler distribution and optimizes the variational bound, as in the original LDA paper. Collapsed Gibbs sampling (Griffiths and Steyvers, 2004) iteratively resamples topic assignments for each word conditioned on all other assignments, providing a simple and effective MCMC approach. Online variational inference (Hoffman et al., 2010) processes documents in mini-batches, enabling topic modeling of massive corpora that cannot fit in memory. Stochastic variational inference further scales these methods to millions of documents.
Evaluating topic quality is a persistent challenge. Held-out perplexity — the standard measure for language models — correlates poorly with human judgments of topic interpretability (Chang et al., 2009). Topic coherence metrics, which measure the semantic relatedness of a topic's top words using co-occurrence statistics from an external corpus, better predict human ratings. The NPMI (Normalized Pointwise Mutual Information) coherence measure has become a standard automatic evaluation metric, though human evaluation of topic labels and intruder word detection remains the gold standard for assessing whether discovered topics are genuinely meaningful.
Extensions and Applications
The basic LDA framework has been extended in numerous directions. Dynamic topic models (Blei and Lafferty, 2006) capture how topics evolve over time. Correlated topic models relax the independence assumption between topics. Supervised topic models incorporate document labels to learn topics that are predictive of outcomes. Hierarchical topic models discover topic hierarchies at multiple levels of granularity. Author-topic models jointly model authorship and topic content. These extensions demonstrate the flexibility of the probabilistic topic modeling framework.
Topic models have been applied across a remarkable range of disciplines. In digital humanities, they reveal thematic trends in literary and historical corpora. In information retrieval, topic representations improve document ranking and recommendation. In computational social science, topic models analyze political discourse, media framing, and social media trends. In bioinformatics, analogous models discover patterns in genomic data where "documents" are genes and "words" are sequence motifs. Despite the rise of neural text representations, topic models remain valued for their interpretability, their ability to work with limited data, and their provision of explicit, human-readable topic descriptions.