Latent Dirichlet Allocation
Topic modelling is the problem. Latent Dirichlet Allocation (LDA) is the most influential solution. Introduced by Blei, Ng, and Jordan in 2003, it gives the topic modelling intuition a precise probabilistic form β and the Dirichlet distribution is at its heart.
The generative story
LDA describes how documents are (hypothetically) generated, then works backward from observed documents to infer the hidden structure. The story goes:
For each topic $k = 1, \ldots, K$: - Draw a distribution over words: $\phi_k \sim \text{Dir}(\beta)$
This gives each topic its vocabulary. The Dirichlet prior $\beta$ controls how concentrated or diffuse the word distributions are. Small $\beta$ means each topic focuses on a few words; large $\beta$ means topics are more uniform.
For each document $d$: - Draw a distribution over topics: $\theta_d \sim \text{Dir}(\alpha)$
This gives each document its topic mixture. The Dirichlet prior $\alpha$ controls how many topics each document uses. Small $\alpha$ means documents focus on one or two topics; large $\alpha$ means documents mix many topics.
For each word position $n$ in document $d$: 1. Choose a topic: $z_{dn} \sim \text{Categorical}(\theta_d)$ 2. Choose a word: $w_{dn} \sim \text{Categorical}(\phi_{z_{dn}})$
Thatβs the complete model. Every word in the corpus was generated by first picking a topic from the documentβs mixture, then picking a word from that topicβs vocabulary.
Why Dirichlet?
The Dirichlet distribution appears at two levels:
-
Document β topics ($\alpha$): Each documentβs topic proportions are drawn from $\text{Dir}(\alpha)$. This is the Dirichlet as a prior over probability vectors β exactly what it was designed for.
-
Topic β words ($\beta$): Each topicβs word distribution is drawn from $\text{Dir}(\beta)$. Same role, different level.
The Dirichlet is the natural choice because: - It produces probability vectors (non-negative, sum to 1) β exactly what we need - Its concentration parameter controls sparsity β and real topics and documents ARE sparse - Itβs conjugate to the categorical distribution β making inference tractable
Plate notation
LDA is often shown as a plate diagram β a compact visual notation for probabilistic models:
ββββ K βββββββββββββ
β Ξ² β [Ο_k] β
ββββββββββββββββββββ
β
ββββ D βββββββββββββββββββββ
β Ξ± β [ΞΈ_d] β
β β β
β ββββ N_d βββββββ β
β β [z_dn] β [w_dn]β β
β ββββββββββββββββ β
ββββββββββββββββββββββββββββ
The outer plate repeats over $K$ topics. The middle plate repeats over $D$ documents. The inner plate repeats over the $N_d$ words in each document. Shaded nodes ($w$) are observed; unshaded ($\theta$, $z$, $\phi$) are latent.
Inference: learning topics from data
You observe the words $w$. You want to infer the hidden topics $z$, the document-topic proportions $\theta$, and the topic-word distributions $\phi$. This is a posterior inference problem β and itβs intractable to solve exactly.
Two main approximation strategies:
Gibbs sampling: Iterate over every word in the corpus, sampling its topic assignment $z_{dn}$ conditioned on all other assignments. After many iterations, the samples converge to the posterior. Slow but simple and accurate.
Variational inference: Approximate the true posterior with a simpler distribution, then optimise the approximation. Faster than Gibbs, scales to larger corpora, but the approximation introduces bias.
Both approaches give you the same output: estimated topics (word distributions) and document-topic proportions.
How LDA relates to factor analysis
Both LDA and factor analysis discover latent structure from observed data, but they make different assumptions:
| Factor Analysis | LDA | |
|---|---|---|
| Data type | Continuous (feature counts/rates) | Discrete (word counts) |
| Latent variables | Continuous factors | Discrete topic assignments |
| Mixing | Linear combination | Categorical selection |
| Prior on mixture | Often none (frequentist) or normal (Bayesian) | Dirichlet |
| Sparsity | Achieved via rotation (Varimax) | Built into the Dirichlet prior |
| Output per document | Factor scores (continuous) | Topic proportions (simplex) |
Factor analysis says: βeach textβs features are a weighted sum of continuous dimensions.β LDA says: βeach word was generated by one specific topic, and the document is a mixture of topics.β Different generative assumptions, same underlying goal β finding latent structure.