Prerequisites: Topic Modelling The Dirichlet Distribution

Latent Dirichlet Allocation

This is an early draft. Content may change as it gets reviewed.

Topic modelling is the problem. Latent Dirichlet Allocation (LDA) is the most influential solution. Introduced by Blei, Ng, and Jordan in 2003, it gives the topic modelling intuition a precise probabilistic form — and the Dirichlet distribution is at its heart.

The generative story

LDA describes how documents are (hypothetically) generated, then works backward from observed documents to infer the hidden structure. The story goes:

For each topic $k = 1, \ldots, K$: - Draw a distribution over words: $\phi_k \sim \text{Dir}(\beta)$

This gives each topic its vocabulary. The Dirichlet prior $\beta$ controls how concentrated or diffuse the word distributions are. Small $\beta$ means each topic focuses on a few words; large $\beta$ means topics are more uniform.

For each document $d$: - Draw a distribution over topics: $\theta_d \sim \text{Dir}(\alpha)$

This gives each document its topic mixture. The Dirichlet prior $\alpha$ controls how many topics each document uses. Small $\alpha$ means documents focus on one or two topics; large $\alpha$ means documents mix many topics.

For each word position $n$ in document $d$: 1. Choose a topic: $z_{dn} \sim \text{Categorical}(\theta_d)$ 2. Choose a word: $w_{dn} \sim \text{Categorical}(\phi_{z_{dn}})$

That’s the complete model. Every word in the corpus was generated by first picking a topic from the document’s mixture, then picking a word from that topic’s vocabulary.

Why Dirichlet?

The Dirichlet distribution appears at two levels:

Document → topics ($\alpha$): Each document’s topic proportions are drawn from $\text{Dir}(\alpha)$. This is the Dirichlet as a prior over probability vectors — exactly what it was designed for.
Topic → words ($\beta$): Each topic’s word distribution is drawn from $\text{Dir}(\beta)$. Same role, different level.

The Dirichlet is the natural choice because: - It produces probability vectors (non-negative, sum to 1) — exactly what we need - Its concentration parameter controls sparsity — and real topics and documents ARE sparse - It’s conjugate to the categorical distribution — making inference tractable

Plate notation

LDA is often shown as a plate diagram — a compact visual notation for probabilistic models:

┌─── K ────────────┐
│  β → [φ_k]       │
└──────────────────┘
          ↓
┌─── D ────────────────────┐
│  α → [θ_d]               │
│           ↓               │
│    ┌─── N_d ──────┐      │
│    │  [z_dn] → [w_dn]│   │
│    └──────────────┘      │
└──────────────────────────┘

The outer plate repeats over $K$ topics. The middle plate repeats over $D$ documents. The inner plate repeats over the $N_d$ words in each document. Shaded nodes ($w$) are observed; unshaded ($\theta$, $z$, $\phi$) are latent.

Inference: learning topics from data

You observe the words $w$. You want to infer the hidden topics $z$, the document-topic proportions $\theta$, and the topic-word distributions $\phi$. This is a posterior inference problem — and it’s intractable to solve exactly.

Two main approximation strategies:

Gibbs sampling: Iterate over every word in the corpus, sampling its topic assignment $z_{dn}$ conditioned on all other assignments. After many iterations, the samples converge to the posterior. Slow but simple and accurate.

Variational inference: Approximate the true posterior with a simpler distribution, then optimise the approximation. Faster than Gibbs, scales to larger corpora, but the approximation introduces bias.

Both approaches give you the same output: estimated topics (word distributions) and document-topic proportions.

How LDA relates to factor analysis

Both LDA and factor analysis discover latent structure from observed data, but they make different assumptions:

	Factor Analysis	LDA
Data type	Continuous (feature counts/rates)	Discrete (word counts)
Latent variables	Continuous factors	Discrete topic assignments
Mixing	Linear combination	Categorical selection
Prior on mixture	Often none (frequentist) or normal (Bayesian)	Dirichlet
Sparsity	Achieved via rotation (Varimax)	Built into the Dirichlet prior
Output per document	Factor scores (continuous)	Topic proportions (simplex)

Factor analysis says: “each text’s features are a weighted sum of continuous dimensions.” LDA says: “each word was generated by one specific topic, and the document is a mixture of topics.” Different generative assumptions, same underlying goal — finding latent structure.

**Discovering registers.** Run LDA on a mixed corpus with $K = 10$ topics. Topics emerge that correspond to discourse types: an academic topic (*analysis*, *significant*, *methodology*), a narrative topic (*said*, *walked*, *door*, *eyes*), a legal topic (*court*, *defendant*, *pursuant*). Each document’s topic proportions give a soft multi-register classification — useful for texts that don’t fit neatly into one register.

**Mining electronic health records.** Run LDA on 100,000 clinical notes. Topics emerge as clinical phenotypes: cardiovascular (*chest pain*, *troponin*, *catheterisation*), diabetes management (*A1C*, *insulin*, *glucose*), post-surgical (*wound*, *drain*, *ambulation*). A patient’s topic proportions become a compact clinical summary — useful for cohort discovery and risk stratification.

**Community ecology.** Treat sites as documents, species as words, abundance as word count. LDA discovers latent “community types” that are probabilistic — each site is a mixture. This handles ecotones (transition zones) naturally, unlike hard clustering methods. The Dirichlet prior with small $\alpha$ encourages each site to be dominated by one or two community types, matching ecological reality.

**Galaxy spectral typing.** Treat galaxy spectra as documents, with discretised spectral features as words. LDA discovers latent “spectral components” — old stellar populations, star-forming regions, AGN emission. Each galaxy is a mixture, naturally handling composite systems. This approach (spectral LDA) has been applied to SDSS data to discover galaxy types without hand-crafted templates.

**Automatic genre/style analysis.** Represent songs as bags of audio features (chroma, MFCCs, rhythmic patterns). LDA discovers latent “style components” — a blues-shuffle topic, a four-on-the-floor dance topic, a fingerpicking acoustic topic. Each song’s topic proportions give a nuanced style profile that handles genre blending (jazz-funk, folk-rock) more naturally than single-label classification.