Going Infinite: The Dirichlet Process
Climbing the ladder
Let’s pause and see how far we’ve come. We’ve built a ladder of increasingly powerful distributions:
Step 1 — Beta: Uncertainty about a single probability (one coin). Draw from Beta and you get a number like 0.7.
Step 2 — Dirichlet: Uncertainty about a vector of probabilities (a die with $K$ faces). Draw from Dirichlet and you get a list like (0.4, 0.35, 0.25) that sums to 1.
Step 3 — Dirichlet process: Uncertainty about an entire probability distribution over infinitely many possible outcomes. Draw from a DP and you get a complete assignment of probabilities to every possible word — infinitely many of them — summing to 1.
Each step up the ladder handles more outcomes. The DP is the limit: it handles infinitely many.
What “distribution over distributions” means
The phrase sounds paradoxical, but it’s actually just Step 3 of the ladder above. A DP is a random machine that, each time you crank it, spits out a different probability distribution over words. One crank might give a distribution where the has probability 0.07 and serendipity has probability 0.000001. Another crank might give different numbers. The DP tells you which distributions are more or less likely to come out.
The two knobs
The DP was introduced by Thomas Ferguson in 1973. It has two parameters — think of them as two knobs on the machine:
Knob 1: $\theta$ (concentration) — This controls how spread out the distribution tends to be.
- Turn $\theta$ down (small, like 1): The DP tends to produce distributions where a few words hog most of the probability. Very concentrated. Like a language where the, of, and and account for most of the text.
- Turn $\theta$ up (large, like 100): The DP tends to produce distributions where probability is spread across many words. Very diffuse. Like a technical glossary where no single term dominates.
This is the same role that $\alpha$ played in the Dirichlet: small = sparse/concentrated, large = uniform/diffuse.
Knob 2: $H$ (base distribution) — This is a “template” distribution that tells the DP which words are more likely a priori.
If $H$ gives high probability to common English words, then draws from the DP will tend to give high probability to those words too. If $H$ is uniform (all words equally likely), then the DP has no preference.
In the finite Dirichlet, the relative sizes of the $\alpha_k$ played this role. If $\alpha_1$ was much bigger than $\alpha_2$, outcome 1 tended to get more probability. The base distribution $H$ is the infinite-dimensional version of this.
The formal definition (optional)
You can skip this box without losing the thread. It’s here for completeness.
A random probability measure $G$ is drawn from $\text{DP}(\theta, H)$ if, for any way of dividing the word space into $K$ groups $(A_1, \ldots, A_K)$:
$$(G(A_1), \ldots, G(A_K)) \sim \text{Dir}(\theta H(A_1), \ldots, \theta H(A_K))$$
In words: no matter how you group the words, the probabilities assigned to the groups follow a finite Dirichlet distribution. The Dirichlet process is defined by saying that all of its finite slices are Dirichlet-distributed — it’s Dirichlets all the way down.
This definition is elegant but abstract. The next node gives a concrete, step-by-step recipe for actually constructing a draw from the DP.