The Stick-Breaking Construction
The most visual way to understand the Dirichlet process. Stick-breaking (Sethuraman 1994) gives you a recipe for constructing a draw from the DP step by step.
The idea
Imagine you have a stick of length 1. This stick represents all the probability in the world — 100% — waiting to be divided up among words. You’re going to break this stick into pieces, one at a time, and hand each piece to a word.
Step by step
Break 1: Pick a random fraction $\beta_1$ of the stick (using a Beta distribution!). Break off that fraction. This piece has length $\pi_1 = \beta_1$. Hand it to a word drawn from the base distribution $H$. That word just got its probability.
Break 2: You now have a shorter stick — the leftover, of length $1 - \pi_1$. Pick another random fraction $\beta_2$. Break off that fraction of the remaining stick. This piece has length $\pi_2 = \beta_2 \times (1 - \pi_1)$. Hand it to another word from $H$.
Break 3: The remaining stick is shorter still. Break off another random fraction $\beta_3$ of what’s left. This piece has length $\pi_3 = \beta_3 \times (\text{what’s left after breaks 1 and 2})$. Hand it to another word.
And so on, forever. Each break takes a random fraction of whatever stick remains. The pieces get smaller on average (because the remaining stick keeps shrinking), but they never quite reach zero — there’s always a little sliver left.
The key detail: each random fraction comes from $\text{Beta}(1, \theta)$ — and there’s our concentration parameter! The Beta(1, $\theta$) distribution controls how greedy each break is:
- $\theta$ small: Beta(1, 0.5) tends to produce large fractions (0.6, 0.8). The first few breaks eat most of the stick. A few words get big pieces; the rest get crumbs.
- $\theta$ large: Beta(1, 50) tends to produce small fractions (0.01, 0.03). Each break nibbles off a tiny piece. The probability gets spread across many, many words.
Try it yourself
The top bar is the stick. Coloured segments are pieces already broken off (each one = a word’s probability). Grey is the remaining stick.
Click “Break once” a few times and watch the pieces appear. Then try “Break 20×” to see the full picture. Reset and try different $\theta$ values: - $\theta = 1$: Three or four breaks and the stick is mostly gone. A very concentrated distribution. - $\theta = 5$: A nice spread — maybe 15–20 visible pieces. - $\theta = 50$: Even 20 breaks barely dent the stick. The pieces are tiny.
A concrete example
To make this tangible, here’s what a specific stick-breaking might look like with $\theta = 2$. Each $\beta$ is drawn from Beta(1, 2), which has a mean of 1/3:
| Break | Fraction $\beta_k$ | Piece size $\pi_k$ | Stick remaining |
|---|---|---|---|
| 1 | 0.41 | 41.0% | 59.0% |
| 2 | 0.28 | 16.5% | 42.5% |
| 3 | 0.35 | 14.9% | 27.6% |
| 4 | 0.52 | 14.4% | 13.2% |
| 5 | 0.19 | 2.5% | 10.7% |
| 6 | 0.63 | 6.7% | 4.0% |
| … | … | … | … |
Notice: each $\beta_k$ is a fraction of the remaining stick, not the original. So the third break takes 35% of a 42.5% stick, yielding a 14.9% piece. The pieces get smaller on average, but occasionally a large $\beta$ on a small remainder produces a surprisingly big piece (break 6: 63% of 10.7% = 6.7%).
Why the result is always discrete
No matter what the base distribution $H$ looks like — even if it’s a smooth, continuous distribution — the output of stick-breaking is always a collection of point masses: specific words with specific probabilities. This is a deep property of the DP, and it’s exactly what we want for language. Words are discrete things.
The GEM distribution
The sequence of weights $(\pi_1, \pi_2, \pi_3, \ldots)$ produced by stick-breaking has its own name: the GEM distribution (after Griffiths, Engen, and McCloskey):
$$(\pi_1, \pi_2, \pi_3, \ldots) \sim \text{GEM}(\theta)$$
You can think of the GEM as the “probability allocation” part of the DP, separate from the “which words get which weights” part (which comes from $H$).