Prerequisites: Going Infinite: The Dirichlet Process

The Stick-Breaking Construction

This is an early draft. Content may change as it gets reviewed.

The most visual way to understand the Dirichlet process. Stick-breaking (Sethuraman 1994) gives you a recipe for constructing a draw from the DP step by step.

The idea

Imagine you have a stick of length 1. This stick represents all the probability in the world — 100% — waiting to be divided up among words. You’re going to break this stick into pieces, one at a time, and hand each piece to a word.

Step by step

Break 1: Pick a random fraction $\beta_1$ of the stick (using a Beta distribution!). Break off that fraction. This piece has length $\pi_1 = \beta_1$. Hand it to a word drawn from the base distribution $H$. That word just got its probability.

Break 2: You now have a shorter stick — the leftover, of length $1 - \pi_1$. Pick another random fraction $\beta_2$. Break off that fraction of the remaining stick. This piece has length $\pi_2 = \beta_2 \times (1 - \pi_1)$. Hand it to another word from $H$.

Break 3: The remaining stick is shorter still. Break off another random fraction $\beta_3$ of what’s left. This piece has length $\pi_3 = \beta_3 \times (\text{what’s left after breaks 1 and 2})$. Hand it to another word.

And so on, forever. Each break takes a random fraction of whatever stick remains. The pieces get smaller on average (because the remaining stick keeps shrinking), but they never quite reach zero — there’s always a little sliver left.

The key detail: each random fraction comes from $\text{Beta}(1, \theta)$ — and there’s our concentration parameter! The Beta(1, $\theta$) distribution controls how greedy each break is:

$\theta$ small: Beta(1, 0.5) tends to produce large fractions (0.6, 0.8). The first few breaks eat most of the stick. A few words get big pieces; the rest get crumbs.
$\theta$ large: Beta(1, 50) tends to produce small fractions (0.01, 0.03). Each break nibbles off a tiny piece. The probability gets spread across many, many words.

Try it yourself

Try It: Stick-Breaking

θ: 5.0

The top bar is the stick. Coloured segments are pieces already broken off (each one = a word’s probability). Grey is the remaining stick.

Click “Break once” a few times and watch the pieces appear. Then try “Break 20×” to see the full picture. Reset and try different $\theta$ values: - $\theta = 1$: Three or four breaks and the stick is mostly gone. A very concentrated distribution. - $\theta = 5$: A nice spread — maybe 15–20 visible pieces. - $\theta = 50$: Even 20 breaks barely dent the stick. The pieces are tiny.

A concrete example

To make this tangible, here’s what a specific stick-breaking might look like with $\theta = 2$. Each $\beta$ is drawn from Beta(1, 2), which has a mean of 1/3:

Break	Fraction $\beta_k$	Piece size $\pi_k$	Stick remaining
1	0.41	41.0%	59.0%
2	0.28	16.5%	42.5%
3	0.35	14.9%	27.6%
4	0.52	14.4%	13.2%
5	0.19	2.5%	10.7%
6	0.63	6.7%	4.0%
…	…	…	…

Notice: each $\beta_k$ is a fraction of the remaining stick, not the original. So the third break takes 35% of a 42.5% stick, yielding a 14.9% piece. The pieces get smaller on average, but occasionally a large $\beta$ on a small remainder produces a surprisingly big piece (break 6: 63% of 10.7% = 6.7%).

Why the result is always discrete

No matter what the base distribution $H$ looks like — even if it’s a smooth, continuous distribution — the output of stick-breaking is always a collection of point masses: specific words with specific probabilities. This is a deep property of the DP, and it’s exactly what we want for language. Words are discrete things.

The GEM distribution

The sequence of weights $(\pi_1, \pi_2, \pi_3, \ldots)$ produced by stick-breaking has its own name: the GEM distribution (after Griffiths, Engen, and McCloskey):

$$(\pi_1, \pi_2, \pi_3, \ldots) \sim \text{GEM}(\theta)$$

You can think of the GEM as the “probability allocation” part of the DP, separate from the “which words get which weights” part (which comes from $H$).