Synapse

An interconnected graph of micro-tutorials

The Stick-Breaking Construction

This is an early draft. Content may change as it gets reviewed.

The most visual way to understand the Dirichlet process. Stick-breaking (Sethuraman 1994) gives you a recipe for constructing a draw from the DP step by step.

The idea

Imagine you have a stick of length 1. This stick represents all the probability in the world — 100% — waiting to be divided up among words. You’re going to break this stick into pieces, one at a time, and hand each piece to a word.

Step by step

Break 1: Pick a random fraction $\beta_1$ of the stick (using a Beta distribution!). Break off that fraction. This piece has length $\pi_1 = \beta_1$. Hand it to a word drawn from the base distribution $H$. That word just got its probability.

Break 2: You now have a shorter stick — the leftover, of length $1 - \pi_1$. Pick another random fraction $\beta_2$. Break off that fraction of the remaining stick. This piece has length $\pi_2 = \beta_2 \times (1 - \pi_1)$. Hand it to another word from $H$.

Break 3: The remaining stick is shorter still. Break off another random fraction $\beta_3$ of what’s left. This piece has length $\pi_3 = \beta_3 \times (\text{what’s left after breaks 1 and 2})$. Hand it to another word.

And so on, forever. Each break takes a random fraction of whatever stick remains. The pieces get smaller on average (because the remaining stick keeps shrinking), but they never quite reach zero — there’s always a little sliver left.

The key detail: each random fraction comes from $\text{Beta}(1, \theta)$ — and there’s our concentration parameter! The Beta(1, $\theta$) distribution controls how greedy each break is:

Try it yourself

Try It: Stick-Breaking
5.0

The top bar is the stick. Coloured segments are pieces already broken off (each one = a word’s probability). Grey is the remaining stick.

Click “Break once” a few times and watch the pieces appear. Then try “Break 20×” to see the full picture. Reset and try different $\theta$ values: - $\theta = 1$: Three or four breaks and the stick is mostly gone. A very concentrated distribution. - $\theta = 5$: A nice spread — maybe 15–20 visible pieces. - $\theta = 50$: Even 20 breaks barely dent the stick. The pieces are tiny.

A concrete example

To make this tangible, here’s what a specific stick-breaking might look like with $\theta = 2$. Each $\beta$ is drawn from Beta(1, 2), which has a mean of 1/3:

Break Fraction $\beta_k$ Piece size $\pi_k$ Stick remaining
1 0.41 41.0% 59.0%
2 0.28 16.5% 42.5%
3 0.35 14.9% 27.6%
4 0.52 14.4% 13.2%
5 0.19 2.5% 10.7%
6 0.63 6.7% 4.0%
… … … …

Notice: each $\beta_k$ is a fraction of the remaining stick, not the original. So the third break takes 35% of a 42.5% stick, yielding a 14.9% piece. The pieces get smaller on average, but occasionally a large $\beta$ on a small remainder produces a surprisingly big piece (break 6: 63% of 10.7% = 6.7%).

Why the result is always discrete

No matter what the base distribution $H$ looks like — even if it’s a smooth, continuous distribution — the output of stick-breaking is always a collection of point masses: specific words with specific probabilities. This is a deep property of the DP, and it’s exactly what we want for language. Words are discrete things.

The GEM distribution

The sequence of weights $(\pi_1, \pi_2, \pi_3, \ldots)$ produced by stick-breaking has its own name: the GEM distribution (after Griffiths, Engen, and McCloskey):

$$(\pi_1, \pi_2, \pi_3, \ldots) \sim \text{GEM}(\theta)$$

You can think of the GEM as the “probability allocation” part of the DP, separate from the “which words get which weights” part (which comes from $H$).