Prerequisites: The Beta Distribution

The Dirichlet Distribution

This is an early draft. Content may change as it gets reviewed.

A coin has two outcomes. But language has thousands of words. We need a way to express uncertainty about probability distributions with many outcomes, not just two.

A concrete example: three outcomes

Suppose you have a bag with red, blue, and green marbles. You don’t know the proportions. You want to express your uncertainty — not over a single number (like the coin), but over a triple of numbers: (probability of red, probability of blue, probability of green).

These three numbers must be non-negative and add up to 1. So if red has probability 0.5 and blue has probability 0.3, then green must have probability 0.2. The triple (0.5, 0.3, 0.2) is called a probability vector.

The Dirichlet distribution is a probability distribution over such probability vectors. Just as the Beta distribution gave us a curve over coin-bias values, the Dirichlet distribution gives us a “surface” over all possible probability vectors.

The parameters

The Dirichlet distribution for $K$ outcomes has $K$ parameters, written $(\alpha_1, \alpha_2, \ldots, \alpha_K)$. Each $\alpha_k$ is a positive number that plays the same role as $a$ and $b$ did for the Beta: it’s a pseudo-count for outcome $k$.

When all the $\alpha$ values are the same (the symmetric case), the single value controls how the samples look:

All $\alpha$ small (like 0.1): Samples are sparse — in each sample, one or two outcomes grab almost all the probability. The others are near zero.
All $\alpha$ equal to 1: Uniform — every probability vector is equally likely. No particular pattern preferred.
All $\alpha$ large (like 10): Samples cluster near the centre — near-equal probability for each outcome. You’re confident the distribution is roughly uniform.

See it on the simplex

For three outcomes, the set of all probability vectors forms a triangle. Each corner represents one outcome getting 100% of the probability. The centre is the uniform (1/3, 1/3, 1/3).

The interactive below draws 200 random samples from a symmetric Dirichlet distribution. Each dot is one sample — one possible “truth” about the marble proportions.

Try It: Dirichlet Samples on the Simplex

α (all equal): 1.00

Drag the slider slowly from left to right: - At $\alpha = 0.1$: All the dots are in the corners. Each “possible world” has one dominant colour. - At $\alpha = 1.0$: Dots scatter everywhere. Every distribution is equally plausible. - At $\alpha = 5.0$: Dots drift toward the centre. You believe the colours are roughly equal. - At $\alpha = 20$: A tight cluster at the centre. You’re very confident the distribution is uniform.

This is the same pattern as the Beta distribution — small parameters give extreme distributions, large parameters give uniform-ish ones — but now in multiple dimensions.

The connection to Beta

Here’s something satisfying: when there are only $K = 2$ outcomes, the Dirichlet distribution is the Beta distribution:

$$(\pi_1, \pi_2) \sim \text{Dir}(\alpha_1, \alpha_2) \iff \pi_1 \sim \text{Beta}(\alpha_1, \alpha_2)$$

(since $\pi_2 = 1 - \pi_1$, it’s fully determined by $\pi_1$). The Dirichlet is just the multi-dimensional version of Beta. Everything we learned about coins carries over.

Learning from data: just add counts

The Dirichlet has a wonderful property. Suppose your prior is $\text{Dir}(\alpha_1, \ldots, \alpha_K)$ and you observe some data: $c_1$ red marbles, $c_2$ blue, $c_3$ green. Your updated belief (the posterior) is:

$$\text{Dir}(\alpha_1 + c_1, \;\; \alpha_2 + c_2, \;\; \alpha_3 + c_3)$$

You just add the counts to the pseudo-counts. That’s it. No complicated formulas, no numerical methods. This property is called conjugacy — the prior and posterior have the same form — and it makes computation trivially easy.

The problem: you have to know $K$

The Dirichlet distribution works beautifully when you know how many outcomes there are. A coin: $K = 2$. A die: $K = 6$. Marbles in three colours: $K = 3$.

But for language, what is $K$? The number of possible words?

There’s no good answer. English dictionaries list hundreds of thousands of entries, but people coin new words all the time. Proper nouns, technical jargon, borrowings from other languages, onomatopoeia — the set of possible words is open-ended.

We need a version of the Dirichlet distribution where $K$ can be infinity. That’s the Dirichlet process.