Prerequisites: Going Infinite: The Dirichlet Process

Exchangeability and de Finetti’s Theorem

This is an early draft. Content may change as it gets reviewed.

This node explains why the Dirichlet process is the right tool for frequency analysis. It’s more philosophical than the other nodes — if you prefer the practical material, you can skip ahead to posterior updating without losing the thread.

A question about word order

Does the order of words matter?

For meaning: obviously yes. “The dog bit the man” and “the man bit the dog” are very different.

For frequency analysis: no. When we count how many times the appears, or compute the frequency spectrum, or draw a Zipf plot, we throw away all word order. The text “the dog bit the man” and any rearrangement of those same five words give identical frequency counts.

This property has a name in probability theory: exchangeability. A sequence of values is exchangeable if any rearrangement is equally probable. It’s a weaker property than independence — the values can be correlated (and in language, they obviously are: seeing the makes dog more likely) — but the correlations don’t depend on position.

Every frequency-based analysis implicitly assumes exchangeability. Zipf’s law, Heaps’ law, the frequency spectrum — they’re all functions of counts, blind to order.

De Finetti’s theorem: exchangeability implies a hidden distribution

In 1931, Bruno de Finetti proved a remarkable result:

If a sequence of observations is exchangeable, then there must be some underlying probability distribution $G$ such that, given $G$, the observations are independent draws from $G$.

What does this mean for language? If you treat word tokens as exchangeable (which you do, every time you count frequencies), then de Finetti says you are implicitly assuming:

There is some “true” distribution $G$ over words (with specific probabilities for the, of, dog, etc.)
Each word token is an independent draw from $G$
You have some prior belief about what $G$ looks like

De Finetti doesn’t tell you what your prior should be. He just says: if you’re doing frequency analysis at all, you have one. The question is whether to leave it implicit or make it explicit.

The Dirichlet process is a way to make it explicit. It says: “My prior over word distributions is a DP with concentration $\theta$ and base $H$.” This is more honest than pretending you have no prior — and it gives you all the computational machinery (conjugacy, stick-breaking, Pólya urn) as a bonus.

The chain of reasoning

Putting it together:

You count word frequencies (treating order as irrelevant) → you assume exchangeability
De Finetti: exchangeability implies a hidden distribution $G$ and a prior over it
The DP is the simplest prior that allows infinitely many possible words (Ferguson 1973)
The Pitman-Yor process is the simplest prior that also produces power laws

Each step is forced by the previous one. The entire Bayesian framework falls out of a single decision: “I’m going to count words.”