Exchangeability and de Finetti’s Theorem
This node explains why the Dirichlet process is the right tool for frequency analysis. It’s more philosophical than the other nodes — if you prefer the practical material, you can skip ahead to posterior updating without losing the thread.
A question about word order
Does the order of words matter?
For meaning: obviously yes. “The dog bit the man” and “the man bit the dog” are very different.
For frequency analysis: no. When we count how many times the appears, or compute the frequency spectrum, or draw a Zipf plot, we throw away all word order. The text “the dog bit the man” and any rearrangement of those same five words give identical frequency counts.
This property has a name in probability theory: exchangeability. A sequence of values is exchangeable if any rearrangement is equally probable. It’s a weaker property than independence — the values can be correlated (and in language, they obviously are: seeing the makes dog more likely) — but the correlations don’t depend on position.
Every frequency-based analysis implicitly assumes exchangeability. Zipf’s law, Heaps’ law, the frequency spectrum — they’re all functions of counts, blind to order.
De Finetti’s theorem: exchangeability implies a hidden distribution
In 1931, Bruno de Finetti proved a remarkable result:
If a sequence of observations is exchangeable, then there must be some underlying probability distribution $G$ such that, given $G$, the observations are independent draws from $G$.
What does this mean for language? If you treat word tokens as exchangeable (which you do, every time you count frequencies), then de Finetti says you are implicitly assuming:
- There is some “true” distribution $G$ over words (with specific probabilities for the, of, dog, etc.)
- Each word token is an independent draw from $G$
- You have some prior belief about what $G$ looks like
De Finetti doesn’t tell you what your prior should be. He just says: if you’re doing frequency analysis at all, you have one. The question is whether to leave it implicit or make it explicit.
The Dirichlet process is a way to make it explicit. It says: “My prior over word distributions is a DP with concentration $\theta$ and base $H$.” This is more honest than pretending you have no prior — and it gives you all the computational machinery (conjugacy, stick-breaking, Pólya urn) as a bonus.
The chain of reasoning
Putting it together:
- You count word frequencies (treating order as irrelevant) → you assume exchangeability
- De Finetti: exchangeability implies a hidden distribution $G$ and a prior over it
- The DP is the simplest prior that allows infinitely many possible words (Ferguson 1973)
- The Pitman-Yor process is the simplest prior that also produces power laws
Each step is forced by the previous one. The entire Bayesian framework falls out of a single decision: “I’m going to count words.”