Limitations: Why We Need Pitman-Yor
Everything about the Dirichlet process works beautifully. It’s elegant, computationally simple, and theoretically grounded. So what’s missing?
The clue: the tail curves
If you generate text from a DP and plot the word frequencies on a log-log scale (a Zipf plot), the dots curve downward. On a true Zipf plot, they should form a straight line.
This means the DP generates too few rare words. The common words are about right, but the tail — the long list of words that appear only once or twice — drops off too fast.
Why: geometric decay
In the stick-breaking construction, each break takes a fraction $\beta_k$ of the remaining stick, where $\beta_k \sim \text{Beta}(1, \theta)$. The average fraction is always $1/(1 + \theta)$ — it doesn’t depend on which break you’re doing.
Since each break takes (on average) the same fraction of what’s left, the pieces shrink by a constant ratio at each step. This is geometric (or equivalently, exponential) decay:
$$E[\pi_k] \approx \frac{1}{\theta + 1} \cdot \left(\frac{\theta}{\theta + 1}\right)^{k-1}$$
Each piece is about $\theta / (\theta + 1)$ times the size of the previous one. On a log scale, the sizes decrease at a constant rate — a straight line on a “log-linear” plot.
But Zipf’s law says frequencies should decay as a power law — a straight line on a “log-log” plot. Power laws have much heavier tails: the rare words are rarer, but there are vastly more of them. An exponential tail can’t match the sheer abundance of hapax legomena (40–60% of all types) that we see in real text.
What would fix it?
The problem is that every stick break uses the same Beta(1, $\theta$) distribution — the “greediness” of each break is constant. What if later breaks were less greedy? What if the stick-breaking process slowed down, allowing more probability mass to survive into the tail?
The Pitman-Yor process makes exactly this change: instead of Beta(1, $\theta$), each break uses $\text{Beta}(1 - d, \;\theta + kd)$, where $d$ is a new parameter (the discount, between 0 and 1) and $k$ is the break number.
Two things change: 1. The first Beta parameter becomes $1 - d$ instead of 1, making each piece a bit larger 2. The second Beta parameter grows with $k$ ($\theta + kd$ instead of just $\theta$), making later breaks take smaller fractions
The result: the decay switches from geometric (exponential) to polynomial (power-law). The tail becomes heavy enough to produce Zipf’s law.
See the difference
This is a log-log plot of the expected piece sizes. The grey line is the DP — notice how it curves downward. The purple line is the Pitman-Yor process — it’s straight.
Things to try: - Change $\theta$ from 5 to 50: Both lines shift up (more types overall), but the DP still curves. The curvature is not a tuning problem. - Change $d$ from 0.1 to 0.9: The PY line’s slope changes, but it stays straight. The discount controls the Zipf exponent. - Set $d$ very small (like 0.1): The PY line is barely different from the DP — a small discount barely helps. - Set $d = 0.5$: Clean separation. The PY generates orders of magnitude more probability for rare words.
Summary
| Dirichlet Process | Pitman-Yor Process | |
|---|---|---|
| Stick-breaking | $\beta_k \sim \text{Beta}(1, \theta)$ | $\beta_k \sim \text{Beta}(1-d, \theta + kd)$ |
| Piece decay | Geometric (exponential) | Power law |
| Zipf plot shape | Curves downward | Straight line |
| Tail | Too light — not enough rare words | Heavy — matches real language |
| Hapax legomena | Under-predicted | Correctly predicted |
| Parameters | 1: $\theta$ | 2: $\theta$ and $d$ |
When $d = 0$, the Pitman-Yor process reduces to the DP exactly. Everything about the DP — stick-breaking, the Pólya urn, exchangeability, conjugacy, posterior updating — carries over unchanged. The discount is a single, clean addition.