Prerequisites: The Stick-Breaking Construction

Limitations: Why We Need Pitman-Yor

This is an early draft. Content may change as it gets reviewed.

Everything about the Dirichlet process works beautifully. It’s elegant, computationally simple, and theoretically grounded. So what’s missing?

The clue: the tail curves

If you generate text from a DP and plot the word frequencies on a log-log scale (a Zipf plot), the dots curve downward. On a true Zipf plot, they should form a straight line.

This means the DP generates too few rare words. The common words are about right, but the tail — the long list of words that appear only once or twice — drops off too fast.

Why: geometric decay

In the stick-breaking construction, each break takes a fraction $\beta_k$ of the remaining stick, where $\beta_k \sim \text{Beta}(1, \theta)$. The average fraction is always $1/(1 + \theta)$ — it doesn’t depend on which break you’re doing.

Since each break takes (on average) the same fraction of what’s left, the pieces shrink by a constant ratio at each step. This is geometric (or equivalently, exponential) decay:

$$E[\pi_k] \approx \frac{1}{\theta + 1} \cdot \left(\frac{\theta}{\theta + 1}\right)^{k-1}$$

Each piece is about $\theta / (\theta + 1)$ times the size of the previous one. On a log scale, the sizes decrease at a constant rate — a straight line on a “log-linear” plot.

But Zipf’s law says frequencies should decay as a power law — a straight line on a “log-log” plot. Power laws have much heavier tails: the rare words are rarer, but there are vastly more of them. An exponential tail can’t match the sheer abundance of hapax legomena (40–60% of all types) that we see in real text.

What would fix it?

The problem is that every stick break uses the same Beta(1, $\theta$) distribution — the “greediness” of each break is constant. What if later breaks were less greedy? What if the stick-breaking process slowed down, allowing more probability mass to survive into the tail?

The Pitman-Yor process makes exactly this change: instead of Beta(1, $\theta$), each break uses $\text{Beta}(1 - d, \;\theta + kd)$, where $d$ is a new parameter (the discount, between 0 and 1) and $k$ is the break number.

Two things change: 1. The first Beta parameter becomes $1 - d$ instead of 1, making each piece a bit larger 2. The second Beta parameter grows with $k$ ($\theta + kd$ instead of just $\theta$), making later breaks take smaller fractions

The result: the decay switches from geometric (exponential) to polynomial (power-law). The tail becomes heavy enough to produce Zipf’s law.

See the difference

Decay Comparison: DP vs Pitman-Yor

θ: 10 d (PY discount): 0.50

This is a log-log plot of the expected piece sizes. The grey line is the DP — notice how it curves downward. The purple line is the Pitman-Yor process — it’s straight.

Things to try: - Change $\theta$ from 5 to 50: Both lines shift up (more types overall), but the DP still curves. The curvature is not a tuning problem. - Change $d$ from 0.1 to 0.9: The PY line’s slope changes, but it stays straight. The discount controls the Zipf exponent. - Set $d$ very small (like 0.1): The PY line is barely different from the DP — a small discount barely helps. - Set $d = 0.5$: Clean separation. The PY generates orders of magnitude more probability for rare words.

Summary

	Dirichlet Process	Pitman-Yor Process
Stick-breaking	$\beta_k \sim \text{Beta}(1, \theta)$	$\beta_k \sim \text{Beta}(1-d, \theta + kd)$
Piece decay	Geometric (exponential)	Power law
Zipf plot shape	Curves downward	Straight line
Tail	Too light — not enough rare words	Heavy — matches real language
Hapax legomena	Under-predicted	Correctly predicted
Parameters	1: $\theta$	2: $\theta$ and $d$

When $d = 0$, the Pitman-Yor process reduces to the DP exactly. Everything about the DP — stick-breaking, the Pólya urn, exchangeability, conjugacy, posterior updating — carries over unchanged. The discount is a single, clean addition.