Synapse

An interconnected graph of micro-tutorials

Posterior Updating

This is an early draft. Content may change as it gets reviewed.

One of the DP’s most practically useful properties: when you observe data, updating your beliefs is trivially easy.

The question

You start with a prior: your belief about the word distribution before seeing any text. This prior is a $\text{DP}(\theta, H)$.

Then you read some text: $n$ words. Some words repeat, some are new. You now want your posterior — your updated belief about the word distribution, incorporating the evidence.

The answer: just mix

The posterior is another Dirichlet process:

$$\text{DP}\left(\theta + n, \;\; \frac{\theta}{\theta + n} H \;+\; \frac{n}{\theta + n} \hat{F}_n\right)$$

The symbol $\hat{F}_n$ just means “the distribution of what you actually observed” — it assigns probability $c_w / n$ to each word $w$ that appeared $c_w$ times.

Let’s read this formula in plain English:

New concentration: $\theta + n$. You started with $\theta$ pseudo-observations, now you’ve added $n$ real ones. You’re more confident.

New base distribution: A weighted average of two things: - What you believed before (the prior $H$), weighted by $\theta / (\theta + n)$ - What you actually saw (the data $\hat{F}_n$), weighted by $n / (\theta + n)$

As you observe more data, the data weight $n/(\theta + n)$ grows toward 1 and the prior weight $\theta/(\theta + n)$ shrinks toward 0. The data gradually overwhelms the prior. But the prior never fully disappears — even after a million observations, there’s still a tiny contribution from $H$, keeping the door open for words you haven’t seen yet.

The predictive distribution

The most useful consequence: what’s the probability of the next word?

For a word $w$ you’ve seen $c_w$ times:

$$P(\text{next word} = w) = \frac{c_w}{\theta + n}$$

For a brand-new word you’ve never seen:

$$P(\text{next word is new}) = \frac{\theta}{\theta + n}$$

The probability of seeing a known word is proportional to how often you’ve already seen it ($c_w$). The probability of seeing something entirely new is proportional to $\theta$. This is the same “rich get richer, but novelty is always possible” dynamic as the Pólya urn — because it is the Pólya urn, derived from the posterior.

Watch it update

Try It: Posterior Updating
2.0

Things to try:

Start with $\theta = 2$ and click: +the, +the, +the, +dog, +bit, +man. Watch the climb to ~37.5% and the “new” bar drop to 25%.

Reset. Set $\theta = 0.5$. Observe the same six words. Now the “new” bar drops much faster — with a weak prior ($\theta$ small), the model quickly becomes dominated by the data it’s seen.

Reset. Set $\theta = 20$. Same six words. The “new” bar barely budges — a strong prior ($\theta$ large) resists being swayed by a small amount of data.

This is the Bayesian tradeoff: $\theta$ controls how much you trust your prior versus the evidence.

A worked example

Let’s trace the arithmetic. Suppose $\theta = 2$ and we observe: the the dog bit the man (6 words, 4 types).

Word Count $c_w$ $P(\text{next}) = c_w / (\theta + n) = c_w / 8$
the 3 3/8 = 37.5%
dog 1 1/8 = 12.5%
bit 1 1/8 = 12.5%
man 1 1/8 = 12.5%
✨ new word — $\theta/8$ = 2/8 = 25.0%

The model has learned that the is the most common word, while reserving a quarter of its probability budget for words it hasn’t seen yet. After 1,000 observations, $\theta / (\theta + n) = 2/1002 \approx 0.2\%$ — but it would still never reach zero.