Topic Modelling
You have a collection of documents. You haven’t labelled them. You want the computer to discover what they’re about — automatically, from the words alone. That’s topic modelling.
The intuition: documents as mixtures
A newspaper article about a politician’s healthcare proposal is partly about politics and partly about medicine. It uses words from both domains: senator, vote, legislation alongside patients, treatment, hospital. A different article about medical research funding is also a mixture — mostly medicine, some politics, some science.
Topic modelling formalises this intuition:
- A topic is a distribution over words. The “politics” topic gives high probability to government, election, policy. The “medicine” topic gives high probability to patient, diagnosis, treatment.
- A document is a mixture of topics. Each document has its own blend — 60% politics and 40% medicine, or 10% politics and 90% medicine.
- Each word in a document was (conceptually) generated by first picking a topic from the document’s mixture, then picking a word from that topic’s distribution.
The topics are latent variables — they’re never observed directly. You only see the words. The model’s job is to reverse-engineer the hidden topics from the observed word patterns.
The bag-of-words assumption
Topic modelling ignores word order. The document “the senator visited the hospital” and “hospital the the senator visited” are treated identically — only the counts matter (senator: 1, visited: 1, hospital: 1, the: 2).
This is a strong simplification, and it’s obviously wrong for understanding meaning. But it’s the same assumption behind Zipf’s law, frequency spectra, and all the other count-based analyses. The technical name is exchangeability — any rearrangement of the words is equally probable. It’s wrong for syntax, but powerful for discovering thematic structure.
What topic modelling discovers
Given a collection of documents and a number of topics $K$, a topic model discovers:
- Topic-word distributions: What each topic is “about” (which words have high probability)
- Document-topic proportions: What mixture of topics each document contains
Nobody labels the topics. The model finds coherent clusters of co-occurring words and calls them “topics.” You interpret them afterward by looking at the highest-probability words in each topic.
Choosing $K$
Like factor analysis, you must choose how many topics to look for. Too few and distinct themes get merged. Too many and topics become incoherent or redundant. There’s no single right answer — the choice depends on your purpose and your data.