Prerequisites: Latent Variables

Topic Modelling

This is an early draft. Content may change as it gets reviewed.

You have a collection of documents. You haven’t labelled them. You want the computer to discover what they’re about — automatically, from the words alone. That’s topic modelling.

The intuition: documents as mixtures

A newspaper article about a politician’s healthcare proposal is partly about politics and partly about medicine. It uses words from both domains: senator, vote, legislation alongside patients, treatment, hospital. A different article about medical research funding is also a mixture — mostly medicine, some politics, some science.

Topic modelling formalises this intuition:

A topic is a distribution over words. The “politics” topic gives high probability to government, election, policy. The “medicine” topic gives high probability to patient, diagnosis, treatment.
A document is a mixture of topics. Each document has its own blend — 60% politics and 40% medicine, or 10% politics and 90% medicine.
Each word in a document was (conceptually) generated by first picking a topic from the document’s mixture, then picking a word from that topic’s distribution.

The topics are latent variables — they’re never observed directly. You only see the words. The model’s job is to reverse-engineer the hidden topics from the observed word patterns.

The bag-of-words assumption

Topic modelling ignores word order. The document “the senator visited the hospital” and “hospital the the senator visited” are treated identically — only the counts matter (senator: 1, visited: 1, hospital: 1, the: 2).

This is a strong simplification, and it’s obviously wrong for understanding meaning. But it’s the same assumption behind Zipf’s law, frequency spectra, and all the other count-based analyses. The technical name is exchangeability — any rearrangement of the words is equally probable. It’s wrong for syntax, but powerful for discovering thematic structure.

What topic modelling discovers

Given a collection of documents and a number of topics $K$, a topic model discovers:

Topic-word distributions: What each topic is “about” (which words have high probability)
Document-topic proportions: What mixture of topics each document contains

Nobody labels the topics. The model finds coherent clusters of co-occurring words and calls them “topics.” You interpret them afterward by looking at the highest-probability words in each topic.

Choosing $K$

Like factor analysis, you must choose how many topics to look for. Too few and distinct themes get merged. Too many and topics become incoherent or redundant. There’s no single right answer — the choice depends on your purpose and your data.

**Register analysis.** Apply topic modelling to a corpus of mixed-register texts (conversation, news, fiction, academic prose). Topics emerge that correspond roughly to registers: one topic has high probability for *said*, *looked*, *door* (fiction), another for *however*, *significant*, *analysis* (academic). The document-topic proportions become a soft register classification — each text is a mixture rather than a single label.

**Clinical notes.** Apply topic modelling to 50,000 patient discharge summaries. Topics emerge like “cardiac” (*chest*, *ECG*, *troponin*, *catheterisation*), “infectious” (*fever*, *culture*, *antibiotic*, *WBC*), and “psychiatric” (*mood*, *sleep*, *medication*, *ideation*). A patient’s admission note might be 40% cardiac + 30% infectious + 20% medication management — reflecting the complexity of real patients.

**Species assemblages.** Treat survey sites as “documents” and species as “words.” Topic modelling discovers “community types” — latent assemblages of species that tend to co-occur. A wetland topic might have high probability for *reed warbler*, *sedge*, *dragonfly*. Each site is a mixture of community types, reflecting transition zones and habitat gradients.

**Spectral decomposition.** Treat each galaxy spectrum as a “document” and discretised spectral features as “words.” Topics emerge that correspond to stellar populations: a “young star” topic (strong emission lines, blue continuum), an “old star” topic (absorption features, red continuum). Each galaxy’s spectrum is a mixture — reflecting its actual mix of stellar populations.

**Musical texture analysis.** Treat musical segments as “documents” and discretised audio features (pitch classes, rhythmic patterns, timbral descriptors) as “words.” Topics emerge as characteristic textures: a “brass fanfare” topic, a “string tremolo” topic, a “woodwind pastoral” topic. A symphony movement is a mixture of textures that evolves over time.