Synapse

An interconnected graph of micro-tutorials

Latent Variables

This is an early draft. Content may change as it gets reviewed.

Some of the most important things in data are things you can’t directly measure.

The idea

You can measure a student’s score on a maths test, a reading test, and a science test. You can’t directly measure their “general intelligence” — but you suspect it exists, because students who score high on one test tend to score high on the others. The scores are observed variables. The underlying ability is a latent variable — hidden, but detectable through its effects.

Latent variables are everywhere:

You can count passives, nominalisations, pronoun rates, and hedges in a text. The underlying **register dimension** (like “informational vs involved”) is latent — a hidden factor that explains why these features co-occur.
You can’t measure “anxiety” directly, but you can measure heart rate, self-reported worry, avoidance behaviour, and cortisol levels. Anxiety is latent; the measurements are observed.
GDP, unemployment, and consumer confidence are observed. “Economic health” is latent — a hidden state of the economy that manifests through the indicators you can actually measure.
You can measure soil pH, nitrogen content, moisture, and organic matter at each site. The underlying **fertility gradient** is latent — an unobservable property that causes these measured variables to correlate.
You can measure tempo, dynamic range, harmonic density, and rhythmic complexity in a piece. The underlying **stylistic dimension** (e.g., “Baroque vs Romantic”) is latent — a hidden property that explains why these features tend to co-occur.
You can measure a galaxy’s brightness across multiple wavelength bands, its redshift, and its angular size. The underlying **galaxy type** (spiral, elliptical, irregular) is latent — an unobservable classification that causes these measured properties to correlate in characteristic ways.

Observed vs latent

Observed variables Latent variables
What Directly measured Inferred from patterns
Examples Test scores, feature counts, sensor readings Intelligence, anxiety, register dimensions
How many Often many (67 linguistic features, 20 survey questions) Usually few (3–6 factors)
Relationship The data you have The structure you’re looking for

The key assumption: the reason observed variables are correlated is that they share latent causes. Students’ test scores correlate because of a common underlying ability. Linguistic features correlate because of a common underlying register dimension.

The modelling distinction

This is where latent variables become more than just a concept — they define a whole approach to data analysis:

The mathematics look similar — both involve eigenvalues of covariance matrices. But the interpretation is fundamentally different. PCA says “here’s a useful compression.” Factor analysis says “here’s a theory of what’s generating the data.”

Why this matters for register analysis

Biber’s Multi-Dimensional Analysis doesn’t just compress 67 features into fewer dimensions — it claims that the dimensions correspond to real situational factors. Dimension 1 (“Involved vs Informational”) isn’t just a data reduction convenience; it reflects a genuine communicative distinction. That’s a latent variable claim.

The entire Bayesian extension of MDA takes this further: instead of point estimates of factor loadings, you get probability distributions — acknowledging that you’re uncertain about exactly how the latent dimensions relate to the observed features. But the latent variable framework is the foundation.