Latent Variables
Some of the most important things in data are things you can’t directly measure.
The idea
You can measure a student’s score on a maths test, a reading test, and a science test. You can’t directly measure their “general intelligence” — but you suspect it exists, because students who score high on one test tend to score high on the others. The scores are observed variables. The underlying ability is a latent variable — hidden, but detectable through its effects.
Latent variables are everywhere:
Observed vs latent
| Observed variables | Latent variables | |
|---|---|---|
| What | Directly measured | Inferred from patterns |
| Examples | Test scores, feature counts, sensor readings | Intelligence, anxiety, register dimensions |
| How many | Often many (67 linguistic features, 20 survey questions) | Usually few (3–6 factors) |
| Relationship | The data you have | The structure you’re looking for |
The key assumption: the reason observed variables are correlated is that they share latent causes. Students’ test scores correlate because of a common underlying ability. Linguistic features correlate because of a common underlying register dimension.
The modelling distinction
This is where latent variables become more than just a concept — they define a whole approach to data analysis:
- Without latent variables (e.g., PCA): “I want to reduce 67 features to a few summary dimensions.” The dimensions are mathematical constructs — convenient rotations of the data. No claim about hidden causes.
- With latent variables (e.g., factor analysis): “I believe there are a few hidden factors causing the patterns I see.” The model posits that each observed variable is a noisy function of the latent factors. The factors are hypothesised entities, not just data summaries.
The mathematics look similar — both involve eigenvalues of covariance matrices. But the interpretation is fundamentally different. PCA says “here’s a useful compression.” Factor analysis says “here’s a theory of what’s generating the data.”
Why this matters for register analysis
Biber’s Multi-Dimensional Analysis doesn’t just compress 67 features into fewer dimensions — it claims that the dimensions correspond to real situational factors. Dimension 1 (“Involved vs Informational”) isn’t just a data reduction convenience; it reflects a genuine communicative distinction. That’s a latent variable claim.
The entire Bayesian extension of MDA takes this further: instead of point estimates of factor loadings, you get probability distributions — acknowledging that you’re uncertain about exactly how the latent dimensions relate to the observed features. But the latent variable framework is the foundation.