Prerequisites: Variance: Measuring Spread

Latent Variables

This is an early draft. Content may change as it gets reviewed.

Some of the most important things in data are things you can’t directly measure.

The idea

You can measure a student’s score on a maths test, a reading test, and a science test. You can’t directly measure their “general intelligence” — but you suspect it exists, because students who score high on one test tend to score high on the others. The scores are observed variables. The underlying ability is a latent variable — hidden, but detectable through its effects.

Latent variables are everywhere:

You can count passives, nominalisations, pronoun rates, and hedges in a text. The underlying **register dimension** (like “informational vs involved”) is latent — a hidden factor that explains why these features co-occur.

You can’t measure “anxiety” directly, but you can measure heart rate, self-reported worry, avoidance behaviour, and cortisol levels. Anxiety is latent; the measurements are observed.

GDP, unemployment, and consumer confidence are observed. “Economic health” is latent — a hidden state of the economy that manifests through the indicators you can actually measure.

You can measure soil pH, nitrogen content, moisture, and organic matter at each site. The underlying **fertility gradient** is latent — an unobservable property that causes these measured variables to correlate.

You can measure tempo, dynamic range, harmonic density, and rhythmic complexity in a piece. The underlying **stylistic dimension** (e.g., “Baroque vs Romantic”) is latent — a hidden property that explains why these features tend to co-occur.

You can measure a galaxy’s brightness across multiple wavelength bands, its redshift, and its angular size. The underlying **galaxy type** (spiral, elliptical, irregular) is latent — an unobservable classification that causes these measured properties to correlate in characteristic ways.

Observed vs latent

	Observed variables	Latent variables
What	Directly measured	Inferred from patterns
Examples	Test scores, feature counts, sensor readings	Intelligence, anxiety, register dimensions
How many	Often many (67 linguistic features, 20 survey questions)	Usually few (3–6 factors)
Relationship	The data you have	The structure you’re looking for

The key assumption: the reason observed variables are correlated is that they share latent causes. Students’ test scores correlate because of a common underlying ability. Linguistic features correlate because of a common underlying register dimension.

The modelling distinction

This is where latent variables become more than just a concept — they define a whole approach to data analysis:

Without latent variables (e.g., PCA): “I want to reduce 67 features to a few summary dimensions.” The dimensions are mathematical constructs — convenient rotations of the data. No claim about hidden causes.
With latent variables (e.g., factor analysis): “I believe there are a few hidden factors causing the patterns I see.” The model posits that each observed variable is a noisy function of the latent factors. The factors are hypothesised entities, not just data summaries.

The mathematics look similar — both involve eigenvalues of covariance matrices. But the interpretation is fundamentally different. PCA says “here’s a useful compression.” Factor analysis says “here’s a theory of what’s generating the data.”

Why this matters for register analysis

Biber’s Multi-Dimensional Analysis doesn’t just compress 67 features into fewer dimensions — it claims that the dimensions correspond to real situational factors. Dimension 1 (“Involved vs Informational”) isn’t just a data reduction convenience; it reflects a genuine communicative distinction. That’s a latent variable claim.

The entire Bayesian extension of MDA takes this further: instead of point estimates of factor loadings, you get probability distributions — acknowledging that you’re uncertain about exactly how the latent dimensions relate to the observed features. But the latent variable framework is the foundation.