Synapse

An interconnected graph of micro-tutorials

Factor Analysis

This is an early draft. Content may change as it gets reviewed.

Factor analysis starts from a question: why do these variables correlate?

If you measure 67 linguistic features across thousands of texts, many features will co-occur. Texts with lots of passives tend to have lots of nominalisations. Texts with lots of first-person pronouns tend to have lots of contractions. Factor analysis proposes an answer: there are a small number of hidden factors that cause these co-occurrence patterns.

The idea (no math)

Imagine you’re a detective. You observe that certain clues tend to appear together at crime scenes — fingerprints, broken glass, footprints by the window. You hypothesise a hidden cause: a burglar. The clues correlate because they share a common cause.

Factor analysis does the same thing with data. You observe that certain variables tend to co-occur — past tense verbs with third-person pronouns, or high blood pressure with high cholesterol. Factor analysis hypothesises hidden factors that cause the co-occurrence. The factors are never directly measured — they’re inferred from the pattern of correlations.

The key difference from PCA: PCA says “here are the main directions in the data.” Factor analysis says “here’s why the data has those directions.”

The model

Factor analysis posits that each observed variable $x_j$ is a linear combination of $k$ latent factors $F_1, F_2, \ldots, F_k$ plus noise:

$$x_j = \lambda_{j1} F_1 + \lambda_{j2} F_2 + \cdots + \lambda_{jk} F_k + \epsilon_j$$

The $\lambda_{jk}$ values are loadings — they tell you how strongly factor $k$ influences variable $j$. The $\epsilon_j$ is the unique variance — the part of $x_j$ that isn’t explained by any factor.

In matrix form:

$$\mathbf{x} = \Lambda \mathbf{F} + \boldsymbol{\epsilon}$$

where $\Lambda$ is the $p \times k$ loading matrix (p variables, k factors).

How it differs from PCA

PCA and factor analysis look similar on the surface — both take many variables and reduce them to fewer dimensions. But the models are fundamentally different:

PCA Factor Analysis
Goal Data reduction Discovering latent structure
Model No model — just rotation Generative model: factors cause observations
Components/Factors Mathematical constructs Hypothesised latent variables
Unique variance Not separated out Explicitly modelled ($\epsilon_j$)
Number of dimensions Up to $p$ (one per variable) Must choose $k$ (how many factors?)
Rotation Unrotated solution is unique Multiple rotations possible (Varimax, Promax)
Claim “This is a useful summary” “This is what’s generating the data”

The practical consequence: PCA always gives you an answer. Factor analysis requires you to decide how many factors there are — and that decision is one of the hardest parts.

Choosing the number of factors

Several criteria exist, and they often disagree:

  • Kaiser criterion: Keep factors with eigenvalue > 1 (common but often over-extracts)
  • Scree plot: Plot eigenvalues, look for the “elbow” where they level off (subjective)
  • Parallel analysis: Compare your eigenvalues to random data — keep factors that beat chance (better, but still frequentist)
  • Bayesian approaches: Let the model determine the number of factors stochastically (most principled, but computationally harder)

This is one of the “six ad hoc decisions” in classical MDA that a Bayesian approach can improve: instead of a human looking at a scree plot, the posterior distribution tells you how many factors the data supports.

Rotation

The raw factor solution is mathematically valid but often hard to interpret — variables load on multiple factors. Rotation transforms the solution to make the loadings cleaner:

  • Varimax (orthogonal): Maximises the variance of loadings within each factor. Each variable loads strongly on one factor and weakly on others. Factors stay perpendicular.
  • Promax (oblique): Allows factors to correlate. More realistic (linguistic dimensions probably do correlate) but harder to interpret.

Biber uses Promax rotation in his MDA — the dimensions are allowed to be correlated because “informational” and “narrative” aren’t independent communicative functions.

The loading matrix

After extraction and rotation, the loading matrix $\Lambda$ is what you interpret. Each row is a variable; each column is a factor. High loadings (positive or negative) mean the variable is important for that factor:

Feature Dimension 1 Dimension 2
Private verbs +0.86 −0.02
1st/2nd person pronouns +0.74 +0.11
Nominalisations −0.71 +0.09
Prepositions −0.68 −0.15
Past tense +0.05 +0.78
3rd person pronouns −0.10 +0.73

Dimension 1 separates involved language (pronouns, private verbs) from informational language (nominalisations, prepositions). Dimension 2 picks up narrative features. Each dimension is a latent variable — a hidden communicative function that explains why these features pattern together.

Why factor analysis, not PCA?

For exploratory data summary, PCA is fine. But when you want to make claims about why features co-occur, you need factor analysis. The latent variable model gives those claims formal grounding.

Register analysis. Biber’s Multi-Dimensional Analysis measures 67 linguistic features and uses factor analysis to discover that the co-occurrence of private verbs, first-person pronouns, and contractions reflects a hidden communicative function — “involved” language. This isn’t just a useful data compression; it’s a claim that a latent dimension of communicative purpose causes these features to pattern together.

Psychometrics. A depression questionnaire has 20 items: sleep quality, appetite, concentration, mood, energy, etc. Factor analysis reveals two latent factors — “somatic” (sleep, appetite, energy) and “cognitive” (concentration, guilt, hopelessness). The claim: depression isn’t one thing, it’s two underlying dimensions that manifest differently across patients.

Environmental gradients. An ecologist measures soil pH, moisture, nitrogen, phosphorus, organic matter, and 10 other variables across 200 sites. Factor analysis reveals two latent factors: “fertility” (nitrogen, phosphorus, organic matter load together) and “drainage” (moisture, clay content, depth to water table). The claim: these two hidden gradients explain why the measured variables correlate.

Musical style dimensions. Analyse 500 compositions on 30 features: tempo, dynamic range, harmonic complexity, rhythmic regularity, voice count, etc. Factor analysis reveals latent dimensions like “complexity” (harmonic density, voice count, tempo variation load together) and “energy” (tempo, dynamic range, rhythmic regularity). The claim: composers vary along hidden stylistic dimensions that cause surface features to co-occur.

Galaxy morphology. Measure 15 photometric and structural properties (luminosity, colour index, concentration, asymmetry, surface brightness profile, ...) for 5,000 galaxies. Factor analysis reveals latent factors like “star formation activity” (blue colour, asymmetry, and clumpiness load together) and “bulge dominance” (concentration, red colour, smooth profile). The claim: galaxies vary along hidden physical dimensions that cause the observed properties to correlate.