This is an early draft. Content may change as it gets reviewed.
You have data with many variables — maybe 67 linguistic features measured across 500 texts. Each text is a point in 67-dimensional space. You can’t visualise 67 dimensions. You can’t easily see which features go together or which texts are similar.
PCA finds the most important directions in this high-dimensional cloud and projects the data onto them, reducing the number of dimensions while preserving as much of the structure as possible.
The core idea
Think of a cloud of 3D data points that’s shaped like a pancake — flat and wide. Most of the variation is in two directions (across the pancake), and very little is in the third direction (through the pancake’s thickness).
PCA finds these directions automatically:
1. The first principal component (PC1) points in the direction of maximum variance — across the widest spread of the pancake
2. The second (PC2) is perpendicular to PC1 and captures the next most variance — across the second-widest direction
3. The third (PC3) is perpendicular to both and captures whatever’s left — through the thickness
If the pancake is very flat, PC3 captures almost no variance. You can drop it and describe the data in just two dimensions with barely any loss.
## What PCA does (no math)
Imagine photographing a sculpture. From one angle, you see its full shape — height, depth, contour. From another angle, it looks like a flat silhouette. PCA finds the **best camera angle** for your data: the viewpoint that shows the most structure.
Your data has many measurements per item (67 features per text, 20 blood markers per patient). PCA asks: if I could only look at this data from *one* direction, which direction would show me the most variation? That’s PC1. If I could add a second direction (perpendicular to the first), what would show me the next most variation? That’s PC2. Two directions from 67 measurements, capturing the essential shape.
The output: a scatter plot where similar items cluster together — even though PCA knew nothing about categories, genres, or types. It found the structure purely from the numbers.
## The algorithm
PCA is surprisingly simple:
1. **Centre the data**: Subtract the mean of each variable so everything is centred at zero
2. **Compute the covariance matrix**: A $p \times p$ matrix (where $p$ is the number of variables) capturing all pairwise relationships
3. **Find the eigenvalues and eigenvectors** of the covariance matrix
4. **Sort by eigenvalue** (largest first). Each eigenvector is a principal component; its eigenvalue is the variance it explains
5. **Project**: Multiply the data by the top $k$ eigenvectors to get $k$-dimensional coordinates
That’s it. Steps 2–4 decompose the correlation structure. Step 5 reduces the dimensionality.
## What the output looks like
After PCA, you have:
**Scores**: Each data point gets new coordinates in PC space. A 500-text Ă— 67-feature dataset becomes 500-text Ă— $k$-PC. You can plot PC1 vs PC2 as a scatter plot and see clustering, outliers, and gradients.
**Loadings**: Each principal component is a weighted combination of the original variables. The **loading** of variable $j$ on PC $k$ tells you how much variable $j$ contributes to that component. High loadings (positive or negative) mean the variable is important for that component.
**Explained variance**: Each eigenvalue tells you how much of the total variance that PC captures. You might find that PC1 explains 25%, PC2 explains 15%, and together the first five PCs explain 70%. This tells you how much structure is captured by your reduced representation.
Try it yourself
Try It: PCA on 2D Data
The purple arrow is PC1 (direction of maximum variance). The green arrow is PC2 (perpendicular, remaining variance). Arrow length is proportional to the eigenvalue (how much variance that direction captures).
Drag points to reshape the cloud. Make it long and thin — PC1 explains almost everything. Make it circular — both PCs explain roughly equal amounts. Click “Show projections” to see each point’s shadow on PC1 — that’s the dimensionality reduction in action.
Concrete examples
**Function words across text types.** Measure the frequencies of 50 common function words (*the*, *of*, *and*, *to*, *a*, ...) across 200 texts. PCA might reveal that PC1 separates texts with high *I*, *you*, *my* from texts with high *the*, *of*, *which* — personal vs impersonal language. PC2 separates *was*, *had*, *he* from *is*, *are*, *will* — past-tense narrative vs present-tense exposition. Plot PC1 vs PC2 and clusters appear: fiction, academic prose, conversation. PCA found this structure purely from word counts.
**Patient lab results.** A hospital measures 20 blood markers (cholesterol, glucose, white cell count, ...) for 500 patients. Each patient is a point in 20-dimensional space. PCA reveals that PC1 separates patients with metabolic syndrome markers (high glucose, high triglycerides, high BMI) from healthy profiles. PC2 picks up inflammatory markers. Plotting PC1 vs PC2, oncology patients cluster separately from cardiac patients — visible in two dimensions from 20 measurements.
**Species abundance across sites.** An ecologist counts 80 plant species across 150 survey sites. PCA reveals that PC1 separates dry upland sites (dominated by grasses and heather) from wet lowland sites (rushes, sedges). PC2 picks up soil acidity. The 80-dimensional species data collapses into an interpretable gradient — and sites with similar ecology cluster together.
**Audio features across recordings.** Extract 13 MFCCs (spectral features) plus tempo, loudness, and spectral centroid from 300 recordings. PCA reveals that PC1 separates orchestral pieces (rich harmonics, wide spectral range) from solo vocal (narrow range, strong fundamental). PC2 picks up tempo-related variation. The scatter plot shows genre clusters emerging from purely acoustic measurements.
**Stellar spectra.** Measure flux at 1,000 wavelength bins for 10,000 stars. Each star is a point in 1,000-dimensional space. PCA reveals that PC1 separates hot blue stars from cool red stars (essentially the temperature axis of the HR diagram). PC2 picks up surface gravity (giants vs dwarfs). Two components from 1,000 wavelengths recover the fundamental physical parameters of stars — and the technique works even for spectra too noisy for traditional model-fitting.
In every case, PCA is revealing structure you can see, not claiming to explain why it exists. The components are convenient axes for viewing the data. They’re not latent causes.
What PCA doesn’t do
PCA tells you the directions of maximum variance. It doesn’t tell you why features co-occur. It doesn’t model noise separately from signal. It doesn’t hypothesise hidden causes.
If you want to go further — to claim that there are hidden communicative dimensions causing the co-occurrence patterns — you need a different tool. That’s factor analysis.