PCA Doesn't Remove Features: It Finds Better Ones

pca dimensionality-reduction linear-algebra unsupervised-learning eigenvectors

When you have a dataset with 50 features, most ML models struggle. More features means more noise, slower training, and harder interpretation. The naive solution is to drop features. PCA does something smarter: it finds new features that are combinations of the original ones, ranked by how much variance they capture.

Principal Component Analysis transforms your data into a new coordinate system where the axes (principal components) are ordered by importance. The first principal component points in the direction of maximum variance in the data. The second points in the direction of maximum remaining variance, orthogonal to the first. And so on.

The math: eigenvectors define the directions (the new axes), and eigenvalues define how much variance each direction captures. PCA computes the covariance matrix of your features, finds its eigenvectors and eigenvalues, sorts by eigenvalue, and takes the top $K$ eigenvectors:

$\mathbf{C} = \frac{1}{n-1} \mathbf{X}^T \mathbf{X}, \quad \mathbf{C} \mathbf{v} = \lambda \mathbf{v}$

where $\mathbf{v}$ are the eigenvectors (principal components) and $\lambda$ are the eigenvalues (variance explained).

What clicked

The principal components aren't features you can interpret anymore. PC1 might be "a weighted mix of age, income, and years of education." It's not human-readable. The tradeoff is compression: you can often retain 90%+ of the variance with a fraction of the original features.

Still shaky on

The requirement that principal components be orthogonal (perpendicular to each other) ensures they're uncorrelated, each PC captures genuinely new variance. Without orthogonality you'd get redundant components. I understand this conceptually but haven't done the linear algebra derivation yet.

What's next

Now that I know how these models work, how do you actually measure if they're any good? Starting with regression metrics: MSE and $R^2$ .