Feature Selection Picks What to Keep, Feature Extraction Builds Something New
Not all features in a dataset are worth keeping. Some are redundant. Some are noise. Some slow down training without adding any predictive signal. Feature engineering addresses this through two distinct strategies: feature selection (pick the useful ones, drop the rest) and feature extraction (transform everything into a new, smaller set that captures what actually matters).
These sound similar but they operate differently.
Feature selection leaves the original variables intact. You are choosing which ones to include in the model. Three methods exist.
Filter methods score each feature independently before any model is trained, using statistical tests like Chi-Square, Fisher's Score, or MAE. Fast and model-agnostic, but they do not account for how features interact with each other.
Wrapper methods evaluate subsets using actual model performance. Forward selection starts empty and adds one feature at a time, keeping whatever improves the score. Backward elimination starts with all features and removes one at a time. More accurate than filter methods, but expensive since you are training a model for every candidate subset.
Embedded methods perform selection during training itself. Lasso (L1 regularization) is the textbook case: coefficients for irrelevant features get pushed to exactly zero during loss minimization. Gradient boosting and decision trees do this too, since both learn to split on the features that reduce error most.
Feature extraction does something different. It does not pick from the existing set. It creates new variables by transforming the original ones. PCA finds directions of maximum variance and projects the data onto them. t-SNE does something similar, primarily for visualization. The statistical properties of the raw data (mean, standard deviation, correlation, covariance) feed into these methods and define what gets captured.
The key trade-off is interpretability. After feature selection, the variables in your model are still the original ones. You can explain what each one means. After PCA, your axes are linear combinations of everything. They capture variance efficiently but they are harder to explain.
| Aspect | Feature Selection | Feature Extraction |
|---|---|---|
| Output | Subset of original features | New transformed features |
| Interpretability | High | Lower |
| Compute cost | Lower | Higher |
| Process | Filter, wrapper, embedded | Statistical or transformation algorithms |