Ridge Shrinks Coefficients, Lasso Zeros Them Out: Why That Difference Matters

regularization linear-regression math optimization feature-selection

Regularization is not about making your model more accurate on training data. It is about making it less wrong on data it has never seen. When a model fits too tightly, it chases noise in the training set rather than learning the actual pattern. The fix is to add a penalty term to the loss function that punishes large coefficients.

Ordinary linear regression minimizes the Sum of Squared Residuals:

SSR = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2

Regularization adds a penalty term on top. Two versions:

Ridge (L2) adds a penalty proportional to the square of each coefficient:

SSR + \lambda \cdot m^2

Lasso (L1) adds a penalty proportional to the absolute value:

SSR + \lambda \cdot |m|

$\lambda$ controls how hard the penalty bites. Set it to zero and you have ordinary least squares. Increase it and coefficients get forced smaller. Set it too high and you underfit.

The distinction between Ridge and Lasso is geometric. With Ridge, the penalty is a smooth bowl. Its derivative is $2m$ , which approaches zero as $m$ approaches zero, meaning the gradient keeps nudging coefficients toward zero but never actually forces them there. Ridge shrinks.

Lasso uses the absolute value. Its derivative is $\pm 1$ everywhere except exactly at zero. The constraint region is a diamond, not a bowl, and optimal solutions frequently land at the corners of that diamond. Those corners sit on the axes. Some coefficients land at exactly zero. Lasso eliminates.

This is the practical payoff: Ridge works when most features are genuinely useful but noisy. Lasso works when you suspect many features are irrelevant. A dataset with 50 input variables where only 10 actually matter is a Lasso problem. Ridge keeps all 50 with small weights. Lasso kills 40 of them.

Both methods trade a small increase in bias for a large reduction in variance. The model becomes slightly less accurate on training data but much more stable across different samples.