Why One Decision Tree Always Overfits: And How Random Forests Fix It

random-forest decision-trees ensemble-learning overfitting supervised-learning

Yesterday I worked through how Decision Trees pick their splits using Gini Index. Today I ran into their biggest problem face-first: a fully grown Decision Tree almost always overfits the training data. It memorizes it. The fix is one of the most elegant ideas in ML. Random Forests.

Here's why a single tree fails. The tree keeps splitting until each leaf contains only one class (or hits a depth limit). On training data, this looks perfect. On test data, it falls apart because those leaves captured noise specific to the training set, not the actual pattern. Tweak the training data slightly and you get a completely different tree. The technical name for this instability is high variance.

Random Forest attacks variance with two stacked ideas:

Bagging (Bootstrap Aggregating): Train $N$ trees, but each tree sees a different random subset of the training data: sampled with replacement (some rows appear multiple times, some not at all). Each tree learns a slightly different version of the problem.

Feature randomness: At every split, each tree can only consider a random subset of features. This prevents all trees from splitting on the same dominant feature early on, which would make them all identical.

When predicting, you take a majority vote (classification) or average (regression) across all $N$ trees. Individual trees are wrong, but they're wrong in different directions. The errors cancel out.

What clicked

This is the core insight of ensemble learning: combining many weak learners creates a strong learner. Each tree is biased toward whatever random slice of data it saw, but the collection of biases averages out to something close to the truth.

Still shaky on

How many trees is enough? At some point adding more trees gives diminishing returns and just costs compute. I've seen 100 as a common default but I want to understand the actual diagnostic: probably out-of-bag error plotted against tree count.

What's next

SVMs take a completely different geometric approach to classification: instead of building a tree of questions, they find the single boundary line that separates classes with the maximum possible gap.