How Decision Trees Pick the Right Question to Ask: The Gini Index Explained

decision-trees gini-index information-gain classification supervised-learning

A decision tree classifies data the same way a doctor does a differential diagnosis: by asking a sequence of yes/no questions, narrowing down possibilities with each answer until reaching a conclusion. What makes the algorithm interesting isn't the structure (a flowchart), it's the question: how does it know which question to ask first?

The tree is hierarchical. It starts at a root node: the first question. Each question splits the data into branches. The process recurses until you reach leaf nodes: terminal nodes that give a final prediction. The algorithm at each node has to pick the feature and threshold that creates the most useful split.

Two metrics measure "most useful":

Information Gain measures how much a split reduces entropy (disorder). High entropy means the data is a mix of classes. After a good split, each branch should be purer. $\text{Information Gain} = \text{entropy before} - \text{weighted entropy after}$ .

Gini Index measures the probability that a randomly picked element would be misclassified if it were randomly labeled by the node's distribution:

$\text{Gini} = 1 - \sum_i P_i^2$

where $P_i$ is the proportion of class $i$ in the node. Gini of 0 means perfectly pure. Gini of 0.5 means maximum impurity.

What clicked

These two measures almost always pick the same split. The actual difference: Gini skips a logarithm calculation, so it's slightly faster. sklearn uses Gini by default for exactly this reason. I wasted 20 minutes thinking there was some deep philosophical difference.

Still shaky on

A node with lower Gini is preferred, but you need to check both children separately, even if the average Gini looks small, a split that produces high impurity on one branch is still bad.

What's next

A perfectly grown tree memorizes training data. Every leaf becomes so specific that it captures noise, not the underlying pattern. The solution is tomorrow's topic: Random Forests.