How Decision Trees Pick the Right Question to Ask: The Gini Index Explained
A decision tree classifies data the same way a doctor does a differential diagnosis: by asking a sequence of yes/no questions, narrowing down possibilities with each answer until reaching a conclusion. What makes the algorithm interesting isn't the structure (a flowchart), it's the question: how does it know which question to ask first?
The tree is hierarchical. It starts at a root node: the first question. Each question splits the data into branches. The process recurses until you reach leaf nodes: terminal nodes that give a final prediction. The algorithm at each node has to pick the feature and threshold that creates the most useful split.
Two metrics measure "most useful":
Information Gain measures how much a split reduces entropy (disorder). High entropy means the data is a mix of classes. After a good split, each branch should be purer. .
Gini Index measures the probability that a randomly picked element would be misclassified if it were randomly labeled by the node's distribution:
where is the proportion of class in the node. Gini of 0 means perfectly pure. Gini of 0.5 means maximum impurity.
What clicked
These two measures almost always pick the same split. The actual difference: Gini skips a logarithm calculation, so it's slightly faster. sklearn uses Gini by default for exactly this reason. I wasted 20 minutes thinking there was some deep philosophical difference.
Still shaky on
A node with lower Gini is preferred, but you need to check both children separately, even if the average Gini looks small, a split that produces high impurity on one branch is still bad.
What's next
A perfectly grown tree memorizes training data. Every leaf becomes so specific that it captures noise, not the underlying pattern. The solution is tomorrow's topic: Random Forests.