![]() The green square-shapes are the Entropy values for p(28/70) and (12/50) of the first two child nodes in the decision tree model above, connected by a green (dashed) line. So, why is this happening? For an intuitive explanation, let’s zoom in into the Entropy plot: Thus, the splitting rule would continue until the child nodes are pure (after the next 2 splits). In contrast to the average classification error, the average child node entropy is not equal to the entropy of the parent node. Next, let’s see what happens if we use Entropy as an impurity metric: In this case, splitting the initial training set wouldn’t yield any improvement in terms of our classification error criterion, and thus, the tree algorithm would stop at this point (for this statement to be true, we have to make the assumption that neither splitting on feature x2 nor x3 would lead to an Information gain as well). Now, is it possible to learn this hypothesis (i.e., tree model) by minimizing the classification error as a criterion function? Let’s do the math:Īs we can see, the Information Gain after the first split is exactly 0, since average classification error of the 2 child nodes is exactly the same as the classification error of the parent node (40/120 = 0.3333333). Further, let’s assume that it is possible to come up with 3 splitting criteria (based on 3 binary features x1, x2, and x3) that can separate the training samples perfectly: Let’s consider the following binary tree starting with a training set of 40 “positive” training samples (y=1) and 80 training samples from the “negative” class (y=0). Here comes the more interesting part, just as promised. Or all non-empty classed p(i ❘ t) ≠ 0, where p(i ❘ t) is the proportion (or frequency or probability) of the samples that belong to class i for a particular node t C is the number of unique class labels.Īlthough we are all very familiar with the classification error, we write it down for completeness: (Note that since the parent impurity is a constant, we could also simply compute the average child node impurities, which would have the same effect.)įor simplicity, we will only compare the “Entropy” criterion to the classification error however, the same concepts apply to the Gini index as well. Impurity Metrics and Information Gainįormally, we can write the “Information Gain” as * This is the important part as we will see later. Splitting a note does not lead to an information gain*.Stop if leave nodes are pure or early stopping criteria is satisfied, else repeat steps 1 and 2 for each new child node.Assign training samples to new child nodes.Split the parent node at the feature x i to minimize the sum of the child node impurities (maximize information gain).Machine Learning FAQ Why are we growing decision trees via entropy instead of the classification error?īefore we get to the main question – the real interesting part – let’s take a look at some of the (classification) decision tree basics to make sure that we are on the same page. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |