Comprehensive Guide to Decision Tree Learning for Classification
Decision trees are a group of divide and conquer method that uses inverted tree-like structure to predict the outcome for our problem. The model predicts the value of a target variable by using simple decision rules inferred from the available features. Decision tree is one of the most powerful predictive analytics method for generating business rules, and can be used for both regression and classification.
It starts with a root node consisting of complete dataset and uses impurity measures to split the nodes into branches, and further into children nodes. The idea is to keep splitting till we have homogeneity, i.e. in a particular branch or leave we have observations belonging to only one class. We use function to measure the quality of split, and two most important measurements are Gini impurity and information gain (entropy).
We’ll see both the methods by using sklearn’s decision tree on breast cancer dataset.
#importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline#importing data from sklearn
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()#converting data into pandas dataframe
canc_data = pd.DataFrame(data.data, columns=data.feature_names)#adding target field to the dataset
canc_data['target'] = pd.Series(data.target)#creating X and y
X_feature = list(canc_data.columns)
X_feature.remove('target')
X = canc_data[X_feature]
y = canc_data['target']#splitting data for training and testing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 100)
Gini Impurity Index :
Gini is the probability of a random sample being classified correctly when selected randomly. Gini denotes the purity, whereas Gini impurity tells us about the impurity of nodes.
So Gini Impurity gives us the probability of misclassifying an observation, i.e. the amount of probability of a particular feature being classified incorrectly when selected randomly.
Therefore, Gini impurity is the summation of the product of pᵢ (probability of an item with label i being chosen) and 1 - pᵢ (probability of misclassification). The minimum value of Gini Index is 0 (when the node is pure, i.e. all the elements in the node are of one particular class). And when all the cases in the node belong to a specific category, the Gini index reaches to zero. Also, the maximum value is 0.5, and it is when the probability of the two classes are same.
Decision Tree Using Gini Index & Entropy Criteria
for i in ['gini','entropy']: print("Output when criterion is :", i)
print('----------------------------------') from sklearn.tree import DecisionTreeClassifier
i = DecisionTreeClassifier(criterion = i ,random_state=0, max_depth = 3).fit(X_train, y_train) #predicting for our test data
y_pred = i.predict(X_test) #generating classification report
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred)) #importing libraries to check model performance
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import log_loss
from sklearn import metrics print("Accuracy score on test: " , round((i.score(X_test, y_test)),3))
print("Accuracy score on train: ", round((i.score(X_train, y_train)),3)) #printing log loss for the model
print('log_loss : ', log_loss(y_test, y_pred)) #let find ROC and AUC score
#before we calculate roc_auc_score(), we need to find out the predicted probabilities for test data. pred_prob = pd.DataFrame(i.predict_proba(X_test)) #we'll also add the actual label
test_result = pd.DataFrame( { 'actual' : y_test})
test_result = test_result.reset_index()
test_result['prob_0'] = pred_prob.iloc[:,0:1]
test_result['prob_1'] = pred_prob.iloc[:,1:2] #to calculate ROC AUC score we will pass actual class labels and predicted probability
auc_score = round(metrics.roc_auc_score(test_result.actual, test_result.prob_1),3)
print("AUC Score : ",auc_score) #generating confusion matrix
cf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(cf_matrix, annot=True, cmap='Blues')
plt.ylabel("True Label")
plt.xlabel("Predicted Label")
plt.show()
print("\n")
#decision tree visualization. We will use graphviz software for our purpose.
from sklearn import tree
from sklearn.tree import export_graphviz
import pydotplus as pdot
from IPython.display import Image#exporting the tree model into odt file
gini = DecisionTreeClassifier(criterion = 'gini' ,random_state=0, max_depth = 3).fit(X_train, y_train)export_graphviz(gini, out_file='clf_tree_gini.odt', feature_names = X_train.columns,filled = True)graph = pdot.graphviz.graph_from_dot_file('clf_tree_gini.odt')
graph.write_jpg('clf_tree_gini.png')
Image(filename='clf_tree_gini.png')
We can also calculate Gini index for cross checking. In the top node, probability of finding benign cancer is 255/398 and malignant is 143/398.
gini_imp_node1 = 1 - (pow(255/398,2) + pow(143/398,2))
print(round(gini_imp_node1,2))
We get our Gini index as 0.46, as we can also see in the top node of decision tree.
Entropy :
Entropy is the measurement of impurity or randomness in the observed data points. Higher the Entropy, lower will be the purity. So the goal of model is to reduce the impurity, which is achieved through Information Gain.
Entropy is maximum when the probability of the two classes is same, and minimum when the entropy is 0 (i.e. node is pure).
Let us plot decision tree using Entropy as measurement criteria.
#exporting the tree model into odt file
gini = DecisionTreeClassifier(criterion = 'entropy' ,random_state=0, max_depth = 3).fit(X_train, y_train)export_graphviz(gini, out_file='clf_tree_entropy.odt', feature_names = X_train.columns,filled = True)graph = pdot.graphviz.graph_from_dot_file('clf_tree_entropy.odt')
graph.write_jpg('clf_tree_entropy.png')
Image(filename='clf_tree_entropy.png')
We can again calculate Entropy. In the top node, probability of finding benign cancer is 255/398 and malignant is 143/398.
import math
entropy_imp_node1 = -((255/398)*math.log2(255/398) + (143/398)*math.log2(143/398))
print(round(entropy_imp_node1,3))
Information Gain : Information gain is equivalent to the reduction of entropy from parent node to child node for given features. It measures the reduction in entropy by splitting a dataset according to the given value. It gives us the information about decrease in parent entropy has after splitting. Decision tree model selects the feature which provides highest information gain and splits the node based on that feature.
Information Gain (I.G.) = Entropy before branching — Entropy after branching
Gini Index or Entropy
Entropy uses of logs and thus is computationally more complex when compared with that of the Gini Index. And so Gini Index will be faster. But we should be considerate of performance metrics as well before we decide which measurement to select.
Every model has its pros and cons. Please go through the excellent documentation from scikit learn explaining about decision trees.