Decision trees

are very simple yet powerful supervised learning methods, which constructs a decision tree model, which will be used to make predictions.

The main advantage of this model is that a human being can easily understand and reproduce the sequence of decisions (especially if the number of attributes is small) taken to predict the target class of a new instance. This is very important for tasks such as medical diagnosis or credit approval, where we want to show a reason for the decision, rather than just saying this is what the training data suggests.

The problem we would like to solve is to determine if a Titanic’s passenger would have survived, given her age, passenger class, and sex.

Preprocessing the data

In [1]:
%pylab inline
Populating the interactive namespace from numpy and matplotlib
In [2]:
import csv
import numpy as np
with open('titanic.csv') as csvfile:
    titanic_reader = csv.reader(csvfile, delimiter=',', quotechar='"')

    # Header contains feature names
    row = next(titanic_reader)
    feature_names = np.array(row)

    # Load dataset, and target classes
    titanic_X, titanic_y = [], []
    for row in titanic_reader:
        titanic_X.append(row)
        titanic_y.append(row[2]) # The target value is "survived"

    titanic_X = np.array(titanic_X)
    titanic_y = np.array(titanic_y)
In [3]:
print (feature_names, titanic_X[0], titanic_y[0])
['row.names' 'pclass' 'survived' 'name' 'age' 'embarked' 'home.dest' 'room'
 'ticket' 'boat' 'sex'] ['1' '1st' '1' 'Allen, Miss Elisabeth Walton' '29.0000' 'Southampton'
 'St Louis, MO' 'B-5' '24160 L221' '2' 'female'] 1

Keep only class (1st,2nd,3rd), age (float), and sex (masc, fem)

In [4]:
titanic_X = titanic_X[:, [1, 4, 10]]
feature_names = feature_names[[1, 4, 10]]

We have selected feature numbers 1, 4, and 10 that is class, age, and sex, based on the assumption that the remaining attributes have no effect on the passenger’s survival. Feature selection is an extremely important step while creating a machine learning solution. If the algorithm does not have good features as input, it will not have good enough material to learn from, results won’t be good, no matter even if we have the best machine learning algorithm ever designed. Sometimes the feature selection will be made manually, based on our knowledge of the problem’s domain and the machine learning method we are planning to use. Sometimes feature selection may be done by using automatic tools to evaluate and select the most promising ones.

In [5]:
print (feature_names)
print (titanic_X[12],titanic_y[12])
['pclass' 'age' 'sex']
['1st' 'NA' 'female'] 1

We have missing values, a usual problem with datasets. In this case, we decided to substitute missing values with the mean age in the training data. We could have taken a different approach, for example, using the most common value in the training data, or the median value. When we substitute missing values, we have to understand that we are modifying the original problem, so we have to be very careful with what we are doing. This is a general rule in machine learning; when we change data, we should have a clear idea of what we are changing, to avoid skewing the final results.

In [6]:
# We have missing values for age
# Assign the mean value
ages = titanic_X[:, 1]
mean_age = np.mean(titanic_X[ages != 'NA', 1].astype(np.float))
titanic_X[titanic_X[:, 1] == 'NA', 1] = mean_age
In [7]:
print (feature_names)
print (titanic_X[12],titanic_y[12])
['pclass' 'age' 'sex']
['1st' '31.19418104265403' 'female'] 1

Our attributes (except for age) are categorical; that is, they correspond to a value taken from a discrete set such as male and female. So, we have to convert categorical data into real values. Let’s start with the sex feature. The preprocessing module of scikit-learn includes a LabelEncoder class, whose fit method allows conversion of a categorical set into a 0..K-1 integer, where K is the number of different classes in the set (in the case of sex, just 0 or 1).This transformation implicitly introduces an ordering between classes.

In [8]:
# Encode sex 
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
label_encoder = enc.fit(titanic_X[:, 2])
print ("Categorical classes:", label_encoder.classes_)
integer_classes = label_encoder.transform(label_encoder.classes_)
print ("Integer classes:", integer_classes)
t = label_encoder.transform(titanic_X[:, 2])
titanic_X[:, 2] = t
Categorical classes: ['female' 'male']
Integer classes: [0 1]

The last two sentences transform the values of the sex attribute into 0-1 values, and modify the training set

In [9]:
print (feature_names)
print (titanic_X[12],titanic_y[12])
['pclass' 'age' 'sex']
['1st' '31.19418104265403' '0'] 1

We will try a more general approach that does not assume an ordering, and it is widely used to convert categorical classes into real-valued attributes. We will introduce an additional encoder and convert the class attributes into three new binary features, each of them indicating if the instance belongs to a feature value (1) or (0). This is called one hot encoding, and it is a very common way of managing categorical attributes for real-based methods

Now, we have to convert the class. Since we have three different classes, we cannot convert to binary values (and using 0/1/2 values would imply an order, something we do not want). We use OneHotEncoder to get three different attributes.

First converts the classes into integers and then uses the OneHotEncoder class to create the three new attributes that are added to the array of features. It finally eliminates from training data the original class feature

In [10]:
from sklearn.preprocessing import OneHotEncoder

enc = LabelEncoder()
label_encoder = enc.fit(titanic_X[:, 0])
print ("Categorical classes:", label_encoder.classes_)
integer_classes = label_encoder.transform(label_encoder.classes_).reshape(3, 1)
print ("Integer classes:", integer_classes)
enc = OneHotEncoder()
one_hot_encoder = enc.fit(integer_classes)
# First, convert clases to 0-(N-1) integers using label_encoder
num_of_rows = titanic_X.shape[0]
t = label_encoder.transform(titanic_X[:, 0]).reshape(num_of_rows, 1)
# Second, create a sparse matrix with three columns, each one indicating if the instance belongs to the class
new_features = one_hot_encoder.transform(t)
# Add the new features to titanix_X
titanic_X = np.concatenate([titanic_X, new_features.toarray()], axis = 1)
#Eliminate converted columns
titanic_X = np.delete(titanic_X, [0], 1)
# Update feature names
feature_names = ['age', 'sex', 'first_class', 'second_class', 'third_class']
# Convert to numerical values
titanic_X = titanic_X.astype(float)
titanic_y = titanic_y.astype(float)
Categorical classes: ['1st' '2nd' '3rd']
Integer classes: [[0]
 [1]
 [2]]

The preceding code first converts the classes into integers and then uses the OneHotEncoder class to create the three new attributes that are added to the array of features. It finally eliminates from training data the original class feature.

In [13]:
print (feature_names)
print (titanic_X[12],titanic_y[12])
['age', 'sex', 'first_class', 'second_class', 'third_class']
[ 31.19418104   0.           1.           0.           0.        ] 1.0

Training a decision tree classifier

Separate training and test sets

In [27]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(titanic_X, titanic_y, test_size=0.25, random_state=33)

Decision Trees

Fit a decision tree with the data.

In [32]:
from sklearn import tree
clf = tree.DecisionTreeClassifier(criterion='entropy', max_depth=3,min_samples_leaf=5)
clf = clf.fit(X_train,y_train)

DecisionTreeClassifier accepts (as most learning methods) several hyperparameters that control its behavior. In this case, we used the Information Gain (IG) criterion for splitting learning data, told the method to build a tree of at most three levels, and to accept a node as a leaf if it includes at least five training instances.

Entropy is a measure of disorder in a set, if we have zero entropy, it means all values are the same (in our case, all instances of the target classes are the same), while it reaches its maximum when there is an equal number of instances of each class (in our case, when half of the instances correspond to survivors and the other half to non survivors). At each node, we have a certain number of instances (starting from the whole dataset), and we measure its entropy. Our method will select the questions that yield more homogeneous partitions (with the lowest entropy), when we consider only those instances for which the answer for the question is yes or no, that is, when the entropy after answering the question decreases.

Let’s measure the accuracy of our method in the training set (we will first define a helper function to measure the performance of a classifier)
In [33]:
from sklearn import metrics
def measure_performance(X,y,clf, show_accuracy=True, show_classification_report=True, show_confusion_matrix=True):
    y_pred=clf.predict(X)
    if show_accuracy:
        print ("Accuracy:{0:.3f}".format(metrics.accuracy_score(y,y_pred)),"\n")

    if show_classification_report:
        print ("Classification report")
        print (metrics.classification_report(y,y_pred),"\n")

    if show_confusion_matrix:
        print ("Confusion matrix")
        print (metrics.confusion_matrix(y,y_pred),"\n")

measure_performance(X_train,y_train,clf, show_classification_report=False, show_confusion_matrix=False)
Accuracy:0.838

Our tree has an accuracy of 0.838 on the training set. But remember that this is not a good indicator. This is especially true for decision trees as this method is highly susceptible to overfitting. Since we did not separate an evaluation set, we should apply cross-validation.

For this example, we will use an extreme case of crossvalidation, named leave-one-out cross-validation. For each instance in the training sample, we train on the rest of the sample, and evaluate the model built on the only instance left out. After performing as many classifications as training instances, we calculate the accuracy simply as the proportion of times our method correctly predicted the class of the left-out instance, and found it is a little lower (as we expected) than the resubstitution accuracy on the training set.

Perform leave-one-out cross validation to better measure performance, reducing variance
In [34]:
from sklearn.cross_validation import cross_val_score, LeaveOneOut
from scipy.stats import sem

def loo_cv(X_train,y_train,clf):
    # Perform Leave-One-Out cross validation
    # We are preforming 1313 classifications!
    loo = LeaveOneOut(X_train[:].shape[0])
    scores=np.zeros(X_train[:].shape[0])
    for train_index,test_index in loo:
        X_train_cv, X_test_cv= X_train[train_index], X_train[test_index]
        y_train_cv, y_test_cv= y_train[train_index], y_train[test_index]
        clf = clf.fit(X_train_cv,y_train_cv)
        y_pred=clf.predict(X_test_cv)
        scores[test_index]=metrics.accuracy_score(y_test_cv.astype(int), y_pred.astype(int))
    print (("Mean score: {0:.3f} (+/-{1:.3f})").format(np.mean(scores), sem(scores)))
In [35]:
loo_cv(X_train, y_train,clf)
Mean score: 0.837 (+/-0.012)

The main advantage of leave-one-out cross-validation is that it allows almost as much data for training as we have available, so it is particularly well suited for those cases where data is scarce. Its main problem is that training a different classifier for each instance could be very costly in terms of the computation time.

A common criticism to decision trees is that once the training set is divided after answering a question, it is not possible to reconsider this decision. For example, if we divide men and women, every subsequent question would be only about men or women, and the method could not consider another type of question (say, age less than a year, irrespective of the gender).

Random Forests – randomizing decisions

Random Forests try to introduce some level of randomization in each step, proposing alternative trees and combining them to get the final prediction. These types of algorithms that consider several classifiers answering the same question are called ensemble methods.

Random Forests propose to build a decision tree based on a subset of the training instances (selected randomly, with replacement), but using a small random number of features at each set from the feature set. This tree growing process is repeated several times, producing a set of classifiers. At prediction time, each grown tree, given an instance, predicts its target class exactly as decision trees do. The class that most of the trees vote (that is the class most predicted by the trees) is the one suggested by the ensemble classifier.

In [36]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=10,random_state=33)
clf = clf.fit(X_train,y_train)
loo_cv(X_train,y_train,clf)
Mean score: 0.817 (+/-0.012)

We find that results are actually worse for Random Forests. It seems that introducing randomization was, after all, not a good idea because the number of features was too small. However, for bigger datasets, with a bigger number of features, Random Forests is a very fast, simple, and popular method to improve accuracy, retaining the virtues of decision trees.

Evaluating the performance

let’s measure the performance of decision trees on the testing data.
In [37]:
clf_dt=tree.DecisionTreeClassifier(criterion='entropy', max_depth=3,min_samples_leaf=5)
clf_dt.fit(X_train,y_train)
measure_performance(X_test,y_test,clf_dt)
Accuracy:0.793

Classification report
             precision    recall  f1-score   support

        0.0       0.77      0.96      0.85       202
        1.0       0.88      0.54      0.67       127

avg / total       0.81      0.79      0.78       329


Confusion matrix
[[193   9]
 [ 59  68]]

From the classification results and the confusion matrix, it seems that our method tends to predict too much that the person did not survive.

excerpt from

9 Responses

  1. Hello,

    thanks for this neat introduction to decision trees. I have a question relative to your section on cross-validation. For each iteration, you are re-constructing the tree (via clf.fit) so that the mean accuracy you get is really the mean accuracy of the best tree you can get over each sub-sample. But these trees might be very different from each other. Personally, I would prefer to construct one tree over the training dataset and test that same tree over the validation dataset; to do this, I would not use clf.fit during cross-validation, but only clf.predict. Do you agree? Maybe my objective is different than yours, hence the different methodology?

    Francois

    1. Hi Francois,
      You are correct it is because it is a different methodology how the dataset has been divided. The goal of cross-validation is to define a dataset to “test” the model in the training phase (i.e., the validation dataset), in order to limit problems like overfitting. Piush

  2. Thanks for this article and tutorial. It’s really helpful to me understand more about decision tree with scikit! 🙂

  3. Great explanation. However, not sure how the feature names with string value converted to float. I try and simulate back but still fail to transform.

    Any idea why?

    1. The features with strings are converted using LabelEncoder. It encodes labels with the value between 0 and n_classes-1. It must be used prior to one-hot encoding, as the OneHotEncoder cannot handle categorical data.

  4. in [34]: the function loo_cv(X_train, y_train, clf):

    The line with loo = LeaveOneOut(X_train[:].shape[0]) gives the error:
    __init__() takes 1 positional argument but 2 were given

    How could I solve this error?

    1. The “new” LeaveOneOur seems not taking argument any more. You can replace it with KFold…
      > from sklearn.model_selection import KFold
      > loo = KFold(X_train[:].shape[0])

  5. Thank you very much! This is extremely helpful to me as I am learning how to normalize variables in python and use decision tree. I have looked up sklearn docs and stackoverflow, but they don’t explain the usage of nominalization and decision trees in details.