In [1]:
# code for loading the format for the notebook
import os

# path : store the current path to convert back to it later
path = os.getcwd()
os.chdir(os.path.join('..', 'notebook_format'))
from formats import load_style
load_style(css_style = 'custom2.css', plot_style = False)
In [2]:

# 1. magic for inline plot
# 2. magic to print version
# 3. magic so that the notebook will reload external python modules
# 4. magic to enable retina (high resolution) plots
%matplotlib inline
%load_ext watermark
%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format = 'retina'

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor

%watermark -a 'Ethen' -d -t -v -p numpy,pandas,matplotlib,sklearn
Ethen 2018-09-15 14:53:27 

CPython 3.6.4
IPython 6.4.0

numpy 1.14.1
pandas 0.23.0
matplotlib 2.2.2
sklearn 0.19.1

Ensemble Trees

Some of the materials builds on top of the previous documentation/implementation on decision trees, thus it might be best to walk through that one first.

Ensembling is a very popular method for improving the predictive performance of machine learning models. Let's pretend that instead of building a single model to solve a binary classification problem, you created five independent models, and each model was correct about 70% of the time. If you combined these models into an "ensemble" and used their majority vote as a prediction, how often would the ensemble be correct?

In [3]:
# generate 1000 random numbers (between 0 and 1) for each model, 
# representing 1000 observations
mod1 = np.random.rand(1000)
mod2 = np.random.rand(1000)
mod3 = np.random.rand(1000)
mod4 = np.random.rand(1000)
mod5 = np.random.rand(1000)

# each model independently predicts 1 (the "correct response") 
# if random number was at least 0.3
preds1 = np.where(mod1 > 0.3, 1, 0)
preds2 = np.where(mod2 > 0.3, 1, 0)
preds3 = np.where(mod3 > 0.3, 1, 0)
preds4 = np.where(mod4 > 0.3, 1, 0)
preds5 = np.where(mod5 > 0.3, 1, 0)

# how accurate was each individual model?
In [4]:
# average the predictions, and then round to 0 or 1
# you can also do a weighted average, as long as the weight adds up to 1 
ensemble_preds = np.round((preds1 + preds2 + preds3 + preds4 + preds5) / 5).astype(int)

# how accurate was the ensemble?


The primary weakness of decision trees is that they don't tend to have the best predictive accuracy and the result can be very unstable. This is partially due to the fact that we were using greedy algorithm to choose the rule/feature to split the tree. Hence a small variations in the data might result in a completely different tree being generated. Fortunately, this problem can be mitigated by training an ensemble of decision trees and use these trees to form a "forest".

This first idea we'll introduce is Bagging. Bagging, short for bootstrap aggregation is a general procedure for reducing the variance of a machine learning algorithm, although it can used with any type of method, it is most commonly applied to tree-based models. The way it works is:

Given a training set $X = x_1, ..., x_n$ with responses $Y = y_1, ..., y_n$, bagging repeatedly ($B$ times) selects a random sample with replacement (a.k.a bootstrap sample) of the training set and fits trees to these newly generated samples:

For $b = 1, ..., B$:

  1. Sample, with replacement, $n$ training examples from $X$, $Y$; call these $X_b$, $Y_b$. Note that the bootstrap sample should be the same size as the original training set
  2. Train a tree, $f_b$, on $X_b$, $Y_b$. For these individual tree, we should allow them to grow deeper (increase the max_depth parameter) so that they have low bias/high variance

After training, predictions for unseen samples $x'$ can be made by averaging the predictions from all the individual regression trees on $x'$:

$$ \begin{align} f' = \frac {1}{B}\sum _{b=1}^{B}{f}_{b}(x') \end{align} $$

Or by taking the majority vote in the case of classification trees. If you are wondering why bootstrapping is a good idea, the rationale is:

We wish to ask a question of a population but we can't. Instead, we take a sample and ask the question to it instead. Now, how confident we should be that the sample answer is close to the population answer obviously depends on the structure of population. One way we might learn about this is to take samples from the population again and again, ask them the question, and see how variable the sample answers tended to be. But often times this isn't possible (we wouldn't relaunch the Titanic and crash it into another iceberg), thus we can use the information in the sample we actually have to learn about it.

This is a reasonable thing to do because not only is the sample you have the best and the only information you have about what the population actually looks like, but also because most samples will, if they're randomly chosen, look quite like the population they came from. In the end, sampling with replacement is just a convenient way to treat the sample like it's a population and to sample from it in a way that reflects its shape.

Random Forest

Random Forest is very similar to bagged trees. Exactly like bagging, we create an ensemble of decision trees using bootstrapped samples of the training set. When building each tree, however, each time a split is considered, a random sample of $m$ features is chosen as split candidates from the full set of $p$ features. The split is only allowed to use one of those $m$ features to generate the best rule/feature to split on.

  • For classification, $m$ is typically chosen to be, $\sqrt{p}$, the square root of $p$.
  • For regression, $m$ is typically chosen to be somewhere between $p/3$ and $p$.

The whole point of choosing a new random sample of features for every single tree at every single split is to correct for decision trees' habit of overfitting to their training set. Suppose there is one very strong feature in the data set, when using bagged trees, most of the trees will use that feature as the top split, resulting in an ensemble of similar trees that are highly correlated. By randomly leaving out candidate features from each split, Random Forest "decorrelates" the trees, such that the averaging process can further reduce the variance of the resulting model.


Here, we will use the Wine Quality Data Set to test our implementation. This link should download the .csv file. The task is to predict the quality of the wine (a scale of 1 ~ 10) given some of its features. We'll build three types of regression model, decision tree, bagged decision tree and random forest on the training set and compare the result on the test set.

In [5]:
wine = pd.read_csv('winequality-white.csv', sep = ';')

# train/test split the features and response column
y = wine['quality'].values
X = wine.drop('quality', axis = 1).values
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.2, random_state = 1234)

print('dimension of the dataset: ', wine.shape)
dimension of the dataset:  (4898, 12)
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.0 0.27 0.36 20.7 0.045 45.0 170.0 1.0010 3.00 0.45 8.8 6
1 6.3 0.30 0.34 1.6 0.049 14.0 132.0 0.9940 3.30 0.49 9.5 6
2 8.1 0.28 0.40 6.9 0.050 30.0 97.0 0.9951 3.26 0.44 10.1 6
3 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956 3.19 0.40 9.9 6
4 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956 3.19 0.40 9.9 6
In [6]:
# this cell simply demonstrates how to create boostrap samples
# we create an array of numbers from 1 to 20
# create the boostrap sample on top of that
nums = np.arange(1, 21)
print('original:', nums)
print('bootstrap: ', np.random.choice(nums, size = 20, replace = True))
original: [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20]
bootstrap:  [ 6 12 13  9 10 12  6 16  1 17  2 13  8 14  7 19  6 19 12 11]
In [7]:
class RandomForest:
    Regression random forest using scikit learn's 
    decision tree as the base tree
    n_estimators: int 
        the number of trees that you're going built
        on the bagged sample (you can even shutoff
        the bagging procedure for some packages)
    max_features: int 
        the number of features that you allow
        when deciding which feature to split on 
    all the other parameters for a decision tree like
    max_depth or min_sample_split also applies to Random Forest, 
    it is just not used here as that is more
    related to a single decision tree

    def __init__(self, n_estimators, max_features):
        self.n_estimators = n_estimators
        self.max_features = max_features
    def fit(self, X, y):
        # for each base-tree models:
        # 1. draw bootstrap samples from the original data
        # 2. train the tree model on that bootstrap sample, and
        #    during training, randomly select a number of features to 
        #    split on each node
        self.estimators = []
        for i in range(self.n_estimators):
            boot = np.random.choice(y.shape[0], size = y.shape[0], replace = True)
            X_boot, y_boot = X[boot], y[boot]
            tree = DecisionTreeRegressor(max_features = self.max_features)
  , y_boot)
        return self

    def predict(self, X):
        # for the prediction, we average the
        # predictions made by each of the bagged tree
        pred = np.empty((X.shape[0], self.n_estimators))
        for i, tree in enumerate(self.estimators):
            pred[:, i] = tree.predict(X)
        pred = np.mean(pred, axis = 1)
        return pred
In [8]:
# compare the results between a single decision tree,
# bagging and random forest, the lower the mean square
# error, the better
tree = DecisionTreeRegressor(), y_train)
tree_y_pred = tree.predict(X_test)
print('tree: ', mean_squared_error(y_test, tree_y_pred))

# bagged decision tree
# max_feature = None simply uses all features
bag = RandomForest(n_estimators = 50, max_features = None), y_train)
bag_y_pred = bag.predict(X_test)
print('bagged tree: ', mean_squared_error(y_test, bag_y_pred))

# random forest using a random one third of the features at every split
rf = RandomForest(n_estimators = 50, max_features = 1 / 3), y_train)
rf_y_pred = rf.predict(X_test)
print('random forest: ',  mean_squared_error(y_test, rf_y_pred))

# use library to confirm results are comparable
rf_reg = RandomForestRegressor(n_estimators = 50, max_features = 1 / 3), y_train)
rf_reg_y_pred = rf_reg.predict(X_test)
print('random forest library: ', mean_squared_error(y_test, rf_reg_y_pred))
tree:  0.7908163265306123
bagged tree:  0.40405469387755105
random forest:  0.39058653061224485
random forest library:  0.3984877551020408

Feature Importance

When using Bagging with decision tree or using Random Forest, we can increase the predictive accuracy of individual tree. These methods, however, do decrease model interpretability, because it is no longer possible to visualize all the trees that are built to form the "forest". Fortunately, we can still obtain an overall summary of feature importance from these models. The way feature importance works is as follows (there are many ways to do it, this is the implementation that scikit-learn uses):

We first compute the feature importance values of a single tree:

  • We can initialize an array feature_importances of all zeros with size n_features
  • We start building the tree and for each internal node that splits on feature $i$ we compute the information gain (error reduction) of that node multiplied by the proportion of samples that were routed to the node and add this quantity to feature_importances[i]

The information gain (error reduction) depends on the impurity criterion that you use (e.g. Gini, Entropy for classification, MSE for regression). Its the impurity of the set of examples that gets routed to the internal node minus the sum of the impurities of the two partitions created by the split.

Now, recall that these Ensemble Tree models simply consists of a bunch of individual trees, hence after computing the feature_importance values across all individual trees, we sum them up and take the average across all of them (normalize the values to sum up to 1 if necessary).

Building on top of the previous documentation/implementation on decision trees, we add the code to compute the feature importance. The code is not shown here, but can be obtained here for those that are interested in the implementation.

In [13]:
from tree import Tree

# load a sample dataset
iris = load_iris()
iris_X =
iris_y =

# train model and print the feature importance
tree = Tree(), iris_y)

# use library to confirm result
# note that the result might not always be the same
# because of decision tree's high variability
clf = DecisionTreeClassifier(criterion = 'entropy', min_samples_split = 10, max_depth = 3), iris_y)
[0.         0.         0.07060267 0.92939733]
[0.         0.         0.07060267 0.92939733]

For ensemble tree, we simply sum all the feauture importance up and take the average (normalize it to sum up to 1 if necessary). Thus, we will not go through the process of building that from scratch, we'll simply visualize the feature importance of the regression Random Forest that we've previously trained on the wine dataset.

In [14]:
def vis_importance(estimator, feature_names, threshold = 0.05):
    Visualize the relative importance of predictors.

    estimator : sklearn-like ensemble tree model
        A tree estimator that contains the attribute

    feature_names : str 1d array or list[str]
        Feature names that corresponds to the
        feature importance.

    threshold : float, default 0.05
        Features that have importance scores lower than this
        threshold will not be presented in the plot, this assumes
        the feature importance sum up to 1.
    if not hasattr(estimator, 'feature_importances_'):
        msg = '{} does not have the feature_importances_ attribute'
        raise ValueError(msg.format(estimator.__class__.__name__))

    imp = estimator.feature_importances_
    feature_names = np.asarray(feature_names)
    mask = imp > threshold
    importances = imp[mask]
    idx = np.argsort(importances)
    scores = importances[idx]
    names = feature_names[mask]
    names = names[idx]

    y_pos = np.arange(1, len(scores) + 1)
    if hasattr(estimator, 'estimators_'):
        # apart from the mean feature importance, for scikit-learn we can access
        # each individual tree's feature importance and compute the standard deviation
        tree_importances = np.asarray([tree.feature_importances_
                                       for tree in estimator.estimators_])
        importances_std = np.std(tree_importances[:, mask], axis = 0)
        scores_std = importances_std[idx]
        plt.barh(y_pos, scores, align = 'center', xerr = scores_std)
        plt.barh(y_pos, scores, align = 'center')

    plt.yticks(y_pos, names)
    plt.title('Feature Importance Plot')
In [15]:
# change default figure and font size
plt.rcParams['figure.figsize'] = 8, 6 
plt.rcParams['font.size'] = 12

# visualize the feature importance of every variable
vis_importance(rf_reg, wine.columns[:-1])


One thing to keep in mind when using the impurity based feature importance ranking is that when the dataset has two (or more) correlated features, then from the model's point of view, any of these correlated features can be used as the predictor, with no preference of one over the others. But once one of them is used, the importance of others is significantly reduced since the impurity they can effectively remove has already been removed by the first feature. As a consequence, they will have a lower reported importance. This is not an issue when we want to use feature selection to reduce overfitting, since it makes sense to remove features that are mostly duplicated by other features. But when we're interpreting the data, it can lead to incorrect conclusions that one of the variables is a strong predictor while the others in the same group are unimportant, while actually they are very close in terms of their relationship with the response variable.

The effect of this phenomenon for Random Forest is somewhat reduced thanks to random selection of features at each node creation, but in general the effect is not removed completely. In the following example, we have three correlated variables $X_0$, $X_1$, $X_2$, and no noise in the data, with the output variable being the sum of the three features:

In [16]:
size = 10000
X_seed = np.random.normal(0, 1, size)
X0 = X_seed + np.random.normal(0, 0.1, size)
X1 = X_seed + np.random.normal(0, 0.1, size)
X2 = X_seed + np.random.normal(0, 0.1, size)
X_012 = np.array([ X0, X1, X2 ]).T
Y = X0 + X1 + X2

rf = RandomForestRegressor(n_estimators = 20, max_features = 2), Y)
print('Scores for X0, X1, X2:', np.round(rf.feature_importances_, 3))
Scores for X0, X1, X2: [0.272 0.548 0.179]

When we compute the feature importances, we see that some of the features have higher importance than the others, while their “true” importance should be very similar. One thing to point out though is that the difficulty of interpreting the importance/ranking of correlated variables is not Random Forest specific, but applies to most model based feature selection methods. This is why it often best practice to remove correlated features prior to training the model.

Advantages of Random Forests:

  • Require very little feature engineering (e.g. standardization)
  • Easy to use, as it rarely requires parameter tuning to achieve compelling and robust performance
  • Provides a more reliable estimate of feature importance compare to other black-box methods (e.g. deep learning, support vector machine)
  • Performance and computation wise, it is very competitive. Although you can typically find a model that beats Random Forest for any given dataset (typically a deep learning or gradient boosting algorithm), it’s never by much, and it usually takes much longer to train and tune those model

Extra Trees

What distinguishes Extra Trees from Random Forest is:

  • We use the entire training set instead of a bootstrap sample of the training set (but can also be trained on a bootstrapped sample as well if we wish)
  • Just like Random Forest, when choosing rules/features at a split, a random subset of candidate features is used, but now, instead of looking at all the thresholds to find the best the best split, thresholds (for the split) are chosen completely at random for each candidate feature and the best of these randomly generated thresholds is picked as the splitting rule. We all know that tree-based methods employ a greedy algorithm when choosing the feature to split on. Thus, we can think of this as taking an extra step in trying to migitate this drawback

Based on Stackoverflow: RandomForestClassifier vs ExtraTreesClassifier in scikit learn In practice, RFs are often more compact than ETs. ETs are generally cheaper to train from a computational point of view but can grow much bigger. ETs can sometime generalize better than RFs but it's hard to guess when it's the case without trying both first (and tuning n_estimators, max_features and min_samples_split by cross-validated grid search).

In [17]:
# grid search on a range of max features and compare 
# the performance between Extra Trees and Random Forest
param_name = 'max_features'
max_features_options = np.arange(4, 10)
fit_param = {param_name: max_features_options}

rf_reg = RandomForestRegressor(n_estimators = 30)
et_reg = ExtraTreesRegressor(n_estimators = 30)
gs_rf = GridSearchCV(rf_reg, fit_param, n_jobs = -1)
gs_et = GridSearchCV(et_reg, fit_param, n_jobs = -1), y_train), y_train)

# visualize the performance on the cross validation test score
gs_rf_mean_score = gs_rf.cv_results_['mean_test_score']
gs_et_mean_score = gs_et.cv_results_['mean_test_score']
mean_scores = [gs_rf_mean_score, gs_et_mean_score]
labels = ['RF', 'ET']

for score, label in zip(mean_scores, labels):
    plt.plot(max_features_options, score, label = label)

plt.xlim( np.min(max_features_options), np.max(max_features_options) )

It is not always the case that Random Forest will outperform Extra Trees making it a method that's worth trying. As for interpretation, Extra Trees is simply another kind of Ensemble Tree method, hence we can still access the feature_importance_ attribute to see which predictors were contributing a lot to explaining the response.