# code for loading the format for the notebook
import os
# path : store the current path to convert back to it later
path = os.getcwd()
os.chdir(os.path.join('..', '..', 'notebook_format'))
from formats import load_style
load_style(plot_style=False)
os.chdir(path)
# 1. magic for inline plot
# 2. magic to print version
# 3. magic so that the notebook will reload external python modules
# 4. magic to enable retina (high resolution) plots
# https://gist.github.com/minrk/3301035
%matplotlib inline
%load_ext watermark
%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format='retina'
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, roc_curve, auc
from sklearn.metrics import precision_recall_curve, average_precision_score
%watermark -a 'Ethen' -d -t -v -p numpy,pandas,matplotlib,sklearn
For this documentation, we'll be working with a human resource dataset. Our goal is to find out the employees that are likely to leave in the future and act upon our findings, i.e. retain them before they choose to leave. This dataset contains 12000 observations and 7 variables, each representing :
S
The satisfaction level on a scale of 0 to 1LPE
Last project evaluation by a client on a scale of 0 to 1NP
Represents the number of projects worked on by employee in the last 12 monthANH
Average number of hours worked in the last 12 month for that employeeTIC
Amount of time the employee spent in the company, measured in yearsNewborn
This variable will take the value 1 if the employee had a newborn within the last 12 month and 0 otherwiseleft
1 if the employee left the company, 0 if they're still working here. This is our response variablefilename = 'HR.csv'
data = pd.read_csv(filename)
print('dimensions: ', data.shape)
data.head()
To train and evaluate the model, we’ll perform a simple train/test split. 80 percent of the dataset will be used to actually train the model, while the rest will be used to evaluate the accuracy of this model, i.e. out of sample error. Note that the best practice is to split it in three ways train/validation/test split.
label_col = 'left'
label = data[label_col].values
data = data.drop(label_col, axis = 1)
print('labels distribution:', np.bincount(label) / label.size)
test_size = 0.2
random_state = 1234
data_train, data_test, y_train, y_test = train_test_split(
data, label, test_size = test_size, random_state = random_state, stratify = label)
This probability table tells you that around 16 percent of the employees who became a staff member of yours have left! If those employees are all the ones that are performing well in the company, then this is probably not a good sign. We'll leave the exploratory analysis part to you ...
We then convert perform some generic data preprocessing including standardizing the numeric columns and one-hot-encode the categorical columns (the "Newborn" variable is treated as a categorical variable) and convert everything into a numpy array that sklearn expects. This generic preprocessing step is written as a custom sklearn Transformer. You don't have to follow this structure if you prefer your way of doing it.
To roll out our own Transformer a adheres to the sklearn API, we need to
__init__
method should be explicit: i.e. *args
or **kwargs
should be avoided, as they will not be correctly handled within cross-validation routinesBaseEstimator
to get some free stuff. It will give us class representations that are more informative when printing the class object. And provides us a get_params
and set_params
functions. These functionalities are used in sklearn's methods such as GridSearch and RandomSearch.TransformerMixin
. For transformer, we need to implement a .fit
method which fits some stuff on the training data and a .transform
method that can perform transformation on both the training and test data. Note that we don't need to subclass TransformerMixin
this to work, but it does give the end-user the idea that this is a Transformer and we get the .fit_transform
method that does the fitting and transformer on the training data in one shot for free.fit
method is stored with a trailing underscore (e.g., self.colnames_). This is a convention used in sklearn so that we can quickly scan the members of an estimator and distinguish which members are fitting during training timeIf you would like to read more on this topic. The following two link might be of interest to you.
from collections import defaultdict
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
class Preprocess(BaseEstimator, TransformerMixin):
"""
Generic data preprocessing including:
- standardize numeric columns
- one-hot encode categorical columns
Parameters
----------
num_cols : list
Numeric column's name
cat_cols : list
Categorical column's name
Attributes
----------
colnames_ : list
Column name of the transformed numpy array
label_encode_dict_ : dict of sklearn's LabelEncoder
LabelEncoder that was used to encode the value
of the categorical columns into with value between
0 and n_classes-1. Categorical columns will go through
this encoding process before being one-hot encoded
cat_encode_ : sklearn's OneHotEncoder
OneHotEncoder that was used to one-hot encode the
categorical columns
scaler_ : sklearn's StandardScaler
Scaler that was used to standardize the numeric columns
"""
def __init__(self, num_cols = None, cat_cols = None):
self.num_cols = num_cols
self.cat_cols = cat_cols
def fit(self, data):
"""
Fit the Preprocess Transformer
Parameters
----------
data : DataFrame
"""
data = data.copy()
# Label encoding across multiple columns in scikit-learn
# https://stackoverflow.com/questions/24458645/label-encoding-across-multiple-columns-in-scikit-learn
if self.cat_cols is not None:
self.label_encode_dict_ = defaultdict(LabelEncoder)
label_encoded = (data[self.cat_cols].
apply(lambda x: self.label_encode_dict_[x.name].fit_transform(x)))
self.cat_encode_ = OneHotEncoder(sparse = False)
self.cat_encode_.fit(label_encoded)
if self.num_cols is not None:
self.scaler_ = StandardScaler().fit(data[self.num_cols])
# store the column names (numeric columns comes before the
# categorical columns) so we can refer to them later
if self.num_cols is not None:
colnames = self.num_cols.copy()
else:
colnames = []
if self.cat_cols is not None:
for col in self.cat_cols:
cat_colnames = [col + '_' + str(classes)
for classes in self.label_encode_dict_[col].classes_]
colnames += cat_colnames
self.colnames_ = colnames
return self
def transform(self, data):
"""
Trasform the data using the fitted Preprocess Transformer
Parameters
----------
data : DataFrame
"""
if self.cat_cols is not None:
label_encoded = (data[self.cat_cols].
apply(lambda x: self.label_encode_dict_[x.name].transform(x)))
cat_encoded = self.cat_encode_.transform(label_encoded)
if self.num_cols is not None:
scaled = self.scaler_.transform(data[self.num_cols])
# combine encoded categorical columns and scaled numerical
# columns, it's the same as concatenate it along axis 1
if self.cat_cols is not None and self.num_cols is not None:
X = np.hstack((scaled, cat_encoded))
elif self.num_cols is None:
X = cat_encoded
else:
X = scaled
return X
num_cols = ['S', 'LPE', 'NP', 'ANH', 'TIC']
cat_cols = ['Newborn']
preprocess = Preprocess(num_cols, cat_cols)
X_train = preprocess.fit_transform(data_train)
X_test = preprocess.transform(data_test)
print('colnames', preprocess.colnames_)
X_train
# pick your favorite classfication model
tree = RandomForestClassifier(max_depth = 4)
tree.fit(X_train, y_train)
After training our model, we need to evaluate whether its any good or not and the most straightforward and intuitive metric for a supervised classifier's performance is accuracy. Unfortunately, there are circumstances where simple accuracy does not work well. For example, with a disease that only affects 1 in a million people, a completely bogus screening test that always reports "negative" will be 99.9999% accurate. Unlike accuracy, ROC curves are less sensitive to class imbalance; the bogus screening test would have an AUC of 0.5, which is like not having a test at all.
ROC curve (Receiver Operating Characteristic) is a commonly used way to visualize the performance of a binary classifier and AUC (Area Under the ROC Curve) is used to summarize its performance in a single number. Most machine learning algorithms have the ability to produce probability scores that tells us the strength in which it thinks a given observation is positive. Turning these probability scores into yes or no predictions requires setting a threshold; cases with scores above the threshold are classified as positive, and vice versa. Different threshold values can lead to different result:
A quick refresher on terminology:
\begin{align} [\text{true positive rate}] &= \frac{[\text{# positive data points with positive predictions}]}{\text{[# all positive data points]}} \\ &= \frac{[\text{# true positives}]}{[\text{# true positives}] + [\text{# false negatives}]} \end{align}true positive rate is also known as recall or sensitivity
\begin{align} [\text{false positive rate}] &= \frac{[\text{# positive data points with positive predictions}]}{\text{[# all negative data points]}} \\ &= \frac{[\text{# false positives}]}{[\text{# false positives}] + [\text{# true negatives}]} \end{align}The ROC curve is created by plotting the true positive rate (when it's actually a yes, how often does it predict yes?) on the y axis against the false positive rate (when it's actually a no, how often does it predict yes?) on the x axis at various cutoff settings, giving us a picture of the whole spectrum of the trade-off we're making between the two measures.
If all these true/false positive terminology is confusing to you, consider reading the material at the following link. Blog: Simple guide to confusion matrix terminology
There are packages to plot ROC curves and to compute metrics from them, but it can still be worthwhile to work through how these curves are calculated from scratch to try and understand better what exactly are they showing us.
def _binary_clf_curve(y_true, y_score):
"""
Calculate true and false positives per binary classification
threshold (can be used for roc curve or precision/recall curve);
the calcuation makes the assumption that the positive case
will always be labeled as 1
Parameters
----------
y_true : 1d ndarray, shape = [n_samples]
True targets/labels of binary classification
y_score : 1d ndarray, shape = [n_samples]
Estimated probabilities or scores
Returns
-------
tps : 1d ndarray
True positives counts, index i records the number
of positive samples that got assigned a
score >= thresholds[i].
The total number of positive samples is equal to
tps[-1] (thus false negatives are given by tps[-1] - tps)
fps : 1d ndarray
False positives counts, index i records the number
of negative samples that got assigned a
score >= thresholds[i].
The total number of negative samples is equal to
fps[-1] (thus true negatives are given by fps[-1] - fps)
thresholds : 1d ndarray
Predicted score sorted in decreasing order
References
----------
Github: scikit-learn _binary_clf_curve
- https://github.com/scikit-learn/scikit-learn/blob/ab93d65/sklearn/metrics/ranking.py#L263
"""
# sort predicted scores in descending order
# and also reorder corresponding truth values
desc_score_indices = np.argsort(y_score)[::-1]
y_score = y_score[desc_score_indices]
y_true = y_true[desc_score_indices]
# y_score typically consists of tied values. Here we extract
# the indices associated with the distinct values. We also
# concatenate a value for the end of the curve
distinct_indices = np.where(np.diff(y_score))[0]
end = np.array([y_true.size - 1])
threshold_indices = np.hstack((distinct_indices, end))
thresholds = y_score[threshold_indices]
tps = np.cumsum(y_true)[threshold_indices]
# (1 + threshold_indices) = the number of positives
# at each index, thus number of data points minus true
# positives = false positives
fps = (1 + threshold_indices) - tps
return tps, fps, thresholds
# we'll work with some toy data so it's easier to
# show and confirm the calculated result
y_true = np.array([1, 0, 1, 0, 1])
y_score = np.array([0.45, 0.4, 0.35, 0.35, 0.8])
tps, fps, thresholds = _binary_clf_curve(y_true, y_score)
print('thresholds:', thresholds)
print('true positive count:', tps)
print('false positive count:', fps)
From the result above, we can see that the function will compute the true/false positive count for all unique threshold in the predicted score y_score
. We can validate the result by hand to confirm that the calculation this in fact correct.
Recall that ROC curve plots that true positive rate on the y-axis and false positive rate on the x-axis. Thus all we need to do is to convert the count into rate and we have our ROC curve.
# convert count to rate, append 0 to
# both true positive and false positive
# so the visualization will start from origin (0, 0)
tpr = np.hstack((0, tps / tps[-1]))
fpr = np.hstack((0, fps / fps[-1]))
print('true positive rate:', tpr)
print('false positive rate:', fpr)
plt.rcParams['figure.figsize'] = 8, 6
plt.rcParams['font.size'] = 12
fig = plt.figure()
plt.plot(fpr, tpr, marker = 'o', lw = 1)
plt.xlabel('false positive rate')
plt.ylabel('true positive rate')
plt.title('Receiver Operator Characteristic')
plt.show()
Now to calculate the AUC (Area Under the Curve) for the ROC curve, we need sum up the rectangular area and the triangular area under the curve. Depicted by the visualization below:
import matplotlib.patches as patches
fig, ax = plt.subplots(1, 2, figsize = (10, 6))
fig.suptitle('Receiver Operator Characteristic', y = 1.02)
# this part is hard-coded for illustration purpose
fpr_diff = fpr[3] - fpr[0]
tpr_diff = tpr[3] - tpr[0]
rect1 = patches.Rectangle(xy = (0, 0), width = fpr_diff,
height = tpr_diff, alpha = 0.3)
ax[0].add_patch(rect1)
fpr_diff = fpr[-1] - fpr[-2]
tpr_diff = tpr[-1] - tpr[-2]
rect2 = patches.Rectangle(xy = (fpr[-2], tpr[-2]), width = fpr_diff,
height = tpr_diff, alpha = 0.3)
ax[1].add_patch(rect2)
for i in range(len(ax)):
ax[i].plot(fpr, tpr, marker = 'o', lw = 1)
ax[i].set_xlabel('false positive rate')
ax[i].set_ylabel('true positive rate')
plt.tight_layout()
plt.show()
def _roc_auc_score(y_true, y_score):
"""
Compute Area Under the Curve (AUC) from prediction scores
Parameters
----------
y_true : 1d ndarray, shape = [n_samples]
True targets/labels of binary classification
y_score : 1d ndarray, shape = [n_samples]
Estimated probabilities or scores
Returns
-------
auc : float
"""
# ensure the target is binary
if np.unique(y_true).size != 2:
raise ValueError('Only two class should be present in y_true. ROC AUC score '
'is not defined in that case.')
tps, fps, _ = _binary_clf_curve(y_true, y_score)
# convert count to rate
tpr = tps / tps[-1]
fpr = fps / fps[-1]
# compute AUC using the trapezoidal rule;
# appending an extra 0 is just to ensure the length matches
zero = np.array([0])
tpr_diff = np.hstack((np.diff(tpr), zero))
fpr_diff = np.hstack((np.diff(fpr), zero))
auc = np.dot(tpr, fpr_diff) + np.dot(tpr_diff, fpr_diff) / 2
return auc
auc_score = _roc_auc_score(y_true, y_score)
print('auc score:', auc_score)
# confirm with scikit-learn's result
auc_score = roc_auc_score(y_true, y_score)
print('package auc socre:', auc_score)
After working through the implementation of ROC curve and AUC score from sratch, we now pull back and visualize:
# calling the roc_curve, extract the probability of
# the positive class from the predicted probability
tree_test_pred = tree.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, tree_test_pred, pos_label = 1)
# AUC score that summarizes the ROC curve
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, lw = 2, label = 'ROC AUC: {:.2f}'.format(roc_auc))
plt.plot([0, 1], [0, 1],
linestyle = '--',
color = (0.6, 0.6, 0.6),
label = 'random guessing')
plt.plot([0, 0, 1], [0, 1, 1],
linestyle = ':',
color = 'black',
label = 'perfect performance')
plt.xlim([-0.05, 1.05])
plt.ylim([-0.05, 1.05])
plt.xlabel('false positive rate')
plt.ylabel('true positive rate')
plt.title('Receiver Operator Characteristic')
plt.legend(loc = "lower right")
plt.tight_layout()
plt.show()
The goal of visualizing the ROC curve is to let us know how well can our classifier be expected to perform in general, at a variety of different baseline probabilities (percentage of the majority class)?
The diagonal line depicts a completely random classifier and ideally our model's ROC curve should be toward the top-left corner and stay as far away from the diagonal line as possible.
Side note: Apart from comparing the model's ROC curve against the ROC curve of a classifier that does random guessing, it's also useful to plot the ROC curve of different classifiers to compare performance against each other.
The probabilistic interpretation of the AUC metric is that if we randomly choose a positive case and a negative case, the probability that the positive case outranks the negative case according to the classifier's prediction. Hopefully, this is evident from the ROC curve figure, where plot is enumerating all possible combinations of positive and negative cases, and the fraction under the curve comprises of the area where the positive case outranks the negative one. I personally find this interpretation extremely useful when conveying what AUC is measuring to a non-technical audience.
def auc_probability(y_true, y_score, size = 100000):
"""probabilistic interpretation of AUC"""
labels = y_true.astype(np.bool)
pos = np.random.choice(y_score[labels], size = size, replace = True)
neg = np.random.choice(y_score[~labels], size = size, replace = True)
auc = np.sum(pos > neg) + np.sum(pos == neg) / 2
auc /= size
return auc
# we want this be close the score returned
# by roc_auc_score
auc_probability(y_true = y_test, y_score = tree_test_pred)
Apart from ROC curve, there is also the precision recall curve. Instead of plotting true positive rate (a.k.a recall) versus false positive rate. We now plot precision versus recall.
\begin{align} [\text{precision}] &= \frac{[\text{# positive data points with positive predicitions}]}{\text{[# all data points with positive predictions]}} \\ &= \frac{[\text{# true positives}]}{[\text{# true positives}] + [\text{# false positives}]} \end{align}tree_test_pred = tree.predict_proba(X_test)[:, 1]
precision, recall, thresholds = precision_recall_curve(
y_test, tree_test_pred, pos_label = 1)
# AUC score that summarizes the precision recall curve
avg_precision = average_precision_score(y_test, tree_test_pred)
label = 'Precision Recall AUC: {:.2f}'.format(avg_precision)
plt.plot(recall, precision, lw = 2, label = label)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision Recall Curve')
plt.legend()
plt.tight_layout()
plt.show()
A classifier with high recall but low precision flags many positive results, but most of its predicted labels are incorrect when compared to its corresponding labels. On the other hand, a classifier with high precision but low recall is just the opposite, returning very few results, but most of its predicted labels are correct when compared to the training labels. An ideal system with high precision and high recall will return many results, with all results labeled correctly.
Precision recall curve answers a fundamentally different question compared to ROC curve. By definition, precision directly answers the question, "What is the probability that this is a real hit given my classifier says it is? Thus it is useful in practice for needle-in-haystack type problems or problems where the "positive" class is more interesting than the negative class.
You can also think about it in the following way. ROC AUC looks at TPR and FPR, the entire confusion matrix for all thresholds. On the other hand, Precision-Recall AUC looks at Precision and Recall (TPR), it doesn't look at True Negative Rate (TNR). Because of that PR AUC can be a better choice when you care only about the "positive" while ROC AUC cares about both "positive" and "negative". Since PR AUC doesn't use TNR directly it can also be better for highly imbalanced problems. You may want to take a look at this Blog: F1 Score vs ROC AUC vs Accuracy vs PR AUC: Which Evaluation Metric Should You Choose?.
Although ROC curve is presumably the more popular choice when evaluating binary classifiers, it is highly recommended to use precision recall curve as a supplement to ROC curves to get a full picture when evaluating and comparing classifiers. For more discussion on this topic, consider taking a look at another documentation. Notebook: Evaluating Imbalanced Datasets
Lots of real world binary classification needs a threshold to convert the model into a business decision. i.e. All cases with a model score above the threshold get some sort of special treatment. For example:
Thresholding is popular because of its simplicity and ease of implementation: We translate a continuous score to a binary yes/no decision, and act on it in a predetermined way. The biggest question for the thresholding pattern is: Where should I set the threshold point?
Up until this point, we've been using AUC to give us a single-number summary of classifier performance. This might be suitable in some circumstances, but for binary classifiers, evaluation metrics that take into account the actual costs of false positive and false negative errors may be much more appropriate than AUC. If we know these costs, we can use them not only to tie the evaluation metric more directly to the business value but also choose an appropriate final cutoff threshold for the classifier.
In real world application, the cost that comes along with making these two mistakes (false positive and false negative) are usually a whole lot different. Take our case for example, a false negative (FN) means an employee left our company but our model failed to detect that, while a false positive (FP) means an employee is still currently working at our company and our model told us that they will be leaving. The former mistake would be a tragedy, since, well, the employee left and we didn't do anything about it! As for conducting the latter mistake, we might be wasting like 20 minutes of a HR manager's time when we arrange a face to face interview with a employee, questioning about how the company can do better to retain him/her, while he/she is perfectly fine with the current situation.
In the code cell below, we assign a cost for a false negative (FN) and false positive (FP) to be 100 and 1000 respectively, given the cost associated with the two mistakes we will multiply them with the false negative and false positive rate at each threshold to figure out where's the best cutoff value. Note that the cost associated with the mistake can just be a back of the envelope number as long as we're sure about which one is more "expensive" than the other.
fpr, tpr, thresholds = roc_curve(y_test, tree_test_pred, pos_label = 1)
fnr = tpr[-1] - tpr
# fn = false negative, meaning the employee left and we
# didn't do anything about it, in our case this will be
# substantially more expensive than a false positive
fp_cost = 100
fn_cost = 1000
costs = (fpr * fp_cost + fnr * fn_cost) * y_test.size
best = np.argmin(costs)
best_cost = np.round(costs[best])
best_threshold = np.round(thresholds[best], 2)
plt.plot(thresholds, costs, label = 'cost ($)')
label = 'best cost: {} at threshold {}'.format(best_cost, best_threshold)
plt.axvline(best_threshold, linestyle = '--', lw = 2, color = 'black', label = label)
plt.xlabel('threshold')
plt.ylabel('$')
plt.title('Cost as a Function of Threshold')
plt.legend()
plt.tight_layout()
plt.show()
Just to hit the notion home, when executing on a project, if we are able to compute expected cost for each mistake, consider maximizing that directly instead of AUC or another general-purpose metrics.