In [1]:
# code for loading the format for the notebook
import os

# path : store the current path to convert back to it later
path = os.getcwd()
os.chdir(os.path.join('..', '..', 'notebook_format'))
from formats import load_style
In [2]:

# 1. magic for inline plot
# 2. magic to print version
# 3. magic so that the notebook will reload external python modules
# 4. magic to enable retina (high resolution) plots
%matplotlib inline
%load_ext watermark
%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format='retina'

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from lightgbm import LGBMRegressor
from sklearn.model_selection import train_test_split

%watermark -a 'Ethen' -d -t -v -p numpy,pandas,sklearn,matplotlib,lightgbm
Ethen 2019-01-27 09:15:00 

CPython 3.6.4
IPython 6.4.0

numpy 1.14.2
pandas 0.23.4
sklearn 0.20.2
matplotlib 3.0.2
lightgbm 2.1.0

Quantile Regression

When working with real-world regression model, often times knowing the uncertainty behind each point estimation can make our predictions more actionable in a business settings. One method of going from a single point estimation to a range estimation or so called prediction interval is known as Quantile Regression.

For example, consider historical sales of an item under a certain circumstance are (10000, 10, 50, 100). Standard least squares method would gives us an estimate of 2540. If we were to restock based on that prediction, we would likely going to significantly overstock 75% of the time. But if we estimate the quantiles of the data distribution, the estimated 5th, 50th, and 95th percentiles are 16, 75, 8515, which are much more informative than the 2540 single estimation. After producing these range estimation we can also leverage some business context that might be hard to incorporate into our data analysis and use that to determine the final business decision.

Objective Function

As we might recall, for linear regression or so called ordinary least squares (OLS), we assume the relationship between our input variable $X$ and our output label $Y$ can be modeled by a linear function.

\begin{align} Y = \theta_0 + \theta_1X_1 + \theta_2X_2 + \ldots + \theta_pX_p + \epsilon \end{align}

And the most common objective function is squared error.

\begin{align} L = (y - X\theta)^2 \end{align}

With quantile regression, we have an additional parameter $\tau$, which specifies the $\tau_{th}$ quantile of our target variable $Y$ that we're interested in modeling, where $\tau \in (0, 1)$ and our objective function becomes:

\begin{align} L &= \begin{cases} \tau(y - X\theta), \; if \; y - X\theta \geq 0 \\ (\tau - 1)(y - X\theta), \; if \; y - X\theta < 0 \end{cases} \end{align}

Let's try and get some intuition of what this objective function is telling us. The quantile loss differs depending on the evaluated quantile. Such that more negative errors are penalized more when we specify a higher quantiles and more positive errors are penalized more for lower quantiles. To confirm that this is actually the case, the code chunk below simulates the quantile loss at different quantile values.

In [3]:
def compute_quantile_loss(y_true, y_pred, quantile):
    y_true : 1d ndarray
        Target value.
    y_pred : 1d ndarray
        Predicted value.
    quantile : float, 0. ~ 1.
        Quantile to be evaluated, e.g., 0.5 for median.
    residual = y_true - y_pred
    return np.maximum(quantile * residual, (quantile - 1) * residual)
In [4]:
# Suppose we've made a prediction for a single point with a true value of zero,
# and our predictions range from -1 to +1 that is, our errors also range from -1 to +1.
n_samples = 1000
y_true = np.zeros(n_samples)
y_pred = np.linspace(-1, 1, n_samples)

quantiles = [0.1, 0.5, 0.9]
quantile_losses = [
    compute_quantile_loss(y_true, y_pred, quantile)
    for quantile in quantiles
In [5]:
# change default style figure and font size
plt.rcParams['figure.figsize'] = 12, 8
plt.rcParams['font.size'] = 12

for quantile_loss in quantile_losses:
    plt.plot(y_pred, quantile_loss)

plt.legend([str(int(q * 100)) + 'th percentile' for q in quantiles])
plt.ylabel('Quantile loss')
plt.title('Quantile loss by error and quantile', loc='left')

Let's look at each line separately:

  • The orange line shows the median, which is symmetric around zero. The median aims to bisect the set of predictions, so we want to weigh underestimates equally to overestimates. As it turns out choosing a quantile of 0.5 is equivalent to modeling the absolute values of the residuals $\big| (y - X\theta) \big|$.
  • The blue line shows the 10th percentile, which assigns a lower loss to negative errors and a higher loss to positive errors. The 10th percentile means we think there's a 10 percent chance that the true value is below the predicted value, so it makes sense to assign less weight to underestimates than to overestimates.
  • The green blue line shows the 90th percentile, which is the opposite of the 10th percentile.

Quantile Regression With LightGBM

In the following section, we generate a sinoide function + random gaussian noise, with 80% of the data points being our training samples (blue points) and the rest being our test samples (red points). Generating a 1 dimension fake data allows us to easily visualize it and gain intuition on what sort of black magic our algorithm is doing.

In [6]:
def ground_truth(x):
    """Ground truth -- function to approximate"""
    return x * np.sin(x) + np.sin(2 * x)

def gen_data(low, high, n_samples, scale=4, test_size=0.2, random_state=3):
    """generate training and testing data from the ground truth function"""
    X = np.random.uniform(low, high, size=n_samples)

    # generate the response from the ground truth function and add
    # some random noise to it, scale controls the variance of the noise.
    y = ground_truth(X) + np.random.normal(scale=scale, size=n_samples)
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state)

    return X_train, X_test, y_train, y_test

def plot_data(x_plot, X_train, X_test, y_train, y_test, low, high):
    """plot training and testing data"""
    s = 15
    plt.plot(x_plot, ground_truth(x_plot), alpha=0.5, label='ground truth')
    plt.scatter(X_train, y_train, s=s, alpha=0.2)
    plt.scatter(X_test, y_test, s=s, alpha=0.2, color='red')
    plt.xlim((low, high))
    plt.legend(loc='upper left')
In [7]:
low = 0
high = 20
x_plot = np.linspace(low, high, 500)

n_samples = 2000
X_train, X_test, y_train, y_test = gen_data(low=low, high=high, n_samples=n_samples)
plot_data(x_plot, X_train, X_test, y_train, y_test, low, high)

We first use the squared error loss as our objective function to train our tree model and visualize the predicted value versus the ground truth.

In [8]:
# feel free to do some hyperparamter tuning,
# these parameters were picked randomly
lgb_params = {
    'n_jobs': 1,
    'max_depth': 4,
    'min_data_in_leaf': 10,
    'subsample': 0.9,
    'n_estimators': 80,
    'learning_rate': 0.1,
    'colsample_bytree': 0.9,
    'boosting_type': 'gbdt'
lgb_l2 = LGBMRegressor(objective='regression', **lgb_params)[:, np.newaxis], y_train)
LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=0.9,
       learning_rate=0.1, max_depth=4, min_child_samples=20,
       min_child_weight=0.001, min_data_in_leaf=10, min_split_gain=0.0,
       n_estimators=80, n_jobs=1, num_leaves=31, objective='regression',
       random_state=None, reg_alpha=0.0, reg_lambda=0.0, silent=True,
       subsample=0.9, subsample_for_bin=200000, subsample_freq=1)
In [9]:
plt.plot(x_plot, lgb_l2.predict(x_plot[:, np.newaxis]),
         label='LGB L2 Loss', alpha=0.9, linewidth=2)

plot_data(x_plot, X_train, X_test, y_train, y_test, low, high)