# 1. magic to print version
# 2. magic so that the notebook will reload external python modules
%matplotlib inline
%load_ext watermark
%load_ext autoreload
%autoreload 2

import os
import numpy as np
import pandas as pd
import m2cgen as m2c
import sklearn.datasets as datasets
from xgboost import XGBClassifier, XGBRegressor

import onnxruntime as rt
from skl2onnx import convert_sklearn, update_registered_converter
from skl2onnx.common.data_types import FloatTensorType
from skl2onnx.common.shape_calculator import calculate_linear_classifier_output_shapes
from onnxmltools.convert.xgboost.operator_converters.XGBoost import convert_xgboost

# prevent scientific notations
pd.set_option('display.float_format', lambda x: '%.3f' % x)

%watermark -a 'Ethen' -u -d -v -p numpy,pandas,sklearn,m2cgen,xgboost

Author: Ethen

Last updated: 2022-06-02

Python implementation: CPython
Python version       : 3.7.11
IPython version      : 7.27.0

numpy  : 1.21.6
pandas : 1.3.5
sklearn: 1.0.2
m2cgen : 0.10.0
xgboost: 1.6.1

Gradient Boosted Tree Inferencing¶

Once we train our machine learning model, depending on the use case, we may wish to operationize it by putting it behind a service for (near) real time inferencing. We can definitely generate predictions in batch offline, store them in some downstream tables or look up services, and pull out pre-computed predictions when needed. Although this batch prediction approach might sound easier to implement, and we might not have to worry about latency issues when it comes to real time services, this paradigm does come with its limitations. e.g.

Cold start problem, if a new entity, whether it's users coming to the website or items being listed on a marketplace, there will be no precomputed recommendations available.
Not having access to real time features. Dynamic features are features based on what’s happening right now – what a user is watching, what people just liked, knowing these will allow us to generate more accurate or relevant predictions based on latest information.
Potentially wasted computation/storage. If we generate predictions for every possible user each day, and only 5% of them login to use our website, then the compute used to generate 95% of our predictions will be wasted.

Translating to Production Language¶

It's very common in industry setting to prototype a machine learning model in Python and translate it into other languages such as C++, Java, etc, when it comes to deploying. This usually happens where the core application is written in other languages such as C++, Java, etc. and it is an extremely time sensitive application where we can't afford the cost of calling an external API to fetch the model prediction.

In this section, we'll be looking at how we can achieve this with Gradient Boosted Trees, specifically XGBoost. Different library might have different ways to doing this, but the concept should be similar.

Tree Structure

A typical model dump from XGBoost looks like the following:

booster[0]:
0:[bmi<0.00942232087] yes=1,no=2,missing=1
    1:[bmi<-0.0218342301] yes=3,no=4,missing=3
        3:[bmi<-0.0584798381] yes=7,no=8,missing=7
            7:leaf=25.84091
            8:leaf=33.0292702
        4:[bp<0.0270366594] yes=9,no=10,missing=9
            9:leaf=38.7487526
            10:leaf=51.0882378
    2:[bp<0.0235937908] yes=5,no=6,missing=5
        5:leaf=53.0696678
        6:leaf=69.4000015
booster[1]:
0:[bmi<0.00511107268] yes=1,no=2,missing=1
    1:[bp<0.0390867069] yes=3,no=4,missing=3
        3:[bmi<-0.0207564179] yes=7,no=8,missing=7
            7:leaf=21.0474758
            8:leaf=27.7326946
        4:[bmi<0.000799824367] yes=9,no=10,missing=9
            9:leaf=36.1850548
            10:leaf=14.9188232
    2:[bmi<0.0730132312] yes=5,no=6,missing=5
        5:[bp<6.75072661e-05] yes=11,no=12,missing=11
            11:leaf=31.3889732
            12:leaf=43.4056664
        6:[bp<-0.0498541184] yes=13,no=14,missing=13
            13:leaf=13.0395498
            14:leaf=59.377037

There are 3 distinct information:

booster Gradient Boosting Tree is an ensemble tree method, each new booster indicates the start of a new tree. The number of trees we have will be equivalent to the number of trees we specified for the model (e.g. for the sklearn XGBoost API, n_estimators controls this) multiplied by the number of distinct classes. For regression model or binary classification model, the number of booster in the model dump will be exactly equal to the number of trees we've specified. Whereas for multi class classification, say we have 3 classes, then tree 0 will contribute to the raw prediction of class 0, tree 1 to class 1, tree 2 to class 2, tree 3 to class 0 and so on.
node Following the booster is each tree's if-else structure. e.g. for node 0, if the feature bmi is less than a threshold, then it will branch to node 1 else it will branch to node 2.
leaf Once we reach the leaf, we can accumulate the response prediction. e.g. node 7 is a leaf, and the prediction for this node is 25.84091.

Raw Prediction

We mentioned that to get the prediction for a given input, we sum up the response prediction associated from each tree's leaf node. The holds true for regression models, but for other models, we will need to perform a transformation on top the raw prediction to get to the probabilities. e.g. for when building a binary classification, a logistic transformation will be needed on top of the raw prediction, whereas for the multi-class classification, a softmax transformation is needed.

Preparation¶

All the examples below, be it regression, binary classification or multi class classification all follow the same structure.

We load some pre-processed data.
Train a quick XGBoost model.
Dump the raw model to disk.
Generate a sample prediction so we can later verify whether the prediction matches with the model converted to cpp.

Regression¶

X, y = datasets.load_diabetes(return_X_y=True, as_frame=True)
X = X[["age", "sex", "bmi", "bp"]]
X.head()

regression_model_params = {
    'n_estimators': 2,
    'max_depth': 3,
    'base_score': 0.0
}
regression_model = XGBRegressor(**regression_model_params).fit(X, y)
regression_model

XGBRegressor(base_score=0.0, booster='gbtree', callbacks=None,
             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
             early_stopping_rounds=None, enable_categorical=False,
             eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
             importance_type=None, interaction_constraints='',
             learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,
             max_delta_step=0, max_depth=3, max_leaves=0, min_child_weight=1,
             missing=nan, monotone_constraints='()', n_estimators=2, n_jobs=0,
             num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
             reg_lambda=1, ...)

regression_model.get_booster().dump_model("regression.txt")

regression_model.predict(X.iloc[[0]])

array([96.475334], dtype=float32)

Binary Classification¶

X, y = datasets.make_classification(n_samples=10000, n_features=5, random_state=42, n_classes=2)
X

array([[-2.24456934, -1.36232827,  1.55433334, -2.0869092 , -1.27760482],
       [-0.46503462, -0.57657929, -0.2033143 ,  0.43042571,  1.98019634],
       [ 1.0967453 ,  1.31568265,  0.40073014, -0.88575625,  0.72711376],
       ...,
       [-3.17646599, -2.97878542,  0.32401442,  0.12710402, -0.63318634],
       [-0.41224819,  0.17380558,  1.04229889, -1.62625451, -1.24718999],
       [-1.02487223, -0.70828082,  0.55578021, -0.70007904, -0.43269446]])

binary_model_params = {
    'n_estimators': 3,
    'max_depth': 3,
    'tree_method': 'hist',
    'grow_policy': 'lossguide'
}
binary_model = XGBClassifier(**binary_model_params).fit(X, y)
binary_model

XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              early_stopping_rounds=None, enable_categorical=False,
              eval_metric=None, gamma=0, gpu_id=-1, grow_policy='lossguide',
              importance_type=None, interaction_constraints='',
              learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,
              max_delta_step=0, max_depth=3, max_leaves=0, min_child_weight=1,
              missing=nan, monotone_constraints='()', n_estimators=3, n_jobs=0,
              num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, ...)

binary_model.get_booster().dump_model("binary_class.txt")

inputs = np.array([[0.0, 0.2, 0.4, 0.6, 0.8]])
binary_model.predict_proba(inputs)

array([[0.2894203, 0.7105797]], dtype=float32)

Multiclass Classification¶

X, y = datasets.load_iris(return_X_y=True, as_frame=True)
X.head()

multi_class_model_params = {
    'n_estimators': 2,
    'max_depth': 3
}
multi_class_model = XGBClassifier(**multi_class_model_params).fit(X, y)
multi_class_model

XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              early_stopping_rounds=None, enable_categorical=False,
              eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
              importance_type=None, interaction_constraints='',
              learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,
              max_delta_step=0, max_depth=3, max_leaves=0, min_child_weight=1,
              missing=nan, monotone_constraints='()', n_estimators=2, n_jobs=0,
              num_parallel_tree=1, objective='multi:softprob', predictor='auto',
              random_state=0, reg_alpha=0, ...)

multi_class_model.get_booster().dump_model("multi_class.txt")

inputs = np.array([[5.1, 3.5, 1.4, 0.2]])
multi_class_model.predict_proba(inputs)

array([[0.6092037 , 0.19627655, 0.19451974]], dtype=float32)

C++ Implementation¶

The rest of the content is about implementing the boosted tree inferencing logic in C++, all the code resides in the gbt_inference folder for those interested. In practice, we don't always have to rely on naive code that we've implemented to solidify our understanding. e.g. the m2cgen (Model 2 Code Generator) project is one of the many projects out there that focuses on converting a trained model into native code. If we export our regression model, we can see that the inferencing logic is indeed a bunch of if else statements followed by a summation at the very end.

code = m2c.export_to_c(regression_model)
print(code)

double score(double * input) {
    double var0;
    if (input[2] >= 0.009422321) {
        if (input[3] >= 0.02359379) {
            var0 = 69.4;
        } else {
            var0 = 53.069668;
        }
    } else {
        if (input[2] >= -0.02183423) {
            if (input[3] >= 0.02703666) {
                var0 = 51.088238;
            } else {
                var0 = 38.748753;
            }
        } else {
            if (input[2] >= -0.058479838) {
                var0 = 33.02927;
            } else {
                var0 = 25.84091;
            }
        }
    }
    double var1;
    if (input[2] >= 0.0051110727) {
        if (input[2] >= 0.07301323) {
            if (input[3] >= -0.04985412) {
                var1 = 59.377037;
            } else {
                var1 = 13.03955;
            }
        } else {
            if (input[3] >= 0.000067507266) {
                var1 = 43.405666;
            } else {
                var1 = 31.388973;
            }
        }
    } else {
        if (input[3] >= 0.039086707) {
            if (input[2] >= 0.00079982437) {
                var1 = 14.918823;
            } else {
                var1 = 36.185055;
            }
        } else {
            if (input[2] >= -0.020756418) {
                var1 = 27.732695;
            } else {
                var1 = 21.047476;
            }
        }
    }
    return var0 + var1;
}

ONNX¶

Another way to achieving this is through ONNX, directly quoting from its documentation.

ONNX Runtime provides an easy way to run machine learned models with high performance on CPU or GPU without dependencies on the training framework. Machine learning frameworks are usually optimized for batch training rather than for prediction, which is a more common scenario in applications, sites, and services

We'll walk through the process of converting our boosted tree model into ONNX format, and benchmark the inference runtime. Here, we are doing it for classification model, but the process should be similar for regression based models.

n_features = 5
X, y = datasets.make_classification(n_samples=10000, n_features=n_features, random_state=42, n_classes=2)
feature_names = [f'f{i}'for i in range(n_features)]
print(f'num rows: {X.shape[0]}, num cols: {X.shape[1]}')
X

num rows: 10000, num cols: 5

array([[-2.24456934, -1.36232827,  1.55433334, -2.0869092 , -1.27760482],
       [-0.46503462, -0.57657929, -0.2033143 ,  0.43042571,  1.98019634],
       [ 1.0967453 ,  1.31568265,  0.40073014, -0.88575625,  0.72711376],
       ...,
       [-3.17646599, -2.97878542,  0.32401442,  0.12710402, -0.63318634],
       [-0.41224819,  0.17380558,  1.04229889, -1.62625451, -1.24718999],
       [-1.02487223, -0.70828082,  0.55578021, -0.70007904, -0.43269446]])

tree = XGBClassifier(
    n_estimators=20,
    max_depth=3,
    learning_rate=0.2,
    tree_method='hist',
    verbosity=1
)
tree.fit(X, y, eval_set=[(X, y)])

[0]	validation_0-logloss:0.57050
[1]	validation_0-logloss:0.48945
[2]	validation_0-logloss:0.42726
[3]	validation_0-logloss:0.38211
[4]	validation_0-logloss:0.34932
[5]	validation_0-logloss:0.32240
[6]	validation_0-logloss:0.29620
[7]	validation_0-logloss:0.27869
[8]	validation_0-logloss:0.26133
[9]	validation_0-logloss:0.25024
[10]	validation_0-logloss:0.24013
[11]	validation_0-logloss:0.23271
[12]	validation_0-logloss:0.22679
[13]	validation_0-logloss:0.22081
[14]	validation_0-logloss:0.21672
[15]	validation_0-logloss:0.21081
[16]	validation_0-logloss:0.20629
[17]	validation_0-logloss:0.20226
[18]	validation_0-logloss:0.19999
[19]	validation_0-logloss:0.19766

XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              early_stopping_rounds=None, enable_categorical=False,
              eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
              importance_type=None, interaction_constraints='',
              learning_rate=0.2, max_bin=256, max_cat_to_onehot=4,
              max_delta_step=0, max_depth=3, max_leaves=0, min_child_weight=1,
              missing=nan, monotone_constraints='()', n_estimators=20, n_jobs=0,
              num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, ...)

tree.predict_proba(X[:1])

array([[0.9862067, 0.0137933]], dtype=float32)

xgboost_checkpoint = 'model.json'
tree.save_model(xgboost_checkpoint)
tree_loaded = XGBClassifier()
tree_loaded.load_model(xgboost_checkpoint)

assert np.allclose(tree.predict_proba(X[:1]), tree_loaded.predict_proba(X[:1]))

input_payloads = [
    {
        'f0': -2.24456934,
        'f1': -1.36232827,
        'f2': 1.55433334,
        'f3': -2.0869092,
        'f4': -1.27760482
    }
]

rows = []
for input_payload in input_payloads:
    row = [input_payload[feature] for feature in feature_names]
    rows.append(row)

np_rows = np.array(rows, dtype=np.float32)
tree.predict_proba(np_rows)[:, 1]

array([0.0137933], dtype=float32)

%%timeit
rows = []
for input_payload in input_payloads:
    row = [input_payload[feature] for feature in feature_names]
    rows.append(row)

np_rows = np.array(rows, dtype=np.float32)
tree.predict_proba(np_rows)[:, 1]

896 µs ± 29.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

def convert_xgboost_to_onnx(model, num_features: int, checkpoint: str):
    
    # boiler plate code for registering the xgboost converter
    update_registered_converter(
        XGBClassifier, 'XGBoostXGBClassifier',
        calculate_linear_classifier_output_shapes, convert_xgboost,
        options={'nocl': [True, False], 'zipmap': [True, False, 'columns']}
    )
    # perform the actual conversion specifying the types of our inputs,
    # at the time of writing this, it doesn't support categorical types
    # that are common in boosted tree libraries such as xgboost or lightgbm
    model_onnx = convert_sklearn(
        model, 'xgboost',
        [('input', FloatTensorType([None, num_features]))],
        target_opset={'': 15, 'ai.onnx.ml': 2}
    )

    with open(checkpoint, "wb") as f:
        f.write(model_onnx.SerializeToString())

onnx_model_checkpoint = 'xgboost.onnx'
convert_xgboost_to_onnx(tree, len(feature_names), onnx_model_checkpoint)

Upon porting our model to onnx format, we can use it for inferencing. This section uses the Python API for benchmarking.

sess = rt.InferenceSession(onnx_model_checkpoint)
input_name = sess.get_inputs()[0].name
output_names = [output.name for output in sess.get_outputs()]

np_rows = np.array(rows, dtype=np.float32)
onnx_predict_label, onnx_predict_score = sess.run(output_names, {input_name: np_rows})
onnx_predict_score

[{0: 0.9862067103385925, 1: 0.01379328966140747}]

%%timeit
rows = []
for input_payload in input_payloads:
    row = [input_payload[feature] for feature in feature_names]
    rows.append(row)

np_rows = np.array(rows, dtype=np.float32)
onnx_predict_label, onnx_predict_score = sess.run(output_names, {input_name: np_rows})

16.6 µs ± 365 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Note, at the time of writing this document, the onnx converter doesn't support categorical variables splits from common boosted tree libraries such as xgboost or lightgbm, we will have to leverage other ways of dealing with categorical variables if we wish to leverage onnx for inferencing.

	age	sex	bmi	bp
0	0.038	0.051	0.062	0.022
1	-0.002	-0.045	-0.051	-0.026
2	0.085	0.051	0.044	-0.006
3	-0.089	-0.045	-0.012	-0.037
4	0.005	-0.045	-0.036	0.022

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.100	3.500	1.400	0.200
1	4.900	3.000	1.400	0.200
2	4.700	3.200	1.300	0.200
3	4.600	3.100	1.500	0.200
4	5.000	3.600	1.400	0.200

Table of Contents