In [1]:
# code for loading the format for the notebook
import os

# path : store the current path to convert back to it later
path = os.getcwd()
os.chdir(os.path.join('..', '..', 'notebook_format'))
from formats import load_style
load_style(css_style = 'custom2.css', plot_style = False)
In [2]:

# 1. magic for inline plot
# 2. magic to print version
# 3. magic so that the notebook will reload external python modules
# 4. magic to enable retina (high resolution) plots
%matplotlib inline
%load_ext watermark
%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format = 'retina'

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

%watermark -a 'Ethen' -d -t -v -p numba,numpy,pandas,sklearn,matplotlib
Ethen 2018-02-24 20:09:44 

CPython 3.6.3
IPython 6.1.0

numba 0.37.0
numpy 1.14.1
pandas 0.22.0
sklearn 0.19.1
matplotlib 2.1.0

Factorization Machine (FM)

Factorization Machine type algorithms are a combination of linear regression and matrix factorization, the cool idea behind this type of algorithm is it aims model interactions between features (a.k.a attributes, explanatory variables) using factorized parameters. By doing so it has the ability to estimate all interactions between features even with extremely sparse data.


Normally, when we think of linear regression, we would think of the following formula:

\begin{align} \hat{y}(\textbf{x}) = w_{0} + \sum_{i=1}^{n} w_{i} x_{i} \end{align}


  • $w_0$ is the bias term, a.k.a intercept.
  • $w_i$ are weights corresponding to each feature vector $x_i$, here we assume we have $n$ total features.

This formula's advantage is that it can computed in linear time, $O(n)$. The drawback, however, is that it does not handle feature interactions. To capture interactions, we could introduce a weight for each feature combination. This is sometimes referred to as a $2_{nd}$ ordered polynomial. The resulting model is shown below:

\begin{align} \hat{y}(\textbf{x}) = w_{0} + \sum_{i=1}^{n} w_{i} x_{i} + \sum_{i=1}^n \sum_{j=i+1}^n w_{ij} x_{i} x_{j} \end{align}

Compared to our previous model, this formulation has the advantages that it can capture feature interactions at least for two features at a time. But we have now ended up with a $O(n^2)$ complexity which means that to train the model, we now require a lot more time and memory. Another issue is that when we have categorical variables with high cardinality, after one-hot encoding them, we would end up with a lot of columns that are sparse, making it harder to actually capture their interactions (not enough data).

To solve this complexity issue, Factorization Machines takes inspiration from matrix factorization, and models the feature interaction using latent factors. Every feature $f_i$ has a corresponding latent factor $v_i$, and two features' interactions are modelled as $\langle \textbf{v}_i, \textbf{v}_{j} \rangle$, where $\langle \cdot \;,\cdot \rangle$ refers to the dot product of the two feature vector. If we assume its of size $k$ (this is a hyperparameter that we can tune). Then:

\begin{align} \langle \textbf{v}_i, \textbf{v}_{j} \rangle = \sum_{f=1}^k v_{i,f} v_{j,f} \end{align}

This leads of our new equation:

\begin{align} \hat{y}(\textbf{x}) = w_{0} + \sum_{i=1}^{n} w_{i} x_{i} + \sum_{i=1}^{n} \sum_{j=i+1}^n \langle \textbf{v}_i , \textbf{v}_{j} \rangle x_i x_{j} \end{align}

This is an improvement from our previous model (when we modeled each pair of interaction terms with weight $w_{ij}$) as the number of parameters is reduced from $n^2$ to $n \times k$, since $k \ll n$, which also helps mitigate overfitting issues. Using the naive way of formulating factorization machine results in a complexity of $O(kn^2)$, because all pairwise interactions have to be computed, but we can reformulate it to make it run in $O(kn)$.

\begin{align} \sum_{i=1}^n \sum_{j=i+1}^n \langle \textbf{v}_i, \textbf{v}_{j} \rangle x_{i} x_{j} &= \frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n \langle \textbf{v}_i, \textbf{v}_{j} \rangle x_{i} x_{j} - \frac{1}{2} \sum_{i=1}^n \langle \textbf{v}_i , \textbf{v}_{i} \rangle x_{i} x_{i} \\ &= \frac{1}{2}\left(\sum_{i=1}^n \sum_{j=1}^n \sum_{f=1}^k v_{i,f} v_{j,f} x_{i} x_{j} \right) - \frac{1}{2}\left( \sum_{i=1}^n \sum_{f=1}^k v_{i,f} v_{i,f} x_{i} x_{i} \right) \\ &= \frac{1}{2}\left(\sum_{i=1}^n \sum_{j=1}^n \sum_{f=1}^k v_{i,f} v_{j,f} x_{i} x_{j} - \sum_{i=1}^n \sum_{f=1}^k v_{i,f} v_{i,f} x_{i} x_{i} \right) \\ &= \frac{1}{2} \sum_{f=1}^{k} \left( \left(\sum_{i=1}^n v_{i,f}x_{i} \right) \left( \sum_{j=1}^n v_{j,f}x_{j} \right) - \sum_{i=1}^{n} v_{i,f}^2 x_{i}^2 \right) \\ &= \frac{1}{2} \sum_{f=1}^{k} \left( \left( \sum_{i}^{n} v_{i,f}x_{i} \right)^2 - \sum_{i=1}^{n} v_{i,f}^2 x_{i}^2 \right) \end{align}

Note, summing over different pairs is the same as summing over all pairs minus the self-interactions (divided by two). This is the reason why the value 1/2 is introduced from the beginning of the derivation.

This reformulated equation has a linear complexity in both $k$ and $n$, i.e. its computation is in $O(kn)$, substituting this new equation into the existing factorization machine formula, we end up with:

\begin{align} \hat{y}(\textbf{x}) = w_{0} + \sum_{i=1}^{n} w_{i} x_{i} + \frac{1}{2} \sum_{f=1}^{k} \left( \left( \sum_{i}^{n} v_{i,f}x_{i} \right)^2 - \sum_{i=1}^{n} v_{i,f}^2 x_{i}^2 \right) \end{align}

In a machine learning setting, factorization machine can be applied to different supervised prediction tasks:

  • Regression:, in this case $\hat{y}(\textbf{x})$ can be used directly by minimizing the mean squared error between the model prediction and target value, e.g. $\frac{1}{N}\sum^{N}\big(y - \hat{y}(\textbf{x})\big)^2$
  • Classification:, if we were to use it in a binary classification setting, we could then minimize the log loss, $\ln \big(e^{-y \cdot \hat{y}(\textbf{x})} + 1 \big)$, where $\sigma$ is the sigmoid/logistic function and $y \in {-1, 1}$.

To train factorization machine, we can use a gradient descent based optimization techniques, the parameters to be learned are $(w_0, \mathbf{w},$ and $\mathbf{V}$).

\begin{align} \frac{\partial}{\partial\theta}\hat{y}(\textbf{x}) = \begin{cases} 1, & \text{if $\theta$ is $w_0$} \\ x_i, & \text{if $\theta$ is $w_i$} \\ x_i\sum_{j=1}^{n} v_{j,f}x_j - v_{i,f}x_{i}^2 & \text{if $\theta$ is $v_{i,f}$} \end{cases} \end{align}
  • Notice that $\sum_{j=1}^n v_{j, f} x_j$ does not depend on $i$, thus it can be computed independently.
  • The last formula above, can also be written as $x_i(\sum_{j=1}^{n} v_{j,f}x_j - v_{i,f}x_{i})$.
  • In practice, we would throw in some L2 regularization to prevent overfitting.

As the next section contains implementation of the algorithm from scratch, the gradient of the log loss is also provided here for completeness. The predicted value $\hat{y}(\textbf{x})$ is replaced with $x$ for making the notation cleaner.

\begin{align} \frac{d}{dx}\left[ \ln \big(e^{-yx} + 1 \big) \right] &= \frac{1}{e^{-yx} + 1} \cdot \frac{d}{dx}\left[e^{-yx} + 1 \right] \\ &= \frac{\frac{d}{dx}\left[e^{-yx} \right] + \frac{d}{dx}\left[1 \right]}{e^{-yx} + 1} \\ &= \frac{e^{-yx} \cdot \frac{d}{dx}\left[-yx \right] + 0}{e^{-yx} + 1} \\ &= \frac{e^{-yx} \cdot -y}{e^{-yx} + 1} \\ &= -\frac{ye^{-yx}}{e^{-yx} + 1} \\ &= -\frac{y}{e^{yx} + 1} \end{align}

Advantages: We'll now wrap up the theoretical section of factorization machine, with some of its advantages:

  • We can observe from the model equation that it can be computed in linear time.
  • By leveraging ideas from matrix factorization, we can estimate higher order interaction effects even under very sparse data.
  • Compared to traditional matrix factorization methods, which is restricted to modeling a user-item matrix, we can leverage other user or item specific features making factorization machine more flexible.


For the implementation of factorization machine, we'll use a for loop based code as I personally find it easier to comprehend for the gradient update section. There are different ways to speed up for loop based code in Python, such as using Cython or Numba, here we'll be using Numba.

In [3]:
# using the example spam dataset
# read it in, extract the input and output columns
label_col = 'label_num'
sms = pd.read_table('sms.tsv', header = None, names = ['label', 'message'])
sms[label_col] = sms['label'].map({'ham': 0, 'spam': 1})
X = sms['message']
y = sms[label_col].values

# split X and y into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.25, random_state = 1)

# convert both sets' text column to document-term matrix;
# ideally, we would want to perform some preprocessing on
# our text data, but let's be lazy here as that's not
# the goal of this documentation
tfidf = TfidfVectorizer(min_df = 2, max_df = 0.5)
X_train_dtm = tfidf.fit_transform(X_train)
X_test_dtm = tfidf.transform(X_test)
<4179x3508 sparse matrix of type '<class 'numpy.float64'>'
	with 51261 stored elements in Compressed Sparse Row format>
In [4]:
import numpy as np
from numba import njit
from tqdm import trange
from sklearn.base import BaseEstimator, ClassifierMixin

class FactorizationMachineClassifier(BaseEstimator, ClassifierMixin):
    Factorization Machine [1]_ using Stochastic Gradient Descent.
    For binary classification only.

    n_iter : int, default 10
        Number of iterations to train the algorithm.

    n_factors : int, default 10
        Number/dimension of features' latent factors.

    learning_rate : float, default 0.1
        Learning rate for the gradient descent optimizer.

    reg_coef : float, default 0.01
        Regularization strength for weights/coefficients.

    reg_factors : float, default 0.01
        Regularization strength for features' latent factors.

    random_state : int, default 1234
        Seed for the randomly initialized features latent factors

    verbose : bool, default True
        Whether to print progress bar while training.

    intercept_ : double
        Intercept term, w0 based on the original notations.

    coef_ : 1d ndarray, shape [n_features,]
        Coefficients, w based on the original notations.

    feature_factors_ : 2d ndarray, shape [n_factors, n_features]
        Latent factors for all features. v based on the original
        notations. The learned factors can be viewed as the
        embeddings for each features. If a pair of features tends
        to co-occur often, then their embeddings should be
        close/similar (in terms of cosine similarity) to each other.

    history_ : list
        Loss function's history at each iteration, useful
        for evaluating whether the algorithm converged or not.

    .. [1] `S. Rendle Factorization Machines (2010)

    def __init__(self, n_iter = 10, n_factors = 10,
                 learning_rate = 0.1, reg_coef = 0.01,
                 reg_factors = 0.01, random_state = 1234, verbose = False):
        self.n_iter = n_iter
        self.verbose = verbose
        self.reg_coef = reg_coef
        self.n_factors = n_factors
        self.reg_factors = reg_factors
        self.random_state = random_state
        self.learning_rate = learning_rate

    def fit(self, X, y):
        Fit the model to the input data and label.

        X : scipy sparse csr_matrix, shape [n_samples, n_features]
            Data in sparse matrix format.

        y : 1d ndarray, shape [n_samples,]
            Training data's corresponding label.


        n_samples, n_features = X.shape
        self.coef_ = np.zeros(n_features)
        self.intercept_ = 0.0

        # the factors are often initialized with a mean of 0 and standard deviation
        # of 1 / sqrt(number of latent factor specified)
        self.feature_factors_ = np.random.normal(
            scale = 1 / np.sqrt(self.n_factors), size = (self.n_factors, n_features))
        # the gradient is implemented in a way that requires
        # the negative class to be labeled as -1 instead of 0
        y = y.copy().astype(np.int32)
        y[y == 0] = -1

        loop = range(self.n_iter)
        if self.verbose:
            loop = trange(self.n_iter)

        self.history_ = []
        for _ in loop:
            loss = _sgd_update(, X.indptr, X.indices,
                               y, n_samples, n_features,
                               self.intercept_, self.coef_,
                               self.feature_factors_, self.n_factors,
                               self.learning_rate, self.reg_coef, self.reg_factors)

        return self

    def predict_proba(self, X):
        Probability estimates. The returned estimates for
        all classes are ordered by the label of classes.

        X : scipy sparse csr_matrix, shape [n_samples, n_features]
            Data in sparse matrix format.

        proba : 2d ndarray, shape [n_samples, n_classes]
            The probability of the sample for each class in the model.
        pred = self._predict(X)
        pred_proba = 1.0 / (1.0 + np.exp(-pred))
        proba = np.vstack((1 - pred_proba, pred_proba)).T
        return proba

    def _predict(self, X):
        """Similar to _predict_instance but vectorized for all samples"""
        linear_output = X * self.coef_
        v = self.feature_factors_.T
        term = (X * v) ** 2 - (X.power(2) * (v ** 2))
        factor_output = 0.5 * np.sum(term, axis = 1)
        return self.intercept_ + linear_output + factor_output

    def predict(self, X):
        Predict class labels for samples in X.

        X : scipy sparse csr_matrix, shape [n_samples, n_features]
            Data in sparse matrix format.

        Predicted class label per sample.
        pred_proba = self.predict_proba(X)[:, 1]
        return pred_proba.round().astype(

def _sgd_update(data, indptr, indices, y, n_samples, n_features,
                w0, w, v, n_factors, learning_rate, reg_w, reg_v):
    Compute the loss of the current iteration and update
    gradients accordingly.
    loss = 0.0
    for i in range(n_samples):
        pred, summed = _predict_instance(data, indptr, indices, w0, w, v, n_factors, i)
        # calculate loss and its gradient
        loss += _log_loss(pred, y[i])
        loss_gradient = -y[i] / (np.exp(y[i] * pred) + 1.0)
        # update bias/intercept term
        w0 -= learning_rate * loss_gradient

        # update weight
        for index in range(indptr[i], indptr[i + 1]):
            feature = indices[index]
            w[feature] -= learning_rate * (loss_gradient * data[index] + 2 * reg_w * w[feature])

        # update factor
        for factor in range(n_factors):
            for index in range(indptr[i], indptr[i + 1]):
                feature = indices[index]
                term = summed[factor] - v[factor, feature] * data[index]
                v_gradient = loss_gradient * data[index] * term
                v[factor, feature] -= learning_rate * (v_gradient + 2 * reg_v * v[factor, feature])
    loss /= n_samples
    return loss

def _predict_instance(data, indptr, indices, w0, w, v, n_factors, i):
    """predicting a single instance"""
    summed = np.zeros(n_factors)
    summed_squared = np.zeros(n_factors)

    # linear output w * x
    pred = w0
    for index in range(indptr[i], indptr[i + 1]):
        feature = indices[index]
        pred += w[feature] * data[index]

    # factor output
    for factor in range(n_factors):
        for index in range(indptr[i], indptr[i + 1]):
            feature = indices[index]
            term = v[factor, feature] * data[index]
            summed[factor] += term
            summed_squared[factor] += term * term

        pred += 0.5 * (summed[factor] * summed[factor] - summed_squared[factor])
    # summed is the independent term that can be re-used
    # during the gradient update stage
    return pred, summed

def _log_loss(pred, y):
    negative log likelihood of the
    current prediction and label, y.
    return np.log(np.exp(-pred * y) + 1.0)
In [5]:
fm = FactorizationMachineClassifier(n_iter = 30, learning_rate = 0.1), y_train)
FactorizationMachineClassifier(learning_rate=0.1, n_factors=10, n_iter=30,
                random_state=1234, reg_coef=0.01, reg_factors=0.01,
In [6]:
# change default style figure and font size
plt.rcParams['figure.figsize'] = 8, 6
plt.rcParams['font.size'] = 12

# one quick way to check that we've implemented
# the gradient descent is to ensure that the loss
# curve is steadily decreasing
plt.title('Loss Curve Per Iteration')
In [7]:
# predict on the test set and output the auc score
y_pred_prob = fm.predict_proba(X_test_dtm)[:, 1]
auc = roc_auc_score(y_test, y_pred_prob)
print('auc', auc)
auc 0.9973867907642742
In [8]:
# we can compare it with a logistic regression,
logreg = LogisticRegression(), y_train)
y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]
auc = roc_auc_score(y_test, y_pred_prob)
print('auc', auc)
auc 0.9949615178092

There are various open-sourced implementations floating around the web, here are the links to some of them:

I personally haven't tested which one is more efficient, feel free to grab one of them as see if it helps solve your problem.