In [1]:
# code for loading the format for the notebook
import os

# path : store the current path to convert back to it later
path = os.getcwd()
os.chdir(os.path.join('..', 'notebook_format'))
from formats import load_style
load_style(css_style = 'custom2.css', plot_style = False)
Out[1]:
In [2]:
os.chdir(path)

# 1. magic to print version
# 2. magic so that the notebook will reload external python modules
%load_ext watermark
%load_ext autoreload
%autoreload 2

import numpy as np
from time import time
from scipy.stats import geom
from lightfm import LightFM
from lightfm.evaluation import auc_score
from lightfm.evaluation import precision_at_k
from lightfm.datasets import fetch_movielens

%watermark -a 'Ethen' -d -t -v -p scipy,numpy,lightfm
Ethen 2018-04-07 14:21:10 

CPython 3.6.4
IPython 6.2.1

scipy 1.0.0
numpy 1.14.2
lightfm 1.14

WARP (Weighted Approximate Rank Pairwise Loss)

Geometric Distribution

We do a quick recap on the Geometric Distribution as we'll be leveraging its property later. Geometric Distribution measures number of trials needed to get a first success in repeated Bernoulli trials. So suppose we have some independent trials and each trial results in one of two possible outcomes success or failure. And our probability of success is denoted by P(success) = $p$, this number stays constant from trial to trial and $X$ represents the number of trials needed to get the first success.

For the first success to occur on the $x_{th}$ trial:

  • The first $x - 1$ trial must be failures.
  • The $x_{th}$ trial must be a success.

This gives us the probability mass function of the geometric distribution:

\begin{align} P(X = x) = (1 - p)^{x - 1}p \end{align}

For the geometric distribution the minimum number $x$ can take is 1, however, there is no upper bound to that number.

Example: In a large populations of adults, 30% have received CPR training, if adults from this population are randomly selected, what is the probability that the 6th person sampled is the first that has received CPR training?

In [3]:
# we could do it manually
p = 0.3
prob = (1 - p) ** 5 * p
print(prob)

# or use scipy stats' geometric distribution
# the first success occuring on the 6th trial
print(geom(p = p).pmf(k = 6))
0.05042099999999998
0.05042099999999998

Last but not least, one useful property to know about the geometric distribution is variable X's mean/expectation is:

\begin{align} \frac{1}{p} \end{align}

Building Intuition for WARP

Like the Bayesian Personalized Ranking (BPR) model, WARP deals with (user, positive item, negative item) triplets. Unlike BPR, the negative items in the triplet are not chosen by random sampling: they are chosen from among those negative items which would violate the desired item ranking given the state of the model. This approximates a form of active learning where the model selects those triplets that it cannot currently rank correctly.

This procedure yields roughly the following algorithm:

  • For a given user, positive item pair, sample a negative item at random from all the remaining items. Compute predictions for both items; if the negative item's prediction exceeds that of the positive item plus a margin, perform a gradient update so we can rank the positive item higher and the negative item lower. If there is no rank violation, continue sampling negative items until a violation is found.
  • If we found a violating negative example at the first try, make a large gradient update: this indicates that a lot of negative items are ranked higher than positives items in current state of the model, and the model should be updated by a large amount. If it took a lot of sampling to find a violating example, perform a small update since the model is likely close to the optimum and should be updated at a low rate.

Given this intuition, we can formalize this idea with some notations in the next section.

Introducing WARP

In the context of making recommendation using implicit feedback data, we are interested in ranking items $i$ given a user $u$. And when a scoring function $f_u(i)$ give the score of item $i$ for user $u$, the item with the highest score is recommended.

We let $f(u) \in \mathbb{R}^Y$ be a vector function providing a score for each of the items, where $f_i(u)$ is the value for item $i$. At the end of the day, after a scoring function $f_i(u)$ gives a score for all the items $i$, the item with the highest score is recommended.

The class of ranking loss function we'll be looking at today is defined as:

\begin{align} \sum_{i \in D_u} L \big( rank_i(f_u) \big) \end{align}

Where:

  • $D_u$ represents users $u$'s positive feedback.
  • $rank_i(f_u)$ is the margin-based rank of item $i$. i.e.
\begin{align} rank_i(f_u) = \sum_{j \notin D_u} I \big( f_u(j) \ge f_u(i) \big) \end{align}
  • $I$ is the indicator function.
  • This $rank_i(f_u)$ tells us that for an item that does not have a user's positive feedback (denoted as $j$), if its score is higher than that of an item that the user showed a positive feedback for, then it will be counted towards the loss function.

Now going back to $L$, $L(\cdot)$ is in charge of transforming this rank into the actual loss:

\begin{align} L(k) = \sum_{t = 1}^k a_t, \enspace a_1 \ge a_2 \ge \ldots \ge 0 \end{align}

Different settings of $a_t$ allows the loss function to optimize for different objectives. e.g.

  • When $a_1 = 1$ and $a_{t > 1} = 0$, then precision at 1 is optimized for.
  • To generalize this a bit, when $a_k = 1$ and $a_{k > 1} = 0$, then precision at $k$ is optimized for.
  • On the other hand, $a_{t} = 1 / t$ becomes a smooth weighting over various positions, with most of the weights given to the top position and rapidly decaying weight for lower positions. This is approximately the natural logarithm and we'll be using this value later in our implementation.

This property of optimizing for the top-k ranked items is often times a desired property in recommendation system since in these scenarios, we're often times only interested in recommending a few items to the users and as such we often do not care about the ranking of the entire set of items.

The problem with this version of the loss function is that, while it does depend on the model's parameter, this dependence is not continuous (our rank being integer value), hence we can't derive gradients to directly optimize for this loss function. Hence, the approach that the original authors took is to derive a differentiable approximation to the logarithm of the rank. The rank loss for a specific item $i \in D_u$ is defined as:

\begin{align} L \big( rank_i(f_u) \big) &= log \big( rank_i(f_u) \big) \frac{rank_i(f_u)}{rank_i(f_u)} \\ &= log \big( rank_i(f_u) \big) \frac{\sum_{j \notin D_u} I \big( f_u(j) \ge f_u(i) \big)}{rank_i(f_u)} \\ &= \sum_{j \notin D_u} log \big( rank_i(f_u) \big) \frac{I \big( f_u(j) \ge f_u(i) \big)}{rank_i(f_u)} \\ &= \sum_{j \notin D_u} log \big( rank_i(f_u) \big) \frac{\big| 1 - f_u(i) + f_u(j) \big|_+}{rank_i(f_u)} \end{align}
  • Notice that during the last step, we've replaced the indicator function with the hinge loss so the gradient descent based algorithms can be adopted to learn the parameters.
  • Based on the equation above, after randomly sampling a positive item $i \in D_u$, we would need to calculate the margin rank for all its negative items. Exactly computing this, however, can be very time-consuming when the number of items is very large. Thus, we instead uniformly sample a negative feedback instance $j \notin D_u$ until a pairwise violation is found, that is, until $1 - f_u(i) + f_u(j) > 0$.
  • Using the procedure above, we would have a probability of $1 / rank_i(f_u)$ of drawing $j$ during the sampling for negative feedback step, which accounts for the denominator of equation $9$.

Now our equation becomes:

\begin{align} L \big( rank_i(f_u) \big) &= \sum_{j \notin D_u} log \big( rank_i(f_u) \big) \big| 1 - f_u(i) + f_u(j) \big|_+ \end{align}

The next brilliant idea is how to estimate for $rank_i(f_u)$, the rank of item $i$. Recall during the sampling procedure mentioned above, we were sampling negative items uniformly with replacement until we find a negative item in which the pairwise violation is found. If we were to re-read the last paragraph a few more times, hopefully, we can see that this steps follows a geometric distribution. To formalize this, during the sampling for negative items step, say we needed to sample $Q$ times before we were able to find such pairwise violation, then that means in our negative items pool, the probability of sampling a negative item whose score violates the pairwise comparison is:

\begin{align} \frac{Q}{\big|D_u'\big|} \end{align}

where $\big|D_u'\big|$ denotes the number of negative items. Utilizing geometric distribution's mean/expectation property, this suggests the value of $rank_i(f_u)$ can be approximated by $\big\lfloor{\frac{|D_u'|}{Q}}\big\rfloor$. Where $\lfloor{\cdot}\rfloor$ is the floor function.

Given all of that, the final formula for estimating the WARP loss for a given item $i$ becomes:

\begin{align} L \big( rank_i(f_u) \big) &= \sum_{j \notin D_u} log \big( \big\lfloor{\frac{|D_u'|}{Q}}\big\rfloor \big) \big| 1 - f_u(i) + f_u(j) \big|_+ \end{align}

Implementation

For the implementation part, we'll leverage the lightfm library,

LightFM provides a function for fetching the MovieLens 100K dataset, which is a small recommender dataset, consisting of around 950 users, 1700 movies, and 100,000 ratings. The ratings are on a scale from 1 to 5, but we'll all treat them as implicit positive feedback in this example.

In [4]:
# movielens is a dictionary and we
# can access the train and test key
# to obtain the training and test set
movielens = fetch_movielens()
train = movielens['train']
test = movielens['test']
train
Out[4]:
<943x1682 sparse matrix of type '<class 'numpy.int32'>'
	with 90570 stored elements in COOrdinate format>

The next two code chunks compares the models' performance between a model that uses the BPR loss as its objective/loss function and another one using the WARP loss.

We'll use two metrics to evaluate the performance: precision@k and ROC AUC. Both are ranking metrics: to compute them, we'll be constructing recommendation lists for all of our users, and checking the ranking of known positive movies. For precision at k we'll be looking at whether they are within the first k results on the list; for AUC, we'll be calculating the probability that any known positive is higher on the list than a random negative example.

In [5]:
start = time()
model = LightFM(learning_rate = 0.05, loss = 'bpr')
model.fit(train, epochs = 50)
print(time() - start)

train_precision = precision_at_k(model, train, k = 10).mean()
test_precision = precision_at_k(model, test, k = 10).mean()

train_auc = auc_score(model, train).mean()
test_auc = auc_score(model, test).mean()

print('Precision: train %.2f, test %.2f.' % (train_precision, test_precision))
print('AUC: train %.2f, test %.2f.' % (train_auc, test_auc))
2.9854891300201416
Precision: train 0.62, test 0.09.
AUC: train 0.92, test 0.87.
In [6]:
start = time()
model = LightFM(learning_rate = 0.05, loss = 'warp')
model.fit(train, epochs = 50)
print(time() - start)

train_precision = precision_at_k(model, train, k = 10).mean()
test_precision = precision_at_k(model, test, k = 10).mean()

train_auc = auc_score(model, train).mean()
test_auc = auc_score(model, test).mean()

print('Precision: train %.2f, test %.2f.' % (train_precision, test_precision))
print('AUC: train %.2f, test %.2f.' % (train_auc, test_auc))
3.049008846282959
Precision: train 0.65, test 0.11.
AUC: train 0.95, test 0.91.

From the example above, we can see that despite the fact that WARP optimizes for precision@k its performance, it is able to obtain both a higher precision and ROC AUC score. Note that this is not saying this will be the case for every single example we've encountered, but based on this result, we can infer that taking the extra step to ensure the sampled negative item is violating the pairwise comparison and penalizing the loss function more if we were able to find a violating negative item quickly does in fact gives performance boost.

Reference