# code for loading the format for the notebook
import os
# path : store the current path to convert back to it later
path = os.getcwd()
os.chdir(os.path.join('..', 'notebook_format'))
from formats import load_style
load_style(css_style = 'custom2.css', plot_style = False)
os.chdir(path)
# 1. magic to print version
# 2. magic so that the notebook will reload external python modules
%load_ext watermark
%load_ext autoreload
%autoreload 2
import numpy as np
from time import time
from scipy.stats import geom
from lightfm import LightFM
from lightfm.evaluation import auc_score
from lightfm.evaluation import precision_at_k
from lightfm.datasets import fetch_movielens
%watermark -a 'Ethen' -d -t -v -p scipy,numpy,lightfm
We do a quick recap on the Geometric Distribution as we'll be leveraging its property later. Geometric Distribution measures number of trials needed to get a first success in repeated Bernoulli trials. So suppose we have some independent trials and each trial results in one of two possible outcomes success or failure. And our probability of success is denoted by P(success) = $p$, this number stays constant from trial to trial and $X$ represents the number of trials needed to get the first success.
For the first success to occur on the $x_{th}$ trial:
This gives us the probability mass function of the geometric distribution:
\begin{align} P(X = x) = (1 - p)^{x - 1}p \end{align}For the geometric distribution the minimum number $x$ can take is 1, however, there is no upper bound to that number.
Example: In a large populations of adults, 30% have received CPR training, if adults from this population are randomly selected, what is the probability that the 6th person sampled is the first that has received CPR training?
# we could do it manually
p = 0.3
prob = (1 - p) ** 5 * p
print(prob)
# or use scipy stats' geometric distribution
# the first success occuring on the 6th trial
print(geom(p = p).pmf(k = 6))
Last but not least, one useful property to know about the geometric distribution is variable X's mean/expectation is:
\begin{align} \frac{1}{p} \end{align}Like the Bayesian Personalized Ranking (BPR) model, WARP deals with (user, positive item, negative item) triplets. Unlike BPR, the negative items in the triplet are not chosen by random sampling: they are chosen from among those negative items which would violate the desired item ranking given the state of the model. This approximates a form of active learning where the model selects those triplets that it cannot currently rank correctly.
This procedure yields roughly the following algorithm:
Given this intuition, we can formalize this idea with some notations in the next section.
In the context of making recommendation using implicit feedback data, we are interested in ranking items $i$ given a user $u$. And when a scoring function $f_u(i)$ give the score of item $i$ for user $u$, the item with the highest score is recommended.
We let $f(u) \in \mathbb{R}^Y$ be a vector function providing a score for each of the items, where $f_i(u)$ is the value for item $i$. At the end of the day, after a scoring function $f_i(u)$ gives a score for all the items $i$, the item with the highest score is recommended.
The class of ranking loss function we'll be looking at today is defined as:
\begin{align} \sum_{i \in D_u} L \big( rank_i(f_u) \big) \end{align}Where:
Now going back to $L$, $L(\cdot)$ is in charge of transforming this rank into the actual loss:
\begin{align} L(k) = \sum_{t = 1}^k a_t, \enspace a_1 \ge a_2 \ge \ldots \ge 0 \end{align}Different settings of $a_t$ allows the loss function to optimize for different objectives. e.g.
This property of optimizing for the top-k ranked items is often times a desired property in recommendation system since in these scenarios, we're often times only interested in recommending a few items to the users and as such we often do not care about the ranking of the entire set of items.
The problem with this version of the loss function is that, while it does depend on the model's parameter, this dependence is not continuous (our rank being integer value), hence we can't derive gradients to directly optimize for this loss function. Hence, the approach that the original authors took is to derive a differentiable approximation to the logarithm of the rank. The rank loss for a specific item $i \in D_u$ is defined as:
\begin{align} L \big( rank_i(f_u) \big) &= log \big( rank_i(f_u) \big) \frac{rank_i(f_u)}{rank_i(f_u)} \\ &= log \big( rank_i(f_u) \big) \frac{\sum_{j \notin D_u} I \big( f_u(j) \ge f_u(i) \big)}{rank_i(f_u)} \\ &= \sum_{j \notin D_u} log \big( rank_i(f_u) \big) \frac{I \big( f_u(j) \ge f_u(i) \big)}{rank_i(f_u)} \\ &= \sum_{j \notin D_u} log \big( rank_i(f_u) \big) \frac{\big| 1 - f_u(i) + f_u(j) \big|_+}{rank_i(f_u)} \end{align}Now our equation becomes:
\begin{align} L \big( rank_i(f_u) \big) &= \sum_{j \notin D_u} log \big( rank_i(f_u) \big) \big| 1 - f_u(i) + f_u(j) \big|_+ \end{align}The next brilliant idea is how to estimate for $rank_i(f_u)$, the rank of item $i$. Recall during the sampling procedure mentioned above, we were sampling negative items uniformly with replacement until we find a negative item in which the pairwise violation is found. If we were to re-read the last paragraph a few more times, hopefully, we can see that this steps follows a geometric distribution. To formalize this, during the sampling for negative items step, say we needed to sample $Q$ times before we were able to find such pairwise violation, then that means in our negative items pool, the probability of sampling a negative item whose score violates the pairwise comparison is:
\begin{align} \frac{Q}{\big|D_u'\big|} \end{align}where $\big|D_u'\big|$ denotes the number of negative items. Utilizing geometric distribution's mean/expectation property, this suggests the value of $rank_i(f_u)$ can be approximated by $\big\lfloor{\frac{|D_u'|}{Q}}\big\rfloor$. Where $\lfloor{\cdot}\rfloor$ is the floor function.
Given all of that, the final formula for estimating the WARP loss for a given item $i$ becomes:
\begin{align} L \big( rank_i(f_u) \big) &= \sum_{j \notin D_u} log \big( \big\lfloor{\frac{|D_u'|}{Q}}\big\rfloor \big) \big| 1 - f_u(i) + f_u(j) \big|_+ \end{align}For the implementation part, we'll leverage the lightfm
library,
LightFM provides a function for fetching the MovieLens 100K dataset, which is a small recommender dataset, consisting of around 950 users, 1700 movies, and 100,000 ratings. The ratings are on a scale from 1 to 5, but we'll all treat them as implicit positive feedback in this example.
# movielens is a dictionary and we
# can access the train and test key
# to obtain the training and test set
movielens = fetch_movielens()
train = movielens['train']
test = movielens['test']
train
The next two code chunks compares the models' performance between a model that uses the BPR loss as its objective/loss function and another one using the WARP loss.
We'll use two metrics to evaluate the performance: precision@k and ROC AUC. Both are ranking metrics: to compute them, we'll be constructing recommendation lists for all of our users, and checking the ranking of known positive movies. For precision at k we'll be looking at whether they are within the first k results on the list; for AUC, we'll be calculating the probability that any known positive is higher on the list than a random negative example.
start = time()
model = LightFM(learning_rate = 0.05, loss = 'bpr')
model.fit(train, epochs = 50)
print(time() - start)
train_precision = precision_at_k(model, train, k = 10).mean()
test_precision = precision_at_k(model, test, k = 10).mean()
train_auc = auc_score(model, train).mean()
test_auc = auc_score(model, test).mean()
print('Precision: train %.2f, test %.2f.' % (train_precision, test_precision))
print('AUC: train %.2f, test %.2f.' % (train_auc, test_auc))
start = time()
model = LightFM(learning_rate = 0.05, loss = 'warp')
model.fit(train, epochs = 50)
print(time() - start)
train_precision = precision_at_k(model, train, k = 10).mean()
test_precision = precision_at_k(model, test, k = 10).mean()
train_auc = auc_score(model, train).mean()
test_auc = auc_score(model, test).mean()
print('Precision: train %.2f, test %.2f.' % (train_precision, test_precision))
print('AUC: train %.2f, test %.2f.' % (train_auc, test_auc))
From the example above, we can see that despite the fact that WARP optimizes for precision@k its performance, it is able to obtain both a higher precision and ROC AUC score. Note that this is not saying this will be the case for every single example we've encountered, but based on this result, we can infer that taking the extra step to ensure the sampled negative item is violating the pairwise comparison and penalizing the loss function more if we were able to find a violating negative item quickly does in fact gives performance boost.