In [1]:
# code for loading the format for the notebook
import os

# path : store the current path to convert back to it later
path = os.getcwd()
os.chdir(os.path.join('..', 'notebook_format'))
from formats import load_style
load_style(css_style = 'custom2.css', plot_style = False)
Out[1]:
In [2]:
os.chdir(path)

# 1. magic to print version
# 2. magic so that the notebook will reload external python modules
%load_ext watermark
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd
from math import ceil
from tqdm import trange
from subprocess import call
from scipy.sparse import csr_matrix, dok_matrix

%watermark -a 'Ethen' -d -t -v -p numpy,pandas,sklearn,tqdm,scipy
Ethen 2018-04-07 14:28:04 

CPython 3.6.4
IPython 6.2.1

numpy 1.14.2
pandas 0.22.0
sklearn 0.19.1
tqdm 4.19.8
scipy 1.0.0

Collaborative Filtering for Implicit Feedback Datasets

One common scenario in real-world recommendation system is we only have implicit instead of explicit user-item interaction data. To elaborate on this a little bit more, a user may be searching for an item on the web, or listening to songs. Unlike a rating data, where we have direct access to the user's preference towards an item, these type of actions do not explicitly state or quantify any preference of the user for the item, but instead gives us implicit confidence about the user's opinion.

Even when we do have explicit data, it might still be a good idea to incorporate implicit data into the model. Consider, for example, listening to songs. When users listen to music on a streaming service, they might rarely ever rate a song that he/she like or dislike. But more often they skip a song, or listen only halfway through it if they dislike it. If the user really liked a song, they will often come back and listen to it. So, to infer a user's musical taste profile, their listens, repeat listens, skips and fraction of tracks listened to, etc. might be far more valuable signals than explicit ratings.

Formulation

Recall from the previous notebook that the loss function for training the recommendation model on explicit feedback data was:

$$ \begin{align} L_{explicit} &= \sum\limits_{u,i \in S}( r_{ui} - x_{u} y_{i}^{T} )^{2} + \lambda \big( \sum\limits_{u} \left\Vert x_{u} \right\Vert^{2} + \sum\limits_{i} \left\Vert y_{i} \right\Vert^{2} \big) \end{align} $$

Where:

  • $r_{ui}$ is the true rating given by user $u$ to the item $i$
  • $x_u$ and $y_i$ are user u's and item i's latent factors, both are $1×d$ dimensional, where $d$ the number of latent factors that the user can specify
  • $S$ was the set of all user-item ratings
  • $\lambda$ controls the regularization strength that prevents overfitting the user and item vectors

To keep it concrete, let's assume we're working music data and the value of our $r_{ui}$ will consists of implicit ratings that counts the number of times a user has listened to a song (song listen count). Then new formulation becomes:

$$ \begin{align} L_{implicit} &= \sum\limits_{u,i} c_{ui}( p_{ui} - x_{u} y_{i}^{T} )^2 + \lambda \big( \sum\limits_{u} \left\Vert x_{u} \right\Vert^{2} + \sum\limits_{i} \left\Vert y_{i} \right\Vert^{2} \big) \end{align} $$

Recall that with implicit feedback, we do not have ratings anymore; rather, we have users' preferences for items. Therefore, in the new loss function, the ratings $r_{ui}$ has been replaced with a preference $p_{ui}$ indicating the preference of user $u$ to item $i$. $p_{ui}$ is a set of binary variables and is computed by binarizing $r_{ui}$.

$$ \begin{align} p_{ui} &= \begin{cases} 1 &\mbox{if } r_{ui} > 0 \\ 0 & \mbox{otherwise} \end{cases} \end{align} $$

We make the assumption that if a user has interacted at all with an item ($r_{ui} > 0$), then we set $p_{ui} = 1$ to indicate that user $u$ has a liking/preference for item $i$. Otherwise, we set $p_{ui} = 0$. However, these assumptions comes with varying degrees of confidence. First of all, when $p_{ui} = 0$, we assume that it should be associated with a lower confidence, as there are many reasons beyond disliking the item as to why the user has not interacted with it. e.g. Unaware of it's existence. On the other hand, as the number of implicit feedback, $r_{ui}$, grows, we have a stronger indication that the user does indeed like the item (regardless of whether he/she if buying a gift for someone else). So to measure the level of confidence mentioned above, we introduce another set of variables $c_{ui}$ that measures our confidence in observing $p_{ui}$:

$$ \begin{align} c_{ui} = 1 + \alpha r_{ui} \end{align} $$

Where the 1 ensures we have some minimal confidence for every user-item pair, and as we observe more and more implicit feedback (as $r_{ui}$ gets larger and larger), our confidence in $p_{ui} = 1$ increases accordingly. The term $\alpha$ is a parameter that we have to specify to control the rate of the increase. This formulation makes intuitive sense when we look back at the $c_{ui}( p_{ui} - x_{u} y_{i}^{T} )^2$ term in the loss function. A larger $c_{ui}$ means that the prediction $x_{u} y_{i}^{T}$ has to be that much closer to $p_{ui}$ so that term will not contribute too much to the total loss.

The implementation in the later section will be based on the formula above, but note that there are many ways in which we can tune the formulation above. For example, we can derive $p_{ui}$ from $r_{ui}$ differently. So instead of setting the binary cutoff at 0, we can set it at another threshold that we feel is appropriate for the domain. Similarly, there are many ways to transform $r_{ui}$ into the confidence level $c_{ui}$. e.g. we can use:

$$ \begin{align} c_{ui} = 1 + \alpha log \left( 1 + r_{ui} / \epsilon \right) \end{align} $$

Regardless of the scheme, it's important to realize that we are transforming the raw observation $r_{ui}$ into two distinct representation, the preference $p_{ui}$ and the confidence levels of the preference $c_{ui}$.

Alternating Least Squares

Let's assume we have $m$ users and $n$ items. Now, to solve for the loss function above, we start by treating $y_i$ as constant and solve the loss function with respect to $x_u$. To do this, we rewrite and expand the first term in the loss function (excluding the regularization terms), $\sum\limits_{u,i} c_{ui}( p_{ui} - x_{u} y_{i}^{T} )^2$ part as:

$$ \begin{align} \sum\limits_{u,i} c_{ui}( p_{ui} - x_{u} y_{i}^{T} )^2 &= \sum_u c_u( p_u^T - x_u Y^T )^2 \\ &= \sum_u p_u^T C^u p_u - 2 x_u Y^T C^u p_u + x_u Y^T C^u Y x_u^T \end{align} $$

Where:

  • $Y \in \mathbb{R}^{n \times d}$ represents all item row vectors vertically stacked on each other
  • $p_u \in \mathbb{R^{n \times 1}}$ contains element all of the preferences of the user
  • The diagonal matrix $C^u \in \mathbb{R^{n \times n}}$ consists of $c_{ui}$ in row/column $i$, which is the user's confidence across all the items. e.g. if $u = 0$ then the matrix for user $u_0$ will look like:
$$ \begin{align} {C}^{u_0} = \begin{bmatrix} c_{u_{01}} & 0 & 0 & 0 & ... & 0 \\ 0 & c_{u_{02}} & 0 & 0 & ... &0\\ ... \\ ... \\ 0 & 0 & 0 & 0 & ... & c_{u_{0n}}\end{bmatrix} \end{align} $$

The formula above can also be used to monitor the loss function at each iteration. If we set $A = Y^T C^u Y$ and $b = Y^T C^u$, the last two terms can be rewritten as $(A x_u^T - 2b p_u) x_u$. As for the first term $p_u^T C^u p_u$ we can leverage the fact that $p_u$ is 1 for all positive items, and just sum the confidence term $C^u$.

Now for the derivation of the partial derivative.

$$ \begin{align} \frac{\partial L_{implicit}}{\partial x_u} &= -2 Y^T C^u p_u + 2 Y^T C^u Y x_u + 2 \lambda x_u = 0 \\ &= (Y^T C^u Y + \lambda I)x_u = Y^T C^u p_{u} \\ &= x_u = (Y^T C^u Y + \lambda I)^{-1} Y^T C^u p_u \end{align} $$

The main computational bottleneck in the expression above is the need to compute $Y^T C^u Y$ for every user. Speedup can be obtained by re-writing the expression as:

$$ \begin{align} {Y}^T {C}^{u} {Y} &= Y^T Y + {Y}^T \left( C^u - I \right) Y \end{align} $$

Now the term $Y^T Y$ becomes independent of each user and can be computed independently, next notice $\left(C^u - I \right)$ has only $n_u$ non-zero elements, where $n_u$ is the number of items for which $r_{ui} > 0$. Similarly, $C^u p_u$ contains only $n_u$ non-zero elements since $p_u$ is a binary transformation of $r_{ui}$. Thus the final formulation becomes:

$$ \begin{align} \frac{\partial L_{implicit}}{\partial x_u} &= x_u = (Y^T Y + Y^T \left( C^u - I \right) Y + \lambda I)^{-1} Y^T C^u p_u \end{align} $$

After solving for $x_u$ the same procedure can be carried out to solve for $y_i$ giving a similar expression:

$$ \begin{align} \frac{\partial L_{implicit}}{\partial y_i} &= y_i = (X^T X + X^T \left( C^i - I \right) X + \lambda I)^{-1} X^T C^i p_i \end{align} $$

Implementation

We'll use the same movielens dataset like the previous notebook. The movielens data is not an implicit feedback dataset as the user did provide explicit ratings, but we will use it for now to test out our implementation. The overall preprocessing procedure of loading the data and doing the train/test split is the same as the previous notebook. But here we'll do it in a sparse matrix fashion.

In [3]:
file_dir = 'ml-100k'
file_path = os.path.join(file_dir, 'u.data')
if not os.path.isdir(file_dir):
    call(['curl', '-O', 'http://files.grouplens.org/datasets/movielens/' + file_dir + '.zip'])
    call(['unzip', file_dir + '.zip'])

names = ['user_id', 'item_id', 'rating', 'timestamp']
df = pd.read_csv(file_path, sep = '\t', names = names)
print('data dimension: \n', df.shape)
df.head()
data dimension: 
 (100000, 4)
Out[3]:
user_id item_id rating timestamp
0 196 242 3 881250949
1 186 302 3 891717742
2 22 377 1 878887116
3 244 51 2 880606923
4 166 346 1 886397596
In [4]:
def create_matrix(data, user_col, item_col, rating_col):
    """
    creates the sparse user-item interaction matrix
    
    Parameters
    ----------
    data : DataFrame
        implicit rating data

    user_col : str
        user column name

    item_col : str
        item column name
    
    ratings_col : str
        implicit rating column name

    Returns
    -------
    ratings : scipy sparse csr_matrix [n_users, n_items]
        user/item ratings matrix

    data : DataFrame
        the implict rating data that retains only the positive feedback
        (if specified to do so)
    """
    # map each item and user to a unique numeric value
    for col in (item_col, user_col):
        data[col] = data[col].astype('category')
    
    # create a sparse matrix of using the (rating, (rows, cols)) format
    rows = data[user_col].cat.codes
    cols = data[item_col].cat.codes
    rating = data[rating_col]
    ratings = csr_matrix((rating, (rows, cols)))
    ratings.eliminate_zeros()
    return ratings, data
In [5]:
user_col = 'user_id'
item_col = 'item_id'
rating_col = 'rating'
X, df = create_matrix(df, user_col, item_col, rating_col)
X
Out[5]:
<943x1682 sparse matrix of type '<class 'numpy.int64'>'
	with 100000 stored elements in Compressed Sparse Row format>

The following train and test set split function is assuming that you're doing a train/test split using the current dataset. Though it's probably better to use time to perform the train/test split. For example, using the year 2016's data as training and the 1 first month of 2017's data as testing.

In [6]:
def create_train_test(ratings, test_size = 0.2, seed = 1234):
    """
    split the user-item interactions matrix into train and test set
    by removing some of the interactions from every user and pretend
    that we never seen them
    
    Parameters
    ----------
    ratings : scipy sparse csr_matrix
        The user-item interactions matrix
    
    test_size : float between 0.0 and 1.0, default 0.2
        Proportion of the user-item interactions for each user
        in the dataset to move to the test set; e.g. if set to 0.2
        and a user has 10 interactions, then 2 will be moved to the
        test set
    
    seed : int, default 1234
        Seed for reproducible random splitting the 
        data into train/test set
    
    Returns
    -------
    train : scipy sparse csr_matrix
        Training set
    
    test : scipy sparse csr_matrix
        Test set
    """
    assert test_size < 1.0 and test_size > 0.0

    # Dictionary Of Keys based sparse matrix is more efficient
    # for constructing sparse matrices incrementally compared with csr_matrix
    train = ratings.copy().todok()
    test = dok_matrix(train.shape)
    
    # 1. for all the users assign randomly chosen interactions
    # to the test and assign those interactions to zero in the training;
    # when computing the interactions to go into the test set, 
    # remember to round up the numbers (e.g. a user has 4 ratings, if the
    # test_size is 0.2, then 0.8 ratings will go to test, thus we need to
    # round up to ensure the test set gets at least 1 rating);
    # 2. note that we can easily the parallelize the for loop if we were to
    # aim for a more efficient implementation
    rstate = np.random.RandomState(seed)
    for u in range(ratings.shape[0]):
        split_index = ratings[u].indices
        n_splits = ceil(test_size * split_index.shape[0])
        test_index = rstate.choice(split_index, size = n_splits, replace = False)
        test[u, test_index] = ratings[u, test_index]
        train[u, test_index] = 0
    
    train, test = train.tocsr(), test.tocsr()
    return train, test
In [7]:
seed = 1234
test_size = 0.2
X_train, X_test = create_train_test(X, test_size, seed)
X_train
Out[7]:
<943x1682 sparse matrix of type '<class 'numpy.int64'>'
	with 79619 stored elements in Compressed Sparse Row format>

The following implementation uses some tricks to speed up the procedure. First of all, when we need to solve $Ax = b$ where $A$ is an $n \times n$ matrix, a lot of books might write the solution as $x = A^{-1} b$, however, in practice there is hardly ever a good reason to calculate that it that way as solving the equation $Ax = b$ is faster than finding $A^{-1}$.

The next one is the idea of computing matrix product $X^T X$ using a outer product of each row.

In [8]:
# example matrix
X = np.array([[9, 3, 5], [4, 1, 2]]).T
X
Out[8]:
array([[9, 4],
       [3, 1],
       [5, 2]])
In [9]:
# normal matrix product
X.T.dot(X)
Out[9]:
array([[115,  49],
       [ 49,  21]])
In [10]:
# intialize an empty array
end_result = np.zeros((2, 2))

# loop through each row add up the outer product
for i in range(X.shape[0]):
    out = np.outer(X[i], X[i])
    end_result += out
    print('row:\n', X[i])
    print('outer product of row:\n', out)

end_result
row:
 [9 4]
outer product of row:
 [[81 36]
 [36 16]]
row:
 [3 1]
outer product of row:
 [[9 3]
 [3 1]]
row:
 [5 2]
outer product of row:
 [[25 10]
 [10  4]]
Out[10]:
array([[115.,  49.],
       [ 49.,  21.]])

The reason why this can speed things up is that the matrix product is now the sum of the outer products of the rows, where each row's computation is independent of another can be computed in the parallelized fashion then added back together!

Last but not least, is exploiting the property of scipy's Compressed Sparse Row Matrix to access the non-zero elements. For those that are unfamiliar with it, the following link has a pretty decent quick introduction. Blog: Empty rows in sparse arrays.

In [11]:
class ALSWR:
    """
    Alternating Least Squares with Weighted Regularization
    for implicit feedback

    Parameters
    ----------
    n_iters : int
        number of iterations to train the algorithm

    n_factors : int
        number/dimension of user and item latent factors

    alpha : int
        scaling factor that indicates the level of confidence in preference

    reg : int
        regularization term for the user and item latent factors

    seed : int
        seed for the randomly initialized user, item latent factors

    Reference
    ---------
    Y. Hu, Y. Koren, C. Volinsky Collaborative Filtering for Implicit Feedback Datasets
    http://yifanhu.net/PUB/cf.pdf
    """
    def __init__(self, n_iters, n_factors, alpha, reg, seed):
        self.reg = reg
        self.seed = seed
        self.alpha = alpha
        self.n_iters = n_iters
        self.n_factors = n_factors
    
    def fit(self, ratings):
        """
        ratings : scipy sparse csr_matrix [n_users, n_items]
            sparse matrix of user-item interactions
        """        
        # the original confidence vectors should include a + 1,
        # but this direct addition is not allowed when using sparse matrix,
        # thus we'll have to deal with this later in the computation
        Cui = ratings.copy().tocsr()
        Cui.data *= self.alpha
        Ciu = Cui.T.tocsr()
        self.n_users, self.n_items = Cui.shape
        
        # initialize user latent factors and item latent factors
        # randomly with a specified set seed
        rstate = np.random.RandomState(self.seed)
        self.user_factors = rstate.normal(size = (self.n_users, self.n_factors))
        self.item_factors = rstate.normal(size = (self.n_items, self.n_factors))
        
        for _ in trange(self.n_iters, desc = 'training progress'):
            self._als_step(Cui, self.user_factors, self.item_factors)
            self._als_step(Ciu, self.item_factors, self.user_factors)  
        
        return self
    
    def _als_step(self, Cui, X, Y):
        """
        when solving the user latent vectors,
        the item vectors will be fixed and vice versa
        """
        # the variable name follows the notation when holding
        # the item vector Y constant and solving for user vector X
        
        # YtY is a d * d matrix that is computed
        # independently of each user
        YtY = Y.T.dot(Y)
        data = Cui.data
        indptr, indices = Cui.indptr, Cui.indices

        # for every user build up A and b then solve for Ax = b,
        # this for loop is the bottleneck and can be easily parallized
        # as each users' computation is independent of one another
        for u in range(self.n_users):
            # initialize a new A and b for every user
            b = np.zeros(self.n_factors)
            A = YtY + self.reg * np.eye(self.n_factors)
            
            for index in range(indptr[u], indptr[u + 1]):
                # indices[index] stores non-zero positions for a given row
                # data[index] stores corresponding confidence,
                # we also add 1 to the confidence, since we did not 
                # do it in the beginning, when we were to give every 
                # user-item pair and minimal confidence
                i = indices[index]
                confidence = data[index] + 1
                factor = Y[i]

                # for b, Y^T C^u p_u
                # there should be a times 1 for the preference 
                # Pui = 1
                # b += confidence * Y[i] * Pui
                # but the times 1 can be dropped
                b += confidence * factor
                
                # for A, Y^T (C^u - I) Y
                A += (confidence - 1) * np.outer(factor, factor)

            X[u] = np.linalg.solve(A, b)
        
        return self

    def predict(self):
        """predict ratings for every user and item"""
        prediction = self.user_factors.dot(self.item_factors.T)
        return prediction
    
    def _predict_user(self, user):
        """
        returns the predicted ratings for the specified user,
        this is mainly used in computing evaluation metric
        """
        user_pred = self.user_factors[user].dot(self.item_factors.T)
        return user_pred
In [12]:
als = ALSWR(n_iters = 15, n_factors = 20, alpha = 15, reg = 0.01, seed = 1234)
als.fit(X_train)
training progress: 100%|██████████| 15/15 [00:28<00:00,  1.93s/it]
Out[12]:
<__main__.ALSWR at 0x1079186d8>

Ranking Metrics

Now that we've built our recommendation engine, the next important thing to do is to evaluate our model's offline performance.

Let's say that there are some users and some items, like movies, songs or jobs. Each user might be interested in some items. The client asks us to recommend a few items (the number is x) for each user. In other words, what we're after is the top-N recommendation for each user and after recommending these items to the user, we need a way to measure whether the recommendation is useful or not. One key thing to note is that metrics such as RMSE might not be the best at assessing the quality of recommendations because the training focused on items with the most ratings, achieving a good fit for those. The items with few ratings don't mean much in terms of their impact on the loss. As a result, predictions for them will be off.

Long story short, we need metrics specifically crafted for ranking evaluation and the two most popular ranking metrics are MAP (Mean Average Precision) and NDCG (Normalized Discounted Cumulative Gain). The main difference between the two is that MAP assumes binary relevance (an item is either of interest or not, e.g. a user clicked a link, watched a video, purchased a product), while NDCG can be used in any case where we can assign relevance score to a recommended item (binary, integer or real). The relationship is just like classification and regression.

Mean Average Precision

For this section of the content, a large portion is based on the excellent blog post at the following link. Blog: What you wanted to know about Mean Average Precision. This documentation builds on top of it by carrying out the educational implementation.

Let's say that there are some users and some items, like movies, songs or jobs. Each user might be interested in some items. The client asks us to recommend a few items (the number is x) for each user. After recommending the items to the user, we need a way to measure whether the recommendation is useful or not. One way to do this is using MAP@k (Mean Average Precision at k) .

The intuition behind this evaluation metric is that:

  • We can recommend at most k items for each user (this is essentially the @k part), but we will be penalized for bad recommendations
  • Order matters, so it's better to submit more certain recommendations first, followed by recommendations we are less sure about

Diving a bit deeper, we will first ignore @k and get M out of the way. MAP is just the mean of APs, or average precision, for all users. If we have 1000 users, we sum APs for each user and divide the sum by 1000. This is MAP. So now, what is AP, or average precision?

One way to understand average precision is this way: we type something in Google and it shows us 10 results. It's probably best if all of them were relevant. If only some are relevant, say five of them, then it's much better if the relevant ones are shown first. It would be bad if first five were irrelevant and good ones only started from sixth, wouldn't it? The formula for computing AP is:

sum i=1:k of (precision at i * change in recall at i)

Where precision at i is a percentage of correct items among first i recommendations. Change in recall is 1/k if item at i is correct (for every correct item), otherwise zero. Note that this is base on the assumption that the number of relevant items is bigger or equal to k: r >= k. If not, change in recall is 1/r for each correct i instead of 1/k.

For example, If the actual items were [1 2 3 4 5] and we recommend [6 4 7 1 2]. In this case we get 4, 1 and 2 right, but with some incorrect guesses in between. Now let's say we were to compute AP@2, so only two first predictions matter: 6 and 4. First is wrong, so precision@1 is 0. Second is right, so precision@2 is 0.5. Change in recall is 0 and 0.5 (that's 1/k) respectively, so AP@2 = 0 0 + 0.5 0.5 = 0.25

In [13]:
def compute_apk(y_true, y_pred, k):
    """
    average precision at k, y_pred is assumed 
    to be truncated to length k prior to feeding
    it to the function
    """
    # convert to set since membership 
    # testing in a set is vastly faster
    actual = set(y_true)
    
    # precision at i is a percentage of correct 
    # items among first i recommendations; the
    # correct count will be summed up by n_hit
    n_hit = 0
    precision = 0
    for i, p in enumerate(y_pred, 1):
        if p in actual:
            n_hit += 1
            precision += n_hit / i

    # divide by recall at the very end
    avg_precision = precision / min(len(actual), k)
    return avg_precision
In [14]:
# example 1

# y_true, is the true interaction of a user
# and y_pred is the recommendation we decided
# to give to the user
k = 2
y_true = np.array([1, 2, 3, 4, 5])
y_pred = np.array([6, 4, 7, 1, 2])
compute_apk(y_true, y_pred[:k], k) # 0.25
Out[14]:
0.25
In [15]:
# example 2

k = 5
y_true = np.array([1, 2])
y_pred = np.array([6, 4, 7, 1, 2])
compute_apk(y_true, y_pred[:k], k) # 0.325
Out[15]:
0.325

After computing the average precision for this individual user, we then compute this for every single user and take the mean of these values, then that would essentially be our MAP@k, mean average precision at k. For this metric, the bigger the better.

In [16]:
def mapk_score(model, ratings, k):
    """
    mean average precision at rank k for the ALS model

    Parameters
    ----------
    model : ALSWR instance
        fitted ALSWR model

    ratings : scipy sparse csr_matrix [n_users, n_items]
        sparse matrix of user-item interactions

    k : int
        mean average precision at k's k
        
    Returns
    -------
    mapk : float
        the mean average precision at k's score
    """
    # compare the top k predictions' index to the actual index,
    # the model is assumed to have the _predict_user method
    mapk = 0
    n_users = ratings.shape[0]
    for u in range(n_users):
        y_true = ratings[u].indices
        u_pred = model._predict_user(u)
        y_pred = np.argsort(u_pred)[::-1][:k]
        mapk += compute_apk(y_true, y_pred, k)

    mapk /= n_users
    return mapk
In [17]:
k = 5
mapk_train = mapk_score(als, X_train, k)
mapk_test = mapk_score(als, X_test, k)
print('mapk training:', mapk_train)
print('mapk testing:', mapk_test)
mapk training: 0.20038882997525595
mapk testing: 0.05916489925768833

Note that it's normal for this metric to be low. We can compare this metric with a baseline to get a sense of how well the algorithm is performing. And a nice baseline for recommendation engine is to simply recommend every user the most popular items (items that has the most user interaction)

NDCG

Suppose that on a four-point scale, we give a 0 score for an irrelevant result, 1 for a partially relevant, 2 for relevant, and 3 for perfect. Suppose also that a query is judged by one of our judges, and the first four results that the search engine returns are assessed as relevant (2), irrelevant (0), perfect (3), and relevant (2) by a judge.

The idea behind NDCG is: A recommender returns some items and we'd like to compute how good the list is. Each item has a relevance score, usually a non-negative number. That's gain (G). For items we don't have user feedback for we usually set the gain to zero. Now we add up those scores, that's cumulative gain (CG). So, the cumulative gain for the four results is the sum of the scores for each result: 2 + 0 + 3 + 2 = 7. In mathematical notations, the CG at a particular rank ${\displaystyle k}$ is defined as:

\begin{align} CG_k = \sum_{i=1}^k rel_i \end{align}

Where $rel_i$ is the graded relevance of the result at position ${\displaystyle i}$. As we can see from the formula, the value computed with this function is unaffected by changes in the ordering of search results, thus DCG is used in place of CG for a more accurate measure about the usefulness of results' ranking.

When evaluating rankings, we’d prefer to see the most relevant items at the top of the list, i.e the first result in our search results is more important than the second, the second more important than the third, and so on. Therefore before summing the scores we divide each by a growing number, which is the discount (D). One simple way to make position one more important than two (and so on) is to divide each score by the rank. So, for example, if the third result is "great", its contribution is $3 / 3 = 1$ (since the score for "great" is 3 , and the rank of the result is 3). If "great" were the first result, its contribution would be 3 / 1 = 3. Though in practice, it's more common to discount it using a logarithm of the item position, giving us:

\begin{align} DCG_k = \sum_{i=1}^k \frac{rel_i}{\log_2{\left(i+1\right)}} \end{align}
In [18]:
def dcg_at_k(score, k = None):
    """
    discounted cumulative gain (dcg)
    
    Parameters
    ----------
    score : 1d nd.array
        ranking/relevance score
        
    k : int, default None
        evaluate the measure for the top-k ranking score,
        default None evaluates all
        
    Returns
    -------
    dcg: float
    """
    if k is not None:
        score = score[:k]

    discounts = np.log2(np.arange(2, score.size + 2))
    dcg = np.sum(score / discounts)
    return dcg


score = np.array([2, 0, 3, 2])
dcg_at_k(score)
Out[18]:
4.361353116146786

There's an alternative formulation of DCG that places stronger emphasis on retrieving relevant documents:

\begin{align} DCG_k = \sum_{i=1}^k \frac{2^{rel_i} - 1}{\log_2{\left(i+1\right)}} \end{align}

This formula is commonly used in industry including major web search companies and data science competition platform such as Kaggle.

In [19]:
def dcg_at_k(score, k = None):
    """
    discounted cumulative gain (dcg)
    
    Parameters
    ----------
    score : 1d nd.array
        ranking/relevance score
        
    k : int, default None
        evaluate the measure for the top-k ranking score,
        default None evaluates all
        
    Returns
    -------
    dcg: float
    """
    if k is not None:
        score = score[:k]

    gain = 2 ** score - 1
    discounts = np.log2(np.arange(2, score.size + 2))
    dcg = np.sum(gain / discounts)
    return dcg


score = np.array([2, 0, 3, 2])
dcg_at_k(score)
Out[19]:
7.79202967422018

The final touch to this metric is Normalized (N). It's not fair to compare DCG values across queries because some queries are easier than others or result lists vary in length depending on the query, so we normalize them by: First, figure out what the best ranking score is for this result and compute DCG for that, then we divide the raw DCG by this ideal DCG to get NDCG@K, a number between 0 and 1. In our previous example, we had 2, then 0, 3, and a 2. The best arrangement of these same results would have been: 3, 2, 2, 0, that is, if the "great" result had been ranked first, followed by the two "relevant" ones, and then the "irrelevant". So we compute the DCG score for the rank 3, 2, 2, 0 to obtain our ideal DCG (IDCG) and simply perform:

\begin{align} NDCG_k = \frac{DCG_k}{IDCG_k} \end{align}

to obtain our final ndcg.

In [20]:
def ndcg_at_k(score, k = None):
    """
    normalized discounted cumulative gain (ndcg)
    
    Parameters
    ----------
    score : 1d nd.array
        ranking/relevance score
        
    k : int, default None
        evaluate the measure for the top-k ranking score,
        default None evaluates all
        
    Returns
    -------
    ndcg: float, 0.0 ~ 1.0
    """
    actual_dcg = dcg_at_k(score, k)
    sorted_score = np.sort(score)[::-1]
    best_dcg = dcg_at_k(sorted_score, k)
    ndcg = actual_dcg / best_dcg
    return ndcg


ndcg_at_k(score)
Out[20]:
0.7497534568197889

The next section modifies the function API a little bit so it becomes more suitable for evaluating the recommendation engine.

In [21]:
def ndcg_score(model, ratings, k):
    """
    Normalized discounted cumulative gain (NDCG) at rank k
    for the ALS model; which computes the ndcg score for
    each users' recommendation and does a simply average
    
    Parameters
    ----------
    model : ALSWR instance
        fitted ALSWR model

    ratings : scipy sparse csr_matrix [n_users, n_items]
        sparse matrix of user-item interactions

    k : int
        rank k's k
        
    Returns
    -------
    avg_ndcg : float
        ndcg at k's score averaged across all users
    """
    ndcg = 0.0
    n_users, n_items = ratings.shape
    for u in range(n_users):
        y_true = np.zeros(n_items)
        y_true[ratings[u].indices] = 1
        u_pred = model._predict_user(u)
        ndcg += ndcg_at_k(y_true, u_pred, k)
        
    avg_ndcg = ndcg / n_users
    return avg_ndcg


def ndcg_at_k(y_true, y_score, k = 10):
    """
    Normalized discounted cumulative gain (NDCG) at rank k
    
    Parameters
    ----------
    y_true : array-like, shape = [n_samples]
        Ground truth (true relevance labels)
    
    y_score : array-like, shape = [n_samples]
        Predicted scores
    
    k : int
        Rank

    Returns
    -------
    ndcg : float, 0.0 ~ 1.0
    """
    actual = dcg_at_k(y_true, y_score, k)
    best = dcg_at_k(y_true, y_true, k) 
    ndcg = actual / best
    return ndcg


def dcg_at_k(y_true, y_score, k = 10):
    """
    Discounted cumulative gain (DCG) at rank k
    
    Parameters
    ----------
    y_true : array-like, shape = [n_samples]
        Ground truth (true relevance labels)
    
    y_score : array-like, shape = [n_samples]
        Predicted scores
    
    k : int
        Rank

    Returns
    -------
    dcg : float
    """
    order = np.argsort(y_score)[::-1]
    y_true = np.take(y_true, order[:k])
    gains = 2 ** y_true - 1
    discounts = np.log2(np.arange(2, gains.size + 2))
    dcg = np.sum(gains / discounts)
    return dcg
In [22]:
k = 5
ndcg_train = ndcg_score(als, X_train, k)
ndcg_test = ndcg_score(als, X_test, k)
print('ndcg training:', ndcg_train)
print('ndcg testing:', ndcg_test)
ndcg training: 0.2959125755797325
ndcg testing: 0.11226091289209723

Reference