In [1]:

```
# code for loading the format for the notebook
import os
# path : store the current path to convert back to it later
path = os.getcwd()
os.chdir(os.path.join('..', '..', 'notebook_format'))
from formats import load_style
load_style(css_style='custom2.css', plot_style=False)
```

Out[1]:

In [2]:

```
os.chdir(path)
# 1. magic for inline plot
# 2. magic to print version
# 3. magic so that the notebook will reload external python modules
# 4. magic to enable retina (high resolution) plots
# https://gist.github.com/minrk/3301035
%matplotlib inline
%load_ext watermark
%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format='retina'
import os
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.sparse import csr_matrix
from implicit.bpr import BayesianPersonalizedRanking
from implicit.evaluation import train_test_split, precision_at_k
%watermark -a 'Ethen' -d -t -v -p scipy,numpy,pandas,matplotlib,implicit
```

When a user has watched, say, 70% romance movies and 30% action movies in the past, then it is reasonable to expect the personalized list of recommended movies to be comprised of 70% romance and 30% action movies as well since we would like to cover the user's diverse set of interests. A recommendation that actually reflect most if not all of the user's interest is considered a **Calibrated Recommendation**. But the question is, does our recommendation exhibit this trait?

Recommendation algorithm provides personalized user experience past on the user's past historical interaction with the product/system/website. However, when serving the recommendation such as recommendation top 10 movies that we think the user might be interested in, a recommendation engine that is solely measured based on ranking metrics can easily generate recommendations that focus on the main area of interests, resulting the user's other area of interest to be under-represented, or worse, absent in the final recommendation.

To hit the notion home, using the example above, given a user that has watched 70% romance movies and 30% action movies, if we were to solely measure the metric based on precision, we can say we can achieve the best performance by predicting the majority genre, i.e. we will recommend 100% romance movies and we can expect the user to interact with those recommendations 70% of the time. On other other hand, if we were to recommend 70% romance movies and 30% action movies, then we would expect our recommendation to only be correct 0.7 * 0.7 + 0.3 * 0.3 = 58% of the time.

Throughout the rest of this notebook, we will take a look at if the phenomenon of crowding out user's sub-interest occurs with our recommendation, develop a quantitative metric to measure how severe this issue is and implement a post-preprocessing logic that is agnostic of the underlying recommendation algorithm we decided to use to ensure the recommendation becomes more calibrated.

We'll be using the publicly available movielens-20m dataset throughout this experiment. We can download it via the following link. There's multiple data under that folder, we can select download all to make things easier.

The algorithm we will be using to generate our recommendation is Bayesian Personalized Ranking, which is a matrix factorization based collaborative filtering algorithm. Readers don't need to be acquainted with this model per se to continue with this notebook as the discussion is model-agnostic and we'll be explaining the syntax. That said, this link contains some resources on this algorithm if it is of interest.

Preparation steps in the next few code chunks involve the following steps:

`rating.csv`

contains user's rating for each movie. Here, we will focus on implicit data, and follow the usual procedure of simulating binary implicit feedback data (i.e. whether the user enjoyed the movie) by retaining only ratings of 4 stars and higher, while dropping lower ratings.`movie.csv`

contains each movies genre tag. We will also eliminate movies that had no genre information attached and create a mapping that stores each movies' genre distribution. In this dataset, each movie $i$ typically has several genres $g$ associated with it, thus we assign equal probabilities $p(g|i)$ to each genre such that $\sum_g p(g|i) = 1$ for each movie $i$. This genre distribution will play a strong role in determining whether our recommendation is well calibrated or not.

In [2]:

```
data_dir = 'movielens-20m-dataset'
# we are working with movie data, but we'll name
# the movie as item to make it more generic to
# all use-cases
user_col = 'userId'
item_col = 'movieId'
value_col = 'rating'
time_col = 'timestamp'
rating_path = os.path.join(data_dir, 'rating.csv')
df_raw = pd.read_csv(rating_path)
print('dimension: ', df_raw.shape)
df_raw.head()
```

Out[2]:

In [3]:

```
title_col = 'title'
genre_col = 'genres'
item_info_path = os.path.join(data_dir, 'movie.csv')
df_item = pd.read_csv(item_info_path)
df_item = df_item[df_item[genre_col] != '(no genres listed)']
print('dimension: ', df_item.shape)
df_item.head()
```

Out[3]:

In [4]:

```
class Item:
"""
Data holder for our item.
Parameters
----------
id : int
title : str
genre : dict[str, float]
The item/movie's genre distribution, where the key
represents the genre and value corresponds to the
ratio of that genre.
score : float
Score for the item, potentially generated by some
recommendation algorithm.
"""
def __init__(self, _id, title, genres, score=None):
self.id = _id
self.title = title
self.score = score
self.genres = genres
def __repr__(self):
return self.title
def create_item_mapping(df_item, item_col, title_col, genre_col):
"""Create a dictionary of item id to Item lookup."""
item_mapping = {}
for row in df_item.itertuples():
item_id = getattr(row, item_col)
item_title = getattr(row, title_col)
item_genre = getattr(row, genre_col)
splitted = item_genre.split('|')
genre_ratio = 1. / len(splitted)
item_genre = {genre: genre_ratio for genre in splitted}
item = Item(item_id, item_title, item_genre)
item_mapping[item_id] = item
return item_mapping
item_mapping = create_item_mapping(df_item, item_col, title_col, genre_col)
item_mapping[1]
```

Out[4]:

In [5]:

```
# convert to implicit feedback data and filter out
# movies that doesn't have any genre
df_rating = df_raw[df_raw[value_col] >= 4.0].copy()
df_rating = df_rating.merge(df_item, on=item_col)
for col in (user_col, item_col):
df_rating[col] = df_rating[col].astype('category')
# the original id are converted to indices to create
# the sparse matrix, so we keep track of the mappings here
# e.g. a userId 1 will correspond to index 0 in our sparse matrix
index2user = df_rating[user_col].cat.categories
index2item = df_rating[item_col].cat.categories
print('dimension: ', df_rating.shape)
df_rating.head()
```

Out[5]:

Given this dataframe we will use the `userId`

, `movieId`

and `rating`

to construct a sparse matrix, perform the random train/test split (we can split based on the time if preferred) and feed the training set into a collaborative filtering based algorithm to train the model, so we can generate item recommendations for users.

In [6]:

```
def create_user_item_csr_matrix(data, user_col, item_col, value_col):
rows = data[user_col].cat.codes
cols = data[item_col].cat.codes
values = data[value_col].astype(np.float32)
return csr_matrix((values, (rows, cols)))
user_item = create_user_item_csr_matrix(df_rating, user_col, item_col, value_col)
user_item
```

Out[6]:

In [7]:

```
np.random.seed(1234)
user_item_train, user_item_test = train_test_split(user_item, train_percentage=0.8)
user_item_train
```

Out[7]:

In [8]:

```
user_item_test
```

Out[8]:

In [9]:

```
# the model expects item-user sparse matrix,
# i.e. the rows represents item and the column
# represents users
np.random.seed(1234)
bpr = BayesianPersonalizedRanking(iterations=70)
bpr.fit(user_item_train.T.tocsr())
```

we will look at a precision_at_k metric just to make sure our recommender is reasonable, feel free to tune the model's hyperparameter to squeeze out performance, but that is not the focus here.

In [10]:

```
precision = precision_at_k(bpr, user_item_train, user_item_test, K=10)
precision
```

Out[10]:

We will take the first user as an example to see whether our recommendations are calibrated or not. Once we're familiar with the procedure for one user, we can repeat the process for all of the users if we'd like to.

Let's start of by defining the problem. We are given the distribution genres $g$ for each movie $i$, $p(g|i)$, what we are interested is whether $p(g|u)$ is similar to $q(g|u)$. Where:

- $p(g|u)$ is the distribution over genre $g$ of the set of movies $H$ played by user $u$ in the past.

- $q(g|u)$ is the distribution over genre $g$ of the set of movies $I$ we recommended to user $u$.

For these distributions, we can have a weighted version if we liked to get sophisticated. e.g. the $p(g|i)$ can be weighted by recency saying something like the item/movie interaction matters more if its a more recent interaction, indicating that item/movie's genre should also be weighted more, but let's not go there yet.

Let's first look at some code to generate these information.

In [11]:

```
# look a the first user
user_id = 0
# find the index that the user interacted with,
# we can then map this to a list of Item, note that we need to first
# map the recommended index to the actual itemId/movieId first
interacted_ids = user_item_train[user_id].nonzero()[1]
interacted_items = [item_mapping[index2item[index]] for index in interacted_ids]
interacted_items[:10]
```

Out[11]:

For the same user, we can use the .recommend method to recommend the topn recommendation for him/her, note that we also passed in the original sparse matrix, and by default, the items/movies that the user has already played will be filtered from the list (controlled by a `filter_already_liked_items`

argument, which defaults to `True`

).

In [12]:

```
# it returns the recommended index and their corresponding score
topn = 20
reco = bpr.recommend(user_id, user_item_train, N=topn)
reco[:10]
```

Out[12]:

In [13]:

```
# map the index to Item
reco_items = [item_mapping[index2item[index]] for index, _ in reco]
reco_items[:10]
```

Out[13]:

The next code chunk defines a function to obtain the genre distribution for a list of items. Given that we now have the list of interacted items and recommended items, we can pass it to the function to obtain the two genre distributions.

In [14]:

```
def compute_genre_distr(items):
"""Compute the genre distribution for a given list of Items."""
distr = {}
for item in items:
for genre, score in item.genres.items():
genre_score = distr.get(genre, 0.)
distr[genre] = genre_score + score
# we normalize the summed up probability so it sums up to 1
# and round it to three decimal places, adding more precision
# doesn't add much value and clutters the output
for item, genre_score in distr.items():
normed_genre_score = round(genre_score / len(items), 3)
distr[item] = normed_genre_score
return distr
```

In [15]:

```
# we can check that the probability does in fact add up to 1
# np.array(list(interacted_distr.values())).sum()
interacted_distr = compute_genre_distr(interacted_items)
interacted_distr
```

Out[15]:

In [16]:

```
reco_distr = compute_genre_distr(reco_items)
reco_distr
```

Out[16]:

In [17]:

```
# change default style figure and font size
plt.rcParams['figure.figsize'] = 10, 8
plt.rcParams['font.size'] = 12
def distr_comparison_plot(interacted_distr, reco_distr, width=0.3):
# the value will automatically be converted to a column with the
# column name of '0'
interacted = pd.DataFrame.from_dict(interacted_distr, orient='index')
reco = pd.DataFrame.from_dict(reco_distr, orient='index')
df = interacted.join(reco, how='outer', lsuffix='_interacted')
n = df.shape[0]
index = np.arange(n)
plt.barh(index, df['0_interacted'], height=width, label='interacted distr')
plt.barh(index + width, df['0'], height=width, label='reco distr')
plt.yticks(index, df.index)
plt.legend(bbox_to_anchor=(1, 0.5))
plt.title('Genre Distribution between User Historical Interaction v.s. Recommendation')
plt.ylabel('Genre')
plt.show()
distr_comparison_plot(interacted_distr, reco_distr)
```

Looking at the results above, we can see that according to $p(g|u)$, the user has interacted with genres such as War, Western, however, they are nowhere to be seen in the topn recommendation to the user, hence we can argue based on the output that our recommendation might not be that well calibrated to the user's past interaction.

To scale this type of comparison, we'll now define our calibration metric $C$. There are various methods to compare whether two distributions are similar to each other, and one popular choice is KL-divergence.

\begin{align} C(p,q) = D_{KL}(p || q) = \sum_{g} p(g|u) \cdot \log \frac{p(g|u)}{\tilde{q}(g|u)} \end{align}The denominator in the formula should be $q(g|u)$, but given that the formula would be undefined if $q(g|u) = 0$ and $p(g|u) > 0$ for a genre $g$. We instead use:

\begin{align} \tilde{q}(g|u) = (1 - \alpha) \cdot q(g|u) + \alpha \cdot p(g|u) \end{align}with a small $\alpha$ such as 0.01, so that $q(g|u) \approx \tilde{q}(g|u)$.

In [18]:

```
def compute_kl_divergence(interacted_distr, reco_distr, alpha=0.01):
"""
KL (p || q), the lower the better.
alpha is not really a tuning parameter, it's just there to make the
computation more numerically stable.
"""
kl_div = 0.
for genre, score in interacted_distr.items():
reco_score = reco_distr.get(genre, 0.)
reco_score = (1 - alpha) * reco_score + alpha * score
kl_div += score * np.log2(score / reco_score)
return kl_div
compute_kl_divergence(interacted_distr, reco_distr)
```

Out[18]:

Being able to compute the calibration metric between $p(g|u)$ and $q(g|u)$ is all well and good, but how can we generate a recommendation list that is more calibrated becomes the next important and interesting question.

Different recommendation algorithm's objective function might be completely different, thus instead of going to hard-route of incorporating it into the objective function right off the bat and spend two weeks writing the customized algorithm in an efficient manner, we will start with an alternative approach of re-ranking the predicted list of a recommender system in a post-processing step.

To determine the optimal set $I^*$ of $N$ recommended items, we'll be using maximum marginal relevance.

\begin{align} I^* = \underset{I, |I|=N}{\text{argmax}} \; (1 - \lambda) \cdot s(I) - \lambda \cdot C(p, q(I)) \end{align}Where

- $s(i)$ is the score of the items $i \in I$ predicted by the recommender system and $s(I) = \sum_{i \in I} s(i)$, i.e. the sum of all the items' score in the recommendation list.
- $\lambda \in [0, 1]$ is a tuning parameter that determines the trade-off between the score generated by the recommender and the calibration score, notice that since the calibration score is measured by KL-divergence, which is a metric that's the lower the better we use its negative in the maximization formula.

Finding the optimal set $I^*$ is a combinatorial optimization problem and can be a topic by itself. We won't do a deep dive into it, but instead leverage a popular greedy submodular optimization to solve this problem. The process is as follows:

- We start out with the empty set.
- Iteratively append one item $i$ at a time, and at step $n$, when we already have the set $I_{n-1}$ comprised of $n - 1$ items, the item $i$ that maximizes the objective function defined above for the set $I_{n-1} \cup {i}$ is added to obtain $I_n$
- Repeat the process the generate the full $I^*$ of size $N$.

From a theoretical standpoint, this procedure guarantees a solution that has a score of 0.63 of the optimal set.

With these information at hand, let's look at the implementation part:

In [19]:

```
def generate_item_candidates(model, user_item, user_id, index2item, item_mapping,
filter_already_liked_items=True):
"""
For a given user, generate the list of items that we can recommend, during this
step, we will also attach the recommender's score to each item.
"""
n_items = user_item.shape[1]
# this is how implicit's matrix factorization generates
# the scores for each item for a given user, modify this
# part of the logic if we were to use a completely different
# algorithm to generate the ranked items
user_factor = model.user_factors[user_id]
scores = model.item_factors.dot(user_factor)
liked = set()
if filter_already_liked_items:
liked = set(user_item[user_id].indices)
item_ids = set(np.arange(n_items))
item_ids -= liked
items = []
for item_id in item_ids:
item = item_mapping[index2item[item_id]]
item.score = scores[item_id]
items.append(item)
return items
```

In [20]:

```
items = generate_item_candidates(bpr, user_item_train, user_id, index2item, item_mapping)
print('number of item candidates:', len(items))
items[:5]
```

Out[20]:

In [21]:

```
def compute_utility(reco_items, interacted_distr, lmbda=0.5):
"""
Our objective function for computing the utility score for
the list of recommended items.
lmbda : float, 0.0 ~ 1.0, default 0.5
Lambda term controls the score and calibration tradeoff,
the higher the lambda the higher the resulting recommendation
will be calibrated. Lambda is keyword in Python, so it's
lmbda instead ^^
"""
reco_distr = compute_genre_distr(reco_items)
kl_div = compute_kl_divergence(interacted_distr, reco_distr)
total_score = 0.0
for item in reco_items:
total_score += item.score
# kl divergence is the lower the better, while score is
# the higher the better so remember to negate it in the calculation
utility = (1 - lmbda) * total_score - lmbda * kl_div
return utility
```

In [22]:

```
def calib_recommend(items, interacted_distr, topn, lmbda=0.5):
"""
start with an empty recommendation list,
loop over the topn cardinality, during each iteration
update the list with the item that maximizes the utility function.
"""
calib_reco = []
for _ in range(topn):
max_utility = -np.inf
for item in items:
if item in calib_reco:
continue
utility = compute_utility(calib_reco + [item], interacted_distr, lmbda)
if utility > max_utility:
max_utility = utility
best_item = item
calib_reco.append(best_item)
return calib_reco
```

In [23]:

```
start = time.time()
calib_reco_items = calib_recommend(items, interacted_distr, topn, lmbda=0.99)
elapsed = time.time() - start
print('elapsed: ', elapsed)
calib_reco_items
```

Out[23]:

In the code chunk above, we turned the $\lambda$ knob extremely high to generate the most calibrated recommendation list possible. Let's now compare the calibrated recommendation (which only optimizes for score, $s$), the original recommendation and the user's interaction distribution.

In [24]:

```
calib_reco_distr = compute_genre_distr(calib_reco_items)
calib_reco_kl_div = compute_kl_divergence(interacted_distr, calib_reco_distr)
reco_kl_div = compute_kl_divergence(interacted_distr, reco_distr)
print('\noriginal reco kl-divergence score:', reco_kl_div)
print('calibrated reco kl-divergence score:', calib_reco_kl_div)
distr_comparison_plot(interacted_distr, calib_reco_distr)
```

Printing out the genre distribution from the calibrated recommendation list shows that this list covers more genre and its distribution closely resembles the distribution of the user's past historical interaction and our quantitative calibration metric, KL-divergence also confirms this. i.e. the calibrated recommendation's KL-divergence is lower than the original recommendation's score.

Thankfully from the results above, it seems that the re-ranked recommendation list that aims to maximize calibration score does in fact generate a more calibrated list. But the question is at what cost? Does other ranking metrics that recommender system often optimize for drop? Let's take a look at precision_at_k. Here the number for `k`

is the `topn`

parameter that we've defined earlier. i.e. the number of recommendations to generate for the user.

In [25]:

```
def precision(user_item, user_id, reco_items, index2item):
indptr = user_item.indptr
indices = user_item.indices
reco_ids = {item.id for item in reco_items}
likes = {index2item[indices[i]] for i in range(indptr[user_id], indptr[user_id + 1])}
relevant = len(reco_ids & likes)
total = min(len(reco_items), len(likes))
return relevant / total
```

In [26]:

```
reco_precision = precision(user_item_test, user_id, reco_items, index2item)
calib_reco_precision = precision(user_item_test, user_id, calib_reco_items, index2item)
print('original reco precision score:', reco_precision)
print('calibrated reco precision score:', calib_reco_precision)
```

Well ..., it's not a surprise that the calibrated recommendation list's precision score is a bit disappointing compared to the original recommendation. But let's see if we try a different value of $\lambda$, this time turning it down a bit to strike a balance between calibration and precision.

In [27]:

```
start = time.time()
calib_reco_items = calib_recommend(items, interacted_distr, topn, lmbda=0.5)
elapsed = time.time() - start
print('elapsed: ', elapsed)
calib_reco_items
```

Out[27]:

In [28]:

```
calib_reco_distr = compute_genre_distr(calib_reco_items)
calib_reco_kl_div = compute_kl_divergence(interacted_distr, calib_reco_distr)
calib_reco_precision = precision(user_item_test, user_id, calib_reco_items, index2item)
print('calibrated reco kl-divergence score:', calib_reco_kl_div)
print('calibrated reco precision score:', calib_reco_precision)
```

In [29]:

```
calib_reco_distr = compute_genre_distr(calib_reco_items)
distr_comparison_plot(interacted_distr, calib_reco_distr)
```

Well, well, well. It turns out calibration can be improved considerably while accuracy is reduced only slightly if we find the sweet spot for the tuning parameter $\lambda$.

The following code chunk curates all the code to generate the calibrated recommendation, the original recommendation and compare it with the user's historical interaction in one place for ease of tracking the flow. This process is outlined for 1 user, feel free to modify the code to perform this comparison across all users and due to the randomness in the recommendation algorithm, the results might differ across runs, but the underlying trend should remain the same.

In [30]:

```
topn = 20
user_id = 0
lmbda = 0.99
reco = bpr.recommend(user_id, user_item_train, N=topn)
reco_items = [item_mapping[index2item[index]] for index, _ in reco]
reco_distr = compute_genre_distr(reco_items)
interacted_ids = user_item_train[user_id].nonzero()[1]
interacted_items = [item_mapping[index2item[index]] for index in interacted_ids]
interacted_distr = compute_genre_distr(interacted_items)
items = generate_item_candidates(bpr, user_item_train, user_id, index2item, item_mapping)
calib_reco_items = calib_recommend(items, interacted_distr, topn, lmbda)
calib_reco_distr = compute_genre_distr(calib_reco_items)
calib_reco_kl_div = compute_kl_divergence(interacted_distr, calib_reco_distr)
calib_reco_precision = precision(user_item_test, user_id, calib_reco_items, index2item)
print('calibrated reco kl-divergence score:', calib_reco_kl_div)
print('calibrated reco precision score:', calib_reco_precision)
distr_comparison_plot(interacted_distr, calib_reco_distr)
reco_kl_div = compute_kl_divergence(interacted_distr, reco_distr)
reco_precision = precision(user_item_test, user_id, reco_items, index2item)
print('original reco kl-divergence score:', reco_kl_div)
print('original reco precision score:', reco_precision)
distr_comparison_plot(interacted_distr, reco_distr)
```

We have calibrated our recommendation here based on movies' genre but the same idea can be applied if we wish to calibrate our recommendation based on some other traits that matters for the problem at hand.

The approach we took here is from a user-centric perspective, i.e. we are generating calibrated recommendations for each user, we can also tackle the problem from a item-centric perspective to see if items are recommended properly, e.g. if popular items just gets recommended a lot more times than other items.

Although we used Bayesian Personalized Ranking to generated recommendations for each user, the technique that we used here to generated the calibrated ranked items for each user is independent of the recommendation algorithm we use. We're only using the algorithm here to demonstrate the process, so feel free is try this out on a recommendation algorithm of your liking/interest.

The discussion here focused on generating calibrated recommendation anchored on a user's past interaction. So if a user played 70% romance movies and 30% action movies, then when we say our recommendation is well-calibrated, that means the top ranked movies for the user should also consists of 70% romance movies and 30% action movies. We can, however, expand this work to recommend diversified items/movies. In other words, instead of limiting ourselves to only work with romance and action movies for a user that has only played movies falling under these two genres, we recommend movies from genres that are outside the user's historical play list.

To elaborate, we were using $p(g|u)$ to represent the user's historical genre interaction, we can extend that distribution to $\tilde{p}(g|u)$, where:

\begin{align} \tilde{p}(g|u) = \beta \cdot p_0(g) + (1 - \beta) \cdot p(g|u) \end{align}Here $p_0(g)$ denote a prior distribution that takes positive values for all the genres $g$ in which we would like to promote diversity in the recommendation. Again we also have a tuning parameter $\beta \in [0, 1]$ to control how "diversified" we would like our recommendations to be. As to how we decide which genres to promote diversity for? Well, that's a topic for another day, but to start with we can use a uniform distribution across all genres or the average over all the users' genre distribution.

After reading a lot of tutorials/papers that introduces new algorithms or new libraries/packages that boosts model performance in terms of ranking metric, coming across a paper on a different topic, calibrating recommendation, is a breath of fresh air. This sheds a new light on how I think about serving recommendations, so kudos to the original authors who brought this well-written paper to light. I encourage people to read the original paper if interested. Paper: H. Steck - Calibrated Recommendation (2018)

**Side Note:**

- In the generating calibrated recommendation section, we introduced the maximum marginal relevance formula that has the knob for us to tune the balance between our original recommendation algorithm's score versus the calibration score. This "balancing" notion can actually be applied to many different areas. For example, In the original paper that introduced this idea, they were using this formula in the context of a search engine. To elaborate, when issuing a query on a search engine, the search engine often times assembles results that maximizes the relevance to the user query. By introducing the linear combination "marginal relevance" - a result is now said to have high marginal relevance if it is both relevant to the query and contains minimal similarity to previously selected results.
- The Greedy Algorithm we leveraged to actually generate our top-N items that maximizes the maximum marginal relevance formula is a discrete optimization method that's often times referred to as a submodular optimization. Submodular optimization is a generic optimization method that can also be applied to other areas such as influence maximization and can be a topic by itself and is discussed lightly deeper in another notebook. Notebook: Submodular Optimization & Influence Maximization