# code for loading the format for the notebook
import os
# path : store the current path to convert back to it later
path = os.getcwd()
os.chdir(os.path.join('..', '..', 'notebook_format'))
from formats import load_style
load_style(plot_style = False)
os.chdir(path)
# 1. magic for inline plot
# 2. magic to print version
# 3. magic so that the notebook will reload external python modules
# 4. magic to enable retina (high resolution) plots
# https://gist.github.com/minrk/3301035
%matplotlib inline
%load_ext watermark
%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format = 'retina'
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from gensim import corpora
from gensim.models import LdaModel
from operator import itemgetter
from nltk.corpus import stopwords
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
%watermark -a 'Ethen' -d -t -v -p numpy,pandas,matplotlib,gensim,nltk,sklearn
Topic modeling is a technique for taking some unstructured text and automatically extracting its common themes, it is a great way to get a bird's eye view on a large text collection.
The major feature distinguishing topic model from other clustering methods is the notion of mixed membership. A lot of clustering models have assumed that each data point belongs to a single cluster. K-means determines membership simply by shortest distance to the cluster center, and Gaussian Mixture Models assumes that each data point is drawn from one of their component mixture distributions. In many cases, though, it is more realistic to think of data as genuinely belonging to more than one cluster or category. For example, if we have a model for text data that includes both "Politics" and "World News" categories, then an article about a recent meeting of the United Nations should have membership in both categories rather than being forced into just one.
Topic model assumes that a topic is a probability distribution over the vocabulary. For example, if we were to create three topics for the Harry Potter series of books manually, we might come up with something like this:
The way we interpret these words is, it is the probability of those words appearing in the topic to the left. So, an example interpretation of the output above will be: There is 42% chance that the word "Harry Potter" came from the Harry Potter topic. Note that the vocabulary probability will sum up to 1 for every topic, but often times, words that have lower weights will be truncated from the output.
In the same way, we can represent individual documents as a probability distribution over topics. For example, Chapter 1 of Harry Potter book 1 introduces the Dursley family and has Dumbledore discuss Harry’s parent’s death. If we take this chapter to be a single document, it could be broken up into topics like this: 40% Muggle topic, 30% Voldemort topic, and the remaining 30% is the Harry topic.
Of course, we don't want to extract the topics and document probabilities by hand like this. We want the machine to do it automatically using our unlabelled text collection as the only input. Because there is no document labeling nor human annotations, topic modeling is an example of an unsupervised machine learning technique.
We'll now start exploring one popular algorithm for doing topic model, namely Latent Dirichlet Allocation. Latent Dirichlet Allocation (LDA) requires documents to be represented as a bag of words (for the gensim library, some of the API calls will shorten it to bow
, hence we'll use the two interchangeably). This representation ignores word ordering in the document but retains information on how many times each word appears.
We'll first play around a toy corpus of 11 documents to familiarize ourselves with the gensim API. In the toy corpus presented, there are 5 documents that are river related and 6 finance related. The interesting part is that the corpus contains the word "bank" which could mean "a financial institution" or "a river bank". A good topic model should be able to tell the difference between these two meanings based on context.
# each nested list represents the word of a document
texts = [['bank','river','shore','water'],
['river','water','flow','fast','tree'],
['bank','water','fall','flow'],
['bank','bank','water','rain','river'],
['river','water','mud','tree'],
['money','transaction','bank','finance'],
['bank','borrow','money'],
['bank','finance'],
['finance','money','sell','bank'],
['borrow','sell'],
['bank','loan','sell']]
# build the dictionary and convert the documents
# to bag of words (bow) representation using the dictionary
texts_dictionary = corpora.Dictionary(texts)
texts_corpus = [texts_dictionary.doc2bow(text) for text in texts]
The following code chunk trains the LDA model using our corpus and dictionary. We set the number of topics to be 2, and expect to see one related to river banks, and one has to do with financial banks.
# train the model
# the more iteration, the more stable the model
# becomes, but of course takes longer to train
np.random.seed(431)
texts_model = LdaModel(
texts_corpus,
id2word = texts_dictionary,
num_topics = 2,
passes = 5,
iterations = 50)
# we can pass the num_words argument to limit the listed
# number of most probable words
texts_model.show_topics()
After using the show_topics
method from the model, it will output the most probable words that appear in each topic. For the gensim library, the default printing behavior is to print a linear combination of the top words sorted in decreasing order of the probability of the word appearing in that topic. Thus words that appear towards the left are the ones that are more indicative of the topic.
We see our LDA model has given us a pretty intuitive result. Bank is the most influential word in both the topics and other words help define what kind of bank we are talking about. We can also use the function get_term_topics
and get_document_topics
to further evaluate our result. get_term_topics
returns the odds of that particular word belonging to a particular topic. A few examples:
print(texts_model.get_term_topics('water'))
print(texts_model.get_term_topics('bank'))
Since the word bank is likely to be in both the topics, the values returned are also very similar.
The get_document_topics
method outputs the topic distribution of the document. Apart from this, it also let us know the topic distribution for each word in the document. Let us test this with two different documents which have the word bank in it, one in the finance context and one in the river context.
# before we can infer topic distributions on new, unseen documents
# we need to convert it to bag of words format first
bow_water = ['bank', 'water', 'bank']
bow = texts_model.id2word.doc2bow(bow_water)
doc_topics, word_topics, phi_values = texts_model.get_document_topics(
bow, per_word_topics = True)
# note that doc_topics equivalent to simply calling model[bow]
print('document topics: ', doc_topics)
print()
for word_id, topic in word_topics:
# access the word with the word id
print(texts_model.id2word[word_id], topic)
print()
for word_id, topic in phi_values:
print(texts_model.id2word[word_id], topic)
What do all these output mean?
doc_topics
shows that this document has a higher probability of belonging to topic 1.per_word_topics
is set as True
, it also returns a word_topics
. This variable captures the word id followed by a list sorted with the most likely topic id. From the output, it means both our word bank and water is more likely to be in topic 1 than topic 0.phi_values
. Compared with word_topics
, it adds the information of the probability of the word belonging to a particular topic. Note that it is scaled by feature length (the word bank appeared 2 times, hence the value of the probability will add up to 2).let's now do the same thing with a second document, bow_finance.
bow_finance = ['bank', 'finance']
bow = texts_model.id2word.doc2bow(bow_finance) # convert to bag of words format first
doc_topics, word_topics, phi_values = texts_model.get_document_topics(
bow, per_word_topics = True)
word_topics
Since the word bank is now used in a financial context, the most probable topic for the word immedietly swaps to being topic 0. We've seen quite clearly that based on the context, the most likely topic associated with a word can change. This differs from our previous method, get_term_topics
, where it is a 'static' topic distribution.
But note that, each word in a document is only given one topic distribution, meaning it can't tell the difference if one word that have different meanings occured in the same document (every 'bank' word will have the same distribution).
bow = texts_model.id2word.doc2bow(['the', 'bank', 'by', 'the', 'river', 'bank'])
doc_topics, word_topics, phi_values = texts_model.get_document_topics(
bow, per_word_topics = True)
word_topics
Let's dig deeper into this topic modeling technique using a larger dataset. To follow along please download the file from this dropbox link.
# the file is placed one directory above this notebook,
# since it's also used by other notebooks
# change this if you liked
file_path = os.path.join('..', 'people_wiki.csv')
wiki = pd.read_csv(file_path)
wiki.head()
# build the id2word dictionary and the corpus (map the word to id)
texts = wiki['text'].apply(lambda x: x.split(' ')).tolist()
dictionary = corpora.Dictionary(texts)
print('number of unique tokens: ', len(dictionary))
# remove stop words from a stop words set; merged from
# nltk and scikit-learn's built-in list and words
stoplist = set(stopwords.words('english')).union(set(ENGLISH_STOP_WORDS))
stop_ids = [dictionary.token2id[stopword] for stopword in stoplist
if stopword in dictionary.token2id]
dictionary.filter_tokens(stop_ids)
# filter out words that appear in less than 2 documents (appear only once),
# there's also a no_above argument that we could specify, e.g.
# no_above = 0.5 would remove words that appear in more than 50% of the documents
dictionary.filter_extremes(no_below = 2)
# remove gaps in id sequence after words that were removed
dictionary.compactify()
print('number of unique tokens: ', len(dictionary))
# convert words to the "learned" word id
corpus = [dictionary.doc2bow(text) for text in texts]
After building the dictionary of word to id mapping and the corpus we can now train our model.
Taken from the gensim LDA documentation.
With gensim we can run online LDA, which is an algorithm that takes a chunk of documents, updates the LDA model, takes another chunk, updates the model etc. Online LDA can be contrasted with batch LDA, which processes the whole corpus (one full pass), then updates the model, then another pass, another update... The difference is that given a reasonably stationary document stream (not much topic drift), the online updates over the smaller chunks (subcorpora) are pretty good in themselves, so that the model estimation converges faster. As a result, we will perhaps only need a single full pass over the corpus: if the corpus has 3 million articles, and we update once after every 10,000 articles, this means we will have done 300 updates in one pass, quite likely enough to have a very accurate topics estimate.
The default parameter for the LdaModel
is chunksize=2000, passes=1, update_every=1.
We'll not be discussing the EM algorithm here, but in general a chunksize of 100k and update_every set to 1 is equivalent to a chunksize of 50k and update_every set to 2. The primary difference is that we will save some memory using the smaller chunksize, but we will be doing multiple loading/processing steps prior to moving onto the maximization step. Passes are not related to chunksize or update_every. Passes is the number of times we want to go through the entire corpus. Below are a few examples of different combinations of the 3 parameters and the number of online training updates which will occur while training LDA.
The method used to fit the LDA model is a randomized algorithm, which means that it involves steps that are random. Because of these random steps, the algorithm will be expected to yield slighty different output for different runs on the same data. Hence to make sure that the output are consistent and to save some time, We will save the model without having to rebuild it every single time.
# for saving and loading the gensim LdaModel
model.save('model.lda')
model = LdaModel.load('model.lda')
# directory for storing all lda models
model_dir = 'lda_checkpoint'
if not os.path.isdir(model_dir):
os.mkdir(model_dir)
# load the model if we've already trained it before
n_topics = 10
path = os.path.join(model_dir, 'topic_model.lda')
if not os.path.isfile(path):
# training LDA can take some time, we could set
# eval_every = None to not evaluate the model perplexity
topic_model = LdaModel(
corpus, id2word = dictionary, num_topics = 10, iterations = 200)
topic_model.save(path)
topic_model = LdaModel.load(path)
After building the model, we'll start and try to identify the topics learned by our model. It is reasonable to hope that the model has been able to learn topics that corresponds to recognizable categories. In order to do this, we must first recall what exactly a 'topic' is in the context of LDA.
Recall that, to LDA, a topic is a probability distribution over words in the vocabulary; that is, each topic assigns a particular probability to every one of the unique words that appears in our data. Different topics will assign different probabilities to the same word: for instance, a topic that ends up describing science and technology articles might place more probability on the word 'university' than a topic that describes sports or politics. Looking at the highest probability words in each topic will thus give us a sense of its major themes. Ideally we would find that each topic is identifiable with some clear theme and all the topics are relatively distinct. In the following code chunk, we print out the top 10 words associated with each topic.
# each element of the list is a tuple
# containing the topic and word / probability list
topics = topic_model.show_topics(num_words = 10, formatted = False)
topics
We can identify themes for each topic, some example:
Recall that the main distinguishing feature for LDA is it allows for mixed membership, which means that each document can partially belong to several different topics. For each document, topic membership is expressed as a vector of weights that sum to one; the magnitude of each weight indicates the degree to which the document represents that particular topic.
We'll explore this in our fitted model by looking at the topic distributions for a few example Wikipedia articles from our data set. We should find that these articles have the highest weights on the topics whose themes are most relevant to the subject of the article - for example, we'd expect an article on a politician to place relatively high weight on topics related to government, while an article about an athlete should place higher weight on topics related to sports or competition.
Here we'll predict the topic distribution for the article on Barack Obama:
# extract one document to serve as an example
obama = wiki.loc[wiki['name'] == 'Barack Obama', 'text'].tolist()[0].split()
# topic distribution for this document
obama_bow = topic_model.id2word.doc2bow(obama)
topic_model[obama_bow]
As we can see the most probable topic associated with a politician is topic 6, which we labeled as politics. And if we look at the second most probable topic, topic 1, we can see words such as government, court which thankfully also makes sense (this is probably a topic about the government and the law).
We can learn more about topics by exploring how they place probability mass (which we can think of as a weight) on each of their top words. We'll do this with two visualizations of the weights for the top words in each topic:
# change default figure and font size
plt.rcParams['figure.figsize'] = 8, 6
plt.rcParams['font.size'] = 12
# top 100 words by weight in each topic
top_n_words = 100
topics = topic_model.show_topics(
num_topics = n_topics, num_words = top_n_words, formatted = False)
for _, infos in topics:
probs = [prob for _, prob in infos]
plt.plot(range(top_n_words), probs)
plt.xlabel('Word rank')
plt.ylabel('Probability')
plt.title('Probabilities of Top 100 Words in each Topic')
plt.show()
In the above plot, each line corresponds to one of our ten topics. Notice how for each topic, the weights drop off sharply as we move down the ranked list of most important words. This shows that the top 10 to 20 words in each topic are assigned a much greater weight than the remaining words (remember from the preprocessing step that our vocabulary size was 100000).
Next we plot the total weight assigned by each topic to its top 10 words:
# total weight assigned by each topic to its top 10 words
top_probs = []
top_n_words = 10
topics = topic_model.show_topics(num_words = top_n_words, formatted = False)
for _, infos in topics:
prob = sum([prob for _, prob in infos])
top_probs.append(prob)
ind = np.arange(top_n_words)
width = 0.5
fig, ax = plt.subplots()
ax.bar(ind - (width / 2), top_probs, width, color = 'lightcoral')
ax.set_xticks(ind)
plt.xlabel('Topic')
plt.ylabel('Probability')
plt.title('Total Probability of Top 10 Words in each Topic')
plt.xlim(-0.5, 9.5)
plt.ylim(0, 0.15)
plt.show()
Here we see that, for our topic model, the top 10 words only account for a small fraction (in this case, between 4% and 12%) of their topic's total probability mass. So while we can use the top words to identify broad themes for each topic, we should keep in mind that in reality these topics are more complex than a simple 10-word summary.
Finally, we'll take a look at the effect of the LDA model hyperparameters alpha
and eta
on the characteristics of our topic model. alpha
is a parameter that controls the prior distribution over topic weights in each document, while eta
is a parameter for the prior distribution over word weights in each topic. In gensim
, both default to a symmetric, 1 / num_topics prior.
alpha
and eta
can be thought of as smoothing parameters when we compute how much each document "likes" a topic (in the case of alpha
) or how much each topic "likes" a word (in the case of gamma
). A higher alpha
makes the document preferences "smoother" over topics, and a higher eta
makes the topic preferences "smoother" over words.
Our goal in this section will be to understand how changing these parameter values affects the characteristics of the resulting topic model.
We'll start with alpha
. Since alpha
is responsible for smoothing document preferences over topics, the impact of changing its value should be visible when we plot the distribution of topic weights for the same document under models fit with different alpha
values. In the code below, we load topic models that have been trained using different settings of alpha
, plot the (sorted) topic weights for the Wikipedia article on an example document (the same one we've used as an example to predict the topic distribution) under models with high (10), original (0.1), and low (0.001) settings of alpha.
path_alpha_high = os.path.join(model_dir, 'topic_model_alpha_high.lda')
path_alpha_low = os.path.join(model_dir, 'topic_model_alpha_low.lda')
if not os.path.isfile(path_alpha_high):
topic_model_alpha_high = LdaModel(corpus, id2word = dictionary,
num_topics = n_topics, iterations = 200, alpha = 10)
topic_model_alpha_low = LdaModel(corpus, id2word = dictionary,
num_topics = n_topics, iterations = 200, alpha = 0.001)
topic_model_alpha_high.save(path_alpha_high)
topic_model_alpha_low.save(path_alpha_low)
topic_model_alpha_high = LdaModel.load(path_alpha_high)
topic_model_alpha_low = LdaModel.load(path_alpha_low)
def sort_doc_topics(topic_model, doc):
"""
given a gensim LDA topic model and
a document, obtain the predicted probability
for each topic in sorted order
"""
bow = topic_model.id2word.doc2bow(doc)
# the default minimum_probability will clip out topics that
# have a probability that's too small will get chopped off,
# which is not what we want here
doc_topics = topic_model.get_document_topics(bow, minimum_probability = 0)
doc_topics.sort(key = itemgetter(1), reverse = True)
probs = [prob for _, prob in doc_topics]
return probs
alpha_low = sort_doc_topics(topic_model_alpha_low, obama)
alpha_default = sort_doc_topics(topic_model, obama)
alpha_high = sort_doc_topics(topic_model_alpha_high, obama)
def param_barplot(a, b, c, ylim, param, ylab):
"""
plotting function for three topic models that
have a different `param` value
"""
width = 0.3
ind = np.arange(len(a))
fig = plt.figure()
ax = fig.add_subplot(111)
b1 = ax.bar(ind, a, width, color = 'lightskyblue')
b2 = ax.bar(ind + width, b, width, color = 'lightcoral')
b3 = ax.bar(ind + (2 * width), c, width, color = 'gold')
ax.set_xticks(ind + width)
ax.set_xticklabels(range(len(a)))
ax.set_ylabel(ylab)
ax.set_xlabel('topics')
ax.set_ylim(0, ylim)
ax.legend(handles = [b1, b2, b3],
labels = ['low ' + param, 'default ' + param, 'high ' + param])
plt.tight_layout()
plt.show()
param_barplot(alpha_low, alpha_default, alpha_high,
ylim = 1.0, param = 'alpha',
ylab = 'Topic Probability for Obama Article')
Here we can clearly see the smoothing enforced by the alpha
parameter - notice that when alpha
is low, most of the weight in the topic distribution for this article goes to a single topic, but when it is high, the weight is much more evenly distributed across the topics.
Just as we were able to see the effect of alpha
by plotting topic weights for a document, we expect to be able to visualize the impact of changing eta
by plotting word weights for each topic. In this case, however, there are far too many words in our vocabulary to do this effectively. Instead, we'll plot the total weight of the top 10 words and bottom 500 words for each topic in the high (10), original (0.1), and low (0.001) eta
model.
path_eta_high = os.path.join(model_dir, 'topic_model_eta_high.lda')
path_eta_low = os.path.join(model_dir, 'topic_model_eta_low.lda')
if not os.path.isfile(path_eta_high):
topic_model_eta_high = LdaModel(corpus, id2word = dictionary,
num_topics = n_topics, iterations = 200, eta = 10)
topic_model_eta_low = LdaModel(corpus, id2word = dictionary,
num_topics = n_topics, iterations = 200, eta = 0.001)
topic_model_eta_high.save(path_eta_high)
topic_model_eta_low.save(path_eta_low)
topic_model_eta_high = LdaModel.load(path_eta_high)
topic_model_eta_low = LdaModel.load(path_eta_low)
def get_top_word_weight(topic_model, n_topics, top_n_words):
"""
total weight assigned by each topic to its top `top_n_words` words
"""
top_probs = []
topics = topic_model.show_topics(num_topics = n_topics,
num_words = top_n_words,
formatted = False)
for _, infos in topics:
prob = sum([prob for _, prob in infos])
top_probs.append(prob)
return top_probs
top_n_words = 10
eta_low_top = get_top_word_weight(topic_model_eta_low, n_topics, top_n_words)
eta_default_top = get_top_word_weight(topic_model, n_topics, top_n_words)
eta_high_top = get_top_word_weight(topic_model_eta_high, n_topics, top_n_words)
def get_bottom_word_weight(topic_model, n_topics, bottom_n_words):
"""
total weight assigned by each topic to its bottom
`bottom_n_words` words
"""
bottom_probs = []
num_words = len(topic_model.id2word)
topics = topic_model.show_topics(num_topics = n_topics,
num_words = num_words,
formatted = False)
for _, infos in topics:
prob = sum([prob for _, prob in infos[-bottom_n_words:]])
bottom_probs.append(prob)
return bottom_probs
bottom_n_words = 500
eta_low_bottom = get_bottom_word_weight(topic_model_eta_low, n_topics, bottom_n_words)
eta_default_bottom = get_bottom_word_weight(topic_model, n_topics, bottom_n_words)
eta_high_bottom = get_bottom_word_weight(topic_model_eta_high, n_topics, bottom_n_words)
param_barplot(eta_low_top, eta_default_top, eta_high_top,
ylim = 0.15, param = 'eta',
ylab = 'Total Probability of Top 10 Words')
param_barplot(eta_low_bottom, eta_default_bottom, eta_high_bottom,
ylim = 0.007, param = 'eta',
ylab = 'Total Probability of Bottom 500 Words')
From these two plots we can see that the low eta
model results in higher weight placed on the top words and lower weight placed on the bottom words for each topic (or more intuitively, topics are composed of few words). On the other hand, the high eta
model places relatively less weight on the top words and more weight on the bottom words. Thus increasing eta
results in topics that have a smoother distribution of weight across all the words in the vocabulary.
We have now seen how the hyperparameters influence the characteristics of our LDA topic model, but we haven't said anything about which settings are best. We know that these parameters are responsible for controlling the smoothness of the topic distributions for documents (alpha
) and word distributions for topics (eta
), but there's no simple conversion between smoothness of these distributions and quality of the topic model.
Hyperparamter: Just like with all other models, there is no universally "best" choice for these hyperparameters. Finding a good topic model really requires some exploration of the output to see if it make sense (as we did by looking at the top words for each topics and checking some topic predictions for documents). If top words looks like complete gibberish, consider looking at the documents that got assigned to that topic and see if that helps decipher the contextual meaning of the topic. Or simply, scratch the whole thing and re-run the model, but during the re-run, add in uninterpretable words that appeared in the topic's top words to the stop words list so that they won't distort the interpretation again. If it still doesn't work ..., then try lemmatizing the words or use feature selection methods (the simplest being setting a cap on the number of words/tokens that the document-term matrix can use). If that still doesn't work ..., well, machine learning is garbage in garbage out, so maybe the data is simply way too outdated or messy to the utilized.
Word Representation: Although LDA assumes the documents to be in bag of words (bow) representation, from this post Quora: Why is the performance improved by using TFIDF instead of bag-of-words in LDA clustering?, it seems like people have also found success when using tf-idf representation as it can be considered a weighted bag of words.
Memory Considerations:: Gensim can only do so much to limit the amount of memory used by our analysis. Our program may take an extended amount of time or possibly crash if we do not take into account the amount of memory the program will consume. Prior to training our model we can get a ballpark estimate of memory use by using the following formula:
$$\text{8 bytes * num_terms * num_topics * 3}$$The magic number 3: The $\text{8 bytes} \times \text{num_terms} \times \text{num_topics}$ accounts for the model output, but Gensim will need to make temporary copies while modeling. The scaling factor of 3 gives you an idea of how much memory Gensim will be consuming while running with the temporary copies present. One quick way to quick down the memory usage is to limit the size of the token. After constructing the dictionary, we can do print(dictionary)
to see the size of the our token and perform filtering to reduce the size if needed.