In [1]:
# code for loading the format for the notebook
import os

# path : store the current path to convert back to it later
path = os.getcwd()
os.chdir(os.path.join('..', '..', 'notebook_format'))
from formats import load_style
load_style(plot_style = False)
Out[1]:
In [2]:
os.chdir(path)

# 1. magic for inline plot
# 2. magic to print version
# 3. magic so that the notebook will reload external python modules
%matplotlib inline
%load_ext watermark
%load_ext autoreload 
%autoreload 2

import numpy as np
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer

%watermark -a 'Ethen' -d -t -v -p numpy,pandas,sklearn
Ethen 2017-10-24 14:51:01 

CPython 3.5.2
IPython 6.2.1

numpy 1.13.3
pandas 0.20.3
sklearn 0.19.1

Naive Bayes

Naive Bayes classifiers is based on Bayes’ theorem, and the adjective naive comes from the assumption that the features in a dataset are mutually independent. In practice, the independence assumption is often violated, but Naive Bayes still tend to perform very well in the fields of text/document classification. Common applications includes spam filtering (categorized a text message as spam or not-spam) and sentiment analysis (categorized a text message as positive or negative review). More importantly, the simplicity of the method means that it takes order of magnitude less time to train when compared to more complexed models like support vector machines.

Text/Document Representations

Text classifiers often don't use any kind of deep representation about language: often times a document is represented as a bag of words. (A bag is like a set that allows repeating elements.) This is an extremely simple representation as it throws away the word order and only keeps which words are included in the document and how many times each word occurs.

We shall look at two probabilistic models of documents, both of which represent documents as a bag of words, using the Naive Bayes assumption. Both models represent documents using feature vectors whose components correspond to word types. If we have a document containing $|V|$ distinct vocabularies, then the feature vector dimension $d=|V|$.

  • Bernoulli document model: a document is represented by a feature vector with binary elements taking value 1 if the corresponding word is present in the document and 0 if the word is not present.
  • Multinomial document model: a document is represented by a feature vector with integer elements whose value is the frequency of that word in the document.

Example: Consider the vocabulary V = {blue,red, dog, cat, biscuit, apple}. In this case |V| = d = 6. Now consider the (short) document "the blue dog ate a blue biscuit". If $d^B$ is the Bernoulli feature vector for this document, and $d^M$ is the Multinomial feature vector, then we would have:

In [3]:
vocab = ['blue', 'red', 'dog', 'cat', 'biscuit', 'apple']
doc = "the blue dog ate a blue biscuit"

# note that the words that didn't appear in the vocabulary will be discarded
bernoulli = [1 if v in doc else 0 for v in vocab]
multinomial = [doc.count(v) for v in vocab]
print('bernoulli', bernoulli)
print('multinomial', multinomial)
bernoulli [1, 0, 1, 0, 1, 0]
multinomial [2, 0, 1, 0, 1, 0]

Bernoulli Model

Consider a corpus of documents (training data) whose class is given by $C = 1, 2, ..., K$. Using Naive Bayes (no matter if it's the bernoulli model or the multinomial model which we'll later see), we classify a document $D$ as the class which has the highest posterior probability $argmax_{ k = 1, 2, ..., K} \, p(C = k|D)$, which can be re-expressed using Bayes’ Theorem:

$$p(C = k|D) = \frac{ p(C = k) \, p(D|C = k) }{p(D)} \ \propto p(C = k) \, p(D|C = k)$$

Where:

  • $\propto$ means is proportional to.
  • $p(C = k)$ represents the class k's prior probabilities.
  • $p(D|C = k)$ is the likelihoods of the document given the class k.
  • $p(D)$ is the normalizing factor which we don't have to compute since it does not depend on the class $C$. i.e. this factor will be the same across all class $C$, thus the numerator will be enough to determine which $p(C = k|D)$ is the largest.

Starting with $p(D|C)$. The spirit of Naive Bayes is it assumes that each of the features it uses are conditionally independent of one another given some class. More formally, if we wish to calculate the probability of observing features $X_1$ through $X_d$, given some class $C$ we can do it by the following math formula:

$$p(x_{1},x_{2},...,x_{d} \mid C) = \prod_{i=1}^{d}p(x_{i} \mid C)$$

Suppose we have a vocabulary (features) $V$ containing a set of $|V|$ words and the $t_{th}$ dimension of a document vector corresponds to word $w_t$ in the vocabulary. Following the Naive Bayes assumption, that the probability of each word occurring in the document is independent of the occurrences of the other words, we can then re-write the $i_{th}$ document's likelihood $p(D_i \mid C)$ as:

$$p(D_i \mid C ) = \prod_{t=1}^{d}b_{it}p(w_t \mid C) + ( 1 - b_{it} ) (1- p(w_t \mid C)) $$

Where:

  • $p(w_t \mid C)$ is the probability of word $w_t$ occurring in a document of class $C$.
  • $1- p(w_t \mid C)$ is the probability of $w_t$ not occurring in a document of class $C$.
  • $b_{it}$ is either 0 or 1 representing the absence or presence of word $w_t$ in the $i_{th}$ document.

This product goes over all words in the vocabulary. If word $w_t$ is present, then $b_{it} = 1$ and the associated probability is $p(w_t \mid C)$; If word $w_t$ is not present, then $b_{it} = 0$ and the associated probability becomes $1- p(w_t \mid C)$.

As for the word likelihood $p(w_t \mid C)$, we can learn (estimate) these parameters from a training set of documents labelled with class $C=k$.

$$p(w_t \mid C = k) = \frac{n_k(w_t)}{N_k}$$

Where:

  • $n_k(w_t)$ is the number of class $C=k$'s document in which $w_t$ is observed.
  • $N_k$ is the number of documents that belongs to class $k$.

Last, calculating $p(C)$ is relatively simple: If there are $N$ documents in total in the training set, then the prior probability of class $C=k$ may be estimated as the relative frequency of documents of class $C=k$:

$$p(C = k)\,= \frac{N_k}{N}$$

Where $N$ is the total number of documents in the training set.

Bernoulli Model Implementation

Consider a set of documents, each of which is related either to Class 1 or to Class 0. Given a training set of 11 documents, we would like to train a Naive Bayes classifier, using the Bernoulli document model, to classify unlabelled documents as Class 1 or 0. We define a vocabulary of eight words.

Thus the training data $X$ is presented below as a 11*8 matrix, in which each row represents an 8-dimensional document vector. And the $y$ represents the class of each document. Then we would like to classify the two testing data.

In [4]:
train = np.genfromtxt('bernoulli.txt', dtype = np.int)
X_train = train[:, :-1]
y_train = train[:, -1] # the last column is the class
print('training data:')
print(X_train)
print()
print(y_train)
print()
print('testing data:')
X_test = np.array([[1, 0, 0, 1, 1, 1, 0, 1], [0, 1, 1, 0, 1, 0, 1, 0]])
print(X_test)
training data:
[[1 0 0 0 1 1 1 1]
 [0 0 1 0 1 1 0 0]
 [0 1 0 1 0 1 1 0]
 [1 0 0 1 0 1 0 1]
 [1 0 0 0 1 0 1 1]
 [0 0 1 1 0 0 1 1]
 [0 1 1 0 0 0 1 0]
 [1 1 0 1 0 0 1 1]
 [0 1 1 0 0 1 0 0]
 [0 0 0 0 0 0 0 0]
 [0 0 1 0 1 0 1 0]]

[1 1 1 1 1 1 0 0 0 0 0]

testing data:
[[1 0 0 1 1 1 0 1]
 [0 1 1 0 1 0 1 0]]
In [5]:
def bernoulli_nb(X_train, y_train, X_test):
    """
    Pass in the training data, it's label and 
    predict the testing data's class using bernoulli naive bayes
    """
    
    # calculate the prior proabability p(C=k)
    N = X_train.shape[0]
    priors = np.bincount(y_train) / N
    
    # obtain the unique class's type (since it may not be 0 and 1)
    class_type = np.unique(y_train)
    class_nums = class_type.shape[0]
    word_likelihood = np.zeros((class_nums, X_train.shape[1]))
    
    # compute the word likelihood p(w_t∣C)
    for index, output in enumerate(class_type):
        subset = X_train[np.equal(y_train, output)]
        word_likelihood[index, :] = np.sum(subset, axis = 0) / subset.shape[0]
        
    # make predictions on the test set
    # note that this code will break if the test set happens to 
    # be a 1d-array, since the first for loop will not be 
    # looping through each document, but each document's feature instead
    predictions = np.zeros(X_test.shape[0], dtype = np.int)
    for index1, document in enumerate(X_test):
        
        # stores the p(C|D) for each class
        posteriors = np.zeros(class_nums)
        
        # compute p(C = k|D) for the document for all class
        # and return the predicted class with the maximum probability
        for c in range(class_nums):
            
            # start with p(C = k)
            posterior = priors[c]
            word_likelihood_subset = word_likelihood[c, :]
            
            # loop through features to compute p(D∣C = k)
            for index2, feature in enumerate(document):
                if feature:
                    prob = word_likelihood_subset[index2]
                else:
                    prob = 1 - word_likelihood_subset[index2]
                
                posterior *= prob

            posteriors[c] = posterior
        
        # compute the maximum p(C|D)
        predicted_class = class_type[np.argmax(posteriors)]
        predictions[index1] = predicted_class
    
    return predictions
In [6]:
predictions = bernoulli_nb(X_train, y_train, X_test)
predictions
Out[6]:
array([1, 0])

Multinomial Distribution

Before discussing the multinomial document model, it is important to be familiar with the multinomial distribution. The multinomial distribution can be used to compute the probabilities in situations in which there are more than two possible outcomes. For example, suppose that two chess players had played numerous games and it was determined that the probability that Player A would win is 0.40, the probability that Player B would win is 0.35, and the probability that the game would end in a draw is 0.25. The multinomial distribution can be used to answer questions such as: "If these two chess players played 12 games, what is the probability that Player A would win 7 games, Player B would win 2 games, and the remaining 3 games would be drawn?" The following generalized formula gives the probability of obtaining a specific set of outcomes when there are $n$ possible outcomes for each event:

$$P = \frac{n!}{n_1!n_2!...n_d!}p_1^{n_1}p_2^{n_2}...p_d^{n_d}$$
  • n is the total number of events.
  • $n_1, ..., n_d$ is the number of times outcome 1 to d occurred.
  • $p_1, ..., p_d$ is the probability of outcome 1 to d occurred.

Or more compactly written as:

$$P = \frac{n!}{\prod_{t=1}^{d}n_t!}\prod_{t=1}^{d}p_t^{n_t}$$

If all of that is still unclear, refer to the following link for a worked example. Youtube: Introduction to the Multinomial Distribution.

Multinomial Model

Recall that for Naive Bayes $argmax_{k = 1, 2, ..., K} \, p(D|C = k)p(C)$ is the objective function that we're trying to solve for. In the multinomial case, calculating $p(D|C = k)$ for the $i_{th}$ document becomes:

$$p(D_i|C = k) = \frac{x_i!}{\prod_{t=1}^{d}x_{it}!}\prod_{t=1}^{d}p(w_t|C)^{x_{it}} \propto \prod_{t=1}^{d}p(w_t|C)^{x_{it}}$$

Where:

  • $x_{it}$, is the count of the number of times word $w_t$ occurs in document $D_i$.
  • $x_i= \sum_t x_{it}$ is the total number of words in document $D_i$.
  • Often times, we don't need the normalization term $\frac{x_i!}{\prod_{t=1}^{d}x_{it}!}$, because it does not depend on the class, $C$.
  • $p(w_t \mid C)$ is the probability of word $w_t$ occurring in a document of class $C$. This time estimated using the word frequency information from the document's feature vectors. More specifically, this is: $\text{Number of word } w_t \text{ in class } C \big/ \text{Total number of words in class } C$.
  • $\prod_{t=1}^{d}p(w_t|C)^{x_{it}}$ can be interpreted as the product of word likelihoods for each word in the document.

Laplace Smoothing

One drawback of the equation for the multinomial model is that the likelihood $p(D_i|C = k)$ involves taking a product of probabilities $p(w_t \mid C)$. Hence if any one of the terms of the product is zero, then the whole product becomes zero. This means that the probability of the document belonging to the class in question is zero (impossible). Intuitively, just because a word does not occur in a document class in the training data does not mean that it cannot occur in any document of that class.

Therefore, one way to alleviate the problem is Laplace Smoothing or add one smoothing, where we add a count of one to each word type and the denominator will be increased by $|V|$, the number of vocabularies (features), to ensure that the probabilities are still normalized. More formally, our $p(w_t \mid C)$ becomes:

$$p(w_t \mid C) = \frac{( \text{Number of word } w_t \text{ in class } C + 1 )}{( \text{Total number of words in class } C) + |V|} $$

In sum, by performing Laplace Smoothing, we ensure that our $p(w_t \mid C)$ will never equal to 0.

Log-Transformation

Our original formula for classifying a document in to a class using Multinomial Naive Bayes was:

$$p(C|D) = p(C)\prod_{t=1}^{d}p(w_t|C)^{x_{it}}$$

In practice, when we have a lot of unique words, we create very small values by computing the product of many $p(w_t \mid C)$ terms. On a computer, the values may become so small that they may "underflow" (run out of memory to represent the value and thus it will be rounded to zero). To prevent this, we can simply throw a logarithm around everything:

$$p(C|D) = log \left( p(C)\prod_{t=1}^{d}p(w_t|C)^{x_{it}}\right)$$

Using the property that $log(ab) = log(a) + log(b)$, the formula above then becomes:

$$p(C|D) = log \, p(C) + \sum_{t=1}^d x_{it}log \, p(w_t|C)$$

Multinomial Model Implementation

In [7]:
text = pd.read_table('multinomial.txt', sep = ',', header = None, names = ['message', 'label'])
X_train = text['message']
y_train = text['label']
text.head()
Out[7]:
message label
0 Chinese Beijing Chinese c
1 Chinese Chinese Shanghai c
2 Chinese Macao c
3 Tokyo Japan Chinese j

Given the four documents and its corresponding class (label), which class does the document with the message Chinese Chinese Chinese Tokyo Japan more likely belong to.

In [8]:
vect = CountVectorizer()
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(['Chinese Chinese Chinese Tokyo Japan'])

print('feature name: ', vect.get_feature_names())

# convert to dense array for better visualize representation
print('training:')
print(X_train_dtm.toarray())
print('\ntesting:')
print(X_test_dtm.toarray())
feature name:  ['beijing', 'chinese', 'japan', 'macao', 'shanghai', 'tokyo']
training:
[[1 2 0 0 0 0]
 [0 2 0 0 1 0]
 [0 1 0 1 0 0]
 [0 1 1 0 0 1]]

testing:
[[0 3 1 0 0 1]]

The implementation in the following code chunk is a very crude implementation while the one two code chunks below is a more efficient and robust implementation that leverages sparse matrix and matrix multiplication.

In [9]:
def mutinomial_nb(X_train_dtm, y_train, X_test_dtm):
    """
    Pass in the training data, it's label and 
    predict the testing data's class using mutinomial naive bayes
    """
    
    # compute the priors
    # convert the character class to numbers (easier to work with)
    le = LabelEncoder()
    y = le.fit_transform(y_train)
    priors = np.bincount(y) / y.shape[0]

    class_type = np.unique(y)
    class_nums = class_type.shape[0]
    feature_nums = X_train_dtm.shape[1]
    likelihood = np.zeros((class_nums, feature_nums))

    # compute the word likelihood p(w_t∣C)
    # apply lapace smoothing
    for index, output in enumerate(class_type):
        subset = X_train_dtm[np.equal(y, output)]
        likelihood[index, :] = (np.sum(subset, axis = 0) + 1) / (np.sum(subset) + feature_nums)
        
    # make prediction on test set
    predictions = np.zeros(X_test_dtm.shape[0], dtype = np.int)
    for index1, document in enumerate(X_test_dtm):
        
        # stores the p(C|D) for each class
        posteriors = np.zeros(class_nums)

        # compute p(C = k|D) for the document for all class
        # and return the predicted class with the maximum probability
        for c in range(class_nums):

            # start with p(C = k)
            posterior = np.log(priors[c])
            likelihood_subset = likelihood[c, :]

            # compute p(D∣C = k)
            prob = document * np.log(likelihood_subset)
            posterior += np.sum(prob)
            posteriors[c] = posterior

        # compute the maximum p(C|D)
        prediction = np.argmax(posteriors)
        predictions[index1] = prediction
    
    # convert the prediction to the original class label
    predicted_class = le.inverse_transform(predictions)
    return predicted_class
In [10]:
import numpy as np
from scipy.misc import logsumexp
from sklearn.preprocessing import LabelBinarizer


class NaiveBayes:
    """
    Multinomial Naive Bayes classifier [1]_.

    Parameters
    ----------
    smooth : float, default 1.0
        Additive Laplace smoothing.

    Attributes
    ----------
    classes_ : 1d ndarray, shape [n_class]
        Holds the original label for each class.

    class_log_prior_ : 1d ndarray, shape [n_class]
        Empirical log probability for each class.

    feature_log_prob_ : 1d ndarray, shape [n_classes, n_features]
        Smootheed empirical log probability of features given a class,
        ``P(feature | class)``.

    class_count_ : 1d ndarray, shape [n_classes]
        Number of samples encountered for each class during fitting.

    feature_count_ : 2d ndarray, shape [n_classes, n_features]
        Number of samples encountered for each class and feature
        during fitting.

    References
    ----------
    .. [1] `Scikit-learn MultinomialNB
        <http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html>`_
    """

    def __init__(self, smooth = 1.0):
        self.smooth = smooth

    def fit(self, X, y):
        """
        Fit the model according to the training data X and
        training label y.

        Parameters
        ----------
        X : scipy sparse csr_matrix, shape [n_samples, n_features]
            Training data.

        y : 1d ndarray, shape [n_samples]
            Label values.

        Returns
        -------
        self
        """

        # one hot encode the label column and for binary
        # label, also expand it to two columns since it
        # only returns a single column vector
        labelbin = LabelBinarizer()
        Y = labelbin.fit_transform(y).astype(np.float64)
        if Y.shape[1] == 1:
            Y = np.concatenate((1 - Y, Y), axis = 1)

        self.classes_ = labelbin.classes_

        # for sparse matrix, the "*" operator performs matrix multiplication
        # https://stackoverflow.com/questions/36782588/dot-product-sparse-matrices
        self.feature_count_ = Y.T * X
        self.class_count_ = Y.sum(axis = 0)

        # compute feature log probability:
        # number of a particular word in a particular class / total number of words in that class
        smoothed_count = self.feature_count_ + self.smooth
        smoothed_class = np.sum(smoothed_count, axis = 1)
        self.feature_log_prob_ = (np.log(smoothed_count) -
                                  np.log(smoothed_class.reshape(-1, 1)))

        # compute class log prior:
        # number of observation in a particular class / total number of observation
        self.class_log_prior_ = (np.log(self.class_count_) -
                                 np.log(self.class_count_.sum()))
        return self

    def predict(self, X):
        """
        Perform classification for input data X.

        Parameters
        ----------
        X : 2d ndarray, shape [n_samples, n_features]
            Input data

        Returns
        -------
        pred_class : 1d ndarray, shape [n_samples]
            Predicted label for X
        """
        joint_prob = self._joint_log_likelihood(X)
        pred_class = self.classes_[np.argmax(joint_prob, axis = 1)]
        return pred_class

    def predict_proba(self, X):
        """
        Return probability estimates for input data X.

        Parameters
        ----------
        X : 2d ndarray, shape [n_samples, n_features]
            Input data

        Returns
        -------
        pred_proba : 2d ndarray, shape [n_samples, n_classes]
            Returns the probability of the samples for each class.
            The columns correspond to the classes in sorted
            order, as they appear in the attribute `classes_`.
        """
        joint_prob = self._joint_log_likelihood(X)

        # a crude implementation would be to take a exponent
        # and perform a normalization
        # temp = np.exp(joint_prob)
        # temp / temp.sum(axis = 1, keepdims = True)
        # but this would be numerically instable
        # https://hips.seas.harvard.edu/blog/2013/01/09/computing-log-sum-exp/
        joint_prob_norm = logsumexp(joint_prob, axis = 1, keepdims = True)
        pred_proba = np.exp(joint_prob - joint_prob_norm)
        return pred_proba

    def _joint_log_likelihood(self, X):
        """
        Compute the unnormalized posterior log probability of X, which is
        the features' joint log probability (feature log probability times
        the number of times that word appeared in that document) times the
        class prior (since we're working in log space, it becomes an addition)
        """
        joint_prob = X * self.feature_log_prob_.T + self.class_log_prior_
        return joint_prob
In [11]:
pred = mutinomial_nb(
    X_train_dtm.toarray(), y_train, X_test_dtm.toarray())
print('crude implementation', pred)

nb = NaiveBayes()
nb.fit(X_train_dtm, y_train)
pred = nb.predict(X_test_dtm)
print('efficient implementation', pred)

nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)
pred = nb.predict(X_test_dtm)
print('library implementation', pred)
crude implementation ['c']
efficient implementation ['c']
library implementation ['c']

Pros and Cons of Naive Bayes

We'll end this notebook with this algorithm's pros and cons.

Pros:

  • Extremely fast to train/apply and is reliably a high bias/low variance classifier (less likely to overfit).
  • Handles extraneous features well, meaning it's robust to irrelevant features.
  • Famously good at text classification. e.g. spam filtering. Or domains where you have many equally important features, which tends to be a problem for other kind of classifiers, in particular tree based algorithms.
  • No parameter tuning is required

Cons:

  • Conditional independence is not always a valid assumption, thus can be outperformed by other methods.
  • Predicted probabilities are not well-calibrated.

Reference