In [1]:
# code for loading the format for the notebook
import os

# path : store the current path to convert back to it later
path = os.getcwd()
os.chdir(os.path.join('..', 'notebook_format'))


Out[1]:
In [2]:
os.chdir(path)
import numpy as np
import pandas as pd

# 1. magic to print version
# 2. magic so that the notebook will reload external python modules

from sklearn.preprocessing import LabelBinarizer
from sklearn.feature_selection import chi2, SelectKBest
from sklearn.feature_extraction.text import CountVectorizer

%watermark -a 'Ethen' -d -t -v -p numpy,pandas,sklearn

Ethen 2017-06-22 13:40:53

CPython 3.5.2
IPython 5.4.1

numpy 1.12.1
pandas 0.20.2
sklearn 0.18.1


# Chi-Square Feature Selection¶

Feature selection is a process where you automatically select those features in your data that contribute most to the prediction variable or output in which you are interested. The benefits of performing feature selection before modeling your data are:

• Avoid Overfitting: Less redundant data gives performance boost to the model and results in less opportunity to make decisions based on noise
• Reduces Training Time: Less data means that algorithms train faster
In [3]:
# suppose we have the following toy text data
X = np.array(['call you tonight', 'Call me a cab', 'please call me... PLEASE!', 'he will call me'])
y = [1, 1, 2, 0]

# we'll convert it to a dense document-term matrix,
# so we can print a more readable output
vect = CountVectorizer()
X_dtm = vect.fit_transform(X)
X_dtm = X_dtm.toarray()
pd.DataFrame(X_dtm, columns = vect.get_feature_names())

Out[3]:
cab call he me please tonight will you
0 0 1 0 0 0 1 0 1
1 1 1 0 1 0 0 0 0
2 0 1 0 1 2 0 0 0
3 0 1 1 1 0 0 1 0

One common feature selection method that is used with text data is the Chi-Square feature selection. The $\chi^2$ test is used in statistics to test the independence of two events. More specifically in feature selection we use it to test whether the occurrence of a specific term and the occurrence of a specific class are independent. More formally, given a document $D$, we estimate the following quantity for each term and rank them by their score:

$$\chi^2(D, t, c) = \sum_{e_t \in \{0, 1\}} \sum_{e_c \in \{0, 1\}} \frac{ ( N_{e_te_c} - E_{e_te_c} )^2 }{ E_{e_te_c} }$$

Where

• $N$ is the observed frequency in and $E$ the expected frequency
• $e_t$ takes the value 1 if the document contains term $t$ and 0 otherwise
• $e_c$ takes the value 1 if the document is in class $c$ and 0 otherwise

For each feature (term), a corresponding high $\chi^2$ score indicates that the null hypothesis $H_0$ of independence (meaning the document class has no influence over the term's frequency) should be rejected and the occurrence of the term and class are dependent. In this case, we should select the feature for the text classification.

## Implementation¶

We first compute the observed count for each class. This is done by building a contingency table from an input $X$ (feature values) and $y$ (class labels). Each entry $i$, $j$ corresponds to some feature $i$ and some class $j$, and holds the sum of the $i_{th}$ feature's values across all samples belonging to the class $j$.

Note that although the feature values here are represented as frequencies, this method also works quite well in practice when the values are tf-idf values, since those are just weighted/scaled frequencies.

In [4]:
# binarize the output column,
# this makes computing the observed value a
# simple dot product
y_binarized = LabelBinarizer().fit_transform(y)
print(y_binarized)
print()

# our observed count for each class (the row)
# and each feature (the column)
observed = np.dot(y_binarized.T, X_dtm)
print(observed)

[[0 1 0]
[0 1 0]
[0 0 1]
[1 0 0]]

[[0 1 1 1 0 0 1 0]
[1 2 0 1 0 1 0 1]
[0 1 0 1 2 0 0 0]]


e.g. the second row of the observed array refers to the total count of the terms that belongs to class 1. Then we compute the expected frequencies of each term for each class.

In [5]:
# compute the probability of each class and the feature count;
# keep both as a 2 dimension array using reshape
class_prob = y_binarized.mean(axis = 0).reshape(1, -1)
feature_count = X_dtm.sum(axis = 0).reshape(1, -1)
expected = np.dot(class_prob.T, feature_count)
print(expected)

[[ 0.25  1.    0.25  0.75  0.5   0.25  0.25  0.25]
[ 0.5   2.    0.5   1.5   1.    0.5   0.5   0.5 ]
[ 0.25  1.    0.25  0.75  0.5   0.25  0.25  0.25]]

In [7]:
chisq = (observed - expected) ** 2 / expected
chisq_score = chisq.sum(axis = 0)
print(chisq_score)

[ 1.          0.          3.          0.33333333  6.          1.          3.
1.        ]


We can confirm our result with the scikit-learn library using the chi2 function. The following code chunk computes chi-square value for each feature. For the returned tuple, the first element is the chi-square scores, the scores are better if greater. The second element is the p-values, they are better if smaller.

In [8]:
chi2score = chi2(X_dtm, y)
chi2score

Out[8]:
(array([ 1.        ,  0.        ,  3.        ,  0.33333333,  6.        ,
1.        ,  3.        ,  1.        ]),
array([ 0.60653066,  1.        ,  0.22313016,  0.84648172,  0.04978707,
0.60653066,  0.22313016,  0.60653066]))

Scikit-learn provides a SelectKBest class that can be used with a suite of different statistical tests. It will rank the features with the statistical test that we've specified and select the top k performing ones (meaning that these terms is considered to be more relevant to the task at hand than the others), where k is also a number that we can tweak.

In [9]:
kbest = SelectKBest(score_func = chi2, k = 4)
X_dtm_kbest = kbest.fit_transform(X_dtm, y)
X_dtm_kbest

Out[9]:
array([[0, 0, 0, 1],
[0, 0, 0, 0],
[0, 2, 0, 0],
[1, 0, 1, 0]], dtype=int64)

For the Chi-Square feature selection we should expect that out of the total selected features, a small part of them are still independent from the class. In text classification, however, it rarely matters when a few additional terms are included the in the final feature set. All is good as long as the feature selection is ranking features with respect to their usefulness and is not used to make statements about statistical dependence or independence of variables.