Text Similarity

All the code text_similarity.R and data used in the documentation the data folder can be found in this folder.

Text Similarity

One of the most common data-mining problem is to find similar items in the given dataset, this also applies for text data. Examples that comes directly into our mind includes detecting plagiarism, news recommendation (don’t show identical news articles to users) and the list goes on.

We’ll use three documents as our toy example. The following section loads them in and also preprocess them by :

  • Removing any punctuation mark.
  • Transform all letters to lower cases.
  • Convert extra white space as a single blank, and remove white spaces up front and at the end.
  • Split the text string into separate words.

Link to the dataset is provided at the end.

# environment setting
library(dplyr)
library(proxy)
library(stringr)
library(data.table)
setwd("/Users/ethen/machine-learning/clustering_old/text_similarity/data")

# read in original text 
( doc <- lapply( list.files(), readLines ) )
## [[1]]
## [1] "The sky is blue and the sun is bright."
## 
## [[2]]
## [1] "The sun in the sky is bright."
## 
## [[3]]
## [1] "We can see sun is bright, the sky is blue."
# preprocess text
doc1 <- lapply(doc, function(x) {
    text <- gsub("[[:punct:]]", "", x) %>% tolower()
    text <- gsub("\\s+", " ", text) %>% str_trim()  
    word <- strsplit(text, " ") %>% unlist()
    return(word)
})
# print only the first text to conserve space
doc1[[1]]
## [1] "the"    "sky"    "is"     "blue"   "and"    "the"    "sun"    "is"    
## [9] "bright"

K-Shingling

The first notion we’ll introduce is Shingling, a common technique of representing documents as sets. Given the document, its k-shingle is said to be all the possible consecutive substring of length k found within it. An example with k = 3 is given below :

Shingling <- function(document, k) {
    shingles <- character( length = length(document) - k + 1 )

    for( i in 1:( length(document) - k + 1 ) ) {
        shingles[i] <- paste( document[ i:(i + k - 1) ], collapse = " " )
    }

    return( unique(shingles) )  
}

# "shingle" our example document, with k = 3
doc1 <- lapply(doc1, function(x) {
    Shingling(x, k = 3)
})
list( Original = doc[[1]], Shingled = doc1[[1]] )
## $Original
## [1] "The sky is blue and the sun is bright."
## 
## $Shingled
## [1] "the sky is"    "sky is blue"   "is blue and"   "blue and the" 
## [5] "and the sun"   "the sun is"    "sun is bright"

As you can see from the printed out first document. By picking our k to be 3, the k-shingle of the first document consists of substrings of lenth 3. The first 3-shingle is the sky is as it’s the first consecutive substring of length 3 in the document, then the second 3-shingle sky is blue is the next 3 word long substring after excluding the first word of the document. And the list goes on.

Another thing to note is that a document’s k-shingle set should only consists of unique k-shingles. For example, if the first document above contains more than one the sky is then it will only appear once as the set of k-shingle for that document.

Now that we have that in mind, we’ll construct a “characteristic” matrix that visualizes the relationships between our three documents. The matrix will be a boolean matrix, where its

  • rows = the elements of the universal set (every unique possible combinations of shingles sets across all documents ).
  • columns = one column per document.

Thus the matrix will have 1 in row i and column j if and only if document j contains the shingle i and 0 otherwise. Example below. ( I will use the data frame data structure to replace matrix throughout the whole documentation ).

# unique shingles sets across all documents
doc_dict <- unlist(doc1) %>% unique()

# "characteristic" matrix
M <- lapply(doc1, function(set, dict) {
    as.integer(dict %in% set)
}, dict = doc_dict) %>% data.frame() 

# set the names for both rows and columns
setnames( M, paste( "doc", 1:length(doc1), sep = "_" ) )
rownames(M) <- doc_dict
M
##                doc_1 doc_2 doc_3
## the sky is         1     1     1
## sky is blue        1     0     1
## is blue and        1     0     0
## blue and the       1     0     0
## and the sun        1     0     0
## the sun is         1     0     0
## sun is bright      1     0     1
## the sun in         0     1     0
## sun in the         0     1     0
## in the sky         0     1     0
## sky is bright      0     1     0
## we can see         0     0     1
## can see sun        0     0     1
## see sun is         0     0     1
## is bright the      0     0     1
## bright the sky     0     0     1

Looking at the first row of the matrix above, all three columns of recorded a 1. This denotes that all three documents contains the 3-shingle the sky is. As for the second, column cell value of [1, 0, 1] means that document 2 does not have the 3-shingle sky is blue, while document 1 and 3 do. And so on.

  • Note that unlike our toy example, these “characteristic” matrices are almost always sparse. Therefore, in practice we usually represent these kind of matrices only by the positions in which the 1 appear as it is considered to be more space efficient.

Jaccard Similarity

Now that we’ve transformed all the documents into shingle sets and used characteristic matrix to visualize the relationships between our documents. The next question is how do we measure the similarity between documents?

One well known measure for determining the degree of similarity is the Jaccard Similarity. Given two sets of shingles, set1 and set2. The math formula of this measurement will be :

\[ \frac{ \text{| } set1 \cap set2 \text{ |} }{ \text{| } set1 \cup set2 \text{ |} } \]

This is equivalent to saying for any two given sets, its jaccard similarity is the size of their intersection divided by their union. For our “characteristic” matrix, if we restrict ourselves to only two columns (documents), then the rows can be divided into three types :

  • Type 1 : rows have 1s in both columns.
  • Type 2 : rows have 1 in one of the columns and 0 in the other.
  • Type 3 : rows have 0s in both columns.

Applying the jaccard similarity concept here, the similarities between two sets will then be \(\frac{ \text{Type 1} }{ \text{ Type 1 + Type 2 } }\).

The following section will calculate the pairwise jaccard similarities for all three documents and print out the original document’s content to get a feel of the performace. One quick way to calculate pairwise distance in R is the dist function, which computes and returns the distance/similarity matrix between either rows or columns of a matrix/data frame. We can also define our own measures for the distance/similarity in the proxy library’s database.

# how similar is two given document, jaccard similarity 
JaccardSimilarity <- function(x, y) {
    non_zero <- which(x | y)
    set_intersect <- sum( x[non_zero] & y[non_zero] )
    set_union <- length(non_zero)
    return(set_intersect / set_union)
}

# create a new entry in the registry
pr_DB$set_entry( FUN = JaccardSimilarity, names = c("JaccardSimilarity") )

# jaccard similarity distance matrix 
d1 <- dist( t(M), method = "JaccardSimilarity" )

# delete the new entry
pr_DB$delete_entry("JaccardSimilarity")
d1
##            doc_1      doc_2
## doc_2 0.09090909           
## doc_3 0.25000000 0.08333333
doc
## [[1]]
## [1] "The sky is blue and the sun is bright."
## 
## [[2]]
## [1] "The sun in the sky is bright."
## 
## [[3]]
## [1] "We can see sun is bright, the sky is blue."

The similarity matrix d1 tells us that document 1 and 3 is the most similar among the three documents. Looking at its original content, it seems like it does match our intuition as both of the documents contains the substring “sun is bright” and “the sky is blue” (despite the different ordering in the context) .

Now that we have calculated the pair-wise similarity between each document in a pretty straightforward way, we’re done right ?! Well, it’s a yes and a no. For small datasets, this method works out perfectly fine, but imagine if we have a large number of documents to compare instead of just three, let’s say the number is N. We will then be forced to do a number of N choose 2 comparisons. This is obviously not going to scale well.

Thus in the following sections, we’ll discuss techniques that can be used to help us save computations so that we can compare document similarities on a large scale. The first is Minhashing.

MinHash

Recall that we’ve said about the “characteristic” matrix : typically these matrices tend to be sparse. That is the set of unique shingles across all documents will be fairly large, making computing the jaccard similarity between the documents a heavy burden.

This is where MinHash comes in. This algorithm will provide us with a fast approximation to the jaccard similarity. The concept is to condense the large sets of unique shingles into a much smaller representations called “signatures. We will then use these signatures alone to measure the similarity between documents. Note that it is impossible for these signatures to give the exact similiarity, but the estimates they provide are close (The larger the number of signatures you choose, the more accurate the estimate).

For our toy example, suppose you wish to minhash our characteristic matrix of row 16 into 4 signatures. Then the first step is to generate 4 columns of randomly permutated rows (independent of each other).

A Implement Trick :

To Implement the idea of generating randomly permutated rows, we don’t actually generate the random numbers, since it is not feasible to do so for large datasets. e.g. For a million itemset you will have to generate a million integers …, not to mention you have to do this for each signatures that you wish to generate. One way to avoid having to generate n permutated rows ( here n is 4, because we chose the signature numbers to be 4 ) is to pick n hash functions in the form of :

\[h(x) = (ax + b) \bmod c\]

Where :

  • x is the row numbers of your original characteristic matrix.
  • a and b are any random numbers smaller or equivalent to the maximum number of x, though they both must be unique in each signature. e.g. For signature 1, if you generated a 5 to serve as your a coefficient, you must ensure that this value does not serve as your a coefficient multiple times within signature 1, though you can still use 5 as your b coefficient in signature 1. And this restriction will refresh for the next signature, that is, you can use 5 to serve as your a or b coefficient for signature 2, but again no multiple 5 for signature 2’s a coefficient and so on.
  • c is a prime number slightly larger than the total number of shingle sets. You can find suitable prime number for your dataset here. For our example dataset, our total row count is 16, thus prime number 17 will do fine.

We can see it for ourselves that this simple hash function does in fact generate random permutated rows

# number of hash functions (signature number)
signature_num <- 4

# prime number
prime <- 17

# generate the unique coefficients  
set.seed(12345)
coeff_a <- sample( nrow(M), signature_num )
coeff_b <- sample( nrow(M), signature_num )

To be clear, we’ll print out our 4 randomly generated hash function :

  • \(( 12x + 8) \bmod 17\)
  • \(( 14x + 3) \bmod 17\)
  • \(( 11x + 5) \bmod 17\)
  • \(( 16x + 7) \bmod 17\)
# see if the hash function does generate permutations
permute <- lapply(1:signature_num, function(s) {
    hash <- numeric( length = length(nrow(M)) )
    for( i in 1:nrow(M) ) {
        hash[i] <- ( coeff_a[s] * i + coeff_b[s] ) %% prime
    }
    return(hash)
})
# # convert to data frame 
permute_df <- structure( permute, names = paste0( "hash_", 1:length(permute) ) ) %>%
              data.frame()
permute_df
##    hash_1 hash_2 hash_3 hash_4
## 1       3      0     16      6
## 2      15     14     10      5
## 3      10     11      4      4
## 4       5      8     15      3
## 5       0      5      9      2
## 6      12      2      3      1
## 7       7     16     14      0
## 8       2     13      8     16
## 9      14     10      2     15
## 10      9      7     13     14
## 11      4      4      7     13
## 12     16      1      1     12
## 13     11     15     12     11
## 14      6     12      6     10
## 15      1      9      0      9
## 16     13      6     11      8

From the output shown above, we can see that all 4 hash functions does in fact produce permutated numbers. There are 0s, but it will not affect our computation and you’ll see why later.

Using our randomly permutated rows, we’ll carry out the second step of the MinHash algorithm, calculating the signatures. The signature value of any column (document) is obtained by : Using the permutated order generated by each hash function, the number of the first row in which the column has a 1.

See example below. We’ll combine our randomly permutated rows (generated by hash functions) with our original characteristic matrix and change the row names of the matrix to its row number to illustrate the calculation :

# use the first two signature as an example
# bind with the original characteristic matrix
M1 <- cbind( M, permute_df[1:2] )
rownames(M1) <- 1:nrow(M1)
M1
##    doc_1 doc_2 doc_3 hash_1 hash_2
## 1      1     1     1      3      0
## 2      1     0     1     15     14
## 3      1     0     0     10     11
## 4      1     0     0      5      8
## 5      1     0     0      0      5
## 6      1     0     0     12      2
## 7      1     0     1      7     16
## 8      0     1     0      2     13
## 9      0     1     0     14     10
## 10     0     1     0      9      7
## 11     0     1     0      4      4
## 12     0     0     1     16      1
## 13     0     0     1     11     15
## 14     0     0     1      6     12
## 15     0     0     1      1      9
## 16     0     0     1     13      6

Consider the matrix shown above, we’ll start with our first hash function (hash_1).

According to our first hash function’s permutated row order, the first row is row 5 ( why is it row 5 ? because 0 is the smallest value for our randomly generated permutation, and it has a 0 in row 5, making it the first row ). Then we’ll look at row 5’s entry for all three documents and ask “which document’s entry at row 5 is a 1 ?”. Aha!! document 1’s (doc_1) row 5 is a 1, thus the signature value for document 1 generated by our first hash function is 0. But document 2 and 3’s entry at row 5 are both 0, thus we’ll have to keep looking.

According to our first hash function’s permutated row order the second row is row 15 ( 1 is the second smallest value for our randomly generated permutation, and it has a value of 1 at row 15 ). We apply the same concept as above and found that document 3’s (doc_3) entry for row 15 is a 1, thus the signature value for document 3 generated by our first hash function is 1. Note that we’re already done with document 1, we do not need to check if it contains a 1 anymore. But we’re still not done, document 2’s entry at row 15 is still a 0. So we’ll have to look deeper.

Again, checking the permutated row order for our first hash function, the third row is row 8. Is document 2’s entry for row 8 a 1 ? Yes it is ! Therefore, we’re done with calculating the signature values for all three columns using our first hash function!! Which are [0, 2, 1].

We can then apply the same notion to calculate the signature value for each column (document) using the second hash function, and so on for the third, fourth, blah blah blah. A quick look at the signature second hash function shows that the first row according to its permutated row order is row 1 and all three document has a 1 in row 1. Hence, the signature values generated by our second hash function for all three document are [0, 0, 0].

As for these calculated signature values, we will store them into a signature matrix along the way, which will later replace our original characteristic matrix. The following section will calculate the signature values for all 3 columns using all 4 hash functions, and print out the signature matrix.

# obtain the non zero rows' index for all columns
non_zero_rows <- lapply(1:ncol(M), function(j) {
    return( which( M[, j] != 0 ) )
})

# initialize signature matrix
SM <- matrix( data = NA, nrow = signature_num, ncol = ncol(M) )

# for each column (document)
for( i in 1:ncol(M) ) {
    # for each hash function (signature)'s value 
    for( s in 1:signature_num ) {
        SM[ s, i ] <- min( permute_df[, s][ non_zero_rows[[i]] ] )
    }
}

# set names for clarity
colnames(SM) <- paste( "doc", 1:length(doc), sep = "_" )
rownames(SM) <- paste( "minhash", 1:signature_num, sep = "_" )  
SM
##           doc_1 doc_2 doc_3
## minhash_1     0     2     1
## minhash_2     0     0     0
## minhash_3     3     2     0
## minhash_4     0     6     0

Note that despite our signature matrix has the same number of columns as the original characteristic matrix, but it only has n rows, where n is the number of hash functions we wish to generate (in this case 4).

As for calculating the pair-wise similarity using this condensed signature matrix. A cool way of saying it is to calculate the fraction in which the hash functions hash into the same bucket. In plain English, this is equivalent to calculate the fraction the rows in which their values are the same. e.g. for document 1 and 2 (column 1 and 2), it’s similarity would be 0.25 because they only agree in 1 row out of a total of 4 (both column’s row 2 is 0).

# signature similarity
SigSimilarity <- function(x, y) mean( x == y )

# same trick to calculate the pairwise similarity 
pr_DB$set_entry( FUN = SigSimilarity, names = c("SigSimilarity") )
d2 <- dist( t(SM), method = "SigSimilarity" )
pr_DB$delete_entry("SigSimilarity")

list(SigSimilarity = d2, JaccardSimilarity = d1)
## $SigSimilarity
##       doc_1 doc_2
## doc_2  0.25      
## doc_3  0.50  0.25
## 
## $JaccardSimilarity
##            doc_1      doc_2
## doc_2 0.09090909           
## doc_3 0.25000000 0.08333333

From the difference of the result between the original jaccard similarity and our new similarity obtained using the signature similarity, you might be doubting is this a true estimate ? Well, the short answer is, we said right in the beginning of this section that Minhash’s purpose is to provide us with a fast “approximation” to the true jaccard similarity, and this example is simply way too small for the law of large numbers to assure that the estimates are close.

For the next section, suppose your final goal is to compute the similarity of every possible pair ( for text clustering perhaps ), then it will most likely be useless to you. However, if you simply wish to find the pairs that are most likely to be similar, then stay tuned. Because in the next section we will be discussing Locality Sensitive Hashing, a technique that allows us to do just that.

Locality Sensitive Hashing

While the information necessary to compute the similarity between documents have been condensed from the original sparse characteristic matrix into a much smaller signature matrix, the underlying problem of needing to perform pairwise comparisons on all the documents still exists.

The concept for locality sensitive hashing (LSH) is that given our signature matrix of size n (row count), we will partition it into b bands, resulting each band with r rows. This is equivalent to the simple math formula : n = br, thus when you’re doing the partition be sure the b you choose is divisible by n. Using our signature matrix above, choosing the band size to be 2 will then become :

# number of bands and rows
bands <- 2
rows <- nrow(SM) / bands

data.frame(SM) %>% 
mutate( band = rep( 1:bands, each = rows ) ) %>%
select( band, everything() )
##   band doc_1 doc_2 doc_3
## 1    1     0     2     1
## 2    1     0     0     0
## 3    2     3     2     0
## 4    2     0     6     0

What locality sensitive hashing tells us is: If the signature values of two documents agree in all the rows of at least one band, then these two documents are likely to be similar and should be compared (list it as our candidate pair). Using this small set of documents is probably a really bad example …., since none of them will get chosen as our candidate pair. For instance, if the signature value of document 2 for band 1 becomes [ 0, 0 ] instead of the current [ 2, 0 ] now, then document 2 and document 1 will become a candidate pair as both of their rows in band1 takes the same value of [ 0, 0 ].

Takeaways:

  1. MinHash, or the formal name, min-wise independent permutations locality sensitive hashing scheme computes an accurate estimate of the jaccard similarity coefficient between two large sets. This is because the number of signatures tend to be much shorter than the number of unique shingles across all documents, making the comparison operation simpler and quicker.
  2. To note, there’s also a R library textreuse that implements the techniques that we’ve walked through in this documentation. You should be able understand its vignette with no problem.

R Session Information

sessionInfo()
## R version 3.2.4 (2016-03-10)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] data.table_1.10.0 stringr_1.0.0     proxy_0.4-15      dplyr_0.5.0      
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.5     rstudioapi_0.6  knitr_1.14      magrittr_1.5   
##  [5] xtable_1.8-2    R6_2.1.2        highr_0.6       tools_3.2.4    
##  [9] miniUI_0.1.1    DBI_0.4-1       htmltools_0.3.5 lazyeval_0.2.0 
## [13] yaml_2.1.13     assertthat_0.1  digest_0.6.9    tibble_1.2     
## [17] bookdown_0.1    shiny_0.13.2    questionr_0.5   formatR_1.4    
## [21] evaluate_0.9    mime_0.4        rmarkdown_1.1   stringi_1.0-1  
## [25] rmdformats_0.3  httpuv_1.3.3

Ethen Liu

November 19, 2015