unsupervized learning -4

discovering interpretable features

Non-negative matrix factorization (NMF)

● NMF = "non-negative matrix factorization"

● Dimension reduction technique

● NMF models are interpretable (unlike PCA)

● Easy to interpret means easy to explain!

● However, all sample features must be non-negative (>= 0)

Interpretable parts

● NMF expresses documents as combinations of topics (or "themes")

● NMF expresses images as combinations of patterns

In [1]:
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline
plt.imshow(plt.imread('img.jpg'))
plt.axis('off')
Out[1]:
(-0.5, 740.5, 246.5, -0.5)

Example word-frequency array

● Word frequency array, 4 words, many documents

● Measure presence of words in each document using "tf-idf"

● "tf" = frequency of word in document

● "idf" reduces influence of frequent words

http://www.tfidf.com/

Term Frequency

TF: Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization:

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).

Inverse Document Frequency

IDF: Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following:

IDF(t) = log_e(Total number of documents / Number of documents with term t in it).

Example:

Consider a document containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.

In [2]:
from sklearn.decomposition import PCA
print str(PCA).split('.')[-1].replace("'>",'')
PCA

Curious

apply dimensional reduction techniques on image

  • PCA
  • NMF
  • SVD
In [3]:
cat = plt.imread('cat.jpg')
print cat.shape
plt.imshow(cat)
(194L, 259L, 3L)
Out[3]:
<matplotlib.image.AxesImage at 0xc951e80>

write functions for comparison

  • read image matrix
In [4]:
def curiosity(image, n):
    from sklearn.decomposition import PCA, NMF, TruncatedSVD
    import numpy as np
    
    ## turn color image to black and white
    try: 
        image = image.mean(axis=2)
    except:
        pass
    
    models = [PCA, NMF, TruncatedSVD]
    results = []
    
    c = 0
    
    
    plt.figure(figsize= (12,12))

    for i in models:
        c+=1
        
        m = i(n)
        d = m.fit_transform(image)
        a = m.inverse_transform(d)
        results.append(a)
        
        plt.subplot(1,3,c)
        plt.imshow(a, cmap='gray')
        plt.axis( 'off')
        
        title = str(i).split('.')[-1].replace("'>",'')
        plt.title(title, size=15)
        
    print '3 dementional reduction algorithms with compressing of '
    print 'Number of components: '+ str(n)
        
    return results
    
In [5]:
test10 = curiosity(cat,10)
3 dementional reduction algorithms with compressing of 
Number of components: 10
In [6]:
test20 = curiosity(cat,30)
3 dementional reduction algorithms with compressing of 
Number of components: 30
In [7]:
test20 = curiosity(cat,50)
3 dementional reduction algorithms with compressing of 
Number of components: 50