unsupervized learning -4

discovering interpretable features

Non-negative matrix factorization (NMF)

● NMF = "non-negative matrix factorization"

● Dimension reduction technique

● NMF models are interpretable (unlike PCA)

● Easy to interpret means easy to explain!

● However, all sample features must be non-negative (>= 0)

Interpretable parts

● NMF expresses documents as combinations of topics (or "themes")

● NMF expresses images as combinations of patterns

In [1]:
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline
plt.imshow(plt.imread('img.jpg'))
plt.axis('off')
Out[1]:
(-0.5, 740.5, 246.5, -0.5)

Example word-frequency array

● Word frequency array, 4 words, many documents

● Measure presence of words in each document using "tf-idf"

● "tf" = frequency of word in document

● "idf" reduces influence of frequent words

http://www.tfidf.com/

Term Frequency

TF: Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization:

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).

Inverse Document Frequency

IDF: Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following:

IDF(t) = log_e(Total number of documents / Number of documents with term t in it).

Example:

Consider a document containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.

In [2]:
from sklearn.decomposition import PCA
print str(PCA).split('.')[-1].replace("'>",'')
PCA

Curious

apply dimensional reduction techniques on image

  • PCA
  • NMF
  • SVD
In [3]:
cat = plt.imread('cat.jpg')
print cat.shape
plt.imshow(cat)
(194L, 259L, 3L)
Out[3]:
<matplotlib.image.AxesImage at 0xc951e80>

write functions for comparison

  • read image matrix
In [4]:
def curiosity(image, n):
    from sklearn.decomposition import PCA, NMF, TruncatedSVD
    import numpy as np
    
    ## turn color image to black and white
    try: 
        image = image.mean(axis=2)
    except:
        pass
    
    models = [PCA, NMF, TruncatedSVD]
    results = []
    
    c = 0
    
    
    plt.figure(figsize= (12,12))

    for i in models:
        c+=1
        
        m = i(n)
        d = m.fit_transform(image)
        a = m.inverse_transform(d)
        results.append(a)
        
        plt.subplot(1,3,c)
        plt.imshow(a, cmap='gray')
        plt.axis( 'off')
        
        title = str(i).split('.')[-1].replace("'>",'')
        plt.title(title, size=15)
        
    print '3 dementional reduction algorithms with compressing of '
    print 'Number of components: '+ str(n)
        
    return results
    
In [5]:
test10 = curiosity(cat,10)
3 dementional reduction algorithms with compressing of 
Number of components: 10
In [6]:
test20 = curiosity(cat,30)
3 dementional reduction algorithms with compressing of 
Number of components: 30
In [7]:
test20 = curiosity(cat,50)
3 dementional reduction algorithms with compressing of 
Number of components: 50
In [8]:
test20 = curiosity(cat,5)
3 dementional reduction algorithms with compressing of 
Number of components: 5

Using scikit-learn NMF

● Follows fit() / transform() pattern

● Must specify number of components e.g. NMF(n_components=2)

● Works with NumPy arrays and with csr_matrix

NMF components

● NMF has components

● ... just like PCA has principal components

● Dimension of components = dimension of samples

● Entries are non-negative

In [9]:
df = pd.read_csv('tweets.csv')
df = df['text']
df.shape
Out[9]:
(615L,)
In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()

sparse = tfidf.fit_transform(df)
print sparse.shape
sparse
(615, 1286)
Out[10]:
<615x1286 sparse matrix of type '<type 'numpy.float64'>'
	with 10947 stored elements in Compressed Sparse Row format>

apply NMF

  • assume there are 4 categories on the documents
In [11]:
# Import NMF
from sklearn.decomposition import NMF
# Import pandas
import pandas as pd

# Create an NMF instance: model
model = NMF(n_components=4)

# Fit the model to articles
model.fit(sparse)

# Transform the articles: nmf_features
nmf_features = model.transform(sparse)

# Print the NMF features
print(nmf_features.shape)

# Create a pandas DataFrame: nmf
nmf = pd.DataFrame(nmf_features)
(615L, 4L)

each row(article) belongs to which column(group)

In [12]:
nmf.tail()
Out[12]:
0 1 2 3
610 0.009678 0.006492 0.021078 0.007579
611 0.011301 0.004056 0.019930 0.004390
612 0.280871 0.000000 0.000000 0.000000
613 0.280871 0.000000 0.000000 0.000000
614 0.280871 0.000000 0.000000 0.000000

NMF

Reconstruction of a sample

In [13]:
print nmf_features.shape
print model.components_.shape
print
print sparse.shape
(615L, 4L)
(4L, 1286L)

(615, 1286)

NMF fits to non-negative data, only

● Word frequencies in each document

● Images encoded as arrays

● Audio spectrograms

● Purchase histories on e-commerce sites

● … and many more!

Sample reconstruction

● Multiply components by feature values, and add up

● Can also be expressed as a product of matrices

● This is the "Matrix Factorization" in "NMF"

https://en.wikipedia.org/wiki/Non-negative_matrix_factorization

simulate

  • the Multiple of two matrix
In [15]:
import numpy as np
ori = np.random.randint(1,100,30).reshape(5,6)
ori.shape
Out[15]:
(5L, 6L)
In [16]:
from sklearn.decomposition import NMF

nmf = NMF(77)

after = nmf.fit_transform(ori)
after.shape
Out[16]:
(5L, 77L)
In [17]:
nmf.components_.shape
Out[17]:
(77L, 6L)
In [18]:
ori
Out[18]:
array([[14, 64, 97, 64, 92, 95],
       [42, 23, 94, 10, 26, 89],
       [35, 98, 41, 13, 18, 82],
       [42, 53, 38, 73, 10, 55],
       [79,  9, 82, 16, 56, 32]])
In [19]:
np.dot(after, nmf.components_)
Out[19]:
array([[ 14.00000006,  63.99994588,  97.00000111,  63.99999669,
         92.        ,  94.99999999],
       [ 41.99999999,  23.03224165,  93.99999996,   9.99996018,
         26.        ,  88.99999991],
       [ 35.        ,  98.        ,  41.00001475,  13.        ,
         18.        ,  82.        ],
       [ 42.        ,  53.00004969,  37.99999898,  73.00000304,
         10.        ,  55.        ],
       [ 79.00000001,   9.00063528,  82.00000003,  16.00005237,
         56.        ,  32.00000009]])

NMF components

● For documents:

● NMF components represent topics

● NMF features combine topics into documents

● For images, NMF components are parts of images

In [20]:
plt.imshow(plt.imread('digit.jpg'))
Out[20]:
<matplotlib.image.AxesImage at 0x14406e80>

Grayscale images

● "Grayscale" image = no colors, only shades of gray

● Measure pixel brightness

● Represent with value between 0 and 1 (0 is black)

● Convert to 2D array

In [25]:
test = np.array([[ 0. ,1., 0.5], [ 1., 0., 1. ]])
test
Out[25]:
array([[ 0. ,  1. ,  0.5],
       [ 1. ,  0. ,  1. ]])
In [31]:
plt.figure(figsize=(3,2))
plt.imshow(test, cmap='gray', interpolation='nearest')
Out[31]:
<matplotlib.image.AxesImage at 0xd910198>

Grayscale image example

● An 8x8 grayscale image of the moon, written as an array

In [36]:
test2 = np.array([[ 0., 0., 0. ,0., 0. ,0. ,0., 0. ]
,[ 0. ,0. ,0. ,0.7, 0.8 ,0., 0., 0. ]
,[ 0. ,0. ,0.8 ,0.8, 0.9 ,1. ,0. ,0. ]
,[ 0. ,0.7, 0.9, 0.9, 1. ,1., 1., 0. ]
,[ 0. ,0.8, 0.9, 1. ,1. ,1. ,1. ,0. ]
,[ 0. ,0. ,0.9 ,1. ,1. ,1., 0., 0. ]
,[ 0. ,0. ,0. ,0.9, 1. ,0. ,0., 0. ]
,[ 0. ,0. ,0. ,0, 0., 0., 0., 0. ]])
test2
Out[36]:
array([[ 0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ],
       [ 0. ,  0. ,  0. ,  0.7,  0.8,  0. ,  0. ,  0. ],
       [ 0. ,  0. ,  0.8,  0.8,  0.9,  1. ,  0. ,  0. ],
       [ 0. ,  0.7,  0.9,  0.9,  1. ,  1. ,  1. ,  0. ],
       [ 0. ,  0.8,  0.9,  1. ,  1. ,  1. ,  1. ,  0. ],
       [ 0. ,  0. ,  0.9,  1. ,  1. ,  1. ,  0. ,  0. ],
       [ 0. ,  0. ,  0. ,  0.9,  1. ,  0. ,  0. ,  0. ],
       [ 0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ]])
In [37]:
plt.figure(figsize=(3,2))
plt.imshow(test2, cmap='gray', interpolation='nearest')
Out[37]:
<matplotlib.image.AxesImage at 0xa21a588>

Grayscale images as flat arrays

● Enumerate the entries

● Row-by-row

● From le" to right

In [38]:
test
Out[38]:
array([[ 0. ,  1. ,  0.5],
       [ 1. ,  0. ,  1. ]])
In [39]:
test.flatten()
Out[39]:
array([ 0. ,  1. ,  0.5,  1. ,  0. ,  1. ])

Encoding a collection of images

● Collection of images of the same size

● Encode as 2D array

● Each row corresponds to an image

● Each column corresponds to a pixel

● ... can apply NMF!

In [50]:
new = np.vstack([test.flatten(), test.flatten()+1, test.flatten()-1])
new
Out[50]:
array([[ 0. ,  1. ,  0.5,  1. ,  0. ,  1. ],
       [ 1. ,  2. ,  1.5,  2. ,  1. ,  2. ],
       [-1. ,  0. , -0.5,  0. , -1. ,  0. ]])

NMF learns topics of documents

  • when NMF is applied to documents, the components correspond to topics of documents, and the NMF features reconstruct the documents from the topics.

build a NMF on tweets

In [53]:
df = pd.read_csv('tweets.csv')
print df.shape
text = df['text']
print text.shape
(615, 33)
(615L,)
In [57]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()

matrix = tfidf.fit_transform(text)
matrix
Out[57]:
<615x1286 sparse matrix of type '<type 'numpy.float64'>'
	with 10947 stored elements in Compressed Sparse Row format>

NMF

  • assume 6 topics
In [58]:
from sklearn.decomposition import NMF
nmf = NMF(6)
nmf
Out[58]:
NMF(alpha=0.0, beta=1, eta=0.1, init=None, l1_ratio=0.0, max_iter=200,
  n_components=6, nls_max_iter=2000, random_state=None, shuffle=False,
  solver='cd', sparseness=None, tol=0.0001, verbose=0)
In [60]:
nmf_data = nmf.fit_transform(matrix)
print nmf_data.shape
print nmf.components_.shape
(615L, 6L)
(6L, 1286L)

nmf.components_

  • stores information of each topic (6 topic)
    • the weights of words in each topic
In [63]:
# Create a DataFrame: components_df
components_df = pd.DataFrame(nmf.components_, columns=['words_'+str(i) for i in range(1,nmf.components_.shape[-1]+1)])

# Print the shape of the DataFrame
print(components_df.shape)
(6, 1286)
In [66]:
components_df.head(3)
Out[66]:
words_1 words_2 words_3 words_4 words_5 words_6 words_7 words_8 words_9 words_10 ... words_1277 words_1278 words_1279 words_1280 words_1281 words_1282 words_1283 words_1284 words_1285 words_1286
0 0.000000 0.0 0.000053 0.000000 0.000000 0.0 0.0 0.0 0.000094 0.000000 ... 0.000000 0.000136 0.000644 0.000000 0.0 0.000000 0.0 0.0 0.000000 0.000000
1 0.000000 0.0 0.000271 0.000034 0.000103 0.0 0.0 0.0 0.000000 0.000105 ... 0.000000 0.001443 0.000851 0.000078 0.0 0.000020 0.0 0.0 0.000000 0.000000
2 0.000443 0.0 0.003419 0.000000 0.000641 0.0 0.0 0.0 0.002472 0.000000 ... 0.005707 0.009409 0.002365 0.000000 0.0 0.004375 0.0 0.0 0.000137 0.000326

3 rows × 1286 columns

In [73]:
# Select row 3: component
component = components_df.iloc[3]
# Print result of nlargest
print(component.nlargest(25))
words_154     0.753039
words_1275    0.499014
words_213     0.484263
words_523     0.484263
words_614     0.484263
words_721     0.484263
words_1047    0.484263
words_1089    0.484263
words_1136    0.484263
words_1115    0.482416
words_1005    0.481395
words_706     0.480707
words_1038    0.480517
words_146     0.480323
words_372     0.440320
words_913     0.166741
words_433     0.037140
words_1088    0.026446
words_478     0.025819
words_792     0.024422
words_85      0.019610
words_214     0.019610
words_264     0.019610
words_1011    0.019610
words_1163    0.019610
Name: 3, dtype: float64

Explore the image parts dataset, each is 2*3

In [181]:
image = np.array([0,1,0,1,0,1]).reshape(2,-1)
In [182]:
plt.imshow(image, cmap='gray', interpolation='nearest')
Out[182]:
<matplotlib.image.AxesImage at 0x18557f98>

simulate 100 samples of constructing image of digits 2*3

In [183]:
sample = np.random.binomial(1,.5,6*100)
sample = sample.reshape(100,-1)
sample.shape
Out[183]:
(100L, 6L)

write a function to print an image

In [186]:
def show(array):
    plt.imshow(array.reshape(2,-1), cmap='gray', interpolation='nearest')
In [187]:
show(sample[3,:])

NMF learns the parts of images

  • there are 6 features on each image
In [190]:
# Import NMF
from sklearn.decomposition import NMF

# Create an NMF model: model
model = NMF(n_components=6)

features = model.fit_transform(sample)
model.components_.shape
Out[190]:
(6L, 6L)
In [193]:
# Call show_as_image on each component
plt.figure(figsize=(12,10))
c=0
n = len(model.components_)
for component in model.components_:
    c+=1
    plt.subplot(n, 1, c)
    show(component)

PCA doesn't learn parts

  • Unlike NMF, PCA doesn't learn the parts of things. Its components do not correspond to topics (in the case of documents) or to parts of images, when trained on images. Verify this for yourself by inspecting the components of a PCA model fit to the dataset

  • colors a pixel red if the value is negative.

In [195]:
# Import PCA
from sklearn.decomposition import PCA

# Create a PCA instance: model
model = PCA(n_components=6)

# Apply fit_transform to samples: features
features = model.fit_transform(sample)

# Call show_as_image on each component
plt.figure(figsize=(12,10))
c=0
n = len(model.components_)
for component in model.components_:
    c+=1
    plt.subplot(n, 1, c)
    show(component)
    

Building recommender systems using NMF

Finding similar articles

● Engineer at a large online newspaper

● Task: recommend articles similar to article being read by customer

● Similar articles should have similar topics

Strategy

● Apply NMF to the word-frequency array

● NMF feature values describe the topics

● ... so similar documents have similar NMF feature values

● Compare NMF feature values?

Apply NMF to the word-frequency array

● articles is a word frequency array

In [244]:
df = pd.read_csv('tweets.csv')
df = df['text']
df.shape

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()

sparse = tfidf.fit_transform(df)
print sparse.shape
sparse

from sklearn.decomposition import NMF
nmf = NMF(6)
nmf

nmf_data = nmf.fit_transform(matrix)
print nmf_data.shape
print nmf.components_.shape
(615, 1286)
(615L, 6L)
(6L, 1286L)

as you can see above, there are 615 posts, 1286 words, and 6 categories

Versions of articles

● Different versions of the same document have same topic proportions

● ... exact feature values may be different!

● E.g. because one version uses many meaningless words

● But all versions lie on the same line through the origin

In [245]:
plt.figure(figsize=(8,6))
plt.imshow(plt.imread('topic.jpg'))
plt.axis('off')
Out[245]:
(-0.5, 857.5, 294.5, -0.5)

Cosine similarity

● Uses the angle between the lines

● Higher values means more similar

● Maximum value is 1, when angle is 0°

In [246]:
plt.figure(figsize=(4,4))
plt.imshow(plt.imread('cosine.jpg'))
plt.axis('off')
Out[246]:
(-0.5, 343.5, 210.5, -0.5)

Calculating the cosine similarities

In [247]:
from sklearn.preprocessing import normalize

norm_features = normalize(nmf_data)
In [248]:
print norm_features.shape
print nmf_data.shape
norm_features
(615L, 6L)
(615L, 6L)
Out[248]:
array([[ 0.,  0.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  1.,  0.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  0.,  0.],
       ..., 
       [ 1.,  0.,  0.,  0.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  0.,  0.]])
In [249]:
df[0]
Out[249]:
'RT @MeckeringBoy: Does Pauline secretly wear her\r\n\r\nGrab my pussy\r\n\r\nTee shirt? https://t.co/DquB05XVMk'
In [250]:
current_post = norm_features[0]

similarities = norm_features.dot(current_post)


print(similarities[:10])
print similarities.shape
[ 1.          0.          0.          0.99005859  0.66814589  0.95633953
  0.          0.31329499  0.          0.9309553 ]
(615L,)
In [256]:
norm_features_df = pd.DataFrame(norm_features, columns=['cat_'+str(i) for i in range(1,7)])
norm_features_df.head()
Out[256]:
cat_1 cat_2 cat_3 cat_4 cat_5 cat_6
0 0.00000 0.0 0.000000 0.000000 1.000000 0.000000
1 0.00000 0.0 1.000000 0.000000 0.000000 0.000000
2 1.00000 0.0 0.000000 0.000000 0.000000 0.000000
3 0.00000 0.0 0.000000 0.140656 0.990059 0.000000
4 0.03113 0.0 0.249109 0.000000 0.668146 0.700397

find posts resemble the most of 1st post using cosine similarity

In [269]:
current_article = norm_features_df.iloc[250]
current_article
Out[269]:
cat_1    0.023034
cat_2    0.000000
cat_3    0.259325
cat_4    0.000000
cat_5    0.965515
cat_6    0.000000
Name: 250, dtype: float64
In [270]:
similar = norm_features_df.dot(current_article)
type(similar)
Out[270]:
pandas.core.series.Series
In [274]:
similar.nlargest(10).index
Out[274]:
Int64Index([250, 502, 557, 524, 499, 18, 293, 108, 383, 60], dtype='int64')
In [277]:
for i in df.iloc[similar.nlargest(10).index]: print i
@SherryForChange @FoxNews @PaulBabeuAZ @POTUS I give more than I take, I care for my family , I like my Trump pussy hat.
They are a bunch of Fools just used to cause division! This is also the Tactics of George Soros!  President Trump n… https://t.co/ba0NfSp0x4
Trump is backing off on China.  Next back off on ObamaCare repeal, building a wall, and grabbing pussy. https://t.co/mxTmZMD924
@CTVNews THIS IS WHY . TRUMP IS A REAL MAN WHO TAKES CARE OF HIS OWN FIRST. UNLIKE OUR POLITICALLY CORRECT PUSSY. https://t.co/OCsAYwwctC
Of course, man who turns blind eye from #rapeculture could get job with pussy grabbing president. https://t.co/bLKDK2YC3m
@lordaedonis slaying https://t.co/CBYmVbRXzO
@lordaedonis slaying https://t.co/CBYmVbRXzO
BLACK PEOPLE CANT LIVE HERE! FYP CALLING ME DUMB! https://t.co/4jom9xATcj
BLACK PEOPLE CANT LIVE HERE! FYP CALLING ME DUMB! https://t.co/4jom9xATcj
The rock is the biggest pussy of them all https://t.co/zfy2FoA22X
In [ ]:
 
In [ ]: