unsupervized learning -4¶

discovering interpretable features¶

Non-negative matrix factorization (NMF)¶

● NMF = "non-negative matrix factorization"

● Dimension reduction technique

● NMF models are interpretable (unlike PCA)

● Easy to interpret means easy to explain!

● However, all sample features must be non-negative (>= 0)

compare with LDA, SVD¶

Interpretable parts¶

● NMF expresses documents as combinations of topics (or "themes")

● NMF expresses images as combinations of patterns

import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline
plt.imshow(plt.imread('img.jpg'))
plt.axis('off')

(-0.5, 740.5, 246.5, -0.5)

Example word-frequency array¶

● Word frequency array, 4 words, many documents

● Measure presence of words in each document using "tf-idf"

● "tf" = frequency of word in document

● "idf" reduces influence of frequent words

http://www.tfidf.com/¶

Term Frequency¶

TF: Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization:

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).

Inverse Document Frequency¶

IDF: Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following:

IDF(t) = log_e(Total number of documents / Number of documents with term t in it).

Example:¶

Consider a document containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.

from sklearn.decomposition import PCA
print str(PCA).split('.')[-1].replace("'>",'')

PCA

Curious¶

apply dimensional reduction techniques on image¶

PCA
NMF
SVD

cat = plt.imread('cat.jpg')
print cat.shape
plt.imshow(cat)

(194L, 259L, 3L)

<matplotlib.image.AxesImage at 0xc951e80>

write functions for comparison¶

read image matrix

def curiosity(image, n):
    from sklearn.decomposition import PCA, NMF, TruncatedSVD
    import numpy as np
    
    ## turn color image to black and white
    try: 
        image = image.mean(axis=2)
    except:
        pass
    
    models = [PCA, NMF, TruncatedSVD]
    results = []
    
    c = 0
    
    
    plt.figure(figsize= (12,12))

    for i in models:
        c+=1
        
        m = i(n)
        d = m.fit_transform(image)
        a = m.inverse_transform(d)
        results.append(a)
        
        plt.subplot(1,3,c)
        plt.imshow(a, cmap='gray')
        plt.axis( 'off')
        
        title = str(i).split('.')[-1].replace("'>",'')
        plt.title(title, size=15)
        
    print '3 dementional reduction algorithms with compressing of '
    print 'Number of components: '+ str(n)
        
    return results

test10 = curiosity(cat,10)

3 dementional reduction algorithms with compressing of 
Number of components: 10

test20 = curiosity(cat,30)

3 dementional reduction algorithms with compressing of 
Number of components: 30

test20 = curiosity(cat,50)

3 dementional reduction algorithms with compressing of 
Number of components: 50

test20 = curiosity(cat,5)

3 dementional reduction algorithms with compressing of 
Number of components: 5

Using scikit-learn NMF¶

● Follows fit() / transform() pattern

● Must specify number of components e.g. NMF(n_components=2)

● Works with NumPy arrays and with csr_matrix

NMF components¶

● NMF has components

● ... just like PCA has principal components

● Dimension of components = dimension of samples

● Entries are non-negative

df = pd.read_csv('tweets.csv')
df = df['text']
df.shape

(615L,)

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()

sparse = tfidf.fit_transform(df)
print sparse.shape
sparse

(615, 1286)

<615x1286 sparse matrix of type '<type 'numpy.float64'>'
	with 10947 stored elements in Compressed Sparse Row format>

apply NMF¶

assume there are 4 categories on the documents

# Import NMF
from sklearn.decomposition import NMF
# Import pandas
import pandas as pd

# Create an NMF instance: model
model = NMF(n_components=4)

# Fit the model to articles
model.fit(sparse)

# Transform the articles: nmf_features
nmf_features = model.transform(sparse)

# Print the NMF features
print(nmf_features.shape)

# Create a pandas DataFrame: nmf
nmf = pd.DataFrame(nmf_features)

(615L, 4L)

each row(article) belongs to which column(group)¶

nmf.tail()

NMF¶

Reconstruction of a sample¶

print nmf_features.shape
print model.components_.shape
print
print sparse.shape

(615L, 4L)
(4L, 1286L)

(615, 1286)

NMF fits to non-negative data, only¶

● Word frequencies in each document

● Images encoded as arrays

● Audio spectrograms

● Purchase histories on e-commerce sites

● … and many more!

Sample reconstruction¶

● Multiply components by feature values, and add up

● Can also be expressed as a product of matrices

● This is the "Matrix Factorization" in "NMF"

https://en.wikipedia.org/wiki/Non-negative_matrix_factorization ¶

simulate¶

the Multiple of two matrix

import numpy as np
ori = np.random.randint(1,100,30).reshape(5,6)
ori.shape

(5L, 6L)

from sklearn.decomposition import NMF

nmf = NMF(77)

after = nmf.fit_transform(ori)
after.shape

(5L, 77L)

nmf.components_.shape

(77L, 6L)

ori

array([[14, 64, 97, 64, 92, 95],
       [42, 23, 94, 10, 26, 89],
       [35, 98, 41, 13, 18, 82],
       [42, 53, 38, 73, 10, 55],
       [79,  9, 82, 16, 56, 32]])

np.dot(after, nmf.components_)

array([[ 14.00000006,  63.99994588,  97.00000111,  63.99999669,
         92.        ,  94.99999999],
       [ 41.99999999,  23.03224165,  93.99999996,   9.99996018,
         26.        ,  88.99999991],
       [ 35.        ,  98.        ,  41.00001475,  13.        ,
         18.        ,  82.        ],
       [ 42.        ,  53.00004969,  37.99999898,  73.00000304,
         10.        ,  55.        ],
       [ 79.00000001,   9.00063528,  82.00000003,  16.00005237,
         56.        ,  32.00000009]])

NMF components¶

● For documents:

● NMF components represent topics

● NMF features combine topics into documents

● For images, NMF components are parts of images

plt.imshow(plt.imread('digit.jpg'))

<matplotlib.image.AxesImage at 0x14406e80>

Grayscale images¶

● "Grayscale" image = no colors, only shades of gray

● Measure pixel brightness

● Represent with value between 0 and 1 (0 is black)

● Convert to 2D array

test = np.array([[ 0. ,1., 0.5], [ 1., 0., 1. ]])
test

array([[ 0. ,  1. ,  0.5],
       [ 1. ,  0. ,  1. ]])

plt.figure(figsize=(3,2))
plt.imshow(test, cmap='gray', interpolation='nearest')

<matplotlib.image.AxesImage at 0xd910198>

Grayscale image example¶

● An 8x8 grayscale image of the moon, written as an array

test2 = np.array([[ 0., 0., 0. ,0., 0. ,0. ,0., 0. ]
,[ 0. ,0. ,0. ,0.7, 0.8 ,0., 0., 0. ]
,[ 0. ,0. ,0.8 ,0.8, 0.9 ,1. ,0. ,0. ]
,[ 0. ,0.7, 0.9, 0.9, 1. ,1., 1., 0. ]
,[ 0. ,0.8, 0.9, 1. ,1. ,1. ,1. ,0. ]
,[ 0. ,0. ,0.9 ,1. ,1. ,1., 0., 0. ]
,[ 0. ,0. ,0. ,0.9, 1. ,0. ,0., 0. ]
,[ 0. ,0. ,0. ,0, 0., 0., 0., 0. ]])
test2

array([[ 0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ],
       [ 0. ,  0. ,  0. ,  0.7,  0.8,  0. ,  0. ,  0. ],
       [ 0. ,  0. ,  0.8,  0.8,  0.9,  1. ,  0. ,  0. ],
       [ 0. ,  0.7,  0.9,  0.9,  1. ,  1. ,  1. ,  0. ],
       [ 0. ,  0.8,  0.9,  1. ,  1. ,  1. ,  1. ,  0. ],
       [ 0. ,  0. ,  0.9,  1. ,  1. ,  1. ,  0. ,  0. ],
       [ 0. ,  0. ,  0. ,  0.9,  1. ,  0. ,  0. ,  0. ],
       [ 0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ]])

plt.figure(figsize=(3,2))
plt.imshow(test2, cmap='gray', interpolation='nearest')

<matplotlib.image.AxesImage at 0xa21a588>

Grayscale images as flat arrays¶

● Enumerate the entries

● Row-by-row

● From le" to right

test

array([[ 0. ,  1. ,  0.5],
       [ 1. ,  0. ,  1. ]])

test.flatten()

array([ 0. ,  1. ,  0.5,  1. ,  0. ,  1. ])

Encoding a collection of images¶

● Collection of images of the same size

● Encode as 2D array

● Each row corresponds to an image

● Each column corresponds to a pixel

● ... can apply NMF!

new = np.vstack([test.flatten(), test.flatten()+1, test.flatten()-1])
new

array([[ 0. ,  1. ,  0.5,  1. ,  0. ,  1. ],
       [ 1. ,  2. ,  1.5,  2. ,  1. ,  2. ],
       [-1. ,  0. , -0.5,  0. , -1. ,  0. ]])

NMF learns topics of documents¶

when NMF is applied to documents, the components correspond to topics of documents, and the NMF features reconstruct the documents from the topics.

build a NMF on tweets¶

df = pd.read_csv('tweets.csv')
print df.shape
text = df['text']
print text.shape

(615, 33)
(615L,)

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()

matrix = tfidf.fit_transform(text)
matrix

<615x1286 sparse matrix of type '<type 'numpy.float64'>'
	with 10947 stored elements in Compressed Sparse Row format>

NMF¶

assume 6 topics

from sklearn.decomposition import NMF
nmf = NMF(6)
nmf

NMF(alpha=0.0, beta=1, eta=0.1, init=None, l1_ratio=0.0, max_iter=200,
  n_components=6, nls_max_iter=2000, random_state=None, shuffle=False,
  solver='cd', sparseness=None, tol=0.0001, verbose=0)

nmf_data = nmf.fit_transform(matrix)
print nmf_data.shape
print nmf.components_.shape

(615L, 6L)
(6L, 1286L)

nmf.components_¶

stores information of each topic (6 topic)
- the weights of words in each topic

# Create a DataFrame: components_df
components_df = pd.DataFrame(nmf.components_, columns=['words_'+str(i) for i in range(1,nmf.components_.shape[-1]+1)])

# Print the shape of the DataFrame
print(components_df.shape)

(6, 1286)

components_df.head(3)

# Select row 3: component
component = components_df.iloc[3]
# Print result of nlargest
print(component.nlargest(25))

words_154     0.753039
words_1275    0.499014
words_213     0.484263
words_523     0.484263
words_614     0.484263
words_721     0.484263
words_1047    0.484263
words_1089    0.484263
words_1136    0.484263
words_1115    0.482416
words_1005    0.481395
words_706     0.480707
words_1038    0.480517
words_146     0.480323
words_372     0.440320
words_913     0.166741
words_433     0.037140
words_1088    0.026446
words_478     0.025819
words_792     0.024422
words_85      0.019610
words_214     0.019610
words_264     0.019610
words_1011    0.019610
words_1163    0.019610
Name: 3, dtype: float64

Explore the image parts dataset, each is 2*3¶

image = np.array([0,1,0,1,0,1]).reshape(2,-1)

plt.imshow(image, cmap='gray', interpolation='nearest')

<matplotlib.image.AxesImage at 0x18557f98>

simulate 100 samples of constructing image of digits 2*3¶

sample = np.random.binomial(1,.5,6*100)
sample = sample.reshape(100,-1)
sample.shape

(100L, 6L)

write a function to print an image¶

def show(array):
    plt.imshow(array.reshape(2,-1), cmap='gray', interpolation='nearest')

show(sample[3,:])

NMF learns the parts of images¶

there are 6 features on each image

# Import NMF
from sklearn.decomposition import NMF

# Create an NMF model: model
model = NMF(n_components=6)

features = model.fit_transform(sample)
model.components_.shape

(6L, 6L)

# Call show_as_image on each component
plt.figure(figsize=(12,10))
c=0
n = len(model.components_)
for component in model.components_:
    c+=1
    plt.subplot(n, 1, c)
    show(component)

PCA doesn't learn parts¶

Unlike NMF, PCA doesn't learn the parts of things. Its components do not correspond to topics (in the case of documents) or to parts of images, when trained on images. Verify this for yourself by inspecting the components of a PCA model fit to the dataset
colors a pixel red if the value is negative.

# Import PCA
from sklearn.decomposition import PCA

# Create a PCA instance: model
model = PCA(n_components=6)

# Apply fit_transform to samples: features
features = model.fit_transform(sample)

# Call show_as_image on each component
plt.figure(figsize=(12,10))
c=0
n = len(model.components_)
for component in model.components_:
    c+=1
    plt.subplot(n, 1, c)
    show(component)

Building recommender systems using NMF¶

Finding similar articles¶

● Engineer at a large online newspaper

● Task: recommend articles similar to article being read by customer

● Similar articles should have similar topics

Strategy¶

● Apply NMF to the word-frequency array

● NMF feature values describe the topics

● ... so similar documents have similar NMF feature values

● Compare NMF feature values?

Apply NMF to the word-frequency array¶

● articles is a word frequency array

df = pd.read_csv('tweets.csv')
df = df['text']
df.shape

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()

sparse = tfidf.fit_transform(df)
print sparse.shape
sparse

from sklearn.decomposition import NMF
nmf = NMF(6)
nmf

nmf_data = nmf.fit_transform(matrix)
print nmf_data.shape
print nmf.components_.shape

(615, 1286)
(615L, 6L)
(6L, 1286L)

as you can see above, there are 615 posts, 1286 words, and 6 categories¶

Versions of articles¶

● Different versions of the same document have same topic proportions

● ... exact feature values may be different!

● E.g. because one version uses many meaningless words

● But all versions lie on the same line through the origin

plt.figure(figsize=(8,6))
plt.imshow(plt.imread('topic.jpg'))
plt.axis('off')

(-0.5, 857.5, 294.5, -0.5)

Cosine similarity¶

● Uses the angle between the lines

● Higher values means more similar

● Maximum value is 1, when angle is 0°

plt.figure(figsize=(4,4))
plt.imshow(plt.imread('cosine.jpg'))
plt.axis('off')

(-0.5, 343.5, 210.5, -0.5)

Calculating the cosine similarities¶

from sklearn.preprocessing import normalize

norm_features = normalize(nmf_data)

print norm_features.shape
print nmf_data.shape
norm_features

(615L, 6L)
(615L, 6L)

array([[ 0.,  0.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  1.,  0.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  0.,  0.],
       ..., 
       [ 1.,  0.,  0.,  0.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  0.,  0.]])

df[0]

'RT @MeckeringBoy: Does Pauline secretly wear her\r\n\r\nGrab my pussy\r\n\r\nTee shirt? https://t.co/DquB05XVMk'

current_post = norm_features[0]

similarities = norm_features.dot(current_post)


print(similarities[:10])
print similarities.shape

[ 1.          0.          0.          0.99005859  0.66814589  0.95633953
  0.          0.31329499  0.          0.9309553 ]
(615L,)

norm_features_df = pd.DataFrame(norm_features, columns=['cat_'+str(i) for i in range(1,7)])
norm_features_df.head()

find posts resemble the most of 1st post using cosine similarity¶

current_article = norm_features_df.iloc[250]
current_article

cat_1    0.023034
cat_2    0.000000
cat_3    0.259325
cat_4    0.000000
cat_5    0.965515
cat_6    0.000000
Name: 250, dtype: float64

similar = norm_features_df.dot(current_article)
type(similar)

pandas.core.series.Series

similar.nlargest(10).index

Int64Index([250, 502, 557, 524, 499, 18, 293, 108, 383, 60], dtype='int64')

for i in df.iloc[similar.nlargest(10).index]: print i

@SherryForChange @FoxNews @PaulBabeuAZ @POTUS I give more than I take, I care for my family , I like my Trump pussy hat.
They are a bunch of Fools just used to cause division! This is also the Tactics of George Soros!  President Trump n… https://t.co/ba0NfSp0x4
Trump is backing off on China.  Next back off on ObamaCare repeal, building a wall, and grabbing pussy. https://t.co/mxTmZMD924
@CTVNews THIS IS WHY . TRUMP IS A REAL MAN WHO TAKES CARE OF HIS OWN FIRST. UNLIKE OUR POLITICALLY CORRECT PUSSY. https://t.co/OCsAYwwctC
Of course, man who turns blind eye from #rapeculture could get job with pussy grabbing president. https://t.co/bLKDK2YC3m
@lordaedonis slaying https://t.co/CBYmVbRXzO
@lordaedonis slaying https://t.co/CBYmVbRXzO
BLACK PEOPLE CANT LIVE HERE! FYP CALLING ME DUMB! https://t.co/4jom9xATcj
BLACK PEOPLE CANT LIVE HERE! FYP CALLING ME DUMB! https://t.co/4jom9xATcj
The rock is the biggest pussy of them all https://t.co/zfy2FoA22X

	0	1	2	3
610	0.009678	0.006492	0.021078	0.007579
611	0.011301	0.004056	0.019930	0.004390
612	0.280871	0.000000	0.000000	0.000000
613	0.280871	0.000000	0.000000	0.000000
614	0.280871	0.000000	0.000000	0.000000

	words_1	words_3	words_4	words_5	words_9	words_10	...	words_1277	words_1278	words_1279	words_1280	words_1282	words_1285	words_1286
0	0.000000	0.000053	0.000000	0.000000	0.000094	0.000000	...	0.000000	0.000136	0.000644	0.000000	0.000000	0.000000	0.000000
1	0.000000	0.000271	0.000034	0.000103	0.000000	0.000105	...	0.000000	0.001443	0.000851	0.000078	0.000020	0.000000	0.000000
2	0.000443	0.003419	0.000000	0.000641	0.002472	0.000000	...	0.005707	0.009409	0.002365	0.000000	0.004375	0.000137	0.000326

	cat_1	cat_3	cat_4	cat_5	cat_6
0	0.00000	0.000000	0.000000	1.000000	0.000000
1	0.00000	1.000000	0.000000	0.000000	0.000000
2	1.00000	0.000000	0.000000	0.000000	0.000000
3	0.00000	0.000000	0.140656	0.990059	0.000000
4	0.03113	0.249109	0.000000	0.668146	0.700397

unsupervized learning -4¶

discovering interpretable features¶

Non-negative matrix factorization (NMF)¶

compare with LDA, SVD¶

Interpretable parts¶

Example word-frequency array¶

http://www.tfidf.com/¶

Term Frequency¶

Inverse Document Frequency¶

Example:¶

Curious¶

apply dimensional reduction techniques on image¶

write functions for comparison¶

Using scikit-learn NMF¶

NMF components¶

apply NMF¶

each row(article) belongs to which column(group)¶

NMF¶

Reconstruction of a sample¶

NMF fits to non-negative data, only¶

Sample reconstruction¶

https://en.wikipedia.org/wiki/Non-negative_matrix_factorization¶

simulate¶

NMF components¶

Grayscale images¶

Grayscale image example¶

Grayscale images as flat arrays¶

Encoding a collection of images¶

NMF learns topics of documents¶

build a NMF on tweets¶

NMF¶

nmf.components_¶

Explore the image parts dataset, each is 2*3¶

simulate 100 samples of constructing image of digits 2*3¶

write a function to print an image¶

NMF learns the parts of images¶

PCA doesn't learn parts¶

Building recommender systems using NMF¶

Finding similar articles¶

Strategy¶

Apply NMF to the word-frequency array¶

as you can see above, there are 615 posts, 1286 words, and 6 categories¶

Versions of articles¶

Cosine similarity¶

Calculating the cosine similarities¶

find posts resemble the most of 1st post using cosine similarity¶

https://en.wikipedia.org/wiki/Non-negative_matrix_factorization ¶