● NMF = "non-negative matrix factorization"
● Dimension reduction technique
● NMF models are interpretable (unlike PCA)
● Easy to interpret means easy to explain!
● However, all sample features must be non-negative (>= 0)
● NMF expresses documents as combinations of topics (or "themes")
● NMF expresses images as combinations of patterns
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
plt.imshow(plt.imread('img.jpg'))
plt.axis('off')
● Word frequency array, 4 words, many documents
● Measure presence of words in each document using "tf-idf"
● "tf" = frequency of word in document
● "idf" reduces influence of frequent words
TF: Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization:
TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).
IDF: Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following:
IDF(t) = log_e(Total number of documents / Number of documents with term t in it).
Consider a document containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.
from sklearn.decomposition import PCA
print str(PCA).split('.')[-1].replace("'>",'')
cat = plt.imread('cat.jpg')
print cat.shape
plt.imshow(cat)
def curiosity(image, n):
from sklearn.decomposition import PCA, NMF, TruncatedSVD
import numpy as np
## turn color image to black and white
try:
image = image.mean(axis=2)
except:
pass
models = [PCA, NMF, TruncatedSVD]
results = []
c = 0
plt.figure(figsize= (12,12))
for i in models:
c+=1
m = i(n)
d = m.fit_transform(image)
a = m.inverse_transform(d)
results.append(a)
plt.subplot(1,3,c)
plt.imshow(a, cmap='gray')
plt.axis( 'off')
title = str(i).split('.')[-1].replace("'>",'')
plt.title(title, size=15)
print '3 dementional reduction algorithms with compressing of '
print 'Number of components: '+ str(n)
return results
test10 = curiosity(cat,10)
test20 = curiosity(cat,30)
test20 = curiosity(cat,50)
test20 = curiosity(cat,5)
● Follows fit() / transform() pattern
● Must specify number of components e.g. NMF(n_components=2)
● Works with NumPy arrays and with csr_matrix
● NMF has components
● ... just like PCA has principal components
● Dimension of components = dimension of samples
● Entries are non-negative
df = pd.read_csv('tweets.csv')
df = df['text']
df.shape
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
sparse = tfidf.fit_transform(df)
print sparse.shape
sparse
# Import NMF
from sklearn.decomposition import NMF
# Import pandas
import pandas as pd
# Create an NMF instance: model
model = NMF(n_components=4)
# Fit the model to articles
model.fit(sparse)
# Transform the articles: nmf_features
nmf_features = model.transform(sparse)
# Print the NMF features
print(nmf_features.shape)
# Create a pandas DataFrame: nmf
nmf = pd.DataFrame(nmf_features)
nmf.tail()
print nmf_features.shape
print model.components_.shape
print
print sparse.shape
● Word frequencies in each document
● Images encoded as arrays
● Audio spectrograms
● Purchase histories on e-commerce sites
● … and many more!
● Multiply components by feature values, and add up
● Can also be expressed as a product of matrices
● This is the "Matrix Factorization" in "NMF"
import numpy as np
ori = np.random.randint(1,100,30).reshape(5,6)
ori.shape
from sklearn.decomposition import NMF
nmf = NMF(77)
after = nmf.fit_transform(ori)
after.shape
nmf.components_.shape
ori
np.dot(after, nmf.components_)
● For documents:
● NMF components represent topics
● NMF features combine topics into documents
● For images, NMF components are parts of images
plt.imshow(plt.imread('digit.jpg'))
● "Grayscale" image = no colors, only shades of gray
● Measure pixel brightness
● Represent with value between 0 and 1 (0 is black)
● Convert to 2D array
test = np.array([[ 0. ,1., 0.5], [ 1., 0., 1. ]])
test
plt.figure(figsize=(3,2))
plt.imshow(test, cmap='gray', interpolation='nearest')
● An 8x8 grayscale image of the moon, written as an array
test2 = np.array([[ 0., 0., 0. ,0., 0. ,0. ,0., 0. ]
,[ 0. ,0. ,0. ,0.7, 0.8 ,0., 0., 0. ]
,[ 0. ,0. ,0.8 ,0.8, 0.9 ,1. ,0. ,0. ]
,[ 0. ,0.7, 0.9, 0.9, 1. ,1., 1., 0. ]
,[ 0. ,0.8, 0.9, 1. ,1. ,1. ,1. ,0. ]
,[ 0. ,0. ,0.9 ,1. ,1. ,1., 0., 0. ]
,[ 0. ,0. ,0. ,0.9, 1. ,0. ,0., 0. ]
,[ 0. ,0. ,0. ,0, 0., 0., 0., 0. ]])
test2
plt.figure(figsize=(3,2))
plt.imshow(test2, cmap='gray', interpolation='nearest')
test
test.flatten()
● Collection of images of the same size
● Encode as 2D array
● Each row corresponds to an image
● Each column corresponds to a pixel
● ... can apply NMF!
new = np.vstack([test.flatten(), test.flatten()+1, test.flatten()-1])
new
df = pd.read_csv('tweets.csv')
print df.shape
text = df['text']
print text.shape
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
matrix = tfidf.fit_transform(text)
matrix
from sklearn.decomposition import NMF
nmf = NMF(6)
nmf
nmf_data = nmf.fit_transform(matrix)
print nmf_data.shape
print nmf.components_.shape
# Create a DataFrame: components_df
components_df = pd.DataFrame(nmf.components_, columns=['words_'+str(i) for i in range(1,nmf.components_.shape[-1]+1)])
# Print the shape of the DataFrame
print(components_df.shape)
components_df.head(3)
# Select row 3: component
component = components_df.iloc[3]
# Print result of nlargest
print(component.nlargest(25))
image = np.array([0,1,0,1,0,1]).reshape(2,-1)
plt.imshow(image, cmap='gray', interpolation='nearest')
sample = np.random.binomial(1,.5,6*100)
sample = sample.reshape(100,-1)
sample.shape
def show(array):
plt.imshow(array.reshape(2,-1), cmap='gray', interpolation='nearest')
show(sample[3,:])
# Import NMF
from sklearn.decomposition import NMF
# Create an NMF model: model
model = NMF(n_components=6)
features = model.fit_transform(sample)
model.components_.shape
# Call show_as_image on each component
plt.figure(figsize=(12,10))
c=0
n = len(model.components_)
for component in model.components_:
c+=1
plt.subplot(n, 1, c)
show(component)
Unlike NMF, PCA doesn't learn the parts of things. Its components do not correspond to topics (in the case of documents) or to parts of images, when trained on images. Verify this for yourself by inspecting the components of a PCA model fit to the dataset
colors a pixel red if the value is negative.
# Import PCA
from sklearn.decomposition import PCA
# Create a PCA instance: model
model = PCA(n_components=6)
# Apply fit_transform to samples: features
features = model.fit_transform(sample)
# Call show_as_image on each component
plt.figure(figsize=(12,10))
c=0
n = len(model.components_)
for component in model.components_:
c+=1
plt.subplot(n, 1, c)
show(component)
● Apply NMF to the word-frequency array
● NMF feature values describe the topics
● ... so similar documents have similar NMF feature values
● Compare NMF feature values?
● articles is a word frequency array
df = pd.read_csv('tweets.csv')
df = df['text']
df.shape
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
sparse = tfidf.fit_transform(df)
print sparse.shape
sparse
from sklearn.decomposition import NMF
nmf = NMF(6)
nmf
nmf_data = nmf.fit_transform(matrix)
print nmf_data.shape
print nmf.components_.shape
● Different versions of the same document have same topic proportions
● ... exact feature values may be different!
● E.g. because one version uses many meaningless words
● But all versions lie on the same line through the origin
plt.figure(figsize=(8,6))
plt.imshow(plt.imread('topic.jpg'))
plt.axis('off')
● Uses the angle between the lines
● Higher values means more similar
● Maximum value is 1, when angle is 0°
plt.figure(figsize=(4,4))
plt.imshow(plt.imread('cosine.jpg'))
plt.axis('off')
from sklearn.preprocessing import normalize
norm_features = normalize(nmf_data)
print norm_features.shape
print nmf_data.shape
norm_features
df[0]
current_post = norm_features[0]
similarities = norm_features.dot(current_post)
print(similarities[:10])
print similarities.shape
norm_features_df = pd.DataFrame(norm_features, columns=['cat_'+str(i) for i in range(1,7)])
norm_features_df.head()
current_article = norm_features_df.iloc[250]
current_article
similar = norm_features_df.dot(current_article)
type(similar)
similar.nlargest(10).index
for i in df.iloc[similar.nlargest(10).index]: print i