● More efficient storage and computation
● Remove less-informative "noise" features
● ... which cause problems for prediction tasks, e.g. classification, regression
● PCA = "Principal Component Analysis"
● Fundamental dimension reduction technique
● First step "decorrelation" (considered here)
● Second step reduces dimension (considered later)
from sklearn import datasets
from sklearn.decomposition import PCA
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
iris = datasets.load_iris()
print iris.feature_names
● PCA a scikit-learn component like KMeans or StandardScaler
● fit() learns the transformation from given data
● transform() applies the learned transformation
● transform() can also be applied to new data
data = iris.data
print data.shape
print data[:4]
plt.scatter(data[:,0], data[:,2], c=iris.target)
plt.axhline(color='r', y=0, linewidth=5)
plt.axvline(color='r')
pca = PCA(2)
pca_data = pca.fit_transform(data[:,[0,2]])
plt.scatter(pca_data[:,0], pca_data[:,1], c=iris.target)
plt.axhline(color='b')
plt.axvline(color='b')
● Rows of transformed correspond to samples
● Columns of transformed are the "PCA features"
● Row gives PCA feature values of corresponding sample
● Measures linear correlation of features
● Value between -1 and 1
● Value of 0 means no linear correlation
● Features of dataset are o"en correlated, e.g. total_phenols and od280
● PCA aligns the data with axes
● Resulting PCA features are not linearly correlated ("decorrelation") PCA
from scipy.stats import pearsonr
# Calculate the Pearson correlation
correlation, pvalue = pearsonr(data[:,0], data[:,2])
print pvalue
print correlation
np.corrcoef(data[:,0], data[:,2])
pd.DataFrame(data[:,[0,2]]).corr()
df = pd.DataFrame(data[:,[0,2]], columns= ['x','y'])
df.head(3)
sns.lmplot('x','y',data=df, size=4, aspect=1.5)
plt.scatter(data[:,0], data[:,2], alpha=.6, c ='gold')
plt.axis('equal')
np.corrcoef(pca_data[:,0], pca_data[:,1])
pd.DataFrame(pca_data).corr()
print data[:,[0,2]].shape, pca_data.shape
pair = np.hstack([data[:,[0,2]], pca_data])
print pair.shape
df2 = pd.DataFrame(pca_data, columns= ['x','y'])
df2.head(3)
df['cat']='before_pca'
df2['cat']='after_pca'
df3 = pd.concat([df,df2])
print df3.shape
df3.tail(3)
sns.lmplot('x','y',data=df3, size=4, aspect=1.5, hue = 'cat')
sns.lmplot('x','y',data=df3, size=3.5, col = 'cat')
● "Principal components" = directions of variance
● PCA aligns principal components with the axes
● Available as components_attribute of PCA object
● Each row defines displacement from mean
print pca.components_
Intrinsic dimension of a flight path
● 2 features: longitude and latitude at points along a flight path
● Dataset appears to be 2-dimensional
● But can approximate using one feature: displacement along flight path
● Is intrinsically 1-dimensional
plt.imshow(plt.imread('ch3_slides.jpg'))
plt.grid('off')
print iris.feature_names
# Make a scatter plot of the untransformed points
plt.scatter(iris.data[:,2], iris.data[:,3], alpha=.7, c='g')
# Create a PCA instance: model
model = PCA()
# Fit model to points
model.fit(iris.data[:,2:4])
# Get the mean of the grain samples: mean
mean = model.mean_
print mean
# Get the first principal component: first_pc
first_pc = model.components_[0,:]
print first_pc
# Plot first_pc as an arrow, starting at mean
plt.arrow(mean[0], mean[1], first_pc[0], first_pc[1], color='red', width=0.01)
sec_pc = model.components_[1,:]
print sec_pc
# Plot second_pc as an arrow, starting at mean
plt.arrow(mean[0], mean[1], sec_pc[0], sec_pc[1], color='b', width=0.01)
# Keep axes on same scale
plt.axis('equal')
plt.show()
● Intrinsic dimension = number of features needed to approximate the dataset
● Essential idea behind dimension reduction
● What is the most compact representation of the samples?
● Can be detected with PCA
● "versicolor", one of the iris species
● Only 3 features: sepal length, sepal width, and petal width
● Samples are points in 3D space
print iris.target_names
print iris.feature_names
iri = pd.DataFrame(iris.data, columns = ['sepal_l','sepal_w','petal_l','petal_w'])
iri['target']=iris.target
versicolor = iri[iri['target']==1]
versicolor = versicolor.iloc[:,[0,1,3]]
versicolor.head()
● Samples lie close to a flat 2-dimensional sheet
● So can be approximated using 2 features
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(6,4))
ax = Axes3D(fig)
ax.scatter(versicolor.iloc[:, 1], versicolor.iloc[:, 1], versicolor.iloc[:, 2])
● Scatter plots work only if samples have 2 or 3 features
● PCA identifies intrinsic dimension when samples have any number of features
● Intrinsic dimension = number of PCA features with significant variance
pca2 = PCA()
pca2.fit(versicolor)
versicolor_pca = pca2.transform(versicolor)
versicolor_pca[:3]
fig = plt.figure(figsize=(10,4))
ax= fig.add_subplot(1,2,1, projection='3d')
l1 = ax.scatter(versicolor.iloc[:, 1], versicolor.iloc[:, 1], versicolor.iloc[:, 2])
ax.set_title('before PCA')
ax = fig.add_subplot(1,2,2, projection='3d')
l2 = ax.scatter(versicolor_pca[:, 1], versicolor_pca[:, 1], versicolor_pca[:, 2])
ax.set_title('after PCA')
plt.show()
● Intrinsic dimension is number of PCA features with significant variance
● In our example: the first two PCA features
● So intrinsic dimension is 2
plt.figure(figsize=(4,4))
features = range(pca2.n_components_)
plt.bar(features, pca2.explained_variance_, alpha = .6, color='r')
plt.xticks(features)
plt.ylabel('variance')
plt.xlabel('PCA feature')
bos = datasets.load_boston()
bos_data = bos.data
bos_data.shape
pca3 = PCA()
pca3.fit(bos_data)
plt.figure(figsize=(7,4))
features = range(pca3.n_components_)
plt.bar(features, pca3.explained_variance_, alpha = .6, color='g')
plt.xticks(features)
plt.ylabel('variance')
plt.xlabel('PCA feature')
plt.axvline(x=2, ls = '--', c='r')
plt.axvline(x=3, ls = '--', c='r')
The Boston dataset is 13-dimensional. But what is its intrinsic dimension?
Make a plot of the variances of the PCA features to find out.
As before, samples is a 2D array, where each row represents a fish. You'll need to standardize the features first.
# Perform the necessary imports
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
import matplotlib.pyplot as plt
# Create scaler: scaler
scaler = StandardScaler()
# Create a PCA instance: pca
pca4 = PCA()
# Create pipeline: pipeline
pipeline = make_pipeline(scaler, pca4)
# Fit the pipeline to 'samples'
pipeline.fit(bos_data)
# Plot the explained variances
features = range(pca4.n_components_)
plt.figure(figsize=(9,4))
plt.bar(features, pca4.explained_variance_, color='r', alpha=.6)
plt.xlabel('PCA feature')
plt.ylabel('variance')
plt.xticks(features)
plt.show()
Dimension reduction
● Represents same data, using less features
● Important part of machine-learning pipelines
● Can be performed using PCA
● PCA features are in decreasing order of variance
● Assumes the low variance features are "noise"
● … and high variance features are informative informative noisy
● E.g. PCA(n_components=2)
● Keeps the first 2 PCA features
● Intrinsic dimension is a good choice
● samples = array of iris measurements (4 features)
● species = list of iris species numbers
pca5 = PCA(2)
pca5.fit(iris.data)
transform = pca5.transform(iris.data)
print (transform.shape)
● PCA has reduced the dimension to 2
● Retained the 2 PCA features with highest variance
● Important information preserved: species remain distinct
plt.scatter(transform[:,0], transform[:,1], c=iris.target)
● Discards low variance PCA features
● Assumes the high variance features are informative
● Assumption typically holds in practice (e.g. for iris)
● Rows represent documents, columns represent words
● Entries measure presence of each word in each document
● ... measure using "tf-idf" (more later)
plt.imshow(plt.imread('tfidf.jpg'))
plt.grid('off')
plt.axis('off')
create a tf-idf word frequency array for a toy collection of documents.
For this, use the TfidfVectorizer from sklearn. It transforms a list of documents into a word frequency array, which it outputs as a csr_matrix.
It has fit() and transform() methods like other sklearn objects.
documents = ['cats say meow', 'dogs say woof', 'dogs chase cats']
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
# Create a TfidfVectorizer: tfidf
tfidf = TfidfVectorizer()
# Apply fit_transform to document: csr_mat
csr_mat = tfidf.fit_transform(documents)
# Print result of toarray() method
print(csr_mat.toarray())
# Get the words: words
words = tfidf.get_feature_names()
# Print words
print
print(words)
csr_mat
print csr_mat
● Array is "sparse": most entries are zero
● Can use scipy.sparse.csr_matrix instead of NumPy array
● csr_matrix remembers only the non-zero entries (saves space!)
● scikit-learn PCA doesn't support csr_matrix
● Use scikit-learn TruncatedSVD instead
● Performs same transformation
plt.imshow(plt.imread('svd.jpg'))
plt.grid('off')
plt.axis('off')
# Perform the necessary imports
from sklearn.decomposition import TruncatedSVD
from sklearn.cluster import KMeans
from sklearn.pipeline import make_pipeline
# Create a TruncatedSVD instance: svd
svd = TruncatedSVD(n_components=50)
# Create a KMeans instance: kmeans
kmeans = KMeans(n_clusters=6)
# Create a pipeline: pipeline
pipeline = make_pipeline(svd, kmeans)
twi = pd.read_csv('tweets.csv')
twi.shape
text = twi['text']
text.shape
text.head()
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
# Create a TfidfVectorizer: tfidf
tfidf = TfidfVectorizer()
# Apply fit_transform to document: csr_mat
tweet_mat = tfidf.fit_transform(text)
# Get the words: words
word = tfidf.get_feature_names()
# Print words
len(word)
tweet_mat
# Import pandas
import pandas as pd
# Fit the pipeline to articles
pipeline.fit(tweet_mat)
# Calculate the cluster labels: labels
labels = pipeline.predict(tweet_mat)
# Create a DataFrame aligning labels and titles: df
df = pd.DataFrame({'label': labels, 'article': range(1,616)})
# Display df sorted by cluster label
print(df.sort_values('label').head())
df['label'].unique()
df['label'].value_counts(ascending = False)
df.head()
index = list(df[df['label']==4]['article'])
index
text.iloc[index]
for i in text.iloc[index]: print i+'\n'
text.iloc[df[df['label']==0]['article'].values]
for i in text.iloc[df[df['label']==0]['article'].values]: print i+'\n'