Dimension reduction¶

● More efficient storage and computation

● Remove less-informative "noise" features

● ... which cause problems for prediction tasks, e.g. classification, regression

Principal Component Analysis¶

● PCA = "Principal Component Analysis"

● Fundamental dimension reduction technique

● First step "decorrelation" (considered here)

● Second step reduces dimension (considered later)

from sklearn import datasets
from sklearn.decomposition import PCA
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

iris data set¶

iris = datasets.load_iris()
print iris.feature_names

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

PCA follows the fit/transform pattern¶

● PCA a scikit-learn component like KMeans or StandardScaler

● fit() learns the transformation from given data

● transform() applies the learned transformation

● transform() can also be applied to new data

data = iris.data
print data.shape
print data[:4]

plt.scatter(data[:,0], data[:,2], c=iris.target)
plt.axhline(color='r', y=0, linewidth=5)
plt.axvline(color='r')

(150L, 4L)
[[ 5.1  3.5  1.4  0.2]
 [ 4.9  3.   1.4  0.2]
 [ 4.7  3.2  1.3  0.2]
 [ 4.6  3.1  1.5  0.2]]

<matplotlib.lines.Line2D at 0xcf06a90>

PCA aligns data with axes¶

● Rotates data samples to be aligned with axes

● Shifs data samples so they have mean 0

● No information is lost

after pca, data algins with 0, 0¶

add a line http://matplotlib.org/api/pyplot_summary.html

pca = PCA(2)
pca_data = pca.fit_transform(data[:,[0,2]])

plt.scatter(pca_data[:,0], pca_data[:,1], c=iris.target)
plt.axhline(color='b')
plt.axvline(color='b')

<matplotlib.lines.Line2D at 0xd29eac8>

PCA features¶

● Rows of transformed correspond to samples

● Columns of transformed are the "PCA features"

● Row gives PCA feature values of corresponding sample

Pearson correlation¶

● Measures linear correlation of features

● Value between -1 and 1

● Value of 0 means no linear correlation

PCA features are not correlated¶

● Features of dataset are o"en correlated, e.g. total_phenols and od280

● PCA aligns the data with axes

● Resulting PCA features are not linearly correlated ("decorrelation") PCA

cmap http://matplotlib.org/examples/color/colormaps_reference.html

statistics by scipy¶

from scipy.stats import pearsonr¶

from scipy.stats import pearsonr

# Calculate the Pearson correlation
correlation, pvalue = pearsonr(data[:,0], data[:,2])
print pvalue
print correlation

1.03845406279e-47
0.871754157305

before PCA correlation is not 0¶

np.corrcoef(data[:,0], data[:,2])

array([[ 1.        ,  0.87175416],
       [ 0.87175416,  1.        ]])

pd.DataFrame(data[:,[0,2]]).corr()

plot linear correlation with Seaborn¶

color http://seaborn.pydata.org/tutorial/color_palettes.html

df = pd.DataFrame(data[:,[0,2]], columns= ['x','y'])
df.head(3)

sns.lmplot('x','y',data=df, size=4, aspect=1.5)

<seaborn.axisgrid.FacetGrid at 0xce10d30>

plt.scatter(data[:,0], data[:,2], alpha=.6, c ='gold')
plt.axis('equal')

(4.0, 8.5, 0.0, 8.0)

after PCA correlation is 0¶

np.corrcoef(pca_data[:,0], pca_data[:,1])

array([[  1.00000000e+00,   2.49817039e-17],
       [  2.49817039e-17,   1.00000000e+00]])

pd.DataFrame(pca_data).corr()

for numpy ndarray, similar to R cbind and rbind¶

cbind : np.vstack([ arr1, arr2,...])¶

rbind : np.hstack([ arr1, arr2,...])¶

print data[:,[0,2]].shape, pca_data.shape 
pair = np.hstack([data[:,[0,2]], pca_data])
print pair.shape

(150L, 2L) (150L, 2L)
(150L, 4L)

here I just found out binding colomns do not work for seaborn¶

so binding rows in pd as DataFrame and add a new category for two distinct groups¶

df2 = pd.DataFrame(pca_data, columns= ['x','y'])
df2.head(3)

df['cat']='before_pca'
df2['cat']='after_pca'

df3 = pd.concat([df,df2])
print df3.shape
df3.tail(3)

(300, 3)

plotting dataset of linear relationship¶

after PCA, there is NO correlation

sns.lmplot('x','y',data=df3, size=4, aspect=1.5, hue = 'cat')

<seaborn.axisgrid.FacetGrid at 0xcf575c0>

sns.lmplot('x','y',data=df3, size=3.5, col = 'cat')

<seaborn.axisgrid.FacetGrid at 0xe7d7e10>

Principal components¶

● "Principal components" = directions of variance

● PCA aligns principal components with the axes

● Available as components_attribute of PCA object

● Each row defines displacement from mean

print pca.components_

[[ 0.39378459  0.91920275]
 [-0.91920275  0.39378459]]

Intrinsic dimension¶

Intrinsic dimension of a flight path

● 2 features: longitude and latitude at points along a flight path

● Dataset appears to be 2-dimensional

● But can approximate using one feature: displacement along flight path

● Is intrinsically 1-dimensional

plt.imshow(plt.imread('ch3_slides.jpg'))
plt.grid('off')

The first principal component¶

The first principal component of the data is the direction in which the data varies the most.¶

In this exercise, your job is to use PCA to find the first principal component of the length and width measurements of the iris petal , and represent it as an arrow on the scatter plot.

print iris.feature_names

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

# Make a scatter plot of the untransformed points
plt.scatter(iris.data[:,2], iris.data[:,3], alpha=.7, c='g')

# Create a PCA instance: model
model = PCA()

# Fit model to points
model.fit(iris.data[:,2:4])

# Get the mean of the grain samples: mean
mean = model.mean_
print mean

# Get the first principal component: first_pc
first_pc = model.components_[0,:]
print first_pc

# Plot first_pc as an arrow, starting at mean
plt.arrow(mean[0], mean[1], first_pc[0], first_pc[1], color='red', width=0.01)

sec_pc = model.components_[1,:]
print sec_pc

# Plot second_pc as an arrow, starting at mean
plt.arrow(mean[0], mean[1], sec_pc[0], sec_pc[1], color='b', width=0.01)

# Keep axes on same scale
plt.axis('equal')
plt.show()

[ 3.75866667  1.19866667]
[ 0.92154695  0.38826694]
[-0.38826694  0.92154695]

Intrinsic dimension¶

● Intrinsic dimension = number of features needed to approximate the dataset

● Essential idea behind dimension reduction

● What is the most compact representation of the samples?

● Can be detected with PCA

Versicolor dataset¶

● "versicolor", one of the iris species

● Only 3 features: sepal length, sepal width, and petal width

● Samples are points in 3D space

create verisocolor dataset¶

print iris.target_names
print iris.feature_names

['setosa' 'versicolor' 'virginica']
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

iri = pd.DataFrame(iris.data, columns = ['sepal_l','sepal_w','petal_l','petal_w'])
iri['target']=iris.target
versicolor = iri[iri['target']==1]
versicolor = versicolor.iloc[:,[0,1,3]]

versicolor.head()

Versicolor dataset has intrinsic dimension 2¶

● Samples lie close to a flat 2-dimensional sheet

● So can be approximated using 2 features

the easiest 3D plotting¶

from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure(figsize=(6,4))

ax = Axes3D(fig)

ax.scatter(versicolor.iloc[:, 1], versicolor.iloc[:, 1], versicolor.iloc[:, 2])

<mpl_toolkits.mplot3d.art3d.Path3DCollection at 0x10630b00>

PCA identifies intrinsic dimension¶

● Scatter plots work only if samples have 2 or 3 features

● PCA identifies intrinsic dimension when samples have any number of features

● Intrinsic dimension = number of PCA features with significant variance

pca2 = PCA()
pca2.fit(versicolor)
versicolor_pca = pca2.transform(versicolor)
versicolor_pca[:3]

array([[-1.13236543,  0.09514055, -0.17654689],
       [-0.62150644, -0.20612416, -0.04140747],
       [-1.02792193,  0.10180149, -0.03853535]])

fig = plt.figure(figsize=(10,4))


ax= fig.add_subplot(1,2,1, projection='3d')

l1 = ax.scatter(versicolor.iloc[:, 1], versicolor.iloc[:, 1], versicolor.iloc[:, 2])
ax.set_title('before PCA')


ax = fig.add_subplot(1,2,2, projection='3d')

l2 = ax.scatter(versicolor_pca[:, 1], versicolor_pca[:, 1], versicolor_pca[:, 2])
ax.set_title('after PCA')

plt.show()

Variance and intrinsic dimension¶

● Intrinsic dimension is number of PCA features with significant variance

● In our example: the first two PCA features

● So intrinsic dimension is 2

plt.figure(figsize=(4,4))
features = range(pca2.n_components_)

plt.bar(features, pca2.explained_variance_, alpha = .6, color='r')
plt.xticks(features)
plt.ylabel('variance')
plt.xlabel('PCA feature')

<matplotlib.text.Text at 0x10cba4a8>

Intrinsic dimension can be ambiguous¶

● Intrinsic dimension is an idealization

● … there is not always one correct answer!

● Boston data from sklearn: could argue for 2, or for 3, or more

NOT standardize feature so the dimension is ambiguous¶

bos = datasets.load_boston()
bos_data = bos.data
bos_data.shape

(506L, 13L)

pca3 = PCA()
pca3.fit(bos_data)

PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

plt.figure(figsize=(7,4))
features = range(pca3.n_components_)

plt.bar(features, pca3.explained_variance_, alpha = .6, color='g')
plt.xticks(features)
plt.ylabel('variance')
plt.xlabel('PCA feature')

plt.axvline(x=2, ls = '--', c='r')
plt.axvline(x=3, ls = '--', c='r')

<matplotlib.lines.Line2D at 0xd2deda0>

standardize features first¶

Variance of the PCA features¶

The Boston dataset is 13-dimensional. But what is its intrinsic dimension?

Make a plot of the variances of the PCA features to find out.

As before, samples is a 2D array, where each row represents a fish. You'll need to standardize the features first.

# Perform the necessary imports
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
import matplotlib.pyplot as plt

# Create scaler: scaler
scaler = StandardScaler()

# Create a PCA instance: pca
pca4 = PCA()

# Create pipeline: pipeline
pipeline = make_pipeline(scaler, pca4)

# Fit the pipeline to 'samples'
pipeline.fit(bos_data)

# Plot the explained variances
features = range(pca4.n_components_)

plt.figure(figsize=(9,4))

plt.bar(features, pca4.explained_variance_, color='r', alpha=.6)
plt.xlabel('PCA feature')
plt.ylabel('variance')
plt.xticks(features)
plt.show()

Dimension reduction with PCA¶

Dimension reduction

● Represents same data, using less features

● Important part of machine-learning pipelines

● Can be performed using PCA

Dimension reduction with PCA¶

● PCA features are in decreasing order of variance

● Assumes the low variance features are "noise"

● … and high variance features are informative informative noisy

Specify how many features to keep¶

● E.g. PCA(n_components=2)

● Keeps the first 2 PCA features

● Intrinsic dimension is a good choice

Dimension reduction of iris dataset¶

● samples = array of iris measurements (4 features)

● species = list of iris species numbers

pca5 = PCA(2)
pca5.fit(iris.data)

transform = pca5.transform(iris.data)
print (transform.shape)

(150L, 2L)

Iris dataset in 2 dimensions¶

● PCA has reduced the dimension to 2

● Retained the 2 PCA features with highest variance

● Important information preserved: species remain distinct

plt.scatter(transform[:,0], transform[:,1], c=iris.target)

<matplotlib.collections.PathCollection at 0x10f34f60>

Dimension reduction with PCA¶

● Discards low variance PCA features

● Assumes the high variance features are informative

● Assumption typically holds in practice (e.g. for iris)

Word frequency arrays¶

● Rows represent documents, columns represent words

● Entries measure presence of each word in each document

● ... measure using "tf-idf" (more later)

plt.imshow(plt.imread('tfidf.jpg'))
plt.grid('off')
plt.axis('off')

(-0.5, 542.5, 292.5, -0.5)

A tf-idf word-frequency array¶

create a tf-idf word frequency array for a toy collection of documents.
- For this, use the TfidfVectorizer from sklearn. It transforms a list of documents into a word frequency array, which it outputs as a csr_matrix.
  - It has fit() and transform() methods like other sklearn objects.
    - You are given a list documents of toy documents about pets. It's contents have been printed in the IPython Shell.

documents = ['cats say meow', 'dogs say woof', 'dogs chase cats']

# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Create a TfidfVectorizer: tfidf
tfidf = TfidfVectorizer() 

# Apply fit_transform to document: csr_mat
csr_mat = tfidf.fit_transform(documents)

# Print result of toarray() method
print(csr_mat.toarray())

# Get the words: words
words = tfidf.get_feature_names()

# Print words
print 
print(words)

[[ 0.51785612  0.          0.          0.68091856  0.51785612  0.        ]
 [ 0.          0.          0.51785612  0.          0.51785612  0.68091856]
 [ 0.51785612  0.68091856  0.51785612  0.          0.          0.        ]]

[u'cats', u'chase', u'dogs', u'meow', u'say', u'woof']

csr_mat

<3x6 sparse matrix of type '<type 'numpy.float64'>'
	with 9 stored elements in Compressed Sparse Row format>

print csr_mat

  (0, 0)	0.517856116168
  (0, 4)	0.517856116168
  (0, 3)	0.680918560399
  (1, 4)	0.517856116168
  (1, 2)	0.517856116168
  (1, 5)	0.680918560399
  (2, 0)	0.517856116168
  (2, 2)	0.517856116168
  (2, 1)	0.680918560399

Sparse arrays and csr_matrix¶

● Array is "sparse": most entries are zero

● Can use scipy.sparse.csr_matrix instead of NumPy array

● csr_matrix remembers only the non-zero entries (saves space!)

TruncatedSVD and csr_matrix¶

● scikit-learn PCA doesn't support csr_matrix

● Use scikit-learn TruncatedSVD instead

● Performs same transformation

plt.imshow(plt.imread('svd.jpg'))
plt.grid('off')
plt.axis('off')

(-0.5, 717.5, 247.5, -0.5)

Clustering twitter data¶

TruncatedSVD is able to perform PCA on sparse arrays in csr_matrix format¶

such as word-frequency arrays.
- Combine your knowledge of TruncatedSVD and k-means to cluster some popular pages from Wikipedia.
  - build the pipeline.

Create a Pipeline object consisting of a TruncatedSVD followed by KMeans.

# Perform the necessary imports
from sklearn.decomposition import TruncatedSVD
from sklearn.cluster import KMeans
from sklearn.pipeline import make_pipeline

# Create a TruncatedSVD instance: svd
svd = TruncatedSVD(n_components=50)

# Create a KMeans instance: kmeans
kmeans = KMeans(n_clusters=6)

# Create a pipeline: pipeline
pipeline = make_pipeline(svd, kmeans)

prepare twitter data¶

twi = pd.read_csv('tweets.csv')
twi.shape

(615, 33)

text = twi['text']
text.shape

(615L,)

text.head()

0    RT @MeckeringBoy: Does Pauline secretly wear h...
1    RT @lordaedonis: if you a grown man afraid of ...
2    RT @alaskantexanQCT: I'm not "White" but still...
3    We dont discriminate like Libs, divede by sex,...
4    @TMattis7 @babsmarshall1 @Randazzoj @realDonal...
Name: text, dtype: object

# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Create a TfidfVectorizer: tfidf
tfidf = TfidfVectorizer() 

# Apply fit_transform to document: csr_mat
tweet_mat = tfidf.fit_transform(text)


# Get the words: words
word = tfidf.get_feature_names()

# Print words

len(word)

1286

tweet_mat

<615x1286 sparse matrix of type '<type 'numpy.float64'>'
	with 10947 stored elements in Compressed Sparse Row format>

clustering¶

article is the name of the post¶

label is the group of the post¶

# Import pandas
import pandas as pd

# Fit the pipeline to articles
pipeline.fit(tweet_mat)

# Calculate the cluster labels: labels
labels = pipeline.predict(tweet_mat)

# Create a DataFrame aligning labels and titles: df
df = pd.DataFrame({'label': labels, 'article': range(1,616)})

# Display df sorted by cluster label
print(df.sort_values('label').head())
df['label'].unique()

     article  label
247      248      0
276      277      0
287      288      0
238      239      0
19        20      0

array([2, 0, 1, 3, 5, 4], dtype=int64)

the results show there are 6 categories in total, the biggest group has 322 twitter posts¶

df['label'].value_counts(ascending = False)

2    320
1    149
3     90
5     23
0     23
4     10
Name: label, dtype: int64

df.head()

check out the group with the least posts¶

index = list(df[df['label']==4]['article'])
index

[52, 82, 84, 132, 249, 327, 357, 359, 407, 607]

text.iloc[index]

52     RT @DebraJenson: Marine veteran pushing @jason...
82     RT @PhDdropout: Reminder that Chaffetz wouldn'...
84     RT @PhDdropout: Reminder that Chaffetz wouldn'...
132    RT @PhDdropout: Reminder that Chaffetz wouldn'...
249    RT @alaskantexanQCT: I'm not "White" but still...
327    RT @DebraJenson: Marine veteran pushing @jason...
357    RT @PhDdropout: Reminder that Chaffetz wouldn'...
359    RT @PhDdropout: Reminder that Chaffetz wouldn'...
407    RT @PhDdropout: Reminder that Chaffetz wouldn'...
607    RT @alaskantexanQCT: I'm not "White" but still...
Name: text, dtype: object

for i in text.iloc[index]: print i+'\n'

RT @DebraJenson: Marine veteran pushing @jasoninthehouse to defend his flip-flop on Trump and pussy-grabbing, VAWA, and Session. #chaffetzt…

RT @PhDdropout: Reminder that Chaffetz wouldn't endorse Trump after pussy tape, only to reverse. Not the spine you'd want in ethics chair #…

RT @PhDdropout: Reminder that Chaffetz wouldn't endorse Trump after pussy tape, only to reverse. Not the spine you'd want in ethics chair #…

RT @PhDdropout: Reminder that Chaffetz wouldn't endorse Trump after pussy tape, only to reverse. Not the spine you'd want in ethics chair #…

RT @alaskantexanQCT: I'm not "White" but still I think that "pussy hats", burning a college campus, attacking Trump supporters and obstr… 

RT @DebraJenson: Marine veteran pushing @jasoninthehouse to defend his flip-flop on Trump and pussy-grabbing, VAWA, and Session. #chaffetzt…

RT @PhDdropout: Reminder that Chaffetz wouldn't endorse Trump after pussy tape, only to reverse. Not the spine you'd want in ethics chair #…

RT @PhDdropout: Reminder that Chaffetz wouldn't endorse Trump after pussy tape, only to reverse. Not the spine you'd want in ethics chair #…

RT @PhDdropout: Reminder that Chaffetz wouldn't endorse Trump after pussy tape, only to reverse. Not the spine you'd want in ethics chair #…

RT @alaskantexanQCT: I'm not "White" but still I think that "pussy hats", burning a college campus, attacking Trump supporters and obstr…

check out the group with the second least posts¶

text.iloc[df[df['label']==0]['article'].values]

2      RT @alaskantexanQCT: I'm not "White" but still...
7      @juliahosack @Marina_Sirtis  your the problem ...
13     RT @alaskantexanQCT: I'm not "White" but still...
20     RT @alaskantexanQCT: I'm not "White" but still...
90     RT @LEngelhorn: You can do "things" in Russia ...
135    RT @PhDdropout: Reminder that Chaffetz wouldn'...
239    Trump is China's bitch. They're definitely gra...
242    RT @lordaedonis: if you a grown man afraid of ...
243    Immigration kills black AMERICANS more than an...
245    RT @lordaedonis: if you a grown man afraid of ...
246    RT @alaskantexanQCT: I'm not "White" but still...
248    RT @daylinleach: @XGrabMyY @realDonaldTrump Ci...
255    RT @lordaedonis: if you a grown man afraid of ...
256    RT @alaskantexanQCT: I'm not "White" but still...
264    RT @bavegan2951: 👍👍👍👍😂😂😂😂😂😂😂🇺🇸🇮�...
268    RT @Caycioldman: If you were upset by Trump's ...
271    Donald is now seen as such a pussy that Access...
277    RT @alaskantexanQCT: I'm not "White" but still...
282    @juliahosack @Marina_Sirtis  your the problem ...
288    RT @alaskantexanQCT: I'm not "White" but still...
295    RT @alaskantexanQCT: I'm not "White" but still...
365    RT @LEngelhorn: You can do "things" in Russia ...
410    RT @PhDdropout: Reminder that Chaffetz wouldn'...
Name: text, dtype: object

for i in text.iloc[df[df['label']==0]['article'].values]: print i+'\n'

RT @alaskantexanQCT: I'm not "White" but still I think that "pussy hats", burning a college campus, attacking Trump supporters and obstr… 

@juliahosack @Marina_Sirtis  your the problem with the insecure pussy hat. Your afraid of anything someone puts in your head. I.e. Trump

RT @alaskantexanQCT: I'm not "White" but still I think that "pussy hats", burning a college campus, attacking Trump supporters and obstr… 

RT @alaskantexanQCT: I'm not "White" but still I think that "pussy hats", burning a college campus, attacking Trump supporters and obstr… 

RT @LEngelhorn: You can do "things" in Russia you can't get away with here. Think Thailand. Children. Pussy grabbers are welcomed.… 

RT @PhDdropout: Reminder that Chaffetz wouldn't endorse Trump after pussy tape, only to reverse. Not the spine you'd want in ethics chair #…

Trump is China's bitch. They're definitely grabbing him by the pussy. Pathetic.

RT @lordaedonis: if you a grown man afraid of a Trump presidency please do not speak to me. the only pussy I like around me is on women.

Immigration kills black AMERICANS more than any other group. This guy gets it! https://t.co/yiDmuauje5

RT @lordaedonis: if you a grown man afraid of a Trump presidency please do not speak to me. the only pussy I like around me is on women.

RT @alaskantexanQCT: I'm not "White" but still I think that "pussy hats", burning a college campus, attacking Trump supporters and obstr… 

RT @daylinleach: @XGrabMyY @realDonaldTrump Civil to Trump? You know he calls your gender "disgusting pigs" and "filthy animals" whose puss…

RT @lordaedonis: if you a grown man afraid of a Trump presidency please do not speak to me. the only pussy I like around me is on women.

RT @alaskantexanQCT: I'm not "White" but still I think that "pussy hats", burning a college campus, attacking Trump supporters and obstr… 

RT @bavegan2951: 👍👍👍👍😂😂😂😂😂😂😂🇺🇸🇮🇱 https://t.co/yPpGIdVETg

RT @Caycioldman: If you were upset by Trump's "grab her by the pussy" comment but are defending our ASUN speaker saying "people change" I w…

Donald is now seen as such a pussy that Access Hollywood has released a tape of President Xi saying that he "grabbe… https://t.co/kdrkvYxVw6

RT @alaskantexanQCT: I'm not "White" but still I think that "pussy hats", burning a college campus, attacking Trump supporters and obstr… 

@juliahosack @Marina_Sirtis  your the problem with the insecure pussy hat. Your afraid of anything someone puts in your head. I.e. Trump

RT @alaskantexanQCT: I'm not "White" but still I think that "pussy hats", burning a college campus, attacking Trump supporters and obstr… 

RT @alaskantexanQCT: I'm not "White" but still I think that "pussy hats", burning a college campus, attacking Trump supporters and obstr… 

RT @LEngelhorn: You can do "things" in Russia you can't get away with here. Think Thailand. Children. Pussy grabbers are welcomed.… 

RT @PhDdropout: Reminder that Chaffetz wouldn't endorse Trump after pussy tape, only to reverse. Not the spine you'd want in ethics chair #…

	0	1
0	1.000000e+00	1.769537e-17
1	1.769537e-17	1.000000e+00

	x	y
0	-2.460806	-0.245533
1	-2.539563	-0.061692
2	-2.710240	0.082770

	x	y	cat
147	1.583463	-0.036035	after_pca
148	1.649168	0.318483	after_pca
149	1.255272	0.476108	after_pca

	0	1
0	1.000000	0.871754
1	0.871754	1.000000

	sepal_l	sepal_w	petal_w
50	7.0	3.2	1.4
51	6.4	3.2	1.5
52	6.9	3.1	1.5
53	5.5	2.3	1.3
54	6.5	2.8	1.5