● "t-SNE" : Creates a 2D map of a dataset (later)
● "Hierarchical clustering" (this video)
● Groups of living things can form a hierarchy
● Clusters are contained in one another
● Read from the bottom up
● Vertical lines represent clusters
from data import sample, variety
print sample.shape, len(variety)
# Perform the necessary imports
from scipy.cluster.hierarchy import linkage, dendrogram
import matplotlib.pyplot as plt
%matplotlib inline
# Calculate the linkage: mergings
mergings = linkage(sample, method='complete')
# Plot the dendrogram, using varieties as labels
dendrogram(mergings,
labels=variety,
leaf_rotation=90,
leaf_font_size=6,)
plt.show()
# Calculate the linkage: mergings
mergings = linkage(sample, method='single')
# Plot the dendrogram, using varieties as labels
dendrogram(mergings,
labels=variety,
leaf_rotation=90,
leaf_font_size=6,)
plt.show()
# Import normalize
from sklearn.preprocessing import normalize
# Normalize the movements: normalized_movements
normalized_sample = normalize(sample)
# Calculate the linkage: mergings
mergings = linkage(normalized_sample, method='complete')
# Plot the dendrogram
dendrogram(
mergings,
labels=variety,
leaf_rotation=90.,
leaf_font_size=6
)
plt.show()
● Not only a visualisation tool!
● Cluster labels at any intermediate stage can be recovered
● For use in e.g. cross-tabulations
Dendrograms show cluster distances
● Height on dendrogram specifies max. distance between merging clusters
● Don't merge clusters further apart than this
● Defined by a "linkage method"
● Specified via method parameter, e.g. linkage(samples, method="complete")
● In "complete" linkage: distance between clusters is max. distance between their samples
● Different linkage method, different hierarchical clustering!
● Use the fcluster method
● Returns a NumPy array of cluster labels
# Perform the necessary imports
import pandas as pd
from scipy.cluster.hierarchy import fcluster
# Use fcluster to extract labels: labels
labels = fcluster(mergings, 6, criterion='distance')
# Create a DataFrame with labels and varieties as columns: df
df = pd.DataFrame({'labels': labels, 'varieties': variety})
# Create crosstab: ct
ct = pd.crosstab(df['labels'], df['varieties'])
# Display ct
print(ct)
labels
from sklearn.manifold import TSNE
# Create a TSNE instance: model
model = TSNE(learning_rate=200)
print sample.shape
# Apply fit_transform to samples: tsne_features
tsne_features = model.fit_transform(sample)
print tsne_features.shape
print variety
import pandas as pd
pd.factorize(variety)
print map(lambda x:{'Kama wheat':'r', 'Rosa wheat':'g', 'Canadian wheat':'b'}[x], variety)
# Scatter plot, coloring by variety_numbers
plt.scatter(tsne_features[:,0], tsne_features[:,1], alpha = .7,
c=map(lambda x:{'Kama wheat':'r', 'Rosa wheat':'g', 'Canadian wheat':'b'}[x], variety))
# Annotate the points
for x, y, name in zip(tsne_features[:,0], tsne_features[:,1], variety):
plt.annotate(name, (x, y), fontsize=5, alpha=0.75)
plt.show()
from sklearn.decomposition import PCA
pca = PCA(2)
sampel_pca = pca.fit_transform(sample)
print sampel_pca.shape
# Scatter plot, coloring by variety_numbers
plt.scatter(sampel_pca[:,0], sampel_pca[:,1], alpha = .7,
c=map(lambda x:{'Kama wheat':'r', 'Rosa wheat':'g', 'Canadian wheat':'b'}[x], variety))
plt.show()
from sklearn import datasets
iris = datasets.load_iris()
iri = iris.data
print iri.shape
from sklearn.manifold import TSNE
# Create a TSNE instance: model
model = TSNE(learning_rate=200)
print iri.shape
# Apply fit_transform to samples: tsne_features
tsne_iri = model.fit_transform(iri)
print tsne_iri.shape
# Scatter plot, coloring by variety_numbers
plt.scatter(tsne_iri[:,0], tsne_iri[:,1], alpha = .7,
c=iris.target)
plt.show()
from sklearn.decomposition import PCA
pca = PCA(2)
print iri.shape
iri_pca = pca.fit_transform(iri)
print iri_pca.shape
# Scatter plot, coloring by variety_numbers
plt.scatter(iri_pca[:,0], iri_pca[:,1], alpha = .7,
c=iris.target)
plt.show()
● Has a fit_transform() method
● Simultaneously fits the model and transforms the data
● Has no separate fit() or transform() methods
● Can’t extend the map to include new data samples
● Must start over each time!
● Choose learning rate for the dataset
● Wrong choice: points bunch together
● Try values between 50 and 200
● t-SNE features are different every time
● Piedmont wines, 3 runs, 3 different sca!er plots!
● … however: The wine varieties (=colors) have same position relative to one another