### Visualisations communicate insight¶

● "t-SNE" : Creates a 2D map of a dataset (later)

● "Hierarchical clustering" (this video)

### A hierarchy of groups¶

● Groups of living things can form a hierarchy

● Clusters are contained in one another

### The dendrogram of a hierarchical clustering¶

● Read from the bottom up

● Vertical lines represent clusters

## Hierarchical clustering with SciPy¶

• SciPy linkage() function performs hierarchical clustering on an array of samples.
• Use linkage() function to obtain a hierarchical clustering of the grain samples, and use dendrogram() to visualize the result.
In [8]:
from data import sample, variety
print sample.shape, len(variety)

(42, 7) 42

In [20]:
# Perform the necessary imports
import matplotlib.pyplot as plt
%matplotlib inline


### method='complete'¶

• If the hierarchical clustering were stopped at height 6 on the dendrogram, how many clusters would there be?
• 3
In [36]:
# Calculate the linkage: mergings

# Plot the dendrogram, using varieties as labels
dendrogram(mergings,
labels=variety,
leaf_rotation=90,
leaf_font_size=6,)

plt.show()


### method='single'¶

In [32]:
# Calculate the linkage: mergings

# Plot the dendrogram, using varieties as labels
dendrogram(mergings,
labels=variety,
leaf_rotation=90,
leaf_font_size=6,)

plt.show()


### nomalize¶

SciPy hierarchical clustering doesn't fit into a sklearn pipeline, so you'll need to use the normalize() function from sklearn.preprocessing instead of Normalizer.

### y-axis represents the distance between clusters¶

In [23]:
# Import normalize
from sklearn.preprocessing import normalize

# Normalize the movements: normalized_movements
normalized_sample = normalize(sample)

# Plot the dendrogram
dendrogram(
mergings,
labels=variety,
leaf_rotation=90.,
leaf_font_size=6
)
plt.show()


### Cluster labels in hierarchical clustering¶

● Not only a visualisation tool!

● Cluster labels at any intermediate stage can be recovered

● For use in e.g. cross-tabulations

### Intermediate clusterings & height on dendrogram¶

• Dendrograms show cluster distances

• Height on dendrogram = distance between merging clusters

### Intermediate clusterings & height on dendrogram¶

● Height on dendrogram specifies max. distance between merging clusters

● Don't merge clusters further apart than this

## Distance between clusters¶

● Defined by a "linkage method"

● Specified via method parameter, e.g. linkage(samples, method="complete")

● In "complete" linkage: distance between clusters is max. distance between their samples

● Different linkage method, different hierarchical clustering!

## Extracting cluster labels¶

● Use the fcluster method

● Returns a NumPy array of cluster labels

• In the previous exercise, you saw that the intermediate clustering of the grain samples at height 6 has 3 clusters.
• Now, use the fcluster() function to extract the cluster labels for this intermediate clustering,
• and compare the labels with the grain varieties using a cross-tabulation.
In [37]:
# Perform the necessary imports
import pandas as pd
from scipy.cluster.hierarchy import fcluster

# Use fcluster to extract labels: labels
labels = fcluster(mergings, 6, criterion='distance')

# Create a DataFrame with labels and varieties as columns: df
df = pd.DataFrame({'labels': labels, 'varieties': variety})

# Create crosstab: ct
ct = pd.crosstab(df['labels'], df['varieties'])

# Display ct
print(ct)

varieties  Canadian wheat  Kama wheat  Rosa wheat
labels
1                      14           3           0
2                       0           0          14
3                       0          11           0

In [38]:
labels

Out[38]:
array([3, 3, 3, 3, 3, 1, 3, 3, 1, 3, 3, 3, 1, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

# t-SNE for 2-dimensional maps¶

### t-SNE for 2-dimensional maps¶

● t-SNE = “t-distributed stochastic neighbor embedding”

● Maps samples to 2D space (or 3D)

● Map approximately preserves nearness of samples

● Great for inspecting datasets

### test on iris dataset¶

t-SNE on the iris dataset ● Iris dataset has 4 measurements, so samples are 4-dimensional

● t-SNE maps samples to 2D space

● t-SNE didn't know that there were different species

● ... yet kept the species mostly separate

## apply t-SNE on wheat dataset¶

In [44]:
from sklearn.manifold import TSNE

# Create a TSNE instance: model
model = TSNE(learning_rate=200)

print sample.shape
# Apply fit_transform to samples: tsne_features
tsne_features = model.fit_transform(sample)
print tsne_features.shape

(42, 7)
(42, 2)


## or use pd.factorize¶

In [56]:
print variety

['Kama wheat', 'Kama wheat', 'Kama wheat', 'Kama wheat', 'Kama wheat', 'Kama wheat', 'Kama wheat', 'Kama wheat', 'Kama wheat', 'Kama wheat', 'Kama wheat', 'Kama wheat', 'Kama wheat', 'Kama wheat', 'Rosa wheat', 'Rosa wheat', 'Rosa wheat', 'Rosa wheat', 'Rosa wheat', 'Rosa wheat', 'Rosa wheat', 'Rosa wheat', 'Rosa wheat', 'Rosa wheat', 'Rosa wheat', 'Rosa wheat', 'Rosa wheat', 'Rosa wheat', 'Canadian wheat', 'Canadian wheat', 'Canadian wheat', 'Canadian wheat', 'Canadian wheat', 'Canadian wheat', 'Canadian wheat', 'Canadian wheat', 'Canadian wheat', 'Canadian wheat', 'Canadian wheat', 'Canadian wheat', 'Canadian wheat', 'Canadian wheat']

In [57]:
import pandas as pd
pd.factorize(variety)

Out[57]:
(array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]),
array(['Kama wheat', 'Rosa wheat', 'Canadian wheat'], dtype=object))
In [60]:
print map(lambda x:{'Kama wheat':'r', 'Rosa wheat':'g', 'Canadian wheat':'b'}[x], variety)

['r', 'r', 'r', 'r', 'r', 'r', 'r', 'r', 'r', 'r', 'r', 'r', 'r', 'r', 'g', 'g', 'g', 'g', 'g', 'g', 'g', 'g', 'g', 'g', 'g', 'g', 'g', 'g', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b']

In [104]:
# Scatter plot, coloring by variety_numbers
plt.scatter(tsne_features[:,0], tsne_features[:,1], alpha = .7,
c=map(lambda x:{'Kama wheat':'r', 'Rosa wheat':'g', 'Canadian wheat':'b'}[x], variety))

# Annotate the points
for x, y, name in zip(tsne_features[:,0], tsne_features[:,1], variety):
plt.annotate(name, (x, y), fontsize=5, alpha=0.75)

plt.show()


### compare with PCA¶

In [69]:
from sklearn.decomposition import PCA
pca = PCA(2)
sampel_pca = pca.fit_transform(sample)
print sampel_pca.shape

(42, 2)

In [70]:
# Scatter plot, coloring by variety_numbers
plt.scatter(sampel_pca[:,0], sampel_pca[:,1], alpha = .7,
c=map(lambda x:{'Kama wheat':'r', 'Rosa wheat':'g', 'Canadian wheat':'b'}[x], variety))
plt.show()


## apply t-SNE on iris dataset¶

In [80]:
from sklearn import datasets
iri = iris.data
print iri.shape

(150, 4)

In [101]:
from sklearn.manifold import TSNE

# Create a TSNE instance: model
model = TSNE(learning_rate=200)

print iri.shape
# Apply fit_transform to samples: tsne_features
tsne_iri = model.fit_transform(iri)
print tsne_iri.shape

# Scatter plot, coloring by variety_numbers
plt.scatter(tsne_iri[:,0], tsne_iri[:,1], alpha = .7,
c=iris.target)
plt.show()

(150, 4)
(150, 2)


### compare with PCA¶

In [103]:
from sklearn.decomposition import PCA
pca = PCA(2)

print iri.shape
iri_pca = pca.fit_transform(iri)
print iri_pca.shape

# Scatter plot, coloring by variety_numbers
plt.scatter(iri_pca[:,0], iri_pca[:,1], alpha = .7,
c=iris.target)
plt.show()

(150, 4)
(150, 2)


## t-SNE has only fit_transform()¶

● Has a fit_transform() method

● Simultaneously fits the model and transforms the data

● Has no separate fit() or transform() methods

● Can’t extend the map to include new data samples

● Must start over each time!

## t-SNE learning rate¶

● Choose learning rate for the dataset

● Wrong choice: points bunch together

● Try values between 50 and 200

### Different every time¶

● t-SNE features are different every time

● Piedmont wines, 3 runs, 3 different sca!er plots!

● … however: The wine varieties (=colors) have same position relative to one another

In [ ]: