Visualizing hierarchies

Visualisations communicate insight

● "t-SNE" : Creates a 2D map of a dataset (later)

● "Hierarchical clustering" (this video)

A hierarchy of groups

● Groups of living things can form a hierarchy

● Clusters are contained in one another

The dendrogram of a hierarchical clustering

● Read from the bottom up

● Vertical lines represent clusters

Hierarchical clustering with SciPy

  • SciPy linkage() function performs hierarchical clustering on an array of samples.
    • Use linkage() function to obtain a hierarchical clustering of the grain samples, and use dendrogram() to visualize the result.
In [8]:
from data import sample, variety
print sample.shape, len(variety)
(42, 7) 42
In [20]:
# Perform the necessary imports
from scipy.cluster.hierarchy import linkage, dendrogram
import matplotlib.pyplot as plt
%matplotlib inline

method='complete'

  • If the hierarchical clustering were stopped at height 6 on the dendrogram, how many clusters would there be?
    • 3
In [36]:
# Calculate the linkage: mergings
mergings = linkage(sample, method='complete')

# Plot the dendrogram, using varieties as labels
dendrogram(mergings,
           labels=variety,
           leaf_rotation=90,
           leaf_font_size=6,)

plt.show()

method='single'

In [32]:
# Calculate the linkage: mergings
mergings = linkage(sample, method='single')

# Plot the dendrogram, using varieties as labels
dendrogram(mergings,
           labels=variety,
           leaf_rotation=90,
           leaf_font_size=6,)

plt.show()

nomalize

SciPy hierarchical clustering doesn't fit into a sklearn pipeline, so you'll need to use the normalize() function from sklearn.preprocessing instead of Normalizer.

y-axis represents the distance between clusters

In [23]:
# Import normalize
from sklearn.preprocessing import normalize

# Normalize the movements: normalized_movements
normalized_sample = normalize(sample)

# Calculate the linkage: mergings
mergings = linkage(normalized_sample, method='complete')

# Plot the dendrogram
dendrogram(
    mergings,
    labels=variety,
    leaf_rotation=90.,
    leaf_font_size=6
)
plt.show()

Cluster labels in hierarchical clustering

● Not only a visualisation tool!

● Cluster labels at any intermediate stage can be recovered

● For use in e.g. cross-tabulations

Intermediate clusterings & height on dendrogram

  • Dendrograms show cluster distances

    • Height on dendrogram = distance between merging clusters

Intermediate clusterings & height on dendrogram

● Height on dendrogram specifies max. distance between merging clusters

● Don't merge clusters further apart than this

Distance between clusters

● Defined by a "linkage method"

● Specified via method parameter, e.g. linkage(samples, method="complete")

● In "complete" linkage: distance between clusters is max. distance between their samples

● Different linkage method, different hierarchical clustering!

Extracting cluster labels

● Use the fcluster method

● Returns a NumPy array of cluster labels

  • In the previous exercise, you saw that the intermediate clustering of the grain samples at height 6 has 3 clusters.
    • Now, use the fcluster() function to extract the cluster labels for this intermediate clustering,
      • and compare the labels with the grain varieties using a cross-tabulation.
In [37]:
# Perform the necessary imports
import pandas as pd
from scipy.cluster.hierarchy import fcluster

# Use fcluster to extract labels: labels
labels = fcluster(mergings, 6, criterion='distance')

# Create a DataFrame with labels and varieties as columns: df
df = pd.DataFrame({'labels': labels, 'varieties': variety})

# Create crosstab: ct
ct = pd.crosstab(df['labels'], df['varieties'])

# Display ct
print(ct)
varieties  Canadian wheat  Kama wheat  Rosa wheat
labels                                           
1                      14           3           0
2                       0           0          14
3                       0          11           0
In [38]:
labels
Out[38]:
array([3, 3, 3, 3, 3, 1, 3, 3, 1, 3, 3, 3, 1, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

t-SNE for 2-dimensional maps

“t-distributed stochastic neighbor embedding”

t-SNE for 2-dimensional maps

● t-SNE = “t-distributed stochastic neighbor embedding”

● Maps samples to 2D space (or 3D)

● Map approximately preserves nearness of samples

● Great for inspecting datasets

test on iris dataset

t-SNE on the iris dataset ● Iris dataset has 4 measurements, so samples are 4-dimensional

● t-SNE maps samples to 2D space

● t-SNE didn't know that there were different species

● ... yet kept the species mostly separate

t-SNE in sklearn

apply t-SNE on wheat dataset

In [44]:
from sklearn.manifold import TSNE

# Create a TSNE instance: model
model = TSNE(learning_rate=200)

print sample.shape
# Apply fit_transform to samples: tsne_features
tsne_features = model.fit_transform(sample)
print tsne_features.shape
(42, 7)
(42, 2)

map string names to int index

or use pd.factorize

In [56]:
print variety
['Kama wheat', 'Kama wheat', 'Kama wheat', 'Kama wheat', 'Kama wheat', 'Kama wheat', 'Kama wheat', 'Kama wheat', 'Kama wheat', 'Kama wheat', 'Kama wheat', 'Kama wheat', 'Kama wheat', 'Kama wheat', 'Rosa wheat', 'Rosa wheat', 'Rosa wheat', 'Rosa wheat', 'Rosa wheat', 'Rosa wheat', 'Rosa wheat', 'Rosa wheat', 'Rosa wheat', 'Rosa wheat', 'Rosa wheat', 'Rosa wheat', 'Rosa wheat', 'Rosa wheat', 'Canadian wheat', 'Canadian wheat', 'Canadian wheat', 'Canadian wheat', 'Canadian wheat', 'Canadian wheat', 'Canadian wheat', 'Canadian wheat', 'Canadian wheat', 'Canadian wheat', 'Canadian wheat', 'Canadian wheat', 'Canadian wheat', 'Canadian wheat']
In [57]:
import pandas as pd
pd.factorize(variety)
Out[57]:
(array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]),
 array(['Kama wheat', 'Rosa wheat', 'Canadian wheat'], dtype=object))
In [60]:
print map(lambda x:{'Kama wheat':'r', 'Rosa wheat':'g', 'Canadian wheat':'b'}[x], variety)
['r', 'r', 'r', 'r', 'r', 'r', 'r', 'r', 'r', 'r', 'r', 'r', 'r', 'r', 'g', 'g', 'g', 'g', 'g', 'g', 'g', 'g', 'g', 'g', 'g', 'g', 'g', 'g', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b']
In [104]:
# Scatter plot, coloring by variety_numbers
plt.scatter(tsne_features[:,0], tsne_features[:,1], alpha = .7, 
            c=map(lambda x:{'Kama wheat':'r', 'Rosa wheat':'g', 'Canadian wheat':'b'}[x], variety))

# Annotate the points
for x, y, name in zip(tsne_features[:,0], tsne_features[:,1], variety):
    plt.annotate(name, (x, y), fontsize=5, alpha=0.75)
    
    
plt.show()

compare with PCA

In [69]:
from sklearn.decomposition import PCA
pca = PCA(2)
sampel_pca = pca.fit_transform(sample)
print sampel_pca.shape
(42, 2)
In [70]:
# Scatter plot, coloring by variety_numbers
plt.scatter(sampel_pca[:,0], sampel_pca[:,1], alpha = .7, 
            c=map(lambda x:{'Kama wheat':'r', 'Rosa wheat':'g', 'Canadian wheat':'b'}[x], variety))
plt.show()

apply t-SNE on iris dataset

In [80]:
from sklearn import datasets
iris = datasets.load_iris()
iri = iris.data
print iri.shape
(150, 4)
In [101]:
from sklearn.manifold import TSNE

# Create a TSNE instance: model
model = TSNE(learning_rate=200)

print iri.shape
# Apply fit_transform to samples: tsne_features
tsne_iri = model.fit_transform(iri)
print tsne_iri.shape

# Scatter plot, coloring by variety_numbers
plt.scatter(tsne_iri[:,0], tsne_iri[:,1], alpha = .7, 
            c=iris.target)
plt.show()
(150, 4)
(150, 2)

compare with PCA

In [103]:
from sklearn.decomposition import PCA
pca = PCA(2)

print iri.shape
iri_pca = pca.fit_transform(iri)
print iri_pca.shape

# Scatter plot, coloring by variety_numbers
plt.scatter(iri_pca[:,0], iri_pca[:,1], alpha = .7, 
            c=iris.target)
plt.show()
(150, 4)
(150, 2)

t-SNE has only fit_transform()

● Has a fit_transform() method

● Simultaneously fits the model and transforms the data

● Has no separate fit() or transform() methods

● Can’t extend the map to include new data samples

● Must start over each time!

t-SNE learning rate

● Choose learning rate for the dataset

● Wrong choice: points bunch together

● Try values between 50 and 200

Different every time

● t-SNE features are different every time

● Piedmont wines, 3 runs, 3 different sca!er plots!

● … however: The wine varieties (=colors) have same position relative to one another

In [ ]: