Clustering

Unsupervised learning

  • Unsupervised learning finds patterns in data
  • E.g. clustering customers by their purchases
  • Compressing the data using purchase patterns (dimension reduction)

Supervised vs unsupervised learning

  • Supervised vs unsupervised learning
    • E.g. classify tumors as benign or cancerous (labels)
  • Unsupervised learning finds pa!erns in data
    • ... but without a specific prediction task in mind

Arrays, features & samples

  • 2D NumPy array
    • Columns are measurements (the features)
    • Rows represent iris plants (the samples)

k-means clustering

  • k-means clustering
    • Number of clusters must be specified
      • Implemented in sklearn ("scikit-learn")
In [1]:
import pandas as pd
import numpy as np

from sklearn import datasets
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.decomposition import PCA

%matplotlib inline

You are given an array points of size 300x2, where each row gives the (x, y) co-ordinates of a point on a map. Make a scatter plot of these points, and use the scatter plot to guess how many clusters there are.

In [2]:
from data import points, new_points
print points.shape, new_points.shape
(300, 2) (300, 2)
In [3]:
plt.figure(figsize=(6,8))

plt.subplot(2,1,1)
plt.scatter(points[:,0], points[:,1], alpha=.6, c='r')
plt.xlim([-3,3])


plt.subplot(2,1,2)
plt.scatter(new_points[:,0], new_points[:,1], alpha=.6, c='g')
Out[3]:
<matplotlib.collections.PathCollection at 0x7f18dca222d0>

Clustering 2D points

  • the points seem to separate into 3 clusters
    • create a KMeans model to find 3 clusters, and fit it to the data points
      • After the model has been fit, you'll obtain the cluster labels for some new points using the .predict() method.
In [4]:
# Import KMeans
from sklearn.cluster import KMeans

# Create a KMeans instance with 3 clusters: model
model = KMeans(n_clusters=3)

# Fit model to points
model.fit(points)

# Determine the cluster labels of new_points: labels
labels = model.predict(new_points)

# Print cluster labels of new_points
print(labels)
[2 1 0 2 1 2 1 1 1 0 2 1 1 0 0 1 0 0 1 1 0 1 2 1 2 0 1 0 0 2 2 1 1 1 0 2 1
 1 2 1 0 2 2 0 2 1 0 0 1 1 1 1 0 0 2 2 0 0 0 2 2 1 1 1 2 1 0 1 2 0 2 2 2 1
 2 0 0 2 1 0 2 0 2 1 0 1 0 2 1 1 1 2 1 1 2 0 0 0 0 2 1 2 0 0 2 2 1 2 0 0 2
 0 0 0 1 1 1 1 0 0 1 2 1 0 1 2 0 1 0 0 1 0 1 0 2 1 2 2 1 0 2 1 2 2 0 1 1 2
 0 2 0 1 2 0 0 2 0 1 1 0 1 0 0 1 1 2 1 1 0 2 0 2 2 1 2 1 1 2 2 0 2 2 2 0 1
 1 2 0 2 0 0 1 1 1 2 1 1 1 0 0 2 1 2 2 2 0 1 1 1 1 1 1 0 0 1 0 0 0 0 1 0 0
 1 1 2 0 2 2 0 2 0 2 0 1 1 0 1 1 1 0 2 2 0 1 1 0 1 0 0 1 0 0 2 0 2 2 2 1 0
 0 0 2 1 2 0 2 0 0 1 2 2 2 0 1 1 1 2 1 0 0 1 2 2 0 2 2 0 2 1 2 0 0 0 0 1 0
 0 1 1 2]

Inspect your clustering

  • Let's now inspect the clustering you performed in the previous exercise!
    • A solution to the previous exercise has already run, so new_points is an array of points and labels is the array of their cluster labels.

examine the results

  • color label by classification
In [5]:
model.cluster_centers_
Out[5]:
array([[-1.57568905, -0.22531944],
       [ 1.01378685,  0.98288627],
       [ 0.18034887, -0.81701955]])
In [6]:
plt.figure(figsize=(8,5))

# Assign the columns of new_points: xs and ys
xs = new_points[:,0]
ys = new_points[:,1]

# Make a scatter plot of xs and ys, using labels to define the colors
plt.scatter(xs, ys, c=labels, alpha=0.5)

# Assign the cluster centers: centroids
centroids = model.cluster_centers_

# Assign the columns of centroids: centroids_x, centroids_y
centroids_x = centroids[:,0]
centroids_y = centroids[:,1]

# Make a scatter plot of centroids_x and centroids_y
plt.scatter(centroids_x, centroids_y, marker='D', s=50)
plt.show()

Iris: clusters vs species

  • k-means found 3 clusters amongst the iris samples
    • k-means found 3 clusters amongst the iris samples

Iris dataset

  • Measurements of many iris plants
    • 3 species of iris: setosa, versicolor, virginica
      • Petal length, petal width, sepal length, sepal width (the features of the dataset)
  • Iris data is 4-dimensional
    • Dimension = number of features
      • Dimension too high to visualize!
        • ... but unsupervised learning gives insight
In [7]:
from sklearn import datasets

iris = datasets.load_iris()
print type(iris.data)
iris.data.shape
<type 'numpy.ndarray'>
Out[7]:
(150, 4)

build a DataFrame for iris dataset from np ndarray

In [8]:
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target']=iris.target

print iris.target_names
df['species'] = df['target'].map({0:iris.target_names[0],1:iris.target_names[1],2:iris.target_names[2]})
df.head()
['setosa' 'versicolor' 'virginica']
Out[8]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target species
0 5.1 3.5 1.4 0.2 0 setosa
1 4.9 3.0 1.4 0.2 0 setosa
2 4.7 3.2 1.3 0.2 0 setosa
3 4.6 3.1 1.5 0.2 0 setosa
4 5.0 3.6 1.4 0.2 0 setosa

make samples for clustering, remove labels

In [9]:
samples = df.iloc[:,:4]
samples.head()
Out[9]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
In [10]:
from sklearn.cluster import KMeans

model2 = KMeans(n_clusters=3)
model.fit(samples)
Out[10]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)
In [11]:
labels2 = model.predict(samples)
labels2
Out[11]:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 2, 2,
       0, 0, 0, 0, 2, 0, 2, 0, 2, 0, 0, 2, 2, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0,
       2, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 2], dtype=int32)

Cluster labels for new samples

  • New samples can be assigned to existing clusters
  • k-means remembers the mean of each cluster (the "centroids")
  • Finds the nearest centroid to each new sample

Scatter plots

  • Scatter plot of sepal length vs petal length
  • Each point represents an iris sample
  • Color points by cluster labels
  • PyPlot (matplotlib.pyplot)
In [12]:
plt.scatter(samples.iloc[:,0], samples.iloc[:,1], c=labels2, alpha = .7)
Out[12]:
<matplotlib.collections.PathCollection at 0x7f18dc4d4d10>

Evaluating a clustering

  • Can check correspondence with e.g. iris species
    • … but what if there are no species to check against?
  • Measure quality of a clustering
  • Informs choice of how many clusters to look for

Cross tabulation with pandas

  • Clusters vs species is a "cross-tabulation"
  • Use the pandas library
  • Given the species of each sample as a list species
In [13]:
pd.crosstab(df['target'], df['species'])
Out[13]:
species setosa versicolor virginica
target
0 50 0 0
1 0 50 0
2 0 0 50

train test split

  • redo a clustering
In [14]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(df.iloc[:,:4],df.species, test_size = .33, random_state=7)
print x_train.shape, x_test.shape, y_train.shape, y_test.shape
x_train.head(3)
(100, 4) (50, 4) (100,) (50,)
Out[14]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
3 4.6 3.1 1.5 0.2
39 5.1 3.4 1.5 0.2
117 7.7 3.8 6.7 2.2

try a little bit supervised learning

In [15]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier()

# Fit model to points
model.fit(x_train,y_train)

# Determine the cluster labels of new_points: labels
labels = model.predict(x_test)

# Print cluster labels of new_points
print(labels)
['virginica' 'versicolor' 'setosa' 'versicolor' 'versicolor' 'setosa'
 'versicolor' 'versicolor' 'setosa' 'versicolor' 'virginica' 'versicolor'
 'setosa' 'virginica' 'setosa' 'versicolor' 'virginica' 'virginica'
 'setosa' 'setosa' 'versicolor' 'virginica' 'versicolor' 'versicolor'
 'virginica' 'versicolor' 'versicolor' 'versicolor' 'virginica' 'virginica'
 'virginica' 'versicolor' 'setosa' 'virginica' 'versicolor' 'setosa'
 'setosa' 'setosa' 'setosa' 'virginica' 'virginica' 'versicolor'
 'virginica' 'virginica' 'versicolor' 'setosa' 'versicolor' 'versicolor'
 'virginica' 'setosa']
In [16]:
pd.crosstab(labels, y_test)
Out[16]:
species setosa versicolor virginica
row_0
setosa 14 0 0
versicolor 0 17 3
virginica 0 1 15

unsupervised learning

kmeans

In [32]:
# Import KMeans
from sklearn.cluster import KMeans

# Create a KMeans instance with 3 clusters: model
model = KMeans(n_clusters=3)

# Fit model to points
model.fit(df.iloc[:,:4])

# Determine the cluster labels of new_points: labels
labels = model.predict(df.iloc[:,:4])

# Print cluster labels of new_points
print(labels)
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 2 0 0 0 0 2 0 0 0 0
 0 0 2 2 0 0 0 0 2 0 2 0 2 0 0 2 2 0 0 0 0 0 2 0 0 0 0 2 0 0 0 2 0 0 0 2 0
 0 2]

Measuring clustering quality

  • Using only samples and their cluster labels
  • A good clustering has tight clusters
  • ... and samples in each cluster bunched together
In [33]:
pd.crosstab(labels, df.species)
Out[33]:
species setosa versicolor virginica
row_0
0 0 2 36
1 50 0 0
2 0 48 14

Inertia measures clustering quality

  • Measures how spread out the clusters are (lower is better)
  • Distance from each sample to centroid of its cluster
  • After fit(), available as attribute inertia_
  • k-means attempts to minimize the inertia when choosing clusters
In [34]:
print(model.inertia_)
78.9408414261

PCA , decompress to 2 dimension to examine the clustering

In [35]:
from sklearn.decomposition import PCA
In [36]:
pca = PCA(2)
pca
Out[36]:
PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)
In [37]:
df_pca = pca.fit_transform(df.iloc[:,:4])
df_pca.shape
Out[37]:
(150, 2)
In [39]:
plt.scatter(df_pca[:,0], df_pca[:,1],c = labels)
Out[39]:
<matplotlib.collections.PathCollection at 0x7f18d8da2690>

The number of clusters

  • Clusterings of the iris dataset with different numbers of clusters
  • More clusters means lower inertia
  • What is the best number of clusters?

How many clusters to choose?

  • good clustering has tight clusters (so low inertia)
  • ... but not too many clusters!
  • Choose an "elbow" in the inertia plot
  • Where inertia begins to decrease more slowly
  • E.g. for iris dataset, 3 is a good choice
In [41]:
model.score(df.iloc[:,:4])
Out[41]:
-78.940841426145937

How many clusters of grain?

  • Append the value of the inertia_ attribute of model to the list inertias.
In [42]:
ks = range(1, 6)
inertias = []

for k in ks:
    # Create a KMeans instance with k clusters: model
    model = KMeans(n_clusters=k)
    
    # Fit model to samples
    model.fit(df.iloc[:,:4])
    
    # Append the inertia to the list of inertias
    inertias.append(model.inertia_)
    
# Plot ks vs inertias
plt.plot(ks, inertias, '-o')
plt.xlabel('number of clusters, k')
plt.ylabel('inertia')
plt.xticks(ks)
plt.show()
In [44]:
df.describe()
Out[44]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target
count 150.000000 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667 1.000000
std 0.828066 0.433594 1.764420 0.763161 0.819232
min 4.300000 2.000000 1.000000 0.100000 0.000000
25% 5.100000 2.800000 1.600000 0.300000 0.000000
50% 5.800000 3.000000 4.350000 1.300000 1.000000
75% 6.400000 3.300000 5.100000 1.800000 2.000000
max 7.900000 4.400000 6.900000 2.500000 2.000000
In [45]:
df.var()
Out[45]:
sepal length (cm)    0.685694
sepal width (cm)     0.188004
petal length (cm)    3.113179
petal width (cm)     0.582414
target               0.671141
dtype: float64
In [47]:
import numpy as np
np.sqrt(df.var())
Out[47]:
sepal length (cm)    0.828066
sepal width (cm)     0.433594
petal length (cm)    1.764420
petal width (cm)     0.763161
target               0.819232
dtype: float64

Transforming features for better clusterings

some models like KMean are highly influced by the feature's diff variance

  • To make all the feature (columns) have equal variance, which is 0,1
    • use standardscaler
      • StandardScaler
        • from sklearn.preprocessing import StandardScaler

StandardScaler

pipline

In [66]:
# Perform the necessary imports
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# Create scaler: scaler
scaler = StandardScaler()

# Create KMeans instance: kmeans
kmeans = KMeans(n_clusters=3)

# Create pipeline: pipeline
pipeline = make_pipeline(scaler, kmeans)
In [75]:
names = '''Alcohol , Malic acid , Ash , Alcalinity of ash , Magnesium , 
Total phenols , Flavanoids , Nonflavanoid phenols , Proanthocyanins ,Color intensity ,Hue ,OD280/OD315 of diluted wines ,Proline '''

names = names.split(',')
In [87]:
wine = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data',names = names)
wine.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 178 entries, 1 to 3
Data columns (total 13 columns):
Alcohol                          178 non-null float64
 Malic acid                      178 non-null float64
 Ash                             178 non-null float64
 Alcalinity of ash               178 non-null float64
 Magnesium                       178 non-null int64
 
Total phenols                  178 non-null float64
 Flavanoids                      178 non-null float64
 Nonflavanoid phenols            178 non-null float64
 Proanthocyanins                 178 non-null float64
Color intensity                  178 non-null float64
Hue                              178 non-null float64
OD280/OD315 of diluted wines     178 non-null float64
Proline                          178 non-null int64
dtypes: float64(11), int64(2)
memory usage: 19.5 KB

feature variance

In [80]:
wine.var()
Out[80]:
Alcohol                              0.659062
 Malic acid                          1.248015
 Ash                                 0.075265
 Alcalinity of ash                  11.152686
 Magnesium                         203.989335
 \nTotal phenols                     0.391690
 Flavanoids                          0.997719
 Nonflavanoid phenols                0.015489
 Proanthocyanins                     0.327595
Color intensity                      5.374449
Hue                                  0.052245
OD280/OD315 of diluted wines         0.504086
Proline                          99166.717355
dtype: float64
In [81]:
classes = [59,71,48]
In [89]:
wine = wine.reset_index()
del wine['index']

add labels

In [100]:
wine['class'] = ['Barolo' for i in range(59)]+ ['Grignolino' for i in range(71)] + ['Barbera' for i in range(48)]
In [101]:
wine['class'].value_counts()
Out[101]:
Grignolino    71
Barolo        59
Barbera       48
Name: class, dtype: int64

Clustering the wines

In [103]:
from sklearn.cluster import KMeans
model = KMeans(n_clusters=3)
labels = model.fit_predict(wine.iloc[:,:-1])
In [105]:
pd.crosstab(labels, wine['class'])
Out[105]:
class Barbera Barolo Grignolino
row_0
0 19 0 50
1 0 46 1
2 29 13 20

accuracy sucks due to high variances among features

  • Variance of a feature measures spread of its values
In [106]:
wine.var()
Out[106]:
Alcohol                              0.659062
 Malic acid                          1.248015
 Ash                                 0.075265
 Alcalinity of ash                  11.152686
 Magnesium                         203.989335
 \nTotal phenols                     0.391690
 Flavanoids                          0.997719
 Nonflavanoid phenols                0.015489
 Proanthocyanins                     0.327595
Color intensity                      5.374449
Hue                                  0.052245
OD280/OD315 of diluted wines         0.504086
Proline                          99166.717355
dtype: float64

Feature variances

relationship between the last 2 features

In [125]:
plt.scatter(wine.iloc[:,-3], wine.iloc[:,-2], 
            c= wine.iloc[:,-1].map({'Barbera':'r', 'Barolo':'g', 'Grignolino':'b'}), 
            alpha=.7)

plt.axis('equal')
Out[125]:
(1.0, 4.5, 200.0, 1800.0)

after StandardScaler

In [126]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
wine.iloc[:,:-1] = scaler.fit_transform(wine.iloc[:,:-1])

plt.scatter(wine.iloc[:,-3], wine.iloc[:,-2], 
            c= wine.iloc[:,-1].map({'Barbera':'r', 'Barolo':'g', 'Grignolino':'b'}), 
            alpha=.7)

plt.axis('equal')
Out[126]:
(-3.0, 3.0, -2.0, 4.0)

sklearn StandardScaler

In [1]: from sklearn.preprocessing import StandardScaler

In [2]: scaler = StandardScaler()

In [3]: scaler.fit(samples)

Out[3]: StandardScaler(copy=True, with_mean=True, with_std=True)

In [4]: samples_scaled = scaler.transform(samples)

Similar methods

● StandardScaler and KMeans have similar methods

● Use fit() / transform() with StandardScaler

● Use fit() / predict() with KMeans

StandardScaler, then KMeans

● Need to perform two steps: StandardScaler, then KMeans

● Use sklearn pipeline to combine multiple steps

● Data flows from one step into the next

Pipelines combine multiple steps

In [129]:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

scaler = StandardScaler()
kmeans = KMeans(n_clusters=3)


from sklearn.pipeline import make_pipeline
pipe = make_pipeline(scaler, kmeans)
pipe
Out[129]:
Pipeline(steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('kmeans', KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0))])
In [130]:
pipe.fit(wine.iloc[:,:-1])
Out[130]:
Pipeline(steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('kmeans', KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0))])
In [134]:
labels = pipe.predict(wine.iloc[:,:-1])
labels
Out[134]:
array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 0, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

huge accurancy improvement after standardscaler

In [135]:
pd.crosstab(labels, wine['class'])
Out[135]:
class Barbera Barolo Grignolino
row_0
0 48 0 3
1 0 0 65
2 0 59 3

sklearn preprocessing steps

● StandardScaler is a "preprocessing" step

● MaxAbsScaler and Normalizer are other examples

Note that Normalizer() is different to StandardScaler(), which you used in the previous exercise. While StandardScaler() standardizes features (such as the features of the fish data from the previous exercise) by removing the mean and scaling to unit variance, Normalizer() rescales each sample - here, each company's stock price - independently of the other.

In [ ]: