程式扎記: [ Intro2ML ] Ch3. Unsupervised Learning and Preprocessing

Clustering
As we described above, clustering is the task of partitioning the dataset into groups, called clusters. The goal is to split up the data in such a way that points within a single cluster are very similar and points in different clusters are different. Similarly to classification algorithms, clustering algorithms assign (or predict) a number to each data point, indicating which cluster a particular point belongs to.

k-Means clustering
k-Means clustering is one of the simplest and most commonly used clustering algorithms. It tries to find cluster centers that are representative of certain regions of the data.

The algorithm alternates between two steps: assigning each data point to the closest cluster center, and then setting each cluster center as the mean of the data points that are assigned to it. The algorithm is finished when the assignment of instances to clusters no longer changes. Figure kmeans_algorithm illustrates the algorithm on a synthetic dataset:

view plaincopy to clipboardprint?
mglearn.plots.plot_kmeans_algorithm()  
plt.suptitle("kmeans_algorithm");  

We specified that we are looking for three clusters, so the algorithm was initialized by declaring three data points as cluster centers (see “Initialization”). Then the iterative algorithm starts: Each data point is assigned to the cluster center it is closest to (see “Assign Points (1)”). Next, the cluster centers are updated to be the mean of the assigned points (see “Recompute Centers (1)”). Then the process is repeated. After the second iteration, the assignment of points to cluster centers remained unchanged, so the algorithm stops.

Given new data points, k-Means will assign them to the closest cluster center. Here are the boundaries of the cluster centers that were learned in the diagram above:

view plaincopy to clipboardprint?
mglearn.plots.plot_kmeans_boundaries()  

Applying k-Means with scikit-learn is quite straight-forward. Here we apply it to the synthetic data that we used for the plots above. We instantiate the KMeans class, and set the number of clusters we are looking for ( If you don’t provide n_clusters, it is set to eight by default. There is no particular reason why you should use eight.). Then we call the fit method with the data:
- ch3_t2.py

view plaincopy to clipboardprint?
#!/usr/bin/env python  
import numpy as np  
import mglearn  
import matplotlib.pyplot as plt  
from sklearn.datasets import make_blobs  
from sklearn.cluster import KMeans  
  
# generate synthetic two-dimensional data  
X, y = make_blobs(random_state=1)  
# build the clustering model:  
kmeans = KMeans(n_clusters=3)  
kmeans.fit(X)  

During the algorithm, each training data point in X is assigned a cluster label. You can find these labels in the kmeans.labels_ attribute:

>>> from ch3_t2 import *
>>> print kmeans.labels_[:10] # Print the class of front 10 data point
[0 1 1 1 2 2 2 1 0 0]

As we asked for three clusters, the clusters are numbered 0 to 2. You can also assign cluster labels to new points, using the predict method. Each new point is assigned to the closest cluster center when predicting, but the existing model is not changed. Running predict on the training set returns the same as labels_:

>>> print(kmeans.predict(X)[:10])
[0 1 1 1 2 2 2 1 0 0]

You can see that clustering is somewhat similar to classification, in that each item gets a label. However, there is no ground truth, and consequently the labels themselves have no a priori meaning. Let’s go back to the example of clustering face images that we discusse before. It might be that the cluster 3 found by the algorithm contains only faces of your friend Bela. You can only know that after you looked at the pictures, though, and the number 3 is arbitrary. The only information the algorithm gives you is that all faces labeled as 3 are similar.

For the clustering we just computed on the two dimensional toy dataset, that means that Here is a plot of this data again. The cluster centers are stored in the cluster_centers_ attribute, and we plot them as triangles:

view plaincopy to clipboardprint?
import os  
dmode = os.environ.get('DISPLAY', '')  
if dmode:  
    plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_, cmap=mglearn.cm3, s=60)  
    plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], marker='^', s=100, linewidth=2, c=[0, 1, 2], cmap=mglearn.cm3)  
    plt.show()  

We can also use more or less cluster centers:

Failure cases of k-Means
Even if you know the “right” number of clusters for a given dataset, k-Means might not always be able to recover them. Each cluster is defined solely by its center, which means that each cluster is a convex shape. As a result of this is that k-Means can only capture relatively simple shapes. k-Means also assumes that all clusters have the same “diameter” in some sense; it always draws the boundary between clusters to be exactly in the middle between the cluster centers. That can sometimes lead to surprising results, as shown below:

view plaincopy to clipboardprint?
X, y = make_blobs(random_state=0)  
plt.scatter(X[:, 0], X[:, 1]);  

k-Means also assumes that all directions are equally important for each cluster. The plot below shows a two-dimensional dataset where there are three clearly separated parts in the data. However, these groups are stretched towards the diagonal. As k-Means only considers the distance to the nearest cluster center, it can’t handle this kind of data.
- ch3_t3.py

view plaincopy to clipboardprint?
import numpy as np  
import mglearn  
import matplotlib.pyplot as plt  
from sklearn.datasets import make_blobs  
from sklearn.cluster import KMeans  
  
# generate some random cluster data  
X, y = make_blobs(random_state=170, n_samples=600)  
rng = np.random.RandomState(74)  
# transform the data to be streched  
transformation = rng.normal(size=(2, 2))  
X = np.dot(X, transformation)  
# cluster the data into three clusters  
kmeans = KMeans(n_clusters=3)  
kmeans.fit(X)  
y_pred = kmeans.predict(X)  
  
import os  
dmode = os.environ.get('DISPLAY', '')  
if dmode:  
    # plot the cluster assignments and cluster centers  
    plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap=mglearn.cm3)  
    plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], marker='^', c=['b', 'r', 'g'], s=60, linewidth=2);  
    plt.show()  

k-Means also performs poorly if the clusters have more complex shapes, like the two_moons data we encountered in Chapter 2:
- ch3_t4.py

view plaincopy to clipboardprint?
import numpy as np  
import mglearn  
import matplotlib.pyplot as plt  
from sklearn.datasets import make_moons  
from sklearn.cluster import KMeans  
  
# generate some random cluster data  
X, y = make_moons(n_samples=200, noise=0.05, random_state=0)  
  
# cluster the data into two clusters  
kmeans = KMeans(n_clusters=2)  
kmeans.fit(X)  
y_pred = kmeans.predict(X)  
  
import os  
dmode = os.environ.get('DISPLAY', '')  
if dmode:  
    # plot the cluster assignments and cluster centers  
    plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap=mglearn.cm3, s=60)  
    plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], marker='^', c=['b', 'g'], s=60, linewidth=2);  
    plt.show()  

Here, we would hope that the clustering algorithm can discover the two half-moon shapes. However, this is not possible using the k-Means algorithm.

Vector Quantization - Or Seeing k-Means as Decomposition
Even though k-Means is a clustering algorithm, there are interesting parallels between k-Means and decomposition methods like PCA and NMF that we discussed above. You might remember that PCA tries to find directions of maximum variance in the data, while NMF tries to find additive components, which often correspond to “extremes” or “parts” of the data (see Figure nmf_illustration). Both methods tried to express data points as a sum over some components.

k-Means on the other hand tries to represent each data point using a cluster center. You can think of that as each point being represented using only a single component, which is given by the cluster center. This view of k-Means as a decomposition method, where each point is represented using a single component, is called vector quantization.

An interesting aspect of vector quantization using k-Means is that we can use many more clusters than input dimensions to encode our data. Let’s go back to the two_moons data. Using PCA or NMF, there is nothing much we can do to this data, as it lives in only two dimensions. Reducing it to one dimension with PCA or NMF would completely destroy the structure of the data. But we can find a more expressive representation using k-Means, by using more cluster centers:
- ch3_t5.py

view plaincopy to clipboardprint?
import numpy as np  
import mglearn  
import matplotlib.pyplot as plt  
from sklearn.datasets import make_moons  
from sklearn.cluster import KMeans  
  
# generate some random cluster data  
X, y = make_moons(n_samples=200, noise=0.05, random_state=0)  
  
# cluster the data into two clusters  
kmeans = KMeans(n_clusters=10)  
kmeans.fit(X)  
y_pred = kmeans.predict(X)  
  
import os  
dmode = os.environ.get('DISPLAY', '')  
if dmode:  
    # plot the cluster assignments and cluster centers  
    plt.scatter(X[:, 0], X[:, 1], c=y_pred, s=60, cmap='Paired')  
    plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],  
                marker='^', c=range(kmeans.n_clusters), s=60, linewidth=2, cmap='Paired')  
    plt.show()  

We used 10 cluster centers, which means each point is now assigned a number between 0 and 9. We can see this as the data being represented using 10 components (that is, we have ten new features), with all features being zero, apart from the one that represents the cluster center the point is assigned to. Using this 10-dimensional representation, it would now be possible to separate the two half-moon shapes using a linear model, which would not have been possible using the original two features. (really? can you show what that would look like, maybe?)

It is also possible to get an even more expressive representation of the data by using the distances to each of the cluster centers as features. This can be done using the transform method of kmeans:

>>> from ch3_t5 import *
>>> distance_features = kmeans.transform(X)
>>> print "Distance features shape: %s" % (str(distance_features.shape))
Distance features shape: (200, 10)
>>> print distance_features[:3]
[[ 0.23340263 1.52608012 0.47968982 1.14060718 1.79608327 0.82573711
1.03376805 1.11075329 1.16555129 0.92529628]
[ 0.98271691 2.55646013 1.68487883 0.149581 2.66745979 1.64355994
0.50806038 2.26381633 0.72381874 1.12665041]
[ 0.94399739 0.80023234 0.75737333 1.76451208 0.94281523 0.13801913
1.35912551 0.71680935 1.97438819 0.80418292]]

k-Means is a very popular algorithm for clustering, not only because it is relatively easy to understand and implement, but also because it runs relatively quickly. k-Means scales easily to large datasets, and scikit-learn even includes a more scalable variant in the MiniBatchKMeans class, which can handle very large datasets.

One of the drawbacks of k-Means is that it relies on a random initialization, which means the outcome of the algorithm depends on a random seed. By default, scikitlearn runs the algorithm 10 times with 10 different random initializations, and returns the best result ( best here meaning that the sum of variances of the clusters is small).

Further downsides of k-Means are the relatively restrictive assumptions made on the shape of clusters, and the requirement to specify the number of clusters you are looking for (which might not be known in a real-world application). Next, we will look at two more clustering algorithms that improve upon these properties in some ways.

Agglomerative Clustering
Agglomerative clustering refers to a collection of clustering algorithms that all build upon the same principles: The algorithm starts by declaring each point its own cluster, and then merges the two most similar clusters until some stopping criterion is satisfied. The stopping criterion implemented in scikit-learn is the number of clusters, so similar cluster are merged until only the specified number of clusters is left.

There are several linkage criteria that specify how exactly “most similar cluster” is measured. (this sentence looks a lot like the next sentence - combine into one) The following three choices are implemented in scikit-learn:

* “ward”, which is the default choice. Ward picks the two clusters to merge such that the variance within all clusters increases the least. This often leads to clusters that are relatively equally sized.
* “average” linkage merges the two clusters that have the smallest average distance between all their points.
* “complete” linkage (also known as maximum linkage) merges the two clusters that have the smallest maximum distance between their points.

Ward is generally a good default; all our examples below will use ward. The plot below illustrates the progression of agglomerative clustering on a twodimensional dataset, looking for three clusters.

>>> import mglearn
>>> import matplotlib.pyplot as plt
>>> mglearn.plots.plot_agglomerative_algorithm()
>>> plt.suptitle("agglomerative_algorithm")
>>> plt.show()

In the beginning, each point is its own cluster. Then, in each step, the two clusters that are closest are merged. In the first four steps, two single point clusters are picked and these are joined into two-point clusters. In step four, one of the two-point clusters is extended to a third point, and so on. In step 9, there are only three clusters remaining. As we specified that we are looking for three clusters, the algorithm then stops.

Let’s have a look at how agglomerative clustering performs on the simple three-cluster data we used above. Because of the way the algorithm works, agglomerative clustering can not make predictions for new data points. Therefore, agglomerative clustering has no predict method. To build the model, and get the cluster memberships on the training set, use the fit_predict method instead. (we could also use the labels_ attribute as we did for k-Means)
- ch3_t6.py

view plaincopy to clipboardprint?
import numpy as np  
import mglearn  
import matplotlib.pyplot as plt  
from sklearn.datasets import make_blobs  
from sklearn.cluster import AgglomerativeClustering  
  
# generate some random cluster data  
X, y = make_blobs(random_state=1)  
  
# cluster the data into two clusters  
agg = AgglomerativeClustering(n_clusters=3)  
assignment = agg.fit_predict(X)  
  
import os  
dmode = os.environ.get('DISPLAY', '')  
if dmode:  
    # plot the cluster assignments and cluster centers  
    plt.scatter(X[:, 0], X[:, 1], c=assignment, cmap=mglearn.cm3, s=60)  
    plt.show()  

As expected, the algorithm recovers the clustering perfectly. While the scikit-learn implementation of agglomerative clustering requires you to specify a number of clusters you want the algorithm to find, agglomerative clustering methods provide some help with choosing the right number, which we will now discuss next.

Hierarchical Clustering and Dendrograms
Agglomerative clustering produces what is known as a hierarchical clustering. The clustering proceeds iteratively, and every point makes a journey from being a single point cluster to belonging to some final cluster. Each intermediate step provides a clustering of the data (with a different number of clusters). It is sometimes helpful to look at all possible clusterings jointly. The figure below shows an overlay of all possible clusterings shown in Figure agglomerative_algorithm, providing some insight into how each cluster breaks up into smaller clusters.

view plaincopy to clipboardprint?
mglearn.plots.plot_agglomerative()  

Figure agglomerative_algorithm

While this visualization provides a very detailed view of the hierarchical clustering, it relies on the two-dimensional nature of the data, and can therefore not be used on datasets that have more than two features. There is, however, another tool to visualize hierarchical clustering, called a dendrogram (as shown in Figure dendrogram below).

Unfortunately, scikit-learn currently does not have the functionality to draw dendrograms. However, you can generate them easily using scipy. The scipy clustering algorithms have a slightly different interface from the scikit-learn clustering algorithms. scipy provides function that take data arrays X linkage array encoding cluster similarities. We can then feed this linkage array into the scipy dendrogram function to plot the dendrogram:
- ch3_t7.py

view plaincopy to clipboardprint?
from scipy.cluster.hierarchy import dendrogram, ward  
import numpy as np  
import mglearn  
import matplotlib.pyplot as plt  
from sklearn.datasets import make_blobs  
from sklearn.cluster import AgglomerativeClustering  
  
# generate some random cluster data  
X, y = make_blobs(random_state=0, n_samples=12)  
  
# apply the ward clustering to the data array X  
# The scipy ward function returns an array that specifies the distances bridged when performing agglomerative  
linkage_array = ward(X)  
# now we plot the dendrogram for the linkage_array containing the distances between clusters  
dendrogram(linkage_array)  
  
import os  
dmode = os.environ.get('DISPLAY', '')  
if dmode:  
    # plot the cluster assignments and cluster centers  
    # mark the cuts in the tree that signify two or three clusters  
    ax = plt.gca()  
    bounds = ax.get_xbound()  
    ax.plot(bounds, [7.25, 7.25], '--', c='k')  
    ax.plot(bounds, [4, 4], '--', c='k')  
    ax.text(bounds[1], 7.25, ' two clusters', verticalalignment='center', fontdict={'size': 15})  
    ax.text(bounds[1], 4, ' three clusters', verticalalignment='center', fontdict={'size': 15})  
    plt.title("dendrogram")  
    plt.show()  

Figure dendrogram

The dendrogram shows data points as points on the bottom (numbered from zero to eleven). Then, a tree is plotted with these points (representing single-point clusters) as the leafs, and a new node parent is added for each two clusters that are joined. Reading from bottom to top, the data points 1 and 4 are joined first (as you could see in Figure agglomerative_algorithm). Next, points 6 and 9 are joined into a cluster, and so on. The top level, there are two branches, one consisting of point 11, 0, 5, 10, 7, 6 and 9, and the other one consisting of points 1, 4, 3, 2 and 8. These correspond to the two largest clusters in in the left hand side of the plot.

The y axis in the dendrogram not only specifies when in the agglomerative algorithm two clusters get merged. The length of each branch also shows how far apart the merged clusters are. The longest branches in this dendogram are the three lines that are around the “8” mark on the y-axis. That these are the longest branches indicates that going from three to two clusters meant merging some very far-apart points. We see this again at the top of the chart, where merging the two remaining clusters into a single cluster again bridges a large distance.

Unfortunately, agglomerative clustering still fails at separating complex shapes like the two_moons dataset. The same is not true for the next algorithm we will look at, DBSCAN.

DBSCAN
Another very useful clustering algorithm is DBSCAN (which stands for “Densitybased spatial clustering of applications with noise”). The main benefits of DBSCAN are that:

* It does not require the user to set the number of clusters a priori,
* It can capture clusters of complex shapes, and
* It can identify point that are not part of any cluster.

DBSCAN is somewhat slower than agglomerative clustering and k-Means, but still scales to relatively large datasets. The way DBSCAN works is by identifying points that are in “crowded” regions of the feature space, where many data points are close together. These regions are referred to as dense regions in feature space. The idea behind DBSCAN is that clusters form dense regions of data, separated by regions that are relatively empty.

Points that are within a dense region are called core samples, and they are defined as follows: There are two parameters in DBSCAN, min_samples and eps. If there are at least min_samples many data points within a distance of eps to a given data point, it’s called a core sample. Core samples that are closer than the distance eps are put into the same cluster by DBSCAN.

The algorithm works by picking a point to start with. It then finds all points with distance eps or less. If there are less than min_samples points within distance eps or less, this point is labeled as noise, meaning that this point doesn’t belong to any cluster. If there are more than min_samples points within a distance of eps, the point is labeled a core sample and assigned a new cluster label. Then, all neighbors (withing eps) of the point are visited. If they have not been assigned a cluster yet, they are assigned the new cluster label we just created. If they are core samples, their neighbors are visited in turn, and so on.

The cluster grows, until there are no more core-samples within distance eps of the cluster. Then another point, which hasn’t yet been visited, is picked, and the same procedure is repeated.

In the end, there are three kinds of points: core points, points that are within distance eps of core points (called boundary points), and noise. When running the DBSCAN algorithm on a particular dataset multiple times, the clustering of the core points is always the same, and the same points will always be labeled as noise. However, a boundary point might be neighbor to core samples of more than one cluster. Therefore, the cluster membership of boundary points depends on the order in which points are visited. Usually there are only few boundary points, and this slight dependence on the order of points is not important.

Let’s apply DBSCAN on the synthetic data from above. As in agglomerative clustering, DBSCAN does not allow predictions on new test data, so we will use the fit_predict method to perform clustering and return the cluster labels in one step:
- ch3_t8.py

view plaincopy to clipboardprint?
from scipy.cluster.hierarchy import dendrogram, ward  
import numpy as np  
import mglearn  
import matplotlib.pyplot as plt  
from sklearn.datasets import make_blobs  
from sklearn.cluster import DBSCAN  
  
# generate some random cluster data  
X, y = make_blobs(random_state=0, n_samples=12)  
  
# apply the ward clustering to the data array X  
dbscan = DBSCAN()  
clusters = dbscan.fit_predict(X)  
print "Clusters: %s" % (str(clusters))  

Execution output:

Clusters: [-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1]

As you can see, all data points were assigned the label -1, which stands for noise. This is a consequence of the default parameter settings for eps and min_samples, which are not attuned to small toy datasets. Let’s investigate the effect of changing eps and min_samples.

view plaincopy to clipboardprint?
import os  
dmode = os.environ.get('DISPLAY', '')  
if dmode:  
    fig, axes = plt.subplots(3, 4, figsize=(11, 8), subplot_kw={'xticks': (), 'yticks': ()})  
    # Plot clusters as red, green and blue, and outliers (-1) as white  
    colors = np.array(['r', 'g', 'b', 'w'])  
    # iterate over settings of min_samples and eps  
    for i, min_samples in enumerate([2, 3, 5]):  
        for j, eps in enumerate([1, 1.5, 2, 3]):  
            # instantiate DBSCAN with a particular setting  
            dbscan = DBSCAN(min_samples=min_samples, eps=eps)  
            # get cluster assignments  
            clusters = dbscan.fit_predict(X)  
            print("min_samples: %d eps: %f cluster: %s" % (min_samples, eps, clusters))  
            # vizualize core samples and clusters.  
            sizes = 60 * np.ones(X.shape[0])  
            # size is given by whether something is a core sample  
            sizes[dbscan.core_sample_indices_] *= 4  
            axes[i, j].scatter(X[:, 0], X[:, 1], c=colors[clusters], s=sizes)  
            axes[i, j].set_title("min_samples: %d eps: %.1f" % (min_samples, eps))  
    fig.tight_layout()  
    plt.show()  

In this plot, points that belong to clusters are colored, while the noise points are shown in white; Core samples are shown as large points, while border points are displayed as smaller points.

Increasing eps (going from left to right in the figure) means that more points will be included in a cluster. This makes clusters grow, but might also lead to multiple clusters joining into one. Increasing min_samples (going from top to bottom in the figure) means that fewer points will be core points, and more points will be labeled as noise.

The parameter eps is somewhat more important, as it determines what it means for points to be “close”. Setting eps to be very small will mean that no points are core samples, and may lead to all points being labeled as noise. Setting eps to be very large will result in all points forming a single cluster.

The setting of min_samples mostly determines whether points in less dense regions will be labeled as outliers, or as their own cluster. If you decrease min_samples, anything that would have been a cluster with less than min_samples many samples will now be labeled as noise. The min_samples therefore determines the minimum cluster size. You can see this very clearly in the plot above, when going from min_samples=3 to min_samples=5 with eps=1.5. With min_samples=3, there are three clusters: one of four points, one of five points and one of three points. Using min_samples=5, the two smaller clusters (with three or four points) are now labeled as noise, and only the cluster with 5 samples remains.

While DBSCAN doesn’t require setting the number of clusters explicitly, setting eps implicitly controls how many clusters will be found. Finding a good setting for eps is sometimes easier after scaling the data using StandardScaler or MinMaxScaler, as using these scaling techniques will ensure that all features have similar ranges.

Below is the result of DBSCAN running on the two_moons dataset. The algorithm actually finds the two half-circles and separates them using the default settings.
- ch3_t9.py

view plaincopy to clipboardprint?
from scipy.cluster.hierarchy import dendrogram, ward  
import numpy as np  
import mglearn  
import matplotlib.pyplot as plt  
from sklearn.datasets import make_moons  
from sklearn.cluster import DBSCAN  
  
# generate some random cluster data  
X, y = make_moons(n_samples=200, noise=0.05, random_state=0)  
  
# Rescale the data to zero mean and unit variance  
from sklearn.preprocessing import StandardScaler  
scaler = StandardScaler()  
scaler.fit(X)  
X_scaled = scaler.transform(X)  
  
# apply the ward clustering to the data array X  
dbscan = DBSCAN()  
clusters = dbscan.fit_predict(X_scaled)  
print "Clusters: %s" % (str(clusters))  
  
import os  
dmode = os.environ.get('DISPLAY', '')  
if dmode:  
    # plot the cluster assignments  
    plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=clusters, cmap=mglearn.cm2, s=60)  
    plt.show()  

As the algorithm produced the desired number of clusters (two), the parameter settings seem to work well.

If we decrease eps to 0.2 (from the default of 0.5), we will get 8 clusters, which are clearly too many. Increasing eps to 0.7 results in a single cluster. When using DBSCAN, you need to be careful about handling the returned cluster assignments. The use of -1 to indicate noise might result in unexpected effects when using the cluster labels to index another array.

Comparing and evaluating clustering algorithms
After talking about the algorithms behind k-Means, agglomerative clustering and DBSCAN, we will now compare them on some real world datasets. One of the challenges in applying clustering algorithms is that it is very hard to access how well a clustering algorithm worked, and to compare outcomes between different algorithms.

Evaluating clustering with ground truth
There are metrics that can be used to assess the outcome of a clustering algorithm relative to a ground truth clustering, the most important ones being the adjusted rand index (ARI) and normalized mutual information (NMI), which both provide a quantitative measure between 0 and 1.

Below we compare the k-Means, agglomerative clustering and DBSCAN algorithms using ARI. We also include what it looks like when we randomly assign points to two clusters for comparison.
- ch3_t10.py

view plaincopy to clipboardprint?
from sklearn.metrics.cluster import adjusted_rand_score  
import numpy as np  
import mglearn  
import matplotlib.pyplot as plt  
from sklearn.datasets import make_moons  
from sklearn.cluster import DBSCAN  
from sklearn.cluster import KMeans  
from sklearn.cluster import AgglomerativeClustering  
  
# generate some random cluster data  
X, y = make_moons(n_samples=200, noise=0.05, random_state=0)  
  
# Rescale the data to zero mean and unit variance  
from sklearn.preprocessing import StandardScaler  
scaler = StandardScaler()  
scaler.fit(X)  
X_scaled = scaler.transform(X)  
  
import os  
dmode = os.environ.get('DISPLAY', '')  
if dmode:  
    # plot the cluster assignments  
    fig, axes = plt.subplots(1, 4, figsize=(15, 3), subplot_kw={'xticks': (), 'yticks': ()})  
    # make a list of algorithms to use  
    algorithms = [KMeans(n_clusters=2), AgglomerativeClustering(n_clusters=2), DBSCAN()]  
  
    # create a random cluster assignment for reference:  
    random_state = np.random.RandomState(seed=0)  
    random_clusters = random_state.randint(low=0, high=2, size=len(X))  
  
    # plot random assignment:  
    axes[0].scatter(X_scaled[:, 0], X_scaled[:, 1], c=random_clusters, cmap=mglearn.cm3, s=60)  
    axes[0].set_title("Random assignment - ARI: %.2f" % adjusted_rand_score(y, random_clusters))  
  
    for ax, algorithm in zip(axes[1:], algorithms):  
        # plot the cluster assignments and cluster centers  
        clusters = algorithm.fit_predict(X_scaled)  
        ax.scatter(X_scaled[:, 0], X_scaled[:, 1], c=clusters, cmap=mglearn.cm3, s=60)  
        ax.set_title("%s - ARI: %.2f" % (algorithm.__class__.__name__, adjusted_rand_score(y, clusters)))  
    plt.show()  

The adjusted rand index provides intuitive results, with a random cluster assignment having a score of 0, and DBSCAN (which recovers the desired clustering perfectly) having a score of 1. A common mistake when evaluating clustering in this way is to use accuracy_score instead of a clustering metric like adjusted_rand_score and normalized_mutual_info_score.

The problem in using accuracy is that it requires the assigned cluster labels to exactly match the ground truth. However, the cluster labels themselves are meaningless, and only which points are in the same cluster matters:

>>> from sklearn.metrics import accuracy_score
>>> clusters1 = [0, 0, 1, 1, 0]
>>> clusters2 = [1, 1, 0, 0, 1]
>>> print("Accuracy: %.2f" % accuracy_score(clusters1, clusters2))
Accuracy: 0.00
>>> from sklearn.metrics.cluster import adjusted_rand_score
>>> print("ARI: %.2f" % adjusted_rand_score(clusters1, clusters2))
ARI: 1.00

Evaluating clustering without ground truth
Although we have just shown how to evaluate on clustering algorithms, in practice, there is a big problem with the evaluation using measures like ARI. When applying clustering algorithms, there is usually no ground truth to which to
compare. If we knew the right clustering of the data, we could use this information to build a supervised model like a classifier. Therefore, using metric like ARI and NMI usually only really helps in developing algorithms, not in assessing success in an application.

There are scoring metrics for clustering that don’t require ground truth, like the silhouette coefficient. However, these often don’t work well in practice. The silhouette score computes the compactness of a cluster, where higher is better, with a perfect score of 1. While compact clusters are good, compactness doesn’t allow for complex shapes. Here is an example comparing the outcome of k-Means, agglomerative clustering and DBSCAN on the two moons using the silhouette score:
- ch3_t11.py

view plaincopy to clipboardprint?
from sklearn.metrics.cluster import silhouette_score  
import numpy as np  
import mglearn  
import matplotlib.pyplot as plt  
from sklearn.datasets import make_moons  
from sklearn.cluster import DBSCAN  
from sklearn.cluster import KMeans  
from sklearn.cluster import AgglomerativeClustering  
  
# generate some random cluster data  
X, y = make_moons(n_samples=200, noise=0.05, random_state=0)  
  
# Rescale the data to zero mean and unit variance  
from sklearn.preprocessing import StandardScaler  
scaler = StandardScaler()  
scaler.fit(X)  
X_scaled = scaler.transform(X)  
  
import os  
dmode = os.environ.get('DISPLAY', '')  
if dmode:  
    # plot the cluster assignments  
    fig, axes = plt.subplots(1, 4, figsize=(15, 3), subplot_kw={'xticks': (), 'yticks': ()})  
    # make a list of algorithms to use  
    algorithms = [KMeans(n_clusters=2), AgglomerativeClustering(n_clusters=2), DBSCAN()]  
  
    # create a random cluster assignment for reference:  
    random_state = np.random.RandomState(seed=0)  
    random_clusters = random_state.randint(low=0, high=2, size=len(X))  
  
    # plot random assignment:  
    axes[0].scatter(X_scaled[:, 0], X_scaled[:, 1], c=random_clusters, cmap=mglearn.cm3, s=60)  
    axes[0].set_title("Random assignment - %.2f" % silhouette_score(X_scaled, random_clusters))  
  
    for ax, algorithm in zip(axes[1:], algorithms):  
        # plot the cluster assignments and cluster centers  
        clusters = algorithm.fit_predict(X_scaled)  
        ax.scatter(X_scaled[:, 0], X_scaled[:, 1], c=clusters, cmap=mglearn.cm3, s=60)  
        ax.set_title("%s - %.2f" % (algorithm.__class__.__name__, silhouette_score(X_scaled, clusters)))  
    plt.show()  

As you can see, k-Means gets the highest silhouette score, even though we might prefer the result produced by DBSCAN.

A slightly better strategy for evaluating clusters are robustness-based clustering metrics. These run an algorithm after adding some noise to the data, or using different parameter settings, and compare the outcomes. The idea is that if many algorithm parameters and many perturbations of the data return the same result, it is likely to be trustworthy. Unfortunately, this strategy is not implemented in scikit-learn at the time of writing. Even if we get a very robust clustering, or a very high silhouette score, we still don’t know if there is any semantic meaning in the clustering, or whether the clustering reflects an aspect of the data that we are interested in.

Supplement
* Unsupervised learning : The k-means clustering algorithm (1)
* Unsupervised learning : The k-means clustering algorithm (2)
* [ML in DH] Clustering 之 Basic Agglomerative Clustering

程式扎記

標籤

2017年2月20日星期一

[ Intro2ML ] Ch3. Unsupervised Learning and Preprocessing - Clustering

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2017年2月20日 星期一