As we described above, clustering is the task of partitioning the dataset into groups, called clusters. The goal is to split up the data in such a way that points within a single cluster are very similar and points in different clusters are different. Similarly to classification algorithms, clustering algorithms assign (or predict) a number to each data point, indicating which cluster a particular point belongs to.
k-Means clustering
k-Means clustering is one of the simplest and most commonly used clustering algorithms. It tries to find cluster centers that are representative of certain regions of the data.
The algorithm alternates between two steps: assigning each data point to the closest cluster center, and then setting each cluster center as the mean of the data points that are assigned to it. The algorithm is finished when the assignment of instances to clusters no longer changes. Figure kmeans_algorithm illustrates the algorithm on a synthetic dataset:
- mglearn.plots.plot_kmeans_algorithm()
- plt.suptitle("kmeans_algorithm");
We specified that we are looking for three clusters, so the algorithm was initialized by declaring three data points as cluster centers (see “Initialization”). Then the iterative algorithm starts: Each data point is assigned to the cluster center it is closest to (see “Assign Points (1)”). Next, the cluster centers are updated to be the mean of the assigned points (see “Recompute Centers (1)”). Then the process is repeated. After the second iteration, the assignment of points to cluster centers remained unchanged, so the algorithm stops.
Given new data points, k-Means will assign them to the closest cluster center. Here are the boundaries of the cluster centers that were learned in the diagram above:
- mglearn.plots.plot_kmeans_boundaries()
Applying k-Means with scikit-learn is quite straight-forward. Here we apply it to the synthetic data that we used for the plots above. We instantiate the KMeans class, and set the number of clusters we are looking for ( If you don’t provide n_clusters, it is set to eight by default. There is no particular reason why you should use eight.). Then we call the fit method with the data:
- ch3_t2.py
- #!/usr/bin/env python
- import numpy as np
- import mglearn
- import matplotlib.pyplot as plt
- from sklearn.datasets import make_blobs
- from sklearn.cluster import KMeans
- # generate synthetic two-dimensional data
- X, y = make_blobs(random_state=1)
- # build the clustering model:
- kmeans = KMeans(n_clusters=3)
- kmeans.fit(X)
As we asked for three clusters, the clusters are numbered 0 to 2. You can also assign cluster labels to new points, using the predict method. Each new point is assigned to the closest cluster center when predicting, but the existing model is not changed. Running predict on the training set returns the same as labels_:
You can see that clustering is somewhat similar to classification, in that each item gets a label. However, there is no ground truth, and consequently the labels themselves have no a priori meaning. Let’s go back to the example of clustering face images that we discusse before. It might be that the cluster 3 found by the algorithm contains only faces of your friend Bela. You can only know that after you looked at the pictures, though, and the number 3 is arbitrary. The only information the algorithm gives you is that all faces labeled as 3 are similar.
For the clustering we just computed on the two dimensional toy dataset, that means that Here is a plot of this data again. The cluster centers are stored in the cluster_centers_ attribute, and we plot them as triangles:
- import os
- dmode = os.environ.get('DISPLAY', '')
- if dmode:
- plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_, cmap=mglearn.cm3, s=60)
- plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], marker='^', s=100, linewidth=2, c=[0, 1, 2], cmap=mglearn.cm3)
- plt.show()
We can also use more or less cluster centers:
Failure cases of k-Means
Even if you know the “right” number of clusters for a given dataset, k-Means might not always be able to recover them. Each cluster is defined solely by its center, which means that each cluster is a convex shape. As a result of this is that k-Means can only capture relatively simple shapes. k-Means also assumes that all clusters have the same “diameter” in some sense; it always draws the boundary between clusters to be exactly in the middle between the cluster centers. That can sometimes lead to surprising results, as shown below:
- X, y = make_blobs(random_state=0)
- plt.scatter(X[:, 0], X[:, 1]);
k-Means also assumes that all directions are equally important for each cluster. The plot below shows a two-dimensional dataset where there are three clearly separated parts in the data. However, these groups are stretched towards the diagonal. As k-Means only considers the distance to the nearest cluster center, it can’t handle this kind of data.
- ch3_t3.py
- import numpy as np
- import mglearn
- import matplotlib.pyplot as plt
- from sklearn.datasets import make_blobs
- from sklearn.cluster import KMeans
- # generate some random cluster data
- X, y = make_blobs(random_state=170, n_samples=600)
- rng = np.random.RandomState(74)
- # transform the data to be streched
- transformation = rng.normal(size=(2, 2))
- X = np.dot(X, transformation)
- # cluster the data into three clusters
- kmeans = KMeans(n_clusters=3)
- kmeans.fit(X)
- y_pred = kmeans.predict(X)
- import os
- dmode = os.environ.get('DISPLAY', '')
- if dmode:
- # plot the cluster assignments and cluster centers
- plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap=mglearn.cm3)
- plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], marker='^', c=['b', 'r', 'g'], s=60, linewidth=2);
- plt.show()
k-Means also performs poorly if the clusters have more complex shapes, like the two_moons data we encountered in Chapter 2:
- ch3_t4.py
- import numpy as np
- import mglearn
- import matplotlib.pyplot as plt
- from sklearn.datasets import make_moons
- from sklearn.cluster import KMeans
- # generate some random cluster data
- X, y = make_moons(n_samples=200, noise=0.05, random_state=0)
- # cluster the data into two clusters
- kmeans = KMeans(n_clusters=2)
- kmeans.fit(X)
- y_pred = kmeans.predict(X)
- import os
- dmode = os.environ.get('DISPLAY', '')
- if dmode:
- # plot the cluster assignments and cluster centers
- plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap=mglearn.cm3, s=60)
- plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], marker='^', c=['b', 'g'], s=60, linewidth=2);
- plt.show()
Here, we would hope that the clustering algorithm can discover the two half-moon shapes. However, this is not possible using the k-Means algorithm.
Vector Quantization - Or Seeing k-Means as Decomposition
Even though k-Means is a clustering algorithm, there are interesting parallels between k-Means and decomposition methods like PCA and NMF that we discussed above. You might remember that PCA tries to find directions of maximum variance in the data, while NMF tries to find additive components, which often correspond to “extremes” or “parts” of the data (see Figure nmf_illustration). Both methods tried to express data points as a sum over some components.
k-Means on the other hand tries to represent each data point using a cluster center. You can think of that as each point being represented using only a single component, which is given by the cluster center. This view of k-Means as a decomposition method, where each point is represented using a single component, is called vector quantization.
An interesting aspect of vector quantization using k-Means is that we can use many more clusters than input dimensions to encode our data. Let’s go back to the two_moons data. Using PCA or NMF, there is nothing much we can do to this data, as it lives in only two dimensions. Reducing it to one dimension with PCA or NMF would completely destroy the structure of the data. But we can find a more expressive representation using k-Means, by using more cluster centers:
- ch3_t5.py
- import numpy as np
- import mglearn
- import matplotlib.pyplot as plt
- from sklearn.datasets import make_moons
- from sklearn.cluster import KMeans
- # generate some random cluster data
- X, y = make_moons(n_samples=200, noise=0.05, random_state=0)
- # cluster the data into two clusters
- kmeans = KMeans(n_clusters=10)
- kmeans.fit(X)
- y_pred = kmeans.predict(X)
- import os
- dmode = os.environ.get('DISPLAY', '')
- if dmode:
- # plot the cluster assignments and cluster centers
- plt.scatter(X[:, 0], X[:, 1], c=y_pred, s=60, cmap='Paired')
- plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
- marker='^', c=range(kmeans.n_clusters), s=60, linewidth=2, cmap='Paired')
- plt.show()
We used 10 cluster centers, which means each point is now assigned a number between 0 and 9. We can see this as the data being represented using 10 components (that is, we have ten new features), with all features being zero, apart from the one that represents the cluster center the point is assigned to. Using this 10-dimensional representation, it would now be possible to separate the two half-moon shapes using a linear model, which would not have been possible using the original two features. (really? can you show what that would look like, maybe?)
It is also possible to get an even more expressive representation of the data by using the distances to each of the cluster centers as features. This can be done using the transform method of kmeans:
k-Means is a very popular algorithm for clustering, not only because it is relatively easy to understand and implement, but also because it runs relatively quickly. k-Means scales easily to large datasets, and scikit-learn even includes a more scalable variant in the MiniBatchKMeans class, which can handle very large datasets.
One of the drawbacks of k-Means is that it relies on a random initialization, which means the outcome of the algorithm depends on a random seed. By default, scikitlearn runs the algorithm 10 times with 10 different random initializations, and returns the best result ( best here meaning that the sum of variances of the clusters is small).
Further downsides of k-Means are the relatively restrictive assumptions made on the shape of clusters, and the requirement to specify the number of clusters you are looking for (which might not be known in a real-world application). Next, we will look at two more clustering algorithms that improve upon these properties in some ways.
Agglomerative Clustering
Agglomerative clustering refers to a collection of clustering algorithms that all build upon the same principles: The algorithm starts by declaring each point its own cluster, and then merges the two most similar clusters until some stopping criterion is satisfied. The stopping criterion implemented in scikit-learn is the number of clusters, so similar cluster are merged until only the specified number of clusters is left.
There are several linkage criteria that specify how exactly “most similar cluster” is measured. (this sentence looks a lot like the next sentence - combine into one) The following three choices are implemented in scikit-learn:
Ward is generally a good default; all our examples below will use ward. The plot below illustrates the progression of agglomerative clustering on a twodimensional dataset, looking for three clusters.
In the beginning, each point is its own cluster. Then, in each step, the two clusters that are closest are merged. In the first four steps, two single point clusters are picked and these are joined into two-point clusters. In step four, one of the two-point clusters is extended to a third point, and so on. In step 9, there are only three clusters remaining. As we specified that we are looking for three clusters, the algorithm then stops.
Let’s have a look at how agglomerative clustering performs on the simple three-cluster data we used above. Because of the way the algorithm works, agglomerative clustering can not make predictions for new data points. Therefore, agglomerative clustering has no predict method. To build the model, and get the cluster memberships on the training set, use the fit_predict method instead. (we could also use the labels_ attribute as we did for k-Means)
- ch3_t6.py
- import numpy as np
- import mglearn
- import matplotlib.pyplot as plt
- from sklearn.datasets import make_blobs
- from sklearn.cluster import AgglomerativeClustering
- # generate some random cluster data
- X, y = make_blobs(random_state=1)
- # cluster the data into two clusters
- agg = AgglomerativeClustering(n_clusters=3)
- assignment = agg.fit_predict(X)
- import os
- dmode = os.environ.get('DISPLAY', '')
- if dmode:
- # plot the cluster assignments and cluster centers
- plt.scatter(X[:, 0], X[:, 1], c=assignment, cmap=mglearn.cm3, s=60)
- plt.show()
As expected, the algorithm recovers the clustering perfectly. While the scikit-learn implementation of agglomerative clustering requires you to specify a number of clusters you want the algorithm to find, agglomerative clustering methods provide some help with choosing the right number, which we will now discuss next.
Hierarchical Clustering and Dendrograms
Agglomerative clustering produces what is known as a hierarchical clustering. The clustering proceeds iteratively, and every point makes a journey from being a single point cluster to belonging to some final cluster. Each intermediate step provides a clustering of the data (with a different number of clusters). It is sometimes helpful to look at all possible clusterings jointly. The figure below shows an overlay of all possible clusterings shown in Figure agglomerative_algorithm, providing some insight into how each cluster breaks up into smaller clusters.
- mglearn.plots.plot_agglomerative()
While this visualization provides a very detailed view of the hierarchical clustering, it relies on the two-dimensional nature of the data, and can therefore not be used on datasets that have more than two features. There is, however, another tool to visualize hierarchical clustering, called a dendrogram (as shown in Figure dendrogram below).
Unfortunately, scikit-learn currently does not have the functionality to draw dendrograms. However, you can generate them easily using scipy. The scipy clustering algorithms have a slightly different interface from the scikit-learn clustering algorithms. scipy provides function that take data arrays X linkage array encoding cluster similarities. We can then feed this linkage array into the scipy dendrogram function to plot the dendrogram:
- ch3_t7.py
- from scipy.cluster.hierarchy import dendrogram, ward
- import numpy as np
- import mglearn
- import matplotlib.pyplot as plt
- from sklearn.datasets import make_blobs
- from sklearn.cluster import AgglomerativeClustering
- # generate some random cluster data
- X, y = make_blobs(random_state=0, n_samples=12)
- # apply the ward clustering to the data array X
- # The scipy ward function returns an array that specifies the distances bridged when performing agglomerative
- linkage_array = ward(X)
- # now we plot the dendrogram for the linkage_array containing the distances between clusters
- dendrogram(linkage_array)
- import os
- dmode = os.environ.get('DISPLAY', '')
- if dmode:
- # plot the cluster assignments and cluster centers
- # mark the cuts in the tree that signify two or three clusters
- ax = plt.gca()
- bounds = ax.get_xbound()
- ax.plot(bounds, [7.25, 7.25], '--', c='k')
- ax.plot(bounds, [4, 4], '--', c='k')
- ax.text(bounds[1], 7.25, ' two clusters', verticalalignment='center', fontdict={'size': 15})
- ax.text(bounds[1], 4, ' three clusters', verticalalignment='center', fontdict={'size': 15})
- plt.title("dendrogram")
- plt.show()
The dendrogram shows data points as points on the bottom (numbered from zero to eleven). Then, a tree is plotted with these points (representing single-point clusters) as the leafs, and a new node parent is added for each two clusters that are joined. Reading from bottom to top, the data points 1 and 4 are joined first (as you could see in Figure agglomerative_algorithm). Next, points 6 and 9 are joined into a cluster, and so on. The top level, there are two branches, one consisting of point 11, 0, 5, 10, 7, 6 and 9, and the other one consisting of points 1, 4, 3, 2 and 8. These correspond to the two largest clusters in in the left hand side of the plot.
The y axis in the dendrogram not only specifies when in the agglomerative algorithm two clusters get merged. The length of each branch also shows how far apart the merged clusters are. The longest branches in this dendogram are the three lines that are around the “8” mark on the y-axis. That these are the longest branches indicates that going from three to two clusters meant merging some very far-apart points. We see this again at the top of the chart, where merging the two remaining clusters into a single cluster again bridges a large distance.
Unfortunately, agglomerative clustering still fails at separating complex shapes like the two_moons dataset. The same is not true for the next algorithm we will look at, DBSCAN.
DBSCAN
Another very useful clustering algorithm is DBSCAN (which stands for “Densitybased spatial clustering of applications with noise”). The main benefits of DBSCAN are that:
DBSCAN is somewhat slower than agglomerative clustering and k-Means, but still scales to relatively large datasets. The way DBSCAN works is by identifying points that are in “crowded” regions of the feature space, where many data points are close together. These regions are referred to as dense regions in feature space. The idea behind DBSCAN is that clusters form dense regions of data, separated by regions that are relatively empty.
Points that are within a dense region are called core samples, and they are defined as follows: There are two parameters in DBSCAN, min_samples and eps. If there are at least min_samples many data points within a distance of eps to a given data point, it’s called a core sample. Core samples that are closer than the distance eps are put into the same cluster by DBSCAN.
The algorithm works by picking a point to start with. It then finds all points with distance eps or less. If there are less than min_samples points within distance eps or less, this point is labeled as noise, meaning that this point doesn’t belong to any cluster. If there are more than min_samples points within a distance of eps, the point is labeled a core sample and assigned a new cluster label. Then, all neighbors (withing eps) of the point are visited. If they have not been assigned a cluster yet, they are assigned the new cluster label we just created. If they are core samples, their neighbors are visited in turn, and so on.
The cluster grows, until there are no more core-samples within distance eps of the cluster. Then another point, which hasn’t yet been visited, is picked, and the same procedure is repeated.
In the end, there are three kinds of points: core points, points that are within distance eps of core points (called boundary points), and noise. When running the DBSCAN algorithm on a particular dataset multiple times, the clustering of the core points is always the same, and the same points will always be labeled as noise. However, a boundary point might be neighbor to core samples of more than one cluster. Therefore, the cluster membership of boundary points depends on the order in which points are visited. Usually there are only few boundary points, and this slight dependence on the order of points is not important.
Let’s apply DBSCAN on the synthetic data from above. As in agglomerative clustering, DBSCAN does not allow predictions on new test data, so we will use the fit_predict method to perform clustering and return the cluster labels in one step:
- ch3_t8.py
- from scipy.cluster.hierarchy import dendrogram, ward
- import numpy as np
- import mglearn
- import matplotlib.pyplot as plt
- from sklearn.datasets import make_blobs
- from sklearn.cluster import DBSCAN
- # generate some random cluster data
- X, y = make_blobs(random_state=0, n_samples=12)
- # apply the ward clustering to the data array X
- dbscan = DBSCAN()
- clusters = dbscan.fit_predict(X)
- print "Clusters: %s" % (str(clusters))
As you can see, all data points were assigned the label -1, which stands for noise. This is a consequence of the default parameter settings for eps and min_samples, which are not attuned to small toy datasets. Let’s investigate the effect of changing eps and min_samples.
- import os
- dmode = os.environ.get('DISPLAY', '')
- if dmode:
- fig, axes = plt.subplots(3, 4, figsize=(11, 8), subplot_kw={'xticks': (), 'yticks': ()})
- # Plot clusters as red, green and blue, and outliers (-1) as white
- colors = np.array(['r', 'g', 'b', 'w'])
- # iterate over settings of min_samples and eps
- for i, min_samples in enumerate([2, 3, 5]):
- for j, eps in enumerate([1, 1.5, 2, 3]):
- # instantiate DBSCAN with a particular setting
- dbscan = DBSCAN(min_samples=min_samples, eps=eps)
- # get cluster assignments
- clusters = dbscan.fit_predict(X)
- print("min_samples: %d eps: %f cluster: %s" % (min_samples, eps, clusters))
- # vizualize core samples and clusters.
- sizes = 60 * np.ones(X.shape[0])
- # size is given by whether something is a core sample
- sizes[dbscan.core_sample_indices_] *= 4
- axes[i, j].scatter(X[:, 0], X[:, 1], c=colors[clusters], s=sizes)
- axes[i, j].set_title("min_samples: %d eps: %.1f" % (min_samples, eps))
- fig.tight_layout()
- plt.show()
In this plot, points that belong to clusters are colored, while the noise points are shown in white; Core samples are shown as large points, while border points are displayed as smaller points.
Increasing eps (going from left to right in the figure) means that more points will be included in a cluster. This makes clusters grow, but might also lead to multiple clusters joining into one. Increasing min_samples (going from top to bottom in the figure) means that fewer points will be core points, and more points will be labeled as noise.
The parameter eps is somewhat more important, as it determines what it means for points to be “close”. Setting eps to be very small will mean that no points are core samples, and may lead to all points being labeled as noise. Setting eps to be very large will result in all points forming a single cluster.
The setting of min_samples mostly determines whether points in less dense regions will be labeled as outliers, or as their own cluster. If you decrease min_samples, anything that would have been a cluster with less than min_samples many samples will now be labeled as noise. The min_samples therefore determines the minimum cluster size. You can see this very clearly in the plot above, when going from min_samples=3 to min_samples=5 with eps=1.5. With min_samples=3, there are three clusters: one of four points, one of five points and one of three points. Using min_samples=5, the two smaller clusters (with three or four points) are now labeled as noise, and only the cluster with 5 samples remains.
While DBSCAN doesn’t require setting the number of clusters explicitly, setting eps implicitly controls how many clusters will be found. Finding a good setting for eps is sometimes easier after scaling the data using StandardScaler or MinMaxScaler, as using these scaling techniques will ensure that all features have similar ranges.
Below is the result of DBSCAN running on the two_moons dataset. The algorithm actually finds the two half-circles and separates them using the default settings.
- ch3_t9.py
- from scipy.cluster.hierarchy import dendrogram, ward
- import numpy as np
- import mglearn
- import matplotlib.pyplot as plt
- from sklearn.datasets import make_moons
- from sklearn.cluster import DBSCAN
- # generate some random cluster data
- X, y = make_moons(n_samples=200, noise=0.05, random_state=0)
- # Rescale the data to zero mean and unit variance
- from sklearn.preprocessing import StandardScaler
- scaler = StandardScaler()
- scaler.fit(X)
- X_scaled = scaler.transform(X)
- # apply the ward clustering to the data array X
- dbscan = DBSCAN()
- clusters = dbscan.fit_predict(X_scaled)
- print "Clusters: %s" % (str(clusters))
- import os
- dmode = os.environ.get('DISPLAY', '')
- if dmode:
- # plot the cluster assignments
- plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=clusters, cmap=mglearn.cm2, s=60)
- plt.show()
As the algorithm produced the desired number of clusters (two), the parameter settings seem to work well.
If we decrease eps to 0.2 (from the default of 0.5), we will get 8 clusters, which are clearly too many. Increasing eps to 0.7 results in a single cluster. When using DBSCAN, you need to be careful about handling the returned cluster assignments. The use of -1 to indicate noise might result in unexpected effects when using the cluster labels to index another array.
Comparing and evaluating clustering algorithms
After talking about the algorithms behind k-Means, agglomerative clustering and DBSCAN, we will now compare them on some real world datasets. One of the challenges in applying clustering algorithms is that it is very hard to access how well a clustering algorithm worked, and to compare outcomes between different algorithms.
Evaluating clustering with ground truth
There are metrics that can be used to assess the outcome of a clustering algorithm relative to a ground truth clustering, the most important ones being the adjusted rand index (ARI) and normalized mutual information (NMI), which both provide a quantitative measure between 0 and 1.
Below we compare the k-Means, agglomerative clustering and DBSCAN algorithms using ARI. We also include what it looks like when we randomly assign points to two clusters for comparison.
- ch3_t10.py
- from sklearn.metrics.cluster import adjusted_rand_score
- import numpy as np
- import mglearn
- import matplotlib.pyplot as plt
- from sklearn.datasets import make_moons
- from sklearn.cluster import DBSCAN
- from sklearn.cluster import KMeans
- from sklearn.cluster import AgglomerativeClustering
- # generate some random cluster data
- X, y = make_moons(n_samples=200, noise=0.05, random_state=0)
- # Rescale the data to zero mean and unit variance
- from sklearn.preprocessing import StandardScaler
- scaler = StandardScaler()
- scaler.fit(X)
- X_scaled = scaler.transform(X)
- import os
- dmode = os.environ.get('DISPLAY', '')
- if dmode:
- # plot the cluster assignments
- fig, axes = plt.subplots(1, 4, figsize=(15, 3), subplot_kw={'xticks': (), 'yticks': ()})
- # make a list of algorithms to use
- algorithms = [KMeans(n_clusters=2), AgglomerativeClustering(n_clusters=2), DBSCAN()]
- # create a random cluster assignment for reference:
- random_state = np.random.RandomState(seed=0)
- random_clusters = random_state.randint(low=0, high=2, size=len(X))
- # plot random assignment:
- axes[0].scatter(X_scaled[:, 0], X_scaled[:, 1], c=random_clusters, cmap=mglearn.cm3, s=60)
- axes[0].set_title("Random assignment - ARI: %.2f" % adjusted_rand_score(y, random_clusters))
- for ax, algorithm in zip(axes[1:], algorithms):
- # plot the cluster assignments and cluster centers
- clusters = algorithm.fit_predict(X_scaled)
- ax.scatter(X_scaled[:, 0], X_scaled[:, 1], c=clusters, cmap=mglearn.cm3, s=60)
- ax.set_title("%s - ARI: %.2f" % (algorithm.__class__.__name__, adjusted_rand_score(y, clusters)))
- plt.show()
The adjusted rand index provides intuitive results, with a random cluster assignment having a score of 0, and DBSCAN (which recovers the desired clustering perfectly) having a score of 1. A common mistake when evaluating clustering in this way is to use accuracy_score instead of a clustering metric like adjusted_rand_score and normalized_mutual_info_score.
The problem in using accuracy is that it requires the assigned cluster labels to exactly match the ground truth. However, the cluster labels themselves are meaningless, and only which points are in the same cluster matters:
Evaluating clustering without ground truth
Although we have just shown how to evaluate on clustering algorithms, in practice, there is a big problem with the evaluation using measures like ARI. When applying clustering algorithms, there is usually no ground truth to which to
compare. If we knew the right clustering of the data, we could use this information to build a supervised model like a classifier. Therefore, using metric like ARI and NMI usually only really helps in developing algorithms, not in assessing success in an application.
There are scoring metrics for clustering that don’t require ground truth, like the silhouette coefficient. However, these often don’t work well in practice. The silhouette score computes the compactness of a cluster, where higher is better, with a perfect score of 1. While compact clusters are good, compactness doesn’t allow for complex shapes. Here is an example comparing the outcome of k-Means, agglomerative clustering and DBSCAN on the two moons using the silhouette score:
- ch3_t11.py
- from sklearn.metrics.cluster import silhouette_score
- import numpy as np
- import mglearn
- import matplotlib.pyplot as plt
- from sklearn.datasets import make_moons
- from sklearn.cluster import DBSCAN
- from sklearn.cluster import KMeans
- from sklearn.cluster import AgglomerativeClustering
- # generate some random cluster data
- X, y = make_moons(n_samples=200, noise=0.05, random_state=0)
- # Rescale the data to zero mean and unit variance
- from sklearn.preprocessing import StandardScaler
- scaler = StandardScaler()
- scaler.fit(X)
- X_scaled = scaler.transform(X)
- import os
- dmode = os.environ.get('DISPLAY', '')
- if dmode:
- # plot the cluster assignments
- fig, axes = plt.subplots(1, 4, figsize=(15, 3), subplot_kw={'xticks': (), 'yticks': ()})
- # make a list of algorithms to use
- algorithms = [KMeans(n_clusters=2), AgglomerativeClustering(n_clusters=2), DBSCAN()]
- # create a random cluster assignment for reference:
- random_state = np.random.RandomState(seed=0)
- random_clusters = random_state.randint(low=0, high=2, size=len(X))
- # plot random assignment:
- axes[0].scatter(X_scaled[:, 0], X_scaled[:, 1], c=random_clusters, cmap=mglearn.cm3, s=60)
- axes[0].set_title("Random assignment - %.2f" % silhouette_score(X_scaled, random_clusters))
- for ax, algorithm in zip(axes[1:], algorithms):
- # plot the cluster assignments and cluster centers
- clusters = algorithm.fit_predict(X_scaled)
- ax.scatter(X_scaled[:, 0], X_scaled[:, 1], c=clusters, cmap=mglearn.cm3, s=60)
- ax.set_title("%s - %.2f" % (algorithm.__class__.__name__, silhouette_score(X_scaled, clusters)))
- plt.show()
As you can see, k-Means gets the highest silhouette score, even though we might prefer the result produced by DBSCAN.
A slightly better strategy for evaluating clusters are robustness-based clustering metrics. These run an algorithm after adding some noise to the data, or using different parameter settings, and compare the outcomes. The idea is that if many algorithm parameters and many perturbations of the data return the same result, it is likely to be trustworthy. Unfortunately, this strategy is not implemented in scikit-learn at the time of writing. Even if we get a very robust clustering, or a very high silhouette score, we still don’t know if there is any semantic meaning in the clustering, or whether the clustering reflects an aspect of the data that we are interested in.
Supplement
* Unsupervised learning : The k-means clustering algorithm (1)
* Unsupervised learning : The k-means clustering algorithm (2)
* [ML in DH] Clustering 之 Basic Agglomerative Clustering
沒有留言:
張貼留言