## 2013年3月7日 星期四

### [ ML In Action ] Unsupervised learning : The k-means clustering algorithm (1)

Preface:

The k-means clustering algorithm:

Pros: Easy to implement
Cons: Can converge at local minima; slow on very large datasets
Works with: Numeric values

1. Create k points for starting centroids (often randomly)
2. While any point has changed cluster assignment
3.     for every point in our dataset:
4.         for every centroid
5.            calculate the distance between the centroid and point
6.         assign the point to the cluster with the lowest distance
7.     for every cluster calculate the mean of the points in that cluster
8.         assign the centroid to the mean

1. Collect: Any method.
2. Prepare: Numeric values are needed for a distance calculation, and nominal values can be mapped into binary values for distance calculations.
3. Analyze: Any method.
4. Train: Doesn’t apply to unsupervised learning.
5. Test: Apply the clustering algorithm and inspect the results. Quantitative error measurements such as sum of squared error (introduced later) can be used.
6. Use: Anything you wish. Often, the clusters centers can be treated as representative data of the whole cluster to make decisions.

- Listing 10.1 k-means support functions
1. from numpy import *
2.
3. def loadDataSet(fileName):      #general function to parse tab -delimited floats
4.     dataMat = []                #assume last column is target value
5.     fr = open(fileName)
7.         curLine = line.strip().split('\t')
8.         fltLine = map(float,curLine) #map all elements to float()
9.         dataMat.append(fltLine)
10.     return dataMat
11.
12. def distEclud(vecA, vecB):
13.     return sqrt(sum(power(vecA - vecB, 2))) #la.norm(vecA-vecB)
14.
15. def randCent(dataSet, k):
16.     n = shape(dataSet)[1]
17.     centroids = mat(zeros((k,n)))#create centroid mat
18.     for j in range(n):#create random cluster centers, within bounds of each dimension
19.         minJ = min(dataSet[:,j])
20.         rangeJ = float(max(dataSet[:,j]) - minJ)
21.         centroids[:,j] = mat(minJ + rangeJ * random.rand(k,1))
22.     return centroids

- Listing 10.2 The k-means clustering algorithm
1. def kMeans(dataSet, k, distMeas=distEclud, createCent=randCent):
2.     m = shape(dataSet)[0]  # instance count
3.     # 紀錄每個 instance 屬於的 cluster 與 距離該 cluster centroid 的距離.
4.     clusterAssment = mat(zeros((m,2)))#create mat to assign data points
5.                                       #to a centroid, also holds SE of each point
6.     centroids = createCent(dataSet, k)
7.     clusterChanged = True
8.     while clusterChanged:
9.         clusterChanged = False
10.         for i in range(m):#for each data point assign it to the closest centroid
11.             minDist = inf; minIndex = -1
12.             for j in range(k):
13.                 distJI = distMeas(centroids[j,:],dataSet[i,:])
14.                 if distJI < minDist:
15.                     minDist = distJI; minIndex = j
16.             if clusterAssment[i,0] != minIndex: clusterChanged = True
17.             clusterAssment[i,:] = minIndex,minDist**2
18.         print centroids
19.         for cent in range(k):#recalculate centroids
20.             ptsInClust = dataSet[nonzero(clusterAssment[:,0].A==cent)[0]]#get all the point in this cluster
21.             centroids[cent,:] = mean(ptsInClust, axis=0) #assign centroid to mean
22.     return centroids, clusterAssment

1. ptsInClust = dataSet[nonzero(clusterAssment[:,0].A==cent)[0]]

>>> a = np.array([[1, 2], [3, 4]])
>>> np.mean(a) # 計算 (1,2,3,4) 的平均值
2.5
>>> np.mean(a, axis=0) # 計算 (1,3)=2 與 (2,4)=3 的平均值 = (2, 3)
array([ 2., 3.])
>>> np.mean(a, axis=1) # 計算 (1,2)=1.5 與 (3,4)=3.5 的平均值 = (1.5, 3.5)
array([ 1.5, 3.5])

Experiment:

>>> import matplotlib.pyplot as plt
>>> fig = plt.figure()
>>> ax.scatter(datMat[:,0], datMat[:,1])
>>> plt.show()

1. #-*- coding: utf-8 -*-
2. from numpy import *
3. import matplotlib.pyplot as plt
4. import kMeans
5.
6. 1) Execute the k-means cluster algorithm
7. k = 4 # How many clusters?
9. rst = kMeans.kMeans(datMat,k)
10.
11. 2) Output the centroid
12. fig = plt.figure()
14. ax.scatter(x=array(rst[0][:,0]).flatten(), y=array(rst[0][:,1]).flatten(), c='pink', s=80)
15.
16. 3) Output the dataset with different according to its cluster.
17. color = ['red''blue''green''yellow']
18. for i in range(k):
19.     smc = datMat[nonzero(rst[1][:,0].A==i)[0]] # sub data set of each cluster.
20.     ax.scatter(x=array(smc[:,0]).flatten(), y=array(smc[:,1]).flatten(), c=color[i], s=30)
21. plt.show()

Supplement:
How to make a Scatter Plot using matplotlib in Python
numpy.ndarray.flatten: Return a copy of the array collapsed into one dimension.

## 關於我自己

Where there is a will, there is a way!