在前面介紹 k-means clustering algorithm 時, 眼尖的你應該會發現 cluster 的數目是由 k 決定. 問題是我們怎麼知道 k 值是什麼? 另外使用 k-means clustering algorithm 有個問題就是它得到的是 local optimal, 也就是每次跑的結果可能都不一樣 (因為 centroid 是 random 決定) . 下面的圖是一個已經完成 k-means clustering algorithm 的結果, 但是透過你肉眼就可以發現它並非最佳的結果 global optimal :
而有一種可以量測你 clustering output 的結果好壞稱為 SSE (sum of squared error), 也就是計算每個 cluster 的每一個 instance 到 該 cluster centroid 的距離的和. 照理說該值如果越小, 說明該 cluster 的密度越高, 也就越好. 那這樣說如果我取多一點 cluster (k 值越大), 不就密度越高越好嗎? 事實上魚與熊掌不可兼得, 你總是需要在密度與 cluster 數目間做取捨. 因為 cluster 越多則 cluster 間的區別度便會變小, 而造成無法將某些預期的結果區分開來.
為了解決上圖的問題, 我們要介紹另一種 k-means clustering 的 方法. 它會設一個 threshold, 當某個 cluster 的 SSE 大於這個 threshold 時 (說明該 cluster 密度過大), 則將屬於該 cluster 的所有 instance 再跑一次 k-means, 但是這次設定 k=2. 如此 cluster 的數量必定會增加, 因次就需要一些做法來 merging clustering.
Bisecting k-means:
為了解決上面提到的應為 local optimal 造成 poor 的 clustering result, 這邊提出另一個方法 bisecting k-means: 它一開始使用一個 cluster, 接著 split 成 2 個 clusters; 再接著根據以降低 global SSE 值為目標挑選 cluster 繼續 split 直到達到預期的 cluster 數目. 底下是該 algorithm 的 pseudo code:
- Start with all the points in one cluster
- While the number of clusters is less than k
- for every cluster
- measure total error
- perform k-means clustering with k=2 on the given cluster
- measure total error after k-means has split the cluster in two
- choose the cluster split that gives the lowest error and commit this split
- def biKmeans(dataSet, k, distMeas=distEclud):
- m = shape(dataSet)[0]
- clusterAssment = mat(zeros((m,2)))
- centroid0 = mean(dataSet, axis=0).tolist()[0]
- centList =[centroid0] #create a list with one centroid
- for j in range(m): # calc initial Error
- clusterAssment[j,1] = distMeas(mat(centroid0), dataSet[j,:])**2
- while (len(centList) < k):
- lowestSSE = inf
- for i in range(len(centList)): # Search the best group to do k-means with best/smallest SSE.
- ptsInCurrCluster = dataSet[nonzero(clusterAssment[:,0].A==i)[0],:] #get the data points currently in cluster i
- centroidMat, splitClustAss = kMeans(ptsInCurrCluster, 2, distMeas)
- sseSplit = sum(splitClustAss[:,1]) # compare the SSE to the SSE of currrent split
- sseNotSplit = sum(clusterAssment[nonzero(clusterAssment[:,0].A!=i)[0],1]) # compare the SSE of instances except ones in cluster i
- print "sseSplit, and notSplit: ",sseSplit,sseNotSplit
- if (sseSplit + sseNotSplit) < lowestSSE:
- bestCentToSplit = i
- bestNewCents = centroidMat
- bestClustAss = splitClustAss.copy()
- lowestSSE = sseSplit + sseNotSplit
- bestClustAss[nonzero(bestClustAss[:,0].A == 1)[0],0] = len(centList) # Increase the cluster index
- bestClustAss[nonzero(bestClustAss[:,0].A == 0)[0],0] = bestCentToSplit
- print 'the bestCentToSplit is: ',bestCentToSplit
- print 'the len of bestClustAss is: ', len(bestClustAss)
- centList[bestCentToSplit] = bestNewCents[0,:].tolist()[0] # replace a centroid with two best centroids
- centList.append(bestNewCents[1,:].tolist()[0])
- clusterAssment[nonzero(clusterAssment[:,0].A == bestCentToSplit)[0],:]= bestClustAss #reassign new clusters, and SSE
- return mat(centList), clusterAssment
接著你可以如下來使用上面的代碼:
接著你可以使用下面代碼顯示 test data set 的分佈:
或者你可以使用下面代碼將 Centroid 與 每個 Cluster 的 instance 以不同顏色標示出來:
- #-*- coding: utf-8 -*-
- from numpy import *
- import matplotlib.pyplot as plt
- import kMeans
- # 1) Execute the k-means cluster algorithm
- k = 3 # How many clusters?
- datMat3 = mat(kMeans.loadDataSet('testSet2.txt'))
- rst = kMeans.biKmeans(datMat3,k)
- # 2) Output the centroid
- fig = plt.figure()
- ax = fig.add_subplot(111)
- ax.scatter(x=array(rst[0][:,0]).flatten(), y=array(rst[0][:,1]).flatten(), c='pink', s=80)
- # 3) Output the dataset with different according to its cluster.
- color = ['red', 'blue', 'green', 'yellow']
- for i in range(k):
- smc = datMat3[nonzero(rst[1][:,0].A==i)[0]] # sub data set of each cluster.
- ax.scatter(x=array(smc[:,0]).flatten(), y=array(smc[:,1]).flatten(), c=color[i], s=30)
- plt.show()
沒有留言:
張貼留言