程式扎記: [ ML In Action ] Unsupervised learning : The k-means clustering algorithm (2)

2013年4月8日星期一

[ ML In Action ] Unsupervised learning : The k-means clustering algorithm (2)

Preface:
在前面介紹 k-means clustering algorithm 時, 眼尖的你應該會發現 cluster 的數目是由 k 決定. 問題是我們怎麼知道 k 值是什麼? 另外使用 k-means clustering algorithm 有個問題就是它得到的是 local optimal, 也就是每次跑的結果可能都不一樣 (因為 centroid 是 random 決定) . 下面的圖是一個已經完成 k-means clustering algorithm 的結果, 但是透過你肉眼就可以發現它並非最佳的結果 global optimal :

而有一種可以量測你 clustering output 的結果好壞稱為 SSE (sum of squared error), 也就是計算每個 cluster 的每一個 instance 到該 cluster centroid 的距離的和. 照理說該值如果越小, 說明該 cluster 的密度越高, 也就越好. 那這樣說如果我取多一點 cluster (k 值越大), 不就密度越高越好嗎? 事實上魚與熊掌不可兼得, 你總是需要在密度與 cluster 數目間做取捨. 因為 cluster 越多則 cluster 間的區別度便會變小, 而造成無法將某些預期的結果區分開來.

為了解決上圖的問題, 我們要介紹另一種 k-means clustering 的方法. 它會設一個 threshold, 當某個 cluster 的 SSE 大於這個 threshold 時 (說明該 cluster 密度過大), 則將屬於該 cluster 的所有 instance 再跑一次 k-means, 但是這次設定 k=2. 如此 cluster 的數量必定會增加, 因次就需要一些做法來 merging clustering.

Bisecting k-means:
為了解決上面提到的應為 local optimal 造成 poor 的 clustering result, 這邊提出另一個方法 bisecting k-means: 它一開始使用一個 cluster, 接著 split 成 2 個 clusters; 再接著根據以降低 global SSE 值為目標挑選 cluster 繼續 split 直到達到預期的 cluster 數目. 底下是該 algorithm 的 pseudo code:

view plaincopy to clipboardprint?
Start with all the points in one cluster  
While the number of clusters is less than k  
    for every cluster  
        measure total error  
        perform k-means clustering with k=2 on the given cluster  
        measure total error after k-means has split the cluster in two  
    choose the cluster split that gives the lowest error and commit this split  

接著底下是實作的代碼:

view plaincopy to clipboardprint?
def biKmeans(dataSet, k, distMeas=distEclud):  
    m = shape(dataSet)[0]  
    clusterAssment = mat(zeros((m,2)))  
    centroid0 = mean(dataSet, axis=0).tolist()[0]  
    centList =[centroid0] #create a list with one centroid  
    for j in range(m): # calc initial Error  
        clusterAssment[j,1] = distMeas(mat(centroid0), dataSet[j,:])**2  
    while (len(centList) < k):  
        lowestSSE = inf  
        for i in range(len(centList)): # Search the best group to do k-means with best/smallest SSE.  
            ptsInCurrCluster = dataSet[nonzero(clusterAssment[:,0].A==i)[0],:] #get the data points currently in cluster i  
            centroidMat, splitClustAss = kMeans(ptsInCurrCluster, 2, distMeas)  
            sseSplit = sum(splitClustAss[:,1]) # compare the SSE to the SSE of currrent split   
            sseNotSplit = sum(clusterAssment[nonzero(clusterAssment[:,0].A!=i)[0],1]) # compare the SSE of instances except ones in cluster i  
            print "sseSplit, and notSplit: ",sseSplit,sseNotSplit  
            if (sseSplit + sseNotSplit) < lowestSSE:  
                bestCentToSplit = i  
                bestNewCents = centroidMat  
                bestClustAss = splitClustAss.copy()  
                lowestSSE = sseSplit + sseNotSplit  
        bestClustAss[nonzero(bestClustAss[:,0].A == 1)[0],0] = len(centList) # Increase the cluster index  
        bestClustAss[nonzero(bestClustAss[:,0].A == 0)[0],0] = bestCentToSplit  
        print 'the bestCentToSplit is: ',bestCentToSplit  
        print 'the len of bestClustAss is: ', len(bestClustAss)  
        centList[bestCentToSplit] = bestNewCents[0,:].tolist()[0] # replace a centroid with two best centroids   
        centList.append(bestNewCents[1,:].tolist()[0])  
        clusterAssment[nonzero(clusterAssment[:,0].A == bestCentToSplit)[0],:]= bestClustAss #reassign new clusters, and SSE  
    return mat(centList), clusterAssment  

Experiment:
接著你可以如下來使用上面的代碼:

>>> import kMeans # 剛剛的代碼存放在 kMeans.py, 這邊載入 kMeans module
>>> from numpy import * # 載入 numpy 模組
>>> datMat3 = mat(kMeans.loadDataSet('testSet2.txt')) # 載入 test data set
>>> centList, myNewAssments=kMeans.biKmeans(datMat3, 3) # 執行 bisecting k-means, 切割成 3 個 clusters
>>> centList # 列印出 3 個 cluster 的 centroid
matrix([[-0.45965615, -2.7782156 ],
[ 2.93386365, 3.12782785],
[-2.94737575, 3.3263781 ]])

接著你可以使用下面代碼顯示 test data set 的分佈:

>>> import matplotlib.pyplot as plt
>>> fig = plt.figure()
>>> ax = fig.add_subplot(111)
>>> ax.scatter(x=array(datMat3[:,0]).flatten(), y=array(datMat3[:,1]).flatten())

>>> plt.show()

或者你可以使用下面代碼將 Centroid 與每個 Cluster 的 instance 以不同顏色標示出來:

view plaincopy to clipboardprint?
#-*- coding: utf-8 -*-      
from numpy import *    
import matplotlib.pyplot as plt    
import kMeans    
    
# 1) Execute the k-means cluster algorithm    
k = 3 # How many clusters?    
datMat3 = mat(kMeans.loadDataSet('testSet2.txt'))    
rst = kMeans.biKmeans(datMat3,k)    
    
# 2) Output the centroid    
fig = plt.figure()    
ax = fig.add_subplot(111)    
ax.scatter(x=array(rst[0][:,0]).flatten(), y=array(rst[0][:,1]).flatten(), c='pink', s=80)    
    
# 3) Output the dataset with different according to its cluster.    
color = ['red', 'blue', 'green', 'yellow']    
for i in range(k):    
    smc = datMat3[nonzero(rst[1][:,0].A==i)[0]] # sub data set of each cluster.    
    ax.scatter(x=array(smc[:,0]).flatten(), y=array(smc[:,1]).flatten(), c=color[i], s=30)    
plt.show()   

執行結果:

程式扎記

標籤

2013年4月8日星期一

[ ML In Action ] Unsupervised learning : The k-means clustering algorithm (2)

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2013年4月8日 星期一

[ ML In Action ] Unsupervised learning : The k-means clustering algorithm (2)

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

2013年4月8日星期一