程式扎記: [ ML In Action ] Unsupervised learning : The k-means clustering algorithm (1)

Preface:
先前我們學的 Supervised learning - Classification 透過設定的相似度函式將一個個的 instance 進行分類, 而接下來的內容 Unsupervised learning 一樣是要進行分類, 不同的是我們不知道有幾類? 怎麼分? 而這樣的應用可以幫助我們迅速將相似的 instances group 到相同 cluster 中; 不相似的 instances 則 group 到另一個 cluster. 更有幫助的是我們可以學到是什麼樣的特徵讓這些原先看似沒有關係的 instances 可以找到關聯的條件. 而第一個我們要介紹的 k-means Clustering algorithm.

The k-means clustering algorithm:
一樣先來看看這個 Algorithm 的特性:

Pros: Easy to implement
Cons: Can converge at local minima; slow on very large datasets
Works with: Numeric values

顧名思義 k-means 就是要從 dataset 找出 k 個 clusters, 而其工作原理也相當直覺. 首先我們從 dataset 找出 k 個 instances 當作 centroid (姑且你可以把它當作中心點), 接著將 dataset 所有的 instances 依據距離最近或是相似度最接近的 centroid 並加入由該 centroid 形成的 cluster. 接著重新計算出屬於該 cluster 的 centroid (計算該 cluster 所有 instances 得出的 means), 並以新的 centroid 重新進行 cluster 的動作. 整個演算法的 pseudo code 如下:

view plaincopy to clipboardprint?
Create k points for starting centroids (often randomly)  
While any point has changed cluster assignment  
    for every point in our dataset:  
        for every centroid  
           calculate the distance between the centroid and point  
        assign the point to the cluster with the lowest distance   
    for every cluster calculate the mean of the points in that cluster  
        assign the centroid to the mean  

底下為進行 k-means clustering 的準備與注意事項:

1. Collect: Any method.
2. Prepare: Numeric values are needed for a distance calculation, and nominal values can be mapped into binary values for distance calculations.
3. Analyze: Any method.
4. Train: Doesn’t apply to unsupervised learning.
5. Test: Apply the clustering algorithm and inspect the results. Quantitative error measurements such as sum of squared error (introduced later) can be used.
6. Use: Anything you wish. Often, the clusters centers can be treated as representative data of the whole cluster to make decisions.

接著前置作業我們就來準備 k-means 會使用到的函數:
- Listing 10.1 k-means support functions

view plaincopy to clipboardprint?
from numpy import *  
  
def loadDataSet(fileName):      #general function to parse tab -delimited floats  
    dataMat = []                #assume last column is target value  
    fr = open(fileName)  
    for line in fr.readlines():  
        curLine = line.strip().split('\t')  
        fltLine = map(float,curLine) #map all elements to float()  
        dataMat.append(fltLine)  
    return dataMat  
  
def distEclud(vecA, vecB):  
    return sqrt(sum(power(vecA - vecB, 2))) #la.norm(vecA-vecB)  
  
def randCent(dataSet, k):  
    n = shape(dataSet)[1]  
    centroids = mat(zeros((k,n)))#create centroid mat  
    for j in range(n):#create random cluster centers, within bounds of each dimension  
        minJ = min(dataSet[:,j])   
        rangeJ = float(max(dataSet[:,j]) - minJ)  
        centroids[:,j] = mat(minJ + rangeJ * random.rand(k,1))  
    return centroids  

接著來看看如何使用這些函數:

有了上面的基礎工程(載入 dataset, 產生 random centroids 與計算兩個 instances 間的距離), 接著我們便可以來實作 k-means algorithm:
- Listing 10.2 The k-means clustering algorithm

view plaincopy to clipboardprint?
def kMeans(dataSet, k, distMeas=distEclud, createCent=randCent):  
    m = shape(dataSet)[0]  # instance count  
    # 紀錄每個 instance 屬於的 cluster 與 距離該 cluster centroid 的距離.  
    clusterAssment = mat(zeros((m,2)))#create mat to assign data points   
                                      #to a centroid, also holds SE of each point  
    centroids = createCent(dataSet, k)  
    clusterChanged = True  
    while clusterChanged:  
        clusterChanged = False  
        for i in range(m):#for each data point assign it to the closest centroid  
            minDist = inf; minIndex = -1  
            for j in range(k):  
                distJI = distMeas(centroids[j,:],dataSet[i,:])  
                if distJI < minDist:  
                    minDist = distJI; minIndex = j  
            if clusterAssment[i,0] != minIndex: clusterChanged = True  
            clusterAssment[i,:] = minIndex,minDist**2  
        print centroids  
        for cent in range(k):#recalculate centroids  
            ptsInClust = dataSet[nonzero(clusterAssment[:,0].A==cent)[0]]#get all the point in this cluster  
            centroids[cent,:] = mean(ptsInClust, axis=0) #assign centroid to mean   
    return centroids, clusterAssment  

這邊有兩行程式不容易懂, 首先來看:

view plaincopy to clipboardprint?
ptsInClust = dataSet[nonzero(clusterAssment[:,0].A==cent)[0]]  

這行程式是要取出所有屬於某個 cluster 的所有 instances. 要懂這行程式需要分開來看裡面的每一個步驟, 來看一個範例:

而 "centroids[cent,:] = mean(ptsInClust, axis=0)" 這行是要重新計算第 'cent' 個 cluster 的 centroid, 這行只要你知道 numpy.mean 的用法就會懂:

>>> a = np.array([[1, 2], [3, 4]])
>>> np.mean(a) # 計算 (1,2,3,4) 的平均值
2.5
>>> np.mean(a, axis=0) # 計算 (1,3)=2 與 (2,4)=3 的平均值 = (2, 3)
array([ 2., 3.])
>>> np.mean(a, axis=1) # 計算 (1,2)=1.5 與 (3,4)=3.5 的平均值 = (1.5, 3.5)
array([ 1.5, 3.5])

Experiment:
先在程式已經 Ready, 在處理資料前我們先來看看資料的分布, 在載入資料後 (datMat), 使用下面代碼產生資料分布圖:

>>> import matplotlib.pyplot as plt
>>> fig = plt.figure()
>>> ax = fig.add_subplot(111)
>>> ax.scatter(datMat[:,0], datMat[:,1])
>>> plt.show()

可以看出來 instance 有集中分布到某四個區域, 接著我們直接來寫代碼並輸出 clustering 的結果吧:

view plaincopy to clipboardprint?
#-*- coding: utf-8 -*-    
from numpy import *  
import matplotlib.pyplot as plt  
import kMeans  
  
# 1) Execute the k-means cluster algorithm  
k = 4 # How many clusters?  
datMat = mat(kMeans.loadDataSet('testSet.txt'))  
rst = kMeans.kMeans(datMat,k)  
  
# 2) Output the centroid  
fig = plt.figure()  
ax = fig.add_subplot(111)  
ax.scatter(x=array(rst[0][:,0]).flatten(), y=array(rst[0][:,1]).flatten(), c='pink', s=80)  
  
# 3) Output the dataset with different according to its cluster.  
color = ['red', 'blue', 'green', 'yellow']  
for i in range(k):  
    smc = datMat[nonzero(rst[1][:,0].A==i)[0]] # sub data set of each cluster.  
    ax.scatter(x=array(smc[:,0]).flatten(), y=array(smc[:,1]).flatten(), c=color[i], s=30)  
plt.show()  

輸出分布如下, 粉紅色的點是每個 cluster 的 centroid; 其他不同顏色的點代表是屬於各自的 cluster 的 instances:

Supplement:
* How to make a Scatter Plot using matplotlib in Python
* numpy.ndarray.flatten: Return a copy of the array collapsed into one dimension.

程式扎記

標籤

2013年3月7日星期四

[ ML In Action ] Unsupervised learning : The k-means clustering algorithm (1)

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2013年3月7日 星期四