先前我們學的 Supervised learning - Classification 透過設定的相似度函式將一個個的 instance 進行分類, 而接下來的內容 Unsupervised learning 一樣是要進行分類, 不同的是我們不知道有幾類? 怎麼分? 而這樣的應用可以幫助我們迅速將相似的 instances group 到相同 cluster 中; 不相似的 instances 則 group 到另一個 cluster. 更有幫助的是我們可以學到是什麼樣的特徵讓這些原先看似沒有關係的 instances 可以找到關聯的條件. 而第一個我們要介紹的 k-means Clustering algorithm.
The k-means clustering algorithm:
一樣先來看看這個 Algorithm 的特性:
顧名思義 k-means 就是要從 dataset 找出 k 個 clusters, 而其工作原理也相當直覺. 首先我們從 dataset 找出 k 個 instances 當作 centroid (姑且你可以把它當作中心點), 接著將 dataset 所有的 instances 依據距離最近或是相似度最接近的 centroid 並加入由該 centroid 形成的 cluster. 接著重新計算出屬於該 cluster 的 centroid (計算該 cluster 所有 instances 得出的 means), 並以新的 centroid 重新進行 cluster 的動作. 整個演算法的 pseudo code 如下:
- Create k points for starting centroids (often randomly)
- While any point has changed cluster assignment
- for every point in our dataset:
- for every centroid
- calculate the distance between the centroid and point
- assign the point to the cluster with the lowest distance
- for every cluster calculate the mean of the points in that cluster
- assign the centroid to the mean
接著前置作業我們就來準備 k-means 會使用到的函數:
- Listing 10.1 k-means support functions
- from numpy import *
- def loadDataSet(fileName): #general function to parse tab -delimited floats
- dataMat = [] #assume last column is target value
- fr = open(fileName)
- for line in fr.readlines():
- curLine = line.strip().split('\t')
- fltLine = map(float,curLine) #map all elements to float()
- dataMat.append(fltLine)
- return dataMat
- def distEclud(vecA, vecB):
- return sqrt(sum(power(vecA - vecB, 2))) #la.norm(vecA-vecB)
- def randCent(dataSet, k):
- n = shape(dataSet)[1]
- centroids = mat(zeros((k,n)))#create centroid mat
- for j in range(n):#create random cluster centers, within bounds of each dimension
- minJ = min(dataSet[:,j])
- rangeJ = float(max(dataSet[:,j]) - minJ)
- centroids[:,j] = mat(minJ + rangeJ * random.rand(k,1))
- return centroids
有了上面的基礎工程(載入 dataset, 產生 random centroids 與 計算兩個 instances 間的距離), 接著我們便可以來實作 k-means algorithm:
- Listing 10.2 The k-means clustering algorithm
- def kMeans(dataSet, k, distMeas=distEclud, createCent=randCent):
- m = shape(dataSet)[0] # instance count
- # 紀錄每個 instance 屬於的 cluster 與 距離該 cluster centroid 的距離.
- clusterAssment = mat(zeros((m,2)))#create mat to assign data points
- #to a centroid, also holds SE of each point
- centroids = createCent(dataSet, k)
- clusterChanged = True
- while clusterChanged:
- clusterChanged = False
- for i in range(m):#for each data point assign it to the closest centroid
- minDist = inf; minIndex = -1
- for j in range(k):
- distJI = distMeas(centroids[j,:],dataSet[i,:])
- if distJI < minDist:
- minDist = distJI; minIndex = j
- if clusterAssment[i,0] != minIndex: clusterChanged = True
- clusterAssment[i,:] = minIndex,minDist**2
- print centroids
- for cent in range(k):#recalculate centroids
- ptsInClust = dataSet[nonzero(clusterAssment[:,0].A==cent)[0]]#get all the point in this cluster
- centroids[cent,:] = mean(ptsInClust, axis=0) #assign centroid to mean
- return centroids, clusterAssment
- ptsInClust = dataSet[nonzero(clusterAssment[:,0].A==cent)[0]]
而 "centroids[cent,:] = mean(ptsInClust, axis=0)" 這行是要重新計算第 'cent' 個 cluster 的 centroid, 這行只要你知道 numpy.mean 的用法就會懂:
Experiment:
先在程式已經 Ready, 在處理資料前我們先來看看資料的分布, 在載入資料後 (datMat), 使用下面代碼產生資料分布圖:
可以看出來 instance 有集中分布到某四個區域, 接著我們直接來寫代碼並輸出 clustering 的結果吧:
- #-*- coding: utf-8 -*-
- from numpy import *
- import matplotlib.pyplot as plt
- import kMeans
- # 1) Execute the k-means cluster algorithm
- k = 4 # How many clusters?
- datMat = mat(kMeans.loadDataSet('testSet.txt'))
- rst = kMeans.kMeans(datMat,k)
- # 2) Output the centroid
- fig = plt.figure()
- ax = fig.add_subplot(111)
- ax.scatter(x=array(rst[0][:,0]).flatten(), y=array(rst[0][:,1]).flatten(), c='pink', s=80)
- # 3) Output the dataset with different according to its cluster.
- color = ['red', 'blue', 'green', 'yellow']
- for i in range(k):
- smc = datMat[nonzero(rst[1][:,0].A==i)[0]] # sub data set of each cluster.
- ax.scatter(x=array(smc[:,0]).flatten(), y=array(smc[:,1]).flatten(), c=color[i], s=30)
- plt.show()
Supplement:
* How to make a Scatter Plot using matplotlib in Python
* numpy.ndarray.flatten: Return a copy of the array collapsed into one dimension.
沒有留言:
張貼留言