2014年6月11日 星期三

[ML in DH] Clustering 之 k-Means Clustering Algorithm

內容來自課程 這裡 
Preface:
給定一些資料點,利用它們之間的距離關係(或相似程度),將類似的點聚集在一起(成為一群


由於並不需額外指定「每個點屬於哪一類」,所以 分群 (Clustering) 的的方法屬於「非監督式學習」(unsupervised learning) 法:


k-Means Clustering Algorithm:
k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.

• Input: D (a set of documents), k (# of desired clusters)
• Output: C (set of clusters)
• Algorithm:


Example: k-Means Clustering
考慮有 Input: P = {(1, 1), (2, 2), (3, 5), (4, 4)}, 並設定 k=2:


The 1'st Iteration
* Randomly choose (1, 1), (2, 2) as the single points of initial clusters #1, #2
* Assign points in P to the closest cluster:

* Calculate new cluster means:




The 2'nd Iteration
* Re-assign points in P to the closest cluster

* Calculate new means:




The 3'nd Iteration
* Re-assign points in P to the closest cluster

* Calculate new means:

* Meet convergence criteria fl DONE.
Since the cluster means (or clusters) do not change in this iteration. Cluster means before and after the 3'nd iteration: {(1.5, 1.5), (3.5, 4.5)}

Toolkit Usage: 
這個演算法的實作可以使用工具 "KMeans.groovy" 進行 k-Means Clustering. 使用方法如下:
  1. // 1) Prepare data  
  2. def datas = [[11], [22], [35], [44]]  
  3.   
  4. // 2) Run k-Means clustering algorithm  
  5. KMeans km = new KMeans()  
  6. JTuple rt = km.kMeans(datas, 2, [[11], [22]]);  
  7.   
  8. // 3) Print clustering result  
  9. def clusterAssment = rt.get(0)  // [Cluster Num, Distance]  
  10. def centroids = rt.get(1)       // [key:ClusterNum,Value:Centroid]   
  11. printf "\t[Info] Cluster Result:\n"  
  12. clusterAssment.eachWithIndex{ v, i->  
  13.     printf "\t\t%s is assigned to Cluster%d (D=%.02f; C=%s)...\n", datas[i],   
  14.                                                                    v[0],   
  15.                                                                    v[1],   
  16.                                                                    centroids[v[0]]  
  17. }  
執行結果如下:
[Info] KMeans with K=2...
[Info] Randomly pickup 2 centroid at:
[1, 1]
[2, 2]
***** Round 1 *****
[1, 1] to Centroid0 has dist=0.000...
[1, 1] to Centroid1 has dist=1.414...
Assign [1, 1] to Centroid0...
[2, 2] to Centroid0 has dist=1.414...
[2, 2] to Centroid1 has dist=0.000...
Assign [2, 2] to Centroid1...
[3, 5] to Centroid0 has dist=4.472...
[3, 5] to Centroid1 has dist=3.162...
Assign [3, 5] to Centroid1...
[4, 4] to Centroid0 has dist=4.243...
[4, 4] to Centroid1 has dist=2.828...
Assign [4, 4] to Centroid1...
***** Recalculate centroids *****
...
[Info] Done! 139 ms
[Info] Cluster Result:
[1, 1] is assigned to Cluster0 (Dist=0.71; Centroid=[1.5, 1.5])...
[2, 2] is assigned to Cluster0 (Dist=0.71; Centroid=[1.5, 1.5])...
[3, 5] is assigned to Cluster1 (Dist=0.71; Centroid=[3.5, 4.5])...
[4, 4] is assigned to Cluster1 (Dist=0.71; Centroid=[3.5, 4.5])...

Supplement:
Unsupervised learning : The k-means clustering algorithm (1)
先前我們學的 Supervised learning - Classification 透過設定的相似度函式將一個個的 instance 進行分類, 而接下來的內容 Unsupervised learning 一樣是要進行分類, 不同的是我們不知道有幾類? 怎麼分? 而這樣的應用可以幫助我們迅速將相似的 instances group 到相同 cluster 中; 不相似的 instances 則 group 到另一個 cluster...

Unsupervised learning : The k-means clustering algorithm (2)
在前面介紹 k-means clustering algorithm 時, 眼尖的你應該會發現 cluster 的數目是由 k 決定. 問題是我們怎麼知道 k 值是什麼? 另外使用 k-means clustering algorithm 有個問題就是它得到的是 local optimal, 也就是每次跑的結果可能都不一樣 (因為 centroid 是 random 決定) ...

This message was edited 29 times. Last update was at 12/06/2014 10:02:26

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

  Source From  Here 方案1: // x -----删除忽略文件已经对 git 来说不识别的文件 // d -----删除未被添加到 git 的路径中的文件 // f -----强制运行 #   git clean -d -fx 方案2: 今天在服务器上  gi...