程式扎記: [ML in DH] Clustering 之 Basic Agglomerative Clustering

標籤

2014年6月12日 星期四

[ML in DH] Clustering 之 Basic Agglomerative Clustering

內容來自課程 這裡 
Preface: 
想法:一開始,將每個點都視為一個群集。接下來,則找出最相近的兩個群集,將其合併為一個新群集。重覆以上的合併動作,直到結束.(Example for Agglomerative Clustering): 
• Input: D = {d1, d2, ..., dn}
• Output: DE // dendrogram as a set of clustering results in each step

Algorithm: 
 

Example: 
考慮 4 個點: P = {(1, 1), (2, 2), (3, 5), (4, 4)}, k=2. 在此,我們將兩個群集的距離,定義為它們重心的距離: 
 

底下為套用 Basic Agglomerative Clustering 演算法的過程: 
 

幾種 Clusters 的距離測度方式: 
兩個點的距離很容易定義,但如何定義兩個群集的距離 但如何定義兩個群集的距離 但如何定義兩個群集的距離 但如何定義兩個群集的距離? 
* Single Link 
Smallest distance between an element in one cluster and an element in the other. Mathematically, the linkage function – the distance D(X,Y) between clusters X and Y – is described by the expression:

* Complete Link 
Largest distance between an element in one cluster and an element in the other. Mathematically, the complete linkage function:

* Average Link 
Average distance between an element in one cluster and an element in the other. Mathematically, the average linkage function:

* Centroid (center of cluster) 
If clusters have a representative centroid, then the cluster distance is defined to be the distance between their centroids.

* Medoid (a centrally located point in the cluster) 
Distance between the cluster medoids. Medoids are representative objects of a data set or a cluster with a data set whose average dissimilarity to all the objects in the cluster is minimal. Medoids are similar in concept to means or centroids, but medoids are always members of the data set.

Toolkit Usage: 
這個演算法可以使用工具 "BAgg.groovy" 來實現, 使用範例如下: 
  1. // 1) Prepare data  
  2. def datas = [[11], [22], [31], [24],  
  3.              [11,4], [12,4], [13,4], [14,4],  
  4.              [2,14], [3,13], [3,14],  
  5.              [12,13], [13,12], [13,13], [14,14]]  
  6.   
  7. // 2) Run Basic Agglomerative Clustering algorithm  
  8. printf "\t[Info] Running Basic Agglomerative Clustering...\n"  
  9. BAgg ba = new BAgg()  
  10. def rt = ba.run(datas,                  /*Input data vector list*/   
  11.                 4,                      /*How many clusters being output*/    
  12.                 ba.completeLinkDistClu  /*Choose Complete Linkage*/  
  13.                 )   
  14. printf "\n"  
  15.   
  16. // 3) Output Cluster Result  
  17. printf "\t[Info] Output Clustering Result:\n"  
  18. def K = rt[0]  
  19. def I = rt[1]  
  20. K.eachWithIndex{ v, p->  
  21.     printf "\t\tC%d: [%s]\n", p, I[p].sort().join(",")  
  22.     v.each{  
  23.         printf "\t\t\t[%s]\n", it.join(",")  
  24.     }  
  25. }  
執行結果: 

沒有留言:

張貼留言

網誌存檔

關於我自己

我的相片
Where there is a will, there is a way!