程式扎記

內容來自課程這裡

Preface:
給定一些資料點，利用它們之間的距離關係（或相似程度），將類似的點聚集在一起（成為一群）

由於並不需額外指定「每個點屬於哪一類」，所以分群 (Clustering) 的的方法屬於「非監督式學習」(unsupervised learning) 法:

k-Means Clustering Algorithm:

k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.

• Input: D (a set of documents), k (# of desired clusters)
• Output: C (set of clusters)
• Algorithm:

Example: k-Means Clustering
考慮有 Input: P = {(1, 1), (2, 2), (3, 5), (4, 4)}, 並設定 k=2:

The 1'st Iteration
* Randomly choose (1, 1), (2, 2) as the single points of initial clusters #1, #2
* Assign points in P to the closest cluster:

* Calculate new cluster means:

The 2'nd Iteration
* Re-assign points in P to the closest cluster

* Calculate new means:

The 3'nd Iteration
* Re-assign points in P to the closest cluster

* Calculate new means:

* Meet convergence criteria ﬂ DONE.

Since the cluster means (or clusters) do not change in this iteration. Cluster means before and after the 3'nd iteration: {(1.5, 1.5), (3.5, 4.5)}

Toolkit Usage:
這個演算法的實作可以使用工具 "KMeans.groovy" 進行 k-Means Clustering. 使用方法如下:

view plaincopy to clipboardprint?
// 1) Prepare data  
def datas = [[1, 1], [2, 2], [3, 5], [4, 4]]  
  
// 2) Run k-Means clustering algorithm  
KMeans km = new KMeans()  
JTuple rt = km.kMeans(datas, 2, [[1, 1], [2, 2]]);  
  
// 3) Print clustering result  
def clusterAssment = rt.get(0)  // [Cluster Num, Distance]  
def centroids = rt.get(1)       // [key:ClusterNum,Value:Centroid]   
printf "\t[Info] Cluster Result:\n"  
clusterAssment.eachWithIndex{ v, i->  
    printf "\t\t%s is assigned to Cluster%d (D=%.02f; C=%s)...\n", datas[i],   
                                                                   v[0],   
                                                                   v[1],   
                                                                   centroids[v[0]]  
}  

執行結果如下:

[Info] KMeans with K=2...
[Info] Randomly pickup 2 centroid at:
[1, 1]
[2, 2]
***** Round 1 *****
[1, 1] to Centroid0 has dist=0.000...
[1, 1] to Centroid1 has dist=1.414...
Assign [1, 1] to Centroid0...
[2, 2] to Centroid0 has dist=1.414...
[2, 2] to Centroid1 has dist=0.000...
Assign [2, 2] to Centroid1...
[3, 5] to Centroid0 has dist=4.472...
[3, 5] to Centroid1 has dist=3.162...
Assign [3, 5] to Centroid1...
[4, 4] to Centroid0 has dist=4.243...
[4, 4] to Centroid1 has dist=2.828...
Assign [4, 4] to Centroid1...
***** Recalculate centroids *****
...
[Info] Done! 139 ms
[Info] Cluster Result:
[1, 1] is assigned to Cluster0 (Dist=0.71; Centroid=[1.5, 1.5])...
[2, 2] is assigned to Cluster0 (Dist=0.71; Centroid=[1.5, 1.5])...
[3, 5] is assigned to Cluster1 (Dist=0.71; Centroid=[3.5, 4.5])...
[4, 4] is assigned to Cluster1 (Dist=0.71; Centroid=[3.5, 4.5])...

Supplement:
* Unsupervised learning : The k-means clustering algorithm (1)

先前我們學的 Supervised learning - Classification 透過設定的相似度函式將一個個的 instance 進行分類, 而接下來的內容 Unsupervised learning 一樣是要進行分類, 不同的是我們不知道有幾類? 怎麼分? 而這樣的應用可以幫助我們迅速將相似的 instances group 到相同 cluster 中; 不相似的 instances 則 group 到另一個 cluster...

* Unsupervised learning : The k-means clustering algorithm (2)

在前面介紹 k-means clustering algorithm 時, 眼尖的你應該會發現 cluster 的數目是由 k 決定. 問題是我們怎麼知道 k 值是什麼? 另外使用 k-means clustering algorithm 有個問題就是它得到的是 local optimal, 也就是每次跑的結果可能都不一樣 (因為 centroid 是 random 決定) ...

This message was edited 29 times. Last update was at 12/06/2014 10:02:26

程式扎記

標籤

2014年6月11日星期三

[ML in DH] Clustering 之 k-Means Clustering Algorithm

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2014年6月11日 星期三