Preface:
用「向量」來表示文件,然後定義「距離」. 例如,使用歐氏幾何概念下,向量與向量之間的距離,來表示文件之間的相似程度(距離越近越相似).令事先定義好的類別為 C = {C1, C2, ..., Cn} . 給定一個已經分好類別的文件集合 (training set):
Simple Distance-Based Algorithm
必須先將 distance(Ci, d) 定義出來。例如,若 training set 中, t1, ..., tm 這 m 篇文件被預先分類於 Ci,我們可令 center(Ci) 為 t1, ..., tm 這 m 個向量的中心點,而定義 distance(Ci, d) 為 d 到 center(Ci) 的距離. 演算法的 pseudo code 如下:
k-Nearest Neighbors:
接著來看下面圖形化的範例了解 k-Means 與 Simple Distance-Based 的差別:
Example: 3-NN
若定義距離函數為 dist(x,y) = |年齡等級差| + |收入等級差| + |是否學生差| + |信用狀況差|
現有一筆關於某人的資料,欲「推測」其是否會買電腦:
令 xi 代表編號 i 的資料。則 dist(d, xi) 的距離為:
Toolkit Usage:
kNN 的演算法我實作於 "KNN.groovy", 使用範例如下:
- def labels = ["Age", "Income", "Student?", "CC", "Class"]
- def datas = [["<=30", "高", "N", "O", "N"],
- ["<=30", "高", "N", "G", "N"],
- ["31-40", "高", "N", "O", "Y"],
- [">40", "中", "N", "O", "Y"],
- [">40", "低", "Y", "O", "Y"],
- [">40", "低", "Y", "G", "N"],
- ["31-40", "低", "Y", "G", "Y"],
- ["<=30", "中", "N", "O", "N"],
- ["<=30", "低", "Y", "O", "Y"],
- [">40", "中", "Y", "O", "Y"],
- ["<=30", "中", "Y", "G", "Y"],
- ["31-40", "中", "N", "G", "Y"],
- ["31-40", "高", "Y", "O", "Y"],
- [">40", "中", "N", "G", "N"]]
- // 收集每個 Instance 的 Category
- def cateList = datas.collect {it[4]}
- printf "\t[Info] kNN with k=3:\n"
- KNN knn = new KNN()
- def kNNData = datas.collect{d->
- def inst = []
- switch(d[0])
- {
- case "<=30":
- inst.add(1)
- break
- case "31-40":
- inst.add(2)
- break
- case ">40":
- inst.add(3)
- }
- switch(d[1])
- {
- case "高":
- inst.add(1)
- break
- case "中":
- inst.add(2)
- break
- case "低":
- inst.add(3)
- }
- switch(d[2])
- {
- case "Y":
- inst.add(1)
- break
- case "N":
- inst.add(2)
- }
- switch(d[3])
- {
- case "G":
- inst.add(1)
- break
- case "O":
- inst.add(2)
- }
- return inst
- }
- testInst = [2, 1, 2, 1]
- printf "\t[Info] For KNN, [%s] is classified as %s\n", testInst.join(","), knn.classify1(testInst, kNNData, cateList, 3)
Supplement:
* [ ML In Action ] Classifying with k-Nearest Neighbors