程式扎記: [ ML In Action ] Classifying with k-Nearest Neighbors

Preface :
這裡我們要討論第一個 machine-learning 的演算法: k-Nearest Neighbors. 它使用 distance-measurement 個概念來 classify items. 而透過代碼一步步的說明, 藉以了解實際運作的原理與演算法. 而底下是該演算法的 Pros & Cons :

k-Nearest Neighbors
Pros : High accuracy, insensitive to outliers, no assumptions about data
Cons : Computationally expensive, requires a lot of memory
Works with: Numeric values, nominal values

而在開始實際的範例前, 我們先來了解一下 kNN 的理念. 假設我們要對電影進行分類: action movie or romance movie. 可能我們會以電影中 kiss 與 action 的多寡來斷定某部電影是屬於 action movie 或者是 romance movie. 但是問題來了, romance movie 中除了 kiss 的場合也有可能有 action 的片段 ; 同樣的 action movie 中也可能有 kiss 的情節發生. 因此我們不能藉由只發生 kiss 或是 action 來判斷某個電影是屬於哪一類. 而我們可以做的是統計已知電影類別中 kiss 與 action 的分布, 然後將之標示在座標上. 因此當有未知電影要進行分類, 便將之標記於座標上, 看到底是離 romance movie 的 group 近一點 ; 還是離 action movie 的 group 近一點來辨別該未知電影的類別 :

而 kNN 第一步便是需要決定 k 值 (通常為奇數), 再拿上面的 "綠點" 來做說明. 決定 k 值後, 接著會從上面的座標中找出 k 個做接近 "綠點" 的座標 ; 接著由這些座標多數的類別決定 "綠點" 的類別, 假設 k 值為 3 , 經過比對後我們可以知道 "綠點" 是屬於 Romance movie :

Prepare - importing data with Python :
在使用 Machine learning 的第一步, 我們要準備的就是 data. 在這裡我們建立了 kNN.py 做為後續我們使用 kNN algorithm 的模組. 第一要我們要寫的是可以返回測試資料的函數 :
- createDataSet() :

view plaincopy to clipboardprint?
from numpy import *  
import operator  
  
def createDataSet():  
        group = array([[1.0, 1.1], [1.0, 1.0], [0, 0], [0, 0.1]])  
        labels = ['A', 'A', 'B', 'B']  
        return group, labels  

如此我們便能如下取回測試使用的 data :

接著你便可以把測試使用的 data 撰寫於上面的函數體中, 並透過呼叫該函數取回測試用的 data.

- Putting the kNN classification algorithm into action
底下是 kNN 演算法的 pseudo code :

# inX 為待測點 ; dataset 為已經 classified 的點集合
For every point in our dataset:
----calculate the distance between inX and the current point
----sort the distance in increasing order
----take k items with lowest distance to inX
----find the majority class among these items
----return the majority class as our prediction for the class of inX

實作的代碼如下 :
- Listing 2.1 k-Nearest Neighbors algorithm

view plaincopy to clipboardprint?
def classify0(inX, dataSet, labels, k):  
        """  
        - Argument  
          * inX : 待測點  
          * dataSet : 已經分類過的點集合.  
          * labels : Class 的 labels  
          * k : kNN 演算法的 k 值.  
        - Reference  
          * array API : http://www.hjcb.nl/python/Arrays.html  
          * sorted() : http://docs.python.org/library/functions.html#sorted  
          * Coding issue : http://puremonkey2010.blogspot.tw/2011/12/python-tutorial-2-using-python.html  
        """  
        dataSetSize = dataSet.shape[0]  
  
        # 1) Distance calculation  
        diffMat = tile(inX, (dataSetSize, 1)) - dataSet # tile() build matrix with dimension : dataSetSize X 1  
        sqDiffMat = diffMat ** 2 # 將每個元素平方  
        sqDistances = sqDiffMat.sum(axis=1) # 將 matrix 裡的元素取合  
        distances = sqDistances ** 0.5 # 取平方根  
        sortedDistIndices = distances.argsort() # 返回 sorting 後的 index array (increasing order)  
  
        # 2) Voting with lowest k distances  
        classCount = {}  
        for i in range(k):  
                voteIlabel = labels[sortedDistIndices[i]]  
                classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1  
        sortedClassCount = sorted(classCount.iteritems(),  
                                  key = operator.itemgetter(1),  
                                  reverse=True)  # reverse=True 指使用降序  
        return sortedClassCount[0][0]  

前面介紹我們知道 kNN 是使用 distance 來決定點與點間的相似程度, 上面的公式距離的計算是使用 Euclidian distance - 考慮有點 p, q. 則距離函數 d(p, q) 等於 :

接下來我們便可以使用測試的 dataset 來測試剛剛寫的函數 classify0() :

>>> import kNN
>>> group, labels = kNN.createDataSet() # 取回測試資料
>>> kNN.classify0([0,0], group, labels, 3) # 對點 [0,0] 進行 classify ; 取 kNN 的 k 值為 3
'B'

- How to test a classifier
在 kNN 演算法中, 透過取出最近的 k 個點來投票未知點的類別. 因此標記可能是錯誤, 因此我們需要對該演算法進行測試或是 evaluation, 來確定結果裡我們預期的不會太遠. 而使用的方法便是從我們的 data set (已經分類好) 取出一部分來當作 test set, 這樣我們可以使用 test set 透過 classify 後取得的 class 與 ground truth 即已知的 class 進行比對, 並計算出 error rate = error classified count / total test set size. 當 error rate 低於某個程度, 我們會說該 classify 的 accuracy 是可接受的.

Example - improving matches from a dating site with kNN :
這邊我們要來處理看起來比較像現實上會遇到的問題. 考慮有個交友網站, 提供的 data set 有包含 features 如 :

* Number of frequent flyer miles earned per year
* Percentage of time spent playing video games
* Liters of ice cream consumed per week

而標示的 Labels 可能為 :

* People she didn't like
* People she liked in small does
* People she liked in large does

因此我們的目的便是透過上面的 feature 來判斷某個 People 被喜歡的程度. 底下為該 example 更為詳細的說明 :

Example: using kNN on results from a dating site
1. Collect: Text file provided
2. Prepare: Parse a text file in Python
3. Analyze: Use Matplotlib to make 2D plots of our data
4. Train: Doesn't apply to kNN algorithm
5. Test: Write a function to use some portion of the given data as test examples. The test examples are classified against the non-test examples. If the predicted class doesn't match the real class, we'll count that as an error.
6. Use: Build a simple command-line program other people can use to predict what ever he/she is labelled.

- Prepare: parsing data from a text file
因為 data set 數據共有 1000 筆, 我們不可能一筆筆自己輸入, 因此已經存在檔案中. 所以下面的函數便是從檔案中還原接下來要使用的 data set :
- Listing 2.2 :

view plaincopy to clipboardprint?
def file2matrix(filename):  
        """ Text record to Numpy parsing code"""  
        fr = open(filename)  
        numberOfLines = len(fr.readlines())  # 1) Get number of lines in file  
        returnMat = zeros((numberOfLines, 3)) # 2) Create Numpy matrix to return  
        classLabelVector = []  
        fr = open(filename)  
        index = 0  
        for line in fr.readlines():  # 3) Parse line to a list  
                line = line.strip()  
                listFromLine = line.split('\t');  
                returnMat[index,:] = listFromLine[0:3]  
                classLabelVector.append(int(listFromLine[-1]))  
                #classLabelVector.append(listFromLine[-1])  
                index += 1  
        fr.close()  
        return returnMat,classLabelVector  

而使用的 data set 存放檔案的內容如下, 第一欄為 Number of frequent flyer miles earned per year ; 第二欄為 Percentage of time spent playing video games ; 第三欄為 Liters of ice cream consumed per week, 最後一欄代表 Class :
- datingTestSet2.txt :

29928 3.596002 1.485952 1
9734 2.755530 1.420655 2
7344 7.780991 0.513048 2
7387 0.093705 0.391834 2
33957 8.481567 0.520078 3
9936 3.865584 0.110062 2
...(略)...

因此接著你便能如下從檔案中取出 data set :

- Analyze: creating scatter plots with Matplotlib
接著我們要初步的來分析數據, 但從文字介面來看會有點痛苦, 因此我們藉由 Matplotlib 將數據實際用坐標軸視覺化 :

>>> import kNN
>>> datingDataMat, datingLabels = kNN.file2matrix('datingTestSet2.txt')
>>> import matplotlib
>>> import matplotlib.pyplot as plt
>>> fig = plt.figure()
>>> ax = fig.add_subplot(111)
>>> ax.scatter(datingDataMat[:,1], datingDataMat[:,2])
>>> plt.show()

接著你應該能看到下圖 :

Fig 2.3 Dating data without class labels. From this plot it's difficult to discern which dot belongs to which group.

從上圖很難看出 grouping 的概念, 因此我們將 Class label 加入到圖表藉以改變點的顏色與大小, 執行代碼如下 :

>>> ax.scatter(datingDataMat[:,1], datingDataMat[:,2], 15.0*array(datingLabels), 15.0*array(datingLabels))

則圖表應該會如下顯示出較有意義的 grouping 趨勢 :

Fig 2.4 It is easier to identify the different classes, but it is difficult to draw conclusion from looking at this data.

- Prepare: normalizing numeric values
從原始數據應該可以發現在 feature: Frequent Flyier Miles Earned Per Year. 上的數值比其他 features 的數值大很多, 這樣在計算 distance 時會造成很大的 biased, 而造成只由某一個 feature 去 dominate 整個 kNN 的結果而不是我們預期的每個 feature 都扮演著同樣重要的角色. 因此我們需要對原始數據進行正規化 (Normalizing) 的動作. 而其概念很簡單, 就是從每一個 feature 中取出最大值 max ; 最小值 min. 再透過下面公式計算出每一筆 feature 的新值 newValue :

newValue = (oldValue - min) / (max - min)

經過正規畫的新數值, 將會介於 0 到 1 間. 我們為此撰寫了新函數 autoNorm() :
- Listing 2.3

view plaincopy to clipboardprint?
def autoNorm(dataSet):  
        """  
        - min() : http://www.hjcb.nl/python/Arrays.html#.min  
        - max() : http://www.hjcb.nl/python/Arrays.html#.max  
        - tile() : http://docs.scipy.org/doc/numpy/reference/generated/numpy.tile.html  
        - shape() : http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.shape.html  
        - zeros() : http://docs.scipy.org/doc/numpy/reference/generated/numpy.zeros.html  
        """  
        minVals = dataSet.min(0) # 取每個 element, 每個 column 最小的值並以 matrix 回傳.  
        maxVals = dataSet.max(0) # 取每個 element, 每個 column 最大的值並以 matrix 回傳.  
        ranges = maxVals - minVals # 取出 (max-min) = range  
        normDataSet = zeros(shape(dataSet)) # 建立回傳的 matrix  
        m = dataSet.shape[0] # 取出 data set 的筆數  
        normDataSet = dataSet - tile(minVals, (m,1)) # 將每個 element 的每個 column 減掉對應該 column 的 minimum value  
        normDataSet = normDataSet/tile(ranges, (m,1)) # 做 Normailize ; Element-wise division  
        return normDataSet, ranges, minVals  

這樣我們便能如下對原始數據進行正規化動作 :

- Test: testing the classifier as a whole program
接著我們要來寫 code 測我們的 kNN 演算法, 透過下面函數 datingClassTest(hoRatio=0.1, k=3) 我們預設使用 10% 的 data set 當作 test set ; 並取 kNN 的 k 值為 3 :
- Listing 2.4

view plaincopy to clipboardprint?
def datingClassTest(hoRatio=0.1, k=3):  
        """ Classifier testing code for dating site """  
        datingDataMat, datingLabels = file2matrix('datingTestSet2.txt')  
        normMat, ranges, minVals = autoNorm(datingDataMat)  
        m = normMat.shape[0]  
        numTestVecs = int(m*hoRatio)  
        errorCount = 0  
        for i in range(numTestVecs):  
                classifierResult = classify0(normMat[i,:], normMat[numTestVecs:m,:], datingLabels[numTestVecs:m], k)  
                print("\t[Info] The classifier came back with: {0}, the real answery is: {1}.".format(classifierResult, datingLabels[i]))  
                if classifierResult != datingLabels[i]: errorCount+=1.0  
        print("\t[Info] The total error rate is {0:02}.".format(errorCount/float(numTestVecs)))  

接著執行上面函數可以發現 error rate 約為 5% :

>>> kNN.datingClassTest()
[Info] The classifier came back with: 3, the real answery is: 3.
[Info] The classifier came back with: 2, the real answery is: 2.
[Info] The classifier came back with: 1, the real answery is: 1.
...(略)...
[Info] The classifier came back with: 3, the real answery is: 1.
[Info] The total error rate is 0.05.

實際應用你可以依據專案的需求定義可接受最高的 error rate ; 通常 error rate 過高可能是 sampling 不夠或是使用的 data set 不適合使用 kNN 來進行 classify, 這時你可以考慮使用別種的 Machine learning algorithm.

- Use: putting together a useful system
最後通常使用我們系統的一般使用者是不會寫 code 的, 因此我們需要將之轉為可以從命令列或提供 UI 讓使用者可以方便的使用我們的系統. 這邊我們提供一個函數 classifyPerson() 讓使用者可以透過 Console 的互動輸入 data 後, 由我們的系統對之進行 classify 並將結果印於 Console 中 :
- Listing 2.5

view plaincopy to clipboardprint?
def classifyPerson(k=3):  
        resultList = ['not at all', 'in small does', 'in large does']  
        percentTats = float(raw_input("\t> Percentage of time spent playing video games? "))  
        ffMiles = float(raw_input("\t> Frequent flier miles earned per year? "))  
        iceCream = float(raw_input("\t> Liters of ice cream consumed per year? "))  
        datingDataMat, datingLabels = file2matrix('datingTestSet2.txt')  
        normMat, ranges, minVals = autoNorm(datingDataMat)  
        inArr = array([ffMiles, percentTats, iceCream])  
        classifierResult = classify0(inArr, normMat, datingLabels, k)  
        print("\t[Info] You will probably like this person: {0}!".format(resultList[classifierResult-1]))  

執行範例如下 :

完整的代碼與測試資料. 都可以在這裡下載 (為 Machine Learning In Action 第二章的內容)

程式扎記

標籤

2012年7月13日星期五

[ ML In Action ] Classifying with k-Nearest Neighbors

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2012年7月13日 星期五