## 2012年7月13日 星期五

### [ ML In Action ] Classifying with k-Nearest Neighbors

Preface :

k-Nearest Neighbors
Pros : High accuracy, insensitive to outliers, no assumptions about data
Cons : Computationally expensive, requires a lot of memory
Works with: Numeric values, nominal values

Prepare - importing data with Python :

- createDataSet() :
1. from numpy import *
2. import operator
3.
4. def createDataSet():
5.         group = array([[1.01.1], [1.01.0], [00], [00.1]])
6.         labels = ['A''A''B''B']
7.         return group, labels

- Putting the kNN classification algorithm into action

# inX 為待測點 ; dataset 為已經 classified 的點集合
For every point in our dataset:
----calculate the distance between inX and the current point
----sort the distance in increasing order
----take k items with lowest distance to inX
----find the majority class among these items
----return the majority class as our prediction for the class of inX

- Listing 2.1 k-Nearest Neighbors algorithm
1. def classify0(inX, dataSet, labels, k):
2.         """
3.         - Argument
4.           * inX : 待測點
5.           * dataSet : 已經分類過的點集合.
6.           * labels : Class 的 labels
7.           * k : kNN 演算法的 k 值.
8.         - Reference
9.           * array API : http://www.hjcb.nl/python/Arrays.html
10.           * sorted() : http://docs.python.org/library/functions.html#sorted
11.           * Coding issue : http://puremonkey2010.blogspot.tw/2011/12/python-tutorial-2-using-python.html
12.         """
13.         dataSetSize = dataSet.shape[0]
14.
15.         # 1) Distance calculation
16.         diffMat = tile(inX, (dataSetSize, 1)) - dataSet # tile() build matrix with dimension : dataSetSize X 1
17.         sqDiffMat = diffMat ** 2 # 將每個元素平方
18.         sqDistances = sqDiffMat.sum(axis=1) # 將 matrix 裡的元素取合
19.         distances = sqDistances ** 0.5 # 取平方根
20.         sortedDistIndices = distances.argsort() # 返回 sorting 後的 index array (increasing order)
21.
22.         # 2) Voting with lowest k distances
23.         classCount = {}
24.         for i in range(k):
25.                 voteIlabel = labels[sortedDistIndices[i]]
26.                 classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1
27.         sortedClassCount = sorted(classCount.iteritems(),
28.                                   key = operator.itemgetter(1),
29.                                   reverse=True)  # reverse=True 指使用降序
30.         return sortedClassCount[0][0]

>>> import kNN
>>> group, labels = kNN.createDataSet() # 取回測試資料
>>> kNN.classify0([0,0], group, labels, 3) # 對點 [0,0] 進行 classify ; 取 kNN 的 k 值為 3
'B'

- How to test a classifier

Example - improving matches from a dating site with kNN :

* Number of frequent flyer miles earned per year
* Percentage of time spent playing video games
* Liters of ice cream consumed per week

* People she didn't like
* People she liked in small does
* People she liked in large does

Example: using kNN on results from a dating site
1. Collect: Text file provided
2. Prepare: Parse a text file in Python
3. Analyze: Use Matplotlib to make 2D plots of our data
4. Train: Doesn't apply to kNN algorithm
5. Test: Write a function to use some portion of the given data as test examples. The test examples are classified against the non-test examples. If the predicted class doesn't match the real class, we'll count that as an error.
6. Use: Build a simple command-line program other people can use to predict what ever he/she is labelled.

- Prepare: parsing data from a text file

- Listing 2.2 :
1. def file2matrix(filename):
2.         """ Text record to Numpy parsing code"""
3.         fr = open(filename)
4.         numberOfLines = len(fr.readlines())  # 1) Get number of lines in file
5.         returnMat = zeros((numberOfLines, 3)) # 2) Create Numpy matrix to return
6.         classLabelVector = []
7.         fr = open(filename)
8.         index = 0
9.         for line in fr.readlines():  # 3) Parse line to a list
10.                 line = line.strip()
11.                 listFromLine = line.split('\t');
12.                 returnMat[index,:] = listFromLine[0:3]
13.                 classLabelVector.append(int(listFromLine[-1]))
14.                 #classLabelVector.append(listFromLine[-1])
15.                 index += 1
16.         fr.close()
17.         return returnMat,classLabelVector

- datingTestSet2.txt :
29928 3.596002 1.485952 1
9734 2.755530 1.420655 2
7344 7.780991 0.513048 2
7387 0.093705 0.391834 2
33957 8.481567 0.520078 3
9936 3.865584 0.110062 2
...(略)...

- Analyze: creating scatter plots with Matplotlib

>>> import kNN
>>> datingDataMat, datingLabels = kNN.file2matrix('datingTestSet2.txt')
>>> import matplotlib
>>> import matplotlib.pyplot as plt
>>> fig = plt.figure()
>>> ax.scatter(datingDataMat[:,1], datingDataMat[:,2])
>>> plt.show()

Fig 2.3 Dating data without class labels. From this plot it's difficult to discern which dot belongs to which group.

>>> ax.scatter(datingDataMat[:,1], datingDataMat[:,2], 15.0*array(datingLabels), 15.0*array(datingLabels))

Fig 2.4 It is easier to identify the different classes, but it is difficult to draw conclusion from looking at this data.

- Prepare: normalizing numeric values

newValue = (oldValue - min) / (max - min)

- Listing 2.3
1. def autoNorm(dataSet):
2.         """
3.         - min() : http://www.hjcb.nl/python/Arrays.html#.min
4.         - max() : http://www.hjcb.nl/python/Arrays.html#.max
5.         - tile() : http://docs.scipy.org/doc/numpy/reference/generated/numpy.tile.html
6.         - shape() : http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.shape.html
7.         - zeros() : http://docs.scipy.org/doc/numpy/reference/generated/numpy.zeros.html
8.         """
9.         minVals = dataSet.min(0) # 取每個 element, 每個 column 最小的值並以 matrix 回傳.
10.         maxVals = dataSet.max(0) # 取每個 element, 每個 column 最大的值並以 matrix 回傳.
11.         ranges = maxVals - minVals # 取出 (max-min) = range
12.         normDataSet = zeros(shape(dataSet)) # 建立回傳的 matrix
13.         m = dataSet.shape[0] # 取出 data set 的筆數
14.         normDataSet = dataSet - tile(minVals, (m,1)) # 將每個 element 的每個 column 減掉對應該 column 的 minimum value
15.         normDataSet = normDataSet/tile(ranges, (m,1)) # 做 Normailize ; Element-wise division
16.         return normDataSet, ranges, minVals

- Test: testing the classifier as a whole program

- Listing 2.4
1. def datingClassTest(hoRatio=0.1, k=3):
2.         """ Classifier testing code for dating site """
3.         datingDataMat, datingLabels = file2matrix('datingTestSet2.txt')
4.         normMat, ranges, minVals = autoNorm(datingDataMat)
5.         m = normMat.shape[0]
6.         numTestVecs = int(m*hoRatio)
7.         errorCount = 0
8.         for i in range(numTestVecs):
9.                 classifierResult = classify0(normMat[i,:], normMat[numTestVecs:m,:], datingLabels[numTestVecs:m], k)
10.                 print("\t[Info] The classifier came back with: {0}, the real answery is: {1}.".format(classifierResult, datingLabels[i]))
11.                 if classifierResult != datingLabels[i]: errorCount+=1.0
12.         print("\t[Info] The total error rate is {0:02}.".format(errorCount/float(numTestVecs)))

>>> kNN.datingClassTest()
[Info] The classifier came back with: 3, the real answery is: 3.
[Info] The classifier came back with: 2, the real answery is: 2.
[Info] The classifier came back with: 1, the real answery is: 1.
...(略)...
[Info] The classifier came back with: 3, the real answery is: 1.
[Info] The total error rate is 0.05.

- Use: putting together a useful system

- Listing 2.5
1. def classifyPerson(k=3):
2.         resultList = ['not at all''in small does''in large does']
3.         percentTats = float(raw_input("\t> Percentage of time spent playing video games? "))
4.         ffMiles = float(raw_input("\t> Frequent flier miles earned per year? "))
5.         iceCream = float(raw_input("\t> Liters of ice cream consumed per year? "))
6.         datingDataMat, datingLabels = file2matrix('datingTestSet2.txt')
7.         normMat, ranges, minVals = autoNorm(datingDataMat)
8.         inArr = array([ffMiles, percentTats, iceCream])
9.         classifierResult = classify0(inArr, normMat, datingLabels, k)
10.         print("\t[Info] You will probably like this person: {0}!".format(resultList[classifierResult-1]))

## 關於我自己

Where there is a will, there is a way!