## 2012年7月22日 星期日

### [ ML In Action ] Logistic Regression

Preface :

Finding the best fit is similar to regression, and in this method it's how we train our classifier. We'll use optimization algorithms to find these best-fit parameters. This best-fit stuff is where the name regression comes from.

General approach to logistic regression
1. Collect: Any method
2. Prepare: Numeric values are needed for a distance calculation. A structured data format is best.
3. Analyze: Any method
4. Train: We will spend most of the time training, where we try to find optional coefficients to classify our data.
5. Test: Classification is quick and easy once the training step is done.
6. Use:
This application needs to get some input data and output structured numeric values. Next, the application applies the simple regress calculation on this input data and determines which class the input data should belong to.

Classification with logistic regression :

Logistic regression
Pros: Computationally inexpensive, easy to implement, knowledge representation easy to interpret.
Cons: Prone to underfitting, may have low accuracy
Works with: Numeric values, nominal values.

1. 如果是負值 (預期值 < 運算值) 則原本的 weight 將會扣掉這個差與設定的 α 乘積, 如此 weight 將變小, 則下次計算 sigmoid() 的運算值將會變小 (更接近預期值) 而逐漸收斂.
2. 如果是正值 (預期值 > 運算值) 則原本的 weight 將會加上這個差與設定的 α 乘積, 如此 weight 將變大, 則下次計算 sigmoid() 的運算值將會變大 (更接近預期值) 而逐漸收斂.

- Train: Using gradient ascent to find the best parameters

- Listing 5.1 (logRegres.py)
1. # -*- coding: utf-8 -*-
2. from numpy import *
3.
4. def plotDataSet(dataArr, labelArr):
5.     import matplotlib.pyplot as plt
6.     xcord1 = []; ycord1 = [];
7.     xcord2 = []; ycord2 = [];
8.     dataSize = len(dataArr)
9.     for i in range(dataSize):
10.         if labelArr[i] == 1:
11.             xcord1.append(dataArr[i][1]); ycord1.append(dataArr[i][2])
12.         else:
13.             xcord2.append(dataArr[i][1]); ycord2.append(dataArr[i][2])
14.     fig = plt.figure()
16.     ax.scatter(xcord1, ycord1, s=30, c='red', marker='s')
17.     ax.scatter(xcord2, ycord2, s=30, c='green')
18.     plt.show()
19.
21.     dataMat = []; labelMat = []
22.     fr = open('testSet.txt')
24.         lineArr = line.strip().split()
25.         dataMat.append([1.0float(lineArr[0]), float(lineArr[1])])
26.         labelMat.append(int(lineArr[2]))
27.     return dataMat,labelMat
28.
29. def sigmoid(inX):
30.     return 1.0/(1+exp(-inX))

>>> import logRegres
>>> logRegres.plotDataSet(dataArr, labelMat)

1. def gradAscent(dataMatIn, classLabels, alpha=0.001, maxCycles=500):
2.     dataMatrix = mat(dataMatIn)             # 轉 list 為 matrix
3.     labelMat = mat(classLabels).transpose() # 轉 class label list 為 matrix, 並做 transpose
4.     m,n = shape(dataMatrix)                 # m 為 data set size ; n 為 features size
5.     weights = ones((n,1))                   # 初始 weight matrix 值為 1.
6.     for k in range(maxCycles):              # for loop 進行 Gradient ascent algorithm 的計算.
7.         h = sigmoid(dataMatrix*weights)     # 計算 t=w0x0 + w1x1 + ... + wnxn. 並將 t 帶入 sigmoid.
8.         error = (labelMat - h)              # 計算 error > exp=1,0 - trt=1,0 = 0 ;
9.                                             #              exp=1 - trt=0 = 1 (加大 weight) ; exp=0 - trt=1 = -1 縮小 weight)
10.         weights = weights + alpha * dataMatrix.transpose()* error # 重新計算 weight matrix
11.     return weights

>>> import logRegres
matrix([[ 4.12414349],
[ 0.48007329],
[-0.6168482 ]])

(x1->x ; x2->y)

1. def plotBestFit(weights):
2.     import matplotlib.pyplot as plt
4.     dataArr = array(dataMat)
5.     n = shape(dataArr)[0]
6.     xcord1 = []; ycord1 = []
7.     xcord2 = []; ycord2 = []
8.     for i in range(n):
9.         if int(labelMat[i])== 1:
10.             xcord1.append(dataArr[i,1]); ycord1.append(dataArr[i,2])
11.         else:
12.             xcord2.append(dataArr[i,1]); ycord2.append(dataArr[i,2])
13.     fig = plt.figure()
15.     ax.scatter(xcord1, ycord1, s=30, c='red', marker='s')
16.     ax.scatter(xcord2, ycord2, s=30, c='green')
17.     x = arange(-3.03.00.1)
18.     y = (-weights[0]-weights[1]*x)/weights[2]
19.     ax.plot(x, y)
20.     plt.xlabel('X1'); plt.ylabel('X2');
21.     plt.show()

>>> weights.getA()
array([[ 4.12414349],
[ 0.48007329],
[-0.6168482 ]])

>>> logRegres.plotBestFit(weights.getA())

2.     m,n = shape(dataMatrix)
3.     alpha = 0.01
4.     weights = ones(n)   #initialize to all ones
5.     for i in range(m):
6.         h = sigmoid(sum(dataMatrix[i]*weights))
7.         error = classLabels[i] - h
8.         weights = weights + alpha * error * dataMatrix[i]
9.     return weights

2.     m,n = shape(dataMatrix)
3.     weights = ones(n)   #initialize to all ones
4.     for j in range(numIter):
5.         dataIndex = range(m)
6.         for i in range(m):
7.             alpha = 4/(1.0+j+i)+0.0001    #apha decreases with iteration, does not
8.             randIndex = int(random.uniform(0,len(dataIndex)))#go to 0 because of the constant
9.             h = sigmoid(sum(dataMatrix[randIndex]*weights))
10.             error = classLabels[randIndex] - h
11.             weights = weights + alpha * error * dataMatrix[randIndex]
12.             del(dataIndex[randIndex])
13.     return weights

## 關於我自己

Where there is a will, there is a way!