程式扎記

Preface :
如果你要做一個重要的決定, 你可能會找人提供建議. 如果你找的人不只一個, 可能重要的人的意見會比其他人的意見對你的影響比重會大一點. 在 Machine learning 類似的概念稱之為 meta-algorithm : 它被使用來組合不同的 algorithms, 並將準確率較高的 algorithm 的權重調高於其它的 algorithm. 而在這邊我們要介紹的是 AdaBoost 演算法.

Classifiers using multiple samples of the dataset :

AdaBoost
Pros: Low generalization error, easy to code, works with most classifiers, no parameters to adjust.
Cons: Sensitive to outliers
Works with: Numeric values, nominal values

在這之前我們看過了一些 Classification 的演算法如 kNN, Decision tree, naive Bayes 與 Logistic regression. 這些演算法有各自的特點與優缺點, 如果要組合不同的演算法, 利用他們的結果來產出更準確的 Classification, 這個動作稱之為 ensemble methods or meta-algorithms.

- Building classifiers from randomly resampled data:bagging
Bootstrap aggregating 是 meta-algorithms 中的一個做法, wiki 上的簡單介紹如下 :

Bootstrap aggregating (bagging) is a machine learning ensemble meta-algorithm to improve machine learning of statistical classification and regression models in terms of stability and classification accuracy. It also reduces variance and helps to avoid overfitting. Although it is usually applied to decision tree models, it can be used with any type of model. Bagging is a special case of the model averaging approach.

詳細的說明可以參考 wiki.

- Boosting
Boosting 與 bagging 相似, 不同的是它使用了 weighting, 在 iteration 的過程中, 上一輪較準確的 classifier 會被給定較高的 weighting. 底下為其書上的說明 :

Boosting is a technique similar to bagging. In boosting and bagging, you always use the same type of classifier. But in boosting, the different classifiers are trained sequentially. Each new classifier is trained based on the performance of those already trained. Boosting makes new classifiers focus on data that was previously misclassified by previous classifiers.

有需多種類的 Boosting , 這邊我們要介紹的是 AdaBoost :

General approach to AdaBoost
1. Collect: Any method.
2. Prepare: It depends on which type of weak learner you're going to use. In this chapter, we'll use decision stumps, which can take any type of data. You could use any classifier, so any of the classifiers from chapters would work. Simple classifiers work better for a weak learner.
3. Analyze: Any method.
4. Train: The majority of the time will be spent here. The classifier will train the weak learner multiple times over the same dataset.
5. Test: Calculate the error rate.
6. Use: Like support vector machines, AdaBoost predicts one of two classes. If you want to use it for classification involving more than two classes, then you'll need to apply some of the same methods as for support vector machines.

Improving the classifier by focusing on errors :
底下是 Boosting 的起因, 我們有可能透過一些 Weak classifer 來組合出一個 Strong classifer 嗎?

Boosting is based on the question posed by Kearns: can a set of weak learners create a single strong learner? A weak learner is defined to be a classifier which is only slightly correlated with the true classification (it can label examples better than random guessing). In contrast, a strong learner is a classifier that is arbitrarily well-correlated with the true classification.

AdaBoost 是 adaptive boosting 的縮寫, 而它便是上述問題的解答之一. 它透過給定 weight vector D (一開始 vector 內值為相等) 到 training data 的每一筆 record 來給 decision tree stump 決定如何取出最佳的結果, 每次 Iteration 後利用重新計算 weight vector 來讓下一個 stump 建立可以 focus 在那些 error 的 records 上藉以找出組合出來的最佳解.

而在計算 weight vector D 有些參數定義如下 :

在每次 Iterator 後重新計算 weight vector D 時, 會降低 correct record 的 weighting ; 增加 error record 的 weighting :

而 α 除了用來計算 weight vector D 外, 它也是每次產生出來的 Weak classifier 的權重 :

Creating a weak learner with a decision stump :
這邊的 decision stump 是一個最簡單的 decision tree, 只有一個 split 意味只對一個 feature 進行 classify. 首先按照慣例我們撰寫函數如下來取回 test data set :
- Listing 7.0 of adaboost.py :

view plaincopy to clipboardprint?
# -*- coding: utf-8 -*-  
from numpy import *  
  
def plotDataSet(dataArr, labelArr):  
    import matplotlib.pyplot as plt  
    xcord1 = []; ycord1 = [];  
    xcord2 = []; ycord2 = [];  
    dataSize = len(dataArr)  
    for i in range(dataSize):  
        if labelArr[i] == 1:  
            xcord1.append(dataArr[i][0]); ycord1.append(dataArr[i][1])  
        else:  
            xcord2.append(dataArr[i][0]); ycord2.append(dataArr[i][1])  
    fig = plt.figure()  
    ax = fig.add_subplot(111)  
    ax.scatter(xcord1, ycord1, s=30, c='red', marker='s')  
    ax.scatter(xcord2, ycord2, s=30, c='green')  
    plt.show()  
  
def loadSimpData():  
    datMat = matrix([[ 1. ,  2.1],  
        [ 2. ,  1.1],  
        [ 1.3,  1. ],  
        [ 1. ,  1. ],  
        [ 2. ,  1. ]])  
    classLabels = [1.0, 1.0, -1.0, -1.0, 1.0]  

接著你可以如下觀察 test data set 的分布 :

>>> import adaboost
>>> datMat, classLabels = adaboost.loadSimpData()
>>> adaboost.plotDataSet(datMat.A, classLabels)

data set 分布如下 :

底下我們要來撰寫建立 decision stump 的代碼, 第一個函式用來將 data set 使用建立好的 decision stump 來做 classification 並以 array 回傳. 第二個函數則為使用給定的 weight vector D 來找出最佳 (error 最小) 的 decision stump, 其 pseudo code 如下 :

底下是代碼的實作 :

view plaincopy to clipboardprint?
def stumpClassify(dataMatrix,dimen,threshVal,threshIneq):#just classify the data  
    """ 使用給定的 stump(dimen, threshVal, threshIneq) 對 data set=dataMatrix 進行 classification. """  
    retArray = ones((shape(dataMatrix)[0],1))  
    if threshIneq == 'lt':  
        retArray[dataMatrix[:,dimen] <= threshVal] = -1.0  
    else:  
        retArray[dataMatrix[:,dimen] > threshVal] = -1.0  
    return retArray  
  
  
def buildStump(dataArr,classLabels,D):  
    """ 根據 weight matrix D 建立最佳 stump."""  
    dataMatrix = mat(dataArr); labelMat = mat(classLabels).T  
    m,n = shape(dataMatrix) # m 為 data set size ; n 為 feature size.  
    numSteps = 10.0; bestStump = {}; bestClasEst = mat(zeros((m,1)))  
    minError = inf # init error sum, to +infinity  
    for i in range(n):#loop over all dimensions  
        rangeMin = dataMatrix[:,i].min(); rangeMax = dataMatrix[:,i].max();  
        stepSize = (rangeMax-rangeMin)/numSteps  
        for j in range(-1,int(numSteps)+1):#loop over all range in current dimension  
            for inequal in ['lt', 'gt']: #go over less than and greater than  
                threshVal = (rangeMin + float(j) * stepSize)  
                predictedVals = stumpClassify(dataMatrix,i,threshVal,inequal)#call stump classify with i, j, lessThan  
                errArr = mat(ones((m,1)))  
                errArr[predictedVals == labelMat] = 0  
                weightedError = D.T*errArr  #calc total error multiplied by D  
                #print "split: dim %d, thresh %.2f, thresh ineqal: %s, the weighted error is %.3f" % (i, threshVal, inequal, weightedError)  
                if weightedError < minError:  
                    minError = weightedError  
                    bestClasEst = predictedVals.copy()  
                    bestStump['dim'] = i  
                    bestStump['thresh'] = threshVal  
                    bestStump['ineq'] = inequal  
    return bestStump,minError,bestClasEst  

接著可以如下測試代碼 :

Implementing the full AdaBoost algorithm :
有了上面的前置準備, 接下來我們要來進行完整的 AdaBoost algorithm, 底下為其 Pseudo-code :

代碼實作如下 :
- Listing 7.2 of adaboost.py :

view plaincopy to clipboardprint?
def adaBoostTrainDS(dataArr,classLabels,numIt=40):  
    """ AdaBoost training with decision stumps. """  
    weakClassArr = []  
    m = shape(dataArr)[0]  
    D = mat(ones((m,1))/m)   #init D to all equal  
    aggClassEst = mat(zeros((m,1)))  
    for i in range(numIt):  
        bestStump,error,classEst = buildStump(dataArr,classLabels,D)#build Stump  
        #print "D:",D.T  
        alpha = float(0.5*log((1.0-error)/max(error,1e-16)))#calc alpha, throw in max(error,eps) to account for error=0  
        bestStump['alpha'] = alpha  
        weakClassArr.append(bestStump)                  #store Stump Params in Array  
        #print "classEst: ",classEst.T  
        expon = multiply(-1*alpha*mat(classLabels).T,classEst) #exponent for D calc, getting messy  
        D = multiply(D,exp(expon))                              #Calc New D for next iteration  
        D = D/D.sum()  
        #calc training error of all classifiers, if this is 0 quit for loop early (use break)  
        aggClassEst += alpha*classEst  
        #print "aggClassEst: ",aggClassEst.T  
        aggErrors = multiply(sign(aggClassEst) != mat(classLabels).T,ones((m,1)))  
        errorRate = aggErrors.sum()/m  
        print "total error: ",errorRate  
        if errorRate == 0.0: break  
    return weakClassArr, aggClassEst  

接著你可以如下取回 AdaBoost algorithm training 完後的結果 :

最後我們要做的就是撰寫函數, 傳入 data set 與 training 後的 classifiers 並回傳 classified 後的結果 :
- Listing 7.3 of adaboost.py :

view plaincopy to clipboardprint?
def adaClassify(datToClass,classifierArr):  
    dataMatrix = mat(datToClass)#do stuff similar to last aggClassEst in adaBoostTrainDS  
    m = shape(dataMatrix)[0]  
    aggClassEst = mat(zeros((m,1)))  
    for i in range(len(classifierArr)):  
        classEst = stumpClassify(dataMatrix,classifierArr[i]['dim'],\  
                                 classifierArr[i]['thresh'],\  
                                 classifierArr[i]['ineq'])#call stump classify  
        aggClassEst += classifierArr[i]['alpha']*classEst  
        print aggClassEst  
    return sign(aggClassEst)  

接著可以如下利用剛剛建立的 classifiers 測試 data 的 classification (假設-1 為類別 C1; +1為類別 C2):

Supplement :
* numpy.sign : Returns an element-wise indication of the sign of a number.
* numpy.multiply : Multiply arguments element-wise.

>>> a = mat([[1, 1, 1]])
>>> b = mat([[2, 3, 4]])
>>> c = multiply(a, b)
>>> c
matrix([[2, 3, 4]])

Preface :
以下為 Machine Learning in Action 第四章內容, 相關代碼可以在這裡下載 (Ch05).
這一章要提的 Machine Learning 是跟 Optimization algorithms 相關的 Logistic Regression. 而底下是有關這個演算法的描述 :

Finding the best fit is similar to regression, and in this method it's how we train our classifier. We'll use optimization algorithms to find these best-fit parameters. This best-fit stuff is where the name regression comes from.

簡單來說就是我們希望找出一條方程式構成的 hyperplane , 並透過它將標示於座標上不同 class 的 data point group 可以區隔開來. 底下為此 Classification 的簡介 :

General approach to logistic regression
1. Collect: Any method
2. Prepare: Numeric values are needed for a distance calculation. A structured data format is best.
3. Analyze: Any method
4. Train: We will spend most of the time training, where we try to find optional coefficients to classify our data.
5. Test: Classification is quick and easy once the training step is done.
6. Use:
This application needs to get some input data and output structured numeric values. Next, the application applies the simple regress calculation on this input data and determines which class the input data should belong to.

Classification with logistic regression :
在開始了解 logistic regression 先要來知道一些常識. 底下為使用 logistic regression 的說明 :

Logistic regression
Pros: Computationally inexpensive, easy to implement, knowledge representation easy to interpret.
Cons: Prone to underfitting, may have low accuracy
Works with: Numeric values, nominal values.

在 logistic regression 中, 我們透過一個函式稱為 sigmoid 來幫我們界定某個 data 的類別, 而該函式的長相如下 :

而上述方程式的 t 實際上為 logistic regression 中 features 值的合. 透過 sigmoid 計算出來的值如果大於 0.5 則歸類為類別 C1 ; 反之則歸類為類別 C2. 而 features 通常會用下面的方程式代表 :

而 x0 到 xn 的值為 input data ; 而我們的目的即是找出最佳的 coefficients w0~wn, 讓計算出來的 t 值套入sigmoid 得到的類別與預期的類別能有最大的相似度.

- Gradient ascent
第一個要接觸的 optimization algorithm 是 Gradient ascent. 詳細的演算法過程不在這裡介紹, 而這邊只說明概念. 考慮下面的式子 :

這個函式的概念就是透過不斷的計算預期值與運算值的差(預期值 - 運算值) :

1. 如果是負值 (預期值 < 運算值) 則原本的 weight 將會扣掉這個差與設定的 α 乘積, 如此 weight 將變小, 則下次計算 sigmoid() 的運算值將會變小 (更接近預期值) 而逐漸收斂.
2. 如果是正值 (預期值 > 運算值) 則原本的 weight 將會加上這個差與設定的 α 乘積, 如此 weight 將變大, 則下次計算 sigmoid() 的運算值將會變大 (更接近預期值) 而逐漸收斂.

- Train: Using gradient ascent to find the best parameters
首先我們先來撰寫載入 dataset , 計算 sigmoid 與 plot dataset 的函數 :
- Listing 5.1 (logRegres.py)

view plaincopy to clipboardprint?
# -*- coding: utf-8 -*-  
from numpy import *  
  
def plotDataSet(dataArr, labelArr):  
    import matplotlib.pyplot as plt  
    xcord1 = []; ycord1 = [];  
    xcord2 = []; ycord2 = [];  
    dataSize = len(dataArr)  
    for i in range(dataSize):  
        if labelArr[i] == 1:  
            xcord1.append(dataArr[i][1]); ycord1.append(dataArr[i][2])  
        else:  
            xcord2.append(dataArr[i][1]); ycord2.append(dataArr[i][2])  
    fig = plt.figure()  
    ax = fig.add_subplot(111)  
    ax.scatter(xcord1, ycord1, s=30, c='red', marker='s')  
    ax.scatter(xcord2, ycord2, s=30, c='green')  
    plt.show()  
  
def loadDataSet():  
    dataMat = []; labelMat = []  
    fr = open('testSet.txt')  
    for line in fr.readlines():  
        lineArr = line.strip().split()  
        dataMat.append([1.0, float(lineArr[0]), float(lineArr[1])])  
        labelMat.append(int(lineArr[2]))  
    return dataMat,labelMat  
  
def sigmoid(inX):  
    return 1.0/(1+exp(-inX))  

接著可以如下畫出 data set 的分布 :

>>> import logRegres
>>> dataArr, labelMat = logRegres.loadDataSet()
>>> logRegres.plotDataSet(dataArr, labelMat)

執行後會出現下圖 :

接著是 Gradient ascent algorithm 的實作 :

view plaincopy to clipboardprint?
def gradAscent(dataMatIn, classLabels, alpha=0.001, maxCycles=500):  
    dataMatrix = mat(dataMatIn)             # 轉 list 為 matrix  
    labelMat = mat(classLabels).transpose() # 轉 class label list 為 matrix, 並做 transpose  
    m,n = shape(dataMatrix)                 # m 為 data set size ; n 為 features size  
    weights = ones((n,1))                   # 初始 weight matrix 值為 1.  
    for k in range(maxCycles):              # for loop 進行 Gradient ascent algorithm 的計算.  
        h = sigmoid(dataMatrix*weights)     # 計算 t=w0x0 + w1x1 + ... + wnxn. 並將 t 帶入 sigmoid.  
        error = (labelMat - h)              # 計算 error > exp=1,0 - trt=1,0 = 0 ;  
                                            #              exp=1 - trt=0 = 1 (加大 weight) ; exp=0 - trt=1 = -1 縮小 weight)  
        weights = weights + alpha * dataMatrix.transpose()* error # 重新計算 weight matrix  
    return weights  

接著可以如下測試 :

>>> import logRegres
>>> dataArr, labelsArr = logRegres.loadDataSet()
>>> logRegres.gradAscent(dataArr, labelsArr)
matrix([[ 4.12414349],
[ 0.48007329],
[-0.6168482 ]])

接著有了 weight matrix, 我們便可以計算 sigmoid(), 因為最佳界定類別 C1 與類別 C2 在 t=0 的位置上故 :

(x1->x ; x2->y)

我們找到分開類別 C1 與類別 C2 的 hyperplane 的方程式, 接著便可以透過下面函數畫出 logistic regression best-fin line :

view plaincopy to clipboardprint?
def plotBestFit(weights):  
    import matplotlib.pyplot as plt  
    dataMat,labelMat=loadDataSet()  
    dataArr = array(dataMat)  
    n = shape(dataArr)[0]  
    xcord1 = []; ycord1 = []  
    xcord2 = []; ycord2 = []  
    for i in range(n):  
        if int(labelMat[i])== 1:  
            xcord1.append(dataArr[i,1]); ycord1.append(dataArr[i,2])  
        else:  
            xcord2.append(dataArr[i,1]); ycord2.append(dataArr[i,2])  
    fig = plt.figure()  
    ax = fig.add_subplot(111)  
    ax.scatter(xcord1, ycord1, s=30, c='red', marker='s')  
    ax.scatter(xcord2, ycord2, s=30, c='green')  
    x = arange(-3.0, 3.0, 0.1)  
    y = (-weights[0]-weights[1]*x)/weights[2]  
    ax.plot(x, y)  
    plt.xlabel('X1'); plt.ylabel('X2');  
    plt.show()  

接著可以如下測試

>>> dataArr, labelsArr = logRegres.loadDataSet()
>>> weights = logRegres.gradAscent(dataArr, labelsArr)
>>> weights.getA()
array([[ 4.12414349],
[ 0.48007329],
[-0.6168482 ]])
>>> logRegres.plotBestFit(weights.getA())

產生座標圖如下 :

- Train: stochastic gradient ascent
在使用 Gradient ascent algorithm 可以發現 loop 中的 data set 的每一筆 record 都會被拿來計算 sigmoid(). 這裡的 data set 只有 100 筆所以感覺不出來影響, 在處理大量的 data set (billions), 就會有 performance concern. 因此有另一個替代的 Stochastic gradient ascent algorithm, 一次只拿 data set 的一筆 record 來重新計算 weight matrix, 而透過它當有新的 data set 進來, 我們可以利用已經算好的 weight matrix 繼續往下運算而不用重新計算舊的 data set. 底下為其 pseudo code :

接著底下為其代碼實作 :

view plaincopy to clipboardprint?
def stocGradAscent0(dataMatrix, classLabels):  
    m,n = shape(dataMatrix)  
    alpha = 0.01  
    weights = ones(n)   #initialize to all ones  
    for i in range(m):  
        h = sigmoid(sum(dataMatrix[i]*weights))  
        error = classLabels[i] - h  
        weights = weights + alpha * error * dataMatrix[i]  
    return weights  

但 α(alpha) 值應該隨著 loop 過程中而變小, 因為 error 會逐漸收斂, 故底下為優化過後的代碼 :

view plaincopy to clipboardprint?
def stocGradAscent1(dataMatrix, classLabels, numIter=150):  
    m,n = shape(dataMatrix)  
    weights = ones(n)   #initialize to all ones  
    for j in range(numIter):  
        dataIndex = range(m)  
        for i in range(m):  
            alpha = 4/(1.0+j+i)+0.0001    #apha decreases with iteration, does not  
            randIndex = int(random.uniform(0,len(dataIndex)))#go to 0 because of the constant  
            h = sigmoid(sum(dataMatrix[randIndex]*weights))  
            error = classLabels[randIndex] - h  
            weights = weights + alpha * error * dataMatrix[randIndex]  
            del(dataIndex[randIndex])  
    return weights  

程式扎記

標籤

2012年7月23日星期一

[ ML In Action ] Improving classification with the AdaBoost meta-algorithm

2012年7月22日星期日

[ ML In Action ] Logistic Regression

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2012年7月23日 星期一