程式扎記: [ ML In Action ] Improving classification with the AdaBoost meta-algorithm

標籤

2012年7月23日 星期一

[ ML In Action ] Improving classification with the AdaBoost meta-algorithm


Preface :
如果你要做一個重要的決定, 你可能會找人提供建議. 如果你找的人不只一個, 可能重要的人的意見會比其他人的意見對你的影響比重會大一點. 在 Machine learning 類似的概念稱之為 meta-algorithm : 它被使用來組合不同的 algorithms, 並將準確率較高的 algorithm 的權重調高於其它的 algorithm. 而在這邊我們要介紹的是 AdaBoost 演算法.

Classifiers using multiple samples of the dataset :
AdaBoost
Pros:
 Low generalization error, easy to code, works with most classifiers, no parameters to adjust.
Cons: Sensitive to outliers
Works with: Numeric values, nominal values

在這之前我們看過了一些 Classification 的演算法如 kNNDecision treenaive Bayes 與 Logistic regression. 這些演算法有各自的特點與優缺點, 如果要組合不同的演算法, 利用他們的結果來產出更準確的 Classification, 這個動作稱之為 ensemble methods or meta-algorithms.

- Building classifiers from randomly resampled data:bagging
Bootstrap aggregating 是 meta-algorithms 中的一個做法, wiki 上的簡單介紹如下 :
Bootstrap aggregating (bagging) is a machine learning ensemble meta-algorithm to improve machine learning of statistical classification and regression models in terms of stability and classification accuracy. It also reduces variance and helps to avoid overfitting. Although it is usually applied to decision tree models, it can be used with any type of model. Bagging is a special case of the model averaging approach.

詳細的說明可以參考 wiki.

- Boosting
Boosting 與 bagging 相似, 不同的是它使用了 weighting, 在 iteration 的過程中, 上一輪較準確的 classifier 會被給定較高的 weighting. 底下為其書上的說明 :
Boosting is a technique similar to bagging. In boosting and bagging, you always use the same type of classifier. But in boosting, the different classifiers are trained sequentially. Each new classifier is trained based on the performance of those already trained. Boosting makes new classifiers focus on data that was previously misclassified by previous classifiers.

有需多種類的 Boosting , 這邊我們要介紹的是 AdaBoost :
General approach to AdaBoost
1. Collect: Any method.
2. Prepare: It depends on which type of weak learner you're going to use. In this chapter, we'll use decision stumps, which can take any type of data. You could use any classifier, so any of the classifiers from chapters would work. Simple classifiers work better for a weak learner.
3. Analyze: Any method.
4. Train: The majority of the time will be spent here. The classifier will train the weak learner multiple times over the same dataset.
5. Test: Calculate the error rate.
6. Use: Like support vector machines, AdaBoost predicts one of two classes. If you want to use it for classification involving more than two classes, then you'll need to apply some of the same methods as for support vector machines.

Improving the classifier by focusing on errors :
底下是 Boosting 的起因, 我們有可能透過一些 Weak classifer 來組合出一個 Strong classifer 嗎?
Boosting is based on the question posed by Kearns: can a set of weak learners create a single strong learner? A weak learner is defined to be a classifier which is only slightly correlated with the true classification (it can label examples better than random guessing). In contrast, a strong learner is a classifier that is arbitrarily well-correlated with the true classification.

AdaBoost 是 adaptive boosting 的縮寫, 而它便是上述問題的解答之一. 它透過給定 weight vector D (一開始 vector 內值為相等) 到 training data 的每一筆 record 來給 decision tree stump 決定如何取出最佳的結果, 每次 Iteration 後利用重新計算 weight vector 來讓下一個 stump 建立可以 focus 在那些 error 的 records 上藉以找出組合出來的最佳解.

而在計算 weight vector D 有些參數定義如下 :


在每次 Iterator 後重新計算 weight vector D 時, 會降低 correct record 的 weighting ; 增加 error record 的 weighting :


而 α 除了用來計算 weight vector D 外, 它也是每次產生出來的 Weak classifier 的權重 :


Creating a weak learner with a decision stump :
這邊的 decision stump 是一個最簡單的 decision tree, 只有一個 split 意味只對一個 feature 進行 classify. 首先按照慣例我們撰寫函數如下來取回 test data set :
- Listing 7.0 of adaboost.py :
  1. # -*- coding: utf-8 -*-  
  2. from numpy import *  
  3.   
  4. def plotDataSet(dataArr, labelArr):  
  5.     import matplotlib.pyplot as plt  
  6.     xcord1 = []; ycord1 = [];  
  7.     xcord2 = []; ycord2 = [];  
  8.     dataSize = len(dataArr)  
  9.     for i in range(dataSize):  
  10.         if labelArr[i] == 1:  
  11.             xcord1.append(dataArr[i][0]); ycord1.append(dataArr[i][1])  
  12.         else:  
  13.             xcord2.append(dataArr[i][0]); ycord2.append(dataArr[i][1])  
  14.     fig = plt.figure()  
  15.     ax = fig.add_subplot(111)  
  16.     ax.scatter(xcord1, ycord1, s=30, c='red', marker='s')  
  17.     ax.scatter(xcord2, ycord2, s=30, c='green')  
  18.     plt.show()  
  19.   
  20. def loadSimpData():  
  21.     datMat = matrix([[ 1. ,  2.1],  
  22.         [ 2. ,  1.1],  
  23.         [ 1.3,  1. ],  
  24.         [ 1. ,  1. ],  
  25.         [ 2. ,  1. ]])  
  26.     classLabels = [1.01.0, -1.0, -1.01.0]  
接著你可以如下觀察 test data set 的分布 :
>>> import adaboost
>>> datMat, classLabels = adaboost.loadSimpData()
>>> adaboost.plotDataSet(datMat.A, classLabels)

data set 分布如下 :


底下我們要來撰寫建立 decision stump 的代碼, 第一個函式用來將 data set 使用建立好的 decision stump 來做 classification 並以 array 回傳. 第二個函數則為使用給定的 weight vector D 來找出最佳 (error 最小) 的 decision stump, 其 pseudo code 如下 :


底下是代碼的實作 :
  1. def stumpClassify(dataMatrix,dimen,threshVal,threshIneq):#just classify the data  
  2.     """ 使用給定的 stump(dimen, threshVal, threshIneq) 對 data set=dataMatrix 進行 classification. """  
  3.     retArray = ones((shape(dataMatrix)[0],1))  
  4.     if threshIneq == 'lt':  
  5.         retArray[dataMatrix[:,dimen] <= threshVal] = -1.0  
  6.     else:  
  7.         retArray[dataMatrix[:,dimen] > threshVal] = -1.0  
  8.     return retArray  
  9.   
  10.   
  11. def buildStump(dataArr,classLabels,D):  
  12.     """ 根據 weight matrix D 建立最佳 stump."""  
  13.     dataMatrix = mat(dataArr); labelMat = mat(classLabels).T  
  14.     m,n = shape(dataMatrix) # m 為 data set size ; n 為 feature size.  
  15.     numSteps = 10.0; bestStump = {}; bestClasEst = mat(zeros((m,1)))  
  16.     minError = inf # init error sum, to +infinity  
  17.     for i in range(n):#loop over all dimensions  
  18.         rangeMin = dataMatrix[:,i].min(); rangeMax = dataMatrix[:,i].max();  
  19.         stepSize = (rangeMax-rangeMin)/numSteps  
  20.         for j in range(-1,int(numSteps)+1):#loop over all range in current dimension  
  21.             for inequal in ['lt''gt']: #go over less than and greater than  
  22.                 threshVal = (rangeMin + float(j) * stepSize)  
  23.                 predictedVals = stumpClassify(dataMatrix,i,threshVal,inequal)#call stump classify with i, j, lessThan  
  24.                 errArr = mat(ones((m,1)))  
  25.                 errArr[predictedVals == labelMat] = 0  
  26.                 weightedError = D.T*errArr  #calc total error multiplied by D  
  27.                 #print "split: dim %d, thresh %.2f, thresh ineqal: %s, the weighted error is %.3f" % (i, threshVal, inequal, weightedError)  
  28.                 if weightedError < minError:  
  29.                     minError = weightedError  
  30.                     bestClasEst = predictedVals.copy()  
  31.                     bestStump['dim'] = i  
  32.                     bestStump['thresh'] = threshVal  
  33.                     bestStump['ineq'] = inequal  
  34.     return bestStump,minError,bestClasEst  
接著可以如下測試代碼 :


Implementing the full AdaBoost algorithm :
有了上面的前置準備, 接下來我們要來進行完整的 AdaBoost algorithm, 底下為其 Pseudo-code :


代碼實作如下 :
- Listing 7.2 of adaboost.py :
  1. def adaBoostTrainDS(dataArr,classLabels,numIt=40):  
  2.     """ AdaBoost training with decision stumps. """  
  3.     weakClassArr = []  
  4.     m = shape(dataArr)[0]  
  5.     D = mat(ones((m,1))/m)   #init D to all equal  
  6.     aggClassEst = mat(zeros((m,1)))  
  7.     for i in range(numIt):  
  8.         bestStump,error,classEst = buildStump(dataArr,classLabels,D)#build Stump  
  9.         #print "D:",D.T  
  10.         alpha = float(0.5*log((1.0-error)/max(error,1e-16)))#calc alpha, throw in max(error,eps) to account for error=0  
  11.         bestStump['alpha'] = alpha  
  12.         weakClassArr.append(bestStump)                  #store Stump Params in Array  
  13.         #print "classEst: ",classEst.T  
  14.         expon = multiply(-1*alpha*mat(classLabels).T,classEst) #exponent for D calc, getting messy  
  15.         D = multiply(D,exp(expon))                              #Calc New D for next iteration  
  16.         D = D/D.sum()  
  17.         #calc training error of all classifiers, if this is 0 quit for loop early (use break)  
  18.         aggClassEst += alpha*classEst  
  19.         #print "aggClassEst: ",aggClassEst.T  
  20.         aggErrors = multiply(sign(aggClassEst) != mat(classLabels).T,ones((m,1)))  
  21.         errorRate = aggErrors.sum()/m  
  22.         print "total error: ",errorRate  
  23.         if errorRate == 0.0break  
  24.     return weakClassArr, aggClassEst  
接著你可以如下取回 AdaBoost algorithm training 完後的結果 :


最後我們要做的就是撰寫函數, 傳入 data set 與 training 後的 classifiers 並回傳 classified 後的結果 :
- Listing 7.3 of adaboost.py :
  1. def adaClassify(datToClass,classifierArr):  
  2.     dataMatrix = mat(datToClass)#do stuff similar to last aggClassEst in adaBoostTrainDS  
  3.     m = shape(dataMatrix)[0]  
  4.     aggClassEst = mat(zeros((m,1)))  
  5.     for i in range(len(classifierArr)):  
  6.         classEst = stumpClassify(dataMatrix,classifierArr[i]['dim'],\  
  7.                                  classifierArr[i]['thresh'],\  
  8.                                  classifierArr[i]['ineq'])#call stump classify  
  9.         aggClassEst += classifierArr[i]['alpha']*classEst  
  10.         print aggClassEst  
  11.     return sign(aggClassEst)  
接著可以如下利用剛剛建立的 classifiers 測試 data 的 classification (假設-1 為類別 C1; +1為類別 C2):


Supplement :
numpy.sign : Returns an element-wise indication of the sign of a number.
numpy.multiply : Multiply arguments element-wise.
>>> a = mat([[1, 1, 1]])
>>> b = mat([[2, 3, 4]])
>>> c = multiply(a, b)
>>> c
matrix([[2, 3, 4]])



沒有留言:

張貼留言

網誌存檔

關於我自己

我的相片
Where there is a will, there is a way!