程式扎記: [ ML In Action ] Classifying with probability theory : naive Bayes

Preface :
Naive Bayes 是 Bayesian decision theory 的延伸應用, 假設我們有兩個 features x, y 與兩個類別 C1, C2 ; 而 p1(x,y) 為給定 x,y 後是 C1 的機率, p2(x,y) 為給定 x,y 後是 C2 的機率. 因此我們可以這麼看 :

If p1(x,y) > p2(x,y), then the class is C1.
If p2(x,y) > p1(x,y), then the class is C2.

簡單來說就是給定 x,y 後, 看是那個類別的可能性高, 就判成該類別. 底下是使用 Naive Bayes 的概觀 :

Naive Bayes
Pros: Works with a small amount of data, handles multiple classes
Cons: Sensitive to how the input data is prepared
Works with: Nominal values

Conditional probability :
這邊要簡單介紹一下 "條件機率". 當你看到 P(A|B), 你可以想像成當發生 B 事件後, 事件 A 發生的機率. 公式定義成 :

而上面的公式你可以用下面的範例來理解 :

而上面的公式透過簡單的代數法則, 可以得到下面的公式 (貝氏定理):

通常使用時機為我們不容易計算 P(A|B), 因此透過上面公式求得較易計算的 P(B|A). 接著有條件機率概念, 一開始的 Classify 可以改寫成 :

If P(C1|x,y) > P(C2|x,y), the class is C1
If P(C2|x,y) > P(C1|x,y), the class is C2

Document classification with naive Bayes :
這邊我們將利用 Machine learning 的技術來自動對 document 進行 classification. 使用的方法為透過 Naive Bayes 對 document 中的 word 作為 feature 進行 training :
General approach to naive Bayes

1. Collect: Any method.
2. Prepare: Numeric or Boolean values are needed
3. Analyze: With many features, plotting features isn't helpful. Looking at histograms is a better idea.
4. Train: Calculate the conditional probabilities of the independent features
5. Test: Calculate the error rate
6. Use: One common application of naive Bayes is document classification.

在使用 naive Bayes 的 algorithm 時, 我們有以下的 assumptions :

1. Statistical independent: One feature or word is just as likely by itself as it is next to other words.
2. Each feature is equally important.

下面的範例, 我們會透過 training data 來告訴我們哪些 text 是 abusive; 那些不是. 透過取出這些 text 的 word (tokens) 作為 features, 我們使用 naive Bayes 來 training 我們的 model.

- Prepare: making word vectors from text
第一步要處理的, 便是將我們的 text 轉成 word vector. 下面的函數 loadDataSet() 已經將 text 轉成 token 的 list, 並將每個 posting 的類別 (abusive-1 or not-0) vector 與 posting list 一起回傳; 而函數 createVocabList() 則是將 posting 中所有 unique 的 word 取出並以 list 回傳; 最後一個函數 setOfWordsVec() 則是依據 unique word set, 將 posting 準成 feature list. 例如我們的 unique word set 是 ['A', 'B', 'C'] 而 posting 是 ['A', 'C'], 透過該函數會回傳 [1, 0, 1] :
- bayes.py :

view plaincopy to clipboardprint?
#!/usr/local/bin/python  
# -*- coding: utf-8 -*-  
from numpy import *  
  
def loadDataSet():  
    postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],  
                 ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],  
                 ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],  
                 ['stop', 'posting', 'stupid', 'worthless', 'garbage'],  
                 ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],  
                 ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]  
    classVec = [0,1,0,1,0,1]    #1 is abusive, 0 not  
    return postingList,classVec  
  
def createVocabList(dataSet):  
    vocabSet = set([])  #create empty set  
    for document in dataSet:  
        vocabSet = vocabSet | set(document) #union of the two sets  
    return list(vocabSet)  
  
def setOfWords2Vec(vocabList, inputSet):  
    returnVec = [0]*len(vocabList)  
    for word in inputSet:  
        if word in vocabList:  
            returnVec[vocabList.index(word)] = 1  
        else: print "the word: %s is not in my Vocabulary!" % word  
    return returnVec  

接著便可以如下測試 :

- Train: calculating probabilities from word vectors
還記得剛剛我們前面的式子 : If P(C1|x,y) > P(C2|x,y), the class is C1. 這邊我們將 x,y 換成 w (features), 透過貝氏定理我們改寫如下 :

接著從我們有的 training corpus, 我們需要計算 :

先來看看 p(w|ci) 這項, w 是 features 的總稱, 因此可以看成 w = w0, w1, w2, ..., wN. 又因為我們的 assumption 是 feature 間是 independent, 故 :

而接下來我們要實作代碼的 pseudo code 為 :

實作代碼為 :

view plaincopy to clipboardprint?
def trainNB0(trainMatrix,trainCategory):  
    numTrainDocs = len(trainMatrix)                     # 求出總 Posting 數目.  
    numWords = len(trainMatrix[0])                      # 求出總 Features 數目.  
    pAbusive = sum(trainCategory)/float(numTrainDocs)   # 求出 Abusive probability = P(ci)  
    p0Num = zeros(numWords); p1Num = zeros(numWords)    # 建立長度為 feature 長度的 zero array  
    p0Denom = 0.0; p1Denom = 0.0                        # Class1 與 Class0 的總 token 數初始值為 0  
    for i in range(numTrainDocs):                       # 開始 looping posting  
        if trainCategory[i] == 1:  
            p1Num += trainMatrix[i]  
            p1Denom += sum(trainMatrix[i])  
        else:  
            p0Num += trainMatrix[i]  
            p0Denom += sum(trainMatrix[i])  
    p1Vect = p1Num/p1Denom                              # 計算 Class1 中每個 token 出現的 Probability->P(w|c1)  
    p0Vect = p0Num/p0Denom                              # 計算 Class0 中每個 token 出現的 Probability->P(w|c0)  
    return p0Vect,p1Vect,pAbusive                       # 返回 : P(w|c0), P(w|c1), P(c1)  

接著可以如下 training 我們的 Model :

- Test: modifying the classifier for real-world condition
還記得上面在計算 P(w|c) 時, 是各個 feature 的連乘 (Assumption: 各個 feature 互相獨立):

這樣會遇到一個問題, 就是我們在計算過程中, 只要有一個 feature 為 0, 則不管其他 feature 為何, 乘積就是為零. 因此我們需要做一些 Smoothing. 這邊使用最簡單的方式就是將每個 token 的 count 的初始值設為 1 而不是 0. 這樣就算該 token 在某個 class 從來沒出現, 至少也還有 1 的基底, 因此在計算機率過程中就不會是 0. 這樣會動到的代碼為 :

view plaincopy to clipboardprint?
p0Num = ones(numWords); p1Num = ones(numWords)      # change to ones()  
p0Denom = 2.0; p1Denom = 2.0                        # change to 2.0.  

接著是因為機率是小於一的值, 因此一堆小於一的值的乘積可能造成 underflow! 因此我們使用 log 來對機率值作後置處理, 這樣可以利用到好處 : ln(a*b) = ln(a) + ln(b). 影響到的代碼如下 :

view plaincopy to clipboardprint?
p1Vect = log(p1Num/p1Denom)          #change to log()  
p0Vect = log(p0Num/p0Denom)          #change to log()  

這樣重新的 train 函數修改如下 :

view plaincopy to clipboardprint?
def trainNB1(trainMatrix,trainCategory):  
    numTrainDocs = len(trainMatrix)                     # 求出總 Posting 數目.  
    numWords = len(trainMatrix[0])                      # 求出總 Features 數目.  
    pAbusive = sum(trainCategory)/float(numTrainDocs)   # 求出 Abusive probability = P(ci)  
    p0Num = ones(numWords); p1Num = ones(numWords)      # change to ones()  
    p0Denom = 2.0; p1Denom = 2.0                        # change to 2.0  
    for i in range(numTrainDocs):  
        if trainCategory[i] == 1:  
            p1Num += trainMatrix[i]  
            p1Denom += sum(trainMatrix[i])  
        else:  
            p0Num += trainMatrix[i]  
            p0Denom += sum(trainMatrix[i])  
    p1Vect = log(p1Num/p1Denom)          #change to log()  
    p0Vect = log(p0Num/p0Denom)          #change to log()  
    return p0Vect,p1Vect,pAbusive  

接著要來撰寫我們的 classify 函數, 代碼如下 :

view plaincopy to clipboardprint?
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):  
    """ P(ci|w) = P(w|ci) * P(ci) / P(w) | 因為我們目的是比較值的大小, 故省略 P(w)."""  
    p1 = sum(vec2Classify * p1Vec) + log(pClass1)           # log(P(w|c1) * P(c1)) = log(P(w|c1)) + log(P(c1))  
    p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)     # log(P(w|c0) * P(c0)) = log(P(w|c0)) + log(1-P(c1))  
    if p1 > p0:  
        return 1  
    else:  
        return 0  

接著來寫測試用的函數, 來將整個 Process 串起來並餵入 Test posting :

view plaincopy to clipboardprint?
def testingNB():  
    listOPosts,listClasses = loadDataSet()  
    myVocabList = createVocabList(listOPosts)  
    trainMat=[]  
    for postinDoc in listOPosts:  
        trainMat.append(setOfWords2Vec(myVocabList, postinDoc))  
    p0V,p1V,pAb = trainNB1(array(trainMat),array(listClasses))  
    testEntry = ['love', 'my', 'dalmation']  
    thisDoc = array(setOfWords2Vec(myVocabList, testEntry))  
    print("\t[Info] {0} is classified as: {1}!", testEntry, classifyNB(thisDoc,p0V,p1V,pAb))  
    testEntry = ['stupid', 'garbage']  
    thisDoc = array(setOfWords2Vec(myVocabList, testEntry))  
    print("\t[Info] {0} is classified as: {1}!", testEntry, classifyNB(thisDoc,p0V,p1V,pAb))  

接著我們可以如下測試它 :

>>> reload(bayes) # 重新載入 bayes.py, 讓剛剛新加的代碼生效.

>>> bayes.testingNB()
['love', 'my', 'dalmation'] classified as: 0
['stupid', 'garbage'] classified as: 1

- Prepare: the bag-of-words document model
在前面函數 setOfWords2Vec() 用來對 posting 建立 feature vector, 過程是取出 posting unique word 再對 unique word vector 對應出現的位置上標示為 1. 這樣的作法會有一個盲點, 就是如果該 token (word) 不只出現一次, 那這樣的 information 將會遺失. 因此有所謂的 bag-of-words model 將這樣的 information 保留. 因而我們寫了另一個函數來作為函數 setOfWords2Vec() 的替換 (當 word 的出現次數也是重要的 feature 時, 可以使用此函數替代) :

view plaincopy to clipboardprint?
def bagOfWords2VecMN(vocabList, inputSet):  
    returnVec = [0]*len(vocabList)  
    for word in inputSet:  
        if word in vocabList:  
            returnVec[vocabList.index(word)] += 1  
    return returnVec  

以上為 Machine Learning in Action 第四章內容, 相關代碼可以在這裡下載 (Ch04).

程式扎記

標籤

2012年7月18日星期三

[ ML In Action ] Classifying with probability theory : naive Bayes

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2012年7月18日 星期三