## 2012年7月18日 星期三

### [ ML In Action ] Classifying with probability theory : naive Bayes

Preface :
Naive Bayes 是 Bayesian decision theory 的延伸應用, 假設我們有兩個 features x, y 與兩個類別 C1, C2 ; 而 p1(x,y) 為給定 x,y 後是 C1 的機率, p2(x,y) 為給定 x,y 後是 C2 的機率. 因此我們可以這麼看 :
If p1(x,y) > p2(x,y)then the class is C1.
If p2(x,y) > p1(x,y)then the class is C2.

Naive Bayes
Pros: Works with a small amount of data, handles multiple classes
Cons: Sensitive to how the input data is prepared
Works with: Nominal values

Conditional probability :

If P(C1|x,y) > P(C2|x,y)the class is C1
If P(C2|x,y) > P(C1|x,y)the class is C2

Document classification with naive Bayes :

General approach to naive Bayes
1. Collect: Any method.
2. Prepare: Numeric or Boolean values are needed
3. Analyze: With many features, plotting features isn't helpful. Looking at histograms is a better idea.
4. Train: Calculate the conditional probabilities of the independent features
5. Test: Calculate the error rate
6. Use: One common application of naive Bayes is document classification.

1. Statistical independent: One feature or word is just as likely by itself as it is next to other words.
2. Each feature is equally important.

- Prepare: making word vectors from text

- bayes.py :
1. #!/usr/local/bin/python
2. # -*- coding: utf-8 -*-
3. from numpy import *
4.
5. def loadDataSet():
6.     postingList=[['my''dog''has''flea''problems''help''please'],
7.                  ['maybe''not''take''him''to''dog''park''stupid'],
8.                  ['my''dalmation''is''so''cute''I''love''him'],
9.                  ['stop''posting''stupid''worthless''garbage'],
10.                  ['mr''licks''ate''my''steak''how''to''stop''him'],
11.                  ['quit''buying''worthless''dog''food''stupid']]
12.     classVec = [0,1,0,1,0,1]    #1 is abusive, 0 not
13.     return postingList,classVec
14.
15. def createVocabList(dataSet):
16.     vocabSet = set([])  #create empty set
17.     for document in dataSet:
18.         vocabSet = vocabSet | set(document) #union of the two sets
19.     return list(vocabSet)
20.
21. def setOfWords2Vec(vocabList, inputSet):
22.     returnVec = [0]*len(vocabList)
23.     for word in inputSet:
24.         if word in vocabList:
25.             returnVec[vocabList.index(word)] = 1
26.         else: print "the word: %s is not in my Vocabulary!" % word
27.     return returnVec

- Train: calculating probabilities from word vectors

1. def trainNB0(trainMatrix,trainCategory):
2.     numTrainDocs = len(trainMatrix)                     # 求出總 Posting 數目.
3.     numWords = len(trainMatrix[0])                      # 求出總 Features 數目.
4.     pAbusive = sum(trainCategory)/float(numTrainDocs)   # 求出 Abusive probability = P(ci)
5.     p0Num = zeros(numWords); p1Num = zeros(numWords)    # 建立長度為 feature 長度的 zero array
6.     p0Denom = 0.0; p1Denom = 0.0                        # Class1 與 Class0 的總 token 數初始值為 0
7.     for i in range(numTrainDocs):                       # 開始 looping posting
8.         if trainCategory[i] == 1:
9.             p1Num += trainMatrix[i]
10.             p1Denom += sum(trainMatrix[i])
11.         else:
12.             p0Num += trainMatrix[i]
13.             p0Denom += sum(trainMatrix[i])
14.     p1Vect = p1Num/p1Denom                              # 計算 Class1 中每個 token 出現的 Probability->P(w|c1)
15.     p0Vect = p0Num/p0Denom                              # 計算 Class0 中每個 token 出現的 Probability->P(w|c0)
16.     return p0Vect,p1Vect,pAbusive                       # 返回 : P(w|c0), P(w|c1), P(c1)

- Test: modifying the classifier for real-world condition

1. p0Num = ones(numWords); p1Num = ones(numWords)      # change to ones()
2. p0Denom = 2.0; p1Denom = 2.0                        # change to 2.0.

1. p1Vect = log(p1Num/p1Denom)          #change to log()
2. p0Vect = log(p0Num/p0Denom)          #change to log()

1. def trainNB1(trainMatrix,trainCategory):
2.     numTrainDocs = len(trainMatrix)                     # 求出總 Posting 數目.
3.     numWords = len(trainMatrix[0])                      # 求出總 Features 數目.
4.     pAbusive = sum(trainCategory)/float(numTrainDocs)   # 求出 Abusive probability = P(ci)
5.     p0Num = ones(numWords); p1Num = ones(numWords)      # change to ones()
6.     p0Denom = 2.0; p1Denom = 2.0                        # change to 2.0
7.     for i in range(numTrainDocs):
8.         if trainCategory[i] == 1:
9.             p1Num += trainMatrix[i]
10.             p1Denom += sum(trainMatrix[i])
11.         else:
12.             p0Num += trainMatrix[i]
13.             p0Denom += sum(trainMatrix[i])
14.     p1Vect = log(p1Num/p1Denom)          #change to log()
15.     p0Vect = log(p0Num/p0Denom)          #change to log()
16.     return p0Vect,p1Vect,pAbusive

1. def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
2.     """ P(ci|w) = P(w|ci) * P(ci) / P(w) | 因為我們目的是比較值的大小, 故省略 P(w)."""
3.     p1 = sum(vec2Classify * p1Vec) + log(pClass1)           # log(P(w|c1) * P(c1)) = log(P(w|c1)) + log(P(c1))
4.     p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)     # log(P(w|c0) * P(c0)) = log(P(w|c0)) + log(1-P(c1))
5.     if p1 > p0:
6.         return 1
7.     else:
8.         return 0

1. def testingNB():
2.     listOPosts,listClasses = loadDataSet()
3.     myVocabList = createVocabList(listOPosts)
4.     trainMat=[]
5.     for postinDoc in listOPosts:
6.         trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
7.     p0V,p1V,pAb = trainNB1(array(trainMat),array(listClasses))
8.     testEntry = ['love''my''dalmation']
9.     thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
10.     print("\t[Info] {0} is classified as: {1}!", testEntry, classifyNB(thisDoc,p0V,p1V,pAb))
11.     testEntry = ['stupid''garbage']
12.     thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
13.     print("\t[Info] {0} is classified as: {1}!", testEntry, classifyNB(thisDoc,p0V,p1V,pAb))

>>> reload(bayes) # 重新載入 bayes.py, 讓剛剛新加的代碼生效.

>>> bayes.testingNB()
['love', 'my', 'dalmation'] classified as: 0
['stupid', 'garbage'] classified as: 1

- Prepare: the bag-of-words document model

1. def bagOfWords2VecMN(vocabList, inputSet):
2.     returnVec = [0]*len(vocabList)
3.     for word in inputSet:
4.         if word in vocabList:
5.             returnVec[vocabList.index(word)] += 1
6.     return returnVec

### [ Python 文章收集 ] List Comprehensions and Generator Expressions

Source From  Here   Preface   Do you know the difference between the following syntax?  view plain copy to clipboard print ? [x  for ...