Preface:
假設現在有一篇文件,其 representation 為一 n 個維度的特徵向量 X = (x1, x2, ..., xn). 目標是將其分類於 m 個 classes C1, ..., Cm之一。直覺的作法是對於每個 class ,估算 P(Ci|X). 如果P(Ci|X) 比其他 P(Cj|X) 有著更高的機率,the naive Bayesian classifier 會將文件 X 分類於 Ci :
Naive Bayesian Training:
如何利用 training documents 估算 P(X|Ci) 與 P(Ci)? 將 P(Ci) 以「訓練文件中,屬於 class Ci 的比例」估算. 也就是說,若 T(Ci) 表示被分類於 Ci 的訓練文件,我們用 P(Ci) = si/S(其中 S=|T|, si=|T(Ci)|)來估算 P(Ci) 。為了簡化 P(X|Ci) 的估算,我們天真地 (naively) 假設特徵維度之間滿足條件獨立。也就是說:
令 T(Ci, xj) 為 T(Ci) 中,文件含有特徵 xj 的集合。我們用 T(Ci, xj) 在 T(Ci) 所佔的比例估算 P(xj|Ci)。– 也就是說, P(xj|Ci)=sij/si, 其中 sij=|T(Ci, xj)|.
Example: Naive Bayesian Categorization
現有一筆關於某人的資料,欲「推測」其是否會買電腦?
方法:令 C1 代表「YES:會買電腦」, C2 代表「NO:不會買電腦」。現比較 P(C1|X) 與 P(C2|X),取最大者之類別:
Toolkit Usage:
根據上面說明, 我實作 Naive Bayesian 在 "NaiveBayes.groovy", 有關使用可以參考下面範例:
- def labels = ["Age", "Income", "Student?", "CC", "Class"]
- def datas = [["<=30", "高", "N", "O", "N"],
- ["<=30", "高", "N", "G", "N"],
- ["31-40", "高", "N", "O", "Y"],
- [">40", "中", "N", "O", "Y"],
- [">40", "低", "Y", "O", "Y"],
- [">40", "低", "Y", "G", "N"],
- ["31-40", "低", "Y", "G", "Y"],
- ["<=30", "中", "N", "O", "N"],
- ["<=30", "低", "Y", "O", "Y"],
- [">40", "中", "Y", "O", "Y"],
- ["<=30", "中", "Y", "G", "Y"],
- ["31-40", "中", "N", "G", "Y"],
- ["31-40", "高", "Y", "O", "Y"],
- [">40", "中", "N", "G", "N"]]
- NaiveBayes nb = new NaiveBayes()
- // 收集每個 Instance 的 Category
- def cateList = datas.collect {it[4]}
- def instances = []
- datas.each { data->
- //printf "\t[Test] %s\n", data[0..3].join(",")
- instances.add(data[0..3])
- }
- printf "\t[Info] NaiveBayes training...%s\n", nb.train2(instances, cateList)
- def testInst = ["<=30", "中", "Y", "O"]
- printf "\t[Info] Instance=[%s] is classified as %s!\n", testInst.join(","), nb.classify2(testInst)
- testInst = ["31-40", "高", "N", "O"]
- printf "\t[Info] Instance=[%s] is classified as %s!\n", testInst.join(","), nb.classify2(testInst)
Supplement:
* [ ML In Action ] Classifying with probability theory : naive Bayes
沒有留言:
張貼留言