2014年5月29日 星期四

[ML in DH] Decision Tree (決策樹)

Introduction:
decision tree is a flow-chart-like tree structure:
– each internal node denotes a test on an attribute
– each branch represents an outcome of the test, and
– leaf nodes represent classes or class distribution

Entropy 熵:亂度、隨機程度

– P(X=0)=0, P(X=1)=1
– P(X=0)=0.2, P(X=1)=0.8
– P(X=0)=0.5, P(X=1)=0.5

Entropy: Definition

Entropy 可被用來估算隨機變數在機率分佈上「有多隨機」。熵值越高，表示該隨機變數的「混亂程度」越大。 另外一種解釋，是將 Entropy 視為該隨機變數在隨機試驗前的不確定程度。因試驗後結果確知（亂度或不確定性減少），我們說該試驗提供了試驗前 Entropy 值（單位為 bit）那麼多的資訊量!

Entropy: Illustration

Branching: Idea

Building a Decision Tree:

– 令 S 為 training set 的資料樣本，其個數 |S| = s
– 假設我們有 m 種不同的分類結果 {C1, ..., Cm}。令 Si 代表 S 中被分類於 Ci 的集合，|Si| = si .

Decision-Tree Algorithm:

Example: Test Attribute Selection

14 個 training data 中，有 9 個 YES，5 個 NO。因此其 Entropy 為:

Toolkit 使用:

1. def labels = ["Age""Income""Student?""CC""Class"]
2. def datas =  [["<=30""高""N""O""N"],
3.               ["<=30""高""N""G""N"],
4.               ["31-40""高""N""O""Y"],
5.               [">40""中""N""O""Y"],
6.               [">40""低""Y""O""Y"],
7.               [">40""低""Y""G""N"],
8.               ["31-40""低""Y""G""Y"],
9.               ["<=30""中""N""O""N"],
10.               ["<=30""低""Y""O""Y"],
11.               [">40""中""Y""O""Y"],
12.               ["<=30""中""Y""G""Y"],
13.               ["31-40""中""N""G""Y"],
14.               ["31-40""高""Y""O""Y"],
15.               [">40""中""N""G""N"]]
16. // Part1
17. DTBuilder dtb = new DTBuilder(dataSet:datas, labels:labels)
18. def ageSet = [] as Set
19. def incSet = [] as Set
20. def issSet = [] as Set
21. def iccSet = [] as Set
26. def origE = dtb.calcShannonEnt(datas)
27. printf "\t[Info] Original E=%.03f\n", origE
28. printf "\t[Info] Calculage Age E:\n"
29. double aftrE=0
30. for(age in ageSet)
31. {
32.     def sd = dtb.splitDataSet(datas, 0, age)
33.     def se = dtb.calcShannonEnt(sd)
34.     printf "\t\tAge='%s' with E=%.03f\n", age, se
35.     aftrE+=(((double)sd.size())/datas.size())*se
36. }
37. printf "\t[Info] IG(age)=%.04f!\n\n", origE - aftrE
38.
39.
40. printf "\t[Info] Calculage Income E:\n"
41. aftrE=0
42. for(inc in incSet)
43. {
44.     def sd = dtb.splitDataSet(datas, 1, inc)
45.     def se = dtb.calcShannonEnt(sd)
46.     printf "\t\tIncome='%s' with E=%.03f\n", inc, se
47.     aftrE+=(((double)sd.size())/datas.size())*se
48. }
49. printf "\t[Info] IG(Income)=%.04f!\n\n", origE - aftrE
50.
51.
52. printf "\t[Info] Calculage Student? E:\n"
53. aftrE=0
54. for(i in issSet)
55. {
56.     def sd = dtb.splitDataSet(datas, 2, i)
57.     def se = dtb.calcShannonEnt(sd)
58.     printf "\t\tissSet='%s' with E=%.03f\n", i, se
59.     aftrE+=(((double)sd.size())/datas.size())*se
60. }
61. printf "\t[Info] IG(Student?)=%.04f!\n\n", origE - aftrE
62.
63. printf "\t[Info] Calculage CC E:\n"
64. aftrE=0
65. for(i in iccSet)
66. {
67.     def sd = dtb.splitDataSet(datas, 3, i)
68.     def se = dtb.calcShannonEnt(sd)
69.     printf "\t\tissSet='%s' with E=%.03f\n", i, se
70.     aftrE+=(((double)sd.size())/datas.size())*se
71. }
72. printf "\t[Info] IG(CC)=%.04f!\n\n", origE - aftrE
73.
74. // Part2
75. def decisionTree = dtb.createTree()
76. printf "\t[Info] Decision Tree: %s\n\n", decisionTree

Supplement:
[ ML In Action ] Decision Tree Construction

[ Python 文章收集 ] List Comprehensions and Generator Expressions

Source From  Here   Preface   Do you know the difference between the following syntax?  view plain copy to clipboard print ? [x  for ...