## 2012年7月16日 星期一

### [ ML In Action ] Decision Tree Construction

Preface :

Decision Trees
Pros: Computationally cheap to use, easy for humans to understand learned results, missing values OK, can deal with irrelevant features.
Cons: Prone to overfitting
Works with: Numeric values, nominal values

Tree Construction :

1. Check if every item in the dataset is in the same class:
2.     If so return the class label
3.     Else
4.         find the "best feature" to split the data
5.         split the dataset
6.         create a branch node
7.             for each split
8.                 call createBranch and add the result to the branch node
9.         return branch node

General approach to decision trees
1. Collect: Any method.
2. Prepare: This tree-building algorithm works only on nominal values, so any continuous values will need to be quantized.
3. Analyze: Any method. You should visually inspect the tree after it is built.
4. Train: Construct a tree data structure.
5. Test: Calculate the error rate with the learned tree.
6. Use: This can be used in any supervised learning task.

- Information gain

Shannon denoted the entropy H of a discrete random variable X with possible values {x1, ..., xn} and probability mass function p(X) as,

I(X) is itself a random variable, and I is the information content of X

- Listing 3.1 Function to calculate the Shannon entropy of a dataset
1. def calcShannonEnt(dataSet):
2.         """ Function to calculate the Shannon entropy of a dataset"""
3.         numEntries = len(dataSet)
4.         labelCounts = {}
5.         for featVec in dataSet:
6.                 currentLabel = featVec[-1]
7.                 labelCounts[currentLabel] = labelCounts.get(currentLabel, 0) + 1
8.         shannonEnt = 0.0
9.         for key in labelCounts:
10.                 prob = float(labelCounts[key]) / numEntries
11.                 shannonEnt -= prob * log(prob, 2)
12.         return shannonEnt

1. def createTestDataSet():
2.         dataSet = [ [11'yes'],
3.                     [11'yes'],
4.                     [10'no'],
5.                     [01'no'],
6.                     [01'no']]
7.         labels = ['no surfacing''flippers']
8.         return dataSet, labels

- Splitting the dataset

- Listing 3.2
1. def splitDataSet(dataSet, axis, value):
2.         """ Dataset splitting on a given feature
3.         - Argument
4.           * dataSet : 原始 data set
5.           * axis : 切割 Feature 的 column index.
6.           * value : 切割 Feature 的某個值."""
7.         retDataSet = []
8.         for featVec in dataSet:
9.                 if featVec[axis] == value:
10.                         reducedFeatVec = featVec[:axis]
11.                         reducedFeatVec.extend(featVec[axis+1:])
12.                         retDataSet.append(reducedFeatVec)
13.         return retDataSet

- Listing 3.3
1. def chooseBestFeatureToSplit(dataSet):
2.         """ Choosing the best feature to split on
3.         - Argument
4.           * dataSet: 原始 data set
5.         - Return
6.           返回有最大 Information gain 的 Feature column."""
7.         numFeatures = len(dataSet[0]) - 1 # 最後一個欄位是 Label
8.         baseEntropy = calcShannonEnt(dataSet)
9.         bestInfoGain = 0.0 ; bestFeature = -1
10.         for i in range(numFeatures):
11.                 featList = [example[i] for example in dataSet]
12.                 uniqueVals = set(featList)
13.                 newEntropy = 0.0
14.                 for value in uniqueVals:
15.                         subDataSet = splitDataSet(dataSet, i, value)
16.                         prob = len(subDataSet) / float(len(dataSet))
17.                         newEntropy += prob * calcShannonEnt(subDataSet)
18.                 infoGain = baseEntropy - newEntropy  # Information gain: Split 會讓 data 更 organized, 所以 Entropy 會變小.
19.                 if infoGain > bestInfoGain:
20.                         bestInfoGain = infoGain
21.                         bestFeature = i
22.         return bestFeature

>>> myDat, labels = tree.createTestDataSet() # 取回 test data set
>>> tree.chooseBestFeatureToSplit(myDat)
# 說明第一次切割選擇 Column0 的 Feature!

- Recursively building the tree

1. def majorityCnt(classList):
2.         """ Choose majority class and return.
3.         - Argument
4.           * classList: 類別的 List.
5.         - Return
6.           返回 List 中出現最多次的類別."""
7.         classCnt = {}
8.         for vote in classList:
9.                 classCnt[vote] = classCnt.get(vote, 0) + 1
10.         sortedClassCnt = sorted(classCnt.iteritems(),
11.                                 key = operator.itemgetter(1),
12.                                 reverse=True)
13.         return sortedClassCnt[0][0]

- Listing 3.4
1. def createTree(dataSet, labels):
2.         """ Tree-building code.
3.         - Argument
4.           * dataSet: 原始 data set
5.           * labels: 對應 class 標籤的文字說明.
6.         - Return
7.           返回建立的 Decision Tree."""
8.         classList = [example[-1for example in dataSet]
9.         if classList.count(classList[0]) == len(classList):     # 所有的 class 都是同一類.
10.                 return classList[0]
11.         if len(dataSet[0]) == 1:                                # 已經沒有 feature 可以 split 了.
12.                 return majorityCnt(classList)
13.
14.         bestFeat = chooseBestFeatureToSplit(dataSet)
15.         bestFeatLabel = labels[bestFeat]
16.         myTree = {bestFeatLabel:{}}
17.         del(labels[bestFeat])
18.         featValues = [example[bestFeat] for example in dataSet]
19.         uniqueVals = set(featValues)
20.         for value in uniqueVals:
21.                 subLabels = labels[:]
22.                 myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), subLabels)
23.         return myTree

>>> myDat, labels = tree.createTestDataSet()
>>> myTree = tree.createTree(myDat, labels)
>>> myTree
{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}

Plotting trees in Python with Matplotlib annotations :

- Listing 3.7
1. import matplotlib.pyplot as plt
2.
3. decisionNode = dict(boxstyle="sawtooth", fc="0.8")
4. leafNode = dict(boxstyle="round4", fc="0.8")
5. arrow_args = dict(arrowstyle="<-")
6.
7. def getNumLeafs(myTree):
8.     numLeafs = 0
9.     firstStr = myTree.keys()[0]
10.     secondDict = myTree[firstStr]
11.     for key in secondDict.keys():
12.         if type(secondDict[key]).__name__=='dict':#test to see if the nodes are dictonaires, if not they are leaf nodes
13.             numLeafs += getNumLeafs(secondDict[key])
14.         else:   numLeafs +=1
15.     return numLeafs
16.
17. def getTreeDepth(myTree):
18.     maxDepth = 0
19.     firstStr = myTree.keys()[0]
20.     secondDict = myTree[firstStr]
21.     for key in secondDict.keys():
22.         if type(secondDict[key]).__name__=='dict':#test to see if the nodes are dictonaires, if not they are leaf nodes
23.             thisDepth = 1 + getTreeDepth(secondDict[key])
24.         else:   thisDepth = 1
25.         if thisDepth > maxDepth: maxDepth = thisDepth
26.     return maxDepth
27.
28. def plotNode(nodeTxt, centerPt, parentPt, nodeType):
29.     createPlot.ax1.annotate(nodeTxt, xy=parentPt,  xycoords='axes fraction',
30.              xytext=centerPt, textcoords='axes fraction',
31.              va="center", ha="center", bbox=nodeType, arrowprops=arrow_args )
32.
33. def plotMidText(cntrPt, parentPt, txtString):
34.     xMid = (parentPt[0]-cntrPt[0])/2.0 + cntrPt[0]
35.     yMid = (parentPt[1]-cntrPt[1])/2.0 + cntrPt[1]
36.     createPlot.ax1.text(xMid, yMid, txtString, va="center", ha="center", rotation=30)
37.
38. def plotTree(myTree, parentPt, nodeTxt):#if the first key tells you what feat was split on
39.     numLeafs = getNumLeafs(myTree)  #this determines the x width of this tree
40.     depth = getTreeDepth(myTree)
41.     firstStr = myTree.keys()[0]     #the text label for this node should be this
42.     cntrPt = (plotTree.xOff + (1.0 + float(numLeafs))/2.0/plotTree.totalW, plotTree.yOff)
43.     plotMidText(cntrPt, parentPt, nodeTxt)
44.     plotNode(firstStr, cntrPt, parentPt, decisionNode)
45.     secondDict = myTree[firstStr]
46.     plotTree.yOff = plotTree.yOff - 1.0/plotTree.totalD
47.     for key in secondDict.keys():
48.         if type(secondDict[key]).__name__=='dict':#test to see if the nodes are dictonaires, if not they are leaf nodes
49.             plotTree(secondDict[key],cntrPt,str(key))        #recursion
50.         else:   #it's a leaf node print the leaf node
51.             plotTree.xOff = plotTree.xOff + 1.0/plotTree.totalW
52.             plotNode(secondDict[key], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode)
53.             plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key))
54.     plotTree.yOff = plotTree.yOff + 1.0/plotTree.totalD
55. #if you do get a dictonary you know it's a tree, and the first element will be another dict
56.
57. def createPlot(inTree):
58.     fig = plt.figure(1, facecolor='white')
59.     fig.clf()
60.     axprops = dict(xticks=[], yticks=[])
61.     createPlot.ax1 = plt.subplot(111, frameon=False, **axprops)    #no ticks
62.     #createPlot.ax1 = plt.subplot(111, frameon=False) #ticks for demo puropses
63.     plotTree.totalW = float(getNumLeafs(inTree))
64.     plotTree.totalD = float(getTreeDepth(inTree))
65.     plotTree.xOff = -0.5/plotTree.totalW; plotTree.yOff = 1.0;
66.     plotTree(inTree, (0.5,1.0), '')
67.     plt.show()
68.
69. def retrieveTree(i):
70.     listOfTrees =[{'no surfacing': {0'no'1: {'flippers': {0'no'1'yes'}}}},
71.                   {'no surfacing': {0'no'1: {'flippers': {0: {'head': {0'no'1'yes'}}, 1'no'}}}}
72.                   ]
73.     return listOfTrees[i]

>>> import tree, treePlotter
>>> myTree = treePlotter.retrieveTree(0)
>>> treePlotter.createPlot(myTree)

Testing and storing the classifier :

- Listing 3.8
1. def classify(inputTree, featLabels, testVec):
2.         """ Classification function for an existing decision tree"""
3.         firstStr = inputTree.keys()[0]                                          # 取出第一個 decision block
4.         secondDict = inputTree[firstStr]                                        # 取出對應 decision block 的 sub decision tree.
5.         featIndex = featLabels.index(firstStr)                                  # 取出該 decision block 對應 feature 的欄位.
6.         for key in secondDict.keys():
7.                 if testVec[featIndex] == key:                                   # 如果 testVec 在該 feature 的值等於 subdecision tree 的 key
8.                         if type(secondDict[key]).__name__ == 'dict':            # 如果該 feature value 的 branch 是 tree, 則繼續 classify.
9.                                 classLabel = classify(secondDict[key], featLabels, testVec)
10.                         elsereturn secondDict[key]                            # 如果該 feature value 的 branch 是 class, 則立即返回.
11.         return classLabel

>>> import tree, treePlotter
>>> myDat, labels = tree.createTestDataSet()
>>> myTree = treePlotter.retrieveTree(0)
>>> myTree
{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}
>>> tree.classify(myTree, labels, [1, 0]) # surfacing & no flippers
'no'
>>> tree.classify(myTree, labels, [1, 1]) # surfacing & has flippers
'yes'

- Use: persisting the decision tree

- Listing 3.9
1. def storeTree(inTree, filename):
2.         import pickle
3.         fw = open(filename, 'w')
4.         pickle.dump(inTree, fw)
5.         fw.close()
6.
7. def grabTree(filename):
8.         import pickle
9.         fr = open(filename)

>>> tree.storeTree(myTree, 'classifierStoreage.bin')
>>> tree.grabTree('classifierStoreage.bin')
{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}

### [ Python 常見問題 ] python generator “send” function purpose?

Source From  Here   Question   Can someone give me an example of why the "send" function associated with Python generator function...