在上一章 Association analysis with the Apriori algorithm 我們使用 Apriori 演算法從 transaction list 找到 data 間的關係, 常見的應用是便利商店可以用來判斷當你買了商品 A 後接著會買其他商品的可能組合. 而在這個過程中我們第一步要找的是所謂的 frequent item set, 並透過設定 minimum support 來 filter 不常見的 item(s). 而在 Apriori 是使用窮舉法找出所有可能的 data set 組合, 雖然有利用 pruning 的技巧, 但是效率仍然不是很理想, 在處理大量資料時可能會遇到處理速度的瓶頸. 這邊要介紹另一種演算法, 讓你能更快速的從 transaction set 中找出 frequent item set. 該演算法需要建立一種特殊的資料結構稱為 FP-Tree (FP means Frequent Pattern) 而該演算法為 FP-growth algorithm.
FP-trees: an efficient way to encode a dataset:
在使用 FP-growth algorithm 前必須先建立 FP-Tree, 接著使用 FP-Tree 來 mining 出 frequent item set. 該演算法有以下特性:
首先我們來看一個簡單的範例, 來了解 FP-Tree 是如何建立. 假設你有如下的 transaction list:
在使用 transaction list 前, 我們要做一些前置處理. 就如在 Apriori 中處理 frequent item set 一樣, 我們會根據 minimum support 移除不常出現的 item (這邊 minimum support=3); 接著將 transaction 紀錄依據出現頻率從高到低進行 sorting. 處理完結果如下:
接著我們就要用處理過後的 transaction list 來建立 FP-Tree 時. 在 FP-Tree 在最上層的 root 是空集合的符號. 底下是前兩筆 transaction 建立 FP-Tree 的過程:
在 item 旁邊的數字說明該 item 出現的頻率 (z:1 說明 'z' 目前頻率為 1), 可能會隨著加入越來越多的 transaction 而增加. 另外為了迅速找到某個 item 在 FP-Tree 中的位置, 我們會建立一個 Header Table (鍵為 item, 值為該 item 在 FP-Tree 中的位置), 而每個 Node 都會有一個 link 連到同樣 item 的下一個 Node 而形成類似 linked list 的結構, 完整的 FP-Tree 建立結果長相如下:
Figure 12.2 FP-tree with header table. The header table serves as a starting point to find similar items.
下一個部份我們要來看如何透過代碼來建立 FP-Tree 並利用 FP-Tree 來 mining frequent items. 在開始 coding 前, 我們定義下面的類別作為 FP-Tree 的 node 以記載相關資料:
- class treeNode:
- def __init__(self, nameValue, numOccur, parentNode):
- self.name = nameValue # item name
- self.count = numOccur # frequency
- self.nodeLink = None # pointer to next node with same item name
- self.parent = parentNode # needs to be updated
- self.children = {} # next node according to sorting transaction set
- def inc(self, numOccur):
- self.count += numOccur
- def disp(self, ind=1):
- print ' '*ind, self.name, ' ', self.count
- for child in self.children.values():
- child.disp(ind+1)
- Listing 12.2 FP-tree creation code
- """Create tree based on given
. - Args:
- dataSet: Input transaction set.
- minSup: Used to filter out infrequent item(s).
- Returns:
- retTree: FP-Tree(treeNode) or None if no frequent item
- headerTable: Header table of FP-Tree or None if no frequent item
- headerTable[item] = (frequency,)
- Raises:
- None
- """
- def createTree(dataSet, minSup=1): #create FP-tree from dataset but don't mine
- headerTable = {}
- #go over dataSet twice
- for trans in dataSet:#first pass counts frequency of occurance
- for item in trans:
- headerTable[item] = headerTable.get(item, 0) + dataSet[trans]
- for k in headerTable.keys(): #remove items not meeting minSup
- if headerTable[k] < minSup:
- del(headerTable[k])
- freqItemSet = set(headerTable.keys())
- #print 'freqItemSet: ',freqItemSet
- if len(freqItemSet) == 0: return None, None #if no items meet min support -->get out
- for k in headerTable:
- headerTable[k] = [headerTable[k], None] #reformat headerTable to use Node link
- #print 'headerTable: ',headerTable
- retTree = treeNode('Null Set', 1, None) #create tree
- for tranSet, count in dataSet.items(): #go through dataset 2nd time
- localD = {}
- for item in tranSet: #put transaction items in order
- if item in freqItemSet:
- localD[item] = headerTable[item][0]
- if len(localD) > 0:
- orderedItems = [v[0] for v in sorted(localD.items(), key=lambda p: p[1], reverse=True)]
- updateTree(orderedItems, retTree, headerTable, count)#populate tree with ordered freq itemset
- return retTree, headerTable #return tree and header table
- """ Recursively update the FP-Tree
- Args:
- items: Ordered frequent items
- inTree: FP-Tree node
- headerTable: Header table of FP-Tree
- count: Count of exist same transaction of input items.
- Returns:
- None
- Raises:
- None
- """
- def updateTree(items, inTree, headerTable, count):
- if items[0] in inTree.children:#check if orderedItems[0] in retTree.children
- inTree.children[items[0]].inc(count) #incrament count
- else: #add items[0] to inTree.children
- inTree.children[items[0]] = treeNode(items[0], count, inTree)
- if headerTable[items[0]][1] == None: #update header table
- headerTable[items[0]][1] = inTree.children[items[0]]
- else:
- updateHeader(headerTable[items[0]][1], inTree.children[items[0]])
- if len(items) > 1:#call updateTree() with remaining ordered items
- updateTree(items[1::], inTree.children[items[0]], headerTable, count)
- def updateHeader(nodeToTest, targetNode): #this version does not use recursion
- while (nodeToTest.nodeLink != None): #Do not use recursion to traverse a linked list!
- nodeToTest = nodeToTest.nodeLink
- nodeToTest.nodeLink = targetNode
- Listing 12.3 Simple dataset and data wrapper
- """Create testing transaction set
- Args:
- None
- Returns:
- simpDat: Transaction set
- Raises:
- None
- """
- def loadSimpDat():
- simpDat = [['r', 'z', 'h', 'j', 'p'],
- ['z', 'y', 'x', 'w', 'v', 'u', 't', 's'],
- ['z'],
- ['r', 'x', 'n', 'o', 's'],
- ['y', 'r', 'x', 'z', 'q', 't', 'p'],
- ['y', 'z', 'x', 'e', 'q', 's', 't', 'm']]
- return simpDat
- """Build up directory to contains (transaction->frequency)
- Args:
- dataSet: Transaction set created by API:loadSimpDat()
- Return:
- retDict:
- Raise:
- None
- """
- def createInitSet(dataSet):
- retDict = {}
- for trans in dataSet:
- retDict[frozenset(trans)] = 1
- return retDict
Mining frequent items from an FP-tree:
當你建立 FP-Tree, 接下來要 mining 出 frequent items 就相當 straight forward, 以下是後續的三個動作:
那什麼是 "conditional pattern bases"? 你可以想像成剛剛產生的 frequent item set 中每個 item 從剛剛建立的 FP-Tree 可以走出所有可能的 path, 而該 path 的最後一個 item 為 target item. 舉例來說, 當 target item="z" 時, 從 FP-Tree 能走到 "z" 的所有走法只有 {Φ->z}5 (5 指的是頻率):
如果 target item="r", 則走法有 {Φ->z->r}1 , {Φ->z->x->y->r}1, {Φ->x->s->r}1:
如果拿掉 path 的頭尾, 底下整理出所有 frequent item set 的 "conditional pattern bases":
而地下是找出 "conditional pattern bases" 的代碼, 而在找的過程中代碼利用 Header table 指向每一個 frequent item, 並利用其 nodeLink 屬性指向下一個相同 item 的 Node; 並利用 Node 上面的的 parentNode 屬性往上走直到 root (parentNode=None):
- Listing 12.4 A function to find all paths ending with a given item
- """Ascends from the left node to root and store the path to
. - Args:
- leafNode: Node of PF-Tree
- prefixPath: list to store the traversal path
- Returns:
- None
- Raises:
- None
- """
- def ascendTree(leafNode, prefixPath): #ascends from leaf node to root
- if leafNode.parent != None:
- prefixPath.append(leafNode.name)
- ascendTree(leafNode.parent, prefixPath)
- """Find prefix path according to given FP-Tree node as
- Args:
- basePat: ?
- treeNode: Target FP-Tree node.
- Return:
- condPats: The directory to hold {conditional path->count}
- Raises:
- None
- """
- def findPrefixPath(basePat, treeNode): #treeNode comes from header table
- condPats = {}
- while treeNode != None:
- prefixPath = []
- ascendTree(treeNode, prefixPath)
- if len(prefixPath) > 1:
- condPats[frozenset(prefixPath[1:])] = treeNode.count
- treeNode = treeNode.nodeLink
- return condPats
接下來利用這些找到的 "conditional pattern bases" 當作 transaction 在建立 "conditional FP-trees". 以 target item='t' 為範例, 底下是建立的過程:
(item 's' 與 'r' 沒有出現在 "conditional FP-trees" 是因為它們不滿足 minimum support=3 的限制)
如此第三步只是重複 一, 二步. 因此我們只要利用 recursive 的方法便可以完成, 而 stop condition 為當 "conditional FP-trees" 為 None 時並終止 recursive. 底下為 mining 過程代碼:
- Listing 12.5 The mineTree function recursively finds frequent itemsets.
- """Recursively finds frequent itemsets.
- Args:
- inTree: FP-Tree
- headerTable: Header table of FP-Tree
- preFix: Prefix paths
- freqItemList: Frequent item list (Final result)
- Returns:
- None
- Raises:
- None
- """
- def mineTree(inTree, headerTable, minSup, preFix, freqItemList):
- bigL = [v[0] for v in sorted(headerTable.items(), key=lambda p: p[1])]#(sort header table)
- for basePat in bigL: #start from bottom of header table
- newFreqSet = preFix.copy()
- newFreqSet.add(basePat)
- #print 'finalFrequent Item: ',newFreqSet #append to set
- freqItemList.append(newFreqSet)
- condPattBases = findPrefixPath(basePat, headerTable[basePat][1])
- #print 'condPattBases :',basePat, condPattBases
- #2. construct cond FP-tree from cond. pattern base
- myCondTree, myHead = createTree(condPattBases, minSup)
- #print 'head from conditional tree: ', myHead
- if myHead != None: #3. mine cond. FP-tree
- #print 'conditional tree for: ',newFreqSet
- #myCondTree.disp(1)
- mineTree(myCondTree, myHead, minSup, newFreqSet, freqItemList)
Supplement:
* [ Python 範例代碼 ] Sorting directory into item key list according to the item value
if you are finding for php training in Noida then you may click here
回覆刪除Great article essay typer toronto
回覆刪除