程式扎記: [ ML In Action ] Unsupervised learning : Efficiently finding frequent itemsets with FP-growth

Preface:
在上一章 Association analysis with the Apriori algorithm 我們使用 Apriori 演算法從 transaction list 找到 data 間的關係, 常見的應用是便利商店可以用來判斷當你買了商品 A 後接著會買其他商品的可能組合. 而在這個過程中我們第一步要找的是所謂的 frequent item set, 並透過設定 minimum support 來 filter 不常見的 item(s). 而在 Apriori 是使用窮舉法找出所有可能的 data set 組合, 雖然有利用 pruning 的技巧, 但是效率仍然不是很理想, 在處理大量資料時可能會遇到處理速度的瓶頸. 這邊要介紹另一種演算法, 讓你能更快速的從 transaction set 中找出 frequent item set. 該演算法需要建立一種特殊的資料結構稱為 FP-Tree (FP means Frequent Pattern) 而該演算法為 FP-growth algorithm.

FP-trees: an efficient way to encode a dataset:
在使用 FP-growth algorithm 前必須先建立 FP-Tree, 接著使用 FP-Tree 來 mining 出 frequent item set. 該演算法有以下特性:

Pros: Usually faster than Apriori.
Cons: Difficult to implement; certain datasets degrade the performance.
Works with: Nominal values.

首先我們來看一個簡單的範例, 來了解 FP-Tree 是如何建立. 假設你有如下的 transaction list:

在使用 transaction list 前, 我們要做一些前置處理. 就如在 Apriori 中處理 frequent item set 一樣, 我們會根據 minimum support 移除不常出現的 item (這邊 minimum support=3); 接著將 transaction 紀錄依據出現頻率從高到低進行 sorting. 處理完結果如下:

接著我們就要用處理過後的 transaction list 來建立 FP-Tree 時. 在 FP-Tree 在最上層的 root 是空集合的符號. 底下是前兩筆 transaction 建立 FP-Tree 的過程:

在 item 旁邊的數字說明該 item 出現的頻率 (z:1 說明 'z' 目前頻率為 1), 可能會隨著加入越來越多的 transaction 而增加. 另外為了迅速找到某個 item 在 FP-Tree 中的位置, 我們會建立一個 Header Table (鍵為 item, 值為該 item 在 FP-Tree 中的位置), 而每個 Node 都會有一個 link 連到同樣 item 的下一個 Node 而形成類似 linked list 的結構, 完整的 FP-Tree 建立結果長相如下:

Figure 12.2 FP-tree with header table. The header table serves as a starting point to find similar items.

下一個部份我們要來看如何透過代碼來建立 FP-Tree 並利用 FP-Tree 來 mining frequent items. 在開始 coding 前, 我們定義下面的類別作為 FP-Tree 的 node 以記載相關資料:

view plaincopy to clipboardprint?
class treeNode:  
    def __init__(self, nameValue, numOccur, parentNode):  
        self.name = nameValue       # item name  
        self.count = numOccur       # frequency   
        self.nodeLink = None        # pointer to next node with same item name  
        self.parent = parentNode    # needs to be updated  
        self.children = {}          # next node according to sorting transaction set  
      
    def inc(self, numOccur):  
        self.count += numOccur  
          
    def disp(self, ind=1):  
        print '  '*ind, self.name, ' ', self.count  
        for child in self.children.values():  
            child.disp(ind+1)  

底下我們要先來看看從 transaction set 建立 FP-Tree 的部分代碼:
- Listing 12.2 FP-tree creation code

view plaincopy to clipboardprint?
"""Create tree based on given .  
    Args:  
        dataSet: Input transaction set.  
        minSup: Used to filter out infrequent item(s).  
          
    Returns:  
        retTree: FP-Tree(treeNode) or None if no frequent item  
        headerTable: Header table of FP-Tree or None if no frequent item  
                     headerTable[item] = (frequency,)  
          
    Raises:  
        None  
"""  
def createTree(dataSet, minSup=1): #create FP-tree from dataset but don't mine  
    headerTable = {}  
    #go over dataSet twice  
    for trans in dataSet:#first pass counts frequency of occurance  
        for item in trans:  
            headerTable[item] = headerTable.get(item, 0) + dataSet[trans]  
    for k in headerTable.keys():  #remove items not meeting minSup  
        if headerTable[k] < minSup:   
            del(headerTable[k])  
    freqItemSet = set(headerTable.keys())  
    #print 'freqItemSet: ',freqItemSet  
    if len(freqItemSet) == 0: return None, None  #if no items meet min support -->get out  
    for k in headerTable:  
        headerTable[k] = [headerTable[k], None] #reformat headerTable to use Node link   
    #print 'headerTable: ',headerTable  
    retTree = treeNode('Null Set', 1, None) #create tree  
    for tranSet, count in dataSet.items():  #go through dataset 2nd time  
        localD = {}  
        for item in tranSet:  #put transaction items in order  
            if item in freqItemSet:  
                localD[item] = headerTable[item][0]  
        if len(localD) > 0:  
            orderedItems = [v[0] for v in sorted(localD.items(), key=lambda p: p[1], reverse=True)]  
            updateTree(orderedItems, retTree, headerTable, count)#populate tree with ordered freq itemset  
    return retTree, headerTable #return tree and header table  
  
""" Recursively update the FP-Tree  
    Args:  
        items: Ordered frequent items  
        inTree: FP-Tree node  
        headerTable: Header table of FP-Tree  
        count: Count of exist same transaction of input items.  
          
    Returns:  
        None  
      
    Raises:  
        None  
"""  
def updateTree(items, inTree, headerTable, count):  
    if items[0] in inTree.children:#check if orderedItems[0] in retTree.children  
        inTree.children[items[0]].inc(count) #incrament count  
    else:   #add items[0] to inTree.children  
        inTree.children[items[0]] = treeNode(items[0], count, inTree)  
        if headerTable[items[0]][1] == None: #update header table   
            headerTable[items[0]][1] = inTree.children[items[0]]  
        else:  
            updateHeader(headerTable[items[0]][1], inTree.children[items[0]])  
    if len(items) > 1:#call updateTree() with remaining ordered items  
        updateTree(items[1::], inTree.children[items[0]], headerTable, count)  
          
def updateHeader(nodeToTest, targetNode):   #this version does not use recursion  
    while (nodeToTest.nodeLink != None):    #Do not use recursion to traverse a linked list!  
        nodeToTest = nodeToTest.nodeLink  
    nodeToTest.nodeLink = targetNode  

底下是產生測試 transaction set 的代碼:
- Listing 12.3 Simple dataset and data wrapper

view plaincopy to clipboardprint?
"""Create testing transaction set  
    Args:  
        None  
          
    Returns:  
        simpDat: Transaction set  
          
    Raises:  
        None  
"""  
def loadSimpDat():  
    simpDat = [['r', 'z', 'h', 'j', 'p'],  
               ['z', 'y', 'x', 'w', 'v', 'u', 't', 's'],  
               ['z'],  
               ['r', 'x', 'n', 'o', 's'],  
               ['y', 'r', 'x', 'z', 'q', 't', 'p'],  
               ['y', 'z', 'x', 'e', 'q', 's', 't', 'm']]  
    return simpDat  
  
"""Build up directory to contains (transaction->frequency)  
    Args:  
        dataSet: Transaction set created by API:loadSimpDat()  
          
    Return:  
        retDict:  
          
    Raise:  
        None  
"""  
def createInitSet(dataSet):  
    retDict = {}  
    for trans in dataSet:  
        retDict[frozenset(trans)] = 1  
    return retDict  

接著假設代碼撰寫在 "fpGrowth.py", 我們可以如下測試 FP-Tree 的建立過程:

Mining frequent items from an FP-tree:
當你建立 FP-Tree, 接下來要 mining 出 frequent items 就相當 straight forward, 以下是後續的三個動作:

1. Get conditional pattern bases from the FP-tree.
2. From the conditional pattern base, construct a conditional FP-tree.
3. Recursively repeat steps 1 and 2 on until the tree contains a single item.

那什麼是 "conditional pattern bases"? 你可以想像成剛剛產生的 frequent item set 中每個 item 從剛剛建立的 FP-Tree 可以走出所有可能的 path, 而該 path 的最後一個 item 為 target item. 舉例來說, 當 target item="z" 時, 從 FP-Tree 能走到 "z" 的所有走法只有 {Φ->z}5 (5 指的是頻率):

如果 target item="r", 則走法有 {Φ->z->r}1 , {Φ->z->x->y->r}1, {Φ->x->s->r}1:

如果拿掉 path 的頭尾, 底下整理出所有 frequent item set 的 "conditional pattern bases":

而地下是找出 "conditional pattern bases" 的代碼, 而在找的過程中代碼利用 Header table 指向每一個 frequent item, 並利用其 nodeLink 屬性指向下一個相同 item 的 Node; 並利用 Node 上面的的 parentNode 屬性往上走直到 root (parentNode=None):
- Listing 12.4 A function to find all paths ending with a given item

view plaincopy to clipboardprint?
"""Ascends from the left node to root and store the path to .  
    Args:  
        leafNode: Node of PF-Tree  
        prefixPath: list to store the traversal path  
          
    Returns:  
        None  
          
    Raises:  
        None  
"""          
def ascendTree(leafNode, prefixPath): #ascends from leaf node to root  
    if leafNode.parent != None:  
        prefixPath.append(leafNode.name)  
        ascendTree(leafNode.parent, prefixPath)  
          
"""Find prefix path according to given FP-Tree node as   
    Args:  
        basePat: ?  
        treeNode: Target FP-Tree node.  
          
    Return:  
        condPats: The directory to hold {conditional path->count}  
          
    Raises:  
        None  
"""  
def findPrefixPath(basePat, treeNode): #treeNode comes from header table  
    condPats = {}  
    while treeNode != None:  
        prefixPath = []  
        ascendTree(treeNode, prefixPath)  
        if len(prefixPath) > 1:   
            condPats[frozenset(prefixPath[1:])] = treeNode.count  
        treeNode = treeNode.nodeLink  
    return condPats  

接著底下是上面代碼的使用範例:

接下來利用這些找到的 "conditional pattern bases" 當作 transaction 在建立 "conditional FP-trees". 以 target item='t' 為範例, 底下是建立的過程:

(item 's' 與 'r' 沒有出現在 "conditional FP-trees" 是因為它們不滿足 minimum support=3 的限制)

如此第三步只是重複一, 二步. 因此我們只要利用 recursive 的方法便可以完成, 而 stop condition 為當 "conditional FP-trees" 為 None 時並終止 recursive. 底下為 mining 過程代碼:
- Listing 12.5 The mineTree function recursively finds frequent itemsets.

view plaincopy to clipboardprint?
"""Recursively finds frequent itemsets.  
    Args:  
        inTree: FP-Tree  
        headerTable: Header table of FP-Tree  
        preFix: Prefix paths  
        freqItemList: Frequent item list (Final result)  
          
    Returns:  
        None  
          
    Raises:  
        None  
"""  
def mineTree(inTree, headerTable, minSup, preFix, freqItemList):  
    bigL = [v[0] for v in sorted(headerTable.items(), key=lambda p: p[1])]#(sort header table)  
    for basePat in bigL:  #start from bottom of header table  
        newFreqSet = preFix.copy()  
        newFreqSet.add(basePat)  
        #print 'finalFrequent Item: ',newFreqSet    #append to set  
        freqItemList.append(newFreqSet)  
        condPattBases = findPrefixPath(basePat, headerTable[basePat][1])  
        #print 'condPattBases :',basePat, condPattBases  
        #2. construct cond FP-tree from cond. pattern base  
        myCondTree, myHead = createTree(condPattBases, minSup)  
        #print 'head from conditional tree: ', myHead  
        if myHead != None: #3. mine cond. FP-tree  
            #print 'conditional tree for: ',newFreqSet  
            #myCondTree.disp(1)              
            mineTree(myCondTree, myHead, minSup, newFreqSet, freqItemList)  

最後的使用範例也相當簡單, 而上面完整代碼可到這裡下載:

Supplement:
* [ Python 範例代碼 ] Sorting directory into item key list according to the item value

程式扎記

標籤

2013年9月10日星期二

[ ML In Action ] Unsupervised learning : Efficiently finding frequent itemsets with FP-growth

2 則留言:

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2013年9月10日 星期二