程式扎記: [ ML In Action ] Unsupervised learning : Association analysis with the Apriori algorithm

Preface:
有關 Association rules learning 的應用相當廣泛, 常見的如便利商店用來分析當你買了 A 產品後, 最可能買的其他商品會是什麼, 如此便可以將這兩種商品擺在一塊以提高消費者的購買意願. 底下是一個簡單範例考慮有顧客購買的清單列表如下:

一眼很難發現出有什麼規則或 Pattern, 但是透過 association analysis, 你可以發現當顧客買了 "diapers" 有很高的可能性也會買 "wine", 而這是有人去分析原因如下:

diapers ➞ beer
The most famous example of association analysis is diapers Þ beer. It has been reported that a grocery store chain in the Midwest of the United States noticed that men bought diapers and beer on Thursdays. The store could have profited from this by placing diapers and beer close together and making sure they were full price on Thursdays, but they did not.

Basic concept of association analysis:
有了上面的開場白, 接著我們來看看 association analysis 到底做了什麼事. 首先一個很重要的參數叫做 support. 你不希望你找出來的規則在幾萬條中只出現這麼一兩次, 因此你會設定 support 告訴 association analysis 在尋找規則時必須設定 minimum 的出現次數 (frequency). 以上面的範例來說同時出現 "diapers" 與 "wine" 的筆數共有 3 筆, 則其 support = 3/5 (5 為總筆數); 另一個重要的參數叫做 confidence, 你可以把它想成 條件機率, 以上面的例子來說, 總共出現 "diapers" 的筆數為 4, 而在出現 "diapers" 又同時出現 "wine" 的筆數有 3 筆, 因此它的 confidence = 3/4.

有了上面的認知後, 接著要來看一個 演算法 Apriori 讓你可以透過設定 support, confidence 後幫你找出 association analysis 後的規則.

Apriori algorithm:
首先來看這個演算法的用途與特性:

簡單說這個演算法就是從一個集合組合成的每筆紀錄中, 找出關聯性規則 (A 出現 B 也很高可能性也會出現 etc.). 所以第一步這個演算法要做的便是找出由這個集合所形成的可能組合. 假設我們集合中有 {0,1,2,3}, 則其可能產生的組合如下:

問題不在於產生這些組合, 而是當你的集合夠大時要產生這些組合可能會很花時間. 因此就需要一些 pruning 的技巧幫助我們節省產生這些組合的時間. 如當我們發現 {2,3} 的組合所計算的 support 值低於設定的 support, 則 {2,3,1}, {2,3,0}, {0,1,2,3} 等接下來延生的組合便可忽略不用去產生它們:

接著我們來看這個演算法第一步產生 frequent candidate itemsets 的 pseudo code:

view plaincopy to clipboardprint?
For each transaction in tran the dataset:  
For each candidate itemset, can:  
        Check to see if can is a subset of tran  
        If so increment the count of can  
For each candidate itemset:  
If the support meets the minimum, keep this item  
Return list of frequent itemsets  

而對應的實作代碼如下:

view plaincopy to clipboardprint?
def loadDataSet():  
    return [[1, 3, 4], [2, 3, 5], [1, 2, 3, 5], [2, 5]]  
  
def createC1(dataSet):  
    C1 = []  
    for transaction in dataSet:  
        for item in transaction:  
            if not [item] in C1:  
                C1.append([item])  
    C1.sort()  
    return map(frozenset, C1) # Create a frozenset of each item in C1       
                 
def scanD(D, Ck, minSupport):  
    ssCnt = {}  
    for tid in D:  
        for can in Ck:  
            if can.issubset(tid):  
                if not ssCnt.has_key(can): ssCnt[can]=1  
                else: ssCnt[can] += 1  
    numItems = float(len(D))  
    retList = []  
    supportData = {}  
    for key in ssCnt:  
        support = ssCnt[key]/numItems # Calculate support for every itemset            
        if support >= minSupport:  
            retList.insert(0,key)  
        supportData[key] = support  
    return retList, supportData  

一個執行範例如下:

如此我們已經得到組合數為 1 且滿足 support 設定的元素, 接下來便是利用這些元素來產生更多的組合.

Putting together the full Apriori algorithm
接下來用來產生組合數大於一的 pseudo code 如下:

view plaincopy to clipboardprint?
While the number of items in the set is greater than 0:  
    Create a list of candidate itemsets of length k  
    Scan the dataset to see if each itemset is frequent  
    Keep frequent itemsets to create itemsets of length k+1  

實作代碼如下:

view plaincopy to clipboardprint?
def aprioriGen(Lk, k): #creates Ck  
    """ Generate C(k)  
    Based on given C(k-1) to generate C(k).  
  
    Args:  
        Lk: A List contains C(1) to C(k-1)  
        k: int  
          
    Returns:  
        A list contains C(k)  
          
    Raises:  
        None  
    """  
    retList = []  
    lenLk = len(Lk)  
    for i in range(lenLk):  
        for j in range(i+1, lenLk): # Join sets if first k-2 items are equal  
            L1 = list(Lk[i])[:k-2]; L2 = list(Lk[j])[:k-2]  
            L1.sort(); L2.sort()                             
            if L1==L2:                                     
                retList.append(Lk[i] | Lk[j])   
    return retList  
  
def apriori(dataSet, minSupport = 0.5):  
    """ Generate L(2)~L(k)  
    Based on given dataset to generate L(2)~L(k).  
      
    Args:  
        dataSet: Dataset  
        minSupport: Minumum support for candidate item set.  
      
    Returns:  
        A tuple (L, supportData)  
        L: A list contains candidate item set.  
        supportData: The dict with mapping key as L(2)~L(m). The value is the support value.  
    """  
    C1 = createC1(dataSet)  
    D = map(set, dataSet)  
    L1, supportData = scanD(D, C1, minSupport)  
    L = [L1]  
    k = 2  
    while (len(L[k-2]) > 0):  
        Ck = aprioriGen(L[k-2], k)  
        Lk, supK = scanD(D, Ck, minSupport)       
        supportData.update(supK)  
        L.append(Lk)  
        k += 1  
    return L, supportData  

一個產生所有組合可能執行範例如下:

Mining association rules from frequent item sets:
現在我們已經產生了符合 support 設定的組合, 接著便是要從這些組合中找出 association rules. 第一步一樣是去產生可能的 rule set. 假設我們有集合 {0,1,2,3}, 則底下是產生 rule set 的過程, 灰色的部分是因為它不滿足我們設定的 confidence, 故被 pruning 掉:

如下的代碼就是在進行上述的動作: 產生 candidate rule set -> evaluate confidence of them -> collect acceptable rules

view plaincopy to clipboardprint?
def generateRules(L, supportData, minConf=0.7):    
    """Generate rules based on given candidate item set.  
    Generate rules with minumum confidence=.  
      
    Args:  
        L: Candidate item set list  
        supportData: Support dict with key as frequent item set and value as support.  
        minConf: Minumum confidence  
      
    Return:  
        Rule list  
    """  
    bigRuleList = []  
    for i in range(1, len(L)):   # Get only sets with two or more items             
        for freqSet in L[i]:  
            H1 = [frozenset([item]) for item in freqSet] # consequence contains only one item  
            if (i > 1):  
                rulesFromConseq(freqSet, H1, supportData, bigRuleList,\  
                                minConf)  
            else:  
                calcConf(freqSet, H1, supportData, bigRuleList, minConf)   
    return bigRuleList  
  
def calcConf(freqSet, H, supportData, brl, minConf=0.7):  
    """Prone the consequent set(H)  
    Prune the consequent set based on given minimum confidence.  
      
    Args:  
        freqSet: Frequent item set  
        H: consequent set  
        supportData: Support dict with key as frequent item set and value as support.  
        brl: List to hold tuple(antecedent set, consequent set, confidence)  
        minConf: Minumum confidence   
          
    Returns:      
        None  
    """  
    prunedH = []   
    for conseq in H:  
        conf = supportData[freqSet]/supportData[freqSet-conseq]   
        if conf >= minConf:   
            print freqSet-conseq,'-->',conseq,'conf:',conf  
            brl.append((freqSet-conseq, conseq, conf))  
            prunedH.append(conseq)  
    return prunedH  
  
def rulesFromConseq(freqSet, H, supportData, brl, minConf=0.7):  
    """Generate candidate rule set.  
    Shift one item from condition to consequence to generate new rule set.  
      
    Args:  
        freqSet: Frequent item set of specific length of composition.  
        H: Consequence item set  
        supportData: Support dict with key as frequent item set and value as support.  
        minConf: Minumum confidence  
          
    Return:  
        None  
    """  
    m = len(H[0])  
    if (len(freqSet) > (m + 1)):   # Try further merging                        
        Hmp1 = aprioriGen(H, m + 1) # Create Hm+1new candidates                                 
        Hmp1 = calcConf(freqSet, Hmp1, supportData, brl, minConf)  
        if (len(Hmp1) > 1):  
            rulesFromConseq(freqSet, Hmp1, supportData, brl, minConf)  

一個使用 support=0.5, confidence=0.7 設定產生的 association rules 範例如下:

Supplement:
* [ ML 文章收集 ] Apriori Algorithm Introduction

程式扎記

標籤

2013年6月18日星期二

[ ML In Action ] Unsupervised learning : Association analysis with the Apriori algorithm

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2013年6月18日 星期二