程式扎記: [ ML In Action ] Unsupervised learning : Association analysis with the Apriori algorithm

標籤

2013年6月18日 星期二

[ ML In Action ] Unsupervised learning : Association analysis with the Apriori algorithm

Preface: 
有關 Association rules learning 的應用相當廣泛, 常見的如便利商店用來分析當你買了 A 產品後, 最可能買的其他商品會是什麼, 如此便可以將這兩種商品擺在一塊以提高消費者的購買意願. 底下是一個簡單範例考慮有顧客購買的清單列表如下: 
 

一眼很難發現出有什麼規則或 Pattern, 但是透過 association analysis, 你可以發現當顧客買了 "diapers" 有很高的可能性也會買 "wine", 而這是有人去分析原因如下: 
diapers ➞ beer
The most famous example of association analysis is diapers Þ beer. It has been reported that a grocery store chain in the Midwest of the United States noticed that men bought diapers and beer on Thursdays. The store could have profited from this by placing diapers and beer close together and making sure they were full price on Thursdays, but they did not.

Basic concept of association analysis: 
有了上面的開場白, 接著我們來看看 association analysis 到底做了什麼事. 首先一個很重要的參數叫做 support. 你不希望你找出來的規則在幾萬條中只出現這麼一兩次, 因此你會設定 support 告訴 association analysis 在尋找規則時必須設定 minimum 的出現次數 (frequency). 以上面的範例來說同時出現 "diapers" 與 "wine" 的筆數共有 3 筆, 則其 support = 3/5 (5 為總筆數); 另一個重要的參數叫做 confidence, 你可以把它想成 條件機率, 以上面的例子來說, 總共出現 "diapers" 的筆數為 4, 而在出現 "diapers" 又同時出現 "wine" 的筆數有 3 筆, 因此它的 confidence = 3/4

有了上面的認知後, 接著要來看一個 演算法 Apriori 讓你可以透過設定 support, confidence 後幫你找出 association analysis 後的規則. 

Apriori algorithm: 
首先來看這個演算法的用途與特性: 
 

簡單說這個演算法就是從一個集合組合成的每筆紀錄中, 找出關聯性規則 (A 出現 B 也很高可能性也會出現 etc.). 所以第一步這個演算法要做的便是找出由這個集合所形成的可能組合. 假設我們集合中有 {0,1,2,3}, 則其可能產生的組合如下: 
 

問題不在於產生這些組合, 而是當你的集合夠大時要產生這些組合可能會很花時間. 因此就需要一些 pruning 的技巧幫助我們節省產生這些組合的時間. 如當我們發現 {2,3} 的組合所計算的 support 值低於設定的 support, 則 {2,3,1}, {2,3,0}, {0,1,2,3} 等接下來延生的組合便可忽略不用去產生它們: 
 

接著我們來看這個演算法第一步產生 frequent candidate itemsets 的 pseudo code: 
  1. For each transaction in tran the dataset:  
  2. For each candidate itemset, can:  
  3.         Check to see if can is a subset of tran  
  4.         If so increment the count of can  
  5. For each candidate itemset:  
  6. If the support meets the minimum, keep this item  
  7. Return list of frequent itemsets  
而對應的實作代碼如下: 
  1. def loadDataSet():  
  2.     return [[134], [235], [1235], [25]]  
  3.   
  4. def createC1(dataSet):  
  5.     C1 = []  
  6.     for transaction in dataSet:  
  7.         for item in transaction:  
  8.             if not [item] in C1:  
  9.                 C1.append([item])  
  10.     C1.sort()  
  11.     return map(frozenset, C1) # Create a frozenset of each item in C1       
  12.                  
  13. def scanD(D, Ck, minSupport):  
  14.     ssCnt = {}  
  15.     for tid in D:  
  16.         for can in Ck:  
  17.             if can.issubset(tid):  
  18.                 if not ssCnt.has_key(can): ssCnt[can]=1  
  19.                 else: ssCnt[can] += 1  
  20.     numItems = float(len(D))  
  21.     retList = []  
  22.     supportData = {}  
  23.     for key in ssCnt:  
  24.         support = ssCnt[key]/numItems # Calculate support for every itemset            
  25.         if support >= minSupport:  
  26.             retList.insert(0,key)  
  27.         supportData[key] = support  
  28.     return retList, supportData  
一個執行範例如下: 
 

如此我們已經得到組合數為 1 且滿足 support 設定的元素, 接下來便是利用這些元素來產生更多的組合. 

Putting together the full Apriori algorithm 
接下來用來產生組合數大於一的 pseudo code 如下: 
  1. While the number of items in the set is greater than 0:  
  2.     Create a list of candidate itemsets of length k  
  3.     Scan the dataset to see if each itemset is frequent  
  4.     Keep frequent itemsets to create itemsets of length k+1  
實作代碼如下: 
  1. def aprioriGen(Lk, k): #creates Ck  
  2.     """ Generate C(k)  
  3.     Based on given C(k-1) to generate C(k).  
  4.   
  5.     Args:  
  6.         Lk: A List contains C(1) to C(k-1)  
  7.         k: int  
  8.           
  9.     Returns:  
  10.         A list contains C(k)  
  11.           
  12.     Raises:  
  13.         None  
  14.     """  
  15.     retList = []  
  16.     lenLk = len(Lk)  
  17.     for i in range(lenLk):  
  18.         for j in range(i+1, lenLk): # Join sets if first k-2 items are equal  
  19.             L1 = list(Lk[i])[:k-2]; L2 = list(Lk[j])[:k-2]  
  20.             L1.sort(); L2.sort()                             
  21.             if L1==L2:                                     
  22.                 retList.append(Lk[i] | Lk[j])   
  23.     return retList  
  24.   
  25. def apriori(dataSet, minSupport = 0.5):  
  26.     """ Generate L(2)~L(k)  
  27.     Based on given dataset to generate L(2)~L(k).  
  28.       
  29.     Args:  
  30.         dataSet: Dataset  
  31.         minSupport: Minumum support for candidate item set.  
  32.       
  33.     Returns:  
  34.         A tuple (L, supportData)  
  35.         L: A list contains candidate item set.  
  36.         supportData: The dict with mapping key as L(2)~L(m). The value is the support value.  
  37.     """  
  38.     C1 = createC1(dataSet)  
  39.     D = map(set, dataSet)  
  40.     L1, supportData = scanD(D, C1, minSupport)  
  41.     L = [L1]  
  42.     k = 2  
  43.     while (len(L[k-2]) > 0):  
  44.         Ck = aprioriGen(L[k-2], k)  
  45.         Lk, supK = scanD(D, Ck, minSupport)       
  46.         supportData.update(supK)  
  47.         L.append(Lk)  
  48.         k += 1  
  49.     return L, supportData  
一個產生所有組合可能執行範例如下: 
 

Mining association rules from frequent item sets: 
現在我們已經產生了符合 support 設定的組合, 接著便是要從這些組合中找出 association rules. 第一步一樣是去產生可能的 rule set. 假設我們有集合 {0,1,2,3}, 則底下是產生 rule set 的過程, 灰色的部分是因為它不滿足我們設定的 confidence, 故被 pruning 掉: 
 

如下的代碼就是在進行上述的動作: 產生 candidate rule set -> evaluate confidence of them -> collect acceptable rules 
  1. def generateRules(L, supportData, minConf=0.7):    
  2.     """Generate rules based on given candidate item set.  
  3.     Generate rules with minumum confidence=.  
  4.       
  5.     Args:  
  6.         L: Candidate item set list  
  7.         supportData: Support dict with key as frequent item set and value as support.  
  8.         minConf: Minumum confidence  
  9.       
  10.     Return:  
  11.         Rule list  
  12.     """  
  13.     bigRuleList = []  
  14.     for i in range(1, len(L)):   # Get only sets with two or more items             
  15.         for freqSet in L[i]:  
  16.             H1 = [frozenset([item]) for item in freqSet] # consequence contains only one item  
  17.             if (i > 1):  
  18.                 rulesFromConseq(freqSet, H1, supportData, bigRuleList,\  
  19.                                 minConf)  
  20.             else:  
  21.                 calcConf(freqSet, H1, supportData, bigRuleList, minConf)   
  22.     return bigRuleList  
  23.   
  24. def calcConf(freqSet, H, supportData, brl, minConf=0.7):  
  25.     """Prone the consequent set(H)  
  26.     Prune the consequent set based on given minimum confidence.  
  27.       
  28.     Args:  
  29.         freqSet: Frequent item set  
  30.         H: consequent set  
  31.         supportData: Support dict with key as frequent item set and value as support.  
  32.         brl: List to hold tuple(antecedent set, consequent set, confidence)  
  33.         minConf: Minumum confidence   
  34.           
  35.     Returns:      
  36.         None  
  37.     """  
  38.     prunedH = []   
  39.     for conseq in H:  
  40.         conf = supportData[freqSet]/supportData[freqSet-conseq]   
  41.         if conf >= minConf:   
  42.             print freqSet-conseq,'-->',conseq,'conf:',conf  
  43.             brl.append((freqSet-conseq, conseq, conf))  
  44.             prunedH.append(conseq)  
  45.     return prunedH  
  46.   
  47. def rulesFromConseq(freqSet, H, supportData, brl, minConf=0.7):  
  48.     """Generate candidate rule set.  
  49.     Shift one item from condition to consequence to generate new rule set.  
  50.       
  51.     Args:  
  52.         freqSet: Frequent item set of specific length of composition.  
  53.         H: Consequence item set  
  54.         supportData: Support dict with key as frequent item set and value as support.  
  55.         minConf: Minumum confidence  
  56.           
  57.     Return:  
  58.         None  
  59.     """  
  60.     m = len(H[0])  
  61.     if (len(freqSet) > (m + 1)):   # Try further merging                        
  62.         Hmp1 = aprioriGen(H, m + 1) # Create Hm+1new candidates                                 
  63.         Hmp1 = calcConf(freqSet, Hmp1, supportData, brl, minConf)  
  64.         if (len(Hmp1) > 1):  
  65.             rulesFromConseq(freqSet, Hmp1, supportData, brl, minConf)  
一個使用 support=0.5confidence=0.7 設定產生的 association rules 範例如下: 


Supplement: 
* [ ML 文章收集 ] Apriori Algorithm Introduction

沒有留言:

張貼留言

網誌存檔