有關 Association rules learning 的應用相當廣泛, 常見的如便利商店用來分析當你買了 A 產品後, 最可能買的其他商品會是什麼, 如此便可以將這兩種商品擺在一塊以提高消費者的購買意願. 底下是一個簡單範例考慮有顧客購買的清單列表如下:
一眼很難發現出有什麼規則或 Pattern, 但是透過 association analysis, 你可以發現當顧客買了 "diapers" 有很高的可能性也會買 "wine", 而這是有人去分析原因如下:
Basic concept of association analysis:
有了上面的開場白, 接著我們來看看 association analysis 到底做了什麼事. 首先一個很重要的參數叫做 support. 你不希望你找出來的規則在幾萬條中只出現這麼一兩次, 因此你會設定 support 告訴 association analysis 在尋找規則時必須設定 minimum 的出現次數 (frequency). 以上面的範例來說同時出現 "diapers" 與 "wine" 的筆數共有 3 筆, 則其 support = 3/5 (5 為總筆數); 另一個重要的參數叫做 confidence, 你可以把它想成 條件機率, 以上面的例子來說, 總共出現 "diapers" 的筆數為 4, 而在出現 "diapers" 又同時出現 "wine" 的筆數有 3 筆, 因此它的 confidence = 3/4.
有了上面的認知後, 接著要來看一個 演算法 Apriori 讓你可以透過設定 support, confidence 後幫你找出 association analysis 後的規則.
Apriori algorithm:
首先來看這個演算法的用途與特性:
簡單說這個演算法就是從一個集合組合成的每筆紀錄中, 找出關聯性規則 (A 出現 B 也很高可能性也會出現 etc.). 所以第一步這個演算法要做的便是找出由這個集合所形成的可能組合. 假設我們集合中有 {0,1,2,3}, 則其可能產生的組合如下:
問題不在於產生這些組合, 而是當你的集合夠大時要產生這些組合可能會很花時間. 因此就需要一些 pruning 的技巧幫助我們節省產生這些組合的時間. 如當我們發現 {2,3} 的組合所計算的 support 值低於設定的 support, 則 {2,3,1}, {2,3,0}, {0,1,2,3} 等接下來延生的組合便可忽略不用去產生它們:
接著我們來看這個演算法第一步產生 frequent candidate itemsets 的 pseudo code:
- For each transaction in tran the dataset:
- For each candidate itemset, can:
- Check to see if can is a subset of tran
- If so increment the count of can
- For each candidate itemset:
- If the support meets the minimum, keep this item
- Return list of frequent itemsets
- def loadDataSet():
- return [[1, 3, 4], [2, 3, 5], [1, 2, 3, 5], [2, 5]]
- def createC1(dataSet):
- C1 = []
- for transaction in dataSet:
- for item in transaction:
- if not [item] in C1:
- C1.append([item])
- C1.sort()
- return map(frozenset, C1) # Create a frozenset of each item in C1
- def scanD(D, Ck, minSupport):
- ssCnt = {}
- for tid in D:
- for can in Ck:
- if can.issubset(tid):
- if not ssCnt.has_key(can): ssCnt[can]=1
- else: ssCnt[can] += 1
- numItems = float(len(D))
- retList = []
- supportData = {}
- for key in ssCnt:
- support = ssCnt[key]/numItems # Calculate support for every itemset
- if support >= minSupport:
- retList.insert(0,key)
- supportData[key] = support
- return retList, supportData
如此我們已經得到組合數為 1 且滿足 support 設定的元素, 接下來便是利用這些元素來產生更多的組合.
Putting together the full Apriori algorithm
接下來用來產生組合數大於一的 pseudo code 如下:
- While the number of items in the set is greater than 0:
- Create a list of candidate itemsets of length k
- Scan the dataset to see if each itemset is frequent
- Keep frequent itemsets to create itemsets of length k+1
- def aprioriGen(Lk, k): #creates Ck
- """ Generate C(k)
- Based on given C(k-1) to generate C(k).
- Args:
- Lk: A List contains C(1) to C(k-1)
- k: int
- Returns:
- A list contains C(k)
- Raises:
- None
- """
- retList = []
- lenLk = len(Lk)
- for i in range(lenLk):
- for j in range(i+1, lenLk): # Join sets if first k-2 items are equal
- L1 = list(Lk[i])[:k-2]; L2 = list(Lk[j])[:k-2]
- L1.sort(); L2.sort()
- if L1==L2:
- retList.append(Lk[i] | Lk[j])
- return retList
- def apriori(dataSet, minSupport = 0.5):
- """ Generate L(2)~L(k)
- Based on given dataset to generate L(2)~L(k).
- Args:
- dataSet: Dataset
- minSupport: Minumum support for candidate item set.
- Returns:
- A tuple (L, supportData)
- L: A list contains candidate item set.
- supportData: The dict with mapping key as L(2)~L(m). The value is the support value.
- """
- C1 = createC1(dataSet)
- D = map(set, dataSet)
- L1, supportData = scanD(D, C1, minSupport)
- L = [L1]
- k = 2
- while (len(L[k-2]) > 0):
- Ck = aprioriGen(L[k-2], k)
- Lk, supK = scanD(D, Ck, minSupport)
- supportData.update(supK)
- L.append(Lk)
- k += 1
- return L, supportData
Mining association rules from frequent item sets:
現在我們已經產生了符合 support 設定的組合, 接著便是要從這些組合中找出 association rules. 第一步一樣是去產生可能的 rule set. 假設我們有集合 {0,1,2,3}, 則底下是產生 rule set 的過程, 灰色的部分是因為它不滿足我們設定的 confidence, 故被 pruning 掉:
如下的代碼就是在進行上述的動作: 產生 candidate rule set -> evaluate confidence of them -> collect acceptable rules
- def generateRules(L, supportData, minConf=0.7):
- """Generate rules based on given candidate item set.
- Generate rules with minumum confidence=
. - Args:
- L: Candidate item set list
- supportData: Support dict with key as frequent item set and value as support.
- minConf: Minumum confidence
- Return:
- Rule list
- """
- bigRuleList = []
- for i in range(1, len(L)): # Get only sets with two or more items
- for freqSet in L[i]:
- H1 = [frozenset([item]) for item in freqSet] # consequence contains only one item
- if (i > 1):
- rulesFromConseq(freqSet, H1, supportData, bigRuleList,\
- minConf)
- else:
- calcConf(freqSet, H1, supportData, bigRuleList, minConf)
- return bigRuleList
- def calcConf(freqSet, H, supportData, brl, minConf=0.7):
- """Prone the consequent set(H)
- Prune the consequent set based on given minimum confidence.
- Args:
- freqSet: Frequent item set
- H: consequent set
- supportData: Support dict with key as frequent item set and value as support.
- brl: List to hold tuple(antecedent set, consequent set, confidence)
- minConf: Minumum confidence
- Returns:
- None
- """
- prunedH = []
- for conseq in H:
- conf = supportData[freqSet]/supportData[freqSet-conseq]
- if conf >= minConf:
- print freqSet-conseq,'-->',conseq,'conf:',conf
- brl.append((freqSet-conseq, conseq, conf))
- prunedH.append(conseq)
- return prunedH
- def rulesFromConseq(freqSet, H, supportData, brl, minConf=0.7):
- """Generate candidate rule set.
- Shift one item from condition to consequence to generate new rule set.
- Args:
- freqSet: Frequent item set of specific length of composition.
- H: Consequence item set
- supportData: Support dict with key as frequent item set and value as support.
- minConf: Minumum confidence
- Return:
- None
- """
- m = len(H[0])
- if (len(freqSet) > (m + 1)): # Try further merging
- Hmp1 = aprioriGen(H, m + 1) # Create Hm+1new candidates
- Hmp1 = calcConf(freqSet, Hmp1, supportData, brl, minConf)
- if (len(Hmp1) > 1):
- rulesFromConseq(freqSet, Hmp1, supportData, brl, minConf)
Supplement:
* [ ML 文章收集 ] Apriori Algorithm Introduction
沒有留言:
張貼留言