2013年6月18日 星期二

[ ML In Action ] Unsupervised learning : Association analysis with the Apriori algorithm

Preface:

diapers ➞ beer
The most famous example of association analysis is diapers Þ beer. It has been reported that a grocery store chain in the Midwest of the United States noticed that men bought diapers and beer on Thursdays. The store could have profited from this by placing diapers and beer close together and making sure they were full price on Thursdays, but they did not.

Basic concept of association analysis:

Apriori algorithm:

1. For each transaction in tran the dataset:
2. For each candidate itemset, can:
3.         Check to see if can is a subset of tran
4.         If so increment the count of can
5. For each candidate itemset:
6. If the support meets the minimum, keep this item
7. Return list of frequent itemsets

2.     return [[134], [235], [1235], [25]]
3.
4. def createC1(dataSet):
5.     C1 = []
6.     for transaction in dataSet:
7.         for item in transaction:
8.             if not [item] in C1:
9.                 C1.append([item])
10.     C1.sort()
11.     return map(frozenset, C1) # Create a frozenset of each item in C1
12.
13. def scanD(D, Ck, minSupport):
14.     ssCnt = {}
15.     for tid in D:
16.         for can in Ck:
17.             if can.issubset(tid):
18.                 if not ssCnt.has_key(can): ssCnt[can]=1
19.                 else: ssCnt[can] += 1
20.     numItems = float(len(D))
21.     retList = []
22.     supportData = {}
23.     for key in ssCnt:
24.         support = ssCnt[key]/numItems # Calculate support for every itemset
25.         if support >= minSupport:
26.             retList.insert(0,key)
27.         supportData[key] = support
28.     return retList, supportData

Putting together the full Apriori algorithm

1. While the number of items in the set is greater than 0:
2.     Create a list of candidate itemsets of length k
3.     Scan the dataset to see if each itemset is frequent
4.     Keep frequent itemsets to create itemsets of length k+1

1. def aprioriGen(Lk, k): #creates Ck
2.     """ Generate C(k)
3.     Based on given C(k-1) to generate C(k).
4.
5.     Args:
6.         Lk: A List contains C(1) to C(k-1)
7.         k: int
8.
9.     Returns:
10.         A list contains C(k)
11.
12.     Raises:
13.         None
14.     """
15.     retList = []
16.     lenLk = len(Lk)
17.     for i in range(lenLk):
18.         for j in range(i+1, lenLk): # Join sets if first k-2 items are equal
19.             L1 = list(Lk[i])[:k-2]; L2 = list(Lk[j])[:k-2]
20.             L1.sort(); L2.sort()
21.             if L1==L2:
22.                 retList.append(Lk[i] | Lk[j])
23.     return retList
24.
25. def apriori(dataSet, minSupport = 0.5):
26.     """ Generate L(2)~L(k)
27.     Based on given dataset to generate L(2)~L(k).
28.
29.     Args:
30.         dataSet: Dataset
31.         minSupport: Minumum support for candidate item set.
32.
33.     Returns:
34.         A tuple (L, supportData)
35.         L: A list contains candidate item set.
36.         supportData: The dict with mapping key as L(2)~L(m). The value is the support value.
37.     """
38.     C1 = createC1(dataSet)
39.     D = map(set, dataSet)
40.     L1, supportData = scanD(D, C1, minSupport)
41.     L = [L1]
42.     k = 2
43.     while (len(L[k-2]) > 0):
44.         Ck = aprioriGen(L[k-2], k)
45.         Lk, supK = scanD(D, Ck, minSupport)
46.         supportData.update(supK)
47.         L.append(Lk)
48.         k += 1
49.     return L, supportData

Mining association rules from frequent item sets:

1. def generateRules(L, supportData, minConf=0.7):
2.     """Generate rules based on given candidate item set.
3.     Generate rules with minumum confidence=.
4.
5.     Args:
6.         L: Candidate item set list
7.         supportData: Support dict with key as frequent item set and value as support.
8.         minConf: Minumum confidence
9.
10.     Return:
11.         Rule list
12.     """
13.     bigRuleList = []
14.     for i in range(1, len(L)):   # Get only sets with two or more items
15.         for freqSet in L[i]:
16.             H1 = [frozenset([item]) for item in freqSet] # consequence contains only one item
17.             if (i > 1):
18.                 rulesFromConseq(freqSet, H1, supportData, bigRuleList,\
19.                                 minConf)
20.             else:
21.                 calcConf(freqSet, H1, supportData, bigRuleList, minConf)
22.     return bigRuleList
23.
24. def calcConf(freqSet, H, supportData, brl, minConf=0.7):
25.     """Prone the consequent set(H)
26.     Prune the consequent set based on given minimum confidence.
27.
28.     Args:
29.         freqSet: Frequent item set
30.         H: consequent set
31.         supportData: Support dict with key as frequent item set and value as support.
32.         brl: List to hold tuple(antecedent set, consequent set, confidence)
33.         minConf: Minumum confidence
34.
35.     Returns:
36.         None
37.     """
38.     prunedH = []
39.     for conseq in H:
40.         conf = supportData[freqSet]/supportData[freqSet-conseq]
41.         if conf >= minConf:
42.             print freqSet-conseq,'-->',conseq,'conf:',conf
43.             brl.append((freqSet-conseq, conseq, conf))
44.             prunedH.append(conseq)
45.     return prunedH
46.
47. def rulesFromConseq(freqSet, H, supportData, brl, minConf=0.7):
48.     """Generate candidate rule set.
49.     Shift one item from condition to consequence to generate new rule set.
50.
51.     Args:
52.         freqSet: Frequent item set of specific length of composition.
53.         H: Consequence item set
54.         supportData: Support dict with key as frequent item set and value as support.
55.         minConf: Minumum confidence
56.
57.     Return:
58.         None
59.     """
60.     m = len(H[0])
61.     if (len(freqSet) > (m + 1)):   # Try further merging
62.         Hmp1 = aprioriGen(H, m + 1) # Create Hm+1new candidates
63.         Hmp1 = calcConf(freqSet, Hmp1, supportData, brl, minConf)
64.         if (len(Hmp1) > 1):
65.             rulesFromConseq(freqSet, Hmp1, supportData, brl, minConf)

Supplement:
* [ ML 文章收集 ] Apriori Algorithm Introduction

關於我自己

Where there is a will, there is a way!