程式扎記: [ DM Practical MLT] (4) Algorithms - Mining association rules

標籤

2015年10月15日 星期四

[ DM Practical MLT] (4) Algorithms - Mining association rules

Introduction (p145)
Association rules are like classification rules. You could find them in the same way, by executing a divide-and-conquer rule-induction procedure for each possible expression that could occur on the right-hand side of the rule. But not only might any attribute occur on the right-hand side with any possible value; a single association rule often predicts the value of more than one attribute. To find such rules, you would have to execute the rule-induction procedure once for every possible combination of attributes, with every possible combination of values, on the right-hand side. That would result in an enormous number of association rules, which would then have to be pruned down on the basis of their coverage (the number of instances that they predict correctly) and their accuracy (the same number expressed as a proportion of the number of instances to which the rule applies). This approach is quite infeasible. (Note that, as we mentioned before, what we are calling coverage is often called support and what we are calling accuracy is often called confidence.)

Instead, we capitalize on the fact that we are only interested in association rules with high coverage. We ignore, for the moment, the distinction between the left-and right-hand sides of a rule and seek combinations of attribute–value pairs that have a prespecified minimum coverage. These are called item setsan attribute–value pair is an item. The terminology derives from market basket analysis, in which the items are articles in your shopping cart and the supermarket manager is looking for associations among these purchases.

Item sets (p146)
The first column of Table 4.10 shows the individual items for the weather data of Table 1.2, with the number of times each item appears in the dataset given at the right. These are the one-item sets. The next step is to generate the two item sets by making pairs of one-item ones. Of course, there is no point in generating a set containing two different values of the same attribute (such as outlook = sunny and outlook = overcast), because that cannot occur in any actual instance.

































Assume that we seek association rules with minimum coverage 2: thus we discard any item sets that cover fewer than two instances. This leaves 47 two item sets, some of which are shown in the second column along with the number of times they appear. The next step is to generate the three-item sets, of which 39 have a coverage of 2 or greater. There are 6 four-item sets, and no five-item sets—for this data, a five-item set with coverage 2 or greater could only correspond to a repeated instance. The first row of the table, for example, shows that there are five days when outlook = sunny, two of which have temperature = mild, and, in fact, on both of those days humidity = high and play = no as well.

Association rules (p146)
Shortly we will explain how to generate these item sets efficiently. But first let us finish the story. Once all item sets with the required coverage have been generated, the next step is to turn each into a rule, or set of rules, with at least the specified minimum accuracy. Some item sets will produce more than one rule; others will produce none. For example, there is one three-item set with a coverage of 4 (row 38 of Table 4.10):




















humidity = normal, windy = false, play = yes

This set leads to seven potential rules:
If humidity = normal and windy = false then play = yes 4/4
If humidity = normal and play = yes then windy = false 4/6
If windy = false and play = yes then humidity = normal 4/6
If humidity = normal then windy = false and play = yes 4/7
If windy = false then humidity = normal and play = yes 4/8
If play = yes then humidity = normal and windy = false 4/9
If – then humidity = normal and windy = false and play = yes 4/12

The figures at the right show the number of instances for which all three conditions are true—that is, the coverage—divided by the number of instances for which the conditions in the antecedent are true. Interpreted as a fraction, they represent the proportion of instances on which the rule is correct—that is, its accuracy.Assuming that the minimum specified accuracy is 100%, only the first of these rules will make it into the final rule set. The denominators of the fractions are readily obtained by looking up the antecedent expression in Table 4.10 (though some are not shown in the Table). The final rule above has no conditions in the antecedent, and its denominator is the total number of instances in the dataset.

Table 4.11 shows the final rule set for the weather data, with minimum coverage 2 and minimum accuracy 100%, sorted by coverage. There are 58 rules, 3 with coverage 4, 5 with coverage 3, and 50 with coverage 2. Only 7 have two conditions in the consequent, and none has more than two. The first rule comes from the item set described previously. Sometimes several rules arise from the same item set. For example, rules 9, 10, and 11 all arise from the four-item set in row 6 of Table 4.10:
temperature = cool, humidity = normal, windy = false, play = yes



































which has coverage 2. Three subsets of this item set also have coverage 2:
temperature = cool, windy = false
temperature = cool, humidity = normal, windy = false
temperature = cool, windy = false, play = yes

and these lead to rules 9, 10, and 11, all of which are 100% accurate (on the training data).

Generating rules efficiently (p150)
We now consider in more detail an algorithm for producing association rules with specified minimum coverage and accuracy. There are two stages: generating item sets with the specified minimum coverage, and from each item set determining the rules that have the specified minimum accuracy.

The first stage proceeds by generating all one-item sets with the given minimum coverage (the first column of Table 4.10and then using this to generate the two-item sets (second column), three-item sets (third column), and so on. Each operation involves a pass through the dataset to count the items in each set, and after the pass the surviving item sets are stored in a hash table — a standard data structure that allows elements stored in it to be found very quickly. From the one-item sets, candidate two-item sets are generated, and then a pass is made through the dataset, counting the coverage of each two-item set; at the end the candidate sets with less than minimum coverage are removed from the table. The candidate two-item sets are simply all of the one-item sets taken in pairs, because a two-item set cannot have the minimum coverage unless both its constituent one-item sets have minimum coverage, too. This applies in general: a three-item set can only have the minimum coverage if all three of its two-item subsets have minimum coverage as well, and similarly for four-item sets.

An example will help to explain how candidate item sets are generated. Suppose there are five three-item sets—(A B C), (A B D), (A C D), (A C E), and (B C D)—where, for example, A is a feature such as outlook = sunny. The union of the first two, (A B C D), is a candidate four-item set because its other three item subsets (A C D) and (B C D) have greater than minimum coverage. If the three-item sets are sorted into lexical order, as they are in this list, then we need only consider pairs whose first two members are the same. For example, we do not consider (A C D) and (B C D) because (A B C D) can also be generated from (A B C) and (A B D), and if these two are not candidate three-item sets then (A B C D) cannot be a candidate four-item set. This leaves the pairs (A B C) and (A B D), which we have already explained, and (A C D) and (A C E). This second pair leads to the set (A C D E) whose three-item subsets do not all have the minimum coverage, so it is discarded. The hash table assists with this check: we simply remove each item from the set in turn and check that the remaining three-item set is indeed present in the hash table. Thus in this example there is only one candidate four-item set, (A B C D). Whether or not it actually has minimum coverage can only be determined by checking the instances in the dataset.

The second stage of the procedure takes each item set and generates rules from it, checking that they have the specified minimum accuracy. If only rules with a single test on the right-hand side were sought, it would be simply a matter of considering each condition in turn as the consequent of the rule, deleting it from the item set, and dividing the coverage of the entire item set by the coverage of the resulting subset—obtained from the hash table—to yield the accuracy of the corresponding rule. Given that we are also interested in association rules with multiple tests in the consequent, it looks like we have to evaluate the effect of placing each subset of the item set on the right-hand side, leaving the remainder of the set as the antecedent.

This brute-force method will be excessively computation intensive unless item sets are small, because the number of possible subsets grows exponentially with the size of the item set. However, there is a better way.We observed when describing association rules before that if the double-consequent rule
If windy = false and play = no then outlook = sunny and humidity = high

holds with a given minimum coverage and accuracy, then both single consequent rules formed from the same item set must also hold:
If humidity = high and windy = false and play = no then outlook = sunny
If outlook = sunny and windy = false and play = no then humidity = high

Conversely, if one or other of the single-consequent rules does not hold, there is no point in considering the double-consequent one. This gives a way of building up from single-consequent rules to candidate double-consequent ones, from double-consequent rules to candidate triple-consequent ones, and so on. Of course, each candidate rule must be checked against the hash table to see if it really does have more than the specified minimum accuracy. But this generally involves checking far fewer rules than the brute force method. It is interesting that this way of building up candidate (n + 1)-consequent rules from actual n-consequent ones is really just the same as building up candidate (n + 1)-item sets from actual n-item sets, described earlier.

Discussion (p151)
Association rules are often sought for very large datasets, and efficient algorithms are highly valued. The method described previously makes one pass through the dataset for each different size of item set. Sometimes the dataset is too large to read in to main memory and must be kept on disk; then it may be worth reducing the number of passes by checking item sets of two consecutive sizes in one go. For example, once sets with two items have been generated, all sets of three items could be generated from them before going through the instance set to count the actual number of items in the sets. More three-item sets than necessary would be considered, but the number of passes through the entire dataset would be reduced.

In practice, the amount of computation needed to generate association rules depends critically on the minimum coverage specified. The accuracy has less influence because it does not affect the number of passes that we must make through the dataset. In many situations we will want to obtain a certain number of rules—say 50—with the greatest possible coverage at a prespecified minimum accuracy level. One way to do this is to begin by specifying the coverage to be rather high and to then successively reduce it, re-executing the entire rule-finding algorithm for each coverage value and repeating this until the desired number of rules has been generated.

The tabular input format that we use throughout this book, and in particular a standard ARFF file based on it, is very inefficient for many association-rule problems. Association rules are often used when attributes are binary—either present or absent—and most of the attribute values associated with a given instance are absent. This is a case for the sparse data representation described in Section 2.4; the same algorithm for finding association rules applies.

Lab Demo
Here will use GML package to explain the usage of class Apriori.

Demo1 - How class Apriori works
Here we simulate how Apriori works to generate the association rules step by step. The sample code as below:
  1. static void Demo1()  
  2. {  
  3.     Apriori ai = new Apriori(confidence:0.5, minSupport:1)  
  4.     def dtSet = [[134], [235], [1235], [25]]  
  5.     printf("\t[Info] Data Set:\n%s\n", dtSet.toString())  
  6.     def C1 = ai.createC1(dtSet)         // 1) Build up C1  
  7.     printf("\t[Info] C1:\n%s\n", C1.toString())  
  8.     JTuple t = ai.scanD(dtSet, C1)      // 2) Build L1  
  9.     printf("\t[Info] Scan with support=1:\n%s\n", t)  
  10.     t = ai.genLk(dtSet)                 // 3) Build L=[L1,L2...,Lk], S=Support of each element from L1~Lk  
  11.     def L = t.get(0)                  
  12.     def S = t.get(1)  
  13.     printf("\t[Info] L,S:\n%s\n%s\n", L.toString(), S.toString())  
  14.     def rules = ai.generateRules(L, S)  // 4) Create Rules based on L,S  
  15.     printf("\t[Info] Generate Rule(%d):\n", rules.size())  
  16.     rules = ai.sortByConf(rules, S)     // 5) Sorting rules based on confidence desc  
  17.     for(JTuple rt in rules)  
  18.     {  
  19.         def ms = rt.get(0); def conseq = rt.get(1); def conf = rt.get(2)  
  20.         printf("\t\t%s --> %s (%.02f/%d)\n", ms, conseq, conf, S.get(ai.unSet(ms, conseq)))  
  21.     }  
  22. }  
The execution output look like:
[Info] Data Set:
[[1, 3, 4], [2, 3, 5], [1, 2, 3, 5], [2, 5]]
[Info] C1:
[[1], [2], [3], [4], [5]]
[Info] Scan with support=1:
['[[1], [3], [2], [5]]','{[1]=2, [3]=3, [4]=1, [2]=3, [5]=3}']
[Info] L,S:
[[[1], [3], [2], [5]], [[1, 3], [2, 3], [3, 5], [2, 5]], [[2, 3, 5]]]
[[1]:2, [3]:3, [4]:1, [2]:3, [5]:3, [1, 3]:2, [2, 3]:2, [3, 5]:2, [2, 5]:3, [1, 2]:1, [1, 5]:1, [2, 3, 5]:2]
[Info] Generate Rule(11):
[1] --> [3] (1.00/2)
[2] --> [5] (1.00/3)
[5] --> [2] (1.00/3)
[5] --> [2, 3] (0.67/2)
[2] --> [3] (0.67/2)
[3] --> [2] (0.67/2)
[3] --> [5] (0.67/2)
[5] --> [3] (0.67/2)
[3] --> [1] (0.67/2)
[3] --> [2, 5] (0.67/2)
[2] --> [3, 5] (0.67/2)
 # [2] then [3,5] has confidence=0.67 and coverage=2

Demo2 - Normally, the user of GML will use Apriori this way:
  1. static void Demo2()  
  2. {  
  3.     // 1) Setup confidence/miniumu support of Apriori algorithm  
  4.     Apriori ai = new Apriori(confidence:1.0, minSupport:1)  
  5.     // 2) Provide input data wrapped in SimpleIn object  
  6.     JTuple t = ai.run(new SimpleIn(new File("data/weathers.dat")))  
  7.     // 3) Extract the rules  
  8.     def rules = t.get(0); def L = t.get(1); def S = t.get(2)  
  9.     printf("\t[Info] Generate Rule(%d):\n", rules.size())  
  10.     rules = ai.sortByConf(rules, S)  
  11.     for(JTuple rt in rules)  
  12.     {  
  13.         def ms = rt.get(0); def conseq = rt.get(1); def conf = rt.get(2)  
  14.         printf("\t\t%s --> %s (%.02f/%d)\n", ms, conseq, conf, S.get(ai.unSet(ms, conseq)))  
  15.     }  
  16. }  
Here using weather data as input data. The execution output look like:
[Info] Generate Rule(9):
[Outlook_overcast] --> [*Play_yes] (1.00/4)
[Temperature_cool, Windy_false] --> [*Play_yes, Humidity_normal] (1.00/2)
[Temperature_hot, *Play_yes] --> [Outlook_overcast, Windy_false] (1.00/2)
[Temperature_hot, Outlook_overcast] --> [*Play_yes, Windy_false] (1.00/2)
[Outlook_overcast, Windy_false] --> [Temperature_hot, *Play_yes] (1.00/2)
[Temperature_hot, Outlook_sunny] --> [Humidity_high, *Play_no] (1.00/2)
[Temperature_hot, *Play_no] --> [Humidity_high, Outlook_sunny] (1.00/2)
[Windy_false, *Play_no] --> [Humidity_high, Outlook_sunny] (1.00/2)
[Temperature_cool] --> [Humidity_normal] (1.00/4)


Supplement
Unsupervised learning : Association analysis with the Apriori algorithm
有關 Association rules learning 的應用相當廣泛, 常見的如便利商店用來分析當你買了 A 產品後, 最可能買的其他商品會是什麼, 如此便可以將這兩種商品擺在一塊以提高消費者的購買意願...


沒有留言:

張貼留言

網誌存檔

關於我自己

我的相片
Where there is a will, there is a way!