程式扎記: [ DM Practical MLT] (4) Algorithms - Inferring rudimentary rules

標籤

2015年8月27日 星期四

[ DM Practical MLT] (4) Algorithms - Inferring rudimentary rules

Introduction 
Here’s an easy way to find very simple classification rules from a set of instances. Called 1R for 1-rule, it generates a one-level decision tree expressed in the form of a set of rules that all test one particular attribute. 1R is a simple, cheap method that often comes up with quite good rules for characterizing the structure in data. It turns out that simple rules frequently achieve surprisingly high accuracy. Perhaps this is because the structure underlying many real-world datasets is quite rudimentary, and just one attribute is sufficient to determine the class of an instance quite accurately. In any event, it is always a good plan to try the simplest things first. 

The idea is this: we make rules that test a single attribute and branch accordingly. Each branch corresponds to a different value of the attribute. It is obvious what is the best classification to give each branch: use the class that occurs most often in the training data. Then the error rate of the rules can easily be determined. Just count the errors that occur on the training data, that is, the number of instances that do not have the majority class. 

Each attribute generates a different set of rules, one rule for every value of the attribute. Evaluate the error rate for each attribute’s rule set and choose the best. It’s that simple! Figure 4.1 shows the algorithm in the form of pseudocode. 
 

To see the 1R method at work, consider the weather data of Table 1.2 (we will encounter it many times again when looking at how learning algorithms work). To classify on the final column, play, 1R considers four sets of rules, one for each attribute. These rules are shown in Table 4.1 
 

An asterisk indicates that a random choice has been made between two equally likely outcomes. The number of errors is given for each rule, along with the total number of errors for the rule set as a whole. 1R chooses the attribute that produces rules with the smallest number of errors—that is, the first and third rule sets. Arbitrarily breaking the tie between these two rule sets gives: 
  1. outlook: sunny -> no  
  2.          overcast -> yes  
  3.          rainy -> yes  
We noted at the outset that the game for the weather data is unspecified. Oddly enough, it is apparently played when it is overcast or rainy but not when it is sunny. Perhaps it’s an indoor pursuit. 

Missing values and numeric attributes 
Although a very rudimentary learning method, 1R does accommodate both missing values and numeric attributes. It deals with these in simple but effective ways.Missing is treated as just another attribute value so that, for example, if the weather data had contained missing values for the outlook attribute, a rule set formed on outlook would specify four possible class values, one each for sunnyovercast, and rainy and a fourth for missing

We can convert numeric attributes into nominal ones using a simple discretization method. First, sort the training examples according to the values of the numeric attribute. This produces a sequence of class values. For example, sorting the numeric version of the weather data (Table 1.3) according to the values of temperature produces the sequence: 
 

Discretization involves partitioning this sequence by placing breakpoints in it. One possibility is to place breakpoints wherever the class changes: 
  1. yes | no | yes yes yes | no no | yes yes yes | no | yes yes | no  
Choosing breakpoints halfway between the examples on either side places them at 64.5, 66.5, 70.5, 72, 77.5, 80.5, and 84. However, the two instances with value 72 cause a problem because they have the same value of temperature but fall into different classes. The simplest fix is to move the breakpoint at 72 up one example, to 73.5, producing a mixed partition in which no is the majority class. 

A more serious problem is that this procedure tends to form a large number of categories. The 1R method will naturally gravitate toward choosing an attribute that splits into many categories, because this will partition the dataset into many classes, making it more likely that instances will have the same class as the majority in their partition. In fact, the limiting case is an attribute that has a different value for each instance—that is, an identification code attribute that pinpoints instances uniquely—and this will yield a zero error rate on the training set because each partition contains just one instance. Of course, highly branching attributes do not usually perform well on test examples; indeed, the identification code attribute will never predict any examples outside the training set correctly. This phenomenon is known as overfitting

For 1R, overfitting is likely to occur whenever an attribute has a large number of possible values. Consequently, when discretizing a numeric attribute a rule is adopted that dictates a minimum number of examples of the majority class in each partition. Suppose that minimum is set at three. This eliminates all but two of the preceding partitions. Instead, the partitioning process begins 
  1. yes no yes yes | yes...  
ensuring that there are three occurrences of yes, the majority class, in the first partition. However, because the next example is also yes, we lose nothing by including that in the first partition, too. This leads to a new division: 
  1. yes no yes yes yes | no no yes yes yes | no yes yes no  
where each partition contains at least three instances of the majority class, except the last one, which will usually have less. Partition boundaries always fall between examples of different classes. 

Whenever adjacent partitions have the same majority class, as do the first two partitions above, they can be merged together without affecting the meaning of the rule sets. Thus the final discretization is: 
  1. yes no yes yes yes no no yes yes yes | no yes yes no  
which leads to the rule set 
  1. temperature: <= 77.5 -> yes  
  2.              > 77.5 -> no  
Discussion 
Surprisingly, despite its simplicity 1R did astonishingly—even embarrassingly—well in comparison with state-of-the-art learning methods, and the rules it produced turned out to be just a few percentage points less accurate, on almost all of the datasets, than the decision trees produced by a state-of-the-art decision tree induction scheme. These trees were, in general, considerably larger than 1R’s rules. Rules that test a single attribute are often a viable alternative to more complex structures, and this strongly encourages a simplicity-first methodology in which the baseline performance is established using simple, rudimentary techniques before progressing to more sophisticated learning methods, which inevitably generate output that is harder for people to interpret. 

The 1R procedure learns a one-level decision tree whose leaves represent the various different classes. A slightly more expressive technique is to use a different rule for each class. Each rule is a conjunction of tests, one for each attribute. For numeric attributes the test checks whether the value lies within a given interval; for nominal ones it checks whether it is in a certain subset of that attribute’s values. These two types of tests—intervals and subset—are learned from the training data pertaining to each class. For a numeric attribute, the endpoints of the interval are the minimum and maximum values that occur in the training data for that class. For a nominal one, the subset contains just those values that occur for that attribute in the training data for the class. Rules representing different classes usually overlap, and at prediction time the one with the most matching tests is predicted. This simple technique often gives a useful first impression of a dataset. It is extremely fast and can be applied to very large quantities of data. 

Lab Demo 
First of all, we have to create out input training data (first line is headers, the one with '*' is target header to predict): 
- data/weathers.dat 
  1. Outlook,Temperature,Humidity,Windy,*Play  
  2. sunny,hot,high,false,no  
  3. sunny,hot,high,true,no  
  4. overcast,hot,high,false,yes  
  5. rainy,mild,high,false,yes  
  6. rainy,cool,normal,false,yes  
  7. rainy,cool,normal,true,no  
  8. overcast,cool,normal,true,yes  
  9. sunny,mild,high,false,no  
  10. sunny,cool,normal,false,yes  
  11. rainy,mild,normal,false,yes  
  12. sunny,mild,normal,true,yes  
  13. overcast,mild,high,true,yes  
  14. overcast,hot,normal,false,yes  
  15. rainy,mild,high,true,no   
Then we can know how 1R work with below sample code (Whole source code can be reached from GML): 
- OneR.groovy 
  1. package dm.basic.oner  
  2.   
  3. import dm.input.SimpleIn  
  4. import flib.util.CountMap  
  5. import flib.util.Tuple  
  6.   
  7. class OneR {  
  8.     public static void main(args)  
  9.     {  
  10.         // Reading training data  
  11.         SimpleIn si = new SimpleIn(new File("data/weathers.dat"))  
  12.           
  13.         int ei=-1;  
  14.         def headers = si.getHeaders()  
  15.         ei=headers.findIndexOf { h-> h.startsWith('*')}  
  16.         printf("\t[Info] Target Header: %d (%s)\n", ei, headers[ei])  
  17.         printf("\t[Info] Start 1R Algorithm...\n")  
  18.         def trainMap = [:]  
  19.         printf("\t\t0) Initialize...\n")  
  20.         headers.size().times{i->  
  21.             if(i==ei) return  
  22.             def cmm = [:].withDefault { k -> return new CountMap()}  
  23.             trainMap[i] = new Tuple(headers[i], [], cmm, new TreeSet())  
  24.         }  
  25.           
  26.         printf("\t\t1) Start Analyze data...\n")  
  27.         Iterator datIter = si.iterator()  
  28.         while(datIter.hasNext())  
  29.         {  
  30.             def r = datIter.next()  
  31.             def p = r[ei]  
  32.             for(int i=0; i
  33.             {                                 
  34.                 if(i==ei) continue  
  35.                 Tuple t = trainMap[i]  
  36.                 t.get(1).add(r[i])  
  37.                 t.get(2)[r[i]].count(p)  
  38.                 t.get(3).add(r[i])  
  39.             }  
  40.         }  
  41.           
  42.         printf("\t\t2) Pick attribute with less error...\n")  
  43.         int gErr=Integer.MAX_VALUE  
  44.         String attr=null  
  45.         for(Tuple t:trainMap.values())  
  46.         {  
  47.             int err=0  
  48.             printf("\t\t\tHeader='%s':\n", t.get(0))  
  49.             for(String v:t.get(3))  
  50.             {  
  51.                 CountMap cm = t.get(2)[v]  
  52.                 def maj = cm.major()            /*Major category/predict list*/  
  53.                 def mc = maj[0]                 /*Pickup first major category/predict*/  
  54.                 def mcc = cm.getCount(mc)       /*Major category/predict count*/  
  55.                 def tc = cm.size()              /*Total size of this attribute*/  
  56.                 printf("\t\t\t\tValue='%s'->%s (%d/%d)\n", v, mc, mcc, tc)  
  57.                 err+=(tc-mcc)  
  58.             }  
  59.             if(err
  60.             {  
  61.                 gErr=err  
  62.                 attr=t.get(0)  
  63.             }  
  64.         }  
  65.           
  66.         printf("\t\t3) Generate 1R: %s\n", attr)  
  67.     }  
  68. }  
Execution Result: 

沒有留言:

張貼留言

網誌存檔

關於我自己

我的相片
Where there is a will, there is a way!