程式扎記: [ NNF For Java ] Obtaining Data for Encog (Ch2)

Preface
Neural networks can provide profound insights into the data supplied to them. However, you can’t just feed any sort of data directly into a neural network. This “raw” data must usually be normalized into a form that the neural network can process. This chapter will show how to normalize “raw” data for use by Encog. Before data can be normalized, we must first have data. Once you decide what the neural network should do, you must find data to teach the neural network how to perform a task. Fortunately, the Internet provides a wealth of information that can be used with neural networks.

Where to Get Data for Neural Networks
The Internet can be a great source of data for the neural network. Data found on the Internet can be in many different formats. One of the most convenient formats for data is the comma-separated value (CSV) format. Other times it may be necessary to create a spider or bot to obtain this data. One very useful source for neural network is the Machine Learning Repository, which is run by the University of California at Irvine. http://kdd.ics.uci.edu/ The Machine Learning Repository site is a repository of various datasets that have been donated to the University of California. Several of these datasets will be used in this book.

Normalizing Data
Data obtained from sites, such as those listed above, often cannot be directly fed into neural networks. Neural networks can be very “intelligent,” but cannot receive just any sort of data and produce a meaningful result. Often the data must first be normalized. We will begin by defining normalization. Neural networks are designed to accept floating-point numbers as their input. Usually these input numbers should be in either the range of -1 to +1 or 0 to +1 for maximum efficiency. The choice of which range is often dictated by the choice of activation function, as certain activation functions have a positive range and others have both a negative and positive range. The sigmoid activation function, for example, has a range of only positive numbers. Conversely, the hyperbolic tangent activation function has a range of positive and negative numbers. The most common case is to use a hyperbolic tangent activation function with a normalization range of -1 to +1.

Recall from Chapter 1 the iris dataset. This data set could be applied to a classification problem. However, we did not see how the data needed to be actually processed to make it useful to a neural network.

A sampling of the dataset is shown here:

view plaincopy to clipboardprint?
” Sepal Length ” , ” Sepal Width” , ” Petal Length ” , ” Petal Width” , ” Species”  
5.1 , 3.5 , 1.4 , 0.2 , ” setosa ”  
4.9 , 3.0 , 1.4 , 0.2 , ” setosa ”  
4.7 , 3.2 , 1.3 , 0.2 , ” setosa ”  
. . .  
7.0 , 3.2 , 4.7 , 1.4 , ” versicolor”  
6.4 , 3.2 , 4.5 , 1.5 , ” versicolor”  
6.9 , 3.1 , 4.9 , 1.5 , ” versicolor”  
. . .  
6.3 , 3.3 , 6.0 , 2.5 , ” virginica”  
5.8 , 2.7 , 5.1 , 1.9 , ” virginica”  
7.1 , 3.0 , 5.9 , 2.1 , ” virginica”  

The fields from this dataset must now be represented as an array of floating point numbers between -1 and +1.

• Sepal Length - Numeric
• Sepal Width - Numeric
• Petal Length - Numeric
• Petal Width - Numeric
• Species - Class

There are really two different attribute types to consider. First, there are four numeric attributes. Each of these will simply map to an input neuron. The values will need to be scaled to -1 to +1. Class attributes, sometimes called nominal attributes, present a unique challenge. In the example, the species of iris must be represented as either one or more floating point numbers. The mapping will not be to a single neuron. Because a three-member class is involved, the number of neurons used to represent the species will not be a single neuron. The number of neurons used to represent the species will be either two or three, depending on the normalization type used.

Normalizing Numeric Values
Normalizing a numeric value is essentially a process of mapping the existing numeric value to well-defined numeric range, such as -1 to +1. Normalization causes all of the attributes to be in the same range with no one attribute more powerful than the others. To normalize, the current numeric ranges must be known for all of the attributes. The current numeric ranges for each of the iris attributes are shown here.

• Sepal Length - Max: 7.9, Min: 4.3
• Sepal Width - Max: 4.4, Min: 2.0
• Petal Length - Max: 6.9, Min: 1.0
• Petal Width - Max: 2.5, Min: 0.1

Consider the “Petal Length.” The petal length is in the range of 1.0 to 6.9. This must convert this length to -1 to +1. To do this we use Equation 2.1.

The above equation will normalize a value x, where the variable d represents the high and low values of the data, the variable n represents the high and low normalization range desired. For example, to normalize a petal length of 3, to the range -1 to +1, the above equation becomes:

This results in a value of 0.66. This is the value that will be fed to the neural network. For regression, the neural network will return values. These values will be normalized. To denormalize a value, Equation 2.3 is used.

To denormalize the value of 0.66, Equation 2.3 becomes:

Once denormalized, the value of 0.66 becomes 2.0 again. It is important to note that the 0.66 value was rounded for the calculation here. Encog provides built-in classes to provide both normalization and denormalization. These classes will be introduced later in this chapter.

Normalizing Nominal Values
Nominal values are used to name things. One very common example of a simple nominal value is gender. Something is either male or female. Another is any sort of Boolean question. Nominal values also include values that are either “yes/true” or “no/false.” However, not all nominal values have only two values. Nominal values can also be used to describe an attribute of something, such as color. Neural networks deal best with nominal values where the set is fixed. For the iris dataset, the nominal value to be normalized is the species. There are three different species to consider for the iris dataset and this value cannot change. If the neural network is trained with three species, it cannot be expected to recognize five species.

Encog supports two different ways to encode nominal values. The simplest means of representing nominal values is called “one-of-n” encoding. One-of-n encoding can often be hard to train, especially if there are more than a few nominal types to encode. Equilateral encoding is usually a better choice than the simpler one-of-n encoding. Both encoding types will be explored in the next two sections.

Understanding One-of-n Normalization
One-of-n is a very simple form of normalization. For an example, consider the iris dataset again. The input to the neural network is statistics about an individual iris. The output signifies which species of iris to evaluate. The three iris species are listed as follows:

• Setosa
• Versicolor
• Virginica

If using the one-of-n normalization, the neural network would have three output neurons. Each of these three neurons would represent one iris species. The iris species predicted by the neural network would correspond to the output neuron with the highest activation. Generating training data for one-of-n is relatively easy. Simply assign a +1 to the neuron that corresponds to the chosen iris and a -1 to the remaining neurons. For example, the Setosa iris species would be encoded as follows:

view plaincopy to clipboardprint?
1,−1,−1  

Likewise, the Versicolor would be encoded as follows:

view plaincopy to clipboardprint?
−1,1,−1  

Finally, Virginica would be encoded as follows.

view plaincopy to clipboardprint?
−1,−1,1  

Encog provides built-in classes to provide this normalization. These classes will be introduced later in this chapter.

Understanding Equilateral Normalization
The output neurons are constantly checked against the ideal output values provided in the training set. The error between the actual output and the ideal output is represented by a percentage. This can cause a problem for the one-of-n normalization method. Consider if the neural network had predicted a Versicolor iris when it should have predicted a Verginica iris. The actual output and ideal would be as follows:

Ideal Output : −1, −1, 1
Actual Output : −1, 1 , −1

The problem is that only two of three output neurons are incorrect. We would like to spread the “guilt” for this error over a larger percent of the neurons. To do this, a unique set of values for each class must be determined. Each set of values should have an equal Euclidean distance from the others. The equal distance makes sure that incorrectly choosing iris Setosa for Versicolor has the same error weight as choosing iris Setosa for iris Virginica.

This can be done using the Equilateral class. The following code segment shows how to use the Equilateral class to generate these values:

view plaincopy to clipboardprint?
import org.encog.mathutil.Equilateral  
import org.encog.util.Format  
  
Equilateral eq = new Equilateral (3 , -1 ,1) ;  
for ( int i =0; i <3; i++) {  
    StringBuilder line = new StringBuilder( ) ;  
    line.append(i);  
    line.append (":");  
    double [] d = eq.encode (i) ;  
    for ( int j =0; j
    {  
        if(j>0)  
            line.append (",") ;  
        line.append (Format.formatDouble (d[j], 4 )) ;  
    }  
    System.out.println( line.toString() ) ;  
}  

The inputs to the Equilateral class are the number of classes and the normalized range. In the above code, there are three classes that are normalized to the range -1 to 1, producing the following output:
Listing 2.1: Calculated Class Equilateral Values 3 Classes

0:0.8660,0.5000
1:-0.8660,0.5000
2:0.0000,-1.0000

Notice that there are two outputs for each of the three classes. This decreases the number of neurons needed by one from the amount needed for one-of-n encoding. Equilateral encoding always requires one fewer output neuron than one-of-n encoding would have. Equilateral encoding is never used for fewer than three classes.

Look at the example before with equilateral normalization. Just as before, consider if the neural network had predicted a Versicolor iris, when it should have predicted a Verginica iris. The output and ideal are as follows:

Ideal Output : 0.0000 , −1.0000
Actual Output : −0.8660 , 0.5000

In this case there are only two neurons, as is consistent with equilateral encoding. Now all neurons are producing incorrect values. Additionally, there are only two output neurons to process, slightly decreasing the complexity of the neural network. Neural networks will rarely give output that exactly matches any of its training values. To deal with this in “one-of-n” encoding, look at which output neuron produced the highest output. This method does not work for equilateral encoding. Equilateral encoding shows which calculated class equilateral value (Listing 2.1) has the shortest distance to the actual output of the neural network.

What is meant by each of the sets being equal in distance from each other? It means that their Euclidean distance is equal. The Euclidean distance can be calculated using Equation 2.5.

In the above equation the variable “q” represents the ideal output value; the variable “p” represents the actual output value. There are “n” sets of ideal and actual. Euclidean normalization is implemented using the Equilateral class in Encog. Usually it is unnecessary to directly deal with the Equilateral class in Encog. Rather, one of the higher-level normalization methods described later in this chapter is used.

If you are interested in the precise means by which the equilateral numbers are calculated, visit the following URL: http://www.heatonresearch.com/wiki/Equilateral

Programmatic Normalization
Encog provides a number of different means of normalizing data. The exact means that you use will be determined by exactly what you are trying to accomplish. The three methods for normalization are summarized here.

• Normalizing Individual Numbers
• Normalizing CSV Files
• Normalizing Memory Arrays

The next three sections will look at all three, beginning with normalizing individual numbers.

Normalizing Individual Numbers
Very often you will simply want to normalize or denormalize a single number. The range of values in your data is already known. For this case, it is unnecessary to go through the overhead of having Encog automatically discover ranges for you. The “Lunar Lander” program is a good example of this. You can find the “Lunar Lander” example here.

view plaincopy to clipboardprint?
org.encog.examples.neural.lunar.LunarLander  

To perform the normalization, several NormalizedField objects are created. Here you see the [url=http://heatonresearch-site.s3-website-us-east-1.amazonaws.com/javadoc/encog-3.3/org/encog/util/arrayutil/NormalizedField.html] object that was created for the lunar lander’s fuel.

view plaincopy to clipboardprint?
import org.encog.util.arrayutil.NormalizationAction  
import org.encog.util.arrayutil.NormalizedField  
  
NormalizedField fuelStats = new NormalizedField (  
    NormalizationAction.Normalize,  
    "fuel",  
    200 , 0,  
    -0.9,  0.9);  

For the above example the range is normalized to -0.9 to 0.9. This is very similar to normalizing between -1 and 1, but less extreme. This can produce better results at times. It is also known that the acceptable range for fuel is between 0 and 200. Now that the field object has been created, it is easy to normalize the values. Here the value 100 is normalized into the variable n.

view plaincopy to clipboardprint?
double n = this.fuelStats.normalize (100) ;  

To denormalize n back to the original fuel value, use the following code:

view plaincopy to clipboardprint?
double f = this.fuelStats.denormalize (n) ;  

Using the NormalizedField classes directly is useful when numbers arrive as the program runs. If large lists of numbers are already established, such as an array or CSV file, this method will not be as effective.

Normalizing Memory Arrays
To quickly normalize an array, the NormalizeArray class can be useful. This object works by normalizing one attribute at a time. An example of the normalize array class working is shown in the sunspot prediction example. This example can be found here:

view plaincopy to clipboardprint?
org.encog.examples.neural.predict.sunspot.PredictSunspot  

To begin, create an instance of the NormalizeArray object. Set the high and low range for normalization.

view plaincopy to clipboardprint?
NormalizeArray norm = new NormalizeArray( ) ;  
norm.setNormalizedHigh (1) ;  
norm.setNormalizedLow (-1) ;  

Now raw data array can be normalized into a normalized array.

view plaincopy to clipboardprint?
double [ ] normalizedSunspots = norm.process( rawDataArray ) ;  

If you have an entire array to normalize to the same high/low, the NormalizeArray class works well. For more fine-tuned control, use the same techniques described in the previous section for individual values. However, all array elements must be looped over.

Normalizing CSV Files
If the data to be normalized is already stored in CSV files, Encog Analyst should be used to normalize the data. Encog Analyst can be used both through the Encog Workbench and directly from Java and C#. This section explains how to use it through Java to normalize the Iris data set. To normalize a file, look at the file normalization example found at the following location:

view plaincopy to clipboardprint?
org.encog.examples.neural.normalize.NormalizeFile  

The groovy version source code:

view plaincopy to clipboardprint?
package demo.normalized  
  
import org.encog.Encog  
import org.encog.app.analyst.AnalystFileFormat  
import org.encog.app.analyst.EncogAnalyst  
import org.encog.app.analyst.csv.normalize.AnalystNormalizeCSV  
import org.encog.app.analyst.script.normalize.AnalystField  
import org.encog.app.analyst.wizard.AnalystWizard  
import org.encog.util.csv.CSVFormat  
  
def dumpFieldInfo(EncogAnalyst analyst) {  
        System.out.println("Fields found in file:");  
        for (AnalystField field : analyst.getScript().getNormalize()  
                .getNormalizedFields()) {  
  
            StringBuilder line = new StringBuilder();  
            line.append(field.getName());  
            line.append(",action=");  
            line.append(field.getAction());  
            line.append(",min=");  
            line.append(field.getActualLow());  
            line.append(",max=");  
            line.append(field.getActualHigh());  
            System.out.println(line.toString());  
        }  
}  
  
if (args.length != 2) {  
    System.out.println("Note: This example assumes that headers are present in the CSV files.");  
    System.out.println("NormalizeFile [input file] [target file]");  
} else {  
    File sourceFile = new File(args[0]);  
    File targetFile = new File(args[1]);  
  
    EncogAnalyst analyst = new EncogAnalyst();  
    AnalystWizard wizard = new AnalystWizard(analyst);  
    wizard.wizard(sourceFile, true, AnalystFileFormat.DECPNT_COMMA);  
  
    dumpFieldInfo(analyst);  
  
    final AnalystNormalizeCSV norm = new AnalystNormalizeCSV();  
    norm.analyze(sourceFile, true, CSVFormat.ENGLISH, analyst);  
    norm.setProduceOutputHeaders(true);  
    norm.normalize(targetFile);  
    Encog.getInstance().shutdown();  
}  

This example takes an input and output file. The input file is the iris data set. The first lines of this file are shown here:

The output will be a normalized version of the input file, as shown below:

view plaincopy to clipboardprint?
” sepall ” , ” sepalw ” , ” petall ” , ” petalw ” , ” species (p0) ” , ” species ( p1)”  
−0.55 ,0.24 , −0.86 , −0.91 , −0.86 , −0.5  
−0.66 ,−0.16 ,−0.86 ,−0.91 ,−0.86 ,−0.5  
−0.77 ,0 ,−0.89 ,−0.91 ,−0.86 ,−0.5  
−0.83 ,−0.08 ,−0.83 ,−0.91 ,−0.86 ,−0.5  
−0.61 ,0.33 , −0.86 , −0.91 , −0.86 , −0.5  
−0.38 ,0.58 , −0.76 , −0.75 , −0.86 , −0.5  
−0.83 ,0.16 , −0.86 , −0.83 , −0.86 , −0.5  
−0.61 ,0.16 , −0.83 , −0.91 , −0.86 , −0.5  

The above data shows that the numeric values have all been normalized to between -1 and 1. Additionally, the species field is broken out into two parts This is because equilateral normalization was used on the species column.

Implementing Basic File Normalization
In the last section, you saw how Encog Analyst normalizes a file. In this section, you will learn the programming code necessary to accomplish this. Begin by accessing the source and target files:

view plaincopy to clipboardprint?
File sourceFile = new File(args[0]);  
File targetFile = new File(args[1]);  

Now create instances of EncogAnalyst and AnalystWizard. The wizard will analyze the source file and build all of the normalization stats needed to perform the normalization.

view plaincopy to clipboardprint?
EncogAnalyst analyst = new EncogAnalyst();  
AnalystWizard wizard = new AnalystWizard(analyst);  

The wizard can now be started.

view plaincopy to clipboardprint?
wizard.wizard(sourceFile, true, AnalystFileFormat.DECPNT_COMMA);  

Now that the input file has been analyzed, it is time to create a normalization object. This object will perform the actual normalization.

view plaincopy to clipboardprint?
final AnalystNormalizeCSV norm = new AnalystNormalizeCSV();  
norm.analyze(sourceFile, true, CSVFormat.ENGLISH, analyst);  

It is necessary to specify the output format for the CSV, in this case, use ENGLISH, which specifies a decimal point. It is also important to produce output headers to easily identify all attributes.

view plaincopy to clipboardprint?
norm.setProduceOutputHeaders(true);  

Finally, we normalize the file.

view plaincopy to clipboardprint?
norm.normalize(targetFile);  
Encog.getInstance().shutdown();  

Now that the data is normalized, the normalization stats may be saved for later use. This is covered in the next section.

Saving the Normalization Script
Encog keeps statistics on normalized data. This data, called the normalization stats, tells Encog the numeric ranges for each attribute that was normalized. This data can be saved so that it does not need to be renormalized each time. To save a stats file, use the following command:

view plaincopy to clipboardprint?
analyst.save(new File( "stats.ega" )) ;  

The file can be later reloaded with the following command:

view plaincopy to clipboardprint?
analyst.load(new File("stats.ega"))  

The extension EGA is common and stands for “Encog Analyst.”

Customizing File Normalization
The Encog Analyst contains a collection of AnalystField objects. These objects hold the type of normalization and the ranges of each attribute. This collection can be directly accessed to change how the attributes are normalized. Also, AnalystField objects can be removed and excludes from the final output. The following code shows how to access each of the fields determined by the wizard.
System.out.println("Fields found in file:");

view plaincopy to clipboardprint?
for (AnalystField field : analyst.getScript().getNormalize()  
        .getNormalizedFields()) {  
  
    StringBuilder line = new StringBuilder();  
    line.append(field.getName());  
    line.append(",action=");  
    line.append(field.getAction());  
    line.append(",min=");  
    line.append(field.getActualLow());  
    line.append(",max=");  
    line.append(field.getActualHigh());  
    System.out.println(line.toString());  
}  

There are several important attributes on each of the AnalystField objects. For example, to change the normalization range to 0 to 1, execute the following commands:

view plaincopy to clipboardprint?
field.setNormalizedHigh(1) ;  
field.setNormalizedLow(0) ;  

The mode of normalization can also be changed. To use one-of-n normalization instead of equilateral, just use the following command:

view plaincopy to clipboardprint?
field.setAction(NormalizationAction.OneOf) ;  

Encog Analyst can do much more than just normalize data. It is also performs the entire normalization, training and evaluation of a neural network. This will be covered in greater detail in Chapters 3 and 4. Chapter 3 will show how to do this from the workbench, while Chapter 4 will show how to do this from code.

程式扎記

標籤

2016年12月3日星期六

[ NNF For Java ] Obtaining Data for Encog (Ch2)

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2016年12月3日 星期六