Neural networks can provide profound insights into the data supplied to them. However, you can’t just feed any sort of data directly into a neural network. This “raw” data must usually be normalized into a form that the neural network can process. This chapter will show how to normalize “raw” data for use by Encog. Before data can be normalized, we must first have data. Once you decide what the neural network should do, you must find data to teach the neural network how to perform a task. Fortunately, the Internet provides a wealth of information that can be used with neural networks.
Where to Get Data for Neural Networks
The Internet can be a great source of data for the neural network. Data found on the Internet can be in many different formats. One of the most convenient formats for data is the comma-separated value (CSV) format. Other times it may be necessary to create a spider or bot to obtain this data. One very useful source for neural network is the Machine Learning Repository, which is run by the University of California at Irvine. http://kdd.ics.uci.edu/ The Machine Learning Repository site is a repository of various datasets that have been donated to the University of California. Several of these datasets will be used in this book.
Data obtained from sites, such as those listed above, often cannot be directly fed into neural networks. Neural networks can be very “intelligent,” but cannot receive just any sort of data and produce a meaningful result. Often the data must first be normalized. We will begin by defining normalization. Neural networks are designed to accept floating-point numbers as their input. Usually these input numbers should be in either the range of -1 to +1 or 0 to +1 for maximum efficiency. The choice of which range is often dictated by the choice of activation function, as certain activation functions have a positive range and others have both a negative and positive range. The sigmoid activation function, for example, has a range of only positive numbers. Conversely, the hyperbolic tangent activation function has a range of positive and negative numbers. The most common case is to use a hyperbolic tangent activation function with a normalization range of -1 to +1.
Recall from Chapter 1 the iris dataset. This data set could be applied to a classification problem. However, we did not see how the data needed to be actually processed to make it useful to a neural network.
A sampling of the dataset is shown here:
There are really two different attribute types to consider. First, there are four numeric attributes. Each of these will simply map to an input neuron. The values will need to be scaled to -1 to +1. Class attributes, sometimes called nominal attributes, present a unique challenge. In the example, the species of iris must be represented as either one or more floating point numbers. The mapping will not be to a single neuron. Because a three-member class is involved, the number of neurons used to represent the species will not be a single neuron. The number of neurons used to represent the species will be either two or three, depending on the normalization type used.
Normalizing Numeric Values
Normalizing a numeric value is essentially a process of mapping the existing numeric value to well-defined numeric range, such as -1 to +1. Normalization causes all of the attributes to be in the same range with no one attribute more powerful than the others. To normalize, the current numeric ranges must be known for all of the attributes. The current numeric ranges for each of the iris attributes are shown here.
The above equation will normalize a value x, where the variable d represents the high and low values of the data, the variable n represents the high and low normalization range desired. For example, to normalize a petal length of 3, to the range -1 to +1, the above equation becomes:
This results in a value of 0.66. This is the value that will be fed to the neural network. For regression, the neural network will return values. These values will be normalized. To denormalize a value, Equation 2.3 is used.
To denormalize the value of 0.66, Equation 2.3 becomes:
Once denormalized, the value of 0.66 becomes 2.0 again. It is important to note that the 0.66 value was rounded for the calculation here. Encog provides built-in classes to provide both normalization and denormalization. These classes will be introduced later in this chapter.
Normalizing Nominal Values
Nominal values are used to name things. One very common example of a simple nominal value is gender. Something is either male or female. Another is any sort of Boolean question. Nominal values also include values that are either “yes/true” or “no/false.” However, not all nominal values have only two values. Nominal values can also be used to describe an attribute of something, such as color. Neural networks deal best with nominal values where the set is fixed. For the iris dataset, the nominal value to be normalized is the species. There are three different species to consider for the iris dataset and this value cannot change. If the neural network is trained with three species, it cannot be expected to recognize five species.
Encog supports two different ways to encode nominal values. The simplest means of representing nominal values is called “one-of-n” encoding. One-of-n encoding can often be hard to train, especially if there are more than a few nominal types to encode. Equilateral encoding is usually a better choice than the simpler one-of-n encoding. Both encoding types will be explored in the next two sections.
Understanding One-of-n Normalization
One-of-n is a very simple form of normalization. For an example, consider the iris dataset again. The input to the neural network is statistics about an individual iris. The output signifies which species of iris to evaluate. The three iris species are listed as follows:
If using the one-of-n normalization, the neural network would have three output neurons. Each of these three neurons would represent one iris species. The iris species predicted by the neural network would correspond to the output neuron with the highest activation. Generating training data for one-of-n is relatively easy. Simply assign a +1 to the neuron that corresponds to the chosen iris and a -1 to the remaining neurons. For example, the Setosa iris species would be encoded as follows:
Understanding Equilateral Normalization
The output neurons are constantly checked against the ideal output values provided in the training set. The error between the actual output and the ideal output is represented by a percentage. This can cause a problem for the one-of-n normalization method. Consider if the neural network had predicted a Versicolor iris when it should have predicted a Verginica iris. The actual output and ideal would be as follows:
The problem is that only two of three output neurons are incorrect. We would like to spread the “guilt” for this error over a larger percent of the neurons. To do this, a unique set of values for each class must be determined. Each set of values should have an equal Euclidean distance from the others. The equal distance makes sure that incorrectly choosing iris Setosa for Versicolor has the same error weight as choosing iris Setosa for iris Virginica.
This can be done using the Equilateral class. The following code segment shows how to use the Equilateral class to generate these values:
Listing 2.1: Calculated Class Equilateral Values 3 Classes
Notice that there are two outputs for each of the three classes. This decreases the number of neurons needed by one from the amount needed for one-of-n encoding. Equilateral encoding always requires one fewer output neuron than one-of-n encoding would have. Equilateral encoding is never used for fewer than three classes.
Look at the example before with equilateral normalization. Just as before, consider if the neural network had predicted a Versicolor iris, when it should have predicted a Verginica iris. The output and ideal are as follows:
In this case there are only two neurons, as is consistent with equilateral encoding. Now all neurons are producing incorrect values. Additionally, there are only two output neurons to process, slightly decreasing the complexity of the neural network. Neural networks will rarely give output that exactly matches any of its training values. To deal with this in “one-of-n” encoding, look at which output neuron produced the highest output. This method does not work for equilateral encoding. Equilateral encoding shows which calculated class equilateral value (Listing 2.1) has the shortest distance to the actual output of the neural network.
What is meant by each of the sets being equal in distance from each other? It means that their Euclidean distance is equal. The Euclidean distance can be calculated using Equation 2.5.
In the above equation the variable “q” represents the ideal output value; the variable “p” represents the actual output value. There are “n” sets of ideal and actual. Euclidean normalization is implemented using the Equilateral class in Encog. Usually it is unnecessary to directly deal with the Equilateral class in Encog. Rather, one of the higher-level normalization methods described later in this chapter is used.
If you are interested in the precise means by which the equilateral numbers are calculated, visit the following URL: http://www.heatonresearch.com/wiki/Equilateral
Encog provides a number of different means of normalizing data. The exact means that you use will be determined by exactly what you are trying to accomplish. The three methods for normalization are summarized here.
The next three sections will look at all three, beginning with normalizing individual numbers.
Normalizing Individual Numbers
Very often you will simply want to normalize or denormalize a single number. The range of values in your data is already known. For this case, it is unnecessary to go through the overhead of having Encog automatically discover ranges for you. The “Lunar Lander” program is a good example of this. You can find the “Lunar Lander” example here.
Normalizing Memory Arrays
To quickly normalize an array, the NormalizeArray class can be useful. This object works by normalizing one attribute at a time. An example of the normalize array class working is shown in the sunspot prediction example. This example can be found here:
Normalizing CSV Files
If the data to be normalized is already stored in CSV files, Encog Analyst should be used to normalize the data. Encog Analyst can be used both through the Encog Workbench and directly from Java and C#. This section explains how to use it through Java to normalize the Iris data set. To normalize a file, look at the file normalization example found at the following location:
The output will be a normalized version of the input file, as shown below:
Implementing Basic File Normalization
In the last section, you saw how Encog Analyst normalizes a file. In this section, you will learn the programming code necessary to accomplish this. Begin by accessing the source and target files:
Saving the Normalization Script
Encog keeps statistics on normalized data. This data, called the normalization stats, tells Encog the numeric ranges for each attribute that was normalized. This data can be saved so that it does not need to be renormalized each time. To save a stats file, use the following command:
Customizing File Normalization
The Encog Analyst contains a collection of AnalystField objects. These objects hold the type of normalization and the ranges of each attribute. This collection can be directly accessed to change how the attributes are normalized. Also, AnalystField objects can be removed and excludes from the final output. The following code shows how to access each of the fields determined by the wizard.
System.out.println("Fields found in file:");