**Preface**

Training is the means by which neural network weights are adjusted to give desirable outputs. This book will cover both supervised and unsupervised training. This chapter will discuss propagation training, a form of supervised training where the expected output is given to the training algorithm. Encog also supports unsupervised training. With unsupervised training, the neural network is not provided with the expected output. Rather, the neural network learns and makes insights into the data with limited direction. Chapter 10 will discuss unsupervised training.

Propagation training can be a very effective form of training for feedforward, simple recurrent and other types of neural networks. While there are several different forms of propagation training, this chapter will focus on the forms of propagation currently supported by Encog. These six forms are listed as follows:

All six of these methods work somewhat similarly. However, there are some important differences. The next section will explore propagation training in general.

**Understanding Propagation Training**

Propagation training algorithms use supervised training. This means that the training algorithm is given a training set of inputs and the ideal output for each input. The propagation training algorithm will go through a series of iterations that will most likely improve the neural network’s error rate by some degree. The error rate is the percent difference between the actual output from the neural network and the ideal output provided by the training data. Each iteration will completely loop through the training data.

**For each item of training data, some change to the weight matrix will be calculated. These changes will be applied in batches using Encog’s batch training. Therefore, Encog updates the weight matrix values at the end of an iteration.**

**Each training iteration begins by looping over all of the training elements in the training set. For each of these training elements, a two-pass process is executed: a forward pass and a backward pass**. The forward pass simply presents data to the neural network as it normally would if no training had occurred. The input data is presented and the algorithm calculates the error, i.e. the difference between the actual and ideal outputs. The output from each of the layers is also kept in this pass. This allows the training algorithms to see the output from each of the neural network layers.

The backward pass starts at the output layer and works its way back to the input layer. The backward pass begins by examining the difference between each of the ideal and actual outputs from each of the neurons. The gradient of this error is then calculated. To calculate this gradient, the neural network’s actual output is applied to the derivative of the activation function used for this level. This value is then multiplied by the error.

**Because the algorithm uses the derivative function of the activation function, propagation training can only be used with activation functions that actually have a derivative function.**This derivative calculates the error gradient for each connection in the neural network. How exactly this value is used depends on the training algorithm used.

**Understanding Backpropagation**

Backpropagation is one of the oldest training methods for feedforward neural networks.

**Backpropagation uses two parameters in conjunction with the gradient descent calculated in the previous section. The first parameter is the learning rate which is essentially a percent that determines how directly the gradient descent should be applied to the weight matrix.**The gradient is multiplied by the learning rate and then added to the weight matrix. This slowly optimizes the weights to values that will produce a lower error.

**One of the problems with the backpropagation algorithm is that the gradient descent algorithm will seek out local minima.**These local minima are points of low error, but may not be a global minimum.

**The second parameter provided to the backpropagation algorithm helps the backpropagation out of local minima. The second parameter is called momentum.**Momentum specifies to what degree the previous iteration weight changes should be applied to the current iteration.

The momentum parameter is essentially a percent, just like the learning rate. To use momentum, the backpropagation algorithm must keep track of what changes were applied to the weight matrix from the previous iteration. These changes will be reapplied to the current iteration, except scaled by the momentum parameters.

**Usually the momentum parameter will be less than one, so the weight changes from the previous training iteration are less significant than the changes calculated for the current iteration.**For example, setting the momentum to 0.5 would cause 50% of the previous training iteration’s changes to be applied to the weights for the current weight matrix.

The following code will setup a backpropagation trainer, given a training set and neural network.

- import org.encog.neural.networks.training.propagation.back.Backpropagation
- ...
- // (3): 0.7 is the learn rate; 0.3 is momentum
- final MLTrain train = new Backpropagation(network, trainingSet, 0.7, 0.3)

**XORHelloWorld**. This example can easily be modified to use backpropagation training by replacing the resilient propagation training line with the above training line.

**Understanding the Manhattan Update Rule**

One of the problems with the backpropagation training algorithm is the degree to which the weights are changed. The gradient descent can often apply too large of a change to the weight matrix.

**The Manhattan Update Rule and resilient propagation training algorithms only use the sign of the gradient.**The magnitude is discarded. This means it is only important if the gradient is positive, negative or near zero. For the Manhattan Update Rule, this magnitude is used to determine how to update the weight matrix value. If the magnitude is near zero, then no change is made to the weight value. If the magnitude is positive, then the weight value is increased by a specific amount. If the magnitude is negative, then the weight value is decreased by a specific amount. The amount by which the weight value is changed is defined as a constant. You must provide this constant to the Manhattan Update Rule algorithm.

The following code will setup a Manhattan update trainer given a training set and neural network.

- final MLTrain train = new ManhattanPropagation(network, trainingSet, 0.00001)

**Understanding Quick Propagation Training**

Quick propagation (QPROP) is another variant of propagation training. Quick propagation is based on Newton’s Method, which is a means of finding a function’s roots. This can be adapted to the task of minimizing the error of a neural network. Typically QPROP performs much better than backpropagation. The user must provide QPROP with a learning rate parameter. However, there is no momentum parameter as QPROP is typically more tolerant of higher learning rates. A learning rate of 2.0 is generally a good starting point.

The following code will setup a Quick Propagation trainer, given a training set and neural network.

- final MLTrain train = new QuickPropagation(network, trainingSet, 2.0)

**Understanding Resilient Propagation Training**

The resilient propagation training (RPROP) algorithm is often the most efficient training algorithm provided by Encog for supervised feedforward neural networks. One particular advantage to the RPROP algorithm is that it requires no parameter setting before using it. There are no learning rates, momentum values or update constants that need to be determined. This is good because it can be difficult to determine the exact optimal learning rate. The RPROP algorithms works similar to the Manhattan Update Rule in that only the sign of the descent is used. However, rather than using a fixed constant to update the weight values, a much more granular approach is used. These deltas will not remain fixed like in the Manhattan Update Rule or backpropagation algorithm. Rather, these delta values will change as training progresses.

The RPROP algorithm does not keep one global update value, or delta. Rather, individual deltas are kept for every weight matrix value. These deltas are first initialized to a very small number. Every iteration through the RPROP algorithm will update the weight values according to these delta values. However, as previously mentioned, these delta values do not remain fixed. The gradient is used to determine how they should change using the magnitude to determine how the deltas should be modified further. This allows every individual weight matrix value to be individually trained, an advantage not provided by either the backpropagation algorithm or the Manhattan Update Rule.

The following code will setup a Resilient Propagation trainer, given a training set and neural network.

- final MLTrain train = new ResilientPropagation(network, trainingSet)

**Enum RPROPType**):

By default, Encog uses RPROP+, the most standard RPROP. Some research indicates that iRPROP+ is the most efficient RPROP algorithm. To set Encog to use iRPROP+ use the following command:

- train.setRPROPType(RPROPType.iRPROPp)

**Understanding SCG Training**

Scaled Conjugate Gradient (SCG) is a fast and efficient training for Encog. SCG is based on a class of optimization algorithms called

**Conjugate Gradient Methods**(CGM). SCG is not applicable for all data sets. When it is used within its applicability, it is quite efficient. Like RPROP, SCG is at an advantage as there are no parameters that must be set.

The following code will setup an SCG trainer, given a training set and neural network.

- final MLTrain train = new ScaledConjugateGradient(network, trainingSet)

**Understanding LMA Training**

The

**Levenberg Marquardt algorithm**(LMA) is a very efficient training method for neural networks. In many cases, LMA will outperform Resilient Propagation.

**LMA is a hybrid algorithm based on both Newton’s Method and gradient descent**(backpropagation), integrating the strengths of both. Gradient descent is guaranteed to converge to a local minimum, albeit slowly. GNA is quite fast but often fails to converge. By using a damping factor to interpolate between the two, a hybrid method is created.

The following code shows how to use Levenberg-Marquardt with Encog for Java.

- final MLTrain train = new LevenbergMarquardtTraining(network, trainingSet)

**Encog Method & Training Factories**

This chapter illustrated how to instantiate trainers for many different training methods using objects such as

**Backpropagation**,

**ScaledConjugateGradient**or

**ResilientPropagation**. In the previous chapters, we learned to create different types of neural networks using

**BasicNetwork**and

**BasicLayer**. We can also create training methods and neural networks using factories.

Factories create neural networks and training methods from text strings, saving time by eliminating the need to instantiate all of the objects otherwise necessary. For an example of factory usage see:

**XORFactory**. This example uses factories to create both neural networks and training methods.

**This section will show how to create both neural networks and training methods using factories.**

**Creating Neural Networks with Factories**

The following code uses a factory to create a feedforward neural network:

- String METHOD_FEEDFORWARD_A = "?:B->SIGMOID->4:B->SIGMOID->?"
- MLMethodFactory methodFactory = new MLMethodFactory()
- MLMethod method = methodFactory.create(MLMethodFactory.TYPE_FEEDFORWARD, // The method to create.
- METHOD_FEEDFORWARD_A, // The architecture string.
- 2, // Input count
- 1) // Output count

**sigmoid activation function**is used between both the input and hidden neuron, as well between the hidden and output layer.

You may notice the two question marks in the neural network architecture string. These will be filled in by the input and output layer sizes specified in the create method and are optional. You can hard-code the input and output sizes. In this case the numbers specified in the create call will be ignored.

**Creating Training Methods with Factories**

It is also possible to create a training method using a factory. The following code creates a backpropagation trainer using a factory.

- MLTrainFactory trainFactory = new MLTrainFactory()
- MLTrain train = trainFactory.create(network, // The method to train.
- dataSet, // The training data.
- MLTrainFactory.TYPE_BACKPROP, // Type of trainer.
- "LR=0.7,MOM=0.3") // The training args.

**How Multithreaded Training Works**

Multithreaded training works particularly well with larger training sets and machines multiple cores. If Encog does not detect that both are present, it will fall back to single-threaded. When there is more than one processing core and enough training set items to keep both busy, multithreaded training will function significantly faster than single-threaded. This chapter has already introduced three propagation training techniques, all of which work similarly. Whether it is backpropagation, resilient propagation or the Manhattan Update Rule, the technique is similar. There are three distinct steps:

First, a regular feed forward pass is performed. The output from each level is kept so the error for each level can be evaluated independently. Second, the errors are calculated at each level and the derivatives of each activation function are used to calculate gradient descents. These gradients show the direction that the weight must be modified to improve the error of the network. These gradients will be used in the third step.

The third step is what varies among the different training algorithms. Backpropagation simply scales the gradient descents by a learning rate. The scaled gradient descents are then directly applied to the weights. The Manhattan Update Rule only uses the gradient sign to decide in which direction to affect the weight. The weight is then changed in either the positive or negative direction by a fixed constant. RPROP keeps an individual delta value for every weight and only uses the sign of the gradient descent to increase or decrease the delta amounts. The delta amounts are then applied to the weights.

The multithreaded algorithm uses threads to perform Steps 1 and 2. The training data is broken into packets that are distributed among the threads. At the beginning of each iteration, threads are started to handle each of these packets. Once all threads have completed, a single thread aggregates all of the results and applies them to the neural network. At the end of the iteration, there is a very brief amount of time where only one thread is executing. This can be seen from Figure 5.1.

As shown in the above image, the i7 is currently running at 100%. The end of each iteration is clearly identified by where each of the processors falls briefly. Fortunately, this is a very brief time and does not have a large impact on overall training efficiency. In attempting to overcome this, various implementations tested not forcing the threads to wait at the end of the iteration for a resynchronization. This method did not provide efficient training because the propagation training algorithms need all changes applied before the next iteration begins.

**Using Multithreaded Training**

To see multithreaded training really shine, a larger training set is needed. In the next chapter we will see how to gather information for Encog using larger training sets. However, for now, we will look a simple benchmarking example that generates a random training set and compares multithreaded and singlethreaded training times. A simple benchmark is shown that makes use of an input layer of 40 neurons, a hidden layer of 60 neurons, and an output layer of 20 neurons. A training set of 50,000 elements is used. This example can be found at the

**MultiBench**. Executing this program on a Quadcore i7 with Hyperthreading produced the following result:

As shown by the above results, the single-threaded RPROP algorithm finished in 15 seconds and the multithreaded RPROP algorithm finished in only 6 seconds. Multithreading improved performance by a factor of two. Your results running the above example will depend on how many cores your computer has. If your computer is single core with no hyperthreading, then the factor will be close to one. This is because the second multi-threading training will fall back to a single thread.

## 沒有留言:

## 張貼留言