Improved Backpropagation Learning in Neural Networks with Windowed Momentum

In International Journal of Neural Systems, vol. 12, no.3&4, pp. 303-318. IMPROVED BACKPROPAGATION LEARNING IN NEURAL NETWORKS WITH WINDOWED MOMENTUM Ernest Istook, Tony Martinez [email protected], [email protected] Brigham Young University Computer Science Department Backpropagation, which is frequently used in Neural Network training, often takes a great deal of time to converge on an acceptable solution. Momentum is a standard technique that is used to speed up convergence and maintain generalization performance. In this paper we present the Windowed momentum algorithm, which increases speedup over standard momentum. Windowed momentum is designed to use a fixed width history of recent weight updates for each connection in a neural network. By using this additional information, Windowed momentum gives significant speed-up over a set of applications with same or improved accuracy. Windowed Momentum achieved an average speed-up of 32% in convergence time on 15 data sets, including a large OCR data set with over 500,000 samples. In addition to this speedup, we present the consequences of sample presentation order. We show that Windowed momentum is able to overcome these effects that can occur with poor presentation order and still maintain its speed-up advantages. Keywords: backprop, neural networks, momentum, second-order, sample presentation order In International Journal of Neural Systems, vol. 12, no.3&4, pp. 303-318. 1. Introduction Due to the time required to train a Neural Network, many researchers have devoted their efforts to developing speedup techniques [1 – 7]. Various efforts range from optimizations of current algorithms to development of original algorithms. One of the most commonly discussed extensions is momentum. [3, 8-14] This paper presents the Windowed Momentum algorithm, with analysis of its benefits. Windowed Momentum is designed to overcome some of the problems associated with standard backprop training. Many algorithms use information from previous weight updates to determine how large an update can be made without diverging [1,13,15]. This typically uses some form of historical information about a particular weight’s gradient. Unfortunately, some of these algorithms have other difficulties to deal with [3,10] as discussed in section 2. Windowed Momentum relies on historical update information for each weight. This information is used to judge the worth of the individual updates resulting in better overall weight changes. Windowed Momentum achieved an average speed-up of 32% in convergence time on 15 data sets, including a large OCR data set with over 500,000 samples with same or improved accuracy. The computational complexity of Windowed Momentum is identical to that of Standard momentum with only a minor increase in memory requirements. A background of standard neural network (NN) training with momentum and other speed-up techniques is discussed in Section 2. In Section 3, Windowed Momentum is analyzed. Section 4 describes the methodology of the experimentation. The results of 1 In International Journal of Neural Systems, vol. 12, no.3&4, pp. 303-318. the experiments are presented in Section 5. Section 6 summarizes the efficacy of Windowed Momentum and presents further directions for research. 2. Background With the Generalized Delta Rule (GDR), small updates are made to each weight such that the updates are proportional to the backpropagated error term at the node. The update rule for the GDR is (1) ∆ wij(t) = ηδjxji where i is the index of the source node, j is the index of the target node, η is the learning rate, δj is the backpropagated error term, and x is the value of the input into the weight. This update rule is effective, but slow in practice. For standard backprop the error term for output nodes is equal to (tj - oj) * f ′ (netj) where tj is the target for output node j and netj is the activation value at node j. With a sigmoid activation function, f ′ (netj) = oj * (1 - oj ), where oj is the actual value for output node j. This makes the final local gradient term (1b) δj = (tj - oj) * oj * (1 - oj ) The gradient for hidden nodes, where I is the input at that node, is calculated using (1c) δh = Ioh * (1 - oh) * ∑ wkhδ k k∈outputs 2 In International Journal of Neural Systems, vol. 12, no.3&4, pp. 303-318. In a binary classification problem typical values are 0 ≤ tj , oj ≤ 1. Therefore, the maximum this local gradient term can take on is 0.1481. When combined with typical values for η [~0.2 … 0.5] and non-saturated weights, the change to weights each update is on the order of 10-2. Although increasing the learning rate will result in larger updates, the likelihood of convergence decreases [3, 16]. Due to the gradient descent nature of backprop, many iterations are required for convergence and the individual steps must be small. An excessive learning rate can disrupt the gradient descent and possibly miss the local minimum; therefore a solution is not guaranteed. 2.1 Update Styles When training a NN, weight updates can occur at several times. The two most common methods are called online and batch training. In batch training, all samples are presented and the sum of the error at each weight is used to perform one single update. In other words, the error is calculated for all samples before a single update is made such that the error is based on ∑δ j - over all j samples. Conversely, online training updates weights j after each sample is presented, thus providing j updates – one for each sample. This update style was used for the experiments presented here. By using batch training, fewer updates are made, but each update considers the entire set of samples. In online training, the current sample and the most recent samples (in the case of using momentum) are the only factors. In addition to batch and online training, there is also semi-batch training. In semi-batch training, updates are made after every k samples. This gives more frequent updates than 3 In International Journal of Neural Systems, vol. 12, no.3&4, pp. 303-318. batch mode and more accurate estimates of the true error gradient. Wilson and Martinez [17] discuss the efficacy of batch training when compared to online and semi-batch training. They conclude that batch training is ineffectual for large training sets and that online training is more efficient due to the increased frequency of updates. Although frequent updates provide a speed up relative to batch mode, there are still other algorithms that can be used for improved performance. One of these is known as standard momentum. 2.2 Standard Momentum When using momentum the update rule from equation (1) is modified to be (2) ∆ wij(t) = ηδjxji + α∆wij(t-1) where α is the momentum. Momentum in NNs behave similar to momentum in physics. In Machine Learning by Tom Mitchell [18, p. 100], momentum is described: “To see the effect of this momentum term, consider that the gradient descent search trajectory is analogous to that of a (momentumless) ball rolling down the error surface. The effect of α is to add momentum that tends to keep the ball rolling down the error surface.” In order to speed up training, many researchers augment each weight update based on the previous weight update. This effectively increases the learning rate [9]. Because many updates to a particular weight are in the same direction, adding momentum will typically result in a speedup of training time for many applications. This speedup has been shown to be several orders of magnitude in simpler applications with lower complexity decision boundaries [18]. 4 In International Journal of Neural Systems, vol. 12, no.3&4, pp. 303-318. Even though momentum is a well-known algorithm in the neural network community, there are certain criteria that have been considered when extending momentum. Jacobs [1] enumerates these criteria and analyzes the problems with basic backprop by giving several heuristics that should be used in neural network design: • Each weight should have a local learning rate • Each learning rate should be allowed to vary over time • Consecutive updates to a weight should increase the weights’ learning rate • Alternating updates to a weight should decrease the weights’ learning rate Because Jacobs’ Delta Bar Delta algorithm is criticized for being overly sensitive with respect to parameter selection [3] the approach taken with this research has sought to minimize the difficulty associated with parameter selection. This goal is placed at a higher priority when compared against the four principles just mentioned. Another attempt to improve momentum is called Second-Order Momentum. The idea is that by using second order related information, a larger update can be made to the weights. This particular algorithm was shown to be ineffective in [10], but Magoulas and Vrahatis [19] have found success using second order error information in other algorithms to adapt local learning rates. 2.3 Momentum Performance Although momentum speeds up performance, there are times when an improper selection of the momentum term causes a network to diverge [3, 16]. This means that instead of distinguishing one class from another, the network will always output a single selection. In one batch of experiments done by Leonard and Kramer [2], 24% of the trials did not converge to a solution in a reasonable amount of time. 5 In International Journal of Neural Systems, vol. 12, no.3&4, pp. 303-318. Because of these problems, various efforts have focused on deriving additional forms of momentum. One such method is to give more weight to previously presented examples and less weight to the current example. By so doing, momentum is preserved when a sample attempts to update contrary to the most recent updates.

Load more