Stochastic Gradient Descent Learning and the Backpropagation Algorithm

Stochastic Gradient Descent Learning and the Backpropagation Algorithm Oliver K. Ernst Department of Physics University of California, San Diego La Jolla, CA 92093-0354 [email protected] Abstract Many learning rules minimize an error or energy function, both in supervised and unsupervised learning. We review common learning rules and their relation to gradient and stochastic gradient descent methods. Recent work generalizes the mean square error rule for supervised learning to an Ising-like energy function. For a specific set of parameters, the energy and error landscapes are compared, and convergence behavior is examined in the context of single linear neuron experiments. We discuss the physical interpretation of the energy function, and the limitations of this description. The backpropgation algorithm is modified to accommodate the energy function, and numerical simulations demonstrate that the learning rule captures the distribution of the network’s inputs. 1 Gradient and Stochastic Gradient Descent A learning rule is an algorithm for updating the weights of a network in order to achieve a particular goal. One particular common goal is to minimize a error function associated with the network, also referred to as an objective or cost function. Let Q(w) denote the error function associated with a network with connection weights w. A first approach to minimize the error is the method of gradient descent, where at each iteration a step is taken in the direction corresponding to Q(w). The learning rule that follows is −∇ w := w η Q(w); (1) − r where η is a constant parameter called the learning rate. Consider being given a set of n observations xi; ti , where i = 1; 2:::n. Each observation consists of inputs x into the network and correspond- ingf outputsg t. A particular case of gradient descent may be formulated for an error function of the form n X Q(w) = Qi(w); (2) i=1 where Qi(w) is the error corresponding to the i-th observation. In this case, the learning rule becomes n X w := w η Qi(w): (3) − r i=1 For most error functions, this standard (or batch) gradient descent method is guaranteed to converge to a local minimum, provided one exists and the learning rate is small. Furthermore, it does so along the fastest route, taking steps at every iteration perpendicular to the contour lines. A key challenge is computational power - evaluating the sum in 3 may be computationally intensive, particularly for large training sets. A more economical method for minimizing Q is that of stochastic 1 (or on-line) gradient descent. At each iteration, the gradient of Q(w) is approximated by the gradient of the single observation’s associated error, Qi(w). The learning rule at each step becomes w := w η Qi(w); (4) − r where we implicitly iterate over the data set, with the possibility of numerous passes. While at each iteration, the algorithm takes a step in a direction perpendicular to the contour lines for that observations error function Qi, it does not necessarily move perpendicular to the contour lines of Q. Furthermore convergence to the minimum of Q is not guaranteed, further demonstrated in Sec- tion 2.1. 1.1 Mean Square Error (MSE) A logical choice for the error function is the mean square error, n n X 1 X 2 E(w) = Ei(w) = ti oi(w) ; (5) 2 jj − jj i=1 i=1 where ti is the expected outcome of the network for the i-th observation, and oi(w) is the actual output of the network, and z 2 denotes the 2-norm [2]. jj jj As an illustrative example, consider the case of a network consisting of a single linear neuron. Let > xi denote the vector of inputs, w the weights corresponding to each input, and hi = w xi the weighted sum of neuron’s inputs. For simplicity, assume a linear activation function g(hi) = hi > such that the actual output of the neuron is yi = g(hi) = hi = w xi. Given a set of n inputs and expected outputs xi; ti , the mean square error thus becomes f g n n X 1 X > 2 E(w) = Ei(w) = (ti w xi) : (6) 2 − i=1 i=1 Note that since ti; yi we may drop the absolute value in 5. The learning rule from the stochastic gradient descent method2 R 4 becomes > w := w + η(ti w xi)xi: (7) − 2 Energy Cost Function Begin by expanding the MSE for a single observation of a linear neuron as 1 > 2 1 2 > 1 > > Ei(w) = (ti w xi) = t tiw xi + w xix w: (8) 2 − 2 i − 2 i Define a similar error function > > i(w) = ctiw xi bw Vw; (9) E − − n×n > where c; b are constants and V . Note that for c = 1; b = 1=2; V = xixi , we 2 R 2 R 2 − recover the MSE function, up to a constant term ti =2 which does not affect the learning rule [6]. The learning rule associated with this error function is > w := w + η ctixi + b(V + V )w ; (10) derived using 4 and the identity @x>Bx=@x = (B + B>)x. This work explores a particular choice for the parameters b; c; V, and it’s relation to the MSE rule. As before for the MSE, let c = 1; b = 1=2. Note that the covariance matrix for general random vector x with mean x is − > > Σxx = xx x x : (11) h i − As we iterate over data pairs xi; ti in stochastic gradient descent, at each step we add a correction f g term to the weights according to 10. This suggests that if the covariance matrix Σxx were known, 2 > > > an intuitive choice for the learning rule is to replace V = xx by V = xx = Σxx + xx . The error function and learning rule then become h i > 1 > > i(w) = tiw xi + w Σxx + xx w; (12) E − 2 1 > > w := w + η tixi (Σxx + Σ + 2 x x )w : (13) − 2 xx This similar form has been discussed previously [6] - we briefly depart to clear up several mis- conceptions. It has been pointed out that this error for resembles the energy of the Ising model in statistical mechanics. This view is limited in several points. In 9, the analogous role of the state of the system is played by w, where generally elements wi , while in the Ising model the state is 2 R described by a vector σ where σi 1; 1 . This suggests that an analogy to a continuous spin Ising model may be more appropriate.2 The {− constantsg c; b would play the analogous roles of the magnetic moment and spin-spin coupling strength, respectively, and tixi in the first term the local magnetic field. The role of the matrix V , however, presents a refutation of this analogy. In the Ising model, the structure of the matrix designates which spins in the lattice couple to another. Several conditions are physically imposed upon the matrix, in particular that it is symmetric (spins cannot couple in only one direction) and that diagonal elements are zero (spins cannot couple with themselves). The > choice of V = xixi violates the second condition, and in general the symmetry condition need not be met. Nonetheless, the remainder of this work maintains the title of energy function for 9 solely for clarity, rather than a physical interpretation. It has been suggested [6] that Hebb’s and Oja’s rules may be recovered for particular choices of b; c; V. However, this claim is based on poor terminology, as these are unsupervised learning rules, while the energy form is clearly a supervised method. The true Hebb’s rule may indeed be understood as maximizing the variance of the output, or iden- tically minimizing E(w) = (w>x)2 [2]. On the other hand, it has been extensively shown that Oja’s rule may not be expressed−h as thei gradient of any function [7]. As a brief summary, re- call that a conservative vector field satisfies @(∆w)i=@wj = @(∆w)j=@wi. Oja’s learning rule is 2 2 ∆w = (x wy)y. Since @(yxi y wi)=@wj = xixj 2wiyxj δijy , the last condition is clearly not met.− Consequently the− field is not conservative− and equivalently− can not be interpreted as the gradient of some function [2]. 2.1 Single Neuron Comparisons of the Energy and MSE Functions As a simple comparison of the energy 12 and the MSE 6 error functions, consider the case of a 2 single linear neuron with two inputs, i.e. x; w . Given a large set of observations xi; ti , consider first the case where there exists a unique2 R solution w~ in weight space where thef MSEg is zero. In the implementation, this corresponds to generating the expected outputs from given inputs as ti = xi;0w~0 + xi;1w~1 for some fixed w~ . Figure 1 shows contour plots of the energy and MSE surfaces in w-space. The MSE surface for a single observation pair is shown in Figure 1a, and is the familiar trench shape. Summing over all observations to attain the total MSE function yields the bowl shape shown in Figure 1b. A sample stochastic gradient descent path is shown to converge to the exact solution, indicated by the green ‘X’. Given sufficient passes over the data set, any desired precision may be attained. The energy function surface for the same single observation pair is shown in 1c, and is a bowl shape similar to the total error function.

Stochastic Gradient Descent Learning and the Backpropagation Algorithm

Training Autoencoders by Alternating Minimization

Learning to Learn by Gradient Descent by Gradient Descent

Training Neural Networks Without Gradients: a Scalable ADMM Approach

Training Deep Networks with Stochastic Gradient Normalized by Layerwise Adaptive Second Moments

Arxiv:2107.04562V1 [Stat.ML] 9 Jul 2021 ¯ ¯ Θ∗ = Arg Min `(Θ) Where `(Θ) = `(Yi, Fθ(Xi)) + R(Θ)

CSE 152: Computer Vision Manmohan Chandraker

Automated Source Code Generation and Auto-Completion Using Deep Learning: Comparing and Discussing Current Language Model-Related Approaches

GEE: a Gradient-Based Explainable Variational Autoencoder for Network Anomaly Detection

Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder

Gradient Descent (PDF)

Training Plastic Neural Networks with Backpropagation

Simultaneous Unsupervised and Supervised Learning of Cognitive