Random Backpropagation and the Deep Learning Channel

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) Learning in the Machine: Random Backpropagation and the Deep Learning Channel (Extended Abstract)∗ Pierre Baldi1y , Peter Sadowski1 and Zhiqin Lu2 1Department of Computer Science, University of California, Irvine 2Department of Mathematics, University of California, Irvine [email protected], [email protected], [email protected] Abstract matrices, a requirement known as the weight symmetry problem that has long been an objection to the hypothesis that Random backpropagation (RBP) is a variant of the biological neurons learn via gradient descent (e.g. [Crick, backpropagation algorithm for training neural net- 1989]). New learning algorithms that do not require full gra- works, where the transpose of the forward matrices dient calculations could lead to more efficient neuromorphic are replaced by fixed random matrices in the calcu- hardware and could help explain learning in the brain. lation of the weight updates. It is remarkable both Are gradients really needed for learning in deep neural net- because of its effectiveness, in spite of using ran- works (NNs)? Recent work suggests they are not (e.g. [Jader- dom matrices to communicate error information, berg et al., 2017]). In the random backpropagation algorithm and because it completely removes the requirement (RBP) [Lillicrap et al., 2016], deep layers of a NN learn of maintaining symmetric weights in a physical useful representations even when the forward weight matri- neural system. To better understand RBP, we com- ces are replaced with fixed, random matrices in the back- pare different algorithms in terms of the informa- propagation equations. This algorithm differs from greedy tion available locally to each neuron. In the pro- unsupervised layer-wise approaches [Hinton et al., 2006; cess, we derive several alternatives to RBP, includ- Bengio et al., 2007] because the deep weights depend on in- ing skipped RBP (SRBP), adaptive RBP (ARBP), formation about the targets, and it differs from greedy super- sparse RBP, and study their behavior through simu- vised layer-wise approaches [Gilmer et al., 2017; Mostafa et lations. These simulations show that many variants al., 2017] because the deep weights depend on the NN output are also robust deep learning algorithms, but that layer, and hence all the other weights. the derivative of the transfer function is important In this work we connect the RBP algorithm to the notion of in the learning rule. Finally, we prove several math- the deep learning channel that communicates error informa- ematical results including the convergence to fixed tion from the output layer to the deep hidden layers [Baldi and points of linear chains of arbitrary length, the con- Sadowski, 2016]. This channel is necessary to converge to vergence to fixed points of linear autoencoders with critical points of the objective, and can be studied using tools decorrelated data, the long-term existence of solu- from information and complexity theory. We classify learn- tions for linear systems with a single hidden layer ing algorithms by the information that is transmitted along and convergence in special cases, and the conver- this channel, and our analysis leads to several new learning gence to fixed points of non-linear chains, when the algorithms, which we analyze through experiments on the derivative of the activation functions is included. MNIST [LeCun et al., 1998], CIFAR-10 [Krizhevsky and Hinton, 2009], and HIGGS [Baldi et al., 2014] benchmark 1 Introduction data sets. Furthermore, we prove that these algorithms converge to a global optimum of the objective function for im- Modern artificial neural networks are optimized using portant special cases. gradient-based algorithms. Gradients can be computed rel- atively efficiently via the backpropagation algorithm, but the gradient at each weight generally depends on both the data 2 Random Backpropagation Algorithms and all other weights in the network. This high degree of in- In this work we consider layered, feed-forward, neural net- terdependence costs energy, both in biological neural systems works in which neurons in layer h are fully-connected to the and in the artificial neural networks simulated using digital neurons in the previous layer h − 1. Layer activity Oh is computers. Furthermore, the calculation of the gradients in computed as a function of the preceding layer as the backpropagation algorithm includes the forward weight ∗This paper is an extended abstract of the article Baldi, et al., Oh , f h(Sh) Artificial intelligence,260:1–35, 2018. h , h h−1 yContact Author S W O for 1 < h ≤ L (1) 6348 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) where O0 = I is the input data and f h is a non-linear acti- These learning rules can be compared in terms of the in- vation function. We focus here on supervised learning with formation required to update each weight. RBP solves the typical output activation functions and loss functions, includ- weight-symmetry problem by removing the dependency of ing linear, sigmoid, and softmax output layers, for which the the update on the forward weights in the backpropagation derivative of the loss E with respect to SL for a single input- step; the updates still depend on every other weight in the target pair (I;T ) is given by network, but all that information is subsumed by the error signal at the output, T − OL. In SRBP, we also remove the h @E dependency of ∆W on the derivative of the transfer func- L f l 0; l > h L = O − T: (2) tion in the downstream layers, ( ) . Despite these @S differences, we show that the these learning algorithms (as The backpropagation algorithm works by first computing well as adaptive variants) still converge to critical points of gradients at each neuron recursively, then computing the gra- the objective function in network architectures conducive to dients at the weights. These gradients are then used to update mathematical analysis, unlike other alternative deep learning the weights: algorithms such as greedy, layer-wise “pre-training.” In addition, we introduce the idea of adaptive random backpropagation (ARBP), where the backpropagation matri- @E (T − OL); for h = L Bh , − = ces in the learning channel are initialized randomly, then pro- @Sh (f h)0 (W h+1)T Bh+1; for h < L gressively adapted during learning using the product of the @E corresponding forward and backward signals, so that ∆W h = −η = ηOh−1Bh (BP) @W h ∆Ch = ηOhRh+1: where the derivative (f h)0 is evaluated at Oh, and η is the learning rate. In random backpropagation, the gradients Bh In this case, the forward channel becomes the learning chan- are replaced with a randomized error signal Rh defined recur- nel for the backward weights. This adaptive behavior can also sively be used with the skip version (ASRBP). In all these algorithms, the weight updates in the last layer are equivalent to those of BP, so (T − OL); for h = L BP=RBP=SRBP=ARBP=ASRBP in the top layer. They only Rh , (f h)0 ChRh+1; for h < L differ in the way they train the hidden layers. In experiments, h h−1 h we also compare to the case where only the top layer is ∆W = ηO R (RBP) trained, and the hidden layers remain fixed after a random h initialization. with constant matrices fC g1≤h≤L replacing the transpose of the weight matrices in each layer. In skip random backpropagation [Baldi et al., 2018], the randomized error signals 3 Results are sent directly to the deep layers rather than propagating 3.1 Mathematical Results through each intermediate layer. Through mathematical analysis, we prove that RBP and SRBP converge to a fixed point corresponding to the global (T − OL); for h = L optimum of the training set loss for the following neural net- Rh , (f h)0 ChRL; for h < L works architectures, starting from almost any set of initial h h−1 h weights (except for a set of measure 0). Proofs for the case of ∆W = ηO R (SRBP) ARBP are provided in [Baldi et al., 2017]. where random matrix Ch now connects layer h to the output • A chain of single linear neurons of arbitrary length layer. ([1;:::; 1]). • An expansive architecture of linear neurons [1; N; 1]. BP RBP SRBP • A compressive architecture of linear neurons [N; 1;N]. Output • A simple [1; 1; 1] architecture, with a power function non-linearity in the hidden neuron of the form f(x) = xµ. Setting µ = 1=3 for instance gives an S-shaped activation. Furthermore, we show that this system generally does not converge for µ =6 1 when the derivative of the transfer function is omitted from the learning rule. Input For the linear architectures, under a set of standard assump- tions, we can derive a set of polynomial, autonomous, ordi- Figure 1: The path of the error signal (red) from an output neuron to nary differential equations (ODEs) for the average time evo- a deep, hidden neuron in backpropagation (BP), random backpropa- lution of the weights under the different learning algorithms. gation (RBP), and skip random backpropagation (SRBP). As soon as there is more than one variable and the system is 6349 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) non-linear, there is no general theory to understand the cor- 1.0 responding behavior. In fact, even in two dimensions, the problem of understanding the upper bound on the number and relative position of the limit cycles of a system of the form dx=dt = P (x; y) and dy=dt = Q(x; y), where P and Q are 0.9 polynomials of degree n is open–in fact this is Hilbert’s 16- BP th problem in the field of dynamical systems [Smale, 1998; SBP Ilyashenko, 2002].

Load more