Boosted Backpropagation Learning for Training Deep Modular Networks
Total Page:16
File Type:pdf, Size:1020Kb
Boosted Backpropagation Learning for Training Deep Modular Networks Alexander Grubb [email protected] J. Andrew Bagnell [email protected] School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213 USA Abstract tor which is itself a network of simpler modules. Ap- proaches leveraging such modular networks have been Divide-and-conquer is key to building sophis- successfully applied to real-world problems like natural ticated learning machines: hard problems are language processing (NLP) (Lawrence et al., 2000; Ro- solved by composing a network of modules hde, 2002), optical character recognition (OCR) (Le- that solve simpler problems (LeCun et al., Cun et al., 1998; Hinton et al., 2006), and robotics 1998; Rohde, 2002; Bradley, 2009). Many (Bradley, 2009; Zucker et al., 2010). Figure1 shows such existing systems rely on learning algo- an example network for a robotic autonomy system rithms which are based on simple paramet- where imitation learning is used for feedback. ric gradient descent where the parametriza- tion must be predetermined, or more spe- These deep networks are typically composed of layers cialized per-application algorithms which are of learning modules with multiple inputs and outputs usually ad-hoc and complicated. We present along with various transforming modules, e.g. the acti- a novel approach for training generic mod- vation functions typically found in neural network lit- ular networks that uses two existing tech- erature, with the end goal of globally optimizing net- niques: the error propagation strategy of work behavior to perform a given task. These con- backpropagation and more recent research on structions offer a number of advantages over single descent in spaces of functions (Mason et al., learning modules, such as the ability to compactly rep- 1999; Scholkopf & Smola, 2001). Combin- resent highly non-linear hypotheses. A modular net- ing these two methods of optimization gives work is also a powerful method of representing and a simple algorithm for training heterogeneous building in prior knowledge about problem structure networks of functional modules using sim- and inherent sub-problems. (Bradley, 2009) ple gradient propagation mechanics and es- Some approaches to network optimization rely on tablished learning algorithms. The result- strictly local information, training each module using ing separation of concerns between learn- either synthetic or collected data specific to the func- ing individual modules and error propagation tion of that module. This is a common approach in mechanics eases implementation, enables a NLP and vision systems, where modules correlate to larger class of modular learning strategies, individual tasks such as part of speech classification and allows per-module control of complex- or image segmentation. The problem with this and ity/regularization. We derive and demon- other local training methods is the lack of end-to-end strate this functional backpropagation and optimization of the system as a whole, which can lead contrast it with traditional gradient descent to a compounding of errors and a degradation in per- in parameter space, observing that in our formance. These local-only training algorithms can be example domain the method is significantly useful as good initializations prior to global optimiza- more robust to local optima. tion, however. A traditional approach to global network optimiza- 1. Introduction tion is the well-studied technique of backpropagation (Rumelhart et al., 1986; Werbos, 1994) which has For difficult learning problems that necessitate com- been used for neural network training for over two plex internal structure, it is common to use an estima- decades. While initially used for training acyclic net- th works, extensions for recurrent and time-varying net- Appearing in Proceedings of the 27 International Confer- works (Werbos, 1994) have been developed. Backprop- ence on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s). agation solves the problem of compounding errors be- Boosted Backpropagation Learning for Modular Networks tween interacting modules by propagating error infor- mation throughout a network, allowing for end-to-end optimization with respect to a global measure of error. Further, it can be interpreted as a completely modular (Bottou & Gallinari, 1991), object-oriented approach to semi-automatic differentiation that provides a sep- aration of concerns between modules in the network. Other approaches for complete optimization of net- works (Hinton et al., 2006; Larochelle et al., 2009) have also shown promise as alternatives to backpropagation, but many of these algorithms are restricted to specific system architectures, and further, often rely upon a ”fine-tuning” based on backpropagation. There have however been compelling results (Bengio et al., 2007) as to the usefulness of using local module training as an initial optimization step, allowing for rapid learning Figure 1. Modular network from the UPI perception and prior to the traditionally slower global optimization planning system for an off-road autonomous vehicle. Image step. courtesy (Bradley, 2009). The basic backpropagation algorithm has previously been used to provide error signals for gradient descent We have selected Euclidean functional gradients be- in parameter space, sometimes making network opti- cause of the flexibility provided in choosing base learn- mization sensitive to the specific parametrization cho- ers and the simplicity of implementing the algorithm sen. Recently, powerful methods for performing gradi- in a modular manner. We begin by briefly review- ent descent directly in a space of functions have been ing Euclidean functional gradient descent, followed by developed both for Reproducing Kernel Hilbert spaces the modified backpropagation algorithm for functional (Scholkopf & Smola, 2001) and for Euclidean function gradients. Following that we present a comparison of spaces (Mason et al., 1999; Friedman, 2000). parameterized gradient descent and our functional gra- The former, known as kernel methods, are well studied dient based method. in the literature and have been shown to be powerful means for learning non-linear hypotheses. The latter 2. Euclidean Functional Gradient methods have been shown to be a generalization of the Descent AdaBoost algorithm (Freund & Schapire, 1996), an- other powerful non-linear learning method where com- In the Euclidean function optimization setting, we seek plex hypotheses are built from arbitrary weak learners. to minimize a cost functional R[f], defined over a set of N sampled points fxngn=1 and accompanying loss func- In the following sections, we present a method for N tions flng defined over possible predictions combining functional gradient descent with backprop- n=1 N agation. Just as backpropagation allows a separation X of concerns between modules, the proposed approach R[f] = ln(f(xn)) cleanly separates the problem of credit assignment for n=1 modules in the network from the problem of learn- by searching over a Euclidean function space F (an L2 ing. This separation allows both a broader class of space of square-integrable functions) of possible func- learning machines to be applied within the network tions f. architecture than standard backpropagation enables, and enables complexity control and generalization per- The desired minimizing function for this cost func- formance to be managed independently by each mod- tional can be found using a steepest descent optimiza- ule in the network preventing the usual combinatorial tion procedure in function space directly, in contrast to search over all modules' internal complexities simulta- parameterized gradient descent where the gradient is neously. The approach further elucidates the notion evaluated with respect to the parameters of the func- of structural local optima|minima that hold in the tion. The functional gradient of R[f] in this Euclidean space of functions and hence are tied to the modular function space is given as (Gelfand & Fomin, 2000): structure|as contrasted with parametric local optima N N which are \accidents" of the chosen parameterization. X X 0 rf R[f] = rf ln(f(xn)) = ln(f(xn))δxn i=1 i=1 Boosted Backpropagation Learning for Modular Networks Algorithm 1 Projected Functional Gradient Descent space and reproducing kernel Hilbert space (RKHS). Given: initial function value f0, step size schedule In this setting we have a layered network of functions T fαtgt=1 fk, xnk = fk(xn(k−1)) where n 2 [1;N] indexes train- for t = 1;:::;T do ing examples and k 2 [1;K] indexes layers. Here xnk Compute gradient rf R[f]. represents the output of layer k for exemplar n, with Project gradient to hypothesis space H using least xn0 defined to be the training input and xnK the net- squares projection to find h∗. work output. ∗ Update f: ft ft−1 − αth . We seek to optimize a subset F ⊆ ff gK of these end for k k=1 functions directly while the rest of the functions fk 62 F remain fixed. These fixed functions are arbitrary activation or intermediate transformation functions in using the chain rule and the fact that rf f(xn) = δxn (Ratliff et al., 2009), where δ is the Dirac delta func- the network, and can range from a simple sigmoid xn function to an A* planner. tion centered at xn. The resulting gradient is itself a function composed of the sum of zero-width impulses The optimization of F is with respect to a set of loss centered at the points xn, scaled by the derivative functions defined over network outputs ln(xnK ).