<<

Boosted Backpropagation Learning for Training Deep Modular Networks

Alexander Grubb [email protected] J. Andrew Bagnell [email protected] School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213 USA Abstract tor which is itself a network of simpler modules. Ap- proaches leveraging such modular networks have been Divide-and-conquer is key to building sophis- successfully applied to real-world problems like natural ticated learning machines: hard problems are language processing (NLP) (Lawrence et al., 2000; Ro- solved by composing a network of modules hde, 2002), optical character recognition (OCR) (Le- that solve simpler problems (LeCun et al., Cun et al., 1998; Hinton et al., 2006), and robotics 1998; Rohde, 2002; Bradley, 2009). Many (Bradley, 2009; Zucker et al., 2010). Figure1 shows such existing systems rely on learning algo- an example network for a robotic autonomy system rithms which are based on simple paramet- where imitation learning is used for feedback. ric descent where the parametriza- tion must be predetermined, or more spe- These deep networks are typically composed of layers cialized per-application which are of learning modules with multiple inputs and outputs usually ad-hoc and complicated. We present along with various transforming modules, e.g. the acti- a novel approach for training generic mod- vation functions typically found in neural network lit- ular networks that uses two existing tech- erature, with the end goal of globally optimizing net- niques: the error propagation strategy of work behavior to perform a given task. These con- backpropagation and more recent research on structions offer a number of advantages over single descent in spaces of functions (Mason et al., learning modules, such as the ability to compactly rep- 1999; Scholkopf & Smola, 2001). Combin- resent highly non-linear hypotheses. A modular net- ing these two methods of optimization gives work is also a powerful method of representing and a simple for training heterogeneous building in prior knowledge about problem structure networks of functional modules using sim- and inherent sub-problems. (Bradley, 2009) ple gradient propagation mechanics and es- Some approaches to network optimization rely on tablished learning algorithms. The result- strictly local information, training each module using ing separation of concerns between learn- either synthetic or collected data specific to the func- ing individual modules and error propagation tion of that module. This is a common approach in mechanics eases implementation, enables a NLP and vision systems, where modules correlate to larger class of modular learning strategies, individual tasks such as part of speech classification and allows per-module control of complex- or image segmentation. The problem with this and ity/regularization. We derive and demon- other local training methods is the lack of end-to-end strate this functional backpropagation and optimization of the system as a whole, which can lead contrast it with traditional to a compounding of errors and a degradation in per- in , observing that in our formance. These local-only training algorithms can be example domain the method is significantly useful as good initializations prior to global optimiza- more robust to local optima. tion, however. A traditional approach to global network optimiza- 1. Introduction tion is the well-studied technique of backpropagation (Rumelhart et al., 1986; Werbos, 1994) which has For difficult learning problems that necessitate com- been used for neural network training for over two plex internal structure, it is common to use an estima- decades. While initially used for training acyclic net-

th works, extensions for recurrent and time-varying net- Appearing in Proceedings of the 27 International Confer- works (Werbos, 1994) have been developed. Backprop- ence on , Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s). agation solves the problem of compounding errors be- Boosted Backpropagation Learning for Modular Networks tween interacting modules by propagating error infor- mation throughout a network, allowing for end-to-end optimization with respect to a global measure of error. Further, it can be interpreted as a completely modular (Bottou & Gallinari, 1991), object-oriented approach to semi-automatic differentiation that provides a sep- aration of concerns between modules in the network. Other approaches for complete optimization of net- works (Hinton et al., 2006; Larochelle et al., 2009) have also shown promise as alternatives to backpropagation, but many of these algorithms are restricted to specific system architectures, and further, often rely upon a ”fine-tuning” based on backpropagation. There have however been compelling results (Bengio et al., 2007) as to the usefulness of using local module training as an initial optimization step, allowing for rapid learning Figure 1. Modular network from the UPI perception and prior to the traditionally slower global optimization planning system for an off-road autonomous vehicle. Image step. courtesy (Bradley, 2009). The basic backpropagation algorithm has previously been used to provide error signals for gradient descent We have selected Euclidean functional be- in parameter space, sometimes making network opti- cause of the flexibility provided in choosing base learn- mization sensitive to the specific parametrization cho- ers and the simplicity of implementing the algorithm sen. Recently, powerful methods for performing gradi- in a modular manner. We begin by briefly review- ent descent directly in a space of functions have been ing Euclidean functional gradient descent, followed by developed both for Reproducing Kernel Hilbert spaces the modified backpropagation algorithm for functional (Scholkopf & Smola, 2001) and for Euclidean function gradients. Following that we present a comparison of spaces (Mason et al., 1999; Friedman, 2000). parameterized gradient descent and our functional gra- The former, known as kernel methods, are well studied dient based method. in the literature and have been shown to be powerful means for learning non-linear hypotheses. The latter 2. Euclidean Functional Gradient methods have been shown to be a generalization of the Descent AdaBoost algorithm (Freund & Schapire, 1996), an- other powerful non-linear learning method where com- In the Euclidean function optimization setting, we seek plex hypotheses are built from arbitrary weak learners. to minimize a cost functional R[f], defined over a set of N sampled points {xn}n=1 and accompanying loss func- In the following sections, we present a method for N tions {ln} defined over possible predictions combining functional gradient descent with backprop- n=1 N agation. Just as backpropagation allows a separation X of concerns between modules, the proposed approach R[f] = ln(f(xn)) cleanly separates the problem of credit assignment for n=1 modules in the network from the problem of learn- by searching over a Euclidean function space F (an L2 ing. This separation allows both a broader class of space of square-integrable functions) of possible func- learning machines to be applied within the network tions f. architecture than standard backpropagation enables, and enables complexity control and generalization per- The desired minimizing function for this cost func- formance to be managed independently by each mod- tional can be found using a steepest descent optimiza- ule in the network preventing the usual combinatorial tion procedure in function space directly, in contrast to search over all modules’ internal complexities simulta- parameterized gradient descent where the gradient is neously. The approach further elucidates the notion evaluated with respect to the parameters of the func- of structural local optima—minima that hold in the tion. The functional gradient of R[f] in this Euclidean space of functions and hence are tied to the modular function space is given as (Gelfand & Fomin, 2000): structure—as contrasted with parametric local optima N N which are “accidents” of the chosen parameterization. X X 0 ∇f R[f] = ∇f ln(f(xn)) = ln(f(xn))δxn i=1 i=1 Boosted Backpropagation Learning for Modular Networks

Algorithm 1 Projected Functional Gradient Descent space and reproducing kernel Hilbert space (RKHS). Given: initial function value f0, step size schedule In this setting we have a layered network of functions T {αt}t=1 fk, xnk = fk(xn(k−1)) where n ∈ [1,N] indexes train- for t = 1,...,T do ing examples and k ∈ [1,K] indexes layers. Here xnk Compute gradient ∇f R[f]. represents the output of k for exemplar n, with Project gradient to hypothesis space H using least xn0 defined to be the training input and xnK the net- squares projection to find h∗. work output. ∗ Update f: ft ← ft−1 − αth . We seek to optimize a subset F ⊆ {f }K of these end for k k=1 functions directly while the rest of the functions fk 6∈ F remain fixed. These fixed functions are arbitrary activation or intermediate transformation functions in using the and the fact that ∇f f(xn) = δxn (Ratliff et al., 2009), where δ is the Dirac delta func- the network, and can range from a simple sigmoid xn function to an A* planner. tion centered at xn. The resulting gradient is itself a function composed of the sum of zero-width impulses The optimization of F is with respect to a set of loss centered at the points xn, scaled by the derivative functions defined over network outputs ln(xnK ). We + ∂ l . can define the local Lagrange function for example n ∂f(xn) and the complete Lagrange function as Instead of using the explicit functional gradient as the direction for the gradient step, the gradient is L (F,X , Λ ) = l (x )+ projected onto a space of functions H, to both al- n n n n nK K low for generalization and to constrain the search X T space to some reasonable hypothesis set. The result- λnk (xnk − fk(xn(k−1))) ing projected direction h∗ can be found by minimizing k=1 N the functional least squares projection of the gradient X in L2 function space (Friedman, 2000; Ratliff et al., L(F,X, Λ) = Ln(F,Xn, λn) 2009): n=1

" N # with Lagrange multipliers λnk enforcing the forward ∗ X 2 h = argmin (h(xn) − ∇f R[f](xn)) (1) propagation mechanics of the network. h∈H n=1 As discussed by LeCun (1988), ∇L(F,X, Λ) = 0 is a which is equivalent to the familiar least squares regres- necessary condition for any set of functions which are sion problem over the dataset {xn, ∇f R[f](xn)}. a stationary point with respect to the loss functions ln while still satisfying the constraints. This results These projected gradients are then used to repeatedly in three separate conditions which must hold at the update the function, giving the gradient update rule stationary point: for f as f(x) ← f(x) − αh∗(x). The final learned function is a sum of gradient steps over time ∂L(F,X, Λ) ∂L(F,X, Λ) ∂L(F,X, Λ) = = = 0 (2) T ∂Λ ∂X ∂F X f(x) = f0(x) − αtht(x) t=1 3.1. Forward Propagation where ht(x) is a function representing the gradient step Satisfying the first condition from (2) yields a separate taken at time t along with the corresponding step size constraint for each example n and layer k: αt and starting point f0. A brief description of this algorithm is given in Algorithm1, in a manner that ∂L(F,X, Λ) generalizes AdaBoost. (Mason et al., 1999; Friedman, (2) =⇒ = 0 ∀n, k ∂λnk 2000) =⇒ xnk = fk(xn(k−1)) ∀n, k

3. Backpropagation for Functional These constraints simply re-state the forward propa- Gradients gation mechanics of the network. Using the Lagrangian framework previously developed 3.2. Backward Propagation by LeCun (1988), we now present the first part of our contribution: a derivation of backpropagation mechan- Similarly, satisfying the second part of (2) provides ics for functional gradients, both in Euclidean function another set of constraints over the training data and Boosted Backpropagation Learning for Modular Networks layers: Prediction Step Learning Step

xnk nk ∂L(F,X, Λ) (2) =⇒ = 0 ∀n, k 2 3 ∂xnk 0 =⇒ λnK = ln(xnK ) ∀n 4

λnk = Jfk+1 (xnk)λn(k+1) ∀n, k < K ∇ f F [ f k ] where Jf (X) is the Jacobian matrix of f at X. * Gradient projection x =f x  −  f =  h nk k n k 1 k ∑ t t on to H These constraints define the mechanics for backwards * ht error propagation. The Lagrange multipliers λnk store the accumulated results of applying the chain rule to 5 the original derivatives of the . Using the ordered derivative notation of Werbos (1994), each ele- 1 6 ment λnki represents the derivative of loss with respect + ∂ ln x  to output xnki, . n k−1 nk−1 ∂xnki

3.3. Functional Gradient Update Figure 2. Example learning module illustrating backprop- The final condition in (2) gives a necessary constraint agation machinery and Euclidean functional gradient pro- on the final optimized functions in F : jection. ∂L(F,X, Λ) (2) =⇒ = 0 ∀f ∈ F ∂f k k In practice the equivalent projected version of this =⇒ ∇f [L(F,X, Λ)] = 0 ∀fk ∈ F gradient step is used. This amounts to building a N N X dataset {(xn(k−1), λnk)}n=1 and using it to train a =⇒ λnk(∇f [fk(xn(k−1))]) = 0 ∀fk ∈ F weak learner h∗ as in (1). n=1 3.4. Generalization to Other Network These constraints necessitate that each fk must be a fixed point of the Lagrange equation L. Since we seek Topologies to minimize the loss functions, a steepest descent pro- The derivation here is presented for a sequential lay- cedure can be used to find the minimum with function ered network, but it extends with no complications update rule: to directed acyclic graphs of modules. For any DAG, we can convert it into a layered network as above by N X f ← f − α λ (∇ [f (x )]) ∀f ∈ F first sorting the modules using a topological ordering k k nk f k n(k−1) k and then modifying each layer to only accept values n=1 z ⊆ xn(k−1) that were originally used by that function: x = (f (z), x /z). The backwards propagation For RKHS function spaces the functional gradient of a nk k n(k−1) rules similarly only apply the Jacobian to a subset of function itself evaluated at x is the kernel centered at the errors being passed back, while others are simply that point K(x, ·)(Scholkopf & Smola, 2001). Apply- passed on in the topological ordering. ing this to our functional update we get the following functional update rule: From there the derivation is fundamentally the same, with the functional update rule operating only on a N X subset of the inputs xn(k−1) and error terms λnk. In fk ← fk − α λnkK(xn(k−1), ·) ∀fk ∈ F a network of this form the backpropagation mechanics n=1 naturally follow the topology created in the network.

And for Euclidean function spaces (in the idealized 4. Implementation for a Modular case) the functional gradient of a function itself is again the Dirac delta function. Correspondingly, we Network get the following function update rule: Using the formal derivation from the previous section,

N we now present an algorithm for training a series of X boosted learning modules, by applying the standard fk ← fk − α λnkδxn(k−1) ∀fk ∈ F n=1 boosting technique to functional backpropagation. Boosted Backpropagation Learning for Modular Networks

Algorithm 2 Modular Functional Gradient Update xnk = (xnk1, xnk2,...), where xnkj = fkj(xn(k−1)). Functional Gradient Forward Step: This is fundamentally equivalent to the multi-output for all xn(k−1) do {Step 1} formulation, but with the restriction that the hypoth- Compute outputs xnk = fk(xn(k−1)). {Step 2} esis space H used for projection is itself a product end for of a given single-output hypothesis space H = Gm Functional Gradient Backward Step: where m is the output dimension. The gradient pro- for all λnk do {Step 3} jection step in this restricted hypothesis space is equiv- Compute λn(k−1) = Jfk (xn(k−1))λnk. {Step 6} alent to m independent projections over the datasets end for N {(xn(k−1), λnkj)}n=1, ∀j. Compute ∇f L[fk] = λnkδxn(k−1) . {Step 4} 1 Project gradient ∇f L[fk] on to H using least squares ∗ 4.2. Online and Stochastic Learning projection to find hk. {Step 5} ∗ Update fk: fkt ← fk(t−1) − αthk. The literature on parametric gradient-based learn- ing has shown that stochastic and online versions of the standard backpropagation algorithm are highly ef- fective and convenient methods of learning, provid- Algorithm2 gives an outline for computing the forward ing performance improvements and enabling practical and backward propagation steps for each functional learning from large or even infinite data sources. Both learner in the network. The algorithm for training the of these algorithms extend to the functional gradient complete network is the same as in backpropagation: a versions of backpropagation presented here. forward pass through the entire network is computed For Euclidean functional gradient boosting, while on- for the training data, the gradient of the loss func- line learning on the per-example level is not feasible, tion is evaluated, and then the backward pass prop- an intuitive way of acheiving online behavior is to use agates gradient information through the network and “mini-batch” learning where a group of examples is updates individual modules. Like any gradient-based collected or sampled from the underlying dataset and procedure this can be repeated for a fixed number of this small dataset is used for one iteration of the algo- steps or until some measure of convergence is reached. rithm presented above. Using batches of examples is Unlike standard boosting, there are some restrictions necessary in practice to obtain a reasonable and robust on the weak hypotheses which can be used. To ac- functional gradient projection. comodate the backpropagation of gradients, the func- In the RKHS setting, online learning easily generalizes tions in the hypothesis space H must be differentiable. and is a well studied problem in the literature (Kivinen Specifically we need to be able to calculate the Jaco- et al., 2004). bian Jh for every function h ∈ H. From there the Jacobian of each function fk can be easily computed as they are all linear combinations of functions in H. 4.3. Benefits of Modularity This restriction does preclude some weak learners com- This algorithm is inherently modular in two ways: it monly employed in boosting, notably decision stumps, separates the individual pieces of the network from but still allows for a wide range of possible hypothesis each other and it separates the structural aspects of spaces. If needed, this restriction can be relaxed for the network from the learning in individual modules. the first functional learner in a network as no gradient This feature makes implementing complex networks of needs to be propagated through this layer. heterogenous modules straightforward and provides a number of mechanisms for improving learning perfor- A single functional module as described here is pic- mance. tured in Figure2. Each learning module is composed of machinery for computing the forward step and back- In neural network-based architectures the complexity ward gradient propagation step, along with an internal of the network is usually regulated by changing the gradient projection module which performs the boost- structure of the network in some way. In contrast, the ing steps necessary to actually update the module. division between gradient propagation and gradient projection when using boosted backpropagation pro- vides a means for varying the complexity of each layer 4.1. Single Output Weak Learners without having to alter the structure of the network. While the above formalism and algorithm consider Another key benefit of the separate weak learners is each function f as a multi-output function, in practice k that the local weak learners can use the gradient being it may be more convenient to treat each function f as k projected to validate various local parameters, reduc- being several single-output functions fkj with outputs Boosted Backpropagation Learning for Modular Networks ing the number of parameters and models that need Feature Raw Terrain Data Cost Function A* Planner to be globally optimized and validated. For exam- Extraction ple, if the weak learner being used is a regularized least squares regressor, the regularization parameter can be selected using the gradient dataset and cross- validation. This removes the need for an additional combinatorial search for regularization parameters at the global level, potentially saving a large amount of computation. Figure 3. Maximum Margin Planning network for an au- 5. Experimental Results tonomous vehicle. Learning modules are colored in blue, while the planning module is a fixed optimal planner. In Our experimental application is a simplified path plan- this case, both a cost function and feature extraction layer ning system for an autonomous vehicle using Maxi- are learned simultaneously to improve overall performance. mum Margin Planning (MMP) (Ratliff et al., 2009), R directly. This first functional gradient is equivalent a method for estimating optimal controllers which ex- to the first λ term from the formal derivation above. hibit the same behavior as demonstrated human ex- amples. The planning system, depicted in Figure3, Replacing the minimization with the actual optimal ∗ consists of feature extraction from overhead data, cost path according to the planner, µi , we get: function mapping, and optimal planning (A*, here) N ! modules. We seek to learn both the feature selection 1 X X X ∇ R[c] = µ (x)δ − µ∗(x)δ module, where raw terrain data is transformed into a f N i φ(x) i φ(x) i=1 x∈Mi x∈Mi set of high-level features, and a cost mapping func- ! tion which takes the generated high-level features and X 1 ∇ R[c] = (µ (x) − µ∗ (x))δ produces costs appropriate for planning. f N i(x) i(x) φ(x) SN x∈{ i=1 Mi} Formally, we are given a set of example maps M with locations in these maps x (essentially terrain feature Intuitively, this is equivalent to defining a loss function examples). For the cost function module, we define over outputs y = c(φ(x)): the input φ(x) as the output of the feature extraction x layer and then compute output c(φ(x)). The MMP ∗ lx(yx) = yxµi(x) − (yx − `i)µi (x) cost functional R is defined as the difference between where i : x ∈ M planned and demonstrated cost i and using the same machinery formally outlined in M 1 X X Section3. R[c] = c(φ(x))µ (x)− M i i=1 x∈Mi Exponentiated Functional Gradient Descent. A ! number of empirical results in the MMP literature n X o min (c(φ(x)) − `i(x))µ(x) (Ratliff et al., 2009) have shown exponentiated func- µ∈Gi x∈Mi tional gradient descent to be superior in performance, so we use this method of steepest descent for the cost- where µi is the demonstrated path and the minimiza- ing module. The gradient is calculated in the same way tion min corresponds to the optimal path returned µ∈Gi as before, however instead of using an additive model by the planning algorithm. This cost functional math- as before, we now update the function c(·) using the ematically expresses the desired constraint that the appropriate exponentiated gradient rule: behavior of the system after training duplicates the demonstrated behavior, by ensuring that the lowest T Y cost paths in the examples are in fact the demonstrated c(x) = ec0(x) eαtht(x) paths. The extra `i(x) term corresponds to a margin t=1 function designed to ensure the demonstrated behavior is achieved by a significant margin. In our experiments Similar results can be derived for the parameterized we use `i(x) = 0 if x is on path µi and 1 otherwise. gradient descent version of this network. In both cases the initial gradient passed in to the network is identi- The cost functional as given does not appear to fit our cal, and only the changes. previous model of a sum of individual loss functions ln, but we can derive the appropriate initial backpropaga- In the following experiments, the terrain features x tion gradient by considering the functional gradient of are 5 by 5 patches of image data around each location Boosted Backpropagation Learning for Modular Networks

gets caught in this local minimum while trying to re- duce the objective (the difference between example and planned path cost) by driving the costs of both paths down.

5.2. Local Parameter Validation We also implemented a cross-validation based parame- ter selection method for determining the Tikhonov reg- ularization parameter used in the linear least squares weak learners. Figure6 shows the resulting parameter Figure 4. Performance on 10 test paths on an unseen map values as they were selected over time. Here we see for a parameterized backpropagation (left) and a Euclidean small values initially, allowing for rapid initial learn- functional gradient descent backpropagation (right) net- ing, followed by relatively large values which prevent work. The parameterized version drives all costs to 0, re- the network from overfitting. In contrast, we also per- sulting in a homogeneous map and straight-line paths. formed a combinatorial search by hand for a good fixed MMP Objective Function Value vs Time regularization parameter. Figure7 displays the final 1400 performance for both methods on both the example Functional Backpropagation paths and paths on an unseen map. 1200 Parametric Backpropagation Here the ability of the modular parameter validation 1000 to adjust over time is very beneficial, as we found that for small fixed global values initial learning is fast, but 800 the algorithm is prone to overfitting, while with large

600 fixed values the learning is slow to the point of making the optimization infeasible. The locally optimized pa- 400 rameters, however, allow for good initial behavior and Objective Function Value generalization. 200

0 0 5000 10000 15000 20000 6. Discussion and Future Work Wallclock Time We believe the combination of functional gradients Figure 5. Plot of MMP objective function value for 4 train- with modular backpropagation provides significant ing paths vs. wallclock time. promise. The separation of learning mechanism and structural error propagation in our method provides an important opportunity to keep learning local to an taken from the satellite imagery. For the feature ex- individual module, even in global network optimiza- traction module, φ(x), a two layer neural network was tion. The ability to validate and perform model selec- used in the parameterized gradient case, while an iden- tion on each component network separately using er- tically structured network using least squares linear ror information may be crucial to efficiently implement regressors as weak learners was used in the functional the divide-and-conquer strategy modular systems are gradient case. meant to use. Additionally, there is experimental indi- cation that functional methods for network optimiza- 5.1. Comparison of Parametric and Functional tion provides a degree of robustness against parametric Gradients minima that occur when using complicated transfor- mation modules like an optimal planner. Results for optimizing both networks using 4 example paths are found in Figures4 and5. In this instance, We have also found that on other datasets functional parameterized backpropagation gets caught in a severe backpropagation is competitive with parametric im- local minimum early on while functional backpropaga- plementations. Experiments on these datasets were tion achieves excellent performance. omitted here for space but can be found in an extended technical report (Grubb & Bagnell, 2010). We hypothesize that the poor performance of param- eterized gradient descent is due to the larger num- In this work, we largely focused on simple, linear weak ber of negative gradient examples in the demonstrated learners to facilitate comparison with the parametric path as compared to the planned path, driving costs approach, although we have additional extensive ex- down primarily. Essentially, the parametric version periments with non-linear learners. The non-linear Boosted Backpropagation Learning for Modular Networks

6 λ 10

r Bottou, L. and Gallinari, P. A framework for the coop- e

t 5 eration of learning algorithms. In Advances in Neural

e 10 Information Processing Systems, pp. 781–788, 1991. m

a 4 r 10 Bradley, D. M. Learning in Modular Systems. PhD the- a P

sis, The Robotics Institute, Carnegie Mellon University,

n 103

o Pittsburgh, PA, USA, 2009. i t

a 2 Freund, Y. and Schapire, R. E. Experiments with a new z 10 i r boosting algorithm. In Proc. of the 13th International a l 1

u 10 Conference on Machine Learning (ICML 1996), pp. 148– g

e 156, 1996.

R 0

10

v Friedman, J. H. Greedy function approximation: A gra- o

n -1 dient boosting machine. Annals of Statistics, 29:1189– 10 o

h 1232, 2000. k i 10-2 T 0 500 1000 1500 2000 2500 3000 Gelfand, I. M. and Fomin, S. V. Calculus of Variations. Iteration Dover Publications, October 2000. Grubb, A. and Bagnell, J. A. Boosted backpropagation Figure 6. Locally optimized Tikhonov regularization pa- learning for deep modular networks. Technical report, rameter values for top layer of MMP network. Parameter CMU-RI-09-45, 2010. selection was performed using cross-validation. Hinton, G. E., Osindero, S., and Teh, Y. A fast learning algorithm for deep belief nets. Neural Computation, 18 0.8 (7):1527–1554, 2006.

0.7 Kivinen, J., Smola, A. J., and Williamson, R. C. Online learning with kernels. IEEE Trans. on Signal Proc., 52 0.6 (8):2165–2176, 2004. Larochelle, H., Bengio, Y., Louradour, J., and Lamblin, P. 0.5 Exploring strategies for training deep neural networks.

0.4 Journal of Machine Learning Research, 10:1–40, 2009.

Loss Lawrence, S., Giles, C. L., and Fong, S. Natural language 0.3 grammatical inference with recurrent neural networks. IEEE Trans. on Knowl. and Data Eng., 12(1):126–140, 0.2 2000.

0.1 LeCun, Y. A theoretical framework for back-propagation. In Proc. of the 1988 Connectionist Models Summer 0.0 School, pp. 21–28, 1988. Locally Optimal Fixed λ = 10 LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recogni- Figure 7. Comparison of training path (green) and test tion. Proc. of the IEEE, 86(11):2278–2324, 1998. path (blue) performance for both locally optimized reg- Mason, L., Baxter, J., Bartlett, P. L., and Frean, M. Func- ularization and predetermined regularization parameters. tional gradient techniques for combining hypotheses. In Advances in Large Margin Classifiers. MIT Press, 1999. Ratliff, N., Silver, D., and Bagnell, J. A. Learning to methods offer the promise of greater system perfor- search: Functional gradient techniques for imitation mance at a significantly larger computational expense. learning. Autonomous Robots, 27(1):25–53, July 2009. Future work will focus on achieving the benefits of Rohde, D. L. T. A connectionist model of sentence com- these learning approaches while limiting the compu- prehension and production. PhD thesis, Carnegie Mellon tational impact. University, PA, USA, 2002. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. Learning internal representations by error propagation. Acknowledgments Computational Models Of Cognition And Perception Se- We would like to thank Daniel Munoz and Abraham Oth- ries, pp. 318–362, 1986. man for their valuable feedback and David Bradley for his Scholkopf, B. and Smola, A. J. Learning with Kernels: help with experimental work. This work is supported by Support Vector Machines, Regularization, Optimization, the ONR MURI grant N00014-09-1-1052 and the Robotics and Beyond. MIT Press, Cambridge, MA, USA, 2001. Institute. Werbos, Paul John. The roots of backpropagation: from ordered derivatives to neural networks and political fore- casting. Wiley-Interscience, New York, NY, USA, 1994. References Zucker, M., Bagnell, J. A., Atkeson, C., and Kuffner, J. Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. An optimization and learning approach to rough terrain Greedy layer-wise training of deep networks. In Ad- locomotion. In International Conference on Robotics and vances in Neural Information Processing Systems 19. Automation, To Appear, 2010. MIT Press, 2007.