An Optimal Control Approach to Deep Learning and Applications to Discrete-Weight Neural Networks

An Optimal Control Approach to Deep Learning and Applications to Discrete-Weight Neural Networks Qianxiao Li 1 Shuji Hao 1 Abstract versions of the above algorithms. More broadly, necessary conditions for optimality can be derived in the form of the Deep learning is formulated as a discrete-time Karush-Kuhn-Tucker conditions (Kuhn & Tucker, 2014). optimal control problem. This allows one to char- Such approaches are quite general and typically do not rely acterize necessary conditions for optimality and on the structures of the objectives encountered in deep learn- develop training algorithms that do not rely on gra- ing. However, in deep learning, the objective function J dients with respect to the trainable parameters. In often has a specific structure; It is derived from feeding particular, we introduce the discrete-time method a batch of inputs recursively through a sequence of train- of successive approximations (MSA), which is able transformations, which can be adjusted so that the final based on the Pontryagin’s maximum principle, outputs are close to some fixed target set. This process for training neural networks. A rigorous error es- resembles an optimal control problem (Bryson, 1975; Bert- timate for the discrete MSA is obtained, which sekas, 1995; Athans & Falb, 2013) that originates from the sheds light on its dynamics and the means to sta- study of the calculus of variations. bilize the algorithm. The developed methods are applied to train, in a rather principled way, neural In this paper, we exploit this optimal control viewpoint of networks with weights that are constrained to take deep learning. First, we introduce the discrete-time Pon- values in a discrete set. We obtain competitive per- tryagin’s maximum principle (PMP) (Halkin, 1966), which formance and interestingly, very sparse weights in is an extension the central result in optimal control due to the case of ternary networks, which may be useful Pontryagin and coworkers (Boltyanskii et al., 1960; Pontrya- in model deployment in low-memory devices. gin, 1987). This is an alternative set of necessary conditions characterizing optimality, and we discuss the extent of its va- lidity in the context of deep learning. Next, we introduce the 1. Introduction discrete method of successive approximations (MSA) based on the PMP to optimize deep neural networks. A rigorous The problem of training deep feed-forward neural net- error estimate is proved that elucidates the dynamics of the works is often studied as a nonlinear programming prob- MSA, and aids us in designing optimization algorithms un- lem (Bazaraa et al., 2013; Bertsekas, 1999; Kuhn & Tucker, der rather general conditions. We apply our method to train 2014) a class of unconventional networks, i.e. those with discrete- min J(θ) θ valued weights, to illustrate the usefulness of this approach. In the process, we discover that in the case of ternary net- where θ represents the set of trainable parameters and J is works, our training algorithm obtains trained models that the empirical loss function. In the general unconstrained are very sparse, which is an attractive feature in practice. case, necessary optimality conditions are given by the condi- ∗ tion rθJ(θ ) = 0 for an optimal set of training parameters The rest of the paper is organized as follows: In Sec.2, we θ∗. This is largely the basis for (stochastic) gradient-descent introduce the optimal control viewpoint and the discrete- based optimization algorithms in deep learning (Robbins & time Pontryagin’s maximum principle. We then introduce Monro, 1951; Duchi et al., 2011; Zeiler, 2012; Kingma & the method of successive approximation in Sec.3 and prove Ba, 2014). When there are additional constraints, e.g. on our main estimate, Theorem2. In Sec.4, we derive algo- the trainable parameters, one can instead employ projected rithms based on the developed theory to train binary and ternary neural networks. Finally, we end with a discussion 1 Institute of High Performance Computing, Singapore. Corre- on related work and a conclusion in Sec.5 and6 respec- spondence to: Qianxiao Li <[email protected]>. tively. Various details on proofs and algorithms are provided Proceedings of the 35 th International Conference on Machine in Appendix A-D, which also contains a link to a software Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 implementation of our algorithms that reproduces all exper- by the author(s). An Optimal Control Approach to Deep Learning iments in this paper. 2.1. The Pontryagin’s Maximum Principle Hereafter, we denote the usual Euclidean norm by k · k Maximum principles of the Pontryagin type (Boltyanskii and the corresponding induced matrix norm by k · k2. The et al., 1960; Pontryagin, 1987) usually consist of necessary Frobenius norm is written as k · kF . Throughout this work, conditions for optimality in the form of the maximization of we use a bold-faced version of a variable to represent a a certain Hamiltonian function. The distinguishing feature collection of the same variable, but indexed additionally by is that it does not assume differentiability (or even conti- t, e.g. θ := fθt : t = 0;:::;T − 1g. nuity) of ft with respect to θ. Consequently the optimality condition and the algorithms based on it need not rely on 2. The Optimal Control Viewpoint gradient-descent type updates. This is an attractive feature for certain classes of applications. In this section, we formalize the problem of training a deep Let θ∗ = fθ ; : : : ; θ g 2 Θ be a solution of (2). We neural network as an optimal control problem. Let T 2 0 T −1 now outline informally the Pontryagin’s maximum principle denote the number of layers and fx 2 d0 : s = Z+ s;0 R (PMP) that characterizes θ∗. First, for each t we define the 0;:::;Sg represent a collection of fixed inputs (images, dt dt+1 Hamiltonian function Ht : R × R × Θt ! R by time-series). Here, S 2 Z+ is the sample size. Consider the dynamical system 1 Ht(x; p; θ) := p · ft(x; θ) − S Lt(x; θ): (3) One can show the following necessary conditions. xs;t+1 = ft(xs;t; θt); t = 0; 1;:::;T − 1; (1) Theorem 1 (Discrete PMP, Informal Statement). Let ft and Φs, s = 1;:::;S be sufficiently smooth in x. Assume further where for each t, f : dt × Θ ! dt+1 is a transforma- t R t R that for each t and x 2 dt , the sets ff (x; θ): θ 2 Θ g tion on the state. For example, in typical neural networks, R t t and fLt(x; θ): θ 2 Θtg are convex. Then, there exists ∗ ∗ it can represent a trainable affine transformation or a non- co-state processes ps := fps;t : t = 0;:::;T g, such that linear activation (in which case it is not trainable and ft following holds for t = 0;:::;T − 1 and s 2 [S]: does not depend on θ). We assume that each trainable ∗ ∗ ∗ ∗ ∗ xs;t+1 = rpHt(xs;t; ps;t+1; θt ); xs;0 = xs;0 (4) parameter set Θt is a subset of an Euclidean space. The ∗ ∗ ∗ ∗ ∗ 1 ∗ goal of training a neural network is to adjust the weights ps;t = rxHt(xs;t; ps;t+1; θt ); ps;T = − S rΦs(xs;T ) (5) θ := fθt : t = 0;:::;T − 1g so as to minimize some S S X ∗ ∗ ∗ X ∗ ∗ loss function that measures the difference between the final Ht(xs;t; ps;t+1; θt ) ≥ Ht(xs;t; ps;t+1; θ); 8θ 2 Θt network output x and some true targets y of x , which s=1 s=1 s;T s s;0 (6) are fixed. Thus, we may define a family of real-valued functions Φ : dT ! acting on x (y are absorbed s R R s;T s The full statement of Theorem1 involve explicit smooth- into the definition of Φ ) and the average loss function is s ness assumptions and additional technicalities (such as the P Φ (x )=S. In addition, we may consider some reg- s s s;T inclusion of an abnormal multiplier). In Appendix A, we ularization terms for each layer L : dt × Θ ! that t R t R state these assumptions and give a sketch of its proof based has to be simultaneously minimized. In typical applications, on Halkin(1966). regularization is only performed for the trainable parameters so that Lt(x; θ) ≡ Lt(θ), but here we will consider Let us discuss the PMP in detail. The state equation (4) the slightly more general case where it is also possible to is simply the forward propagation equation (1) under the regularize the state at each layer. In summary, we wish to optimal parameters θ∗. Eq. (5) defines the evolution of the ∗ solve the following problem co-state ps. To draw an analogy with nonlinear programming, the co-state can be interpreted as a set of Lagrange S S T −1 multipliers that enforces the constraint (1) when the opti- 1 X 1 X X min J(θ) := Φs(xs;T ) + Lt(xs;t; θt) mization problem (2) is regarded as a joint optimization θ2Θ S S s=1 s=1 t=0 problem in θ and xs, s 2 [S]. In the optimal control and subject to: PMP viewpoint, it is perhaps more appropriate to think of the dynamics (5) as the evolution of the normal vector of a xs;t+1 = ft(xs;t; θt); t = 0;:::;T − 1; s 2 [S] (2) separating hyper-plane, which separates the set of reachable states and the set of states where the objective function takes where we have defined for shorthand Θ := fΘ × · · · × 0 on values smaller than the optimum (see Appendix A).

Load more