An Optimal Control Approach to Deep Learning and Applications to Discrete-Weight Neural Networks

Qianxiao Li 1 Shuji Hao 1

Abstract versions of the above algorithms. More broadly, necessary conditions for optimality can be derived in the form of the Deep learning is formulated as a discrete-time Karush-Kuhn-Tucker conditions (Kuhn & Tucker, 2014). optimal control problem. This allows one to char- Such approaches are quite general and typically do not rely acterize necessary conditions for optimality and on the structures of the objectives encountered in deep learn- develop training algorithms that do not rely on gra- ing. However, in deep learning, the objective function J dients with respect to the trainable parameters. In often has a specific structure; It is derived from feeding particular, we introduce the discrete-time method a batch of inputs recursively through a sequence of train- of successive approximations (MSA), which is able transformations, which can be adjusted so that the final based on the Pontryagin’s maximum principle, outputs are close to some fixed target set. This process for training neural networks. A rigorous error es- resembles an optimal control problem (Bryson, 1975; Bert- timate for the discrete MSA is obtained, which sekas, 1995; Athans & Falb, 2013) that originates from the sheds light on its dynamics and the means to sta- study of the . bilize the algorithm. The developed methods are applied to train, in a rather principled way, neural In this paper, we exploit this optimal control viewpoint of networks with weights that are constrained to take deep learning. First, we introduce the discrete-time Pon- values in a discrete set. We obtain competitive per- tryagin’s maximum principle (PMP) (Halkin, 1966), which formance and interestingly, very sparse weights in is an extension the central result in optimal control due to the case of ternary networks, which may be useful Pontryagin and coworkers (Boltyanskii et al., 1960; Pontrya- in model deployment in low-memory devices. gin, 1987). This is an alternative set of necessary conditions characterizing optimality, and we discuss the extent of its va- lidity in the context of deep learning. Next, we introduce the 1. Introduction discrete method of successive approximations (MSA) based on the PMP to optimize deep neural networks. A rigorous The problem of training deep feed-forward neural net- error estimate is proved that elucidates the dynamics of the works is often studied as a nonlinear programming prob- MSA, and aids us in designing optimization algorithms un- lem (Bazaraa et al., 2013; Bertsekas, 1999; Kuhn & Tucker, der rather general conditions. We apply our method to train 2014) a class of unconventional networks, i.e. those with discrete- min J(θ) θ valued weights, to illustrate the usefulness of this approach. In the process, we discover that in the case of ternary net- where θ represents the set of trainable parameters and J is works, our training algorithm obtains trained models that the empirical . In the general unconstrained are very sparse, which is an attractive feature in practice. case, necessary optimality conditions are given by the condi- ∗ tion ∇θJ(θ ) = 0 for an optimal set of training parameters The rest of the paper is organized as follows: In Sec.2, we θ∗. This is largely the basis for (stochastic) gradient-descent introduce the optimal control viewpoint and the discrete- based optimization algorithms in deep learning (Robbins & time Pontryagin’s maximum principle. We then introduce Monro, 1951; Duchi et al., 2011; Zeiler, 2012; Kingma & the method of successive approximation in Sec.3 and prove Ba, 2014). When there are additional constraints, e.g. on our main estimate, Theorem2. In Sec.4, we derive algo- the trainable parameters, one can instead employ projected rithms based on the developed theory to train binary and ternary neural networks. Finally, we end with a discussion 1 Institute of High Performance Computing, Singapore. Corre- on related work and a conclusion in Sec.5 and6 respec- spondence to: Qianxiao Li . tively. Various details on proofs and algorithms are provided Proceedings of the 35 th International Conference on Machine in Appendix A-D, which also contains a link to a software Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 implementation of our algorithms that reproduces all exper- by the author(s). An Optimal Control Approach to Deep Learning iments in this paper. 2.1. The Pontryagin’s Maximum Principle Hereafter, we denote the usual Euclidean norm by k · k Maximum principles of the Pontryagin type (Boltyanskii and the corresponding induced matrix norm by k · k2. The et al., 1960; Pontryagin, 1987) usually consist of necessary Frobenius norm is written as k · kF . Throughout this work, conditions for optimality in the form of the maximization of we use a bold-faced version of a variable to represent a a certain Hamiltonian function. The distinguishing feature collection of the same variable, but indexed additionally by is that it does not assume differentiability (or even conti- t, e.g. θ := {θt : t = 0,...,T − 1}. nuity) of ft with respect to θ. Consequently the optimality condition and the algorithms based on it need not rely on 2. The Optimal Control Viewpoint gradient-descent type updates. This is an attractive feature for certain classes of applications. In this section, we formalize the problem of training a deep Let θ∗ = {θ , . . . , θ } ∈ Θ be a solution of (2). We neural network as an optimal control problem. Let T ∈ 0 T −1 now outline informally the Pontryagin’s maximum principle denote the number of layers and {x ∈ d0 : s = Z+ s,0 R (PMP) that characterizes θ∗. First, for each t we define the 0,...,S} represent a collection of fixed inputs (images, dt dt+1 Hamiltonian function Ht : R × R × Θt → R by time-series). Here, S ∈ Z+ is the sample size. Consider the 1 Ht(x, p, θ) := p · ft(x, θ) − S Lt(x, θ). (3) One can show the following necessary conditions. xs,t+1 = ft(xs,t, θt), t = 0, 1,...,T − 1, (1) Theorem 1 (Discrete PMP, Informal Statement). Let ft and Φs, s = 1,...,S be sufficiently smooth in x. Assume further where for each t, f : dt × Θ → dt+1 is a transforma- t R t R that for each t and x ∈ dt , the sets {f (x, θ): θ ∈ Θ } tion on the state. For example, in typical neural networks, R t t and {Lt(x, θ): θ ∈ Θt} are convex. Then, there exists ∗ ∗ it can represent a trainable affine transformation or a non- co-state processes ps := {ps,t : t = 0,...,T }, such that linear activation (in which case it is not trainable and ft following holds for t = 0,...,T − 1 and s ∈ [S]: does not depend on θ). We assume that each trainable ∗ ∗ ∗ ∗ ∗ xs,t+1 = ∇pHt(xs,t, ps,t+1, θt ), xs,0 = xs,0 (4) parameter set Θt is a subset of an Euclidean space. The ∗ ∗ ∗ ∗ ∗ 1 ∗ goal of training a neural network is to adjust the weights ps,t = ∇xHt(xs,t, ps,t+1, θt ), ps,T = − S ∇Φs(xs,T ) (5) θ := {θt : t = 0,...,T − 1} so as to minimize some S S X ∗ ∗ ∗ X ∗ ∗ loss function that measures the difference between the final Ht(xs,t, ps,t+1, θt ) ≥ Ht(xs,t, ps,t+1, θ), ∀θ ∈ Θt network output x and some true targets y of x , which s=1 s=1 s,T s s,0 (6) are fixed. Thus, we may define a family of real-valued functions Φ : dT → acting on x (y are absorbed s R R s,T s The full statement of Theorem1 involve explicit smooth- into the definition of Φ ) and the average loss function is s ness assumptions and additional technicalities (such as the P Φ (x )/S. In addition, we may consider some reg- s s s,T inclusion of an abnormal multiplier). In Appendix A, we ularization terms for each layer L : dt × Θ → that t R t R state these assumptions and give a sketch of its proof based has to be simultaneously minimized. In typical applications, on Halkin(1966). regularization is only performed for the trainable param- eters so that Lt(x, θ) ≡ Lt(θ), but here we will consider Let us discuss the PMP in detail. The state equation (4) the slightly more general case where it is also possible to is simply the forward propagation equation (1) under the regularize the state at each layer. In summary, we wish to optimal parameters θ∗. Eq. (5) defines the of the ∗ solve the following problem co-state ps. To draw an analogy with nonlinear program- ming, the co-state can be interpreted as a set of Lagrange S S T −1 multipliers that enforces the constraint (1) when the opti- 1 X 1 X X min J(θ) := Φs(xs,T ) + Lt(xs,t, θt) mization problem (2) is regarded as a joint optimization θ∈Θ S S s=1 s=1 t=0 problem in θ and xs, s ∈ [S]. In the optimal control and subject to: PMP viewpoint, it is perhaps more appropriate to think of the dynamics (5) as the evolution of the normal vector of a xs,t+1 = ft(xs,t, θt), t = 0,...,T − 1, s ∈ [S] (2) separating hyper-plane, which separates the set of reachable states and the set of states where the objective function takes where we have defined for shorthand Θ := {Θ × · · · × 0 on values smaller than the optimum (see Appendix A). ΘT −1} and [S] := {1,...,S}. One may recognize prob- lem (2) as a classical fixed-time, variable-terminal-state The Hamiltonian maximization condition (6) is the center- optimal control problem in discrete time (Ogata, 1995), in piece of the PMP. It says that an optimal solution θ∗ must fact a special one with almost decoupled dynamics across globally maximize the (summed) Hamiltonian for each layer samples in [S]. t = 0,...,T − 1. Let us contrast this statement with usual An Optimal Control Approach to Deep Learning

∗ first-order optimality conditions of the form ∇θJ(θ ) = 0. Finally, we remark that in the original derivation of the A key observation is that in Theorem1, we did not make PMP for continuous-time control systems (Boltyanskii et al., reference to the derivative of any quantity with respect θ. 1960) (i.e. x˙ s,t = ft(xs,t, θt), t ∈ [0,T ] in place of Eq. (1)), In fact, the PMP holds even if ft is not differentiable, or the convexity condition can be removed due to the “convex- even continuous, with respect to θ, as long as the convexity ifying” effect of integration with respect to time (Halkin, assumptions are satisfied. On the other hand, if we assume 1966; Warga, 1962). Hence, the convexity condition is for each t: 1) ft is differentiable with respect to θ; 2) Ht purely an artifact of discrete-time dynamical systems. ∗ is concave in θ; and 3) θt lies in the interior of Θt, then the Hamiltonian maximization condition (6) is equivalent to P 3. The Method of Successive Approximations the condition ∇θ s Ht = 0 for all t, which one can then show is equivalent to ∇θJ = 0 (See Appendix C, proof of The PMP (Eq. (4)-(6)) gives us a set of necessary conditions Prop. C.1). In other words, the PMP can be viewed as a an optimal solution to (2) must satisfy. However, it does stronger set of necessary conditions (at optimality, Ht is not not tell us how to find one such solution. The goal of this just stationary, but globally maximized) and has meaning in section is to discuss algorithms for solving (2) based on the more general scenarios, e.g. when stationarity with respect maximum principle. to θ is not achievable due to constraints, or not defined due On closer inspection of Eq. (4)-(6), one can see that they to non-differentiability. each represent a manifold in solution space consisting of all possible θ, {xs, s ∈ [S]} and {p , s ∈ [S]}, and the in- Remark 1. It may occur that P H (x∗ , p∗ , θ) is con- s s t s,t s,t+1 tersection of these three manifolds must contain an optimal stant for all θ ∈ Θt, in which case the problem is sin- solution, if one exists. Consequently, an iterative projection gular (Athans & Falb, 2013). In such cases, the PMP is method that successively projects a guessed solution onto trivially satisfied by any θ and so the PMP does not tell us each of the manifolds is natural. This is the method of suc- anything useful. This may arise especially in the case where cessive approximations (MSA), which was first introduced there are no regularization terms. to solve continuous-time optimal control problems (Krylov & Chernousko, 1962; Chernousko & Lyubushin, 1982). Let 2.2. The Convexity Assumption us now outline a discrete-time version. The most crucial assumption in Theorem1 is the convexity 0 0 Start from an initial guess θ := {θt , t = 0,...,T − 1}. 0 0 of the sets {ft(x, θ): θ ∈ Θt} and {Lt(x, θ): θ ∈ Θt} For each sample s, we define xθ := {xθ : t = 0,...,T } 1 s s,t for each fixed x . We now discuss how restrictive these by the dynamics assumptions are with regard to deep neural networks. Let us first assume that the admissable sets Θ are convex. Then, θ0 θ0 0 θ0 t xs,t+1 = ft(xs,t, θt ), xs,0 = xs,0, (7) the assumption with respect to Lt is not restrictive since most regularizers (e.g. `1, `2) satisfy it. Let us consider the for t = 0,...,T − 1. Intuitively, this is a projection onto convexity of {ft(x, θ): θ ∈ Θt}. In classical feed-forward the manifold defined by Eq. (4). Next, we perform the neural networks, there are two types of layers: trainable ones projection onto the manifold defined by Eq. (5), i.e. we θ0 θ0 and non-trainable ones. Suppose layer t is non-trainable define ps := {ps,t : t = 0,...,T } by the backward (e.g. f(xt, θt) = σ(xt) where σ is a non-linear activation dynamics function), then for each x the set {ft(x, θ): θ ∈ Θt} is a θ0 θ0 θ0 0 θ0 1 θ0 singleton, and hence trivially convex. On the other hand, ps,t = ∇xH(xs,t, ps,t+1, θt ), ps,T = − S ∇Φs(xs,T ), in trainable layers ft is usually affine in θ. This includes (8) fully connected layers, convolution layers and batch normal- for t = T − 1,..., 0. Finally, we project onto manifold ization layers (Ioffe & Szegedy, 2015). In these cases, as defined by Eq. (6) by performing Hamiltonian maximization 1 1 long as the admissable set Θt is convex, we again satisfy the to obtain θ := {θt : t = 0,...,T − 1} with convexity assumption. Residual networks also satisfy the S convexity constraint if one introduces auxiliary variables 1 X θ0 θ0 θ = arg max Ht(x , p , θ). t = 0,...,T − 1. Θ t s,t s,t+1 (see Appendix A.1). When the set t is not convex, then θ∈Θt s=1 it is in general not true that the PMP constitute necessary (9) conditions. The steps (7)-(9) are then repeated until convergence. We 1Note that this is in general unrelated to the convexity, in the summarize the basic MSA algorithm in Alg.1. sense of functions, of ft with respect to either x or θ. For example, Let us contrast the MSA with gradient-descent based meth- the scalar function f(x, θ) = θ3 sin(x) is evidently non-convex in ods. Similar to the formulation of the PMP, at no point both arguments, but {f(x, θ): θ ∈ R} is convex for each x. On the other hand {θx : θ ∈ {−1, 1}} is non-convex because of a did we take the derivative of any quantity with respect to non-convex admissible set. θ. Hence, we can in principle apply this to problems that An Optimal Control Approach to Deep Learning

Algorithm 1 Basic MSA satisfied if each Wt is bounded, which is usually implied 0 0 by the boundedness of Θt. Although this is not typically Initialize: θ = {θt ∈ Θt : t = 0 ...,T − 1}; for k = 0 to #Iterations do true in principle, we can safely assume this in practice by θk θk k θk xs,t+1 = ft(xs,t, θt ), xs,0 = xs,0, ∀s, t; truncating weights that are too large in magnitude. Conse- θk θk θk k θk 1 ps,t = ∇xHt(xs,t, ps,t+1, θt ), ps,T = − S ∇Φs(xs,T ), quently, (A1) is not very restrictive, since many commonly ∀s, t; employed loss functions (mean-square, soft-max with cross- k+1 PS θk θk θt = arg maxθ∈Θt s=1 Ht(xs,t, ps,t+1, θ) for t = entropy) satisfy these assumptions. In (A2), the regularity 0,...,T − 1; assumption on Lt is again not an issue, because we mostly end for take Lt to be independent of x. On the other hand, the regularity of ft with respect to x is sometimes restrictive. For example, ReLU activations does not satisfy (A2) due are not differentiable with respect to θ. However, the catch to non-differentiability. Nevertheless, any suitably molli- is that the Hamiltonian maximization step (9) may not be fied version (like Soft-plus) does satisfy it. Moreover, tanh trivial to evaluate. Nevertheless, observe that the maximiza- and sigmoid activations also satisfy (A2). Finally, unlike tion step is decoupled across different layers of the neural in Theorem1, we do not assume the convexity of the sets network, and hence it is a much smaller problem than the {f (x, θ): θ ∈ Θ } and {L (x, θ): θ ∈ Θ }, and hence original optimization problem, and its solution method can t t t t the results in this section applies to discrete-weight neural be parallelized. Alternatively, as seen in Sec.4, one can networks considered in Sec.4. With the above assumptions, exploit cases where the maximization step has explicit solu- we prove the following estimate. tions. Theorem 2 (Error Estimate for Discrete MSA). Let as- The basic MSA (Alg.1 can be shown to converge for sumptions (A1) and (A2) be satisfied. Then, there exists a problems where ft is linear and the costs Φs,Lt are constant C > 0, independent of S, θ and φ, such that for quadratic (Aleksandrov, 1968). In general, however, unless any θ, φ ∈ Θ, we have a good initial condition is given, the MSA may diverge. Let J(φ) − J(θ) us understand the nature of such phenomena by obtaining T −1 S rigorous error estimates per-iteration of Eq. (7)-(9). X X θ θ θ θ ≤ − Ht(xs,t, ps,t+1, φt) − Ht(xs,t, ps,t+1, θt) (10) t=0 s=1 3.1. An Error Estimate for the MSA T −1 S C X X θ θ 2 + kf (x , φ ) − f (x , θ )k (11) S t s,t t t s,t t In this section, we derive a rigorous error estimate for the t=0 s=1 MSA, which can help us understand its dynamics. Let us T −1 S d θ C X X θ θ 2 define W := conv{x ∈ t : ∃θ and s s.t. x = x}, + k∇xft(x , φt) − ∇xft(x , θt)k , (12) t R s,t S s,t s,t 2 θ t=0 s=1 where xt is defined according to Eq. (7). This is the convex T −1 S hull of all states reachable at layer t by some initial sample C X X θ θ 2 + k∇ L (x , φ ) − ∇ L (x , θ )k , (13) and some choice of the values for the trainable parameters. S x t s,t t x t s,t t t=0 s=1 Let us now make the following assumptions: θ θ where xs , ps are defined by Eq. (7) and (8). (A1) Φs is twice continuously differentiable, with Φs and ∇Φs satisfying a Lipschitz condition, i.e. there exists 0 Proof. The proof follows from elementary estimates and a K > 0 such that for all x, x ∈ WT and s ∈ [S] discrete Gronwall’s lemma. See Appendix B. 0 0 0 |Φs(x) − Φs(x )| + k∇Φs(x) − ∇Φs(x )k ≤ Kkx − x k Theorem2 relates the decrement of the total objective func- (A2) ft(·, θ) and Lt(·, θ) are twice continuously differen- tion J with respect to the iterative projection steps of the tiable in x, with ft, ∇xft,Lt, ∇xLt satisfying Lips- chitz conditions in x uniformly in t and θ, i.e. there MSA. Intuitively, Theorem2 says that the Hamiltonian max- exists K > 0 such that imization step (9) is generally the right direction, because a large magnitude of (10) results in higher loss improvement. 0 0 kft(x, θ) − ft(x , θ)k + k∇xft(x, θ) − ∇xft(x , θ)k2 However, whenever we change the parameters from θ to 0 0 + |Lt(x, θ) − Lt(x , θ)| + k∇xLt(x, θ) − ∇xLt(x , θ)k φ (e.g. during the maximization step (9)), we incur non- ≤ Kkx − x0k negative penalty terms (11)-(13). Observe that these penalty terms vanish if φ = θ, or more generally, when the state 0 for all x, x ∈ Wt, θ ∈ Θt and t = 0,...,T − 1. and co-state equations (Eq. (7), (8)) are still satisfied when θ is replaced by φ. In other words, these terms measure the Again, let us discuss these assumptions with respect to neu- distance from manifolds defined by the state and co-state ral networks. Note that both assumptions are more easily equations when the parameter changes. Alg.1 diverges An Optimal Control Approach to Deep Learning when these penalty terms dominate the gains from (10). by (14) and the fact that Lt ≡ 0, we get This insight can point us in the right direction of developing T −1 S convergent modifications of the basic MSA. We shall now X X θk θk discuss this in the context of some specific applications. J(φ) − J(θ) ≤ − Ht(xs,t, ps,t+1, θ) t=0 s=1 T −1 S C X X 4. Neural Networks with Discrete Weights + (1 + kxθ k2)kφ − θ k2 , S s,t t t F t=0 s=1 We now turn to the application of the theory developed in the previous section on the MSA, which is a PMP-based numerical method for training deep neural networks. As dis- Note that we have used the inequality k · k ≤ k · k . cussed previously, the main strength of the PMP and MSA 2 F Assuming that kxθ k is O(1), we may then decrease J by formalism is that we do not rely on gradient-descent type s,t not only maximizing the Hamiltonian, but also penalizing updates. This is particularly useful when one considers neu- the difference kφ − θ k , i.e. for each k and t we set ral networks with (some) trainable parameters that can only t t F take values in a discrete set. Then, any small gradient update " S # k+1 X θk θk k 2 to the parameters will almost always be infeasible. In this θt = arg max Ht(xs,t, ps,t+1, θ) − ρk,tkθ − θ kF section, we will consider two such cases: binary networks, θ∈Θt s=1 where weights are restricted to {−1, +1}; and ternary net- (15) works, where weights are selected from {−1, +1, 0}. These networks are potentially useful for low-memory devices as for some penalization parameters ρk,t > 0. This again has storing the trained weights requires less memory. In this the explicit solution section, we will modify the MSA so that we can train these ( θk θk networks in a principled way. k+1 sign([Mt ]ij) |[Mt ]ij| ≥ 2ρk,t [θt ]ij = k (16) [θt ]ij otherwise 4.1. Binary Networks Therefore, we simply replace the parameter update step in Binary neural networks are those with binary trainable lay- Alg.1 with (16). Furthermore, to deal with mini-batches, ers, e.g. in the fully connected case, θk we keep a moving average of Mt across different mini- batches and use the averaged value to update our parame- f (x, θ) = θx ters. It is found empirically that this further stabilizes the t (14) θ algorithm. Note that the assumption kxs,tk is O(1) can be achieved by normalization, e.g. batch-normalization (Ioffe dt×dt+1 where θ ∈ Θt = {−1, +1} is a binary matrix. A & Szegedy, 2015). We summarize the algorithm in Alg.2. similar form of ft holds for convolution neural networks Further algorithmic details are found in Appendix D, where after reshaping, except that Θt is now the set of Toeplitz we also discuss the choice of hyper-parameters and the con- binary matrices. Hereafter, we will consider the fully con- vergence of the algorithm for a simple binary regression nected case for simplicity of exposition. It is also natural problem. A rigorous proof of convergence in the general to set the regularization to 0 since there is in general no case is beyond the scope of this work, but we demonstrate preference between +1 or −1. Thus, the Hamiltonian has via experiments below that the algorithm performs well on the form the tested benchmarks. We apply Alg.2 to train binary neural networks on various H (x, p, θ) = p · θx. t benchmark datasets and compare the results from previ- ous work on training binary-weight neural networks (Cour- Consequently, the Hamiltonian maximization step (9) has bariaux et al., 2015). We consider a fully-connected neural explicit solution, given by network on MNIST (LeCun, 1998), as well as (shallow) con- volutional networks on CIFAR-10 (Krizhevsky & Hinton,

S 2009) and SVHN (Netzer et al., 2011) datasets. The network X θk θk θk structures are mostly identical to those considered in Cour- arg max Ht(xs,t, ps,t+1, θ) = sign(Mt ) θ∈Θt s=1 bariaux et al.(2015) for ease of comparison. Complete implementation and model details are found in Appendix D. The graphs of training/testing loss and error rates are θ PS θ θ T where Mt := s=1 ps,t+1(xs,t) . Note that the sign func- shown in Fig.1. We observe that our algorithm performs θ tion is applied element-wise. If [Mt ]ij = 0, then the arg- well in terms of an optimization algorithm, as measured max is arbitrary. Using Theorem2 with the form of ft given by the training loss and error rates. For the harder datasets An Optimal Control Approach to Deep Learning

Algorithm 2 Binary MSA MSA (Train) BinaryConnect (Train) MSA (Test) BinaryConnect (Test) 0 0 Initialize: θ , M ; Loss Error Rate (%) 10−1 Hyper-parameters: ρk,t, αk,t; for k = 0 to #Iterations do 4 k k xθ = f (xθ , θk) ∀s, t s,t+1 t s,t t 10−5 θk 2 with xs,0 = xs,0; MNIST θk θk θk k p = ∇xHt(x , p , θ ) ∀s, t s,t s,t s,t+1 t 10−9 0 θk 1 0 100 200 0 100 200 with ps,T = − ∇Φs(xs,T ); S 100 30 k+1 k PS θk θk T M t = αk,tM t + (1 − αk,t) s=1 ps,t+1(xs,t) ( k+1 k+1 10−3 20 k+1 sign([M t ]ij ) |[M t ]ij | ≥ 2ρk,t [θt ]ij = k [θ ]ij otherwise −6 10 t CIFAR-10 10 ∀t, i, and j; 0 end for 0 200 400 0 200 400 100 30

20 (CIFAR-10 and SVHN), we have rapid convergence but 10−2 worse test loss and error rates at the end, possibly due to SVHN 10 10−4 overfitting. We note that in (Courbariaux et al., 2015), many 0 regularization strategies are performed. We expect that sim- 0 100 200 0 100 200 ilar techniques must be employed to improve the testing Epoch Epoch performance. However, these issues are out of the scope of the optimization framework of this paper. Note that we also Figure 1. Comparison of binary MSA (Alg.2) with BinaryCon- compared the results of BinaryConnect without regulariza- nect (Courbariaux et al., 2015) (with binary variables for inference). We observe that MSA has good convergence in terms of the train- tion strategies such as stochastic binarization, but the results ing loss and error rates, showing that it is an efficient optimization are similar in that our algorithm converges very fast with algorithm. Note that to avoid broken lines, when the loss equals very low training losses, but sometimes overfits. 0 exactly, we replace it by 1e-8 on the log-scale. The test loss for the bigger datasets (CIFAR10, SVHN) eventually becomes worse 4.2. Ternary Networks due to overfitting, hence some regularization techniques is needed for applications which are prone to overfitting. We shall consider another case where the network weights are allowed to take on values in {−1, +1, 0}. In this case, our goal is to explore the sparsification of the network. To 2 now test the ternary algorithm on the same benchmarks used this end, we shall take Lt(x, θ) = λtkθk for some param- F in Sec. 4.1 and the results are shown in Fig.2. Observe that eter λt. Note that since the weights are restricted to the the performance on training and testing datasets is similar ternary set, any component-wise `p regularization for p > 0 to the binary case (Fig.1), but the ternary networks achieve are identical. The higher the λt values, the more sparse the solution will be. high degrees of sparsity in the weights, with only 0.5-2.5% of the trained weights being non-zero, depending on the As in Sec. 4.1, we can write down the Hamiltonian for a dataset. This potentially offers significant memory savings fully connected ternary layer as compared to its binary or full floating precision counterparts. 1 2 Ht(x, p, θ) = p · θx − S λtkθkF . 5. Discussion and Related Work The derivation of the ternary algorithm then follows directly from those in Sec. 4.1, but with the new form of Hamiltonian We begin with a discussion of the results presented thus dt×dt+1 above and that Θt = {−1, +1, 0} . Maximizing the far. We first introduced the viewpoint that deep learning augmented Hamiltonian (15) with Ht as defined above, we can be regarded as a discrete-time optimal control problem. obtain the ternary update rule Consequently, an important result in optimal ,  k the Pontryagin’s maximum principle, can be applied to give +1 [M θ ] ≥ ρ (1 − 2[θk] ) + λ  t ij k,t t ij t a set of necessary conditions for optimality. These are in k+1 θk k [θt ]ij = −1 [Mt ]ij ≤ −ρk,t(1 + 2[θt ]ij) − λt general stronger conditions than the usual optimality condi- 0 otherwise. tions based on the vanishing of first-order partial derivatives. (17) Moreover, they apply to broader contexts such as problems We replace the parameter update step in Alg.2 by (17) to with constraints on the trainable parameters or problems that obtain the MSA algorithm for ternary networks. For com- are non-differentiable in the trainable parameters. However, pleteness, we give the full ternary algorithm in Alg.3. We we note that specific assumptions regarding the convexity An Optimal Control Approach to Deep Learning

Error Rate (%) Sparsity (%) Algorithm 3 Ternary MSA 100 0 MSA (Train) 0 4 MSA (Test) Initialize: θ , M ; BinaryConnect (Train) 75 Hyper-parameters: ρ , α ; BinaryConnect (Test) MSA k,t k,t 50 for k = 0 to #Iterations do 2 BinaryConnect k k MNIST θ θ k 25 xs,t+1 = ft(xs,t, θt ) ∀s, t k θ 0 0 with xs,0 = xs,0; 0 100 200 0 100 200 θk θk θk k ps,t = ∇xHt(xs,t, ps,t+1, θt ) ∀s, t 30 100 MSA (Train) θk 1 with p = − ∇Φs(xs,T ); MSA (Test) s,T S BinaryConnect (Train) 75 k+1 k k k 20 BinaryConnect (Test) PS θ θ T MSA M t = αk,tM t + (1 − αk,t) ps,t+1(xs,t) s=1 50 BinaryConnect  k+1 k 10

+1 [M ] ≥ ρ (1 − 2[θ ] ) + λ CIFAR-10  t ij k,t t ij t 25 [θk+1] = k+1 k t ij −1 [M t ]ij ≤ −ρk,t(1 + 2[θt ]ij ) − λt 0 0 0 otherwise. 0 200 400 0 200 400 30 100 ∀t, i, and j; MSA (Train) MSA (Test) end for BinaryConnect (Train) 75 20 BinaryConnect (Test) MSA 50 BinaryConnect SVHN 10 25 of some sets must be satisfied. We showed that they are jus- 0 0 tified for conventional neural networks, but not necessarily 0 100 200 0 100 200 Epoch Epoch so for all neural networks (e.g. binary, ternary networks).

Next, based on the PMP, we introduced an iterative projec- Figure 2. Performance of the ternary MSA (Alg.3) vs. BinaryCon- tion technique, the discrete method of successive approxi- nect with a simple thresholding procedure described in (Li et al., mations (MSA), to find an optimal solution of the learning 2016). In the second column of plots, we show the sparsity of problem. A rigorous error estimate (Theorem2) is derived the networks (defined as the percentage of all weights that are for the discrete MSA, which can be used to both understand non-zero) as training proceeds. Observe that the MSA algorithm its dynamics and to derive useful algorithms. This should finds solutions with comparable error rates with the binary case (whose final test error is plotted as a gray horizontal line) but are be viewed as the main theoretical result of the present paper. very sparse. In comparison, the simple thresholding of BinaryCon- Note that the usual back-propagation with gradient descent nect does not produce sparse solutions. In fact, the final sparsities can be regarded as a simple modification of the MSA, if for MSA are approximately: MNIST: <1.0%; CIFAR-10: 0.9%; differentiability conditions are assumed (see Appendix C). SVHN: 2.4%. It is expected that sparser solutions can be found by Nevertheless, we note that Theorem2 itself does not assume adjusting the penalty parameters λt. any regularity conditions with respect to the trainable pa- rameters. Moreover, neither does it require the convexity conditions in Theorem1, and hence applies to a wider range network exhibits extremely sparse weights that perform al- of neural networks, including those in Sec.4. All results up most as well as its binary counter-part (see Fig.2). Also, to this point apply to general neural networks (assuming that the phenomena of overfitting in Fig.1 and2 are interesting the respective conditions are satisfied), and are not specific as overfitting is generally less common in stochastic gradi- to the applications presented subsequently. ent based optimization approaches. This seems to suggest that the MSA based methods optimize neural networks in a In the last part of this work, we apply our results to de- rather different way. vise training algorithms for discrete-weight neural networks, i.e. those with trainable parameters that can only take values Let us now put our work in the context of the existing litera- in a discrete set. Besides potential applications in model ture. First, the optimal control approach we adopt is quite deployment in low-memory devices, the main reasons for different from the prevailing viewpoint of nonlinear pro- choosing such applications are two-fold. First, gradient- gramming (Bertsekas, 1999; Bazaraa et al., 2013; Kuhn & descent updates are not applicable by itself because small Tucker, 2014) and the analysis of the derived gradient-based updates to parameters are prohibited by the discrete equality algorithms (Moulines, 2011; Shamir & Zhang, 2013; Bach constraint on the trainable parameters. However, our method & Moulines, 2013; Xiao & Zhang, 2014; Shalev-Shwartz based on the MSA is applicable since it does not perform & Zhang, 2014) for the training of deep neural networks. gradient-descent updates. Second, in such applications the In particular, the PMP (Thm.1) and the MSA error esti- potentially expensive Hamiltonian maximization steps in mate (Thm.2) do not assume differentiability and do not the MSA have explicit solutions. This makes MSA an at- characterize optimality via gradients (or sub-gradients) with tractive optimization method for problems of this nature. respect to trainable parameters. In this sense, it is a stronger In Sec4, we demonstrate the effectiveness of our methods and more robust condition, albeit sometimes requiring dif- on various benchmark datasets. Interestingly, the ternary ferent assumptions. The optimal control and dynamical An Optimal Control Approach to Deep Learning systems viewpoint has been discussed in the context of (2016); Tang et al.(2017); Li et al.(2016); Zhu et al.(2016). deep learning inE(2017); Li et al.(2018) and dynamical Theoretical analysis for the case of convex loss functions systems based discretization schemes has been introduced is carried out in Li et al.(2017a). Our point of numerical in Haber & Ruthotto(2017); Chang et al.(2017). Most comparison for the binary MSA algorithm is Courbariaux of these works have theoretical basis in continuous-time et al.(2015), where optimization of binary networks is based dynamical systems. In particular, Li et al.(2018) analyzed on shadow variables with full floating-point precision that continuous-time analogues of neural networks in the opti- is iteratively truncated to obtain gradients. We showed in mal control framework and derived MSA-based algorithms Sec. 4.1 that the binary MSA is competitive as a training in continuous time. In contrast, the present work presents algorithm, but is in need of modifications to reduce over- a discrete-time formulation, which is natural in the usual fitting for certain datasets. Training ternary networks has context of deep learning. The discrete PMP turns out to been discussed in Hwang & Fan(1967); Kim et al.(2014); be more subtle, as it requires additional assumptions of Li et al.(2016); Zhu et al.(2016). The difference in our convexity of reachable sets (Thm.1). Note also that un- ternary formulation is that we explore the sparsification of like the estimates derived in Li et al.(2018), Thm.2 holds networks using a regularization parameter. In this sense rigorously for discrete-time neural networks. The present it is related to compression of neural networks (e.g. Han method for stabilizing the MSA is also different from that et al.(2015)), but our approach trains a network that is natu- in Li et al.(2018), where augmented Lagrangian type of rally ternary, and compression is achieved during training modifications are employed. The latter would not be effec- by a regularization term. Generally, a contrasting aspect tive here because weights cannot be updated infinitesimally of our approach from the aforementioned literature is that without violating the binary/ternary constraint. Moreover, the theory of optimal control, together with Theorem.2, the present methods that rely on explicit solutions of Hamil- provide a theoretical basis for the development of our algo- tonian maximization are fast (comparable to SGD) on a rithms. Nevertheless, further work is required to rigorously wall-clock basis. establish the convergence of these algorithms. We also men- tion a recent work (Yin et al., 2018) which analyzes quan- In the deep learning literature, the connection between op- tized networks and develops algorithms based on relaxing timal control and deep learning has been qualitative dis- the discrete-weight constraints into continuous regularizers. cussed in LeCun(1988) and applied to the development Lastly, there are also analyses of quantized networks from of automatic differentiation and back-propagation (Bryson, a statistical-mechanical viewpoint (Baldassi et al., 2015; 1975; Baydin et al., 2015). However, there are relatively 2016a;b; 2017). fewer works relating optimal control algorithms to train- ing neural networks beyond the classical gradient-descent with back-propagation. Optimal control based strategies 6. Conclusion and Outlook in hyper-parameter tuning has been discussed in Li et al. In this paper, we have introduced the discrete-time optimal (2017b). control viewpoint of deep learning. In particular, the PMP In the continuous-time setting, the Pontryagin’s maximum and the MSA form an alternative theoretical and algorithmic principle and the method of successive approximations have basis for deep learning that may apply to broader contexts. a long history, with a large body of relevant literature includ- As an application of our framework, we considered the ing, but not limited to Boltyanskii et al.(1960); Pontryagin training of binary and ternary neural networks, in which we (1987); Bryson(1975); Bertsekas(1995); Athans & Falb develop effective algorithms based on optimal control. (2013); Krylov & Chernousko(1962); Aleksandrov(1968); There are certainly many avenues of future work. An inter- Krylov & Chernousko(1972); Chernousko & Lyubushin esting mathematical question is the applicability of the PMP (1982); Lyubushin(1982). The discrete-time PMP have for discrete-weight neural networks, which does not satisfy been studied in Halkin(1966); Holtzman(1966a); Holtzman the convexity assumptions in Theorem1. It will be desirable & Halkin(1966); Holtzman(1966b); Canon et al.(1970), to find the condition under which rigorous statements can where Theorem1 and its extensions are proved. To the best be made. Another question is to establish the convergence of our knowledge, the discrete-time MSA and its quanti- of the algorithms presented. tative analysis have not been performed in either the deep learning or the optimal control literature. References Sec.4 concerns the application of the MSA, in particular Thm.2, to develop training algorithms for binary and ternary Aleksandrov, V. V. On the accumulation of perturbations in neural networks. There are a number of prior work explor- the linear systems with two coordinates. Vestnik MGU, 3, ing the training of similar neural networks, such as Cour- 1968. bariaux et al.(2015); Hubara et al.(2016); Rastegari et al. Athans, M. and Falb, P. L. Optimal control: an introduction An Optimal Control Approach to Deep Learning

to the theory and its applications. Courier Corporation, Chernousko, F. L. and Lyubushin, A. A. Method of suc- 2013. cessive approximations for solution of optimal control problems. Optimal Control Applications and Methods, 3 Bach, F. and Moulines, E. Non-strongly-convex smooth (2):101–114, 1982. stochastic approximation with convergence rate O(1/n). In Advances in Neural Information Processing Systems, Courbariaux, M., Bengio, Y., and David, J.-P. Binarycon- pp. 773–781, 2013. nect: Training deep neural networks with binary weights Baldassi, C., Ingrosso, A., Lucibello, C., Saglietti, L., and during propagations. In Advances in Neural Information Zecchina, R. Subdominant dense clusters allow for simple Processing Systems, pp. 3123–3131, 2015. learning and high computational performance in neural networks with discrete synapses. Physical review letters, Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient 115(12):128101, 2015. methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121– Baldassi, C., Borgs, C., Chayes, J. T., Ingrosso, A., Lu- 2159, 2011. cibello, C., Saglietti, L., and Zecchina, R. Unreasonable effectiveness of learning neural networks: From acces- E, W. A proposal on machine learning via dynamical sys- sible states and robust ensembles to basic algorithmic tems. Communications in Mathematics and Statistics, 5 schemes. Proceedings of the National Academy of Sci- (1):1–11, 2017. ences, 113(48):E7655–E7662, 2016a. Haber, E. and Ruthotto, L. Stable architectures for deep Baldassi, C., Gerace, F., Lucibello, C., Saglietti, L., and neural networks. arXiv preprint arXiv:1705.03341, 2017. Zecchina, R. Learning may need only a few bits of synap- tic precision. Physical Review E, 93(5):052313, 2016b. Halkin, H. A maximum principle of the pontryagin type for systems described by nonlinear difference equations. Baldassi, C., Gerace, F., Kappen, H. J., Lucibello, C., Sagli- SIAM Journal on control, 4(1):90–111, 1966. etti, L., Tartaglione, E., and Zecchina, R. On the role of synaptic stochasticity in training low-precision neural Han, S., Mao, H., and Dally, W. J. Deep compres- networks. arXiv preprint arXiv:1710.09825, 2017. sion: Compressing deep neural networks with pruning, Baydin, A. G., Pearlmutter, B. A., Radul, A. A., and Siskind, trained quantization and huffman coding. arXiv preprint J. M. Automatic differentiation in machine learning: a arXiv:1510.00149, 2015. survey. arXiv preprint arXiv:1502.05767, 2015. Holtzman, J. Convexity and the maximum principle for dis- Bazaraa, M. S., Sherali, H. D., and Shetty, C. M. Nonlinear crete systems. IEEE Transactions on Automatic Control, programming: theory and algorithms. John Wiley & 11(1):30–35, 1966a. Sons, 2013. Holtzman, J. On the maximum priciple for nonlinear Bertsekas, D. P. and optimal control, discrete-time systems. IEEE Transactions on Automatic volume 1. Athena scientific Belmont, MA, 1995. Control, 11(2):273–274, 1966b. Bertsekas, D. P. Nonlinear programming. Athena scientific Belmont, 1999. Holtzman, J. M. and Halkin, H. Discretional convexity and the maximum principle for discrete systems. SIAM Boltyanskii, V. G., Gamkrelidze, R. V., and Pontryagin, Journal on Control, 4(2):263–275, 1966. L. S. The theory of optimal processes. I. The maximum principle. Technical report, TRW Space Tochnology Labs, Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., and Los Angeles, California, 1960. Bengio, Y. Binarized neural networks. In Advances in neural information processing systems, pp. 4107–4115, Applied optimal control: optimization, esti- Bryson, A. E. 2016. mation and control. CRC Press, 1975. Canon, M. D., Cullum Jr, C. D., and Polak, E. Theory of op- Hwang, C. and Fan, L. A discrete version of pontryagin’s timal control and mathematical programming. McGraw- maximum principle. , 15(1):139– Hill Book Company, 1970. 146, 1967. Chang, B., Meng, L., Haber, E., Ruthotto, L., Begert, Ioffe, S. and Szegedy, C. Batch normalization: Accelerating D., and Holtham, E. Reversible architectures for arbi- deep network training by reducing internal covariate shift. trarily deep residual neural networks. arXiv preprint In International conference on machine learning, pp. 448– arXiv:1709.03698, 2017. 456, 2015. An Optimal Control Approach to Deep Learning

Kim, J., Hwang, K., and Sung, W. X1000 real-time phoneme Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., recognition vlsi using feed-forward deep neural networks. and Ng, A. Y. Reading digits in natural images with In Acoustics, Speech and Signal Processing (ICASSP), unsupervised feature learning. In NIPS workshop on 2014 IEEE International Conference on, pp. 7510–7514. deep learning and unsupervised feature learning, volume IEEE, 2014. 2011, pp. 5, 2011. Kingma, D. and Ba, J. Adam: A method for stochastic Ogata, K. Discrete-time control systems, volume 2. Prentice optimization. arXiv preprint arXiv:1412.6980, 2014. Hall Englewood Cliffs, NJ, 1995.

Krizhevsky, A. and Hinton, G. Learning multiple layers of Pontryagin, L. S. Mathematical theory of optimal processes. features from tiny images. Technical Report. University CRC Press, 1987. of Toronto, 2009. Rastegari, M., Ordonez, V., Redmon, J., and Farhadi, A. Krylov, I. A. and Chernousko, F. L. On the method of Xnor-net: Imagenet classification using binary convo- successive approximations for solution of optimal control lutional neural networks. In European Conference on problems. J. Comp. Mathem. and Mathematical Physics, Computer Vision, pp. 525–542. Springer, 2016. 2(6), 1962. Robbins, H. and Monro, S. A stochastic approximation Krylov, I. A. and Chernousko, F. L. An algorithm for the method. The annals of mathematical statistics, pp. 400– method of successive approximations in optimal control 407, 1951. problems. USSR Computational Mathematics and Math- ematical Physics, 12(1):15–38, 1972. Shalev-Shwartz, S. and Zhang, T. Accelerated proximal stochastic dual coordinate ascent for regularized loss min- Kuhn, H. W. and Tucker, A. W. Nonlinear programming. imization. Mathematical Programming, pp. 1–41, 2014. In Traces and emergence of nonlinear programming, pp. 247–258. Springer, 2014. Shamir, O. and Zhang, T. Stochastic gradient descent for non-smooth optimization: Convergence results and opti- LeCun, Y. A theoretical framework for back-propagation. mal averaging schemes. In ICML (1), pp. 71–79, 2013. In The Connectionist Models Summer School, volume 1, pp. 21–28, 1988. Tang, W., Hua, G., and Wang, L. How to train a compact binary neural network with high accuracy? In AAAI, pp. LeCun, Y. The MNIST database of handwritten digits. 2625–2631, 2017. http://yann.lecun.com/exdb/mnist/, 1998. Warga, J. Relaxed variational problems. Journal of Mathe- Li, F., Zhang, B., and Liu, B. Ternary weight networks. matical Analysis and Applications, 4(1):111–128, 1962. arXiv preprint arXiv:1605.04711, 2016. Xiao, L. and Zhang, T. A proximal stochastic gradient Li, H., De, S., Xu, Z., Studer, C., Samet, H., and Goldstein, method with progressive variance reduction. SIAM Jour- T. Training quantized nets: A deeper understanding. In nal on Optimization, 24(4):2057–2075, 2014. Advances in Neural Information Processing Systems, pp. 5813–5823, 2017a. Yin, P., Zhang, S., Lyu, J., Osher, S., Qi, Y., and Xin, J. Binaryrelax: A relaxation approach for training deep Li, Q., Tai, C., and E, W. Stochastic modified equations and neural networks with quantized weights. arXiv preprint adaptive stochastic gradient algorithms. In International arXiv:1801.06313, 2018. Conference on Machine Learning, pp. 2101–2110, 2017b. Zeiler, M. D. Adadelta: an adaptive learning rate method. Li, Q., Chen, L., Tai, C., and E, W. Maximum principle arXiv preprint arXiv:1212.5701, 2012. based algorithms for deep learning. Journal of Machine Learning Research, 18:1–29, 2018. Zhu, C., Han, S., Mao, H., and Dally, W. J. Trained ternary quantization. arXiv preprint arXiv:1612.01064, 2016. Lyubushin, A. A. Modifications of the method of succes- sive approximations for solving optimal control problems. USSR Computational Mathematics and Mathematical Physics, 22(1):29–34, 1982. Moulines, Eric and, F. R. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems, pp. 451–459, 2011.