Arxiv:2102.13566V1 [Cs.LG] 26 Feb 2021

SPARSE APPROXIMATION IN LEARNING VIA NEURAL ODES CARLOS ESTEVE YAGÜE* AND BORJAN GESHKOVSKI* Abstract. We consider the continuous-time, neural ordinary differential equation (neural ODE) perspective of deep supervised learning, and study the impact of the final time horizon T in training. We focus on a cost consisting of an integral of the empirical risk over the time interval, and L1–parameter regularization. Under homogeneity assumptions on the dynamics (typical for ReLU activations), we prove that any global minimizer is sparse, in the sense that there exists a positive stopping time T ∗ beyond which the optimal parameters vanish. Moreover, under appropriate interpolation assumptions on the neural ODE, we provide quantitative estimates of the stopping time T ∗, and of the training error of the trajectories at the stopping time. The latter stipulates a quantitative approximation property of neural ODE flows with sparse parameters. In practical terms, a shorter time-horizon in the training problem can be interpreted as considering a shallower residual neural network (ResNet), and since the optimal parameters are concentrated over a shorter time horizon, such a consideration may lower the computational cost of training without discarding relevant information. Contents 1. Introduction1 2. Preliminary lemmas 10 3. Proof of Theorem 1.1 16 4. Asymptotic interpolation 18 5. Concluding remarks 21 References 22 Keywords. Deep Learning; Neural ODEs; Supervised Learning; Sparsity; Optimal control; Nonlinear systems. AMS Subject Classification. 49J15; 49M15; 49J20; 49K20; 93C20; 49N05. 1. Introduction arXiv:2102.13566v1 [cs.LG] 26 Feb 2021 Sparsity is a highly desirable property in many machine learning and optimization tasks due to the inherent reduction of computational complexity. When induced by `1–regularization for instance, it has been used extensively for simplifying machine learning tasks by selecting a strict subset of the available features to be used in an automatized manner. An illustrative example is the well-known Lasso (least absolute shrinkage and selection operator,[Santosa and Symes, 1986; Tibshirani, 1996]), which consists in minimizing a least squares cost function and an `1–penalty for an affine Date: March 1, 2021 *Equal contribution. 1 2 CARLOS ESTEVE YAGÜE AND BORJAN GESHKOVSKI parametric model, and enforces a subset of the trainable parameters to become zero. As a consequence, the associated features may be pruned. Following this line of reasoning, in this work, we study supervised learning problems viewed from a continuous-time, neural ODE perspective, and we demonstrate the appearance of sparsity patterns for L1–regularized minimization problems. 1.1. Background. We recall that supervised learning addresses the problem of pre- dicting from data, which consists in approximating an unknown function f : N X −! Y from N known and possibly noisy samples ~xi; ~yi = f(~xi) i=1. Depending on the nature of the space of labels , one distinguishesf two typesg of supervised learning tasks, namely that of classificationY (labels take values in a finite set of m classes, e.g. m = 1; : : : ; m ) and regression (labels take continuous values in R ). Heuristi- cally,Y f supervisedg learning consists in constructing a map Y ⊂ fapprox : ( ); X −! P Y which, desirably, is such that for any x and for any Borel measurable A , 2 X ⊂ Y fapprox(x)(A) 1 whenever f(x) A, and fapprox(x)(A) 0 whenever f(x) A; here, ( ) denotes' the space of probability2 measures on . In' other words, one62 looks for P Y Y a map fapprox which approximates the map x δ where δz stands for the Dirac 7−! f(x) measure centered at z. The map fapprox is often chosen from a class of parametric functions, and, as one only has N samples of f, the parameters are tuned in order to fit fapprox to these data by minimizing a specific loss functional. Deep neural networks constitute a popular method for constructing fapprox – they are parametrized computational architectures which propagate each individual sample of N d×N the input data ~xi i=1 R across a sequence of affine parametric maps and simple nonlinearities.f Theg so-called2 residual neural networks (ResNets, [He et al., 2016]) may, in the simplest case, be cast as schemes of the mould 8 k+1 k k k k <x = xi + σ w xi + b for k 0;:::;Nlayers 1 i 2 f − g (1.1) 0 d :xi = ~xi R 2 k d for all i 1;:::;N := [N]. The unknown states are xi R for any i [N], σ is an explicit2 f scalar,g Lipschitz continuous nonlinear function2 defined component-wise2 in (1.1), k k Nlayers−1 are optimizable parameters (controls) with k d×d and w ; b k=0 w R k d 2 b R , and Nlayers > 1 designates the number of layers referred to as the depth. Due2 to the inherent dynamical systems nature of ResNets, several recent works have aimed at studying an associated continuous-time formulation in some detail, a trend started with the works [E, 2017; Haber and Ruthotto, 2017]. This perspective is motivated by the simple observation that for any i [N] and for T > 0,(1.1) is roughly the forward Euler scheme for the neural ordinary2 differential equation (neural ODE) ( x_ i(t) = σ(w(t)xi(t) + b(t)) for t (0;T ) 2 d (1.2) xi(0) = ~xi R : 2 We shall focus our interest on parametrizing fapprox by the flows of neural ODEs such as (1.2). This may be done by setting fapprox : x µ(x(T )), where x(T ) solves d 7−! (1.2) with x(0) = x, and µ : R ( ) is chosen appropriately. In practice, the −! P Y SPARSE APPROXIMATION IN LEARNING VIA NEURAL ODES 3 time-dependent parameters [w; b] are found by solving the regularized empirical risk minimization problem N 1 X p min loss P xi(T ); ~yi + [w; b] ; (1.3) [w;b] N Lp(0;T ;Rdu ) i=1 | {z } :=E(x(T )) d m 1 where p 1; 2 , P : R R is assumed to be a given affine map, and loss( ; ): m 2 f g −! · · R R+ is such that x loss(x; y) is continuous for all y , loss(x; y) = 0 × Y −! 7−! 2 Y 6 whenever µ(x) = δy, and loss(x; y) 0 when µx δy in an appropriate sense of measures (e.g.,6 for the Wasserstein−! distance). Common−! examples of loss functions include the cross-entropy loss for classification tasks ! e(P x)~y lossP x; ~y := log ; (1.4) Pm (P x)j − j=1 e m where P x R and ~y [m], in which case, µ := softmax P , or the mean squared error (MSE)2 loss for regression2 tasks ◦ 2 loss P x; ~y := P x ~y 2 − ` m where now ~y R , in which case, µ(x) := δP x. 2 Y ⊂ Note that in (1.1) the time-step h = T is fixed (equal to 1), and each time-instance Nlayers of a discretization to (1.2) would represent a different layer of the derived neural network (1.1). We therefore see that when the time-step is fixed, the time horizon T in (1.2) may serve as an indicator of the number of layers Nlayers in the discrete-time context. Thus, a good a priori knowledge of the dynamics of the learning problem over longer time horizons is desirable in view of discovering approximation and gener- alization properties of the trained neural ODE flow. This perspective has been taken in [Esteve et al., 2020a] for L2–regularized supervised learning problems. Herein, we complete this study with new results and insights for L1–regularized learning problems. N 1.2. Problem setting. We assume we are given a training dataset ~xi; ~yi i=1 where d f g ~xi R and ~yi . We henceforth set dx := d N, and consider stacked neural ODEs2 X of ⊂ the form 2 Y × (x_ (t) = f(x(t); u(t)) for t (0;T ) 2 0 d (1.5) x(0) = x R x ; 2 0 dx dx du dx where T > 0 and x = [~x1; : : : ; ~xN ] R . The nonlinearity f : R R R may take the form 2 × −! 02w 3 2b31 B6 .. 7 6.7C f(x; u) = σ @4 . 5 x + 4.5A (1.6) w b dx du 2 for x R and u = [w; b] R with du := d + d, and σ Lip(R) is defined component-wise2 so that each component2 of f coincides with the canonical2 neural ODE 1In practice, P is either part of the trainable parameters, or its coefficients may be chosen at random. Whilst we fix P for technical purposes, numerical experiments indicate that the results presented in what follows persist when P is optimized as well. 4 CARLOS ESTEVE YAGÜE AND BORJAN GESHKOVSKI given in (1.2). Permutations may also be considered, e.g. 2w 3 2b3 6 .. 7 6.7 f(x; u) = 4 . 5 σ(x) + 4.5 : (1.7) w b The key assumption we make in what follows is that f is 1–homogeneous with respect to the parameters u, i.e. d d f(x; αu) = αf(x; u) for all (x; u) R x R u and α > 0: (1.8) 2 × This is clearly the case for f parametrized as in (1.7), whilst for (1.6), we shall moreover assume that σ is 1–homogeneous – a canonical example of such an activation function is the ReLU σ(x) = max x; 0 . f g d d Remark 1. Since σ Lip(R), for any x0 R x and u L1(0;T ; R u ),(1.5) with f as 2 2 d 2 above admits a unique solution x C0([0;T ]; R x ).

Arxiv:2102.13566V1 [Cs.LG] 26 Feb 2021

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support