Arxiv:2002.02208V1
Total Page:16
File Type:pdf, Size:1020Kb
GLOBAL CONVERGENCE OF FRANK WOLFE ON ONE HIDDEN LAYER NETWORKS ALEXANDRE D’ASPREMONT AND MERT PILANCI ABSTRACT. We derive global convergence bounds for the Frank Wolfe algorithm when training one hidden layer neural networks. When using the ReLU activation function, and under tractable preconditioning assump- tions on the sample data set, the linear minimization oracle used to incrementally form the solution can be solved explicitly as a second order cone program. The classical Frank Wolfe algorithm then converges with rate O(1/T ) where T is both the number of neurons and the number of calls to the oracle. 1. INTRODUCTION We focus on the problem of training one hidden layer neural networks using incremental algorithms, and in particular the Frank-Wolfe method. While they are of course more toy models than effective classification tools, one hidden layer neural networks have been heavily used to study the complexity of the neural network training problem in a variety of regimes and algorithmic settings. Incremental methods in particular are classical tools for training one hidden layer networks, starting at least with the results in [Breiman, 1993] and [Lee et al., 1996]. The Frank-Wolfe algorithm [Frank and Wolfe, 1956], also known as conditional gradients [Levitin and Polyak, 1966] is one of the most well known methods of this type, and is used in constrained minimization problems where projection on the feasible set is hard, but solving a linear minimization oracle (LMO) over this set is tractable. This method has a long list of applications in machine learning, with recent examples including [Joulin et al., 2014, Shah et al., 2015, Osokin et al., 2016, Locatello et al., 2017a, Freund et al., 2017, Locatello et al., 2017b, Miech et al., 2017]. Several other approaches have recently been used to produce convergence results on one hidden layer train- ing problems, including gradient descent schemes [Vempala and Wilmes, 2018] and discretized gradient flows [Chizat and Bach, 2018, Chizat, 2019]. Here, in the spirit of [Bengio et al., 2006, Rosset et al., 2007, Bach, 2017] we focus on training infinitely wide neural networks which are asymptotically convex. Follow- ing [Bach, 2017], we use an ℓ1 like penalty to let the algorithm decide on the location of the neurons via the solutions of the linear minimization oracle. In this setting, each iteration of Frank Wolfe, i.e. each solution of the LMO, adds a fixed number of neurons to the network. Our contribution is twofold. While the one hidden layer training problem [Song et al., 2017] and the linear minimization oracle problem [Guruswami and Raghavendra, 2009] are both intractable in general, we first show, using recent results by [Ergen and Pilanci, 2019], that the LMO can be solved efficiently under arXiv:2002.02208v1 [math.OC] 6 Feb 2020 overparameterization and mild preconditioning assumptions. Second, we discuss convexity properties of one hidden layer neural networks in a broader setting, showing in particular that the overparameterized problem has a convex epigraph and no duality gap. Using results derived from the Shapley-Folkman theorem, we derive non-asymptotic convergence bounds on this duality gap when converging towards the mean field limit. Overall, these results seem to further confirm recent empirical findings in e.g. [Zhang et al., 2016] that overparameterized networks, in the “modern regime” described in e.g. [Belkin et al., 2019], are inherently easier to train. 2. FRANK WOLFE ON ONE HIDDEN LAYER NETWORKS n d n Given n real multivariate data samples A R × and a label vector y R , together with activation d ∈ ∈ functions σθ : R R, parameterized by θ where is a compact topological vector space. For → ∈ V V Date: February 7, 2020. 1 a continuous function h(θ) : R, we write h(θ)dµ(θ) the action of the Radon measure µ on the function h. V → As in [Rosset et al., 2007, Bach, 2017] we focus onR the following problem n 2 minimize σθ(ai)dµ(θ) yi − i=1 (1) X Z subject to γ σθ( )dµ(θ) δ 1 · ≤ Z in the variable µ, a Radon measure on , with parameter δ > 0. Here γ is the variation norm, a natural V 1 extension of the ℓ1 norm to the infinite dimensional setting, which we describe in detail below. 2.1. Variation Norm. For a Radon measure µ, we write µ ( ) , sup h(θ)dµ(θ) | | V h(θ): [ 1,1], V→ − Z h continuous its total variation. When µ has a density, with dµ(θ) = p(θ)dτ(θ) then µ ( ) is simply equal to the L1 norm of p. | | V As in [Bach, 2017], we now write the space of functions f(x) : Rd R such that F1 → f(x)= σθ(x)dµ(θ) Z where µ is Radon measure on with finite total variation. The infimum of the total variation of µ over all representations of f, written V γ (f) , inf µ ( ) : f(x)= σθ(x)dµ(θ) 1 | | V Z is a norm called the variation norm of f (see e.g. [Kurkov´aand Sanguineti, 2001], or the discussion on atomic norms in [Chandrasekaran et al., 2012]). Note that when f is decomposable on a finite number of basis functions, with k f(x)= ηiσθi (x) i X=1 we have k µ(θ)= ηiδ θ=θi { } i X=1 and the total variation of µ is simply η 1, the ℓ1 norm of η. In this context, we can rewrite problem (1) as an equivalent problem k k n 2 minimize (f(ai) yi) − (2) i=1 subject to γX(f) δ 1 ≤ which is a convex problem in the variable f . ∈ F1 2.2. Incremental Algorithm: Frank Wolfe. Problem (2) is an infinite dimensional problem, but it can be solved efficiently using the Frank Wolfe method (aka conditional gradients) provided we can solve a linear minimization oracle over a γ1 ball. The Frank Wolfe algorithm solves problem (2) by invoking a linear minimization oracle involving the gradient at each iteration, then takes convex combinations of iterates. 2 Gradients. The objective of problem (2), namely n 2 L(f) , σθ(ai)dµ(θ) yi − i=1 Z Xn 2 = (f(ai) yi) − i X=1 is a smooth convex functional, whose gradient is given by n L′(f)(x)= giδ x=ai { } i X=1 where gi = 2 σθ(ai)dµ(θ) yi , i = 1,...,n. (3) − Z if we write f(ai)= σθ(ai)dµ(θ), i = 1,...,n, Z for a given Radon measure µ. Linear Minimization Oracle. Given a gradient vector g Rn as in (3), because the input space is finite, each iteration of the Frank Wolfe algorithm seeks to solve the∈ following linear minimization oracle n minimize gif(ai) i=1 subject to γX(f) δ 1 ≤ in the variable f . By definition of , this is equivalent to solving ∈ F1 F1 n minimize gi σθ(ai)dµ(θ) i=1 (LMO) X Z subject to γ σθ( )dµ(θ) δ 1 · ≤ Z in the variable µ, a Radon measure on and parameter δ > 0. We have, switching sums, V n inf gi σθ(ai)dµ(θ) γ1(R σθ( )dµ(θ)) 1 · ≤ i=1 Z X n = inf giσθ(ai) dµ(θ) γ1(R σθ( )dµ(θ)) 1 · ≤ Z i=1 ! ! n X max giσθ(ai) , ≥ − θ ∈V i=1 X with equality if and only if µ = µ µ + where both µ+ and µ are nonnegative measures supported on − − the set of maximizers of − n max giσθ(ai) (4) θ ∈V i=1 X with the value inside the absolute value positive for µ+ (respectively negative for µ ). This means that the key to solving (LMO) is solving problem (4). We will discuss how to solve (4) for− specific activation functions in Section 2.3. We first describe the overall structure of the Frank Wolfe algorithm for solving (2) (hence (1)). 3 Frank Wolfe Algorithm. Given a linear minimization oracle, the Frank Wolfe algorithm (aka conditional gradient method, or Fedorov’s algorithm) is then detailed as Algorithm 1 and, calling L∗ the optimum value of problem (2), we have the following convergence bound. Algorithm 1 Frank-Wolfe Algorithm Input: A target precision ε> 0 1: Set t := 1, µ1(θ) = 0. 2: repeat 3: Get µd(θ) solving (LMO) for gi = 2 σθ(ai)dµt(θ) yi , i = 1,...,n, − Z 4: Set µt+1(θ) := (1 λt)µt(θ)+ λtµd(θ), for λ = 2/(t + 1) − 5: Set t := t + 1 6: until gap ε t ≤ Output: µ(θ)tmax Proposition 2.1. After T iterations of Algorithm 1 we have 2 2 ⋆ 4R δ L σθ( )dµT (θ) L (5) · − ≤ T + 1 Z 2 n 2 where R = supθ i=1 σθ(ai) . ∈V Proof. The objective functionP is 2 smooth and the result directly follows from e.g. [Jaggi, 2013] or [Bach, 2017, 2.5]. § By construction, Algorithm 1 is designed to add a constant number of atoms to the measure µ(θ) at each iteration. After T iterations, where the method reaches a precision measured by the bound (5), the solution f thus has O(T ) neurons. Duality Gap. One of the key benefits of the Frank Wolfe algorithm is that, invoking convexity of the objec- tive, it outputs an upper bound on the duality gap as a byproduct of the linear minimization oracle [Jaggi, 2013], computed as n gap = gi σθ(ai)dµt(θ) σθ(ai)dµd(θ) (6) t − i X=1 Z Z where µt(θ) is the current iterate in Algorithm 1, and µd(θ) the solution of the linear minimization oracle. 2.3.