arXiv:2002.02208v1 [math.OC] 6 Feb 2020 nrmna ehd npriua r lsia ol o t [ for in tools results classical the are with least particular set in algorithmic and methods regimes of Incremental variety a us in heavily problem been have training course networks of neural layer are hidden they ne one neural tools, While layer method. hidden Frank-Wolfe one the training particular of problem the on focus We u otiuini wfl.Wieteoehde ae trai layer hidden one network. the the While to neurons twofold. of is number contribution fixed Our settin a this adds In LMO, the oracle. of minimization linear the of solutions airt train. descri to regime” easier “modern the recent in confirm networks, further g overparameterized to seem deriv results results this these Overall, Using on limit. bounds gap. convergence duality no non-asymptotic and derive i showing convex setting, a broader a has in networks neural layer hidden [ assumptio preconditioning by mild and overparameterization results recent using show, [ problem oracle minimization [ ing Bach [ schemes descent [ flows gradient including produce to problems, used ing been recently have approaches other Several i examples recent al. with et Osokin learning, machine over (LMO) in oracle applications minimization of linear minimizatio a constrained solving in but used hard, is is and type, this of methods Wolfe functions Given Date LBLCNEGNEO RN OF NOEHDE AE NETWOR LAYER HIDDEN ONE ON WOLFE FRANK OF CONVERGENCE GLOBAL Bach , , 2017 eray7 2020. 7, February : hztadBach and Chizat n 1956 ovdepiil sascn re oeporm h classi The program. cone order second rate a as explicitly solved in ntesml aast h iermnmzto oracle minimization linear the set, data sample funct the activation on ReLU the tions using When networks. neural layer A elmliait aasamples data multivariate real , BSTRACT σ 2017 efcso riigifiieywd erlntok hc a which networks neural wide infinitely training on focus we ] θ O ,as nw scniinlgains[ gradients conditional as known also ], , (1 : 2016 /T R ,w s an use we ], d edrv lblcnegnebud o h rn of alg Wolfe Frank the for bounds convergence global derive We . ) , where → oael tal. et Locatello , R .F 2. 2018 T aaeeie by parameterized , Breiman sbt h ubro ern n h ubro al oteora the to calls of number the and neurons of number the both is ℓ RANK 1 , LXNR ’SRMN N ETPILANCI MERT AND D’ASPREMONT ALEXANDRE uuwm n Raghavendra and Guruswami Chizat iepnlyt e h loih eieo h oaino th of location the on decide algorithm the let to penalty like re n Pilanci and Ergen , , W 1993 2017a , LEON OLFE 2019 n [ and ] A , .Hr,i h prto [ of spirit the in Here, ]. .I 1. rude al. et Freund ∈ θ R V ∈ NTRODUCTION O e tal. et Lee n NE × , eii n Polyak and Levitin d 2019 epl n Wilmes and Vempala 1 where H n ae vector label a and IDDEN , ,ec trto fFakWle ..ec solution each i.e. Wolfe, Frank of iteration each g, s eod edsuscneiypoete fone of properties convexity discuss we Second, ns. ,ta h M a esle fcetyunder efficiently solved be can LMO the that ], 2017 , miia nig neg [ e.g. in findings empirical o,adudrtatbepeodtoigassump- preconditioning tractable under and ion, 1996 sdt nrmnal omteslto a be can solution the form incrementally to used tings. a rn of loih hncnegswith converges then algorithm Wolfe Frank cal dt td h opeiyo h erlnetwork neural the of complexity the study to ed ann n idnlyrntok,satn at starting networks, layer hidden one raining atclrta h vraaeeie problem overparameterized the that particular n cuig[ ncluding rbesweepoeto ntefail set feasible the on projection where problems n igpolm[ problem ning e neg [ e.g. in bed V , hssti rcal.Ti ehdhsaln list long a has method This tractable. is set this ovrec eut noehde ae train- layer hidden one on results convergence 2009 pwe ovrigtwrstema field mean the towards converging when ap oetymdl hnefcieclassification effective than models toy more , L sacmattplgclvco pc.For space. vector topological compact a is wrsuigiceetlagrtm,adin and algorithms, incremental using tworks dfo h hpe-oka hoe,we theorem, Shapley-Folkman the from ed oael tal. et Locatello .TeFakWleagrtm[ algorithm Frank-Wolfe The ]. AYER r ohitatbei eea,w first we general, in intractable both are ] egoe al. et Bengio , olne al. et Joulin N 1966 rtmwe riigoehidden one training when orithm ekne al. et Belkin ETWORKS y eaypoial ovx Follow- convex. asymptotically re , oge al. et Song ∈ soeo h otwl known well most the of one is ] 2018 R , cle. 2017b n , n iceie gradient discretized and ] oehrwt activation with together , 2006 , 2014 hn tal. et Zhang , , 2019 , , 2017 ic tal. et Miech oste al. et Rosset , hhe al. et Shah ern i the via neurons e ,aeinherently are ], n h linear the and ] , rn and Frank 2016 , , , 2017 KS 2007 2015 that ] ]. , , a continuous function h(θ) : R, we write h(θ)dµ(θ) the action of the Radon measure µ on the function h. V → As in [Rosset et al., 2007, Bach, 2017] we focus onR the following problem

n 2 minimize σθ(ai)dµ(θ) yi − i=1 (1) X Z  subject to γ σθ( )dµ(θ) δ 1 · ≤ Z  in the variable µ, a Radon measure on , with parameter δ > 0. Here γ is the variation norm, a natural V 1 extension of the ℓ1 norm to the infinite dimensional setting, which we describe in detail below.

2.1. Variation Norm. For a Radon measure µ, we write

µ ( ) , sup h(θ)dµ(θ) | | V h(θ): [ 1,1], V→ − Z h continuous its total variation. When µ has a density, with dµ(θ) = p(θ)dτ(θ) then µ ( ) is simply equal to the L1 norm of p. | | V As in [Bach, 2017], we now write the space of functions f(x) : Rd R such that F1 →

f(x)= σθ(x)dµ(θ) Z where µ is Radon measure on with finite total variation. The infimum of the total variation of µ over all representations of f, written V

γ (f) , inf µ ( ) : f(x)= σθ(x)dµ(θ) 1 | | V  Z  is a norm called the variation norm of f (see e.g. [Kurkov´aand Sanguineti, 2001], or the discussion on atomic norms in [Chandrasekaran et al., 2012]). Note that when f is decomposable on a finite number of basis functions, with k

f(x)= ηiσθi (x) i X=1 we have k

µ(θ)= ηiδ θ=θi { } i X=1 and the total variation of µ is simply η 1, the ℓ1 norm of η. In this context, we can rewrite problem (1) as an equivalent problem k k n 2 minimize (f(ai) yi) − (2) i=1 subject to Xγ (f) δ 1 ≤ which is a convex problem in the variable f . ∈ F1 2.2. Incremental Algorithm: Frank Wolfe. Problem (2) is an infinite dimensional problem, but it can be solved efficiently using the Frank Wolfe method (aka conditional gradients) provided we can solve a linear minimization oracle over a γ1 ball. The Frank Wolfe algorithm solves problem (2) by invoking a linear minimization oracle involving the gradient at each iteration, then takes convex combinations of iterates. 2 Gradients. The objective of problem (2), namely n 2 L(f) , σθ(ai)dµ(θ) yi − i=1 Z  Xn 2 = (f(ai) yi) − i X=1 is a smooth convex functional, whose gradient is given by n

L′(f)(x)= giδ x=ai { } i X=1 where

gi = 2 σθ(ai)dµ(θ) yi , i = 1,...,n. (3) − Z  if we write

f(ai)= σθ(ai)dµ(θ), i = 1,...,n, Z for a given Radon measure µ. Linear Minimization Oracle. Given a gradient vector g Rn as in (3), because the input space is finite, each iteration of the Frank Wolfe algorithm seeks to solve the∈ following linear minimization oracle n minimize gif(ai) i=1 subject to Xγ (f) δ 1 ≤ in the variable f . By definition of , this is equivalent to solving ∈ F1 F1 n minimize gi σθ(ai)dµ(θ) i=1 (LMO) X Z  subject to γ σθ( )dµ(θ) δ 1 · ≤ Z  in the variable µ, a Radon measure on and parameter δ > 0. We have, switching sums, V n inf gi σθ(ai)dµ(θ) γ1(R σθ( )dµ(θ)) 1 · ≤ i=1 Z  X n = inf giσθ(ai) dµ(θ) γ1(R σθ( )dµ(θ)) 1 · ≤ Z i=1 ! ! n X max giσθ(ai) , ≥ − θ ∈V i=1 X with equality if and only if µ = µ µ + where both µ+ and µ are nonnegative measures supported on − − the set of maximizers of − n max giσθ(ai) (4) θ ∈V i=1 X with the value inside the absolute value positive for µ+ (respectively negative for µ ). This means that the key to solving (LMO) is solving problem (4). We will discuss how to solve (4) for− specific activation functions in Section 2.3. We first describe the overall structure of the Frank Wolfe algorithm for solving (2) (hence (1)). 3 Frank Wolfe Algorithm. Given a linear minimization oracle, the Frank Wolfe algorithm (aka conditional gradient method, or Fedorov’s algorithm) is then detailed as Algorithm 1 and, calling L∗ the optimum value of problem (2), we have the following convergence bound.

Algorithm 1 Frank-Wolfe Algorithm Input: A target precision ε> 0 1: Set t := 1, µ1(θ) = 0. 2: repeat 3: Get µd(θ) solving (LMO) for

gi = 2 σθ(ai)dµt(θ) yi , i = 1,...,n, − Z 

4: Set µt+1(θ) := (1 λt)µt(θ)+ λtµd(θ), for λ = 2/(t + 1) − 5: Set t := t + 1 6: until gap ε t ≤ Output: µ(θ)tmax

Proposition 2.1. After T iterations of Algorithm 1 we have 2 2 ⋆ 4R δ L σθ( )dµT (θ) L (5) · − ≤ T + 1 Z  2 n 2 where R = supθ i=1 σθ(ai) . ∈V Proof. The objective functionP is 2 smooth and the result directly follows from e.g. [Jaggi, 2013] or [Bach, 2017, 2.5]. § By construction, Algorithm 1 is designed to add a constant number of atoms to the measure µ(θ) at each iteration. After T iterations, where the method reaches a precision measured by the bound (5), the solution f thus has O(T ) neurons. Duality Gap. One of the key benefits of the Frank Wolfe algorithm is that, invoking convexity of the objec- tive, it outputs an upper bound on the duality gap as a byproduct of the linear minimization oracle [Jaggi, 2013], computed as n gap = gi σθ(ai)dµt(θ) σθ(ai)dµd(θ) (6) t − i X=1 Z Z  where µt(θ) is the current iterate in Algorithm 1, and µd(θ) the solution of the linear minimization oracle.

2.3. Solving the LMO for ReLU Activation Functions. The key to making Algorithm 1 tractable is effi- T ciently solving the (LMO) problem. When the activation function is the ReLU, given by σθ(ai) = (θ ai)+, and is the Euclidean unit ball, we have V n sup giσθ(ai) θ ∈V i=1 X T T = max g (Aθ)+ , max g (Aθ)+ θ 2 1 θ 2 1 − k k ≤ k k ≤  Under certain conditions on the data set A, this last maximization problem is tractable as a second order program. 4 2.3.1. Spike-free matrices. [Ergen and Pilanci, 2019] define spike-free matrices as follows. Definition 2.2. A matrix A Rn d is spike-free if and only if ∈ × (Au) : u Rd, u 1 = A Rn . (7) { + ∈ k k2 ≤ } B2 ∩ + where is the Euclidean unit ball. B2 The set on the left in Definition 2.2 is precisely the set over which we minimize in (LMO), while the set on the right is convex. For spike free matrices, the (LMO) is thus a convex minimization problem, with T T max g (Aθ)+ = max g Aθ (8) θ 2 1 ± θ 2 1, ± k k ≤ kAθk ≤0 ≥ where the problem on the right is a (convex) second order cone program [Boyd and Vandenberghe, 2004]. [Ergen and Pilanci, 2019, Lem. 2.4] shows for example that whitened matrices A Rn d with n d, for ∈ × ≤ which σmin(A)= σmax(A) = 1, are spike free. In practice then, if we let θ+,θ be the optimal solutions to the right hand side of (8), the corresponding optimal measures read −

µ(θ)= λδθ (θ) (1 λ)δθ+ (θ) − − − where 0 λ 1. ≤ ≤ 2.3.2. Certifying Spike-Free Matrices. The (LMO) problem is tractable when the matrix A is spike-free. [Ergen and Pilanci, 2019, Lem. 2.3] shows that this is the case when A has full row rank, n d and ≤ max A†(Au)+ 2 1. (9) u 2 1 k k ≤ k k ≤ We can relax the left-hand side as follows

max A†(Au)+ 2 u 2 1 k k k k ≤ diag maxn max A† (z)Au 2 ≤ z [0,1] u 2 1 k k ∈ k k ≤ = max A† diag(z)A 2 z [0,1]n k k ∈ by convexity of the norm. Now, checking

max A† diag(z)A 2 1 z [0,1]n k k ≤ ∈ is equivalent to deciding whether

I A† diag(z)A n T T 0, z [0, 1] , (10) A diag(z)A† I  ∀ ∈   which is a matrix cube problem, and admits a semidefinite relaxation [Ben-Tal and Nemirovski, 2001, Prop. 4.4.5] which we detail in the following proposition. n d Proposition 2.3. Suppose A R × has full row rank and n d. Let us call MA(z) S2d the matrix in (10) and assume the following∈ linear matrix inequality ≤ ∈ ρ ρ Xi 2 MA(ei), Xi 2 MA(ei), i = 1,...,n  − (11)  n 1 Xi MA(1)+ I  i=1 ≤ 2 in the variables X S is feasible for ρ = 1, where e is the Euclidean basis, then both condition (10) i 2d P i and a fortiori (9) holds∈ and A is spike free. Proof. See [Ben-Tal and Nemirovski, 2001, Prop. 4.4.5].

The semidefinite relaxation in (11) for checking condition (10) has a constant approximation ratio equal to π/2, as we recall below. 5 Proposition 2.4. If the linear matrix inequality in (11) is infeasible for ρ = 2/π, then

I A† diag(z)A T T 0, A diag(z)A† I 6   for some z [0, 1]n. ∈ Proof. The matrices MA(ei) all have rank at most two, hence the approximation ratio in [Ben-Tal and Nemirovski, 2001, Th. 4.4.1] is equal to π/2.

3. STOCHASTIC FRANK WOLFE When the number of samples n is larger than the dimension d , the conditions that guarantee tightness of the SOCP for solving the linear minimization oracle in Section 2.3 cannot hold. We recall in Algorithm 2 the stochastic Frank Wolfe algorithm for minimizing objectives that are finite sums, i.e. n minimize f(x)= i=1 fi(x) subject to x ∈C P in the variable x Rd, discussed in e.g. [Hazan and Luo, 2016]. This algorithm admits the following convergence bound∈ 4LD2 E[f(wt) f(w )] − ∗ ≤ t+2 where L is the Lipschitz constant of f and D is the diameter of the feasible set , provided the size mt of the minibatch is set to ∇ C G(t + 1) 2 m = t LD   where G is a upper bound on the Lipschitz constant of the gradients fi. ∇ Algorithm 2 Stochastic Frank-Wolfe (SFW) n Input: A target precision ε> 0, objective function f = fi/n, feasible set and parameters mt. i=1 C 1: Set t := 1. 2: repeat P 3: Estimate the stochastic gradient 1 ˜ f = fi(xt) ∇ I I | | X for I an i.i.d. sample of indices in [1,n] of size mt. 4: Solve the linear minimization oracle

wd := argmin ˜ f ⊤w w ∇ ∈C

5: Take step wt+1 := (1 λt)wt + λtwd, for λ = 2/(t + 1) − 6: Set t := t + 1 7: until t t ≥ max

Focusing on problem (1), when solving problems where n is larger than d, i.e. problems that are not overparameterized, Algorithm 2 solves a linear minimization oracle at each iteration on a subset of the mt d samples that we write AI R × . For small values of mt, this matrix is much more likely to satisfy the spike-free condition in (7).∈ 6 mt d Let us define mA as the largest value for which AI R satisfies the spike-free condition in (7) for all ∈ × subsets I [1,n] with I mA. To ensure that the LMO is always tractable we can limit the number of iterations so⊂ that | | ≤ G(t + 1) 2 mt = mA LD ≤   or again tmax LD√mA/G 1. The stochastic Frank Wolfe Algorithm 2 will then solve (1) and yield an iterate ≤ − 2 [f(w ) f(w )] 4LD = O GD . (12) E tmax LD√d √mA − ∗ ≤ G +1 which is the precision limit imposed on the algorithm by the spike-free properties of the matrix A. In other words, depending on the spike free properties of A measured by mA, the stochastic Frank Wolfe algorithm will be guaranteed to reach a precision at least equal to the bound in (12).

4. HIDDEN CONVEXITY The results of the previous section highlight the fact that solving problem (1) becomes easier as the network becomes increasingly overparameterized. The Frank Wolfe algorithm adds a couple of neurons per iteration and we have seen above that the linear minimization oracle becomes easier when d is relatively large. This phenomenon, akin to the hidden convexity of the S-Lemma [Ben-Tal and Nemirovski, 2001, 4.10.5] for example, has been observed empirically many settings, and has several geometrical roots which§ we discuss below. 4.1. Large Dimensional Regime. Of course, as in [Ergen and Pilanci, 2019, Lem. 2.4], if d is large enough, the number of neurons m is larger than n and the vectors θk are picked in general position, then the matrix with columns (Aθk)+ for k = 1,...,m has full rank, and we can solve (1) by solving a simple linear system. This of course offers no guarantee that an algorithm such as Frank Wolfe or the stochastic gradient method will converge since the problem is still nonconvex, even though the results of Section 2 show that Frank Wolfe does indeed converge in this scenario, but it shows that this overparameterized regime it is inherently easier. 4.2. Large Number of Neurons. Perhaps more surprisingly, a similar phenomenon occurs when the num- ber of neurons gets larger relative to the number of samples, and the training problem becomes increasingly close to being convex. In the large number of heterogeneous neurons regime, the Shapley-Folkman theorem, a classical result from convex analysis, shows that the Minkowski sum of arbitrary sets of about the same size becomes arbitrarily close to its as the number of sets grows while the dimension remains fixed. We briefly recall this result and its consequences in optimization in what follows. m 4.2.1. The Shapley-Folkman Theorem. Given functions fi, a vector b R , and vector-valued functions m ∈ gi, i [n] that take values in R , we consider the following separable optimization problem ∈ n hP (u) := minimize i=1 fi(xi) n (P) subject to gi(xi) b + u Pi=1 ≤ Rdi Rm in the variables xi , with perturbation parameterPu . We first recall some basic results about conjugate functions∈ and convex envelopes. ∈ Biconjugate and convex envelope. Given a function f, not identically + , minorized by an affine function, we write ∞ f ∗(y) , inf y⊤x f(x) x dom f{ − } ∈ the conjugate of f, and f ∗∗(y) its biconjugate. The biconjugate of f (aka the convex envelope of f) is the pointwise supremum of all affine functions majorized by f (see e.g. [Rockafellar, 1970, Th. 12.1] or [Hiriart-Urruty and Lemar´echal, 1993, Th. X.1.3.5]), a corollary then shows that epi(f ∗∗) = Co(epi(f)). For simplicity, we write S∗∗ = Co(S) for any set S in what follows. We will make the following technical assumptions on the functions fi and gi in our problem. 7 di Assumption 4.1. The functions fi : R R are proper, 1-coercive, lower semicontinuous and there exists an affine function minorizing them. →

Note that coercivity trivially holds if dom(fi) is compact (since f can be set to + outside w.l.o.g.). When n ∞ Assumption 4.1 holds, epi(f ∗∗), fi∗∗ and hence i=1 fi∗∗(xi) are closed [Hiriart-Urruty and Lemar´echal, 1993, Lem. X.1.5.3]. Also, as in e.g. [Ekeland and Temam, 1999], we define the lack of convexity of a function as follows. P Definition 4.2. Let f : Rd R, we let → ρ(f) , sup f(x) f ∗∗(x) (13) x dom(f){ − } ∈ Many other quantities measure lack of convexity (see e.g. [Aubin and Ekeland, 1976, Bertsekas, 2014] for further examples). In particular, the nonconvexity measure ρ(f) can be rewritten as d+1 d+1 ρ(f)= sup f µixi µif(xi) (14) xi dom(f) ( ! − ) ∈d i=1 i=1 µ R +1,1 µ=1 X X ∈ + ⊤ when f satisfies Assumption 4.1 (see [Hiriart-Urruty and Lemar´echal, 1993, Th. X.1.5.4]). Bounds on the duality gap. Let hP (u)∗∗ be the biconjugate of hP (u) defined in (P), then hP (0)∗∗ is the optimal value of the dual to (P) (this is the perturbation view on duality, see [Ekeland and Temam, 1999, Chap. III] for more details). Then, [Ekeland and Temam, 1999, Lem. 2.3], and [Ekeland and Temam, 1999, Th. I.3] show the following result.

Theorem 4.3. Suppose the functions fi, gji in problem (P) satisfy Assumption 4.1 for i = 1,...,n, j = 1,...,m. Let p¯j = (m + 1) max ρ(gji), for j = 1,...,m (15) i then hP (¯p) hP (0)∗∗ + (m + 1) max ρ(fi). (16) ≤ i where ρ( ) is defined in Def. 4.2. · This last result shows that the optimal value of problem (P) is bounded above and below by the optimal values of convex problems, with the gap between these bounds decreasing in relative scale when the number of terms n increases relative to the number of constraints m. The proof of [Ekeland and Temam, 1999, Th. I.3] shows in fact a much stronger result, which is that the epigraphs of those three optimization problems are nested. 4.2.2. Wide Neural Networks. The results above show that separable optimization problems become in- creasingly convex as the number of terms increases. See [Kerdreux et al., 2017] for an application of these results to multitask problems and [Zhang et al., 2019] for an extension to training multi-branch neural net- works. This has direct implications for generic one hidden layer neural networks, as we detail below. d Given samples al R and labels yl for l = 1,...n, consider the following unregularized (unconstrained) one hidden layer network∈ training problem n p 2 min θi σ θ⊤al yl (17) 0 i − i ! Xl=1 X=1   d+1 in the variables (θi ,θi) R for i = 1,...,p, where σ( ) is an activation function. Defining 0 ∈ · , gl(θ) θ1 σ θ2⊤al , for l = 1 ...,n, (18) in the variable θ = (θ ,θ ) R Rd, the problem can be rewritten 1 2 ∈ × n 2 minimize l=1 (zl yl) p − (19) subject to i=1 g(θi)= z P8 P d+1 n in the variables θi R for i = 1,...,p and z R . Suppose we add an ℓ constraint on the parameters ∈ ∈ ∞ θi, solving instead n 2 minimize l=1 (zl yl) p − subject to g(θi)= z (20) Pi=1 θi δ, i = 1,...,p kP k∞ ≤ d+1 n in the variables θi R for i = 1,...,p and z R . This is equivalent to ∈ ∈ n 2 p minimize l=1 (zl yl) + i=1 1 θi δ p − {k k∞≤ } (21) subject to g(θi)= z Pi=1 P Now, let P n 2 p h((u, v)) = min. l=1 (zl yl) + i=1 1 θi δ p − {k k∞≤ } s.t. i=1 g(θi) z + u Pp ≤ P s.t. g(θi) z v, l = 1,...,n Pi=1 ≥ − then Theorem 4.3 shows P hP ((u, v))∗∗ hP ((u, v)) hP (0)∗∗ (22) ≤ ≤ with

(u, v)= θ ρ(σ)1  | [i]0| j ,..., n =1X2 +1 with θ θ . . . and   | [1]0| ≥ | [2]0|≥ ρ(σ)= max ρ (θ1σ (θ2al)) . l=1,...,n for θ δ. Note that ρ(fi) = 0 in (16) as the objective function in (21) is convex. This means that when n remainsk k∞ ≤ constant and δ 0 as the number of neurons p , i.e. the trained model is not sparse or atomic (the mean field limit),→ then → ∞

θ 0,  | [i]0| → j ,..., n =1X2 +1 hence (u, v) 0, the bound in (22) is asymptotically tight and problem (17) is asymptotically equal to its convex relaxation.→ Overall then, the bound in (16) precisely quantifies the convergence rate of the duality gap in problem (20) in the mean field limit, when the number of neurons goes to infinity. Note that, when all activation functions are identical, the convergence is actually finite, but the bound also allows us to quantify convergence in the case of heterogeneous networks.

4.3. Convex Relaxation. Yet another take on the hidden convexity properties of problem (1) is given by the results in [Lemar´echal and Renaud, 2001]. Suppose we start with a problem involving a single unit minimize z y 2 k − k2 (23) subject to σ(θ⊤ai)= zi, i = 1,...,n in the variable θ Rd. If we directly form a convex relaxation for this last problem as in e.g. [Lemar´echal and Renaud, 2001∈, S2.2], by taking the convex hull of its epigraph (splitting the equality into two inequality constraints), we obtain 2 minimize z y 2 k n−+2 k subject to j=1 αjσ(θj⊤ai)= zi, i = 1,...,n α 2 kPk1 ≤ d d+2 in the variables θj R for j = 1,...,n+2 and α R . Even though this last problem is still nonconvex, its epigraph is convex∈ by construction and it is an∈ explicit (geometric) convex relaxation of problem (23). This last problem also happens to exactly match an unconstrained version of the original one hidden layer training problem in (1). This shows once more that, in a sense, one hidden layer neural networks where the 9 number of neurons exceeds the number of samples are just convex problems, parameterized in a nonconvex manner.

5. NUMERICAL RESULTS 5.1. Linear Minimization Oracle. The results discussed in Section 2.3 on the linear minimization oracle guarantee that whitened matrices A with n d are spike free, hence satisfy ≤ (Au) : u Rd, u 1 = A Rn . (24) { + ∈ k k2 ≤ } B2 ∩ + Solving the LMO under this equivalence means solving a second order cone program. To get a sense of how far this equivalence is likely to hold beyond this regime, we first check a necessary condition on whitened matrices with n d. While the solution of the original LMO, given by ≥ T max g (Aθ)+ θ 2 1 k k ≤ is always nonzero when g 0, that of its SOCP counterpart, written 6≤ max gT Aθ θ 2 1, ± kAθk ≤0 ≥ can only be nonzero if there is a vector θ such that Aθ 0. This means that the SOCP cannot solve the LMO if θ : Aθ 0 = 0 , in other words, θ : Aθ ≥ 0 = 0 is a necessary condition for A being spike-free{ and (24≥) to} hold.{ } { ≥ } 6 { } n d We sample Gaussian matrices A R × with d = 20 and n ranging from 20 to 75, with 200 samples at each n. We then whiten these matrices∈ and check if θ : Aθ 0 = 0 . In Figure 1, we plot the resulting empirical probability and notice a phase transition starti{ ng a≥ bit} after 6 {n}= d which seems to indicate that, for Gaussian matrices at least, the overparamerization requirement is tight.

} 1 0 { = 0.8 } 6 0 ≥ 0.6 Aθ :

θ 0.4 {

0.2 PSfrag replacements

0 Probability 20 30 40 50 60 70 n FIGURE 1. Probability of SOCP solving the linear minimization oracle having a nonzero solution versus number of samples n for d = 20. When the solution to the SOCP is zero, it cannot be a tight solution of the LMO.

n d We also tested the matrix cube relaxation in (11) on ten sample Gaussian matrices A R × with d = 10. After whitening, the linear matrix inequality in (11) was always feasible on these samples∈ for n = 5 and n = 10, showing that, in these toy examples at least, the SDP relaxation is tight enough to certify that the whitened matrices are spike-free. On the other hand, when repeating this last experiment on Gaussian matrices that were not whitened, the linear matrix inequality in (11) was always infeasible, showing that these matrices are potentially not spike- free. This means that some form of normalization is critical to the tractability of the linear minimization oracle. 10 5.2. Frank Wolfe. We now test the convergence of the Frank Wolfe Algorithm 1 on toy examples. In Fig- ure 2 the ground truth is generated using a ten neurons in dimension 25 using Gaussian weights, observing 20 data points and no whitening. In Figure 3, we repeat the same experiment using ten neurons, this time whitening the data. In Figure 4, we repeat this last experiment once more at the edge of the overparameteri- zation regime, with d = n = 20. Convergence seems faster in the whitened examples, where the guarantees hold. Finally, in Figure 5 we test convergence of the Stochastic Frank Wolfe Algorithm 2 on a toy network example where the ground truth is generated by ten neurons, in dimension d = 20 using n = 25 samples and whitening. Note that the stochastic variant produces no valid gap. In this setting, the results in Section 3 only guarantee convergence until a fixed (a priori intractable) precision threshold, which is indeed what we observe in this experiment. In cases where the spike free condition in (9) does not hold, the SOCP typically returns a solution equal to zero (cf. Figure 1) and convergence stalls.

101 Loss Gap

100

10-1

10-2

PSfrag replacements 10-3 0 20 40 60 80 100 Iterations FIGURE 2. Convergence of Frank Wolfe on a toy network example where the ground truth is generated using ten neurons, in dimension d = 25 using n = 20 samples and no whiten- ing. We plot both loss and duality gap bound versus number of iterations (and a proportional number of neurons).

ACKNOWLEDGEMENTS AA is at CNRS & d´epartement d’informatique, Ecole´ normale sup´erieure, UMR CNRS 8548, 45 rue d’Ulm 75005 Paris, France, INRIA and PSL Research University. AA acknowledges support from the French gov- ernment under management of Agence Nationale de la Recherche as part of the ”Investissements d’avenir” program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute), the ML & Optimisation joint research initiative with the fonds AXA pour la recherche and Kamet Ventures, as well as a Google focused award.

11 102 Loss Gap

100

10-2

10-4

10-6

PSfrag replacements 10-8 0 20 40 60 80 100 Iterations FIGURE 3. Convergence of Frank Wolfe on a toy network example where the ground truth is generated by ten neurons, in dimension d = 25 using n = 20 samples and whitening. We plot both loss and duality gap bound versus number of iterations (and a proportional number of neurons).

102 Loss Gap

100

10-2

10-4

10-6

PSfrag replacements 10-8 0 20 40 60 80 100 Iterations FIGURE 4. Convergence of Frank Wolfe on a toy network example where the ground truth is generated by ten neurons, in dimension d = 20 using n = 20 samples and whitening. We plot both loss and duality gap bound versus number of iterations (and a proportional number of neurons).

12 Loss 100

10-1

-2 PSfrag replacements 10 20 40 60 80 100 Iterations FIGURE 5. Convergence of the Stochastic Frank Wolfe Algorithm on a toy network exam- ple where the ground truth is generated by ten neurons, in dimension d = 20 using n = 25 samples and whitening. We plot loss versus number of iterations (and a proportional num- ber of neurons).

13 REFERENCES Jean-Pierre Aubin and Ivar Ekeland. Estimates of the duality gap in nonconvex optimization. Mathematics of Opera- tions Research, 1(3):225–245, 1976. Francis Bach. Breaking the curse of dimensionality with convex neural networks. The Journal of Machine Learning Research, 18(1):629–681, 2017. Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019. A. Ben-Tal and A. Nemirovski. Lectures on modern : analysis, algorithms, and engineering applications. MPS-SIAM series on optimization. SIAM, 2001. Yoshua Bengio, Nicolas L Roux, Pascal Vincent, Olivier Delalleau, and Patrice Marcotte. Convex neural networks. In Advances in neural information processing systems, pages 123–130, 2006. Dimitri P Bertsekas. Constrained optimization and Lagrange multiplier methods. Academic press, 2014. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. Leo Breiman. Hinging hyperplanes for regression, classification, and function approximation. IEEE Transactions on Information Theory, 39(3):999–1013, 1993. V. Chandrasekaran, B. Recht, P. Parrilo, and A.S. Willsky. The convex geometry of linear inverse problems. Founda- tions of Computational Mathematics, 12(6):805–849, 2012. Lenaic Chizat. Sparse optimization on measures with over-parameterized gradient descent. arXiv preprint arXiv:1907.10300, 2019. Lenaic Chizat and Francis Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport. In Advances in neural information processing systems, pages 3036–3046, 2018. Ivar Ekeland and Roger Temam. Convex analysis and variational problems. SIAM, 1999. T. Ergen and M. Pilanci. Convex duality and cutting plane methods for over-parameterized neural networks. OPT-ML workshop, 2019. M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval research logistics quarterly, 3(1-2):95–110, 1956. Robert M Freund, Paul Grigas, and Rahul Mazumder. An extended frank–wolfe method with “in-face” directions, and its application to low-rank matrix completion. SIAM Journal on Optimization, 27(1):319–346, 2017. Venkatesan Guruswami and Prasad Raghavendra. Hardness of learning halfspaces with noise. SIAM Journal on Computing, 39(2):742–765, 2009. Elad Hazan and Haipeng Luo. Variance-reduced and projection-free stochastic optimization. In International Confer- ence on Machine Learning, pages 1263–1271, 2016. Jean-Baptiste Hiriart-Urruty and Claude Lemar´echal. Convex Analysis and Minimization Algorithms. Springer, 1993. Martin Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization. In Proceedings of the 30th interna- tional conference on machine learning, number CONF, pages 427–435, 2013. Armand Joulin, Kevin Tang, and Li Fei-Fei. Efficient image and video co-localization with frank-wolfe algorithm. In European Conference on Computer Vision, pages 253–268. Springer, 2014. Thomas Kerdreux, Igor Colin, and Alexandre d’Aspremont. An approximate shapley-folkman theorem. arXiv preprint arXiv:1712.08559, 2017. Vera Kurkov´aand Marcello Sanguineti. Bounds on rates of variable-basis and neural-network approximation. IEEE Transactions on Information Theory, 47(6):2659–2665, 2001. Wee Sun Lee, Peter L Bartlett, and Robert C Williamson. Efficient agnostic learning of neural networks with bounded fan-in. IEEE Transactions on Information Theory, 42(6):2118–2132, 1996. Claude Lemar´echal and Arnaud Renaud. A geometric study of duality gaps, with applications. Mathematical Pro- gramming, 90(3):399–427, 2001.

14 Evgeny S Levitin and Boris T Polyak. Constrained minimization methods. USSR Computational mathematics and mathematical physics, 6(5):1–50, 1966. Francesco Locatello, Rajiv Khanna, Michael Tschannen, and Martin Jaggi. A unified optimization view on generalized matching pursuit and frank-wolfe. arXiv preprint arXiv:1702.06457, 2017a. Francesco Locatello, Michael Tschannen, Gunnar R¨atsch, and Martin Jaggi. Greedy algorithms for cone constrained optimization with convergence guarantees. In Advances in Neural Information Processing Systems, pages 773–784, 2017b. Antoine Miech, Jean-Baptiste Alayrac, Piotr Bojanowski, Ivan Laptev, and Josef Sivic. Learning from video and text via large-scale discriminative clustering. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 5267–5276. IEEE, 2017. Anton Osokin, Jean-Baptiste Alayrac, Isabella Lukasewitz, Puneet K Dokania, and Simon Lacoste-Julien. Minding the gaps for block frank-wolfe optimization of structured svms. ICML 2016 International Conference on Machine Learning / arXiv preprint arXiv:1605.09346, 2016. R. T. Rockafellar. Convex Analysis. Princeton University Press., Princeton., 1970.

Saharon Rosset, Grzegorz Swirszcz, Nathan Srebro, and Ji Zhu. l1 regularization in infinite dimensional feature spaces. In International Conference on Computational Learning Theory, pages 544–558. Springer, 2007. Neel Shah, Vladimir Kolmogorov, and Christoph H Lampert. A multi-plane block-coordinate frank-wolfe algorithm for training structural svms with a costly max-oracle. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2737–2745, 2015. Le Song, Santosh Vempala, John Wilmes, and Bo Xie. On the complexity of learning neural networks. In Advances in neural information processing systems, pages 5514–5522, 2017. Santosh Vempala and John Wilmes. Gradient descent for one-hidden-layer neural networks: Polynomial convergence and sq lower bounds. arXiv preprint arXiv:1805.02677, 2018. Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016. Hongyang Zhang, Junru Shao, and Ruslan Salakhutdinov. Deep neural networks with multi-branch architectures are intrinsically less non-convex. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1099–1109, 2019.

CNRS & D.I., UMR 8548, E´ COLE NORMALE SUPERIEURE´ ,PARIS,FRANCE. E-mail address: [email protected]

ELECTRICAL ENGINEERING DEPARTMENT, STANFORD UNIVERSITY,STANFORD, CA 94305, USA. E-mail address: [email protected]

15