<<

Descent Learns One-hidden- CNN: Don’t be Afraid of Spurious Local Minima

Simon S. Du 1 Jason D. Lee 2 Yuandong Tian 3 Barnabas´ Poczos´ 1 Aarti Singh 1

Abstract fully. Why such simple methods in learning DCNN is suc- cessful remains elusive from the optimization perspective. We consider the problem of learning a one-hidden- layer neural network with non-overlapping con- Recently, a line of research (Tian, 2017; Brutzkus & Glober- volutional layer and ReLU activation, i.e., son, 2017; Li & Yuan, 2017; Soltanolkotabi, 2017; Shalev- P T Shwartz et al., 2017b) assumed the input distribution is f(Z, w, a) = j ajσ(w Zj), in which both the convolutional weights w and the output weights a Gaussian and showed that stochastic gradient descent with are parameters to be learned. When the labels are random or 0 initialization is able to train a neural net- P T the outputs from a teacher network of the same ar- work f(Z, {wj}) = j ajσ(wj Z) with ReLU activation chitecture with fixed weights (w∗, a∗), we prove σ(x) = max(x, 0) in polynomial time. However, these that with Gaussian input Z, there is a spurious lo- results all assume there is only one unknown layer {wj}, cal minimizer. Surprisingly, in the presence of the while a is a fixed vector. A natural question thus arises: spurious local minimizer, gradient descent with weight normalization from randomly initialized Does randomly initialized (stochastic) gradient descent weights can still be proven to recover the true pa- learn neural networks with multiple layers? rameters with constant probability, which can be In this paper, we take an important step by showing that boosted to probability 1 with multiple restarts. We randomly initialized gradient descent learns a non-linear also show that with constant probability, the same convolutional neural network with two unknown layers w procedure could also converge to the spurious lo- and a. To our knowledge, our work is the first of its kind. cal minimum, showing that the local minimum plays a non-trivial role in the dynamics of gradi- Formally, we consider the convolutional case in which a ent descent. Furthermore, a quantitative analysis filter w is shared among different hidden nodes. Let x ∈ Rd shows that the gradient descent dynamics has two be an input sample, e.g., an image. We generate k patches phases: it starts off slow, but converges much from x, each with size p: Z ∈ Rp×k where the i-th column faster after several iterations. is the i-th patch generated by selecting some coordinates of x: Zi = Zi(x). We further assume there is no overlap between patches. Thus, the neural network has the 1. Introduction following form: Deep convolutional neural networks (DCNN) have achieved k X >  the state-of-the-art performance in many applications such f(Z, w, a) = aiσ w Zi . arXiv:1712.00779v2 [cs.LG] 15 Jun 2018 as (Krizhevsky et al., 2012), natural lan- i=1 guage processing (Dauphin et al., 2016) and applied in classic games like Go (Silver et al., 2016). We focus on the realizable case, i.e., the label is generated Despite the highly non-convex nature of the objective func- according to y = f (Z, w∗, a∗) for some true parameters ∗ ∗ tion, simple first-order like stochastic gradient w and a and use `2 loss to learn the parameters: descent and its variants often train such networks success- 1 ∗ ∗ 2 1 min `(Z, w, a) := (f (Z, w, a) − f (Z, w , a )) . Department, Carnegie Mellon University w,a 2 2Department of Data Sciences and Operations, University of South- ern California 3Facebook Artificial Intelligence Research. Corre- We assume x is sampled from a Gaussian distribution and spondence to: Simon S. Du . there is no overlap between patches. This assumption is Proceedings of the 35 th International Conference on Machine equivalent to that each entry of Z is sampled from a Gaus- Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 sian distribution (Brutzkus & Globerson, 2017; Zhong et al., by the author(s). 2017b). Following (Zhong et al., 2017a;b; Li & Yuan, 2017; Gradient Descent Learns One-hidden-layer CNN

5

0

-5

-10

-15

-20

Logarithm of Prediction Errors Sucess Case Failure Case -25 0 50 100 150 200 250 300 350 400 450 500 Epochs (a) Convolutional neural network with an unknown non- (b) The convergence of gradient descent for learning a CNN overlapping filter and an unknown output layer. In the first described in Figure 1a with Gaussian input using different (hidden) layer, a filter w is applied to nonoverlapping parts initializations. The success case and the failure case corre- of the input x, which then passes through a ReLU activation spond to convergence to the global minimum and the spurious function. The final output is the inner product between an local minimum, respectively. In the first ∼ 50 iterations the output weight vector a and the hidden layer outputs. convergence is slow. After that gradient descent converges at a fast linear rate.

Figure 1. Network architecture that we consider in this paper and convergence of gradient descent for learning the parameters of this network.

Tian, 2017; Brutzkus & Globerson, 2017; Shalev-Shwartz The is et al., 2017b), in this paper, we mainly focus on the popula- 1 h i ` (v, a) = (f (Z, v, a) − f (Z, v∗, a∗))2 . (2) tion loss: 2EZ 1 h i ` (w, a) := (f (Z, w, a) − f (Z, w∗, a∗))2 . In this paper we focus on using randomly initialized gradient 2EZ descent for learning this convolutional neural network. The 1 We study whether the global convergence w → w∗ and pseudo-code is listed in Algorithm1. ∗ a → a can be achieved when optimizing `(w, a) using Main Contributions. Our paper have three contributions. randomly initialized gradient descent. First, we show if (v, a) is initialized by a specific random A crucial difference between our two-layer network and initialization, then with high probability, gradient descent ∗ ∗ previous one-layer models is there is a positive-homogeneity from (v, a) converges to teacher’s parameters (v , a ). We a  can further boost the success rate with more trials. issue. That is, for any c > 0, f Z, cw, c = f (Z, w, a). This interesting property allows the network to be rescaled Second, perhaps surprisingly, we prove that the objective without changing the function computed by the network. As function (Equation (2)) does have a spurious local minimum: reported by (Neyshabur et al., 2015), it is desirable to have using the same random initialization scheme, there exists 0 0 scaling-invariant learning to stabilize the training a pair (˜v , ˜a ) ∈ S±(v, a) so that gradient descent from process. (˜v0, ˜a0) converges to this bad local minimum. In contrast to One commonly used technique to achieve stability is previous works on guarantees for non-convex objective func- weight-normalization introduced by Salimans & Kingma tions whose landscape satisfies “no spurious local minima” (2016). As reported in (Salimans & Kingma, 2016), this property (Li et al., 2016; Ge et al., 2017a; 2016; Bhojana- re-parametrization improves the conditioning of the gradient palli et al., 2016; Ge et al., 2017b; Kawaguchi, 2016), our because it couples the magnitude of the weight vector from result provides a concrete counter-example and highlights a the direction of the weight vector and empirically acceler- conceptually surprising phenomenon: ates stochastic gradient descent optimization. Randomly initialized local search can find a global v In our setting, we re-parametrize the first layer as w = 1 kvk2 With some simple calculations, we can see the optimal solu- and the prediction function becomes tion for a is unique, which we denote as a∗ whereas the optimal for v is not because for every optimal solution v∗, cv∗ for c > 0 k >  X σ Zi v is also an optimal solution. In this paper, with a little abuse of f (Z, v, a) = a . (1) ∗ i kvk the notation, we use v to denote the equivalent class of optimal i=1 2 solutions. Gradient Descent Learns One-hidden-layer CNN

Algorithm 1 Gradient Descent for Learning One-Hidden- For this problem we consider the quantity sin2 φt where Layer CNN with Weight Normalization φt = θ (vt, v∗) and we show it shrinks at a geometric rate p k (c.f. Lemma 5.5). 1: Input: Initialization v0 ∈ R , a0 ∈ R , η. Organization This paper is organized as follows. In Sec- 2: for t = 1, 2,... do tion3 we introduce the necessary notations and analytical t t t+1 t ∂`(v ,a ) formulas of gradient updates in Algorithm1. In Section4, 3: v ← v − η ∂vt , t t we provide our main theorems on the performance of the t+1 t ∂`(v ,a ) 4: a ← a − η ∂at . algorithm and their implications. In Section6, we use simu- 5: end for lations to verify our theories. In Section5, we give a proof sketch of our main theorem. We conclude and list future minimum in the presence of spurious local minima. directions in Section7. We place most of our detailed proofs in the appendix. Finally, we conduct a quantitative study of the dynamics of gradient descent. We show that the dynamics of Algo- 2. Related Works rithm1 has two phases. At the beginning (around first 50 iterations in Figure 1b), because the magnitude of initial From the point of view of learning theory, it is well signal (angle between v and w∗) is small, the prediction known that training a neural network is hard in the worst error drops slowly. After that, when the signal becomes cases (Blum & Rivest, 1989; Livni et al., 2014; Sˇ´ıma, 2002; stronger, gradient descent converges at a much faster rate Shalev-Shwartz et al., 2017a;b) and recently, Shamir(2016) and the prediction error drops quickly. showed that assumptions on both the target function and the input distribution are needed for optimization algorithms Technical Insights. The main difficulty of analyzing the used in practice to succeed. convergence is the presence of local minima. Note that local minimum and the global minimum are disjoint (c.f. Solve NN without gradient descent. With some additional Figure 1b). The key technique we adopt is to characterize assumptions, many works tried to design algorithms that the attraction basin for each minimum. We consider the provably learn a neural network with polynomial time and t t ∞ sample complexity (Goel et al., 2016; Zhang et al., 2015; sequence {(v , a )}t=0 generated by Algorithm1 with step size η using initialization point v0, a0. The attraction Sedghi & Anandkumar, 2014; Janzamin et al., 2015; Goel & basin for a minimum (v∗, a∗) is defined as the Klivans, 2017a;b). However these algorithms are specially n o designed for certain architectures and cannot explain why B (v∗, a∗) = v0, a0 , lim vt, at → (v∗, a∗) (stochastic) gradient based optimization algorithms work t→∞ well in practice. The goal is to find a distribution G for weight initialization so that the probability that the initial weights are in B (v∗, a∗) Gradient-based optimization with Gaussian Input. Fo- of the global minimum is bounded below: cusing on gradient-based algorithms, a line of research ana- lyzed the behavior of (stochastic) gradient descent for Gaus- ∗ ∗ P(v0,a0)∼G [B (v , a )] ≥ c sian input distribution. Tian(2017) showed that popula- tion gradient descent is able to find the true weight vector for some absolute constant c > 0. with random initialization for one-layer one-neuron model. While it is hard to characterize B (v∗, a∗), we find that the Soltanolkotabi(2017) later improved this result by show- > > set B˜ (v∗, a∗) ≡ {v0, a0 : v0 v∗ ≥ 0, a0 a∗ ≥ ing the true weights can be exactly recovered by empirical > 0 > ∗ ∗ ∗ projected gradient descent with enough samples in linear 0, 1 a ≤ 1 a } is a subset of B(v , a ) (c.f. Lemma 5.2-Lemma 5.4). Furthermore, when the learn- time. Brutzkus & Globerson(2017) showed population ing rate η is sufficiently small, we can design a specific gradient descent recovers the true weights of a convolu- distribution G so that: tion filter with non-overlapping input in polynomial time. Zhong et al.(2017b;a) proved that with sufficiently good ∗ ∗ h ˜ ∗ ∗ i P(v0,a0)∼G [B(v , a )] ≥ P(v0,a0)∼G B(v , a ) ≥ c initialization, which can be implemented by tensor method, gradient descent can find the true weights of a one-hidden- This analysis emphasizes that for non- layer fully connected and convolutional neural network. Li problems, we need to carefully characterize both the trajec- & Yuan(2017) showed SGD can recover the true weights tory of the algorithm and the initialization. We believe that of a one-layer ResNet model with ReLU activation under this idea is applicable to other non-convex problems. the assumption that the spectral norm of the true weights is within a small constant of the identity mapping. (Panigrahy To obtain the convergence rate, we propose a potential et al., 2018) also analyzed gradient descent for learning function (also called Lyapunov function in the literature). Gradient Descent Learns One-hidden-layer CNN a two-layer neural network but with different activation 3. Preliminaries functions. This paper also follows this line of approach that studies the behavior of gradient descent algorithm with We use bold-faced letters for vectors and matrices. We use k·k Gaussian inputs. 2 to denote the Euclidean norm of a finite-dimensional vector. We let wt and at be the parameters at the t-th Local minimum and Global minimum. Finding the op- iteration and w∗ and a∗ be the optimal weights. For two timal weights of a neural network is non-convex problem. vector w1 and w2, we use θ (w1, w2) to denote the angle Recently, researchers found that if the objective functions between them. ai is the i-th coordinate of a and Zi is satisfy the following two key properties, (1) all saddle points the transpose of the i-th row of Z (thus a column vector). and local maxima are strict (i.e., there exists a direction with We denote Sp−1 the (p − 1)-dimensional unit sphere and negative ), and (2) all local minima are global B (0, r) the ball centered at 0 with radius r. (no spurious local minmum), then perturbed (stochastic) gradient descent (Ge et al., 2015) or methods with sec- In this paper we assume every patch Zi is vector of i.i.d ond order information (Carmon et al., 2016; Agarwal et al., Gaussian random variables. The following theorem gives 2017) can find a global minimum in polynomial time. 2 an explicit formula for the population loss. The proof uses Combined with geometric analyses, these algorithmic re- basic rotational invariant property and polar decomposition sults have shown a large number problems, including tensor of Gaussian random variables. See SectionA for details. decomposition (Ge et al., 2015), dictionary learning (Sun Theorem 3.1. If every entry of Zis i.i.d. sampled from a et al., 2017), matrix sensing (Bhojanapalli et al., 2016; Park Gaussian distribution with mean 0 and variance 1, then et al., 2017), (Ge et al., 2017a; 2016) population loss is and matrix factorization (Li et al., 2016) can be solved in " ∗ 2 1 (π − 1) kw k 2 (π − 1) 2 polynomial time with local search algorithms. ` (v, a) = 2 ka∗k + kak 2 2π 2 2π 2

This motivates the research of studying the landscape of neu- ∗ ∗ 2 2 (g (φ) − 1) kw k kw k 2 ral networks (Kawaguchi, 2016; Choromanska et al., 2015; − 2 a>a∗ + 2 1>a∗ Hardt & Ma, 2016; Haeffele & Vidal, 2015; Mei et al., 2π 2π 1 2  2016; Freeman & Bruna, 2016; Safran & Shamir, 2016; + 1>a − 2 kw∗k 1>a · 1>a∗ (3) Zhou & Feng, 2017; Nguyen & Hein, 2017a;b; Ge et al., 2π 2 2017b; Zhou & Feng, 2017; Safran & Shamir, 2017). In where φ = θ (v, w∗) and g(φ) = (π − φ) cos φ + sin φ. particular, Kawaguchi(2016); Hardt & Ma(2016); Zhou & Feng(2017); Nguyen & Hein(2017a;b); Feizi et al.(2017) Using similar techniques, we can show the gradient also has showed that under some conditions, all local minima are an analytical form. global. Recently, Ge et al.(2017b) showed using a modi- Theorem 3.2. Suppose every entry of Zis i.i.d. sampled fied objective function satisfying the two properties above, from a Gaussian distribution with mean 0 and variance 1. one-hidden-layer neural network can be learned by noisy Denote φ = θ (w, w∗). Then the expected gradient of w perturbed gradient descent. However, for nonlinear activa- and a can be written as tion function, where the number of samples larger than the ∂` (Z, v, a) number of nodes at every layer, which is usually the case in EZ ∂v most deep neural network, and natural objective functions ! 1 vv> like `2, it is still unclear whether the strict saddle and “all = − I − a>a∗ (π − φ) w∗ 2π kvk 2 locals are global” properties are satisfied. In this paper, we 2 kvk2 show that even for a one-hidden-layer neural network with ∂` (Z, v, a) ReLU activation, there exists a spurious local minimum. EZ ∂a However, we further show that randomly initialized local 1 search can achieve global minimum with constant probabil- = 11> + (π − 1) I a ity. 2π 1 − 11> + (g(φ) − 1) I kw∗k a∗ 2Lee et al.(2016) showed vanilla gradient descent only con- 2π 2 verges to minimizers with no convergence rates guarantees. Re- cently, Du et al.(2017a) gave an exponential time lower bound As a remark, if the second layer is fixed, upon proper scal- for the vanilla gradient descent. In this paper, we give polynomial ing, the formulas for the population loss and gradient of convergence guarantee on vanilla gradient descent. v are equivalent to the corresponding formulas derived in (Brutzkus & Globerson, 2017; Cho & Saul, 2009). How- ever, when the second layer is not fixed, the gradient of v depends on a>a∗, which plays an important role in deciding whether converging to the global or the local minimum. Gradient Descent Learns One-hidden-layer CNN

  > ∗ ∗  |1 a |kw k 4. Main Result unif B 0, √ 2 , then exists k We begin with our main theorem about the convergence of gradient descent. v0, a0 ∈ {(v, a) , (v, −a) , (−v, a) , (−v, −a)}

Theorem 4.1. Suppose the initialization satisfies > that a0 a∗ > 0, 1>a0 ≤ 1>a∗ and φ0 < π/2. 0> ∗ > 0 > ∗ 0 a a > 0, 1 a ≤ 1 a , φ < π/2 and step size Further, with high probability, the initialization satisfies satisfies  > ∗ ∗ ∗ 2  0> ∗ ∗ |1 a |ka k2kw k2 0 a a kw k2 = Θ k , and φ =   0> ∗ 0    a a cos φ Θ √1 . η = O min   , p ∗ 2 > ∗ 2 ∗ 2  ka k2 + (1 a ) kw k2 Theorem 4.2 shows after generating a pair of random vec- ∗ 2 0 (g(φ0) − 1) ka k2 cos φ tors (v, a), trying out all 4 sign combinations of (v, a),   , ka∗k2 + (1>a∗)2 kw∗k2 we can find the global minimum by gradient descent. Fur- 2 2 ther, because the initial signal is not too small, we only  O(1/poly(k, p, kw∗k kak )) cos φ0 1  need to set the step size to be 2 2 , . and the number of iterations in phase I is at most  2  ka∗k2 + (1>a∗) kw∗k2 k  ∗ 2 2 O(poly(k, p, kw k2 kak2)). Therefore, Theorem 4.1 and Theorem 4.2 together show that randomly initialized gradi- Then the convergence of gradient descent has two phases. ent descent learns an one-hidden-layer convolutional neural (Phase I: Slow Initial Rate) There exists T = network in polynomial time. The proof of the first part of   1 1 1 T1 Theorem 4.2 uses the symmetry of unit sphere and ball and O η cos φ0β0 + η such that we have φ = Θ (1) and >   the second part is a standard application of random vector T1  ∗ ∗ ∗ 2 ∗ 2 0 a a kw k2 = Θ ka k2 kw k2 where β = in high-dimensional spaces. See Lemma 2.5 of (Hardt & n > o min a0 a∗ kw∗k , (g(φ0) − 1) ka∗k2 ka∗k2 . Price, 2014) for example. 2 2 2   (Phase II: Fast Rate) Suppose at the T -th iteration, φT1 = Remark 1: For the second layer we use O √1 type ini- 1 k >   T1  ∗ ∗ ∗ 2 ∗ 2 Θ (1) and a a kw k2 = Θ ka k2 kw k2 , then tialization, verifying common initialization techniques (Glo-  1 1  1  3 rot & Bengio, 2010; He et al., 2015; LeCun et al., 1998). there exists T2 = Oe( ∗ 2 ∗ 2 + η log  ) such ηkw k2ka k2 2 2 Remark 2: The Gaussian input assumption is not necessar- that ` vT1+T2 , aT1+T2  ≤  kw∗k ka∗k . 2 2 ily true in practice, although this is a common assumption appeared in the previous papers (Brutzkus & Globerson, Theorem 4.1 shows under certain conditions of the initial- 2017; Li & Yuan, 2017; Zhong et al., 2017a;b; Tian, 2017; ization, gradient descent converges to the global minimum. Xie et al., 2017; Shalev-Shwartz et al., 2017b) and also con- The convergence has two phases, at the beginning because sidered plausible in (Choromanska et al., 2015). Our result the initial signal (cos φ0β0) is small, the convergence is can be easily generalized to rotation invariant distributions. quite slow. After T iterations, the signal becomes stronger 1 However, extending to more general distributional assump- and we enter a regime with a faster convergence rate. See tion, e.g., structural conditions used in (Du et al., 2017b) Lemma 5.5 for technical details. remains a challenging open problem. Initialization plays an important role in the conver- Remark 3: Since we only require initialization to be smaller gence. First, Theorem 4.1 needs the initialization satisfy ∗ ∗ 0> ∗ > 0 > ∗ 0 than some quantities of a and w . In practice, if the op- a a > 0, 1 a ≤ 1 a and φ < π/2. Second, timization fails, i.e., the initialization is too large, one can the step size η and the convergence rate in the first phase halve the initialization size, and eventually these conditions also depends on the initialization. If the initial signal is very will be met. small, for example, φ0 ≈ π/2 which makes cos φ0 close to 0, we can only choose a very small step size and because T 1 4.1. Gradient Descent Can Converge to the Spurious depends on the inverse of cos φ0, we need a large number Local Minimum of iterations to enter phase II. We provide the following initialization scheme which ensures the conditions required Theorem 4.2 shows that among by Theorem 4.1 and a large enough initial signal. {(v, a) , (v, −a) , (−v, a) , (−v, −a)}, there is a pair Theorem 4.2. Let v ∼ unif Sp−1 and a ∼ that enables gradient descent to converge to the global minimum. Perhaps surprisingly, the next theorem shows 3 > ∗ ∗ that under some conditions of the underlying truth, there Oe (·) hides logarithmic factors on 1 a kw k2 and ∗ ∗ ka k2 kw k2 is also a pair that makes gradient descent converge to the Gradient Descent Learns One-hidden-layer CNN spurious local minimum. This lemma shows that when the algorithm converges, ∗ ∗ and a and a are not orthogonal, then we arrive at Theorem 4.3. Without loss of generality, we let kw k2 = 2 either a global optimal point or a local minimum. 1. Suppose 1>a∗ < 1 ka∗k2 and η is suf- poly(p) 2 Now recall the gradient formula of v: ∂`(v,a) = ficiently small. Let v ∼ unif Sp−1 and a ∼ ∂v 1  vv>  > ∗ ∗   > ∗  − I − 2 a a (π − φ) w . Notice that φ ≤ π |1 a | 2πkvk2 kvk unif B 0, √ , then with high probability, there ex- 2 k  vv>  and I − 2 is just the projection matrix onto the com- 0 0 kvk2 ists v , a ∈ {(v, a) , (v, −a) , (−v, a) , (−v, −a)} that plement of v. Therefore, the sign of inner product between 2 > −2(1>a∗) ∗ a0 a∗ < 0, 1>a0 ≤ 1>a∗ , g φ0 ≤ +1. a and a plays a crucial role in the dynamics of Algorithm1 ka∗k2 2 because if the inner product is positive, the gradient update If v0, a0 is used as the initialization, when Algorithm1 will decrease the angle between v and w∗ and if it is nega- converges, we have tive, the angle will increase. This observation is formalized −1 in the lemma below. θ (v, w∗) = π, a = 11> + (π − 1) I 11> − I a∗ Lemma 5.2 (Invariance I: Tje Angle between v and w∗  ∗ 2 t > ∗ t+1 t and ` (v, a) = Ω ka k2 . always decreases.). If (a ) a > 0, then φ ≤ φ .

> Unlike Theorem 4.1 which requires no assumption on the This lemma shows that when (at) a∗ > 0 for all t, gradient 2 underlying truth a∗, Theorem 4.3 assumes 1>a∗ < descent converges to the global minimum. Thus, we need to t > ∗ 1 ka∗k2. This technical condition comes from the study the dynamics of (a ) a . For the ease of presentation, poly(p) 2 ∗ > ∗ 2 without loss of generality, we assume kw k2 = 1. By the t −2(1 a ) proof which requires invariance g(φ ) ≤ ∗ 2 for gradient formula of a, we have ka k2 0 0 all iterations. To ensure there exists v , a which makes > 2 2 t+1 ∗ −2(1>a∗) (1>a∗) a a g(φ0) ≤ ka∗k2 , we need ka∗k2 relatively small. See   t 2 2 η(π − 1) t> ∗ η(g(φ ) − 1) t 2 SectionE for more technical insights. = 1 − a a + a 2π 2π 2 > ∗ 2 (1 a ) η  > ∗2 > t > ∗ A natural question is whether the ratio ka∗k2 becomes + 1 a − 1 a 1 a . (4) 2 2π larger, the probability randomly gradient descent converging to the global minimum, becomes larger as well. We verify We can use induction to prove the invariance. If (at)> a∗ > this phenomenon empirically in Section6. t π 0 and φ < 2 the first term of Equation (4) is non-negative. t π t For the second term, notice that if φ < 2 , we have g(φ ) > 5. Proof Sketch 1, so the second term is non-negative. Therefore, as long  2  as 1>a∗ − 1>at 1>a∗ is also non-negative, we In Section 5.1, we give qualitative high level intuition on why the initial conditions are sufficient for gradient descent have the desired invariance. The next lemma summarizes to converge to the global minimum. In Section 5.2, we the above analysis. explain why the gradient descent has two phases. Lemma 5.3 (Invariance II: Positive Signal from the Second 2 Layer.). If (at)> a∗ > 0, 0 ≤ 1>a∗ · 1>at ≤ 1>a∗ , > 5.1. Qualitative Analysis of Convergence 0 < φt < π/2 and η < 2, then at+1 a∗ > 0. The convergence to global optimum relies on a geometric  2  characterization of saddle points and a series of invariants It remains to prove 1>a∗ − 1>at 1>a∗ > 0. throughout the gradient descent dynamics. The next lemma Again, we study the dynamics of this quantity. Using the gives the analysis of stationary points. The main step is to gradient formula and some algebra, we have check the first order condition of stationary points using Theorem 3.2.  η (k − π − 1) 1>at+1 · 1>a∗ ≤ 1 − 1>at · 1>a∗ Lemma 5.1 (Stationary Point Analysis). When the gradient 2π > ∗ t descent converges, a a 6= 0 and kvk < ∞, we have η (k + g(φ ) − 1) 2 2 + 1>a∗ either 2  η (k − π − 1) θ (v, w∗) = 0, a = kw∗k a∗ ≤ 1 − 1>at · 1>a∗ 2 2π or θ (v, w∗) = π, η (k + π − 1) > ∗2 > −1 >  ∗ ∗ + 1 a a = 11 + (π − 1) I 11 − I kw k2 a . 2 Gradient Descent Learns One-hidden-layer CNN

π where have used the fact that g(φ) ≤ π for all 0 ≤ φ ≤ 2 . for some positive absolute constant C. Therefore, we have Therefore we have much faster convergence rate than that in the Phase I. After   > ∗ > t+1 > ∗ 1 1   only Oe ∗ 2 2 log  iterations, we obtain φ ≤ . 1 a − 1 a · 1 a ηkw k2kak2   η(k + π − 1) > ∗ > t > ∗ Once we have this, we can use Lemma C.4 to show ≥ 1 − 1 a − 1 a 1 a . > ∗ > ∗ 1 1  2π 1 a − 1 a ≤ O ( ka k2) after Oe( ηk log  ) iter- ations. Next, using Lemma C.5, we can show after These imply the third invariance.   O 1 log 1 iterations, ka − a∗k = O ( ka∗k ). Lastly, Lemma 5.4 (Invariance III: Summation of Second Layer Al- e η  2 2 > ∗ > t > ∗2 2π ∗ ∗ ways Small.). If 1 a · 1 a ≤ 1 a and η < Lemma C.6 shows if ka − a k2 = O ( ka k2) and φ = k+π−1   2 ∗ 2 then 1>a∗ · 1>at+1 ≤ 1>a∗ . O () we have we have ` (v, a) = O  ka k2 .

0 π To sum up, if the initialization satisfies (1) φ < 2 , (2) > 2 6. Experiments a0 a∗ > 0 and (3) 1>a∗ · 1>a0 ≤ 1>a∗ , with Lemma 5.2, 5.3, 5.4, by induction we can show the conver- In this section, we illustrate our theoretical results with gence to the global minimum. Further, Theorem 4.2 shows numerical experiments. Again without loss of generality, ∗ these three conditions are true with constant probability we assume kw k2 = 1 in this section. using random initialization. 6.1. Multi-phase Phenomenon 5.2. Quantitative Analysis of Two Phase Phenomenon In Figure2, we set k = 20, p = 25 and we consider 4 key In this section we demonstrate why there is a two-phase quantities in proving Theorem 4.1, namely, angle between phenomenon. Throughout this section, we assume the v and w∗ (c.f. Lemma 5.5), ka − a∗k (c.f. Lemma C.5), conditions in Section 5.1 hold. We first consider the con- 1>a − 1>a∗ (c.f. Lemma C.4) and prediction error (c.f. vergence of the first layer. Because we are using weight- Lemma C.6). ∗ normalization, only the angle between v and w will affect When we achieve the global minimum, all these quanti- the prediction. Therefore, in this paper, we study the dynam- 2 t ties are 0. At the beginning (first ∼ 10 iterations), ics sin φ . The following lemma quantitatively characterize > > ∗ 1 a − 1 a and the prediction error drop quickly. This the shrinkage of this quantity of one iteration. is because for the gradient of a, 11>a∗ is the dominating Lemma 5.5 (Convergence of Angle between v and w∗). term which will make 11>a closer to 11>a∗ quickly. Under the same assumptions as in Theorem 4.1. Let n > o After that, for the next ∼ 200 iterations, all quantities de- β0 = min a0 a∗, g(φ0) − 1 ka∗k2 kw∗k2. 2 2 crease at a slow rate. This phenomenon is explained to the If the step size satisfies η = Phase I stage in Theorem 4.1. The rate is slow because the β0 cos φ0 cos φ0 1 O(min{ ∗ 2 > ∗ 2 ∗ 2 , ∗ 2 > ∗ 2 ∗ 2 , k }), initial signal is small. (ka k2+(1 a ) )kw k2 (ka k2+(1 a ) )kw k2 we have After ∼ 200 iterations, all quantities drop at a much faster sin2 φt+1 ≤ 1 − η cos φtλt sin2 φt rate. This is because the signal is very strong and since the convergence rate is proportional to this signal, we have a > kw∗k π−φt at a∗ t 2( )( ) much faster convergence rate (c.f. Phase II of Theorem 4.1). where λ = t 2 . 2πkv k2 This lemma shows the convergence rate depends on two 6.2. Probability of Converging to the Global Minimum crucial quantities, cos φt and λt. At the beginning, both In this section we test the probability of converging to the cos φt and λt are small. Nevertheless, Lemma C.3 shows global minimum using the random initialization scheme λt is universally lower bounded by Ω β0. Therefore, 1 t described in Theorem 4.2. We set p = 6 and vary k after O( η cos φ0β0 ) we have cos φ = Ω (1). Once (1>a∗)2   and 2 . We run 5000 random initializations for each t 1 kak2 cos φ = Ω (1), Lemma C.2 shows, after O itera- > ∗ 2 η (k, (1 a ) )   kak2 and compute the probability of converging to (at) a∗ kw∗k = Ω kw∗k2 ka∗k2 2 tions, 2 2 . Combining the the global minimum. facts kvtk ≤ 2 (Lemma C.3) and φt < π/2, we have 2 (1>a∗)2 t t  ∗ 2 ∗ 2 In Theorem 4.3, we showed if 2 is sufficiently small, cos φ λ = Ω kw k ka k . Now we enter phase II. kak2 2 2 randomly initialized gradient descent converges to the spu- In phase II, Lemma 5.5 shows rious local minimum with constant probability. Table1 empirically verifies the importance of this assumption. For 2 t+1  ∗ 2 ∗ 2 2 t sin φ ≤ 1 − ηC kw k2 ka k2 sin φ Gradient Descent Learns One-hidden-layer CNN

P > ∗ 2 PP (1 a ) P ∗ 2 P ka k2 0 1 4 9 16 25 PP k PP 25 0.50 0.55 0.73 1 1 1 36 0.50 0.53 0.66 0.89 1 1 49 0.50 0.53 0.61 0.78 1 1 64 0.50 0.51 0.59 0.71 0.89 1 81 0.50 0.53 0.57 0.66 0.81 0.97 100 0.50 0.50 0.57 0.63 0.75 0.90

Table 1. Probability of converging to the global minimum with dif- (1>a∗)2 (1>a∗)2 ferent 2 and k. For every fixed k, when 2 becomes kak2 kak2 larger, the probability of converging to the global minimum be- (1>a∗)2 comes larger and for every fixed ratio 2 when k becomes kak2 lager, the probability of converging to the global minimum be- comes smaller. Figure 2. Convergence of different measures we considered in proving Theorem 4.1. In the first ∼ 200 iterations, all quanti- ties drop slowly. After that, these quantities converge at much bring their insights to our setting. faster linear rates. Another interesting direction is to generalize our result to deeper and wider architectures. Specifically, an open prob- (1>a∗)2 lem is under what conditions randomly initialized gradient k every fixed if kak2 becomes larger, the probability of 2 descent algorithms can learn one-hidden-layer fully con- converging to the global minimum becomes larger. nected neural network or a convolutional neural network (1>a∗)2 with multiple kernels. Existing results often require suf- An interesting phenomenon is for every fixed ratio kak2 2 ficiently good initialization (Zhong et al., 2017a;b). We when k becomes lager, the probability of converging to the believe the insights from this paper, especially the invari- global minimum becomes smaller. How to quantitatively ance principles in Section 5.1 are helpful to understand the characterize the relationship between the success probability behaviors of gradient-based algorithms in these settings. and the dimension of the second layer is an open problem. 8. Acknowledgment 7. Conclusion and Future Works This research was partly funded by NSF grant IIS1563887, In this paper we proved the first polynomial convergence AFRL grant FA8750-17-2-0212 DARPA D17AP00001. guarantee of randomly initialized gradient descent algo- J.D.L. acknowledges support of the ARO under MURI rithm for learning a one-hidden-layer convolutional neural Award W911NF-11-1-0303. This is part of the collabo- network. Our result reveals an interesting phenomenon that ration between US DOD, UK MOD and UK Engineering randomly initialized local can converge to and Physical Research Council (EPSRC) under the Multidis- a global minimum or a spurious local minimum. We give a ciplinary University Research Initiative. The authors thank quantitative characterization of gradient descent dynamics Xiaolong Wang and Kai Zhong for useful discussions. to explain the two-phase convergence phenomenon. Experi- mental results also verify our theoretical findings. Here we list some future directions. References Our analysis focused on the population loss with Gaussian Agarwal, Naman, Allen-Zhu, Zeyuan, Bullins, Brian, input. In practice one uses (stochastic) gradient descent on Hazan, Elad, and Ma, Tengyu. Finding Approximate the empirical loss. Concentration results in (Mei et al., 2016; Local Minima Faster Than Gradient Descent. In STOC, Soltanolkotabi, 2017) are useful to generalize our results 2017. Full version available at http://arxiv.org/ to the empirical version. A more challenging question is abs/1611.01146. how to extend the analysis of gradient dynamics beyond rotationally invariant input distributions. Du et al.(2017b) Bhojanapalli, Srinadh, Neyshabur, Behnam, and Srebro, proved the convergence of gradient descent under some Nati. Global optimality of local search for low rank matrix structural input distribution assumptions in the one-layer recovery. In Advances in Neural Information Processing convolutional neural network. It would be interesting to Systems, pp. 3873–3881, 2016. Gradient Descent Learns One-hidden-layer CNN

Blum, Avrim and Rivest, Ronald L. Training a 3-node Ge, Rong, Lee, Jason D, and Ma, Tengyu. Learning neural network is NP-complete. In Advances in neural one-hidden-layer neural networks with landscape design. information processing systems, pp. 494–501, 1989. arXiv preprint arXiv:1711.00501, 2017b.

Brutzkus, Alon and Globerson, Amir. Globally optimal gra- Glorot, Xavier and Bengio, Yoshua. Understanding the dient descent for a Convnet with Gaussian inputs. arXiv difficulty of training deep feedforward neural networks. preprint arXiv:1702.07966, 2017. In Proceedings of the Thirteenth International Confer- ence on Artificial Intelligence and Statistics, pp. 249–256, Carmon, Yair, Duchi, John C, Hinder, Oliver, and Sidford, 2010. Aaron. Accelerated methods for non-convex optimization. arXiv preprint arXiv:1611.00756, 2016. Goel, Surbhi and Klivans, Adam. Eigenvalue decay implies polynomial-time learnability for neural networks. arXiv Cho, Youngmin and Saul, Lawrence K. Kernel methods preprint arXiv:1708.03708, 2017a. for . In Advances in neural information processing systems, pp. 342–350, 2009. Goel, Surbhi and Klivans, Adam. Learning depth-three neural networks in polynomial time. arXiv preprint Choromanska, Anna, Henaff, Mikael, Mathieu, Michael, arXiv:1709.06010, 2017b. Arous, Gerard´ Ben, and LeCun, Yann. The loss surfaces of multilayer networks. In Artificial Intelligence and Goel, Surbhi, Kanade, Varun, Klivans, Adam, and Thaler, Statistics, pp. 192–204, 2015. Justin. Reliably learning the ReLU in polynomial time. arXiv preprint arXiv:1611.10258, 2016. Dauphin, Yann N, Fan, Angela, Auli, Michael, and Grangier, David. Language modeling with gated convolutional Haeffele, Benjamin D and Vidal, Rene.´ Global optimality networks. arXiv preprint arXiv:1612.08083, 2016. in tensor factorization, deep learning, and beyond. arXiv preprint arXiv:1506.07540, 2015. Du, Simon S, Jin, Chi, Lee, Jason D, Jordan, Michael I, Poczos, Barnabas, and Singh, Aarti. Gradient descent Hardt, Moritz and Ma, Tengyu. Identity matters in deep can take exponential time to escape saddle points. arXiv learning. arXiv preprint arXiv:1611.04231, 2016. preprint arXiv:1705.10412, 2017a. Hardt, Moritz and Price, Eric. The noisy power method: A Du, Simon S, Lee, Jason D, and Tian, Yuandong. When meta algorithm with applications. In Advances in Neural is a convolutional filter easy to learn? arXiv preprint Information Processing Systems, pp. 2861–2869, 2014. arXiv:1709.06129, 2017b. He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Feizi, Soheil, Javadi, Hamid, Zhang, Jesse, and Tse, David. Jian. Delving deep into rectifiers: Surpassing human-level Porcupine neural networks:(almost) all local optima are performance on classification. In Proceedings global. arXiv preprint arXiv:1710.02196, 2017. of the IEEE international conference on computer vision, pp. 1026–1034, 2015. Freeman, C Daniel and Bruna, Joan. Topology and geometry of half-rectified network optimization. arXiv preprint Janzamin, Majid, Sedghi, Hanie, and Anandkumar, Anima. arXiv:1611.01540, 2016. Beating the perils of non-convexity: Guaranteed training of neural networks using tensor methods. arXiv preprint Ge, Rong, Huang, Furong, Jin, Chi, and Yuan, Yang. Es- arXiv:1506.08473, 2015. caping from saddle points − online stochastic gradient for tensor decomposition. In Proceedings of The 28th Kawaguchi, Kenji. Deep learning without poor local min- Conference on Learning Theory, pp. 797–842, 2015. ima. In Advances In Neural Information Processing Sys- tems, pp. 586–594, 2016. Ge, Rong, Lee, Jason D, and Ma, Tengyu. Matrix com- pletion has no spurious local minimum. In Advances in Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Neural Information Processing Systems, pp. 2973–2981, Imagenet classification with deep convolutional neural 2016. networks. In Advances in neural information processing systems, pp. 1097–1105, 2012. Ge, Rong, Jin, Chi, and Zheng, Yi. No spurious local minima in nonconvex low rank problems: A unified geo- LeCun, Yann, Bottou, Leon,´ Orr, Genevieve B, and Muller,¨ metric analysis. In Proceedings of the 34th International Klaus-Robert. Efficient backprop. In Neural networks: Conference on Machine Learning, pp. 1233–1242, 2017a. Tricks of the trade, pp. 9–50. Springer, 1998. Gradient Descent Learns One-hidden-layer CNN

Lee, Jason D, Simchowitz, Max, Jordan, Michael I, and deep neural networks. In Advances in Neural Information Recht, Benjamin. Gradient descent only converges to Processing Systems, pp. 901–909, 2016. minimizers. In Conference on Learning Theory, pp. 1246– 1257, 2016. Sedghi, Hanie and Anandkumar, Anima. Provable meth- ods for training neural networks with sparse connectivity. Li, Xingguo, Wang, Zhaoran, Lu, Junwei, Arora, Raman, arXiv preprint arXiv:1412.2693, 2014. Haupt, Jarvis, Liu, Han, and Zhao, Tuo. Symmetry, sad- dle points, and global geometry of nonconvex matrix Shalev-Shwartz, Shai, Shamir, Ohad, and Shammah, factorization. arXiv preprint arXiv:1612.09296, 2016. Shaked. Failures of gradient-based deep learning. In International Conference on Machine Learning, pp. 3067– Li, Yuanzhi and Yuan, Yang. Convergence analysis of 3075, 2017a. two-layer neural networks with ReLU activation. arXiv Shalev-Shwartz, Shai, Shamir, Ohad, and Shammah, preprint arXiv:1705.09886, 2017. Shaked. Weight sharing is crucial to succesful optimiza- Livni, Roi, Shalev-Shwartz, Shai, and Shamir, Ohad. On tion. arXiv preprint arXiv:1706.00687, 2017b. the computational efficiency of training neural networks. Shamir, Ohad. Distribution-specific hardness of learning In Advances in Neural Information Processing Systems, neural networks. arXiv preprint arXiv:1609.01037, 2016. pp. 855–863, 2014. Silver, David, Huang, Aja, Maddison, Chris J, Guez, Arthur, Mei, Song, Bai, Yu, and Montanari, Andrea. The landscape Sifre, Laurent, Van Den Driessche, George, Schrittwieser, of empirical risk for non-convex losses. arXiv preprint Julian, Antonoglou, Ioannis, Panneershelvam, Veda, arXiv:1607.06534, 2016. Lanctot, Marc, et al. Mastering the game of go with Neyshabur, Behnam, Salakhutdinov, Ruslan R, and Srebro, deep neural networks and tree search. Nature, 529(7587): Nati. Path-SGD: Path-normalized optimization in deep 484–489, 2016. neural networks. In Advances in Neural Information Sˇ´ıma, Jirˇ´ı. Training a single sigmoidal neuron is hard. Neu- Processing Systems, pp. 2422–2430, 2015. ral Computation, 14(11):2709–2728, 2002. Nguyen, Quynh and Hein, Matthias. The loss surface Soltanolkotabi, Mahdi. Learning ReLUs via gradient de- of deep and wide neural networks. arXiv preprint scent. arXiv preprint arXiv:1705.04591, 2017. arXiv:1704.08045, 2017a. Sun, Ju, Qu, Qing, and Wright, John. Complete dictionary Nguyen, Quynh and Hein, Matthias. The loss surface and recovery over the sphere I: Overview and the geometric expressivity of deep convolutional neural networks. arXiv picture. IEEE Transactions on Information Theory, 63 preprint arXiv:1710.10928, 2017b. (2):853–884, 2017. Panigrahy, Rina, Rahimi, Ali, Sachdeva, Sushant, and Tian, Yuandong. An analytical formula of population gra- Zhang, Qiuyi. Convergence results for neural networks dient for two-layered ReLU network and its applications via electrodynamics. In LIPIcs-Leibniz International Pro- in convergence and critical point analysis. arXiv preprint ceedings in Informatics, volume 94. Schloss Dagstuhl- arXiv:1703.00560, 2017. Leibniz-Zentrum fuer Informatik, 2018. Xie, Bo, Liang, Yingyu, and Song, Le. Diverse neural net- Park, Dohyung, Kyrillidis, Anastasios, Carmanis, Constan- work learns true target functions. In Artificial Intelligence tine, and Sanghavi, Sujay. Non-square matrix sensing and Statistics, pp. 1216–1224, 2017. without spurious local minima via the Burer-Monteiro ap- proach. In Artificial Intelligence and Statistics, pp. 65–74, Zhang, Yuchen, Lee, Jason D, Wainwright, Martin J, and 2017. Jordan, Michael I. Learning halfspaces and neural networks with random initialization. arXiv preprint Safran, Itay and Shamir, Ohad. On the quality of the initial arXiv:1511.07948, 2015. basin in overspecified neural networks. In International Conference on Machine Learning, pp. 774–782, 2016. Zhong, Kai, Song, Zhao, and Dhillon, Inderjit S. Learning non-overlapping convolutional neural networks with mul- Safran, Itay and Shamir, Ohad. Spurious local minima tiple kernels. arXiv preprint arXiv:1711.03440, 2017a. are common in two-layer relu neural networks. arXiv preprint arXiv:1712.08968, 2017. Zhong, Kai, Song, Zhao, Jain, Prateek, Bartlett, Peter L, and Dhillon, Inderjit S. Recovery guarantees for one-hidden- Salimans, Tim and Kingma, Diederik P. Weight normaliza- layer neural networks. arXiv preprint arXiv:1706.03175, tion: A simple reparameterization to accelerate training of 2017b. Gradient Descent Learns One-hidden-layer CNN

Zhou, Pan and Feng, Jiashi. The landscape of deep learning algorithms. arXiv preprint arXiv:1705.07038, 2017. Gradient Descent Learns One-hidden-layer CNN A. Proofs of Section3 Proof of Theorem 3.1. We first expand the loss function directly.

` (v, a)

1 2 = y − a>σ (Z) w E 2 ∗ > h ∗ ∗ >i ∗ > h >i > h ∗ >i ∗ = (a ) E σ (Zw ) σ (Zw ) a + a E σ (Zw) σ (Zw) a − 2a E σ (Zw) σ (Zw ) a = (a∗)> A (w∗) a∗ + a>A (w) a − 2a>B (w, w∗) w∗. where for simplicity, we denote

h >i A(w) =E σ (Zw) σ (Zw) (5)

∗ h ∗ >i B (w, w ) =E σ (Zw) σ (Zw ) . (6)

For i 6= j, using the second identity of Lemma A.1, we can compute

1 A(w) = σ Z>w σ Z>w = kwk2 ij E i E j 2π 2

For i = j, using the second moment formula of half-Gaussian distribution we can compute

1 A (w) = kwk2 . ii 2 2

Therefore 1 A(w) = kwk2 11> + (π − 1) I . 2π 2

Now let us compute B (w, w∗). For i 6= j, similar to A(w)ij, using the independence property of Gaussian, we have

1 B (w, w ) = kwk kw∗k . ∗ ij 2π 2 2

Next, using the fourth identity of Lemma A.1, we have

1 B (w, w∗) = (cos φ (π − φ) + sin φ) kwk kw∗k . ii 2π 2 2

Therefore, we can also write B (w, w∗) in a compact form

1 B (w, w∗) = kwk kw∗k 11> + (cos φ (π − φ) + sin φ − 1) I . 2π 2 2

Plugging in the formulas of A(w) and B (w, w∗) and w = v , we obtain the desired result. kvk2

Proof of Theorem 3.2. We first compute the expect gradient for v. From(Salimans & Kingma, 2016), we know

! ∂` (v, a) 1 vv> ∂` (w, a) = I − . ∂v kvk 2 ∂w 2 kvk2 Gradient Descent Learns One-hidden-layer CNN

Recall the gradient formula,

∂` (Z, w, a) ∂w k k ! k ! X ∗ X ∗ ∗ X  > = ai σ (Ziw) − ai σ (Zw ) aiZiI Zi w i=1 i=1 i=1  k  X 2 >  > X >  > > =  ai ZiZi I Zi w ≥ 0 + aiajZiZj I Zi w ≥ 0, Zj w ≥ 0  w (7) i=1 i6=j

 k  X ∗ >  > > ∗ X ∗ ∗  > > ∗ ∗ −  aiai ZiZi I Zi w ≥ 0, Zi w ≥ 0 + aiaj ZiZj I Zi w ≥ 0, Zj w ≥ 0  w . (8) i=1 i6=j

Now we calculate expectation of Equation (7) and (8) separately. For (7), by first two formulas of Lemma A.1, we have

 k  X 2 >  > X >  > >  ai ZiZi I Zi w ≥ 0 + aiajZiZj I Zi w ≥ 0, Zj w ≥ 0  w i=1 i6=j k X w X w = a2 · + a a . i 2 i j 2π i=1 i6=j

For (8), we use the second and third formula in Lemma A.1 to obtain

 k  X ∗ >  > > ∗ X ∗ ∗  > > ∗ ∗  aiai ZiZi I Zi w ≥ 0, Zi w ≥ 0 + aiaj ZiZj I Zi w ≥ 0, Zj w ≥ 0  w i=1 i6=j  1 1 kw∗k  X 1 kw∗k =a>a∗ (π − φ) w∗ + sin φ 2 w + a a∗ 2 w. π π kwk i j 2π kwk 2 i6=j 2

In summary, aggregating them together we have

∂` (Z, w, a) EZ ∂w 2 P > ∗ ∗ P ∗ ! 1 kak aiaj a a sin φ kw k aja kw k = a>a∗ (π − φ) w∗ + 2 + i6=j + 2 + i6=j j ∗ 2 w. 2π 2 2π 2π kwk2 2π kwk2

As a sanity check, this formula matches Equation (16) of (Brutzkus & Globerson, 2017) when a = a∗ = 1. Next, we calculate the expected gradient of a. Recall the gradient formula of a

∂`(Z, w, a) = a>σ (Zw) − (a∗)>σ (Zw∗) σ (Zw) a =σ (Zw) σ (Zw)> a − σ (Zw) σ (Zw∗)> a∗

Taking expectation we have

∂` (w, a) = A (w) a − B (w, w∗) a∗ ∂a where A (w) and B (w, w∗) are defined in Equation (5) and (6). Plugging in the formulas for A (w) and B (w, w∗) derived in the proof of Theorem 3.1 we obtained the desired result. Gradient Descent Learns One-hidden-layer CNN

Lemma A.1 (Useful Identities). Given w, w∗ with angle φ and Z is a Gaussian random vector, then 1 zz> z>w ≥ 0  w = w E I 2   >  1 w E zI z w ≥ 0 =√ 2π kwk2 ∗  >  > >  1 ∗ 1 kw k2 E zz I z w ≥ 0, z w∗ ≥ 0 w∗ = (π − φ) w + sin φ w 2π 2π kwk2 1 σ z>w σ z>w  = (cos φ (π − φ) + sin φ) kwk kw∗k E ∗ 2π 2 2

d×d  > Proof. Consider an orthonormal basis of R : eiej with e1 k w. Then for i 6= j, we know

 >  >  heiej, E zz I z w ≥ 0 i = 0 by the independence properties of Gaussian random vector. For i = j = 1,

h 2 i 1 he e>, zz> z>w ≥ 0 i = z>w z>w ≥ 0 = i j E I E I 2

>  >  >  where the last step is by the property of half-Gaussian. For i = j 6= j, heiej , E zz I z w ≥ 0 i = 1 by standard Gaussian second moment formula. Therefore, zz> z>w ≥ 0  w = 1 w. z z>w ≥ 0  = √1 w can be E I 2 E I 2π proved by mean formula of half-normal distribution. To prove the third identity, consider an orthonormal basis of Rd×d:  > eiej with e1 k w∗ and w lies in the plane spanned by e1 and e2. Using the polar representation of 2D Gaussian random 2 1 variables (r is the radius and θ is the angle with dPr = r exp(−r /2) and dPθ = 2π ): Z ∞ Z π/2 >  >  > >  1 3 2  2 he1e1 , E zz I z w ≥ 0, z w∗ ≥ 0 i = r exp −r /2 dr · cos θdθ 2π 0 −π/2+φ 1 = (π − φ + sin φ cos φ) , 2π Z ∞ Z π/2 >  >  > >  1 3 2  he1e2 , E zz I z w ≥ 0, z w∗ ≥ 0 i = r exp −r /2 dr · sin θ cos θdθ 2π 0 −π/2+φ 1 = sin2 φ , 2π Z ∞ Z π/2 >  >  > >  1 3 2  2 he2e2 , E zz I z w ≥ 0, z w∗ ≥ 0 i = r exp −r /2 dr · sin θdθ 2π 0 −π/2+φ 1 = (π − φ − sin φ cos φ) . 2π

w¯ −cos φe1 Also note that e2 = sin φ . Therefore 1 1 w¯ − cos φe zz> z>w ≥ 0, z>w ≥ 0  w = (π − φ + sin φ cos φ) w∗ + sin2 φ · 1 kw∗k E I ∗ ∗ 2π 2π sin φ 2 1 1 kw∗k = (π − φ) w∗ + sin φ 2 w. 2π 2π kwk2

For the fourth identity, focusing on the plane spanned by w and w∗, using the polar decomposition, we have Z ∞ Z π/2  >  >  1 3 2  ∗ E σ z w σ z w∗ = r exp −r /2 dr · (cos θ cos φ + sin θ sin φ) cos θdθ kwk2 kw k2 2π 0 −π/2+φ 1 = cos φ (π − φ + sin φ cos φ) + sin3 φ kwk kw∗k . 2π 2 2 Gradient Descent Learns One-hidden-layer CNN B. Proofs of Qualitative Convergence Results

> ∗ Proof of Lemma 5.1. When Algorithm1 converges, since a a 6= 0 and kvk2 < ∞, using the gradient formula in  vv>  ∗ vv> Theorem 3.2, we know that either π − φ = 0 or I − 2 w = 0. For the second case, since I − 2 is a projection kvk2 kvk2  vv>  ∗ ∗ matrix on the complement space of v, I − 2 w = 0 is equivalent to θ (v, w ) = 0. Once the angle between v and kvk2 w∗ is fixed, using the gradient formula for a we have the desired formulas for saddle points.

> ∗  vv>  ∗ Proof of Lemma 5.2. By the gradient formula of w, if a a > 0, the gradient is of the form c I − 2 w where c > 0. kvk2 vv> Thus because I − 2 is the projection matrix onto the complement space of v, the gradient update always makes the angle kvk2 smaller.

C. Proofs of Quantitative Convergence Results C.1. Useful Technical Lemmas We first prove the lemma about the convergence of φt.

Proof of Lemma 5.5. We consider the dynamics of sin2 φt.

sin2 φt+1

 > 2 vt+1 w∗ =1 − t+1 2 ∗ 2 kv k2 kw k2 2  t ∂` > ∗ v − η ∂vt w =1 −   t 2 2 ∂` 2 ∗ 2 kv k2 + η ∂vt kw k2

 t > ∗ t 2 > (a ) a (π−φ ) 2 (vt) v + η · sin2 φt kwk 2πkvk2 2 =1 −  t > ∗ t 2 2 t ∗ 4 t 2 ∗ 2 2 (a ) a (π−φ ) sin φ kw k2 kv k2 kw k2 + η 2π t 2 kv k2 t > 2 2 3 (a ) a(π−φ) kvtk kw∗k cos2 φt + 2η kw∗k · · sin2 φt cos φt ≤1 − 2 2 2 2π  t > ∗ t 2 2 t ∗ 4 t 2 ∗ 2 2 (a ) a (π−φ ) sin φ kw k2 kv k2 kw k2 + η 2π t 2 kv k2

> > 2 ∗ at a(π−φ)  at a∗(π−φ)   ∗ 2 2 t kw k2 ( ) 2 t t 2 ( ) 2 t kw k2 sin φ − 2η t 2 · 2π · sin φ cos φ + η 2π sin φ 2 kv k2 kvk2 =  t > ∗ 2  ∗ 2 2 (a ) a (π−φ) 2 t kw k2 1 + η 2π sin φ t 2 kv k2 !2 !2 kw∗k (at)> a (π − φ) (at)> a∗ (π − φ) kw∗k ≤ sin2 φt − 2η 2 · · sin2 φt cos φt + η2 sin2 φt 2 t 2 2π 2π t 2 kv k2 kv k2 where in the first inequality we dropped term proportional to O(η4) because it is negative, in the last equality, we divided t 2 ∗ 2 numerator and denominator by kv k2 kw k2 and the last inequality we dropped the denominator because it is bigger than 1.  >  kw∗k at a∗ π−φt t 2 ( ) ( ) Therefore, recall λ = t 2 and we have 2πkv k2

 2 sin2 φt+1 ≤ 1 − 2η cos φtλt + η2 λt sin2 φt. (9)

cos φt t 2 To this end, we need to make sure η ≤ λt . Note that since kv k2 is monotonically increasing, it is lower bounded by 1. t t > ∗  ∗ 2 > ∗2 2 Next notice φ ≤ π/2. Finally, from Lemma C.2, we know (a ) a ≤ ka k2 + 1 a kwk2. Combining these, we Gradient Descent Learns One-hidden-layer CNN have an upper bound  ∗ 2 > ∗2 ∗ 2 ka k2 + 1 a kw k2 λt ≤ . 4 Plugging this back to Equation (9) and use our assumption on η, we have sin2 φt+1 ≤ 1 − η cos φtλt sin2 φt.

t t t+1> ∗ n t > ∗  g(φ )−1 ∗ 2 t > ∗ g(φ )−1 ∗ 2o Lemma C.1. a a ≥ min (a ) a + η π−1 ka k2 − (a ) a , π−1 ka k2

Proof. Recall the dynamics of (at)> a∗. t >  η (π − 1) > η (g(φ ) − 1) η  2  at+1 a∗ = 1 − at a∗ + ka∗k2 + 1>a∗ − 1>a∗ 1>at 2π 2π 2 2π t  η (π − 1) > η (g(φ ) − 1) ≥ 1 − at a∗ + ka∗k2 2π 2π 2

t t > ∗ g(φ )−1 ∗ 2 where the inequality is due to Lemma 5.4. If (a ) a ≥ π−1 ka k2, t t >  η (π − 1) g(π ) − 1 η (g(φ )) at+1 a∗ ≥ 1 − ka∗k2 + ka∗k2 2π π − 1 2 π − 1 2 g(φt) − 1 = ka∗k2 . π − 1 2

t t > ∗ g(φ )−1 ∗ 2 t+1> ∗ If (a ) a ≤ π−1 ka k2, simple algebra shows a a increases by at least t g(φ ) − 1 >  η ka∗k2 − at a∗ . π − 1 2

A simple corollary is a>a∗ is uniformly lower bounded. 0 t > ∗ n 0> ∗ g(φ )−1 ∗ 2o Corollary C.1. For all t = 1, 2,..., (a ) a ≥ min a a , π−1 ka k2 .

> ∗  ∗ 2 This lemma also gives an upper bound of number of iterations to make a a = Θ ka k2 .

1 > ∗  ∗ 2 Corollary C.2. If g(φ) − 1 = Ω (1), then after η iterations, a a = Θ ka k2 .

> ∗ 1 g(φ) ∗ 2 > ∗ g(φ) ∗ 2 Proof. Note if g(φ) − 1 = Ω (1) and a a ≤ 2 · π−1 ka k2, each iteration a a increases by η π−1 ka k2.

We also need an upper bound of (at)> a∗. t > ∗  ∗ 2 > ∗2 ∗ 2 Lemma C.2. For t = 0, 1,..., (a ) a ≤ ka k2 + 1 a kw k2.

∗ t > ∗ Proof. Without loss of generality, assume kw k2 = 1. Again, recall the dynamics of (a ) a . t >  η (π − 1) > η (g(φ ) − 1) η  2  at+1 a∗ = 1 − at a∗ + ka∗k2 + 1>a∗ − 1>a∗ 1>at 2π 2π 2 2π

 η (π − 1) > η (π − 1) η (π − 1) 2 ≤ 1 − at a∗ + ka∗k2 + 1>a∗ . 2π 2π 2 2π

t > ∗ ∗ 2 > ∗2 Now we prove by induction, suppose the conclusion holds at iteration t, (a ) a ≤ ka k2 + 1 a . Plugging in we have the desired result. Gradient Descent Learns One-hidden-layer CNN

C.2. Convergence of Phase I In this section we prove the convergence of Phase I.

 1  t Proof of Convergence of Phase I. Lemma C.3 implies after O cos φ0β0 iterations, cos φ = Ω (1), which implies t g(φ )−1  1  t > ∗ ∗  ∗ 2 ∗ 2 π−1 = Ω (1). Using Corollary C.2, we know after O η iterations we have (a ) a kw k = Ω kw k2 ka k2 .

The main ingredient of the proof of phase I is the follow lemma where we use a joint induction argument to show the t t convergence of φ and a uniform upper bound of kv k2. 0 n 0> ∗ 0  ∗ 2o ∗ 2 Lemma C.3. Let β = min a a , g(φ ) − 1 ka k2 kw k2. If the step size satisfies η ≤   β∗ cos φ0 cos φ0 2π min ∗ 2 > ∗ 2 ∗ 2 , ∗ 2 > ∗ 2 ∗ 2 , k+π−1 , we have for t = 0, 1,... 8(ka k2+(1 a ) )kw k2 (ka k2+(1 a ) )kw k2

 0 0 t 2 t cos φ β t sin φ ≤ 1 − η · and v ≤ 2. 8 2

Proof. We prove by induction. The initialization ensure when t = 0, the conclusion is correct. Now we consider the t 2 dynamics of kv k2. Note because the gradient of v is orthogonal to v (Salimans & Kingma, 2016), we have a simple t 2 dynamic of kv k2. 2 t 2 t−1 2 2 ∂` (v, a) v 2 = v 2 + η ∂v 2 2 t > ∗ t−1! 2 t ∗ 2 t−1 2 2 (a ) a π − φ sin φ kw k2 = v + η 2 2π t 2 kv k2 t−1 2 2  ∗ 2 > ∗2 ∗ 2 2 t−1 ≤ v 2 + η ka k2 + 1 a kw k2 sin φ t−1 2  ∗ 2 > ∗2 ∗ 2 X 2 i =1 + η ka k2 + 1 a kw k2 sin φ i=1  2 8 ≤1 + η2 ka∗k2 + 1>a∗ kw∗k2 2 2 η cos φ0β0 ≤2 where the first inequality is by Lemma C.2 and the second inequality we use our induction hypothesis. Recall λt = ∗  t > ∗ t kw k a a π−φ 0 2 ( ) ( ) t t β t 2 . The uniform upper bound of kvk2 and the fact that φ ≤ π/2 imply a lower bound λ ≥ 8 . 2πkv k2 Plugging in Lemma 5.5, we have

 cos φ0β0   cos φ0β0 t+1 sin2 φt+1 ≤ 1 − η sin2 φt ≤ 1 − η . 8 8

We finish our joint induction proof.

C.3. Analysis of Phase II In this section we prove the convergence of phase II and necessary auxiliary lemmas.

>   T1  ∗ ∗ ∗ 2 ∗ 2 T1 Proof of Convergence of Phase II. At the beginning of Phase II, a a kw k = Ω kw k2 ka k2 and g(φ ) − 1 = t > ∗ ∗  ∗ 2 ∗ 2 Ω (1). Therefore, Lemma C.1 implies for all t = T1,T1 + 1,..., (a ) a kw k = Ω kw k2 ka k2 . Combining with the   ∗ 2 ∗ 2 T1 fact that kvk2 ≤ 2 (c.f. Lemma C.3), we obtain a lower bound λt ≥ Ω kw k2 ka k2 We also know that cos φ = Ω (1) Gradient Descent Learns One-hidden-layer CNN

t t and cos φ is monotinically increasing (c.f. Lemma 5.2), so for all t = T1,T1 + 1,..., cos φ = Ω (1). Plugging in these two lower bounds into Theorem 5.5, we have

2 t+1  ∗ 2 ∗ 2 2 t sin φ ≤ 1 − ηC kw k2 ka k2 sin φ .

    ∗ 10 1 1  2 t 10 ka k2 for some absolute constant C. Thus, after O ∗ 2 ∗ 2 log  iterations, we have sin φ ≤ min  ,  > ∗ , ηkw k2ka k2 |1 a | n ∗ o t ka k2 which implies π − g(φ ) ≤ min ,  |1>a∗| . Now using Lemma C.4,Lemma C.5 and Lemma C.6, we have after  1 1  ∗ 2 ∗ 2 Oe ηk log  iterations ` (v, a) ≤ C1 ka k2 kw k2 for some absolute constant C1. Rescaling  properly we obtain the desired result.

C.3.1. TECHNICAL LEMMASFOR ANALYZING PHASE II In this section we provide some technical lemmas for analyzing Phase II. Because of the positive homogeneity property, ∗ without loss of generality, we assume kw k2 = 1. ∗   1>a∗−1>a0  0 ka k2 1 | | > ∗ > T ∗ Lemma C.4. If π − g(φ ) ≤  > ∗ , after T = O log ∗ iterations, 1 a − 1 a ≤ 2 ka k . |1 a | ηk ka k2 2

Proof. Recall the dynamics of 1>at.

 η (k + π − 1) η (k + g(φt) − 1) 1>at+1 = 1 − 1>at + 1>a∗ 2π 2π  η (k + π − 1) η (k + g(φt) − 1) = 1 − 1>at + 1>a∗. 2π 2π

Assume 1>a∗ > 0 (the other case is similar). By Lemma 5.4 we know 1>at < 1>a∗ for all t. Consider

 η (k + π − 1) η (π − g(φt)) 1>a∗ − 1>at+1 = 1 − 1>a∗ − 1>a∗ + 1>a∗. 2π 2π

Therefore we have

(π − g (φt)) 1>a∗  η (k + π − 1)  (π − g (φt)) 1>a∗  1>a∗ − 1>at+1 − = 1 − 1>a∗ − 1>a∗ − . k + π − 1 2π k + π − 1

  > ∗ > 0  t > ∗ 1 |1 a −1 a | > ∗ > t (π−g(φ ))1 a ∗ After T = O log ∗ iterations, we have 1 a − 1 a − ≤  ka k , which ηk ka k2 k+π−1 2 > ∗ > t ∗ implies1 a − 1 a ≤ 2 ka k2 .

∗   a∗−a0  0 ka k2 > ∗ > 0  ∗ 1 k k2 Lemma C.5. If π − g(φ ) ≤  > ∗ and 1 a − 1 a ≤ ka k , then after T = O log ∗ iterations, |1 a | k 2 η ka k2 ∗ 0 ∗ a − a 2 ≤ C ka k2 for some absolute constant C.

Proof. We first consider the inner product

∂` (vt, at) h , at − a∗i at t π − 1 t ∗ 2 g(φ ) − π ∗ > t ∗ t ∗ > > ∗ = a − a − (a ) a − a + a − a 11 a − a 2π 2 2π t π − 1 t ∗ 2 g(φ ) − π ∗ t ∗ ≥ a − a − ka k a − a . 2π 2 2π 2 2 Gradient Descent Learns One-hidden-layer CNN

Next we consider the squared norm of gradient 2 ∂` (v, a) 1 t ∗ t  ∗ > t ∗ 2 = 2 (π − 1) a − a + π − g(φ ) a + 11 a − a 2 ∂a 2 4π 3  2 t ∗ 2 t 2 ∗ 2 2 > t > ∗2 ≤ (π − 1) a − a + π − g(φ ) ka k + k 1 a − 1 a . 4π2 2 2 t ∗ ∗ Suppose ka − a k2 ≤  ka k2, then t t 2 ∂` (v , a ) t ∗ π − 1 t ∗ 2  ∗ 2 h , a − a i ≥ a − a − ka k at 2π 2 2π 2 2 ∂` (v, a) 2 ∗ 2 ≤ 3 ka k2 . ∂a 2 Therefore we have   t+1 ∗ 2 η (π − 1) t ∗ 2 2 2 a − a ≤ 1 − a − a + 4η kak 2 2π 2 2 ∗ 2   2 ∗ 2 ! t+1 ∗ 2 8 (π − 1)  ka k2 η (π − 1) t ∗ 2 8 (π − 1)  ka k2 ⇒ a − a − ≤ 1 − a − a − . 2 π − 1 2π 2 π − 1

 1 1  t+1 ∗ 2 ∗ Thus after O η  iterations, we must have a − a 2 ≤ C ka k2 for some large absolute constant C. Rescaling , we obtain the desired result.

∗ ∗ ∗ ∗ Lemma C.6. If π − g(φ) ≤  and ka − a kw k2k ≤  ka k2 kw k2, then the population loss satisfies ` (v, a) ≤ ∗ 2 ∗ 2 C ka k2 kw k2 for some constant C > 0.

Proof. The result follows by plugging in the assumptions in Theorem 3.1.

D. Proofs of Initialization Scheme Proof of Theorem 4.2. The proof of the first part of Theorem 4.2 just uses the symmetry of unit sphere and ball and the  > ∗  |1 a | second part is a direct application of Lemma 2.5 of (Hardt & Price, 2014). Lastly, since a0 ∼ B 0 √ , we have k √ > 0 0 0 > ∗ ∗ 1 a ≤ a 1 ≤ k a 2 ≤ 1 a kw k2 where the second inequality is due to Holder’s¨ inequality.

E. Proofs of Converging to Spurious Local Minimum Proof of Theorem 4.3. The main idea is similar to Theorem 4.1 but here we show w → −w∗ (without loss of generality, ∗ > ∗ we assume kw k2 = 1). Different from Theorem 4.1, here we need to prove the invariance a a < 0, which implies our > 2 t > ∗ > t > ∗ 0 −2(1 a) k+π−1 desired result. We prove by induction, suppose (a ) a > 0, 1 a ≤ 1 a , g φ ≤ ∗ 2 + 1 and η < 2π . ka k2 > 2 > t > ∗ 0 −2(1 a) Note 1 a ≤ 1 a are satisfied by Lemma 5.4 and g φ ≤ ∗ 2 + 1 by our initialization condition and induction ka k2 hypothesis that implies φt is increasing. Recall the dynamics of (at)> a∗. t >  η (π − 1) > η (g (φ ) − 1) η  2  at+1 a∗ = 1 − at a∗ + ka∗k2 + 1>a∗ − 1>at 1>a∗ 2π 2π 2 2π  t ∗ > ∗2 η (g(φ ) − 1) ka k2 + 2 1 a ≤ < 0 2π where the first inequality we used our induction hypothesis on inner product between at and a∗ and 1>at ≤ 1>a∗ and the second inequality is by induction hypothesis on φt. Thus when gradient descent algorithm converges, according ∗ > −1 >  ∗ ∗ Lemma 5.1, θ (v, w ) = π, a = 11 + (π − 1) I 11 − I kw k2 a . Plugging these into Theorem 3.1, with some  ∗ 2 ∗ 2 routine algebra, we show ` (v, a) = Ω kw k2 ka k2 .