Gradient Descent Learns One-Hidden-Layer CNN: Don’T Be Afraid of Spurious Local Minima

Gradient Descent Learns One-hidden-layer CNN: Don’t be Afraid of Spurious Local Minima Simon S. Du 1 Jason D. Lee 2 Yuandong Tian 3 Barnabas´ Poczos´ 1 Aarti Singh 1 Abstract fully. Why such simple methods in learning DCNN is suc- cessful remains elusive from the optimization perspective. We consider the problem of learning a one-hidden- layer neural network with non-overlapping con- Recently, a line of research (Tian, 2017; Brutzkus & Glober- volutional layer and ReLU activation, i.e., son, 2017; Li & Yuan, 2017; Soltanolkotabi, 2017; Shalev- P T Shwartz et al., 2017b) assumed the input distribution is f(Z; w; a) = j ajσ(w Zj), in which both the convolutional weights w and the output weights a Gaussian and showed that stochastic gradient descent with are parameters to be learned. When the labels are random or 0 initialization is able to train a neural net- P T the outputs from a teacher network of the same ar- work f(Z; fwjg) = j ajσ(wj Z) with ReLU activation chitecture with fixed weights (w∗; a∗), we prove σ(x) = max(x; 0) in polynomial time. However, these that with Gaussian input Z, there is a spurious lo- results all assume there is only one unknown layer fwjg, cal minimizer. Surprisingly, in the presence of the while a is a fixed vector. A natural question thus arises: spurious local minimizer, gradient descent with weight normalization from randomly initialized Does randomly initialized (stochastic) gradient descent weights can still be proven to recover the true pa- learn neural networks with multiple layers? rameters with constant probability, which can be In this paper, we take an important step by showing that boosted to probability 1 with multiple restarts. We randomly initialized gradient descent learns a non-linear also show that with constant probability, the same convolutional neural network with two unknown layers w procedure could also converge to the spurious lo- and a. To our knowledge, our work is the first of its kind. cal minimum, showing that the local minimum plays a non-trivial role in the dynamics of gradi- Formally, we consider the convolutional case in which a ent descent. Furthermore, a quantitative analysis filter w is shared among different hidden nodes. Let x 2 Rd shows that the gradient descent dynamics has two be an input sample, e.g., an image. We generate k patches phases: it starts off slow, but converges much from x, each with size p: Z 2 Rp×k where the i-th column faster after several iterations. is the i-th patch generated by selecting some coordinates of x: Zi = Zi(x). We further assume there is no overlap between patches. Thus, the neural network function has the 1. Introduction following form: Deep convolutional neural networks (DCNN) have achieved k X > the state-of-the-art performance in many applications such f(Z; w; a) = aiσ w Zi : arXiv:1712.00779v2 [cs.LG] 15 Jun 2018 as computer vision (Krizhevsky et al., 2012), natural lan- i=1 guage processing (Dauphin et al., 2016) and reinforcement learning applied in classic games like Go (Silver et al., 2016). We focus on the realizable case, i.e., the label is generated Despite the highly non-convex nature of the objective func- according to y = f (Z; w∗; a∗) for some true parameters ∗ ∗ tion, simple first-order algorithms like stochastic gradient w and a and use `2 loss to learn the parameters: descent and its variants often train such networks success- 1 ∗ ∗ 2 1 min `(Z; w; a) := (f (Z; w; a) − f (Z; w ; a )) : Machine Learning Department, Carnegie Mellon University w;a 2 2Department of Data Sciences and Operations, University of South- ern California 3Facebook Artificial Intelligence Research. Corre- We assume x is sampled from a Gaussian distribution and spondence to: Simon S. Du <[email protected]>. there is no overlap between patches. This assumption is Proceedings of the 35 th International Conference on Machine equivalent to that each entry of Z is sampled from a Gaus- Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 sian distribution (Brutzkus & Globerson, 2017; Zhong et al., by the author(s). 2017b). Following (Zhong et al., 2017a;b; Li & Yuan, 2017; Gradient Descent Learns One-hidden-layer CNN 5 0 -5 -10 -15 -20 Logarithm of Prediction Errors Sucess Case Failure Case -25 0 50 100 150 200 250 300 350 400 450 500 Epochs (a) Convolutional neural network with an unknown non- (b) The convergence of gradient descent for learning a CNN overlapping filter and an unknown output layer. In the first described in Figure 1a with Gaussian input using different (hidden) layer, a filter w is applied to nonoverlapping parts initializations. The success case and the failure case corre- of the input x, which then passes through a ReLU activation spond to convergence to the global minimum and the spurious function. The final output is the inner product between an local minimum, respectively. In the first ∼ 50 iterations the output weight vector a and the hidden layer outputs. convergence is slow. After that gradient descent converges at a fast linear rate. Figure 1. Network architecture that we consider in this paper and convergence of gradient descent for learning the parameters of this network. Tian, 2017; Brutzkus & Globerson, 2017; Shalev-Shwartz The loss function is et al., 2017b), in this paper, we mainly focus on the popula- 1 h i ` (v; a) = (f (Z; v; a) − f (Z; v∗; a∗))2 : (2) tion loss: 2EZ 1 h i ` (w; a) := (f (Z; w; a) − f (Z; w∗; a∗))2 : In this paper we focus on using randomly initialized gradient 2EZ descent for learning this convolutional neural network. The 1 We study whether the global convergence w ! w∗ and pseudo-code is listed in Algorithm1. ∗ a ! a can be achieved when optimizing `(w; a) using Main Contributions. Our paper have three contributions. randomly initialized gradient descent. First, we show if (v; a) is initialized by a specific random A crucial difference between our two-layer network and initialization, then with high probability, gradient descent ∗ ∗ previous one-layer models is there is a positive-homogeneity from (v; a) converges to teacher’s parameters (v ; a ). We a can further boost the success rate with more trials. issue. That is, for any c > 0, f Z; cw; c = f (Z; w; a). This interesting property allows the network to be rescaled Second, perhaps surprisingly, we prove that the objective without changing the function computed by the network. As function (Equation (2)) does have a spurious local minimum: reported by (Neyshabur et al., 2015), it is desirable to have using the same random initialization scheme, there exists 0 0 scaling-invariant learning algorithm to stabilize the training a pair (~v ; ~a ) 2 S±(v; a) so that gradient descent from process. (~v0; ~a0) converges to this bad local minimum. In contrast to One commonly used technique to achieve stability is previous works on guarantees for non-convex objective func- weight-normalization introduced by Salimans & Kingma tions whose landscape satisfies “no spurious local minima” (2016). As reported in (Salimans & Kingma, 2016), this property (Li et al., 2016; Ge et al., 2017a; 2016; Bhojana- re-parametrization improves the conditioning of the gradient palli et al., 2016; Ge et al., 2017b; Kawaguchi, 2016), our because it couples the magnitude of the weight vector from result provides a concrete counter-example and highlights a the direction of the weight vector and empirically acceler- conceptually surprising phenomenon: ates stochastic gradient descent optimization. Randomly initialized local search can find a global v In our setting, we re-parametrize the first layer as w = 1 kvk2 With some simple calculations, we can see the optimal solu- and the prediction function becomes tion for a is unique, which we denote as a∗ whereas the optimal for v is not because for every optimal solution v∗, cv∗ for c > 0 k > X σ Zi v is also an optimal solution. In this paper, with a little abuse of f (Z; v; a) = a : (1) ∗ i kvk the notation, we use v to denote the equivalent class of optimal i=1 2 solutions. Gradient Descent Learns One-hidden-layer CNN Algorithm 1 Gradient Descent for Learning One-Hidden- For this problem we consider the quantity sin2 φt where Layer CNN with Weight Normalization φt = θ (vt; v∗) and we show it shrinks at a geometric rate p k (c.f. Lemma 5.5). 1: Input: Initialization v0 2 R , a0 2 R , learning rate η. Organization This paper is organized as follows. In Sec- 2: for t = 1; 2;::: do tion3 we introduce the necessary notations and analytical t t t+1 t @`(v ;a ) formulas of gradient updates in Algorithm1. In Section4, 3: v v − η @vt , t t we provide our main theorems on the performance of the t+1 t @`(v ;a ) 4: a a − η @at . algorithm and their implications. In Section6, we use simu- 5: end for lations to verify our theories. In Section5, we give a proof sketch of our main theorem. We conclude and list future minimum in the presence of spurious local minima. directions in Section7. We place most of our detailed proofs in the appendix. Finally, we conduct a quantitative study of the dynamics of gradient descent. We show that the dynamics of Algo- 2. Related Works rithm1 has two phases. At the beginning (around first 50 iterations in Figure 1b), because the magnitude of initial From the point of view of learning theory, it is well signal (angle between v and w∗) is small, the prediction known that training a neural network is hard in the worst error drops slowly.

Load more