On the Dynamics of Descent for

? ? Thanh V. Nguyen ,RaymondK.W.Wong†, Chinmay Hegde ? Iowa State University, †Texas A&M University

Abstract Atypicalapproachadoptedbythislineofworkisas follows: assume that the data obeys a ground truth We provide a series of results for unsuper- generative model (induced by simple but reasonably ex- vised learning with autoencoders. Specifically, pressive data-generating distributions), and prove that we study shallow two- ar- the weights learned by the proposed (either chitectures with shared weights. We focus on exactly or approximately) recover the parameters of the three generative models for data that are com- generative model. Indeed, such distributional assump- mon in statistical : (i) the tions are necessary to overcome known NP-hardness mixture-of-gaussians model, (ii) the sparse barriers for learning neural networks [8]. Nevertheless, coding model, and (iii) the sparsity model the majority of these approaches have focused on neural with non-negative coefficients. For each of network architectures for , barring these models, we prove that under suitable a few exceptions which we detail below. choices of hyperparameters, architectures, and initialization, autoencoders learned by gradi- 1.2 Our contributions ent descent can successfully recover the pa- rameters of the corresponding model. To our In this paper, we complement this line of work by pro- knowledge, this is the first result that rigor- viding new theoretical results for ously studies the dynamics of using neural networks. Our focus here is on shallow two- for weight-sharing autoencoders. Our analy- layer autoencoder architectures with shared weights. sis can be viewed as theoretical evidence that Conceptually, we build upon previous theoretical re- shallow autoencoder modules indeed can be sults on learning autoencoder networks [9, 10, 11], and used as mechanisms for a va- we elaborate on the novelty of our work in the discus- riety of data models, and may shed insight on sion on prior work below. how to train larger stacked architectures with Our setting is standard: we assume that the training autoencoders as basic building blocks. data consists of i.i.d. samples from a high-dimensional distribution parameterized by a generative model, and 1Introduction we train the weights of the autoencoder using ordinary (batch) gradient descent. We consider three families of generative models that are commonly adopted in 1.1 Motivation machine learning: (i) the Gaussian mixture model with Due to the resurgence of neural networks and deep well-separated centers [12]; (ii) the k-sparse model, learning, there has been growing interest in the commu- specified by sparse linear combination of atoms [13]; nity towards a thorough and principled understanding and (iii) the non-negative k-sparse model [11]. While of training neural networks in both theoretical and these models are traditionally studied separately de- algorithmic aspects. This has led to several important pending on the application, all of these model families breakthroughs recently, including provable algorithms can be expressed via a unified, generic form: for learning shallow (1-hidden layer) networks with nonlinear activations [1, 2, 3, 4], deep networks with y = Ax⇤ + ⌘, (1) linear activations [5], and residual networks [6, 7]. which we (loosely) dub as the generative bilinear model. Proceedings of the 22nd International Conference on Ar- In this form, A is a groundtruth n m-matrix, x⇤ is ⇥ tificial Intelligence and (AISTATS) 2019, Naha, an m-dimensional latent code vector and ⌘ is an inde- Okinawa, Japan. PMLR: Volume 89. Copyright 2019 by pendent n-dimensional random noise vector. Samples the author(s). y’s are what we observe. Different choices of n and m, On the Dynamics of Gradient Descent for Autoencoders as well as different assumptions on A and x⇤ lead to The code consistency property is crucial for establishing the three aforementioned generative models. the correctness of gradient descent over the reconstruc- tion loss. This turns out to be rather tedious due to Under these three generative models, and with suit- the weight sharing — a complication which requires able choice of hyper-parameters, initial estimates, and asubstantialdeparturefromtheexistingmachinery autoencoder architectures, we rigorously prove that: for analysis of sparse coding algorithms — and indeed forms the bulk of the technical difficulty in our proofs. Two-layer autoencoders, trained with (normal- Nevertheless, we are able to derive explicit linear con- ized) gradient descent over the reconstruction vergence rates for all the generative models listed above. loss, provably learn the parameters of the un- We do not attempt to analyze other training schemes derlying generative bilinear model. (such as stochastic gradient descent or dropout) but anticipate that our analysis may lead to further work To the best of our knowledge, our work is the first to along those directions. analytically characterize the dynamics of gradient de- scent for training two-layer autoencoders. Our analysis can be viewed as theoretical evidence that shallow au- 1.4 Comparison with prior work toencoders can be used as feature learning mechanisms (provided the generative modeling assumptions hold), a Recent advances in algorithmic learning theory has led view that seems to be widely adopted in practice. Our to numerous provably efficient algorithms for learning analysis highlights the following interesting conclusions: Gaussian mixture models, sparse codes, topic models, (i) the activation of the hidden (encoder) layer and ICA (see [12, 13, 14, 15, 16, 17, 18, 19]andrefer- influences the choice of bias; (ii) the bias of each hid- ences therein). We omit a complete treatment of prior den neuron in the encoder plays an important role in work due to space constraints. achieving the convergence of the gradient descent; and We would like to emphasize that we do not propose a (iii) the gradient dynamics depends on the complexity new or autoencoder architecture, nor are we of the generative model. Further, we speculate that our the first to highlight the applicability of autoencoders analysis may shed insight on practical considerations with the aforementioned generative models. Indeed, for training deeper networks with stacked autoencoder generative models such as k-sparsity models have served layers as building blocks [9]. as the motivation for the development of deep stacked (denoising) autoencoders dating back to the work of [20]. 1.3 Techniques The paper [9]provesthatstackedweight-sharingau- toencoders can recover the parameters of sparsity-based Our analysis is built upon recent algorithmic devel- generative models, but their analysis succeeds only for opments in the sparse coding literature [14, 15, 16]. certain generative models whose parameters are them- Sparse coding corresponds to the setting where the syn- (i) selves randomly sampled from certain distributions. thesis coefficient vector x⇤ in (1) for each data sample (i) (i) In contrast, our analysis holds for a broader class of y is assumed to be k-sparse, i.e., x⇤ only has at networks; we make no randomness assumptions on the most k m non-zero elements. The exact algorithms ⌧ parameters of the generative models themselves. proposed in these papers are all quite different, but at ahighlevel,allthesemethodsinvolveestablishinga More recently, autoencoders have been shown to learn notion that we dub as “support consistency”. Broadly sparse representations [21]. The recent paper [11] (i) (i) (i) speaking, for a given data sample y = Ax⇤ + ⌘ , demonstrates that under the sparse generative model, the idea is that when the parameter estimates are close the standard squared-error reconstruction loss of ReLU to the ground truth, it is possible to accurately esti- autoencoders exhibits (with asymptotically many sam- (i) mate the true support of the synthesis vector x⇤ for ples) critical points in a neighborhood of the ground each data sample y(i). truth dictionary. However, they do not analyze gradi- ent dynamics, nor do they establish convergence rates. We extend this to a broader family of generative mod- We complete this line of work by proving explicitly that els to form a notion that we call “code consistency”. gradient descent (with column-wise normalization) in We prove that if initialized appropriately, the weights the asymptotic limit exhibits linear convergence up to of the hidden (encoder) layer of the autoencoder pro- aradiusaroundthegroundtruthparameters. vides useful information about the sign pattern of the corresponding synthesis vectors for every data sam- ple. Somewhat surprisingly, the choice of activation 2Preliminaries function of each neuron in the hidden layer plays an m important role in establishing code consistency and Notation Denote by xS the sub-vector of x R 2 affects the possible choices of bias. indexed by the elements of S [m].Similarly,letW ✓ S ? ? Thanh V. Nguyen ,RaymondK.W.Wong†,ChinmayHegde

n m be the sub-matrix of W R ⇥ with columns indexed 2 by elements in S.Also,definesupp(x) , i [m]: Input Hidden Output { 2 layer layer layer xi =0 as the support of x, sgn(x) as the element-wise 6 } y1 yˆ sign of x and 1E as the indicator of an event E. 1

We adopt standard asymptotic notations: let f(n)= y2 yˆ2 O(g(n)) (or f(n)=⌦(g(n)))ifthereexistssomecon- stant C>0 such that f(n) C g(n) (respectively, . . | | | | . . f(n) C g(n) ). Next, f(n)=⇥(g(n)) is equivalent | | | | yn to that f(n)=O(g(n)) and f(n)=⌦(g(n)).Also, yˆn f(n)=!(g(n)) if limn f(n)/g(n) = . In ad- !1 | | 1 dition, g(n)=O⇤(f(n)) indicates g(n) K f(n) | | | | for some small enough constant K. Throughout, we Figure 1: Architecture of a shallow 2-layer autoencoder network. The encoder and the decoder share the weights. use the phrase “with high probability” (abbreviated to w.h.p.) to describe any event with failure probability !(1) at most n . gradient descent for training the above autoencoder architectures. Indeed, we show that for a variety of 2.1 Two-Layer Autoencoders data distributions, such autoencoders can recover the distribution parameters via suitably initialized gradient We focus on shallow autoencoders with a single hid- descent. den layer, n neurons in the input/output layer and m hidden neurons. We consider the weight-sharing archi- T m n 2.2 Generative Bilinear Model tecture in which the encoder has weights W R ⇥ 2 n m and the decoder uses the shared weight W R ⇥ . We now describe an overarching generative model for 2 The architecture of the autoencoder is shown in Fig. the data samples. Specifically, we posit that the data m 1. Denote b R as the vector of biases for the en- samples y(i) N Rn are drawn according to the 2 { }i=1 2 coder (we do not consider decoder bias.) As such, for a following “bilinear” model: given data sample y Rn,theencodinganddecoding 2 respectively can be modeled as: y = Ax⇤ + ⌘, (3)

T n m x = (W y + b) and yˆ = Wx, (2) where A R ⇥ is a ground truth set of parameters, m 2 n x⇤ R is a latent code vector, and ⌘ R represents 2 2 where ( ) denotes the in the en- noise. Depending on different assumptions made on A · coder neurons. We consider two types of activation and x⇤,thismodelgeneralizesvariouspopularcases, functions: (i) the rectified linear unit: such as mixture of spherical Gaussians, sparse coding, nonnegative sparse coding, and independent component ReLU(z) = max(z,0), analysis (ICA). We will elaborate further on specific cases, but in general our generative model satisfies the and (ii) the hard thresholding operator: following generic assumptions:

threshold(z)=z1 z . | | A1. The code x⇤ is supported on set S of size at most When applied to a vector (or matrix), these functions k,suchthatpi = P[i S]=⇥(k/m), pij = 2 2 2 are operated on each element and return a vector (re- P[i, j S]=⇥(k /m ) and pijl = P[i, j, l S]= 2 2 spectively, matrix) of same size. Our choice of the ⇥(k3/m3); activation ( ) function varies with different data gen- · A2. Nonzero entries are independent; moreover, erative models, and will be clear by context. 2 E[xi⇤ i S]=1 and E[xi⇤ i S]=2 < ; Herein, the loss function is the (squared) reconstruction | 2 | 2 1 A3. For i S, x⇤ [a ,a ] with 0 a a ; error: 2 | i |2 1 2  1  2 1

1 2 1 T 2 A4. The noise term ⌘ is distributed according to L = y yˆ = y W(W y + b) , 2 2 2 (0,⌘I) and is independent of x⇤. k k k k N and we analyze the expected loss where the expec- As special cases of the above model, we consider the tation is taken over the data distribution (specified following variants. below). Inspired by the literature of analysis of sparse coding [11, 14, 22], we investigate the landscape of Mixture of spherical Gaussians: We consider the the expected loss so as to shed light on dynamics of standard Gaussian mixture model with m centers, On the Dynamics of Gradient Descent for Autoencoders which is one of the most popular generative models en- O(1) n.Forsparsecoding,wefocusonlearning ⌧ countered in machine learning applications. We model overcomplete dictionaries where n m = O(n) .For  the means of the Gaussians as columns of the matrix the sparse coding case, we further require the spectral A.Todrawadatasampley,wesamplex⇤ uniformly norm bound on A, i.e., A O( m/n). (In other m n k k from the canonical basis ei R with probability words, A is well-conditioned.) { }i=1 2 p p =⇥(1/m).Assuch,x⇤ has sparsity parameter k =1 i Our eventual goal is to show that training autoencoder with only one nonzero element being 1. That means, via gradient descent can effectively recover the gen-  =  = a = a =1. 1 2 1 2 erative model parameter A.Tothisend,weneeda Sparse coding: This is a well-known instance of measure of goodness in recovery. Noting that any re- the above structured linear model, where the goal is covery method can only recover A up to a permutation basically to learn an overcomplete dictionary A that ambiguity in the columns (and a sign-flip ambiguity in sparsely represents the input y. It has a rich history the case of sparse coding), we first define an operator in various fields of signal processing, machine learning ⇡ that permutes the columns of the matrix (and multi- and neuroscience [23]. The generative model described plies by +1 or 1 individually to each column in the above has successfully enabled recent theoretical ad- case of sparse coding.) Then, we define our measure of vances in sparse coding [13, 14, 15, 16, 24]. The latent goodness: code vector x⇤ is assumed to be k-sparse, whose nonzero Definition 2 (-closeness and (,⇠)-nearness). A ma- entries are sub-Gaussian and bounded away from zero. trix W is said to be -close to A if there exists an Therefore, a1 > 0 and a2 = .Weassumethatthe operator ⇡( ) defined above such that ⇡(W ) A 1 · k i ik distribution of nonzero entries are standardized such for all i. We say W is (,⇠)-near to A if in addition that 1 =0, 2 =1.Notethattheconditionof2 ⇡(W ) A ⇠ A . further implies that a1 1. k k k k  To simplify notation, we simply replace ⇡ by the iden- Non-negative sparse coding: This is another vari- tity operator while keeping in mind that we are only ant of the above sparse coding model where the ele- recovering an element from the equivalence class of all x ments of the latent code ⇤ are additionally required permutations and sign-flips of A. to be non-negative [11]. In some sense this is a gen- eralization of the Gaussian mixture model described Armed with the above definitions and assumptions, we above. Since the code vector is non-negative, we do are now ready to state our results. Since the actual not impose the standardization as in the previous case mathematical guarantees are somewhat tedious and of general sparse coding (1 =0and 2 =1); instead, technical, we summarize our results in terms of informal we assume a compact interval of the nonzero entries; theorem statements, and elaborate more precisely in that is, a1 and a2 are positive and bounded. the following sections. Having established probabilistic settings for these mod- Our first main result establishes the code consistency els, we now establish certain deterministic conditions of weight-sharing autoencoders under all the genera- on the true parameters A to enable analysis. First, tive linear models described above, provided that the we require each column Ai to be normalized to unit weights are suitably initialized. norm in order to avoid the scaling ambiguity between A Theorem 1 (informal). Consider a sample y = Ax⇤ + T and x⇤.(Technically,thisconditionisnotrequiredfor ⌘.Letx = (W y + b) be the output of the encoder the mixture of Gaussian model case since x⇤ is binary; part of the autoencoder. Suppose that W is -close to however we make this assumption anyway to keep the A with = O⇤(1/ log n). treatment generic.) Second, we require columns of A to be “sufficiently distinct”; this is formalized by adopting (i) If ( ) is either the ReLU or the hard threshold- · the notion of pairwise incoherence. ing activation, then the support of the true code n m vector x⇤ matches that of x for the mixture-of- Definition 1. Suppose that A R ⇥ has unit-norm 2 Gaussians and non-negative sparse coding gener- columns. A is said to be µ-incoherent if for every pair µ ative models. of column indices (i, j),i= j we have A ,A . 6 |h i ji| pn (ii) If ( ) is the hard thresholding activation, then · Though this definition is motivated from the sparse the support of x⇤ matches that of x for the sparse coding literature, pairwise incoherence is sufficiently coding generative model. general to enable identifiability of all aforementioned models. For the mixture of Gaussians with unit-norm Our second main result leverages the above property. means, pairwise incoherence states that the means are We show that iterative gradient descent over the weights well-separated, which is a standard assumption. In W linearly converges to a small neighborhood of the the case of Gaussian mixtures, we assume that m = ground truth. ? ? Thanh V. Nguyen ,RaymondK.W.Wong†,ChinmayHegde

T Theorem 2 (informal). Provided that the initial weight If x = threshold(W y + b) with = a1/2 and W 0 such that W 0 is (, 2)-near to A. Given asymptot- b =0, then with high probability ically many samples drawn from the above models, an iterative gradient update of W can linearly converge to sgn(x) = sgn(x⇤). a small neighborhood of the ground truth A. (ii) Non-negative k-sparse code with ReLU activa- We formally present these technical results in the next tion: Suppose µ pn/k and k = O(1/2).If T  sections. Note that we analyze the encoding and the x = ReLU(W y + b), and bi [ (1 )a1 + s 2 gradient given W at iteration s;howeverweoftenskip a pk, a pk] for all i, then with high proba- 2 2 the superscript for clarity. bility, supp(x)=supp(x⇤). 3Initialization (iii) Non-negative k-sparse code with thresholding ac- 2 Our main result is a local analysis of the learning tivation: Suppose µ pn/k and k = O(1/ ). T dynamics for two-layer autoencoders. More specifically, If x = threshold(W y + b) with = a1/2 and we prove that the (batch) gradient descent linearly b =0, then with high probability, converges to the ground truth parameter A given an 0 supp(x)=supp(x⇤). initialization W that is O⇤(1/ log n) column-wise close to the ground truth. Despite the fact that the recovery error at the convergence is exponentially better than The full proof for Theorem 3 is relegated to Ap- the initial 1/ log n order, a natural question is how pendix A. Here, we provide a short proof for the to achieve this initialization requirement. In practice, mixture-of-Gaussians generative model, which is re- random initialization for autoencoders is a common ally a special case of (ii) and (iii) above, where k =1 strategy and it often leads to surprisingly good results and the nonzero component of x⇤ is equal to 1 (i.e., [25, 26]. In theory, however, the validity of the random 1 = 2 = a1 = a2 =1.) initialization is still an open problem. For the k-sparse T Proof. Denote z = W y + b and S = supp(x⇤)= j . model, the authors in [16]introduceanalgorithmthat { } provably produces such a coarse estimate of A using Let i be fixed and consider two cases: if i = j,then spectral methods. This algorithm applies perfectly z = W ,A + W ,⌘ +b (1 2/2) log n+b > 0, to this context of the autoencoder architecture. We i h i ii h i i i ⌘ i conjecture that this spectral algorithm still works for w.h.p. due to the fact that W ,A 1 2/2 non-negative sparse case (including the special mixture h i ii (Claim 1), and the conditions = O(1/pn) and of Gaussian model) although, due to non-negativity, ⌘ b > 1+. more complicated treatments including concentration i arguments and sign flips of the columns are involved. On the other hand, if i = j, then using Claims 1 and 2 6 We leave this to our future work. in Appendix A, we have w.h.p.

z = W ,A + W ,⌘ +b µ/pn++ log n+b < 0, 4EncodingStage i h i ji h i i i  ⌘ i for b 2, µ pn/k and = O(1/pn).Dueto i   ⌘ Our technical results start with the analysis of the Claim 2, these results hold w.h.p. uniformly for all i, encoding stage in the forward pass. We rigorously and hence x = ReLU(z) has the same support as x⇤ prove that the encoding performed by the autoencoder w.h.p.. is sufficiently good in the sense that it recovers part of the information in the latent code x⇤ (specifically, Moreover, one can also see that when bi =0,then the signed support of x⇤.) This is achieved based w.h.p., zi > 1/2 if i = j and zi < 1/4 otherwise. This on appropriate choices of activation function, biases, result holds w.h.p. uniformly for all i,andtherefore, and a good W within close neighborhood of the true x = threshold1/2(z) has the same support as x⇤ w.h.p. parameters A.Wecallthispropertycodeconsistency: ⌅ Theorem 3 (Code consistency). Let x = (W T y + b). Note that for the non-negative case, both ReLU and Suppose W is -close to A with = O⇤(1/ log n) and threshold activation would lead to a correct support 2 the noise satisfies ⌘ = O(1/pn). Then the following of the code, but this requires k = O(1/ ), which results hold: is rather restrictive and might be a limitation of the current analysis. Also, in Theorem 3, b is required to be (i) General k-sparse code with thresholding activa- negative for ReLU activation for any >0 due to the tion: Suppose µ pn/ log2 n and k n/ log n. error of the current estimate W .However,thisresultis   On the Dynamics of Gradient Descent for Autoencoders consistent with the conclusion of [27]thatnegativebias In fact, [11] (Lemma 5.1) shows that this approximate is desirable for ReLU activation to produce sparse code. gradient L is a good approximation of the true gra- ri Note that such choices of b also lead to statistical bias dient (4) in expectation. Since A is assumed to have (error) in nonzero code and make it difficult to construct normalizedg columns (with A =1), we can enforce k ik aprovablycorrectlearningprocedure(Section5)for this property to the update by a simple column normal- ReLU activation. ization after every update; to denote this, we use the op- erator normalize( ) that returns a matrix normalize(B) Part (i) of Theorem 3 mirrors the consistency result · established for sparse coding in [16]. with unit columns, i.e.: normalize(B) = B / B , Next, we apply the above result to show that provided i i k ik the consistency result a (batch) gradient update of the for any matrix B that has no all-zero columns. weights W (and bias in certain cases) converges to the true model parameters. Our convergence result leverages the code consistency property in Theorem 3, but in turn succeeds under constraints on the biases of the hidden neurons b.For 5LearningStage thresholding activation, we can show that the simple choice of setting all biases to zero leads to both code In this section, we show that a gradient descent update consistency and linear convergence. However, for ReLU for W of the autoencoder (followed by a normalization activation, the range of bias specified in Theorem 3 in the Euclidean column norm of the updated weights) (ii) has a profound effect on the descent procedure. leads to a linear convergence to a small neighborhood Roughly speaking, we need non-zero bias in order to of the ground truth A under the aforementioned gener- ensure code consistency, but high values of bias can ative models. For this purpose, we analyze the gradient adversely impact gradient descent. Indeed, our current of the expected loss with respect to W .Ouranalysis analysis does not succeed for any constant choice of bias involves calculating the expected value of the gradient (i.e., we do not find a constant bias that leads to both as if we were given infinitely many samples. (The finite support consistency and linear convergence.) To resolve sample analysis is left as future work.) this issue, we propose to use a simple diminishing Since both ReLU and hard thresholding activation (in magnitude) sequence of biases b along different functions are non-differentiable at some values, we iterations of the algorithm. Overall, this combination will formulate an approximate gradient. Whenever of approximate gradient and normalization lead to differentiable, the gradient of the loss L with respect to an update rule that certifies the existence of a linear n A the column Wi R of the weight matrix W is given convergent algorithm (up to a neighborhood of .) The 2 by: results are formally stated as follows:

T T T Theorem 4 (Descent property). Suppose that at step L = 0(W y+b ) (W y+b )I+yW y Wx , rWi i i i i i s the weight W s is (s, 2)-near to A. There exists an (4) s T ⇥ ⇤⇥ ⇤ iterative update rule using an approximate gradient g : where x = (W y + b) and 0(z ) is the gradient of i W s+1 = normalize(W s ⇣gs) that linearly converges (zi) at zi where is differentiable. For the rectified to A when given infinitely many fresh samples. More linear unit ReLU(z ) = max(z , 0),itsgradientis i i precisely, there exists some ⌧ (1/2, 1) such that: 2 1 if zi > 0, 0(zi)= (i) Mixture of Gaussians: Suppose the conditions in (0 if zi < 0. either (ii) or (iii) of Theorem 3 hold. Suppose that the ⇣ =⇥(m), and that the On the other hand, for the hard thresholding activation bias vector b satisfies: threshold(zi)=zi1 z ,thegradientis | i| T (i.1) b =0if x = threshold1/2(W y + b); or s+1 s T 1 if zi >, (i.2) b = b /C if x = ReLU(W y + b) for 0(zi)= | | some constant C>1. (0 if zi <. | | s+1 2 s 2 Then, W A F (1 ⌧) W A F + kO(1) k  k k One can see that in both cases, the gradient 0( ) at O(mn ). T · zi = Wi y + bi resembles an indicator function 1xi=0 = 6 (ii) General k-sparse code: Provided the conditions 1(z )=0 except where it is not defined. The observation i 6 in Theorem 3 (i) hold and the learning rate ⇣ = motivates us to approximate the Wi L with a simpler T r ⇥(m/k). rule by replacing 0(Wi y + bi) with 1xi=0: 6 s+1 2 s 2 Then, W A F (1 ⌧) W A F + T T 2k 2 k  k k iL = 1xi=0(Wi yI + biI + yWi )(y Wx). O(mk /n ). r 6 g ? ? Thanh V. Nguyen ,RaymondK.W.Wong†,ChinmayHegde

(iii) Non-negative k-sparse code: Suppose the condi- Now, we determine when the bias conditions in The- tions in either (ii) or (iii) of Theorem 3 hold. orem 3 and Lemma 1 simultaneously hold for dif- Suppose that the learning rate ⇣ =⇥(m/k) and ferent choices of activation function. For the hard- the bias b satisfies: thresholding function, since we do not need bias (i.e. T bi =0for every i), then i(1 i) 2(1 i) and this (iii.1) b =0if x = thresholda /2(W y + b); or  1 lemma clearly follows. (iii.2) bs+1 = bs/C if x = ReLU(W T y + b) for T some constant C>1. On the other hand, if we encode x = ReLU(W y + s+1 2 s 2 b),thenweneedeverybiasbi to satisfy bi [ 1+ Then, W A F (1 ⌧) W A F + sp s 2 2 3 k k  k k 2 k, ] and (bi + i) i 2(1 i).Since O(k /m). s | s | = W ,A 1 and 0,fortheconditionsof i h i ii! ! b to hold, we require b 0 as s increases. Hence, a i i ! Recall the approximate gradient of the squared loss: fixed bias for the rectified linear unit would not work. T T Instead, we design a simple update for the bias (and iL = 1xi=0(Wi yI + biI + yWi )(y Wx). r 6 this is enough to prove convergence in the ReLU case). We willg use this form to construct a desired update Here is our intuition. The gradient of L with respect rule with linear convergence. Let us consider an update to bi is given by: s step g in expectation over the code x⇤ and and the T T L = 0(W y + b )W (y Wx) noise ⌘: rbi i i i Similarly to the update for the weight matrix, we ap- T T T gi = E[1xi=0(Wi yI + biI + yWi )(y Wx)]. (5) (W y + b ) 6 proximate this gradient with by replacing 0 i i with 1xi=0,calculatetheexpectedgradientandobtain: To prove Theorem 4, we compute gi according to the 6 (g ) = [W T (y Wx)1 ]+ generative models described in (3) and then argue the b i E i xi⇤=0 6 descent. Here, we provide a proof sketch for (again) the T T = E[Wi (y Wi(Wi y + bi)1x⇤=0]+ i 6 simplest case of mixture-of-Gaussians; the full proof is T 2 T 2 = E[(Wi Wi Wi )y + Wi bi1x⇤=0]+ deferred to Appendix B. k k k k i 6 = p b + i i Proof of Theorem 4 (i). Based on Theorem 3, one can From the expected gradient formula, we design a very explicitly compute the expectation expressed in (5). simple update for the bias: bs+1 = p1 ⌧bs where Specifically, the expected gradient gi is of the form: b0 = 1/ log n,andshowbyinductionthatthischoice of bias is sufficiently negative to make the consistency g = p A + p (2 +2b + b2)W + i i i i i i i i i result 3 (ii) and (iii) hold at each step. At the first 0 s w(1) step, we have O⇤(1/ log n),then where i = Wi ,Ai and = O(n ). If we  h i 2 k k 2 0 0 can find bi such that +2bii + b i for all i, gi b = 1/ log n W A . i i ⇡ i k i ik roughly points in the same desired direction to Ai,and s s Now, assuming bi Wi Ai ,weneedtoprove therefore, a descent property can be established via the s+1 s+1k k that b W Ai . following result: k i k From the descent property at the step s,wehave Lemma 1. Suppose W is -close to A and the bias 2 s+1 s s satisfies (b + ) 2(1 ). Then: Wi Ai p1 ⌧ Wi Ai + o( ). | i i i| i k k k k 2 2 s+1 s s 2 gi,Wi Ai pi(i 2 ) Wi Ai Therefore, bi = p1 ⌧bi p1 ⌧ Wi Ai h i k k s+1  k 2 k 1 2 W Ai o(s).Asaresult, (bi + i) i 2 2 k i k | |⇡ + gi (1 ) 2(1 ). In addition, the condition of pii k k pii k k i i  i bias in the support consistency holds. By induction, we From Lemma 1, one can easily prove the descent prop- can guarantee the consistency at all the update steps. erty using [16] (Theorem 6). We apply this lemma with Lemma 1 and hence the descent results stated in (i.2) learning rate ⇣ = max (1/p ) and ⌧ = ⇣p ( 22) and (iii.2) hold for the special case of the Gaussian i i i i i 2 (0, 1) to achieve the descent as follows: mixture model. ⌅

s+1 2 s 2 K W Ai (1 ⌧) W Ai + O(n ), 6Experiments k i k  k i k s+1 s wheref W = W ⇣gs and K is some constant We support our theoretical results with some experi- greater than 1. Finally, we use Lemma 5 to obtain the ments on synthetic data sets under on the mixture-of- s+1 descentf property for the normalized Wi . Gaussians model. We stress that these experimental On the Dynamics of Gradient Descent for Autoencoders

1 1 1

0.5 ⌘ =0.01 0.5 ⌘ =0.01 0.5 ⌘ =0.01 ⌘ =0.02 ⌘ =0.02 ⌘ =0.02 ⌘ =0.03 ⌘ =0.03 ⌘ =0.03 Reconstruction loss 0 Reconstruction loss 0 Reconstruction loss 0 0 20 40 0 20 40 0 20 40 Iteration Iteration Iteration

Figure 2: The learning curve in training step using different initial estimate W 0. From left to right, the autoencoder is initialized by (i) some perturbation of the ground truth, (ii) PCA and (iii) random guess.

20 are initially set to b = 2.5.WetraintheweightsW Init w/ perturbation with the batch gradient descent and update the bias

2 F s+1 s 15 Init w/ PCA k using a fixed update rule b = b /2.

A Random init 10 The learning rate for gradient descent is set fixed to W k ⇣ = m. The number of descent steps is T = 50.Werun 5 the batch descent algorithm at each initialization with 0 different levels of noise (⌘ =0.01, 0.02, 0.03), then we 0 10 20 30 40 50 observe the reconstruction loss over the data samples. Iteration Figure 2 shows the learning curve in the number of Figure 3: Frobenius norm difference between the learned iterations. From the left, the first plot is the loss with W and the ground truth A by three initialization schemes. the initial point 0.5-close to A. The next two plots represent the learning using the PCA and random ini- tializations. The gradient descent also converges when results are not intended to be exhaustive or of practical using the same step size and bias as described above. relevance, but rather only to confirm some aspects of The convergence behavior is somewhat unexpected; our theoretical results, and shed light on where the even with random initialization the reconstruction loss theory falls short. decreases to low levels when the noise parameter ⌘ is small. This suggests that the loss surface is perhaps We generate samples from a mixture of m = 10 amenable to optimization even for radius bigger than Gaussians with dimension n = 784 using the model O()-away from the ground truth parameters, although y = Ax⇤ + ⌘. The means are the columns of A,ran- 1 our theory does not account for this. domly generated according to Ai (0, pn In).To ⇠N In Figure 3 we show the Frobenius norm difference synthesize each sample y,wechoosex⇤ uniformly from the canonical bases e m and generate a Gaussian between the ground truth A and final solution W us- { i}i=1 noise vector ⌘ with independent entries and entry-wise ing three initialization schemes on a data set with noise ⌘ =0.01. Interestingly, despite the convergence, standard deviation ⌘.Wecreateadatasetof10, 000 samples in total for each Monte Carlo trial. neither PCA nor random initialization leads to the recovery of the ground truth A.Notethatsincewe We consider a two-layer autoencoder with shared can only estimate W up to some column permutation, weights as described in Section 2.1, such that the hid- we use the Hungarian algorithm to compute matching den layer has 10 units with ReLU activation. Then, we between W and A and then calculate the norm. observe its gradient dynamics on the above data using three different initializations: (i) we initialize W by adding small random perturbation to the groundtruth A such that W 0 = A + E for =0.5 with the 784 10 Conclusions To our knowledge, the above analysis perturbation E R ⇥ generated according to 2 E (0, 1/pn);(ii)weperformprincipalcompo- is the first to prove rigorous convergence of gradient ij ⇠N nent analysis of the data samples and choose the top dynamics for autoencoder architectures for a wide vari- 10 singular vectors as W 0;(iii)werandomlygenerate ety of (bilinear) generative models. Numerous avenues 1 for future work remain — finite sample complexity W with Wi (0, In). ⇠N pn analysis; extension to more general architectures; and For all three initializations, the bias b of the encoder extension to richer classes of generative models. ? ? Thanh V. Nguyen ,RaymondK.W.Wong†,ChinmayHegde

7Acknowledgements [13] Daniel A Spielman, Huan Wang, and John Wright. Exact recovery of sparsely-used dictionaries. In This work was supported in part by the National Sci- Conference on Learning Theory,pages37–1,2012. ence Foundation under grants CCF-1566281, CAREER [14] Alekh Agarwal, Animashree Anandkumar, Pra- CCF-1750920 and DMS-1612985/1806063, and in part teek Jain, Praneeth Netrapalli, and Rashish Tan- by a Faculty Fellowship from the Black and Veatch don. Learning sparsely used overcomplete dictio- Foundation. naries. In Conference on Learning Theory,pages 123–137, 2014. References [15] Rémi Gribonval, Rodolphe Jenatton, Francis Bach, [1] Yuandong Tian. Symmetry-breaking convergence Martin Kleinsteuber, and Matthias Seibert. Sam- analysis of certain two-layered neural networks ple complexity of dictionary learning and other with relu nonlinearity. 2017. matrix factorizations. IEEE Transactions on In- [2] Rong Ge, Jason D Lee, and Tengyu Ma. Learning formation Theory,61(6):3469–3486,2015. one-hidden-layer neural networks with landscape [16] Sanjeev Arora, Rong Ge, Tengyu Ma, and Ankur design. arXiv preprint arXiv:1711.00501,2017. Moitra. Simple, efficient, and neural algorithms for [3] Alon Brutzkus and Amir Globerson. Globally op- sparse coding. In Conference on Learning Theory, timal gradient descent for a convnet with gaussian pages 113–149, 2015. inputs. In International Conference on Machine [17] Ankur Moitra and Gregory Valiant. Settling the Learning,pages605–614,2017. polynomial learnability of mixtures of gaussians. In [4] Kai Zhong, Zhao Song, Prateek Jain, Peter L Foundations of Computer Science (FOCS), 2010 Bartlett, and Inderjit S Dhillon. Recovery guar- 51st Annual IEEE Symposium on,pages93–102. antees for one-hidden-layer neural networks. In IEEE, 2010. International Conference on Machine Learning, [18] Sanjeev Arora, Rong Ge, Ankur Moitra, and pages 4140–4149, 2017. Sushant Sachdeva. Provable ica with unknown [5] Kenji Kawaguchi. without poor gaussian noise, with implications for gaussian mix- local minima. In Advances in Neural Information tures and autoencoders. In Advances in Neural Processing Systems,pages586–594,2016. Information Processing Systems,pages2375–2383, [6] Yuanzhi Li and Yang Yuan. Convergence analysis 2012. of two-layer neural networks with relu activation. [19] Navin Goyal, Santosh Vempala, and Ying Xiao. In Advances in Neural Information Processing Sys- Fourier pca and robust tensor decomposition. In tems,pages597–607,2017. Proceedings of the forty-sixth annual ACM sym- [7] Moritz Hardt and Tengyu Ma. Identity matters posium on Theory of computing,pages584–593. in deep learning. 2017. ACM, 2014. [8] Avrim Blum and Ronald L Rivest. Training a 3- [20] Pascal Vincent, Hugo Larochelle, Isabelle La- node neural network is np-complete. In Advances joie, , and Pierre-Antoine Manzagol. in Neural Information Processing Systems,pages Stacked denoising autoencoders: Learning useful 494–501, 1989. representations in a deep network with a local denoising criterion. Journal of Machine Learning [9] Sanjeev Arora, Aditya Bhaskara, Rong Ge, and Research,11(Dec):3371–3408,2010. Tengyu Ma. Provable bounds for learning some deep representations. In International Conference [21] Devansh Arpit, Yingbo Zhou, Hung Ngo, and on Machine Learning,pages584–592,2014. Venu Govindaraju. Why regularized auto-encoders learn sparse representation? arXiv preprint [10] Sanjeev Arora, Yingyu Liang, and Tengyu Ma. arXiv:1505.05561,2015. Why are deep nets reversible: A simple theory, with implications for training. arXiv preprint [22] Alekh Agarwal, Animashree Anandkumar, and arXiv:1511.05653,2015. Praneeth Netrapalli. Exact recovery of sparsely [11] Akshay Rangamani, Anirbit Mukherjee, Ashish used overcomplete dictionaries. stat,1050:8,2013. Arora, Tejaswini Ganapathy, Amitabh Basu, Sang [23] Bruno A Olshausen and David J Field. Sparse Chin, and Trac D Tran. Sparse coding and autoen- coding with an overcomplete basis set: A strategy coders. arXiv preprint arXiv:1708.03735,2017. employed by v1? Vision research,37(23):3311– [12] Sanjeev Arora and Ravi Kannan. Learning mix- 3325, 1997. tures of separated nonspherical gaussians. The [24] Sanjeev Arora, Rong Ge, and Ankur Moitra. New Annals of Applied Probability,15(1A):69–92,2005. algorithms for learning incoherent and overcom- On the Dynamics of Gradient Descent for Autoencoders

plete dictionaries. In Conference on Learning The- ory,pages779–806,2014. [25] Adam Coates and Andrew Y Ng. The importance of encoding versus training with sparse coding and vector quantization. In International Conference on Machine Learning,pages921–928,2011. [26] Andrew M Saxe, Pang Wei Koh, Zhenghao Chen, Maneesh Bhand, Bipin Suresh, and Andrew Y Ng. On random weights and unsupervised feature learning. In International Conference on Machine Learning,pages1089–1096,2011. [27] Kishore Konda, Roland Memisevic, and David Krueger. Zero-bias autoencoders and the ben- efits of co-adapting features. arXiv preprint arXiv:1402.3337, published in ICLR 2015,2015. [28] Thanh V Nguyen, Raymond K W Wong, and Chinmay Hegde. A provable approach for double- sparse coding. In Proc. Conf. American Assoc. Artificial Intelligence,Feb.2018.