<<

On the Dynamics of Descent for

Thanh V. Nguyen∗, Raymond K. W. Wong†, Chinmay Hegde∗ ∗Iowa State University, †Texas A&M University

Abstract A typical approach adopted by this line of work is as follows: assume that the data obeys a ground truth We provide a series of results for unsuper- generative model (induced by simple but reasonably ex- vised learning with autoencoders. Specifically, pressive data-generating distributions), and prove that we study shallow two- ar- the weights learned by the proposed (either chitectures with shared weights. We focus on exactly or approximately) recover the parameters of the three generative models for data that are com- generative model. Indeed, such distributional assump- mon in statistical : (i) the tions are necessary to overcome known NP-hardness mixture-of-gaussians model, (ii) the sparse barriers for learning neural networks [8]. Nevertheless, coding model, and (iii) the sparsity model the majority of these approaches have focused on neural with non-negative coefficients. For each of network architectures for , barring these models, we prove that under suitable a few exceptions which we detail below. choices of hyperparameters, architectures, and initialization, autoencoders learned by gradi- 1.2 Our contributions ent descent can successfully recover the pa- rameters of the corresponding model. To our In this paper, we complement this line of work by pro- knowledge, this is the first result that rigor- viding new theoretical results for ously studies the dynamics of gradient descent using neural networks. Our focus here is on shallow two- for weight-sharing autoencoders. Our analy- layer autoencoder architectures with shared weights. sis can be viewed as theoretical evidence that Conceptually, we build upon previous theoretical re- shallow autoencoder modules indeed can be sults on learning autoencoder networks [9, 10, 11], and used as feature learning mechanisms for a va- we elaborate on the novelty of our work in the discus- riety of data models, and may shed insight on sion on prior work below. how to train larger stacked architectures with Our setting is standard: we assume that the training autoencoders as basic building blocks. data consists of i.i.d. samples from a high-dimensional distribution parameterized by a generative model, and 1 Introduction we train the weights of the autoencoder using ordinary (batch) gradient descent. We consider three families of generative models that are commonly adopted in 1.1 Motivation machine learning: (i) the Gaussian mixture model with Due to the resurgence of neural networks and deep well-separated centers [12]; (ii) the k-sparse model, learning, there has been growing interest in the commu- specified by sparse linear combination of atoms [13]; nity towards a thorough and principled understanding and (iii) the non-negative k-sparse model [11]. While of training neural networks in both theoretical and these models are traditionally studied separately de- algorithmic aspects. This has led to several important pending on the application, all of these model families breakthroughs recently, including provable algorithms can be expressed via a unified, generic form: for learning shallow ( -hidden layer) networks with 1 ∗ nonlinear activations [1, 2, 3, 4], deep networks with y = Ax + η, (1) linear activations [5], and residual networks [6, 7]. which we (loosely) dub as the generative bilinear model. Proceedings of the 22nd International Conference on Ar- In this form, A is a groundtruth n × m-matrix, x∗ is tificial Intelligence and Statistics (AISTATS) 2019, Naha, an m-dimensional latent code vector and η is an inde- Okinawa, Japan. PMLR: Volume 89. Copyright 2019 by pendent n-dimensional random noise vector. Samples the author(s). y’s are what we observe. Different choices of n and m, On the Dynamics of Gradient Descent for Autoencoders as well as different assumptions on A and x∗ lead to The code consistency property is crucial for establishing the three aforementioned generative models. the correctness of gradient descent over the reconstruc- tion loss. This turns out to be rather tedious due to Under these three generative models, and with suit- the weight sharing — a complication which requires able choice of hyper-parameters, initial estimates, and a substantial departure from the existing machinery autoencoder architectures, we rigorously prove that: for analysis of sparse coding algorithms — and indeed forms the bulk of the technical difficulty in our proofs. Two-layer autoencoders, trained with (normal- Nevertheless, we are able to derive explicit linear con- ized) gradient descent over the reconstruction vergence rates for all the generative models listed above. loss, provably learn the parameters of the un- We do not attempt to analyze other training schemes derlying generative bilinear model. (such as stochastic gradient descent or dropout) but anticipate that our analysis may lead to further work To the best of our knowledge, our work is the first to along those directions. analytically characterize the dynamics of gradient de- scent for training two-layer autoencoders. Our analysis can be viewed as theoretical evidence that shallow au- 1.4 Comparison with prior work toencoders can be used as feature learning mechanisms (provided the generative modeling assumptions hold), a Recent advances in algorithmic learning theory has led view that seems to be widely adopted in practice. Our to numerous provably efficient algorithms for learning analysis highlights the following interesting conclusions: Gaussian mixture models, sparse codes, topic models, (i) the activation of the hidden (encoder) layer and ICA (see [12, 13, 14, 15, 16, 17, 18, 19] and refer- influences the choice of bias; (ii) the bias of each hid- ences therein). We omit a complete treatment of prior den neuron in the encoder plays an important role in work due to space constraints. achieving the convergence of the gradient descent; and We would like to emphasize that we do not propose a (iii) the gradient dynamics depends on the complexity new or autoencoder architecture, nor are we of the generative model. Further, we speculate that our the first to highlight the applicability of autoencoders analysis may shed insight on practical considerations with the aforementioned generative models. Indeed, for training deeper networks with stacked autoencoder generative models such as k-sparsity models have served layers as building blocks [9]. as the motivation for the development of deep stacked (denoising) autoencoders dating back to the work of [20]. 1.3 Techniques The paper [9] proves that stacked weight-sharing au- toencoders can recover the parameters of sparsity-based Our analysis is built upon recent algorithmic devel- generative models, but their analysis succeeds only for opments in the sparse coding literature [14, 15, 16]. certain generative models whose parameters are them- Sparse coding corresponds to the setting where the syn- selves randomly sampled from certain distributions. ∗(i) thesis coefficient vector x in (1) for each data sample In contrast, our analysis holds for a broader class of (i) ∗(i) y is assumed to be k-sparse, i.e., x only has at networks; we make no randomness assumptions on the most k  m non-zero elements. The exact algorithms parameters of the generative models themselves. proposed in these papers are all quite different, but at a high level, all these methods involve establishing a More recently, autoencoders have been shown to learn notion that we dub as “support consistency”. Broadly sparse representations [21]. The recent paper [11] speaking, for a given data sample y(i) = Ax∗(i) + η(i), demonstrates that under the sparse generative model, the idea is that when the parameter estimates are close the standard squared-error reconstruction loss of ReLU to the ground truth, it is possible to accurately esti- autoencoders exhibits (with asymptotically many sam- mate the true support of the synthesis vector x∗(i) for ples) critical points in a neighborhood of the ground each data sample y(i). truth dictionary. However, they do not analyze gradi- ent dynamics, nor do they establish convergence rates. We extend this to a broader family of generative mod- We complete this line of work by proving explicitly that els to form a notion that we call “code consistency”. gradient descent (with column-wise normalization) in We prove that if initialized appropriately, the weights the asymptotic limit exhibits linear convergence up to of the hidden (encoder) layer of the autoencoder pro- a radius around the ground truth parameters. vides useful information about the sign pattern of the corresponding synthesis vectors for every data sam- ple. Somewhat surprisingly, the choice of activation 2 Preliminaries function of each neuron in the hidden layer plays an m important role in establishing code consistency and Notation Denote by xS the sub-vector of x ∈ R affects the possible choices of bias. indexed by the elements of S ⊆ [m]. Similarly, let WS Thanh V. Nguyen∗, Raymond K. W. Wong†, Chinmay Hegde∗ be the sub-matrix of W ∈ Rn×m with columns indexed by elements in S. Also, define supp(x) {i ∈ [m]: Input Hidden Output , layer layer layer xi =6 0} as the support of x, sgn(x) as the element-wise y1 yˆ sign of x and 1E as the indicator of an event E. 1

We adopt standard asymptotic notations: let f(n) = y2 yˆ2 O(g(n)) (or f(n) = Ω(g(n))) if there exists some con- . . stant C > 0 such that |f(n)| ≤ C|g(n)| (respectively, . . |f(n)| ≥ C|g(n)|). Next, f(n) = Θ(g(n)) is equivalent y to that f(n) = O(g(n)) and f(n) = Ω(g(n)). Also, n yˆn f(n) = ω(g(n)) if limn→∞ |f(n)/g(n)| = ∞. In ad- dition, g(n) = O∗(f(n)) indicates |g(n)| ≤ K|f(n)| Figure 1: Architecture of a shallow 2-layer autoencoder for some small enough constant K. Throughout, we network. The encoder and the decoder share the weights. use the phrase “with high probability” (abbreviated to w.h.p.) to describe any event with failure probability at most n−ω(1). gradient descent for training the above autoencoder architectures. Indeed, we show that for a variety of 2.1 Two-Layer Autoencoders data distributions, such autoencoders can recover the distribution parameters via suitably initialized gradient We focus on shallow autoencoders with a single hid- descent. den layer, n neurons in the input/output layer and m hidden neurons. We consider the weight-sharing archi- 2.2 Generative Bilinear Model tecture in which the encoder has weights W T ∈ Rm×n and the decoder uses the shared weight W ∈ Rn×m. We now describe an overarching generative model for The architecture of the autoencoder is shown in Fig. the data samples. Specifically, we posit that the data m (i) N n 1. Denote b ∈ R as the vector of biases for the en- samples {y }i=1 ∈ R are drawn according to the coder (we do not consider decoder bias.) As such, for a following “bilinear” model: given data sample y ∈ Rn, the encoding and decoding respectively can be modeled as: y = Ax∗ + η, (3)

x = σ(W T y + b) and yˆ = W x, (2) where A ∈ Rn×m is a ground truth set of parameters, x∗ ∈ Rm is a latent code vector, and η ∈ Rn represents where σ(·) denotes the in the en- noise. Depending on different assumptions made on A coder neurons. We consider two types of activation and x∗, this model generalizes various popular cases, functions: (i) the rectified linear unit: such as mixture of spherical Gaussians, sparse coding, nonnegative sparse coding, and independent component ReLU(z) = max(z, 0), analysis (ICA). We will elaborate further on specific cases, but in general our generative model satisfies the and (ii) the hard thresholding operator: following generic assumptions:

thresholdλ(z) = z1|z|≥λ. A1. The code x∗ is supported on set S of size at most When applied to a vector (or matrix), these functions k, such that pi = P[i ∈ S] = Θ(k/m), pij = 2 2 are operated on each element and return a vector (re- P[i, j ∈ S] = Θ(k /m ) and pijl = P[i, j, l ∈ S] = spectively, matrix) of same size. Our choice of the Θ(k3/m3); activation σ(·) function varies with different data gen- A2. Nonzero entries are independent; moreover, erative models, and will be clear by context. ∗ ∗2 E[xi |i ∈ S] = κ1 and E[xi |i ∈ S] = κ2 < ∞; Herein, the loss function is the (squared) reconstruction ∗ error: A3. For i ∈ S, |xi | ∈ [a1, a2] with 0 ≤ a1 ≤ a2 ≤ ∞;

1 2 1 T 2 A4. The noise term η is distributed according to L = ky − yˆk = ky − W σ(W y + b)k , 2 ∗ 2 2 N (0, σηI) and is independent of x . and we analyze the expected loss where the expec- As special cases of the above model, we consider the tation is taken over the data distribution (specified following variants. below). Inspired by the literature of analysis of sparse coding [11, 14, 22], we investigate the landscape of Mixture of spherical Gaussians: We consider the the expected loss so as to shed light on dynamics of standard Gaussian mixture model with m centers, On the Dynamics of Gradient Descent for Autoencoders which is one of the most popular generative models en- O(1)  n. For sparse coding, we focus on learning countered in machine learning applications. We model overcomplete dictionaries where n ≤ m = O(n) . For the means of the Gaussians as columns of the matrix the sparse coding case, we further require the spectral p A. To draw a data sample y, we sample x∗ uniformly norm bound on A, i.e., kAk ≤ O( m/n). (In other m n from the canonical basis {ei}i=1 ∈ R with probability words, A is well-conditioned.) p = Θ(1/m). As such, x∗ has sparsity parameter k = 1 i Our eventual goal is to show that training autoencoder with only one nonzero element being 1. That means, via gradient descent can effectively recover the gen- κ = κ = a = a = 1. 1 2 1 2 erative model parameter A. To this end, we need a Sparse coding: This is a well-known instance of measure of goodness in recovery. Noting that any re- the above structured linear model, where the goal is covery method can only recover A up to a permutation basically to learn an overcomplete dictionary A that ambiguity in the columns (and a sign-flip ambiguity in sparsely represents the input y. It has a rich history the case of sparse coding), we first define an operator in various fields of signal processing, machine learning π that permutes the columns of the matrix (and multi- and neuroscience [23]. The generative model described plies by +1 or −1 individually to each column in the above has successfully enabled recent theoretical ad- case of sparse coding.) Then, we define our measure of vances in sparse coding [13, 14, 15, 16, 24]. The latent goodness: ∗ code vector x is assumed to be k-sparse, whose nonzero Definition 2 (δ-closeness and (δ, ξ)-nearness). A ma- entries are sub-Gaussian and bounded away from zero. trix W is said to be δ-close to A if there exists an Therefore, and ∞. We assume that the a1 > 0 a2 = operator π(·) defined above such that kπ(W )i − Aik ≤ δ distribution of nonzero entries are standardized such for all i. We say W is (δ, ξ)-near to A if in addition that κ1 = 0, κ2 = 1. Note that the condition of κ2 kπ(W ) − Ak ≤ ξkAk. further implies that a1 ≤ 1. To simplify notation, we simply replace π by the iden- Non-negative sparse coding: This is another vari- tity operator while keeping in mind that we are only ant of the above sparse coding model where the ele- recovering an element from the equivalence class of all ments of the latent code ∗ are additionally required x permutations and sign-flips of A. to be non-negative [11]. In some sense this is a gen- eralization of the Gaussian mixture model described Armed with the above definitions and assumptions, we above. Since the code vector is non-negative, we do are now ready to state our results. Since the actual not impose the standardization as in the previous case mathematical guarantees are somewhat tedious and of general sparse coding (κ1 = 0 and κ2 = 1); instead, technical, we summarize our results in terms of informal we assume a compact interval of the nonzero entries; theorem statements, and elaborate more precisely in that is, a1 and a2 are positive and bounded. the following sections. Having established probabilistic settings for these mod- Our first main result establishes the code consistency els, we now establish certain deterministic conditions of weight-sharing autoencoders under all the genera- on the true parameters A to enable analysis. First, tive linear models described above, provided that the we require each column Ai to be normalized to unit weights are suitably initialized. norm in order to avoid the scaling ambiguity between A Theorem 1 (informal). Consider a sample y = Ax∗ + and x∗. (Technically, this condition is not required for η. Let x = σ(W T y + b) be the output of the encoder the mixture of Gaussian model case since x∗ is binary; part of the autoencoder. Suppose that W is δ-close to however we make this assumption anyway to keep the A with δ = O∗(1/ log n). treatment generic.) Second, we require columns of A to be “sufficiently distinct”; this is formalized by adopting (i) If σ(·) is either the ReLU or the hard threshold- the notion of pairwise incoherence. ing activation, then the support of the true code vector x∗ matches that of x for the mixture-of- Definition 1. Suppose that A ∈ n×m has unit-norm R Gaussians and non-negative sparse coding gener- columns. A is said to be µ-incoherent if for every pair ative models. of column indices 6 we have |h i| ≤ √µ . (i, j), i = j Ai,Aj n (ii) If σ(·) is the hard thresholding activation, then Though this definition is motivated from the sparse the support of x∗ matches that of x for the sparse coding literature, pairwise incoherence is sufficiently coding generative model. general to enable identifiability of all aforementioned models. For the mixture of Gaussians with unit-norm Our second main result leverages the above property. means, pairwise incoherence states that the means are We show that iterative gradient descent over the weights well-separated, which is a standard assumption. In W linearly converges to a small neighborhood of the the case of Gaussian mixtures, we assume that m = ground truth. Thanh V. Nguyen∗, Raymond K. W. Wong†, Chinmay Hegde∗

T Theorem 2 (informal). Provided that the initial weight If x = thresholdλ(W y + b) with λ = a1/2 and W 0 such that W 0 is (δ, 2)-near to A. Given asymptot- b = 0, then with high probability ically many samples drawn from the above models, an ∗ iterative gradient update of W can linearly converge to sgn(x) = sgn(x ). a small neighborhood of the ground truth A. (ii) Non-negative k-sparse√ code with ReLU activa- We formally present these technical results in the next tion: Suppose µ ≤ δ n/k and k = O(1/δ2). If sections. Note that we analyze the encoding and the T x =√ReLU(W√ y + b), and bi ∈ [−(1 − δ)a1 + gradient given W s at iteration s; however we often skip a2δ k, −a2δ k] for all i, then with high proba- the superscript for clarity. bility, supp(x) = supp(x∗). 3 Initialization (iii) Non-negative k-sparse code with thresholding ac- √ 2 Our main result is a local analysis of the learning tivation: Suppose µ ≤ δ n/k and k = O(1/δ ). T dynamics for two-layer autoencoders. More specifically, If x = thresholdλ(W y + b) with λ = a1/2 and we prove that the (batch) gradient descent linearly b = 0, then with high probability, converges to the ground truth parameter A given an supp(x) = supp(x∗). initialization W 0 that is O∗(1/ log n) column-wise close to the ground truth. Despite the fact that the recovery error at the convergence is exponentially better than The full proof for Theorem 3 is relegated to Ap- the initial 1/ log n order, a natural question is how pendix A. Here, we provide a short proof for the to achieve this initialization requirement. In practice, mixture-of-Gaussians generative model, which is re- random initialization for autoencoders is a common ally a special case of (ii) and (iii) above, where k = 1 ∗ strategy and it often leads to surprisingly good results and the nonzero component of x is equal to 1 (i.e., [25, 26]. In theory, however, the validity of the random κ1 = κ2 = a1 = a2 = 1.) initialization is still an open problem. For the k-sparse model, the authors in [16] introduce an algorithm that Proof. Denote z = W T y + b and S = supp(x∗) = {j}. provably produces such a coarse estimate of A using Let i be fixed and consider two cases: if i = j, then spectral methods. This algorithm applies perfectly 2 to this context of the autoencoder architecture. We zi = hWi,Aii+hWi, ηi+bi ≥ (1−δ /2)−ση log n+bi > 0, conjecture that this spectral algorithm still works for w.h.p. due to the fact that hW ,A i ≥ 1 − δ2/2 non-negative sparse case (including the special mixture i i √ (Claim 1), and the conditions and of Gaussian model) although, due to non-negativity, ση = O(1/ n) − . more complicated treatments including concentration bi > 1 + δ arguments and sign flips of the columns are involved. On the other hand, if i =6 j, then using Claims 1 and 2 We leave this to our future work. in Appendix A, we have w.h.p. √ 4 Encoding Stage zi = hWi,Aji+hWi, ηi+bi ≤ µ/ n+δ+ση log n+bi < 0, √ √ for bi ≤ −2δ, µ ≤ δ n/k and ση = O(1/ n). Due to Our technical results start with the analysis of the Claim 2, these results hold w.h.p. uniformly for all i, encoding stage in the forward pass. We rigorously and hence x = ReLU(z) has the same support as x∗ prove that the encoding performed by the autoencoder w.h.p.. is sufficiently good in the sense that it recovers part ∗ of the information in the latent code x (specifically, Moreover, one can also see that when bi = 0, then ∗ the signed support of x .) This is achieved based w.h.p., zi > 1/2 if i = j and zi < 1/4 otherwise. This on appropriate choices of activation function, biases, result holds w.h.p. uniformly for all i, and therefore, ∗ and a good W within close neighborhood of the true x = threshold1/2(z) has the same support as x w.h.p. parameters A. We call this property code consistency:  Theorem 3 (Code consistency). Let x = σ(W T y + b). Note that for the non-negative case, both ReLU and ∗ Suppose W is δ-close to A with√ δ = O (1/ log n) and threshold activation would lead to a correct support 2 the noise satisfies ση = O(1/ n). Then the following of the code, but this requires k = O(1/δ ), which results hold: is rather restrictive and might be a limitation of the current analysis. Also, in Theorem 3, b is required to be (i) General k-sparse code√ with thresholding activa- negative for ReLU activation for any δ > 0 due to the tion: Suppose µ ≤ n/ log2 n and k ≤ n/ log n. error of the current estimate W . However, this result is On the Dynamics of Gradient Descent for Autoencoders consistent with the conclusion of [27] that negative bias In fact, [11] (Lemma 5.1) shows that this approximate is desirable for ReLU activation to produce sparse code. gradient ∇giL is a good approximation of the true gra- Note that such choices of b also lead to statistical bias dient (4) in expectation. Since A is assumed to have (error) in nonzero code and make it difficult to construct normalized columns (with kAik = 1), we can enforce a provably correct learning procedure (Section 5) for this property to the update by a simple column normal- ReLU activation. ization after every update; to denote this, we use the op- Part (i) of Theorem 3 mirrors the consistency result erator normalize(·) that returns a matrix normalize(B) established for sparse coding in [16]. with unit columns, i.e.: Next, we apply the above result to show that provided normalize(B)i = Bi/kBik, the consistency result a (batch) gradient update of the for any matrix B that has no all-zero columns. weights W (and bias in certain cases) converges to the true model parameters. Our convergence result leverages the code consistency property in Theorem 3, but in turn succeeds under constraints on the biases of the hidden neurons b. For 5 Learning Stage thresholding activation, we can show that the simple choice of setting all biases to zero leads to both code In this section, we show that a gradient descent update consistency and linear convergence. However, for ReLU for W of the autoencoder (followed by a normalization activation, the range of bias specified in Theorem 3 in the Euclidean column norm of the updated weights) (ii) has a profound effect on the descent procedure. leads to a linear convergence to a small neighborhood Roughly speaking, we need non-zero bias in order to of the ground truth A under the aforementioned gener- ensure code consistency, but high values of bias can ative models. For this purpose, we analyze the gradient adversely impact gradient descent. Indeed, our current of the expected loss with respect to W . Our analysis analysis does not succeed for any constant choice of bias involves calculating the expected value of the gradient (i.e., we do not find a constant bias that leads to both as if we were given infinitely many samples. (The finite support consistency and linear convergence.) To resolve sample analysis is left as future work.) this issue, we propose to use a simple diminishing Since both ReLU and hard thresholding activation (in magnitude) sequence of biases b along different functions are non-differentiable at some values, we iterations of the algorithm. Overall, this combination will formulate an approximate gradient. Whenever of approximate gradient and normalization lead to differentiable, the gradient of the loss L with respect to an update rule that certifies the existence of a linear n convergent algorithm (up to a neighborhood of A.) The the column Wi ∈ R of the weight matrix W is given by: results are formally stated as follows: Theorem 4 (Descent property). Suppose that at step ∇ L = −σ0(W T y+b )(W T y+b )I+yW T y−W x, Wi i i i i i s the weight W s is (δs, 2)-near to A. There exists an (4) iterative update rule using an approximate gradient gs: where x = σ(W T y + b) and σ0(z ) is the gradient of i W s+1 = normalize(W s − ζgs) that linearly converges σ(z ) at z where σ is differentiable. For the rectified i i to A when given infinitely many fresh samples. More linear unit ReLU(z ) = max(z , 0), its gradient is i i precisely, there exists some τ ∈ (1/2, 1) such that: ( 0 1 if zi > 0, σ (zi) = (i) Mixture of Gaussians: Suppose the conditions in 0 if zi < 0. either (ii) or (iii) of Theorem 3 hold. Suppose that the ζ = Θ(m), and that the On the other hand, for the hard thresholding activation bias vector b satisfies: thresholdλ(zi) = zi1| |≥ , the gradient is zi λ T (i.1) b = 0 if x = threshold1/2(W y + b); or ( (i.2) bs+1 = bs/C if x = ReLU(W T y + b) for 0 1 if |zi| > λ, σ (zi) = some constant C > 1. 0 if |zi| < λ. s+1 2 s 2 Then, kW − AkF ≤ (1 − τ)kW − AkF + One can see that in both cases, the gradient σ0(·) at O(mn−O(1)). T zi = Wi y + bi resembles an indicator function 1xi=06 = 1 except where it is not defined. The observation (ii) General k-sparse code: Provided the conditions σ(zi)=06 in Theorem 3 (i) hold and the learning rate ζ = motivates us to approximate the ∇Wi L with a simpler 0 T Θ(m/k). rule by replacing σ (Wi y + bi) with 1xi=06 : s+1 2 s 2 Then, kW − AkF ≤ (1 − τ)kW − AkF + T T 2 2 ∇giL = −1xi=06 (Wi yI + biI + yWi )(y − W x). O(mk /n ). Thanh V. Nguyen∗, Raymond K. W. Wong†, Chinmay Hegde∗

(iii) Non-negative k-sparse code: Suppose the condi- Now, we determine when the bias conditions in The- tions in either (ii) or (iii) of Theorem 3 hold. orem 3 and Lemma 1 simultaneously hold for dif- Suppose that the learning rate ζ = Θ(m/k) and ferent choices of activation function. For the hard- the bias b satisfies: thresholding function, since we do not need bias (i.e. b = 0 for every i), then λ (1 − λ ) ≤ 2(1 − λ ) and this (iii.1) b = 0 if x = threshold (W T y + b); or i i i i a1/2 lemma clearly follows. (iii.2) bs+1 = bs/C if x = ReLU(W T y + b) for some constant C > 1. On the other hand, if we encode x = ReLU(W T y + b), then we need every bias b to satisfy b ∈ [−1 + Then, k s+1 − k2 ≤ − k s − k2 √ i i W A F (1 τ) W A F + s − s and | 2 − | ≤ − . Since 3 2δ k, δ ] (bi + λi) λi 2(1 λi) O(k /m). s s λi = hWi ,Aii → 1 and δ → 0, for the conditions of bi to hold, we require bi → 0 as s increases. Hence, a Recall the approximate gradient of the squared loss: fixed bias for the rectified linear unit would not work. Instead, we design a simple update for the bias (and ∇ L = −1 (W T yI + b I + yW T )(y − W x). gi xi=06 i i i this is enough to prove convergence in the ReLU case). We will use this form to construct a desired update Here is our intuition. The gradient of L with respect rule with linear convergence. Let us consider an update to bi is given by: step gs in expectation over the code x∗ and and the ∇ L = −σ0(W T y + b )W T (y − W x) noise η: bi i i i Similarly to the update for the weight matrix, we ap- T T gi = − [1x =06 (W yI + biI + yW )(y − W x)]. (5) 0 T E i i i proximate this gradient with by replacing σ (Wi y + bi)

with 1xi=06 , calculate the expected gradient and obtain: To prove Theorem 4, we compute gi according to the T − − 1 ∗ generative models described in (3) and then argue the (gb)i = E[Wi (y W x) xi =06 ] + γ descent. Here, we provide a proof sketch for (again) the T T = −E[Wi (y − Wi(Wi y + bi)1x∗=06 ] + γ simplest case of mixture-of-Gaussians; the full proof is i T 2 T 2 = − [(W − kW k W )y + kW k b 1 ∗ ] + γ deferred to Appendix B. E i i i i i xi =06 = −pibi + γ Proof of Theorem 4 (i). Based on Theorem 3, one can From the expected gradient formula, we design a very explicitly compute the expectation expressed in (5). √ simple update for the bias: bs+1 = 1 − τbs where Specifically, the expected gradient is of the form: gi b0 = −1/ log n, and show by induction that this choice of bias is sufficiently negative to make the consistency g = −p λ A + p (λ2 + 2b λ + b2)W + γ i i i i i i i i i result 3 (ii) and (iii) hold at each step. At the first 0 ∗ s −w(1) step, we have δ ≤ O (1/ log n), then where λi = hWi ,Aii and kγk = O(n ). If we 2 2 0 0 can find bi such that λi + 2biλi + bi ≈ λi for all i, gi bi = −1/ log n ≤ −kWi − Aik. roughly points in the same desired direction to A , and i Now, assuming bs ≤ −kW s − A k, we need to prove therefore, a descent property can be established via the i i i that bs+1 ≤ −kW s+1 − A k. following result: i i From the descent property at the step s, we have Lemma 1. Suppose W is δ-close to A and the bias √ 2 s+1 s s satisfies |(bi + λi) − λi| ≤ 2(1 − λi). Then: kWi − Aik ≤ 1 − τkWi − Aik + o(δ ). √ √ 2 2 s+1 s s 2hgi,Wi − Aii ≥ pi(λi − 2δ )kWi − Aik Therefore, bi = 1 − τbi ≤ − 1 − τkWi − Aik ≤ s+1 2 1 2 −kWi − Aik − o(δs). As a result, |(bi + λi) − λi| ≈ + kg k2 − kγk2 i λi(1 − λi) ≤ 2(1 − λi). In addition, the condition of piλi piλi bias in the support consistency holds. By induction, we can guarantee the consistency at all the update steps. From Lemma 1, one can easily prove the descent prop- Lemma 1 and hence the descent results stated in (i.2) erty using [16] (Theorem 6). We apply this lemma with and (iii.2) hold for the special case of the Gaussian learning rate ζ = max (1/p λ ) and τ = ζp (λ −2δ2) ∈ i i i i i mixture model. (0, 1) to achieve the descent as follows: 

s+1 2 s 2 −K kWfi − Aik ≤ (1 − τ)kWi − Aik + O(n ), 6 Experiments

s+1 s where Wf = W − ζgs and K is some constant We support our theoretical results with some experi- greater than 1. Finally, we use Lemma 5 to obtain the ments on synthetic data sets under on the mixture-of- s+1 descent property for the normalized Wi . Gaussians model. We stress that these experimental On the Dynamics of Gradient Descent for Autoencoders

1 1 1

0.5 ση = 0.01 0.5 ση = 0.01 0.5 ση = 0.01 ση = 0.02 ση = 0.02 ση = 0.02 ση = 0.03 ση = 0.03 ση = 0.03

Reconstruction loss 0 Reconstruction loss 0 Reconstruction loss 0 0 20 40 0 20 40 0 20 40 Iteration Iteration Iteration

Figure 2: The learning curve in training step using different initial estimate W 0. From left to right, the autoencoder is initialized by (i) some perturbation of the ground truth, (ii) PCA and (iii) random guess.

20 are initially set to b = −2.5δ. We train the weights W

Init w/ perturbation with the batch gradient descent and update the bias

2 F 15 Init w/ PCA s+1 s k using a fixed update rule b = b /2.

A Random init

− 10 The learning rate for gradient descent is set fixed to W k ζ = m. The number of descent steps is T = 50. We run 5 the batch descent algorithm at each initialization with 0 different levels of noise (ση = 0.01, 0.02, 0.03), then we 0 10 20 30 40 50 observe the reconstruction loss over the data samples. Iteration Figure 2 shows the learning curve in the number of Figure 3: Frobenius norm difference between the learned iterations. From the left, the first plot is the loss with W and the ground truth A by three initialization schemes. the initial point 0.5-close to A. The next two plots represent the learning using the PCA and random ini- tializations. The gradient descent also converges when results are not intended to be exhaustive or of practical using the same step size and bias as described above. relevance, but rather only to confirm some aspects of The convergence behavior is somewhat unexpected; our theoretical results, and shed light on where the even with random initialization the reconstruction loss theory falls short. decreases to low levels when the noise parameter ση is small. This suggests that the loss surface is perhaps We generate samples from a mixture of m = 10 amenable to optimization even for radius bigger than Gaussians with dimension n = 784 using the model O(δ)-away from the ground truth parameters, although ∗ y = Ax + η. The means are the columns of A, ran- our theory does not account for this. domly generated according to ∼ N √1 . To Ai (0, n In) synthesize each sample y, we choose x∗ uniformly from In Figure 3 we show the Frobenius norm difference m between the ground truth A and final solution W us- the canonical bases {ei}i=1 and generate a Gaussian noise vector η with independent entries and entry-wise ing three initialization schemes on a data set with noise ση = 0.01. Interestingly, despite the convergence, standard deviation ση. We create a data set of 10, 000 samples in total for each Monte Carlo trial. neither PCA nor random initialization leads to the recovery of the ground truth A. Note that since we We consider a two-layer autoencoder with shared can only estimate W up to some column permutation, weights as described in Section 2.1, such that the hid- we use the Hungarian algorithm to compute matching den layer has 10 units with ReLU activation. Then, we between W and A and then calculate the norm. observe its gradient dynamics on the above data using three different initializations: (i) we initialize W by adding small random perturbation to the groundtruth A such that W 0 = A + δE for δ = 0.5 with the Conclusions To our knowledge, the above analysis perturbation E ∈ R784×10 generated according to √ is the first to prove rigorous convergence of gradient Eij ∼ N (0, 1/ n); (ii) we perform principal compo- nent analysis of the data samples and choose the top dynamics for autoencoder architectures for a wide vari- ety of (bilinear) generative models. Numerous avenues 10 singular vectors as W 0; (iii) we randomly generate for future work remain — finite sample complexity W with W ∼ N (0, √1 I ). i n n analysis; extension to more general architectures; and For all three initializations, the bias b of the encoder extension to richer classes of generative models. Thanh V. Nguyen∗, Raymond K. W. Wong†, Chinmay Hegde∗

7 Acknowledgements [13] Daniel A Spielman, Huan Wang, and John Wright. Exact recovery of sparsely-used dictionaries. In This work was supported in part by the National Sci- Conference on Learning Theory, pages 37–1, 2012. ence Foundation under grants CCF-1566281, CAREER [14] Alekh Agarwal, Animashree Anandkumar, Pra- CCF-1750920 and DMS-1612985, and in part by a Fac- teek Jain, Praneeth Netrapalli, and Rashish Tan- ulty Fellowship from the Black and Veatch Foundation. don. Learning sparsely used overcomplete dictio- naries. In Conference on Learning Theory, pages References 123–137, 2014. [1] Yuandong Tian. Symmetry-breaking convergence [15] Rémi Gribonval, Rodolphe Jenatton, Francis Bach, analysis of certain two-layered neural networks Martin Kleinsteuber, and Matthias Seibert. Sam- with relu nonlinearity. 2017. ple complexity of dictionary learning and other [2] Rong Ge, Jason D Lee, and Tengyu Ma. Learning matrix factorizations. IEEE Transactions on In- one-hidden-layer neural networks with landscape formation Theory, 61(6):3469–3486, 2015. design. arXiv preprint arXiv:1711.00501, 2017. [16] Sanjeev Arora, Rong Ge, Tengyu Ma, and Ankur [3] Alon Brutzkus and Amir Globerson. Globally op- Moitra. Simple, efficient, and neural algorithms for timal gradient descent for a convnet with gaussian sparse coding. In Conference on Learning Theory, inputs. In International Conference on Machine pages 113–149, 2015. Learning, pages 605–614, 2017. [17] Ankur Moitra and Gregory Valiant. Settling the [4] Kai Zhong, Zhao Song, Prateek Jain, Peter L polynomial learnability of mixtures of gaussians. In Bartlett, and Inderjit S Dhillon. Recovery guar- Foundations of Computer Science (FOCS), 2010 antees for one-hidden-layer neural networks. In 51st Annual IEEE Symposium on, pages 93–102. International Conference on Machine Learning, IEEE, 2010. pages 4140–4149, 2017. [18] Sanjeev Arora, Rong Ge, Ankur Moitra, and Sushant Sachdeva. Provable ica with unknown [5] Kenji Kawaguchi. without poor gaussian noise, with implications for gaussian mix- local minima. In Advances in Neural Information tures and autoencoders. In Advances in Neural Processing Systems, pages 586–594, 2016. Information Processing Systems, pages 2375–2383, [6] Yuanzhi Li and Yang Yuan. Convergence analysis 2012. of two-layer neural networks with relu activation. [19] Navin Goyal, Santosh Vempala, and Ying Xiao. In Advances in Neural Information Processing Sys- Fourier pca and robust tensor decomposition. In tems, pages 597–607, 2017. Proceedings of the forty-sixth annual ACM sym- [7] Moritz Hardt and Tengyu Ma. Identity matters posium on Theory of computing, pages 584–593. in deep learning. 2017. ACM, 2014. [8] Avrim Blum and Ronald L Rivest. Training a 3- [20] Pascal Vincent, Hugo Larochelle, Isabelle La- node neural network is np-complete. In Advances joie, , and Pierre-Antoine Manzagol. in Neural Information Processing Systems, pages Stacked denoising autoencoders: Learning useful 494–501, 1989. representations in a deep network with a local [9] Sanjeev Arora, Aditya Bhaskara, Rong Ge, and denoising criterion. Journal of Machine Learning Tengyu Ma. Provable bounds for learning some Research, 11(Dec):3371–3408, 2010. deep representations. In International Conference [21] Devansh Arpit, Yingbo Zhou, Hung Ngo, and on Machine Learning, pages 584–592, 2014. Venu Govindaraju. Why regularized auto-encoders [10] Sanjeev Arora, Yingyu Liang, and Tengyu Ma. learn sparse representation? arXiv preprint Why are deep nets reversible: A simple theory, arXiv:1505.05561, 2015. with implications for training. arXiv preprint [22] Alekh Agarwal, Animashree Anandkumar, and arXiv:1511.05653, 2015. Praneeth Netrapalli. Exact recovery of sparsely [11] Akshay Rangamani, Anirbit Mukherjee, Ashish used overcomplete dictionaries. stat, 1050:8, 2013. Arora, Tejaswini Ganapathy, Amitabh Basu, Sang [23] Bruno A Olshausen and David J Field. Sparse Chin, and Trac D Tran. Sparse coding and autoen- coding with an overcomplete basis set: A strategy coders. arXiv preprint arXiv:1708.03735, 2017. employed by v1? Vision research, 37(23):3311– [12] Sanjeev Arora and Ravi Kannan. Learning mix- 3325, 1997. tures of separated nonspherical gaussians. The [24] Sanjeev Arora, Rong Ge, and Ankur Moitra. New Annals of Applied Probability, 15(1A):69–92, 2005. algorithms for learning incoherent and overcom- On the Dynamics of Gradient Descent for Autoencoders

plete dictionaries. In Conference on Learning The- ory, pages 779–806, 2014. [25] Adam Coates and Andrew Y Ng. The importance of encoding versus training with sparse coding and vector quantization. In International Conference on Machine Learning, pages 921–928, 2011. [26] Andrew M Saxe, Pang Wei Koh, Zhenghao Chen, Maneesh Bhand, Bipin Suresh, and Andrew Y Ng. On random weights and unsupervised feature learning. In International Conference on Machine Learning, pages 1089–1096, 2011. [27] Kishore Konda, Roland Memisevic, and David Krueger. Zero-bias autoencoders and the ben- efits of co-adapting features. arXiv preprint arXiv:1402.3337, published in ICLR 2015, 2015. [28] Thanh V Nguyen, Raymond K W Wong, and Chinmay Hegde. A provable approach for double- sparse coding. In Proc. Conf. American Assoc. Artificial Intelligence, Feb. 2018. Thanh V. Nguyen∗, Raymond K. W. Wong†, Chinmay Hegde∗

A Proof of Theorem 3

We start our proof with the following auxiliary claims.

Claim 1. Suppose that maxikWi − Aik ≤ δ and kWik = 1. We have:

2 1. hWi,Aii ≥ 1 − δ /2 for any i ∈ [m]; √ 2. |hWi,Aji| ≤ µ/ n + δ, for any j =6 i ∈ [m];

P 2 2 2 3. j∈S\{i}hWi,Aji ≤ O(µ k/n + δ ) for any S ⊂ [m] of size at most k.

Proof. The claims (i) and (ii) clearly follow from the δ-closeness and µ-incoherence properties as shown below.

2 2 hWi,Aii = 1 − (1/2)kWi − Aik ≥ 1 − δ /2, and √ |hWi,Aji| = |hAi,Aji + hWi − Ai,Aji| ≤ µ/ n + δ. For (iii), we apply Cauchy-Schwarz to bound each term inside the summation. Precisely, for any j =6 i,

2 2 2 2 2 hWi,Aji ≤ 2 hAi,Aji + hWi − Ai,Aji ≤ 2µ /n + 2hWi − Ai,Aji . p Together with kAk = O( m/n) = O(1), we finish proving (iii) by noting that

X 2 2 T 2 2 2 2 2 2 hWi,Aji ≤ 2µ k/n + 2kAS (Wi − Ai)kF ≤ 2µ k/n + 2kASk kWi − Aik ≤ O(µ k/n + δ ). j∈S\{i}



Claim 2. Suppose kWik = 1, then maxi|hWi, ηi| ≤ ση log n holds with high probability.

Proof. Since η is a spherical Gaussian random vector and kWik = 1, hWi, ηi is Gaussian with mean 0 and variance 2 ση. Using the Gaussian tail bound for hWi, ηi and taking the union bound over i = 1, 2, . . . , m, we have that maxi|hWi, ηi| ≤ ση log n holds with high probability. 

Proof of Theorem 3. Denote z = W T y + b and let i ∈ [m] be fixed for a moment. (Later we use a union bound ∗ argument for account for all i). Denote S = supp(x ) and R = S\{i}. Notice that xi = 0 if i 6∈ S by definition. th One can write the i entry zi of the weighted sum z as

T ∗ zi = Wi (ASxS + η) + bi ∗ X ∗ = hWi,Aiixi + hWi,Ajixj + hWi, ηi + bi j∈R ∗ = hWi,Aiixi + Zi + hWi, ηi + bi,

P ∗ ∗ where we write Zi = j∈RhWi,Ajixj . Roughly speaking, since hWi,Aii is close to 1, zi approximately equals xi if we can control the remaining terms. This will be made precise below separately for different generative models.

A.1 Case (i): Sparse coding model

For this setting, the hidden code x∗ is k-sparse and is not restricted to non-negative values. The nonzero entries are mutually independent sub-Gaussian with mean κ1 = 0 and variance κ2 = 1. Note further that a1 ∈ (0, 1] and a2 = ∞ and the dictionary is incoherent and over-complete. Since the true code takes both positive and negative values as well as sparse, it is natural to consider the hard thresholding activation. The consistency is studied in [16] for the case of sparse coding (see Appendix C and also work [28], Lemma 8 for a treatment of the noise.) On the Dynamics of Gradient Descent for Autoencoders

A.2 Case (ii) and (iii): Non-negative k-sparse model

∗ ∗ Recall that S = supp(x ) and that xj ∈ [a1, a2] for j ∈ S. Cauchy-Schwarz inequality implies r X sX µ2k2 |Z | = hW ,A ix∗ ≤ hW ,A i2kx∗k ≤ a + kδ2, i i j j i j 2 n j∈R j∈R √ ∗ where we use bound (ii) in Claim 1 and kx k ≤ a2 k. If i ∈ S, then w.h.p.

∗ zi = hWi,Aiixi + Zi + hWi, ηi r µ2k2 ≥ (1 − δ2/2)a − a + kδ2 − σ log n + b > 0 1 2 n η i √ √ 2 2 √ √ for bi ≥ −(1 − δ)a1 + a2δ k and a2δ k  (1 − δ)a1, k = O(1/δ ) = O(log n), µ ≤ δ n/k, and ση = O(1/ n). On the other hand, when i∈ / S then w.h.p.

zi = Zi + hWi, ηi + bi r µ2k2 ≤ a + kδ2 + σ log n + b 2 n η i ≤ 0

q √ µ2k2 2 for bi ≤ −a2 n + kδ − ση log n ≈ −a2δ k. Due to the use of Claim 2, these results hold w.h.p. uniformly for all i and so supp(x) = S for x = ReLU(W T y +b) w.h.p. by We re-use the tail bound P[Zi ≥ ] given in [11], Theorem 3.1. √ Moreover, one can also see that with high probability zi > a1/2 if i ∈ S and zi < a2δ k < a1/4 otherwise. This ∗ results hold w.h.p. uniformly for all i and so x = threshold1/2(z) has the same support as x w.h.p. 

B Proof of Theorem 4

B.1 Case (i): Mixture of Gaussians

We start with simplifying the form of gi using the generative model 3 and Theorem 3. First, from the model we ∗ T 2 ∗ can have pi = P[xi =6 0] = Θ(1/m) and E[η] = 0 and E[ηη ] = σηI. Second, by Theorem 3 in (i), 1xi=06 = xi = 1 T T with high probability. As such, under the event we have 1 ∗ for both choices xi = σ(Wi y + bi) = (Wi y + bi) xi =06 of σ (Theorem 3).

To analyze gi, we observe that

T T − 1 ∗ − 1 γ = E[(Wi yI + biI + yWi )(y W x)( xi =06 xi=06 )] has norm of order O(n−w(1)) since the failure probability of the support consistency event is sufficiently small for large n, and the remaining term has bounded moments. One can write:

T T − 1 ∗ − gi = E[ xi =06 (Wi yI + biI + yWi )(y W x)] + γ T T T − 1 ∗ − − = E[ xi =06 (Wi yI + yWi + biI)(y WiWi y biWi)] + γ T T T T T − 1 ∗ − 1 ∗ = E[ xi =06 (Wi yI + yWi )(I WiWi )y] + biE[ xi =06 (Wi yI + yWi )]Wi T 2 − 1 ∗ − 1 ∗ biE[ xi =06 (I WiWi )y] + bi WiE[ xi =06 ] + γ (1) (2) (3) 2 = gi + gi + gi + pibi Wi + γ, Thanh V. Nguyen∗, Raymond K. W. Wong†, Chinmay Hegde∗

(t) ∗ Next, we study each of gi , t = 1, 2, 3, by using the fact that y = Ai + η as xi = 1. To simplify the notation, denote λi = hWi,Aii. Then

(1) T T T − − 1 ∗ gi = E[(Wi (Ai + η)I + (Ai + η)Wi )(I WiWi )(Ai + η) xi =06 ] T T T − h i − 1 ∗ = E[(λiI + AiWi + Wi, η I + ηWi )(I WiWi )(Ai + η) xi =06 ] T ∗ T T − − 6 − h i − 1 ∗ = (λiI + AiWi )(Ai λiWi)P[xi = 0] E[( Wi, η I + ηWi )(I WiWi )η xi =06 ] 2 T T − − h i − 1 ∗ = piλiAi + piλi Wi E[( Wi, η I + ηWi )(I WiWi )η xi =06 ],

∗ where we use pi = P[xi =6 0] and denote kWik = 1. Also, since η is spherical Gaussian-distributed, we have:

T T 2 h i − 1 ∗ h i − h i E[( Wi, η I + ηWi )(I WiWi )η xi =06 ] = piE[ Wi, η η Wi, η Wi] 2 2 = piση(1 − kWik )Wi = 0, To sum up, we have

(1) 2 gi = −piλiAi + piλi Wi (6) For the second term,

(2) T T T T 1 ∗ 1 ∗ gi = biE[ xi =06 (Wi yI + yWi )]Wi = biE[ xi =06 (Wi (Ai + η)I + (Ai + η)Wi )]Wi 2 k k 1 ∗ = biE[(λiWi + Wi Ai) xi =06 ]

= pibiλiWi + pibiAi. (7)

In the second step, we use the independence of spherical η and x. Similarly, we can compute the third term:

(3) T T − − 1 ∗ − − 1 ∗ gi = bi(I WiWi )E[y xi =06 ] = bi(I WiWi )E[(Ai + η) xi =06 ] T = −pibi(I − WiWi )Ai

= −pibiAi + pibiλiWi (8)

Putting (6), (7) and (8) together, we have

2 2 gi = −piλiAi + pi(λi + 2biλi + bi )Wi + γ

2 2 Having established the closed-form for gi, one can observe that when bi such that λi + 2biλi + bi ≈ λi, gi roughly ∗ points in the same desired direction to A and suggests the correlation of gi with Wi − Ai. Now, we prove this result.

2 2 Proof of Lemma 1. Denote v = pi(λi + 2biλi + bi − λi)Wi + γ. Then

2 2 gi = −piλiAi + pi(λi + 2biλi + bi )Wi + γ (9)

= piλi(Wi − Ai) + v,

By expanding (9), we have

1 2 2 1 2 2hv, Wi − Aii = kgik − piλikWi − Aik − kvk . piλi piλi

Using this equality and taking inner product with Wi − Ai to both sides of (9), we get

2 1 2 1 2 2hgi,Wi − Aii = piλikWi − Aik + kgik − kvk . piλi piλi

We need an upper bound for kvk2. Since

2 |(bi + λi) − λi| ≤ 2(1 − λi) On the Dynamics of Gradient Descent for Autoencoders and 2 2(1 − λi) = kWi − Aik , we have: 2 2 |(bi + λi) − λi| ≤ kWi − Aik ≤ δkWi − Aik Notice that

2 2 2 2 kvk = kpi(λi + 2biλi + bi − λi)Wi + γk 2 2 2 2 ≤ 2pi δ kWi − Aik + 2kγk .

Now one can easily show that

2 2 1 2 2 2 2hgi,Wi − Aii ≥ pi(λi − 2δ )kWi − Aik + kgik − kγk . piλi piλi



B.2 Case (ii): General k-Sparse Coding

For this case, we adopt the same analysis as used in Case 1. The difference lies in the distributional assumption of x∗, where nonzero entries are independent sub-Gaussian. Specifically, given the support S of size at most k with 2 2 ∗ ∗ ∗T pi = P[i ∈ S] = Θ(k/m) and pij = P[i, j ∈ S] = Θ(k /m ), we suppose E[xi |S] = 0 and E[xSxS |S] = I. For simplicity, we choose to skip the noise, i.e., y = Ax∗ for this case. Our analysis is robust to iid additive Gaussian noise in the data; see [28] for a similar treatment. Also, according to Theorem 3, we set bi = 0 to obtain support consistency. With zero bias, the expected update rule gi becomes

T T gi = −E[(Wi yI + yWi )(y − W x)1xi=06 ].

∗ ∗ For S = supp(x ), then y = ASxS. Theorem 3 in (ii) shows that supp(x) = S w.h.p., so under that event we can T write W x = WSxS = WS(WS y). Similar to the previous cases, γ denotes a general quantity whose norm is of order n−w(1) due to the converging probability of the support consistency. Now, we substitute the forms of y and x into gi:

T T gi = −E[(Wi yI + yWi )(y − W x)1xi=06 ] T T T − − 1 ∗ = E[(Wi yI + yWi )(y WSWS y) xi =06 ] + γ T T ∗ ∗ ∗ T T ∗ − − 1 ∗ − − 1 ∗ = E[(I WSWS )(Wi ASxS)ASxS xi =06 ] E[(ASxS)Wi (I WSWS )ASxS xi =06 ] + γ (1) (2) = gi + gi + γ.

Write (1) T T ∗ ∗ − − 1 ∗ | gi,S = E[(I WSWS )(Wi ASxS)ASxS xi =06 S], and (2) ∗ T T ∗ − − 1 ∗ | gi,S = E[(ASxS)Wi (I WSWS )ASxS xi =06 S],

(1) (1) (2) (2) ∗ ∗ so that and . It is easy to see that 1 ∗ | if ∈ and gi = E(gi,S ) gi = E(gi,S ) E[xj xl xi =06 S] = 1 i = j = l S ∗ ∗ (1) 1 ∗ | otherwise. Therefore, becomes E[xi xl xi =06 S] = 0 gi,S

(1) T T ∗ ∗ − − 1 ∗ | (10) gi,S = E[(I WSWS )(Wi ASxS)ASxS xi =06 S] X T T ∗ ∗ − − 1 ∗ | = E[(I WS WS)(Wi Aj)Alxj xl xi =06 S] j,l∈S T = −λi(I − WSWS )Ai, (11) Thanh V. Nguyen∗, Raymond K. W. Wong†, Chinmay Hegde∗

T where we use the earlier notation λi = Wi Ai. Similar calculation of the second term results in

(2) ∗ T T ∗ − − 1 ∗ | (12) gi,S = E[(ASxS)Wi (I WSWS )ASxS xi =06 S] X ∗ T T X ∗ − − 1 ∗ | = E[ xj AjWi (I WSWS ) xl Al xi =06 S] j∈S l∈S X T T ∗ ∗ ∗ = − E[AjWi (I − WSWS )Alxj xl sgn(xi )|S] j,l∈S T T = −AiWi (I − WSWS )Ai (13) Now we combine the results in (10) and (12) to compute the expectation over S.

(1) (2) gi = E[gi,S + gi,S ] + γ (14) T T T = −E[λi(I − WSWS )Ai + AiWi (I − WSWS )Ai] + γ X T T X T = −E[2λiAi − λi WjWj Ai − AiWi WjWj Ai] + γ j∈S j∈S X T X = −2piλiAi + E[λi WjWj Ai + hWi,WjihAi,WjiAi] + γ j∈S j∈S 2 X 2 X = −2piλiAi + E[λi Wi + hAi,WjiWj + λikWik Ai + hWi,WjihAi,WjiAi] + γ, j∈R j∈R where pi = P[i ∈ S] and R = S\{i}. Moreover, kWik = 1, hence

2 X gi = −piλiAi + piλi Wi + pijλihAi,WjiWj + pijhWi,WjihAi,WjiAi) + γ j∈[m]\{i} 2 T T T = −piλiAi + piλi Wi + λiW−idiag(pij)W−iAi + (Wi W−idiag(pij)W−iAi)Ai + γ, (15)

th for W−i = (W1,...,Wi−1,Wi+1,...,Wm) with the i column being removed, and diag(pij) denotes the diagonal matrix formed by pij with j ∈ [m]\{i}.

Observe that ignoring lower order terms, gi can be written as piλi(Wi − Ai) + piλi(λi − 1)Wi, which roughly points in the same desired direction to A. Rigorously, we argue the following: Lemma 2. Suppose W is (δ, 2)-near to A. Then

2 1 2 2 2 2hgi,Wi − Aii ≥ piλikWi − Aik + kgik − O(pik /n λi) piλi

Proof. We proceed with similar steps as in the proof of Lemma 1. By nearness, p kW k ≤ kW − Ak + kAk ≤ 3kAk ≤ O( m/n).

2 2 Also, pi = Θ(k/m) and pij = Θ(k /m ). Then

T T kW−idiag(pij)W−iAik ≤ pikW−idiag(pij/pi)W−ik 2 ≤ pikW−ik max(pij/pi) = O(pik/n). j=6 i

Similarly, T T kWi W−idiag(pij)W−iAi)Aik ≤ O(pik/n).

Now we denote

T T T v = piλi(λi − 1)Wi + λiW−idiag(pij)W−iAi + (Wi W−idiag(pij)W−iAi)Ai + γ. Then gi = piλi(Wi − Ai) + v On the Dynamics of Gradient Descent for Autoencoders

where kvk ≤ piλi(δ/2)kWi − Aik + O(pik/n) + kγk. Therefore, we obtain 2 δ 2 1 2 2 2 2hgi,Wi − Aii ≥ piλi(1 − )kWi − Aik + kgik − O(pik /n λi). 2 piλi where we assume that kγk is negligible when compared with O(pik/n).  Adopting the same arguments in the proof of Case (i), we are able to get the descent property column-wise for the normalized gradient update with the step size ζ = maxi(1/piλi) such that there is some τ ∈ (0, 1):

s+1 2 s 2 2 2 kWi − Aik ≤ (1 − τ)kWi − Aik + O(pik /n λi).

Since pi = Θ(k/m), Consequently, we will obtain the descent in Frobenius norm stated in Theorem 4, item (ii). Lemma 3 (Maintaining the nearness). kW − Ak ≤ 2kAk.

Proof. The proof follows from [16] (Lemma 24 and Lemma 32).

B.3 Case (iii): Non-negative k-Sparse Coding

We proceed with the proof similarly to the above case of general k-sparse code. Additional effort is required due to the positive mean of nonzero coefficients in x∗. For x = σ(W T y + b), we have the support recovery for both choices of σ a shown in (ii) and (iii) of Theorem 3. Hence we re-use the expansion in [11] to compute the expected approximate gradient. Note that we standardize Wi such that kWik = 1 and ignore the noise η. Let i be fixed and consider the approximate gradient for the ith column of W . The expected approximate gradient has the following form:

T T gi = −E[1xi=06 (Wi yI + biI + yWi )(y − W x)] = αiWi − βiAi + ei, where

2 X 2 2 X 2 X αi = κ2piλi + κ2 pijhWi,Aji + 2κ1 pijλihWi,Aji + κ1 pijlhWi,AjihWi,Ali j=6 i j=6 i j=6 l=6 i X 2 + 2κ1pibiλi + 2κ1 pijbihWi,Aji + pibi ; j=6 i

X 2 X 2 X βi = κ2piλi − κ2 pijhWi,WjihAi,Wji + κ1 pijhWi,Aji − κ1 pijhWi,WjihWj,Aji j=6 i j=6 i j=6 i 2 X X − κ1 pijlhWi,WjihWj,Ali − κ1 pijbihWi,Wji; j=6 l=6 i j=6 i

2 2 and ei is a term with norm keik ≤ O(max (κ1, κ2)pik/m) – a rough bound obtained in [11] (see the proof of Lemma 5.2 in pages 26 and 35 of [11].) As a sanity check, by plugging in the parameters of the mixture of Gaussians to αi, βi and ei, we get the same expression for gi in Case 1. We will show that only the first term in αi is dominant except ones involving the bias bi. The argument for βi follows similarly. Claim 3.

2 2 √ 2 2 αi = κ2piλi + κ2O(pik/m) + 2κ1piλiO(k/ m) + κ1O(pik /m) √ 2 + 2κ1pibiλi + 2κ1pibiO(k/ m) + pibi .

Proof. We bound the corresponding terms in αi one by one. We start with the second term: m m X 2 X 2 pijhWi,Aji ≤ max pij hWi,Aji j=6 i j=6 i j=6 i T 2 ≤ max pijkA−iWikF j=6 i

≤ O(pik/m), Thanh V. Nguyen∗, Raymond K. W. Wong†, Chinmay Hegde∗

2 2 since pij = Θ(k /m ) = Θ(pik/m). Similarly, we have

m m X T X | pijhWi,Aji| = |Wi pijAj| j=6 i j=6 i s X 2 ≤ kWikkAk pij j=6 i √ ≤ O(pik/ m), which leads to a bound on the third and the sixth terms. Note that this bound will be re-used to bound the corresponding term in βi. The next term is bounded as follows:

X T X T pijlhWi,AjihWi,Ali = Wi pijlAjAl Wi j=6 l j=6 l j,l=6 i j,l=6 i

X T 2 ≤ pijlAjAl kWik j=6 l j,l=6 i 2 ≤ O(pik /m),

P T T where M = j=6 l pijlAjAl = A−iQA−i for Qjl = pijl for j 6= l and Qjl = 0 otherwise. Again, A−i j,l=6 i th 3 3 2 denotes the matrix W with its i column removed. We have pijl = Θ(k /m ) ≤ O(qik /m); therefore, 2 2 kMk ≤ kQkF kAk ≤ O(qik /m). 

Claim 4.

2 √ 2 √ βi = κ2piλi − κ2O(pik/m) + κ1O(pik/ m) − κ1O(pik/ m) 2 2 √ + κ1O(pik /m) − κ1biO(pik/ m).

p Proof. We proceed similarly to the proof of Claim 3. Due to nearness and the fact that kA∗k = O( m/n) = O(1), we can conclude that kW k ≤ O(1). For the second term, we have

X T X T k pijhWi,WjihAi,Wjik = kWi pijWjWj Aik j=6 i j=6 i T ≤ max pijkW−iW−ikkWikkAik j=6 i

≤ O(pik/m),

T P T P T T where WjWj are p.s.d and so 0  j=6 i pijWjWj  (maxj=6 i pij)( j=6 i WjWj )  maxj=6 i pijW−iW−i. To bound the third one, we use the fact that |λj| = |hWj,Aji| ≤ 1. Hence from the proof of Claim 3,

X X k pijhWi,WjihWj,Ajik = k pijλjhWi,Wjik j=6 i j=6 i s X 2 ≤ kWikkW k (pijλj) j=6 i √ ≤ O(pik/ m), On the Dynamics of Gradient Descent for Autoencoders which is also the bound for the last term. The remaining term can be bounded as follows:

X X T k pijlhWi,WjihWj,Alik ≤ k pijlWjWj Alk j=6 l=6 i j=6 l=6 i X T ≤ kpijlW−iW−ik l=6 i X 2 ≤ max pijlkW−ik j=6 l=6 i l=6 i 2 ≤ O(pik /m).



When bi = 0, from (3) and (4) and bi ∈ (−1, 0), we have:

2 2 2 √ αi = pi(κ2λi + 2κ1pibiλi + bi ) + O(max(κ1, κ2)k/ m) and 2 √ βi = κ2piλi + O(max(κ1, κ2)k/ m), √ where we implicitly require that k ≤ O( n), which is even weaker than the condition k = O(1/δ2) stated in Theorem 3. Now we recall the form of gi:

2 2 gi = −κ2piλiAi + pi(κ2λi + 2κ1pibiλi + bi )Wi + v (16)

2 √ 2 √ 2 √ where v = O(max(κ1, κ2)k/ m)Ai + O(max(κ1, κ2)k/ m)Wi + ei. Therefore kvk ≤ O(max(κ1, κ2)k/ m).

∗ 2 2 Lemma 4. Suppose A is δ-close to A and the bias satisfies |κ2λi + 2κ1pibiλi + bi − κ2λi| ≤ 2κ2(1 − λi), then

2 2 2 1 2 2 k 2hgi,Wi − Aii ≥ κ2pi(λi − 2δ )kWi − Aik + kgik − O(max(1, κ2/κ1) ) κ2piλi pim

The proof of this lemma and the descent is the same as that of Lemma 1 for the case of Gaussian mixture. Again, the condition for bias holds when bi = 0 and the thresholding activation is used; but breaks down when the nonzero bias is set fixed across iterations. Now, we give an analysis for a bias update. Similarly to the mixture of Gaussian case, the bias is updated as

bs+1 = bs/C, for some C > 1. The proof remains the same to guarantee the consistency and also the descent. The last step is to maintain the nearness for the new update. Since it is tedious to argue that for the complicated ∗ form of gi, we can instead perform a projection on convex set B = {W |W is δ-close to A and kW k ≤ 2kAk} to guarantee the nearness. The details can be found in [16].

B.4 Auxiliary Lemma

In our descent analysis, we assume a normalization for W ’s columns after each descent update. The descent property is achieved for the unnormalized version and does not directly imply the δ-closeness for that current estimate. In fact, this is shown by the following lemma:

s s s+1 Lemma 5. Suppose that kWi k = kAik = 1 and kWi − Aik ≤ δs. The gradient update Wfi satisfies kW s+1 − A k ≤ (1 − τ)kW s − A k + o(δ ). Then, for 1−δs ≤ τ < 1, we have fi i i i s 2−δs

s+1 kWi − Aik ≤ (1 + o(1))δs,

s+1 s+1 Wfi where Wi = s+1 . kWfi k Thanh V. Nguyen∗, Raymond K. W. Wong†, Chinmay Hegde∗

s+1 Proof. Denote w = kWfi k. Using a triangle inequality and the descent property, we have

s+1 s+1 kWfi − wAik = kWfi − Ai + (1 − w)Aik s+1 ≤ kWfi − Aik + k(1 − w)Aik (kAik = 1) s s ≤ (1 − τ)kWi − Aik + (1 − τ)kWi − Aik + o(δs) s ≤ 2(1 − τ)kWi − Aik + o(δs).

s+1 s At the third step, we use |1−w| ≤ kWfi −Aik ≤ (1−τ)kWi −Aik+o(δs). This also implies w ≥ 1−(1−τ −o(1))δs. Therefore,

2(1 − τ) kW s+1 − A k ≤ kW s − A k + o(δ ) i i w i i s 2(1 − τ) s ≤ kWi − Aik + o(δs). (1 + (1 − τ − o(1))δs)

This implies that when the condition 1+δs ≤ τ < 1 holds, we get: 2+δs

s+1 kWi − Aik ≤ (1 + o(1))δs.