A Sample Complexity Separation between Non-Convex and Convex Meta-Learning

Nikunj Saunshi 1 Yi Zhang 1 Mikhail Khodak 2 Sanjeev Arora 13

Abstract complexity of an unseen but related test task. Although there is a long history of successful methods in meta-learning One popular trend in meta-learning is to learn and the related areas of multi-task and lifelong learning from many training tasks a common initialization (Evgeniou & Pontil, 2004; Ruvolo & Eaton, 2013), recent that a gradient-based method can use to solve a approaches have been developed with the diversity and scale new task with few samples. The theory of meta- of modern applications in mind. This has given rise to learning is still in its early stages, with several re- simple, model-agnostic methods that focus on learning a cent learning-theoretic analyses of methods such good initialization for some gradient-based method such as as Reptile (Nichol et al., 2018) being for convex stochastic gradient descent (SGD), to be run on samples models. This work shows that convex-case analy- from a new task (Finn et al., 2017; Nichol et al., 2018). sis might be insufficient to understand the success These methods have found widespread applications in a of meta-learning, and that even for non-convex variety of areas such as computer vision (Nichol et al., 2018), models it is important to look inside the optimiza- (Finn et al., 2017), and federated tion black-box, specifically at properties of the learning (McMahan et al., 2017). optimization trajectory. We construct a simple meta-learning instance that captures the problem Inspired by their popularity, several recent learning-theoretic of one-dimensional subspace learning. For the analyses of meta-learning have followed suit, eschewing cus- convex formulation of on this tomization to specific hypothesis classes such as halfspaces instance, we show that the new task sample com- (Maurer & Pontil, 2013; Balcan et al., 2015) and instead plexity of any initialization-based meta-learning favoring the convex-case study of gradient-based algorithms algorithm is ⌦(d), where d is the input dimension. that could potentially be applied to deep neural networks In contrast, for the non-convex formulation of a (Denevi et al., 2019; Khodak et al., 2019). This has yielded two layer linear network on the same instance, results showing that meta-learning an initialization by using we show that both Reptile and multi-task rep- methods similar to Reptile (Nichol et al., 2018) for con- resentation learning can have new task sample vex models leads to a reduction in sample complexity of complexity of (1), demonstrating a separation unseen tasks. These benefits are shown using natural no- O from convex meta-learning. Crucially, analyses tions of task-similarity like the average distance between of the training dynamics of these methods reveal the risk minimizers of tasks drawn from an underlying meta- that they can meta-learn the correct subspace onto distribution. A good initialization in these models is one which the data should be projected. that is close to the population risk minimizers for tasks in this meta-distribution. In this paper we argue that, even in some simple settings, 1. Introduction such convex-case analyses are insufficient to understand We consider the problem of meta-learning, or learning-to- the success of initialization-based meta-learning algorithms. learn (Thrun & Pratt, 1998), in which the goal is to use For this purpose, we pose a simple instance for meta- the data from numerous training tasks to reduce the sample learning linear regressors that share a one-dimensional sub- space, for which we prove a sample complexity separa- 1Princeton University, Princeton, New Jersey, USA 2Carnegie tion between convex and non-convex methods. In the pro- 3 Mellon University, Pittsburgh, Pennsylvania, USA Institute for cess, we provide a first theoretical demonstration of how Advanced Study, Princeton, New Jersey, USA. Correspondence to: initialization-based meta-learning methods can learn good Nikunj Saunshi . representations, as observed empirically (Raghu et al., 2019). Proceedings of the 37 th International Conference on Machine Specifically, our contributions are the following: Learning, Online, PMLR 119, 2020. Copyright 2020 by the au- thor(s). A Sample Complexity Separation between Non-Convex and Convex Meta-Learning

We show, in the convex formulation of linear regression 2. Related Work • on this instance, a new task sample complexity lower bound of ⌦(d) for any initialization-based meta-learning There is a rich history of theoretical analysis of learning-to- algorithm. This suggests that no amount of meta-training learn (Baxter, 2000; Maurer, 2005; Maurer et al., 2016). Our data can yield an initialization that can be used by a focus is on a well-studied setting in which tasks such as half- common gradient-based within-task algorithms to solve space learning share a common low-dimensional subspace, a new task with fewer samples than if no meta-learning with the goal of obtaining sample complexity depending had been done; thus initialization-based meta-learning on this sparse structure rather than on the ambient dimen- in the convex formulation fails to learn the underlying sion (Maurer, 2009; Maurer & Pontil, 2013; Balcan et al., task-similarity. The lower bounds also holds for more 2015; Denevi et al., 2018; Bullins et al., 2019; Khodak et al., general meta-learning instances that have the property 2019). While these works derive specialized algorithms, we that the average prediction for an input across tasks is instead focus on learning an initialization for gradient-based independent of the input, as in true in many standard methods such as SGD or few steps of gradient descent (Finn meta-learning benchmarks. et al., 2017; Nichol et al., 2018). Some of these methods have recently been studied in the convex setting (Denevi We show for the same instance that formulating the model et al., 2019; Khodak et al., 2019; Zhou et al., 2019). Our • as a two-layer linear network – an over-parameterization results show that such convex-case analyses cannot hope of the same hypothesis class – allows a Reptile-like proce- to show adaptation to an underlying low-dimensional sub- dure to use training tasks from this meta-learning instance space leading to dimension-independent sample complexity and find an initialization for gradient descent that will bounds. On the other hand, we show that their guarantees have (1) sample complexity on a new task. To the best using distance-from-initialization are almost tight for the O of our knowledge, this is the first sample complexity anal- meta-learning of convex Lipschitz functions. ysis of initialization-based meta-learning algorithms in To get around the limitations of convexity for the prob- the non-convex setting. Additionally, this demonstrates lem of meta-learning a shared subspace, we instead study that initialization-based methods can capture the strong non-convex models. While the optimization properties of property of representation learning. gradient-based meta-learning algorithms have been recently Central to our proof is a trajectory-based analysis to ana- studied in the non-convex setting (Fallah et al., 2019; Ra- • lyze properties of the solution found by a specific proce- jeswaran et al., 2019; Zhou et al., 2019), these results only dures like Reptile or gradient descent on a representation provide stationary-point convergence guarantees and do not learning objective. For the latter, we show that looking show a reduction in sample complexity, the primary goal at the trajectory is crucial as not all minimizers can learn of meta-learning. Our theory is more closely related to the subspace structure. recent empirical work that tries to understand various in- herently non-convex properties of learning-to-learn. Most Finally, we revisit existing upper bounds for the convex notably, Arnold et al. (2019) hypothesize and show some • case. We show that our lower bound does not contradict experimental evidence that the success of gradient-based these upper bounds, since their task similarity measure meta-learning requires non-convexity, a view theoretically of average parameter distance is large in our case. We supported by our work. Meanwhile, Raghu et al. (2019) complement this observation by proving that the existing demonstrate that the success of the popular MAML algo- bounds are tight, in some sense, and going beyond them rithm (Finn et al., 2017) is likely due to its ability to learn will require additional structural assumptions. good data-representations rather than adapt quickly. Our upper bound, in fact, supports these empirical findings by Paper organization: We discuss related work in Sec- showing that Reptile can indeed learn a good representation tion 2. Section 3 sets up notation for the rest of the pa- at the penultimate layer, and learning just a final linear layer per, formalizes initialization-based meta-learning methods for a new task will reduce sample complexity for a new task. and defines the subspace meta-learning instance that we Our results draw upon work motivated by understanding are interested in. The lower bound for linear regression is that analyzes trajectories and implicit regu- stated in Section 4, while the corresponding upper bounds larization in deep linear neural networks (Saxe et al., 2014; for non-convex meta-learning with two-layer linear network Gunasekar et al., 2018; Saxe et al., 2019; Gidel et al., 2019). is provided in Section 5. While all proofs are provided in The analysis of solutions found by gradient flow in deep lin- the appendix, we give a sketch of the proofs for the up- ear networks by (Saxe et al., 2014; Gidel et al., 2019) form per bounds in Section 6 to highlight the key steps in the a core component of our analysis. In this vein, Lampinen & trajectory-based analysis and discuss why such an analy- Ganguli (2019) recently studied the dynamics of deep linear sis is important. A discussion about tightness of existing networks in the context of transfer learning and show that convex-case upper bounds can be found in Section 7. A Sample Complexity Separation between Non-Convex and Convex Meta-Learning jointly learning linear representations using two tasks will 3.3. Initialization-based meta-learning yield smaller error on each one than individual task learn- We focus on a popular approach in meta-learning that uses ing. However their guarantees are not for an unseen task training tasks to learn an initialization of the model parame- drawn from a distribution, but only for two given tasks, and ters. This initialization is fed into a pre-specified gradient- crucially not for gradient-based meta-learning methods. based algorithm that updates model parameters starting from this initialization by using samples from a new task. We 3. Meta-Learning Setup refer to these methods as initialization-based meta-learning methods and they are restricted to return within-task al- 3.1. Notations gorithms of the form Alg( )=GD-Alg( ; ✓ ), where · · init Let [N] denote the set 1,...,N . We use x for vectors, GD-Alg runs some gradient-based algorithm starting from { } M for matrices, I for d dimensional identity matrix and the initialization ✓ ⇥ on an objective function that d init 2 0 for the all-zero vector in d dimensions. is used to depends on the input training set S. For example, we can de- d k·k denote the ` norm. For a function ` : X Y Z, we note the algorithm of gradient descent as GD(S; ✓ ), that 2 ⇥ ! init use `(x, ):Y Z to denote a function of the second runs gradient descent to convergence on the empirical risk · ! argument when the first argument is set to x. For a finite `S by starting from the initialization ✓init. The definitions set S, x S denotes sampling uniformly from S.We of the various initialization-based meta-learning and within- ⇠ also need the ReLU function [x] = x x 0 . For a task algorithms that we analyze are in sections 4.1 and 5.1. + { } sequence a ,...,a , we use a for j i to denote the In the subsequent sections, we will concretely define the { 1 T } i:j set a ,...,a . distribution of tasks µ and the meta-learning algorithms we { i j} are interested in. 3.2. Task distribution and excess risk 3.4. Meta-learning a subspace We are interested in regression tasks of the following form 2 For meta-learning to be meaningful, the tasks must share `⇢(✓):= E (f(x,✓) y) (1) (x,y) ⇢ some common structure. Here we focus on a structure that ⇠ assumes the existence of a low-dimensional representation where we abuse notation and use ⇢ to denote a task as well as of the data that suffices to solve all the tasks, specifically, a its associated data distribution. The input x is a vector in Rd linear representation. To capture this idea, we construct a and y is real-valued scalar. The function f : Rd ⇥ R is ⇥ ! simple but instructive meta-learning instance. a regressor of choice, e.g. a linear function or a deep neural d network, that is parametrized by ✓ ⇥. Often one only We are interested in tasks ⇢w for w R , where the distri- 2n 2 has access to samples S = (xi,yi) from the unknown bution is defined as follows { }i=1 distribution ⇢, and the empirical risk is defined as 2 2 (x,y) ⇢w : x (0,Id),y (w>x, ) (3) `S(✓)= E (f(x,✓) y) (2) ⇠ ⇠N ⇠N (x,y) S ⇠ The target y for x is a linear function of x plus a zero-mean 1 While various formalizations for meta-learning exist, we Gaussian noise added to it. The meta-learning instance present one that is most convenient for the presentation of µw is defined as uniform distribution over two tasks ⇢w ⇤ d ⇤ this work. In our meta-learning setting, we assume that there and ⇢ w for a fixed but unknown vector w R . Note ⇤ ⇤ 2 that for every point x Rd, only the projection of x onto is an underlying unknown distribution µ over tasks. Given 2 T ⇢ ,...,⇢ µ the direction of w is necessary to solve all tasks in µw . access to a training tasks 1 T sampled from , the ⇤ ⇤ goal of a meta-learner Meta is to learn some underlying Thus the hope is that a meta-learning algorithm picks up on structure that relates the tasks in µ and output a within-task this structure and learns to project data onto this subspace algorithm Alg = Meta(⇢1:T ) that can be used to solve a for sample efficiency on a new task. The average task risk new task sampled from µ. To solve a new task ⇢ µ by and excess risk for an algorithm Alg can then be written as ⇠ using training set S from ⇢, the meta-learned algorithm Alg n(Alg,µw )= E E `sw (Alg(S)) outputs parameters Alg(S) ⇥. The average risk of an L ⇤ s 1 S ⇢n ⇤ 2 ⇠ sw algorithm that uses n samples from a new task is ⇠{± } ⇤ n(Alg,µw )= n(Alg,µw ) ⇤(µw ) (4) E ⇤ L ⇤ L ⇤ n(Alg,µ)= E E `⇢(Alg(S)) L ⇢ µ S ⇢n ⇠ ⇠ In the subsequent sections, we describe the convex setting We define the excess risk of Alg as (Alg,µ)= of linear regression and the equally expressive non-convex En n(Alg,µ) ⇤(µ), where ⇤(µ)= E inf `⇢(✓) is the setting of a two-layer linear network regressor. Our main L L L ⇢ µ ✓ ⇥ ⇠ 2 1 minimum achievable risk by the class ⇥ with complete We can extend all results to y = w>x + ⇠, where ⇠ is inde- knowledge of the distribution µ. pendent of x, just has 0 mean and variance 2. A Sample Complexity Separation between Non-Convex and Convex Meta-Learning result shows that while no meta-learning algorithm can learn 4.2. Lower bounds a meaningful initialization for a gradient-based within-task We use the definition of excess risk from Equation 4 algorithm in the convex setting, standard meta-learning algo- En and formally define sample complexity for a meta-learned rithms like Reptile on a two-layer linear network can in fact within-task algorithm below learn to project the data on the one-dimensional subspace and thus reduce the sample complexity for a new task from Definition 4.1 (Sample complexity). The minimum num- ⌦(d) to (1). ber of samples needed from a new task for a within-task O algorithm Alg to have excess risk smaller than ✏ is

4. Convex Meta-Learning Lower Bound n✏(Alg,µw )=min n N : n(Alg,µw ) ✏ (7) ⇤ { 2 E ⇤  } In this section, we use a regression function f that is linear We will proceed to show a lower bound for all meta-learning in x to solve the meta-learning instance µw . We have d ⇤ d algorithms that return an initialization to be used by algo- ⇥=R , the parameters are ✓ = w, w R and the ⌘,t0 2 rithms GD and GD described in the previous subsec- regressor is f(x, w) := w x. Using the definition of the step reg > tion. We assume that w = = r to make the noise of distribution in Equation 3, for s 1 we get k ⇤k 2{± } the same order as the signal and for simplicity of presenta- 2 2 2 `sw (w)= E (w>x y) = w sw + (5) tion. The lower bounds in more generality can be found in ⇤ (x,y) k ⇤k ⇢sw Appendix B.1. ⇠ ⇤ r2 Theorem 4.1. Suppose w = = r and ✏ 0, 2 . 2 k ⇤k 2 Thus we have ⇤(µw )= E inf `sw (w)= . d L ⇤ s 1 w d ⇤ For every initialization w0 R , the number of samples⇣ ⌘ 2R 2 ⇠{± } needed to have ✏ excess risk on a new task is 2 4.1. Within-task algorithms dr min n✏(GDreg( ; w0),µw )=⌦ 0 · ⇤ ✏ As described in Section 3.2, we consider within-task algo- ✓ ◆ 2 rithms that are based on gradient descent. A meta-learner ⌘,t0 dr d min n✏(GDstep( ; w0),µw )=⌦ is allowed to learn an initialization w0 R that is used ⌘>0,t0 N+ · ⇤ ✏ 2 2 ✓ ◆ as a starting point to run a gradient-based algorithm on a The lower bound is strong due of the following new task. We will show lower bounds for the following Remark 4.1. algorithms The bound holds even if the meta learner has seen in- ⌘,t0 • GDstep(S; w0) - GD for t0 steps: finitely many tasks sampled from µ and has access to the Runs gradient descent with learning rate ⌘ for t0 steps on population loss for each task. `S (defined in Equation 2). Starting from w0, follow the Even regularization techniques like explicit `2- dynamics below and return wt0 . • regularization or early stopping cannot benefit from a w = w ⌘ ` (w ) meta-learned initialization. t+1 t rw S t 2 ✏ r /2 The condition is not restrictive since a trivial GD (S; w0) - -regularized GD: •  2 reg learner that outputs 0d for all task has error exactly r . Runs gradient descent with vanishingly small learning rate (gradient flow) to convergence on ` The lower bound also holds for more general meta- S, • learning instances. For any distribution over the optimal 2 2 classifiers that satisfies [w]=0 , the lower bound `S,(w)= E (w>x y) + w (6) E d (x,y) S 2 k k ⇢w µ ⇠ will depends on the variance⇠ of the optimal regressors ⇥ ⇤ 2 Starting from w0, follow the dynamics below, return w . across tasks, i.e. r = E [ w ]. The zero-mean prop- 1 ⇢w µ k k ⇠ dw erty is similar to many standard meta-learning bench- t = ` (w ) dt rw S, t marks where average prediction across tasks for an input is independent of the input. This holds for sine regression In the next section we will provide lower bounds on the (Finn et al., 2017) due to symmetry around zero and for excess risk for all initialization-based meta-learning algo- Omniglot (Lake et al., 2017) and Mini-ImageNet (Ravi & rithms that return initializations for the above algorithms. Larochelle, 2017) due to label-shuffling. Note that some of these algorithms have been used in prior ⌘,t0 work; most notably, GDstep is the base-learner used by This demonstrates that the convex formulation does not do MAML (Finn et al., 2017), so our convex-case lower-bounds justice to the practical efficacy of such algorithms. We pro- hold directly for any initialization it might learn. vide the proof of this result and even tigher lower bounds in A Sample Complexity Separation between Non-Convex and Convex Meta-Learning the appendix. The proofs are based on finding a closed-form algorithm is sequentially updated as (Ai+1, wi+1)=(1 ⌘,t0 expression for the solutions found by GDreg and GDstep and ⌧)(Ai, wi)+⌧GDpop(`⇢i+1 , (Ai, wi)) for some 0 <⌧<1. showing that, in fact, no initialization has better excess risk At the end of T tasks, return AT . than the trivial initialization of 0 . d On encountering a new task, Reptile slowly interpolates between the current initialization and the solution for the 5. Non-Convex Meta-Learning Upper Bound new task obtained by running gradient descent on it starting from the current initialization. As mentioned earlier, this We now use a two layer linear network as the regressor method has enjoyed empirical success (McMahan et al., f. The parameters in this case are ✓ =(A, w), A d d d 2 2017). The second algorithm of interest is reminiscent to R ⇥ , w R . The regressor f is then defined as 2 multi-task representation learning. f(x, (A, w)) := w>Ax. As before,

2 2 RepLearn(⇢1:T , (A0, w0,1:T )) - Representation learn- `sw ((A, w)) = A>w sw + (8) ⇤ k ⇤k ing: Starting from (A0, w0,1:T ), run gradient flow on 2 the following objective function: rep(A, w1:T )= Again it is easy to see that ⇤(µw )= . We now describe L L ⇤ T the within-task algorithms of interest and the initialization- 1 T `⇢i (A, wi), return A at the end. based meta-algorithms for which we show guarantees. i=1 1 P dAt 5.1. Within-task and meta-learning algorithms = A rep(At, wt,1:T ) dt r L We are interested in the following within-task algorithms. dwt,i = (A , w ),i [T ] dt rwt,i Lrep t t,1:T 2 GDpop(⇢;(A0, w0)) - Population GD: Runs gradient descent with vanishingly small learning This is a standard objective for multi-task representation rate (gradient flow) to convergence on `⇢. Starting from learning used in prior work, occasionally equipped with a (A0, w0), follow the dynamics below, return (A , w ). regularization term for w1:T . For our analysis we do not 1 1 need an explicit regularizer, just like Saxe et al. (2014) and dA dw t = ` ((A , w )); t = ` ((A , w )) Gidel et al. (2019). dt rA ⇢ t t dt rw ⇢ t t

5.2. Upper bounds GD2reg(S;(A0, w0)) - Second-layer regularized GD: Runs gradient descent with tiny learning rate (gradient flow) Recall that n(GD2reg( ;(A, 0d)),µw ) is the excess risk E · ⇤ to convergence on `S,( ; A0) for the initialization A that is used by GD2 . We will · reg show that with access to a feasible number of training tasks, 2 2 `S,(w; A0)= E (w>A0x y) + w (9) both Reptile and RepLearn can learn an initializa- (x,y) S 2 k k ⇠ tion with small . We first prove the upper bounds for En Starting from w , follow the dynamics below by only up- Reptile under the assumption that w = = r. 0 k ⇤k dating w, return (A0, w ) Starting with (A , w )=(I , 0 ), let 1 Theorem 5.1. 0 0 d d AT = Reptile(⇢1:T , (A0, w0)) be the initialization dwt T = w`S,(wt; A0)) learned from T tasks ⇢1,...,⇢T i.i.d. µw . If T dt r { }⇠ 1/3 ⇤ poly(d, r, 1/✏, log(1/),) and ⌧ = (T ), then with O probability at least 1 over sampling of T tasks, We will be showing guarantees for initializations learned by two meta-learning algorithms, Reptile and RepLearn. 2 cr A meta-learner receives T training tasks ⇢1,...,⇢T sam- min n(GD2reg( ;(AT , 0d)),µw ) ✏ + { } 0 E · ⇤  n pled independently from µw ; each task is either ⇢w or ⇤ ⇤ ⇢ w . For simplicity of analysis, we assume that the learner for a small constant c. Thus with the same probability, we ⇤ has access to the population losses for these tasks, since we have are mainly concerned about the new task sample complexity. 2 While simplistic, showing guarantees even in this setting r min n✏(GD2reg( ; AT , 0d),µw )= 0 · ⇤ O ✏ requires a non-trivial analysis. Note that the lower bound ✓ ◆ for linear regression holds even with access to population Remark 5.1. Our guarantees are for the within-task al- loss function for any number of training tasks. The first gorithm GD2 that only updates the second layer. This meta-learning algorithm of interest is the following reg supports the empirical findings from (Raghu et al., 2019) Reptile(⇢1:T , (A0, w0)) - Reptile: that initialization-based methods can learn good representa- Starting from (A0, w0), the initialization maintained by the tions and just require learning a final linear layer for a new A Sample Complexity Separation between Non-Convex and Convex Meta-Learning task. Analysis for the setting where both layers are updated Step 1: Starting from A0 = Id, w = 0d, the initializa- for a new task is more non-trivial and left for future work. tion learned by the meta-learning algorithm always satisfies Ai =(ai )w w > + Id, wi = biw . The proof can be found in Appendix C.2. Thus we can show ⇤ ⇤ ⇤ Thus the updates by Reptile ensure that A is only up- that a standard meta-learning method like Reptile can dated in the direction of w w > and w is updated in the learn a useful initialization for a gradient-based within-task ⇤ ⇤ direction of w . This is proved by induction, where the algorithm like GD2 . A sketch of the proof in Section 6 ⇤ reg crucial step is to show that if at time i we start with A , w will demonstrate that the Reptile update surprisingly am- i i that satisfy the above condition, then interpolating towards plifies the component along w in the spectrum of the first ⇤ the output of GD still maintains this condition. Step 2 layer A, while keeping the components orthogonal to w pop ⇤ below shows exactly this and, in fact, we can get the exact unchanged. Interestingly, even though both w and w ⇤ ⇤ dynamics for the sequence a ,b . appear as tasks, the meta-initialization >0 ensures that { i i} they do not cancel each other out in the first layer, unlike Step 2: Initialized with A =(a )w w > + Id, w = ⇤ ⇤ in the second layer. In contrast to the convex-case lower bw for a>b 0, the solution found by GDpop is A¯, w¯ = 2 ⇤ bound, we only need (r /✏) samples for a new task, thus GDpop(⇢sw , (A, w)) where A¯ =(¯a )w w > + Id, O ⇤ ⇤ ⇤ showing gap of d between convex and non-convex meta- w¯ = ¯bw , for a¯ = f(a, b, s) and ¯b = g(a, b, s) learning in our setting. We now show a similar result for ⇤ RepLearn under the assumption of w = = r. (a2 b2)+ 4+(a2 b2)2 k ⇤k f(a, b, s)= Theorem 5.2. With (A0, w0,1:T )=(Id, 0d,...,0d), let s p2 AT = RepLearn(⇢1:T , (A0, w0,1:T )), be the initializa- T (a2 b2)+ 4+(a2 b2)2 tion learned using T tasks ⇢1,...,⇢T i.i.d. µw . If g(a, b, s)=s { }⇠ ⇤ T poly(d, r, 1/✏, log(1/),), then with probability at s 2 p least 1 over sampling of the T tasks, This, along with step 1, gives us the dynamics of ai,bi 2 cr ai+1 = ai + ⌧(f(ai,bi,si+1) ai) min n(GD2reg( ;(AT , 0d)),µw ) ✏ + (10) 0 E · ⇤  n b = b + ⌧(g(a ,b ,s ) b ) i+1 i i i i+1 i for a small constant c. With the same probability, we have This is the step where we use the analysis of the trajectory of gradient flow on two-layer linear networks that was done 2 r first in Saxe et al. (2014) and later made robust in Gidel et al. min n✏(GD2reg( ; AT , 0d),µw )= 0 · ⇤ O ✏ (2019). While their focus was on the case where the two ✓ ◆ layers are initialized at exactly the same scale, we need to analyze the case where A and w are initialized differently; Yet again we can show a new task sample complexity of this was analyzed in the appendix of Saxe et al. (2014). In (r2/✏). We now sketch the proofs of the upper bounds to O fact, as we will see in step 3, having  =0when w = 0 highlight the interesting parts of the proof and to show the 6 0 d is crucial in showing that A can learn the subspace. Refer need for a trajectory-based analysis. to Figure 6.1 for more insights into the dynamics induced by f and g. 6. Proof Sketch Step 3: We show a very important property satisfied by We first present a proof sketch for the guarantees pro- the dynamics of ai,bi described in Equation 10: ai is an vided for the Reptile algorithm in Theorem 5.1 and increasing sequence. Since the sequence s1:T is a random for RepLearn in Theorem 5.2. Following that we will sequence in 1 T , a and b are random variables. How- {± } T T present an argument for why a trajectory-based analysis is ever even though si+1 has 0 mean, si+1 only affects the sign necessary, by looking more closely at the representation of bi but not ai, as evident in Equation 10. In fact, we can learning objective. show that if initialized with >0, ai always increases; the same is however not true for bi. We show that for the meta- 6.1. Reptile sketch initialization of a0 =  and b0 =0, with high probability, 1 1/4 1/3 a = ⌦˜ min , (⌧T) . Picking ⌧ = T , For simplicity assume w =1. Let the T training T 2p⌧ O k ⇤k ˜ 1/6 tasks be ⇢1,...,⇢T , where ⇢i = ⇢siw for si is uniformly we get that⇣ aT n= ⌦(T ). Thuso⌘ for an appropriate choice ⇤ sampled from 1 . Recall the update: (A , w )= of the interpolating parameter ⌧, a as T . So {± } i+1 i+1 T !1 !1 (1 ⌧)(A , w )+⌧GD (` , (A , w )). The proof in- we know that in the limit, A is basically a rank one matrix i i pop ⇢i+1 i i T volves showing the following key properties of the dynamics in the direction of w w >. In the next step we show why ⇤ ⇤ of GD2reg and the interpolation updates: such an AT reduces sample complexity. A Sample Complexity Separation between Non-Convex and Convex Meta-Learning

Step 4: To gain intuition for why the learned AT reduces ing the dynamics of the specific algorithms used might be sample complexity, notice that the only information about as important as the objective functions themselves. input x that is needed to make predictions for all tasks in µw is its projection on w . Thus if all data points ⇤ ⇤ 7. Tightness of Existing Bounds are projected on w , we could just learn a 1-dimensional ⇤ classifier on the projected data. So after this projection, In providing a first non-convex sample complexity analysis the task would be reduced to a 1-dimensional regression of gradient-based meta-learning, our results have also ex- problem that has a sample complexity of (1/✏). With posed a fundamental limitation of convex methods: in the O AT =(aT )w w > + Id, we are learning a classifier presence of very natural subspace structure they are unable ⇤ ⇤ for a new task on the linearly transformed data AT x instead. to learn an initialization that exploits it to obtain a good sam- For large enough T , aT is large enough that AT almost acts ple complexity. There is thus a tension between this result like a projection onto the subspace of w , thus leading to a and recent upper bounds that use other intuitive assumptions ⇤ reduction in sample complexity from ⌦(d/✏) to (1/✏). on the task-distribution to show reduced sample complexity O of similar or identical methods (Denevi et al., 2019; Khodak 6.2. RepLearn sketch et al., 2019; Zhou et al., 2019). Broadly, these results show that gradient-based meta-learning methods can adapt to a Recall that the representation learning algorithm runs gra- similarity measure that depends on the closeness of mini- dient descent on by starting from (I , 0 ,...,0 ), Lrep d d d mizing parameters for the tasks. For convex models they where obtain upper bounds on the excess risk of form 1 > GV rep(A, w1,...,wT )= `⇢i (A, wi) (Alg,µ)= (11) L T n i=1 E O pn X ✓ ◆ T 2 1 2 2 for large enough number of training tasks T , where V = = A>wi siw + 2 T k ⇤k min ⇥ E⇢ µ Proj⇥ () is the average variation of i=1 ⇢⇤ X 2 ⇠ k k the optimal task parameters, for ⇥⇢⇤ = arg min✓ ⇥ `⇢(✓), 1 2 2 2 = A>W W + and G is the Lipschitz constant with respect to the Euclidean T k ⇤k norm. d T th where W R ⇥ has wi as its i column and W d T 2 th ⇤ 2 These results, however, do not contradict our convex-case R ⇥ has siw as its i column. This objective is a special ⇤ lower bounds in Section 4 because our tasks are not similar case of the deep linear regression objective studied in Saxe in the same sense. While the parameters lie on a subspace, et al. (2014); Gidel et al. (2019), except with an unbalanced the average variation of optimal parameters remains large. initialization for A and W . Using a very similar analysis However, while the distance-based task-similarity measure technique, one can show that gradient flow on this objective is natural and intuitive, we believe that a low-dimensional will converge to A =(a )w w > + Id, where 1 1 ⇤ ⇤ representation structure such as ours may be more explana- for a sufficiently small , a =⌦(T 1/4). Just like the 1 tory for the success of gradient-based meta-learning algo- previous section, the first layer has learned the subspace and rithms. In fact the importance of representation learning will reduce sample complexity of a new task to (1/✏). O in the success of popular gradient-based methods has been shown by existing empirical results (Raghu et al., 2019). 6.3. Why trajectory is important Additionally we argue that existing upper bounds may not As evident in the proof sketches above, we relied heavily be very meaningful in the context of current practical appli- on analyzing the specific trajectory of different methods, cations. The term GV in Equation 11 can be lower bounded whether it is for gradient descent on a specific objective by Jensen’s inequality function or the interpolation updates in Reptile. A nat- ural question is whether simple analysis techniques that 2 GV = G min E⇢ µ Proj⇥ () only look at properties of all minimizers of some objec- ⇥ ⇠ k ⇢⇤ k r 2 tive function can lead to similar conclusions. We answer GE⇢ µ Proj⇥ () this question for the representation learning objective neg- ⇠ k ⇢⇤ k atively. In particular, we construct a minimizer of the ob- min E⇢ µ`⇢() `⇢⇤ ⇥ ⇠ jective where the first layer does not learn any struc- 2 Lrep ture about the subspace and will have ⌦(d/✏) new task where min ⇥ E⇢ µ`⇢() is the minimum achievable risk 2 ⇠ sample complexity. This bad minimizer is very simple: by a single common parameter for all tasks from the class ⇥ A = Id, wi = siw , i [T ]. While the existence of such and `⇢⇤ =min✓ ⇥ `⇢(✓). In common meta-learning settings, ⇤ 8 2 2 a solution is not too surprising, it does illustrate that analyz- the average risk E⇢ µ`⇢() of any fixed parameter ⇥ ⇠ 2 A Sample Complexity Separation between Non-Convex and Convex Meta-Learning

3

2

1

0

1 y (corresponds to a) to (corresponds y

2

3 3 2 1 0 1 2 3 x (corresponds to b) (a) One step update (b) Evolution for 1000 steps

Figure 1. The two figures correspond to a run with T =1000tasks, ⌧ =0.3, (a0,b0)=(0.1, 0) and w =1. The first figure k ⇤k shows what the updates from Equation 10 looks like at step i when si+1 = 1. It can be shown that a¯i+1 = f(ai,bi,si+1) and ¯ 2 ¯ 2 2 ¯ bi+1 = g(ai,bi,si+1) always satisfy a¯i+1 bi+1 = ai bi , thus the solution (¯ai+1, bi+1) will be the intersection of the curves 2 2 2 2 xy = si+1 = 1 and y x = a b and (ai+1,bi+1) is the appropriate interpolation. The second figure shows the entire dynamics i i of (ai,bi) for the same setting. As evident, ai is always increasing while bi fluctuates around its mean value of 0. is large, e.g. due to label-shuffling in tasks like Omniglot upon existing bounds. This, coupled with the fact that the (Lake et al., 2017) and Mini-ImageNet (Ravi & Larochelle, existing bounds can be large in practical settings, makes a 2017) or due to symmetry in the tasks around zero like in case for the need for more structural assumptions and a shift the sine wave task (Finn et al., 2017). to non-convexity for analyses of meta-learning. Given the above drawbacks, it is natural to ask if this bound of GV can be improved in the convex settings prior work 8. Conclusions and Future Work considers. We answer this negatively. Below we adapt an In this work we look at the family of initialization-based information-theoretic argument from Agarwal et al. (2012) meta-learning methods that has enjoyed empirical success. to show that such a dependence is unavoidable when ana- Using a simple meta-learning problem of linear predictors lyzing a distance-based task-similarity notion for convex in a 1-dimensional subspace, we show a gap in the new task G-Lipschitz functions, and thus that existing results are sample complexity between meta-learning using linear re- almost tight: gression and meta-learning using two-layer linear networks. Theorem 7.1. For any G, V > 0, there exists a domain , Z This is, to our knowledge, is the first non-convex sample parameter class ⇥ Rd and a distribution µ over tasks ✓ complexity analysis of initialization-based meta-learning such every ⇢ µ is a distribution over and ` (✓)= ⇠ Z ⇢ and it also captures the idea that such methods can learn Ez ⇢`z(✓) where `z :⇥ R is convex and G-Lipschitz good representations. There are many interesting future ⇠ ! w.r.t. the Euclidean norm for every z . Additionally, ⇥ 2Z directions to be pursued. satisfies k-subspace learning: while the lower bound for the con- min E⇢ µ Proj⇥ () V • ⇥ ⇠ k ⇢⇤ k 2 vex setting trivially holds if the task predictions came from a k-dimensional subspace for k>1, showing that and an algorithm like Reptile can have sample complexity of 1 1 (k) is an open problem. While this can be proved for n(Alg,µ)=⌦ GV min , O E pn pd the representation learning objective using a very sim- ✓ ⇢ ◆ ilar analysis, showing it for Reptile, which only learns for any algorithm Alg : n ⇥ that returns a parameter one second layer instead of T unlike the representation Z ! given a training set. learning objective, might require stronger tools. There is experimental evidence suggesting that such a statement A consequence of this theorem is that without additional might be true. assumptions other than convexity, Lipschitzness and small average parameter variation, one cannot hope to improve Weaker distributional assumptions: while showing upper • A Sample Complexity Separation between Non-Convex and Convex Meta-Learning

bounds was non-trivial under current assumptions, one tion learning. In Proceedings of the 30th International would hope to show guarantees under weaker and more Conference on Algorithmic Learning Theory, 2019. realistic assumptions, such as a more general data dis- tribution, different input distributions across tasks, and Davis, C. and Kahan, W. M. The rotation of eigenvectors by access to only finitely many samples from training tasks. a perturbation. iii. SIAM Journal on Numerical Analysis, 7(1):1–46, 1970. One common bottleneck for the above points is a ro- • Denevi, G., Ciliberto, C., Stamos, D., and Pontil, M. Incre- bust analysis of the dynamics of linear networks when mental learning-to-learn with statistical guarantees. In the initializations are not appropriately aligned. While Proceedings of the Conference on Uncertainty in Artifi- Gidel et al. (2019) provide a perturbation analysis for 2 cial Intelligence, 2018. this, ✏-perturbation at the initialization scales as ✏ect in the final solution, where t is the time for which gradient Denevi, G., Ciliberto, C., Grazzi, R., and Pontil, M. descent/flow is run. It would be nice to have an analy- Learning-to-learn stochastic gradient descent with biased sis with a more conservative error propagation, perhaps regularization. In Proceedings of the 36th International exploiting structured perturbations. Conference on , 2019. Deep neural network: while analysis for linear networks Evgeniou, T. and Pontil, M. Regularized multi-task learning. • can be a first cut to understanding non-convex meta- In Proceedings of the 10th ACM SIGKDD International learning, it would be interesting to see if the insights Conference on Knowledge Discovery and , gained from this setting are useful for the more interest- 2004. ing setting of non-linear neural networks. Fallah, A., Mokhtari, A., and Ozdaglar, A. On the con- vergence theory of gradient-based model-agnostic meta- Acknowledgments learning algorithms. arXiv, 2019. Sanjeev Arora, Nikunj Saunshi and Yi Zhang are supported Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta- by NSF, ONR, Simons Foundation, Schmidt Foundation, learning for fast adaptation of deep networks. In Proceed- Amazon Research, DARPA and SRC. Mikhail Khodak ings of the 34th International Conference on Machine is supported in part by NSF grants CCF-1535967, CCF- Learning, 2017. 1910321, IIS-1618714, IIS-1705121, IIS1838017, and IIS- 1901403, DARPA FA875017C0141, Amazon Research, Gidel, G., Bach, F., and Lacoste-Julien, S. Implicit regular- Amazon Web Services, Microsoft Research, Bloomberg ization of discrete gradient dynamics in deep linear neural Data Science, the Okawa Foundation, Google, JP Morgan networks. Advances in Neural Information Processing AI Research, and the Carnegie Bosch Institute. Systems, 2019. Gunasekar, S., Lee, J., Soudry, D., and Srebro, N. Charac- References terizing implicit bias in terms of optimization geometry. Agarwal, A., Bartlett, P. L., Ravikumar, P., and Wainwright, In Proceedings of the 35th International Conference on M. J. Information-theoretic lower bounds on the ora- Machine Learning, 2018. cle complexity of stochastic convex optimization. IEEE Khodak, M., Balcan, M.-F., and Talwalkar, A. Adaptive Transactions on Information Theory, 58(5):3235–3249, gradient-based meta-learning methods. In Advances in 2012. Neural Information Processing Systems, 2019. Arnold, S. M. R., Iqbal, S., and Sha, F. Decoupling adapta- Lake, B. M., Salakhutdinov, R., Gross, J., and Tenenbaum, tion from modeling with meta-optimizers for meta learn- J. B. One shot learning of simple visual concepts. In ing. arXiv, 2019. Proceedings of the Conference of the Cognitive Science Society (CogSci), 2017. Balcan, M.-F., Blum, A., and Vempala, S. Efficient rep- resentations for lifelong learning and autoencoding. In Lampinen, A. and Ganguli, S. An analytic theory of gen- Proceedings of the 28th Annual Conference on Learning eralization dynamics and transfer learning in deep linear Theory, 2015. networks, 2019. Baxter, J. A model of inductive bias learning. Journal of Maurer, A. Algorithmic stability and meta-learning. Journal Artificial Intelligence Research, 12:149–198, 2000. of Machine Learning Research, 6:967–994, 2005. Bullins, B., Hazan, E., Kalai, A., and Livni, R. Generalize Maurer, A. Transfer bounds for linear . across tasks: Efficient algorithms for linear representa- Machine Learning, 2009. A Sample Complexity Separation between Non-Convex and Convex Meta-Learning

Maurer, A. and Pontil, M. Excess risk bounds for multitask learning with trace norm regularization. In Proceedings of the 26th Annual Conference on Learning Theory, 2013. Maurer, A., Pontil, M., and Romera-Paredes, B. The benefit of multitask representation learning. Journal of Machine Learning Research, 17(1):2853–2884, 2016.

McMahan, H. B., Moore, E., Ramage, D., Hampson, S., and Aguera y Arcas, B. Communication-efficient learning of deep networks from decentralized data. In Proceed- ings of the 20th International Conference on Artifical Intelligence and Statistics, 2017.

Nichol, A., Achiam, J., and Schulman, J. On first-order meta-learning algorithms. arXiv, 2018. Raghu, A., Raghu, M., Bengio, S., and Vinyals, O. Rapid learning or feature reuse? Towards understanding the effectiveness of MAML. arXiv, 2019.

Rajeswaran, A., Finn, C., Kakade, S. M., and Levine, S. Meta-learning with implicit gradients. In Advances in Neural Information Processing Systems, 2019. Ravi, S. and Larochelle, H. Optimization as a model for few-shot learning. In Proceedings of the 5th International Conference on Learning Representations, 2017. Ruvolo, P. and Eaton, E. ELLA: An efficient lifelong learn- ing algorithm. In Proceedings of the 30th International Conference on Machine Learning, 2013. Saxe, A. M., Mcclelland, J. L., and Ganguli, S. Exact solutions to the nonlinear dynamics of learning in deep linear neural network. In In International Conference on Learning Representations, 2014. Saxe, A. M., McClelland, J. L., and Ganguli, S. A mathe- matical theory of semantic development in deep neural networks. Proceedings of the National Academy of Sci- ences, 116(23):11537–11546, 2019. Thrun, S. and Pratt, L. Learning to Learn. Springer Science & Business Media, 1998. Zhou, P., Yuan, X., Xu, H., Yan, S., and Feng, J. Efficient meta learning via minibatch proximal update. In Ad- vances in Neural Information Processing Systems, 2019.