A Sample Complexity Separation Between Non-Convex and Convex Meta-Learning
Total Page:16
File Type:pdf, Size:1020Kb
A Sample Complexity Separation between Non-Convex and Convex Meta-Learning Nikunj Saunshi 1 Yi Zhang 1 Mikhail Khodak 2 Sanjeev Arora 13 Abstract complexity of an unseen but related test task. Although there is a long history of successful methods in meta-learning One popular trend in meta-learning is to learn and the related areas of multi-task and lifelong learning from many training tasks a common initialization (Evgeniou & Pontil, 2004; Ruvolo & Eaton, 2013), recent that a gradient-based method can use to solve a approaches have been developed with the diversity and scale new task with few samples. The theory of meta- of modern applications in mind. This has given rise to learning is still in its early stages, with several re- simple, model-agnostic methods that focus on learning a cent learning-theoretic analyses of methods such good initialization for some gradient-based method such as as Reptile (Nichol et al., 2018) being for convex stochastic gradient descent (SGD), to be run on samples models. This work shows that convex-case analy- from a new task (Finn et al., 2017; Nichol et al., 2018). sis might be insufficient to understand the success These methods have found widespread applications in a of meta-learning, and that even for non-convex variety of areas such as computer vision (Nichol et al., 2018), models it is important to look inside the optimiza- reinforcement learning (Finn et al., 2017), and federated tion black-box, specifically at properties of the learning (McMahan et al., 2017). optimization trajectory. We construct a simple meta-learning instance that captures the problem Inspired by their popularity, several recent learning-theoretic of one-dimensional subspace learning. For the analyses of meta-learning have followed suit, eschewing cus- convex formulation of linear regression on this tomization to specific hypothesis classes such as halfspaces instance, we show that the new task sample com- (Maurer & Pontil, 2013; Balcan et al., 2015) and instead plexity of any initialization-based meta-learning favoring the convex-case study of gradient-based algorithms algorithm is ⌦(d), where d is the input dimension. that could potentially be applied to deep neural networks In contrast, for the non-convex formulation of a (Denevi et al., 2019; Khodak et al., 2019). This has yielded two layer linear network on the same instance, results showing that meta-learning an initialization by using we show that both Reptile and multi-task rep- methods similar to Reptile (Nichol et al., 2018) for con- resentation learning can have new task sample vex models leads to a reduction in sample complexity of complexity of (1), demonstrating a separation unseen tasks. These benefits are shown using natural no- O from convex meta-learning. Crucially, analyses tions of task-similarity like the average distance between of the training dynamics of these methods reveal the risk minimizers of tasks drawn from an underlying meta- that they can meta-learn the correct subspace onto distribution. A good initialization in these models is one which the data should be projected. that is close to the population risk minimizers for tasks in this meta-distribution. In this paper we argue that, even in some simple settings, 1. Introduction such convex-case analyses are insufficient to understand We consider the problem of meta-learning, or learning-to- the success of initialization-based meta-learning algorithms. learn (Thrun & Pratt, 1998), in which the goal is to use For this purpose, we pose a simple instance for meta- the data from numerous training tasks to reduce the sample learning linear regressors that share a one-dimensional sub- space, for which we prove a sample complexity separa- 1Princeton University, Princeton, New Jersey, USA 2Carnegie tion between convex and non-convex methods. In the pro- 3 Mellon University, Pittsburgh, Pennsylvania, USA Institute for cess, we provide a first theoretical demonstration of how Advanced Study, Princeton, New Jersey, USA. Correspondence to: initialization-based meta-learning methods can learn good Nikunj Saunshi <[email protected]>. representations, as observed empirically (Raghu et al., 2019). Proceedings of the 37 th International Conference on Machine Specifically, our contributions are the following: Learning, Online, PMLR 119, 2020. Copyright 2020 by the au- thor(s). A Sample Complexity Separation between Non-Convex and Convex Meta-Learning We show, in the convex formulation of linear regression 2. Related Work • on this instance, a new task sample complexity lower bound of ⌦(d) for any initialization-based meta-learning There is a rich history of theoretical analysis of learning-to- algorithm. This suggests that no amount of meta-training learn (Baxter, 2000; Maurer, 2005; Maurer et al., 2016). Our data can yield an initialization that can be used by a focus is on a well-studied setting in which tasks such as half- common gradient-based within-task algorithms to solve space learning share a common low-dimensional subspace, a new task with fewer samples than if no meta-learning with the goal of obtaining sample complexity depending had been done; thus initialization-based meta-learning on this sparse structure rather than on the ambient dimen- in the convex formulation fails to learn the underlying sion (Maurer, 2009; Maurer & Pontil, 2013; Balcan et al., task-similarity. The lower bounds also holds for more 2015; Denevi et al., 2018; Bullins et al., 2019; Khodak et al., general meta-learning instances that have the property 2019). While these works derive specialized algorithms, we that the average prediction for an input across tasks is instead focus on learning an initialization for gradient-based independent of the input, as in true in many standard methods such as SGD or few steps of gradient descent (Finn meta-learning benchmarks. et al., 2017; Nichol et al., 2018). Some of these methods have recently been studied in the convex setting (Denevi We show for the same instance that formulating the model et al., 2019; Khodak et al., 2019; Zhou et al., 2019). Our • as a two-layer linear network – an over-parameterization results show that such convex-case analyses cannot hope of the same hypothesis class – allows a Reptile-like proce- to show adaptation to an underlying low-dimensional sub- dure to use training tasks from this meta-learning instance space leading to dimension-independent sample complexity and find an initialization for gradient descent that will bounds. On the other hand, we show that their guarantees have (1) sample complexity on a new task. To the best using distance-from-initialization are almost tight for the O of our knowledge, this is the first sample complexity anal- meta-learning of convex Lipschitz functions. ysis of initialization-based meta-learning algorithms in To get around the limitations of convexity for the prob- the non-convex setting. Additionally, this demonstrates lem of meta-learning a shared subspace, we instead study that initialization-based methods can capture the strong non-convex models. While the optimization properties of property of representation learning. gradient-based meta-learning algorithms have been recently Central to our proof is a trajectory-based analysis to ana- studied in the non-convex setting (Fallah et al., 2019; Ra- • lyze properties of the solution found by a specific proce- jeswaran et al., 2019; Zhou et al., 2019), these results only dures like Reptile or gradient descent on a representation provide stationary-point convergence guarantees and do not learning objective. For the latter, we show that looking show a reduction in sample complexity, the primary goal at the trajectory is crucial as not all minimizers can learn of meta-learning. Our theory is more closely related to the subspace structure. recent empirical work that tries to understand various in- herently non-convex properties of learning-to-learn. Most Finally, we revisit existing upper bounds for the convex notably, Arnold et al. (2019) hypothesize and show some • case. We show that our lower bound does not contradict experimental evidence that the success of gradient-based these upper bounds, since their task similarity measure meta-learning requires non-convexity, a view theoretically of average parameter distance is large in our case. We supported by our work. Meanwhile, Raghu et al. (2019) complement this observation by proving that the existing demonstrate that the success of the popular MAML algo- bounds are tight, in some sense, and going beyond them rithm (Finn et al., 2017) is likely due to its ability to learn will require additional structural assumptions. good data-representations rather than adapt quickly. Our upper bound, in fact, supports these empirical findings by Paper organization: We discuss related work in Sec- showing that Reptile can indeed learn a good representation tion 2. Section 3 sets up notation for the rest of the pa- at the penultimate layer, and learning just a final linear layer per, formalizes initialization-based meta-learning methods for a new task will reduce sample complexity for a new task. and defines the subspace meta-learning instance that we Our results draw upon work motivated by understanding are interested in. The lower bound for linear regression is deep learning that analyzes trajectories and implicit regu- stated in Section 4, while the corresponding upper bounds larization in deep linear neural networks (Saxe et al., 2014; for non-convex meta-learning with two-layer linear network Gunasekar et al., 2018; Saxe et al., 2019; Gidel et al., 2019). is provided in Section 5. While all proofs are provided in The analysis of solutions found by gradient flow in deep lin- the appendix, we give a sketch of the proofs for the up- ear networks by (Saxe et al., 2014; Gidel et al., 2019) form per bounds in Section 6 to highlight the key steps in the a core component of our analysis.