Generalization Error of Generalized Linear Models in High Dimensions
Total Page:16
File Type:pdf, Size:1020Kb
Generalization Error of Generalized Linear Models in High Dimensions Melikasadat Emami 1 Mojtaba Sahraee-Ardakan 1 Parthe Pandit 1 Sundeep Rangan 2 Alyson K. Fletcher 1 3 Abstract a class of generalized linear models (GLMs) of the form 0 At the heart of machine learning lies the question y = φout( x; w ; d); (1) of generalizability of learned rules over previously where x 2 Rp is a vector of input features, y is a scalar out- unseen data. While over-parameterized models 0 p put, w 2 R are weights to be learned, φout(·) is a known based on neural networks are now ubiquitous in 0 machine learning applications, our understanding link function, and d is random noise. The notation x; w of their generalization capabilities is incomplete. denotes an inner product. We use the superscript “0” to This task is made harder by the non-convexity of denote the “true” values in contrast to estimated or postu- the underlying learning problems. We provide lated quantities. The output may be continuous or discrete a general framework to characterize the asymp- to model either regression or classification problems. totic generalization error for single-layer neural We measure the generalization error in a standard man- networks (i.e., generalized linear models) with ner: we are given training data (xi; yi), i = 1;:::;N from arbitrary non-linearities, making it applicable to which we learn some parameter estimate wb via a regularized regression as well as classification problems. This empirical risk minimization of the form framework enables analyzing the effect of (i) over- parameterization and non-linearity during mod- w = argmin Fout(y; Xw) + Fin(w); (2) b w eling; and (ii) choices of loss function, initializa- T tion, and regularizer during learning. Our model where X = [x1 x2 ::: xN ] , is the data matrix, Fout is also captures mismatch between training and test some output loss function, and Fin is some regularizer on distributions. As examples, we analyze a few spe- the weights. We are then given a new test sample, xts, for cial cases, namely linear regression and logistic which the true and predicted values are given by regression. We are also able to rigorously and an- 0 alytically explain the double descent phenomenon yts = φout( xts; w ; dts); ybts = φ(hxts; wbi); (3) in generalized linear models. where dts is the noise in the test sample, and φ(·) is a pos- tulated inverse link function that may be different from the true function φout(·). The generalization error is then de- 1. Introduction fined as the expectation of some expected loss between yts and y of the form A fundamental goal of machine learning is generalization: bts the ability to draw inferences about unseen data from finite E fts(yts; ybts); (4) arXiv:2005.00180v1 [cs.LG] 1 May 2020 training examples. Methods to quantify the generalization error are therefore critical in assessing the performance of for some test loss function fts(·) such as squared error or any machine learning approach. prediction error. This paper seeks to characterize the generalization error for Even for this relatively simple GLM model, the behavior of the generalization error is not fully understood. Recent 1 Department of Electrical and Computer Engineering, Univer- works (Montanari et al., 2019; Deng et al., 2019; Mei & sity of California, Los Angeles, Los Angeles, USA 2Department of Electrical and Computer Engineering, New York University, Montanari, 2019; Salehi et al., 2019) have characterized Brooklyn, New York, USA 3Department of Statistics, University the generalization error of various linear models for clas- of California, Los Angeles, Los Angeles, USA. Correspondence sification and regression in certain large random problem to: Melikasadat Emami <[email protected]>, Mojtaba Sahraee- instances. Specifically, the number of samples N and num- Ardakan <[email protected]>, Parthe Pandit <parthepan- ber of features p both grow without bound with their ratio [email protected]>, Sundeep Rangan <[email protected]>, Alyson K. Fletcher <akfl[email protected]>. satisfying p=N ! β 2 (0; 1), and the samples in the training data xi are drawn randomly. In this limit, the gen- eralization error can be exactly computed. The analysis can Generalization Error of GLMs in High Dimensions explain the so-called double descent phenomena (Belkin using the Convex Gaussian Min-max Theorem in the under- et al., 2019a): in highly under-regularized settings, the test parametrized regime. The results in the current paper can error may initially increase with the number of data samples consider all these models as special cases. Bounds on the N before decreasing. See the prior work section below for generalization error of over-parametrized linear models are more details. also given in (Bartlett et al., 2019; Neyshabur et al., 2018). Although this paper and several other recent works consider Summary of Contributions. Our main result (Theo- only simple linear models and GLMs, much of the motiva- rem1) provides a procedure for exactly computing the tion is to understand generalization in deep neural networks asymptotic value of the generalization error (4) for GLM where classical intuition may not hold (Belkin et al., 2018; models in a certain random high-dimensional regime called Zhang et al., 2016; Neyshabur et al., 2018). In particular, the Large System Limit (LSL). The procedure enables the a number of recent papers have shown the connection be- generalization error to be related to key problem parameters tween neural networks in the over-parametrized regime and including the sampling ratio β = p=N, the regularizer, the kernel methods. The works (Daniely, 2017; Daniely et al., output function, and the distributions of the true weights 2016) showed that gradient descent on over-parametrized and noise. Importantly, our result holds under very general neural networks learns a function in the RKHS correspond- settings including: (i) arbitrary test metrics fts; (ii) arbi- ing to the random feature kernel. Training dynamics of over- trary training loss functions Fout as well as decomposable parametrized neural networks has been studied by (Jacot regularizers Fin; (iii) arbitrary link functions φout; (iv) cor- et al., 2018; Du et al., 2018; Arora et al., 2019; Allen-Zhu related covariates x; (v) underparameterized (β < 1) and et al., 2019), and it is shown that the function learned is in overparameterized regimes (β > 1); and (vi) distributional an RKHS corresponding to the neural tangent kernel. mismatch in training and test data. Section4 discusses in detail the general assumptions on the quantities f , F , ts out Approximate Message Passing. Our key tool to study F , and φ under which Theorem1 holds. in out the generalization error is approximate message passing (AMP), a class of inference algorithms originally developed Prior Work. Many recent works characterize generaliza- in (Donoho et al., 2009; 2010; Bayati & Montanari, 2011) tion error of various machine learning models, including for compressed sensing. We show that the learning problem special cases of the GLM model considered here. For exam- for the GLM can be formulated as an inference problem ple, the precise characterization for asymptotics of predic- on a certain multi-layer network. Multi-layer AMP meth- tion error for least squares regression has been provided in ods (He et al., 2017; Manoel et al., 2018; Fletcher et al., (Belkin et al., 2019b; Hastie et al., 2019; Muthukumar et al., 2018; Pandit et al., 2019) can then be applied to perform 2019). The former confirmed the double descent curve of the inference. The specific algorithm we use in this work (Belkin et al., 2019a) under a Fourier series model and a is the multi-layer vector AMP (ML-VAMP) algorithm of noisy Gaussian model for data in the over-parameterized (Fletcher et al., 2018; Pandit et al., 2019) which itself builds regime. The latter also obtained this scenario under both on several works (Opper & Winther, 2005; Fletcher et al., linear and non-linear feature models for ridge regression 2016; Rangan et al., 2019; Cakmak et al., 2014; Ma & Ping, and min-norm least squares using random matrix theory. 2017). The ML-VAMP algorithm is not necessarily the most Also, (Advani & Saxe, 2017) studied the same setting for computationally efficient procedure for the minimization deep linear and shallow non-linear networks. (2). For our purposes, the key property is that ML-VAMP enables exact predictions of its performance in the large sys- The analysis of the the generalization for max-margin linear tem limit. Specifically, the error of the algorithm estimates classifiers in the high dimensional regime has been done in in each iteration can be predicted by a set of deterministic re- (Montanari et al., 2019). The exact expression for asymp- cursive equations called the state evolution or SE. The fixed totic prediction error is derived and in a specific case for points of these equations provide a way of computing the two-layer neural network with random first-layer weights, asymptotic performance of the algorithm. In certain cases, the double descent curve was obtained. A similar dou- the algorithm can be proven to be Bayes optimal (Reeves, ble descent curve for logistic regression as well as linear 2017; Gabrie´ et al., 2018; Barbier et al., 2019). discriminant analysis has been reported by (Deng et al., 2019). Random feature learning in the same setting has This approach of using AMP methods to characterize the also been studied for ridge regression in (Mei & Monta- generalization error of GLMs was also explored in (Barbier nari, 2019). The authors have, in particular, shown that et al., 2019) for i.i.d. distributions on the data. The explicit highly over-parametrized estimators with zero training error formulae for the asymptotic mean squared error for the are statistically optimal at high signal-to-noise ratio (SNR).