Generalization Error of Generalized Linear Models in High Dimensions

Melikasadat Emami 1 Mojtaba Sahraee-Ardakan 1 Parthe Pandit 1 Sundeep Rangan 2 Alyson K. Fletcher 1 3

Abstract a class of generalized linear models (GLMs) of the form

0 At the heart of lies the question y = φout( x, w , d), (1) of generalizability of learned rules over previously where x ∈ Rp is a vector of input features, y is a scalar out- unseen data. While over-parameterized models 0 p put, w ∈ R are weights to be learned, φout(·) is a known based on neural networks are now ubiquitous in 0 machine learning applications, our understanding link function, and d is random noise. The notation x, w of their generalization capabilities is incomplete. denotes an inner product. We use the superscript “0” to This task is made harder by the non-convexity of denote the “true” values in contrast to estimated or postu- the underlying learning problems. We provide lated quantities. The output may be continuous or discrete a general framework to characterize the asymp- to model either regression or classification problems. totic generalization error for single-layer neural We measure the generalization error in a standard man- networks (i.e., generalized linear models) with ner: we are given training data (xi, yi), i = 1,...,N from arbitrary non-linearities, making it applicable to which we learn some parameter estimate wb via a regularized regression as well as classification problems. This empirical risk minimization of the form framework enables analyzing the effect of (i) over- parameterization and non-linearity during mod- w = argmin Fout(y, Xw) + Fin(w), (2) b w eling; and (ii) choices of , initializa- T tion, and regularizer during learning. Our model where X = [x1 x2 ... xN ] , is the data matrix, Fout is also captures mismatch between training and test some output loss function, and Fin is some regularizer on distributions. As examples, we analyze a few spe- the weights. We are then given a new test sample, xts, for cial cases, namely linear regression and logistic which the true and predicted values are given by regression. We are also able to rigorously and an- 0 alytically explain the double descent phenomenon yts = φout( xts, w , dts), ybts = φ(hxts, wbi), (3) in generalized linear models. where dts is the noise in the test sample, and φ(·) is a pos- tulated inverse link function that may be different from the true function φout(·). The generalization error is then de- 1. Introduction fined as the expectation of some expected loss between yts and y of the form A fundamental goal of machine learning is generalization: bts the ability to draw inferences about unseen data from finite E fts(yts, ybts), (4) arXiv:2005.00180v1 [cs.LG] 1 May 2020 training examples. Methods to quantify the generalization error are therefore critical in assessing the performance of for some test loss function fts(·) such as squared error or any machine learning approach. prediction error. This paper seeks to characterize the generalization error for Even for this relatively simple GLM model, the behavior of the generalization error is not fully understood. Recent 1 Department of Electrical and Computer Engineering, Univer- works (Montanari et al., 2019; Deng et al., 2019; Mei & sity of California, Los Angeles, Los Angeles, USA 2Department of Electrical and Computer Engineering, New York University, Montanari, 2019; Salehi et al., 2019) have characterized Brooklyn, New York, USA 3Department of Statistics, University the generalization error of various linear models for clas- of California, Los Angeles, Los Angeles, USA. Correspondence sification and regression in certain large random problem to: Melikasadat Emami , Mojtaba Sahraee- instances. Specifically, the number of samples N and num- Ardakan , Parthe Pandit , Sundeep Rangan , Alyson K. Fletcher . satisfying p/N → β ∈ (0, ∞), and the samples in the training data xi are drawn randomly. In this limit, the gen- eralization error can be exactly computed. The analysis can Generalization Error of GLMs in High Dimensions explain the so-called double descent phenomena (Belkin using the Convex Gaussian Min-max Theorem in the under- et al., 2019a): in highly under-regularized settings, the test parametrized regime. The results in the current paper can error may initially increase with the number of data samples consider all these models as special cases. Bounds on the N before decreasing. See the prior work section below for generalization error of over-parametrized linear models are more details. also given in (Bartlett et al., 2019; Neyshabur et al., 2018). Although this paper and several other recent works consider Summary of Contributions. Our main result (Theo- only simple linear models and GLMs, much of the motiva- rem1) provides a procedure for exactly computing the tion is to understand generalization in deep neural networks asymptotic value of the generalization error (4) for GLM where classical intuition may not hold (Belkin et al., 2018; models in a certain random high-dimensional regime called Zhang et al., 2016; Neyshabur et al., 2018). In particular, the Large System Limit (LSL). The procedure enables the a number of recent papers have shown the connection be- generalization error to be related to key problem parameters tween neural networks in the over-parametrized regime and including the sampling ratio β = p/N, the regularizer, the kernel methods. The works (Daniely, 2017; Daniely et al., output function, and the distributions of the true weights 2016) showed that on over-parametrized and noise. Importantly, our result holds under very general neural networks learns a function in the RKHS correspond- settings including: (i) arbitrary test metrics fts; (ii) arbi- ing to the random feature kernel. Training dynamics of over- trary training loss functions Fout as well as decomposable parametrized neural networks has been studied by (Jacot regularizers Fin; (iii) arbitrary link functions φout; (iv) cor- et al., 2018; Du et al., 2018; Arora et al., 2019; Allen-Zhu related covariates x; (v) underparameterized (β < 1) and et al., 2019), and it is shown that the function learned is in overparameterized regimes (β > 1); and (vi) distributional an RKHS corresponding to the neural tangent kernel. mismatch in training and test data. Section4 discusses in detail the general assumptions on the quantities f , F , ts out Approximate Message Passing. Our key tool to study F , and φ under which Theorem1 holds. in out the generalization error is approximate message passing (AMP), a class of inference algorithms originally developed Prior Work. Many recent works characterize generaliza- in (Donoho et al., 2009; 2010; Bayati & Montanari, 2011) tion error of various machine learning models, including for compressed sensing. We show that the learning problem special cases of the GLM model considered here. For exam- for the GLM can be formulated as an inference problem ple, the precise characterization for asymptotics of predic- on a certain multi-layer network. Multi-layer AMP meth- tion error for least squares regression has been provided in ods (He et al., 2017; Manoel et al., 2018; Fletcher et al., (Belkin et al., 2019b; Hastie et al., 2019; Muthukumar et al., 2018; Pandit et al., 2019) can then be applied to perform 2019). The former confirmed the double descent curve of the inference. The specific algorithm we use in this work (Belkin et al., 2019a) under a Fourier series model and a is the multi-layer vector AMP (ML-VAMP) algorithm of noisy Gaussian model for data in the over-parameterized (Fletcher et al., 2018; Pandit et al., 2019) which itself builds regime. The latter also obtained this scenario under both on several works (Opper & Winther, 2005; Fletcher et al., linear and non-linear feature models for ridge regression 2016; Rangan et al., 2019; Cakmak et al., 2014; Ma & Ping, and min-norm least squares using random matrix theory. 2017). The ML-VAMP algorithm is not necessarily the most Also, (Advani & Saxe, 2017) studied the same setting for computationally efficient procedure for the minimization deep linear and shallow non-linear networks. (2). For our purposes, the key property is that ML-VAMP enables exact predictions of its performance in the large sys- The analysis of the the generalization for max-margin linear tem limit. Specifically, the error of the algorithm estimates classifiers in the high dimensional regime has been done in in each iteration can be predicted by a set of deterministic re- (Montanari et al., 2019). The exact expression for asymp- cursive equations called the state evolution or SE. The fixed totic prediction error is derived and in a specific case for points of these equations provide a way of computing the two-layer neural network with random first-layer weights, asymptotic performance of the algorithm. In certain cases, the double descent curve was obtained. A similar dou- the algorithm can be proven to be Bayes optimal (Reeves, ble descent curve for logistic regression as well as linear 2017; Gabrie´ et al., 2018; Barbier et al., 2019). discriminant analysis has been reported by (Deng et al., 2019). Random feature learning in the same setting has This approach of using AMP methods to characterize the also been studied for ridge regression in (Mei & Monta- generalization error of GLMs was also explored in (Barbier nari, 2019). The authors have, in particular, shown that et al., 2019) for i.i.d. distributions on the data. The explicit highly over-parametrized estimators with zero training error formulae for the asymptotic mean squared error for the are statistically optimal at high signal-to-noise ratio (SNR). regularized linear regression with rotationally invarient data The asymptotic performance of regularized logistic regres- matrices is proved in (Gerbelot et al., 2020). The ML-VAMP sion in high dimensions is studied in (Salehi et al., 2019) method in this work enables extensions to correlated features Generalization Error of GLMs in High Dimensions and to mismatch between training and test distributions. convergence can also be satisfied by correlated sequences and deterministic sequences. 2. Generalization Error: System Model Training data input: For each N, we assume that the train- ing input data samples, x ∈ p, i = 1,...,N, are i.i.d. We consider the problem of estimating the weights w in i R and drawn from a p-dimensional Gaussian distribution with the GLM model (1). As stated in the Introduction, we zero mean and covariance Σ ∈ p×p. The covariance can suppose we have training data {(x , y )}N arranged as tr R i i i=1 capture the effect of features being correlated. We assume X := [x x ... x ]T ∈ N×p, y := [y y . . . y ]T ∈ 1 2 N R 1 2 N the covariance matrix has an eigenvalue decomposition, RN . Then we can write Σ = 1 VTdiag(s2 )V , (9) 0 tr p 0 tr 0 y = φout(Xw , d), (5) 2 p×p where str are the eigenvalues of Σtr and V0 ∈ R is the where φout(z, d) is the vector-valued function such that orthogonal matrix of eigenvectors. The scaling 1 ensures N p [φout(z, d)]n = φout(zn, dn) and {dn}n=1 are general 2 that the total variance of the samples, Ekxik , does not noise. grow with N. We will place a certain random model on str Given the training data (X, y), we consider estimates of and V0 momentarily. w0 given by a regularized empirical risk minimization of Using the covariance (9), we can write the data matrix as the form (2). We assume that the loss function Fout and regularizer Fin are separable functions, i.e., one can write X = U diag(str)V0, (10)

N p where U ∈ N×p has entries drawn i.i.d. from N (0, 1 ). X X R p Fout(y, z) = fout(yn, zn),Fin(w) = fin(wj), For the purpose of analysis, it is useful to express the matrix n=1 j=1 U in terms of its SVD: (6)   2 diag(smp) 0 for some functions fout : R → R and fin : R → R. Many U = V2SmpV1, Smp := (11) standard optimization problems in machine learning can 0 ∗ be written in this form: logistic regression, support vector N×N p×p where V1 ∈ R and V2 ∈ R are orthogonal and machines, linear regression, Poisson regression. N×p min{N,p} Smp ∈ R with non-zero entries smp ∈ R only Large System Limit: We follow the LSL analysis of (Bay- along the principal diagonal. smp are the singular values ati & Montanari, 2011) commonly used for analyzing AMP- of U. A standard result of random matrix theory is that, 1 based methods. Specifically, we consider a sequence of since U is i.i.d. Gaussian with entries N (0, p ), the matrices problems indexed by the number of training samples N. For V1 and V2 are Haar-distributed on the group of orthogonal each N, we suppose that the number of features p = p(N) matrices and smp is such that grows linearly with N, i.e., PL(2) lim {smp,i} = Smp, (12) p(N) N→∞ lim → β (7) N→∞ N where Smp ≥ 0 is a non-negative random variable such that 2 Smp satisfies the Marcencko-Pastur distribution. Details on for some constant β ∈ (0, ∞). Note that β > 1 corresponds this distribution are in AppendixH. to the over-parameterized regime and β < 1 corresponds to the under-parameterized regime. Training data output: Given the input data X, we assume that the training outputs y are generated from (5), where the 0 True parameter: We assume the true weight vector w noise d is independent of X and has an empirical distribu- has components whose empirical distribution converges as tion which converges as

0 PL(2) 0 PL(2) lim {wn} = W , (8) lim {di} = D. (13) N→∞ N→∞

0 N for some limiting random variable W . The precise defi- Again, the limit (13) will be satisfied if {di}i=1 are i.i.d. nition of empirical convergence is given in AppendixA. It draws of random variable D with bounded second moments. means that the empirical distribution 1 Pp δ converges, p i=1 wi Test data: To measure the generalization error, we assume in the Wasserstein-2 metric (see Chap. 6 (Villani, 2008)), now that we are given a test point xts, and we obtain the to the distribution of the finite-variance random variable true output yts and predicted output yts given by (3). We W 0. Importantly, the limit (8) will hold if the components b 0 p 0 assume that the test data inputs are also Gaussian, i.e., {wi }i=1 are drawn i.i.d. from the distribution of W with 0 2 T T E(W ) < ∞. However, as discussed in AppendixA, the xts = u diag(sts)V0, (14) Generalization Error of GLMs in High Dimensions

p 1 where u ∈ R has i.i.d. Gaussian components, N (0, p ), et al., 2018; Pandit et al., 2019). This algorithm is not nec- and sts and V0 are the eigenvalues and eigenvectors of the essarily the most computationally efficient method. For our test data covariance matrix. That is, the test data sample has purposes, however, the algorithm serves as a constructive a covariance matrix proof technique, i.e., it enables exact predictions for gen- eralization error in the LSL as described above. Moreover, 1 T 2 Σts = p V0 diag(sts)V0. (15) in the case when loss function (2) is strictly convex, the problem has a unique global minimum, whereby the gen- In comparison to (9), we see that we are assuming that the eralization error of this minimum is agnostic to the choice eigenvectors of the training and test data are the same, but of algorithm used to find this minimum. To that end, we the eigenvalues may be different. In this way, we can capture start by reformulating (2) in a form that is amicable to the distributional mismatch between the training and test data. application of ML-VAMP, Algorithm1. For example, we will be able to measure the generalization error when the test sample is outside a subspace explored Multi-Layer Representation. The first step in applying by the training data. ML-VAMP to the GLM learning problem is to represent 0 To capture the relation between the training and test distri- the mapping from the true parameters w to the output y as a certain multi-layer network. We combine (5), (10) and butions, we assume that components of str and sts converge 0 as (11), so that the mapping w 7→ y can be written as the PL(2) following sequence of operations (as illustrated in Fig.1): lim {(str,i, sts,i)} = (Str,Sts), (16) N→∞ 0 0 0 0 z0 := w , p0 := V0z0, to some non-negative, bounded random vector (S ,S ). 0 0 0 0 tr ts z1 := φ1(p0, ξ1), p1 := V1z1, The joint distribution on (S ,S ) captures the relation (18) tr ts z0 := φ (p0, ξ ), p0 := V z0, between the training and test data. 2 2 1 2 2 2 2 0 0 z3 := φ3(p2, ξ3) = y, When Str = Sts, our model corresponds to the case when the training and test distribution are matched. Isotropic where ξ` are the following vectors: Gaussian features in both training and test data correspond 1 2 1 2 ξ1 := str, ξ2 := smp, ξ3 := d, (19) to covariance matrices Σtr = p σtrI, Σts = p σtsI, which can be modeled as Str = σtr, Sts = σts. We also require and the functions φ`(·) are given by that the matrix V0 is uniformly distributed on the set of φ1(p0, str) := diag(str)p0, p × p orthogonal matrices. φ2(p1, smp) := Smpp1, (20) Generalization error: From the training data, we obtain an φ3(p2, d) := φout(p2, d). estimate wb via a regularized empirical risk minimization (2). We see from Fig.1 that the mapping of true parameters Given a test sample xts and parameter estimate wb, the true w0 = z0 to the observed response vector y = z0 is de- output yts and predicted output ybtr are given by equation 0 3 (3). We assume the test noise is distributed as dts ∼ D, scribed by a multi-layer network of alternating orthogonal following the same distribution as the training data. The operators V` and non-linear functions φ`(·). Let L = 3 postulated inverse-link function φ(·) in (3) may be different denote the number of layers in this multi-layer network. from the true inverse-link function φout(·). The minimization (2) can also be represented using a similar The generalization error is defined as the asymptotic ex- signal flow graph. Given a parameter candidate w, the pected loss, mapping w 7→ Xw can be written using the sequence of vectors Ets := lim Efts(yts, yts), (17) N→∞ b z0 := w, p0 := V0z0, z1 := Strp0, p1 := V1z1, (21) where f (·) is some loss function relevant for the test er- ts z := S p , p := V z = Xw. ror (which may be different from the training loss). The 2 mp 1 2 2 2 expectation in (17) is with respect to the randomness in the There are L = 3 steps in this sequence, and we let training as well as test data, and the noise. Our main result z = {z , z , z }, p = {p , p , p } provides a formula for the generalization error (17). 0 1 2 0 1 2 denote the sets of vectors across the steps. The minimization 3. Learning GLMs via ML-VAMP in (2) can then be written in the following equivalent form: min F0(z0) + F1(p0, z1) + F2(p1, z1) + F3(p2) There are many methods for solving the minimization prob- z,p (22) lem (2). We apply the ML-VAMP algorithm of (Fletcher subject to p` = V`z`, ` = 0, 1, 2, Generalization Error of GLMs in High Dimensions

ξ1 ξ2 ξ3

0 0 0 z0 = w z3 = y 0 0 0 0 0 0 p0 z1 p1 z2 p2 = Xw V0 φ1(·) V1 φ2(·) V2 φ3(·)

Figure 1. Sequence flow representing the mapping from the unknown parameter values w0 to the vector of responses y on the training data. where the penalty functions F` are defined as where (pb`−1, bz`) are the solutions to the minimization

− F0(·) = Fin(·),F1(·, ·) =δ{z =S p }(·, ·), γ 1 tr 0 (p , z ) := argmin F (p , z ) + ` kz − r−k2 F (·, ·) = δ (·, ·),F (·) =F (y, ·), b`−1 b` ` `−1 ` ` ` 2 {z2=Smpp1} 3 out (p`−1,z`) 2 (23) + c γ`−1 + 2 where δA(·) is 0 on the set A, and +∞ on A . + kp − r k . (27) 2 `−1 `−1

ML-VAMP for GLM Learning. Using this multi-layer The quantity h∂v/∂ui on lines 11 and 23 denotes the em- representation, we can now apply the ML-VAMP algorithm pirical mean 1 PN ∂v /∂u . from (Fletcher et al., 2018; Pandit et al., 2019) to solve N n=1 n n the optimization (22). The steps are shown in Algorithm1. Thus, the ML-VAMP algorithm in Algorithm1 reduces These steps are a special case of the “MAP version” of ML- the joint constrained minimization (22) over variables VAMP in (Pandit et al., 2019), but with a slightly different (z0, z1, z2) and (p0, p1, p2) to a set of proximal operations set-up for the GLM problem. We will call these steps the on pairs of variables (p`−1, z`). As discussed in (Pandit ML-VAMP GLM Learning Algorithm. et al., 2019), this type of minimization is similar to ADMM with adaptive step-sizes. Details of the denoisers g± and k ` The algorithm operates in a set of iterations indexed by . In other aspects of the algorithm are given in AppendixB. each iteration, a “forward pass” through the layers generates 0 estimates bzk` for the hidden variables z` , while a “backward 0 4. Main Result pass” generates estimates pbk` for the variables p` . In each step, the estimates zk` and pk` are produced by functions + − b b We make two assumptions. The first assumption imposes g` (·) and g` (·) called estimators or denoisers. certain regularity conditions on the functions fts, φ, φout, ± For the MAP version of ML-VAMP algorithm in (Pandit and maps g` appearing in Algorithm1. The precise defini- et al., 2019), the denoisers are essentially proximal-type tions of pseudo-Lipschitz continuity and uniform Lipschitz operators defined as continuity are given in AppendixA of the supplementary material. γ 2 proxF/γ (u) := argmin F (x) + 2 kx − uk . (24) x Assumption 1. The denoisers and link functions satisfy the following continuity conditions: An important property of the proximal operator is that for separable functions F of the form (6), we have (a) The proximal operators in (25),

[proxF/γ (u)]i = proxf/γ (ui). + − − − + + g0 (r0 , γ0 ), g3 (r2 , y, γ2 ), In the case of the GLM model, for ` = 0 and L, on lines7 − + and 19, the denoisers are proximal operators given by are uniformly Lipschitz continuous in r0 and (r2 , y) over parameters γ− and γ+. + − − − 0 2 g (r , γ ) = prox − (r ), (25a) 0 0 0 Fin/γ0 0 − + + + (b) The link function φout(p, d) is Lipschitz continuous in g (r , y, γ ) = prox + (r ). (25b) 3 2 2 Fout/γ 2 2 (p, d). The test error function fts(φ(zb), φout(z, d)) is Note that in (25b), there is a dependence on y through the pseduo-Lipschitz continuous in (z,b z, d) of order 2. term F (y, ·). For the middle terms, ` = 1, 2, i.e., lines9 out Our second assumption is that the ML-VAMP algorithm and 21, the denoisers are given by converges. Specifically, let xk = xk(N) be any set of out- + + − + − puts of Algorithm1, at some iteration k and dimension N. g (r , r , γ , γ ) := z`, (26a) ` `−1 ` `−1 ` b x (N) z (N) p (N) − + − + − For example, k could be bk` or bk` for some g` (r`−1, r` , γ`−1, γ` ) := p`−1, (26b)  0  b `, or a concatenation of signals such as z` (N) bzk`(N) . Generalization Error of GLMs in High Dimensions

0 Algorithm 1 ML-VAMP GLM Learning Algorithm (b) The true parameter w and its estimate wb empirically − − converge as 1: Initialize γ0` > 0, r0` = 0 for ` = 0,...,L−1 2: 0 PL(2) 0 3: for k = 0, 1,... do lim {(wi , wi)} = (W , Wc), (30) N→∞ b 4: // Forward Pass 5: for ` = 0,...,L − 1 do where W 0 is the random variable from (8) and 6: if ` = 0 then + − − 0 − 7: zk0 = g0 (rk0, γk0) Wc = prox + (W + Q ), (31) b fin/γ0 0 8: else 9: z = g+(r+ , r− , γ+ , γ− ) − − 0 bk` ` k,`−1 k` k,`−1 k` with Q0 = N (0, τ0 ) independent of W . 10: end if + − (c) The asymptotic generalization error (17) with (y , y ) 11: αk` = ∂bzk`/∂rk` ts bts + − defined as (3) is given by + V`(bzk` − αk`rk`) 12: rk` = + 1 − αk`   + + − Ets = E fts φout(Zts,D), φ(Zbts) , (32) 13: γk` = (1/αk` − 1)γk` 14: end for 15: where (Zts, Zbts) ∼ N (02, M) and independent of D. 16: // Backward Pass 17: for ` = L, . . . , 1 do Part (a) shows that, similar to gradient descent, Algorithm1 18: if ` = L then finds the stationary points of problem (2). These stationary − + + 19: pbk,L−1 = gL (rk,L−1, γk,L−1) points will be unique in strictly convex problems such as 20: else linear and logistic regression. Thus, in such cases, the same − + − + − 21: pbk,`−1 = g` (rk,`−1, rk+1,`, γk,`−1, γk+1,`) results will be true for any algorithm that finds such station- 22: end if ary points. Hence, the fact that we are using ML-VAMP is − D + E 23: αk,`−1 = ∂pbk,`−1/∂rk,`−1 immaterial – our results apply to any solver for (2). Note VT (p − α− r+ ) that the convergence to the fixed points {bz`, pb`} is assumed − `−1 bk,`−1 k,`−1 k,`−1 24: rk+1,`−1 = − from Assumption2. 1 − αk,`−1 25: γ− = (1/α− − 1)γ+ Part (b) provides an exact description of the asymptotic k+1,`−1 k,`−1 k,`−1 statistical relation between the true parameter w0 and its 26: end for − + 27: end for estimate wb. The parameters τ0 , γ0 > 0 and M can be explicitly computed using a set of recursive equations called the state evolution or SE described in AppendixC in the

Assumption 2. Let xk(N) be any finite set of outputs of supplementary material. the ML-VAMP algorithm as above. Then there exist limits We can use the expressions to compute a variety of relevant metrics. For example, the PL(2) convergence shows that x(N) = lim xk(N) (28) k→∞ the MSE on the parameter estimate is satisfying N 1 X 0 2 0 2 lim (wn − wbn) = E(W − Wc) . (33) 1 2 N→∞ N lim lim kxk(N) − x(N)k = 0. (29) n=1 k→∞ N→∞ N The expectation on the right hand side of (33) can then be We are now ready to state our main result. computed via integration over the joint density of (W 0, Wc) Theorem 1. Consider the GLM learning problem (2) solved from part (b). In this way, we have a simple and exact by applying Algorithm1 to the equivalent problem (22) method to compute the parameter error. Other metrics such under the assumptions of Section2 along with Assump- as parameter bias or variance, cosine angle or sparsity de- − + tection can also be computed. tions1 and2. Then, there exist constants τ0 , γ0 > 0 and 2×2 M ∈ R 0 such that the following hold: Part (c) of Theorem1 similarly exactly characterizes the asymptotic generalization error. In this case, we would (a) The fixed points {bz`, pb`}, ` = 0, 1, 2 of Algorithm1 compute the expectation over the three variables (Z, Z,Db ). satisfy the KKT conditions for the constrained opti- In this way, we have provided a methodology for exactly mization problem (22). Equivalently wb := bz0 is a predicting the generalization error from the key parameters stationary point of (2). of the problems such as the sampling ratio β = p/N, the Generalization Error of GLMs in High Dimensions regularizer, the output function, and the distributions of the For all experiments below, the true model coefficients are 0 true weights and noise. We provide several examples such generated as i.i.d. Gaussian wj ∼ N (0, 1) and we use stan- 2 as linear regression, logistic regression and SVM in the dard L2-regularization, fin(w) = λw /2 for some λ > 0. AppendixG. We also recover the result by (Hastie et al., Our framework can incorporate arbitrary i.i.d. distributions 2019) in AppendixG. on wj and regularizers, but we will illustrate just the Gaus- sian case with L2-regularization here. Remarks on Assumptions. Note that Assumption1 is satisfied in many practical cases. For example, it can be Under-regularized linear regression. We first consider verified that it is satisfied in the case when fin(·) and fout(·) the case of under-regularized linear regression where the 2 are convex. Assumption2 is somewhat more restrictive output channel is φout(p, d) = p + d with d ∼ N (0, σd). 2 in that it requires that the ML-VAMP algorithm converges. The noise variance σd is set for an SNR level of 10 dB. The convergence properties of ML-VAMP are discussed We use a standard mean-square error (MSE) output loss, 2 2 in (Fletcher et al., 2016). The ML-VAMP algorithm may fout(y, p) = (y − p) /(2σd). Since we are using the 2 not always converge, and characterizing conditions under L2-regularizer, fin(w) = λw /2, the minimization (2) is which convergence is possible is an open question. However, standard ridge regression. Moreover, if we were to select 0 2 experiments in (Rangan et al., 2019) show that the algorithm λ = 1/E(wj ) , then the ridge regression estimate would does indeed often converge, and in these cases, our analysis correspond to the minimum mean-squared error (MMSE) applies. In any case, we will see below that the predictions estimate of the coefficients w0. However, to study the under- −4 0 2 from Theorem1 agree closely with numerical experiments regularized regime, we take λ = (10) /E(wj ) . in several relevant cases. Fig.2 plots the test MSE for the three cases described above In some special cases equation (32) simplifies to yield quan- for the linear model. In the figure, we take p = 1000 titative insights for interesting modeling artifacts. We dis- features and vary the number of samples n from 0.2p (over- cuss these in AppendixG in the supplementary material. parametrized) to 3p (under-paramertrized). For each value of n, we take 100 random instances of the model and com- 5. Experiments pute the ridge regression estimate using the sklearn package and measure the test MSE on the 1000 independent test Training and Test Distributions. We validate our theo- samples. The simulated values in Fig.2 are the median test retical results on a number of synthetic data experiments. error over the 100 random trials. The test MSE is plotted in For all the experiments, the training and test data is gen- a normalized dB scale, erated following the model in Section2. We generate the  2  training and test eigenvalues as i.i.d. with lognormal distri- E(ybts − yts) Test MSE (dB) = 10 log10 2 . butions, Eyts

2 0.1u 2 0.1u Also plotted is the state evolution (SE) theoretical test MSE S = A(10) tr ,S = A(10) ts , tr ts from Theorem1. where (utr, uts) are bivariate zero-mean Gaussian with In all three cases in Fig.2, the SE theory exactly matches the simulated values for the test MSE. Note that the case  1 ρ  cov(u , u ) = σ2 . of match training and test distributions for this problem tr ts u ρ 1 was studied in (Hastie et al., 2019; Mei & Montanari, 2019; Montanari et al., 2019) and we see the double descent phe- 2 In the case when σu = 0, we obtain eigenvalues that are nomenon described in their work. Specifically, with highly 2 equal, corresponding to the i.i.d. case. With σu > 0 we under-regularized linear regression, the test MSE actually can model correlated features. Also, when the correlation increases with more samples n in the over-parametrized coefficient ρ = 1, Str = Sts, so there is no training and test regime (n/p < 1) and then decreases again in the under- mismatch. However, we can also select ρ < 1 to experiment parametrized regime (n/p > 1). with cases when the training and test distributions differ. In Our SE theory can also provide predictions for the corre- the examples below, we consider the following three cases: lated feature case. In this particular setting, we see that (1) i.i.d. features (σu = 0); in the correlated case the test error is slightly lower in the over-parametrized regime since the energy of data is concen- (2) correlated features with matching training and test dis- trated in a smaller sub-space. Interestingly, there is minimal tributions (σ = 3 dB, ρ = 1); and u difference between the correlated and i.i.d. cases for the (3) correlated features with train-test mismatch (σu = under-parametrized regime when the training and test data 3 dB, ρ = 0.5). match. When the training and test data are not matched, Generalization Error of GLMs in High Dimensions

Figure 2. Test error for under-regularized linear regression under Figure 4. Test MSE under a non-linear least square estimation. various train and test distributions

rate (1− accuracy) and compare the simulated values with the SE theoretical estimates. The i.i.d. case was considered in (Salehi et al., 2019). Fig.3 shows that our SE theory is able to predict the test error rate exactly in i.i.d. cases along with a correlated case and a case with training and test mismatch.

Nonlinear Regression. The SE framework can also con- sider non-convex problems. As an example, we consider a non-linear regression problem where the output function is

2 φout(p, d) = tanh(p) + d, d ∼ N (0, σd). (34) The tanh(p) models saturation in the output. Corresponding to this output, we use a non-linear MSE output loss

1 2 fout(y − p) = 2 (y − tanh(p)) . (35) Figure 3. Classification error rate with logistic regression under 2σd various train and test distributions This output loss is non-convex. We scale the data matrix so that the input E(p2) = 9 so that the tanh(p) is driven well 2 the test error increases. In all cases, the SE theory can into the non-linear regime. We also take σd = 0.01. accurately predict these effects. For the simulation, the non-convex loss is minimized using Tensorflow where the non-linear model is described as a Logistic Regression. Fig.3 shows a similar plot as Fig.2 two-layer model. We use the ADAM optimizer (Kingma & for a logistic model. Specifically, we use a logistic output Ba, 2014) with 200 epochs to approach a local minimum of P (y = 1) = 1/(1 + e−p), a binary cross entropy output the objective (2). Fig.4 plots the median test MSE for the loss fout(y, p), and `2-regularization level λ so that the estimate along with the SE theoretical test MSE. We again output corresponds to the MAP estimate (we do not perform see that the SE theory is able to predict the test MSE in all ridgeless regression in this case). The mean of the training cases even for this non-convex problem. 2 2 and test eigenvalues EStr = ESts are selected such that, 0 if the true coefficients w were known, we could obtain 6. Conclusions a 5% prediction error. As in the linear case, we generate random instances of the model, use the sklearn package to In this paper we provide a procedure for exactly computing perform the logistic regression, and evaluate the estimates the asymptotic generalization error of a solution in a gen- on 1000 new test samples. We compute the median error eralized linear model (GLM). This procedure is based on Generalization Error of GLMs in High Dimensions scalar quantities which are fixed points of a recursive itera- Daniely, A. Sgd learns the conjugate kernel class of the tion. The formula holds for a large class of generalization network. In Advances in Neural Information Processing metrics, loss functions, and regularization schemes. Our Systems, pp. 2422–2430, 2017. formula allows analysis of important modeling effects such as (i) overparameterization, (ii) dependence between covari- Daniely, A., Frostig, R., and Singer, Y. Toward deeper under- ates, and (iii) mismatch between train and test distributions, standing of neural networks: The power of initialization which play a significant role in the analysis and design of and a dual view on expressivity. In Advances In Neural machine learning systems. We experimentally validate our Information Processing Systems, pp. 2253–2261, 2016. theoretical results for linear as well as non-linear regression and logistic regression, where a strong agreement is seen Deng, Z., Kammoun, A., and Thrampoulidis, C. A model between our formula and simulated results. of double descent for high-dimensional binary linear clas- sification. arXiv preprint arXiv:1911.05822, 2019.

References Donoho, D. L., Maleki, A., and Montanari, A. Message- Advani, M. S. and Saxe, A. M. High-dimensional dynamics passing algorithms for compressed sensing. Proc. of generalization error in neural networks. arXiv preprint National Academy of Sciences, 106(45):18914–18919, arXiv:1710.03667, 2017. 2009. Allen-Zhu, Z., Li, Y., and Liang, Y. Learning and gener- Donoho, D. L., Maleki, A., and Montanari, A. Message alization in overparameterized neural networks, going passing algorithms for compressed sensing. In Proc. In- beyond two layers. In Advances in Neural Information form. Theory Workshop, pp. 1–5, 2010. Processing Systems, pp. 6155–6166, 2019. Du, S. S., Zhai, X., Poczos, B., and Singh, A. Gradient Arora, S., Du, S. S., Hu, W., Li, Z., and Wang, R. Fine- descent provably optimizes over-parameterized neural grained analysis of optimization and generalization for networks. arXiv preprint arXiv:1810.02054, 2018. overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584, 2019. Fletcher, A., Sahraee-Ardakan, M., Rangan, S., and Schniter, P. Expectation consistent approximate inference: Gen- Barbier, J., Krzakala, F., Macris, N., Miolane, L., and Zde- eralizations and convergence. In Proc. IEEE Int. Symp. borova,´ L. Optimal errors and phase transitions in high- Information Theory (ISIT), pp. 190–194. IEEE, 2016. dimensional generalized linear models. Proc. National Academy of Sciences, 116(12):5451–5460, 2019. Fletcher, A. K., Rangan, S., and Schniter, P. Inference in Bartlett, P. L., Long, P. M., Lugosi, G., and Tsigler, A. deep networks in high dimensions. Proc. IEEE Int. Symp. Benign overfitting in linear regression. arXiv preprint Information Theory, 2018. arXiv:1906.11300, 2019. Gabrie,´ M., Manoel, A., Luneau, C., Barbier, J., Macris, Bayati, M. and Montanari, A. The dynamics of message N., Krzakala, F., and Zdeborova,´ L. Entropy and mutual passing on dense graphs, with applications to compressed information in models of deep neural networks. In Proc. sensing. IEEE Trans. Inform. Theory, 57(2):764–785, NIPS, 2018. February 2011. Gerbelot, C., Abbara, A., and Krzakala, F. Asymptotic er- Belkin, M., Ma, S., and Mandal, S. To understand deep rors for convex penalized linear regression beyond gaus- learning we need to understand kernel learning. arXiv sian matrices. arXiv preprint arXiv:2002.04372, 2020. preprint arXiv:1802.01396, 2018. Givens, C. R., Shortt, R. M., et al. A class of wasserstein Belkin, M., Hsu, D., Ma, S., and Mandal, S. Reconciling metrics for probability distributions. The Michigan Math- modern machine-learning practice and the classical bias– ematical Journal, 31(2):231–240, 1984. variance trade-off. Proc. National Academy of Sciences, 116(32):15849–15854, 2019a. Hastie, T., Montanari, A., Rosset, S., and Tibshirani, R. J. Belkin, M., Hsu, D., and Xu, J. Two models of double de- Surprises in high-dimensional ridgeless least squares in- scent for weak features. arXiv preprint arXiv:1903.07571, terpolation. arXiv preprint arXiv:1903.08560, 2019. 2019b. He, H., Wen, C.-K., and Jin, S. Generalized expectation Cakmak, B., Winther, O., and Fleury, B. H. S-AMP: Ap- consistent signal recovery for nonlinear measurements. proximate message passing for general matrix ensembles. In 2017 IEEE International Symposium on Information In Proc. IEEE ITW, 2014. Theory (ISIT), pp. 2333–2337. IEEE, 2017. Generalization Error of GLMs in High Dimensions

Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Salehi, F., Abbasi, E., and Hassibi, B. The impact of regu- Convergence and generalization in neural networks. In larization on high-dimensional logistic regression. arXiv Advances in neural information processing systems, pp. preprint arXiv:1906.03761, 2019. 8571–8580, 2018. Tulino, A. M., Verdu,´ S., et al. Random matrix theory and Kingma, D. P. and Ba, J. Adam: A method for stochastic wireless communications. Foundations and Trends R in optimization. arXiv preprint arXiv:1412.6980, 2014. Communications and Information Theory, 1(1):1–182, 2004. Ma, J. and Ping, L. Orthogonal amp. IEEE Access, 5: 2020–2033, 2017. Villani, C. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008. Manoel, A., Krzakala, F., Varoquaux, G., Thirion, B., and Zdeborova,´ L. Approximate message-passing for convex Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. optimization with non-separable penalties. arXiv preprint Understanding requires rethinking general- arXiv:1809.06304, 2018. ization. arXiv preprint arXiv:1611.03530, 2016.

Mei, S. and Montanari, A. The generalization error of ran- dom features regression: Precise asymptotics and double descent curve. arXiv preprint arXiv:1908.05355, 2019.

Montanari, A., Ruan, F., Sohn, Y., and Yan, J. The gen- eralization error of max-margin linear classifiers: High- dimensional asymptotics in the overparametrized regime. arXiv preprint arXiv:1911.01544, 2019.

Muthukumar, V., Vodrahalli, K., and Sahai, A. Harmless interpolation of noisy data in regression. In 2019 IEEE International Symposium on Information Theory (ISIT), pp. 2299–2303. IEEE, 2019.

Neyshabur, B., Li, Z., Bhojanapalli, S., LeCun, Y., and Srebro, N. Towards understanding the role of over- parametrization in generalization of neural networks. arXiv preprint arXiv:1805.12076, 2018.

Opper, M. and Winther, O. Expectation consistent approxi- mate inference. Journal of Machine Learning Research, 6(Dec):2177–2204, 2005.

Pandit, P., Sahraee, M., Rangan, S., and Fletcher, A. K. Asymptotics of MAP inference in deep networks. In Proc. IEEE Int. Symp. Information Theory, pp. 842–846, 2019.

Pandit, P., Sahraee-Ardakan, M., Rangan, S., Schniter, P., and Fletcher, A. K. Inference with deep generative priors in high dimensions. arXiv preprint arXiv:1911.03409, 2019.

Rangan, S., Schniter, P., and Fletcher, A. K. Vector approxi- mate message passing. IEEE Trans. Information Theory, 65(10):6664–6684, 2019.

Reeves, G. Additivity of information in multilayer net- works via additive gaussian noise transforms. In Proc. 55th Annual Allerton Conf. Communication, Control, and Computing (Allerton), pp. 1064–1070. IEEE, 2017. Supplementary Materials: Generalization Error of GLMs in High Dimensions

p A. Empirical Convergence of Vector surely. In particular, if xn ∼ X are i.i.d. and EkXkp < ∞, Sequences then x empirically converges to X with pth order moments. The LSL model in Section2 and our main result in Section4 PL(p) convergence is equivalent to weak convergence plus require certain technical definitions. convergence in p moment (Bayati & Montanari, 2011), and hence PL(p) convergence is also equivalent to convergence Definition 1 (Pseudo-Lipschitz continuity). For a given in Wasserstein-p metric (See Chapter 6. (Villani, 2008)). p ≥ 1, a function f : d → m is called Pseudo-Lipschitz R R We use this fact later in proving Theorem1. of order p if

kf(x1) − f(x2)k B. ML-VAMP Denoisers Details ≤ Ckx − x k 1 + kx kp−1 + kx kp−1 (36) 1 2 1 2 Related to Smp and smp from equation (11), we need to + N − p define two quantities smp ∈ R and smp ∈ R that are for some constant C > 0. zero-padded versions of the singular values smp, so that for n > min{N, p}, we set s± = 0. Observe that (s+ )2 Observe that for p = 1, the pseudo-Lipschitz is equivalent mp,n mp T − 2 to the standard definition of Lipschitz continuity. are eigenvalues of UU whereas (smp) are eigenvalues T of U U. Since smp empirically converges to Smp as given Definition 2 (Uniform Lipschitz continuity). Let φ(x, θ) + d s in (12), the vector smp empirically converges to random be a function on r ∈ R and θ ∈ R . We say that φ(x, θ) is + − variable Smp whereas the vector smp empirically converges uniformly Lipschitz continuous in x at θ = θ if there exists − to random variable Smp, where a mass is placed at 0 appro- constants L1,L2 ≥ 0 and an open neighborhood U of θ + priately. Specifically, Smp has a point mass of (1 − β)+δ{0} such that − 1 when β < 1, whereas Smp has a point mass of (1− β )+δ{0}, kφ(x , θ) − φ(x , θ)k ≤ L kx − x k (37) when β > 1. In AppendixH (eqn. (113)), we provide the 1 2 1 1 2 + − densities over positive parts of Smp and Smp. d for all x1, x2 ∈ R and θ ∈ U; and A key property of our analysis will be that the non-linear ± functions (20) and the denoisers g` (·) have simple forms. kφ(x, θ1) − φ(x, θ2)k ≤ L2 (1 + kxk) kθ1 − θ2k, (38) Non-linear functions φ`(·): The non-linear functions all act d for all x ∈ R and θ1, θ2 ∈ U. componentwise. For example, for φ1(·) in (20), we have Definition 3 (Empirical convergence of a sequence). Con- z = φ (p , s ) = diag(s )p ⇐⇒ z = φ (p , s ), N 1 1 0 tr tr 0 1,n 1 0,n tr,n sider a sequence of vectors x(N) = {xn(N)}n=1 with d xn(N) ∈ R . So, each x(N) is a block vector with a total where φ1(·) is the scalar-valued function, of Nd components. For a finite p ≥ 1, we say that the vec- tor sequence x(N) converges empirically with p-th order φ1(p0, s) = sp0. (41) moments if there exists a random variable X ∈ d such R Similarly, for φ (·), that 2 + + z2 = φ2(p1, smp) ⇐⇒ z2,n = φ2(p1,n, smp,n), n < N p (i) EkXkp < ∞; and N where p1 ∈ R is the zero-padded version of p1, and (ii) for any f : Rd → R that is pseudo-Lipschitz continu- ous of order p, φ2(p1, s) = s p1. (42)

N Finally, the function φ3(·) in (20) acts componentwise with 1 X lim f(xn(N)) = E [f(X)] . (39) N→∞ N n=1 φ3(p2, d) = φout(p2, d). (43)

+ In this case, with some abuse of notation, we will write Input denoiser g0 (·): Since F0(z0) = Fin(z0), and Fin(·) given in (6), the denoiser (25a) acts componentwise in that, PL(p) lim {xn} = X, (40) + − − + − − N→∞ bz0 = g0 (r0 , γ0 ) ⇐⇒ zb0,n = g0 (r0,n, γ0 ), where we have omitted the dependence on N in xn(N). + where g0 (·) is the scalar-valued function, We note that the sequence {x(N)} can be random or de- terministic. If it is random, we will require that for every γ− g+(r−, γ−) := argmin f (z ) + 0 (z − r−)2. pseudo-Lipschitz function f(·), the limit (39) holds almost 0 0 0 in 0 0 0 (44) z0 2 Supplementary Materials: Generalization Error of GLMs in High Dimensions

Thus, the vector optimization in (25a) reduces to a set of C. State Evolution Analysis of ML-VAMP scalar optimizations (44) on each component. A key property of the ML-VAMP algorithm is that its perfor- − Output denoiser g3 (·): The output penalty F3(p2, y) = mance in the LSL can be exactly described by a scalar Fout(p2, y) where Fout(p2, y) has the separable form (6). equivalent system. In the scalar equivalent system, the Thus, similar to the case of g0(·), the denoiser g3(·) in (25b) vector-valued outputs of the algorithm are replaced by scalar also acts componentwise with the function, random variables representing the typical behavior of the components of the vectors in the large-scale-limit (LSL). + − + + γ2 + 2 Each of the random variables are described by a set of g3 (r2 , γ2 , y) := argmin fout(p2, y) + 2 (p2 − r2 ) . p2 parameters, where the parameters are given by a set of de- (45) terministic equations called the state evolution or SE. The SE for the general ML-VAMP algorithm are derived ± Linear denoiser g1 (·): The expressions for both denoisers in (Pandit et al., 2019) and the special case of the updates ± ± for ML-VAMP for GLM learning are shown in Algorithm2 g1 and g2 are very similar and can be explained together. with details of functions g± in AppendixB. We see that the The penalty F1(·) restricts z1 = Strp0, where Str is a ` square matrix. Hence, for ` = 1, the minimization in (27) SE updates in Algorithm2 parallel those in the ML-VAMP is given by, algorithm Algo.1, except that vector quantities such as zk`, + − b pbk`, rk` and rk` are replaced by scalar random variables + − + − γ0 + 2 γ1 − 2 such as Zbk`, Pbk`, Rk` and Rk`. Each of these random pb0 := argmin 2 kp0 − r0 k + 2 kStrp0 − r1 k , p0 variables are described by the deterministic parameters such 2×2 0 − (46) as Kk` ∈ R 0 , and τ` , τk` ∈ R+. The updates in the section labeled as “Initial”, provide the and bz1 = Strpb0. This is a simple quadratic minimization scalar equivalent model for the true system (18). In these and the components of p0 and z1 are given by b b updates, Ξ` represent the limits of the vectors ξ` in (19).

− + − + − That is, p0,n = g (r , r , γ , γ , str,n) b 1 0,n 1,n 0 1 + + + − + − Ξ1 := Str, Ξ2 := Smp, Ξ3 := D. (49) zb1,n = g1 (r0,n, r1,n, γ0 , γ1 , str,n), Due to assumptions in Section2, we have that the compo- where nents of ξ` converge empirically as,

+ + − − PL(2) γ r + sγ r lim {ξ`,i} = Ξ`, (50) g−(r+, r−, γ+, γ−, s) := 0 0 1 1 (47a) N→∞ 1 0 1 0 1 + 2 − γ0 + s γ1 So, the Ξ` represent the asymptotic distribution of the com- s(γ+r+ + sγ−r−) g+(r+, r−, γ+, γ−, s) := 0 0 1 1 (47b) ponents of the vectors ξ`. 1 0 1 0 1 γ+ + s2γ− 0 1 The updates in sections labeled “Forward pass” and “Back- ward pass” in the SE equations in Algorithm2 parallel those ± Linear denoiser g2 (·): This denoiser is identical to the case in Algorithm1. The key quantities in these SE equations ± are the error variables, g1 (·) in that we need to impose the linear constraint z2 = Smpp1. However Smp is in general a rectangular matrix + + 0 − − 0 pk` := rk` − p` , qk` := rk` − z` , and the two resulting cases of β ≶ 1 needs to be treated separately. which represent the errors of the estimates to the inputs of + − the denoisers. We will also be interested in their transforms, Recall the definitions of vectors smp and smp at the begin- + T + − − ning of this section. Then, for ` = 2, with the penalty qk` = V` pk,`+1, pk` = V`qk`. F (p , z ) = δ , the solution to (27) has compo- 2 1 2 {z2=Smpp1} The following Theorem is an adapted version of the main nents, result from (Pandit et al., 2019) to the iterates of Algorithms 1 and2. p = g−(r+ , r− , γ+, γ+, s− ) (48a) b1,n 2 1,n 2,n 1 2 mp,n Theorem 2. Consider the outputs of the ML-VAMP for + + − + + + zb2,n = g2 (r1,n, r2,n, γ1 , γ2 , smp,n), (48b) GLM Learning Algorithm under the assumptions of Sec- tion2. Assume the denoisers satisfy the continuity condi- − − + + with the identical functions g2 = g1 and g2 = g1 as given tions in Assumption1. Also, assume that the outputs of the by (47a) and (47b). Note that in (48a), n = 1, . . . , p and in SE satisfy ± (48b), n = 1,...,N. αk` ∈ (0, 1), Supplementary Materials: Generalization Error of GLMs in High Dimensions

for all k and `. Suppose Algo.1 is initialized so that the following convergence holds

− 0 PL(2) − lim {r − z` } = Q N→∞ 0` 0` Algorithm 2 SE for ML-VAMP for GLM Learning where (Q− ,Q− ,Q− ) are independent zero-mean Gaus- 1: // Initial 00 01 02 − − sians, independent of {Ξ`}. Then, 2: Initialize γ0` = γ0` from Algorithm1. − − − 3: Q0` ∼ N (0, τ0`) for some τ0` > 0 for ` = 0, 1, 2 (a) For any fixed iteration k ≥ 0 in the forward direction 0 0 4: Z0 = W and layer ` = 1,...,L−1, we have that, almost surely, 5: for ` = 0,...,L−1 do + − + − 0 0 0 0 lim (γk,`−1, γk`) = (γk,`−1, γk`), and, (51) 6: P` = N (0, τ` ), τ` = var(Z` ) N→∞ 0 0 7: Z = φ`+1(P , Ξ`+1) + 0 0 + − `+1 ` lim {(bzk`, z` , p`−1, rk,`−1, r` )} 8: end for N→∞ 9: PL(2) + 0 0 + − = (Zbk`,Z` ,P`−1,Rk,`−1,R` ) (52) 10: for k = 0, 1,... do 11: // Forward Pass where the variables on the right-hand side are from the 12: for ` = 0,...,L − 1 do SE equations (51) and (52) are the outputs of the SE 13: if ` = 0 then equations in Algorithm2. Similar equations hold for − 0 − ` = 0 with the appropriate variables removed. 14: Rk0 = Z` + Qk0 + − − 15: Zbk0 = g0 (Rk0, γk0) (b) Similarly, in the reverse direction, For any fixed itera- 16: else tion k ≥ 0 and layer ` = 1,...,L − 2, we have that, + 0 + − 0 − 17: Rk,`−1 = P`−1 + Pk,`−1, Rk` = Z` + Qk` almost surely, + + − + − 18: Zbk` = g` (Rk,`−1,Rk`, γk,`−1, γk`, Ξ`) + − + − lim (γk,`−1, γk+1,`) = (γk,`−1, γk+1,`), and (53) 19: end if N→∞ + − + 0 0 + − 20: α = ∂Zbk`/∂Q lim {(p , z` , p`−1, r , r )} k` E k` N→∞ bk+1,`−1 k,`−1 k+1,` 0 + − + Zbk` − Z − α Q 21: Q = ` k` k` PL(2) + 0 0 + − k` + = (Pbk+1,`−1,Z` ,P`−1,Rk,`−1,Rk+1,`). (54) 1 − αk` 22: γ+ = ( 1 − 1)γ− k` α+ k` 0 + − k` Furthermore, (P`−1,Pk`−1) and Qk` are independent. 0 + + + 0 + 23: (P` ,Pk`) ∼ N (0, Kk`), Kk` = cov(Z` ,Qk`) 24: end for Proof. This is a direct application of Theorem 3 from (Pan- 25: dit et al., 2019) to the iterations of Algorithm1. The con- 26: // Backward Pass vergence result in (Pandit et al., 2019) requires the uniform 27: for ` = L, . . . , 1 do ± Lipschitz continuity of functions g` (·). Assumption1 pro- 28: if ` = L then vides the required uniform Lipschitz continuity assumption 29: R+ = P 0 + P + + − k,L−1 L−1 k,L−1 on g0 (·) and g3 (·). For the ”middle” layers, ` = 1, 2, − + + 0 g±(·) 30: Pbk,L−1 = gL (Rk,L−1, γk,L−1,ZL) the denoisers ` are linear and the uniform continuity 31: else assumption is valid since we have made the additional as- + 0 + − 0 − s s 32: Rk,`−1 = P`−1 + Pk,`−1, Rk+1,` = Z` + Qk+1,` sumption that the terms tr and mp are bounded almost − + − + − surely.  33: Pbk,`−1 = g` (Rk,`−1,Rk+1,`, γk,`−1, γk+1,`, Ξ`) 34: end if − + A key use of the Theorem is to compute asymptotic empiri- 35: αk,`−1 = E∂Pbk,`−1/∂Pk,`−1 0 − + cal limits. Specifically, for a componentwise function ψ(·), − Pbk,`−1 − P`−1 − αk,`−1Pk,`−1 1 PN let hψ(x)i denotes the average ψ(xn) The above 36: Pk+1,`−1 = − N n=1 1 − αk,`−1 theorem then states that for any componentwise pseudo- − 1 + 37: γk+1,`−1 = ( − − 1)γk,`−1 Lipschitz function ψ(·) of order 2, as N → ∞, we have the αk,`−1 − − − − 2 following two properties 38: Qk+1,`−1 ∼ N (0, τk+1,`−1), τk,`−1 = E(Pk+1,`−1) + 0 0 + − 39: end for lim hψ(z , z` , p`−1, r , r )i N→∞ bk` k,`−1 ` 40: end for + 0 0 + − = E ψ(Zbk`,Z` ,P`−1,Rk,`−1,R` ) + 0 0 + − lim hψ(p , z` , p`−1, r , r i N→∞ bk+1,`−1 k,`−1 k+1,`) + 0 0 + − = E ψ(Pbk+1,`−1,Z` ,P`−1,Rk,`−1,Rk+1,`). Supplementary Materials: Generalization Error of GLMs in High Dimensions

That is, we can compute the empirical average over compo- Proof. For any  > 0, the limit (59) implies that there exists nents with the expected value of the random variable limit. a k(↑ ∞ as  ↓ 0) such that for all k > k, This convergence is key to proving Theorem1. lim |αN − βk| < , N→∞ D. Empirical Convergence of Fixed Points from which we can conclude,

A consequence of Assumption2 is that we can take the lim inf αN > βk −  N→∞ limit k → ∞ of the random variables in the SE algorithm. for all k > k . Therefore, Specifically, let xk = xk(N) be any set of d outputs from  the ML-VAMP for GLM Learning Algorithm under the lim inf αN ≥ sup βk − . N→∞ assumptions of Theorem2. Under Assumption2, for each k≥k N , there exists a vector Since this is true for all  > 0, it follows that

x(N) = lim xk(N), (55) lim inf αN ≥ lim sup βk. (61) k→∞ N→∞ k→∞ representing the limit over k. For each k, Theorem2 shows Similarly, lim supN→∞ αN ≤ infk>k βk + , whereby there also exists a random vector limit, lim sup αN ≤ lim inf βk. (62) N→∞ k→∞ PL(2) lim {xk,i(N)} = Xk, (56) N→∞ Equations (61) and (62) together show that the limits in (60) exists and are equal.  representing the limit over N. The following proposition shows that we can take the limits of the random variables Xk. Proof of Proposition1 Let f : Rd → R be any pseudo- Proposition 1. Consider the outputs of the ML-VAMP for Lipschitz function of order 2, and define, GLM Learning Algorithm under the assumptions of Theo- N 1 X rem2 and Assumption2. Let xk = xk(N) be any set of d α = f(x (N)), β = f(X ). (63) N N i k E k outputs from the algorithm and let x(N) be its limit from i=1 (55) and let X be the random variable limit (56). Then, k Their difference can be written as, there exists a random variable X ∈ Rd such that, for any d pseudo-Lipschitz continuous f : R → R, αN − βk = AN,k + BN,k, (64)

N where 1 X lim Ef(Xk) = Ef(X) = lim f(xi(N)). N k→∞ N→∞ N 1 X i=1 A := f(x (N)) − f(x (N)), (65) N,k N i k,i (57) i=1 ± ± In addition, the SE parameter limits αk` and γk` converge N 1 X to limits, B := f(x (N)) − f(X ). (66) N,k N k,i E k ± ± ± ± i=1 lim αk` = α` , lim γk` = γ` . (58) k→∞ k→∞ Since {xk,i(N)} converges PL(2) to Xk, we have,

lim BN,k = 0. (67) The proposition shows that, under the convergence assump- N→∞ tion, Assumption2, we can take the limits as k → ∞ of the For the term AN,k, random variables from the SE. To prove the proposition we N first need the following simple lemma. (a) 1 X |AN,k| ≤ lim |f(xi(N)) − f(xk,i(N))| N→∞ N Lemma 1. If αN and βk ∈ R are sequences such that i=1 N (b) C X lim lim |αN − βk| = 0, (59) ≤ lim aki(N)(1 + aki(N)) k→∞ N→∞ N→∞ N i=1 then, there exists a constant C such that, v (c) u N N u 1 X 2 1 X 2 ≤ C lim t a (N) + aki(N) N→∞ ki lim αN = lim βk = C. (60) N N N→∞ k→∞ i=1 i=1 = C lim k(N)(1 + k(N)), (68) In particular, the two limits in (60) exist. N→∞ Supplementary Materials: Generalization Error of GLMs in High Dimensions where (a) follows from applying the triangle inequality to E. Proof of Theorem1 the definition of AN,k in (65); (b) follows from the definition of pseudo-Lipschitz continuity in Definition1, C > 0 is the From Assumption2, we know that for every N, every group of vectors xk converge to limits, x := limk→∞ xk. The Lipschitz contant and ± ± ± parameters, γk`, also converge to limits γ` := limk→∞ γk` a (N) := kx (N) − x (N)k , for all `. By the continuity assumptions on the functions ki k,i i 2 ± ± g` (·), the limits x and γ` are fixed points of the algorithms. and (c) follows from the RMS-AM inequality: A proof similar to that in (Pandit et al., 2019) shows that the fixed points z and p satisfy the KKT condition of the N !2 N b` b` 1 X 1 X a (N) ≤ a2 (N) =: 2 (N). constrained optimization (22). This proves part (a). N ki N ki k i=1 i=1 The estimate wb is the limit, By (29), we have that, w = z0 = lim zk0. b b k→∞ b lim lim k(N) = 0. k→∞ N→∞ 0 0 Also, the true parameter is z0 = w . By Proposition1, we have that the PL(2) limits of these variables are Hence, from (68), it follows that, PL(2) lim {(w, w )} = (W,W ) := (Z ,Z0). lim lim AN,k = 0. (69) b 0 c 0 b0 0 k→∞ N→∞ N→∞ From line 15 of the SE Algorithm2, we have Substituting (67) and (69) into (64) show that αN and βk satisfy (59). Therefore, applying Lemma1 we have that + − − 0 − Wc = Zb0 = g0 (R0 , γ0 ) = prox − (W + Q0 ). for any pseudo-Lipschitz function f(·), there exists a limit fin/γ0 Φ(f) such that, This proves part (b). N 1 X To prove part (c), we use the limit lim f(xi(N)) = lim Ef(Xk) = Φ(f). (70) N→∞ N k→∞ i=1 0 PL(2) 0 lim {p0,n, p0,n} = (P0 , Pb0). (73) N→∞ b In particular, the two limits in (70) exists. When restricted to the continuous, bounded functions with the kfk∞ norm, Since the fixed points are critical points of the constrained 0 0 it is easy verified that Φ(f) is a positive, linear, bounded optimization (22), pb0 = V0wb. We also have p0 = V0w . function of f, with Φ(1) = 1. Therefore, by the Riesz Therefore, representation theorem, there exists a random variable X h (N) (N)i T 0 such that Φ(f) = Ef(X). This fact in combination with zts zbts := u Diag(sts)V0[w wb] (70) shows (57). T 0 = u Diag(sts)[p0 pb0]. (74) It remains to prove the parameter limits in (58). We prove + Here, (N) in the subscript denotes the dependence on N. the result for the parameter αk`. The proof for the other 1 (N) (N) parameters are similar. Using Stein’s lemma, it is shown in Since u ∼ N (0, p I), [zts zbts ] is a zero-mean bivariate (Pandit et al., 2019) that Gaussian with covariance matrix

p  2 0 0 2 0  − X s p p s p p0,n + E(Zbk`Qk`) (N) 1 ts,n 0,n 0,n ts,n 0,n b α = . (71) M = p 2 0 2 k` − 2 sts,np0,np0,n sts,np0,np0,n E(Q` ) n=1 b b b Since the numerator and denominator of (71) are PL(2) The empirical convergence (73) yields the following limit, functions we have that the limit, " 0 0 0 # (N) 2 P0 P0 P0 Pb0 − lim M = M := E Sts . (75) N→∞ 0 + + E(Zbk`Qk`) P0 Pb0 Pb0Pb0 α` := lim αk` = lim k→∞ k→∞ − 2 E(Qk`) − (N) (N) (Zb`Q ) It suffices to show that the distribution of [z z ] con- = E ` , (72) ts bts − 2 E(Q` ) verges to the distribution of [Zts Zbts] in the Wasserstein-2 metric as N → ∞. (See the discussion in AppendixA on − − where Zb` and Q` are the limits of Zbk` and Qk`. This the equivalence of convergence in Wasserstein-2 metric and completes the proof. PL(2) convergence.) Supplementary Materials: Generalization Error of GLMs in High Dimensions

0 − + Now, Wassestein-2 distance between between two probabil- since E[P0 Q1 ] = 0 and K0 is the covariance matrix of 0 + ity measures ν1 and ν2 is defined as (P0 ,P0 ) from line 23.  1/2 2 Finally for m22 we have, W2(ν1, ν2) = inf Eγ kX1 − X2k , (76) γ∈Γ 2 m22 = E StsPb0Pb0 where Γ is the set of probability distributions on the product  + 2 Stsγ space with marginals consistent with ν1 and ν2. For Gaus- 0 + 2 = E + 2 − E(P0 ) sian measures ν1 = N (0, Σ1) and ν2 = N (0, Σ2) we have γ0 + Strγ1 2 (Givens et al., 1984)  S S γ−  + ts tr 1 (Q−)2 2 1/2 1/2 1/2 E + 2 − E 1 W2 (ν1, ν2) = tr(Σ1 − 2(Σ1 Σ2Σ1 ) + Σ2). γ0 + Strγ1 + 2 (N) (N) 2 0 2 γ0 Sts 0 + Therefore, for Gaussian distributions ν1 = N (0, M ), + E StsE(P0 ) + 2E · E P0 P0 γ+ + γ−S2 and ν2 = N (0, M), the convergence (75) implies 0 1 tr 2 W (ν(N), ν ) → 0, i.e., convergence in Wasserstein-2 dis-  S γ+  2 1 2 = k ts 0 tance. Hence, 22E + 2 − γ0 + Strγ1 (N) (N) W2  − 2 (zts , zbts ) −→ (Zts, Zbts) ∼ N (0, M), − StsStrγ1 + τ1 E + 2 − − m11 + 2m12. (81) where M is the covariance matrix in (75). Hence the conver- γ0 + Strγ1 gence holds in the PL(2) sense (see discussion in Appendix A on the equivalence of convergence in W2 and PL(2) con- G. Special Cases vergence). G.1. Linear Output with Square Error Hence the asymptotic generalization error (17) is In this section we examine a few special cases of the GLM Ets := lim Efts(yts, yts) problem (2). We first consider a linear output with additive N→∞ b Gaussian noise and a squared error training and test loss. (a) (N) (N) = lim Efts(φout(zts ,D), φ(zts )) Specifically, consider the model, N→∞ b (b) 0 = Efts(φout(Zts,D), φ(Zbts)), (77) y = Xw + d (82) where (a) follows from (3); and step (b) follows from con- We consider estimates of w0 such that: tinuity assumption in Assumption1(b) along with the def- 1 2 λ 2 inition of PL(2) convergence in Def.3. This proves part wb = argmin 2 ky − Xwk + 2β kwk (83) (c). w The factor β is added above since the two terms scale with F. Formula for M a ratio of β. It does not change analysis. Consider the ML- VAMP GLM learning algorithm applied to this problem. For the special cases in the next Appendix, it is useful to The following corollary follows from the Main result in derive expressions for the entries the covariance matrix M Theorem1. in (75). For the term m , 11 Corollary 1 (Squared error). For linear regression, i.e., 2 0 2 2 0 2 2 2 m11 = E Sts(P0 ) = E StsE(P0 ) = E Sts · k11, (78) φ(t) = t, φout(t, d) = t + d, fts(y, yb) = (yts − ybts) , 1 2 0 Fout(p2) = N ky − p2k , we have where we have used the fact that P0 ⊥⊥ (Sts,Str). Next, 2 0 m12 = E Sts P0 Pb0. where,  + 2  − 2 LR γ0 Sts γ1 StrSts − 2 Ets =E + 2 − k22 + E + 2 − τ1 + σd. − 0 + 0 − + − − γ0 +Strγ1 γ0 +Strγ1 Pb0 = g1 (P0 + P0 ,Z1 + Q1 , γ0 , γ1 ,Str ) γ+P + + S γ−Q− The quantities k , τ −, γ+, γ− depend on the choice of = 0 0 tr 1 1 + P 0, (79) 22 1 0 1 + 2 − 0 regularizer λ and the covariance between features. γ0 + Strγ1 where (P 0,P +,Q−) are independent of (S ,S ). Hence, 0 0 0 tr ts Proof. This follows directly from the following observation: S2 γ+ m = S2 · (P 0)2 + ts 0 [P 0P +] SLR 2 2 2 12 E ts E 0 E + 2 − E 0 0 Ets = E(Zts + D − Zbts) = E(Zts − Zbts) + E D γ0 + Strγ1 2  2 +  = m11 + m22 − 2m12 + σd. Stsγ0 = m11 + E · k12, (80) γ+ + S2 γ− 0 tr 1 Substituting equation (81) proves the claim.  Supplementary Materials: Generalization Error of GLMs in High Dimensions

G.2. Ridge Regression with i.i.d. Covariates using lines 20 and 22,

− We next the special case when the input features are inde- + γ0 + α0 = − , γ0 = λ/β. (84) pendent, i.e., (83) where rows of X corresponding to the γ0 +λ/β training data has i.i.d Gaussian features with covariance 2 σtr 1 2 Ptrain = p I and Str = σtr. Similarly, we have fout(p2) = 2 (p2 − y) , whereby the g−(·) T output denoiser 3 in the last layer for ridge regression Although the solution to (83) exists in closed form (X X + is given by, λI)−1XTy, we can study the effect of the regularization parameter λ on the generalization error E as detailed in + + ts − + + γ2 r2 + y the result below. g3 (r2 , γ2 , y) = + . (85) γ2 + 1 Corollary 2. Consider the ridge regression problem (83) with regularization parameter λ > 0. For the squared loss By substituting this denoiser in line 30 of the algorithm we − 2 get P and thus, following the lines 35-38 of the algorithm i.e., fts(y, y) = (y − y) , i.i.d Gaussian features without b2 b b we have train-test mismatch, i.e., Str = Sts = σtr, the generaliza- RR + tion error Ets is given by Corollary1, with constants − γ2 − α2 = + , whereby γ2 = 1. (86) γ2 +1 0 + k22 = Var(W ), γ0 = λ/β,  + + − − 1 λ Having identified these constants α , γ , α , γ , we will  − 2 β < 1 0 0 2 2  G σtr  λ 1 λ now sequentially identify the quantities − ( − ) γ1 = σ2 β G σ2 β tr tr β > 1  β−1 λ + + + + + + + + +  + 2 (α , γ ) → K → (α , γ ) → K → (α , γ ) → K G σtrβ 0 0 0 1 1 1 2 2 2

λ in the forward pass, and then the quantities where G = Gmp(− 2 ), with Gmp given in AppendixH, σtrβ and τ − = (P −)2 where P − is given in equation (95) in − − − − − − − − − 1 E 1 1 τ0 ← (α0 , γ0 ) ← τ1 ← (α1 , γ1 ) ← τ2 ← (α2 , γ2 ) the proof. in the backward pass. Proof of Corollary2. We are interested in identifying the We also note that we have following constants appearing in Corollary1: + − + − + − α` + α` = 1 (87) K0 , τ1 , γ0 , γ1 .

+ + These quantities are obtained as fixed points of the State Forward Pass: Observe that K0 = Cov(Z0,Q0 ). Now, + 0 Evolution Equations in Algo.2. We explain below how from line 21, on simplification we get Q0 = −W0 to obtain expressions for these constants. Since these are whereby, fixed points we ignore the subscript k corresponding to the  1 −1 iteration number in Algo.2. K+ = var(W 0) . (88) 0 −1 1

In the case of problem (83), the maps proxfin and proxfout , + − 0 + i.e., g0 and g3 respectively, can be expressed as closed- Notice that from line 23, the pair (P0 ,P0 ) is jointly Gaus- + form formulae. This leads to simplification of the SE equa- sian with covariance matrix K0 . But the above equation + 0 + tions as explained below. means that P0 = −P0 , whereby R0 = 0 from line 17. + We start by looking at the forward pass (finding quantities Now, the linear denoiser g1 (·) is defined as in (47a). Note with superscript ’+’) of Algorithm2 for different layers and that since we are considering i.i.d Gaussian features for this then the backward pass (finding quantities with superscript problem, the random variable Str in this layer is a constant + − ± ± ’-’) to get the parameters {K` , τ` , α` , γ` } for ` = 0, 1, 2. σtr. Therefore, similar to layer ` = 0 by evaluating lines + 0 λ 2 17-23 of the algorithm we get Q1 = −Z1 , whereby To begin with, notice that fin(w) = 2 w , and therefore the + g (·) 2 − + denoiser 0 in (44) is simply, + σtrγ1 + γ0 λ + 2 + α1 = + 2 − , γ1 = σ2 = σ2 β , K1 = σtrK0 . γ0 +σtrγ1 tr tr − + − + − − γ0 − ∂g0 γ0 (89) g0 (r0 , γ0 ) = − r0 , and − = − γ0 +λ/β ∂r0 γ0 +λ/β

− Observe that this means Using the random variable R0 and substituting in the ex- + + 0 pression of the denoiser to get Zb0, we can now calculate α0 P1 = −P1 . (90) Supplementary Materials: Generalization Error of GLMs in High Dimensions

0 0 Backward Pass: Since Y = φout(P2 ,D) = P2 + D, with some simplification we get − line 36 of algorithm on simplification yields P2 = D, − 2 λ 2 0 λ 2 whereby we can get τ2 , E(A ) = ( 2 ) G − ( 2 G) , (97a) σtrβ σtrβ − − 2 2 2 λ τ2 = E(P2 ) = E[D ] = σd. (91) 2 0 E(B ) = G − 2 G , (97b) σtrβ Next, to calculate the terms (α−, γ−), we use the decoiser λ 1 1 where G = Gmp(− σ2 β ), with Gmp given in AppendixH, − tr g defined in (47a) for line 33 of the algorithm to get P1. 0 λ 2 b and G is the derivative of Gmp calculated at − 2 . σtrβ + + − − − − + 0 − γ1 R1 +Smpγ2 R2 Smp(SmpP1 +Q2 ) Now consider the under-parametrized case (β < 1): Pb1 = + − 2 − = + − 2 , (92) γ +(Smp) γ γ +(Smp) 1 2 1 λ Let u = − σ2 β and z = Gmp(u). In this case we have − + 0 + tr where we have used γ2 = 1,R1 = P1 + P1 = 0 due to − 0 − + 0 − + λ (90), and R2 = Z2 + Q2 = SmpP1 + Q2 from lines 17, α1 = 1 − 2 G = 1 + uz. (98) 32 and4 respectively. σtrβ

∂g− Note that, Then, we calculate α− and γ− as α− = 2 = 1 1 1 E ∂P + 1 (a) γ+ −1 1 1 Gmp(z) = u ⇒ Rmp(z) + = u E + − 2 . This gives, γ1 +(Smp) z (b) 1 1 ( λ ⇒ + = u, (99a) 2 G β < 1 1 − βz z − σtrβ α1 = 1 1 λ (93) (1 − ) + 2 G β ≥ 1 where R (.) is the R-transform defined in (Tulino et al., β β σtrβ mp 2004) and (a) follows from the relationship between the R- Here, in the overparameterized case (β > 1), the denoiser and Stieltjes-transform and (b) follows from the fact that for − + 1 λ 1 g2 outputs R1 with probability 1 − and 2 G with Marchenko-Pastur distribution we have Rmp(z) = . β σtrβ 1−zβ 1 Therefore, probability β . 1 1 Next, from line 37 we get, G ( + ) = z mp 1 − βz z  1 λ 1 1 1  G − 2 β < 1 0 0  σtrβ ⇒ Gmp( + ) = G = . (100)  λ 1 λ β 1 − 1 + 1 − βz z 2 − 2 γ = ( − − 1)γ = 2 ( − 2 ) (94) (1−βz) z 1 α 1 σtrβ G σtrβ 1 β > 1  β−1 λ  + 2 For the over-parametrized case (β > 1) we have: G σtrβ

+ 1 λ 1 − uz α1 = β (1 + σ2 β G) = . (101) Now from line 36 and equation (87) we get, tr β

+ − 0 − + (a) + 0 In this case, as mentioned in AppendixH and following α P = P1 − P − α P = P1 − α P 1 1 b 1 1 1 b 1 1 the results from (Tulino et al., 2004), the measure µβ scales ! β (b) S− S+ S− with β and thus Rmp(z) = . Therefore, similar to (99a), = mp mp − α+ P 0 + mp Q− 1−z λ − 2 1 1 λ − 2 2 z satisfies 2 +(Smp) 2 +(Smp) σtrβ σtrβ | {z } | {z } β 1 1 A B + = u ⇒ G0 = . (102) 1 − z z β 1 (95) (1−z)2 − z2 where (a) follows from (90) and (87), and (b) follows from − − − 2 Now τ1 can be calculated as follows: (92). From this one can obtain τ1 = E(P1 ) which can 0 −   be calculated using the knowledge that P1 ,Q2 are indepen- − 2 2 2 2 0 2 0 2 2 0 τ1 = η u z σtrvar(W )(κ − 1) + σdz(uzκ + 1) dent Gaussian with covariances E(P1 ) = σtrVar(W ), − 2 2 0 − E(Q2 ) = σd. Further, P1 ,Q2 are independent of (103) (S+ ,S− ). mp mp where Observe that by (95) we have ( 1 ( (1−βz)2 (1+uz) β < 1 βz2−(1−βz)2 β < 1 ! η = , κ = 2 1 β β ≥ 1 (1−z) β ≥ 1 τ − = (A2)σ2 Var(W 0) + (B2)σ2 . (96) (1−uz) βz2−(1−z)2 1 + 2 E tr E d (α1 ) (104) Supplementary Materials: Generalization Error of GLMs in High Dimensions and z is the solution to the fixed points and, ( ( 1 + 1 = u β < 1 γ−σ2 1 β < 1 1−βz z . (105) 1 tr β 1 b := + − 2 = 1 , 1−z + z = u β ≥ 1 γ0 + γ1 σtr β β ≥ 1

 Thus applying Corollary1, we get

RR 2 2 − 2 G.3. Ridgeless Linear Regression Ets = a k22 + b τ1 + σd ( Here we consider the case of Ridge regression (83) when 1 σ2 β < 1 = 1−β d λ → 0+. Note that the solution to the problem (83) is β 2 1 2 0 β−1 σd + (1 − β )σtrVar(W ) β ≥ 1 (XTX + λI)−1XTy remains unique since λ > 0. The fol- lowing result was stated in (Hastie et al., 2019), and can be This proves the claim.  recovered using our methodology. Note however, that we calculate the generalization error whereas they have calcu- G.4. Train-Test Mismatch lated the squared error, whereby we obtain an additional 2 additive factor of σd. The result explains the double-descent Observe that our formulation allows for analyzing the ef- phenomenon for Ridgeless linear regression. fect of mismatch in the training and test distribution. One (S ,S ) Corollary 3. For ridgeless linear regression, we have can consider arbitrary joint distributions over tr ts that model the mismatch between training and test features. Here ( 1 2 we give a simple example which highlights the effect of this RR 1−β σd β < 1 lim Ets = β mismatch. λ→0+ 2 1 2 0 β−1 σd + (1 − β )σtrVar(W ) β ≥ 1 Definition 4 (Bernoulli ε-mismatch). (Sts,Str) has a bi- + − variate Bernoulli distribution with Proof of Corollary3. We calculate the parameters γ0 , γ1 , − + k22 and τ1 when λ → 0 . Before starting off, we note that • Pr{Str =Sts =0} = P{Str =Sts =1} = (1 − ε)/2

( β • Pr{S =0,S =1} = {S =1,S =0} = ε/2 1−β β < 1 tr ts P tr ts G0 := lim Gmp(−z) = β , (106) z→0+ β−1 β > 1 Notice that the marginal distribution of the Str in the 1 as described in AppendixH. Following the derivations in Bernoulli ε−mismatch model is such that P(Str 6= 0) = 2 . Corollary2, we have Hence half of the features extracted by the matrix V0 are 1 relevant during training. Similarly, P(Sts 6= 0) = 2 . How- + 0 γ0 = λ/β, k22 = Var(W ) (107) ever the features spanned by the test data do not exactly overlap with the features captured in the training data, and Now for λ → 0+, we have the fraction of features common to both the training and ( ( test data is 1 − ε. Hence for ε = 0, there is no training-test 1 β < 1 1 = 1−β β < 1 − − G0 β mismatch, whereas for ε = 1 there is a complete mismatch. 1 − α1 = 1 , γ1 = λ , β ≥ 1 2 β > 1 β (β−1)σtrβ The following result shows that the generalization error (108) increases linearly with the mismatch parameter ε.

Using this in simplifying (95) for λ → 0+, we get Corollary 4 (Mismatch). Consider the problem of Linear Regression (83) under the conditions of Corollary1. Addi- ( 2 tionally suppose we have Bernoulli ε-mismatch between the − − 2 σdG0 β < 1 τ1 = E(P1 ) = 2 2 0 training and test distributions. Then βσdG0 + σtrVar(W )(β − 1) β ≥ 1 − k22 ∗2 τ1 ∗ 2 − 2 Ets = ((1 − ε)γ + ε) + (1 − γ )(1 − ε) + σ ,  Smp  2 2 d where during the evaluation of E + − 2 , for the γ1 +(Smp) γ+ case of β > 1, we need to account for the point mass at 0 where γ∗ := 0 . The terms k , τ −, γ∗ are indepen- − 1 γ++γ− 22 1 for S with weight 1 − . 0 1 mp β dent of ε. Next, notice that

+ ( Proof. This follows directly by calculating the expectations γ0 σtr 0 β < 1 a := + − 2 = 1 , of the terms in Corollary1, with the joint distribution of γ0 + γ1 σtr (1 − )σtr β ≥ 1 β (Str,Sts) given in Definition4.  Supplementary Materials: Generalization Error of GLMs in High Dimensions

− The quantities k22 and τ1 in the result above can be calcu- As with all other models considered in this work, the true lated similar to the derivation in the proof of Corollary2 underlying data generating model could be anything that and can in general depend on the regularization parameter can be represented by the graphical model in Figure1, e.g. λ and overparameterization parameter β. logistic or probit model, and our theory is able to exactly predict the error when SVM is applied to learn such linear G.5. Logistic Regression classifiers in the large system limit. The precise analysis for the special case of regularized lo- gistic regression estimator with i.i.d Gaussian features is H. Marchenko-Pastur distribution provided in (Salehi et al., 2019). Consider the logistic re- We describe the random variable Smp defined in (12) where gression model, 2 Smp has a rescaled Marchenko-Pastur distribution. Notice that the positive entries of s are the positive eigenvalues (y = 1|x ) := ρ(xTw) for i = 1, ··· ,N mp P i i i of UTU (or UUT). where ρ(x) = 1 is the standard logistic function. 1 1+e−x Observe that Uij ∼ N(0, p ), whereas, the standard scaling 0 while studying the Marchenko-Pastur distribution is for ma- In this problem we consider estimates of w such that 1 trices H such that Hij ∼ N (0, N ) (for e.g. see equation T Xw T (1.10) from (Tulino et al., 2004) and the discussion preced- wb := argmin 1 log(1 + e ) − y Xw + Fin(w). √ w ing it). Also notice that βU has the same distribution as H. Thus the results from (Tulino et al., 2004) apply directly where F is the reguralization function. This is a special in to the distributions of eigenvalues of βUTU and βUUT. case of optimization problem (2) where We state their result below taking into account this disparity T Xw T in scaling. Fout(y, Xw) = 1 log(1 + e ) − y Xw. (109) The positive eigenvalues of βUTU have an empirical distri- Similar to the linear regression model, using the ML-VAMP bution which converges to the following density: GLM learning algorithm, we can characterize the general- p + − + − (bβ − x)+(x − aβ)+ ization error for this model with quantities K0 , τ1 , γ0 , γ1 µβ(x) = (113) given by algorithm2. We note that in this case, the output 2πβx non-linearity is √ 2 √ 2 where aβ = (1 − β) , bβ := (1 + β) . Similarly the 1 βUUT φout(p2, d) = {ρ(p2)>d} (110) positive eigenvalues of have an empirical distribu- tion converging to the density βµβ. We note the following + − where d ∼ Unif(0, 1). Also, the denoisers g0 , and g3 integral which is useful in our analysis: can be derived as the proximal operators of Fin, and Fout defined in (25). 1 G0 : = lim 1{S >0} − E 2 mp z→0 Smp − z G.6. Support Vector Machines Z bβ 1 β = lim µβ(x)dx = . (114) z→0− x/β − z |β − 1| The asymptotic generalization error for support vector ma- aβ chine (SVM) is provided in (Deng et al., 2019). Our model can also handle SVMs. Similar to logistic regression, SVM More generally, the Stieltjes transform of the density is finds a linear classifier using the hinge loss instead of logis- given by: tic loss. Assuming the class labels are y = ±1 the hinge loss is 1 Z bβ 1 G (z) = 1 = µ (x)dx mp E 2 {Smp>0} β `hinge(y, y) = max(0, 1 − yy). (111) S − z x/β − z b b mp aβ Therefore, if we take (115) X Fout(y, Xw) = max(0, 1 − yiXiw), (112) i

th where Xi is the i row of the data matrix, the ML-VAMP algorithm for GLMs finds the SVM classifier. The algorithm would have proximal map of hinge loss and our theory provides exact predictions for the estimation and prediction error of SVM.