arXiv:2004.14637v2 [stat.ML] 4 May 2020 h eeaiainerr .. o elatandmdlcan model is trained model a observation trained well new any how a for i.e., estimate error, criterion generalization performance of the the key estimation i.e., the A error, for training the set. makes the explain model as can the to model that referred trained is the data well training How training. model e ftann aa .. ar of pairs a using i.e., parameters data, model training unknown of of set Estimation given. observation by is an and estimate error to generalization practice. of best for terms guidelines in providing learn distributed methods regression of these [3] linear pitfalls of [2], potential understanding facilitator characterizing overall by significant the to a contribute as ove ed We emerged of learning context has the advancement distributed in computing, rapid Accordingly, e.g., from networks, a [1]. interest communication to wireless field and led mul- for the has over industry need within task and growing academia learning the both where the nodes, of tiple burden computational high the to close expressions. is analytical examp our node numerical verify provide any further to at w We incre observations. estimated solution of solution solution unknowns centralized number the centralized of distributed of number the the that the of to of that d compared error as training dramatically on range generalization error same the the while the overparamet that in the show remains for results particular, our In case, partition nodes. ized the on over error unknowns characterizati generalization the analytical the settin an of present regression dependence o We the distributed linear nodes. are of a unknowns network the on a where focusing understo setting the by well consider not gap We is g this approaches learning, address distributed such We in of interest performance vast ization s the over Despite burden nodes. computational the distributing by processing so eta neett eeo ehd hthv ohlow both have lear it that machine have Modern methods Hence, [4]. develop error to error. generalization to and said generalization interest training do low is central error of a it training is guarantee low data, always general, new In not on error. well generalization low performs model the eerhCucludrgat2015-04011. grant under Council Research .HlkitadA. and Hellkvist M. nasadr erigts,temi i st eable be to is aim main the task, learning standard a In the sharing for framework a provides learning Distributed Terms Index Abstract eeaiainErrfrLna ersinunder Regression Linear for Error Generalization Dsrbtdlann aiiae h cln-po data of scaling-up the facilitates learning —Distributed DsrbtdLann,Gnrlzto Error. Generalization Learning, —Distributed z¸lkaeakolde h upr rmSwedish from support the acknowledges Ozc¸elikkale ¨ .I I. NTRODUCTION { y y atnHlkit yaOclkae Anders.Ahlen Ayca.Ozcelikkale, Martin.Hellkvist, hnacrepniginput corresponding a when ie h corresponding the given et fEetia niern,UpaaUiest,Swed University, Uppsala Engineering, Electrical of Dept. atnHlkit Ayc¸a Hellkvist, Martin ( y i itiue Learning Distributed , a i ) y srfre oas to referred is i ntetraining the in n for ing eneral- everal n of ing nof on a error ning ases hen If . ver ata od. er- les ge es g. a z¸ lkae nesAhl´en Anders Ozc¸elikkale, r . , ¨ The A ounvector column h rnps famatrix a of transpose the by fe atto arcsclm-ieadvcosrow-wise vectors of and partioning column-wise Column-wise matrices partition often eioo eoe o-iesprto.Temti fall of matrix by The denoted separation. is ones row-wise denotes semicolon o pia attoigo nnwsfrdsrbtdlearn distributed guideli for unknowns provide of we th partitioning results, optimal of for these the partitioning Using of the results. verifi are expectation numerical of results the analytical function these present Furthermore, a unknowns. 2 as resu Lemma error analytical and main generalization generalization 1 Our the Theorem solution. centralized in to m the compared solution of values distributed observations, error large of the unknowns number of extremely of the error take to generalization number close the the is then number node if the any particular, than to larger assigned overparameterize In is the unknowns observations. to of of number attention model the special such i.e., of pay case, error we generalization [10], the [4 [9], on learning results machine recent in and models sa by overparameterized Motivated the of approach. unknowns on i.e., centralized success the errors dependence, the as accuracy training such of of achieves level no solution partitioning shows distributed error the the training on the depend although sol distributed heavily the of can performance generalization C the that method the learning where [7] distributed of network. work the the seminal over follows the distributed approach by are our initiated unknowns [5], work nodes of readi the line sensor over instance, to distributed contrast (for are In observations gap. the this lea where address char- setting we distributed to article, attempt of this no In properties been schemes. generalization has there the knowledge acterize our investigat of successfully best been quan the have [5], [2], privacy constraints and [6] energy tion as low such having learning, distributed still while models data, overparameterized training [4]. fit error the generalization to predict able exactly to often are techniques attoigo vector a of partitioning k Notation: ecnie iermdladuiietesuccessful the utilize and model linear a consider We for challenges related communications various Although x ∈ p [ = R × n × x p p 1 k ; dniymti sdntdas denoted is matrix identity · · · ednt h or-ers suones and pseudoinverse Moore-Penrose the denote We sgvnby given is } @angstrom.uu.se x ; ∈ x K R 1 ] p K . × x ∈ 1 A en into R as A K [ = A × x K as O K A ∈ C hogotteppr we paper, the Throughout . [ = R 1 blocks A O , n 8.Orrslsshow results Our [8]. A + · · · × x p 1 and ; , into x · · · A k K A I ∈ ; p ] K T x h row-wise The . ednt a denote We . R p respectively. , p ] k hr the where , lcswith blocks × 1 sgiven is d to ed, dby ed rning ution ing. tiza- ngs) nes me the the lts ay d e s ] . II. PROBLEM STATEMENT Algorithm 1: Implementation of COCOA [8] (and COLA We focus on the linear model 1 1 [11] with W = K K ) for (2) with (11). T yi = ai x + wi, (1) Input: Data matrix A distributed column-wise according R th Rp×1 th to partition P. Regularization parameter λ. where yi ∈ is the i observation, ai ∈ is the i re- 0 p×1 0 p×1 th Initialize: xˆ =0 ∈ R v =0 ∈ R ∀ k=1,...,K gressor, wi is the unknown disturbance for the i observation, k p×1 for t =0, 1,...,T do and x = [x ; ··· ; xp] ∈ R is the vector of unknown 1 t 1 K t coefficients. v¯ = K k=1 vk We consider the problem of estimating x given n for k ∈{1, 2,...,K} do t P t T t data points, i.e., pairs of observations and regressors, ck = λxˆk − Ak (y − v¯ ) ∆xt = −(KATA + λI )+ct (yi, ai), i =1, . . . , n, by minimizing the following regular- k k k pk k t+1 t t ized cost function: xˆk = xˆk + ∆xk t+1 t t v = v¯ + KAk∆xk 1 2 λ 2 k min y − Ax + x , (2) end ×1 2 2 x ∈ Rp 2 2 end where A ∈ Rn×p is the regressor matrix whose ith row is given T R1×p by ai ∈ . We further denote the first term as f(Ax) = K 1 2 λ 2 k=1 pk = p. We denote the part of A available at node k y − Ax . The second term x with λ ≥ 0 denotes n×p 2 2 2 2 as Ak ∈ R k . In particular, using this partitioning, y with the regularization function. Pw =0, ∀i, can be expressed as T R1×p i We consider the setting where the regressors ai ∈ are independent and identically distributed (i.i.d.) with ai ∼ x1 K N (0, Ip). Under this Gaussian regressor model, we focus . y = Ax = [A1, ··· , AK ] . = Akxk, (7) on the generalization error of the solution to (2) found by  .  x k=1 the distributed solver CoCoA [8]. Our main focus is on the  K  X scenario where λ =0, w =0 where the solutions with λ> 0   i where xk is the partition at node k. Note that there is no are used for comparison. In the remainder of this section, we loss of generality due to the specific order of this partitioning define the generalization error. We provide details about our structure since the columns of A are i.i.d. (since rows are i.i.d. implementation of CoCoA in Section III. with N (0, Ip)). x x Let wi = 0, ∀i, and let ˆ be an estimate of found by In COCOA, at iteration t, node k shares its estimate of using the data pairs (yi, ai), i =1,...,n. For a given A, the t y, denoted vk, over the network. Note that the Ak’s and the generalization error, i.e., the expected error for estimating y observation vector y are fixed over all iterations. The variables when a new pair (y, a) with a ∼ N (0, Ip) comes is given by t Rpk ×1 t Rpk×1 xˆk ∈ and ∆xk ∈ are the estimate and its update t T 2 T T 2 computed by node , respectively. Hence, xt and x are par- Ea[(y − a xˆ) ]=Ea[(a x − a xˆ) ] (3) k ˆ ∆ titioned as xˆt = [xˆt ; ··· ; xˆt ] and ∆xt = [∆xt ; ··· ; ∆xt ]. =E [tr[(x − xˆ)(x − xˆ)TaaT]] (4) 1 K 1 K a The average over all local estimates vt is denoted as v¯t. 2 k =kx − xˆk , (5) At iteration t, COCOA solves the following minimization where a is statistically independent of A and we have used problem at each node [8]: the notation Ea[·] to emphasize that the expectation is over a. t T t min ∇v¯t f(v¯ ) Ak∆xk Here (5) follows from a I . We are interested in the ∆xt ∼ N (0, p) k (8) expected generalization error over the distribution of training σ′ t 2 λ t t 2 + Ak∆x + xˆ + ∆x . data 2τ k 2 2 k k 2 2 E 2 Using Ax 1 y Ax , we have the smoothness ǫG = A[kx − xˆk ], (6) f( ) = 2 − 2 parameter τ =1 [11]. We set σ′ = K since it is considered a where the expectation is over the regressor matrix A in t safe choice [11]. Only keeping the terms that depend on ∆xk the training data. In the rest of the paper, we focus on the reveals that the solution to (8) can be equivalently found by evolution of ǫG in CoCoA. For notational simplicity, we drop solving the following problem the subscript A from our expectation expressions. t T K T λ t min (∆xk) ( 2 Ak Ak + 2 Ipk )∆xk III. DISTRIBUTED SOLUTION APPROACH ∆xt k (9) t T t T t As the distributed solution approach, we use the iterative + (λxˆk − Ak (y − v¯ )) ∆xk. approach COCOA introduced in [8]. In COCOA, mutually t exclusive subsets of coefficients of x and the associated subset Taking the derivative with respect to ∆xk and setting it to of columns of A are distributed over K nodes (K ≤ p). Hence, zero, we obtain the unknown coefficients are partitioned over nodes so p K T t t T t (KA Ak + λIp )∆x = −(λxˆ − A (y − v¯ )). (10) that each node governs the learning of pk variables, hence k k k k k With λ = 0, existence of a matrix inverse is not guaranteed. the overparameterized scenario of n ≤ p. If xk 6= 0 and any Hence, the local solvers use Moore-Penrose pseudoinverse to pi ∈{n − 1,n,n +1}, i 6= k, the generalization error after the solve (10) as first iteration will be extremely large, since the corresponding

t T + t T t αk in (13) will be extremely large. In order to avoid large ∆xk = −(KAk Ak + λIpk ) (λxˆk − Ak (y − v¯ )). (11) generalization errors, no partition Ak should have a number The resulting algorithm for estimating x iteratively is pre- of columns pk close to the number of observations n. Note that sented in Algorithm 1. according to (14), having pk ∈{n − 1,n,n +1} in one node In [11], a generalization of COCOA is presented, named affects the generalization error associated with the partition in COLA, where a mixing matrix W is introduced to model the the other nodes. 1 1 We now consider evolution of the generalization error: quality of the connection between nodes. For W = K K COLA reduces to COCOA, hence our analysis also applies to Lemma 2. Consider the setting in Theorem 1. For large t, this special case of COLA. the generalization error associated with xˆt+1 is given by IV. PARTITIONING AND THE GENERALIZATION ERROR E x xt+1 2 K E x xt 2 (16) [ − ˆ 2] ≈ k=1 αk [ k − ˆk 2] This section presents our main results in Theorem 1 and where α is defined as in (14)P. Lemma 2, which reveal how the generalization error changes k based on the data partitioning. We first provide a preliminary Proof: See Section VII-D. Lemma 2 reveals that if we have E x xt 2 with result to describe the evolution of the estimates of Algorithm 1: [ k − ˆk 2] 6= 0 pi ∈{n − 1,n,n +1}, i 6= k, at a given iteration, then the average generalization error Lemma 1. Using Algorithm 1 with λ = 0, the closed form t+1 will increase dramatically in the next iteration. Hence, if expression for xˆ is given by the average generalization error takes large values, it will not decrease by iterating the algorithm further. Numerical + + A1 A1 illustrations are provided in Section V. t+1 1 . t 1 . Following [12], a similar analysis is presented in [10], to xˆ = Ip − . A xˆ + . y. (12) K   K   explain the “double descent” curves in [9]. The analyses of  A+  A+  K   K  [10], [12] focus on the centralized problem where only a     Proof: See Section VII-A. This result shows that when subset p¯ of the p unknowns are learnt and present how the generalization error increases when is close to the number λ = 0, the estimate in each iteration is a combination of p¯ of observations . In this paper we extend these results for the previous global estimate (xˆt) and the local least-squares n + distributed learning with COCOA. solutions (Ak y) from each node. We now present our main results: We note that presence of noise wi in (1) during training would provide some numerical stability. Similarly, having a Theorem 1. Let A ∈ Rn×p have i.i.d. rows with non-zero regularization during training, i.e., λ> 0, will make ai ∼ N (0, Ip). Using Algorithm 1 with λ=0, wi=0, ∀i, the the matrix in (11) invertible, hence replacing the pseudo- generalization error in iteration t =1, i.e., ǫG, is given by inverse of (11) with an inverse. With a large enough λ > 0 2 2 (compared to the machine precision), this will provide nu- E x − xˆ1 = K x α , (13) 2 k=1 k 2 k merical stability which can reduce the large values in the h i generalization error significantly, at the cost of a larger training where αk and γk , k =1 ,...,KP, are given by error. On the other hand, a too large regularization will make 1 2 rmin,k K αk = 2 (K + (1 − 2K) + i γi), (14) the distributed solution penalize the norm of the solution too K pk =1 i=6 k much, and the solution will neither fit the training data nor the rmin,k for P (15a) test data. We illustrate these effects in Section V. rmax −rmin −1 pk ∈/ {n − 1,n,n +1}, γk = ,k ,k +∞ otherwise, (15b)  V. NUMERICAL EXAMPLES and rmin,k = min{pk,n} and rmax,k = max{pk,n}. We now provide numerical results to illustrate the depen- Proof: See Section VII-B. Here, while writing the expres- dence of the generalization error on the partitioning and the sions, we have used the notational convention that if any effect of regularization. We generate x with x ∼ N (0, Ip) and the corresponding x 2 , then that once in the numerical experiments and keep it fixed. We gen- αk = +∞ k 2 = 0 component of (13) is also zero. Note that the infinity, i.e., erate the rows of A ∈ Rn×p i.i.d. with distribution N (0, I ). p ∞, in (15b) denotes the indeterminate/infinite values due to We set n = 50, p = 150, wi =0, ∀i. The data is partitioned divergence of the relevant integrals. Further discussions on over K =2 nodes, so p = p1 + p2. this point are provided together with an illustrative example Verification of Theorem 1: We first empirically verify the in Section VII-C. expression in (13) from Theorem 1. We obtain the empirical Theorem 1 shows how the partitioning of x (and hence A) results by computing the first iteration of Algorithm 1 for over the nodes affects the generalization error ǫG. Note that the N = 100 simulations. Note that these values correspond to the interesting case of pk ∈{n − 1,n,n +1}, K > 1, occurs with average generalization error (i.e. risk) by (6). In Figure 1, we 1e3 1.0 tion error and the training error of Algorithm 1 as a function of Theoretical p1 with λ =0. When either p1 or p2 approaches n = 50, there 0.8 Empirical is a large increase in the generalization error. This behaviour 0.6 is consistent with the general trend of Figure 1, which was obtained using Theorem 1. This numerical result supports the

Error 0.4 result of Lemma 2, i.e., once the generalization error increases 0.2 drastically, iterations of Algorithm 1 do not decrease it. In particular, the peak generalization error for Algorithm 1 is on 0.0 the order of 5 (not shown on the plot). On the other hand, 0 25 50 75 100 125 150 10 the distributed solution fits the training data perfectly, as does p1 2 the LS solution: the respective training errors are lower than Fig. 1: Comparison of E[ x − xˆ1 ] expression in (13) with 2 −25. In contrast to the distributed case, the LS solution fits the empirical ensemble average for K =2, λ =0. 10 the new data well with an average generalization error of ≈ 60.

1e3 Hence, although Algorithm 1 successfully finds a solution 1.0 that achieves a training error on the same level with the direct Generalization 0.8 Error centralized solution, the generalization error is significantly Training higher when p ∈{n − 1,n,n +1}. Error k 0.6 Effect of regularization: We now investigate the effects of

Error 0.4 regularization on the peaks of Figure 2. We set a non-zero regularization parameter λ and run the same simulations as in 0.2 Figure 2. A value of λ between 10−4 and 103 dampens the 0.0 peaks in generalization error (when p1 is close to 50, 100) − 0 25 50 75 100 125 150 to between 104 and 102. As λ is increased beyond 10 4, 3 p1 the training error starts to grow. In particular, for λ = 10 , Fig. 2: The generalization error and the training error for the training error is on the same level as the generalization K =2, λ =0 after convergence. error. Any further increase in λ increases both the training 2 and the generalization error. These results are consistent with present the analytical value , i.e, E x x1 from (13), ǫG [ − ˆ 2] 2 the discussions in Section IV. and the empirical average 1 N x − xˆ1 , where the N i=1 (i) 2 VI. CONCLUSIONS subscript (i) denotes the ith simulation. Figure 1 illustrates P that the empirical average follows the analytical values for all We have presented a characterization of the generalization error showing how partitioning plays a major role in dis- p1 in (15). When pk ∈{n − 1,n,n +1}, the empirical average increases so drastically the values are out of the range of the tributed linear learning. In particular, our analytical results show how it is crucial for the generalization performance that plots. For pk ≈ n, pk ∈/ {n−1,n,n+1}, we see that empirical values take large values and these values are exactly on the the partitioning must avoid setting the number of unknowns analytical curve. Note that no analytical value is computed in any node close to the number of available observations. We have presented numerical results, simulating the distributed for pk ∈{n − 1,n,n +1}, hence the increase in the analytical learning system COCOA, verifying our analytical results. expressions around pk ≈ n, pk ∈/ {n − 1,n,n +1} directly comes from large but finite values dictated by the analytical Extension of this work to the fully decentralized case of COLA expression. is considered an important direction for future work. Generalization error after convergence: We now illustrate VII. APPENDIX that the generalization error does not decrease when Algo- A. Proof of Lemma 1 rithm 1 is run until convergence. We set the number of t iterations for Algorithm 1 as T = 200. We note that increasing For λ =0, the formula for ∆xk reduces to further does not change the nature of the results. The average t t T ∆x = 1 A+(y − v¯ ). (17) training error is calculated as 1 N A x xT 2, k K k Nn i=1 (i)( − ˆ(i)) 2 0 0 t t−1 t−1 where the superscript T denotes the final iteration. The gen- Using xˆ = , xˆk = xˆk +∆xk and the partitioning struc- P t t−1 i eralization error is calculated in a similar fashion but using a ture ∆x = [∆x1; ··· ; ∆xK ], we obtain xˆ = i=0 ∆x . To- ′ R10n×p t t−1 t−1 new data matrix A ∈ from the same distribution as gether with vk = v¯ +KAk∆xk and A = [A1, ··· , AK ] A ∈ Rn×p. Here A, A′ are independently sampled for each we have P ′ simulation. The matrix A is chosen to have 10n rows so K K 1 that the generalization error is averaged over a large number v¯t = vt = v¯t−1 + A ∆xt−1 K k k k of data points. For benchmarking, we use the training and k k X=1 X=1 (18) the generalization errors of the centralized least-squares (LS) t−1 solution, i.e., xˆ = A+y using the whole A. = v¯t−1 + A∆xt−1 = v¯0 + A∆xi = Axˆt. In Figure 2, we plot the empirical average of the generaliza- i=0 X t+1 t t Combining (17) and (18) with xˆk = xˆk + ∆xk, we obtain invertibility of square) Gaussian matrices. We provide a proof t+1 t 1 + t 1 + for the sake of completeness in Section VII-E. xˆ = xˆk − A Axˆ + A y. (19) k K k K k By definition of A, the rows of A are i.i.d. with N (0, Ip).

Putting this result in vector form gives the desired expression. Hence, the rows of Ak are i.i.d. with N (0, Ipk ), for any k. Thus, using Lemma 3 with C = Ak in (26), we obtain B. Proof of Theorem 1 K r Let us define the matrix consisting of pseudo-inverses of xTE Q x x 2 min,k (28) [ ] = k 2 . pk blocks of A as follows: k=1 X ¯ + + Rp×n We now consider the term xT E[Q TQ]x in Lemma 4: A = [A1 ; ··· ; AK ] ∈ (20) Now we consider the error for the unknown x at iteration Lemma 4. Let A be a n × p random matrix with i.i.d. ¯ t = 1, i.e. x˜1 = x − xˆ1. Using Lemma 1 and the fact rows with the distribution N (0, Ip). Let A denote the ma- + + Rp×n Rp×1 that y = Ax, we have an expression for x˜1 as follows trix [A1 ; ··· ; AK ] ∈ . Let z = [z1; ··· ; zK ] ∈ , pk ×1 1 1 ¯ 0 1 ¯ where zk ∈ R and rmin,k = min{n,pk}, rmax,k = x˜ = Ip − K AA x˜ = Ip − K AA x where we have used that x˜0 = x − xˆ0 = x since the algorithm is initialized max{n,pk}, k =1,...,K. Then 0 0   with xˆ = . Hence, the error after one iteration is expressed K 2 rmin K 2 TE T ¯T ¯ ,k 1 z [A A AA]z= k=1 zk p + i=1 γi , (29) in terms x. We now consider the expectation of x˜ , i.e. 2 k i=6 k 2   2 2 P P E x1 E I 1 AA¯ x (21) with γk, k =1,...,K, defined in (15). [ ˜ 2]= [ ( p − K ) 2] TE 1 T 1 Proof: See Section VII-F. Using Q = AA¯ and Lemma 4, = x [(Ip − K Q )( Ip − K Q)]x (22) we obtain 2 2 T 1 T T x x E Q x 2 x E Q Q x (23) = 2 − K [ ] + K [ ] TE T K 2 rmin,k K x [Q Q]x= k=1 xk p + i=1 γi . (30) where in (22), we have introduced the notation Q = AA¯ . 2 k i=6 k

TE   In (23), we will first evaluate the term x [Q]x, then Combining (30)P and (28) with (23), weP obtain (13) of TE T x [Q Q]x and finally combine these results to find an Theorem 1. This concludes the proof of Theorem 1. expression for E x1 2 . [ ˜ 2] We now evaluate the term xTE[Q]x. The matrix Q can be C. An Illustrative Example expressed as follows: We now consider the special case where n = 1, p = R1×2 + + + 2,K = 2, p1 = 1, p2 = 1 where A = [a1,a2] ∈ , A1 A1 A1 ··· A1 AK R2×1 . . . x = [x1; x2] ∈ . Hence, y = a1x1 + a2x2. Now consider Q = AA¯ = . A = . .. . . (24)  .   . . .  the case where a1 and a2 are non-zero so that the pseudo- A+ A+ A ··· A+ A inverses are given by 1 and 1 , respectively. Note that  K   K 1 K K  a1 a2     this is the case with probability one since a1 and a2 are Since A and A are statistically independent for k 6= i, and k i Gaussian distributed. By Lemma 1 and xˆ0 = 0, we have E A 0, we obtain 2 1 [ ]= 1 1 1 a and 1 1 a 1 . xˆ1 = 2a1 y = 2 x1 + 2a1 x2 xˆ2 = 2a2 y = 2a2 x1 + 2 x2 E + 0 Hence, [A1 A1] ··· . . E[Q]= . .. . . (25) E 1 2 E x1 a2 2 x2 a1 2  . . .  [kx − xˆ k2]= | − x2| + | − x1| 0 E + 2 2a1 2 2a2 ··· [AK AK ]   Eh a2 2 x1 2 a2 i   = | x2| + | | − x1x2] The quadratic form xTE[Q]x can then be expressed as the 2a1 2 2a1 h a1 2 x2 2 a1 following summation: + | x1| + | | − x1x2 (31) 2a2 2 2a2 xTE[Q]x = K xTE[A+A ]x . (26) i k=1 k k k k Now consider the individual terms, for instance E a2 2 [| 2a1 x2| ]= 2 2 1 x2 As an intermediate step, weP now present Lemma 3 which will E[a ]E[ 2 ] , where we have used the statistical indepen- 2 a1 4 be utilized throughout the proofs: E 2 dence of a1 and a2. Here, [a2] is finite valued. On the other n×pc E 1 Lemma 3. Let C ∈ R be a random matrix with i.i.d. hand, for ai Gaussian distributed, [ a2 ] diverges (and also ∞ i Rpc×1 1 2 rows with the distribution N (0, Ipc ). Let z ∈ . Then note that 2 exp(−ai ) dai takes large values for any given ǫ ai 2 r¯ finite ǫ > 0). Similar conclusions can be drawn for the other zTE C+C z z min (27) R [ ] = 2 , terms in the expectation. Hence, these observations illustrate pc the infinite/indeterminate values in Theorem 1. where r¯ = min{n,p }. min c On the other hand, when at least one of the ai’s is zero (note Proof: See Section VII-E. This type of expressions have that this event has probability zero), the associated pseudo- been utilized before, e.g., in [10] for n ≤ pc without a inverse is zero. By straightforward calculations, the average proof. This result follows from the unitary invariance (and generalization error can be found to be finite in this case. D. Proof of Lemma 2 Hence, C+C can be expressed as follows We adopt the same notation in the proof of Theorem 1 in C+C = V TΛ+UU TΛV = V TΛ+ΛV = V TDV , (38) Section VII-B. We consider the error in the estimate xˆt+1, t t i.e. x˜ +1 = x − xˆ +1 for an arbitrary t. Using Lemma 1 and where we have used the fact that for a real orthogonal matrix t t+1 1 ¯ t T y = Ax, x˜ can be written as x˜ = Ip − K AA x˜ . Now, U we have UU = I. consider the evolution of the expected error, i.e., TE +  Taking the trace of z [C C]z, we obtain t+1 2 t 2 E x˜ E I 1 AA¯ x˜ (32) T + T T [ 2]= [ ( p − K )( ) 2] tr(z E[C C]z)= E[tr(DV zz V )] (39) t t E T 1 T 1 T T = [(x˜ ) (Ip − K Q )(Ip − K Q)x˜ ] (33) = tr(DE[V zz V ]). (40) 2 ≈ E[ x˜t ] − 2 E[(x˜t)TE[Q]x˜t] (34) 2 K In (40), we have moved the expectation inside since D is 1 E t TE T t + K2 [(x˜ ) [Q Q]x˜ ] given by (37) w.p. 1. Since D is diagonal and we would like where Q = AA¯ . In (34), we have used the Independence As- to evaluate a trace type expression, we only need to consider the diagonal elements of E[V zzTV T]. Denoting the ith row sumption [13, Ch.16], which assumes statistical independence th between xˆt and the regressors in A for large t. of V as vi, the i diagonal element is Independence Assumption [13, Ch.16] is a widely uti- 2 E[(vi, z + ··· + vi,p zp ) ]. (41) lized assumption in signal processing literature to study the 1 1 c c transient and the steady-state behaviour of adaptive filters Note that V is Haar distributed since C is Gaussian distributed yielding extraordinary agreement between analytical studies [15]. Hence, by [16, Lemma 1.1], the cross terms of the square and empirical values, see for instance [13, Ch.16.6]. In our in (41) are zero, and by [16, Proposition 1.2], the non-cross particular case, the assumption does not hold to its full extent terms are 1 . Hence, we have the ith diagonal element of pc since we have an overparametrized system of equations with E[V zzTV T] as multiple solutions. Instead, numerical studies suggest that 1 2 z (42) for large t there may be a constant finite gap between the 2. pc actual values and the approximation, for instance in between E[(x˜t)TQTQx˜t] and E[(x˜t)TE[QTQ]x˜t]. Note that since We note that (42) does not depend on the order of indices z. this gap is finite and constant, the generalization error will (This is also a direct consequence of the rotational invariance of Gaussian matrices, i.e. V , U are Haar distributed.) Using still grow whenever γi =+∞. TE + Now following the proof of Theorem 1 with (34) instead (42), we express z [C C]z as follows as given in Lemma 3 of (23), we obtain the expression in (16) of Lemma 2. This 2 r zTE C+C z z min (43) [ ] = 2 , concludes the proof of Lemma 2. pc

E. Proof of Lemma 3 where rmin = min{n,pc}. Let us denote the singular value decomposition of C ∈ Rn×pc as F. Proof of Lemma 4 C = U TΛV , (35) We first focus on G = ATA¯TAA¯ . Using the definition of A¯ in (20), we express the product A¯TA¯ as follows where U ∈ Rn×n and V ∈ Rpc×pc are unitary matrices (which reduces to real orthonormal matrices since C is real-valued) K n×pc ¯T ¯ T + and Λ ∈ R is the (possibly rectangular) diagonal matrix A A = (AkAk ) , (44) k of singular values. Hence, the pseudo-inverse of C is given X=1 by where we have used the following identities for the pseudoin- + TΛ+ C = V U. (36) verse: (M +)T = (M T)+ and (M T)+M + = (MM T)+ for Note that the diagonal elements of Λ+ are the reciprocals of any matrix M. the non-zero diagonal values of Λ, so we have Hence, we have K I 0 Λ+Λ rmin Rpc×pc T ¯T ¯ T T + D = = 0 0 ∈ , (37) G = A A AA = A (AkAk ) A (45)   k=1 X where r = Rank(C) = min{n,p } is the rank of C. min c The matrix G and hence E[G] can be seen as a matrix Here, we have used the fact that a random matrix with i.i.d consisting of K × K blocks of varying sizes. The (k, j)th Gaussian entries has full rank with probability (w.p.) 1 [14]. block of E[G] (kth horizontal, jth vertical block) is given by In particular, by [14, Eqn. 3.2], a square Gaussian matrix M ∈ Rrmin×rmin is invertible w.p. 1. Hence, a rectangular K n×pc E T T + Gaussian matrix in R (which has M as a sub-matrix) [Ak (AiAi ) Aj ]. (46) has full rank. i=1 X We now consider the cases with k 6= j and k = j, separately. where For k 6= j, (46) can be written as 1 for pk >n +1, (55) pk−n−1 K ′ pk γk = for pk