Generalization Error for Linear Regression Under Distributed

Generalization Error for Linear Regression under Distributed Learning Martin Hellkvist, Ayça Ozçelikkale,¨ Anders Ahlén Dept. of Electrical Engineering, Uppsala University, Sweden {Martin.Hellkvist, Ayca.Ozcelikkale, Anders.Ahlen}@angstrom.uu.se Abstract—Distributed learning facilitates the scaling-up of data techniques are often able to fit overparameterized models processing by distributing the computational burden over several to exactly predict the training data, while still having low nodes. Despite the vast interest in distributed learning, general- generalization error [4]. ization performance of such approaches is not well understood. We address this gap by focusing on a linear regression setting. Although various communications related challenges for We consider the setting where the unknowns are distributed over distributed learning, such as energy constraints [5], quantiza- a network of nodes. We present an analytical characterization of tion [6] and privacy [2], have been successfully investigated, to the dependence of the generalization error on the partitioning of the best of our knowledge there has been no attempt to char- the unknowns over nodes. In particular, for the overparameter- acterize the generalization properties of distributed learning ized case, our results show that while the error on training data remains in the same range as that of the centralized solution, schemes. In this article, we address this gap. In contrast to the the generalization error of the distributed solution increases setting where the observations (for instance, sensor readings) dramatically compared to that of the centralized solution when are distributed over the nodes [5], our approach follows the the number of unknowns estimated at any node is close to the line of work initiated by the seminal work of [7] where the number of observations. We further provide numerical examples unknowns are distributed over the network. to verify our analytical expressions. Index Terms—Distributed Learning, Generalization Error. We consider a linear model and utilize the successful distributed learning method COCOA [8]. Our results show I. INTRODUCTION that the generalization performance of the distributed solution can heavily depend on the partitioning of the unknowns Distributed learning provides a framework for sharing the although the training error shows no such dependence, i.e., high computational burden of the learning task over mul- the distributed solution achieves training errors on the same tiple nodes, where the growing need for and interest from level of accuracy as the centralized approach. Motivated by the both academia and industry has led to a rapid advancement success of overparameterized models in machine learning [4] within the field [1]. Accordingly, distributed learning over and recent results on the generalization error of such models wireless communication networks, e.g., in the context of edge [9], [10], we pay special attention to the overparameterized computing, has emerged as a significant facilitator [2], [3]. case, i.e., the number of unknowns is larger than the number We contribute to the overall understanding of these methods of observations. In particular, if the number of unknowns by characterizing potential pitfalls of distributed learning for assigned to any node is close to the number of observations, linear regression in terms of generalization error and by then the generalization error of the distributed solution may providing guidelines for best practice. take extremely large values compared to the generalization In a standard learning task, the main aim is to be able error of the centralized solution. Our main analytical results to estimate an observation y when a corresponding input a in Theorem 1 and Lemma 2 present the expectation of the arXiv:2004.14637v2 [stat.ML] 4 May 2020 is given. Estimation of unknown model parameters using a generalization error as a function of the partitioning of the set of training data, i.e., pairs of (y , a ) is referred to as i i unknowns. Furthermore, these analytical results are verified by model training. How well the trained model can explain the numerical results. Using these results, we provide guidelines training data is referred to as the training error, i.e., the error for optimal partitioning of unknowns for distributed learning. that the model makes for the estimation of y in the training i Notation: We denote the Moore-Penrose pseudoinverse and set. A key performance criterion for any trained model is the transpose of a matrix A as A+ and AT, respectively. the generalization error, i.e., how well a trained model can The p × p identity matrix is denoted as I . We denote a estimate a new observation y given the corresponding a. If p column vector x ∈ Rp×1 as x = [x ; ··· ; x ], where the the model performs well on new data, it is said to have 1 p semicolon denotes row-wise separation. The matrix of all low generalization error. In general, low training error does ones is denoted by 1 ∈ RK×K. Throughout the paper, we not always guarantee a low generalization error. Hence, it K often partition matrices column-wise and vectors row-wise. is of central interest to develop methods that have both low Column-wise partioning of A ∈ Rn×p into K blocks with training and generalization error [4]. Modern machine learning n×pk Ak ∈ R is given by A = [A1, ··· , AK ]. The row-wise Rpk×1 M. Hellkvist and A. Ozçelikkale¨ acknowledges the support from Swedish partitioning of a vector x into K blocks xk ∈ is given Research Council under grant 2015-04011. by x = [x1; ··· ; xK ]. II. PROBLEM STATEMENT Algorithm 1: Implementation of COCOA [8] (and COLA We focus on the linear model 1 1 [11] with W = K K ) for (2) with (11). T yi = ai x + wi, (1) Input: Data matrix A distributed column-wise according R th Rp×1 th to partition P. Regularization parameter λ. where yi ∈ is the i observation, ai ∈ is the i re- 0 p×1 0 p×1 th Initialize: xˆ =0 ∈ R v =0 ∈ R ∀ k=1,...,K gressor, wi is the unknown disturbance for the i observation, k p×1 for t =0, 1,...,T do and x = [x ; ··· ; xp] ∈ R is the vector of unknown 1 t 1 K t coefficients. v¯ = K k=1 vk We consider the problem of estimating x given n for k ∈{1, 2,...,K} do t P t T t data points, i.e., pairs of observations and regressors, ck = λxˆk − Ak (y − v¯ ) ∆xt = −(KATA + λI )+ct (yi, ai), i =1, . , n, by minimizing the following regular- k k k pk k t+1 t t ized cost function: xˆk = xˆk + ∆xk t+1 t t v = v¯ + KAk∆xk 1 2 λ 2 k min y − Ax + x , (2) end ×1 2 2 x ∈ Rp 2 2 end where A ∈ Rn×p is the regressor matrix whose ith row is given T R1×p by ai ∈ . We further denote the first term as f(Ax) = K 1 2 λ 2 k=1 pk = p. We denote the part of A available at node k y − Ax . The second term x with λ ≥ 0 denotes n×p 2 2 2 2 as Ak ∈ R k . In particular, using this partitioning, y with the regularization function. wP =0, ∀i, can be expressed as T R1×p i We consider the setting where the regressors ai ∈ are independent and identically distributed (i.i.d.) with ai ∼ x1 K N (0, Ip). Under this Gaussian regressor model, we focus . y = Ax = [A1, ··· , AK ] . = Akxk, (7) on the generalization error of the solution to (2) found by . x k=1 the distributed solver CoCoA [8]. Our main focus is on the K X scenario where λ =0, w =0 where the solutions with λ> 0 i where xk is the partition at node k. Note that there is no are used for comparison. In the remainder of this section, we loss of generality due to the specific order of this partitioning define the generalization error. We provide details about our structure since the columns of A are i.i.d. (since rows are i.i.d. implementation of CoCoA in Section III. with N (0, Ip)). x x Let wi = 0, ∀i, and let ˆ be an estimate of found by In COCOA, at iteration t, node k shares its estimate of using the data pairs (yi, ai), i =1,...,n. For a given A, the t y, denoted vk, over the network. Note that the Ak’s and the generalization error, i.e., the expected error for estimating y observation vector y are fixed over all iterations. The variables when a new pair (y, a) with a ∼ N (0, Ip) comes is given by t Rpk ×1 t Rpk×1 xˆk ∈ and ∆xk ∈ are the estimate and its update t T 2 T T 2 computed by node , respectively. Hence, xt and x are par- Ea[(y − a xˆ) ]=Ea[(a x − a xˆ) ] (3) k ˆ ∆ titioned as xˆt = [xˆt ; ··· ; xˆt ] and ∆xt = [∆xt ; ··· ; ∆xt ]. =E [tr[(x − xˆ)(x − xˆ)TaaT]] (4) 1 K 1 K a The average over all local estimates vt is denoted as v¯t. 2 k =kx − xˆk , (5) At iteration t, COCOA solves the following minimization where a is statistically independent of A and we have used problem at each node [8]: the notation Ea[·] to emphasize that the expectation is over a. t T t min ∇v¯t f(v¯ ) Ak∆xk Here (5) follows from a I . We are interested in the ∆xt ∼ N (0, p) k (8) expected generalization error over the distribution of training σ′ t 2 λ t t 2 + Ak∆x + xˆ + ∆x . data 2τ k 2 2 k k 2 2 E 2 Using Ax 1 y Ax , we have the smoothness ǫG = A[kx − xˆk ], (6) f( ) = 2 − 2 parameter τ =1 [11].

Generalization Error for Linear Regression Under Distributed

A Simple Algorithm for Semi-Supervised Learning with Improved Generalization Error Bound

Statistical Mechanics Methods for Discovering Knowledge from Production-Scale Neural Networks

Multitask Learning with Local Attention for Tibetan Speech Recognition

Lecture 9: Generalization

Issues in Using Function Approximation for Reinforcement Learning

Effective Dimensionality Revisited

Generalizing to Unseen Domains: a Survey on Domain Generalization

Batch Policy Learning Under Constraints

Learning Better Structured Representations Using Low-Rank Adaptive Label Smoothing

Learning Universal Graph Neural Network Embeddings with Aid of Transfer Learning

1 Introduction 2 the Learning Problem

Principles and Algorithms for Forecasting Groups of Time Series: Locality and Globality