Iterative Pre-Conditioning to Expedite the Gradient-Descent Method

Iterative Pre-Conditioning to Expedite the Gradient-Descent Method Kushal Chakrabarti?, Nirupam Gupta†, and Nikhil Chopra? Abstract—This paper considers the problem of multi-agent distributed optimization. In this problem, there are multiple Server agents in the system, and each agent only knows its local cost function. The objective for the agents is to collectively compute a common minimum of the aggregate of all their local cost functions. In principle, this problem is solvable using a distributed variant of the traditional gradient-descent method, which is an iterative method. However, the speed of convergence Agent Agent Agent m of the traditional gradient-descent method is highly influenced 1 1 2 2 m m by the conditioning of the optimization problem being solved. (A ,b ) (A ,b ) (A ,b ) Specifically, the method requires a large number of iterations Fig. 1: The system architecture. to converge to a solution if the optimization problem is ill- conditioned. In this paper, we propose an iterative pre-conditioning approach that can significantly attenuate the influence of the prob- as (1), using a central server without ever sharing their lem’s conditioning on the convergence-speed of the gradient- local data (Ai , bi ) with anyone in the system. In this descent method. The proposed pre-conditioning approach can architecture, the server maintains an estimate of the be easily implemented in distributed systems and has minimal computation and communication overhead. For now, we only global solution of (1), and each agent downloads the consider a specific distributed optimization problem wherein estimated global solution from the server. Each agent the individual local cost functions of the agents are quadratic. updates the estimated global solution locally using the Besides the theoretical guarantees, the improved convergence data it possesses and uploads its local updated estimate speed of our approach is demonstrated through experiments to the server. The server accumulates the locally updated on a real data-set. solutions from all the agents to improve the global solution. I. INTRODUCTION We consider a synchronous distributed system that com- The most common and most straightforward algorithm prises of a server and m agents in a server-based archi- for the above distributed minimization problem is the tecture, as shown in Fig. 1. The server-based architecture distributed variant of the vanilla gradient-descent method, can be emulated easily on a rooted peer-to-peer network called the distributed gradient-descent (DGD) method [4]. using the well-known routing and message authentication Recently, Azizan-Ruhi et al., 2019 [5] have proposed an primitives [1], [2]. Each agent i {1,..., m} holds a pair accelerated projection method, which is built upon the 2 (Ai , bi ), where Ai is a matrix of size n n and bi is a seminal work on accelerated methods by Nesterov [6]. i £ column-vector of size n. Let Rn denote the set of real- However, Azizan-Ruhi et al. do not provide any theoretical valued vectors of size n. For a vector v Rn, let v denote guarantees on the improved convergence of their method 2 k k its 2-norm. If ( )T denotes the transpose then v 2 vT v. over the traditional DGD method [5]. ¢ k k Æ The objective of the agents is to solve for the least-squares problem: In this paper, we propose a different approach than the arXiv:2003.07180v2 [math.OC] 29 Mar 2020 m accelerated methods to improve the convergence speed of X 1 ° °2 minimize °Ai x bi ° . (1) the DGD method. Instead of using momentum, we use a n ° ° x R i 1 2 ¡ 2 Æ pre-conditioning scheme wherein each iteration the update Applications of such problems include linear regression, vector, used by the server to update its estimate of the state estimation, hypothesis testing. global solution of (1) is multiplied by a pre-conditioner The server collaborates with the agents to solve the matrix that we provide. The pre-conditioner matrix in itself problem. The server-based architecture is also commonly is updated in each iteration; hence we call our approach known as the federated model [3], where the agents the iterative pre-conditioning method. collaboratively solve an optimization problem, such The DGD method is an iterative algorithm, in which the ? University of Maryland, College Park, Maryland 20742, U.S.A. server maintains an estimate for a point of minimum of (1), † Georgetown University, Washington, DC 20057, U.S.A. Emails: [email protected], [email protected] and denoted by x¤, and updates it iteratively by collaborating [email protected] with the agents as follows. For each iteration t 0,1, ..., Æ let x(t) denote the estimate of x¤ at the beginning of However, most of the existing pre-conditioning techniques iteration t. The initial value x(0) is chosen arbitrarily. are not applicable for the distributed framework consid- In each iteration t, the server broadcasts x(t) to all the ered in this paper. The incomplete LU factorization al- agents. Each agent i computes the gradient of the function gorithms [8], accelerated iterative methods [9], symmetric ° °2 (1/2)°Ai x bi ° at x x(t), denoted by g i (t), and sends it successive over-relaxation method [9], for computing such ¡ Æ to the server. Note that a matrix K such that ·(KAT A) is provably smaller than T i i T ³ i i ´ ·(A A) require the server to have direct access to the g (t) (A ) A x(t) b , i, t. (2) i Æ ¡ 8 matrices A ’s. Some other pre-conditioning methods [10] Upon receiving the gradients {g i (t) i 1,..., m} from all the require A to be a symmetric positive definite matrix. The j Æ agents, the server updates x(t) to x(t 1) using step-size of recently proposed distributed pre-conditioning scheme for Å constant value ± as follows: the D-Heavy Ball method in [5] has the same convergence m rate as APC. The experimental results in Section III suggest X x(t 1) x(t) ± g i (t), t. (3) that our proposed scheme converges faster than APC, and Å Æ ¡ i 1 8 Æ hence faster than the said pre-conditioning scheme. To be able to present the contribution of this paper, we C. Summary of Our Contributions first briefly review the convergence of the DGD method described above. We propose an iterative pre-conditioner matrix K (t), instead of a constant pre-conditioner matrix K . That is, the We define the following notations. Let A server updates its estimate as follows: 1 T m T T Æ [(A ) ,...,(A ) ] denote the matrix obtained by stacking m X matrices Ai ’s vertically. So, matrix A is of size N n x(t 1) x(t) ±K (t) g i (t), t. (7) £ Å Æ ¡ 8 where N (Pm n ). We assume that A is a tall matrix, i 1 Æ i 1 i Æ i.e., N n. Similarly,Æ we concatenates the bi ’s to get The pre-conditioner K (t) can be computed in a distributed ¸ T b £(b1)T ,...,(bm)T ¤ RN . manner in the federated architecture for each iteration t, Æ 2 as is presented in Section II. We show that the iterative A. Convergence of DGD process (7) converges provably faster to the optimum If matrix A is full-column rank then we know that there point x¤ than the original DGD algorithm (3). In the exists step-size ± for which there is a positive value ½ 1 experiments, we have also observed that the convergence Ç such that [7], speed of our proposed algorithm is faster than the ° ° t ° ° accelerated projection-based consensus (APC) method °x(t) x¤° ½ °x(0) x¤°, t 0, 1,... ¡ · ¡ Æ proposed in [5]. The value ½ is commonly referred as the convergence rate. Smaller ½ implies higher convergence speed, and vice-versa. Moreover, the computational complexity of the proposed If we let ¸ and γ denote the largest and smallest eigenvalues method is the same as the DGD algorithm, which is the of AT A, then it is known that smallest amongst the existing distributed algorithms for (¸/γ) 1 solving (1). We have formally shown the proposed algorithm ½ ½GD ¡ . (4) ¸ Æ (¸/γ) 1 to converge faster than the distributed gradient method. Å In contrast, the conditions where APC is guaranteed to The ratio ¸/γ is also commonly referred as the condition converge faster than DGD have not been provided in [5]. number of matrix AT A which we denote by ·(AT A). APC, as well as the pre-conditioning scheme for D-HBM B. Pre-Conditioning of DGD in [5], have additional computational overhead before the Our objective is to improve the convergence rate ½ of iterations begin. DGD beyond ½GD by a suitable pre-conditioning proposed in this paper. Let K , referred as the pre-conditioner, be a II. PROPOSED ALGORITHM square matrix of size n n. The server now updates its £ estimate as follows: In this section, we present our proposed algorithm m and theoretically guarantee that the proposed method X x(t 1) x(t) ±K g i (t), t. (5) converges faster than the distributed gradient descent Å Æ ¡ i 1 8 Æ method (3) for solving the distributed least-squares If the matrix product KAT A is positive definite then conver- problem (1). gence of (5) can be made linear by choosing ± appropriately, and the smallest possible convergence rate for (5) is given To be able to present our algorithm, we introduce the by (ref. Chapter 11.3.3 of [7]) following notation. T For a positive integer m, let ·(KA A) 1 ² ½K¤ ¡ .

Load more