Iterative Pre-Conditioning to Expedite the Gradient-Descent Method

Kushal Chakrabarti?, Nirupam Gupta†, and Nikhil Chopra?

Abstract—This paper considers the problem of multi-agent distributed optimization. In this problem, there are multiple Server agents in the system, and each agent only knows its local cost function. The objective for the agents is to collectively compute a common minimum of the aggregate of all their local cost functions. In principle, this problem is solvable using a distributed variant of the traditional gradient-descent method, which is an . However, the speed of convergence Agent Agent Agent m of the traditional gradient-descent method is highly influenced 1 1 2 2 m m by the conditioning of the optimization problem being solved. (A ,b ) (A ,b ) (A ,b ) Specifically, the method requires a large number of iterations Fig. 1: The system architecture. to converge to a solution if the optimization problem is ill- conditioned. In this paper, we propose an iterative pre-conditioning ap- proach that can significantly attenuate the influence of the prob- as (1), using a central server without ever sharing their lem’s conditioning on the convergence-speed of the gradient- local data (Ai , bi ) with anyone in the system. In this descent method. The proposed pre-conditioning approach can architecture, the server maintains an estimate of the be easily implemented in distributed systems and has minimal computation and communication overhead. For now, we only global solution of (1), and each agent downloads the consider a specific distributed optimization problem wherein estimated global solution from the server. Each agent the individual local cost functions of the agents are quadratic. updates the estimated global solution locally using the Besides the theoretical guarantees, the improved convergence data it possesses and uploads its local updated estimate speed of our approach is demonstrated through experiments to the server. The server accumulates the locally updated on a real data-set. solutions from all the agents to improve the global solution. I.INTRODUCTION We consider a synchronous distributed system that com- The most common and most straightforward prises of a server and m agents in a server-based archi- for the above distributed minimization problem is the tecture, as shown in Fig. 1. The server-based architecture distributed variant of the vanilla gradient-descent method, can be emulated easily on a rooted peer-to-peer network called the distributed gradient-descent (DGD) method [4]. using the well-known routing and message authentication Recently, Azizan-Ruhi et al., 2019 [5] have proposed an primitives [1], [2]. Each agent i {1,..., m} holds a pair accelerated projection method, which is built upon the ∈ (Ai , bi ), where Ai is a matrix of size n n and bi is a seminal work on accelerated methods by Nesterov [6]. i × column-vector of size n. Let Rn denote the set of real- However, Azizan-Ruhi et al. do not provide any theoretical valued vectors of size n. For a vector v Rn, let v denote guarantees on the improved convergence of their method ∈ k k its 2-norm. If ( )T denotes the transpose then v 2 vT v. over the traditional DGD method [5]. · k k = The objective of the agents is to solve for the least-squares problem: In this paper, we propose a different approach than the arXiv:2003.07180v2 [math.OC] 29 Mar 2020 m accelerated methods to improve the convergence speed of X 1 ° °2 minimize °Ai x bi ° . (1) the DGD method. Instead of using momentum, we use a n ° ° x R i 1 2 − ∈ = pre-conditioning scheme wherein each iteration the update Applications of such problems include linear regression, vector, used by the server to update its estimate of the state estimation, hypothesis testing. global solution of (1) is multiplied by a pre-conditioner The server collaborates with the agents to solve the matrix that we provide. The pre-conditioner matrix in itself problem. The server-based architecture is also commonly is updated in each iteration; hence we call our approach known as the federated model [3], where the agents the iterative pre-conditioning method. collaboratively solve an optimization problem, such The DGD method is an iterative algorithm, in which the ? University of Maryland, College Park, Maryland 20742, U.S.A. server maintains an estimate for a point of minimum of (1), † Georgetown University, Washington, DC 20057, U.S.A. Emails: [email protected], [email protected] and denoted by x∗, and updates it iteratively by collaborating [email protected] with the agents as follows. For each iteration t 0,1, ..., = let x(t) denote the estimate of x∗ at the beginning of However, most of the existing pre-conditioning techniques iteration t. The initial value x(0) is chosen arbitrarily. are not applicable for the distributed framework consid- In each iteration t, the server broadcasts x(t) to all the ered in this paper. The incomplete LU factorization al- agents. Each agent i computes the gradient of the function gorithms [8], accelerated iterative methods [9], symmetric ° °2 (1/2)°Ai x bi ° at x x(t), denoted by g i (t), and sends it successive over-relaxation method [9], for computing such − = to the server. Note that a matrix K such that κ(KAT A) is provably smaller than T i i T ³ i i ´ κ(A A) require the server to have direct access to the g (t) (A ) A x(t) b , i, t. (2) i = − ∀ matrices A ’s. Some other pre-conditioning methods [10] Upon receiving the gradients {g i (t) i 1,..., m} from all the require A to be a symmetric positive definite matrix. The | = agents, the server updates x(t) to x(t 1) using step-size of recently proposed distributed pre-conditioning scheme for + constant value δ as follows: the D-Heavy Ball method in [5] has the same convergence m rate as APC. The experimental results in Section III suggest X x(t 1) x(t) δ g i (t), t. (3) that our proposed scheme converges faster than APC, and + = − i 1 ∀ = hence faster than the said pre-conditioning scheme. To be able to present the contribution of this paper, we C. Summary of Our Contributions first briefly review the convergence of the DGD method described above. We propose an iterative pre-conditioner matrix K (t), instead of a constant pre-conditioner matrix K . That is, the We define the following notations. Let A server updates its estimate as follows: 1 T m T T = [(A ) ,...,(A ) ] denote the matrix obtained by stacking m X matrices Ai ’s vertically. So, matrix A is of size N n x(t 1) x(t) δK (t) g i (t), t. (7) × + = − ∀ where N (Pm n ). We assume that A is a tall matrix, i 1 = i 1 i = i.e., N n. Similarly,= we concatenates the bi ’s to get The pre-conditioner K (t) can be computed in a distributed ≥ T b £(b1)T ,...,(bm)T ¤ RN . manner in the federated architecture for each iteration t, = ∈ as is presented in Section II. We show that the iterative A. Convergence of DGD process (7) converges provably faster to the optimum If matrix A is full-column rank then we know that there point x∗ than the original DGD algorithm (3). In the exists step-size δ for which there is a positive value ρ 1 experiments, we have also observed that the convergence < such that [7], speed of our proposed algorithm is faster than the ° ° t ° ° accelerated projection-based consensus (APC) method °x(t) x∗° ρ °x(0) x∗°, t 0, 1,... − ≤ − = proposed in [5]. The value ρ is commonly referred as the convergence rate. Smaller ρ implies higher convergence speed, and vice-versa. Moreover, the computational complexity of the proposed If we let λ and γ denote the largest and smallest eigenvalues method is the same as the DGD algorithm, which is the of AT A, then it is known that smallest amongst the existing distributed for (λ/γ) 1 solving (1). We have formally shown the proposed algorithm ρ ρGD − . (4) ≥ = (λ/γ) 1 to converge faster than the distributed . + In contrast, the conditions where APC is guaranteed to The ratio λ/γ is also commonly referred as the condition converge faster than DGD have not been provided in [5]. number of matrix AT A which we denote by κ(AT A). APC, as well as the pre-conditioning scheme for D-HBM B. Pre-Conditioning of DGD in [5], have additional computational overhead before the Our objective is to improve the convergence rate ρ of iterations begin. DGD beyond ρGD by a suitable pre-conditioning proposed in this paper. Let K , referred as the pre-conditioner, be a II.PROPOSEDALGORITHM square matrix of size n n. The server now updates its × estimate as follows: In this section, we present our proposed algorithm m and theoretically guarantee that the proposed method X x(t 1) x(t) δK g i (t), t. (5) converges faster than the distributed + = − i 1 ∀ = method (3) for solving the distributed least-squares If the matrix product KAT A is positive definite then conver- problem (1). gence of (5) can be made linear by choosing δ appropriately, and the smallest possible convergence rate for (5) is given To be able to present our algorithm, we introduce the by (ref. Chapter 11.3.3 of [7]) following notation. T For a positive integer m, let κ(KA A) 1 • ρK∗ − . (6) = κ(KAT A) 1 [m]: {1,...,m}. + = i Let e j denote the j-th column of the (n n)- computation of {R (t) j 1,...,n} in (8) are independent of • × j | = dimensional identity matrix I. each other. So, each agent i computes them in parallel. n n Now, (8) can be rewritten as Let K (t) R × be the pre-conditioner matrix for • ∈ n iteration t, and let k j (t) R denote the j-th column i i T i β 1 ∈ R (t): (A ) A k j (t) k j (t) e j . of K (t). j = + m − m For β 0, define • Two matrix-vector multiplications are required for comput- > i i i T i µ ¶ ing R (t): A k j (t) and (A ) (A k j (t)), in that order. Here i i T i β 1 j R (t): (A ) A I k j (t) e j , j 1,...,n, i [m]. i j = + − = ∈ A is an (ni n)-dimensional matrix and k j (t), A k j (t) m m × (8) respectively are column vectors of dimension n, N. So, a total of (2n n n) flops are required for computing R (t). The proposed method is summarized in Algorithm 1. i + j Two matrix-vector multiplications are required for com- Note that α, δ, β are positive valued parameters in the puting g i (t): Ai x(t), (Ai )T ¡Ai x(t) bi ¢, in that order. By algorithm. i − i a similar argument as for R j (t), computing g (t) needs O(ni n) flops. One matrix-vector multiplication is required Pm i Algorithm 1 for computing x(t 1): K (t) i 1 g (t), which needs O(ni n) + = n n n flops. So, per-iteration computational cost of Algorithm 1 is 1: Initialize x(0) R , K ( 1) R × ∈ − ∈ O(ni n). 2: for t 0,1,... do = 3: The server transmits x(t) and K (t 1) to all the agents − B. Convergence Analysis i [m] ∈ 4: for each agent i [m] do To be able to present the convergence of Algorithm 1, we ∈ 5: Compute g i (t) using (2) introduce the following notations. 6: for each j 1,...,n do Let, ∈ • 7: Compute Ri (t 1) given by (8) ¡ T ¢ 1 j − K ∗ A A βI − . 8: end for = + 9: i i n T Transmit g (t) and {R j (t 1)}j 1 to the server The matrix A A is positive semi-definite. Thus, − = ¡ T ¢ 10: end for A A βI is positive definite for β 0. So, K ∗ is well- + > 11: for each j 1,...,n do defined. ∈ 12: The server updates each column of K (t) Let k∗ be the j-th column of K ∗, j 1,...,n. • j m = X i k j (t) k j (t 1) α R (t 1) (9) Let, λ and γ denote the smallest and largest eigenvalue j • T = − − i 1 − of A A, respectively. = 13: end for Let, • 14: The server updates the estimate λ γ (λ γ)β m ρ : − , ρ : − , X K∗ β∗ x(t 1) x(t) δK (t) g i (t) (10) = λ γ 2β = (λ γ)β 2λγ + = − +° + ° + + i 1 σ : δλ°K ( 1) K ∗° . = 0 = − − F 15: end for We make the following assumption. Assumption 1: Assume that the matrix AT A is full rank. For each iteration t 0, 1,..., the server maintains a = The above lemma shows that each column of the pre-conditioner matrix K (t 1) and an estimate x(t) of the − pre-conditioner matrix K (t) asymptotically converges to point of optimum x∗. The initial pre-conditioner matrix the corresponding column of K ∗. In other words, the K ( 1) and estimate x(0) are chosen arbitrarily. The server − matrix K (t) asymptotically converges to K ∗. sends x(t) and K (t 1) to the agents. Each agent i [m] − ∈ computes Ri (t 1) for j 1,..., n as given in (8), the local j − = Lemma 1: Consider the iterative process (9), and let β 0. i > gradient g (t) as given in (2), and sends these to the server. Under Assumption 1, there exists α 0 for which there is a Then, the server computes the updated pre-conditioner > positive value ρK 1 such that for each j 1,...,n, K (t) as given in (9), and uses this updated pre-conditioner < = ° ° ° ° K (t) to compute the updated estimate x(t 1) as given °k (t) k ° ρ °k (t 1) k °, t 0,1,2,..., + ° j ∗j ° K ° j ∗j ° by (10). − ≤ − − =

where ρK ρK∗ . ≥ T The smallest and the largest Eigenvalues of K ∗ A A A. Computational Complexity have a direct influence on the convergence speed of We count the number of floating-point multiplications Algorithm 1. The following lemma finds these Eigenvalues. needed per-iteration of Algorithm 1. For each agent i, the T Lemma 2: If Assumption 1 holds and β 0, then K ∗ A A is 420 420. We generate the vector b by setting b Ax∗ > × = is positive definite, and the largest and the smallest eigen- where x∗ is n 420 dimensional vector of one’s. Since A λ γ = T is positive definite, (1) has a unique solution x∗. The data values of K ∗ A A are and , respectively. λ β γ β (A,b) is split among m 10 machines, so that ni 42, + + = = The following proposition tells us that Algorithm 1 i 1,...,10. converges to the point of minima at a linear rate. = We apply Algorithm 1 to solve the aforementioned Proposition 1: Consider Algorithm 1 with β 0. Under > optimization problem. It can be seen that, the speed of the Assumption 1, there exists δ 0 for which there is a positive > algorithm depends on the parametric choices of β,α,δ (ref. value ρβ 1 such that < Fig. 2a). Also, the algorithm converges to x∗ irrespective ° ° ¡ t 1¢° ° of the initial choice of the entries in x(0) and K ( 1) (ref. °x(t 1) x∗° ρβ σ0(ρK∗ ) + °x(t) x∗°, t, − + − ≤ + − ∀ Fig. 2b). where ρ ρ∗. β ≥ β From Proposition 1, the rate of convergence of The experimental results have been compared with the Algorithm 1 depends on the choice of initial value K ( 1) conventional distributed gradient descent (DGD) method − through σ0. The closer K ( 1) is to K ∗, the faster is the rate. and the accelerated projection-based consensus (APC) [5] − method (ref. Fig. 3). The pre-conditioning technique for the distributed heavy-ball method in [5] has the same C. Comparison with DGD theoretical rate as APC. Besides, it has an additional Now we are ready to present our key result, which is computational overhead of O(p2n p3), with p n + ≤ i a formal comparison between the convergence speed of depending on the given matrix Ai , which is larger than the Algorithm 1 and the original DGD method in (3). We define cost of Algorithm 1 in general. a few necessary notations. Let, the estimate of the optimum point computed by We compare the number of iterations by DGD, APC and • Algorithm 1 and DGD be denoted by x1(t) and x2(t), Algorithm 1 to reach a relative estimation error (defined as respectively, after t iterations. x(t) x∗ 4 k − k ) of 10− (ref. Table I). The tuning parameter β x∗ Let, zi (t) xi (t) x∗ for i 1,2. k k • = − = has been set at 5 for Algorithm 1. The rest of the algorithm Let, the known upper bound on zi (t) be Ei (t). parameters have been set such that the respective • k k algorithms will have their smallest possible convergence 7 Theorem 1: Consider Algorithm 1 and the DGD algo- rates. Specifically, (α∗ 3.17 10− ,δ∗ 1.95) for Algorithm = ∗ = 7 rithm (3) with identical initial estimate x(0) Rn. If 1, (γ∗ 1.08,η∗ 12.03) for APC and δ∗ 3.17 10− for ∈ = = = ∗ Assumption 1 holds, then there exists t such that DGD. We found that Algorithm 1 performs the fastest sw < ∞ E1(t) E2(t) for all t tsw . among these algorithms. Note that evaluating the optimal < > tuning parameters for any of these algorithms requires Theorem 1 implies that if the server executes both knowledge about the smallest and largest Eigenvalues of Algorithm 1 and the original DGD algorithm using the AT A. same initial estimate x(0) of x∗, then after a certain number of iterations the upper bound on the error-norm We notice that, the error norm x(t) x∗ for Algorithm 1 generated by Algorithm 1 is smaller than the original DGD. k − k is less than that of DGD from an approximate iteration index of 300 onward (ref. Fig. 3b). Since x(0) is same for III.EXPERIMENTS both of the algorithms, this observation is in agreement with the claim in Theorem 1. Similarly, Algorithm 1 is faster In this section, we present the experimental results to than APC after 1114 iterations (ref. Fig. 3b). verify the obtained theoretical convergence guarantees of Algorithm 1. The matrix A is loaded from one of IV. SUMMARY the real data-sets in SuiteSparse Matrix Collection1. Our primary focus here is obtaining faster convergence on In this paper, we have proposed an algorithm for solving a problem (1) where the condition number of AT A is distributed linear least-squares minimization problems over large enough, specifically 5.8 107, while keeping the a federated architecture with minimum computational load. ∗ computational load minimum. However, the algorithm can be emulated in a rooted peer- to-peer network. The vital contribution lies in mitigating the Example 1. The matrix A is a real symmetric positive detrimental impact of ill-conditioning on the convergence definite matrix from the benchmark dataset “bcsstm07” of the traditional gradient descent method. The compu- which is part of a structural problem. The dimension of A tation of the pre-conditioner is done at the server level without requiring any access to the data. In practice, we 1https://sparse.tamu.edu test the algorithm on a real data-set with a significant (a) (b) ° ° Fig. 2: Temporal evolution of error norm for estimate °x(t) x ° for Example 1, under Algorithm 1 with different parameter choices and initialization. − ∗ (a) x(0) [0,...,0]T ; (b) β 100,α 2 10 7,δ 1. = = = ∗ − =

(a) (b) ° ° Fig. 3: Temporal evolution of error norm for estimate °x(t) x∗° against number of iterations for Example 1, under Algorithm 1, DGD and APC [5]; − T with (a) arbitrary parameter choices (b) optimal parameter choices. Initialization for (a) and (b) both: (Algorithm 1) x(0) [0,...,0] , K ( 1) On n ; = − = (DGD) x(0) [0,...,0]T ; (APC) according to the algorithm. In (a): (Algorithm 1) β 100,α 2 10 7,δ 1; (DGD) δ 10 7; (APC) γ η 1. In (b):× = = = ∗ − = = − = = (Algorithm 1) β 5,α 3.17 10 7,δ 1.95; (DGD) δ 3.17 10 7; (APC) γ 1.08,η 12.03. = = ∗ − = = ∗ − ∗ = ∗ = TABLE I

Comparison Between DGD, APC and Algorithm 1 with Optimal Parameters on “bcsstm07”; 4 regarding (a) Number of Iterations to Attain a Relative Estimation Error 10− , and (b) Decay rate of the Instantaneous Error-norm. Here, cond(AT A) 5.8 107. = ∗ Algorithm DGD APC Algo. 1 (β 5, K ( 1) On n ) = − = × (a) Iterations needed 105 4.85 104 2.11 104 > ∗ ∗ (b) Rate of Decrease 0.9999 0.9672 0.9583 4.3 108 (0.9999)t 1 + ∗ ∗ +

condition number (cond(AT A) 5.8 107) and get much REFERENCES = ∗ better performance compared to the classical distributed gradient algorithm as well as the recently proposed accel- [1] Nancy A Lynch. Distributed algorithms. Elsevier, 1996. erated projection-consensus algorithm (APC), regarding the [2] Andrew S Tanenbaum. Network protocols. ACM Computing Surveys (CSUR), 13(4):453–489, 1981. number of iterations needed for convergence to the true [3] Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. Federated solution. We have formally shown the proposed algorithm machine learning: Concept and applications. ACM Transactions on to converge faster than the distributed gradient method, Intelligent Systems and Technology (TIST), 10(2):1–19, 2019. while the APC method only speculates it to be faster. [4] Dimitri P Bertsekas and John N Tsitsiklis. Parallel and distributed computation: numerical methods, volume 23. Prentice hall Englewood Cliffs, NJ, 1989. ACKNOWLEDGEMENTS [5] Navid Azizan-Ruhi, Farshad Lahouti, Amir Salman Avestimehr, and This work is being carried out as a part of the Pipeline Babak Hassibi. Distributed solution of large-scale linear systems via accelerated projection-based consensus. IEEE Transactions on Signal System Integrity Management Project, which is supported Processing, 67(14):3806–3817, 2019. by the Petroleum Institute, Khalifa University of Science and [6] Y Nesterov. A method of solving a convex programming problem with 2 Technology, Abu Dhabi, UAE. Nirupam Gupta was spon- convergence rate O(1/k ). Sov. Math. Doklady, 27(2):372–376, 1983. [7] Jeffrey A Fessler. Image reconstruction: Algorithms and analysis. http: sored by the Army Research Laboratory under Cooperative //web.eecs.umich.edu/~fessler/book/c-opt.pdf. [Online book draft; Agreement W911NF- 17-2-0196. accessed 17-February-2020]. T T T T [8] J Andvandervorst Meijerink and Henk A Van Der Vorst. An iterative A Ax A b. Thus, x∗ satisfies A A x∗ A b. From (13) solution method for linear systems of which the coefficient matrix is a T− T = and A A x∗ A b, we get symmetric M-matrix. Mathematics of computation, 31(137):148–162, = T 1977. z(t 1) z(t) δK (t)A A(x(t) x∗) [9] Owe Axelsson. A survey of preconditioned iterative methods for + = − − ¡ T ¢ linear systems of algebraic equations. BIT Numerical Mathematics, I δK (t)A A z(t) = − 25(1):165–187, 1985. ¡ T ¢ T I δK ∗ A A z(t) δK˜(t)A Az(t) [10] Michele Benzi. Preconditioning techniques for large linear systems: = − − a survey. Journal of computational Physics, 182(2):418–477, 2002. ¡ T ¢ I δK ∗ A A z(t) δu(t), (14) = − − where K˜(t) comprises of the columns k˜ (t): k (t) k∗, j APPENDIX j = j − j = 1,...,n and u(t): K˜(t)AT Az(t). From Lemma 1, it follows A. Proof of Lemma 1 = that From (8) and noticing that Pm Ri (t 1) ° °2 ° °2 i 1 j °k˜ j (t)° (ρ )2t 2 °k˜ j ( 1)° ¡ T ¢ = − = ° ° K∗ + ° ° A A βI k j (t 1) e j , dynamics (9) can be rewritten as ≤ − + − − ° ˜ °2 2t 2 ° ˜ °2 °K (t)°F (ρK∗ ) + °K ( 1)°F . (15) k (t) k (t 1) α£¡AT A βI¢k (t 1) e ¤. (11) =⇒ ≤ − j = j − − + j − − j Now, ˜ ° °° T ° Define k j (t): k j (t) k∗j . From the definition of K ∗ we have, u(t) °K˜(t)°°A A° z(t) = − k k ≤ k k ° ˜ ° ° T ° ¡ T ¢ ¡ T ¢ °K (t)°F °A A° z(t) A A βI K ∗ I A A βI k∗ e j , j 1,...,n. ≤ k k + = =⇒ + j = = ° ˜ ° t 1 λ°K ( 1)°F (ρK∗ ) + z(t) , (16) From (11), ≤ − k k where the last inequality follows from (15). From (14) £ ¡ T ¢¤ and (16), k˜ j (t) I α A A βI k˜ j (t 1). (12) = − + − ¡° T ° t 1¢ z(t 1) °I δK ∗ A A° σ0(ρ∗ ) + z(t) , t. (17) Since (AT A βI) is positive definite for β 0, α for k + k ≤ − + K k k ∀ + > ° ∃ ° T which there is a positive ρK 1 such that °k˜ j (t)° Since K ∗ A A is positive definite for β 0 (from Lemma 2), ° ° < ≤ ° T ° > ρK °k˜ j (t 1)°, t 0,1,2,..., where the smallest value of ρK δ for which ρβ : °I δK ∗ A A° 1 and (17) holds. − = ∃ = − T < κ(AT A βI) 1 κ(K ∗ A A) 1 The smallest value of ρ is − . Then with the is + − (ref. Corollary 11.3.3 and Chapter 11.3.3 β T κ(AT A βI) 1 κ(K ∗ A A) 1 + T + + λ β largest and the smallest Eigenvalues of K ∗ A A obtained in of [7]). As κ(AT A βI) + , the claim follows. + = γ β Lemma 2, the claim follows. + D. Proof of Theorem 1 B. Proof of Lemma 2 ³ t 1´ Define, st : ρ∗ σ0(ρ∗ ) + . It can be easily checked Consider the Eigen decomposition AT A UΣU T , where = β + K = that ρ∗ ρGD for β 0. Since ρ∗ 1 and ρ∗ ρGD , there U is the Eigenvector matrix of AT A and Σ is a diagonal β < > K < β < exists τ such that st ρGD , t τ. Define, rt : st /ρGD . matrix with the Eigenvalues of AT A in the diagonal. Since < ∞ < ∀ > = ¡ T ¢ 1 Then rt 1 t τ. From the recursion in Proposition 1, we K ∗ : A A βI − and the identity matrix can be written < ∀ > = + have as I UU T , we have = ¡ t ¢ z1(t 1) Πk τ 1 sk z1(τ 1) , t τ. (18) 1 n T k + k ≤ = + k + k ∀ > K ∗ Udi ag{ } U k 1 Since {st 0}t 0 is a strictly decreasing sequence, from the = λk β = > ≥ + same recursion we also have T 1 n T n T K ∗ A A Udi ag{ }k 1U Udi ag{λk }k 1U =⇒ = λk β = = z (τ 1) s z (τ) s z (τ) + k 1 + k ≤ τ k 1 k ≤ 0 k 1 k T λk n T τ 1 z1(τ 1) s0+ z(0) . (19) K ∗ A A Udi ag{ }k 1U , =⇒ k + k ≤ k k =⇒ = λ β = k + Combining (18) and (19), where λk 0 are the Eigenvalues of the positive definite ¡ t ¢ τ 1 t τ τ 1 > µ ¶ z1(τ 1) Π sk s0+ z(0) (sτ 1) − s0+ z(0) T d λk k + k ≤ k τ 1 k k ≤ + k k matrix A A. Since 0, the claim follows. = + t τ τ 1 dλ λ β > (rτ 1ρGD ) − s0+ z(0) k k = + k k + t 1 c1(rτ 1ρGD ) + z(0) , t τ C. Proof of Proposition 1 = + k k ∀ > µ ¶τ 1 T Pm i T i s0 + Using (2) and observing that A A i 1(A ) A , dynam- where c1 is a constant. Thus, we have = = = rτ 1ρGD ics (10) can be rewritten as + t 1 E1(t 1) c1(rτ 1ρGD ) + z(0) T + = + k k x(t 1) x(t) δK (t)A (Ax(t) b). (13) t 1 + = − − c1(rτ 1) + E2(t 1), t τ. = + + ∀ > Define z(t): x(t) x∗. The objective cost in (1) can be Since r 1 and c is a constant, t such that = − τ 1 1 sw 1 2 t+1 < ∃ < ∞ rewritten as A x b , gradient of which is given by c1(rτ 1) + 1 t tsw . This completes the proof. 2 k − k + < ∀ >