Lapack Working Note 56 Conjugate Gradient Algorithms with Reduced

LapackWorking Note 56 Conjugate Gradient Algorithms with Reduced Synchronization Overhead on Distributed Memory Multipro cessors y z y , V.L. Eijkhout, C. H. Romine E. F. D'Azevedo Decemb er 3, 1999 Abstract The standard formulation of the conjugate gradient algorithm involves two inner pro duct computations. The results of these two inner pro ducts are needed to up date the search direction and the computed solution. Since these inner pro ducts are mutually interdep endent, in a distributed memory parallel environment their computation and subsequent distribu- tion requires two separate communication and synchronization phases. In this pap er, we present three related mathematically equivalent rearrangements of the standard algorithm that reduce the numb er of communication phases. We present empirical evidence that two of these rearrangements are numerically stable. This claim is further substantiated by a pro of that one of the empiricall y stable rearrangements arises naturally in the symmetric Lanczos metho d for linear systems, which is equivalent to the conjugate gradient metho d. 1 Intro duction The conjugate gradient CG metho d is an e ective iterative metho d for solving large sparse symmetric p ositive de nite systems of linear equations. It is robust and, coupled with an e ective preconditioner [18], is generally able to achieve rapid convergence to an accurate solution. This work was supp orted in part byDARPA under contract number DAAL03-91-C-0047 and in part by DOE under contact numb er DE-AC05-84OR21400 y Mathematical Sciences Section, Oak Ridge National Lab oratory,P.O. Box 2008, Bldg. 6012, Oak Ridge, TN 37831-6367 z Department of Computer Science, UniversityofTennessee, 104 Ayres Hall, Knoxville, TN 37996 1 One drawback of the standard formulation of the conjugate gradient algorithm on distributed memory parallel machines is that it involves the computation of two separate inner pro ducts of distributed vectors. Moreover, the rst inner pro duct must b e completed b efore the data are available for computing the second inner pro duct. Hence, a distributed memory implementation of the standard conjugate gradient metho d has two separate communication phases for these two inner pro ducts. Since communication is quite exp ensive on the current generation of distributed memory multipro cessors, it is desirable to reduce the communication overhead by combining these two communication phases into one. Saad [15, 16] has shown one rearrangement of the computation that eliminates a communication phase by computing kr k based on the relationship k 2 2 2 2 2 kr k = kAp k kr k 1 k +1 k k 2 k 2 2 to b e numerically unstable. Meurant [10] prop osed using 1 as a predictor for kr k and reevaluate the actual norm on the next iteration with an extra inner k +1 pro duct. Van Rosendale [19] has prop osed without numerical results an m-step conjugate gradient algorithm to increase parallelism. The conjugate gradient algorithm is known to b e closely related to the Lanc- zos algorithm for tridiagonalizing a matrix [4]. Paige [11,12,13] has done detailed analysis to show some variants of the Lanczos algorithm are unstable. Strakos [17] and Greenbaum [5, 6]have considered the close connection b etween the Lanczos and CG algorithm in the analysis of stability of CG computations under p erturbations in nite arithmetic. In x2, we present a rearrangement of the conjugate gradient computation that eliminates one communication phase by computing b oth inner pro ducts at once. We show a natural asso ciation b etween this rearrangement and the Lanczos algorithm in x3. A discussion of how this rearrangement of the computation a ects the stability prop erties of the conjugate gradient algorithm and some MATLAB numerical exp eriments on the e ectiveness of the rearrangement are included in x4. 2 The conjugate gradient algorithm In this section we will present several variants of the conjugate gradient algorithm, based on elimination of the A-inner pro duct of the search directions. 2.1 The standard formulation We b egin by reviewing the standard conjugate gradient pro cedure [2,8]for solving the linear system Ax = b: 2 2 For simplicity,we assume a zero initial guess, and residual vector r = b, with 1 t hx; y i = x y as the usual inner pro duct. For k =1; 2;::: = hr ;r i k k k = = =0 k k k 1 1 p = r + p p = r 3 k k k k 1 1 1 v = Ap k k = hp ;v i k k k = = k k k x = x + p k +1 k k k r = r v : 4 k +1 k k k Saad [15, 16] and Meurant [10] have considered eliminating the rst inner pro duct for = h r ;r i .We prop ose eliminating the second communication k k k phase by nding alternative expressions for k We will rely on the following intrinsic prop erties of the CG pro cedure: the orthogonality of residual vectors and equivalently the conjugacy of search directions [8, page 420] h p ;Ap i hr ;r i k k +1 k k +1 =0 5 hr ;r i h p ;Ap i k k k k and the fact that t Ar =0 for ji j j > 1: 6 r j i 2.2 Rearranged metho d 1 t We derive the rst rearrangementby expanding p Ap by substituting p = k k k r + p for b oth o ccurrences of p : k k k 1 k = h p ;v i = h p ;Ap i k k k k k = h r + p ;Ar + v i k k k 1 k k k 1 = h r ;Ar i + hr ;v i + k k k k k 1 2 hp ;v i hp ;Ar i + k 1 k 1 k k 1 k k 2 = h r ;Ar i +2 hr ;v i + : 7 k k k k k k 1 k 1 k From 4 and 5: r = r v k k 1 k 1 k 1 hr ;r i = hr ;r i hr ;v i k k k k 1 k 1 k k 1 = 0 hr ;v i : 8 k k 1 k k 1 3 Therefore by 7, 8 and = = , k k k 1 2 = hr ;Ar i +2 = + k k k k k k 1 k 1 k 2 = ; where = h r ;Ar i : 9 k k k 1 k k k k 2.3 Rearranged metho ds 2 and 3 t If we expand the o ccurrences of p in p Ap one at a time we nd di erent k k k arrangements of the algorithm. = hp ;Ap i k k k = hr + p ;Ap i = hr ;Ap i k k k 1 k k k = hr ;Ar + p i k k k k 1 = hr ;Ar i + hr ;Ap i k k k k k 1 1 1 = + where = hr ;Ap i 10 k k k k k 1 k k Thus we nd a rearrangement of the conjugate gradient metho d where we t t t Ap simultaneously and com- r , and r Ar , r compute inner pro ducts r k 1 k k k k k t pute = p Ap by recurrence formula 10. k k k Successive expansion of p = r + p gives k k k k 1 p = r + r + r + : k k k k 1 k k 1 k 2 Using equation 6 we then nd = ::: k = hr ;Ar + r + r + i k k k k 1 k k 1 k 2 = hr ;Ar i + hr ;Ar i 11 k k k k k 1 This gives us a rearrangement of the conjugate gradient metho d where we t t t compute inner pro ducts r Ar , r r , and r Ar simultaneously and com- k k k 1 k k k t pute = p Ap by recurrence formula 11. k k k 2.4 The mo di ed conjugate gradient metho d We prop ose the following rearrangement of the conjugate gradient pro cedure. First initialize and v ,by p erforming one step of the standard algorithm 1 1 r = b; = hr ;r i ; p = r ; v = Ap 1 1 1 1 1 1 1 1 = hp ;v i ; x = = p 1 1 1 2 1 1 1 For k =2; 3;::: s = Ar k k 4 Compute = hr ;r i and = hr ;s i k k k k k k 1 = and hr ;v i for metho d 2 only k k 1 k 2 and = hr ;s i for metho d 3 only k k 1 k = = k k k 1 p = r + p k k k k 1 v = s + v v Ap 12 k k k k 1 k k 8 2 for metho d 1 k k 1 < k 1 + for metho d 2 = k k k k : 2 for metho d 3 + k k k = = k k k x = x + p k +1 k k k r = r v : k +1 k k k Note that the ab ove pro cedure requires extra storage for the vector s and extra k work in up dating the vector v for all three metho ds. Metho ds 2 and 3 require k an additional inner pro duct and storage for v and s resp ectively. k 1 k 1 3 Stability pro of of metho d 1 In this section we present a pro of of the stability of the rst metho d, based on an equivalence of 7 to a stable recurrence arising from the symmetric Lanczos pro cess.

Load more