LapackWorking Note 56

Conjugate Gradient Algorithms with

Reduced Synchronization Overhead on



Distributed Memory Multipro cessors

y z y

, V.L. Eijkhout, . H. Romine E. F. D'Azevedo

Decemb er 3, 1999

Abstract

The standard formulation of the conjugate gradient algorithm involves

two inner pro duct computations. The results of these two inner pro ducts

are needed to up date the search direction and the computed solution.

Since these inner pro ducts are mutually interdep endent, in a distributed

memory parallel environment their computation and subsequent distribu-

tion requires two separate communication and synchronization phases. In

this pap er, we present three related mathematically equivalent rearrange-

ments of the standard algorithm that reduce the numb er of communication

phases. We present empirical evidence that two of these rearrangements

are numerically stable. This claim is further substantiated by a pro of

that one of the empiricall y stable rearrangements arises naturally in the

symmetric Lanczos metho d for linear systems, which is equivalent to the

conjugate gradient metho d.

1 Intro duction

The conjugate gradient CG metho d is an e ective iterative metho d for solving

large sparse symmetric p ositive de nite systems of linear equations. It is robust

and, coupled with an e ective [18], is generally able to achieve

rapid convergence to an accurate solution.



This work was supp orted in part byDARPA under contract number DAAL03-91-C-0047

and in part by DOE under contact numb er DE-AC05-84OR21400

y

Mathematical Sciences Section, Oak Ridge National Lab oratory,P.O. Box 2008, Bldg.

6012, Oak Ridge, TN 37831-6367

z

Department of Computer Science, UniversityofTennessee, 104 Ayres Hall, Knoxville, TN

37996 1

One drawback of the standard formulation of the conjugate gradient algo-

rithm on distributed memory parallel machines is that it involves the computa-

tion of two separate inner pro ducts of distributed vectors. Moreover, the rst

inner pro duct must b e completed b efore the data are available for computing

the second inner pro duct. Hence, a distributed memory implementation of the

standard conjugate gradient metho d has two separate communication phases for

these two inner pro ducts. Since communication is quite exp ensive on the current

generation of distributed memory multipro cessors, it is desirable to reduce the

communication overhead by combining these two communication phases into

one.

Saad [15, 16] has shown one rearrangement of the computation that elimi-

nates a communication phase by computing kr k based on the relationship

k 2

2 2 2 2

kr k = kAp k kr k 1

k +1 k k

2 k 2 2

to b e numerically unstable. Meurant [10] prop osed using 1 as a predictor for

kr k and reevaluate the actual norm on the next iteration with an extra inner

k +1

pro duct. Van Rosendale [19] has prop osed without numerical results an m-step

conjugate gradient algorithm to increase parallelism.

The conjugate gradient algorithm is known to b e closely related to the Lanc-

zos algorithm for tridiagonalizing a [4]. Paige [11,12,13] has done

detailed analysis to show some variants of the are unstable.

Strakos [17] and Greenbaum [5, 6]have considered the close connection b etween

the Lanczos and CG algorithm in the analysis of stability of CG computations

under p erturbations in nite arithmetic.

In x2, we present a rearrangement of the conjugate gradient computation

that eliminates one communication phase by computing b oth inner pro ducts

at once. We show a natural asso ciation b etween this rearrangement and the

Lanczos algorithm in x3. A discussion of how this rearrangement of the compu-

tation a ects the stability prop erties of the conjugate gradient algorithm and

some MATLAB numerical exp eriments on the e ectiveness of the rearrangement

are included in x4.

2 The conjugate gradient algorithm

In this section we will present several variants of the conjugate gradient algo-

rithm, based on elimination of the A-inner pro duct of the search directions.

2.1 The standard formulation

We b egin by reviewing the standard conjugate gradient pro cedure [2,8]for

solving the linear system

Ax = b: 2 2

For simplicity,we assume a zero initial guess, and residual vector r = b, with

1

t

hx; y i = x y as the usual inner pro duct.

For k =1; 2;:::

= hr ;r i

k k k

= =  =0

k k k 1 1

p = r + p p = r  3

k k k k 1 1 1

v = Ap

k k

 = hp ;v i

k k k

= =

k k k

x = x + p

k +1 k k k

r = r v : 4

k +1 k k k

Saad [15, 16] and Meurant [10] have considered eliminating the rst inner

pro duct for = h r ;r i .We prop ose eliminating the second communication

k k k

phase by nding alternative expressions for 

k

We will rely on the following intrinsic prop erties of the CG pro cedure: the

orthogonality of residual vectors and equivalently the conjugacy of search di-

rections [8, page 420]

h p ;Ap i hr ;r i

k k +1 k k +1

=0 5

hr ;r i h p ;Ap i

k k k k

and the fact that

t

Ar =0 for ji j j > 1: 6 r

j

i

2.2 Rearranged metho d 1

t

We derive the rst rearrangementby expanding p Ap by substituting p =

k k

k

r + p for b oth o ccurrences of p :

k k k 1 k

 = h p ;v i = h p ;Ap i

k k k k k

= h r + p ;Ar + v i

k k k 1 k k k 1

= h r ;Ar i + hr ;v i +

k k k k k 1

2

hp ;v i hp ;Ar i +

k 1 k 1 k k 1 k

k

2

 = h r ;Ar i +2 hr ;v i +  : 7

k k k k k k 1 k 1

k

From 4 and 5:

r = r v

k k 1 k 1 k 1

hr ;r i = hr ;r i hr ;v i

k k k k 1 k 1 k k 1

= 0 hr ;v i : 8

k k 1 k k 1 3

Therefore by 7, 8 and = = ,

k k k 1

2

 = hr ;Ar i +2  = + 

k k k k k k 1 k 1

k

2

 =   ; where  = h r ;Ar i : 9

k k k 1 k k k

k

2.3 Rearranged metho ds 2 and 3

t

If we expand the o ccurrences of p in p Ap one at a time we nd di erent

k k

k

arrangements of the algorithm.

 = hp ;Ap i

k k k

= hr + p ;Ap i = hr ;Ap i

k k k 1 k k k

= hr ;Ar + p i

k k k k 1

= hr ;Ar i + hr ;Ap i

k k k k k 1

1 1

 = +  where  = hr ;Ap i 10

k k k k k 1

k k

Thus we nd a rearrangement of the conjugate gradient metho d where we

t t t

Ap simultaneously and com- r , and r Ar , r compute inner pro ducts r

k 1 k k

k k k

t

pute  = p Ap by recurrence formula 10.

k k

k

Successive expansion of p = r + p gives

k k k k 1

p = r + r + r + :

k k k k 1 k k 1 k 2

Using equation 6 we then nd

 = :::

k

= hr ;Ar + r + r + i

k k k k 1 k k 1 k 2

= hr ;Ar i + hr ;Ar i 11

k k k k k 1

This gives us a rearrangement of the conjugate gradient metho d where we

t t t

compute inner pro ducts r Ar , r r , and r Ar simultaneously and com-

k k k 1

k k k

t

pute  = p Ap by recurrence formula 11.

k k

k

2.4 The mo di ed conjugate gradient metho d

We prop ose the following rearrangement of the conjugate gradient pro cedure.

First initialize  and v ,by p erforming one step of the standard algorithm

1 1

r = b; = hr ;r i ; p = r ; v = Ap

1 1 1 1 1 1 1 1

 = hp ;v i ; x = = p

1 1 1 2 1 1 1

For k =2; 3;:::

s = Ar

k k 4

Compute = hr ;r i and  = hr ;s i

k k k k k k

1

= and  hr ;v i for metho d 2 only

k k 1

k

2

and  = hr ;s i for metho d 3 only

k k 1

k

= =

k k k 1

p = r + p

k k k k 1

v = s + v v Ap  12

k k k k 1 k k

8

2

  for metho d 1

k k 1

< k

1

 +  for metho d 2

 =

k k

k

k

:

2

for metho d 3  + 

k k

k

= =

k k k

x = x + p

k +1 k k k

r = r v :

k +1 k k k

Note that the ab ove pro cedure requires extra storage for the vector s and extra

k

work in up dating the vector v for all three metho ds. Metho ds 2 and 3 require

k

an additional inner pro duct and storage for v and s resp ectively.

k 1 k 1

3 Stability pro of of metho d 1

In this section we present a pro of of the stability of the rst metho d, based on

an equivalence of 7 to a stable recurrence arising from the symmetric Lanczos

pro cess. The outline of the pro of is as follows.

First we recall the fact that the conjugate gradients metho d is a tridiag-

onalization pro cudure. Recurrence relation 7 in metho d 1 is then shown to

b e equivalent to the pivot recurrence for the factorization of this tridiagonal

matrix. The cornerstone of the pro of is that the symmetric

pro duced by the symmetric Lanczos pro cess gives the same pivot recurrence. If

the original co ecient matrix is symmetric p ositive de nite this recurrence is

stable, hence the scalar recurrence in rearranged metho d 1 is stable.

The two basic vector up dates of the conjugate gradients metho d

r = r Ap ; p = r + p

k +1 k k k k +1 k +1 k +1 k

can b e summarized as

RI J =AP D ; PU = R

where R and P are matrices containing the vector sequences fr g and fp g as

k k

columns, and



1 if i = j

J = ; D = diag ; u = :

if i +1 = j

i;j +1 k ij

j

0 otherwise 5

1

With H =I J D U a tridiagonal matrix we can write this as AR = RH .

For the elements of H we nd:

t t

r Ar r r

n n

n n

h = ; h = h ; 13

nn n+1n nn+1

t

t

r r r r

n n+1

n

n+1

where wehave used only orthogonality prop erties and the symmetry of A.

The pivots d in the factorization of H satisfy a recurrence

kk

1

= h h h ;

n+1n+1 n+1n nn nn+1

n+1

t t

or, using = r r =p Ap and equation 13

k k k

k k

t t

t

p Ap r Ar

r r

n+1 n+1

n

n+1 n+1

n 2

= h :

nn

nn+1

t t t

r r r r r r

n+1 n+1 n+1

n+1 n+1 n+1

t 1

Divide by r r and use the fact that h = :

n+1 nn+1 n

n+1 n

t t 2 t

p Ap = r Ar h r r

n+1 n+1 n n

n+1 n+1 nn+1 n

t 2 t 1

= r Ar r r

n+1 n

n+1 n n n

t 2 t

= r Ar p Ap :

n+1 n

n+1 n n

t

This shows that the recursive computation of p Ap in rearranged metho d 1

n

n

is equivalentby mere formula substitution to the pivot recursion for H .We

gloss over the fact that formulas 13 are based on prop erties of the conju-

gate gradients metho d formulated as a three-term recurrence, instead of as the

usual coupled pair of two-term recurrences . We assume comparable numerical

stability prop erties for these twovariants of the basic metho d.

The Lanczos metho d for constructing an orthonormal matrix Q that re-

t

duces A to symmetric tridiagonal form T by Q AQ = T is easily constructed

from the conjugate gradients metho d by letting

1

r where c = kr k: q = c

n nn n n

nn

Hence the orthonormal tridiagonalization can b e written as

1 1

AQ = QT with Q = RC ; T = C HC:

1

Using the fact that H is tridiagonal and that H = C TC with C diagonal,

we nd that the factorization of T generates the same pivot sequence. If A, and

therefore T , is symmetric p ositive de nite, this pivot recursion, and therefore

recurrence 7 is stable.

4 Numerical exp eriments on stability

The aim of the following exp eriments is to determine the stability and conver-

gence prop erties of the mo di ed conjugate gradient pro cedures. 6

We p erformed a number of MATLAB exp eriments in solving Ax = b by the

conjugate gradient pro cedure to study the convergence b ehavior on di erent

distributions of eigenvalues of the preconditioned matrix. In Eijkhout's re-

arrangement, hr ;v i is computed by an extra inner pro duct. Meurant's

k k 1

rearrangementistaken from [10] and the Lanczos rearrangement is adapted

from [4, page 342] byevaluating the two inner pro ducts for ~ together as

j

~ = hr~ ;Ar~ i = hr~ ; r~ i.

j j j j j

Test 1

The matrices considered have the eigensp ectrum used by Strakos [17] and Green-

baum and Strakos [6]

i 1

ni

   ; i =2;:::;n;  2 0; 1: 14  =  +

n 1 i 1

n 1

Wehave used n = 100,  = 1E3,  =  = = 1E5 and  =0:6; 0:8; 0:9; 1:0

1 n 1

in the exp eriments. For  =1,wehave a uniformly distributed sp ectrum, and

<1 describ es quantitatively the clustering at  .

1

Test 2

The eigensp ectrum has a gap, f1;:::;50; 10051;:::;10100g.

Test 3

The eigensp ectrum has double eigenvalues, f1; 1; 2; 2;:::;50; 50g.

Test 4

The eigensp ectrum consists of the ro ots of the Chebyshev p olynomial T x

n

shifted from [1; 1] to the interval [a; b]

b a =2+i 1 b + a

 = cos + ; i =1;:::;n: 15

i

2 n 2

Wehave used n = 100, a =1, b = 1E5.

As done in Hageman and Young [7], Greenbaum [5] and Strakos [17], we

op erate on diagonal matrices. This pro cedure is equivalent to representing all

1

vectors over the basis of eigenvectors of matrix A. In all cases, a random right

hand side and zero initial guess are used.

We display the decrease of A-norm of the error at each iteration divided by

the A-norm of the initial error

1=2

hx~ x ;A~x x i

k k

1

; x~ = A b: 16

1=2

hx~ x ;A~x x i

0 0

1

uniform over [1; 1] 7 Standard CG, n=100, kappa=100000 103

100

10-3 Error 10-6

10-9

10-12 0 20 40 60 80 100 120 140 160 180 200

Iteration

Figure 1: Classical CG on Test 1. Dashed curve:  =0:6; dotted curve:  =0:8;

dash-dot curve:  =0:9; solid curve:  =1. 8 Modified CG, n=100, kappa=100000 103

100

10-3 Error 10-6

10-9

10-12 0 20 40 60 80 100 120 140 160 180 200

Iteration

Figure 2: Mo di ed CG on Test 1. Dashed curve:  =0:6; dotted curve:  =0:8;

dash-dot curve:  =0:9; solid curve:  =1. 9 Eijkhout Rearrangement, n=100, kappa=100000 103

100

10-3 Error 10-6

10-9

10-12 0 20 40 60 80 100 120 140 160 180 200

Iteration

Figure 3: Eijkhout RearrangementonTest 1. Dashed curve:  =0:6; dotted

curve:  =0:8; dash-dot curve:  =0:9; solid curve:  =1. 10 Meurant Rearrangement, n=100, kappa=100000 103

100

10-3 Error 10-6

10-9

10-12 0 20 40 60 80 100 120 140 160 180 200

Iteration

Figure 4: Meurant RearrangementonTest 1. Dashed curve:  =0:6; dotted

curve:  =0:8; dash-dot curve:  =0:9; solid curve:  =1. 11 Lanczos Rearrangement, n=100, kappa=100000 103

100

10-3 Error 10-6

10-9

10-12 0 20 40 60 80 100 120 140 160 180 200

Iteration

Figure 5: Lanczos RearrangementonTest 1. Dashed curve:  =0:6; dotted

curve:  =0:8; dash-dot curve:  =0:9; solid curve:  =1. 12 standard CG, n=100 103

100

10-3 Error 10-6

10-9

10-12 0 20 40 60 80 100 120 140 160 180 200

Iteration

Figure 6: Classical CG on Tests 2{4. Solid curve: Test 2; dashed curve: Test 3;

dotted curveTest 4. 13 modified CG, n=100 103

100

10-3 Error 10-6

10-9

10-12 0 20 40 60 80 100 120 140 160 180 200

Iteration

Figure 7: Mo di ed CG on Tests 2{4. Solid curve: Test 2; dashed curve: Test 3;

dotted curveTest 4. 14 Eijkhout Rearrangment, n=100 103

100

10-3 Error 10-6

10-9

10-12 0 20 40 60 80 100 120 140 160 180 200

Iteration

Figure 8: Eijkhout RearrangementonTests 2{4. Solid curve: Test 2; dashed

curve: Test 3; dotted curveTest 4. 15 Meurant Rearrangement, n=100 103

100

10-3 Error 10-6

10-9

10-12 0 20 40 60 80 100 120 140 160 180 200

Iteration

Figure 9: Meurant RearrangementonTests 2{4. Solid curve: Test 2; dashed

curve: Test 3; dotted curveTest 4. 16 Lanczos Based Rearrangment, n=100 103

100

10-3 Error 10-6

10-9

10-12 0 20 40 60 80 100 120 140 160 180 200

Iteration

Figure 10: Lanczos RearrangementonTests 2{4. Solid curve: Test 2; dashed

curve: Test 3; dotted curveTest 4. 17

Table 1: Description of test problems.

Problem Order Nonzeros Description

BCSSTK13 2003 11973 Fluid Flow Generalized Eigenvalues

BCSSTK14 1806 32630 Ro ot of Omni Coliseum, Atlanta

BCSSTK15 3948 60882 Mo dule of an O shore Platform

BCSSTK18 11948 80519 R.E.Ginna Nuclear Power Station

Figures 1{5, display the convergence results from Test 1. Note that for

 =0:8; 0:9 b oth the standard and mo di ed CG pro cedures exhibit similar

slow convergence b ehavior. Figures 6{10 display the convergence results on

Tests 2{4. For Test 1 with  =0:6, the standard CG algorithm shows the b est

convergence prop erties. Eijkhout's rearrangement has slightly b etter stability

prop erties than mo di ed CG. The other results are essentially the same.

All the results on Tests 2{4 again show similar convergence b ehavior among

the standard CG and the di erent rearrangements of CG.

5 Parallel p erformance

To gauge the e ectiveness of the mo di ed CG pro cedure, we p erformed a number

of exp eriments in comparing the run-time in standard CG and mo di ed CG.

The test matrices are chosen from the Harwell-Bo eing Test Collection [3]. The

exp eriments are p erformed on 16 no des of the iPSC/860 hyp ercub e. Each matrix

is rst reordered by the bandwidth reducing Reverse Cuthill-McKee ordering [9].

The matrix is then equally blo ck partitioned byrows and distributed across the

pro cessors in ELLPACK format [14]. In all cases, a random right hand side and

zero initial guess are used, and convergence is assumed when

8

kr k  10 kr k : 17

k 2 0 2

The conjugate gradient pro cedure is rarely used without some form of pre-

conditioning to accelerate convergence. In the tests describ ed b elow, we use

a blo ck preconditioner derived as follows: Let A b e the diagonal blo ckofthe

i

t

matrix A contained in pro cessor i, and write A = L + D + L where L is

i i i i

i

strictly lower triangular and D is diagonal. Then the preconditioning matrix

i

1

t

L + D  .As M is M = diagM ;M ;:::;M , where M =L + D D

i i 1 2 p i i i

i

shown in Axelsson and Barker [1], this corresp onds to each pro cessor doing a

single SSOR step with ! = 1 on its diagonal blo ck A . This preconditioner

i

requires no added communication among the pro cessors when implemented in

parallel.

Table 1 is a brief description of the problems selected from the Harwell-

Bo eing Test Collection. Table 2 shows the numb er of iterations and time in 18

Table 2: Timing results.

standard CG mo di ed CG

Problem Iterations Time Iterations Time

BCSSTK13 1007 19.56 1007 16.99

BCSSTK14 232 2.72 232 2.35

BCSSTK15 376 7.60 376 6.72

BCSSTK18 697 34.55 697 32.80

seconds required to solve the corresp onding problems. In all cases, the mo di ed

CG shows an improvement in the time required for solution, ranging from 5

to 13. Moreover, the mo di ed CG rearrangement shows no unstable b ehavior

since it takes almost exactly the same numb er of iterations as standard CG.

6 Conclusion

Wehave presented a rearrangement of the standard conjugate gradient pro ce-

dure that eliminates one synchronization p ointby p erforming two inner pro ducts

at once. The rearrangement has a natural connection with the Lanczos pro cess

for solving linear equations. Although not a pro of, MATLAB simulations indicate

that the rearrangement is stable. Moreover, computational exp eriments using

parallel versions of b oth the mo di ed and standard conjugate gradient algo-

rithms show that the mo di ed version reduces the execution time byasmuch

as 13 on an Intel iPSC/860 with 16 pro cessors.

References

[1] O. Axelsson and V. A. Barker. Finite Element Solution of Boundary Value

Problems. Academic Press, 1984.

[2] P. Concus, G. Golub, and D. O'Leary. A generalized conjugate gradient

metho d for the numerical solution of elliptic partial di erential equations.

In J. Bunch and D. Rose, editors, Computations, pages 309{

322. Academic Press, New York, 1976.

[3] I. S. Du , R. G. Grimes, and J. G. Lewis. Sparse matrix test problems.

ACM Transactions on Mathematical Software, 151:1{14, 89.

[4] Gene H. Golub and Charles F. van Loan. Matrix Computations. John

Hopkins University Press, Baltimore, Maryland, 1983. 19

[5] A. Greenbaum. Behavior of slightly p erturb ed Lanczos and conjugate-

gradient recurrences. J. Appl., 113:7{63, 1989.

[6] A. Greenbaum and Z. Strakos. Predicting the b ehavior of nite preci-

sion Lanczos and conjugate gradient computations. SIAM J. Matrix Anal.

Appl., 131:121{137, 1992.

[7] L. A. Hageman and D. M. Young. Applied Iterative Methods. Academic,

New York, 1981.

[8] Magnus R. Hestenes and Eduard Stiefel. Metho ds of conjugate gradients

for solving linear systems. J. Res. Nat. Bur. Standards, 496, 1952.

[9] Joseph W-H Liu and Andrew H. Sherman. Comparative analysis of

the Cuthill-McKee and the reverse Cuthill-McKee ordering algorithms for

sparse matrices. SIAM Journal on Numerical Analysis, 13:198{213, 1975.

[10] G. Meurant. Multitasking the conjugate gradient metho d on the CRAY

X-MP/48. Paral lel Comput., 5:267{280, 1987.

[11] C. C. Paige. Computational variants of the Lanczos metho d for eigenprob-

lem. J. Inst. Maths. Applics., 10:373{381, 1972.

[12] C. C. Paige. Error analysis of the Lanczos algorithm for tridiagonalizing a

. J. Inst. Maths. Applics., 18:341{349, 1976.

[13] C. C. Paige. Accuracy and e ectiveness of the Lanczos algorithm for the

symmetric eigenproblem. J. Linear Algebra Appl., 34:235{258, 1980.

[14] J. R. Rice and R. F. Boisvert. Solving El liptic Problems Using ELLPACK.

Springer-Verlag, New York, 1985.

[15] Yo cef Saad. practical use of p olynomials preconditionings for the conjugate

gradient metho d. SIAM J. Sci. Statist. Comput., 6:865{881, 1985.

[16] Youcef Saad. metho ds on sup ercomputers. SIAM J. Sci.

Statist. Comput., 106:1200{1232, 1989.

[17] Z. Strakos. On the real convergence rate of the conjugate gradient metho d.

J. Linear Algebra Appl., 154-156:535{549, 1991.

[18] Henk A. van der Vorst. High p erformance preconditioning. SIAM J. Sci.

Statist. Comput., 106:1174{1185, 1989.

[19] John van Rosendale. Minimizing inner pro duct data dep endencies in conju-

gate gradient iteration. Technical Rep ort NASA Contractor Rep ort 172178,

Institute for Computer Applications in Science and Engineering, NASA

Langley Research Center, Hampton, Virginia 23665, 1983. 20