LapackWorking Note 56
Conjugate Gradient Algorithms with
Reduced Synchronization Overhead on
Distributed Memory Multipro cessors
y z y
, V.L. Eijkhout, C. H. Romine E. F. D'Azevedo
Decemb er 3, 1999
Abstract
The standard formulation of the conjugate gradient algorithm involves
two inner pro duct computations. The results of these two inner pro ducts
are needed to up date the search direction and the computed solution.
Since these inner pro ducts are mutually interdep endent, in a distributed
memory parallel environment their computation and subsequent distribu-
tion requires two separate communication and synchronization phases. In
this pap er, we present three related mathematically equivalent rearrange-
ments of the standard algorithm that reduce the numb er of communication
phases. We present empirical evidence that two of these rearrangements
are numerically stable. This claim is further substantiated by a pro of
that one of the empiricall y stable rearrangements arises naturally in the
symmetric Lanczos metho d for linear systems, which is equivalent to the
conjugate gradient metho d.
1 Intro duction
The conjugate gradient CG metho d is an e ective iterative metho d for solving
large sparse symmetric p ositive de nite systems of linear equations. It is robust
and, coupled with an e ective preconditioner [18], is generally able to achieve
rapid convergence to an accurate solution.
This work was supp orted in part byDARPA under contract number DAAL03-91-C-0047
and in part by DOE under contact numb er DE-AC05-84OR21400
y
Mathematical Sciences Section, Oak Ridge National Lab oratory,P.O. Box 2008, Bldg.
6012, Oak Ridge, TN 37831-6367
z
Department of Computer Science, UniversityofTennessee, 104 Ayres Hall, Knoxville, TN
37996 1
One drawback of the standard formulation of the conjugate gradient algo-
rithm on distributed memory parallel machines is that it involves the computa-
tion of two separate inner pro ducts of distributed vectors. Moreover, the rst
inner pro duct must b e completed b efore the data are available for computing
the second inner pro duct. Hence, a distributed memory implementation of the
standard conjugate gradient metho d has two separate communication phases for
these two inner pro ducts. Since communication is quite exp ensive on the current
generation of distributed memory multipro cessors, it is desirable to reduce the
communication overhead by combining these two communication phases into
one.
Saad [15, 16] has shown one rearrangement of the computation that elimi-
nates a communication phase by computing kr k based on the relationship
k 2
2 2 2 2
kr k = kAp k kr k 1
k +1 k k
2 k 2 2
to b e numerically unstable. Meurant [10] prop osed using 1 as a predictor for
kr k and reevaluate the actual norm on the next iteration with an extra inner
k +1
pro duct. Van Rosendale [19] has prop osed without numerical results an m-step
conjugate gradient algorithm to increase parallelism.
The conjugate gradient algorithm is known to b e closely related to the Lanc-
zos algorithm for tridiagonalizing a matrix [4]. Paige [11,12,13] has done
detailed analysis to show some variants of the Lanczos algorithm are unstable.
Strakos [17] and Greenbaum [5, 6]have considered the close connection b etween
the Lanczos and CG algorithm in the analysis of stability of CG computations
under p erturbations in nite arithmetic.
In x2, we present a rearrangement of the conjugate gradient computation
that eliminates one communication phase by computing b oth inner pro ducts
at once. We show a natural asso ciation b etween this rearrangement and the
Lanczos algorithm in x3. A discussion of how this rearrangement of the compu-
tation a ects the stability prop erties of the conjugate gradient algorithm and
some MATLAB numerical exp eriments on the e ectiveness of the rearrangement
are included in x4.
2 The conjugate gradient algorithm
In this section we will present several variants of the conjugate gradient algo-
rithm, based on elimination of the A-inner pro duct of the search directions.
2.1 The standard formulation
We b egin by reviewing the standard conjugate gradient pro cedure [2,8]for
solving the linear system
Ax = b: 2 2
For simplicity,we assume a zero initial guess, and residual vector r = b, with
1
t
hx; y i = x y as the usual inner pro duct.
For k =1; 2;:::
= hr ;r i
k k k
= = =0
k k k 1 1
p = r + p p = r 3
k k k k 1 1 1
v = Ap
k k
= hp ;v i
k k k
= =
k k k
x = x + p
k +1 k k k
r = r v : 4
k +1 k k k
Saad [15, 16] and Meurant [10] have considered eliminating the rst inner
pro duct for = h r ;r i .We prop ose eliminating the second communication
k k k
phase by nding alternative expressions for
k
We will rely on the following intrinsic prop erties of the CG pro cedure: the
orthogonality of residual vectors and equivalently the conjugacy of search di-
rections [8, page 420]
h p ;Ap i hr ;r i
k k +1 k k +1
=0 5
hr ;r i h p ;Ap i
k k k k
and the fact that
t
Ar =0 for ji j j > 1: 6 r
j
i
2.2 Rearranged metho d 1
t
We derive the rst rearrangementby expanding p Ap by substituting p =
k k
k
r + p for b oth o ccurrences of p :
k k k 1 k
= h p ;v i = h p ;Ap i
k k k k k
= h r + p ;Ar + v i
k k k 1 k k k 1
= h r ;Ar i + hr ;v i +
k k k k k 1
2
hp ;v i hp ;Ar i +
k 1 k 1 k k 1 k
k
2
= h r ;Ar i +2 hr ;v i + : 7
k k k k k k 1 k 1
k
From 4 and 5:
r = r v
k k 1 k 1 k 1
hr ;r i = hr ;r i hr ;v i
k k k k 1 k 1 k k 1
= 0 hr ;v i : 8
k k 1 k k 1 3
Therefore by 7, 8 and = = ,
k k k 1
2
= hr ;Ar i +2 = +
k k k k k k 1 k 1
k
2
= ; where = h r ;Ar i : 9
k k k 1 k k k
k
2.3 Rearranged metho ds 2 and 3
t
If we expand the o ccurrences of p in p Ap one at a time we nd di erent
k k
k
arrangements of the algorithm.
= hp ;Ap i
k k k
= hr + p ;Ap i = hr ;Ap i
k k k 1 k k k
= hr ;Ar + p i
k k k k 1
= hr ;Ar i + hr ;Ap i
k k k k k 1
1 1
= + where = hr ;Ap i 10
k k k k k 1
k k
Thus we nd a rearrangement of the conjugate gradient metho d where we
t t t
Ap simultaneously and com- r , and r Ar , r compute inner pro ducts r
k 1 k k
k k k
t
pute = p Ap by recurrence formula 10.
k k
k
Successive expansion of p = r + p gives
k k k k 1
p = r + r + r + :
k k k k 1 k k 1 k 2
Using equation 6 we then nd
= :::
k
= hr ;Ar + r + r + i
k k k k 1 k k 1 k 2
= hr ;Ar i + hr ;Ar i 11
k k k k k 1
This gives us a rearrangement of the conjugate gradient metho d where we
t t t
compute inner pro ducts r Ar , r r , and r Ar simultaneously and com-
k k k 1
k k k
t
pute = p Ap by recurrence formula 11.
k k
k
2.4 The mo di ed conjugate gradient metho d
We prop ose the following rearrangement of the conjugate gradient pro cedure.
First initialize and v ,by p erforming one step of the standard algorithm
1 1
r = b; = hr ;r i ; p = r ; v = Ap
1 1 1 1 1 1 1 1
= hp ;v i ; x = = p
1 1 1 2 1 1 1
For k =2; 3;:::
s = Ar
k k 4
Compute = hr ;r i and = hr ;s i
k k k k k k
1
= and hr ;v i for metho d 2 only
k k 1
k
2
and = hr ;s i for metho d 3 only
k k 1
k
= =
k k k 1
p = r + p
k k k k 1
v = s + v v Ap 12
k k k k 1 k k
8
2
for metho d 1
k k 1
< k
1
+ for metho d 2
=
k k
k
k
:
2
for metho d 3 +
k k
k
= =
k k k
x = x + p
k +1 k k k
r = r v :
k +1 k k k
Note that the ab ove pro cedure requires extra storage for the vector s and extra
k
work in up dating the vector v for all three metho ds. Metho ds 2 and 3 require
k
an additional inner pro duct and storage for v and s resp ectively.
k 1 k 1
3 Stability pro of of metho d 1
In this section we present a pro of of the stability of the rst metho d, based on
an equivalence of 7 to a stable recurrence arising from the symmetric Lanczos
pro cess. The outline of the pro of is as follows.
First we recall the fact that the conjugate gradients metho d is a tridiag-
onalization pro cudure. Recurrence relation 7 in metho d 1 is then shown to
b e equivalent to the pivot recurrence for the factorization of this tridiagonal
matrix. The cornerstone of the pro of is that the symmetric tridiagonal matrix
pro duced by the symmetric Lanczos pro cess gives the same pivot recurrence. If
the original co ecient matrix is symmetric p ositive de nite this recurrence is
stable, hence the scalar recurrence in rearranged metho d 1 is stable.
The two basic vector up dates of the conjugate gradients metho d
r = r Ap ; p = r + p
k +1 k k k k +1 k +1 k +1 k
can b e summarized as
RI J =AP D ; PU = R
where R and P are matrices containing the vector sequences fr g and fp g as
k k
columns, and
1 if i = j
J = ; D = diag ; u = :