Parallelizi ng the

Divide and Conquer Algorithm

for the Symmetric Tridiagona l Eigenvalue

Problem



on Distributed Memory Architectures

y z

Francoise Tisseur Jack Dongarra

February 24, 1998

Abstract

We present a new parallel implementation of a divide and conquer algo-

rithm for computing the sp ectral decomp osition of a symmetric tridiagonal

on distributed memory architectures. The implementation we de-

velop di ers from other implementations in that we use a two dimensional

blo ck cyclic distribution of the data, we use the Lowner theorem approach

to compute orthogonal eigenvectors, and weintro duce p ermutations b efore

the back transformation of each rank-one up date in order to make good



This work supp orted in part by Oak Ridge National Lab oratory, managed byLockheed

Martin Energy Research Corp. for the U.S. Department of Energy under contract number

DE-AC05-96OR2246 and by the Defense Advanced Research Pro jects Agency under contract

DAAL03-91-C-0047, administered by the Army Research Oce.

y

Department of Mathematics, University of Manchester, Manchester, M13 9PL, England

[email protected], http://www.ma.man.ac.uk/~ftisseur/.

z

Department of Computer Science, UniversityofTennessee, Knoxville, TN 37996-1301, USA

and Mathematical Sciences Section, Oak Ridge National Lab oratory, Oak Ridge, TN 37831

[email protected], http://www.cs.utk.edu/~dongarra. 1

use of de ation. This algorithm yields the rst scalable, p ortable and nu-

merically stable parallel divide and conquer eigensolver. Numerical results

con rm the e ectiveness of our algorithm. We compare p erformance of

the algorithm with that of the QR algorithm and of bisection followed by

on an IBM SP2 and a cluster of Pentium PI I's.

Key words. Divide and conquer, symmetric eigenvalue

problem, , rank-one mo di cation, parallel al-

gorithm, ScaLAPACK, LAPACK, distributed memory architec-

ture

AMS sub ject classi cations. 65F15, 68C25

1 Intro duction

The divide and conquer algorithm is an imp ortant recent development for

solving the tridiagonal symmetric eigenvalue problem. The algorithm was

rst develop ed by Cupp en [8] based on previous ideas of Golub [17] and

Bunch, Nielson and Sorensen [5] for the solution of the secular equation

and made p opular as a practical parallel metho d by Dongarra and Sorensen

[14]. This simple and attractive algorithm was considered unstable for a

while b ecause of a lack of orthogonality in the computed eigenvectors. It

was thought that extended precision arithmetic was needed in the solution

of the secular equation to guarantee that suciently orthogonal eigenvec-

tors are pro duced when there are close eigenvalues. Recently,however,

Gu and Eisenstat [20]have found a new approach that do es not require

extended precision and wehave used it in our implementation.

The divide and conquer algorithm has natural parallelism as the initial

problem is partitioned into several subproblems that can b e solved inde-

p endently. Early parallel implementations had mixed success. Dongarra

and Sorensen [14] and later Darbyshire [9] wrote an implementation for

shared memory machines Alliant FX/8, KSR1. They concluded that di-

vide and conquer algorithms, when prop erly implemented, can b e many

times faster than traditional ones such as bisection followed byinverse it-

eration or the QR algorithm, even on serial computers. Hence, a version

has b een incorp orated in LAPACK [1]. Ipsen and Jessup [23] compared 2

their parallel implementations of the divide and conquer algorithm and

the bisection algorithm on the Intel iPSC-1 hyp ercub e. They found that

their bisection implementation was more ecient than their divide and

conquer implementation b ecause of the excessive amount of data that was

transferred b etween pro cessors and also b ecause of unbalanced work load

after the de ation pro cess. More recently, Gates and Arb enz [16] with an

implementation for the Intel Paragon and Fachin [15] with an implemen-

tation on a network of T800 transputers showed that go o d sp eed-up can

be achieved from distributed memory parallel implementations. However,

their implementations are not as ecient as they could have b een. They

did not use techniques describ ed in [20] that guarantee the orthogonality

of the eigenvectors and that make go o d use of the de ation in order to

sp eed the computation.

In this pap er, we describ e an ecient, scalable, and p ortable parallel

implementation for distributed memory machines of a divide and conquer

algorithm for the symmetric tridiagonal eigenvalue problem.

Divide and conquer metho ds consist of an initial partition of the prob-

lem into subproblems and then, after some appropriate computations done

on these individual subproblems, results are joined together using rank-r

up dates r>1. Wechose to implement the rank-one up date of Cupp en

[8] rather than the rank-two up date used in [16], [20]. A priori, we see no

reason why one up date should b e more accurate than the other or faster

in general, but Cupp en's metho d, as reviewed in Section 2, app ears to b e

easier to implement.

In Section 3 we discuss several imp ortant issues to consider for parallel

implementations of a divide and conquer algorithm. Then, in Section 4,

we derive our algorithm. Wehave implemented our algorithm in Fortran

77 as pro duction quality software in the ScaLAPACK mo del [4]. Our

algorithm is well suited to compute all the eigenvalues and eigenvectors of

large matrices with clusters of eigenvalues. For these problems, bisection

followed byinverse iteration as implemented in ScaLAPACK [4], [11]is

limited by the size of the largest cluster that ts on one pro cessor. The

QR algorithm is less sensitive to the eigenvalue distribution but is more

exp ensive in computation and communication and thus do es not p erform

as well as the divide and conquer metho d. Examples that demonstrate the 3

eciency and numerical p erformance are presented in x6.

2 Cupp en's Metho d

Solving the symmetric eigenvalue problem consists, in general, in three

steps: the symmetric matrix A is reduced to tridiagonal form T , then one

computes the eigenvalues and eigenvectors of T and nally, one computes

the eigenvectors of A from the eigenvectors of T .

In this section we consider the problem of determining the sp ectral

decomp osition

T

T = WW

nn

of a symmetric tridiagonal matrix T 2 R , where  is diagonal and W

is orthogonal.

Cupp en [8] divides the original problem into subproblems of smaller

size byintro ducing the decomp osition:

k nk

  !  ! !

T

b

k T e e T 0 e

1 k 1 k

1

T 1 T

T = = +

e e

1

k

T 1

b

T nk e e 0 T e

2 1 2 1

k

where 1  k  n, e represents the j th canonical vector of appropriate

j

b b

dimension, is the k th o -diagonal elementofT , and T and T di er

1 2

from the corresp onding submatrices of T only by their last and rst di-

agonal co ecient, resp ectively. This is the divide phase. Dongarra and

Sorensen [14]intro duced the factor to avoid cancellation when forming

b b

the new diagonal elements of diag T ; T . Wenowhavetwo indep endent

1 2

symmetric tridiagonal eigenvalue problems of order k and n k . Let

T T

b b

2.1 ; T = Q D Q T = Q D Q

2 2 2 1 1 1

2 1

b e their sp ectral decomp ositions and

 !

e

k

T

z = diag Q ;Q  : 2.2

1 2

1

e

1

Then we get

 ! !  !

T

Q 0 D Q 0

1 1 1

T

T = + z z ;= 2.3 ;

0 Q D 0 Q

2 2 2

T T

= QD + z z Q 4

T

and the eigenvalues of T are therefore those of D + z z . Finding the

sp ectral decomp osition

T T

D + z z = UU

of a rank-one up date is the heart of the divide and conquer algorithm.

This is the conquer phase. Then it follows that

T

T = WW with W = QU: 2.4

A recursive application of the strategy describ ed ab ove on the two

tridiagonal matrices in 2.1 leads to the divide and conquer metho d for

the symmetric tridiagonal eigenvalue problem.

2.1 Computing the Sp ectral Decomp osition of a

Rank-One Perturb ed Matrix

An up dating technique as describ ed in [5], [8], [17] can b e used to compute

the sp ectral decomp osition of a rank-one p erturb ed matrix

T T

D + z z = UU 2.5

where D = diag d ;d ;:::;d , z =z ;z ;:::;z  and  is a nonzero

1 2 n 1 2 n

T

scalar. By setting equal to zero the characteristic p olynomial of D + z z

T n

of D + z z are the ro ots of we nd that the eigenvalues f g

i

i=1

n

2

X

z

i

f = 1+ ; 2.6

d 

i

i=1

which is called the secular equation.

Many metho ds have b een suggested for solving the secular equation see

Melman [27] for a survey. Each eigenvalue is computed in O n ops. A

corresp onding normalized eigenvector u can b e computed from the formula

v

,

u

n

2

1

X

z

u

z D I  z z

j

1 n

t

u = = ;:::; 2.7

1 2

jjD I  z jj d  d  d 

1 n j

j =1

in only O n ops. Thus, the sp ectral decomp osition of a rank-one p er-

2

turb ed matrix can b e computed in O n  ops. Unfortunately, calculation

of eigenvectors using 2.7 can lead to a loss of orthogonality for close

eigenvalues. We discuss solutions to this problem in Section 4.2. 5

2.2 De ation

Dongarra and Sorensen showed [14] that the problem 2.5 can p otentially

b e reduced in size. If z = 0 for some i,we see from 2.3 that d is an

i i

T

eigenvalue of D + z z with eigenvector e . Moreover, if D has an eigen-

i

value d of multiplici ty m> 1, we can rotate the eigenvector basis in order

i

to zero out the comp onentofz corresp onding to the rep eated eigenvalues.

T

Then, we can removerows and columns from D + z z corresp onding to

zero comp onents of z .

When working in nite precision arithmetic, wemust deal with the

problem of the z 's nearly equal to zero and nearly equal d 's. Then,

i i

in order to precisely describ e when we can de ate, we need to de ne a

T

tolerance . Let = "kD + z z k where " is the machine precision.

2

Wesay that the rst typ e of de ation arises when jz j . Now

i

assume that kz k =1.Wehave

2

T

kD + z z e d e k = jz jkz k  :

i i i 2 i 2

T

Then d ;e may b e considered as an approximate eigenpair for D + z z

i i

and z is set to zero.

i

The second typ e of de ation comes from a applied to

T

D + z z in order to set a comp onentofz equal to zero. Supp ose that

q

2 2

jz z jjd d j= z + z  . Let G b e the Givens rotation de ned by

i j i j ij

i j

 !

c s

T

[e ;e ] G [e ;e ]=

i j ij i j

s c

q

2 2

2 2

z + z with c = z =r;s= z =r;r= and c + s = 1. Then

i j

i j

T T T

e

= D + z~z~ + E ; kE k  ; G D + z z G

ij ij 2 i

i

2 2 2 2 2

~ ~

wherez ~ = r ; z~ =0; d = d c + d s ; d = d s + d c and E =d d cs.

i j i i j j i j ij i j

The result of recognizing all these de ations is to replace the rank-

T

one up date problem D + z z with one of smaller size. Hence, if G is

the pro duct of all the rotations used to zero out certain comp onents of z

and if P is the accumulation of p ermutations used to translate the zero

comp onents of z to the b ottom of z , the result is

! 

T

e

D + z~z~ 0

T T T

+ E; kE k  c PGD + z z G P =

2

0  6

T

e

with c a constant of order unity. The matrix D + z~z~ has only simple

eigenvalues and all the elements ofz ~ are nonzero.

The de ation pro cess is essential for the success of the divide and con-

T

e

quer algorithm. In practice, the dimension of D + z~z~ is usually consider-

t

ably smaller than the dimension of D + z z reducing the numb er of ops

when computing the eigenvector matrix of T in 2.4. Cupp en [8] showed

that de ation are more likely to take place when the matrix is diagonally

dominant.

2.3 Algorithm and Complexity

The divide and conquer algorithm is naturally expressed in recursive form

as follows.

Pro cedure dc eigendecomp osition T; W;

T

n From input T compute output W; such that T = WW . n

if T is 1-by-1

return Q =1;  = T

else

! 

T 0

1

T

+ vv Express T =

0 T

2

call dc eigendecomp osition T ;W ; 

1 1 1

call dc eigendecomp osition T ;W ; 

2 2 2

T

Form D + z z from  ;W ; ;W

1 1 2 2

T

Find eigenvalues  and eigenvectors U of D + z z

 !

W 0

1

Form W = U

0 W

2

return W , 

end

The recursion can b e carried on until we reacha2 2or1 1 eigenvalue

problem or it can b e terminated with an n  n problem and we can use

0 0

the QR algorithm or some other metho d to solve the tridiagonal problem.

Note that wehave not sp eci ed the dimensions of T and T .We refer to

1 2

Section 4.1 for details of howwecho ose the size of the subproblems in our

parallel implementation. 7 Level 0

Level 1

Level 2

Level 3

Pb0 Pb1 Pb2 Pb3 Pb4 Pb5 Pb6 Pb7

Figure 3.1: A divide and conquer tree.

We count as one op, an elementary oating p oint op eration +; ;=

2

or . Assuming no de ation and ignoring the terms in O n , the number

of ops tn to run dc eigendecomp osition for an n  nT satis es the

recursion

3

tn=n +2tn=2;

which has the solution

4

3 2

tn  n + O n :

3

In practice, b ecause of de ation, it app ears that the algorithm takes only

2:3 2

O n  ops on average and the cost can even b e as lowasO n  for some

sp ecial cases see [10]. 8

3 Parallelization Issues

Divide and conquer algorithms have b een successfully implemented on

shared memory multipro cessors for solving the symmetric tridiagonal eigen-

value problem and for the computation of the singular value decomp osition

of bidiagonal matrices [14], [24]. By contrast, the implementation of these

algorithms on distributed memory machines p oses diculties. Several is-

sues need to b e addressed and several implementations are p ossible.

The rst issue is how to split the work among the pro cessors. As shown

in Figure 3.1, the recursive matrix splitting leads to a hierarchy of sub-

problems with a data dep endency graph in the form of a binary tree. This

structure suggests a natural way to split the work among the pro cesses. If

P is the numb er of pro cesses, the smallest subproblems are chosen to b e of

dimension n=P , lying at the leaves of the tree. At the top of the tree, all

pro cesses co op erate. At each branch of the tree, the task is naturally split

in two sets of pro cesses where, in each set, pro cesses co op erate. At the

leaves of the tree, each pro cess solves its subproblem indep endently. This

is the way previous implementations have b een done [15], [16], [23]. This

approach o ers a natural parallelism for the up date of the subproblems.

Ipsen and Jessup [23] rep ort unbalanced work load among the pro cesses

when the de ations are not evenly distributed across the sets of pro cesses

involved at the branches of the tree. In this case, the faster set of pro cesses

those that exp erience de ation will havetowait for the other set of pro-

cesses b efore b eginning the next merge. This reduces the sp eedup gained

though the use of the tree. Gates and Arb enz [16] showed that if we as-

sume that the work corresp onding to any no de in the tree is well balanced

among the pro cessors, then the implementation should still have 85 ef-

ciency even in the worst case of bad distribution of de ations. However,

it is worth considering this problem for our implementation. A p ossible

issue is dynamic splitting versus static splitting [6]. A task list is used

to keep track of the various parts of the matrix during the decomp osition

pro cess and make use of data and task parallelism. This approach has

1

b een investigated for the parallel implementation of the sp ectral divide

and conquer algorithm for the unsymmetric eigenvalue problem using the

1

A ScaLAPACK prototypecodeisavailable at http://www.netlib.org/scalapack/prototype/ 9

matrix sign function [2]. We did not cho ose this approach b ecause in the

symmetric case the partitioning of the matrix can b e done arbitrarily and

we prefer to take advantage of this opp ortunity. By contrast with previ-

ous implementations we use a splitting di erent from the natural splitting

asso ciated with the binary tree. We explain our approach in Section 4.1.

The second issue is to maintain orthogonalitybetween eigenvectors in

the presence of close eigenvalues. There are two approaches: the extra

precision approach of Sorensen and Tang [30], used by Gates and Arb enz

[16] in their implementation, and the Lowner Theorem approach prop osed

by Gu and Eisenstat [19] and adopted for LAPACK [1], [29]. There are

trade-o s that we shall discuss in Section 4.2 b etween these two approaches.

The third issue is the back transformation pro cess, which is of great

imp ortance for the success of divide and conquer algorithms b ecause it

reduces the cost of forming the eigenvector matrix of the tridiagonal form.

We explain in Section 4.3 the idea of Gu and Eisenstat for reorganizing the

data structure of the orthogonal matrices b efore the back transformation

and we prop ose a parallel implementation of this approach. While used

in the serial LAPACK divide and conquer co de, this idea has never b een

considered in any current parallel implementation of the divide and conquer

algorithm.

The last issue, and p erhaps the most critical step when writing a paral-

lel program, is how to distribute the data. Previous implementations used

a one-dimensional distribution [16], [23]. Gates and Arb enz [16] used a

one-dimensional row blo ck distribution for Q, the matrix of eigenvectors

and a one-dimensional column blo ck distribution for U , the eigenvector

matrix of the rank-one up dates. This distribution simpli es their parallel

matrix- used for the back transformation QU .How-

ever, their matrix multiplication routine grows in communication with the

numb er of pro cesses, making it not scalable. By contrast, our implemen-

tation uses a two dimensional blo ck cyclic distribution of the data in the

style of ScaLAPACK. We use the PBLAS Parallel BLAS routine PxGEMM

to p erform our parallel matrix multiplications. This routine has a commu-

nication cost that grows with the square ro ot of the numb er of pro cesses,

leading to go o d eciency and scalability. 10

4 Implementation Details

Wenow describ e a parallel implementation of the divide and conquer algo-

rithm that addresses the issues discussed in previous the section. Our goal

was to write an ecient, reliable, p ortable and scalable co de that follows

the conventions of the ScaLAPACK software [4].

On shared-memory concurrent computers, LAPACK seeks to make ef-

cient use of the memory hierarchyby maximizing data reuse. Sp eci cally,

LAPACK casts linear algebra computations in terms of blo ck-oriented

matrix-matrix op erations whenever p ossible, enabling maximal use of Level

3 BLAS matrix-matrix op erations. An analogous approach has b een

taken by ScaLAPACK for distributed-memory machines. ScaLAPACK

uses blo ck partitioned algorithms in order to reduce the frequency with

which data must b e transferred b etween pro cesses and thereby to reduce

the xed startup cost incurred each time a message is sent.

4.1 Data Distribution

In contrast to all previous parallel implementations for distributed memory

machines [15], [16], [23], we use a two-dimensional blo ck cyclic distribution

for U , the eigenvector matrix of the rank-one up date, and Q, the matrix

of the back transformation. For linear algebra routines and matrix multi-

plications, two-dimensional blo ck cyclic distribution has b een shown to b e

ecient and scalable [7],[21], [28]. The ScaLAPACK software has adopted

this data layout.

The blo ck-cyclic distribution is a generalization of the blo ck and the

cyclic distributions. The pro cesses of the parallel computer are rst mapp ed

onto a two-dimensional rectangular grid of size P  Q.Any general m  n

dense matrix is decomp osed into m  n blo cks starting at its upp er left

b b

corner. These blo cks are then uniformly distributed in each dimension of

the P  Q pro cess grid as illustrated in Figure 4.1. In our implementation,

we imp ose the blo ck size to b e equal in each direction: m = n .

b b

2

Previous parallel implementations gave a subproblem of size n =Q  P 

to each pro cessor. As we use the two-dimensional blo ck cyclic distribution,

it is now natural to partition the original problem into subproblems of

size n . At the leaves of the tree, pro cesses that hold a diagonal blo ck

b 11 012

A11 A12 A13 A14 A15 A16 A17 A18 A11 A14A 17 A12 A15A 18 A13 A16

A21 A22 A23 A24 A25 A26 A27 A28 A31 A34 A37 A32 A35 A38 A33 A36 0

A31 A32 A33 A34 A35 A36 A37 A38 A51 A54A 57 A52 A55 A58 A53 A56

A41 A42 A43 A44 A45 A46 A47 A48 A71 A74 A77 A72 A75 A78 A73 A76

A51 A52 A53 A54 A55 A56 A57 A58 A21 A24 A27 A22 A25 A28 A23 A26 A A A A A A61 A62 A63 A64 A65 A66 A67 A68 A41 44 47 A42 45 48 A43 46 1 A A A A A A71 A72 A73 A74 A75 A76 A77 A78 A61 64 67 A62 65 68 A63 66

A81 A84 A87 A82 A85 A88 A83 A86

A81 A82 A83 A84 A85 A86 A87 A88

Figure 4.1: Global left and the distributed right views of the matrix P =

2;Q = 3.

solve their own subproblems of size n  n using the QR algorithm or

b b

the serial divide and conquer algorithm. Some pro cesses may hold several

subproblems and some of them none. As the computational cost of this

rst step is negligible compared with the computational cost of the whole

algorithm, it do es not matter if the work is not well distributed there.

However, go o d load balancing of the work is assured when the grid P  Q

of pro cesses is such that lcmP; Q = 1. In this case, at the leaves of the

tree, all the pro cesses hold a subproblem [28]. The worst case happ ens

when lcmP; Q= P or lcm P; Q= Q.

T T

For a given rank-one up date QD + z z Q the pro cesses that collab-

orate are those that hold a part of the global matrix Q. By contrast with

previous implementations, with the two dimensional blo ck cyclic distribu-

tion all the pro cesses collab orate b efore reaching the top of the tree. We

illustrate this in Figures 4.2 and 4.3 where the eigenvector matrix is dis-

tributed over 4 pro cesses, using rstly a one-dimensional blo ck distribution

and secondly a two-dimensional blo ck cyclic distribution. Supp ose that,

at level 1, all the de ations o ccur in the rst submatrix, which is held by

pro cesses P and P for a one-dimensional blo ck distribution and pro cesses

0 1

P ;P ;P ;P for a two-dimensional blo ck distribution. With a one dimen-

0 1 2 3 12 1-D block distribution

P0 P0 P0

P1 P1 P1

P2 P2 P2

P3 P3 P3

Level 2 Level 1 Level 0

Figure 4.2: Active part of the matrix Q held by each pro cess.

2-D block cyclic distribution

P0 P0 P1 P0 P1 P0 P1

P3 P2 P3 P2 P3 P2 P3

P0 P0 P1 P0 P1 P0 P1

P3 P2 P3 P2 P3 P2 P3

Level 2 Level 1 Level 0

Figure 4.3: Active part of the matrix Q held by each pro cess.

sional blo ck distribution, pro cessors P and P will have little computation

0 1

to p erform and then will havetowait for pro cesses P and P b efore b egin-

2 3

ning the last rank-one up date level 0. With the two-dimensional blo ck

cyclic distribution, the computation is distributed among the 4 pro cesses.

This solves some of the load balancing problems that may app ear when

de ations are not evenly distributed among the pro cesses.

Moreover, the two-dimensional blo ck cyclic distribution is particularly

well adapted for ecient and scalable parallel matrix-matrix multiplica-

tions. These op erations are the main computational cost of this algorithm.

In our parallel implementation, we use PxGEMM, as included in ScaLAPACK. 13

4.2 Orthogonality

^

Let  b e an approximate ro ot of the secular equation 2.6. When we

approximate the eigenvector u by replacing  in 2.7 by its approximation

^ ^ ^

 then when d  ,even if  is close to , the ratio z =d  can b e

j i j

very far from the exact one z =d . As a consequence, the computed

j j

eigenvector is very far from the true one and the resulting eigenvector

matrix is far from b eing orthogonal.

Sorensen and Tang [30] prop osed using extended precision to compute

^

the di erences d  . However, this approach is hard to implement

j i

p ortably across all the usual architectures. There are many machine-

dep endent tricks to make the implementation of extended precision go

faster, but on some machines, such as Crays, these tricks do not help and

p erformance su ers.

The Gu and Eisenstat approach based on the Lowner Theorem can

easily b e implemented p ortably on IEEE machines and Crays using only

working precision arithmetic throughout, with a trivial bit of extra work in

one place to comp ensate for the lack of a guard digit in Cray add/subtract.

By contrast, the Lowner approachmay require more communication than

the extra precision approach, dep ending on how the parallelization is done.

The reason is that the Lowner approach uses a formula that requires infor-

mation ab out all the eigenvalues, requiring a broadcast, whereas the extra

precision approach is \embarrassingly" parallel, with each eigenvalue and

eigenvector computed without communication. However the extra com-

munication the Lowner approach uses is trivial compared with the com-

munication of eigenvectors elsewhere in the computation.

The Lowner approach [19], [26] considers that the computed eigenvalues

T

are the exact eigenvalues of a new rank-one mo di cation D + z~z~ .By

de nition wehave that

n

Y

T

^

detD + z~z~ I =   4.1

j

j =1

and also

0 1

n n

2

X Y

z~

j

T

@ A

1+ detD + z~z~ I = d : 4.2

j

d 

j

j =1 j =1 14

Then, combining 4.1 and 4.2 and setting  = d leads to

i

v

u

i1

u

^ Y

 d

j i

u

^

z~ =  d  ; i =1;:::;n:

i i i

t

d d

j i

j =1

j 6=i

^

If all the quantities  d are computed to high relative accuracy

j i

thenz ~ can b e computed to high relative accuracy. Substituting the exact

i

n

^

eigenvalues f g and the computedz ~ into 2.7 gives

i

i=1

v

,

u

n

2

X

z~

u

z~ z~

j

1 n

t

u^ = ;:::; :

i

2

^ ^ ^

d  d  d  

1 i n i j i

j =1

Evaluating this formula, we obtain approximations of high comp onen-

T

twise accuracy to the eigenvectorsu ^ of D +~z z~ . Provided that the eigen-

i

^

values  are suciently accurate, which is assured by the used of a suitable

i

stopping criterion when solving the secular equation, it can b e shown see

T

b b b

[19] that we obtain a numerical eigendecomp osition T  U U , that is,

b

a decomp osition with a relative residual of order " and with U orthogonal

to working precision.

There are several ways to parallelize this approach. Either we consider

2

it as an O n  op erations cost, which means we can justify redundant

computation in order to avoid communications, or we consider that the

size of the data to communicate is negligible compare with what is sent

for the back transformation and then we can justify communications for

distributing the work among the pro cesses. Wechose the latter approach.

Let S denote the set of pro cesses that co op erate for a given rank-one

up date and k b e the numb er of ro ots to approximate. Then each pro cess

i

k

s

^

; the corresp onding quantities of S computes k=S ro ots lab eled f g

i

i=i

0

^

 d ; i  i  i ; j =1;:::;k and a part of each comp onentof~z :

i j 0 ks

i i

k k

s s

Y Y

1

^

d d  :  d  z =

i j i j j

i=i

i=i ;j 6=i

0

0

Results are broadcast over S and pro cesses up date theirz ~:

Y

z~ = z :

j j

S

Each pro cess of S then holds the necessary information to compute its

^

lo cal part of U and no more communication is needed. 15

^

To compute the approximate eigenvalues and the quantities  d

j i

stably and eciently,we use the hybrid scheme for the rational interp ola-

tion of f x as develop ed by Li [25]. The hybrid scheme keeps the p eak

numb er of iterations relatively small for solving the secular equation. For

our parallel implementation, this is helpful b ecause the execution time for

this part is determined by whichever eigenvalue takes the largest number

of iterations.

4.3 BackTransformation

The main cost in the divide and conquer algorithm is in computing the

pro duct QU see 2.4. The eciency of the whole implementation relies

on a prop er implementation of this back transformation. The goal is to

reduce the size of the matrix-matrix multiplication when transforming the

eigenvectors of the p erturb ed to the eigenvectors of the

tridiagonal matrix.

In this section, we explain a p ermutation strategy originally suggested

byGu[18] and used in the serial LAPACK divide and conquer co de. Then

we derive a p ermutation strategy more suitable for our parallel implemen-

tation. This new strategy is one of the ma jor contributions of our work.

After the de ation pro cess, we denote bt G the pro duct of all the Givens

rotations used to set to zero comp onentofz corresp onding to nearly equal

diagonal elements of D and by P the accumulation of p ermutations used

to translate the zero comp onents of z to the b ottom of z :

! 

T

e

D + z~z~ 0

t T T

: 4.3 PGD + z z G P =



0 

T

e e e

Let U;  b e the sp ectral decomp osition of D + z~z~ . Then

! ! ! !  

T

T

e e e e

U 0  0 U 0 D + z~z~ 0

T

= = UU ;

 

0 I 0  0 I 0 

and the sp ectral decomp osition of the tridiagonal matrix T = QD +

T

z z Q is obtained from

T T T T

T = QPG PGD + z z PG PGQ

T T T

= QPG UU PGQ

T

= WW 16

T

with W = QPG U .

When not prop erly implemented, the computation of W can b e very

exp ensive. To simplify the explanation, we illustrate it with a 44 example:

we supp ose that d = d and that G is the Givens rotation used to set to

1 3

zero the third comp onentofz . The matrix P is a p ermutation that moves

z to the b ottom of z .We indicate bya\" that a value has changed.

3

T

There are twoway of applying the transformation PG , either on the

left, that is on Q, or on the right, that is on U . Note that Q = diag Q ;Q 

1 2

is blo ck diagonal see2.3 and so wewould like tototake advantage of

this structure. It would halve the cost of the matrix multiplication if Q ,Q

1 2

are of the same size. To preserve the blo ck diagonal form of Q,we need to

T

apply PG on the right:

1 1 0 0

    

C C B B

C C B B

    

C C B B

T T

PG Q  PG U =

C C B B

C C B B

    

A A @ @

1  

1 0 1 0

    

C B C B

C B C B

    

C B C B

G =

C B C B

C B C B

1  

A @ A @

    

0 1 0 1

     

B C B C

B C B C

     0

B C B C

=

B C B C

B C B C

     

@ A @ A

     0

The pro duct b etween the two last matrices is p erformed with 64 ops

3

instead of the 2n = 128 ops of a full matrix pro duct. However, if we apply

T

PG on the left, we can reduce further the numb er of ops. Consider

again the 4  4 example:

1 0 1 0

    

C B C B

C B C B

    

C B C B

T T T

G P QPG  U =

C B C B

C B C B

    

A @ A @

1   17

0 1 0 1

     

B C B C

B C B C

     

B C B C

= P

B C B C

B C B C

     

@ A @ A

   1

1 0 1 0

     

C B C B

C B C B

     

C B C B

e

= QU =

C B C B

C B C B

     

A @ A @

1   

At this step, a p ermutation is used to group the columns of Q according

to their sparsity structure:

1 1 0 0

     

C C B B

C C B B

     

C C B B

T

  e

P P QU =

C C B B

C C B B

     

A A @ @

1   

1 0 1 0

     

C B C B

C B C B

     

C B C B

 

= QU: =

C B C B

C B C B

     

A @ A @

1   

Then, three matrix multiplications are p erformed with 48 ops involving

   

the matrices Q1: 2; 1; Q3: 4; 2; Q1: 4; 3 and U 1: 3; 1: 3. This organi-

zation allows the BLAS to p erform three matrix multiplies of minimal size.

In parallel, this strategy is hard to implement eciently. One needs to



rede ne the p ermutation P in order to avoid communication b etween pro-



cess columns. In our parallel implementation, P groups the column of Q



according to their lo cal sparsity structure, that is, P p ermutes columns of



Q b elonging to the same pro cess column. More precisely, lo cally, P puts to-

gether columns of Q with zero comp onents in the lower part, then columns

of Q without zero comp onents, columns of Q with zero comp onents in the

upp er part and nally columns of Q that are already eigenvectors. The



resultant matrix Q has the following global structure:

 !

  

Q Q 0 Q

11 12 13

:

  

0 Q Q Q

21 22 23



Q contains n columns of Q that have not b een a ected by de ation,

11 1 1



Q contains n columns of Q that have not b een a ected by de ation,

22 2 2 18

0

T

 

Q ; Q  contains k columns of Q that corresp ond to de ated eigen-

13 23 2

values they are eigenvectors of T ,

0

T

 

Q ; Q  contains the n n + n + k  remaining columns of Q.

12 21 1 2



The matrix U has the structure

0 0

nk k

 !

0



nk U 0

1



U = :

0

k 0 I

 

Then, for the computation of the pro duct QU ,we use two calls to the paral-

   

lel BLAS PxGEMM involving parts of U and the matrices Q ; Q ; Q ; Q .

1 11 12 21 22

0

Unlike in the serial implementation, we can not assume that k = k , that is,

T

 

that Q ; Q  contains all the columns corresp onding to de ated eigen-

13 23



values. This is due to the fact that P acts only on columns of Q that b elong

to the same pro cess column. Let k q  b et the numb er of de ated eigenval-

0

ues held by the pro cess column q ,0 q  Q 1. Then, k = min k q .

0q Q1

So, even if we can not p erform matrix multiplies of minimal sizes as in the

serial case, we still get go o d sp eed-up on many matrices.

5 The Divide and Conquer Co de

The co de is comp osed of 7 parallel routines PxSTEDC, PxLAED0, PxLAED1,

PxLAEDZ, PxLAED2, PxLAED3, PxLASRT and it uses LAPACK's serial rou-

tines whenever p ossible.

PxSTEDC scales the tridiagonal matrix, calls PxLAED0 to solve the tridiago-

nal eigenvalue problem, scales back when nished and sorts the eigenvalues

and corresp onding eigenvectors in ascending order by calling PxLASRT.

PxLAED0 is the driver of the divide and conquer algorithm. It splits the

tridiagonal matrix T into TSUBPBS = N-1/NB +1 submatrices using rank-

one mo di cation. NB is the size of the blo ck used for the two dimensional

blo ck cyclic distribution. It calls the serial divide and conquer co de xSTEDC

to solve each eigenvalue problem at the leaves of the tree. Then, each rank-

one mo di cation is merged by a call to PxLAED1:

TSUBPBS = N-1/NB +1 19

while TSUBPBS > 1 

for i = 1:TSUBPBS/2

call PxLAED1i,TSUBPBS,D,Q, ...

end

TSUBPBS = TSUBPBS / 2

end

PxLAED1 is the routine that combines eigensystems of adjacent submatrices

into an eigensystem for the corresp onding larger matrix. It calls PxLAEDZ

to form z as in 2.2, then calls PxLAED2 to de ate eigenvalues and to group

columns of Q following their sparsity structure as describ ed in Section 4.3.

Then, it calls PxLAED3, which distributes the work among the pro cesses

in order to compute the ro ots of the secular equation, solve the Lowner

inverse eigenvalue problem and compute the eigenvectors of the rank-one

T

e

up date D + z~z~ see 4.3. Each ro ot of the secular equation is computed

by the serial LAPACK routine xLAED4. Finally, the eigenvector matrix of

the rank-one up date is multiplied into the larger matrix that holds the

collective results of all the previous eigenvector calculations via the use of

two calls to the parallel matrix multiplication PxGEMM.

6 Numerical Exp eriments

This section concerns accuracy tests, execution times and p erformance re-

sults. We compare our parallel implementation of the divide and conquer

algorithm with the two parallel algorithms for solving the symmetric tridi-

agonal eigenvalue problem available in ScaLAPACK [4]:

 B/I I: Bisection followed byinverse iteration subroutines PxTEBZ and

PxHEIN. The inverse iteration algorithm can b e used with two op-

tions:

I I-1: Inverse iteration without a reorthogonalization pro cess.

I I-2: Inverse iteration with a reorthogonalization pro cess when the

3

eigenvalues are separated by less than 10 in absolute value.

 QR: The QR algorithm subroutine PxSTEQR2. 20

2

PxSYEVX is the name of the exp ert driver asso ciated with B/I I and

PxSYEV is the simple driver asso ciated with QR. Wehave written a driver

called PxSYEVD that computes all the eigenvalues and eigenvectors of a

symmetric matrix using our parallel divide and conquer routine PxSTEDC.

We use twotyp es of test matrices. The rst are symmetric matrices

with random entries from a uniform distribution on [1; 1]. The second

typ e are generated by the ScaLAPACK subroutine PxLATMS. The matrix

T

A = U DU; where U is orthogonal and D = diag s ;t  with t  0 and

i i i

s = 1chosen randomly with equal probability. Matrix 1 has equally

i

spaced entries from " to 1, matrix 2 has geometrically spaced entries from

" to 1,

i1

n1

d = " ; i =1:n;

i

and matrix 3 has clustered entries

d = "; i =1:n 1; d =1:

i n

The typ e 3 matrices are designed to illustrate how B/I I can fail to compute

orthogonal eigenvectors.

T

b b b

Let QQ b e the computed sp ectral decomp osition of A.To determine

the accuracy of our results, we measure the scaled residual error and the

scaled departure from orthogonality, de ned by

T T

b b b b b

kAQ Q k kI Q Qk

1 1

R = and O = :

n"kAk n"

1

When b oth quantities are small, the computed sp ectral decomp osition is

the exact sp ectral decomp osition of a slight p erturbation of the original

problem.

The tests were run on an IBM SP2 in double precision arithmetic. On

53 16

this machine, " =2  1:1  10 :

Table 6.1 shows the greatest residual and departure from orthogonality

measured for matrices of typ e 1, 2 and 3 solved by B/I I-1, B/I I-2, QR and

divide and conquer. The matrices are of order n = 1500 with a blo ck size

nb =60ona2 4 pro cessor grid. For eigenvalues with equally spaced

2

\Driver" refers to the routine that solves the eigenproblem for a full symmetric matrix

by reducing the matrix to tridiagonal form, solving the tridiagonal eigenvalue problem, and

transforming the eigenvectors back to those of the original matrix. 21

Matrix Eigensolvers

typ e B/I I-1 B/I I-2 QR D&C

4 4 4 4

Uniform R 3  10 3  10 3  10 2  10

distribution O 0:20 0:17 0:55 0:27

[; 1] Time 52 52 120 58

4 4 4 4

Geometrical R 3  10 3  10 5  10 4  10

distribution O  1; 000 88:03 0:23 0:20

[; 1] Time 53 137 95 51

4 4 4 4

R 4  10 4  10 4  10 4  10

Clustered O  1; 000  1; 000 0:50 0:16

at " Time 52 139 120 47

Table 6.1: Normalized residual, normalized eigenvector orthogonality, and tim-

ing for a matrix of size n = 1500 on an IBM-SP2 2x4 pro cessor grid for bi-

section/inverse iteration without and with reorthogonization, QR algorithm, and

divide and conquer algorithm.

mo dulus, bisection followed byinverse iteration gives go o d numerical re-

sults and is slightly faster than the divide and conquer algorithm. This

is due to the absence of communication when computing the eigenvectors,

b oth for B/I I-1 and B/I I-2. However, as illustrated by matrices of typ e 2,

if no reorthogonalization is used, numerical orthogonality can b e lost with

inverse iteration when the eigenvalues are p o orly separated. It is clear that

the reorthogonalization pro cess greatly increases the execution time of the

inverse iteration algorithm. For large clusters, the reorthogonalization pro-

cess in PxHEIN is limited by the size of the largest cluster that ts on one

pro cessor. Unfortunately, in this case, orthogonality is not guaranteed.

This phenomenon is illustrated by matrices of typ e 3. In the remaining

exp eriments, we always use B/I I with reorthogonalization.

We compared the relative p erformance of B/I I, QR and divide and

conquer. In our gures, the horizontal axis is matrix dimension and the

vertical axis is time divided by the time for divide and conquer, so that the 22 Time relative to DC=1 7

bisection+inverse iteration QR 6 QR

divide and conquer

5

4

3

2

D&C 1 B/II

0 1000 1500 2000 2500 3000 3500

matrix size

Figure 6.1: Execution times of PDTEBZ+PDSTEIN B/I I and PDSTEQR2 QR rel-

ativetoPDSTEDC D&C, on an IBM SP2, using 8 no des. Tridiagonal matrices,

eigenvalues of equally spaced mo dulus. 23 Time relative to DC=1 3

bisection+inverse iteration

QR 2.5 divide and conquer

tridiagonalisation EV 2

1.5

1 EVD EVX

0.5 TRD

0 1000 1500 2000 2500 3000 3500

matrix size

Figure 6.2: Execution times of PDSYEVX B/I I, PDSYEV QR and PDSYTRD tridi-

agonalization relativeto PDSYEVD D&C, on an IBM SP2, using 8 no des. Full

matrices, eigenvalues of equally spaced mo dulus. 24 Time relative to D&C=1 8 EVX

7

6

bisection+inverse iteration

5 QR

divide and conquer

4 tridiagonalisation

3

2 EV

1 EVD TRD

0 1000 1500 2000 2500 3000 3500

matrix size

Figure 6.3: Execution times of PDSYEVX B/I I, PDSYEV QR and PDSYTRD tridi-

agonalization relativeto PDSYEVD D&C, on an IBM SP2, using 8 no des. Full

matrices, eigenvalues of geometrically spaced mo dulus. 25 Time relative to DC=1 8 QR QR

7 divide and conquer

tridiagonalisation TRD

6 back transformation

5 ORM

4

3

2

1 D&C

0 1000 1500 2000 2500 3000 3500

Matrix size

Figure 6.4: Execution times of PDSTEQR2 QR, PDSYTRD tridiagonalization

and PDORMTR back transformation relativeto PDSTEDC D&C. Measured on

an IBM SP2, using 8 no des. Tridiagonal matrices, eigenvalues of geometrically

spaced mo dulus. 26 Speedāˆ’up of D&C over QR, IBM SP2, 12 nodes 7

evenly spaced entries

6 geometrically spaced entries

random entries

5

4

3 Execution time of QR relative to D&C

2

1 1000 1500 2000 2500 3000 3500

Matrix size

Figure 6.5: Sp eedups of PDSYEVDD&C over PDSYEVQR with several typ es of

full matrices. Tests done on an IBM SP2 using 12 no des.

divide and conquer curve is constant at 1. It is clear from Figures 6.1 and

6.2, which corresp ond to the sp ectral decomp osition of the tridiagonal ma-

trix T and the symmetric matrix A, resp ectively, that divide and conquer

comp etes with bisection followed byinverse iteration when the eigenvalues

of the matrices in question are well separated. For inverse iteration, this

situation is go o d since no reorthogonalization of eigenvectors is required.

For divide and conquer it is bad since this means there is little de ation

within intermediate problems. Note that the execution times of QR are

much larger. This distinction in sp eed b etween QR or B/I I and divide

and conquer is more noticeable in Figure 6.1 sp eed-up up to 6.5 than

in Figure 6.2 sp eed-up up to 2 b ecause Figure 6.2 includes the overhead

due to the tridiagonalization and back transformation pro cesses.

As illustrated in Figure 6.3, divide and conquer runs faster than B/I I

as so on as eigenvalues are p o orly separated or in clusters.

We also compare execution times of the tridiagonalization, QR, B/I I-2

and back transformation relative to the execution time of divide and con-

quer. >From Figure 6.4, it app ears that when using the QR algorithm 27 Performance results of PDSYEVD, IBM SP2. 4

3.5

Constant memory per processes 3 8 Mbytes. 18MBytes. 2.5

2 Gflops/s

1.5

1

0.5

0 1 2 3 4 5 6 8 10 12 16 18

Number of processes

Figure 6.6: Performance of PDSTEDC, IBM SP2.

Performance results of PDSYEVD, cluster of PC 800

700

600 Constant memory per processes 8 Mbytes. 500 18MBytes. 32 MBytes.

400 Mflops/s

300

200

100

0 1 2 3 4 6 8 10

Number of processes

Figure 6.7: Performance of PDSEDC, cluster of 300 MHz Intel PI I pro cessors using

a 100 Mbit Switch Ethernet connection. 28

for the computing all the eigenvalues and eigenvectors of a symmetric ma-

trix, the b ottleneck is the sp ectral decomp osition of the tridiagonal matrix.

This is not true any more when using our parallel divide and conquer algo-

rithm: sp ectral decomp osition of the tridiagonal matrix is now faster than

the tridiagonalization and back transformation of the eigenvectors. Ef-

fort as in [22] should b e made to improve the tridiagonalization and back

transformation.

We measured the p erformances of PDSTEDC on an IBM SP2 Figure 6.6

and on a cluster of 300 MHz Intel PI I pro cessors using a 100 Mbit Switch

Ethernet connection Figure 6.7. In our gures, the horizontal axis is the

numb er of pro cessors and the vertical axis is the numb er of ops p er second

obtained when the size of the problem is maintained constant on each

pro cess. The p erformance increases with the numb er of pro cessors, which

illustrates the scalability of our parallel implementation. These measures

have b een done using the Level 3 BLAS of ATLAS Automatically Tuned

Linear Algebra Software [32], which run at a p eak of 440 M op/s on the

SP2 and 190 M op/s on the PI I. On the SP2, our co de runs at 50 of the

p eak p erformance of matrix multiplication and 40 on the cluster of PI I's.

Note that these p ercentages takeinto account the time sp ent at the end

of the computation to sort the eigenvalues and corresp onding eigenvectors

into increasing order.

7 Conclusions

For serial and shared memory machines, divide and conquer is one of the

fastest available algorithms for nding all the eigenvalues and eigenvec-

tors of a large dense symmetric matrix. By contrast, implementations of

this algorithm on distributed memory machines have in the past p osed

diculties.

In this pap er, we showed that divide and conquer can b e eciently

parallelized on distributed memory machines. By using the Lowner the-

orem approach, go o d numerical eigendecomp ositions are obtained in all

situations. From the p oint of view of execution time, our results seem to

b e b etter for most cases when compared with the parallel execution time

of QR and bisection followed byinverse iteration available in the ScaLA- 29

PACK library.

Performance results on the IBM SP2 and a cluster of PC PI I demon-

strate the scalability and p ortability of our algorithm. Go o d eciency is

mainly obtained by exploiting the data parallelism inherent to this algo-

rithm rather than its task parallelism. For this, we concentrated our e orts

on a go o d implementation of the back transformation pro cess in order to

reach maximum sp eed-up for the matrix multiplications. Unlike in previ-

ous implementations, the numb er of pro cesses is not required to b e a p ower

of two. This implementation will b e incorp orated in the ScaLAPACK li-

brary.

Recentwork [12] has b een done on an algorithm based on inverse iter-

ation whichmay provide a faster and more accurate algorithm and should

also yield an embarrassingly parallel algorithm. Unfortunately, there is no

parallel implementation available at this time, so we could not compare

this new metho d with divide and conquer.

We showed that in contrast to the ScaLAPACK QR algorithm imple-

mentation, the sp ectral decomp osition of the tridiagonal matrix is no longer

the b ottleneck. E orts should b e made to improve the tridiagonalization

and the back transformation of the eigenvector matrix of the tridiagonal

form to the original one.

The main limitation of this prop osed parallel algorithm is the amount

of storage needed. Compared with the ScaLAPACK QR implementation,

2

2n extra storage lo cations are required to p erform the back transforma-

tion in the last step of the divide and conquer algorithm. This is the price

wepay for using level 3 BLAS op erations. It is worth noting that in most of

the cases, not all this storage is used, b ecause of de ation. Unfortunately,

ideas as develop ed in [31] for the sequential divide and conquer seem hard

to implement eciently in parallel as they require a lot of communication.

As in many algorithms, there is a trade o b etween go o d eciency and

workspace [3], [13]. Such a trade-o app ears also in parallel implementa-

tions of inverse iteration when reorthogonalization of the eigenvectors is

p erformed.

In future work, the authors plan to use ideas develop ed in this pap er

for the development of a parallel implementation of the divide and conquer

algorithm for the singular value decomp osition. 30

References

[1] E. Anderson, Z. Bai, C. H. Bischof, J. W. Demmel, J. J. Dongarra,

J. J. Du Croz, A. Greenbaum, S. J. Hammarling, A. McKenney, S. Os-

trouchov, and D. C. Sorensen. LAPACK Users' Guide, Release 2.0.

So ciety for Industrial and Applied Mathematics, Philadelphi a, PA,

USA, second edition, 1995.

[2] Zhao jun Bai, James W. Demmel, Jack J. Dongarra, Antoine Petitet,

Howard Robinson, and K. Stanley. The sp ectral decomp osition of

nonsymmetric matrices on distributed memory parallel computers.

SIAM J. Sci. Comput., 185:1446{1461, 1997.

[3] Christian Bischof, Steven Huss-Lederman, Xiaobai Sun, Anna Tsao,

and Thomas Turnbull. Parallel p erformance of a symmetric eigen-

solver based on the invariant subspace decomp osition approach. In

proceedings of Scalable High Performance Computing Conference'94,

Knoxvil le, Tennessee, pages 32{39, May 1994. Also PRISM Working

Note 15.

[4] L. S. Blackford, J. Choi, A. Cleary, E. D'Azevedo, J. Demmel,

I. Dhillon, J. Dongarra, S. Hammarling, G. Henry,A.Petitet, K. Stan-

ley,D.Walker, and R. C. Whaley. ScaLAPACK Users' Guide. SIAM,

Philadelphia, PA, 1997.

[5] James R. Bunch, Christopher P. Nielsen, and Danny C. Sorensen.

Rank-one mo di cation of the symmetric eigenproblem. Numer. Math.,

31:31{48, 1978.

[6] Soumen Chakrabarti, James Demmel, and Katherine Yelick. Mo deling

the b ene ts of mixed data and task parallelism. Technical Rep ort CS-

95-289, Department of Computer Science, UniversityofTennessee,

Knoxville, TN, USA, May 1995. LAPACK Working Note 97.

[7] Jaeyoung Choi, Jack J. Dongarra, Roldan Pozo, and David W.

Walker. ScaLAPACK: A scalable linear algebra library for distributed

memory concurrent computers. Technical Rep ort CS-92-181, Depart-

ment of Computer Science, UniversityofTennessee, Knoxville, TN,

USA, Novemb er 1992. LAPACK Working Note 55. 31

[8] J. J. M. Cupp en. A divide and conquer metho d for the symmetric

tridiagonal eigenproblem. Numer. Math., 36:177{195, 1981.

[9] Kirsty F. F. Darbyshire. A parallel algorithm for the symmetric tridi-

agonal eigenproblem. Master's thesis, University of Manchester, 1994.

[10] James W. Demmel. Applied . So ciety for

Industrial and Applied Mathematics, Philadelphia, PA, USA, 1997.

[11] James W. Demmel and K. Stanley. The p erformance of nding eigen-

values and eigenvectors of dense symmetric matrices on distributed

memory computers. Technical Rep ort CS-94-254, Departmentof

Computer Science, UniversityofTennessee, Knoxville, TN, USA,

Septemb er 1994. LAPACK Working Note 86.

2

[12] Inderjit Dhillon. A New O n  Algorithm for the Symmetric Tridi-

agonal Eigenvalue/Eigenvector Problem. PhD thesis, Universityof

California, Berkeley, USA, 1997.

[13] St ephane Domas and Francoise Tisseur. Parallel implementation of a

symmetric eigensolver based on the Yau and Lu metho d. In Vector

and Paral lel Processing { VECPAR'96, Lecture Notes in Computer

Science 1215, pages 140{153. Springer-Verlag, Berlin, 1997.

[14] J. Dongarra and D. Sorensen. A fully parallel algorithm for the sym-

metric eigenvalue problem. SIAM J. Sci. Stat. Comput., 8:139{154,

1987.

[15] Maria Paula Goncalves Fachin. The Divide-and-conquer Method for

the Solution of the Symmetric Tridiagonal Eigenproblem and Trans-

puter Implementations. PhD thesis, University of Kent at Canterbury,

1994.

[16] K. Gates and P. Arb enz. Parallel divide and conquer algorithms for

the symmetric tridiagonal eigenproblem. Technical rep ort, Institute

for Scienti c Computing, ETH Zurich, 1994.

[17] Gene H. Golub. Some mo di ed matrix eigenvalue problems. SIAM

Review, 152:318{334, 1973.

[18] Ming Gu. Studies in Numerical Linear Algebra. PhD thesis, Yale

University, 1993. 32

[19] Ming Gu and Stanley C. Eisenstat. A stable and ecient algorithm

for the rank-one mo di cation of the symmetric eigenproblem. SIAM

J. Matrix Anal. Appl., 154:1266{1276, 1994.

[20] Ming Gu and Stanley C. Eisenstat. A divide-and-conque r algorithm

for the symmetric tridiagonal eigenproblem. SIAM J. Matrix Anal.

Appl., 161:172{191, 1995.

[21] B. Hendrickson and D. Womble. The torus-wrap mapping for dense

matrix calculations on massively parallel computers. SIAM J. Sci.

Comput., 15:1201{1226, 1994.

[22] Bruce Hendrickson, Elizab eth Jessup, and Christopher Smith. A par-

allel eigensolver for dense symmetric matrices. Submitted toSIAM J.

Sci. Comput., 1996.

[23] I. C. F. Ipsen and E. R. Jessup. Solving the symmetric tridiagonal

eigenvalue problem on the hyp ercub e. SIAM J. Sci. Stat. Comput.,

112:203{29, 1990.

[24] E. R. Jessup and D. C. Sorensen. A parallel algorithm for computing

the singular value decomp osition of a matrix. SIAM J. Matrix Anal.

Appl., 152:530{548, 1994.

[25] Ren-Cang Li. Solving the secular equation stably and eciently.Tech-

nical rep ort, Department of Mathematics, University of California,

Berkeley, CA, USA, April 1993. LAPACK Working Note 89.

[26] K. Lowner. Ub er monotone matrix funktionen. Math. Z., 38:177{216,

1934.

[27] A. Melman. A numerical comparison of metho ds for solving secular

equations. J. Comp. Appl. Math., 861:237{249, 1997.

[28] A. Petitet. Algorithmic Redistribution Methods for Block Cyclic De-

compositions. PhD thesis, UniversityofTennessee, Knoxville, TN,

1996.

[29] J. Rutter. A serial implementation of Cupp en's divide and conquer

algorithm for the symmetric eigenvalue problem. Technical Rep ort

CS-94-225, Department of Computer Science, UniversityofTennessee,

Knoxville, TN, USA, March 1994. LAPACK Working Note 69. 33

[30] D. C. Sorensen and P.T.P.Tang. On the orthogonality of eigenvec-

tors computed by divide and conquer metho ds techniques. SIAM J.

Numer. Anal., 28:1752{1775, 1991.

[31] Francoise Tisseur. Workspace reduction for the divide and conquer

algorithm. Manuscript, Novemb er 1997.

[32] R. Clint Whaley and Jack Dongarra. Automatically tuned linear alge-

bra software. Technical Rep ort CS-97-366, Department of Computer

Science, UniversityofTennessee, Knoxville, TN, USA, 1997. LA-

PACK Working Note 131. 34