<<

A BLOCK PROCEDURE WITH

CONSTANT SYNCHRONIZATION REQUIREMENTS

ࣿ y

ANDREAS STATHOPOULOS AND KESHENG WU

Abstract. We prop ose an alternative orthonormalization metho d that computes the orthonor-

mal from the right singular vectors of a . Its advantage are: aङ all op erations are

matrix-matrix multiplications and thus cache-ecient, bङ only one synchronization p oint is required

in parallel implementations, cङ could b e more stable than Gram-Schmidt. In addition, we consider

the problem of incremental orthonormalization where a blo ckofvectors is orthonormalized against a

previously orthonormal set of vectors and among itself. We solve this problem by alternating itera-

tively b etween a phase of Gram-Schmidt and a phase of the new metho d. We provide error analysis

and use it to derive b ounds on how accurately the two successive orthonormalization phases should

b e p erformed to minimize total work p erformed. Our exp eriments con rm the favorable numerical

b ehavior of the new metho d and its e ectiveness on mo dern parallel computers.

Key words. Gram-Schmidt, orthogonalization, Householder, QR factorization,

singular value decomp osition, Poincare

AMS Sub ject Classi cation. 65F15

1. Intro duction. Computing an from a given set of vectors

is a basic computation, common in most scienti c applications. Often, it is also one

of the most computationally demanding pro cedures b ecause the vectors are of large

dimension, and b ecause the computation scales as the square of the number of vectors

involved. Further, among several orthonormalization techniques the ones that ensure

high accuracy are the more exp ensive ones.

In many applications, orthonormalization o ccurs in an incremental fashion, where

a new set of vectors घwe call this internal setङ is orthogonalized against a previously

orthonormal set of vectors घwe call this externalङ, and then among themselves. This

computation is typical in blo ck Krylov metho ds, where the Krylov basis is expanded by

a blo ckofvectors [12, 11]. It is also typical when certain external orthogonalization

constraints have to be applied to the vectors of an iterative metho d. Lo cking of

converged eigenvectors in eigenvalue iterative metho ds is such an example [19,22].

This problem di ers from the classical QR factorization in that the external set

of vectors should not be mo di ed. Therefore, atwo phase pro cess is required; rst

orthogonalizing the internal vectors against the external, and second the internal

among themselves. Usually, the numb er of the internal vectors is much smaller than

the external ones, and signi cantly smaller than their dimension. Another imp ortant

di erence is that the accuracy of the R matrix of the QR factorization is not of pri-

mary interest, but rather the orthonormalityof the pro duced vectors Q. A variety

of orthogonalization techniques exist for b oth phases. For the external phase, Gram-

Schmidt घGSङ and its mo di ed version घMGSङ are the most comp etitivechoices. For

the internal phase, QR factorization using Householder transformations is the most

stable, alb eit more exp ensive metho d [11]. When the number of vectors is signi -

cantly smaller than their dimension, MGS or GS with reorthogonalization are usually

preferred.

Computationally, MGS, GS and Housholder transformations are based on level

1 or level 2 BLAS op erations [15,9,8]. These basic kernel computations, dot pro d-

ucts, vector up dates and sometimes matrix-vector op erations, cannot fully utilize the

Department of Computer Science, College of William and Mary, Williamsburg, Virginia 23187-

8795, घ[email protected]ङ.

y

NERSC, Lawrence Berkeley National Lab oratory, Berkeley, California 94720, घ[email protected]ङ. 1

pro cessor cache, with signi cant p erformance shortcomings on most of to days com-

puters. In addition, these metho ds have high synchronization requirements when

implemented on parallel computers. The numb er of synchronization p oints of the GS

is linear to the number of vectors, while for the rest metho ds it is quadratic [14].

When the parallel platform involves a high latency network but fast pro cessors, as

in the case of the increasingly p opular clusters of workstations [21], the number of

synchronization p oints can haveanadverse e ect in the scalability of the metho d.

Most of the previous e orts to address these problems considered blo cks of vectors,

and used hybrids of the more scalable GS for orthogonalization b etween blo cks, and

the more accurate MGS for orthogonalization within blo cks [14, 2]. By adjusting

the blo ck size appropriately, these techniques can pro duce accurate orthonormal sets

while reducing the synchronization requirements of MGS. However, the number of

synchronization p oints is still linear to the number of vectors, and BLAS level 2

kernels are still dominant despite blo cking. Finally, most of such e orts have fo cused

on a full QR factorization of a set of vectors, rather than the two phase problem.

In this pap er, weintro duce a metho d based on the singular value decomp osition

घSVDङ, that uses the right singular vectors to pro duce an orthonormal basis for a

given subspace. The idea itself is not new, dating back at least to Poincare, and it

is sometimes encountered in chemistry and wavelet literature [5, 20,16,17, 6]. How-

ever, it has not received any attention as a computationally viable orthogonalization

metho d, and to our knowledge there is no analysis of its numerical prop erties. Beyond

theoretical considerations, the metho d, whichwe call SVQB, has some very attractive

characteristics. It uses primarily level 3 BLAS kernels, and it has a constantnumber

of synchronization p oints. As we show in this pap er, it is not as numerically accurate

as MGS, but it is often b etter than GS. More interestingly,we show that more stable

alternatives, such as MGS or Householder, are an overkill for our two phase problem.

Coupling the SVQB metho d for the internal phase with a blo ck GS with reorthogonal-

ization for the external phase results in a metho d also with constant synchronization

requirements.

The pap er is organized as follows. We rst describ e the basic idea of the SVQB

metho d, and we couple it with a blo ck GS for the two phase problem. The numerical

stability of the metho d is analyzed in the following section, and some numerical exp er-

iments and comparisons with other metho ds verify its e ectiveness. We then analyze

the numerical interaction b etween the metho ds used in each of the two phases, and

based on this theory we tune the two phases so that they do not p erform unneces-

sary reorthogonalizations. Following, we present timings from a series of exp eriments

on the Cray T3E, and on the IBM SP-2. These verify that b oth the blo ck computa-

tions and the small numb er of synchronizations help the new metho d achieve accurate

, faster than other comp etitive metho ds.

nࣾk

2. Orthogonalization metho ds. Let V 2< b e a set of orthonormal vec-

nࣾm

tors, and W 2< b e a set of vectors, where k + m ऊ n. In practice, we usually

exp ect m < k and m  n. The problem is to obtain an orthonormal set of vec-

tors Q, such that spanघQङ=spanघW ङ and Q ? V . This problem can be viewed as

consisting of two phases: obtaining a Q which is orthogonal to V घexternal phaseङ,

and orthonormalizing the Q vectors among themselves घinternal phaseङ. The internal

phase problem is most commonly solved through various metho ds for QR factoriza-

tion [11,23].

Gram-Schmidt घGSङ is the most well known and also the least computationally

exp ensive metho d for QR factorization. In addition, GS is based primarily on level 2

2 BLAS kernels, and it is p ossible to implement it in parallel with a mo dest num-

ber of m + 1 synchronization p oints. Although computationally attractive, GS do es

not always pro duce numerically orthogonal vectors [2]. However, when used with

reorthogonalization, a second GS iteration is typically enough for pro ducing orthog-

onality to machine precision [7]. Since a second iteration is not always necessary,GS

with reorthogonalization is often preferred over other metho ds.

Householder re ections, on the other hand, yield a QR factorization that is nu-

merically accurate. By construction, the resulting matrix Q is orthogonal to machine

precision घङ, and the computed upp er triangular matrix R di ers from the exact one

by less than kAk [11]. Despite its excellentnumerical characteristics, this technique

involves twice the arithmetic of the GS. For a slight increase in arithmetic, a blo ck

version is also p ossible that facilitates the use of level 3 BLAS kernels. However, when

m  n, this blo cking is harder to obtain and it is less e ective. Moreover, the number

of synchronization p oints it requires grows quadratically with m.

Mo di ed Gram-Schmidt घMGSङ improves the of the GS by

orthogonalizing individual pairs of vectors rather than a vector against a blo ck. For

MGS, the error in the orthogonalityof Q can b e b ounded by ऊघW ङ, where ऊघW ङis

the condition number of W [1, 3]. The op eration count for MGS is the same with GS,

but working on individual vectors allows only for level 1 BLAS kernels, which impairs

signi cantly the actual p erformance on cache based computers. Finally, the number

of synchronization p oints grows quadratically with m.

In the context of the two phase problem, pro ducing an internal set of vectors Q

with orthogonality close to machine precision is unnecessary, b ecause of the interde-

p endence of the phases. For example, if the internal vectors W are orthogonal to

machine precision to all external vectors V , the internal orthogonalization could sp oil

signi cantly this external orthogonality. Therefore, a second iteration through the

algorithm would still b e necessary. If W and V are far from orthogonal, obtaining a

fully orthonormal Q is also unnecessary b ecause the external orthogonalization will

sp oil the orthogonalityof Q. Therefore, this two phase problem obviates the use of

exp ensive but stable metho ds such as Householder.

3. The SVQB metho d. As with the two phase problem, it is often the case

that an orthonormal basis of the spanघW ङ is needed, rather than a QR factorization

of W . There are various ways of obtaining such a basis, but an esp ecially interesting

one is the one derived from the right singular vectors of W . Assume that the vectors

in W are normalized. The singular values of W are the square ro ots of the eigenvalues

T

of S = W W , and the right singular vectors are the corresp onding eigenvectors of S .

1=2

Let Su = uࣿ b e the eigendecomp osition of S , and de ne Q = Wuࣿ . Obviously,

T

spanघQङ = spanघW ङ and Q Q = I . If the W vectors are not normalized, the diagonal

T

of S , D = diagघS ङ, contains the squares of their norms घS = W W ङ. Therefore,

ii i

i

we can p erform a normalization implicitly by scaling the columns and rows of S , and

1=2

work with the implicitly normalized WD . This implicit normalization is not only

inexp ensive, but also as numerically stable as explicit normalization. The resulting

factorization is not a QR but rather a `QB' factorization where Q is orthonormal

and B a full matrix. In exact arithmetic, the algorithm for this singular vector QB

factorization घwhichwe call SVQBङ follows:

Algorithm 3.1. Q =SVQBघW ङ

0 T

1. S = W W

1=2 0 1=2

2. Scale S = D S D , with D = diag घS ङ 3

3. Solve Su = uࣿ, for all eigenpairs

1=2 1=2

4. Compute Q = WD uࣿ

When some of the vectors in W are linearly dep endent, one or more of the eigenvalues

are zero, and the scaling in the computation of Q cannot b e carried out. In nite pre-

cision, the same e ect is caused by almost linearly dep endentvectors and eigenvalues

close to zero. To prevent normalization over ows, we set a minimum threshold for

eigenvalues. If  is the machine precision, we insert the following two steps:

3.1  =  max घࣿ ङ

i ii

3.2 If ࣿ <, set ࣿ =  , for all i.

ii ii

Other strategies for dealing with linear dep endencies are also p ossible. For example,

we could consider only those eigenvectors with eigenvalues greater than some \safe"

threshold. The resulting basis is then smaller, but it is guaranteed to b e orthonormal

and to numerically span a subspace of the original vectors. Finally, b ecause of nite

घi+1ङ

precision arithmetic, the algorithm may have to be applied iteratively घQ =

घiङ

SVQBघQ ङङ, until an orthonormal set Q is obtained.

Both the solution of the eigenvalue problem and the implicit normalization in-

volve only m ࣾ m matrices घS and uङ, and thus can b e computed inexp ensively. On

the other hand, the matrix-matrix multiplication for computing S and the multipli-

2

cation of W with u, each contribute 2nm oating p oint op erations, which makes the

algorithm twice as exp ensive as GS. However, these op erations are b oth level 3 BLAS

kernels and can be p erformed eciently on cache based computers. Alternatively,

the matrix multiplication for computing the symmetric S can b e p erformed with half

the op erations, but level 2 BLAS will have to b e used. Moreover, if implemented in

parallel, the SVQB metho d metho d requires only one synchronization p oint, when

computing the matrix S .

3.1. The Choleski QR metho d. Similarly to the SVQB metho d, we can de-

rive a blo ck QR factorization based on the Choleski factorization. Note, that if

T T 1

S = W W = R R , where R is the Choleski factor, Q = WR de nes the QR

factorization for W [11]. Although this metho d is rarely or never used computation-

ally, it has some attractive characteristics; it is a QR factorization, it is based on

a level 3 BLAS kernel and a triangular system solution, it involves only 50क more

arithmetic than GS, and it also requires one synchronization p oint in parallel imple-

mentations. Researchers have noticed that it is not as stable as MGS, but frequently

it can b e more stable than GS [2, 10]. One of the problematic issues with this घdenoted

as CholQRङ metho d is that the more ill-conditioning S is, the less stable the Choleski

factorization b ecomes. Regularizing it in some ecient and e ective way is not as

straightforward as in the case of the SVQB, where the smallest singular pairs can b e

simply left out of the computation. Finally, the cache p erformance of the CholQR is

usually inferior to that of the SVQB, b ecause of the triangular solve.

3.2. The two phase GS-SVQB. In the two phase case, the external vectors

V should not b e mo di ed, and thus external orthogonalization is p erformed by either

GS or MGS. Better stability is certainly an advo cate for MGS. However, in many

problems k , the number of vectors in V , is large and p erforming this op eration as

eciently as p ossible is crucial. In addition, accuracy close to machine precision is

less critical b ecause the orthogonalityachieved in one phase may not be preserved

in the other. For this reason, the blo ck GS with BLAS 3 kernels and some form of

reorthogonalization mechanism is usually preferred. If we denote as Q = GSघV; W ङ the 4

T

blo ck GS pro cess, घI VV ङW followed by a normalization, the two phase algorithm

can b e describ ed as:

Algorithm 3.2. Q = GS-SVQBघV; W ङ

0

1. W =GSघV; W ङ

0

2. Q =SVQBघW ङ

In practice, the ab ove algorithm do es not always compute an accurately orthonor-

mal set Q, and thus it has to b e applied in an iterative fashion. To develop an ecient

iterativeversion of GS-SVQB, wemust rst examine the numerical characteristics of

the SVQB algorithm. This is the goal of the following section.

4. Stability analysis of the SVQB. For many applications, such as blo ck

Krylov iterative metho ds, the of the resulting vectors Q is more im-

p ortant, than the accuracy of the upp er triangular R of the QR factorization. For

example, Krylov metho ds will still make progress, alb eit a slower one, if provided with

an orthonormal set Q. Moreover, the actual spanघW ङ of an almost linearly dep endent

W is not well de ned numerically. Therefore, in this pap er we measure stabilityas

the departure of the resulting Q from orthonormality rather than the backward error.

Because of its blo ck, non sequential nature, we exp ect the SVQB pro cedure to b e less

stable than the MGS. But as we show b elow, it can often b e more stable than GS.

n

Theorem 4.1. Let W be a set of m linearly independent vectors of < . Let

Q be the oating point representation of the matrix computed by applying the SVQB

procedureonW. If ऊघW ङ is the condition number of W , then

T 2

ऌ ऌ

kI Q Qk ऊ ct min ऊघWङ ; 1 ;

where kk = kk ,  is the machine round o , and ct isaconstant depending on n and

2

m.

T

~

Proof. Let S = W W , and denote by S the oating p oint representation of S .

Then

~

घ4.1ङ S = S + S; with kS kऊc  kSk:

1

2

We can further write this as kS kऊc  kWk . The e ect of p erforming scaling on

1

the matrix S , corresp onds to eachvector in W having 1 implicitly, and in that

case kS kऊc m.

1

From standard backward error analysis, the solution of the small m ࣾ m symmetric

~

eigenvalue problem Su = uࣿ can be considered an exact eigendecomp osition of a

T

ऌ ऌ ऌ

nearby matrix S =ऌuࣿऌu . Using relation घ4.1ङ we can express the error in S as:

2

ऌ ~ ~ ~

घ4.2ङ S = S +  S; with k S kऊc  kSk+Oघ ङ:

2

From the ab ove, and by letting c = c + c , the matrices S and S are related by:

3 1 2

ऌ ~

घ4.3ङ kS S k = k S + S kऊ c  kSk:

3

1=2

ऌ ऌ

Let Q = Q + Q = W uऌࣿ + Q be the oating p oint representation of the

1=2

matrix returned by the SVQB pro cedure. Then, kQk ऊ c  kW k kuऌk kࣿ k.

4

1

1=2

ऌ ऌ

p

Note that kࣿ k = , where ऋ = min ࣿ , and since uऌ are orthonormal

min i ii

min

eigenvectors, wehave:

kWk

p

घ4.4ङ : kQkऊ c 

4

min 5

It is well known that for symmetric eigenproblems the error in the eigenvalue is

b ounded by the error in the matrix. Thus, the minimum eigenvalue of S could have

a large error:

घ4.5ङ ऋ ऋ घS ङ ऊ kS S k ऊ c  kS k = c ऋ घS ङ;

min min 3 3 max

while the largest eigenvalue is always relatively accurate. Because the eigenvalues of

S are the squares of the singular values of W , the eigenvalue b ounds re ect also that

p

singular values of W smaller than  cannot b e represented as eigenvalues of S . For

this reason wehave to distinguish b etween the following two cases for the condition

p p

number ऊघW ङ= ऊघSङ= ऋ घS ङ=ऋ घS ङ:

max min

1

p

a. If ऊघW ङ ऋ , then ऋ घS ङ is not computed accurately in oating p oint,

min



and ऋ could b e O घऋ घS ङङ. In fact, our algorithm sp eci cally sets any

min max

eigenvalues that are smaller than ऋ घS ङ equal to this threshold. In this

max

case the ratio is b ounded by:

s

ई उ

kW k ऋ घS ङ 1

max

p p

ऊ घ4.6ङ = O :

ऋ घS ङ



max

min

1

p

b. If ऊघW ङ ऊ , then from b ound घ4.5ङ and since for this case, it holds 1+



c ऊघSङ=Oघ1ङ, the ratio is b ounded by:

3

s s

kW k ऋ घS ङ ऊघS ङ

max

p

ऊ =

ऋ घS ङ+c ऋ घS ङ 1+c ऊघSङ

min 3 max 3

min

p

घ4.7ङ ऊघSङङ = O घऊघW ङङ: = Oघ

Combining the two cases, the b ound for the ratio in घ4.4ङ can b e written as:

p

kQkऊ c min ऊघWङ; घ4.8ङ  :

4

We emphasize that the ab oveis not the backward error for the exact result of

the SVQB, but for the matrix pro duct of the computed eigenpairs of S . The exact

backward error is not relevant b ecause the orthogonality of the resulting Q is of

interest.

Let us consider the departure of the computed Q = Q + Q from orthonormality:

T 1=2 T T 1=2 T T 2

ऌ ऌ ऌ ऌ

kI Q Qk = kI ࣿ uऌ W W uऌࣿ + Q Q + Q Qk + OघQ ङ

1=2 T 1=2

ऌ ऌ

घ4.9ङ ऊkI ࣿ uऌ Suऌࣿ k+2kQkkQk:

From relations घ4.1{4.3ङ and घ4.9ङ, the orthonormalityofऌu, and the fact that kQk =

O घ1ङ, wehave:

T 1=2 T 1=2 1=2 T 1=2

ऌ ऌ ऌ ~ ऌ ऌ ~ ऌ

kI Q Qkऊ kI ࣿ uऌ Suऌࣿ +ࣿ uऌ घS +S ङऌuࣿ k +2kQkkQk

1

ऌ ~

ऊkࣿ kk S + S k +2kQkkQk

kS k

ऊ c  घ4.10ङ +2kQk:

3

min

2

From b ounds घ4.6ङ and घ4.7ङ, and b ecause kS kऊ kWk wehave:

ई उ

p

1

2 T

ऌ ऌ

ऊघW ङ ; kI Q Qkऊc  min +2c min ऊघWङ; 

3 4



2

घ4.11ङ ऊct min ऊघWङ ; 1 : 6

Next, we provide b ounds on jऊघQङ 1j, thus showing that when applying the

SVQB iteratively, Q converges fast to an orthonormal basis. We rst state the fol-

lowing lemma घsee [13]ङ.

Lemma 4.2.

T T 2

1. If kI Q Qkऊ , then kQ QkऊkQk ऊ 1+a.

T T 1

2. If kI Q Qkऊ <1, then kI घQ Qङ kऊ .

1

Proposition 4.3. Assume that for the set of vectors W , with condition number

T

ऊघW ङ, kI W W k ऊ < 1. After one application of the SVQB algorithm, the

resulting Q matrix has condition number:

r

1+a

ऊघQङ ऊ :

1a

p

T T 1

Proof. By de nition, ऊघQङ = kQ QkkघQ Qङ k. The pro of follows from

T 2 T 1 T 1

lemma 4.2, since kQ QkऊkQk ऊ1+a, and kघQ Qङ k = kI I +घQ Qङ kऊ

1

T 1

. 1+kI घQ Qङ kऊ

1a

Toavoid extensive use of complexity analysis symb ols, weintro duce the following

intuitive notations f Oघgङ , f =!घgङ, and f ऋ O घg ङ ,

f = घgङ.

n

Theorem 4.4. Let W be a set of m linearly independent vectors of < , with

condition number ऊघW ङ. Let Q be the oating point representation of the matrix

computed by applying the SVQB procedureon W. If  is the machine round o , then:

1

2

p

if ऊघW ङ



2 T

ऌ ऌ

Qk

q q

1+a a 2a

and by using prop osition 4.3, wehave: ऊघQङ ऊ ऊ ऊ 1+ : This 1+

1a 1a 1a

proves the b ound, b ecause, in this case, a is b ounded away from 1.

1

p

The case where ऊघW ङ ऋ O घ ङismuch harder to b ound b ecause the eigenvalues



T

of the S = W W are much smaller than  and numerical error dominates. However,

a b ound can be given with some relatively weak assumptions. We rst need the

following lemma.

ऊ ऋ

p

G C

p

Lemma 4.5. Let A = , where kC k = O घ1ङ, G is a symmetric

T

C B

matrix with eigenvalues ऋ = O घ1ङ, and B is symmetric matrix with eigenvalues

g

0 <ऋ

b b

such that jऋ ऋj = O घङ.

b

T T T

Proof. Let घu ;ऋ ङ an eigenpair of B , and de ne the vector: u = [u u ] ,

b b

g

b

where u is chosen to satisfy the rst blo ck row set of eigenvalue equations for

g

ऋ ऊ

p

u

g

1

+ A: u = घऋ I Gङ Cu . Applying A on u we have: Au = ऋ

g b b b

u

b

ऊ ऋ

0

: From our assumption, ऋ  O घ1ङ, which implies kघऋ I

b b

T 1

C घऋ IGङ Cu

b b

1 T 1

Gङ k = O घ1ङ, and thus kC घऋ IGङ Cu k = Oघङ. This means that the matrix

b b

A is only O घङaway from a matrix that has घu; ऋ ङ as an eigenpair.

b

Theorem 4.6. Consider the same assumptions of theorem 4.4. If we assume, in

p

addition, that घaङ singular values of W that are smal ler than O घ ङ are wel l separated 7

from the rest which are O घ1ङ, and घbङ the smal lest singular value of Q is not appreciably

smal ler than the one of W , then:

p

1

p

if ऊघW ङ ऋ O घ ङ; ऊघQङ ऊ O घ ऊघWङङङ:



Proof. Throughout this pro of we assume the notation of theorem 4.1. Weintro-

duce Q = W uऌ, the unscaled version of Q. The error in the oating p oint represen-

u

p

tation of Q is Q with kQ kऊkWk=Oघ ऋ ङ, where ऋ इ ऋ घS ङ. We

u u u max max max

now rewrite relation घ4.9ङ using this unscaled Q :

u

T T T 1=2 1=2 T

ऌ ऌ ऌ ऌ

घ4.12ङ uऌ S uऌ + Q Q + Q Q ࣿ : Q = ࣿ Q

u u

u u

T T

Let ࣽQ = Q Q + Q Q , and observe that kࣽQkऊ2kQ kkQ k = O घऋ ङ.

u u u u max

u u

T T

Because uऌ are orthonormal vectors, Q Q = uऌ S uऌ has the same eigenvalues as S .

u

u

1=2 T

We will show that scaling from left and right with ࣿ reduces ऊघQ Q +ࣽQङby

u

u

T

ऌ ऌ

Q a factor of O घङ. Thus, we fo cus on the exact eigenvalues of the exact pro duct Q

of the computed vectors Q.

1

p

Since ऊघW ङ ऋ O घ ङ, there is a set of eigenvalues of S that are smaller than



. Standard eigensolvers view these eigenvalues as numerical multiplicities, and thus

they pro duce a basis for their subspace, rather than their exact eigenvectors. On the

other hand, the separation assumption घaङ states that the approximations for the rest

of the eigenvectors have no mixing with these almost degenerate ones. If we order the

eigenvalues of S in decreasing size, we can partition the corresp onding eigenvectors

into two groups, a \go o d" group u and a \bad" group u :

g b

u =ऌu; for i =1;:::;l

g i

; and ऋ >Oघऋ ङ:

l max

u =ऌu; for i = l +1;:::;m

b i

The fact that there is no mixing between the good and the bad groups translates

T

to u u = O घङ. Because of the separation and the fact that kS S k = O घऋ ङ,

b max

g

the eigenvectors of S pro duced by eigensolvers are accurate approximations of the

T

eigenspaces of S within each of the two groups. Therefore, the eigenvalues of uऌ S uऌ

g

g

T

are ऋ + O घऋ ङ; i =1;:::;l, and the eigenvalues ofu ऌ S uऌ are ऋ + O घऋ ङ; i =

i max b i max

b

l +1;:::;m. Thus, we can rewrite the unscaled part of equation घ4.12ङ as:

ऊ ऋ

G 0

T

घ4.13ङ uऌ S uऌ +ࣽQ = +Oघऋ ङ;

max

0 B

T T

where G = uऌ S uऌ , B = uऌ S uऌ , and the p erturbation O घऋ ङ applies to all the

g b max

g

b

elements.

T

ऌ ऌ

Wenow fo cus on Q Q by scaling the matrix in equation घ4.13ङ from left and right

1=2

ऌ ऌ

by ࣿ . In our algorithm, any ऋ <ऋ is set equal to ऋ , which gives ࣿ the

i max max

following form:

ऌ ऌ

ࣿ= diag घऋ ;:::;ऋ ;ऋ ;:::;ऋ ङ:

1 l max max

Using these scaling factors we obtain:

ऊ ऋ

0

G C

T

ऌ ऌ

घ4.14ङ ; Q Q =

1

T

C घB + O घऋ ङङ

max

ऋ

max 8

0

where G is G + O घऋ ङ scaled with the go o d eigenvalues of ࣿ, and the i rowof

max th

C are the scaled p erturbations O घऋ ङ:

max

इ आ

p

O घऋ ङ

max

p

C = ऋ ; j = l +1;:::;m: = O

i;j max

ऋ ऋ

max i

From the theorem hyp othesis, ऋ = O घ1ङ; i = 1;:::;l, and thus the eigenvalues of

i

0

G + O घऋ ङ are O घ1ङ. As a result, the eigenvalues of G are also O घ1ङ. On the

max

other hand, the minimum eigenvalue ऋ of B + O घऋ ङ could b e much larger than

b max

ऋ . We consider the worst case scenario, where ऋ = ऋ + O घऋ ङ. The order

min b min max

notation covers also the case of no p erturbation. From this, and from assumption घbङ,

1

the minimum eigenvalue of घB + O घऋ ङङ is:

max

ऋ

max

ऋ + O घऋ ङ

min max

0

ऋ = :

b

ऋ

max

We distinguish two cases. First, if the p erturbation in order notation ab ove is close

p

T

ऌ ऌ

ङ. Therefore, its condition number Q are all O घ1 + to ऋ , the eigenvalues of Q

max

is close to one, and the theorem holds. Second, we assume that the p erturbation

0

is much smaller than ऋ , and therefore ऋ  1. In this case, the assumptions

max

b

of lemma 4.5 are satis ed for the matrix in equation घ4.14ङ, and thus the smallest

0 T

ऌ ऌ

Q is equal to ऋ + O घऋ ङ. Substituting these eigenvalues in the eigenvalue of Q

max

b

2 T

ऌ ऌ ऌ

Qङ=ऊघQङ completes the pro of: condition number ऊघQ

1 ऊघSङ

2

ऊघQङ = = O घऊघSङङ: =

2

1+O घऊघSङङ

1+OघऊघSङङ + O घ ऊघS ङङ

+Oघऋ ङ

max

ऊघSङ

In general, one could circumvent the separation assumption of the theorem by

cho osing a value of  > 1 and the \good" eigenvalues ऋ > O घऋ ङ, to ensure

l max

O घ1ङ separation b etween the go o d and bad groups. The existence of such separator

is guaranteed by the small number घmङ of eigenvalues that span the interval b etween

0 and ऋ = O घ1ङ. Although, increasing  weakens the theoretical b ound of the

max

p

theorem, our numerical exp eriments have shown a consistent reduction of  in the

condition number of Q.

The second assumption of the theorem is necessary since, in general, we cannot

guarantee that the smallest singular value of Q is not smaller than  घW ङoreven

min

zero. However, such a situation is rather improbable for two reasons. First, one

exp ects the SVQB pro cess to pro duce b etter conditioned, if not orthogonal, Q vectors.

Second, since computing Q involves a matrix multiplication and a scaling op eration,

one exp ects the numerical error to b e in linearly indep endent directions, or to b e a

small relative error. If the elements of the W andu ऌ vectors are O घ1ङ, numerical error

is exp ected to mount close to , and thus cancellations should not reduce the lowest

eigenvalue of S , whichismuch smaller than . If there is a sp ecial structure of either

W or uऌ with elements smaller than , numerical error will a ect the computations

with those elements in a relative sense, and thus the smallest eigenvalue may not b e

reduced signi cantly.

4.1. Convergence comparisons. The ab ove theorems suggest that in most

situations, applying SVQB once or twice should pro duce orthogonal vectors. In the

case of extremely ill-conditioned vectors, a third application of the pro cedure might 9

b e necessary. This akin to the b ehavior of the iterative GS metho d [13,18,7]. How-

ever, unlike blo ck GS without internal re-orthogonalization, SVQB guarantees an

p p

improvementof  at every iteration, even for ऊघW ङ > 1= .

If an accurate Choleski decomp osition can b e computed, the CholQR pro cedure

should be identical to SVQB. In fact, it might be p ossible to prove b ounds for the

CholQR similar to the ones in the previous section. However, even with an accurate

decomp osition, for very large condition numb ers we exp ect the CholQR to be less

stable than the eigenvalue-based SVQB metho d.

To demonstrate the relative e ectiveness of these metho ds, we apply them on

three sets of vectors, and rep ort the improvements on their condition numb ers. The

rst set is the 30 Krylov vectors generated from a vector of all ones and the 2-

D Laplacean on a regular, nite di erence, square mesh with Neumann conditions.

The dimension of the matrix is 1089, and the initial vector is not considered among

the set of 30. The second set consists of the columns of the Hilb ert matrix of size

100. The third set is rather arti cial, and it has b een used to show the b ene ts

of MGS over GS [14, 13, 2]. We use the following variation, shown in MATLAB

notation: W = [ onesघ1,30ङ; diagघ randघ30,1ङ*eps*eps*eps ङ ]; All tests are

run in MATLAB, on a SUN Ultra-2 workstation with  = 2.2e16. The condition

numb ers are computed by the Matlab cond and therefore could b e inaccurate

16

whenever they exceed 10 . The results for these three cases are shown in tables 4.1,

4.2, and 4.3 resp ectively.

SVQB CholQR GS MGS

0

ऌ ऌ ऌ ऌ

Iteration ऊघQ ङ ऊघQङ ऊघQ ङ ऊघQङ ऊघQङ ऊघQङ

u

u

1 6e+16 1e+09 4e+08 1e+09 2e+16 2e+01

2 4e+08 1e+01 4e+00 6e+01 1e+14 3e-14

3 4e+00 1+ 1+ 1+7e-12 8e+10 1+

4 { { { 1+ 3e+06 {

5 { { { { 1+1e03 {

6 { { { { 1+ {

Table 4.1

W = KrylovघA,30ङ, where A is a 2-D Laplacean of size 1089ࣾ1089, and an initial vector of al l

ones. Note that ऊघW ङ= 3e+20. However, after scaling because of numerical error ऊघW ङ becomes:

6e+16.

CholQR GS MGS SVQB

0

ऌ ऌ ऌ ऌ

Iteration ऊघQ ङ ऊघQङ ऊघQ ङ ऊघQङ ऊघQङ ऊघQङ

u

u

1 2e+19 3e+11 9e+10 2e+12 2e+19 7e+02

2 9e+10 2e+03 7e+02 6e+04 4e+16 1+2e-13

3 8e+02 1+5e-11 1+2e-11 1+2e-07 2e+14 1+

4 1+2e-11 1+ 1+ 1+ 4e+12 {

5 { { { { 4e+10 {

6 { { { { 4e+08 {

7 { { { { 2e+00 {

8 { { { { 1+ {

Table 4.2

W = Hilbert matrix of sizeघ100ङ. ऊघW ङ= 2e+19. 10

SVQB CholQR GS MGS

0

ऌ ऌ ऌ ऌ

Iteration ऊघQ ङ ऊघQङ ऊघQ ङ ऊघQङ ऊघQङ ऊघQङ

u

u

1 2e+49 3e+41 4e+34 4e+34 2e+01 1+

2 4e+34 7e+25 4e+19 4e+19 1+ {

3 4e+19 3e+11 2e+07 4e+06 { {

4 2e+07 1+1e03 1+1e04 1+4e03 { {

5 1+1e04 1+ 1+ 1+ { {

Table 4.3

W = [ onesघ1,30ङ, diagघrandघ30,1ङ*eps*eps *ep sङ ]

We compare SVQB against CholQR, GS, and MGS, by printing the condition

number ऊघQङ of the vectors that these metho ds pro duce after each iteration. Since

the implicit normalization in SVQB do es not guarantee normality for ill-conditioned

problems, we print also the condition number of the unscaled vectors Q , ऊघQ ङ,

u u

and the condition numb er of the same vectors after explicitly scaling them by their

0

norms, ऊघQ ङ. Note that after each iteration ऊघQ ङ is equal to the condition number

u

u

of the vectors b efore the application of SVQB. For example, after the rst iteration

ऊघQ ङ=ऊघWङ.

u

The results in all tables con rm theorems 4.4 and 4.6. The application of one

p

. This step of SVQB reduces the condition numb er of a set of vectors at least by

reduction is sharp for the examples in tables 4.1 and 4.2, but if the vectors are explicitly

normalized the reduction could b e larger घsee table 4.3ङ. When the condition number

p

is smaller than 1= , the reduction ob eys closely the b ound in theorem 4.4.

As exp ected, the CholQR metho d b ehaves similarly to SVQB घtable 4.3ङ. In some

cases, the orthogonality of the CholQR is inferior to that of the SVQB घsee table 4.2ङ,

and thus, it is p ossible that it takes more iterations to pro duce a fully orthonormal

set घsee table 4.1ङ. This is in spite of the fact that in our implementation, we rst

T

compute the lowest eigenvalue of W W and shift it so that the Choleski decomp osition

is applied on a numerically p ositive de nite matrix.

As discussed earlier, blo ck GS without re-orthogonalization for eachvector is not

e ective for large condition numb ers. In such cases, GS may o er no improvementbe-

tween successive iterations घsee rst GS iteration in table 4.2ङ, or it may require many

iterations to pro duce a set with relatively small conditioner number घsee tables 4.1

and 4.2ङ. Once this is achieved, however, one or two further iterations provide a fully

orthonormal set. An exception, is the example in table 4.3 for which GS requires

only one re-orthogonalization. The reason is that GS takes advantage of the sparse

structure of the matrix, p erforming computations only among very small elements,

thus achieving low relative error.

MGS is clearly more stable than the rest of the metho ds. However, b ecause the

departure from orthogonality for the MGS is b ounded by ऊघW ङ [1], even for relatively

small ऊघW ङ, a second MGS is often needed घsee tables 4.1 and 4.2ङ. Note that the

2

ऊघWङ b ound for SVQB is virtually identical to that of MGS, if ऊघW ङ is close to 1,

thus diminishing any advantages over SVQB.

5. Orthogonalizing against V . The ab ove theory and examples establish that

SVQB is a comp etitive technique for orthogonalizing a set of vectors W among them-

selves. In practice, however, full orthogonality among W vectors may not b e necessary

in all iterations of algorithm GS-SVQB. The reason is the interplaybetween the blo ck

GS and the SVQB at the rst and second step of the algorithm resp ectively. The GS 11

destroys the orthogonalityof W, and the SVQB pro cedure may destroy the orthogo-

nality against V .

0

Lemma 5.1. Let W = GS घV; W ङ be the set of vectors resulting after the rst

0 T

step of the GS-SVQBघV; W ङ algorithm, and assume that kV W k ऊ ऌ kW k. Let

Q = SVQB घW ङ be the result of the second step of the GS-SVQB algorithm. Then,

ई ई उउ

1

T

p

Qk = O घऌ + ङ min ऊघW ङ; kV :



1=2

ऌ ऌ

Proof. Following the notation of theorem 4.1, let Q = W uऌࣿ + Q. Using

the error b ound on Q in घ4.4ङ, and the b ounds for the two di erent cases in घ4.6ङ

T T 1=2 T 1=2

ऌ ऌ ऌ

and घ4.7ङ, wehave: kV Qk = kV W uऌ ࣿ + V Qkऊ ऌkWkkࣿ k + O घkQkङ=

p

Oघघऌ + ङ minघऊघW ङ; 1= ङङ :

The ab ove lemma states that even when GS pro duces W that is exactly orthogonal

to V , i.e., ऌ = 0, the SVQB pro cedure may destroy that orthogonality up to a

p

maximum of . Although lemma 5.1 is given only for the SVQB, similar results apply

if any other metho d for QR decomp osition is used. In our numerical exp eriments, we

have observed consistently the b ound घऌऊघWङङ for the loss of orthogonality against

V , regardless of the metho d used to orthogonalize the W vectors among themselves.

This is an additional reason diminishing the imp ortance of using the more accurate

MGS, instead of SVQB, in our pro cedure.

The GS pro cedure in the rst step of the GS-SVQB algorithm, has an even worse

e ect on the orthogonality of the W vectors.

T T

Lemma 5.2. Let V V = I; and W be a set of normal vectors with kW W I kऊ

T 1=2

ऍ. Let Q =घW VV WङD be the normalizedresult of the block GS, where D is

a diagonal matrix with the squares of the normalizing norms of the vectors. Assume

T

that thereisno oating point error in computing Q. If kV W k = <1, then:

ई उ

2

ऍ + 

T

kQ Q I k = O :

2

1 

T T T T T T

Proof. If we let S = V W , we have D = w w w vv w = 1 e S Se .

ii i i i

i i i

T T T T 2

Note that for all diagonal elements of S S , it holds e S Se ऊkS Sk=  < 1. As

i

i

2

a result, for all diagonal elements of D , wehave: D > 1  . This holds for the

ii

1 2

the minघD ङ to o, and therefore kD k < 1=घ1  ङ. In addition, we see that for all

ii

2 2

i: 1=D 1 < =घ1  ङ. From the ab ovewe can compute:

ii

T 1=2 T 1=2 1 1=2 T T 1=2

kQ Q I k = kD घW W I ङD + D I D घW V ङघV W ङD k

1 1 1 T

ऊ ऍ kD k + kD I k + kD kkS S k

2 2 2 2 2 2 2

ऊ ऍ=घ1  ङ+ =घ1  ङ+ =घ1  ङ=घऍ +2 ङ=घ1  ङ:

The lemma states that if the vectors W are not orthogonal enough to V घi.e., if

p

T

kV W k > ङ, they will also lose their orthogonality among each other after the GS

step. This lemma applies to any pro cedure that may b e used instead of GS.

6. The iterative GS-SVQB algorithm. Theorems 4.1, 4.4 and 4.6 and the

well studied b ehavior of the GS algorithm suggest that the GS-SVQB should be

applied iteratively. Figure 6.1 shows four p ossible implementations, based on which

of the two steps घGS/SVQBङ is carried out iteratively. 12

Algorithm 1: Algorithm 2:

rep eat rep eat

W =GSघV; W ङ rep eat W = GSघV; W ङ

W =SVQBघW ङ rep eat W =SVQBघW ङ

T T T T

until घkW W I k; kV W k = O घङङ until घkW W I k; kV W k = O घङङ

Algorithm 3: Algorithm 4:

rep eat rep eat

rep eat W = GSघV; W ङ W = GSघV; W ङ

W =SVQBघW ङ rep eat W =SVQBघW ङ

T T T T

until घkW W I k; kV W k = O घङङ until घkW W I k; kV W k = O घङङ

Fig. 6.1. Four possible iterative implementations of the GS-SVQB algorithm. The outer loop

is repeated until W becomes orthonormal and orthogonal to V . The inner loops could berepeated

until ful l orthogonalization is achievedorforaspeci ed number of steps. Our theory suggests that

Algorithm 4 is the most preferable.

Algorithm 1 is a straight forward iterative implementation of GS-SVQB, but it

do es not takeinto account the fact that SVQB and GS may require di erentnumber of

steps to orthogonalize W among themselves and against V resp ectively. Forcing b oth

metho ds to take the same numb er of steps may cause a lot of work to b e wasted. This

applies esp ecially to the GS, which p erforming twice is usually enough [18]. Moreover,

b ecause usually the size of V is much larger than that of W ,we try to minimize the

numb er of times that GS has to b e rep eated.

Algorithms 2 and 3 apply GS rep eatedly on W and thus they also are inappro-

priate for similar reasons. In addition, although the resulting W might b e orthogonal

to V after the GS, applying the SVQB on W destroys that orthogonalitybyasmuch

as ऊघW ङ घfrom lemma 5.1ङ. This completely obviates more than one GS application.

Algorithm 4 seems the most appropriate choice for an iterative implementation

of GS-SVQB. An additional argument in favor of this choice is that GS destroys

T 2

the orthogonality among the W vectors only by O घkV W k ङ घsee lemma 5.2ङ. The

presence of the square often alleviates this loss, esp ecially when the GS has already

b een applied once.

6.1. Tuning the algorithm. The next step is to identify ecient and practical

conditions for terminating the outer and inner rep eat lo ops of algorithm 4. We observe

that the iterative application of SVQB in the inner lo op should not necessarily pro duce

a fully orthonormal set W . Lemmas 5.1 and 5.2 suggest that the two steps, GS and

SVQB, must b e balanced so that work is not wasted in obtaining full orthogonality

T

when this is not needed. They also suggest that b oth ऊघW ङ and kV W k must be

kept comparably small.

We start by determining the numb er of times the inner rep eat lo op needs to b e

executed. First, we seek the conditions under which the nal outer iteration do es not

require anySVQB applications. In this case, one or more outer iterations have b een

p erformed already, b ecause W must b e orthonormal. In addition, orthonormalityof

W must not be destroyed by the GS at this last outer iteration. Thus, lemma 5.2

implies that b efore this GS it holds:

p

T

घ6.1ङ : kV W k <

Although, this condition can be checked inexp ensively as a by-pro duct of GS, the 13

ऊघW ङ cannot be known exactly without additional synchronization and computa-

tion. To decide whether to skip the nal SVQB, we can instead use theorem 4.1 and

the singular values of W obtained in the last SVQB of the previous outer iteration.

According to the theorem, the SVQB that pro duced the latest orthonormal set W

must have b een applied to a previous set with condition number Oघ1ङ. Therefore,

b esides घ6.1ङ, we should also test for:

घ6.2ङ ऊघW ङ=Oघ1ङ:

b efore last SVQB

Next, we examine the numb er of inner iterations required at any step of algorithm

4. Because applying GS twice is usually enough [18], we claim that the inner iteration

should pro duce a set of W with at least ऊघW ङ=Oघ1ङ. If the inner iteration pro duced

W with ऊघW ङ >Oघ1ङ, then at the next outer iteration, the orthogonality that the

next GS achieves against V will be on the order of  ऊघW ङ घlemma 5.1ङ. Thus, a

third outer iteration would b e necessary, which could have b een avoided. As a result,

we only need to distinguish b etween those cases for which the inner iteration should

pro duce a completely orthonormal set W or one with ऊघW ङ=Oघ1ङ. For this, we need

to takeinto account the reorthogonalization requirement of GS.

To test whether the GS pro cedure requires reorthogonalization, wehave used a

p opular test due to Daniel et al. [7]. Reorthogonalization is needed whenever the

norm of W after GS b ecomes less than .7 times its original norm b efore the GS. In

our algorithm, the SVQB also can cause loss of orthogonality against V . Let Q be

the set resulting from GS on W , and consider the following cases.

Assume that Daniel's test do es not require reorthogonalization. If ऊघQङ=Oघ1ङ,

then one application of the SVQB will pro duce a fully orthonormal Q, without de-

stroying its orthogonality against V . If ऊघQङ > O घ1ङ; then SVQB will destroy the

orthogonality against V and orthogonalization must b e rep eated. The SVQB should

b e carried out until the ऊघQङ b ecomes O घ1ङ, so that the SVQB at the following outer

T

iteration do es not a ect V Q.

Next, assume that Daniel's test requires reorthogonalization. If ऊघQङ = O घ1ङ,

still we need to apply SVQB once. This will probably orthonormalize Q and if or-

thogonality against V is relatively go o d, a nal SVQB may not b e needed at the next

outer step. If ऊघQङ >Oघ1ङ, as in the previous case, we iterate the SVQB until the

ऊघQङ b ecomes O घ1ङ.

At this p oint we can summarize our ab ove observations and analysis into the

following algorithm. The algorithm simpli es our previous discussion by noting that

rep eat-until lo ops execute always at least once. Thus, it is sucient that the `until'

condition for SVQB checks whether the condition numb er of the resulting Q is O घ1ङ.

Note, that the condition number of the set W is computed from the eigenvalues of

T

the matrix S = W W in the SVQB algorithm. Therefore, it corresp onds to the set

W b efore the application of the SVQB. Because a b ound on the resulting ऊघQङ can

b e inferred through theorems 4.4 and 4.6, the until condition checks if the condition

p

T

numb er b efore the last SVQB was less than 1= . Finally, note that kV W k can b e

computed during the GS, and it corresp onds to the overlap of the two matrices b efore

the GS.

Algorithm 6.1. Q = iGS-SVQBघV; W ङ घiterative GS-SVQB metho dङ

i =1

घ0ङ

W = W

ऊ = large number

last 14

rep eat

T घi1ङ

 = kV W k

घiङ घi1ङ

W = GSघV; W ङ

Reorthogonalization = Daniel's test

घ0ङ घiङ

Q = W

j =1

p

if घ< ङ and घऊ = O घ1ङङ

last

break घskip nal SVQBङ

घ0ङ

if घReorthogonalization = falseङ and घऊघQ ङ >Oघ1ङङ

Reorthogonalization = true

rep eat

घj 1ङ

ऊ = ऊघQ ङ

last

घj ङ घj 1ङ

Q =SVQBघQ ङ

j = j +1

p

ङङ until घऊ

last

घiङ घj 1ङ

W = Q

i = i +1

until घReorthogonalization = false ङ

7. Further optimizations. Because of its blo ckcharacter, the iGS-SVQB al-

gorithm iterates on the whole set of vectors W . Often, this maybewasteful b ecause

individual vectors in W may b e almost orthogonal to V and also to any other vector

in W . Clearly, further reorthogonalizations should exempt these vectors, p erforming

computations on a smaller blo ck.

Sp eci cally, after the GS phase, we can easily separate those vectors that do

not need reorthogonalization घgood vectorsङ and those that need it घbad vectorsङ,

W =[W W ]. Unfortunately, not all of the go o d vectors will turn out to have large

g b

singular values in the SVQB phase. Moreover, the SVQB phase will mix all these

go o d and bad vectors and the identity of the go o d ones will disapp ear.

To solve this problem, rst we p erform an eigenvalue decomp osition of S =

g

T

W W , for the W vectors alone. This, in turn, will sub divide the go o d group into

g g

g

two groups W =[W W ]. The rst group, whichwe call go o d-go o d group, consists

g gg gb

of all the singular vectors W corresp onding to large singular values of S . Therefore,

gg g

these W vectors are orthogonal b oth to V and to each other and can b e app ended to

gg

V in future iterations. Note also that W has at least one vector. The second group

gg

W , whichwe call go o d-bad, are all the vectors that did not need reorthogonalization

gb

originally, but b ecause they were to o close to other vectors in W they mayhave lost

g

their orthogonality against V as well. Therefore, the W vectors should b e combined

gb

with the W vectors, and an SVQB should b e applied on [W W ].

b gb b

Still, a few more subtleties need to b e resolved. First note that one synchroniza-

T

tion is still enough, since the computation of S = W W can b e computed at the

b b

b

same time as S . Second, neither of W and W vector sets are orthogonal to W ,

g gb b gg

and moreover the W set has not b een orthogonalized against W yet. If the angles

b gg

between W and W are very small, extra iterations of the exp ensive, outer lo op

b gg

of the iGS-SVQB algorithm may b e needed. Therefore, wehave to p erform at least

one orthogonalization of the [W W ] against W . Interestingly, these orthogonal-

gb b gg

izations can b e p erformed on m ࣾ m matrices, by orthogonalizing the short singular

vectors instead of the corresp onding W vectors. In this way, the computation of

T

0

S =[W W ] [W W ] can also b e p erformed inexp ensively using m ࣾ m matrices,

gb b gb b

b 15

and without additional synchronization p oints.

The theory supp orting the ab ove optimizations, as well as the details of the im-

plementation fall b eyond the scop e of this pap er, and will b e the fo cus of an up coming

technical rep ort. In summary,we can say that at each iteration, our iGS-SVQB im-

plementation identi es orthonormal vectors in W that have b een orthonormalized

e ectively, so that no further computations are wasted on them.

When the number of vectors in W is very large, applying the SVQB metho d

on the full blo ck may not always be ecient or numerically stable. For example,

the b etter cache p erformance of the SVQB may not be enough to counterbalance

the increase in the arithmetic over the GS, or even the solution of the m ࣾ m SVD

problem. Moreover, in most situations increasing the blo ckbeyond an optimal size

may even decrease cache p erformance. Finally, the conditioning of W is b ound to

deteriorate with blo ck size. For these reasons, wewantavariable blo ck size that can

b e tuned according to the machine and the problem.

Let b b e the desirable blo ck size. We partition W into p = m=b sets of vectors,

W =[W ;:::;W ], and apply the SVQB on each one individually. After a W subblo ck

1 p i

is made orthonormal and orthogonal to V and to previous W ; j

j

V and the next W is targeted. The algorithm follows:

i+1

Algorithm 7.1. Q = bGS-SVQBघV; W;bङ घvariable blo ck iGS-SVQB metho dङ

p = m=b

Partition W =[W ;:::;W ]

1 p

Q =[ ]; Z = V

for i =1;p

Q = iGS-SVQBघZ; W ङ

t i

Z =[Z; Q ]

t

Q =[Q; Q ]

t

The numb er of synchronization p oints in the bGS-SVQB algorithm increases lin-

early with p, and a larger p ercentage of the computation is sp ent on the GS pro cedure.

In the extreme case of b = 1, the algorithm reduces to the classical GS metho d, while

for b = m the algorithm is simply the iGS-SVQB metho d. With bGS-SVQB, we

exp ect to identify a range of blo ck sizes for which the cache p erformance is optimal,

while the synchronization requirements are not excessive. Finally note that since

bGS-SVQB is based on the iGS-SVQB, it will iterate to guarantee orthogonalityto

any user sp eci ed level.

8. Timing exp eriments. Wehave extensively tested our implementation of the

bGS-SVQB metho d against a variety of orthogonalization alternatives. To provide a

common comparison framework for all metho ds, we use a \bGS-Metho d" algorithm

that is identical to our bGS-SVQB, except that a di erent metho d is used to orthog-

onalize the blo ck.

The rst metho d is the classical GS algorithm with reorthogonalization. This is

the only metho d whose structure di ers slightly from the \bGS-Metho d". For GS, it

is more ecient to orthogonalize eachvector in the blo ck at once against all V vectors

and all previously orthogonalized vectors in W . Thus, blo ck size do es not a ect the

b ehavior of GS, and in the gures we simply refer to it as GS.

We also compare against the QR factorization with Householder re ections. The

metho d is denoted as bGS-QR, and uses the QR implementation from the ScaLA-

PACK library [4] for b oth single and multipro cessor platforms. 16 200 180

bGS−SVQB bGS−SVQB 180 160

160 bGS−CholQR 140

140 bGS−CholQR 120 120 GS 100 GS 100

bGS−QR 80 80 bGS−QR MFLOPs on one processor of the IBM SP2 MFLOPs on one processor of the Cray T3E

60 60 1 2 3 4 5 6 8 10 1 2 3 4 5 6 8 10

Block size Block size

Fig. 8.1. Single node MFLOPs as a function of block size for the four methods. The methods

orthonormalize thirty vectors of dimension 500000. Left graph depicts Cray T3E results. Right

graph depicts IBM SP2 results.

Finally, we compare against with the computationally similar metho d CholQR.

The metho d is denoted as bGS-CholQR, and uses the Choleski decomp osition pro-

vided in the ScaLAPACK library. Note that all \bGS-Metho d" variants reduce to GS

when the blo ck size is equal to one.

All algorithms have b een implemented in Fortran 90 using the MPI interface, and

run on the Cray T3E 900 and the IBM SP2 parallel computers at NERSC National

lab. 256 MB of memory are available on each no de of b oth machines, while on the

SP2 the no des are two-pro cessor SMP no des, but they are assigned individual MPI

pro cesses. The T3E network is considerably faster than the SP2 one, while the SP2

pro cessors are slightly faster than the SP2 ones. On b oth parallel platforms we link

with the MPICH libraries, and we make use of the machine optimized libraries for

ScaLAPACK and BLAS.

Our rst numerical example was chosen to b e easily repro ducible as a set of thirty

Krylovvectors of a diagonal matrix. We use the matrix A = diagघ[1:n]ङ घin Matlab

notationङ, and the initial vector x = [ 1 logघ[2:n]ङ ]', with n = 500000. We let

ࣾ ࣿ

2 29

V = ;, and build W as the set of normalized vectors: W = x; Ax; A x;:::;A x .

The condition number of the resulting set W , as computed by the Matlab cond

function, is 1.3892E+20. The goal is to orthonormalize the set W as accurately as

p ossible, by applying di erent orthogonalization variants and for a range of blo ck

sizes: bGS-Metho dघV; W;bङ. Note that, initially, most of the computation is sp enton

the \Metho d", while as more blo cks get orthonormalized GS takes over.

Figure 8.1 illustrates the single-no de oating p oint p erformance घMFLOP rateङ

achieved for each of the four algorithms as a function of blo ck size, on the T3E घleft

graphङ, and on the SP2 घright graphङ. As exp ected, the CG rate is constant regardless

of blo ck size. The single-no de p erformance of the bGS-QR do es not scale with blo ck

size on either machine, which p oints b oth to the ScaLAPACK implementation and to

the inherent blo ck limitations of the QR. On the other hand, the blo ck structure of the

SVQB and CholQR allows them to outp erform GS signi cantly,even for small blo cks

of 8-10 vectors. The Choleski back-solve implementation seems to exploit b etter the

architecture of the SP2 than of the T3E. However, on b oth platforms, bGS-SVQB

improves GS p erformance by at least 70-80क for these small blo ck sizes. We should

mention that the blo ck MGS metho d in [14] is exp ected to have worse single-no de

p erformance than the GS b ecause only one of the two phases involves BLAS 3 kernels. 17 −+− GS −x− bGS−CholQR −o− bGS−SVQB −+− GS −x− bGS−CholQR −o− bGS−SVQB

2 nodes 2 nodes 9 8

4 nodes 4 nodes 4.5 4 SP2

IBM e

h 8 nodes 2.25 8 nodes s on t

2 d n secon i 1.125 16 nodes me

16 nodes Ti Time in seconds on the Cray T3E 1

0.5625 32 nodes 32 nodes 0.5

1 2 3 4 5 6 8 10 1 2 3 4 5 6 8 10

Block size Block size

Fig. 8.2.

Good no de p erformance is imp ortant only if it leads to accurate and faster or-

thogonalization. All of the algorithms tested pro duced a nal orthonormal set Q,

T 13

with kQ Q I k =10 . Figure 8.2 shows that execution times of blo ck metho ds

are sup erior to the GS metho d. The graphs plot execution time as a function of blo ck

size, for three metho ds, and for various numb ers of pro cessors. The left graph corre-

sp onds to the Cray T3E and the right one to the IBM SP2. The bGS-CholQR and

bGS-SVQB are consistently faster than the GS, for any blo ck size on the SP2, and for

blo ck sizes of 4 or ab ove on the T3E. It is also clear, b ecause of the logarithmic time

scale, that the relative improvementover the GS timings p ersists on large number of

pro cessors, despite smaller lo cal problem sizes. On the T3E, bGS-SVQB is 20क faster

than the GS metho d, and on the SP2, it is more than 25क faster. Note that although

bGS-CholQR is 35क faster than GS on the SP2, this impressive p erformance do es not

carry over to the T3E.

Finally,we test a di erent scalability. We measure the e ects of synchronization

as the numb er of no des increases, by xing the problem size on each pro cessor and

using a constant blo ck size of 6. For this test, the set W has 30 Krylovvectors gen-

erated by the matrix of the discretized Laplacean on a 3 dimensional cub e. Every

pro cessor holds a 32 ࣾ 32 ࣾ 32 grid lo cally घso the matrix size is prop ortional to the

number of no desङ, and uses it to create the Krylov space. The W vectors are gen-

erated in chunks of 6 successive Krylovvectors, and eachchunk is orthonormalized

by a call to bGS-Metho dघV; W; 6ङ. Figure 8.3 plots the execution times of the four

metho ds over a wide range of pro cessor numb ers. The T3E has b een used for this

exp eriment b ecause of the large numberofavailable no des. In the absence of com-

munication/synchronization costs, the times should b e equal for all pro cessors. The

time increase observed in the gure is relatively small for all metho ds b ecause of the 18 −+− GS −*− bGS−Scalapack QR −x− bGS−CholQR −o− bGS−SVQB 4.5

4

3.5

3 Time in seconds on the Cray T3E

2.5 1 2 4 8 16 27 64 125 216 343 512

Number of processors

Fig. 8.3.

extremely fast T3E network. However, the e ects are more apparent on the GS and

the bGS-QR, as their curves increase faster than the resp ective ones for bGS-CholQR

and bGS-SVQB. These di erences are exp ected to b e signi cant if the exp erimentis

run on a high-latency network cluster. Finally,we observe that on this problem the

bGS-SVQB is more than 30क faster than the GS घan improvementover the previous

numerical problemङ. This is esp ecially imp ortant b ecause the GS is one of the most

computationally ecient metho ds.

9. Conclusions. We have intro duced and analyzed a new orthonormalization

metho d, called SVQB, that computes the orthonormal basis from the right singular

vectors of a matrix. The metho d is attractive computationally b ecause it involves

only BLAS 3 kernels and it requires only one synchronization p oint in parallel im-

plementations. We have proved that if the condition number of the vector set is

p

originally ऊघW ङ

p

T

vector set is kI QQ k Oघ1= ङ, and under

some weak assumptions, the algorithm guarantees that each iteration reduces the con-

p

dition numb er of the set by . Wehave also considered the problem of incremental

orthonormalization where a blo ckofvectors is orthonormalized against a previously

orthonormal set of vectors with GS, and among itself with SVQB. The indep endent

application of each of the metho ds destroys the orthogonality pro duced by the other

metho d. Wehave provided b ounds that describ e this numerical interdep endence and

have used them to balance the work p erformed by each of the two orthogonaliza-

tion phases of an iterative scheme. Our Matlab examples have demonstrated that

our theoretical b ounds are in accordance with practice, and our parallel implementa-

tions on the Cray T3E and the IBM SP2 have shown that our metho d improves the

p erformance of other orthonormalization alternatives.

10. Acknowledgements. This work was supp orted by the College of William

and Mary and the Director, Oce of Science, Division of Mathematical, Information,

and Computational Sciences of the U.S. Department of Energy under contract number

DE-AC03-76SF00098.

This research used resources of the National Energy Research Scienti c Comput-

ing Center, which is supp orted by the Oce of Science of the U.S. Department of

Energy under Contract No. DE-AC03-76SF00098. 19

REFERENCES

[1] A. Bjorck. Solving linear least squares problems by gram-schmidt orthogonalization. BIT,

7:1{21, 1967.

[2] A. Bjorck. Numerics of gram-schmidt orthogonalization. and its Applications,

198:297{316, 1994.

[3] A. Bjorck and C. Paige. Loss and recapture of orthogonality in the mo di ed gram-schmidt

algorithm. SIAM J. Matrix Anal. Appl., 13घ1ङ:176{190, 1992.

[4] L. S. Blackford, J. Choi, A. Cleary, E. D'Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Ham-

marling, G. Henry,A.Petitet, K. Stanley,D.Walker, and R. C. Whaley. ScaLAPACK

User's Guide. SIAM, 1997.

[5] S. Chaturvedi, A. K. Kap o or, and V. Srinivasan. A new orthogonalization pro cedure with an

extremal prop erty.Technical Rep ort quant-ph/9803073, LACS Stanford, Palo Alto, 1998.

[6] C. K. Chui. Wavelet Analysis and its Applications. Academic Press, San Diego, 1992.

[7] J. W. Daniel, W. B. Gragg, L. Kaufman, and G. W. Stewart. Reorthogonalization and stable

algorithms for up dating the Gram-Schmidt QR factorization. Math. Comp., 30घ136ङ:772{

795, Octob er 1976.

[8] J. Dongarra, J. DuCrouz, I. Du , and S. Hammarling. A set of level 3 basic linear algebra

subprograms. ACM Trans. Math. Soft., 16:1{17, 1990.

[9] J. Dongarra, J. DuCrouz, S. Hammarling, and R. Hanson. An extended set of FORTRAN basic

linear algebra subprograms. ACM Trans. Math. Soft., 14:1{32, 1988.

[10] W. Gander. Algorithms for the QR decomp osition. Technical Rep ort TR 80-02, Angewandte

Mathematik, ETH, 1980.

[11] G. H. Golub and C. F. Van Loan. Matrix Computations. The John Hopkins University Press,

Baltimore, MD 21211, 1989.

[12] G. H. Golub and R. Underwood. The blo ck Lanczos metho d for computing eigenvalues. In

J. R. Rice, editor, Mathematical Software III, pages 361{377, New York, 1977. Academic

Press.

[13] W. Ho man. Iterative algorithms for gram-schmidt orthogonalization. Computing, 41:335{348,

1989.

[14] W. Jalby and B. Philipp e. Stability analysis and improvementof the blo ck gram-schmidt

algorithm. SIAM J. Sci. Statist. Comput., 12घ5ङ:1058{1073, 1991.

[15] C. Lawson, R. Hanson, D. Kincaid, and F. Krogh. Basic linear algebra subprograms for FOR-

TRAN usage. ACM Trans. Math. Soft., 5:308{325, 1979.

[16] P.O.Lodwin. J. Chem. Phys., 18:365, 1950.

[17] P.O.Lodwin. Adv. Quant. Chem., 23:84, 1992.

[18] Beresford N. Parlett. The Symmetric Eigenvalue Problem. SIAM, Philadelphia, PA, 1998.

[19] Yousef Saad. Iterative methods for sparse linear systems. PWS Publishing Company, 1996.

[20] H. C. Schweinler and E. P. Wigner. J. Math. Phys., 11:1693, 1970.

[21] J. P. Singh, D. E. Culler, and A. Gupta. Paral lel Computer Architecture. A Hardware/Software

Aproach. Morgan Kaufmann Publishers, Inc, San Francisco, 1999.

[22] A. Stathop oulos and J. R. McCombs. A parallel, blo ck, Jacobi-Davidson implementation for

solving large eigenproblems on coarse grain environments. In 1999 International Confer-

ence on Paral lel and DistributedProcessing Techniques and Applications, pages 2920{2926.

CSREA Press, 1999.

[23] J. H. Wilkinson. The Algebraic Eigenvalue Problem. Claredon Press, Oxford, England, 1965. 20