A BLOCK ORTHOGONALIZATION PROCEDURE WITH
CONSTANT SYNCHRONIZATION REQUIREMENTS
ࣿ y
ANDREAS STATHOPOULOS AND KESHENG WU
Abstract. We prop ose an alternative orthonormalization metho d that computes the orthonor-
mal basis from the right singular vectors of a matrix. Its advantage are: aङ all op erations are
matrix-matrix multiplications and thus cache-ecient, bङ only one synchronization p oint is required
in parallel implementations, cङ could b e more stable than Gram-Schmidt. In addition, we consider
the problem of incremental orthonormalization where a blo ckofvectors is orthonormalized against a
previously orthonormal set of vectors and among itself. We solve this problem by alternating itera-
tively b etween a phase of Gram-Schmidt and a phase of the new metho d. We provide error analysis
and use it to derive b ounds on how accurately the two successive orthonormalization phases should
b e p erformed to minimize total work p erformed. Our exp eriments con rm the favorable numerical
b ehavior of the new metho d and its e ectiveness on mo dern parallel computers.
Key words. Gram-Schmidt, orthogonalization, Householder, QR factorization,
singular value decomp osition, Poincare
AMS Sub ject Classi cation. 65F15
1. Intro duction. Computing an orthonormal basis from a given set of vectors
is a basic computation, common in most scienti c applications. Often, it is also one
of the most computationally demanding pro cedures b ecause the vectors are of large
dimension, and b ecause the computation scales as the square of the number of vectors
involved. Further, among several orthonormalization techniques the ones that ensure
high accuracy are the more exp ensive ones.
In many applications, orthonormalization o ccurs in an incremental fashion, where
a new set of vectors घwe call this internal setङ is orthogonalized against a previously
orthonormal set of vectors घwe call this externalङ, and then among themselves. This
computation is typical in blo ck Krylov metho ds, where the Krylov basis is expanded by
a blo ckofvectors [12, 11]. It is also typical when certain external orthogonalization
constraints have to be applied to the vectors of an iterative metho d. Lo cking of
converged eigenvectors in eigenvalue iterative metho ds is such an example [19,22].
This problem di ers from the classical QR factorization in that the external set
of vectors should not be mo di ed. Therefore, atwo phase pro cess is required; rst
orthogonalizing the internal vectors against the external, and second the internal
among themselves. Usually, the numb er of the internal vectors is much smaller than
the external ones, and signi cantly smaller than their dimension. Another imp ortant
di erence is that the accuracy of the R matrix of the QR factorization is not of pri-
mary interest, but rather the orthonormalityof the pro duced vectors Q. A variety
of orthogonalization techniques exist for b oth phases. For the external phase, Gram-
Schmidt घGSङ and its mo di ed version घMGSङ are the most comp etitivechoices. For
the internal phase, QR factorization using Householder transformations is the most
stable, alb eit more exp ensive metho d [11]. When the number of vectors is signi -
cantly smaller than their dimension, MGS or GS with reorthogonalization are usually
preferred.
Computationally, MGS, GS and Housholder transformations are based on level
1 or level 2 BLAS op erations [15,9,8]. These basic kernel computations, dot pro d-
ucts, vector up dates and sometimes matrix-vector op erations, cannot fully utilize the
ࣿ
Department of Computer Science, College of William and Mary, Williamsburg, Virginia 23187-
8795, घ[email protected]ङ.
y
NERSC, Lawrence Berkeley National Lab oratory, Berkeley, California 94720, घ[email protected]ङ. 1
pro cessor cache, with signi cant p erformance shortcomings on most of to days com-
puters. In addition, these metho ds have high synchronization requirements when
implemented on parallel computers. The numb er of synchronization p oints of the GS
is linear to the number of vectors, while for the rest metho ds it is quadratic [14].
When the parallel platform involves a high latency network but fast pro cessors, as
in the case of the increasingly p opular clusters of workstations [21], the number of
synchronization p oints can haveanadverse e ect in the scalability of the metho d.
Most of the previous e orts to address these problems considered blo cks of vectors,
and used hybrids of the more scalable GS for orthogonalization b etween blo cks, and
the more accurate MGS for orthogonalization within blo cks [14, 2]. By adjusting
the blo ck size appropriately, these techniques can pro duce accurate orthonormal sets
while reducing the synchronization requirements of MGS. However, the number of
synchronization p oints is still linear to the number of vectors, and BLAS level 2
kernels are still dominant despite blo cking. Finally, most of such e orts have fo cused
on a full QR factorization of a set of vectors, rather than the two phase problem.
In this pap er, weintro duce a metho d based on the singular value decomp osition
घSVDङ, that uses the right singular vectors to pro duce an orthonormal basis for a
given subspace. The idea itself is not new, dating back at least to Poincare, and it
is sometimes encountered in chemistry and wavelet literature [5, 20,16,17, 6]. How-
ever, it has not received any attention as a computationally viable orthogonalization
metho d, and to our knowledge there is no analysis of its numerical prop erties. Beyond
theoretical considerations, the metho d, whichwe call SVQB, has some very attractive
characteristics. It uses primarily level 3 BLAS kernels, and it has a constantnumber
of synchronization p oints. As we show in this pap er, it is not as numerically accurate
as MGS, but it is often b etter than GS. More interestingly,we show that more stable
alternatives, such as MGS or Householder, are an overkill for our two phase problem.
Coupling the SVQB metho d for the internal phase with a blo ck GS with reorthogonal-
ization for the external phase results in a metho d also with constant synchronization
requirements.
The pap er is organized as follows. We rst describ e the basic idea of the SVQB
metho d, and we couple it with a blo ck GS for the two phase problem. The numerical
stability of the metho d is analyzed in the following section, and some numerical exp er-
iments and comparisons with other metho ds verify its e ectiveness. We then analyze
the numerical interaction b etween the metho ds used in each of the two phases, and
based on this theory we tune the two phases so that they do not p erform unneces-
sary reorthogonalizations. Following, we present timings from a series of exp eriments
on the Cray T3E, and on the IBM SP-2. These verify that b oth the blo ck computa-
tions and the small numb er of synchronizations help the new metho d achieve accurate
orthogonality, faster than other comp etitive metho ds.
nࣾk
2. Orthogonalization metho ds. Let V 2< b e a set of orthonormal vec-
nࣾm
tors, and W 2< b e a set of vectors, where k + m ऊ n. In practice, we usually
exp ect m < k and m n. The problem is to obtain an orthonormal set of vec-
tors Q, such that spanघQङ=spanघW ङ and Q ? V . This problem can be viewed as
consisting of two phases: obtaining a Q which is orthogonal to V घexternal phaseङ,
and orthonormalizing the Q vectors among themselves घinternal phaseङ. The internal
phase problem is most commonly solved through various metho ds for QR factoriza-
tion [11,23].
Gram-Schmidt घGSङ is the most well known and also the least computationally
exp ensive metho d for QR factorization. In addition, GS is based primarily on level 2
2 BLAS kernels, and it is p ossible to implement it in parallel with a mo dest num-
ber of m + 1 synchronization p oints. Although computationally attractive, GS do es
not always pro duce numerically orthogonal vectors [2]. However, when used with
reorthogonalization, a second GS iteration is typically enough for pro ducing orthog-
onality to machine precision [7]. Since a second iteration is not always necessary,GS
with reorthogonalization is often preferred over other metho ds.
Householder re ections, on the other hand, yield a QR factorization that is nu-
merically accurate. By construction, the resulting matrix Q is orthogonal to machine
precision घङ, and the computed upp er triangular matrix R di ers from the exact one
by less than kAk [11]. Despite its excellentnumerical characteristics, this technique
involves twice the arithmetic of the GS. For a slight increase in arithmetic, a blo ck
version is also p ossible that facilitates the use of level 3 BLAS kernels. However, when
m n, this blo cking is harder to obtain and it is less e ective. Moreover, the number
of synchronization p oints it requires grows quadratically with m.
Mo di ed Gram-Schmidt घMGSङ improves the numerical stability of the GS by
orthogonalizing individual pairs of vectors rather than a vector against a blo ck. For
MGS, the error in the orthogonalityof Q can b e b ounded by ऊघW ङ, where ऊघW ङis
the condition number of W [1, 3]. The op eration count for MGS is the same with GS,
but working on individual vectors allows only for level 1 BLAS kernels, which impairs
signi cantly the actual p erformance on cache based computers. Finally, the number
of synchronization p oints grows quadratically with m.
In the context of the two phase problem, pro ducing an internal set of vectors Q
with orthogonality close to machine precision is unnecessary, b ecause of the interde-
p endence of the phases. For example, if the internal vectors W are orthogonal to
machine precision to all external vectors V , the internal orthogonalization could sp oil
signi cantly this external orthogonality. Therefore, a second iteration through the
algorithm would still b e necessary. If W and V are far from orthogonal, obtaining a
fully orthonormal Q is also unnecessary b ecause the external orthogonalization will
sp oil the orthogonalityof Q. Therefore, this two phase problem obviates the use of
exp ensive but stable metho ds such as Householder.
3. The SVQB metho d. As with the two phase problem, it is often the case
that an orthonormal basis of the spanघW ङ is needed, rather than a QR factorization
of W . There are various ways of obtaining such a basis, but an esp ecially interesting
one is the one derived from the right singular vectors of W . Assume that the vectors
in W are normalized. The singular values of W are the square ro ots of the eigenvalues
T
of S = W W , and the right singular vectors are the corresp onding eigenvectors of S .