Givens and Householder Reductions for Linear Least Squares on a

Cluster of Workstations

Omer Egeciog lu and Ashok Srinivasan

Department of Computer Science

University of California at Santa Barbara

Santa Barbara CA

omercsucsbedu ashokcsucsbedu

There are two main metho ds to solve that Abstract

use orthogonal up dates Givens or fastGivens ro

tations and Householder reectors Both use the fact

We rep ort on the prop erties of implementations

that kAx bk kQA Qbk when Q is an orthogo

of fastGivens and Householder reector

nal and through a series of premultiplications

based parallel algorithms for the solution of linear

T

whose pro duct we denote by Q reduce A to

least squares problems on a cluster of workstations

Givens rotations enable communication hiding and

R n

T

take greater advantage of parallelism than House

Q A

m n

holder reectors provided the matrices are suciently

large

where R is n n upp er triangular This is the QR

factorization of A Once the reduction to upp er tri

angular form is accomplished x can b e obtained

Intro duction

by back substitution by solving R x c where

c n

The linear least squares LLS problem is the min

T

Q b

d m n

imization of the residual

It is well known that b oth metho ds are numeri

min kAx bk

n

x2IR

cally stable and are preferred over the solution of

the normal equations for LLS problems

mn m

where A IR and b IR are given and the sub

While the numb er of additions for Givens and House

script denotes the ordinary Euclidean norm Such

holder metho ds are the same for the QR factoriza

problems typically arise in evaluating unknown pa

tion the numb er of multiplications and square ro ots

rameters x x x of a linear mo del to t mea

n

in the Householder metho d is fewer than the Givens

sured data Each row of A represents the values of

metho d Furthermore if fastGivens transforma

the indep endent variables in an exp eriment and the

tions are used to reduce the numb er of multiplications

corresp onding row of b represents the value of the de

of Givens transformations then p erio dic monitoring

p endent variable for that particular exp eriment The

and scaling of the row multipliers are necessary This

measured values may b e imprecise b ecause of various

makes the Householder the sequential algorithm of

sources of error and consequently the numb er of ex

choice for dense QR However in a parallel setting

p eriments conducted exceeds the numb er of unknown

this advantage is oset by the higher communication

parameters ie m n The unknown x is required

requirements of the Householder reduction algorithm

to minimize the residual over a suitable norm such

In this pap er we consider a parallel implementa

as A further simplifying assumption is that the

tion of the fastGivens reduction for LLS formulated

columns of A are linearly indep endent ie it has full

as in with m n on a cluster of workstations

column rank

and compare our results with the Householder based



ScaLAPACK QR factorization run in the same envi

Supp orted in part by the NASA grant NAGW and

T

UCSB Committee on Research grant ronment In our tests the output is Q A R and

0 0 T

two rows that are altered Solving for and in Q b and the back substitution step is not p erformed

as it is identical for b oth algorithms

0 0

a c s a

First we present an overview of Givens and fast

0 0

b s c b

Givens transformations in Section and Householder

transformations in Section Detailed analyses of

0 0

where a a b and b are row vectors results in the

these metho ds can b e found in

equations

In Section we present our distributed memory im

0 0

a ca s b

plementation of the fastGivens reduction algorithm

0 0

b sa c b

for LLS In Section we present our test results com

pare the p erformance of the two metho ds and present

For new values it is p ossible to cho ose either

our conclusions

0 0 0 0

s s or c c

We cho ose the former if jsj jcj and the latter other

Givens transformations

wise These two p ossibilities result in typ e and typ e

fastGivens rotations and the appropriate choice

Consider the matrices

serves to keep the diagonal elements of D as far from

c s a

as p ossible

E A

a s c

where a and a are two row vectors Then

Householder reectors

ca sa

EA

Householder reectors are orthogonal matrices of

sa ca

the form

T

If the rst k comp onents of a and a are zero

E I w w

k then an additional zero can b e intro duced in

m

where w IR is a unit vector They are used to

the k th comp onent of a by cho osing c and s as

intro duce zeros to more than one entry of a column

a a

k k

vector of A simultaneously by a single premultiplica

q q

s c

tion by a suitable choice of w To make the entries of

a a a a

k k k k

m

a vector x IR b elow its k th comp onent while

leaving the ones ab ove it unchanged form the vector

With this choice c s and c cos and

s sin for some angle The matrix E is a plane

T

u x sig nx x x

k k k m

through angle and consequently or

1

thogonal We can dene a rotation in the i j plane

2

where x x The Householder trans

m

k

n

in IR similarly Let r otatei j denote the function

formation matrix corresp onding to u is formed by set

that makes the i j th element of a matrix A zero

ting

using rows a and a Each rotation uses n j

i i

T

uu E I

T

multiplications and a sequence of these can b e ap

u u

mn

plied to reduce A IR to upp er triangular form

E is an m m which is called an el

using a total of n m n multiplications This

ementary reection It can b e veried that the vector

can b e halved by using fastGivens or fastscaled

E x has the desired prop erties In addition computing

transformations the current matrix is kept in a fac

E y for an arbitrary vector y is simple for

tored form as DA where D is an m m diagonal

T

matrix In the actual implementation D is stored as

E y I y uu y

T

u u

a vector Consider rows a and a of A as in

Instead of computing the entries of the new rows in

T

and u y is a scalar If E denotes the Householder

j

we rst compute sc Then

reector that lls the sub diagonal entries in column j

of the current matrix with zeros then the Householder

ca sa c a a

reduction algorithm successively intro duces zeros in

c sa ca a a

sub diagonal entries of columns of A and reduces A to

To see how D should b e up dated when a new rotation upp er triangular form using a total of ab out mn

on a row is p erformed it suces to consider only the n multiplications

A distributed memory fastGivens

implementation

p

A Givens rotation aects only two rows of A at a

time If successive rotations were to aect only dis

p

joint pairs of rows then they could b e done simulta

neously without aecting the nal result Otherwise

a suitable ordering schedule needs to b e sp ecied

p

on the execution of individual Givens rotations that

avoids common rows More formally supp ose

T j k is the time step at which the j k th entry

of the matrix is annihilated b ecomes and S j k

Figure Blo ck annihilation pattern for P pro

is the index of the row that is used along with row

cessors matrix

j to annihilate the j k th entry Any such pair of

functions dened for k j satisfying the following re

quirements is an acceptable schedule for the parallel

In each column annihilation starts with the last row

execution of Givens rotations

in the blo ck of rows assigned to the pro cess Annihi

lation of the entries in row a is p erformed using row

i

Concurrent op erations act on dierent rows In

0 0

a Note that in order to annihilate an entry in the

i

other words if T j k T j k and j k

0 0 0 0 0

rst row of its blo ck pro cess p has to access the last

j k then fj S j k g fj S j k g

row of pro cess p In addition pro cess p cannot

If T j k t and S j k i then T j l t

b egin to annihilate an entry in its last row until it has

and T i l t for all l k This ensures that all

received the mo died version of it from pro cess p

the previous columns of the two rows involved at

High level steps of the algorithm executed by each

time step t have already b een annihilated

pro cess p f P g are given in Figure The

function r otate op erates in the same manner as

A particular schedule satisfying these two prop er

the one mentioned earlier except that it also mo d

ties was given by Sameh and Kuck in which

ies the corresp onding elements of the vector b In

order not to clutter up the algorithm description we

T j k m j k

assume that whenever a message is to b e sent to or

S j k j

received from a pro cess whose numb er is not in the

range f P g no action is taken

A single Givens rotation can b e p erformed in small

The main calculation o ccurs in Step For each

constant time with n pro cessors Therefore the com

value of j the rst row interior rows whose indices

putation time of the overall Givens reduction with n

are b etween rst and last and last row are annihi

pro cessors is ideally of order of m n If nm pro

lated in column j When b is executed the pro cess

cessors are available then up to m rotations can b e

already has its last row from the next pro cess re

p erformed simultaneously For the limited pro cessor

ceived in a When p mo dies its rst row in dii

case ecient scheduling for various architectures are

it has already received the last row of pro cess p in

discussed in Figure gives the annihilation pat

step di Step c ensures that only and all sub

tern T j k used in our implementation of the Givens

diagonal elements are annihilated Since j varies over

reduction algorithm when P pro cessors are avail

all the required columns the reduced matrix is upp er

able for a matrix

triangular In Step e the diagonal elements of D

corresp onding to the rows assigned to p are scaled

The algorithm

In Step each pro cess sends its last row to the

next pro cess In particular pro cess P will have the For the implementation of the algorithm we cre

last row of pro cess P and will b e able to execute ate P pro cesses each represented by a distinct iden

Step Note that it do es not need to execute Step tier in f P g A blo ck of adjacent rows are

a P returns P s last row in diii Thus P assigned to a single pro cess with each pro cess p keep

in turn can complete its execution for this value of j ing approximately the same numb er of rows of A as

In general a pro cess needs the last row of p after indicated in for P Each pro cess annihilates

annihilating column j and its last row after pro cess entries b eginning with the rst column of the matrix

tions on data that reside in higher levels in hierarchi

cal memories An extension of the LAPACK to dis

mn m

Input A IR and b IR

tributed memory concurrent computers is the ScaLA

T T

Output Q A R and Q b

PACK library ScaLAPACK uses blo ckcyclic

Step Pro cess p is assigned the blo ck of rows

data distribution as its primary data decomp osition

mp P rst through mpP last

metho d and uses Householder transformations for

QR factorization The characteristics and the p erfor

Step Send last row to pro cess p

mance of various ScaLAPACK factorization routines

Step for j to minlast n do

including QR factorization on the Intel family of par

allel computers is rep orted in

Receive mo died last row from p

if p has more than a single row then

i r otatel ast j

Test platform

ii Send last row to pro cess p

r otatei j for all ps interior rows

The parallel fastGivens algorithm was compared

with index i j

with the QR factorization routine of ScaLAPACK

if index of rst row j then

PVM version was used for message passing

i Receive last row of pro cess p

in the fastGivens algorithm ScaLAPACK version

ii r otatef ir st j

was used to test the ScaLAPACK algorithm us

iii Send p s mo died last row back

to it

ing the PVM version of BLACS for message passing

iv if p has only one row then send it

The tests were conducted on a cluster of Sun SPARC

to pro cess p

Station LX workstations connected over an Ethernet

LAN for matrices of three dierent sizes using matri

Scale entries of D corresp onding to

the blo ck

ces from and the parallel double precision matrix

generator of ScaLAPACK

Step Multiply partial matrix by diagonal ele

The sizes of the matrices used for the test were

ments and output result

and The tests were

p erformed on up to pro cessors ScaLAPACK al

lows the user to cho ose the pro cessor grid structure

Figure Distributed memory fastGivens

and the blo ck size Due to the high communication

cost involved in the Ethernet it was found that hav

p has used it in order to annihilate column j Step

ing one column of pro cessors gave the b est results for

bii ensures the former and step diii the latter

a xed numb er of pro cessors Therefore results are

Finally Step is evident

rep orted only for this situation It was found that big

If T is the xed startup cost of sending and re

ger blo ck sizes are b etter for a PVM implementation

ceiving a message relative to a multiplication and T

In accordance with this the blo ck size was taken to

is the relative cost of sending a oating p oint numb er

b e as high as the numb er of columns

across the network then the time complexity of the

The measure of time was taken to b e the wall

algorithm is no worse than mn P T P n T P n

clo ck time for the actual calculations ignoring the

for mP n

time taken for generating the test matrix and for the

output of the factors At least four exp eriments were

p erformed for each value of the parameter set and

Comparison with Householder Re

the average of these is rep orted in Figures and

ductions

In our initial implementation of the distributed

Blo ckpartitioned QR factorization algorithm us fastGivens transformations we p erformed scaling of

ing Householder reectors is implemented in the LA D using oating p oint arithmetic The results are re

PACK which is a software library for p erform p orted as Givens in Figures and Since divi

ing dense and banded linear algebra computations on sion and multiplication by are integer op erations on

vector machines and shared memory computers LA the exp onent in another version of the implementa

PACK makes use of blo ckpartitioned algorithms to tion we used the C library function ldexp for scaling

utilize fast matrixvector and matrixmatrix op era The resulting faster algorithm is rep orted as Givens

Test results We intend to generalize the implementation of the

parallel fastGivens algorithm so that a row of pro ces

sors is used to p erform up dates of the rows in parallel

The sequential version of the QR factorization us

by using a rectangular array of pro cessors We can

ing Householder reections of ScaLAPACK was much

also adapt the blo ckcyclic data distribution scheme

faster than the fastGivens algorithm on a single pro

of ScaLAPACK to the Givens case In place of a clus

cessor This could b e partly due to the innate na

ter of workstations we plan to conduct our next tests

ture of the algorithm and also p ossibly to a dierence

on a parallel computer Meiko CS which has a fast

in the quality of the compilers in optimization since

interconnection network In this setup the communi

the fastGivens algorithm was written in C whereas

cation costs incurred should b e much lower than for

the Householder algorithm was in FORTRAN It was

the Ethernet delays that we have observed

found that for the smaller matrix the se

quential Householder algorithm was much faster than

its parallel version for P probably due to the

References

high communication overhead Similarly for the par

allel fastGivens algorithm while there was a slight

sp eedup with increasing P it was not signicant for

Anderson E Z Bai C Bischof J Demmel JJ

the smaller matrix Figure For the medium size

Dongarra J DuCroz A Greenbaum S Ham

matrix mo derate sp eedup was obtained us

marling A McKenney and D Sorensen LA

ing the Householder algorithm for a small numb er of

PACK A Portable Linear Algebra Library for

pro cessors substantial sp eedup was obtained using

HighPerformance Computers in Proc Super

the Givens algorithm across the whole range of pro

computing pp IEEE Press

cessors tested These results are shown in Figure

Anderson E Z Bai J Demmel JJ Dongarra

For the case shown in Figure the House

J DuCroz A Greenbaum S Hammarling A

holder algorithm outp erformed the fastGivens imple

McKenney S Ostrouchov and D Sorensen LA

mentation for small numb er of pro cessors only After

PACK Users Guide SIAM Press Philadelphia

ab out P fastGivens p erformed b etter

PA

The p erformance of the algorithms can b e ex

plained if we observe that b oth algorithms have com

Beguelin A JJ Dongarra GA Geist R

parable numb er of oating p oint op erations and the

Manchek and VS Sunderam A users guide to

numb er of messages sent is also ab out the same How

PVM parallel virtual machine ORNLTM

ever much of the communication cost can b e hidden

in the fastGivens algorithm since computations can

b e p erformed while communication is taking place

Choi J JJ Dongarra R Pozo and DW

In fact the observed communication cost is approxi

Walker A Scalable Linear Algebra Library for

mately n P messages as opp osed to P n It do es

Distributed Memory Concurrent Computers In

not seem p ossible to hide the communication cost in

Proceedings of Fourth Symposium on the Frontiers

the Householder algorithm This may account for its

of Massively Paral lel Computation McLean Vir

p o or sp eedup when the communication network used

ginia IEEE Computer So ciety Press Los Alami

is slow

tos CA

Choi J JJ Dongarra S Ostrouchov AP Pe

Conclusions and future work

titet DW Walker RC Whaley The Design

and Implementation of the ScaLAPACK LU QR

Givens and Householder transformations were im

and Cholesky Factorization Routines Oak Ridge

plemented on a cluster of workstations using PVM

National Lab oratory TM Septemb er

for the solution of LLS problems Some of the exp er

imental results are conclusive the parallel algorithm

Golub H G and JM Ortega Scientic Com

for fastGivens transformations can signicantly re

puting an Introduction with Paral lel Computing

duce parallel time It app ears that the Givens trans

Academic Press Inc San Diego

formation is b etter suited to distributed memory ma

chines with signicant communication costs than the Golub H G and C F Van Loan Matrix Com

Householder transformation at least when dealing putations nd Edition The Johns Hopkins Uni

with large matrices versity Press Baltimore

Lord R J Kowalik and S Kumar Solving Lin

ear Algebraic Equations on a MIMD Computer

Proc Int Conf Par Proc pp

Householder

Parlett B N The symmetric eigenvalue problem

PrenticeHall Englewo o d Clis NJ

Sameh A and D Kuck On Stable Parallel

Linear System Solvers J ACM pp

Wilkinson J H The Algebraic Eigenvalue Prob

Givens-1

lem Oxford University Press Oxford

Ortega JM and RG Voigt Solution of Par

Givens-2

tial Dierential Equations on Vector and Paral lel

Computers SIAM Philadelphia

Gregory T and DL Karney A col lection

of matrices for testing computational algorithms

WileyInterscience New York

Figure Comparison of Givens and House

holder reductions for LLS m n

Householder

Householder Givens-1

Givens-2 Givens-1

Givens-2

Figure Comparison of Givens and House

Figure Comparison of Givens and House

holder reductions for LLS m n

holder reductions for LLS m n