Sparse Givens QR Factorization on a Multiprocessor J. Tourino R. Doallo E.L. Zapata

May 1996 Technical Report No: UMA-DAC-96/08

Published in: 2nd Int’l. Conf. on Massively Parallel Computing Systems Ischia, Italy, May 1996, pp. 551-556

University of Malaga Department of Computer Architecture C. Tecnologico • PO Box 4114 • E-29080 Malaga • Spain

Sparse Givens QR Factorization on a Multipro cessor

Juan Tourino Ramon Doallo Emilio L Zapata

Dept Electronica y Sistemas Dept Arquitectura de Computadores

University of La Coruna University of Malaga

Campus de Elvina sn Campus de Teatinos

La Coruna Spain Malaga Spain

fjuandoallogudces ezapataatcctimaumaes

Householder reections Since these sequential algo Abstract

rithms have a high arithmetic complexity the deve

lopment of parallel is of considerable in

We present a paral lel for the QR fac

terest Several parallel orthogonal factorization algo

torization with column pivoting of a sparse

rithms have b een designed for various machines We

by means of Givens rotations Nonzero elements of

cite just a few for the Intel iPSC for the

the matrix M to be decomposed are stored in a one

nCUBE for a network of transputers for the

dimensional doubly linked list data structure We wil l

nCUBE for the CM all of them for dense

discuss a strategy to reduce l lin in order to gain me

matrices and CM Fujitsu AP

mory savings and decrease the computation times As

Cray TD for sparse matrices

an application of QR factorization we wil l describe

the least squares problem This algorithm has been

We have implemented the Givens metho d with co

designed for a message passing multiprocessor and we

lumn pivoting for sparse matrices on the Cray TD

have evaluated it on the Cray TD supercomputer using

MIMD distributed memory computer Although a

sparse problem could b e treated with a parallel pro the Harwel lBoeing col lection

gram for dense factorization the storage and time cost

of ignoring sparsity would not b enet from parallel pro

cessing

Intro duction

This pap er is organized as follows In x we describ e

the sequential and parallel Givens algorithm as well as

QR factorization is a direct metho d in matrix alge

a strategy to reduce llin The least squares problem

bra which involves the decomp osition of a matrix M

an application of QR factorization is shown in x and

of dimensions A B A B into the pro duct of

exp erimental results are discussed in x

T 1

an Q Q Q and an upp er

triangular matrix R QR factorization has many appli

QR through Givens rotations

cations in to solve linear sys

tems of equations least squares problems linear pro

grams eigenvalue problems co ordinate transforma

We obtain matrices Q and R of dimensions A A

tions pro jections and optimization problems It is ne

and B B resp ectively using Givens rotations We

cessary to solve these problems in many scientic elds

do not calculate matrix Q b ecause this matrix is not

such as uid dynamics molecular chemistry aeronau

explicitly necessary in order to solve the least squares

tic simulation

problem describ ed in x Matrix M is overwritten by

This factorization can b e computed by several p ossi

matrix R inplace algorithm We present the sequen

ble ways Chapter using plane rotations Givens

tial algorithm with column pivoting in order to consider

metho d the Mo died GramSchmidt pro cedure or

those cases in which the of matrix M is not maxi

mum The use of orthogonal transformations is nume



This work was supp orted by the Spanish CICYT under con

rically stable and in practice the rank of the matrix

tract TICC and by the TRACS Programme under the

can b e determined accurately when column p ermuta

Human Capital and Mobility Programme of the Europ ean Union

grant numb er ERBCHGECT tions are p erformed during factorization

rank B Givens rotations are clearly orthogonal and it is

for j jB j not necessary to p erform inverse trigonometric func

A1

tions Figure shows the sequential QR factorization

X

2

m norm

j

by means of Givens rotations It can b e observed how

ij

i=0

zeros are intro duced in matrix M to achieve the upp er

triangular matrix R Finally the norms of the up dated

for cx cxB cx f

columns are calculated in

Obtain px cx px B such that

norm max norm

px j

j B

cx 1) 2) A-1)

if norm f

px ******** ******** ********

******** ******** 0 *******

ank cx r ******** ******** 0 *******

******** ******** 0 *******

B cx ********M ******** 0 ******* 0 g ******** ** ****** *******

******** 0 ******* 0 ******* f

else 0 * * **** * 0 * ****** 0 *******

swap nor m col of M and

cx cx

nor m col of M

px px ******** ******** ******** 0 ******* 0 ******* 0 ******* 0 ******* 0 0 ****** 0 0 ******R

0 ******* 0 0 ****** 0 0 0 *****

A icx i for i 0 ******* 0 0 * ***** 0 0 0 0 ****

0 0 0 0 0 0 0 0

ens rotation to subrows Apply Giv ******* * ***** ***

0 ******* 0 0 * ***** 0 0 0 0 0 0 **

and i from col to col

i 0 0 0 0 0 0 0 0 0 0 0 0

B 1 cx ****** ***** *

A) B[2A-B-1]/4 ) B[2A-B-1]/2 )

for jcx jB j

2 Figure 1. Sequential Givens rotations

nor m nor m m

j j

cx j

g

Once the algorithm has ended what we really get is

g

a M Q R factorization is a p ermutation

B B made up by the pro duct of rank elementary

p ermutations each with

In the squares of the euclidean norms of the

0 1 r ank 1 j

jrank b eing the identity matrix or a matrix

columns of matrix M are calculated and stored in vec

resulting from swapping two of its columns This is

tor norm Then a pro cedure of B iterations if the

due to the pivoting carried out in

rank of M is maximum is p erformed It consists of the

following actions the pivot column px and the pivot

element norm are selected The pivot element

px

Fillin control

is the maximum of the norms of the columns whose

index is cx If pivot is close to is the required

When working with sparse matrices an additional

precision the rank of the matrix is given by the value

problem arises which is the llin The factors that in

cx and the factorization ends Otherwise a swap of

uence the llin of a sparse matrix are the following

column cx with the pivot column of matrix M as well

the dimensions and rank of the matrix the sparsity

as a swap of their norms are p erformed

degree numb er of null elements and another factor

Givens rotations are applied in to zero the sub

of great imp ortance but dicult to mo del which is

diagonal elements of column cx A Givens rotation

the matrix pattern or the lo cation of the nonzero ele

involving two rows of matrix M consists of calcu

ments Thus llin may vary signicantly for two ma

lating the following pro duct

trices with the same dimensions rank and sparsity de

0 0 0

gree dep ending on how nonzero elements are placed

m m m

cx cx+1

B 1

0 0

A high llin is an undesirable situation due to the in

m m

cx+1 B 1

crease in the storage cost and computation time It

would b e of interest to implement a simple metho d

to reduce llin The most common heuristic strategy

employed in LU factorization to maintain the sparsity

g cos g sin m m m

cx cx+1 B 1

degree is the Markowitz criterion Chapter More

g sin g cos m m m

cx cx+1 B 1

over must b e ensured in the LU

factorization by avoiding the selection of pivots with a

m

m cx

cx

p p

where g cos g sin

2 2

2 2

+m +m m m

low absolute value A row ordering strategy for Givens

cx cx

cx cx

Matrix Origin A B ElemM ElemM

JPWH Circuit physics mo delling

BCSSTK Structural engineering

SHERMAN Oil reservoir mo delling

Table 1. Harwell-Boeing sparse matrices

Matrix Reduc

ElemR ElemR ElemR ElemR

JPWH

BCSSTK

SHERMAN

Table 2. Fill-in reduction

rotations based on pairing rows to minimize llin is by on the average for this set of sparse matrices

presented in which is a substantial reduction

We have implemented a metho d to reduce llin in

the QR factorization by taking advantage of column

Parallel Givens Rotations

pivoting Instead of expression we use a new cri

terion to select the pivot column

The parallel algorithm develop ed has b een generali

zed for any numb er of pro cessing elements PEs and

Obtain px cx px B such that

any dimension of matrix M We nd parallel Givens

nor m z er o algorithms in and

px px

max z er o max nor m

j j

Matrix M is distributed onto a mesh with m n

cxj B cxj B

PEs Each PE is identied by co ordinates idxidy

is maximum

with idx n and idy m Nonzero elements of

where zero is the numb er of nonzero elements in sub M are mapp ed over PEs using a Blo ck Column Scatter

j

column j from row cx to row A and BCS scheme but these elements are stored in

is a prexed parameter The rst term refers to llin doubly linked lists instead of vectors This distribu

reduction while the second one relates to numerical tion provides data and load balancing The algorithm

stability Obviously for the pivot column selec requires access b oth by rows and by columns a data

tion criterion is equivalent to the one describ ed in structure such as a twodimensional doubly linked list

The strategy is to cho ose a pivot column with many used in for a LU factorization would b e suita

zero es in the sub column from row cx to A in order ble But this structure is costly to manage and a great

to p erform fewer rotations Although we try to reduce amount of memory is required Therefore we use one

llin as much as p ossible we shall always keep dimensional doubly linked lists each one represents one

a minimum degree of numerical stability in the algo column of the matrix and each item of the list stores the

rithm by discarding as pivot columns those with norm row index the matrix element and two p ointers These

lists are arranged in growing order of the row index and

close to zero

they provide ecient access by columns Row access

With the aim of testing this strategy we have chosen

three matrices from the HarwellBo eing sparse matrix is achieved using an auxiliary p ointer vector with as

collection A description of these matrices is pre many comp onents as columns in the matrix We go

sented in table where A B are the dimensions with this p ointer vector through the linked lists corres

of the matrix ElemM is the numb er of nonzero ele p onding to the columns of the matrix from b ottom to

top to get row access

ments of M and ElemM is the p ercentage of these

elements Table shows llin in matrix R after the Let us consider the sequential algorithm to see how

factorization ElemR and ElemR are the numb er it can b e executed in parallel First each PE obtains

and p ercentage of nonzero elements for and the lo cal norms corresp onding to the column segments

in expression Reduc is the p ercentage of reduc lo cal lists it contains By means of a reduction ins

tion in the numb er of nonzero elements obtained with truction sum by columns the vector norm of each

As we can see llin has decreased column of PEs will contain the norms of the corres NULL num_elem_M) up up up up up up 0 c 3 h 7 m 18s 23 v 26 y down down down down down down NULL

header num_elem_M)

norm2 0c 3 h 7 m 18s 23 v 26 y

i mij

Figure 2. Buffer for sending local lists

p onding global columns Then the lo cal maximum the rows are rotated row cx is up dated vii the nal

norm of each PE is obtained this value is the same result is shown in c We have chosen this approach

for each column of PEs The global maximum is ob b ecause it is easy to generalize for any dimension of the

tained by means of a reduction instruction which nds mesh

the maximum norm by rows of PEs As a result the

pivot element as well as px the index of the pivot * *** **** ****

0 *** 0 *** 0 ***

tained in all the PEs The pa column will b e con 0 *** 0 *** 0 ***

0 *** 0 *** i) 0 ***

rallelization of the strategy to reduce llin requires

more communications than but it is not very costly

from the computational p oint of view In order to p er

0 ****

oting describ ed in if columns cx and px form the piv * *** ii) **** 0 *** 0 *** 0 *** 0 ***

0 0

t PEs we use a packed vector are lo cated in dieren *** *** 0 ***

0 *** 0 *** iii) 0 ***

Chapter that acts as a buer for exchanging data ws information of the lists is placed in

As gure sho vii)

e memory p ositions the corresp onding co consecutiv 0 ****

* *** **** 0

of M and the square of its norm are sent in a lumn iv) *** 0 *** 0 *** 0 *** 0 *** 0 *** 0 ***

single message 0 *** 0 *** v) 0 ***

If the rotations are parallelized according to the

sequential algorithm neither the outer lo op which

0 ****

through the columns of the matrix nor the in go es * *** vi) **** 0 ***

0 *** 0 *** 0 ***

lo op which annihilates each element of the ner 0 *** 0 *** 0 ***

0 *** 0 *** 0 ***

column could b e executed in parallel due to data de

Thus the parallel algorithm would need

p endencies a) b) c)

BAB iterations Nevertheless a Givens rota

can b e applied to any two rows not necessarily

tion Figure 3. Parallel Givens Rotations

consecutive therefore each row of PEs can indep en

dently and in parallel apply the rotations to the M

Rows are rotated to zero the corresp onding ele

rows which they store and thus reduce the execution

ments Clearly it is not necessary to rotate those rows

times Figure a illustrates how this is done for a ma

whose rst element is zero As our matrix is sparse and

trix distributed on a mesh In this example

null elements are not stored in the lists we go through

the four PEs apply Givens rotations in parallel to the

the elements of list column cx and rotate the rows of

rst column of the matrix but the rst row of each PE

those elements llin may app ear at this stage

is not rotated This is solved as shown in gure b

The PE containing the row cx in this example row

The least squares problem

sends it to the southern PE i and then a Givens ro

tation is applied to row cx and to the nonrotated row

The least squares problem calculates a vector x of of this PE in order to zero the corresp onding element

length B that minimizes kM x z k where z is a vector The new row cx is sent again to the southern PE iii

2

of length A If the rank of M is maximum B the and we pro ceed in the same way as in ii Once all

least squares problem has one unique solution x This lo op has data dep endencies and thus it must

LS

Otherwise it has an innite numb er of solutions x b e maintained in parallel co de without any p ossibility

SOL

one of which has a minimum norm and which we will of b eing distributed among the PEs In addition it is

also denote as x x x such that kx k is necessary to access the elements of matrix R by rows

LS LS SOL SOL 2

minimum If AB the least squares problem is equiva matrix R is stored by columns This is solved by

lent to solving a linear equation system M x z since using an auxiliary p ointer vector as we saw in x

kM x z k Another option is to apply the column version of back

2

substitution Chapter Once the backsubstitution

This problem can b e solved adapting the parallel

is carried out we get the solution vector x of length

algorithm that carries out the Givens QR factorization

B distributed in each row of PEs so that the global

of matrix M In particular the least squares problem

comp onent J of vector x is replicated in the column of

is equivalent to solving the upp er triangular system

T T

PEs with idxJ mod n

R x Q z This approach is adequate due to the

Due to the column pivoting carried out in the QR

go o d numerical stability of the QR factorization If

factorization p ermutation must b e applied to the

r ank M B this algorithm calculates the one unique

solution to the least squares problem If r ank M B comp onents of vector x so that x is overwritten with

vector x All the PEs contain a vector called permut

only one of the innite solutions is obtained the one

of length B It is the only vector whose comp onents

called basic solution which has a maximum of rank

are not distributed among the PEs This vector stores

nonzero elements and that in general will not coincide

the index of the column swapp ed in each iteration px

with the minimum norm solution x

LS

and by applying these swaps starting from the end

T

elements of vector x are obtained in the correct order

Calculati on of Q z

T

Pro duct Q z is obtained at the same time that

Exp erimental results and conclusions

QR factorization is p erformed we assume vector z is

dense First vector z is stored in a vector called qtz

The algorithm has b een implemented on a Cray TD

which is distributed in each column of PEs so that

sup ercomputer with a Cray YMP host and

the global comp onent I of qtz is replicated in the row

DECAlpha pro cessors connected by a tridimensional

of PEs with idyI mod m Then Givens rotations

torus top ology using C language and PVM routines for

are applied to the corresp onding elements of qtz at the

message passsing The parallel co de is SPMD Simple

same time we apply the rotations to the rows of matrix

Program Multiple Data We have used low latency

M For instance if rows and are b eing pro cessed

fastsend and communication functions such as pvm

this pro duct is calculated it is similar to

pvm fastrecv nonstandard PVM functions for messa

ges of length less than bytes We have also deve

lop ed reduction instructions suitable for our algorithm

0

q tz

g cos g sin q tz

to reduce communications

0

g sin g cos q tz

q tz

We have tested the p erformance of our parallel algo

rithm using the HarwellBo eing matrices describ ed in

table Table shows the execution times in seconds

Once all the rotations of the factorization have

for and PEs without taking into account

T

ended vector qtz stores the pro duct Q z from index

our strategy to reduce llin and applying the

to the global index B As we can see matrix Q is

llin control approach These times include

not necessary

the QR factorization as well as the resolution of the

least squares problem The time required for data dis

Backsubstitutio n and p ermutation

tribution and collection of results is not included b e

cause we assume that this program is a p ossible sub

T

The upp er triangular system Rx Q z is solved

problem within a wider program

by means of a backsubstitution The corresp onding

Execution times are substantially reduced with

sequential algorithm is as follows

b ecause fewer nonzero elements app ear and

therefore computations savings are achieved For

instance execution times decrease for matrices

for irankii

JPWH BCSSTK and SHERMAN using PEs

r ank 1

X

by and resp ectively These times de

r x r x q tz

ij j ii i i

crease even further as the numb er of PEs increases

j =i+1

Matrix

JPWH

BCSSTK

SHERMAN

Table 3. MGS: execution times (in seconds) for and

on a Massively Parallel Computer Paral lel Compu due to the fact that when data are distributed among

ting

more PEs the numb er of computations that each PE

CHBischof Adaptive Blo cking in the QR Factoriza

carries out is lower whereas the numb er of communi

tion The Journal of Supercomputing

cations tends to increase Since many communications

Cray Research Inc Cray TD Technical Summary

are required to up date the nonrotated row of each PE

Septemb er

in each step of the algorithm see gure b even the

ELZapata JALamas FFRivera and OGPlata

running time is higher using PEs than with PEs

Mo died GramSchmidt QR Factorization on Hyp er

for matrix JPWH

cub e SIMD computers Journal of Paral lel and Dis

d Computing

1 tribute

GHGolub and CFVan Loan Matrix Computations

SHERMAN5

The Johns Hopkins University Press second edition 0.8

BCSSTK14

ISDu AMErisman and JKReid Direct Methods

for Sparse Matrices Clarendon Press

0.6

ISDu RGGrimes and JGLewis Users Guide for

the HarwellBo eing Sparse Matrix Collection Techni

JPWH991

A CERFACS Octob er 0.4 cal Rep ort TRP

EFFICIENCY

LCWaring and MClint Parallel GramSchmidt Or

thogonalisati on on a Network of Transputers Paral lel

0.2

Computing

LFRomero and ELZapata Data Distribution s for

Matrix Vector Multiplica tion Paral lel Com 0 Sparse

1 4 16 64 128

NUMBER OF PEs puting

RAsenjo and ELZapata Sparse LU Factorization

on the Cray TD In Intl Conference on HighPerfor

Figure 4. Ef®ciencies for

mance Computing and Networking Milan Italy pages

SpringerVerlag LNCS no May

Figure shows the eciencies for attained

RDoallo BBFraguela JTourino and ELZapata

for each matrix For example the eciencies for

Parallel Sparse Mo died GramSchmidt QR Decom

JPWH BCSSTK and SHERMAN using

p osition In Intl Conference on HighPerformance

Computing and Networking accepted Brussels Bel

PEs are and resp ectively As we can

gium April

see the algorithm scales rather well It is clear that e

RDoallo JTourino and ELZapata Sparse House

ciencies will b e lower for b ecause the execution of

holder QR Factorization on a Mesh In Fourth Euromi

the algorithm with llin reduction has lower running

cro Workshop on Paral lel and Distributed Processing

times and therefore the communication term is a re

Braga Portugal pages IEEE Computer So ciety

latively more signicant fraction of the running time

Press January

Nevertheless b etter eciencies could b e achieved with

SGKratzer Sparse QR Factorization on a Massively

larger matrices

Parallel Computer The Journal of Supercomputing

THRob ey and DLSulsky Row Ordering for a Sparse

References

QR Decomp osition SIAM J Matrix Anal Appl

Octob er

BHendrickson Parallel QR Factorization using the

Toruswrap Mapping Paral lel Computing

CBendtsen PCHansen KMadsen HBNielsen and

MPinar Implementation of QR Up and Downdating