Solution of general linear systems of equations using blo ck Krylov

based iterative metho ds on distributed computing environments

LeroyAnthony Drummond Lewis

Decemb er

i

Resolution de systemes lineaires creux par des metho des iteratives par

blo cs dans des environnements distribues heterogenes

Resume

Nous etudions limplantation de metho des iteratives par blo cs dans des environnements mul

tipro cesseur amemoire distribuee p our la resolution de systemes lineaires quelconques Dans

un premier temps nous nous interessons aletude du p otentiel de la metho de du gradient con

jugue classique en environnement parallele Dans un deuxieme temps nous etudions une autre

metho de iterativedelameme famille que celle du gradient conjugue qui a ete concue p our

resoudre simultanementdessystemes lineaires avec de multiples seconds membres et commune

mentreferencee sous le nom de gradient conjugue par blo cs

La complexite algorithmique de la metho de du gradient conjugue par blo cs est superieure a celle

de la metho de du gradientconjugue classique car elle demande plus de calculs par iteration

et necessite en outre plus de memoire Malgre cela la metho de du gradient conjugue par blo cs

apparat comme etant mieux adaptee aux environnements vectoriels et paralleles

Nous avons etudie trois variantes du gradient conjugue par blo cs basees sur dierents mo deles de

programmation parallele et leur ecaciteaete comparee sur divers environnements distribues

Pour la resolution des systemes lineaires non symetriques nous considerons lutilisation de me

thodes iteratives de pro jection par lignes accelerees par la metho de du gradient conjugue par

blo cs Les dierentes versions de gradient conjugue par blo cs precedemmentetudiees sont util

isees en vue de cette acceleration En particulier nous etudions limplantation dans des environ

nements distribues de la metho de de Cimmino par blo cs acceleree par la metho de du gradient

conjugue par blo cs La combinaison de ces deux techniques presente en eet un b on p otentiel

de parallelisme

Pour une b onne p erformance de limplantation de cette derniere metho de dans des environ

nements distribues heterogenes nous avons etudie dierentes strategies de repartition des taches

aux divers pro cesseurs et nous comparons deux sequencements statiques realisant cette repar

tition Le premier a p our ob jectif de maintenir lequilibre des charges et le second a p our

but de reduire en premier les communications entre les dierents pro cesseurs tout en essayant

dequilibrer aux mieux la charge des pro cesseurs

Finalement nous etudions des strategies de pretraitement des systemes lineaires p our ameliorer

la p erformance de la metho de iterative basee sur la metho de de Cimmino

Mots clefs

La metho de de Cimmino gradient conjugue calcul distribue calcul heterogene metho des itera

tives calcul parallele pretraitement des systemes lineaires sequencement systemes lineaires

creux et symetriques matrices creuses systemes lineaires creux non symetriques

ii

Solution of general linear systems of equations using blo ck Krylov based

iterative metho ds on distributed computing environments

Abstract

We study the implementation of blo ck Krylov based iterative metho ds on distributed computing

environments for the solution of general linear systems of equations First we study p otential

implementation of the classical conjugate gradient CG metho d on parallel environments From

the family of conjugate direction metho ds we study the Blo ck Conjugate GradientBlockCG

metho d which is based on the classical CG metho d The Blo ckCG works on a blo ckofs right

hand side vectors instead of a single one as is the case of the Classical CG and we study the

implementation of the Blo ckCG on distributed environments

The complexity of the Blo ckCG metho d is higher than the complexity of the classical CG in

terms of computations and memory requirements We show that the fact that an iteration of the

Blo ckCG requires more computations than the classical CG makes the Blo ckCG more suitable

for vector and parallel environments Additionally the increase in memory requirements is only

amultiple of s which generally is much smaller than the size of the linear system b eing solved

We present three mo dels of distributed implementations of the Blo ckCG and discuss the ad

vantages and disadvantages from each of these mo del implementations

The Classical CG and Blo ckCG are suitable for the solution of symmetric and p ositive denite

systems of equations Furthermore b oth metho ds guarantee in exact arithmetic to nd a so

lution to p ositive denite systems in a nite numb er of iterations

For non symmetric linear systems we study blo ckrow pro jection iterative metho ds for solving

general linear systems of equations and we are particularly interested in showing that the rate

of convergence of some row pro jection metho ds can b e accelerated with the Blo ckCG metho d

We review twoblock pro jection metho ds the blo ck Cimmino and the blo ck Kaczmarz metho d

Afterwards we study the implementation of an iterative pro cedure based on the blo ck Cimmino

metho d using the Blo ckCG metho d to accelerate its rate of convergence on distributed com

puting environments The complexity of the new iterative pro cedure is higher than the one of

the Blo ckCG metho d in terms of computations and memory requirements In this case the

main advantage is the extension of the application of the CG based metho ds to general linear

systems of equations We present a parallel implementation of the blo ck Cimmino metho d with

Blo ckCG acceleration that p erforms well on distributed computing environments

This last parallel implementation op ens a study of p otential scheduling strategies for distribut

ing tasks to a set of computing elements We presentascheduler for heterogeneous computing

environments whichiscurrently implemented inside the blo ck Cimmino based solver and can

b e reused inside parallel implementations of several other iterative metho ds

Lastlywe combine all of the ab ove eorts for parallelizing iterative metho ds with prepro cessing

strategies to improve the numerical b ehavior of the blo ck Cimmino based iterativesolver

Keywords

Cimmino metho d conjugate gradient distributed computing heterogeneous computing iter

ative metho ds parallel computing prepro cessing linear systems scheduling symmetric sparse linear systems sparse matrices unsymmetric sparse linear systems

iii

Acknowledgements

The author gratefully acknowledges the following p eople and organizations for their immeasur

able supp ort

Iain S Du

For his professional support and research guidance

Daniel Ruiz

For the time and eorts he has dedicated to my thesis and the genuine pleasure that has been

working with him

Joseph Noailles

Who was responsible for my PhD studies

Mario Arioli

For his remarkable contributions to my work

Craig Douglas

For al l his insights and valuable comments Proofreading my thesis during the processofprepa

ration and great advice you are supposed to have fun while writing your thesis

J C Diaz

For his constant support in my career

Dominique Bennett

For al l the help settling in Toulouse and CERFACS which is an indirect but substantial support

for four years of work in France

CERFACS

For providing the scientic resources and pleasant working environment To al l my col leagues

and special ly to Osni Marques for helping me plotting and nding eigenvalues

ENSEEIHT IRIT

The APO team for hosting me in their group and sharing with me their research experiences

and great resources Brigitte Sor and Frederic Noail les for their outstanding computing support

MCS Argonne National Lab oratory

For letting me use their paral lel computing facilities

IANCNR Pavia Italy

For hosting me in Summer

CNUSC

For the use of SP Thin node

My family

For always being with me To Sandra Drummond for her help and company My friends for al l

that I have found learned and shared in our friendship iv

Contents

Intro duction

Blo ck conjugate gradient

Conjugate gradient algorithm

A stable Blo ckCG algorithm

Performance of Blo ckCG vs Classical CG

Stopping criteria

Sequential Blo ckCG exp eriments

Solving the LANPRO NOS problem

Solving again the LANPRO NOS problem

Solving the LANPRO NOS problem

Solving the SHERMAN problem

Remarks

Parallel Blo ckCG implementation

Partitioning and scheduling strategies

Partitioning strategy

Scheduling strategy

Implementations of Blo ckCG

AlltoAll implementation

MasterSlave distributed Blo ckCG implementation

MasterSlave centralized Blo ckCG implementation

Parallel Blo ckCG exp eriments

Parallel solution of the SHERMAN problem

Parallel solution of the LANPRO NOS problem

Parallel solution of the LANPRO NOS problem

Fixing the numb er of equivalent iterations

Comparing dierent computing platforms

Parallel solution of POISSON problem

Remarks

Solving general systems using Blo ckCG

The blo ck Cimmino metho d

The blo ck Kaczmarz metho d

Blo ck Cimmino accelerated byBlockCG v

vi CONTENTS

Computer implementation issues

Parallel implementation

Ascheduler for heterogeneous environments

Taxonomyofscheduling techniques

A static scheduler for blo ck iterative metho ds

Heterogeneous environment sp ecication

Ascheduler for a parallel blo ck Cimmino solver

Scheduler exp eriments

Scheduling subsystems with a nearly uniform workload

Scheduling subsystems with nonuniform workload

Remarks

Blo ck Cimmino exp eriments

problem Solving the GRENOBLE

Solving the FRCOL problem

Remarks

Partitioning strategies

Ill conditioning in and across blo cks

Twoblo ck partitioning

Prepro cessing Strategies

Partitioning Exp eriments

Solving the SHERMAN problem

Solving the GRENOBLE problem

Remarks

Conclusions

Chapter

Intro duction

The class of iterative metho ds comprises a wide range of pro cedures for solving various typ es

of linear systems Iterative pro cedures use successive approximations to the solution of a linear

system of equations A few references to general overviews of iterative metho ds are Hageman

and Young Barrett Berry Chan et al van der Vorst and Sto er and

Bulirsch

The class of iterative metho ds is sub divided into stationary and nonstationary iterative metho ds

The former metho ds are easier to implement and understand but they are less ecient in terms

of convergence than the latter ones An iterative metho d is stationary if it can b e expressed in

the form

j j

x Bx c

where B and c are indep endent of the iteration count j Examples of these metho ds are the

Jacobi metho d the GaussSeidel metho d the SuccessiveOver Relaxations metho d SOR Sym

metric SOR metho d SSOR the Cimmino metho d and the Kaczmarz metho d

On the contrary the computations in a nonstationary iterative metho ds involve information that

changes at each iteration Examples of these metho ds are the Krylov pro jection metho ds Clas

sical CG Chebyshev iteration Minimal Residual MINRES and Generalized Minimal Residual

GMRES

In general the Krylov pro jection metho ds nd a sequence of approximations to the solution of

a linear system by constructing an orthogonal basis for the Krylov subspaces Because these

Krylov subspaces are nite an adequate approximation to the solution is found in a nite num

b er of steps in absence of roundo errors

The nite termination prop ertymakes the Krylov pro jection metho ds very attractive even when

used with stationary metho ds to accelerate their rate of convergence

The Blo ck Conjugate Gradient algorithm Blo ckCG also b elongs to the Krylov pro jection

metho ds and is based on the Classical Conjugate Gradient metho d CG Class of Blo ckCG

can b e used in the solution of linear systems with multiple righthand sides In this case the

Blo ckCG algorithm is used to concurrently solve linear systems of equations with s righthand

sides instead of solving the linear systems using s applications of the classical CG algorithm

We prop ose developing a parallel iterativesolver for distributed environments based on a station

ary metho d the blo ck Cimmino metho d that uses a nonstationary metho d the blo ck conjugate

gradient to accelerate its rate of convergence During the development of the parallel iterative

CHAPTER INTRODUCTION

solver wehave built a set of mo dules that can b e reused in other parallel iterative pro cedures

and have fo cused on answers to the following statements

E Are there advantages in implementing the classical CG metho d and Blo ckCG metho d on

distributed computing environments without p erturbing the robustness of these metho ds

E Can the eorts of parallelizing the Classical CG and Blo ckCG b e reused inside other

iterative pro cedures For instance in the parallel implementation of an iterative solver

based on the blo ck Cimmino metho d with Blo ckCG acceleration

E What are the parameters that inuence the parallel p erformance of the blo ck Cimmino

based solver

E Is there a need to design a scheduler to evenly distribute the workload from iterations of

the blo ck Cimmino based solver If the need is armed can the scheduler b e reused in

the parallel implementation of other iterative pro cedures

E Prepro cessing a linear system can reduce the numb er of iterations but is there also an

inuence on the parallel p erformance of the blo ck Cimmino based solver

In Chapter we study the formulations of the Classical CG metho d We review the eorts of

several authors that have studied the implementation of the Classical CG on distributed envi

ronments From the Classical CG formulations a stabilized Blo ckCG algorithm can b e derived

The algorithm is describ ed as stable b ecause it corrects some instability problems during the

Blo ckCG computations that are due to solution vectors that converge at dierent rates and

j

illconditioni ng that may app ear inside the residual R

The order of complexity of the Blo ckCG algorithm is higher than Classical CG in terms of

computations and memory requirements However in the absence of roundo errors Blo ckCG

promises a faster rate of convergence than Classical CG and the matrixmatrix computations

inside a Blo ckCG iteration can b enet from faster numerical kernels that improve data lo cality

and increase the Mop rate These are desirable features from an algorithm to b e implemented

in vector and parallel environments Chapter is devoted to an analysis of the p otential paral

lelization of the Classical CG and Blo ckCG algorithms to answer statement E

The parallelization of the Blo ckCG algorithm op ens more questions ab out which sections of the

algorithm should b e parallelized and which parallel programming mo del should b e used Three

parallel implementations of the stabilized Blo ckCG algorithm are presented in Chapter

When parallelizing many sequential algorithms there is a tendency to identify the sections of

the algorithm with heavy computations and only parallelize these sections In the Classical CG

and Blo ckCG the heavy computational sections have b een identied by other authors as the

j j

matrixvector pro ducts Ax in the Classical CG or the matrixmatrix pro ducts AP in the

Blo ckCG The results from our analysis of the Blo ckCG in Section prove that parallelizing

j

the AP is not always ecient nor sucient Furthermore we demonstrate this eect with a

parallel Blo ckCG implementation describ ed in Section

In contrast if the whole Classical CG algorithm or Blo ckCG algorithms is parallelized then

the parallel eciency is prop ortional to the rates of computation sp eed to communication sp eed

Parallelizing the whole Blo ckCG algorithm leads to a decision of which programming mo del

should b e used In Section we prop ose a M aster Slave programming mo del for par

allelizing the whole Blo ckCG algorithm And in Section a dierent programming mo del

is used in which the master role is removed and all the parallel pro cesses work at the same

level Results from comparisons of the three parallel Blo ckCG implementations are presented

in Chapter

The Blo ckCG algorithm only guarantees convergence when the system is symmetric p ositive

denite although with the use of a preconditioner the algorithm can b e extended to solve gen

eral unsymmetric systems Blo ckCG can b e used as an acceleration pro cedure inside another

basic iterative metho d

In Chapter westudytheblock Cimmino and blo ck Kaczmarz metho ds that are regarded as

iterativerow pro jection metho ds for solving general systems of equations In the same chapter

we review an example of the Blo ckCG acceleration inside the blo ck Cimmino metho d

To answer statement E an iterative pro cedure of the blo ck Cimmino metho d accelerated

with the stabilized Blo ckCG Algorithm is presented in Section And the parallel implemen

tation of this pro cedure is discussed in Section and Section

In Section we identify that there is a need for a task scheduler that distributes the work

load among the dierent computing elements CE in the system therefore we study dierent

scheduling strategies in Chapter

One of the advantages of working in heterogeneous computing environments is the abilityto

provide certain level of computing p erformance prop ortional to the resources available in the

system and their dierent computing capabilities A scheduler is regarded as a strategy or p ol

icy to eciently manage the use of these resources In parallel distributed environments with

homogeneous resources the level of p erformance is commensurate with the numb er of resources

present in the system

Ascheduler for parallel iterative metho ds in heterogeneous computing environments is presented

in Chapter The scheduler resp onds to statement E Design issues from the scheduler are

discussed in Chapter where we see that the scheduler can easily b e mo died to suit other

parallel iterative pro cedures

In heterogeneous computing environments the scheduler not only considers information from

the tasks to b e executed in parallel but must also consider information ab out the capabilities of

the CEs In Section a syntax for sp ecifying heterogeneous environments is dened

The eects of using Blo ckCG acceleration inside the parallel blo ck Cimmino solver and the ef

fects from dierentscheduling strategies are separately studied in Chapters and In Chapter

the eects of the scheduler on the p erformance of the parallel blo ck Cimmino solver are stud

ied And in Chapter the eects of the Blo ckCG acceleration on the parallel blo ck Cimmino

solver are studied together with an analysis of the parallel p erformance of the blo ck Cimmino

solver

Also from the results obtained in Chapters and we identify a need for studying natural

prepro cessing and partitioning strategies of the linear system of equations

Prepro cessing a linear system of equations improves the p erformance of most linear solvers and

in some cases a more accurate approximation to the real solution is found For instance nu

merical instabilities are encountered in the solution of some linear systems and it is necessary to

scale the systems b efore they are solved These instabilities o ccur even when the most robust

computer implementations of direct or iterativesolvers are used

Also p erforming some p ermutations of the elements of the original system can substantially

reduce the required computing time for the solution of large sparse linear systems by improving

the rate of convergence of some iterative metho ds eg SOR and Kaczmarz metho ds

CHAPTER INTRODUCTION

Prepro cessing and partitioning strategies are studied in Chapter The eects of the prepro

cessing and partitioning strategies on the parallel blo ck Cimmino solver are studied in Chapter

and this chapter closes with remarks that answer statementE

Lastly general conclusions and future related work are prop osed in Chapter

Chapter

Blo ck conjugate gradient

The Blo ck Conjugate Gradient algorithm Blo ckCG b elongs to a family of conjugate direction

metho ds and the study of these metho ds is app ealing b ecause they guarantee convergence in a

nite numb er of steps in the absence of round o errors The Blo ckCG is based on the classical

Conjugate Gradient metho d CG for solving linear systems of equations The Blo ckCG can b e

used in the solution of linear systems with multiple righthand sides In this case the Blo ckCG

algorithm is used to concurrently solve the linear system of equations with s righthand sides

instead of solving the linear systems using s applications of the classical CG algorithm In addi

tion the Blo ckCG algorithm is exp ected to converge faster than classical CG for linear system

of equations with clusters of eigenvalues that are separated In this chapter we will present

some examples of this eect

First we study some relevant prop erties of the classical CG metho d and review some of the

main issues involved in parallelizin g dierentversions of the CG algorithm Then we describ e

a stabilized Blo ckCG algorithm and compare it with classical CG

Conjugate gradient algorithm

Given the linear system of equations

Ax b

where A is an n n symmetric matrix we write the following Richardson iteration after splitting

the matrix A as I I A

I I Ax b

j j

x b I Ax

j j j

x r x

If wemultiple by A and add b to

j j j

b Ax Ar b Ax

CHAPTER BLOCK CONJUGATE GRADIENT

j j j

r r Ar

j

I Ar

j

I A r

j

P Ar

j j

where P is a p olynomial of degree j with P

Using this p olynomial we can rewrite in the following way

j j

x r r r x

j

X

k

I A r x

k

Without loss of generalitywe will assume x and if x then a linear transformation

of the form z x x can b e considered and the system of equations will lo ok like

Az b Ax

b

where z

Removing x from weget

j

X

k j

I A r x

k

j j

From it is inferred that the new x is in the subspace fr Ar A r g which

j

is the Krylov subspace K A r

If x is the exact solution to then we build an iterative pro cedure that nds at each

j

iteration an approximation x to x that satises

j

min kx x k

j j

for x KA r and a given norm Furthermore if the matrix A is symmetric p ositive

denite SPD then we can use the following prop er inner pro duct

x y y x x Ay

A A

x x x

A

For instance if x Spanfr g then x r

x r x r min kx x k

A A

CONJUGATE GRADIENT ALGORITHM

and has a minimum with resp ect to at

r Ax r x

A

r r r Ar

A

j j j

In general wewant to nd min kx x k for x K A r

A

j j

x x K A r

A

j j

r K A r

If we rep eat the search for an approximation j times we build a set of orthogonal residual vectors

j j

fr r r g that is an orthogonal basis for the Krylov subspace K A r Further

o n

j j

K A r Span r Ar A r

j

We continue to generate orthogonal residual vectors until a full basis for the subspace K A r

is found Beyond this p oint we cannot generate more orthogonal vectors and have a solution

to

j j

Inside the iterative pro cedure a new x is computed after pro jecting the residual r onto the

j

Krylov subspace K A r with resp ect to A

Theorem The orthogonal basis satises the fol lowing term recurrence

j j j j j j j

r Ar r r

Pro of The pro of is by induction We start with

r Spanfr Ar g

r Ar r

and

r Span fr Ar A r g

Span fr r Ar g

r Ar r r

By induction

j j j j

r K A r r K A r

where

j j

A r Span fr r r g K

Thus

CHAPTER BLOCK CONJUGATE GRADIENT

j

X

j j j i i

r Ar r

i

j j i

Since r must b e orthogonal to fr r r gwemust nd the values of that satisfy

this condition In other words

j i

r r

i

Taking the inner pro duct of with r yields

j i i i i

Ar r r r

Using the prop erty of inner pro ducts gives us

j i i i i

r r r r and

A

j i i i i

r Ar r r

j

Using the induction argumentfor r

j

j i i i i i i

r r r r

i

i i

r r

i

i

were for i j is reduced to

j j j j j j j

r Ar r r

or

j j j j j j j

r Ar r r

This completes the pro of of

The term recurrence relation can b e rearranged as

j j j j j j j

Ar r r r

j j

Let fr r r g b e the columns of a matrix R From weget

j

j j j j

j

AR R r

j

CONJUGATE GRADIENT ALGORITHM

or

j j j j j T

AR R T r e

j

where T is an j j tridiagonal matrix and e is a canonical vector We can write the new

j

approximation as a linear combination of R namely

j j

x R y

j

The x that minimizes

j

min kx x k

A

is derived from

T

j j

R Ax b

T T

j j j

R Ax R b

T T

j j j

R AR y R b

From

T T

j j j j T j j

b R T r e y R R

j j

Since r is orthogonal with resp ect to the columns of R wehave

T

j j j

R R T y kr k e

where

kr k

T

j j

R R

j

kr k

Thus

j j

T y e y T e

If A is an SPD matrix then in the relation

T T

j j j j j

R AR R R T

j

from T can b e transformed into an SPD matrix eg using some rowscaling and LU

decomp osed without anypivoting

j j j

T L U

then from and wehave

CHAPTER BLOCK CONJUGATE GRADIENT

j j

x R y

j j

R T e

j j j

R U L e

Lab eling the terms in parenthesis

j j j j j

P R U and Q L e

after algebraic manipulation we obtain an equation for the new approximation of

j j j j

x x q p

The columns of the matrix P are Aconjugate hence the name of the Conjugate Gradient metho d

Algorithm is a version of the classical CG algorithm from Hestenes and Stiefel On

the other hand if we do not assume the p ositive denitiveness of A in we can only

j

hop e that T is not singular and the pro cedure is known as the Lanczos metho d for symmetric

systems see Lanczos

Algorithm Conjugate Gradient

x is arbitrary and r b Ax

p

For j until convergence do

j j j j

p r p

j j

q Ap

j j

p q

j j j

x x p

j 

j

j j j j

r r q

j j j

r r

j

If r is small enough then

j j j j

x x p

stop

j 

j

j 

Stop

CONJUGATE GRADIENT ALGORITHM

From Algorithm we see that the CG algorithm requires sp ecial consideration for its par

allelization We notice that the computational weight of the algorithm is on step Apart

from this computation as shown in Demmel Heath and van der Vorst there are arrays

that need to b e loaded in cache memory or memory registers at each iteration to p erform only

oating p oint op erations on these arrays which results in p o or data lo cality Indep endentof

the computer architecture p o or data lo calityalways implies a high numb er of data transfers

which happ ens b etween dierent memory hierarchies or in the worst case b etween pro cessors

Another b ottleneck in the algorithm is the computation of the steps and In the

parallel version of the algorithm the inner pro ducts p erformed in these two steps require that

the dierent pro cessors exchange lo cal data to complete them Clearly these two steps are the

synchronization p oints in the algorithm Moreover steps and dep end on the

results from step Similarly step dep ends on the results from step These data

dep endencies preventusfromoverlapping communication with useful computations

In order to reduce the numb er of synchronization p oints we review a few prop osed algorithms

Saad suggested removing one synchronization p oint at the exp ense of numerical instabilities

see Saad Saad Meurant prop osed a version of the CG algorithm that

eliminates one synchronization p oint in Algorithm and increases the data lo cality factor

to However the cost of his improvement reduces the of the algorithm and

adds one inner pro duct Chronop oulos prop osed a dierentversion of the CG algorithm named

the sstep conjugate gradient algorithm see Chronop oulos and Gear

The sstep conjugate gradient is based on the idea of generating s orthogonal vectors in one

j j s j

iteration step by rst building r Ar A r and orthogonalize and up date the cur

rent approximation to the solution in the resulting subspace The sstep improves the data

lo cality and the parallelism from the classical CG algorithm since the arrays are loaded only

once p er iteration Also in the sstep the number of synchronization p oints is reduced to one

see Chronop oulos The sstep p erforms n more oating p oint op erations p er itera

tion than the classical CG which is not critical for large values of s b ecause a large value of s

not only increase the granularity of parallel subtasks but may also reduce the numb er of iter

ation steps b efore reaching convergence However the sstep intro duces numerical instabilities

s

by explicitly computing A Also dep ending on the sp ectrum of the matrix A the set of

j j s j

vectors r Ar A r may convergence to a vector in the direction of the dominant

eigenvectors and not converge to the solution of the linear system The weaknesses of the sstep

algorithm are further reviewed in Saad

DAzevedo Eijkhout and Romine intro duced two new stable variants to the CG algo

rithm In these versions the authors substitute the second inner pro duct byequivalent algebraic

expressions They have shown that these version of the algorithm are stable and claim an im

provement on the p erformance of the classical CG algorithm of to A dieren t approach

was presented by Demmel Heath and van der Vorst in which one overlaps communica

tion with useful computations Their original approachwas prop osed for preconditioned CG

For the nonpreconditioned CG one has to split the computation into two phases to overlap

communication with useful computations And in general the improvement from overlapping

communication with useful computation will b e determined by the sp eed of the communication

network the computing pro cessors and the size of the system to b e solved

CHAPTER BLOCK CONJUGATE GRADIENT

A stable Blo ckCG algorithm

In the previous section the matrix A is assumed to b e SPD The fact that A is p ositive denite

guarantees the termination of the algorithm in absence of round o errors However Algorithm

may still work without guarantee for non p ositivedenite matrices

Some algorithms based on the CG metho ds have b een designed for solving symmetric indenite

linear systems A few examples of these algorithms are CGI see Mo dersitzki MINRES

and SYMMLQ see Paige and Saunders And in some cases one can use a precondi

tioner matrix H which is a go o d approximation to A and SPD

Here we consider a Blo ckCG algorithm for SPD matrices and later in Chapter we will study

the Blo ckCG metho d for accelerating the rate of convergence of a another basic blo ckrow

pro jection metho d

In the Blo ckCG algorithm the term block refers to multiple solution vectors and the number

of these solutions denes the blo ck size Let s b e the blo ck size for the system of linear equations

HX K

Here the vectors x and b from are replaced bytheX and K matrices resp ectively and

b oth matrices are of order n sNowwe write the blo ck form for Algorithm

Algorithm Blo ck Conjugate Gradient

X is arbitrary P R K HX

For j until convergence do

T T

j j j j j j j

X X P P R HP R

T T

j j j j j j j

R R HP P R HP R

T T

j j j j j j j

P R P R R R R

Stop

However there is a change from the Krylov space considerations used while building Algorithm

Equation is rewritten here as

n o

j j

K H R Span R HR H R

With a space of size j is generated and with inside the Blo ckCG algorithm a space

j

of size s j is generated Furthermore the columns in the R matrix may b ecome almost

ASTABLE BLOCKCG ALGORITHM

linearly dep endent when convergence is ab out to b e reached A solution to this problem is the

orthonormalization of the P and R matrices As is given by Ruiz the stabilized Blo ck

CG algorithm is depicted in Algorithm

Algorithm Stabilized Blo ck Conjugate Gradient

X is arbitrary R K HX

T

R R such that R R I

T

P R such that P HP I

For j until convergence do

T

j

j

T

j j

j j j j j

X X P P R where R K HX

T

j j j j j

such that R R HP R R I

j

j

T

j j

j

T

j j j j j

P R P such that P HP I

j

j

Stop

j j

In step of Algorithm we do not explicitly compute R K HX b ecause this

j

computation involves the HX matrixmatrix pro duct and this pro duct is very exp ensive when

computed in parallel eg pro cessors need to exchange data which implies more communication

and synchronization p oints From steps and in Algorithm it can b e shown by

recurrence that the residuals are given by

Y

j

j

A

R R

i

ij

j

Therefore the up date of the X solution in step can b e rewritten as

Y

T

j j j

j j

A

P P R X X

i

ij

In general it can b e proved that for the Blo ckCG algorithm the H norm of the error at each

iteration is b ounded by its reduced condition number where the s are the eigen

n s j

values of the H matrix sorted in increasing order and s is the blo ck size see OLeary

This observation represents an advantage of the Blo ckCG over its nonblo ckcounterparts where

the error at each iteration is b ounded by its condition number

j

CHAPTER BLOCK CONJUGATE GRADIENT

Performance of Blo ckCG vs Classical CG

We rst start by comparing the complexity of the Classical CG Algorithm and the Sta

bilized Blo ckCG Algorithm We fo cus on the numb er of oating p oint op erations FLOP

count p er iteration as depicted in Table In Chapter we present results from sequential

runs of computer implementations of b oth algorithms to compare their convergence rates

In Table the matrix A is SPD of dimension n The sparsity pattern of A is used through

all of the computations thus the FLOP count in a matrixvector op eration is a function of the

sparsity pattern of the matrix A dened by

A op countA b

n

for anyvector b In the Blo ckCG algorithm s is the blo ck size and usually s n

The FLOP count from one iteration of the Blo ckCG is greater than stimes the FLOP count

from one iteration of the Classical CG Because of the stability issues discussed in Section a

j

R steps and few FLOPs more are p erformed in the orthonormalization of the matrices

j j

of Algorithm and steps and of Algorithm Also the FLOP count P HP

increases with the use of Cholesky factorizations to compute steps and of Algorithm

j

and steps and of Algorithm Both and have dimensions s sand

j j

j

we use the routine DPOTRF from the LAPACK library Anderson Bai Bischof et al

to p erform the Cholesky factorizations The increase in the FLOP count from the Cholesky

factorizations b ecomes signicant for large values of s

From Table the FLOP count p er iteration of the ClassicalCG Algorithm is

count An FLOP

and for one iteration of the Blo ckCG Algorithm

s s

count s A n s n s FLOP s

As summarized in and the FLOP count from one iteration of Blo ckCG is much

greater than the FLOP count from one iteration of Classical CG for large values of s Therefore

a large value of s deteriorates the p erformance of a sequential Blo ckCG implementation On

the other hand a large value of s increases the granularity of the asso ciated parallel subtasks

and improves the p erformance of a parallel Blo ckCG implementation Also in the absence of

roundo errors the numb er of synchronization p oints is implicitly reduced with a large value of

s b ecause the numb er of iterations required to reachconvergence is also reduced

The value of s should b e a compromise b etween the FLOP count the acceleration in the con

vergence rate of the Blo ckCG due to a greater value of s and resources available in the system

Nikishin and Yeremin haveproposedaVariable Blo ck Preconditioned Conjugate Gra

dient VBPCG algorithm that provides a p ossibility for adjusting the blo ck size during the

preconditioned Blo ckCG iterations The goal of varying the blo ck size is to nd a compromise

between the blo ck size the convergence rate the FLOP count and the degree of parallelism

While they have shown that VBPCG algorithm reduces the total FLOP count from the Blo ck

preconditioned CG they have also concluded that p erformance of the VBPCG algorithm de p ends on the sp ectral prop erties of the preconditioned matrix

STOPPING CRITERIA

In Chapter we study a compromise b etween the factors that inuence the p erformance of

dierent parallel implementations of the Blo ckCG Algorithm in distributed memory en

vironments

Stopping criteria

The results from exp eriments to b e presented in next chapter come from sequential runs of

computer implementations of the Classical CG and Blo ckCG algorithms Furthermore the

stopping criterion used in this implementations is based on the theory of Oettli and Prager see

Oettli and Prager for backward error analysis

j n nn n j j

Let x R A R and b R and dene rx b Ax For any matrix E E

andavector f f dene

j

j rx j

i

max

j

i

E j x j f

i

with

and

If there exists a matrix A andavector b with

j A j E and j b j f

such that

j

A A x b b

and is the smallest numb er for which such A and b exist Therefore wehave the solution

to a nearby problem when we compute a small for a given E and f

At the end of each iteration we compute an that satises and we stop the iterations

when either is reduced to a given threshold value or cannot b e reduced anymore

There are dierentways of cho osing E and f see for instance Arioli Demmel and Du

and Arioli Du and Ruiz Here wehave used

T

E kAk ee and fkbk e

where e is a column vector of all s This choice of E and f corresp onds to the normwise

backward error dened by

j

kr x k

j

kAk kx k kbk

and we stop iterations after c where c is the threshold value that determines the ter

mination A comparison b etween the normwise backward error and other stopping criteria is

presented by Arioli Du and Ruiz and the authors have empirically shown that in

some cases it is p ossible to drive to near the machine precision Furthermore the authors

consider as a useful stopping criterion for iterative metho ds b ecause the sp ectral prop erties

of the matrices are implicitly considered in uses the norms of the matrices to estimate the

CHAPTER BLOCK CONJUGATE GRADIENT

Classical CG Blo ckCG

Algorithm Algorithm

Phase Step FLOP count Step FLOP count

An s An s

s s s

n s

Startup s An s

s s s

Total n A Total s A n s

s s

n s s

n n s s

A n s n s

s s s

n s

Iteration

n s An s

s s s

cycle n s

n

n

n

Total An Total s A n s

s s

n s s

Table Complexity of the Classical CG and the Blo ckCG algorithms The FLOP

count in the Blo ckCG step comes from

STOPPING CRITERIA

error Consequently as many iterative metho ds is sensitive to scalings of the matrices

In many applications p eople use a dierent error estimation that uses less information than the

normwise backward error For instance computing an error estimate using the residual norms

In this case one sets

E and f kr x k

thus the backward error at each iteration is estimated with

j

kr x k

kr x k

and the iteration pro cess stops when c Quite often one lo oks for only a few correct

digits in the approximation to the solution of a linear system of equations In these cases is

commonly used

Furthermore the threshold value c can b e related to the truncation error that comes from

the discretization of the original problem Thus the maximum numb er of correct digits in an

approximation dep ends on the truncation error or discretization error from discretizing the

original continuous problem and writing out the system of linear equations

We also present in Chapter some results from runs of the Classical CG and Blo ckCG using a

stopping criterion based on

CHAPTER BLOCK CONJUGATE GRADIENT

Chapter

Sequential Blo ckCG exp eriments

The purp ose of the following exp eriments is to examine the b ehavior of the Classical CG and

the Blo ckCG algorithms describ ed in the previous chapter and conclude with some incentives

for the parallelization of the Blo ckCG Algorithm

As illustrated in OLeary Blo ckCG is exp ected to converge faster than Classical CG

for linear systems of equations with clusters of eigenvalues that are separated In addition

Blo ckCG can simultaneously solve a linear system of equations with multiple righthand sides

Some advantages from using Level BLAS routines inside the Blo ckCG iteration are shown

in this chapter Furthermore the granularityofBlockCG is greater than the granularityof

Classical CG and weanticipate to gain more from implementing the Blo ckCG algorithm in

vector and parallel computer systems

In the rst three sections we use two test problems that were previously used for testing a

Lanczos algorithm with partial reorthogonalizations Simon This version of the Lanczos

algorithm is named LANPRO from LANczos with Partial ReOrthogonalization The matrices

used in the tests of the LANPRO algorithm are available in the HarwellBo eing

collection Du Grimes and Lewis

In the HarwellBo eing collection these seven matrices are lab eled NOS to NOS Here we

used the problems LANPRO NOS and LANPRO NOS These problems are particularly

interesting b ecause Simon compares the LANPRO algorithm with Classical CG and

presents results from the solution of the problem LANPRO NOS in which Classical CG

p o orly converges after more than n iterations

The LANPRO test problems come from nite element approximations to problems in structural

engineering LANPRO NOS results from a discretization using a biharmonic op erator on a

b eam In this case the b eam has one end xed and one end free see Figure a The

corresp onding stiness matrix was computed using a nite element approximation program

named FEAP for Finite Element Analysis Program see Zienkiewikz and Taylor

The biharmonic equation of a plate exure is given by

w w v w

q

x x y y Et

where w is the displacement In this problem w is the result of applying a unit load at ab out

the middle of the b eam and there is no displacement at the xed end of the b eam Thus the

b oundary conditions at the xed end are

w w

w

x y

CHAPTER SEQUENTIAL BLOCKCG EXPERIMENTS

unit unit load load y z y

w = 0 x x w = 0

(a) (b)

Figure a A b eam with one end xed and the other end free A unit load at ab out the middle of the

b eam The stiness matrix corresp onds to the LANPRO NOS problem b A plate with one end xed and

the other free A unit load is applied to one of the free corners of the plate The stiness matrix corresp onds

to the LANPRO NOS problem

LANPRO (NOS2) 0

100

200

300

400

500

600

700

800

900

0 200 400 600 800

nz = 4137

Figure Sparsity pattern of LANPRO NOS matrix from

the HarwellBo eing Sparse Matrix Collection

In the LANPRO NOS problem elements were used in the nite element approximation

with degrees of freedom in each element Figures illustrates the sparsity pattern of the

test problem The stiness matrix is SPD of order with nonzero elements Figure illustrates the eigensp ectrum of this matrix

Eigenspectrum of matrix: LANPRO (NOS2) Condition Number : 5.100e+09

0 2 4 6 8 10 12 14 16 10

x 10

Figure Eigensp ectrum of LANPRO NOS matrix

Eigenvalue distribution Matrix of LANPRO (NOS2) 28 25

20

15

10

5

0 30.8 5.0e 5 1.0e6 1.5e6 2.0e6 2.5e6

Plotted interval from: 3.084e+01 to: 2.946e+06

Total number of eigenvalues in interval: 420

Figure Eigenvalues at lower end of the eigensp ectrum Notice

that there are eigenvalues around

LANPRO NOS comes from applying a biharmonic op erator on a rectangular plate In this

problem the plate has one side xed and the others free see Figure b The problem is discretized using the biharmonic op erator with the b oundary conditions at the

CHAPTER SEQUENTIAL BLOCKCG EXPERIMENTS

Eigenvalue distribution Matrix of LANPRO (NOS2)

12

10

8

6

4

2

0 2 4 6 8 10 12 14 10 x 10 Plotted interval from: 1.888e+10 to: 1.572e+11

Total number of eigenvalues in interval: 245

Figure Eigenvalues at the center of the eigensp ectrum

Eigenvalue distribution Matrix of LANPRO (NOS2) 18

16

14

12

10

8

6

4

2

0 2 4 6 8 10 12 14 10 x 10 Plotted interval from: 6.941e+09 to: 1.573e+11

Total number of eigenvalues in interval: 276

Figure Eigenvalues at upp er end of the eigensp ectrum

xed end

0

100

200

300

400

500

600

700

800

900

0 200 400 600 800

nz = 15844

Figure Sparsity pattern of LANPRO NOS

The stiness matrix was also computed using FEAP The stiness matrix is SPD of order

with nonzero elements The sparsity pattern of this matrix is shown in Figure and

Figure illustrates the eigensp ectrum of the LANPRO NOS matrix

As shown in Figure the test problem LANPRO NOS has two large clusters of eigenvalues

to the extremes of the sp ectrum and a large condition number of As can b e seen in

Figure the LANPRO NOS problem has several clusters of eigenvalues and these clusters

are not as separated as in LANPRO NOS Also the condition numb er of LANPRO NOS

is relatively small compared to the one of the LANPRO NOS

In Section we use a problem arising from oil reservoir mo deling This problem is included

in the HarwellBo eing sparse matrix collection under the name of SHERMAN The problem

comes from a three dimensional simulation mo del on a grid The problem has

b een discretized using a sevenp oint nitedierence approximation with one equation and one

unknown at each grid p oint

The results in this chapter are from sequential runs of implementations of the Classical CG

Algorithm and the Blo ckCG Algorithm Results from parallel runs of b oth algorithms

will b e presented in the next chapter The runs rep orted in this chapter were p erformed on an IBM RS system with theoretical p eak p erformance of Mops The Mops term is

CHAPTER SEQUENTIAL BLOCKCG EXPERIMENTS

the computational rate measured in millions of oatingp oint op erations computed in a second

and is dierent from a FLOP count that represents the numb er of oatingp oint op erations

Eigenspectrum of matrix: LANPRO (NOS3) Condition Number : 3.772e+04

0 100 200 300 400 500 600 700

Figure Eigensp ectrum of LANPRO NOS matrix

Eigenvalue distribution Matrix of LANPRO (NOS3)

20

18

16

14

12

10

8

6

4

2

0 0 100 200 300 400 500 600

Plotted interval from: 1.829e−02 to: 6.899e+02

Total number of eigenvalues in interval: 960

Figure Distribution of Eigenvalues for LANPRO NOS

SOLVING THE LANPRO NOS PROBLEM

Solving the LANPRO NOS problem

Blo ck Iteration Equivalent Normwise Backward

size count iterations error

Table Sequential runs of CG and Blo ckCG programs Stop



iterations after  LANPRO NOS problem

In this rst exp eriment we use the LANPRO NOS problem and we randomly cho ose the

blo ck sizes of and to examine some eects of the blo ck size on the p erformance of

Blo ckCG In this case we stop iterations when is the normwise backward

error from

Figure shows the convergence curves of Classical CG lab eled Blo ckCG and four in

stances of Blo ckCG In this gure the numb er of iterations represents the iteration countof

the Classical CG and the Blo ckCG algorithms

In Figure only iterations of Classical CG have b een rep orted although in this exp er

iment Classical CG cannot reduce b elow even after iterations In the same

gure the iteration count for reducing to decreases as the blo ck size is increased

On the other hand Blo ckCG computes s orthogonal vectors at each iteration while Classical

CG computes only one Therefore the iteration count has to b e multiplied by the blo cksizeto

reect the numb er of equivalent or normalized iterations

Figure shows the convergence curves using equivalent iterations The results from this ex

p eriment are summarized in Table Table summarizes the results from the similar runs

as in Table after computing only n orthogonal vectors

The solution of a linear system is in the eigenspace of the matrix A and any iterative metho d

that successively multiplies the matrix A times a vector or another matrix as in Blo ckCG will

include information from eigenvalues at the upp er end of the eigensp ectrum into the solution

during the rst iterations Information from eigenvalues at the lower end of the eigensp ectrum

is likely to b e included only during the last iterations of the metho d

In nite arithmetic information from the smallest eigenvalues is dicult to include in the so

lution when the matrices have a high condition numb er and there are clusters of eigenvalues at

the lower end of the sp ectrum For instance twovery small eigenvalues will converge to the

same value after a few multiplications of the matrix A in nite arithmetic Similar convergence

b ehavior has b een observed in our exp eriments solving the LANPRO NOS and this explains

the slow convergence rate of b oth Classical CG and Blo ckCG

CHAPTER SEQUENTIAL BLOCKCG EXPERIMENTS

Blo ck Iteration Equivalent Normwise Backward Execution

size count iterations error Time secs Mops

Table Sequential runs of Classical CG and Blo ckCG programs with a xed number of

equivalent iterations LANPRO NOS problem

j

In Classical CG the computational weight is in the Ap pro duct thus A has a great inuence

in the total execution time As shown in Tables and the execution time of Blo ckCG

with blo ck sizes and is signicantly greater than the execution time of Blo ckCG with

smaller blo ck sizes This eect means that FLOP count in Blo ckCG can b e greatly inuenced

by large values of s

The size of the matrix A and its sparsity pattern are constant through the Blo ckCG iterations

thus the ratio of A s over the total FLOP count decreases as the blo ck size is increased

Clearly increasing the blo ck size will cause the FLOP count from op erations that involve the

s s matrices to increase For instance in the Blo ckCG Algorithm the factorization of the

j j

and matrices b ecomes more exp ensiveass is increased and this can b e easily veried in

Table shows the inuence of A s on the total FLOP count for the runs shown

in Table

A

Blo ck size

Total of

FLOP count

Table Percentage of Total FLOP count gener

ated by A LANPRO NOS problem

The results shown in Tables and are veried using a dierent stopping criterion based

on stopping iterations when The results are shown in Table and

SOLVING AGAIN THE LANPRO NOS PROBLEM

as shown in Simon the convergence rate of Classical CG solving the LANPRO NOS

problem is very slow In Table the b ehavior of Blo ckCG is the similar to the one observed

with a dierent stopping criterion Tables and

Stopping criterion

Blo ck Iteration Equivalent Execution

size count iterations Time secs

Table Sequential runs of Classical CG and Blo ck



CG programs Using  as stopping crite

rion LANPRO NOS problem

Empirically it has b een observed that Blo ckCG has b etter chances to include information from

the smallest eigenvalues into the solution with a blo ck size close to the size of the cluster of eigen

values at the lower end of the sp ectrum One reason is that more information from sp ectrum is

gathered at each iteration In innite precision this implies that fewer matrix multiplications

are p erformed to obtain the relevant information from the entire sp ectrum and include it in the

solution And in nite arithmetic the reduction in the numb er of matrix multiplications will

also reduce the p erturbation in the solution due to roundo errors from implicitly computing

J

A

Therefore in some cases a large blo ck size is recommended for Blo ckCG

Solving again the LANPRO NOS problem

Blo ck Iteration Equivalent Normwise Backward Execution

size count iterations error Time secs Mops

Table Sequential runs of CG and Blo ckCG programs LANPRO NOS problem

CHAPTER SEQUENTIAL BLOCKCG EXPERIMENTS

Test: LANPRO (NOS2) 0 10

−2 (a) = Block_CG(01) 10 (Fig-b) (b) = Block_CG(08) −4 10 (c) = Block_CG(16)

−6 (d) = Block_CG(32) 10 (e) = Block_CG(64) −8 10

−10 a 10 wise Backward error − −12 10 Norm −14 10

−16 10 ebd c −18 10 500 1000 1500 2000 2500 3000

Number of iterations Figa

LANPRO (NOS2) 0 10

−2 10

−4 10 (c) = Block_CG(16)

−6 (d) = Block_CG(32) 10 (e) = Block_CG(64) −8 10

−10 10 wise Backward error − −12 10 Norm −14 10

−16 10 ecd −18 10 100 200 300 400 500 600 700 800 900 1000 1100

Number of iterations

Figb

Figure a Convergence curves of Blo ckCG with dierent blo ck sizes b

Zo om of some convergences curves Test problem LANPRO NOS

SOLVING AGAIN THE LANPRO NOS PROBLEM

Test: LANPRO (NOS2) 0 10

−2 (a) = Block_CG(01) 10 (Fig-b) (b) = Block_CG(08) −4 10 (c) = Block_CG(16)

−6 (d) = Block_CG(32) 10 (e) = Block_CG(64) −8 10

−10 a 10 wise Backward error − −12 10 Norm −14 10

−16 10 ebd c −18 10 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 4

Number of equivalent iterations x 10 Figa

LANPRO (NOS2) −2 10

−4 10 (c) = Block_CG(16)

−6 (d) = Block_CG(32) 10 (e) = Block_CG(64)

−8 10

−10 10

wise Backward error −12 − 10 Norm −14 10

−16 10 ecd −18 10 0 2000 4000 6000 8000 10000 12000

Number of equivalent iterations

Figb

Figure Convergence curves of Blo ckCG with dierent blo ck sizes In

eachcurve the numb er of iteration is multipli ed byitsblocksizetoshow

equivalent iterations b Zo om of some convergences curves Test problem

LANPRO NOS

CHAPTER SEQUENTIAL BLOCKCG EXPERIMENTS

Again we use the LANPRO NOS problem Figures and show the distribution of

eigenvalues at the lower end the center and the upp er end of the eigensp ectrum resp ectively

In these gures The eigenvalues are sorted in ascending order

There are eigenvalues in the cluster of eigenvalues at the lower end of the eigensp ectrum of

LANPRO NOS see Figure In this cluster the eigenvalues lie b etween and

After a few runs of Blo ckCG varying the blo ck sizes around wehavechosen

to rep ort the results with blo ck size of b ecause Blo ckCG with blo ck sizes b etween and

have exhibited the same b ehavior during the exp eriments and is the smallest value

Table shows the results from sequential runs of Classical CG and Blo ckCG with blo ck

size Again the execution time of Blo ckCG has drastically increased with a blo ck size of

when compared to the execution time of Classical CG This increase is due to Blo ckCG

op erations that involve matrices When solving the LANPRO NOS using a blo ck

size of the A parameter is only of the total FLOP coun t

In Table the stopping criterion is based on and iterations are stopp ed after reducing

the residual norm b elow In this case Classical CG is stopp ed earlier than in the

exp eriment rep orted in Table

Stopping criterion

Blo ck Iteration Equivalent Execution

size count iterations Time secs

Table Sequential runs of Classical CG and Blo ck



CG with blo ck size Using  as stop

ping criterion LANPRO NOS problem

In the computational results in Sections and it is observed that the Mop rate increases

as the blo ck size is increased This eect is due to the use of Level BLAS routines in Blo ckCG

instead of the Level BLAS routines that are used in Classical CG Mainly the use of Level

BLAS routines improves the ratio of FLOPs p er memory reference

Solving the LANPRO NOS problem

In this exp eriment we use the test problem LANPRO NOS and wecho ose the four blo ck

sizes and for Blo ckCG The clusters of eigenvalues in this test problem are not very

distant one from the other as shown in Figure The condition numb er is smaller than the

one from LANPRO NOS Therefore all the eigenvalues are easily approximated even when

using small blo ck sizes

The convergence curves for this exp eriment are shown in Figures and Figure

plots the convergence curves using the normwise backward error versus the iteration count

and in Figure the convergence curves were plotted using versus the equivalent itera

SOLVING THE LANPRO NOS PROBLEM

Blo ck Iteration Equivalent Normwise Backward Execution

size count iterations error Time secs Mops

Table Sequential runs of CG and Blo ckCG programs LANPRO NOS problem

Blo ck Iteration Equivalent Normwise Backward Execution

size count iterations error Time secs Mops

Table Sequential runs of Classical CG and Blo ckCG programs with a xed number of

equivalent iterations LANPRO NOS problem

CHAPTER SEQUENTIAL BLOCKCG EXPERIMENTS

tions Considering the iteration count in Figure Blo ckCG converges faster as the blo ck

size is increased However in Figure the computational work necessary to reduce b elow

increases as we increase the blo ck size and Classical CG converges faster in terms

of equivalent iterations than all of the four Blo ckCG instances

The size of LANPRO NOS is and Blo ckCG with blo ck sizes of and compute more

than orthogonal vectors thus Blo ckCG mayhaveconverged to a erroneous solution in b oth

cases

Table summarizes the convergence information from Figures and

Classical CG takes iterations to converge and in Table the numb er of equivalent iteration

has b een xed around to compare the computational work b etween Classical CG and the

dierent instances of Blo ckCG At this p oint in the computations Classical CG has p erformed

b etter in approximating the solution to LANPRO NOS than Blo ckCG instances

Stopping criterion

Blo ck Iteration Equivalent Execution

size count iterations Time secs Mops

Table Sequential runs of CG and Blo ckCG programs Solv

ing the LANPRO NOS problem

Table summarizes the results from solving the LANPRO NOS problem with the same

blo ck sizes Comparing the results from Table and Table it is evident the advantage

of using the stopping criterion based on b ecause even with blo ck there are only

orthogonal vectors computed b efore reaching the threshold value Whereas using as the

stopping criterion see Table more equivalent iterations are p erformed for the blo ck sizes

of and

Table shows the p ercentage of the total FLOP count generated by A Notice that the

p ercentages have increased compared to those shown in Table for the LANPRO NOS

problem One of the reasons is that the LANPRO NOS stiness matrix is more dense than

the one of the LANPRO NOS problem As a result of increasing p ercentages in Tables

and it can b e seen that the dierences b etween the execution times are smaller than the

ones rep orted in the tables for the LANPRO NOS problem after varying the blo cksizes

Again in this exp eriment the Level BLAS eect improves the Mop rate as the blo cksizeis also increased

SOLVING THE LANPRO NOS PROBLEM

Test: LANPRO (NOS3) 0 10

−2 (a) = Block_CG(01) 10 (b) = Block_CG(04) −4 10 (c) = Block_CG(08)

−6 (d) = Block_CG(16) 10 (e) = Block_CG(32) −8 10

−10 10 wise Backward error

− −12 10 Norm −14 10

−16 10 e d c b a −18 10 50 100 150 200 250 300 350 400

Number of iterations

Figure Convergence curves of Blo ckCG with dierent blo cksizesTest

problem LANPRO NOS

Test: LANPRO (NOS3) 0 10

−2 (a) = Block_CG(01) 10 (b) = Block_CG(04) −4 10 (c) = Block_CG(08)

−6 (d) = Block_CG(16) 10 (e) = Block_CG(32) −8 10

−10 10 wise Backward error

− −12 10 Norm −14 10

−16 10 abcde −18 10 200 400 600 800 1000 1200 1400 1600

Number of equivalent iterations

Figure Convergence curves of Blo ckCG with dierent blo ck

sizes In eachcurve the numb er of iteration is multiplied by its blo ck

size to show equivalent iterations Test problem LANPRO NOS

CHAPTER SEQUENTIAL BLOCKCG EXPERIMENTS

A

Blo ck size

Total of

FLOP count

Table Percentage of Total FLOP count generated

by A LANPRO NOS problem

Solving the SHERMAN problem

As mentioned earlier the SHERMAN matrix comes from an oil reservoir mo del The SHER

MAN matrix has only a symmetric pattern but is not SPD In this exp eriments wesolve instead

T T

A A where A is the SHERMAN matrix b ecause A A is SPD The clusters of eigenvalues in

T

A A are more separated than in the original SHERMAN matrix The sp ectrum of the original

T

SHERMAN matrix is shown in Figure and the eigensp ectrum of A A from the SHER

T

MAN matrix is shown in Figure Given distribution of eigenvalues in A Aweanticipate

that Blo ckCG will convergence faster than Classical CG

The original SHERMAN problem will b e solved in Chapter using a blo ckrow pro jection

metho d that is accelerated with Blo ckCG

T

The sparsity pattern of A A from SHERMAN is shown in Figure The matrix has

nonzero entries and a condition number of

The blo ck sizes and were chosen for this exp eriment In the results presented

in Table the stopping criterion was e where is the residual norm from

In Table it can b e seen that for this case Blo ckCG p erforms b etter than Classical CG

The table shows that Blo ckCG with blo ck sizes of and have computed with almost the same

accuracy the solution in fewer iterations than Classical CG With the blo ck sizes of and a

more accurate solution is computed in fewer equivalent iterations

Although in this case the execution of Blo ckCG with blo ck sizes and to ok more time

than Classical CG it is exp ected that Blo ckCG will p erform b etter in parallel environments

b ecause of the Mop rates shown in Table

Now we rep eat the same exp eriment using w normalized backward error as the

stopping criterion The threshold value has b een chosen using information that was obtained

during the runs from the rst exp eriments rep orted in Table

Figures and showconvergence curves using the iteration count and equivalent iter

ations resp ectivelyTables and summarize the results from the convergence curves

Table supp orts the results from Table and again Blo ckCG runs faster than Classical

CG for the blo ck sizes of and In addition Table shows that the blo ck sizes of and

have computed a solution with the requested accuracy after equivalent iterations while Classical CG has not

SOLVING THE SHERMAN PROBLEM

Blo ck Iteration Equivalent Normwise Backward Execution

size count iterations error Time secs

Table Sequential runs of CG and Blo ckCG programs SHERMAN problem

Blo ck Iteration Equivalent Normwise Backward Execution

size count iterations error Time secs

Table Sequential runs of CG and Blo ckCG programs SHERMAN problem

CHAPTER SEQUENTIAL BLOCKCG EXPERIMENTS

Blo ck Iteration Equivalent Normwise Backward Execution

size count iterations error Time secs Mops

Table Sequential runs of CG and Blo ckCG programs with xed numb er of iterations SHERMAN problem

SQUARED SHERMAN4 0

200

400

600

800

1000

0 200 400 600 800 1000

nz = 10346

T

Figure Sparsity pattern of A A SHERMAN matrix

SOLVING THE SHERMAN PROBLEM

A

Blo ck size

Total of

FLOP count

Table Percentage of Total FLOP count generated by

A SHERMAN problem

As shown in Table the p ercentage of the total FLOP count generated by A decreases

as the blo ck size is increased Furthermore these p ercentages are signicantly decreased for

the blo ck sizes of and and it implies that in these cases the total FLOP count and

execution time are dominated by the blo ck size

As in the previous exp eriments the Level BLAS eect increases the Mop rate as the blo ck size is increased

Eigenspectrum of matrix: SHERMAN4 Condition Number : 2.179e+03

0 10 20 30 40 50 60 70

Figure Eigensp ectrum of SHERMAN matrix

Eigenspectrum of matrix: A TA from SHERMAN4 Condition Number : 4.744e+06

0 500 1000 1500 2000 2500 3000 3500 4000 4500

T

Figure Eigensp ectrum of A A from SHERMAN matrix

CHAPTER SEQUENTIAL BLOCKCG EXPERIMENTS

AT A SHERMAN4 0 10

(a) = Block_CG(01)

(b) = Block_CG(04)

(c) = Block_CG(08) −5 10 (d) = Block_CG(16)

(e) = Block_CG(32)

(f) = Block_CG(64) wise Backward error −

−10 10 Norm

f e d c b a

0 200 400 600 800 1000 1200

Number of iterations

Figure Convergence curves of Blo ckCG with dierentblock sizes

AT A SHERMAN4 0 10

(a) = Block_CG(01)

(b) = Block_CG(04)

(c) = Block_CG(08) −5 10 (d) = Block_CG(16)

(e) = Block_CG(32)

(f) = Block_CG(64) wise Backward error −

−10 10 Norm

e c ab d f

0 500 1000 1500

Number of equivalent iterations

Figure Convergence curves of Blo ckCG with dierentblock

sizes In each curve the numb er of iteration is multiplied by its blo ck

size to show equivalent iterations

REMARKS

Remarks

T

LANPRO NOS LANPRO NOS SHERMAN A A

Exec Time Exec Time Exec Time

blo ck Classc Blo ck Perf Classc Blo ck Perf Classc Blo ck Perf

size CG CG ratio CG CG ratio CG CG ratio

Table Summary of results from previous exp eriments Comparison b etween Blo ckCG and

solving the linear system s times with Classical CG The p erformance ratio is the execution time of

Classical CG divided by Blo ckCG Execution time is in seconds For each problem only a selected

numb er of blo ck sizes were rep orted in the previous sections of this chapter Therefore there are

some emptyentries in the table

In some cases it is not trivial to cho ose a suitable blo ck size for the Blo ckCG algorithm without

additional information from the problem to b e solved For instance a linear system of equations

with a high condition number may require a prior analysis of the eigensp ectrum and computing

the eigensp ectrum is more exp ensive than solving the linear system of equations

From Blo ckCG results presentedinthischapter the Mop rate increases as the blo ck size is

increased The Mop rates rep orted in Tables and show that as the

granularity is increased the Mop rate is increased Furthermore the increasing Mop rate in

Blo ckCG can pro duce computational gains on vector and parallel computers even for linear

systems in which Classical CG converges a few iterations faster than Blo ckCG

In a parallel distributed environment the number of synchronization p oints of Classical CG are

implicitly reduced in Blo ckCG In Blo ckCG the information that needs to b e communicated

in one iteration equals the information that need to b e communicated in s iterations of Clas

sical CG Thus the numb er of synchronization p oints while computing the orthogonal vectors

is reduced as the blo ck size of Blo ckCG is increased In this case Blo ckCG is preferred over

Classical CG even if Blo ckCG communicates longer messages than Classical CG Indeed the

p ossible b ottlenecks in the communication are asso ciated with the volume of messages b eing

exchanged and the exp ensive startup or latency time

When solving a linear system of equations with multiple righthand sides Blo ckCG will p erform

CHAPTER SEQUENTIAL BLOCKCG EXPERIMENTS

b etter than Classical CG since the s solutions are computed simultaneously

Table compares the p erformance of Blo ckCG and Classical CG for solving a linear system

of equations with s righthand sides The information in Table has b een extracted from

results presented in previous sections The execution times of Classical CG presented in previous

tables have b een multiplied by the blo ck size to estimate the execution time of Classical CG for

solving the linear system of equations s times

In all of the results presented in this chapter using Blo ckCG has b een faster than s applications

of Classical CG On the other hand this means that in none of the cases has Blo ckCG cost

more than s times Blo ckCG although at eachBlockCG iteration the FLOP count is increased

by a factor of s when compared with a Classical CG iteration

Increasing the blo ck size in Blo ckCG will also increase the granularity of problem to b e solved

and this eect makes Blo ckCG more suitable for parallelization In Chapter we study the

parallelization of the Blo ckCG Algorithm for distributed computing environments And

in Chapter we present results from parallel exp eriments that supp ort our exp ectation that

Blo ckCG is suitable for parallel environments

Chapter

Parallel Blo ckCG implementati on

In this chapter we present three dierent parallel versions of the Blo ckCG Algorithm

for distributed memory architectures see also Drummond Du and Ruiz First we

present partitioning and scheduling strategies used in the three implementations Then we dis

cuss and compare the three parallel distributed implementations

Partitioning and scheduling strategies

At this p oint we lo ok for a problem partitioning strategy to manage the creation of subproblems

that can b e solved in parallel and a strategy to distribute these tasks among the pro cessing

elements Here we recall that the problem in hand is to nd a solution X to the linear system

of equations

HX K

where H is a n n symmetric p ositive denite matrix SPD X and K are n s matrices and

s is the numb er of solution vectors in the system or blo ck size for Blo ckCG

The stabilized Blo ckCG Algorithm intro duced in Chapter will b e used in the solution

of

Partitioning strategy

First a column partition of the matrix H into l submatrices is p erformed l in Figure

A column partition has b een preferred over a row partition for the following reasons

There is no need to send the full X and K matrices to every computing elementCE

during the task distribution Rather eachCEreceives one or more parts of these matrices

eg in Figure X X X X K K K and K

The numb er of messages that need to b e sent is indep endent of a column or row partition

J

However When computing the HX or the HP pro ducts in parallel shorter messages

are sent with a column partitioning strategy than with a row one In the latter it maybe

required to send the X and P matrices in full to p erform the parallel pro ducts

In the FORTRAN language it is easier and faster to access sequential elementsinamatrix

columnwise than rowwise

CHAPTER PARALLEL BLOCKCG IMPLEMENTATION

4-1 -1 x x x b b b -1 4 -1 -1 X K x x x 1 1 b b b -1 4-1 -1 H x x x b b b -1 4 -1 3 x x x b b b -1 4 -1 -1 x x x b b b -1 -1 -1 X K 4 -1 x x x 22b b b -1 -1 4 -1 -1 x x x b H b b -1 -1 4 -1 4 x x x b b b

H -1 4 -1 -1 x x x b b b 1 X -1 -1 4 -1 -1 x x x 3 K 3 b b b -1 -1 4 -1 -1 x x x b b b 4 -1 -1 -1 x x x b b b 4 H -1 -1 x x x b b b 2 X -1 -1 4 -1 x x x 44K b b b -1 -1 4 -1 x x x b b b -1-1 4 x x x b b b

(0)

HKX

Figure Columnpartition strategy of the linear system of equations H X K

initial denition of tasks

In the partitioning strategy the parameter k is dened as the numb er of columns to b e included

in every submatrix If k do es not divide n exactly the last submatrix will have less than k

columns These submatrices are lab eled H H H

l

A second step is to identify row b oundaries for every submatrix H The row b oundaries for a

i

submatrix H are dened by the rst and last rows in H with nonzero entries As a result every

i i

submatrix H can b e considered as a rectangular matrix of dimension n k where n is the

i i i i

number of rows b etween the row b oundaries for H andk is equal to k except that probably

i i

k may b e smaller The partition of the X and K matrices is p erformed after completing the

l

partition of the H matrix

Partitioning the X and K matrices result in submatrices X X X and K K K

l l

resp ectively The X submatrices need to b e pro duct compatible with the H submatrices

i i

The matrix K is partitioned into submatrices of k rows each again except the last submatrix

that mayhave less than k rows Figure shows a column partition for a system of linear

equations with a blo ck size for Blo ckCG equal to three For now on we dene a task as a

triplet of matrices HX K The resulting tasks from the partition will b e p erformed in

i i i

parallel under dierent programming mo dels to b e intro duced in Section

In parallel pro cessing a reduce op eration gathers partial results from parallel computations and

merges these partial results to obtain a global one In Algorithm the op erations

R K HX in step

and

j j

j

R R HP in step

j

resp ectively When these pro ducts are computed in and H P involve the pro ducts H X

i i

i i parallel one of the CEs p erforms a reduce op eration to obtain the global result for the matrix

PARTITIONING AND SCHEDULING STRATEGIES

-1 x x x 4 -1 x x x -1 x x x -1 4 -1 x x x -1 x x x -1 4-1 x x x -1 x x x -1 4 x x x 4 -1 X -1 (0) 2 X1 -1 4 -1 -1 HX -1 4 -1 -1 -1 s s s 4 -1 + (0) s s s -1 H1 X HX -1 1 s s s -1 s s s -1 s s s H X HX (0) s s s 2 2 + + s s s s s s s s s H (0) + + s s s 3 X HX -1 3 s s s s s s -1 s s s -1 (0) + s s s -1 H4 X HX4 s s s 4 -1 -1 s s s -1 4 -1 -1 -1 4 -1 -1 4 -1 X3 -1 X REDUCE OPERATION: -1 x x x 4 4 -1 x x x Merge and sum -1 x x x -1 4 -1 x x x partial results -1 x x x -1 4 -1 x x x -1 x x x

-1 4 x x x



Figure Example of a reduce op eration The HX matrices are the partial results

i



from the H X pro ducts The global HX is obtained after merging the partial results

i i

and adding the rows of the partial pro ducts that overlap

j

have dimensions n s and if the H and HP matrix pro duct The partial results HX

i i

i i

l

X

submatrices corresp ond to the blo cks on the diagonal of the matrix H then n n and the

i

i

global result from the pro duct is obtained by simply merging the partial results

l

X

On the other hand if there are rowoverlaps b etween the H submatrices then n nand

i i

i

the global result from the pro duct is obtained by merging the partial results and adding the rows

from the partial results that overlap Figure shows an example of rowoverlaps b etween the

submatrices HX HX HX and HX for the column partitioning strategy illustrated

in Figure Whether these partial pro ducts are computed in a single pro cessor or in a cluster

of pro cessors the information overlaps require data manipulation to obtain a correct answer to

the matrixmatrix pro duct For instance in Figure the second piece of the HX matrix

HX is obtained after the addition of the rows that overlap the three partial matrices HX

and HX

In a parallel distributed environment this reduce op eration can b e centralized in one CE or

distributed among all the CEs In the former each CE sends back its partial result from

the pro duct to the CE p erforming the reduce op eration In the distributed reduce every CE

CHAPTER PARALLEL BLOCKCG IMPLEMENTATION

CE1 CE2 CE3 CE4 Partial results M M M M 1 2 3 4

CEX GMat M 1 + M Centralized 2 + M = reduce operation 3 M + 4

GMat G 1 G 2 Broadcast global results

G 3 G 4

Resume

CE1 CE2 CE3 CE4

Figure Centralized reduce op eration CEX is the computing element

resp onsible for gathering partial results merging them and broadcasting back

parts of the global results to other CEs

has information of the neighb oring CEs with which it needs to exchange data and the reduce

op eration is also p erform in parallel Figure illustrates a centralized reduce op eration and

Figure illustrates a distributed reduce op eration

After the completion of the reduce op eration in the HX pro duct each CE gets backan

up dated HX whichisacopyofn rows of the global HX matrix If the centralized re

i

i

duce is used then the CE p erforming the reduce op eration sends back these up dated matrices to

the other CEs However the up date of these matrices is done implicitl y in the distributed reduce

th

In the parallel computations of the i task triplet the op erations in and b ecome

R K HX

i

i i

and

j j

j

HP R R

i i

i

Since the dimensions of the K matrix are k s a strategy must b e dened to select k rows from

i i i

the n rows available in HX Otherwise if all of the n rows are used in the computations

i i

PARTITIONING AND SCHEDULING STRATEGIES

CE1 CE2 CE3 CE4

M M Partial results 1 M M 4 2 3

M M 1 2 + + M3 + + Distributed + + M 4 reduce + + + + operation + +

G G G G 1 2 3 4

Resume

Figure Distributed reduce op eration Data exchanges are p erformed in

parallel b etween the CEs Attheendeach CE has a part of the global result

there will b e duplicated data in one or more CEs The duplicated data will p erturb the Blo ck

th

CG computations when not handled prop erly Moreover the selected k rows for the i task

i

th

triplet must not overlap with the k rows selected for the j task triplet i j This strategy

j

is depicted in Figure

In Figure GM at can b e any global matrix that result from a parallel distributed op eration

The Mat matrices are in the lo cal memory of a CE The shaded areas in the Mat matrices

i i

th

represent the k rows that are used in the computations asso ciated with i task triplet

i

The strategy is based on the following assumptions

The matrix H is SPD thus it has a full diagonal

l

X

k is the numb er of columns in H and k n

i i i

i

The matrix GM at has n rows

The expression in nds the rst of the k rows in Mat where Fc is the column number

i i i

in H of the rst column in H and Fr is the rownumber in H of the rst rowin H Then

i i i

F ir stRow Mat jFc Fr j

i i i

jj represents absolute value

CHAPTER PARALLEL BLOCKCG IMPLEMENTATION

Mat GMat Mat 1 2 s s s s s s s s s s s s s s s s s s s s s k s s s Mat 1 s s s 3 s s s s s s s n s s s s s 1 s s s s s s s s s s s s s s s s s s s s s s s s s s s k s s s n s s s 2 2 Mat s s s s s s s s s s s s 4 s s s s s s s s s s s s s s s s s s s s s s s s n k s s s 3 s s s 3 s s s s s s s s s s s s s s s s s s n4 s s s s s s s s s s s s s s s s s s k s s s s s s 4 s s s s s s s s s s s s

S S S Global information that is used in local Block-CG computations

Figure Distribution of a global result matrix GM atTheMat matrices represent

i

j  j  j 



or matrices used in the parallel computations of the R P HP either the R

i i i

i

Blo ckCG algorithm

Scheduling strategy

Among of the factors that determine the quality of a parallel implementation are

The design of the parallel algorithm

The size of problems to b e solved

The computing environment

The numb er of CEs used

The ecient use of eachCE

A parallel scheduler can b e seen as a to ol for tuning the last factor given the other four

In this section weintro duce a static scheduler that is used in Section inside the parallel

implementations of the Blo ckCG Algorithm This static scheduler tries to evenly distribute

the total workload among the CEs A balanced distribution of the workload is imp ortanttouse

all of the CEs eciently in a parallel run

This static scheduler considers two parameters for distributing the workload among the CEs

The rst parameter is the size of each subproblem that results from the problem partition ei

numberofrows times numb er of columns and the second is the numberofoverlapping rows

between dierent subproblems Therefore this scheduler is designed for homogeneous computing

environments In Chapter we present a more general scheduler for heterogeneous computing

IMPLEMENTATIONS OF BLOCKCG

environments

First of all the static scheduler used in the parallel Blo ckCG implementations calls a function

that computes a weight factor p er task triplet T For eachtaskT there is a T that is

i i i

the numb er of nonzeros in H Also there is a T that is the total number of rowoverlaps

i i

between the submatrix H and all of the other submatrices

i

The function that assigns a weight factor to each task is dened by

W T T T

i i i

The T determines the FLOP count asso ciated with T and can thus b e used as an estimate

i i

of the computational work asso ciated with the task One the other hand T is used to

i

estimate the communication asso ciated with T

i

Once the weight factors are computed the scheduler sorts the tasks in descending order by their

weight factors in a work list The scheduler then dispatches these tasks to the available CEs

in the order they app ear in the work list

To maintain workload balance among the CEs the scheduler keeps track of the workload already

assigned to each CE and assigns the next task in the work l ist to one of the CEs with the

lowest workload factor This scheduling strategy can b e seen as a solution to a binpacking prob

lem see for instance Dror Mizuike Ito Kennedy and Nguyen and Mayr

Implementations of Blo ckCG

We present three dierent parallel implementations of the Blo ckCG Algorithm First

we develop an Al ltoAll parallel version in whichwe parallelize the entire Blo ckCG algorithm

The other two implementations are based on a masterslave computing mo del with dierent

approaches to determine which sections of the algorithm are p erformed in parallel and which

ones are p erformed sequentially

Wehave used the P library of parallel routines Butler and Lusk to manage the parallel

environments on dierentmachines also wehavedevelop ed versions using the PVM library of

routines Beguelin Dongarra Geist et al and Geist Baguelin Dongarra et al

The computing elements CEs can all b e physically part of the same computer or a network of

heterogeneous distributed pro cessors For the remainder of this chapter we fo cus our discussion

on the pro cesses and the tasks they p erform in each implementation rather than on the CEs

b ecause pro cesses are the activeentities in the execution of a program and in some computer

platforms the user do es not have direct control over the CEs

AlltoAll implementation

In this implementation we p erform a full parallel implementation of the Blo ckCG algorithm

at the exp ense of redundant computations that are carried out in parallel by dierent pro cesses

referred to here as the worker pro cesses At rst there is an initiator pro cess which creates

the other worker pro cesses and generates the task triplets HX K Later the initiator

i i i

pro cess b ecomes one of the workers and works on some of the generated tasks

One or more tasks can b e assigned to a worker pro cess After a worker pro cess receives one or

more tasks it builds a set of matrices one set p er task that corresp onds to the set of matrices

used in Algorithm Figure illustrates an example of a task triplet and the set of lo cally

generated matrices asso ciated with the task During an iteration a worker pro cess communicates

CHAPTER PARALLEL BLOCKCG IMPLEMENTATION

to its neighb ors the results from its computations of H P Each worker needs to communicate

i i

only to neighb ors with tasks that have information that overlap its tasks Then each worker

T T

broadcasts the R R and the P HP matrices to all the other workers

i i

i i

Task i x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Gamma Beta Hi Xi Ki x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Alpha Lambda x x x x x x x x x x x x x x x x x x x x x x x x Non-overlapped information x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Overlapped information x x x x x x x x x x x x (need exchange) x x x x x x x x x x x x

Ri Pi HPi

Figure Blo ckCG matrices kept lo cally byeach pro cess

There is a set of matrices p er task assigned to a pro cess

Every worker p erforms an iteration of the Blo ckCG and tests whether the stopping criterion

has b een met If the stopping criterion has not b een met the worker pro cesses start a new

iteration At the end when the stopping criterion is satised each worker pro cess rep orts its

part of the solution to the initiator pro cess whichassembles the solution of the whole system

Figure captures the interactions b etween pro cesses in the Al l to Al l implementation

Clearly the communication b etween pro cesses has an impact on the p erformance of the three

parallel Blo ckCG implementations that are presented in this chapter Thus we study the

numb er of messages sentateach iteration for all of the three parallel Blo ckCG implementations

Let m b e the numb er of neighb oring workers with which worker will exchange partial results

k k

and p b e the total number of w or k er s

and H P For every pro duct H X

i i i

i

p

X

m messages M

k

k

and the length of these messages varies according to the number of rows worker needs to ex

k

change with each of its neighb oring workers

T

For every pro duct R R all to all

p p messages of length s s

T

For every pro duct P HP all to all

p p messages of length s s

IMPLEMENTATIONS OF BLOCKCG

Leading to pp M messages of lengths varying from s s to maxn s where n is the

i i

number of rows in H i

global initialization T γi = R i R i

local initialization build local γ

(0) R i = Hi Xi local Blk-CG computations

complete local R i HPi = HPi i

complete local HP i

T βi = Pi HPi

sequential build local parallel β

send to specific processes local Blk-CG computations broadcast

terminate?

program flow no

Figure Scheme of the Al l  to  Al l parallel

Blo ckCG implementation

MasterSlave distributed Blo ckCG implementation

In this implementation weuseamaster sl av e programming mo del in which the master

pro cess monitors the program execution while the Blo ckCG algorithm is run at the sl av e

pro cesses level Initially a master pro cess is created which in turn creates the sl av e pro cesses

and generates the tasks HX K The master calls the static scheduler to distribute

i i i

the tasks After scheduling the tasks and dispatching them to the sl av e pro cesses the master

pro cess p erforms three reduce op erations at each iteration

In the rst reduce the master collects from sl av e pro cesses their results from computing the

T T

pro ducts R R The master then combines these partial results to build the full R R The

i

i

resulting matrix is referenced as in the Blo ckCG Algorithm This matrix is later

factorized bythe master and broadcasted back to the sl av e pro cesses During this reduce

op eration only the master is active while the sl av e pro cesses wait for the answer this problem

was previously identied as synchronization p oint in Chapter for the CG algorithm

T

In the second reduce the master collects the partial pro ducts P HP from the sl av e pro cesses

i

i

This time the matrix corresp onds to the matrix in the Blo ckCG Algorithm The master

also factorizes and broadcasts back the results from to the sl av e pro cesses In the third

CHAPTER PARALLEL BLOCKCG IMPLEMENTATION

reduce op eration the master runs the test for termination To the test for convergence the

j

master gathers information that is lo cal to the sl av e pro cesses lo cal residuals and kX k

i

and when the stopping criterion is met the master pro cess signals ter mination to the sl av e

pro cesses Otherwise the master pro cess signals continuation to the sl av e pro cesses and the

sl av es start a new parallel iteration

The sl av e pro cesses initially receivefromthemaster pro cess at least one task to execute Then

the sl av e pro cesses build lo cally their partial matrices and these matrices are kept in memory

until the master pro cess signals ter mination In Figure the partial matrices and are

i i

computed lo cally by each sl av e pro cess The rst and second reduce op erations transform these

partial matrices into global and resp ectively

global initialization T γi = R i R i

local initialization build global γ

(0) R i = Hi Xi local Blk-CG computations

complete local R i HPi = Hi Pi

complete local HP i

T sequential βi = Pi HPi

parallel build global all send to β one process send to specific processes local Blk-CG computations broadcast

terminate?

program flow no

Figure Scheme of the M aster  Slave dis

tributed parallel Blo ckCG implementation

As mentioned b efore there may exist rowoverlaps b etween two or more tasks while computing

the H X and H P pro ducts A pro cess working on a task with rowoverlaps has to communicate

i i i i

with its neighb ors to complete its part of the global pro duct In addition one or more tasks

with rows that overlap maybescheduled to the same sl av e pro cess and in this case the overlap

is handled lo cally

To study the average numb er of messages sentatevery iteration step we use the same set of

assumptions made for the Al l to Al l implementation Section

IMPLEMENTATIONS OF BLOCKCG

For every pro duct H P

i i

p

X

m messages M

k

k

The length of these messages varies according to the number of rows sl av e will exchange with

k

each of its neighbor sl av es

T

For every pro duct R R one message in every direction b etween the master and sl av e

i k

i

p messages of length s s

T

For every pro duct P HP one message in every direction b etween the master and sl av e

i k

i

p messages of length s s

Which results in p M messages of lengths varying from s s to maxn s where n

i i

is the number of rows in H

i

Figure captures the interaction b etween the master pro cess and the sl av e pro cesses under

this parallel implementation

MasterSlave centralized Blo ckCG implementation

Lastlywe present an implementation based on a master sl av e programming mo del that diers

from the previous one in the roles of the master and the sl av e pro cesses In the following parallel

j

implementation only the HP pro duct is parallelized

The master pro cess generates the tasks by dividing the matrix H from into submatrices

H Here the initial tasks are no longer the task triplets HX K but HX for

i i i i i i

the rst iteration and HP for the main iteration see Figure The master pro cess

i i

spawns a group of sl av e pro cesses and then distributes the H submatrices among the sl av e

i

pro cesses The master pro cess executes all the steps from the Blo ckCG algorithm except for

the pro ducts HP which are p erformed in parallel by the sl av e pro cesses Therefore the master

pro cess partitions and distributes the matrix P in a way that will b e pro ductcompatible with

the already distributed H s In the reduce op eration the sl av e pro cesses send their partial

i

j j

HP to the master pro cess and the master assembles the global HP matrix

i

Figure shows the interaction b etween the master and the sl av e pro cesses in this parallel

implementation At the end of an iteration the master pro cess p erforms the test for termination

and signals ter mination or continuation to the sl av e pro cesses

We conduct the same study as b efore to calculate the numb er of messages sent at each iteration

For every pro duct H P

i i

p messages of length n s

i

p in every direction b etween the master pro cess and the sl av e pro cesses Resulting in

p messages b eing sentateach iteration step

Table summarizes the numb er of messages that are sentateach iteration of the Blo ck

CG Algorithm under each of the parallel implementations presented here The third

implementation master sl av ecentralized Blo ckCG involves fewer messages than the other

two implementations However the messages sent in the master sl av ecentralized Blo ckCG

CHAPTER PARALLEL BLOCKCG IMPLEMENTATION

global initialization γ = R T R

local initialization blk-cg computations (0) R i = Hi Xi

HPi = HPi i complete R complete HP

sequential β = PT HP parallel

all send to one process blk-cg computations

send to specific process terminate?

program flow no

Figure Scheme of the Master  S l av ecen

tralized parallel Blo ckCG implementation

implementation are longer than the messages sentinany of the other two implementations

j

In this implementation the master pro cess sends at each iteration the P submatrices to

i

each sl av e pro cess while in the other two implementations this information is kept lo cally and

j

HP and small s s matrices pro cesses only exchange parts of the

i

Implementation Numb er of Messages

p

X

Alltoall pp m

k

k

p

X

MasterSlave distributed Blo ckCG p m

k

k

MasterSlave centralized Blo ckCG p

Table Messages b eing sent p er parallel Blo ckCG implementation

Chapter

Parallel Blo ckCG exp eriments

The purp ose of this chapter is to compare the p erformance of the three parallel Blo ckCG im

plementations describ ed in Chapter and to show advantages from the parallelization of the

Blo ckCG algorithm

In this chapter we assume through all the exp eriments that a parallel run of Blo ckCG with

blo ck size of one is numerically equivalent to a run of Classical CG as can b e veried from

Algorithms and Thus the results from runs of Classical CG are presented under the

columns lab eled Blo ckCG with blo ck size of one in all of the tables in this chapter

In the three parallel Blo ckCG implementations a run with only one CE involves extra overhead

that is not present in a sequential run of the Blo ckCG implementation In the Al l to Al l

implementation the overhead is very small and is due to data manipulation provided for han

dling the parallelism In the other two implementations the overhead comes from creation of

one sl av e pro cess and this overhead is more signicant than the one in the Al l to Al l im

plementation

The matrix that results from squaring the SHERMAN matrix see Section is used in

the rst parallel exp eriment Next the LANPRO NOS and LANPRO NOS problems are

solved in parallel For these two problems the blo ck sizes have b een chosen based on the results

and remarks from Chapter

In Section a dierent problem is used It comes from a nite dierence mo del of Laplaces

equation see for example Strang

u u

x y

When the righthand side of is dierent from the equation is Poissons equation

Applying a central dierence approximation to the second derivates

ux h ux h ux h u

x h

and

uy h uy h uy h u

y h

CHAPTER PARALLEL BLOCKCG EXPERIMENTS

The righthand side approaches the lefthand side as h go es to zero

u u

The dierential equations f and f are replaced by a nite dierences approxima

x y

tion of the values of u at the meshp oints x h h N h

Figure a shows a p oint molecule resulting from the combination of and The

b oundary conditions at the edge of the grid are u The p oints in the grid are lab eled from to

n n N and there is one linear equation p er grid p oint The resulting matrix A is depicted

in Figure b The matrix is SPD with s along the main diagonal and s in the other

diagonals Because of the b oundary conditions some s are removed from the other diagonals

A problem based on the Poissons equation is solved in Section The grid size is thus the

matrix A is of order with nonzero elements

-1 4 -1 4 -1 -1 4 -1 4 -1 -1 -1 -1 1 -1 1

N

= A N N 1 N (a)

n 1 n

n=xNN (b)

Figure p oint discretization of Laplaces equation using nite dierences a p ointap

proximation b Structure of matrix A that results from discretization

In this chapter the p erformance of the parallel implementations is measured in terms of two

parameters the sp eedup and the eciency The sp eedup is dened by

Execution time from b est sequential run

sp eedup

Execution time from parallel run using p CEs

and the eciency by

PARALLEL SOLUTION OF THE SHERMAN PROBLEM

sp eedup

E

p

A sp eedup is a factor of the reduction in the execution time when moving from a sequential

implementation to a parallel implementation and ideally a sp eedup equals the numb er of CEs

that are used during a parallel run In some less favorable cases the parallel implementation

requires more time than the sequential implementation thus speedup The eciency

gives an estimation of the contribution of the p pro cessors to the sp eedup and ideally it is equal

to one

The following parallel exp eriments were run on three dierent parallel computer systems The

rst platform is a Thin no de SP with RS pro cessors Each no de has a theoretical

p eak p erformance of Mops The no des are connected through a high p erformance switch

with latency time of microseconds and a bandwidth of to Mbytes p er second

The second platform is a BBN TC computer system installed at Argonne National Lab o

ratory The TC system is a virtual shared memory computer and p eak p erformance of

MFlops p er no de The no des are interconnected through a multistage network pro cessor called

the Buttery switch

The last computing platform is a network of SUN Sparc workstations Eachworkstation

has a p eak p erformance of MFlops and they are connected through ETHERNET The p eak

p erformance of the ETHERNET network is Mbytes p er second

Parallel solution of the SHERMAN problem

Some advantages of using Blo ckCG over Classical CG were presented in Section with the

T

results from sequentials runs of Blo ckCG in the solution of the A A from SHERMAN prob

T

lem In this section the results from parallel runs of Blo ckCG on the A A SHERMAN are

presented The blo ck sizes of and are used in these exp eriments Tables to

summarize the results from the three dierent parallel Blo ckCG implementations

The results from parallel runs of the Blo ckCG Al l to Al l implementation are shown in Tables

and The runs were p erformed on a SP computer The b ottlenecks from parallelizing

Blo ckCG are shown in these tables by the low sp eedups obtained Increasing the blo cksizegives

b etter sp eedups However the eciency factors decrease as the numb er of CEs is increased

T

The execution time of a sequential run of Blo ckCG solving the A A SHERMAN problem and

using blo ck size of is almost the same as the execution time of a sequential run of Classical CG

on the same problem As shown in Section the blo ck sizeofconverges in fewer equivalent

iterations than the ClassicalCG and after parallelizing the Blo ckCG algorithm the execution

T

time for solution of the A A SHERMAN problem is reduced by almost with a blo c k size

of and computing on CEs

In Tables and the results from parallel runs of the Blo ckCG M aster Slave distributed

implementation are presented The runs were p erformed on a SP computer system

The parallel Classical CG implementation using the M aster Slave distributed scheme also

rep orts low sp eedups but they are b etter than the sp eedups obtained with the Al l to Al l

implementation and this is due to the numb er of messages b eing exchanged in each implemen

tation at b oth synchronization p oints

CHAPTER PARALLEL BLOCKCG EXPERIMENTS

In the Al l to Al l implementation the execution of the synchronization p oints takes more time

than in the Master S l av e distributed implementation b ecause of the numb er of messages

b eing exchanged And in b oth implementations only a scalar is exchanged in a message which

do es not comp ensate for the exp ense of the communication over the computations

Increasing the blo ck size improves the p erformance of the M aster S l av e distributed imple

T

mentation In Table the execution time from the solution of the A A SHERMAN problem

using CEs and a blo ck size of is reduced by almost from the sequen tial version In

Table the p erformance is improved even more when increasing the blo ck size to b ecause

a reduction on the execution time of almost from the sequen tial implementation is obtained

In this case the execution time is reduced by almost from the execution of the sequen tial

ClassicalCG

Even though in some cases Blo ckCG takes a few more iterations to converge than Classical

CG with the Master Slave distributed implementation is p ossible to present an example in

which the p erformance of Blo ckCG is improved with the parallel implementation

According to the results rep orted in Table Blo ckCG with a blo ck size of is computa

tionally more exp ensive than Classical CG However the p erformance of Blo ckCG with a blo ck

size of is also improved with the results presentedinTable

Execution times of Blo ckCG sequential version

blo ck size secs

blo ck size secs

blo ck size secs

Number blo ck size blo cksize blo cksize

of CEs Time sp eedup E Time sp eedup E Time sp eedup E

Table Results from the Al l  to  Al l parallel implementation of Blo ckCG on the solution of

T

the A A SHERMAN problem The times shown in the table are in seconds

In Tables and the results from the M aster Slave centralized parallel Blo ckCG imple

mentation are presented This parallel implementation of Classical CG and Blo ckCG p erforms

very p o orly and the reason is the granularity of the work b eing p erformed in parallel that is

very small compared to the granularity of the problem p erformed sequentially

Increasing the granularity of the H P pro ducts may improve the b ehavior of this implementa

i i

tion Thus increasing the computational weight of the parallel pro ducts comp ensates for the

exp ensive cost of distributing the P matrices at each iteration i

PARALLEL SOLUTION OF THE SHERMAN PROBLEM

Execution times of Blo ckCG sequential version

blo ck size secs

blo ck size secs

Number blo ck size blo ck size

of CEs Time sp eedup E Time sp eedup E

Table Results from the Al l  to  Al l parallel implementation of

T

Blo ckCG on the solution of the A A SHERMAN problem The times

shown in the table are in seconds

Execution times of Blo ckCG sequential version

blo ck size secs

blo ck size secs

blo ck size secs

Number blo ck size blo cksize blo cksize

of CEs Time sp eedup E Time sp eedup E Time sp eedup E

Table Results from the Master  Slave distributed parallel implementation of Blo ckCG on

T

the solution of the A A SHERMAN problem The times shown in the table are in seconds

CHAPTER PARALLEL BLOCKCG EXPERIMENTS

Execution times of Blo ckCG sequential version

blo ck size secs

blo ck size secs

Number blo cksize blo ck size

of CEs Time sp eedup E Time sp eedup E

Table Results from the M aster  Slave distributed parallel imple

T

mentation of Blo ckCG on the solution of the A A SHERMAN problem

The times shown in the table are in seconds

Execution times of Blo ckCG sequential version

blo ck size secs

blo ck size secs

blo ck size secs

Number blo ck size blo cksize blo cksize

of CEs Time sp eedup E Time sp eedup E Time sp eedup E

Table Results from the Master  Slave centralized parallel implementation of Blo ckCG on

T

the solution of the A A SHERMAN problem The times shown in the table are in seconds

PARALLEL SOLUTION OF THE LANPRO NOS PROBLEM

Execution times of Blo ckCG sequential version

blo ck size secs

blo ck size secs

Number blo ck size blo ck size

of CEs Time sp eedup E Time sp eedup E

Table Results from the Master  Slave centralized parallel imple

T

mentation of Blo ckCG on the solution of the A A SHERMAN problem

The times shown in the table are in seconds

Also in a shared memory parallel environment this implementation may p erform b etter b ecause

the communications are avoided and the access to shared memory can b e easily synchronized in

a less exp ensiveway than exchanging messages In the next sections we p erform a more detailed

analysis of this implementation when solving the LANPRO NOS and NOS problems

Parallel solution of the LANPRO NOS problem

The purp ose of this exp eriment is to compare the p erformance of the sequential and parallel

implementations of the ClassicalCG Algorithm and the Blo ckCG Algorithm using

a large blo ck size for Blo ckCG

In this exp eriment the LANPRO NOS problem is solved in parallel in the SP computer

The blo ck sizes of and are chosen for the exp eriment based on the sequential Blo ckCG

exp eriments presented in Chapter

To b e able to compare the sequential and parallel implementations on the SP computer a

selection of the sequential exp eriments from Chapter are rep eated in this section Table

summarizes the results from sequential runs on the SP

The results from the solution of the LANPRO NOS problem using the Al l to Al l parallel

Blo ckCG implementation are presented in Table Although the computations of Blo ckCG

with blo ck size continue to b e more exp ensive than with Classical CG a parallel run of

Blo ckCG on CEs has reduced the execution time of the sequential Blo ckCG run by almost

The parallel Al l to Al l implementation p erforms very p o orly for Classical CG An analysis of

the run from the Al l to Al l implementation has shown that the communication time b etween

CEs is a dominant factor in the total execution time This is due to the small granularity of the

CHAPTER PARALLEL BLOCKCG EXPERIMENTS

LANPRO Blo ck Normwise Backward Execution

problem size error time

NOS

NOS

Table Sequential runs of Classical CG and Blo ckCG im

plementations for solving the LANPRO NOS problem on a SP

computer system The times shown in this table are in seconds

parallel Classical CG op erations Basically in Classical CG all of the BLAS op erations from

Blo ckCG are replaced by BLAS and BLAS op erations in the Classical CG

In these exp eriments the highest sp eedup rep orted for the Al l to Al l parallel Blo ckCG

implementation is attained with CEs The eciency suggests that of the total computa

tional p ower of the no des has b een used

Execution times of Blo ckCG sequential version

blo ck size secs

blo ck size secs

Number blo cksize blo cksize

of CEs Time sp eedup E Time sp eedup E

Table Results from the Al l  to  Al l parallel implementation of

Blo ckCG on the solution of the LANPRO NOS problem The times

shown in the table are in seconds

Table shows the results from the solution of the LANPRO NOS problem using the

M aster Slave distributed parallel Blo ckCG implementation The execution time from the

sequential Blo ckCG is reduced by almost with the parallel Blo c kCG using CEs The

highest sp eedup is attained with pro cessors but the eciency is reduced as we increased the

numb er of pro cessors

LastlyTable summarizes results from runs of the M aster Slave centralized parallel

PARALLEL SOLUTION OF THE LANPRO NOS PROBLEM

Execution times of Blo ckCG sequential version

blo ck size secs

blo ck size secs

Number blo ck size blo cksize

of CEs Time sp eedup E Time sp eedup E

Table Results from the Master  S l av e distributed parallel imple

mentation of Blo ckCG on the solution of the LANPRO NOS problem

The times shown in the table are in seconds

Blo ckCG implementation when solving the LANPRO NOS problem The Amdahls law

Amdahl helps to explain to the p o or p erformance of this implementation A parallel

implementation mayhave sequential and parallel sections The sequential sections are sometimes

called synchronization p oints Moreover if is the ratio of computations p erformed in sequence

then is the ratio of computations p erform in parallel The Amdahls law denes an upp er

b ound S for the sp eedup

S lim

p

p

In the M aster S l av e centralized Blo ckCG implementation only the part of the total FLOP

count that is related to A is p erformed in parallel Table shows the p ercentages of the

total Blo ckCG FLOP count that comes from A for the problem LANPRO NOS problem

A of the total FLOP coun t comes from A using a blo ck size of and this factor is

reduced to with a blo c k size of

In these cases is equal to when the blo ck size is and equal to when the blo ck

size is Thus S is equal to and for the blo ck sizes of and resp ectively The

lack of sp eedups in the results shown in Table are explained by the small values of S This

is not the case for the other two implementations that have higher ratios b etween the parallel

and sequential sections

CHAPTER PARALLEL BLOCKCG EXPERIMENTS

Execution times of Blo ckCG sequential version

blo ck size secs

blo ck size secs

Number blo cksize blo cksize

of CEs Time sp eedup E Time sp eedup E

Table Results from the Master Slavecentralized parallel imple

mentation of Blo ckCG on the solution of the LANPRO NOS problem

The times shown in the table are in seconds

Parallel solution of the LANPRO NOS problem

The LANPRO NOS problem intro duced in Chapter is solved in these exp eriments Table

summarizes the results from sequential runs of Blo ckCG on the SP computer Based on

the results presented in Section the blo ck sizes of and were chosen

LANPRO Blo ck Normwise Backward Execution

problem size error time

NOS

NOS

NOS

Table Sequential runs of Classical CG and Blo ckCG im

plementations for solving the LANPRO NOS problem on a SP

computer system The times shown in this table are in seconds

Table presents results from runs of the Al l to Al l parallel implementation of Blo ckCG

on the solution of the LANPRO NOS problem For all of the blo ck sizes used in this case

the granularity is smaller than the granularity of the blo ck size of on the LANPRO NOS

problem thus the sp eedups have decreased compared to the sp eedups shown in Table

PARALLEL SOLUTION OF THE LANPRO NOS PROBLEM

In this case the maximum sp eedups are attained using CEs with the blo ck sizes of and

The parallel p erformance of Blo ckCG with blo ck sizeofisvery p o or in this case b ecause

of the required synchronizations discussed in Section and the low granularity of Classical CG

Execution times of Blo ckCG sequential version

blo ck size secs

blo ck size secs

blo ck size secs

Number blo ck size blo cksize blo cksize

of CEs Time sp eedup E Time sp eedup E Time sp eedup E

Table Results from the Al l  to  Al l parallel implementation of Blo ckCG on the solution of

the LANPRO NOS problem The times shown in the table are in seconds

Table shows the results of runs of the M aster Slave distributed Blo ckCG implemen

tation As in the case of the Al l to Al l Blo ckCG implementation the sp eedups have also

decreased when compared to the sp eedups rep orted in Table The maximum sp eedup with

a blo ck size of is attained with CEs and with a blo ck size of is attained with CEs

The execution times for Blo ckCG are shorter than the execution times of the sequential and

parallel Classical CG implementations The eciency of CEs in the case of blo cksizeofis

higher than the eciency attained with the same CEs in the case of blo ck size of and this

is due to and increase in the granularity after increasing the blo ck size from to

The results of the Master S l av ecentralized implementation are presented in Table

Again the implementation rep orted p o or sp eedups From information in Table and the

Amdahls law it can b e deduced that the upp er b ounds of the sp eedups in this exp eriments

are for blo ck size for blo ck size and for blo cksizeofHowever the maximum

sp eedup rep orted from these exp eriments for blo ck size of one is and for the blo ck sizes

of and The p o or p erformance of the parallel Classical CG is due to a straight parallel

implementation of Classical CG without the considerations discussed in Section

LastlyTable shows comparative results from dierent runs of sequential Classical CG and

parallel version of Blo ckCG The purp ose of this table is to illustrate the advantages of the

parallel Blo ckCG over a sequential Classical CG even for those cases presented in Chapter

where Blo ckCG was a few equivalent iterations more exp ensive than Classical CG

CHAPTER PARALLEL BLOCKCG EXPERIMENTS

Execution times of Blo ckCG sequential version

blo ck size secs

blo ck size secs

blo ck size secs

Number blo ck size blo cksize blo cksize

of CEs Time sp eedup E Time sp eedup E Time sp eedup E

Table Results from the M aster  S l av e distributed parallel implementation of Blo ckCG on

the solution of the LANPRO NOS problem The times shown in the table are in seconds

Execution times of Blo ckCG sequential version

blo ck size secs

blo ck size secs

blo ck size secs

Number blo ck size blo cksize blo cksize

of CEs Time sp eedup E Time sp eedup E Time sp eedup E

Table Results from the M aster  Slavecentralized parallel implementation of Blo ckCG on

the solution of the LANPRO NOS problem The times shown in the table are in seconds

PARALLEL SOLUTION OF THE LANPRO NOS PROBLEM

Sequential Parallel Blo ckCG

Classical MasterSlave MasterSlave

CG AlltoAll Distributed Centralized

LANPRO exec time Blo ck N Time N Time N Time

problem secs size CEs secs CEs secs CEs secs

NOS

NOS

NOS

NOS

NOS

Table Time comparison b etween runs of sequential Classical CG and parallel Blo ckCG

implementation s The is used for the cases where the a parallel Blo ckCG implementation has

p erformed b etter than sequential implementation of Classical CG

For each of the LANPRO problems there is one or more entries in Table Eachentry in

the table compares runs from sequential Classical CG versus the b est time obtained in the so

lution of the same problem with each parallel Blo ckCG implementation Along with an entry

for a parallel implementations there is a column that identies the numb er of CEs used in the

solution of the problem and another entry that identies the blo ck size

Entries in Table that are marked with a are the favorable cases in which a parallel

Blo ckCG implementation has p erformed b etter than the sequential Classical CG implementa

tion Therefore the execution time in those cases has b een reduced by the eects of parallelism

despite the extra FLOP count generated byBlockCG In the M aster Slave centralized

Blo ckCG implementation none of the b est times improves the p erformance of the sequential

Classical CG In the other two implementations almost of all of the cases are favorable except

in the solution of the LANPRO NOS problem with a blo cksizeof

In the sequential Blo ckCG implementation the solution of LANPRO NOS with the large

blo ck size required far more FLOPs than Classical CG In the M aster S l av edistributed Blo ck

CG implementation the large blo ck size enlarges the sequential sections where the master per

forms the centralized reduce op erations and factorizes the and matrices In addition

j j

the length of the messages exchanged b etween the master and sl av e pro cesses is drastically

increased see Table

In the Al l to Al l Blo ckCG implementation the FLOP count from redundant op erations

is drastically increased with the large blo ck size and as in the M aster Slave distributed

implementation the length of the messages is also increased Furthermore in the Al l to Al l

implementation the long messages are more exp ensive b ecause they are broadcast to all the

workers

CHAPTER PARALLEL BLOCKCG EXPERIMENTS

Fixing the numberofequivalent iterations

In the parallel Classical CG and Blo ckCG implementations it has b een observed that there

are numerical and computational issues that inuence the p erformance of a parallel Blo ckCG

implementation These issues can b e summarized by

The structure of the iteration matrix eg size sparsity pattern etc

The blo ck size used in Blo ckCG

The numb er of equivalent iterations

The stopping criterion

The computing platform

The numb er of CEs used in a computer run

The computational scheme of the parallel implementation

The rst four issues are related to the numerics of the problem b eing solved The inuence of

the computing platform and the numb er of CEs is studied in Sections and

So far when comparing the parallel Blo ckCG implementations the results have b een inuenced

mainly by some of the numerical issues For instance the Blo ckCG iterations are stopp ed after

reducing the normwise backward error or the residual error b elow a threshold value Therefore

the iteration counthasvaried as the blo cksizewas also varied and the execution time of a par

allel Blo ckCG run dep ended not only on the parallel implementation but also on the iteration

count

In this section the purp ose of the exp eriments is to study the inuence of each parallel im

plementation on the p erformance of Blo ckCG The numb er of equivalent iterations is xed to

isolate some of the issues related with the computational scheme from the ones related to the

stopping criterion

To preserve the numerical meaning in the following exp eriments Table shows the qualityof

the solution after k iterations of Classical CG and Blo ckCG when solving the LANPRO NOS

problems The blo ck sizes of and are used in the parallel Blo ckCG

LANPRO Blo ck Orthogonal vectors Normwise Backward

problem size computed error

NOS

NOS

NOS

Table Work done after k equivalent iterations of Classical CG and

Blo ckCG

FIXING THE NUMBER OF EQUIVALENT ITERATIONS

In Tables and the Blo ckCG iterations are xed at When the execution

time of the parallel Blo ckCG is less than the execution time of the parallel Classical CG a

is placed next to the execution time to represent the reduction in the execution time In these

cases the eect of increasing the blo ck size sp eeds the execution of a parallel Blo ckCG imple

mentation On the other hand a is placed next to the entries in the table when the execution

time from the parallel Blo ckCG is more than the execution time of the parallel Classical CG

Entries without a or a sign are assumed that take almost the same time for b oth Blo ckCG

and Classical CG

Table shows the results from this exp eriment with the Al l to Al l Blo ckCG implemen

tation As observed earlier the parallel Classical CG using an Al l to Al l scheme simply

increases the execution time as the numb er of CEs is increased b ecause the cost of the op er

ations p erformed in parallel is smaller than the cost of the communications In this case the

communication time has a great inuence on the overall execution time and in distributed com

puting environments this communication b ecomes b ottleneck of parallelizing the Classical CG

Algorithm

In the parallel Blo ckCG with blo ck sizes of and it is observed that the execution time is

reduced as the numb er of CEs is increased Increasing the blo ck size also sp eeds the execution

time and in Table this eect is seen when moving from blo ck size of to on CEs

Parallel Parallel Blo ckCG

Number Classical Blo ck size

of CEs CG

Table Execution times from runs of the Al l 

to  Al l parallel Blo ckCG implementation Times

are in seconds

Table shows the results from rep eating the same exp erimentwiththeM aster S l av e

distributed Blo ckCG implementation This implementation presents a b etter compromise b e

tween the communication that is overlapp ed with useful computations and this eect is also

present in the parallel Classical CG where the execution time is reduced as the numb er of CEs

is increased

In the same table it is observed that the execution times of the parallel Blo ckCG do not changed

muchwhenchanging the blo ck size from to With a blo ck size of the iteration countis

and with a blo ck size of is Thus with the blo ck size of the numb er of communications

is reduced by half and this comp ensates for the increase in the FLOP count as the blo cksizeis

increased

In almost all of the cases rep orted in Table the execution time of the parallel Blo ckCG is

CHAPTER PARALLEL BLOCKCG EXPERIMENTS

Parallel Parallel Blo ckCG

Number Classical Blo ck size

of CEs CG

Table Execution times from runs of the

M aster  S l av e distributed parallel Blo ckCG im

plementation Time are in seconds

less than the execution time of Classical CG

Table shows the results from running this exp eriment with the M aster S l av ecentralized

implementation Contrary to the other two implementations the granularity of the problem do es

not comp ensate for the cost of communication Thus increasing the blo ck size also increases

the size of the sequential section and the length of the messages b eing sent while decreasing the

p ercentage of the total FLOP count that is p erformed in parallel

Parallel Parallel Blo ckCG

Number Classical Blo ck size

of CEs CG

Table Execution times from runs of the

M aster  Slavecentralized parallel Blo ckCG im

plementation Times are in seconds

COMPARING DIFFERENT COMPUTING PLATFORMS

Comparing dierent computing platforms

The computing platform has an inuence in the p erformance of a parallel implementation The

three parallel Blo ckCG implementations were designed for general parallel distributed comput

ing systems and do not include sp ecically tunned algorithms to enhance the p erformance on

any sp ecic computer system Nevertheless the relation b etween the sp eed of the pro cessors

and the sp eed of the communication network can enhance the p erformance of a parallel imple

mentation and in the case of Blo ckCG can even favor one implementation over another

The purp ose of this section is to p erform parallel runs of the Blo ckCG implementations on

three dierent computing platforms In each computing platform a dierent library subroutine

was used to compute the execution times and for some platforms the execution times are more

accurate than others Thus we do not attempt to compare the computational p erformance of

one computer system against another

In Table the results of solving the LANPRO NOS problem on the SP computer are

presented In this case the blo ck size has b een xed at We observe that the M aster S l av e

distributed Blo ckCG implementation rep orts the b est sp eedups compared to the other twoim

plementations However the maximum sp eedup in this case is obtained with CEs with an

eciency factor of only

In the same table it is observed that the Al l to Al l implementation provides the maximum

eciency factor with one or more CEs When using one CE this implementation generates less

overhead than the other two and the execution time may b e in some cases slightly longer than

the execution time of the sequential Blo ckCG implementation

The maximum eciency attained with more than one CE is attained by the Al l to Al l

Blo ckCG implementation using CEs In the Al l to Al l implementation distributed re

duce op erations are used In a computing platform with a fast communication network the use

of distributed reduce op erations is more favorable than the use of centralized ones However

increasing the numb er of CEs in a parallel implementation that uses distributed reduce op er

ations will have a greater impact on the sp eedups and eciency than increasing the number

of CEs in a parallel implementation that uses centralized reduce op erations This eect is also

observed in Table when increasing the numb er of CEs from to With and CEs

the Al l to Al l implementation has the lowest eciency of the three implementations

The results in Table come from runs on the BBN TC computer The problem LAN

PRO NOS is solved with a blo ck size of The pro cessors in the BBN TC are slower

than the pro cessors in the SP The BBN TC has also a fast network switch and the com

bination of relatively slow pro cessors and a relatively fast network reduces the overall exp ense

of communication over useful computations

This last eect is reected in the sp eedups shown in Table that have increased for the

Al l to Al l and M aster Slave distributed Blo ckCG implementations from the sp eedups

rep orted with the SP computer As remarked earlier in the M aster Slave centralized

implementation the sp eedup is less aected by the the sp eed of the network and the sp eed of

the pro cessors and the S upp er b ounds for this implementation when solving the LANPRO

NOS problem was also shown in Section The S upp er b ounds computed in Section

are consistent with the results in Tables and

Lastly the results from runs on the network of SUN Sparc stations are shown in Table

Here the network is very slow compared with the networks in the other two computing platforms

As a result the eciency rates rep orted in the three implementations are smaller than those

rep orted in the two previous tables

CHAPTER PARALLEL BLOCKCG EXPERIMENTS

Execution time of Blo ckCG sequential version secs

MasterSlave MasterSlave

Number AlltoAll Distributed Centralized

CE Time sp eedup E Time sp eedup E Time sp eedup E

Table Performance of parallel Blo ckCG implementations while solving LANPRO NOS prob

lem Times in table are in seconds The parallel runs were p erformed in a SP computer

Blo ckCG Sequential Time millisecs

MasterSlave MasterSlave

Number AlltoAll Distributed Centralized

CE Time sp eedup E Time sp eedup E Time sp eedup E

Table Performance of parallel Blo ckCG implementations while solving LANPRO NOS prob

lem Times in table are in seconds The parallel runs were p erformed in a BBN TC computer

PARALLEL SOLUTION OF POISSON PROBLEM

In this case the use of centralized reduce op erations in the M aster S l av e distributed imple

mentation b ecomes very exp ensive as the numb er of CEs is increased This is eect comes from

the serial delivery of messages through a slownetwork As shown in Table the problem is

solved faster with the Al l to Al l implementation than with the other two implementations

Execution time of Blo ckCG sequential version secs

MasterSlave MasterSlave

Number AlltoAll Distributed Centralized

CE Time sp eedup E Time sp eedup E Time sp eedup E

Table Performance of parallel Blo ckCG implementations while solving LANPRO NOS prob

lem Times in table are in seconds The parallel runs were p erformed in a network of Sparc

workstations

Parallel solution of POISSON problem

The purp ose of this section is to compare runs of the three parallel Blo ckCG implementations as

in Section after increasing the granularity of the problem to b e solved In these exp eriments

the problem to b e solved comes from the discretization of Poissons equation A blo ck size of

is used in all the exp eriments in this section

Table presents the results from parallel runs on the SP computer In the Al l to Al l and

M aster Slavedistributed implementations the increase in the granularity has improved their

p erformance This is not the case for the M aster S l av e centralized implementation in which

the increase in the granularity has also increased the computational weight of the sequential

sections

In the Al l to Al l implementation the eciency with CEs is higher than the eciency

obtained in the LANPRO NOS problem and in this case the M aster S l av e distribute

also exhibits the same eciency with CEs In the Al l to Al l implementation the eciency

decreases at faster rate than in the M aster Slave distributed implementation For these

two implementations the sp eedups and eciency rates are again increased when rep eating this

exp eriment on the BBN TC computer

Table shows the results from runs of this exp eriment on a BBN TC computer The

M aster Slave distributed implementation shows a slow decrease in the eciency rates that

CHAPTER PARALLEL BLOCKCG EXPERIMENTS

Execution time of Blo ckCG sequential version secs

MasterSlave MasterSlave

Number AlltoAll Distributed Centralized

CE Time sp eedup E Time sp eedup E Time sp eedup E

Table Performance of parallel Blo ckCG implementations solving a Poissons equation problem

Times in table are in seconds The parallel runs were p erformed on a SP computer

results in a very high sp eedup of on CEs

The results from the same runs on the network of SUN Sparc workstations are presented

in Table The times obtained when solving the LANPRO NOS problem on the SUN

Sparc workstation are reconrmed with the results in Table The Al l to Al l Blo ckCG

implementation has p erformed b etter than the other two implementations on the network of

workstations and in the Al l to Al l and M aster S l av e distributed implementations the

sp eedups and eciency have increased as the granularity has also b een increased

Remarks

In this chapter wehaveshown the advantages of parallelizing the Blo ckCG Algorithm In some

cases it has b een shown that the parallel Blo ckCG will p erform b etter than the sequential

Classical CG even when a few more FLOPs are computed with the Blo ckCG implementation

In distributed computing environments the p o or p erformance of the M aster S l av ecentralized

Blo ckCG implementation proves that only parallelizing the HP pro ducts do es not comp ensate

for the use of more than one CE given the small ratios b etween the parallel and sequential

computations However this implementation may p erform b etter in the event that the matrix

H comes from a preconditioner that needs to b e recomputed at each iteration

Additionally the M aster S l av e centralized Blo ckCG implementation is more trivial to im

plement than the other two parallel Blo ckCG implementations presented in this chapter and

has fewer critical sections ie sections of a parallel co de where only one CE is allow to up date

the date at a time than its counterparts in shared memory environments b ecause these critical

REMARKS

Blo ckCG Sequential Time millisecs

MasterSlave MasterSlave

Number AlltoAll Distributed Centralized

CE Time sp eedup E Time sp eedup E Time sp eedup E

Table Performance of parallel Blo ckCG implementations solving a Poissons equation problem

Times in table are in seconds The parallel runs were p erformed on a BBN TC computer

Execution time of Blo ckCG sequential version secs

MasterSlave MasterSlave

Number AlltoAll Distributed Centralized

CE Time sp eedup E Time sp eedup E Time sp eedup E

Table Performance of parallel Blo ckCG implementations solving a Poissons equation problem

Times in table are in seconds The parallel runs were p erformed on a network of SUN Sparc

workstations

CHAPTER PARALLEL BLOCKCG EXPERIMENTS

sections app ear every time a reduce op eration is p erformed

The granularity of the problem the numb er of CEs used in a run and the characteristics of the

network have a greater impact on the p erformance of the Al l to Al l implementation than

in the other two implementations Therefore the Al l to Al l implementation will p erform

b etter than its counterparts in some cases when a fair compromise is found b etween the number

of CEs and the granularity of the subproblems b eing solved

A fair compromise can only b e found after a few runs of the Al l to Al l parallel Blo ckCG im

plementation using a dierentnumb er of CEs at each run Clearly tuning the parallel eciency

is only worthwhile when the linear system of equations is solved more than once with dierent

values for the co ecients of the equations each solution these typ es of systems usually app ear in

many application elds like structural engineering climate mo deling op erations research etc

From exp eriments rep orted in this chapter it known that when a fair comprise is found the

Al l to Al l implementation p erforms b etter than the other two implementations in terms of

the parallel eciency see for instance Table On the other hand this fair compromise also

dep ends on the computing platform b eing used and when running in timesharing mo de the fair

compromise will also dep end on the computational load of the computer system Alternatively

the M aster Slave distributed implementation has exhibit a more consistent p erformance

even when moving from one computing platform to the other and when solving linear systems

of dierent sizes

The M aster S l av e distributed implementation has exhibited linear sp eedups b ecause of its

balance b etween communication and computations Moreover the sp eedups are increased as

the granularity of the problem is increased Chapter we use the M aster Slave distributed

Blo ckCG implementation inside a blo ck Cimmino iteration to accelerate its convergence rate

The resulting implementation is extended to general unsymmetric sparse systems Furthermore

inside a blo ck Cimmino iteration the synchronization p oints are overlapp ed with useful compu

tations

In general the Master sl av e distributed implementation suits distributed computing envi

ronments b etter

Chapter

Solving general systems using

Blo ckCG

The Blo ckCG algorithm only guarantees convergence in the solution of symmetric p ositive def

inite systems With the use of a preconditioner the algorithm can b e extended to the solution

of general unsymmetric systems As this can b e done by using Blo ckCG as an acceleration

pro cedure inside another basic iterative metho d In this chapter we study two iterativerow

pro jection metho ds for solving general systems of equations and use the Blo ckCG algorithm

to accelerate the rate of convergence of a row pro jection metho d

An overview of the blo ck Cimmino metho d is presented in Section and the blo ck Kaczmarz

metho d is briey intro duced in Section An implementation of the blo ck Cimmino metho d

accelerated with the stabilized Blo ckCG Algorithm is presented in Section

General issues from a computer implementation of the blo ck Cimmino metho d accelerated with

the Blo ckCG algorithm are discussed in Section and the corresp onding parallel implemen

tation is presented in Section

The blo ck Cimmino metho d

The blo ck Cimmino metho d is a generalization of the Cimmino metho d see Cimmino

Basically the linear system of equations

Ax b

where A is a m n matrix is partitioned into l subsystems with l m such that

b A

C C B B

b A

C C B B

C C B B

x

C C B B

A A

b A

l l

CHAPTER SOLVING GENERAL SYSTEMS USING BLOCKCG

Notice that the term blo ck in blo ck Cimmino refers to the matrix blo cks that result from

and not to the numb er of righthand side as in Blo ckCG

Figure illustrates a geometric interpretation of the blo ck Cimmino metho d in a planar case

The blo ck metho d computes a set of l row pro jections and a combination of these pro jections

is used to build the next approximation to the solution

Algorithm describ es the blo ck Cimmino iteration the matrix A is the Mo orePenrose

i

pseudoinverse of the submatrix A dened by

i

T T

A A A A

i

i i

i

T

and P is a pro jector onto the range of A given by

T

RA i

i

A P A

T

i

RA i

i

Algorithm Blo ck Cimmino

x is arbitrary

For j until convergence do

For i l do

j

j

b P A x

T

i

i RA i

i

P

j

l

j j

x x

i

i

Stop

One of the advantages of the blo ck Cimmino metho d is the natural parallelism in Step and

for this reason its parallel implementation for distributed computing environments is studied in

Section

In Algorithm the Mo orePenrose pseudoinverse A is used However the blo ck Cimmino

i

metho d will converge for any other pseudoinverse of A Campb ell and Meyer Thus

i

we use the generalized pseudoinverse

T T

G A A G A A

i

i i

G

where G is an ellipsoidal norm matrix

When solving for the s in Algorithm the augmented systems approach Bartels Golub

i

and Saunders and Hachtel is used b ecause it is computationally more reliable

THE BLOCK CIMMINO METHOD

(1) δ1

x (0)

(1) δ2 ν

(1) x

x *

Figure Blo ck Cimmino geometric interpretation In this case a two blo ckrow

partition is depicted

and less exp ensive than the normal equations Here the augmented system equations for A s

i

are written as

T

G A u

i

i

A v b A x

i i i i

with solution

T

v A G A r

i i i

i

b A x u A

i i i

iG

i

In Step of Algorithm the choice of the relaxation parameter has an impact on the

convergence of the algorithm

It can b e veried that the blo ck Jacobi iteration matrix applied to the system

T

AA y b

T

x A y

see Hageman and Young is similar to the iteration matrix of the blo ck Cimmino metho d

see for instance Ruiz For this reason in a study of row pro jection metho ds Elfving

refers to the blo ck Cimmino metho d as the Blo ckrow Jacobi metho d Elfving In

CHAPTER SOLVING GENERAL SYSTEMS USING BLOCKCG

Elfvings pap er he shows that the convergence of the blo ckrow metho d dep ends on the relax

ation parameter and the sp ectral radius of the matrix dened by

l l

X X

T T

M P A A A A

T

i i

i i

RA

i

i i

T

If b RA and x RA then the blo ckrow converges towards the minimum norm if and

only if

min

M

The value of giving the optimal asymptotic rate of convergence is

max min

where and are the largest and smallest nonzero eigenvalues of the matrix M

max min

The blo ck Kaczmarz metho d

The blo ck Kaczmarz is another row pro jection metho d and is a generalization of the Kaczmarz

metho d Kaczmarz As in the blo ck Cimmino metho d the blo ck structure from the

original matrix is obtained by partitioning as And a blo ck Kaczmarz algorithm

can b e dened as in Algorithm Figure illustrates a geometric interpretation of the

Blo ck Kaczmarz algorithm

Algorithm Blo ck Kaczmarz

x is arbitrary

For j until convergence do

j

z x

For i l do

i i i

z I P z A b

T

RA

i

i

i i

z A b A z

i i

i

j l

x z

Stop

THE BLOCK KACZMARZ METHOD

(1) z ν > 1

x (0)

* x (1) (2)

x = z

Figure Blo ck Kaczmarz geometric interpretation In this case a two blo ckrow

partition is depicted

Contrary to the blo ck Cimmino algorithm blo ck Kaczmarz is sequential in nature The Kacz

marz pro jections are computed one at a time in Step This is also illustrated in Figure

in the planar case The current iteration is pro jected onto the dierent subspaces using the

up dated value from one subspace to compute the next one If is greater than one then an

overpro jection is p erformed otherwise an underpro jection is p erformed After computing all the

pro jections over each subspace the new iterate is obtained Step

The Kaczmarz metho d has also b een studied by Elfving and he shows that the metho d is

equivalent to the blo ck SOR applied to the system Thus Elfving refers to the Kaczmarz

metho d as the blo ckrow SOR

The relaxation parameter in the blo ck Kaczmarz metho d do es not dep end on the sp ectral

radius of the matrix as the blo ck Cimmino metho d Furthermore the blo ck Kaczmarz con

T

verges towards the minimum norm solution when b RA and x RA The dierences

between the convergence of the blo ck Cimmino and the blo ck Kaczmarz metho ds are similar

to those b etween the Jacobi and SOR metho ds for solving symmetric p ositive denitive linear

systems see Hageman and Young

The iteration matrix of the blo ck Kaczmarz metho d

l

Y

I P

T

i

RA

i

is not symmetric and in general its eigenvalues are not real but complex As for the symmetric

SOR or SSOR metho d a symmetric version of the blo ck Kaczmarz metho d can b e derived

thus the Kaczmarz metho d is also referred as the blo ck SSOR metho d The symmetric blo ck

CHAPTER SOLVING GENERAL SYSTEMS USING BLOCKCG

Kaczmarz metho d has a symmetric matrix with real eigenvalues and with these prop erties it

can b e accelerated with the Blo ckCG Algorithm Kamath and Sameh presents a study

on the robustness of the conjugate gradient acceleration for blo ckrow pro jection metho ds as

the blo ck SSOR

One of the advantages of the blo ck Kaczmarz metho d over the blo ck Cimmino metho d is that

the relaxation parameter is indep endent of the sp ectral radius of the matrix On the other

hand one can improve the solution time of a system of equations by exploiting the natural

parallelism of the blo ck Cimmino metho d whereas the symmetric blo ck Kaczmarz metho d can

only b e parallelized for some matrices with a sp ecial structure in which a blo ck partitioning will

allow the computations of two or more subspaces indep endently and in parallel see Arioli Du

Noailles and Ruiz Comparisons b etween the blo ck Cimmino metho d and blo ck SSOR

metho d can b e found in the works of Bramley Bramley and Sameh and Arioli

Du Noailles and Ruiz

The Blo ckCG acceleration for basic stationary iterative metho ds do es not require that the basic

iterativemethodisconvergent In the blo ck Cimmino metho d the rate of convergence can still

b e sloweven with the optimal and this motivates the study of a pro cedure to accelerate its

convergence rate in the next section A similar pro cedure can b e derived for the symmetric

blo ck Kaczmarz metho d see Ruiz

Blo ck Cimmino accelerated byBlockCG

The blo ck Cimmino metho d is a linear stationary iterative metho d with a symmetrizable itera

tion matrix for a denition of symmetrizable iterative metho d see Hageman and Young

page The use of ellipsoidal norms ensures the conditions that make the blo ck Cimmino

iteration matrix SPD

Any basic iterative metho d for the solution of

Ax b

may b e expressed as

j j

x Qx k

where Q is the iteration matrix The blo ck Cimmino iteration matrix is I M From

and Step in Algorithm the expression in b ecomes

j j

x I M x k

l

X

j

I M x A b

i

i

i

T

Let D b e the blo ck diagonal matrix with diagonal blo cks A A thus the matrix M can b e

i

i rewritten as

BLOCK CIMMINO ACCELERATED BY BLOCKCG

A1 A2 x = b Al

Build augmented

systems (from Ai ) Preprocessing Analyse and factorize augmented systems MA27

Sum of projections (Cimmino)

Solve

Preconditioned Block-CG

Until Convergence

Figure The blo ck Cimmino metho d accelerated with

the stabilized Blo ckCG algorithm

T

M A D A

From the matrix M is SPD if A is square and has full rank The I M is symmetric

semidenite for and SPD if A is square and has full rank Using the generalized pseudo

inverse in

l

X

j G j

b x I M x A

i

iG

i

CHAPTER SOLVING GENERAL SYSTEMS USING BLOCKCG

G

with M dened as

l

X

G

A M A

i

iG

i

A SPD blo ck Cimmino iteration matrix can b e used as a sort of preconditioning matrix for the

Blo ckCG metho d We recall that Blo ckCG will simultaneously search the next approximation

to the systems solution in sKrylov subspaces and in the absence of roundo errors will converge

to the systems solution in a nite numb er of steps Figure depicts the Blo ckCG acceleration

describ ed in Algorithm And in Algorithm it can b e seen that the convergence of the

algorithm is indep endent of the relaxation parameter

Algorithm Blo ckCG acceleration for blo ck Cimmino

X is arbitrary P R

For j until convergence do

j

R is the pseudo residual vector

l

X

j G j

R A B M X

i

iG

i

j j

X X

j

j j

P P P

j j

T T

j G j j j

P GM P R GR

j

T T

j j j j

GR R GR R

j

Stop

Computer implementation issues

At rst the matrix A is partitioned rowwise into l blo cks according to And inside a

blo ck the columns with only zero elements are not explicitly stored

In Figure a the linear system is partitioned into three blo cks of rows In Figure b the

blo cks are illustrated after discarding the columns with only zero elements A section of columns

is dened as the group of contiguous columns with nonzero elements and all the columns in

one section app eared exactly in the same blo cks In Figure c there are seven sections of

columns S is the section of contiguous columns with nonzero elements that app eared A only

S is the section of contiguous columns with nonzeros that app eared in A and A S is the

section of contiguous columns with nonzeros that are in A A and A and so forth

The numb er of sections of columns dep ends on the sparsity pattern of the columns of the matrix

A and this number can be very large If the iterativescheme is designed to manipulate the

PARALLEL IMPLEMENTATION

blo cks by sections of columns as is the case of this implementation then a large number of

sections of columns may prevent us from solving the system of equation in a reasonable amount

of time Therefore an amalgamation parameter is used to scan groups of columns instead of a

single column at time

The purp ose of removing the columns with only zero elements from the blo cks and clustering

the columns in sections of columns is to avoid p erforming unnecessary op erations on op erands

with zero value In addition storage space is saved b ecause the zero elements are not explicitly

stored

After analysing the structure of the blo cks the data structures for handling the augmented sys

tems are generated Figure illustrates the augmented systems from the blo ckrow partition

in Figures ac

After building the l augmented systems is solved using the sparse symmetric linear solver

MA from the Harwell Subroutine Library Du and Reid and AEA Technology

T

The MA solver is a frontal metho d which computes the LD L decomp osition The MA

solver has three main phases Analyse Factorize and Solve

During the MA Analyse phase the minimum degree criterion is used to cho ose the pivots

from the diagonal The order of eliminations is represented in a elimination tree that is parsed

using a depthrst search In the MA Factorize phase the matrix is decomp osed using the

assembling and elimination ordering from the MA Analyse phase The MA Solve phase

uses the factors generated in the previous phase to solve the system of equations

With the generation of the augmented systems the size of the subproblems in is increased

and using the sparsity pattern of the augmented systems will minimize the storage space re

quired to store the augmented subsystems

In general the ellipsoidal norms matrices G s are diagonal matrices The nonzero elements

i

G G

is mwherec in a G are stored in an one dimension array of length less than or equal c

i

i i

G

T

maximum numb er of nonzeros in a rowofG and c mFor A and A only A needs to

i i i

i i

T

b e stored since A can b e implicitly obtained from A The A matrices are stored in a general

i i

i

sparse matrix format

The zero blo cks in the lower rightcornerofeach augmented system are not explicitly stored in

memoryThus the only increment in the storage space requirement comes from the nonzero

elements in the ellipsoidal norm matrices

The solve phase of the Blo ckCG acceleration is depicted in Figure and this is an imple

mentation of Algorithm Ateach iteration the augmented systems are solved to obtain

the s The sum of the s is used as the residual vector in the Blo ckCG iteration and this

i i

pro cedure is rep eated until convergence is reached

Parallel implementation

This section deals with the parallel distributed implementation of the blo ck Cimmino metho d

with a Blo ckCG acceleratioin studied in Section The parallel implementation of Blo ckCG

acceleration is designed for general distributed computing environments including heterogeneous

computing environments

The Master S l av e distributed parallel Blo ckCG implementation presented in Section

is used inside the parallel Blo ckCG acceleration A few additions are made to the parallel

CHAPTER SOLVING GENERAL SYSTEMS USING BLOCKCG

b A1 1

A b 2 2 (a)

b A3 3

b A1 1

A’ b 2 A"2 2 (b)

b A3 3

b A1 A1 A1 A1 1

A’2 b A"2 2 (c)

b A3 A3 A3 A3 3

S1 SS23 S4 SS5 67 S

Figure Row partitioning of the original linear system of equations and identicati on of column sections In

a resulting subsystems from In b the same subsystems after identifying the blo cks of nonzero es in each

subsystem c Identication of column sections

PARALLEL IMPLEMENTATION

-1

b T 1 A1 u G A1 1 1 0 b A ’ ’’ = 2 = 2 A 2 x

b A1 0 3 A 3 v1 r1

-1 u’ A ’ T 2 G 2 2 T 0 u" A ’’ 2 = 2 ’ ’’ 0r v2 A 2 A 2 2

u’ -1 u 2 1 = + + u3 T u" u A 3 0 2 3 G3 =

δ = u + u + u3 v3 A 3 0 r

Σ i 1 2 3

Figure Building the augmented systems from resulting subsystems in Illustration of the

augmented systems corresp onding to the example of system partition presented in Figure

CHAPTER SOLVING GENERAL SYSTEMS USING BLOCKCG

Blo ckCG implementation to accommo date the used of the Cimmino iteration matrix

Figure illustrates the computational ow of the parallel implementation of the blo ck Cimmino

metho d accelerated with the Blo ckCG algorithm

Pre-Analyse

Analyse

Pre-Factorize

Factorize

Pre-Solve

Flow control Master Slave Communication In Parallel

Block-CG Sequential

Figure Scheme of the parallel Blo ckCG acceleration for the blo ck Cimmino metho d

In the PreAnalyse phase the master pro cess is resp onsible for partitioning the system of equa

tions and identifying the sections of columns These partitions are sp ecied by the user as an

input parameter

With the sections of columns the master is able to identify data overlaps b etween the dierent

blo cks This information is later transmitted to the sl av e pro cesses and with this information

the sl av e pro cesses build their lo cal information ab out their computational neighb orho o d eg

identication of neighbor sl av e pro cesses and data it needs to communicate to each neighbor

sl av e pro cess

Before the master pro cess sends informtion to the sl av e pro cesses the master pro cess calls a

scheduler to evenly distribute the workload among the sl av e pro cesses and this distribution is

based on the partitioning of the system and data overlappings b etween the sections of columns

After scheduling the solution of the dierent subsystems the master pro cess dispatchestoeach

sl av e information from one or more subsystems During this rst dispatch the master pro cess

PARALLEL IMPLEMENTATION

sends to the sl av e pro cesses the sparsity pattern of each augmented system and the description

of the corresp onding sections of columns

In the MA Analyse phase only the structure of the matrix is used Later the numerical values

are used in the MA Factorize phase For this reason during the rst dispatchthemaster pro

cess sends only the structure of the augmented systems to the sl av e pro cesses and each sl av e

pro cess calls the routine that implements the MA Analyse phase on one or more augmented

systems While the sl av e pro cesses call the routine that p erforms the MA Analyse the master

prepares the data messages for the parallel numerical factorization of the subsystems

Once the messages are prepared the master pro cess sends to each sl av e pro cess the numerical

data that corresp onds to each augmented system In this way the second dispatch of informa

tion is exp ected to overlap with the parallel computations of the MA Analyse phase that is

p erformed in parallel by the sl av e pro cesses

With the numerical information the sl av e pro cesses call the routine that p erforms the MA

Factorize phase and at the same time the master pro cess prepares the messages with the right

hand side values and sends them to the sl av e pro cesses

Lastlythemaster and sl av e pro cesses take their roles in the Blo ckCG computations as de

scrib ed in the M aster S l av e distributed parallel Blo ckCG implementation from Section

However some pro cedures need to b e mo died to accommo date the use of the blo ck

Cimmino iteration matrix in Blo ckCG

Basically there are two main dierences b etween the data manipulation in the parallel Blo ck

CG and the one in the parallel Blo ckCG acceleration As discussed in Section the system

of equations is partitioned columnwise in Blo ckCG and it is partitioned rowwise in the blo ck

Cimmino metho d and to overcome this dierence the blo cks of rows are manipulated in sections

of columns see Figure

Secondly in the M aster Slave distributed Blo ckCG implementation there is a distributed

reduce op eration in whichthesl av e pro cesses exchange the results from partial pro ducts and

each sl av e builds lo cally a part of the global pro duct Furthermore the use of helps to

identify the contributions of each sl av e pro cess to the global solution In the Blo ckCG acceler

ation cannot b e used b ecause the matrix A maynothave a full diagonal

Two solutions to this problem are considered The rst one distributes section ownership roles

between the sl av e pro cesses Thus a sl av e pro cess monitors a reduce op eration on one or more

sections of columns In other words the distributed reduced op eration b ecomes a centralized

reduced op eration and the sl av e pro cess that owns a section of columns is resp onsible for col

lecting lo cal results from neighb ors and broadcasting back the global results

The numb er of sections of columns varies with the sparsity pattern of the matrix A and the

amalgamation parameter discussed in Section A large numb er of sections of columns will

congest the networks and degrade the overall p erformance of the parallel implementation

The assignmentofthesection ownership roles is not trivial b ecause the parameters that deter

mine the computational weightinvolved in a centralized reduce op eration are only known after

a few iterations of the Solve phase Calibrating the workload after a few iterations of the Solve

phase is computationally exp ensive in the case of the Blo ckCG acceleration b ecause of all the

lo cal data structures that have already b een generated byasl av e pro cess

Alternatively a distributed reduce op eration can b e used at the exp ense of redundant computa

tions that are p erformed lo cally bytwo or more sl av e pro cesses In this case each sl av e pro cess

manipulates the data it receives from its neighb or pro cesses and eliminates the redundant data while building a part of the global solution

CHAPTER SOLVING GENERAL SYSTEMS USING BLOCKCG

In comparison with the parallel Blo ckCG implementations see Chapter the granularityof

the parallel Blo ckCG has b een increased with the use of the blo ck Cimmino iteration matrix

The increase on the granularity and the eects of the overlaps b etween sections require a b etter

scheduling strategy than the scheduling strategy presented in Chapter to overcome the com

munication b ottlenecks and b enet from the parallel scheme that overlaps communications with useful computations

Chapter

A scheduler for heterogeneous

environments

One of the advantages of working in heterogeneous computing environments is the ability to pro

vide certain level of computing p erformance prop ortional to the numb er of resources available in

the system and their dierent computing capabilities A scheduler is regarded as a strategy or

p olicy to eciently manage the use of these resources In parallel distributed environments with

homogeneous resources the level of p erformance is commensurate with the numb er of resources

present in the system

Ascheduler for parallel iterative metho ds in heterogeneous computing environments is presented

in this chapter An overview of some scheduling techniques is intro duced in Section In Sec

tion we present a static scheduler for heterogeneous environments and justify its application

in parallel Blo ckCG like iterative metho ds The dierent mo dules of the scheduler can b e reused

for other parallel iterative direct or semidirect metho ds

In heterogeneous computing environments the scheduler not only considers information from

the tasks to b e executed in parallel but it must also consider information ab out the capabilities

of the CEs In Section a syntax for sp ecifying heterogeneous environments is dened

Finally in Section an example of an application of the scheduler is presented In the example

the scheduler distributes a set of tasks arising from the parallel Blo ckCG acceleration of the

blo ck Cimmino metho d presentedinChapter

Taxonomyofscheduling techniques

There is a great variety of pro cedures for solving the problem of distributing a set of resources to

a group of consumers in an optimal manner and this problem is known to b e NPcomplete see

Garey and Johnson In general all these pro cedures receive dierent names dep ending on

the discipline in which they are used In the context used here the scheduler can b e characterized

by a mapping function that assigns work to a set of CEs thus in some literature the problem

has b een referred as the mapping problem see for instance Heddaya and Park Talbi and

Muntean

Several authors have classied the dierentscheduling strategies in a taxonomy A taxonomyof

dierentscheduling strategies is presented in Figure which is an extension of the taxonomies

presented by Casavant and Kuhl Talbi and Muntean In this chapter a static

greedy scheduling strategy is used and a short overview of all of the strategies in Figure will

CHAPTER A SCHEDULER FOR HETEROGENEOUS ENVIRONMENTS

Scheduling Strategies

Static Dynamic

Physically Physically Optimal Sub-Optimal distributed non distributed

Approximate Heuristics Non-Cooperative Cooperative

Greedy Iterative Optimal Sub-Optimal

Mathematical Queuing Graph General Programming Theory Theory Enumerative Purpose Specific Approximate Heuristics

Branch and Min-cut Genetic Process Bound Partitioning Algorithms Clustering Dynamic Weak Hill Routing Programming Homomorphism Climbing Limitation Simulated

Annealing

Figure Taxonomy of some scheduling strategies

help to understand our choice for a static greedy scheduling strategy

The rst classication in the taxonomy illustrated in Figure is related to the time the

scheduling strategy is applied In a pure static strategy the scheduling decision is made b efore

the parallel pro cessing starts In a dynamic strategy decisions for the distribution of tasks are

constantly made throughout the parallel pro cessing Therefore a dynamic strategy is preferred

when dynamic creation of tasks or allo cation of pro cessors is allowed

In a dynamic scheduling strategy very little information is known ab out the needs of the tasks

and characteristics of the computing resources Dynamic strategies can b e p erformed distributed

when many pro cessors can run the scheduling strategy see Gao Liu and Railey or

nondistributed in which the dynamic scheduling strategy is centralized in a single pro cessor

see Stone

The classication of co op erative or nonco op erative relates to whether or not the dierent pro

cessors consult one another while distributely p erforming a dynamic scheduling strategy In

the co op erative case a global optimal scheduling solution can b e sought In some cases the

high cost of computing an optimal scheduling solution can b e reduced by accepting a relatively

good scheduling solution and in this case a suboptimal scheduling strategy is used The same

scheduling strategies under the static suboptimal classication can b e used for the dynamic

suboptimal one

Here a static scheduling strategy is considered b ecause in iterative metho ds generally it is p os

sible to collect a priori information from the parallel tasks b efore the parallel pro cessing phase

b egins Also data lo cality is an imp ortant issue for several iterative metho ds and moving data

TAXONOMY OF SCHEDULING TECHNIQUES

around will only increase the communication cost that represents already a b ottleneck in the

p erformance of some parallel iterative metho ds

In a static optimal scheduling strategy the assignment of tasks is based on an optimal criterion

function whichmayinvolve the minimization of the total execution time or maximization of

the usage of available resources in the system see for instance Bokhari b Gabrielian

and Tyler Shen and Tsai The size of the search space in this problem grows

exp onentially with the numb er of resources and tasks involved in a parallel run Furthermore

the problem has b een proven to b e NPcomplete when the numb er of pro cessors go es to innity

see Chretienne

Our eorts are now reduced to suboptimal scheduling strategies b ecause of the exp onential cost

of optimal scheduling strategies A suboptimal strategy can b e based on a search for a go o d

approximation to the solution inside a given search space Several search algorithms havebeen

develop ed under this principle and some of these algorithms are used in static optimal scheduling

strategies These strategies are not studied in more detail here and the reader is referred to the

following studies of static approximate strategies

MathematicalProgramming Bokhari b Gabrielian and Tyler Ma Lee and

Tsuchiya

QueuingTheory Kleinro ck Kleinro ck and Nilsson

GraphTheory Bokhari Shen and Tsai

Enumerate Shen and Tsai

Static heuristic scheduling strategies make the most realistic assumptions ab out a priori knowl

edge concerning the tasks and individual p erformance of each pro cessor see for instance

Efe and for this reason they are preferred in this work over static approximate strate

gies

In a static greedy scheduling strategypartofascheduling solution is given and this solution

is expanded until a complete schedule of tasks is obtained Only one task assignment is made

at each expansion On the other hand iterative based scheduling strategies are initialized bya

complete schedule of tasks and this schedule is improved through iterations At the end of each

iteration the schedule is evaluated using a function and the output of this function is tested

against a stopping criterion eg minimum execution time maximum utilization of allo cated

pro cessors

Dep ending on the function that evaluates the currentschedule an iterative strategy can b e

based on some well known general purp ose strategies or sp ecic scheduling solutions that have

b een prop osed for certain problems and in Figure the instances of Sp ecic scheduling strate

gies are Pro cess Clustering Lo and Routing limitation Bokhari a which are not

further discussed here Some examples of General Purp ose strategies are Genetic Algorithms

Hill Climbing and Simulated Annealing

Genetic algorithms are sto chastic searchtechniques intro duced by Holland and they are

regarded as optimization algorithms based on a space search in whicheach p oint is represented

by a string of symb ols Each string combination is assigned a value based on a tness function

The algorithm starts with a basic set of initial p oints in the space called the basic p opulation

a new set of p oints is generated at each iteration The pro cess is called repro duction b ecause

the new generated p oints are combinations of the current p opulation Then some of these new

generated p oints are discarded using the tness function for the elimination criteria From the

CHAPTER A SCHEDULER FOR HETEROGENEOUS ENVIRONMENTS

p oints that survive the algorithms generate a new sequence and this iteration is rep eated for a

given numb er of generations

HillClimbi ng algorithms Johnson Papadimitriou and Yannakakis nd a global mini

mum only in convex search spaces It starts with a complete schedule and tries to improveit

by lo cal transformations At the end of each iteration the new schedule is evaluated and if the

cost of the move from the old schedule to the knew one is p ositive the move is accepted This

pro cess is rep eated until there are no more p ossible changes to the scheduling solution that will

further reduce the cost of the function Thus a lo cal minimum is found rather than a global

minimum

Simulated annealing algorithms scan a search space using Markovchains to apply a sequence of

random lo cal transformations to a system that has b een submitted to a high temp erature These

transformations aect the temp erature in the system until a state of equilibrium is reached Sim

ulated annealing algorithms Kirkpatrick Gelatt and Vecchi are sequential in nature

and dicult to parallelize Greening and scheduling strategies based on these algorithms

are more costly than the two previous iterative strategies Talbi and Muntean

Astaticscheduler for blo ck iterative metho ds

In this section we prop ose a scheduler for synchronous blo ck iterative metho ds The prop osed

scheduler has a numb er of parameters that are tuned to suit the particular scheduling needs from

a blo ck iterative metho d In the taxonomy illustrated in Figure the prop osed scheduler is

classied as a static greedy scheduling strategy

A static greedy strategy is preferred for the following reasons

In many synchronous blo ck iterative metho ds a task is asso ciated to a partition of the

system of equations and these partitions are preserved through the iterations Thus no

dynamic tasks are generated and there is no advantage of using a dynamic scheduling

strategy over a static one

Finding an optimal distribution of the workload is ideal but is not a necessary condition

for parallel blo ck iterative metho ds Furthermore optimal strategies are more exp ensive

to implement and compute than the suboptimal ones

As discussed earlier heuristic strategies are preferred over approximate ones b ecause

heuristic strategies make the most realistic assumptions ab out a priori knowledge con

cerning the tasks and the CEs

A greedy strategy is preferred over the iterative ones b ecause iteratives scheduling strate

gies are generally more exp ensive to compute than greedy and it is not always p ossible to

nd a tness function or a go o d criterion to stop iterations

In general the scheduling problem for blo ck iterative metho ds on heterogeneous computing

environments is rst approached by a symb olic description of the partition of problem to b e

ASTATIC SCHEDULER FOR BLOCK ITERATIVE METHODS

solved system of equations and the heterogeneous computing environment

Without loss of generality the linear system of equations

Ax b

can b e divided into l subsystems of equations and the resulting subsystems are not necessarily

mutually exclusive eg some equations can b e rep eated in more than one subsystem

In general can b e replaced by a dierent expression when solving a dierenttyp e of

problem with a blo ck iterative metho d

The division of the system into the l subsystem can b e represented by an undirected

graph of subsystems given by

G V E

s s s

where V is a set of the vertices fV i lg and there is a vertex p er subsystem of

s s

i

equations E is the set of edges that representaroworcolumnoverlap b etween subsystems If

s

anytwo subsystems haverows or columns that overlap then there is an edge b etween them and

dep ending on the number of rows or columns an integer value is asso ciated with each edge to

represent the p otential communication b etween the two subsystems The value asso ciated with

an edge is greater than zero and edges with a zero value are reserved for subproblems that are

mutually exclusive thus the edges are not explicitly included in the graph

In the other hand the heterogeneous computing environment is describ ed as a metacomputer

with p CEs CEs are identied by some computing characteristics that dierentiate one from

the other eg computer name computing sp eed numb er of pro cessors architecture etc

Similarly to the graph of subsystems a graph of CEs is dened by

G V E

ce ce ce

where V is the set of vertices fV i pg and eachvertex representsaCEinthe

ce ce

i

metacomputer Moreover there are three dierenttyp es of vertices V is identied as the

ce

R

ro ot vertex for monitoring the parallel pro cessing in the metacomputer Usually the parallel

environment is generated from the CE represented byrootvertex

Avertex in G can also b e a single pro cessor or a cluster of pro cessors For the latter case

ce

the cluster is represented by a undirect subgraph of CEs G V E that expresses

ce ce ce

i i i

the interconnection b etween the CEs inside a cluster A cluster can b e shared or distributed

memory

In E is the set of edges in the metacomputers graph and each edge represent the

ce

interconnection network b etween the dierent CEs Figure depicts an example of a CE

graph In Figure G is a distributed cluster and V to V are the CEs inside this

ce ce ce

  

cluster G is a shared memory cluster and in this case is represented by a graph with no

ce

edges since the CEs are connected by a p o ol of memory rather than a physical communication

network

The scheduling strategy can b e dened as a mapping function

f V V

s ce

such that

f v v i l and

s ce

i i

j p

CHAPTER A SCHEDULER FOR HETEROGENEOUS ENVIRONMENTS

V ceR

Vce Gce V Gce 1 2 ce3 4

V Vce Vce Vce ce5 6 7 8

Vce root vertex R G ce i cluster vertex

Vce

i single processor vertex

Figure An example of a graph of CEs

Some authors have studied as continuous function for instance Prasanna and Musi

cus with an innite numberofCEsandany fraction b etween and of a CE can b e

used to solve a subsystem However in practice is taken to b e a discrete function and a

subsystem can only by assigned to one CE

To balance the workload distribution each v is assigned a computational weight using a discrete

s

i

function dened as

W V N

s

N is the set of natural numb ers Then

r

X

W v

s j j

j i i

i

are parameters relevant to the subproblem eg number of R are scalars and where

j j

i i

nonzeros size of the subproblem etc The scalars control the signicance of the param

j j

i i

eters Similarly a function to measure dierent computational asp ects from the CEs is dened

The function outputs a numb er that is used as a priority for assigning work to the CEs such

that a CE with a highest priority will b e assigned tasks b efore a CE with a lower priority

P V N

ce

r

X

P v p

ce j j

j i i

i

ASTATIC SCHEDULER FOR BLOCK ITERATIVE METHODS

the R are scalars that control the signicance of the p parameters or sp ecications from

j j

i i

each CE

Additionallyaworkload function is dened as

V N

ce

such that

l

j

X

W v v

s ce

j j

i

i

n o

l

j

v is the set of subsystems assigned to v

s ce

j j

i

i

A static greedy scheduling strategy is describ ed in Algorithm for distributing tasks to a

group of heterogeneous CEs

Algorithm A Static Scheduling Algorithm

Sort in descending order the v according to W v

s s

j j

l number of v s

s

j

p numb er of CEs in the metamachine

if l p then p l

Sort in descending order the v according to P ce

ce j

j

for each v V do

ce ce

j

Assign v to v and up date v

s ce ce

j j j

for each v V not yet assigned do

s s

i

Find b est v V to work on v

ce ce s

j i

considering E and v

s ce

j

Stop

The purp ose of considering E in Step of Algorithm is to minimize the communication

s

in the same step is to keep the workload between the CEs and the purp ose of considering v

ce

j

distribution balanced However the order in which these two parameters are considered leads

to two dierentscheduling strategies

If the set of edges is considered rst then the priority is to reduce the communication b etween

the CEs In this case an attempt is made to assign to the same CE two or more subsystems that

p otentially will communicate a lot the v parameter is also used here to avoid overloading

ce

j

a CE

are considered rst then the priority is to balance workload among the CEs For If the v

ce

j

each task in the notyetassigned list of tasks the pro cessor with the lowest v is sought

ce

j

and the set of edges is used to resolve conicts b etween two or more CEs with the same compu

tational weights

CHAPTER A SCHEDULER FOR HETEROGENEOUS ENVIRONMENTS

Heterogeneous environment sp ecication

The heterogeneous environment is describ ed in an input le or hesf il e for heterogeneous

environment sp ecication le A hesle contains a line p er CE in the metamachine Each

line sp ecication contains a list of keywords and values from the following syntax

hostname ct or CT DISTRIB SHARED SINGLE

nm or NM hostname hostname

np or NP integer number

sp or SP integer number

the keywords and corresp onding values b etween squared brackets eg are optional The

hostname is mandatory and if none of the other keywords are sp ecied the default values are

used

CT sp ecies the cluster typ e and the default value for this keyword is SINGLE for a single

pro cessor The SINGLE value is provided for declaration of workstations

If the cluster typ e is set to DISTRIB eg ct DISTRIB then the hostname is a distributed

memory cluster A distributed memory cluster is used for sp ecifying a distributed memory

machine a virtual shared memory machine or a subset of a network of workstations that are

connected through a sp ecial communication network

In the last case of the distributed cluster the CEs are addressed indep endentlythus the keyword

NM is used for declaring a list of machines names in the distributed cluster

The numb er of CEs in a cluster is sp ecied with the keyword NP and the default is one

The keyword SP is used for sp ecifying the sp eed of a CE or an output value from

Ascheduler for a parallel blo ck Cimmino solver

Basically the dierent phases of the parallel Blo ckCG acceleration are depicted in Figure

The basic iteration matrix in this case is the blo ck Cimmino iteration matrix thus the subprob

lems are obtained after l row partitions of the system of equations and the computational

weight factors are computed as follows

W v

s i i i

i 

size of the subsystem of equations

i

numb er of nonzeros

i



numb er of edges

i

The scalars and determine the relevance of one parameters against the others

The values asso ciated to the edges in G are the numb er of columns that overlap b etween pairs s

A SCHEDULER FOR A PARALLEL BLOCK CIMMINO SOLVER

of subsystems

The parallel implementation of the Blo ckCG acceleration is based on the M aster S l av e

distributed Blo ckCG implementation thus the CEs communicates through messages The

PVM system library has b een used for handling the message passing Thus the syntax

of the hesle accepts all the PVM keywords dened for a hostle sp ecication see Geist

Baguelin Dongarra et al

PVM provides standard information from each CE in a data structure called pvmhostinfo

and this information can b e mo died through the PVM keywords From the information in

the hesle and the information supplied by the pvmhostinfo the no des of the G graph are

ce

generated Figure depicts a no de from the G graph ce

A B CDE F G HIJK

A: cluster type F: pvm_tid B: host_id G: time_ref1 C: priority H: time_ref 2 D: status I: number of descendents E: C.P.U. speed J: pointer to first descendent

K: next sibling

Figure Structure of a no de in the G graph

ce

The elds host id and pvm tid are included to interface the solver with the PVM system The

cluster typeisprovided by the user in the hesle The elds numb er of descendents the p ointer

to rst descendent and next sibling are provided for traversing the G graph

ce

ref and time ref are used for monitoring the execution of a CE The The elds status time

priority eld is set after computing The sp eed eld is taken from the PVM system by

default however the user can sp ecify it directly with keywords from the hesle or PVM hostle

Algorithm is used to schedule the solution of the subsystems on one of the CEs The prior

ity is given to the communication b ecause of the synchronization p oints of the Blo ckCG metho d

The partitions of the system of equations can b e uniform or nonuniform If the system

of equations is uniformly partitioned then each subsystem will have the same number of rows

and the computational weights will not vary much from one subsystem to the other In this case

CHAPTER A SCHEDULER FOR HETEROGENEOUS ENVIRONMENTS

an approximate average workload p er pro cessor is dened by

l

X

W v

s

i

i

w

p

X

Nprocsv

ce

i

i

N pr ocs is a function that returns the numb er of pro cessors sp ecied in the declaration of each

CE

On the other hand if the system is nonuniformly partitioned then the sizes and computational

weights from one subsystem to another mayvary a lot In this case an upp erb ound for the

computational workload is dened by

w l w N pr ocsv

ce

i

with R and

Other authors have suggested tuning the workload distribution by minimizing the workload

imbalanced after each assignment of a subsystem see for instance Talbi and Muntean

However in the parallel blo ck Cimmino accelerated with the Blo ckCG algorithm the priority

for assigning subsystems to the CEs is on the values asso ciated with the edges in G thus we

s

suggest using or to keep the workload balanced

Chapter

Scheduler exp eriments

The purp ose of this chapter is to validate the use of the static greedy scheduler prop osed in

Section also in Arioli Drummond Du and Ruiz The following exp eriments were

run on a network of SUN SPARC and IBM RS workstations as shown in Table

Computing Number of

ID Element Pro cessors

A Cluster IBM RS

With FDDI Network

B IBM RS

C IBM RS H

D H SUN SPARC

Table List of available computing ele

ments

An unsymmetric nondiagonally dominant matrix that comes from a nite volume discretization

of the Navier Stokes equation coupled with chemistry is used in the numerical exp eriments The

discretization is p erformed using a curvilinear mesh of bypoints on vevariables and

this leads to a matrix of order The original system is partitioned into

blo cks For the rst exp eriment the system of equations is partitioned as describ ed in Table

and in the second exp eriment the partition of the system of equations is shown in Table The

resulting G for the rst and second partition are depicted in Figures and resp ectively

s

Each subsystem will b ecome a task to b e p erformed by one of the computing elements in the

metacomputer Three dierentscheduling strategies are compared in the following sections to

analyse the inuence of the scheduling strategy on the overall execution time of the parallel

blo ck Cimmino solver accelerated with the Blo ckCG algorithm

The rst strategy is a sequential distribution of subsystems to CEs in which subsystems are

distributed one by one to the next CE in a circular queue of CEs In this case neither commu

nication nor workload balancing issues are considered

The second strategy uses Algorithm and in Step the fo cus is to balance the workload

among the CEs In this strategy the CE with the lowest workload value is the candidate for

CHAPTER SCHEDULER EXPERIMENTS

1357 9 11 13 15 17 19 690 690 690 690 690 690 690 690 690 690 690 690 690 690 690 690 690 690 690

2 4 6 8 10 12 14 16 18 20

Figure Graph of subsystems according to rst partition

1 357 9 11 13 15 17 19 690 690 690 690 695695 695 695 690 695 695 690 695 690 695 690695 690 695 2 4 6 8 10 14

175 175 175 175 175 12 175 17516 17518 175 20

Figure Graph of subsystems according to second partition

solving the subsystem in turn If there is more than one candidate the subsystem is scheduled

to the CE that has b een previously assigned a set of subsystems with the highest p otential of

communication with the subsystem in turn

The purp ose of the third strategy is to reduce the communication cost p er iteration byschedul

ing two or more subsystems with a high p otential of communication to the same CE Using

to nd upp er b ounds for the workload p er CE The third strategy also uses Algorithm and

in Step the emphasis is to minimize the communication

Subsystem Number of Size of Number of

Number Rows Augmented System Nonzeros

Table Partition I of system equations into subsystems

Scheduling subsystems with a nearly uniform workload

As showinTable the subsystems have almost the same weight factor and the purp ose

of this exp eriment is to test the p erformance of the parallel Cimmino implementation under

dierent distribution strategies

SCHEDULING SUBSYSTEMS WITH A NEARLYUNIFORM WORKLOAD

Subsystem Number of Size of Number of

Number Rows Augmented System Nonzeros

Table Partition I I of system equations into subsystems

scheduling Computing Elements

ID strategy A B C D E F G H

In Sequence

STG Random

First Consider

STG WeightFactor

First Consider

STG Communication

Table Distributions of subsystems under three scheduling strate

gies Each subsystem is identied in the table by the numb er of its

corresp onding subsystem

ID Computing Elements Execution

Strategy A B C D E F G H Time Secs

STG

STG

STG

Table Numb er of neighb ors p er computing element un der three strategies

CHAPTER SCHEDULER EXPERIMENTS

The results from the rst exp eriments are shown in Tables and Each CE has to com

municate with two neighb ors The rst strategy has p erformed b etter than the second strategy

b ecause the weight of the subsystems is already balanced by the rst partition and the attempt

to balance the workload using Step only generates more communication

In the third strategy the numb er of neighb ors is reduced by one and in this case the third

strategy reduces the execution time bysaving communication

Scheduling subsystems with nonuniform workload

scheduling Computing Elements

ID strategy A B C D E F G H

In Sequence

STG Random

First Consider

STG WeightFactor

First Consider

STG Communication

Table Distributions of subsystems under three scheduling strate

gies Each subsystem is identied in the table by the number of its

corresp onding subsystem

ID Computing Elements Execution

Strategy A B C D E F G H Time Secs

STG

STG

STG

Table Numb er of neighb ors p er computing element un

der three strategies

In this exp eriment the subsystems have dierent sizes and some require more communication

than others Thus the p erformance of the three scheduling strategies is studied to conclude

with a strategy that b est suits the parallel Cimmino implementation

The results from the second exp eriments are summarized in Tables and Table de

picts the distribution of subsystems and Table presents the number of neighbors per CE

and the total execution time Clearly the third strategy reduces the execution time b ecause the

number of communications are also reduced and the communication is a b ottleneck in parallel

implementations of conjugate gradient based metho ds

The second strategy has p erformed b etter than the rst one b ecause it balanced the workload

among the CEs and as a result CEs had to wait less at the synchronization p oints In the rst

exp eriment this eect did not app ear b ecause the subsystems were of almost the same size

REMARKS

however in the second problem these subsystem vary in size and number of neighb ors

Remarks

The parallel p erformance of the blo ck Cimmino solver dep ends a great deal on the distribution

of tasks among the CEs The rst scheduling strategy should b e reserved for cases in which the

linear system is evenly partitioned into blo cks of rows the numb er of neighb ors p er CE is the

same and the CEs have the same computing capabilities In practice these trivial partitionings

are not p erformed and we dedicate Chapters and to study the eects of the partitioning

on the b ehavior of the blo ck Cimmino solver and its parallel implementation

The second scheduling strategy should b e preferred when the sizes of the blo cks vary more than

the variations in the number of neighb ors b etween blo cks The rst and second scheduling

strategies are more trivial and less costly to implement than the third strategy

The third scheduling strategy p erforms b etter than its counterparts and in the previous two

exp eriments the third strategy has improved the p erformance of the parallel Cimmino imple

mentation Thus this strategy is used in the exp eriments presented in Chapter and it is

compared against the second scheduling strategy in Chapter

CHAPTER SCHEDULER EXPERIMENTS

Chapter

Blo ck Cimmino exp eriments

In this chapter we present runs of the parallel Cimmino with Blo ckCG acceleration In these

exp eriments trivial blo ckrow partitionings are p erformed on the linear system to obtain

Later in Chapter we study partitioning strategies that may improve the overall

p erformance of the iterativesolver in terms of its convergence rate and accuracy approximating

the real solution of the system of equations

problem is used in the rst exp eriment This problem comes from the Harwell The GRE

Bo eing matrix collection see Du Grimes and Lewis The unsymmetric matrix arises

in the simulation of computer systems The sparsity pattern of GRE is shown in Figure

The matrix is very illconditioned and Arioli Demmel and Du have shown that

its classical condition numb er in the innitynormkAk kA k is equal to e

In the second exp eriment we use the FRCOL problem that comes from a two dimensional mo del

develop ed byPerrel for studying the eects of a b o dy entering the atmosphere at a high mach

numb er The problem is discretized using a nite volume discretization of the Navier Stokes

equations coupled with chemistryTherearetwovelo cities and one energy at each grid p oint

From the chemistry there are two sp ecies which leads to twodensityvariables p er mesh p oint

Thus there is a total of vevariables p er mesh p oint

The discretization is p erformed in a curvilinear mesh of p oints leading to blo ck tridi

agonal matrices of order

In Section wesolve the GRENOBLE problem using the blo ck Cimmino accelerated

with the Blo ckCG and study the impact of dierent blo ck sizes on the p erformance of the par

allel solver In these exp eriments the blo ck size for the Blo ckCG acceleration and the number

of CEs in the system are varied

Afterwards in Section wesolvetheFRCOL problem and x the blo ck size to isolate the

eects of the blo ck size on the rate of convergence that may sp eedup the parallel execution of

the iterative solver from the sp eedups arising from the parallelism Thus in the exp eriments of

Section only the numb er of CEs in the system is varied

problem Solving the GRENOBLE

matrix is partitioned into seven subsystems The rst six subsystems have The GRE

rows each three blo cks of rows and the last subsystem has rows three

blo cks of rows The exp eriments are run on the Thin no de SP computer at

CNUSC The maximum numb er of CEs to b e used in these exp eriments is seven b ecause there

CHAPTER BLOCK CIMMINO EXPERIMENTS

GRENOBLE

NZ

Figure Sparsity pattern of GRE matrix

GRE_1107 0 10

(a) = Block_CG(01)

(b) = Block_CG(04) −2 10 (c) = Block_CG(08)

(d) = Block_CG(16)

−4 (e) = Block_CG(32) 10

−6

wise Backward error 10 − Norm

−8 10

e d c b a

−10 10 200 400 600 800 1000 1200 1400

Number of iterations

Figure Convergence curves with iteration count Test problem

GRE

SOLVING THE GRENOBLE PROBLEM

GRE_1107 0 10

(a) = Block_CG(01)

(b) = Block_CG(04) −2 10 (c) = Block_CG(08)

(d) = Block_CG(16)

−4 (e) = Block_CG(32) 10

−6

wise Backward error 10 − Norm

−8 10

cde a b

−10 10 200 400 600 800 1000 1200 1400

Number of equivalent iterations

Figure Convergence curves with equivalent iterations the itera

tion countismultiplied by the blo cksize Test problem GRE

are seven subsystems

Figures and show the convergence curves corresp onding to these exp eriments the blo ck

sizes of and converge in fewer iterations than the Classical CG acceleration

In Tables to the execution time of the Solve phase has b een separated from the other

two phases b ecause in some cases the output of the Analyse and Factorize phases can b e reused

to solve the same linear system of equations with a dierent set of right hand sides Thus only

the Solve phase is rerun in these cases For these exp eriments we rst fo cus our attention on

the total run time and then the run time of the Solve phase byitself

A close lo ok at the times shown in Tables to reveals that most of the computational

weight is in the Solve phase And thus the Blo ckCG implementation has a great impact on the

p erformance of this parallel blo ck Cimmino implementation

As shown in Figure the Blo ckCG acceleration with a blo ck size of four is as fast in terms

of equivalent iterations as the CG acceleration rep orted in gure with a blo ck size of one

However in Table it is shown that using the Blo ckCG with a blo ck size of four is faster

than using Classical CG

In all cases the sp eedups increase as we increase the numb er of CEs And the eciencies ob

tained in these exp eriments are higher than the ones obtained with the parallel Blo ckCG see

Chapter This is due to an increase in the granularity of the tasks p erformed in parallel And

in this implementation of the Blo ckCG acceleration the increase in the granularity comes from

the computations of the orthogonal pro jections from the blo ck Cimmino metho d

CHAPTER BLOCK CIMMINO EXPERIMENTS

Sequential times

Blo ck size total run secs Solve phase secs

Blo ck size total run secs Solve phase secs

N Blo cksize Blo ck size

of Total run Solvephase Total run Solve phase

CEs Time Sp dup E Time Sp dup E Time Sp dup E Time Sp dup E

Table Results from the parallel Blo ck Cimmino with Blo ckCG acceleration On the solution of the

GRE problem Times in this table are in seconds

Sequential times

Blo ck size total run secs Solve phase secs

Blo ck size total run secs Solve phase secs

N Blo cksize Blo ck size

of Total run Solvephase Total run Solve phase

CEs Time Sp dup E Time Sp dup E Time Sp dup E Time Sp dup E

Table Results from the parallel Blo ck Cimmino with Blo ckCG acceleration On the solution of the

problem Times in this table are in seconds GRE

SOLVING THE FRCOL PROBLEM

Sequential times

Blo ck size total run secs

Solve phase secs

N Blo ck size

of Total run Solve phase

CEs Time Sp dup E Time Sp dup E

Table Results from the parallel Blo ck Cimmino with

Blo ckCG acceleration On the solution of the GRE

problem Times in this table are in seconds

Solving the FRCOL problem

Here the FRCOL problem is solved using the parallel implementation of blo ck Cimmino metho d

accelerated with the Blo ckCG stabilized algorithm In this exp eriment the size of the blo ck for

the Blo ckCG acceleration is xed to to isolate the p erformance of the parallel implementation

from that related to the blo ck size The linear system is partitioned into blo cks of rows of

rows each Therefore a maximum of CEs are used in the parallel execution

problem Thus the grain size of The FRCOL problem is larger than the GRENOBLE

the parallel task has b een increased As shown in Table this increase has favored the p er

formance of the parallel Cimmino implementation and the eciencies obtained are ab ove

when considering the total run time and ab ove when only considering the execution time of

the Solve phase Thus most of the computational resources are b eing used during the parallel

executions

The eciency in the Solve phase increases as the numb er of CEs is increased while the eciency

of the total run starts to decrease after ve CEs This is due to the communication in the rst

two phases and as shown in Table the p ercentage of the total run time consumed in the

Solve phase decreases as the numb er of CEs is increased whereas the p ercentage of the total

run time consumed bythemaster to sl av e communication in the Analyse and Factorize phases

increases as the numb er of CEs is increased Therefore the eciency of the total run time

decreases despite the reduction in the total run time due to an increase on the numb er of CEs

CHAPTER BLOCK CIMMINO EXPERIMENTS

Sequential times

Blo ck size total run secs

Solve phase secs

N Blo cksize

of Total run Solve phase

CEs Time Sp dup E Time Sp dup E

Table Results from the parallel Cimmino with Blo ck

CG acceleration On the solution of the FR COL problem

Times in this table are in seconds

NumberofCEs

Consumption of

run time in

Communication from

rst two phases

Solve Phase

Table Percentages of the total run time consumed in the Com

munication of the Analyse and Factorized phases and the execution of

the Solve phase

REMARKS

Remarks

The blo ck Cimmino with Blo ckCG acceleration app ears to b e a reliable solver which suits well

the solution of some practical problems for instance Benzi Sgallari and Spaletta and

OLeary The solver can b e easily tuned to reduce the numb er of iterations and improve

the p erformance of a parallel run

In some cases the eciency of the parallel runs of blo ck Cimmino increases as the number of

CEs is increased And the maximum numb er of the CEs to b e used in a run is limited by the

numb er of blo cks of rows from partitions of the linear system

The partition of a linear system should b e driven by the nature of the problem b eing solve

In the next chapter we study natural partitioning and prepro cessing strategies to improve the

p erformance of the blo ck Cimmino metho d

CHAPTER BLOCK CIMMINO EXPERIMENTS

Chapter

Partitioning strategies

Prepro cessing a linear system of equations improves the p erformance of most linear solvers and

in some cases a more accurate approximation to the real solution is found For instance nu

merical instabilities are encountered in the solution of some linear systems and it is necessary to

scale the systems b efore they are solved These instabilities o ccur even when the most robust

computer implementations of direct or iterativesolvers are used

Also p erforming some p ermutations of the elements of the original system can substantially

reduce the required computing time for the solution of large sparse linear systems by improving

the rate of convergence of some iterative metho ds eg SOR and Kaczmarz metho ds and

reducing llin in direct metho ds

The aim of this chapter is to study prepro cessing strategies to derive natural blo ck partitionings

of the form from general linear systems of equations We study two dierent prepro cess

ing strategies to b e applied to the original matrix A These prepro cessing strategies are based on

T

p ermutations that transform the matrix AA into a matrix with a blo ck tridiagonal structure

T

Transforming the matrix AA into a blo ck tridiagonal matrix provides a natural partitioning of

the linear system for row pro jection metho ds b ecause these metho ds use the normal equations

to compute their pro jections Therefore the resulting natural blo ck partitioning should improve

the rate of convergence of blo ckrow pro jection metho ds as blo ck Cimmino Section and

blo ck Kaczmarz Section

Ill conditioning in and across blo cks

An illconditioned matrix A has some linear combinations of rows that are almost equal to zero

and as mentioned by Bramley and Sameh these linear combinations may o ccur inside

blo cks or across blo cks after row partitionings of the form If the metho d used for solving

the subproblems is sensitive to illconditioni ng within the blo cks then the metho d may converge

to the wrong solution

Assuming that the pro jections in the blo ck Cimmino are computed on the subspaces exactly

then the rate of convergence of the blo ck Cimmino metho d dep ends only on the conditioning

across the blo cks Therefore a combination of a robust metho d for computing the pro jections

and a partitioning strategy that minimizes the illcondi tionin g across the blo cks is sought

To study the illconditioni ng across the blo cks we consider the linear system of equations re

T

sulting from the partitioning and the QR decomp osition of the submatrices A Then

i

CHAPTER PARTITIONING STRATEGIES

T

A Q R i l where A is an m n matrix of full row rank

i i i i

i

T

Q n m Q Q I

i i i m m

i

i i

R m m R is a nonsingular upp er triangular matrix

i i i i

Writing again the sum of pro jections of the blo ck Cimmino metho d in matrix form from Section

wehave

l

X

T T

A A A M A

i i

i i

i

and replacing the A s bytheQR factors

i

l

X

T T T T

M Q R R Q Q R R Q

i i i i

i i i i

i

l

X

T T T

Q R R R R Q

i i i

i i i

i

l

X

T

Q Q

i

i

i

T

Q Q Q Q

l l

Using the theory of the singular value decomp osition see Golub and Kahan Golub and

T

Van Loan it can b e seen that the nonzero eigenvalues of Q Q Q Q are also

l l

T

the nonzero eigenvalues of Q Q Q Q Therefore the sp ectrum of the matrix M

l l

from is the same as the sp ectrum of the matrix

T T

I Q Q Q Q

m m l

B C

B C

B C

T T T

B C

Q Q I Q Q Q Q

m m l

 

B C

B C

B C

B C

B C

B C

B C

A

T

Q Q I

m m

l l l

T

Q are the matrices whose singular values represent the cosines of the principal where the Q

j

i

T T

angles b etween the subspaces RA andRA see Bjorck and Golub These principal

i j

angles are recursively dened by

ILL CONDITIONING IN AND ACROSS BLOCKS

T

v u

ij

ij

cos max max

ij

k

T T

ku kkv k

u RA v RA

ij ij

ij  ij 

i j

T

u v

ij

ij

k

k

ku k kv k

ij ij

k k

sub ject to

T

for p l and u u

ij

p ij

T

v v for p l

ij

ij

p

o i n h

T T

u Thesetofvectors u dimRA l varying from to m min dim RA

ij

ij ij

j i

m

ij

n o

T

and v v are called the principal vectors b etween the subspaces RA and

ij ij

i

m

ij

T

RA

j

Notice the principal angles satisfy

ij ij

m

ij

T

it means that RA and if all of the m principal angles are equal to is orthogonal to

ij

i

T

RA and the wider the principal angles are the closer the blo ck Cimmino iteration matrix is

j

to the identity matrix Furthermore the closer the iteration matrix is to the identity the faster

the convergence of the Blo ckCG acceleration should b e

The principal angles by themselves do not provide information regarding the sp ectrum and

illconditioni ng of the resulting iteration matrix therefore it is necessary to study partitioning

strategies for which there exists a strong relation b etween these principal angles and the sp ec

trum of the iteration matrix

Without loss of generality assume that the matrix A is partitioned in two blo cks

A

A

where A and A have m and m rows resp ectively and further assume that m m

With the blo ck Cimmino iteration matrix is dened as C I P P where

P P and P P From M P P and from it can b e re

T T

RA RA



T

duced to Q Q Q Q Therefore from the sp ectrum of the blo ck Cimmino iteration

matrix for the partition is given by

T

I Q Q

m m

A

T

Q Q I

m m

CHAPTER PARTITIONING STRATEGIES

As mentioned in Section the blo ck Cimmino with the Blo ckCG acceleration is indep endent

of the relaxation parameter Thus if then

T

Q Q

m m

A

T

Q Q

m m

 

The following observations follow from

T T

Since m m the rank of the rectangular matrices Q Q and Q Q is not greater than

m Thus there are at most m nonzero eigenvalues

The use of the Blo ckCG acceleration guarantees the nite termination of the blo ck Cim

mino metho d in exact arithmetic In this particular case it should not take more than m

steps to reachconvergence Therefore blo ck Cimmino with the Blo ckCG acceleration will

work b etter for small values of m

T

The singular values of the matrix Q Q and the eigenvalues of the matrix are

T

strongly related Indeed a singular value decomp osition of the matrix Q Q following

Golub and Van Loan is given by

T T

Q U V Q

with U and V orthogonal matrices with dimensions m m and m m resp ectively

is a rectangular matrix with dimensions m m and diag The

m



matrix is equal to

T

U U

T T

V V

And has the same eigenvalues as the matrix

T

The eigenvalues of matrix are f i m g and corresp ond to the cosines

i

T T

of the principal angles b etween the two subspaces RA andRA

Twoblo ck partitioning

In the following sections we will study two prepro cessing strategies to p ermute the rows of

T

the matrix A based on p ermutations that transform the AA matrix into a blo ck tridiagonal

matrix Matrices with blo ck tridiagonal structures are commonly found in discretization of

partial dierential equations and in practice these systems are solved using dierent iterative

schemes Co ecient matrices from other general systems of equations can also b e p ermuted to

blo ck tridiagonal matrices using matrix reordering techniques see for example Du Erisman and Reid

TWOBLOCK PARTITIONING

Twoblo ck partitioning refers to a partitioning strategy in which the columns inside a blo ckof

rows intersect the columns of at most two other blo cks A twoblo ck partitioning can b e applied

to matrices with a blo ck diagonal structure and in some cases this will sp eed the convergence of

the linear solver

For instance let the matrix A with blo cks of size b b b e partitioned into veblocks of rows as

illustrated here

C C B B

A

C C B B

C C B B

C C B B

C C B B

C C B B

A



C C B B

C C B B

C C B B

C C B B

C C B B

A



C C B B

C C B B

C C B B

C C B B

C C B B

A



C C B B

C C B B

C C B B

C C B B

A

A A



where each A has k b and each k for i Further for i we observe

i i i

T T

that the pair of subspaces RA and RA are orthogonal and the sum of these orthogonal

i i

pro jections onto these two subspaces corresp ond to the orthogonal pro jection onto the direct

sum of these orthogonal subspaces viz

P P P

T T T T

RA RA RA RA

i i i i

Therefore the partitioning dened in is numerically equivalent to partitioning in two

blo cks B and B as in

B C

C B

B C

A

C B

B C

C B B C

B C

B

C B B C

B C

C B B C

B C

C B B C

B C

C B B C

A

B C



C B B C

B C

C B B C

B C

C B B C

B C

C B B C

B C

C B B C

B C

A



C B B C

B C

C B B C

B C

C B B C

B C

C B B C

B C

C B B C

B C

C B

A

A

B C



C B

B C

B



A

A

A



where

CHAPTER PARTITIONING STRATEGIES

B A i is o dd B A i is even

i i

i i

From Section the numb er of nonzero eigenvalues in the iteration matrix is at most

min m m when using a twoblo ck partition In m and m are the number of rows

in B and B resp ectively Therefore varying the number of rows in B andB will also vary

the numb er of nonzero eigenvalues of the iteration matrix

Without loss of generalitywe go back to one of the assumptions made in Section m m

the size of the interface blo ck B can at least b e of size b which is the minimum required

for making the A s in B orthogonal

i

In the parallel implementation of the blo ck Cimmino metho d accelerated with Blo ckCG the

compromise is to nd a matrix partitioning that reduces the size of the interface blo ck B which

implicitly improves the rate of convergence of Blo ckCG acceleration and at the same time

enables a fair distribution of the workload

Prepro cessing Strategies

Given the general linear system of equations

Ax b

with the matrix A of dimensions m n The rst prepro cessing strategy nds a p ermutation of

the normal equations

T T

B PAA P

such that the matrix B has a blo ck tridiagonal structure An implementation of the Cuthill

McKee Algorithm see for instance Cuthill and McKee George Du Erisman

and Reid for ordering symmetric matrices is used Afterwards the system of equations

Ax b

is solved with A PAandb PbFrom the blo ck tridiagonal structure of the matrix B

the blo ckrow partition is dened by partitioning the blo cks of rows in A with resp ect

to the diagonal blo ck in the blo ck tridiagonal structure of B Doing this the row partitioning

preserves the advantages of the twoblo ck partitioning describ ed in Section Figure

illustrates the row partitioning of A from the diagonal blo ck structure of B

T

In the second prepro cessing strategy the matrix AA is rst normalized

T

B AA

D diag B

 

B D BD

Afterwards remove all the nonzero elements in B which are value under a tolerance value in absolute viz

PREPROCESSING STRATEGIES

Figure Row partitionin g of A from blo ck tridiag

onal structure of B

B removeB

B using the CuthillMcKee Algorithm p ermute

T

B P BP

and solve In this case the row partitioning for A is dened by the blo ck tridiagonal

structure of B

The rst prepro cessing strategy always delivers a matrix A with a twoblo ck partitioning The

partitioning obtained with the second strategy is close to a twoblo ck partitioning and this will

b ecome clearer with the partitioning exp eriments in the next chapter

B in the sec Additionally since some nonzero elements under a tolerance value are removed from

ond prepro cessing strategy the matrix B has a smaller bandwidth than the matrix B Therefore

B and consequently the matrix A can b e partitioned B has more blo cks on the diagonal than

into more blo cks of rows

Clearly the second prepro cessing strategy is a more exible partitioning strategy

We p erform a last step after using either prepro cessing strategy In this last step the columns

that b elong to the same subset of blo cks are group ed to exp edite the identication of sections of

columns in the blo ck Cimmino solver see Section The need for this last step will b ecome

more evident in the exp eriments in the next chapter particularly compare Figure with

Figure and Figure a with Figure b

In the last prepro cessing step the sparsity pattern of A is stored in a sparse matrix C such

that the matrix C has a in all places that the matrix A has a nonzero element

C is Then the blo ckrow partitioning from the matrix A is applied to the matrix C such that

i

the ith blo ck of the matrix C Ifl is the numb er of blo cks of rows in the matrices A and C

then we use a matrix D of dimension l n to store the sparsity pattern of eachblockinC

C As depicted in Figure if there is at least one in the j th column of the blo ck matrix

i

then D i j

After scanning the sparsity pattern of the blo cks in Aaninteger lab el is computed for each

column of the matrix A in the following manner

CHAPTER PARTITIONING STRATEGIES

111111 11 1 111 11 1 1 C i 111 111 1 1 1 11

Di 1111111 111 111

Figure Scanning the sparsity pattern of a blo ck

of rows

l

X

p

lab el j D p j for j n

p

After computing the lab els they are sorted in ascending order and the results from the sort

dene a column p ermutation to b e applied to the matrix A

Notice that these column p ermutations will group all the columns that b elong to the same subset

of blo cks and the resulting matrix will have sections of columns inside the blo cks of rows and

these sections of columns are the ones used in the parallel Cimmino implementation describ ed in Section

Chapter

Partitioning Exp eriments

Now we p erform some more runs of the parallel blo ck Cimmino implementation using the Blo ck

CG acceleration and fo cus on the eects of the prepro cessing strategy on the p erformance of the

metho d In the exp eriments rep orted in Chapters and wehave used trivial partitionings

based on the size of the linear systems and the number of available CEs in the computational

platform In this chapter all of the previous eorts invested in the parallelism are combined

with the prepro cessing strategies for solving the linear system of equations and the advantages

of using the output from a prepro cessing phase are emphasized

The exp eriments in this chapter were run on a SP Thin no de

In Section the parallel blo ck Cimmino implementation is used for solving the SHERMAN

linear system from the HarwellBo eing matrix collection Du Grimes and Lewis The

symmetric matrix SHERMAN comes from the discretization of partial dierential equations

extracted from an oil reservoir mo deling program The matrix arises from a three dimensional

simulation mo del on a grid using a sevenp oint nite dierence approximation with

one equation and one unknown p er grid blo ck Simon

Section contains the results from exp eriments of runs of the parallel blo ck Cimmino imple

mentation solving the GRENOBLE problem intro duced in Chapter

Solving the SHERMAN problem

The matrix SHERMAN is a symmetric matrix of order The sp ectrum of the matrix was

shown earlier in Chapter Figure The exp eriments rep orted in this section were selected

from a large set of exp eriments varying the blo ck size for the Blo ckCG acceleration To fo cus

on the eects from the prepro cessing strategies the blo ck size in the Blo ckCG has b een xed

at In these exp eriments wehave used as a stopping criterion see Section for the blo ck

Cimmino iterations and the error is reduced b elow

The original pattern of the matrix SHERMAN is shown in Figure and the results from

parallel runs of the blo ck Cimmino implementation solving the original system ie without any

data prepro cessing are rep orted in Table under the column lab eled Original matrix A

The parallel blo ck Cimmino program can b e invoked from the command line with several pa

rameters to sp ecify the computing environment the problem to b e solved and other values that

are meaningful to the parallel solver for instance the blo ck size for the Blo ckCG acceleration

the threshold value for the stopping criterion the blo ck partitioning strategy the size of the

integer and double working arrays etc

CHAPTER PARTITIONING EXPERIMENTS

SHERMAN4 0

200

400

600

800

1000

0 200 400 600 800 1000

nz = 3786

Figure Sparsity pattern of the matrix

SHERMAN

The following is an example of a call to the blo ck Cimmino solver from the UNIX command line

note that the only purp ose for including this typ e of example in this chapter is to summarize

the dierent parameters of the parallel solver that were used in a given set of runs Further the

examples should not b e confused with a user guide to the parallel blo ck Cimmino solver

command blkcimmino hesfile spenv datafile shermanrua

partitions

blocksize threshold D scheduling

As mentioned in Section the computational environment is sp ecied in a HESle and in

the ab ove example the HESle is the spenv The linear system is stored in the HarwellBo eing

sparse matrix format in the shermanrua le In the rst exp eriment a trivial row blo ck

partition of the linear system is p erformed

The rst argument following the partitions keyword is the numb er of blo cks of rows in the

linear system of the form In these exp eriments the blo cks have almost the same number

of rows only the last blo ck has rows the blocksize keyword is used for the blo ck

size of the Blo ckCG acceleration

The value of the threshold keyword is used for the stopping criteria As mentioned in Chapter

wehave implemented three scheduling strategies for the blo ck Cimmino solver In the results

rep orted in Table only the second and third scheduling strategies are used b ecause the

rst scheduling strategy is a random scheduling distribution And from exp eriments presented

in Chapter there are no signicant advantages for using the rst scheduling strategy over the

other two

Recall that the goal in the second scheduling strategy is to balance the workload while in the

third scheduling strategy is to minimize rst the communication cost and preserve as muchas

p ossible the balanced workload

SOLVING THE SHERMAN PROBLEM

For the second set of exp eriments a trivial column p ermutation of the original matrix A is

p erformed The goal of the column p ermutation is to group the columns that b elong to the

same blo cks The column lab eling is used to nd these column p ermutation

Figure shows the matrix SHERMAN after p ermuting its columns The results from these

exp eriments are presented in Table under the column lab eled After column p ermutation

SHERMAN4 SHERMAN4 AFTER COLUMN PERMUTATIONS 0 0

200 200

400 400

600 600

800 800

1000 1000

0 200 400 600 800 1000 0 200 400 600 800 1000

nz = 11040 nz = 11040

Figure Sparsity pattern of the SHERMAN matrix b efore and after column p ermutation using the column

lab eling The matrix SHERMAN has b een partitioned into blo cks of rows

Recall that blo ck Cimmino is numerically insensitive to column p ermutations Thus the re

duction in the numberofoverlaps b etween the blo cks of rows may only improve the parallel

execution of the blo ck Cimmino metho d b ecause the communication is implicitly reduced

In the third round of exp eriments the rst prepro cessing strategy from Section is used

Figure shows the matrix SHERMAN after the rst prepro cessing strategyFrom the blo ck

tridiagonal structure of the normal equations from SHERMAN the matrix is partitioned into

blo cks of rows Later the columns of the resulting matrix are p ermuted using the column

lab eling The sparsity pattern of the matrix SHERMAN after the rst prepro cessing

strategy and column p ermutations is shown in Figures a and Figures b

Figure a is the result of applying the column p ermutations to the matrix SHERMAN after

the Reverse Cuthill McKee algorithm is used inside the rst prepro cessing strategy and Figure

b is the result of using the Cuthill McKee algorithm instead in this prepro cessing phase

The partitionings shown in b oth gures are twoblo ck partitionings Here we prefer Figure

b b ecause there are veblocks that are indep endent from the rest while in Figure a

wehave obtained only four In b oth cases a great improvement in the parallel execution is

exp ected b ecause there are indep endent blo cks of rows

The following is an example of the command line and parameters for running the parallel blo ck

Cimmino exp eriments on the matrix SHERMAN after using the rst prepro cessing strategy

with the column p ermutations command blkcimmino hesfile spenv datafile shermanstgrua

CHAPTER PARTITIONING EXPERIMENTS

partitions

threshold D scheduling

Again the blo cks of rows have nearly the same size however the numb er of column overlaps

between the blo cks of rows varies from to and this implies nonuniform communication b e

tween the CEs during a parallel blo ck Cimmino run Results from the parallel blo ck Cimmino

runs are presentedinTable under the column lab eled Prepro cessing strategy I

SHERMAN4 AFTER ROW PERMUTATIONS (CM) 0

200

400

600

800

1000

0 200 400 600 800 1000

nz = 3786

Figure Matrix SHERMAN after apply

ing the prepro cessing strategy I

In the last exp eriments of this section the second prepro cessing strategy is applied to the matrix

SHERMAN Figure shows the sparsity pattern of the normal equations from SHERMAN

after symmetric p ermutation Figure a shows the results of using the Reverse CuthillMckee

algorithm and Figure b shows the results of using the CuthillMcKee algorithm The nor

T

malized nonzero entries of AA are plotted in Figure

The tolerance value is chosen to b e and the nonzero entries less than the tolerance value are

T T

removed from the matrix AA The rows and columns of the new matrix AA are p ermuted

using the Reverse CuthillMcKee algorithm or the CuthillMcKee algorithm Figure shows

the result of p ermuting the matrix using the Reverse CuthillMcKee algorithm

The sparsity pattern of the SHERMAN matrix after completion of the second prepro cessing

strategy is shown in Figure This matrix is obtained after p ermuting the rows of the orig

inal SHERMAN matrix using either the Reverse CuthillMcKee algorithm Figure a or

the CuthillMcKee algorithm Figure b inside the second prepro cessing strategy with the

column p ermutations based on the column lab eling

The following is an example of the command line and parameters for running the parallel blo ck

Cimmino exp eriments on the matrix SHERMAN after using the second prepro cessing strategy

SOLVING THE SHERMAN PROBLEM

SHERMAN4 AFTER ROW & COLUMN PERMUTATIONS SHERMAN4 AFTER ROW & COLUMN PERMUTATIONS 0 0

200 200

400 400

600 600

800 800

1000 1000

0 200 400 600 800 1000 0 200 400 600 800 1000

nz = 3786 nz = 3786

a b

Figure Matrix SHERMAN after applying the prepro cessing strategy I and p ermuting the columns using

the column lab eling In a the Reverse CuthillMcKee algorithm is used inside the prepro cessing phase and in b the CuthillMcKee algorithm is used

NORMAL EQUATIONS FOR MATRIX : SHERMAN4 (RCM) NORMAL EQUATIONS FOR MATRIX: SHERMAN4 (CM) 0 0

200 200

400 400

600 600

800 800

1000 1000

0 200 400 600 800 1000 0 200 400 600 800 1000

nz = 10346 nz = 10346

a b

Figure Blo ck tridiagonal structures of the normal Equations from SHERMAN using the Reverse Cuthill

Mckee algorithm in a and the Cuthill McKee algorithm in b

CHAPTER PARTITIONING EXPERIMENTS

T NON ZERO VALUES IN NORMALIZED AA : SHERMAN4 1.2

1

0.8

0.6

0.4

0.2

0

0 2000 4000 6000 8000 10000 12000

Figure Normalized nonzeros in the normal equa tions matrix SHERMAN

NORMALIZED AAT : SHERMAN4 TOLERANCE VALUE: 0.2 0

200

400

600

800

1000

0 200 400 600 800 1000

nz = 3356

Figure Sparsity pattern of the normal equations

from SHERMAN after removing the nonzero entries less

than and using the Reverse CuthillMcKee algorithm

to p ermute the matrix to a blo ck tridiagonal form

SOLVING THE SHERMAN PROBLEM

SHERMAN4 AFTER ROW & COLUMN PERMUTATIONS SHERMAN4 AFTER ROW & COLUMN PERMUTATIONS 0 0

200 200

400 400

600 600

800 800

1000 1000

0 200 400 600 800 1000 0 200 400 600 800 1000

nz = 3786 nz = 3786

a b

Figure Matrix SHERMAN after applying the prepro cessing strategy I I and p ermuting the columns using

the column lab eling In a the Reverse CuthillMcKee algorithm is used inside the prepro cessing phase

and in b the CuthillMcKee algorithm is used

and p erforming the column p ermutations

command blkcimmino hesfile spenv datafile shermanstgrua

partitions

threshold D scheduling

As shown in Figure this partitioning do es not lead to a twoblo ck partitioning but numer

T

ically it is not that dierent b ecause wehavekept the largest entries from AA

After the second prepro cessing strategy there is more freedom in the way the p ermuted matrix is

partitioned into blo cks of rows since the Cuthill McKee ordering or the Reverse Cuthill McKee

T

ordering is applied to a sparser matrix than the original AA In this section wehavechosen

only blo cks to fairly compare with the results from the other three prepro cessing strategies

However in the next section we present another example in which more blo cks are obtained

using the second prepro cessing strategy

In the SHERMAN problem the p erformance of the Cimmino solver is substantially improved

by the prepro cessing strategies I and I I First of all the numb er of iterations has b een reduced

from to with the rst prepro cessing strategy and from to with the second This

proves that the second prepro cessing strategy do es not dier to o far numerically from a twoblo ck

partitioning Furthermore when either prepro cessing strategy has b een used the indep endent

blo cks of rows reduce the communication required by the original linear system

Using a trivial partitioning of the SHERMAN matrix without prepro cessing will lead to nearly

uniform tasks and this is the reason for the comparable results using the scheduling strategy

with that using the scheduling strategy with nearly uniform tasks However this is not the

case when any of the prepro cessing strategies are used

When the prepro cessing strategies have b een used the resulting blo cks do not only dier in size

CHAPTER PARTITIONING EXPERIMENTS

Comp Original After column Prepro cessing Prepro cessing

env matrix A permutation strategy I strategy I I

Seq

run secs secs secs secs

Time Sp dup E Time Sp dup E Time Sp dup E Time Sp dup E

stg

CEs

stg

stg

CEs

stg

stg

CEs

stg

stg

CEs

stg

Table Results from the parallel Blo ck Cimmino implementation with Blo ckCG acceleration using four dierent

prepro cessing strategies in the solution of the SHERMAN problem Times in this table are in seconds stg and stg

corresp ond to the scheduling strategies STG and STG resp ectively from Chapter

SOLVING THE GRENOBLE PROBLEM

but also in the numb er of column intersections b etween blo cks of rows Therefore the scheduling

strategy distributes the tasks in a less favorable manner than the scheduling strategy

The minimum execution time rep orted in Table is secs for a sp eedup of on

pro cessors using the partitioning strategy I and the scheduling strategy The execution time

in this case represents an improvement of of the execution time from a sequen tial blo ck

Cimmino run solving the original matrix SHERMAN

Solving the GRENOBLE problem

Nowwe p erform similar exp eriments as in the previous section with the GRENOBLE

problem We try to study the eects of varying the tolerance value in the second strategyand

also the numb er of blo ck partitions

In Table the results from runs of the parallel blo ck Cimmino implementation are presented

The results in the column lab eled Original matrix A come from exp eriments using the blo ck

row partitioning shown in Figure a and the results in the second column lab eled After

column p ermutation are from exp eriments using the blo ck partitioning shown in Figure

b

The following is an example of the call to the blo ck Cimmino solver that was used in the set of

exp eriments that generated the rst two columns of Table

command blkcimmino hesfile spenv datafile grerua

partitions

blocksize threshold D scheduling

Applying the rst prepro cessing strategy to the GRENOBLE problem leads to a matrix

with the sparsity pattern shown in Figure a later the columns of this matrix are p ermuted

using the column lab eling And as exp ected the output of the rst prepro cessing strat

egy leads to a twoblo ck partitioning illustrated in Figure b

In this case an example of a call to the blo ck Cimmino solver is

command blkcimmino hesfile spenv datafile grestgrua

partitions blocksize

threshold D scheduling

Results from these exp eriments are rep orted in Table under the column lab eled Prepro

cessing strategy I

In the rst and second prepro cessing strategies the normal equations from GRENOBLE

T

are computed and the sparsity pattern of the matrix AA is shown in Figures a and

b The normalized nonzero entries are plotted in Figure

Two tolerance values are selected for testing the second prepro cessing strategy First is used

T

and ab out half of the nonzero entries are removed from AA The resulting matrix is reordered

CHAPTER PARTITIONING EXPERIMENTS

GRENOBLE 1107 GRENOBLE1107 AFTER COLUMN PERMUTATIONS 0 0

200 200

400 400

600 600

800 800

1000 1000

0 200 400 600 800 1000 0 200 400 600 800 1000

nz = 5664 nz = 5664

a b

original sparsity pattern and b after grouping the columns according to Figure a GRENOBLE

the blo ck partitions

GRENOBLE 1107 AFTER ROW PERMUTATIONS (CM) GRENOBLE1107 AFTER ROW & COLUMN PERMUTATIONS (CM) 0 0

200 200

400 400

600 600

800 800

1000 1000

0 200 400 600 800 1000 0 200 400 600 800 1000

nz = 5664 nz = 5664

a b

Figure a GRENOBLE sparsity pattern after the rst prepro cessing strategy using CuthillMcKee

for reordering the normal equations and b the same prepro cessed matrix after column p ermutations

SOLVING THE GRENOBLE PROBLEM

NORMAL EQUATIONS FOR MATRIX : GRENOBLE 1107 (RCM) NORMAL EQUATIONS FOR MATRIX: GRENOBLE1107 (CM) 0 0

200 200

400 400

600 600

800 800

1000 1000

0 200 400 600 800 1000 0 200 400 600 800 1000

nz = 22427 nz = 22427

a b

Figure Sparsity pattern of the normal equations from GRENOBLE In a the Reverse Cuthill McKee ordering is used and in b the CuthillMcKee ordering is used

NONZERO VALUES IN NORMALIZED AAT: GRENOBLE 1107 1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 0 0.5 1 1.5 2 2.5 4

x 10

Figure Normalized nonzeros in the normal equa

matrix tions from GRENOBLE

CHAPTER PARTITIONING EXPERIMENTS

with the Cuthill McKee algorithm

T

Figure a shows the sparsity pattern of AA after removing the nonzero entries less than

and p ermuting the matrix to a blo ck tridiagonal form Figure b shows the results

T

of using the second prepro cessing strategy The truncated matrix AA in Figure a has a

smaller bandwidth than the normal equations from the original linear system shown in Figure

and the blo cks of rows in Figure b havefewer column overlappings than those in Figure a and Figure b

NORMALIZED AAT : GRENOBLE1107 TOLERANCE VALUE: 0.2 GRENOBLE1107 AFTER ROW & COLUMN PERMUTATIONS (CM) 0 0

200 200

400 400

600 600

800 800

1000 1000

0 200 400 600 800 1000 0 200 400 600 800 1000

nz = 6663 nz = 5664

a b

Figure a GRENOBLE sparsity pattern after the second prepro cessing strategy using Cuthill

McKee for reordering the normal equations and a tolerance value of And b the same prepro cessed matrix

after column p ermutations

Cho osing a higher tolerance value for instance will lead to an almost diagonal matrix Fig

ure a and the output matrix from the second prepro cessing strategy Figure b is

almost the same matrix as that shown in Figure b Therefore removing a large number of

T

nonzero entries from the AA matrix will reduce the eects of the second prepro cessing strategy

In Table the exp eriments with the tolerance value of are presented under the column

lab eled Prepro cessing Strategy I I blo cks The following is an example of the call to the

blo ck Cimmino solver for these exp eriments

command blkcimmino hesfile spenv datafile grestgrua

partitions blocksize

threshold D scheduling

One of the advantages of using the second prepro cessing strategy over the rst one is the free

dom for partitioning the blo cks of rows Thus we rep eat the last exp eriments using the second

prepro cessing strategy with a tolerance value of with the system partitioned in blo cks of

rows Figure illustrates the sparsity pattern of the resulting matrix An example of the

calls to the blo ck Cimmino solver used in these exp eriments is

SOLVING THE GRENOBLE PROBLEM

NORMALIZED AAT : GRENOBLE 1107 TOLERANCE VALUE: 0.8 GRENOBLE1107 AFTER ROW & COLUMN PERMUTATIONS (CM) 0 0

200 200

400 400

600 600

800 800

1000 1000

0 200 400 600 800 1000 0 200 400 600 800 1000

nz = 1107 nz = 5664

a b

Figure a GRENOBLE sparsity pattern after second prepro cessing strategy using CuthillMcKee

for reordering the normal equations and a tolerance value of And b the same prepro cessed matrix after

column p ermutations

command blkcimmino hesfile spenv datafile grestgbrua

partitions

blocksize threshold D scheduling

There is a reduction in the numb er of column overlaps b etween the blo cks of rows in Figure

and those in Figure Furthermore the number of rows in each blo ck is nonuniform

as in the trivial partitionings shown in Figures a and b

The GRENOBLE problem is interesting b ecause the prepro cessing strategy do es not

change the b ehavior of the blo ck Cimmino solver That is to say the distribution of eigenval

ues inside the blo cks of rows do es not change after applying any of the prepro cessing strategies

Thus the algorithm takes exactly the same numb er of steps to converge in all of the exp eriments

ie iterations

The results in Tables and show that using the second prepro cessing strategy with a few

more blo cks will improve the parallel p erformance of the solver b ecause the numb er of column

overlaps b etween the blo cks is reduced and with more blo cks more tasks can b e generated and

more CEs can b e used

The maximum sp eedups in these cases are obtained with the prepro cessing strategy I on CEs

and prepro cessing strategy I I on CEs using the third scheduling strategy The

runs with the rst prepro cessing strategy rep orted almost the same parallel b ehavior in terms

of eciencies as the second prepro cessing strategy with blo cks Thus the use of the second

prepro cessing strategy is sensitive to the numb er of blo ck partitions

CHAPTER PARTITIONING EXPERIMENTS

GRENOBLE 1107 AFTER ROW & COLUMN PERMUTATIONS (CM) 0

200

400

600

800

1000

0 200 400 600 800 1000

nz = 5664

Figure Sparsity pattern of the GRENOBLE after

prepro cessing strategy I I with a tolerance value of In this

case blo cks of rows are dened

SOLVING THE GRENOBLE PROBLEM

Comp Original After column Prepro cessing

env matrix A permutation strategy I

Seq run secs secs secs

Time Sp dup E Time Sp dup E Time Sp dup E

stg

CEs

stg

stg

CEs

stg

stg

CEs

stg

Table Results from the parallel Blo ck Cimmino with Blo ckCG acceleration using three

problem Times dierent prepro cessing strategies on the solution of the GRENOBLE

in this table are in seconds stg and stg corresp ond to the scheduling strategies STG and

STG resp ectively from Chapter

CHAPTER PARTITIONING EXPERIMENTS

Comp Prepro cessing Prepro cessing

env strategy I I strategy I I

blo cks blo cks

Seq run secs secs

Time Sp dup E Time Sp dup E

stg

CEs

stg

stg

CEs

stg

stg

CEs

stg

stg

CEs

stg

Table Results from the parallel Blo ck Cimmino with Blo ck

CG acceleration varying the numb er of blo cks of rows in the prepro

cessing strategy I I on the solution of the GRENOBLE prob

lem Times in this table are in seconds stg and stg corresp ond

to the scheduling strategies STG and STG resp ectively from Chapter

REMARKS

Remarks

From exp eriments in Ruiz the use of the augmented system approachisnotvery ecient

when the blo cks have a small number of rows compared to their numb er of columns Thus the

rst prepro cessing strategy is preferred b ecause it delivers a twoblo ck partitioning with less

computational eort than the second prepro cessing strategy

Alternatively if for some linear systems partitionings with small blo cks must b e preferred then

the normal equations approach of Bramley and Bramley and Sameh should b e

used inside the blo ck Cimmino solver b ecause it p erforms b etter in terms of eciency And in

these cases the solver should b e combined with the second prepro cessing strategy to have more

exibility in the numb er and sizes of the blo cks of rows

The ab ove exp eriments suggest that prepro cessing strategies can greatly improve the p erfor

mance of the parallel Cimmino solver Furthermore the advantages of using either prepro cessing

strategy will dep end on the problem b eing solved and the computing environment

If a large number of CEs is available and they are connected through a fast interconnection

network then the second prepro cessing strategy can b e used to generate as many blo cks of rows

as the numb er of CEs in the system Nevertheless a compromise must b e found b etween the

size of tasks that are generated from the blo ckrow partitions the eciency of the CEs and the

metho d used for computing the normal equations inside the blo ck Cimmino iterations

On the other hand if the computational system has only a few CEs or there are p ossible

b ottlenecks in the communication b etween the CEs for instance the bus in an ETHERNET

network then the rst prepro cessing strategy should b e used b ecause it pro duces a twoblo ck

partitioning of the original system which implicitly reduces the communication cost and delivers

bigger blo cks of rows

CHAPTER PARTITIONING EXPERIMENTS

Chapter

Conclusions

Currently there is muchactivework in developing Krylov based iterative metho ds for the so

lution of linear systems and eigenvalue problems At the same time there is a great demand

for accelerating the p erformance of these metho ds on the solution of large systems of equations

using eciently the computational resources

Various authors have studied dierenttechniques to improve the p erformance of parallel im

plementations of conjugate gradienttyp e metho ds Parallel implementations of Classical CG

have attempted to reduce the numb er of synchronization p oints emb edded in the Classical CG

algorithm and in this way minimize its execution time However these eorts have led to either

break the robustness of Classical CG or increase its complexity and in less favorable implemen

tations b oth drawbacks are found

In general increasing the complexityofany algorithm pays o when the more complex algorithm

either extends the usefulness of the original or improves its computational rate In b oth cases

the robustness and correctness of the original algorithm must b e preserved Furthermore for

computational purp oses if there is an increase in the storage requirements then it should b e

linear in the size of the system of equations

Wehave studied a stabilized version of the Blo ckCG algorithm which is a generalization

of the CG metho d In exact arithmetic Blo ckCG reduces the numb er of iterations of Classical

CG by a factor of its blo cksizes In practice this blo cksizeischosen to b e much smaller than

the size of the linear system In the stabilized Blo ckCG Algorithm there is an increase in

the storage requirements from the Classical CG Algorithm This increase is only a factor

of the blo ck size times the size of the linear system

The computational rate of the stabilized Blo ckCG implementation is increased as the blo ck size

is increased and weshow this eect in Chapter Some results from exp eriments are shown in

Figure In the gure results from runs of Classical CG are rep orted with a blo ck size of

one The increase in the complexity of the algorithm is caused by the use of rectangular matrices

of size n s instead of vectors of size n and extra orthonormalizations to preserve the stability

of the algorithm

Additionally in the Blo ckCG algorithm the matrixvector op erations from Classical CG are

replaced by matrixmatrix op erations Thus Level BLAS routines are used in Blo ckCG im

plementations instead of Level BLAS routines that are used by Classical CG The use of the

Level BLAS routines causes an increase in Mop rates as shown in Figure due to b etter

data lo cality

CHAPTER CONCLUSIONS

Sequential Block−CG performance 80 SHERMAN4

70

60 LANPRO(NOS3) 50

40

MFlops LANPRO(NOS2) 30

20

10

0 0 20 40 60 80 100 120 140 160 180 200

block size

Figure Plot of Mops against blo ck size from runs of

a stabilized Blo ckCG implementation on a IBM RS

workstation

Several authors have studied the parallelization of the Classical CG Blo ckCG and precondi

tioned versions of these algorithms Wehave shown in Chapter that the Blo ckCG is more

suitable for parallel computing environments than the Classical CG b ecause of its granularity

and the implicit reduction in the numb er of synchronizations From the results of the three

parallel implementations of the Blo ckCG wehave concluded that the trivial parallelization of

the HP pro ducts is not as ecient as parallelizing the entire algorithm

Basically the computational weight of the HP pro ducts decreases as the blo ck size is increased

This eect summarized in Figure motivates our development of the other two parallel

implementations of the Blo ckCG in whichmostoftheBlockCG op erations are p erformed in

parallel

The Blo ckCG works well as an acceleration technique inside basic iterative metho ds for the

solution of general linear systems Wehave studied a reliable implementation of the blo ck Cim

mino iteration with Blo ckCG acceleration In this case wehave increased the complexityof

the stabilized Blo ckCG algorithm to b e able to solve a larger set of linear systems and have

improved its overall solution time using a parallel implementation

The p erformance of the parallel implementation of the blo ck Cimmino metho d with Blo ckCG

acceleration is more sensitive to the strategy used for partitioning the system into blo cks of rows

and the computational environment than the Blo ckCG algorithm is to column partitioning

Wehave presented two examples of prepro cessing strategies to nd natural partitionings of the

linear system of equations

The rst prepro cessing strategy provides a twoblo ck partitioning of the original system of equa

tions A twoblo ck partitioning of the linear system improves the p erformance of the parallel

implementation by reducing the communication b etween blo cks of rows Additionallyatwo

blo ck partitioning as presented in Section can accelerate the rate of convergence of blo ck Cimmino

COMPUTATIONAL WEIGHT OF MATRIX PRODUCT IN BLOCK−CG 2 10

(a) = LANPRO(NOS2)

(b) = LANPRO(NOS3)

1 10 (c) = SHERMAN4

0 10

(c)

−1 10 (b) (a) Percentage of sparse matrix vector multiplications

−2 10 0 20 40 60 80 100 120 140 160 180 200

block size

Figure Ratio of oating p oint op erations from the

HP pro ducts over the total numb er of oating p oint op era

tions in a Classical CG or Blo ckCG iteration

However a twoblo ck partitioning may deliver blo cks of rows with extremely nonuniform work

load that mayleadtounbalanced workload distributions and thus aects the eciency of the

parallel implementation

The second prepro cessing strategy can b e used to derive a more exible partitioning than the

twoblo ck one Generally this strategy delivers more blo cks than the rst prepro cessing strat

egyThus more CEs can b e used during a parallel run of the blo ck Cimmino implementation

The second prepro cessing strategy may yield very small blo cks of rows and as stated in Chapter

The augmented system approach on small blo cks with fewer rows than columns is not very

ecient Therefore it is recommended to regroup contiguous blo cks of rows into bigger blo cks

Inside the blo ck Cimmino implementation wehaveusedthe M aster S l av e distributed imple

mentation of the Blo ckCG In this case the granularity of the tasks generated by the Blo ckCG

iteration has increased and the p erformance of the parallel blo ck Cimmino algorithm is sensitive

to the distribution of tasks among available CEs

Wehave studied dierentscheduling strategies to distribute tasks among heterogeneous CEs

Basicallyinmany iterative metho ds the numb er and sizes of the tasks remain almost constant

from one iteration to the other Furthermore in several implementations these tasks create

data structures that are lo cal to one CE and exp ensivetomove around For these reasons we

have preferred a static scheduler over a dynamic one

Although wehave designed a scheduler for heterogeneous environments the homogeneous cases

can b e regarded as instances of these In heterogeneous environments for each CE its theo

retical sp eed and the number of available pro cessors are used as parameters to the scheduler

After partitioning the system of equations into blo ckofrows heuristics ab out the computational

eort required on a blo ckofrows can b e derived In our implementation we use for each blo ck

of rows the numb er of column overlaps with other blo cks the size of the augmented system and

the numb er of nonzero elements as parameters to the scheduler

CHAPTER CONCLUSIONS

Both sets of heuristics from each CE and eachblockofrow have lead us to implement a static

scheduler of the greedy sub class And after isolating some numerical issues that accelerate the

rate of convergence of the blo ck Cimmino solver wehaveshown an improvement in the p erfor

mance of the parallel blo ck Cimmino solver due to a prop er selection of the parameters passed

to scheduler and the scheduling strategy used

For instance Figure illustrates an eciency surface obtained with various runs of the blo ck

Cimmino with Blo ckCG acceleration in which the blo ck size and the numberofprocessorsare

varied while the numb er of equivalent iterations is kept constant In the exp eriments of Figure

the rst prepro cessing strategy and third scheduling strategy were used Using a dierent

scheduling or prepro cessing strategy lead to totally dierent eciency results In Figure

the exp eriments in whichablocksizeofonewas used in the Blo ckCG acceleration rep orted the

lowest eciencies In these cases the eciency decreases as the numb er of CEs is increased For

larger blo ck sizes the eciency surface is smo other even when the numb er of CEs is increased

PARALLEL EFFICIENCY OF THE BLOCK CIMMINO

1

0.8

0.6

0.4 efficency

0.2

0 40 30 10 8 20 6 10 4 2 0 0

block size number of CEs

Figure Eciency surface from parallel runs of the

blo ck Cimmino on the solution of the SHERMAN problem

The parallel strategies develop ed in the implementation of the blo ck Cimmino with Blo ckCG

acceleration can b e easily accommo dated to implementations of other iterativeschemes The

static greedy based scheduler can b e tuned to accept other input parameters from the computa

tional environment and the linear system b eing solved The strategies used in the manipulation

of distributed op erations like the reduce are fully reusable in other parallel programs

The mo dular structure of the parallel blo ck Cimmino solver allows reusability ofsomeofthe

mo dules inside other iterativeschemes For instance the mo dule with Blo ckCG acceleration

can b e reused inside the blo ck Kaczmarz iteration with a new mo dule for computing the Kacz

marz pro jections Furthermore an iterativescheme has b een prop osed recently by Arioli and

Ruiz which the authors call Blo ckCGSI The Blo ckCGSI algorithm combines the sta

bilized Blo ckCG with the theory of subspace iteration to overcome some of the breakdowns

of the Blo ckCG due to illcondi tioned systems and roundo errors Blo ckCGSI increases the

complexity of the original Blo ckCG algorithm in order to extend the applicabili ty to general

linear system of equations

Moreover any of the parallel Blo ckCG implementations presented in Chapter can b e reused

in an implementation of the new iterativescheme Based on exp eriments rep orted in Chapter

the M aster S l av e distributed or the Al l to Al l implementations should b e just as ecient

for the new iterativescheme

The static scheduler from Chapter can also b e reused for distributing the tasks among the

CEs Similarly the software mo dules used in the blo ck Cimmino and Blo ckCG implementations

to p erform the reduce op erations to gather information from the computational neighborhoods

and to handle scattered data structures in parallel distributed environments can also b e reused

in a parallel implementation of Blo ckCGSI

Henceforth we fully exp ect that Blo ckCGSI as well as other blo ck iterativeschemes can show

sup erior p erformance when implemented using the parallel concepts intro duced in this work

CHAPTER CONCLUSIONS

Bibli ography

AEA TechnologyHARWELL SUBROUTINE LIBRARY Didcot England A cat

alogue of subroutine Release

G Amdahl Validity of the single processor approach to achieving large scale computer

capabilities in AFIPS Conference Pro ceedings vol

E Anderson Z Bai C Bischof J Demmel J Dongarra J D Croz A Green

baum S Hammarling A McKenney S Ostrouchov and D Sorensen

LAPACK Users Guide SIAM Philadelphia USA second ed

M Arioli and D Ruiz Block Conjugate Gradient with Subspace Iteration for Solving

Linear SystemsTech Rep RTAPO ENSEEIHT IRIT Toulouse France To app ear

in the pro ceedings of the second IMACS International Sympsium on Iterative Metho ds in

Linear Algebra Bulgaria June

M Arioli J W Demmel and I S Duff Solving Sparse Linear Systems with

Sparse Backward Error SIAM J Matrix Anal and Applics

M Arioli A Drummond I S Duff and D Ruiz AParal lel Scheduler for Block

Iterative Solvers in Heterogeneous Computing Environments in Pro ceedings of the seventh

SIAM conference on Parallel Pro cessing for Scientic Computing Philadelphia USA SIAM

M Arioli I S Duff and D Ruiz Stopping Criteria for iterative solvers SIAM J

Matrix Anal and Applics Also Technical Rep ort TRPA CERFACS

France

M Arioli I S Duff J Noailles and D Ruiz Block Cimmino and Block SSOR

Algorithms for solving linear systems in a paral lel environmentTech Rep TR CER

FACS Toulouse France

M Arioli I S Duff J Noailles and D Ruiz A Block Projection MethodFor

Sparse Matrices SIAM J Sci Stat Comput

R Barrett M Berry T Chan J Demmel J Donato J Dongarra V Eijkhout

R Pozo C Romine and H van der Vorst Templates for the Solution of Linear

Systems Building Blocks for Iterative Methods SIAM

R H Bartels G H Golub and M A SaundersNumerical techniques in

mathematical programming in Nonlinear Programming K R JB Rosen O L Mangasarian

ed Academic Press New York

BIBLIOGRAPHY

A Beguelin J Dongarra A Geist R Manchek and V Sunderam A Users

Guide to PVM Paral lel Virtual MachineTech Rep ORNLTM Oak Ridge National

Lab oratoryTennessee

M Benzi F Sgallari and G Spaletta Aparal lel block projection method of the

cimmino type for nite Markov chains in Computations with Markovchains pro ceedings of

the nd international workshop on the numerical solution of Markovchains W J Stewart

ed Boston Kluwer Academic publisher

A Bjorck and G H Golub Numerical methods for computing angles between linear

subspaces Math Comp

S H Bokhari Dual processor scheduling with dynamic reassignment IEEE Trans

Software Eng SE

S H Bokhari a On the mapping problem IEEE Trans on Comp C

S H Bokhari b A shortest tree algorithm for optimal assignments across space and

time in a distributedprocessor system IEEE Trans Software Eng SE

R Bramley and A Sameh Row projection methods for large nonsymmetric linear

systemsTech Rep Center for Sup ercomputing Research and Development University

of Illinois Urbana IL Also SISSC Vol January

R Bramley Row projection methods for linear systems PhD Thesis Center for

Sup ercomputing ResearchandDevelopment University of Illinois Urbana IL

R Butler and E Lusk Users Guide to the p Paral lel Programming System Math

ematics and Computer Science Division Argonne National Lab oratory

S L Campbell and J C D Meyer Generalized inverses of linear transformations

Pitman London

T L Casavant and J G Kuhl ATaxonomy of scheduling in GeneralPurpose

Distributed Computing Systems IEEE transactions on software engineering

P Chretienne Task scheduling over distributed memory machinesinParallel and

distributed Algorithms M Cosnard Y Rob ert P Quinton and M Raynal eds Amsterdam

Elsevier Science

A T Chronopoulos and C W Gear SStep iterative methods for symmetric linear

systems in Sup ercomputing Los Alamos CA IEEE Computer So ciety Press

A T Chronopoulos Towards ecient paral lel implementation of the CG method

applied to a class of block tridiagonal linear systems Computer and Applied Mathematics

G Cimmino Calcolo approssimato per le soluzioni dei sistemi di equaziono lineariin

Ricerca Sci I I vol I

E Cuthill and J McKee Reducing the bandwidth of sparse symmetric matrices

in Pro ceedings th National Conference of the Asso ciation for Computing Machinery New

Jersey Brandon Press

BIBLIOGRAPHY

E DAzevedo V Eijkhout and C RomineReducing Communication Costs in

the Conjugate Gradient Algorithm on Distributed Memory MultiprocessorsTech Rep CS

UniversityofTennessee Knoxville Tennessee Also LAPACK WORKING NOTE

J W Demmel M T HeathandHAvan der Vorst Paral lel Numerical Linear

Algebra Acta Numerica

DrorCost Allocation The Traveling Salesman Binpacking and the Knapsack Ap

plied Mathematics and Computation

A Drummond I S Duff and D RuizAParal lel Distributed Implementation of

The Block Conjugate Gradient AlgorithmTech Rep TRPA CERFACS Toulouse

France

I S Duff and J K Reid The multifrontal solution of indenite sparse linear systems

ACM Trans Math Softw

I S Duff A M Erisman and J Reid Direct Methods For Sparse Matrices

Oxoford University Press New York USA

I S Duff R G Grimes and J G Lewis Users Guide for the Harwel lBoeing

Sparse Matrix Col lection Rutherford Appleton Lab oratory Chilton DIDCOT Oxon OX

QX RAL

K Efe Heuristic models of task assignment scheduling in distributed systems Com

puter

T Elfving Blockiterative methods for consistent and inconsistent linear equations

Numer Math

A Gabrielian and D B Tyler Optimal object al location in distributedcomputer

systems in Pro ceeding from the th International conference on distributed computing system

C Gao J W S Liu and M RaileyLoad balancing algorithms in homogeneous

distributed systems in Pro ceedings from International Conference on Parallel Pro cessing

M R Garey and D S JohnsonComputers and intractability A guide to the theory

of NPcompletenessFreeman San Francisco

A Geist A Baguelin J Dongarra W Jiang R Manchek and V Sunderam

PVM Paral lel Virtual Machine A Users Guide and Tutorial for Network Paral lel

Computing MIT Press Cambridge Massachusetts

A George Computer implementation of the niteelement method PhD thesis De

partment of Computer Science Stanford University Stanford California Rep ort STAN

CS

G H Golub and W Kahan Calculating the singular values and pseudoinverse of

a matrix SIAM J Numer Anal

BIBLIOGRAPHY

G H Golub and C F Van Loan Matrix Computations The Johns Hopkins Uni

versity Press Baltimore MD second ed

D R Greening Paral lel simulatedannealing techniquesPhysica D Nonlinear phe

nomena

G D Hachtel Extended applications of the sparse tableau approach nite elements

and least squares in Basic question of design theory W R Spillers ed North Holland

Amsterdam

L A Hageman and D M Young Applied Iterative Methods Academic Press

London

A Heddaya and K ParkMapping paral lel iterative algorithms onto workstation net

works in Pro ceedings from rd International Symp osium on High Performance Computing

M R Hestenes and E Stiefel Methods of conjugate gradient for solving linear

systems J Res Natl Bur Stand

J H Holland Adaptation in natural and articial systems Ann Arb or Universityof

Michigan Press

D S Johnson C H Papadimitriou and M YannakakisHow easy is local

search in Pro ceedings from Annual Symp osium of Foundation of Computer Science

S Kaczmarz Angenaherte Auosung von Systemen linearer Gleichungen Bull intern

Acad p olonaise Sci lettres Cracouie Class sci math natur Seira A Sci Math

C Kamath and A Sameh Aprojection method for solving nonsymmetric linear

systems on multiprocessorsParallel Computing

S Kirkpatrick C D Gelatt and M P Vecchi Optimization by simulated

annealing Science

L Kleinrock and A Nilsson On optimal scheduling algorithms for timeshared

systemsACM

L Kleinrock Queuing Systemsvol Computer Applications WileyNewYork

C Lanczos Solution of systems of linear equations by minimized iterations J Res

Natl Bur Stand

V M Lo Algorithms for static task assignment and symmetric contraction in dis

tributed systems in Pro ceedings of the th international conference on parallel pro cessing

The Pennsylvania State University Press

P R Ma E Y Lee and M Tsuchiya Ataksallocation model for distributed

computing systems IEEE Trans on Comp C

E W Mayr Paral lel Approximation AlgorithmsTechnical Rep ort STANCSTR

Stanford University Department of Computer Science

G Meurant Multitasking the Conjugate Gradient method on the CRAY XMP

Parallel Computing

BIBLIOGRAPHY

T Mizuike Y Ito D Kennedy and L Nguyen Burst Scheduling Algorithms for

SSTDMA Systems IEEE trans on commun COM

J Modersitzki Conjugate Gradient Type Methods for Solving Symmetric Indenite

Linear SystemsTech Rep University of Utrecht Netherlands

A A Nikishin and A Y Yeremin Variable Block CG algorithms for solving large

sparse symmetric positive denite linear systems on paral lel computers I General iterative

scheme SIAM J Matrix Anal Appl

W Oettli and W Prager Compatibility of approximate solution of linear equations

with given error bounds for coecients and righthand sides Numer Math

D P OLeary The Block Conjugate Gradient algorithm and related methods Linear

Algebra Appl

D P OLeary Iterative methods for nding the stationary vector for Markov chains

IMA Volume on Linear Algebra Markov Chains and Queueing Mo dels

C C Paige and M A Saunders Solution of sparse indenite systems of linear

equations SIAM J Numer Anal

G N S Prasanna and B R Musicus Generalised Multiprocessor Scheduling Using

Optimal ControlACM

D Ruiz Solution of large sparse unsymmetric linear systems with a block iterative

method in a multiprocessor environment PhD thesis CERFACS Toulouse France

Y SaadPractical use of polynomial preconditionings for the Conjugate Gradient

methodSIAMJScientic and Statistical Computing

Y Saad Krylov subspacemethods on supercomputerstech rep RIACS Moett Field

CA

Y SaadKrylov subspace methods on supercomputers SIAM J Scientic and Statistical

Computing

C Shen and W TsaiAgraph matching approach to optimal task assignment in

distributedcomputing systems using a minimax criterion IEEE Trans Comput C

H D Simon The Lanczos Algorithm with Partial Reorthogonalization Mathematics of

Computation

H D Simon Incomplete LU preconditionedconjugategradientlike methods in reservoir

simulation in Pro ceedings of the Eight SPE Symp osium on Reservoir Simulation

J Stoer and R Bulirsch Introduction to Numerical Analysis SpringerVerlag

Berlin

H S Stone Critical load factors in twoprocessor distributed systems IEEE Trans

software Eng SE

BIBLIOGRAPHY

G Strang Introduction to Applied MathematicsWellesleyCambridge Press Cam

bridge MA

E G Talbi and T Muntean General heuristics for the mapping probleminWorld

Transputer Congress Aachen Germany IO Press

H A van der Vorst Lecture notes on iterative methods Mathematical Institute

University of Utrecht The Netherlands

O C Zienkiewikz and R L TaylorThe Finite Element Methodvol Basic For

mulation and Linear Problems Solid and Fluid Mechanics Dynamics and NonLinearity

McGrawHillInternational Editions fourth ed