Multigrid Equation Solvers for Large Scale Nonlinear

Finite Element Simulations

Mark Francis Adams

Rep ort No UCBCSD

January

Computer Science Division EECS

University of California Berkeley California

Multigrid Equation Solvers for Large Scale Nonlinear

Finite Element Simulations

by

Mark Francis Adams

BA University of California Berkeley

A dissertation submitted in partial satisfaction of the

requirements for the degree of

Do ctor of Philosophy

in

Engineering Civil Engineering

in the

GRADUATE DIVISION

of the

UNIVERSITY of CALIFORNIA BERKELEY

Committee in charge

Professor Rob ert L Taylor Cochair

Professor James W Demmel Cochair

Professor Gregory L Fenves

Professor Katherine Yelick

The dissertation of Mark Francis Adams is approved

Cochair Date

Cochair Date

Date

Date

University of California Berkeley

Multigrid Equation Solvers for Large Scale Nonlinear

Finite Element Simulations

Copyright

by

Mark Francis Adams

Abstract

Multigrid Equation Solvers for Large Scale Nonlinear Finite Element Simulations

by

Mark Francis Adams

Do ctor of Philosophy in Engineering Civil Engineering

University of California Berkeley

Professor Rob ert L Taylor Cochair

Professor James W Demmel Cochair

The nite element metho d has grown in the past years to b e a p opular metho d

for the simulation of physical systems in science and engineering The nite element metho d

is used in a wide array of industries In fact just ab out any enterprise that makes a physical

pro duct can and probably do es use nite element technology The success of the nite

element metho d is due in large part to its ability to allow the use of accurate formulation of

partial dierential equations PDEs on arbitrarily general physical domains with complex

b oundary conditions Additionally the rapid growth in the computational p ower available

in to days computers for an ever more aordable price has made nite element technology

more accessible to a wider variety of industries and academic disciplin es

As computational resources allow p eople to pro duce ever more accurate simulation

of their systems allowing for the more ecient design and safety testing of everything from

automobiles to nuclear weapons to articial knee joints all asp ects of the nite element

simulation pro cess are stressed The largest b ottleneck in the growth in the scale of nite

element applications is the linear equation solver used in implicit time integration schemes

This is due to the fact that the direct solution metho ds p opular in the nite element

community as they are ecient easy to use and relatively unaected by the underlying

PDE and discretization do not scale well with increasing problem size

The scale of problems that are now b ecoming feasible demand that iterative meth

o ds b e used The p erformance issues of iterative metho ds is very dierent from those of

direct metho ds as their p erformance is highly sensitive to the underlying PDE and dis

cretization the construction of robust iterative metho ds for nite element matrices is a

hard problem which is currently a very active area of research We discuss the iterative

metho ds commonly used to day and show that they can all b e characterized as metho ds

that solve problems eciently by pro jecting the solution to a series of subspaces The goal

of iterative metho d design and indeed of nite element metho d design is to select a series

of subspaces that solves problems optimally solvers try to minimize solution costs and

nite element formulations try to optimize accuracy of the solution The subspaces used in

multigrid metho ds are highly eective in minimizing solution costs particularly on large

problems Multigrid is known to b e the most eective solution metho d for some discretized

PDEs however its eective use on unstructured nite element meshes is an op en problem

and constitutes the theme of this study

The main contribution of this dissertation is the algorithmic development and

exp erimental analysis of robust and scalable techniques for the solution of the sparse ill

conditioned matrices that arise from nite element simulation in D continuum mechanics

We show that our multigrid formulations are eective in the linear solution of systems with

large jumps in material co ecients for problems with realistic mesh conguration and ge

ometries including p o orly prop ortioned elements and for problems with p o or geometric

conditioning as is commonplace in structural engineering We show that the these meth

o ds can b e used eectively within nonlinear simulations via Newtons metho d We solve

problems with more than sixteen million degrees of freedom and parallel solver eciency

of ab out on pro cessors of a Cray TE We also show that our metho ds can b e

adapted and extended to the indenite matrices that arise in the simulation of problems

with constraints namely contact problems formulated with Lagrange multiplier

Professor Rob ert L Taylor

Dissertation Committee Cochair

Professor James W Demmel Dissertation Committee Cochair

iii

To my b est friend Nighat

for constant supp ort for many years b efore my graduate work and throughout the

many years of this work

iv

Contents

List of Figures viii

List of Tables xi

Dissertation summary

Introduction

The nite element metho d

Motivation

Goals

Dissertation outline

Contributions

Notation

Mathematical preliminaries

Introduction

The nite element metho d

Finite element example Linear isotropic heat equation

Iterative equation solver basics

Matrix splitting metho ds

Krylov subspace metho ds

Preconditioned Krylov subspace metho ds

Krylov subspace metho ds as pro jections

One level domain decomp osition

Alternating Schwartz metho d

Multiplicative and additive Schwarz

Multilevel domain decomp osition

Introduction

A Simple two level metho d

Multigrid

Convergence of multigrid

Convergence analysis of domain decomp osition

Variational formulation

Domain decomp osition comp onents

v

A convergence theory

High p erformance linear equation solvers for nite element matrices

Introduction

Algebraic multigrid

A promising algebraic metho d

Geometric approach on unstructured meshes

Promising geometric approaches

Domain decomp osition

A domain decomp osition metho d

Our metho d

Introduction

A parallel maximal indep endent set

An asynchronous distributed memory algorithm

Shared memory algorithm

Distributed memory algorithm

Complexity of the asynchronous maximal indep endent set algorithm

Numerical results

Maximal indep endent set heuristics

Automatic coarse grid creation with unstructured meshes

Topological classication of vertices in nite element meshes

A simple face identication algorithm

Mo died maximal indep endent set algorithm

Vertex ordering in MIS algorithm on mo died nite element graphs

Coarse grid cover of ne grid

Numerical results

Finite element shap e functions

Galerkin construction of coarse grid op erators

Smo others

Multigrid characteristics on linear problems in solid mechanics

Introduction

Multigrid works

Large jumps in material co ecients soft section cantilever b eam

Large jumps in material co ecients curved material interface

Incompressible materials

Poorly prop ortioned elements

Conclusion

Parallel architecture and algorithmic issues

Introduction

Parallel nite element co de structure

Finite element co de structure

vi

Finite element parallelism

Parallelism and graph partitioning

Parallel computer architecture

Multiple levels of partitioning

Solver complexity issues

A parallel nite element architecture

Athena

Epimetheus

Prometheus

AthenaEpimetheusPrometheus construction details

Pro cessor sub domain agglomeration

Simple sub domain agglomeration metho d

Sub domain agglomeration as an integer programming problem

Potential use of a computational mo del

Sub domain agglomeration metho d

Parallel p erformance and mo deling

Introduction

Motivation computational mo dels and notation

Notation and computational mo dels

PRAM computational mo del and analysis

Costs and b enets

Eciency measures

Computational mo del of multigrid

Multigrid comp onent lab eling

Coarse grid size and density

Floating p oint costs

Communication costs

Total cost of comp onents

Conclusion and future work in multigrid complexity mo deling

Linear scalability studies

Introduction

Solver conguration and problem denitions

Problem P

Scalability studies on a Cray TE P

Scalability studies on an IBM PowerPC cluster P

Agglomeration and level p erformance

Agglomeration

Performance on dierent multigrid levels

End to end p erformance

Cray TE

IBM PowerPC cluster

Conclusion

vii

Large scale nonlinear results and indenite systems

Introduction

Nonlinear problem P

Nonlinear solver

Cray TE large scale linear solves

Cray TE nonlinear solution

Lagrange multiplier problems

Numerical results

Conclusion

Conclusion

Future Work

Bibliography

A Test problems

B Machines

viii

List of Figures

Conjugate Gradient Algorithm

Preconditioned Conjugate Gradient Algorithm

Schwarzs original gure

Two Sub domains with Matching Grids

Matrix Graph of the Two Sub domain Problem

Matrix A for the Two Sub domain Problem

Multiplicative Schwarz with Two Sub domains

Additive Schwarz with Two Sub domains

Matrix for the Two Level Metho d

Graph for the Restriction Matrix for the Two Level Metho d

Multigrid Vcycle Algorithm

Multigrid Vcycle

Full Multigrid Algorithm

Multigrid Fcycle

Graph of sp ectrum of R for N

Finite element quadrilateral mesh and its corresp onding graph

Basic MIS algorithm BMA for the serial construction of an MIS

Shared memory MIS algorithm SMMA for MIS running on pro cessor p

Asynchronous distributed memory MIS algorithm ADMMA on pro cessor p

ADMMA Action pro cedures running on pro cessor p

Weaving monotonic path WMP in a D FE mesh

vertex D FE mesh

Average iterations vs number of pro cessors and number of vertices

Multigrid coarse vertex set selection on structured meshes

Basic MIS algorithm for the serial construction of an MIS

Face identication algorithm

Poor MIS for multigrid of a shell

Original and fully mo died graph

MIS and coarse mesh

Mo died coarse grid

Fine input grid and coarse grids for problem in D elasticity

ix

Test problems from Sphere dof b eamcolumn

dof tub e dof

Matrix triple pro duct algorithm running on pro cessor p

Cantilever with uniform mesh and end load element mesh N

Residual convergence for multiple discretizations of a cantilever

Residual convergence for a cantilever with soft a section

Residual convergence for included sphere with soft cover

Residual convergence for included sphere with incompressible cover material

Sp ectrum of dof included sphere with incompressible cover material

Cantilevered hollow cone rst principal stress and deformed shap e

Residual convergence for Cone problem

Common computer architectures

FEAP command language example

Co de Architecture

Matrix vector pro duct Mopsec on a Cray TE

Carto on of cost function and its transp ose

Carto on of the tted function

Search for b est feasible integer number of pro cessors

Multigrid computational comp onents lab els and parameters

Finite element cost structure

Eciency Plot Structure

Full Multigrid Cycle

Floating p oint counts for multigrid comp onents

Costs for multigrid comp onents

Comparison of mo del with exp erimental data for send phase of matrix vector

pro duct on ne grid

Cost Inventory of CG with Full Multigrid

Vertex D FE mesh and deformed shap e

dof p er pro cessor included sphere times on a PE Cray TE

dof p er pro cessor included sphere eciency on a Cray TE

dof p er pro cessor included sphere eciency on a Cray TE

dof p er pro cessor active pro cessors p er no de included sphere times

on a IBM PowerPC cluster

dof p er pro cessor pro cessors p er no de included sphere eciency

on a IBM PowerPC cluster

dof p er pro cessor pro cessors p er no de included sphere eciency

on a IBM PowerPC cluster

dof p er pro cessor end to end times on a Cray TE

dof p er pro cessor end to end eciency on a Cray TE

k dof p er no de pro c p er no de end to end times IBM PowerPC cluster

x

k dof p er no de pro c p er no de end to end eciency IBM PowerPC

cluster

dof concentric spheres problem

dof p er pro cessor included concentric sphere times on a Cray TE

dof p er pro cessor included concentric sphere eciency on a Cray TE

k dof p er pro cessor concentric sphere op eciency on a Cray TE

k dof p er pro cessor concentric sphere Mopsec eciency on a Cray TE

Multigrid iterations p er Newton iteration

Histogram of the number iteration p er Newton step in all time steps

End to end times of nonlinear solve with dof p er pro cessor

Uzawa algorithm

dof concentric spheres with contact undeformed and deformed shap e

dof concentric spheres without contact undeformed and deformed shap e

A Vertex D FE mesh and deformed shap e

A dof concentric spheres problem

A Cantilever with uniform mesh and end load element mesh N

A Truncated hollow cone

A Cantilevered tub e

A Beamcolumn

A Concentric spheres without contact

A Concentric spheres with contact

xi

List of Tables

Average number of iterations

Solve time in seconds number of iterations

Multiple discretizations of a cantilever

Cantilever with soft section

Iterations for included sphere with soft cover

Iterations for included sphere with common in CG smo other

Topdown vs b ottom up p erformance mo deling

Future complexity conguration

MFlop rates for MFlop types in multigrid

Matrix vector pro duct phase times

Flat and graded pro cessor groups IBM PowerPC cluster

Time for each grid on Cray TE million dof problem

Nonlinear materials

Multigrid preconditioned CG iteration counts for contact problem

A Materials for test problems

A Problem P statistics

A Problem P statistics

A Problem P statistics

A Problem P statistics

A Problem P statistics

A Problem P statistics

A Problem P statistics

A Problem P statistics

xii

Acknowledgements

I would like to express my gratitude to Professors RL Taylor and James Demmel for

sharing their enthusiasm for and exp ertise in their resp ective elds as well as for helping

to make my exp erience as a graduate student a rewarding and enriching one

I would like to express my gratitude to Professor Taylor for his supp ort and con

stant enthusiasm for my work that has made my years in the graduate program for the

Civil Engineering department a gratifying exp erience I have learned much from Professor

Taylors extensive knowledge gained in his years as a researcher and a practitioner in

the eld of Computational Mechanics

I would like to thank Professor Demmel for his energetic supp ort that has b een

invaluable to my exp erience and education here at Berkeley The author has b eneted

greatly from Professor Demmels exp ertise guidance and introduction to the elds of Sci

entic Computing Computer Science and Applied Mathematics

I would like to express my gratitude to the following

Professor Taylor for providing and maintaining FEAP without which this work would

have b een grossly inferior

The PETSc team Satish Balay William Gropp Lois Curfman McInnes and

Barry Smith for providing a great pro duct without which this work would have b een

grossly inferior And I esp ecially want to thank Barry Smith for his tireless supp ort of

PETSc for his concise theory manual Domain Decomp osition and for informing

me in with Tony Chan of the basic algorithm with which I work and without

which I would not have a life

George Karypis for providing a fast full features stateoftheart mesh partitioner

METIS and ParMetis

Jonathan Shewchuk for providing fast robust geometric predicates

Argonne National Lab oratory for the use of their IBM SP which was invaluable in the

early development work for this dissertation

Steve Ashby for his supp ort of this work and for providing me with the opp ortunity to work for a summer in the intellectually stimulating environment of CASC and LLNL

xiii

Juliana Hsu for providing me with test problems from the ALED group at LLNL and

the entire ALED group and the linear solvers group in CASC for many interesting

discussions during my summer at the lab

Lawrence Livermore National Lab oratory for providing access to its computing sys

tems and to the sta of Livermore Computing for their supp ort services

Lawrence Berkeley National Lab oratory for the use of their Cray TE and their very

helpful supp ort sta This research used resources of the National Energy Research

Scientic Computing Center which is supp orted by the Oce of Energy Research of

the US Department of Energy under Contract No DEACSF

Eric Kasp er for providing me with the original FEAP source for test problem P in

chapter

This research was supp orted in part by the Department of Energy under DOE

Contract No WENG and DOE Grant Nos DEFGER DEFG

ER and DEFCER the National Science Foundation under NSF Co op

erative Agreement No ACI Infrastructure Grant No CDA and Grant

No ASC the Defense Advanced Research Pro jects Agency under DARPA Con

tract No FC and Grant No DEFGER gifts from the IBM

Shared University Research Program the California State MICRO program Sun Microsys

tems and Intel and funds from the California State MICRO program The information

presented here do es not necessarily reect the p osition or the p olicy of the Government and

no ocial endorsement should b e inferred

Chapter

Dissertation summary

Introduction

The nite element metho d has proven to b e a p opular spatial discretization tech

nique in the simulation of complex physical systems in many areas of science and engi

neering The nite element metho d commonly provides for the spatial discretization of the

domain for a partial dierential equation PDE time discretizations or time integrators are

also required Time integrators fall into two main categories implicit and explicit Explicit

schemes are preferable for problems in which the interval in the time discretization is short

relative to the spatial discretization Implicit metho ds are more compute intensive than

explicit metho ds however they have sup erior stability prop erties and are thus preferable

when the time step required to capture the b ehavior of the system is large relative the

spatial discretization Implicit metho ds are exp ensive b ecause they require that a sparse

op erator b e linearized and its inverse a dense op erator applied to a vector rather than

only the sparse op erator itself this is much more dicult to compute The cost of applying

the inverse of a nite element op erator will dominate the total cost of the nite element

metho d when an implicit time integrator is in use on large scale problems the eective

solution of this problem is the sub ject of this dissertation

The nite element metho d

The nite element metho d nds the optimal solution of a partial dierential

equation in a user provided function space that is it calculates a solution in the span of a

provided set of functions whose error is orthogonal to this subspace in some inner pro duct

This is often calculated with an orthogonal projection and is the basis of classical metho ds

such as Galerkin and RayleighRitz metho ds The key asp ect of the nite element metho d

that has lead to its success is the use of piecewise continuous low order p olynomials rst

introduced by Courant in the s and indep endently by the engineering community

in the s Piecewise continuous p olynomials are eective b ecause they may b e auto

matically constructed by meshing a complex domain into simple elements or p olyhedra

and have compact supp ort Finite element basis functions can then b e constructed from

this mesh on the elements

The nite element metho d is in part successful b ecause it allows for a very hard

problem to b e eectively decomp osed into separate and well dened disciplines Broadly

sp eaking these disciplines are

Mesh generation

Element formulation

Nonlinear and transient solution

Linear equation solvers

Visualization and p ost pro cessing

This dissertation fo cuses on the eective construction of linear equation solvers for nite

element metho d matrices in the high p erformance computational environments of to day

Motivation

The dominant costs of conducting a nite element analysis once the software has

b een developed is the construction of the mo del or mesh and the solution of a sparse set

of algebraic equations The solution of this system of equations is generally the dominant

cost in an analysis as a single problem often requires the solution of hundreds or thousands

of equation sets if dynamic or transient analysis is called for Nonlinear formulations

are generally solved with a series of linearized solutions for instance by a Newton solution

scheme at each time step Additionally nite element analyses are usually used in design

pro cesses or parametric analyses thus requiring that many nite element simulations b e run with a single mesh or mo del

When implicit time discretization metho ds are in use the cost of the sparse linear

system solve will dominate the cost of the nite element simulation Furthermore the linear

equation solver is one of the hardest parts of a nite element solution to implement eciently

as the problem size and number of parallel pro cessors increases therefore scalable linear

equation solver technology is of considerable imp ortance to the continued growth in the use

of the nite element metho d

Direct solution metho ds based on Gaussian elimination are very p opular as they

are robust and eective for mo derate sized problems Accurate nite element simulations

often require that a ne discretization b e used thus necessitating the solution of large

systems of equations Also in some cases it may simply b e more economical to use large

fast and ever cheaper computers to solve a large problem where in the past highly skilled

engineers have had to painstakingly assemble smaller analyses and combine their results

with intuition to assess the safety of a structure

The diculty with direct metho ds is that for typical nite element problems op

timal direct solvers have a time complexity of ab out n in D where n is the number of

degrees of freedom in the system Direct metho ds are more applicable to D mo dels as the

number of degrees of freedom only increases quadratically with the scale of discretization

as opp osed to cubicly for D mo dels Additionally the time complexity for D problems is

ab out n as D problems tend to have much less ll introduced during the factorization

The rather small constants in the complexity of direct metho ds have allowed them

to b e sup erior to iterative metho ds for the size mo dels that were aordable in the past But

the p o or asymptotics of direct metho ds is now requiring that iterative metho ds b e used as

iterative metho ds have the p otential of O n time and space complexity

Domain decomp osition metho ds represent a framework for describing and analyz

ing optimal iterative metho ds for solving the sparse matrices from unstructured discretiza

tions of PDEs Multilevel domain decomp osition metho ds are theoretically optimal

preconditioners Of the family of multilevel domain decomp osition metho ds multigrid is

the most p owerful on structured meshes Multigrid is also ideally suited for nite element

problems on unstructured grids as it uses many of the same mathematical techniques eg

orthogonal pro jections as well as many of the same to ols in its implementation eg mesh

generators In fact as we show the core of multigrid is the recursive application of a vari

ationally induced approximation of the current problem grid this is also a description

of the nite element metho d Thus multigrid is in a sense the application of the nite

element metho d to itself

Many of the domains of interest to nite element metho d practitioners are inher

ently illconditione d problems with large ranges in scale of discretization complex struc

tures often requiring the use of less than ideally prop ortioned elements sharp jumps in

material prop erties or changes in physics These diculties are exacerbated by the use of

stateoftheart nite element formulations eg mixed metho ds for nearly incompressible

problems plasticity formulation to simulate yielding of structural materials large nite

deformation elements and the use of Lagrange multipliers in applying constraints Many of

these formulations result in a loss of p ositive deniteness in the equations andor results in

equations that are very p o orly conditioned These nite element formulations often severely

degrade the p erformance of iterative metho ds thus the serial p erformance as well as scala

bility of the solver is of utmost imp ortance This dissertation presents an eective metho ds

for solving the sparse illconditioned systems of equations that arise from large scale ie

vertices state of the art nite element simulations

Goals

The exp onential growth in the pro cessing p ower of computers in the last twenty

years is enabling scientists and engineers to conduct ever more accurate simulations of

their systems Along with the pull of the opp ortunity to conduct large scale simulations

applications are pushing for fast accurate simulations to design ecient pro ducts and to

bring them to market quickly b ecause of increasing global comp etition in manufacturing

Also testing pro ducts in the lab oratory is at b est a limited metho d of pro duct testing and in

the case of the US government is no longer a p olitically viable means of insuring the safety

of the nations sto ckpile of nuclear weapons Thus the time and exp ense of lab oratory

testing is proving to b e increasingly less attractive than computational simulations As the

equation solver is the dominant cost in some large nite element simulations scalable and

robust equations solvers are of critical imp ortance

The primary requirements for the next generation of nite element metho d equa

tion solvers are

Complete scalability Any metho d must b e completely scalable and completely

parallel This means that the number of iterations required to reduce the error by

a constant fraction must b e indep endent of the mesh size or that the number of

iterations grows p olylogarithmically ie p olynomial in log n while the cost of each

iteration is p olylogarithmic in time The algorithm must also b e feasible to implement

eciently ie it must b e easily parallelized All solvers in this class are multilevel

solvers

Robustness In this context robustness means that the metho d is eective on the

problems that will b e common in industrial practice ie unstructured meshes with

a wide range of scales of discretization with multiple material prop erties with widely

varying material co ecients and accurate material constitutions for all classes of

industrial applications We are concerned with solvers for D solid mechanics problems

only although the metho d that we discuss is applicable to problems in uid mechanics

and other areas of physics as well

Easy of use We would like to have a black b ox solver so as to simplify its use with

existing nite element co des This criterion is secondary in that requiring more data

from the user is certainly more palatable than not b eing able to solve the problem

Thus we desire to have a minimal interface with the nite element implementation

and require only what is easily available in common nite element co des

Multigrid is the b est known metho d to date that satises the rst criterion Multi

grid is an optimal solver of Poissons equation discretized with nite element or nite

dierence metho ds at least sequentially the D FFT is comp etitive in parallel although

to our knowledge it is not applicable to problems as general as those that we consider

The parallel time complexity of full multigrid is O log n although realizing the theoret

ical complexity on real machines is a challenge multigrid can scale very well on the large

computers of to day chapters and in the foreseeable future x

The robustness of multigrid is necessary for our purp oses as large nite element

simulations can require a wide variety of nite element formulations in a variety of appli

cations many of which pro duce op erators that are very demanding of iterative metho ds

The multigrid algorithm that we work with oers the wellknown advantages of

multigrid while maintaining a minimal though not minimum solver interface The goal

of having a black b ox solver is theoretically achievable with algebraic multigrid metho ds

although the only eective algebraic metho ds that we are aware of require geometric in

formation as do es ours Our metho d builds on previous work and is an

automated metho d for constructing the coarse grids for standard multigrid algorithms and

forms the coarse grid operators algebraically This combined geometricalgebraic approach

yields a metho d that has a similar interface with the nite element implementation as ef

fective algebraic metho ds but our approach allows for a more extensive use of geometric

information that is quite eective on many classes of problems

Dissertation outline

We have organized this dissertation so that it can b e read in a multilevel fashion

that is one can read this chapter the conclusion and the preamble to each chapter to get

a uniform introduction to our work One can read the introduction in each chapter in

addition to get a more detailed view of our work The dissertation is organized as follows

Chapter continues with a list of notations and concepts that we use in our work

Chapter introduces p ertinent background in iterative equation solvers for nite

element matrices

Finite element formulations are in general not necessary to understand the

characteristics of the equation solver although multigrid is in fact very much

related to the nite element metho d multigrid can b e seen as the recursive

application of the nite element metho d Thus a rudimentary introduction to

the nite element metho d proves to b e invaluable in understanding the b ehavior

and construction of multigrid metho ds

Basic iterative metho ds are introduced as we use some of them as accelera

tors and they lead to domain decomp osition metho ds one variety of which is

multigrid

One level domain decomp osition metho ds are introduced as we use them within

our solver and they serve to introduce many of the basic concepts of pro jections

which we use extensively

Chapter discusses multilevel domain decomp osition metho ds in general and multi grid in particular

We introduce the basic issues of multilevel domain decomp osition metho ds in

the vein of describing what has come b efore multigrid and the alternatives to

multigrid

We do not discuss the mathematical details of multigrid in great depth however

we introduce the current stateoftheart in analyzing multigrid on unstructured

grids This description is not comprehensive but is intended to introduce some

basic concepts and proves useful in understanding the p erformance b ehavior of

multilevel metho ds

Chapter describ es the current comp etitive metho ds in the eld of high p erformance

linear equation solvers for nite element matrices

Chapter introduces our multigrid metho d we build on earlier work in D formula

tions extending them to D and in parallel This includes our development of a new

maximal indep endent set algorithm for nite element meshes that under the PRAM

computational mo del see has O time complexity and is practical this

is an improvement over the previous b est algorithm This chapter also includes

a set of heuristics and metho ds for applying this multigrid algorithm eectively to

complex nite element meshes and represents the most original direct contribution of

the dissertation to linear equation solver algorithms

Chapter presents a series of serial numerical studies aimed and elucidating some

imp ortant characteristics of the b ehavior of iterative solvers in general and multigrid

in particular on problems in solid mechanics This chapter identies and analyses

the b ehavior of iterative solvers on problems with particular features such as incom

pressibility p o orly prop ortioned elements complex geometries and large jumps in

material co ecients

Chapter discusses parallel computational asp ects of the nite element metho d and

unstructured multigrid solvers We discuss the structure of our co de and some of the

algorithmic issues in optimizing p erformance of multilevel solvers on to days parallel

computers

Chapter develops a theoretical framework for mo deling multigrid complexity devel ops a simple complexity mo del for the computers in the near future uses it to analyze

multigrid solvers and describ es a more detailed computational mo del of unstructured

multigrid solvers on distributed memory computers

Chapter shows numerical results for problems in D linear elasticity of scalability

exp eriments on an IBM PowerPC and Cray TE with up to degrees of

freedom on pro cessors

Chapter extends our linear solver to material nonlinear and large deformation nite

element analysis and shows numerical results for problems of up to degrees

of freedom on pro cessors of a Cray TE with ab out parallel eciency We

also develop metho ds for extending our solver to constrained problems with Lagrange

multipliers that arise in nite element simulation with contact

Chapter concludes with p ossible directions for future work

App endix A lists our test problems and problem sp ecications

App endix B lists the machines that we use for our numerical exp eriments

Contributions

This dissertation develops a highly optimal linear equations solver for nite element

matrices on unstructured grids We extend an eective serial D multigrid algorithm to

D with heuristics to maintain the geometry of a problem on automatically generated

coarse grids to dramatically improve p erformance and robustness of out iterative solver

We develop a new parallel maximal indep endent set algorithm that has sup erior PRAM

complexity for nite element graphs and is very practical as well We develop algorithms to

mitigate the parallel ineciency the coarse grids of multigrid solvers on typical computers of

to day We have developed a fully parallel nite element implementation built on an existing

serial research nite element co de so as to fully test our co de and algorithms We show

p erformance results for problems of up to equations and up to pro cessors on

a Cray TE and an IBM PowerPC cluster in linear elasticity large deformation elasticity

and plasticity We also apply our solver in serial to contact problems formulated with Lagrange multipliers

Notation

Finite element meshes provide a geometric description of individual elements dif

ferent element types require dierent mesh types A brief taxonomy by dimension of degrees

of freedom of D element types is as follows

D continuum elements meshes are made of p olytop es and are fully D elements

D manifolds elements can b e divided into two main classes of element types plates

and shel ls with b ending energy and membranes without b ending energy

D elements can b e divided into two main classes of element types beams or ro ds

with b ending energy and trusses without b ending energy

Finite element analysis can use any and all of these element types in a single analysis

This dissertation however only discusses continuum elements although multigrid metho ds

are applicable to all nite element metho d formulations with compactly supp orted basis

functions Additionally D elements are usually either hexahedra or tetrahedra

Our examples use eight no de hexahedral trilinear brick elements but our metho ds can

adapt to other element types

We work closely with nite element meshes in D continuum mechanics so a few

denitions prove useful

Regular meshes The grid p oints are on a regular lattice

Structured meshes The grid p oints are on a logically regular grid though co ordi

nates may b e transformed

Unstructured meshes The grid p oints are placed arbitrarily

Blo ck structured meshes Unstructured meshes of structured blo cks

The meshes of all nite element applications must satisfy certain conditions namely that any

vertex that is in the closure of an element domain must b e one of the elements vertices and

no two elements may intersect each other The meshes of primary interest are unstructured

as the primary strength of the nite element metho d is its ability to accommo date complex domains and b oundary conditions

dof A degree of freedom in a nite element mo del is represented by one entry in

a vector and pro duces one equation in a nite element matrix The total number of degrees

of freedom is denoted as n

Vectors and functions We use b old face u for functions and plain text u for

vectors A vector u is a list of weights by which to scale a set of basis functions that add

i

P

m

together to form an arbitrary function in the span i m thus u u

i i i

i

residual The residual of system of equations Ax b with a given solution

p

T

vector x is dened as r b Ax Note we use the two norm of the residual ie r r

throughout this dissertation and will generally simple call it the residual

Chapter

Mathematical preliminari es

In this chapter we present background material useful in understanding our work

Introduction

We b egin with a brief introduction to the nite element metho d highlighting the

comp onents similar to those in multigrid Understanding the mathematical structure of

the nite element metho d is useful in describing the nature of the matrices that we solve

but more imp ortantly it proves useful in understanding multigrid equation solvers This is

b ecause the mathematical structure of the nite element metho d and multigrid are quite

similar even though nite elements and multigrid are solutions to two entirely dierent

problems ie multigrid is a metho d of implementing the application of the inverse of

a nite element op erator In fact as we show the core of multigrid is the recursive

application of a variationally induced approximation of the current problem grid this

is also a description of the nite element metho d

Linear equation solver fundamentals are introduced as the signicance of our work

can only b e appreciated in the context of the alternatives Additionally multigrid uses many

standard linear equation solvers in its construction Mo dern domain decomp osition theory

is introduced multigrid can b e analyzed as a particular domain decomp osition metho d

b ecause mo dern domain decomp osition analysis provides the strongest analytical metho ds

for multigrid on unstructured meshes More imp ortantly mo dern domain decomp osition

theory can provide invaluable insight into the nature of multigrid solvers and we thus provide a brief overview of this theory with the intent of touching on its more salient features

The nite element metho d

The problem of simulating physical systems can in general b e reduced to that of

nding the solution u for a linearized PDE op erator L and applied forcing function Q

that has the general form

L u t Q in

u u on

d

B u u on

n

Were is the b oundary of the domain for which the displacements or primal variable

d

is sp ecied ie Dirichlet b oundary conditions denotes the p ortion of the b oundary

n

with natural ie Neumann b oundary conditions for which the dual variable eg force

is sp ecied

Complex physical phenomenon are mo deled with complex PDEs which may in

clude domains of entirely dierent physical b ehavior as with uidstructure interaction

problems or may have multiple coupled physical elds as with thermalmechanical sys

tems Thus the actual simulation may b e a comp osition of multiple domains and these

PDEs may b e very complex Additionally accurate simulations are rarely linear though

often they are linearized for use in a nonlinear solution metho d eg by Newtons metho d

Thus linear systems may b e but one comp onent of a fully nonlinear solution but the core

solution pro cedure is the solution of a problem that can b e represented symbolically by

equation

The nite element metho d is a means of formulating or spatially discretizing

partial dierential equations so they may b e solved numerically The nite element

metho d provides a sound metho d of nding the optimal solution within a given subspace

of the space in which the true solution b elongs The nite element metho d commonly utilizes

a Galerkin condition as its optimality criteria a Galerkin condition can b e stated as nd

u S span i n such that L u Q S or equivalently hL u Q vi

i

R

a b v S where ha bi

The success of the nite element metho d is due in part to its ability to apply the

Galerkin condition to arbitrarily complex domains and b oundary conditions The strength

of nite elements comes from mappings b etween arbitrary p olyhedra and regular p olyhedra

eg the unit square where the mathematics are tractable as well as the availability of

computational metho ds for automatically constructing go o d p olyhedra on complex physical

domains With mesh generation techniques these physically intuitive meshes can b e

used to construct piecewise continuous p olynomials for the set of basis functions S

Thus the nite element metho ds success derives from its ability to construct

accurate subspaces via unstructured meshes and from the ability to compute pro jection

onto nite element spaces relatively inexp ensively as the linearized nite element op erator

is sparse in commonly used nite element discretizations These pro jections are computed

with a linear equation solver the eective construction of these solvers for large nite

element problems is the sub ject of this dissertation

The pro cess of applying the nite element metho d to the simulation of a physical

system can thus b e summarized as

Develop a theory to mo del the physics ie the strong form of the PDE

Formulate a weak form of the PDE that can b e used in the nite element metho d

Discretized the domain of interest with a nite element mesh

Pick a set of basis functions for each element in which to nd the approximate

solution

Formulate a time integrator for transient PDEs

Develop a nonlinear solution strategy for the discrete time form

For each time step and for each iteration step in the nonlinear solution metho d solve

a system of linear equations for the parameters of the nite element basis functions

For each time step substitute the discrete answer into the basis functions to nd the

answer where required and often derivatives of the answer as well and transform this

data into a readable form

Finite element example Linear isotropic heat equation

For example the strong form of the heat equation can b e written as

D

X

T T

in k Q c

x x t

i i

i

with k the thermal conductivity the mass density and c the sp ecic heat of the material

or simply

T

in r k rT Q c

t

T g on

D D

T

g on

N N

n

and

D N D N

T T at t

For simplicity we only discuss the spatial discretization and lo ok for a steady state solution

T

in equation Also assume homogeneous Dirichlet b oundary conditions thus

t

ie T on and that the b oundary is smo oth enough so that the solution to is

R

T then the homogeneous Dirichlet problem can b e stated as the in L ie

Poissons problem

L T Q r k rT Q in

T on

If we multiply by an arbitrary function v L which satises the b oundary conditions

and integrate over the domain then can b e equivalently stated as

Z

v r k rT Q v H

Where H is the set of functions that are zero on and whose rst derivatives are

in L As u and v are in Sob olev spaces H we can dene an inner pro duct

Z

u v u v

and also a bilinear form

Z

a v u rv k ru

so equation can b e transformed with integration by parts to the weak form of the

dierential equation

a v u v Q v H

To formulate a numerical solution for a problem with the nite element metho d

we rst need a test space for v

T span

m

and a solution space S for u

S span

m

P P

m n

In practice and v T as We can express a vector u S as

i i j j

i j

T and S are often the same space resulting in a BubnovGalerkin metho d otherwise the

metho d is known as a PetrovGalerkin metho d

A nite element approximation to T in equations and can b e con

structed by replacing u and v in equation If T S then we have n equation one

for each test function in n unknowns of the form

n

X

a Q

i j j i i i

j

Remember v is arbitrary as long as it satises the b oundary conditions and so

i

is arbitrary and thus equation may b e written as

n

X

a Q

i j j i

j

This is a set of n linear equations in n unknowns Ax b where

Z Z

A a r k r b Q

ij i j i j i i

T

and x is the discrete vector that we solve for These inner pro ducts are

n

implemented with numerical integration in which the op erator L in Cartesian co ordinates

is evaluated at carefully selected Gauss integration points the weighed sum of which taken

over each element provides an accurate answer for the p olynomial shape functions used in

nite elements analysis

Most problems of interest are nonlinear thus A is a function of x The weak

form must then b e linearized and a nonlinear solution strategy formulated eg Newtons metho d

What remains to b e done is to solve this sparse set of algebraic equations for

P

n

the vector x the answer u can then b e constructed and the solution or its

i i

i

derivative can b e generated at any p oint in the domain Our next step is to introduce the

foundations of the metho d that we employ to solve these equations

Iterative equation solver basics

The section introduces iterative equation solvers so as to motivate our research into

multigrid Iterative equation solvers rely on the application of a sparse op erator Direct

solvers based on Gaussian elimination rst factor an n by n matrix A into an upp er and

a lower triangular matrix eg nd L and U such that A LU

The factorization has a complexity of ab out O n for typical D sparse nite

element matrices although the exact cost is dep endent on the precise structure of the

matrix and the order of the equations Finite element matrix factorizations have a space

complexity ie memory requirement of ab out O n Iterative metho ds have a space

complexity of O n but the time complexity is metho d dep endent the minimization of the

time complexity of iterative equation solvers is the primary goal in the design of an iterative

metho d

This section gives a brief description of the classical iterative metho ds b eginning

with the simplest and least eective and culminating with what we consider the most ef

fective multigrid This introduction is useful not only for providing a context for multigrid

but also proves useful in that multigrid actually uses many of the classical iterative metho ds

as comp onents In fact multigrid is really nothing more than an intelligent marshaling of

iterative and direct metho ds in order to allow these metho ds op erating at diering scales

of resolution to do what they do b est this b ecomes clear in subsequent chapters

Matrix splitting metho ds

Some of the oldest iterative metho ds can b e describ ed by considering a matrix

splitting that is given a matrix A dene a splitting A M K Substituting this

expression into the system to b e solved Ax b we get the equivalent expression x

M b K x This is not very useful in and of itself however it do es suggest an iterative

metho d set k and while jb Ax j tol do k k

k

x M K x b

k k

The idea is now to pick M and K such that the cost of applying the inverse of M is

inexp ensive relative to the reduction in the error in each iteration

Jacobis metho d is a simple iterative metho d that visits each equation i and sets

P

k k k

n

th n

x b A x where is the i comp onent of at iteration

i ij

j i

i j i

A

ii

k Jacobis metho d is a matrix splitting metho d in which M in equation M is

J

the diagonal of A Other commonly used matrix splitting metho ds are GaussSeidel and

successive overrelaxation SOR GaussSeidel is a simple and natural improvement usually

of Jacobi in which the most recent data for x is used instead of only using the values of x form

the previous iteration For GaussSeidel M is the diagonal and the lower triangular part

GS

of A SOR follows the intuition that if a correction to an approximate solution to a problem

is an improvement then we can magnify this correction to get more improvement to the

solution Thus with a user provided parameter typically M is D L

SOR

where L is the strictly lower triangular part of A and K is D U

SOR

Krylov subspace metho ds

Krylov subspace metho ds are an elegant means of designing iterative solvers that

have proven to b e quit valuable in practice A Krylov subspace is dened for a given

k

matrix A and vector b by K A b span b Ab A b A b One advantage of Krylov

k

subspace metho ds is that the representation of the op erator can b e entirely abstracted from

the iterative metho d that is the iterative metho d need only b e able to apply the op erator

and need not directly access the data that is used to represent the op erator This is a p ositive

attribute for an iterative metho d to p ossess but is often not of paramount imp ortance as

Krylov subspace metho ds invariably require preconditioning which is often some type of an

incomplete factorization which in turn requires that parts of the actual matrix b e accessed

Regardless Krylov subspace metho ds have proven to b e very useful in the construction of

iterative solvers we introduce them here as we routinely use them in our solvers

The rst Krylov subspace metho ds were developed in the s though most of

the research activity in these metho ds has taken place since the s Over the

past years Krylov subspace metho ds have b een designed for may types of matrices this

is necessary as these metho ds can take advantage of a priori knowledge of the op erator

ie symmetry p ositive deniteness semideniteness etc to provide b etter metho ds for

op erators for which something is know ab out their sp ectra A common prop erty of nite

element matrices that Krylov subspace metho ds can take great advantage of is symmetric

p ositive deniteness SPD Conjugate gradients CG the rst Krylov subspace metho d to

b e developed is an eective metho d that requires that the op erator b e SPD We describ e

and derive CG following the presentation in

Consider the functional

T T

v v Av v b

Notice that minimizing v yields v A b x that is the minimization of equation

is equivalent to solving Ax b for x if A is SPD This fact can b e deduced by taking

the Frechet derivative of equation

T T T

x x Ax x A b

where is an arbitrary p erturbation vector and is a scalar By setting to zero we nd

that if A is symmetric x A b is required to solve the resulting equation

T

Ax b

as is arbitrary Taking the second Frechet derivative of simply give us A thus if

A is p ositive we are minimizing the functional

This construction gives us a heuristic to iteratively solve for x in Ax b ac

tually this is not a heuristic but for the time b eing we will consider it as such With this

construction we can pick an optimal scalar to improve current solution x with a vector

k

p by nding the that minimizes x p In so doing we nd that we should pick as

k

T

p r

k

T

p Ap

where r b Ax if A is symmetric This is all well and go o d but how can we pick p

k k

A simple choice is let p r this is the metho d of steepest descent and it is not eective

k

and do es not optimize the solution very well

Since we are only applying the op erator and are only given the vector b we also

know that x K A b For general search directions we would like to b e able to pick

k

k

p so the new current solution is optimal in some well dened sense That is we would

like our solution in x K A b to minimize the residual in some norm we can not

k

k

minimize the error e as we do not know the solution but we can calculate the residual

k

r Ae b Av

k k k

q

T

An obvious choice is to use the two norm on the residual r r which

k

k

when A is symmetric is equivalent to requiring that r K A b that is the Galerkin

k

k

condition With this choice we get general minimum residuals GMRES Conjugate

th

gradients chooses p so that the k iterate x minimizes the residual in the A norm

k

p

T

kxk 1 x A x

A

We need to deduce the conjugate gradient choice of p in each iteration To do

this express x as a linear combination of vectors p p p which span K A b In

k k

k

matrix notation

x P y

k k

where y is a vector of scalar weights of all of the search vectors in the columns of P

k

Split equation into two parts

x P y p

k k k

use equation to nd x that is

k

min P y p

y k k

after some manipulation we get

T T T T

p Ap p b P y y P Ap min

k k k y

k k k

The second term in equation the cross term is problematic as without it the

minimization decouples into two simple minimizations as we show The solution to min

imizing the cross term is to make it zero this is accomplished by requiring that all of our

T

search vectors b e A conjugate that is p Ap if i j Thus we have a sp ecication for

j

i

our search directions

With the cross term eliminated equation reduces to two minimizations

2

T T

min P y and min p Ap p b The rst part of this minimization min P y

y k k y k

k k

can b e assumed to have already b een done as we are applying the algorithm recursively

the base case of the recursion P y is clearly minimized The second part of the mini

k

T T

mization after b eing given x from the rst part is similar to p bp Ap So k

we have an expression for in each iteration and a sp ecication for p namely

k

T

p Ap j k

j

k

Amazingly if we apply a standard GramSchmidt technique to calculate p by A or

k

thogonalizing Ax with p p p we nd that we only need the rst two terms of the

k k

recurrence see for a complete discussion of this material and only need one matrix

vector pro duct p er iteration

x r b p b k solve Ax b for x

while kr k tol

k

z Ap

k

T T

r r p z

k k

k k

x x p

k k k k

r r z

k k k

T T

r r r r

k k k

k k

p r p

k k k k

k k

Figure Conjugate Gradient Algorithm

Thus the b eauty of CG is that this orthogonality can b e maintained in p erfect

arithmetic with a short vector recurrence unlike GMRES which requires that all of the

old search vectors b e stored so that each new one may b e orthogonalized against them

We can state b ounds for the convergence rate for CG as

kr k

1

k

A

k

p

kr k

1

k

A

or

p

k

p

kx x k kx x k

k

A A

See for details of these derivations The imp ortant p oint to see from this is that

the convergence rate of CG dep ends on the condition number of A CG provides us with

an O n solver for our mo del problem D Poisson this is b etter than Jacobi and

GaussSeidel and as go o d as SOR without the need to pick a parameter but it is not go o d enough for use in very large problems

Krylov subspace metho ds require that they b e preconditioned in order to b e eec

tive the sub ject of this dissertation is such a preconditioner Just ab out any solver can

b e used as a preconditioner for a Krylov subspace metho d Many of the classical matrix

splitting metho ds of x and their generalization in x are used as preconditioners for

the Krylov subspace smoothers for our numerical exp eriments throughout this dissertation

Thus preconditioned Krylov subspace metho ds are central to the sub ject of this disserta

tion but b efore we lo ok at preconditioning we need to lo ok at the convergence prop erties

of CG

In addition to CG for SPD matrices there are may other Krylov subspace metho ds

for standard matrices classes ie indenite semidenite symmetric and unsymmetric

see and the references therein for a full discussion of these metho ds

Preconditioned Krylov subspace metho ds

In the last section we found that the convergence rate of CG and all other Krylov

subspace metho ds is dep endent on the condition number of the matrix A natural way to

try to improve the convergence rate is to transform the system to an easier one solve it

and then go back to the original system that is instead of solving Ax b we want to nd

an M and solve M Ax M b or in symmetric form

M AM M x M b

One can substitute equation into Figure after some algebraic manipulations one

can eliminate the M and M terms and get the algorithm in Figure

The ob jective now resembles that of matrix splitting metho ds nd an M whose

inverse is cheap to apply and is relatively eective at reducing the condition number of

the op erator M A We can lo ok at the Krylov subspace metho d as an accelerator for an

iterative solver We have observed in informal numerical exp eriments that the use of a

Krylov subspace metho d accelerator is almost always economically advantageous in terms

of total execution time for the solve we therefore only consider this architecture in our

numerical exp eriments as there are many other interesting parameters to investigate So we are still left with a question of nding a go o d preconditioner for a Krylov subspace metho d

x r b k solve Ax b for x

while kr k tol

z M r

T

r z

k

if k

p z

else

p z p

k k

z Ax

T

p z

k

x x p

r r z

k k

Figure Preconditioned Conjugate Gradient Algorithm

Krylov subspace metho ds as pro jections

We can alternatively lo ok at Krylov subspace metho ds as a pro jection of the

problem onto a subspace We often work with pro jections in the analysis of domain de

comp osition metho ds but for now we can consider pro jections as an approximation u in a

low dimensional space K to a function u in a high dimensional space V If

K is a closed subspace of V

V is complete

Our nite element op erator is V elliptic

then we can dene a unique approximation that has an error that is orthogonal to the

solution after we equip L with an inner pro duct Thus we can use the Galerkin condition

T

from x and insist that the approximate solution satises u u v j v L or

T

in the case of CG require that u minimizes u u Lu b

In the case of symmetric systems it is natural to let L K and for the p ositive

denite case this can b e accomplished with little cost We observe later that the cost of

nding a pro jection is a linear solve of the order of the size of the subspace The reason for

the eciency of CG is that the linear system is tridiagonal and hence very cheap to solve

see for details

One level domain decomp osition

Domain decomp osition metho ds are a p opular approach to construct iterative

solvers esp ecially in multiprocessor environments Domain decomp osition of the physical

domain is a natural metho d to consider in a parallel computing environment b ecause the

nite element mesh invariably needs to b e partitioned Thus reasonable to use this structure

in the solver so as to exploit data lo cality

Domain decomp osition metho ds have b een used by engineers for decades to build

direct solvers and iterative solvers alike Direct domain decomp osition solvers are known as

nested dissection no de ordering or substructuring iterative solvers are known as iterative

substructuring or Schur complement metho ds

Domain decomp osition is imp ortant in the history of solvers for nite element

matrices but is also p ertinent to this dissertation for two reasons First one of the simplest

domain decomp osition metho ds is a generalization of Jacobis metho ds introduced in x

namely use diagonal blocks of the matrix as the preconditioning matrix and use a direct

solver on each of these blo cks we often blo ck Jacobi solvers as a comp onent in our solver

The second reason for interest in domain decomp osition metho ds is that during the past

ten years a p owerful metho d has b een developed in the domain decomp osition community

for the analysis of a wide range of iterative solvers including multigrid metho ds on

unstructured meshes See for an introduction to these metho ds and the references

therein for the literature in this area

This chapter introduces the background and notation useful in the analysis of

our and most iterative solvers for discretized PDEs additionally the central concepts

of restrictions and interpolation are introduced here as they are used extensively in the

description of our metho ds

Alternating Schwartz metho d

This section introduces the classical domain decomp osition metho ds for solving

PDEs starting with Schwarzs metho d from the th century up to the mo dern numerical

metho ds This section is thus intended to provide a basis for the discussion of the analytical

domain decomp osition techniques by providing simple concrete and intuitive examples

of domain decomp osition metho ds The purp ose of this section is to provide historical

background of many of the comp onents of our solve as well as introduce some of the basic

concepts that we use throughout this dissertation This discussion follows the presentation

in

The earliest domain decomp osition metho d was introduced by Schwarz in to

solve for the continuous solution of a PDE Schwarzs metho d was not intended as a numer

ical metho d but as a way of solving elliptic PDEs comp osed of the union of simple domains

for which explicit solutions were available Thus this section will work with continuous

functions and linear op erators and not their discretized counterparts vectors and matrices

An example of the classical alternating Schwarz metho d pro ceeds as follows Given

a domain shown in Figure on which we wish to solve the elliptic PDE

Lu f in

u g on

For instance L could b e the Laplacian op erator r and the b oundary conditions could

b e of either Dirichlet or Neumann type though here we consider only Dirichlet b oundary conditions for simplicity δ Ω δ Ω 2 1

ΩΩ Ω 1 2

  Γ  2 

Γ1

Figure Schwarzs original gure

is the b oundary of and closure of is denoted by The

articial b oundary of sub domain is dened as the part of within We often

i i i

have use for the b oundary n that is the true b oundary if any of the sub domain i

i i

k k

We dene u as the current solution at iteration k on domain Also u j is dened

i

j

i i

k k

as the restriction of u to that is the values of u that are on Note that this type

j j

i i

of restriction is simply a selection of certain values or embedding

The classical alternating Schwarz metho d can b e stated as select an initial guess

for u then iteratively for k solve the b oundary value problem

k

f in Lu

k

g on n u

k k

on u u j

1

k k

then solve the following for u for u

k

f in Lu

k

g on n u

k k

on j u u

2

and continue until the solution has converged

In eect this metho d solves a small sub domain problem with b oundary conditions

augmented by the restriction of current solutions on other sub domains Notice that this

construction is similar to GaussSeidel iterations in x though instead of solving for just

one equation at each substep in the outer iteration we solve a sub domain b oundary value

k k

we get a Jacobi for u problem also notice that if in equation we substitute u

like iteration Though this is a continuous metho d and is not explicitly used in our work

it do es provide intuition as to how domain decomp osition metho ds work

Multiplicative and additive Schwarz

The section describ es a discrete version of the Schwarz metho d of the last section

and the additive variant The next chapter extends these metho ds to include a coarse grid

correction used in multilevel metho ds

Figure shows a mesh with two sub domains

Ω 1

2

Figure Two Sub domains with Matching Grids

For simplicity assume one no de has a Dirichlet b oundary condition and the non

homogeneous term moved to the right hand side also order the no des with the interior

no des of the rst sub domain rst then the rst sub domains b oundary no des followed by

the no des common to b oth sub domains as so on The vector of unknowns lo ok like this

u u u u u u

1 n 1 2 2 n

2 1 1 2

Figure shows the resulting graph of the matrix with this no de ordering now we only consider the closure of the domains and this drop the bar notation

1 4 9 12 Ω 1

5 2 7 10 Ω 13 2

11 14

3 6 8

Figure Matrix Graph of the Two Sub domain Problem

At this p oint it is useful to introduce some notation that we use extensively the

restriction op erator Figure shows two overlapping submatrices A and A We want

1 2

to have an algebraic expression to dene these submatrices as in general they are not so

T

simple ie contiguous We can construct A by A R AR here R is an jA j n

1 1 1

0

A Ω 1

A Ω 5 2

10

15 0 5 10 15

nz = 70

Figure Matrix for the Two Sub domain Problem

boolean matrix with a very simple form I Thus if we had chosen a dierent ordering

of the matrix A then R would b e a p ermuted identity matrix with zero columns inserted

for the no des that are not in These b o olean or embedding op erators are sucient to

describ e these simple one level domain decomp osition metho ds multilevel metho ds require

that nonb o olean restriction op erators b e used but they retain this basic structure

We can state some simple iterative solvers based on these decomp ositions Figure

shows the multiplicative Schwarz algorithm to solve Ax b for x

k x

while kb Ax k tol

k

T T

R b Ax R AR x x R

k k

k

T T

x x R b Ax R AR R

k

k k

k k

Figure Multiplicative Schwarz with Two Sub domains

The additive Schwarz algorithm is shown in gure

k x

while kb Ax k tol

k

r b Ax

k k

T T T T

R r R AR R r R R AR x x R

k k k k

k k

Figure Additive Schwarz with Two Sub domains

The additive form of the alternating Schwarz metho d is a bit cheaper as a new

residual is not formed at each sub domain The additive form is also more parallelizable as

the solves on each sub domain are indep endent of each other only the up date of the solution

vector need b e synchronized However multiplicative forms converge faster than additive metho ds

Chapter

Multilevel domain decomp ositi on

This chapter extends the discussion of the mathematical and algorithmic under

pinnings of domain decomp osition and multigrid metho ds from the previous chapter The

one level metho ds discussed in the previous chapter are not eective enough in and of

themselves The shortcomings of one level metho ds can b e overcome by the use of multiple

levels or multiple scales of resolution of the problem

Introduction

All optimal metho ds have some multilevel comp onent the need for multiple

levels can b e understo o d in many ways For one take the linear discretization of the

Poisson op erator on a regular mesh and a typical row for vertices with Neumann b oundary

condition of the form notice that the rows add up to zero

T

Now the up date or correction d for domain i with A A R AR in the Schwarz

i

i i i

n

metho d with in the error e in Figure can b e written as

n n T n T T

R Ae R A x x R A R b Ax R A d R A

i i i

i i i

i i i

n

If the error e is constant in then the Schwarz up dates are zero on interior or oating

i

n

sub domains those without any Dirichlet b oundary condition as R Ae Therefore the

i

sub domain corrections do not correct the constant part of the error If the error pro jected

onto a sub domain is dominated by the constant term then the blo ck Jacobi metho d will

not b e eective

Intuitively we can see that if the lo cal part of the global error for a sub domain

is almost constant then the global error must b e smo oth These simple one level solver

metho ds are called smo others in multigrid terminology b ecause they are eective at re

ducing nonsmo oth or high frequency error thereby smo othing the error This intuition is

not valid for all op erators however more generally these simple metho ds damp the high en

ergy comp onents of the error and so it is the low energy error that we need to b e concerned

T

ab out the energy of our solution error e b eing given by the bilinear form or e Ae Thus

for the Poisson op erator the low energy functions are these smo oth functions b ecause

the Poisson op erator with constant co ecients is a smo oth op erator

Another way to divine the need for a global comp onent in a solver is from a simple

information theoretic viewp oint This can most easily b e seen by allowing the sub domains

to degenerate to single no des thus transforming the Schwarz metho d into Jacobi iterations

If a p oint load is applied at a corner of a D regular mesh then as Jacobi only transfers

information via a matrix vector pro duct with M K which has the same graph structure

as A and b the nonzero structure of x can only advance to the neighbors of a nonzero

no de in each iteration As the shortest path to the furthest no de from a corner no de in a

p p

n long we require a minimum of n iterations to get a nonzero in D mesh is ab out

all n degrees of freedom The inverse of the Poisson op erator is dense so all no des have a

nonzero value in general in their solution Thus Jacobi can not under any circumstances

p

converge with high relative accuracy in less than n iterations for this particular right

hand side To remedy this situation we need to have some form of global communication in

each iteration

The solution to this inherent limitation on the convergence rate of an iterative

solver is to add a global correction The global correction is a p erhaps approximate

pro jection to a smaller subspace As we see in x these pro jects are implemented with

a smaller linear solve we generally apply these metho ds recursively and only the top

coarsest grid is solved exactly thus all but the p enultimate grids are corrected with an

approximate pro jection The subspace from which we compute a correction is a coarse grid

space the construction of this coarse grid space is the primary distinguishing characteristic

of all scalable linear equation solvers

A Simple two level metho d

To create a coarse grid C space the domain must rst b e rediscretized in some

fashion but with far fewer no des Using the discretized domain in gure we can redis

cretize Figure shows such a discretization after it has b een remeshed Ω

0

Figure Matrix for the Two Level Metho d

With this coarse grid we can dene a new linear op erator A treat this like a new

C

overlapping sub domain and apply either the additive or multiplicative metho d from x

Thus we restrict the current residual b Ax up to the coarse mesh solve for the coarse

k

grid correction and interpolate the correction back to the ne grid With the addition of a

coarse grid we can write the multiplicative form of our two sub domain example in x

T T

x x R R AR R b Ax

k k

k

T T

x x R b Ax R AR R

k k k

T

b Ax R x x R A

C k

k k

C

C

The restriction matrix for the coarse grid is no longer a simple b o olean op erator as we need

to interpolate the no dal values on the ne mesh to multiple no dal values of the coarse mesh

Thus we need discrete coarse grid restriction op erators gure show the directed graph

in b old of this restriction op erator Much is known ab out calculating interpolation values

as this is at the core of the nite element metho d ie mapping b etween domains Thus

given a shape function x ie the nite element basis function for coarse no de I we

I

can calculate R j which is the interpolate of I at the ne no de j by R j jcoor d

I I I

with jcoor d b eing the co ordinate of no de j

With this construction we again have the same options as with the one level

Schwarz metho ds we can use an additive form or a multiplicative form for the coarse

0

Figure Graph for the Restriction Matrix for the Two Level Metho d

grid correction We can therefore generate four basic forms of this two level metho d the

algorithm given in equation is the multiplicativemultiplicative form ie multiplicative

within a level and b etween levels Additionally we can symmetrize equation and if we

use the additive metho d for the sub domain corrections we get the additivemultiplicative

form

T T T T

R b Ax R AR R b Ax R R AR x x R

k k k

k

T T

x x R R AR b Ax R

C C

k k k

C C

T T T T

b Ax R R AR b Ax R R R AR x x R

k

k k k

This gives us the classic multigrid form with a blo ck Jacobi smo other and is the basic

approach that we utilize in our numerical exp eriments We can now introduce the classic

multigrid algorithm

Multigrid

Multigrid has b een accepted for the past years as b eing the theoretically optimal

solution metho d for some mo del problems and a great deal of research has b een fo cused on

applying multigrid to many types of discretized PDEs In any given year there are many

international conferences dedicated to multigrid additionally multigrid is well represented

in domain decomp osition conferences and iterative metho d conferences worldwide

This chapter is concerned with providing the basic multigrid background without

the distraction of unstructured meshes and parallel computing We discuss the classical

multigrid form which is a natural extension from x although unlike most classical

presentations we assume that multigrid is used as a preconditioner Using multigrid as a

preconditioner merely means that we are not improving a solution of Ax b with x but

k k

are nding an approximate solution to Ax r Additionally we enter the algorithm with

k

a residual on the ne mesh this changes the structure of ful l multigrid and is explained

shortly

We number the grids from the b ottom ne to the top coarse counter to the

practice in the classic multigrid literature this is b ecause we start with the ne mesh and

work our way up to the coarse mesh stopping when we can solve the problem directly

Historically multigrid has b een used primarily on structured meshes as one starts at the

top mesh and continues to rene the mesh until we have an accurate answer Note we will

use top to mean the coarse grid even though it is at the bottom of gure

The primary op erators used by multigrid are

Smo other The smo other S A r is an iterative solver that is applied for only a few

iterations or even just one iteration The smo other must b e eective at reducing the

error up to the frequency that can b e resolved on the mesh and down to the frequency

that can b e resolved on the next coarsest mesh

Restriction The restriction op erator Rr must b e able to map residuals to the next

coarsest grid

Interpolation or Prolongation The interpolation op erator P x must map values

solutions from a grid to the next ner grid The transp ose of the restriction op erator

T

is commonly used ie P R

Coarse Grid Op erator The coarse grid op erator A must represent al l of the

i

frequencies lower than those that can b e eectively reduced by the smo other on this

level More precisely the error in the coarse grid representation of the low eigen

functions of the ne grid must b e in the space spanned high eigenfunctions of the

ne grid We use a Galerkin or variational coarse grid op erator as discussed ab ove

that is A R A P

i i i i

With these comp onents we can state the classic multigrid Vcycle in Figure

function M GV A r

i i

if there is a coarser grid

x S A r

i i i

r r Ax

i i i

r R r

i i i

T

x M GV R A R r

i i i i

i

T

x x R x

i i i

i

r r A x

i i i i

x x S A r

i i i i

else

x A r

i i

i

return x

i

Figure Multigrid Vcycle Algorithm

We represent this algorithm schematically in Figure

i=1 S S R RT i=2 S S R RT i=3 S S R RT i=4 S S R :matrix application R T R S :Smoother application

i=5 D D :Direct solve

Figure Multigrid Vcycle

The preconditioner B in Figure x is M GV A r In practice it has b een

universally observed that a variant of the Vcycle ful l multigrid or Fcycles provides b etter

solver p erformance Figure shows the full multigrid algorithm

function FMGA r

i i

if there is a coarse grid

x FMGA R r

i i i i

r r A x

i i i i

else

x

i

x x M GV A r

i i i i

T

if there is a ner mesh return R x

i

i

else return x

i

Figure Full Multigrid Algorithm

And we can schematically represent full multigrid in Figure

i=1 S S

R T R T R : matrix application R R S S S S i=2 S : Smoother application R R R D : Direct solve RT RT RT i=3 S S S S S S R R R R RT RT RT RT i=4 SSS S S S SS R R R R R R T RT RT RT RT

i=5 D DD D D

Figure Multigrid Fcycle

Convergence of multigrid

The intuition b ehind the convergence b ehavior of multigrid starts with the no

tion that iterative metho ds which smo oth values of nearest neighbors on the grid can

eectively reduce the high frequency or high energy comp onent of the residual Multi

grid organizes standard iterative metho ds to work at varying scales of resolution so as to

allow their work to b e most pro ductive To do this we need to b e able to map residuals

and corrections b etween grids in an eective manner and we need a coarse grid op era

tor that represents in tandem with these inter grid transfer op erators the low frequency

comp onents of the error

Informally we want the dierence b etween the solution in the coarse grid space

mapp ed to the ne grid space and the true answer on the ne grid to have little of the

low energy comp onents Thus the eect of the coarse grid correction is to demote the

low frequency error to high frequency error which can b e reduced cheaply For our mo del

problem Poissons equation in any dimension a particular multigrid construction satises

these requirements extremely well The Poisson problem is one of the few problems for

which hard b ounds on the number of iterations required to achieve a sp ecied tolerance is

known The next section sketches the pro of of the convergence rate of multigrid on Poissons

equation on a unit square

Convergence pro of outline

We sketch the pro of here as it useful in understanding the frequency domain de

comp osition nature of multigrid This presentation follows that in As a mo del prob

lem uses the regular D Poisson equation with Dirichlet b oundary conditions The optimal

multigrid algorithm for Poissons equation is carefully constructed to reduce the error by

in each iteration at least a factor of

k

Multigrid is optimal if we use N no des with constant spacing b etween

k

D D

k k

unknowns The coarse no des and grid p oints in each dimension ie

grid picks every other no de from the ne mesh and by the sp ecial choice of grid dimension

the end no des remain through all of the grids In D and D one can p erform the same

pro cedure recursively coarsening the edges then starting from selected edge no des select

surface no des and so on The restriction and interpolation op erators are derived from standard linear in

terp olation Thus the rst row in the restriction matrix R is R

T

The Galerkin form for the coarse grid op erator RAR is the same Poisson matrix scaled by

in magnitude and approximately in size The smo other uses a weighted Jacobi metho d

similar to that introduced in x where R I A and c b gives the

k th

standard Jacobi op erator Let e x x b e the error in the k iteration and note that

k

T

with the eigendecomp osition Z of A R Z I Z where is a diagonal matrix

J

of the eigenvalues of A We have

k k

e R e

k

R e

k

T

e Z I Z

k

T

Z I Z e

so

k k

T k T T k T

Z e I Z e or Z e I Z e

j j

j j

th k T k

is called the j frequency component of the error e The eigenvalues Z e

j

R determine how fast each comp onent of the error decreases in each

j j

iteration Figure plots R

j

1

0.8

0.6

0.4 1/3

0.2 w=1/2 0

-0.2 w=2/3 -1/3 -0.4

-0.6

-0.8 w=1 -1

0 10 20 30 40 50 60 70 80 90 100

Figure Graph of sp ectrum of R for N

N

When and j ie the upp er half of the sp ectrum j j This means

j

or less in each iteration that upp er half of the error comp onents are multiplied by

With Figure and by using the identity x R x b in the two steps that

apply the smo other and the identity b Ax in the line that takes the residual we can

construct an expression for the application of one V cy cl e to the error

k T T k

e R fI R RAR RAgR e

Note this expression assumes by induction that the coarsest grid is solved exactly

This expression can b e nearly diagonalized with Z the eigenvector matrix of A Equation

is blo ck diagonal with a particular ordering of the eigenvector matrix The eigenvalues

of these blo cks can b e explicitly calculated these blo cks are by in the D case by

in the D case and so on see and for details The eigenvalues of the matrix in

equation are b ounded by and thus we get nearly one digit of accuracy p er iteration

this rate is not only fast but it is completely indep endent of the size of the problem This

construction is perfect in that as we go down the V cy cl e we are reducing the error

and as the corrections p ercolate back up the V cy cl e they do not soil the uniformly by

lower part of the sp ectrum in any signicant way and they pro ject their correction at least

well enough so that the smo othing step reduces the entire sp ectrum of the error by at least

Multigrid is thus completely scalable in serial provided the amount of work p er

iteration is O n ie O work for each unknown As the work p er grid is prop ortional to

the number of unknowns and the number of unknowns p er grid decreases with a geometric

D D D D L

progression ie for L levels in D dimensions the amount

of work is b ounded by twice the work done on the ne grid in D Multigrid is therefore an

O n metho d sequentially

Convergence analysis of domain decomp osition

The convergence analysis for unstructured problems is less satisfying than that

of the previous section This is b ecause the analysis for unstructured problems can not

provide absolute b ounds on the rate of convergence but only b ounds on the condition of

the preconditioned system With b ounds on the condition number we can use the b ounds

from the Krylov subspace metho d x to give b ounds on the convergence rate

Additionally the b ounds that can b e derived though impressive feats of analy

sis have lo osely dened parameters h and H and other parameters that are dicult to

explicitly calculate Despite the shortcomings of this analytical framework it is quite valu

able in providing necessary conditions for scalability as well as for providing a theoretical

framework to compare dierent algorithms in a rational way

From Domain Decomp osition page

In order to understand the convergence b ehavior of domain decomp osition

algorithms we need to introduce a mathematical framework This is most easily

done in the context of Sob olev spaces and Galerkin nite elements Indeed

it turns out that the basic correction steps calculated in virtually all domain

decomp osition and multigridmultilevel algorithms may b e viewed as approxi

mate orthogonal pro jections in some suitable inner pro duct onto a subspace

This observation makes p ossible the complete analysis of many domain decom

p osition and multigrid metho ds

x introduces a variational framework and pro jections with the example in x

T T

R r in equation in the multiplica R AR and show that the correction R

k

tive Schwarz algorithm is an example of a pro jection x will review or introduce the

comp onents required to describ e and analyze all domain decomp osition metho ds with the

convergence theory describ ed in x

This presentation follows that in

Variational formulation

We recall the steady state Poissons equation from x and consider only homo

geneous Dirichlet b oundary conditions The strong form is

r k ru f in

u on

and the weak form

a v u f v v H

with

Z

a v u rv kru

Z

f v v f

Within this variational framework we can demonstrate that the corrections in the

n

multiplicative Schwarz metho ds are pro jections of the error Let u b e the solution at the

th n

end of the n iteration and let u b e the solution at the end of the substep in iteration

n Then the one level continuous Schwarz algorithm can b e written as

n n n

L u u Lu f in

n n

u u on

n n

u u on n

and

n n n

L u u Lu f in

n n

u u on

n n

u u on n

This can b e expressed in weak form as

n n n n n

v H a u u v f v a u v u u H

and

n n n n n

a u u v f v a u v u u H v H

n n th n

Let e u u b e the error in the n iterate u then

n n n

a u u v a e v v H

and

n n n

a u u v a e v v H

Dene the pro jection T e in the inner pro duct a by

i

a T e v a e v T e H v H

i i i i

n n n n

At each half step the corrections u u and u u calculated in equation

are pro jections of the error onto the subspaces H or H

Formally a pro jection of e onto a subspace H in the inner pro duct a

is dened as

e P e arg inf ke vk

a

1

v H

1

0

We can alternatively dene P e by the following Find e H so that

a e v a e v v H

or equivalently

a e e v v H

It is now simple to show that is the minimizer of and thus our correction

in the Schwarz metho ds are in fact pro jections of the error on to a sub domain These

pro jections have the attractive prop erty that they nd a correction in a given sub domain

that is closest to the error in the A or energy norm The next section shows that this

indeed the case for the discrete forms of the Schwarz metho ds

Matrix representation of pro jections

In x we constructed the nite dimensional subspace S span H k

k

n we now replace H ab ove by S After selecting domains i p

i

i

g ie those functions in S with supp ort in Express u S as dene the sets f

i

k

P

u u and let A b e the stiness matrix and A a It is p ossible to de

k k ij i j

k

P

i

then w rive an explicit representation for T u in equation Let w T u

k i i

k

k

equation can b e expressed as

X X

i i i i

w a u a

k k k

l k l l

k k

X X

i i i i

u a w a

k k k

l l l k

k k

Recall that the restriction matrix R maps the co ecients of u to co ecients of

i

i i

T

the lo cal sub domain i thus a R AR We can now express a discretized form

i

i

l k

of our pro jection op erator

T

A w R AR w R Au

i i

i

i

or

T

w R AR R Au

i i

i

Hence the matrix representation of the op erator T is given by

i

T T

T R R AR R A

i i i

i i

Note this is the coarse grid correction term in equation applied to the error

Domain decomp osition comp onents

We have describ ed some of the basic comp onents of domain decomp osition meth

o ds in functional form This section list the comp onents necessary to sp ecify all domain

decomp osition metho ds some of which have b een introduced previously and some of which

are introduced here

Domain decomp osition p olynomials

Recall the correction step in the multiplicative Schwarz algorithm in gure is

given by

T T

x x R R AR R b Ax

k k

k

T T

x x R b Ax R AR R

k

k k

T T

R The op erator B restricts the residual to one sub domain R AR Dene B by R

i i i i

i i

solves for a correction in this domain and then interpolates the correction back to the global

space We can rewrite the two domain multiplicative Schwarz correction as

x x B B B AB b Ax

k k k

We can interpret multiplicative Schwarz as a Richardson iterative pro cedure with the pre

conditioner B given by B B B B AB B can b e thought of as a p olynomial in

the residual r Note that the utility of using a Krylov subspace metho d is now evident as

CG requires very little additional cost and provides a solver with well dened optimization

prop erties precisely the optimization prop erties describ ed in section x for the Schwarz

metho ds

The form of the multiplicative Schwarz metho d for the error

k k k

e e B B B AB Ae

We use the pro jection op erators from x and nd an expression for the error reduction

op erator of the metho d T BA T T T T I I T I T with B B B

B AB We can express the eect of a metho d on the error in the form of p olynomials in T

i

with i p T refers to the coarse grid in our convention For the multiplicative

multilevel metho d we have T BA P T T T I I T I T The art of

p p

designing preconditioners is to design these p olynomials such that they are well conditioned

op erators with resp ect to the cost of their application

Auxiliary bilinear forms

When an approximate solver is used in equation it b ecomes

n n n n n

u u v f v a u v u u S v S a

likewise equation b ecomes

n n n n n

u u v f v a u v u u S v S a

That is we do not require that we have the same bilinear form on the sub domains b e they

coarse grids or sub domains and can we can therefore use auxiliary bilinear forms

Interpolation op erators

The existence of sub domains requires interpolation op erators to recover the sub

domain corrections To this end we dene the set of op erators I S S

i i

Domain decomp osition comp onents

We now have all of the comp onents necessary to dene any particular domain decomp osition metho d

A set of subspaces S

i

A set of interpolation op erators I

i

A set of auxiliary bilinear forms a

i

The p olynomial P T T T to prescrib e the order of the application of the sub

p

domain corrections

The next section applies the formalism of this section to an abstract analysis applicable to

al l domain decomp osition metho ds

A convergence theory

The goal of this section is to sketch an abstract convergence theory for domain

decomp osition metho ds These metho ds are comp osed of somewhat intuitive parameters

that make the utility of multigrid evident as well as serve to provide an understanding of

the convergence characteristics of domain decomp osition metho ds on unstructured meshes

To b egin with we assume for simplicity that we have symmetric p ositive denite

op erators and work with operators and functions rather than matrices and vectors

First the abstract convergence theory makes extensive use of the following lemma

P

LEMMA Dene T T Then

i

i

X

aT u u min a u u

P i i i

I u u V u

i i i i

i

i

See for a pro of of this

The abstract convergence theory for symmetric p ositive denite problems centers

around three parameters which measure the interaction of the subspaces V and the bilinear

i

forms a These parameters are presented in the form of assumptions

i

Assumption let C b e the minimum constant such that u V u

P

I u u V with

i i i i

i

X

a u u C au u

i i i

i

If C can b e b ounded indep endently of the grid parameters size of elements and number

of sub domains then the V are said to provide a stable splitting of V This quantity

i

au u is referred to as the energy of u A value of is desirable for C thus we want

the sub domain spaces to have minimal energy This assumption together with Lemma

P

provides a lower b ound C on the sp ectrum of T T for the additive Schwarz

i

i

op erator Note multigrid is a stable splitting and the coarse grid functions have relatively

low energy b ecause the linear nite element shap e functions hat functions are somewhat

smo oth

Assumption Dene E to b e the minimal value that satises

ij

jaI u I u j E aI u I u aI u I u

i i j j ij i i i i j j j j

u V u V i j p

i i j j

Dene E to b e the sp ectral radius of E Note we do not include the coarse grid space

V in this denition This parameter is in some sense the measure of orthogonality of the

subspaces When E the subspaces V and V are orthogonal when E we

ij i j

have the usual CauchySchwarz inequality and for E we have a strengthened

CauchySchwarz inequality A value of is desirable for E

Assumption let b e the minimum constant such that u V i

i i

p

aI u I u a u u

i i i i i

This parameter refers to quality of the sub domain solves corresp onds to exact

sub domain solves This assumption also restrains us from simply scaling a to decrease

i

C Note our coarse grid sub domains are recursive applications of multigrid thus

except on the p enultimate grid though as we use a direct solver on the lo cal sub domains

For a linear op erator L which is self adjoint with resp ect to a we use the

Rayleigh quotient characterization of the extreme eigenvalues

aLu u aLu u

A A

L max L min

max min

u u

au u au u

A A

The condition number of L is thus given by L L As the b ound on the number

max

min

of iteration of the conjugate gradients metho d is prop ortional to the condition number of the

preconditioned system BA the abstract convergence b ounds are derived by using Lemma

A A

and the three assumptions to nd expressions for this condition number L L

max

min

With this machinery we can state the b ound on the condition number of the

abstract additive Schwarz metho d see Lemma and Lemma in for the derivations

K BA E C

and for the abstract multiplicative Schwarz we have

K BA E C

Lemma in shows that for the two level overlapping Schwarz metho ds additive

or multiplicative with exact sub domain solves and sub domain overlap width of order H

the sub domain size the condition number of the preconditioned system indep endent of

H and h the scale of discretization Thus this particular form of multigrid has optimal convergence characteristics within the context of the abstract convergence theory

Chapter

High p erformance linear equation

solvers for nite element matrices

The previous chapter introduced the ingredients used in multilevel iterative equa

tion solvers This chapter completes the context in which our work resides by discussing

the particular metho ds that have b een developed for the large unstructured sparse matrices

that are of interest to the nite element community Scalable solvers for unstructured nite

element problems is an active area of research This section provides a brief overview of

current promising metho ds

Introduction

We are interested in scalable technology ie we are interested in metho ds that have

k

the p otential to run in time O n sequentially and p olylogorithmically in parallel O log n

for a constant k Thus we lo ok only at multilevel metho ds

We segregate the ma jor metho ds in three somewhat arbitrary categories geo

metric multigrid algebraic multigrid and domain decomp osition metho ds We distinguish

b etween algebraic x and geometric x multigrid metho ds by calling a metho d alge

braic if it uses the matrix values in the construction of its restriction op erators We also

discuss notable domain decomp osition metho ds that are not multigrid metho ds x

Algebraic multigrid

Again we dene algebraic metho ds as metho ds that use the values of the ma

trix entries in the construction of the multigrid restriction op erators Algebraic multigrid

metho ds are an active area of research within the multigrid community as they provide an

avenue toward black b ox scalable solvers for PDEs on unstructured meshes There are

three comp onents to an algebraic or indeed any unstructured multigrid metho d selection

criterion for the coarse grid p oints construction of the restriction and interpolation op

erators or functions and the metho d for construction of the coarse grid op erators The A

i

coarse grid op erators in algebraic metho ds are usually formed in the standard Galerkin

T

Thus the coarse grid p oint selection and restriction or variational way A R A R

i i i

i

function formulation are the only distinguishing asp ects of an algebraic metho d

Algebraic metho ds were rst introduced by Ruge in and tested on nite

element matrices including thin b o dy elasticity and incompressible materials The results of

these early algebraic metho ds were not very promising but they did set the stage for eective

mo dern metho ds that we discuss in this section Before we describ e these algorithms we

describ e the general approach that many of these metho ds employ

Many of the algebraic metho ds that we are aware of use some type of heavy edge

edges are odiagonal stiness matrix entries heuristic to agglomerate vertices in the

selection of the coarse grid p oint set much like the metho d in discussed in x

These metho ds intend to keep tightly coupled vertices connected to each other via maximal

matching algorithms or strongly coupled neighborho o d of a no de Once these

small algebraic regions have b een dened interpolation functions with compact supp ort

are then constructed in some fashion

A promising algebraic metho d

Vanek et al constructed an algebraic algorithm that is particularly go o d at

handling thin b o dy elasticity and lightly supp orted structures eg a plate with Dirichlet

b oundary conditions on a small p ortion of the b oundary The algorithm pro ceeds as follows

Construct strongly coupled neighborhood of nodes this approach is meant to mimic

the b ehavior of semicoarsening for anisotropic and stretched grids

Use these groups to construct a constant interpolation function that is a function

with a constant value at all of the vertices in the group and zero everywhere else

These functions are the translational rigid b o dy mo des for the group of vertices

These functions are not ideal as they have very high energy ie au u is large

this results in a high b ound on the convergence rate via the C term of assumption

in x This is b ecause these functions are sharp a natural approach is to

smo oth these functions with a simple iterative scheme ie smoother in multigrid

terminology Notice though that the supp ort of these functions grows by one layer

of vertices in each iteration thus the overuse of this smo othing results in high com

putational complexity in applying them and more imp ortantly a higher complexity

of the coarse grid op erators as they are constructed from the ne grid op erator and

these interpolation op erator Vanek et al thus apply only one step of the smo other

using a slightly altered op erator to pro duce the simple version of their metho d

For problems in elasticity and fourth order problems their results are dramatically

improved with the use of geometric information in the form of what seems to amount

to vertex co ordinates The vertex co ordinates are used to orthogonalized their initial

guess against userprovided p olynomials or against the rigid b o dy mo des that can b e

inferred from these vertex co ordinates This approach of removing the rigid b o dy

mo des is similar to many successful metho ds

Geometric approach on unstructured meshes

To apply classic multigrid techniques to an unstructured grid one is faced with two

main design decisions Recall from x that for regular grids the coarse grids are known a

priori hence the restrictioninterpolation and coarse grid op erators are known implicitly

For unstructured grids however the coarse grids must b e explicitly constructed after which

standard nite element shap e functions can b e used to construct the restriction op erator

The rst design decision is whether the coarse grids are provided by the nite element co de

or are constructed within the solver from the ne mesh

The second decision that often follows from the rst is whether to construct

the coarse grid op erators algebraically Galerkin coarse grids or let the nite element

implementation generate the coarse grid op erators There are advantages and disadvantages

in either approach The Galerkin approach is harder to implement eciently and requires

more communication as an element can not b e redundantly calculated since there are

no explicit elements On the other hand the Galerkin approach is more robust b ecause

there is no need to take derivatives of the coarse grid shap e functions thus the coarse

grids need not b e as go o d eg zero volume elements are allowed since they do not have

any ne grid vertices within them by construction Also nonlinear elements can b e more

easily accommo dated as some element formulations eg large deformation plasticity are

not robust with p o orly prop ortioned low order tetrahedra Nonlinear materials can b e

accommo dated with Full Approximation Storage FAS metho ds where the current

solution and right hand side are restricted to the coarse mesh and not just the current

residual the coarse grid elements thus retain state variables in nonlinear materials

An additional advantage to the Galerkin approach with nonlinear problems is that

regions of lo calized softening may not app ear on the coarser grids if a new nite element

problem is formed However with the Galerkin approach a region of lo calized nonlinearity

will contribute to the coarse grid stiness matrix Thus Galerkin coarse grids provide in a

sense a higher order of approximation as they sample the function on the ne grid at more

p oints Our metho d is novel in that it constructs a geometric coarse grid automatically

thus relieving the nite element user for this burden we use Galerkin coarse grids b ecause

of their desirable prop erties and as they can b e constructed automatically within the solver

Promising geometric approaches

The work in applying geometric forms of multigrid to unstructured problems in

continuum mechanics has come primarily from the engineering community Many of these

metho ds require an explicit nite element mesh for the coarse grids supplied by the user

Most of this work uses metho ds that require that coarse meshes indeed the entire problem

b e provided by the user eg see for second order PDEs for fourth order PDEs

and for nonlinear problems note uses Galerkin coarse grids These meth

o ds have the same structure as classical multigrid metho ds x as do es our metho d and

go o d results can b e achieved for the problems that can b e addressed with these metho ds

ie problems where the coarse meshes can b e eectively constructed These metho ds

have practical diculties on large problems with complex b oundaries and material inter faces as the small size of the coarse grids required for eciency can b e so small that the

geometry of the original problem can not b e well represented Thus some type of approx

imate mesh must b e constructed for these coarse problems but a mesh generator is not

in general equipp ed to pro duce such meshes These metho ds are eective however when

the geometryscale of the problem is such that all of the coarse meshes can b e eectively

constructed and the user is willing to do so

Another metho d that falls into our denition of geometric multigrid but could also

b e called an algebraic metho d is that of using rigid b o dy mo des Bulgakov et al in

the construction of the restriction op erators which are used for the standard Galerkin coarse

grid op erators This metho d is notable in that it has an algebraic architecture so that the

user must only provide the ne mesh This metho d pro ceeds by partitioning vertices in the

ner mesh into small sets of connected vertices these sets then pro duce a coarse grid space

with constant interpolation The displacement and rotational rigid b o dy mo des of these

aggregates is then used to provide interpolation values from the ne aggregated vertices

to the coarse vertex This metho d is similar to that describ ed in with the addition of

rotational mo des discussed in x

A theoretical drawback of this approach is that the discrete interpolation func

tions have disjoint supp ort the actual interpolation functions have high energy and the

constant C in assumption of x is very large The numerical results however show

promising results

Domain decomp osition

As noted previously domain decomp osition metho ds have a rich history in the

structural engineering community as a way of organizing the direct solution of large

complex structures These metho ds also known as nested dissection vertex ordering factor

izations can b enet from the reuse of factorizations from duplicate substructures These

direct metho ds explicitly form a Schur complement its factorization b eing the primary cost

of these metho ds Multilevel domain decomp osition metho ds use a two level scheme such

as Keyes for uid problems and rely on a p owerful overlapping domain decomp osition

smo other to ameliorate the eects of using a single very coarse grid In the past years

iterative solutions of the Schur complement have b een investigated extensively see and

the references therein Here we mention only one metho d which has b een well developed and tested on structures problems

A domain decomp osition metho d

The nite element tearing and interconnect metho d FETI developed by Farhat

et al is an iterative substructuring metho d that tears the monolithic problem into

a series of sub domains These sub domains are interconnected via Lagrange multipliers

thus the Schur complement is only solving for the Lagrange multipliers these b ecome the

primary variables in the solve

As the sub domains are given Neumann b oundary conditions on the articial ie

in x the sub domain are singular in general Thus a pseudoinverse is used for

i

the sub domain solves The resulting Schur complement equations need to b e augmented

by applying a constraint again applied with Lagrange multipliers that the applied load

minus the Lagrange multiplier on each oating sub domain is orthogonal to the rigid b o dy

mo des of the sub domains Thus the Schur complement equations must b e augmented with

Lagrange multipliers to enforce the rigid b o dy constraint This nal set of equations is

solved with a projected conjugate gradient algorithm on the primary Lagrange multipliers

where the rigid b o dy comp onent of the residual is pro jected out in each iteration

Preconditioning is required for the original Schur complement only and can b e

implemented using any of the standard iterative substructuring metho ds Thus the

FETI metho d is a nonoverlapping domain decomp osition metho d in the dual space of the

problem The drawback of FETI is that the coarse grid is not in the same form as the ne

grid and hence FETI can not b e applied recursively this fact prevents one from developing

an optimal complexity multilevel FETI algorithm by simply solving the coarser problems

by another application of FETI

Chapter

Our metho d

This chapter and the numerical results in chapter describ e and discuss the core

issues of our algorithms in terms of convergence rates serial p erformance and some simple

PRAM parallel complexity The second half of this dissertation chapter to the end

discusses parallel p erformance issues in more depth and veries our claim of scalability with

p erformance results with our largest test problems

This chapter discusses our main contributions to serial multigrid metho ds for D

nite element problems on unstructured grids We have developed heuristics to optimize

the quality of the vertex sets that are promoted to the coarse grid at all level in multigrid

and techniques to mo dify the Delaunay tessellation on these sets to optimize the solu

tion time for solid mechanics problems on unstructured nite element meshes with large

jumps in material co ecients complex geometries thin b o dy domains and p o orly shap ed

elements This chapter also discusses our new practical and highly optimal in PRAM

parallel maximal indep endent set algorithm

Introduction

This chapter discussed the technical details of the multigrid metho d that we use

and is the core of this dissertation As we have seen in x multigrid is a very p owerful solu

tion metho d While multigrid is used extensively for structured meshes the use of multigrid

on unstructured meshes is less wide spread This is due to the fact that the construction of

unstructured meshes is a challenging endeavor Thus the eective construction of coarse

grids for unstructured problems is not a well developed sub ject though an active area of

research and is the primary goal of this dissertation

Our work is based on a metho d developed by Guillard and indep endently by

Chan and Smith This metho d relies on

Maximal independent sets to automatically select vertices to b e promoted to the

next coarse grid

Automatic mesh generation techniques to construct the coarse grids using these ver

tices

Coarse grid spaces constructed with nite element shap e functions

Galerkin coarse grid op erators

Our approach has the advantage that it is relatively indep endent of the nite element

implementation ie the solver interface with the nite element co de is relatively small

The nite element co de need only provide no dal co ordinates as well as the stiness matrix

itself We also use element connectivity information and material type indices to identify

material interfaces This metho d can thus b e interfaced with an existing nite element

co de with relatively little diculty as the nite element co de need only provide the nite

element mesh to the solver

Solvers should b e highly mo dular to b e useful to the nite element community

This is of primary imp ortance as the nite element co des are large and complex systems

that have in many cases b een developed over decades The metho d that we use satises

our need for a highly mo dular and optimal nite element solver

Our metho d was previously developed in serial for D linear elasticity applications

We have extended the metho d to parallel computers for applications in D continuum

mechanics The metho d has on four basic comp onents

Coarsening Multigrids coarse grids should capture the low frequency eigenvectors

eectively Maximal independent sets MISs are often an eective and p opular

heuristic for capturing the low energy mo des of unstructured nite element problems

We describ e a new algorithm for the parallel construction of an MIS in x as

well as a set of heuristics to improve the quality of the MIS in x Note that

there is no fundamental reason to promote vertices on the ne mesh to b e in the

coarse mesh ie multigrid do es not require no denested coarse grids see for

numerical exp eriments with dual coarse grids also we take advantage of this in one

of our metho ds

Mesh After the vertices have b een created we need a mesh the mesh is used to

construct the nite element function space Thus we need a mesh generator our

mesh generation is discussed in x

RestrictionInterpolation Op erators After the coarse meshes are created one

can use standard nite element shap e functions for the restriction and interpolation

op erators This is describ ed in x

Coarse Grid Op erators Coarse grid op erators can b e constructed in one of two

ways as discussed in x x describ es the construction of Galerkin coarse grid

op erators on parallel computers

Also general issues of smo others or preconditioners for our smo others are discussed in

x

A parallel maximal indep endent set algorithm

An independent set is a set of vertices I V in a graph G V E in which

no two members of I are adjacent ie v w I v w E a maximal independent set

MIS is an indep endent set for which no prop er sup erset is also an indep endent set The

parallel construction of an MIS is useful in many computing applications such as graph

coloring and coarse grid creation for multigrid algorithms on unstructured nite element

meshes In addition to requiring an MIS which is not unique many of these applications

want an MIS that maximizes a particular application dep endent quality metric Finding

the optimal solution in many of these applications is an NPcomplete problem ie they

can not b e solved in p olynomial time or can b e solved by a nondeterministicN machine in

p olynomialP time for this reason greedy algorithms in combination with heuristics

are commonly used for b oth the serial and parallel construction of MISs Many of the graphs

of interest arise from physical mo dels such as nite element simulations These graphs are

sparse and their vertices are connected to only their nearest neighbors The vertices of such

graphs have a b ound on their maximum degree We discuss our metho d of attaining O PRAM see complexity b ounds for computing an MIS on such graphs namely nite

element mo dels in three dimensional solid mechanics Our algorithm is notable in that it

do es not rely on global random vertex ordering see and the references therein

to achieve correctness in a distributed memory computing environment but explicitly uses

knowledge of the graph partitioning to provide for the correct construction of an MIS in an

ecient manner

The complexity mo del of our algorithm also has the attractive attribute that it

requires far fewer pro cessors than vertices in fact we restrict the number of pro cessors

used in order to attain optimal complexity Our PRAM mo del uses P O n pro cessors

n jV j to compute an MIS but we restrict P to b e at most a xed fraction of n to

attain the optimal theoretical complexity The upp er b ound on the number of pro cessors

is however far more than the number of pro cessors that are generally used in practice on

common distributed memory computers of to day so given the common use of relatively fat

pro cessor no des in mo dern computers our theoretical mo del allows for the use of many more

pro cessors than one would typically use in practice Thus in addition to obtaining optimal

PRAM complexity b ounds our complexity mo del reects the way that mo dern machines

are actually used Our numerical exp eriments conrm our O complexity claim

We do not include the complexity of the graph partitionings in our complexity

mo del though our metho d explicitly dep ends on these partitions We feel justied in this

as it is reasonable to assume that the MIS program is embedded in a larger application that

requires partitions that are usually much b etter than the partitions that we require

An asynchronous distributed memory algorithm

Consider a graph G V E with vertex set V and edge set E an edge b eing

an unordered pair of distinct vertices Our application of interest is a graph which arises

from a nite element analysis where elements can b e replaced by the edges required to

make a clique of all vertices in each element see Figure Finite element metho ds and

indeed most discretization metho ds for PDEs pro duce graphs in which vertices only share

an edge with their nearest physical neighbors thus the degree of each vertex v V can b e

b ounded by some mo dest constant We restrict ourselves to such graphs in our complexity

analysis Furthermore to attain our complexity b ounds we must also assume that vertices

are partitioned well which is dened later across the machine We introduce our algorithm by rst describing the basic random greedy MIS algo

FE mesh Graph

Figure Finite element quadrilateral mesh and its corresp onding graph

rithms describ ed in We utilize an object oriented notation from common programming

languages as well as set notation in describing our algorithms this is done to simplify the

notation and we hop e it do es not distract the uninitiated reader We endow vertices v with

a mutable data member state state fsel ected del eted undoneg All vertices b egin in the

undone state and end in either the selected or deleted state the MIS is dened as the set

of selected vertices Each vertex v is also given a list of adjacencies adjac

D ef inition The adjacency list for vertex v is dened by

v adjac fv j v v E g

We also assume that v state has b een initialize d to the undone state for all v and

v adjac is as dened in D ef inition in all of our algorithm descriptions With this notation

in place we show the basic MIS algorithm BMA in Figure

forall v V

if v state undone then

v state sel ected

forall v v adjac

v state del eted

I fv V j v state sel ectedg

Figure Basic MIS algorithm BMA for the serial construction of an MIS

For parallel pro cessing we partition the vertices onto pro cessors and dene the

vertex set V owned by pro cessor p of P pro cessors Thus V V V V is a disjoint

p P

union and for notational convenience we give each vertex an immutable data member

pr oc after the partitioning is calculated to indicate which pro cessor is resp onsible for it

S

Dene the edge separator set E to b e the set of edges v w such that v pr oc w pr oc

S S

Dene the vertex separator set V fv j v w E g G is undirected thus v w

and w v are equivalent Dene the processor vertex separator set for pro cessor p by

S S

V fv j v w E and v pr oc p or w pr oc pg Further dene a pro cessors

p

B S L B

boundary vertex set by V V V and a pro cessors local vertex set by V V V

p p

p p p

Our algorithm provides for correctness and eciency in a distributed memory computing

environment by rst assuming a given ordering or numbering of pro cessors so that we can

use inequality op erators with these pro cessor numbers As will b e evident later if one vertex

is placed on each pro cessor an activity of theoretical interest only then our metho d will

degenerate to one of the well known random types of algorithms

set an acronym for maximum pro cessor in We dene a function mpiv sv er tex

vertex set which op erates on a list of vertices

D ef inition

set v state del etedg if v er tex set maxfv pr oc j v v er tex

mpiv sv er tex set

set if v er tex

Given these denitions and op erators our algorithm works by implementing two

rules within the BMA running on pro cessor p as shown b elow

Rul e Pro cessor p can select a vertex v only if v pr oc p

Rul e Pro cessor p can select a vertex v only if p mpiv sv adjac

Note that Rul e is a static rule b ecause v pr oc is immutable and can b e enforced

simply by iterating over V on each pro cessor p when lo oking for vertices to select In

p

contrast Rul e is dynamic b ecause the result of mpiv sv adjac will in general change

actually monotonically decrease as the algorithm progresses and vertices in v adjac are

deleted

Shared memory algorithm

Our Shared Memory MIS Algorithm SMMA in Figure can b e written as a

simple mo dication to BMA

We have mo died the vertex set that the algorithm running on pro cessor p uses

to lo ok for vertices to select so as to implement Rul e We have embedded the basic

while fv V j v state undoneg

p

forall v V implementation of Rul e

p

if v state undone then

if p mpiv sv adjac then implementation of Rul e

v state sel ected

forall v v adjac

v state del eted

I fv V j v state sel ectedg

Figure Shared memory MIS algorithm SMMA for MIS running on pro cessor p

algorithm in an iterative lo op and added a test to decide if pro cessor p can select a vertex

for the implementation of Rul e Note the last line of Figure may delete vertices that

have already b een deleted but this is inconsequential

There is a great deal of exibility in the order which vertices are chosen in each

iteration of the algorithm Herein lies a simple opp ortunity to apply a heuristic as the rst

vertex chosen is always selectable and the probability is high that vertices which are chosen

early is also selectable Thus if an application can identify vertices that are imp ortant then

those vertices can b e ordered rst and so that a less imp ortant vertex can not delete a more

imp ortant vertex For example in the automatic construction of coarse grids for multigrid

equation solvers on unstructured meshes one would like to give priority to the b oundary

vertices This is an example of a static heuristic that is a ranking which can b e calculated

initially and do es not change as the algorithm progresses Dynamic heuristics are more

dicult to implement eciently in parallel An example is the saturation degree ordering

SDO used in graph coloring algorithms SDO colors the vertex with a maximum

number of dierent colored adjacencies the degree of an uncolored vertex increases as the

algorithm progresses and its neighbors are colored We know of no MIS application that

do es not have a quality metric to maximize thus it is of practical imp ortance that an

MIS algorithm can accommo date the use of heuristics eectively Our metho d can still

implement the forall lo ops with a serial heuristic ie we can iterate over the vertices

in V in any order that we like To incorp orate static heuristics globally ie a ranking

p

of vertices one needs to augment our rules and mo dify SMMA see for details but in

doing so we lose our complexity b ounds in fact if one assigns a random rank to all vertices

this algorithm would degenerate to the random algorithms describ ed in

To demonstrate correctness of SMMA we pro ceed as follows show termination

show that the computed set I is maximal and show that indep endence of I fv V j

v state sel ectedg is an invariant of the algorithm

Termination is simple to prove and we do so in x

To show that I is maximal we can simply note that if v state del eted for v V v

must have a selected vertex v v adjac as the only mechanism to delete a vertex is

to have a selected neighbor do so All deleted vertices thus have a selected neighbor

and they can not b e added to I and maintain indep endenc e hence I is maximal

To show that I is always indep endent rst note that I is initially indep endent as

I is initially the empty set Thus it suces to show that when v is added to I in

line of Figure no v v adjac is selected Alternatively we can show that

v state del eted in line since if v can not b e deleted then no v v adjac can b e

selected To show that v state del eted in line we need to test three cases for the

pro cessor of a vertex v that could delete v

Case v pr oc p v would have blo cked v pr oc from selecting v b ecause

mpiv sv adjac v pr oc p v pr oc so the test on line would not have

b een satised for v on pro cessor v pr oc

Case v pr oc p v would have b een deleted and not passed the test on

line as this pro cessor selected v and by denition there is only one thread of

control on each pro cessor

Case v pr oc p as mpiv sv adjac v pr oc p thus p mpiv sv adjac

the test on line would not have succeeded line would not b e executed on

pro cessor p

Further we should show that a vertex v with v state sel ected can not b e deleted

and v state del eted can not b e selected For a v to have b een selected by p it must have

b een selectable by p ie fv v adjac j v pr oc p v state del etedg However

for another pro cessor p to delete v p must select v p v pr oc this is not p ossible

since if neither v nor v are deleted then only one pro cessor can satisfy line in Figure

This consistency argument is developed further in x Thus we have shown that

I fv V j v state sel ectedg is an indep endent set and if SMMA terminates I is

maximal as well

Distributed memory algorithm

For a distributed memory version of this algorithm we use a message passing

paradigm and dene some high level message passing op erators Dene sendpr oc X Action

and r eceiv eX Action sendpr oc X Action sends the ob ject X and pro cedure Action

to pro cessor pr oc r eceiv eX Action receives this message on pro cessor pr oc Figure

shows a distributed memory implementation of our MIS algorithm running on pro cessor p

We have assumed that the graph has b een partitioned to pro cessors to P thus dening

S L B

V V V V and v pr oc for all v V

p

p p p

while fv V j v state undoneg

p

B

forall v V implementation of Rul e

p

if v state undone then

if p mpiv sv adjac then implementation of Rul e

S el ectv

S

pr oc set fpr oc j v V g p

pr oc

set sendpr oc v S el ect forall pr oc pr oc

while r eceiv ev Action

if v state undone then Actionv

L

forall v V implementation of Rul e

p

if v state undone then

if p mpiv sv adjac then implementation of Rul e

S el ectv

B

if v V then

S

pr oc set fpr oc j v V g p

pr oc

forall pr oc pr oc set sendpr oc v Del ete

I fv V j v state sel ectedg

Figure Asynchronous distributed memory MIS algorithm ADMMA on pro cessor p

pro cedure S el ectv

v state sel ected

forall v v adjac

D el etev

pro cedure D el etev

v state del eted

Figure ADMMA Action pro cedures running on pro cessor p

A subtle distinction must now b e made in our description of the distributed mem

ory version of the algorithm in Figure and vertices eg v and v op erate on lo cal

copies of the ob jects and not to a single shared ob ject So for example an assignment to

v state refers to assignment to the lo cal copy v on pro cessor p Each pro cessor has a copy of

E S

the set of vertices V V V ie the lo cal vertices V and one layer of ghost vertices

p p

p p

Thus all expressions refer to the ob jects vertices in pro cessor ps lo cal memory

Note that the value of mpiv sv adjac monotonically decreases as v state v

v adjac are deleted thus as all tests to select a vertex are of the form p mpiv sv adjac

some pro cessors have to wait for other pro cessors to do their work ie select and delete

vertices In our distributed memory algorithm in Figure the communication time

is added to the time that a pro cessor may have to wait for work to b e done by another

pro cessor this do es not eect the correctness of the algorithm but it may eect the resulting

MIS Thus ADMMA is not a deterministic MIS algorithm although the synchronous version

that we use for our numerical results is deterministic for any given partitioning

The correctness for ADMMA can b e shown in a number of ways but rst we dene

a weaving monotonic path WMP as a path of length t in which each consecutive pair of

vertices v v E satises v pr oc v pr oc see Figure

i j i j

Processor rank

P1 P4 P5 P8

P2 P6 P7

P3 WMP

Figure Weaving monotonic path WMP in a D FE mesh

One can show that the semantics of ADMMA run on a partitioning of a graph

is equivalent to a random algorithm with a particular set of random numbers Alterna

tively we can use the correctness argument from the shared memory algorithm and show

consistency in a distributed memory environment To do this rst make a small isomorphic

transformation to ADMMA in Figure

Remove v state sel ected in Figure and replace it with a memoization of v so

as to avoid selecting v again

Remove the Select message from Figure and mo dify the S el ect pro cedure in

Figure to send the appropriate Delete messages to pro cessors that touch

v v adjac

Redene I to b e I fv V j v state del etedg at the end of the algorithm

Change the termination test to while v v adjac j v V v state del eted v state

p

del eted or simply while I is not indep endent

This do es not change the semantics of the algorithm but removes the sel ected

state from the algorithm and makes it mathematically simpler although less concrete of a

description Now only D el ete messages need to b e communicated and v state del eted is

WMP S WMP

the only change of state in the algorithm Dene the directed graph G V E

WMP WMP

E fv w E j v pr oc w pr ocg in general G is a forest of acyclic graphs

WMP S WMP WMP

Further dene G V E E fv w E j w state del eted v pr oc

p p p p

WMP WMP

pg G is the current lo cal view of G with the edges removed for which the source p

vertices have b een deleted Rul e can now b e restated pro cessor p can only select a

WMP

vertex that is not the end of an edge in E Pro cessors deletes down stream edges

p

WMP

in E and send messages so as other pro cessors can delete their up stream copies

WMP

of these edges thus G is pruned as p deletes vertices and receives delete messages

p

Informally consistency for ADMMA can b e inferred as the only information ow explicit

WMP

delete messages b etween pro cessors moves down acyclic graphs in G as the test

v v adjac j v pr oc p or v state del eted for pro cessor p to select a vertex v

WMP

requires that al l edges in E to v are deleted Thus the order of the reception of

these delete messages is inconsequential and there is no opp ortunity for race conditions or

ambiguity in the results of the MIS More formally we can show that these semantics insure

that I fv V j v state del etedg is maximal and indep endent

I is indep endent as no two vertices in I can remain dep endent forever To show

this we note that the only way for a pro cessor p to not b e able to select a vertex v

is for v to have a neighbor v on a higher pro cessor If v is deleted then p is free to

select v Vertex v on pro cessor p can in tern b e selected unless it has a neighbor

v on a higher pro cessor Eventually the end of this WMP is reached and pro cessor

p pro cesses v and thus releases p to select v and on down the line Therefore

t t t t

no pair of undone vertices remains and I is eventually b e indep endent

I is maximal as the only way for a vertex to b e deleted is to have a selected neighbor

To show that no vertex v that is selected can ever b e deleted as in our shared

memory algorithm we need to show that three types of pro cessors p with vertex v

can not delete v

For p p we have the correctness of the serial semantics of BMA to ensure

correctness ie v would b e deleted and p would not attempt to select it

For p p p will not pass the mpiv sv adjac test as in the shared memory

case

For p p p do es not pass the mpiv sv adjac and will not select v in the

rst place

Thus I is maximal and indep endent

Complexity of the asynchronous maximal indep endent set algo

rithm

In this section we derive the complexity b ound of our algorithm under the PRAM

computational mo del To understand the costs of our algorithm we need to b ound the cost

of each outer iteration as well as b ound the total number of outer iterations To do this

we rst make some restrictions on the graphs that we work with and the partitions that

we use We assume that our graphs come from physical mo dels that is vertices are only

connected by an edge to its nearest neighbors so the maximum degree of any vertex is

b ounded We also assume that our partitions satisfy a certain criterion for regular meshes

we can illustrate this criterion with regular rectangular partitions and a minimum logical

dimension that dep ends only on the mesh type We can b ound the cost of each outer

iteration by requiring that the sizes of the partitions are indep endent of the total number

of vertices n Further we assume that the asynchronous version of the algorithm is made

synchronous by including a barrier at the end of the receive while lo op in Figure

at which p oint all messages are received and then pro cessed in the next forall lo op This

synchronization is required to avoid more than one leg of a WMP from b eing pro cessed in

each outer iteration We need to show that the work done in each iteration on pro cessor p

is of order N N jV j This is achieved if we use O n pro cessors and can b ound the

p p p

load balance ie maxfN gminfN g of the partitioning

p p

LEMMA With the synchronous version of ADMMA the running time of the

PRAM version of one outer iteration in Figure is O O nP if maxfN gminfN g

p p

O

Proof We need to b ound the number of pro cessors that touch a vertex v ie

S

In all cases jv j is clearly b ounded by Thus max jv j N pr oc j v V jv j

p

pr oc

pr oc pr oc pr oc

is O and is an upp er b ound and a very p essimistic b ound on the number of messages sent

in one iteration of our algorithm Under the PRAM computational mo del we can assume

that messages are sent b etween pro cessors in constant time and thus our communication

costs in each iteration is O The computation done in each iteration is again prop ortional

to N and b ounded by N the number of vertices times the maximum degree This

p p

is also a very p essimistic b ound that can b e gleaned by simply following all the execution

paths in the algorithm and successively multiplying by the b ounds on all of the lo ops

and N The running time for each outer iteration is therefore O O n P

p

Notice for regular partitions jv j is b ounded by in D and in D and that

pr oc

for optimal partitions of large meshes jv j is ab out and for D and D resp ectively

pr oc

The number of outer iterations in Figure is a bit trickier to b ound To do this we lo ok

at the mechanism by which a vertex fails to b e selected

LEMMA The running time in the PRAM computational model of ADMMA

is bounded by the maximum length weaving monotonic path in G

Proof To show that the number of outer iterations is prop ortional to the maximum

length WMP in G we need to lo ok at the mechanism by which a vertex can fail to b e selected

in an iteration of our algorithm and thus p otentially require an additional iteration For a

pro cessor p to fail to select a vertex v v must have an undone neighbor v on a higher

pro cessor p For vertex v to not b e selectable v in turn must have an undone neighbor

v on a higher pro cessor p and so on until v is the top vertex in the WMP The vertex v

t t

at the end of a WMP is b e pro cessed in the rst iteration as there is nothing to stop v pr oc

t

from selecting or deleting v Thus in the next iteration the top vertex v of the WMP

t t

has b een either selected or deleted if v was selected then v has b een deleted and the

t t

WMP

Undone WMP UWMP a path in G is at most of length t after one iteration

and if v was deleted the worst case then the UWMP could b e of at most length t

t

After t outer iterations the maximum length UWMP is of length zero thus all vertices are

selected or deleted Therefore the number of outer iterations is b ounded by the longest

WMP in the graph

COROLLARY ADMMA wil l terminate

Proof Clearly the maximum length of a WMP is b ounded by the number of

pro cessors P By LEMMA ADMMA will terminate in a maximum of P outer iterations

To attain our desired complexity b ounds we want to show that a WMP can not

grow longer than a constant To understand the b ehavior of this algorithm we b egin with

a few observation ab out regular meshes Begin by lo oking at a regular partitioning of a

D nite element quadrilateral mesh Figure shows a D mesh and a partitioning with

regular blo cks of four and a particular pro cessor order This is just small enough

to allow for a WMP to traverse the mesh indenitely but clearly a nine vertex

partitions would break this WMP and only allow it walk around partition intersections

Note that the case would require just the right sequence of events to happ en on all

pro cessors for this WMP to actually govern the run time of the algorithm On a regular D

nite element mesh of hexahedra the WMP can coil around a line b etween four pro cessors

and the required partition size using the same arguments as in the D case would b e ve

vertices on each side or one more than the number of pro cessors that share a pro cessor

interface line

For irregular meshes one has to lo ok at the mesh partitioning mechanism employed

Partitions on irregular meshes in scientic and engineering applications generally attempt

S

E and balance the number of vertices on each to reduce the number of edges cut ie

partition ie jV j pn We assume that such a partitioner is in use and make a few

p

general observations First the partitions of such a mesh will tend to pro duce partitions

in the shap e of a hexagon in D for a large mesh with relatively large partitions This

is b ecause the partitioner is trying to reduce the surface to volume ratio of each partition

These partitions are not likely to have skinny regions where a WMP could jump through the

partition and thus the WMP is relegated to following the lines of partition intersections

We do not present statistical or theoretical arguments as to the minimum partition size N

that must b e employed to b ound the growth of a WMP for a given partitioning metho d

though clearly some constant N exists that for a give nite element mesh type and a given

reasonable partitioning metho d will b ound the maximum WMP length by a constant This

constant is roughly the number of partitions that come close to each other at some p oint

an optimal partitioning of a large D dimensional mesh will pro duce partitioning in which

D partitions meet at any given point Thus when a high quality mesh partitioner is in

use we would exp ect to see the algorithm terminate in at most four iterations on adequately

well partitioned and sized three dimensional nite element meshes

Numerical results

We present numerical exp eriments on an IBM SP with Mhz Power

PSC pro cessors at Argonne National Lab oratory An extended version of the Finite Element Analysis Program FEAP is used to generate out test problems and pro duce

our graphics We use ParMetis to calculate our partitions and PETSc for our

parallel programming and development environment Our co de is implemented in C

FEAP is implemented in FORTRAN PETSc and ParMetis are implemented in C We want

to show that our complexity analysis is indicative of the actual b ehavior of the algorithm

with real imp erfect mesh partitioners Our exp eriments conrm our PRAM complexity

mo del is indicative of the p erformance one can exp ect with practical partitions on graphs

of nite element problems Due to a lack of pro cessors we are not able to investigate the

asymptotics of our algorithm throughly

Our exp eriments are used to demonstrate that we do indeed see the b ehavior that

our theory predicts Additionally we use numerical exp eriments to quantify lower b ound

on the number of vertices p er pro cessor that our algorithm requires b efore growth in the

number of outer iterations is observed We use a parameterized mesh from solid mechanics

for our test problem This mesh is made of eight vertex hexahedral trilinear brick elements

and is almost regular the maximum degree of any vertex is in the asso ciated graph

Figure shows one mesh vertices The other meshes that we test are of the same

physical mo del but with dierent scales of discretization this problem is referred to as P

in later chapters

Figure vertex D FE mesh

We add synchronization to ADMMA on each pro cessor by receiving al l messages

from neighboring pro cessors in each iteration to more conveniently measure the maximum

length WMP that actually governs the number of outer iterations Table shows the

results of the number of iterations required to calculate the MIS Each case was run

times as we do not b elieve that ParMetis is deterministic but all iteration counts were

identical thus it seems that this did not eect any of our results A p erfect partitioning of

a large Ddimensional mesh with a large number of vertices p er pro cessor results in D

pro cessors intersecting at a p oint and D partitions sharing a line If these meshes

are optimal we can exp ect that the length of these lines of partition b oundaries are of

approximately uniform length The length of these lines required to halt the growth of

WMPs is D vertices on an edge as discussed in x If the approximate average size

of each partition is that of a cub e with this required edge length then we would need ab out

vertices p er partition to keep the length of a WMP from growing past This assumes

that we have a p erfect mesh which we do not but none the less this analysis gives an

approximate lower b ound on the number of vertices that we need p er pro cessor to maintain

our constant maximum WMP length

Pro cessors

Vertices

Table Average number of iterations

Figure shows a graphic representation of this data for all partitions The growth

in iteration count for constant graph size is reminiscent of the p olylogarithmic complexity

of at or vertex based random MIS algorithms Although ParMetis do es not sp ecify

the ordering of pro cessors it is not likely to b e very random These results show that

the largest number of vertices p er pro cessor that broke the estimate of our algorithms

complexity b ound is ab out average and the smallest number of vertices p er pro cessor

that stayed at our b ound of iterations was also ab out average To demonstrate

our claim of O PRAM complexity we only require that there exists a b ound N on the

number of vertices p er pro cessor that is required to keep a WMP form growing b eyond

the region around a p oint where pro cessors intersect These exp eriments do not show any

indication that that such a N do es not exist Additionally these exp eriments show that

our b ounds are quite p essimistic for the number of pro cessors that we were able to use

This data suggests that we are far away from the asymptotics of this algorithm that is we

need many more pro cessors to have enough of the longest WMPs so that one consistently governs the number of outer iterations

Average Number of Iterations

10

5

0

2000

4000 80 70 6000 60 8000 50 40 10000 30 12000 20

Vertices 10 Processors

Figure Average iterations vs number of pro cessors and number of vertices

Maximal indep endent set heuristics

This section discusses heuristics useful in optimizing the quality of multigrid re

striction op erators These metho ds use co ordinate data available in all nite element simu

lations and we also employ some element data element type eg tetrahedra quadrilateral

shell etc and element connectivity We use this data to categorize top ological elements

of the nite element mesh and use this information to mo dify the graph used in the MIS

algorithm to improve solver p erformance We also show how these heuristics can b e applied

globally on parallel platforms as well as a simple metho d to get the coarse grids to more

eectively cover the ne grids

Automatic coarse grid creation with unstructured meshes

This section introduces the comp onents that we use for the automatic construction

of coarse grids on unstructured meshes but rst we state what we want our coarse grids to

b e able to do The goal of the coarse grids in multigrid is to approximate the low frequency

error in the current grid Each successive grids nite element function space should with

a drastically reduced vertex set approximate as b est as it can the highest frequency or

eigenfunctions of the current grid That is with say of the vertices it is natural to

exp ect that one could only represent the lowest of the ne grid sp ectra well

The coarse grid functions should approximate the highest part of this lower part

of the sp ectrum as well as p ossible It is not p ossible to satisfy this criterion directly on

unstructured grids but a natural heuristic is to represent the geometry as well as p ossible

with a much smaller set of vertices One promising approach is to use computational

geometry techniques to characterize features and maintain them on the coarser grids

One p opular metho d is to use a maximal independent set as a heuristic to evenly coarsen

the vertex set as discussed in the last section If vertices are added to the MIS randomly

then the MIS is exp ected to b e a go o d representation of the ne grid in the sense of evenly

coarsening the grid p oints and maintain the feature characteristics of the mesh An MIS is

not unique in general and an arbitrary MIS is not likely to p erform well as we show b elow

thus we use heuristics to improve p erformance

We motivate our approach by rst lo oking at the structured multigrid algorithm

We can characterize the b ehavior of multigrid on structured meshes as shown in Figure

as select every other vertex starting from the b oundary in each dimension for use

in the coarse grid

2 2 2 2 2 1 1 1

2 2 2 2 2

2 2 2 2 2 1 1 1

2 2 2 2 2

2 2 2 2 2 1 1 1 (3) (2) (1) P : 9 by 9 grid of points P : 5 by 5 grid of points P : 3 by 3 grid of points 7 by 7 grid of unknowns 3 by 3 grid of unknowns 1 by 1 grid of unknowns Points labeled2 are Points labeled 1 are

part of next coarser grid part of next coarser grid

Figure Multigrid coarse vertex set selection on structured meshes

To apply multigrid to unstructured meshes it is natural to try to imitate the

b ehavior of the structured algorithm in hop es of imitating its success Consider that in

addition to evenly coarsening the vertex set the coarse grids in Figure also emphasize

the b oundaries One description of multigrid meshes on regular grids is place each vertex

v in a topological category of dimension d for instance corners d edges d

surfaces d and interiors d Note we overload edges here to mean a top ological

feature and not a graph edge the type of edge should b e obvious from the context in the

following discussion Given these categories we have collections of features for each category

eg a set of D connected surface vertices b ounded by edge vertices would b e one face in

the set of faces in the problem Notice that the regular mesh in Figure pro duces an

MIS within each feature This section discusses algorithms to implement these observations

and provide numerical exp eriments on mo del problems in linear elasticity Note Guillard

and Chan and Smith also use simple D heuristics to preserve b oundaries and emphasis

corners

Maximal indep endent set algorithms

Recall the basic greedy MIS algorithm in Figure which we discussed in x

There is a great deal of exibili ty in the order that vertices are chosen in each

iteration of the algorithm Herein lies a simple opp ortunity to apply a heuristic as the rst

forall v V

if v state undone then

v state sel ected

forall v v adjac

v state del eted

I fv V j v state sel ectedg

Figure Basic MIS algorithm for the serial construction of an MIS

vertex chosen is always selectable and the probability is high that vertices which are chosen

early are also selectable Thus if an application can identify vertices that are imp ortant

then those vertices can b e ordered rst and so that a less imp ortant vertex can not delete a

more imp ortant vertex We can now decide that corners are more imp ortant than edges and

edges are more imp ortant than surfaces and so on and order the vertices with all corners

rst then edges etc With this heuristic in place and the basic MIS algorithm in Figure

we can guarantees that the number of edge vertices on the coarse grid in each edge

f ine

V

edg e

coar se

for D meshes whereas a valid MIS could remove al l V segment satises

edg e

edge and corner vertices from the graph which would b e disastrous as is shown in x

Parallel maximal indep endent set algorithms

The order in which each pro cessor traverses the lo cal vertex list can b e governed

by our heuristics although the global application of a heuristic requires an alteration in the

MIS algorithm in x Our partition based MIS algorithm requires that vertices v are give

an immutable data member v topo in the MIS algorithm pro cessor p can select a vertex v

only if fv v adjac j v state del etedg

v topo v topo or v topo v topo and v pr oc v pr oc

This replaces the test expression in line of Figure

Topological classication of vertices in nite element meshes

Our metho ds are motivated by the intuition that the coarse grids of multigrid

metho ds must represent the b oundary of the domain well in order to approximate the function space of the ne mesh well which is necessary for multigrid metho ds to b e eective

Intuitively this can b e done by emphasizing the vertices that dene the domain

Note we dene a domain in a slightly nonstandard way as to mean a contiguous region of

the real nite element domain with a particular material prop erty thus for our discussion

the b oundary of the PDE prop er are augmented with b oundaries b etween dierent material

types If the domain is convex then a convex hull is useful in reasoning ab out how to b est

classify the vertices Vertices that are on the convex hull should b e given more emphasis

the those that are not ie interior vertices Vertices that are required to dene the convex

hull are likewise more imp ortant than vertices that simply lien the convex hull Domains

of interest are by no means necessarily convex but the idea of emphasizing vertices by

their contribution to dening the b oundary of a domain is useful in the exp osition of our

metho ds

The rst type of classication of vertices is to nd the exterior vertices if con

tinuum elements are used then this classication is trivial For noncontinuum elements

like plates shells and b eams heuristics such as minimum degree could b e used to nd an

approximation to the exterior vertices or a combination of mesh partitioners and con

vex hull algorithms could b e used For the rest of this section we assume that continuum

elements are used and so a b oundary of the domains represented by a list of facets or D

p olygons can b e dened The exterior vertices give us our rst vertex classication from

the last section interior vertices are vertices that are not exterior vertices Exterior vertices

require further classication but rst we need a metho d to automatically identify faces in

our nite element problems

A simple face identication algorithm

To describ e our algorithm we assume that a list of facets f acet l ist has b een

l ist created of the b oundaries of the nite element mesh Assume that each facet f f acet

has calculate its unit normal vector f nor m Assume that each facet f has a list of facets

f adjac that are adjacent to it With these data structures and a list with with AddT ail

ID for each and Remov eH ead functions with the obvious meaning we can calculate a f ace

facet with the algorithm shown in Figure

This algorithm simply rep eats a breadth rst search of trees ro oted at an arbitrary

undone facet which is pruned by the requirement that a minimum angle arccos TOL b e

maintained by all facets in the tree relative to the ro ot This heuristic is a simple way to

forall f f acet l ist f f ace ID

ID C ur r ent

l ist forall f f acet

if f f ace ID

l ist ff g

nor m f nor m

ID C ur r ent ID C ur r ent

while l ist

f l istRemov eH ead

f f ace ID C ur r ent ID

forall f f adjac TOL T O L is a user selected tolerance

T

ID if nor m f nor m T O L and f f ace

l istAddT ail f

Figure Face identication algorithm

identify faces or manifolds that are somewhat at of the b oundaries in the mesh

These faces are useful for two reasons

Topological categories for vertices used in the heuristics of x can b e inferred

from these faces

A vertex attached to only one face is in the interior of a surface

A vertex attached to two dierent faces is in the interior of an edge

A vertex attached to more than two dierent faces is a corner

Vertices not asso ciated with the same faces should not interact with each other in the

MIS algorithm

This second criterion is discussed in the next section

Mo died maximal indep endent set algorithm

We now have all of the pieces that we need to describ e the core of our metho d

First we classify vertices and ensure that a vertex of lower rank do es not suppress a vertex

of higher rank this was done with a slight mo dications to standard MIS algorithms in

x Second we want to maintain the integrity of the faces in the original problem as

b est we can The motivation for this second criterion can b e seen in Figure

Fine Grid

Coarse Grid

Deleted Vertex

Selected Vertex

Figure Poor MIS for multigrid of a shell

If the nite element mesh has a thin region then the MIS as describ ed in x can

easily fail to maintain a cover of the vertices in the ne mesh This comes from the ability

of the vertices on one face to decimate the vertices on an opp osing face as shown in Figure

This phenomenon could b e mitigated by randomizing the order that the vertices are

added to the MIS at least within a vertex type But randomization is not go o d enough as

these skinny regions tend to lower the convergence rate of iterative solvers so we need to

do something b etter

A simple x for this problem is to mo dify the graph to which we apply the MIS

algorithm we want to maintain the same vertices in the graph but will reduce the edge set

To avoid the problems illustrated in Figure we can lo ok at our metho d of classifying

vertices again

A vertex attached to only one face is in the interior of a surface

A vertex attached to two dierent faces is in the interior of an edge

A vertex attached to more than two dierent faces is a corner

Now we claim that by removing al l edges between vertices that are not attached to

a common face we force the MIS to b e a more logical and economical in terms of solver

p erformance representative of the ne mesh as shown in Figure For instance we

do not want corners to delete a edge vertex with which it do es not share an exterior facet Another example is that we do not want edge vertices to delete edges with which it do es not

share a face this is the most eective heuristic in this example And nally we augment

our heuristic and not allow corners to b e deleted at all this could b e problematic on some

meshes that have many initial corners as dened by our algorithm and a reclassication

of the remeshed vertices on coarse meshes is advisable

Figure Original and fully mo died graph

We are now free to run our MIS algorithm on this mo died graph Figure

shows and example of a p ossible MIS and remeshing

Fine Grid

Coarse Grid

Figure MIS and coarse mesh

Vertex ordering in MIS algorithm on mo died nite element graphs

An additional degree of freedom in this algorithm as describ ed thus far is the

order of the vertices within each category Thus far we have implicitly ordered the vertices

by top ological category the ordering within each category can also b e sp ecied Two simple

heuristics can b e used to order the vertices random order or a natural order Meshes may

b e initially ordered in either a blo ck regular order ie an assemblage of logically regular

blo cks but this dep ends on the mesh generation metho d used Initial vertex orders can

also b e ordered in a cache optimizing order like CuthillMcKee Both of these ordering

types are what we call natural orders and we assume that the initial order of our mesh

is of this type if not then we can make it so The MISs pro duced from natural orderings

tend to b e rather dense random ordering on the other hand tend to b e more sparse That

is the MISs with natural orderings tend to b e larger than those pro duced with random

orders Note that for a uniform D hexahedral mesh the asymptotics of the size of the

MIS is b ounded from ab ove by and from b elow by as the largest MIS will pick

every second vertex and the smallest MIS will select every third vertex natural and random

orderings are simple heuristics to approach these b ounds

Small MISs are preferable as this means that there is less work to b e done on

the coarser mesh also fewer levels are required b efore the coarsest grid is small enough to

solve directly but care has to taken to not degrade the convergence rate of the solver by

compromising the quality if the coarse grid representation In particular as the b oundaries

are imp ortant to the coarse grid representation it may b e advisable to use natural ordering

for the exterior vertices and a random ordering for the interior vertices we use this approach

in many of our numerical exp eriments

Meshing of the vertex set on the coarse grid

The vertex set for the coarse grid remains to b e meshed this is necessary in

order to apply nite element shap e functions to calculate the restriction op erator We use

a standard Delaunay meshing algorithm to give us these meshes This is done by putting

the mesh inside of a b ounding b ox thus adding dummy vertices to the coarse grid set and

then meshing this to pro duce a mesh that covers all ne grid vertices The tetrahedron

attached to the b ounding b ox vertices are removed and the ne grid vertices within these

deleted tetrahedron are added to a list of lost vertices l ost l ist

With this b o dy we continue to remove tetrahedra from the mesh that connect

vertices that were not near each other on the ne mesh recall the vertex set are still

nested and that do not have any vertices that lie uniquely within the tetrahedron

Dene a vertex v to lie uniquely in a tetrahedron if v lies completely within the tetrahedra

and not on its surface or there is no adjacent tetrahedra to which v can b e added More

precisely if a vertexs shap e function values are all larger than some small tolerance we

use only linear shap e functions or there is not an adjacent tetrahedra that can accept

the vertex then that tetrahedra is deemed necessary and not removed We also use a

more aggressive phase in which we use a more negative though still small tolerance to try

l ist The to remove more tetrahedra but the orphaned vertices are added to the l ost

resolution of the vertices in the l ost l ist is discussed in x

Coarse grid cover of ne grid

The nal optimization that we would like to employ is to improve the cover of the

coarse mesh on the ne vertex set With these coarse grids constructed the interpolation

op erators are calculated by evaluating standard nite element element shap e functions of

the element to which the ne grid vertex is associated Each ne grid vertex is asso ciated

with an element on the coarse grid the element that covers the ne grid vertex In general

l ist from the previous section will fail to b e covered however some ne grid vertices the l ost

by the coarse grid as shown in Figure This problem can b e solved in one of two ways

nd a nearby element and use it thus extrap olate or move the vertices on the coarse grid

so as to cover all ne grid vertices We can use the extrap olation of an element that do es

not cover a ne mesh p oint the extrap olation values will simply not all b e b etween zero and

one Intuition tells us however that interpolation is b etter than extrap olation Alternatively

l ist one can move the coarse grid vertex p ositions to cover the ne vertices in l ost

The optimal coarse grid vertex p ositions or an approximation to them could

p erhaps b e constructed with the use of interpolation theory to provide cost functions and

linear or nonlinear programming We have instead opted for a simple greedy algorithm that

iteratively traverses the exterior vertices of the coarse mesh and applies a simple algorithm

to try to cover the uncovered vertices that are near it First we dene a list ext neig h

c

for each coarse grid vertex c that contains all of the vertices attached to all of the exterior

facets to which c is attached We dene a list l ost l ist for each coarse grid vertex c and put

c

l ist into the list of the coarse grid vertex to which v is closest l ost l ist the vertices v in l ost

c

is then expanded to include the vertices in l ost l ist for each vertex i ext neig h Given

i c

a maximum number of outer iterations M a maximum distance tol that a vertex can

max

b e move and a larger tolerance tol to prune the list of p otential coarse grid vertices

delete that can help to cover a ne grid vertex the algorithm is as follows

do M times forall c on the exterior of the coarse grid

Calculate a vector the weighted average of the outward normals of the facets

connected to the coarse grid vertex cf acet l ist weighted by the facet area

The normal of each facet that do es not have a p ositive inner pro duct with this

T

l ist j f nor m g then is vector is added to until ff cf acet

normalized to unit length

l ist forall f a b c cf acet l ist forall v l ost

c

Solve for in

ax ay az

bx by bz

cx x cy y cz z

v x v y v z

l ist if tol remove v from l ost

c delete

else if tol tol

max max

else if and

if

ccood ccood

Recalculate the shap e functions for all of ne grid vertices that are within

an element connected to c cel ems

l ist calculate the shap e function for v For all e cel ems For all v l ost

c

in element e

If all shap e values are greater than for some small number then

l ist for all i ext neig h Add v to e and remove v from l ost

i c

Figure shows an illustration of what our algorithm might do on our running example

Fine Grid

Coarse Grid

Figure Mo died coarse grid

Figure shows an example of our metho ds applied to a problem in D linear

elasticity The ne input mesh is shown with three coarse grids used in the solution

Figure Fine input grid and coarse grids for problem in D elasticity

Numerical results

To demonstrate the eectiveness of the metho ds that we have discussed we use

three test problems in linear elasticity shown in Figure The problems are chosen

to exercise the primary problem features that we have tried to accommo date material

co ecient discontinuities thin shell types of features and curved surfaces The rst

problem is of a hard sphere Youngs mo dulus E Poisson ratio encased in a

soft rubb erlike material E The second problems is a steel b eamcolumn

connection made of thin p o orly prop ortioned elements The third problem is of a tub e

xed at one end and loaded at the other end like a cantilever a thin slit of the rubb erlike

material of the rst problem runs down the length of one side of the tub e The sphere

problem has equations the b eam has equations and the tub e has

equations All problems use eight vertex trilinear brick elements the hard material is a

standard displacement element and the soft material is a mixed formulation

Figure Test problems from linear elasticity Sphere dof b eamcolumn

dof tub e dof

Each problem is solved with a conjugate gradient CG solver to a relative tol

erance of Full multigrid is used for the preconditioner the smo other is CG precon

ditioned by blo ck Jacobi the number of blo cks was reduced by a factor of eight for each

successive level All problems used three coarse grids so that the top grid solved directly

was a few hundred equations in size We present numerical exp eriments on an IBM SP

MHz Power PSC The Finite Element Analysis Program FEAP is used to

generate out test problems and pro duce our graphics We use ParMetis to calculate

our partitions and PETSC for our parallel programming development environment

Our co de is implemented in C FEAP is implemented in FORTRAN and PETSc and

ParMetis are implemented in C

To demonstrate the eectiveness of our metho ds we run these test problems with

Pure MIS no optimizations Pure MIS with the heuristics of exterior vertices ordered

rst and interior ordered last and randomly our mo died MIS without the vertex cover

heuristics and nally all of our optimizations The solution times and the iteration counts

are shown in Figure

To demonstrate the eectiveness of our metho ds we run these test problems with

Pure randomized greedy MIS no optimizations

Pure MIS with the heuristic of exterior vertices ordered rst and interior ordered last

and randomly x

Our mo died MIS x without the vertex cover heuristics x

All of our optimizations x x and x

The solution times and the iteration counts are shown in Table

Problem Names Sphere Beamcolumn Tube

Number of equations

Condition K A of matrix

Number of pre and p ost smo other applications

Number of blo cks in Jacobi smo other ne grid

Pure MIS no optimizations

st

Pure MIS w exterior ordered interior last and random

Mo died MIS wo vertex cover heuristics

Mo died MIS w vertex cover heuristics all optimizations

Matrix vector pro duct time on the ne grid sec

Table Solve time in seconds number of iterations

These exp eriments show that our metho ds provide signicant improvement over

a random MIS esp ecially on complex domains Moving coarse vertices to cover ne ones

did not help on meshes that do not have curved surfaces as is exp ected and provide some

improvement to meshes with curved surfaces The vertex orderings within each category

when applicable aects these results a small amount particularly on the Pure MIS data

we see ab out a variance with dierent randomization schemes Thus one should

consider that the standard deviation rather high for this data but we have consistently

observed the dramatic b enet of the mo died MIS that is reected in this data Thus we

feel that our mo died MIS heuristics are eective esp ecially on domains with complex

geometry

Mesh generation

Mesh generators have b een developed using three basic metho ds Advancing front

Delaunay and Octree We use a simple Delaunay metho d with

Watsons insertion algorithm This metho d pro ceeds by trivially meshing a dummy

envelope of vertices then inserting each vertex into the existing mesh The algorithm in

variant is that the mesh remain a valid Delaunay mesh after each vertex insertion The

metho d is not highly optimal nor parallelizable but it is simple to implement moreover we

do not need a parallel Delaunay the the design of which is an op en problem

Additionally Delaunay is only well dened on vertex sets with no nontrivial copla

nar or cospherical p oint sets most algorithms work on these nongeneral position p oint

sets although exact arithmetic is required for robustness Delaunay metho ds in D require

the evaluation of a by determinant exactly although only the sign negative zero or

p ositive is required

We use a very ecient metho d of adaptive precision arithmetic that is sp ecialized

for the numerical predicates used in Delaunay tessellations This metho d achieves its

eciency by rst unrolling the lo ops in the determinant evaluation After the result has

b een calculated and intermediate results have b een judiciously saved a backward error

analysis is p erformed and if the results of this analysis suggest that the sign of the answer

can not b e trusted then this pro cesses is rep eated back through the intermediate levels

until the values of the intermediate results can b e trusted and exact arithmetic is used

for the rest of the calculation Note if a set of input p oints is in general positions and

plain double precision arithmetic is robust then the only additional cost in this metho d

is the one backward error analysis this is ab out a time p enalty If the p oints are

cospherical then the determinant is zero and the calculation must b e done completely in exact arithmetic otherwise this metho d allows for as much of the earlier terms to b e reused

Finite element shap e functions

After the coarse meshes are created one can simply use standard nite element

shap e functions for the restriction and interpolation op erators Linear and bilinear or

trilinear nite element shap e functions are ane mappings of a p oint in space with the

vertices of a p olytop e or element That is given a p oint in space we would like to express

its co ordinate as a weighted sum of the co ordinates of the element vertices This in eect

requires that the element vertices b e used to construct a shape function which has the very

intuitive meaning of the assumed shap e of the element given the displacements of its ver

tices These shap e functions are in general a partition of unity so that any constant function

can b e represented exactly Note displacements are the vertex variables for displacement

based nite element metho ds however the vertex variable may in general b e any physical

quantity eg co ordinates velocities temp erature pressure ux or electro magnetic eld

One needs to nd the element that contains each ne grid vertex j The nite

element shap e function for the element is evaluated at the lo cation of the ne grid vertex

for each vertex I of the coarse grid element containing j to compute the R entry in the

I j

restriction op erator Note that in general one needs to search the coarse grid elements to

nd the place for each ne grid vertex this can b e done in O log n time with an optimal

searching algorithm but since our coarse vertices are created from the ne ones it is p ossible

for each vertex to nd its parent in constant time

Galerkin construction of coarse grid op erators

The Galerkin construction of the coarse grids requires that three sparse matrices

b e multiplied together In indicial notation ie with the sum symbol omitted we have

i

f ine

coar se

A R A R

I i J j

IJ

ij

This construction takes ab out or times as many oating p oint op erations ops for

hexahedral meshes as a matrix vector pro duct on the ner grid although achieving optimal

p erformance is dicult as there are no sparse libraries that currently supp ort a

triple pro duct We have implemented a distributed memory sparse matrix triple pro duct

that op erates on PETSc matrices We are only able to get ab out of the oating p oint p erformance that we get in the multigrid iterations on one pro cessor but this matrix

triple pro duct a small though not insignicant part of the solve and it scales reasonably

well The reasons for this lack of p erformance is three fold

We have four sparse ob jects in the matrix triple pro duct and only one in the matrix

vector pro duct this requires much more indirect addressing

Communication in the matrix triple pro duct requires that a matrix b e communicated

instead of a vector in the matrix vector pro duct matrices are much heavier

ob jects than vectors ie they are D sparse arrays whereas vectors are D dense

arrays

We implemented this and as the p erformance of the matrix triple pro duct is not the

ob ject of this research and it is not a b ottleneck in the co de we have devoted a

limited although not insignicant amount of development eort to the task

That said we can briey discuss our algorithm and our reasons for the decisions that we

made We rst assume that the matrices are ordered in blo ck rows with compressed sparse

blo ck row storage format that is each pro cessor own a contiguous logical blo ck of the

matrix this is standard practice and is the format used by PETSc

For example a matrix A for an almost regular blo ck of hexahedral elements with

dof that has nonzeros its restriction matrix R has rows and

columns with nonzeros and the coarse matrix is by and has

as many non zeros as the larger nonzeros Thus the restriction op erator has ab out

stiness matrix and ab out as many nonzeros as the smaller stiness matrix From

this example it is clear that the ne grid stiness matrix is much larger than all of the

other matrices combined thus we wish to b egin with the ne grid matrix in optimizing

for cache p erformance This implies the rst two lo ops of our iteration in Figure

runs sequentially through the ne grid matrix This decision now dictates that we access

matrix entries in R that are stores on another pro cessor and also that we accumulate o

coar se

pro cessor data to the pro duct matrix A Therefore some parts of R are replicated

and communication is required to accumulate the results but this is more attractive than

communicating and duplicating parts of the ne grid matrix

Figure do es not show many optimizations that are done in practice but is

intended to show the details of the computation and the structure of these matrices Note

pRow S tar t and pRow E nd are the rst and last rows on pro cessor p

for i pRow S tar t pRow E nd

forall j icol umns icol umns is the set of adjacencies for vertex i

for II for each vertex in el ement

I iel ementv er texII index

shapec is the scalar value at co ordinate c of the shap e function

shape I iel ementv er texII shapeicoor d

for JJ

J jel ementv er texJJ index

J jel ementv er texJJ shapejcoor d shape

ndf ndf

Note that A

IJ

coar se coar se

I A shape J A A shape

ij

IJ IJ

Figure Matrix triple pro duct algorithm running on pro cessor p

that this matrix triple pro duct is in general only done once for each ne grid matrix and

so there is no opp ortunity to amortize the communication costs of the ne grid matrix in

order to save costs in the communication of the resultant coarse grid matrix

Smo others

Smo others are simple solvers in and of themselves x x and x Ma

trix splitting metho ds such as Jacobi GaussSeidel and SOR are used frequently although

they are not eective enough for illconditioned systems Jacobi and its generalization block

Jacobi however is very useful as a smoother in our solver as we shall see Another p opular

preconditioner is a socalled incomplete factorization which calculates a factorization of A

that limits the amount of l l that o ccurs This ll is resp onsible for the nonO n

complexity in factoring and solving nite element matrices

We use Krylov subspace x smo others x preconditioned with one level

domain decomp osition metho ds x We have found that the b est smo other precondi

tioner available in PETSc are diagonal blo ck Jacobi and overlapping Schwarz with blo ck

Jacobi b eing our primary metho d for harder problems As we are using or at least assuming unstructured grids the construction of the

sub domains dening the blo ck of the blo ck Jacobi smo other is by no means evident For

tunately a go o d solution exists it is natural to think that we would want well shap ed

sub domains in a domain decomp osition smo other so as to capture the lowest energy func

tions as is p ossible with a xed number of vertices p er blo ck thus improving the constant

of assumption in x We can use our standard mesh partitioners to give us go o d

partitions thus we construct our sub domains with METIS

Chapter

Multigrid characteristics on linear

problems in solid mechanics

This chapter presents numerical studies that are aimed at providing insight into

some of the fundamental challenges in applying iterative metho ds to the solution of complex

problems in solid mechanics and some characteristics of iterative solvers in general and

multigrid in particular Recall that the motivation for using iterative metho ds is to solve

large scale problems we address issues of scale in the following chapters but rst we study

the convergence b ehavior of our solver on some basic classes of problems in linear elasticity

Introduction

This section addresses the issues of incompressibility x p o or asp ect ratio ele

ments x p o or geometric conditioning x and large jumps in material co ecients

xx via a suite of numerical exp eriments But rst we shall verify that multigrid

works in x that is for simple problem in linear elasticity the convergence rate is in

variant to the scale or neness of discretization We conclude that multigrid is a very

promising solver of the challenging nite element problems that are found in many areas of

science and engineering

Multigrid works

Our mo del problem for this section is a long thin cantilever b eam ie all displace

ments fully restrained at one end of a long skinny rectangular prism We use a regular

mesh with p erfect cubic elements for a cantilever discretized with N elements

through the thickness A load applied at the end and and use a linear elastic displacement

based element with Poisson ratio of shown in Figure The load is o axis that is

the load vectors all parallel have signicant comp onents in all three directions in all

directions with the axis at the supp ort and the b eam b eing in the p ositive direction this

is signicant as we are using Krylov subspace metho ds and the convergence rate can dep end

on the nature of the applied load For example if the applied load were an eigenvector of

the stiness matrix then any unpreconditioned Krylov subspace metho d would converge in one iteration

Time = 0.00E+00

Figure Cantilever with uniform mesh and end load element mesh N

This problem is meant to demonstrate that multigrid applied to a mo del problem

do es indeed converge at a rate indep endent of the scale of discretization Actually we

show some superlinear convergence this is a common phenomenon ie b etter meshes

lead to faster convergence with multigrid Table shows the number of iterations and

the condition number of each matrix The condition number is estimated from b elow

by running CG without preconditioning and calculating the extreme eigenvalues of the

projected tridiagonal matrix that CG uses to calculate its approximate answer x

This unpreconditioned CG solve was run with a relative residual tolerance of The

drastic increase in the number of iterations shows that these matrices are indeed getting

harder for CG to solve

Our solver is CG with a full multigrid preconditioner using a diagonally precon

ditioned CG smo other We use two iterations in the pre and p ost smo others We declare

convergence when norm of the initial residual has b een reduced by a factor of

N Levels Iterations dof Condition Unpreconditioned iterations

Table Multiple discretizations of a cantilever Figure shows the convergence history of these problems

Relative residual vs. iteration 2 10

2*2*64 Beam 1 10 4*4*128 Beam 8*8*256 Beam

0 10

−1 10

−2 10

−3 10 relative residual

−4 10

−5 10

−6 10

−7 10 0 2 4 6 8 10 12 14

iterations

Figure Residual convergence for multiple discretizations of a cantilever

Note another phenomenon that is common of iterative solvers the residual in

creases jumps during the rst few iterations This jump can sometimes b e down a general

rule of thumb is that the more p o orly conditioned the problem the larger this initial jump

is This section shows that multigrid has a hop e of b eing an eective solver for nite

element metho d matrices We can see that on a trivial problem multigrids rate of conver

gence is indeed nearly indep endent of the condition number of the matrix as reected in

ner discretizations indeed it improves with size The next section also uses this mo del

problem and shows that multigrid is invariant to changes in material co ecients if the

material interfaces are captured on the coarse grids

Large jumps in material co ecients soft section can

tilever b eam

In this section we take the cantilever b eam of the last section and add two rows of a

soft material in the middle thus the problem is singular if the soft material has no stiness

The N dof problem is used from the last section the soft material has a

Poisson ratio of and we parameterize its elastic mo dulus to range from that of the

rest of the mesh down to the p oint that the matrix is singular to working precision and

CG breaks down We have encouraged a very regular structure of the coarse meshes by

using the natural vertex order in the maximal indep endent set algorithm additionally we

have placed this soft layer judiciously so as to pro duce the b est coarse grids without

changing the meshs geometric conguration Thus this is very much an articial example

and it is simply meant to show that indeed multigrid can work p erfectly for D elasticity

with large jumps in material co ecients if the problem is p erfectly partitioned Table

shows the results in terms of iteration count for these problems Figure shows the

log E

sof t

Iterations

Condition

Unpreconditioned iterations

Table Cantilever with soft section

convergence history of these problems

The imp ortant p oint to notice here is that the slope of the convergence is the same

for all of the problems Also notice the dot plots in Figure of the true residual ie

explicit calculation of b Ax the last two problems do not converge to the sp ecied

k

tolerance b ecause of the illconditioni ng caused by these extreme co ecients

Relative residual vs. iteration 3 10

Soft E = 1.0 2 10 Soft E = 1.0e−2 Soft E = 1.0e−4 1 10 Soft E = 1.0e−6 Soft E = 1.0e−8 0 real residual (|| b − A * x ||) 10 k

−1 10

−2 10

−3 10

−4

residual (in CG) / original 10

−5 10

−6 10

−7 10 0 2 4 6 8 10 12 14

iterations

Figure Residual convergence for a cantilever with soft a section

Large jumps in material co ecients curved material

interface

This section demonstrates the eect of not capturing the geometry p erfectly in

a mixed co ecient problem Figure shows a nite element mo del of a hard sphere

included in a soft somewhat incompressible material Poisson ratio of The

diculty with this problem is that the coarse grids can not capture the geometry of the ne

grids unless all vertices on the curved surface were retained in the coarse grid This results

in ne grid p oints on the curved surface b eing interpolated by p oints in the interior of the

soft material thus p oints of very dierent character stiness in this case are p olluting

each other

We use blo ck Jacobi preconditioner for the CG smo other as this is most eective

for this problem x Our problem has dof and we use blo cks in the blo ck Jacobi

preconditioner for the conjugate gradient smo other in multigrid

Table shows the convergence again with a residual tolerance of Figure

log E

sof t

Iterations

Condition NA NA

Unpreconditioned iterations NA NA

Table Iterations for included sphere with soft cover

shows the convergence history of these problems

Thus unlike the previous example where we were able to capture the material

interface p erfectly on the coarse grids the convergence rate do es degrade as the elastic mo dulus of the soft material is reduced

Relative residual vs. iteration 3 10

Soft E = 1.0 2 10 Soft E = 1.0e−2 Soft E = 1.0e−4 1 10 Soft E = 1.0e−6 Soft E = 1.0e−8

0 Soft E = 1.0e−10 10 Soft E = 1.0e−12

−1 10

−2 10

−3 10

−4 residual (in CG) / original 10

−5 10

−6 10

−7 10 0 5 10 15 20 25 30

iterations

Figure Residual convergence for included sphere with soft cover

Incompressible materials

This section investigates the eect of incompressible materials on the convergence

rate of our solver We use the dof version of the same mo del as the last section We

parameterize the Poisson ratio The ratio of the elastic mo dulus in the soft material

sof t

to that of the hard material is held at Three dierent smo others are used conjugate

gradients preconditioned with Jacobi blo ck Jacobi and overlapping Schwarz one layer of

overlap Table shows the number of iterations for each case as well as the condition

number of a smaller version of this problems see b elow for more details Note we restart

sof t

Jacobi

Blo ck Jacobi

Overlapping additive Schwarz

Condition dof version

Unpreconditioned iterations

Table Iterations for included sphere with common preconditioners in CG smo other

CG every iterations for the data in Table Figure shows the solve time and

convergence history of these problems with each smo other

We can observe many common characteristics of iterative solvers in this data

First the condition number of the matrix is not eective at predicting the convergence

b ehavior of these problems as the convergence rate deteriorates drastically whereas the

condition number is hardly eected by the higher Poisson ratio Also we can see that the

more exp ensive preconditioners b ecome economical an the harder problems only

The condition number is calculated by calculating the entire sp ectrum with LA

PACKs dsyev Figure shows a plot of the sp ectrum for to

sof t

on a dof version of this problem From this we can see that the Poisson ratio eects

only the eigenvalues in the middle of the sp ectrum up to the p oint where the volumetric

stiness of the soft material is rivaling and surpasses that of the sti material Thus

for all but the the condition number of these problems are virtually identical

The second observation to b e made from this data is that the more p owerful

preconditioners are only cost eective on the harder problems with the the overlapping

Schwarz preconditioner b eing the most p owerful Note we are not able to achieve the same

Overlapping additive Schwarz preconditioned CG Block Jacobi preconditioned CG

5 5 10 10 0.3 Poisson ratio 0.3 Poisson ratio 0.4 Poisson ratio 0.4 Poisson ratio 0.45 Poisson ratio 0.45 Poisson ratio 0.49 Poisson ratio 0.49 Poisson ratio 0 0 10 0.499 Poisson ratio 10 0.499 Poisson ratio relative residual relative residual

−5 −5 10 10

0 10 20 30 0 50 100 150 iteration iteration

Jacobi preconditioned CG Included sphere − Solve times vs. Poisson ratio 2 10 5 10 0.3 Poisson ratio 0.4 Poisson ratio 0.45 Poisson ratio 0.49 Poisson ratio 0 1 10 0.499 Poisson ratio 10 relative residual Solve time (sec) Jacobi Block Jacobi −5 10 additive Schwarz 0 10 0 50 100 150 0.3 0.35 0.4 0.45 0.5

iteration Poissons ratio

Figure Residual convergence for included sphere with incompressible cover material

Note dierent scale for overlapping additive Schwarz

relative p erformance of overlapping Schwarz in parallel as it requires communication and

as we do not have global coarse grids and hence can not repro duce our serial semantics

in parallel This leads to the conclusion that as the physics of our problems get more

challenging we need more p owerful smo others for optimal p erformance

6 10

5 10

4 10

3 10

2 10 eigen value

1 10 0.3 Poisson ratio 0.4 Poisson ratio 0 10 0.45 Poisson ratio 0.49 Poisson ratio 0.499 Poisson ratio −1 10 0.4999 Poisson ratio 0.49999 Poisson ratio

−2 10 0 500 1000 1500 2000 2500 3000 3500

spectrum

Figure Sp ectrum of dof included sphere with incompressible cover material

Poorly prop ortioned elements

This section investigates problems with p o orly prop ortioned elements It is well

known that the use of elements with large asp ect ratios severely degrades the p erformance of

iterative solvers The reasons for this are not well understo o d though the condition

number of the elements do es increase the overall condition of the matrix do es not rise

dramatically For these tests we use a mo del of a truncated hollow cone shown in Figure

with the deformed shap e and the rst principal stress These elements have asp ect

ratios in of ab out This problem has dof and a condition number of

the total solve required seconds iterations on one no de of a Cray TE

A matrix vector pro duct takes ab out seconds thus the total solve time

coarse grid matrix vector pro ducts and the actual solve takes the time required to do

matrix vector pro ducts Figure shows the convergence history of this problem

These results show that this metho d has the p otential to b e eective on thin b o dy elasticity

problems with p o or asp ect ratio elements Additionally in our exp erience these types of

problems b enet greatly from our heuristics describ ed in x

PRIN. STRESS 1

-3.35E+00 1.08E+00 5.51E+00 9.93E+00 1.44E+01 1.88E+01 2.32E+01 2.76E+01 Current View Min = -2.54E+00 X = 0.00E+00 Y = 7.87E+00 Z =-6.48E+01 Max = 2.65E+01 X = 2.71E+02 Y =-8.16E-01 Z =-4.06E+01

Time = 3.37E+30

Figure Cantilevered hollow cone rst principal stress and deformed shap e

Relative residual vs. iteration 1 10

0 10

−1 10

−2 10

−3 10 relative residual −4 10

−5 10

−6 10

−7 10 0 2 4 6 8 10 12 14 16 18 20

iterations

Figure Residual convergence for Cone problem

Conclusion

This section has demonstrated many of the characteristics of multigrid solvers

for computational mechanics problems on mo del problems designed to isolate some of the

more common comp onents of problems in solid mechanics We have shown that multigrid

has the p otential to b e an eective solver for the challenging problems in industry and

academia We have identied some of the characteristics of solid mechanics problems that

make them hard for iterative solvers eg incompressibility thin elements large jumps in

material co ecients and curved surfaces The next chapter discusses the p erformance and

parallelism issues of multigrid on large scale nite element problems

Chapter

Parallel architecture and

algorithmi c issues

This chapter discusses general parallelism issues of nite element co des archi

tectural details of our parallel nite element implementation our parallel nite element

programming mo del and parallel algorithmic issues of large scale parallel nite element

implementations

Introduction

We pro ceed by discussing parallelism issues in large nite element co des and al

gebraic multigrid metho ds in x x describ es our parallel nite element design and

implementation We discuss our programming mo del for the complete parallelization of

an existing large nite element co de with minimal mo dication to and no parallel con

structs in the existing nite element co de We conclude with algorithmic issues asso ciated

with optimizing p erformance of multilevel solvers on common parallel computers in x

by developing algorithms for grid agglomeration useful in mitigating the inherent parallel

ineciencies of the coarsest grids in multigrid solvers

Parallel nite element co de structure

This section discuses parallelism issues of nite element co des and algebraic

multigrid metho ds This section pro ceeds as follows x discusses the eect of the

time integrator on the structure of nite element co des x discuses parallelism in

nite element co des x discuses graph partitioning x describ es parallel computer

architectures of to day x discusses the need for multiple levels of partitioning on high

p erformance computers x touches on overall solver structure and complexity

Finite element co de structure

Here we review and the next section will extend the basic steps of the nite

element metho d as listed in x The rst four of these steps do not have a signicant

impact on the co de construction but the fth step Formulate a time integrator for the

PDE do es have a ma jor impact on the co de construction and problem characteristics

One of the distinctions among nite element implementations is whether a co de

is an implicit co de or an explicit co de This is a bit surprising as the time integrator

may not b e the most distinguishing asp ect of a nite element package One would think

that a uid or solid or multiphysics distinction would b e more imp ortant One might also

think that linear or nonlinear or transient or static would b e a more imp ortant way to

categorize a co de yet the implicit or explicit descriptor seems to play a far greater role

in the construction of nite element implementations than one would initially exp ect

An implicit metho d must apply the inverse of the linearization the sparse stiness

matrix of the residual to the residual this inverse is a dense op erator An explicit metho d

applies the inverse of the mass matrix to a vector the mass matrix can b e approximated

with a lump ed or blo ck diagonal mass matrix eectively and thus its inverse is not

dense Herein lies the source of the dierence in the structure of implicit and explicit co des

The source of the eect of the time integrator on nite element co des can to some

degree b e attributed to the fact that direct solvers have dominated much of the nite element

community in the past Direct solvers for the stiness matrix have time complexity O n

for D nite element problems with n degrees of freedom dof whereas the application

of the elements to form the residual and the application of the inverse of the mass matrix

have time complexity O n The result of this fact is that the p erformance of implicit

co des is dominated by the solver whereas the p erformance of explicit co des is dominated

by the element application optimizing p erformance for these two endeavors requires very

dierent techniques Note that although iterative metho ds are ab out O n in time and

space complexity and are often more easily parallelized than direct metho ds the constants

of iterative metho ds for most interesting nite element applications are still much larger

than that of the element co de thus the solver will remain the b ottleneck with iterative

solvers also

The optimization of an element formulation requires many dierent small nu

merical op erations eg tensor op erations in element material constitution inversions of

small Jacobians sometimes small nonlinear solves and many other op erations all of which

require intimate knowledge of the nite element formulation Element optimization has tra

ditionally b een done entirely by engineers whereas direct solver p erformance is primarily

in the realm of a numerical linear algebraist as sparse direct solvers are used in many dis

ciplines Thus the dierent nature of optimizing implicit and explicit co des is one source

of the inuence of the time integrator on the structure of nite element co des This disser

tation is concerned with solvers step seven in x and thus we assume that an implicit

time integrator is in use or an explicit metho d that uses preconditioning to soften the

op erator

Finite element parallelism

The parallel implementation of the nite element metho d requires that we augment

the basic steps in x and x At some p oint the domain requires partitioning as

interprocessor communication costs are inherently a b ottleneck if they are not minimized

Traditionally sparse matrix implementations use a blo ck row or blo ck column partitioning

of the matrix storage this implies that a row or no dal partitioning is sucient to parallelize

the matrix storage as nite element matrices are structurally symmetric A high degree

of scalability requires partitioning all of the work and storage for each vertex and each

element Any nonparallel constructs or signicant parts of the co de that do not scale well

will inict severe p erformance p enalties as the scale of the problems increase

The partitioning or parallelization needs to b e brought into the system as early

as p ossible to allow for the maximum amount of scalability and must itself b e parallelized

for the scale of problems that are of interest to this dissertation Additionally the back end

of the system the last step in x should remain parallel for as long as is p ossible We

have elected to draw the line for parallelism of the nite element system after the mesh

generation but b efore the partitioning although if a parallel mesh generator were available

it would b e simple to insert and we go back to the serial co de after the solve but b efore

the visualization

Parallelism and graph partitioning

Mesh partitioning is central to parallel computing with unstructured meshes and

we thus discuss some of the basic issues and algorithms involved Mesh partitioning is

the assignment of each vertex and element to one pro cessor p of the P pro cessors thereby

dening a set of local vertices an elements for each pro cessor Note we partition elements

and vertices as most of the costs of a nite implementation can b e eectively Because of

the size of our our nite element problems and the increasing ratio of pro cessor sp eeds to

communication sp eeds we need to have each processor b e resp onsible for many more than

one vertex The physical nature of nite element graphs ie vertices communicate only

with physically close vertices allows graph partitioning to b e very eective at reducing

the amount of data required from other pro cessors which is generally several orders of

magnitude more exp ensive to use than lo cal data Thus as we want go o d scalability at

least at the algorithmic level a nontrivial no dal partitioning and the related element

partitioning must b e computed

Partitioners attempt to minimize communication cost by minimizing the number

of ghost vertices on each pro cessor ghost vertices are nonlo cal vertices that are connected

to at least one lo cal vertex and the amount of load imbalance ie the maximum amount of

work for any one pro cessor divided by the average amount of work p er pro cessor In general

one wants to weigh the cost of load imbalance and the cost of a ghost along with the number

of neighbor partitions to dene an optimization problem Most partitioning metho ds try

to minimize edge cuts as a heuristic to minimizing ghosts with the constraint that the

number of vertices p er partition are equal within some tolerance In attempting to optimize

the solution time of an iterative solver it is natural to optimize the p erformance of the matrix

vector pro duct as this is where a ma jority of the computation takes place Also essentially

all of the communication takes place in the matrix vector pro duct thus its optimization

is essential for parallel eciency Thus we can and do simplify the partitioning problem

to that of optimizing the matrix vector pro duct This optimization is relatively simple as

opp osed to partitioning for sparse matrix factorizations Not all vertices in a nite element

problem are asso ciated with the same amount of computation esp ecially in domains where

dierent physics exist so it is desirable to not simply place equal number of vertices in

each partition but to put equal amounts of work ie oating p oint op erations and data

in each partition

Partitioners that are of interest to day fall into two groups geometric partitioners

and multilevel partitioners Geometric partitioners have low complexity as

low as O in PRAM see x and use vertex co ordinates only to separate the graph

k

into two roughly equal pieces applied recursively they can generate partitions with

partitions These partitioners can b e shown to pro duce partitions with exp ected separator

sizes within a constant factor of optimal ie the size of the set of vertices that separates a

D 1

D

and graph with b ounded asp ect ratio elements into two unconnected pieces is O n

within a constant of optimal load balance These metho ds do not however lo ok at the

edges in the graph but instead rely on the physical nature of nite element graphs to

minimize edge cuts by minimizing the number of vertices near a cut

Multilevel partitioners rely on the notion of restricting the ne graph to a much

smaller coarse graph by using maximal indep endent set or maximal matching algorithms

This pro cess is applied recursively until the graph is small enough that a high quality parti

tioner such as sp ectral bisection or kway partitioners can b e applied This partitioning

of the coarse problem is then interpolated back to the ner graph a lo cal smo oth

ing pro cedure eg KernighanLin is then used at each level to lo cally improve the

partitioning These metho ds have p olylogarithmic in n complexity though they have the

advantage that they can pro duce more rened partitions and can more easily accommo date

vertex and edge weights in the graph We use a multilevel partitioner as a go o d public

domain implementation exists ParMetis

Coarse grid partitioning

Chapter describ ed our metho d of promoting vertices to coarse grids this

chapter has discussed partitioning the ne grid we have yet to dene the partitioning of

the coarse grids We have a natural partitioning for the coarse grids as our coarse grid

vertices are derived from ne grid vertices This natural partition is however not adequate

for many classes of problems

One must explicitly repartition the coarse grids even if the ne grid is p erfectly

balanced the coarse grids are in general unbalanced One source of imbalance is intentional

nonuniform coarsening in the domain Another source of imbalance is nonuniform

sub domain top ologies eg an area with line top ology like a cable and areas with thin

top ology coarsen at dierent rates from each other and from thick b o dy sub domains

x On our uniform problems the load imbalance increases only mildly as we go up the

grid hierarchy nonuniform problems however develop severe load imbalance and thus we

repartition the coarse grids note ParMetis provides services to improve existing partitions

with minimal vertex movement

Parallel computer architecture

Common computer architectures of to day have a hierarchy of memory storage

comp onents The top of the memory hierarchy are the registers data must b e in registers

to b e used Below the registers are often two or three levels of cache next is main lo cal

memory and other pro cessor elements PE main memories then disk and so on Moving

data b etween levels of the memory hierarchy is generally far more exp ensive than p erforming

oating p oint op erations on data in registers thus communication is often the b ottleneck

in numerical co des Figure shows a schematic of the common computer architectures of to day

Uni-processor Symmetric multi-processor (SMP) registers Uni-processor Uni-processor . . . . level 1 cache main memory level 2 cache

main memory Uni-processor computer Distributed memory computer . . . Cluster of SMPs (CLUMPs) Uni-proc. Uni-proc.

SMP SMP . . . memory memory

network network

Figure Common computer architectures

Some sort of partitioning is required for al l of these levels as we see in the next

section We explicitly partition once for at machines and twice for clusters of symmetric

multiprocessors SMPs which we call CLUMPs

Multiple levels of partitioning

A reason for multiple levels of partitioning is minimize movement of data b etween

levels of the memory hierarchy on to days computers Our problems multiple degrees of

freedom p er vertex are naturally partitioned for registers though p erhaps not optimally

Lo cal matrix vector pro ducts require matrix ordering to optimize cache p erformance uti

lizing a graph partitioner is one metho d

In the case of CLUMPs we partition our domain to the SMPs rst as communi

cation is much faster at least in theory within an SMP we then partition to each pro cessor

within the SMP We use a at MPI programming mo del one thread p er pro cessor the

message passing layer can b e optimized with a multiprotocol MPI to communicate

more quickly within an SMP than b etween SMPs although the SMP cluster that we use

BlueTR at LLNL do es not yet have a multiprotocol MPI implementation thus multiple

levels of partitioning is not currently of b enet Alternatively a multithreaded matrix vector

pro duct could b e used to take advantage of the shared memory and thus use a nonat

programming mo del for some parts of the co de The purp ose of the second partitioning

phase on each SMP is to minimize the communication within the SMP even if a multi

threaded matrix vector pro duct were in use this partitioning would serve to minimize cache

misses on the left hand side vector

Solver complexity issues

For our overall nonlinear solver we use a NewtonKrylovSchwarz solution metho d

that is our multilevel Schwarz algorithm discussed in the last chapter is used as

a preconditioner for a Krylov subspace metho d discussed in the chapter which in turn

is used as an approximate linear solver in an outer Newton iteration see any numerical

analysis text for a description of Newtons metho d The complexity of our system is

governed primarily by the application of our multigrid preconditioner

Global communication such as dot pro ducts or other small global reduction

op erations are in general required in each of the control steps eg time step selection

Newton convergence checking and solver convergence checking and within the solver itself

if a Krylov subspace smo other is used these reductions are the most nonscalable constructs

in our algorithms ie the log P or log n terms in the PRAM complexity The constants

for the logn terms are small but will increasingly b ecome imp ortant as the scale of

problems increases

A parallel nite element architecture

Our co de ParFeap is comp osed of three basic comp onents

Athena is a parallel nite element co de without a parallel solver that uses ParMetis

to partition the nite element graph then constructs a fully valid serial nite

element problem and nally runs a serial research nite element implementation

Finite Element Analysis Program FEAP to provide a well sp ecied nite

element problem on each pro cessor

Epimetheus is an algebraic multigrid solver infrastructure that provides a solver to

Athena a driver for Prometheus and an interface to PETSc x and numerical

primitives not provided by PETSc ie the matrix triple pro duct and a Uzawa solver

x

Prometheus is our restriction op erator constructor the core of the research work

for this dissertation

Athena Epimetheus and Prometheus are implemented with ab out lines of C

co de PETSc and ParMetis are implemented in C and FEAP is implemented in FORTRAN

Athena

We have developed a highly parallel nite element co de built on an existing serial

research legacy nite element implementation Testing iterative solvers convergence rates

requires that challenging test problems b e used test problems that test a wide range of

nite element techniques We needed a full featured nite element co de but full featured

nite element co des are inherently large complex and not easily parallelized Thus by

necessity we have developed a domain decomp osition based parallel nite element program

ming mo del in which a complete nite element problem is built on each pro cessor This

abstraction allows for a very simple though expressive interface and required only very

simple mo dications to FEAP

Athena uses ParMetis to calculate a mesh partitioning and then constructs

a fully valid though not necessarily well p osed nite element problem for each pro cessor

L

In addition to the lo cal vertices for pro cessor p prescrib ed by the vertex partitioning V

p

x duplicate vertices are required ie ghost vertices that are required for some

L

lo cal computation but are not in V The partitioning also requires duplicate elements

p

as each partition must have a copy of all elements that touch any lo cal vertex if one wishes

to calculate the stiness matrix without any communication although with redundant

computation The displacements or Dirichlet b oundary conditions must b e applied redun

dantly ie on ghosts whereas the loads Neumann b oundary conditions must b e applied

uniquely to maintain the semantics of the problem as residuals or forces are added into

a global vector if elements are not redundantly applied otherwise the residuals for the

ghost vertices are ignored Materials are sp ecied by index b ound to the elements in the

partition and sp ecied at run time with a material le describ ed b elow

A slightly mo died serial nite element co de runs on each pro cessor Though

the serial co de is mo died it do es not have any parallel constructs or knowledge of the

global problem this is useful for debugging and the continued indep endent development

of FEAP Parallelism is introduced by providing the nite element co de with a matrix

assembly routine solve routine and a global dot pro duct additional supp ort functions are

provided for expressiveness and p erformance but are not strictly necessary The global dot

pro duct is hardwired for a vector of the type x or b in an equation like Ax b where A is

the stiness matrix for the entire lo cal nite element problem and thus allows for a very

simple parallelization of the serial co de that with the addition of a solver is adequate

for all or most of the global op erations in all nite element co des This simple interface

could also b e adequate for a simple explicit metho d where the solver simply needs to invert

a diagonal mass matrix and do a comp onent by comp onent vectorvector pro duct and

then communicate the solution on lo cal vertices to neighbor pro cessors Any metho d that

requires other global op erations eg a solver with solver sp ecic initializati on routines can

b e added as needed thus this interface provides the kernel for a full featured parallel nite

element implementation The advantage of this metho d is that the serial nite element co de

eg FEAP is completely parallel and has a very small interface with the parallel co de

eg Athena With the addition of a solver that allo cates the parallel stiness matrix and

global vectors and solves a system of the form Ax b many nite element co des among

those used in industry and academia are readytorun

Epimetheus

Prometheus provides Epimetheus with restriction op erators and Athena uses

Epimetheus to solve the equations Epimetheus uses METIS to determine the blo ck

Jacobi sub domains and PETSc for the parallel numerical library and programming develop

ment environment Future work may include adding an interface to the parallel unstructured

multilevel grids that we use to construct our coarse grids as these are generally useful for

anyone building a parallel algebraic multigrid co de or in fact any parallel algorithm on

unstructured multilevel grids similar to how Kelp and Titanium provide parallel

multilevel structured grid primitives

Prometheus

FEAP provides Prometheus with the lo cal nite element problem that was orig

inally constructed by Athena ie co ordinates element connectivities material identiers

and the b oundaries conditions Prometheus constructs the global restriction op erators for

each grid and is the core of the algorithmic contribution of this dissertation discussed at

length in chapter

AthenaEpimetheusPrometheus construction details

This section discusses some low level details of our co de construction this is not

intended to b e a manual or sp ecication but is intended to provide enough details so that

the interested reader can get some idea of the interface and architecture Note for the

interested reader the FEAP manual can b e found in

To run our system one must rst input a problem into a serial version of FEAP

using FEAPs simple mesh generation then output a mesh le a FEAP input le in a

xed format so as to allow it to b e read eciently in parallel this le is the input to

Athena If one had a scalable mesh generator to provide the mesh at run time then one

would only need disjoint vertex and element partitionings and provide each pro cessor with its vertices and elements there are no constraints on the prop erties of these partitions other

than b eing disjoint Alternatively one could transform a pregenerated mesh into our at

FEAP format easily as it is very simple

One can now run ParFeap with three input les the large at FEAP input le and

two small les one with the material parameters and the other with the FEAPsolution

script FEAPs command language is used to invoke Athena routines from this FEAP

solutionscript as well as invoke FEAP commands see Figure We will not discuss

the material le further but the FEAPsolutionscript is of interest as it uses FEAPs

command language and is an imp ortant comp onent of the user interface with ParFeap

again do cumentation details may b e found in Athena can now b e run with any

number of pro cessors including one pro cessor with no further preparation This exibili ty

drives this design as it is paramount for solver development eg debugging exp erimenting

with new metho ds and collecting p erformance data

Within the FEAPsolutionscript displacements can b e written to a le after time

step is complete The serial version of FEAP is then invoked FEAPs read command is

used to op en the le for example readdisp reads in the displacements in a le named

disp and the results can then b e visualized using the serial version of FEAP If a more

sophisticated visualization to ol were available it could b e inserted here Figure shows

an example FEAPsolutionscript that initializes Prometheus mgin makes four coarser

multigrid levels mg nalizes the data structures mgfn then do es a simple linear

pseudo time stepping problem and writes out the displacements at each step Figure shows a graphic representation of the overall system architecture

batch b egin batch blo ck

mgin initialize Prometheus

lo op

mg make a new level

next

mgfn nalize the multigrid setup prepare for a solve

dt set the time step

prop use a proportional load

lo optime lo op for time steps

time increment the time

tang form and solve the tangent stiness matrix

mgds write a PETSc displacement le

next

mgdsend collect displacements and write to FEAP le

end end batch blo ck

inte b egin interactive mo de

Figure FEAP command language example in the FEAPsolutionscript

x (=b/A) FEAP . Parallel finite element code (without solver)

. FEAP input file

Athena ParMetis global data Partition to SMPs

FEAP file FEAP file (memory resident) (memory resident)

Athena Athena ParMetis global data Partition within each SMP

file file file file

.

solution script FEAP FEAP FEAP FEAP materials file

Epimetheus Prometheus FE/AMG solver interface Max. Ind. Sets METIS Mat. Products (RAR') Delaunay FE shape func. (R)

.

PETSc Code active in solution phase

Figure Co de Architecture

Pro cessor sub domain agglomeration

Sub domain agglomeration refers to reducing the number of active pro cessors and

coalescing agglomerating the sub domains on each pro cessor to a new reduced set of active

pro cessors Sub domain agglomeration is valuable for two entirely dierent reasons rst for

p erformance and second for algorithmic considerations As mentioned in x we partition

many vertices ie O to each pro cessor this allows all of the pro cessors to do useful

work in much of the multigrid algorithm for the large problems of to day eg

vertices The diculty is that as the number of vertices p er pro cessor dwindles the

ability to do work eciently decreases At some p oint in the grid hierarchy it is ecient

to let some pro cessors remain idle and agglomerate the work to fewer pro cessors that is

the time sp ent on a grid will decrease if fewer pro cessors are used Note agglomeration

and more signicantly repartitioning can slow the restriction and interpolation op erators

on the previous grid to some degree though we are not able to quantify this well and thus

must rely on the total run time to justify and optimize agglomeration Also agglomeration

requires data movement in the setup phase though as agglomeration takes place when there

are very few vertices p er pro cessor the cost is not signicant

A second reason for the use of pro cessor agglomeration comes from the multigrid

algorithm that we use b oth in terms of mathematics and practical implementation issues

Mathematically when any multigrid algorithm uses a blo ck Jacobi preconditioner in the

smo other you no longer have the same solver in parallel as on one pro cessor since one

can no longer maintain the same blo ck sizes on the coarser grids assuming that a serial

solver is used on the sub domains

Agglomeration should take place when the value of the global Mop rate eg

Figure left of the operator that one wishes to optimize on your machine using

the current number of active pro cessors falls b elow that of a smaller integer number of

pro cessors We need a metho ds to pick the number of pro cessors to use given the structure

of the stiness matrix at each level of multigrid This section will discuss three approaches

to determine at run time the number of pro cessors to use for the coarsest grids First

x will discuss a simple metho d and introduce some of the concepts and to ols involved

in sub domain agglomeration algorithms x denes the problem more concretely as an

integer programming problem A sophisticated though complex and intractable approach

will b e discussed in x and the metho d that we use in our numerical results x will

b e discussed in x

Simple sub domain agglomeration metho d

We can quantify the p oint at which this agglomeration should take place for a

particular problem by measuring the p erformance of the op eration or mix of op erations

one is optimizing for on each level A simple metho d is to rst measure the p erformance

of the op erator on a range of matrix sizes and number of pro cessors For example if we

approximate our op erator multigrid on one given level by a matrixvector pro duct then

we can use Figure

A simple sub domain agglomeration metho d is to decide on the optimal number

of equations to have on each pro cessor B and given the number of equations on a grid n

i

n

i

use P d e pro cessors if P P From this data we could conclude that the optimal

i i

B

number of equations p er pro cessor is about on the Cray TE as this is ab out where we have p eak oating p oint op eration op or megaop Mop rates

Matrix−vector product: Mflops/sec vs. processors Matrix−vector product total Mflops/sec vs. equations / processor

4K sphere 4K sphere 15K sphere 15K sphere 83K sphere 83K sphere

3 10 3 10 Mflops/sec Total Mflops/sec

2 2 10 10 0 1 2 2 3 4 10 10 10 10 10 10

processors equations / processor

Figure Matrix vector pro duct p er pro cessor and total Mopsec on a Cray TE

Exp erience has shown that this very simple metho d is not adequate at least on

some of the platforms that we use as the optimal number of equations p er pro cessor is a

function of P Also as our coarse grids tend to b e denser than the nest grid the number

of nonzeros in the matrix should also b e considered though we do not currently consider the number of nonzeros

Sub domain agglomeration as an integer programming problem

The optimal number of pro cessors is a constrained optimization problem First

we dene our constraints obviously we need an integer number of pro cessors but we are

not free to select any number of pro cessors in our system as our implementation is not

completely exible in the number of pro cessors used For simplicity of the implementation

we require that the number of active pro cessors is a factor of the number of pro cessors on

the previous ner grid For example if we have six levels and pro cessors then the

series of pro cessor group sizes f g is not valid as is not a factor of

while the set f g is valid Note this constraint is not onerous as the number

of equations drops by a signicant fraction eg to on the coarser grids though it do es

demand that we avoid using a number of pro cessors with large primes in their factors

We want to pick the number of pro cessors to minimize the time to do the work on a

particular level We neglect the eect of agglomeration on the restriction and interpolation

op erations as they are not large on machines with go o d networks are dicult to measure

accurately without eecting the p erformance and there inclusion would signicantly com

plicate the optimization problem which as we will see is already to o complex for us to do

directly Most of the ops in a level of multigrid are in the matrixvector pro ducts and

the blo ck Jacobi preconditioner Blo ck Jacobi preconditioners require no communication

hence motivate the use of as many pro cessors as there are blo cks Matrixvector pro ducts

have some nearest neighbor communication and do not sp eedup p erfectly forever as seen in

Figure Additionally the dot pro ducts require global communication but p erform very

few ops p er dof Hence the dot pro ducts push for fewer pro cessors than other op erators

esp ecially on machines with slow communication

To select the number of pro cessors P to use on a coarse grid G with an integer

programming technique one needs an explicit and accurate function T G P for the ex

ecution time of a particular grid on a particular machine with any number of pro cessors

Integer programming could then b e used to ne the minimum with resp ect to P of this

function T to decide on the number of pro cessors to partition the current grid on to The

next section will discuss issues in construction a function T via a computational mo del

Potential use of a computational mo del

We can pick the optimal number of pro cessors by mo deling the time complexity of

a level of multigrid as a function of the number of pro cessors P and the number of vertices

n as follows First we dene complexity as the maximum time T that any pro cessor sp ends

x

in op eration x Note this is a dierent denition of T than we use in chapter ie

T maximum time number of pro cessors b ecause here we have already paid for the

pro cessors and gain no b enet from using fewer pro cessors thus this as a somewhat unusual

cost mo del Dene f p the minimum op rate on any pro cessor for BLAS op erators

the maximum message latency the maximum inverse bandwidth b etween any two

pro cessors for a double precision oating p oint number and ndf the number of degrees of

freedom p er vertex In other words the time to send a message with n words is n

A simple mo del of dot pro ducts is

ndf n

T P n logP

dot

P f p

We will assume p erfect load balance and ignore network contention thus each pro cessor has

n

vertices and and are not functions of P Note as all the cost comp onents exactly

P

are the maximum costs on any one pro cessors and we are lo oking for the maximum total

time on any one pro cessor we use T P n

dot

To mo del the cost of the matrix vector pro duct we can use a simple mo del see

x for a more detailed mo del assume all vertices have neighbors ie neglect the

b oundary and assume that the problem is uniform and f p is the minimum op rate of

n

ghosts vertices matrix vector pro ducts on any one pro cessor Assume there are

P

see x then the matrixvector pro duct complexity could b e mo deled as

2

ndf n n

3

T P n ndf

matr ixv ector

P f p P

This assumes the maximum number of neighbor pro cessors is as this is a common

maximum that we observe in our numerical exp eriments

1

BLAS BLAS BLAS Basic linear algebra subroutines BLASs refer to a set of subroutine

denitions that dene an interface for highly tuned machine sp ecic implementations of standard linear

algebra op erators BLASs op erations fall into three categories dened by the algebraic ob jects in the

op eration A BLAS subroutine op erates on vectors and scalars only BLAS op erate on at least one vector

and one matrix and BLAS op erate on matrices only These categories are useful as they accurately dene

the p erformance of the op erations within them with BLAS op erations b eing the fastest and the BLAS b eing the slowest

We can mo del the cost of a level of multigrid with diagonal preconditioning by two

dot pro ducts and one matrixvector pro duct ie T P n T P T P

MG dot matr ixv ector

ignoring the lower order terms eg smo other preconditioner AXPYs We can minimize

d

T P n for P assuming that our T P n expression is and the cost by solving

MG MG

dP

equality with

d n ndf n ndf n

T P n ndf

MG

dP P P f p P f p

P

This simple mo del is however not accurate as the load imbalance has not b een

considered and the and terms need to include the costs of packing and unpacking

vector data in the matrixvector pro duct note we dene and to include this in x

but this reduces the generality of these terms and requires that they b e derived from numer

ical exp eriments on typical problems Also more signicantly the matrixvector pro duct

expression implies that using more pro cessors always lowers the run time this is not an

accurate assumption as is a function of P as Figure demonstrates for small problems

ie typical coarsest grids We discuss mo deling via op eration counting in more detail in

the next chapter as well as reasons for the discrepancy b etween these mo dels see x

for matrixvector pro duct mo dels in more detail

Sub domain agglomeration metho d

We currently work with empirical mo dels based on curve tting that use mea

surements from actual solves This metho d has the disadvantage that mo del parameters

have to b e calculated for each machine that the co de is to b e run on and may need to

b e recalculated for dierent problem types but it has the advantage that it is resonably

accurate and simple to construct

We can start by lo oking at in Figure the form of the function f that we want

The concave shap e of the curve for f in Figure is derived from the intuition that as

more pro cessors are in use the log P term in the dot pro ducts and any other source of

parallel ineciency will push for the use of fewer pro cessors For convenience we transp ose

this function to get the convex function n g P

Now it is natural to begin to approximate this function with a quadratic p olynomial

n AP BP or alternatively

n

A P B P

n P P = f(n) for machine X n = g(P ) for machine X

n P

Figure Carto on of cost function and its transp ose

n

where is the problem size p er pro cessor or the function that we actually want

P

p

B AC B

P

f loat

A

A and B are machine and problem dep endent parameters that are determined by exp eri

mental observation on actual multigrid solves of the problem at hand or from exp erience

with other problems on the target machine We have selected A and B by starting with

starting with an initial guess and using a large problem to search for the optimal solve

time by p erturbing A measuring the total p erformance the only quantity that we can ef

fectively measure and search for an optimal A we assume that solve time is a convex

function of A we rep eat this pro cess for B and go back to A and so on until we nd the

minimal solve time for this one problem This will give us go o d results on similar problems

as the one that we test on

In a pro duction setting one could automate this pro cess for selecting the co ecients

eg A and B for a p olynomial eg equation by running parametric exp eriments

parametric in A and B with a large representative problem one could then simply select

the A and B used in the exp eriment with the fastest solve time or use curve tting to

construct a function that can b e minimized Note this pro cess would b e rep eated for each

new machine or machine conguration and for each problem class

The use of many more pro cessors would likely require that a higher order p olyno

mial or a more complex function b e used Note for more accuracy one should also use the

number of nonzeros in the matrix in addition to the number of equations n as this is a

more direct measure of the matrix vector cost and do es not remain constant on all grids

Equation provides us with reasonable results as we run homogeneous problems ie

trilinear elements with three dof p er no de with an approximately even distribution of nonzeros

Pick A and B A

n/P B

P

Figure Carto on of the tted function

After the number of vertices n on a grid has b een determined at run time A B

and equation are used to compute P After computing P we nd the rst

f loat f loat

feasible integer in each direction ab ove and b elow P For example with pro cessors

f loat

and P in Figure we now nd the two feasible number of pro cessors eg

f loat

P and P in Figure We pick P to b e the closer of P and P to P as the

f loat

P = 35

1 2 3 4 5 67 8 9 ^

P1 PP2

Figure Search for b est feasible integer number of pro cessors on pro cessors if p result of this metho d

Chapter

Parallel p erformance and

mo deling

This chapter discusses p erformance mo deling for multigrid solvers in general and

our solver in particular

Introduction

Accurate time complexity mo dels based on op eration counts are very challenging

on to days parallel machines even for regular problems ie parallel dense linear algebra

accurate computational mo dels can b e intractable b ecause of the many contributors to

p erformance and their complex interactions Previous multigrid mo deling work for

regular D grids on a pro cessor NCUBE run with approximately times as many

nonzeros p er pro cessor as is typical to day can b e found in This chapter discusses

complexity issues and mo dels of D unstructured multigrid solvers

The goal of this chapter is to develop the basis for a useful mo del to predict the run

time and memory usage of multigrid solvers with unstructured nite element meshes on

typical computers of to day An accurate complexity mo del of this type is b eyond the scop e

of this dissertation however we will discuss qualitative mo dels useful in understanding the

overall complexity of multigrid solvers b egin to develop quantitative mo dels of some of the

ma jor comp onents of multigrid solvers and present a framework for a multigrid complexity

mo del Additionally this chapter will discuss our p erformance measures which will b e used

in our numerical exp eriments in chapters and Thus the primary goal of this chapter

is to introduce the overall complexity structure of multigrid on unstructured meshes and

b egin to develop a complexity mo del via a decomp osition of multigrid and analysis of some

of its comp onents

Our complexity mo dels will rst decomp ose multigrid into comp onents whose in

teraction with each other can b e ignored We assume that all of our mo dels are static

in that we are able to count for all of the work to b e done in a solve at least to the level

of detail that we articulate in our decomp osition At some p oint our decomp osition takes

the form of a functional decomp osition eg time to apply an op erator number of ops

ops p er second At the comp onent level one may resort to a bottomup mo del that

is based on low level and general machine sp ecications such as clo ck rate cache size

network sp ecications etc or one can use a topdown approach of measuring the co de

comp onent on a particular machine with various inputs and curve t the function to ap

proximate the measured data See Table for the pros and cons of b ottomup and

topdown approaches

Pros and cons

topdown b ottomup

Needs running implementation can predict p erformance on new machines

on a give machine or without implementations

can predict p erformance only for dierent may badly underestimate time dep ending

parameter values on a given machine on whether all op erations are counted

takes all interactions into account may b e very hard to count all op erations

or mo del their interactions

may or may not account for irregular op erations may require

irregular op erations approximating work p er pro cessor

Table Topdown vs b ottom up p erformance mo deling

This chapter pro ceeds as follows x discusses the motivation for complexity

mo deling the notation and the complexity mo dels that we use x presents a simple

computational mo del based on PRAM that is used to sp eculate on the scale of future

multigrid solves and is intended to introduce multigrid complexity by providing an overall

view of the complexity of multigrid solvers x discusses the high level structure of our

cost mo del and introduces our primary metric eciency x discusses the scop e of our

mo del the primary cost comp onents of the multigrid solution phase mo dels for some of the

primary solver comp onents and quantitative measures of the complexity of unstructured

multigrid solvers x will conclude with a sketch of a path for continuing to develop our

complexity mo del

Motivation computational mo dels and notation

There are several reasons for developing p erformance mo dels and p erforming com

plexity analysis of a multigrid co de First multigrid solvers have many parameters eg

number of pro cessors sub domains smo othing steps etc and many algorithmic choices

eg additive or multiplicative metho ds so one would like to make these choices on a

rational basis since exp erience shows that p erformance can b e a sensitive function of these

parameters A second use of p erformance mo deling is to identify b ottlenecks in the co de

that are an artifact of a co de bug a system problem eg using nonoptimal system pa

rameters or system bugs platform sp ecic b ehavior eg a p o or contention management

implementation in the communication system might require more synchronous communica

tion patterns or some other xable problem A third reason for p erformance mo dels is

to pro ject the b ehavior of your algorithm on new or future machines so to make informed

purchasing decisions and so that system manufacturers can make informed decisions ab out

the their system design assuming that the manufacturer is interested optimizing the p er

formance of their pro ducts for your application Additionally p erformance mo dels can b e

used to chose the optimal number of pro cessors to employ at any given level as discussed

in chapter and to estimate resource needs on shared machines

Notation and computational mo dels

We need to decide which op erations to count in our p erformance mo del We use

two complexity mo dels of the underlying machine PRAM and LogP

For some algorithms we use the RAM random access memory and PRAM par

allel RAM complexity mo dels as a basis for high level discussions and o ccasionally en

rich them The RAM mo del makes the simplication that all memory can b e accessed in

constant time and primitive op erations ie also take the same constant

time The PRAM mo del builds on the RAM mo del and includes parallel constructs it

assumes that sending and receiving data from and to any pro cess can also b e done in

constant time p er data word PRAM mo dels come in many types all combinations of

concurrentCexclusiveE and readsRwritesW concurrent reads and exclusive writes

CREW b eing the most intuitively realistic for a general computer program queued write

and others We do not distinguish b etween these PRAM mo dels in our work as we work

primarily with a distributed memory programming mo del however if it is not otherwise

stated we assume the CREW PRAM mo del The advantage of the PRAM mo del is its

simplicity this simplicity is also its limitation as communication costs can b e very complex

The second mo del that we use LogP mo dels the communication system

more realistically We use the LogP mo del in the more detailed discussions of multigrid

complexity LogP mo dels the overhead in communication overhead o network inverse

bandwidth and capacity gap g and network latency latency L P stands for the number

of pro cessors

The overhead o is the time that the pro cessor uses to pro cess an MPI message

plus the time that our solver PETSc actually sp ends collecting packing unpacking and

placing data to and from a message Gap g is the p erio d during which messages of a given

size can b e pro cessed by the network without stalling Latency L is the time that it takes

b etween the time that message is sent by the application via an MPI call and the time

that the application receives the message on another pro cessor We will use a simplied

version of LogP and let O L and g Often empirical data is available in the form

of measured time for an op eration t the number of pro cessors involved in communication

p and an average number data words n in the communication with each pro cessor thus we

lo ok for functions of the form t f p n p n

Notation

Multigrid requires many parameters Figure lab els the comp onents and nota

tion of a multigrid solve that we will use throughout this chapter

L number of levels in multigrid

i grid number i for the nest grid and thus i L for the coarsest grid

P number of pro cessors active on grid i

i

n number of total degrees of freedom on grid i

i

D dimension of problem eg

ndf degrees of freedom p er vertex eg

s number of prep ost smo othing steps eg

k number of iterations in global linear solution

b number of blo cks in blo ck Jacobi preconditioner on grid i

i

A matrix on level i

i

R Restriction op erator from grid i to grid i

i

T

P R Interpolation op erator from grid i to grid i

i

i

M v ec Matrix vector multiply on grid i A x b

i i

T

V ecD ot vector dot pro duct on grid i y x

i i

i

q

T

x x V ecN or m vector norm on grid i

i i

i

Axpy x y y

i i i i

Restr ict R r r

i i i i

T

I nter pol ate R x x

i i i i

T

T r iP r od R A R A

i i i i

i

F Full factorization of matrix A

i i

S Forward elimination and back substitution with the factored A

i i

j

F Factorization of the matrix for a sub domain j of A

i

i

j

S Forward elimination and back substitution with the factored sub domain matrix j of

i

A

i

N maximum number of neighbors of any pro cessor domain

neig h

F x Flops in one application of op erator x

C x Communication time in one application of op erator x

T x Time to do one application of op erator x

Figure Multigrid computational comp onents lab els and parameters

PRAM computational mo del and analysis

Before we articulate a more detailed complexity mo del of multigrid solvers we

develop a simple mo del with an optimistic machine that we hop e can b e built in the future

We use PRAMlike complexity mo dels PRAM with some simple enhancements PRAM

is far to o blunt a to ol to mo del the p erformance of these co des but it do es provide a means

of seeing the big picture We also exercise this mo del by sp eculating on the scalability of

the multigrid algorithm that we use by dening a base case with no signicant parallel

ineciency in our mo del and stipulating that we are only willing to solve problems with at

least parallel eciency see x for the denition of parallel eciency we show that

problems of the order of trillions of unknowns could b e solved with our current multigrid

algorithm This is meant to bring some p ersp ective to the scale of problems that could b e

addressed with our multigrid solver

We will mo del only the multigrid preconditioner and ignore the accelerator and

assume that we are using diagonal preconditioning for our smo others We ignore the AXPYs

and other BLAS op erators that require no communication and the factorization cost on

the coarsest grid L as this is a small constant shared by al l solves We also neglect the cost

of the coarse grid construction and any sub domain setup Thus we need only mo del the

matrixvector pro ducts the dot pro ducts and the forward elimination and back substitution

on the coarsest grid

Assume that we have K equations p er pro cessor on the nest grid this is a bit

larger than our common size but is reasonable on a well congured machine with ab out

Mb of main memory p er pro cessor and a pro duction quality implementation of our solver

and PETSc chapter Assume a reduction of a factor of on each successive grid and

that we have n K equations on the coarsest grid Let us also assume that we are

L

only willing to tolerate equations on a pro cessor b efore we agglomerate sub domains

x and that this number is indep endent of the number of pro cessors Note all of

the coarsest grids have equations p er pro cessor in this discussion In this mo del the

n

0

log n n b eing the global number of equations on number of levels L log

K

the ne grid assume n is a p ower of eight

n

i

to b e the cost in time of one level not the coarsest level in First dene S

P

i

n

i

equations on each pro cessor assume that this is a linear a multigrid V cycle with

P

i

n

i

except for the parallel ineciency of the matrix vector pro ducts and the time function in

P i

in the global reductions dot pro ducts which have a log P in their PRAM complexity A

go o d communication network is essential in containing the costs of these reductions see

x many machines in the past have networks that could handle reductions eectively

CM Intel ASCI Red at Sandia National Lab oratory We temp orarily neglect the

log P terms two in each iteration of the Krylov subspace metho d smo other b ecause they

have small constants the range of p erformance on machines of to day varies enormously and

these terms are dicult to capture in this mo del though we accommo date them b elow

n n

i i

This assumption that the cost of S is linear in assumes that the matrixvector

P P

i i

pro ducts scale linearly this is not realistic as was discussed in x so we constrain our

problems to have have at least K equations p er pro cessor or p enalize grids with fewer

than K equations p er pro cessor Thus to account for the fact that the coarse grids run

slower than the ner grid ie sp eedup is not p erfect forever we add a factor of to

the costs of the coarsest grids this reects our empirical measurements of communication

eciency or lack thereof

Dene D n to b e the cost of the coarse grid solve without the factorization

L

with n equations and as many pro cessors as we like on the coarsest grid eg P

L L

n

K

L1

We estimate the solve cost of the coarsest grid D n in terms of S

L

P

L1

n

L1

S S as the coarse grid would use pro cessors if it were treated as a non

coarsest grid First we estimate that S is times larger than a matrix vector pro duct

with the stiness matrix on the coarsest grid this is the number of matrix vector pro ducts

in S if we use two pre and p ost smo other applications and neglect the other terms in the

S complexity see Figure for an inventory of multigrid comp onent applications Next

we estimate that the op count in D n is times larger than that in a matrix vector

L

pro duct on the coarsest grid this is an approximation using the band width of the stiness

matrix for a cub e with K equations Thus D n S S Note an

L

alternative to one small set of pro cessors working on the coarse grid solve is to have n

L

pro cessors factor the coarse grid and solve for one row of the explicit inverse of the coarse

grid stiness matrix each solve ie D n consists of a broadcast of the right hand

L

side vector a dot pro duct of size n on n pro cessors and a sp ecial interpolation

L L

op erator

Now we can state a cost estimate of full multigrid in equation

L

X

S T n S K S K S K i S L

i

The factors i and L come from the number time that full multigrid is applied

on each level see Figure and Figure

Table shows some grid statistics with this mo del from this we can see that

for example with only two levels of partitioning ie three distinct pro cessor groups and

L six levels we can solve a million degree of freedom problems with K pro cessors

Mo del Conguration example with L and P K

Grid number L total active pro cessors equations p er pro cessor Total equations

L K K

L K K

L NA K NA NA K NA

L L

K K K

L L

K K K K M

L L

K K K K M

L L

K K K K M

Table Future complexity conguration K M

As a thought exp eriment let us dene a base case for which there is no signicant

parallel ineciency and see how large of a problem can b e solved in this mo del with

parallel eciency describ ed b elow Let us dene a base case to b e the largest

problem in which no pro cessors are idle except on the coarsest grid and no grid has a

matrixvector pro duct with less than K equations p er pro cessor Thus the base case has

M equations as all pro cessors must b e active and have K equations on the p enultimate

grid the p enultimate grid must have K K equations as n K therefore

L

P P P and n M K and L Thus the M degree

L

of freedom problem with pro cessors is the largest problem with no signicant parallel

ineciency To get a rough idea of the size of problems that can b e solved economically

with this p erformance mo del we dene an acceptable level of eciency of relative to

our base case ie we decide that we are willing to allow for twice the compute time for our

largest problem thus running at one half of the eciency of the base case

How many extra grids can we use with this mo del and eciency pain tolerance

First let us make a gesture to the logP and assume that or of the extra time of

our compute costs are going into the communication cost of the dot pro ducts so we only

allow for a longer compute time in the mo del Note this is a constant though large

approximation to the log P terms in the dot pro ducts We have extra levels L to

give a problem size of n K or ab out trillion equations on ab out seventy

million pro cessors a p etaop machine

One should note that this mo del do es not attempt change our basic solver con

guration ie full multigrid and Krylov smo others Dep ending on the machine and

problem additive formulations andor V cycle multigrid may b e more economical as

without Krylov subspace metho d smo others V cycle multigrid has PRAM complexity of

log n whereas full multigrid has PRAM complexity of logn

This mo del is meant to give a big picture view of multigrid complexity issues to

b egin to augment the PRAM multigrid mo dels so as to reect the machines of to day and

the near future and to put some approximate constants in the complexity of full multigrid

with Krylov subspace smo others and accelerators Thus this section has introduced the

broad outlines of multigrid complexity we are now ready to build a multigrid complexity

theory from the ground up or from LogP up

Costs and b enets

Before we set out to mo del multigrid solvers we will dene what we are trying

to accomplish and how we measure success Our ultimate goal is to solve al l sparse nite

element problems on unstructured grids cheaply The problems that we are concerned

with are accurate simulations of complex physical phenomenon in solid mechanics via nite

element metho ds on unstructured meshes in particular we are concerned with the linear

solve or preconditioning of matrices from such metho ds The rst metric that we introduce

is the benet that we wish to enjoy al l sparse nite element problems is not a feasible

goal nor easily quantiable and so we use the number of degrees of freedom dof as our

measure of b enet for which we have costs We implicitly include al l sparse nite element

problems by applying our solver to challenging test problems

Costs can b e dened in many ways we use the run time of sample problems and

apply other costs as constraints eg memory costs are included by assuming that it is free

but limited to the size of main memory on most of to days machines We have attempted to

build test problems that are indicative of the real world problems that some p eople want

to simulate and thus provide an approximate prediction of the costs of our solver on some

demanding nite element simulations Thus robustness is implicitly included in our analysis

by the nature and diculty of our test problems and by using metho ds and parameters

that are indicative of the most ecient way for us to solve these test problems Cost is thus

dened as the pro duct of the maximum time required by any one pro cessor to solve the

problem and the maximum number of pro cessors used

With costs and b enets dened we can describ e the overall computational struc

ture of solvers that is useful in providing broad categories in which all of our costs reside

There are three basic cost phases of a linear solve

Setup cost p er distinct mesh The setup cost of a conguration or graph eg con

structing coarse grids and restriction matrices allo cating memory for the data and

setting up communicators and buers for matrix vector pro ducts etc

Setup cost p er matrix for a mesh that has b een setup Prepro cessing each matrix

for a solve eg the LU factorization for a direct metho d matrix triple pro ducts and

submatrix factorizations for multigrid

Cost for each right hand side asso ciated with a given matrix eg backward substitu

tion for a direct metho d and the actual iterations in an iterative solver

All thing b eing equal we would in general like to move costs up this list so

they may b e done less frequently and optimize the implementations down the list as the

number of applications of these cost comp onents increases as we go move down the list The

setup cost for each mesh phase is of the same order as that of one solve on our linear

test case in chapter though this is very machine and problem dep endent and it scales

ab out as well as the solver see Figure Finite element applications require many solves

on each conguration and thus this cost can b e amortized by each application of the solver

thus we do not fo cus on the cost for each conguration Figure show the high level cost

features that we measure the red b oxes or leaves of the cost tree are where actual work

is done and are the high level co de segments that are measured in x to give an overview

of the endtoend costs of a nite element simulation

Now we want to solve interesting problems cheaply but how do we know when or

by how much we have succeeded To judge success we measure eciency the p ercentage

of optimal p erformance this gives us a metric that tells us the fraction of p erfection that we have achieved for the co de or any sub comp onent

FE Costs

Partitioning (Athena/FEAP) Configuration Costs

Restriction construction (Prometheus) Solve Costs

Fine grid creation (FEAP) Solve for "x" (PETSc)

Solve Setup

R * A * P (Epimetheus) Subdomain factorizations (PETSc)

Figure Finite element cost structure co de segment from

Eciency measures

We dene parallel eciency e or scaled eciency as the time to solve a problem

of size n on one pro cessor divided by the time to solve a rened discretization of the

problem with n equations on P pro cessors such that P n n Thus parallel eciency

P P

e T T P where T P is the time to solve the problem discretization for P pro cessors

We use time eciency b ecause it directly addresses the total cost of solving a

problem when the machine is b eing used eciently Unscaled Sp eedup plots cost of

solving a xed sized problem vs number of pro cessors are a useful diagnostic to ol to

judge the robustness of an implementation but are not useful in directly estimating the

cost of using a particular machine Thus our mo del of our environment is that we have

a large number of pro cessors with nite memory p er pro cessor many patient users with

large implicit nite element problems p erfect job scheduling and we select the number of

pro cessors to use for each job so as to maximize job throughput on the entire machine

We separate the serial eciency and the parallel eciency the total eciency

b eing the pro duct of the two We dene the following eciency measures or sources of

ineciency

serial eciency s the fraction of p eak megaop rate Mopsec of the serial

implementation Peak is dened as the Mop rate of the fastest op rate on any

oating p oint dominated application The fastest dense matrixmatrix multiply is used

for the p eak op rate if available and the Linpack toward p erfect parallelism b enchmark is used otherwise

work eciency w the fraction of ops in the parallel algorithm that are not redun

dant ie the number of ops in the serial algorithm divided by the number of ops

in the parallel algorithm on the same problem discretization

scale eciency z this is the scalability of the algorithm with resp ect to ops p er

unknown in the RAM complexity mo del ie the ops done p er unknown as the

problem size increases z is similar to in that it relates to op ineciency though

distinct from w Work eciency w is related to the number of pro cessors used scale

eciency z is related to the size of the problem Note scaled eciency plots that

we use extensively will in general measure al l of the parallel eciency measures that

are present in the system

load balance l the ratio of the average to the maximum amount of work ops that

a pro cessor do es for an op eration This is easily measured and dened as we do not

use any nonuniform algorithmic constructs ie we do not use task level parallelism

communication eciency c the highest p ercentage of time that a pro cessor is

not waiting pro cessing packing data or any other form of work asso ciated with

interprocess communication

All of the algorithm comp onents that we use with the exception of the ne grid

matrix creation FEAPs element state determination have p erfect work eciency ie

w In our numerical exp eriments during the ne matrix creation each pro cessor

calculates al l of the elements that any vertex on a pro cessor touches and the stiness

matrix entries that b elong on another pro cessors are discarded this leads to ab out work

eciency on our test problems and common solver conguration ie trilinear hexahedra

and ab out dof p er pro cessor This redundant evaluation of elements runs faster than

exclusive element evaluation on many machines as there is no communication required

Full multigrid is p erfectly scalable in the sense that the amount of work ops

required for each degree of freedom approaches a constant as the number of degrees of

freedom go es to innity thus z approaches a constant in theory In fact our one pro cessor

problems are large enough that the number of ops p er equation is a constant or as close to

a constant as we can eectively measure as the problem is scaled up see Figure Thus

z eciencies are close to and we do not discuss them further Therefore we concentrate

on the ineciencie s cl s

Eciency plots

Eciency has the prop erty that we can measure or mo del the sources of ineciency

and simply multiply them to get a total eciency Figure shows a carto on of a typical

eciency plot and it is instructive to p oint out some of its more salient features

region of super linear speedup 1.0 perfect efficiency e1 np Time(1) modeled subdivision * e2 n * p Time(p) 1 measured time

0.0

1 Processors (p) P

Figure Eciency Plot Structure

If we identify two sources of ineciency e and e and we can mo del the eect of

only one eectively eg e load balance could b e estimated with the number of ops

on each pro cessors or some other measurable quantity then we can plot e and the total

measured quantity time in Figure and the ratio is e Also other curves could b e

mo deled by various means as is describ ed in this section to provide approximate data on

the particular sources of ineciency Note this characterization implies that ineciencies

are decoupled in the sense that they could b e isolated and have the same value regardless

of the presence of the other this is not in general true but is often a close approximation

in our application Note these eciency plots require that we prescrib e the size of each

problem n for each number of pro cessors p this is not in general p ossible so we pick the

p

n

p

number of pro cessors for each problem discretization and add a factor to the eciency

n p

1

plots to adjust for this error in our exp erimental structure

We also o ccasionally plot total Mopsec against pro cessors used again with

scaled problem discretizations Total Mopsec plots are useful as the data is easily mea

sured for the solve or some part of the co de with go o d load balance and they provide a

useful diagnostic to ol to see how eciently the machine is b eing used in the oating p oint

intensive parts of the co de Mopsec data is not useful to compare with nonoating p oint

intensive parts of the co de like the partitioning and the coarse grid set up phases nor is it

useful in understanding costs as they do not include convergence rate a core comp onent

of the cost of iterative metho ds

Computational mo del of multigrid

This section introduces a more detailed p erformance mo del of multigrid First this

mo del do es not take into account the convergence rate of the solver in question and thus

we only discuss the time to solve a problem with a given number of iterations conguration

costs are not generally included so the only costs that we will b e concerned with will b e

the matrix setup and the solve costs Therefore we do not present a complete p erformance

mo del that could b e used to for example provide the number of sub domains to use in

the blo ck Jacobi preconditioner on a given number of pro cessor on a given machine and

so on We concentrate on analyzing the run time of comp onents for a particular solver

conguration on a particular type of problem to provide concrete and more readable

expressions To avoid a glut of variables we x the number of iterations at a value typical of

our test problems Note we show in chapter that the number of iterations that our solver

requires to achieve a xed relative reduction in the residual is not eected by the scale of

the problem thus the assumption of a xed number of iterations for any one problem is

valid

We further simplify the analysis by making assumptions ab out the class of prob

lems and the solver conguration in our mo del In particular we assume that we are working

with uniformly thick b o dy problems ie problems in which coarsening can take place in

all three dimensions at all levels throughout the domain We assume that we are using the

symmetrize multiplicative Schwarz metho d equation Our mo del could easily b e mo d

ied to accommo date additive metho ds but we do not conduct numerical exp eriments with

additive metho ds as we have not sp ent time optimizing them ie we have not introduced

the task level parallelism that makes additive metho ds attractive

We assume that the top coarsest grid has ab out degrees of freedom This is

a rather small problem for the top grid as can b e inferred by the fact that the time sp ent

in the top coarsest grid is minute compared to the time of the p enultimate grid x

There are two reasons for this less than optimal choice of parameters ie using to o many grids

memory we are often memory b ound and as we have a fully symmetric program

ming mo del using a smaller top grid reduces the maximum memory used on any one

pro cessor

program development we want to scale to much larger problems so we want to

harden the algorithms so as to ease the eventual transition to larger problems A

tiny coarse grid may introduce robustness problems in problem with more complex

geometries so we wish to try to reveal these issues in preparation for larger machines

We currently use PETScs serial sparse solver for the top grid future work will include

incorp orating a parallel direct solver for the coarse grid problem

For simplicity we restrict ourselves to one set of solver parameters typical of our

numerical exp eriments We assume two s iterations of a CG smo other preconditioned

n

i

with blo ck Jacobi with n equations p er blo ck thus the number of blo cks is b d e

b i

We let the number of levels L dlog n e with n this assumes that the

coarsest grid has ab out equations The problem size is n ndf jV j equations

n

0

These will eventually b e used in the ndf degrees of freedom p er vertex and n

i i

multigrid inventory of Figure to give us the total comp onent count of a multigrid solve

as a function of problem size n and the number of iterations k We denote the cost in

time of an op erator x by T x the oating p oint cost by F x and the communication costs

by C x We do not include load balance in F x nor do we explicitly include nonuniform

communication or network contention costs b etween individual pro cessors in C x

This section pro ceeds as follows rst x rewrites the full multigrid algorithm

in more detail than we have previously x discusses mo dels for the sizes of coarse

grids and ghost vertices used primarily in the communication cost mo dels The costs

of the comp onents are mo deled in terms of oating p oint op erations and op rates in

x communication costs are mo deled in x x integrates the communication

and oating p oint costs matrix vector pro ducts are mo deled in more detail as they are one

of the largest costs in most multigrid solves

Multigrid comp onent lab eling

This section enumerates the complexity structure of full multigrid First let us

show the full multigrid cycle that we use as a preconditioner for a Krylov subspace

metho d in all of our numerical exp eriments in Figures

F cy cl eA b x M ul tiGr id

i

if A I sC oar sest

i

x A b direct solve

i

else

r R b restrict the residual b to the coarsest grid

i

d M ul tiGr id F cy cl eA r recursion to get rst correction from coarsest grid

i

x P d interpolate correction back to this grid i

i

r b A x form new residual

i

V cy cl eA r approximate solve x M ul tiGr id

i

x x x add correction

V cy cl eA b x M ul tiGr id

i

if A I sC oar sest

i

x A b direct solve

i

el se

x S moothA b smo oth error

i

r b A x form new residual

i

r R r restrict the residual r to the coarsest grid

i

d M ul tiGr id V cy cl eA r get coarse grid correction

i

x P d interpolate correction back to this grid i

i

x x x add correction

r b A x form new residual

i

x S moothA r smo oth error

i

x x x add correction

Figure Full Multigrid Cycle

Coarse grid size and density

To mo del the scale of our coarse matrices we assume that each grid has a factor

D

of for D dimensional problems fewer vertices than the previous grid This

comes from the assumption that every other vertex is b eing promoted to the next grid

in each dimension this is true for regular meshes This assumption is not always true for

irregular meshes as even on a regular grid a maximal indep endent set MIS in x could

pick every third vertex resulting in a factor of reduction at each level

We intentionally randomize the vertex order of the interior vertices in our MIS

implementation sub ject to the constraints in x to increase the reduction factor thus

we actually observe reduction factors of ab out b etween the nest and rst coarse grid

and ab out on the subsequent grids Mo deling this reduction factor is further complicated

by the fact that the graphs on the coarse grid are not in general well dened in the

sense that they are not graphs of valid nite element meshes with either tetrahedral or

hexahedral elements In fact our co de has three graphs asso ciated with each coarse grid i

T

The nonzero structure of A R A R in x which we call G

i i i exact

i

The graph of the Delaunay tessellation describ ed in x G note this is a

D elaunay

valid nite element graph

The approximate adjacencies that we create immediately after the MIS to facilitate

ecient implementation of the symbolic phase of our co de

These algorithmic complexities also present diculties in making a priori estimates

of the number of nonzeros in the coarse grid matrices We have observed reduction rates

in the number of nonzeros from grid i to grid i of ab out on problem P in chapter

Thus we use for our nonzero edge reduction factor for all grids as an approximation

Number of adjacent pro cessor sub domains and size of ghost vertex lists

To mo del communication costs we rst characterize the maximum number of ad

jacent neighboring pro cessors and the number of ghost vertices with which any pro cessor

has contact N maximum number of neighbors of any pro cessor domain To esti

neig h

mate these quantities we take advantage of the fact that our problems come from physical

domains The mesh partitioner attempts to minimize edge cuts in the graph this has

the physical analogy of minimizing the surface to volume ratio of the partitions Optimal

partitions are hexagons for a D continuum and rhombic do decahedrons a D continuum

Thus the optimal N is ab out in a regular D mesh Note that for D partitioning

neig h

with cub es N is We use N as this is typical of what we see in practice

neig h neig h

That is N is the maximum number of neighbors for any pro cessor with ab out

neig h

degrees of freedom p er pro cessor and ab out pro cessors Note that this number

will most likely go down with increased dof p er pro cessor and b etter mesh partitioners and

up with more pro cessors as statistical eects of nonuniform partitioning come into play

Now we estimate the number of ghost vertices o pro cessor vertices which are

connected to a lo cal vertex for each pro cessor eg the number of entries of x that we

need from other pro cessors to compute the lo cal entries in y Ax in a matrix vector

pro duct Likewise we need the number of lo cal b oundary vertices whose values must

b e sent to other pro cessors We can again use the physical prop erties of our graphs by

assuming that a pro cessors sub domain is a sphere with radius r and each vertex lls a

p

r By assuming that the unit volume With jV j pro cessor vertices we have jV j

p p

p

I

thickness of the b oundary layer of vertices is the interior vertices V have a radius r

p

p

I

r After truncation of higher order terms we have V thus

p

p

I

V

p

jV j r

p p

since we have

1

3

jV j

p

r

p

we get

I

V

p

1

jV j

3 jV j

p

p

Our numerical exp eriments use ab out to dof p er pro cessor ab out

to vertices p er pro cessor on the ne grid Substituting these partition sizes into

equation gives us ab out interior vertices which is line with the values that

we measure in our numerical exp eriments To get an expression for the number of ghost

G

r r V vertices we assume that the ghost layer is one unit thick so that

p

p p

and after dropping low order terms we get

1 2

G

3 3

jV j jV j V

p p p

B

and similarly as V r r

p

p p

1 2

B

3 3

jV j jV j V

p p

p

An alternative mo del is to assume that the sub domains are cubes we can now us an

explicit discrete graph with a regular grid of hexahedra by assuming that each sub domain

has N vertices on the side of its cub e The number of vertices p er sub domain as jV j N

p p

p

Further we can mo del the number of interior vertices vertices with no contact with other

pro cessors as

I

N N V

p p

p

and b oundary vertices

2 1

B

3 3

jV j V jV j

p p

p

and the number of ghost vertices

1 2

G

3 3

jV j jV j V

p p

p

which is largely similar to the mo del based on spheres Again for to vertices

p er pro cessor problems we get ab out interior vertices

Floating p oint costs

This section lists the oating p oint counts for the comp onents in multigrid these

estimates assume that we are using eight no de hexahedral trilinear elements We assume

that our sub domains for the blo ck Jacobi preconditioner are no de cub es which

gives blo cks with equations Further we are oriented toward thick b o dy problems as

this is the type of problem that we use for our linear scalability studies in chapter

We need a op rate to estimate the costs of oating p oint op erations Mopsec

ops sec Mopsec is a machine dep endent parameter and even on any one

machine we sub divide Mopsec into several types Below are ve types of f l op rates with

the Mopsec rate of the appropriate on a TE and IBM PowerPC in Table on a

typical problem ie ab out unknowns on one pro cessor

M f l op sec Dot pro ducts AXPYs any norm etc

M f l op sec Sparse matrix vector pro ducts and solves with a ndf ndf matrix of

double precision scalars p er blo ck matrix entry

M f l op sec Sparse matrix vector pro ducts scalar p er blo ck matrix entry

a

M f l op sec Sparse direct matrix factorization

M f l op sec Sparse matrixmatrixmatrix pro ducts

a

TE Mopsec p eak IBM PowerPC Mopsec p eak

M f l op sec

M f l op sec

M f l op sec

a

M f l op sec

M f l op sec

a

Table MFlop rates for MFlop types in multigrid

Floating p oint counts are estimated in some cases by measuring the op counts of

representative problems otherwise they are mo dels for typical hexahedral meshes Figure

shows the oating p oint counts that we use in our analysis refers to values that are

measured from numerical exp eriments and are mo deled values

F M v ec n n M f l op

i

M v ec

0

F M v ec

i i

F Restr ict n M f l op

i i a

F V ecD ot n M f l op

i i

F V ecN or m n M f l op

i i

F Axpy n M f l op

i i

F T r iP r od M v ec n M f l op

i i i a

j

F F M f l op

i

j

F S M f l op

i

F F M f l op M f l op

L

F S M f l op M f l op

L

Figure Floating p oint counts for multigrid op erators oprate type

Communication costs

Communication costs are dicult to mo del eectively their diculty will require

that we use the LogP complexity mo dels overhead o and latency L describ ed at the

b eginning of this chapter As with the op rate ab ove there are dierent overhead o types

for dierent op erations we include the time to pack messages in the application layer

in overhead Overhead is sub divided into the overhead for each pro cessor with which

we communicate and the overhead for each word double b eing communicated ie the

cost to send n words is n We dene three types of

Dot pro ducts the true machinesystem software b elow the application o L

for the reduction to compute the answer and a broadcast to disseminate it

Vector scattergather in sparse matrix vector pro ducts p osting receives and

other p er send or receive overhead

Matrix op erations in the matrix construction like setting up for each send

and receive

In general N is multiplied by to get time due to latency in matrixvector pro ducts

neig h

Note we do not include the initial setup phase for the data structures in and ie

allo cating buers communicating maps b etween lo cal and nonlo cal vertices for the send

and receives etc

For completeness we describ e in the same way as Dot pro duct communication

bandwidth has no dep endence on numbers of vertices we assume that all active pro cessors

on a level i have at least one vertex by denition so we need only dene and

G

could b e a function of ndf and V the number of ghost vertices where ndf is the number of

p

G

scalars asso ciated with each vertex in a vector could b e a function of x ndf and V

p

where an average of x matrix entries are in a row asso ciated with an opro cessor vertex

that must b e communicated We will estimates these parameters with exp erimental data

and as we only test ndf problems we assume is a linear function of ndf and is a

G

quadratic function of ndf we could include ndf as we do in x for V to get a more

p

accurate complexity

Matrix vector pro ducts use indirect indexing into a dense vector to copy

gather values to buer for sending to a neighbor pro cessor this pro cess is reversed on

the receiving side as pro cessors unpack a message and scatter it into a sparse vector

Matrix triple pro ducts calculate some o pro cessors values and could accumulate them

into a lo cal sparse matrix costs o ccur after the oating p oint work is complete each

pro cessor will copy their o pro cessor data into messages for sending and will then receive

corresp onding messages and accumulate this data into their lo cal sparse matrix We will not

measure or as these values are inherently very implementation and data structure

dep endent and they are not as large a cost in the solve as the or We use

and here to derive an order of the complexity for matrix vector pro ducts we provide

quantitative measures for and in x

In constructing our complexity mo del we use the sub domainascub e mo del of

x to give on grid i with ndf dof p er vertex

1 2

3 3

jV j jV j C M v ec N ndf

p p i neig h

Note we ignore latency L as we can overlap communication and computation eectively at

least on the ner grids see x for details Additionally we have

C V ecD ot dl og P e o L

i i

using the raw machine overhead o and latency L and

C V ecN or m C V ecD ot

i i

C Axpy

i

The number of opro cessor ghost vertices for the restriction op erator C Restr ict

i

for which we require values for the left hand side vector is ab out the same as that for the

matrixvector pro duct Again as we have ndf dof p er vertex and coarse grid vertices in

general connected to each ne grid vertex through the restriction op erator

C Restr ict N ndf V

i neig h p

For C T r iP r od assume that a pro cessor adds values into al l graph edges o

i

diagonal matrix entries tofrom their ghost vertices and self edges diagonal entries on

those vertices Additionally a pro cessor can p otentially add values to the opro cessor

edges b etween two ghost vertices With this we can estimate the number of o pro cessor

matrix entries that a pro cessor sends as the number of ghost vertices diagonal entries

S

and edges b etween pro cessors ie v w E j v V in x and the number of

p

edges between ghost vertices To simplify C T r iP r od we assume that all ghost vertices of

i

i

pro cessor p on grid i have ab out half of their matrix row entries touched by the

pro cessor p thus

i

G

V C T r iP r od N ndf

i neig h

p

2 1

G

3 3

jV j V jV j

p p

p

This expression is in line with our observations on problem P chapter

Total cost of comp onents

For most of the comp onents in our multigrid mo del we can simply add the com

munication time to the oating p oint time as there is either no opp ortunity for overlapping

communication and computation or we or PETSc have not implemented it

The only computational comp onent that requires that we articulate the mo del

further is M v ec discussed b elow The rest of the multigrid comp onents are simply mo deled

as the sum of their computation and communication times shown in Figure

n

G

0

T Restr ict N ndf V

i i neig h

p

M f lop sec

P

2a

i

n

0

T V ecD ot dl og P e

i i

i

M f lop sec

P

1

i

n

0

T V ecN or m dl og P e

i i i

M f lop sec

P

1

i

n

0

T Axpy

i i

M f lop sec

P

1

i

n

G

0

V T T r iP r od N ndf

i neig h

i

p

M f lop sec

P

3a

i

j

T F

i

M f lop sec

3

j

T S

i

M f lop sec

2

T F

L

M f lop sec

3

T S

L

M f lop sec

2

2 1

3 3

n n

G

0 0

V

i i

p

P P

i i

Figure Costs for multigrid op erators

Matrix vector pro duct cost

To form the total cost of matrixvector pro ducts we lo ok at the matrix data struc

ture in more detail this is b ecause matrix vector pro ducts are resp onsible for most of the

time in a multigrid solve and the communication and computation patterns of our matrix

vector pro ducts are not obvious and allow for some overlap of communication and compu

tation PETSc implements its parallel matrix class with two serial sparse matrices on each

pro cessor plus a small amount of global data The lo cal short fat submatrix blo ck of

global rows partitioned to pro cessor p is column partitioned into the diagonal blo ck A

p

of the global matrix and the rest or o diagonal lo cal part B The advantage of this

p

scheme is that A only works on the lo cal parts of the source and destination vectors x and

p

b resp ectively and can thus b e done without any communication The PETSc metho d of

p erforming b Ax on pro cessor i is as follows note matrix subscripts refer to pro cessor

submatrices and not grids

p ost receives for necessary parts ghost values of x j j i actual receives in the

j

fourth step

send necessary parts of x to all pro cessors j that touch i corresp onding send of

i

receives in the fourth step

b A x work on lo cal data

i i i

receive all opro cessor entries in x

j

b b B x

i i i j

With this we can state a cost mo del for matrix vector pro ducts

G

V T M v ec N ndf

i neig h

p

B B I

i i

V V V

p p p

ndf ndf

M f l op sec M f l op sec

2 1

3 3

n n

G

V

p

i i

P P

i i

2 1

3 3

n n

B

V

p

i i

P P

i i

Here we assume that there is enough work in the b A x term of the matrix vector

i i i

pro duct to hide the latency in the communication this will not b e the case in general

esp ecially on the coarsest grids

We can measure the maximum time that any pro cessor sp ends in each of these

phases with varying numbers of pro cessors to estimate the value of and table

shows these maximum times for a problem P describ ed in chapter with ab out

equations p er pro cessor on the Cray TE at NERSC

Fine grid matrix vector pro duct data with equations p er pro cessor

pro cessors dof x

max

Max send time maxmin t

send

Max diag blo ck matvec time maxmin

Mopsec p er pro c diag blo ck matvec

Max receive time maxmin

Max odiag blo ck matvec time maxmin

Mopsec p er pro c odiag blo ck matvec

Max neighbor pro cessors N

neig h

Ave number ghosts in each neighbor pro c n

g

Table Matrix vector pro duct phase times

Note this data is from the standard summary output in PETSc although we

added ten lines of co de to PETScs matrix vector pro duct routine to start and stop timers

for each of the ve phases

To use this data to estimate and we rst notice that our mo del predicts

that the send and receive phases should b e equal this is not the case The sends are

nonblo cking and the receives can p otentially blo ck but for the ne grid this is not likely

as there is much work the lo cal diagonal blo ck matrix vector pro duct to hide the latency

in the sends We use the send times for this demonstration

For a mo del of the slowest pro cessors communication time we will use the maxi

mum time and the maximum number of neighbor pro cessors although we do not know if

a pro cessor with the largest number of neighbor pro cessors was in fact the pro cessor with

the largest time but it natural to assume so We will also assume that all of the messages

are of the same length as the average length of messages is the only data that is available

and use this to calculate the average number of ghost vertices in each message n We can

g now use this data from the three samples to give us three equations in two unknowns of the

max

form N N n ndf t

neig h neig h g

send

The least squares t for this data gives us and

Clearly n so we can neglect the term on the ne grid and calculate with

least squares t Figure shows a comparison of this mo del with the

data

−3 x 10 model vs. experiment of send time in matrix vector product 4

measured send times β * approximate number of "ghosts" 3.5 2

3

2.5

2

1.5 seconds per send phase of matrix vector products

1

0.5 500 1000 1500 2000 2500 3000 3500 4000

approximate number of "ghosts"

Figure Comparison of mo del with exp erimental data for send phase of matrix vector

pro duct on ne grid

Thus we have an approximate measure of and sec on a

MHz Cray TE with dof p er vertex and hexahedra mesh of a thick b o dy problem The

next step would b e to articulate to include a term for the average number of vertices p er

pro cessor to allow for the mo deling of the coarse grids though we conclude the development

of this mo del here

This exercise is an example of the analyses that would have to b e carried out for

and for a machine and a problem class to construct an initial complexity mo del

of multigrid We do not pursue this mo del further in this dissertation but the next section

concludes with a sketch of the path for pursuing this mo del

Conclusion and future work in multigrid complexity

mo deling

This chapter has provided an introduction to the complexity issues of large scale

unstructured multigrid solvers on the common parallel architecture and machines of to days

We have developed some of the infrastructure for our complexity mo del and have articulated

some of the comp onents of this mo del We however have not completed this mo del

To complete this mo del one would rst need to complete the expressions for

and developed in the last section for coarse grid matrices ie extend the mo del of

and as a function of the number of equations p er pro cessor and of machine latency one

could then construct mo dels for and These expressions could then b e substituted

into the comp onent mo dels presented in this chapter these comp onent mo dels could then

b e substituted into an inventory shown in Figure of the applications of each comp onent

in the solver The resulting cost estimate could then b e compared to exp erimental data and

rened as necessary to develop an accurate mo del

After rening this mo del for thick b o dy hexahedra meshes one could extend the

mo del to accommo date other types of problem classes eg using thin b o dy problems like

shells by changing some of the appropriate parameters such as grid reduction rates in

x One would then continue to iteratively rene the mo del and p erturb the problems

to develop a more comprehensive accurate mo del of multigrid solvers on unstructured nite

element meshes Some of the payos in the mo deling that we have introduced in this

chapter would b e as stated in x to aid algorithmic design design b etter computers

inform purchasing decisions and verify optimal or correct co de installations

Setup Multigrid preconditioner

T r iP r od i L

i

F factor coarsest grid

L

j

i L j b F factor blo ck diagonal matrix for

i

i

blo ck Jacobi

k Conjugate Gradient iterations with a Multigrid preconditioner

M v ec

V ecD ot

Axpy

N or m

Application of M ul tiGr id F cy cl eA b

L S solves on coarsest grid

L

Restr ict i L

i

M v ec i L

i

Axpy i L

i

V cy cl eA b i L Applications of M ul tiGr id

i

i M v ec

i

i Restr ict I nter pol ate

i i

i Axpy

i

s i Applications of S moothA b

i

M v ec

i

V ecD ot

Axpy

i

j

j b S Jacobi PC

i

i

Figure Cost Inventory of CG with Full Multigrid Preconditioner

Chapter

Linear scalability studies

This chapter presents scalability studies of our solver on an IBM PowerPC cluster

with way SMPs and a Cray TE with pro cessors with problems up to million

degrees of freedom We show comparative results on two dierent platforms and lo ok into

some general and detailed solver p erformance issues

Introduction

This chapter pro ceeds as follows notation and solver conguration are introduced

in x We introduce our linear test problem in x and present scalability studies of

our solver on a Cray TE x and an IBM PowerPC cluster x x shows numerical

exp eriments for our grid agglomeration strategies and analysis of the time sp ent in each

grid of a sample problem We present end to end p erformance data in x to provide a

view of the p erformance of our entire parallel nite element package

Our overall results show that our algorithm is indeed scalable on problems up to

million degrees of freedom on pro cessors of a Cray TE with ab out solver

parallel eciency and on way SMPs no des of an IBM PowerPC cluster with ab out

solver parallel eciency

Solver conguration and problem denitions

We denote a problem x by P x k b eing the number of thousands of dof that we

k

put on each pro cessor A problem is a geometry or domain with b oundary conditions and

a nite element discretization or element formulation material prop erties etc for each

sub domain but do es not include a mesh of the domain For instance our rst problem

the included sphere x x x x and x with a mesh and number of

pro cessors that results in ab out dof p er pro cessor is referred to as P

Our solver uses a blo ck Jacobi preconditioner for a CG smo other for our full

multigrid preconditioner of a CG solver Two pre and p ost smo othing steps are used

throughout our numerical exp eriments and we use a convergence tolerance of ie

kAxbk

unless otherwise stated declare convergence of solution x when

kbk

Problem P

Figure shows one mesh vertices of a nite element mo del of a hard

sphere Poisson ratio of included in a soft somewhat incompressible material Poisson

ratio of P is made of eight vertex hexahedral trilinear brick elements and is

almost logically regular All materials are linear with mixed displacement and pressure

elements QP A uniform pressure load is applied on the top surface One o ctant

is mo deled with symmetric roller b oundary conditions on the cut surfaces all other

surfaces have homogeneous Neumann b oundary conditions The other meshes that we test

are of the same physical mo del but with dierent scales of discretization

Figure Vertex D FE mesh and deformed shap e

Scalability studies on a Cray TE P

Figure shows the times for the primary comp onents in Figure of one linear

nite element solution after the p er conguration set up phase for a variety of problems

from dof to million dof with to pro cessors run on a Cray TE The Cray

TE has MHz single pro cessor no des with Mopsec theoretical p eak and

Mb memory p er pro cessor For each instantiation of the problem we have chosen the

number of pro cessors so as to keep ab out dof p er pro cessor and to b e a multiple of four to b e consistent with the IBM data in the next section

Inclusion.Sphere Solve Times (~15K per processor) Cray t3e 100

Fine grid creation (FEAP) 90 R * A * P (Epimetheus) Subdomain factorizations (PETSc) 80 Solve for "x" (PETSc) Total Iterations,tol=10−6 70

60

50 Time(p)

40

30 number of iterations

20

10

0 50 100 150 200 250 300 350 400 450 500

Number of processors

Figure dof p er pro cessor included sphere times on a PE Cray TE

From this data we can make a few observations ab out the p erformance charac

teristics of our solver First and foremost we see an overall trend of our solver requiring a

constant number of iterations the rst solve required iterations and the last required

This means that multigrid converges as quickly as exp ected The enthusiasm for the

success of this algorithm is temp ered by seemingly random and indeterminate increases

of the iteration count particularly on the smaller problems in this data the dierent

solves have iteration counts b etween and This is a shortcoming of the current algo

Inclusion.Sphere Solve Speedup (~15K per processor) Cray t3e

N(p) / (N(1) * p)

1

0.8

0.6

0.4

Time(1) / Time(p) * N(p) (N(1) p) Fine grid creation (FEAP) R * A * P (Epimetheus) Subdomain factorizations (PETSc) 0.2 Solve for "x" (PETSc) Total N(p) / (N(3) * p)

0 0 1 2 10 10 10

Number of processors

Figure dof p er pro cessor included sphere eciency on a Cray TE

rithm although we have found since this data was collected that this variability can b e

signicantly damp ed by restarting CG every or iterations

We see two p ossible sources of these uctuations in iteration counts the nonglobal

coarse grids meshes x lead to restriction op erators that are not computed with one valid

global nite element mesh and the random selection of facets in our face identication

algorithm x is probably not optimal These issues require a parallel Delaunay mesh

generator and that we use a more rational selection of initial facets with some global

view of the problems the design of such an algorithm that is b oth eective and has

acceptable parallel characteristics is a sub ject of future research

Figure shows the eciency of the data in Figure we have also plotted the

N p

factor by which we multiply the data to account for the fact that these problems

N p

come in xed sizes which are not in general an integer multiple of the smallest pro cessor

version of the problem Figures and show that the formation of the ne grid stiness

matrix Fine grid stiness matrix creation is scaling well this is to b e exp ected as

no communication is required and the only sources of ineciency are load imbalance and

the redundant work done on elements that straddle our vertex partitioning Thus our

eciency dened in x for the ne grid construction is communication eciency

c scale eciency z and work eciency w and is the ratio of the number

of elements in the problem divided by the number of pro cessors to the maximum number

of element evaluations on any one pro cessor Load balance l for the element evaluations is

not explicitly enforced with our partitioner though they are inherently well balanced as all

elements do the same amount of work in this linear example and the element load balance

is reasonably go o d as the number of elements on a pro cessor is closely related to the number

of nonzeros on the pro cessor which is explicitly optimized by the partitioner Also on

large problems ParMetis has a tendency to put multiple disconnected small sub domains on

a few pro cessors resulting in go o d load balance in the matrix vector pro ducts but large

surface areas on these few pro cessors these pro cessors evaluate more than the average

number of elements p er pro cessor leading to larger load imbalance in the ne grid matrix

creation

We can also see that the coarse grid creation the matrix triple pro duct RAP

is scaling reasonably well and is a small part of the overall solve time The matrix triple

pro duct times in Figure are hindered by what seems to b e nonoptimal matrix assembly

implementation in PETSc We have implemented our own assembly wrapp ers for the

PETSc assemble routines that use hash tables to cache the accumulation of matrix entries

until the oating p oint work is done in the matrix triple pro duct We then add our cached

values to the PETSc matrix with the PETSc assemble routines once for each matrix entry

p er pro cessor This optimization is necessary for the opro cessor matrix entries as PETSc

do es not implement this well but even for the onpro cessor entries using our assembly

wrapp er has more than doubled the p erformance of the matrix triple pro duct on the IBM

the increase on the Cray TE is not as dramatic

To understand the parallel eciency of the actual solve we remove the scale o

eciency noise number of iterations from this data to ascertain trends in the solver p er

formance Mop rate Figure shows two dierent decomp ositions of the solve time

to solve for x parallel eciency using op rate and op count utilizing our eciency

mo dels from x to visualize the contributing comp onents Note the lower curve of

each of these plots represents the eciency of the solve with a xed number of iterations

Also the small dierence b etween the average opiteration and the horizontal line at

eciency the horizontal axis in the left plot of Figure is the small amount of

sublinear b elow the axis op eciency in this set of exp eriments See x for a more

Inclusion.Sphere flop/iteration/proc Efficiency (~15K per processor) Cray t3e Inclusion.Sphere flop/sec Efficiency (~15K per processor) Cray t3e

1 1

load imbalance communication

0.8 0.8

communication

0.6 0.6 load imbalance f(1) / f(p) * N(p) (N(1) p)

0.4 mflop/sec(pe) / mflop/sec(1) 0.4

0.2 0.2 max flop/iteration ave flop/iteration max flop/sec time per iteration ave flop/sec

0 0 0 1 2 0 1 2 10 10 10 10 10 10

Number of processors Number of processors

Figure dof p er pro cessor included sphere eciency on a Cray TE

detailed discussion of the eciency plots such as those in Figure From this data we

can see that the parallel eciency of the TE is almost for the solve

The serial eciency of this co de can b e calculated by dividing the Mop rate for

the entire solve and matrix setup submatrix factorizations and matrix triple pro ducts by

the serial p eak op rate The Serial Mop rate for the dof problem is Mopsec

and the p eak op rate app endix B is Mopsec to giving a serial eciency s

Scalability studies on an IBM PowerPC cluster P

Performance is a machine dep endent quantity thus we lo ok at this same study as

the last section on a dierent machine Note to b e consistent with P on the Cray we

put k dof p er way SMP no de on the IMB we however use only two pro cessors p er

no de as this gives us b etter p erformance on the largest problems the ones of interest

resulting in P Figures show the same exp eriments as in the previous section

run on an IBM PowerPC cluster at LLNL Each no de has four MHz PowerPC e

pro cessors with Mb of memory p er no de and a p eak Mop rate of Mopsec

app endix B We only use two pro cessors p er no de and run in a at MPI programming

mo del thus we are not explicitly taking advantage of the shared address space on each

no de

This data shows that the eciency of the solve is not scaling well as we are only

Inclusion.Sphere Solve Times (~30K per processor) IBM ppc, 2 proc. per node 500

Fine grid creation (FEAP) 450 R * A * P (Epimetheus) Subdomain factorizations (PETSc) 400 Solve for "x" (PETSc) Total −6 350 Iterations,tol=10

300

250 Time(pe)

200

150

100 number of iterations

50

0 50 100 150 200 250

Number of processors

Figure dof p er pro cessor active pro cessors p er no de included sphere times

on a IBM PowerPC cluster

running a bit ab ove eciency Figure for the largest problem The sub domain

factorizations is running very p o orly in some cases this is most likely due to paging caused

by some nonscalable constructs in PETSc which means we should have run these problems

with fewer equations p er no de however we wanted to remain consistent with the Cray data

on this problem

To understand the parallel eciency of the actual solve we remove the scale o

eciency noise number of iterations from this data to ascertain trends in the solver p er

formance Mop rate Figure shows two dierent decomp ositions of the solve time

to solve for x parallel eciency using op rate and op count utilizing our eciency

mo dels to visualize the contributing comp onents from x See x for a more detailed

discussion of the eciency plots such as those in Figure

The serial eciency of this co de can b e calculated by dividing the Mop rate for

the entire solve and matrix setup submatrix factorizations and matrix triple pro ducts

The Serial Mop rate for the dof problem is Mopsec p eak op rate app endix

B is Mopsec to give a serial eciency s

Inclusion.Sphere Solve Speedup (~30K per processor) IBM ppc, 2 proc. per node

1 N(p)/(N(1) * p)

0.8

0.6

0.4

Time(1) / Time(p) * N(p) (N(1) p) Fine grid creation (FEAP) R * A * P (Epimetheus) Subdomain factorizations (PETSc) 0.2 Solve for "x" (PETSc) Total N(p) / (N(3) * p)

0 0 1 2 10 10 10

Number of processors

Figure dof p er pro cessor pro cessors p er no de included sphere eciency on a IBM PowerPC cluster

Inclusion.Sphere flop/iteration/proc Efficiency (~30K per processor) IBM ppc, 2 proc. per node Inclusion.Sphere flop/sec Efficiency (~30K per processor) IBM ppc, 2 proc. per node

1 1 communication load imbalance communication 0.8 0.8

load imbalance 0.6 0.6 f(1) / f(p) * N(p) (N(1) p)

0.4 mflop/sec(pe) / mflop/sec(1) 0.4

0.2 0.2 max flop/iteration ave flop/iteration max flop/sec time per iteration ave flop/sec 0 0 0 1 2 0 1 2 10 10 10 10 10 10

Number of processors Number of processors

Figure dof p er pro cessor pro cessors p er no de included sphere eciency on a IBM PowerPC cluster

Agglomeration and level p erformance

This section lo oks into two related areas time sp ent on each multigrid level and

the eect of pro cessor agglomeration describ ed in x

Agglomeration

We use the IBM PowerPC cluster of this exp eriment as the utility of agglomer

ation techniques are most pronounced on this machine We use P on no des using

all pro cessors p er no de with and without pro cessor agglomeration Table shows

approximate op rates for dot pro ducts and the total op rate for the actual solve ie the

iterations

Total Mop p er sec MFs for solve

k dof With agglomeration Without agglomeration

Included Sphere

grid equations np dot MFs np dot MFs

k

k

k

k

NA NA

Table Flat and graded pro cessor groups IBM PowerPC cluster

This data shows that for this machine dramatic savings can b e achieved with

pro cessor group agglomeration

Performance on dierent multigrid levels

We provide an approximate measurement of the time sp ent on each level for a

particular problem on the Cray TE We use the dof version of problem P that

is similar to P introduced in chapter Table shows the time sp ent on each grid

in the accelerator and the total time for the actual solve after the preconditioners have

b e factored and the coarse grids created Note we are only able to measure the total solve

time accurately given the PETSc output We thus approximate the time on each grid

by adding the maximum time sp ent in the matrixvector pro duct and sub domain solves of

the blo ck Jacobi preconditioner and the minimum time sp ent in the dot pro ducts as the

dot pro ducts accumulate the load imbalance accounted for in the previous terms We then

throw out the time on the grid that we have the least condence in grid and assign it

the time required to add up to our reliable total solve time

level vertices active pro cessors time sec

Krylov accelerator

Total solve time

Table Time for each grid on Cray TE million dof problem

Note this data may indicate that fourth grid is not using an optimal number of

pro cessors though we are not able to improve the overall solve time the quantity that we

can measure accurately by varying the number of pro cessors so the source of the apparent

ineciency in this grid is not clear

End to end p erformance

This section shows the total end to end p erformance of our parallel nite element

implementation with our solver to do just one solve from b eginning to end We measure the

high level comp onents of the total parallel nite element system as diagramed in Figure

Note we do not measure time for FEAP to setup the lo cal data structures for the ne grid

after the partitioning but b efore the construction of the restriction op erators Prometheus

as this is not optimal now we do not use memory resident les but it is small ab out seconds on the TE

Cray TE

Figures and show the times for ma jor comp onents in the solution of one solve on the Cray TE at NERSC

Inclusion.Sphere "end to end" Times (~15K per processor) Cray t3e 800

Partitioning (Athena) Restriction construction (Prometheus) 700 Fine grid creation (FEAP) Solve setup (Epimetheus,PETSc) Solve for "x" 600 Total

500

400 Time(p)

300

200

100

0 50 100 150 200 250 300 350 400 450 500

Number of processors

Figure dof p er pro cessor end to end times on a Cray TE

From this data we can see that the parallel nite element co des setup phase and

partitioning Athena costs ab out ten times the cost of one solve of P solved to a relative

residual tolerance of Also the restriction op erator construction Prometheus ie

coarse grid construction shap e function evaluation and restriction matrix construction

costs at most twice as much as a solve The time for the restriction op erator construction is

hindered by a few PETSc ineciencies namely the the assembly routines are probably not

optimal x and so assembling the restriction op erator could probably b e improved The

restriction construction is also hindered by inecient submatrix extraction and symbolic

factorization in the additive Schwarz preconditioners Thus the restriction construction

phase probably has opp ortunity to b e further optimized by optimizing some of the general

sparse matrix op erations in PETSc though these costs are p er conguration cost and

thus are amortized in nonlinear and transient analyses

Inclusion.Sphere Prometheus and Solve Speedup (~15K per processor) Cray t3e

Restriction construction (Prometheus) Fine grid creation (FEAP) Solve setup (Epimetheus,PETSc) 1 Solve for "x" (PETSc) Total Solve Mflops/sec

0.8 mflops(pe) per sec / mflops(1) per sec * p

0.6

0.4 Time(1) / Time(p) * N(p) (N(1) p)

0.2

0 0 1 2 10 10 10

Number of processors

Figure dof p er pro cessor end to end times on a Cray TE

Also the FEAP construction and Epimetheus assembly of the ne grid stiness

matrix is very small relative the solve cost The solve setup costs includes the Jacobi blo ck

factorizations but is predominately the matrix triple pro duct for the coarse grid op erator

construction this cost is ab out one fth that of the actual solve solve for x Finally

End to End the sum of all of the comp onents Additionally Figure shows the eciency

of the Mop rate in the solve which is ab out

IBM PowerPC cluster

This section shows the same data as the last section for the IBM PowerPC cluster

at LLNL It is interesting to note that the nite element setup Athena is faster on the IBM

though this is somewhat misleading Each no de on the IBM has a lo cal disk the Cray

pro cessors do not have a lo cal disk so that writing the lo cal FEAP input le and reading

it in inside of FEAP on each pro cessor is much more exp ensive on the Cray Also the

Cray do es not supp ort IO in C very well requiring that Cray sp ecic routines b e called for

reading and writing les this data may not have had the optimal system IO parameters

set so the Cray Athena data may not b e optimal

Inclusion.Sphere "end to end" Times (~60K per node) IBM ppc, 2 proc. per node 1000 Partitioning (Athena) 900 Restriction construction (Prometheus) Fine grid creation (FEAP) 800 Solve setup (Epimetheus,PETSc) Solve for "x" Total 700

600

500 Time(pe)

400

300

200

100

0 20 40 60 80 100 120

Number of nodes

Figure k dof p er no de pro c p er no de end to end times IBM PowerPC cluster

Inclusion.Sphere Prometheus and Solve Speedup (~60K per node) IBM ppc, 2 proc. per node

1

0.8 mflops(pe) per sec / mflops(1) per sec * p

0.6

0.4

Time(1) / Time(p) * N(p) (N(1) p) Restriction construction (Prometheus) Fine grid creation (FEAP) 0.2 Solve setup (Epimetheus,PETSc) Solve for "x" (PETSc) Total Solve Mflops/sec 0 0 1 2 10 10 10

Number of nodes

Figure k dof p er no de pro c p er no de solve times IBM PowerPC cluster

Conclusion

Form this data we can see that the initial parallel nite element setup phase is the

biggest b ottleneck for one solve but as its cost can b e amortized for typical problems and

as it is not the sub ject of this work this is not of primary concern The p erformance of the

restriction construction Prometheus may b e somewhat diminished by nonoptimal assem

bly of the restriction op erators in PETSc this is partially due the fact that we repartition

the coarse grids requiring more communication than if we did not repartition Also though

ParMetis minimizes the data movement in repartitioning it is not clear how eective it is

we are using the second or third alpha release of version The restriction construction

phase is also hindered by inecient sparse matrix op erations submatrix extraction and

symbolic factorization in the additive Schwarz preconditioners x These problems are

again only p er conguration costs and have not b een investigated throughly Addition

ally we can see the very dierent p erformance characteristics of the IBM and the Cray on

unstructured multigrid co des We attribute this to the p o or supp ort of at MPI co des on the IBM

Chapter

Large scale nonlinear results and

indenite systems

This chapter discusses the application of our solver to nonlinear problems that

arise in computational plasticity and nite deformation materials with a study that presents

our largest solve to data a million degree of freedom solve with up to pro cessors

and ab out parallel communication eciency We also discuss the use of our linear

solver for symmetric p ositive denite matrices to constrained problems with Lagrange

multipliers namely contact problems in linear elasticity

Introduction

This chapter pro ceeds as follows x describ es the large scale test problem P

and the nonlinear solution pro cedure is discuss in x A scalability study for one linear

solve is presented in with problems of up to million degrees of freedom with

ab out parallel eciency on pro cessors of a Cray TE The nonlinear numerical

results are discussed in x with problems up to million degrees of freedom on

pro cessors of a Cray TE and ab out parallel eciency on pro cessors of a Cray

TE We discuss contact problems with serial numerical results in x and conclude in

x

Nonlinear problem P

Problem P is similar to P from chapter with one imp ortant dierence the

included sphere has b een sub divided into seventeen layers of alternating materials This

change to P has two imp ortant eects on the character of the problem First it introduces

thin b o dy features which our heuristics x are designed to accommo date and second

these heuristics alters the degree to which vertices are coarsened in the sphere but not in

the rest of the domain leading to severe load imbalance if one do es not repartition

We again scale this problem Figure shows the smallest base version of P

with k dof P has b een discretized by using a similar scale of discretization as P s

Figure dof concentric spheres problem

k problem for the base case Each successive problem has one more layer of elements

through each of the seventeen shell layers with an appropriate ie similar renement in

the other two directions and in the outer soft domain resulting in problems of size

k k k k k k and k degrees of freedom the largest

problem has not yet b een run

We use similar materials as those in P but with nonlinear constitutive mo dels

and the interior concentric spheres alternate hard and soft layers with the hard material

in the inner and outer shells The loading and b oundary conditions have b een changed from

P to an imp osed uniform displacement down on the top surface

Table shows a summary of the constitution of our two material types

The hard material is a simple J plasticity material with kinematic hardening

Material Elastic mo d Poisson ratio deformation type yield stress hardening mo d

soft large NA

hard small

Table Nonlinear materials

The soft material is a large deformation NeoHo okean hypo elastic material

Nonlinear solver

We use a full Newton nonlinear solution metho d Convergence is declared when

the energy norm of the correction in an iteration is that of the rst correction

T

x b Ax This means in Newton iteration m we declare convergence when

m

m

T

Our linear solver within each Newton iteration is conjugate gradient pre x b Ax

conditioned by our multigrid solver with a blo ck Jacobi preconditioned conjugate gradient

smo other We use blo cks for every unknowns in the blo ck Jacobi preconditioner

FEAP calls our linear solver at each Newton iteration with the current residual

r b Ax thus the linear solve is for the negative increment x A r We use

m m m

a dynamic convergence tolerance r tol for the linear solve in each Newton iteration of

kr k

m

r tol r tol in the rst and second iteration and r tol min

m

kr k

m1

on all subsequent iterations m This heuristic is intended to minimize the number of

total iteration required in the Newton solve in each time step by only solving each linear

solve to the degree that it deserves to b e solved That is if the true nonlinear residual

is not converging quickly then solving the linear system to an accuracy far in excess of the

reduction in the residual that is exp ected in the outer Newton iteration is not like to b e

economical

The reason for hardwiring the tolerance for the second Newton iteration is

that the residual for the rst iteration of this problem tends to drop by ab out three or

ders of magnitude The second step of this problem tends to have the residual reduced

by ab out one order of magnitude or less and then continues with sup er linear but not

quadratic convergence rate as we use a nonexact solver Our dynamic tolerance heuristic

kr k

m

min sp ecies to o small of a tolerance on the second iteration of this

kr k

m1 problem so we hardwired the tolerance for the sake of eciency

Cray TE large scale linear solves

We run one linear solve with ab out degrees of freedom p er pro cessor ie

P and a convergence tolerance of the rst linear solve tolerance in the nonlinear

solver so as to investigate the eciency of our largest solves to date dof We

want to show our solver in its b est light by running with as many equations p er pro cessor

as p ossible as parallel eciency will in general increase as the number of degrees of freedom

p er pro cessor go es up Thus this material is intended to illustrate the issues of scale in

isolation of the issue of nonlinear p erformance discussed in the x Figure shows

the times for the ma jor sub comp onents of the solver and Figure shows the eciency diagram of the same data

Solve Times (~41K dof per PE) Cray T3E 100

90

80

70 Fine grid creation (FEAP) R * A * P (Epimetheus) 60 Subdomain factorizations (Epimetheus) Solve for "x" (PETSc) Total 50

Time(pe) 40

30

20

10

0 100 200 300 400 500 600

Number of PEs

Figure dof p er pro cessor included concentric sphere times on a Cray TE

We see sup erlinear eciency in the solve times in Figures and for

two reasons First we have sup erlinear convergence rates as shown in Figures The

decrease in the iteration count in Figure shows sup erlinear scale eciency z in

our eciency measures from x

Second the vertices added in each successive scale problem have a higher p er

centage of interior vertices than the base two pro cessor problem leading to higher rates

Solve Speedup (~41K dof per PE) Cray T3E 1.5

N(pe)/(N(1) * p) 1

0.5

Fine grid creation (FEAP) R * A * P (Epimetheus) Subdomain factorizations Solve for "x" (PETSc) Time(1) / Time(pe) * (N(1) p) N(pe) Total N(p) / (N(3) * p)

0 100 200 300 400 500 600

Number of PEs

Figure dof p er pro cessor included concentric sphere eciency on a Cray TE

of vertex reduction in the coarse grids This is b ecause as the number of unknowns n

2

3

whereas the interior will increase by n thus increases the surface area increases by n

the ratio of interior vertices to surface vertices will increase as the scale of discretization

decreases n increases on the larger problems Our heuristics x try to articulate the

surfaces b oundary and material interfaces well resulting in a higher ratio of surface ver

tices b eing promoted to the coarse grid Thus the rate of vertex reduction is higher on the

larger problems as they have prop ortionally higher interior vertices hence prop ortionally

larger reduction rates leading to less work p er pro cessor p er ne grid vertex on the large

problems For instance the total reduction factor of the rst coarse grid is ab out seven on

the base case k dof problem and ab out thirteen on the larger problems Therefore

there are in fact fewer ops per ne grid vertex on the coarse grids for the larger problems

as can b e seen in Figure The work eciency w from x shown in the average

ops p er iteration p er pro cessor in Figure is sup erlinear w

Figures and also show that sub domain factorizations p erform p o orly

On insp ection of the PETSc output we see that the copying of the sub domain data from

the grid stiness matrix and the symbolic factorizations of these submatrices are the cause

of these large and increasing with number of pro cessors times We are not able to explain

this p o or p erformance in this data though older data shown in Figure on problems of

similar scale show much smaller times that are in line with our exp ectations Additionally

there is no need for communication in the sub domain factorization phase hence the parallel eciency in Figure is clearly not endemic to our algorithm or implementation

flop/iteration/proc Efficiency (~41K dof per PE) Cray T3E 1.5

load imbalance

communication

1

0.5 f(1) / f(pe) * (N(1) pe N(pe))

max flop/iteration ave flop/iteration time per iteration

0 1 2 10 10

Number of PEs

Figure k dof p er pro cessor concentric sphere op eciency on a Cray TE

Figure shows that we have ab ove parallel eciency in the Mop rate up to pro cessors on the Cray TE in the solve

flop/sec Efficiency (~41K dof per PE) Cray T3E

1

communication

0.8

load imbalance

0.6

0.4 mflop/sec(pe) / mflop/sec(1)

0.2

max flop/sec ave flop/sec

0 1 2 10 10

Number of PEs

Figure k dof p er pro cessor concentric sphere Mopsec eciency on a Cray TE

Cray TE nonlinear solution

For the nonlinear problem P we run ve time steps that result in a displacement

down of inches the cub e is inches on a side and the top soft section is inches

at the central z axis Thus as the hard interior core do es not displace signicantly at

in top displacement we have ab out compression of the soft material at the interior

corner This p ercentage decreases away from the interior corner as the depth of the soft

material increases as we move away from the top of the sphere We keep ab out

degrees of freedom p er pro cessor and run problems of size k on pro cessors up to

k on pro cessors Figure show the number of multigrid iterations stacked on one another to show the total number of multigrid iterations to solve each problem

800

700

600

500

400

300

200 linear solver iterations

100

0

7 6.5 6 5.5 5 log (d o f)

10

Figure Multigrid iterations p er Newton iteration

From this data we can see that the total number of iterations is staying ab out

constant as the scale of the problem increases Figure shows that the number of

iterations to reduce the residual by a xed amount in the rst solve of the rst time step

decreases as the problem size increases Thus the data in Figure suggests that the

nonlinear problem is getting harder to solve as the discretization is rened this is not a

surprising result as there is likely more nonlinear b ehavior in the ner discretizations but more work remains to investigate this issue further

Figure show a histogram of the number of inner iterations for each Newton

outer iteration of each time step

100

50

0 linear solver iterations

7

6 Newton solver iteration 5 log (d o f)

10

Figure Histogram of the number iteration p er Newton step in all time steps

Figure show the total wallclock times for these problems run on an IBM

PowerPC cluster note this data was not in the dissertation and solves problems of up to

million dof the largest solve also slipp ed into the gures ab ove

4 Inclusionx 10 Layered Sphere times (~80K per processor) IBM PowerPC cluster 2.5

Total (wall clock) time Solve for "x" (PETSc) Solve setup 2 Coarse grid setup

1.5

1

0.5 Time(p) * N(1) / procs(1) procs(p) N(p)

0 0 50 100 150 200 250 300

Number of processors (p)

Figure End to end times of nonlinear solve with dof p er pro cessor

This data shows that the overall solution times are staying ab out constant and

that the initial partitioning cost are ab out equal to the cost of the ve nonlinear solves on

the Cray TE up to pro cessors Additionally the base case pro cessor case solve

time in the actual iterations ran at Mops and the pro cessor case ran at

Mops ab out eciency This Mops is the fastest Mop rate p er pro cessor that

we have recorded thus our communication eciency is c on pro cessors

Lagrange multiplier problems

For problems with multiple b o dies or b o dies that fold onto themselves the sys

tem of dierential equations in continuum mechanics must b e augmented with algebraic

constraints to prevent p enetration b etween b o dies contact forces must b e introduced to

maintain a physically meaningful solution ie satisfy conservation laws Contact constraints

can b e implemented in one of two ways First sti springs can b e added to the system

judiciously to approximate contact ie some p enetration is p ermitted and the stier the

springs the smaller this error this is known as a penalty approach Second the system of

dierential equations may b e augmented with an explicit set of algebraic equations that

sp ecify the lengths of prescrib ed gaps ie the normal to a plane of a b o dy through a

p oint on another b o dy with a sp ecied gap eg for true contact denes one equation

This sp ecied relative displacement requires that a force b e applied in the direction of the

gap to maintain balance of linear momentum Force terms are thus added to the existing

displacement equations as an applied load and the magnitude of this load is the value of

the Lagrange multiplier

For example consider a D problem with a spring k attached to a rigid wall and

a mass and an applied constant load gives rise to a simple equation for the steady state

resp onse k x f If another massspring were added to the system we would get the

trivial system of equations

f x k

f x k

If we sp ecify that the two masses must stay in contact with each other ie x x or

x x we need to augment each equation with the contact force b etween the masses

k x f and k x f to get

f x k

B C B C B C

B C B C B C

f x k

A A A

or

T

A C x f

C

A is a standard stiness matrix Note that we could also assume that k and A would

T

v Av

only b e p ositive semidenite but in general the problem is still well p osed if

T

v v

v K er C v

We use Lagrange multiplier formulations to imp ose contact constraints thus re

sulting in an indenite system of equations To solve the set of equations we use a simple

Uzawa metho d The Uzawa algorithm or augmented Lagrange metho d decouples the

solution for the displacements x and the Lagrange multipliers Outer iterations on the

Lagrange multipliers call our multigrid solver which will conduct inner iterations with a

mo died right hand side see Figure

Our stiness matrix A can b e p ositive semidenite We know A is p ositive denite

over the kernel of C thus we can regularize A to get a symmetric p ositive denite matrix

T

A A C DC Here is a scalar scaling parameter and D is a scaling matrix We use

a diagonal matrix for D Note

the solution x satises C x so Ax A x and we have the correct solution

even though we are not using the true stiness matrix to calculate the residual

our Lagrange multipliers are formed with a simple slave p oint on master surface

formulation

in many texts authors implicitly assume that D is the identity

p enalty metho ds solves with A and pick to b e vary large eg times larger than

the largest diagonal entry

The vector from the slave p oint to the surface normal to the surface is called the gap

vector g its length is called the gap We dene D with the gap vector g and the diagonal

entries d of the stiness matrix asso ciated with the slave no de j in each constraint i as

T t

D g d an alternative denition of D is D g A g The diculty with p enalty

ii ii j j

metho ds ie large is that the system b ecomes very sti and dicult to solve with

iterative metho ds The advantage of augmented Lagrange formulations is that they are

more accurate and the regularized systems A are much b etter conditioned as a relatively

small can b e used

Picking leads to a typical ad hoc minimization pro cess as there are no reliable

metho ds that we are aware of for picking a priori for the problems that are of interest

to this dissertation If is to o small then the outer iterations will converge slowly and if

is to o large then the inner iteration our solver will converge slowly We chose as

this seems to b e ab out optimal for our test problem Note the solution times do not seem

to b e very sensitive to this parameter We use a basic Uzawa with a user provide relative

tolerance r tol and absolute tolerance atol algorithm to solve for x and in equation

as shown in Figure

from previous time step

x

i

T

r tol jf j or jC x j atol Ax C f while

i i

T

solve for x Ax f C

i i i

C x

i i i

i i

Figure Uzawa algorithm

Numerical results

We test our solver with a test problem in linear elasticity Figure shows a

problem with three concentric spheres this is somewhat similar to P with out the cub e

cover material and we call it P The loading is a thermal load on the middle sphere ie

the middle sphere is hotter than the inner and outer sphere and hence expands The two

interface conditions b etween the inner and the middle sphere and the middle and the outer

sphere are enforced with a contact constraint The problem has dof and ab out

equations in C for each of the two contact surfaces our standard solver conguration is used

with blo cks of blo ck Jacobi preconditioner for the smo other four multigrid levels and a

relative tolerance of or on the residual These solves have two separate inuences

Time = 3.37E+30

Figure dof concentric spheres with contact undeformed and deformed shap e

on their p erformance the number of outer iterations and the cost of each iteration We

provide a control case that is a similar problems with no contact Figure shows the concentric sphere problem without contact b oth the undeformed and deformed shap e

Time = 3.37E+30

Figure dof concentric spheres without contact undeformed and deformed

shap e

Table shows the results of this exp eriment The average convergence rate for

each Uzawa iteration is ab out the same as that of the no contact problem As we use an

m

m

i

j the regularized adaptive convergence tolerance in the inner iterations r tol jC x

i

inner

problem is in fact a bit more dicult to solve than our control problem

problem inner iterations total outer iterations

rtol

no contact

contact

Table Multigrid preconditioned CG iteration counts for contact problem

From this data we can see that the cost is growing at ab out the logarithm of the

convergence tolerance with ab out twice as many outer iterations and ab out twice as many

inner iterations p er outer iteration This is probably not optimal and may b e the sub ject

of future research A p otential avenue to improve this algorithm is to use preconditioned Uzawa metho ds

Conclusion

We conclude that our linear solver can b e eectively used in solving nonlinear

problems and that it remains robust for problems with thin b o dy features and up to

million equations on up to pro cessors of a Cray TE with ab out parallel eciency

Our nonlinear problems also present some the the most challenging op erators namely

with geometric stiness which lowers the lowest eigenvalues hence increases the condition

number of the matrix and softening materials from yielding plasticity materials resulting

in areas of strain lo calization with highly incompressible constitution Thus we are able to

show that our solver remains robust on large scale problems in the presents very challenging

materials Also we show that our solver has the p otential to b e useful is solving contact problems with Lagrange multipliers

Chapter

Conclusion

This dissertation has developed a promising metho d for solving the linear set of

equations arising from implicit nite element applications in solid mechanics Our approach

a D and parallel extension to an existing serial D algorithm is to our knowledge unique in

that it is a fully automatic ie the user need only provide the ne grid which is easily avail

able in most nite element co des standard geometric multigrid metho d for unstructured

nite element problems

We have developed heuristics for the automatic parallel construction of the coarse

grids for D problems in solid mechanics this work represents some of the most original

contributions of this dissertation Our metho d is the most scalable and robust ie with

resp ect to convergence rate and breadth of problems on which it is eective multigrid

algorithm on unstructured nite element problems that we are aware of Additionally we

have developed and analyzed a new parallel maximal indep endent set algorithm that has

sup erior PRAM complexity on nite element meshes than the commonly used random

algorithms and is very practical

We have developed a fully parallel and p ortable prototype solver that shows

promising results b oth in terms of convergence rates and parallel eciency for some mo d

estly complex geometries with challenging materials in large deformation elasticity and

plasticity Our prototype has solved problems of up to million degrees of freedom on

problems with large deformation elasticity and plasticity on a Cray TE with up to

pro cessors and on an IBM PowerPC cluster with up to way SMP no des pro cessors

we have also run on a network of workstations and a network of PC SMPs We have developed a complexity theory and have b egun to develop a complexity mo del of multigrid

equation solvers for D nite element problems on parallel computers Additionally we have

developed agglomeration strategies for the optimal selection of active pro cessor sets on the

coarse grids of multigrid as this is essential for optimal scalability

We have also developed a highly parallel nite element implementation built on

an existing stateoftheart serial research legacy nite element implementation that uses

our parallel solver By necessity we have developed a novel domain decomp osition based

parallel nite element programming mo del that builds a complete nite element problem

on each pro cessor as its primary abstraction

The implementation of our prototype system ParFeap required ab out

lines of our own C co de plus several large packages PETSc lines of C

FEAP lines of FORTRAN METISParMetis lines of C and geometric

predicates lines of C This complexity was required b ecause

ecient parallel multigrid solvers for unstructured meshes are inherently complex

algorithm development and verication of success in this area is highly experimentally

based and thus requires a exible full featured computational substrate to enable this

type of research and

we require p ortable implementations so we use explicit message passing MPI as our

parallel programming paradigm while accommo dating b oth at and hierarchical

memory architectures clusters of SMPs CLUMPs

Additionally this work is rather broad in the sense that the emphasis has b een on getting

a prototype of the b est nite element linear equation solver p ossible this has limited the

amount of time that could b e devoted to investigating more optimized approaches to each

asp ect of our algorithm In fact the vary exploratory nature of this work demands a non

optimal implementation in that there is no sense in optimizing any system eg a parallel

linear solver for unstructured nite element problems b efore the overall practical quality

of the algorithm and a particular application area have b een determined Thus there is

much more work to b e done

Future Work

To b e useful to a more general nite element community we need to extend the algorithm and its features

Investigate more sophisticated face identication algorithms to increase robust

ness of solver on arbitrary complex domains

Incorp orate a parallel Delaunay tessellation algorithm so as to develop more

robust and globally consistent implementations

Incorp orate a parallel direct solver for the coarsest grid

Extend the implementation for more element types shells b eams trusses etc

Accommo date higher order elements such as supp orting higher dof p er vertex

for padaptive metho ds and multiphysics problems

Develop solution strategies and implementations to extend application domains

Investigate highly nonlinear problems to evaluate solver characteristics on prob

lems in plasticity large deformation elasticity as well as other areas as one ap

proaches the limit load so as to develop strategies to eectively solve these highly

nonlinear nite element problems

Investigate nonCG Krylov subspace metho ds for indenite systems from large

deformation elasticity and plasticity

Develop parallel and preconditioned Uzawa solvers for indenite systems from

constrained problems with Lagrange multipliers

Investigate multigrid algorithms for dierential and algebraic systems from con

strained problems with Lagrange multipliers

Develop PETSc to more fully supp ort large scale algebraic multigrid applications

eciently

Incorp orate a general RAP sparse matrix triple pro duct in PETSc we have im

plemented a sp ecialized RAP requiring a copy of the R matrix within Epimetheus

Implement faster opro cessor and onpro cessor matrix assembly routines

Improve the eciency of the additive Schwarz preconditioner setup phase

Add left and right communicators to restriction matrices to fully supp ort

coarse grid agglomeration

Optimize numerical kernels for shared memory architectures

Rebuild and cleanup the prototypes infrastructure for distributed multilevel unstruc

tured grid classes

Redesign the parallel distributed multilevel classes

Redesign the data structures and clean the co de up

Separate the distributed multilevel unstructured grid classes from Prometheus

so as to provide a class library to the public

Add and test other promising algebraic multigrid algorithms eg

Agglomeration with rigid b o dy mo des Bulgakov and Kuhn x

Smo othed agglomeration Vanek and Mandel x

Build maintain do cument and supp ort a Library interface

Encapsulate Prometheus into a PETSc PC ob ject low level

Develop a higher level interface eg FEI

Biblio graphy

Mark Adams Heuristics for the automatic construction of coarse grids in multigrid solvers

for nite element matrices Technical Rep ort UCBCSD University of California

Berkeley

Mark Adams A parallel maximal indep endent set algorithm In Proceedings th copper moun

tain conference on iterative methods

Anderson Bai Bischof Demmel Dongarra DuCroz Greenbaum Hammarling McKenney

Ostrouchov and Sorensen LAPACK Users Guide SIAM

JH Argyris and S Kelsey Energy theorems and structural analysis Aircraft engineering

K Arrow Hurwicz L and Uzawa H Studies in nonlinear programming Stanford University

Press

ASCI httpwwwllnlgovasci

ASCI Red machine at Sandia National Lab oratory httpwwwsandiagovASCIRed

S B Baden Software infrastructure for nonuniform scientic computations on parallel pro

cessors Applied Computing Review ACM Spring

T Baker Mesh generation for the computation of ow elds over complex aero dynamic shap es

Computers Math Applic

S Balay WD Gropp L C McInnes and BF Smith PETSc users manual Technical

rep ort Argonne National Lab oratory

Marshall Bern and David Eppstein Mesh generation and optimal triangulation In Computing

in Euclidean Geometry World Scientic Publishing

Guy Blello ch Gary L Miller and Dafna Talmor Design and implementation of a practical

parallel Delaunay algorithm To appear in Algorithmica

A Brandt Multilevel adaptive solutions to b oundary value problems Math Comput

D Brelaz New metho d to color the vertices of a graph Comm ACM

Eric A Brewer Highlevel optimization via automated statistical mo deling PPoPP

Franco Brezzi and KlausJurgen Bathe A discourse on the stability conditions for mixed nite

element formulations Computer methods in applied mechanics and engineering

Franco Brezzi and Michel Fortin Mixed and Hybrid Finite Element Methods SpringerVerlag

W Briggs A Multigrid Tutorial SIAM

VE Bulgakov and G Kuhn Highp erformance multilevel iterative aggregation solver for large

niteelement structural analysis problems International Journal for Numerical Methods in

Engineering

Edward K Buratynski A fully automatic threedimensional mesh generator for complex ge

ometries International Journal for Numerical Methods in Engineering

James C Cavendish Automatic triangulation of arbitrary planar domains for the nite element

metho d International Journal for Numerical Methods in Engineering

TF Chan and Y Saad Multigrid algorithms on the hypercub e multiprocessor IEEE Trans

Comput C

Tony F Chan and Barry F Smith Multigrid and domain decomp osition on unstructured

grids In David F Keyes and Jinchao Xu editors Seventh Annual International Conference

on Domain Decomposition Methods in Scientic and Engineering Computing AMS A

revised version of this pap er has app eared in ETNA December

Philipp e G Ciarlet The nite element method for el liptic problems NorthHolland Pub Co

R Courant Variational metho ds for the solution of problems of equalibrium and vibration

Bul letin of the American Mathematical Society

David Culler Richard Karp David Patterson Abhijit Sahay Klaus Erik Schauser Eunice

Santos Ramesh Subramonian and Thorsten von Eicken Logp Towards a realistic mo del of

parallel computation In PPOPP May

James Demmel Applied Numerical Linear Algebra SIAM

JE Dendy MP Ida and Rutledge JM A semicoarsening multigrid algorithm for SIMD

machines SIAM J Sci Stat Comput

Jack J Dongarra Performance of various computers using standard linear equations software Technical Rep ort CS University of Tennessee Knoxville

MC Dracop oulos and MA Criseld Coarsene mesh preconditioners for the iterative solu

tion of nite element problems International Journal for Numerical Methods in Engineering

Maksymillian Dryja Barry F Smith and Olof B Widlund Schwarz analysis of iterative

substructuring algorithms for elliptic problems in three dimensions SIAM J Numer Anal

December

Laura C Dutto Wagdi G Habashi and Michel Fortin An algebraic multilevel parallelizable

preconditioner for large scale CFD problems Comput Methods Appl Mech Engrg

Howard C Elman and Gene H Golub Inexact and preconditioned Uzawa algorithms for saddle

p oint problems SAIM J Numer Anal

Charb el Farhat Jan Mandel and FrancoisXavier Roux Optimal convergence prop erties of

the FETI domain decomp osition metho d Computer Methods in Applied Mechanics and Engi

neering

Charb el Farhat and FrancoisXavier Roux A metho d of nite element tearing and intercon

necting and its parallel solution algorithm International Journal for Numerical Methods in

Engineering

FEAP httpwwwceberkeleyedu rlt

FEI httpzcasandiagovfei

D A Field Implementing Watsons algorithm in three dimensions In Proc Second Ann ACM

Symp Comp Geom

J Fish V Belsky and S Gomma Unstructured multigrid metho d for shells International

Journal for Numerical Methods in Engineering

J Fish M Pandheeradi and V Belsky An ecient multilevel solution scheme for large scale

nonlinear systems International Journal for Numerical Methods in Engineering

S Fortune and J Wyllie Parallelism in random access machines In ACM Symp on Theory

of Computing pages

John R Gilb ert Gary L Miller and ShangHua Teng Geometric mesh partitioning Imple

mentation and exp eriments Technical Rep ort CSL Xerox Palo Alto Research Center

Gene H Golub and Charles Van Loan Matrix Computations John Hopkins University Press

Herve Guillard No denested multigrid with Delaunay coarsening Technical Rep ort

Institute National de Recherche en Informatique et en Automatique

O Hassan K Morgan EJ Prob ert and J Peraire Unstructured tetrahedral mesh generation

for threedimensional viscous ows International Journal for Numerical Methods in Engineer

ing

Bruce Hendrickson and Rob ert Leland A multilevel algorithm for partitioning graphs In Proc

Supercomputing Formerly Technical Rep ort SAND

Bruce Hendrickson and Rob ert Leland An improved sp ectral graph partitioning algorithm for

mapping parallel computations SIAM J Sci Stat Comput

M R Hestenes and E Stiefel Metho ds of conjugate gradients for solving linear systems J

Res Natl Bur Stand

Ellis Horowititz and Sarta j Sahni Fundamentals of computer algorithms Galgotia Publications

Mark T Jones and Paul E Plassman A parallel graph coloring heuristic SIAM J Sci

Comput

S Kacau and I D Parsons A parallel multigrid metho d for historydep endent elastoplacticity

computations Computer methods in applied mechanics and engineering

George Karypis and Vipin Kumar A fast and high quality multilevel scheme for partitioning

irregular graphs SIAM J Sci Comput To app ear

George Karypis and Kumar Vipin Parallel multilevel kway partitioning scheme for irregular

graphs Supercomputing

BW Kernigham and S Lin An ecient heuristic pro cedure for partitioning graphs The Bel l

System Technical Journal

D E Keyes Aero dynamic applications of NewtonKrylovSchwarz solvers In M Deshpande

S Desai and R Narasimha editors Proceedings of the th International Conference on Nu

merical Methods in Fluid Dynamics pages SpringerNew York

David Kincaid and Ward Cheney Bro oksCole Publishing Company

N Kirkuchi and J T Oden Contact Problems in Elasticity SIAM

Xiaoye S Li J Demmel and J Gilb ert An asynchronous parallel sup erno dal algorithm for

sparse gaussian elimination To appear in SIAM J Matrix Anal Appl

R Lohner Parallel unstructured grid generation Comp Meth App Mech Eng

M Luby A simple parallel algorithm for the maximal indep endent set problem SIAM J

Comput

David G Luenberger Introduction to Linear and Nonlinear Programming AddisonWesley

S S Lumetta A M Mainwaring and D E Culler MultiProto col Active Messages on a

Cluster of SMPs In Proceedings of SC High Performance Networking and Computing San

Jose California November

J Mandel Balancing domain decomp osition Comm Numer Meth Eng

Dimitri J Mavriplis Directional agglomeration multigrid techniques for highReynolds number

viscous ows Technical Rep ort Institute for Computer Applications in Science and En

gineering Mail Stop NASA Langley Research Center Hampton VA January

Dimitri J Mavriplis Multigrid strategies viscous ow solvers on anisotropic unstructured

meshes Technical Rep ort Institute for Computer Applications in Science and Engineering

Mail Stop NASA Langley Research Center Hampton VA January

MGNet httpwwwmgnetorg

Millennium pro ject httpwwwmillenniumberkeleyedu

NOW pro ject httpnowcsberkeleyedu

DRJ Owen YT Feng and D Peric A multigrid enhanced GMRES algorithm for elasto

plastic problems International Journal for Numerical Methods in Engineering

JS Przemieniecki Matrix structural analysis of substructures Am Inst Aero Astro J

J Ruge AMG for problems of elasticity Applied Mathematics and Computation

Y Saad and H Schultz GMRES A generalized minimal residual algorithm for solving non

symmetric linear systems SIAM J Sci Stat Comput

Jonathan Richard Shewchuk Adaptive precision oating p oint arithmetic and fast robust

geometric predicates Technical Rep ort CMUCS School of Computer Science Carnegie

Mellon University Pittsburgh Pennsylvania Submitted to Discrete and Computational Geometry

Jonathan Richard Shewchuk Delaunay Renement Mesh Generation PhD thesis School of

Computer Science Carnegie Mellon University Pittsburgh Pennsylvania May Available

as Technical Rep ort CMUCS

JC Simo and TJR Hughes Computational Inelasticity SpringerVerlag

JC Simo Taylor RL and Pister KS Variational and pro jection metho ds for the volume

constraint in nite deformation elastoplasticity Computer Methods in Applied Mechanics and

Engineering

Barry Smith A parallel implementation of an iterative substructuring algorithm in three di

mensions SIAM J Sci Comput

Barry Smith Petter Bjorstad and William Gropp Domain Decomposition Cambridge Uni

versity Press

Ken Stanley Execution Time of Symmetric Eigensolvers PhD thesis University of California

at Berkeley

Made Suarjana and Kincho Law A robust incomplete factorization based on value and space

constraints Methods in Engineering

Dafna Talmor Wel l spaced points for numerical methods PhD thesis Carnegie Mellon Uni

versity Pittsburgh PA

Sivan Toledo Improving memorysystem p erformance of sparse matrixvector multiplication

In Proceedings of the th SIAM Conference on Parallel Processing for Scientic Computing

March

P Vanek J Mandel and M Brezina Algebraic multigrid by smo othed aggregation for second

and fourth order elliptic problems In th Copper Mountain Conference on Multigrid Methods

David E Womble and Brenton C Young A mo del and implementation of multigrid for mas

sively parallel computers Technical Rep ort SANDJ Sandia National Lab oratory

Yelick Semenzato Pike Miyamoto Liblit Krishnamurthy Hilnger Graham Gay Colella

and Aiken Titanium A highp erformance Java dialect In Workshop on Java for High

Performance Network Computing Stanford California February

OC Zienkiewicz and RL Taylor The volume McGrawHill London Fourth edition

App endix A

Test problems

This section lists all of our test problems here with some statistics ab out them

This is meant to provide a reference with more detail ab out the problems and to provide a

single lo cation for reference to the test problems

Materials

All materials are mixed pressuredisplacement eight no de trilinear nite elements

elements

Material name elastic mo dulus Poisson ratio yield stress hardening mo dulus deformation type

hardlinear NA small

softlinear NA small

hardnonlinear large

softnonlinear NA large

Table A Materials for test problems

Problem Included sphere

Figure A Vertex D FE mesh and deformed shap e

Materials hardlinear sphere softlinear cover

Boundary conditions Symmetry on sides uniform load on top

dof condition number nonzeros x

Table A Problem P statistics

Problem Included layered sphere

Figure A dof concentric spheres problem

Materials Nine hardnonlinear layers and eight softnonlinear layers in spheres

softnonlinear cover

Boundary conditions Symmetry on sides uniform imp osed displacement on

top

dof condition number nonzeros x

Table A Problem P statistics

Problem Cantilever b eam

Time = 0.00E+00

Figure A Cantilever with uniform mesh and end load element mesh N

Materials hardlinear

Boundary conditions One end xed uniform oaxis load on the other end

dof condition number nonzeros x

Table A Problem P statistics

Problem Cone

PRIN. STRESS 1

-3.35E+00 1.08E+00 5.51E+00 9.93E+00 1.44E+01 1.88E+01 2.32E+01 2.76E+01 Current View Min = -2.54E+00 X = 0.00E+00 Y = 7.87E+00 Z =-6.48E+01 Max = 2.65E+01 X = 2.71E+02 Y =-8.16E-01 Z =-4.06E+01

Time = 3.37E+30

Figure A Truncated hollow cone

Materials hardlinear

Boundary conditions One end xed oaxis load with a torque on the other

end

dof condition number nonzeros x

Table A Problem P statistics

Problem Tube

Figure A Cantilevered tub e

Materials hardlinear

Boundary conditions One end xed uniform oaxis load on the other end

dof condition number nonzeros x

Table A Problem P statistics

Problem Beamcolumn

Figure A Beamcolumn

Materials hardlinear

Boundary conditions Top and b ottom of column xed and uniform load down

on end of b eam

dof condition number nonzeros x

Table A Problem P statistics

Problem Concentric spheres without contact

Time = 3.37E+30

Figure A Concentric spheres without contact

Materials ALED matrix steel on inner and outer shell aluminum middle sec

tion

Boundary conditions Sphere symmetry b oundary conditions with thermal

load on middle section

dof condition number nonzeros x

Table A Problem P statistics

Problem Concentric spheres with contact

Time = 3.37E+30

Figure A Concentric spheres with contact

Materials ALED matrix steel on inner and outer shell aluminum middle sec

tion

Boundary conditions Sphere symmetry b oundary conditions with thermal

load on middle section Contact b etween shells enforces with Lagrange multipliers

dof condition number nonzeros x

Table A Problem P statistics

App endix B

Machines

pro cessor Cray TE at NERSC

The Cray TE at NERSC has single pro cessor no des total up to pro cessors

have b een used for a single job MHz Mopsec theoretical p eak MB memory

p er pro cessor and a p eak Mop rate of Mopsec of pro cessor Linpack R

max

No de pro cessor PowerPC cluster at LLNL

The IBM at LLNL has ab out waySMPs available to users MHz Pow

erPC e pro cessors Mopsec theoretical p eak MB of memory p er no de and a

p eak Mop rate of Mopsec of pro cessor Linpack toward p erfect parallelism as rep orted at httpwwwrsibmcomhardwarelargescaleindexhtml