The Pennsylvania State University The Graduate School

THE AUXILIARY SPACE SOLVERS AND THEIR APPLICATIONS

A Dissertation in Mathematics by Lu Wang

c 2014 Lu Wang

Submitted in Partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy

December 2014 The dissertation of Lu Wang was reviewed and approved∗ by the following:

Jinchao Xu Professor of Department of Mathematics Dissertation Advisor, Chair of Committee

James Brannick Associate Professor of the Department of Mathematics

Ludmil Zikatanov Professor of the Department of Mathematics

Chao-Yang Wang Professor of Materials Science and Engineering

Yuxi Zheng Professor of the Department of Mathematics Department Head

∗Signatures are on file in the Graduate School. Abstract

Developing efficient iterative methods and parallel algorithms for solving sparse linear sys- tems discretized from partial differential equations (PDEs) is still a challenging task in scien- tific computing and practical applications. Although many mathematically optimal solvers, such as the multigrid methods, have been analyzed and developed, the unfortunate reality is that these solvers have not been used much in practical applications. In order to narrow the gap between theory and practice, we develop, formulate, and analyze mathematically optimal solvers that are robust and easy to use in practice based on the methodology of Fast Auxiliary Space Preconditioning (FASP). We develop a multigrid method on unstructured shape-regular grids by the construction of an auxiliary coarse grid hierarchy on which the multigrid method can be applied by using the FASP technique. Such a construction is realized by a cluster tree which can be obtained in O(N log N) operations for a grid of N nodes. This tree structure is used for the definition of the grid hierarchy from coarse to fine. For the constructed grid hierarchy, we prove that the condition number of the preconditioned system for an elliptic PDE is O(log N). Then, we present a new colored block Gauss-Seidel method for general unstructured grids. By constructing the auxiliary grid, we can aggregate the degree of freedoms in the same cells of the auxiliary girds into one block. By developing and a parallel coloring algorithm for the tree structure, a colored block Gauss-Seidel method can be applied with the aggregates serving as non-overlapping blocks. On the other hand, we also develop a new parallel unsmoothed aggregation algebraic multigrid method for the PDEs defined on an unstructured mesh from the auxiliary grid. It provides (nearly) optimal load balance and predictable communication patterns factors that make our new algorithm suitable for parallel computing. Furthermore, we extend the FASP techniques to saddle point and indefinite problems. Two auxiliary space preconditioners are presented. An abstract framework of the symmetric positive definite auxiliary preconditioner is presented so that the optimal multigrid method could be applied for the indefinite problem on the unstructured grid. We also numerically verify the optimality of the two preconditioners for the Stokes equations.

iii Table of Contents

List of Figures vii

List of Tables ix

Acknowledgments x

Chapter 1 Introduction 1

Chapter 2 8 2.1 Stationary Iterative Methods ...... 10 2.1.1 Jacobi Method ...... 10 2.1.2 Gauss-Seidel Method ...... 12 2.1.3 Successive Over-Relaxation Method ...... 14 2.1.4 Block Iterative Method ...... 16 2.2 Krylov Space Method and Preconditioners ...... 19 2.2.1 Conjugate Gradient Method ...... 20 2.3 Preconditioned Iterations ...... 24 2.3.1 Preconditioned Conjugate Gradient Method ...... 25 2.3.2 Preconditioning Techniques ...... 27 2.4 Numerical Example ...... 28 2.4.1 Comparison of the Iterative Method ...... 29 2.4.2 Comparison of the Preconditioners ...... 31

Chapter 3 Multigird Method and Fast Auxiliary Space Preconditioner 33 3.1 Method of Subspace Correction ...... 33

iv 3.1.1 Parallel Subspace Correction and Successive Subspace Correction . . 35 3.1.2 Multigrid viewed as Multilevel Subspace Corrections ...... 40 3.1.3 Convergence Analysis ...... 45 3.2 The Auxiliary Space Method ...... 55 3.3 Algebraic Multigrid Method ...... 59 3.3.1 Classical AMG ...... 60 3.3.2 UA-AMG ...... 63

Chapter 4 FASP for Poisson-like Problem on unstructured grid 66 4.1 Preliminaries and Assumptions ...... 67 4.2 Construction of the Auxiliary Grid-hierarchy ...... 68 4.2.1 Clustering and Auxiliary Box-trees ...... 68 4.2.2 Closure of the Auxiliary Box-tree ...... 71 4.2.3 Construction of a Conforming Auxiliary Grid Hierarchy ...... 74 4.2.4 Adaptation of the Auxiliary Grids to the Boundary ...... 76 4.2.5 Near Boundary Correction ...... 79 4.3 Estimate of the Condition Number ...... 80 4.3.1 Convergence of the MG on the Auxiliary Grids ...... 82 4.3.1.1 Stable decomposition: Proof of (A1) ...... 83 4.3.1.2 Strengthened Cauchy-Schwarz inequality: Proof of (A2) . . 84 4.3.1.3 Condition number estimation ...... 86

Chapter 5 Colored Gauss-Seidel Method by auxiliary grid 91 5.1 Graph Coloring ...... 92 5.2 Quadtree Coloring ...... 93 5.3 Tree Representations ...... 97 5.4 Parallel Implementation of the Coloring Algorithm ...... 99 5.5 Block Colored Gauss-Seidel Methods ...... 102

Chapter 6 Parallel FASP-AMG Solvers 103 6.1 Parallel Auxiliary Grid Aggregation ...... 106 6.2 Parallel Prolongation and Restriction and Coarse-level Matrices ...... 108 6.3 Parallel Smoothers Based on the Auxiliary Grid ...... 111 6.4 GPU Implementation ...... 112 6.4.1 -Vector Multiplication on GPUs ...... 113 6.4.2 Parallel Auxiliary Grid Aggregation ...... 114

v Chapter 7 Numerical Applications for Poisson-like Problem on Unstructured Grid 117 7.1 Auxiliary Space Multigrid Method ...... 117 7.1.1 Geometric Multigrid ...... 117 7.1.2 ASMG for the Dirichlet problem ...... 118 7.1.3 ASMG for the Neumann problem ...... 119 7.2 FASP-AMG ...... 121 7.2.1 Test Platform ...... 121 7.2.2 Performance ...... 122

Chapter 8 FASP for Indefinite Problem 128 8.1 Krylov Space Method for Indefinite Problems ...... 129 8.1.1 The Minimal Residual Method ...... 129 8.1.2 Generalized Minimal Residual Method ...... 134 8.2 Preconditioners for Indefinite Problems ...... 142 8.3 FASP Preconditioner ...... 144

Chapter 9 Fast Preconditioners for Stokes Equation on Unstructured Grid 147 9.1 Block Preconditioners ...... 148 9.2 Analysis of the FASP SPD Preconditioner ...... 154 9.3 Some Examples ...... 157 9.3.1 Use a Lower Order Velocity Space Pair as an Auxiliary Space . . . . 158 9.3.2 Use a Lower Order Pressure Space as an Auxiliary Space ...... 159

Chapter 10 Conclusions 162 10.1 Conclusions ...... 162 10.2 Future works ...... 163

Bibliography 164

vi List of Figures

2.1 of A ...... 11 2.2 Comparison of the number of Iterations ...... 30 2.3 Comparison of the CPU time ...... 30 2.4 Comparison of the number of iterations for preconditioners ...... 31

4.1 Left: The 2D triangulation T of Ω with elements τi. Right: The barycenters ξi (dots) and the minimal distance h between barycenters...... 68 4.2 Examples of the region quadtree on different domains...... 70 4.3 Tree of regular boxes with root B1 in 2D. The black dots mark the corre- sponding barycenters ξi of the triangles τi. Boxes with less than three points ξi are leaves...... 70 4.4 The subdivision of the marked (red) box on level ` would create two boxes (blue) with more than one hanging node at one edge...... 72 4.5 The subdivision of the red box makes it necessary to subdivide nodes on all levels...... 73 4.6 Hanging nodes can be treated by a local subdivision within the box Bν. The top row shows a box with 1, 2, 2, 3, 4 hanging nodes, respectively, and the bottom row shows the corresponding triangulation of the box...... 74 4.7 The final hierarchy of nested grids. Red edges were introduced in the last (local) closure step...... 75 4.8 Case 1: σi is subdivided in the fine level ...... 75 4.9 Case 2: σi is not subdivided in the fine level ...... 75 4.10 A triangulation of the Baltic sea with local refinement and small inclusions. . 76 4.11 Hanging nodes can be treated by a local subdivision within the cube Bν. Firstly erasing the hanging nodes on the face and then connecting the center of the cube...... 77 4.12 The boundary Γ of Ω is drawn as a red line, boxes non-intersecting Ω are light green, boxes intersecting Γ are dark green, and all other boxes (inside of Ω) are blue...... 77

vii 4.13 The boundary Γ of Ω is drawn as a red line, boxes non-intersecting Ω are light green, and all other boxes (intersecting Ω) are blue...... 78 4.14 The finest auxiliary grid σ(10) contains elements of different size. Left: Dirich- let b.c. (852 degrees of freedom), right: Neumann b.c. (2100 degrees of freedom) 79

5.1 A balanced quadtree requires at least five colors ...... 94 5.2 Forced coloring rectangles ...... 95 5.3 Adaptive quadtree and its binary graph ...... 96 5.4 Six-Coloring for adaptive quadtree ...... 98 5.5 the Mordon code of an adaptive quadtree ...... 99 5.6 Adaptive quadtree and its binary graph ...... 100 5.7 Coloring of 3D adaptive octree ...... 101

6.1 Aggregation on level L...... 107 6.2 Aggregation on the coarse levels...... 108 6.3 Coloring on the finest level L ...... 112 6.4 Sparse matrix representation using the ELL format and the memory access pattern of SpMv...... 114

7.1 Covergence rates for Auxiliary Space MultiGrid with n4 = 737, 933, n5 = 2, 970, 149, n6 = 11, 917, 397, and n7 = 47, 743, 157 degrees of freedom. . . . . 119 7.2 Covergence rates for Auxiliary Space MultiGrid with n4 = 756, 317, n5 = 3, 006, 917, n6 = 11, 990, 933, and n7 = 47, 890, 229 degrees of freedom. . . . . 120 7.3 Quasi-uniform grid for a 2D unit square ...... 123 7.4 Shape-regular grid for a 2D unit square ...... 124 7.5 Shape-regular grid for a 2D circle domain ...... 125 7.6 Shape regular-grid for the 3D heat transfer problem on a cubic domain(left) and a cavity domain(right) ...... 127

+ −1 9.1 P2 − P0 elements and P1 − P0 elements ...... 158 + −1 0 −1 9.2 P2 − P1 elements and P2 − P0 elements ...... 160 + −1 0 −1 9.3 P2 − P1 elements and P2 − P0 elements ...... 160

viii List of Tables

2.1 Comparison of the different iterative method for Poisson equation ...... 30 2.2 Comparison of the different preconditioners for Poisson equation ...... 31

7.1 The time in seconds for the setup of the matrices and for ten steps of V-cycle (geometric) multigrid, Algorithm 16...... 118 7.2 The storage complexity in bytes per degree of freedom (auxiliary grids, aux- iliary matrices and H-solvers) and the solve time in seconds for an ASMG preconditioned cg-iteration...... 119 7.3 The storage complexity in bytes per degree of freedom (auxiliary grids, aux- iliary matrices and H-solvers) and the solve time in seconds for an ASMG preconditioned cg-iteration...... 120 7.4 Test Platform ...... 121 7.5 Wall time and number of iterations for the Poisson problem on a 2D uniform grid ...... 122 7.6 Wall time and number of iterations for the Poisson problem on a 2D quasi- uniform grid ...... 123 7.7 Wall time and number of iterations for the Poisson problem on a 2D shape- regular grid ...... 124 7.8 Wall time and number of iterations for the Poisson problem on a disk domain 125 7.9 Time/Number of iterations for the heat-transfer problem on a 3D unit cube . 127 7.10 Wall time and the number of iterations for the heat-transfer problem on a cavity domain ...... 127

+ −1 9.1 Number of iterations for using P1 − P0 elements as a preconditioner for 0 −1 Stokes equation with P2 − P0 elements ...... 159 + −1 9.2 Number of iterations for using P1 − P0 elements as a preconditioner for Stokes equation with P2 − P0 elements ...... 160 0 −1 9.3 Number of iterations for using P2 − P0 elements as a preconditioner for + −1 Stokes equation with P2 − P1 elements ...... 161

ix Acknowledgments

First and foremost, I want to express my sincere gratitude to my advisor, Prof. Jinchao Xu for his enthusiasm, patience, encouragement and inspiration. I have learned a great deal from him, academically and beyond. His insightful advice, patient guidance, constant support and encouragement are essential to the completion of my education. His fine intuition and deep understanding of , finite element method, and multigrid methods have been very important to my Ph.D. studies and research. His support for my work, life, and career development are invaluable, and my words simply can not express my gratitude strongly enough. His work ethic has been an enormous influence on me. For me, he has redefined the word “advisor” to a new level. It has been my extreme privilege to know him and work with him. I also want to thank Prof. James Brannick and Prof. Ludmil Zikatanov, for being my committee members, and providing me with a mathematics perspective for my research and thesis. Their comments and advices have been most essential for me to understand the new methods described in this work. I would like to thank my committee member, Prof. Chao-yang Wang, who valued my research and were generously willing to take the time to evaluate my thesis. I want to thank all the excellent post-docs and colleagues in our group. Dr. Xiaozhe Hu, Dr. Maximilian Metti, Dr. Fei Wang, Kai Yang, Changhe Qiao, and Yicong Ma. Without the advice and helpful discussions, I was impossible to accomplish my Ph.D. thesis. Last but not the least, I want to thank my family, especially my wife Ying Chen, for her unconditional trust, constant encouragement, and sweet love over the years.

x Dedication

TO GOD BE THE GLORY.

xi Chapter 1

Introduction

Numerical simulation plays an important role in scientific research and engineering design since experimental investigation is both very expensive and time consuming. Numerical simulation helps to understand important features and reduce the time for development. Progress in computer science and engineering helps to meet this need for computational power. The rapid development of the computer industry provides more and more powerful computing ability for numerical simulation, which also makes numerical simulation even more applicable to wider fields and more complex physical phenomena. As the complexity and difficulty of the numerical simulation increase, the linear solvers become the most stringent bottleneck as measured by the proportion of execution time. The need for the fast and stable linear solvers, especially for massively parallel computers, is becoming increasingly urgent. Assume V is a Hilbert space and V∗ is the dual space of V. Consider the following linear system Au = f, (1.1) where A : V → V∗ is a nonsingular linear operator over V and f ∈ V∗ is the given function on the dual space V∗. Since we consider V to be a finite dimensional space, we take V∗ = V. There are two different ways to solve the system (1.1): a direct solver or an iterative solver. Direct solvers theoretically give the exact solution in finite steps, for example Gaussian elimination [1], multifrontal solvers [2], and the like. The review papers of [3, 4, 5] serve as excellent references for various direct solvers. These methods would give the precise solution if they were performed in infinite precision arithmetic. However, in practice, this is rarely true because of the rounding errors. The error made in one step propagates further in all 2 following steps. This makes it difficult to solve the equations by direct solvers for complex problems in applications. On the other hand, as scientific computing develops, many of the problems are extremely large and complex. For example, “Grand Challenge” problems require PetaFLOPs and PetaBytes of computing resources. The large computing complexity of the direct methods makes them infeasible for these problems, even with the best available computing power. Consider Gaussian elimination method as an example since it is still the most commonly used method in practice: Gaussian elimination is a row reduction algorithm for solving linear equations. To perform row reduction on a matrix, one uses a sequence of elementary row operations to modify the matrix until the lower left-hand corner of the matrix is filled with as many zeros as possible. There are three types of elementary row operations: 1) Swapping two rows, 2) Multiplying a row by a non-zero number, 3) Adding a multiple of one row to another row. By using these operations, a matrix can always be transformed into an upper triangular matrix. Once all of the leading coefficients (the left-most non-zero entry in each row) are 1, and in every column containing a leading coefficient has zeros elsewhere, the matrix is said to be in reduced row echelon form. This final form is unique; in other words, it is independent of the sequence of row operations used. For example, in the following sequence of row operations (where multiple elementary operations might be done at each step), the third and fourth matrices are the ones in row echelon form, and the final matrix is the unique reduced row echelon form. The advantage of Gaussian elimination is that it is the most user-friendly solver. For any matrix and right and side, it can guarantee that Gaussian elimination can always solve the equations. However, the computational efficiency is very low. The number of arithmetic operations is one way of measuring the algorithm’s computational efficiency. For Gaussian elimination, it requires N(N − 1)/2 divisions, (2N 3 + 3N 2 − 5N)/6 multiplications, and (2N 3 + 3N 2 − 5N)/6 subtractions, for a total of approximately 2N 3/3 operations. So the arithmetic complexity is O(N 3). When the problem scale is large, it will costs lots of time and memory to solve the problem. Sometimes, it is even impossible. For example, for the input size N = 109, even with the top 1 super computer Tianhe-2, which ranked the world’s fastest with a record of 33.86 petaflops in 2013, it costs about 560 years to solve one problem by the O(N 3) algorithm. In contrast to direct methods, iterative methods are not expected to terminate in a finite number of steps. Starting from an initial guess, iterative methods form successive approximations that converge to the exact solution only in the limit. These methods are 3 relatively easy to implement and use fewer memories. Therefore, the iterative methods are generally needed for large scale problems. Two main classes of the iterative methods are the stationary iterative methods and the more general Krylov subspace methods. Stationary iterative methods solve a linear system with an operator approximating the original one. Examples of stationary iterative methods are the Jacobi method, Gauss–Seidel method and the Successive over-relaxation (SOR) method [1]. While these methods are simple to derive, implement, and analyze, the convergence of these methods is only guaranteed for a limited class of matrices. On the other hand, Krylov subspace methods work by forming a basis of the sequence of successive matrix powers times the initial residual (the Krylov sequence). The approximations to the solution are then formed by minimizing the residual over the subspace formed. The prototypical method in this class is the conjugate gradient method (CG) [6], the minimal residual method (MINRES) [7], and the generalized minimal residual method (GMRES) [8]. Since these methods form a basis, it is evident that the method converges in N iterations, where N is the system size. However, in the presence of the rounding errors this statement does not hold. Moreover, in practice N can be very large, and the iterative process reaches sufficient accuracy already far earlier. The Krylov space method can also be accelerated by some preconditioners. For example, if N = 109, the CG method with multigrid method as the preconditioner can solve the problem in 500 seconds. It has been observed that when employing classical iteration methods can reduce the high-frequency components of the errors rapidly, but it can hardly reduce the low-frequency components errors [9, 1, 10]. The multigrid principle was motivated by this observation. Another crucial observation is that the low-frequency errors on a fine mesh becomes high- frequency errors on a coarser mesh. For the coarse grid problem, we can apply the smoothing and the separation of scales again. Recursively application of smoothing to each level results in the classical formulation of multigrid. A natural application of this idea is the geometric multigrid (GMG) method [11, 9]. GMG has been provided substantial acceleration when compared to basic iterative solvers like Jacobi or Gauss-Seidel, even better performance has been observed when these methods are used as a preconditioner of Krylov methods. GMG method for a Poisson equation can be solved in O(N) operations. Roughly speaking, there are two different types of theories that have been developed for the convergence of GMG. For the first kind theory that makes critical use of elliptic regularity of the underlying partial differential equations as well as approximation and inverse properties 4 of the discrete hierarchy of grids, we refer to Bank and Dupont [12], Braess and Hackbusch [13], Hackbusch[11], and Bramble and Pasciak[14]. The second kind of theory makes minor or no elliptic regularity assumption, we refer to Yserentant [15], Bramble, Pasciak and Xu [16], Bramble, Pasciak, Wang and Xu [17], Xu [18, 19] and Yserentant [20], and Chen, Nochetto and Xu [21, 22]. The GMG method, however, relies on a given hierarchy of geometric grids. Such a hierarchy of grids is sometimes naturally available, for example, due to an adaptive grid refinement or can be obtained in some special cases by a coarsening algorithm [23]. But in most cases in practice, only a single (fine) unstructured grid is given. This makes it difficult to generate a sequence of nested meshes. To circumvent this difficulty, two different methods are developed: algebraic multigrid (AMG) methods and non-nested geometric multigrid. One practical way to generate the grids hierarchy for general unstructured grids is al- gebraic multigrid (AMG) methods. Most AMG methods, although their derivations are purely algebraic in nature, can be interpreted as nested MG when they are applied to fi- nite element systems based on a geometric grid. AMG methods are usually very robust and converge quickly for Poisson-like problems [24, 25]. There are many different types of AMG methods: the classical AMG [26, 27, 28], smoothed aggregation AMG [29, 30, 31], AMGe [32, 33], unsmoothed aggregation AMG [34, 35] and many others. Highly efficient sequential and parallel implementations are also available for both CPU and GPU systems [36, 37, 38]. AMG methods have been demonstrated to be one of the most efficient solvers for many practical problems [39]. Despite the great success in practical applications, AMG still lacks solid theoretical justifications for these algorithms except for two-level theories [40, 30, 41, 42, 43, 44, 45]. For a truly multilevel theory, using the theoretical framework developed in [17, 19], Vanˇek,Mandel, and Brezina [46] provides a theoretical bound for the smoothed aggregation AMG under some assumption about the aggregations. Such an assumption has been recently investigated in [45] for aggregations that are controlled by auxiliary grids that are similar to those used in [47]. Another way to generate the multigrid method for unstructured grid is non-nested geo- metric multigrid. One example of such kind of theory is by Bramble, Pasciak and Xu [48]. In this work, optimal convergence theories are established under the assumption that a non- nested sequence of quasi-uniform meshes can be obtained. Another example is the work by Bank and Xu [49] that gives a nearly optimal convergence estimate for a hierarchical basis type method for a general shape-regular grid in two dimensions. This theory is based on 5 non-nested geometric grids that have nested sets of nodal points from different levels. One feature in the aforementioned MG algorithms and their theories is that the underly- ing multilevel finite element subspaces are not nested, which is not always desirable from both theoretical and practical points of view. To avoid the non-nestedness, many different MG techniques and theories have been explored in the literature. One such a theory was devel- oped by Xu [47] for a semi-nested MG method with an unstructured but quasi-uniform grid based on an auxiliary grid approach. Instead of generating a sequence of non-nested grids from the initial grid, this method is based on a single auxiliary structured grid whose size is comparable to the original quasi-uniform grid. While the auxiliary grid is not nested with the original grid, it contains a natural nested hierarchy of coarse grids. Under the assump- tion that the original grid is quasi-uniform, an optimal convergence theory was developed in [47] for second order elliptic boundary problems with Dirichlet boundary conditions. The first goal of my thesis is to extend the algorithm and theory in Xu [47] to shape regular grids that are not necessarily quasi-uniform. The lack of quasi-uniformity of the original grid makes the extension nontrivial for both the algorithm and the theory. First, it is difficult to construct auxiliary hierarchical grids without increasing the grid complexity, especially for grids on complicated domains. The way we construct the hierarchical structure is to generate a cluster tree, based on the geometric information of the original grid [50, 51, 52, 53]. Secondly, it is also not straightforward to establish optimal convergence for the geometric multigrid applied to a hierarchy of auxiliary grids that can be highly locally refined. As the need to solve extremely large systems grows urgency, researchers do not only study the MG method and theories but also also consider the parallelizations. Parallel multigrid approaches have been implemented in various frameworks. For example, waLBerla [54] is for finite difference discretizations on fully structured grids, BoomerAMG (included in the Hypre package [55]) is the parallelization of the classical AMG methods and their variants for unstructured grids, ML [56] focus on the parallel versions of smoothed aggregation AMG methods, Peano [57] is based on space-filling curves or Distributed and Unified Numerics Environment (DUNE) which is a general software framework for solving PDEs. Not only are the researchers rapidly developing algorithms, they are doing the same with the hardware. GPUs based on single instruction multiple thread (SIMT) hardware architecture have provided an efficient platform for large-scale scientific computing since November 2006, when NVIDIA released the Compute Unified Device Architecture (CUDA) toolkit. The CUDA toolkit made programming on GPU considerably easier than it had 6 been previously in large part. MG methods have also been parallelized and implemented on GPUs in a number of studies. GMG methods as the typical cases of MG have been implemented on GPUs first [58, 59, 60, 61, 62, 63]. These studies demonstrate that the speedup afforded by using GPUs can result in GMG methods achieving a high level of performance on CUDA-enabled GPUs. However, to the best of our knowledge of the task, parallelizing an AMG method on GPUs or CPUs remains very challenging, mainly due to the sequential nature of the coarsening processes (setup phase) used in AMG methods. In most AMG algorithms, coarse-grid points are selected sequentially using graph theoretical tools (such as maximal independent sets and graph partitioning algorithms) and the coarse- grid matrices are constructed by a triple-. Although extensive research has been devoted to improving the performance of parallel coarsening algorithms, leading to marked improvements on CPUs [64, 65, 66, 36, 67, 68, 69] and on GPUs [65, 70, 71] over time, the setup phase is still considered to be the major bottleneck in parallel AMG methods. On the other hand, the task of designing an efficient and robust parallel smoother in the solver phase is no less challenging. However, most of the difficulties could be solved by using auxiliary spaces. The special structure of the auxiliary grid makes the parallelizations more efficient. Therefore, the second goal of this thesis is to design new parallel smoothers and a MG method based on auxiliary space preconditioning techniques. Many problems in physics and engineering, like the Navier-Stokes equations in fluid dy- namics, Helmholtz equations, or multi-physics problems, lead to indefinite problems. Other problems, like the biharmonic plate problem, may be formulated as coupled problems in different variables, which leads to a saddle point problem. It is important to find an efficient preconditioner for these indefinite problems. It is noted that preconditioning design and analysis for saddle-point problems and indef- inite problems have been the subject of active research in a variety of areas of applied math- ematics, for example: groundwater flow, Stokes and Navier-Stokes flow [72, 73, 74, 75], elas- ticity, magnetostatics, etc. While most results address the symmetric case, non-symmetric preconditioning has also been analyzed for some practical problems. Various iterative methods and preconditioning methods to solve saddle-point type have been the subject of research. Some methods are focused on clever renumbering schemes in combination with a classical iterative approach, like the SILU scheme and ILU schemes pro- posed by Wille et al. Other methods are based on the splitting of the saddle point operator. 7

A number of block-preconditioners have been devised, for example block diagonal precondi- tioners [74], block triangle preconditioners, the Pressure-Convection Diffusion commutator (PCD) of Kay, Logan and Wathen [76, 77], the Least Squares Commutator (LSC) by Elman, Howle, Shadid, Shuttleworth and Tuminaro [75], the Augmented Lagrangian Approach (AL) of Benzi and Olshanskii [78], the Artificial Compressibility (AC) preconditioner [79] and the Grad-Div (GD) preconditioner [79]. For an overview of block preconditioners, we refer to [80, 81, 82]. Although we can apply MG to solve the sub-blocks, the development of MG for indefinite problems on general unstructured grids is still rarely analyzed. So, the last goal of this thesis is to carry out the general analysis of FASP preconditioners for finite element discretizations and to describe sufficient conditions for optimal preconditioning. The rest of the thesis is organized as follows. In Chapter 2, we review the iterative method and preconditioning techniques. Then in Chapter 3, we introduce the basic concepts and theories of MG and auxiliary space preconditioning. Next, we introduce the algorithm and theory of our new auxiliary space MG on shape regular grids in Chapter 4. In Chapter 5, we present a parallel colored Gauss-Seidel method. In Chapter 6, we discuss the parallelization of the UA-AMG algorithms based on auxiliary grids. After that, we give some numerical examples in Chapter 7. In Chapter 8, we review the iterative method for the indefinite problems and introduce the new FASP theory for indefinite problems. After that, we discuss the method and apply the FASP theories to the Stokes equations in Chapter 9. And finally, we make some conclusions and describe future work in Chapter 10. Chapter 2

Iterative Method

A single step linear iterative method uses an old approximation, uold, of the solution u∗ of (1.1), to produce a new approximation, unew, usually consists of three steps:

1. Form rold = f − Auold;

2. Solve Ae = rold approximately:e ˆ = Brold;

3. Update unew = uold +e, ˆ where B is a linear operator on V and it can be thought of as an approximate inverse of A. As a result, we have the following algorithm.

Algorithm 1 Iterative Method Given u0 ∈ V; p0 = r0; for m = 0, 1, ··· , until convergence do um+1 = um + B(f − Aum)

∗ We say that an iterative scheme like algorithm 1 converges if limm→∞ um = u for any u0 ∈ V. The core element of the above iterative scheme is the operator B. Notice that if B = A−1, after one iteration, u1 is then the exact solution. In general, B may be regarded as an approximate inverse of A. For general iterative methods, we have the following simple convergence result. 9

Lemma 2.0.1. The iterative Algorithm 1 converges if and only if

ρ(I − BA) < 1

where ρ(A) is the spectral radius of A.

If A is symmetric positive definite (SPD), we can define a new inner product: (u, v)A = (Au, v). Sometimes it is more desirable that the operator B is symmetric. If B is not symmetric, there is a natural way to symmetrize it. Consider the following Algorithm 2.

Algorithm 2 Symmetrized Iterative Method Given u0 ∈ V; p0 = r0; for m = 0, 1, ··· , until convergence do um+1/2 = um + B(f − Aum) um+1 = um+1/2 + BT (f − Aum+1/2)

The symmetrization of iterator B can be denote as B¯. Since

I − BA¯ = (I − BT A)(I − BA) = I − BT A − BA + BT ABA,

we have B¯ = BT + B − BT AB.

The convergence theory is as follows.

Theorem 2.0.2. A sufficient condition for the convergence of Algorithm 2 is that

B−T + B−1 − A is SPD.

Proof. If B−T + B−1 − A is SPD, then B¯ = BT (B−T + B−1 − A)B is SPD. Since

¯ ¯T (BAu, v)A = (B Au, v)A,

¯ BA is SPD w.r.t. (·, ·)A. Because

((I − BA¯ )u, v) = (I − BT A)(I − BA) 10

Assume λ ∈ σ(I − BA¯ ) ⊂ [0, ∞) and µ ∈ σ(BA¯ ), then µ > 0, so

0 ≤ λ = 1 − µ < 1. which leads the conclusion.

Theorem 2.0.3. If A is SPD, A sufficient condition for the convergence of Algorithm 1 is that B−T + B−1 − A is SPD.

Proof. By Theorem 2.0.2,

1 1 T 2 ¯ 2 ρ(I − BA) ≤ kI − BAkA = k(I − BA) (I − BA)kA = kI − BAkA < 1.

2.1 Stationary Iterative Methods

Most of the stationary iterative methods involve passing from one iterate to the next by modifying one or a few components of an approximate vector solution at a time. This is natural since it is simple to modify a component. The convergence of these methods is rarely guaranteed for all matrices, but a large body of theory exists. Assume V = RN and N×N A = (aij) ∈ R which is symmetric positive definite (SPD). And we begin with the matrix splitting A = D + L + U, (2.1)

where D is the diagonal of A, L is the strict lower part, and U is the strict upper part (as shown in Figure 2.1).

2.1.1 Jacobi Method

The Jacobi method determines the i-th component by eliminating the i-th component of m the residual vector of the previous solution. If ui denotes the i-th component of the m-th 11

Figure 2.1. Matrix splitting of A iteration um. With writing the residual form as

N ∗ X m m (f − Au )i = fi − aijuj − aiiui = 0, j=1,j6=i where u∗ is the exact solution of the linear system. um+1 can be determined by

N m+1 X m aiiui = fi − aijuj (2.2) j=1,j6=i or N ! m+1 1 X m ui = fi − aijuj , i = 1, 2, ··· ,N. (2.3) aii j=1,j6=i This is a component-wise form of the Jacobi iteration. It can be written in vector form as

um+1 = D−1(f − (L + U)um) (2.4)

or um+1 = um + D−1(f − Aum) (2.5)

This leads the following algorithm: 12

Algorithm 3 Jacobi Method for k ← 1 until convergence do for i ← 1 to N do σ = 0 for j ← 1 to N do if j 6= i then m σ ← σ + aijuj ,

m+1 1 ui ← (fi − σ) aii check if convergence is reached

The convergence theory of the Jacobi method is well defined. Theorem 2.1.1 (Jacobi Method). Assume A is a symmetric positive definite (SPD), Jacobi method converges if and only if 2D − A is SPD. Proof. Since B−T + B−1 − A = D + D − A = 2D − A, the desired result follows from Theorem 2.0.3.

It is worthy to mention that the Jacobi method sometimes converges even if these con- ditions are not satisfied.

2.1.2 Gauss-Seidel Method

The Jacobi method determines the i-th component by eliminating the i-th component of the residual vector of the current solution in the order i = 1, 2, ··· ,N. This time the approximate solution is updated immediately. after the new component is determined. The i-th component of the residuals is

i−1 N X m+1 m+1 X m fi − aijuj − aiiui − aijuj = 0, (2.6) j=1 j=i+1 which leads to the iteration,

i−1 N ! 1 X X um+1 = f − a um+1 − a um , i = 1, 2, ··· ,N. (2.7) i a i ij j ij j ii j=1 j=i+1 13

The vector form of the equation (2.6) can be written as

f − Lum+1 − Dum+1 − Uum = 0.

Therefore, the vector form of the Gauss-Seidel method is

um+1 = (D + L)−1(f − Uum), (2.8) or um+1 = um + (D + L)−1(f − Aum). (2.9)

This leads to the following algorithm:

Algorithm 4 Gauss-Seidel Method u = u0 while convergence is not reached do for i ← 1 to N do σ = 0 for j ← 1 to N do if j 6= i then σ ← σ + aijuj, 1 ui ← (fi − σ) aii check if convergence is reached

The convergence properties of the Gauss-Seidel method depend on the matrix A. Namely, the procedure is known to converge if either A is symmetric positive-definite or A is strictly or irreducibly diagonally dominant.

Theorem 2.1.2 (Gauss-Seidel Method). Assume A is SPD. Then Gauss-Seidel method always converges.

Proof. Since B−T + B−1 − A = (D + L)T + D + L − A = D, is SPD. By Theorem 2.0.3, we can prove the Gauss-Seidel method is convergent. 14

The Gauss-Seidel method has some variational formats. The backward Gauss-Seidel method can be defined as

um+1 = uu + (D + U)−1(f − Aum), (2.10)

which is equivalent to making the corrections in the order N,N − 1, ··· , 1. The symmetric Gauss-Seidel method consists of a forward sweep followed by a backward sweep which can be written as m+ 1 m −1 m u 2 = u + (D + L) (f − Au ), (2.11) m+1 m+ 1 −1 m+ 1 u = u 2 + (D + U) (f − Au 2 ). We can rewrite to one equation

um+1 = um + [(D + U)−1 + (D + L)−1 − (D + U)−1A(D + L)−1](f − Aum). (2.12)

It is simple to prove that when A is SPD, the backward Gauss-Seidel method and the symmetric Gauss-Seidel method converge.

2.1.3 Successive Over-Relaxation Method

The successive over-relaxation method (SOR) is derived by extrapolating the Gauss-Seidel method. This extrapolation takes the form of a weighted average between the previous iterate and the computed Gauss-Seidel iterate successively for each component,

m+1 m+1 m ui = ωu¯i + (1 − ω)ui , i = 1, 2, ··· ,N

m whereu ¯i denotes a Gauss-Seidel iterate and ω is the extrapolation factor. The idea is to choose a value for ω that will accelerate the rate of convergence of the iterates to the solution. The vector form of the SOR method can be written as

um+1 = (D + ωL)−1[−ωU + (1 − ω)D]um + ω(D + ωL)−1f. (2.13)

This leads the following algorithm: 15

Algorithm 5 SOR Method u = u0 while convergence is not reached do for i ← 1 to N do σ = 0 for j ← 1 to N do if j 6= i then σ ← σ + aijuj, ω ui ← (1 − ω)ui + (fi − σ) aii check if convergence is reached

If ω = 1, the SOR method simplifies to the Gauss-Seidel method. If ω = 0, the SOR method simplifies to the Jacobi method. A theorem due to Kahan [83] shows that SOR fails to converge if ω is outside the interval (0, 2).

Theorem 2.1.3. Assume A is SPD. Then SOR method converges if 0 < ω < 2.

Proof. Since

2 − ω B−T + B−1 − A = ω−1(D + ωL)T + ω−1(D + ωL) − A = D, ω

is SPD. The result follows from Theorem 2.0.3.

In general, it is not possible to compute in advance the value of ω that will maximize the rate of convergence of SOR. If the coefficient matrix A is symmetric and positive definite, the SOR iteration is guaranteed to converge for any value of ω between 0 and 2, though the choice of ω can significantly affect the rate at which the SOR iteration converges. Frequently, some heuristic estimate is used, such as ω = 2 − O(h), where h is the mesh spacing of the discretization of the underlying physical domain. A backward SOR sweep can be defined analogously to the backward Gauss-Seidel sweep (2.10). A Symmetric SOR (SSOR) step consists of the SOR step (2.13) followed by a backward SOR step,

m+ 1 −1 m −1 u 2 = (D + ωL) [−ωU + (1 − ω)D]u + ω(D + ωL) f, (2.14) m+1 −1 m+ 1 −1 u = (D + ωU) [−ωL + (1 − ω)D]u 2 + ω(D + ωU) f. 16

2.1.4 Block Iterative Method

We now discuss the extension of the stationary iterative methods to block scheme. The block iterative methods are generalizations of the “pointwise” iteration method described above. They update a whole set of components at each time, typically a sub vector of the solution vector, instead of only one component. First of all, we assume that that A˜ is simply a matrix

      A11 A12 ··· A1J ξ1 β1 A A ··· A  ξ  β  ˜  21 22 2J   2  ˜  2  A =  . . . .  , u˜ =  .  , f =  .  .  . . .. .   .   .   . . .   .   .  AJ1 AJ2 ··· AJJ ξJ βJ

with entries being subblocks:

Ni×Nj Nj Ni X Aij ∈ R , ξj ∈ R , βi ∈ R , 1 ≤ i, j ≤ J and N = Ni. i

Similarly, we can define the following block matrix splitting:

A˜ = D˜ + L˜ + U˜ where D˜ is the block diagonal of A˜ and L˜ and U˜ are the strictly block lower and upper triangular parts of A˜, respectively. Namely

      A11 0 0 A12 ··· A1J      . .   A22  A21 0   0 .. .  D˜ =   , L˜ =   , U˜ =   .  ..   . .. ..   .   .   . . .   .. A       1,J−1 AJJ AJ1 ··· AJ−1,1 0 0

With these definitions, it is easy to generalize the previous three iterative procedures defined earlier, namely, Jacobi, Gauss-Seidel, and SOR, namely we can simply take   D˜ −1 block Jacobi;  B˜ = (D˜ + L˜)−1 block Gauss-Seidel; (2.15)   ω(D˜ + ωL˜)−1 block SOR, 17

the iterative method can be written as

u˜m+1 =u ˜m + B˜(f˜− A˜u˜m). (2.16)

In addition, a block can also correspond to the unknowns associated with a few consecutive lines in the plane. Unlike the “pointwise” iterative methods, it is not easy to compute B˜ exactly because solving Aii may be very expensive. In order to make the block iterative method more practical, the modified block iterative methods can be defined by replacing the block diagonal inverse D˜ −1 by a modified block diagonal inverse, denoted by R˜. Namely we can choose   R˜ modified block Jacobi;  B˜ = (R˜−1 + L˜)−1 modified block Gauss-Seidel; (2.17)   ω(R˜−1 + ωL˜)−1 modified block SOR, where R˜ is a modification or an approximation of D˜ −1. This means on each step, we do not have to solve the sub-block exactly which is more practical to implement. Mathematically, it can be proved that under the same conditions with the “pointwise” iterative method, the block iterative methods and the modified iterative methods are also convergent. In the following, we shall proceed to give a convergence analysis of the modified block Gauss-Seidel method:

u˜m+1 =u ˜m + (R˜−1 + L˜)−1(f˜− A˜u˜m), m = 1, 2,... (2.18)

Lemma 2.1.4. Assume that A˜ is semi-SPD. Assume that

D˜ = R˜−T + R˜−1 − D˜

is nonsingular, then

−1 −1 B˜−1 = (R˜−T + U˜)T D˜ (R˜−T + U˜) = A˜ + (D˜ + U˜ − R−1)T D˜ (D˜ + U˜ − R−1) (2.19)

and for any v˜, −1 v˜T B˜−1v˜ = ((R˜−T + U)˜v)T D˜ (R˜−T + U)˜v (2.20)

v˜T B˜−1v˜ =v ˜T A˜v˜ + ((D˜ + U˜ − R−1)˜v)T D˜¯ −1(D˜ + U˜ − R−1)˜v. (2.21) 18

Proof. It follows that

B¯˜ = B˜ + B˜t − B˜tA˜B˜ = B˜t(B˜−1 + B˜−t − A)B˜ = B˜tD˜¯B˜ where D˜ = B˜−1 + B˜−t − A = R˜−t + R˜−1 − D.˜

Hence

B˜¯−1 = B˜−1D˜¯ −1B˜−t = (D˜ + A − B−T )D˜¯ −1(D˜ + A − B−1) = (D˜ + A − B−T )D˜¯ −1(D˜ + A − B−1) = A˜ + (A − B−T )D˜¯ −1(A − B−1)

The desired results then follow.

Theorem 2.1.5. Assume A is SPD. The modified block Gauss-Seidel method converges if

D˜¯ ≡ R˜−t + R˜−1 − D > 0. (2.22)

And the following convergence estimate holds:

˜ ˜ 2 1 1 kI − BAkA = 1 − = 1 − (2.23) c1 1 + c0

where ˜t ˜ T ˜¯ −1 ˜−t ˜ c1 = sup ((R + U)v) D (R + U)v (2.24) kvkA=1 and ˜ ˜ ˜−1 T ˜¯ −1 ˜ ˜ ˜−1 c0 = sup ([(D + U) − R ]v) D [D + U − R ]v. (2.25) kvkA=1

In particular, if R˜ = D˜ −1, then R˜¯ = D˜ −1 and

˜ ˜ T ˜ −1 ˜ ˜ c1 = sup ((D + U)v) D (D + U)v (2.26) kvkA˜=1 19 and ˜ T ˜ −1 ˜ c0 = sup (Uv) D Uv. (2.27) kvkA=1

2.2 Krylov Space Method and Preconditioners

With respect to the “influence on the development and practice of science and engineering in the 20th century”, Krylov space methods are considered as one of the ten most important classes of numerical methods. Unlike the stationary iterative methods, Krylov methods do not have an iteration matrix. The Krylov subspace method is a method that extracting an m 0 0 approximate solution u from an affine subspace u + Km, where u is an arbitrary initial guess to the solution and Km is the Krylov subspace

0 0 0 2 0 m−1 0 Km(A, r ) = span{r , Ar ,A r , ··· ,A r }. (2.28)

The residual is r = f − Au

m So {r }m≥0 denotes the sequence of residuals

rm = f − Aum.

0 When there is no ambiguity, Km(A, r ) will be denoted by Km. From the theory of the approximation, it is clear that the approximations obtained from a Krylov subspace method are of the form

−1 m 0 0 A f ≈ u = u + qm−1(A)r , (2.29)

0 in which qm−1 is a cartain polynomial of degree m − 1. In a simplest case, let u = 0, then −1 A f is approximated by qm−1(A)f. In other words, polynomial qm−1(u) is a approximation of 1/u. The relative residual norm of the Krylov subspace method can be bounded as

krmk 0 ≤ min |q(A)|, (2.30) kr k q∈Pm−1 where Pm−1 is the set of (m − 1)-th order polynomials. 20

Although all the techniques provide the same type of polynomial approximations, dif- ferent choices will arise the different versions of Krylov subspace methods. In this section, we will introduce the conjugate gradient method (CG) and we will introduce the minimal residual method (MINRES) and general minimal residual method (GMRES) in Chapter 8.

2.2.1 Conjugate Gradient Method

The conjugate gradient (CG) method is due to Hestenes and Stiefel [6]. It is one of the best known iterative techniques for solving sparse symmetric positive definite linear systems (SPD). The conjugate gradient method was invented in the 1950s as a direct method. It has come into wide use over the last 15 years as an iterative method and has generally superseded the stationary iterative method. The CG method is the prototype of the Krylov subspace method. It is an orthogonal projection method and satisfies a minimality condition: the error is minimal in the energy norm or A-norm, which means

T 1/2 kukA = (u Au) + γ.

Consider the quadratic function

1 φ(u) = uT Au − f T u. (2.31) 2

Since φ(u) is convex and has a unique minimum, assume φ(u∗) is the minimal value. It satisfies ∇φ(u∗) = f − Au∗ = 0,

∗ 1 T −1 so u is the solution. If we choose γ = 2 f A f, it is easy to see that

2 ∗ 2 ∗ T ∗ T T T T T −1 kf −AukA−1 = ku −ukA = (u ) Au−2(u ) Au+u Au = u Au−2f x+f A f = 2φ(u).

So solving the equation Au = f is equivalent to minimize the quadratic function φ(u) and also equivalent to minimize the energy norm of the error vector u∗ − u. In m-th step, we want to find um such that

um = min φ(u). (2.32) 0 u∈u +Km 21

By doing this iteratively, we can find the solution within N steps. Choosing the search directions {vm} which are conjugate (A-orthogonal) to each other, i.e. (vm)T Avj = 0, j = 0, 1, ·, m − 1, and define um = um−1 + ωm−1vm−1, rm = rm−1 − ωm−1Avm−1

ωm is chosen to minimize the A-norm of the residual of the error on the line um−1 + ωvu−1. This means (rm−1)T vm−1 ωm−1 = . (2.33) (vm−1)T Avm−1 Associated with the minimality of the Galerkin condition

m−1 Km−1⊥r ∈ Km,

m N which implies that {r }m=0 are the orthogonal basis of KN . If we choose the conjugate m N search directions to be {r }m=0, we can get the standard CG method. The detail algorithm is as following:

Algorithm 6 CG Method r0 = f − Au0; p0 = r0; for m = 0, 1, ··· , until convergence do (rm)T rm α ← m (pm)T Apm m+1 m m u ← u + αmp , m+1 m m r ← r − αmAp If rm+1 is sufficiently small then exit loop (rm+1)T rm+1 β ← m (rm)T rm m+1 m+1 m p ← r + βmp

The conjugate gradient method can theoretically be viewed as a direct method, as it produces the exact solution after a finite number of iterations, which is not larger than the size of the matrix, in the absence of round-off error. However, the conjugate gradient method is unstable with respect to even small perturbations, most directions are not in practice conjugate, and the exact solution is never obtained. Fortunately, the conjugate 22 gradient method can be used as an iterative method as it provides monotonically improving approximations um to the exact solution, which may reach the required tolerance after a relatively small (compared to the problem size) number of iterations. The improvement is typically linear and its speed is determined by the condition number κ(A) of the system matrix A: λ (A) κ(A) = max , λmin(A) where λmax(A) and λmin(A) are the largest and smallest eigenvalue of A. The convergence Analysis of CG from [84] is as following:

Theorem 2.2.1 (Convergence of CG). Assume that um is the m-th iteration of CG method and u∗ is the exact solution, we have the following estimate:

p ∗ m κ(A) − 1 m ∗ 0 ku − u kA ≤ 2( ) ku − u kA, (2.34) pκ(A) + 1

Proof. For an arbitrary polynomial qm−1 of degree m − 1, denote

m ∗ ∗ u˜ = qm−1(A)f = qm−1(A)Au = Aqm−1(A)u , since (2.32) we have

(u∗ − um)T A(u∗ − um) ≤ min(u∗ − u˜m)T A(u∗ − u˜m) qm−1 ∗ T ∗ ≤ min((I − Aqm−1(A))u ) A(I − Aqm−1(A))u ) qm−1 ∗ T ∗ ≤ min (Aqm(A)u ) Aqm(A)u qm(0)=1 2 ∗ T ∗ ≤ min max |qm(λ)| (u ) Au qm(0)=1 λ∈σ(A) 2 ∗ T ∗ ≤ min max |qm(λ)| (u ) Au . qm(0)=1 λ∈[a,b]

Here a = λmin(A) and b = λmax(A). We choose b+a−2λ Tm( b−a ) q˜m(λ) = b+a . Tm( b−a ) 23

Here Tm(t) is the Chebyshev polynomial of degree m given by

( cos(m cos−1 t)) if |t| ≤ 1; Tm(t) = (2.35) (sign(t))m cosh(m cosh−1 t)) if |t| ≥ 1;

b+a−2λ Notice that Tm( b−a ) ≤ 1 for λ ∈ [a, b]. Thus

 b + a −1 max |q˜m(λ)| ≤ Tm( ) . λ∈[a,b] b − a

We set b + a eσ + e−σ = cosh σ = . b − a 2 Solving this equation for eσ, we have

pκ(A) + 1 eσ = pκ(A) − 1 since κ(A) = b/a. We then obtain

!m emσ + e−mσ 1 1 pκ(A) + 1 cosh mσ = ≥ emσ = . 2 2 2 pκ(A) − 1

Consequently !m pκ(A) − 1 min max |qm(λ)| ≤ 2 p . qm(0)=1 λ∈[a,b] κ(A) + 1 The desired result then follows.

Even though, the estimate given above is sufficient for many applications but in general it is not sharp. There are many ways to sharpen the estimate. For example, the following improved estimate shows that the convergence of the CG method depends on the distribution of the spectrum of A. It is possible that the CG method converges fast even the condition number of A is large.

Theorem 2.2.2. [85] Assume that σ(A) = σ0(A) ∪ σ1(A) and l is the number of elements in σ0(A). Then p !m−l ∗ m b/a − 1 ∗ 0 ku − u kA ≤ 2M ku − u kA pb/a + 1 24

where a = minλ∈σ1(A) λ, b = maxλ∈σ1(A) λ and

Y λ M = max 1 − . λ∈σ1(A) µ µ∈σ0(A)

In practice, κ(A)  1, if A ill-conditioned. For this condition, the convergence rate can not be guaranteed.

2.3 Preconditioned Iterations

Although all of the iterative methods are well founded theoretically, they are all likely to suffer from slow convergence for problems which arise from typical applications. Preconditioning is a key ingredient for the success of Krylov subspace methods in the applications. Both the efficiency and robustness of iterative techniques can be improved by using preconditioners. preconditioning is simply transforming the original linear system into one which has the same solution, but easier to solve with an iterative solver. In general, the reliability of iterative techniques, when dealing with various applications, depends much more on the quality of the preconditioner than on the particular Krylov subspace accelerators used. In this section, we discusses the preconditioned versions of the Krylov subspace algorithms using a generic preconditioner first. Then, we cover some common used preconditioners. To begin with, it is worthwhile to consider the options available for preconditioning a system. The first step in preconditioning is to find a preconditioning matrix B. The matrix B can be defined in many different ways but it must satisfy a few minimal requirements. B should be an approximation of A−1 in some sense. From a practical point of view, the most requirement for B is that it is inexpensive to construct. This is because the preconditioned algorithms will all require to multiply B at each step. Once a preconditioning matrix B is defined there are three known ways of applying the preconditioner. The preconditioner can be applied from the left, which leading to the preconditioned system BAu = Bf. (2.36)

Alternatively, it can also be applied to the right:

ABv = f, u = Bv. (2.37) 25

Note that the above formulation amounts to making the change of variables u = B−1x, and solving the system with respect to the unknown u. The third common situation is applying the preconditioner on both,which is called split preconditioner:

B1AB2v = B1f, u = B2v, B = B1B2. (2.38)

2.3.1 Preconditioned Conjugate Gradient Method

Recall that CG is designed to solve the symmetric and positive definite matrix. It is im- perative to preserve symmetry. In general, the right and left preconditioners are no longer symmetric. In order to design the Preconditioned Conjugate Gradient Method (PCG), we need to consider the strategies for preserving symmetry. When B is available in the form of an incomplete Cholesky factorization, i.e. B = LLT , then it is simple to use the split preconditioner as (2.38), wihc leads to

LALT v = Lf, u = LT v. (2.39)

Apply CG to this system, we can have the corresponding PCG method.

Algorithm 7 PCG Method for split preconditioner r0 = f − Au0;r ˆ0 = Lr0; p0 = LT r0 for m = 0, 1, ··· , until convergence do (ˆrm)T rˆm α ← m (pm)T Apm m+1 m m u ← u + αmp , m+1 m m rˆ ← rˆ − αmAp (ˆrm+1)T rˆm+1 β ← m (ˆrm)T rˆm m+1 T m+1 m p ← L rˆ + βmp

However, it is not necessary to split the preconditioner to preserve symmetry. Since BA is self-adjoint for the B−1 inner product,

(BAu, v)B−1 = (Au, v) = (u, BAv)B−1 .

Therefore, an alternative is to replace the usual Euclidean inner product in the Conjugate 26

Gradient algorithm by the B−1 inner product. Note that the B−1 inner product do not have to be computed explicitly. With this observation, the following algorithm is obtained.

Algorithm 8 PCG Method for left preconditioner r0 = f − Au0; z0 = Br0; p0 = z0 for m = 0, 1, ··· , until convergence do (ˆrm)T zm α ← m (pm)T Apm m+1 m m u ← u + αmp , m+1 m m r ← r − αmAp zm+1 ← Brm+1 (rm+1)T zm+1 β ← m (rm)T zm m+1 m+1 m p ← z + βmp

By observing that BA is also self-adjoint with respect to the A inner product, i.e.

(BAu, v)A = (BAu, Av) = (u, BAv)A.

A similar PCG with left preconditioner can be written under this inner product. Consider now the right preconditioned system (2.37). AB is self-adjoint with B inner product, so the PCG algorithm can be written with respect to the variable u under the new inner product. that the same sequence of computations is obtained as with Algorithm 8. The implication is that the left preconditioned CG algorithm with the B−1 inner product is mathematically equivalent to the right preconditioned CG algorithm with the B inner product. The following convergence estimate has been done for PCG:

p !m ∗ m κ(BA) − 1 ∗ 0 ku − u kA ≤ 2 ku − u kA. (2.40) pκ(BA) + 1

So a good preconditioner for PCG should satisfy that BA should be better conditioned, i.e. κ(BA) < κ(A). By knowing that any iterative method could be seen as an operator B which is an approximation of A−1, we now prove that any convergent iterative method can accelerate CG. Theorem 2.3.1. Assume that B is symmetric with respect to the inner product (·, ·). If the 27

iterate scheme um+1 = um + B(f − Aum) is convergent, then B from the scheme can used as a preconditioner for A and the PCG method converges at a faster rate.

Proof. If the iterative scheme is convergent, then ρ = ρ(I − BA) < 1. By the definition, we have 2 ((I − BA)v, v)A (BAv, v)A kI − BAkA = sup 2 = 1 − inf 2 < 1. v6=0 kvkA v6=0 kvkA infv6=0(BAv, Av) > 0 which means B is symmetric positive definite. So we can use B as a preconditioner of PCG. Then, since 1 − kI − BAk ≤ kBAk ≤ 1 + kI − BAk, we have

1 + ρ κ(BA) ≤ . 1 − ρ

So q p 1+ρ p κ(BA) − 1 1−ρ − 1 1 − 1 − ρ2 ≤ = < ρ. pκ(BA) + 1 q 1+ρ ρ 1−ρ + 1 The desired result then follows.

We conclude that for any linear iterative scheme for which B is symmetric, a precondi- tioner for A can be attained and the convergence rate can be accelerated by using the PCG method. For example, a preconditioner can be resulted from symmetric SOR method as follows B = SSt,S = (D − ωU)D1/2.

A more interesting case is that the scheme may not be convergent at all whereas B can always be a preconditioner. For example, the Jacobi method is not convergent for all SPD system, but B = D−1 can always be used as a preconditioner. This preconditioner is often known as diagonal preconditioner.

2.3.2 Preconditioning Techniques

Finding a good preconditioner to solve a sparse linear system is often viewed as a combination of art and science. A preconditioner can be defined as a solver which is combined with an 28

outer iteration, like the Krylov subspace iterations. Roughly speaking, a preconditioner is any form of implicit or explicit modification of the linear system. In general, a good preconditioner should be cheap to construct and apply and easy to solve. Generally speaking, there are two approaches to constructing preconditioners. One popu- lar approach is purely algebraic methods that only use the information of the matrix A. The algebraic methods are often easy to develop and to use. For example, all of the stationary iterative methods can be applied as the preconditioners. In general, for a matrix splitting A = M − N, any B = M −1 can be the preconditioner. Ideally, M should be close to A in some sense. The other common preconditioner is defined by the incomplete factorization of A. Assume we have a decomposition of the form A = LU − R where L and U have the same nonzero structure as the lower and upper parts of A respectively, and R is the residual or error of the factorization. This factorization known as ILU(0) which is rather easy and inexpensive to compute. On the other hand, it often leads to a crude approximation which may result in the Krylov subspace accelerator requiring many iterations to converge. To rem- edy this, several alternative incomplete factorizations have been developed by allowing more fill-in in L and U. In general, the more accurate ILU factorizations require fewer iterations to converge, but the preprocessing cost to compute the factors is higher. The algebraic methods may not always be efficient. The other approach is to design the specialized algorithms by using more information for a narrow class of problems. By using the knowledge of the problem, such as the continuous equations, the problem domain, and details of the discretization, very stable and efficient preconditioners can be developed. One typical example is the multigrid preconditioners.

2.4 Numerical Example

At last, we use one example to show the difference of all the iterative method we used to make the audience be more clear about the advantage and drawbacks of the methods. All of the methods are tested on MATLAB with 2.3GHz processor. 29

2.4.1 Comparison of the Iterative Method

The first test problem we considered is the Laplacian on the unit square with homogeneous Dirichlet boundary conditions:

− ∆u = f in Ω := [0, 1]2, u = 0 on Γ := ∂Ω. (2.41)

Set f = 2x(1 − x) + 2y(1 − y), so the exact solution is u∗ = x(1 − x)y(1 − y). Define the P1 finite element spaces V on a (2` − 1) × (2` − 1) structured mesh with ` 2 N N := (2 − 1) vertices. Let (ϕi)i=1 denote a Lagrange basis in V . The discrete stiffness matrices A are defined by Z Ai,j := h∇ϕj(x) , ∇ϕi(x)idx. Ω

The right hand side vector b is defined as Z fi = fϕi(x)dx. Ω

The linear systems generated from this equation is symmetric positive definite. We compare the different stationary iterative method by using CG as a baseline (Table 2.4.1). As shown in Figure 2.2 and Figure 2.3, Jacobi method converges slowest in both number of iteration and CPU time. It is considered as an inefficient method in term of the run time. However, the primary advantage of the Jacobi method is its parallelism. The low complexity of each iterates makes Jacobi method seems to be very efficient in parallel computing. Gauss Seidel method is faster is because it uses more updated convergence values to find better guesses than the Jacobi method. SOR method with a optimal parameter converges fastest in the stationary method. However, it is difficult to find the best parameters in practice. On the other hand, from the numerical results, all of the stationary iterative method is much slower than CG method and the convergence rate highly depend on the condition of the matrix A. It might need thousands of iterations to get an accurate solution. Because of these disadvantages, stationary iterative methods are rarely used as a solver of the linear systems. 30

5 × 5 9 × 9 17 × 17 33 × 33 Jacobi 129(2.7) 368(6.45) 1204(18.4) 4309(60.2) Gauss-Seidel 65(1.4) 185(3.25) 603(9.25) 2156(30.15) Symemtric GS 38(0.9) 98(1.9) 308(4.9) 1084(15.4) SOR 22(0.7) 37(1.1) 67(1.85) 127(3.35) CG 4(0.1) 10(0.2) 24(0.5) 46(1.2) Table 2.1. Comparison of the different iterative method for Poisson equation

Figure 2.2. Comparison of the number of Iterations

Figure 2.3. Comparison of the CPU time 31

2.4.2 Comparison of the Preconditioners

In this subsection, we compare the effect of the different preconditioners regardless of the choosing the preconditioned iterative method. Table 2.4.2 shows the number of iterations for PCG with the different preconditioners. Clearly, the preconditioning helped significantly

17 × 17 33 × 33 65 × 65 129 × 129 257 × 257 No preconditioner 21 38 70 128 230 Jacobi 21 38 70 128 230 Symemtric GS 15 25 44 72 122 Symemtric SOR 14 19 27 39 57 Incomplete Cholesky Factorization 13 21 36 61 103

Table 2.2. Comparison of the different preconditioners for Poisson equation

(shown in Figure 2.4). For the 257 × 257 problem, the number of iterations for PCG with out preconditioner (the common CG method) took 230 iterations, symmetric Gauss-Seidel preconditioner, number of iterations drop to 122, and it has a little improvement. And for Cholesky IC(0) factorization preconditioning method, the number of iterations is 103, and with symmetric SOR preconditioner with the optimal parameter, it only took 57 iterations. Therefore, by apply appropriate preconditioner, the convergence rate increase significantly.

Figure 2.4. Comparison of the number of iterations for preconditioners

The running time is another issue for choosing a good preconditioner. Normally, the computation time is proportional to the number of iteration. The less the number of iteration the less computation time is required. However, since the computation cost per iteration is 32 different. For example, PCG with Jacobi method may converge faster than with symmetric SOR preconditioner on the high-performance parallel computing resources. For different situation and matrix properties choose a good preconditioner is very importatent. sIt could significantly reduce the amount of computation time and allow fast convergence. Chapter 3

Multigird Method and Fast Auxiliary Space Preconditioner

In recent decades, multigrid (MG) methods have been well established as one of the most efficient iterative solvers and preconditioners for the linear system (1.1). Moreover, intensive research has been done to analyze the convergence of MG. In particular, it can be proven that the geometric multigrid (GMG) method has linear complexity O(N) in terms of compu- tational and memory complexity for a large class of elliptic boundary value problems where N is the number of degree of freedom. In this chapter, we introduce some basic ideas of the multigrid method from the view of subspace corrections.

3.1 Method of Subspace Correction

In the spirit of dividing and conquering, we decompose the space V as the summation of subspaces and correspondingly decompose the problem (1.1) into sub-problems with smaller size and relatively easy to solve. This method is developed by Xu [19]. Let Vi ⊂ V (for 0 ≤ i ≤ L) be subspaces of V. It consists as a decomposition of V i.e.

L X V = Vi. (3.1) i=0 34

PL This means that, for each v ∈ V, there exists vi ∈ Vi (0 ≤ i ≤ L) such that v = i=1 vi. This representation of v may not be unique in general, namely (3.1) is not necessarily a direct sum. We define the following operators, for i = 1,...,L:

• Qi : V → Vi the projection in the L2 inner product (·, ·);

• Ii : Vi → V the natural inclusion to V;

• Pi : V → Vi the projection in the inner product (·, ·)A;

• Ai : Vi → Vi the restriction of A to the subspace Vi;

−1 • Ri : Vi → Vi an approximation of (Ai) which means the smoother;

• Ti : V → Vi Ti = RiQiA = RiAiPi.

For any u ∈ V and ui, vi ∈ Vi, these operators fulfil the trivial equalities

t (Qiu, vi) = (u, Iivi) = (Ii u, vi),

(AiPiu, vi) = a(u, vi) = (QiAu, vi),

(Aiui, vi) = a(ui, vi) = (Aui, vi) = (QiAIiui, vi).

Qi and Pi are both orthogonal projections and Ai is the restriction of A on Vi and is SPD. T Qi coincides with the natural inclusion Ii and thus sometimes are omitted. The matrix or operator A is understood as the bilinear function on V ×V. Then the restriction on subspaces T is Ai = Ii AIi. It follows from the definition that

AiPi = QiA. (3.2)

This identity is of fundamental importance and will be used frequently in this chapter. A consequence of it is that, if u is the solution of (1.1), then

Aiui = fi (3.3)

with ui = Piu and fi = Qif. This equation may be regarded as the restriction of (1.1) to Vi. 35

Since Vi ⊂ V, we may consider the natural inclusion Ii : Vi 7→ V defined by

Iivi = vi, ∀vi ∈ Vi.

T We notice that Qi = Ii because

T (Qiu, vi) = (u, vi) = (u, Iivi) = (Ii , u, vi).

T T Similarly, we have Pi = Ii , where Ii is the transpose of Ii w.r.t. (·, ·)A.

We note that the solution ui of (3.3) is the best approximation of the solution u of (1.1) in the subspace Vi in the sense that

1 J(ui) = min J(v), with J(v) = (Av, v) − (f, v) v∈Vi 2

and

ku − uikA = min ku − vkA. v∈Vi The subspace equation (3.3) will be in general solved approximately. To describe this, we introduce, for each i, another non-singular operator Ri : Vi 7→ Vi that represents an

approximate inverse of Ai in certain sense. Thus an approximate solution of (3.3) may be

given byu ˆi = Rifi. The consistent notation for the smoother Ri is Bi, the iterator for each local problem. But we reserve the notation B for the iterator of the original problem. −1 Last, let us look at Ti = RiQiA = RiAiPi. When Ri = Ai , from the definition, −1 Ti = Pi = Ai QiA. When Ti|Vi : Vi → Vi, the projection Pi is identity and thus Ti|Vi = RiAi. −1 −1 −1 With a slight abuse of notation, we use Ti = (Ti|Vi ) . The action of Ti and Ti is

−1 −1 (Tiui, ui)A = (RiAiui,Aiui), (Ti u, u)A = (Ri u, u).

3.1.1 Parallel Subspace Correction and Successive Subspace Cor- rection

From the viewpoint of subspace correction, most linear iterative methods can be classified into two major algorithms, namely the parallel subspace correction (PSC) method and the successive subspace correction method (SSC). 36

PSC: Parallel subspace correction. This type of algorithm is similar to Jacobi method. The idea is to correct the residue equation on each subspace in parallel. Let uold be a given approximation of the solution u of (1.1). The accuracy of this approximation can be measured by the residual: rold = f − Auold. If rold = 0 or very small, we are done. Otherwise, we consider the residual equation:

Ae = rold.

Obviously u = uold + e is the solution of (1.1). Instead we solve the restricted equation to each subspace Vi old Aiei = Qir .

old Since we can get QiAe = Qir and we know that QiA = AiPi, it is easy to see

old AiPie = Qir .

Then it is clear that ei = Pie. It should be helpful to notice that the solution ei is the best old possible correction u in the subspace Vi in the sense that

old old 1 J(u + ei) = min J(u + e), with J(v) = (Av, v) − (f, v) e∈Vi 2

and old old ku − (u + ei)kA = min ku − (u + e)kA. e∈Vi As we are only seeking for a correction, we only need to solve this equation approximately using the subspace solver Ri described earlier

old eˆi = RiQir .

An update of the approximation of u is obtained by

J new old X u = u + Iieˆi i=1 which can be written as unew = uold + B(f − Auold), 37

where J J X X t B = IiRiQi = IiRiIi . (3.4) i=1 i=1 We have therefore

Algorithm 9 Parallel subspace correction Given u0 ∈ V; apply the Algorithm 1 with B given in (3.4).

It is well-known that the Jacobi method is not convergent for all SPD problems, hence Algorithm 9 is not always convergent. However the preconditioner obtained from this algo- rithm is of great performance.

Lemma 3.1.1. The operator B given by (3.4) is SPD if each Ri : Vi → Vi is SPD.

Proof. The symmetry of B follows from the symmetry of Ri. Now, for any v ∈ V, we have PJ (Bv, v) = i=1(RiQiv, Qiv) ≥ 0. If (Bv, v) = 0, we then have Qiv = 0 for all i. Let vi ∈ Vi P P P be such that v = i vi, then (v, v) = i(v, vi) = i(Qiv, vi) = 0. Therefore v = 0 and B hence is positive and definite.

SSC: Successive subspace correction. This type of algorithm is similar to the Gauss- Seidel method. To improve the PSC method that makes simultaneous correction, we here make the correction in one subspace at a time by using the most updated approximation of u. More 0 old precisely, starting from v = u and correcting its residule in V1 gives

1 0 t −1 v = v + I1R1I1(f − Av ).

1 By correcting the new approximation v in the next space V2, we get

2 1 t 1 v = v + I2R2I2(f − Av ).

Proceeding this way successively for all Vi leads to

Let Ti = RiQiA. By (3.2), Ti = RiAiPi. Note that Ti : V 7→ Vi is symmetric with respect −1 to A(·, ·) and nonnegative and that Ti = Pi if Ri = Ai . 38

Algorithm 10 Successive subspace correction Given u0 ∈ V; for k = 0, 1, ··· , until convergence do v ← uk; for i = 1 : L do v ← v + RiQi(f − Av); uk+1 ← v.

If u is the exact solution of (1.1), then f = Au. Let vi be the ith iterate (with v0 = uk) from Algorithm 10, we have by definition

i+1 i u − v = (I − Ti)(u − v ), i = 1, ··· , L.

A successive application of this identity yields

k+1 k u − u = EL(u − u ), (3.5)

where

EL = (I − TL)(I − TL−1) ··· (I − T1). (3.6)

Algorithm 10 can also be symmetrized.

Algorithm 11 Symmetric successive subspace correction Given u0 ∈ V; v ← u0; for k = 0, 1,... until convergence do for i = 1 : J and i = J : −1 : 1 do v ← v + RiQi(f − Av)

The advantage of the symmetrized algorithm is that it can be used as a preconditioner. In fact, 11 can be formulated in the Algorithm 1 with operator B defined as follows: For f ∈ V, let Bf = u1 with u1 obtained by 11 applied to (1.1) with u0 = 0. Similar to Young’s SOR method, let us introduce a relaxation method. 39

Algorithm 12 SOR successive subspace correction Input : u0 ∈ V Output: v ∈ V v ← u0; for k = 0, 1,... until convergence do for i = 1 : J do v ← v + ωRiQi(f − Av)

Like the SOR method, that a proper choice of ω can result in an improvement of the convergence rate, but it is not easy to find an optimal ω in general. The above algorithm is essentially the same as Algorithm 10 since we can absorb the relaxation parameter ω into the definition of Ri. Like the colored Gauss-Seidel method, SSC iteration can also be colorized and paral- lelized. Associated with a given partition 3.1, a coloring of the set J = {0, 1, 2,...,L} is a disjoint decomposition:

Lc J = ∪t=1J (t)

such that

PiPj = 0 for any i, j ∈ N (t), i 6= j(1 ≤ t ≤ Lc).

We say that i, j have the same color if they both belong to some J (t). The important property of the coloring is that the SSC iteration can be carried out in parallel in each color.

Algorithm 13 Colored SSC Input : u0 ∈ V Output: v ∈ V v ← u0; for k = 0, 1,... until convergence do

for t = 1 : Jc do P t v ← v + i∈J (t) IiRiIi (f − Av)

We note that the terms under the sum in the above algorithm can be evaluated in parallel (for each t, namely within the same color). 40

3.1.2 Multigrid viewed as Multilevel Subspace Corrections

From the space decomposition point of view, a multigrid algorithm can be viewed as a subspace correction method based on the subspaces defined on a nested sequence of triangu- lations. In this section, we rederive a class of multigrid method by a successive application of the overlapping domain decomposition method. We will also provide a complete convergence analysis using the Xu-Zikatanov identity and the results in 3.1.3. For simplicity, we illustrate the technique by considering the linear finite element method for the Poisson equation.

− ∆u = f, in Ω, and u = 0, on ∂Ω, (3.7) where Ω ⊂ Rd is a polyhedral domain. The weak formulation of (3.7) is as follows: given −1 1 f ∈ H (Ω) find u ∈ H0 (Ω) so that

1 a(u, v) = hf, vi, ∀v ∈ H0 (Ω), (3.8) where Z a(u, v) = (∇u, ∇v) = ∇u · ∇vdx, Ω −1 1 and h·, ·i is the duality pair between H (Ω) and H0 (Ω). 1 By the Poincar´einequality, a(·, ·) is an inner product on H0 (Ω), Thus by the Riesz −1 1 representation theorem, for any f ∈ H (Ω), there exists a unique u ∈ H0 (Ω) such that (3.8) holds. Furthermore, we have the following regularity result. There exists α ∈ (0, 1] which depends on the smoothness of ∂Ω such that

kuk1+α . kfkα−1 (3.9)

This inequality is valid if Ω is convex or ∂Ω is C1,1.

We assume that the triangulation {Tk}, k = 1, ··· ,J is constructed by a successive refinement process. More precisely, Th = TJ for some J > 1, and Tk for k ≤ J are a nested i sequence of quasi-uniform triangulations. i.e. Tk consist of simplexes Tk = {τk} of size hk i such that Ω = ∪iτk for which the quasi-uniformity constants are independent of k (cf. [86]) l i and τk−1 is a union of simplexes of {τk}. We further assume that there is a constant γ < 1, 2k independent of k, such that hk is proportional to γ . We then have a nested sequence of 41 quasi-uniform triangulations

T0 ≤ T1 ≤ · · · ≤ TJ = Th

As an example, in the two dimensional case, a finer grid is obtained by connecting the midpoints of the edges of the triangles of the coarser grid, with T being the given coarsest 1 √ initial triangulation, which is quasi-uniform. In this example, γ = 1/ 2.

Corresponding to each triangulation Tk, a finite element space Vk can be defined by

1 Vk = {v ∈ H0 (Ω) : v|τ ∈ P1(τ), ∀τ ∈ Tk}.

Obviously, the following inclusion relation holds:

V0 ⊂ V1 ⊂ ... ⊂ VJ = V.

We assume that h = hJ is sufficiently small and h1 is of unit size. Note that J = O(| log h|). Naturally we have a macro space decomposition

J X V = Vk k=0

Let Nk be the dimension of Vk, i.e., the number of interior vertices of Tk. The standard nodal PNk basis in Vk will be denoted by ϕk,i, i = 1,,Nk. The micro decomposition is Vk = i=1 Vk,i with Vk,i = span{ϕk,i}. By choosing the right Rk,i, the PSC method on Vk is equivalent to Richardson method or Jacobi method. In summary, we choose the space decomposition:

J J N X X Xk V = Vk = Vk,i (3.10) k=0 k=0 i=1

2 T If we apply PSC to the decomposition (3.10) with Rk,i = hkIk,i, we obtain Ik,iRk,iIk,iv = 2−d 2 (v, ϕk,i)ϕk,i. The resulting operator B, according to (3.4), is the so-called BPX precon- ditioner [16]: J Nk X X 2−d Bv = 2 (v, ϕk,i)ϕk,i. (3.11) k=0 i=1

Let TJ be the finest triangulation in the multilevel structure described earlier with nodes 42

NJ {xi}i=1. With such a triangulation, a natural domain decomposition is

NJ ! ¯ ¯ h [ [ Ω = Ω0 supp φi , i=1

h where φi is the nodal basis function in VJ associated with the node xi and Ω0 , which may be empty, is the region where all functions in VJ vanish. It is easy to see that the corresponding decomposition method without a coarse space is exactly the Gauss-Seidel method which is known to be inefficient (its convergence rate is 2 known to be 1−O(hJ )). The more interesting case is when a coarse space is introduced. The choice of such a coarse space is clear here, namely VJ−1. There remains to choose a solver for

VJ−1. To do this, we may repeat the above process by using the space VJ−2 as a “coarser” space with the supports of the nodal basis function in VJ−1 as a domain decomposition. We

continue in this way until we reach a coarse space V0 where a direct solver can be used. As a result, a multilevel algorithm based on domain decomposition is obtained. This procedure can be illustrated by the following diagram:

VJ ⇒ (GS)J + &

VJ−1 ⇒ (GS)J−1 + &

VJ−2 ⇒ (GS)J−2 +

VJ−3 ... This resulting algorithm is a very basic multigrid method cycle, which may be called the backslash (\) cycle. Interpretation of the multigrid method as a special Gauss-Seidel method: A careful inspection on the multigrid algorithm derived above shows clearly that this algorithm is nothing but the SSC to the decomposition (3.10) with exact subspace problems solvers −1 Rk,i = Ak,i . Apparently the SSC method for this decomposition is nothing but the simple Gauss-Seidel iteration for the following matrix

k `  AMG = a(φi , φj) k,`=1:J,i,j=1:nk 43

This matrix is symmetric and semidefinite matrix and it called extended stiffness matrix by Michael Griebel, see Griebel [87] and Griebel and Oswald [88]. We shall first present this method from a more classic point of view. This more classic approach makes it easier to introduce many different kinds of classic multigrid methods and also make it possible to use more classic approach to analyze the convergence of multigrid methods.

A multigrid process can be viewed as defining a sequence of operators Bk : Vk 7→ Vk which are approximate inverse of Ak in the sense that kI − BkAkkA is bounded away from 1. A typical way of defining such a sequence of operators is the following backslash cycle multigrid procedure.

Algorithm 14 \-cycle MG −1 For k = 0, define B0 = A0 . Assume that Bk−1 : Vk−1 7→ Vk−1 is defined. We shall now define Bk : Vk 7→ Vk which is an iterator for the equation of the form

Akv = g.

Fine grid smoothing: for v0 = 0 and l = 1, 2, ··· , m do

l l−1 l−1 v = v + Rk(g − Akv )

Coarse grid correction: ek−1 ∈ Vk−1 is the approximate solution of the residual equation m Ak−1e = Qk−1(g − Av ) by the iterator Bk−1:

m ek−1 = Bk−1Qk−1(g − Av ).

Define m Bkg = v + ek−1.

After the first step, the residual v − vm is small on high frequencies. In another word, m v − v is smoother and hence it can be very well approximated by a coarse space Vk−1. The second step in the above algorithm plays role of correcting the low frequencies by the coarser space Vk−1 and the coarse grid solver Bk−1 given by induction. 44

With the above defined Bk, we may consider the following simple iteration

k+1 k k u = u + BJ (f − Au ) (3.12)

There are many different ways to make use of Bk, which will be discussed late. Before we now study its convergence, we now discuss briefly the algebraic version of the above algorithm. Let Φk = (φk, ··· , φk ) be the nodal basis vector for the space V , we define the so-called 1 nk k k+1 nk+1×nk prolongation matrix Ik ∈ R as follows

k k+1 k+1 Φ = Φ Ik , (3.13)

Algorithm 15 Matrix version of MG method

−1 nk−1×nk−1 nk nk×nk Let B0 = A0 . Assume that Bk−1 ∈ R is defined; then for η ∈ R , Bk ∈ R is defined as follows: Fine grid smoothing: for ν0 = 0 and l = 1, 2, ··· , m do

l l−1 l−1 ν = ν + Rk(η − Akν )

nk−1 Coarse grid correction: εk−1 ∈ R is the approximate solution of the residual equation t m Ak−1ε = (I) (η − Akν ) by using Bk−1

k t m εk−1 = Bk−1(Ik−1) (η − Akν ).

Define m k Bkη = ν + Ik−1εk−1.

The above algorithm is given in recurrence. But it can also be easily implemented in a non-recursive fashion. V-cycle and W-cycle Two important variants of the above backslash cycle are the so-called V-cycle and W-cycle. A V-cycle algorithm is obtained from the backslash cycle by performing more smoothings after the coarse grid corrections. Such an algorithm, roughly speaking, is like a backslash 45

(\) cycle plus a slash (/) (a reversed backslash) cycle. The detailed algorithm is given as follows.

Algorithm 16 V-cycle MG −1 For k = 0, define B0 = A0 . Assume that Bk−1 : Vk−1 7→ Vk−1 is defined. We shall now

define Bk : Vk 7→ Vk which is an iterator for the equation of the form

Akv = g.

1. pre-smoothing: For v0 = 0 and l = 1, 2, ··· , m

l l−1 l−1 v = v + Rk(g − Akv )

2. Coarse grid correction: ek−1 ∈ Vk−1 is the approximate solution of the residual equation m Ak−1e = Qk−1(g − Av ) by the iterator Bk−1:

m ek−1 = Bk−1Qk−1(g − Av ).

m+1 m 3. post-smoothing: For v = v + ek−1 and l = m + 2, 2, ··· , 2m

l l−1 l−1 v = v + Rk(g − Akv )

3.1.3 Convergence Analysis

The analysis of additive multilevel operator relies on the following identity which is well known in the literature [89, 19, 88, 90]. For completeness, we include a concise proof taken from [90].

Theorem 3.1.2 (Identity for PSC). If Ri is SPD on Vi for i = 0, ··· ,L, then B defined by (3.4) is also SPD on V. Furthermore

L −1 X −1 (B v, v) = inf (Ri vi, vi), (3.14) PL v =v i=0 i i=0 46

and −1 −1 λmin(BA) = sup inf (Ri vi, vi). (3.15) PL kvkA=1 i=0 vi=v Proof. Note that B is symmetric and

L L X T X (Bv, v) = ( IiRiIi v, v) = RiQiv, Qiv), i=0 i=0 hence B is invertible and thus SPD. We now prove (3.14) by constructing a decomposition −1 achieving the infimum. letv ˜i = RiQiB v, i = 0, ··· ,L. By the definition of B, we have a P decomposition v = i v˜i. And it satisfies

L L X −1 X −1 inf (Ri vi, vi) = inf (Ri (˜vi + wi), v˜i + wi) PL v =v PL w =0 i=0 i i=0 i=0 i i=0 L " L L # X −1 X −1 X −1 = (Ri v˜i, v˜i) + inf 2 (Ri v˜i, wi) + (Ri wi, wi) . PL w =0 i=0 i=0 i i=0 i=0

Since L L L L X −1 X −1 X −1 −1 X (Ri v˜i, ui) = (QiB v, ui) = (B v, ui) = (B v, ui), i=0 i=0 i=0 i=0 for any ui ∈ Vi, i = 0, ··· ,L, we deduce

L L " L L # X −1 −1 X −1 X X −1 inf (Ri vi, vi) = (B v, v˜i) + inf 2(B v, wi) + (Ri wi, wi) PL v =v PL w =0 i=0 i i=0 i=0 i=0 i i=0 i=0 L −1 X −1 = (B v, v) + inf (Ri wi, wi) PL w =0 i=0 i i=0 = (B−1v, v).

The proof of the equality (3.15)

−1 −1 −1 −1 −1 ((BA) v, v)A (B v, v) λmin(BA) = λmax((BA) ) = sup = sup 2 v∈V\{0} (v, v)A v∈V\{0} kvkA −1 −1 = sup (B v, v) = sup inf (Ri vi, vi). PL kvkA=1 kvkA=1 i=0 vi=v 47

As for additive methods, we now present an identity developed by Xu and Zikatanov [90] −1 for multiplicative methods. For simplicity, we focus on the case Ri = Ai , i = 0, ..., L, i.e., T the subspace solvers are exact. In this case I − IiRiIi A = I − Pi.

Theorem 3.1.3 (X-Z Identity for SSC). The following identity is valid

L Y 1 k (I − P )k2 = 1 − , (3.16) i A 1 + c i=0 0

where L L X X 2 c0 = sup inf kPi vjkA. (3.17) PL v =v kvkA=1 i=0 i i=0 j=i+1

For SSC method with general smoothers, we assume that each subspace smoother Ri

induces a convergent iteration, i.e. the error operator I − Ti is a contraction. (T) Contraction of Subspace Error Operator: There exists ρ < 1 such that

kI − TikAi ≤ ρ for all i = 1, . . . , L.

∗ We associate with Ti the adjoint operator Ti with respect to the inner product (·, ·)A. To

deal with general non-symmetric smoothers Ri, we introduce the symmetrization of Ti:

¯ ∗ ∗ Ti = Ti + Ti − Ti Ti, i = 0, ··· , L.

We use a simplified version of XZ identity given by Cho, Xu, and Zikatanov [91].

Theorem 3.1.4 (General X-Z Identity of SSC). if assumption (T) is valid, then the follow- ing identity holds L 2 Y 1 (I − T ) = 1 − , i K i=0 A where L X ¯−1 ∗ ∗ K = sup inf (Ti (vi + Ti wi), vi + Ti wi)A, PL v =v kvkA=1 i=0 i i=0 P with wi = j>i vj.

We now present a convergence analysis based on assumption (T) and two other assump- tions. 48

(A1) Stable Decomposition: For any v ∈ V, there exists a decomposition

L L X X v = v , v ∈ V , i = 1, . . . , L, such that kv k2 ≤ K kvk2 . (3.18) i i i i Ai 1 A i=1 i=1

(A2) Strengthened Cauchy-Schwarz (SCS) Inequality: For any ui, vi ∈ Vi, i = 1,...,L L L L !1/2 L !1/2

X X X 2 X 2 (ui, vj)A ≤ K2 kuikA kvjkA . i=1 j=i+1 i=1 j=1 The convergence theory of the method is as follows.

PL Theorem 3.1.5. Let V = i=1 Vi be a decomposition satisfying assumptions (A1) and

(A2), and let the subspace smoothers Ri satisfy (T). Then

L 2 Y 1 − ρ2 (I − I R Q A) ≤ 1 − . i i i 2 2 2K1(1 + (1 + ρ) K2 ) i=1 A

The proof can be found from [22, 21], which is simplified by using the XZ identity [90]. We shall give an upper bound of the constant K in Theorem 3.1.4 by choosing the stable decomposition satisfying (3.18). For the optimality of the BPX preconditioner, we are to prove that the condition number κ(BA) is uniformly bounded and thus PCG using BPX preconditioner converges in a fixed

number of steps for a given tolerance regardless of the mesh size. The estimate λmin(BA) & 1 follows from the stability of the subspace decomposition. The first result is on the macro PJ decomposition V = k=0 Vk.

Lemma 3.1.6 (Stability of macro decomposition). For any v ∈ V, there exists a decompo- PJ sition v = k=0 vk with vk ∈ Vk, k = 0, ··· ,J such that

J X −2 2 2 hk kvkk . |v|1. (3.19) k=0

Proof. Following the chronological development, we present two proofs. The first one uses full regularity and the second one minimal regularity. Full regularity H2: If we assume α = 1 in (3.9), which holds for convex polygons or 49

polyhedrons. Let P−1 = 0, we have the following decomposition

J X v = vk. (3.20) k=0

2 where vk = (Pk − Pk−1)v ∈ Vk. The full regularity assumption leads to the L error estimate of Pk via a standard duality argument:

1 k(I − Pk)vk . hk|(I − Pk)v|1, ∀v ∈ H0 (Ω).

Since Vk−1 ⊂ Vk, we have Pk−1Pk = Pk−1 and

Pk − Pk−1 = (I − Pk−1)(Pk − Pk−1).

So we have

J J X −2 2 X −2 2 hk kvkk = hk k(Pk − Pk−1)vk k=0 k=0 J X −2 2 = hk k(I − Pk−1)(Pk − Pk−1)vk k=0 J X 2 . |(Pk − Pk−1)v|1 k=0 2 = |v|1,Ω.

In the last step, we have used the fact (Pk − Pk−1)v is the orthogonal decomposition in the A-inner product. Minimal regularity H1: If we relax the H2-regularity, we can use decomposition by L2 projections J J X X v = vk = (Qk − Qk−1)v, k=0 k=0 2 where Qk : Vk → Vk−1 is the L projection onto Vk and vk = (Qk − Qk−1)v. A simple proof of nearly optimal stability of (3.23) proceeds as follows. Invoking approximability and 50

1 2 H -stability of the L -projection Qk on quasi-uniform grids, we infer that

k(Qk − Qk−1)vk = k(I − Qk−1)Qkvk . hk|Qkv|1 . hk|v|1.

Therefore, J J X −2 2 X −2 2 2 2 hk kvkk = hk k(Qk − Qk−1)vk . J|u|1 . | log h||u|1. k=0 k=0 The factor | log h| in the estimate can be removed by a more careful analysis based on the theory of Besov spaces and interpolation spaces. The following crucial inequality can be found, for example, in [19, 92, 93, 94, 95]:

J X −2 2 2 hk kvkk . |u|1. k=0

This completes the proof.

We next state the stability of the micro decomposition. For a finite element space V with N 2 nodal basis {ϕi}i=1, let Qϕi be the L -projection to the one dimensional subspace spanned by ϕi. We have the following norm equivalence which says the nodal decomposition is stable in L2. The proof is classical in the finite element analysis and thus omitted here.

Lemma 3.1.7 (Stability of micro decomposition). For any u ∈ V over a quasi-uniform mesh T , we have the norm equivalence

N 2 = X 2 kuk ∼ kQϕi uk . (3.21) i=1

Theorem 3.1.8. For any v ∈ V , there exists a decomposition of v of the form

J N X Xk v = vk,i, vk,i ∈ Vk,i = span{ϕk,i}, i = 1, ··· ,Nk, k = 0, ··· , J, (3.22) k=0 i=1

such that J Nk X X 2 2 2 hkkvk,ik . |v|1. (3.23) k=0 i=1

Consequently λmin(BA) & 1 for the BPX preconditioner B defined in (3.11). 51

Proof. Since

!−1 (ABAv, v) (Av, v) (B−1v, v) λmin(BA) = inf = inf −1 = sup . v∈V\{0} (Av, v) v∈V\{0} (B v, v) v∈V\{0} (Av, v)

Combining Lemma 3.1.6 and 3.1.7, it is easy to prove (3.23) for (3.11).

To estimate λmax(BA), we first present a strengthened Cauchy-Schwarz (SCS) inequality for the macro decomposition.

Lemma 3.1.9 (Strengthened Cauchy-Schwarz Inequality (SCS)). For any ui ∈ Vi, vj ∈ Vj, j > i, we have j−i −1 (ui, vj)A . γ |ui|1hj kvjk0,

2i where γ < 1 is a constant such that hi ∼= γ .

Proof. Let us first prove the inequality on one element τ ∈ Ti. Using integration by parts, Cauchy-Schwarz inequality, trace theorem, and inverse inequality, we have Z ∂ui (ui, vj)A,τ = (∇ui, ∇vj)τ = vjds ∂τ ∂n 1 1 − 2 − 2 . k∇uik0,∂τ kvjk0,∂τ . hi k∇uik0,τ hj kvjk0,τ 1 hj  2 −1 j−i −1 . |ui|1,τ hj kvjk0,τ ∼= γ |ui|1,τ hj kvjk0,τ . hi

Adding over τ ∈ Ti and using Cauchy-Schwarz inequality, we have

X X j−i −1 (ui, vj)A = (ui, vj)A,τ . γ |ui|1,τ hj kvjk0,τ τ∈Ti τ∈Ti 1 1 j−i −1 X 2  2  X 2  2 . γ hj |ui|1,τ kvjk0,τ τ∈Ti τ∈Ti j−i −1 = γ hj |ui|1kvjk0

which is the asserted estimate.

We also needs the following estimates for our main results 52

Lemma 3.1.10 (Auxiliary estimate). Given γ < 1, we have

N N N X |j−i| 2  X  X  N γ x y ≤ x 2 y 2 , ∀x , y ∈ , i = 1, ··· , N, j = 1, ··· ,N. i j 1 − γ i i i j R i,j=1 i=1 i=1

N×N |j−i| Proof. Let Γ ∈ R be the matrix such that Γi,j = γ . The spectral radius ρ(Γ) satisfies

N X |j−i| 2 ρ(Γ) ≤ kΓk1 = max γ ≤ . 1≤j≤N 1 − γ i=1

N N x and y denotes (xi)i=1 and (yi)i=1. By Cauchy-Schwarz inequality,

N X |j−i| γ xiyj = (Γx, y) ≤ ρ(Γ)kxk2kyk2. i,j=1 which is the desired estimate.

Now we can have our estimates for the largest eigenvalue of BA.

Theorem 3.1.11. For any v ∈ V, we have

J X −2 2 (Av, v) . inf hk kvkk . PJ v =v k=0 k k=0

So, λmax . 1 for the BPX preconditioner B defined in (3.11).

PJ Proof. For v ∈ V, let v = k=0 vk, vk ∈ Vk, k = 0, ··· ,J be an arbitrary decomposition. By the SCS inequality of Lemma 3.1.9,

J J J J J X X X X X j−k −1 (∇v, ∇v) = 2 (∇vk, ∇vj) + (∇vk, ∇vk) . γ |uk|1hj kvjk0. k=0 j=k+1 k=0 k=0 j=k

−1 Combining Lemma 3.1.10 with the inverse inequality |vk|1 . hk kvkk0, we obtain

J 1 J 1 J  X 2 2  X −2 2 2 X −2 2 (∇v, ∇v) . |vk|1 hk kvkk0 . hk kvkk0. k=0 k=0 k=0

which is the assertion. 53

Finally we can conclude the optimality of the BPX preconditioner defined in (3.11).

Theorem 3.1.12 (Optimality of BPX Preconditioner). For the preconditioner B defined in (3.11), we have κ(BA) . 1.

Next, we prove the uniform convergence of V-cycle multigrid, namely SSC applied to the decomposition (3.10) with exact subspace solvers.

Lemma 3.1.13 (Nodal Decomposition). Let T be a quasi-uniform triangulation with N nodal basis ϕi. For the nodal decomposition

N X v = vi, vi = v(xi)ϕi, i=1 we have N N X X 2 −2 2 |Pi vj|1 . h kvk . i=1 j>i

Proof. For every 1 < i < N, we define the index set Li = {j ∈ N : i < j ≤ N, supp ϕj ∩ supp ϕi 6= ∅} andω ˜i = ∪j∈Li supp ϕj. Since T is quasi-uniform, the numbers of integers in each Li is uniformly bounded. So we have

N N N N N N X X X X X X |P v |2 = |P v |2 |v |2 i j 1,Ω i j 1,Ω . j 1,ω˜i i=1 j>i i=1 j∈Li i=1 j∈Li N N X X |v |2 h−2kv k2 . i 1,ω˜i . i i 0,ω˜i i=1 i=1 where we have used an inverse inequality in the last step. Since the nodal basis decomposition is stable in the L2-inner product (Lemma 3.1.7), we deduce the desired estimate.

Lemma 3.1.14. The following inequality holds for all v ∈ V:

J J X 2 X −2 2 |(Pk − Qk)v|1,Ω . hk k(Qk − Qk−1)vk0,Ω. (3.24) k=0 k=0 54

PJ Proof. Since (I − Qk)v = j=k+1(Qj − Qj−1)v, we have

J J X 2 X |(Pk − Qk)v|1,Ω = ((Pk − Qk)v, (Pk − Qk)v)A k=0 k=0 J X = ((Pk − Qk)v, (I − Qk)v)A k=0 J J X X = ((Pk − Qk)v, (Qj − Qj−1)v)A. k=0 j=k+1

Applying the strengthened Cauchy-Schwarz inequality (Lemma 3.1.9) yields

J J 1 J 1 X 2  X 2  2  X 2  2 |(Pk − Qk)v|1,Ω . |(Pk − Qk)v|1,Ω hk2 k(Qj − Qj−1)vk0,Ω . k=0 k=1 k=0

The desired result then follows.

Theorem 3.1.15 (Optimality of \-cycle Multigrid). The \-cycle multigrid method, using −1 SSC applied to the decomposition (3.10) with exact subspace solvers Ri = Ai , converges uniformly.

Proof. We use the following macro decomposition

J X v = vk, vk = (Qk − Qk−1)v, k=0

with the nodal decomposition

N Xk vk = vk,i, vk,i = vk(xi)ϕk,i, i=1

For simplicity, let us use a lexicographical ordering: we say (`, j) (k, i) if ` = k and j ≥ i, or ` > k. It is easy to see that

N J N N X Xk X X` Xk v`,j = vk,j + v`,j = vk,j + (v − vk) (`,j) (k,i) j=i+1 `=k+1 j=1 j=i+1 55

In view of the expression (3.16), it follows that

J Nk X X X 2 |Pk,i v`,j|1,Ω k=0 i=1 (`,j)) (k,i)

J Nk Nk ! X X 2 X 2 ≤ |Pk,i(v − Pkv)|1,Ω + |Pk,i vk,j|1,Ω k=0 i=1 j=i+1 J N N X Xk Xk ≤ | v |2 k,j 1,Ωk,i k=0 i=1 j=i+1 J n X Xk |v |2 . k,i 1,Ωk,i k=0 i=1 J X −2 2 . hk kvkk0,Ω k=0 . |v|1,Ω.

This means that the constant c0 defined by (3.16) is bounded above by a constant that is independent of any mesh parameters. Consequently, by Theorem 3.1.3, the multigrid method converges uniformly with respect to any mesh parameters.

The convergence result discussed above has been for quasi-uniform grids, but we would like to remark that uniform convergence can also be obtained for locally refined grids, For details, we refer to [16, 96, 97, 98]. The proof of Theorem 3.1.15 hinges on Theorem 3.1.3 (X-Z identity), which in turn −1 −1 requires exact solvers Ri = Ai and makes Pi = Ai QiA the key operator to appear in −1 (3.16). If the smoothers Ri are not exact, namely Ri 6= Ai , then the key operator becomes

Ti = RiQiA and Theorem 3.1.3 must be replaced by Theorem 3.1.4. We refer to [91] for details.

3.2 The Auxiliary Space Method

In previous sections, we study multilevel methods formulated over a hierarchy of quasi- uniform or graded meshes. The geometric structure of these meshes is essential for both the design and analysis of such methods. Unfortunately, many grids in practice are not hierarchical. We use the term unstructured grids to refer those grids that do not possess 56 much geometric or topological structure. The design and analysis of efficient multilevel solvers for unstructured grids is a topic of great theoretical and practical interest. The method of subspace correction consists of solving a system of equations in a vector space by solving on appropriately chosen subspaces of the original space. Such subspaces are, however, not always available. The auxiliary space method, developed in [99, 47], is for designing preconditioners using auxiliary spaces which are not necessarily subspaces of the original subspace. Here, the original space is the finite element space V for the given grid T and the pre- J conditioner is the multigrid method on the sequence (V`)`=1 of FE spaces for the auxiliary J grids (T`)`=1. The idea of the method is to generate an auxiliary space V with inner producta ˜(·, ·) = ˜ hA·, ·i and energy norm k · kA˜. Between the spaces there is a suitable linear transfer operator Π: V 7→ V, which is continuous and surjective. Πt : V 7→ V is the dual operator of Π in the default inner products

hΠtu, v˜i = hu, Π˜vi, for all u ∈ V, v˜ ∈ V.

In order to solve the linear system Ax = b, we require a preconditioner B defined by

B := S + ΠB˜Πt, (3.25) where S is the smoother and B˜ is the preconditioner of A˜. The estimate of the condition number κ(BA) is given below.

Theorem 3.2.1. Assume that there are nonnegative constants c0, c1, and cs, such that

1. the smoother S is bounded in the sense that

kvkA ≤ cskvkS−1 ∀v ∈ V, (3.26)

2. the transfer operator Π is bounded,

kΠwkA ≤ c1kwkA˜ ∀w ∈ V, (3.27) 57

3. the transfer is stable, i.e. for all v ∈ V there exists v0 ∈ V and w ∈ V such that

2 2 2 2 v = v0 + Πw and kv0kS−1 + kwkA˜ ≤ c0kvkA, and (3.28)

4. the preconditioner B˜ on the auxiliary space is optimal, i.e. for any v˜ ∈ V , there exists

m1 > m0 > 0, such that

2 ˜ ˜ 2 m0kv˜kA˜ ≤ (BAv,˜ v˜)A˜ ≤ m1kv˜kA˜. (3.29)

Then, the condition number of the preconditioned system defined by (3.25) can be bounded by 2 1 2 2 m1 2 κ(BA) ≤ max{1, }c0(cs + c1). (3.30) m0 m0

Proof. Firstly, we estimate the upper bound of (BAv, v)A and consequently λmax(BA). By the definition of B from (3.25), we have

˜ t (BAv, v)A = (SAv, v)A + (ΠBΠ Av, v)A.

By Cauchy-Schwarz inequality, (3.26), and (3.29), we can control the first term as

1/2 1/2 1/2 (SAv, v)A ≤ (SAv, SAv)A (v, v)A ≤ cs(SAv, SAv)S−1 (v, v)A ≤ cs(SAv, Av) (v, v)A.

2 which leads to (SAv, v)A ≤ cs(v, v)A. Similarly by Cauchy-Schwarz inequality and (3.27), we have

˜ t ˜ ˜ ˜−1 t ˜−1 t (ΠBΠ Av, v)A = (BAA Π Av, A Π Av)A˜ ˜−1 t ˜−1 t ≤ m1(A Π Av, A Π Av)A˜ ˜−1 t = m1(ΠA Π Av, v)A ˜−1 t ˜−1 t 1/2 1/2 ≤ m1(ΠA Π Av, ΠA Π Av)A (v, v)A ≤ m c (A˜−1ΠtAv, A˜−1ΠtAv)1/2(v, v)1/2 1 1 A˜ A m ≤ 1 c (B˜A˜A˜−1ΠtAv, A˜−1ΠtAv)1/2(v, v)1/2 1/2 1 A˜ A m0 m1 ˜−1 t 1/2 1/2 = 1/2 c1(ΠB Π Av, v)A (v, v)A , m0 58

m2 ˜ t 1 2 which leads to (ΠBΠ Av, v)A ≤ 2 c1(v, v)A. Hence, the upper bound is m0

2 (BAv, v)A 2 m1 2 λmax ≤ sup ≤ (cs + c1). v∈V\{0} (v, v)A m0

Next, we estimate the lower bound of λmin. We choose v = v0 + Πw satisfying (3.28). Then,

˜−1 t (v, v)A = (v0 + Πw, v)A = (v0, SAv)S−1 + (w, A Π Av)A˜ 1 1 2 ˜−1 t t 2 ≤ ((v0, v0)S−1 + (w, w)A˜) ((SAv, Av) + (A Π Av)A˜, Π Av)) 1 ˜−1 t t  2 ≤ c0kvkA (SAv, Av) + (A Π Av, Π Av) 1 ˜−1 t ˜−1 t  2 = c0kvkA (SAv, Av) + (A Π Av, A Π Av)A˜ 1 1 ˜ ˜ ˜−1 t ˜−1 t  2 ≤ c0kvkA (SAv, Av) + (BAA Π Av, A Π Av)A˜ m0 1 1 ˜ t  2 ≤ c0kvkA (SAv, Av) + (ΠBΠ Av, v)A m0 1 1 2 ≤ c0 max{1, 1/2 }kvkA(BAv, v)A, m0

1 2 which leads to (BAv, v)A max{1, }c . Combining the estimate of λmin and λmax, we & m0 0 can have the desired conclusion.

We also can combine the smoother S and the auxiliary grid correction multiplicatively with a preconditioner B in the form [100, 101]

T I − BcoA = (I − S A)(I − BA)(I − SA) (3.31)

which leads to Algorithm 17. The combined preconditioner, under suitable scaling assump- tions performs no worse than its components.

Theorem 3.2.2. [100] Suppose there exists ρ ∈ [0, 1) such that for all v ∈ V , we have

2 2 k(I − SA)vkA ≤ ρkvkA, 59

Algorithm 17 Multiplicative Auxiliary Space Iteration Step Input : uk Output: uk+1 Given S, B; uk,0 := uk; (1) uk,1 := uk,0 − S(b − Auk,0); (2) uk,2 := uk,1 − B(b − Auk,1); (3) uk+1 := uk,2 − ST (b − Auk,2);

then the multiplicative preconditioner Bco yields the bound

(1 − m1)(1 − ρ) + m1 κ(BcoA) ≤ , (3.32) (1 − m0)(1 − ρ) + m0 for the condition number, and

κ(BcoA) ≤ κ(BA). (3.33)

According to Theorem 3.2.1 and Theorem 3.2.2, our goal is to construct an auxiliary space V in which we are able to define an efficient preconditioner. The preconditioner will be the geometric multigrid method on a suitably chosen hierarchy of auxiliary grids. Additionally, the space has to be close enough to V so that the transfer from V to V fulfils (3.27) and

(3.28). This goal is achieved by a density fitting of the finest auxiliary grid TJ to T . In J order to prove (3.29), we use the multigrid theory for the auxiliary grids {T`}`=1 from the viewpoint of the method of subspace corrections.

3.3 Algebraic Multigrid Method

The AMG method was popularized by Ruge and St¨uben [26, 27, 28]. It has grown even more popular in the past several years and is still under development. Basically, AMG involves three key steps including:

1. Setup phase:

(a) Selecting coarse grid nodes or generating aggregates, (b) building interpolation and restriction operators, (c) Generating coarse grid matrix. 60

2. Solve phase:

(a) Choosing multigrid cycles, (b) Applying pre-smoothing, (c) Coarse grid correction, (d) Applying post-smoothing.

The various forms of AMG that have been proposed involve various combinations of these key steps. In what follows we give a brief summary of several popular choices.

3.3.1 Classical AMG

In Ruge and St¨uben’s basic AMG, coarsening is based on the selection of strong correlated unknowns [26, 27, 28]. The strong or weak coupling between unknowns are decided based on the coefficients in the matrix. Each fine node is chosen such that it will be strongly connected to at least one coarse node. The interpolation operator is an approximation to the fine node value based on all the coarse nodes it is strongly connected. The choice of the AMG smoother varies. Usually classical smoothers like Gauss-Seidel are chosen, as in St¨uben’s classical AMG. There are also alternatives like: ILU as the smoother such as [102] proposed by Saad et al., sparse inverse proposed by Br¨oker et al. [103], and polynomial smoothing based on a matrix-vector product [104]. Algebraic multigrid methods are designed for the solution of the sparse linear systems (1.1) by using multigrid principles. Normally, there is no concept of how to solve the systems efficiently by AMG. However, we can gain some results under the assumption

N X A is symmetric and Aij ≤ 0(j 6= i), Aij = 0. (3.34) j=1

Definition 1. Node i and j are connected if Aij 6= 0. Neighbor of i(Ni) is all points which are connected to i. j is strongly connected to i, if

−Aij ≥ θ max −Aik. k6=i 61

Definition 2. Assume e is a smooth error, when e satisfies that for any i,

2 X −Aij (ei − ej) < . A e2 j ii i

Then, 1 X X (−A (e − e )2) <  A e2. 2 ij i j ii i i,j i Lemma 3.3.1. If e is a smooth error, and

j ∈ Si = {j : −Aij ≥ θ max(−Aik)}, k6=i then, 2 2 (ei − ej) < Cei , where C depends on θ

PN Proof. Since j=1 Aij = 0, we have

X Aii = −Aij j6=i

≤ |Ni| max(−Aik). k6=i

Therefore, for all j ∈ Si, we have

2 X −Aij (ei − ej)  > A e2 j ii i 2 X −Aij (ei − ej) = 2 Aii ei j∈Si 2 X −Aij maxk6=i −Aik (ei − ej) = 2 maxk6=i −Aik Aii ei j∈Si 2 X (ei − ej) ≥ θ|Ni| 2 ei j∈Si 2 (ei − ej) ≥ θ|Ni| 2 . ei

This completes the prove. 62

Define

d(i, j) := −aij/si, X d(i, P ) := d(i, j), j∈P

$i(α) := {j ∈ Ni : d(i, j) ≤ α}, T $i (α) := {j : i ∈ $j(α)}.

I We denote the set of points j to be used in interpolation to an F-point i by $i . This will be a subset of C ∩ $i. we can assume that after relaxation:

m m X m m X m m aii ei + aij ej + aij ej = 0. (i ∈ F ). (3.35) i/∈F mj6=i j∈Cm

In order to make an improvement in interpolation, we will suppose that C has been I construted so that it satisfies the following criterion, which also implicitly defines $i : I (CG1) : C should be chosen such that for each point i in F, there is a set $i ⊆ C ∩ $i I such that every point j ∈ $i is either in $i , or strongly depends on it. The error at the point j can be approximately expressed as a weighted average of the I errors at i and points in $i as following:

X X I ej ≈ (ajiei + ajkek)/(aji + ajk)(j ∈ Ni − $i ). (3.36) I I k∈$i k∈$i

Then we have the interpolation formula:

 m+1 m m vi , (i ∈ C ), vi := (3.37) P m m −( I (aij + cij)v )/(aii + cii)(i ∈ F ), j∈$i j

where X X I cij := aikakj/[aki + akl](j ∈ $i ∪ {i}). (3.38) I I k∈ /$i k6=i l∈$i Strength of connection of a point i to a point j is measured against the largest connection 63

of i, which is denoted by si in the following:

si := max{−aik : k 6= i}.

3.3.2 UA-AMG

The coarsening in the UA-AMG method is performed by simply aggregating the unknowns, and the prolongation and restriction are Boolean matrices that characterize the aggregates. It has been shown that under certain conditions the Galerkin-type coarse-level matrix has a controllable sparsity pattern [105]. These properties make the UA-AMG method more suitable for parallel computation than are traditional AMG methods, such as the classical AMG and the smoothed aggregation AMG, especially on GPUs. Recent works also show that UA-AMG method is both theoretically well founded and practically efficient, if we form the aggregates in an appropriate way and apply certain enhanced cycles, such as the AMLI-cycle or the Nonlinear AMLI-cycle (K-cycle) [106, 107, 108].

nk×nk Given the k-th-level matrix Ak ∈ R , in the UA-AMG method we define the prolon- k gation matrix Pk−1 from a non-overlapping partition of the nk unknowns at level k into the k nk−1 nonempty disjoint sets Gj , j = 1, . . . , nk−1, which are referred to as aggregates. There are many different approaches to generating the aggregates. However, in most of existing approaches, the aggregates are chosen one by one, i.e. sequentially. And, standard parallel aggregation algorithms are variants of the parallel maximal independent set algorithm, which choose the points of the independent sets based on the given random numbers for all points. Therefore, the quality of the aggregates cannot be guaranteed. If the aggregates have been formed and are ready to use. Once the aggregates are k constructed, the prolongation Pk−1 is an nk × nk−1 matrix given by  k k 1 if i ∈ Gj (Pk−1)ij = i = 1, . . . , nk, j = 1, . . . , nk−1. (3.39) 0 otherwise

k With such a piecewise constant prolongation Pk−1, the Galerkin-type coarse-level matrix

nk−1×nk−1 Ak−1 ∈ R is defined by

k t k Ak−1 = (Pk−1) Ak(Pk−1). (3.40) 64

Note that the entries in the coarse-grid matrix Ak−1 can be obtained from a simple summation process: X X (Ak−1)ij = akl, i, j = 1, 2, ··· , nk−1. (3.41)

k∈Gi l∈Gj The enhanced cycles are often used for the UA-AMG method in order to achieve uniform convergence for model problems such as the Poisson problem [107, 109, 108] and M-matrix [106]. In this paper, we use the nonlinear AMLI-cycle, also known as the variable AMLI-cycle or the K-cycle because it is parameter-free. The nonlinear AMLI-cycle uses the Krylov sub- space iterative method to accelerate the coarse-grid correction. Therefore, for the complete- ness of our presentations, let us first recall the nonlinear preconditioned conjugate gradient (PCG) method originated in [110]. Algorithm 18 is a simplified version designed to address symmetric positive definite (SPD) problems. The original version was meant for more gen-

ˆ nk nk eral cases including nonsymmetric and possibly indefinite matrices. Let Bk[·]: R → R be a given nonlinear preconditioner intended to approximate the inverse of Ak. We now formulate the nonlinear PCG method that can be used to provide an iterated approximate ˆ inverse to Ak based on the given nonlinear operator Bk[·]. This procedure gives another

˜ nk nk nonlinear operator Bk[·]: R → R , which can be viewed as an improved approximation of the inverse of Ak. In practice, we choose the number of cycles n of Nonlinear PCG as 2.

Algorithm 18 Nonlinear PCG Method ˜ ˆ Bk[f] =NPCG (A, f, Bk[·], n) ˆ u0 = 0, r0 = f and p0 = Bk[r0] for i = 1, 2, . . . , n do

(Bˆ[ri−1],ri−1) αi = ; (Api−1,pi−1) ui = ui−1 + αipi−1;

ri = ri−1 − αiApi−1; ˆ βj = − (B[ri],Apj ) , j = 0, 2, . . . , i − 1 i (Apj ,pj ) ˆ Pi−1 j pi = B[ri] + j=0 βi pj; ˜ Bk[f] := un;

We also need to introduce a smoothing operator in order to define the multigrid method. In general, all smoothers, such as the Jacobi smoother and the Gauss-Seidel (GS) smoother, can be used here. In this paper, we use parallel smoothers which are based on auxiliary grid. 65

If we already have a smoother. Using Algorithm 18, we can recursively define the nonlinear −1 AMLI-cycle MG method as an approximation of Ak (see Algorithm 19).

Algorithm 19 Nonlinear AMLI-cycle MG ˆ Bk[f] =NAMLIcycle [uk,Ak, fk, k] if k == 1 then −1 u1 = A1 f1; else uk = smoothing(uk,Ak, fk, k);// Pre-smoothing k T fk−1 = (Pk−1) (fk − Akuk);// Restriction

ek−1 = NPCG(Ak−1, fk−1, NAMLIcycle(0,Ak−1, ·, k − 1), nk−1),// Coarse grid correc- tion k uk = uk + Pk−1ek−1;// Prolongation

uk = smoothing(uk,Ak, fk, k);// Post-smoothing ˆ Bk[f] = uk;

k Our parallel AMG method is mainly based on Algorithm 19 with prolongation Pk−1 and

coarse-grid matrix Ak−1 defined by (3.39) and (3.40) respectively. The main idea of our new parallel AMG method is to use an auxiliary structured grid to (1) efficiently construct the aggregates in parallel; (2) simplify the construction of the prolongation and the coarse-grid matrices; (3) develop a robust and effective colored smoother; and (4) better control the sparsity pattern of the coarse-grid matrices and working load balance. Chapter 4

FASP for Poisson-like Problem on unstructured grid

Finite element methods are often used for solving the elliptic PDEs because of the flexibility especially on the unstructured grids. However, the general unstructured grids do not offer a suitable framework for the multigrid method due to the difficulties of constructing the grid hierarchy. One way to overcome this difficulty is applying a refinement of the given unstructured grids, such as PLTMG [111], KASKADE [112], and HHG [113]. The other way is focused on the development of the AMG method. Although there are lots of good progress in this field, AMG methods are lacking theoretical justifications. In this chapter, the multigrid algorithm on the unstructured grids is explained. The algorithm is based on the theory of auxiliary space preconditioning. The auxiliary structured grid hierarchy is generated. Then, a geometric multigrid method can be applied together with a smoothing on the original grid by using the auxiliary space preconditioning technique. In Section 4.1, we discuss some basic assumptions on the given triangulation. An abstract analysis is provided based on four assumptions. In Section 4.2 we introduce the detailed construction of the structured auxiliary space by an auxiliary cluster tree and an improved treatment of the boundary region for Neumann boundary conditions. In 4.3, we describe the auxiliary space multgrid preconditioner (ASMG) and estimate the condition number by verifying the assumptions of the abstract theory. 67

4.1 Preliminaries and Assumptions

The triangulation is assumed to be conforming and shape-regular in the sense that the ratio of the circumcircle and inscribed circle is bounded uniformly [114], and it is a K-mesh in the sense that the ratio of diameters between neighboring elements is bounded uniformly.

All elements τi are assumed to be shape-regular but not necessarily quasi-uniform, so the diameters can vary globally and allow a strong local refinement.

The vertices of the triangulation are denoted by (pj)j∈J . Some of the vertices are Dirichlet

nodes, JD ⊂ J , where we impose essential Dirichlet boundary conditions, and some are

Neumann vertices, JN ⊂ J , where we impose natural Neumann boundary conditions.

For each of the simplex τi ∈ T , define the barycenter

Nτ 1 Xi ξi := pk(τi), Nτ i k=1

d where pk(τi) ∈ R is the vertex of τi and Nτi is the number of vertices of τi, as shown in Figure 4.1.

Definition 3. We denote the minimal distance between the triangle barycenters of the grid by

h := min kξi − ξjk2. i,j∈I The diameter of Ω is denoted by

H := max kx − yk2. x,y∈Ω

In order to prove the desired nearly linear complexity estimate, we have to assume that the refinement level of the grid is algebraically bounded in the following sense.

q Assumption 1. We assume that H/h ∼= N for a small number q, e.g. q = 2.

The above assumption allows an algebraic grading towards a point but it forbids a ge- ometric grading. The assumption is sufficient but not necessary: the construction of the auxiliary grids might still be of complexity O(N log N) or less if Assumption 1 is not valid, but it would require more technical assumptions in order to prove this. 68

Ω h τi ξi

Figure 4.1. Left: The 2D triangulation T of Ω with elements τi. Right: The barycenters ξi (dots) and the minimal distance h between barycenters. 4.2 Construction of the Auxiliary Grid-hierarchy

In this section, we explain how to generate a hierarchy of auxiliary grids based on the given (unstructured) grid T . The idea is to analyse and split the element barycenters by their geometric position regardless of the initial grid structure. Our aim is to obtain a structured hierarchy of grids that preserves some properties of the initial grid, e.g. the local mesh size. A similar idea has already been applied in [53, 115, 38].

4.2.1 Clustering and Auxiliary Box-trees

In order to give the construction algorithm, we need to introduce the definition of cluster.

Definition 4 (Cluster). A cluster t is a subset of I. If t is a cluster, the corresponding S subdomain of Ω is Ωt := i∈t τi.

The clusters are collected in a hierarchical cluster tree TT .

Definition 5 (Cluster Tree). A tree TT is a cluster tree if the following conditions are satisfied.

1. The nodes in TT are clusters.

2. The root of TT is I.

3. The leaves of TT are denoted by L(TT ) and the tree hierarchy is given by a father/son

relation: For each interior node t ∈ TT \L(TT ), the set of sons of t, sons(t), is a subset 69

of TT \{t} such that [˙ t = s s∈sons(t) holds. Vice versa, the father of any s ∈ sons(t) is t.

The standard (geometrically regular) construction of the cluster tree TT is as follows. For the initial step, we choose a (minimal) hypercube of the domain Ω:

1 B := [a1, b1) × [a2, b2) × · · · [ad, bd) ⊃ Ω where

ai = min xi, bi = max xi, for i = 1, 2, ··· , d, (x1, x2, ··· , xd) ∈ Ω. (4.1)

Define the level of B1 to be g(B1) = 1. Then we regularly subdivide B1 to 2d childrens 2 2 2 2 2 {Bi }. When d = 2, the four children B1, B2, B3, B4 can be defined as

2 0 0 2 0 0 B2 = [a1, b1) × [a2, b2), B3 = [a1, b1) × [a2, b2), 2 0 0 2 0 0 B1 = [a1, b1) × [a2, b2), B4 = [a1, b1) × [a2, b2),

0 0 0 0 where a1 = b1 := (a1 + b1)/2 and a2 = b2 := (a2 + b2)/2. 2 2 1 d The level of Bi is g(Bi ) = g(B ) + 1 = 2, where i = 1, 2, ··· , 2 . Finally, we apply ` the same subdivision process recursively and define the level of the boxes Bi recursively 1 Pd (i−1)k (cf. Figure 4.3). This yields an infinite tree Tbox with root B . For any i = i=1 2 ti, k k 0 ≤ ti ≤ 2 − 1, we can denote the subregion Ωi on level k as

b − a b − a b − a b − a Ωk := (a +t 1 1 , a +(t +1) 1 1 )×· · ·×(a +t d d , a +(t +1) d d ). (4.2) i 1 1 2k 1 1 2k d d 2k d d 2k

Figure 4.2 gives two examples of the 2D region quadtrees on a unit square domain and a circle domain. ` Letting Bj denote a box in this tree, we can define the cluster t, which is a subset of I, by ` ` ` tj := t(Bj) := {i ∈ I | ξi ∈ Bj}.

1 This yields an infinite cluster tree with root t(B ). We construct a finite cluster tree TI by

not subdividing nodes which are below a minimal cardinality nmin, e.g. nmin := 3. Define ` ` the nodes which have no child nodes as the leaf nodes. The cardinality #tj = #t(Bj) is the 70

Figure 4.2. Examples of the region quadtree on different domains.

` number of the barycenters in Bj. Leaves of the cluster tree contain at most nmin indices. For any leaf node, its parent node contains at least 4 barycenters, then the total number of leaf nodes is bounded by the number of barycenters N.

B

Figure 4.3. Tree of regular boxes with root B1 in 2D. The black dots mark the corresponding barycenters ξi of the triangles τi. Boxes with less than three points ξi are leaves.

Remark 4.2.1. The size of a leaf box Bj can be much larger than the size of simplex τj that intersect with Bj since a large box Bj may only intersect with one very small element and will not be further subdivided.

Lemma 4.2.2. Suppose Assumption 1 holds, the complexity for the construction of TI is O(qN log N).

Proof. First, we estimate the depth of the cluster tree. Let t = t(Bν) ∈ TI be a node of the cluster tree and #t > nmin. By definition the distance between two nodes ξi, ξj ∈ t is at least

kξi − ξjk2 ≥ h. 71

Therefore, the box Bν has a diameter of at least h. After each subdivision step the diameter

of the boxes is exactly halved. Let ` denote the number of subdivisions after which Bν was created. Then −` 1 diam(Bν) = 2 diam(B ).

Consequently, we obtain √ −` 1 −` h ≤ diam(Bν) = 2 diam(B ) ≤ 2 2H

so that by Assumption 1, ` . log(H/h) ∼= q log N.

Therefore the depth of TI is in O(q log N).

Next, we estimate the complexity for the construction of TI . The subdivision of a single

node t ∈ TI and corresponding box Bν is of complexity #t. On each level of the tree TI , the nodes are disjoint, so that the subdivision of all nodes on one level is of complexity at most O(N). For all levels this sums up to at most O(qN log N).

Remark 4.2.3. The boxes used in the clustering can be replaced by arbitrary shaped elements, e.g. triangles/tetrahedra or anisotropic elements — depending on the application or operator at hand. For ease of presentation we restrict ourselves to the case of boxes.

Remark 4.2.4. The complexity of the construction can also be bounded from below by O(N log N), as is the case for a uniform (structured) grid. However, this complexity arises only in the construction step and this step will typically be of negligible complexity.

For each cluster tν ⊂ T we have an associated box Bν. The depth of the cluster tree is p := depth(T ). Notice that the tree of boxes is not the regular grid that we need for the multigrid method. A further refinement as well as deletion of elements is necessary.

4.2.2 Closure of the Auxiliary Box-tree

The hierarchy of box-meshes from Figure 4.3 is exactly what we want to construct: each box has at most one hanging node per edge, namely, the fineness of two neighbouring boxes differs by at most one level. In general this is not fulfilled. We construct the grid hierarchy of nested uniform meshes starting from a coarse mesh (0) 1 σ consisting of only a single box B = [a1, b1) × [a2, b2) × · · · × [ad, bd), the root of the 72

box tree. All boxes in the meshes σ(1), . . . , σ(J) to be constructed will either correspond to a cluster t in the cluster tree or will be created by refinement of a box that corresponds to a leaf of the cluster tree. Let ` ∈ {1,...,J} be a level that is already constructed (the trivial start ` = 1 of the induction is given above).

Part I: Mark elements for refinement

` We mark all elements of the mesh which are then refined regularly. Let Bν be an arbitrary (`) ` ` box in σ . The box Bν corresponds to a cluster tν = t(Bν) ∈ TI . The following two situations can occur:

` 1. (Mark) If #tν > nmin then Bν is marked for refinement.

` 2. (Retain) If #tν ≤ nmin, e.g. tν = ∅, then Bν is not marked in this step.

Figure 4.4. The subdivision of the marked (red) box on level ` would create two boxes (blue) with more than one hanging node at one edge.

After processing all boxes on level `, it may occur that there are boxes on level ` − 1 that would have more than one hanging node on an edge after refinement of the marked boxes, cf. Figure 4.4. Since we want to avoid this, we have to perform a closure operation for all such elements and for all coarser levels ` − 1,..., 1.

Part II: Closure and Refinement

3. (Close) Let L(`−1) be the set of all boxes on level ` − 1 having too many hanging nodes. All of these are marked for refinement. By construction a single refinement of each box is sufficient. However, a refinement on level `−1 might then produce too many hanging nodes in a box on level ` − 2. Therefore, we have to form the lists L(j), j = ` − 1,..., 1 of boxes with too many hanging nodes successively on all levels and mark the elements. 73

Figure 4.5. The subdivision of the red box makes it necessary to subdivide nodes on all levels.

4. (Refine) At last we refine all boxes (on all levels) that are marked for refinement.

The result of the closure operation is depicted in Figure 4.5. Each of the boxes in the closed grids lives on a unique level ` ∈ {1,...,J}. It is important that a box is either refined regularly (split into four successors on the next level) or it is not refined at all. For each box that is marked in step 1, there are at most O(log N) boxes marked during the closure step 3.

Lemma 4.2.5. The complexity for the construction and storage of the (finite) box tree ` ` with boxes Bν and corresponding cluster tree TI with clusters tν = t(Bν) is of complexity O(N log N), where N is the number of barycenters, i.e., the number of triangles in the tri- angulation τ.

Proof. For the level ` of the tree, let n` be the number of leaf boxes and m` be the boxes

which have child boxes. Accordingly, the total number of the boxes on level ` is n` + m`. By definition, d n` + m` = 2 m`−1,

PJ where ` ≥ 2 and n1 + m1 = m1 = 1. Since `=1 n` . N, we have

J J J X X d d X N & n` = (2 m`−1 − m`) = (2 − 1) m` + 1. `=1 `=2 `=1

As a result, J X m` . N. `=1 PJ The total work for generating the tree is `=1 n` + m` . N. (`−1) Given `, let α` denote the number of boxes in L (the set of boxes that have more (`−1) than 1 hanging node). Since every box in L has to be a leaf box, we have α` ≤ n`. As 74

the process of closing each hanging node will go through at most d boxes in any given level, the total number of the marked boxes in this closure process is bounded by

J X dJα` . JN . N log N. `=1

4.2.3 Construction of a Conforming Auxiliary Grid Hierarchy

At last, we create a hierarchy of nested conforming triangulations by subdivision of the boxes and by discarding those hypercube that lie outside the domain Ω.

If d = 2, given any box Bν there can be at most one hanging node per edge. The possible situations and corresponding local closure operations are presented in Figure 4.6. The closure

Figure 4.6. Hanging nodes can be treated by a local subdivision within the box Bν. The top row shows a box with 1, 2, 2, 3, 4 hanging nodes, respectively, and the bottom row shows the corre- sponding triangulation of the box.

operation introduces new elements on the next finer level. The final hierarchy of triangular grids σ(1), . . . , σ(J) is nested and conforming without hanging nodes. All triangles have a minimum angle of 45 degrees, i.e., they are shape-regular, cf. Figure 4.7. The triangles in the quasi-regular meshes σ(1), . . . , σ(J) have the following properties:

1. All triangles in σ(1), . . . , σ(J) that have children which are themselves further subdi- vided, are refined regularly (four congruent successors) as depicted here (cf. Figure 4.8).

(j) 0 2. Each triangle σi ∈ σ that is subdivided but not regularly refined, has successors σi that will not be further subdivided (cf. Figure 4.9). 75

level 1 level 2 level 3 level 4 level 5

Figure 4.7. The final hierarchy of nested grids. Red edges were introduced in the last (local) closure step.

’ σi σi

Figure 4.8. Case 1: σi is subdivided in the fine level

The hierarchy of grids constructed so far covers on each level the whole box B. This hierarchy has now to be adapted to the boundary of the given domain Ω. In order to explain the construction we will consider the domain Ω and triangulation T (5837 triangles) from Figure 4.10. The triangulation consists of shape-regular elements, it is locally refined, it contains many small inclusions, and the boundary Γ of the domain Ω is rather complicated. If d = 3, there are at most one hanging node per edge and one inner hanging node per

face for any cube Bν. If there is no hanging node for any edges or faces of Bν, we can divide the cube to 6 tetrahedral regularly. If there are hanging nodes on the edges or faces, the local closure operations have two steps. we do the closure operations for all faces first and then, we connect the triangles on each face to the center of the cube to formulate the tetrahedra . This closure operations will also generate the final hierarchy of tetrahedra grids σ(1), . . . , σ(J)

σ σ ’ i i

Figure 4.9. Case 2: σi is not subdivided in the fine level 76

Figure 4.10. A triangulation of the Baltic sea with local refinement and small inclusions. which are nested and conforming without hanging nodes. Figure 4.11 shows the different types of possible situations and corresponding local closure operations.

4.2.4 Adaptation of the Auxiliary Grids to the Boundary

The Dirichlet boundary: On the Dirichlet boundary we want to satisfy homogeneous boundary conditions (b.c.), i.e., u|Γ = 0 (non-homogeneous b.c. can trivially be transformed to homogeneous ones). On the given fine triangulation τ this is achieved by use of basis functions that fulfil the b.c. Since the auxiliary triangulations σ(1), . . . , σ(J) do not necessarily resolve the boundary, we have to use a slight modification.

D Definition 6 (Dirichlet auxiliary grids). We define the auxiliary triangulations T` by

D (`) T` := {τ ∈ σ | τ ⊂ Ω}, ` = 1, . . . , J. 77

Figure 4.11. Hanging nodes can be treated by a local subdivision within the cube Bν. Firstly erasing the hanging nodes on the face and then connecting the center of the cube.

level 3 level 4 level 5

Figure 4.12. The boundary Γ of Ω is drawn as a red line, boxes non-intersecting Ω are light green, boxes intersecting Γ are dark green, and all other boxes (inside of Ω) are blue.

In Figure 4.12 the Dirichlet auxiliary grids are formed by the blue boxes. All other elements (light green and dark green) are not used for the Dirichlet problem. On an auxiliary grid we impose homogeneous Dirichlet b.c. on the boundary

D D Γ` := ∂Ω , Ω := ∪ D τ.¯ ` ` τ∈T`

The auxiliary grids are still nested, but the area covered by the triangles grows with increasing 78 the level number: D D D [ D Ω1 ⊂ · · · ⊂ ΩJ ⊂ Ω, Ω` := {τ ∈ T` }

The Neumann boundary: On the Neumann boundary we want to satisfy natural (1) (J) (Neumann) b.c., i.e., ∂nu|Γ = 0. For the auxiliary triangulations σ , . . . , σ , we will approximate the true b.c. by the natural b.c. on an auxiliary boundary.

N N Definition 7 (Neumann auxiliary grids). Define the auxiliary triangulations T1 ,..., TJ by

N (`) T` := {τ ∈ σ | τ ∩ Ω 6= ∅}, ` = 1, . . . , J.

level 3 level 4 level 5

Figure 4.13. The boundary Γ of Ω is drawn as a red line, boxes non-intersecting Ω are light green, and all other boxes (intersecting Ω) are blue.

In Figure 4.13 the Neumann auxiliary grids are formed by the blue boxes. All other elements (light green) are not used for the Neumann problem. On an auxiliary grid we impose natural Neumann b.c., the auxiliary grids are non-nested. The area covered by the triangles grows with decreasing level number:

N N N [ N Ω ⊂ ΩJ ⊂ · · · ⊂ Ω1 , Ω` := {τ ∈ T` }

Remark 4.2.6 (Mixed Dirichlet/Neumann b.c.). The defintion of the grids for mixed bound- ary conditions of Dirichlet (on ΓD) and Neumann type we use the grids

M (`) T` := {τ ∈ σ | τ ∩ Ω 6= ∅ and τ ∩ ΓD = ∅}, ` = 1, . . . , J. 79

The b.c. on the auxiliary grid are of Neumann type except for neighbours of boxes σ ∩ΓD 6= ∅ where essential Dirichlet b.c. are imposed.

Figure 4.14. The finest auxiliary grid σ(10) contains elements of different size. Left: Dirichlet b.c. (852 degrees of freedom), right: Neumann b.c. (2100 degrees of freedom)

4.2.5 Near Boundary Correction

Since the boundaries of different levels do not coincide, the near boundary error cannot be reduced very well by the standard multigrid method for the Neumann boundary condition.

So we introduce a near-boundary region Ω(`,j) where a correction for the boundary approx- imation will be done. The near-boundary region is defined in layers around the boundary

Γ`:

Definition 8 (Near-boundary region). We define the j-th near-boundary region T(`,j) on level ` of the auxiliary grids by

T(`,0) :={τ ∈ T` | dist(Γ`, τ) = 0}, i = 1, . . . , j. T(`,i) :={τ ∈ T` | dist(T(`,i−1), τ) = 0},

The idea for solving the linear system on level ` is to perform a near-boundary correction after the coarse grid correction. The errors introduced by the coarse grid correction is 80 eliminated by solving the subsystem for the degrees of freedom in the near-boundary region

T(`,j). The extra computational complexity is O(N) because only the elements which are close to the boundary are considered.

Definition 9 (Partition of degrees of freedom). Let J` denote the index set for the degrees of freedom on the auxiliary grid T`. We define the near-boundary degrees of freedom by

J`,j := {i ∈ J` | i belongs to an element τ ∈ T(`,j)}.

Let (u)i∈J` be a coefficient vector on level ` of the auxiliary grids. Then we extend the standard coarse grid correction by the solve step

` ` ` ` ` −1 ` r := f − A u , u |J`,j := (A |J`,j ×J`,j ) r |J`,j .

` The small system A |J`,j of near-boundary elements is solved by an H-matrix solver, cf. [116].

4.3 Estimate of the Condition Number

In this section, we investigate and analyze the new algorithm by verifying the assumptions of the theorem of the auxiliary grid method. Based on the auxiliary hierarchy we constructed in Section 4.2, we can define the auxiliary space preconditioner (3.25) and (3.31) as follows. ˜ Let the auxiliary space V = VJ and A be generated from (3.8). Since we already have the J hierarchy of grids {V`}`=1, we can apply MG on the auxiliary space VJ as the preconditioner B˜. On the space V, we can apply a traditional smoother S, e.g. Richardson, Jacobi, or Gauß-Seidel. For the stiffness matrix A = D − L − U (diagonal, lower and upper triangular part), the matrix representation of the Jacobi iteration is S = D−1 and for the Gauß-Seidel iteration it is S = (D − L)−1. (More generally, one could use any smoother that features the −1 2 −1 = spectral equivalence kvkS ∼k h v kL2(Ω).)

The auxiliary grid may be over-refined, it can happen that an element τi ∈ T intersects J much smaller auxiliary elements τj ∈ TJ :

J τi ∩ τ 6= ∅, h J hτ but h J 6= hτ . (4.3) j τj . i τj ∼ i 81

Algorithm 20 Auxiliary Space MultiGrid −1 For ` = 0, define B0 = A0 . Assume that B`−1 : V`−1 → V`−1 is defined. We shall now define B` : V` → V` which is an iterator for the equation of the form

A`u = f.

Pre-smoothing: For u0 = 0 and k = 1, 2, ··· , ν

k k−1 k−1 u = u + R`(f − A`u )

Coarse grid correction: e`−1 ∈ V`−1 is the approximate solution of the residual equation ν A`−1e = Q`−1(f − A`u ) by the iterator B`−1:

ν+1 ν ν ν u = u + e`−1 = u + B`−1Q`−1(g − A`u ).

Near boundary correction:

ν+2 ν+1 ` ν+1 ` −1 ν+1 u = u + u |J`,j = u + A |J`,j ×J`,j (f − A`u )

. Post-smoothing: For k = ν + 3, ··· , 2ν + 3

k k−1 k−1 u = u + R`(f − A`u )

In this case, we do not have the local approximation and stability properties for the stan- dard nodal interpolation operator. Therefore, we need a stronger interpolation between the original space and the auxiliary space. This is accomplished by the Scott-Zhang quasi- interpolation operator Π: H1(Ω) → V

2 for a triangulation T [117]. Let {ψi} be an L -dual basis to the nodal basis {ϕi}. We define the interpolation operator as

X Z Πv(x) := ϕi(x) ψ(ξ)v(ξ)dξ. i∈J Ω 82

By definition, Π preserves piecewise linear functions and satisfies (cf. [117]) for all v ∈ H1(Ω)

2 X −2 2 2 |Πv|1,Ω + hτ k(v − Πv)k0,τ . |v|1,Ω. (4.4) τ∈T

We define the new interpolation Π from the auxiliary space V to V by the Scott-Zhang interpolation Π : V → V and the reverse interpolation Π:˜ V → V .

Then we can apply Theorem 3.2.1 for V = VJ . In order to estimate the condition number, we need to verify that the multigrid preconditioner B˜ on the auxiliary space is bounded and the finest auxiliary grid and corresponding FE space we constructed yields a stable and bounded transfer operator and smoothing operator.

4.3.1 Convergence of the MG on the Auxiliary Grids

Firstly, we prove the convergence of the multigrid method on the auxiliary space. For the Dirichlet boundary, we have the nestedness

D D Ω1 ⊂ · · · ⊂ ΩJ ⊂ Ω, which induces the nestedness of the finite element spaces defined on the auxiliary grids D T` , ` = 1, ··· ,J:

V1 ⊂ V2 ⊂ · · · ⊂ VJ .

In order to avoid overloading the notation, we will skip the superscript D in the following. In order to prove the convergence of the local multilevel methods by Theorem 3.1.5, we only need to verify the assumptions for the decompsition of

J X X VJ = V`,k. (4.5) ˜ `=1 k∈N` where ˜ N` = {k ∈ J`|k ∈ J` \J`−1 or ϕk,` 6= ϕk,`−1}.

(`) Since T` ⊂ σ is the local refinement of T`−1, the size of the triangles in T` may be different. ¯ We denote T` as a refinement of the grid T` where all elements are regularly refined such that ¯ all elements from T` are congruent to the smallest element of T`. The finite element spaces 83

¯ ¯ ¯ corresponding to T` are denoted by V`. In the triangulations T` we have

¯ −` τ ∈ T` ⇒ hτ ∼ 2 .

¯ For an element τ ∈ T` we denote by gτ the level number of the triangulation Tgτ to which

−gτ τ belongs, i.e. hτ ∼ 2 . For any vertex pi, if i ∈ J` but i 6∈ J`−1, we define gpi = `. The following properties about the generation of elements or vertices are [22, 21]

¯ τ ∈ T`, if and only if gτ = `;

i ∈ J`, if and only if gpi ≤ `; ¯ For τ ∈ T`, max gpi = ` = gτ , i∈J (τ)

¯ where J (τ) is the set of vertices of τ ∈ T`. With the space decomposition (4.5), we can verify the assumptions of Theorem 3.1.5.

4.3.1.1 Stable decomposition: Proof of (A1)

The purpose of this subsection is to prove the decomposition is stable.

` ˜ Theorem 4.3.1. For any v ∈ V , there exist function vi ∈ V`,i, i ∈ N`, ` = 1,...,J, such that J J X X ` X X ` 2 2 v = vi and kvi kA . log(N)kvkA. (4.6) ˜ ˜ `=1 i∈N` `=1 i∈N` Proof. Following the argument of [22, 21], we define the Scott-Zhang interpolation between

different levels Π` : V`+1 → V`,ΠL : VL → VL, and Π0 : V1 → 0. By the definition, we can define the decomposition as

J X ` ` v = v , v = (Π` − Π`−1)v ∈ V`. `=1

Assume v` = P ξ ϕ`, where v` = ξ ϕ` ∈ V . Then, i∈J` `,i i i `,i i `,i

` 2 ` 2 X d ` 2 ` 2 2 kvi k0 = kvi k ` hτ |v (pi)| kv k ` = k(Π` − Π`−1)vk ` . 0,ωi . . 0,ωi 0,ωi ` τ∈ωi

` ` where ωi is support of ϕi and the center vertex is pi. 84

By the inverse inequality, we can conclude

X ` 2 X X −2 ` 2 X −2 2 kvi kA . hτ kvi k0,τ . hτ k(Π` − Π`−1)vk0,τ . i∈J i∈J ` τ∈T ` ` τ∈ωi `

Invoking the approximability and stability and following the same argument of Lemma 4.3.6, we have X h−2k(v − Π v)k2 |v|2 and kΠ vk2 kvk2 . τ ` 0,τ . 1,Ω`+1 ` 0,Ω` . 0,Ω`+1 τ∈T` So,

X −2 2 X −2 2 hτ k(Π` − Π`−1)vk0,τ = hτ kΠ`(I − Π`−1)vk0,τ

τ∈T` τ∈T` X −2 2 X −2 2 h k(I − Π`−1)vk ` h k(I − Π`−1)vk ` . τ 0,ωτ . τ 0,ω˜ τ∈T` τ∈T` X −2 2 2 . hτ k(I − Π`−1)vk0,τ . |v|1. τ∈T`−1

` ` where ωτ is the union of the elements in ` that intersect with τ ∈ T` andω ˜τ is the union of ` the elements in T`−1 that intersect with ωτ . Therefore,

J J X X ` 2 X X −2 2 2 2 kvi kA . hτ k(Π` − Π`−1)vk0,τ . J|v|1 . log(N)|v|1. ˜ `=1 i∈N` `=1 τ∈T`

4.3.1.2 Strengthened Cauchy-Schwarz inequality: Proof of (A2)

In this subsection, we establish the strengthened Cauchy-Schwarz inequality for the space

decomposition (4.5). Assuming there is an ordering index set Λ = {α|α = (`α, kα), kα ∈ ˜ N`, `α = 1, ··· ,J}. Define the ordering as follows. For any α, β ∈ Λ, if `α > `β or `α =

`β, kα > kβ, then, α > β. The strengthened Cauchy-Schwarz inequality is given as follows.

` m Theorem 4.3.2. For any uα = vk ∈ Vα = V`,i, vβ = vj ∈ Vβ = Vm,j, α = (`, k), β = (m, j) ∈ Λ, we have  1/2 1/2 X X X 2 X 2 (uα, vβ)A kuαk kuβk . . A A α∈Λ β∈Λ,β>α α∈Λ β∈Λ 85

Proof. For any α ∈ Λ, we denote by

α X n(α) = {β ∈ Λ|β > α, ωβ ∩ ωα 6= ∅}, vk = vβ.

β∈n(α),gβ =k

where ωα is the support of the Vα and gα = maxτ∈ωα gτ .

Since, the mesh is a K-mesh, for any τ ⊂ ωα, we have

 1/2 α hk −1 α (uα, vk )1,τ . |uα|1,τ hk kvk k0,τ hgα

So,

α X α (ui, vk )1,ωα = (ui, vk )1,τ τ⊂ωi  1/2 X hk −1 α . |uα|1,τ hk kvk k0,τ hgα τ⊂ωα  1/2  1/2 hk −1 X 2 ≤ |uα|1,ωα hk  kvβk0,ωα  . hgα β∈n(α),gβ =k

Then fix uα and consider

J X X X X (uα, vβ)A = (uα, vβ)1,ωα = (uα, vβ)1,ωα ∼ β∈Λ,β>α β∈n(α) k=gα β∈n(α) J X α . |(uα, vk )1,ωα | k=gα J  1/2  1/2 X hk X |u | h−1 kv k2 . α 1,ωα k β 0,ωα hgα k=gα gβ =k

We sum up the uα level by level,   J J J 1 1   2   2 X X X X X X hk −1 X 2  (uα, vβ)A .  |uα|1,ωα hk kvjk0,ωα   h`  `=1 gα=` β∈Λ `=1 gα=` k=` β∈n(α) β>α gj =k 86

1 1 J J ! 2 ! 2 X X hk X 2 −2 X X 2 . |uα|1,ωα hk kvjk0,ωα h` `=1 k=` gα=` gα=` β∈n(α) gβ =k 1 1 1 J J ! 2 ! 2 2 ! 2 X X hk X X X h |u |2 ` |v |2 . α 1,ωα 2 β 1,ωα h` hk `=1 k=` gα=` gα=` gβ =k 1 1 J ! J J ! X X 2 X X X X h2 2 |u |2 ` |v |2 . α 1 2 β 1,ωα hk `=1 gα=` `=1 k=` gα=` gβ =k 1 1 J ! 2 J k 2 ! 2 X X 2 X X h` X 2 . kuikA 2 |vβ|1 . hk `=1 gi=` k=1 `=1 gβ =k 1 1 J ! 2 J ! 2 X X 2 X X 2 . kuikA kvβkA . `=1 gi=` k=1 gβ =k

This gives us the desired estimate.

The Gauß-Seidel method as the smoother means choosing the exact inverse for each of the subspaces V`,k. Therefore, the assumption of the smoother is satisfied as well. Consequently, we have the uniform convergence of the multigrid method on the auxiliary grid.

Theorem 4.3.3. The multigrid method on the auxiliary grid based on the space decomposi- 1 tion (4.5) is nearly optimal, the convergence rate is bounded by 1 − 1+C log(N) .

4.3.1.3 Condition number estimation

Now, we estimate condition number of the auxiliary space preconditioner by verifying the assumptions in Theorem 3.2.1. The assumption (3.26) is the continuity of the smoother S. We prove the first assumption in Theorem 3.2.1 for the Jacobi and Gauß-Seidel iteration. For the Jacobi method, the square of the energy norm can be computed by summing local contributions from the cells τi of the mesh T :

2 X 2 X X X X kvkA = k ξiϕikA = a( ξiϕi, ξjϕj) = a(ξiϕi, ξjϕj)

i∈J i∈J j∈J i∈J ωi∩ωj 6=∅ 87

X X 1 X ≤ (kξ ϕ k2 + kξ ϕ k2 ) ≤ K kξ ϕ k2 = KhDv, vi, 2 i i A j j A i i A i∈J ωi∩ωj 6=∅ i∈J

where K is the maximal number of non-zeros in a row of A. Thus the choice cs = K fulfills the continuity assumption. The continuity of the Gauß-Seidel method can also be proved.

Lemma 4.3.4 (Continuity for Gauß-Seidel). The stiffness matrix A = D − L − U fulfills

1 h(D − L)ξ, ξi ≤ hDξ, ξi ≤ 2h(D − L)ξ, ξi, ξ ∈ N . (4.7) K R

Proof. For any ξ ∈ RN we have

X X X X h(D − L)ξ, ξi = a(ξiϕi, ξjϕj) ≤ a(ξiϕi, ξjϕj) ≤ KhDξ, ξi,

i∈I j≤i i∈I Φi∩Φj 6=∅ and vice versa

hDξ, ξi ≤ h(A + D)ξ, ξi = 2h(D − L)ξ, ξi.

So we get the desired estimate.

In order to prove assumptions (3.27) and (3.28) , we need the following lemmas for the transfer operator between V and V .

Lemma 4.3.5 (Local stability property). For any auxiliary space function v ∈ V˜ and any element τ ∈ T , the quasi-interpolation Π satisfies

j−k |Πv|k,τ . hτ |v|j,ωτ , j, k ∈ {0, 1},

where ωτ is the union of elements in the auxiliary grid T that intersect with τ.

Proof. Assume that pi, i = 1, 2, 3, are the nodal points of τ and ϕi,τ (ψi,τ ) the corresponding

nodal (dual) basis functions. Let Φi = ∪m∈Mi τm denote the union of elements adjacent to pi ¯ (p) in the grid T , and let Φi = ∪ (p) τm denote the union of elements in the auxiliary grid Tp m∈Mi that intersect with Φi. Then we can estimate

3 3 d X 2 −k X |Πv|k,τ ≤ |Πv(pi)| |φi,τ |k,τ . hτ |Πv(pi)| i=1 i=1 88

3 d Z 2 −k X = hτ | ψi,τ (ξ)v(ξ)dξ| i=1 Φi 3 d 2 −k X . hτ kψi,τ k∞,Φi kvk0,Φi i=1 3 3 d d 2 −k X −d − 2 −k X . hτ hτ kvk0,Φi . hτ kvk0,Φi i=1 i=1 3 − d −k X X h 2 hd/2+j|v| . τ τm j,τm i=1 m∈Mi 3 − d −k X X X h 2 ( hd+2j)1/2( |v|2 )1/2 . τ τm j,τm i=1 m∈Mi m∈Mi 3 d d − −k X +j j−k 2 2 ¯ . hτ hτ |v|j,Φi . hτ |v|j,τ¯. i=1

which proves the desired inequality.

Lemma 4.3.6. For any auxiliary element function v ∈ V the reverse interpolation operator Π˜ satisfies X h−2k(v − Π˜v)k2 |v|2 and |Π˜v|2 |v|2 . (4.8) τ 0,τ . 1,Ω 1,ΩJ . 1,Ω τ∈T

Proof. The proof follows an argument presented in Xu [47]. Let Tˆ be the set of the elements in T which do not intersect with ∂ΩJ , i.e.

ˆ ˆ [ T = {τ|τ ∈ T , τ ∈ ΩJ τ ∩ ∂ΩJ = ∅}, Ω = τ.¯ τ∈Tˆ

Then,

X −2 ˜ 2 X −2 ˜ 2 X −2 2 X −2 ˜ 2 hτ k(v − Πv)k0,τ ≤ hτ k(v − Πv)k0,τ + hτ kvk0,τ + hτ kΠvk0,τ . τ∈T τ∈Tˆ τ∈T \Tˆ τ∈T \Tˆ

ˆ For any element τ ∈ T , ωτ is the union of elements in the auxiliary grid TJ that intersect with τ, X h−2k(v − Π˜v)k2 h−2k(v − Π˜v)k2 |v|2 (4.9) τ 0,τ . τ˜ 0,τ˜ . 1,ωτ τ˜⊂ωτ 89

So, X X h−2k(v − Π˜v)k2 |v|2 |v|2 ≤ |v|2 . τ 0,τ . 1,ωτ . 1,ΩJ 1,Ω τ∈Tˆ τ∈Tˆ By the Poincar´einequality and scaling, if Gη is a reference square (d = 2) or a cube (d = 3) of side length η, then

Z −2 2 2 η kwk0,Gη . |∇w| dx Gη holds for all functions w vanishing on one edge of Gη. For any τ ∈ T \ Tˆ, by covering τ with

ητ subregion which can be mapped onto G , ητ ∼= hτ , we can conclude that

X −2 2 X 2 2 hτ kwk0,τ . |w|1,Gητ . |w|1,Ω. τ∈T \Tˆ τ∈T \Tˆ

Applying the above estimate with w = v and w = Π˜v, one has

X −2 2 X −2 ˜ 2 2 ˜ 2 2 hτ kvk0,τ + hτ kΠvk0,τ . |v|1,Ω + |Πv|1,Ω . |v|1,Ω. τ∈T \Tˆ τ∈T \Tˆ

For the second inequality, |Π˜v|2 |v|2 |v|2 1,ΩJ . 1,ΩJ . 1,Ω So, we have the desired estimate.

We can now verify the remaining assumptions of Theorem 3.2.1.

Lemma 4.3.7. For any v ∈ Vp, we have

|Πv|1,Ω . |v|1,ΩJ .

Proof. By the local stability of Π,

X X |Πv|2 = |Πv|2 |v|2 |v|2 = |v|2 1,Ω 1,τ . 1,ωτ . 1,Ω 1,ΩJ τ∈T τ∈T

The desired estimate then follows. 90

Lemma 4.3.8. For any v ∈ V , there exists v0 ∈ V and w ∈ Vp such that

2 2 2 kv k −1 + |w| |v| . 0 S 1,Ωp . 1,Ω

˜ Proof. For any v ∈ V , let w := Πv and v0 = v − Πw, then

2 2 X −2 2 2 kv k −1 + |w| h kv − Πwk + |w| 0 S 1,ΩJ . τ 0,τ 1,ΩJ τ∈T X X ≤ h−2kv − Π˜vk2 + h−2kw − Πwk2 + |w|2 τ 0,τ τ 0,τ 1,ΩJ τ∈T τ∈T 2 . |v|1,Ω.

Theorem 4.3.9. If the multigrid method on the Dirichlet auxiliary grid is the preconditioner B˜ on the auxiliary space and the ASMG preconditioner defined by (3.25) or (3.31), then

κ(BA) . log(N). Chapter 5

Colored Gauss-Seidel Method by auxiliary grid

The Gauss-Seidel method is very attractive in many applications to solve problems such as a set of inter-dependent constraints or as a smoother in multigrid methods. The heart of the algorithm is a loop that sequentially process each unknown quantity. Since the algorithm is inherently sequential, it is not easy to do the parallelization efficiently. If we relax the strict precedence relationships in the Gauss-Seidel method, the parallelism can be extracted even for highly coupled problems [118]. Instead of a true Gauss-Seidel method, each processor performs the Gauss-Seidel method as a subdomain solver for a block Jacobi method. For this method, as we add more processing units, the algorithm tending toward a Jacobi algorithm. The main drawback is then that the convergence rate sometimes suffers and even lead to divergence [104]. Researchers have done a lot of efforts on the parallelization of Gauss-Seidel method. Lots of parallelization schemes have been developed for regular problems. A red-black coloring scheme is used to solve the Poisson equations by finite differences method in [119, 120]. Then, this scheme has been extended to multi-coloring [121, 104]. Other Gauss-Seidel method for distributed systems can be found in [122, 123, 124, 125]. One of the key procedure of the parallelization is grouping the independent quantities based on a graph-coloring approach. However for multi-colored Gauss-Seidel method, the number of parallel communications required per iteration is proportional to the number of colors, hence it tends to be slow, especially on unstructured grids. Therefore, the purpose of this chapter is to develop a new coloring scheme for unstructured grids so that efficient parallel Gauss-Seidel method can be 92

implemented in practical problems on general unstructured grids. Thanks to the auxiliary grids, an unstructured grid can be associated with an struc- tured grid which is generated by an adaptive tree. Since the trees have numerous parallel constructs, it seems plausible that efficient parallel coloring procedure could be developed on shared-memory or distributed-memory parallel computers. We can develop a coloring algorithm for quadtree tree in 2D and octree in 3D, which can be done in linear time com- plexity. By the coloring of the adaptive tree, we can color the DoFs by blocks which are sitting in the same hypercube in the auxiliary structured grid. Therefore, a colored block- wise Gauss-Seidel smoother can be applied with the aggregates serving as non-overlapping blocks.

5.1 Graph Coloring

In order to implement the colored Gauss-Seidel method, we need to assign a color for each DoF such that all the DoFs with the same color are independent. For the general unstruc- tured grid, it is usually by the algorithm of graph coloring. Define a graph G = (V,E).

Each point in V is one DoF. If there existe = (vi, vj) ∈ E, it means aij 6= 0. Introduce the following definition:

Definition 10 (Proper Coloring, k-coloring). A proper coloring is an assignment of colors to the vertices of a graph so that no two adjacent vertices have the same color. A k-coloring of a graph is a proper coloring involving a total of k colors

In order to describe how many colors we need to color a graph, we need the following definition:

Definition 11 (Chromatic Number). The chromatic number of a graph is the minimum number of colors in a proper coloring of that graph, denoted by χ(G).

In general, it is difficult to compute or estimate the chromatic number of complicated graphs. Theoretical results on graph coloring do not offer much good news: even approximat- ing the chromatic number of a graph is known to be NP-hard [126]. There is an algorithm for properly coloring vertices that can give us an upper bound on the number of colors needed. The complexity of the greedy algorithm is O(nd¯), where d¯ is the average degree. The parallelization of the greedy algorithm is difficult since it is inherently sequential. Several 93

Algorithm 21 The Greedy Algorithm For Coloring Vertices of a Graph

Let v1, ··· , vn be an ordering of V for i = 1 to n do Determine the forbidden colors to vi Assign vi the smallest perissible color

approaches based on maximal independent set exist, but it will generate more colors than a sequential implementation and the speedup is poor especially on unstructured grid [127]. Clearly, the greedy algorithm provides a proper coloring. However the number of colors used for the algorithm depends on the graph and the order in which we color the vertices. An upper bound of the colors can be estimated by the following theory.

Theorem 5.1.1 (Greedy Coloring Theorem). If d is the largest of the degrees of the vertices in a graph G, then G has a proper coloring with d + 1 or fewer colors, i.e., the chromatic number of G is at most d + 1.

5.2 Quadtree Coloring

The quadtree coloring problem was introduced by Benantar et al [128, 129], motivated by problems of the parallel computations on quadtree-structured finite element meshes. There are several variants of the problems. Quadtree may be balanced or unbalanced. Balanced quad trees are typically used in finite element method. Furthermore, the squares may be neighboring with the squares sharing the edge (edge adjacency) or sharing any vertex of edge (vertex adjacency). We focus on the vertex adjacency balanced quadtree coloring. The coloring of the quadtree can be done by the general graph coloring algorithm by considering the dual graph G∗ of the quadtree. G∗ is obtained by replacing each rectangles as nodes and putting an edge if two rectangles share the same edge or vertex. By Theorem 5.1.1, the maximum number of colors for coloring the vertex adjacency balanced quadtree by greedy algorithm is 12. In 3D, the number will be 32. However with the special structure of the quadtree, we can have better algorithms. For full quadtree, the coloring procedure is simple. 4 colors are sufficient for 2D quadtree and 8 colors are sufficient for 3D octree. If we have established an adaptive quadtree, we seek coloring procedures that (1) use a small number of colors, (2) have a reasonable parallel complexity, and (3) applicable to arbitrary quadtree / octree structures. Since one level 94 differences in tree levels are permitted at adjacent quadrant edges and two level difference between quadrants sharing a common vertex. Benantar et al. [128] provide an algorithm showing that with corner adjacency, balanced quadtrees require at most six colors. Eppstein et al. [130] provide lower bound examples showing that at least five colors are necessary for balanced corner adjacency.

Theorem 5.2.1 ([130]). There is a balanced quadtree requiring five colors for all colorings in which no two squares sharing an edge or a corner have the same color.

Proof. For an example of a balanced quadtree as shown in Figure 5.1, assume we can color the quadtree by only 4 colors. 4 different colors are used for the center rectangles see Figure 5.1.

Figure 5.1. A balanced quadtree requires at least five colors

Rectangle A has three possible options: color 1,2, or 4. Once we choose the color for A, we can fill in the colors for some rectangles by the following two rules:

1. If rectangle s has 3 colored neighbors, assign the remaining fourth color to s;

2. If rectangle s has a corner shared by three other rectangles, each of which is adjacent to some rectangles of color i, assign color i to s.

For different choices of colors for A, Figure 5.2 shows the results with choosing A as color 1 and 4. Since the quadtree is symmetric with respect to y = x, Coloring A as 2 is similar 95 with choosing A as 1. No matter what color is chosen for A, it leads to an inconsistency at B. B has to be both color 1 and 2. Therefore, the overall quadtree cannot be colored by 4 colors.

Figure 5.2. Forced coloring rectangles

The six-color algorithm in [128] is suitable for parallel implementation . In order to present the algorithm, a 2D binary graph is obtained from a finite quadtree.

Definition 12. A 2D binary graph is a directed binary graph obtained from a finite quadtree by the following assertive algorithm:

1. The root of the quadtree corresponds to the root of the binary graph.

2. Every terminal quadrant is associated with a node of the binary graph.

3. Nodes across a common horizontal edge in the quadtree are connected in the binary graph.

4. When a quadrant is divided, its parent in the quasi binary graph becomes the root of a subgraph.

Assume we want to divided all quadrants to six colors 1, 2, 3, 4, 5, and 6. Let us define the six colors divided into three sets, a0 = {1, 2}, a1 = {3, 4}, and a2 = {5, 6}. Each set consists two disjoint colors that alternate in a column-order traversal of the quadtree representation of the domain. A column-order traversal of the quadtree is equivalent to a 96

Figure 5.3. Adaptive quadtree and its binary graph

depth-first traversal of the binary graph. Whenever left and right branches of the binary graph merges, the traversal continues using the color set associated with the left branch. Two of the three color sets are passed to a node of the binary graph. Assume they are a and b. At each branching, the color set a and the third color set c are passed to the left offspring while the sets a and b are passed in reverse order to the right offspring. This process results in the recursive coloring procedure described in Algorithm 22.

Algorithm 22 Six-coloring method for binary graph in 2D ˜ Bk[f] = coloring binary graph (root, a0, a1, a2) if not (root = NULL || root colored) then color root using an alternating coloring from set a0; coloring binary graph(left offspring,a0, a2, a1); coloring binary graph(right offspring,a1, a0, a2);

Theorem 5.2.2 ([131]). For any balanced quadtree, the coloring given by Algorithm 22 is a proper vertex adjacent coloring, which means no two squares sharing an edge or a corner have the same color.

Proof. There are three relationship for the two rectangle: S1 and S2 may share the edge in

x or y direction or S1 and S2 share one vertex.

If S1 and S2 share an x direction edge, which means S1 and S2 are parent and child in the binary graph. Assume S1 is the parent of S2. The possible conditions are

1. The depth of S1 and S2 are the same; 97

2. The depth of S1 is larger than S2 and S2 is the left offspring;

3. The depth of S1 is larger than S2 and S2 is the right offspring;

4. The depth of S1 is smaller than S2 and S1 is the left branch;

5. The depth of S1 is smaller than S2 and S1 is the right branch;

All conditions will result 2 possible results: 1, S1 and S2 are colored by the alternating coloring of the same color group; 2, S1 and S2 are colored by different color group. For both of these two conditions, S1 and S2 have different color.

If S1 and S2 share an y direction edge, S1 and S2 have the same descendant S0. Assume

S1 is the most right child of the left offspring S0 and S2 is the most left child of the right offspring S0. Denote the left offspring and right offspring of S0 are S3 and S4. Assume the color set assigned to S0 is {a0, a1, a2}. Then the color set of S3 is {a0, a2, a1} and the color set of S4 is {a1, a0, a2}. So the color set of S1 is {a0, a2, a1} or {a2, a0, a1} and the color set of S2 is {a1, a0, a2} or {a1, a2, a0}. Therefore, S1 and S2 are colored by different color group.

So S1 and S2 have different color.

If S1 and S2 share an vertex, S1 and S2 have the same descendant S0. The possible conditions are the same with the case which S1 and S2 share an y direction edge. So S1 and

S2 have different color. Therefore, any neighbor rectangles have different colors, which completes our proof.

The actual implementation utilizes additional stack structure instead of the binary graph in order to reduce the number of tree traversals. Figure 5.4 shows the coloring result from the binary tree of Figure 5.3.

Remark 5.2.3. The corner-adjacent quadtree may require either five or six colors. Whether exists an coloring algorithm with 5 colors is still an open problem.

5.3 Tree Representations

In most classic representation of quadtree / octree uses pointers to store the data. Besides of the data, each node needs to store the pointers for each of its child nodes. In leaf nodes, the pointer to the children are NULL. The pointer from a child node to his parent node can also be added in order to simplify the upward traversal. 98

Figure 5.4. Six-Coloring for adaptive quadtree

In order to speed up the traversal, one possible option is to replace pointers by indices. In that case, the references to a child node must be replaced by a calculus on the father’s index and the nodes must be stored in an index table. The structure are more compact that pointer representation and they allow a direct access in constant time to any node of the tree by providing its induex, while pointer representation only allow direct access to the root node. Since the tree structure is generated from the grid, the index can also be systematically generated from the node geodetrical position by defining the node position in the tree hierarchy. A common choice for such index is Morton codes. This method is efficient to generate unique index for each node, while offers good spatial locality and easy computation. Another advantage of Mordon code is their hierarchical order, since it is possible to create a single index for each node, while preserving the tree hierarchy. The index can be calculated from the tree hierarchy, recursively when traversing the tree. The root has index 0, and the index of each child node is the concatenation of its parent index with the direction which is coded over d bits. The bottom-up traversal is also possible, as if to find the parent index we only have to truncate the last d bits of a child index. 99

On the other hand, the index can equivalently be computed from the geometric position of the node’s box and its size. Assume the root’s box is a unit box,if the position of a box −l is (x1, x2, ··· , xd) and the size is 2 . Then, it means that the node is at depth l and the Morton code could be generated as

1 1 1 2 2 2 l l l a1a2 ··· ada1a2 ··· ad ··· a1a2 ··· ad (5.1)

l l−1 1 l where aiai ··· ai is the binary decomposition of b2 xic, i = 1, 2, ··· , d. Figure 5.5 shows one example of the Mordon code of the leaf node of the adaptive quadtree.

Figure 5.5. the Mordon code of an adaptive quadtree

5.4 Parallel Implementation of the Coloring Algorithm

With the index representation of the tree structure, we can rewrite Algorithm 22 and gen- eralize the algorithm for 3D octree. If d = 2, assume that the index of one node i is 100

i 1 1 2 2 l l i 1 2 l I = a1a2a1a2 ··· a1a2. We can rewrite the index as primary index I1 = a1a1 ··· a1 and sec- i 1 2 l ondary index I2 = a2a2 ··· a2. The primary index can determine the color set used for the node. If two nodes has the same primary index, they will have the same color set (c.f. Figure 5.6). Since the length of the primary index may be different, we only compare the value of the index, i.e. ignore the 0s in front of the primary index. For example, 010 = 10. The color i i set can determined by the remainder of the division of I1 by 3. if I1 mod 3 = k, the color of node i is in color set ak. Then we can order the nodes with the same primary index by the secondary index and using an alternating coloring from the color set to color the nodes. The comparison of the secondary code is by binary digits from the right to left. For example, the ordered sequence of {00, 001, 10, 11, 101} is {00, 10, 001, 101, 11}. In order to color all the nodes, we only need to reorder the nodes by the primary and secondary indices. Then, using the primary indices to determine the color set and alternatively color the nodes with the choosing color set.

Figure 5.6. Adaptive quadtree and its binary graph

Theorem 5.4.1. Six colors are sufficient to color the nodes of a quadtree so that no two adjacent cubes, which means that they are sharing the nodes or edges have the same color.

Proof. Since the new algorithm is equivalent with Algorithm 22, by Theorem 5.2.2, we have the proof.

If d = 3, we can design a similar algorithm for octree with 18 colors. Firstly, we separate 101

the 18 colors to 9 sets: a1 = {1, 2}, a2 = {3, 4}, ··· , a9 = {17, 18} and continue to group the

9 sets to 3 set group: b1 = {a1, a2, a3}, b2 = {a4, a5, a6}, and b3 = {a7, a8, a9}. For the node i 1 1 1 2 2 2 l l l i with the index I = a1a2a3a1a2a3 ··· a1a2a3, we can separate the index to three indices: i 1 2 l i 1 2 l primary-1 index I1 = a1a1 ··· a1, primary-2 index I2 = a2a2 ··· a2, and secondary index i 1 2 l I3 = a3a3 ··· a3. primary-1 index determines the set group and primary-1 index determines the color sets. Then, if we order the nodes with the same primary index by the secondary index. So we can also color the nodes by the ordering of the primary-1, primary-2, and secondary indices (c.f. Figure5.7).

Figure 5.7. Coloring of 3D adaptive octree

Theorem 5.4.2. Eighteen colors are sufficient to color the nodes of a octree so that no two adjacent cubes, which means that they are sharing the nodes, edges, or faces, have the same color.

Proof. Assume the two neighbor cubes are A and B. There are four conditions. If A and B have the same primary-1 and primary-2 indices, they are colored by the same color set, denoted as ai. Since we color the two cubes by the alternative colors in ai, A and B have different colors. If A and B have the same primary-1 index but different primary-2 index. Assume the a b primary-2 index of A is I2 and its level is `1. Then assume the primary-2 index of B is I2 and its level is `2. Since A and B are neighbors, `2 ∈ {`1 − 1, `1, `1 + 1}. If `2 = `1, then 102

b a `1 b a `2 b a `1 I2 = I2 ± (2 ), if `2 = `1 − 1, then I2 = I2 ± (2 ), and if `2 = `1 + 1, then I2 = I2 ± (2 ). a b For any cases, I2 6≡ I2(mod 3), so A and B are in different color sets. Therefore, they have different color. If A and B have different primary-1 indices. Assume the primary-1 indices of A and B a b a b are I1 and I1. With the same argument of the previous one, I1 6≡ I1(mod 3). Therefore, they have different color groups, so different colors. Therefore, we complete the proof.

5.5 Block Colored Gauss-Seidel Methods

Given any unstructured grid T , we can have an associate auxiliary structured grid Th which is generated from the auxiliary cluster quadtree / octree. Define the DoFs in the same hypercube as one block. Then, the colored quadtree / octree in Section 5.4 can be used to tell which block are independent of each other. The blocks of the same color are independent of each other, from which it follows that they can be computed embarrassingly parallel. Consider the blocks of one color as one set. Each set can be distributed among the available processors, an updated solution of the DoFs in the set are computed. This has one important drawback: The need for data communication between the processors is high. When computing set i, the data of other set in other processors needs to be transferred. Some special techniques are needed in the implementation. The number of colors has a significant impact on the performance of parallel scaling of the colored Gauss-Seidel method. For general graph coloring scheme, it cannot control the number of colors. The performance is not guaranteed. For our coloring scheme, the number of colors is fixed which is more stable in applications. Chapter 6

Parallel FASP-AMG Solvers

Given the need to solve extremely large systems, researchers have developed and implemented a number of parallel AMG methods for CPU. BoomerAMG (included in the Hypre package [55]) is the parallelization of the classical AMG methods and their variants, whereas others including ML [56] focus on the parallel versions of smoothed aggregation AMG methods. In addition, some parallel implementations of the AMG methods are used in commercial software such as NaSt3DGPF [132] and SAMG [39]. Not only are the researchers rapidly developing algorithms, they are doing the same with the hardware. GPUs based on single instruction multiple thread (SIMT) hardware architecture have provided an efficient platform for large-scale scientific computing since November 2006, when NVIDIA released the Compute Unified Device Architecture (CUDA) toolkit. The CUDA toolkit made programming on GPU considerably easier than it had been previously in large part. Engineers and scientists believe that programmable GPU have large potential on outperform CPU in scientific computing, and they switch to GPU to achieve better performance for the current solvers. Since CUDA now is the main choice for scientific computing on GPU, and numbers of Consequently, many sparse linear algebra and solver packages based on CUDA have been developed: MAGMA [133],CULA [134], the CUDPP library [135], NIVIDA CUSPARSE library [136], IBM SpMV library [137], Iterative CUDA [138], Concurrent Number Cruncher [139], and CUSP [140]. MG methods have also been parallelized and implemented on GPU in a number of studies. GMG methods as the typical cases of MG have been implemented on GPU firstly[58, 59, 60, 61, 62, 63]. These studies demonstrate that the speedup afforded by using GPUs can result in GMG methods achieving a high level of performance on CUDA-enabled GPUs. However, 104 to the best of our knowledge of the task, paralleling an AMG method on GPU or CPU remains very challenging mainly due to the sequential nature of the coarsening processes (setup phase) used in AMG methods. In most AMG algorithms, coarse-grid points or basis are selected sequentially using graph theoretical tools (such as maximal independent sets and graph partitioning algorithms) and the coarse-grid matrices are constructed by a triple- matrix multiplication. Although extensive researches have been devoted to improving the performance of parallel coarsening algorithms, leading to marked improvements on CPU [64, 65, 66, 36, 67, 68, 69] and on GPU [65, 70, 71] over time, the setup phase is still considered the major bottleneck in parallel AMG methods. On the other hand, the task of designing an efficient and robust parallel smoother in the solver phase is no less challenging. In this chapter, based on the idea of the auxiliary grid preconditioning, we design a new parallel AMG method based on the unsmoothed aggregation AMG (UA-AMG) method to- gether with what is known as the nonlinear AMLI-cycle (K-cycle) [106, 107, 108] for (nearly) isotropic problems discretized on shape regular grids. The UA-AMG method is attractive because of the simplicity of its setup procedure and its low computational cost. The aggre- gates are usually generated by finding the maximal independent set of the corresponding graph of the matrix. This procedure is mainly sequential and difficult to implement in par- allel. Moreover, the shape of the coarse grid and the sparsity pattern of the coarse-grid matrix are usually difficult be controlled, which increases the computational complexity of the AMG methods. The lack of control and corresponding complexity makes the traditional aggregation approaches less favorable for parallel computation than other AMG methods, especially on GPUs. However, our main focus here is the discretized partial differential equa- tions, for which detailed information on the underlying geometric grid is generally available, although we usually only have access to the finest-level unstructured grid. We designed our new AMG method to overcome the difficulties of the setup phase, i.e. algebraic coarsening process. And we did this by using the geometric grid on the finest level in order to build the coarse levels instead of using the pure algebraic coarsening characteristic of most AMG methods. The main idea is to construct an auxiliary structured grid based on information from the finest geometric grid, and to select a simple and fixed coarsening algorithm that allows explicit control of the overall grid and operator complexities of the multilevel solver. When an auxiliary structured grid is used, the coarsening process of the UA-AMG method, i.e. the construction of aggregates, can easily be done by choosing the elements that intersect with certain local patches of the auxiliary grid. This auxiliary grid and its relationship with 105 the original unstructured grid can be effectively managed by a quadtree in 2D (octree in 3D). The special structure of the auxiliary grid narrows the bandwidth of the coarse grid matrix (9-point stencil in 2D and 27-point stencil in 3D for most shape regular grids). Besides, this narrowing gives us explicit control of the sparsity pattern of the coarse-grid matrices and reduces the operator complexity of the AMG methods. These features control the work per thread which is considered to be advantage for parallel computing, especially on GPU. Moreover, due to the regular sparsity pattern, we can use the ELLPACK format to signifi- cantly speed up the sparse matrix-vector multiplication on GPU (see [141] for a discussion of different sparse matrix storage formats). As our new parallel AMG method is based on the UA-AMG framework, there is no need to form the prolongation and restriction matrices explicitly. Both of them can be done efficiently in parallel by using the quadtree of the aux- iliary structured grid. In addition, the auxiliary grid allows to use the colored Gauss-Seidel smoother without any coloring the matrices. This improves not only the convergence rate but also the parallel performance of the UA-AMG method. The following construction of the auxiliary grid aggregation method will be given for the 2D case. However, we would like to point out that it can easily be generalized to the 3D case. Many research groups have shown that GPU can accelerate multigrid methods, making them perform faster and better than on CPU (see [58, 59, 60, 61, 62, 60, 142]). Well- implemented GPU kernels can be more than 10 times faster than their CPU counterparts. The parallelization algorithms of our auxiliary grid AMG method we are discussing are as follows:

1. Sparse matrix-vector multiplication: just as many other iterative methods do, sparse matrix-vector multiplication (SpMv) plays an important role in our AMG method. Moreover, because we apply the nonlinear AMLI-cycle and use our AMG method as a preconditioner for the Kyrlov iterative method, we need an efficient parallel SpMv implementation on GPU. This means we should use a sparse matrix storage format that has a regular memory access pattern when we do the SpMv.

2. Parallel aggregation algorithm: in the coarsening procedure, aggregation is one of the most time-consuming steps. In the previous section, we introduced our auxiliary-grid- based aggregation method and shown its potential for efficient parallelization. The parallelization of the auxiliary-grid aggregation algorithm is discussed in this section.

3. Prolongation and restriction: an advantage of the UA-AMG method is that there is no 106

need to explicitly form prolongation P or restriction R, which usually is the transpose of P . This is because both P and R are just Boolean matrices that characterize the aggregates. However, we do need their actions in the AMG method to transfer the solution and the residual between levels. And, we also need to be able to transfer the solution and the residual between levels in parallel.

4. Coarse-level matrix: the coarse-level matrix is usually computed by a triple-matrix multiplication (3.40). This computation is considered to be the most difficult part to parallelize in the AMG method on GPU, as speedup can be as low as 1.2 compared with its CPU implementation (see [70]). However, thanks to the auxiliary structured grid, we have fixed sparsity pattern on each coarse level, which makes the triple-matrix multiplication much easier to compute.

5. Parallel smoother: the smoother is the central component of the AMG method. It is usually a linear iteration. For the parallel AMG method, however, it is difficult to design an efficient and parallel smoother. Generally speaking, smoothers that have a good smoothing property, such as the Gauss-Seidel method, are difficult to parallelize, whereas, smoothers, like the Jacobi method, that are easy to parallelize cannot usually achieve a good smoothing property. In our parallel AMG method, the auxiliary struc- tured grid makes it possible to use a colored Gauss-Seidel method, which maintains a good smoothing property and is easy to parallelize.

By combining these components with the nonlinear AMLI-cycle, we have established our parallel auxiliary grid AMG method, which is presented at the end of this section.

6.1 Parallel Auxiliary Grid Aggregation

The method to generate the auxiliary quadtree is as Section 4.2.1. Once we have the auxiliary structured grid and the region quadtree, we can form the aggregates of all the levels. Here, we need to distinguish between two different cases (1) level L and (2) all the coarse levels 0 < k < L.

• Level L. On this level, we need to aggregate the degree of freedoms (DoFs) on the L unstructured grid. Basically, all the DoFs in the subregion Ωi form one aggregate i. 107

Therefore, there are 4L aggregates on this level (8L in 3D). Let us denote the coordinate L of DoF j on the unstructured grid by (xj, yj), such that the aggregate Gi is defined by

L L Gi = {j :(xj, yj) ∈ Ωi }. (6.1)

Using equation (4.2), we can see that the aggregation generating procedure can be done by checking the coordinates of the DoFs.

• Coarse level 0 < k < L. Start from the second level L − 1 and proceed through the levels to the coarsest level 0. Note that each level k is a structured grid formed by k all the subregions {Ωi }, and note, too, that a DoF on the coarse level corresponds to a subregion on the coarse level. Therefore, we can use a fixed and simple coarsening whereby all the children of a subregion form an aggregate, i.e,

k k+1 k Gi = {j :Ωj is a child of Ωi }. (6.2)

Thanks to the quadtree structure, the children on the quadtree can be found by check- ing their indices.

Figures 6.1 and 6.2 show the aggregation on level L and on the coarse levels respec- tively. The detailed parallel algorithms and implementation of them are discussed in the next section.

Figure 6.1. Aggregation on level L. 108

Figure 6.2. Aggregation on the coarse levels.

6.2 Parallel Prolongation and Restriction and Coarse- level Matrices

As noted in Section 3.3.2, in UA-AMG method, the prolongation and restriction matrices are piecewise constant and characterizing the aggregates. Therefore, we do not form them explicitly in UA-AMG method.

k−1 nk−1 k k k−1 • Prolongation: Let v ∈ R , so that the action v = Pk−1v can be written component-wise as follows:

k k k−1 k−1 k−1 (v )i = (Pk−1v )i = (v )j, j ∈ Gi

Assign each thread to one element of vk, and use the array aggregation to obtain k−1 information about j ∈ Gi , i.e., i = aggregation[j], so that prolongation can be efficiently implemented in parallel.

k nk k T k • Restriction: Let v ∈ R , so that the action (Pk−1) v can be written component- wise as follows: k−1 k T k X k (v )i = ((Pk−1) v )i = (v )j. k−1 j∈Gi 109

k−1 Moreover, when 0 < k < L, assume that i = t1 + 2 t2. The above formula can be expressed explicitly as

k−1 k k k k (v )i = (v )2t1+2 (2t2) + (v )2t1+1+2 (2t2) k k k k + (v )(2t1+1)+2 (2t2+1) + (v )2t1+2 (2t2+1).

Therefore, each thread is assigned to an element of vk−1, and use the array aggregation k−1 to obtain information about j ∈ Gi , i.e., find all j such that aggregation[j] = i, on level L, or use the explicit expression on the coarse levels, such that the action of restriction can also be implemented in parallel.

Usually, the coarse-level matrix is constructed by the triple-matrix multiplication (3.40). However, the triple-matrix multiplication is a major bottleneck of parallel AMG method on GPU [70], not only in terms of parallel efficiency, but also regards to memory usage. In general, the multiplication of sparse matrices comprises two steps. First step is to determine the sparsity patterns by doing the multiplication symbolically, and the second step is to actually compute the nonzero entries. In our AMG method, because of the presence of the auxiliary structured grid, the sparsity patterns are known and fixed. There is no need to perform the symbolic multiplication in the first step. For the second step, according to (3.41), the construction of coarse-level matrices in the UA-AMG method only involves simple summations over aggregates and can be performed simultaneously for each nonzero entry. From the unstructured grid to the auxiliary grid, i.e., on level L, the sparsity pattern (more precisely the stencil) can be determined by the ratio between the longest edge of L the unstructured grid and the size of the subregion Ωi . In order to make the stencil small (9-point stencil in 2D and 27-point stencil in 3D), we choose L such that the size of the L subregion Ωi is larger than the longest edge of the unstructured grid. This also insures that the aggregates on the finest level are not too small. On the other hand, we also pick the largest feasible L to make sure the aggregates on the finest level are not too bigger. Such choice of L is possible and practical for the quasi-uniform grids, and also for the shape regular grids where the sizes of biggest and smallest element are not differ too much. However, for unstructured meshes with significant local refinement, such choice L might not be feasible. If there only are a few places the stencil will be bigger than 9-point (or 27-point), we can use Hybrid sparse matrix storage format (9-point/27-point stencil will be stored in ELL format, and the extra entries will be stored in CSR format). Otherwise, we can use larger stencil, 110

which might be less efficient compared with 9-point/27-point stencil, but still insures the fixed sparsity pattern. Once the stencil on the finest auxiliary grid is fixed, the stencils on coarser auxiliary grids are not larger than the stencil on the finest level, and we can keep 9-point/27-point stencil for simplicity. Because most of the case we will have 9-point or 27-point stencil, here we will focus on discussing how to construct the ELL format in order to store the sparse matrices have 9-point stencil. Such construction can be easily generalized to bigger stencils. In order to do

this, we need to generate the index array Ajk and the nonzero array Axk for a coarse-level matrix Ak.

• Form Ajk: due to the special structure of the region quadtree, all the coarse-level

matrices have a 9-point stencil structure. Based on this fact, Ajk is pre-determined and can be generated in parallel without any symbolic matrix multiplication. For a k DoF, i = t1 + 2 t2, on level k (k ≤ L), its 8 neighbors are

k k i1 = (t1 + 1) + 2 t2, i2 = (t1 + 1) + 2 (t2 + 1), k k i3 = t1 + 2 (t2 + 1), i4 = (t1 − 1) + 2 (t2 + 1), k k i5 = (t1 − 1) + 2 t2, i6 = (t1 − 1) + 2 (t2 − 1), k k i7 = t1 + 2 (t2 − 1), i8 = (t1 + 1) + 2 (t2 − 1).

And, the i-th row of the index array Ajk is, in Matlab notation, Ajk(i, :) = [i, i1, i2, ··· , i8], with some simple modifications for the DoFs on the boundaries. It is easy to see that

each thread can handle one entry in the array Ajk and complete it in parallel.

• Form Axk: The summation (3.41) in the ELL format is as follows,

X X Axk(r, t) = Axk+1(i, j), r = 1, 2, ··· , nk, t = 0, 1, ··· , 8. (6.3) k k i∈Gr j∈Gq , q=Ajk(r,t)

We use the Matlab notation again here. Similar to the action of restriction, the searching k k for i ∈ Gr and j ∈ Gq , q = Ajk(r, t) can be done in parallel with the help of aggregation.

Together with the parallel summation, the array Axk can be formed in parallel. Moreover, on the coarse level k < L, we can write the summation explicitly as the action of restriction.

Remark 6.2.1. For highly locally refined grids, the auxiliary grids generated by adaptive quadtree are another possible approach and might be more effective. Using adaptive quadtree 111 can recover the local density of the grid and reduce the disbalance between the aggregates on the finest level. However, the sparsity pattern of the coarse-level matrix becomes more complicated, and the development of an efficient parallel algorithm to construct the coarse- level matrices is still ongoing. In general, the parallel auxiliary grid AMG method based on the adaptive quadtree is beyond the scope of this paper.

6.3 Parallel Smoothers Based on the Auxiliary Grid

An efficient parallel smoother is crucial for the parallel AMG method. For the sequential AMG method, Gauss-Seidel relaxation is widely used and has been shown to have a good smoothing property. However, the standard Gauss-Seidel is a sequential procedure that does not allow efficient parallel implementation. To improve the arithmetic intensity of the AMG method and to render it more efficient for parallel computing, we introduce colored Gauss-Seidel relaxation. For example, on a structured grid with a 5-point stencil (2D) or a 7- point stencil (3D), the red-black Gauss-Seidel smoother is widely used for parallel computing because it can be efficiently parallelized (it is Jacobi method for each color) and still maintain a good smoothing property. In fact, in such a way, the red-black Gauss-Seidel smoother works even better than standard Gauss-Seidel smoother for certain model problems. However, for an unstructured grid, coloring the grid is challenging, and the number of colors can be dramatically large for certain types of grid. Therefore, how to apply a colored Gauss-Seidel smoother is often unclear, and sometimes it may even be impossible to do so. Thanks to the auxiliary structured grid again, we have only one unstructured grid and it is the finest level, and all the other coarse levels are structured. Therefore, the coloring for the coarse levels becomes trivial. Because we have a 9-point stencil in 2D for most cases, 4 colors are sufficient for the coarse-level structured grid and a 4-color point-wise Gauss-Seidel smoother can be used as the parallel smoother on the coarse levels. For the unstructured grid on the fine level, we apply a modified coloring strategy. Instead of coloring each DoF, we color each aggregate. Because the aggregates are sitting in the auxiliary structured grid L which is formed by all the subregions Ωi on level L, 4 colors are sufficient for coloring. Therefore, on the finest level L, a 4-color block-wise Gauss-Seidel smoother can be applied with the aggregates serving as nonoverlapping blocks, see Figure 6.3. 112

Figure 6.3. Coloring on the finest level L 6.4 GPU Implementation

Many research groups have shown that GPU can accelerate multigrid methods, making them perform faster and better than on CPU (see [58, 59, 60, 61, 62, 60, 142]). Well-implemented GPU kernels can be more than 10 times faster than their CPU counterparts. In this section, we discuss the parallelization of our auxiliary grid AMG method on GPU, as follows:

1. Sparse matrix-vector multiplication: just as many other iterative methods do, sparse matrix-vector multiplication (SpMv) plays an important role in our AMG method. Moreover, because we apply the nonlinear AMLI-cycle and use our AMG method as a preconditioner for the Kyrlov iterative method, we need an efficient parallel SpMv implementation on GPU. This means we should use a sparse matrix storage format that has a regular memory access pattern when we do the SpMv.

2. Parallel aggregation algorithm: in the coarsening procedure, aggregation is one of the most time-consuming steps. In the previous section, we introduced our auxiliary-grid- based aggregation method and shown its potential for efficient parallelization. The parallelization of the auxiliary-grid aggregation algorithm is discussed in this section.

3. Prolongation and restriction: an advantage of the UA-AMG method is that there is no need to explicitly form prolongation P or restriction R, which usually is the transpose 113

of P . This is because both P and R are just Boolean matrices that characterize the aggregates. However, we do need their actions in the AMG method to transfer the solution and the residual between levels. And, we also need to be able to transfer the solution and the residual between levels in parallel.

4. Coarse-level matrix: the coarse-level matrix is usually computed by a triple-matrix multiplication (3.40). This computation is considered to be the most difficult part to parallelize in the AMG method on GPU, as speedup can be as low as 1.2 compared with its CPU implementation (see [70]). However, thanks to the auxiliary structured grid, we have fixed sparsity pattern on each coarse level, which makes the triple-matrix multiplication much easier to compute.

5. Parallel smoother: the smoother is the central component of the AMG method. It is usually a linear iteration. For the parallel AMG method, however, it is difficult to design an efficient and parallel smoother. Generally speaking, smoothers that have a good smoothing property, such as the Gauss-Seidel method, are difficult to parallelize, whereas, smoothers, like the Jacobi method, that are easy to parallelize cannot usually achieve a good smoothing property. In our parallel AMG method, the auxiliary struc- tured grid makes it possible to use a colored Gauss-Seidel method, which maintains a good smoothing property and is easy to parallelize.

By combining these components with the nonlinear AMLI-cycle, we have established our parallel auxiliary grid AMG method, which is presented at the end of this section.

6.4.1 Sparse Matrix-Vector Multiplication on GPUs

An efficient SpMv algorithm on GPU requires a suitable sparse matrix storage format. How different storage formats perform in SpMv has been extensively studied in [141]. This study shows that the need for coalesce accessing of the memory makes matrix formats such as compressed row storage (CSR) format and compressed column storage (CSC) format widely used for the iterative linear solvers on CPU but inefficient on GPU. According to [141], when each row has roughly the same nonzeros, ELLPACK (ELL) storage format is one of the most efficient sparse matrix storage formats on GPU. Let us recall the ELL format, an M × N sparse matrix with at most K nonzeros per row is stored in two M × K arrays: (1) array Ax stores nonzero entries, and (2) array Aj stores 114

the column indices of the corresponding entries. Both arrays are stored in column-major order to improve the efficiency when accessing the memory on GPU [136], see Figure 6.4.

 4 −2 0 0 0 1 −1  4 −2 0  A = −1 2 −1 0 , Aj = 0 1 2  Ax = −1 2 −1 0 0 −3 4 2 3 −1 −3 4 0

Aj: [0 0 2 1 1 3 -1 2 -1] Column-major ordering of Aj and Ax: Ax: [4 -1 -3 -2 2 4 0 -1 0]

Memory access of SpMV: one thread handles one row

Ax =  4 −1 −3 −2 2 4 0 −1 0  Step 1 =  0 1 2  Step 2 =  0 1 2  Step 3 =  2 

Figure 6.4. Sparse matrix representation using the ELL format and the memory access pattern of SpMv.

It is not always possible to use the ELL format for AMG method, because the sparsity patterns of the coarse-level matrices can become denser than the fine-level matrices and the number of nonzeros per row can vary dramatically. However, the ELL format is a good choice for our AMG method. This is because the auxiliary structured grid and the region quadtree. All the coarse-level matrices generated from the auxiliary grid have fixed sparsity patterns. Based on these facts, we choose the ELL format to maximize the efficiency of SpMV. Moreover, we store the diagonal entries in the first column of Aj and Ax to avoid having to execute a searching process during smoothing process and, therefore, further improve the overall efficiency of our parallel AMG method.

6.4.2 Parallel Auxiliary Grid Aggregation

As discussed in Section 6.1, we use an auxiliary structured grid and its hierarchical structure to form the aggregates. A region quadtree is used to handle the hierarchical structure. Following Section 6.1, we discuss how to efficiently parallelize this aggregation method in 2D here. The 3D generalization of this aggregation method is straightforward. Again, we distinguish two cases: level L and coarse levels 0 < k < L. 115

L • Level L: The aggregate {Gi } is defined by (6.1), which means that we need to check the coordinates of each DoF, and determine the aggregates for each DoF. Obviously, checking the coordinates of each DoF is a completely separate process; therefore, it can be efficiently performed in parallel with other processes. Algorithmically, we assign L each thread to DoF j and determine a subregion Ωi such that the coordinates (xj, yj) ∈ L L L Ωi , then j ∈ Gi . Assume that Ωi is labeled lexicographically, and then given (xj, yj), the index i is determined by

y − a x − a i = b j 2 × 2Lc × 2L + b j 1 × 2Lc, b2 − a2 b1 − a1

where a1, a2, b1, and b2 are given by (4.1). The output of this parallel subroutine is an array aggregation that contains the information about the aggregates. We would like to point out that aggregation plays an important role in our AMG method since prolongation and restriction are all based on it. Following is the psuedo-code implemented using CUDA.

Listing 6.1. Parallel aggregation on level L(2D case) 1 __global__ void aggregation ( int ∗ aggregation , double∗ x , double∗ y , double xmax , double ymax , double xmin , double ymin , int L ) 2 { 3 /∗ get thread ID ∗/ 4 const unsigned int idx = blockIdx . x∗blockDim . x+threadIdx . x ; 5 const unsigned int powL = ( int ) pow ( 2 . 0 , L ); 6 /∗ check x coordinate ∗/ 7 const unsigned int xIdx = ( int )(( x − xmin )/( xmax−xmin ) ∗powL ); 8 /∗ check y coordinate ∗/ 9 const unsigned int yIdx = ( int )(( y − ymin )/( ymax−ymin ) ∗powL ); 10 /∗ label the aggregate ∗/ 11 aggregation [ idx ] = yIdx∗powL + xIdx ; 12 }

k • Coarse level 0 < k < L: The aggregate {Gi } is defined by (6.2). Therefore, we k+1 need to determine the root of each subregion Ωj – a task that can be performed in a completely parallel way. Each thread is assigned to a DoF on the coarse level k + 1, k or equivalently to say a subregion Ωj + 1, then compute the index i on each thread in 116

parallel by j%2k 2k j/2k i = b c × + b c. 2 2 2 Following is the psuedo-code implemented using CUDA.

Listing 6.2. Parallel aggregation on coarse level k(2D case) 1 __global__ void aggregation ( int ∗ aggregation , int k ) 2 { 3 /∗ get thread ID ∗/ 4 const unsigned int idx = blockIdx . x∗blockDim . x+threadIdx . x ; 5 const unsigned int powk = ( int ) pow ( 2 . 0 , k ); 6 /∗ check x index ∗/ 7 const unsigned int xIdx = ( int )(( idx%powk ) /2) ; 8 /∗ check y index ∗/ 9 const unsigned int yIdx = ( int )(( idx/powk ) /2) ; 10 /∗ label the aggregate ∗/ 11 aggregation [ idx ] = yIdx ∗( powk /2) + xIdx ; 12 }

Remark: For the problem which significant local refinement, it is difficult to keep the 9-point stencil. So we need to do some approximation and truncate the matrix when we do the coarsening on level L. This will affect the efficiency of the algorithm, however it still gains some speed up. Another way to control the load balance is modifying the way of seperate the domain when we generate the quadtree. Not equally dividing the domain to four, we will try to separate the d.o.f. equally to gain a better performance. Chapter 7

Numerical Applications for Poisson-like Problem on Unstructured Grid

7.1 Auxiliary Space Multigrid Method

The numerical tests in this section are all performed on a SunFire with a 2.8GHz Opteron processor and sufficient main memory. Although the tests are done using only a single processor with access to all the memory, there might be some undesirable scaling effects when using a large portion of the available memory due to the speed of memory access. Therefore, the timings for larger problems are slightly worse than in theory (memory used and flops counted). In order to compare our code we need a reference point, and this reference is a straightfor- ward geometric multigrid method on a structured grid, where we do not exploit the structure except for it to be a geometric multigird hierarchy. This means we setup the stiffness ma- trices as well as prolongation and restriction matrices as one would do on a general grid hierarchy.

7.1.1 Geometric Multigrid

As a test problem we consider the Poisson equation on the unit square with homogeneous Dirichlet boundary conditions:

− ∆u = f in Ω := [0, 1]2, u = 0 on Γ := ∂Ω. (7.1) 118

For this domain, it is straight-forward to construct a nested hierarchy of regular grids

T1,... TJ and corresponding P1 finite element spaces V1 ⊂ · · · ⊂ VJ with n` := dim(V`) =

` 2 ` n` (2 − 1) . Let (ϕi )i=1 denote a Lagrange basis in V`. The geometric multigrid algorithm is presented in Algorithm ??. On each level, a smooth- ing iteration is required which we take to be symmetric Gauß-Seidel. The timings for the setup of the stiffness matrices A` and the prolongation matrices P ` on level ` as well as for 10 V-cycles (ν := 2 smoothing steps) of geometric multigrid are given in Table 7.1.

Table 7.1. The time in seconds for the setup of the matrices and for ten steps of V-cycle (geometric) multigrid, Algorithm 16.

#dof Setup of A, P 10 V-cycles n9 = 1, 050, 625 4.3 5.1 n10 = 4, 198, 401 26.7 39.7 n11 = 16, 785, 409 108.8 152.3

From the timings, we observe that the geometric multigrid method (with textbook con- vergence rates) requires roughly 1 second per step per million degrees of freedom, i.e., roughly 5 − 10 seconds per million degrees of freedom to solve the problem. For smaller problems cacheing effects seem to speed up the calculations. These results for the geometric multigrid method on a uniform grid are now compared with the ASMG method for the baltic sea mesh with strong local refinement and several inclusions.

7.1.2 ASMG for the Dirichlet problem

Our solver for the unstructured grid from the baltic sea geometry, cf. Figure 4.10, is a preconditioned conjugate gradient method (CG) where the preconditioner is ASMG. We iterate until the discretisation error on the finest level of the hierarchy is met. In particular, a nested iteration from the coarsest to the finest level is used to obtain good inital values. We consider the baltic sea model problem with homogeneous Dirichlet boundary con- ditions. The storage complexity and timings are shown in Table 7.2. The auxiliary grid hierarchy and matrices require roughly 3 times more storage than the given (unstructured) grid, and the ASMG-CG solve takes approximately 11 seconds per million degrees of freedom, which is at most two times slower than the geometric multigrid method. 119

Table 7.2. The storage complexity in bytes per degree of freedom (auxiliary grids, auxiliary matrices and H-solvers) and the solve time in seconds for an ASMG preconditioned cg-iteration.

#dof aux. storage storage A aux. setup ASMG-cg solve (steps) n4 = 737, 933 509 85 45.2 12.4 (5) n5 = 2, 970, 149 351 87 124 40.2 (5) n6 = 11, 917, 397 281 87 414 125.9 (5) n7 = 47, 743, 157 247 88 1360 544.9 (5)

The convergence rates for the ASMG iteration are given in Figure 7.1, where we plot the residual reduction factors for the first 50 steps. We observe that the rates on all levels are uniformly bounded away from 1, roughly of the size 0.4.

1 "level 4" "level 5" "level 6" "level 7"

0.8

0.6

0.4

0.2

0 0 5 10 15 20 25 30 35 40 45 50

Figure 7.1. Covergence rates for Auxiliary Space MultiGrid with n4 = 737, 933, n5 = 2, 970, 149, n6 = 11, 917, 397, and n7 = 47, 743, 157 degrees of freedom.

7.1.3 ASMG for the Neumann problem

In the final test, we consider the baltic sea model problem with homogeneous Neumann boundary conditions. Extra near boundary correction has been applied, cf. Algorithm 20. The storage complexity as well as the timing for the ASMG-CG solve are given in Table 7.3 for the model problem with natural Neumann boundary conditions. We observe that in order to solve the model problem up to the size of the discretization 120

Table 7.3. The storage complexity in bytes per degree of freedom (auxiliary grids, auxiliary matrices and H-solvers) and the solve time in seconds for an ASMG preconditioned cg-iteration.

#dof aux. storage aux. setup storage A ASMG-cg solve (steps) n4 = 756, 317 355 34.6 88 11.5 (6) n5 = 3, 006, 917 244 80 88 33.3 (6) n6 = 11, 990, 933 195 266 88 143.6 (6) n7 = 47, 890, 229 172 941 88 520.7 (6)

error, we need to spend twice as much storage for the auxiliary than the given (unstructured) fine grid, and the solve time is roughly 11 seconds per million degrees of freedom.

1 "level 4" "level 5" "level 6" "level 7"

0.8

0.6

0.4

0.2

0 0 5 10 15 20 25 30 35 40 45 50

Figure 7.2. Covergence rates for Auxiliary Space MultiGrid with n4 = 756, 317, n5 = 3, 006, 917, n6 = 11, 990, 933, and n7 = 47, 890, 229 degrees of freedom.

In Figure 7.2 we plot the residual reduction factors for 50 steps of ASMG (without CG acceleration) for 4 consecutive levels. We observe that the rate is level independent and of 0.5, i.e. bounded away from 1, but not as good as the corresponding rate of the geometric multigrid method. We conclude that the theoretically proven convergence rates are indeed small enough to be competitive with geometric multigrid in the sense that the storage requirements increase by a factor of at most 3 and the solve times at most by a factor of 2. 121

7.2 FASP-AMG

In this section, we perform several numerical experiments to demonstrate the efficiency of our proposed AMG method for GPU implementation. We test the parallel algorithm on partial differential equations discretized on different grids in 2D or 3D, in order to show the potential of the proposed AMG method for practical applications.

7.2.1 Test Platform

Our focal test and comparison platform is a NVIDIA Tesla C2070 together with a Dell computing workstation. Details in regard to the machine are given in Table 7.4.

Table 7.4. Test Platform

CPU Type Intel CPU Clock 2.4 GHz Host Memory 16 GB GPU Type NVIDIA Telsa C2070 GPU Clock 575MHz Device Memory 6 GB CUDA Capability 2.0 Operating System RedHat CUDA Driver CUDA 4.1 Host Complier gcc 4.1 Device Complier nvcc 4.1 CUSP v0.3.0

The algorithm is implemented as a part of the FASP Solver Package, which has been scheduled to be released in 2013. The whole implementation of our algorithm is purely on GPU so that we can avoid the time of transforming datas between GPU and CPU. The algo- rithm is also available in the NVAMG project. NVAMG is an internal project at NVIDIA to develop parallel iterative solvers which runs on GPUs, with public release expected sometime in 2013. In the numerical tests, we are using the standard time measurement function to measure the running time of the algorithms. Because our aim is to demonstrate the improve- ment of our new algorithm on GPUs, we concentrate on comparing our proposed method with the parallel smoothed aggregation AMG method implemented in the CUSP package [140]. CUSP is an open source C++ library of generic parallel algorithms for sparse linear 122 algebra and graph computations on CUDA-enabled GPUs. It provides a flexible, high-level interface for manipulating sparse matrices and solving sparse linear systems, and all CUSP’s algorithms and implementations have been optimized for GPU by NVIDIA’s research group. To the best of our knowledge, the parallel AMG method implemented in the CUSP package is the state-of-art AMG method on GPU.

7.2.2 Performance

Example 7.2.1 (2D Poisson problem). Consider the following Poisson equation in 2D,

2 −∆u = f, in Ω ⊂ R , with the Dirichlet boundary condition

u = 0, on ∂Ω.

The standard linear finite element method is used to solve Example 7.2.1 for a certain triangulation of Ω. 2D uniform grid on Ω = (0, 1) × (0, 1). On a 2D uniform grid, the standard linear finite element method for the Poisson equation gives a 5-point stencil and the resulting stiffness ma- trix has a banded data structure. In this case, our proposed aggregation algorithm coincides with the standard geometric coarsening, which suggests that our proposed AMG method has clear advantages over other AMG methods. Table 7.5 shows the numbers of iterations and wall time required to reduce the relative residual to 10−6 for different implementations.

Table 7.5. Wall time and number of iterations for the Poisson problem on a 2D uniform grid 1024×1024 2048×2048 Setup # Iter Solve Total Setup # Iter Solve Total CUSP 0.63 36 0.35 0.98 2.38 41 1.60 3.98 New 0.03 10 0.13 0.16 0.11 11 0.43 0.54

As shown in Table 7.5, compared with the CUSP package, our proposed AMG method is about 21 to 22 times faster in the setup phase, 3 to 4 times faster in the solver phase, and 6 to 7 time faster in total. 2D quasi-uniform grid on Ω = (0, 1) × (0, 1). The quasi-uniform grid on the unit square 123

is shown in Figure 7.3. Table 7.6 shows the comparison between the CUSP package and our method. For the problem with 1.4 million unknowns, our new method runs 3.03 times faster than the AMG method in the CUSP package, and for the problem with 5.7 million unknowns, our method runs 3.82 times faster than the CUSP package. The speedup is even more significant in the setup phase: our method is about 7 times faster than the setup phase implemented in the CUSP package, which demonstrating the efficiency of our proposed auxiliary-grid-based aggregation algorithm.

Figure 7.3. Quasi-uniform grid for a 2D unit square

Table 7.6. Wall time and number of iterations for the Poisson problem on a 2D quasi-uniform grid # Unknowns = 1, 439, 745 # Unknowns = 5, 763, 073 Setup # Iter Solve Total Setup # Iter Solve Total CUSP 1.09 40 0.82 1.91 3.65 50 4.11 7.76 New 0.16 15 0.49 0.65 0.53 13 1.52 2.05

2D shape-regular grid on Ω = (0, 1) × (0, 1). We also test our auxiliary-grid-based AMG method for the shape-regular grid (see Figure 7.4). Table 7.7 shows the number of iterations and the wall time. We can see that, compared with the CUSP package, our proposed auxiliary-grid-based AMG method can achieve about 6 times speedup in the setup phase and 2 times speedup in total. The results are not as good as those for the uniform or quasi- uniform grid because the local density of the grid varies and thereby causes an unbalanced 124 distribution of unknowns at the finest level. However, our method still achieves reasonable speedup in the setup phase, which is considered the most challenging aspect for a parallel AMG method. However, for a large-scale problem i.e., one with 13 million unknowns, the CUSP package terminates due to the limitation of the memory in the GPU. Our method requires much less memory than the CUSP package does and still produces reasonable results in an optimal way.

Figure 7.4. Shape-regular grid for a 2D unit square

Table 7.7. Wall time and number of iterations for the Poisson problem on a 2D shape-regular grid # Unknowns = 3, 404, 545 # Unknowns = 13, 624, 833 Setup # Iter Solve Total Setup # Iter Solve Total CUSP 2.23 52 2.51 4.74 – – – – New 0.35 25 2.12 2.47 1.62 27 8.67 10.29

2D quasi-uniform grid on a disk domain. Instead of the square unit domain, we test the performance of our method on a disk domain as shown in Figure 7.5. Since the domain is not close to the auxiliary space. There are some empty aggregates and the matrices on the coarse grids might be singular. In our code, when constructing such singular coarse-grid matrices, we do nothing special and the matrices will have rows that are all zeros. However, at each smoothing step, each component in the solution, which corresponds to one of those empty aggregates, is set to be zero. This makes sure the solution is always orthogonal to the 125 null spaces of the singular matrices. Therefore, the efficiency of the smoother and the overall AMG method will not be affected. Indeed, such special treatment of the empty aggregates may reduce the parallelism because the threads which handle those empty aggregates have less computational load than other threads. However, we can expect that there are only a few empty aggregates, and the overall AMG method can still achieve high performance. In fact, our algorithm is still robust as shown in Table 7.8, and can achieve 7 times speedup in the setup phase and 2 times speedup in total compared with the AMG method in CUSP.

Figure 7.5. Shape-regular grid for a 2D circle domain

Table 7.8. Wall time and number of iterations for the Poisson problem on a disk domain # Unknowns = 2, 881, 025 # Unknowns = 11, 529, 733 Setup # Iter Solve Total Setup # Iter Solve Total CUSP 2.19 50 1.68 3.87 – – – – New 0.30 21 1.65 1.85 1.37 23 7.21 8.58

It is necessary to mention that one iteration of the new method is twice expensive than one iteration of the AMG method implemented in CUSP. The reason is that the nonlinear AMLI-cycle (choosing n = 2 in Algorithm 18) is used in the solving phase, but CUSP uses the standard V-cycle on the solving phase. In general, the nonlinear AMLI-cycle is more expensive than the V-cycle in terms of the computational cost per cycle, especially on parallel machines. The main reason is the lack of parallelism on the coarse levels where the size of 126

the problems are relatively small. The nonlinear AMLI-cycle visits coarse levels more than the V-cycle does. Therefore, its parallel efficiency is not as good as V-cycle’s. Moreover, for the shape regular grids and disk domain, due to the less balanced computational loads between threads, its parallel efficiency becomes even worse. However, because we use the UA-AMG method to reduce the cost and increase the parallelism of the setup phase, which is considered to be the major bottleneck of the parallelization for the AMG methods, we need to use the nonlinear AMLI-cycle (or other expensive cycling strategies) to improve the performance of the UA-AMG method and to achieve uniform convergence. If we only use V-cycle, the resulting AMG method is not optimal and, consequently, it is not scalable and is not suitable for parallel computing. Furthermore, as shown in Tables 7.7 and 7.8, in addition to the speedup in the setup phase, the nonlinear AMLI-cycle reduces number iterations a lot compared with V-cycle, and the overall UA-AMG method can still achieve at least 2 times speedup.

Example 7.2.2 (3D heat-transfer problem). Consider

 ∂T  − ∇ · (D∇T ) = 0, in Ω ∈ 3  ∂t R  T = 356, on Γinlet   ∂T  = 0 on Γ ∂n D

with finite volume discretization, where T0 = 256, D = 1, and the boundary Γ = Γinlet ∪ ΓD.

We consider two 3D computational domains, of which one is a unit cube (Figure 7.6) and the other is a cavity domain (Figure 7.6). Shape-regular grids are used on both domains. The numerical results for one time step are shown in Tables 7.9 and 7.10 for the cubic and cavity domain, respectively. Compared with the other aggregation-based AMG method on GPU, we can see about 6 times speedup in the setup phase, 2 times speedup in the solver phase, and about 2 to 5 times speedup in total. The numerical results demonstrate the efficiency and robustness of our proposed AMG algorithm for isotropic problems in 3D. And these results affirm the potential of the new method for solving practical problems. 127

Y

Y X Z X Z

Figure 7.6. Shape regular-grid for the 3D heat transfer problem on a cubic domain(left) and a cavity domain(right)

Table 7.9. Time/Number of iterations for the heat-transfer problem on a 3D unit cube # Unknowns = 909, 749 # Unknowns = 1, 766, 015 Setup Solve Total Setup Solve Total CUSP 1.75 0.60 2.35 1.94 0.52 2.46 New 0.26 0.37 0.63 0.39 0.64 1.03

Table 7.10. Wall time and the number of iterations for the heat-transfer problem on a cavity domain # Unknowns = 410, 482 # Unknowns = 2, 094, 240 Setup Solve Total Setup Solve Total CUSP 0.67 0.12 0.79 1.14 0.38 1.52 New 0.11 0.05 0.16 0.22 0.32 0.54 Chapter 8

FASP for Indefinite Problem

We consider the general model problem on a Hilbert space X: find x ∈ X such that

L(x, y) = hf, yi, ∀y ∈ X, (8.1)

where L(·, ·) is symmetric and indefinite. Assume H is an SPD operator which defines the inner product (·, ·)X on X and induces a norm on X, denoted by k · kX . We assume that the problem inherits the stability conditions to have a unique solution so that it satisfies

|L(x, y)| ≤ αkxkX kykX , ∀x, y ∈ X (8.2) and L(x, y) inf sup ≥ β > 0 (8.3) 06=x∈X 06=y∈X kxkX kykX Introduce an operator A : X 7→ X by

(Ax, y) = L(x, y), ∀x, y ∈ X.

Then the operator form of (8.1) is Ax = f. (8.4)

For large scale problem, a competitive alternative to computing directly the solution to (8.4) is to employ an iterative method with a preconditioning technique. More precisely we 129

want to find and preconditioning operator B : X 7→ X such that the modified system

BAx = Bf. (8.5)

can be solved with an iterative method in a number of iterations independent of mesh size.

8.1 Krylov Space Method for Indefinite Problems

8.1.1 The Minimal Residual Method

The minimal residual method (MINRES), which is first proposed by Paige and Saunders [7], is an algorithm for solving indefinite symmetric linear systems. Both CG and MINRES can be viewed as a special variant of the Lanczos method for positive definite symmetric systems. MINRES is a variant that can be applied to symmetric indefinite systems. The Lanczos process with starting vector b may be used to generate the N × m matrix 1 2 m 0 1 Vm = (v , v , ··· , v ) according to v = 0 and β1v = f, and the Hessenberg tridiagonal

matrix Tm such that AVm = Vm+1Tm and AVm = VmTm where Tm is m × m and tridiagonal. Denote   α1 β2  .  ! β α ..  T T =  2 2  = m . (8.6) m  .. ..  T  . . β  βm+1em  m  βm+1 where j j j T j j+1 j j j−1 p = Av , αj = (v ) p , βj+1v = p − αjv − βjv .

m m m Aproximate solutions within Km may be formed as u = Vmy for some m-vector y . CG and MINRES may be derived by choosing ym appropriately at each iteration. Based on (2.32), CG is to minimize the quadratic form within each Krylov subspace. (2.32) can be rewritten as

m m m u = Vmy , where y = arg min φ(Vmy). y

For nonsingular matrix A, (·, ·)A is no longer defines an inner product. However we can still 130

m minimizes kr k within Km.So the characterization of MINRES is

m m m u = Vmy , where y = arg min kb − AVmyk. y

Since the residual is

m 1 r = f − AVmy = β1v − Vm+1Tmy = Vm+1(β1e1 − Tmy), the problem we need to solve is equivalent to

m y = arg min kβ1e1 − Tmyk. (8.7) y

This subproblem is processed by the expanding QR factorization: Q0 = I, and   Im−1 ! !   Qm−1 Rm tm Qm,m+1 =  cm sm  ,Qm = Qm,m+1 ,Qm(Tm β1e1) = ,   I 0 qm sm −cm where cm and sm form the Householder reflector Qm,m+1 that annihilates βm+1 in Tm to give upper-tridiagonal Rm with Rm and tm being unaltered in later iterations. m m So the solution of (8.7) satisfies Rmy = tm. Instead of solving for y , MINRES solves T T T RmDm = Vm by forward substitution, obtaining the last column dm of Dm at iteration k. At the same time, it updates um by

m m m T u = Vmy = DmRmy = Dmtm = xm−1 + τmdm, τm = emtm.

The details of the MINRES is written as Algorithm 23. 2 The convergent analysis can be done based on (2.30) with the ` -norm k · k2. Note the similarity with the convergence estimate for CG: symmetry of the matrix ensures orthog- onality of the eigenvectors in both cases, with the consequence that convergence depends only on the eigenvalues. It is also apparent that for MINRES it is the Euclidean norm k · k2 (we will use k · k for the sake of the simplicity) of the residual that is reduced in an optimal 131

Algorithm 23 MINRES Method v1 = f − Au0; v0 = 0

β1 = kv1k2; η = β1

γ1 = γ0 = 1; σ1 = σ0 = 0 ω0 = ω−1 = 0 for m = 1, 2, ··· until convergence do The Lanczos recurrence: vm = 1 vm α = (vm)T Avm βm m m+1 m m m−1 v = Av − αmv − βmv m+1 βm+1 = kv k2 QR decomposition: p 2 2 δ = γmαm − γm−1σmβm; ρ1 = δ + βm+1 ρ2 = σmαm + γm−1γmβm; ρ3 = σm−1βm γm+1 = δ/ρ1; σm+1 = βm+1/ρ1 Update the solution: m m k−2 ω = (v − ρ3ω − ρ2ωm−1)/ρ1 m m−1 m u = u + γm+1ηω krmk2 = |σm+1|krm−1k2 check convergence; η = −σm+1η

manner rather than the A-norm. Therefore, we have the following estimate

krmk ≤ inf sup |q(λ)| kr0k, (8.8) q(λ)∈Pm,q(0)=1 λ∈σ(A)

where σ(A) is the spectrum of A. The key difference here, however, is that indefinite matrices have both positive and negative real eigenvalues. Nevertheless, a similar estimate still holds [143].

Theorem 8.1.1 ([143]). If σ(A) ⊂ [a, b] ∩ [c, d], where a < b < 0 < c < d, under the constraint that |b−a| = |d−c|, that is, the two intervals have equal length, then the residuals generated by MINRES satisfy

p p !m/2 krmk |ad| − |bc| ≤ 2 p p . (8.9) kr0k |ad| + |bc|

Proof. The kth-degree polynomial that has minimal maximum deviation from 0 on [a, b] ∩ 132

[c, d] is given by T`(p(λ)) 2(λ − b)(λ − c) qm(λ) = , p(λ) = 1 + , (8.10) T`(p(λ)) ad − bc m where ` = [ 2 ], [·] denotes the integer part, and T` is the `th Chebyshev polynomial. Note that the function p(λ) maps each of the intervals [a, b] and [c, d] to the interval [−1, 1]. It follows that for λ ∈ [a, b] ∩ [c, d], the absolute value of the numerator in (8.10) is bounded 1 −1 by 1. The size of the denominator is determined by the property: if p(0) = 2 (y + y ), then 1 −1 T`(p(0)) = 2 (y + y ). To determine y, we need to solve the equation

ad + bc 1 q(0) = = (y + y−1) ad − bc 2

or the quadratic equation 1 ad + bc 1 y2 − y + = 0. 1 ad − bc 2 This equation has the following solutions

p|ad| − p|bc| p|ad| + p|bc| y = , or y = . p|ad| + p|bc| p|ad| − p|bc|

By 8.8, we have the estimate of the MINRES residual as

p p !m/2 krmk |ad| − |bc| ≤ 2 p p . (8.11) kr0k |ad| + |bc|

Theorem 8.1.2 (Convergence of MINRES [143]). Assume that um is the m-th iteration of MINRES method and u∗ is the exact solution. Then there exists a constant α ∈ (0, 1), depending only on κ(A) such that

kA(u∗ − um)k ≤ δmkA(u∗ − u0)k. (8.12)

Moreover, a crude upper bound for infq(λ)∈Pm,q(0)=1 supλ∈σ(A) |q(λ)| leads to the following es- timate κ(A) − 1 δ = . κ(A) + 1

−1 |λmax(A)| where κ(A) = kA kkAk = is the condition number of A, λmax(A) and λmin(A) are |λmin(A)| 133

maximal and minimal (by moduli) eigenvalues of A respectively.

Proof. The proofs are identical with Thereom 2.2.1 but with L2-norm. By (8.8), we have

kA(u∗ − um)k ∗ 0 ≤ inf sup |q(λ)|. kA(u − u )k q(λ)∈Pm,q(0)=1 λ∈σ(A)

Denote a = |λmax(A)| and b = |λmin(A)|, the eigenvalue λ ∈ (−a, −b) ∩ (b, a). By Theorem 8.1.1, we have the desired estimate.

For preconditioned MINRES, it is desirable to ensure that the preconditioner does not destroy the symmetry of the system. So we can use the same technique of PCG to design the three preconditioned MINRES. If B = HHT , we can apply MINRES to the corresponding system to solve u and get the following algorithm.

Algorithm 24 preconditioned MINRES Method v1 = f − Au0; z0 = Bv0

p 1 T 1 β1 = (z ) v ; η = β1

γ1 = γ0 = 1; σ1 = σ0 = 0 ω0 = ω−1 = 0 for m = 1, 2, ··· until convergence do The Lanczos recurrence: zm = 1 zm α = (zm)T Azm βm m m+1 m m m−1 v = Az − αmv − βmv m+1 m+1 p m+1 T m+1 z = Bv βm+1 = (z ) v QR decomposition: p 2 2 δ = γmαm − γm−1σmβm; ρ1 = δ + βm+1

ρ2 = σmαm + γm−1γmβm; ρ3 = σm−1βm

γm+1 = δ/ρ1; σm+1 = βm+1/ρ1 Update the solution m m m−2 ω = (z − ρ3ω − ρ2ωm−1)/ρ1 m m−1 m u = u + γm+1ηω check convergence;

η = −σm+1η 134

Note that a positive-definite preconditioner is needed in order for k · kH to define a norm since the reduction of the residual in the preconditioned MINRES algorithm is in a norm that is dependent on the preconditioner. Thus one must be careful not to select a preconditioner that simply distorts this norm. The preconditioned MINRES convergence estimate becomes

∗ m kA(u − u )kH ∗ 0 ≤ min max |qm(λ)|. kA(u − u )kH qm λ∈σ(HA)

Following the same method with the Theorem 8.1.2, we have Theorem 8.1.3 ([143]). Assume that um is the m-th iteration of MINRES method and u∗ is the exact solution. Then there exists a constant α ∈ (0, 1), depending only on κ(A) such that ∗ m 2 m ∗ 0 2 kA(u − u )kB−1 ≤ δ kA(u − u )kB−1 . (8.13)

Moreover, a crude upper bound for infq(λ)∈Pm,q(0)=1 supλ∈σ(BA) |q(λ)| leads to the following estimate κ(BA) − 1 δ = . κ(BA) + 1 The role of preconditioning in this case is to cluster both the positive and the nega- tive eigenvalues so that the polynomial approximation error is acceptably small after a few iterations. Similarly, we can also write out the left and right preconditioned MINRES. However, for all of the preconditioned MINRES, it is important to make sure that the preconditioner is SPD.

8.1.2 Generalized Minimal Residual Method

The generalized minimal residual method (GMRES) was proposed by Saad and Schultz [8] as a Krylov subspace method for nonsymmetric systems. GMRES is a generalization of the MINRES method. 0 0 Similarly to MINRES, for any vector u ∈ u + Km, it can be written as u = u + Vmv where v is an vector and Vm is the orthogonal vectors for Km, denoting by   1, i = j, Vm = {v1, v2, ··· , vm} with (vi, vj) =  0, i 6= j. 135

Vm is generated by Arnoldi’s process, see Algorithm 25.

Algorithm 25 Arnoldi’s Iteration 1 ˆ β = krk2; v = r/β; b = βe1; for j = 1, 2, ··· , m do ω = Avj for i = 1, ··· , j do i T hi,m = (v ) ω; i ω = ω − hi,mv

hm+1,m = kωk2

if hm+1,m == 0 then Stop

By Algorithm 25, AVm has the decomposition

AVm = [Av1, Av2, ··· , Avm]   h1,1 h1,2 h1,3 ··· h1,m    h2,1 h2,2 h2,3 ··· h2,m     0 h h ··· h  T = [v1, v2, ··· , vm]  3,2 3,3 3,m  + hm+1,mvm+1em,  . . . . .   ......    0 ··· 0 hm,m−1 hm,m

T where em = (0, 0, ··· , 1)1×m. The matrix form can be written as:

T ¯ AVm = VmHm + hm+1,mvm+1em = Vm+1Hm,

m×m where Hm = (hi,j)m×m ∈ R is called as the Hessenberg matrix with the property that " # H ¯ m m+1,m T m,m hi,j = 0, if i < j +1. Hm = ∈ R . Notice that VmVm = Im ∈ R (~01×m−1, hm+1,m) then, we have the property that T Vm AVm = Hm.

T T Remark 8.1.4. If A = A, then, Hm = Hm which implies that Hm is a tridiagnal matrix. GMRES method will reduce to MINRES. 136

Pm m Suppose z ∈ Km = span{v1, v2, ··· , vm}, z = i=1 yivi = Vmy, y ∈ R .

¯ b − Az = b − AVmy = b − Vm+1Hm ¯ = Vm+1(βe1 − Hmy)

Then, the minimization problem can be equivalently written as:

¯ ¯ xm = argminzkb − Azk = argminykVm(βe1 − Hmy)k = argminykβe1 − Hmyk.

Therefore, the solution of GMRES at m step is xm = x0 + ym with

¯ ym = argminy∈Km kβre1 − Hmyk with βr = kr0k.

¯ In order to solve the minimum system, we first translate matrix Hm be the upper triangle matrix i.e. ! ¯ Rm QmHm = , ~01×m

m×m where Rm ∈ R is the upper triangle nonsingular matrix. This is done by the modified GramSchmidt orthogonalization. It is an important property of GMRES that the basis for the Krylov space must be stored as the iteration progress. This means that in order to perform m step GMRES iterations one must store m vectors of length N. For very large problems this becomes prohibitive and the iteration is restarted when the available room for basis vectors is exhausted. One way to implement this is to set mmax to the maximum number m of vectors that one can store, call m GMRES and explicitly test the residual f − Au and if m = mmax upon termination. If the norm of the residual is larger than required residual, call GMRES again with u0 = um, the

result from the previous call. This restarted version of the algorithm is called GMRES(mmax). There is no general convergence theorem for the restarted algorithm and restarting will slow the convergence down. However, when it works it can significantly reduce the storage costs of the iteration. The whole algorithm is written as Algorithm 26. 137

Algorithm 26 GMRES(mmax) Method r = f − Au0; u = u0 for j = 1, 2, ··· until convergence do 1 ˆ β = krk2; v = r/β; b = βe1;

for m = 1, 2, ··· , mmax do ω = Avm for i = 1, ··· , m do i T i hi,m = (v ) ω; ω = ω − hi,mv m+1 hm+1,m = kωk2 v = ω/hm+1,m r1,m = h1,m for i = 2, ··· , m do γ = ci−1ri−1,i + si−1hi,m; ri,m = −si−1ri−1,m + ci−1hi,m; ri−1,m = γ q 2 2 δ = rm,m + hm+1,m; cm = rm,m/δ; sm = hm+1,m/δ; rm,m = cmrm,m + smhm+1,m

ˆ ˆ ˆ ˆ ˆ (j−1)mmax+m bm+1 = −smbm; bm = cmbm ρ = |bm+1|(= kb − Au k2) if ρ is small enough then nr = m,break nr ˆ nr = mmax; y = bnr /rnr,nr

for m = nr−1, ··· , 1 do m ˆ Pnr i y = (bm − i=m+1 rm,iy )/rm,m Pnr i u = u + i=1 rm,iv ; r = f − Au

The analysis of the convergence of GMRES is more complex, it is difficult to prove a

simple result such as CG. If A is a positive definite matrix, GMRES(mmax) converges for

any mmax ≥ 1. And we also have the following estimate [144]: Theorem 8.1.5 ([145]). If

(u, Au) inf = γ > 0 (8.14) x6=0 (u, u) kAxk sup 2 = Γ (8.15) x6=0 kxk2

then the GMRES method converges and at the m-th iteration the residual is bounded as

γ2 krmk2 ≤ (1 − )mkr0k2. (8.16) Γ2 138

Proof. By (2.30), we only need to estimate the polynomial. Note that for q1(x) = 1+αx ∈ P1,

m−1 m−1 min kq(A)k2 ≤ k(q1(A)) k2 ≤ kq1(A)k2 . q∈Pm−1

Since

2 (I + αA)x, (I + αA)x kq1(A)k2 = max x6=0 (x, x) h (Ax, x) (Ax, Ax)i = max 1 + 2α + α2 , x6=0 (x, x) (x, x)

(8.14), and (8.15), if α < 0,

2 2 2 kq1(A)k2 ≤ 1 + 2γα + Γ α . γ By choosing α = − , we have Γ2

h γ2 i kq (A)k2 ≤ 1 − , 1 2 Γ2 which concludes the proof.

If A is diagonalizable, which means there exists X such that A = XΛX−1 where Λ = diag{λ1, ··· , λN } is the diagonal matrix of eigenvalues. The residual has the following bound

krmk 0 ≤ κ(X) min max |q(λi)|, (8.17) kr k qm∈Pm,q(0)=1 i=1,··· ,N where κ(X) = kXkkX−1k. This estimate could give us some ideas about the convergence of GMRES. Constructing a polynomial that is small on the spectrum will be sufficient to give a fast convergence. In minimizing the polynomial, we are constrained by the fact that q(0) = 1. Therefore, a polynomial that is small on the eigenvalues close to the origin must grow rapidly to give q(0) = 1, requiring a polynomial of sufficiently high degree. As the number of distinct eigenvalues near the origin grows with κ(A), we expect to see higher iteration counts of GMRES convergence as we increase κ(A). Clustering of eigenvalues also affects GMRES convergence. A polynomial with one root in a cluster of eigenvalues will be small throughout the cluster. However to be small on all 139 well-separated eigenvalues, the polynomial would require a root at each of those eigenvalues. In the case of preconditioned GMRES, or other nonsymmetric iterative solvers, the same three options for applying the preconditioning operation as for the Conjugate Gradient are available. However, there will be one fundamental difference. The right preconditioning versions will give rise to what is called a flexible variant, i.e., a variant in which the precon- ditioner can change at each step. The left preconditioned GMRES algorithm is equivalent to applied GMRES to the system

BAu = Bf.

The straightforward application of GMRES to the above linear system yields the following preconditioned version of GMRES.

Algorithm 27 GMRES(mmax) Method with left preconditioner r = B(f − Au0); u = u0 for j = 1, 2, ··· until convergence do 1 ˆ β = krk2; v = r/β; b = βe1;

for m = 1, 2, ··· , mmax do ω = BAvm for i = 1, ··· , m do i T i hi,m = (v ) ω; ω = ω − hi,mv m+1 hm+1,m = kωk2 v = ω/hm+1,m r1,m = h1,m for i = 2, ··· , m do γ = ci−1ri−1,i + si−1hi,m; ri,m = −si−1ri−1,m + ci−1hi,m; ri−1,m = γ q 2 2 δ = rm,m + hm+1,m; cm = rm,m/δ; sm = hm+1,m/δ; rm,m = cmrm,m + smhm+1,m

ˆ ˆ ˆ ˆ ˆ (j−1)mmax+m bm+1 = −smbm; bm = cmbm ρ = |bm+1|(= kB(b − Au )k2) if ρ is small enough then nr = m,break nr ˆ nr = mmax; y = bnr /rnr,nr

for m = nr−1, ·, 1 do m ˆ Pnr i y = (bm − i=m+1 rm,iy )/rm,m Pnr i u = u + i=1 rm,iv ; r = B(f − Au) 140

Note that there is no easy access to these unpreconditioned residuals, unless they are computed explicitly, e.g., by multiplying the preconditioned residuals by M. If a stopping criterion based on the actual residuals, instead of the preconditioned ones, we need to calcu- late the residual separately. If B is symmetric positive definite, we would take the advantage of this. Similarly with the PCG, we can design the preconditioned GMRES with the B−1 inner product. The right preconditioned GMRES algorithm is based on solving (2.37). However, we never needs to be compute and save the variable v. The residualr ˆm = f −ABvm = f −Aum, it can be computed based on um. When we update the approximate solution, we can get the desired approximation of um based on um = Bvm,

nr m 0 X i u = u + B rk,iv . i=1

Thus, one preconditioning operation is needed at the end of the outer loop, instead of at the beginning in the case of the left preconditioned version. 141

Algorithm 28 GMRES(mmax) Method with right preconditioner r = f − Au0; x = u0 for j = 1, 2, ··· until convergence do 1 ˆ β = krk2; v = r/β; b = βe1;

for m = 1, 2, ··· , mmax do ω = ABvm for i = 1, ··· , m do i T i hi,m = (v ) ω; ω = ω − hi,mv m+1 hm+1,m = kωk2 v = ω/hm+1,m r1,m = h1,m for i = 2, ··· , m do γ = ci−1ri−1,i + si−1hi,m; ri,m = −si−1ri−1,m + ci−1hi,m; ri−1,m = γ

q 2 2 δ = rm,m + hk+1,k; cm = rm,m/δ; sm = hk+1,k/δ; rm,m = cmrm,m + smhk+1,k

ˆ ˆ ˆ ˆ ˆ (j−1)mmax+m bm+1 = −smbm; bm = cmbm ρ = |bm+1|(= kb − Au k2) if ρ is small enough then nr = m,break nr ˆ nr = mmax; y = bnr /rnr,nr

for m = nr−1, ·, 1 do m ˆ Pnr i y = (bm − i=m+1 rm,iy )/rm,m Pnr i u = u + B i=1 rm,iv ; r = f − Au

Note that the residual norm is now relative to the initial system. This is an essential difference with the left preconditioned GMRES algorithm. In most practical situations, the difference in the convergence behavior of the two approaches is not significant. The only exception is when B is ill-conditioned which could lead to substantial differences. If B is the result of a factorization of the form B = LU, there is the option of using GMRES on the split-preconditioned system LAUv = Lf with u = Uv. In this situation, it is clear that we need to operate on the initial residual by L at the start of the algorithm and by U on the linear combination in forming the approximate solution. The residual norm available is that of L(f − Au). For the right preconditioned GMRES, the approximate solution is obtained by applying the same preconditioning matrix to the linear combination of vi. Actually the preconditioner j j is not necessary to be the save at every step. Denote z = Bjv . Then it wonld be natural 142

to compute the approximate solution as

nr 0 nr u = u + Znr y

1 nr where Znr = {z , ··· , z }. These are the only changes that lead from the right precondi- tioned algorithm to the flexible variant, described below.

Algorithm 29 FGMRES(mmax) Method r = f − Au0; u = u0 for j = 1, 2, ··· until convergence do 1 ˆ β = krk2; v = r/β; b = βe1;

for m = 1, 2, ··· , mmax do m m m z = Bmv ; ω = Az for i = 1, ··· , m do i T i hi,m = (v ) ω; ω = ω − hi,mv m+1 hm+1,m = kωk2 v = ω/hm+1,m r1,m = h1,m for i = 2, ··· , m do γ = ci−1ri−1,i + si−1hi,m; ri,m = −si−1ri−1,m + ci−1hi,m; ri−1,m = γ q 2 2 δ = rm,m + hm+1,m; cm = rm,m/δ; sm = hm+1,m/δ; rm,m = cmrm,m + smhm+1,m

ˆ ˆ ˆ ˆ ˆ (j−1)mmax+m bm+1 = −smbm; bm = cmbm ρ = |bm+1|(= kb − Au k2) if ρ is small enough then nr = m,break nr ˆ nr = mmax; y = bnr /rnr,nr

for m = nr−1, ·, 1 do m ˆ Pnr i y = (bm − i=m+1 rm,iy )/rm,m Pnr i u = u + i=1 rm,iz ; r = f − Au

8.2 Preconditioners for Indefinite Problems

We define the B : X 7→ X as the inverse of H which means B−1 defines the inner product

(·, ·)X . Note that,

(BAx, y)X = (BAx, y)B−1 = (Ax, y) = (x, Ay) = (x, BAy)X , 143

which implies BA : X 7→ X is symmetric with respect to (·, ·)X . Therefore, B can be used as a preconditioner for solving (8.4). If A is symmetric, we can use preconditioned MINRES and have the following convergence estimate.

−1 Theorem 8.2.1. Assume B is SPD, B defines the inner product (·, ·)X , and A satisfies (8.2) and (8.3), then we have α κ(BA) ≤ , (8.18) β

α−β and, therefore, preconditioned MINRES converges with convergence rate δ ≤ α+β :

m m/2 kr kB α − β  0 ≤ kr kB α + β

Proof. Note that

|(BAx, x)X | |(Ax, x)| |L(x, x)| kBAkL(X,X) = sup 2 = sup 2 = sup 2 ≤ α, kxkX kxkX kxkX and

−1 −1 kBAxkX (BAx, y) L(x, y) k(BA) kL(X,X) = inf = inf sup = inf sup ≥ β > 0. x kxkX x y kxkX kykX x y kxkX kykX

Therefore, we have κ(BA) ≤ α/β. By Theorem 8.1.3, we have the desired estimate of the convergence of preconditioned MINRES.

The previous result can be used to characterize a general class of symmetric precondi- tioners M : X 7→ X. We assume it is spectrally equivalent to B in the following sense

c1(f, Bf) ≤ (f, Mf) ≤ c2(f, Bf). (8.19)

Let x = Bf, we have −1 c1(x, x)X ≤ (x, MB x)X ≤ c2(x, x)X , (8.20)

−1 −1 which implies that MB : V 7→ V is an isomorphism with kMB kL(X,X) ≤ c2 and −1 −1 −1 k(MB ) kL(X,X) ≤ c1 . If we use M as the preconditioner, then we have the following results regarding the convergence estimate of the corresponding preconditioned MINRES method. 144

Theorem 8.2.2. Assume that (8.2), (8.3), and (8.20) are satisfied, then we have

c α κ(MA) ≤ 2 , c1β and, therefore, preconditioned MINRES converges with convergence rate δ ≤ c2α−c1β : c2α+c1β

m m/2 kr kB c2α − c1β  0 ≤ kr kB c2α + c1β

Proof. Based on assumptions any Theorem 8.2.1, we have

−1 −1 kMAkL(X,X) = kMB BAkL(X,X) ≤ kMB kL(X,X)kBAkL(X,X) ≤ c2α, and

−1 −1 −1 −1 −1 −1 −1 −1 −1 −1 k(MA) kL(X,X) = k(MB BA) kL(X,X) ≥ k(BA) kL(X,X)k(MB ) kL(X,X) ≥ c1β, which gets the estimate of the condition number. By Theorem 8.1.3, we have the desired estimate of the convergence of preconditioned MINRES.

8.3 FASP Preconditioner

We consider the problem Aˆxˆ = fˆ on the auxiliary space Xˆ which is easier to solve. The space X in (8.22) is equipped with an inner product d(·, ·) which is different from (·, ·)X . The operator D induced by d(·, ·) leads to the smoother T = D−1. In general, we can define any smoother S on X satisfying kSk ' T For each Xˆ, we need a projection operator Π:ˆ Xˆ → X which gives Π := I × Π:ˆ X¯ → X (8.21)

Then the fast auxiliary space preconditioner is given by

ˆ ˆ−1 ˆ ∗ MA = S + ΠA Π (8.22)

For a simpler version, we can define the fast auxiliary space symmetric positive definite preconditioner which is given by ˆ ˆˆ ∗ MB = T + ΠBΠ (8.23) 145

ˆ ˆˆ ∗ For the fast auxiliary space SPD preconditioner MB = T + ΠBΠ , we have the following estimate:

Theorem 8.3.1. If Π := I × Π:ˆ X¯ → X satisfies the following properties:

1. kxkX ≤ c1kxkT −1 , ∀x ∈ X;

ˆ ˆ 2. kΠˆxkX ≤ c2kxˆkXˆ , ∀xˆ ∈ X; ˆ ˆ 3. For every x ∈ X, there exists x0 ∈ X and xˆ1 ∈ X such that x = x0 + Πˆx1 and

2 2 2 2 kx0kT −1 + kxˆ1kXˆ ≤ c3kxkX .

Then MB satisfies

−1 2 2 C1(x, x)X ≤ (x, MBB x)X ≤ (c1 + c2)(x, x)X , (8.24)

Proof. Firstly we prove the right inequality. Since

−1 ˆ ˆˆ ∗ −1 −1 ˆ ˆˆ ∗ −1 (x, MBB x)X = (x, (T + ΠBΠ )B x)X = (x, T B x)X + (x, ΠBΠ B x)X

By property (i) and (ii), we have the following inequalities,

−1 −1 ˆ ˆˆ ∗ −1 (x, MBB x)X ≤ kxkX kT B xkX + kxkX kΠBΠ B xkX −1 ˆˆ ∗ −1 ≤ (c1kT B xkT −1 + c2kBΠ B xkXˆ )kxkX 2 2 1/2 −1 2 ˆˆ ∗ −1 2 1/2 ≤ (c1 + c2) (kT B xkT −1 + kBΠ B xkXˆ ) kxkX 2 2 1/2 −1 ˆ ˆˆ ∗ −1 1/2 = (c1 + c2) ((T B x, x)X + (ΠBΠ B x, x)X ) kxkX 2 2 1/2 −1 1/2 = (c1 + c2) ((MBB x, x)X ) kxkX .

Secondly, we will get the lower bound by the decomposition in property (iii),

ˆ −1 −1 −1 ˆ−1 ˆ ∗ −1 (x, x)X = hx0 + Πˆx1, B xi = (T x0, B x)T + (B xˆ1, Π B x)Bˆ −1 −1 ˆ−1 ˆ ∗ −1 ≤ kT x0kT kB xkT + kB xˆ1kBˆkΠ B xkBˆ −1 2 ˆ−1 2 1/2 −1 2 ˆ ∗ −1 2 1/2 ≤ (kT x0kT + kB xˆ1kBˆ) (kB xkT + kΠ B xkBˆ) 146

2 2 1/2 −1 ˆˆ ∗ −1 ˆ ∗ −1 1/2 ≤ (kx0kT −1 + kxˆ1kXˆ ) ((T B x, x)X + hBΠ B x, Π B xi) −1 ˆ ˆˆ ∗ −1 1/2 ≤ c3kxkX ((T B x, x)X + (ΠBΠ B x, x)X ) −1 1/2 ≤ c3kxkH ((MBB x, x)X ) .

By Theorem 8.2.2, we have

Theorem 8.3.2. If Π := I × Π:ˆ X¯ → X satisfies the following properties:

1. kxkX ≤ c1kxkT −1 , ∀x ∈ X;

ˆ ˆ 2. kΠˆxkX ≤ c2kxˆkXˆ , ∀xˆ ∈ X; ˆ ˆ 3. For every x ∈ X, there exists x0 ∈ X and xˆ1 ∈ X such that x = x0 + Πˆx1 and

2 2 2 2 kx0kT −1 + kxˆ1kXˆ ≤ c3kxkX .

Then there exists a constant c which does not depend on the mesh size so that

κ(MBA) ≤ c.

Therefore the convergence rate of the MINRES with MB as preconditioner does not depend on the mesh size. Chapter 9

Fast Preconditioners for Stokes Equation on Unstructured Grid

We consider the incompressible Stokes problem

 1  a(u, v) + b(v, p) = hf1, vi ∀v ∈ H (Ω) 0 (9.1)  b(u, q) = hf2, qi ∀q ∈ L2(Ω)

where a(u, v) = (∇u, ∇v) and b(v, p) = (div v, p). We will also make the following operator:

A : hAu, vi = a(u, v)

B : hBu, pi = b(v, p)

B∗ : hu, B∗pi = b(v, p)

Then the original equations transfers to linear equations:

! ! ! AB∗ u f Ax = = 1 = f (9.2) B 0 p f2

We assume that the norm kxkX = k(u, p)kX is given by

2 2 2 kxkX = |u|1 + kpk0 148

Lemma 9.0.3. (8.2) and (8.3) are equivalent to the Babuˇska-Brezzi conditions:

hAuv, uwi ≤ c1kuvkAkuwkA (9.3a)

hBu, pi ≤ c2kukAkpk0 (9.3b) hBu, pi inf sup ≥ c3 (9.3c) p u kukAkpkl2 hAu, vi inf sup ≥ c4 (9.3d) u v kukAkvkA

9.1 Block Preconditioners

Since the Stokes equation can be written in two by two block form. The block type pre- conditioners based on segregations are one of the main preconditioning method, including block diagonal preconditioners, block triangular type preconditioners, block approximate factorization preconditioners, and further enhancements of these based on symmetric block Gauss-Seidel method. There are lots of works developing and analyzing these types of preconditioners. See [146, 147, 148] for block diagonal preconditioners, [149, 73, 150, 151] for block triangle preconditioners, [152, 153] for Pressure-Convection Diffusion commutator (PCD) preconditioner, [154, 155] for Least Squares Commutator (LSC) preconditioner, [156] for the Augmented Lagrangian Approach (AL) preconditioner, and [157, 158] SIMPLE type method. We can define the block diagonal preconditioner as

! A˜−1 0 M = (9.4) 0 I where A˜−1 is the approximation of A−1. For example, one or several steps of multigrid method. By Theorem 8.2.2, we have the convergence estimate of the block diagonal precon- ditioner (9.4). ˜ Theorem 9.1.1. If A is SPD and there exists constants C1 > C2 > 0 such that for any 1 x ∈ H0 (Ω), ˜ C2(Ax, x) ≤ (Ax, x) ≤ C1(Ax, x) the MINRES algorithm applied to A with the preconditioner (9.4) converges in a number of iteration independent of the mesh size. 149

If the preconditioner M is not symmetric, we have to use preconditioned GMRES method. And in order to make sure the method converges, we need to use Theorem 8.1.5 and verify the two conditions (8.14) and (8.15), namely, if

(MAx, x) kMAxk inf X = γ > 0, and sup X = Γ, (9.5) x (x, x)X x kxkX

the preconditioned GMRES with converge with convergence rate (1 − γ2/Γ2). For Stokes problem, we can also use upper triangular block matrix and lower triangular block matrix as preconditioners, i.e.

!−1 !−1 AB∗ A 0 MU = , and ML = (9.6) 0 I BI

By estimating (9.5), we have

Theorem 9.1.2 ([151]). For solving the Stokes equation (9.2), if (9.3) holds, let ML defined as (9.6), then there exists γ, Γ > 0 such that

(M Ax, x) kM Axk inf L X = γ, and sup L X = Γ, (9.7) x (x, x)X x kxkX holds for all ρ < ρ0. Therefore the left preconditioned GMRES converges in a number of iterations independent of mesh size. Moreover, the residuals satisfy

m 2 2 m kr kX  γ  0 2 ≤ 1 − 2 . kr kX Γ

Proof. Since, the preconditioned system results as

! IA−1B∗ MLA = , 0 BA−1B∗

for x = (u, p)T ∈ X, we have

2 −1 ∗ ∗ (MLAx, x)X = |u|1 + (BA B p, p) + (B p, u) 150

Since we have the following estimates,

!2 −1 ∗ −1 ∗ ∗ hBu, pi) 2 (BA B p, p) = (A B p, B p) = sup ≥ c3kpk0, u∈V \{0} |u|A

!2 −1 ∗ −1 ∗ ∗ (Bu, p) 2 (BA B p, p) = (A B p, B p) = sup ≤ c2kpk0, u∈V \{0} |u|A

∗ −1 ∗ −1 ∗ −1 ∗ 1 1 −1 ∗ 2 |(B p, u)| = |(A B p, u) | ≤ kA B pk kuk = (BA B p, p) 2 |u| ≤ (BA B p, p)+|u| A A A 1 2 1 The following estimate satisfies:

2 −1 ∗ ∗ (MLAx, x)X ≥ |u|1 + (BA B p, p) − |(B p, u)| 1 ≥ |u|2 + (BA−1B∗p, p) − ((BA−1B∗p, p) + |u|2) 1 2 1 1 1 ≥ |u|2 + (BA−1B∗p, p) 2 1 2 1 1 ≥ |u|2 + c2kpk2 2 1 2 3 0 2 ≥ γkxkX ,

1 1 2 where γ = min{ 2 , 2 c3}. On the other hand,

2 2 2 2 −1 ∗ 2 kMLAxkX ≤ 3(|u|1 + c2kpk0 + kA B pkA) 2 2 2 2 ≤ 3(|u|1 + c2kpk0 + c2kpk0) 2 ≤ ΓkxkX .

where Γ = max{3, 3(c2 + 1)c2}. By Theorem 8.1.5, we can complete the proof.

Similarly, we have the estimate for the upper triangular preconditioner MU .

Theorem 9.1.3 ([151]). For solving the Stokes equation (9.2), if (9.3) holds, let MU defined as (9.6), then there exists γ, Γ > 0 such that

(AM x, x) −1 kAM xk −1 inf U X = γ, and sup U X = Γ (9.8) x (x, x)X−1 x kxkX−1 151

holds. Therefore the right preconditioned GMRES converges in a number of iterations inde- pendent of mesh size. Moreover, the residuals satisfy

m 2 2 m kr kX−1  γ  0 2 ≤ 1 − 2 . kr kX−1 Γ

Proof. The preconditioned system results as

! I 0 AMU = , BA−1 BA−1B∗

for any x = (u, p) ∈ X,

2 −1 ∗ −1 (AMU x, x)X−1 = |u|A−1 + (BA B p, p) + (BA u, p)

Since

1  (BA−1u, p) = (A−1/2u, A−1/2B∗p) ≤ kA−1/2ukkA−1/2B∗pk ≤ kuk2 + (BA−1B∗p, p) . 2 A−1

Therefore,

2 −1 ∗ −1 (AMU x, x)X−1 ≥ |u|A−1 + (BA B p, p) − |(BA u, p)|

2 −1 ∗ 1 2 −1 ∗  ≥ |u| −1 + (BA B p, p) − kuk + (BA B p, p) A 2 A−1 1 2 1 −1 ∗ = |u| −1 + (BA B p, p) 2 A 2 1 2 1 2 2 ≥ |u| −1 + c kpk 2 A 2 2 0 2 ≥ γkxkX−1 ,

1 1 2 where γ = min{ 2 , 2 c2}. On the other hand,

2 2 2 2 −1 2 kAMU xkX−1 ≤ 3(|u|A−1 + c2kpk0 + kBA uk ) 2 2 2 2 ≤ 3(|u|A−1 + c2kpk0 + c2kukA−1 ) 2 ≤ ΓkxkX .

2 where Γ = max{3 + 3c2, 3c2}. By Theorem 8.1.5, we can complete the proof. 152

For general block triangle preconditioners

−1 −1 ∗ ! ! A1 B A1 0 MU = , and ML = 0 ρA2 B ρA2 where A1 is an approximation of −∆ and A2 is an approximation of I. The preconditioned system becomes

! A−1AA−1B∗ M A = 1 1 L −1 −1 −1 −1 −1 −1 ∗ ρ A2 B(A1 A − I) ρ A2 BA1 B and ! AA−1 ρ−1(AA−1 − I)B∗A−1 AM = 1 1 2 . U −1 −1 −1 ∗ −1 BA1 ρ BA1 B A2 We can have the following estimates:

Theorem 9.1.4. If A1 and A2 satisfy

−1 −1 β1(y1, y1)A ≤ (A1 Ay1, y1)A, kA1 Ay1kA ≤ α1ky1kA ∀y1 ∈ V, (9.9)

−1 ˆ −1 β2(y2, y2) ≤ (A2 Sy2, y2), kA2 y1k ≤ α2ky2k ∀y2 ∈ Q, (9.10)

ˆ −1 ∗ where S = BA1 B . Then there exists a ρ0 such that ML satisfies (9.7) for all ρρ0 < 1 provided −1 kA1 A − IkA ≤ ρ0.

Therefore the left preconditioned GMRES converges in a number of iterations independent of mesh size. Moreover, the residuals satisfy

m 2 2 m kr kX  γ  0 2 ≤ 1 − 2 . kr kX Γ

Proof. Since

−1 −1 ∗ −1 −1 −1 −1 −1 ˆ (MLAx, x)X = (A1 Au, u)A + (A1 B p, u)A + ρ (A2 B(A1 A − I)u, p) + ρ (A2 Sp, p),

−1 ∗ −1 ∗ −1 |(A1 B p, u)A| ≤ kA B pkAkA1 AukA ≤ α1c2kpk0kukA, 153

−1 −1 −1 −1 ∗ −1 |(A2 B(A1 A − I)u, p)| ≤ k(A1 A − I)ukAkA B A2 pkA ≤ ρ0α2c2kpk0kukA, and

ˆ −1 ∗ −1 ∗ −1 ∗ ˆ (Sp, q) (A1 B p, A B q)A kA B qkA kSpk = sup = sup −1 ∗ q∈Q kqk0 q∈Q kA B qkA kqk0 −1 ∗ −1 ∗ −1 ∗ kA B qkA −1 ∗ kA B qkA = kA1 B pkA sup ≤ α1kA B pkA sup q∈Q kqk0 q∈Q kqk0 2 ≤ α1c2kpk0, we have

2 −1 2 (MLAx, x)X ≥ β1kukA + ρ β2kpk0 − (α1c2 + α2c2)kpk0kukA 2 2 2 −1 2 (α1 + α2) c2 2 β1 2 ≥ β1kukA + ρ β2kpk0 − kpk0 − kukA 2β1 2 β ≥ 1 kxk2 , 2 X

2β1β2 by choosing ρ < 2 2 2 . (α1 + α2) c2 + β2 On the other side,

2 2 2 −2 2 2 4 −2 kMLAxkX ≤ 4(α1kukA + ρ α1α2c2kpk0 + α1c2kpk0kukA + ρ ρ0α2c2kpk0kukA) 2 ≤ γkxkX which complete the proof.

We also have the similar result for the right preconditioner MR.

Theorem 9.1.5. If A1 and A2 satisfy

−1 −1 β1(y1, y1)A−1 ≤ (AA1 y1, y1)A−1 , kAA1 y1kA−1 ≤ α1ky1kA−1 ∀y1 ∈ V, (9.11)

−1 ˆ −1 β2(y2, y2) ≤ (A2 Sy2, y2), kA2 y1k0 ≤ α2ky2k0 ∀y2 ∈ Q, (9.12)

ˆ −1 ∗ where S = BA1 B . Then there exists a ρ0 such that MR satisfies (9.8) for all ρρ0 < 1 provided −1 kAA1 − IkA−1 ≤ ρ0. 154

Therefore the right preconditioned GMRES converges in a number of iterations independent of mesh size. Moreover, the residuals satisfy

m 2 2 m kr kX−1  γ  0 2 ≤ 1 − 2 . kr kX−1 Γ

9.2 Analysis of the FASP SPD Preconditioner

Now we consider to construct the FASP preconditioner MB. In order to define the precon- ditioner, we need to choose a smoother. For X = V × Q, assume we have an overlapping domain decomposition

X = X1 + ··· + Xm where Xi = Vi × Qi. Assume the maximum number of overlap is ck  m. We define the smoother T as m X ∗ T = ΠiBiΠ . (9.13) i=1 where Πi : Xi → X is the inclusion and

!−1 Ai 0 Bi = . 0 I

T is an additive Schwarz type iteration. The multiplicative Schwarz type iteration is also called Vanka smoother. In [159], Vanka developed a new multigrid method for solving the steady state incompressible Navier-Stokes equations based on a finite volume discretization on a staggered grid. The computational domain is divided into non-overlapping cells with pressure nodes at the cell center and velocity nodes at the cell faces. The smoothing proce- dure is a symmetric coupled Gauss-Seidel method (SCGS). This technique can be easily be extended to overlapping domains and other mixed variational problems. It has been widely used in practice and has shown good convergence results. However, there is only few theories of the convergence and smoothing properties of the underlying iterative method. Molenaar [160] analyzed Vanka smoother by Fourier analysis for a mixed finite element method of the Poisson equation in one dimension. Sch¨oberl and Zulehner [161] derived the convergence and smoothing properties for the additive Schwarz type method by analyzing the symmetric inexact Uzawa methods. Sch¨oberl [162] proved 155 the smoothing property that are independent of the small parameters for the parameter dependent problem of nearly incompressible materials with P2P0 element. The convergence and smoothing properties for the general cases are still open problems.

With this type of the smoother, we can define the SPD preconditioner MB as (8.23) for the Stokes equations.

Lemma 9.2.1. Smoother T , which is defined as (9.13), satisfies the properties in Theorem

8.3.1: there exists a constant c1 > 0, such that

kxkX ≤ c1kxkT −1 , ∀x ∈ X.

Proof. By Cauchy-Schwarz inequality, we have

−1 −1 −1 1/2 1/2 (T B x, x)X ≤ (T B x, T B x)X (x, x)X and

m −1 −1 X ∗ −1 ∗ −1 (T B x, T B x)X = (ΠiBiΠi B x, ΠjBjΠj B x)X i,j=1 m X ∗ −1 ∗ −1 ≤ kΠiBiΠi B xkX kΠjBjΠj B xkX i,j=1 m m X ∗ −1 2 1/2 X ∗ −1 2 1/2 ≤ ck( kΠiBiΠi B xkX ) ( kΠjBjΠj B xkX ) i=1 j=1

Since

m m X ∗ −1 2 X ∗ −1 ∗ −1 kΠjBjΠj B vkX = (BjΠj B x, BjΠj B x)Xj j=1 j=1 m X ∗ −1 ∗ −1 = hBjΠj B x, Πj B xi j=1 m X ∗ −1 = (ΠjBjΠj B x, x)X j=1 −1 = hT B x, xiX 156

We have −1 1/2 (T B x, x)X ≤ ck kxkX

So the property (i) will be satisfied.

hx, yi (x, By)X (x, By)X 1 −1 kxkT = sup = sup −1 ≥ sup 1/2 ≥ 1/2 kxkX y∈X∗ kykT y∈X∗ (T B By, By)X y∈X∗ ck kBykX ck

For property (iii), we can transfer it to the stable decomposition in H1 norm and L2 norm Pm if there exists x0 = i=1 xi, where xi ∈ Vi × Qi

m 2 X 2 kx k −1 ≤ kx k 0 T i Xi i=1

Because of this inequality

m m X X ∗ hx0, yi = hΠixi, yi = (xi, BiΠi y)Xi i=1 i=1 m X ∗ ≤ kxikXi kBiΠi ykXi i=1 m m X X ≤ ( kx k2 )1/2( kH−1Π∗yk2 )1/2 i Xi i i Xi i=1 i=1 m m X X ≤ ( kx k2 )1/2(h Π B Π∗y, yi)1/2 i Xi i i i i=1 i=1 m X = ( kx k2 )1/2kyk i Xi T i=1

we have

!1/2 m  2 (Pm kx k2 )1/2kyk 2 hx, yi i=1 i Xi T X 2 kx k −1 = sup ≤ sup = kx k . 0 T i Xi y∈X∗ kyk y∈X∗ kyk T T i=1

T Pm Pm So if for any x = (u, p) , there exist u = i=1 ui +u ˆ and p = i=1 pi +p ˆ such that

m m X X ku k2 + kuˆk2 kuk2 kp k2 + kpˆk2 kpk2 i Ai Aˆ . A i 0 0 . 0 i=1 i=1 157

T P T set xi = (ui, pi) x0 = i Πixiandx ˆ = (ˆu, pˆ) ,

m 2 2 X 2 2 2 kx k −1 + kxˆ k ≤ kx k + kxˆ k kxk . 0 T 1 Xˆ i Xi 1 Xˆ . X i=1

Therefore, by Theorem 8.3.2, we have the estimate of the SPD preconditioner for the Stokes equations.

Theorem 9.2.2. For Stokes equation (9.2) on X = V × Q, which satisfies the Brezzi con- dition (9.3), we consider an auxiliary space Xˆ = Vˆ × Qˆ. If we can define a interpolation Π:ˆ Xˆ → X which satisfies the following properties:

ˆ ˆ 1. kΠˆxkX ≤ c2kxˆkXˆ , ∀xˆ ∈ X; ˆ ˆ 2. For every x ∈ X, there exists x0 ∈ X and xˆ1 ∈ X such that x = x0 + Πˆx1 and

2 2 2 2 kx0kT −1 + kxˆ1kXˆ ≤ c3kxkX .

Then there exists a constant c which does not depend on the mesh size so that

κ(MBA) ≤ c.

Therefore the convergence rate of the MINRES with MB as preconditioner does not depend on the mesh size.

9.3 Some Examples

In this section, some examples will be given to demonstrate the application of the abstract framework presented in the previous section. −1 Consider a high order stable finite element space pair Pr × Pk with r > 1, k > 0. We 0 −1 + −1 take the auxiliary space to be a lower order element pair P2 × P0 or P1 × P0 which is a −1 subspace of Pr × Pk . We have a two level preconditioner as follows:

ˆ MB = T + BQ (9.14) 158

where Bˆ is the block diagonal preconditioner on the auxiliary space and Q is the L2 projec- tion. It is well known that the L2 projection satisfies the desired properties.

9.3.1 Use a Lower Order Velocity Space Pair as an Auxiliary Space

0 −1 + −1 0 −1 Firstly, we take two conforming elements: P2 − P0 and P1 − P0 as an example. P2 − P0 + −1 and P1 − P0 are two most simple elements which satisfies inf-sup condition. The degree 0 of freedom of these two elements are shown as Figure 9.1. Denote VH as the divergence-free + 0 0 −1 space of P 1 − P 0 element and Vh as the divergence-free space of P2 − P0 element.

0 −1 + −1 P2 − P0 P1 − P0 + −1 Figure 9.1. P2 − P0 elements and P1 − P0 elements

0 In one element τ, assume that u ∈ Vh on τ could be written as

X u|τ = αiλi + αijλiλj

Then we define Πh as following

X X Πhu|τ = αiλi + αijλiλj~nij (9.15)

Then we can prove this projection satisfies other two properties of Theorem 9.2.2.

0 −1 ˆ + −1 Lemma 9.3.1. If the interpolation Πh : X = P2 × P0 → X = P1 × P0 is defined as ˆ (9.15), then for every u ∈ X, there exists u0 ∈ X and uˆ1 ∈ X such that u = u0 +u ˆ1 and

2 2 2 2 ku0kT −1 + kuˆ1kXˆ ≤ c3kukX .

Proof. Since for τ ∈ T ,

2 X 2 |u − Πhu|1,τ = | αijλiλj~τij|1,τ 159

Z 2 . | (u · ~τ) | ∂T Z 2 . | u | ∂T 2 . kuk1,T

So, 2 X 2 X 2 2 2 |u − Πhu|1 = |u − Πhu|1,T . kuk1,T = kuk1 . |u|1 T ∈T T ∈T By Bramble-Hilbert lemma,

2 2 2 ku − Πhuk0 . h |u|1

And 2 |Πhu|1 ≤ |u|1 + |u − Πhu|1 . |u|1. ˆ For any u ∈ X, let u0 = u − Πhu and u1 = Πhu ∈ X, we have the desired estimate.

And the numerical result is also positive. We could get uniform convergence with respect to mesh size. The outer iteration we used is GMRES. And if we do not solve the problem in the auxiliary space exactly but one step multigrid, we can also get uniform convergence.

Mesh MA MA-multi MB 4 × 4 8 11 14 8 × 8 12 13 23 16 × 16 12 13 31 32 × 32 11 13 30 64 × 64 10 13 28 128 × 128 8 12 26

+ −1 Table 9.1. Number of iterations for using P1 − P0 elements as a preconditioner for Stokes 0 −1 equation with P2 − P0 elements

9.3.2 Use a Lower Order Pressure Space as an Auxiliary Space

+ −1 0 −1 If we use P2 − P1 element as the auxiliary space for and P2 − P0 element, the divergence- 0 −1 0 + −1 0 free space of P2 − P0 VH is the subspace of the divergence-free space of P2 − P1 , Vh . So we can choose ΠH as an inclusion. 160

+ −1 0 −1 P2 − P1 P2 − P0

+ −1 0 −1 Figure 9.2. P2 − P1 elements and P2 − P0 elements

The numerical result is as Table 9.2.

Mesh MA MA-multi MB 4 × 4 13 6 23 8 × 8 14 6 27 16 × 16 14 6 25 32 × 32 13 5 23 64 × 64 13 5 23

+ −1 Table 9.2. Number of iterations for using P1 − P0 elements as a preconditioner for Stokes equation with P2 − P0 elements

We can also use the discontinuous pressure space pair as the auxiliary space for continuous 0 −1 0 0 pressure space pairs, such as using P2 − P0 elements as the auxiliary space of P2 − P1 elements (Taylor-Hood element) c.f. Figure 9.3.

0 0 0 −1 P2 − P1 P2 − P0

+ −1 0 −1 Figure 9.3. P2 − P1 elements and P2 − P0 elements

The numerical results are shown as Figure 9.3. 161

Mesh MA MA-multi MB 4 × 4 13 6 23 8 × 8 14 6 27 16 × 16 14 6 25 32 × 32 13 5 23 64 × 64 13 5 23

0 −1 Table 9.3. Number of iterations for using P2 − P0 elements as a preconditioner for Stokes + −1 equation with P2 − P1 elements Chapter 10

Conclusions

10.1 Conclusions

In this thesis, we have developed new multigrid techniques and theories based on the auxiliary grid hierarchy. The aim is to provide some efficient, stable, and parallel solvers for the practical applications on general unstructured grid. More specifically, in this thesis,

1. We extend the algorithm and theory of the fast auxiliary space preconditioner in Xu [47] to shape regular grids. For a general shape-regular unstructured grid, we present a construction of an auxiliary coarse grid hierarchy. By the auxiliary space precondi- tioning technique, we prove that the convergence rate of the multigrid preconditioned CG for the elliptic problem is 1 − O(1/log N).

2. We develop a new parallel auxiliary grid AMG method. In the construction of the hierarchical coarse grid, we use the same a simple and fixed coarsening procedure based on the quadtree generated from the auxiliary grid. This allows us to explicitly control the sparsity patterns and operator complexities of the AMG solver. This feature provides (nearly) optimal load balancing and predictable communication patterns on shape regular grids, which makes our new algorithm suitable for parallel computing, especially on GPU.

3. We design a new parallel colored block Gauss-Seidel method for general unstructured grids. A parallel coloring algorithm of the auxiliary tree has been designed. The new algorithm uses fix number of colors and allow us the control the data communication so 163

that we can implement an efficient GS smoothers for any unstructured grids on MPI.

4. We extend the FASP algorithm and theory to indefinite problems. We prove that the SPD FASP preconditioner is optimal for the indefinite problems. And we numerically verify the optimality of the general FASP preconditioner for the Stokes equations.

10.2 Future works

In the dissertation, the work focused on the linear solver for simple models. Possible future works includes both extensions and verification of the currently developed algorithms to the practical applications, as well as studies on massively parallel computing systems.

• Improve the auxiliary space parallel AMG method for highly local refined mesh and high order elements and apply the problem for practical problems such as fluid-structure interactions and reservoir simulations.

• Implement the auxiliary space parallel AMG method on MPI and investigate the scal- ability of the implementation.

• Prove the FASP theory for the general preconditioners for the indefinite problem and apply the algorithm to solve the Navier-Stokes equations. Bibliography

[1] Golub, G. H. and C. F. Van Loan (1996) Matrix Computations, thrid ed., Johns Hopkins University Press, Baltimore, MD, USA.

[2] Duff, I. and J. Reid (1983) “The Multifrontal Solution of Indefinite Sparse Sym- metric Linear,” ACM Transactions on Mathematical Software (TOMS), 9(3), pp. 302 – 325.

[3] Heath, M., E. Ng, and B. Peyton (1991) “Parallel algorithms for sparse linear systems,” SIAM Reviews, 33, pp. 420 – 460.

[4] Duff, I. (1998) Direct methods, Technical Report RAL-98-056,, Rutherford Appleton Laboratory, Oxfordshire, UK.

[5] Dongarra, J., I. Duff, D. Sorensen, and H. van der Vorst (1998) for High-Performance Computers, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA.

[6] Hestenes, M. R. and E. Stiefel (1952), “Methods of conjugate gradients for solving linear systems,” .

[7] Paige, C. C. and M. A. Saunders (1975) “Solution of sparse indefinite systems of linear equations,” SIAM Journal on Numerical Analysis, 12(4), pp. 617–629.

[8] Saad, Y. and M. H. Schultz (1986) “GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems,” SIAM Journal on scientific and statistical computing, 7(3), pp. 856–869.

[9] Wesseling, P. (1992) An introduction to multigrid methods, John Wiley & Sons, Ltd., New York.

[10] Briggs, W. L., V. E. Henson, and S. F. McCormick (2000) A Multigrid Tutorial, vol. 72, second ed., Society for Industrial and Applied Mathematics, Philadelphia, PA. 165

[11] Hackbusch, W. (1985) Multi-grid methods and applications, vol. 4, Springer-Verlag Berlin.

[12] Bank, R. E. and T. Dupont (1981) “An optimal order process for solving finite element equations,” Mathematics of Computation, 36(153), pp. 35–51.

[13] Braess, D. and W. Hackbusch (1983) “A new convergence proof for the multigrid method including the V-cycle,” SIAM Journal on Numerical Analysis, 20(5), pp. 967– 975.

[14] Bramble, J. H. and J. E. Pasciak (1987) “New convergence estimates for multigrid algorithms,” Mathematics of computation, 49(180), pp. 311–329.

[15] Yserentant, H. (1986) “On the multi-level splitting of finite element spaces,” Nu- merische Mathematik, 49(4), pp. 379–412.

[16] Bramble, J. H., J. E. Pasciak, and J. Xu (1990) “Parallel multilevel precondi- tioners,” Mathematics of Computation, 55(191), pp. 1–22.

[17] Bramble, J. H., J. E. Pasciak, J. P. Wang, and J. Xu (1991) “Convergence estimates for multigrid algorithms without regularity assumptions,” Mathematics of Computation, 57(195), pp. 23–45.

[18] Xu, J. (1989) Theory of multilevel methods, vol. 8924558, Cornell Unversity, May. [19] ——— (1992) “Iterative methods by space decomposition and subspace correction,” SIAM Review, 34(4), pp. 581–613.

[20] Yserentant, H. (1993) “Old and new convergence proofs for multigrid methods,” Acta numerica, 2, pp. 285–326.

[21] Chen, L., R. Nochetto, and J. Xu (2012) “Optimal multilevel methods for graded bisection grids,” Numerische Mathematik, 120(1), pp. 1–34.

[22] Xu, J., L. Chen, and R. H. Nochetto (2009) “Optimal multilevel methods for H(grad), H(curl), and H(div) systems on graded and unstructured grids,” in Mul- tiscale, Nonlinear and Adaptive Approximation (R. DeVore and A. Kunoth, eds.), Springer Berlin Heidelberg, pp. 599–659.

[23] Chen, L. and C. Zhang (2010) “A coarsening algorithm on adaptive grids by newest vertex bisection and its applications,” J. Comput. Math, 28(6), pp. 767–789.

[24] Brandt, A., S. McCormick, and J. Ruge (1984) “Algebraic multigrid (AMG) for sparse matrix equations,” in Sparsity and its Applications, Cambridge University Press, pp. 257–284. 166

[25] Ruge, R. W. and K. Stuben¨ (1985) “Efficient solution of finite difference and finite element equations by algebraic multigrid (AMG),” in Multigrid Methods for Integral and Differential Equations (H. H. D. J. Paddon, ed.), Clarenden Press Oxford, pp. 169–212.

[26] Stuben,¨ K. (1983) “Algebraic multigrid (AMG): experiences and comparisons,” Ap- plied Mathematics and Computation, 13(3-4), pp. 419–451.

[27] Brandt, A., S. McCormick, and J. Ruge (1984) “Algebraic multigrid (AMG) for sparse matrix equations,” in Sparsity and Its Applications, edited by Evans, D., pp. 257–284.

[28] Ruge, J. and K. Stuben¨ (1987) “Algebraic multigrid (AMG),” in Multigrid Method (S. McCormick, ed.), Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, pp. 73 – 130.

[29] Vanek,ˇ P. (1992) “Acceleration of convergence of a two-level algorithm by smoothing transfer operators,” Applications of Mathematics, 37(4), pp. 265–274.

[30] Vanek,ˇ P., J. Mandel, and M. Brezina (1996) “Algebraic multigrid by smoothed aggregation for second and fourth order elliptic problems,” Computing, 56(3), pp. 179–196.

[31] Brezina, M., R. Falgout, S. MacLachlan, T. Manteuffel, S. McCormick, and J. Ruge (2005) “Adaptive smoothed aggregation (SA) multigrid,” SIAM Review, 47(2), pp. 317–346.

[32] Jones, J. and P. Vassilevski (2002) “AMGe based on element agglomeration,” SIAM Journal on Scientific Computing, 23(1), pp. 109–133.

[33] Lashuk, I. and P. Vassilevski (2008) “On some versions of the element agglomera- tion AMGe method,” Numerical Linear Algebra with Applications, 15(7), pp. 595–620.

[34] Bulgakov, V. (1993) “Multi-level iterative technique and aggregation concept with semi-analytical preconditioning for solving boundary-value problems,” Communica- tions in Numerical Methods in Engineering, 9(8), pp. 649–657.

[35] Braess, D. (1995) “Towards algebraic multigrid for elliptic problems of second order,” Computing, 55(4), pp. 379–393.

[36] Henson, V. and U. Yang (2002) “BoomerAMG: A parallel algebraic multigrid solver and preconditioner,” Applied Numerical Mathematics, 41(1), pp. 155–177.

[37] Brannick, J., Y. Chen, X. Hu, and L. Zikatanov (2013) “Parallel unsmoothed aggregation algebraic multigrid algorithms on GPUs,” in Numerical Solution of Partial Differential Equations: Theory, Algorithms, and Their Applications, Springer, pp. 81– 102. 167

[38] Wang, L., X. Hu, J. Cohen, and J. Xu (2013) “A Parallel Auxiliary Grid Alge- braic Multigrid Method for Graphic Processing Units,” SIAM Journal on Scientific Computing, 35(3), pp. C263–C283.

[39] Thum, P., H. J. Diersch, and R. Grundler¨ (2011) “SAMG – The linear solver for groundwater simulation,” in MODELCARE 2011, 8th International Conference on Calibration and Reliability in Groundwater Modeling, Leipzig, Germany, p. 183.

[40] Vanek,ˇ P. (1995) “Fast multigrid solver,” Applications of Mathematics, 40(1), pp. 1–20.

[41] Stuben,¨ K. (1999) Algebraic multigrid (AMG): an introduction with applications, in: U. Trottenberg, C.W. Oosterlee, A. Sch¨uller(Eds.), Multigrid, Academic Press, New York, 2000. Also GMD Report 53, March 1999.

[42] Stuben,¨ K. (2001) “A review of algebraic multigrid,” Journal of Computational and Applied Mathematics, 128(1), pp. 281–309.

[43] Falgout, R. D., P. S. Vassilevski, and L. T. Zikatanov (2005) “On two-grid convergence estimates,” Numerical linear algebra with applications, 12(5-6), pp. 471– 494.

[44] Falgout, R. D. and P. S. Vassilevski (2004) “On generalizing the algebraic multi- grid framework,” SIAM Journal on Numerical Analysis, 42(4), pp. 1669–1693.

[45] Brezina, M., P. Vanekˇ , and P. S. Vassilevski (2012) “An improved convergence analysis of smoothed aggregation algebraic multigrid,” Numerical Linear Algebra with Applications, 19(3), pp. 441–469.

[46] Vanek, P., J. Mandel, and M. Brezina (1994) Algebraic multigrid on unstructured meshes, vol. 34, UCD/CCM Report.

[47] Xu, J. (1996) “The auxiliary space method and optimal multigrid preconditioning techniques for unstructured grids,” Computing, 56, pp. 215–235.

[48] Bramble, J. H., J. E. Pasciak, and J. Xu (1991) “The analysis of multigrid algorithms with nonnested spaces or noninherited quadratic forms,” Mathematics of Computation, 56(193), pp. 1–34.

[49] Bank, R. E. and J. Xu (1995) “A hierarchical basis multigrid method for unstruc- tured grids,” NOTES ON NUMERICAL FLUID MECHANICS, 49, pp. 1–1.

[50] Finkel, R. and J. Bentley (1974) “Quad trees a data structure for retrieval on composite keys,” Acta Informatica, 4(1), pp. 1–9. 168

[51] Grasedyck, L., W. Hackbusch, and S. LeBorne (2001) Adaptive refinement and clustering of H-matrices, Tech. Rep. 106, Max Planck Institute of Mathematics in the Sciences.

[52] Grasedyck, L., W. Hackbusch, and S. Le Borne (2003) “Adaptive Geometri- cally Balanced Clustering of H-Matrices,” Computing, 73, pp. 1–23.

[53] Feuchter, D., I. Heppner, S. Sauter, and G. Wittum (2003) “Bridging the gap between geometric and algebraic multi-grid methods,” Computing and visualization in science, 6(1), pp. 1–13.

[54] Feichtinger, C., S. Donath, H. Kostler¨ , J. Gotz¨ , and U. Rude¨ (2011) “WaL- Berla: HPC software design for computational engineering simulations,” Journal of Computational Science, 2(2), pp. 105–112. URL http://www.sciencedirect.com/science/article/pii/S1877750311000111 [55] Falgout, R. and U. Yang (2002) “Hypre: a library of high performance precondi- tioners,” in Proceedings of the International Conference on Computational SciencePart III (J. J. D. P M A Sloot C J K Tan and A. G. Hoekstra, eds.), vol. 2331, Springer- Verlag London, UK, pp. 632–641. URL http://www.springerlink.com/index/G0FJ31WFNHCGEUBL.pdf [56] Gee, M., C. Siefert, J. Hu, R. Tuminaro, and M. Sala (2006) ML 5.0 smoothed aggregation user’s guide, Sandia National Laboratories.

[57] Bungartz, H.-J., M. Mehl, T. Neckel, and T. Weinzierl (2010) “The PDE framework Peano applied to fluid dynamics: an efficient implementation of a parallel multiscale fluid dynamics solver on octree-like adaptive Cartesian grids,” Computa- tional Mechanics, 46(1), pp. 103 – 114.

[58] Goodnight, N., C. Woolley, G. Lewin, D. Luebke, and G. Humphreys (2003) “A multigrid solver for boundary value problems using programmable graphics hard- ware,” HWWS ’03 Proceedings of the ACM SIGGRAPH/EUROGRAPHICS confer- ence on Graphics hardware, pp. 102–111. URL http://portal.acm.org/citation.cfm?doid=1198555.1198784 [59] Bolz, J., I. Farmer, E. Grinspun, and P. Schrooder¨ (2003) “Sparse matrix solvers on the GPU: conjugate gradients and multigrid,” Symposium A Quarterly Jour- nal In Modern Foreign Literatures, 22(3), pp. 917–924. URL http://portal.acm.org/citation.cfm?id=882364 [60] Goddeke, D., R. Strzodka, J. Mohd-Yusof, P. McCormick, H. Wobker, C. Becker, and S. Turek (2008) “Using GPUs to improve multigrid solver perfor- mance on a cluster,” International Journal of Computational Science and Engineering, 4(1), pp. 36–55. URL http://inderscience.metapress.com/index/92mm0gg171405064.pdf 169

[61] Grossauer, H. and P. Thoman (2008) “GPU-based multigrid: Real-time perfor- mance in high resolution nonlinear image processing,” International Journal of Com- puter Vision, pp. 141–150. URL http://www.springerlink.com/index/X3971L27X3004143.pdf

[62] Feng, Z. and Z. Zeng (2010) “Parallel Multigrid Preconditioning on Graphics Pro- cessing Units ( GPUs ) for Robust Power Grid Analysis,” Power, pp. 661–666. URL http://portal.acm.org/citation.cfm?id=1837274.1837443

[63] Feng, C., S. Shu, J. Xu, and C.-S. Zhang (2013) “Numerical Study of Geometric Multigrid Methods on CPU–GPU Heterogeneous Computers,” Advances in Applied Mathematics and Mechanics, arXiv:1208.4247v1. URL http://arxiv.org/abs/1208.4247

[64] Cleary, A., V. Falgout, R.and Henson, and J. Jones (1998) “Coarse-grid selection for parallel algebraic multigrid,” IRREGULAR ’98 Proceedings of the 5th International Symposium on Solving Irregularly Structured Problems in Parallel, pp. 104–115. URL http://www.springerlink.com/index/M145530406718324.pdf

[65] Tuminaro, R. (2000) “Parallel smoothed aggregation multigrid: aggregation strate- gies on massively parallel machines,” Proceedings of the 2000 ACMIEEE conference on Supercomputing CDROM, 00, pp. 0–20. URL http://portal.acm.org/citation.cfm?id=370049.370060

[66] Krechel, A. and K. Stuben¨ (2001) “Parallel algebraic multigrid based on subdo- main blocking,” Parallel Computing, 27(8), pp. 1009–1031. URL http://linkinghub.elsevier.com/retrieve/pii/S0167819101000801

[67] De Sterck, H., U. Yang, and J. Heys (2006) “Reducing complexity in parallel algebraic multigrid preconditioners,” SIAM Journal on Matrix Analysis and Applica- tions, 27(4), p. 1019. URL http://link.aip.org/link/SJMAEL/v27/i4/p1019/s1&Agg=doi

[68] Chow, E., R. Falgout, J. Hu, R. Tuminaro, and U. Yang (2006) A survey of parallelization techniques for multigrid solvers, Tech. rep., Frontiers of Scientific Computing, M. Heroux, P. Raghavan, H. Simon, Eds., SIAM, submitted. Also available as Lawrence Livermore National Laboratory technical report UCRL-BOOK-205864.

[69] Joubert, W. and J. Cullum (2006) “Scalable algebraic multigrid on 3500 proces- sors,” Group, 23, pp. 105–128.

[70] Bell, N., S. Dalton, and L. Olson (2012) “Exposing fine-grained parallelism in algebraic multigrid methods,” SIAM Journal on Scientific Computing, 34(4), pp. 123– 152. 170

[71] Kraus, J. and M. Forster¨ (2012) “Efficient AMG on heterogeneous systems,” in Facing the Multicore - Challenge II (R. Keller, D. Kramer, and J.-P. Weiss, eds.), vol. 7174 of Lecture Notes in Computer Science, Springer Berlin Heidelberg, pp. 133–146. URL http://dx.doi.org/10.1007/978-3-642-30397-5_12

[72] Strikwerda, J. C. (1984) “An Iterative Method for Solving Finite Difference Ap- proximations to the Stokes Equations,” SIAM Journal on Numerical Analysis, 21(3), pp. 447 – 458.

[73] Elman, H. C. (2002) “Preconditioners for saddle point problems arising in computa- tional fluid dynamics,” Applied Numerical Mathematics, 43(1-2), pp. 75–89. URL http://linkinghub.elsevier.com/retrieve/pii/S0168927402001186

[74] Mardal, K. and R. Winther (2004) “Uniform preconditioners for the time depen- dent Stokes problem,” Numerische Mathematik, pp. 305–327. URL http://link.springer.com/article/10.1007/s00211-004-0529-6

[75] Elman, H. C., V. E. Howle, J. Shadid, D. Silvester, and R. S. Tuminaro (2007) “least squares preconditioners for stabilized discretizations of the Navier-Stokes Equations,” SIAM Journal of Scientific Computing, 30(1), pp. 290 – 311.

[76] Kay, D., D. Loghin, and A. Wathen (2002) “A Preconditioner for the Steady-State Navier-Stokes Equations,” SIAM J Scientific Computing, 24(1), pp. 237 – 256.

[77] Silvester, D., H. C. Elman, D. Kay, and A. Wathen (2001) “Efficient precon- ditioning of the linearized Navier-Stokes equations for incompressible flow,” J. Comp. Appl. Math., 128(1-2), pp. 261 – 279.

[78] Benzi, M. and M. Olshanskii (2006) “An Augmented Lagrangian-Based Approach to the Oseen Problem,” SIAM J Scientific Computing, 28(6), pp. 2095 – 2113.

[79] de Niet, A. and F. Wubs (2007) “Two preconditioners for saddle point problems in fluid flows,” International Journal for Numerical Methods in Fluids, 54(4), pp. 355 – 377.

[80] Rehman, M., C. Vuik, and G. Segal (2008) “A comparison of preconditioners for incompressible Navier – Stokes solvers,” International Journal for Numerical Methods in Fluids, 57(12), pp. 1731–1751.

[81] Segal, A., M. ur Rehman, and C. Vuik (2010) “Preconditioners for Incompressible Navier-Stokes Solvers,” Numerical Mathematics: Theory, Methods and Applications, 3(3), pp. 245–275. URL http://www.global-sci.org/nmtma/readabs.php?vol=3&no=3&doc= 245&year=2010&ppage=275 171

[82] Rehman, M., T. Geenen, C. Vuik, G. Segal, and S. P. Maclachlan (2011) “On iterative methods for the incompressible Stokes problem,” International Journal for Numerical Methods in Fluids, 65(10), pp. 1180–1200.

[83] Kahan, W. (1958) Gauss-Seidel Methods of Solving Large Systems of Linear Equa- tions, Ph.D. thesis, University of Toronto.

[84] Daniel, J. W. (1967) “The conjugate gradient method for linear and nonlinear op- erator equations,” SIAM Journal on Numerical Analysis, 4(1), pp. 10–26.

[85] Liesen, J. and Z. StrakoS˜ (2012) Krylov subspace methods: principles and analysis, Numerical Mathematics and scientific computation. Oxford university Press, Oxford.

[86] Ciarlet, P. (1991) Basic error estimates for elliptic problems, In: P. Ciarlet and J.-L. Lions, eds., Handbook of Numerical Analysis, Volume II, pp. 17–352. North Holland.

[87] Griebel, M. (1994) “Multilevel algorithms considered as iterative methods on semidefinite systems,” SIAM J. Sci. Comput., 15(3), pp. 547 – 565.

[88] Griebel, M. and P. Oswald (1995) “On additive Schwarz preconditioners for sparse grid discretizations,” Numerische Mathematik, 66(4), pp. 449 – 463.

[89] Widlund, O. B. (1992) “Some Schwarz methods for symmetric and nonsymmetric elliptic problems,” in Fifth International Symposium on Domain Decomposition Meth- ods for Partial Differential Equations (D. Keyes, T. Chan, G. Meurant, J. Scroggs, and R. Voigt, eds.), SIAM, Philadelphia, PA, USA, pp. 19 – 36.

[90] Xu, J. and L. Zikatanov (2002) “The method of alternating projections and the method of subspace corrections in Hilbert space,” Journal of the American Mathemat- ical Society, 15(3), pp. 573–597.

[91] Cho, D., J. Xu, and L. T. Zikatanov (2008) “New estimates for the rate of con- vergence of the method of subspace corrections,” Numerical Mathematics: Theory, Methods and Applications, 1(1), pp. 44 – 56.

[92] Dahmen, W. and A. Kunoth (1992) “Multilevel preconditioning,” Numerische Mathematik, 63(1), pp. 315–344. URL http://dx.doi.org/10.1007/BF01385864

[93] Oswald, P. (1992) “On discrete norm estimates related to multilevel preconditioners in the finite element method,” in Proc. Intern. Conf. Constructive Theory of Functions Varna 1991, Ed. K. G. Ivanov and P. Petrushev and B. Sendov, pp. 203–214, Publ. House Bulgarian Academy of Sciences, pp. 203 – 214.

[94] Bornemann, F. and H. Yserentant (1993) “A basic norm equivalence for the theory of multilevel methods,” Numerische Mathematik, 64(1), pp. 455–476. 172

[95] Oswald, P. (1994) Multilevel Finite Element Approximation, Theory and Applica- tions, Teubner Skripten zur Numerik.

[96] Bramble, J. H. and J. E. Pasciak (1993) “New estimates for multilevel algorithms including the -cycle,” Mathematics of computation, 60(202), pp. 447–471.

[97] Bramble, J. H., J. E. Pasciak, J. P. Wang, and J. Xu (1991) “Convergence estimates for multigrid algorithms without regularity assumptions,” Mathematics of Computation, 57(195), pp. 23–45.

[98] Xu, J. (1997) “An introduction to multigrid convergence theory,” in Iterative Methods in Scientific Computing (R. Chan, T. Chan, and G. Golub, eds.), Springer-Verlag, pp. 169 – 242.

[99] Nepomnyaschikh, S. (1992) “Decomposition and fictitious domains methods for elliptic boundary value problems,” in Fifth International Symposium on Domain De- composition Methods for Partial Differential Equations, pp. 62–72.

[100] Hu, X., C. S. Zhang, S. Wu,, S. Zhang, X. Wu, J. Xu, and L. Zikatanov (2013) “Combined Preconditioning with Applications in Reservoir Simulation,” Multiscale Modeling and Simulation, 11(2), pp. 507 – 521.

[101] Hu, X., J. Xu, and C.-S. Zhang (2013) “Application of auxiliary space precondi- tioning in field-scale reservoir simulations,” Science in China Series A: Mathematics.

[102] Saad, Y. and J. Zhang (2001) “Enhanced multi-level block ILU preconditioning strategies for general sparse linear systems,” Journal of Computational and Applied Mathematics, 130(1-2), pp. 99 – 118.

[103] Broker,¨ O. and M. J. Grote (2002) “Sparse approximate inverse smoothers for geometric and algebraic multigrid,” Applied Numerical Mathematics, 41(1), pp. 61 – 80.

[104] Adams, M., M. Brezina, J. Hu, and R. Tuminaro (2003) “Parallel multi- grid smoothing: polynomial versus Gauss-Seidel,” Journal of Computational Physics, 188(593–610).

[105] Kim, H., J. Xu, and L. Zikatanov (2003) “A multigrid method based on graph matching for convection-diffusion equations,” Numerical Linear Algebra with Applica- tions, 10(1-2), pp. 181–195. URL http://doi.wiley.com/10.1002/nla.317

[106] Kraus, J. (2002) “An algebraic preconditioning method for M-matrices: linear versus non-linear multilevel iteration,” Numer. Linear Algebra Appl., 9(8), pp. 599–618. URL http://dx.doi.org/10.1002/nla.281 173

[107] Notay, Y. and P. Vassilevski (2008) “Recursive Krylov-based multigrid cycles,” Online, 15(July 2007), pp. 473–487. URL http://www3.interscience.wiley.com/cgi-bin/abstract/114286660/ ABSTRACT [108] Hu, X., P. S. Vasilevski, and J. Xu (2013) “Comparative convergence analysis of nonlinear AMLI-cycle multigrid,” SIAM Journal on Numerical Analysis, 51(2), pp. 1349 – 1369. [109] Notay, Y. (2010) “An aggregation-based algebraic multigrid method,” Electronic Transactions on Numerical Analysis, 37, pp. 123–146. URL http://www.emis.ams.org/journals/ETNA/vol.37.2010/pp123-146.dir/ pp123-146.pdf [110] Fletcher, R. and C. Reeves (1964) “Function minimization by conjugate gradi- ents,” The Computer Journal, 7(2), pp. 149–154. URL http://comjnl.oupjournals.org/cgi/doi/10.1093/comjnl/7.2.149 [111] Bank, R. E. (2007) PLTMG: A software package for solving elliptic partial differential equations, Department of Mathematics, University of California. [112] Deuflhard, P., P. Leinenand, and H. Yserentant (1989) “Concepts of an adap- tive hierrchical finite element code,” Impact Comput. Sci. Engrg., 1, pp. 3 – 35. [113] Bergen, B. (2005) “Hierarchical Hybrid Grids: Data Structures and Core Algo- rithms for Efficient Finite Element Simulations on Supercomputers,” Erlangen, FAU, Advances in Simulation 14/2006, SCS Europe. [114] Ciarlet, P. (1991) Basic error estimates for elliptic problems, In: P. Ciarlet and J.-L. Lions, eds., Handbook of Numerical Analysis, Volume II, pp. 17–352. North Holland. [115] Liehr, F., T. Preusser, M. Rumpf, S. Sauter, and L. Schwen (2009) “Com- posite finite elements for 3D image based computing,” Computing and Visualization in Science, 12, pp. 171–188. [116] LeBorne, S., L. Grasedyck, and R. Kriemann (2005) “Domain-decomposition based H-LU preconditioners,” in Proceedings of the 16th International Conference on Domain Decomposition Methods (O. B. Widlund, ed.), Springer-Verlag Berlin, pp. 67 – 674. [117] Scott, L. R. and S. Zhang (1990) “Finite element interpolation of nonsmooth functions satisfying boundary conditions,” Mathematics of computation, 54(190), pp. 483–493. [118] Renouf, M., F. Dubois, and P. Alart (2004) “A parallel version of the non smooth contact dynamics algorithm applied to the simulation of granular media,” Journal of Computational and Applied Mathematics, 168(1 – 2), pp. 375 – 382. 174

[119] Fox, G., M. Johnson, G. Lyzenga, S. Otto, J. Salmon, and D. Walker (1988) Solving Problems on Concurrent Processors, Prentice Hall.

[120] Adams, L. M. and H. F. Jordan (1986) “Is SOR Color-Blind?” SIAM J. Sci. and Stat. Comput., 7(2), pp. 490 – 506.

[121] Golub, G. and J. Ortega (1993) Scientific Computing with an Introduction to Parallel Computing, Academic Press, Boston, MA.

[122] Kim, T. and C.-O. Lee (1999) “A parallel Gauss-Seidel method using NR data flow ordering,” Applied Mathematics and Computation, 99(2), pp. 209–220.

[123] Adams, M. F. (2001) “A distributed memory unstructured Gauss-Seidel algorithm for multigrid smoothers,” in Supercomputing, ACM/IEEE 2001 Conference, IEEE, pp. 14–14.

[124] Shang, Y. (2009) “A distributed memory parallel Gauss–Seidel algorithm for linear algebraic systems,” Computers & Mathematics with Applications, 57(8), pp. 1369– 1376.

[125] Courtecuisse, H. and J. Allard (2009) “Parallel dense gauss-seidel algorithm on many-core processors,” in High Performance Computing and Communications, 2009. HPCC’09. 11th IEEE International Conference on, IEEE, pp. 139–147.

[126] Ausiello, G., P. Crescenzi, V. Gambosi, G. ad Kann, A. Marchetti- Spaccamela, and M. Protasi (1999) Complexity and Approximation, Springer, Berlin.

[127] Bozdag,˘ D., A. H. Gebremedhin, F. Manne, E. G. Boman, and U. V. Catalyurek (2008) “A framework for scalable greedy coloring on distributed-memory parallel computers,” Journal of Parallel and Distributed Computing, 68(4), pp. 515– 535.

[128] Benantar, M., R. Biswas, J. E. Flaherty, and M. S. Shephard (1990) “Parallel computation with adaptive methods for elliptic and hyperbolic systems,” Computer Methods in Applied Mechanics and Engineering, 82(1–3), pp. 73 – 93.

[129] Benantar, M., U. Dogrus¯ oz¨ , J. E. Flaherty, and M. Krishnamoorthy (1996) “Coloring quadtrees,” Manuscript.

[130] Eppstein, D., M. W. Bern, and B. L. Hutchings (1999) “Algorithms for Coloring Quadtrees,” CoRR, cs.CG/9907030. URL http://arxiv.org/abs/cs.CG/9907030 [131] Benantar, M. (1992) Parallel and Adaptive Algorithms for Elliptic Partial Dieren- tial Systems, Ph.D. thesis, Department of Computer Science, Rensselaer Polytechnic Institute, Troy, NY. 175

[132] Griebel, M. NaSt3DGPF, A parallel 3D free surface flow solver, Insti- tute for Numerical Simulation, university of Bonn, http://wissrech.ins.uni- bonn.de/research/projects/NaSt3DGPF/index.htm. URL http://wissrech.ins.uni-bonn.de/research/projects/NaSt3DGPF/team. htm

[133] Dongarra, J., T. Dong, M. Gates, A. Haidar, S. Tomov, and I. Yamazaki (2012) MAGMA: a New Generation of Linear Algebra Library for GPU and Multicore Architectures, http://icl.cs.utk.edu/magma/index.html. URL http://icl.cs.utk.edu/projectsfiles/magma/pubs/ORNL_ Seminar-06-16-10.pdf

[134] Humphrey, J., D. Price, K. Spagnoli, A. Paolini, and E. Kelmelis (2010) “CULA: hybrid GPU accelerated linear algebra routines,” in Defense, vol. 7705, SPIE, pp. 770502–770502–7. URL http://dx.doi.org/doi/10.1117/12.850538

[135] Sengupta, S., M. Harris, Y. Zhang, and J. Owens (2007) “Scan primitives for GPU computing,” Computing, p. 106. URL http://portal.acm.org/citation.cfm?id=1280110

[136] Bell, N. and M. Garland (2009) “Implementing sparse matrix-vector multiplica- tion on throughput-oriented processors,” in SC ’09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, ACM, New York, NY, USA, pp. 1–11. URL http://portal.acm.org/citation.cfm?doid=1654059.1654078

[137] Baskaran, M. and R. Bordawekar (2008) Optimizing sparse matrix-vector mul- tiplication on GPUs, Tech. Rep. RC24704, IBM.

[138] Klockner, A. Iterative cuda, Courant Institute of Mathematical Sciences, New York University, http://mathema.tician.de/software/iterative-cuda. URL http://mathema.tician.de/software/iterative-cuda

[139] Buatois, L., G. Caumon, and B. Levy´ (2007) “Concurrent number cruncher : an efficient sparse linear solver on the GPU,” High Performance Computing and Commu- nications, 4782, pp. 358–371. URL http://alice.loria.fr/OLD/publications/papers/2007/NumberCruncher/ HPCC_number_cruncher.pdf

[140] Garland, M. and N. Bell (2010) CUSP: Generic parallel algorithms for sparse matrix and graph computations, http://cusplibrary.github.com/. URL http://cusp-library.googlecode.com 176

[141] Bell, N. and M. Garland (2008) Efficient sparse matrix-vector multiplication on CUDA, Tech. Rep. NVR-2008-004, NVIDIA Corporation. URL http://mgarland.org/files/papers/nvr-2008-004.pdf [142] Haase, G., M. Liebmann, C. Douglas, and G. Plank (2010) “A parallel alge- braic multigrid solver on graphics processing units,” High Performance Computing and Applications, pp. 38–47.

[143] Greenbaum, A. (1997) Iterative methods for solving linear systems, vol. 17 of Frontiers in Applied Mathematics, Society for Industrial and Applied Mathematics, Philadelphia, PA.

[144] Xu, J. and X.-C. Cai (1992) “A Preconditioned GMRES Method for Nonsymmetric or Indefinite Problems,” Mathematics of Computation, 59(200), pp. pp. 311–319. URL http://www.jstor.org/stable/2153059 [145] Eisenstat, S. C., H. C. Elman, and M. H. Schultz (1983) “Variational iterative methods for nonsymmetric systems of linear equations,” Siam J. Numer. Anal, 20, pp. 345 – 357.

[146] Murphy, M. F., G. H. Golub, and A. J. Wathen (2006) “A Note on Precondi- tioning for Indefinite Linear Systems,” SIAM J Scientific Computing, 21(6), pp. 1969 – 1972.

[147] Benzi, M. and V. Simoncini (2006) “On the eigenvalues of a class of saddle point matrices,” Numerische Mathematik, 103(2), pp. 173 – 196.

[148] Mardal, K.-A. and R. Winther (2011) “Preconditioning discretizations of systems of partial differential equations,” Numerical Linear Algebra with Applications, 18(1), pp. 1–40.

[149] Silvester, D. and A. Wathen (1994) “Fast iterative solution of stabilised Stokes systems part II: using general block preconditioners,” SIAM Journal on Numerical Analysis, 31(5), pp. 1352–1367.

[150] Simoncini, V. (2004) “Block triangular preconditioners for symmetric saddle-point problems,” Numerical Algorithms, Parallelism and Applications, 49(1), pp. 63 – 80.

[151] Loghin, D. and A. J. Wathen (2004) “Analysis of preconditioners for saddle-point problems,” SIAM Journal on Scientific Computing, 25(6), pp. 2029–2049.

[152] Silvester, D., H. Elman, D. Kay, and A. Wathen (2001) “Efficient precondi- tioning of the linearized Navier–Stokes equations for incompressible flow,” Journal of Computational and Applied Mathematics, 128(1), pp. 261–279.

[153] Kay, D., D. Loghin, and A. Wathen (2002) “A preconditioner for the steady-state Navier–Stokes equations,” SIAM Journal on Scientific Computing, 24(1), pp. 237–256. 177

[154] Elman, H., V. E. Howle, J. Shadid, R. Shuttleworth, and R. Tuminaro (2006) “Block preconditioners based on approximate commutators,” SIAM Journal on Scientific Computing, 27(5), pp. 1651–1668.

[155] Rehman, M. u., T. Geenen, C. Vuik, G. Segal, and S. MacLachlan (2011) “On iterative methods for the incompressible Stokes problem,” International Journal for Numerical methods in fluids, 65(10), pp. 1180–1200.

[156] Benzi, M. and M. A. Olshanskii (2006) “An augmented Lagrangian-based ap- proach to the Oseen problem,” SIAM Journal on Scientific Computing, 28(6), pp. 2095–2113.

[157] Elman, H., D. Silvester, and A. Wathen (2005) Finite elements and fast iterative solvers: with applications in incompressible fluid dynamics, Oxford University Press.

[158] Segal, A., M. ur Rehman, and C. Vuik (2010) “Preconditioners for incompressible Navier-Stokes solvers,” Numerical Mathematics: Theory, Methods and Applications, 3(3), pp. 245–275.

[159] Vanka, S. (1986) “Block implicit multigrid solution of Navier-Stokes equations in primitive variables,” J. Comput. Phys., 65, pp. 138 – 158.

[160] Molenaar, J. (1991) “A two-grid analysis of the combination of mixed finite element and Vanka-type relaxation,” in Multigrid methods III, Proc. 3rd Eur. Conf. (W. Hack- busch and U. Trottenberg, eds.), Birkh¨auser,Bonn/Ger, pp. 313 – 323.

[161] Schoberl,¨ J. and W. Zulehner (2003) “On Schwarz-type smoothers for saddle point problem,” Numerische Mathematik, 95(2), pp. 377 – 399.

[162] Schoberl,¨ J. (1999) “Multigrid methods for a parameter dependent problem in pri- mal variables,” Numerische Mathematik, 84(1), pp. 97 – 119. Vita Lu Wang

Lu Wang is a PhD candidate in the Department of Mathematics at Pennsylvania State University since 2008. He received the B.S. degree in Mathematics and B.E. degree in Computer Science from Peking University, Beijing, China in 2008.