Robust and scalable hierarchical matrix-based fast direct solver and preconditioner for the numerical solution of elliptic partial differential equations

Dissertation by

Gustavo Iv´anCh´avez Ch´avez

In Partial Fulfillment of the Requirements

For the Degree of

Doctor of Philosophy

King Abdullah University of Science and Technology

Thuwal, Kingdom of Saudi Arabia

June, 2017 2 EXAMINATION COMMITTEE PAGE

The dissertation of Gustavo Iv´anCh´avez Ch´avez is approved by the examination committee

Committee Chairperson: Professor David Keyes Committee Members: Professor Mikhail Moshkov, Professor David Ketcheson, Professor George Turkiyyah, Professor Jingfang Huang 3

© June, 2017

Gustavo Iv´anCh´avez Ch´avez

All Rights Reserved 4 ABSTRACT

Robust and scalable hierarchical matrix-based fast direct solver and preconditioner for the numerical solution of elliptic partial differential equations Gustavo Iv´anCh´avez Ch´avez

This dissertation introduces a novel fast direct solver and preconditioner for the solution of block tridiagonal linear systems that arise from the discretization of ellip- tic partial differential equations on a Cartesian product mesh, such as the variable- coefficient Poisson equation, the convection-diffusion equation, and the wave Helmholtz equation in heterogeneous media. The algorithm extends the traditional cyclic reduction method with hierarchical matrix techniques. The resulting method exposes substantial concurrency, and its arithmetic operations and memory consumption grow only log-linearly with problem size, assuming bounded rank of off-diagonal matrix blocks, even for problems with arbitrary coefficient structure. The method can be used as a standalone direct solver with tunable accuracy, or as a black-box preconditioner in conjunction with Krylov methods. The challenges that distinguish this work from other thrusts in this active field are the hybrid distributed-shared parallelism that can demonstrate the algorithm at large-scale, full three-dimensionality, and the three stressors of the current state- of-the-art multigrid technology: high wavenumber Helmholtz (indefiniteness), high Reynolds convection (nonsymmetry), and high contrast diffusion (inhomogeneity). Numerical experiments corroborate the robustness, accuracy, and complexity claims and provide a baseline of the performance and memory footprint by comparisons with competing approaches such as the multigrid solver hypre, and the STRUMPACK implementation of the multifrontal factorization with hierarchically semi-separable 5 matrices. The companion implementation can utilize many thousands of cores of Shaheen, KAUST's Haswell-based Cray XC-40 supercomputer, and compares favor- ably with other implementations of hierarchical solvers in terms of time-to-solution and memory consumption. 6 ACKNOWLEDGMENTS

First and foremost I want to thank my advisor, Prof. David Keyes, for all his support and encouragement: You are a gentleman and a scholar, and you have been an inspiration to me to the largest extent of the word. I also want to thank Prof. George Turkiyyah for his time and follow-up. I partic- ularly thank him for introducing me to the field of hierarchical matrices. I would like to express my sincere gratitude to my dissertation committee: Prof. Mikhail Moshkov, Prof. David Ketcheson, and Prof. Jingfang Huang for their thor- ough comments and detailed suggestions on the draft of this dissertation, especially for making the time for being physically present at my dissertation talk. In particu- lar, to Prof. Huang for literally traveling halfway across the world, to Prof. Moshkov for attending right after a medical procedure, and to Prof. Ketcheson, who kindly rescheduled his summer agenda. This work is the result of team effort, for which I gratefully acknowledge the inputs of Prof. R. Yokota, Dr. H. Ltaief, Dr. A. Litvinenko, Dr. L. Dalcin, Dr. S. Zampini, Dr. G. Markomanolis, Dr. S. Kortas, Dr. B. Hadri, and Dr. S. Feki. I will fondly remember my colleagues at the Extreme Computing Research Center: M. Farhan, W. Boukaram, M. Abdul-Jabbar, A. Alonazi, A. Chara, D. Sukkari, Dr. L. Liu, Dr. H. Ibeid, and Dr. T. Malas. I thank you for the innumerable conversations and for making my stay so enjoyable. Also, I thank the staff of the ECRC for their impeccable work: E. Gonzalez, V. Hansford, O. Camilleri, N. Ezzeddine, and G. Martinez. Although I never met him, I’d like to thank the founder of KAUST: King Abdullah of Saudi Arabia. His generosity and vision are truly breathtaking. Finally, I thank my friends and family. This work is dedicated to you. Especially, for the love of my life Karen Ram´ırez,and for my loving parents: Alejandro Ch´avez and Laura Ch´avez. 7

TABLE OF CONTENTS

Examination Committee Page 2

Copyright 3

Abstract 4

Acknowledgments 6

List of Abbreviations 11

List of Figures 13

List of Tables 18

1 Introduction 20 1.1 Overview ...... 20 1.2 Motivation ...... 21 1.3 Fundamental problem ...... 22 1.4 Contributions ...... 25 1.5 Outline ...... 26

2 Hierarchical matrices 28 2.1 Intuition for hierarchically low-rank approximations ...... 28 2.2 Overview of the -matrix format ...... 29 H 2.2.1 Index set ...... 30 2.2.2 Cluster tree ...... 30 2.2.3 Block cluster tree ...... 30 2.2.4 Admissibility condition ...... 31 2.2.5 Compression of low-rank blocks ...... 32 2.3 Benefits of -matrix approximations ...... 32 H 2.4 Other types of data-sparse formats ...... 33 2.4.1 2 ...... 33 H 2.4.2 Hierarchically semi-separable (HSS) ...... 34 8 2.4.3 Hierarchically off-diagonal low-rank (HODLR) ...... 35 2.4.4 Block low-rank (BLR) ...... 37 2.5 Summary of data-sparse formats ...... 38

3 Related work 39 3.1 Approximate triangular factorizations ...... 39 3.1.1 Data-sparse multifrontal ...... 39 3.1.2 Data-sparse supernodal ...... 42 3.1.3 Compression of the entire triangular factors ...... 43 3.2 Approximate inverse ...... 44 3.2.1 Iterative procedure ...... 44 3.2.2 Recursive formulation ...... 45 3.2.3 Matrix-free ...... 46

4 Accelerating cyclic reduction 48 4.1 Introduction to cyclic reduction ...... 48 4.2 Cyclic reduction on a model problem ...... 49 4.2.1 Elimination ...... 50 4.2.2 Solve ...... 53 4.3 Accelerated cyclic reduction (ACR) ...... 53 4.3.1 Modularity ...... 53 4.3.2 Block-wise -matrix approximation ...... 54 H 4.3.3 Slicing decomposition ...... 55 4.3.4 Reduced-dimension -matrices ...... 56 H 4.3.5 Tuning parameters ...... 57 4.3.6 Fixed-rank versus adaptive-rank arithmetic ...... 59 4.3.7 General algorithm ...... 60 4.3.8 Sequential complexity estimates ...... 61

5 Distributed memory accelerated cyclic reduction 63 5.1 Overview of the concurrency features of ACR ...... 63 5.2 Hybrid parallel programming model ...... 64 5.2.1 Inter-node parallelism ...... 65 5.2.2 Intra-node parallelism ...... 66 5.3 Parallel elimination and solve ...... 67 5.4 Parallel complexity estimates ...... 69 5.5 Parallel scalability ...... 69 9 5.5.1 Weak scaling ...... 70 5.5.2 Strong scaling ...... 71 5.5.3 Effectiveness of the choice of -matrix admissibility ...... 72 H 5.5.4 Memory footprint ...... 73

6 ACR as a fast direct solver 75 6.1 Numerical results in 2D and benchmark with other solvers ...... 75 6.1.1 Environment settings ...... 75 6.1.2 Constant-coefficient Poisson equation ...... 76 6.1.3 Variable-coefficient Poisson equation ...... 78 6.1.3.1 Smooth coefficient ...... 78 6.1.3.2 High contrast discontinuous coefficient ...... 79 6.1.3.3 Anisotropic coefficient ...... 80 6.1.4 Helmholtz equation ...... 81 6.1.4.1 Positive definite formulation ...... 81 6.1.4.2 Indefinite formulation ...... 82 6.1.5 Convection-diffusion equation ...... 86 6.1.5.1 Proportional convection and diffusion ...... 86 6.1.5.2 Convection dominance ...... 87 6.2 Numerical results in 3D and benchmarking with other solvers . . . . 89 6.2.1 Environment settings ...... 89 6.2.2 Poisson equation ...... 90 6.2.3 Convection-diffusion equation ...... 96 6.2.4 Wave Helmholtz equation ...... 98

7 ACR as a preconditioner for sparse iterative solvers 101 7.1 Environment settings ...... 101 7.2 Variable-coefficient Poisson equation ...... 102 7.2.1 Generation of random permeability fields ...... 102 7.2.2 Tuning parameters ...... 103 7.2.3 Sensitivity with respect to high contrast coefficient ...... 106 7.2.4 Operation count and memory footprint ...... 107 7.3 Convection-diffusion equation with recirculating flow ...... 108 7.3.1 Tuning parameters ...... 109 7.3.2 Sensitivity with respect to vortex wavenumber ...... 110 7.3.3 Sensitivity with respect to Reynolds number ...... 111 7.3.4 Operation count and memory footprint ...... 112 10 7.4 Indefinite Helmholtz equation in heterogeneous media ...... 113 7.4.1 Tuning parameters ...... 114 7.4.2 Low to high frequency Helmholtz regimes ...... 117 7.4.3 Operation count and memory footprint ...... 120

8 Summary and future work 122 8.1 Concluding remarks ...... 122 8.2 Future work ...... 125

References 128

Appendix A Memory consumption optimization 140

Appendix B Benchmarks configuration 144 B.1 Configuration for the direct solution of 2D problems ...... 145 B.2 Configuration for the direct solution of 3D problems ...... 146 B.3 Configuration for the iterative solution of 3D problems ...... 147

Appendix C Papers accepted and submitted 149

Appendix D Sparse linear solvers that leverage data-sparsity 150 11

LIST OF ABBREVIATIONS

ABC Absorbing boundary condition ACA Adaptive cross approximation ACR Accelerated cyclic reduction AHMED Another software library on hierarchical ma- trices for elliptic differential equations AMG Algebraic multigrid BDLR Boundary distance low-rank approximation method BEM BLAS Basic linear algebra subprograms BLR Block low-rank data-sparse format CE Compress and eliminate solver CG Method of conjugate gradients CHOLMOD Sparse Cholesky modification package CR Cyclic reduction DMHIF Distributed memory hierarchical interpolative factorization ExaFMM Library for fast multipole algorithms, in par- allel, and with GPU capability FD Finite difference method FEM FFT Fast Fourier transform FMM Fast multipole method FV GMRES Generalized minimal residual method GPU Graphic processing unit H2Lib Software library for hierarchical matrices and H2-matrices HBS Hierarchically block separable data-sparse for- mat 12 HiCMA Hierarchical computations on manycore archi- tectures HLib Software library for hierarchical matrices HLibPro Commercial software library for hierarchical matrices HODLR Hierarchically off-diagonal low rank data- sparse format HSS Hierarchically semiseparable data-sparse for- mat HSSMF Multifrontal factorization accelerated with hi- erarchically semiseparable matrices hypre High performance preconditioners library ID Interpolative decomposition data-sparse for- mat LoRaSp Low-rank sparse solver MF Multifrontal factorization MPI Message passing interface MUMPS Multifrontal massively parallel sparse direct solver NERSC National Energy Research Scientific Comput- ing Center pARMS Parallel version of the algebraic recursive mul- tilevel solver PaStiX Parallel sparse matrix package PDE Partial differential equation PETSc Portable, extensible toolkit for scientific com- putation PML Perfectly matched layer boundary condition RAM Random-access memory RAS Restricted additive Schwarz preconditioner RRQR Rank-revealing QR decomposition STRUMPACK Structured matrices package SVD Singular value decomposition 13

LIST OF FIGURES

2.1 Index set in the natural ordering of a 1D grid...... 30

2.2 Binary cluster tree I of cardinality 8...... 30 T 2.3 Flat block partitioning at different levels l, without admissibility con- dition...... 31 2.4 Hierarchical block partitioning with admissibility condition. Green blocks are represented by low-rank approximations and red blocks with dense matrices...... 31 2.5 Additionally to the block hierarchy (left) that the -matrix format H has, the 2-matrix format features a hierarchy of nested basis (right). 34 H 2.6 The HSS format features weak admissibility condition and two hierar- chies: a block hierarchy and a nested basis hierarchy...... 35 2.7 The HODLR format features a single hierarchy of blocks and a weak admissibility condition...... 36 2.8 The BLR format does not feature a hierarchy of blocks or basis, and it uses a straightforward admissibility condition: only the diagonal blocks are stored as dense matrices...... 37

4.1 Cyclic reduction preserves a block tridiagonal structure through elim- ination. Red blocks depict diagonal blocks, green blocks depict the innermost of the bidiagonal blocks, and blue blocks depict the out- ermost of the bidiagonal blocks. Gray blocks denote blocks in which elimination is completed...... 49 4.2 Left: Depiction of a rank 1 -matrix, green blocks are low-rank blocks H and red blocks are dense blocks. Right: Block-wise approximation of A; as opposed to a global factorization of A. Nonempty blocks depict an -matrix...... 55 H 4.3 Left: Grid; center: slice decomposition to the domain; right: resulting matrix structure. In 2D N = n2, each block row represents a line of size n n., whereas in 3D N = n3, each block row represents a plane × of size n2 n2...... 56 × 14 4.4 -inverse of the 2D Poisson operator, discretized with N = 64 64 H × grid points, using two different admissibility conditions. The number in each block is the numerical rank necessary to achieve an overall −4 compression accuracy of  = 10 . The color map indicates the H relative size of the required numerical rank k, and the block size n; thus deep blue indicates a good compression k n...... 58  5.1 Concurrency in ACR elimination for a 16-planes example. Level 0 can eliminate eight planes concurrently thus reducing the problem size by two; this process continues recursively until one plane is left...... 64 5.2 Distribution of multiple planes per physical compute node for an ex- ample with n=16 and p=4...... 65 5.3 Communication pattern for the 8-planes case. P depicts the planes being eliminated, and u the solution per-plane as back-substitution is executed...... 66 5.4 Parallel ACR elimination tree depicting two-levels of concurrency using distributed memory parallelism to distribute concurrent work across compute nodes, and shared memory parallelism to perform -matrix H operations within the nodes...... 67 5.5 ACR weak scalability for the solution of the constant-coefficient Pois- son equation...... 71 5.6 ACR strong scalability for the solution of the constant-coefficient Pois- son equation...... 72 5.7 Choice of -matrix structure to represent planes. Blue indicates low- H rank blocks, whereas red indicates dense blocks...... 73 5.8 Memory footprint of ACR...... 74

6.1 Execution times and memory consumption as a function of matrix dimension for the constant-coefficient Poisson problem...... 77 6.2 Execution times and memory consumption as a function of matrix dimension for the variable-coefficient Poisson problem with smooth co- efficients...... 79 6.3 Execution times and memory consumption as a function of matrix dimension for the variable-coefficient Poisson problem with with high contrast discontinuous coefficients...... 80 6.4 Execution times and memory consumption as a function of matrix dimension for the anisotropic-coefficient problem...... 81 15 6.5 Execution times and memory consumption as a function of matrix dimension for the positive definite Helmholtz problem...... 82 6.6 Execution times and memory consumption as a function of matrix dimension for the Helmholtz equation test 1: Fixing k, while increasing the resolution...... 83 6.7 Execution times and memory consumption as a function of matrix dimension for the Helmholtz equation test 2: Fixing resolution h, while decreasing the number of grid points per wavelength...... 84 6.8 Execution times and memory consumption as a function of matrix dimension for the Helmholtz equation test 3: Keeping a constant ratio between h and k, while increasing the resolution...... 85 6.9 Solvers performance while decreasing the number of points per wave- length. AMG fails to converge for large k...... 85 6.10 Execution times and memory consumption as a function of matrix dimension for the convection-diffusion problem...... 86 6.11 Robustness of factorization methods as the convection dominance in- creases. AMG fails to converge for large α...... 88 6.12 2D recirculating flow b(x)...... 88 6.13 Performance of the factorization and back-substitution phases of ACR for the Poisson problem...... 92 6.14 Controllable accuracy solution of ACR for a N = 2563 Poisson problem. 94 6.15 Robustness of ACR and HSSMF for convection-diffusion problem. In convection dominated problems, AMG fails to converge while direct solvers maintain a steady performance...... 97 6.16 Solution of increasingly larger indefinite Helmholtz problems consis- tently discretized with 12 points per wavelength...... 100

7.1 Different realizations of random permeability fields κ(x) at different resolutions and contrast of the coefficient. Images depict the middle slice of each 3D permeability field...... 103 7.2 Number of iterations and preconditioning accuracy for the variable- coefficient Poisson equation with N = 1283 degrees of freedom and coefficient contrast of four orders of magnitude...... 105

7.3 Effect of the preconditioner accuracy  for the variable-coefficient H Poisson equation with N = 1283 degrees of freedom and coefficient contrast of four orders of magnitude...... 105 16 7.4 Required number of iterations for an ACR preconditioner accuracy of

=1e-2 as the contrast of the coefficient increases. A larger number H of iterations is necessary as the contrast of the coefficient increases. . 106 7.5 Measured performance and memory footprint for the solution of an increasingly larger variable-coefficient Poisson equation with a random field of four orders of magnitude of contrast in the coefficient. The

preconditioner accuracy for this experiments is set to  = 1e-1. . . . 108 H 7.6 This experiment depicts a convection-diffusion problem with recirculat- ing flow with eight vortices, α = 8, discretized with N = 1283 degrees of freedom...... 109

7.7 Effect on the preconditioner accuracy  for a convection-diffusion H problem with recirculating flow with eight vortices, α = 8, and dis- cretized with N = 1283 degrees of freedom...... 110 7.8 Increasing number of vortices per dimension in the flow b(x)...... 110 7.9 Time distribution of the preconditioner as the number of vortices per dimension in b(x) increases. Increasing the number of vortices had a minor effect on the effectiveness of the preconditioner...... 111

7.10 Effect on the preconditioner accuracy  for the convection-diffusion H equation with recirculating flow discretized with N = 1283 degrees of freedom as the convective becomes more significant than the diffusion term...... 112 7.11 Measured performance and memory footprint for the solution of the convection-diffusion equation with recirculating flow...... 113 7.12 Wave velocity field c(x). The image depicts the middle slice of the 3D wave velocity field...... 114

7.13 Number of iterations as a function of the preconditioner accuracy . H As  decreases, the preconditioner requires fewer iterations...... 115 H 7.14 Time requirements while refining the preconditioner accuracy . The H largest  delivers the fastest time to solution...... 116 H 7.15 Effect on the preconditioner accuracy  for the high-frequency Helmholtz H equation in a heterogeneous medium discretized with N = 1283 degrees of freedom and 12 points per wavelength...... 117 17 7.16 Preconditioner performance for the Helmholtz equation in a hetero- geneous medium discretized with N = 1283 degrees of freedom at in- creasing frequencies. The problem with f = 0 Hz represents a constant- coefficient Poisson problem, while f = 8 Hz represents a high-frequency Helmholtz problem...... 119 7.17 Measured performance and memory footprint for the solution of a se- quence high-frequency Helmholtz problems in heterogeneous medium, discretized at 12 points per wavelength. On average, the rank of the low-rank blocks of the ACR preconditioner grows slower than O(n). . 121

8.1 Partitioning of an unstructured mesh that produces a block tridiagonal matrix structure, for the application of ACR...... 127

A.1 -matrix structure for different parameter η with fixed  and leaf H H size nmin=32. Matrix depicts a 2D variable coefficient Poisson problem with four orders of magnitude of contrast in the coefficient discretized with N = 1282 degrees of freedom. The numbers inside the green low-rank blocks denote the required numerical rank for the specified accuracy...... 141 A.2 Effect of tunable parameter η on total memory consumption of ACR for a 3D variable-coefficient problem. The memory complexity esti- mate of (N log N) is achieved for η = 2 which corresponds to strong O admissibility. η = 2n which correspond to weak admissibility uses the most memory asymptotically, and an intermediate value of η = n/2 achieves the least amount of memory within the range of problem sizes considered...... 143 18

LIST OF TABLES

2.1 Complexity estimates of the format. The constant k is determined H by the numerical rank of the approximation...... 33 2.2 Summary of the defining characteristics of data-sparse formats and their main proponents...... 38

4.1 Comparing the complexity estimates of storing and computing the in- verse of a N N matrix block in dense format, versus approximating × the matrix block with a hierarchical matrix with numerical rank k.. 55 4.2 -inverse of the 2D Poisson operator for N = 1282 grid points, us- H ing two different admissibility conditions. We document the memory and floating-point operations to build the -matrix inverse with weak H and standard admissibility. The weak admissibility condition tends to require large ranks, which lead to increased memory requirements and more arithmetic operations than the standard admissibility con- dition. (Equivalent dense storage and arithmetic operations of these operations would have required 2,147 MB and 1.0E13 operations.) . . 59 4.3 Summary of the sequential complexity estimates of the classic cyclic re- duction method and the proposed variant, accelerated cyclic reduction, k represents the numerical rank of the approximation...... 61

6.1 Execution parameters, obtained relative residual, and ranks of the ACR factorization for the Poisson experiments...... 93 6.2 Execution parameters, obtained relative residual, and ranks of the HSSMF factorization for the Poisson experiments...... 93 6.3 Iterative solution of a N = 1283 Poisson problem with the conjugate gradients method and ACR preconditioner. Relative residual of the solution is 1E-6 in all cases...... 95 6.4 Execution parameters, obtained relative residual, and ranks of the ACR factorization for the Helmholtz experiments...... 99 6.5 Execution parameters, obtained relative residual, and ranks of the HSSMF factorization for the Helmholtz experiments...... 99 19 7.1 Number of iterations required by CG for the variable-coefficient Pois- son equation with coefficient contrast of six orders of magnitude. The most economical preconditioner for the hardest problem did not reach convergence within 100 iterations, thus requiring a more accurate ver- sion of the preconditioner to reach convergence...... 106 7.2 Hardware configuration for distributed-memory experiments. Each compute node has 32 cores, and hosts four block rows of the origi- nal matrix...... 107 7.3 Tuning of the preconditioner to require at most 20 GMRES iterations for a sequence of Helmholtz problems at increasing frequencies. The problem with f = 0 represents a constant-coefficient Poisson problem, while f = 8 represents a high-frequency Helmholtz problem...... 118 7.4 Rank growth statistics for a sequence high-frequency Helmholtz prob- lems in heterogeneous medium, discretized at 12 points per wavelength. 120

A.1 Memory consumption as a function of the tuning parameter η for the computation of the approximate inverse in the -matrix format of a H 2D variable-coefficient Poisson problem with four orders of magnitude of contrast in the coefficient, discretized with N = 1282 degrees of freedom. Parameter η=2 depicts strong admissibility, while η=256 depicts weak admissibility; regarding memory requirements, η=64 is optimal...... 142

D.1 Alphabetical list of software libraries that leverage data-sparsity for the solution of sparse linear systems of equations...... 150 20

Chapter 1

Introduction

1.1 Overview

The importance of the development of scalable solvers is well acknowledged in the field of scientific computing. Up until the petascale era of computing, two factors have been predominant for determining a solver’s scalability: optimal complexity of arithmetic operations and memory footprint, and available concurrency. Given the advent of exascale, the ability of the method to adapt to the architectural features of the computing hardware has become equally important. Acknowledgment of hard- ware constraints in the development of new algorithms requires the consideration of factors such as memory hierarchies, shared memory cores, bandwidth, and power con- sumption. This is a challenging task for two reasons; first, because it requires a broad skill set in numerical linear algebra, numerical analysis, and parallel computing, and second, because solving large-scale problems is intrinsically complicated, as domain expertise in both the physics of the problem and the method are required to optimally adjust the algorithm’s tuning parameters. In the current ecosystem of parallel computing there are two main flavors of pro- gramming models: shared memory computing and distributed memory computing, with a spectrum blending both elements in between. In a shared memory environ- ment, all processors within a compute node have direct access to a single bank of local memory, whereas in a distributed memory environment each node, beyond having ac- cess to its local memory, accesses remote memories from other nodes only through 21 a network interconnect. The combination of a distributed memory environment and the shared memory environment is the present, and projected, architectural trend in high-performance architectures. Consequently, it is compulsory that algorithms adapt to this model to utilize the available hardware efficiently. There is vast documentation that demonstrates that optimal algorithms such as multigrid methods [1] or fast multipole methods [2] are able to perform efficiently in massively parallel environments. The exploitation of structure, usually in the form of a hierarchy that maps well to hardware, and assumptions of the inherent properties of the problem, result in scalable algorithms. It is, therefore, desirable to develop algorithms that have optimal complexity in both arithmetic operations and memory requirements, map well to hardware, and exploit the underlying properties of a different, or an extended class of problems. Recent work in the scalable solvers community exhibits an increasing interest in the exploration of hierarchical low-rank approximations to accelerate the solution of linear systems of the form Ax = b. The appeal comes from their near-optimal complexity estimates regarding the number of arithmetic operations and memory footprint, pro- vided that A has a data-sparse property, in the sense that only linear or log-linear data is needed for its representation. Evidence in form of publications and open-source software demonstrates that these algorithms can be efficiently ported into highly parallel computing environments. There are several manifestations of algorithms that rely on data sparsity, which have led to the conception of new solvers or to improvements to classical methods in the form of reduced complexity estimates, as we discuss in the next chapter.

1.2 Motivation

There is a large body of work that aims to expand the class of problems that estab- lished optimal methods can tackle. For example, iterative methods have tractable 22 complexity and scalability, but their convergence is problem-dependent. The explo- ration of novel approaches comes at a high risk, but it can be even more challenging to develop specific customizations around individual classes of problems. Furthermore, an alternative approach might adapt more naturally to the issue in question. For instance, if one were to seek robustness on the solution of sparse linear systems, ex- act factorizations would be the immediate method of choice. However, factorizations or conventional direct solvers come at quadratic or cubic computational complexity, which is prohibitive for large-scale problems. The goal of this work is therefore to preserve robustness in the solution of sparse linear systems at near-optimal complexity and scalability for an extended class of problems that the established optimal methods can tackle. The intermediate deliverable of this work, in the form of a dissertation proposal, presented evidence that a new robust and scalable solver titled Accelerated Cyclic Re- duction (ACR) could be extended to tackle 3D problems with more general features. This dissertation documents the development and results of these generalizations and provides insight into even further expansions that were exposed during the process.

1.3 Fundamental problem

The solution of linear system of equations often comprises the most expensive part of large-scale scientific calculations. Depending on the discretization, the resulting linear systems can be either sparse (e.g., partial differential equations) or dense (e.g., integral equations). This work concentrates on the class of elliptic partial differential equations; in particular, on the solution of the sparse linear systems that arise from the discretization with local operators from the finite-difference method or the finite element method.

Consider the block tridiagonal matrix A = tridiagonal(Ei,Di,Fi), for 0 < i < n 1 − (with the exception of the blocks E0 and Fn−1 which do not exist), that arises from 23 the 2D discrete Poisson problem Au = b with a 5-point stencil on a Cartesian product mesh with n grid points per linear dimension:

κ(~x) u = f, (1.1) − ∇ · ∇

  D0 F0      E1 D1 F1     . . .  A =  ......  . (1.2)        En−2 Dn−2 Fn−2    En−1 Dn−1

In the case in which κ(~x) is a constant, one can leverage the fact that the eigenvectors of the discrete Poisson operator are known, and in conjunction with the Fourier trans- form, the resulting linear systems can be solved optimally. The class of methods that leverage κ constant are referred to as fast Poisson solvers. In the case in which κ(~x) is not a constant, known as the variable-coefficient Poisson equation, these techniques no longer generalize with optimal complexity. For the constant-coefficient case, the blocks Ei,Di,Fi on (1.2) are equal to each other in their corresponding diagonal. For the variable-coefficient case, these blocks are no longer equal to each other in their corresponding diagonal, thus a different method has to be employed to preserve op- timality. One of the goals of this work is to achieve optimality when the coefficient κ(~x) is variable. There are other generalizations to (1.1) that thwart classical optimal methods, for instance, the regular Poisson equation plus a first order derivative, typically model- ing flow velocity b(~x). The resulting equation is known as the convection-diffusion equation:

2u + b(~x) u = f. (1.3) − ∇ · ∇ 24 The numerical solution of the convection-diffusion equation models the behavior of a passive scalar, u(~x), within a system with diffusion and advection, such as incom- pressible flow. From a numerical analysis point of view, the difficulty of solving the linear system arising from the discretization of (1.3) comes from the fact that its cor- responding matrix is nonsymmetric. As a result, one can not longer use the extensive class of solvers that leverage symmetry. Another generalization to (1.1) is when it contains an oscillatory term κ2u, in which u typically models wave amplitude and κ2 its corresponding wavenumber; the resulting equation is known as the Helmholtz equation:

2u κ2u = f. (1.4) − ∇ −

A critical nuance of the Helmholtz equation is the sign of the zeroth-order term in relation to the diffusion term. If the signs are opposite, the resulting linear system becomes diagonally dominant, and the broad class of solvers that handle diffusion problems optimally can be used. The case in which the sign of the oscillatory term and the diffusion term are the same, and if the wavenumber is high enough to shift part of the the spectrum to the other half plane, leads to the wave Helmholtz equa- tion which has an oscillatory solution, as opposed to diffusive. The resulting matrix after discretization is indefinite, and as in the previous case, one can either develop extensions to optimal methods to handle these type of problems, or use a suboptimal, but more robust, method. The numerical solution of the wave Helmholtz equation in heterogeneous media, namely, when κ is variable, is crucial for applications such as acoustics, wave-based inverse problems, optical devices, and communications. However, from the point of view of optimal complexity solvers, it is still a challenging problem for modern elliptic or boundary solvers relying on multigrid and hierarchically low-rank approximations. There are these three classes of equations that feature symmetry and nonsymmetry, 25 definiteness and indefiniteness, constant and variable-coefficient, for which the meth- ods and numerical experiments of this work are put to the test at large-scale.

1.4 Contributions

This dissertation introduces accelerated cyclic reduction (ACR), a novel algorithm in the category of fast direct solvers and robust preconditioners, that leverages cyclic reduction as the outer algorithm to solve block tridiagonal linear systems, up to a controllable accuracy, with hierarchical matrix arithmetic operations as the inner computational kernels (Chapter 4). The motivation for concurrency exploitation is based on the fact that hierarchi- cal matrix arithmetic operations haven proven to expose substantial concurrency at the node level, and that the amount of distributed memory concurrency in cyclic reduction is proportional to the square root of the problem size in two-dimensions and to the cube root of the problem size in three-dimensions (Chapter 5). In terms of complexity estimates, by using hierarchical matrix arithmetic operations the re- sulting method has near-optimal complexity regarding the number of operations and memory requirements. When compared to state-of-the-art hierarchical solvers, nu- merical experiments show that ACR has a competitive time to solution and memory requirements (Chapter 6). Furthermore, it features improved or similar arithmetic and memory complexity estimates relative to the latest algorithms within the same class, up to complexity constants. The fact that the method is entirely algebraic expands its range of applicability to problems with arbitrary coefficient structure, up to the applicability of rank com- pression. The method is, therefore, robust in comparison to other solvers that are limited in the range of problems that can tackle, or require geometry information. We demonstrate via a companion software implementation that the method is well- suited for modern parallel multi-core systems, and that it is scalable across a modern 26 massively parallel distributed memory system. The two main stages of ACR achieve log-linear complexity in memory and number of operations for both the elimination and solve stages consistently from small to large-scale problems, and they exhibit good weak, and strong scalability at large processor counts. Numerical evidence demonstrates improved overall performance and effectiveness of the use of ACR as a preconditioner for the preconditioned conjugate gradients method and the generalized minimal residual method (Chapter 7). Recommendations are made on the choice of the different tunable parameters of ACR as a precondi- tioner: admissibility condition parameter, size of the leaf nodes, and the approxima- tion threshold for conversion into -matrix format and arithmetics. H Applications that model high contrast diffusion (such as reservoir modeling), re- circulating flow in dominant convection (such as fluid dynamics), or high-frequency waves in heterogeneous media (such as seismic imaging), benefit directly from this work as demonstrated by numerical experiments. Finally, throughout this work, discussion about the rationale of the multiple al- gorithmic trade offs is provided. Furthermore, numerical results are discussed in the context of the nature and performance of other solvers. The last chapter of this doc- ument describes potential avenues of future directions and provides commentary into the challenges and opportunities of such endeavors.

1.5 Outline

The remainder of this dissertation is organized as follows:

ˆ Chapter 2: Hierarchical matrices describes the hierarchical matrix format used in this work and other hierarchical matrix formats available in the literature.

ˆ Chapter 3 Related work discusses the endeavors in the field of hierarchical matrix-based and data-sparse methods for the solution of sparse linear systems. 27 ˆ Chapter 4 Accelerating cyclic reduction details the insights that led to the de- velopment of the main algorithm starting from a historical perspective and describes why the synergism with hierarchical matrices makes it an optimal complexity algorithm.

ˆ Chapter 5 Distributed memory accelerated cyclic reduction shows the parallel features of accelerated cyclic reduction on a hybrid parallel programming model and demonstrates its performance at large processor counts.

ˆ Chapter 6 ACR as a fast direct solver provides the first set of numerical ex- periments of using accelerated cyclic reduction as an exact solver of tunable accuracy in comparison with other well-known solvers in the field.

ˆ Chapter 7 ACR as a preconditioner for sparse iterative solvers provides the second set of numerical experiments of the use of accelerated cyclic reduction as a preconditioner to speed up the convergence of Krylov methods with a challenging set of numerical experiments.

ˆ Chapter 8 Summary and future work recapitulates the lessons learned from conception up until today and discusses avenues of future research directions and the challenges and opportunities ahead. 28

Chapter 2

Hierarchical matrices

A hierarchical matrix is a data-sparse representation that enables fast linear algebraic operations by using a hierarchy of off-diagonal blocks, each represented by a low-rank approximation or a small dense matrix, that can be tuned to guarantee an arbitrary precision. The approximation, sometimes referred to as compression, is performed via singular value decomposition, or with a related method that delivers a low-rank approximation with fewer arithmetic operations than the traditional SVD method. For the representation to be effective in terms of arithmetic operations and memory requirements, the numerical rank must be significantly smaller than the sizes of the various matrix blocks that they replace.

2.1 Intuition for hierarchically low-rank approximations

To motivate hierarchically low-rank approximations and before formally defining what an -matrix is, consider the finite-difference discretization of the 1D Poisson equation. H The resulting matrix T is a tridiagonal matrix. Then, consider a matrix partitioning as depicted in (2.1); the upper-right block can be factorized in the form ABT with A = [0 0 0 1]T and BT = [1 0 0 0]. Also, notice that the rest of the off-diagonal blocks can be factorized in the same manner. In this case, this rank 1 approximation is actually exact. 29

  -2 1    1 -2 1     1 -2 1       1 -2 1  T =   . (2.1)  1 -2 1       1 -2 1     1 -2 1    1 -2

Perhaps more interestingly, consider T −1 (2.2), and notice that the inverse of this operator can also be approximated (exactly in this case) with a hierarchical block structure of rank 1 approximations. For example, consider the factorization of the

T 1 T T upper-right block of T in the form AB , where A = 9 [1 2 3 4] and B = [4 3 2 1]. Similarly, the rest of the hierarchical blocks of T can be approximated in the same fashion with rank 1 factorizations:

  8 7 6 5 4 3 2 1    7 14 12 10 8 6 4 2     6 12 18 15 12 9 6 3      −1 1  5 10 15 20 16 12 8 4  T =   . (2.2) 9 ×  4 8 12 16 20 15 10 5       3 6 9 12 15 18 12 6     2 4 6 8 10 12 14 7    1 2 3 4 5 6 7 8

Remarkably, this hierarchical compression technique, here illustrated with a simple 1D problem, generalizes well to higher dimensions for far less regular operators [3].

2.2 Overview of the -matrix format H

Formally, a hierarchical matrix in the -format [3, 4, 5], can be constructed from four H components: an index set, a cluster tree, a block cluster tree, and the specification of an admissibility condition. 30 2.2.1 Index set

The index set = 0, 1, . . . , n 1 represents the nodal points of the grid under a I { − } certain ordering, such as the natural ordering, as depicted in Figure 2.1.

Figure 2.1: Index set in the natural ordering of a 1D grid.

2.2.2 Cluster tree

The cluster tree, denoted by I , recursively subdivides the index set until T I × I exhaustion. For simplicity, consider a binary cluster tree of cardinality 8 as shown in Figure 2.2.

1:8 I

1:4 5:8 I I

1:2 3:4 5:6 7:8 I I I I

1 2 3 4 5 6 7 8 I I I I I I I I

Figure 2.2: Binary cluster tree I of cardinality 8. T

2.2.3 Block cluster tree

Once the cluster tree is defined, the block cluster tree maps matrix sub-blocks over the partitioning of the index set . An example of a clustering, frequently used by I ×I other data-sparse formats as we discuss in the next section, is a flat block-subdivision of the matrix in l levels, as depicted in Figure 2.3. The -format however, uses H a discriminant to determine which blocks are further subdivided with the so-called admissibility condition. 31

Figure 2.3: Flat block partitioning at different levels l, without admissibility condition.

2.2.4 Admissibility condition

Besides determining which blocks are further partitioned, the admissibility condition also determines which blocks are represented as a low-rank block (green) or a dense block (red), see Figure 2.4.

Figure 2.4: Hierarchical block partitioning with admissibility condition. Green blocks are represented by low-rank approximations and red blocks with dense matrices.

A weak admissibility criterion results in a coarse partitioning of the matrix (i.e., fewer, large blocks), and a standard admissibility criterion allows a more refined structure (i.e., more, small blocks). The condition that determines if a block is preserved “as is” during construction, or if it will be further decomposed is formally defined as:

min(diameter(τ), diameter(σ)) η distance(τ, σ). (2.3) ≤ ·

In (2.3) τ and σ represent two geometric regions in the PDE domain defined as the 32 convex hulls of two separate point sets t and s (i.e., nodes in the cluster tree). A matrix block Ats satisfying the previous inequality is represented in a low-rank form. The tuning parameter η controls weight of the distance function, and therefore can control the degree of admissibility of the matrix from weak (large blocks, large η), or standard (small blocks, small η).

2.2.5 Compression of low-rank blocks

The last step in the construction of an -matrix is the choice of an algorithm to H compute low-rank approximations for each of the blocks tagged as low-rank blocks, in this case, as the product of two matrices of the form ABT . Given a block of size n n, an effective compression leads to a tall and skinny matrix A that of size n k, × × and a short and fat matrix BT of size k n, where k is the numerical rank of the × block at some accuracy . An effective compression means that the numerical rank H k is k n. An efficient use of a hierarchical matrix to compress a given matrix has  a balance between numerical low-rank k and a moderate number of low-rank blocks.

2.3 Benefits of -matrix approximations H

-matrix approximations are especially useful for a particular class of matrices, such H as dense and sparse matrices that arise from the discretization of elliptic operators with methods such as the boundary element method (BEM), finite-difference (FD), finite volumes (FV), or the finite element method (FEM). Also, it has recently been shown that certain covariance matrices, under proper ordering, can benefit from this technique; a larger set of applications can be found in [6]. There are immediate algorithmic gains while using -matrix storage and the set of arithmetic operations H that are available. In terms of storage, storing a dense matrix requires (N 2) memory, O while its -matrix approximation counterpart can be stored in O(N log N) units of H memory. 33 A set of algebraic operations are defined within the -format. For a comprehen- H sive discussion of the construction of -matrices and their arithmetic operations, we H refer the reader to [5]. In terms of time for inversion, factorization, matrix-addition, matrix-vector and matrix-matrix multiplication it is possible to achieve log-linear time and memory complexity. The following table summarizes the time and memory complexities in the format [5] for several algebraic operations. H Operation -format Dense format Storage H(kN log N) (N 2) Matrix-vector O(kN log N) O(N 2) Addition O(k2N log N) O(N 2) Multiplication O(k2N log2 N) O(N 3) LU O(k2N log2 N) O(N 3) Inversion O(k2N log2 N) O(N 3) O O Table 2.1: Complexity estimates of the format. The constant k is determined by the numerical rankH of the approximation.

2.4 Other types of data-sparse formats

The discussion so far has focused on the matrix format. However, there are other H data-sparse formats available in the literature; in this section we briefly describe each of them.

2.4.1 2 H Introduced in [7], and conveniently discussed in [8], the 2 format introduces a second H hierarchy as compared to the format, hence the exponent with the number two. H This additional hierarchy is referred to as nested bases, as it further compresses the block hierarchy into a common column and row basis; see Figure 2.5. This second hierarchy further improves the arithmetic, and memory complexity estimates of all the algebraic operations defined for -matrices by up to one logarithm. For instance, H storage and matrix-vector multiplication can be done (kN) time. O 34 To summarize, the low-rank blocks are further factorized into a hierarchy of com- mon basis across rows and columns, and by so-called transfer matrices s (depicted in green in Figure 2.5) that uniquely define each low-rank block within the basis hierar- chy in the form UsV T . For a low-rank block of size n n, U has size n k, s has size × × k k, and V has size k n, where k is the numerical rank of the block at some trun- × × cated accuracy . The second hierarchy in this format increases the implementation H complexity of each of its methods, for instance the matrix-vector operation relies on recursive forward and backward transformations across the different levels of cluster bases directly and several computations on the k k transfer matrices as it is not × possible to access each low-rank block directly.

Figure 2.5: Additionally to the block hierarchy (left) that the -matrix format has, the 2-matrix format features a hierarchy of nested basisH (right). H

2.4.2 Hierarchically semi-separable (HSS)

Similar to the 2 format, the hierarchically semi-separable (HSS) format introduced H in [9] and generalized in [10] also features a block hierarchy and a basis hierarchy. The main difference with respect to the 2 format is its choice of admissibility condition. H HSS features a weak admissibility condition, by consistently approximating the large off-diagonal block with a single low-rank approximation. In this sense, HSS can be viewed as a one-dimensional 2 matrix. Figure 2.6 depicts a generic schematic of an H HSS matrix. 35 This data-sparse format introduced a linear complexity ULV factorization for dense and sparse symmetric positive linear systems with a data-sparse property, in which U and V are orthogonal, and L is lower-triangular. When this format is ap- plied to sparse matrices, in recent work, the compression algorithm used is based on randomized sampling [11]; which has been generalized to general nonsymmetric matrices [12, 13]. The randomized compression algorithm improved on the original O(N 2) algorithm down to near-optimal complexity leading also to significant sav- ings on memory footprint. The hierarchically block separable (HBS) format [14] is virtually identical to HSS.

Figure 2.6: The HSS format features weak admissibility condition and two hierarchies: a block hierarchy and a nested basis hierarchy.

2.4.3 Hierarchically off-diagonal low-rank (HODLR)

The HODLR format, introduced in [15], also features a weak admissibility as HSS, although it does not have a nested basis hierarchy, which results in an asymptotically higher arithmetic complexity format. HODLR however, does feature a block hierar- chy of low-rank blocks in a tree structure. The algorithm that computes low-rank approximations is based on a variant of the pseudo-skeleton algorithm [16], titled the boundary distance low-rank approximation scheme (BDLR). The compression algorithm has an asymptotic complexity of (kn), where k is O the numerical rank of the approximation, and n is the size of the compressed block. 36 The heuristic is based on the assumption that large entries correspond to points close in space, and it was proven to be effective for problems where the matrix had an underlying smooth and singular at the origin, such as the Green’s function of the 3D

0 Laplace’s equation: G(x, x ) = −1/4π(x−x0). HODLR significantly reduced the memory consumption of the frontal matrices that arise from the multifrontal method in a set of test problem with both structured and unstructured meshes [17]. However, earlier versions of this work did so with relatively high rank. Recent work [18] improved on the ranks magnitude by com- pressing well-separated clusters only. The improvement to the BDLR approximation is based on choosing rows and columns of the matrix based on the location to their corresponding vertices in the sparse matrix graph, and it was reported competitive in comparison to the adaptive cross approximation algorithm (ACA) [19]. A complete analysis of the benefits of this format for the entire stiffness matrix, as opposed to only the dense frontal blocks within the multifrontal method as in previous work, was documented in [20]. The resulting solver was put to the test as a preconditioner to GMRES with relatively low-accuracy HODLR matrices.

Figure 2.7: The HODLR format features a single hierarchy of blocks and a weak admissibility condition. 37 2.4.4 Block low-rank (BLR)

The BLR format, introduced in [21, 22], does not feature a hierarchy of blocks or basis, thus allowing more freedom in the block partitioning process. It partitions the matrix in equal size blocks, usually referred to as tiles, with a flat data structure as opposed to a tree. This data structure is simpler to implement and can leverage existing high-performance kernels designed for hardware accelerators such as GPUs or Intel many-core processors. Since the size of the tiles is uniform for commonly arising operations, such as the singular value decomposition or the rank-revealing QR decomposition, for compression, can be batched in a straightforward way. The existence of at least one effective low-rank block ensures the efficiency of the format as compared to a purely dense representation, which is noticeable in practice for elliptic PDEs and certain covariance matrices. As compared with other formats that feature almost optimal complexity, the BLR has higher complexity estimates but has nonetheless proven to be competitive for a range of test problems and problem sizes [21]. A depiction of this format is shown in Figure 2.8.

Figure 2.8: The BLR format does not feature a hierarchy of blocks or basis, and it uses a straightforward admissibility condition: only the diagonal blocks are stored as dense matrices. 38 2.5 Summary of data-sparse formats

Table 2.2 summarizes the features of the data-sparse formats described in this chapter.

Block Basis Admissibility Format Authors hierarchy hierarchy condition W. Hackbusch, Yes No Weak or Strong H L. Grasedyck, et al. W. Hackbusch, 2 Yes Yes Weak or Strong H S. B¨orm,et al. S. Chandrasekaran, HSS Yes Yes Weak J. Xia, et al. S. Ambikasaran, HODLR Yes No Weak E. Darve, et al. P. Amestoy, BLR No No Fixed C. Ashcraft, et al.

Table 2.2: Summary of the defining characteristics of data-sparse formats and their main proponents. 39

Chapter 3

Related work

The last two decades have witnessed an increasing interest in the use of data-sparse or hierarchical low-rank approximations for the efficient solution of linear systems. Early work in the field focused on dense linear systems, and whereas recent work in the machine learning community is also taking advantage of these techniques, this chapter focuses on techniques developed for the solution of sparse linear systems. The intention of the following sections is to be comprehensive, however, given the rapidly expanding nature of this field, the following narrative will quickly get outdated.

3.1 Approximate triangular factorizations

Leveraging an underlying hierarchically low-rank structure has been a successful strat- egy for improving the arithmetic complexity of direct solvers. As a result, direct solvers are becoming feasible candidates for tackling large-scale problems. In this section, two major directions towards fast direct solvers for sparse matrices are de- scribed.

3.1.1 Data-sparse multifrontal

Arguably the most explored approach to accelerate matrix factorizations via data- sparse methods is the use of low-rank approximations to compress the dense frontal blocks that arise in the multifrontal variant of Gaussian elimination. Frontal blocks in the sense of matrices that represent the interaction of lower dimensional partitions, 40 as opposed to the entire 3D problem. The enabling property is that under proper ordering, many of the off-diagonal blocks of the Schur complement of discretized elliptic PDEs have an effective low- rank approximation [23]. Under this property, it is possible to improve the memory and arithmetic estimate of the conventional multifrontal solver [24]. Furthermore, within each format, there are efficient methods to perform the necessary arithmetic operations between fronts in dense and low-rank format; this is to preserve the low- rank approximations during the factorization and solution stages of the solver and stay within the improved complexity estimates. Variations of this strategy differ in the type of data-sparse format used, and in the compression algorithm that builds the low-rank approximations of the dense frontal matrices. The work of J. Xia et al. [25] relies on nested dissection as the permutation strat- egy and uses the multifrontal method as a solver. Frontal matrices are approximated with the HSS format, while the solver relies on the corresponding HSS algorithms for elimination [10, 26]. A similar line of work is the generalization of this method to 3D problems and general meshes by G. Schmitz et al. [27, 28]. More recently, P. Ghysels et al. [12] introduced a method based on a fast ULV decomposition [29] and randomized sampling of HSS matrices in a many-core environment, where HSS approximations are used only to approximate fronts of large enough size, as the com- plexity constant of building an HSS approximation only pays off for enough large matrices. This approach is not limited to a particular hierarchical format. The work of A. Aminfar et al. [30] proposed the use of the HODLR matrix format [31] in combination with the boundary distance low-rank approximation method BDLR [32] also in the context of the multifrontal method. In Wang et al. [33] the authors investigate the use of the HSS format [34] to ac- celerate the parallel multifrontal method, which results in an algorithm known as the 41 HSS-structured multifrontal solver (HSSMF). The general approach uses intra-node parallel HSS operations within a distributed memory implementation of the multi- frontal sparse factorization. This method lowers the complexity of both arithmetic operations and memory consumption of the resulting HSS-structured multifrontal solver by leveraging the underlying numerically low-rank structure of the interme- diate dense matrices appearing within the factorization process, driven by a nested dissection ordering. In a similar line of work and in a distributed memory environment, Ghysels et al. [12] investigate a combination of the multifrontal method and the HSS-structured hierarchical format, extending the range of applicability of the solver to general nonsymmetric matrices. Using the task-based parallelism paradigm, they introduce randomized sampling compression [13] and fast ULV HSS factorization [35]. Under the assumption of the existence of an underlying low-rank structure of the frontal ma- trices, randomized methods deliver almost linear complexity; this reduces the asymp- totic complexity of the solver, which is mainly attributed to the frontal matrices near the root of the elimination tree. The effectiveness of these task-based algorithms in combination with a distributed memory implementation of the multifrontal method is available in an early stage software release of the package STRUMPACK [36], which we consider in the numerical experiments parts of this dissertation. The HSS format assumes a weak admissibility condition, which in practice requires the use of large nu- merical ranks even for approximations with modest relative accuracy. Consequently, this stresses the memory requirements and increases overall execution time. Solovyev et al. present a further exploration of the combination of the multifrontal method and HSS matrices in [37, 38]. The BLR format [39] has also been used to compress blocks into low-rank ap- proximations to accelerate the factorization process of the multifrontal method. This format is compatible with numerical pivoting and is well-suited for the reuse of exist- 42 ing high-performance implementations of dense linear algebra kernels. Even though this format is not hierarchical, it has proven to be useful for a broad range of prob- lems [21] within the distributed memory implementation of the multifrontal method provided by the MUMPS library [40]. The interpolative decomposition [41, 42] is another method for finding low-rank approximations that has proved to be a fast solver for symmetric elliptic PDEs and integral equations within the framework of the multifrontal method. This decomposi- tion relies on a “skeletonization” procedure to eliminate a redundant set of points from a symmetric matrix to further compress dense fronts. The key step in skeletonization uses the interpolative decomposition of low-rank matrices to achieve a quasi-linear overall complexity in factorization. The performance of hierarchical interpolative decomposition in a distributed memory environment is reported in [43].

3.1.2 Data-sparse supernodal

In [44], two variants of the supernodal Cholesky factorization together with the block low-rank (BLR) data-sparse format [21] are discussed. One that optimizes for memory footprint and one that optimizes for time-to-solution. The multi-threaded PaStiX library [45] is the direct solver used in this work. The BLR format is not hierarchical, in the sense that it divides dense blocks into smaller tiles into a flat data structure. Tiles along the block diagonal are stored as dense matrices, whereas blocks in the off- diagonal are stored as a low-rank approximation computed with either the singular value decomposition (SVD), or the rank-revealing QR (RRQR) method. The variants are characterized by which compression method is used, the memory-optimal variant uses the SVD method, while the time-optimal variant uses the RRQR method. In a similar fashion, [46] proposes a method that combines the supernodal left- looking variant of the Cholesky factorization [47], with data-sparse approximations for the solution of symmetric positive definite linear systems that arise from the 43 discretization of elliptic PDEs. Traditional supernodal methods rely on optimized BLAS-3 kernels to eliminate the supernodes that group nonzero entries, whereas this method performs elimination with structured algebra. The resulting solver preserves the data locality features of the factorization method, plus the memory efficiency of data-sparse approximations. The construction of the structured factors relies on randomized low-rank approximations [48, 49] to form low-rank blocks of the form UV T . Numerical examples demonstrate memory savings against the direct Cholesky factorization and robustness against some of the preconditioners provided in Trilinos [50] such as Jacobi, incomplete Cholesky, and multigrid, in a set of test problems that include nonlinear elasticity.

3.1.3 Compression of the entire triangular factors

Rather than compressing individual blocks within the decomposition process, in this section we comment on other hierarchy-exploiting techniques that focus on approxi- mating the entire triangular factors. The use of a nested dissection ordering of the unknowns as a clustering strategy for the construction of -Matrix is known as -Cholesky by I. Ibragimov et al. [51] H H and -LU by L. Grasedyck et al. [52, 53]. The main idea is to create an -Matrix H H approximation of the sparse system with a clustering based on domain decomposition. The advantage of this nested dissection ordering of the unknowns is that large blocks of zeros are preserved after factorization. The nonzero blocks are approximated with a low-rank approximation, and an LU factorization is performed under the appropriate arithmetic operations. Recently, R. Kriemann et al. [54] demonstrated that the H -LU method can be efficiently implemented with a task-based scheduling based on a H directed acyclic graph in modern manycore systems. A similar line of work from J. Xia et al. [55] also proposes the construction of a rank-structured Cholesky factorization via the HSS hierarchical format. 44 H. Pouransari et al. approximate fill-in via low-rank approximations with the 2 H format, see [56]. This format guarantees linear complexity provided that blocks cor- respond to well-separated clusters and have a data-sparse property. The algorithm starts by recursively bisecting the computational domain, implicitly forming a binary tree. The leaf nodes correspond to independent subdomains, and the internal nodes correspond to Schur complements to computed with low-rank arithmetic operations. The bottom-up elimination process is performed with a procedure referred to as “ex- tended sparsification” in which the original matrix dimension grows by introducing auxiliary variables but nonetheless remains sparse. Alternatively, elimination can be performed with an in-place algorithm that keeps the matrix size constant. A related method with similar strategies as in this work is the so-called “compress and elim- inate” solver [57]. A recent extension of this algorithm into a distributed memory environment documented in [58], demonstrates that concurrent processors can work on independent subdomains defined by their corresponding subgraphs, where interior vertices are eliminated concurrently. Communication is needed at the boundary ver- tices, but additional concurrency at the boundary is exploited trough graph coloring.

3.2 Approximate inverse

3.2.1 Iterative procedure

The computation of an approximate matrix inverse has also been performed in com- bination with low-rank approximations and the Newton-Schulz iteration [59, 60, 61]. This method is advantageous for high-performance computing because it mainly re- lies on matrix-matrix multiplications. Moreover, the Newton-Schulz iteration has quadratic convergence. However, fast convergence relies on a high-quality initial guess, which is a challenging task on its own as it typically requires domain expertise of the underlying problem under consideration. See for instance [62], to compute the sign function for Hermitian matrices via this method. An alternative approach based 45 on recursion was also proposed in early work by Hackbusch et al. [3], as we review in the next section.

3.2.2 Recursive formulation

The recursive matrix inversion algorithm considers a 2 2 block partitioning of the × matrix, to then perform Gaussian elimination on one of the diagonal blocks. This procedure is repeated until a small-enough matrix size typically stored in dense format is reached. Next, a regular BLAS-3 kernel is called to compute the inverse of this block in the traditional sense, taking into account the properties of the matrix, such as if it is symmetric positive definite, to ensure regularity. The recursive process is then reversed until the inverse of the full matrix is computed. A disadvantage of this method is its sequential nature along diagonal blocks. A study of the trade-off between load imbalance and sequential work is presented at [63]. Also, this recursive algorithm was used at low-accuracy to provide an initial guess to the approximate inverse Newton-Schulz iteration method — discussed in the previous section — to boost the parallel performance of the overall inversion process. R. Li et al. propose a multilevel preconditioner based on low-rank corrections design to exploit multiple levels of parallelism on modern high-performance comput- ing architectures, see [64]. The aim of this framework is to exploit data-sparsity rather than nonzero sparsity in the approximate inverse matrix. The preconditioner constructs the approximate inverse by recursively computing the inverse of its cor- responding 2 2 block diagonal approximation, plus a low-rank correction given by × the Sherman–Morrison formula [65]. Low-rank approximations are performed with the Lanczos algorithm [66] with reorthogonalization. In the case in which just a few eigenpairs are needed, the Lanczos algorithm can efficiently approximate them via matrix-vector products without forming the matrix explicitly, and in the same fashion, it can update the preconditioner to control its accuracy. A generalization 46 of this method is presented in [67], featuring a Schur complement based low-rank correction that via numerical experiments is shown to be less sensitive to indefinite- ness as compared to incomplete LU preconditioners and sparse approximate inverse preconditioners. The parallel features of this method are illustrated in a subsequent publication that targets a distributed memory environment [68]. Comparisons with other distributed memory preconditioners such as the pARMS method [69] and the RAS preconditioner [70] are provided. Within the same domain decomposition frame- work with low-rank corrections, another recent generalization was presented in [71]. The so-called hierarchical interface decomposition proposes an additional ordering to the sparse matrix which allows exploiting a hierarchy among interface points so blocks in the reordered matrix can be factored out simultaneously. A fast direct method for high-order discretization of elliptic PDEs has been pro- posed by Martinsson et al. [72, 73, 74]. The method is based on a multi-domain spectral collocation discretization scheme and a hierarchy of nested grids, similar to nested dissection. It exploits analytical properties of elliptic PDEs to build Dirichlet- to-Neumann operators, by hierarchically merging these operators originating from smaller grids. When computations are done using the HSS data-sparse format, an asymptotic complexity of O(N 4/3) can be reached. Despite of its log-linear arithmetic complexity, by virtue of the high-order discretization of the PDE used in this method, the accuracy produced per degree-of-freedom is substantial.

3.2.3 Matrix-free

Fast multipole methods [75] feature high arithmetic intensity, a high degree of paral- lelism, controllable accuracy, and potentially less synchronous communication pattern as compared to factorization based and multilevel methods. One apparent disadvan- tage of the FMM approach to preconditioning is it does not naturally incorporate boundary conditions, and that it relies on the existence of a Green’s function to the 47 particular PDE under consideration. In [76], authors propose the use of the boundary element method as a way to incorporate boundary conditions over arbitrary geome- tries. Furthermore, via numerical examples, it is demonstrated that for a certain class of problems, fast multipole methods have advantages regarding robustness and par- allel scalability as compared to the algebraic multigrid implementation of the hypre library [77]. 48

Chapter 4

Accelerating cyclic reduction

This chapter starts by reviewing the cyclic reduction algorithm in preparation for the accelerated cyclic reduction variant that improves its arithmetic and memory complexity estimates to near-optimal complexity for the variable-coefficient case.

4.1 Introduction to cyclic reduction

Cyclic reduction (CR) was introduced by R. W. Hockney in 1965 at Stanford Univer- sity [78], and then formalized by Buzbee and Golub in 1970 [79]. Cyclic Reduction is a recursive algorithm for (block) tridiagonal linear systems that exploit an odd-even (red-black) numbering of the unknowns, which in 2D represent a sequence of lines, and in 3D represent a sequence of planes. The re-ordering strategy solves for half of the system at a time, which results in reducing the problem size by half at each step. The formulation is based on the computation of the Schur complement, with the property that it preserves the block tridiagonal structure throughout the recursion of the algorithm. The red-black ordering of the unknowns for block tridiagonal systems is the only coloring strategy that preserves the structure after the computation of the Schur complement. A graphical representation of the progression of the algorithm is shown in Figure 4.1. The algorithm consists of two phases: elimination and back-substitution. Elim- ination is equivalent to block Gaussian elimination without pivoting on a permuted system (P AP T )(P u) = P f. The permutation matrix P corresponds to an even/odd 49

Figure 4.1: Cyclic reduction preserves a block tridiagonal structure through elimination. Red blocks depict diagonal blocks, green blocks depict the innermost of the bidiagonal blocks, and blue blocks depict the outermost of the bidiagonal blocks. Gray blocks denote blocks in which elimination is completed. reordering of the unknowns. Permutation decouples the system, and the computation of the Schur complement successively reduces the problem size by half. This process is recursive, and it finishes when a single block is reached, although the recursion can be stopped early if the system is small enough to be solved directly. The second phase performs a conventional forward and backward substitution to find the solution to the corresponding right-hand side(s). Cyclic reduction was originally conceived as a direct solver. However, extensions of the use of CR as preconditioner have appeared in the literature [80, 81]. Regarding parallel implementations, due to its vast concurrency features, there is documentation on GPU implementations [82, 83, 84] and distributed memory implementations [85, 86, 87, 88, 89], although to the best of our knowledge, none of them use of hierarchical matrices.

4.2 Cyclic reduction on a model problem

As introduced in Section 1.3, consider the linear variable-coefficient Poisson equation

κ(~x) u = f(~x) (4.1) − ∇ · ∇

discretized with a second-order finite difference approximation with n n grid points × and Dirichlet boundary conditions. The corresponding matrix A that arises from 50 discretization has a block tridiagonal structure:

  D1 F1    E2 D2 F2     ......  A = tridiagonal(Ei,Di,Fi) =  . . .  . (4.2)    E D F   n−1 n−1 n−1 En Dn

For instance, if κ(~x) = 1, the blocks become:

    4 1 1  −     1 4 1   1   −     ......   ..  Di =  . . .  ,Ei = Fi =  .  . (4.3)      1 4 1   1   −    1 4 1 −

One could save memory and operations since the blocks are equal to each other in their corresponding diagonal and some of the subsequent operations are therefore the same. To aid exposition, these optimizations are not considered in the following sections, as we are primarily interested in the general case where κ(~x) is not a constant.

4.2.1 Elimination

The block cyclic reduction method can, for instance, be used to solve the above block tridiagonal linear system. As an illustration, consider a discretization of n = 8 points per dimension in 2D, which results in an N N sparse matrix, with N = n2: × 51

      D0 F0 u0 f0        E1 D1 F1   u1   f1         E D F   u   f   2 2 2   2   2         E3 D3 F3   u3   f3      =   . (4.4)  E D F   u   f   4 4 4   4   4         E5 D5 F5   u5   f5         E D F   u   f   6 6 6   6   6  E7 D7 u7 f7

The first step of cyclic reduction is to block-permute the system into evens/odds:

      D0 F0 u0 f0        D2 E2 F2   u2   f2         D E F   u   f   4 4 4   4   4         D6 E6 F6   u6   f6      =   . (4.5)  E F D   u   f   1 1 1   1   1         E3 F3 D3   u3   f3         E F D   u   f   5 5 5   5   5  E7 D7 u7 f7

Consider a 2 2 partition of the system, which can be written more succinctly as × the following partitioned matrix:

      A A u f  11 12   even   even      =   . (4.6) A21 A22 uodd fodd

Notice that the upper-left block (A11) is block diagonal, which means that its inverse

can be computed in parallel as the inverse of each block (D0, D2, D4, and D6). The Schur complement of the upper-left partition may then be computed:

−1 (1) (1) −1 (A22 A21A A12)uodd = f , f = fodd A21A feven. (4.7) − 11 − 11 52 As mentioned in the previous section, an essential property of block tridiagonal ma- trices under even/odd ordering is that the computation of the Schur complement also yields a block tridiagonal matrix, hence the word cyclic in the name of the algorithm. Superscripts in the blocks denote the step number of the algorithm, see (4.8). This notation is consistent with [90, 91]. Notice the block tridiagonal structure after the first step of elimination:

 (1) (1)   (1)   (1)  D0 F0 u0 f0        (1) (1) (1)   (1)   (1)   E D F   u   f   1 1 1   1   1      =   . (4.8)  E(1) D(1) F (1)   u(1)   f (1)   2 2 2   2   2   (1) (1) (1)   (1)   (1)  E3 D3 F3 u3 f3

The algorithm continues with another even/odd permutation of the blocks:

 (1) (1)   (1)   (1)  D0 F0 u0 f0        (1) (1) (1)   (1)   (1)   D E F   u   f   2 2 2   2   2      =   . (4.9)  E(1) F (1) D(1)   u(1)   f (1)   1 1 1   1   1   (1) (1)   (1)   (1)  E3 D3 u3 f3

After the elimination, the computation of the Schur complement yields a 2 2 system: ×       (2) (2) (2) (2) D0 F0 u0 f0     =   . (4.10)  (2) (2)   (2)   (2)  E1 D1 u1 f1

(3) The last “cycle” of permutation and Schur complementation leads to the D0 block, which is the last step of the elimination phase of the cyclic reduction algorithm for this problem size. The reader is encouraged to take another look at Figure 4.1 now that the algebra has been laid out, and notice how the original matrix starts block tridiagonal and in a sequence of steps is reduced to a block diagonal matrix. 53 4.2.2 Solve

After elimination is completed, the solve stage starts from the last block of unknowns:

(3) (3) (3) D0 u0 = f0 . (4.11)

(3) Once the solution at the last step u0 is computed, it is propagated backward in the hierarchy of the elimination tree. The formula to compute the solution at step q (n = 2q) is given by:

u(q) = (D(q))−1(f (q) E(q)u(q+1) F (q)u(q+1)). (4.12) − −

This procedure continues until the solution of the entire linear system is computed. The solve phase involves fewer operations as compared with elimination, as it only involves matrix-vector multiplications. For large-scale problems, this makes the solve phase orders of magnitude faster than elimination. The ability to efficiently solve for a given right-hand side once the elimination is completed motivates the use of ACR for multiple right-hand sides at a minimal cost per new right-hand side; this is true for other factorizations as well.

4.3 Accelerated cyclic reduction (ACR)

This section describes how cyclic reduction can be used in combination with hierar- chical matrices to result in a variant that improves the computational complexity and memory requirements of the classical cyclic reduction method.

4.3.1 Modularity

ACR is modular by design; it is not limited to the use of the -matrix format. In H fact, the use of the 2-format would immediately translate to an additional reduction H of one logarithmic factor in terms of arithmetic and memory complexity estimates, 54 precisely from O(k N log N (log N + k2)) to O(k N log N) in terms of operations, and O(k2 N log N) to O(k N) in terms of memory requirements. Similar reductions would hold true if another data-sparse format with nested bases would be used, such as HSS or HBS, although those two formats in particular feature weak admissibility which in practice requires larger numerical ranks as compared to formats that allow standard admissibility, such as the -format here used. H The companion implementation of this work leverages the implementation of the -matrix format and their arithmetic operations provided by the HLibPro library H [92]. However, since HLibPro does not support distributed memory computing, an expansion to the capabilities of HLibPro by distributing workload across different computing nodes was implemented, as we discuss in the next chapter.

4.3.2 Block-wise -matrix approximation H It is at the block level that improvements to the complexity estimates of the classical

cyclic reduction algorithm take place. ACR approximates each Di, Ei and Fi block of the original block tridiagonal matrix A (defined in (4.2)) with a hierarchical matrix, as opposed to a global approximation of A; see Figure 4.2. Subsequently, the elimi- nation and solve phases proceed by using hierarchical matrix algebra, as opposed to conventional dense linear algebra. 55

Figure 4.2: Left: Depiction of a rank 1 -matrix, green blocks are low-rank blocks and red blocks are dense blocks. Right:H Block-wise approximation of A; as opposed to a global factorization of A. Nonempty blocks depict an -matrix. H

Table 4.1 summarizes the advantage of a block-wise approximation of the original matrix with -matrices in the computation of the inverse of a block and its storage, H compared to their equivalent dense counterparts.

Inverse Storage Dense Matrix (N 3) (N 2) Matrix O(k N log N (log N + k2)) O(k N log N) H O Table 4.1: Comparing the complexity estimates of storing and computing the inverse of a N N matrix block in dense format, versus approximating the matrix block× with a hierarchical matrix with numerical rank k.

4.3.3 Slicing decomposition

From a grid decomposition point of view, the red/black ordering of the unknowns slices the domain into lines or planes, depending if the underlying problem comes from a 2D or 3D problem respectively, as depicted on Figure 4.3. 56

N = n2

n A = n2

n n 2 n n

N = n3

A = n3

n n2

3 n n2

Figure 4.3: Left: Grid; center: slice decomposition to the domain; right: resulting matrix structure. In 2D N = n2, each block row represents a line of size n n., whereas in 3D N = n3, each block row represents a plane of size n2 n×2. ×

This decomposition bears a similarity to the slice decomposition reported in [93]. This decomposition is also used in a similar line of work in the sweeping preconditioner literature [94, 95], with the crucial distinction that rather than sweeping through the domain in one direction, ACR recursively eliminates half of the remaining planes at once, concurrently.

4.3.4 Reduced-dimension -matrices H In generating the structure of the hierarchical matrix representations of the blocks, we exploit the fact that, for a 3D problem, the domain is subdivided into n planes each consisting of n2 grid points. As a result, block rows of the matrix are identified 57 with the planes of the discretization grid. We consider this geometry and use a two- dimensional planar bisection clustering when constructing each -matrix. In other H words, ACR deals with -matrices with one dimension less than the original problem H dimension. For this work, the standard admissibility condition was chosen, as opposed to the simpler weak admissibility condition that the -matrix format also allows, because H it provides the flexibility of selecting a range of coarser to finer blocks. The next subsection motivates the usefulness of this trade-off regarding memory usage, and in subsequent chapters, this trade-off also play a role in performance.

4.3.5 Tuning parameters

There are three tuning parameters in the construction of an -matrix that can be H leveraged to optimize for memory requirements and performance: , η, and nmin H (section 2.2).

The first parameter is , which controls the specified block-wise relative accuracy H of the -matrix blocks tagged as low-rank; depicted in green in Figure 4.2. This H parameter resembles the cut-off tolerance  of the truncated SVD that for instance disregards singular values to achieve an approximation accuracy of  = 10−8. The second parameter is η, from the admissibility condition. The case for choosing a standard admissibility condition is that, by further refining off-diagonals blocks, it is possible to achieve the same relative accuracy but with smaller numerical ranks, albeit with more off-diagonal blocks. Numerical low-ranks are crucial to ensure eco- nomic memory consumption and overall high-performance. The impact on the block structure given by the choice if admissibility condition at both ends of the spectrum is illustrated in Figure 4.4. The example depicts the -inverse of the variable-coefficient H two-dimensional Poisson operator discretized on a N = 64 64 grid using the five- × point finite difference scheme. In the figure on the right, the use of a few small dense 58 blocks in the off-diagonal regions allows smaller ranks, to the same relative accuracy.

(a) Weak admissibility. (b) Standard admissibility. Figure 4.4: -inverse of the 2D Poisson operator, discretized with N = 64 64 grid points, usingH two different admissibility conditions. The number in each block× is the −4 numerical rank necessary to achieve an overall compression accuracy of  = 10 . The color map indicates the relative size of the required numerical rankHk, and the block size n; thus deep blue indicates a good compression k n. 

Table 4.2 shows the storage gains by representing the inverse of a 2D Poisson problem with an -matrix with weak admissibility versus standard admissibility. H Table 4.2 also shows the difference in terms of number of operations between these two structures. The cost of the -matrix inversion requires 56C3 kn(log n + 1)2 + H sp 3 184Cspk n(log n + 1) operations, where k represents the average rank of the low-rank blocks, and Csp represents the sparsity of the structure of the hierarchical matrix inverse, see [5]. Since the weak admissibility condition requires larger ranks than the standard admissibility condition, at scale, this tends to increase the memory requirements and the number of floating-point operations. 59 Admissibility Storage Operations Weak 723 MB 8.0E11 Standard 434 MB 5.0E11

Table 4.2: -inverse of the 2D Poisson operator for N = 1282 grid points, using two differentH admissibility conditions. We document the memory and floating-point operations to build the -matrix inverse with weak and standard admissibility. The weak admissibility conditionH tends to require large ranks, which lead to increased memory requirements and more arithmetic operations than the standard admissibility condition. (Equivalent dense storage and arithmetic operations of these operations would have required 2,147 MB and 1.0E13 operations.)

The third parameter determines which blocks with less than or equal to nmin rows or columns are stored as dense matrices, as it is more efficient to operate on them in dense than in low-rank form. It also alleviates unnecessarily deep binary trees in the structure of -matrices. Figure 4.4 depicts these blocks in red. H

4.3.6 Fixed-rank versus adaptive-rank arithmetic

Arithmetic operations in the format can be performed in two ways: with fixed-rank H approximations, or with adaptive-rank approximations; see [96]. Arithmetics with fixed-rank operations and compression operate with the assump- tion of a maximum rank in each low-rank block of the resulting . This accuracy H specification is useful when a memory budget is prescribed. In the case of adaptive-ranks only the strictly necessary singular values (ranks) required to achieve a low-rank block accuracy of  are preserved. For certain oper- H ations such as -matrix addition, to maintain certain  accuracy, the ranks in the H H resulting -matrix might increase, but one can be confident that the operation was H performed to the specified precision and that re-compression routines can reduce the rank requirements to a minimum. This adaptive-rank arithmetic is the method of choice in this work, as additionally to ensure certain precision disregarding unneces- sary large ranks in the -matrix blocks. H ACR requires hierarchical matrix addition, subtraction, matrix-matrix multiplica- 60 tion, matrix-vector multiplication and matrix inversion. The relative accuracy of the approximation is therefore specified during the construction of each block and while performing hierarchical matrix arithmetic operations. Committing to a given toler- ance ensures that the numerical ranks are adjusted to preserve the specified accuracy during the elimination and solve phases.

4.3.7 General algorithm

Let the operators , , , and inverse denote the arithmetic operations of addition, ⊕ ⊗ subtraction, matrix-matrix and matrix-vector multiplications, and inversion under the -matrix format. To simplify the exposition, we assume the size of the linear H system is a power of two; the number of steps required by ACR is thus q = log2 N. As mentioned in the previous section, two procedures define cyclic reduction: elimination and back-substitution. The elimination algorithm is shown in Algorithm 1, whereas the back-substitution algorithm is shown in listing Algorithm 2. Even though Algorithm 1 and Algorithm 2 show permutations and matrix operations at the level of the global system, the reference implementation operates at a per-block granularity, which means that permutations are part of the implementation’s logic and that linear algebraic operations are performed block by block. This is possible since cyclic reduction preserves the block tridiagonal structure during elimination.

Algorithm 1 Elimination is the most compute intensive algorithm as it uses hi- erarchical matrix-matrix operations. It is in charge of recursively eliminating half of the unknowns via Schur complementation. 1: Set A(0) = A 2: for i = 0 to q-1 do (i+1) (i) (i) (i) (i) 3: A = (A22 A21 -inverse(A11 ) A12 ) (i) (i) ⊗ H (i) ⊗ (i) 4: f (i+1) = f A -inverse(A ) f 2 − 21 ⊗ H 11 ⊗ 1 5: end for 61 Algorithm 2 Solve relies on the elimination phase being completed and proceeds to evaluate a given right-hand side(s) with the use of hierarchical matrix-vector mul- tiplications only. 1: Solve A(q)u(q) = f (q) 2: for i = q-1 to 0 do (i) (i) 3: u(i) = -inverse(A ) (f (i) A u(i+1)) H 11 ⊗ − 12 ⊗ 4: end for

4.3.8 Sequential complexity estimates

Every cyclic reduction step requires two matrix-matrix multiplications, one matrix inversion and one matrix addition per eliminated block. These kernels have arithmetic complexity of O(k n log n (log n + k2)) operations [5]. For a problem size of N = n2 with n = 2q, ACR requires n/2 + n/4 + n/8 + ... n steps to perform elimination. ≈ The most expensive computation in each step is the computation of an inverse of a block of size n n, which in -format has a complexity of O(k n log n (log n + × H k)), therefore, ACR results in a O(k N log N (log N + k2)) overall algorithm, with (k N log N) memory requirements. Table 4.3 summarizes the complexity estimates O of the classical cyclic reduction algorithm versus the proposed variant for a two- dimensional problems with N = n2 unknowns. The complexity estimate for 3D prob- lems follows the same derivation, with the difference of a higher complexity constant since the ranks from 2D matrices are higher than the ranks of 1D matrices.

Method Operations Memory Cyclic Reduction (CR) (N 2) (N 1.5 log N) Accelerated Cyclic Reduction (ACR) O(k N log N (log N + k2)) O(k N log N) O Table 4.3: Summary of the sequential complexity estimates of the classic cyclic reduction method and the proposed variant, accelerated cyclic reduction, k represents the numerical rank of the approximation.

The asymptotic complexity of ACR compares favorably to two other solvers that exploit hierarchical matrix representations, -LU [52] and multifrontal HSS [33]. H ACR and -LU have similar complexity estimates, although they have different pref- H 62 actors which are determined by the numerical rank of the approximations and the algorithm itself. Because ACR effectively uses hierarchical representations only for a set of regular two-dimensional problems, the resulting constants appearing in the asymptotic complexity estimates tend to be smaller and make it feasible to perform large-scale computations. We finally note that ACR has smaller complexity estimates than the HSS multifrontal method, having an estimated asymptotic complexity of O(N 4/3 log N) factorization flops for three-dimensional elliptic problems. Regarding practical usage, ACR has different concurrency properties than -LU H and the multifrontal HSS variant, enabling different amounts of independent work to be performed. The regularity of the operations of ACR is valuable concerning the ability of ACR to adapt to current and future hardware architectures with high-degree of parallelism, as we describe in the next chapter. 63

Chapter 5

Distributed memory accelerated cyclic reduction

This chapter describes how to leverage the concurrency features of the accelerated cyclic reduction method in a distributed memory environment.

5.1 Overview of the concurrency features of ACR

A number of concurrency features of the algorithms are evident. Each block row, identified by a plane in the discretization, is assigned to a logical processor. This decomposition allows the initial approximation of each block into an -matrix in an H embarrassingly parallel manner. The q = log n levels of Schur complement computa- tion exploit concurrent execution in two ways

ˆ The inverse of the decoupled block A11 in (4.6) can be computed concurrently in a block-wise fashion since it has a block diagonal structure. This computation, along with the construction of each block, is also pleasingly parallel.

ˆ Computing the Schur complement requires two matrix-matrix multiplications and one matrix addition. Since the linear system partition is formed out of matrix blocks, the computation of these block matrix-matrix multiplications and block matrix-addition can also be computed concurrently at block granularity.

Figure 5.1 depicts the concurrent elimination of planes across all the levels of ACR for a 16-planes illustrative example. This type of recursive computation that handles a reduced number of unknowns at each step is common to other hierarchical methods such as multigrid. 64

Figure 5.1: Concurrency in ACR elimination for a 16-planes example. Level 0 can eliminate eight planes concurrently thus reducing the problem size by two; this process continues recursively until one plane is left.

5.2 Hybrid parallel programming model

The orchestration of distributed parallel work was design to accommodate the ar- chitecture of a modern supercomputer, such as the second generation Shaheen Cray XC40 supercomputer at the King Abdullah University of Science & Technology. Sha- heen is composed of 6144 nodes, with each node holding 128GB of RAM and two Intel Haswell processors with 16 cores clocked at 2.3Ghz. Each node is connected to each other forming a logical unity with a Dragonfly network interconnect. The model of multiple fast individual nodes all physically connected trough a high-speed network interconnect, although not faster than computation within the node, suggest a parallel model in which one seeks to maximize local computation within the node and minimize communication across the nodes. As a result, all the calculations in charge of computing the Schur complement are performed within a node with a shared memory paradigm, and via a distributed memory paradigm only the necessary matrix blocks that do not exist within the particular node to perform such a Schur complement are provided with the message 65 passing interface (MPI).

5.2.1 Inter-node parallelism

In a distributed environment, each plane of the computational domain is assigned to a single logical processor, and multiple logical processors are assigned to a single physical node. Let p be the number of physical compute nodes each storing n/p planes at the beginning of the factorization. After r steps of ACR, each compute node holds n/(2rp) planes. At level r = log(n/p), a coarse level called the C-level, every node holds a single plane only. The remaining log p steps of ACR beyond the C-level leave some compute nodes idle as illustrated in Figure 5.2.

Node 1 Node 2 Node 3 Node 4

P (0) (0) (0) (0) P (0) (0) (0) (0) P (0) P (0) P (0) P (0) P (0) P (0) P (0) P (0) 0 P1 P2 P3 4 P5 P6 P7 8 9 10 11 12 13 14 15

(1) (1) (1) (1) P (1) P (1) P (1) P (1) Level 1 P1 P3 P5 P7 9 11 13 15

(2) (2) P (2) P (2) Level 2 P3 P7 11 15 C-level

(3) P (3) Level 3 P7 15

(4) Level 4 P15

Figure 5.2: Distribution of multiple planes per physical compute node for an example with n=16 and p=4.

Communication occurs just at inter-node boundaries at every step of the fac- torization. Thus up to the C-level, there are O(p) messages per step, each trans- mitting planes of size O(k n2 log n). Beyond the C-level, there are O(p/2 + + ··· 1) O(p) communications messages, adding up to a total communication volume of ≈ 2 n O(k p n log n (log p + 1)) for ACR. Fig. 5.3 depicts the inverted binary tree com- munication pattern of elimination phase and top-down binary tree communication pattern of the solve phase. 66

P (0) (0) (0) (0) P (0) (0) (0) P (0) 0 P1 P2 P3 4 P5 P6 7 Step 1. Blocks conversion into

(1) (1) (1) P (1) Step 2. Elimination P1 P3 P5 7

(2) (2) P3 P7

(3) P7

Step 3. Back-substitution u7

u u3 7

Plane ( )

u u1 u3 u5 7 communication Solution vector

u u u u u u Solution vector 0 1 2 3 4 u5 u6 7 communication

Figure 5.3: Communication pattern for the 8-planes case. P depicts the planes being eliminated, and u the solution per-plane as back-substitution is executed.

5.2.2 Intra-node parallelism

Beyond parallelism across distributed computing nodes, there is additional concur- rency available at the node level. Supplementary parallelism is possible, not only because elimination and back-substitution for multiple block rows can proceed con- currently, but also because hierarchical matrix arithmetic can be performed in paral- lel. The two levels of parallelism are shown schematically in Figure 5.4. In practice, programming models based on tasks and directed acyclic graphs have proven to be effective to parallelize hierarchical matrix arithmetic [54, 12]. The optimal resource 67 allocation of the number of cores per node to either block row processing or for paral- lel arithmetic requires tuning for either maximum performance or available memory maximization.

Distributed-memory planes Distributed-memory communication Hierarchical matrix

Figure 5.4: Parallel ACR elimination tree depicting two-levels of concurrency using distributed memory parallelism to distribute concurrent work across compute nodes, and shared memory parallelism to perform -matrix operations within the nodes. H

5.3 Parallel elimination and solve

The parallel ACR elimination and solve algorithms are listed in Algorithms 3 and 4, respectively. Consider three consecutive equations from the original linear system shown in (4.4):

E.g., j = 1 Eq uj−2 uj−1 uj uj+1 uj+2 rhs

0 j 1 Ej−1 Dj−1 Fj−1 fj−1 − (5.1) 1 j Ej Dj Fj fj

2 j + 1 Ej+1 Dj+1 Fj+1 fj+1

Take as an example j=1, and consider equations (j 1) and (j + 1). We multiply − −1 −1 equation (j 1) with Ej(Dj−1) , and equation (j + 1) with Fj(Dj+1) , and − − − then add equations 0-2. The resulting linear system has only odd unknowns: 68

Eq uj−2 uj−1 uj uj+1 uj+2 rhs (5.2) 0 0 0 0 j Ej 0 Dj 0 Fj bj

This is equivalent of the Schur complement operation shown in Equation 4.7, which at a block granularity, is equivalent to:

E(q+1) = E(q)(D(q) )−1E(q) j − j j−1 j−1 D(q+1) = D(q) E(q)(D(q) )−1F (q) F (q)(D(q) )−1E(q) j j − j j−1 j−1 − j j+1 j+1 F (q+1) = F (q)(D(q) )−1F (q) (5.3) j − j j+1 j+1

f (q+1) = f (q) E(q)(D(q) )−1f (q) F (q)(D(q) )−1f (q) j j − j j−1 j−1 − j j+1 j+1

Algorithm 3 Parallel elimination concurrently eliminates half of the unknowns at each step via a hybrid parallel paradigm in which MPI distributes work to logical processors that locally compute hierarchical matrix operations. 1: j= Processor number 2: parallel for at all processors j, j 0 : 2q 1 ∈ − (1) (1) (1) 3: Block-wise conversion to -matrix of A = tridiagonal(E ,D ,F ) H j j j 4: end parallel for 5: for i = 1 to q do 6: parallel for at j even, j 0 : 2q−i 1 (i) ∈ − 7: -inverse(Dj ) H (i) (i) −1 (i) (i) 8: Communicate Ej ,(Dj ) , Fj , fj to processors j 1 (i) (i) −1 (i) (i) − 9: Communicate Ej ,(Dj ) , Fj , fj to processors j + 1 10: end parallel for 11: parallel for at j odd, j 0 : 2q−i−1 1 (i+1) (i∈+1) (i+1) −(i+1) 12: Compute Ej , Dj , Fj , fj from Equation 4.7 13: end parallel for 14: end for 69 Algorithm 4 Parallel solve evaluates a given right-hand size(s) in a hybrid parallel paradigm communicating only vectors blocks and locally computing with hierarchical matrix-vector operations only. 1: n = 2q 2: j= Processor number 3: for i = q to 1 do 4: parallel for at j, j 0 : 2q−i 1 (i) ∈ (i) −1 − (i) (i) (i+1) (i) (i+1) 5: Compute uj = (Dj ) (fj Ej uj Fj uj ) (i) ⊗ ⊗ − ⊗ 6: Communicate uj to processors j 1 (i) − 7: Communicate u to processors j 1 j − 8: end parallel for 9: end for

5.4 Parallel complexity estimates

The regularity of ACR facilitates the estimation of the parallel complexity estimates and its assessment for large-scale scalability. Consider the computing node with the longest task dependency, executing log n steps. In the log(n/p) steps preceding the C-level, this node processes n/(2p) + n/(4p) + + 1 block rows in sequence. ··· Beyond the C-level, it processes a single block row in every one of the sequen- tial log p steps. This results in an asymptotic parallel time complexity for ACR of O (kn2 log n(log n + k2)(n/p + log p)). The sequential computational time gets re- duced by the number of parallel compute nodes p, but at the expense of an additional log p factor that inhibits ideal scaling. Fortunately, the remaining work beyond C- level is small and grows only as n2 = N 2/3, the bulk of the computations happens at early stages where most of the concurrency is readily available.

5.5 Parallel scalability

To provide a baseline of scalability, consider the constant-coefficient Poisson equation with homogeneous Dirichlet boundary conditions in the unit cube 70

2u = 1, x Ω = [0, 1]3, u(x) = 0, x Γ, (5.4) −∇ ∈ ∈ discretized with the 7-point finite-difference star stencil. The resulting matrix after discretization is symmetric positive definite. Other methods such as multigrid, FMM, or FFTs are ordinarily used to solve this problem. However, this experiment is still worth considering to facilitate the exposition of ACR and to report on a standard problem that can easily be compared to the performance of other solvers. As stated in the introductory chapters of this work, ACR is best suited for the case of variable-coefficient, as we expose in the next chapters. Furthermore, the discretization of the Poisson equation has all positive eigenvalues with rapid decay in the off-diagonals, making this experiment an ideal case for hierarchically low-rank approximations analysis as the theoretical rank structure for this problem has been reported in the literature.

5.5.1 Weak scaling

Figures 5.5a and 5.5b depict the results of a weak scaling experiments for ACR fixing a different numbers of degrees of freedom per processor, along with the ideal weak scaling reference lines depicted as dashed curves considering that the estimates of elimination and solve are of O(k2N log2 N) operations. 71

104 8,192 dof/processor 8,192 dof/processor 2,048 dof/processor 2,048 dof/processor 512 dof/processor 2.5 512 dof/processor 103

0.5

102 Solve time (s)

Factorization time (s) 0.1

101 256 512 1,024 2,048 4,096 8,192 256 512 1,024 2,048 4,096 8,192 Processors Processors (a) Weak scaling of factorization. (b) Weak scaling of solve. Figure 5.5: ACR weak scalability for the solution of the constant-coefficient Poisson equation.

The elimination stage follows the ideal trending line as we increase the number of degrees of freedom per processor, the solve phase deviates from the ideal scaling due to the inherently load imbalance of the recursive bisection of cyclic reduction and the communication latency which is more noticeable due to the lower arithmetic intensity of this stage. However, the absolute time of the solve stage is 2.5 seconds at worse to directly solve for a problem of one-eighth of a billion knowns, at N = 5123.

5.5.2 Strong scaling

Figures 5.6a and 5.6b show the total time in seconds for the factorization and solve phases of ACR in a strong scaling experiment; dashed lines indicate ideal scaling, in which the time, as we double the number of processors, shrinks by a factor of two. 72

104 101

103 100

102 10 1 3 − 3

N = 512 Solve time (s) N = 512 N = 2563 N = 2563 Factorization time (s) 1 10 N = 1283 N = 1283 3 2 3 N = 64 10− N = 64

32 128 512 2,048 8,192 32 128 512 2,048 8,192 Processors Processors (a) Strong scaling of factorization. (b) Strong scaling of solve. Figure 5.6: ACR strong scalability for the solution of the constant-coefficient Poisson equation.

The most time-consuming phase of ACR benefits as the number of processors increases for a variety of problem sizes. Nonetheless, as in the weak scaling study, the ideal scaling of the solve stage deteriorates at large processor counts as factors such as hardware latency play a significant role in this computationally lightweight kernel based on matrix-vector multiplications. H

5.5.3 Effectiveness of the choice of -matrix admissibility H The maximum rank found across the entire ACR elimination varies from rank 5 to 10 across all problem sizes considered in this scaling study. Considering that the largest problem size of N = 5123 that involves -matrices of size is of size 262144 262144, H × the maximum rank is remarkably low. The rank behavior matches the theoretical rank structure from this problem, according to the estimates of k O(1) reported ≈ [23]. Figure 5.7 depicts the structure of the -matrices used to represent each plane, H with the choice of standard admissibility condition. Dark blue blocks denote a low ratio between the numerical rank of the approximation and the full rank of the block, 73 whereas red block indicates not admissible blocks stored in dense format. For visual- ization purposes, the figure was taken from the N = 323 problem, and represents the last diagonal block during the elimination phase of ACR. The prevalence of dark blue blocks indicate a good relative compression of each block, since the ratio of numerical rank of the approximation and the actual block size is very small. Most of the red blocks are clustered near the diagonal, where the smallest blocks reside.

Figure 5.7: Choice of -matrix structure to represent planes. Blue indicates low-rank blocks,H whereas red indicates dense blocks.

5.5.4 Memory footprint

One of the highlights of this chapter is the achieved memory footprint of ACR, as it is well acknowledged that the hard limit of direct solvers is their high memory consumption. Figure 5.8 depicts the memory requirements to store the ACR factor- ization across different problem sizes ranging from 643 to 5123, together with a dashed line that denotes the expected asymptotic memory usage of O(N log N). 74

106

105

104

Memory footprint (GB) Memory footprint O(N log N) 103 106 107 108 N Figure 5.8: Memory footprint of ACR. 75

Chapter 6

ACR as a fast direct solver

This chapter documents the performance and memory footprint of ACR as a direct solver for 2D and 3D elliptic PDEs with a variety of coefficient structure. The aim is to experiment broadly on the class of problems that ACR is applicable, to compare with other solvers in their respective shared or distributed memory environment, and to highlight the niche in which ACR is best suited.

6.1 Numerical results in 2D and benchmark with other solvers

This section studies the performance of ACR as a 2D direct solver for a variety of elliptic partial differential equations. It first describes the hardware utilized in these experiments, a description of the solvers used for comparison, the error tolerance definition, and the properties of the discretization used for each problem.

6.1.1 Environment settings

To provide baseline of performance and memory requirements, the following numerical results are compared with the performance and memory requirements of the following solvers: -LU factorization through the commercial implementation in the HLibPro H v2.2 library [92, 97]. HLibPro is one of the leading hierarchical matrix libraries. It has active industry users and is highly regarded due to its high-performance in shared memory environments. 76 Algebraic multigrid (AMG) via the implementation in the high performance preconditioners (hypre) library [77], accessed through PETSc [98, 99]. The error tolerance is measured in the 2-norm of the algebraic error. All solvers are tuned in order for the overall solution to follow asymptotically the truncation error of the discretization as a direct solver; specific tuning parameters for each solver can be found in Appendix B. In all cases, the reported timings refer to the solution per right-hand side, in seconds. The memory requirements for ACR is computed as the sum over all blocks employed during elimination, and for -LU and AMG the H reported memory usage is taken directly from the standard output of each experiment.

6.1.2 Constant-coefficient Poisson equation

The first problem is the constant-coefficient Poisson equation discretized with the five-point stencil finite difference scheme:

2u = f(x). x Ω = [0, 1]2 u(x) = 0, x Γ − ∇ ∈ ∈ (6.1) f(x) = 100e−100((x−0.5)2+(y−0.5)2).

Figure 6.1 shows that when cyclic reduction (CR) does not exploit the repeated block structure, it is not feasible for large problems due to its (N 2) complexity, O although for small problems it is convenient to use. It can also be seen that both -LU and ACR are methods of O(N log2 N) time complexity, with a complexity H constant that benefits ACR in terms of performance. However, and as noted before, AMG is the method of choice for this smooth problem and one right-hand side. ACR and -LU require O(N log N) units of memory. We show the performance of the H sequential algorithm on all cases, and for the parallel version we use 12 cores of an Intel Westmere processor. 77

103 103 CR -LU 102 -LU 102 AMGH AMGH ACR 101 ACR 101

100 100

Time (s) 10−1 Time (s) 10−1

10−2 10−2

10−3 10−3 104 105 106 104 105 106 N N (a) One core performance. (b) Twelve cores performance. -LU 10 10 AMGH ACR 109

8

Bytes 10

107

106 104 105 106 N (c) Memory consumption. Figure 6.1: Execution times and memory consumption as a function of matrix dimension for the constant-coefficient Poisson problem. 78 6.1.3 Variable-coefficient Poisson equation

We now consider the problem where material properties κ(x) are heterogeneous:

κ(x) u = f(x). x Ω = [0, 1]2 u(x) = 0, x Γ. (6.2) − ∇ · ∇ ∈ ∈ A typical application of this equation is the study of heat flow in composite materials or Darcy flow in variable permeability media. Modeling of material properties could be of rapid change, such as the change of permeability in an oil field, or from a smooth transition, such as alloy metals.

6.1.3.1 Smooth coefficient

We model the material properties change with the smooth function κ(x), the differ- ence between β− and β+ determines the coefficient contrast, while  determines the degree of smoothness of the change of material properties:

f(x) = 100e−100((x−0.5)2+(y−0.5)2) (6.3) 1 + tanh( x−0.5 ) κ(x, y) = β− + (β+ + β−)  . 2 Further discussion of the problem relevance and its applicability can be found at [100]. The particular choices of coefficients for this experiment are β− = 0.1, β+ = 10, and  = 0.5. In the largest experiment N = 20482, -LU failed to reach a H direct solution within the specified tolerance. ACR demonstrates a lower complexity constant as compared to -LU in terms of performance. For a single right-hand H side AMG remains the method of choice for this problem type. Figure 6.2 shows a summary of the performance and memory requirements for this experiment. 79

102 -LU -LU 1010 H AMGH AMG 1 10 ACR ACR 109 100 8

Bytes 10

Time (s) 10−1 107 10−2 106 4 5 6 104 105 106 10 10 10 N N (a) Twelve cores performance. (b) Memory consumption. Figure 6.2: Execution times and memory consumption as a function of matrix dimension for the variable-coefficient Poisson problem with smooth coefficients.

6.1.3.2 High contrast discontinuous coefficient

Consider the case where the material properties change dramatically within the do- main. For this test problem, we use a finite volume discretization with harmonic averaging of the coefficients at jumps [101]. The problem under consideration is discussed in [102], and it models the variable-coefficient Poisson equation with κ(x) defined as a piecewise function:

  −2 1 10 x 0.5, y 0.5  × ≤ ≤   1 x > 0.5, y 0.5 κ(x) = ≤  −4 1 10 x < 0.5, y > 0.5  ×   6 1 10 x 0.5, y > 0.5 × ≥ The performance of -LU versus ACR crosses halfway the experiments, favoring H ACR. AMG consistently achieved convergence within the specified tolerance, but with a slight performance penalty in the large-scale problems as compared to the smooth coefficient case. ACR remained consistent regarding performance and memory con- sumption as in the previous numerical experiments. Figure 6.3 shows the performance 80 and memory requirements for these tests.

102 -LU -LU AMGH 1010 AMGH 101 ACR ACR 109 100 8

Bytes 10 Time (s) 10−1 107 10−2 106 104 105 106 104 105 106 N N (a) Twelve cores performance. (b) Memory consumption. Figure 6.3: Execution times and memory consumption as a function of matrix dimension for the variable-coefficient Poisson problem with with high contrast discontinuous coefficients.

6.1.3.3 Anisotropic coefficient

The anisotropic-coefficient problem models dramatic changes in the conductivity di- rectionally dependent. Here,  models how significant is the heat transfer in the x direction versus the y direction.

1 2 Uxx + Uyy = f(x). x Ω = [0, 1] u(x) = 0, x Γ  ∈ ∈ (6.4) f(x) = 100e−100((x−0.5)2+(y−0.5)2).

The selection of  is modeled after the example in [103]. As in the other variations of the variable-coefficient Poisson equation, AMG remains the method of choice as it can consistently solve these problems in optimal time and memory consumption. Nonetheless, factorization-based methods such as -LU and ACR can also solve H these problems with near-optimal complexity, and do have an advantage when the factorization needs to be reused for multiple right-hand sides because the cost of factorization gets amortized. 81 The anisotropy parameter is set to  = 10, which represents a difference of two or- ders of magnitude of the conductivity of x versus y. Figure 6.4 shows the performance and memory footprints of each solver.

2 10 -LU -LU 10 AMGH 10 AMGH 101 ACR ACR 109 100 8

Bytes 10 −1 Time (s) 10 107 10−2

106 10−3 104 105 106 104 105 106 N N (a) Twelve cores performance. (b) Memory consumption. Figure 6.4: Execution times and memory consumption as a function of matrix dimension for the anisotropic-coefficient problem.

6.1.4 Helmholtz equation

There are two variants of the Helmholtz equation differentiated by the eigenvalue structure of the resulting linear system after discretization. One variant is positive definite, and another one is indefinite. The nature of the solution of the positive- definite Helmholtz equation is similar to diffusion, while the indefinite variant, com- monly denoted as the wave Helmholtz equation, is oscillatory in nature. This dis- tinction translates significantly to the practical difficulties that solving the indefinite Helmholtz problem poses to linear solvers.

6.1.4.1 Positive definite formulation

This Helmholtz variant reinforces the positive definitiveness of the resulting linear system by featuring a wavenumber k of opposite sign to the diffusive term, as exem- plified in the following equation: 82

2u + k2u = f(x). x Ω = [0, 1]2 u(x) = 0, x Γ − ∇ ∈ ∈ (6.5) f(x) = 100e−100((x−0.5)2+(y−0.5)2).

As depicted in Figure 6.5, the performance and memory requirements of each solver is comparable to the solution of a purely diffusive linear system such as the ones featured in the previous section.

102 -LU -LU 10 AMGH 10 AMGH 1 10 ACR ACR 109 100 8

Bytes 10 −1 Time (s) 10 107 10−2 106 104 105 106 104 105 106 N N (a) Twelve cores performance. (b) Memory consumption. Figure 6.5: Execution times and memory consumption as a function of matrix dimension for the positive definite Helmholtz problem.

6.1.4.2 Indefinite formulation

In the wave Helmholtz equation, k models the wavenumber with the same sign as the diffusive term:

2u + k2u = f(x). x Ω = [0, 1]2 u(x) = 0, x Γ ∇ ∈ ∈ (6.6) f(x) = 100e−100((x−0.5)2+(y−0.5)2).

When the ratio of k and the number of grid points n increases proportionally, this equation is regarded as the high-frequency Helmholtz equation; as noted before, it poses significant challenges for classical iterative methods [104]. 83 To explore the parameter space, four cases are consider with respect to the number of grid points per wavelength and the resolution h: - Helmholtz test 1. Fixing k, while increasing the resolution The choice of the wavenumber in this experiment is k = 20. In terms of problem difficulty, increasing the resolution and fixing k means that the problem gets easier, i.e., less indefinite, as the resolution is increased. Figure 6.6 shows the performance and memory requirements for this experiment, validating that both factorizations remain robust for a fixed wavenumber as we increase the resolution.

2 10 1010 -LU -LU H ACRH ACR 101 109

8

Bytes 10 100 Time (s)

107 10−1

104 105 106 104 105 106 N N (a) Twelve cores performance. (b) Memory consumption. Figure 6.6: Execution times and memory consumption as a function of matrix dimension for the Helmholtz equation test 1: Fixing k, while increasing the resolution.

- Helmholtz test 2. Fixing resolution h, while decreasing the number of grid points per wavelength As we increase k, the number of grid points per wavelength decreases since we fix the resolution and ramp up the wavenumber until the limit of engineering prac- tice of 10 points per wavelength. Figure 6.7 shows that both solvers can efficiently solve the problem to the specified accuracy while preserving their performance and memory complexity estimates. ACR features a lower complexity constant regarding performance for this experiment. 84 109 20 6 · -LU H -LU ACR 5 ACRH 15 4

3 Bytes

Time (s) 10 2

1 5 1 25 50 100 1 25 50 100 k k (a) Twelve cores performance. (b) Memory consumption. Figure 6.7: Execution times and memory consumption as a function of matrix dimension for the Helmholtz equation test 2: Fixing resolution h, while decreasing the number of grid points per wavelength.

- Helmholtz test 3. Keeping a constant ratio between h and k, while increasing the resolution In this experiment, we fix the number of grid points per wavelength by changing h and k proportionally. Similar to the previous test, Figure 6.8 shows that both solvers hold their complexity estimates by solving a sequence of high-frequency Helmholtz problems. - Helmholtz test 4. Low to high frequency Helmholtz regimes. The previous tests have not included results for AMG, as it is known to be dis- advantaged for the high-frequency Helmholtz equation. This behavior has also been confirmed empirically by AMG not reaching convergence given a maximum of 1,000 iterations in these experiments. Nonetheless, Figure 6.9 depicts the behavior of the three solvers as the indefiniteness increases for h = 2−10, demonstrating the robustness of hierarchical-matrix based factorizations for the high-frequency case. 85

2 10 1010 -LU -LU H ACRH ACR 101 109

8

Bytes 10 100 Time (s)

107 10−1

104 105 106 104 105 106 N N (a) Twelve cores performance. (b) Memory consumption. Figure 6.8: Execution times and memory consumption as a function of matrix dimension for the Helmholtz equation test 3: Keeping a constant ratio between h and k, while increasing the resolution.

25 AMG -LU H 20 ACR

15

Time (s) 10

5

0 0 5 10 15 20 k Figure 6.9: Solvers performance while decreasing the number of points per wavelength. AMG fails to converge for large k. 86 6.1.5 Convection-diffusion equation

The convection-diffusion equation describes phenomena where particles or physical quantities interact due to the combination of two processes: diffusion and convection. Mathematically, it is modeled by the sum of a diffusive term and a first order derivate:

2u + α u = f(x). x Ω = [0, 1]2 u(x) = 0, x Γ (6.7) − ∇ ∇ ∈ ∈

6.1.5.1 Proportional convection and diffusion

The case in which the above equation has a proportional, or mostly diffusive nature, is typically well handled by traditional iterative solvers, as illustrated by Figure 6.10 which depicts performance and memory requirements very similar to purely diffusive problems. A well-known issue for the convection-diffusion equation arises when the convective term dominates the balance of convection to diffusion, as we show in the next section.

102 -LU -LU 10 AMGH 10 AMGH 101 ACR ACR 109 100 8

Bytes 10 Time (s) 10−1 107 10−2 106 104 105 106 104 105 106 N N (a) Twelve cores performance. (b) Memory consumption. Figure 6.10: Execution times and memory consumption as a function of matrix dimension for the convection-diffusion problem. 87 6.1.5.2 Convection dominance

A dominant convection problem requires a discretization that takes into account the nature of the problem, such as the upwind discretization here employed:

2u + αb(x) u = f(x), x Ω = [0, 1]2 u(x) = 0, x Γ − ∇ · ∇ ∈ ∈  sin(4πx) sin(4πy + π/2  (6.8) b(x) = . cos(4πx) cos(4πy + π/2)

However, beyond the type of discretization, the issue with this imbalance can also be identified by the fact that this problem leads to a nonsymmetric linear system. The lack of symmetry rules out a vast amount of work in solvers that assume symmetry. The following experiments progressively accentuate the convection dominance of the problem by increasing the magnitude of α, which as a result, increases the skew- symmetry. Figure 6.11 demonstrates that for small α, AMG outperforms the factorizations, but as the convection dominance increases, -LU and ACR maintain their robustness H by consistently solving the problem with sustained performance, whereas AMG does not converge to a solution. The flow for this problem (6.8) can be seen in Figure 6.12; this flow was proposed in [105]. 88

25 AMG -LU H 20 ACR

15

Time (s) 10

5

0 2,000 2,500 3,000 3,500 4,000 α Figure 6.11: Robustness of factorization methods as the convection dominance increases. AMG fails to converge for large α.

1.0

0.8

0.6

0.4

0.2

0.0

0.0 0.2 0.4 0.6 0.8 1.0 Figure 6.12: 2D recirculating flow b(x). 89 6.2 Numerical results in 3D and benchmarking with other solvers

This section documents the distributed memory performance of ACR for 3D problems and provides a baseline of memory footprint, ranks, and performance by comparing with other hierarchical solvers such as the multifrontal factorization with hierarchical semi separable matrices (HSSMF) and algebraic multigrid.

6.2.1 Environment settings

This section documents the parallel performance and scalability of ACR in a distributed- memory environment. The source code is written in the C programming language and compiled with the Intel compiler v15. External libraries utilized in the reference implementation include HLIBpro v2.2 with Intel TBB [92, 97], and the sequential version of the Intel Math Kernel Library [106]. Experiments are conducted on the Cray XC40 Shaheen supercomputer at the King Abdullah University of Science & Technology. Each node has 128GB of RAM and two Intel Haswell processors, each with 16 cores clocked at 2.3Ghz. To provide a baseline of performance we consider the solution of the same linear systems with STRUMPACK [12] v1.0.3, the open-source implementation of the HSS- structured multifrontal solver (HSSMF) developed at the Lawrence Berkeley National Laboratory. The HSSMF method can solve a broader class of linear systems com- pared to ACR, but the comparison is still of interest, as STRUMPACK is among the few available implementations of distributed-memory fast direct solvers that exploit hierarchically low-rank approximations.

The tuning parameters of ACR include the choice of the leaf node size nmin for the matrices, the threshold parameter η used to decide which blocks will be ap- H proximated with a low-rank factorization, or as a dense, full-rank, block, and the accuracy of the approximation  for the construction and algebraic operations of H 90 the matrices. The tuning parameters for STRUMPACK include how many matri- H ces from the nested-dissection elimination tree will be approximated as HSS, which is controlled by specifying the threshold at which frontal matrices will represented as HSS matrices, the compression accuracy for the HSS matrices, and the minimum leaf size of the HSS frontal matrices. We recall here that the HSS matrix format uses the so-called weak admissibility condition, whereas ACR uses a standard admissibil- ity condition, which does not limit the use of dense blocks exclusively at the matrix diagonal. Additionally, we also consider the algebraic multigrid (AMG) implemen- tation of hypre [107, 77]. Comparison experiments are set to deliver a solution with a relative error tolerance as Ax b / b 10−2. It is standard practice that || − ||2 || ||2 ≈ the factorization at low-accuracy is then passed to an iterative refinement procedure or used as a preconditioner, as shown in Chapter 7. For further comparisons, we also consider the multifrontal (MF) implementation of STRUMPACK, and our cyclic reduction (CR) implementation with dense matrix blocks.

6.2.2 Poisson equation

As a baseline, we consider the constant-coefficient Poisson equation with homogeneous Dirichlet boundary conditions in the unit cube, i.e.,

2u = 1, x Ω = [0, 1]3, u(x) = 0, x Γ, (6.9) −∇ ∈ ∈ discretized with the 7-point finite-difference star stencil, which leads to a symmetric positive definite linear system. Figure 6.13 compares all solvers under consideration for a sequence of Poisson problems that increase from N = 323 to N = 5123, with processor counts increased from 256 to 4,096. We document the execution parameters, obtained relative residual, and ranks of the ACR and HSSMF factorization in Tables 6.1 and 6.2. We report factorization times in Figures 6.13a showing that ACR can competitively tackle these 91 problems. Similarly, the solve timings in Figure 6.13b show that ACR is able to solve for a given right-hand size in comparable times as the other methods under consideration. Figure 6.13c documents the size of the factors required by the factor- izations, and it shows that the cyclic reduction method (CR) cannot solve problems as small as N = 1283 due to memory limitations. Also, the experiments confirm that the HSSMF method requires less memory to store its factors than the multifrontal method (MF). However, as Figure 6.13d shows, the HSSMF method requires higher ranks than ACR, which translated into a larger size of the factors and prohibited the execution of HSSMF for problems of N = 2563 and above. The experiments show that ACR requires only O(1) ranks, as opposed to the O(n) rank requirements of the HSSMF factorization. We measured the high memory water mark with the library PAPI v5.5 [108], and our results show that the implementation of both ACR and the HSSMF methods require only a small additional factor of scratch memory relative to the size of their factors (e.g., 2.4x for ACR, and 3x for HSSMF on the N = 1283 ∼ ∼ problem). As expected for this particular problem, multigrid is the method of choice con- cerning performance and memory footprint for a single right-hand-side. However, for multiple right-hand-sides, the ability to reuse the factorization could give the advan- tage to solvers based on factorization. The factorization times for ACR and HSSMF are comparable, with the setup stage of HSSMF being faster for smaller problems; the smaller ranks required by ACR instead lead to a faster factorization step with large problem sizes and faster time to solution. While ACR and HSSMF solvers can deliver a more accurate solution as direct solvers (i.e., without iterative procedures), this comes at the expense of more time and memory; it is common practice that this factorization is then used as a preconditioner or passed to an iterative refinement procedure. 92

4 10 101 CR CR ACR 103 ACR HSSMF AMG MF 100 HSSMF 102 MF

101 1 Solve time (s) 10−

Factorization time (s) 100

1 2 10− 10− 3 3 3 3 3 323 643 1283 2563 5123 32 64 128 256 512 N N (a) Factorization time. (b) Solve time. 103 CR ACR 107 ACR HSSMF 106 HSSMF MF 102 105

104

103 1 Memory (MB) 10 102

1

10 Largest rank in factorization

0 10 100 3 3 3 3 3 32 64 128 256 512 323 643 1283 2563 5123 N N (c) Memory usage. (d) Largest rank in the factorization. Figure 6.13: Performance of the factorization and back-substitution phases of ACR for the Poisson problem. 93

Leaf Relative N  η Average rank Largest rank H size residual 323 8E-03 2 32 1.39E-02 3 4 643 1E-03 2 32 3.20E-02 3 5 1283 1E-03 2 32 2.22E-02 4 7 2563 1E-03 2 32 8.75E-02 4 7 5123 1E-04 2 32 3.26E-02 5 11

Table 6.1: Execution parameters, obtained relative residual, and ranks of the ACR factorization for the Poisson experiments.

Compression Relative Leaf Minimum N Largest rank tolerance residual size front size 323 1E-02 4.41E-02 128 256 82 643 1E-03 2.65E-02 128 1,024 243 1283 1E-03 8.40E-02 128 4,096 532

Table 6.2: Execution parameters, obtained relative residual, and ranks of the HSSMF factorization for the Poisson experiments. 94

104 101 Factorization Solve

103 100 Time (sec) Time (sec)

102 10 1 1e-2 1e-3 1e-4 1e-5 1e-6 − 1e-2 1e-3 1e-4 1e-5 1e-6 Relative residual Relative residual (a) ACR factorization. (b) ACR solve. 106 30 ACR factors Largest rank Average rank

20

105

Memory (MB) 10 Largest rank in factorization

104 0 1e-2 1e-3 1e-4 1e-5 1e-6 1e-2 1e-3 1e-4 1e-5 1e-6 Relative residual Relative residual (c) ACR size of the factors. (d) Ranks requirements of the factorization. Figure 6.14: Controllable accuracy solution of ACR for a N = 2563 Poisson problem.

Numerical experiments confirm that ACR could be used as a direct solver if we tune its parameters with a higher accuracy for its -matrix representations and H operations, as depicted in Figure 6.14, at the expense of modest rank increases, albeit with higher memory requirements and time to solution. However, as Table 6.3 shows, a low-accuracy factorization in combination with an iterative procedure is best to minimize the total time-to-solution. 95

 Factors(MB) Max. rank Factorization Apply Total(s) Iterations H 1E-01 22,328 31 26.56 0.058 28.01 25 1E-02 26,687 37 51.24 0.064 51.94 11 1E-03 32,212 53 89.32 0.104 89.73 4 1E-04 39,181 71 149.06 0.127 149.44 3

Table 6.3: Iterative solution of a N = 1283 Poisson problem with the conjugate gradients method and ACR preconditioner. Relative residual of the solution is 1E-6 in all cases. 96 6.2.3 Convection-diffusion equation

We next consider a standard convection-diffusion problem:

2u + αb(x) u = f(x), x Ω = [0, 1]3, − ∇ · ∇ ∈   sin(a 2πx) sin(a 2π(1/8 + y)) + sin(a 2π(1/8 + z)) sin(a 2πx)   (6.10)   b(x) = cos(a 2πx) cos(a 2π(1/8 + y)) + cos(a 2π(1/8 + y)) cos(a 2πz) ,     cos(a 2πx) cos(a 2π(1/8 + z)) + sin(a 2π(1/8 + y)) sin(a 2πz) discretized with a 7-point upwind finite difference scheme which leads to a nonsymmetric linear system. The flow b(x) is a three-dimensional generalization of the two-dimensional vortex flow proposed by Wessel et al. [105]. We adjust the forcing term and boundary conditions to meet the exact solution

u(x) = sin(πx) + sin(πy) + sin(πz) + sin(3πx) + sin(3πy) + sin(3πz), as proposed by Gupta and Zhang [109], as it is an archetypal challenging problem for multigrid methods. To demonstrate the robustness of ACR and HSSMF for this problem, we fix the number of degrees of freedom at N = 1283 and we increase the dominance of the convective term; results are reported in Figure 6.15. Consistent with the Poisson equation results, multigrid methods remains the method of choice for diffusion dominated problems; however, when the convection increases, i.e., α increases, the performance of AMG deteriorates. On the other hand, both ACR and HSSMF prove to be able to solve convection-dominated problems, with ACR being consistently faster than HSSMF particularly in the solve phase, see Figure 6.15a and 6.15b. The memory footprint of the factors generated by ACR and HSSMF are compa- 97 rable, see Figure 6.15c, however, ACR can achieve a solution to the same accuracy with substantially smaller numerical ranks, as depicted in Figure 6.15d.

104 101 ACR ACR HSSMF HSSMF 103 AMG AMG

0 102 10

1 10 Solve time (s) Factorization time (s) 10 1 100 −

2 4 6 8 10 12 14 2 4 6 8 10 12 14 α α (a) Factorization time for ACR and HSSMF, (b) Solve time for a single right-hand-side and setup time for AMG. with each solver. 107 600 ACR ACR HSSMF HSSMF 6 10 AMG 400 105

104 200

Factors memory (MB) 103 Largest rank in factorization

0 2 4 6 8 10 12 14 2 4 6 8 10 12 14 α α (c) Memory footprint of factorization of (d) Largest rank in factorization of ACR and ACR and HSSMF and setup for AMG. HSSMF. Figure 6.15: Robustness of ACR and HSSMF for convection-diffusion problem. In convection dominated problems, AMG fails to converge while direct solvers maintain a steady performance. 98 6.2.4 Wave Helmholtz equation

Finally, consider the indefinite Helmholtz equation with Dirichlet boundary conditions on the unit cube:

( 2u + κ2u) = 1, Ω = [0, 1]3, (6.11) − ∇ discretized with the 27-point trilinear finite element scheme on hexahedra. Results for ACR and HSSMF are reported in Figure 6.16. The parameter κ is chosen to obtain a sampling rate of approximately 12 points per wavelength, specifically κ = 16, 32, 64 { } respectively, corresponding to approximately 10 10 10 for the N = 1283 problem. × × As opposed to the positive definite Helmholtz equation which models phenomena similar to diffusion, the indefinite variant, commonly denoted as the wave Helmholtz equation, has a solution that is oscillatory in nature. Multigrid methods are known to diverge without specific customizations for high-frequency Helmholtz problems, which we also confirmed via experimentation. For a detailed examination of the difficulties of solving the Helmholtz equation with classical iterative methods we refer the reader to [104]. We document the execution parameters, obtained relative residual, and ranks of the ACR and HSSMF factorization in Tables 6.4 and 6.5. Numerical experiments show that ACR features consistently lower factorization and solve times than HSSMF, as can be seen in Figure 6.16a and 6.16b. The size of the factors of ACR and HSSMF are comparable, with a slightly higher memory requirements of ACR due to performance- oriented tuning, see Figure 6.16c. Furthermore, as also shown in section 6.2.2, HSSMF required less memory than MF, and CR quickly runs out of memory for problems larger than N = 643. Finally, the largest rank of ACR is consistently lower than that of HSSMF, even though both solvers require O(n) ranks, as shown in Figure 6.16d. Nevertheless, lower ranks lead to faster time-to-solution in favor of ACR. 99

Leaf Relative N  η Average rank Largest rank H size residual 323 5E-03 4 32 1.67E-02 5 8 643 5E-08 8 32 2.63E-02 30 56 1283 5E-13 16 32 1.07E-02 113 260

Table 6.4: Execution parameters, obtained relative residual, and ranks of the ACR factorization for the Helmholtz experiments.

Compression Relative Leaf Minimum N Largest rank tolerance residual size front size 323 5E-03 5.32E-02 128 256 105 643 1E-04 6.08E-02 128 1,024 641 1283 1E-06 1.13E-02 128 4,096 1,659

Table 6.5: Execution parameters, obtained relative residual, and ranks of the HSSMF factorization for the Helmholtz experiments. 100

103 CR 100 CR ACR ACR HSSMF HSSMF 102 MF MF

10 1 101 − Solve time (s) Factorization time (s) 100

2 10− 323 643 1283 323 643 1283 N N (a) Factorization time. (b) Solve time.

107 CR ACR ACR 103 HSSMF 6 10 HSSMF MF 105 102 104

103

Memory (MB) 1 102 10

1 10 Largest rank in factorization

100 100 323 643 1283 323 643 1283 N N (c) Memory usage. (d) Largest rank in the factorization. Figure 6.16: Solution of increasingly larger indefinite Helmholtz problems consistently discretized with 12 points per wavelength. 101

Chapter 7

ACR as a preconditioner for sparse iterative solvers

At the other end of the spectrum from direct solvers are iterative solvers. The con- vergence of iterative solvers is problem dependent. However, when combined with a preconditioner, it is possible to increase their robustness and even to accelerate their convergence. The combination of a Krylov solver with an efficient preconditioner is a proven way to solve large-scale problems. The development of a preconditioner is the fundamental idea of this chapter, via a low-accuracy ACR factorization that accel- erates the convergence of a Krylov method for large-scale problems with challenging coefficient structure in 3D.

7.1 Environment settings

Experiments are conducted on the Cray XC40 Shaheen supercomputer at the King Abdullah University of Science & Technology. Each node has 128GB of RAM and two Intel Haswell processors, each with 16 cores clocked at 2.3GHz. The convergence metric is set to achieve relative norm of the residual in the 2-norm to be less than or equal to 10−8. For problems with positive definite matrices we used the conjugate gradient method [110], otherwise we used the generalized minimal residual method [111]; in either case, a low-accuracy ACR factorization was used as a preconditioner. 102 7.2 Variable-coefficient Poisson equation

The solution of variable-coefficient PDEs is an essential engineering problem, as the coefficient structure typically corresponds to material properties of the problem un- der consideration. Heterogeneity in the media may come from variations in position, time, or from another dependent variable. From a numerical analysis standpoint, it is well acknowledged that for a method to remain robust under high-contrast coeffi- cient specific customizations to the discretization or to the solver itself are necessary. This section documents the behavior of the ACR preconditioner from an increasingly challenging coefficient structure with up to six orders of magnitude of contrast. The problem under consideration in this section is the symmetric positive definite discretization of the 3D variable-coefficient Poisson equation with Dirichlet bound- ary conditions. In particular, the second-order accurate 7-point finite-difference star stencil with harmonic average of the coefficient κ(x) [101]:

κ(x) u = 1, x Ω = [0, 1]3, u(x) = 0, x Γ, (7.1) −∇ · ∇ ∈ ∈

7.2.1 Generation of random permeability fields

The generation of random permeability field κ(x) that closely represents a porous media for the modeling of water or oil flow is a well-defined task on its own. The experiments in this section are based on the parallel framework for the multilevel Monte Carlo approach (MLMC) described in [112], via the Distributed and Unified Numerics Environment DUNE [113]. The random permeability fields are defined with covariance function of the form:

2 3 (h) = σ exp( h 2/λ), h [0, 1] . (7.2) C −|| || ∈

1 Gaussian random fields are set to a correlation length λ = 3h, where h = n−1 and 103 N = n3. The variance σ is set to deliver a particular contrast in the coefficient measured in orders of magnitude. Figure 7.1 depicts four random fields realizations at different number of degrees of freedom and contrast of the coefficient.

(a) N = 323 One order of (b) N = 643 Two orders of magnitude of contrast. magnitude of contrast.

(c) N = 1283 Four orders of (d) N = 2563 Six orders of magnitude of contrast. magnitude of contrast. Figure 7.1: Different realizations of random permeability fields κ(x) at different resolutions and contrast of the coefficient. Images depict the middle slice of each 3D permeability field.

7.2.2 Tuning parameters

The main parameter that controls the accuracy of the ACR preconditioner is . As H discussed in Chapter 2,  controls the accuracy of the -matrix approximations H H and their arithmetic operations. This global threshold, in turn, controls the relative accuracy of the solution for a given right-hand side.

It is expected that as we adjust , we can control the required number of iter- H ations to reach convergence with a Krylov method. One could set  to the sought H after accuracy of the solution and not require any iteration at all. However, the per- 104 formance and memory requirements, although asymptotically-optimal, become im- practical at high-accuracy for 3D problems. For the ACR preconditioner, the sweet spot for achieving the fastest time to solution is not the one corresponding to the least number of iterations. It is generally the case that the inexpensive ACR precondition- ers provide the fastest time to solution. There is, of course, a trade-off, between the accuracy of the preconditioner and the number of Krylov iterations as our numerical experiments document below. The effect on the required number of iterations as a function of the preconditioner

3 accuracy  to solve a N = 128 problem with coefficient contrast of four orders of H magnitude can be seen in Figure 7.2a. The largest  requires the most number of H iterations, while the smallest  requires the least number of iterations. H As can be seen from Figure 7.2b, even though setting a large  required the H most number of CG iterations, that is the recommended value of  for absolute H performance. Although there are more iterations than with a smaller , the apply H of the preconditioner is fastest at large  since the ranks are the smallest, see Figure H 7.3a. Figure 7.3b depicts how  directly determines the memory footprint of the H preconditioner, and shows why it is desirable to set  as large as possible to also H optimize for memory requirements. 105

103 150 =1e-1 Setup H =1e-2 Apply 1  2 10 H

||  =1e-3 ) H  =1e-4 Ax H 100 10 1

− − b ( P

) 3 10− Ax

Total time (s) 50 −

b 5 ( 10− ||

7 10− 0 0 10 20 30 40 50 60 1e-1 1e-2 1e-3 1e-4

CG iterations Preconditioner accuracy ( ) H (a) Number of CG iterations as a function of (b) Time requirements while refining the the preconditioner accuracy  for the preconditioner accuracy  for the H H variable-coefficient Poisson equation. The variable-coefficient Poisson equation. The preconditioner with the smallest  requires preconditioner with the largest  delivers H H the least number of iterations. the best time to solution. Figure 7.2: Number of iterations and preconditioning accuracy for the variable-coefficient Poisson equation with N = 1283 degrees of freedom and coefficient contrast of four orders of magnitude.

104 80 · Largest rank in factorization Memory footprint 8 60

6 40 4 Numerical rank Memory (MB) 20 2

0 0 1e-1 1e-2 1e-3 1e-4 1e-1 1e-2 1e-3 1e-4 Preconditioner accuracy ( ) Preconditioner accuracy ( ) H H (a) Largest rank in the factorization at (b) Memory requirements at different different preconditioner accuracy . The preconditioner accuracy . The H H preconditioner with the largest  requires preconditioner with the largest  requires H H the smallest numerical rank. the least amount of memory.

Figure 7.3: Effect of the preconditioner accuracy  for the variable-coefficient Poisson equation with N = 1283 degrees of freedomH and coefficient contrast of four orders of magnitude. 106 7.2.3 Sensitivity with respect to high contrast coefficient

As the problem difficulty increases, i.e., the contrast of the coefficient sharpens, there are cases for which the most economical preconditioner (e.g., =1e-1) might not H reach convergence within a small number of iterations, for instance, see Table 7.1.

N  CG Iterations 323 1e-1H 27 643 1e-1 51 1283 1e-1 95 1e-1 100+ 2563 1e-2 73

Table 7.1: Number of iterations required by CG for the variable-coefficient Poisson equation with coefficient contrast of six orders of magnitude. The most economical preconditioner for the hardest problem did not reach convergence within 100 iterations, thus requiring a more accurate version of the preconditioner to reach convergence.

Therefore, to reach convergence, a more accurate preconditioner is necessary. Fig- ure 7.4 shows the required number of iterations to achieve convergence for a precon- ditioner with accuracy =1e-2 at increasing problem size and contrast of the coeffi- H cient. For comparison, the baseline case (zero contrast) depicts a constant-coefficient Poisson equation with κ(x)=1.

80 N = 2563 N = 1283 60 N = 643 N = 323

40

PCG iterations 20

0 0 2 4 6 Constrast of the coefficient (in orders of magnitude)

Figure 7.4: Required number of iterations for an ACR preconditioner accuracy of =1e-2 as the contrast of the coefficient increases. A larger number of iterations is H necessary as the contrast of the coefficient increases. 107 7.2.4 Operation count and memory footprint

The complexity estimates of the number of operations in the setup and apply phases of the preconditioner are bounded by O(k2N log2 N) and O(kN log N) respectively, while its memory footprint is bounded by O(kN log N); where N is the number of degrees of freedom and k is the numerical rank of the approximation. To demonstrate that these estimates hold for a variable-coefficient problem, Figure 7.5 shows the behavior of the preconditioner in terms of operations count and memory footprint as we increase the number of degrees of freedom N for the variable-coefficient Poisson equation with a coefficient of four orders of magnitude of contrast. The vertical axis of Figure 7.5a, normalized by the number of compute nodes used in each case as documented in Table 7.2, reports the measured performance of the setup and apply phases of the preconditioner while comparing it with their theoretical complexity. Figure 7.5b reports the total memory requirements as the problem size increases and also compares it with the theoretical complexity demonstrating a fair agreement.

N Nodes Planes per node 323 8 4 643 16 4 1283 32 4 2563 64 4

Table 7.2: Hardware configuration for distributed-memory experiments. Each compute node has 32 cores, and hosts four block rows of the original matrix. 108

105 Setup Memory footprint 2 5 N log N 103 N log N 10 Apply

1 N log N 10 4

(s) 10

1 10− Time Nodes 103 Memory (MB) 3 10−

2 5 10 10−

5 6 7 105 106 107 10 10 10 N N (a) Comparison of the preconditioner setup (b) Comparison of the preconditioner and apply with their respective theoretical memory footprint with its theoretical estimates. estimate. Figure 7.5: Measured performance and memory footprint for the solution of an increasingly larger variable-coefficient Poisson equation with a random field of four orders of magnitude of contrast in the coefficient. The preconditioner accuracy for this experiments is set to  = 1e-1. H

7.3 Convection-diffusion equation with recirculating flow

In this section we show the effectiveness of the ACR preconditioner on the convection- diffusion equation discretized with a 7-point upwind finite difference scheme, which after discretization leads to a nonsymmetric linear system:

κ(x) u + αb(x) u = f(x), x Ω = [0, 1]3, − ∇ · ∇ · ∇ ∈   sin(a 2πx) sin(a 2π(1/8 + y)) + sin(a 2π(1/8 + z)) sin(a 2πx)     b(x) = cos(a 2πx) cos(a 2π(1/8 + y)) + cos(a 2π(1/8 + y)) cos(a 2πz) , (7.3)     cos(a 2πx) cos(a 2π(1/8 + z)) + sin(a 2π(1/8 + y)) sin(a 2πz)

bx + by + bz = 0.

In (7.3), b(x) models a variable and recirculating flow with vanishing normal velocities at the boundary. Furthermore, when the convection term dominates α > 1, this 109 equation is known to be challenging for classical iterative solvers.

7.3.1 Tuning parameters

In a regime of convection dominance, Figure 7.6a shows how the ACR preconditioner can control the number of GMRES iterations by tuning the preconditioner accuracy

. H Regarding absolute time, the preconditioner with the largest  led to the best H time to solution, albeit with the most iterations, as Figure 7.6b shows. As a result, this preconditioner configuration featured the lowest numerical rank, as shown in Figure 7.7a, enabling a fast apply at each iteration. Furthermore, the fastest preconditioner had the least memory requirements, as shown in Figure 7.7b.

200 1  =1e-1 H Setup  =1e-2 H Apply  =1e-3

2 1e-2 H

||  =1e-4 150 b H || /

2 1e-4 || b 100 − 1e-6 Ax || Total time (s) 50 1e-8

0 5 10 15 20 25 30 0 1e-1 1e-2 1e-3 1e-4 GMRES iterations Preconditioner accuracy ( ✏) (a) Number of iterations as a function of the H preconditioner accuracy . As  (b) Time requirements while refining the H H preconditioner accuracy . The largest  decreases, the preconditioner requires fewer H H iterations. delivers the best time to solution. Figure 7.6: This experiment depicts a convection-diffusion problem with recirculating flow with eight vortices, α = 8, discretized with N = 1283 degrees of freedom. 110

104 100 · Largest rank in factorization Memory footprint 80 8

60 6

40 4 Numerical rank Memory (MB)

20 2

0 0 1e-1 1e-2 1e-3 1e-4 1e-1 1e-2 1e-3 1e-4 Preconditioner accuracy ( ) Preconditioner accuracy ( ) H H (a) Largest rank in factorization at different (b) Memory requirements while refining the  for the convection-diffusion equation. preconditioner accuracy . The largest  H H H The preconditioner with the largest  delivers the preconditioner with the least H features the lowest numerical ranks. memory footprint.

Figure 7.7: Effect on the preconditioner accuracy  for a convection-diffusion problem with recirculating flow with eight vortices,Hα = 8, and discretized with N = 1283 degrees of freedom.

7.3.2 Sensitivity with respect to vortex wavenumber

Consider an increasing number of vortices in the flow b(x), as Figure 7.8 shows. At the corners, and in center of each vortex, there are saddle points which are known to be challenging for multigrid methods to resolve [109]. Figure 7.9 demonstrates that the ACR preconditioner remains robust as the number of vortices increases.

1.0 1.0 1.0 1.0

0.8 0.8 0.8 0.8

0.6 0.6 0.6 0.6

0.4 0.4 0.4 0.4

0.2 0.2 0.2 0.2

0.0 0.0 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 (a) Two vortices. (b) Four vortices. (c) Six vortices. (d) Eight vortices. Figure 7.8: Increasing number of vortices per dimension in the flow b(x). 111

60 Setup Apply

40

Total time (s) 20

0 2 4 6 8 Vortices per dimension

Figure 7.9: Time distribution of the preconditioner as the number of vortices per dimension in b(x) increases. Increasing the number of vortices had a minor effect on the effectiveness of the preconditioner.

7.3.3 Sensitivity with respect to Reynolds number

Consider a fixed-accuracy ACR preconditioner with  =1e-1, and an increasingly H convection dominated problem, achieved by gradually increasing α in (7.3). As ex- pected, Figure 7.10a shows that a low-accuracy preconditioner requires more itera- tions as the convective term dominates. Furthermore, given that the accuracy of the preconditioner is fixed, there is a noticeable effect on the apply phase of the precondi- tioner, which is proportional to the number of iterations. A graphical representation of such behavior can be seen in Figure 7.10b. Evidently, as shown in the section 7.3.1, it is possible to control, and decrease, the number of iterations by building a more accurate preconditioner. But the key point here is that the ACR preconditioner in combination with GMRES is demonstrated to be robust for convection dominated problems. 112

100 Iterations 60 Setup 80 Apply

60 40

40

Total time (s) 20 GMRES iterations 20

0 2 4 8 16 0 2 4 8 16 α α (a) Number of iterations as the convection (b) Time requirements as the convection term gains dominance. An increase in the term gains dominance. The overall time to dominance of the convection term requires solution has a moderate increase. mores iterations.

Figure 7.10: Effect on the preconditioner accuracy  for the convection-diffusion equation with recirculating flow discretized with N =H 1283 degrees of freedom as the convective becomes more significant than the diffusion term.

7.3.4 Operation count and memory footprint

Figure 7.11 presents a comparison between the measured performance and memory requirements of the preconditioner, and their corresponding theoretical complexity estimates for the convection-diffusion problem described in (7.3), as the problem size increases. The vertical axis of Figure 7.11a, normalized with the number of compute nodes used in each case as documented in Table 7.2, reports on the measured performance of the setup and apply phases of the preconditioner while demonstrating a fair agreement with the asymptotic complexity estimates for the large-scale experiments. Figure 7.11b reports the total memory requirements as the problem size increases and also compares it with its corresponding theoretical complexity demonstrating a fair agreement across all experiments. 113

Setup Memory footprint 104 2 N log N N log N 105 Apply 102 N log N

(s) 104 100 Time Nodes 3

Memory (MB) 10 2 10−

102 4 10− 105 106 107 105 106 107 N N (a) Comparison of the preconditioner setup (b) Comparison of the preconditioner and apply with their respective theoretical memory footprint with its theoretical estimates. estimate. Figure 7.11: Measured performance and memory footprint for the solution of the convection-diffusion equation with recirculating flow.

7.4 Indefinite Helmholtz equation in heterogeneous media

The numerical solution of the indefinite Helmholtz equation offers one of the great- est challenges for iterative and fast numerical solvers at large-scale [104]. There is a significant interest in the development of optimal methods as several engineering ap- plications use the Helmholtz equation to model time-harmonic propagation of acoustic waves. Inversion techniques based on full-waveform inversion (FWI) for instance, in- volve heterogeneous velocity models and the solution of multiple right-hand sides at a wide range of frequencies. Therefore, the introduction of an efficient forward solver directly contributes to expanding the limits of what can be modeled computationally. Consider the indefinite Helmholtz equation in a variable velocity field c(x), at frequency f, and Dirichlet boundary conditions in the unit cube: 114

(2πf)2 2u u = f(x), Ω = [0, 1]3, x Γ, −∇ − c(x)2 ∈ c(x) = 1.25(1 0.4e−32(|x−0.5|2+|y−0.5|2)) (7.4) − u(x) = sin(πx) sin(πy) sin(πz)

The velocity field models a waveguide over the unit cube as proposed in [95], and depicted in 7.12. The forcing term f(x) is adjusted to satisfy the proposed exact solution u(x). The equation is discretized with the 27-point trilinear finite element scheme on hexahedra with the software library PetIGA [114]. Since the linear system arising from the discretization is indefinite in the high-frequency Helmholtz case, we use ACR to accelerate the convergence of GMRES.

Figure 7.12: Wave velocity field c(x). The image depicts the middle slice of the 3D wave velocity field.

7.4.1 Tuning parameters

We illustrate the effectiveness of the ACR preconditioner on a moderately high- frequency Helmholtz problem as described in (7.4), discretized with N = 1283 de- grees of freedom and 12 points per wavelength. As Figure 7.13 shows, we can control the number of iterations that GMRES requires to reach convergence by adjusting the accuracy of the preconditioner . Notice that the preconditioner accuracy  H H is smaller than what was chosen for diffusive problems. The need of higher relative 115 accuracy is due to the fact that the Helmholtz equation, in the high-frequency regime, has off-diagonal block ranks that asymptotically grow with problem size (k (n)). ∼ O This theoretical estimate is reported in the literature [23]. Evidently, rank growth impacts hierarchical-matrix based solvers. Nonetheless, the complexity estimates of the ACR preconditioner are still lower than traditional exact sparse factorizations, as demonstrated in section 7.4.3.

=1e-6 H 1  =1e-7 H  =1e-8

2 H

||  =1e-9 b 1e-2 H || / 2

|| 1e-4 b −

Ax 1e-6 ||

1e-8

0 5 10 15 20 25 30 GMRES iterations

Figure 7.13: Number of iterations as a function of the preconditioner accuracy . H As  decreases, the preconditioner requires fewer iterations. H

Even though there is an increase in the setup time of the preconditioner as com- pared to diffusive problems, the preconditioner still features an economical solve stage, as depicted by the blue slivers in Figure 7.14. As mentioned in the introduction of this section, inverse problems typically require the solution of a few thousand right-hand sides, thus, a fast solve stage is of paramount importance for this application. 116

1,000 Setup Apply 800

600

400 Total time (s)

200

0 1e-6 1e-7 1e-8 1e-9 Preconditioner accuracy ( ) H

Figure 7.14: Time requirements while refining the preconditioner accuracy . The H largest  delivers the fastest time to solution. H

The growth in the setup phase as the accuracy of the preconditioner is tightened is due to increased numerical ranks, as shown in Figure 7.15a. Rank growth has a direct impact on the memory footprint of the preconditioner, as shown in Figure

7.15b. Once more, the preconditioner with the largest , i.e., the lowest numerical H rank, is the preconditioner of choice to optimize for both memory and performance. 117

104 200 · Largest rank in factorization Memory footprint 8 150

6 100 4 Numerical rank Memory (MB) 50 2

0 0 1e-6 1e-7 1e-8 1e-9 1e-6 1e-7 1e-8 1e-9 Preconditioner accuracy ( ) Preconditioner accuracy ( ) H H (a) Largest rank in factorization. (b) Memory requirements while refining the Factorizations with smaller ranks lead to preconditioner accuracy . The most H more iterations, but less time to solution and economical preconditioner regarding memory memory footprint. footprint is delivered with the largest . H

Figure 7.15: Effect on the preconditioner accuracy  for the high-frequency Helmholtz equation in a heterogeneous medium discretizedH with N = 1283 degrees of freedom and 12 points per wavelength.

7.4.2 Low to high frequency Helmholtz regimes

Consider a sequence of Helmholtz problems, as described in (7.4), at increasing fre- quency. Evidently, if the frequency is set to f = 0 Hz, the zeroth-order term vanishes, and we are left with a constant-coefficient Poisson problem. In the other end of the spectrum, a frequency of f = 8 Hz corresponds to a moderately high-frequency Helmholtz problem at 12 points per wavelength, as the problem featured in the previous section. Table 7.3 shows the preconditioner accuracy chosen to require a maximum of 20 GMRES iterations to reach convergence. 118 Points per f  wavelength H 0 - 1e-1 2 48 1e-3 4 24 1e-4 8 12 1e-6

Table 7.3: Tuning of the preconditioner to require at most 20 GMRES iterations for a sequence of Helmholtz problems at increasing frequencies. The problem with f = 0 represents a constant-coefficient Poisson problem, while f = 8 represents a high-frequency Helmholtz problem.

Figure 7.16a shows an apparent increase in both the setup and apply phases of the preconditioner as a function of the frequency f. The growth in the setup is mainly due to the higher numerical ranks required to meet the upper limit of 20 iterations, as shown in Figure 7.16b. The increase in the apply phase is due to both an increase in ranks, and an increase in the indefiniteness of the problem due to a higher frequency; as is evident from the growth in the number of required iterations depicted in Figure 7.16c. Finally as illustrated in figure 7.16d, the memory footprint also increases with the frequency as a consequence of higher ranks. 119

500 Setup 120 Largest rank in factorization Apply 400 100

80 300 60 200 Total time (s) Numerical rank 40

100 20

0 0 0 2 4 8 0 2 4 8 Frequency (Hz) Frequency (Hz) (a) Time requirements as a function of (b) Largest rank in factorization, the frequency. The high-frequency regime accuracy is adjusted to require less than 20 (f = 8) requires the most time in both setup GMRES iterations. The high-frequency case and apply phases. (f = 8) requires the largest numerical rank. 104 8 · 20 Memory footprint

15 6

10 4

5 Memory (MB) GMRES iterations 2 0

0 0 2 4 8 0 2 4 8 Frequency (Hz) Frequency (Hz) (c) Number of iterations as a function of (d) Memory requirements as a function of frequency. The high-frequency regime frequency. The high-frequency regime (f = 8) requires the largest number of (f = 8) exhibits the largest memory iterations. footprint.

Figure 7.16: Preconditioner performance for the Helmholtz equation in a heterogeneous medium discretized with N = 1283 degrees of freedom at increasing frequencies. The problem with f = 0 Hz represents a constant-coefficient Poisson problem, while f = 8 Hz represents a high-frequency Helmholtz problem. 120 7.4.3 Operation count and memory footprint

As the previous experiments show, the high-frequency Helmholtz regime is where the highest numerical ranks are required. Therefore, it is of interest to show how the computations behave asymptotically as the problem size increases considering the estimate k O(n) [23]. Figure 7.17a shows a comparison of the preconditioner ∼ setup with the O(n2 N log2 N) estimate, and the preconditioner apply with respect to the O(n N log N) estimate. Figure 7.17b shows the memory footprint of the preconditioner with respect to the estimate O(n N log N). Table 7.4 shows a fair agreement to the Chandrasekaran et al. estimate on the largest rank growth of the factorization, however, the average rank on the low-rank blocks of the ACR preconditioner grows slower than k O(n), which is reflected by a slightly lower ∼ than predicted memory consumption and setup time. The ACR preconditioner does not use the HSS format or a weak admissibility condition which results in off-diagonal blocks with large rank, but rather a standard admissibility condition that allows a more refined structure of the -matrix blocks, as discussed in Section 4.3.5, and H shown in Figure A.1.

N Largest rank Average rank 323 25 16 643 59 32 1283 118 36

Table 7.4: Rank growth statistics for a sequence high-frequency Helmholtz problems in heterogeneous medium, discretized at 12 points per wavelength. 121

104 105 Setup Memory footprint n2 N log2 N n N log N 2 10 Apply 104 n N log N (s) 0 10 103 Time Nodes

2 Memory (MB) 10− 102

4 10− 101 105 106 105 106 N N (a) Comparison of the preconditioner setup (b) Comparison of the preconditioner and apply with their respective theoretical memory footprint with its theoretical estimates. estimate. Figure 7.17: Measured performance and memory footprint for the solution of a sequence high-frequency Helmholtz problems in heterogeneous medium, discretized at 12 points per wavelength. On average, the rank of the low-rank blocks of the ACR preconditioner grows slower than O(n). 122

Chapter 8

Summary and future work

8.1 Concluding remarks

This dissertation presents an extension to the cyclic reduction method based on hi- erarchical matrices in a distributed memory environment. The new method, called accelerated cyclic reduction (ACR), features improved asymptotic complexity on the order of log-linear number of operations and memory footprint. ACR is best suited for the solution of block tridiagonal linear systems that arise from the discretization of elliptic operators with variable-coefficient. Moreover, it can be used as a fast direct solver, or as a preconditioner to Krylov methods. The ACR elimination strategy is based on a red/black ordering of the unknowns. If a 3D grid is considered, the ordering divides the grid into planes. These planes represent block rows of the original linear system. ACR approximates each matrix block with a hierarchical matrix, in the format, and its structure is defined us- H ing a binary spatial partitioning of the planar grid sections, employing a standard admissibility criterion that limits the rank of individual low-rank blocks. The Schur complement elimination is done via hierarchical matrix operations, resulting in an overall algorithm of O(k2 N log2 N) arithmetic complexity and O(k N log N) mem- ory footprint. The concurrency features of ACR are among their strengths. The regularity of the decomposition allows a predictable load balance. The parallel features are demon- strated via the companion implementation in a distributed memory environment with 123 numerical experiments that study the strong and weak scalability of the method. Concurrency at node level involves task-based parallelism of the hierarchical matrix arithmetic operations involved in the computation of the Schur complement and its evaluation. Since ACR is a direct solver in the limit, it can tackle a broad class of problems that lack definiteness, such as the indefinite high-frequency Helmholtz equation, or lack symmetry, such as the convection-diffusion equation. For these problems, stock versions of algebraic multigrid fail to produce convergent schemes. The robustness of ACR in dealing with such problems was demonstrated over a range of problem sizes and parameters. While multigrid methods are superior for scalar problems possessing smoothness and definiteness, fast factorization methods such as ACR benefit when multiple right- hand sides are involved, as the time to solve per additional forcing term is orders of magnitude smaller than the factorization, which can be reused. Since the accuracy of the -matrix approximations and their arithmetic opera- H tions can be tuned, it was demonstrated that ACR could be used as a preconditioner to Krylov methods. Numerical experiments in challenging heterogeneous media showed the optimal parameters to optimize for performance and memory consumption. Fur- thermore, it was demonstrated that theoretical rank requirements are met. A baseline of achieved performance, in both shared and distributed memory environment, is provided via a set of benchmarks with state-of-the-art software li- braries such as hypre, HLibPro, and STRUMPACK, for problems in two and three- dimensions. The companion implementation of ACR demonstrated competitiveness in performance and memory usage. As expected from all hierarchical low-rank approximations methods, the key to performance, and memory economy, is largely based on achieving an approximation with low rank; i.e., an efficient compression into a data-sparse format where k (the 124 rank) is much less than n (the size of the block to be approximated). Numerical examples demonstrate that ACR requires consistently lower ranks than competing approaches, such as the HSS multifrontal approach. Support for tuning the multiple parameters is provided. Together, these advancements allowed the solution of large-scale 3D elliptic par- tial differential equations that feature symmetry and nonsymmetry, definiteness and indefiniteness, constant and heterogeneous coefficient, including the direct solution of one-eighth of a billion linear system using eight thousand cores of the XC40 Shaheen supercomputer in approximately three seconds. 125 8.2 Future work

The results of this work suggest a number of lines of future work, some of which are listed here: Non-reflecting boundary conditions. Non-reflecting boundary conditions for the solution of the indefinite Helmholtz equation in the high-frequency regime, such as PML boundary conditions or ABC boundary conditions, are particularly beneficial to achieve data-sparse approximations with lower numerical ranks, as compared to Dirichlet boundary conditions [94]. The use of non-reflecting boundary conditions requires complex number arithmetic operations and the inclusion of an outer set of grid points which further stresses on the problem size and memory requirements of sparse solvers. Duct acoustics. Consider the modeling of duct acoustics, a problem that aircraft manufacturers face while trying to reduce engine noise in residential areas. The difference with the previously discussed acoustics problems largely used in seismic inversion is that the modeling of duct acoustics typically requires the regular complex Helmholtz plus a lower order derivative term arising from convection. Multilevel parallelism. Consider that -matrix operations involve a significant H number of small and dense matrix operations and the extensive and rapid adoption of accelerators in high-performance computing systems. ACR being modular by design could naturally incorporate an additional level of parallelism through batched-BLAS routines optimized for accelerators. Some of this work is currently being pursued at KAUST [115]. Unstructured grids. An ordering of the grid based on a breadth-first traversal can be generated and partitioned to generate the matrix blocks—the key requirement for ACR being that every partition has no more than two neighbors to keep the tridiagonal block structure, as illustrated in Figure 8.1. The main difference with the current ACR for regular grids is that the diagonal and off-diagonal matrix H 126 blocks will have different structures since the blocks now represent different geometric arrangements of the nodes. That the ranks do not increase adversarially remains to be documented experimentally and perhaps theoretically in the future. But our intuition is that when the partitioning produces a sequence of “pancake” regions, as a properly initialized breadth-first traversal would do, the resulting interactions between partitions should behave similarly to how the planar regions of ACR interact, and the rank patterns should be similar. Nest on dimension. In the current implementation of ACR, the hierarchical struc- tures of the matrix blocks are generated based on a two-dimensional clustering. It should be practical to nest on dimension and represent these blocks as hierarchi- cal matrices whose entries represent 1D lines and where ACR could be used in the nested dimension. This generalization would enable to continue with MPI distributed memory down to the line level alleviating large shared-memory aggregates in the 3D problems as planes, as clearly memory is a limiting resource. Machine learning based auto-tuning. ACR has many user-set parameters for which limited guidance was provided, such as: geometric division and coloring of the dis- tributed memory regions, ordering, admissibility condition structure, leaf size, arith- metics based on fixed rank or predetermined accuracy, accuracy of compression and arithmetics, choice of low-rank compression method, size of the Krylov space and termination level of ACR (for preconditioned problems), among others. These can be used to match architectural conditions (relative costs of flops and memory trans- fers, the relative cost of synchronization, memory capacity per node, concurrency desired per node, etc.), as well as application characteristics (domain shape, indefi- niteness, nonsymmetry, anisotropy, inhomogeneity, number of right-hand sides, etc.). A machine learning framework that collects statistics on the execution of a subset of the tuning parameter space could drastically improve the usability of ACR in a production environment with minimum user involvement. 127

Figure 8.1: Partitioning of an unstructured mesh that produces a block tridiagonal matrix structure, for the application of ACR. 128

REFERENCES

[1] J. Rudi, A. C. I. Malossi, T. Isaac, G. Stadler, M. Gurnis, P. W. J. Staar, Y. Ineichen, C. Bekas, A. Curioni, and O. Ghattas, “An extreme-scale implicit solver for complex PDEs: highly heterogeneous flow in earth’s mantle,” in SC15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 2015, pp. 5:1–5:12. [2] D. Malhotra, A. Gholami, and G. Biros, “A volume integral equation Stokes solver for problems with variable coefficients,” in SC14: Proceedings of the In- ternational Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 2014, pp. 92–102. [3] W. Hackbusch, “A sparse matrix arithmetic based on -matrices. Part I: H Introduction to -matrices,” Computing, vol. 62, no. 2, pp. 89–108, 1999. H [Online]. Available: http://dx.doi.org/10.1007/s006070050015 [4] M. Bebendorf, Hierarchical matrices: A means to efficiently solve elliptic bound- ary value problems. Springer, 2008, vol. 63, Lecture Notes in Computational Science and Engineering. [5] W. Hackbusch, Hierarchical matrices: Algorithms and analysis. Springer, 2015, vol. 49. [6] S. B¨orm,L. Grasedyck, and W. Hackbusch, “Introduction to hierarchical matri- ces with applications,” Engineering Analysis with Boundary Elements, vol. 27, no. 5, pp. 405–422, 2003, Large-scale problems using BEM. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0955799702001522 [7] W. Hackbusch, B. Khoromskij, and S. Sauter, “On 2-matrices,” in H Lectures on Applied Mathematics, H.-J. Bungartz, R. Hoppe, and C. Zenger, Eds. Springer Berlin Heidelberg, 2000, pp. 9–29. [Online]. Available: http://dx.doi.org/10.1007/978-3-642-59709-1 2 [8] S. B¨orm, Efficient numerical methods for non-local operators: 2-Matrix com- H pression, algorithms and analysis. European Mathematical Society, 2010, vol. 14, EMS Tracts in Mathematics. 129 [9] S. Chandrasekaran, P. Dewilde, M. Gu, T. Pals, X. Sun, A. J. van der Veen, and D. White, “Some fast algorithms for sequentially semiseparable representations,” SIAM Journal on Matrix Analysis and Applications, vol. 27, no. 2, pp. 341–364, 2005. [Online]. Available: http://dx.doi.org/10.1137/S0895479802405884 [10] J. Xia, S. Chandrasekaran, M. Gu, and X. S. Li, “Fast algorithms for hierar- chically semiseparable matrices,” Numerical Linear Algebra with Applications, vol. 17, no. 6, pp. 953–976, 2010. [11] J. Xia, “Randomized sparse direct solvers,” SIAM Journal on Matrix Analysis and Applications, vol. 34, no. 1, pp. 197–227, 2013. [12] P. Ghysels, X. S. Li, F.-H. Rouet, S. Williams, and A. Napov, “An efficient multicore implementation of a novel HSS-structured multifrontal solver using randomized sampling,” SIAM Journal on Scientific Computing, vol. 38, no. 5, pp. S358–S384, 2016. [13] P. G. Martinsson, “A fast randomized algorithm for computing a hierarchically semiseparable representation of a matrix,” SIAM Journal on Matrix Analysis and Applications, vol. 32, no. 4, pp. 1251–1274, 2011. [14] A. Gillman, “Fast direct solvers for elliptic partial differential equations,” Ph.D. dissertation, University of Colorado, 2011. [15] S. Ambikasaran, “Fast algorithms for dense numerical linear algebra and appli- cations,” Ph.D. dissertation, Stanford University, 2013. [16] S. A. Goreinov, E. E. Tyrtyshnikov, and N. L. Zamarashkin, “A theory of pseudoskeleton approximations,” Linear Algebra and its Applications, vol. 261, no. 1-3, pp. 1–21, 1997. [17] A. Aminfar, S. Ambikasaran, and E. Darve, “A fast block low-rank dense solver with applications to finite-element matrices,” Journal of Computational Physics, vol. 304, pp. 170–188, 2016. [18] S. Ambikasaran and E. Darve, “The inverse fast multipole method,” arXiv preprint arXiv:1407.1572, Jul 2014. [19] S. Rjasanow, “Adaptive cross approximation of dense matrices,” in Interna- tional Association for Boundary Element Methods Conference, 2002, pp. 28–30. [20] A. Aminfar and E. Darve, “A fast, memory efficient and robust sparse precon- ditioner based on a multifrontal approach with applications to finite-element matrices,” International Journal for Numerical Methods in Engineering, 2016. 130 [21] P. Amestoy, C. Ashcraft, O. Boiteau, A. Buttari, J.-Y. L’Excellent, and C. Weis- becker, “Improving multifrontal methods by means of block low-rank represen- tations,” SIAM Journal on Scientific Computing, vol. 37, no. 3, pp. A1451– A1474, 2015. [22] P. Amestoy, A. Buttari, J.-Y. L’Excellent, and T. Mary, “On the complexity of the block low-rank multifrontal factorization,” INPT-IRIT ; CNRS-IRIT ; INRIA-LIP ; UPS-IRIT, Research Report IRIT/RT–2016–03–FR, May 2016. [Online]. Available: https://hal.archives-ouvertes.fr/hal-01322230 [23] S. Chandrasekaran, P. Dewilde, M. Gu, and N. Somasunderam, “On the numer- ical rank of the off-diagonal blocks of Schur complements of discretized elliptic PDEs,” SIAM Journal on Matrix Analysis and Applications, vol. 31, no. 5, pp. 2261–2290, 2010. [24] I. S. Duff and J. K. Reid, “The multifrontal solution of indefinite sparse symmetric linear equations,” ACM Transactions on Mathematical Software, vol. 9, no. 3, pp. 302–325, Sep 1983. [Online]. Available: http://doi.acm.org/10.1145/356044.356047 [25] J. Xia, S. Chandrasekaran, M. Gu, and X. Li, “Superfast multifrontal method for large structured linear systems of equations,” SIAM Journal on Matrix Analysis and Applications, vol. 31, no. 3, pp. 1382–1411, 2010. [Online]. Available: http://dx.doi.org/10.1137/09074543X [26] J. Xia, “Randomized sparse direct solvers,” SIAM Journal on Matrix Analysis and Applications, vol. 34, no. 1, pp. 197–227, 2013. [27] P. G. Schmitz and L. Ying, “A fast nested dissection solver for Cartesian 3D el- liptic problems using hierarchical matrices,” Journal of Computational Physics, vol. 258, pp. 227–245, 2014. [28] ——, “A fast direct solver for elliptic problems on general meshes in 2D,” Journal of Computational Physics, vol. 231, no. 4, pp. 1314–1338, 2012. [Online]. Available: http://www.sciencedirect.com/science/article/pii/ S0021999111006115 [29] S. Chandrasekaran, M. Gu, and T. Pals, “A fast ULV decomposition solver for hierarchically semiseparable representations,” SIAM Journal on Matrix Analysis and Applications, vol. 28, no. 3, pp. 603–622, Aug 2006. [Online]. Available: http://dx.doi.org/10.1137/S0895479803436652 131 [30] A. Aminfar and E. Darve, “A fast, memory efficient and robust sparse precon- ditioner based on a multifrontal approach with applications to finite-element matrices,” International Journal for Numerical Methods in Engineering, 2016. [31] S. Ambikasaran and E. Darve, “An (N log N) fast direct solver O for partial hierarchically semi-separable matrices,” Journal on Scientific Computing, vol. 57, no. 3, pp. 477–501, Dec 2013. [Online]. Available: http://dx.doi.org/10.1007/s10915-013-9714-z [32] A. Aminfar and E. Darve, “A fast and memory efficient sparse solver with applications to finite-element matrices,” arXiv preprint arXiv:1410.2697, 2014. [33] S. Wang, X. S. Li, F.-H. Rouet, J. Xia, and M. V. De Hoop, “A parallel geo- metric multifrontal solver using hierarchically semiseparable structure,” ACM Transactions on Mathematical Software, vol. 42, no. 3, pp. 21:1–21:21, May 2016. [34] R. Vandebril, M. Barel, G. Golub, and N. Mastronardi, “A bibliography on semiseparable matrices,” Calcolo, vol. 42, no. 3-4, pp. 249–270, 2005. [Online]. Available: http://dx.doi.org/10.1007/s10092-005-0107-z [35] J. Xia, Y. Xi, and M. Gu, “A superfast structured solver for Toeplitz linear systems via randomized sampling,” SIAM Journal on Matrix Analysis and Ap- plications, vol. 33, no. 3, pp. 837–858, 2012. [36] F.-H. Rouet, X. S. Li, P. Ghysels, and A. Napov, “A distributed-memory pack- age for dense hierarchically semi-separable matrix computations using random- ization,” ACM Transactions on Mathematical Software, vol. 42, no. 4, p. 27, 2016. [37] S. Solovyev, D. Vishnevsky, and H. Liu, “Multifrontal hierarchically semi- separable solver for 3D Helmholtz problem using 27-point finite-difference scheme,” in 77th EAGE Conference and Exhibition 2015, 2015. [38] S. Solovyev, V. Kostin, and D. Vishnevskiy, “Cluster implementation of low- rank multifrontal direct solver for 3D Helmholtz problem,” in 7th EAGE Saint Petersburg International Conference and Exhibition, 2016. [39] C. Weisbecker, “Improving multifrontal solvers by means of algebraic block low- rank representations,” Ph.D. dissertation, Institut National Polytechnique de Toulouse-INPT, 2013. 132 [40] P. R. Amestoy, I. S. Duff, J.-Y. L’Excellent, and J. Koster, MUMPS: A general purpose distributed memory sparse solver. Berlin, Heidelberg: Springer Berlin Heidelberg, 2001, pp. 121–130. [41] K. L. Ho and L. Ying, “Hierarchical interpolative factorization for elliptic op- erators: differential equations,” Communications on Pure and Applied Mathe- matics, 2015. [42] ——, “Hierarchical interpolative factorization for elliptic operators: integral equations,” Communications on Pure and Applied Mathematics, 2015. [43] Y. Li and L. Ying, “Distributed memory hierarchical interpolative factoriza- tion,” arXiv preprint arXiv:1607.00346, 2016. [44] G. Pichon, E. Darve, M. Faverge, P. Ramet, and J. Roman, “Sparse supernodal solver using block low-rank compression,” Inria Bordeaux Sud-Ouest, Research Report RR-9022, Jan 2017. [Online]. Available: https://hal.inria.fr/hal-01450732 [45] P. H´enon,P. Ramet, and J. Roman, “PaStiX: A high-performance parallel di- rect solver for sparse symmetric positive definite systems,” Parallel Computing, vol. 28, no. 2, pp. 301–321, 2002. [46] J. N. Chadwick and D. S. Bindel, “An efficient solver for sparse linear systems based on rank-structured Cholesky factorization,” arXiv preprint arXiv:1507.05593, 2015. [47] T. A. Davis, S. Rajamanickam, and W. M. Sid-Lakhdar, “A survey of direct methods for sparse linear systems,” Acta Numerica, vol. 25, pp. 383–566, 2016. [48] N. Halko, P.-G. Martinsson, and J. A. Tropp, “Finding structure with random- ness: Stochastic algorithms for constructing approximate matrix decomposi- tions,” 2009. [49] ——, “Finding structure with randomness: Probabilistic algorithms for con- structing approximate matrix decompositions,” SIAM review, vol. 53, no. 2, pp. 217–288, 2011. [50] M. A. Heroux, R. A. Bartlett, V. E. Howle, R. J. Hoekstra, J. J. Hu, T. G. Kolda, R. B. Lehoucq, K. R. Long, R. P. Pawlowski, E. T. Phipps et al., “An overview of the Trilinos project,” ACM Transactions on Mathematical Software, vol. 31, no. 3, pp. 397–423, 2005. [51] I. Ibragimov, S. Rjasanow, and K. Straube, “Hierarchical Cholesky decomposition of sparse matrices arising from curl-curl-equation,” Journal of 133 Numerical Mathematics, vol. 15, no. 1, pp. 31–57, 2007. [Online]. Available: http://dx.doi.org/10.1515/jnma.2007.031 [52] L. Grasedyck, R. Kriemann, and S. Le Borne, “Parallel black box -LU H preconditioning for elliptic boundary value problems,” Computing and Visualization in Science, vol. 11, no. 4-6, pp. 273–291, 2008. [Online]. Available: http://dx.doi.org/10.1007/s00791-008-0098-9 [53] ——, “Domain decomposition based -LU preconditioning,” Numerische H Mathematik, vol. 112, no. 4, pp. 565–600, 2009. [Online]. Available: http://dx.doi.org/10.1007/s00211-009-0218-6 [54] R. Kriemann, “ -LU factorization on many-core systems,” Computing and H Visualization in Science, vol. 16, no. 3, pp. 105–117, 2013. [Online]. Available: http://dx.doi.org/10.1007/s00791-014-0226-7 [55] J. Xia and M. Gu, “Robust approximate Cholesky factorization of rank- structured symmetric positive definite matrices,” SIAM Journal on Matrix Analysis and Applications, vol. 31, no. 5, pp. 2899–2920, 2010. [Online]. Available: http://dx.doi.org/10.1137/090750500 [56] H. Pouransari, P. Coulier, and E. Darve, “Fast hierarchical solvers for sparse matrices using extended sparsification and low-rank approximation,” arXiv preprint arXiv:1510.07363, 2013. [57] D. A. Sushnikova and I. V. Oseledets, ““Compress and eliminate” solver for symmetric positive definite sparse matrices,” arXiv preprint arXiv:1603.09133, Mar 2016. [58] C. Chen, H. Pouransari, S. Rajamanickam, E. Boman, and E. Darve, “A dis- tributed memory hierarchical solver for sparse matrices,” (Under review), 2017. [59] W. Hackbusch, B. N. Khoromskij, and E. E. Tyrtyshnikov, “Approximate it- erations for structured matrices,” Numerische Mathematik, vol. 109, no. 3, pp. 365–383, 2008. [60] L. Lin, J. Lu, R. Car, and E. Weinan, “Multipole representation of the Fermi operator with application to the electronic structure analysis of metallic sys- tems,” Physical Review B, vol. 79, no. 11, p. 115133, 2009. [61] G. Schulz, “Iterative berechung der reziproken matrix,” ZAMM - Journal of Applied Mathematics and Mechanics, vol. 13, no. 1, pp. 57–59, 1933. 134 [62] J. Chen and E. Chow, “A stable scaling of Newton-Schulz for improving the sign function computation of a Hermitian matrix,” 2014, Preprint ANL/MCS- P5059-0114. [63] M. Izadi, “Parallel –matrix arithmetic on distributed memory systems,” Com- H puting and Visualization in Science, vol. 15, no. 2, pp. 87–97, 2012. [64] R. Li and Y. Saad, “Divide and conquer low-rank preconditioners for symmetric matrices,” SIAM Journal on Scientific Computing, vol. 35, no. 4, pp. A2069– A2095, 2013. [65] J. Sherman and W. J. Morrison, “Adjustment of an inverse matrix correspond- ing to a change in one element of a given matrix,” The Annals of Mathematical Statistics, vol. 21, no. 1, pp. 124–127, 1950. [66] C. Lanczos, An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. United States Government. Press Office Los Angeles, CA, 1950. [67] R. Li, Y. Xi, and Y. Saad, “Schur complement-based domain decomposition preconditioners with low-rank corrections,” Numerical Linear Algebra with Ap- plications, vol. 23, no. 4, pp. 706–729, 2016. [68] R. Li and Y. Saad, “Low-rank correction methods for algebraic domain decom- position preconditioners,” arXiv preprint arXiv:1505.04341, 2015. [69] Z. Li, Y. Saad, and M. Sosonkina, “pARMS: a parallel version of the alge- braic recursive multilevel solver,” Numerical Linear Algebra with Applications, vol. 10, no. 5-6, pp. 485–509, 2003. [70] X.-C. Cai and M. Sarkis, “A restricted additive Schwarz preconditioner for general sparse linear systems,” SIAM Journal on Scientific Computing, vol. 21, no. 2, pp. 792–797, 1999. [71] Y. Xi, R. Li, and Y. Saad, “An algebraic multilevel preconditioner with low-rank corrections for sparse symmetric matrices,” SIAM Journal on Matrix Analysis and Applications, vol. 37, no. 1, pp. 235–259, 2016. [72] P. Martinsson, “A direct solver for variable coefficient elliptic PDEs dis- cretized via a composite spectral ,” Journal of Computa- tional Physics, vol. 242, pp. 460 – 479, 2013. [73] A. Gillman and P.-G. Martinsson, “A direct solver with (N) complexity for O variable coefficient elliptic PDEs discretized via a high-order composite spectral 135 collocation method,” SIAM Journal on Scientific Computing, vol. 36, no. 4, pp. A2023–A2046, 2014. [74] S. Hao and P.-G. Martinsson, “A direct solver for elliptic PDEs in three dimen- sions based on hierarchical merging of Poincar´e-Steklov operators,” Journal of Computational and Applied Mathematics, vol. 308, pp. 419 – 434, 2016. [75] L. Greengard and V. Rokhlin, “A fast algorithm for particle simulations,” Jour- nal of Computational Physics, vol. 73, no. 2, pp. 325–348, 1987. [76] H. Ibeid, R. Yokota, J. Pestana, and D. Keyes, “Fast multipole precondi- tioners for sparse matrices arising from elliptic equations,” arXiv preprint arXiv:1308.3339, 2016. [77] R. D. Falgout and U. M. Yang, hypre: a library of high-performance precondi- tioners. Berlin, Heidelberg: Springer Berlin Heidelberg, 2002, pp. 632–641. [78] R. W. Hockney, “A fast direct solution of Poisson’s equation using Fourier analysis,” Journal of the ACM, vol. 12, no. 1, pp. 95–113, Jan 1965. [Online]. Available: http://doi.acm.org/10.1145/321250.321259 [79] B. L. Buzbee, G. H. Golub, and C. W. Nielson, “On direct methods for solving Poisson equations,” SIAM Journal on Numerical Analysis, vol. 7, no. 4, pp. pp. 627–656, 1970. [Online]. Available: http://www.jstor.org/stable/2949380 [80] G. Rodrigue and D. Wolitzer, “Preconditioning by incomplete block cyclic reduction,” Mathematics of Computation, vol. 42, no. 166, pp. 549–565, 1984. [Online]. Available: http://dx.doi.org/10.1090/S0025-5718-1984-0736452-6 [81] A. Reusken, “On the approximate cyclic reduction preconditioner,” SIAM Journal on Scientific Computing, vol. 21, pp. 565–590, 2000. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.45.6669 [82] D. Goddeke and R. Strzodka, “Cyclic reduction tridiagonal solvers on GPUs applied to mixed-precision multigrid,” IEEE Transactions on Parallel and Dis- tributed Systems, vol. 22, no. 1, pp. 22–32, 2011. [83] P. Quesada-Barriuso, J. Lamas-Rodr´ıguez, D. B. Heras, M. B´oo, and F. Arg¨uello,“Selecting the best tridiagonal system solver projected on multi- core CPU and GPU platforms,” in International Conference on Parallel and Distributed Processing Techniques and Applications, 2011. [84] Y. Zhang, J. Cohen, and J. D. Owens, “Fast tridiagonal solvers on the GPU,” ACM SIGPLAN Notices, vol. 45, no. 5, pp. 127–136, May 2010. 136 [85] P. Amodio, I. Gladwell, and G. Romanazzi, “An algorithm for the solution of bordered ABD linear systems arising from boundary value problems,” in Multibody Dynamics 2007, ECCOMAS. Milano (Italy), 2007. [86] P. Amodio and M. Paprzycki, “A cyclic reduction approach to the numerical solution of boundary value ODEs,” SIAM Journal on Scientific Computing, vol. 18, no. 1, pp. 56–68, 1997. [87] W.-Y. Lin and C.-L. Chen, “A parallel algorithm for solving tridiagonal linear systems on distributed memory multiprocessors,” International Journal of High Speed Computing, vol. 6, no. 03, pp. 375–386, 1994. [88] G. Saghi, H. J. Siegel, and J. L. Gray, “Predicting performance and selecting modes of parallelism: a case study using cyclic reduction on three parallel machines,” Journal of Parallel and Distributed Computing, vol. 19, no. 3, pp. 219–233, 1993. [89] R. A. Sweet, “A parallel and vector variant of the cyclic reduction algorithm,” SIAM Journal on Scientific and Statistical Computing, vol. 9, no. 4, pp. 761– 765, 1988. [90] D. A. Bini and B. Meini, “The cyclic reduction algorithm: from Poisson equation to stochastic processes and beyond,” Numerical Algorithms, vol. 51, no. 1, pp. 23–60, 2009. [Online]. Available: http://dx.doi.org/10.1007/ s11075-008-9253-0 [91] W. Gander and G. H. Golub, “Cyclic reduction history and applications,” Scientific computing (Hong Kong, 1997), pp. 73–85, 1997. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.55.2066 [92] R. Kriemann, “Parallel -matrix arithmetics on shared memory systems,” H Computing, vol. 74, no. 3, pp. 273–297, 2005. [Online]. Available: http://dx.doi.org/10.1007/s00607-004-0102-2 [93] R. Guivarch, L. Giraud, and J. Stein, “Parallel distributed fast 3D Poisson solver for meso-scale atmospheric simulations,” International Journal of High Performance Computing Applications, vol. 15, no. 1, pp. 36–46, 2001. [94] B. Engquist and L. Ying, “Sweeping preconditioner for the Helmholtz equa- tion: hierarchical matrix representation,” Communications on Pure and Applied Mathematics, vol. 64, no. 5, pp. 697–735, 2011. 137 [95] J. Poulson, B. Engquist, S. Li, and L. Ying, “A parallel sweeping precondi- tioner for heterogeneous 3D Helmholtz equations,” SIAM Journal on Scientific Computing, vol. 35, no. 3, pp. C194–C212, 2013. [96] L. Grasedyck and W. Hackbusch, “Construction and arithmetics of - H matrices,” Computing, vol. 70, no. 4, pp. 295–334, 2003. [97] L. Grasedyck, W. Hackbusch, and R. Kriemann, “Performance of precondi- tioning for sparse matrices,” Computational Methods in Applied Mathematics, vol. 8, no. 4, pp. 336–349, 2008. [98] S. Balay, S. Abhyankar, M. F. Adams, J. Brown, P. Brune, K. Buschelman, V. Eijkhout, W. D. Gropp, D. Kaushik, M. G. Knepley, L. C. McInnes, K. Rupp, B. F. Smith, and H. Zhang, “PETSc users manual,” Argonne National Labo- ratory, Tech. Rep. ANL-95/11 - Revision 3.5, 2014. [99] S. Balay, W. D. Gropp, L. C. McInnes, and B. F. Smith, “Efficient manage- ment of parallelism in object oriented numerical software libraries,” in Modern Software Tools in Scientific Computing, E. Arge, A. M. Bruaset, and H. P. Langtangen, Eds. Birkh¨auserPress, 1997, pp. 163–202. [100] H. C. Elman, D. Silvester, and A. Wathen, Finite elements and fast iterative solvers: with applications in incompressible fluid dynamics. Oxford University Press, 2014. [101] S. Y. Kadioglu, R. R. Nourgaliev, and V. A. Mousseau, “A comparative study of the harmonic and arithmetic averaging of diffusion coefficients for nonlinear heat conduction problems,” Technical Report INL/EXT-08-13999, Idaho National Laboratory, Idaho Falls, Idaho 83415, Tech. Rep., 2008. [102] R. Ewing, O. Iliev, and R. Lazarov, “A modified finite volume approximation of second-order elliptic equations with discontinuous coefficients,” SIAM Journal on Scientific Computing, vol. 23, no. 4, pp. 1335–1351, 2001. [103] E. Braverman, B. Epstein, M. Israeli, and A. Averbuch, “A fast spectral sub- tractional solver for elliptic equations,” Journal of Scientific Computing, vol. 21, no. 1, pp. 91–128, 2004. [104] O. G. Ernst and M. J. Gander, “Why it is difficult to solve Helmholtz problems with classical iterative methods,” in Numerical Analysis of Multiscale Problems. Springer, 2012, pp. 325–363. [105] W. F. Ames, Numerical methods for partial differential equations. Academic press, 2014. 138 [106] A. Kalinkin, A. Anders, and R. Anders, “Schur complement computations in Intel® Math Kernel Library PARDISO,” Applied Mathematics, vol. 6, no. 02, p. 304, 2015. [107] W. Briggs, V. Henson, and S. McCormick, A Multigrid Tutorial, Second Edition, 2nd ed. Society for Industrial and Applied Mathematics, 2000. [Online]. Available: http://epubs.siam.org/doi/abs/10.1137/1.9780898719505 [108] S. Browne, J. Dongarra, N. Garner, G. Ho, and P. Mucci, “A portable program- ming interface for performance evaluation on modern processors,” International Journal of High Performance Computing Applications, vol. 14, no. 3, pp. 189– 204, 2000. [109] M. M. Gupta and J. Zhang, “High accuracy multigrid solution of the 3D convection–diffusion equation,” Applied Mathematics and Computation, vol. 113, no. 2, pp. 249–274, 2000. [110] M. R. Hestenes and E. Stiefel, Methods of conjugate gradients for solving linear systems. Journal of Research of the National Bureau of Standards, 1952, vol. 49. [111] Y. Saad and M. H. Schultz, “GMRES: A generalized minimal residual algo- rithm for solving nonsymmetric linear systems,” SIAM Journal on Scientific and Statistical Computing, vol. 7, no. 3, pp. 856–869, 1986. [112] J. Mohring, R. Milk, A. Ngo, O. Klein, O. Iliev, M. Ohlberger, and P. Bas- tian, “Uncertainty quantification for porous media flow using multilevel Monte Carlo,” in International Conference on Large-Scale Scientific Computing. Springer, 2015, pp. 145–152. [113] P. Bastian, M. Blatt, A. Dedner, C. Engwer, R. Kl¨ofkorn, M. Ohlberger, and O. Sander, “A generic grid interface for parallel and adaptive scientific com- puting. Part I: Abstract framework,” Computing, vol. 82, no. 2-3, pp. 103–119, 2008. [114] L. Dalcin, N. Collier, P. Vignal, A. Cˆortes,and V. M. Calo, “PetIGA: High- performance isogeometric analysis,” Computer Methods in Applied Mechanics and Engineering, vol. 308, pp. 151–181, 2016. [115] K. Akbudak, H. Ltaief, A. Mikhalev, and D. Keyes, “Tile low rank Cholesky factorization for climate/weather modeling applications on manycore architec- tures,” in ISC High Performance, vol. 17, 2017. 139 [116] H. De Sterck, U. M. Yang, and J. J. Heys, “Reducing complexity in parallel algebraic multigrid preconditioners,” SIAM Journal on Matrix Analysis and Applications, vol. 27, no. 4, pp. 1019–1039, 2006. [117] H. Ibeid, R. Yokota, J. Pestana, and D. Keyes, “Fast multipole precondi- tioners for sparse matrices arising from elliptic equations,” arXiv preprint arXiv:1308.3339v4, 2016. [118] G. Ch´avez, G. Turkiyyah, and D. Keyes, A direct elliptic solver based on hierarchically low-rank schur complements. Springer International Publishing, 2017, pp. 135–143. [Online]. Available: http://dx.doi.org/10.1007/ 978-3-319-52389-7 12 [119] G. Ch´avez, G. Turkiyyah, S. Zampini, H. Ltaief, and D. Keyes, “Accelerated cyclic reduction: a distributed memory fast direct solver for structured linear systems,” arXiv preprint arXiv:1701.00182, 2016. [120] M. Bebendorf, “Hierarchical LU decomposition-based preconditioners for BEM,” Computing, vol. 74, no. 3, pp. 225–247, 2005. [Online]. Available: http://dx.doi.org/10.1007/s00607-004-0099-6 [121] Y. Li, J. Poulson, and L. Ying, “Distributed memory hierarchical matrices repository,” https://bitbucket.org/poulson/dmhm, 2012-2014, Stanford Uni- versity. [122] R. Yokota, J. Pestana, H. Ibeid, and D. Keyes, “Fast multipole precondi- tioners for sparse matrices arising from elliptic equations,” arXiv preprint arXiv:1308.3339, 2013. [123] S. B¨ormand S. Christophersen, “Approximation of BEM matrices using GPG- PUs,” arXiv preprint arXiv:1510.07244, 2015. [124] W. Hackbusch, S. B¨orm,and L. Grasedyck, “HLib 1.4,” http://hlib.org, 1999- 2012, Max-Planck-Institut, Leipzig. 140

Appendix A

Memory consumption optimization

Consider the computation of the approximate inverse in the -matrix format of a H 2D variable-coefficient Poisson problem, as defined in (6.2), with an error tolerance of three digits of accuracy in the Frobenius norm ( AA−1 I ), and a fixed accuracy || − ||F parameter . The variable of interest in this experiment is η, which controls the H block refinement as depicted in Figure A.1. Table A.1 documents the memory requirements of each approximate inverse as a function of η. As it can be seen, the optimal η parameter resides in between weak ad- missibility (η=2) and strong admissibility (η=256). This tuning is a major advantage for data-sparse formats that are not limited to the choice of weak admissibility, such as the -format. As the table shows, the most economic inverse in terms of memory H is not necessarily the representation with the smallest rank, although is very close to it. This is due to the fact that an aggressive refinement leads to an larger number of blocks and deeper cluster trees. 141

32 9 6 8 4 3 1 1 32 8

9 32 9 10 10 4 10 1 3 2 8 32 8 6 9 32 12 1 1 5 3 8 3 1 1 3 6 4 1 1 32 10 16 8 10 12 32 10 1 14 4 11 3 12 1 8 10 32

4 10 1 11 32 11 10 11 4 3 2 1 32 8

3 4 1 1 10 32 11 10 2 3 10 4 11 1 2 1 8 32 10 1 11 5 14 10 11 32 14 10 1 13 3 4 1 1 5 3 5 2 1 2 3 1 1 1 4 1 1 1 16 32 9 16 1 1 3 4 11 10 14 32 1 1 13 3 10 1 14 4 10 1 1 10 9 32

8 10 10 1 32 11 10 10 4 3 1 1 32 8

3 3 3 2 1 1 10 32 11 10 10 4 11 1 1 1 3 2 8 32 9 1 12 12 13 11 11 32 14 1 2 5 3 11 1 1 12 3 2 1 4 6 4 2 1 32 9 16 2 1 1 3 3 3 10 10 14 32 10 1 14 4 1 1 1 13 3 12 2 9 9 32

4 10 1 11 32 11 10 11 4 3 2 1 32 8

3 4 2 1 10 32 11 10 1 3 10 4 10 1 2 1 8 32 10

1 10 5 14 10 11 32 14 10 1 13 3 1 2 5 3 2 1 1 1 32 10 31 3 4 1 2 3 5 4 1 1 1 4 1 1 16 16 1 1 3 4 11 10 14 32 1 1 13 3 11 1 14 5 11 2 1 10 10 32

4 10 1 11 32 11 10 11 4 3 2 2 32 8

3 4 1 1 1 1 10 32 11 10 10 4 10 2 4 3 2 1 8 32 10 5 2 11 5 14 11 1 1 10 11 32 14 1 2 5 3 11 10 2 2 5 3 2 1 32 9 16 2 1 3 5 1 1 1 10 10 14 32 10 1 14 5 11 10 3 11 2 2 10 9 32

4 10 1 11 32 11 10 10 4 3 1 1 32 8

3 4 2 1 10 32 11 10 1 1 3 2 10 4 11 1 2 1 8 32 10 3 5 2 1 1 11 5 14 10 11 32 15 1 11 10 12 1 1 5 1 1 2 5 3 5 2 2 2 4 1 1 1 1 1 16 32 10 17 1 1 3 5 10 11 14 32 2 1 1 11 10 3 12 1 14 4 11 1 2 10 10 32

1 10 12 12 10 1 11 10 1 2 32 15 10 11 5 3 2 2 32 10

2 1 1 4 3 3 1 1 1 3 10 11 1 10 1 14 32 11 10 15 5 11 1 1 2 4 2 2 1 10 32 10 2 2 12 12 14 11 11 32 14 1 2 5 3 10 1 1 10 10 1 2 2 1 5 3 2 1 32 10 16 1 1 2 1 2 3 3 3 2 3 1 2 11 10 14 32 10 1 14 4 1 1 1 11 10 4 11 2 1 10 10 32

2 10 11 11 4 14 1 11 32 14 10 11 1 2 5 3 2 1 32 10

2 2 2 3 11 10 3 5 2 1 14 32 11 10 2 1 3 2 2 11 1 14 4 11 1 1 1 10 32 10 2 10 4 14 11 11 32 14 1 10 10 12 1 2 5 3 2 2 1 1 1 1 31 32 10 32 31 5 1 1 1 3 4 2 2 1 5 1 4 1 4 1 1 1 1 17 16 2 2 2 3 2 1 3 5 11 11 14 32 2 1 1 10 10 3 1 1 10 1 15 4 10 1 2 10 10 32

4 10 1 12 32 11 10 11 5 3 2 2 32 8

3 4 2 1 1 1 10 32 11 10 11 4 11 2 3 2 8 32 10 6 1 10 5 14 11 1 2 11 11 32 14 1 2 5 3 5 4 12 3 1 1 4 1 2 32 10 16 2 1 3 5 2 1 1 11 10 14 32 11 1 14 5 13 3 11 1 10 10 32

5 10 1 11 32 11 10 11 4 3 2 2 32 8

3 4 2 1 10 32 11 10 11 4 10 1 2 3 2 1 8 32 8 3 3 5 1 1 1 2 10 4 14 11 11 32 14 1 2 5 3 5 11 1 12 3 3 2 1 2 1 1 4 1 3 1 1 1 1 1 16 32 10 16 2 1 3 5 11 10 14 32 11 1 14 5 1 1 13 3 11 1 2 8 10 32

4 10 1 11 32 11 10 11 5 3 2 2 32 8

3 4 2 1 10 32 11 10 11 4 11 2 1 1 4 2 8 32 10 5 2 10 5 14 10 11 32 14 1 1 5 3 11 1 2 12 3 2 1 4 6 3 2 1 32 10 16 2 1 3 5 10 10 14 32 10 1 14 5 1 1 1 13 3 12 1 10 10 32

4 10 1 11 32 11 10 10 4 3 1 1 32 8

3 4 1 1 10 32 11 10 2 3 11 4 10 1 2 1 8 32 9

1 11 5 14 10 11 32 14 11 1 13 3 1 1 1 1 5 4 2 1 1 32 10 34 1 4 1 1 4 5 2 1 4 1 1 1 4 5 3 2 1 1 1 16 16 2 1 3 5 10 10 14 32 1 1 13 3 10 1 14 5 11 2 2 9 10 32

1 12 10 10 1 2 12 12 11 1 32 15 10 11 5 3 2 1 32 10

2 1 2 4 10 11 1 10 1 3 3 3 1 1 1 1 1 15 32 11 10 14 5 10 1 3 3 1 1 10 32 10 2 1 11 12 12 11 1 2 10 11 32 15 1 1 5 3 10 10 2 1 1 1 6 3 2 1 32 9 18 1 2 2 4 1 1 2 1 2 4 3 3 2 1 1 11 11 14 32 10 1 15 5 11 10 3 11 1 1 10 9 32

1 10 11 11 5 14 1 11 32 15 10 10 1 2 5 3 1 1 32 10

2 2 2 3 11 10 3 5 1 1 15 32 11 10 1 1 3 3 1 11 2 14 4 10 1 1 1 10 32 9 1 2 2 4 4 3 2 1 2 10 5 14 10 11 32 14 1 10 10 11 1 1 1 1 1 1 5 3 6 2 1 2 3 1 1 1 4 1 1 1 1 1 18 32 10 18 1 1 3 3 2 2 3 5 10 11 14 32 1 1 1 11 10 3 1 2 10 1 15 5 10 1 1 9 10 32

2 12 12 12 11 1 10 10 1 1 32 15 10 10 5 3 1 2 32 10

2 1 1 3 3 3 1 1 1 3 10 10 1 11 1 14 32 11 10 14 4 11 2 1 2 4 2 1 1 10 32 10 1 2 12 12 13 11 11 32 14 1 2 5 3 1 1 5 3 11 1 1 10 10 2 2 2 1 32 10 17 2 2 2 1 2 3 3 3 3 3 1 1 10 11 14 32 11 1 14 5 2 1 1 11 10 3 10 1 2 10 10 32

1 10 10 11 4 14 1 10 32 15 10 11 1 2 5 4 2 2 32 10

1 2 2 3 11 11 3 5 2 1 14 32 11 10 1 10 1 14 5 11 2 1 1 4 2 2 1 10 32 9 1 10 5 14 10 11 32 14 1 1 5 3 1 10 11 10 1 2 2 1 1 1 1 32 34 32 9 5 33 31 1 1 1 1 4 1 1 3 3 1 1 1 1 2 6 1 4 2 2 1 4 1 1 1 1 18 17 1 1 3 3 2 2 3 5 10 10 14 32 1 2 10 1 14 5 2 1 2 11 10 3 10 2 2 9 10 32 4 10 1 10 32 11 10 11 4 3 2 1 32 8 5 1 3 4 2 1 2 1 10 32 11 10 11 4 10 1 3 2 2 1 8 32 9 5 1 12 5 14 2 1 10 11 11 32 15 1 1 5 3 10 10 2 2 5 4 1 1 32 10 15 1 2 3 5 2 1 1 10 11 15 32 10 1 14 4 11 10 3 11 2 2 9 10 32

4 10 1 11 32 10 10 10 4 3 1 1 32 8

3 4 1 1 10 32 11 10 2 1 3 2 10 4 10 1 2 1 8 32 10 3 1 3 5 1 1 1 1 1 1 1 10 4 14 10 11 32 14 1 10 10 12 1 1 5 3 5 2 2 2 4 1 1 1 3 1 1 1 15 32 10 17 1 1 3 5 10 10 14 32 2 1 1 11 10 3 11 1 15 5 11 2 2 10 10 32

1 10 5 14 1 10 10 11 1 2 32 15 10 11 5 3 2 2 32 10

2 2 1 3 5 2 1 1 1 3 10 10 2 10 1 14 32 11 10 15 5 11 2 3 2 1 1 10 32 9 2 5 2 11 4 14 2 1 11 11 11 32 14 1 2 4 3 1 1 5 3 10 10 2 2 1 1 32 8 18 1 2 2 1 3 4 2 1 2 2 3 1 1 10 11 14 32 10 1 14 5 11 10 3 10 2 2 9 8 32

2 11 11 11 5 14 1 10 32 14 10 10 1 1 5 4 1 1 32 10

2 2 2 3 10 10 3 5 2 1 14 32 11 10 1 10 1 15 5 11 1 1 1 3 2 1 1 10 32 10

1 11 4 14 10 11 32 15 2 2 4 3 1 11 10 11 2 2 1 1 32 9 32 1 3 1 1 3 6 1 1 1 1 1 2 5 4 2 1 1 4 1 1 17 18 1 1 2 3 1 2 3 5 10 10 14 32 1 2 11 1 14 4 1 1 1 11 11 3 10 1 1 10 9 32

4 10 1 11 32 11 10 11 4 3 2 2 32 8

3 4 1 1 1 1 10 32 11 10 11 4 10 1 3 2 2 1 8 32 10 5 1 11 5 14 1 1 10 11 11 32 15 1 2 5 3 10 10 1 1 5 3 1 1 32 10 16 1 2 3 5 2 1 2 10 11 15 32 11 1 15 5 11 10 3 11 1 1 10 10 32

4 10 1 10 32 11 10 10 4 3 2 2 32 8

3 4 2 1 10 32 11 10 2 1 3 2 10 4 10 2 2 1 8 32 10 4 1 1 1 4 5 1 2 1 10 5 14 10 11 32 15 1 10 10 11 1 1 5 1 1 2 5 3 5 2 2 2 3 1 1 1 1 1 16 32 10 18 1 1 3 5 10 10 14 32 1 1 1 11 10 3 10 1 15 5 10 1 2 10 10 32

1 10 5 14 1 11 10 11 1 1 32 15 10 11 5 3 2 1 32 9

1 1 2 4 5 2 1 3 10 11 1 11 1 15 32 11 10 14 4 10 1 1 1 3 2 2 1 9 32 9 1 5 1 10 4 14 11 11 32 14 1 2 5 3 2 1 10 10 10 2 2 1 1 6 4 2 1 32 9 16 1 2 1 2 3 4 2 3 1 1 11 10 14 32 10 1 15 4 1 1 1 10 10 3 11 2 2 9 9 32

1 12 11 11 4 14 1 11 32 15 10 11 1 2 5 3 2 2 32 10

2 2 2 3 11 10 3 5 2 1 14 32 11 10 1 1 3 2 1 10 1 15 4 11 2 2 1 10 32 10 1 1 2 11 5 14 11 11 32 15 1 10 10 11 1 1 1 2 5 3 2 1 1 1 1 31 5 32 32 10 32 1 32 5 1 4 1 1 1 2 3 6 1 1 1 3 1 5 1 3 1 1 1 1 18 16 1 2 2 3 2 1 3 4 10 11 14 32 1 1 2 11 10 3 1 1 10 1 14 5 10 1 2 10 10 32

2 10 5 14 1 11 11 10 1 2 32 15 10 11 5 3 2 1 32 9

1 1 1 3 4 1 1 1 1 3 10 11 1 10 1 15 32 11 10 16 4 11 1 4 2 1 1 9 32 10 2 6 1 11 5 14 2 1 10 2 1 10 11 32 14 1 1 5 3 10 10 2 2 5 3 2 1 32 10 18 1 2 1 1 3 4 2 1 2 2 3 1 1 10 10 14 32 10 1 15 5 10 10 3 10 1 2 10 10 32

1 10 11 10 1 1 5 14 1 11 32 15 10 11 5 4 2 2 32 10

1 1 2 3 10 11 1 10 1 3 4 2 1 14 32 11 10 1 1 3 2 15 4 12 1 1 1 10 32 8 1 1 1 1 1 3 1 3 6 1 2 3 1 2 1 10 5 14 10 11 32 14 1 10 10 10 1 2 5 3 5 1 2 2 1 1 1 1 4 1 1 1 1 1 4 1 1 1 18 32 10 18 1 1 2 3 1 1 1 1 3 5 10 10 14 32 1 1 2 11 10 3 11 1 15 4 11 2 2 8 10 32

2 10 5 14 1 10 10 10 1 1 32 14 10 10 5 3 1 1 32 8

1 2 1 3 5 1 1 4 10 10 1 10 1 14 32 11 10 14 5 10 1 1 1 4 2 2 1 8 32 10 2 5 1 10 5 14 10 12 32 15 1 2 5 3 1 1 6 4 1 1 11 10 11 1 1 1 1 32 10 18 1 2 1 1 3 5 2 3 1 2 10 11 14 32 10 1 15 5 2 1 2 11 10 3 10 1 2 10 10 32

2 10 11 11 5 14 1 11 32 15 10 10 1 2 5 4 1 1 32 9

2 2 1 3 10 10 3 5 2 1 15 32 11 10 2 11 1 14 5 11 1 1 1 3 3 2 1 9 32 9

1 11 5 14 11 11 32 15 1 1 5 3 1 10 10 12 2 2 1 1 1 32 9 33 1 1 1 1 1 3 1 2 3 6 1 3 1 1 1 6 1 1 1 2 1 3 1 1 1 1 1 4 1 1 18 18 2 2 2 3 1 2 3 5 10 11 14 32 1 1 11 1 14 5 1 1 2 11 11 3 12 2 1 9 9 32

2 10 11 10 1 2 5 14 1 11 32 15 10 10 5 3 1 2 32 10

1 2 1 4 10 10 1 10 1 4 5 2 1 1 1 14 32 11 10 14 5 10 1 3 2 2 1 10 32 9 2 6 1 10 5 14 2 1 10 11 11 32 15 1 2 5 3 10 10 2 2 1 1 6 4 1 1 32 9 16 1 1 2 3 1 2 1 2 3 5 2 1 1 11 11 14 32 10 1 15 5 11 10 3 12 2 2 9 9 32

1 10 11 11 5 14 1 10 32 15 10 11 1 2 5 4 2 1 32 9

2 1 1 3 10 10 3 4 2 1 14 32 11 10 2 1 3 2 1 10 1 14 4 10 1 1 1 9 32 8 1 1 3 1 1 1 1 4 3 6 1 1 2 10 4 14 10 11 32 15 1 11 10 11 1 1 1 1 4 1 1 1 1 1 1 1 5 3 6 1 1 1 3 1 1 1 1 1 16 32 9 17 1 2 2 3 2 2 3 5 10 10 14 32 2 1 2 11 10 3 1 2 10 1 14 5 10 1 1 8 10 32

1 10 5 14 1 11 11 11 2 2 32 15 10 10 4 3 1 1 32 9

2 1 2 4 4 1 1 3 10 11 2 10 1 15 32 11 10 14 5 10 1 1 1 3 2 2 1 9 32 10 2 6 1 11 5 14 10 11 32 15 1 2 5 3 1 1 10 10 10 1 2 1 1 6 4 2 1 32 10 17 1 2 1 2 3 5 2 3 1 2 10 10 14 32 10 1 14 5 1 1 1 11 10 3 10 1 2 10 10 32

2 10 10 11 4 14 1 11 32 15 10 11 1 1 5 4 2 1 32 10

2 2 2 3 10 10 3 5 2 1 14 32 11 10 2 1 3 2 1 10 1 15 5 11 1 1 1 10 32 10 1 1 1 11 5 14 10 11 32 15 1 1 1 11 10 10 1 1 1 5 4 1 1 1 1 1 1 5 33 32 33 32 9 33 1 5 34 1 1 1 4 1 1 1 3 1 1 4 6 1 1 1 1 1 4 1 1 1 1 1 6 1 3 1 1 1 1 17 17 1 2 2 3 1 2 3 5 10 10 14 32 2 1 1 11 10 3 1 1 11 1 14 5 10 1 2 10 9 32 4 10 1 11 32 10 10 11 4 3 2 2 32 8 5 1 3 4 1 1 1 2 10 32 12 10 11 4 10 1 3 2 8 32 10 5 1 10 5 14 11 1 2 10 11 32 14 1 2 5 3 5 4 12 3 2 1 4 2 1 32 10 16 1 1 4 5 1 1 1 10 11 14 32 11 1 14 5 13 3 11 1 10 10 32

4 10 2 11 32 11 10 11 4 3 2 1 32 8

3 4 2 1 11 32 11 10 11 4 10 1 2 3 2 1 8 32 9 5 3 4 5 1 2 1 1 1 1 11 4 14 10 11 32 14 1 1 5 4 6 11 1 12 3 4 1 1 1 4 1 1 1 4 1 1 1 16 32 9 16 1 2 3 4 11 11 14 32 11 1 14 4 1 1 13 3 11 1 2 9 9 32

4 10 1 11 32 11 10 11 4 3 2 2 32 8

3 4 1 1 10 32 12 10 10 4 10 1 1 2 3 2 8 32 10 5 1 10 5 14 10 11 32 14 2 2 5 3 11 1 1 12 3 2 1 4 5 4 1 1 32 10 16 1 1 4 5 11 10 14 32 11 1 14 5 2 1 1 13 3 12 1 10 10 32

4 10 2 11 32 11 10 11 5 3 2 2 32 7

3 4 2 1 10 32 11 10 2 3 11 4 10 2 2 2 7 32 10

1 2 10 5 14 11 11 32 15 11 1 13 3 1 2 5 4 1 1 1 1 32 10 32 1 3 1 1 1 1 4 6 1 1 4 5 3 1 1 1 3 1 1 16 16 2 1 3 5 10 11 14 32 1 1 14 3 10 1 14 4 11 2 2 10 10 32

1 12 10 10 1 2 12 12 11 1 32 16 10 11 5 4 2 1 32 10

2 1 2 4 10 10 2 10 1 3 3 3 1 1 1 1 2 16 32 11 10 14 4 11 1 3 2 2 1 10 32 8 2 1 1 2 12 12 12 11 1 1 11 11 32 15 1 1 5 3 10 10 2 2 6 4 2 1 32 10 17 1 2 2 3 1 2 1 1 2 3 3 3 1 1 1 10 10 15 32 10 1 14 4 11 10 3 11 2 2 8 10 32

2 10 11 10 1 2 5 14 1 11 32 16 10 11 4 3 2 2 32 9

2 2 1 4 10 11 1 11 2 3 5 1 1 15 32 11 10 2 1 3 1 14 4 10 1 2 1 9 32 9 1 2 2 4 1 2 1 1 3 3 1 1 1 11 4 14 11 11 32 15 1 11 10 11 1 1 5 1 2 2 5 3 6 2 1 1 5 1 1 1 1 1 17 32 9 16 1 2 2 3 1 2 1 1 3 5 11 10 15 32 2 1 1 11 10 3 10 1 14 4 10 1 2 9 9 32

1 13 12 12 11 1 11 10 1 2 32 15 10 10 5 4 1 1 32 10

2 1 2 4 3 3 1 1 1 3 10 10 1 11 1 15 32 11 10 14 5 11 1 1 2 4 2 2 1 10 32 10 2 2 11 12 13 11 11 32 14 1 2 5 3 10 1 2 10 10 1 2 1 1 6 4 2 1 32 10 17 2 2 2 1 2 3 3 3 2 3 1 1 10 10 14 32 11 1 14 5 1 1 1 11 10 3 11 2 2 10 10 32

2 11 10 11 5 14 1 11 32 15 10 11 1 1 5 4 2 1 32 10

2 2 2 3 10 11 3 5 2 1 14 32 11 10 2 1 3 2 2 10 1 14 4 10 1 2 1 10 32 8 1 1 1 11 5 14 10 11 32 14 1 11 10 11 1 1 5 3 2 1 1 1 1 1 31 32 32 10 31 32 5 1 1 1 3 1 1 1 1 1 2 4 3 2 1 1 4 1 6 1 4 1 1 1 1 16 17 1 2 2 3 1 1 3 5 10 11 14 32 2 1 1 11 10 3 1 2 11 1 14 5 10 1 2 8 10 32

4 10 1 11 32 11 10 10 4 3 1 1 32 8

3 4 2 1 1 1 10 32 11 10 11 4 10 1 4 1 8 32 10 5 1 11 5 14 11 1 2 10 11 32 14 1 1 4 3 5 3 12 3 2 1 4 1 1 32 10 15 1 2 3 5 1 1 1 10 11 14 32 10 1 14 4 13 3 11 1 10 10 32

4 10 1 11 32 11 10 10 4 3 1 1 32 8

3 4 2 1 11 32 11 10 11 4 11 1 1 4 2 1 8 32 10 4 4 5 1 2 1 1 11 4 14 10 11 32 14 1 1 5 3 5 11 1 12 3 4 1 1 2 1 1 4 1 15 32 10 16 1 2 3 4 10 11 14 32 11 1 14 4 1 1 13 3 11 2 2 10 10 32

4 10 1 11 32 11 10 11 4 3 2 2 32 8

3 4 1 1 10 32 11 10 11 4 11 2 1 2 3 2 8 32 10 5 1 12 4 14 10 11 32 14 1 2 5 3 11 1 2 13 3 2 1 4 32 9 16 1 2 4 4 10 11 14 32 11 1 14 4 1 1 1 13 3 11 1 10 9 32

4 10 1 11 32 9 10 8 32 7

3 3 2 1 9 32 8 6 2 3 7 32 8

1 10 5 14 10 8 32 12 11 1 11 3 1 1 32 10 31 1 3 1 1 3 5 1 1 2 1 1 1 4 16 16 1 2 3 4 8 6 12 32 1 1 8 3 8 10 32

1 12 11 10 1 2 12 12 11 1 32 16 10 10 5 3 1 1 32 10

1 1 2 4 10 10 2 10 1 4 3 3 1 1 1 1 1 15 32 11 10 14 5 12 1 3 2 2 1 10 32 10 2 2 12 12 12 11 1 2 10 11 32 14 1 1 5 3 10 10 2 2 1 1 6 4 2 1 32 10 16 1 2 2 3 1 1 1 1 2 3 3 3 1 1 1 10 11 14 32 10 1 14 4 11 10 3 10 2 2 10 10 32

2 10 11 11 5 14 1 10 32 15 10 10 1 1 5 3 1 2 32 10

2 2 1 3 11 10 3 5 1 1 14 32 11 10 2 1 3 2 1 11 1 14 5 10 2 1 1 10 32 10 1 1 2 4 4 3 2 1 1 11 5 14 11 11 32 15 1 10 10 12 1 1 1 1 1 2 5 3 6 2 2 1 4 1 16 32 10 17 1 2 2 3 1 2 3 5 11 10 14 32 2 1 2 12 10 3 1 2 11 2 15 5 12 2 2 10 10 32

2 12 12 12 11 1 10 11 1 2 32 15 10 11 5 3 2 1 32 10

1 1 1 3 3 3 1 1 1 3 10 10 2 10 2 15 32 10 10 15 4 10 1 10 32 10 2 2 10 10 8 11 11 32 10 1 1 4 3 1 1 6 3 32 8 15 1 2 2 1 2 3 3 3 2 3 1 2 10 11 10 32 10 1 10 4 10 8 32

2 10 11 11 5 14 1 10 32 15 10 11 1 2 5 3 2 2 32 9

2 2 2 3 11 11 3 4 2 1 14 32 10 10 1 11 1 15 5 11 2 9 32 10 1 1 1 10 4 10 10 10 32 10 1 2 4 3 1 1 1 31 31 32 7 5 31 1 1 1 4 1 1 4 3 1 1 1 1 1 6 1 4 17 15 1 2 2 3 1 1 3 4 11 10 10 32 1 1 10 1 10 4 10 7 32 2 10 10 10 1 1 4 14 2 11 32 14 10 11 5 3 2 1 5 32 10 1 2 2 2 3 10 10 1 10 1 3 5 2 1 2 1 14 32 11 10 14 4 11 1 3 2 2 1 10 32 8 2 1 1 5 2 11 4 14 1 1 11 10 11 32 15 1 1 5 3 10 10 2 2 6 4 1 1 32 10 18 1 2 2 3 1 2 2 1 3 4 1 1 2 10 10 15 32 10 1 15 5 11 10 3 11 2 1 8 10 32

1 10 10 11 1 1 5 14 1 11 32 14 10 11 5 3 2 2 32 9

2 1 1 3 10 11 1 10 1 3 5 1 1 14 32 11 10 1 1 3 2 16 4 11 2 2 1 9 32 10 1 1 1 1 1 4 1 2 4 1 3 6 1 1 1 1 1 1 1 11 5 14 11 11 32 15 1 11 10 11 1 2 5 3 6 1 1 2 4 1 1 1 1 1 4 1 1 1 18 32 10 18 1 1 3 3 1 1 1 2 3 5 11 10 14 32 2 1 1 11 10 3 11 1 14 5 10 1 2 10 10 32

1 10 5 14 1 11 11 10 1 2 32 15 10 11 5 3 2 1 32 10

2 1 2 3 4 1 1 1 1 3 10 10 2 11 2 14 32 12 10 15 4 11 1 3 2 2 1 10 32 9 2 6 1 10 4 14 2 1 10 11 11 32 14 1 2 5 3 1 1 6 4 10 10 1 2 2 1 32 10 16 2 1 1 2 3 4 1 1 1 2 3 1 1 11 10 14 32 10 1 14 5 10 10 3 10 1 2 9 10 32

2 11 11 11 4 14 1 12 32 15 10 10 1 1 5 3 1 2 32 10

2 2 2 3 10 12 3 4 2 1 14 32 11 10 1 11 1 15 5 11 1 2 1 3 2 2 1 10 32 10

2 10 5 14 10 11 32 14 1 2 5 3 1 10 10 10 2 2 1 1 1 32 9 33 1 1 1 1 1 3 1 1 1 3 1 2 4 6 1 1 1 1 1 2 5 4 1 1 1 1 1 4 1 1 18 16 1 1 2 3 2 1 3 5 11 11 14 32 1 2 11 1 15 5 2 1 1 11 10 3 10 1 2 10 9 32

1 10 11 10 1 2 5 15 1 11 32 16 10 11 5 3 2 1 32 8

1 1 2 3 10 11 1 11 1 3 4 2 1 1 1 15 32 11 10 15 4 11 1 3 2 2 1 8 32 10 1 6 2 11 5 14 1 1 11 10 11 32 14 1 2 5 3 10 10 2 2 1 1 6 3 1 1 32 10 16 1 2 2 4 1 1 2 2 3 5 1 1 2 10 10 14 32 10 1 14 5 11 10 3 10 2 1 10 10 32

2 10 10 11 5 15 1 11 32 15 10 11 1 1 5 3 2 1 32 8

2 2 1 3 10 10 3 5 2 1 14 32 11 10 1 1 3 2 2 10 1 15 5 10 1 2 1 8 32 10 1 1 1 1 2 4 4 1 1 1 4 6 1 1 1 11 4 14 11 11 32 15 1 10 10 11 1 1 5 1 1 1 1 1 1 2 5 4 6 1 2 2 4 1 1 1 1 1 16 32 10 17 1 1 2 3 1 2 3 5 10 10 14 32 2 1 1 10 10 3 1 1 11 1 14 5 10 2 2 10 10 32

1 11 5 14 1 10 10 10 1 2 32 15 10 10 5 3 1 1 32 9

2 1 2 3 4 2 1 3 10 11 1 11 1 14 32 11 10 14 4 10 1 1 1 4 2 2 1 9 32 9 1 5 1 12 5 14 11 11 32 15 1 2 5 3 1 1 11 10 10 2 2 1 1 6 3 2 1 32 9 18 1 2 2 2 3 5 2 3 1 1 12 10 14 32 11 1 15 5 1 1 2 11 10 3 11 2 2 9 10 32

1 10 10 11 4 14 1 11 32 15 10 11 1 1 5 3 2 2 32 10

1 2 2 3 11 11 3 5 2 1 14 32 11 10 2 1 3 2 1 10 1 15 4 12 1 2 1 10 32 10 1 1 1 1 1 2 10 5 14 11 11 32 14 1 10 10 11 1 1 1 2 5 3 2 2 1 1 1 33 32 5 33 32 10 33 34 5 1 1 1 1 1 4 1 4 1 1 1 2 3 5 1 1 1 4 1 1 1 1 1 5 1 4 2 1 1 1 17 18 1 1 2 3 2 1 3 5 11 11 14 32 1 1 1 11 10 3 1 1 11 1 15 5 12 2 2 10 10 32

2 10 5 14 1 11 10 11 1 2 32 14 10 11 5 4 2 1 32 9

1 1 1 3 4 2 1 1 1 3 10 10 2 10 1 14 32 11 10 14 4 10 1 3 2 2 1 9 32 8 2 6 1 10 5 14 2 1 11 1 1 10 11 32 14 1 2 4 3 10 10 1 2 6 4 2 1 32 10 18 1 2 2 2 3 5 2 1 1 2 3 1 1 10 11 14 32 10 1 14 4 11 10 3 11 2 2 8 10 32

1 10 10 10 1 1 5 14 1 10 32 15 10 10 5 3 1 2 32 10

2 1 1 3 10 10 1 10 1 4 4 2 1 14 32 11 10 1 1 3 2 15 4 10 2 2 1 10 32 10 1 1 1 1 1 4 2 4 5 1 1 4 1 2 1 10 4 14 10 11 32 14 1 10 10 12 1 2 5 3 6 2 1 2 1 1 4 1 18 32 10 16 1 2 2 3 1 2 1 2 3 4 11 10 14 32 2 1 1 11 10 3 11 1 14 5 10 1 2 10 10 32

2 10 5 14 1 11 11 10 1 2 32 15 10 11 5 3 2 1 32 10

1 2 2 3 5 2 1 4 10 10 2 10 1 14 32 10 10 15 5 11 1 10 32 9 2 6 2 11 4 10 11 11 32 10 1 1 4 3 1 1 5 4 32 8 15 1 2 2 2 3 4 2 3 1 1 10 10 10 32 10 1 10 4 9 8 32

1 10 10 11 5 14 1 10 32 15 10 11 1 1 5 4 2 2 32 10

2 2 1 3 10 10 3 5 1 1 14 32 10 10 2 10 1 15 5 10 1 10 32 10

1 10 4 10 10 11 32 10 1 2 4 3 1 1 32 8 32 1 1 1 1 1 4 1 2 3 5 1 3 1 1 1 6 1 1 1 3 16 15 1 2 2 3 1 1 3 4 11 10 10 32 1 1 10 1 10 4 10 8 32

1 10 11 11 1 1 5 15 1 10 32 15 10 11 5 3 2 1 32 9

1 1 1 3 10 11 1 10 1 3 5 2 1 2 1 15 32 11 10 15 4 11 1 3 2 2 1 9 32 10 1 6 1 10 5 14 1 1 10 10 11 32 15 1 1 5 3 10 10 2 2 1 1 6 4 1 1 32 9 18 1 2 2 3 1 1 2 2 3 5 1 1 2 10 11 15 32 11 1 14 4 11 10 3 11 2 2 10 10 32

2 10 10 11 5 14 1 11 32 15 10 11 1 2 5 4 2 1 32 10

2 2 2 3 10 11 3 4 1 1 15 32 11 10 2 1 3 2 1 11 1 14 5 11 1 2 1 10 32 10 1 1 3 1 1 1 1 4 4 6 1 1 1 11 5 14 10 11 32 15 1 11 10 11 1 2 1 1 1 1 5 3 6 1 1 2 4 1 18 32 10 16 1 2 2 3 1 2 3 4 10 11 15 32 2 1 2 11 10 3 1 1 11 1 15 5 10 2 2 10 10 32

1 10 5 14 1 10 11 11 1 2 32 15 10 11 5 3 2 1 32 10

1 1 1 4 5 2 1 3 10 12 2 11 1 14 32 10 10 15 5 11 1 10 32 10 1 5 2 11 4 10 11 11 32 10 1 1 4 3 1 1 5 4 32 8 16 1 1 2 1 3 4 2 3 1 2 10 11 10 32 10 1 10 4 10 8 32

2 11 10 11 5 14 1 10 32 15 10 11 1 2 5 3 2 2 32 9

2 2 2 3 10 11 3 4 1 1 14 32 10 10 1 10 1 15 4 10 2 10 32 10 1 1 1 1 1 11 4 10 11 11 32 10 1 1 2 4 3 1 1 1 5 31 33 32 32 8 5 32 1 1 1 4 1 1 1 4 1 2 4 6 1 1 1 1 1 5 1 3 16 16 1 2 2 3 1 1 3 4 11 11 10 32 1 1 10 1 10 4 10 8 32 4 10 1 11 32 11 10 10 4 3 1 1 5 32 8 1 3 4 2 1 1 1 10 32 11 10 10 4 10 1 4 2 2 1 8 32 10 6 2 10 5 14 1 1 11 11 11 32 14 1 2 5 3 10 10 2 2 5 3 2 1 32 10 15 2 1 3 5 2 1 1 10 10 14 32 10 1 15 4 11 10 3 10 1 2 10 10 32

4 10 1 10 32 11 10 11 4 3 2 2 32 8

3 4 2 1 10 32 11 10 2 1 3 2 10 4 11 2 2 1 58 32 10 3 1 3 5 1 1 1 1 1 1 1 10 4 14 10 11 32 14 1 10 10 12 1 2 5 3 5 2 2 1 4 1 1 1 4 1 1 1 15 32 10 17 1 1 3 4 10 10 14 32 1 1 2 12 10 3 11 1 15 5 11 1 1 10 10 32

2 10 5 14 1 11 11 11 1 2 32 15 10 10 4 3 1 1 32 10

2 2 1 3 4 2 1 2 1 4 11 11 2 10 1 14 32 11 10 15 4 10 1 3 2 2 1 10 32 8 2 6 2 12 5 14 2 1 10 10 11 32 14 1 1 5 3 2 1 5 4 10 10 1 2 2 1 32 9 17 1 1 2 2 3 5 2 1 2 2 3 1 2 10 11 14 32 11 1 15 5 10 10 3 10 2 2 8 10 32

2 10 11 11 4 14 1 11 32 15 10 11 2 2 4 3 2 1 32 9

2 2 1 3 11 11 3 4 1 1 15 32 11 10 1 11 1 15 5 11 1 1 1 4 2 2 1 9 32 10

1 11 5 14 10 11 32 15 1 1 5 4 1 10 10 10 1 2 1 1 32 10 32 1 4 1 1 4 5 1 1 1 1 1 1 6 4 2 1 1 4 1 1 17 17 1 2 2 3 1 2 3 5 10 11 14 32 1 1 11 1 15 5 2 1 1 10 10 3 10 2 2 10 10 32

4 10 1 10 32 11 10 11 4 3 2 1 32 7

3 4 2 1 1 1 10 32 11 10 11 4 10 1 3 2 2 1 7 32 9 5 1 10 5 14 2 2 10 11 11 32 15 1 2 5 3 10 10 1 1 5 4 1 1 32 9 16 1 1 3 4 2 1 1 10 11 14 32 10 1 15 4 11 10 4 10 1 1 9 9 32

4 10 1 12 32 11 10 11 4 3 2 1 32 8

3 4 2 1 10 32 11 10 1 1 3 2 11 4 10 1 2 1 8 32 10 3 1 1 1 3 6 1 1 1 10 5 15 10 11 32 15 1 11 10 11 1 1 5 1 1 2 4 3 5 2 1 1 3 1 1 1 1 1 16 32 10 18 1 1 3 4 11 11 15 32 1 1 1 11 10 3 10 1 14 4 10 1 1 10 10 32

2 10 4 14 1 10 10 11 1 1 32 15 10 11 5 3 2 2 32 9

2 2 2 3 4 2 1 4 10 10 1 10 1 14 32 11 10 14 4 11 2 1 1 3 3 2 1 9 32 10 2 6 1 11 5 14 11 11 32 15 1 2 5 3 2 1 11 10 11 2 2 1 1 6 5 1 1 32 9 18 1 1 1 2 4 5 2 4 1 1 11 10 14 32 11 1 15 5 1 1 2 11 10 3 10 2 1 10 9 32

1 10 10 11 5 14 1 11 32 15 10 11 1 1 5 3 2 1 32 10

1 2 2 3 10 10 3 5 2 1 14 32 10 10 2 1 3 2 2 10 1 14 5 10 1 2 1 10 32 10 1 1 1 1 1 2 10 5 14 11 11 32 15 2 10 10 11 1 2 5 4 1 1 1 1 32 5 32 32 10 32 31 5 1 1 3 1 1 1 1 4 6 1 2 1 4 1 6 1 3 2 1 1 1 18 18 1 2 2 3 2 2 3 5 10 10 15 32 2 1 1 11 10 3 1 1 10 1 14 5 10 1 2 10 10 32

1 10 4 14 1 11 11 10 1 2 32 15 10 11 4 3 2 2 32 8

1 1 1 3 4 1 1 1 1 4 10 10 2 10 1 14 32 11 10 15 4 11 2 3 2 1 1 8 32 9 1 6 1 11 5 14 1 1 10 2 1 11 11 32 14 2 2 5 3 10 10 1 2 5 4 2 1 32 10 17 1 1 1 2 3 5 1 1 1 2 3 1 1 11 10 15 32 11 1 14 5 11 10 3 10 1 1 9 10 32

2 10 10 10 1 1 4 15 2 10 32 15 10 10 4 3 1 1 32 8

2 2 1 4 10 11 1 11 1 3 4 2 1 14 32 11 10 1 1 3 2 16 5 10 1 2 1 8 32 10 1 1 1 1 1 4 1 4 5 1 1 1 1 1 1 4 1 2 2 10 4 14 11 11 32 15 1 10 10 11 1 2 5 4 6 2 2 2 1 1 4 1 4 1 1 1 1 1 17 32 9 16 1 2 2 3 1 2 2 1 3 4 10 11 15 32 2 1 1 11 10 3 10 1 15 5 11 1 2 10 9 32

1 10 5 14 1 11 11 10 1 2 32 15 10 11 5 3 2 2 32 9

1 1 1 4 4 1 1 2 1 3 10 11 1 10 1 14 32 11 10 15 5 11 2 3 2 2 1 9 32 9 1 6 1 10 5 14 1 1 10 11 11 32 15 1 2 5 3 1 1 5 3 10 10 1 1 2 1 32 10 16 1 1 1 1 4 5 1 1 1 2 3 1 1 11 11 14 32 11 1 14 5 10 10 4 10 1 1 9 10 32

2 10 10 11 5 14 1 11 32 14 10 10 1 2 5 3 1 1 32 9

2 2 1 3 10 10 3 5 2 1 14 32 11 10 1 11 1 15 5 11 1 2 1 3 2 1 1 9 32 10

2 11 5 14 11 11 32 14 1 1 5 3 1 1 1 10 10 12 2 2 1 32 10 34 1 1 1 1 1 3 1 1 4 6 1 1 1 1 1 1 4 1 1 1 5 1 1 1 4 4 2 1 1 1 16 16 1 1 2 3 2 2 3 5 10 11 15 32 1 2 11 1 15 5 1 1 1 11 10 3 11 1 2 10 10 32

1 10 11 10 2 2 4 14 1 11 32 15 10 11 5 3 2 2 32 10

2 1 2 3 10 10 1 10 1 3 5 2 1 1 1 14 32 11 10 15 5 11 2 4 2 1 1 10 32 10 1 7 1 11 5 14 1 1 10 10 11 32 15 1 2 5 3 10 10 2 1 1 1 6 4 1 1 32 10 18 1 2 2 3 1 1 1 2 3 5 2 1 2 10 11 14 32 10 1 15 4 11 10 3 11 1 2 10 10 32

2 10 10 11 5 14 1 10 32 15 10 11 1 1 5 3 2 2 32 10

2 2 1 3 10 11 3 5 2 1 14 32 11 10 1 1 3 2 2 10 1 15 5 11 2 2 1 10 32 10 1 1 3 1 1 1 1 1 1 4 4 6 1 1 1 10 5 14 10 11 32 15 1 10 10 11 1 2 1 1 1 1 5 3 6 2 1 1 3 1 1 1 4 1 1 1 1 1 18 32 10 18 1 1 2 3 1 2 3 4 10 11 14 32 1 1 2 11 10 3 1 1 10 1 14 5 10 1 2 10 10 32

1 10 4 14 1 11 10 10 1 1 32 15 10 11 5 4 2 2 32 9

1 1 2 3 5 1 1 4 10 10 1 10 1 15 32 11 10 15 5 11 1 1 1 3 2 1 1 9 32 10 1 5 1 11 5 14 10 11 32 14 1 1 5 3 1 1 5 4 2 2 11 10 10 2 2 2 1 32 10 17 1 2 1 2 3 5 2 3 1 2 11 11 14 32 11 1 15 5 2 1 2 11 10 3 10 1 2 10 10 32

1 10 10 11 5 14 1 10 32 15 10 10 1 1 5 3 1 1 32 10

2 1 1 3 10 10 4 5 2 1 14 32 11 10 1 10 1 14 5 10 1 1 1 4 2 2 1 10 32 10 1 1 1 1 1 1 1 2 11 5 14 11 11 32 14 1 2 5 3 1 11 10 11 1 2 2 1 5 34 32 34 32 10 5 34 33 1 1 1 1 4 1 1 1 1 1 4 1 1 3 5 1 1 1 1 1 6 1 4 1 1 1 4 1 1 1 1 18 17 1 2 2 3 2 2 3 5 10 10 14 32 1 2 10 1 15 5 1 1 2 11 10 3 12 2 2 10 10 32 4 10 1 11 32 11 10 10 4 3 1 1 1 32 8 5 3 4 2 1 2 1 11 32 11 10 10 4 11 1 3 3 2 1 8 32 10 4 2 11 4 14 1 1 10 10 11 32 14 1 1 5 3 10 10 2 2 5 3 2 1 32 8 16 2 2 3 4 1 1 1 10 10 14 32 10 1 14 5 11 10 3 12 2 2 10 8 32

4 10 1 12 32 11 10 10 4 3 1 1 32 8

3 4 1 1 10 32 11 10 1 1 3 2 10 4 10 1 2 1 8 32 10 4 1 3 5 1 1 1 1 1 1 1 11 5 14 10 11 32 15 1 10 10 11 1 2 5 3 5 1 2 2 4 1 1 1 16 32 10 17 1 2 3 5 11 10 14 32 2 1 2 11 10 3 10 1 14 5 10 1 2 10 10 32

1 10 5 14 1 10 10 10 1 2 32 15 10 11 5 3 2 1 32 10

1 1 2 3 5 2 1 2 1 3 10 11 1 10 1 14 32 11 10 15 5 11 1 3 2 2 1 10 32 10 2 6 1 10 5 14 1 1 10 10 11 32 15 1 1 5 4 1 1 6 4 10 10 2 1 1 1 32 10 18 1 1 1 1 4 5 1 1 1 3 3 1 2 10 11 14 32 10 1 14 5 11 10 3 10 1 2 10 10 32

2 10 11 11 5 15 1 10 32 15 10 10 1 2 5 4 1 1 32 10

2 2 2 3 10 11 3 5 1 1 14 32 11 10 2 10 1 15 5 12 1 1 1 4 2 1 1 10 32 10

1 10 5 14 10 11 32 14 1 1 5 4 1 11 10 11 1 2 32 9 32 1 4 1 2 5 6 1 1 1 2 1 2 6 4 2 1 17 18 2 1 2 3 1 2 4 5 10 11 14 32 1 2 11 1 15 5 1 1 2 11 10 3 12 2 2 10 9 32

4 10 1 11 32 11 10 11 4 3 2 2 32 7

3 4 2 1 2 1 11 32 11 10 10 4 11 1 3 3 2 1 7 32 10 5 1 10 5 14 2 1 10 10 11 32 14 1 1 5 3 10 10 2 2 32 10 15 2 1 3 5 2 1 2 11 11 14 32 10 1 15 5 11 10 3 10 1 2 10 10 32

4 10 1 11 32 9 10 8 32 8

3 4 1 1 9 32 8 6 1 1 3 2 8 32 7 3 1 1 1 3 5 1 1 2 10 5 14 10 8 32 12 1 10 10 8 1 1 4 1 15 32 9 15 2 2 3 4 8 6 12 32 1 1 1 8 6 3 7 9 32

1 10 5 14 1 11 10 10 1 1 32 15 10 10 5 4 1 1 32 10

2 1 1 3 5 1 1 4 10 10 1 11 1 14 32 11 10 15 5 11 1 1 1 3 2 2 1 10 32 10 2 6 1 11 5 15 10 11 32 15 1 1 5 3 1 1 11 10 11 2 2 32 10 17 1 2 1 1 4 5 2 3 1 1 10 10 14 32 10 1 14 5 2 1 2 11 10 4 12 2 1 10 10 32

1 10 10 8 5 14 1 11 32 12 10 8 32 10

2 1 1 3 8 6 4 5 1 1 12 32 8 6 1 1 3 2 10 32 8 1 1 1 1 1 10 5 14 10 8 32 12 1 10 10 9 1 1 31 5 32 32 10 31 1 1 3 1 1 1 1 4 6 1 2 1 4 15 17 1 2 2 3 1 1 3 5 8 6 12 32 2 1 1 8 6 3 8 10 32

1 10 5 14 1 10 10 10 1 1 32 15 10 10 5 3 1 2 32 10

2 1 2 3 5 2 1 1 1 3 10 10 2 10 1 14 32 11 10 14 5 11 2 4 2 2 1 10 32 10 1 6 1 10 5 14 1 1 10 1 1 11 11 32 14 1 2 5 3 10 10 2 2 6 4 2 1 32 10 18 1 1 1 2 3 5 1 1 2 2 3 1 2 10 10 14 32 11 1 14 5 11 10 4 10 1 2 10 10 32

1 10 11 11 1 2 5 14 1 11 32 16 10 10 5 3 1 1 32 10

2 1 1 4 10 11 1 11 1 3 5 2 1 15 32 11 10 1 1 3 2 15 5 11 1 2 1 10 32 10 1 1 1 1 2 4 1 4 6 1 1 4 1 2 1 11 5 14 10 11 32 15 1 11 10 12 1 1 5 3 5 2 2 2 1 1 1 1 4 1 1 1 18 32 10 16 1 2 2 3 1 2 2 2 3 5 10 10 14 32 1 1 1 10 10 3 10 1 15 4 11 2 2 10 10 32

1 10 5 14 1 11 10 10 1 2 32 15 10 10 5 3 1 2 32 10

1 1 2 3 5 2 1 4 10 11 1 10 1 14 32 11 10 15 5 10 2 2 1 3 2 2 1 10 32 10 1 5 1 10 5 14 10 11 32 14 1 2 5 3 1 1 5 4 1 1 10 10 11 2 2 2 1 32 10 16 1 2 1 1 3 5 2 3 1 1 10 10 14 32 10 1 15 4 2 1 2 11 10 3 11 2 2 10 10 32

2 10 11 11 5 14 1 10 32 15 10 11 1 2 5 3 2 1 32 10

1 2 1 3 10 11 3 5 2 1 14 32 11 10 1 11 1 14 5 10 1 2 1 3 2 2 1 10 32 9

1 11 5 15 10 11 32 15 1 2 5 3 1 10 10 10 2 2 32 10 33 1 1 1 1 1 4 1 2 4 6 1 4 1 1 2 6 1 1 1 1 1 4 1 1 16 16 1 2 2 3 2 2 3 5 11 10 14 32 1 2 11 1 15 5 2 1 2 11 10 3 12 2 2 9 10 32

1 10 11 10 1 2 5 15 1 10 32 16 10 11 5 4 2 1 32 10

1 2 2 3 10 11 1 11 1 3 5 1 1 1 1 15 32 11 10 15 5 10 1 3 2 2 1 10 32 9 1 6 1 10 5 14 1 1 11 10 11 32 14 1 2 5 3 10 10 2 1 32 10 17 2 2 2 3 1 1 1 2 3 4 2 1 2 10 10 14 32 11 1 15 4 11 10 4 10 1 1 9 10 32

2 10 10 8 5 15 1 11 32 12 10 8 32 8

1 2 2 3 8 6 4 5 2 1 12 32 8 6 2 1 3 2 8 32 7 1 1 3 1 1 1 2 4 4 5 1 2 1 10 5 14 10 8 32 12 1 11 10 9 1 1 1 1 4 1 17 32 10 15 1 2 2 3 1 1 3 5 8 6 12 32 1 1 1 8 6 3 7 10 32

1 11 5 14 1 11 10 11 1 1 32 15 10 11 5 3 2 1 32 9

1 2 2 3 4 2 1 4 10 11 1 10 1 14 32 11 10 15 5 11 1 1 1 3 2 2 1 9 32 10 1 5 2 10 5 14 11 11 32 15 1 2 5 4 1 1 11 10 10 1 1 32 10 18 1 2 2 1 3 5 2 3 1 1 11 10 14 32 10 1 15 5 2 1 2 11 10 3 10 1 2 10 10 32

1 10 10 8 5 14 1 11 32 12 10 8 32 10

2 1 2 3 8 6 3 5 2 1 12 32 8 6 1 1 3 2 10 32 8 1 1 1 1 1 2 10 5 14 10 8 32 12 1 1 1 10 10 9 1 5 34 31 33 32 10 32 1 1 1 1 4 1 1 1 4 1 2 4 6 1 1 1 1 1 3 15 18 1 2 2 3 2 1 4 5 8 6 12 32 1 1 1 8 6 3 8 10 32 1 10 5 14 1 11 10 10 1 1 32 15 10 11 5 3 2 2 1 32 10 5 2 1 1 3 5 2 1 1 1 3 10 10 1 10 1 14 32 11 10 16 5 11 2 3 3 2 1 10 32 10 2 6 1 10 5 14 1 1 10 2 1 11 11 32 15 1 2 5 3 10 10 2 2 7 3 2 1 32 10 18 1 2 1 1 3 5 2 1 1 2 4 1 1 11 10 14 32 10 1 14 5 11 10 3 10 1 2 10 10 32

1 10 11 10 2 2 5 15 1 12 32 15 10 11 5 3 2 2 32 10

2 1 1 4 10 11 1 11 1 3 5 2 1 14 32 11 10 2 1 3 2 14 5 12 1 2 1 5 10 32 10 1 1 1 1 1 4 1 3 6 1 1 1 1 1 1 4 1 1 1 1 2 10 5 14 10 11 32 15 1 11 10 11 1 2 5 4 6 2 1 2 4 1 1 1 3 1 1 1 18 32 10 18 1 1 2 3 1 2 2 2 3 5 11 11 14 32 2 1 2 11 10 3 10 1 14 5 10 1 2 10 10 32

1 10 5 14 1 10 11 10 1 2 32 15 10 10 5 4 1 1 32 10

2 2 2 3 5 2 1 1 1 4 10 10 2 10 1 14 32 11 10 14 5 11 1 4 3 2 1 10 32 10 2 5 2 11 5 14 1 1 11 10 12 32 15 1 1 5 3 1 1 6 4 10 10 2 1 1 1 32 10 18 1 2 2 1 3 5 2 1 1 2 3 1 2 10 11 14 32 10 1 14 4 11 10 3 10 1 2 10 10 32

2 10 10 11 5 14 1 10 32 15 10 10 1 2 5 3 1 2 32 10

2 2 1 3 11 10 4 5 1 1 14 32 11 10 1 11 1 14 5 10 2 2 1 4 2 2 1 10 32 10

1 1 11 5 14 10 11 32 14 1 2 5 3 1 11 10 11 2 2 1 1 32 10 32 1 1 1 1 1 3 1 1 4 6 1 1 1 2 1 1 4 1 1 1 1 2 6 4 1 1 1 4 1 1 18 18 1 2 2 3 1 2 3 5 11 11 14 32 1 1 11 1 15 5 1 1 2 11 10 4 10 2 2 10 10 32

2 10 10 10 1 1 5 14 1 10 32 15 10 10 5 3 1 1 32 10

1 2 1 4 10 10 1 10 1 3 5 2 1 1 1 14 32 11 10 14 5 11 1 4 3 1 1 10 32 10 1 1 1 6 2 10 5 14 1 1 10 10 11 32 15 1 1 5 3 10 10 2 1 5 3 2 1 32 10 17 1 2 2 3 1 2 2 2 3 5 2 1 1 10 10 14 32 10 1 15 5 11 10 3 11 1 2 10 10 32

1 10 10 10 1 2 5 14 1 10 32 14 10 11 5 3 2 2 32 10

1 1 1 3 10 10 1 10 1 3 5 1 1 14 32 11 10 1 1 4 2 15 5 11 2 1 1 10 32 9 1 1 4 1 1 1 1 1 1 3 1 1 1 1 3 6 1 2 1 10 5 14 10 11 32 14 1 10 10 11 1 1 4 1 2 2 5 3 6 2 2 2 4 1 1 1 1 1 17 32 9 17 1 2 2 3 1 2 1 2 3 5 11 10 14 32 2 1 1 11 10 3 10 1 15 5 11 1 1 9 9 32

2 10 5 14 1 11 11 10 1 2 32 15 10 10 5 4 1 1 32 10

2 2 1 3 5 2 1 4 10 10 1 10 1 14 32 11 10 14 5 11 1 1 1 4 2 2 1 10 32 10 2 6 1 11 5 15 11 11 32 15 1 2 5 3 1 1 10 10 11 2 2 1 1 6 4 2 1 32 10 17 1 2 2 2 3 5 2 3 1 2 10 11 15 32 11 1 15 4 2 1 1 11 10 3 12 2 2 10 10 32

2 10 11 11 4 14 1 11 32 15 10 11 1 2 5 3 2 2 32 9

1 2 2 3 11 10 3 5 2 1 15 32 11 10 1 1 3 2 1 11 1 15 5 11 2 2 1 9 32 10 1 1 1 1 1 1 1 1 1 1 10 5 14 11 11 32 15 1 11 10 11 1 2 5 3 2 1 34 5 1 33 32 32 10 32 33 5 1 1 1 4 1 1 1 1 1 4 1 1 1 1 1 1 4 6 1 1 1 4 1 6 1 4 1 1 1 1 17 17 1 2 2 3 1 2 3 4 10 10 15 32 2 1 2 11 10 3 1 1 10 1 15 4 11 1 2 10 10 32

1 10 5 14 1 11 10 11 1 1 32 15 10 11 5 4 2 1 32 10

1 2 2 4 5 1 1 2 1 3 10 10 1 10 1 14 32 11 10 14 5 11 1 4 2 1 1 10 32 10 2 6 1 11 5 14 1 1 10 1 1 10 11 32 15 1 1 5 3 10 10 2 2 6 4 2 1 32 10 18 1 1 1 2 3 5 2 1 1 3 4 1 2 10 11 15 32 11 1 15 4 11 10 3 12 2 2 10 10 32

1 10 10 10 1 2 5 14 1 10 32 15 10 10 5 3 1 1 32 10

1 1 1 4 10 10 2 10 1 4 5 1 1 14 32 11 10 1 1 3 2 14 4 11 1 2 1 10 32 10 1 1 1 1 1 4 1 4 6 1 2 4 1 1 1 11 5 14 11 11 32 15 1 11 10 11 1 2 5 3 5 2 2 1 1 1 4 1 18 32 10 18 1 2 2 3 1 1 1 1 3 4 11 10 14 32 2 1 2 11 10 3 10 1 15 4 10 2 2 10 10 32

1 10 4 14 1 10 11 11 1 2 32 15 10 10 5 3 1 1 32 9

2 1 2 3 5 2 1 4 10 11 1 11 1 15 32 10 10 14 5 10 1 9 32 10 2 5 2 10 4 10 10 10 32 10 1 1 4 3 1 1 6 4 32 8 16 1 1 2 2 3 4 2 3 1 2 10 10 10 32 11 1 10 4 10 8 32

2 11 11 11 5 14 1 11 32 15 10 11 1 2 5 3 2 1 32 10

1 2 2 3 10 10 3 5 1 1 15 32 10 10 2 10 1 14 5 11 1 10 32 10

1 11 4 10 10 11 32 10 1 1 4 3 1 1 32 8 31 1 1 1 1 1 3 1 2 4 5 1 4 1 1 1 5 1 1 1 3 18 16 1 2 2 3 1 1 3 4 11 11 10 32 1 2 10 1 10 4 10 8 32

2 10 11 11 1 2 4 14 1 11 32 15 10 10 5 3 1 1 32 9

2 2 2 3 10 11 2 11 1 3 4 2 1 1 1 15 32 11 10 15 5 12 1 3 3 2 2 9 32 10 2 6 1 11 5 14 2 1 10 10 11 32 14 1 1 5 3 10 10 1 2 1 1 6 4 2 1 32 10 17 1 2 2 4 1 1 2 1 3 4 2 1 2 10 10 14 32 11 1 15 5 11 10 3 11 2 2 10 10 32

2 11 10 11 5 14 1 11 32 15 10 11 1 2 5 3 2 1 32 10

2 2 2 3 10 11 3 5 1 1 14 32 11 10 1 1 3 2 1 10 1 14 4 10 1 2 1 10 32 10 1 1 4 1 1 1 2 4 3 5 1 1 1 11 5 15 11 11 32 15 1 10 10 11 1 1 1 1 1 2 5 3 6 1 1 1 4 1 17 32 10 18 1 2 2 3 1 2 3 5 10 10 15 32 1 1 1 11 10 3 1 2 11 1 15 5 10 1 2 10 10 32

1 10 5 14 1 11 11 11 1 1 32 15 10 11 5 3 2 2 32 10

2 1 2 3 4 2 1 3 10 11 1 10 1 14 32 10 10 15 5 10 2 10 32 10 2 5 2 10 4 10 11 11 32 10 1 2 4 3 1 1 5 3 32 8 16 1 1 2 2 3 4 2 3 1 1 10 10 10 32 11 1 10 5 10 8 32

1 10 10 11 5 14 1 10 32 15 10 10 1 1 5 3 1 2 32 10

2 2 2 3 10 11 3 5 2 1 14 32 10 10 2 11 1 15 5 10 2 10 32 10 1 1 1 1 1 1 1 1 11 4 10 11 11 32 10 1 2 4 3 5 32 1 32 31 32 8 5 32 1 1 1 3 1 1 1 4 1 1 4 6 1 1 1 1 1 6 1 3 18 16 2 1 2 3 2 2 3 5 10 10 10 32 1 2 10 1 10 4 10 8 32 1 10 10 10 1 2 5 14 2 11 32 16 11 11 5 3 2 2 1 5 32 10 2 1 2 3 10 11 2 10 1 3 5 2 1 1 1 15 32 12 10 14 4 10 2 3 2 2 1 10 32 9 1 1 1 5 2 10 4 14 2 1 10 11 11 32 14 1 2 5 3 10 10 2 2 5 4 2 1 32 9 17 1 2 2 3 1 2 2 2 3 5 2 1 1 11 10 14 32 10 1 15 5 11 10 3 10 1 2 9 9 32

1 10 10 10 1 2 5 14 2 10 32 15 10 10 5 3 1 1 32 10

2 1 2 4 10 10 1 11 1 3 5 2 1 15 32 10 10 1 1 3 2 15 5 10 1 2 1 10 32 10 1 1 1 1 1 4 1 1 3 1 3 6 1 1 1 1 1 1 2 10 5 14 10 11 32 14 1 11 10 11 1 2 5 3 5 2 2 2 4 1 1 1 17 32 10 18 1 2 2 3 1 2 2 2 3 5 10 10 14 32 1 1 2 11 10 3 11 1 14 5 11 2 2 10 10 32

2 10 5 14 1 11 11 10 1 1 32 15 10 11 5 4 2 1 32 10

1 2 1 3 5 2 1 1 1 4 10 10 1 10 1 14 32 11 10 14 5 11 1 3 2 2 1 10 32 10 2 6 1 11 5 14 1 1 10 10 11 32 15 1 1 5 3 1 1 6 4 10 10 2 2 2 1 32 10 17 1 2 1 1 3 5 2 1 1 2 3 1 2 10 11 14 32 11 1 15 5 11 10 3 11 1 2 10 10 32

2 10 11 11 5 14 1 11 32 15 10 11 1 2 5 3 2 2 32 10

2 2 1 4 11 11 4 5 1 1 14 32 12 10 1 10 1 15 5 12 1 1 1 3 2 2 1 10 32 10

1 12 5 15 11 11 32 15 1 2 5 3 1 11 10 11 2 2 32 10 32 1 1 1 1 1 4 1 1 1 4 1 1 4 6 1 1 1 2 1 2 5 3 1 1 18 17 1 2 2 3 1 2 3 5 10 10 15 32 1 2 11 1 15 5 2 1 1 11 10 4 12 2 2 10 10 32

1 10 10 11 1 1 5 14 1 11 32 15 10 11 5 3 2 1 32 10

2 1 1 3 10 10 1 10 1 3 5 2 1 1 1 15 32 11 10 15 4 11 1 3 2 2 1 10 32 9 2 5 1 11 5 14 1 1 10 10 11 32 15 1 1 5 3 10 10 2 2 32 10 18 1 2 2 4 1 2 1 2 3 5 2 1 1 10 11 14 32 10 1 15 5 11 10 3 10 1 2 9 10 32

1 10 10 8 5 14 1 11 32 12 10 8 32 10

2 1 1 3 8 6 3 4 1 1 12 32 8 6 2 1 3 1 10 32 8 1 1 1 1 1 3 4 1 1 1 4 6 1 2 1 10 5 14 10 8 32 12 1 10 10 8 1 1 3 1 18 32 10 16 1 2 2 3 1 1 3 5 8 6 12 32 2 1 2 8 6 3 8 10 32

2 10 5 14 1 12 10 11 1 2 32 14 10 11 5 3 2 1 32 10

2 2 2 3 5 2 1 3 10 10 2 10 1 14 32 11 10 15 4 10 1 2 1 3 2 1 1 10 32 8 1 5 1 11 4 15 11 11 32 15 1 2 4 3 1 1 10 10 11 2 2 32 10 17 1 2 2 1 3 5 2 3 1 2 10 11 15 32 11 1 15 4 2 1 1 11 10 4 11 2 1 8 10 32

2 10 10 8 4 14 1 10 32 12 10 8 32 9

2 2 1 3 8 6 3 4 2 1 12 32 8 6 1 1 3 2 9 32 8 1 1 1 1 1 1 1 1 1 10 4 15 10 8 32 12 1 11 10 8 1 32 33 5 32 32 9 32 1 1 1 1 1 3 1 4 1 1 1 2 4 5 1 1 1 3 16 17 1 2 1 3 1 2 3 4 8 6 12 32 2 1 1 8 6 3 8 9 32

1 10 4 14 1 12 10 11 1 2 32 15 10 11 5 3 2 2 32 10

2 1 1 3 4 2 1 2 1 3 10 10 1 10 1 14 32 11 10 14 4 11 2 3 2 2 1 10 32 8 2 6 1 11 5 14 1 1 11 1 1 11 11 32 15 1 2 5 3 10 10 2 2 5 4 2 1 32 10 17 1 2 1 1 3 5 1 1 2 2 3 1 1 10 10 15 32 11 1 14 5 11 10 3 12 2 2 8 10 32

1 10 11 11 1 2 5 14 1 10 32 14 10 11 5 3 2 1 32 10

2 2 1 3 10 11 2 10 1 3 4 2 1 14 32 11 10 2 1 3 2 14 5 11 1 2 1 10 32 10 1 1 1 1 2 4 1 4 6 1 1 4 1 1 1 10 5 14 11 11 32 15 1 11 10 12 1 2 5 3 6 2 1 2 17 32 10 18 1 1 2 3 1 1 1 1 3 5 10 11 14 32 2 1 1 11 10 4 11 1 15 5 10 1 1 10 10 32

1 10 5 14 1 10 10 11 1 2 32 15 10 11 5 3 2 1 32 10

2 1 2 3 5 2 1 3 10 11 2 10 1 15 32 10 10 16 5 11 1 10 32 10 2 6 1 10 4 10 10 11 32 10 1 2 4 3 1 1 6 4 32 8 16 1 2 1 2 3 4 2 3 1 2 10 10 10 32 10 1 10 4 10 8 32

2 10 10 11 5 15 1 10 32 15 10 10 1 2 5 3 1 2 32 9

2 2 2 3 10 11 3 5 2 1 14 32 10 10 2 10 1 15 5 10 1 9 32 10

1 10 4 10 11 11 32 10 1 2 4 3 32 8 32 1 1 1 1 1 3 1 2 3 6 1 4 1 1 2 5 18 16 1 2 2 3 1 1 3 4 11 11 10 32 1 1 10 1 10 4 10 8 32

2 11 11 10 1 2 4 14 1 10 32 15 10 11 5 3 2 2 32 9

2 2 2 3 10 11 1 11 1 3 5 2 1 2 1 14 32 11 10 15 4 10 2 4 2 2 1 9 32 10 1 6 1 10 5 14 1 1 10 11 11 32 15 1 2 4 3 10 10 2 2 32 10 17 2 2 2 4 1 1 1 1 3 5 2 1 1 11 11 14 32 10 1 14 5 11 10 3 11 2 2 10 10 32

2 10 10 8 4 14 1 11 32 12 10 8 32 10

2 2 2 3 8 6 3 5 2 1 12 32 8 6 2 1 3 2 10 32 8 1 1 4 1 1 1 2 3 4 6 1 2 2 10 4 14 10 8 32 12 1 11 10 8 17 32 10 16 1 1 2 3 2 2 3 5 8 6 12 32 2 1 2 8 6 3 8 10 32

1 10 5 14 1 10 11 11 1 2 32 15 10 11 5 3 2 1 32 10

2 1 1 3 5 2 1 4 11 10 2 10 1 14 32 10 10 14 5 10 1 10 32 9 2 5 2 10 4 10 10 10 32 10 1 1 4 3 32 7 16 1 2 2 2 3 4 2 3 1 1 10 10 10 32 10 1 10 4 9 7 32

2 10 10 8 5 14 1 10 32 12 10 8 32 9

2 2 2 3 8 6 3 5 1 1 12 32 8 6 9 32 7 1 1 1 1 1 1 1 10 4 10 10 8 32 9 1 5 32 32 32 32 8 1 1 1 3 1 1 1 3 16 1 2 3 5 1 1 1 3 16 7 1 2 3 4 8 6 9 32 5 1 5 8 32 (a) η = 2 (Strong admissibility) (b) η = 64

32 8 32 8

8 32 8 8 32 8 32 10 15 32 10 15 8 10 32 8 10 32

32 8 32 8

8 32 10 8 32 10 15 32 9 15 15 32 9 15 10 9 32 10 9 32

32 8 32 8

8 32 9 8 32 9 32 8 15 32 8 15 9 8 32 9 8 32

32 8 32 8

8 32 10 8 32 10

32 9 31 32 9 31 15 15 15 15 10 9 32 10 9 32

32 8 32 8

8 32 9 8 32 9 32 8 16 32 8 16 9 8 32 9 8 32

32 8 32 8

8 32 10 8 32 10 16 32 10 17 16 32 10 17 10 10 32 10 10 32

32 10 32 10

10 32 10 10 32 10 32 10 16 32 10 16 10 10 32 10 10 32

32 10 32 10

10 32 10 10 32 10 31 32 10 31 31 32 10 31 17 16 17 16 10 10 32 10 10 32

32 7 32 7

7 32 9 7 32 9 32 10 16 32 10 16 9 10 32 9 10 32

32 8 32 8

8 32 8 8 32 8 16 32 10 16 16 32 10 16 8 10 32 8 10 32

32 8 32 8

8 32 10 8 32 10 32 9 16 32 9 16 10 9 32 10 9 32

32 8 32 8

8 32 9 8 32 9

32 10 32 32 10 32 16 16 16 16 9 10 32 9 10 32

32 9 32 9

9 32 10 9 32 10 32 9 17 32 9 17 10 9 32 10 9 32

32 9 32 9

9 32 9 9 32 9 17 32 9 17 17 32 9 17 9 9 32 9 9 32

32 10 32 10

10 32 10 10 32 10 32 10 17 32 10 17 10 10 32 10 10 32

32 9 32 9

9 32 8 9 32 8 31 32 32 8 31 32 32 8 17 17 17 17 8 8 32 8 8 32 32 8 63 32 8 63 8 32 9 8 32 9 32 9 15 32 9 15 9 9 32 9 9 32

32 8 32 8

8 32 10 8 32 10 15 32 10 16 15 32 10 16 10 10 32 10 10 32

32 10 32 10

10 32 9 10 32 9 32 8 18 32 8 18 9 8 32 9 8 32

32 10 32 10

10 32 10 10 32 10

32 8 32 32 8 32 16 18 16 18 10 8 32 10 8 32

32 8 32 8

8 32 10 8 32 10 32 10 16 32 10 16 10 10 32 10 10 32

32 8 32 8

8 32 10 8 32 10 16 32 10 17 16 32 10 17 10 10 32 10 10 32

32 8 32 8

8 32 9 8 32 9 32 9 16 32 9 16 9 9 32 9 9 32

32 9 32 9

9 32 9 9 32 9 32 32 10 32 32 32 10 32 17 16 17 16 9 10 32 9 10 32

32 9 32 9

9 32 10 9 32 10 32 10 17 32 10 17 10 10 32 10 10 32

32 10 32 10

10 32 8 10 32 8 17 32 9 17 17 32 9 17 8 9 32 8 9 32

32 8 32 8

8 32 10 8 32 10 32 10 18 32 10 18 10 10 32 10 10 32

32 9 32 9

9 32 9 9 32 9

32 8 32 32 8 32 17 18 17 18 9 8 32 9 8 32

32 9 32 9

9 32 9 9 32 9 32 9 16 32 9 16 9 9 32 9 9 32

32 9 32 9

9 32 8 9 32 8 16 32 8 17 16 32 8 17 8 9 32 8 9 32

32 9 32 9

9 32 9 9 32 9 32 10 17 32 10 17 9 10 32 9 10 32

32 9 32 9

9 32 10 9 32 10 32 32 32 9 32 32 32 9 17 17 17 17 10 9 32 10 9 32 63 32 8 63 32 8 8 32 9 8 32 9 32 9 16 32 9 16 9 9 32 9 9 32

32 8 32 8

638 32 9 64 5 638 32 9 16 32 9 16 16 32 9 16 9 9 32 9 9 32

32 8 32 8

8 32 9 8 32 9 32 10 16 32 10 16 9 10 32 9 10 32

32 7 32 7

7 32 9 7 32 9

32 10 32 32 10 32 16 16 16 16 9 10 32 9 10 32

32 10 32 10

10 32 8 10 32 8 32 10 17 32 10 17 8 10 32 8 10 32

32 9 32 9

9 32 9 9 32 9 17 32 9 16 17 32 9 16 9 9 32 9 9 32

32 10 32 10

10 32 10 10 32 10 32 10 16 32 10 16 10 10 32 10 10 32

32 9 32 9

10 32 8 10 32 8 32 32 10 31 32 32 10 31 16 16 16 16 8 10 32 8 10 32

32 8 32 8

8 32 9 8 32 9 32 8 15 32 8 15 9 8 32 9 8 32

32 8 32 8

8 32 10 8 32 10 15 32 8 16 15 32 8 16 10 8 32 10 8 32

32 8 32 8

8 32 10 8 32 10 32 9 16 32 9 16 10 9 32 10 9 32

32 7 32 7

7 32 8 7 32 8

32 10 31 32 10 31 16 16 16 16 8 10 32 8 10 32

32 10 32 10

10 32 10 10 32 10 32 8 16 32 8 16 10 8 32 10 8 32

32 10 32 10

10 32 10 10 32 10 16 32 9 17 16 32 9 17 10 9 32 10 9 32

32 10 32 10

10 32 9 10 32 9 32 7 15 32 7 15 9 7 32 9 7 32

32 9 32 9

9 32 9 9 32 9 31 31 32 7 31 31 32 7 17 15 17 15 9 7 32 9 7 32 32 9 63 32 9 63 10 32 8 10 32 8 32 9 17 32 9 17 8 9 32 8 9 32

32 9 32 9

9 32 9 9 32 9 17 32 8 18 17 32 8 18 9 8 32 9 8 32

32 10 32 10

10 32 9 10 32 9 32 10 16 32 10 16 9 10 32 9 10 32

32 10 32 10

10 32 10 10 32 10

32 9 32 32 9 32 18 16 18 16 10 9 32 10 9 32

32 8 32 8

8 32 10 8 32 10 32 10 16 32 10 16 10 10 32 10 10 32

32 8 32 8

8 32 10 8 32 10 16 32 10 17 16 32 10 17 10 10 32 10 10 32

32 9 32 9

9 32 8 9 32 8 32 9 17 32 9 17 8 10 32 8 10 32

32 10 32 10

10 32 9 10 32 9 32 32 10 32 32 32 10 32 17 17 17 17 9 10 32 9 10 32

32 8 32 8

8 32 8 8 32 8 32 9 16 32 9 16 8 9 32 8 9 32

32 10 32 10

10 32 10 10 32 10 16 32 9 16 16 32 9 16 10 9 32 10 9 32

32 10 32 10

10 32 9 10 32 9 32 8 15 32 8 15 9 8 32 9 8 32

32 10 32 10

10 32 10 10 32 10

32 8 32 32 8 32 16 15 16 15 10 8 32 10 8 32

32 9 32 9

9 32 8 9 32 8 32 9 18 32 9 18 8 10 32 8 10 32

32 10 32 10

10 32 10 10 32 10 18 32 10 16 18 32 10 16 10 10 32 10 10 32

32 9 32 9

9 32 10 9 32 10 32 8 15 32 8 15 10 8 32 10 8 32

32 9 32 9

10 32 10 10 32 10 32 32 32 8 32 32 32 8 16 15 16 15 10 8 32 10 8 32 63 32 8 63 32 8 8 32 10 8 32 10 32 9 15 32 9 15 10 9 32 10 9 32

32 8 32 8

63 58 32 10 64 63 8 32 10 15 32 9 17 15 32 9 17 10 9 32 10 9 32

32 9 32 9

9 32 8 9 32 8 32 9 17 32 9 17 8 9 32 8 9 32

32 9 32 9 10 10 9 32 1269 32 32 10 32 32 10 32 17 17 17 17 10 10 32 10 10 32

32 7 32 7

7 32 9 7 32 9 32 9 16 32 9 16 9 9 32 9 9 32

32 8 32 8

8 32 10 8 32 10 16 32 10 17 16 32 10 17 10 10 32 10 10 32

32 9 32 9

9 32 9 9 32 9 32 9 18 32 9 18 9 9 32 9 9 32

32 10 32 10

10 32 10 10 32 10 32 32 10 32 32 32 10 32 17 18 17 18 10 10 32 10 10 32

32 8 32 8

8 32 8 8 32 8 32 8 17 32 8 17 8 8 32 8 8 32

32 8 32 8

8 32 10 8 32 10 17 32 9 16 17 32 9 16 10 9 32 10 9 32

32 8 32 8

8 32 9 8 32 9 32 10 16 32 10 16 9 10 32 9 10 32

32 9 32 9

9 32 8 9 32 8

32 10 34 32 10 34 16 16 16 16 8 10 32 8 10 32

32 10 32 10

10 32 10 10 32 10 32 10 18 32 10 18 10 10 32 10 10 32

32 10 32 10

10 32 10 10 32 10 18 32 10 18 18 32 10 18 10 10 32 10 10 32

32 9 32 9

9 32 10 9 32 10 32 10 16 32 10 16 10 10 32 10 10 32

32 10 32 10

10 32 10 10 32 10 32 34 32 9 32 34 32 9 18 16 18 16 10 9 32 10 9 32 32 8 63 32 8 63 8 32 8 8 32 8 32 8 15 32 8 15 8 8 32 8 8 32

32 8 32 8

8 32 10 8 32 10 15 32 10 16 15 32 10 16 10 10 32 10 10 32

32 10 32 10

10 32 10 10 32 10 32 10 16 32 10 16 10 10 32 10 10 32

32 9 32 9

9 32 9 9 32 9

32 8 32 32 8 32 16 16 16 16 9 8 32 9 8 32

32 7 32 7

7 32 10 7 32 10 32 10 15 32 10 15 10 10 32 10 10 32

32 8 32 8

8 32 7 8 32 7 15 32 8 15 15 32 8 15 7 8 32 7 8 32

32 10 32 10

10 32 10 10 32 10 32 10 17 32 10 17 10 10 32 10 10 32

32 9 32 9

9 32 8 9 32 8 32 32 10 31 32 32 10 31 15 17 15 17 8 10 32 8 10 32

32 10 32 10

10 32 9 10 32 9 32 10 18 32 10 18 9 10 32 9 10 32

32 10 32 10

10 32 10 10 32 10 18 32 10 16 18 32 10 16 10 10 32 10 10 32

32 10 32 10

10 32 10 10 32 10 32 10 16 32 10 16 10 10 32 10 10 32

32 9 32 9

9 32 9 9 32 9

32 10 33 32 10 33 16 16 16 16 9 10 32 9 10 32

32 9 32 9

9 32 9 9 32 9 32 9 16 32 9 16 9 9 32 9 9 32

32 8 32 8

8 32 7 8 32 7 16 32 10 15 16 32 10 15 7 10 32 7 10 32

32 8 32 8

8 32 10 8 32 10 32 10 17 32 10 17 10 10 32 10 10 32

32 10 32 10

10 32 8 10 32 8 31 33 32 10 31 33 32 10 15 17 15 17 8 10 32 8 10 32 63 32 9 63 32 9 9 32 10 9 32 10 32 10 18 32 10 18 10 10 32 10 10 32

32 8 32 8

64 5 648 32 9 648 32 9 18 32 10 18 18 32 10 18 9 10 32 9 10 32

32 10 32 10

10 32 10 10 32 10 32 8 16 32 8 16 10 8 32 10 8 32

32 10 32 10

10 32 10 10 32 10

32 9 32 32 9 32 18 16 18 16 10 9 32 10 9 32

32 10 32 10

10 32 10 10 32 10 32 10 17 32 10 17 10 10 32 10 10 32

32 10 32 10

10 32 8 10 32 8 17 32 9 17 17 32 9 17 8 9 32 8 9 32

32 10 32 10

10 32 10 10 32 10 32 8 17 32 8 17 10 8 32 10 8 32

32 9 32 9

9 32 9 9 32 9 32 32 9 32 32 32 9 32 17 17 17 17 9 9 32 9 9 32

32 9 32 9

9 32 10 9 32 10 32 10 18 32 10 18 10 10 32 10 10 32

32 10 32 10

10 32 10 10 32 10 18 32 10 18 18 32 10 18 10 10 32 10 10 32

32 9 32 9

9 32 10 9 32 10 32 8 16 32 8 16 10 8 32 10 8 32

32 9 32 9

9 32 10 9 32 10

32 7 31 32 7 31 18 16 18 16 10 7 32 10 7 32

32 9 32 9

9 32 8 9 32 8 32 9 16 32 9 16 8 9 32 8 9 32

32 8 32 8

8 32 9 8 32 9 16 32 9 18 16 32 9 18 9 9 32 9 9 32

32 9 32 9

9 32 10 9 32 10 32 8 16 32 8 16 10 8 32 10 8 32

32 10 32 10

10 32 9 10 32 9 32 31 32 7 32 31 32 7 18 16 18 16 9 7 32 9 7 32 32 10 63 32 10 63 10 32 9 10 32 9 32 9 16 32 9 16 9 9 32 9 9 32

32 9 32 9

9 32 10 9 32 10 16 32 10 18 16 32 10 18 10 10 32 10 10 32

32 8 32 8

8 32 10 8 32 10 32 10 17 32 10 17 10 10 32 10 10 32

32 10 32 10

10 32 9 10 32 9

32 9 32 32 9 32 18 17 18 17 9 9 32 9 9 32

32 9 32 9

9 32 9 9 32 9 32 10 18 32 10 18 9 10 32 9 10 32

32 10 32 10

10 32 8 10 32 8 18 32 10 16 18 32 10 16 8 10 32 8 10 32

32 10 32 10

10 32 8 10 32 8 32 9 17 32 9 17 8 9 32 8 9 32

32 8 32 8

8 32 8 8 32 8 32 32 9 32 32 32 9 32 16 17 16 17 8 9 32 8 9 32

32 9 32 9

9 32 8 9 32 8 32 10 17 32 10 17 8 10 32 8 10 32

32 8 32 8

8 32 10 8 32 10 17 32 10 18 17 32 10 18 10 10 32 10 10 32

32 10 32 10

10 32 10 10 32 10 32 8 16 32 8 16 10 8 32 10 8 32

32 8 32 8

8 32 9 8 32 9

32 8 32 32 8 32 18 16 18 16 9 8 32 9 8 32

32 9 32 9

9 32 10 9 32 10 32 8 16 32 8 16 10 8 32 10 8 32

32 10 32 10

10 32 8 10 32 8 16 32 10 16 16 32 10 16 8 10 32 8 10 32

32 9 32 9

9 32 8 9 32 8 32 7 16 32 7 16 8 7 32 8 7 32

32 8 32 8

8 32 7 8 32 7 32 32 32 7 32 32 32 7 16 16 16 7 16 7 5 64 64 63 7 32 126 64 63 7 32 (c) η = 128 (d) η = 256 (Weak admissibility)

Figure A.1: -matrix structure for different parameter η with fixed  and leaf size H H nmin=32. Matrix depicts a 2D variable coefficient Poisson problem with four orders of magnitude of contrast in the coefficient discretized with N = 1282 degrees of freedom. The numbers inside the green low-rank blocks denote the required numerical rank for the specified accuracy. 142 −1 N η AA I F Max. Rank Memory (Bytes) 1282 2 || 5.0e-3− || 16 6.76e+7 1282 64 7.2e-3 34 6.64e+7 1282 128 9.1e-3 64 8.64e+7 1282 256 9.1e-3 126 1.01e+8

Table A.1: Memory consumption as a function of the tuning parameter η for the computation of the approximate inverse in the -matrix format of a 2D variable-coefficient Poisson problem with four orders ofH magnitude of contrast in the coefficient, discretized with N = 1282 degrees of freedom. Parameter η=2 depicts strong admissibility, while η=256 depicts weak admissibility; regarding memory requirements, η=64 is optimal.

Since the memory consumption of ACR is determined by the sum of the memory consumption of each -matrix involved in elimination, an economical storage of each H -matrix directly translates into savings to the overall ACR memory footprint. As H shown in Figure A.2, tuning η across a range of problem sizes has nuanced benefits. For linear systems in the order of a few millions of degrees of freedom a coarse block partitioning (close to weak admissibility) minimizes the overall memory consumption. However, for problems larger than a dozen of millions of unknowns block partitioning closer to strong admissibility is optimal to reduce memory requirements. 143

107 Memory η = 2 106 Memory η = n/2 Memory η = 2n

105

104 Memory (MB) 103

102

105 106 107 N

Figure A.2: Effect of tunable parameter η on total memory consumption of ACR for a 3D variable-coefficient problem. The memory complexity estimate of (N log N) is achieved for η = 2 which corresponds to strong admissibility. η = 2On which correspond to weak admissibility uses the most memory asymptotically, and an intermediate value of η = n/2 achieves the least amount of memory within the range of problem sizes considered. 144

Appendix B

Benchmarks configuration

This section documents the configuration of the software libraries involved in the various benchmarking experiments throughout this dissertation. The goal is to pro- vide the necessary information to aid reproducibility and replication of the numerical results. The same hardware configuration and core affinity were used for each solver, using the same computational system. All experiments within the same category used the same error tolerance measured in the relative norm of the residual of the solution of the linear system, and the same linear system. Tuning of the multiple parameters of each software library was intended to maximize performance unless stated otherwise. Numerical results represent the average of at least two independent runs. Within the same category of experiments, all solvers filled up external dependen- cies via the same libraries and versions, such as Intel MKL or Cray LibSci. Further- more, the same compiler and compiler optimizations flags were used to generate the binary files of each solver. The following subsections detail the tuning parameter for each solver and the template source code used as a reference. 145 B.1 Configuration for the direct solution of 2D problems

AMG The configuration options for the PETSc installation are:

--with-cc=icc --with-cxx=icpc --with-fc=ifort \\ --with-blas-lapack-dir=$INTEL_PATH/mkl/lib/intel64 \\ --download-hypre --download-mpich --with-debugging=no

The execution flags for the use of BoomerAMG as a 2D direct solver are:

-pc_type hypre -ksp_type richardson -ksp_max_it 500

This configuration is proposed at the official PETSc FAQ website at http://www. mcs.anl.gov/petsc/documentation/faq.html The template source code for the testing is based on PETSc KSP example ex10.c, found at https://www.mcs.anl.gov/petsc/petsc-current/src/ksp/ksp/examples/ tutorials/ex10.c.html H-LU The external dependencies used for the HLibPro installation include Intel MKL v15 and the provided pre-compiled auxiliary libraries of the FFTW library, the SCOTCH and METIS partitioners, and the Boost C++ Libraries. These auxiliary

libraries and further installation details can be found at http://www.hlibpro.com/ doc/2.5/install.html The execution parameters for the use of -LU as a direct solver include a leaf H size nmin=20 and an admissibility condition parameter η = 2. This configuration is proposed in [53].

The template source code for the testing is based on the HLibPro example sparsealg.c, found within the HLibPro distribution at http://www.hlibpro.com/download.html 146 ACR

The tuning parameters of ACR include a leaf size nmin=32 and an admissibility condition parameter η = 2.

B.2 Configuration for the direct solution of 3D problems

AMG The configuration options for the PETSc installation are:

--with-cc=icc --with-cxx=icpc --with-fc=ifort \\ --with-blas-lapack-dir=$INTEL_PATH/mkl/lib/intel64 \\ --download-hypre --download-mpich --with-debugging=no

The execution flags for the use of BoomerAMG as a 3D direct solver are:

-pc_type hypre -ksp_type richardson -ksp_max_it 1 -ksp_norm_type unpreconditioned -pc_hypre_boomeramg_coarsen_type hmis -pc_hypre_boomeramg_interp_type ext+i -pc_hypre_boomeramg_max_iter 250 -pc_hypre_boomeramg_agg_nl 1 -pc_hypre_boomeramg_p_max 4

Numerical experiments confirm that the HMIS-coarsening [116] produces the best performance; this configuration has also been used in [117]. The template source code used for testing is based on PETSc KSP example ex10.c, found at https://www.mcs.anl.gov/petsc/petsc-current/src/ksp/ksp/ examples/tutorials/ex10.c.html

HSSMF The configuration options for the STRUMPACK-Sparse installation include Intel MKL v15, METIS, ParMETIS, and SCOTCH partitioner. Further installation de- pendencies can be found inside the solver repository at http://portal.nersc.gov/ project/sparse/strumpack/ 147 The execution flags for the use of HSSMF are:

ˆ sp hss. Enables the use of HSS matrices

ˆ sp rctol. Approximation accuracy of the HSS matrices

ˆ sp hss front size. Given the block size, it determines if a frontal matrices is approximated as an HSS matrix, or left as a dense matrix.

To optimize for performance, the parameter sp hss front size was set to use as few HSS matrices as possible, typically less than a dozen of HSS matrices for large- scale problems. This choice is motivated by the fact that compression is an expensive operation and that large matrices benefit more for data-sparse representations than small matrices. Further details of the rationale to compress only large enough blocks into the HSS format is discussed at [12]. The template source code for the testing is based on the STRUMPACK example

testMMdoubleMPIDist.c, found within the STRUMPACK distribution at the folder STRUMPACK-sparse-1.0.3\strumpack-sparse-1.0.3\test.

ACR

The tuning parameters of ACR include a leaf size nmin=32 and an admissibility condition parameter η = n2/2, where n is the size of the planes arising from the discretization of the unit cube of N = n3 degrees of freedom.

B.3 Configuration for the iterative solution of 3D problems

ACR The configuration options for the PETSc installation are:

--with-cc=icc --with-cxx=icpc --with-fc=ifort \\ --with-blas-lapack-dir=$INTEL_PATH/mkl/lib/intel64 \\ 148 --download-hypre --download-mpich --with-debugging=no

The template source code for the testing is based on PETSc KSP example ex15.c, found at https://www.mcs.anl.gov/petsc/petsc-current/src/ksp/ksp/examples/ tutorials/ex15.c.html, which allows for a user-defined shell preconditioner.

The tuning parameters of ACR include a leaf size nmin=32 and an admissibility condition parameter η = n2/2, where n is the size of the planes arising from the discretization of the unit cube of N = n3 degrees of freedom. More on the choice of these tuning parameters for different elliptic PDEs can be seen in Sections 7.2.2, 7.3.1, and 7.4.1. 149

Appendix C

Papers accepted and submitted

G. Ch´avez, G. Turkiyyah, D. Keyes (2017) A direct elliptic solver based on hierarchically low-rank Schur complements in Proceedings of the 23rd International Conference on Domain Decomposition Meth- ods for Partial Differential Equations (C.-O. Lee et al., eds.), Springer, Lecture Notes in Computational Science and Engineering Vol. 116, pp. 135–143.

G. Ch´avez, G. Turkiyyah, S. Zampini, H. Ltaief, D. Keyes Accelerated cyclic reduction: A distributed memory direct solver for structured linear systems Provisional acceptance to the Elsevier Journal of Parallel Computing (2016).

G. Ch´avez, G. Turkiyyah, S. Zampini, D. Keyes Parallel accelerated cyclic reduction preconditioner for three-dimensional elliptic PDEs with variable coefficients Submitted to the Elsevier Journal of Computational and Applied Mathematics (2017). 150

Appendix D

Sparse linear solvers that leverage data-sparsity

Package Technique Format Contact G. Ch´avez, G. Turkiyyah ACR Cyclic reduction H D. Keyes [118] M. Bebendorf [120] AHMED −1 & -LU H H H G. Pichon BLR PaStiX Supernodal BLR M. Faverge [44] D. Sushnikova CE 2-LU 2 H H I. Oseledets [57] Y. Li DMHIF Multifrontal ID L. Ying [43] Y. Li DMHM Newton-Schulz H L. Ying [121] H. Ibeid, R. Yokota ExaFMM Fast multipole H D. Keyes [122] S. Christophersen H2Lib −1 & 2-LU 2 H H H S. B¨oerm[123] L. Grasedyck HLib −1 & -LU H H H W. Hackbusch[124] R. Kriemann HLibPro −1 & -LU H H H W. Hackbusch[54] H. Pouransari LoRaSp 2-LU 2 H H E. Darve [56] A. Aminfar MF-HODLR Multifrontal HODLR E. Darve [30] T. Mary MUMPS-BLR Multifrontal BLR P. R. Amestoy [22] J. Chadwick Structured CHOLMOD Supernodal BLR D. Bindel [46] F.-H. Rouet, P. Ghysels STRUMPACK-Sparse Multifrontal HSS X.S. Li [36]

Table D.1: Alphabetical list of software libraries that leverage data-sparsity for the solution of sparse linear systems of equations.