Robust and Scalable Hierarchical Matrix-Based Fast Direct Solver and Preconditioner for the Numerical Solution of Elliptic Partial Differential Equations
Total Page:16
File Type:pdf, Size:1020Kb
Robust and scalable hierarchical matrix-based fast direct solver and preconditioner for the numerical solution of elliptic partial differential equations Dissertation by Gustavo Iv´anCh´avez Ch´avez In Partial Fulfillment of the Requirements For the Degree of Doctor of Philosophy King Abdullah University of Science and Technology Thuwal, Kingdom of Saudi Arabia June, 2017 2 EXAMINATION COMMITTEE PAGE The dissertation of Gustavo Iv´anCh´avez Ch´avez is approved by the examination committee Committee Chairperson: Professor David Keyes Committee Members: Professor Mikhail Moshkov, Professor David Ketcheson, Professor George Turkiyyah, Professor Jingfang Huang 3 © June, 2017 Gustavo Iv´anCh´avez Ch´avez All Rights Reserved 4 ABSTRACT Robust and scalable hierarchical matrix-based fast direct solver and preconditioner for the numerical solution of elliptic partial differential equations Gustavo Iv´anCh´avez Ch´avez This dissertation introduces a novel fast direct solver and preconditioner for the solution of block tridiagonal linear systems that arise from the discretization of ellip- tic partial differential equations on a Cartesian product mesh, such as the variable- coefficient Poisson equation, the convection-diffusion equation, and the wave Helmholtz equation in heterogeneous media. The algorithm extends the traditional cyclic reduction method with hierarchical matrix techniques. The resulting method exposes substantial concurrency, and its arithmetic operations and memory consumption grow only log-linearly with problem size, assuming bounded rank of off-diagonal matrix blocks, even for problems with arbitrary coefficient structure. The method can be used as a standalone direct solver with tunable accuracy, or as a black-box preconditioner in conjunction with Krylov methods. The challenges that distinguish this work from other thrusts in this active field are the hybrid distributed-shared parallelism that can demonstrate the algorithm at large-scale, full three-dimensionality, and the three stressors of the current state- of-the-art multigrid technology: high wavenumber Helmholtz (indefiniteness), high Reynolds convection (nonsymmetry), and high contrast diffusion (inhomogeneity). Numerical experiments corroborate the robustness, accuracy, and complexity claims and provide a baseline of the performance and memory footprint by comparisons with competing approaches such as the multigrid solver hypre, and the STRUMPACK implementation of the multifrontal factorization with hierarchically semi-separable 5 matrices. The companion implementation can utilize many thousands of cores of Shaheen, KAUST's Haswell-based Cray XC-40 supercomputer, and compares favor- ably with other implementations of hierarchical solvers in terms of time-to-solution and memory consumption. 6 ACKNOWLEDGMENTS First and foremost I want to thank my advisor, Prof. David Keyes, for all his support and encouragement: You are a gentleman and a scholar, and you have been an inspiration to me to the largest extent of the word. I also want to thank Prof. George Turkiyyah for his time and follow-up. I partic- ularly thank him for introducing me to the field of hierarchical matrices. I would like to express my sincere gratitude to my dissertation committee: Prof. Mikhail Moshkov, Prof. David Ketcheson, and Prof. Jingfang Huang for their thor- ough comments and detailed suggestions on the draft of this dissertation, especially for making the time for being physically present at my dissertation talk. In particu- lar, to Prof. Huang for literally traveling halfway across the world, to Prof. Moshkov for attending right after a medical procedure, and to Prof. Ketcheson, who kindly rescheduled his summer agenda. This work is the result of team effort, for which I gratefully acknowledge the inputs of Prof. R. Yokota, Dr. H. Ltaief, Dr. A. Litvinenko, Dr. L. Dalcin, Dr. S. Zampini, Dr. G. Markomanolis, Dr. S. Kortas, Dr. B. Hadri, and Dr. S. Feki. I will fondly remember my colleagues at the Extreme Computing Research Center: M. Farhan, W. Boukaram, M. Abdul-Jabbar, A. Alonazi, A. Chara, D. Sukkari, Dr. L. Liu, Dr. H. Ibeid, and Dr. T. Malas. I thank you for the innumerable conversations and for making my stay so enjoyable. Also, I thank the staff of the ECRC for their impeccable work: E. Gonzalez, V. Hansford, O. Camilleri, N. Ezzeddine, and G. Martinez. Although I never met him, I'd like to thank the founder of KAUST: King Abdullah of Saudi Arabia. His generosity and vision are truly breathtaking. Finally, I thank my friends and family. This work is dedicated to you. Especially, for the love of my life Karen Ram´ırez,and for my loving parents: Alejandro Ch´avez and Laura Ch´avez. 7 TABLE OF CONTENTS Examination Committee Page 2 Copyright 3 Abstract 4 Acknowledgments 6 List of Abbreviations 11 List of Figures 13 List of Tables 18 1 Introduction 20 1.1 Overview . 20 1.2 Motivation . 21 1.3 Fundamental problem . 22 1.4 Contributions . 25 1.5 Outline . 26 2 Hierarchical matrices 28 2.1 Intuition for hierarchically low-rank approximations . 28 2.2 Overview of the -matrix format . 29 H 2.2.1 Index set . 30 2.2.2 Cluster tree . 30 2.2.3 Block cluster tree . 30 2.2.4 Admissibility condition . 31 2.2.5 Compression of low-rank blocks . 32 2.3 Benefits of -matrix approximations . 32 H 2.4 Other types of data-sparse formats . 33 2.4.1 2 ................................. 33 H 2.4.2 Hierarchically semi-separable (HSS) . 34 8 2.4.3 Hierarchically off-diagonal low-rank (HODLR) . 35 2.4.4 Block low-rank (BLR) . 37 2.5 Summary of data-sparse formats . 38 3 Related work 39 3.1 Approximate triangular factorizations . 39 3.1.1 Data-sparse multifrontal . 39 3.1.2 Data-sparse supernodal . 42 3.1.3 Compression of the entire triangular factors . 43 3.2 Approximate inverse . 44 3.2.1 Iterative procedure . 44 3.2.2 Recursive formulation . 45 3.2.3 Matrix-free . 46 4 Accelerating cyclic reduction 48 4.1 Introduction to cyclic reduction . 48 4.2 Cyclic reduction on a model problem . 49 4.2.1 Elimination . 50 4.2.2 Solve . 53 4.3 Accelerated cyclic reduction (ACR) . 53 4.3.1 Modularity . 53 4.3.2 Block-wise -matrix approximation . 54 H 4.3.3 Slicing decomposition . 55 4.3.4 Reduced-dimension -matrices . 56 H 4.3.5 Tuning parameters . 57 4.3.6 Fixed-rank versus adaptive-rank arithmetic . 59 4.3.7 General algorithm . 60 4.3.8 Sequential complexity estimates . 61 5 Distributed memory accelerated cyclic reduction 63 5.1 Overview of the concurrency features of ACR . 63 5.2 Hybrid parallel programming model . 64 5.2.1 Inter-node parallelism . 65 5.2.2 Intra-node parallelism . 66 5.3 Parallel elimination and solve . 67 5.4 Parallel complexity estimates . 69 5.5 Parallel scalability . 69 9 5.5.1 Weak scaling . 70 5.5.2 Strong scaling . 71 5.5.3 Effectiveness of the choice of -matrix admissibility . 72 H 5.5.4 Memory footprint . 73 6 ACR as a fast direct solver 75 6.1 Numerical results in 2D and benchmark with other solvers . 75 6.1.1 Environment settings . 75 6.1.2 Constant-coefficient Poisson equation . 76 6.1.3 Variable-coefficient Poisson equation . 78 6.1.3.1 Smooth coefficient . 78 6.1.3.2 High contrast discontinuous coefficient . 79 6.1.3.3 Anisotropic coefficient . 80 6.1.4 Helmholtz equation . 81 6.1.4.1 Positive definite formulation . 81 6.1.4.2 Indefinite formulation . 82 6.1.5 Convection-diffusion equation . 86 6.1.5.1 Proportional convection and diffusion . 86 6.1.5.2 Convection dominance . 87 6.2 Numerical results in 3D and benchmarking with other solvers . 89 6.2.1 Environment settings . 89 6.2.2 Poisson equation . 90 6.2.3 Convection-diffusion equation . 96 6.2.4 Wave Helmholtz equation . 98 7 ACR as a preconditioner for sparse iterative solvers 101 7.1 Environment settings . 101 7.2 Variable-coefficient Poisson equation . 102 7.2.1 Generation of random permeability fields . 102 7.2.2 Tuning parameters . 103 7.2.3 Sensitivity with respect to high contrast coefficient . 106 7.2.4 Operation count and memory footprint . 107 7.3 Convection-diffusion equation with recirculating flow . 108 7.3.1 Tuning parameters . 109 7.3.2 Sensitivity with respect to vortex wavenumber . 110 7.3.3 Sensitivity with respect to Reynolds number . 111 7.3.4 Operation count and memory footprint . 112 10 7.4 Indefinite Helmholtz equation in heterogeneous media . 113 7.4.1 Tuning parameters . 114 7.4.2 Low to high frequency Helmholtz regimes . 117 7.4.3 Operation count and memory footprint . 120 8 Summary and future work 122 8.1 Concluding remarks . 122 8.2 Future work . 125 References 128 Appendix A Memory consumption optimization 140 Appendix B Benchmarks configuration 144 B.1 Configuration for the direct solution of 2D problems . 145 B.2 Configuration for the direct solution of 3D problems . 146 B.3 Configuration for the iterative solution of 3D problems . 147 Appendix C Papers accepted and submitted 149 Appendix D Sparse linear solvers that leverage data-sparsity 150 11 LIST OF ABBREVIATIONS ABC Absorbing boundary condition ACA Adaptive cross approximation ACR Accelerated cyclic reduction AHMED Another software library on hierarchical ma- trices for elliptic differential equations AMG Algebraic multigrid BDLR Boundary distance low-rank approximation method BEM Boundary element method BLAS Basic linear algebra subprograms BLR Block low-rank data-sparse format CE Compress and eliminate solver CG Method of conjugate gradients CHOLMOD Sparse Cholesky modification package CR Cyclic reduction DMHIF Distributed memory hierarchical interpolative factorization ExaFMM Library for fast multipole algorithms, in par- allel, and with GPU capability FD Finite difference method FEM