A Splitting Approach for the Parallel Solution of Linear Systems on Gpu Cards

A SPLITTING APPROACH FOR THE PARALLEL SOLUTION OF LINEAR SYSTEMS ON GPU CARDS by Ang Li A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Electrical and Computer Engineering) at the UNIVERSITY OF WISCONSIN–MADISON 2016 Date of final oral examination: 05/09/16 The dissertation is approved by the following members of the Final Oral Committee: Dan Negrut, Associate Professor, Mechanical Engineering Mikko Lipasti, Professor, Electrical and Computer Engineering Yu-Hen Hu, Professor, Electrical and Computer Engineering Parameswaran Ramanathan, Professor, Electrical and Computer Engineering Radu Serban, Associate Scientist, Mechanical Engineering © Copyright by Ang Li 2016 All Rights Reserved i I would like to dedicate this dissertation to Prof. Dan Negrut, my advisor, without whom none of my accomplishment during my PhD life would have been possible. While I realize it’s unusual to include here, I think I need to share with the world how Dan has also been helpful in changing my attitude towards others and life. ii acknowledgments I would like to give my thanks to my advisor, Professor Dan Negrut, for his guidance and support while working with him. I would also like to thank the committee members for their time and suggestions. I enjoyed the time I spent with all my colleagues and friends in the Simulation-Based Engineering Lab. I benefited a lot from their support, especially Dr. Radu Serban’s instructions. Finally, to my beloved family, I miss YOU, and – I am YOU. iii contents Contents . iii List of Tables . xii List of Figures . xiv Abstract . xviii 1 Introduction . 1 1.1 The Science and Engineering algorithmic backdrop ........... 1 1.2 Solving linear systems .......................... 2 1.3 Contributions ............................... 3 2 Motivation: the rise of parallel computing and emergence of GPU computing 5 2.1 Three efficiency walls in sequential computing ............. 5 2.2 GPU computing and lack of GPU linear system solvers . 6 2.3 Brief overview of relevant hardware ................... 8 2.3.1 The Tesla K40 Graphics Processing Unit . 9 2.3.2 Intel Xeon processor . 10 2.4 Brief overview of Compute Unified Device Architecture . 11 3 The “Split and Parallelize” Method . 12 3.1 SaP for dense banded linear systems ................... 12 3.1.1 SPIKE algorithm . 12 3.1.2 A generalization of the Cyclic Reduction: the Block Cyclic Reduction algorithm . 17 3.1.3 Nomenclature, solution strategies . 20 3.1.4 A comparison of SPIKE algorithm and Block Cyclic Reduction algorithm . 20 3.2 SaP for general sparse linear systems . 25 iv 3.3 SaP components and computational flow . 26 4 Reordering Methodologies . 29 4.1 Diagonal Boosting ............................. 29 4.2 Bandwidth Reduction ........................... 31 4.3 Third-stage reordering .......................... 33 5 Krylov Subspace Method . 35 5.1 A brief introduction to preconditioning . 36 5.2 Conjugate Gradient method ....................... 37 5.2.1 Backgrounds . 37 5.2.2 Further discussion . 39 5.3 Biconjugate Gradient Stabilized method and its generalization . 40 5.4 A profiling of Krylov subspace method . 44 6 Implementation Details . 48 6.1 Dense banded matrix factorization details . 48 6.1.1 Number of partitions and partition size . 48 6.1.2 Matrix storage . 49 6.1.3 LU/UL/Cholesky factorization . 50 6.1.4 Optimization techniques . 51 6.2 Implementation details for reordering methodologies . 53 6.2.1 DB reordering implementation details . 53 6.2.2 CM reordering implementation details . 54 6.3 CPU-GPU data transfer in SaP::GPU . 56 6.4 Multi-GPU support ............................ 59 7 Numerical Experiments . 64 7.1 Numerical experiments related to dense banded linear systems . 64 7.1.1 Sensitivity with respect to P ................... 64 7.1.2 Sensitivity with respect to d ................... 67 v 7.1.3 Comparison with LAPACK and Intel’s MKL over a spectrum of N and K ............................ 73 7.1.4 Profiling results for dense banded linear system solver . 78 7.2 Numerical experiments related to sparse matrix reorderings . 91 7.2.1 Assessment of the diagonal boosting reordering solution . 91 7.2.1.1 Efficiency evaluation of diagonal boosting algorithm . 91 7.2.1.2 Performance comparison against Harwell Sparse Li- brary’s solution . 95 7.2.2 Assessment of the bandwidth reduction reordering solution . 97 7.3 Numerical experiments related to sparse linear systems . 100 7.3.1 Profiling results . 100 7.3.2 The impact of the third stage reordering . 112 7.3.3 Comparison against state-of-the-art direct linear solvers on the CPU . 114 7.3.4 Comparison against a state-of-the-art GPU linear solver . 118 7.4 Numerical experiments highlighting multi-GPU support . 120 8 Use of SaP in computational multibody dynamics problems . 124 9 Conclusions and future work . 133 A Namespace Index . 135 A.1 Namespace List ..............................135 B Class Index . 136 B.1 Class Hierarchy ..............................136 C Class Index . 138 C.1 Class List .................................138 D File Index . 140 D.1 File List ..................................140 vi E Namespace Documentation . 142 E.1 sap Namespace Reference . 142 E.1.1 Detailed Description . 144 E.1.2 Enumeration Type Documentation . 145 E.1.2.1 KrylovSolverType . 145 E.1.2.2 PreconditionerType . 145 E.1.3 Function Documentation . 145 E.1.3.1 bicgstab . 145 E.1.3.2 bicgstabl . 145 E.1.3.3 minres . 146 F Class Documentation . 147 F.1 sap::Graph< T >::AbsoluteValue< VType > Struct Template Reference147 F.2 sap::Graph< T >::AccumulateEdgeWeights Struct Reference . 147 F.3 sap::BandedMatrix< Array > Class Template Reference . 147 F.3.1 Detailed Description . 148 F.3.2 Constructor & Destructor Documentation . 149 F.3.2.1 BandedMatrix . 149 F.4 sap::BiCGStabLMonitor< SolverVector > Class Template Reference . 149 F.4.1 Detailed Description . 150 F.5 sap::Graph< T >::ClearValue Struct Reference . 151 F.6 sap::Graph< T >::CompareValue< VType > Struct Template Reference151 F.7 sap::CPUTimer Class Reference . 151 F.7.1 Detailed Description . 152 F.8 sap::Graph< T >::Difference Struct Reference . 152 F.9 sap::Graph< T >::EdgeLength Struct Reference . 153 F.10 sap::Graph< T >::EqualTo< Type > Struct Template Reference . 153 F.11 sap::Graph< T >::Exponential Struct Reference . 154 F.12 sap::Graph< T >::GetCount Struct Reference . 154 F.13 sap::GPUTimer Class Reference . 154 F.13.1 Detailed Description . 155 vii F.14 sap::Graph< T > Class Template Reference . 155 F.14.1 Detailed Description . 159 F.15 sap::Graph< T >::is_not Struct Reference . 160 F.16 sap::Graph< T >::Map< Type > Struct Template Reference . 160 F.17 sap::Monitor< SolverVector > Class Template Reference . 160 F.17.1 Detailed Description . 162 F.18 sap::Multiply< T > Struct Template Reference . 162 F.19 sap::MVBanded< Matrix > Class Template Reference . 162 F.19.1 Detailed Description . 163 F.20 sap::Options Struct Reference . 164 F.20.1 Detailed Description . 165 F.20.2 Constructor & Destructor Documentation . 165 F.20.2.1 Options . 165 F.20.3 Member Data Documentation . 165 F.20.3.1 absTol . 165 F.20.3.2 applyScaling . 165 F.20.3.3 dbFirstStageOnly . 165 F.20.3.4 dropOffFraction . 165 F.20.3.5 factMethod . 166 F.20.3.6 gpuCount . 166 F.20.3.7 ilu_level . 166 F.20.3.8 isSPD . 166 F.20.3.9 maxBandwidth . 166 F.20.3.10maxNumIterations . 166 F.20.3.11performDB . 166 F.20.3.12performReorder . 166 F.20.3.13precondType . 166 F.20.3.14relTol . 167 F.20.3.15safeFactorization . 167 F.20.3.16saveMem . 167 F.20.3.17solverType . 167 viii F.20.3.18testDB . 167 F.20.3.19trackReordering . 167 F.20.3.20variableBandwidth . 167 F.21 sap::Graph< T >::PermutedEdgeLength Struct Reference . 168 F.22 sap::Graph< T >::PermuteEdge Struct Reference . 168 F.23 sap::Precond< PrecVector > Class Template Reference . 169 F.23.1 Detailed Description . 172 F.23.2 Constructor & Destructor Documentation . 172 F.23.2.1 Precond . 172 F.23.2.2 Precond . 173 F.23.2.3 Precond . 173 F.23.3 Member Function Documentation . 173 F.23.3.1 operator() . 173 F.23.3.2 setup . 173 F.23.3.3 solve . 174 F.23.3.4 update . 174 F.24 sap::SegmentedMatrix< Array, MemorySpace > Class Template Reference174 F.25 sap::SmallerThan< T > Struct Template Reference . 176 F.26 sap::Solver< Array, PrecValueType > Class Template Reference ..

A Splitting Approach for the Parallel Solution of Linear Systems on Gpu Cards

Enhanced Capabilities of the Spike Algorithm and a New Spike- Openmp Solver

A Parallel Solver for Incompressible Fluid Flows

PSPIKE: a Parallel Hybrid Sparse Linear System Solver *

Dense and Sparse Parallel Linear Algebra Algorithms on Graphics Processing Units

A Banded Spike Algorithm and Solver for Shared Memory Architectures Karan Mendiratta University of Massachusetts Amherst

Some Efficient Methods for Computing the Determinant of Large Sparse

Scalable Parallel Tridiagonal Algorithms with Diagonal Pivoting and Their Optimization for Many-Core Architectures

A Scalable, Numerically Stable, High-Performance Tridiagonal Solver Using Gpus

A Feature Complete SPIKE Banded Algorithm and Solver

Implementing and Evaluating an Heterogeneous, Scalable, Tridiagonal Linear System Solver with Opencl to Target Fpgas, Gpus, and Cpus

SPIKE::GPU a SPIKE-Based Preconditioned GPU Solver for Sparse Linear Systems

A Spike Algorithm for Solving Tridiagonal Block Matrix Equations on a Multicore Machine Shinji TOKUDA Research Organization for Information Science and Technology