A Splitting Approach for the Parallel Solution of Linear Systems on Gpu Cards
Total Page:16
File Type:pdf, Size:1020Kb
A SPLITTING APPROACH FOR THE PARALLEL SOLUTION OF LINEAR SYSTEMS ON GPU CARDS by Ang Li A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Electrical and Computer Engineering) at the UNIVERSITY OF WISCONSIN–MADISON 2016 Date of final oral examination: 05/09/16 The dissertation is approved by the following members of the Final Oral Committee: Dan Negrut, Associate Professor, Mechanical Engineering Mikko Lipasti, Professor, Electrical and Computer Engineering Yu-Hen Hu, Professor, Electrical and Computer Engineering Parameswaran Ramanathan, Professor, Electrical and Computer Engineering Radu Serban, Associate Scientist, Mechanical Engineering © Copyright by Ang Li 2016 All Rights Reserved i I would like to dedicate this dissertation to Prof. Dan Negrut, my advisor, without whom none of my accomplishment during my PhD life would have been possible. While I realize it’s unusual to include here, I think I need to share with the world how Dan has also been helpful in changing my attitude towards others and life. ii acknowledgments I would like to give my thanks to my advisor, Professor Dan Negrut, for his guidance and support while working with him. I would also like to thank the committee members for their time and suggestions. I enjoyed the time I spent with all my colleagues and friends in the Simulation-Based Engineering Lab. I benefited a lot from their support, especially Dr. Radu Serban’s instructions. Finally, to my beloved family, I miss YOU, and – I am YOU. iii contents Contents . iii List of Tables . xii List of Figures . xiv Abstract . xviii 1 Introduction . 1 1.1 The Science and Engineering algorithmic backdrop ........... 1 1.2 Solving linear systems .......................... 2 1.3 Contributions ............................... 3 2 Motivation: the rise of parallel computing and emergence of GPU computing 5 2.1 Three efficiency walls in sequential computing ............. 5 2.2 GPU computing and lack of GPU linear system solvers . 6 2.3 Brief overview of relevant hardware ................... 8 2.3.1 The Tesla K40 Graphics Processing Unit . 9 2.3.2 Intel Xeon processor . 10 2.4 Brief overview of Compute Unified Device Architecture . 11 3 The “Split and Parallelize” Method . 12 3.1 SaP for dense banded linear systems ................... 12 3.1.1 SPIKE algorithm . 12 3.1.2 A generalization of the Cyclic Reduction: the Block Cyclic Reduction algorithm . 17 3.1.3 Nomenclature, solution strategies . 20 3.1.4 A comparison of SPIKE algorithm and Block Cyclic Reduction algorithm . 20 3.2 SaP for general sparse linear systems . 25 iv 3.3 SaP components and computational flow . 26 4 Reordering Methodologies . 29 4.1 Diagonal Boosting ............................. 29 4.2 Bandwidth Reduction ........................... 31 4.3 Third-stage reordering .......................... 33 5 Krylov Subspace Method . 35 5.1 A brief introduction to preconditioning . 36 5.2 Conjugate Gradient method ....................... 37 5.2.1 Backgrounds . 37 5.2.2 Further discussion . 39 5.3 Biconjugate Gradient Stabilized method and its generalization . 40 5.4 A profiling of Krylov subspace method . 44 6 Implementation Details . 48 6.1 Dense banded matrix factorization details . 48 6.1.1 Number of partitions and partition size . 48 6.1.2 Matrix storage . 49 6.1.3 LU/UL/Cholesky factorization . 50 6.1.4 Optimization techniques . 51 6.2 Implementation details for reordering methodologies . 53 6.2.1 DB reordering implementation details . 53 6.2.2 CM reordering implementation details . 54 6.3 CPU-GPU data transfer in SaP::GPU . 56 6.4 Multi-GPU support ............................ 59 7 Numerical Experiments . 64 7.1 Numerical experiments related to dense banded linear systems . 64 7.1.1 Sensitivity with respect to P ................... 64 7.1.2 Sensitivity with respect to d ................... 67 v 7.1.3 Comparison with LAPACK and Intel’s MKL over a spectrum of N and K ............................ 73 7.1.4 Profiling results for dense banded linear system solver . 78 7.2 Numerical experiments related to sparse matrix reorderings . 91 7.2.1 Assessment of the diagonal boosting reordering solution . 91 7.2.1.1 Efficiency evaluation of diagonal boosting algorithm . 91 7.2.1.2 Performance comparison against Harwell Sparse Li- brary’s solution . 95 7.2.2 Assessment of the bandwidth reduction reordering solution . 97 7.3 Numerical experiments related to sparse linear systems . 100 7.3.1 Profiling results . 100 7.3.2 The impact of the third stage reordering . 112 7.3.3 Comparison against state-of-the-art direct linear solvers on the CPU . 114 7.3.4 Comparison against a state-of-the-art GPU linear solver . 118 7.4 Numerical experiments highlighting multi-GPU support . 120 8 Use of SaP in computational multibody dynamics problems . 124 9 Conclusions and future work . 133 A Namespace Index . 135 A.1 Namespace List ..............................135 B Class Index . 136 B.1 Class Hierarchy ..............................136 C Class Index . 138 C.1 Class List .................................138 D File Index . 140 D.1 File List ..................................140 vi E Namespace Documentation . 142 E.1 sap Namespace Reference . 142 E.1.1 Detailed Description . 144 E.1.2 Enumeration Type Documentation . 145 E.1.2.1 KrylovSolverType . 145 E.1.2.2 PreconditionerType . 145 E.1.3 Function Documentation . 145 E.1.3.1 bicgstab . 145 E.1.3.2 bicgstabl . 145 E.1.3.3 minres . 146 F Class Documentation . 147 F.1 sap::Graph< T >::AbsoluteValue< VType > Struct Template Reference147 F.2 sap::Graph< T >::AccumulateEdgeWeights Struct Reference . 147 F.3 sap::BandedMatrix< Array > Class Template Reference . 147 F.3.1 Detailed Description . 148 F.3.2 Constructor & Destructor Documentation . 149 F.3.2.1 BandedMatrix . 149 F.4 sap::BiCGStabLMonitor< SolverVector > Class Template Reference . 149 F.4.1 Detailed Description . 150 F.5 sap::Graph< T >::ClearValue Struct Reference . 151 F.6 sap::Graph< T >::CompareValue< VType > Struct Template Reference151 F.7 sap::CPUTimer Class Reference . 151 F.7.1 Detailed Description . 152 F.8 sap::Graph< T >::Difference Struct Reference . 152 F.9 sap::Graph< T >::EdgeLength Struct Reference . 153 F.10 sap::Graph< T >::EqualTo< Type > Struct Template Reference . 153 F.11 sap::Graph< T >::Exponential Struct Reference . 154 F.12 sap::Graph< T >::GetCount Struct Reference . 154 F.13 sap::GPUTimer Class Reference . 154 F.13.1 Detailed Description . 155 vii F.14 sap::Graph< T > Class Template Reference . 155 F.14.1 Detailed Description . 159 F.15 sap::Graph< T >::is_not Struct Reference . 160 F.16 sap::Graph< T >::Map< Type > Struct Template Reference . 160 F.17 sap::Monitor< SolverVector > Class Template Reference . 160 F.17.1 Detailed Description . 162 F.18 sap::Multiply< T > Struct Template Reference . 162 F.19 sap::MVBanded< Matrix > Class Template Reference . 162 F.19.1 Detailed Description . 163 F.20 sap::Options Struct Reference . 164 F.20.1 Detailed Description . 165 F.20.2 Constructor & Destructor Documentation . 165 F.20.2.1 Options . 165 F.20.3 Member Data Documentation . 165 F.20.3.1 absTol . 165 F.20.3.2 applyScaling . 165 F.20.3.3 dbFirstStageOnly . 165 F.20.3.4 dropOffFraction . 165 F.20.3.5 factMethod . 166 F.20.3.6 gpuCount . 166 F.20.3.7 ilu_level . 166 F.20.3.8 isSPD . 166 F.20.3.9 maxBandwidth . 166 F.20.3.10maxNumIterations . 166 F.20.3.11performDB . 166 F.20.3.12performReorder . 166 F.20.3.13precondType . 166 F.20.3.14relTol . 167 F.20.3.15safeFactorization . 167 F.20.3.16saveMem . 167 F.20.3.17solverType . 167 viii F.20.3.18testDB . 167 F.20.3.19trackReordering . 167 F.20.3.20variableBandwidth . 167 F.21 sap::Graph< T >::PermutedEdgeLength Struct Reference . 168 F.22 sap::Graph< T >::PermuteEdge Struct Reference . 168 F.23 sap::Precond< PrecVector > Class Template Reference . 169 F.23.1 Detailed Description . 172 F.23.2 Constructor & Destructor Documentation . 172 F.23.2.1 Precond . 172 F.23.2.2 Precond . 173 F.23.2.3 Precond . 173 F.23.3 Member Function Documentation . 173 F.23.3.1 operator() . 173 F.23.3.2 setup . 173 F.23.3.3 solve . 174 F.23.3.4 update . 174 F.24 sap::SegmentedMatrix< Array, MemorySpace > Class Template Reference174 F.25 sap::SmallerThan< T > Struct Template Reference . 176 F.26 sap::Solver< Array, PrecValueType > Class Template Reference ..