A SPLITTING APPROACH FOR THE PARALLEL SOLUTION OF LINEAR SYSTEMS ON GPU CARDS

by

Ang Li

A dissertation submitted in partial fulfillment of the requirements for the degree of

Doctor of Philosophy

(Electrical and Computer Engineering)

at the

UNIVERSITY OF WISCONSIN–MADISON

2016

Date of final oral examination: 05/09/16

The dissertation is approved by the following members of the Final Oral Committee: Dan Negrut, Associate Professor, Mechanical Engineering Mikko Lipasti, Professor, Electrical and Computer Engineering Yu-Hen Hu, Professor, Electrical and Computer Engineering Parameswaran Ramanathan, Professor, Electrical and Computer Engineering Radu Serban, Associate Scientist, Mechanical Engineering © Copyright by Ang Li 2016 All Rights Reserved i

I would like to dedicate this dissertation to Prof. Dan Negrut, my advisor, without whom none of my accomplishment during my PhD life would have been possible. While I realize it’s unusual to include here, I think I need to share with the world how Dan has also been helpful in changing my attitude towards others and life. ii acknowledgments

I would like to give my thanks to my advisor, Professor Dan Negrut, for his guidance and support while working with him. I would also like to thank the committee members for their time and suggestions. I enjoyed the time I spent with all my colleagues and friends in the Simulation-Based Engineering Lab. I benefited a lot from their support, especially Dr. Radu Serban’s instructions. Finally, to my beloved family, I miss YOU, and – I am YOU. iii contents

Contents ...... iii

List of Tables ...... xii

List of Figures ...... xiv

Abstract ...... xviii

1 Introduction ...... 1 1.1 The Science and Engineering algorithmic backdrop ...... 1 1.2 Solving linear systems ...... 2 1.3 Contributions ...... 3

2 Motivation: the rise of and emergence of GPU computing 5 2.1 Three efficiency walls in sequential computing ...... 5 2.2 GPU computing and lack of GPU linear system solvers ...... 6 2.3 Brief overview of relevant hardware ...... 8 2.3.1 The Tesla K40 Graphics Processing Unit ...... 9 2.3.2 Intel Xeon processor ...... 10 2.4 Brief overview of Compute Unified Device Architecture ...... 11

3 The “Split and Parallelize” Method ...... 12 3.1 SaP for dense banded linear systems ...... 12 3.1.1 SPIKE algorithm ...... 12 3.1.2 A generalization of the Cyclic Reduction: the Block Cyclic Reduction algorithm ...... 17 3.1.3 Nomenclature, solution strategies ...... 20 3.1.4 A comparison of SPIKE algorithm and Block Cyclic Reduction algorithm ...... 20 3.2 SaP for general sparse linear systems ...... 25 iv

3.3 SaP components and computational flow ...... 26

4 Reordering Methodologies ...... 29 4.1 Diagonal Boosting ...... 29 4.2 Bandwidth Reduction ...... 31 4.3 Third-stage reordering ...... 33

5 Krylov Subspace Method ...... 35 5.1 A brief introduction to preconditioning ...... 36 5.2 Conjugate Gradient method ...... 37 5.2.1 Backgrounds ...... 37 5.2.2 Further discussion ...... 39 5.3 Biconjugate Gradient Stabilized method and its generalization . . . . . 40 5.4 A profiling of Krylov subspace method ...... 44

6 Implementation Details ...... 48 6.1 Dense banded matrix factorization details ...... 48 6.1.1 Number of partitions and partition size ...... 48 6.1.2 Matrix storage ...... 49 6.1.3 LU/UL/Cholesky factorization ...... 50 6.1.4 Optimization techniques ...... 51 6.2 Implementation details for reordering methodologies ...... 53 6.2.1 DB reordering implementation details ...... 53 6.2.2 CM reordering implementation details ...... 54 6.3 CPU-GPU data transfer in SaP::GPU ...... 56 6.4 Multi-GPU support ...... 59

7 Numerical Experiments ...... 64 7.1 Numerical experiments related to dense banded linear systems . . . . . 64 7.1.1 Sensitivity with respect to P ...... 64 7.1.2 Sensitivity with respect to d ...... 67 v

7.1.3 Comparison with LAPACK and Intel’s MKL over a spectrum of N and K ...... 73 7.1.4 Profiling results for dense banded linear system solver . . . . . 78 7.2 Numerical experiments related to reorderings ...... 91 7.2.1 Assessment of the diagonal boosting reordering solution . . . . 91 7.2.1.1 Efficiency evaluation of diagonal boosting algorithm . 91 7.2.1.2 Performance comparison against Harwell Sparse Li- brary’s solution ...... 95 7.2.2 Assessment of the bandwidth reduction reordering solution . . 97 7.3 Numerical experiments related to sparse linear systems ...... 100 7.3.1 Profiling results ...... 100 7.3.2 The impact of the third stage reordering ...... 112 7.3.3 Comparison against state-of-the-art direct linear solvers on the CPU ...... 114 7.3.4 Comparison against a state-of-the-art GPU linear solver . . . 118 7.4 Numerical experiments highlighting multi-GPU support ...... 120

8 Use of SaP in computational multibody dynamics problems ...... 124

9 Conclusions and future work ...... 133

A Namespace Index ...... 135 A.1 Namespace List ...... 135

B Class Index ...... 136 B.1 Class Hierarchy ...... 136

C Class Index ...... 138 C.1 Class List ...... 138

D File Index ...... 140 D.1 File List ...... 140 vi

E Namespace Documentation ...... 142 E.1 sap Namespace Reference ...... 142 E.1.1 Detailed Description ...... 144 E.1.2 Enumeration Type Documentation ...... 145 E.1.2.1 KrylovSolverType ...... 145 E.1.2.2 PreconditionerType ...... 145 E.1.3 Function Documentation ...... 145 E.1.3.1 bicgstab ...... 145 E.1.3.2 bicgstabl ...... 145 E.1.3.3 minres ...... 146

F Class Documentation ...... 147 F.1 sap::Graph< T >::AbsoluteValue< VType > Struct Template Reference147 F.2 sap::Graph< T >::AccumulateEdgeWeights Struct Reference . . . . . 147 F.3 sap::BandedMatrix< Array > Class Template Reference ...... 147 F.3.1 Detailed Description ...... 148 F.3.2 Constructor & Destructor Documentation ...... 149 F.3.2.1 BandedMatrix ...... 149 F.4 sap::BiCGStabLMonitor< SolverVector > Class Template Reference . 149 F.4.1 Detailed Description ...... 150 F.5 sap::Graph< T >::ClearValue Struct Reference ...... 151 F.6 sap::Graph< T >::CompareValue< VType > Struct Template Reference151 F.7 sap::CPUTimer Class Reference ...... 151 F.7.1 Detailed Description ...... 152 F.8 sap::Graph< T >::Difference Struct Reference ...... 152 F.9 sap::Graph< T >::EdgeLength Struct Reference ...... 153 F.10 sap::Graph< T >::EqualTo< Type > Struct Template Reference . . . 153 F.11 sap::Graph< T >::Exponential Struct Reference ...... 154 F.12 sap::Graph< T >::GetCount Struct Reference ...... 154 F.13 sap::GPUTimer Class Reference ...... 154 F.13.1 Detailed Description ...... 155 vii

F.14 sap::Graph< T > Class Template Reference ...... 155 F.14.1 Detailed Description ...... 159 F.15 sap::Graph< T >::is_not Struct Reference ...... 160 F.16 sap::Graph< T >::Map< Type > Struct Template Reference . . . . . 160 F.17 sap::Monitor< SolverVector > Class Template Reference ...... 160 F.17.1 Detailed Description ...... 162 F.18 sap::Multiply< T > Struct Template Reference ...... 162 F.19 sap::MVBanded< Matrix > Class Template Reference ...... 162 F.19.1 Detailed Description ...... 163 F.20 sap::Options Struct Reference ...... 164 F.20.1 Detailed Description ...... 165 F.20.2 Constructor & Destructor Documentation ...... 165 F.20.2.1 Options ...... 165 F.20.3 Member Data Documentation ...... 165 F.20.3.1 absTol ...... 165 F.20.3.2 applyScaling ...... 165 F.20.3.3 dbFirstStageOnly ...... 165 F.20.3.4 dropOffFraction ...... 165 F.20.3.5 factMethod ...... 166 F.20.3.6 gpuCount ...... 166 F.20.3.7 ilu_level ...... 166 F.20.3.8 isSPD ...... 166 F.20.3.9 maxBandwidth ...... 166 F.20.3.10maxNumIterations ...... 166 F.20.3.11performDB ...... 166 F.20.3.12performReorder ...... 166 F.20.3.13precondType ...... 166 F.20.3.14relTol ...... 167 F.20.3.15safeFactorization ...... 167 F.20.3.16saveMem ...... 167 F.20.3.17solverType ...... 167 viii

F.20.3.18testDB ...... 167 F.20.3.19trackReordering ...... 167 F.20.3.20variableBandwidth ...... 167 F.21 sap::Graph< T >::PermutedEdgeLength Struct Reference ...... 168 F.22 sap::Graph< T >::PermuteEdge Struct Reference ...... 168 F.23 sap::Precond< PrecVector > Class Template Reference ...... 169 F.23.1 Detailed Description ...... 172 F.23.2 Constructor & Destructor Documentation ...... 172 F.23.2.1 Precond ...... 172 F.23.2.2 Precond ...... 173 F.23.2.3 Precond ...... 173 F.23.3 Member Function Documentation ...... 173 F.23.3.1 operator() ...... 173 F.23.3.2 setup ...... 173 F.23.3.3 solve ...... 174 F.23.3.4 update ...... 174 F.24 sap::SegmentedMatrix< Array, MemorySpace > Class Template Reference174 F.25 sap::SmallerThan< T > Struct Template Reference ...... 176 F.26 sap::Solver< Array, PrecValueType > Class Template Reference . . . 177 F.26.1 Detailed Description ...... 178 F.26.2 Constructor & Destructor Documentation ...... 178 F.26.2.1 Solver ...... 178 F.26.3 Member Function Documentation ...... 178 F.26.3.1 setup ...... 178 F.26.3.2 solve ...... 179 F.26.3.3 update ...... 179 F.27 sap::SpmvCusp< Matrix > Class Template Reference ...... 180 F.27.1 Detailed Description ...... 181 F.28 sap::SpmvCuSparse< Matrix > Class Template Reference ...... 181 F.28.1 Detailed Description ...... 182 F.29 sap::Graph< T >::Square Struct Reference ...... 183 ix

F.30 sap::Stats Struct Reference ...... 183 F.30.1 Detailed Description ...... 185 F.30.2 Constructor & Destructor Documentation ...... 185 F.30.2.1 Stats ...... 185 F.30.3 Member Data Documentation ...... 185 F.30.3.1 actualDropOff ...... 185 F.30.3.2 bandwidth ...... 185 F.30.3.3 bandwidthDB ...... 186 F.30.3.4 bandwidthReorder ...... 186 F.30.3.5 d_p1 ...... 186 F.30.3.6 d_p1_ori ...... 186 F.30.3.7 diag_dom ...... 186 F.30.3.8 diag_dom_ori ...... 186 F.30.3.9 flops_LU ...... 186 F.30.3.10nuKf ...... 186 F.30.3.11numIterations ...... 186 F.30.3.12numPartitions ...... 187 F.30.3.13relResidualNorm ...... 187 F.30.3.14residualNorm ...... 187 F.30.3.15rhsNorm ...... 187 F.30.3.16time_assembly ...... 187 F.30.3.17time_bandLU ...... 187 F.30.3.18time_bandUL ...... 187 F.30.3.19time_cpu_assemble ...... 187 F.30.3.20time_DB ...... 187 F.30.3.21time_DB_first ...... 188 F.30.3.22time_DB_post ...... 188 F.30.3.23time_DB_pre ...... 188 F.30.3.24time_DB_second ...... 188 F.30.3.25time_dropOff ...... 188 F.30.3.26time_fullLU ...... 188 x

F.30.3.27time_offDiags ...... 188 F.30.3.28time_reorder ...... 188 F.30.3.29time_shuffle ...... 188 F.30.3.30time_toBanded ...... 189 F.30.3.31time_transfer ...... 189 F.30.3.32timeSetup ...... 189 F.30.3.33timeSolve ...... 189 F.30.3.34timeUpdate ...... 189 F.31 sap::strided_range< Iterator >::stride_functor Struct Reference . . . 189 F.32 sap::strided_range< Iterator > Class Template Reference ...... 190 F.33 sap::system_error Class Reference ...... 191 F.34 sap::Timer Class Reference ...... 192 F.34.1 Detailed Description ...... 192

G File Documentation ...... 193 G.1 sap/banded_matrix.h File Reference ...... 193 G.1.1 Detailed Description ...... 193 G.2 sap/bicgstab.h File Reference ...... 193 G.2.1 Detailed Description ...... 194 G.3 sap/bicgstab2.h File Reference ...... 194 G.3.1 Detailed Description ...... 195 G.4 sap/common.h File Reference ...... 196 G.4.1 Detailed Description ...... 197 G.4.2 Macro Definition Documentation ...... 197 G.4.2.1 BURST_VALUE ...... 197 G.5 sap/exception.h File Reference ...... 197 G.5.1 Detailed Description ...... 198 G.6 sap/graph.h File Reference ...... 198 G.6.1 Detailed Description ...... 200 G.7 sap/minres.h File Reference ...... 200 G.7.1 Detailed Description ...... 201 xi

G.8 sap/monitor.h File Reference ...... 201 G.8.1 Detailed Description ...... 202 G.9 sap/precond.h File Reference ...... 202 G.9.1 Detailed Description ...... 204 G.10 sap/segmented_matrix.h File Reference ...... 204 G.10.1 Detailed Description ...... 205 G.11 sap/solver.h File Reference ...... 205 G.11.1 Detailed Description ...... 207 G.12 sap/spmv.h File Reference ...... 207 G.12.1 Detailed Description ...... 208 G.13 sap/strided_range.h File Reference ...... 208 G.13.1 Detailed Description ...... 209 G.14 sap/timer.h File Reference ...... 209 G.14.1 Detailed Description ...... 210

References ...... 211 xii list of tables

2.1 Comparison of multiprocessor and the number of cores between two Tesla cards (commercial models). One is Tesla M2090 with Fermi GF100, and the other is Tesla K40 with Kepler GK110. In this table, “SP” and “DP” are for single precision and double precision, respectively...... 10

7.1 Performance comparison over a spectrum of number of partitions P for coupled (C) vs. decoupled (D) strategies in SaP::GPU. All timings are in milliseconds. Problem parameters: N = 200,000, d = 0.6, K = 50 and

K = 200. The symbols used are as follows: Dpre–amount of time spent in

preprocessing by the decoupled strategy; Dit–number of Krylov iterations

for convergence; DT ot–amount of time to converge. Similar values are reported for the coupled approach...... 67 7.2 Influence of d for coupled (C) vs. decoupled (D) vs. BCR (B) strategies in SaP::GPU (N = 200,000, P = 50 for SaP::GPU-C and P = 80 for SaP::GPU-D, K = 50 for Table 7.2a and K = 200 for Table 7.2b). All timings are in milliseconds. Symbols used are as specified for Table 7.1. For those tests whose number of iterations are reported as 101 and Krylov solve and total time reported as −1, it means the test fails to converge within 100 Krylov iterations...... 71 7.3 Influence of d for the decoupled (D) strategy in SaP::GPU (N = 200,000 and P = 80). All timings are in milliseconds. Symbols used are as specified for Table 7.1. For those tests whose number of iterations are reported as 501 and Krylov solve and total time reported as −1, it means the test fails to converge within 500 Krylov iterations...... 73 7.4 Performance comparison, two-dimensional sweep over N and K for d = 0.6. For each value N, the five rows correspond to the SaP::GPU-D, SaP::GPU-C, SaP::GPU-B, MKL and LAPACK solvers, respectively...... 74 7.5 Medians of RΓ−T ot and RΓ−P reP for all stages in SaP::GPU-D. Table 7.5a shows the single GPU results, and Table 7.5b shows the dual GPU results. 79 xiii

7.6 Medians of RΓ−T ot and RΓ−P reP for all stages in SaP::GPU-C. It is worth noting that the summation of each row is not expected to be 100, because these values are statistical medians...... 83 7.7 Medians of RΓ−T ot, RΓ−P reP and RΓ−Kry for all stages in SaP::GPU-B. It is worth noting that the summation of each row is not expected to be 100, in that these values are statistical medians...... 88 7.8 Median information for the SaP solution components as % of the time for solution. Two scenarios are considered: the first data row provides values when the total time; i.e., 100%, included the time spent by SaP in the Krylov-subspace component. The second row of data is obtained by considering 100% to be the time required to compute a factorization of the preconditioner...... 102 7.9 Median information for the Krylov subspace components for each type. The first row provides the median number of iterations to solution, and the second row provides the % of the time for solution for each type of linear systems...... 112 7.10 Examples of matrices where the third stage reordering (3rd SR) reduced

more significantly the block bandwidth Ki for Ai, i = 1,...,P ...... 113 7.11 Speed-up “SpdUp” values obtained upon embedding a third stage reorder- ing step in the solution process, a decision that also changed the number of partitions P for best performance. When correlating the results reported to values provided in Table 7.10, this table lists for each matrix A the

largest of its Ki values, i = 1,...,P ...... 114 xiv list of figures

2.1 Tesla K40 GPU Active Block Diagram ...... 10

3.1 Factorization of the matrix A with P = 3...... 13

3.2 Computational flow for SaP::GPU. SaP::GPU-Dd is the decoupled dense

solver; SaP::GPU-Cd is the coupled dense solver; SaP::GPU-Bd is the dense solver applying BCR algorithm. The analogue sparse solvers have an s subscript. The 3rd stage reordering in square brackets is optional but typically leads to substantial reductions in time to solution...... 27

4.1 Sparsity pattern for the original (left) and reordered (right) ANCF matrix of dimension N = 88, 950. The sparsity pattern for a 2000 × 2000 diagonal block of the reordered matrix is displayed in the inset...... 34

5.1 Profiling results of RΓ−Kry in SaP::GPU-C and SaP::GPU-D obtained from a set of 60 banded linear systems...... 45 5.2 Profiling results of RΓ−Kry in SaP::GPU-C and SaP::GPU-D obtained from a subset of 60 banded linear systems which are considered “large” in terms of dimension N...... 46 5.3 Profiling results of RΓ−Kry in SaP::GPU-C and SaP::GPU-D obtained from a subset of 60 banded linear systems which are considered “large” in terms of half-bandwidth K...... 47

6.1 Speedup of SaP::GPU-C with mixed precision strategy over SaP::GPU-C with pure double strategy on a battery of banded tests. The set of test matrices and solver configuration are the same as the ones described in §7.1.3, with the exception that the relative tolerance was set to 10−6, instead of 10−9...... 52 6.2 Data transfer in SaP::GPU...... 57 6.3 Allocating a seven-partitioned matrix to three GPUs...... 61 6.4 Data transfer in SaP::GPU with dual GPUs...... 63 xv

7.1 Time to solution as a function of the number of partitions P . Study carried out for a dense banded linear system with N = 200,000 and d = 0.6, for two different values of the half-bandwidth K...... 65 7.2 Influence of the diagonal dominance d, with 0.01 ≤ d ≤ 1.2, for fixed values N = 200,000, P = 50 for SaP::GPU-C and P = 80 for SaP::GPU-D. The plots are for two different values of the bandwidth K...... 69 7.3 Influence of the diagonal dominance d, with 0.1 ≤ d ≤ 0.2, for fixed values N = 200,000 and P = 80 for SaP::GPU-D...... 72 7.4 SaP speedup over Intel’s MKL and LAPACK – statistical analysis based on values in Table 7.4...... 77 7.5 Profiling results of RΓ−T ot in SaP::GPU-D obtained from a subset of 59 banded linear systems. Fig. 7.5a illustrates SaP::GPU-D run on single GPU, and Fig. 7.5b illustrates SaP::GPU-D run on dual GPU...... 80 7.6 Profiling results of RΓ−T ot in SaP::GPU-D obtained from a subset of 59 banded linear systems which are considered “large” in terms of dimension N. Fig. 7.6a illustrates SaP::GPU-D run on single GPU, and Fig. 7.6b illustrates SaP::GPU-D run on dual GPU...... 81 7.7 Profiling results of RΓ−T ot in SaP::GPU-D obtained from a subset of 59 banded linear systems which are considered “large” in terms of half- bandwidth K. Fig. 7.7a illustrates SaP::GPU-D run on single GPU, and Fig. 7.7b illustrates SaP::GPU-D run on dual GPU...... 82 7.8 Profiling results of RΓ−T ot in SaP::GPU-C obtained from a set of 58 banded linear systems which SaP::GPU-C successfully solved...... 84 7.9 Profiling results of RΓ−P reP in SaP::GPU-C obtained from a set of 58 banded linear systems which SaP::GPU-C successfully solved...... 85 7.10 Profiling results of RΓ−T ot in SaP::GPU-C obtained from a subset of 58 banded linear systems which are considered “large”. Fig. 7.10a illustrates all cases with large N, and Fig. 7.10b illustrates all cases with large K. . 86 7.11 Profiling results of RΓ−T ot in SaP::GPU-B obtained from a set of 57 banded linear systems which SaP::GPU-B successfully solved...... 88 xvi

7.12 Profiling results of RΓ−P reP and RΓ−Kry in SaP::GPU-B obtained from a set of 57 banded linear systems which SaP::GPU-B successfully solved. . . 89 7.13 Profiling results of RΓ−T ot in SaP::GPU-B obtained from a subset of 57 banded linear systems which are considered “large”. Fig. 7.13a illustrates all cases with large N, and Fig. 7.13b illustrates all cases with large K. . 90 7.14 Results of a statistical analysis that uses a median-quartile method to

measure the spread of d (Fig. 7.14a) and d1−pct (Fig. 7.14b) before and

after the application of diagonal boosting reordering. d1−pct is defined in Eq. 7.2...... 94 7.15 Results of a statistical analysis that uses a median-quartile method to measure the spread of the MC64 and DB times to solution. The speedup factor, or performance metric, is computed as in Eq. (7.3)...... 96 7.16 Comparison of the Harwell MC60 and SaP’s CM implementations in terms of resulting half bandwidth K and time to solution...... 98 7.17 Comparison of the Harwell MC60 and SaP’s CM implementations in terms of resulting half bandwidth K and time to solution. Statistical analysis of large matrices only...... 99 7.18 Profiling results obtained for a set of 90 linear systems that, out of a collection of 114, could be solved by SaP::GPU...... 103 7.19 Profiling results obtained for a set of 90 linear systems that, out of a collection of 114, could be solved by SaP::GPU...... 104 7.20 Profiling results obtained for a set of 30 linear systems of type I that, out of a collection of 114, could be solved by SaP::GPU...... 107 7.21 Profiling results obtained for a set of 25 linear systems of type II that, out of a collection of 114, could be solved by SaP::GPU...... 108 7.22 Profiling results obtained for a set of 19 linear systems of type III that, out of a collection of 114, could be solved by SaP::GPU...... 109 7.23 Profiling results obtained for a set of 6 linear systems of type IV that, out of a collection of 114, could be solved by SaP::GPU...... 110 7.24 Profiling results obtained for a set of 10 linear systems of type V that, out of a collection of 114, could be solved by SaP::GPU...... 111 xvii

7.25 Statistical information regarding the dimension N and number of nonzeros nnz for the 114 coefficient matrices used to compare SaP::GPU, PARDISO, SuperLU, and MUMPS...... 115 7.26 Statistical spread for SaP::GPU’s performance relative to that of PARDISO, SuperLU, and MUMPS. Referring to Eq. 7.4, the results were obtained using the data sets SSaP::GPU−PARDISO (with 60 values), SSaP::GPU−SuperLU (75 values), and SSaP::GPU−MUMPS (64 values)...... 117 7.27 Statistical spread for SaP::GPU’s performance relative to that of cuSOLVER. Referring to Eq. 7.4, the results were obtained using the data sets SSaP::GPU−cuSOLVER (with 41 values)...... 119 7.28 Speedup of dual-GPU solution over single-GPU solution on banded systems with a spectrum of N and K...... 121 7.29 Speedup of dual-GPU solution over single-GPU solution on the 19 systems preconditioned with SaP::GPU-D...... 123

8.1 Dual-GPU method of staggered solve/preconditioner update ...... 127 8.2 Timing comparison for various Krylov solvers, with or without precon- ditioning. The sharp increases in the curves for the two preconditioned solvers correspond to preconditioner updates which, for this study, were enforced every 0.5 s. [40 × 40 net; E = 2 · 107 Pa; h = 10−3 s] ...... 131 8.3 Comparison of different preconditioner update strategies for the precondi- tioned BiCGStab(2) solver. The solid lines in the foreground are filtered values of the raw measured timing data displayed in the background. [40 × 40 net; E = 2 · 108 Pa; h = 10−3 s] ...... 132 xviii abstract

Dense and sparse linear algebra are identified by Colella (2004) as two of the seven most important classes of numerical methods (the “seven dwarfs”). Although more than a decade has passed since the original seven dwarfs’ specification, fast linear algebra continues to be as relevant as ever. Solving linear systems is ubiquitous in a variety of science and engineering applications from computational mechanics (fluids and structure) to computational nanoelectronics (quantum mechanical simulations) to financial mathematics (random walks and geometric Brownian motion). Many real-life applications give rise to very large sparse linear systems that can be reordered to produce either narrow banded systems or low-rank perturbations of narrow banded systems, with the systems being either dense or sparse within the band. GPU-accelerated computing is the use of a Graphics Processing Unit (GPU) in tandem with a CPU to accelerate scientific, engineering, consumer, and enterprise applications. In many science and engineering fields, GPU-accelerated computing has become ubiquitous and is playing an increasingly important role in Scientific Computing, owing to its capacity to deliver speed-ups that in some cases can be as high as one order of magnitude compared to the execution on traditional multi-core CPUs. Despite the importance and ubiquity of solving linear systems and the fact that GPU-accelerated computing is presently widely used, there exist few solutions that leverage GPU or hybrid GPU/CPU computing, for solving linear systems of equations. In this dissertation, we discuss an approach for solving dense or sparse banded linear systems Ax = b on a GPU card. The matrix A ∈ RN×N is possibly asymmetric and moderately large; i.e., 10,000 ≤ N ≤ 500,000. The split and parallelize (SaP) approach seeks to partition the matrix A into diagonal sub-blocks Ai, i = 1,...,P , which are independently factored in parallel. The solution may choose to consider or to ignore the matrices that couple the diagonal sub-blocks Ai. This approach, along with the Krylov subspace-based iterative method that it preconditions, are implemented in a solver called SaP::GPU, which is compared in terms of efficiency xix with three commonly used multi-core CPU sparse direct solvers (PARDISO, SuperLU, and MUMPS), and one GPU sparse solver (NVIDIA’s cuSOLVER). SaP::GPU, which runs entirely on the GPU except for several stages involved in preliminary row-column permutations, is more robust than all other four solvers and compares well in terms of efficiency with the aforementioned direct solvers. In a comparison against LAPACK and Intel’s MKL, SaP::GPU also fares well when used to solve dense banded systems that are close to being diagonally dominant. SaP::GPU is currently used in the Simulation-Based Engineering Lab to compute the time evolution (dynamics) of many-body systems, e.g., granular material flow. SaP::GPU has also been used in determining the dynamics of mechanical systems that contain compliant (flexible) bodies, an exercise that calls for implicit integration methods. The latter methods draw on a Newton step, which solves nonlinear systems by repeatedly solving linear systems Ax = b.

SaP::GPU, which is open source and freely available, is released under a permissive BSD3 license. 1

1 introduction

1.1 The Science and Engineering algorithmic backdrop

Colella (2004) identified seven classes of numerical methods, nicknamed the “seven dwarfs”, believed to be important in science and engineering for at least the next decade. This was reviewed and extended by Asanovic et al. (2006). These seven classes of methods are: dense linear algebra, sparse linear algebra, spectral methods, N-body methods, structured grids, unstructured grids, and Monte Carlo methods. Dense linear algebra refers to those applications that use unit-stride memory accesses to read data from rows, and strided accesses to read data from columns. Sparse linear algebra refers to those applications in which data sets include many zero values and thus data is usually stored in compressed matrices to reduce the memory storage and bandwidth requirements to access all of the nonzero values. Despite that more than a decade has passed since the seven dwarfs’ specification has been put out and more classes of methods, e.g., machine learning, have been included into the dwarf list, the importance of linear algebra has not decayed at all. A testimony to this is the increasing number of efficient linear algebra libraries that are still active, for instance BLAS (Blackford et al., 2002), LAPACK (Anderson et al., 1990), MKL (Wang et al., 2014) for dense linear algebra and OSKI (Vuduc et al., 2005), SuperLU (Li, 2005), MUMPS (Amestoy et al., 2000) and PARDISO (Schenk et al., 2001) for sparse linear algebra. Indeed, with the appearance of modern performance-critical applications in many fields, an increasing demand is placed on efficient libraries for linear algebra. Besides, the appearance of new hardware, for instance vector computers with gathering and scattering operations, GPU cards, further drive the need for new libraries capable of leveraging the latest chip designs. Moreover, many times the dense and sparse linear algebra are observed to play an important role in other numerical methods. For example, unstructured grids, which are one of the methods in Colella’s specification, could be interpreted as a sparse matrix problem; 2 in the field of neural networks and deep learning, is a crucial type of operation (Nielsen, 2016, chap .2).

1.2 Solving linear systems

Linear systems arise in a wide variety of science and engineering applications from computational mechanics (fluids and structure) to computational nano-electronics (quantum mechanical simulations) to financial mathematics (random walks and geo- metric Brownian motion). In the Simulation-Based Engineering Lab (SBEL) (SBEL, 2006), various multi-body dynamic problems call for the solution of linear systems: Lagrangian-Lagrangian framework for simulating fluid-solid interaction problems, Selective Laser Sintering (SLS) simulation, Primal-Dual Interior Point (PD-IP) ap- proaches for solving the Cone Complementarity Problem (CCP) associated with friction and contact in many-body dynamics, the Absolute Nodal Coordinate Formu- lation (ANCF) approach for the simulation of large-scale flexible multibody dynamics, etc. In §8, we describe in detail how SaP::GPU has been leveraged in the ANCF approach for the simulation of large-scale flexible multibody dynamics. Many real-life applications give rise to very large sparse linear systems that can be reordered to produce either narrow banded systems or low-rank perturbations of narrow banded systems, with the systems being either dense or sparse within the band. For example, in finite-element analysis, the underlying sparse linear systems can be reordered to result in a banded system in which the width of the band is but a small fraction of the size of the overall problem. Specific techniques can be used to further reduce the bandwidth and create a robust preconditioner for an iterative solver. As discussed in §3.2, this forms the basic idea of how SaP::GPU solves sparse linear systems. 3

1.3 Contributions a) By applying three split and parallelize (SaP) approaches, I implemented an efficient GPU solver for dense banded linear systems.

• Two of the three approaches, the decoupled and coupled approaches, are envi- sioned by SPIKE algorithm detailed in (Polizzi and Sameh, 2006). Results in §7.1 demonstrate relatively high efficiency of these approaches. The rest one of the approaches, the Block Cyclic Reduction approach, was generalized from the Cyclic Reduction method for fast solution of tridiagonal systems (Swarz- trauber, 1977). Results in §7.1 demonstrate the high robustness of this approach. • A set of sensitivity studies with respect to various attributes of a dense banded matrix was carried out to characterize the robustness and performance of these three approaches. b) I studied, implemented and assessed two classes of CPU-GPU sparse matrix reordering strategies, namely diagonal boosting and bandwidth reduction, which make the solution of a general sparse system fall back on one of a dense banded system.

• Both reordering strategies were compared against Harwell Sparse Library (Hop- per, 1973) corresponding routines. In terms of reordering quality, the diago- nal boosting and bandwidth reduction reordering methods implemented in SaP::GPU are close to HSL’s MC64 and MC60, respectively, when the metrics product of diagonal entries and half-bandwidth, respectively, are used for comparison (§7.2). In §7.2.1.1, diagonal boosting in SaP::GPU is proved to enhance the degree of diagonal dominance of a matrix (defined in Eq. 3.11), which is critical for the quality of many preconditioning approaches and Krylov subspace methods. In terms of performance, the diagonal boosting reordering in SaP::GPU demonstrates higher performance than HSL’s MC64, 4

and bandwidth reduction reordering in SaP::GPU demonstrates comparable performance to HSL’s MC60. • An extra stage of bandwidth reduction reordering called “third-stage reorder- ing” and applied after the system is split is one of the major contributions of this work as it produces a sizable reduction in solution run times. c) Both the dense banded and general sparse linear system solvers were assessed and compared against commonly used CPU and GPU solvers.

• The efficiency of the dense banded linear system solver was assessed using synthetic banded matrices. Its performance was compared against that of the Intel’s Math Kernel Library (Wang et al., 2014) dense banded solver. A speedup by a factor of up to 3 was observed for more than half of the test cases. • The general sparse linear system solver was assessed with 116 sparse matrices from either the University of Florida Sparse Matrix Collection (Davis and Hu, 2011), or SBEL (SBEL, 2006). SaP::GPU was noticed to be the most robust solver. Its efficiency is comparable to that of the CPU solvers and superior to that of the NVIDIA’s GPU solver. d) An open source library of dense banded and general sparse linear system solvers on the GPU, SaP::GPU, was developed, documented and released (see SBEL (b) for details).

• This library includes sample usage examples. • This library was used in a real multibody dynamics application at SBEL. In §8, methods preconditioned with SaP::GPU are reported to result in significant decrease in simulation time compared against unpreconditioned methods. 5

2 motivation: the rise of parallel computing and emergence of gpu computing

This chapter motivates the proposed direction of research, namely parallel solution of sparse linear systems using GPU acceleration. In §1.2, we have established the arising importance of solving linear systems, in this chapter, we begin with a discussion of three walls to sequential computing (§2.1), and justify the importance of a GPU- parallel solution of linear systems and conclude with an argument pertaining to the lack of GPU or GPU-CPU hybrid linear system solvers (§2.2). In §2.3 and §2.4, we give a brief overview of the hardware, on which SaP::GPU runs, and Compute Unified Device Architecture (CUDA) created by NVIDIA.

2.1 Three efficiency walls in sequential computing

As pointed out in Manferdelli (2007), the programs we develop nowadays are up against three walls: the memory, instruction level parallelism, and power walls.

• Memory wall: memory access speed has not kept the pace with advances in the chip processing power. Memory accesses require time of the orders of 100ns, which translates to roughly 200 cycles for a clock rate of 2 GHz. Programs therefore should be designed to reduce data movement. For instance, a fused multiply–add operation on the GPU is carried out in one cycle, while a GDDR5 GPU global memory access can requires between 400 and 600 cycles.

• Frequency / power wall: the chip has become the victim of its own success. The feature length has become so small that current leaks from increasingly many transistors per unit area lead to a chain reaction that can trigger thermal runaways, a phenomenon that is only exacerbated by high clock frequencies. It comes as no surprise that new chip designs increasingly host what is called dark 6

silicon – transistors that cannot be powered lest the chip is compromised (Es- maeilzadeh et al., 2011).

• Instruction Level Parallel wall: techniques that benefited sequential com- puting: instruction pipelining, superscalar execution, out-of-order execution, register renaming, branch prediction, and speculative execution – have been heavily mined and to a very large extent exhausted their potential. Increasing the ability to predict for ILP is calling for substantially more complex designs and more power to operate these units involved in deep horizon prediction.

Against this backdrop, we posit that parallel computing is the only alternative capable of delivering in a reliable way steady performance improvements in Scientific Computing. Moreover, by leveraging multi/many–core processors and accelerators, parallel computing can both effectively hide memory latency and trade single-core ILP for simple single-core designs that lead to better power efficiency.

2.2 GPU computing and lack of GPU linear system solvers

GPU-accelerated computing is the use of a GPU in tandem with a CPU to accelerate scientific, analytics, engineering, consumer, and enterprise applications. Previously used in niche applications and by a small group of enthusiasts, GPU-accelerated computing has become ubiquitous and is playing an increasingly important role in Scientific Computing in data analysis, simulation, and visualization tasks. In fields as diverse as genomics, medical physics, astronomy, nuclear engineering, quantum chemistry, finance, oil and gas exploration, GPU computing is attractive owing to its capacity to deliver speed-ups that in some cases can be as high as one order of mag- nitude compared to the execution on traditional multi-core CPUs. This performance boost is typically associated with computational tasks where the application data is amenable to single instruction multiple data (SIMD) processing. The modern GPUs have a deep memory hierarchy that at the low-end displays bandwidths in excess 7 of 300 GB/s, while at the high-end, for scratch pad memory and registers, displays low latency and bandwidths that approach 1 TB/s. From a hardware architectural perspective, one of the salient features of GPUs is their ability to pack a very large number of rather unsophisticated (from a logical point of view) scalar processors. These are organized in streaming multiprocessors (SMs), each GPU being the sum of several of these SMs that work independently by typically sharing less than 8 GB of main memory and 1 GB of L2 cache. Although employed on a small scale as early as the 1990s, in relative historical terms, leveraging General Purpose Graphics Processing Unit (GPGPU) computing in scientific and engineering applications has only recently gained widespread acceptance. As such, this is a time marked by a perceived lack of productivity tools to support domain-specific applications. Despite the importance and ubiquity of solving linear systems and the fact that GPU-accelerated computing is presently widely used, due to the short period of the rise of GPGPU computing and its popularity, there exist few solutions that leverage GPU computing, or hybrid GPU/CPU computing, for solving linear systems of equations. PARDISO (Schenk and Gärtner, 2004), a direct sparse linear solver on shared-memory and distributed-memory multiprocessors is an efficient sparse linear solver on multi-core CPU systems that has no support for GPU computing. Thus one either cannot use GPU computing if the task at hand calls for the solution of linear systems, or has to move data back and forth between the CPU and GPU. This problem is exacerbated when we attempt to use, say, a Newton method that calls for the solution of a linear system at each stage of the iterative process. Similarly, if one requires the solution of a linear system at each time step of a dynamics simulation, this would call for a host-device transfer at each time step (in GPU computing, CPU memory is called host memory, and GPU memory is called device memory). The overhead becomes prohibitively taxing when the number of Newton iteration becomes large.

NVIDIA’s new CUDA 7.0 release includes a sparse linear solver – cuSOLVER, which provides a device sparse linear solver based on the QR factorization. It naïvely QR-factorizes the coefficient matrix and provides the solution for a single right hand 8 side; i.e., it cannot apply the same factorization to solve for multiple right hand sides. Moreover, it settles the programmer with the burden of pre-processing (for instance, reordering the matrix to reduce the number of fill-in’s) and post-processing (corresponding to pre-reordering, this step is to reorder the solution from the cuSOLVER routine back to get the true solution). Paralution (Lukarski and Trost, 2014) is a set of iterative solvers with various preconditioners that provides GPU support. It is still under development such that a non-trivial subset of features (like the AMG preconditioner, useful in preconditioning certain symmetric positive definite systems) is implemented only on the CPU. SuperLU is a general purpose library for the direct solution of large, sparse, asymmetric systems of linear equations on high performance machines (Li, 2005). It supports three independent flavors: sequential, shared memory, and distributed memory. In October 2014, a recent version of SuperLU for distributed memory began to support GPU (Sao et al., 2014). Finally, the MUltifrontal Massively Parallel sparse direct Solver, or MUMPS (?), is a package for solving systems of linear equations , where the matrix can be asymmetric, symmetric positive definite, or general symmetric, on distributed memory computers. MUMPS implements a direct method based on a multifrontal approach which performs an LU factorization to matrix A, and has attractive out-of-core capabilities.

In this work, we compare SaP::GPU with four direct solvers, the results of which are reported in §7.3.3 and §7.3.4. It should be pointed out that our goal was to develop an open source GPU solution. In doing so, we have been constrained by memory size limitations and the inability of the GPU to handle logic-heavy operations that are not supported well on SIMD hardware tuned for fine grain parallelism.

2.3 Brief overview of relevant hardware

This section briefly introduces the major hardware SaP::GPU runs on. NVIDIA’s Tesla K40 GPU NVIDIA (2013), launched in 2013, is introduced in §2.3.1, and Intel’s Xeon processor is introduced in §2.3.2. 9

2.3.1 The Tesla K40 Graphics Processing Unit

The NVIDIA®Tesla®K40 Graphics Processing Unit (GPU) board is a PCI Express, dual-slot full height (4.376 inches by 10.5 inches) form factor computing module comprised of a single GK110 GPU (NVIDIA, 2013). Fig. 2.1 illustrates the architecture of Tesla K40 GPU Active. A full Kepler GK110 implementation includes 15 SMX units and six 64-bit memory controllers. Key features of its architecture which are leveraged in SaP::GPU and benefit SaP::GPU significantly include:

• The SMX processor architecture. As the next generation of Fermi GF100 or GF104’s Stream Multiprocessor (SM), there is a significantly increase in the both the single and double precision cores in each Kepler GK110’s Stream Multiprocessor (SMX), which is detailed in Table 2.1.

• An enhanced memory subsystem, offering additional caching capabilities, more bandwidth at each level of the hierarchy, and a fully redesigned and substantially faster DRAM I/O implementation. From Fermi to Kepler, the maximum number of 32-bit registers available to each thread increases from 63 to 255, which enables a more aggressive strategy of leveraging registers during tuning the kernels in SaP::GPU.

• Hardware support throughout the design to enable new program- ming model capabilities. For instance, Unified Memory Access is a feature enabled from CUDA 6 with minimum compute capability 3.2. CUDA Compute Capability (CCC. CUDA Compute Capability will be briefly introduced in §2.4) 3.5 is enabled in Kepler GK110, which implies Unified Memory Access is supported in Kepler GK110, which in turn simplifies the code of the reordering methods in SaP::GPU. 10

Figure 2.1: Tesla K40 GPU Active Block Diagram

Type of multipr. # of multipr. # of SP cores # of DP cores Tesla M2090 (w/ Fermi GF100) SM 16 512 256 Tesla K40 (w/ Kepler GK110) SMX 192 2880 960

Table 2.1: Comparison of multiprocessor and the number of cores between two Tesla cards (commercial models). One is Tesla M2090 with Fermi GF100, and the other is Tesla K40 with Kepler GK110. In this table, “SP” and “DP” are for single precision and double precision, respectively.

2.3.2 Intel Xeon processor

The Xeon is a model of x86 microprocessors designed and manufactured by Intel Corporation, targeted at the non-consumer workstation, server, and embedded system markets. Primary advantages of the Xeon CPUs, when compared to the majority of Intel’s desktop-grade consumer CPUs, are their multi-socket capabilities, higher core counts, and support for ECC memory.

The brand name of the two Xeon CPUs used in this work is Xeon E5-2690 v2, which was released on Sept. 10, 2013. Each of the CPUs with codename Ivy Bridge-EP has ten cores and frequency 3.0 GHz. Each of the ten cores has a seperate 250KB L2 11 cache, and they share a 25MB L3 cache (Intel, 2013). The RAM included is 64GB DDR3 ECC Registered, which CPU solvers like PARDISO, SuperLU and MUMPS could easily tap into.

2.4 Brief overview of Compute Unified Device Architecture

CUDA is a parallel computing platform and programming model invented by NVIDIA. It enables dramatic increases in computing performance by harnessing the power of the GPU. With millions of CUDA-enabled GPUs sold to date, software developers, scientists and researchers are finding broad-ranging uses for GPU computing with CUDA (NVIDIA). The CUDA platform is designed to work with programming languages such as C, C++ and Fortran. This accessibility makes it easier for specialists in parallel programming to utilize GPU resources, as opposed to previous API solutions, which required advanced skills in graphics programming. The CUDA Compute Capability (CCC) describes the features supported by a CUDA hardware. In general, larger CCC suggests more features in the architecture. First CUDA capable hardware like the GeForce 8800 GTX have a compute capability of 1.0, and Tesla K40 GPU, on which SaP::GPU runs, enables a compute capability 3.5. In addition, all GPU cards are backwards compatible – they can run code target to their own compute capability, as well as all previous compute capabilities. 12

3 the “split and parallelize” method

3.1 SaP for dense banded linear systems

Assume that the banded dense matrix A ∈ RN×N has half-bandwidth K  N. There are two approaches we have followed to build up SaP::GPU. The first approach is SPIKE algorithm discussed in Sameh and Kuck (1978); Polizzi and Sameh (2006, 2007), and the second approach is a generalization of cyclic reduction algorithm discussed in Swarztrauber (1977); Gander and Golub (1997). In both cases, we partition the banded matrix A into a block tridiagonal form with P diagonal blocks P Ni×Ni P Ai ∈ R , where Ni = N. For each partition i, let Bi, i = 1,...,P − 1 i=1 and Ci, i = 2,...,P be the super- and sub-diagonal coupling blocks, respectively – see Fig. 3.1. Each coupling block has dimension K × K for banded matrices with half-bandwidth K = max |i − j|. The two approaches have two different flavors of i,j,aij 6=0 SaP::GPU: the generalized cyclic reduction approach results in exact solution while the SPIKE algorithm leads to approximate solution. These approaches are detailed in §3.1.2 and §3.1.1, respectively.

3.1.1 SPIKE algorithm

As illustrated in Fig. 3.1, the banded matrix A is expressed as the product of a block diagonal matrix D and a so-called spike matrix S (Sameh and Kuck, 1978). The latter is made up of identity diagonal blocks of dimension Ni and off-diagonal spike blocks, each having K columns. Specifically,

A = DS , (3.1)

where D = diag(A1,..., AP ). Assuming that Ai are non-singular, the so-called left and right spikes Wi and Vi associated with partition i, each of dimension Ni × K, 13 are given by

  0     A1V1 =  0  (3.2a)   B1   Ci 0     Ai [Wi | Vi] =  0 0  , i = 2,...,P − 1 (3.2b)   0 Bi   CP     AP WP =  0  . (3.2c)   0

Figure 3.1: Factorization of the matrix A with P = 3.

Solving the linear system Ax = b is thus reduced to solving

Dg = b (3.3) Sx = g . (3.4)

Since D is block-diagonal, solving for the modified right-hand side g from (3.3) is trivially parallelizable, as the work is split across P processes, each charted to solve

Aigi = bi, i = 1,...,P . Note that the same decoupling is manifest in Eq. (3.2) and the work is also spread over P processes. 14

The remaining question is how to efficiently solve the linear system in (3.4). This problem can be reduced to one of smaller size, Sˆxˆ = gˆ. To that end, the spikes Vi and Wi, as well as the modified right-hand side gi and the unknown vectors xi in

(3.4) are partitioned into their top K rows, the middle Ni − 2K rows, and the bottom K rows as:

 (t)  (t) Vi Wi      0   0  Vi =  Vi  , Wi =  Wi  , (3.5a)  (b)  (b) Vi Wi  (t)  (t) gi xi      0   0  gi =  gi  , xi =  xi  . (3.5b)  (b)  (b) gi xi

A block-tridiagonal reduced system is obtained by eliminating the middle partitions of the spike matrices, resulting in the linear system

      R1 M1 xˆ1 gˆ1        .   .   .   ..   .   .               N R M   xˆ  =  gˆ  , (3.6)  i i i   i   i     .   .   ..   .   .   .   .   .        NP −1 RP −1 xˆP −1 gˆP −1 denoted Sˆxˆ = gˆ, of dimension 2K(P − 1)  N, where

 (b)  Wi 0 Ni =   , i = 2,...,P − 1 (3.7a) 0 0  (b) IM Vi Ri =  (t)  , i = 1,...,P − 1 (3.7b) Wi+1 IM   0 0 Mi =  (t)  , i = 1,...,P − 2 (3.7c) 0 Vk+1 15 and  (b)   (b)  xi gi xˆi =  (t)  , gˆi =  (t)  , i = 1,...,P − 1 . (3.8) xi+1 gi+1

Two strategies are proposed in Polizzi and Sameh (2006) to solve (3.6): (i) an exact reduction; and (ii) an approximate reduction, which sets Ni ≡ 0 and Mi ≡ 0 and results in a block diagonal matrix Sˆ. The solution approach adopted herein is based on (ii) and therefore each sub-system Rixˆi = gˆi is solved independently using the following steps:

¯ (t) (b) Form Ri = IM − Wi+1Vi , (3.9a) ¯ (t) (t) (t) (b) Solve Rix˜i+1 = gi+1 − Wi+1gi , (3.9b) (b) (b) (b) (t) Calculate x˜i = gi − Vi x˜i+1 . (3.9c)

(t) (b) Note that a tilde was used to denote the approximate values x˜i and x˜i obtained upon dropping the Ni and Mi terms. An approximation of the solution of the original problem is finally obtained by solving independently and in parallel P systems using the available LU factorizations of the Ai matrices:

  0     A1x1 = b1 −  0  (3.10a)  (t) B1x˜2  (b)    Cix˜i−1 0         Aixi = bi −  0  −  0  , i = 2,...,P − 1 (3.10b)    (t)  0 Bix˜i+1  (b)  CP x˜P −1     AP xP = bP −  0  . (3.10c)   0

Computational savings can be made by noting that if an LU factorization of the (b) diagonal blocks Ai is available, the bottom block of the right spike; i.e. Vi , can 16 be obtained from (3.2a) using only the bottom K × K blocks of L and U. However, obtaining the top block of the left spike requires calculating the entire spike Wi. An effective alternative is to perform an additional UL factorization of Ai, in which case (t) Wi can be obtained using only the top K × K blocks of the new U and L.

There are two remarks on these strategies. First, note that the decision to set Ni ≡ 0 and Mi ≡ 0 relegates the algorithm to preconditioner status. This decision is justified as follows: although the dimension of the reduced linear system in (3.6) is smaller than that of the original problem, its half-bandwidth is at least three times larger. The memory footprint of exactly solving (3.6) is large, thus limiting the size of problems that can be tackled on the GPU. Specifically, at each recursive step, additional memory that is required to store the new reduced matrix cannot be deallocated until the global solution is fully recovered. Next, it is manifest that the quality of the preconditioner is dictated by the entries in

Ni and Mi. For the sake of this discussion, assume that the matrix A is diagonally dominant with a degree of diagonal dominance d ≥ 1; i.e.,

X |aii| ≥ d · |aij| , ∀i = 1,...,N. (3.11) j6=i

When d ≥ 1, the elements of the left spikes Wi decay in magnitude from top to bottom; those of the right spikes Vi decay from bottom to top (Mikkelsen and Manguoglu, 2008). This decay, which is more pronounced the larger the degree of diagonal dominance, justifies the approximation Ni ≡ 0 and Mi ≡ 0. However, having A be diagonal dominant, although desirable, is not mandatory as demonstrated by numerical experiments reported herein. Truncating when d < 1 will lead to a preconditioner of lesser quality. The sensitivity of SaP preconditioner’s quality with respect to d is studied in §7.1.2. 17

3.1.2 A generalization of the Cyclic Reduction: the Block Cyclic Reduction algorithm

Cyclic Reduction (CR) was proposed in Swarztrauber (1977) for fast solution of tridiagonal systems, and it was also discussed later in Gander and Golub (1997) as well. Considering the fact that the partitioned banded matrix A is in a block tridiagonal form, and inspired by the high degree of parallelism introduced by CR, SaP::GPU implements the generalization of CR for block tridiagonal systems – Block Cyclic Reduction (BCR). BCR is a recursive method consisting of two phases, which are called deflation and inflation, respectively. As derived in Eq. 3.12, in the j-th step of the deflation phase, the system is successively reduced to one of smaller size with only even-indexed blocks until a system with two blocks of unknowns is reached, which is solved directly:

 P  A(j+1) = A(j) − C(j)(A(j) )−1B(j) − B(j)(A(j) )−1C(j) , i = 1,..., i 2i 2i 2i−1 2i−1 2i 2i+1 2i+1 2j+1 (3.12a)  P  B(j+1) = −B(j)(A(j) )−1B(j) , i = 1,..., i 2i 2i+1 2i+1 2j+1 (3.12b)  P  C(j+1) = −C(j)(A(j) )−1C(j) , i = 1,..., i 2i 2i−1 2i−1 2j+1 (3.12c)  P  b(j+1) = b(j) − C(j)(A(j) )−1b(j) − B(j)(A(j) )−1b(j) , i = 1,..., i 2i 2i 2i−1 2i−1 2i 2i+1 2i+1 2j+1 (3.12d)

(j) In each step of the inflation stage, all odd-indexed unknown blocks xi are recovered 18

(j) (j) in parallel by substituting the already calculated xi−1 and xi+1:

 P  x(j) = (A(j))−1b(j)−(A(j))−1B(j)x(j) −(A(j))−1C(j)x(j) , i = 1,..., and i is odd i i i i i i−1 i i i+1 2j (3.13) There are two remarks in order. First, Eqs. 3.12 and 3.13 are simplified for succinctness. For the sake of rigorousness, they are patched with

 P  A(j) = 0 , i ≤ 0 or i > (3.14a) i 2j  P  B(j) = 0 , i ≤ 0 or i ≥ (3.14b) i 2j  P  C(j) = 0 , i ≤ 1 or i > (3.14c) i 2j  P  b(j) = 0 , i ≤ 0 or i > (3.14d) i 2j  P  x(j) = 0 , i ≤ 0 or i > (3.14e) i 2j and  P  x(j+1) = x(j), i = 1,..., (3.15) i 2i 2j+1 Eq. 3.14 indicates that all “invalid” blocks with index i out of bounds are treated as zero matrices. Eq. 3.15 describes the relation between solutions at two adjacent steps. In addition, for the purpose of ease of implementation and profiling, Eqs 3.12 19 and 3.13 are rewritten as

 P  A(j) = L(j)U(j) , i = 1,..., and i is odd (3.16a) i i i 2j  P  G(j) = (U(j))−1(L(j))−1B(j) , i = 1,..., and i is odd (3.16b) i i i i 2j  P  H(j) = (U(j))−1(L(j))−1C(j) , i = 1,..., and i is odd (3.16c) i i i i 2j  P  A(j+1) = A(j) − C(j)G(j) − B(j)H(j) , i = 1,..., (3.16d) i 2i 2i 2i−1 2i 2i+1 2j+1  P  B(j+1) = −B(j)G(j) , i = 1,..., (3.16e) i 2i 2i+1 2j+1  P  C(j+1) = −C(j)H(j) , i = 1,..., (3.16f) i 2i 2i−1 2j+1  P  r(j) = (U(j))−1(L(j))−1b(j) , i = 1,..., and i is odd (3.16g) i i i i 2j  P  b(j+1) = b(j) − C(j)r(j) − B(j)r(j) , i = 1,..., (3.16h) i 2i 2i 2i−1 2i 2i+1 2j+1  P  x(j) = r(j) − G(j)x(j) − H(j)x(j) , i = 1,..., and i is odd (3.16i) i i i i−1 i i+1 2j

Eq. 3.16a is the LU-factorization of diagonal blocks with odd indices; Eq. 3.16b and 3.16c are the solution with multiple Right-Hand-Sides (RHS); Eq. 3.16d, 3.16e and 3.16f are matrix updates, which consist of matrix multiplication and matrix addition; Eq. 3.16g is the solution with a single RHS; Eq. 3.16h and 3.16i are RHS update and solution recover, respectively, both of which consist of matrix-vector multiplication and vector addition. Secondly, in order to increase the degree of parallelism and reduce the number of floating-point operations in Eq. 3.16d, 3.16e and 3.16f, in SaP::GPU, when BCR algorithm is applied, the number of partitions P is fixed to the largest possible j N k partition number K . 20

3.1.3 Nomenclature, solution strategies

Targeted for execution on the GPU, the methodology outlined in §3.1.1 and §3.1.2 becomes the foundation of a parallel implementation called herein “split and parallelize”

(SaP). The matrix A is split into block diagonal matrices Ai, which are processed in parallel. The code implementing this strategy is called SaP::GPU. Several flavors of SaP::GPU can be envisioned. At one end of the spectrum, the solution path implements the exact reduction, SaP-B, a strategy that is envisioned by the BCR algorithm. At the other end of the spectrum, SaP::GPU solves the block-diagonal linear system in 3.3 and for preconditioning purposes uses the approximation x ≈ g. In what follows, this will be called the decoupled approach, SaP-D. The middle ground is the approximate reduction, which sets Ni ≡ 0 and Mi ≡ 0. This will be called the coupled approach, SaP-C, owing to the coupling that occurs through the truncated (b) (t) spikes; i.e., Vi and Wi+1. Neither the coupled nor the decoupled approaches described in §3.1.1 qualify as direct solvers. Moreover, as SaP::GPU applies pivot boosting when zero pivoting is encountered, even with BCR, which is claimed to produce exact solutions, the solutions are faced with the risk of being only approximate. Consequently, SaP::GPU employs an outer Krylov subspace scheme to solve Ax = b. The solver uses BiCGStab(`) (Sleijpen and Fokkema, 1993) and left-preconditioning, unless the matrix A is symmetric and positive definite, in which case the outer loop implements a conjugate gradient method (Saad, 2003b). Both schemes are detailed in §5. SaP::GPU is open source and available at SBEL (a,b).

3.1.4 A comparison of SPIKE algorithm and Block Cyclic Reduction algorithm

In this subsection, we estimate the number of synchronization steps (called steps for short) and the total number of floating point operations in terms of matrix dimension N, half-bandwidth K and number of partitions P for the three preconditioning approaches, i.e. SaP-D, SaP-C and SaP-B introduced in §3.1.1 and §3.1.2. This serves 21 as a measure of how expensive each preconditioning approach is. Before the calculation, we summarize the number of steps and floating point operations of some basic computation categories. This makes the statements in the rest subsection succinct. In all cases, the backdrop is that a banded matrix A with dimension N and half-bandwidth K is P -partitioned.

(i) K · P  N for the decoupled/coupled approach, K · P = N for BCR approach.

N (ii) The dimension of each diagonal block n ≡ Ni ≈ P , i = 1,...,P . (iii) LU-Factorizing each diagonal block requires 2n steps , nK · (K + 1) floating point multiplications and nK2 floating point additions. This is because it takes n iterations to factorize each diagonal block, and two steps with K · (K + 1) multiplications and K2 additions are mandatory in each iteration.

n˜ (iv) Factorizing a general dense matrix with dimension n˜ costs n˜ steps with P i(i − i=1 n˜ 1) = n˜(n˜2 − 1)/3 multiplications and P (i − 1)2 = n˜(n˜ − 1)(2n˜ − 1)/6 additions. i=1

(v) Assuming block-wise implementation of matrix multiplication (with block size K 16) is applied, matrix multiplication of two K-by-K matrices requires 16 steps with K3 multiplications and K3 additions.

(vi) Assume T is a general triangular matrix with dimension n and all-one diagonal elements, u and z are vectors with length n, then solving z from equation Tz = u requires n steps with n(n − 1)/2 multiplications and n(n − 1)/2 additions.

(vii) Assume T is a banded triangular matrix with dimension n and half-bandwidth K and all-one diagonal elements, u and z are vectors with length n, then solving z from equation Tz = u requires n steps with nK multiplications and nK additions.

The decoupled approach. As the simplest preconditioning approach which ap- proximates the solution x to g in Eq. 3.3, the decoupled approach only needs to 22 factorize the block diagonal matrix D in Eq. 3.3. Factorizing matrix D requires independently and concurrently factorizing P banded matrices with dimension n, 2N which is of category (iii) above. Thus it requires in total P steps, NK · (K + 1) multiplications and NK2 additions.

The coupled approach. Besides factorizing matrix D, in the coupled approach, extra computations are required to calculate the spikes Vi and Wi in Eq. 3.5a and factorize the reduced subblocks Ri in Eq. 3.7b. There are (P − 1) Ri’s, all of which are of size 2K, thus factorizing Ri, which is of category iv, costs 2K steps, together with (P − 1) · 2K · (4K2 − 1)/3 multiplications and (P − 1) · K · (2K − 1) · (4K − 1)/3 additions. For calculating spikes Vi and Wi, depending on whether the entire spikes

Vi and Wi have to be calculated, the computational cost is:

(a) Computing entire spikes. In this case, V and W are solved from Eq. 3.2. There are two sets of banded triangular systems, and for each set, there are in total 2(P − 1)K right-hand-sides of size n, which are of category vii. Hence, computing entire spikes requires 2N/P steps with 4(P − 1)nK2 multiplications and 4(P − 1)nK2 additions.

(b) (t) (b) Partially computing spikes. In the scenario where only Vi and Wi in Eq. 3.5a are computed, a UL-factorization of D is also required, and we also need to solve two sets of general triangular systems, and for each set, there are in total 2(P − 1)K right-hand-sides of size K. For UL factorization, the costs are 2N/P 2 (b) steps, NK · (K + 1) multiplications and NK additions; for the solution of Vi (t) 2 and Wi , the costs are 2K steps with 2(P − 1)K (K − 1) multiplications and 2(P − 1)K2(K − 1) additions.

To summarize, for the coupled approach, the overall computation costs are

2N/P + 2K + 2N/P = 4N/P + 2K 23 steps with

NK · (K + 1) + (P − 1) · 2K · (4K2 − 1)/3 + 4(P − 1)nK2 ≈ (5N − 4n)K2 multiplications and

NK2 + K · (2K − 1) · (4K − 1)/2 + 4(P − 1)nK2 ≈ (5N − 4n)K2 additions if computing entire spikes is necessary, or

2N/P + 2K + 2N/P + 2K = 4N/P + 4K steps with

NK ·(K +1)+(P −1)·2K ·(4K2 −1)/3+NK ·(K +1)+2(P −1)K2(K −1) ≈ 2NK2 multiplications and

NK2 + K · (2K − 1) · (4K − 1)/2 + NK2 + 2(P − 1)K2(K − 1) ≈ 2NK2 additions if partial spikes are computed.

The BCR approach. Since BCR is a recursive approach, we first calculate the computation costs of one iteration, and follow by summing the compuatation costs over all iterations to get the overall computation cost. In iteration j, the computation cost of Eqs. 3.16 is summarized below.

• Diagonal subblock factorization(Eq. 3.16a). Similar with the decoupled approach, the costs are 2N/P = 2K steps, NK · (K + 1)/2j+1 ≈ NK2/2j+1 multiplications and NK2/2j+1 additions.

• Calculating coupled blocks(Eq. 3.16b and Eq. 3.16c). Similar with partially computing spikes in the coupled approach, the costs are 2K steps with (P − 1)K2(K − 1)/2j ≈ NK2/2j multiplications and NK2/2j additions. 24

• Assembling next iteration(Eq. 3.16d, Eq. 3.16e and Eq. 3.16f). These include P/2j−1 concurrent K-by-K matrix multiplications (of category v), followed by P/2j concurrent K-by-K matrix additions. Hence, the total costs are K/16 steps with PK3/2j−1 = NK2/2j−1 multiplications and PK3/2j−1 + PK2/2j ≈ NK2/2j−1 additions.

In the j-th iteration, the costs are 65K/16 steps with 7NK2/2j+1 multiplications and 2 j+1 7NK /2 additions. Summing over all iterations, the overall costs are 65K·log2 P/16 steps with 7NK2 multiplications and 7NK2 additions.

Remarks on the preconditioning strategies.

(1) In terms of the number of floating point operations, the cost of the coupled approach and BCR approach is two to seven times of that of the decoupled approach. For a well-conditioned system, the decoupled approach is promising in achieving good performance. The drawback of the decoupled approach is its relatively poor preconditioning quality, which results in a higher count of Krylov iterations than the other approaches.

(2) For the coupled approach, in terms of the number of floating point operations, the cost of calculating the entire spikes is roughly 2.5 times of that of partially calculating the spikes, given the condition that these two methods share the same K. As computing full spikes is typically combined with the application of the third stage reordering, the costs of the coupled approach in this scenario can

be rewritten as (5N − 4n)KKnew multiplications and (5N − 4n)KKnew. This sheds light upon when the third-stage reordering is helpful in reducing the cost of the coupled approach: if it can bring down the average half-bandwidth over all partitions by at least 60%.

(3) BCR approach is attractive for two reasons: first, it produces the exact solution if no zero pivoting is encountered during factorizing the main diagonal blocks; secondly, the number of synchronization steps in the BCR approach is less than that of the decoupled and coupled approach (O(K log P ) vs. O(n)). 25

3.2 SaP for general sparse linear systems

N×N The discussion focuses next on solving Asx = b, where As ∈ R is assumed to be a sparse matrix. The salient attribute of the solution strategy is its fallback on the dense banded approach described in §3.1. Specifically, an aggressive row and column permutation process is employed to transform As into a banded matrix A that has a large d and small K. Although the reordered matrix will remain sparse within the band, it will be regarded as dense banded and LU- and/or UL- and/or

Cholesky-factored accordingly. For matrices As that are either asymmetric or have low d, a first set of row permutations is applied as QAsx = Qb, to either maximize the number of nonzeros on the diagonal (maximum traversal search) (Duff, 1981), or maximize the product of the absolute values of the diagonal entries (Duff and Koster, 1999, 2001). Both reordering algorithms are implemented using a depth first search with a look-ahead technique similar to the one in the Harwell Software Library (HSL) (Hopper, 1973).

While the purpose of the first reordering QAs is to render the permuted matrix diagonally “heavy”, a second reordering seeks to reduce K by using the traditional Cuthill-McKee CM algorithm (Cuthill and McKee, 1969). Since the diagonal entries should not be relocated, the second permutation is applied to the symmetric matrix T T QAs + As Q . Following these two reorderings, the resulting matrix A is split to obtain A1 through AP . A third CM reordering is then applied independently to each Ai for further reduction of bandwidth. While straightforward to implement in SaP::GPU-D, this third stage reordering in SaP::GPU-C mandates computation of the entire spikes, an operation that can significantly increase the memory footprint and flop count of the numerical solution. Note that third stage reordering in SaP::GPU-C renders the UL factorization superfluous since computing only the top of a spike is insufficient. These two reordering methodologies will be detailed in Chapter 4.

If Ai is diagonally dominant, the LU and/or UL factorization can be safely carried out without pivoting (Golub and Loan, 1980). Adopting the strategy used in PARDISO

(Schenk et al., 2001), we always perform factorizations of the diagonal blocks Ai 26 without pivoting but with pivot boosting. Specifically, if a pivot becomes smaller than a threshold value, it is boosted to a small, user controlled value . This yields a factorization of a slightly perturbed diagonal block, LiUi = Ai + δAi, where kδAik = O(ukAk) and u is the unit roundoff (Manguoglu et al., 2009).

3.3 SaP components and computational flow

The SaP::GPU dense banded linear system solver is relatively straightforward to implement. SaP::GPU branches based on the two approaches:

• For SaP::GPU-C, upon partitioning A into diagonal blocks Ai, each Ai is subject

to an LU factorization that requires an amount of time TLU . Next, in TBC time,

the coupling block matrices Bi and Ci are extracted on the GPU. The Vi and

Wi spikes are subsequently computed in an operation that requires TSPK time.

Afterwards, in TLUrdcd time, the spikes are truncated and the steps outlined in (t) (b) Eq. (3.9) are taken to produce the intermediary values x˜i and x˜i .

(j) • For SaP::GPU-B, in the j-th step, LU factorization of odd-indexed Ai requires (j) an amount of time TLU . Then in TGH time, the intermediate odd-indexed block (j) ( (j) matrices Gi and Hij) are computed. Afterwards, SaP::GPU consumes TcmpNxt (j+1) (j+1) (j+1) time to calculate Ai , Bi and Ci for the (j + 1)-th step. Overall, P (j) LU factorization requires TLU = j TLU time, computing Gi and Hi consumes P (j) TGH = j TGH time, and acquiring the reduced system of next step takes P (j) TcmpNxt = j TcmpNxt time.

At this point, the pre-processing step is over and the factorizations for Ai, and for

SaP-C, the factorizations of R¯ i, are available for preconditioning during the iterative phase of the solution. The amount of time spent iterating is TKry, the iterative methods considered being BiCGStab(2) and conjugate gradient. The SaP::GPU sparse linear system solution is slightly more convoluted at the front end. A sequence of two permutations, DB requiring TDB and CM requiring TCM time, 27 is carried out to increase the size of the diagonal elements and reduce bandwidth.

An additional amount of time TDrop might be spent to drop off-diagonal elements in order to decrease the bandwidth of the reordered A matrix. TDtransf is used then to transfer all matrix entries and reordering information to GPU. An amount of time

TAsmbl is spent on the GPU in book-keeping required to turn the reordered sparse matrix into a dense banded matrix. It is worth noting that TDtransf does not include the overhead associated with moving data back and forth between the CPU and GPU during the reordering process. The overhead is already counted in TDB and TCM .

(a) Computational flow for SaP::GPUd

(b) Computational flow for SaP::GPUs

Figure 3.2: Computational flow for SaP::GPU. SaP::GPU-Dd is the decoupled dense solver; SaP::GPU-Cd is the coupled dense solver; SaP::GPU-Bd is the dense solver applying BCR algorithm. The analogue sparse solvers have an s subscript. The 3rd stage reordering in square brackets is optional but typically leads to substantial reductions in time to solution. 28

The definition of the times to solution is tied to the computational flow in Fig. 3.2, where subscripts d and s are used to differentiate between the dense and sparse paths, respectively. For a sparse linear system solve that uses the coupled approach; i.e.,

SaP::GPU-Cs, the total time is TT otSparse = TP repSp +TT otDense, where TP repSp = TDB +

TCM +TDtransf +TDrop +TAsmbl and TT otDense = TLU +TBC +TSPK +TLUrdcd +TKry. For

SaP::GPU-Dd, owing to the decoupled nature of the solution, TT otDense = TLU + TKry, where TLU includes an CM process that reduces the bandwidth of each Ai. For

SaP::GPU-Bd, TT otDense = TLU + TGH + TcmpNxt, where all of TLU , TGH and TcmpNxt are summations over the steps. 29

4 reordering methodologies

In this chapter, we present the details of two reordering algorithms: one to bring elements large in absolute values to the diagonal, and one to bring off-diagonal elements closer to the diagonal (i.e. bandwidth reduction). The previous reordering algorithm does not have a dedicated name; herein, we name it Diagonal Boosting algorithm (DB for abbrevation); the latter is named after its inventor E. Cuthill and J. McKee (Cuthill and McKee, 1969) as Cuthill-McKee algorithm (CM for abbrevation). An optional third-stage CM reordering, aimed at further bandwidth reduction in each partition separately, is described in §4.3.

4.1 Diagonal Boosting

The reordering algorithm for bringing large elements to diagonal was first proposed by Duff and Koster in 1981 (Duff, 1981). That work aimed at maximizing the number of nonzeros on the diagonal. Several variants of this algorithm were further brought up in Duff and Koster (1999, 2001) for different target goals (e.g. maximizing the absolute value of bottleneck, the sum, the product or other metrics of all diagonal entries). Regardless of the variation, these algorithms are essentially a minimum bipartite perfect matching. Among these variants, we chose one which attempts to maximize the product of the absolute values of the diagonal entries. This serves as a heuristic to boost the diagonal dominance, which reduces the likelihood that we come across zero-pivoting during factorization. The following sketches out this variant of the reordering algorithm. n Q Given a matrix {aij}n×n, the problem is to find a column permutation σ s.t. |aiσi | i=1 is maximized. Denoting ai = max |aij|, and noting that ai is an invariant of σ, then j we are to minimize

n n n Y ai X ai X log = log = (log ai − log |aiσi |) (4.1) i=1 |aiσi | i=1 |aiσi | i=1 30

The reordering problem is reduced to minimum bipartite perfect matching in the following way: given a bipartite graph GC = (VR,VC ,E), the weight cij of edge between node i ∈ VR and node j ∈ VC is defined as

 log ai − log |aij| (aij 6= 0) cij = (4.2) ∞ (aij = 0)

P If we are able to find a minimum bipartite perfect matching σ s.t. ciσi is minimized, n Q according to the process of reduction above, |aiσi | is maximized, which is exactly i=1 the goal of the reordering algorithm. Numerous studies have been done on minimum bipartite perfect matching (Carpaneto and Toth, 1980; Kuhn, 1955; Burkhard and Derigs, 1980; Carraresi and Sodini, 1986; Derigs and Metz, 1986; Jonker and Volgenant, 1987). A key concept in these works is the shortest augmenting path P . An augmenting path P for a match M starting at a certain row node i (or column node j) is said the be shortest if for any augmenting path P 0 starting at row node i (or column node j) for M, the weight of M ⊕ P is no greater than that of M ⊕ P 0. The alternating length of path P is defined as

l(P ) := c(M ⊕ P ) − c(M) = c(P \M) − c(M ∩ P ) , (4.3) where c(M) is the weight of matching M, defined as

X c(M) = cij (4.4) (i,j)∈M

Furthermore, works by Kuhn (1955) and Edmonds and Karp (1972) stated that M has minimum weight if and only if there exists dual variables ui for i ∈ VR and vj for j ∈ VC s.t.

 ui + vj ≤ cij ((i, j) ∈ E) (4.5) ui + vj = cij ((i, j) ∈ M) 31

Define the reduced weight of edge (i, j) as

cij = cij − ui − vj (4.6)

P With the observation that any “valid” match M should satisfy c(M) = cij = 0, (i,j)∈M we conclude that for any augmenting path P in the reduced graph GC , l(P ) = c(P \M) ≥ 0. Therefore, finding the shortest augmenting path in GC is equivalent to

finding the shortest path in GC , whose edge weights are all non-negative.

4.2 Bandwidth Reduction

Symmetric reordering operations that reduce bandwidth of a sparse matrix are a crucial element for a linear solver whose performance depends on both the dimension and the bandwidth of a matrix. Whether QAs is sparse or not, there are P − 1 pairs of always dense spikes, each of dimension Ni × K. They need to be stored unless one employs an LU and UL factorization of Ai to retain only the appropriate bottom and top components. Large K values pose memory challenges; i.e., storing and data movement, that limit the size of the problems that can be tackled. Moreover, the spikes need to be computed by solving multiple right-hand side linear systems with Ai coefficient matrices. There are 2K such systems for each of the P − 1 pairs of spikes. Evidently, a low K is highly desirable. However, finding the lowest half-bandwidth K by symmetrically reordering a sparse matrix is NP-hard. The CM reordering provides simple and oftentimes effective heuristics to tackle this problem. Moreover, as the CM reordering yields symmetric permutations, it will not displace the “heavy” diagonal terms obtained during the DB step. However, to obtain a symmetric permutation, one has to start with a symmetric matrix. To this end, unless A is already symmetric and does not call for a DB step (which is the case, for instance, when A is symmetric positive definite), the matrix pattern passed over for CM reordering is the pattern of A + AT . Given a symmetric n × n matrix with nnz non-zero entries, we visualize it as the 32 adjacency matrix of an undirected graph G. In the following discussion, G is assumed to be a connected graph, i.e. G contains only one connected component. For a graph consisting of two or more connected components, those components are handled by CM separately. Given an undirected graph G, the basic CM first picks a random node and adds the node to the work list. Then the algorithm repeats sorting all its neighboring nodes with non-descending vertex degree and adding them until all vertices have been added and removed once from the work list. In other words, CM is essentially a BFS where neighboring vertices are visited in order from lowest to highest vertex degree. Algorithm 1 sketches out the process of the basic CM.

Algorithm 1 The process of the basic CM Initialize an empty worklist L Pick a random node ns and adds it to L while L is not empty do Extract the first node n0 from the worklist Find all neighboring nodes of n0 which have not been added to L yet. Suppose these nodes form a list L0 Sort L0 with ascending vertex degree, and extend L with L0 end while

Note that Alg. 1 is called basic CM in the sense that it differs with the actual CM in two regards:

1) The selection of the starting node ns. With different selections of the starting nodes, the height of the BFS tree will also differ, which affects the quality of CM (i.e. the half-bandwidth of the resulting matrix). Heuristically speaking, the taller the BFS tree is, the fewer nodes two adjacent levels contain, and the lower the half-bandwidth will be (note that nodes at two levels which are not adjacent are not connected, and thus do not influence the half-bandwidth). Thus the selection of the starting node is crucial. In the actual CM, several starting nodes are attempted, each resulting in a tentative reordering. The reordering which leads to the minimum half bandwidth is selected as the result of CM. 33

2) For the sake efficiency, the actual CM never sorts a node list L0. Instead, in a pre-processing stage, it does a global sort so that for any node, its adjacent nodes are in the order of ascending vertex degree.

In §6.2.2, the actual CM reordering method is detailed.

4.3 Third-stage reordering

The DB–CM reordering sequence yields diagonally-heavy matrices of smaller bandwidth. The band itself however can be very sparse. The purpose of the third-stage CM reordering is to further reduce the bandwidth within each Ai and reduce the sparsity within the band. Consider, for instance, the matrix ANCF88950 that comes from structural dynamics (Serban et al., 2015). It has 513,900 nonzeros, N = 88 950, and an average of 5.78 non-zero elements per row. After DB–CM reordering with no drop-off, the resulting banded matrix has a half-bandwidth K = 205. The band itself is very sparse with a fill-in of only 0.7% within the band, as illustrated in Fig. 4.1. In its default solution, SaP::GPU constructs a block banded matrix where each diagonal block Ai, obtained after the initial DB–CM reorderings, is allowed to have a different bandwidth. This is achieved using another CM pass, independently and in parallel for each Ai. Applying this strategy to ANCF88950, using P = 16 partitions, the half bandwidth is reduced for all partitions to values no higher than K = 141, while the fill-in within the band becomes approximately 3%. Note that this third-stage reordering does nothing to reduce the column-width of the spikes. However, it helps in two respects: a smaller memory footprint for the LU/UL factors, and less factorization effort. These are important side effects, since the LU/UL GPU factorization is currently done in-core considering Ai to be dense within the band. 34

Figure 4.1: Sparsity pattern for the original (left) and reordered (right) ANCF matrix of dimension N = 88, 950. The sparsity pattern for a 2000 × 2000 diagonal block of the reordered matrix is displayed in the inset. 35

5 krylov subspace method

With respect to the “influence on the development and practice of science and engineering in the 20th century”, Krylov space methods are considered as one of the ten most important classes of numerical methods (Dongarra and Sullivan, 2000). Sparse linear systems of equations can be solved by either so-called sparse direct solvers, which are clever variations of Gauss elimination, or by iterative methods. In the last thirty years, sparse direct solvers have been tuned to perfection: on the one hand by finding strategies for permuting equations and unknowns to guarantee a stable LU decomposition and small fill-in in the triangular factors, and on the other hand by organizing the computation so that optimal use is made of the hardware, which nowadays often consists of parallel computers whose architecture favors block operations with data that are locally stored or cached (Gutknecht, 2007). The iterative methods that are today applied for solving large-scale linear systems are mostly preconditioned Krylov (sub)space solvers. Krylov subspace methods have become a very useful and popular tool for solving large sets of linear and nonlinear equations, and large eigenvalue problems. One of the reasons for their popularity is their simplicity and their generality. Moreover, one of their properties, namely flexible preconditioning allows using not only incomplete decompositions like ILU(p) or IC(p), but also any Krylov space solvers as a preconditioner. These combinations, called inner-outer iteration methods, may be very effective (Simoncini and Szyld, 2002). These methods have been increasingly accepted as efficient and reliable alternative to the more expensive methods (for instance direct methods) that are usually employed for solving dense problems. This trend is likely to accelerate as models are becoming more complex and give rise to larger and larger matrix problems (Saad, 1989). The main idea of Krylov subspace methods is to generate a basis of the Krylov subspace Km (m is defined as the dimension of Krylov subspace):

2 m−1 Km(A, r0) = span{r0, Ar0, A r0,..., A r0} (5.1) 36

and generate a sequence of approximations, xm, of the following forms:

2 m−1 −1 xm = x0 + α0b + α1Ab + α2A b + ... + αm−1A b(≈ A b) (5.2)

The dimension of the Krylov subspace is increased at each iteration. Therefore, the approximate solution xm becomes increasingly more accurate and is expected to be exact after at most n iterations when the Krylov subspace spans all of Rn. In practice, a solution of acceptable accuracy is often reached in much fewer than n iterations. For more details regarding Krylov subspace methods, interested readers may refer to Saad (1989); Gutknecht (2007). The remainder of §5 is organized as follows: in §5.1, we introduce the concept of preconditioning; and in §5.2 and §5.3, we describe two specific Krylov subspace methods implemented in SaP::GPU: CG for Symmetric Positive Definite (SPD) systems, and BiCGStab with its generalization for general systems.

5.1 A brief introduction to preconditioning

Consider the linear system Ax = b to be solved by some iterative solver. In general, when using an iterative Krylov method to solve the system, it is necessary to precondition the system to get acceptable efficiency (reduce the number of Krylov iterations). Preconditioning can be done on:

• the left: (M−1A) x = M−1b

• the right: (AM−1)(Mx) = b

 −1 −1 −1 • both sides: ML AMR (MRx) = ML b

Any of the preconditioning methods will give the same solution as the original system so long as the preconditioner matrix M (for left- or right-preconditioning) or the product MLMR (for preconditioning on both sides) is nonsingular. Typically there is a trade-off in the choice of preconditioner. In order to actually improve the efficiency 37 of the Krylov method, the matrix M (for left- or right-preconditioning) or the product

MLMR (for preconditioning on both sides) should in some sense approximate the matrix A. Yet at the same time, in order to be overall cost-effective, the matrix M or the matrices ML and MR should be reasonably efficient to evaluate (factorize) and solve (i.e. be able to cheaply solve systems of the form Mz = w in order to obtain products of the form z = M−1w). Therefore, the preconditioner is chosen that is a reasonable approximation of the matrix A, in an attempt to achieve a minimal number of linear iterations while keeping M−1 (for left- or right-preconditioning) or −1 −1 MR ML as simple as possible. It is worth noting that the preconditioner matrix M (for left- and right-preconditioning) or MR and ML (for preconditioning on both sides) need not be formed and/or stored explicitly. Only the action of applying the preconditioner solve operation M−1 or −1 −1 MR ML to a given vector need to be computed in iterative methods.

5.2 Conjugate Gradient method

The Conjugate Gradient (CG) method was proposed by M. Hestenes and E. Stiefel in 1952 (Hestenes and Stiefel, 1952). It requires the matrix A to be square, symmetric, and positive definite. As a method which has been maturely studied, its description can also be found in Avriel (2003); Saad (2003a); Golub and Van Loan (2012). J. Sewchuk also gave an interesting note on the introduction of CG (Shewchuk, 1994). SaP::GPU applies CG solver when a linear system is specified to be symmetric positive- definite. In the remainder of this section, some backgrounds on conjugate gradient method are provided (§5.2.1), and then the CG algorithm is detailed, highlighting its computational intensive components (§5.2.2).

5.2.1 Backgrounds

Two vectors u and v are orthogonal if they satisfy

(u, v) ≡ uT v = 0 38

Two vectors u and v are conjugate with respect to an N × N matrix A if they suffice

T (u, v)A ≡ u Av = 0

The Frobenius norm of a vector u is defined as √ T ||u||2 = u u

Suppose that

P = {p0,..., pN−1} is a set of N mutually conjugate vectors, then P forms a basis for Rn. Then the solution xtrue of system Ax = b can be expressed in this basis:

N−1 X xtrue = αipi (5.3) i=0

Based on Eq. 5.3, some calculations can be done so that each αk can be expressed as:

(pk, b) αk = , k = 0,...,N − 1 (5.4) (pk, pk)A

If vectors pk are selected carefully, not all pk(∈ P ) are required to obtain a good approximation of xtrue. Suppose that we make a series of guesses of solution: x0, x1,..., xk,... Without loss of generality, suppose the initial guess x0 = 0 (other- wise, the system Ax = b − Ax0 is considered). Let rk be the residual at the k-th step:

rk = b − Axk (5.5)

The conjugate gradient method attempts to construct pk from rk in resemblance to Gram-Schmidt process (Gagliano and Qufmica): 39

  rk (k = 0) pk = (5.6) P (pi,rk)A rk − (p ,p ) pi (k > 0)  i

Then the (k + 1)-th guess of the solution is constructed based on xk

xk+1 = xk + αkpk (5.7) and the (k + 1)-th residual can be expressed as:

rk+1 = rk − αkApk (5.8)

As stated, the algorithm requires storage of all previous searching directions and residue vectors, as well as many matrix-vector multiplications, and thus can be computationally expensive. However, a closer analysis of the algorithm (Hestenes and Stiefel, 1952) shows that only rk, pk, and xk are needed to construct rk+1, pk+1, and xk+1 (see Chapter 5 in Hestenes and Stiefel (1952) for details):

pk+1 = rk+1 + βkpk , (5.9)

T rk+1rk+1 where βk ≡ T . In addition, αk can be expressed as rk rk

(rk, rk) αk = (5.10) (pk, pk)A

5.2.2 Further discussion

In SaP::GPU, the preconditioned version of conjugate gradient method is applied with SaP-D or SaP-C (in the current version of SaP::GPU, SaP-B is not supported to be applied in CG). Denote the preconditioner as M. The preconditioned conjugate gradient method is detailed in Alg. 2. 40

Algorithm 2 Conjugate gradient method

1: r0 ← b −1 2: z0 ← M r0 3: p0 ← z0 4: k ← 0 5: loop T zk rk 6: αk ← T pk Apk 7: xk+1 ← xk + αkpk 8: rk+1 ← rk − αkApk 9: if ||r||2 ≤ c · ||r0||2 (c is called relative tolerance, which is user-specified) then 10: Exit loop 11: end if −1 12: zk+1 ← M rk+1 T zk+1rk+1 13: βk ← T zk rk 14: pk+1 ← zk+1 + βkpk 15: k ← k + 1 16: end loop 17: The result is xk+1

The only operation that is computationally intensive is called preconditioning solve, which refers to the Line 2 and 12 in Alg. 2. In each preconditioning solve stage, one or two groups of forward elimination/backward substitution (one for SaP-D and two for SaP-C) are required, which correspond to two or four operations of type vii stated in §3.1.4. Other operations in Alg. 2 include Sparse Matrix Vector multiplication (SpMV, which refers to Line 8 in Alg. 2) and vector operations (for instance, vector “Alpha X Plus Y” (AXPY) as shown in Line 7 and vector copy as shown in Line 3).

5.3 Biconjugate Gradient Stabilized method and its generalization

The conjugate gradient method is not suitable for asymmetric systems because the residual vectors cannot be made orthogonal with short recurrences (Voevodin, 1983; Faber and Manteuffel, 1984). Unlike the conjugate gradient method, the 41 biconjugate gradient method (BiCG) does not require the matrix A to be symmetric (or strictly speaking self-adjoint, meaning that A is Hermitian), but instead one needs to perform multiplications by the conjugate transpose of A (Fletcher, 1976). BiCGStab, in abbrevation of BiConjugate Gradient Stabilized, was proposed by Van der Vorst (Van der Vorst, 1992). It is a variant of BiCG and has faster and smoother convergence than the original BiCG as well as other variants such as the conjugate gradient squared method (CGS) (Sonneveld, 1989). However, as numerical experiments confirm, if A has nonreal eigenvalues with an imaginary part that is large relative to the real part, this method might cause stagnation or even breakdown (Sleijpen and Fokkema, 1993). In 1993, G. Sleijpen and D. Fokkema proposed a series of methods generalized from BiCGStab: BiCGStab(`). Instead of forming a 1-step Minimal Residual polynomial, or MR-polynomial (i.e. polynomial of degree one) as BiCGStab does, BiCGStab(`) forms an `-degree MR-polynomial pk after each `-th step, where k is the iteration number. In the intermediate steps j = k` + i, i = 1, . . . , ` − 1, simple factors ti and the pk are reconstructed from these powers. In this way, certain near-breakdowns in the processes can be avoided (Sleijpen and Fokkema, 1993).

In SaP::GPU, we use BiCGStab(2), which means ` is set to two. BiCGStab(2), preconditioned with SaP-C, SaP-D or SaP-B, is applied to the solution of general matrices, which are either asymmetric or indefinite. The preconditioned BiCGStab(`) method is listed in Alg. 3. Similar to conjugate gradient, preconditioning solve (as shown in Lines 20, 25, 28, 53 and 63) is the most time-consuming operation. The profiling results in §5.4 demonstrate this fact. 42

Algorithm 3 BiCGStab(`) method

1: ρ0 ← 1 2: α ← 0 3: ω ← 1 4: r0 ← b 5: ˆr0 ← r0 6: u0 ← 0 7: k ← 0 8: loop 9: u˜k,0 ← uk 10: ˜rk,0 ← rk 11: x˜k ← xk 12: ρ0 ← −ω · ρ0 13: for all j ← 0, . . . , ` − 1 do 14: ρ1 ← ˜rk,jˆr0 15: β ← α · ρ1 ρ0 16: ρ0 ← ρ1 17: for all i ← 0, . . . , j do 18: u˜k,i ← ˜rk,i − β · u˜k,i 19: end for −1 20: u˜k,j+1 ← AM u˜k,j ρ0 21: α ← T u˜k,j+1ˆr0 22: for all i ← 0, . . . , j do 23: ˜rk,i ← ˜rk,i − α · u˜k,i+1 24: end for −1 25: ˜rk,j+1 ← AM ˜rk,j 26: x˜ ← x˜ + αu˜k,0 27: if ||˜rk,0||2 ≤ c · ||r0||2 then −1 28: ˜rk,0 ← b − AM x˜ 29: if ||˜rk,0||2 ≤ c · ||r0||2 then 30: Exit loop 31: end if 32: end if 33: end for 43

34: for all j ← 1, . . . , ` do 35: for all i ← 1, . . . , j − 1 do T ˜rj ˜ri 36: τi,j ← T ˜rj ˜rj 37: ˜rk,j ← ˜rk,j − τi,j˜rk,i 38: end forT 0 ˜rj ˜r0 39: γj ← T ˜rj ˜rj 40: end for 0 41: γ` ← γ` 42: ω ← γ` 43: for all j ← ` − 1,..., 2 do 0 44: γj ← γj 45: for all i ← j + 1, . . . , ` − 1 do 00 00 46: γj ← γj + τi,j · γi+1 47: end for 48: end for 49: x˜ ← x˜ + γ1˜rk,0 0 50: ˜rk,0 ← ˜rk,0 − γ`˜rk,` 51: u˜k,0 ← u˜k,0 − γ`u˜k,` 52: if ||˜rk,0||2 ≤ c · ||r0||2 then −1 53: ˜rk,0 ← b − AM x˜ 54: if ||˜rk,0||2 ≤ c · ||r0||2 then 55: Exit loop 56: end if 57: end if 58: for all j ← 1, . . . , ` do 59: u˜k,0 ← u˜k,0 − γju˜k,j 00 60: x˜ ← x˜ + γj ˜rk,j 0 61: ˜rk,0 ← ˜rk,0 − γj˜rk,j 62: if ||˜rk,0||2 ≤ c · ||r0||2 then −1 63: ˜rk,0 ← b − AM x˜ 64: if ||˜rk,0||2 ≤ c · ||r0||2 then 65: Exit loop 66: end if 67: end if 68: end for 44

69: uk+1 ← u˜k,0 70: xk+1 ← x˜ 71: rk+1 ← ˜rk,0 72: k ← k + 1 73: end loop

5.4 A profiling of Krylov subspace method

This subsection reports some profiling results of BiCGStab(`) method. For the purpose of profiling, the same set of 60 tests introduced in §7.1.3 were used, and the approaches selected in this study are SaP::GPU-C and the single-GPU version of Γ−Kry SaP::GPU-D. We denote by Rα the percentage of time spent in the stage Γ in test case α, considering 100% to be the time spent in the Krylov subspace component. For succinctness, when discussing a statistical observation for all tests, the subscript α is oftentimes omitted.

The results illustrated in Figs. 5.1, 5.2 and 5.3 demonstrate that in SaP::GPU, the preconditioning solve dominates, or is at least a major component of, the total time spent in Krylov subspace method, especially for systems large in size (either large in dimension N or large in half-bandwidth K). For SaP::GPU-D, the time spent in components other than preconditioning solve and SpMV is somewhat higher than anticipated. This fact, however, gives the user of SaP::GPU a direction in making selections among the three preconditioners: SaP::GPU-D is relative cheap compared with SaP::GPU-C and SaP::GPU-B (in that the percentage of time spent in preconditioning solve of SaP::GPU-D is comparable to that spent in vector operations). If a system is “likely” to converge fast by applying SaP::GPU-D (see §7.1.2 for details of when a system is “likely” to converge fast with SaP::GPU-D), then this is the approach that should be attempted first. 45

(a) SaP::GPU-C

(b) SaP::GPU-D

Figure 5.1: Profiling results of RΓ−Kry in SaP::GPU-C and SaP::GPU-D obtained from a set of 60 banded linear systems. 46

(a) SaP::GPU-C

(b) SaP::GPU-D

Figure 5.2: Profiling results of RΓ−Kry in SaP::GPU-C and SaP::GPU-D obtained from a subset of 60 banded linear systems which are considered “large” in terms of dimension N. 47

(a) SaP::GPU-C

(b) SaP::GPU-D

Figure 5.3: Profiling results of RΓ−Kry in SaP::GPU-C and SaP::GPU-D obtained from a subset of 60 banded linear systems which are considered “large” in terms of half-bandwidth K. 48

6 implementation details

6.1 Dense banded matrix factorization details

This section provides several implementation details of the SaP::GPU dense banded solver, regarding how the P partitions Ai are determined (§6.1.1), how the banded matrix A is stored (§6.1.2), how the LU/UL/Cholesky steps are implemented on the GPU (§6.1.3), and some techniques on optimization (§6.1.4).

6.1.1 Number of partitions and partition size

N As discussed in §3.1.2, in SaP::GPU-B, the number of partitions P is fixed to b K c, thus the discussion in this subsection applies to only SaP::GPU-D and SaP::GPU-C. The selection of P must strike a balance between two conflicting requirements. On the one hand, having a large P is attractive given that the LU/UL/Cholesky factorization of Ai for i = 1,...,P can be done independently and simultaneously. On the other hand, for SaP::GPU-C, approximations are made to evaluate the spikes corresponding to the coupling of the diagonal blocks Ai and Ai+1; and for SaP::GPU-D, all coupling blocks are completely ignored. In either case, this negatively impacts the quality of the resulting preconditioner. Since this adversely impacts the quality of the resulting preconditioner, a high P could lead to poor preconditioning and an increase in the number of iterations to convergence. In the current implementation, no attempt is made to automate this selection and some experimentation is required. The experimentation is detailed in §7.1.1.

Given a P value, the size of the diagonal blocks Ai is selected to achieve load balancing. N N The first Pr partitions are of size b P c + 1, while the remaining are of size b P c, where N N = P b P c + Pr. 49

6.1.2 Matrix storage

For general dense banded matrices Ai, we adopt a “tall and thin” storage in column- major order. Eq. 6.1 illustrates the method we are using for a symmetric positive definite matrix (Eq. 6.1a) and a general matrix (Eq. 6.1b), both with N = 8 and

K = 2. For the symmetric positive definite scenario where the matrices Ai are Cholesky factorized, all diagonal elements are stored in the first column; otherwise,

Ai’s are LU/UL factorized, and all diagonal elements are stored in the (K + 1)-st column. In either scenario, the rest of the elements are correspondingly distributed columnwise. This strategy groups the operands of the LU/UL/Cholesky factorizations and allows coalesced memory accesses that can fully leverage the GPU’s bandwidth.

  a11 a21 a31     a22 a32 a42     a33 a43 a53     a44 a54 a64   (6.1a)   a55 a65 a75     a66 a76 a86     a77 a87 ∗    a88 ∗ ∗   ∗ ∗ a11 a21 a31      ∗ a12 a22 a32 a42     a13 a23 a33 a43 a53     a24 a34 a44 a54 a64   (6.1b)   a35 a45 a55 a65 a75     a46 a56 a66 a76 a86     a57 a67 a77 a87 ∗    a68 a78 a88 ∗ ∗ 50

6.1.3 LU/UL/Cholesky factorization

The solution strategy calls for an LU and an optional UL factorization (for the general matrix case) or a Cholesky factorization (for the symmetric definite matrix case) of each dense banded diagonal block Ai. The implementation requires a certain level of synchronization since for each Ai, the factorization, forward elimination, and backward substitution phases each consist of Ni − 1 dependent steps that need to be choreographed. One aggravating factor is the GPUs lack of native, low overhead, support for synchronization between threads running in different blocks. The established GPU strategy for inter-block synchronization is “exit and launch a new kernel”. This guarantees synchronization at the GPU-grid level at the cost of non-negligible overhead. In a trade-off between minimizing the overhead of kernel launches and maximizing the occupancy of the GPU, we established two execution paths: one for K < 64, the second one for larger bandwidths. As a side note, the threshold value of 64 was selected through numerical experimentation over a variety of problems and is controlled by the number of threads that can be organized in a block in CUDA (NVIDIA, 2015). For K < 64, the code was designed to reduce the kernel launch count. Instead of having Ni − 1 kernel launches, each completing a step of the factorization of

Ai = LiUi by updating entries in a (K + 1) × (K + 1) window of elements, a single 2 2 kernel is launched to factor Ai. It uses min(K , 32 ) threads per block and relies on low-overhead stream-multiprocessor synchronization support within the block, without any need for global synchronization. In a so-called window-sliding method, at each step of the factorization; i.e., during the process of computing column entries in L and row entries of U, each thread updates a fixed number of Ai entries. On current GPU hardware, this fixed number is between 1 and 4. Once all threads in the block complete their work, they are synchronized and the (K + 1) × (K + 1) window slides down by one row and to the right by one column. The value 4 is explained as follows. Assume that K = 63. Then, the sliding window has size 64 × 64. Since the two-dimensional GPU thread block size is 1024 = 32 × 32, each thread will handle four entries of the window of focus. 51

For K ≥ 64, SaP uses multiple blocks of threads to update L and U entries. On the upside, there are more threads working on the window of focus. On the downside, there is overhead associated with leaving and reentering the kernel, a process that has the side effect of flushing the shared memory and registers. The window is larger than K × K, and it slides at a stride of eight; i.e., moves down by eight rows and to the right by eight columns upon exiting and reentering the LU factorization kernel.

6.1.4 Optimization techniques

Use of registers and shared memory. If the user decides to employ a third-stage reordering, the coupling sub-blocks Bi and Ci are used to compute the entire spikes in a scheme that renders a UL factorization superfluous. Then, Bi and Ci are each first partitioned into sub-blocks of dimension L × K where L is at most 20. Each forward/backward sweep to get the spikes is unrolled, and in each iteration of the new loop, one entire sub-block, rather than a vector of length K, is calculated. To this end, the corresponding elements in the matrix Ai are pre-fetched into shared memory and the entries of the sub-block are preloaded into registers. This strategy, in which all operations to calculate the spikes draw on registers and shared memory, leads to 50% to 70% improvement in performance when compared with an alternative that calculates the spike elements in a loop without leveraging the low latency/high bandwidth of the GPU register file and shared memory. Mixed Precision Strategy. In the hope of leveraging both the high efficiency of single-precision floating-point calculation and the high accuracy of double-precision floating-point calculation, the solution uses a mixed-precision implementation by falling back on single precision for the preconditioner and switching to double precision arithmetic in the outer BiCGStab(2) calculations. Fig. 6.1 illustrates the speedup of SaP::GPU-C with mixed precision strategy over SaP::GPU-C applying pure double strategy on a battery of banded tests. The results indicate that mixed precision strategy results in a median of 15.9% speedup over an approach where all calculations are performed in double precision. There is a caveat of the mixed precision strategy: the accuracy of the solution 52

Figure 6.1: Speedup of SaP::GPU-C with mixed precision strategy over SaP::GPU-C with pure double strategy on a battery of banded tests. The set of test matrices and solver configuration are the same as the ones described in §7.1.3, with the exception that the relative tolerance was set to 10−6, instead of 10−9. is limited by the usage of single-precisioned preconditioner. Hence, the benefit of mixed precision strategy vanishes when the accuracy of solution is critical. In the current version of SaP::GPU, the mixed precision strategy has been deprecated in consideration of solution accuracy. 53

6.2 Implementation details for reordering methodologies

6.2.1 DB reordering implementation details

SaP::GPU organizes the DB algorithm into four stages, DB-S1 through DB-S4. Due to differences in the nature and degree of parallelism of these stages, DB implements a hybrid strategy; namely, it relies on GPU computing for DB-S1 and DB-S4 and on CPU computing for DB-S2 and DB-S3. A thorough discussion of the implementation is provided in Li et al. (2014). Therein, a solution that kept the entire DB implementation on the GPU was discussed and deemed decisively slower than the hybrid strategy adopted here.

DB-S1: form bipartite graph. This stage assembles a matrix that mirrors the structure of the original sparse matrix. The sparsity pattern of the input matrix is maintained and the values of its nonzero entries are modified according to Eq. (4.2). The stage is highly parallel and involves: (1) calculating for each row of the original matrix the max absolute value, and (2) updating each value to form the weighted bipartite graph.

DB-S2: find initial partial match. This stage is not mandatory but the availability of an initial partial match as a starting point for the next stage was found to considerably reduce the running time for the overall algorithm (Li et al., 2014). Like in Carpaneto and Toth (1980), after setting ui = minj cij and vj = mini(cij − ui), we try to match as many pairs of nodes as possible. The matched nodes (i, j) should satisfy ui + vj = cij. This yields augmenting paths of length one. This stage, which was implemented to execute in parallel, was compute intensive as it had to resolve scenarios where multiple column nodes would match the same row node. A CPU parallel implementation was found to be more suitable owing to intense integer arithmetic and control flow overhead.

DB-S3: find perfect match. Finding matches in a bipartite graph GC is equivalent to finding the shortest paths in an associated reduced graph. Omitting some of the 54 details, the shortest path problem is tackled using Dijkstra’s algorithm (Dijkstra, 1959), which is applied to all nodes i that are unmatched in the initial partial match obtained in DB-S2. This ensures that all row nodes, and therefore all column nodes, are eventually matched. The theoretical complexity of this stage is O(n · (m + n) · log n), where n and m are the dimension and number of nonzeros in the input matrix, respectively. However, thanks to the preprocessing DB-S2, actual run times for finding a perfect match are acceptable in all situations and this stage is the DB bottleneck only for about half of the matrices tested (Li et al., 2014).

DB-S4: extract permutation and scaling factors. The matrix permutation can be obtained directly from the resulting perfect match: if the row node i was matched to the column node j then rows (or columns) i and j must be permuted. Optionally, scaling factors can be calculated and applied to rows and columns in order to bring the matrix to a so-called I-matrix form; i.e., a matrix with 1 or −1 on the diagonal and off-diagonal elements of absolute value less than 1, see Olschowka and Neumaier (1996). This stage is highly parallelizable and amenable to GPU computing.

6.2.2 CM reordering implementation details

The unordered CM algorithm, which draws on an approach described in Karantasis et al. (2014), is separated into three stages, CM-S1 through CM-S3. A high quality reordering calls for several BFS iterations, which are called herein “CM iterations”. Just like the DB implementation, the CM solution (i) is hybrid – the overall algorithm leverages both CPU and GPU computing; and, (ii) it uses CPU–GPU unified memory, a recent CUDA feature (Negrut et al., 2014), to provide for a simple and transparent memory management process. The latter feature allows the CUDA runtime to transparently manage the CPU–GPU data migration as the computation switches back and forth between the CPU and GPU. Since no explicit, programmer initiated, data transfer is required, the code is cleaner and more concise. CM-S1: pre-processing. The first stage is implemented on the GPU to accomplish two objectives. First, it produces the data structure that is worked upon. As the 55 input matrix A is not guaranteed to be symmetric, the sparse matrix structure for (A + AT )/2 is produced in anticipation of the subsequent two stages of the algorithm. Second, in order to avoid repetitively sorting the neighbors of a given node, the nodes with the same row indices are pre-sorted by ascending vertex degree of column index.

CM-S2: perform standard BFS. After experimenting with the implementation, the strategy adopted started from several nodes and in parallel performed what would be a traditional CM-S2 & CM-S3 combo. The alternative of considering one node only, namely the node with the smallest vertex degree, yields a second level BFS tree with fewer nodes. Eventually, the resulting BFS tree will likely be “tall and thin”. Starting from several nodes and completing the reordering process for each of them increases the likelihood of avoiding a “bad” initial node. In practical terms, owing to the use of parallel computing, this strategy yields smaller bandwidths at a modest increase in computational overhead.

For each starting node, a standard BFS pass yields the levels of all nodes in the BFS tree. Since the order of nodes at the same level is not critical in this stage, parallel computing can help by concurrently visiting the neighbors of all nodes at the previous level. We use an outer loop to iterate over the levels, and in each iteration, depending on the number of nodes np added in the previous iteration, we decide whether this iteration is executed on the GPU or CPU. In particular, a kernel handles the iteration on the GPU only if np ≥ 10. There are two notable implementation details. First, the CM iterations are executed sequentially. After each iteration, we select the node at the previous level with the lowest vertex degree which has not yet been selected yet. If no such nodes exist; i.e., all nodes at the last level have been selected as starting nodes in previous iterations, a random node which has not been considered is selected. Second, the CM iterations terminate either when the height of the BFS tree does not increase, or when the maximum number of nodes over all levels does not decrease compared with the candidate optimal found so far. This strategy is proposed in Padua Haiek (1980) with the caveat that we only consider the leaf with the minimum degree. From practical experience, these heuristics lead to an algorithm that for most matrices terminates within three CM iterations. 56

CM-S3: reorder nodes. The previous stage determines the level of each node. Roughly speaking, nodes are ordered in ascending order, from level 0 up to the maximum level ml and memory space can be pre-allocated for nodes at each level. Parallel computing is leveraged by observing that the order of nodes at level l depends only on the order of nodes at level l − 1. To that end, a pair of read/write pointers is set for each level, and except for level 0, the read/write pointers of each level will point to the starting position of the level’s pre-allocated space. We say a thread “works on” level l if it reads nodes at level l and writes their neighbors that are at level l + 1. Thus the execution thread working on level l will read and modify the read pointer of level l and the write pointer of level l + 1, and it will only read the write pointer of level l. Once the thread finishes reading all nodes at level l, it moves on to another level; otherwise it repeats checking whether or not the thread working on level l − 1 has written nodes which it has not processed by checking if the read pointer at level l lags the write pointer at level l. If yes, the thread working on level l processes these nodes, i.e., writes their neighbors with level l + 1, and goes back to checking again whether it has finished processing or not; otherwise, it spins and waits for the thread working on the previous level. Note that the parallelism in CM-S3 is rather coarse-grained and proved to be better suited for execution on the CPU.

6.3 CPU-GPU data transfer in SaP::GPU

Data transfer between CPU and GPU in SaP::GPU is illustrated in Fig. 6.2. For the dense banded solver SaP::GPUd, if the banded matrix A is already on the GPU, no data transfer happens until the solution of the system. Data transfer happens only in the sparse solver SaP::GPUs, which includes the hybrid reordering approaches DB and CM, and matrix assembling process on the CPU. In Fig. 6.2, three types of data, the transfer of which is of increasing costs, are illustrated: a circle represents reordering information; an uncolored square represents the sparsity pattern of a sparse matrix without values; and a blue-colored square represents a sparse matrix. 57

Figure 6.2: Data transfer in SaP::GPU.

Next, we detail the data transfer and processing of SaP::GPUs’s solution of a sparse system. The original matrix As lies on the GPU.

First, the matrix As is reordered with the DB process: information of the matrix is 58

extracted within DB-S1 on the GPU; then the whole matrix As is transferred to CPU for DB-S2 and DB-S3 processes; next, the DB reordering information is transferred back to GPU for DB-S4 process; meanwhile, the reordered matrix QDBAs stays on the CPU for CM process.

Next, the matrix QDBAs is reordered with the CM process: the sparsity pattern of

QDBAs is transferred to GPU for symmetrizing and pre-sorting in CM-S1. This is followed by multiple iterations of BFS starting from different starting nodes (CM-S2). In this CM stage, the decision of where to search the neighboring nodes of the current- level nodes is made based on the number of current-level nodes (denoted as NCL): if

NCL is no less than a threshold (which is tuned to be 10 in SaP::GPU), the neighboring node searching is done on the GPU, otherwise the CPU is responsible for neighboring node searching for the current-level nodes. The data transfer happens when the current-level nodes do not exist on the hardware responsible for the neighboring node searching. CM-S2 continues until the final BFS tree structure is obtained, which is followed by reordering nodes based on the BFS tree structure (CM-S3), which is processed on the CPU. At the end of CM-S3, the reordering information QCM is transffered to GPU. Meanwhile, the CM reordering is applied on the matrix QDBAs T and we obtain the reordered matrix QCM QDBAsQCM . We denote this matrix ADB−CM.

After DB and CM reordering processes, elements of matrix ADB−CM are optionally dropped off. The user either specifies a target half-bandwidth Ktar (the value of Ktar should be less than the half-bandwidth of the matrix ADB−CM), and SaP::GPU directly drops off Ktar all elements outside the band determined by Ktar (we denote this matrix ADB−CM); or the user specifies a drop-off rate dfrac, and SaP::GPU continues dropping off elements farthest from the diagonal, until the half-bandwidth is K, which satisfies the following conditions: K ||ADB−CM||1 ≥ (1 − dfrac) · ||ADB−CM||1 and K−1 ||ADB−CM||1 < (1 − dfrac) · ||ADB−CM||1 where || · ||1 is the 1-norm operator. 59

After the optional drop-off, the matrix is partitioned, and the main diagonal blocks

Ai with off-diagonal blocks Bi, Ci are obtained. A third-stage reordering P is T applied on each main diagonal blocks Ai, resulting in PiAiPi . After the third stage T reordering, the blocks PiAiPi , Bi and Ci are sent back to GPU, where these blocks are assembled to obtain the preconditioning matrix M. Then the solution of the general sparse system falls back on the solution of a dense banded system, and this process is on the GPU. It is worth noting that the data transfer in the third stage reordering is as complicated as that in CM in that the third stage reordering is essentially applying bandwidth reduction to each main diagonal blocks Ai. Fig. 6.2 omits the data transfer between CPU and GPU for the sake of simplicity.

6.4 Multi-GPU support

Multi-GPU support of SaP::GPU is motivated by the following three points:

• Inherent nature of “Split and Parallelize” methods. For any of the three SaP approaches, with splitting, several intensive but independent calculations are created so that the work is split across different processes. For instance, the solution to P independent systems Dg = b shown in Eq. 3.3 for SaP-D and SaP-C, the calculation of spikes in Eq. 3.2 and the solution to P −1 independent systems in Eq. 3.6 for SaP-C, the assembly of diagonal and off-diagonal blocks A(j+1), B(j+1) and C(j+1) in Eqs. 3.16d, 3.16e and 3.16f for SaP-B, etc. The essential “independence” of these calculations implies that no communication is required during these computations. This enables the possibility of leveraging multiple GPUs to accomplish the work.

• A single GPU might not effectively leverage the parallelism of a particular algorithm. The best example for this point is LU factorization

of each diagonal block Ai when we are solving the system Dg = b shown in Eq. 3.3. Recall that in §6.1.3, we have mentioned that when the half-bandwidth 60

K is no smaller than 64, a CUDA kernel with multiple blocks of threads is

launched to update L and U entries for each diagonal block Ai. Assume K has value 200 and the system is split into 20 partitions. We further assume that a CUDA kernel with blocks of 512 threads to update the L and U entries. Then in this kernel, the number of blocks is as large as 1563(= 2002 × 20/512), exceeding the total number of double-precision cores in Kepler GK110, which is 960. As a consequence, peak performance cannot be attained, even if all cores can be fully filled. Hence, splitting this work into multiple GPUs and leveraging more double-precision cores will be helpful in achieving better performance.

• Limited GPU memory can be addressed with multi-GPU support. The GDDR5 memory size of Tesla K40 is 12 GB. Consider a dense banded matrix A of double-precision floating point with N = 1,000,000 and K = 500, which requires 7.46 GB of memory. For solving a system in the form of Ax = b, an iterative solver like SaP::GPU is obliged to consume close to 14.92 GB (twice of 7.46 GB) to store both A and the preconditioning matrix M. By splitting matrices over multiple GPUs, the shortage of GPU memory when storing large matrices can be somewhat alleviated.

Among the three preconditioning approaches, i.e. SaP-D, SaP-C and SaP-B, only SaP-D is multi-GPU supported. For SaP-C and SaP-B, data transfers among GPUs have to be introduced to assemble and calculate the coupling blocks, and to obtain the matrix-vector product in the Krylov solve stage, which increases the complexity of data management with multiple GPUs, increases the memory footprint, and offsets the potential performance gain of leveraging multi-GPU. The overheads of these additional data transfers are significant especially for SaP-B. Indeed, SaP-B is essentially a recursive approach which takes exactly log2 P iterations both in preconditioning stage and in each preconditioning solve. Introducing data transfer among GPUs in each iteration significantly hurts performance.

Figure 6.3 illustrates the data allocation to multiple GPUs in SaP::GPU-D: each of the first nr GPUs are allocated bP/nGP U c + 1 partitions, while the remaining 61

are allocated bP/nGP U c partitions, where nGP U is the number of active GPUs, and nr = P − nGP U bP/nGP U c.

Figure 6.3: Allocating a seven-partitioned matrix to three GPUs.

There are two remarks on this strategy of data allocation. First, the partitioning strategy discussed in §6.1.1 is not modified for the multi-GPU implementation of

SaP::GPU-D. In other words, given P and nGP U values, the (preconditioning) banded matrix is first P -partitioned by applying the partitioning strategy described in §6.1.1, then the P partitions are allocated to nGP U GPUs with the allocation strategy described above. This partition-then-allocate strategy does not strike a perfect load balancing among GPUs. The scenario illustrated in Fig. 6.3 is a good instance: 62

GPU 0 is allocated one more partition of entries than GPU 1 and GPU 2. An ideal strategy is an allocate-then-partition strategy: the matrix is first allocated to GPUs in a balanced manner, and then the allocated matrix on each GPU is Pi-partitioned, n GPP U−1 satisfying P = Pi. The only reason why partition-then-allocate, instead of i=0 allocate-then-partition, is used in SaP::GPU-D, is the simplicity of test: with the assumption that the single-GPU version of SaP::GPU-D is implemented correctly, the multi-GPU version of SaP::GPU-D is tested by comparing all L and U entries of each diagonal Ai obtained by both versions. As one of the directions of optimization, striking the load balancing with allocate-then-partition strategy is a straightforward starting point. Next, without loss of generality, the linear system Ax = b provided by user is assumed to be allocated originally on GPU 0. In the stage of Krylov solve, whenever a preconditioning solve (which is defined in §5.2.2) is to be accomplished, SaP::GPU needs to partition and allocate the right-hand-side, which is on GPU 0, to multiple GPUs in correspondence with how the rows of the preconditioning matrix are allo- cated. After the preconditioning solve, SaP::GPU must combine various parts of the preconditioning solve results into a single vector on GPU 0. This back-and-forth data transfer among GPUs increases the overhead of the multi-GPU SaP::GPU-D. The overhead, nonetheless, is inevitable if the matrix A provided by user is on a single GPU, which implies the SpMV function has to be accomplished on a single GPU. With multi-GPU support, data transfer between CPU and all GPUs is illustrated in Fig. 6.4. It is worth noting that (1) since only SaP-D is multi-GPU supported, there are no Bi and Ci blocks when transferring data from CPU to GPUs; and (2) since the preconditioner assembly process and the subsequent LU factorization are concurrent on each GPU, it is hard to distinguish TASSEMBLE and TLU when multiple GPUs are enrolled in solving the system. 63

Figure 6.4: Data transfer in SaP::GPU with dual GPUs. 64

7 numerical experiments

This chapter reports the results of numerical experiments evaluating the robustness and performance of SaP::GPU and the reordering methodologies applied. The simulation results of using SaP::GPU in a computational multibody dynamics problem are detailed in §8. All experiments were carried out on Tesla K40 GPU (introduced in §2.3.1) and two Intel Xeon E5-2690 v2 CPUs (introduced in §2.3.2). All softwares were built with optimization level two, namely with ‘-O2’ flag on. For all CPU linear system solvers against which SaP::GPU is compared against (including MKL and LAPACK dense banded linear solvers, and PARDISO, SuperLU and MUMPS general sparse linear solvers), different numbers of threads were attempted to solve a system, and the smallest execution time over numbers of threads was recorded as the solution time for a certain linear system.

7.1 Numerical experiments related to dense banded linear systems

7.1.1 Sensitivity with respect to P

N As P is fixed to the value b K c in SaP-B, the sensitivity with respect to P only applies to SaP-C and SaP-D. The results in Fig. 7.1 summarize the sensitivity of the time to solution with respect to the number of partitions P . This behavior, namely relatively small gains after a threshold value of P , is typical. It is instructive to see how the solution time is spent by SaP-C and SaP-D and understand how changing P influences this distribution of the time to solution between the major implementation components. The results in Table 7.1 provide this information as they compare the coupled and decoupled strategies in regards to the factorization times, Dpre vs. Cpre; number of iterations in the Krylov solver, Dit vs. Cit; amount of −9 time spent iterating to find the solution at a level of accuracy of at least 10 , DKry 65

vs. CKry; and the total times, DT ot vs. CT ot. These times are defined as Dpre = TLU ,

Cpre = TLU + TBC + TSPK + TLUrdcd, DT ot = Dpre + DKry, and CT ot = Cpre + CKry. Note that for SaP::GPU, quarters of number of iterations are reported. This is due to the fact that BiCGStab(2) contains three exit points during each iteration. Moving from one to the next roughly requires the same amount of effort, which justifies the notation.

SaP::GPU-C SaP::GPU-C 2 SaP::GPU-D SaP::GPU-D 2

1

Exec time (s) Exec time1 (s)

0 0 20 40 60 80 100 0 20 40 60 80 100 P P (a) K = 50 (b) K = 200

Figure 7.1: Time to solution as a function of the number of partitions P . Study carried out for a dense banded linear system with N = 200,000 and d = 0.6, for two different values of the half-bandwidth K.

The number of iterations to convergence suggests that the quality of the coupled- version of the preconditioner is superior. Yet the cost for getting this better pre- conditioner depends highly on the half-bandwidth K of the system. For smaller K, the time saving from Krylov solve of SaP::GPU-C is higher than the extra cost of preconditioning, so it is worthwhile to spend time in achieving this better precon- ditioner. For larger K, the time saving from Krylov solve of SaP::GPU-C is offset by the extra cost of preconditioning, and SaP::GPU-D ends up winning by taking as little as half the time required by SaP::GPU-C. When the same factorization is used multiple times, this conclusion could change since the metric that controls the performance would be DKry and CKry, or its number of iterations for convergence 66 proxy. Also note that the return on increasing the number of partitions gradually fades away and for the coupled strategy there is no reason to go beyond P = 50, while for the decoupled strategy, the threshold value of P can reach 80. Hence, P = 50 for the coupled strategy and P = 80 for the decoupled strategy will be used in the remaining part of this section for the studies with respect to other parameters of a banded system. 67

PCpre Dpre Cit Dit CKry DKry CT ot DT ot 1 1,205 1,203.4 0.25 0.25 1,152.2 1,146.1 2,357.2 2,349.5 10 269.8 138 0.25 2.25 252.2 601.6 522 739.6 20 174.9 91.4 0.25 2.25 134.9 381.3 309.7 472.6 30 128.6 70.1 0.25 2.25 97.2 258.1 225.8 328.2 40 122.1 65.7 0.25 2.25 81.3 197.8 203.3 263.5 50 140.9 92.6 0.25 2.25 67.9 170.8 208.8 263.5 60 136.3 81.5 0.25 2.25 59.7 151.6 196 233.1 70 150.1 72.1 0.25 2.25 67.4 151.6 206.9 233 80 140.2 66.1 0.25 2.25 53.7 154.2 194 220.3 90 132.2 60.8 0.25 2.25 49.9 112.8 182.1 173.6 100 139.3 72.2 0.25 2.25 50.4 111.6 189.7 183.8 (a) K = 50

PCpre Dpre Cit Dit CKry DKry CT ot DT ot 1 1,487.3 1,489.7 0.25 0.25 1,245.9 1,244.1 2,733.2 2,733.8 10 1,286.6 539 0.25 1.75 290.6 579.8 1,577.2 1,118.8 20 1,200.6 477 0.25 1.75 181.5 354.3 1,382.1 831.3 30 1,202.4 466.2 0.25 1.75 141.8 278.1 1,344.2 744.3 40 1,186.7 436 0.25 1.75 123.1 239.9 1,309.8 675.9 50 1,202.9 429.2 0.25 1.75 117.9 217.5 1,320.8 646.7 60 1,234.5 441 0.25 1.75 110.1 239.4 1,344.6 680.5 70 1,248.9 438.8 0.25 1.75 111.2 198.5 1,360.1 637.3 80 1,249.4 421.3 0.25 1.75 111.3 226.8 1,360.8 648.1 90 1,271.3 428.3 0.25 1.75 109.7 187.1 1,381 615.4 100 1,277.8 416.8 0.25 1.75 109.2 180.9 1,387 597.6 (b) K = 200

Table 7.1: Performance comparison over a spectrum of number of partitions P for coupled (C) vs. decoupled (D) strategies in SaP::GPU. All timings are in milliseconds. Problem parameters: N = 200,000, d = 0.6, K = 50 and K = 200. The symbols used are as follows: Dpre–amount of time spent in preprocessing by the decoupled strategy; Dit–number of Krylov iterations for convergence; DT ot–amount of time to converge. Similar values are reported for the coupled approach.

7.1.2 Sensitivity with respect to d

Next, we report on the performance of SaP::GPU for a dense banded linear system with N = 200,000 and K = 200, for degrees of diagonal dominance in the range 0.01 ≤ d ≤ 1.2, see Eq. (3.11). The entries in the matrix are randomly generated. 68

The number of partitions P is configured as follows: for SaP::GPU-C, P is fixed to 50; for SaP::GPU-D, P is fixed to 80; for SaP-B, as mentioned in §3.1.2, P is N fixed to the value of b K c, which is 1000. The findings are summarized in Fig. 7.2, where SaP::GPU-C, SaP::GPU-D and SaP::GPU-B are compared against the banded linear solver in MKL. When d > 1 the impact of neglecting the coupling blocks (for SaP::GPU-D) and the truncation of the coupling blocks (for SaP::GPU-C) becomes increasingly irrelevant, a situation that places SaP::GPU at an advantage. As such, there is no reason to go beyond d = 1.2 since if anything, the results will get better. The more interesting range is d < 1, when the diagonal dominance requirement is violated. SaP::GPU solver demonstrates uniform performance over a wide range of degrees of diagonal dominance. For instance, SaP::GPU-C typically required less than one Krylov iteration for all d > 0.08. As the degree of diagonal dominance decreases further, the number of iterations and hence the time to solution increase significantly as a consequence of truncating the spikes that now contain non-negligible values. 69

3 3 SaP::GPU-C over MKL SaP::GPU-C over MKL SaP::GPU-D over MKL SaP::GPU-D over MKL SaP::GPU-B over MKL SaP::GPU-B over MKL 2 2

Speedup 1 Speedup 1

0 0 0.1 0.3 0.5 0.7 0.9 1.1 0.1 0.3 0.5 0.7 0.9 1.1 4 4 SaP::GPU-C SaP::GPU-C 3 SaP::GPU-D SaP::GPU-D SaP::GPU-B 3 SaP::GPU-B 2 2 1 Exec time (s) Exec time1 (s) 0 0.1 0.3 0.5 0.7 0.9 1.1 0.1 0.3 0.5 0.7 0.9 1.1 d d (a) K = 50 (b) K = 200

Figure 7.2: Influence of the diagonal dominance d, with 0.01 ≤ d ≤ 1.2, for fixed values N = 200,000, P = 50 for SaP::GPU-C and P = 80 for SaP::GPU-D. The plots are for two different values of the bandwidth K.

It is instructive to see how the solution time is spent by SaP::GPU-C, SaP::GPU-D and SaP::GPU-B and understand how changing d influences the split of the time to solution between the major components of the implementation. The results reported in Table 7.2 provide this information as they help answer the following questions: (1) can one still use a decoupled approach for matrices that are far from being diagonal dominant? The answer is yes, if d is reasonably large; i.e., when d ≥ 0.1. Note that 70 the number of iterations to convergence for the decoupled approach quickly recovers away from small values of d. (2) What are the scenarios in which the BCR approach should be considered? The answer is when d is rather small; i.e., when d < 0.1 and neither SaP::GPU-D nor SaP::GPU-C is likely to approach the solution. 71

d Cpre Dpre Bpre Cit Dit Bit CKry DKry BKry CT ot DT ot BT ot 1.2 143.9 64.7 235.2 0.25 1.75 0.25 65.2 103.3 55.1 209.1 168 290.2 1 144.5 64.1 233.1 0.25 1.75 0.25 65.1 102.5 54.9 209.6 166.6 288 0.8 144.9 64.1 236.2 0.25 1.75 0.25 65.1 102.5 55.2 210.1 166.6 291.3 0.6 147.4 64.4 235.2 0.25 2.25 0.25 65.8 139.5 55.2 213.1 203.8 290.4 0.4 143 64.3 235.9 0.25 2.75 0.25 65.6 143.3 55.3 208.6 207.5 291.2 0.2 145.2 64.4 235.5 0.25 4.75 0.25 65.1 224.2 55.1 210.3 288.7 290.6 0.1 144.3 63.8 232.7 16.5 101 0.25 1,270 −1 54.9 1,414.3 −1 287.7 8·10−2 147.2 64 235.2 101 101 0.75 −1 −1 81.6 −1 −1 316.7 6·10−2 144.6 64.2 235.4 101 101 2.5 −1 −1 293.8 −1 −1 529.2 4·10−2 143.6 64.4 234.5 101 101 2.5 −1 −1 293.8 −1 −1 528.2 2·10−2 144.8 64.3 235.4 101 101 2.5 −1 −1 294.1 −1 −1 529.5 1·10−2 144.2 65 233.2 101 101 2.5 −1 −1 293.8 −1 −1 527 (a) K = 50

d Cpre Dpre Bpre Cit Dit Bit CKry DKry BKry CT ot DT ot BT ot 1.2 1,200.8 421.4 1,573.6 0.25 1.75 0.25 111.8 174.8 109.3 1,312.7 596.2 1,682.9 1 1,200 417.6 1,572.7 0.25 1.75 0.25 112.1 174.5 109.2 1,312.1 592.1 1,681.9 0.8 1,189.8 417.4 1,572.6 0.25 1.75 0.25 111.4 174.1 109.1 1,301.2 591.5 1,681.8 0.6 1,189.6 417.6 1,572.5 0.25 2.25 0.25 111.4 207 109.2 1,301 624.7 1,681.7 0.4 1,189.6 417.6 1,573.3 0.25 2.25 0.25 111.6 206.8 109.3 1,301.2 624.4 1,682.6 0.2 1,189.6 417.5 1,571.8 0.25 2.75 0.25 111.4 243.5 109.2 1,301 661 1,681 0.1 1,189.5 417.4 1,573.2 0.25 4.75 0.25 111.6 382 109.2 1,301.2 799.4 1,682.4 8·10−2 1,189.7 417.4 1,572.9 0.25 6.75 0.25 111.6 520.6 109.2 1,301.2 938 1,682.1 6·10−2 1,189.9 417.6 1,571.6 1.75 101 0.25 281 −1 109 1,470.9 −1 1,680.6 4·10−2 1,189.8 417.7 1,573.6 101 101 1.25 −1 −1 243.3 −1 −1 1,816.9 2·10−2 1,189.7 417.7 1,573 101 101 2.5 −1 −1 597.3 −1 −1 2,170.4 1·10−2 1,190.2 417.5 1,574.3 101 101 2.5 −1 −1 597.7 −1 −1 2,172 (b) K = 200

Table 7.2: Influence of d for coupled (C) vs. decoupled (D) vs. BCR (B) strategies in SaP::GPU (N = 200,000, P = 50 for SaP::GPU-C and P = 80 for SaP::GPU-D, K = 50 for Table 7.2a and K = 200 for Table 7.2b). All timings are in milliseconds. Symbols used are as specified for Table 7.1. For those tests whose number of iterations are reported as 101 and Krylov solve and total time reported as −1, it means the test fails to converge within 100 Krylov iterations.

The results of efficiency evaluation of diagonal boosting algorithm in §7.2.1.1 suggest that for half of the 114 real application systems studied in this work, d can be boosted to larger than 0.13. It is thus helpful to zoom into the range between 0.1 and 0.2 72 to observe the quality of convergence of SaP::GPU-D. The results of this “zoom-in” study are reported in Fig. 7.3 and Table 7.3. Note that whether the half-bandwidth is small (i.e. K = 50) or large (i.e. K = 200), diagonal dominance no less than 0.13 will guarantee convergence when a system is preconditioned with SaP::GPU-D. This observation, combined with the results reported in §7.2.1.1, provides the users a likely positive answer to the following question: is SaP::GPU-D a feasible preconditioning strategy to solve a general sparse system?

4 4 SaP::GPU-D SaP::GPU-D 3 3

2 2 Exec time (s) Exec time1 (s) 1 0 0.1 0.12 0.14 0.16 0.18 0.2 0.1 0.12 0.14 0.16 0.18 0.2 d d (a) K = 50 (b) K = 200

Figure 7.3: Influence of the diagonal dominance d, with 0.1 ≤ d ≤ 0.2, for fixed values N = 200,000 and P = 80 for SaP::GPU-D. 73

d Dpre Dit DKry DT ot d Dpre Dit DKry DT ot 0.2 58.2 4.75 162.1 220.4 0.2 398.3 2.75 213.7 612 0.19 58.1 5.75 191.8 249.9 0.19 394 2.75 212.9 606.9 0.18 59.1 5.75 192.8 252 0.18 394.1 3.25 241.5 635.6 0.17 57.1 6.75 221.3 278.4 0.17 393.2 3.25 241.7 634.9 0.16 58.5 7.75 250.5 308.9 0.16 393.2 3.25 241.7 634.9 0.15 57.9 9.25 292.8 350.6 0.15 397.3 3.25 271 668.3 0.14 57.6 11.75 368.4 426 0.14 393.1 3.75 273.6 666.7 0.13 57.8 26.25 792.2 850.1 0.13 393.2 3.75 273.3 666.5 0.12 58 183.75 5,429.7 5,487.7 0.12 394.6 3.75 276.7 671.4 0.11 58.5 501 −1 −1 0.11 393.3 4.75 333.8 727.1 0.1 58.7 501 −1 −1 0.1 393.1 4.75 334.1 727.2 (a) K = 50 (b) K = 200

Table 7.3: Influence of d for the decoupled (D) strategy in SaP::GPU (N = 200,000 and P = 80). All timings are in milliseconds. Symbols used are as specified for Table 7.1. For those tests whose number of iterations are reported as 501 and Krylov solve and total time reported as −1, it means the test fails to converge within 500 Krylov iterations.

7.1.3 Comparison with LAPACK and Intel’s MKL over a spectrum of N and K

This section summarizes results of a two-dimensional sweep over N and K. Besides comparing with MKL, in this two-dimensional tests, we also include LAPACK to better understand where SaP::GPU stands when compared to these open-source or com- mercial dense banded solvers. In this exercise, prompted by the results reported in Figs. 7.1 and 7.2, we fixed P = 50 for SaP::GPU-C and P = 80 for SaP::GPU-D and chose matrices for which d = 0.6. Each row in Table 7.4 lists the value of N, which runs from 1000 to 1,000,000. Each column lists the dimension of the half bandwidth K, which runs from 10 to 500. Each table row is split in three sub-rows: SaP::GPU-D results are reported in the first sub-row; SaP::GPU-C in the second sub-row; SaP::GPU-B in the third sub-row; MKL in the fourth sub-row; LAPACK in the fifth sub-row. All timings are in milliseconds. “OoM” stands for “out-of-memory” – a situation that arises when SaP::GPU exhausts during the solution of the linear 74

system the GPU’s 12 GB of global memory.

Table 7.4: Performance comparison, two-dimensional sweep over N and K for d = 0.6. For each value N, the five rows correspond to the SaP::GPU-D, SaP::GPU-C, SaP::GPU-B, MKL and LAPACK solvers, respectively.

K N 10 20 50 100 200 500 1.141E1 9.435 9.941 1.324E1 2.241E1 2.412E1 6.723 5.687 9.289 1.610E1 2.478E1 2.439E1 1000 1.359E1 1.560E1 1.845E1 2.521E1 3.433E1 6.328E1 4.803 5.313 5.232 8.718 9.974 1.510E1 2.010E−1 5.400E−1 2.524 7.742 2.718E1 1.186E2 1.120E1 9.278 1.032E1 1.425E1 2.109E1 4.181E1 5.352 7.117 1.062E1 1.886E1 3.433E1 9.498E1 2000 1.382E1 1.704E1 2.325E1 3.524E1 5.533E1 1.518E2 4.733 5.020 6.853 1.258E1 1.512E1 2.934E1 3.810E−1 1.077 5.120 1.596E1 5.899E1 2.960E2 1.174E1 9.321 1.195E1 1.695E1 2.571E1 6.754E1 6.785 7.350 1.417E1 2.816E1 6.672E1 2.634E2 5000 2.135E1 2.302E1 2.957E1 5.081E1 9.499E1 2.870E2 5.585 6.124 9.205 2.374E1 3.090E1 7.426E1 9.470E−1 2.696 1.291E1 4.152E1 1.558E2 8.278E2 1.456E1 1.280E1 1.450E1 2.036E1 3.244E1 9.056E1 6.862 8.421 1.861E1 4.562E1 1.201E2 5.396E2 10,000 2.828E1 3.292E1 4.066E1 7.010E1 1.437E2 4.929E2 5.958 7.021 1.463E1 4.282E1 5.877E1 1.448E2 1.891 5.376 2.633E1 8.486E1 3.173E2 1.707E3 2.072E1 1.788E1 2.303E1 2.853E1 4.541E1 1.386E2 9.945 1.280E1 2.540E1 7.105E1 2.259E2 1.111E3 20,000 3.602E1 4.285E1 5.813E1 1.019E2 2.267E2 8.884E2 7.520 1.066E1 3.051E1 8.448E1 1.107E2 2.658E2 3.753 1.074E1 5.354E1 1.694E2 6.315E2 3.472E3 4.333E1 3.874E1 5.129E1 6.046E1 1.025E2 2.973E2 2.119E1 2.749E1 5.641E1 1.381E2 3.934E2 2.784E3 50,000 5.748E1 6.950E1 9.774E1 1.971E2 4.653E2 2.044E3 1.224E1 2.512E1 7.873E1 2.027E2 2.549E2 5.835E2 75

Table 7.4 – continued from previous page 9.538 2.972E1 1.374E2 4.227E2 1.584E3 8.764E3 7.372E1 6.641E1 8.168E1 1.140E2 2.020E2 5.902E2 3.767E1 5.029E1 1.035E2 2.596E2 6.828E2 4.153E3 100,000 8.640E1 1.076E2 1.590E2 3.477E2 8.634E2 3.877E3 2.412E1 4.942E1 1.508E2 3.435E2 4.611E2 1.101E3 2.139E1 6.016E1 2.734E2 8.445E2 3.173E3 1.759E4 1.481E2 1.283E2 1.783E2 2.229E2 4.292E2 1.245E3 6.809E1 9.234E1 1.905E2 4.884E2 1.259E3 6.751E3 200,000 1.334E2 1.750E2 2.776E2 6.271E2 1.638E3 7.494E3 5.000E1 9.727E1 3.019E2 7.079E2 9.360E2 2.109E3 4.363E1 1.198E2 5.461E2 1.690E3 6.354E3 3.538E4 3.146E2 2.771E2 3.594E2 5.230E2 9.898E2 3.270E3 1.551E2 2.149E2 4.699E2 1.191E3 2.983E3 OoM 500,000 2.766E2 3.732E2 6.196E2 1.460E3 3.957E3 OoM 1.100E2 2.308E2 7.547E2 1.590E3 2.158E3 5.934E3 1.083E2 2.976E2 1.366E3 4.225E3 1.603E4 8.865E4 5.873E2 5.291E2 6.942E2 1.022E3 1.940E3 OoM 3.004E2 4.140E2 8.976E2 2.382E3 5.860E3 OoM 1,000,000 5.247E2 7.080E2 1.189E3 2.871E3 OoM OoM 1.942E2 4.292E2 1.462E3 3.396E3 4.895E3 1.131E4 2.160E2 5.941E2 2.732E3 8.611E3 3.199E4 1.786E5

The results reported in Table 7.4 are statistically summarized in Fig. 7.4, which provides SaP::GPU over MKL and/or LAPACK speedup information. Assume that a SaP::GPU-D test “α” successfully ran to completion in SaP::GPU-D, requiring Tα , and/or SaP::GPU-C SaP::GPU-B in SaP::GPU-C, requiring Tα , and/or in SaP::GPU-B, requiring Tα SaP::GPU-D By convention, in case of failing to solve, the value ∞ is assigned to Tα , SaP::GPU-C SaP::GPU-B Tα or Tα . If a test runs to completion both in SaP::GPU and MKL (or LAPACK), the “α” speedup value used to generate the plot in Fig. 7.4a (or Fig. 7.4b)is SaP::GPU−−X X SaP::GPU X computed as sBD ≡ Tα /Tα , where Tα is solver X’s time to solution and SaP::GPU SaP::GPU-D SaP::GPU-C SaP::GPU-B Tα ≡ min(Tα ,Tα ,Tα ). Given that N assumes 10 values and K takes 6 values, “α” can be one of 60 tests. Since one (N,K) test, namely (1,000,000, 500), failed to solve in SaP::GPU, the sample population for the statistical 76

SaP::GPU−−MKL study in Fig. 7.4a is 59. Out of 59 tests, sBD > 1 in 33 cases. The highest SaP::GPU−−MKL speedup was sBD = 3.3527. The median is 1.0738 and the 75-percentile is 2.0656, which implies (1) that the overall performance of SaP::GPU is slightly faster than MKL; and (2) that for 25% of the tests, SaP::GPU demonstrates a speedup over MKL from two to three, while for 25% of the tests, SaP::GPU demonstrates a speedup over MKL from one to two. Similarly, the sample population for the statistical study in Fig. 7.4b is also 59. Out SaP::GPU−−LAPACK SaP::GPU−−MKL of 59 tests, sBD > 1 in 41 cases. The highest speedup was sBD = 29.8066. The median is 2.6781 and the 75-percentile is 8.3361. The results imply that given a banded system, SaP::GPU is likely to solve the system in at most 40% times LAPACK will take. However, it is worth noting that LAPACK demonstrates high performance for small half-bandwidth; i.e., for K = 10 and K = 20. A small half- bandwidth is likely to force LAPACK to use a simple and unblocked factorization, which exhibits small overheads when the matrix size is small. 77

(a) SaP – MKL

(b) SaP – LAPACK

Figure 7.4: SaP speedup over Intel’s MKL and LAPACK – statistical analysis based on values in Table 7.4. 78

7.1.4 Profiling results for dense banded linear system solver

This subsection reports the profiling results of all three approaches to solving dense banded systems: SaP::GPU-D, SaP::GPU-C, SaP::GPU-B. The set of 60 tests used for profiling was introduced in §7.1.3, and for each of the approaches, the profiling results were generated from a subset of these tests which were successfully solved by this approach. Three notations are introduced for the succinctness of description. Consider an arbitrary stage Γ in solving a system, and α is a specific test, then we define

Γ Γ−T ot Tα Rα ≡ 100 × T ot (7.1a) Tα Γ Γ−P reP Tα Rα ≡ 100 × P reP , if Γ ∈ P reP (7.1b) Tα Γ Γ−Kry Tα Rα ≡ 100 × Kry , if Γ ∈ Kry (7.1c) Tα

These notations are interpreted as for this specific test α, the percentage of time T ot spent in the stage Γ, considering 100% to be the total time Tα , the pre-processing P reP Kry time Tα , and the Krylov solve time Tα , respectively. It is worth noting that Γ−P reP Γ−Kry if Γ 6∈ P reP , Rα is invalid for any test α, and this is similar for Rα . For succinctness, when discussing a statistical observation for all tests, the subscript α is oftentimes omitted.

Profiling results of SaP::GPU-D. The two (only) major stages in SaP::GPU-D are LU factorization of the main diagonal blocks (LU) and Krylov solve (Kry). The rest of time spent in pre-processing other than LU is denoted as PrePRst. The median percent of time consumed in each stage is shown in Table 7.5, and Fig. 7.5, Fig. 7.6 and Fig. 7.7 illusrate the distribution of RΓ−T ot for each stage when considering the whole set of matrices, the subset of matrices with dimension N no smaller than 50,000, and the subset of matrices with half-bandwidth K no smaller than 50, respectively. Note that SaP::GPU-D is multi-GPU supported, Table 7.5 and Figs. 7.5, 7.6 and 7.7 display 79 the profiling results for both single GPU and dual GPU runs. As anticipated, (1) for both single GPU and dual GPU implementations, whichever subset of matrices are considered, the Kry component dominates the overall running time. This is because SaP-D is worse than SaP-C and SaP-B in terms of preconditioning quality, and several iterations in Krylov solve are required for convergence. (2) The time spent in the other stages, for instance partitioning and data transfers among GPUs when dual GPUs are leveraged, in pre-processing is trivial. (3) The LU component benefits more from an additional GPU than Kry. As discussed in §6.1.3, SaP might use multiple blocks of threads to update L and U entries, which result in more blocks the multiprocessors of a single GPU can handle in a parallel mannor. This is alleviated by leveraging an additional GPU.

LU PrePRst Kry LU PrePRst Kry RΓ−T ot 20.09 2.485 78.18 RΓ−T ot 19.01 8.994 70.83 Γ−T ot Γ−T ot RN≥50,000 40.54 3.990 55.10 RN≥50,000 22.50 12.91 64.27 Γ−T ot Γ−T ot RK≥50 40.54 2.854 55.10 RK≥50 34.31 11.71 55.57 (a) Single GPU (b) Dual GPU

Table 7.5: Medians of RΓ−T ot and RΓ−P reP for all stages in SaP::GPU-D. Table 7.5a shows the single GPU results, and Table 7.5b shows the dual GPU results. 80

(a) Single GPU

(b) Dual GPU

Figure 7.5: Profiling results of RΓ−T ot in SaP::GPU-D obtained from a subset of 59 banded linear systems. Fig. 7.5a illustrates SaP::GPU-D run on single GPU, and Fig. 7.5b illustrates SaP::GPU-D run on dual GPU. 81

(a) Single GPU

(b) Dual GPU

Figure 7.6: Profiling results of RΓ−T ot in SaP::GPU-D obtained from a subset of 59 banded linear systems which are considered “large” in terms of dimension N. Fig. 7.6a illustrates SaP::GPU-D run on single GPU, and Fig. 7.6b illustrates SaP::GPU-D run on dual GPU. 82

(a) Single GPU

(b) Dual GPU

Figure 7.7: Profiling results of RΓ−T ot in SaP::GPU-D obtained from a subset of 59 banded linear systems which are considered “large” in terms of half-bandwidth K. Fig. 7.7a illustrates SaP::GPU-D run on single GPU, and Fig. 7.7b illustrates SaP::GPU-D run on dual GPU. 83

Profiling results of SaP::GPU-C. In SaP::GPU-C, there are three major stages in pre-processing: LU factorization of each diagonal blocks (LU), computing spikes (SPK), and LU factorization of the reduced system (LUrdcd), and the rest of time spent in pre-processing is denoted as PrePRst. The median percent of time consumed in each stage is shown in Table 7.6, and Fig. 7.8 and Fig. 7.9 illustrate the distribution of RΓ−T ot and RΓ−P reP for each stage. It is somewhat surprising to observe that the amount of time spent in the Krylov-subspace component turned out higher than anticipated – it consumed similar amount of time to the LU component. For a further study of the amount of time spent in LU and Kry, two subsets of “large” matrices were extracted and two additional corresponding plots were generated. One subset consists of all matrices with large dimension N, the cut-off value of which is 50,000; the other subset consists of all matrices with large half-bandwidth K, the cut-off value of which is 50. The profiling results for solving these two subsets of systems are in Table 7.6 and Fig. 7.10. From these results, it can be concluded that: (1) LU component is the largest time-consumer, especially when the banded matrix size is large; (2) Kry component consumes non-trivial amount of time, yet this stage scales better in terms of both N and K than LU does.

LU SPK LUrdcd PrePRst Kry RΓ−T ot 35.47 8.399 3.947 5.363 33.91 RΓ−P reP 58.39 15.19 6.123 10.23 - Γ−T ot RN≥50,000 57.50 3.634 0.7316 4.338 30.098 Γ−T ot RK≥50 45.14 14.68 5.123 4.060 22.513 Table 7.6: Medians of RΓ−T ot and RΓ−P reP for all stages in SaP::GPU-C. It is worth noting that the summation of each row is not expected to be 100, because these values are statistical medians. 84

Figure 7.8: Profiling results of RΓ−T ot in SaP::GPU-C obtained from a set of 58 banded linear systems which SaP::GPU-C successfully solved. 85

Figure 7.9: Profiling results of RΓ−P reP in SaP::GPU-C obtained from a set of 58 banded linear systems which SaP::GPU-C successfully solved. 86

(a) Matrices with N ≥ 50,000

(b) Matrices with K ≥ 50

Figure 7.10: Profiling results of RΓ−T ot in SaP::GPU-C obtained from a subset of 58 banded linear systems which are considered “large”. Fig. 7.10a illustrates all cases with large N, and Fig. 7.10b illustrates all cases with large K. 87

Profiling results of SaP::GPU-B. In SaP::GPU-B, there are three major stages in pre-processing: LU (Eq. 3.16a); MtSw (Eqs. 3.16b and 3.16c), which consists of multiple right-hand-side solves; and MM (Eqs. 3.16d, 3.16e and 3.16f), which consists of matrix multiplications and additions, and the rest of time spent in pre-processing other than the three stages is denoted as PrePRst; there are two major stages in Krylov solve: SgSw (Eq. 3.16g), which consists of single right-hand-side solves and MV (Eqs. 3.16h and 3.16i), which mainly consists of matrix-vector multiplications, and the rest of time spent in Krylov solve other than the two stages is denoted as KryRst. The median percent of time consumed in each stage is shown in Table 7.7, and Fig. 7.11 and Fig. 7.12 employing a median-quartile method to illustrate the distribution of time consumption in each stage in SaP::GPU-B. The results in Table 7.7 and Fig. 7.12 reveal the bottleneck of the BCR approach and suggest the direction of optimization efforts in the future. On a coarse level, the pre-processing components in SaP::GPU are significantly more time-consuming than the Krylov-subspace components. This matches our expectation, in that the BCR algorithm was designed to be an exact method, therefore expected to solve a system with only one call to preconditioning solve, if no pivot boosting happens during factorization. On a fine level, the matrix multiplication operations in Eqs. 3.16d, 3.16e and 3.16f and the multiple right- hand-sides sweep operations in Eqs. 3.16b and 3.16c are more tedious than the LU factorization of Eq. 3.16a. This agrees with the calculation of number of floating point operations in §3.1.4: the number of operations in stages MtSw and MM are twice and four times more than that in stage LU. Similar studies on “large” matrices in terms of both dimension N and half-bandwidth K were carried out for SaP::GPU-B (seen in Table 7.7 and Fig. 7.13). There are Γ−T ot Γ−T ot Γ−T ot two significant differences in RN≥50,000 and RK≥50 compared against R : (1) for larger N, MV stage in Krylov solve consumes significantly larger ratio of elapsed time; (2) for larger K, MtSw component consumes significantly larger ratio of elapsed time. These facts shed light upon the directions of optimization for BCR methods: more effort is required in stages MtSw, MM (which overall consumes more time than any other stages over all test matrices) and MV. 88

LU MtSw MM PrePRst SgSw MV KryRst RΓ−T ot 18.44 26.30 29.36 4.492 5.964 5.719 3.890 RΓ−P reP 23.23 31.78 33.84 5.464 --- RΓ−Kry ---- 29.11 38.11 25.77 Γ−T ot RN≥50,000 19.36 26.30 29.36 4.608 4.026 9.350 3.331 Γ−T ot RK≥50 19.47 36.71 28.75 4.012 4.708 3.297 2.861 Table 7.7: Medians of RΓ−T ot, RΓ−P reP and RΓ−Kry for all stages in SaP::GPU-B. It is worth noting that the summation of each row is not expected to be 100, in that these values are statistical medians.

Figure 7.11: Profiling results of RΓ−T ot in SaP::GPU-B obtained from a set of 57 banded linear systems which SaP::GPU-B successfully solved. 89

(a) RΓ−P reP

(b) RΓ−Kry

Figure 7.12: Profiling results of RΓ−P reP and RΓ−Kry in SaP::GPU-B obtained from a set of 57 banded linear systems which SaP::GPU-B successfully solved. 90

(a) Matrices with N ≥ 50,000

(b) Matrices with K ≥ 50

Figure 7.13: Profiling results of RΓ−T ot in SaP::GPU-B obtained from a subset of 57 banded linear systems which are considered “large”. Fig. 7.13a illustrates all cases with large N, and Fig. 7.13b illustrates all cases with large K. 91

7.2 Numerical experiments related to sparse matrix reorderings

When solving sparse linear systems, SaP reformulates the sparse problem as a dense banded linear system that is subsequently solved using SaP::GPU-C or SaP::GPU-D. Ideally, the “sparse–to–dense” transition yields a coefficient matrix that is diagonal heavy; i.e., has a large d, and has a small bandwidth K. Two matrix reorderings are applied in an attempt to meet these two objectives. The first one, namely the diagonal boosting reordering, is assessed in section §7.2.1. The second one, namely the bandwidth reduction reordering, is evaluated in §7.2.2.

7.2.1 Assessment of the diagonal boosting reordering solution

Two sets of experiments were carried to assess DB reordering. The first set of experiments, summarized in §7.2.1.1, is to study the effectiveness of DB reordering; i.e., what is the resulting d of matrix QAs, where Q represents DB reordering. The second set of experiments, summarized in §7.2.1.2, is to study how DB reordering implemented in SaP performs compared against Harwell Sparse Library (HSL)’s MC64 algorithm (Hopper, 1973). Unless otherwise stated, the experiments in §7.2, §7.3 and §7.4 used 113 matrices selected from the University of Florida Sparse Matrix Collection (Davis and Hu, 2011), and three matrices from the Simulation Based Engineering Lab (SBEL) from the University of Wisconsin-Madison (SBEL, 2006).

7.2.1.1 Efficiency evaluation of diagonal boosting algorithm

The first set of experiments aims at assessing the efficiency of DB in boosting matrix diagonals. It would be instructive to study for each sparse matrix As, by applying the DB reordering Q, how the diagonal dominance d changes. Analyzing resulting d’s distribution over all 116 matrices would also be of interest. With the sensitivity study with respect to d detailed in §7.1.2, these analyses would serve as an intuition 92 for SaP::GPU in selecting an efficient preconditioning method. Different from the synthetic banded matrices for the set of experiments described

|aii| in §7.1, where di ≡ P , 1 ≤ i ≤ N is uniform for each row i, the value of di’s is |aij | j6=i subject to large variance for both As and QAs. As d is the minimum of all di’s, it might fail in accurately reflecting the heaviness of all diagonal elements of a matrix. For this purpose, we define

N |aii| d1−pct = pct1 P , (7.2) i=1 |aij| j6=i where pct1 is the first percentile of an array of numbers. In the experiments to evaluate DB’s efficiency in boosting diagonals, both d and d1−pct are selected as the metrics for assessing DB reordering in SaP::GPU. In particular, d1−pct is to rule out those rare rows with extremely small diagonal elements.

Fig. 7.14 illustrates the distribution of the d and d1−pct values over all 116 matrices before and after applying DB reordering. There are several observations from the statistical results:

• d does not decrease for any of the 116 matrices and in fact increases for 80

of them; for all but one matrix sparsine, d1−pct does not decrease after the

DB reordering algorithm is applied, and for 79 out of the 116 matrices, d1−pct increases by applying DB reordering.

• The median of d increases from 0.0012 to 0.1383, and the median of d1−pct increases from 0.0245 to 0.2004. According to sensitivity study results with respect to d in §7.1.2, all three approaches are likely to demonstrate reasonable performance. The most interesting preconditioning approach is SaP-D, which is hopeful in converging fast in at least more than half of the 116 selected

application systems, in that d1−pct > 0.2 holds for more than half of those systems. 93

• There are 37 matrices with zero d and 30 matrices with zero d1−pct before DB

reordering, while no zero d or zero d1−pct is observed after the DB reordering. This implies DB helps avoid zero pivoting in LU factorization for at least 30% of the systems. 94

(a) d

(b) d1−pct Figure 7.14: Results of a statistical analysis that uses a median-quartile method to measure the spread of d (Fig. 7.14a) and d1−pct (Fig. 7.14b) before and after the application of diagonal boosting reordering. d1−pct is defined in Eq. 7.2. 95

7.2.1.2 Performance comparison against Harwell Sparse Library’s solution

The second set of results, summarized in Fig. 7.15, correspond to the performance comparison between the hybrid CPU–GPU implementation of §4.1 and the HSL MC64 algorithm (Hopper, 1973). The hybrid implementation outperformed MC64 for 96 out of the 116 matrices. The left pane in Fig. 7.15 presents results of a statistical analysis that used a median-quartile method to measure the spread of the MC64 and DB MC64 DB times to solution. Assume that Tα and Tα represent the times required by DB and MC64, respectively, to complete the diagonal boosting reordering in test α.A relative speedup is computed as

T MC64 SDB−MC64 = log α . (7.3) α 2 DB Tα

DB−MC64 These Sα values, which can be either positive or negative, are collected in a set SDB−MC64 which is used to generate the left box plot in Fig. 7.26. The number of tests used to produce these statistical results was 116. Note that a positive value means that DB is faster than MC64, with the opposite outcome being the case for negative values DB−MC64 DB−MC64 of Sα . The median values for S was 1.2423, which indicates that half of the 116 tests ran more than 2.3 times faster using the DB implementation. On average, it turns out that the larger the matrix, the faster the DB solution becomes. Indeed, as a case study, we analyzed a subset of larger matrices. The “large” attribute was defined in two ways: first, by considering the matrix size, and second, by considering the number of nonzero elements. For the 116 matrices considered, we picked the largest 24 of them; i.e., approximately the largest 20%. To this end, in the first case, we selected all matrices whose dimension was higher than N =150,000. In the second case, we selected all matrices whose number of nonzero elements was larger than 4,350,000. For large N, the median was 1.6255, while for matrices with many nonzero elements, the median was 1.7276. In other words, half of the large tests ran more than three times faster in DB. Finally, the statistical results in Fig. 7.26 indicate that for large tests, with the exception of two outliers, there were no tests for which 96

DB−MC64 Sα was negative; i.e., with one exception, DB was faster. When all 116 tests were considered, MC64 was faster in several cases, with an outlier for which MC64 was four times faster than DB. Two facts emerged at the end of this analysis. First, as discussed in Li et al. (2014), the bottleneck in the diagonal boosting reordering was either the DB-S2 stage; i.e., finding the initial match, or the DB-S3 stage; i.e., finding a perfect match, with an approximately equal split among them. Secondly, the quality of the reordering turned out to be identical – almost all matrices displayed the same grand product of the diagonal entries regardless of whether the reordering was carried out using MC64 or DB.

Figure 7.15: Results of a statistical analysis that uses a median-quartile method to measure the spread of the MC64 and DB times to solution. The speedup factor, or performance metric, is computed as in Eq. (7.3). 97

7.2.2 Assessment of the bandwidth reduction reordering solution

The performance of the CM solution implemented in SaP was evaluated on a set of 125 sparse matrices from various applications. These matrices were the 116 used in the previous section plus several other matrices such as ANCF31770, ANCF88950, NetANCF_40by40, etc., that arise in granular dynamics and the implicit integration of flexible multi-body dynamics (Fang, 2015; Fang and Negrut, 2014; Serban et al., 2015). Figure 7.16 presents results of a statistical analysis that used a median-quartile method to compare (i) the half bandwidths of the matrices obtained by Harwell’s MC60 and SaP’s CM; and, (ii) the time to solution; i.e., time to complete a band- reducing reordering. For (i), the quantity reported is the relative difference between the resulting bandwidths,

KMC60 − KCM rK ≡ 100 × , KCM where KMC60 and KCM are, respectively, the half bandwidths K of the matrices produced by MC60 and CM. For (ii), the metric used was identical to the one introduced in

Eq. (7.3). Note that CM is superior when rK assumes large positive values, which are also desirable for the time-to-solution plot. As far as rK is concerned, the median value is 0%; i.e., out of 125 matrices, about half are better off being reordered by Harwell’s MC60 with the other half being better off reordered by SaP’s CM. On a positive side, the number of outliers for CM is higher, indicating that there is a propensity for CM to “win big”. In terms of times to solution, MC60 is marginally faster than CM’s hybrid CPU/GPU solution. Indeed, the median value of the performance metric is −0.1057; i.e., it takes half of the tests run with CM at least 1.076 times longer to complete the bandwidth reduction task. 98

Figure 7.16: Comparison of the Harwell MC60 and SaP’s CM implementations in terms of resulting half bandwidth K and time to solution.

It is insightful to discuss what happens when this statistical analysis is controlled to only consider larger matrices. The results of this analysis are captured in Fig. 7.17. Just like in section §7.2.1.2, the focus is on the largest 20% matrices, where “large” is understood to mean large matrix dimension N, and then separately, large number of nonzeros nnz. Incidentally, the cut-off value for the dimension was N =215,000, while for the number of nonzeros was nnz =7,800,000. When the statistical analysis included the 25 largest matrices based on size N, the median value for the half bandwidth metric rK was yet again 0.0%. The median value for time to solution changed however, from −0.1057 to 0.6964 to indicate that for half of these large tests SaP ran more than 1.6 times faster than the Harwell solution. Qualitatively, the same conclusions were reached when the 25 large matrices were selected on the 99

grounds on nnz count. The median for rK was 0.4182%, which again suggested that the relative difference in the resulting bandwidth K yielded by CM and MC60 was practically negligible. The median time to solution was the same 0.6964. Note though that according to the results shown in Fig. 7.17, there is no large–nnz test for which the Harwell implementation is faster than the CM. In fact, 25% of the large tests; i.e., about five tests, run at least three times faster in CM.

Figure 7.17: Comparison of the Harwell MC60 and SaP’s CM implementations in terms of resulting half bandwidth K and time to solution. Statistical analysis of large matrices only.

Finally, it is worth pointing out the correlations between times to solutions and K values, on the one hand, and N and nnz, on the other hand. Herein, the correlation used is the Pearson product-moment correlation coefficient (Box et al., 1978). As a rule of thumb, a Pearson correlation coefficient of 0.01 to 0.19 suggests a negligible relationship, while a coefficient between 0.7 and 1.0 indicates a strong positive relationship. The correlation coefficient between the bandwidth and the dimension N 100 of the matrix turns out to be small; i.e., 0.15 for MC60 and 0.16 for CM. Indeed, the fact that a matrix is large does not say much about what K value one can expect upon reordering this matrix. However, the correlation between the number of nonzeros and the amount of time to figure out the reordering is very high. In other words, the larger the matrix size N, the longer the time to produce the reordering. For instance, the correlation coefficient was 0.91 for MC60 and 0.81 for CM. The same observation holds for the number of nonzeros entries: when there is a lot of them, the time to produce a reordering is large. The Pearson correlation coefficient is 0.71 for MC60 and 0.83 for CM. These correlation coefficients were obtained on a sample size of 125 matrices. Yet the same trends are manifest for the reduced set of 25 large matrices that we worked with. For instance, the correlation between dimension N and resulting K is very small at large N values: 0.04 for MC60 and 0.05 for CM. For the time to solution, the correlation coefficients with respect to N are 0.89 for MC60 and 0.76 for CM.

7.3 Numerical experiments related to sparse linear systems

7.3.1 Profiling results

Figure 7.18 plots statistical results that summarize how the time to solution; i.e., finding x in Ax = b, is spent in SaP::GPU. The raw data used in this analysis is available on-line (SBEL, b). A discussion of exactly what it means to find the solution of the linear system is postponed for section §7.3.4. The labels used in Fig. 7.18 are inspired by the notation used in section §3.3 and Fig. 3.2. Consider for instance the diagonal boosting reordering DB employed by SaP. In a statistical sense, the percent of time to solution spent in DB is represented using a median-quartile method to measure statistical spread. The raw data used to generate the DB box was obtained DB as follows. If a test “α” that runs to completion requires Tα > 0 for DB completion, then this test will generate one data entry in an array of data subsequently used to DB−T ot produce the statistical result. The actual entry that is used is Rα defined in 101

Eq. 7.1a. In other words, the entry is the percent of time spent when solving this particular linear system for performing the diagonal boosting reordering. The bars for the K-reducing reordering (CM), for multiple data transfers between CPU and GPU (Dtrsf), etc., are similarly obtained. Not all bars in Fig. 7.18 were generated using the same number of data entries; i.e., some tests contributed to some but not all bars. For instance, a symmetric positive definite linear system requires no DB step and such this test won’t contribute an entry to the array of data used to determine the DB box in the figure. Of a batch of 90 tests that ran to completion with SaP, the sample population used to generate the bars is as follows: 90 data points for CM, Dtrsf, and Kry; 63 data points for DB; 65 for LU; 32 data points for Drop; and 9 data points for BC, SPK, and LUrdcd. These counts provide insights into the solution path adopted by SaP in solving the 90 linear systems. For instance, the coupled approach; i.e., the SPIKE method of Polizzi and Sameh (2006) has been employed in the solution of nine of the 90 linear systems. The rest of them were solved via SaP::GPU-D. Of 90 linear systems, 25 were most effectively solved by SaP resorting to diagonal preconditioning; i.e., after DB all the entries were dropped off except the heavy diagonal ones. Also, note that several of the linear systems considered were symmetric positive definite, from where the 60 points count for DB. A statistical analysis of the time spent in the Krylov-subspace component of the solution reveals that the median time was 52.74%. The median times for the other components of the solution are listed in the first row of data in Table 7.8 (note that values in each row of data does not add up to 100% for several reasons. First, these are statistical median values. Second, there are very many tests that do not include all the components of the solution. For instance, SPK is computed based on a set of 16 points while Drop is computed using 32 data points, some of them not even obtained in conjunction with the same test.). The second row of data provides the median values when the Krylov-subspace component, which dwarfs most of the solution components is eliminated. In this case, the entry for DB, for instance, was DB T ot T ot obtained based on data points 100 × Tα /Tα , where this time around Tα included everything except the time spent in the Krylov-subspace component of the solution. 102

T ot In other words, Tα is the time required to compute from scratch the preconditioner. The median values should be used in conjunction with the median-quartile boxplot of Fig. 7.18 for the first row of data, and Fig. 7.19 for the second row of data. Consider, for instance, the results associated with the drop-off operation. In the Krylov-inclusive measurement, Drop has a median of 5.3%; i.e., half of the 32 tests which employed drop-off spent more than amount in performing the drop-off, while half were quicker. The spread is rather large and there are several outliers that suggest that a handful of tests require a very large amount of time be spent in the drop-off part of the solution.

DB CM Dtransf Drop Asmbl BC LU SPK LUrdcd Kry 3.5 1.3 1.5 5.3 0.6 1.6 18.4 35.7 7.6 52.7 9.7 3.4 3.8 27.3 1.8 2.4 63.4 45.6 11.7 0

Table 7.8: Median information for the SaP solution components as % of the time for solution. Two scenarios are considered: the first data row provides values when the total time; i.e., 100%, included the time spent by SaP in the Krylov-subspace component. The second row of data is obtained by considering 100% to be the time required to compute a factorization of the preconditioner. 103

Figure 7.18: Profiling results obtained for a set of 90 linear systems that, out of a collection of 114, could be solved by SaP::GPU. 104

Figure 7.19: Profiling results obtained for a set of 90 linear systems that, out of a collection of 114, could be solved by SaP::GPU.

The results in Fig. 7.18 and Table 7.8 suggest where the optimization efforts should concentrate in the future. For instance, the time required for the CPU↔GPU data transfer is, in the overall picture, rather insignificant and as such a matter of less concern. Somewhat surprising, the amount of time spent in drop-off came out higher than anticipated, at least in relative terms. One caveat is that no effort was made to optimize this component of the solution. Instead, the effort went into optimizing the DB and CM solution components. This paid off, as matrix reordering in SaP, particularly for large matrices, is fast when compared to Harwell and it reached the point where the drop-off became a more significant bottleneck. Another seemingly surprising observation was the relative small number of cases in which SaP::GPU-C was preferred over SaP::GPU-D; i.e., in which the SPIKE strategy (Polizzi and Sameh, 2006) was employed. The combination of the two points below serves as an explanation of these results: 105

• In the current implementation, a large number of iterations associated with a less sophisticated preconditioner is preferred to a smaller count of expensive iterations associated with SaP::GPU-C, if the diagonal dominance d is reasonably large (the sensitivity study with respect to d in §7.1.2 demonstrates this point). Out of a sample population of 90 tests, when invoked, the median number of iterations to solution in SaP::GPU-C was 6.75. Conversely, when SaP::GPU-D was preferred, the median count was 29.375 (SBEL, b).

• The study of the efficiency of DB reordering algorithm in §7.2.1.1 suggests that for a large population of the 114 linear systems discussed in the §7.3, the

resulting diagonal dominance and d1−pct are reasonably large: the median of

resulting d is 0.1383, and the median of the resulting d1−pct is 0.2004. These statistical results suggest a large population of linear systems where SaP::GPU-D is of better preference than SaP::GPU-C and SaP::GPU-B.

For obtaining the direction of where the optimization efforts should concentrate in the future for each of the approaches introduced in §3.1, the 90 tests which SaP::GPU successfully solved were categorized by preconditioning method into five types, and we generate the profiling results for tests of each type. These five types are:

(I) Direct solve without applying SaP. SaP::GPU directly solves this type of systems with one partition.

(II) Diagonal preconditioning. SaP::GPU applies diagonal preconditioner in this type of systems.

(III) Preconditioning with SaP-D. SaP::GPU applies the decoupled approach in this type of systems.

(IV) Preconditioning with SaP-C. SaP::GPU applies the coupled approach in this type of systems.

(V) Preconditioning with SaP-B. SaP::GPU applies the approach based on BCR algorithm in this type of systems. 106

Figs. 7.20, 7.21, 7.22, 7.23 and 7.24 illustrate the profiling results obtained for sets of linear systems of Type I, II, III, IV and V, respectively, successfully solved by SaP::GPU. Table 7.9 lists the median number of iterations and median percentage of time spent in Krylov subspace component for each of the five types. Among these 90 tests, 30 tests are of Type I, 25 are of Type II, 19 are of Type III, six are of Type IV, and ten are of Type V. From the categorized information, several observations can be made:

1. Both the number of Krylov iterations and the percentage of time to solution of systems of Type II and Type III are significantly larger than for those of other types. For optimizing the solution of these two types, efforts should concentrate only on optimizing the operations in Krylov subspace component: preconditioning solve, SpMV and vector operations.

2. For direct solve (Type I), LU is the most time-consuming component, thus requiring a more advanced LU factorization method which is more parallelizable. Meanwhile, the time spent in Krylov subspace component is also significant.

3. For the coupled approach (Type IV), spike calculation SPK, instead of LU, is the largest consumer without taking into account the Krylov subspace component. Although optimization efforts as discussed in §6.1 have been made, SPK continues to be a limiting factor in SaP::GPU-C. This observation is also applicable to the approach based on BCR algorithm (Type V), with the difference that for Type V, the SPK component is even more expensive than the Krylov subspace component.

4. Both the number of Krylov iterations and the percentage of time to solution of systems of Type V are significantly lower than those of other types. This suggests that for the current implementation, SaP::GPU-B is the most sophisticated preconditioning method, which attempts to solve a system directly and in a highly parallelized manner. This, in consequence, leads to the high time cost of the SaP::GPU-B method itself. 107

(a) RΓ−T ot

(b) RΓ−P reP

Figure 7.20: Profiling results obtained for a set of 30 linear systems of type I that, out of a collection of 114, could be solved by SaP::GPU. 108

(a) RΓ−T ot

(b) RΓ−P reP

Figure 7.21: Profiling results obtained for a set of 25 linear systems of type II that, out of a collection of 114, could be solved by SaP::GPU. 109

(a) RΓ−T ot

(b) RΓ−P reP

Figure 7.22: Profiling results obtained for a set of 19 linear systems of type III that, out of a collection of 114, could be solved by SaP::GPU. 110

(a) RΓ−T ot

(b) RΓ−P reP

Figure 7.23: Profiling results obtained for a set of 6 linear systems of type IV that, out of a collection of 114, could be solved by SaP::GPU. 111

(a) RΓ−T ot

(b) RΓ−P reP

Figure 7.24: Profiling results obtained for a set of 10 linear systems of type V that, out of a collection of 114, could be solved by SaP::GPU. 112

Type-I Type-II Type-III Type-IV Type-V Overall 1 158.3 204.3 5 0.5 13.3 34.1 79.9 85.1 34 14 52.7

Table 7.9: Median information for the Krylov subspace components for each type. The first row provides the median number of iterations to solution, and the second row provides the % of the time for solution for each type of linear systems.

7.3.2 The impact of the third stage reordering

It is almost always the case that upon carrying out a CM reordering of a sparse matrix, the resulting A matrix has a small number of entries in the first and last rows. Yet, as the row index j increases, the number of nonzero in row j increases up to approximately j ≈ N/2. Thereafter, the nonzero count starts decreasing to reach small values towards j ≈ N. Overall, A has its K value dictated by the worst offender. Therefore, a partitioning of A into Ai, i = 1,...,P would conservatively require that, for instance, A1 and AP work with a large K most likely dictated by a sub-matrix such as AP/2. Allowing each Ai to have its own Ki proved to lead to efficiency gains for two main reasons. First, in SaP-C it led to a reduction in the dimension of the spikes, since for each pair of coupling blocks Bi and Ci, the number of columns in the ensuing spikes was determined as the larger of the values Ki and

Ki+1. Second, SaP::GPU capitalizes on the observation that, since Ai are independent and governed by their local Ki, for SaP::GPU-D, there is nothing to prevent a third reordering, which attempts to further reduce the bandwidth of Ai; for SaP::GPU-C, as the application of an additional reordering on the diagonal blocks demonstrates the necessity of a complete spike calculation, based on the discussion in §3.1.4, a performance gain is promising if the decrease decrease in the average K is significant (at least 60%). As it comes on the heels of the DB and CM reorderings, this is called a “third stage reordering” and is applied independently and preferably concurrently to the P sub-matrices Ai. As illustrated in Table 7.10, the decrease in local Ki can be significant and it can lead to non-negligible speedups, see Table 7.11. 113

rd rd Mat. Name PKi before 3 SR Ki after 3 SR K reduction (%) 123, 170, 204, 229, 247 89, 92, 79, 46, 45 247, 247, 247, 248, 242 48, 48, 59, 50, 58 ANCF31770 20 64.75 213, 181, 134, 68, 106 72, 98, 64, 56, 42 129, 124, 124, 113, 82 36, 54, 49, 59, 82 194, 274, 337, 387, 410 116, 74, 65, 109, 112 410, 410, 410, 410, 405 97, 100, 93, 97, 114 ANCF88950 20 67.63 352, 296, 227, 116, 176 116, 56, 88, 75, 116 208, 204, 204, 191, 137 50, 96, 97, 118, 75 274, 317, 317, 317, 320 140, 71, 71, 102, 74 af23560 10 65.39 339, 334, 317, 314, 283 123, 127, 119, 114, 143 256, 378, 458, 533 125, 68, 122, 118 599, 634, 578, 517 85, 93, 97, 91 NetANCF40by40 16 75.76 436, 343, 215, 210 57, 69, 112, 85 275, 295, 257, 178 85, 73, 113, 101 684, 1325, 1308, 1288 532, 170, 122, 110 bayer01 8 80.20 879, 501, 493, 508 109, 110, 110, 121 1124, 1287, 1316, 1331 587, 288, 288, 288 1331, 1331, 1331, 1331 288, 288, 288, 288 finan512 16 76.98 1331, 1331, 1331, 1331 288, 288, 288, 288 1331, 1331, 1331, 1015 288, 288, 227, 211 247, 405, 405 132, 81, 80 gridgena 6 71.97 405, 405, 247 122, 72, 105 315, 348, 288 163, 156, 259 lhr10c 6 20.37 166, 156, 259 217, 247, 178 180, 281, 702, 678, 495 155, 241, 647, 540, 254 rma10 10 27.16 637, 560, 495, 478, 545 496, 422, 217, 349, 358

Table 7.10: Examples of matrices where the third stage reordering (3rd SR) reduced more significantly the block bandwidth Ki for Ai, i = 1,...,P . 114

w/o 3rd SR w/ 3rd SR Mat. Name SpdUp PKi PKi ANCF31770 16 248 20 98 1.203 ANCF88950 32 410 20 118 1.537 af23560 10 339 10 143 1.238 NetANCF40by40 16 634 16 125 1.900 bayer01 8 1325 8 532 2.234 finan512 10 1331 16 587 1.804 gridgena 6 405 6 132 1.636 lhr10c 4 427 6 259 1.228 rma10 10 702 10 647 1.113

Table 7.11: Speed-up “SpdUp” values obtained upon embedding a third stage reorder- ing step in the solution process, a decision that also changed the number of partitions P for best performance. When correlating the results reported to values provided in Table 7.10, this table lists for each matrix A the largest of its Ki values, i = 1,...,P .

7.3.3 Comparison against state-of-the-art direct linear solvers on the CPU

A set of 114 matrices, of which 105 are from the Florida matrix collection, is used herein to compare the robustness and time to solution of SaP::GPU, PARDISO, SuperLU, and MUMPS. This set of matrices was selected on the following basis: at least one of the four solvers can retrieve the solution x within 1% relative accuracy. For a sparse linear system Asx = b, we used the method of manufactured solutions to measure the relative accuracy: an exact solution x? was first chosen and then the right-hand side was set to b = Ax?. Each sparse linear solver attempted to produce an approximation ? ? ? x of the solution x . If this approximation satisfied ||x − x ||2/||x ||2 ≤ 0.01, then the solve was considered to have been successful. Given that SaP::GPU is an iterative (0) solver, its initial guess is always x = 0N . Although in many instances the initial guess can be selected to be relatively close the actual solution, this situation is avoided here by choosing x? far from the aforementioned initial guess. Specifically, x? had its entries roughly distributed on a parabola starting from 1.0 as the first entry, approaching the value 400 at N/2, and decreasing to 1.0 for the N th and last entry of x?. 115

Figure 7.25 employs a median-quartile method to measure the statistical spread of the 114 matrices used in this sparse solver comparison. In terms of size, N is between 8192 and 4,690,002. In terms of nonzeros, nnz is between 41,746 and 46,522,475. The median for N is 71,328. The median for nnz is 1,167,967.

Figure 7.25: Statistical information regarding the dimension N and number of nonzeros nnz for the 114 coefficient matrices used to compare SaP::GPU, PARDISO, SuperLU, and MUMPS.

On the robustness side, SaP::GPU failed to solve 24 linear systems. In 22 cases, SaP::GPU ran out of GPU global memory. In the remaining two cases, SaP::GPU failed to converge. The rest of the solvers failed as follows: PARDISO 40 times, 116

SuperLU 27 times, and MUMPS 35 times. This demonstrates that for the 114 linear systems considered, SaP::GPU is the most robust linear system solver. For the systems SaP::GPU failed to solve, the results should be qualified as follows. The GPU card had 12 GB of GDDR5-type memory. Given that in its current implementation SaP::GPU is an in-core solver, it does not swap data in and out of the GPU. Consequently, it ran 22 times against this memory-size hard constraint. This issue can be partially alleviated by considering a better GPU card. Indeed, there are cards that have as much as 24 GB of global memory, which still comes short of the 64 GB of RAM that PARDISO, SuperLU, and MUMPS could tap into. Secondly, the PARDISO, SuperLU, and MUMPS solvers were used with default setting. Adjusting parameters that control these solvers’ solution process would likely increase their success rate. Interestingly, for the 114 linear systems considered and the three state-of-art solvers against which SaP::GPU is compared, there was a perfect negative correlation between speed and robustness. PARDISO was the fastest, followed by MUMPS, then SaP::GPU, and finally SuperLU. Of the 60 linear systems solved both by SaP::GPU and PARDISO, SaP::GPU was faster 21 times. Of the 75 linear systems solved both by SaP::GPU and SuperLU, SaP::GPU was faster 42 times. Of the 64 linear systems solved both by SaP::GPU and MUMPS, SaP::GPU was faster 27 times. We compare next the four solvers using a median-quartile method to measure statistical SaP::GPU PARDISO spread. Assume that Tα and Tα represent the times required by SaP::GPU and PARDISO, respectively, to finish test α. A relative speedup is computed as

T PARDISO SSaP::GPU−PARDISO = log α , (7.4) α 2 SaP::GPU Tα

SaP::GPU−MUMPS SaP::GPU−SuperLU SaP::GPU−PARDISO with Sα and Sα similarly computed. These Sα values, which can be either positive or negative, are collected in a set SSaP::GPU−PARDISO which is used to generate a box plot in Fig. 7.26. The figure also reports results on SSaP::GPU−SuperLU, and SSaP::GPU−MUMPS. Note that the number of tests used to produce these statistical measures is different for each comparison: 60 linear systems for SSaP::GPU−PARDISO, 75 for SSaP::GPU−SuperLU, and 64 for SSaP::GPU−MUMPS. The median values 117 for SSaP::GPU−PARDISO, SSaP::GPU−SuperLU, and SSaP::GPU−MUMPS are −1.6459, 0.1558, and −0.3505, respectively. These results suggest that when it finishes, PARDISO can be expected to be about two times faster than SaP::GPU. MUMPS is marginally faster than SaP::GPU, which on average can be expected to be only slightly faster than SuperLU.

Figure 7.26: Statistical spread for SaP::GPU’s performance relative to that of PARDISO, SuperLU, and MUMPS. Referring to Eq. 7.4, the results were obtained using the data sets SSaP::GPU−PARDISO (with 60 values), SSaP::GPU−SuperLU (75 values), and SSaP::GPU−MUMPS (64 values).

Red crosses are used in Fig. 7.26 to show statistical outliers. Favorably, most of the SaP::GPU’s outliers are large and positive. For instance, there are three linear systems for which compared to PARDISO, SaP::GPU finishes significantly faster, four linear systems for which it is significantly faster than SuperLU, and four linear systems for which it is significantly faster than MUMPS. On the flip side, there are two tests where SaP::GPU runs slower than MUMPS and one test where it runs significantly slower then SuperLU. The results also suggest that about 50% of the linear systems 118 run in SaP::GPU in the range between “as fast as PARDISO or two to three times slower”, 50% of the linear systems run in SaP::GPU in the range “between four times faster to four times slower then SuperLU”. Relative to MUMPS, the situation is just like for SuperLU if only slightly shifted towards negative territory: the second and third quartile suggest that 50% of the linear systems run in SaP::GPU in the range “between three times faster to three times slower then MUMPS”. Again, favorably for SaP::GPU, the last quartile is long and reaches well into high positive values. In other words, when it beats the competition, it beats it by a large margin.

7.3.4 Comparison against a state-of-the-art GPU linear solver

The same set of 114 matrices used in the comparison against PARDISO, SuperLU, and MUMPS was considered to compare SaP::GPU with the sparse direct QR solver in cuSOLVER library (NVIDIA, 2015). For cuSOLVER, the QR solver was run in two configurations: with or without the application of a reversed Cuthill–McKee (RCM) reordering before solving the system. RCM was optionally applied given that it can potentially reduce the QR factorization fill-in.

On the robustness side, cuSOLVER successfully solved 42 out of 114 systems when using either configuration. There are only one linear system: ex11, which was successfully solved by cuSOLVER but not by SaP::GPU. In all 69 systems cuSOLVER failed to solve, the implementation ran out of memory.

SaP::GPU−cuSOLVER On the performance side, the relative speedup Sα is similarly computed as the definition in Eq. 7.4. Fig. 7.27 employs the median-quartile method to compare the performance of SaP::GPU and cuSOLVER. The number of tests used to produce the statistical measures is 41, as there are 41 linear systems successfully solved by both SaP::GPU and cuSOLVER. Of the 41 systems solved both by SaP::GPU and cuSOLVER, cuSOLVER was faster than SaP::GPU in two cases: ckt11752_tr_0 and ex19. The SaP::GPU−cuSOLVER 25-percentile, median and 75-percentile of Sα are 2.0094, 2.7471 and 4.3327, respectively. These results indicate that SaP::GPU can be expected to be 119 about 6.7(≈ 22.7471) times as fast as cuSOLVER. One thing worth noting is that a symmetric reordering CM or RCM is not the most efficient method to reduce the QR factorization fill-in. A column reordering introduced in Heggernes and Matstoms (1996) can potentially benefit both the robustness and performance of cuSOLVER. This column reordering, however, is available in neither the cuSOLVER library or SaP::GPU library, so RCM reordering is optionallly applied in this set of tests instead.

Figure 7.27: Statistical spread for SaP::GPU’s performance relative to that of cuSOLVER. Referring to Eq. 7.4, the results were obtained using the data sets SSaP::GPU−cuSOLVER (with 41 values). 120

7.4 Numerical experiments highlighting multi-GPU support

This section reports the numerical experiments that highlights multi-GPU support. Two sets of experiments were designed for the multi-GPU support highlights: one to highlight the impact of multi-GPU support on the dense banded solver, and one to highlight the impact of multi-GPU support on the general sparse solver. As stated in §6.4, multi-GPU support is only available for the decoupled approach SaP::GPU-D in the current version of SaP::GPU.

Impact on the dense banded solver. The matrix set and solver configurations are as described in §7.1.3. Similar with the speedup sBD in §7.1.3, we define sDvsS ≡ single dual Tα /Tα , where α is a specific test out of the 60 dense banded tests, and the superscript “single” and “dual” specify whether to use one or two GPUs for solution of the test. Fig. 7.28 is the median-quartile box plot that illustrates sDvsS of 59 tests (both implementations failed in one case, with (N,K) = (1,000,000, 500)).

Out of 59 tests, sDvsS > 1 was observed in 53 tests. The 25-percentile, median and 75-percentile of sDvsS are 1.2185, 1.4153 and 1.7275, respectively. The results demonstrate a preference of leveraging an additional GPU for the decoupled approach on the solution of banded systems. 121

Figure 7.28: Speedup of dual-GPU solution over single-GPU solution on banded systems with a spectrum of N and K.

There are six tests on which sDvsS < 1 was observed, with N = 200,000, N = 500,000, N = 1,000,000 and K = 10, K = 20 combined. These are tests with relatively large N and small K. According to the discussion of the overheads of multi-GPU support in §6.4, on the one hand, small K limits the potential computation savings in LU factorization of each diagonal block and preconditioning sweep; on the other hand, large N leads to the scenario where the computation savings by leveraging multi-GPU are offset by the overheads of transferring the diagonal blocks in preconditioning setup stage and transferring vectors back-and-forth among GPU in Krylov solve stage. It is worth noting that the tests illustrated in Fig. 7.28 do not highlight the benefit of decreased memory footprint from multi-GPU support. In other words, with an additional GPU, the set of linear systems SaP::GPU successfully solved has failed to grow. For highlighting multi-GPU support is helpful in memory critical systems, it is worth mentioning that for a test with N = 800,000 and K = 500, where the 122 single-GPU SaP::GPU-D runs out of memory, the dual-GPU approach successfully in around 5.2 seconds.

Impact on the general sparse solver. The matrix set used to highlight the impact of multi-GPU support on the general sparse solver is the set of 19 systems preconditioned by SaP-D described in §7.3.1. sDvsS is defined the same way as the dense banded scenario. Fig. 7.29 is the median-quartile box plot that illustrates sDvsS of the 19 systems, where the 25-percentile, median and 75-percentile are 0.9389, 1.0219 and 1.1272, respectively. Surprisingly, multi-GPU support is not as impactive on general sparse systems as on dense banded systems. A further study on comparison of the elapsed time spent in each component in SaP-D for single-GPU approach and dual-GPU approach suggests that multi-GPU support introduced non-negligible overheads in assembling the banded matrices on the GPU from a sparse matrix on the CPU. The overheads, however, are not significant for the solution of dense banded systems. 123

Figure 7.29: Speedup of dual-GPU solution over single-GPU solution on the 19 systems preconditioned with SaP::GPU-D. 124

8 use of sap in computational multibody dynamics problems

In this chapter, we showcase how SaP::GPU is used in a Computational Engineering application. In multibody dynamics analysis, simulation of compliant yet stiff structures results in numerically stiff systems of differential equations of motion (EOM). Numerical explicit integrators are known to be inefficient on such problems since, for stability reasons, they must use small step sizes – much smaller than dictated by approximation requirements only (Hairer and Wanner, 1996). Alternatively, implicit integrators can stably solve stiff problems using much larger step sizes yet require at each time step the solution of a system of nonlinear algebraic equations. The nonlinear system is typically solved using a Newton-type method (Atkinson, 1989), which in turn calls for the solution of several linear systems at each integration time step. The lack of an open source parallel sparse linear solver capable of solving these linear systems robustly and efficiently on graphics processing unit (GPU) cards is what motivated the use of SaP::GPU here. The multibody system dynamics solution calls for performing two tasks at each integration time step: (i) formulating the system EOM, followed by (ii) numerically solving the resulting differential-algebraic equations (DAE). In Serban et al. (2015), the formulation of the system EOM was introduced in detail. In that work, the Absolute Nodal Coordinate Formulation (ANCF for abbrevation) approach (Berzeri et al., 2001; von Dombrowski, 2002) was used to formulate the EOM for compliant systems with beam elements. Here we focus on how the resulting DAEs are solved. Serban et al. (2015) uses the Newmark integration scheme (Newmark, 1959) originally proposed in the structural dynamics community for the numerical integration of a linear set of second order ODEs, and obtain the nonlinear set of equations (8.1) and (8.2) T s a (Mq¨)n+1 + (Φq λ)n+1 + (Q − Q )n+1 = 0 . (8.1) 125

Φ(qn+1, tn+1) = 0. (8.2)

Using the Newmark integration formulas, the position and velocity at a new time step tn+1 are obtained in terms of the acceleration q¨n+1 as

h2 q = q + hq˙ + [(1 − 2β)q¨ + 2βq¨ ] (8.3a) n+1 n n 2 n n+1

q˙ n+1 = q˙ n + h [(1 − γ)q¨n + γq¨n+1] , (8.3b) where h is the integration step size, while γ ≥ 1/2 and β ≥ (γ + 1/2)2/4 are two user selected parameters that control the amount of numerical damping associated with the integration formula. At each time step tn+1, the numerical solution begins by solving the nonlinear set of equations (8.1) and (8.2) for the unknowns q¨n+1 and

λn+1, using Eqs. (8.3) to eliminate the generalized positions and velocities. This is made clear by rewriting the nonlinear system as Ψ(q¨n+1, λn+1) = 0p+m, where

    (Mq¨) + ΦT λ + (Qs − Qa) n+1 q n+1 n+1 Ψ(q¨n+1, λn+1) ≡   . (8.4) 2 Φ(qn+1, tn+1)/(βh )

The kinematic constraint equations above were scaled to improve the condition number of the Jacobian matrix (Negrut et al., 2007; Bottasso et al., 2007; Negrut et al., 2009). The nonlinear algebraic system in Eq. (8.4) is solved numerically using a Newton- type iterative algorithm. The required Jacobian Jn+1 = [∂Ψ/∂q¨n+1 , ∂Ψ/∂λn+1] is relatively straightforward to obtain. The most computationally intensive term is the evaluation of the partial derivative of the internal forces with respect to the generalized coordinates, i.e., the stiffness matrix Ks ≡ ∂Qs/∂e.

Using a Newton-type method, the solution {q¨n+1 , λn+1} of the nonlinear system in Eq. (8.4) is obtained iteratively, requiring successive solutions of linear systems of the form (k) (k) (k) (k) Jn+1 · δn+1 = −Ψ(q¨n+1, λn+1) , (8.5)

(k) to obtain the correction vector δn+1, where (k) represents a Newton iteration counter. While this method is known to converge quadratically (Atkinson, 1989), it does so 126 by incurring a great computational cost related to the Jacobian matrix evaluation and factorization at each iteration. Modified Newton methods alleviate this issue by evaluating and factoring the Jacobian only once per time step and reusing it throughout the nonlinear iterations. Equation (8.5) defines a modified Newton method as long as a direct linear solver (k) is used to determine the correction δn+1. Solutions that leverage parallel computing (k) often times rely on Krylov-subspace methods (Saad, 2003a), in which δn+1 is computed iteratively. The resulting inexact Newton iteration has been shown to be effective in the context of implicit integration of stiff differential equations (Brown et al., 1994). (k) In this case, products of the form Jn+1v, for some given vector v, are computed in a matrix-free fashion. The term "inexact" indicates that, unlike when a direct solver (k) is used, δn+1 is computed through an iterative process in an inner loop that can (k) be stopped before δn+1 satisfies Eq. (8.5) exactly. This was the approach embraced in Melanz et al. (2013) using an unpreconditioned BiCGStab linear solver. As discussed in §5.1, Krylov-subspace methods sometimes fail to converge and are almost always inefficient unless an adequate preconditioner is used. A generic linear system Ax = b can be preconditioned on the left as (M−1A)x = M−1b, on the right as (AM−1)Mx = b, or on both sides. Here we only consider preconditioning on the left. The Krylov-subspace method is then applied to a system with the modified matrix M−1A instead of A. §5 details two preconditioned Krylov-subspace methods: preconditioned CG and preconditioned BiCGStab(`). In this document, we report on a GPU-based general-purpose preconditioner that dovetails with the rest of the equation formulation and solution components of the implementation to produce a GPU-based preconditioned Newton-Krylov solver for flexible multibody dynamics. Preconditioner update strategies. The Jacobian matrix needs updating as often as possible to maintain a good convergence rate in the Newton iteration. For the example problems considered in this document, the preconditioner matrix might be updated:

(a) upon starting a new simulation; 127

(b) when the average number of Krylov iterations per Newton iteration reaches a user-specified threshold; (c) upon taking more than a user-specified number of integration steps since the last evaluation.

The above strategies raise two issues: first, Jacobian evaluation is costly; second, once a new Jacobian is available, it needs to be factorized, which is increasing the computational cost. The Jacobian evaluation/factorization cost can be seen as the spikes in Fig. 8.3. Though for the flexible multibody dynamics application the preconditioner updates need not be frequent; i.e., the number of cost spikes in the plot are not numerous, there might be cases where more frequent updates – once for several time steps for instance – are required. If these updates are not carried out, the advantage of a Newton-Krylov method will be offset by poor convergence rates.

Figure 8.1: Dual-GPU method of staggered solve/preconditioner update

To fix this potential problem and to reduce overall simulation times, a dual-GPU method of staggered solve/preconditioner update is proposed. Therein, at least two

GPUs (denoted as D0 and D1, respectively) are assumed to be available. In this method, both D0 and D1 maintain a copy of the up-to-date linear operator, for instance of the Jacobian matrix in Eq. 8.5. Fig. 8.1 illustrates how the dual-GPU method works. After D0 gets the initial preconditioner ready, it sends a message 128

to the CPU which urges the CPU to launch the kernel on D1 to prepare for a new preconditioner, and then proceeds to solve linear systems. In turn, when D1 completes the factorization of the up-to-date Jacobian, it sends the CPU a swap request, which implies that as soon as D0 finishes its work in this time step, D0 and D1 should swap their roles; i.e., D1 will be responsible for solving the Newton system while D0 should update the preconditioner. D1 then solves the linear systems until it receives a request to swap its role with D0’s yet again. With this method, at an arbitrary point, either D0 or D1 is solving the system (for succinctness, the device solving the system is sometimes called the working device, while the other device is called the preconditioning device) and the time of preconditioner update is totally overlapped. In other words, there will be no spikes like the ones in Fig. 8.3. Additionally, as the preconditioner is more up-to-date, it leads to a smaller number of iterations for convergence, which will further reduce the simulation time. This methodology in principle can be extended to more than two GPUs and essentially would amount to getting “fresh” factorized Jacobians more often. There is another type of swap request that deals with the case in which the Jacobian matrix experiences significant changes (which is also displayed in Fig. 8.1), a scenario that renders a factorized Jacobian. In this case, a GPU, say D1, solves the linear system for a while and finds it necessary to update the preconditioner due to a sudden increase in the number of Krylov iterations. D1 then sends a swap request to CPU, urging D0 to directly solve the system after its preconditioner is ready, stops solving linear systems, and immediately starts to update its own preconditioner. Although this should be an infrequent (and undesirable) case, a certain part of the evaluation and factorization costs are hidden. This implies that even in this undesirable case, this method is not going to be worse off if the overhead of sending the swap request – which is at most several bytes – is discarded. There are two remarks regarding the dual-GPU staggered method and the multi-GPU support of SaP::GPU. First, note that the dual-GPU method discussed in this section is on the level of Newton iterations, which is different from the multi-GPU support of SaP::GPU discussed in §6.4. In other words, for supporting the dual-GPU staggered 129 method, the algorithms inside the SaP::GPU do not require any modifications, while it requires some efforts to add or change several interfaces inside SaP::GPU for the multi-GPU support not in the frame of a real application and Newton method. Next, it is risky, at least not beneficial, to apply the multi-GPU version of SaP::GPU altogether with the dual-GPU staggered method for the current implementation of SaP::GPU. Assume we are applying the dual-GPU version of SaP::GPU together with the dual-GPU staggered method in a certain application. And we also assume at a certain time, Solver 0 is approaching the solution of the linear system A0x0 = b0 allocated on D0 (its preconditioner M0 is ready and allocated on D0 and D2) while the preconditioner M1 of the linear system A1x1 = b1 allocated on D1 is under preparation by Solver 1. Though Solver 0 keeps track of both the GPU index of the matrix A0 (i.e., 0 in this example) and all GPU indices of the preconditioner M0 (i.e., 0 and 2 in the example), this information is not exposured to Solver 1 or the user of

SaP::GPU for security reasons. Hence, it is not guaranteed that M1 is allocated on two GPUs other than D0 and D2, which hinders the imaginary performance gain the user might expect. Simulation environment and method. The open-source software infrastructure that SaP::GPU is integrated with is called Chrono (Mazhar et al., 2013; Project Chrono). Its Chrono::Flex module provides support for various exible elements, bilateral con- straints (joints), and contact with friction. It offers several solution methods for the resulting EOMs and uses GPU acceleration in both the equation formulation and equation solution stages. We compare the SaP::GPU preconditioned Newton-Krylov solution against several non-preconditioned GPU-based solution approach (BiCGStab, BiCGStab(2) and MINRES). Comparison of Krylov solvers. We begin with a discussion of the relative per- formance of various solvers on the 40 × 40 flexible net test problem with a modulus of elasticity E = 2 · 107 Pa. We performed 5-second long simulations with a step size h = 10−3 s using the following solvers: unpreconditioned and preconditioned BiCGStab, unpreconditioned and preconditioned BiCGStab(2), and unpreconditioned MINRES. 130

The BiCGStab solver is included in this study to allow a comparison with the results presented in Melanz et al. (2013), which were obtained using an unpreconditioned BiCGStab solver. In many situations, in particular for the problems considered here, the linear system of Eq. (8.5) is a saddle point problem1. We therefore include in this comparison the MINRES method (Paige and Saunders, 1975), which was specifically designed to solve linear systems with symmetric, possibly indefinite matrices. In its standard form, MINRES can only be preconditioned with a positive definite preconditioner matrix. Unfortunately, for a saddle-point problem the resulting SaP::GPU preconditioner is indefinite and thus not appropriate for preconditioning MINRES. While we plan on extending SaP::GPU to support preconditioning of saddle point problems, here we only consider unpreconditioned MINRES. Results are shown in Fig. 8.2. Taking the preconditioned BiCGStab(2) as the base case, the resulting speedups over unpreconditioned solvers, which held effectively constant across the entire simulation time range were as follows: 10.8 over BiCGStab – none, 8.3 over BiCGStab(2) – none, 3.6 over MINRES – none. We note that, although the results in Fig. 8.2 show the preconditioned BiCGStab as the most effective solver for this particular problem, we found that it lacks overall robustness and often displays convergence stagnation (Gutknecht, 1993; Sleijpen and Fokkema, 1993). Therefore, in all subsequent update-strategy study, we restricted ourselves to using only preconditioned BiCGStab(2). 1This may not be the case in general if, for example, the system includes controls. 131

14

12 BiCGStab − none

10

8

6 BiCGStab2 − none

MINRES − none 4

2 BiCGStab2 − SPIKE Execution time per step (s)

0 BiCGStab − SPIKE 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Time (s)

Figure 8.2: Timing comparison for various Krylov solvers, with or without pre- conditioning. The sharp increases in the curves for the two preconditioned solvers correspond to preconditioner updates which, for this study, were enforced every 0.5 s. [40 × 40 net; E = 2 · 107 Pa; h = 10−3 s]

Comparison of preconditioner update strategies. We investigate next the effect on overall performance of different preconditioner update strategies. As mentioned earlier in this subsection, the goal here is to strike a balance between the high relative cost of a preconditioner evaluation and the decreased convergence due to using out-of-date Jacobian information. For this study, we use again the 40 × 40 flexible net, simulated over a 5 second interval using a step size h = 10−3 s. Since the impact of an out-of-date preconditioner on the convergence rate of the BiCGStab(2) solver increases with problem stiffness, we use here a modulus of elasticity E = 2 · 108 Pa. We note that the same trends, albeit less pronounced, are obtained at lower stiffness. Figure 8.3 illustrates the execution time per time step for three different preconditioner update strategies:

1. No update, where the preconditioner was only evaluated at time t = 0 and kept fixed for the entire duration; 132

2. Uniform update, where a preconditioner update was enforced at uniformly- spaced time steps, in this case every 500 steps (or 0.5 s). 3. Adaptive update, where a preconditioner update was triggered after any integration step that required more than 10 Krylov iterations, cumulative over all Newton iterations at that step. 4. Dual-GPU staggered update, where a preconditioner update was triggered as frequently as possible by the preconditioning device.

Compared to the no update scenario, employing a uniform update strategy results in an overall speedup of 1.19, adaptive update results in a speedup of 1.47, while dual-GPU staggered update results in a speedup of 1.75. We must note however that the selection of the various algorithmic parameters in the preconditioner update strategy is highly problem dependent and achieving optimal performance requires some experimentation.

Figure 8.3: Comparison of different preconditioner update strategies for the precondi- tioned BiCGStab(2) solver. The solid lines in the foreground are filtered values of the raw measured timing data displayed in the background. [40 × 40 net; E = 2 · 108 Pa; h = 10−3 s] 133

9 conclusions and future work

This work discusses parallel strategies to (i) solve dense banded linear systems; (ii) solve sparse linear systems; and (iii) perform matrix reorderings for diagonal boosting and bandwidth reduction. The salient feature shared by these strategies is that they are designed to run in parallel on GPU cards. BSD3 open source implementations of all these strategies are available at SBEL (a,b) as part of a software package called SaP::GPU. As far as the parallel solution of linear systems is concerned, the strategies discussed are in-core; i.e., there is no host-device, CPU-GPU, memory swapping, which somewhat limits the size of the problems that can be presently solved by SaP::GPU. Over a broad range of dense matrix sizes and bandwidths, SaP::GPU is likely to run two times faster than Intel’s MKL and it demonstrates higher speedup compared against LAPACK. This conclusion should be modulated by hardware considerations and also the observation that the diagonal dominance of the dense banded matrix is a performance factor. On the sparse linear system side, the most surprising result was the robustness of SaP::GPU. Out of a set of 114 tests, most of them using matrices from the University of Florida sparse matrix collection, SaP::GPU failed only 24 times, of which 22 were “out-of-memory” failures owing to a 12 GB limit on the size of the GPU memory. In terms of performance, SaP::GPU was compared against PARDISO, MUMPS, SuperLU and cuSOLVER. We noticed that among all the solvers, SaP::GPU is the most robust solver for the 114 linear systems considered. In terms of performance, PARDISO was the fastest, followed by MUMPS, SaP::GPU, and SuperLU. Due to the fact that the diagonal boosting algorithm is efficient in boosting the diagonal dominance of a matrix, the straight split-and-parallelize strategy, without the coupling involved in the SPIKE-type strategy or the recursive cyclic strategy, emerged as the more often solution approach adopted by SaP::GPU. The implementation of SaP::GPU is somewhat peculiar in that the sparse solver builds on top of the dense banded one. The sparse–to–dense transition occurs via two reorderings: one that boosts the diagonal entires and one that reduces the matrix bandwidth. Herein, they were implemented as CPU/GPU hybrid solutions which 134 were compared against Harwell’s implementations and found to be twice as fast for the diagonal boosting reordering, and of comparable speed for the bandwidth reduction. Many issues remain to be investigated at this point. First, given that more than 50% of the time to solution is spent in the iterative solver, it is worth consider the techniques analyzed in Li et al. (2015), which sometimes double the flop rate in sparse matrix-vector multiplication operations upon changing the matrix storage scheme; i.e., moving from CSR to ELL or hybrid. Second, an out-of-core and/or multi-GPU implementation would enable SaP::GPU to handle larger problems while possibly reducing time to solution. Third, the CM bandwidth reduction strategy implemented is dated; spectral and/or hyper-graph partitioning for load balancing should lead to superior splitting of the coefficient matrix. Finally, as it stands, with the exception of parts of the matrix reordering, SaP::GPU is entirely a GPU solution. It would be worth investigating how the CPU can be involved in other phases of the implementation. Such an investigation would be well justified given the imminent tight integration of the CPU and GPU memories. 135 a namespace index

A.1 Namespace List

Here is a list of all documented namespaces with brief descriptions:

sap Sap is the top-level namespace which contains all SaP functions and types ...... 142 136 b class index

B.1 Class Hierarchy

This inheritance list is sorted roughly, but not completely, alphabetically:

sap::Graph< T >::AbsoluteValue< VType > ...... 147 sap::Graph< T >::AccumulateEdgeWeights ...... 147 sap::BandedMatrix< Array > ...... 147 sap::BiCGStabLMonitor< SolverVector > ...... 149 sap::Graph< T >::ClearValue ...... 151 sap::Graph< T >::CompareValue< VType > ...... 151 sap::Graph< T >::Difference ...... 152 sap::Graph< T >::EdgeLength ...... 153 sap::Graph< T >::EqualTo< Type > ...... 153 sap::Graph< T >::Exponential ...... 154 sap::Graph< T >::GetCount ...... 154 sap::Graph< T > ...... 155 sap::Graph< T >::is_not ...... 160 sap::Graph< T >::Map< Type > ...... 160 sap::Monitor< SolverVector > ...... 160 sap::Multiply< T > ...... 162 sap::MVBanded< Matrix > ...... 162 sap::Options ...... 164 sap::Graph< T >::PermutedEdgeLength ...... 168 sap::Graph< T >::PermuteEdge ...... 168 sap::Precond< PrecVector > ...... 169 sap::SegmentedMatrix< Array, MemorySpace > ...... 174 sap::SmallerThan< T > ...... 176 sap::Solver< Array, PrecValueType > ...... 177 sap::SpmvCusp< Matrix > ...... 180 137 sap::SpmvCuSparse< Matrix > ...... 181 sap::Graph< T >::Square ...... 183 sap::Stats ...... 183 sap::strided_range< Iterator >::stride_functor ...... 189 sap::strided_range< Iterator > ...... 190 sap::system_error ...... 191 sap::Timer ...... 192 sap::CPUTimer ...... 151 sap::GPUTimer ...... 154 138 c class index

C.1 Class List

Here are the classes, structs, unions and interfaces with brief descriptions:

sap::Graph< T >::AbsoluteValue< VType > ...... 147 sap::Graph< T >::AccumulateEdgeWeights ...... 147 sap::BandedMatrix< Array > ...... 147 sap::BiCGStabLMonitor< SolverVector > ...... 149 sap::Graph< T >::ClearValue ...... 151 sap::Graph< T >::CompareValue< VType > ...... 151 sap::CPUTimer CPU timer ...... 151 sap::Graph< T >::Difference ...... 152 sap::Graph< T >::EdgeLength ...... 153 sap::Graph< T >::EqualTo< Type > ...... 153 sap::Graph< T >::Exponential ...... 154 sap::Graph< T >::GetCount ...... 154 sap::GPUTimer GPU timer ...... 154 sap::Graph< T > ...... 155 sap::Graph< T >::is_not ...... 160 sap::Graph< T >::Map< Type > ...... 160 sap::Monitor< SolverVector > ...... 160 sap::Multiply< T > ...... 162 sap::MVBanded< Matrix > ...... 162 sap::Options Input solver options ...... 164 sap::Graph< T >::PermutedEdgeLength ...... 168 sap::Graph< T >::PermuteEdge ...... 168 139 sap::Precond< PrecVector > SaP preconditioner ...... 169 sap::SegmentedMatrix< Array, MemorySpace > ...... 174 sap::SmallerThan< T > ...... 176 sap::Solver< Array, PrecValueType > Main SaP::GPU solver ...... 177 sap::SpmvCusp< Matrix > Default SPMV functor class ...... 180 sap::SpmvCuSparse< Matrix > ...... 181 sap::Graph< T >::Square ...... 183 sap::Stats Output solver statistics ...... 183 sap::strided_range< Iterator >::stride_functor ...... 189 sap::strided_range< Iterator > ...... 190 sap::system_error ...... 191 sap::Timer Base timer class ...... 192 140 d file index

D.1 File List

Here is a list of all documented files with brief descriptions:

sap/banded_matrix.h Definition of banded matrix class ...... 193 sap/bicgstab.h BiCGStab preconditioned iterative Krylov solver ...... 193 sap/bicgstab2.h BiCGStab(L) preconditioned iterative Krylov solver ...... 194 sap/common.h Definition of commonly used macros and solver configuration con- stants ...... 196 sap/exception.h Definition of the SaP system_error exception class ...... 197 sap/graph.h Definition of an auxiliary class for the Preconditioner class. This class is for reordering, dropping off elements, and assemblying banded matrix ...... 198 sap/minres.h MINRES preconditioned iterative solver ...... 200 sap/monitor.h It provides two classes for monitoring progress of iterative linear solvers, checking for convergence, and stopping on various error conditions ...... 201 sap/precond.h Definition of the SaP preconditioner class ...... 202 141 sap/segmented_matrix.h Definition of a special data structure which is used in Block Cyclic Reduction algorithm implementation ...... 204 sap/solver.h Definition of the main SaP solver class ...... 205 sap/spmv.h Definition of the default Spike SPMV functor class ...... 207 sap/strided_range.h This file implements the strided_range class to allow non-contiguous access to data in a thrust vector ...... 208 sap/timer.h CPU and GPU timer classes ...... 209 142 e namespace documentation

E.1 sap Namespace Reference sap is the top-level namespace which contains all SaP functions and types.

Classes

• class BandedMatrix • class system_error • class Graph • class Monitor • class BiCGStabLMonitor • class Precond SaP preconditioner. • struct Multiply • struct SmallerThan • class SegmentedMatrix • struct Options Input solver options. • struct Stats Output solver statistics. • class Solver Main SaP::GPU solver. • class MVBanded • class SpmvCusp Default SPMV functor class. • class SpmvCuSparse 143

• class strided_range • class Timer Base timer class. • class GPUTimer GPU timer. • class CPUTimer CPU timer.

Enumerations

• enum KrylovSolverType { BiCGStab_C, GMRES_C, CG_C, CR_C, BiCGStab1, BiCGStab2, BiCGStab, MINRES } • enum FactorizationMethod { LU_UL, LU_only } • enum PreconditionerType { Spike, Block, None }

Functions

• template void bicgstab (LinearOperator &A, Vector &x, Vector &b, Monitor &monitor, Preconditioner &M) Preconditioned BiCGStab Krylov method. • template void bicgstabl (LinearOperator &A, Vector &x, const Vector &b, Monitor &monitor, Preconditioner &P) Preconditioned BiCGStab(L) Krylov method. 144

• template void bicgstab1 (LinearOperator &A, Vector &x, const Vector &b, Monitor &monitor, Preconditioner &P) Specializations of the generic sap::bicgstabl function for L=1. • template void bicgstab2 (LinearOperator &A, Vector &x, const Vector &b, Monitor &monitor, Preconditioner &P) Specializations of the generic sap::bicgstabl function for L=2. • void kernelConfigAdjust (int &numThreads, int &numBlockX, const int numThreadsMax) • void kernelConfigAdjust (int &numThreads, int &numBlockX, int &num- BlockY, const int numThreadsMax, const int numBlockXMax) • template void minres (LinearOperator &A, Vector &x, const Vector &b, Monitor &moni- tor, Preconditioner &P) Preconditioned MINRES method.

Variables

• const unsigned int BLOCK_SIZE = 512 • const unsigned int MAX_GRID_DIMENSION = 32768 • const unsigned int CRITICAL_THRESHOLD = 70

E.1.1 Detailed Description sap is the top-level namespace which contains all SaP functions and types. 145

E.1.2 Enumeration Type Documentation

E.1.2.1 enum sap::KrylovSolverType

This defines the types of Krylov subspace methods used in SaP.

E.1.2.2 enum sap::PreconditionerType

This defines the types of SaP preconditioners.

E.1.3 Function Documentation

E.1.3.1 template void sap::bicgstab ( LinearOperator & A, Vector & x, Vector & b, Monitor & monitor, Preconditioner & M )

Preconditioned BiCGStab Krylov method.

Template Parameters

LinearOperator is a functor class for sparse matrix-vector product. Vector is the vector type for the linear system solution. Monitor is the convergence test object. Preconditioner is the preconditioner.

E.1.3.2 template void sap::bicgstabl ( LinearOperator & A, Vector & x, const Vector & b, Monitor & monitor, Preconditioner & P )

Preconditioned BiCGStab(L) Krylov method. 146

Template Parameters

LinearOperator is a functor class for sparse matrix-vector product. Vector is the vector type for the linear system solution. Monitor is the convergence test object. Preconditioner is the preconditioner. L is the degree of the BiCGStab(L) method.

E.1.3.3 template void sap::minres ( LinearOperator & A, Vector & x, const Vector & b, Monitor & monitor, Preconditioner & P )

Preconditioned MINRES method.

Template Parameters

LinearOperator is a functor class for sparse matrix-vector product. Vector is the vector type for the linear system solution. Monitor is the convergence test object. Preconditioner is the preconditioner (must be positive definite) 147 f class documentation

F.1 sap::Graph< T >::AbsoluteValue< VType > Struct Template Reference

Public Member Functions

• __host__ __device__ VType operator() (VType a)

The documentation for this struct was generated from the following file:

• sap/graph.h

F.2 sap::Graph< T >::AccumulateEdgeWeights Struct Reference

Public Member Functions

• __host__ __device__ T operator() (WeightedEdgeType a)

The documentation for this struct was generated from the following file:

• sap/graph.h

F.3 sap::BandedMatrix< Array > Class Template Reference

#include 148

Public Types

• typedef Array::value_type value_type • typedef Array::memory_space memory_space • typedef int index_type

Public Member Functions

• BandedMatrix (int n, int k, double d) • const Vector & getBandedMatrix () const

Public Attributes

• int m_n • int m_k • int num_rows • int num_cols • size_t num_entries

F.3.1 Detailed Description

templateclass sap::BandedMatrix< Array >

This class defines the banded matrix.

Template Parameters

Array is the array type used to store the matrix. (In most cases the same with the solver’s array type) 149

F.3.2 Constructor & Destructor Documentation

F.3.2.1 template sap::BandedMatrix< Array >::BandedMatrix ( int n, int k, double d )

This is the constructor for the BandedMatrix class. It specifies the dimension N, the half-bandwidth K and the diagonal dominance d of the banded matrix. Then it randomly generates a banded matrix with parameters (N, K, d). The documentation for this class was generated from the following file:

• sap/banded_matrix.h

F.4 sap::BiCGStabLMonitor< SolverVector > Class Template Reference

#include

Public Types

• typedef SolverVector::value_type SolverValueType

Public Member Functions

• BiCGStabLMonitor (const int maxIterations, const int maxmSteps, const SolverValueType relTol, const SolverValueType absTol=SolverValueType(0)) • virtual void init (const SolverVector &rhs) • virtual bool finished (const SolverVector &r) • virtual bool finished (SolverValueType rNorm) • virtual bool finished () • virtual bool needCheckConvergence (const SolverVector &r) • virtual bool needCheckConvergence (SolverValueType rNorm) 150

• virtual void stop (int code, const char ∗message) • virtual void increment (float incr) • virtual void incrementStag () • virtual void resetStag () • virtual void updateResidual (SolverValueType r) • virtual BiCGStabLMonitor < SolverVector > & operator++ () • virtual int getMaxIterations () const • virtual size_t iteration_limit () const • virtual SolverValueType getRelTolerance () const • virtual SolverValueType getAbsTolerance () const • virtual SolverValueType getTolerance () const • virtual SolverValueType getRHSNorm () const • virtual bool converged () const • virtual size_t iteration_count () const • virtual float getNumIterations () const • virtual int getCode () const • virtual const std::string & getMessage () const • virtual SolverValueType getResidualNorm () const • virtual SolverValueType getRelResidualNorm () const

F.4.1 Detailed Description templateclass sap::BiCGStabLMonitor< SolverVector >

This class provides support for monitoring progress of BiCGStab(L) solvers, check for convergence, and stop on various error conditions. The documentation for this class was generated from the following file:

• sap/monitor.h 151

F.5 sap::Graph< T >::ClearValue Struct Reference

Public Member Functions

• __host__ __device__ double operator() (double a)

The documentation for this struct was generated from the following file:

• sap/graph.h

F.6 sap::Graph< T >::CompareValue< VType > Struct Template Reference

Public Member Functions

• bool operator() (const thrust::tuple< int, VType > &a, const thrust::tuple< int, VType > &b) const

The documentation for this struct was generated from the following file:

• sap/graph.h

F.7 sap::CPUTimer Class Reference

CPU timer.

#include Inheritance diagram for sap::CPUTimer: 152

sap::Timer

sap::CPUTimer

Public Member Functions

• virtual void Start () • virtual void Stop () • virtual double getElapsed ()

Protected Attributes

• timeval timeStart • timeval timeEnd

F.7.1 Detailed Description

CPU timer. CPU timer using the performance counter for WIN32 and gettimeofday() for Linux. The documentation for this class was generated from the following file:

• sap/timer.h

F.8 sap::Graph< T >::Difference Struct Reference

Public Member Functions

• __host__ __device__ int operator() (const int &a, const int &b) const 153

The documentation for this struct was generated from the following file:

• sap/graph.h

F.9 sap::Graph< T >::EdgeLength Struct Reference

Public Member Functions

• __host__ __device__ int operator() (NodeType a)

The documentation for this struct was generated from the following file:

• sap/graph.h

F.10 sap::Graph< T >::EqualTo< Type > Struct Template Reference

Public Member Functions

• EqualTo (Type l) • __host__ __device__ bool operator() (const Type &a)

Public Attributes

• Type m_local

The documentation for this struct was generated from the following file:

• sap/graph.h 154

F.11 sap::Graph< T >::Exponential Struct Reference

Public Member Functions

• __host__ __device__ double operator() (double a)

The documentation for this struct was generated from the following file:

• sap/graph.h

F.12 sap::Graph< T >::GetCount Struct Reference

Public Member Functions

• __host__ __device__ int operator() (int a)

The documentation for this struct was generated from the following file:

• sap/graph.h

F.13 sap::GPUTimer Class Reference

GPU timer.

#include Inheritance diagram for sap::GPUTimer: 155

sap::Timer

sap::GPUTimer

Public Member Functions

• GPUTimer (int g_idx=0) • virtual void Start () • virtual void Stop () • virtual double getElapsed ()

Protected Attributes

• int gpu_idx • cudaEvent_t timeStart • cudaEvent_t timeEnd

F.13.1 Detailed Description

GPU timer. CUDA-based GPU timer. The documentation for this class was generated from the following file:

• sap/timer.h

F.14 sap::Graph< T > Class Template Reference

#include 156

Classes

• struct AbsoluteValue • struct AccumulateEdgeWeights • struct ClearValue • struct CompareValue • struct Difference • struct EdgeLength • struct EqualTo • struct Exponential • struct GetCount • struct is_not • struct Map • struct PermutedEdgeLength • struct PermuteEdge • struct Square

Public Types

• typedef cusp::coo_matrix< int, T, cusp::host_memory > MatrixCoo • typedef cusp::csr_matrix< int, T, cusp::host_memory > MatrixCsr • typedef cusp::csr_matrix< int, T, cusp::device_memory > MatrixCsrD • typedef cusp::array1d< T, cusp::host_memory > Vector • typedef cusp::array1d< T, cusp::device_memory > VectorD 157

• typedef cusp::array1d< double, cusp::host_memory > DoubleVector • typedef cusp::array1d< int, cusp::host_memory > IntVector • typedef cusp::array1d< int, cusp::device_memory > IntVectorD • typedef cusp::array1d< double, cusp::device_memory > DoubleVectorD • typedef cusp::array1d< bool, cusp::host_memory > BoolVector • typedef cusp::array1d< bool, cusp::device_memory > BoolVectorD • typedef Vector MatrixMapF • typedef IntVector MatrixMap • typedef IntVector::iterator IntIterator • typedef Vector::iterator TIterator • typedef IntVector::reverse_iterator IntRevIterator • typedef Vector::reverse_iterator TRevIterator • typedef thrust::tuple < IntIterator, IntIterator > IntIteratorTuple • typedef thrust::tuple < IntIterator, IntIterator, TIterator > IteratorTuple • typedef thrust::tuple < IntRevIterator, IntRevIterator, TRevIterator > RevIteratorTuple • typedef thrust::zip_iterator < IntIteratorTuple > EdgeIterator • typedef thrust::zip_iterator < IteratorTuple > WeightedEdgeIterator 158

• typedef thrust::zip_iterator < RevIteratorTuple > WeightedEdgeRevIterator • typedef thrust::tuple< int, int > NodeType • typedef thrust::tuple< int, int, T > WeightedEdgeType • typedef thrust::tuple< int, double > Dijkstra

Public Member Functions

• Graph (bool trackReordering=false) • double getTimeDB () const • double getTimeDBPre () const • double getTimeDBFirst () const • double getTimeDBSecond () const • double getTimeDBPost () const • double getTimeRCM () const • double getTimeDropoff () const • double getDiagDominance (bool before_db=false) const • double getDP1 (bool before_db=false) const • int reorder (const MatrixCsr &Acsr, bool testDB, bool doDB, bool dbFirst- StageOnly, bool scale, bool doRCM, bool doSloan, IntVector &optReordering, IntVector &optPerm, IntVectorD &d_dbRowPerm, VectorD &d_dbRowScale, VectorD &d_dbColScale, MatrixMapF &scaleMap, int &k_db) • int dropOff (T frac, int maxBandwidth, T &frac_actual) • void assembleOffDiagMatrices (int bandwidth, int numPartitions, Vector &WV_host, Vector &offDiags_host, IntVector &offDiagWidths_left, IntVector &offDiagWidths_right, IntVector &offDiagPerms_left, IntVector &offDiag- Perms_right, MatrixMap &typeMap, MatrixMap &offDiagMap, MatrixMap &WVMap) 159

• void secondLevelReordering (int bandwidth, int numPartitions, IntVector &secondReorder, IntVector &secondPerm, IntVector &first_rows) • void assembleBandedMatrix (int bandwidth, IntVector &ks_col, IntVector &ks_row, MatrixCoo &Acoo, MatrixMap &typeMap, MatrixMap &banded- MatMap) • void assembleBandedMatrix (int bandwidth, bool saveMem, int numPartitions, IntVector &ks_col, IntVector &ks_row, MatrixCoo &Acoo, IntVector &ks, Int- Vector &BOffsets, MatrixMap &typeMap, MatrixMap &bandedMatMap) • void get_csr_matrix (MatrixCsr &Acsr, int numPartitions) • void unorderedBFS (bool doRCM, bool doSloan, IntVector &tmp_reordering, IntVector &row_offsets, IntVector &column_indices, IntVector &visited, Int- Vector &levels, IntVector &ori_degrees, BoolVector &tried) • void unorderedBFSIteration (int width, int start_idx, int end_idx, Int- Vector &tmp_reordering, IntVector &levels, IntVector &visited, IntVector &row_offsets, IntVector &column_indices, IntVector &ori_degrees, BoolVector &tried, IntVector &costs, IntVector &ori_costs, StatusVector &status, int &next_level)

F.14.1 Detailed Description

templateclass sap::Graph< T >

This class is an auxiliary class of SaP preconditioner class. It performs reordering, drop-off, and assembles banded matrix.

Template Parameters

T is the preconditioner’s floating point type.

The documentation for this class was generated from the following file:

• sap/graph.h 160

F.15 sap::Graph< T >::is_not Struct Reference

Public Member Functions

• __host__ __device__ bool operator() (bool x)

The documentation for this struct was generated from the following file:

• sap/graph.h

F.16 sap::Graph< T >::Map< Type > Struct Template Reference

Public Member Functions

• Map (Type ∗arg_map) • __host__ __device__ Type operator() (int a)

Public Attributes

• Type ∗ m_map

The documentation for this struct was generated from the following file:

• sap/graph.h

F.17 sap::Monitor< SolverVector > Class Template Reference

#include 161

Public Types

• typedef SolverVector::value_type SolverValueType

Public Member Functions

• Monitor (const int maxIterations, const SolverValueType relTol, const Solver- ValueType absTol=SolverValueType(0)) • virtual void init (const SolverVector &rhs) • virtual bool finished (const SolverVector &r) • virtual bool finished (SolverValueType rNorm) • virtual void stop (int code, const char ∗message) • virtual void increment (float incr) • virtual Monitor< SolverVector > & operator++ () • virtual int getMaxIterations () const • virtual size_t iteration_limit () const • virtual SolverValueType getRelTolerance () const • virtual SolverValueType getAbsTolerance () const • virtual SolverValueType getTolerance () const • virtual SolverValueType getRHSNorm () const • virtual bool converged () const • virtual size_t iteration_count () const • virtual float getNumIterations () const • virtual int getCode () const • virtual const std::string & getMessage () const • virtual SolverValueType getResidualNorm () const • virtual SolverValueType getRelResidualNorm () const 162

F.17.1 Detailed Description templateclass sap::Monitor< SolverVector >

This class provides support for monitoring progress of iterative linear solvers, check for convergence, and stop on various error conditions. The documentation for this class was generated from the following file:

• sap/monitor.h

F.18 sap::Multiply< T > Struct Template Reference

Public Member Functions

• __host__ __device__ T operator() (thrust::tuple< T, T > tu)

The documentation for this struct was generated from the following file:

• sap/precond.h

F.19 sap::MVBanded< Matrix > Class Template Reference

#include

Public Types

• typedef cusp::linear_operator

typename Matrix::index_type > Parent • typedef Matrix::value_type ValueType

Public Member Functions

• MVBanded (Matrix &A) • template void operator() (const Array &v, Array &Av) • double getTime () const Cummulative time for all SPMV calls (ms). • double getCount () const Total number of calls to the SPMV functor. • double getGFlops () const Average GFLOP/s over all SPMV calls.

F.19.1 Detailed Description

templateclass sap::MVBanded< Matrix >

This class implements the MV functor for banded matrix-vector product.

Template Parameters

Matrix is the type of the banded matrix.

The documentation for this class was generated from the following file:

• sap/spmv.h 164

F.20 sap::Options Struct Reference

Input solver options.

#include

Public Member Functions

• Options ()

Public Attributes

• KrylovSolverType solverType • int maxNumIterations • double relTol • double absTol • bool testDB • bool isSPD • bool saveMem • bool performReorder • bool performDB • bool dbFirstStageOnly • bool applyScaling • int maxBandwidth • int gpuCount • double dropOffFraction • FactorizationMethod factMethod • PreconditionerType precondType • bool safeFactorization • bool variableBandwidth 165

• bool trackReordering • bool useBCR • int ilu_level

F.20.1 Detailed Description

Input solver options. This structure encapsulates all solver options and specifies the methods and parameters used in the iterative solver and the preconditioner.

F.20.2 Constructor & Destructor Documentation

F.20.2.1 sap::Options::Options ( ) [inline]

This is the constructor for the Options class. It sets default values for all options.

F.20.3 Member Data Documentation

F.20.3.1 double sap::Options::absTol

Absolute tolerance; default: 0

F.20.3.2 bool sap::Options::applyScaling

Apply DB scaling? default: true

F.20.3.3 bool sap::Options::dbFirstStageOnly

In DB, only the first stage is to be performed? default: false

F.20.3.4 double sap::Options::dropOffFraction

Maximum fraction of the element-wise matrix 1-norm that can be dropped-off; default: 0 166

F.20.3.5 FactorizationMethod sap::Options::factMethod

Diagonal block factorization method; default: LU_only

F.20.3.6 int sap::Options::gpuCount

Number of GPU expected to use; default: 1

F.20.3.7 int sap::Options::ilu_level

Indicate the level of ILU, a minus value means complete LU is applied; default: -1

F.20.3.8 bool sap::Options::isSPD

Indicate whether the matrix is symmetric positive definitive; default: false

F.20.3.9 int sap::Options::maxBandwidth

Maximum half-bandwidth; default: INT_MAX

F.20.3.10 int sap::Options::maxNumIterations

Maximum number of iterations; default: 100

F.20.3.11 bool sap::Options::performDB

Perform DB reordering? default: true

F.20.3.12 bool sap::Options::performReorder

Perform matrix reorderings? default: true

F.20.3.13 PreconditionerType sap::Options::precondType

Preconditioner type; default: Spike 167

F.20.3.14 double sap::Options::relTol

Relative tolerance; default: 1e-6

F.20.3.15 bool sap::Options::safeFactorization

Use safe factorization (diagonal boosting)? default: false

F.20.3.16 bool sap::Options::saveMem

(For SPD matrix only) Indicate whether to use memory-saving yet slower mode or not; default: false

F.20.3.17 KrylovSolverType sap::Options::solverType

Krylov method to use; default: BiCGStab2

F.20.3.18 bool sap::Options::testDB

Indicate that we are running the test for DB

F.20.3.19 bool sap::Options::trackReordering

Keep track of the reordering information? default: false

F.20.3.20 bool sap::Options::variableBandwidth

Allow variable partition bandwidths? default: true The documentation for this struct was generated from the following file:

• sap/solver.h 168

F.21 sap::Graph< T >::PermutedEdgeLength Struct Reference

Public Member Functions

• PermutedEdgeLength (int ∗perm_array) • __host__ __device__ int operator() (NodeType a)

Public Attributes

• int ∗ m_perm_array

The documentation for this struct was generated from the following file:

• sap/graph.h

F.22 sap::Graph< T >::PermuteEdge Struct Reference

Public Member Functions

• PermuteEdge (int ∗perm_array) • __host__ __device__ NodeType operator() (NodeType a)

Public Attributes

• int ∗ m_perm_array

The documentation for this struct was generated from the following file:

• sap/graph.h 169

F.23 sap::Precond< PrecVector > Class Template Reference

SaP preconditioner.

#include

Public Types

• typedef PrecVector::memory_space MemorySpace • typedef PrecVector::memory_space memory_space • typedef PrecVector::value_type PrecValueType • typedef PrecVector::value_type value_type • typedef PrecVector::iterator PrecVectorIterator • typedef cusp::unknown_format format • typedef cusp::array1d< int, MemorySpace > IntVector • typedef IntVector MatrixMap • typedef cusp::array1d < PrecValueType, MemorySpace > MatrixMapF • typedef cusp::array1d < PrecValueType, cusp::host_memory > PrecVectorH • typedef cusp::array1d< int, cusp::host_memory > IntVectorH • typedef cusp::array1d< bool, cusp::host_memory > BoolVectorH • typedef IntVectorH MatrixMapH • typedef cusp::array1d < PrecValueType, cusp::host_memory > MatrixMapFH 170

• typedef IntVectorH::iterator IntHIterator • typedef PrecVectorH::iterator PrecHIterator • typedef cusp::coo_matrix< int, PrecValueType, MemorySpace > PrecMatrixCoo • typedef cusp::coo_matrix< int, PrecValueType, cusp::host_memory > PrecMatrixCooH • typedef cusp::csr_matrix< int, PrecValueType, MemorySpace > PrecMatrixCsr • typedef cusp::csr_matrix< int, double, MemorySpace > DoubleMatrixCsr • typedef cusp::csr_matrix< int, PrecValueType, cusp::host_memory > PrecMatrixCsrH

Public Member Functions

• Precond () • Precond (int numPart, bool isSPD, bool saveMem, bool reorder, bool testDB, bool doDB, bool dbFirstStageOnly, bool scale, double dropOff_frac, int max- Bandwidth, int gpuCount, FactorizationMethod factMethod, Preconditioner- Type precondType, bool safeFactorization, bool variableBandwidth, bool track- Reordering, bool use_bcr, int ilu_level, PrecValueType tolerance) • Precond (const Precond &prec) • Precond & operator= (const Precond &prec) • double getTimeDB () const • double getTimeDBPre () const • double getTimeDBFirst () const • double getTimeDBSecond () const • double getTimeDBPost () const 171

• double getDiagDominance (bool before_db=false) const • double getDP1 (bool before_db=false) const • double getTimeReorder () const • double getTimeDropOff () const • double getTimeCPUAssemble () const • double getTimeTransfer () const • double getTimeToBanded () const • double getTimeCopyOffDiags () const • double getTimeBandLU () const • double getTimeBandUL () const • double gettimeAssembly () const • double getTimeFullLU () const • double getTimeShuffle () const • double getTimeBCRLU () const • double getTimeBCRSweepDeflation () const • double getTimeBCRMatMulDeflation () const • double getTimeBCRSweepInflation () const • double getTimeBCRMVInflation () const • int getBandwidthReordering () const • int getBandwidthDB () const • int getBandwidth () const • int getNumPartitions () const • double getActualDropOff () const • int getActualNumNonZeros () const • template void setup (const Matrix &A) • void update (const PrecVector &entries) This function updates the banded matrix and off-diagonal matrices based on the given entries. 172

• void solve (PrecVector &v, PrecVector &z) • template void operator() (const SolverVector &v, SolverVector &z)

Public Attributes

• IntVectorH m_ks_row_host • IntVectorH m_ks_col_host

F.23.1 Detailed Description

templateclass sap::Precond< PrecVector >

SaP preconditioner. This class implements the SaP preconditioner.

Template Parameters

PrecVector is the vector type used in the preconditioner. (its underlying type defines the precision of the preconditioner).

F.23.2 Constructor & Destructor Documentation

F.23.2.1 template sap::Precond< PrecVector >::Precond ( )

This is the default constructor for the Precond class. 173

F.23.2.2 template sap::Precond< PrecVector >::Precond ( int numPart, bool isSPD, bool saveMem, bool reorder, bool testDB, bool doDB, bool dbFirstStageOnly, bool scale, double dropOff_frac, int maxBandwidth, int gpuCount, FactorizationMethod factMethod, PreconditionerType precondType, bool safeFactorization, bool variableBandwidth, bool trackReordering, bool use_bcr, int ilu_level, PrecValueType tolerance )

This is the constructor for the Precond class.

F.23.2.3 template sap::Precond< PrecVector >::Precond ( const Precond< PrecVector > & prec )

This is the copy constructor for the Precond class.

F.23.3 Member Function Documentation

F.23.3.1 template template void sap::Precond< PrecVector >::operator() ( const SolverVector & v, SolverVector & z )

This is the wrapper around the preconditioner solve function. It is invoked by the Krylov solvers through the cusp::multiply() function. Note that this operator() is templatized by the SolverVector type to implement mixed-precision.

F.23.3.2 template template void sap::Precond< PrecVector >::setup ( const Matrix & A )

This function performs the initial preconditioner setup, based on the specified matrix: (1) Reorder the matrix (DB and/or RCM) (2) Element drop-off (optional) (3) LU factorization (4) Get the reduced matrix 174

F.23.3.3 template void sap::Precond< PrecVector >::solve ( PrecVector & v, PrecVector & z )

This function solves the system Mz=v, for a specified vector v, where M is the implicitly defined preconditioner matrix.

F.23.3.4 template void sap::Precond< PrecVector >::update ( const PrecVector & entries )

This function updates the banded matrix and off-diagonal matrices based on the given entries. Assume we are to solve many systems with exactly the same matrix pattern. When we have solved one, next time we don’t bother doing permutation and scaling again. Instead, we keep track of the mapping from the sparse matrix to the banded ones and directly update them. This function is called when the solver has solved at least one system and during setup, the mapping is tracked. Otherwise report error and exit. The documentation for this class was generated from the following file:

• sap/precond.h

F.24 sap::SegmentedMatrix< Array, MemorySpace > Class Template Reference

Public Types

• typedef Array::value_type T • typedef Array::value_type value_type • typedef cusp::array1d< T, MemorySpace > Vector • typedef cusp::array1d< T, cusp::host_memory > VectorH 175

• typedef cusp::array1d< int, MemorySpace > IntVector • typedef cusp::array1d< int, cusp::host_memory > IntVectorH

Public Member Functions

• template void init (int n, int num_partitions, const IntArray &ks_per_parttion, bool upper) • template void copy (const Array &A, const IntArray &A_offsets) • void sweepStride (const Array &A, int k) • bool multiply_stride (SegmentedMatrix< Array, MemorySpace > &result_- matrix) const • bool multiply_stride (const SegmentedMatrix< Array, MemorySpace > &mat_a, const SegmentedMatrix< Array, MemorySpace > &mat_b) • void update_banded (Array &A, int k) const

Public Attributes

• bool m_upper • int m_global_n • int m_global_num_partitions • IntVector m_ns_scan • IntVector m_ks_per_partition • IntVector m_A_offsets • IntVector m_A_offsets_of_offsets • IntVector m_src_offsets • IntVector m_src_offsets_of_offsets 176

• IntVectorH m_ns_scan_host • IntVectorH m_ks_per_partition_host • IntVectorH m_A_offsets_host • IntVectorH m_A_offsets_of_offsets_host • IntVectorH m_src_offsets_of_offsets_host • IntVectorH m_src_offsets_host • Vector m_A

The documentation for this class was generated from the following file:

• sap/segmented_matrix.h

F.25 sap::SmallerThan< T > Struct Template Reference

Public Member Functions

• SmallerThan (T threshold) • __host__ __device__ bool operator() (T val)

Public Attributes

• T m_threshold

The documentation for this struct was generated from the following file:

• sap/precond.h 177

F.26 sap::Solver< Array, PrecValueType > Class Template Reference

Main SaP::GPU solver.

#include

Public Member Functions

• Solver (int numPartitions, const Options &opts) SaP solver constructor. • template bool setup (const Matrix &A) Preconditioner setup. • template bool update (const Array1 &entries) Preconditioner update. • template bool solve (SpmvOperator &spmv, const Array &b, Array &x) Linear system solve. • const Stats & getStats () const Extract solver statistics. • int getMonitorCode () const • const std::string & getMonitorMessage () const • const Precond< PrecVector > & getPreconditioner () const 178

F.26.1 Detailed Description

templateclass sap::Solver< Array, PrecValueType >

Main SaP::GPU solver. This class is the public interface to the Spike-preconditioned Krylov iterative solver.

Template Parameters

Array is the array type for the linear system solution. (both cusp::array1d and cusp::array1d_view are valid). PrecValueType is the floating point type used in the preconditioner (to support mixed-precision calculations).

F.26.2 Constructor & Destructor Documentation

F.26.2.1 template sap::Solver< Array, PrecValueType >::Solver ( int numPartitions, const Options & opts )

SaP solver constructor. This is the constructor for the Solver class. It specifies the requested number of partitions and the set of solver options.

F.26.3 Member Function Documentation

F.26.3.1 template template bool sap::Solver< Array, PrecValueType >::setup ( const Matrix & A )

Preconditioner setup. This function performs the initial setup for the SaP solver. It prepares the precondi- 179

tioner based on the specified matrix A (which may be the system matrix, or some approximation to it).

Template Parameters

Matrix is the sparse matrix type used in the preconditioner.

F.26.3.2 template template bool sap::Solver< Array, PrecValueType >::solve ( SpmvOperator & spmv, const Array & b, Array & x )

Linear system solve. This function solves the system Ax=b, for given matrix A and right-handside vector b. An exception is throw if this call was not preceeded by a call to Solver::setup().

Template Parameters

SpmvOperator is a functor class which implements the operator() to calculate sparse matrix-vector product. See sap::SpmvCusp for an example.

F.26.3.3 template template bool sap::Solver< Array, PrecValueType >::update ( const Array1 & entries )

Preconditioner update. This function updates the Spike preconditioner assuming that the reordering infor- mation generated when the preconditioner was initially set up is still valid. The diagonal blocks and off-diagonal spike blocks are updates based on the provided matrix non-zero entries. 180

An exception is thrown if this call was not preceeded by a call to Solver::setup() or if reordering tracking was not enabled through the solver options.

Template Parameters

Array1 is the vector type for the non-zero entries of the updated matrix (both cusp::array1d and cusp::array1d_view are allowed).

The documentation for this class was generated from the following file:

• sap/solver.h

F.27 sap::SpmvCusp< Matrix > Class Template Reference

Default SPMV functor class.

#include

Public Types

• typedef cusp::linear_operator Parent

Public Member Functions

• SpmvCusp (Matrix &A) • double getTime () const Cummulative time for all SPMV calls (ms). 181

• double getCount () const Total number of calls to the SPMV functor. • double getGFlops () const Average GFLOP/s over all SPMV calls. • template void operator() (const Array &v, Array &Av) Implementation of the SPMV functor using cusp::multiply().

F.27.1 Detailed Description

templateclass sap::SpmvCusp< Matrix >

Default SPMV functor class. This class implements the default SPMV functor for sparse matrix-vector product, using the cusp::multiply algorithm.

Template Parameters

Matrix is the type of the sparse matrix.

The documentation for this class was generated from the following file:

• sap/spmv.h

F.28 sap::SpmvCuSparse< Matrix > Class Template Reference

#include 182

Public Types

• typedef cusp::linear_operator Parent • typedef Matrix::value_type fp_type • typedef Matrix::index_type idx_type

Public Member Functions

• SpmvCuSparse (Matrix &A, cusparseHandle_t &handle) • double getTime () const Cummulative time for all SPMV calls (ms). • double getCount () const Total number of calls to the SPMV functor. • double getGFlops () const Average GFLOP/s over all SPMV calls. • template void operator() (const Array &v, Array &Av) Implementation of the SPMV functor using cusp::multiply().

F.28.1 Detailed Description templateclass sap::SpmvCuSparse< Matrix >

This class implements the default SPMV functor for sparse matrix-vector product, using cuSparse algorithm. 183

Template Parameters

Matrix is the type of the sparse matrix.

The documentation for this class was generated from the following file:

• sap/spmv.h

F.29 sap::Graph< T >::Square Struct Reference

Public Member Functions

• __host__ __device__ T operator() (T a)

The documentation for this struct was generated from the following file:

• sap/graph.h

F.30 sap::Stats Struct Reference

Output solver statistics.

#include

Public Member Functions

• Stats ()

Public Attributes

• double timeSetup • double timeUpdate 184

• double timeSolve • double time_DB • double time_DB_pre • double time_DB_first • double time_DB_second • double time_DB_post • double d_p1 • double d_p1_ori • double diag_dom • double diag_dom_ori • double time_reorder • double time_dropOff • double time_cpu_assemble • double time_transfer • double time_toBanded • double time_offDiags • double time_bandLU • double time_bandUL • double time_fullLU • double time_assembly • double time_shuffle • double time_bcr_lu • double time_bcr_sweep_deflation • double time_bcr_mat_mul_deflation • double time_bcr_sweep_inflation • double time_bcr_mv_inflation • int bandwidthReorder • int bandwidthDB 185

• int bandwidth • double nuKf • double flops_LU • int numPartitions • double actualDropOff • float numIterations • double rhsNorm • double residualNorm • double relResidualNorm • int actual_nnz

F.30.1 Detailed Description

Output solver statistics. This structure encapsulates all solver statistics, both from the iterative solver and the preconditioner.

F.30.2 Constructor & Destructor Documentation

F.30.2.1 sap::Stats::Stats ( ) [inline]

This is the constructor for the Stats class. It initializes all timing and performance measures.

F.30.3 Member Data Documentation

F.30.3.1 double sap::Stats::actualDropOff

Actual fraction of the element-wise matrix 1-norm dropped off.

F.30.3.2 int sap::Stats::bandwidth

Half-bandwidth after reordering and drop-off. 186

F.30.3.3 int sap::Stats::bandwidthDB

Half-bandwidth after DB.

F.30.3.4 int sap::Stats::bandwidthReorder

Half-bandwidth after reordering.

F.30.3.5 double sap::Stats::d_p1

The 1st percentile of d’s of the matrix after DB

F.30.3.6 double sap::Stats::d_p1_ori

The 1st percentile of d’s of the matrix before DB

F.30.3.7 double sap::Stats::diag_dom

The diagonal dominance of the matrix after DB

F.30.3.8 double sap::Stats::diag_dom_ori

The diagonal dominance of the matrix before DB

F.30.3.9 double sap::Stats::flops_LU

FLOPs of LU

F.30.3.10 double sap::Stats::nuKf

Non-uniform K factor. Indicates whether the K changes a lot from row to row.

F.30.3.11 float sap::Stats::numIterations

Number of iterations required for iterative solver to converge. 187

F.30.3.12 int sap::Stats::numPartitions

Actual number of partitions used in the SaP factorization

F.30.3.13 double sap::Stats::relResidualNorm

Final relative residual norm (i.e. ||b-Ax||_2 / ||b||_2)

F.30.3.14 double sap::Stats::residualNorm

Final residual norm (i.e. ||b-Ax||_2).

F.30.3.15 double sap::Stats::rhsNorm

RHS norm (i.e. ||b||_2).

F.30.3.16 double sap::Stats::time_assembly

Time for assembling off-diagonal matrices (including solving multiple RHS)

F.30.3.17 double sap::Stats::time_bandLU

Time for LU factorization of diagonal blocks.

F.30.3.18 double sap::Stats::time_bandUL

Time for UL factorization of diagonal blocks(in LU_UL method only).

F.30.3.19 double sap::Stats::time_cpu_assemble

Time on CPU to assemble the banded matrix and off-diagonal spikes.

F.30.3.20 double sap::Stats::time_DB

Time to do DB reordering. 188

F.30.3.21 double sap::Stats::time_DB_first

Time to do DB reordering (first stage).

F.30.3.22 double sap::Stats::time_DB_post

Time to do DB reordering (post-processing).

F.30.3.23 double sap::Stats::time_DB_pre

Time to do DB reordering (pre-processing).

F.30.3.24 double sap::Stats::time_DB_second

Time to do DB reordering (second stage).

F.30.3.25 double sap::Stats::time_dropOff

Time for drop-off

F.30.3.26 double sap::Stats::time_fullLU

Time for LU factorization of the reduced matrix R.

F.30.3.27 double sap::Stats::time_offDiags

Time to compute off-diagonal spike matrices on GPU.

F.30.3.28 double sap::Stats::time_reorder

Time to do DB reordering.

F.30.3.29 double sap::Stats::time_shuffle

Total time to do vector reordering and scaling. 189

F.30.3.30 double sap::Stats::time_toBanded

Time to form banded matrix when reordering is disabled.

F.30.3.31 double sap::Stats::time_transfer

Time to transfer data from CPU to GPU.

F.30.3.32 double sap::Stats::timeSetup

Time to set up the preconditioner.

F.30.3.33 double sap::Stats::timeSolve

Time for Krylov solve.

F.30.3.34 double sap::Stats::timeUpdate

Time to update the preconditioner. The documentation for this struct was generated from the following file:

• sap/solver.h

F.31 sap::strided_range< Iterator >::stride_functor Struct Reference

Public Member Functions

• stride_functor (difference_type stride) • __host__ __device__ difference_type operator() (const difference_type &i) const 190

Public Attributes

• difference_type m_stride

The documentation for this struct was generated from the following file:

• sap/strided_range.h

F.32 sap::strided_range< Iterator > Class Template Reference

Classes

• struct stride_functor

Public Types

• typedef thrust::iterator_difference < Iterator >::type difference_type • typedef thrust::counting_iterator < difference_type > CountingIterator • typedef thrust::transform_iterator < stride_functor, CountingIterator > TransformIterator • typedef thrust::permutation_iterator < Iterator, TransformIterator > PermutationIterator • typedef PermutationIterator iterator 191

Public Member Functions

• strided_range (Iterator first, Iterator last, difference_type stride) • iterator begin (void) const • iterator end (void) const

Protected Attributes

• Iterator m_first • Iterator m_last • difference_type m_stride

The documentation for this class was generated from the following file:

• sap/strided_range.h

F.33 sap::system_error Class Reference

Public Types

• enum Reason { Zero_pivoting = -1, Negative_DB_weight = -2, Illegal_update = -3, Illegal_solve = -4, Matrix_singular = -5 }

Public Member Functions

• system_error (Reason reason, const std::string &what_arg) • system_error (Reason reason, const char ∗what_arg) • Reason reason () const 192

The documentation for this class was generated from the following file:

• sap/exception.h

F.34 sap::Timer Class Reference

Base timer class.

#include Inheritance diagram for sap::Timer:

sap::Timer

sap::CPUTimer sap::GPUTimer

Public Member Functions

• virtual void Start ()=0 • virtual void Stop ()=0 • virtual double getElapsed ()=0

F.34.1 Detailed Description

Base timer class. The documentation for this class was generated from the following file:

• sap/timer.h 193 g file documentation

G.1 sap/banded_matrix.h File Reference

Definition of banded matrix class.

#include #include #include #include

Classes

• class sap::BandedMatrix< Array >

Namespaces

• namespace sap sap is the top-level namespace which contains all SaP functions and types.

G.1.1 Detailed Description

Definition of banded matrix class.

G.2 sap/bicgstab.h File Reference

BiCGStab preconditioned iterative Krylov solver. 194

#include #include #include #include

Namespaces

• namespace sap sap is the top-level namespace which contains all SaP functions and types.

Functions

• template void sap::bicgstab (LinearOperator &A, Vector &x, Vector &b, Monitor &moni- tor, Preconditioner &M) Preconditioned BiCGStab Krylov method.

G.2.1 Detailed Description

BiCGStab preconditioned iterative Krylov solver.

G.3 sap/bicgstab2.h File Reference

BiCGStab(L) preconditioned iterative Krylov solver.

#include #include #include #include #include 195

Namespaces

• namespace sap sap is the top-level namespace which contains all SaP functions and types.

Functions

• template void sap::bicgstabl (LinearOperator &A, Vector &x, const Vector &b, Monitor &monitor, Preconditioner &P) Preconditioned BiCGStab(L) Krylov method. • template void sap::bicgstab1 (LinearOperator &A, Vector &x, const Vector &b, Monitor &monitor, Preconditioner &P) Specializations of the generic sap::bicgstabl function for L=1. • template void sap::bicgstab2 (LinearOperator &A, Vector &x, const Vector &b, Monitor &monitor, Preconditioner &P) Specializations of the generic sap::bicgstabl function for L=2.

G.3.1 Detailed Description

BiCGStab(L) preconditioned iterative Krylov solver. 196

G.4 sap/common.h File Reference

Definition of commonly used macros and solver configuration constants.

#include #include

Namespaces

• namespace sap sap is the top-level namespace which contains all SaP functions and types.

Macros

• #define ALWAYS_ASSERT • #define BURST_VALUE (1e-7) • #define BURST_NEW_VALUE (1e-4) • #define MATRIX_MUL_BLOCK_SIZE (16) • #define MAT_VEC_MUL_BLOCK_SIZE (16) • #define USE_OLD_CUSP

Enumerations

• enum sap::KrylovSolverType { BiCGStab_C, GMRES_C, CG_C, CR_C, BiCGStab1, BiCGStab2, BiCGStab, MINRES } • enum FactorizationMethod { LU_UL, LU_only } • enum sap::PreconditionerType { Spike, Block, None } 197

Functions

• void sap::kernelConfigAdjust (int &numThreads, int &numBlockX, const int numThreadsMax) • void sap::kernelConfigAdjust (int &numThreads, int &numBlockX, int &numBlockY, const int numThreadsMax, const int numBlockXMax)

Variables

• const unsigned int sap::BLOCK_SIZE = 512 • const unsigned int sap::MAX_GRID_DIMENSION = 32768 • const unsigned int sap::CRITICAL_THRESHOLD = 70

G.4.1 Detailed Description

Definition of commonly used macros and solver configuration constants.

G.4.2 Macro Definition Documentation

G.4.2.1 #define BURST_VALUE (1e-7)

If ALWAYS_ASSERT is defined, we make sure that assertions are triggered even if NDEBUG is defined.

G.5 sap/exception.h File Reference

Definition of the SaP system_error exception class.

#include #include 198

Classes

• class sap::system_error

Namespaces

• namespace sap sap is the top-level namespace which contains all SaP functions and types.

G.5.1 Detailed Description

Definition of the SaP system_error exception class.

G.6 sap/graph.h File Reference

Definition of an auxiliary class for the Preconditioner class. This class is for reordering, dropping off elements, and assemblying banded matrix. 199

#include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include

Classes

• class sap::Graph< T > • struct sap::Graph< T >::AbsoluteValue< VType > • struct sap::Graph< T >::Square • struct sap::Graph< T >::AccumulateEdgeWeights • struct sap::Graph< T >::is_not • struct sap::Graph< T >::ClearValue • struct sap::Graph< T >::Exponential 200

• struct sap::Graph< T >::EdgeLength • struct sap::Graph< T >::PermutedEdgeLength • struct sap::Graph< T >::PermuteEdge • struct sap::Graph< T >::Map< Type > • struct sap::Graph< T >::GetCount • struct sap::Graph< T >::CompareValue< VType > • struct sap::Graph< T >::Difference • struct sap::Graph< T >::EqualTo< Type >

Namespaces

• namespace sap sap is the top-level namespace which contains all SaP functions and types.

G.6.1 Detailed Description

Definition of an auxiliary class for the Preconditioner class. This class is for reordering, dropping off elements, and assemblying banded matrix.

G.7 sap/minres.h File Reference

MINRES preconditioned iterative solver.

#include #include #include #include #include #include 201

Namespaces

• namespace sap sap is the top-level namespace which contains all SaP functions and types.

Functions

• template void sap::minres (LinearOperator &A, Vector &x, const Vector &b, Monitor &monitor, Preconditioner &P) Preconditioned MINRES method.

G.7.1 Detailed Description

MINRES preconditioned iterative solver.

G.8 sap/monitor.h File Reference

It provides two classes for monitoring progress of iterative linear solvers, checking for convergence, and stopping on various error conditions.

#include #include #include #include

Classes

• class sap::Monitor< SolverVector > • class sap::BiCGStabLMonitor< SolverVector > 202

Namespaces

• namespace sap sap is the top-level namespace which contains all SaP functions and types.

G.8.1 Detailed Description

It provides two classes for monitoring progress of iterative linear solvers, checking for convergence, and stopping on various error conditions.

G.9 sap/precond.h File Reference

Definition of the SaP preconditioner class. 203

#include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include

Classes

• class sap::Precond< PrecVector > SaP preconditioner. • struct sap::Multiply< T > 204

• struct sap::SmallerThan< T >

Namespaces

• namespace sap sap is the top-level namespace which contains all SaP functions and types.

G.9.1 Detailed Description

Definition of the SaP preconditioner class.

G.10 sap/segmented_matrix.h File Reference

Definition of a special data structure which is used in Block Cyclic Reduction algorithm implementation.

#include #include #include #include #include #include #include #include #include

Classes

• class sap::SegmentedMatrix< Array, MemorySpace > 205

Namespaces

• namespace sap sap is the top-level namespace which contains all SaP functions and types.

G.10.1 Detailed Description

Definition of a special data structure which is used in Block Cyclic Reduction algorithm implementation.

G.11 sap/solver.h File Reference

Definition of the main SaP solver class. 206

#include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include

Classes

• struct sap::Options Input solver options. • struct sap::Stats Output solver statistics. 207

• class sap::Solver< Array, PrecValueType > Main SaP::GPU solver.

Namespaces

• namespace sap sap is the top-level namespace which contains all SaP functions and types.

G.11.1 Detailed Description

Definition of the main SaP solver class.

G.12 sap/spmv.h File Reference

Definition of the default Spike SPMV functor class.

#include #include #include #include #include #include "cusparse.h"

Classes

• class sap::MVBanded< Matrix > • class sap::SpmvCusp< Matrix > Default SPMV functor class. • class sap::SpmvCuSparse< Matrix > 208

Namespaces

• namespace sap sap is the top-level namespace which contains all SaP functions and types.

G.12.1 Detailed Description

Definition of the default Spike SPMV functor class.

G.13 sap/strided_range.h File Reference

This file implements the strided_range class to allow non-contiguous access to data in a thrust vector.

#include #include #include #include

Classes

• class sap::strided_range< Iterator > • struct sap::strided_range< Iterator >::stride_functor

Namespaces

• namespace sap sap is the top-level namespace which contains all SaP functions and types. 209

G.13.1 Detailed Description

This file implements the strided_range class to allow non-contiguous access to data in a thrust vector. This class is the same as the strided_range thrust example provided by Nathan Bell at (https://code.google.com/p/thrust/source/browse/examples/strided- _range.cu)

G.14 sap/timer.h File Reference

CPU and GPU timer classes.

#include #include #include

Classes

• class sap::Timer Base timer class. • class sap::GPUTimer GPU timer. • class sap::CPUTimer CPU timer.

Namespaces

• namespace sap sap is the top-level namespace which contains all SaP functions and types. 210

G.14.1 Detailed Description

CPU and GPU timer classes. 211 references

Amestoy, Patrick R, Iain S Duff, and J-Y L’Excellent. 2000. Multifrontal parallel distributed symmetric and unsymmetric solvers. Computer methods in applied mechanics and engineering 184(2):501–520.

Anderson, Edward, Zhaojun Bai, Jack Dongarra, Anne Greenbaum, Alan McKenney, Jeremy Du Croz, Sven Hammerling, James Demmel, C Bischof, and Danny Sorensen. 1990. LAPACK: A portable linear algebra library for high-performance computers. In Proceedings of the 1990 acm/ieee conference on supercomputing, 2–11. IEEE Computer Society Press.

Asanovic, Krste, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, et al. 2006. The landscape of parallel computing research: A view from berkeley. Tech. Rep., Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley.

Atkinson, K. E. 1989. An introduction to numerical analysis. 2nd ed. New York: John Wiley & Sons Inc.

Avriel, Mordecai. 2003. Nonlinear programming: analysis and methods. Courier Dover Publications.

Berzeri, M., M. Campanelli, and A. A. Shabana. 2001. Definition of the elastic forces in the finite-element absolute nodal coordinate formulation and the floating frame of reference formulation. Multibody System Dynamics 5:21–54.

Blackford, L Susan, Antoine Petitet, Roldan Pozo, Karin Remington, R Clint Whaley, James Demmel, Jack Dongarra, Iain Duff, Sven Hammarling, Greg Henry, et al. 2002. An updated set of basic linear algebra subprograms (BLAS). ACM Transactions on Mathematical Software 28(2):135–151. 212

Bottasso, C. L., O. A. Bauchau, and A. Cardona. 2007. Time-step-size-independent conditioning and sensitivity to perturbations in the numerical solution of index three differential algebraic equations. SIAM Journal on Scientific Computing 3:395–420.

Box, G. E. P., W. G. Hunter, and J. S. Hunter. 1978. Statistics for experimenters. New York: John Wiley & Sons.

Brown, P. N., A. C. Hindmarsh, and L. R. Petzold. 1994. Using Krylov methods in the solution of large-scale differential-algebraic systems. SIAM J. Sci. Comput. 15: 1467–1488.

Burkhard, RE, and Ulrich Derigs. 1980. Assignment and matching problems: Solution methods with FORTRAN-programs. Springer-Verlag New York, Inc.

Carpaneto, Giorgio, and Paolo Toth. 1980. Algorithm 548: Solution of the assignment problem [h]. ACM Transactions on Mathematical Software (TOMS) 6(1):104–111.

Carraresi, Paolo, and Claudio Sodini. 1986. An efficient algorithm for the bipartite matching problem. European journal of operational research 23(1):86–93.

Colella, Phillip. 2004. Defining software requirements for scientific computing.

Cuthill, E., and J. McKee. 1969. Reducing the bandwidth of sparse symmetric matrices. In Proceedings of the 24th acm conference, 157–172. New York.

Davis, Timothy A, and Yifan Hu. 2011. The university of florida sparse matrix collection. ACM Transactions on Mathematical Software (TOMS) 38(1):1.

Derigs, Ulrich, and Achim Metz. 1986. An efficient labeling technique for solving sparse assignment problems. Computing 36(4):301–311.

Dijkstra, Edsger W. 1959. A note on two problems in connexion with graphs. Numerische mathematik 1(1):269–271. von Dombrowski, S. 2002. Analysis of large flexible body deformation in multi- body systems using absolute coordinates. Multibody System Dynamics 8:409–432. 10.1023/A:1021158911536. 213

Dongarra, Jack, and Francis Sullivan. 2000. Guest editors’ introduction: The top 10 algorithms. Computing in Science & Engineering 2(1):22–23.

Duff, I.S. 1981. Algorithm 575: Permutations for a zero-free diagonal [F1]. ACM Transactions on Mathematical Software (TOMS) 7(3):387–390.

Duff, I.S., and J. Koster. 1999. The design and use of algorithms for permuting large entries to the diagonal of sparse matrices. SIAM J. Matrix Analysis and Applications 20(4):889–901.

———. 2001. On algorithms for permuting large entries to the diagonal of a sparse matrix. SIAM J. Matrix Analysis and Applications 22(4):973.

Edmonds, Jack, and Richard M Karp. 1972. Theoretical improvements in algorithmic efficiency for network flow problems. Journal of the ACM (JACM) 19(2):248–264.

Esmaeilzadeh, Hadi, Emily Blem, Renee St Amant, Karthikeyan Sankaralingam, and Doug Burger. 2011. Dark silicon and the end of multicore scaling. In Computer architecture (isca), 2011 38th annual international symposium on, 365–376. IEEE.

Faber, Vance, and Thomas Manteuffel. 1984. Necessary and sufficient conditions for the existence of a conjugate gradient method. SIAM Journal on Numerical Analysis 21(2):352–362.

Fang, L. 2015. A Primal-Dual Interior Point Method for Solving Multibody Dynamics Problems with Frictional Contact. M.S. thesis, Department of Mechanical Engi- neering, University of Wisconsin–Madison, http://sbel.wisc.edu/documents/ Luning_master_thesis.pdf.

Fang, L., and D. Negrut. 2014. An Analysis of a Primal-Dual Interior Point Method for Computing Frictional Contact Forces in a Differential Inclusion-Based Ap- proach for Multibody Dynamics. Tech. Rep. TR-2014-13: http://sbel.wisc.edu/ documents/TR-2014-13.pdf, Simulation-Based Engineering Laboratory, University of Wisconsin-Madison. 214

Fletcher, Roger. 1976. Conjugate gradient methods for indefinite systems. In Numerical analysis, 73–89. Springer.

Gagliano, ER, and Industria Qufmica. Orthonormallzation on the plane: A geometric approach.

Gander, Walter, and Gene H Golub. 1997. Cyclic reduction—history and applications. Scientific computing (Hong Kong, 1997) 73–85.

Golub, G. H., and C. F. Van Loan. 1980. Matrix computations. Baltimore, Maryland: Johns Hopkins University Press.

Golub, Gene H, and Charles F Van Loan. 2012. Matrix computations, vol. 3. JHU Press.

Gutknecht, M. 1993. Variants of BICGSTAB for matrices with complex spectrum. SIAM J. Sci. Comput. 14:1020–1033.

Gutknecht, Martin H. 2007. A brief introduction to Krylov space methods for solving linear systems. In Frontiers of computational science, 53–62. Springer.

Hairer, E., and G. Wanner. 1996. Solving ordinary differential equations ii: Stiff and differential-algebraic problems. Springer.

Heggernes, Pinar, and Pontus Matstoms. 1996. Finding good column orderings for sparse qr factorization. University of Linköping, Department of Mathematics.

Hestenes, M.R., and E. Stiefel. 1952. Methods of conjugate gradients for solving linear systems. J. of Research of the Natl. Bureau of Standards 49(6).

Hopper, MJ. 1973. Harwell subroutine library. a catalogue of subroutines (1973). Tech. Rep., DTIC Document.

Intel. 2013. Intel xeon processor e5-2690 v2. URL http://ark.intel.com/products/75279/Intel-Xeon-Processor-E5-2690-v2-25M- Cache-3_00-GHz. Accessed: 2016-04-30. 215

Jonker, Roy, and Anton Volgenant. 1987. A shortest augmenting path algorithm for dense and sparse linear assignment problems. Computing 38(4):325–340.

Karantasis, Konstantinos I, Andrew Lenharth, Donald Nguyen, Mara J Garzaran, and Keshav Pingali. 2014. Parallelization of reordering algorithms for bandwidth and wavefront reduction. In High performance computing, networking, storage and analysis, sc14: International conference for, 921–932. IEEE.

Kuhn, Harold W. 1955. The Hungarian method for the assignment problem. Naval research logistics quarterly 2(1-2):83–97.

Li, A., H. Mazhar, R. Serban, and D. Negrut. 2015. Comparison of SPMV perfor- mance on matrices with different matrix format using CUSP, cuSPARSE and Vien- naCL. Tech. Rep. TR-2015-02–http://sbel.wisc.edu/documents/TR-2015-02. pdf, SBEL, University of Wisconsin - Madison.

Li, A., R. Serban, and D. Negrut. 2014. An implementation of a reordering approach for increasing the product of diagonal entries in a sparse matrix. Tech. Rep. TR- 2014-01: http://sbel.wisc.edu/documents/TR-2014-01.pdf, Simulation-Based Engineering Laboratory, University of Wisconsin-Madison.

Li, Xiaoye S. 2005. An overview of superlu: Algorithms, implementation, and user interface. ACM Transactions on Mathematical Software (TOMS) 31(3):302–325.

Lukarski, Dimitar, and Nico Trost. 2014. Paralution project. URL http://www.paralution.com. Accessed: December.

Manferdelli, John L. 2007. The many-core inflection point for mass market computer systems. CTWatch Quarterly 3(1):11–17.

Manguoglu, M., A.H. Sameh, and O. Schenk. 2009. PSPIKE: A parallel hybrid sparse linear system solver. In Proceedings of the 15th international euro-par conference on parallel processing, 797–808. Delft, The Netherlands: Springer-Verlag. 216

Mazhar, H., T. Heyn, A. Pazouki, D. Melanz, A. Seidl, A. Bartholomew, A. Tasora, and D. Negrut. 2013. Chrono: a parallel multi-physics library for rigid-body, flexible- body, and fluid dynamics. Mechanical Sciences 4(1):49–64.

Melanz, D., N. Khude, P. Jayakumar, and D. Negrut. 2013. A matrix-free Newton- Krylov parallel implicit implementation of the Absolute Nodal Coordinate Formula- tion. ASME Journal of Computational and Nonlinear Dynamics 9(1):011006.

Mikkelsen, C.C.K., and M. Manguoglu. 2008. Analysis of the truncated SPIKE algorithm. SIAM J. Matrix Analysis Applications 30(4):1500–1519.

Negrut, D., L. Jay, and N. Khude. 2009. A discussion of low-order numerical integra- tion formulas for rigid and flexible multibody dynamics. Journal of Computational and Nonlinear Dynamics 4:021008–1.

Negrut, D., R. Rampalli, G. Ottarsson, and A. Sajdak. 2007. On an implementation of the Hilber–Hughes–Taylor method in the context of index 3 differential-algebraic equations of multibody dynamics (detc2005-85096). J. Comput. Nonlinear Dyn 2(1): 73–85.

Negrut, D., R. Serban, A. Li, and A. Seidl. 2014. Unified Memory in CUDA 6.0. A Brief Overview of Related Data Access and Transfer Issues. Tech. Rep. TR- 2014-09: http://sbel.wisc.edu/documents/TR-2014-09.pdf, Simulation-Based Engineering Laboratory, University of Wisconsin-Madison.

Newmark, N. M. 1959. A method of computation for structural dynamics. Journal of the Engineering Mechanics Division, ASCE 67–94.

Nielsen, Michael. 2016. Neural Networks and Deep Learning.

NVIDIA. CUDA Parallel Computing Platform. URL http://www.nvidia.com/object/cuda_home_new.html.

———. 2013. TESLA K40 GPU Active Accelerator. URL https://www.nvidia.com/content/PDF/kepler/Tesla-K40-Active-Board-Spec- BD-06949-001_v03.pdf. 217

NVIDIA. 2015. CUDA Programming Guide. Available online at http://docs. nvidia.com/cuda/cuda-c-programming-guide/index.html.

NVIDIA. 2015. cuSOLVER. URL https://developer.nvidia.com/cusolver.

Olschowka, Markus, and Arnold Neumaier. 1996. A new pivoting strategy for Gaussian elimination. Linear Algebra and its Applications 240:131–151.

Padua Haiek, David Alejandro. 1980. Multiprocessors: Discussion of some theoretical and practical problems. Ph.D. thesis, Champaign, IL, USA. AAI8018194.

Paige, C.C., and M.A. Saunders. 1975. Solution of sparse indefinite systems of linear equations. SIAM J. on Numerical Analysis 12(4):617–629.

Polizzi, E., and A.H. Sameh. 2006. A parallel hybrid banded system solver: the SPIKE algorithm. Parallel Computing 32(2):177–194.

———. 2007. SPIKE: A parallel environment for solving banded linear systems. Computers & Fluids 36(1):113 – 120.

Project Chrono. Chrono: . http://www.projectchrono.org. Accessed: 2015-02-07.

Saad, Y. 2003a. Iterative methods for sparse linear systems. Society for Industrial and Applied Mathematics.

Saad, Youcef. 1989. Overview of Krylov subspace methods with applications to control problems. Research Institute for Advanced Computer Science, NASA Ames Research Center.

Saad, Yousef. 2003b. Iterative methods for sparse linear systems. Society for Industrial and Applied Mathematics.

Sameh, A.H., and D.J. Kuck. 1978. On stable parallel linear system solvers. JACM 25(1):81–91.

Sao, Piyush, Richard Vuduc, and Xiaoye Sherry Li. 2014. A distributed CPU-GPU sparse direct solver. In Euro-par 2014 parallel processing, 487–498. Springer. 218

SBEL. a. SaP::GPU Github. URL https://github.com/spikegpu/SaPLibrary. Ac- cessed: 2015-02-07.

———. b. SaP::GPU Website. URL http://sapgpu.sbel.org. Accessed: 2016-03-29.

———. 2006. Simulation Based Engineering Lab. URL http://sbel.wisc.edu.

Schenk, O., and K. Gärtner. 2004. Solving unsymmetric sparse systems of linear equations with Pardiso. Future Generation Computer Systems 20(3):475–487.

Schenk, O., K. Gartner, W Fichtner, and A. Stricker. 2001. PARDISO: a high- performance serial and parallel sparse linear solver in semiconductor device simulation. Future Generation Computer Systems 18(1):69–78.

Serban, Radu, Daniel Melanz, Ang Li, Ilinca Stanciulescu, Paramsothy Jayakumar, and Dan Negrut. 2015. A GPU-based preconditioned Newton-Krylov solver for flexible multibody dynamics. International Journal for Numerical Methods in Engineering 102:1585–1604.

Shewchuk, Jonathan Richard. 1994. An introduction to the conjugate gradient method without the agonizing pain.

Simoncini, Valeria, and Daniel B Szyld. 2002. Flexible inner-outer Krylov subspace methods. SIAM Journal on Numerical Analysis 40(6):2219–2239.

Sleijpen, G.L., and D.R. Fokkema. 1993. BiCGStab(l) for linear equations involving unsymmetric matrices with complex spectrum. ETNA 1:11–32.

Sonneveld, Peter. 1989. CGS, a fast Lanczos-type solver for nonsymmetric linear systems. SIAM Journal on Scientific and Statistical Computing 10(1):36–52.

Swarztrauber, Paul N. 1977. The methods of cyclic reduction, fourier analysis and the facr algorithm for the discrete solution of poisson’s equation on a rectangle. Siam Review 19(3):490–501. 219

Voevodin, VV. 1983. The question of non-self-adjoint extension of the conjugate gradients method is closed. USSR Computational Mathematics and Mathematical Physics 23(2):143–144.

Van der Vorst, H.A. 1992. Bi-CGSTAB: A fast and smooth converging variant of Bi-CG for the solution of nonsymmetric linear systems. SIAM J. Scientific and Statistical Computing 13(2):631–644.

Vuduc, Richard, James W Demmel, and Katherine A Yelick. 2005. Oski: A library of automatically tuned sparse matrix kernels. In Journal of physics: Conference series, vol. 16, 521. IOP Publishing.

Wang, Endong, Qing Zhang, Bo Shen, Guangyong Zhang, Xiaowei Lu, Qing Wu, and Yajuan Wang. 2014. Intel math kernel library. In High-performance computing on the intel® xeon phi., 167–188. Springer.