Hybrid Algorithms for Efficient Cholesky Decomposition And

Hybrid Algorithms for Efficient Cholesky Decomposition and Matrix Inverse using Multicore CPUs with GPU Accelerators Gary Macindoe A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy of UCL. 2013 ii I, Gary Macindoe, confirm that the work presented in this thesis is my own. Where information has been derived from other sources, I confirm that this has been indicated in the thesis. Signature : Abstract The use of linear algebra routines is fundamental to many areas of computational science, yet their implementation in software still forms the main computational bottleneck in many widely used algorithms. In machine learning and computational statistics, for example, the use of Gaussian distributions is ubiquitous, and routines for calculating the Cholesky decomposition, matrix inverse and matrix determinant must often be called many thousands of times for com- mon algorithms, such as Markov chain Monte Carlo. These linear algebra routines consume most of the total computational time of a wide range of statistical methods, and any improvements in this area will therefore greatly increase the overall efficiency of algorithms used in many scientific application areas. The importance of linear algebra algorithms is clear from the substantial effort that has been invested over the last 25 years in producing low-level software libraries such as LAPACK, which generally optimise these linear algebra routines by breaking up a large problem into smaller problems that may be computed independently. The performance of such libraries is however strongly dependent on the specific hardware available. LAPACK was originally de- veloped for single core processors with a memory hierarchy, whereas modern day computers often consist of mixed architectures, with large numbers of parallel cores and graphics process- ing units (GPU) being used alongside traditional CPUs. The challenge lies in making optimal use of these different types of computing units, which generally have very different processor speeds and types of memory. In this thesis we develop novel low-level algorithms that may be generally employed in blocked linear algebra routines, which automatically optimise themselves to take full advantage of the variety of heterogenous architectures that may be available. We present a comparison of our methods with MAGMA, the state of the art open source implementation of LAPACK designed specifically for hybrid architectures, and demonstrate up to 400% increase in speed that may be obtained using our novel algorithms, specifically when running commonly used Cholesky matrix decomposition, matrix inverse and matrix determinant routines. Original Contributions The main contributions in this thesis are a collection of optimised algorithms that may be ap- plied to general blocked linear algebra routines on hybrid and heterogenous architectures, such as systems with a multicore CPU and GPU accelerator, and systems consisting of multiple GPU accelerators. The first contribution is a new automated approach for blocked linear algebra routines that allows a new level of dynamic blocking to balance the workload more efficiently between heterogenous CPU and GPU computing devices, which have varying clock speeds and very different memory capacities. The second contribution considers the problem of trans- ferring diagonal submatrices between CPU memory and GPU memory, which is an essential operation in many blocked linear algebra routines. We develop a novel algorithm for achieving this transfer efficiently and demonstrate the resulting improvement in speed. The third contribution is an original method for running multiple GPU kernel functions simultaneously on GPUs that do not have inherent hardware support for this capability. In these cases, a large number of GPU processors may often be left idle, waiting for a single kernel to complete. We demonstrate our algorithm using an example whereby a Cholesky decomposition kernel may be run concur- rently with a matrix multiply kernel, achieving much higher performance and efficiency than previously possible on the same hardware using existing state of the art linear algebra libraries. We employ the Cholesky decomposition, matrix inverse and determinant operations as moti- vating examples, and demonstrate up to a 400% increase in speed that may be obtained using combinations of the novel approaches presented. Acknowledgements This thesis was funded by the EPSRC grant, EP/E052029/1. Contents 1 Introduction 1 1.1 Computer Simulations . .4 1.1.1 Generating Random Numbers on a Computer . .4 1.2 Approaches to Parallel Simulation . .5 1.2.1 Communication and Synchronisation . .6 1.2.2 Parallel Random Number Generators . .6 1.3 Hardware Accelerators . .6 1.3.1 Hybrid Multicore Parallel Programming . .6 1.3.2 GPGPU . .7 1.4 Summary . .9 2 Related Work 11 2.1 Technologies to Parallelise Existing Code . 11 2.1.1 MPI . 11 2.1.2 OpenMP . 12 2.1.3 SSE . 13 2.1.4 Compiler Autovectorisation . 15 2.1.5 CUDA . 16 2.1.6 OpenCL . 17 2.1.7 HMPP . 18 2.2 Parallel MCMC Implementations . 19 2.2.1 Parallel Pseudo-Random Number Generation . 19 2.2.2 General Solutions for Parallelising Monte Carlo Algorithms . 26 2.2.3 Specific Parallel Monte Carlo Algorithms . 31 2.3 Parallel Numerical Libraries . 35 2.3.1 LAPACK . 35 2.3.2 Optimised BLAS . 38 Contents vii 2.3.3 ATLAS . 39 2.3.4 Linear Algebra on GPUs . 41 2.3.5 CULA . 45 2.3.6 MAGMA . 46 2.4 Summary . 56 3 General Methodology 58 3.1 Representing Matrices and Vectors in memory . 60 3.1.1 Host Memory . 60 3.1.2 GPU Memory . 61 3.1.3 Copying Matrices and Vectors . 62 3.2 Theoretical Instruction Throughput . 63 3.3 Design of Linear Algebra functions . 65 3.3.1 Automatic Vectorisation of C code for the CPU . 65 3.3.2 Use of C++ templates for GPU kernels . 67 3.3.3 Generating Extra Precisions . 68 3.3.4 Exploiting the differences between SIMT and SIMD . 69 3.4 Using multiple GPUs . 70 3.5 Benchmarks and Error Analysis . 71 3.5.1 GPU Occupancy . 71 3.5.2 Timing Methods . 71 3.5.3 Tuning the Block Size . 72 3.5.4 Floating Point Error Analysis . 73 3.6 Summary . 74 4 Hybrid Cholesky Decomposition 75 4.1 Introduction . 75 4.1.1 LAPACK Unblocked Algorithm . 76 4.1.2 LAPACK Blocked Algorithm . 77 4.1.3 Hybrid Blocked Algorithm . 78 4.2 Current State of the Art Methods . 80 4.2.1 GPU Matrix Multiply . 81 4.2.2 GPU Symmetric Rank-K Update . 86 4.2.3 GPU Triangular Solve . 88 4.3 Improvements on the State of the Art . 92 viii Contents 4.3.1 Unblocked Cholesky on the CPU . 92 4.3.2 Optimising Diagonal Block Transfer . 96 4.3.3 Dynamic Block Sizing . 96 4.3.4 Unblocked Cholesky on the GPU . 99 4.3.5 Combining Unblocked Cholesky and Inverse with Matrix Multiplica- tion on the GPU . 100 4.3.6 Alternatives to GPU Triangular Solve . 102 4.4 Results . 103 4.5 Using Multiple GPUs . 105 4.6 Discussion . 109 5 Hybrid Cholesky Inverse 118 5.1 Introduction . 118 5.1.1 LAPACK Unblocked Algorithm . 118 5.1.2 LAPACK Blocked Algorithm . 120 5.1.3 Hybrid Blocked Algorithm . 124 5.2 Improvements on the State of the Art . 124 5.2.1 GPU Triangular Matrix Multiply . 125 5.2.2 Unblocked Triangular Inverse on the CPU . 126 5.2.3 Unblocked Triangular Inverse on the GPU . 129 5.2.4 Unblocked Triangular Product on the CPU . 130 5.2.5 Unblocked Triangular Product on the GPU . 131 5.2.6 Alternatives to GPU Triangular Solve . 131 5.2.7 Improving Diagonal Block Transfer . 133 5.2.8 Dynamic Block Sizing . 134 5.2.9 Combining Unblocked kernels with Matrix Multiplication on the GPU . 135 5.3 Results . 136 5.4 Discussion . 137 6 Hybrid Cholesky Determinant 143 6.1 Introduction . 143 6.2 Methods . 144 6.2.1 Parallel Reduction on the GPU . 144 6.2.2 Improving Memory Bandwidth . 145 6.3 Results . 145 Contents ix 6.4 Discussion . 146 7 Conclusions and Discussion 148 List of Figures 1.1 Vertex and fragment processors in the nVidia GeForce 6800 GPU . .8 3.1 Exploiting the SIMT architecture to execute multiple kernels simultaneously. 70 4.1 Submatrices used in the blocked upper triangular Cholesky decomposition . 78 4.2 Submatrices used in the blocked lower triangular Cholesky decomposition . 80 4.3 Blocked matrix multiply . 82 4.4 Blocked symmetric rank-K update . 87 4.5 Blocked triangular matrix solve . 89 4.6 Extending the diagonal block to a column in the upper triangular Cholesky decomposition . 97 4.7 Extending the diagonal block to a column in the lower triangular Cholesky decomposition . ..

Hybrid Algorithms for Efficient Cholesky Decomposition And

Conjugate Gradient Method (Part 4)  Pre-Conditioning  Nonlinear Conjugate Gradient Method

Matrix Inversion Using Cholesky Decomposition

AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 3: Positive-Deﬁnite Systems; Cholesky Factorization

1. Positive Definite Matrices a Matrix a Is Positive Deﬁnite If X>Ax > 0 for All Nonzero X

Cholesky Decomposition 1 Cholesky Decomposition

The CMA Evolution Strategy: a Tutorial

Reducing Dimensionality in Text Mining Using Conjugate Gradients and Hybrid Cholesky Decomposition

Stable Computations of Generalized Inverses of Positive Semidefinite

Hierarchical Sparse Cholesky Decomposition with Applications to High-Dimensional Spatio-Temporal ﬁltering

LAPACK Working Note

A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators LAPACK Working Note #223

A Simple Yet Efficient Rank One Update for Covariance Matrix