Solving Diagonally Dominant Tridiagonal Linear Systems with Fpgas in an Heterogeneous Computing Environment

Solving Diagonally Dominant Tridiagonal Linear Systems with FPGAs in an Heterogeneous Computing Environment by Hamish J. Macintosh BEng Electrical School of Electrical Engineering and Computer Science Science and Engineering Faculty Queensland University of Technology A dissertation submitted in fulfilment of the requirements for the degree of Masters of Philosophy (Engineering) 2019 Keywords: tridiagonal, system of linear equations, diagonally dominant, heterogeneous computing, FPGA, GPU, CPU, high performance computing, parallel computing, cyclic reduction, parallel cyclic reduction, SPIKE, truncated SPIKE, optimisation, power efficiency. Statement of Original Authorship In accordance with the requirements of the degree of Masters of Philoso phy (Engineering) in the School of Electrical Engineering and Computer Science, I present the following thesis entitled, Solving Diagonally Dominant Tridiagonal Linear Systems with FPGAs in an Heterogeneous Computing Environment This work was performed under the supervision of Dr Jasmine Banks, Dr Neil Kelson and Prof. Troy Farell. I declare that the work submitted in this thesis is my own, except as acknowledged in the text and footnotes, and has not been previously submitted for a degree at Queensland University of Technology or any other institution. Hamish J. Macintosh QUT Verified Signature June 2019 iii Acknowledgements I’d like to acknowledge the support and guidance provided by my supervi- sory team, Dr. Jasmine Banks, Dr. Neil Kelson and Professor Troy Farrell. Mr David Warne deserves a special mention: thank you for your en- couragement and continual prodding to write this damned document. All of the specialised computing resources used in this thesis was gen- erously provided by the Queensland University of Technology’s eResearch Office. Thank you! Finally I’d like to thank my wife Laura, for always believing I’d finish this thesis . eventually. iv Abstract The primary motivation for this research is to determine the feasibility of targeting Field Programmable Gate Arrays (FPGAs) for use in accelerating general purpose scientific computing on High Performance Computing (HPC) platforms. FPGAs are typically hard to program as they require electronic engineers to write HDL circuit designs. However, the Open Com- puting Language (OpenCL) is a programming framework that is able target a wide variety computational accelerators and provides a way for software engineers to use high level languages (C and C++) to program FPGAs. As a use case for a common scientific task, this investigation focuses on the implementation and performance characteristics of diagonally dominant tridiagonal linear systems solvers. These implementations are targeted for FPGAs as well as the more traditional compute devices Graphical Process- ing Units (GPUs) and Central Processing Units (CPUs) using OpenCL to provide a common programming paradigm. The results of this investigation are presented in two research papers which make up body of this thesis. The first paper: Implementation of parallel tridiagonal solvers for a heterogeneous computing environment, explores the design of truly portable implementations of the Parallel Cyclic Reduction(PCR) and SPIKE algorithms with OpenCL to solve diagonally dominant tridiagonal linear systems. The implementations target both the GPU and FPGA hardware and are evaluated in terms of solver performance, resource efficiency and numerical accuracy for fixed sized small matrices. The proposed GPU implementations of PCR and SPIKE developed in this work outperform similar routines provided by NVIDIA and presented in other research papers. The FPGA PCR and SPIKE compute performance was estimated using simula- tion and timing results. The simulations for the FPGA are promising and estimated similar performance to the GPU PCR and SPIKE implementations. While the work presented in the first paper shows it is feasible to create v Abstract vi fully portable OpenCL kernels to target the GPU and FPGA, in doing so the code is optimised for neither device. The second paper: Implement- ing and Evaluating an Heterogeneous, Scalable, Tridiagonal Linear System Solver with OpenCL to target FPGAs, GPUs and CPUs, addresses this and furthers the work described in the first paper by presenting the ’oclspkt’ routine. This routine is an heterogeneous OpenCL implementation of the truncated-SPIKE algorithm that can use CPUs, GPUs and FPGAs to con- currently accelerate the solving of diagonally dominant tridiagonal linear systems. It is designed to solve tridiagonal systems of any size and can dynamically allocate optimised workloads to each accelerator depending on the accelerator’s compute performance. The individual GPU, CPU, and FPGA solvers of the oclspkt routine are 110%, 150% and 170% faster than comparable device-optimised third-party solvers. With regards to heterogeneous combinations of compute devices, the GPU-FPGA combination has the best compute performance and the FPGA-only configuration has the best over all energy efficiency. The results for compute performance in the second paper are shown to be memory bandwidth constrained. While the GPU kernel compute- performance is several times faster than the FPGA kernel, the FPGA test hardware has several times less available memory bandwidth. With new generations of FPGA accelerator boards being equipped with high- bandwidth-data transfer technology the performance gap between devices is expected to close. This coupled with the fractional power requirements of the FPGA make it an attractive concept for future HPC requirements. List of Publications The following papers are included in this thesis as chapters. Chapter 3: Macintosh, Hamish, Warne, David, Kelson, Neil A., Banks, Jasmine, & Farrell, Troy W. (2016) Implementation of parallel tridiagonal solvers for a heterogeneous computing environment. The ANZIAM Journal, 56, C446-C462 Chapter 4: Macintosh, Hamish, Banks, Jasmine. & Kelson, Neil A."Implementing and Evaluating an Heterogeneous, Scal- able, Tridiagonal Linear System Solver with OpenCL to target FPGAs, GPUs, and CPUs, submitted to the Interna- tional Journal of Reconfigurable Computing. vii Contents 1 Introduction1 1.1 Research Problem........................2 1.2 Methodology..........................2 1.3 Thesis Outline.........................3 2 Literature Review5 2.1 Field Programmable Gate Arrays...............6 2.2 FPGAs and Accelerating Linear Algebra Routines.....7 2.2.1 HDL Implementations.................8 2.2.2 HLL Implementations................. 10 2.3 OpenCL: The Open Computing Language.......... 11 2.4 Heterogeneous Computing with OpenCL........... 13 2.5 Diagonally Dominant Tridiagonal Linear Systems...... 15 2.6 Algorithms to Solve Tridiagonal Linear Systems....... 17 2.6.1 Thomas Algorithm (TDMA)............. 17 2.6.2 Recursive Doubling................... 17 2.6.3 Cyclic Reduction and Parallel Cyclic Reduction... 18 2.6.4 SPIKE Algorithm................... 19 2.6.5 Conjugate Gradient Algorithm............ 21 2.7 Heterogeneous Linear System Solvers............. 21 2.8 Summary Analysis and Relevance to this Work....... 22 3 Investigation 1: Portable Solvers of Small Tridiagonal Sys- tems 25 3.1 Introduction........................... 28 3.2 Background........................... 29 3.2.1 Tridiagonal linear systems............... 29 3.2.2 Heterogeneous computing with OpenCL....... 30 3.3 Parallel tridiagonal linear systems solvers.......... 30 viii Contents ix 3.3.1 Parallel cyclic reduction................ 31 3.3.2 SPIKE.......................... 31 3.3.3 Implementation using OpenCL............ 32 3.4 Evaluation............................ 32 3.4.1 FPGA resource utilisation............... 33 3.4.2 Compute performance................. 33 3.4.3 Numerical accuracy................... 35 3.4.4 Power utilisation.................... 35 3.5 Conclusion............................ 36 4 Investigation 2: An Heterogeneous Solver for Tridiagonal Systems 37 4.1 Introduction........................... 41 4.2 Background........................... 43 4.2.1 Tridiagonal linear systems............... 43 4.2.2 The SPIKE Algorithm................. 44 4.3 Implementation......................... 46 4.3.1 FPGA Implementation................. 48 4.3.2 Porting the truncated-Spike Kernel to CPU and GPU 53 4.3.3 The Heterogeneous Solver............... 56 4.4 Evaluation............................ 57 4.4.1 Compute Performance................. 58 4.4.2 Numerical Accuracy.................. 60 4.4.3 Energy Consumption and Efficiency......... 60 4.5 Conclusion............................ 62 4.6 Data Availability........................ 64 4.7 Conflicts of Interest....................... 64 4.8 Acknowledgements....................... 64 5 Conclusion 65 5.1 Summary of Thesis Objectives................ 65 5.2 Summary of Contributions................... 68 5.3 Recommendations for Further work.............. 69 Bibliography 70 List of Figures 1.1 High level overview of research methodology..........3 1.2 Iterative investigation process methodology..........4 2.1 Basic FPGA architecture...................6 2.2 OpenCL Memory and Programming model. Adapted from [1]. 12 2.3 OpenCL Platform model. Adapted from [1].......... 13 3.1 Compute throughput of parallel cyclic reduction implementations versus a single threaded Thomas algorithm..... 34 3.2 Compute throughput of spike implementations versus a single threaded Thomas algorithm................ 35 4.1 An overview of the anatomy and execution flow of the oclspkt solver.............................. 47

Solving Diagonally Dominant Tridiagonal Linear Systems with Fpgas in an Heterogeneous Computing Environment

Superlu Users' Guide

Enhanced Capabilities of the Spike Algorithm and a New Spike- Openmp Solver

A New Analysis of Iterative Refinement and Its Application to Accurate Solution of Ill-Conditioned Sparse Linear Systems Carson

Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization – LAPACK Working Note 184

A Parallel Solver for Incompressible Fluid Flows

PSPIKE: a Parallel Hybrid Sparse Linear System Solver *

Error Bounds from Extra-Precise Iterative Refinement

Dense and Sparse Parallel Linear Algebra Algorithms on Graphics Processing Units

Three-Precision GMRES-Based Iterative Refinement for Least Squares Problems Carson, Erin and Higham, Nicholas J. and Pranesh, Sr

Using Mixed Precision in Numerical Computations to Speedup Linear Algebra Solvers

A Banded Spike Algorithm and Solver for Shared Memory Architectures Karan Mendiratta University of Massachusetts Amherst

A Survey of Recent Developments in Parallel Implementations of Gaussian Elimination