Solving Diagonally Dominant Tridiagonal Linear Systems with FPGAs in an Heterogeneous Computing Environment

by Hamish J. Macintosh BEng Electrical School of Electrical Engineering and Computer Science Science and Engineering Faculty Queensland University of Technology

A dissertation submitted in fulfilment of the requirements for the degree of Masters of Philosophy (Engineering) 2019 Keywords: tridiagonal, system of linear equations, diagonally dominant, heterogeneous computing, FPGA, GPU, CPU, high performance computing, , cyclic reduction, parallel cyclic reduction, SPIKE, truncated SPIKE, optimisation, power efficiency. Statement of Original Authorship

In accordance with the requirements of the degree of Masters of Philoso­ phy (Engineering) in the School of Electrical Engineering and Computer Science, I present the following thesis entitled,

Solving Diagonally Dominant Tridiagonal Linear Systems with FPGAs in an Heterogeneous Computing Environment

This work was performed under the supervision of Dr Jasmine Banks, Dr Neil Kelson and Prof. Troy Farell. I declare that the work submitted in this thesis is my own, except as acknowledged in the text and footnotes, and has not been previously submitted for a degree at Queensland University of Technology or any other institution.

Hamish J. Macintosh

QUT Verified Signature

June 2019

iii Acknowledgements

I’d like to acknowledge the support and guidance provided by my supervi- sory team, Dr. Jasmine Banks, Dr. Neil Kelson and Professor Troy Farrell. Mr David Warne deserves a special mention: thank you for your en- couragement and continual prodding to write this damned document. All of the specialised computing resources used in this thesis was gen- erously provided by the Queensland University of Technology’s eResearch Office. Thank you! Finally I’d like to thank my wife Laura, for always believing I’d finish this thesis . . . eventually.

iv Abstract

The primary motivation for this research is to determine the feasibility of targeting Field Programmable Gate Arrays (FPGAs) for use in accelerat- ing general purpose scientific computing on High Performance Computing (HPC) platforms. FPGAs are typically hard to program as they require electronic engineers to write HDL circuit designs. However, the Open Com- puting Language (OpenCL) is a programming framework that is able target a wide variety computational accelerators and provides a way for software engineers to use high level languages (C and C++) to program FPGAs. As a use case for a common scientific task, this investigation focuses on the implementation and performance characteristics of diagonally dominant tridiagonal linear systems solvers. These implementations are targeted for FPGAs as well as the more traditional compute devices Graphical Process- ing Units (GPUs) and Central Processing Units (CPUs) using OpenCL to provide a common programming paradigm. The results of this investigation are presented in two research papers which make up body of this thesis. The first paper: Implementation of parallel tridiagonal solvers for a het- erogeneous computing environment, explores the design of truly portable implementations of the Parallel Cyclic Reduction(PCR) and SPIKE algo- rithms with OpenCL to solve diagonally dominant tridiagonal linear sys- tems. The implementations target both the GPU and FPGA hardware and are evaluated in terms of solver performance, resource efficiency and nu- merical accuracy for fixed sized small matrices. The proposed GPU imple- mentations of PCR and SPIKE developed in this work outperform similar routines provided by NVIDIA and presented in other research papers. The FPGA PCR and SPIKE compute performance was estimated using simula- tion and timing results. The simulations for the FPGA are promising and estimated similar performance to the GPU PCR and SPIKE implementa- tions. While the work presented in the first paper shows it is feasible to create

v Abstract vi fully portable OpenCL kernels to target the GPU and FPGA, in doing so the code is optimised for neither device. The second paper: Implement- ing and Evaluating an Heterogeneous, Scalable, Tridiagonal Linear System Solver with OpenCL to target FPGAs, GPUs and CPUs, addresses this and furthers the work described in the first paper by presenting the ’oclspkt’ routine. This routine is an heterogeneous OpenCL implementation of the truncated-SPIKE algorithm that can use CPUs, GPUs and FPGAs to con- currently accelerate the solving of diagonally dominant tridiagonal linear systems. It is designed to solve tridiagonal systems of any size and can dynamically allocate optimised workloads to each accelerator depending on the accelerator’s compute performance. The individual GPU, CPU, and FPGA solvers of the oclspkt routine are 110%, 150% and 170% faster than comparable device-optimised third-party solvers. With regards to heteroge- neous combinations of compute devices, the GPU-FPGA combination has the best compute performance and the FPGA-only configuration has the best over all energy efficiency. The results for compute performance in the second paper are shown to be memory bandwidth constrained. While the GPU kernel compute- performance is several times faster than the FPGA kernel, the FPGA test hardware has several times less available memory bandwidth. With new generations of FPGA accelerator boards being equipped with high- bandwidth-data transfer technology the performance gap between devices is expected to close. This coupled with the fractional power requirements of the FPGA make it an attractive concept for future HPC requirements. List of Publications

The following papers are included in this thesis as chapters.

Chapter 3: Macintosh, Hamish, Warne, David, Kelson, Neil A., Banks, Jasmine, & Farrell, Troy W. (2016) Implementation of parallel tridiagonal solvers for a heterogeneous computing environment. The ANZIAM Journal, 56, C446-C462

Chapter 4: Macintosh, Hamish, Banks, Jasmine. & Kelson, Neil A."Implementing and Evaluating an Heterogeneous, Scal- able, Tridiagonal Linear System Solver with OpenCL to tar- get FPGAs, GPUs, and CPUs, submitted to the Interna- tional Journal of Reconfigurable Computing.

vii Contents

1 Introduction1 1.1 Research Problem...... 2 1.2 Methodology...... 2 1.3 Thesis Outline...... 3

2 Literature Review5 2.1 Field Programmable Gate Arrays...... 6 2.2 FPGAs and Accelerating Linear Algebra Routines.....7 2.2.1 HDL Implementations...... 8 2.2.2 HLL Implementations...... 10 2.3 OpenCL: The Open Computing Language...... 11 2.4 Heterogeneous Computing with OpenCL...... 13 2.5 Diagonally Dominant Tridiagonal Linear Systems...... 15 2.6 Algorithms to Solve Tridiagonal Linear Systems...... 17 2.6.1 Thomas Algorithm (TDMA)...... 17 2.6.2 Recursive Doubling...... 17 2.6.3 Cyclic Reduction and Parallel Cyclic Reduction... 18 2.6.4 SPIKE Algorithm...... 19 2.6.5 Conjugate Gradient Algorithm...... 21 2.7 Heterogeneous Linear System Solvers...... 21 2.8 Summary Analysis and Relevance to this Work...... 22

3 Investigation 1: Portable Solvers of Small Tridiagonal Sys- tems 25 3.1 Introduction...... 28 3.2 Background...... 29 3.2.1 Tridiagonal linear systems...... 29 3.2.2 Heterogeneous computing with OpenCL...... 30 3.3 Parallel tridiagonal linear systems solvers...... 30

viii Contents ix

3.3.1 Parallel cyclic reduction...... 31 3.3.2 SPIKE...... 31 3.3.3 Implementation using OpenCL...... 32 3.4 Evaluation...... 32 3.4.1 FPGA resource utilisation...... 33 3.4.2 Compute performance...... 33 3.4.3 Numerical accuracy...... 35 3.4.4 Power utilisation...... 35 3.5 Conclusion...... 36

4 Investigation 2: An Heterogeneous Solver for Tridiagonal Systems 37 4.1 Introduction...... 41 4.2 Background...... 43 4.2.1 Tridiagonal linear systems...... 43 4.2.2 The SPIKE Algorithm...... 44 4.3 Implementation...... 46 4.3.1 FPGA Implementation...... 48 4.3.2 Porting the truncated-Spike Kernel to CPU and GPU 53 4.3.3 The Heterogeneous Solver...... 56 4.4 Evaluation...... 57 4.4.1 Compute Performance...... 58 4.4.2 Numerical Accuracy...... 60 4.4.3 Energy Consumption and Efficiency...... 60 4.5 Conclusion...... 62 4.6 Data Availability...... 64 4.7 Conflicts of Interest...... 64 4.8 Acknowledgements...... 64

5 Conclusion 65 5.1 Summary of Thesis Objectives...... 65 5.2 Summary of Contributions...... 68 5.3 Recommendations for Further work...... 69

Bibliography 70 List of Figures

1.1 High level overview of research methodology...... 3 1.2 Iterative investigation process methodology...... 4 2.1 Basic FPGA architecture...... 6 2.2 OpenCL Memory and Programming model. Adapted from [1]. 12 2.3 OpenCL Platform model. Adapted from [1]...... 13 3.1 Compute throughput of parallel cyclic reduction implemen- tations versus a single threaded Thomas algorithm..... 34 3.2 Compute throughput of spike implementations versus a sin- gle threaded Thomas algorithm...... 35 4.1 An overview of the anatomy and execution flow of the oclspkt solver...... 47 4.2 The FPGA TDMA OpenCL kernel tdma, with the execution path and data dependencies shown. The tdma executes as a single Work Item kernel...... 49 4.3 The FPGA truncated–SPIKE OpenCL kernel spktrunc, with the execution path and data dependencies shown. The spk- trunc executes as a single Work Item kernel...... 50 4.4 The CPU truncated–SPIKE OpenCL kernels spkfact and sp- krec, with the execution path and data dependencies shown. Both kernels are executed as an NDRange of Work Items.. 55 4.5 The GPU truncated–SPIKE OpenCL kernels spkfact, sp- krecul and spkreclu, with the execution path and data de- pendencies shown. All kernels are executed as an NDRange of Work Items...... 56 4.6 Time (ms) to solve system of size N = 256×106 using oclspkt, targeting CPU, GPU and FPGA devices...... 58 4.7 Comparing time(ms) to solve system of size N = 256 × 106 using oclspkt, dgtsv, sdtsvb and TDMA...... 58

x List of Figures xi

4.8 Performance comparison in rows solved per second for N = (256 ... 1280) × 106 when targeting CPU, GPU, FPGA and heterogeneous combinations of devices...... 61 4.9 Numerical accuracy of oclspkt for varying diagonal domi- nance compared to CPU TDMA solver...... 62 4.10 Joules required to solve tridiagonal system of N = 256 × 106 per device using the oclspkt routine...... 62 4.11 Energy efficiency comparison in rows solved per second for N = (256 ... 1280) × 106 for oclspkt when targeting devices CPU, GPU, FPGA and heterogeneous combinations..... 63 List of Terms

ASIC Application Specific Integrated Circuit.6,7 ATLAS Auto Tuned Linear Algebra Subprograms.9, 11, 22 AVX-256 256 bit Advanced Vector Extension. 14

bittstream A binary file that contains programming in- formation for an FPGA.6 BRAM Block RAM. 15, 22

CG Conjugate Gradient.8, 11, 21, 22, 24 Compute Unit An OpenCL term for a group of Processing Elements. 11, 12, 14, 25 CPU Central Processing Unit.v,1–5,8–12, 14–17, 21–24, 66, 68, 69 CR Cyclic Reduction. 18, 24 CUBLAS CUDA Basic Linear Algebra Subprograms. 10 CUDA A proprietary parallel programming paradigm invented by NVIDIA, intended to be used with NVIDIA GPUs. 11, 68

DMA Direct Memory Access. 15

FP16 16 bit Floating Point.1,9, 10 FP32 32 bit Floating Point.9, 10, 14, 23, 38 FP64 64 bit Floating Point.8,9, 11 FPGA Field Programmable Gate Array.v,1–12, 14– 17, 21–24, 65–69

GEMM GEneral Matrix Multiply.9

xii List of Terms xiii

global memory An OpenCL memory object, high capacity off- chip memory with read/write access from all Compute units and host. 14, 15 GPU Graphical Processing Unit.v,1–5, 10–12, 14– 19, 21–24, 66, 68

HDL Hardware Description Language.2,5,7,8, 66–69 HLL High Level Language.2,5,7, 10, 65–69 HPC High Performance Computing.v,1,2, 65

IDE Integrated Development Environment. 68, 69 Intel MKL Intel Math Kernel Library.8–10, 22

LAPACK Linear Algebra PACKage.9 local memory An OpenCL memory object, medium capacity fast on-chip memory with read/write access from all Processing Elements in a Compute Unit. 14, 15, 18 LSU Load Store Units. 15, 67

OpenCL Open Computing Language. A software framework designed to execute a program in an heterogeneous computing environment.1– 5,7, 10–15, 23, 66–69

PCR Parallel Cyclic Reduction. 17, 18, 24, 25, 68 private memory An OpenCL memory object, small capacity fast on-chip memory registers with read/write access only from a single Processing Element. 12, 15 Processing Element An OpenCL term for a computing resource that can execute a Work Item. 11, 12, 14, 26

RD Recursive Doubling. 17, 18 RHS Right Hand Side. 22, 38 List of Terms xiv

SDK Source Development Kit. 10, 11 SIMD Single Instruction Multiple Data. 14, 26 SM Streaming Multiprocessor. 11, 14

TDMA Tridiagonal Matrix Algorithm. 11, 16, 17, 19, 24, 38

VHDL Very high-speed integrated circuit Hardware Description Language.7, 10

Work Group An OpenCL term for a group of Work Items. 11, 13, 14 Work Item An OpenCL term for a thread of code.. 11, 14, 15 Chapter 1

Introduction

The primary motivation for this research is to determine the feasibility of targeting Field Programmable Gate Arrays (FPGAs) for use in accelerating general purpose scientific computing on HPC platforms. HPC platforms are becoming more reliant on specialist co-processors, resulting in a heteroge- neous computing environment. Traditionally, multi-core Central Process- ing Units (CPUs) have provided the majority of the computing power in an HPC cluster and have only recently been surpassed in terms of FLOPs of computing power provided by Graphical Processing Units (GPUs) [2]. Prior to this FPGAs have been primarily used in embedded devices or hardware prototyping application specific integrated circuits. Recently the floating point performance of FPGAs has markedly improved. Additionally the rise of compute devices that feature reduced precision data types like 16 bit Floating Point (FP16)[3,4] have reached a point where FPGAs have become much more relevant to the HPC arena, particularly as we approach exascale machines [5]. The varied programming models needed to target specific devices in a heterogeneous computing environment make developing a general portable solution difficult. OpenCL is an open source programming framework for heterogeneous computing [1]. Implementations of this framework have been developed by vendors of a wide range of computing devices including CPUs, GPUs and in more recent years FPGAs. In a perfect world this would enable a developer to build a fully portable computational routine which will execute on any of the available resources. This investigation focuses on the implementation and performance char- acteristics of diagonally dominant tridiagonal linear systems solvers as a use-case for a common scientific computing task. We target these imple-

1 Introduction 2 mentations for FPGAs as well as the more traditional compute devices; the GPUs and CPUs. The remainder of this chapter will explicitly state our research questions, discuss the aims and objectives of this work and present the structure of the remaining chapters.

1.1 Research Problem

The aim of this research is to determine the feasibility of targeting FP- GAs for use in accelerating general purpose scientific computing on HPC platforms and how they compare to more traditional CPU and GPU ac- celerators in terms of compute performance utilisation. To achieve this, portable linear systems solvers will be developed that are suitable for de- ployment on the aforementioned accelerators. These solvers will be tested for , computational throughput, power efficiency and re- source utilisation to allow comparisons to be made between the different hardware. The key questions this program of research aims to answer are:

1. Is it feasible to integrate non-traditional compute devices like FPGAs with GPUs and or CPUs for general purpose scientific computing tasks?

2. Can portable diagonally dominant tridiagonal linear systems solvers be designed that could be deployed to any number of CPUs, GPUs or FPGAs present on a host system?

3. Can optimised portable code be created to target multiple devices with OpenCL?

4. Is the same compute performance achievable from High Level Lan- guages (HLLs) as from Hardware Description Languages (HDLs) when targeting FPGAs?

5. Is developing optimised FPGA kernels easier with HLLs than with HDLs?

1.2 Methodology

The methodology used to investigate these research problems was approached in an iterative manner as illustrated in in Figure 1.1. Initially a litera- ture review was conducted (chapter 2) to determine the commonly used Introduction 3 past methods of accelerating the solving of diagonally dominant tridiago- nal linear systems. This included algorithms and programming paradigms used, relative performance of solvers, computational hardware targeted, and strategies for deploying heterogeneous solvers. Next, we conducted two investigations: first, an investigation in solving many small systems with GPUs and FPGAs; and second, an investigation solving a single large system with CPUs, GPUs and FPGAs. Figure 1.2 illustrates the iterative cycle followed in pursuing these inves- tigations. For each specific use-case application we designed the algorithm using a high level scripting language before porting the code to OpenCL C code and testing on each target device for functional correctness. Fol- lowing this, the OpenCL kernels were profiled and benchmarking unit tests were run to determine the compute and energy performance as well as the numerical accuracy. The cycle was repeated as necessary to achieve the de- sired results, and from each investigation a research output was produced, these are presented in chapter 3 and chapter 4.

Figure 1.1: High level overview of research methodology.

1.3 Thesis Outline

The remainder of this thesis is organised as follows. A review of relevant literature is presented in chapter 2:Literature Review, which provides per- tinent background information on tridiagonal systems, different numerical methods used to solve tridiagonal linear systems, and how these have been implemented on different computation devices. In chapter 3:Solving Small TD Systems, an investigation into imple- Introduction 4

Figure 1.2: Iterative investigation process methodology. menting truly portableOpenCL kernels of the Parallel Cyclic Reduction and Recursive SPIKE solver algorithm for small diagonally dominant tridiago- nal systems is presented in the form of a published paper. The algorithms are implemented on GPU and FPGA hardware and we evaluate these de- signs in the context of solver performance, resource efficiency and numerical accuracy. Chapter4: Solving Large TD Systems then builds on the work of chap- ter 3. In this chapter a submitted paper on the implementation of an optimised truncated-SPIKE algorithm for tridiagonal systems of any size is presented. This solver called oclspkt targets CPU, GPU and FPGA hardware as well as a unified heterogeneous solver utilising two or more of available CPU, GPU and FPGA devices. Finally, chapter 5:Conclusion concludes the thesis with a discussion of the research problem and research papers presented in the context of the research questions that were proposed in section 1.1. Chapter 2

Literature Review

This Chapter provides background information required to give context to this thesis and a review of literature relevant to solving diagonally domi- nant tridiagonal linear systems using FPGAs in a heterogeneous computing environment. This chapter is structured as follows: Firstly, an overview of the FPGA device, its history and architecture is presented. This section is followed by a discussion of how FPGAs have been used to accelerate linear algebra routines. As there are relatively few tridi- agonal linear system solver implementations targeting FPGAs, literature concerning how FPGAs have been used in general linear algebra routines is presented giving an insight into how they have been previously utilised for fundamental scientific computations. The section is divided into literature concerning HDL implementations and HLL implementations. Following this, the OpenCL framework is introduced including how OpenCL can be used to target different accelerator devices in a hetero- geneous computing environment. The section will briefly cover the com- plexities encountered when writing optimised code targeting CPUs, GPUs and FPGAs respectively. A definition of a diagonally dominant tridiagonal linear system is de- veloped along with a discussion of its significance in scientific computa- tion. This is followed by a sub-section that examines common serial and parallel algorithms used to solve tridiagonal linear systems and how these algorithms have been implemented previously. Of particular interest will be the compute device targeted, programming language and technology used to implement the solver. Finally, a review of past implementations of heterogeneous diagonally dominant tridiagonal linear systems solvers is presented and the key research gaps identified by this literature review are

5 Literature Review 6 highlighted.

2.1 Field Programmable Gate Arrays

FPGAs are monolithic integrated circuits that can be reconfigured in real- time as application specific digital circuits. As illustrated in Figure 2.1 at the simplest level the FPGA architecture comprises an array of different logic blocks including: general logic, memory and multiply-accumulators. A matrix of programmable interconnects connects these logic blocks which in turn are surrounded by programmable I/O cells [6,7]. When the FPGA is powered on, a pre-compiled bittstream is used to route programmable logic blocks to I/O pins via the matrix of programmable interconnects.

Figure 2.1: Basic FPGA architecture

An FPGA may cost anywhere from less than one hundred to many thou- sands of dollars to purchase and can be continually reconfigured to suit the application. This is markedly different to Application Specific Integrated Circuits (ASICs), where the research and development costs can run into the millions for a company before a single chip is produced. This requires large scale production to reduce unit costs for this single-application com- ponent. Another advantage to the FPGA is that if there is an error in the Literature Review 7

FPGA program functionality an update can be designed and deployed after initial shipping of the product, this is not possible with an ASIC. However, despite this flexibility, the FPGA loses out in performance and is 20 to 30 times slower than the ASIC[6]. Traditionally, to program an FPGA, specialised engineers design the cir- cuit using HDLs like Very high-speed integrated circuit Hardware Descrip- tion Language (VHDL) or Verilog. HDLs require hardware and electronic engineering expertise to be able to work with the complexities of register level logic, gate timing and configurable clocks. This has generally pre- cluded software engineers from targeting FPGA hardware, until the recent advent of HLL workflows and FPGA source developemnt kits for OpenCL.

2.2 FPGAs and Accelerating Linear Algebra Rou- tines

Historically, FPGAs have been used only sporadically as computational ac- celerators by the wider scientific community, this is despite the application of FPGAs to solve computational problems being an active area of research [8]. The FPGA’s adoption as an accelerator has been limited by the poor portability of HDL code written for specific FPGA hardware that would allow the repurposing of application specific computational routines, like linear algebra functions, to suit available FPGA hardware. This has been compounded by the limited availability of middleware tools such as drivers. Linear algebra routines make up the fundamental computation compo- nents to a wide variety of scientific computing applications. These routines are characterised by intensive floating point computational loads and a high communication-to-computation ratio [9]. The re-configurable nature of FPGAs allow for algorithm dependent hardware optimisations, including multiple discrete cache and data buffers between computational stages and asynchronous execution of tasks, which should make the FPGA ideal for managing these factors. However, FPGA implementations of linear algebra routines are, as noted, generally device specific which makes universally accessible FPGA hardware libraries difficult to design. The remainder of this section presents a review of the literature to date involving HDL and HLL implementations of linear algebra routines on FPGA hardware. Literature Review 8

2.2.1 HDL Implementations

Iterative solvers for systems of linear equations have been a popular choice for implementation on FPGA hardware [10,9, 11] since the algorithms mostly consist of vector-matrix multiplications that can be carried out con- currently. Cole et al. [10] explores rapid prototyping of a conjugate gradient algorithm implementation on an FPGA, and compares the results with a CPU implementation. In this instance the FPGA implementation performs poorly when compared to the implementation on the CPU. However, the authors make the point that the FPGA hardware targeted had significant limitations in terms of clock speeds and memory bandwidth when compared like-for-like with the CPU used in the comparison, therefore targeting a ‘non-development’ FPGA board would likely show improved results. This conclusion is upheld by Guiming et al. [11] who furthered this work by implementing a Conjugate Gradient (CG) solver on an FPGA for a matrix of arbitrary size and demonstrated a 4.62–9.24× speed increase over the same MATLABCG solver running on a CPU. The authors do note however that the recorded speedup was allowing for the error tolerance of the FPGA to be in the range of 10−4 whereas the CPU MATLAB routine, which by default uses 64 bit Floating Point (FP64), will generally have an error tolerance in the order of 10−16. Increasing the precision of the FPGA implementation would likely reduce the observed speed gains. Direct solvers for systems of linear equations using FPGAs have also been an active area of research [12, 13, 14, 15, 16, 17, 18]. These algorithms differ from iterative solvers since they generally require more division oper- ations and have inter-loop dependence that often serialises the execution. They do however benefit from requiring fewer memory transactions than iterative methods, making direct solver algorithms better suited for solv- ing larger linear systems. In a notable case, Zhang et al. [8] developed a software-based HDL generator to automate the design of a direct linear sys- tem solver based on the FPGA hardware resources available. The authors recognised a need for a portable and scalable design. They implemented an LU factorisation solver and showed how the generator implements the de- sign across multiple FPGAs. The authors tested their designs with matrices up to 40,000 elements, the upper limit of the input matrix was bound only by the off-chip RAM available on the accelerator card. Results were com- pared on a like-for-like basis with a CPU using the same data type (FP64) and an optimised Intel Math Kernel Library (Intel MKL) LU solver. A Literature Review 9

10% increase in computing performance and 3.6× better energy efficiency was demonstrated using the FPGA. In contrast, in consideration of FPGA and power efficiency, Giefers et al. [19] implemented a general matrix multiply circuit on an FPGA using FP64 and found that the FPGA implementation was only more energy efficient when considering compute devices in isolation. When considering the full computer system running the FPGA, the FPGA GEneral Matrix Multiply (GEMM) was not as energy efficient as a multithreaded CPU software implementation. Combining the direct and iterative methods, Junqing et al. [9] de- signed a mixed precision dense linear system solver. The authors used a heterogeneous reduced precision direct solver on an FPGA and CPU as a pre-conditioner to iterative refinement on a CPU. The direct solver uses low precision data types to perform LU decomposition, with pivoting done on an FPGA. The resulting LU triangular matrices are then solved on the CPU using FP64. This FPGA LU decomposer and CPU solver acts as a pre- conditioner for a subsequent iterative refinement algorithm implemented on the CPU. The iterative refinement algorithm improves the numerical accuracy of the solution. The authors test their solver with implementa- tions of FP16, 32 bit Floating Point (FP32) and FP64 of the FPGALU decomposer compared against an Auto Tuned Linear Algebra Subprograms (ATLAS) CPU only solver. Their experiments showed that using FP16 im- plementations of the FPGA LU decomposer/ FP64 CPU hybrid provides a 2× to 3× performance boost over the pure CPU solver. It should be noted that in this testing, matrices of 128 to 4096 elements were used. This work appears to be the first truly heterogeneous implementation of a mixed precision linear algebra routine using FPGAs and CPUs. Efforts have also been made to provide a hardware library for lin- ear algebra routines, Gonzalez et al. presented LAPACKrc (LAPACK- reconfigurable) [20], a linear algebra library of FPGA kernels consisting of dense, sparse and least square solver routines. The performance across all implemented routines is reported to be 40–150× that of similar Intel MKL optimised Linear Algebra PACKage (LAPACK) routines. As in [11] and [10], the performance characteristics presented by Gonzalez et al. [20] are compared against double precision FP64 Intel MKL optimised routines. For example, the authors test 16–64bit (number representation not specified) implementations for QR factorisation, compared to the FP64 Intel MKL QR factorisation routine. The 64bit FPGA implementation performance is Literature Review 10 on par with Intel MKL QR, however the significant speed increases come with a decrease in precision in the FPGA implementation. In contrast, Rafique et al.[21] provide close to a like-for-like performance comparison of the Lanczos method for symmetric eigenvalues computa- tion for small matrices. They compare single precision FP16 implementa- tions of Intel MKL(CPU) and CUDA Basic Linear Algebra Subprograms (CUBLAS)(GPU) to their VHDL implementation on an FPGA. They re- port a mean speed up of 13.4× and 52.8× over the CPU and GPU respec- tively for solving a single eigenvalue problem and a mean speed up of 103× and 408× when solving multiple eigenvalue problems with the same input matrix. The authors discussed how the substantial performance increases could be attributed to the small problem size matrices (up to 335×335) not efficiently fitting the CPU and GPU architectures, whereas the FPGA implementation was custom designed for the problem sets considered. Fur- ther, in the tests solving multiple eigenvalue problems the authors did not use multithreading on the CPU and GPU benchmarks. Instead, they it- erated through the independent systems sequentially. This experimental condition ensures that the inefficiencies present in solving a single eigen- value problem of this size on the CPUs and GPUs are not mitigated by the increased workload when solving multiple eigenvalue problems, and therefore will subsequently skew the comparison in favour of the FPGA implementation.

2.2.2 HLL Implementations

In recent years both major FPGA vendors, IntelFPGA (formally Altera) and Xilinx have released OpenCL(section 2.3) source development kits and runtime software for specific PCIe attached FPGA expansion cards [22, 23]. The OpenCL Source Development Kits (SDKs) allow software engineers to write C-like code to be compiled into a bitstream that can configure an FPGA. FPGA vendors take advantage of the reconfigurable nature of the hardware by implementing a static top module to handle FPGA to host communication, data, and OpenCL kernel execution scheduling. The remaining logic-fabric is free for runtime reconfiguration with precompiled OpenCL-kernel bitstreams. Prior work has been done implementing linear algebra routines using HLLs on FPGA hardware. Czajkowski et al. [24], presented an FP32 general and demonstrated good performance, but the Literature Review 11 input matrix sizes are limited by the amount of onboard RAM. Warne et al. [25] compared implementations of the Tridiagonal Matrix Algorithm (TDMA) using Altera (now Intel) OpenCL SDK and Xilinx HLS. This approach provides a modicum of portability when compared to the VHDL implementation in [15]. Much like Junqing et al. [9], Angerer et al. [26] implemented an FPGA+GPU heterogeneousCG solver. The authors implemented the backward phase using 8 and 16 bit fixed point on the FPGA using OpenCL. The accuracy of the solution is then improved by computing the forward phase using FP64. Testing with very large dense matrices of up to 24064×24064 elements, the authors reported a 3.7× computation speed up and 3.5× re- duction in ‘energy-to-solution’ when compared to a multithreaded ATLAS CPU implementation. Interestingly the authors comment that when tested in isolation the FPGA and GPU accelerators do not perform as well as the CPU.

2.3 OpenCL: The Open Computing Language

OpenCL is a programming framework that allows developers to target a wide range of compute devices using C/C++ APIs. The specification for OpenCL defines a framework in terms of four models: platform, execution, memory and programming [1, 27]. The platform model defines the hardware of the host environment, al- lowing a programmer to list and allocate available computational devices. The execution model then uses the hardware described in the platform model to configure the OpenCL environment, and how kernels are deployed to the devices. The memory model defines the abstracted memory hierar- chy of the device whilst the programming model describes how parallel computation is mapped to the device hardware. For the programming model, a device’s computer resources are divided up at the smallest level as Processing Elements and depending on the de- vice architecture one or more Processing Elements are grouped into one or many Compute Units[27]. For example an NVIDA GPU built with the Maxwell architecture has a Streaming Multiprocessor (SM) with 128 CUDA cores [28]; theSM and CUDA core maps directly to the OpenCL Compute Units and Processing Elements respectively. Similarly, the threads of de- vice kernel code, are called Work Items and are grouped in Work Groups. Work Items and Work Groups are mapped to the Processing Element and Literature Review 12

Figure 2.2: OpenCL Memory and Programming model. Adapted from [1].

Compute Unit hardware respectively. The memory model abstracts the types of memory a device has avail- able. These are defined by OpenCL as: global, local and private memory. Global memory is generally hi-capacity off-chip memory banks that can be accessed by all Processing Elements across the device. The GPU’s GDDR5 RAM, and CPU and FPGA’s DDR4 RAM banks are examples of global memory. Local memory is on-chip and is higher bandwidth and lower capacity than global memory and is only accessible to Processing Elements of the same Compute Unit. GPU local memory is the L2 cache, the L1/2/3 cache for a CPU and BRAM banks on an FPGA. Finally, private memory refers to on-chip register memory space and is only accessible with in a particular Processing Element. The programming and memory model is shown in Figure 2.2. Unlike the programming and memory models the platform (Figure 2.3) and execution model is set up on the host side of the OpenCL application. The platform model queries the available devices on the host machine and their properties. With this information the execution model defines:

• the OpenCL context object, created with one or more devices from the platform’s model; Literature Review 13

Figure 2.3: OpenCL Platform model. Adapted from [1].

• the program object, read from a pre-compiled OpenCL binary or read and compiled in real time from source code;

• one or more kernel objects defined in the OpenCL program;

• one or more command queues for orchestrating read and write data operations and kernel executions with queues executing in order or out of order;

• event objects for synchronisation of OpenCL commands and profiling operations;

• memory buffers on the host and compute devices and schedules for read and write operations; and

• mapping kernel arguments and executing OpenCL kernels on de- vices by way of ‘enqueuing’ one to three dimensional arrays of Work Groups.

2.4 Heterogeneous Computing with OpenCL

A wide variety of vendors support the OpenCL specification for their prod- ucts [29]. This work focuses on heterogeneous computing utilising a subset of these products: OpenCL compliant CPUs, GPUs and FPGAs from In- tel, NVIDIA, and Intel FPGA respectively. Since OpenCL was originally Literature Review 14 designed as a cross vendor GPU programming paradigm [27], the program- ming and memory model maps well to the conventional GPU Single In- struction Multiple Data (SIMD) vector processing architecture. However, as different devices with wildly different hardware architectures are added to the conformance list, vendors have implemented the OpenCL execution, programming and memory models slightly differently depending on the de- vice being targeted. There are numerous requirements and factors to consider when targeting CPUs, GPUs, and FPGAs with OpenCL. NVIDIA GPU SMs schedule CUDA core executions in groups of ‘WARPS’, being generally groups of 32 CUDA cores executing FP32 SIMD operations [30]. It is therefore no surprise that the optimum number of Work Items per Work Group must be in multiples of 32. For maximum memory bandwidth global memory transactions must be single strided between adjacent Work Items. This ensures fully coalesced memory read and writes with no wasted bandwidth. Additionally, caching using the GPU’s L2 shared memory or local memory must be explicitly coded for by the programmer if random memory access is required for Work Items in a Work Group. For Intel CPUs, if the kernel code does not include Work Item depen- dent branch divergence, and has single strided memory access for Work Items in a Work Group; the Intel OpenCL compiler will try to extract the maximum amount of vectorisation possible. For example, Intel CPUs with the 256 bit Advanced Vector Extension (AVX-256) instruction set can po- tentially run 8 SIMD FP32 operations per core. Hence highly vectorisable code in this instance will have 8 Work Items per Work Group. Unlike the NVIDIA GPUs, Intel CPUs employ sophisticated hardware caching, mean- ing explicitly moving global memory to local memory can actually hinder performance[31]. Unlike the CPUs and GPUs, it is possible to configure the number of Processing Elements and Compute Units in any number of combinations for the FPGA. However where possible it is optimal to code FPGA OpenCL kernels to run as a single Work Item[32]. That is, the kernel code will reconfigure the FPGA as a single monolithic Processing Element inside a single Compute Unit to execute one Work Item for one Work Group. This essentially serialises most parallel algorithms by iterating through threads instead of executing them concurrently on different Processing Elements. Concurrency and performance is achieved by exploiting the FPGAs fine grained pipelined parallelism inside the thread. Further, implementing Literature Review 15

FPGA kernels as a single Work Item kernel reduces the amount of re- sources needed for thread synchronisation and global memory transactions. The Intel FPGA OpenCL compiler implements Load Store Units (LSU) as required to interface kernels with global memory, and the programmer can explicitly create Block RAM (BRAM) memory caches with local memory as required, or alternatively private memory arrays of a large enough size will automatically be implemented on BRAM banks. On the host side the execution model has several key differences depend- ing on the device being targeted. For example both the GPU and FPGA require device side memory buffers where the CPU can use the in-place memory for computation. Further, for fast global memory transactions Intel CPUs require memory spaces with aligned 4096 bit boundaries. To enable Direct Memory Access (DMA) across PCIe for data transfer to the FPGA, host side memory spaces need 64 bit aligned boundaries. For the GPU to use DMA, host side memory has to be ‘pinned’ and mapped to the device before transfer. The different OpenCL implementation considerations described in this section give only a brief overview of the complexities to consider when trying to develop portable heterogeneous code. The OpenCL framework provides a unified language to target multiple devices in a heterogeneous computing environment, but the nature of the underlying architectures dictate that a truly portable kernel code would be difficult to design.

2.5 Diagonally Dominant Tridiagonal Linear Sys- tems

Computing the solution to a system of tridiagonal linear systems is a com- mon scientific computing task. As such, it is an ideal use case to investigate the feasibility of heterogeneous FPGA, GPU and CPU computing. A tridiagonal linear system can be defined as a coefficient with a bandwidth of β = 1 in the linear system A · x = y, as illustrated in Literature Review 16

Equation 2.1.

  a1,1 a1,2   a2,1 a2,2 a2,3     ......  A =  . . .  (2.1)    a a a   n−1,n−2 n−1,n−1 n−1,n an,n−1 an,n

|Ai,i| d = min P (2.2) |Ai,j| i6=j Tridiagonal Linear systems are considered diagonally dominant when d > 1 in Equation 2.2 and solving the system refers to finding the solution vector x from matrix A and a set of known right-hand-side vectors y as seen in Equation 2.3.

A · x = y (2.3)

For non-singular diagonally dominant tridiagonal systems, a special form of non-pivoting called the Thomas algorithm [33] or more commonly known as TDMA can perform LU decomposition and back substitution in Θ(n) operations. The TDMA provides good per- formance when solving many small tridiagonal linear systems concurrently. However as the TDMA is inherently serial in execution, it fails to scale well in highly parallel computing environments. More advanced parallel methods must be applied if the problem requires solving fewer large sys- tems. Many parallel algorithms exist for solving tridiagonal and block- tridiagonal linear systems and are implemented in well established numer- ical libraries[34, 35, 36, 37]. Diagonally dominant tridiagonal linear systems arise in many scientific, visualisation, engineering and economic fields such as cubic spline inter- polation [38], depth-of-field calculations in 3D graphics [39], implicit finite difference schemes [40] and computational fluid dynamics [41]. The process of solving such systems is a common enough occurrence that it makes a worthwhile use-case for investigating FPGAs as a device for accelerating scientific computations when compared to CPUs and GPUs as well as the potential to exploit heterogeneous combinations thereof. Literature Review 17

2.6 Algorithms to Solve Tridiagonal Linear Sys- tems

Extensive research has been conducted in optimising tridiagonal linear sys- tem solver algorithms. This section will give an overview of common serial and parallel algorithms that have been developed and the relative compu- tational cost required to implement the algorithm in hardware.

2.6.1 Thomas Algorithm (TDMA)

The Thomas algorithm or TDMA is a modified version of Gaussian Elimi- nation without partial pivoting for solving diagonally dominant tridiagonal linear systems[33]. It consists of a forward sweep where without pivot- ing, A · x = y, A is decomposed using LU factorisation to A = LU, and solved for L · d = y. For the back-substitution phase, the solution vec- tor is recovered by solving U · x = d[42]. The TDMA is inherently serial, but requires the fewest operations of all the algorithms presented here to solve a traditional linear system. Assuming a tridiagonal system as per Equation 2.1, the forward sweep can be calculated with Equation 2.4 and back-substitution with Equation 2.5.

 0 ai,i−1 ai,i−1 = a  i−1,i 0 0 ai,i = ai,i − ai,i−1 · ai−1,i+1 i = 2, . . . , n (2.4)   0 0 yi = yi − ai,i−1 · yi−1

 0 yi xi = 0 i = n ai,i 0 (2.5) yi−ai,i+1·xi+1 xi = 0 i = n − 1,..., 1 ai,i The TDMA has been implemented widely on CPU, GPU and FPGA hardware[42, 40, 15, 25]

2.6.2 Recursive Doubling

Similar to Parallel Cyclic Reduction (PCR), Recursive Doubling (RD), pro- posed by Stone [43] and later reformulated by Egecioglu et al [44], expresses the unknowns as a multiplication chain of matrices that can be evaluated in parallel using the Scan Primitive. This Scan Primitive is a parallel primitive originally developed for vector machines. A downside to this algorithm is that numerical accuracy may be lost for even diagonally dominant matrices Literature Review 18 since the algorithm requires a divide operation with the upper diagonal as the divisor[42].

2.6.3 Cyclic Reduction and Parallel Cyclic Reduction

A widely implemented parallel algorithm for solving tridiagonal linear sys- tems, Cyclic Reduction (CR), first proposed by Hockney [45], works by eliminating even (or odd) elements in a linear system recursively in a for- ward phase until two unknowns remain. These unknowns are then solved directly and the full solution is obtained via back substitution. Assuming a tridiagonal linear system as per Equation 2.1, the forward phase ofCR is given by Equation 2.6, and the back substitution phase by Equation 2.7.

a a α = i,i−1 , β = i,i+1 ai−stride,i ai−stride,i 0 0 ai,i−1 = −αai−stride,i−1, ai,i+1 = −βai+stride,i+1 (2.6) 0 ai,i = ai,i − αai−stride,i+1 − βai+stride,i−1 0 yi = yi − αyi−stride − βyi+stride

0 0 0 yi − ai,i−1xi−stride − ai,i+1xi+stride xi = 0 (2.7) a1,1

Where stride grows with each iteration of the forward phase with stride = iter 2 for iter = 0, 1, . . . , log2(n) and n is the size of the system. CR was further parallelised to the PCR algorithm [46]. It is an adaption ofCR where only the forward phase (Equation 2.6) is completed by elimi- nating both odd and even indices simultaneously. This reduces the solving steps required but increases the amount of work per step. TheCR and PCR algorithms have been widely implemented on GPU hardware as the parallel operations map well to the SIMD architecture [47, 48, 49]. Performance is good when the entire input matrix can fit in the local memory cache and application requires solving multiple such small systems [50, 51]. However, the algorithm loses numerical accuracy for large linear systems due to mul- tiple divide operations leaving progressively smaller residual terms. In some cases theCR algorithm is used as a pre-conditioner of an iterative method, or as a first stage of a multistage or hybrid algorithm [47, 48, 51]. Zhang et al. [50] profiledCR, PCR,RD and combinations of the three when imple- Literature Review 19 mented on GPU hardware. Zhang et al. [8] later suggested decomposing the matrix into multiple discrete sub-matrices. The sub-matrices are then solved with a direct method like the TDMA.

2.6.4 SPIKE Algorithm

The [52] is a poly-algorithm that uses domain decom- position techniques to partition a tridiagonal or any banded matrix into mutually independent subsystems. These subsystems can then be solved in parallel. For the linear system in Equation 2.1 we can partition the system into p partitions of m elements, where k = (1, 2, . . . , p), to give a main diagonal partition Ak (Equation 2.8), off-diagonal partitions Bk and Ck, and the right hand side partition Yk (Equation 2.8).

  ai,j ai,j+1   ai+1,j ai+1,j+1 ai+1,j+2     ......  Ak =  . . .     a a a   i+m−2,j+m−3 i+m−2,j+m−2 i+m−2,j+m−1 ai+m−1,j+m−2 ai+m−1,j+m−1

  0 amk+1,m(k−1) ymk  . . .  [Bk, Ck, Yk] =  . . .  (2.8)   am(k+1)−1,m(k+1) 0 ym(j+1)

The coefficient matrix partitions are factorised so that A = DS where D is the main diagonal and S is the spike matrix as seen in Equation 2.9.

    A0 IV0      A1  W1 IV1       ......   ......  DS =  . . . ˙ . . .       A   W IV   p−2   p−2 p−2 Ap−1 Wp−1 I (2.9)

−1 −1 where Vj = (Aj) Bj for j = 0, . . . , p − 2 and Wj = (Aj) Bj for j = Literature Review 20

1, . . . , p − 1. By first solving DF = Y the solution can be retrieved by solving SX = F . As SX = F is the same size as the original system solving for X can be simplified by first extracting a reduced system of the boundary elements between partitions to form SˆXˆ = Fˆ as seen in Equation 2.10, where t and b denote the top and bottom most elements of the partition.

 t   t   t  1 0 V0 0 X0 F0  b   b   b  0 1 V0 0   X0   F0        0 W t 1 0   Xt   F t   1   1   1   b   b   b  0 W1 0 1   X1   F1         ......   .   .   . . . ˙ .  =  .  (2.10)        1 0 V t 0 Xt  F t   p−2   p−2  p−2  b   b   b   0 1 Vp−2 0 Xp−2 Fp−2        0 W t 1 0 Xt  F t   p−1   p−1  p−1 b b b 0 Wp−1 0 1 Xp−1 Fp−1 The reduced system Sˆ is a sparse banded matrix of size 2p × 2p and has a bandwidth of 2. Polizzi et al [52] proposed strategies to handle solving the reduced system. The truncated-SPIKE algorithm states for a diagonally dominant system where d > 1 (Equation 4.2) the reduced SPIKE partitions t b Vj and Wj can be set to zero [52]. This truncated reduced system takes the form of p − 1 independent systems seen in Equation 2.11 which can be solved easily using direct methods. With Xˆ computed, the remaining values of X can be found with perfect parallelism using Equation 2.12.

        1 V b Xb F b ˆ ˆ j j j ˆ SjXj =     =   = Fj j = 0, . . . , p − 2 (2.11) t t t  Wj+1 1 Xj+1 Fj+1

    0  t A1X1 = F1 −   b1x2  I  m       0 I  t m b AjXj = Fj −   bjxj+1 −   cjxj−1 j = 2, . . . , p − 1 (2.12)  I 0  m     I  m b ApXp = Fp −   cpxp−1  0 Literature Review 21

The SPIKE algorithm has been shown to be an effective method for decomposing massive matrices whilst remaining numerically stable and de- manding little memory overhead [42]. The SPIKE algorithm has been im- plemented with good results to solve banded linear systems using CPUs, GPUs and heterogeneous combinations of both often using vendor specific programming paradigms [53]. Wang et. al. [54] presented a scaleable SPIKE implementation targeting CPUs and GPUs in a clustered HPC en- vironment and were able to demonstrate good computation and communi- cation efficiency solving massive diagonally dominant linear systems.

2.6.5 Conjugate Gradient Algorithm

TheCG algorithm is a popular iterative method for solving sparse linear systems that are symmetric and positive definite. The algorithm gener- ates a series of conjugate vectors, giving the numerical method its name. These vectors are the gradients of a quadratic function, and minimising this function is equivalent to solving the linear system [55]. TheCG algorithm has been implemented in FPGA, CPU and GPU hardware on numerous occasions. As with the previously mentioned re- search from Junqing et al. [9] and Angerer et al. [26], Lui et al. [56] present a highly scaleable GPU implementation of the Bi-Conjugate Gra- dient Stabilised algorithm, a variant of theCG algorithm, for arbitrarily banded systems. The authors use row-based domain decomposition in order to partition the matrix for distribution across GPUs and compute nodes and test their implementation and its scalability across 6–192 GPUs.

2.7 Heterogeneous Linear System Solvers

In this thesis heterogeneous algorithmic implementations are algorithmic implementations that use more than one type of computational device thereby:

1. computing the same algorithm on different devices for separate par- titions of the input data, or

2. computing different parts of the algorithm on different devices with non-partitioned input data.

For item 1 heterogeneous implementations, Chang et al. [53] presented a scaleable GPU optimised SPIKE algorithm implementation that can incor- Literature Review 22 porate CPUs in a heterogeneous computing environment. Wang et al.[54] built on this work with their SPIKE2 algorithm that used multiple levels of the SPIKE domain decomposition to allocate dynamic partitions of work to CPUs and GPUs. Both implementations focus on the novelty of GPU algo- rithm optimisation and use Intel MKL library solvers for the work allocated to the CPU. To date, there have been no heterogeneous tridiagonal linear systems solvers described in literature that incorporate FPGA technology along side CPUs or GPUs. Regarding heterogeneous linear system solvers as described in item 2, that use FPGAs to solve non-tridiagonal linear systems; Junqing et al. [9] implement a limited precision direct solver on FPGA hardware as a pre-conditioner to a high-precision iterativeCG solver on a CPU. The solver is limited to on-chip FPGA BRAM, and the devices are not utilised in parallel, but sequentially. In contrast Angerer et al.[26] implement a low precision FPGA iterative CG solver as a pre-conditioner to a high precision GPUCG solver. In this implementation the CPU is utilised as a work-flow coordinator for the GPU and FPGA. Full utilisation of both devices is achieved by pipelining the solving of multiple Right Hand Side (RHS) vectors with a common coefficient matrix. In this case the input matrix size is limited by the on board RAM for both the FPGA and GPU devices. It is interesting to note that the incorporation of the FPGA reduces the energy required to solve a RHS by approximately a third when compared to the ATLAS CPU solver alone [26].

2.8 Summary Analysis and Relevance to this Work

This chapter provided background information required to give context to this thesis and a review of literature relevant to solving diagonally domi- nant tridiagonal linear systems using FPGAs in a heterogeneous computing environment.

FPGAs and FPGA Linear Algebra implementations In section 2.1 and section 2.2, a background on FPGA technology and how it has been applied to implementing linear algebra routines was explored. A compari- son of the different algorithms is presented in Table 2.1. In past research, comparative papers often do not test FPGA implementations against like for like algorithm implementations i.e. algorithms running on the compar- Literature Review 23 ison device often using high precision data types or are un-optimised. This leads to remarkable speedups for the FPGA but is not an accurate rep- resentations of the true performance of the FPGA hardware. Also, past

Algorithm Time Complexity Stability Execution TDMA Θ(n) Stable Serial RD Θ(logn) Unstable Thread parallel CR/PCR Θ(nlogn) Stable Thread parallel 1 SPIKE Θ(n√) Stable Task and thread parallel CG Θ(n k)2 Stable Thread parallel

Table 2.1: Comparison of common algorithms to solve diagonally dominant tridiagaonal linear systems.

FPGA implementations have suffered from the maximum input problem size being often limited by on chip or on board memory. This along with the implementation designs often being specific to a given FPGA, limits its usability for general scientific computing. The investigations presented in chapter 3 and chapter 4, seek to provide clear and fair comparisons between each device targeted by optimising code bases and comparing to leading industry routines. In both investigations all implementation and comparison testing is done using FP32, as to ex- ploit maximum parallelism of each device architecture, whilst maintaining suitable numerical accuracy. Further, in chapter 4 the implementation is designed to work for an input system of any size, and will dynamically load balance the input for all available computation devices.

OpenCL and Heterogenous Computing OpenCL and how it can be used to implement heterogeneous computational routines was covered in section 2.3 and section 2.4. The ability to target different accelerator de- vices in a heterogeneous computing environment was explored with the complexities encountered writing optimised code targeting CPUs, GPUs and FPGAs. In chapter 3 a single portable OpenCL implementation of two algorithms to be executed on target GPUs and FPGAs is presented to investigate the portability of OpenCL. We further investigate heteroge- neous computing with OpenCL in chapter 4 by implementing algorithmic blocks that are then arranged in an optimised way to suit the underlying

1 Using the TDMA to factorise the spike matrices and the truncated method to solve the reduced system. 2 Assuming all elements of n are non-zero and k is the condition number of the matirx Literature Review 24 computational architecture.

Tridiagonal Linear Systems and Algorithms to solve them. In section 2.5 and section 2.6 the tridiagonal matrix and several algorithms to solve systems of tridiagonal linear equations were defined. The quickest method of solving such systems in terms of operation, the TDMA is inher- ently serial and well understood but can perform poorly when compared to parallel algorithms implemented on multi-core devices. In the literature, only iterative methods like theCG algorithm have been explored thoroughly as parallel linear system solvers implemented on FPGA hardware. As such the investigations presented in this document focus on direct parallel methods, specifically the PCR and SPIKE algorithms. The PCR algorithm is primarily chosen as the algorithm is well understood and implemented numerous times on GPU hardware and does not suffer from thread dependent branching logic that is inherent in theCR algorithm. The SPIKE algorithm is chosen for its ability to decompose the prob- lem domain for computation on different processors, devices or computers with little shared memory overhead. This property makes it ideal for a heterogeneous tridiagonal linear system solver implementation.

Heterogeneous Tridiagonal Linear System Solvers. Finally, in sec- tion 2.7 heterogeneous implementations of tridiagonal linear systems solver implementations targeting CPUs, GPUs, FPGAs and combinations thereof are reviewed. We found that no heterogeneous tridiagonal linear system solver utilising FPGA hardware exists. However, FPGAs have been used in heterogeneous linear system solvers for systems that are not tridiagonal in nature. For these solvers the FPGA is part of an algorithmic pipeline and are not standalone accelerators. The investigation in chapter 4 looks to fill this research gap by using the available devices (CPUs, GPUs and FPGAs) on a computer to accelerate the SPIKE algorithm solving a diag- onally dominant tridiagonal linear systems of any size. Chapter 3

Investigation 1: Portable Solvers of Small Tridiagonal Systems

Preface

In this chapter, the paper titled "Implementation of parallel tridi- agonal solvers for a heterogeneous computing environment" is presented. The impetus behind this paper was to design naive portable OpenCL implementations of parallel tridiagonal linear system solvers. The paper targets only GPU and FPGA hardware as a discovery phase of the overall research. For ease of design, the implementations target small tridi- agonal systems that can be stored completely in each device’s local memory. The parallel tridiagonal solvers PCR and SPIKE were implemented on an FPGA and a GPU and compared against other OpenCL and CUDA implementations. Both PCR and SPIKE OpenCL implementations were designed initially for GPU hardware for functional correctness. The func- tional GPU kernel code was then compiled and built for the FPGA, chang- ing only pre-processor definitions that control the number of Compute Units instantiated and the matrix size to be solved. The PCR implementations ported well between the GPU and the FPGA as they were comprised of one OpenCL kernel with a single fixed iteration for-loop. This design worked well for both the GPU and FPGA. The fixed iteration for-loop in the PCR kernel allowed for the FPGA compiler to ag- gressively optimise the hardware utilisation, allowing for multiple Compute

25 Investigation 1: Portable Solvers of Small Tridiagonal Systems 26

Units and Processing Elements to be instantiated. Further, the lack of branching logic fitted the SIMD architecture of the GPU well. The SPIKE algorithm however, was logically split into three separate kernels for the factorisation, reduced system solve, and back substitution steps. While this worked well for the GPU, only one kernel could be instan- tiated at a time on the FPGA due to hardware constraints. This introduces a reprogramming-delay in the FPGA-SPIKE solver where the FPGA needs to be reconfigured between algorithm steps and consequentially reduces compute performance. One further thing to note is the FPGA results presented in this paper are based on simulation timings due to FPGA hardware issues near the time of publication. Note also that at the time of publication the FPGA manufacturer Intel FPGA was still known as Altera.

Investigation 1: Portable Solvers of Small Tridiagonal Systems 28

Implementation of parallel tridiagonal solvers for a heterogeneous computing environment

Abstract

Tridiagonal diagonally dominant linear systems arise in many scientific and engineering applications. The standard Thomas algorithm for solving such systems is inherently serial forming a bottleneck in computation. Algo- rithms such as cyclic reduction and spike reduce a single large tridiagonal system into multiple small independent systems which can be solved in parallel. We have developed portable cyclic reduction and spike algorithm Open Computing Language (Opencl) implementations with the intent to target a range of co-processors in a heterogeneous computing environment including Field Programmable Gate Arrays (fpgas), Graphics Processing Units (gpus) and other multi-core processors. We evaluate these designs in the context of solver performance, resource efficiency and numerical ac- curacy.

3.1 Introduction

Tridiagonal and block tridiagonal linear systems arise in many computa- tional applications. Some applications involve the solving of many small in- dependent systems whilst others require solving fewer large systems [52, 15]. The focus of this article is the case when the linear system is strictly diag- onally dominant. As High Performance Computing (hpc) platforms become more paral- lel, specialised and heterogeneous, the requirement for parallel algorithms to be portable as well as efficient is increasing. We have developed an Open Computing Language (Opencl) implementation of two well known parallel tridiagonal solvers. Our implementation is portable and capable of executing on a range of co-processor environments. Of particular interest in this work are field programmable gate arrays (fpgas). These are reconfigurable computing devices that consist of an in- terconnected array of configurable logic blocks (clbs) and memory modules in the form of distributed block rams (brams). The clbs themselves are simple units built from look-up tables (luts), logic gates and multiplexers (muxs). clbs can be configured and routed at run time to define custom processor architectures. Investigation 1: Portable Solvers of Small Tridiagonal Systems 29

Traditionally, fpgas were mainly used in embedded devices or for hard- ware prototyping. Recently the floating point performance of fpgas has reached a point where they have become much more relevant to the hpc arena, particularly with the approach of exascale machines [57, 58, 59]. As a result, fpga-based linear algebra architecture design is a area of active research [40, 60, 15, 11]. The biggest barrier to effective use of fpgas has always been the dif- ficulty in programming them. Until recently, this has entailed a hard- ware design process rather than software development. However, advances have been made with both major fpga vendors releasing Opencl com- pliant boards which produce comparable results to manual hardware de- sign [22, 23]. Using the Altera1 based Opencl, Warne et al. demonstrated the ease in which a custom tridiagonal linear system solver can be deployed [15, 25]. In this work, we aim to extend our previous efforts towards a more gen- eral highly parallel solution targeting fpgas in particular, but also other Opencl compliant co-processors that may be present within a heteroge- neous computing environment.

3.2 Background

3.2.1 Tridiagonal linear systems

n×n A linear system Ax = b is tridiagonal if the coefficient matrix A ∈ R is banded with bandwidth β = 1. That is,   a1,1 a1,2   a2,1 a2,2 a2,3     ......  A =  . . .     a a a   n−1,n−2 n−1,n−1 n−1,n an,n−1 an,n

Considering strictly diagonally dominant systems only, it is well known that the lu-decomposition can be performed in Θ(n) operations using the Thomas algorithm [33]. Although Θ(n) is a vast improvement on a fully dense lu-decomposition, the Thomas algorithm is inherently serial. As a

1 Since publication, Altera is now know as Intel FPGA Investigation 1: Portable Solvers of Small Tridiagonal Systems 30 result the high performance of modern highly parallel computing architec- tures cannot be directly exploited. For applications involving many small independent systems it is easy to achieve parallel computation using a standard Thomas algorithm. However, more advanced, inherently parallel methods must be applied if the problem requires solving fewer large systems. Specialised architectures for specifically solving tridiagonal linear sys- tems were designed using field programmable gate arrays [40, 15]. However, these systems were built with very specific uses in mind.

3.2.2 Heterogeneous computing with OpenCL

The ideal parallel approach depends greatly on the computing devices avail- able to the program. hpc platforms are becoming more reliant on specialist co-processors, each with their own programming models. This naturally makes designing a general, portable solution difficult. Opencl is an open standard for heterogeneous computing [1]. Imple- mentations of the Opencl standard have been developed by vendors of a wide range of computing devices including many-core cpus, graphics pro- cessing units (gpus), digital signal processors (dsps) and recently fpgas. These implementations enable a developer to build a portable computa- tional routine which will execute on any of the available resources. The Opencl programming model divides program code into two par- titions: 1) serial code to be executed on a host processor and 2) parallel code to be executed on any number of co-processor devices. Each device is treated as an array of compute units which can operate asynchronously, each of which is further divided into synchronised processing elements.

3.3 Parallel tridiagonal linear systems solvers

Many parallel algorithms exist for solving tridiagonal and block-tridiagonal linear systems and are implemented in well established numerical libraries such as Scalapack [34, 35, 36, 37]. In this section, we briefly present two parallel tridiagonal solver schemes which were the focus of gpu implemen- tations, namely, cyclic reduction [61] and spike [52]. Investigation 1: Portable Solvers of Small Tridiagonal Systems 31

3.3.1 Parallel cyclic reduction

The method of cyclic reduction splits a single tridiagonal system into two smaller independent tridiagonal systems [61]. This is done by transforming the ith equation,

ai,i−1xi−1 + ai,ixi + ai,i+1xi+1 = bi into,

αiai,i−1xi−2+(ai,i + αiai−1,i + βiai+1,i) xi+βiai,i+1xi+2 = bi+αibi−1+βibi+1 where αi = −ai,i−1/ai−1,i−1 and βi = −ai,i+1/ai+1,i+1. The result of cyclic reduction is two independent tridiagonal linear sys- tems, one containing equations with even indexed unknowns and another containing equations with odd indexed unknowns. This transformation can be applied recursively to any number of independent systems.

3.3.2 SPIKE

Another parallel method is the so-called spike algorithm [52]. In this method, the coefficient matrix is factorised to A = DS where D = diag(A1,A2,...,Ap) with Aj are diagonal blocks of A and S has the form,   IV1   W2 IV2     ......  S =  . . .  (3.1)    W IV   p−1 p−1 Wp I where p is the number of partitions, m is the number of linear equations per partition, and n is the total number of linear equations in the system, m×1 so n = pm. Here Wi,Vi ∈ R are given by solving,   0 aim+1,im  . .  Ai [Vi,Wi] =  . .  (3.2)   aim,im+1 0

Factorising A in this way allow the original Ax = b to be solved by first solving Dy = b as p independent tridiagonal systems and then solving Investigation 1: Portable Solvers of Small Tridiagonal Systems 32

Sx = y. The second step can also be transformed into p independent systems in Θ(p) operations.

3.3.3 Implementation using OpenCL

For parallel cyclic reduction (pcr), the kernel code developed in this work defines the process of solving for a single unknown. The ith processing element will solve for xi. A single linear system is loaded from “off-chip” global memory into fast local memory which is shared by all processing elements within a compute unit. Each xi is then solved simultaneously as shown in the pseudo-code in Algorithm1.

Algorithm 1 : pcr compute kernel logic

[iG, iL] ← get_ids {work group and work item index} n Uc,Dc,Lc, bc ∈ R {Shared local cache of system iG}

s ← 1 {Solve for xiL in r recursive steps} for all i ∈ [1, ··· , r] do α ← −Lc [iL] /Dc [iL − s], β ← −Uc [iL] /Dc [iL + s] 0 0 U ← βUc [iL + s], L ← αLc [iL − s] 0 D ← Dc [iL] + αUc [iL − s] + βLc [iL + s] 0 b ← bc [iL] + αbc [iL − s] + βbc [iL + s] {Synchronise all work items} 0 0 0 0 [Uc [iL] ,Dc [iL] ,Lc [iL] , bc [iL]] ← [U ,D ,L , b ] s ← 2s end for

At the present stage of development, our spike implementation uses a direct mapping of a processing element to the work required to factorise the pth partition of A. With this allocation of workload to processors, a number of processors will operate in parallel on each compute unit. There is a sequential bottle-neck of Θ(p) in reducing the system Sx = y into a set of parallel back-substitutions. The final solution is then recovered though parallel back-substitution in the same manner as the factorisation step.

3.4 Evaluation

One of the main aims of this work is the exploration and development of efficient parallel tridiagonal solvers which take advantage of a range of possible compute platforms. To assess the feasibility of such a design, we specifically focused on resource utilisation, compute performance and nu- merical accuracy. In this section, our implementations of pcr and spike Investigation 1: Portable Solvers of Small Tridiagonal Systems 33

Table 3.1: Logic utilisation percentage of optimum configuration for the Altera Stratix v fpga.

Compute Processing Block- Lookup Design Units Elements ram Tables pcr 8 2 92% 99% spike factorise 1 4 92% 99% spike partition Sx = y 1 16 43% 46% spike back-solve 1 8 22% 30% are evaluated against pre-existing implementations targeting nvidia gpus. Not only do the results show that our designs are comparable when run- ning on nvidia hardware but also indicate that the fpga platform is a promising alternative.

3.4.1 FPGA resource utilisation

Unlike fixed architectures such as gpus and cpus, Opencl implementations targeting fpgas allow the number of compute units and processing elements to be varied. However, the specific choices made for these will affect both algorithm performance and utilisation of resources on the fpga configurable logic fabric. We tuned the combination of both compute units and processing ele- ments for maximum throughput for both the pcr and spike implementa- tions. The optimal configurations are given in Table 3.1. The estimated fpga resource utilisation percentages are also included in the table, and these were generated for the target Altera Stratix v architecture using the Altera Opencl software development kit (sdk). The percentages of con- sumed fpga logic resources (i.e., block-ram and lookup tables) are within the total amounts available and show that each design implementation on the target architecture is feasible. However, for maximum throughput the fpga cannot be configured to run all spike stages at the same time, so instead some reconfiguration between stages is necessary.

3.4.2 Compute performance

We are not aware of any other Opencl implementation of pcr or spike designed with an fpga as the target architecture. This makes it difficult to benchmark. As a result we have compiled our Opencl code for an nvidia gpu to aid comparison with other pcr and spike implementations Investigation 1: Portable Solvers of Small Tridiagonal Systems 34

Figure 3.1: Compute throughput of parallel cyclic reduction implementa- tions versus a single threaded Thomas algorithm along with the performance for a sequential Thomas cpu based solution. In addition, our estimated fpga performance is placed along side these results. The fpga estimates are produced by the Altera offline compiler; an analysis of the hardware schematic generates these through the compilation process to provide the number of Opencl work items that can be executed per second. We converted the estimated throughput to system solves per second (rather than more usual flop based measures) to ease comparison with the most relevant studies. Data transfer overheads can impede this throughput in reality [25], however, there are a number of coding practices which can assist in minimising this impact [62]. nvidia provides an Opencl implementation of pcr which is distributed with their cuda sdk. Figure 3.1 provides the throughput (Systems/sec) for our pcr implementation against the nvidia implementation for sys- tems of size n ∈ [64, 128, 256, 512, 1024]. Our implementation out-performs nvidia’s when running on an nvidia Quadro 4000 gpu. Our estimates also show that the fpga implementation will run on-par with the nvidia gpu code; considering the fpga is running at only a fraction of the clock speed (≈ 150MHz) this is quite impressive. An analogous performance comparison was performed with our spike implementation against a cuda implementation. For the given system size range our gpu based implementation has comparable performance to Chang et al. [53]. Also, the estimated performance of our fpga design out- performs both gpu implementations by a factor of around four times. The fpga estimates are likely to be optimistic in this case as the re-configuration time required when switching between spike stages is not taken into consid- Investigation 1: Portable Solvers of Small Tridiagonal Systems 35

Figure 3.2: Compute throughput of spike implementations versus a single threaded Thomas algorithm eration. Although not explored here, this re-configuration time ( ≤ 100ms) could be hidden in practice by, for example, processing spike stages in batches.

3.4.3 Numerical accuracy

We found both our pcr and spike implementations to be of similar accu- racy to a standard Thomas algorithm (see Table 3.2). Currently, all our analysis and test cases are based on strictly diagonally dominant matrices which do not require pivot operations. In future work, we intend to perform a more general numerical stability analysis as we extend the work to other matrix types.

Table 3.2: Numerical accuracy of pcr and spike (n = 1024).

Thomas pcr spike

||x − b||∞ 3.576e-7 5.36e-7 4.172e-7

3.4.4 Power utilisation

As the purpose of this study is to assess the feasibility of building efficient linear algebra implementations for a heterogeneous computing environment, we have not performed any detailed analysis on power consumption. How- ever the attraction of such a heterogeneous environment is strongly moti- vated by potential increase in power efficiency due to the significantly (order of magnitude) lower power consumption of fpgas [57]. Investigation 1: Portable Solvers of Small Tridiagonal Systems 36

To put Figures 3.1 and 3.2 into context, the Altera fpga pci-e card consumes a maximum of 35W under load. By comparison, an nvidia Quadro 4000 has a maximum power consumption of 142 Watts. While a detailed examination the various tridiagonal solvers remains for future work, we have previously investigated power utilisation in the context of image processing algorithms with promising results [16].

3.5 Conclusion

In this article, we explored the feasibility of designing truly portable multi- platform high-performance linear algebra routines using Opencl. The par- allel tridiagonal solvers of cyclic reduction and spike were implemented on an fpga and a gpu and compared against other Opencl and cuda implementations. Results to date indicate that in terms of accuracy and computational efficiency our designs are comparable for both fpga and gpu cases. This suggests that further investigation is warranted, and in future work we intend to continue to develop and improve upon our de- signs for tridiagonal solvers in the broader context of providing enhanced platforms for high performance, low power numerical routines.

Acknowledgments This project utilised the high performance comput- ing (hpc) facility at the Queensland University of Technology (qut). The facility is administered by qut’s hpc and research support group. A spe- cial thanks to the hpc group for their support, particularly in providing access to specialist fpga and gpu resources. Chapter 4

Investigation 2: An Heterogeneous Solver for Tridiagonal Systems

Preface

The paper presented in chapter 3 demonstrated that truly portable OpenCL kernels targeting GPUs and FPGAs are feasible. In particular, there was significant opportunity to further develop the SPIKE algorithm implemen- tation. The current investigation builds on the prior chapter’s work in the following ways:

• Extending beyond simulation only appraisals of computational through- put and energy efficiency of FPGAs;

• Optimising the previously developed code base for each target device architecture whilst maintaining some code portability;

• Improving the functionality of the implementation to allow matrices of any size to be solved; and

• Addressing heterogeneous computing environments by extending the capabilities of the implementation to use any combination of available computational devices, or one device in isolation, as needed.

The paper presented in this chapter, "Implementing and Evaluat- ing an Heterogeneous, Scalable, Tridiagonal Linear System Solver with OpenCL to target FPGAs, GPUs, and CPUs", will focus

37 Investigation 2: An Heterogeneous Solver for Tridiagonal Systems 38 on one algorithm, the SPIKE. The SPIKE algorithm has low memory- transaction overhead, it is relatively easy to decompose the problem domain and the algorithm has numerical stability when solving large systems. In this instance, instead of solving the reduced system directly with a pent- diagonal solver as in chapter 3, the truncated method is used for resolving the reduced system. Use of the truncated method simplifies the solving of the reduced system as the operations can be done in-place either directly after the “spike” ma- trices have been calculated or directly before the back-substitution phase. The operation complexity for solving the reduced system using the trun- cated method is Θ(1) compared to Θ(n) for the pentdiagonal solver. Using the truncated method is feasible since the systems to be addressed are sufficiently diagonally dominant to allow the solver to maintain numerical stability and negate any loss in accuracy that the method might otherwise introduce. In order to ensure the solver will be able to fit in the available FPGA hardware, we develop the OpenCL kernel for the FPGA first, and instead of focusing on portability of the kernel several portable algorithmic blocks are instead devised. This allows for different kernel configurations to be developed for each device targeted. With this method optimised truncated- SPIKE implementations were developed for the FPGA, GPU and CPU and some portability maintained. Furthermore, the truncated–SPIKE OpenCL kernel is designed in such a way that matrices of any size can be solved with limited padding of the coefficient matrix and RHS vector. This ensures good performance metrics for problem sets of varying sizes, including those comprised of multiple small systems or a single monolithic tridiagonal linear system. In this way the solver becomes applicable to a wider range of real world applications. A ‘top’ level of the truncated–SPIKE algorithm is used to dynamically distribute the problem size to a single or heterogeneous combination of available compute devices based on previously recorded performance met- rics. The single device solvers are tested against third party diagonally dom- inant tridiagonal linear systems solvers. For the CPU and GPU, the im- plementations were tested in comparison to Intel MKL, and CUDA based routines. For the FPGA, a TDMA OpenCL implementation was designed to act as a suitable benchmark. All comparison and benchmark experiments use FP32 data types across Investigation 2: An Heterogeneous Solver for Tridiagonal Systems 39 all devices to ensure like-for-like comparison.

Investigation 2: An Heterogeneous Solver for Tridiagonal Systems 41

Implementing and Evaluating an Heterogeneous, Scalable, Tridiagonal Linear System Solver with OpenCL to target FPGAs, GPUs, and CPUs

Abstract

Solving diagonally dominant tridiagonal linear systems is a common prob- lem in scientific high performance computing (HPC). Furthermore it is becoming more commonplace for HPC platforms to utilise a heterogeneous combination of compute devices. Whilst desirable to design faster im- plementations of parallel linear system solvers, power consumption con- cerns are increasing in priority. This work presents the oclspkt routine. The oclspkt routine is a heterogeneous OpenCL implementation of the truncated–SPIKE algorithm that can use FPGAs, GPUs and CPUs to con- currently accelerate the solving of diagonally dominant tridiagonal linear systems. The routine is designed to solve tridiagonal systems of any size and can dynamically allocate optimised workloads to each accelerator in a heterogeneous environment depending on the accelerator’s compute perfor- mance. The truncated–SPIKE FPGA solver is developed first optimising for kernel performance, device global memory bandwidth and interleaved host to device memory transactions. The kernel is then remapped to best exploit the underlying architecture of the CPU and GPU to optimise ker- nel performance. An optimised TDMA OpenCL kernel is also developed to act as a serial baseline performance comparison for the parallel truncated– SPIKE kernel since no FPGA tridiagonal solver capable of solving large tridiagonal systems was available at the time of development. The individ- ual GPU, CPU, and FPGA solvers of the oclspkt routine are 110%, 150% and 170% faster respectively than comparable device-optimised third-party solvers and applicable baselines respectively. Assessing heterogeneous com- binations of compute devices, the GPU+FPGA combination is found to have the best compute performance and the FPGA-only configuration is found to have the best overall energy efficiency.

4.1 Introduction

Given the ubiquity of tridiagonal linear system problems in engineering, economic and scientific fields it is no surprise that significant research has Investigation 2: An Heterogeneous Solver for Tridiagonal Systems 42 been undertaken to address the need for larger models and higher resolution simulations. Demand for solvers for massive linear systems that are faster and more memory efficient is ever increasing. First proposed in 1978 by Sameh et al. [63] and later refined in 2006 [52], the SPIKE algorithm is becoming an increasingly popular method for solving banded linear system problems [64, 65, 54, 53, 66]. The SPIKE algorithm has been shown to be an effective method for decomposing massive matrices whilst remaining numerically stable and demanding little memory overhead [42]. The SPIKE algorithm has been implemented with good results to solve banded linear systems using CPUs, GPUs and in CPU+GPU heterogeneous environments often using vendor specific programming paradigms [53]. A scalable SPIKE implementation targeting CPUs and GPUs in a clus- tered HPC environment to solve massive diagonally dominant linear sys- tems has previously been demonstrated with good computation and com- munication efficiency [54]. Whilst it is desirable to design faster implemen- tations of parallel linear system solvers it is necessary also to have regard for power consumption, since this is a primary barrier to exascale computing when using traditional general purpose CPU and GPU hardware [67, 68]. FPGA accelerator cards require an order of magnitude less power com- pared to HPC grade CPUs and GPUs. Previous efforts in developing FPGA based routines to solve tridiagonal systems have been limited to solving small systems with the serial Thomas Algorithm [15, 25, 40]. We have previously investigated the feasibility of FPGA implementations of parallel algorithms including the Parallel Cyclic Reduction and SPIKE [69] for solv- ing small tridiagonal linear systems. This previous work utilised OpenCL to produce portable implementations to target FPGAs and GPUs. The cur- rent work again utilises OpenCL since this programming framework allows developers to target a wide range of compute devices including FPGAs, CPUs and GPUs with a unified language. OpenCL consists of C based kernel code to execute on the device and C or C++ host code to setup the environment and orchestrate memory transfers and kernel execution. The motivation for this paper is to evaluate the feasibility of utilising FPGAs, along with GPUs and CPUs concurrently in a heterogeneous com- puting environment in order to accelerate solving a diagonally dominant tridiagonal linear system. In addition we aimed to develop a solution that maintained portability whilst providing an optimised code base for each tar- get device architecture and was capable of solving large systems. As such, we present the oclspkt routine, an heterogeneous OpenCL implementation Investigation 2: An Heterogeneous Solver for Tridiagonal Systems 43 of the truncated–SPIKE algorithm that can dynamically load balance work allocated to FPGAs, GPUs, and CPUs concurrently or in isolation, in or- der to solve tridiagonal linear systems of any size. We evaluate the oclspkt routine in terms of computational characteristics, numerical accuracy and energy consumption. This paper is structured as follows: Section 2 provides an introduction to diagonally dominant tridiagonal linear systems and the truncated–SPIKE algorithm. Section 3 describes the implementation of the oclspkt–FPGA OpenCL host and kernel code and the optimisation process. This is fol- lowed by the porting and optimisation of the oclspkt–FPGA kernel and host code to the GPU and CPU devices as oclspkt–GPU and oclspkt–CPU. Section 3 concludes with discussion of the integration of the three solvers to produce the heterogeneous oclspkt solver. In Section 4, the individual solvers are compared to optimised third party tridiagonal linear systems solvers. The three solvers are further compared in terms of energy effi- ciency, performance and numerical accuracy in addition to an evaluation of different heterogeneous combinations of the oclspkt. Finally in Section 5 we draw conclusions from the results and discuss the implications for future work.

4.2 Background

4.2.1 Tridiagonal linear systems

A coefficient band matrix with a bandwidth of β = 1 in the linear system Ax = y is considered tridiagonal, see Equation 4.1.   a1,1 a1,2   a2,1 a2,2 a2,3     ......  A =  . . .  (4.1)    a a a   n−1,n−2 n−1,n−1 n−1,n an,n−1 an,n

|Ai,i| d = min P (4.2) |Ai,j| i6=j For non-singular diagonally dominant systems where d > 1 in Equation 4.2, a special form of non-pivoting Gaussian elimination called the Thomas algo- rithm [33] can perform in Θ(n) operations. The Thomas Investigation 2: An Heterogeneous Solver for Tridiagonal Systems 44 algorithm provides good performance when solving small tridiagonal lin- ear systems however since this algorithm is intrinsically serial, it fails to scale well in highly parallel computing environments. More advanced, in- herently parallel methods must be applied if the problem requires solving large systems. Many parallel algorithms exist for solving tridiagonal and block-tridiagonal linear systems and are implemented in well established numerical libraries [35, 36, 37].

4.2.2 The SPIKE Algorithm

The SPIKE algorithm [52] is a poly-algorithm that uses domain decompo- sition to partition a banded matrix into mutually independent subsystems which can be solved concurrently. Consider the tridiagonal linear system AX = Y where A is n × n in size with only a single right hand side vector Y . We can partition the system into p partitions of m elements, where k = (1, 2, . . . , p), to give a main diagonal partition Ak, off-diagonal parti- tions Bk and Ck, and the right hand side partition Yk.

  ai,j ai,j+1   ai+1,j ai+1,j+1 ai+1,j+2     ......  Ak =  . . .     a a a   i+m−2,j+m−3 i+m−2,j+m−2 i+m−2,j+m−1 ai+m−1,j+m−2 ai+m−1,j+m−1

  0 amk+1,m(k−1) ymk  . . .  [Bk, Ck, Yk] =  . . .  (4.3)   am(k+1)−1,m(k+1) 0 ym(j+1) The coefficient matrix partitions are factorised so A = DS where D is the main diagonal block matrix and S the SPIKE matrix as seen in Equation 4.4,

    A1 IV1      A2  W2 IV2       ......   ......  DS =  . . . ˙ . . .  (4.4)      A   W IV   p−1   p−1 p−1 Ap Wp I Investigation 2: An Heterogeneous Solver for Tridiagonal Systems 45

−1 −1 where Vk = (Ak) Bk for k = 1, . . . , p − 1 and Wk = (Ak) Bk for j = 2, . . . , p. By first solving DF = Y the solution can be retrieved by solving SX = F . As SX = F is the same size as the original system solving for X can be simplified by first extracting a reduced system of the boundary elements between partitions to form SˆXˆ = Fˆ as seen in Equation 4.5, where t and b denote the top and bottom most elements of the partition.

 t   t   t  1 0 V1 0 X1 F1  b   b   b  0 1 V1 0   X1   F1        0 W t 1 0   Xt   F t   2   2   2   b   b   b  0 W2 0 1   X2   F2         ......   .   .   . . . ˙ .  =  .  (4.5)        1 0 V t 0 Xt  F t   p−1   p−1  p−1  b   b   b   0 1 Vp−1 0 Xp−1 Fp−1        0 W t 1 0  Xt   F t   p   p   p  b b b 0 Wp 0 1 Xp Fp

The reduced system Sˆ is a sparse banded matrix of size 2p×2p and has a bandwidth of 2. Polizzi et al. [52] proposed strategies to handle solving the reduced system. The truncated–SPIKE algorithm states for a diagonally dominant system where d > 1 (Equation 4.2) the reduced SPIKE partitions t b Vk and Wk can be set to zero [52]. This truncated reduced system takes the form of p − 1 independent systems seen in Equation 4.6 which can be solved easily using direct methods. " #" # " # 1 V b Xb F b k k = k k = 1, . . . , p − 1 (4.6) t t t Wk+1 1 Xk+1 Fk+1

With Xˆ computed, the remaining values of X can be found with perfect parallelism using Equation 4.7.

 b t A1X1 = F1 − V1 X2  b t t b AkXk = Fk − Vk Xk+1 − WkXk−1 k = 2, . . . , p − 1 (4.7)   t b ApXp = Fp − WpXp−1 Mikkelsen et al. [70] conducted a detailed error analysis of the Trun- cated SPIKE algorithm and showed that a reasonable approximation of the upper bound of the infinity norm is dependent on the degree of diagonal Investigation 2: An Heterogeneous Solver for Tridiagonal Systems 46 dominance, the partition size and bandwidth of the matrix given by:

−m |xˆ − x|∞ ≈ d β (4.8)

4.3 Implementation

The general SPIKE algorithm consists of four steps: 1) partitioning the system, 2) factorising the partitions, 3) extracting and solving the reduced system, and 4) recovering the overall solution. For diagonally dominant tridiagonal linear systems the truncated-SPIKE algorithm may be employed. This requires only the bottom SPIKE element b t vkm and top SPIKE element wkm+1 in order to resolve the boundary un- b t known elements xkm and xkm+1 [52]. This decouples the partition from the rest of the matrix and can be achieved by performing only the forward- sweep of LU factorisation, referred to as LUFS and forward-sweep steps of

UL factorisation, referred to as ULFS. LUFS and ULFS will be computed for partitions k and k + 1 for k = 1, 2, . . . , p − 1. b t The factorised right hand side elements fkm and fkm+1 and SPIKE t b elements wkm+1 and vkm are used to form and solve the reduced system ˆ ˆ ˆ b t SkXk = Fk using Equation 4.6 to produce xkm and xkm+1. This algorithmic step is referred to as RS.

The remaining elements of the solution Xk can then be recovered with

Equation 4.7 via the back-sweep step of LU, referred to as LUBS and the back-sweep step UL factorisation, referred to as ULBS on the top and bot- tom half of the partitions k and k + 1 respectively. We use the Thomas Algorithm to compute the forward and back-sweep factorisation steps giv- ing the overall complexity of our truncated-SPIKE algorithm as O(n).A high level overview of the anatomy and execution flow of our oclspkt routine can be seen in Figure 4.1. The oclspkt solver expects the size of the system n, the RHS vector Y and the tridiagonal matrix split into vectors of its lower–diagonal L, main–diagonal D and upper–diagonal U as inputs. The solution vector X is returned. In the following subsections we describe the truncated–SPIKE algorithm implementation for the FPGA (oclspkt-FPGA) using OpenCL and the de- velopment considerations to obtain optimised performance. As a part of this process we design and implement an optimised TDMA OpenCL kernel to act as a serial baseline performance comparison for the parallel truncated– SPIKE kernel. Since the optimised TDMA implementation is constrained Investigation 2: An Heterogeneous Solver for Tridiagonal Systems 47

Figure 4.1: An overview of the anatomy and execution flow of the oclspkt solver. Investigation 2: An Heterogeneous Solver for Tridiagonal Systems 48 by the available global memory bandwidth we are able to make genuine comparisons of FPGA hardware utilisation and computational complexity for these two kernels. We then discuss the process of porting and optimising the oclspkt code for CPU and GPU. Finally we describe integrating the three oclspkt–[FPGA|GPU|CPU] implementations as a heterogeneous solver. The specific hardware we target for these implementations are Bittware’s A10PL4 FPGA, NVIDIA’s M4000 GPU and the Intel Xenon E5-1650 CPU.

4.3.1 FPGA Implementation

In order to take advantage of the FPGA’s innate pipelined parallelism we implement both the TDMA and truncated-SPIKE algorithm as single Work Item kernels. A single Work Item kernel has no calls to the OpenCL API for the local or global relative position in a range of Work Items. This allows the Intel FPGA compiler to pipeline as much of the kernel code as possible, whilst not having to address Work Item execution synchronisation and access to shared memory resources. This reduces the OpenCL FPGA resource consumption overhead, allowing more of the logic fabric to be used for computation.

TDMA Kernel Code

We found no suitable FPGA implementation of a tridiagonal linear system solver able to solve large systems. In order to provide a suitable perfor- mance baseline for the more complex SPIKE algorithm we implemented the Thomas Algorithm or TDMA with OpenCL. The TDMA implementa- tion calculates the forward-sweep and back-substitution loops one block of the input system at a time, effectively treating the FPGA’s on-chip BRAM as cache for the current working data. The block size m is set as high as possible, and is only limited by the available resources on the FPGA. An OpenCL representation of the kernel implementation can be seen in Figure 4.2. The forward sweep section loads m elements of the input vectors L, D, U, Y , from off-chip DDR4 RAM to on-chip BRAM. With this input the upper–triangular and modified RHS is calculated, overwriting the initial values of D and Y . D and Y are then written back to DDR4 RAM due to BRAM limitation on the FPGA. The forward sweep section iterates over 1 . . . p blocks. The back-substitution section loads m elements of D, U, and Investigation 2: An Heterogeneous Solver for Tridiagonal Systems 49

Figure 4.2: The FPGA TDMA OpenCL kernel tdma, with the execution path and data dependencies shown. The tdma executes as a single Work Item kernel

Y vectors and writes m elements of X after recovering the solution via back substitution. The back substitution section iterates over p . . . 1 blocks.

Truncated–SPIKE Kernel Code

An OpenCL algorithm representation of the truncated–SPIKE kernel, sp- ktrunc can be seen in Figure 4.3. The FPGA oclspkt implementation exe- cutes p iterations of its main loop, loading one block of the linear system given, where block size is m, and we solve 1 partition per iteration of the main loop. A partition of size m is loaded from global memory and partitioned as per Equation 4.3& Equation 4.3. LUFS(k) and ULFS(k) are executed concurrently to compute and store half of the upper and lower triangular 0 0 m 0 0 m b systems [D UY ]k( 2 ; m) and [LD Y ]k(1; 2 ), and the SPIKE elements vk t b b and wk respectively. Next, using yk−1 and vk−1 from the previous iteration, t t b t and yk and wk as inputs for RS(k) the boundary elements, xk−1 and xk are computed. m m Finally, Xk(1; 2 ) and Xk−1( 2 ; m) are then recovered with ULBS(k) 0 0 m 0 0 m 0 0 m and LUBS(k−1) of [LD Y ]k(1; 2 ) and [D UY ]k−1( 2 ; m). [D UY ]k( 2 ; m) b and vk is stored for the next iteration of the main loop. The FPGA solver 0 0 m is initialised with an upper triangular identity matrix in [D UY ]k−1( 2 ; m) for k = 0. This results in a streaming linear system solver where loading in a block of partitions at the start of the pipeline will compute a block of the solution Investigation 2: An Heterogeneous Solver for Tridiagonal Systems 50

Figure 4.3: The FPGA truncated–SPIKE OpenCL kernel spktrunc, with the execution path and data dependencies shown. The spktrunc executes as a single Work Item kernel

m vector with a − 2 element offset.

Host Code

On the host side, in order to interleave the PCIe memory transfers to the device with the execution of the solver kernel we create in-order command queues for writing to, executing on, and reading from the FPGA. We create two copies of read-only memory objects L, D, U, Y and a write-only memory object X. The spktrunc kernel and the FPGA’s DMA controller for the PCIe bus share the total bandwidth of the DDR4 RAM bank. To maximise FPGA global memory bandwidth the device memory objects are explicitly designated specific RAM bank locations on the FPGA card in such a way that the PCIe to device RAM, and device RAM to FPGA bandwidth is optimised. The execution kernel is enqueued as a 1-by-1-by-1 dimension task with arguments p, L, D, U, Y , and X, where p, the number of partitions to solve n  is given by ceil m + 1 . The execution kernel is scheduled and synchronised with the write and read operations of the device memory objects using OpenCL event objects. The kernel code is dependent on the partition size m, so memory buffers for the input and output vectors are created as 1-by-size vectors, where size Investigation 2: An Heterogeneous Solver for Tridiagonal Systems 51 is given by p × m. The input matrix, consisting of lower, main and upper diagonal vectors of A and a single right hand side vector Y are stored in row-major order. The memory objects L, D, U, Y are padded with an identity matrix and zeros in order to accommodate linear systems where m is not a factor of n, giving the overall memory requirement as 5 × size. As the kernel is implemented as a single Work Item, this allows for single-strided memory access patterns when the FPGA loads partitions from global memory for processing. This means that it is not necessary to implement a pre-execution data marshalling stage as is often required for SIMD or SIMD-like processors.

Kernel Complexity and Hardware Utilisation

The FLOP requirements for our TDMA and truncated–SPIKE OpenCL kernels are presented in Table 4.1. The TDMA kernel has significantly fewer FLOPs compared to the truncated–SPIKE kernel. This is expected as the TDMA kernel only computes the LU factorisation and back–substitution, compared to the more computationally complex truncated-SPIKE poly– algorithm described previously. However, since the TDMA kernel requires the upper–triangular matrix of the entire system to be stored to global memory as an intermediate step and then subsequently re-read, the TDMA kernel requires double the number of FPGA to off–chip memory transac- tions in comparison to the truncated–SPIKE kernel.

Operation TDMA truncated-SPIKE ADD/SUB 3mp (5m + 3)p MUL 3mp (6m + 5)p DIV 3mp (3m + 3)p MEM 10mp 5mp

Table 4.1: FLOP and global memory transactions required for the TDMA and truncated–SPIKE FPGA kernels.

The FLOP and memory transaction requirements seen in Table 4.1 are reflected in the FPGA kernel hardware utilisation presented in Table 4.2. OpenCL requires a static partition of the available FPGA hardware re- sources in order to facilitate host to FPGA memory transfers and OpenCL kernel execution. Per Table 4.2 this static partition is significant, consum- ing at least 10% of measured resource types. The total resource utilisation for each kernel is given by the addition of OpenCL static resource utilisation Investigation 2: An Heterogeneous Solver for Tridiagonal Systems 52 plus the kernel specific resource utilisation. The more computationally complex truncated–SPIKE kernel requires more look up tables (ALUT), flip flops (FF) and digital signal processor (DSP) tiles than the TDMA kernel. The TDMA kernel however requires more block RAM (BRAM) tiles due to implementing a greater number of load store units to cater for the extra global memory transactions. Further- more, both kernels are constrained by the available amount of BRAM on the FPGA, with the BRAM utilisation by far the highest resource utilisation for both kernels.

Resource OpenCL Static TDMA truncated-SPIKE ALUTs 13% 14% 18% FFs 13% 10% 13% BRAMs 16% 52% 49% DSPs 10% 16% 27%

Table 4.2: FPGA hardware utilisation for the TDMA and truncated– SPIKE kernels.

FPGA OpenCL Optimisation considerations

Our implementation of the truncated–SPIKE algorithm is global memory bandwidth constrained. It requires large blocks of floating point data to be accessible at each stage of the algorithm. By far the largest bottleneck to computational throughput is ensuring coalesced aligned global memory transactions. When loading matrix partitions from the global to local or private mem- ory a major optimisation consideration is the available bandwidth on the global memory bus. The available bandwidth per memory transaction is 512 bits, and the Load–Store–Units that are implemented by the Intel FPGA b compiler are of the size 2 bits where bmin = 9 and bmax is constrained by the available resources on the FPGA. Therefore to ensure maximum global memory bandwidth with the aforementioned constraints we set par- tition size m to 32 for single precision floating point data for both kernels resulting in 1024 bit memory transactions. The value of m is hard–coded and known at compile time allowing for unrolling of the nested loops at the expense of increased hardware utilisa- tion. Unrolling a loop effectively tells the compiler to generate a hardware instance for each iteration of the loop, meaning that if there are no loop carried dependencies the entire loop is executed in parallel. However, for Investigation 2: An Heterogeneous Solver for Tridiagonal Systems 53 loops with carried dependencies such as LU/UL factorisation each iteration cannot execute in parallel. Nonetheless this is still many times faster than sequential loop execution despite the increase in latency that is dependant on the loop size. Loop unrolling is our primary computational optimisation step thereby allowing enough compute bandwidth for our kernel to act as a streaming linear system solver. Note that in our optimisation process we either fully unroll loops or not at all. It is possible to partially unroll loops for a performance boost when hardware utilisation limitations do not permit a full unroll. Partially unrolling a loop can however be inefficient since the hardware utilisation does not scale proportionally with the unroll factor due to the hardware overhead required to control the loop execution.

4.3.2 Porting the truncated-Spike Kernel to CPU and GPU

In order to investigate the full potential for a heterogeneous computing implementation of the truncated SPIKE algorithm for solving tridiagonal linear systems of any size, we exploited the portability of OpenCL. We modified the host and kernel code used for the FPGA implementation to target CPU and GPU hardware. To achieve this it was necessary to make modification to the host and kernel side memory objects and data access patterns, remap the truncated SPIKE algorithm to different kernel objects, and modify the Work Group sizes and their mapping to Compute Units with respect to CPU and GPU hardware architecture.

Partitioning and Memory Mapping

The memory requirements for CPU and GPU implementations are depen- dant on the partitioning scheme for each device. The memory required to solve for a partition of size m multiplied by a multiple of the GPU’s preferred Work Group size must be less than the available local memory. This constrains the size and number of partitions m, since it is preferable to maximise the occupancy of the SIMD lane whilst ensuring that sufficient local memory is available. Unlike the GPU, all OpenCL memory objects on the CPU are automat- ically cached in to local memory by hardware [71]. However, considering that the CPU has a lower Compute Unit count we maximise the partition size m to minimise the number of partitions thereby minimising the opera- tion count required to recover the reduced system. The relative values for Investigation 2: An Heterogeneous Solver for Tridiagonal Systems 54

Device Partitions(p) size n FPGA ceil( m ) + 1 p × m GPU n/(WGsize × m) p × m × WGsize CPU n/(CUs × WGsize) p × CU × WGsize

Table 4.3: oclspkt kernel partitioning schemes p and size in terms of m and Work Group size can be seen in Table 4.3. For our implementation of the truncated–SPIKE algorithm for the CPU and GPU the host and kernel memory requirements are five 1-by-size vec- tors, L, D, U, Y , X of the partitioned system and four 1-by-(p+2) vectors, V , W , Y t, Y b of the reduced system. By storing the values for V , W , Y t, Y b in a separate global memory space we remove the potential for bank con- flicts in memory transactions that may occur if the reduced system vectors are stored in-place in the partitioned system. The reduced system memory objects are padded with zeros to accom- modate the top and bottom most partitions j = 0 and j = p removing the need for excess control code in the kernel to manage the top-most and bottom-most partitions. Further, to ensure data locality for coalesced mem- ory transactions on both the CPU and GPU the input matrix is transformed in a pre-execution data marshalling step. The data marshalling transforms the input vectors so that data for adjacent Work Items are sequential in- stead of strided. This allows the data to be automatically cached and for vector processing of Work Items on the CPU, and for full bandwidth global memory transactions on the GPU.

Remapping the kernel

In contrast to the FPGA implementation of the truncated–SPIKE algo- rithm, for the CPU and GPU implementations we split the code into two separate kernels for the CPU and three separate kernels for the GPU. This allows for better Work Group scheduling and dispatching for multiple Com- pute Unit architectures as we enqueue the kernels for execution as arrays of Work Items known as an NDRange in OpenCL’s parlance. In remapping the kernel to the CPU the underlying architecture pro- vides relatively few Processing Elements per Compute Unit and a fast clock speed. As such, in order to make best use of this architecture the num- ber of partitions of the truncated–SPIKE algorithm should be minimised, thereby ensuring allocation of Work Groups of the maximum possible size to each Compute Unit to ensure maximum occupancy. Figure 4.4 shows Investigation 2: An Heterogeneous Solver for Tridiagonal Systems 55 an OpenCL representation of the CPU implementation, its execution order and data path.

Figure 4.4: The CPU truncated–SPIKE OpenCL kernels spkfact and spkrec, with the execution path and data dependencies shown. Both kernels are executed as an NDRange of Work Items.

For the CPU implementation we use the partitioning scheme proposed by Mendiratta [72]. We compute the LU and UL forward sweep factorisa- tion in the spkfaccpu kernel where we apply UL factorisation to elements 0 to 2mmin and apply LU factorisation to elements mmin to m where mmin is the smallest partition size required to purify the resulting factorisations of error as per Equation 4.8. This reduces the overall operation count and is only possible when m >> mmin. The spktfaccpu kernel is enqueued as a p–by–1–by–1 NDRange, in a single in-order command queue. The reduced system and the recovery of the overall solution is handled by a second kernel spktrec. The spktrec kernel is enqueued as per spktfaccpu in a single in-order command queue. The reduced system is solved and the boundary unknown elements are recovered and used to compute the UL backsweep of elements 0 to mmin − 1 and the LU backsweep of elements m to mmin. In contrast to the CPU, the GPU has many Processing Elements per Compute Unit and a relatively low clock speed. In order to optimise perfor- mance, it was important to maximise the number of partitions of the SPIKE Investigation 2: An Heterogeneous Solver for Tridiagonal Systems 56

Figure 4.5: The GPU truncated–SPIKE OpenCL kernels spkfact, spkrecul and spkreclu, with the execution path and data dependencies shown. All kernels are executed as an NDRange of Work Items. algorithm, by reducing the partition size and thereby ensuring maximum occupancy of the Processing Elements. Figure 4.5 shows an OpenCL rep- resentation of the GPU implementation, its execution order and data path. For the GPU, partitioning the system and the LU and UL factorisation of the code are handled by the first kernel, spkfactgpu. Unlike the CPU, the GPU computes the entire block size m of the UL and the LU factorisations. Only the top half of the UL and the bottom half of the LU results are then stored in global memory in order to reduce global memory transactions and overall global memory space requirements. The reduced system and the recovery of the overall solution is handled by two kernels, spktreclu and spktrecul. spktreclu and spktrecul only load the bottom half of partition m and top half of partition m respectively to compute the backsweep portions of LU and UL factorisation. The three kernels are again enqueued as p−by− 1−by−1 NDRange in in-order command queues for writing to, executing on and reading from the GPU. As with the FPGA this effectively interleaves the PCIe data transfer with kernel execution.

4.3.3 The Heterogeneous Solver

We further extend our truncated–SPIKE implementation to utilise all avail- able computation resources available on a platform, as seen in Figure 4.1. Investigation 2: An Heterogeneous Solver for Tridiagonal Systems 57

Component Specification CPU Intel Xeon E5-1620 v4 @ 3.50GHz GPU Nvidia M4000 8GB GDDR5 PCIe G3 x16 FPGA Bittware A10PL4 w/ Intel Arria 10 GX 8GB DDR4 PCIe G3 x8 RAM 64GB DDR4 @ 2400Mhz OS CentOS 7.4 Software ICC 18.0.3 CUDA 9.0 Intel Quartus Pro 17.0 Intel OpenCL SDK 7.0.0.2568

Table 4.4: Specifications for Dell T5000 Desktop PC

This heterogeneous solver first checks for available devices on the host us- ing OpenCL APIs, and then queries if device profiling data exists for found devices. If profiling data is not available for all devices, each device will be allocated an even portion of the input system and profiling data will be collected on the next execution of the solver. Otherwise, each device will be allocated a portion of the input system determined by the percent- age of the total system throughput over the individual devices’ previously recorded throughput. Throughput in this case includes data transit time across the PCIe bus, data marshalling, and compute time of the kernel. The heterogeneous solver then asynchronously dispatches chunks of the input data to the devices, executes the device solvers and recovers the solution. The inter-chunk boundary solutions recovered from the devices are cleansed of error by executing a ‘top’ level of the truncated-SPIKE algorithm on the chunk partitions.

4.4 Evaluation

In the following sub-sections we evaluate the oclspkt routine in terms of compute performance, numerical stability and power efficiency. The results presented use single precision floating point and all matrices are random, non singular, and the main diagonal has a diagonal dominance factor d > 3. All results presented in this paper have been executed on a Dell T5000 Desktop PC with an Intel Xeon CPU E5-1620 v4, 64GB of RAM, a Bittware A10PL4 FPGA and a NVIDIA M4000 GPU, full specifications are listed in Table 4.4 Investigation 2: An Heterogeneous Solver for Tridiagonal Systems 58

4.4.1 Compute Performance

Figure 4.6: Time (ms) to solve system of size N = 256 × 106 using oclspkt, targeting CPU, GPU and FPGA devices.

To evaluate the compute performance of the oclspkt, we first only con- sider the kernel execution time for our target devices in isolation. In Fig- ure 4.6 we show the time to solve a system where N = 256 × 106, high- lighting the solve and data marshalling kernel components of the overall execution time. In this experiment the GPU solve kernel takes on average 78.4 ms to solve the tridiagonal system, where the FPGA and CPU is 2.6 and 4.8× slower at 200 ms and 376 ms respectively. Furthermore, when also considering the data marshalling overheads required by the GPU and CPU kernels the GPU is still the quickest at 152 ms with the FPGA and CPU now 1.3 and 6.1× slower.

Figure 4.7: Comparing time(ms) to solve system of size N = 256 × 106 using oclspkt, dgtsv, sdtsvb and TDMA.

In Figure 4.7 we compare these results to other diagonally dominant tridiagonal solver algorithms, our TDMA FPGA kernel, a CUDA-GPU im- plementation, dgtsv,[53], the Intel MKL sdtsvb routine [73] and a sequential CPU implementation of the TDMA. For each of the three target devices our oclspkt implementation outperforms the comparison routines for solv- ing a tridiagonal system of N = 256 × 106. The oclspkt(FPGA) is 1.7× faster than the TDMA (FPGA) kernel, the oclspkt(GPU) implementation Investigation 2: An Heterogeneous Solver for Tridiagonal Systems 59 is 1.1× faster than the dgtsv, and our oclspkt(CPU) is 1.5 and 3.5× faster than the sdtsvb and TMDA CPU solvers respectively. Note that for each of these results we include any data marshalling overhead, but exclude host to PCIe device transfer time. A comparison of the compute performance targeting single and hetero- geneous combinations of devices executing the oclspkt routine can be seen in Figure 4.8. We normalise the performance metric to rows-solved per second (RSs−1) to provide a fair algorithmic comparison across different device hardware architectures. Furthermore, when evaluating the hetero- geneous solver performance of oclspkt we use a holistic system approach, which includes the host to device PCIe data transfer times for the FPGA and GPU devices, and all data marshalling overheads. As seen in Table 4.5 the GPU+FPGA device combination has the best average maximum per- formance. The GPU+FPGA device combination performs 1.38× better than the next best device, the GPU-only, and performs 2.48× better than the worst performing device, the CPU-only implementation. Curiously, we would expect performance metrics of the heterogeneous combinations of devices to be close to the summation of the individual de- vice performance metrics. In fact, our results show that only the GPU+FPGA heterogeneous performance is close to the summation of the GPU and FPGA–only performance at 88% of the theoretical total. The CPU+FPGA, CPU+GPU and CPU+GPU+FPGA average maximum performance are only 65%, 51% and 55% respectively. For the PCIe attached devices, the GPU and FPGA performance met- rics are determined by the available PCIe bus bandwidth. The kernel ex- ecution time (Figure 4.6) is completely interleaved with the host to device memory transfers. As such, the M4000 GPU card with 16 PCIe Gen 3.0 lanes available will outperform the A10PL4 FPGA card with 8 PCIe Gen 3.0 lanes regardless of the kernel compute performance. Similarly the CPU performance is determined by the available host RAM bandwidth. Using the de-facto industry standard benchmark for measuring sus- tained memory bandwidth, STREAM [74, 75], our desktop machine, spec- ified in Table 4.4, has a maximum measured memory bandwidth of 42 GBs−1. Profiling our CPU implementation of oclspkt using Intel VTune Amplifier shows very efficient use of the available memory bandwidth with sustained average 36 GBs−1 and peak 39 GBs−1 memory bandwidth utilisa- tion. This saturation of host memory bandwidth by the CPU solver creates a processing bottle neck and negatively affects the PCIe data transfer to Investigation 2: An Heterogeneous Solver for Tridiagonal Systems 60

Compute Performance Energy Efficiency Device [×106 RSs−1] [×106 RSJ−1] CPU 279 1.39 GPU 501 12.9 FPGA 280 28.6 CPU+GPU 396 1.67 CPU+FPGA 365 1.81 GPU+FPGA 691 15.6 CPU+GPU+FPGA 431 2.06

Table 4.5: Maximum observed average(n=16) compute performance and energy efficiency of oclspkt for devices and heterogeneous combinations. the FPGA and GPU. This, coupled with the heterogeneous partitioning scheme described in subsection 4.3.3 will favour increasing the chunk size of the input system allocated for CPU computation on each successive in- vocation of the oclspkt routine, and subsequently decrease the performance of GPU and FPGA devices.

4.4.2 Numerical Accuracy

In Figure 4.9 we show the numerical accuracy of the oclspkt in terms of the infinity norm of the known and calculated results, varied by the diagonal dominance of the input matrix compared to the TDMA CPU implementa- tion. The TDMA approaches the value for single precision floating point numbers when the diagonal dominance of the input system is 2, whereas the oclspkt for the CPU, GPU and FPGA requires a diag- onal dominance of 2.8 to achieve a similar accuracy. Equation 4.8 shows that an approximation of the upper bound of the infinity norm error is dependant on the SPIKE partition size, the bandwidth, and the degree of diagonal dominance. As the GPU and FPGA partition sizes and the small

CPU partition sizes are equal that is, mGP U = mF P GA = mCPUmin , the numerical accuracy for all implementations are expected to be very similar.

4.4.3 Energy Consumption and Efficiency

To determine estimated energy consumption in Joules for each device, we used the manufacturer’s rated thermal design power (TDP) and multiplied it by the kernel execution time for data marshalling and solve steps of the oclspkt. TDP represents the average power in watts used by a processor when the device is fully utilised. Whilst this is not a precise measurement of power used to solve the workload, it nevertheless provides a relative Investigation 2: An Heterogeneous Solver for Tridiagonal Systems 61

Figure 4.8: Performance comparison in rows solved per second for N = (256 ... 1280) × 106 when targeting CPU, GPU, FPGA and heterogeneous combinations of devices. inter-device benchmark. The TDP of the M4000 GPU is 120 W, the Xeon E-1650 is 140 W and the A10PL4 FPGA is 33 W. When solving a system of N = 256 × 106 as seen in Figure 4.10 the FPGA implementation uses 2.8× less and 20× less energy than the GPU and CPU implementations respectively. Further in Figure 4.11 we can see the energy efficiency of each hardware configuration of oclspkt in rows solved per Joule. Across the range of the experiment each solver shows consistent results with the FPGA-only solver the most energy efficient peaking at 28 × 106 rows solved per Joule. The FPGA-only solver is on average 1.8× more energy efficient than the next best performing solver, the GPU+FPGA and is 20.0× more energy efficient than the poorest performing CPU-only solver. This is not surpris- ing since the TDP for the FPGA is an order of magnitude smaller than the other devices. Similarly to the heterogeneous results in subsection 4.4.1, the addition of the CPU solver significantly constrains the available band- width to host memory slowing down the PCIe data transfer rates. In turn Investigation 2: An Heterogeneous Solver for Tridiagonal Systems 62

Figure 4.9: Numerical accuracy of oclspkt for varying diagonal dominance compared to CPU TDMA solver.

Figure 4.10: Joules required to solve tridiagonal system of N = 256 × 106 per device using the oclspkt routine. this pushes more work to the CPU solver, slows down the overall compute, and since the CPU has the highest TDP, this exacerbates the poor energy efficiency.

4.5 Conclusion

In this paper we presented a numerically stable heterogeneous OpenCL im- plementation of the truncated-SPIKE algorithm targeting FPGAs, GPUs, CPUs, and combinations of these devices. Our experimental case has demonstrated the feasibility of utilising FPGAs, along with GPUs and CPUs concurrently in a heterogeneous computing environment in order to accelerate solving a diagonally dominant tridiagonal linear system. When comparing our CPU, GPU and FPGA implementation of oclspkt to a suit- able baseline implementation and third-party solvers specifically designed and optimised for these devices, the compute performance of our imple- Investigation 2: An Heterogeneous Solver for Tridiagonal Systems 63

Figure 4.11: Energy efficiency comparison in rows solved per second for N = (256 ... 1280) × 106 for oclspkt when targeting devices CPU, GPU, FPGA and heterogeneous combinations. mentation showed 150%, 110% and 170% improvement respectively. Profiling the heterogeneous combinations of oclspkt showed that tar- geting the GPU+FPGA devices gives the best compute performance and targeting FPGA-only will give the best energy efficiency. Also adding our highly optimised CPU implementation to a heterogeneous device combi- nation with PCIe attached devices significantly reduced the expected per- formance of the overall system. In our experimental case, with a compute environment that has CPUs, GPUs and FPGAs it is advantageous to rele- gate the CPU to a purely task orchestration role instead of computation. Under our experimental conditions all device compute performance re- sults are memory bandwidth constrained. While the GPU kernel compute- performance is several times faster than the FPGA kernel, the FPGA test hardware has several times less available memory bandwidth. As high- bandwidth-data transfer technology is introduced to the new generations of FPGA accelerator boards this performance gap between devices is expected to close. Given the significantly lower power requirements, incorporation of FPGAs has the potential to reduce some of the power consumption bar- riers currently faced by HPC environments as we move towards exascale Investigation 2: An Heterogeneous Solver for Tridiagonal Systems 64 computing. A natural progression of this work would be to extend the oclspkt rou- tine to be able to solve non-diagonally dominant and block-tridiagonal lin- ear systems. Further it would be advantageous to extend the heterogeneous partitioning routine to be able to tune the solver to maximise energy effi- ciency where desired. A extension of this work may also seek to account for memory bottle necks detected on successive invocations of the solver further enhancing performance in heterogeneous applications.

4.6 Data Availability

The source code and data used to support the findings of this study are available from the corresponding author upon request.

4.7 Conflicts of Interest

The authors declare that there is no conflict of interest regarding the pub- lication of this paper.

4.8 Acknowledgements

This project utilised the high performance computing (hpc) facility at the Queensland University of Technology (qut). The facility is administered by qut’s eResearch Department. A special thanks to the eResearch De- partment for their support, particularly in providing access to specialist FPGA and GPU resources. Chapter 5

Conclusion

The primary motivation for this thesis is to determine the feasibility of tar- geting FPGAs for use in accelerating general purpose scientific computing on heterogeneous HPC platforms. This has been explored through the lens of a common scientific computing problem, solving a diagonally dominant tridiagonal linear system. With this focus, a comparative analysis of solver implementations for FPGA, GPU, CPU and heterogeneous combinations thereof has been completed. At a high-level the work presented in this thesis has shown: • Portable code for GPU and FPGA devices can be developed using HLLs but in doing so the code is optimised for neither device.

• Solver kernels developed for FPGAs can provide FLOP performance for scientific computing tasks of the same order of magnitude as CPU and GPU solver kernels.

• Using FPGAs can provide significant power savings when compared to GPUs and CPUs for similar tasks. The remainder of this chapter discusses how papers 1 and 2 have specif- ically addressed the research questions posed in Chapter 1 in addition to summarising the major contributions of these works. Finally, consideration is given to opportunities for further work in this area.

5.1 Summary of Thesis Objectives

This sub-section describes how the research questions posed in Chapter 1 have been investigated and answered by this work. The initially posed research questions are restated below:

65 Conclusion 66

1. Is it feasible to integrate non-traditional compute devices like FPGAs with GPUs/CPUs for general purpose scientific computing tasks?

2. Can portable diagonally dominant tridiagonal linear systems solvers be designed that could be deployed to any number of CPUs, GPUs or FPGAs present on a host system?

3. Can optimised portable code be created to target multiple devices with OpenCL?

4. Is the same compute performance achievable from HLLs as from HDLs when targeting FPGAs?

5. Is developing optimised FPGA kernels easier with HLLs than with HDLs?

RQ1 The papers presented in chapters 3 and 4 both show that it is in- deed feasible to integrate FPGA hardware as a computational accelerator for general purpose scientific computing. Paper 1 describes a comparative analysis of portable implementations for FPGA and GPU tridiagonal lin- ear systems solvers solving systems in isolation. Further, Paper 2 presents an FPGA implementation of the truncated SPIKE algorithm working in concert with GPU and CPU implementations of the same algorithm to ac- celerate the computation. It was found that within the OpenCL framework it is relatively easy to select, add and dispatch work to FPGA hardware alongside other accelerators using the OpenCL execution model.

RQ2 and RQ3 Paper 1 presented portable implementations of the PCR and SPIKE algorithms to solve tridiagonal systems up to 1024 elements in size. These algorithms were able to run on the GPU and were simulated on FPGA hardware. The resource utilisation was too high for the SPIKE algorithm to be able to run all kernels on the FPGA concurrently. The OpenCL implementations of the PCR and SPIKE algorithms were not optimised for the specific underlying compute architectures of each tar- get device. This was by design to ensure portability of the implementations but by doing so did not maximise performance. This shows that while it is possible to implement fully portable OpenCL code to target multiple devices, it is not always advantageous to do so. In Paper 2, rather than implementing a fully portable OpenCL ker- nel code, it was decided to focus on a single algorithm with the goal of Conclusion 67 maximising performance whilst maintaining some portability. In order to do this the kernel code was divided up into algorithm code blocks. These blocks remained portable to different devices, but were arranged in several device specific OpenCL kernels in order to maximise performance on the given device. In addition, key parameters including loop size, unroll factors and Work Group sizes were defined and optimised per device to maximise performance. Further, in Paper 2 the execution model of the host program allowed for the user to define or for the system to dynamically allocate the maximum number and types of devices to use in solving the tridiagonal system. The truncated-SPIKE algorithm allows the problem domain to be decomposed and sent to separate memory-isolated compute devices to be solved. For diagonally dominant tridiagonal matrices there is little inter-device com- munication required during the execution allowing the implementation to scale to all devices available on a host.

RQ4 A comparative test between OpenCL implementations of tridiagonal solvers described in this work and an HDL implementation could not be conducted. This is due to the fact that at the time of writing no SPIKE or PCR HDL implementations are commercially or freely available, and implementing these routines is outside the scope of this work. However, using the oclspkt implementation of the truncated-SPIKE solver presented in Paper 2, the Intel FPGA OpenCL kernel profiling tool [32] consistently showed memory bandwidth utilisation of 82% that is 28 Gbs−1 out of a total possible 34 Gbs−1 maximum for the A10PL4 FPGA. Using an HDL implementation it is likely possible to achieve better memory- bandwidth utilisation by having direct control of the design of the LSU. If the assumption is made that an HDL implementation has close to per- fect memory bandwidth utilisation and the truncated–SPIKE algorithm is memory-bandwidth constrained, the ideal HDL implementation will only give an 21% performance boost1 when compared to the HLL implementa- tion. As such it is likely that the HLL implementation will give close to the same compute performance as an HDL implementation.

1 Calculated from (bt − bm)/bm, where bt is total memory bandwidth available and bm is maximum utilised memory bandwidth. Conclusion 68

RQ5 In the author’s experience, it is subjectively easier to implement computational routines on the FPGA and have them execute in-the-loop using the HLL OpenCL rather than an HDL. A naive algorithm OpenCL kernel code can be written and tested in hours using OpenCL rather than likely days or weeks using an HDL. However, as with most development work, coding high performance optimised routines with OpenCL for FPGAs can still takes weeks of iterative designing, coding, and testing. The real benefits of using an HLL like OpenCL comes from: 1) removing the steep learning curve associated with developing code with HDLs and using vendor specific Integrated Development Environments (IDEs); and 2) having an integrated framework to be able to deploy work to the FPGA ’in-the-loop’ of a host program.

5.2 Summary of Contributions

This section summarises the major contributions of papers 1 and 2 included in Chapters 3 and 4.

Chapter 3 – Paper 1: Implementation of parallel tridiagonal solvers for a heterogeneous computing environment This paper explored the feasibility of designing truly portable multi-device high-performance lin- ear algebra routines using OpenCL. The parallel tridiagonal solvers of PCR and SPKIE were implemented on an FPGAs and a GPUs and compared against other OpenCL and CUDA implementations. Results indicated that in terms of accuracy and computational efficiency our designs are compara- ble for both the FPGA and GPU cases and suggested further investigation was warranted.

Chapter 4 – Paper 2: Implementing and Evaluating an Heteroge- neous Scalable Tridiagonal Linear System Solver with OpenCL to target FPGA’s CPUs and GPUs . This paper presented the oclspkt routine, a numerically stable heterogeneous OpenCL implementation of the truncated-SPIKE algorithm targeting CPUs, GPUs, FPGAs and combina- tions of these devices. The oclspkt routine when run targeting a single device showed 50% (CPU), 10% (GPU) and 70% (FPGA) improved com- pute performance compared to industry leading device specific routines. Profiling the heterogeneous combinations of oclspkt showed that tar- geting the GPU+FPGA devices gives the best compute performance and Conclusion 69 targeting FPGA-only will give the best energy efficiency. Also adding our highly optimised CPU implementation to a heterogeneous device combina- tion with PCIe attached devices significantly reduced the expected perfor- mance of the overall system. The results for compute performance in the second paper are shown to be memory bandwidth constrained. While the GPU kernel compute- performance is several times faster than the FPGA kernel, the FPGA test hardware has several times less available memory bandwidth. As newer generations of FPGA accelerator boards are equipped with high-bandwidth- data transfer technology the performance gap between devices is expected to close. This coupled with the fractional power requirements of the FPGA make it an attractive concept for future HPC requirements.

5.3 Recommendations for Further work

A natural progression for the linear system solver aspect of this work would be to implement an HDL version of the oclspkt routine in order to com- pare the performance to the OpenCL oclspkt routine. Any improvement in compute performance could then be accurately quantified allowing a more complete evaluation of the benefits and disadvantages of working with the HLL OpenCL rather than HDLs and vendor specific IDEs. A further development opportunity would see generalisation of the oclspkt routine to handle non-diagonally dominant matrices. This would require implementing partial pivoting and then solving the reduced system fully during the SPIKE factorisation stage instead of approximating it with the ’truncated’ method. Considering FPGAs and OpenCL more broadly, a future investigation could focus on implementation of an algorithm that is compute-constrained instead of memory-constrained as is the case with tridiagonal linear system solver algorithms. This would further test the limits of the FPGA, and give new insight in to how FPGA’s compare to traditional CPU and GPU devices when utilising optimised routines. Bibliography

[1] Khronos Opencl Working Group, “The OpenCL specification,” tech. rep., 2009.x,1, 11, 12, 13, 30

[2] M. Feldman, “New GPU-Accelerated supercomputers change the bal- ance of power on the TOP500.” https://www.top500.org/news/new- gpu-accelerated-supercomputers-change-the-balance-of-power-on- the-top500/, June 2018. Accessed: 2018-11-10.1

[3] M. Feldman, “NVIDIA breaks new ground with turing GPU archi- tecture.” https://www.top500.org/news/nvidia-breaks-new-ground- with-turing-gpu-architecture/, Aug. 2018. Accessed: 2018-11-10.1

[4] NVIDIA Corporation, “NVIDIA TURING GPU ARCHITECTURE,” tech. rep., Sept. 2018.1

[5] M. Feldman, “Summit up and running at oak ridge, claims first exascale application.” https://www.top500.org/news/summit-up-and-running- at-oak-ridge-claims-first-exascale-application/, June 2018. Accessed: 2018-11-10.1

[6] I. Kuon, R. Tessier, and J. Rose, FPGA Architecture. Boston, UNITED STATES: Now Publishers, 2008.6,7

[7] I. Grout, “CHAPTER 1 - introduction to programmable logic,” in Digital Systems Design with FPGAs and CPLDs (I. Grout, ed.), pp. 1– 41, Burlington: Newnes, Jan. 2008.6

[8] W. Zhang, V. Betz, and J. Rose, “Portable and scalable FPGA-based acceleration of a direct linear system solver,” ACM Trans. Reconfig- urable Technol. Syst., vol. 5, no. 1, pp. 1–26, 2012.7,8, 19

70 Bibliography 71

[9] S. Junqing, G. D. Peterson, and O. O. Storaasli, “High-Performance Mixed-Precision linear solver for FPGAs,” IEEE Trans. Comput., vol. 57, no. 12, pp. 1614–1623, 2008.7,8,9, 11, 21, 22

[10] Rapid Prototyping Projection Algorithms with FPGA Technology, Rapid System Prototyping, 2009. RSP ’09. IEEE/IFIP International Symposium on, 2009.8,9

[11] W. Guiming, X. Xianghui, D. Yong, and W. Miao, “High-Performance architecture for the conjugate gradient solver on FPGAs,” Circuits and Systems II: Express Briefs, IEEE Transactions on, vol. 60, no. 11, pp. 791–795, 2013.8,9, 29

[12] Dedicated hardware implementation of a linear congruence solver in FPGA, Electronics, Circuits and Systems (ICECS), 2012 19th IEEE International Conference on, 2012.8

[13] W. Guiming, D. Yong, S. Junqing, and G. D. Peterson, “A high perfor- mance and memory efficient LU decomposer on FPGAs,” IEEE Trans. Comput., vol. 61, no. 3, pp. 366–378, 2012.8

[14] Y. Shao, L. Jiang, Q. Zhao, and Y. Wang, “High performance and par- allel model for LU decomposition on FPGAs,” in 2009 Fourth Inter- national Conference on Frontier of Computer Science and Technology, pp. 75–79, Dec. 2009.8

[15] Solving tri-diagonal linear systems using field programmable gate ar- rays, 2012.8, 11, 17, 28, 29, 30, 42

[16] D. J. Warne, R. F. Hayward, N. A. Kelson, J. E. Banks, and L. Mejias, “Pulse-coupled neural network performance for real-time identification of vegetation during forced landing,” in Engineering Mathematics and Applications Conference (EMAC2013), vol. 55 of ANZIAM J., pp. C1– C16, 2014.8, 36

[17] L. Zhuo and V. K. Prasanna, “High-Performance and parameterized matrix factorization on FPGAs,” in 2006 International Conference on Field Programmable Logic and Applications, pp. 1–6, Aug. 2006.8

[18] G. Wu, Y. Dou, Y. Lei, J. Zhou, M. Wang, and J. Jiang, “A fine-grained pipelined implementation of the LINPACK benchmark on FPGAs,” in 2009 17th IEEE Symposium on Field Programmable Custom Comput- ing Machines, pp. 183–190, Apr. 2009.8 Bibliography 72

[19] H. Giefers, R. Polig, and C. Hagleitner, “Analyzing the energy- efficiency of dense linear algebra kernels by power-profiling a hybrid CPU/FPGA system,” in 2014 IEEE 25th International Conference on Application-Specific Systems, Architectures and Processors, pp. 92–99, June 2014.9

[20] J. Gonzalez and R. C. Núñez, “LAPACKrc: Fast linear algebra kernel- s/solvers for FPGA accelerators,” J. Phys. Conf. Ser., vol. 180, no. 1, p. 012042, 2009.9

[21] A. Rafique, N. Kapre, and G. A. Constantinides, “A high throughput FPGA-Based implementation of the lanczos method for the symmetric extremal eigenvalue problem,” in Reconfigurable Computing: Architec- tures, Tools and Applications, pp. 239–250, Springer Berlin Heidelberg, 2012. 10

[22] Altera, “Implementing FPGA design with the OpenCL standard,” 2013. 10, 29

[23] L. Wirbel, “Xilinx SDAccel a unified development environment for to- morrow’s data center,” tech. rep., The Linley Group, Nov. 2014. 10, 29

[24] From opencl to high-performance hardware on FPGAS, Field Pro- grammable Logic and Applications (FPL), 2012 22nd International Conference on, 2012. 10

[25] D. J. Warne, N. A. Kelson, and R. F. Hayward, “Comparison of high level FPGA hardware design for solving tri-diagonal linear systems,” Procedia Comput. Sci., vol. 29, pp. 95–101, 2014. 11, 17, 29, 34, 42

[26] C. M. Angerer, R. Polig, D. Zegarac, H. Giefers, C. Hagleitner, C. Bekas, and A. Curioni, “A fast, hybrid, power-efficient high- precision solver for large linear systems based on low-precision hard- ware,” Sustainable Computing: Informatics and Systems, vol. 12, pp. 72–82, Dec. 2016. 11, 21, 22

[27] B. R. Gaster, L. Howes, D. R. Kaeli, P. Mistry, and D. Schaa, “Chap- ter 2 - introduction to OpenCL,” in Heterogeneous Computing with OpenCL (Second Edition) (B. R. Gaster, L. Howes, D. R. Kaeli, P. Mis- try, and D. Schaa, eds.), pp. 15–38, Boston: Morgan Kaufmann, Jan. 2013. 11, 14 Bibliography 73

[28] “CUDA | GeForce.” https://www.geforce.com/hardware/technology/ cuda. Accessed: 2018-11-4. 11

[29] The Khronos Group, “Conformant products.” https://www.khronos. org/conformance/adopters/conformant-products/opencl, Nov. 2018. Accessed: 2018-11-9. 13

[30] NVIDIA Corporation, “OpenCL programming guide for the CUDA architecture,” tech. rep., Aug. 2009. 14

[31] Intel Corporation, “Developer guide for intel R SDK for OpenCLâĎć applications.” https://software.intel.com/en-us/openclsdk-devguide- 2017, Oct. 2018. Accessed: 2018-11-9. 14

[32] I. Corporation, “Intel R fpga sdk for opencl pro edition - best practices guide,” tech. rep., Dec. 2017. 14, 67

[33] W. H. Press, Numerical recipes in FORTRAN: the art of scientific com- puting. Cambridge [England]; New York, NY: Cambridge University Press, 1992. 16, 17, 29, 43

[34] P. Arbenz, A. Cleary, J. Dongarra, and M. Hegland, “A comparison of parallel solvers for diagonally dominant and general narrow-banded linear systems,” Parallel and Distributed Computing Practices, vol. 2, pp. 385–400, 1999. 16, 30

[35] P. Arbenz, A. Cleary, J. Dongarra, and M. Hegland, “A comparison of parallel solvers for diagonally dominant and general Narrow-Banded linear systems II,” in Euro-Par’99 Parallel Processing, pp. 1078–1087, Springer Berlin Heidelberg, 1999. 16, 30, 44

[36] C. R. Dun, M. Hegland, and M. R. Osborne, “Parallel stable solution methods for tridiagonal linear systems of equations,” in Computational Techniques and Applications Conference (CTAC95), (River Edge, NJ), pp. 267–274, World Sci. Publishing, 1996. 16, 30, 44

[37] M. Hegland, “On the parallel solution of tridiagonal systems by wrap- around partitioning and incomplete LU factorization,” Numer. Math., vol. 59, pp. 453–472, Dec. 1991. 16, 30, 44

[38] FPGA implementation of cubic spline interpolation method for em- pirical mode decomposition, Signal Processing and Communications Applications Conference (SIU), 2012 20th, 2012. 16 Bibliography 74

[39] M. Kass, A. Lefohn, and J. D. Owens, “Interactive depth of field using simulated diffusion,” tech. rep., 2006. 16

[40] S. Palmer and D. Thomas, “Accelerating implicit finite difference schemes using a hardware optimised implementation of the thomas algorithm for FPGAs,” 2014. 16, 17, 29, 30, 42

[41] M. Sayeed, V. Magi, and J. Abraham, “Enhancing the performance of a parallel solver for turbulent reacting flow simulations,” Numerical Heat Transfer, Part B: Fundamentals, vol. 59, pp. 169–189, Mar. 2011. 16

[42] L. W. Chang and W. M. Hwu, “A guide for implementing tridiagonal solvers on GPUs,” pp. 29–44, June 2014. 17, 18, 21, 42

[43] H. S. Stone, “An efficient parallel algorithm for the solution of a tridi- agonal linear system of equations,” J. ACM, vol. 20, no. 1, pp. 27–38, 1973. 17

[44] Ö. EÇğecioÇğlu, C. K. Koc, and A. J. Laub, “A recursive doubling algorithm for solution of tridiagonal systems on hypercube multipro- cessors,” J. Comput. Appl. Math., vol. 27, pp. 95–108, Sept. 1989. 17

[45] R. W. Hockney, “A fast direct solution of poisson’s equation using fourier analysis,” J. ACM, vol. 12, no. 1, pp. 95–113, 1965. 18

[46] R. W. Hockney and C. R. Jesshope, Parallel Computers. 1981. 18

[47] An Auto-tuned Method for Solving Large Tridiagonal Systems on the GPU, Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International, 2011. 18

[48] Memory Hierarchy Optimization for Large Tridiagonal System Solvers on GPU, Parallel and Distributed Processing with Applications (ISPA), 2012 IEEE 10th International Symposium on, 2012. 18

[49] Solving tridiagonal systems on a GPU, High Performance Computing (HiPC), 2013 20th International Conference on, 2013. 18

[50] Y. Zhang, J. Cohen, and J. D. Owens, “Fast tridiagonal solvers on the GPU,” SIGPLAN Not., vol. 45, no. 5, pp. 127–136, 2010. 18 Bibliography 75

[51] Y. Zhang, J. Cohen, A. A. Davidson, and J. D. Owens, A Hybrid Method for Solving Tridiagonal Systems on the GPU. Elsevier Inc., 2012. 18

[52] E. Polizzi and A. H. Sameh, “A parallel hybrid banded system solver: the SPIKE algorithm,” Parallel Comput., vol. 32, no. 2, pp. 177–194, 2006. 19, 20, 28, 30, 31, 42, 44, 45, 46

[53] L.-W. Chang, J. A. Stratton, H.-S. Kim, and W.-M. W. Hwu, “A scalable, numerically stable, high-performance tridiagonal solver using GPUs,” pp. 1–11, IEEE Computer Society Press, 2012. 21, 34, 42, 58

[54] A Hierarchical Tridiagonal System Solver for Heterogenous Supercom- puters, 2014 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, 2014. 21, 22, 42

[55] M. B. T. F. C. J. D. J. D. J. D. V. E. R. P. C. R. H. van der Vorst, Richard Barrett, “2. iterative methods,” in Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, Other Titles in Applied Mathematics, pp. 5–37, Society for Industrial and Applied Mathematics, Jan. 1994. 21

[56] GPU-accelerated scalable solver for banded linear systems, Cluster Computing (CLUSTER), 2013 IEEE International Conference on, 2013. 21

[57] Y. Dou, Y. Lei, G. Wu, S. Guo, J. Zhou, and L. Shen, “FPGA accel- erating double/quad-double high precision floating-point applications for ExaScale computing,” pp. 325–336, ACM, 2010. 29, 35

[58] D. Jensen and A. F. Rodrigues, “Embedded systems and exascale com- puting,” Comput. Sci. Eng., vol. 12, no. 6, pp. 20–29, 2010. 29

[59] J. Shalf, S. Dosanjh, and J. Morrison, “Exascale computing technol- ogy challenges,” in High Performance Computing for Computational Science – VECPAR 2010 (J. L. Palma, M. Daydé, O. Marques, and J. Lopes, eds.), vol. 6449 of Lecture Notes in Computer Science, ch. 1, pp. 1–25, Springer Berlin Heidelberg, Jan. 2011. 29

[60] S. Skalicky, S. Lopez, and M. Lukowiak, “Performance modeling of pipelined linear algebra architectures on FPGAs,” Comput. Electr. Eng., vol. 40, no. 4, pp. 1015–1027, 2014. 29 Bibliography 76

[61] G. H. Golub, ed., Cyclic reduction - history and applications, Workshop on Scientific Computing, (New York), Springer Verlag, 1997. 30, 31

[62] Altera, “Altera sdk for opencl: Best practices guide,” tech. rep., Altera Inc., May 2015. 34

[63] A. H. Sameh and D. J. Kuck, “On stable parallel linear system solvers,” J. ACM, vol. 25, pp. 81–91, Jan. 1978. 42

[64] E. Polizzi and A. Sameh, “SPIKE: A parallel environment for solving banded linear systems,” Comput. Fluids, vol. 36, no. 1, pp. 113–120, 2007. 42

[65] Performance Models for the Spike Banded Linear System Solver, Par- allel and Distributed Computing (ISPDC), 2010 Ninth International Symposium on, 2010. 42

[66] H. Gabb, “Intel R adaptive Spike-Based solver,” tech. rep., Oct. 2010. 42

[67] J. Mair, Z. Huang, D. Eyers, and Y. Chen, “Quantifying the en- ergy efficiency challenges of achieving exascale computing,” in 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp. 943–950, May 2015. 42

[68] M. U. Ashraf, F. A. Eassa, A. A. Albeshri, and A. Algarni, “Per- formance and power efficient massive parallel computational model for HPC heterogeneous exascale systems,” IEEE Access, vol. 6, pp. 23095– 23107, 2018. 42

[69] H. Macintosh, D. Warne, N. A. Kelson, J. Banks, and T. W. Far- rell, “Implementation of parallel tridiagonal solvers for a heterogeneous computing environment,” The ANZIAM Journal, vol. 56, pp. C446– C462, 2016. 42

[70] C. C. K. Mikkelsen and M. Manguoglu, “Analysis of the truncated SPIKE algorithm,” SIAM Journal on Matrix Analysis and Applica- tions; Philadelphia, vol. 30, no. 4, p. 20, 2008. 45

[71] Intel Corporation, “Developer guide for intel R SDK for OpenCLâĎć applications.” https://software.intel.com/en-us/openclsdk-devguide- 2017, Oct. 2018. Accessed: 2018-11-9. 53 Bibliography 77

[72] K. Mendiratta, “A banded spike algorithm and solver for shared mem- ory architectures,” 2011. 55

[73] Intel Corporation, “?dtsvb.” https://software.intel.com/en-us/mkl- developer-reference-c-dtsvb, Nov. 2018. Accessed: 2018-11-12. 58

[74] J. D. McCalpin, “STREAM: Sustainable memory bandwidth in high performance computers,” tech. rep., University of Virginia, Char- lottesville, Virginia. 59

[75] J. McCalpin, “Memory bandwidth and machine balance in high per- formance computers,” IEEE Technical Committee on Computer Archi- tecture Newsletter, pp. 19–25, 12 1995. 59