A Study on Linear Algebra Operations Using Extended Precision Floating-Point Arithmetic on Gpus

View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Tsukuba Repository A study on linear algebra operations using extended precision floating-point arithmetic on GPUs 著者椋木大地 year 2013 その他のタイトル GPUにおける拡張精度浮動小数点演算を用いた線形計算の研究学位授与大学筑波大学 (University of Tsukuba) 学位授与年度 2013 報告番号 12102甲第6722号 URL http://hdl.handle.net/2241/00122774 A study on linear algebra operations using extended precision floating-point arithmetic on GPUs Graduate School of Systems and Information Engineering University of Tsukuba November 2013 Daichi Mukunoki Contents 1 Introduction 1 1.1 Precision of Floating-Point Arithmetic . 2 1.1.1 IEEE Standard for Floating-Point Arithmetic . 2 1.1.2 Extended Precision Floating-Point Arithmetic . 3 1.2 GPU Computing . 4 1.2.1 Performance Characteristics . 5 1.2.2 CUDA . 5 1.3 Overview of the Thesis . 8 1.3.1 Scope of the Study . 8 1.3.2 Contributions . 9 1.3.3 Organization of the Thesis . 9 2 Triple- and Quadruple-Precision BLAS Subroutines on GPUs 12 2.1 Introduction . 12 2.2 Quadruple-Precision Floating-Point Operations . 13 2.2.1 DD-type Quadruple-Precision Floating-Point Format . 14 2.2.2 DD-type Quadruple-Precision Floating-Point Arithmetic . 15 2.3 Triple-Precision Floating-Point Operations . 18 2.3.1 D+S-type Triple-Precision Floating-Point Format . 19 2.3.2 D+S-type Triple-Precision Floating-Point Arithmetic . 19 2.3.3 Computation of D+S-type Values using DD Arithmetic . 21 2.3.4 D+I-type Triple-Precision Floating-Point Format . 23 2.4 Implementation of Triple- and Quadruple- Precision BLAS Subroutines on GPUs . 25 2.4.1 DD Arithmetic Functions . 26 2.4.2 Implementation of BLAS Kernels . 27 2.4.3 Data Structures of Triple- and Quadruple-Precision Value Ar- rays in Global Memory . 28 i CONTENTS 2.5 Performance Prediction . 30 2.5.1 Theoretical Peak Performance on DD Arithmetic . 30 2.5.2 Performance Prediction using the Bytes/Flop and Bytes/DDFlop ratios . 32 2.6 Performance Evaluation . 33 2.6.1 Evaluation Methods . 34 2.6.2 Performance Comparison of Double-, Triple- and Quadruple-Precision BLAS Subroutines . 34 2.6.3 AoS Layout vs. SoA Layout . 39 2.6.4 D+S Arithmetic v.s. DD Arithmetic . 39 2.6.5 Accuracy Evaluation . 40 2.7 Related Work . 41 2.8 Conclusion . 42 3 Optimization of Sparse Matrix-vector Multiplication on NVIDIA Kepler Architecture GPUs 44 3.1 Introduction . 44 3.2 Related Work . 46 3.3 Kepler architecture GPUs . 47 3.4 Implementation . 48 3.4.1 48KB Read-only Data Cache . 48 3.4.2 Avoid Outermost Loop . 49 3.4.3 Shuffle Instruction . 52 3.5 Performance Evaluation . 52 3.5.1 Evaluation Methods . 52 3.5.2 Result . 54 3.6 Conclusion . 59 4 Krylov Subspace Methods using Quadruple-Precision Arithmetic on GPUs 61 4.1 Introduction . 61 4.2 Related Work . 64 4.3 Implementation . 65 4.3.1 Overview of Quadruple-Precision Versions . 65 4.3.2 Implementation of CG and BiCGStab Methods on GPUs . 66 4.4 Experimental Results . 67 4.4.1 Unpreconditioned Methods . 67 4.4.2 Cases with Preconditioning . 73 ii CONTENTS 4.5 Conclusion . 77 5 Conclusion 79 5.1 Summary . 79 5.2 Future Work . 82 Bibliography 84 A List of Publications 92 iii Listings 2.1 Conversion from DD-type to D+I-type . 25 2.2 Conversion from D+I-type to DD-type . 26 2.3 Implementation of QuadMul . 27 2.4 Triple-precision AXPY using D+I-type format . 28 3.1 Host code of SpMV . 49 3.2 Kernel code of SpMV for the Fermi architecture . 50 3.3 Kernel code of SpMV for the Kepler architecture . 51 iv List of Figures 1.1 GPU architecture (NVIDIA Fermi architecture) . 6 1.2 CUDA device and thread hierarchy . 7 2.1 Quadruple-precision floating-point format on DD arithmetic . 14 2.2 Concept of DD arithmetic . 15 2.3 D+S-type triple-precision floating-point format . 19 2.4 Triple-precision operations using DD arithmetic on GPUs . 23 2.5 D+I-type triple-precision floating-point format . 24 2.6 GEMV kernel . 29 2.7 GEMM kernel . 30 2.8 Array of Structures (AoS) layout and Structure of Arrays (SoA) layout on D+S-type values . 31 2.9 Performance of triple- and quadruple-precision AXPY (DDFlops: DD- type floating-point operations per second) . 36 2.10 Relative execution time of AXPY (value is the multiple of the execution time of the double-precision subroutine) . 36 2.11 Performance of triple- and quadruple-precision GEMV (DDFlops: DD-type floating-point operations per second) . 37 2.12 Relative execution time of GEMV (value is the multiple of the execution time of the double-precision subroutine) . 37 2.13 Performance of triple- and quadruple-precision GEMM (DDFlops: DD-type floating-point operations per second) . 38 2.14 Relative execution time of GEMM (value is the multiple of the execution time of the double-precision subroutine) . 38 2.15 Performance comparison of AoS (Array-of-Structures) and SoA (Structure- of-Arrays) layouts on D+S-type triple-precision AXPY . 39 2.16 Performance comparison of D+S and DD arithmetic on GEMM . 40 3.1 CRS format . 45 v LIST OF FIGURES 3.2 Thread mapping for the cases of NT=1, 2 and 4 on the CRS-vector method . 47 3.3 Flops performance for the cases with the 4 largest and 4 smallest speedup of Ver. A (only using 48KB read-only data cache) to Ver. Fermi . 54 3.4 Flops performance for the cases with the 4 largest and 4 smallest speedup of Ver. B (only avoiding outermost loop) to Ver. Fermi . 55 3.5 Flops performance for the cases with the 4 largest and 4 smallest speedup of Ver. C (only using shuffle instruction) to Ver. Fermi . 56 3.6 Flops performance for the cases with the 4 largest and 4 smallest speedup of Ver. Kepler to Ver. Fermi . 57 3.7 Flops performance for the cases with the 4 largest and 4 smallest speedup of Ver. Kepler to cuSPARSE's subroutine . 58 4.1 Preconditioned CG method . 62 4.2 Preconditioned BiCGStab method . 63 4.3 Accelerating iterative methods using high precision arithmetic . 64 vi List of Tables 1.1 Binary floating-point basic formats defined in IEEE 754-2008 . 3 2.1 Number of double-precision floating-point instructions and Flops for quadruple-precision multiply-add . 18 2.2 Floating-point instruction and cycle counts for D+S and DD arithmetic on Tesla M2050 . 22 2.3 Bytes/Flop and Bytes/DDFlop for BLAS subroutine . 34 2.4 Evaluation environment . 34 2.5 Relative error to octuple-precision result (input: uniform random numbers in the range of 0 to 1 in double-precision) . 41 3.1 Evaluation Environment . 53 3.2 Properties of the matrices shown in Figure 3.3 . 54 3.3 Properties of the matrices shown in Figure 3.4 . 55 3.4 Properties of the matrices shown in Figure 3.5 . 56 3.5 Properties of the matrices shown in Figure 3.6 . 57 3.6 Properties of the matrices shown in Figure 3.7 . 58 4.1 Evaluation environment . 67 4.2 The number of problems which could be solved using quadruple- precision (QP) versions with stopping criterion: jjrjj2=jjr0jj2 ≤ and satisfying the true relative residual: jjb − Axjj2=jjbjj2 ≤ . 68 4.3 Cases which can be solved using both quadruple-precision (QP) and double-precision (DP) versions with stopping criterion: jjrjj2=jjr0jj2 ≤ −8 10 and satisfying the true relative residual (TRR): jjb−Axjj2=jjbjj2 ≤ 10−8 but QP versions are faster . 69 vii LIST OF TABLES 4.4 Cases which can be solved using quadruple-precision (QP) versions −8 with stopping criterion: jjrjj2=jjr0jj2 ≤ 10 and satisfying the true −8 relative residual (TRR): jjb − Axjj2=jjbjj2 ≤ 10 but which do not converge when using double-precision (DP) versions with the stopping criterion within the maximum number of iterations of 30,000 . 70 4.5 Cases which can be solved using quadruple-precision (QP) versions and converged by double-precision (DP) versions with stopping crite- −8 rion: jjrjj2=jjr0jj2 ≤ 10 but not satisfying the true relative residual −8 (TRR): jjb − Axjj2=jjbjj2 ≤ 10 when using DP versions . 71 4.6 The 2 largest and 2 smallest relative execution times per iteration (QP/DP ratio) . 72 4.7 The number of problems which can be solved using preconditioned quadruple-precision (QP) versions with stopping criterion: jjrjj2=jjr0jj2 ≤ and satisfying the true relative residual (TRR): jjb − Axjj2=jjbjj2 ≤ 74 4.8 Cases with preconditioning which can be solved using both quadruple- precision (QP) and double-precision (DP) versions with stopping cri- −8 terion: jjrjj2=jjr0jj2 ≤ 10 and satisfying the true relative residual −8 (TRR): jjb − Axjj2=jjbjj2 ≤ 10 but QP versions are faster . 75 4.9 Cases with preconditioning which can be solved using quadruple- precision (QP) versions and converged by double-precision (DP) ver- −8 sions with stopping criterion: jjrjj2=jjr0jj2 ≤ 10 but not satisfying −8 the true relative residual (TRR): jjb − Axjj2=jjbjj2 ≤ 10 when using DP versions . 76 viii Abstract Most computational tasks in scientific and engineering calculations are linear alge- braic operations, which compute vectors and matrices, such as the solving of systems of linear equations. On computers, real numbers are represented using floating-point numbers and are computed with floating-point arithmetic operations. Floating-point operations have rounding errors and these errors may become a critical issue for some applications.

Load more