
View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Tsukuba Repository A study on linear algebra operations using extended precision floating-point arithmetic on GPUs 著者 椋木 大地 year 2013 その他のタイトル GPUにおける拡張精度浮動小数点演算を用いた線形 計算の研究 学位授与大学 筑波大学 (University of Tsukuba) 学位授与年度 2013 報告番号 12102甲第6722号 URL http://hdl.handle.net/2241/00122774 A study on linear algebra operations using extended precision floating-point arithmetic on GPUs Graduate School of Systems and Information Engineering University of Tsukuba November 2013 Daichi Mukunoki Contents 1 Introduction 1 1.1 Precision of Floating-Point Arithmetic . 2 1.1.1 IEEE Standard for Floating-Point Arithmetic . 2 1.1.2 Extended Precision Floating-Point Arithmetic . 3 1.2 GPU Computing . 4 1.2.1 Performance Characteristics . 5 1.2.2 CUDA . 5 1.3 Overview of the Thesis . 8 1.3.1 Scope of the Study . 8 1.3.2 Contributions . 9 1.3.3 Organization of the Thesis . 9 2 Triple- and Quadruple-Precision BLAS Subroutines on GPUs 12 2.1 Introduction . 12 2.2 Quadruple-Precision Floating-Point Operations . 13 2.2.1 DD-type Quadruple-Precision Floating-Point Format . 14 2.2.2 DD-type Quadruple-Precision Floating-Point Arithmetic . 15 2.3 Triple-Precision Floating-Point Operations . 18 2.3.1 D+S-type Triple-Precision Floating-Point Format . 19 2.3.2 D+S-type Triple-Precision Floating-Point Arithmetic . 19 2.3.3 Computation of D+S-type Values using DD Arithmetic . 21 2.3.4 D+I-type Triple-Precision Floating-Point Format . 23 2.4 Implementation of Triple- and Quadruple- Precision BLAS Subroutines on GPUs . 25 2.4.1 DD Arithmetic Functions . 26 2.4.2 Implementation of BLAS Kernels . 27 2.4.3 Data Structures of Triple- and Quadruple-Precision Value Ar- rays in Global Memory . 28 i CONTENTS 2.5 Performance Prediction . 30 2.5.1 Theoretical Peak Performance on DD Arithmetic . 30 2.5.2 Performance Prediction using the Bytes/Flop and Bytes/DDFlop ratios . 32 2.6 Performance Evaluation . 33 2.6.1 Evaluation Methods . 34 2.6.2 Performance Comparison of Double-, Triple- and Quadruple-Precision BLAS Subroutines . 34 2.6.3 AoS Layout vs. SoA Layout . 39 2.6.4 D+S Arithmetic v.s. DD Arithmetic . 39 2.6.5 Accuracy Evaluation . 40 2.7 Related Work . 41 2.8 Conclusion . 42 3 Optimization of Sparse Matrix-vector Multiplication on NVIDIA Kepler Architecture GPUs 44 3.1 Introduction . 44 3.2 Related Work . 46 3.3 Kepler architecture GPUs . 47 3.4 Implementation . 48 3.4.1 48KB Read-only Data Cache . 48 3.4.2 Avoid Outermost Loop . 49 3.4.3 Shuffle Instruction . 52 3.5 Performance Evaluation . 52 3.5.1 Evaluation Methods . 52 3.5.2 Result . 54 3.6 Conclusion . 59 4 Krylov Subspace Methods using Quadruple-Precision Arithmetic on GPUs 61 4.1 Introduction . 61 4.2 Related Work . 64 4.3 Implementation . 65 4.3.1 Overview of Quadruple-Precision Versions . 65 4.3.2 Implementation of CG and BiCGStab Methods on GPUs . 66 4.4 Experimental Results . 67 4.4.1 Unpreconditioned Methods . 67 4.4.2 Cases with Preconditioning . 73 ii CONTENTS 4.5 Conclusion . 77 5 Conclusion 79 5.1 Summary . 79 5.2 Future Work . 82 Bibliography 84 A List of Publications 92 iii Listings 2.1 Conversion from DD-type to D+I-type . 25 2.2 Conversion from D+I-type to DD-type . 26 2.3 Implementation of QuadMul . 27 2.4 Triple-precision AXPY using D+I-type format . 28 3.1 Host code of SpMV . 49 3.2 Kernel code of SpMV for the Fermi architecture . 50 3.3 Kernel code of SpMV for the Kepler architecture . 51 iv List of Figures 1.1 GPU architecture (NVIDIA Fermi architecture) . 6 1.2 CUDA device and thread hierarchy . 7 2.1 Quadruple-precision floating-point format on DD arithmetic . 14 2.2 Concept of DD arithmetic . 15 2.3 D+S-type triple-precision floating-point format . 19 2.4 Triple-precision operations using DD arithmetic on GPUs . 23 2.5 D+I-type triple-precision floating-point format . 24 2.6 GEMV kernel . 29 2.7 GEMM kernel . 30 2.8 Array of Structures (AoS) layout and Structure of Arrays (SoA) lay- out on D+S-type values . 31 2.9 Performance of triple- and quadruple-precision AXPY (DDFlops: DD- type floating-point operations per second) . 36 2.10 Relative execution time of AXPY (value is the multiple of the execu- tion time of the double-precision subroutine) . 36 2.11 Performance of triple- and quadruple-precision GEMV (DDFlops: DD-type floating-point operations per second) . 37 2.12 Relative execution time of GEMV (value is the multiple of the exe- cution time of the double-precision subroutine) . 37 2.13 Performance of triple- and quadruple-precision GEMM (DDFlops: DD-type floating-point operations per second) . 38 2.14 Relative execution time of GEMM (value is the multiple of the exe- cution time of the double-precision subroutine) . 38 2.15 Performance comparison of AoS (Array-of-Structures) and SoA (Structure- of-Arrays) layouts on D+S-type triple-precision AXPY . 39 2.16 Performance comparison of D+S and DD arithmetic on GEMM . 40 3.1 CRS format . 45 v LIST OF FIGURES 3.2 Thread mapping for the cases of NT=1, 2 and 4 on the CRS-vector method . 47 3.3 Flops performance for the cases with the 4 largest and 4 smallest speedup of Ver. A (only using 48KB read-only data cache) to Ver. Fermi . 54 3.4 Flops performance for the cases with the 4 largest and 4 smallest speedup of Ver. B (only avoiding outermost loop) to Ver. Fermi . 55 3.5 Flops performance for the cases with the 4 largest and 4 smallest speedup of Ver. C (only using shuffle instruction) to Ver. Fermi . 56 3.6 Flops performance for the cases with the 4 largest and 4 smallest speedup of Ver. Kepler to Ver. Fermi . 57 3.7 Flops performance for the cases with the 4 largest and 4 smallest speedup of Ver. Kepler to cuSPARSE's subroutine . 58 4.1 Preconditioned CG method . 62 4.2 Preconditioned BiCGStab method . 63 4.3 Accelerating iterative methods using high precision arithmetic . 64 vi List of Tables 1.1 Binary floating-point basic formats defined in IEEE 754-2008 . 3 2.1 Number of double-precision floating-point instructions and Flops for quadruple-precision multiply-add . 18 2.2 Floating-point instruction and cycle counts for D+S and DD arith- metic on Tesla M2050 . 22 2.3 Bytes/Flop and Bytes/DDFlop for BLAS subroutine . 34 2.4 Evaluation environment . 34 2.5 Relative error to octuple-precision result (input: uniform random numbers in the range of 0 to 1 in double-precision) . 41 3.1 Evaluation Environment . 53 3.2 Properties of the matrices shown in Figure 3.3 . 54 3.3 Properties of the matrices shown in Figure 3.4 . 55 3.4 Properties of the matrices shown in Figure 3.5 . 56 3.5 Properties of the matrices shown in Figure 3.6 . 57 3.6 Properties of the matrices shown in Figure 3.7 . 58 4.1 Evaluation environment . 67 4.2 The number of problems which could be solved using quadruple- precision (QP) versions with stopping criterion: jjrjj2=jjr0jj2 ≤ and satisfying the true relative residual: jjb − Axjj2=jjbjj2 ≤ . 68 4.3 Cases which can be solved using both quadruple-precision (QP) and double-precision (DP) versions with stopping criterion: jjrjj2=jjr0jj2 ≤ −8 10 and satisfying the true relative residual (TRR): jjb−Axjj2=jjbjj2 ≤ 10−8 but QP versions are faster . 69 vii LIST OF TABLES 4.4 Cases which can be solved using quadruple-precision (QP) versions −8 with stopping criterion: jjrjj2=jjr0jj2 ≤ 10 and satisfying the true −8 relative residual (TRR): jjb − Axjj2=jjbjj2 ≤ 10 but which do not converge when using double-precision (DP) versions with the stopping criterion within the maximum number of iterations of 30,000 . 70 4.5 Cases which can be solved using quadruple-precision (QP) versions and converged by double-precision (DP) versions with stopping crite- −8 rion: jjrjj2=jjr0jj2 ≤ 10 but not satisfying the true relative residual −8 (TRR): jjb − Axjj2=jjbjj2 ≤ 10 when using DP versions . 71 4.6 The 2 largest and 2 smallest relative execution times per iteration (QP/DP ratio) . 72 4.7 The number of problems which can be solved using preconditioned quadruple-precision (QP) versions with stopping criterion: jjrjj2=jjr0jj2 ≤ and satisfying the true relative residual (TRR): jjb − Axjj2=jjbjj2 ≤ 74 4.8 Cases with preconditioning which can be solved using both quadruple- precision (QP) and double-precision (DP) versions with stopping cri- −8 terion: jjrjj2=jjr0jj2 ≤ 10 and satisfying the true relative residual −8 (TRR): jjb − Axjj2=jjbjj2 ≤ 10 but QP versions are faster . 75 4.9 Cases with preconditioning which can be solved using quadruple- precision (QP) versions and converged by double-precision (DP) ver- −8 sions with stopping criterion: jjrjj2=jjr0jj2 ≤ 10 but not satisfying −8 the true relative residual (TRR): jjb − Axjj2=jjbjj2 ≤ 10 when using DP versions . 76 viii Abstract Most computational tasks in scientific and engineering calculations are linear alge- braic operations, which compute vectors and matrices, such as the solving of systems of linear equations. On computers, real numbers are represented using floating-point numbers and are computed with floating-point arithmetic operations. Floating-point operations have rounding errors and these errors may become a critical issue for some applications.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages105 Page
-
File Size-