Using Mixed Precision in Numerical Computations to Speedup Linear Algebra Solvers
Total Page:16
File Type:pdf, Size:1020Kb
Using Mixed Precision in Numerical Computations to Speedup Linear Algebra Solvers Jack Dongarra, UTK/ORNL/U Manchester Azzam Haidar, Nvidia Nick Higham, U of Manchester Stan Tomov, UTK Slides can be found: http://bit.ly/icerm-05-2020-dongarra 5/7/20 1 Background • My interest in mixed precision began with my dissertation … § Improving the Accuracy of Computed Matrix Eigenvalues • Compute the eigenvalues and eigenvectors in low precision then improve selected values/vectors to higher precision for O(n2) ops using the the matrix decomposition § Extended to singular values, 1983 2 § Algorithm in TOMS 710, 1992 IBM’s Cell Processor - 2004 • 9 Cores § Power PC at 3.2 GHz § 8 SPEs • 204.8 Gflop/s peak! $600 § The catch is that this is for 32 bit fl pt; (Single Precision SP) § 64 bit fl pt peak at 14.6 Gflop/s • 14 times slower that SP; factor of 2 because of DP and 7 because of latency issues The SPEs were fully IEEE-754 compliant in double precision. In single precision, they only implement round-towards-zero, denormalized numbers are flushed to zero and NaNs are treated like normal numbers. Mixed Precision Idea Goes Something Like This… • Exploit 32 bit floating point as much as possible. § Especially for the bulk of the computation • Correct or update the solution with selective use of 64 bit floating point to provide a refined results • Intuitively: § Compute a 32 bit result, § Calculate a correction to 32 bit result using selected higher precision and, § Perform the update of the 32 bit results with the correction using high precision. 4 Leveraging Mixed Precision on Cell Processor Idea: use low precision to compute the expensive flops (LU O(n3)) and then iteratively refine (O(n2)) the solution in order to achieve the FP64 arithmetic Iterative refinement for dense systems, Ax = b, can work this way. L U = lu(A) FP32 precision O(n3) x = U\(L\b) FP32 precision O(n2) r = b – Ax (with original A) FP64 precision O(n2) WHILE || r || not small enough 1. find a correction “z” to adjust x that satisfy Az=r FP32 precision O(n2) 2. x = x + z FP64 precision O(n1) 3. r = b – Ax (with original A) FP64 precision O(n2) END Ø Wilkinson, Moler, Stewart, & Higham provide error bound for SP fl pt results when using DP fl pt. Ø It can be shown that using this approach we can compute the solution to 64-bit floating point precision. Ø Need a copy of the original matrix to compute residual (r) and matrix cannot be too badly conditioned Requires extra storage, total is 1.5 times normal; O(n3) work is done in lower precision O(n2) work is done in high precision Problems if the matrix is ill-conditioned IBM Cell 3.2 GHz, Ax = b SP Theoretical 250 Peak 200 8 SGEMM (Embarrassingly Parallel) SP Peak (204 Gflop/s) 150 DP Peak (15 Gflop/s) GFlop/s 100 DP Theoretical Peak 50 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Matrix Size 6 IBM Cell 3.2 GHz, Ax = b 250 200 8 SGEMM (Embarrassingly Parallel) SP Ax=b SP Peak (204 Gflop/s) Performance SP Ax=b IBM 150 .30 secs DP Peak (15 Gflop/s) DP Ax=b IBM GFlop/s 100 DP Ax=b 50 Performance 3.9 secs 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Matrix Size 7 IBM Cell 3.2 GHz, Ax = b 250 200 8 SGEMM (Embarrassingly Parallel) SP Peak (204 Gflop/s) SP Ax=b IBM .30 secs 150 DSGESV Mixed Precision DP Peak (15 Gflop/s) Performance DP Ax=b IBM GFlop/s 100 .47 secs 8.3X Speedup 50 3.9 secs 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Matrix Size 8 Intriguing Potential • Exploit lower precision as much as possible § Payoff in performance • Faster floating point • Less data to move • Automatically switch between SP and DP to match the desired accuracy § Compute solution in SP and then a correction to the solution in DP • Potential for GPU, FPGA, special purpose processors § Use as little precision as you can get away with and improve the accuracy • Linear systems and Eigenvalue, optimization problems, where Newton’s method is used. xi z (correction, xi+1 – xi ) Z = - A\(b – Ax) xi+1 Machine Learning in Computational Science Many fields are beginning to adopt machine learning to augment modeling and simulation methods • Climate • Biology • Drug Design • Epidemology • Materials • Cosmology • High-Energy Physics Deep Learning Needs Small Matrix Operations Matrix Multiply is the time consuming part. Convolution Layers and Fully Connected Layers require matrix multiply There are many GEMM’s of small matrices, perfectly parallel, can get by with 16-bit floating point w X1 11 X1 w21 w12 y1 X2 w22 w13 y2 X3 w23 Convolution Step Fully Connected In this case 3x3 GEMM 11 / 47 Classification Nvidia Volta Peak Rates • Four Performance levels for the different precision • 64 bit floating point (FMA): 7.5 Tflop/s peak • 32 bit floating point (FMA): 15 Tflop/s peak • 16 bit floating point (FMA): 30 Tflop/s peak • 16 bit floating point w/Tensor core: 120 Tflop/s peak Tensor Core, special hardware for: Mixed Precision Matrix Multiply 4x4 Matrices 07 12 07 13 Mixed Precision • Today many precisions to deal with (IEEE Standard) • Note the number range with half precision (16 bit fl.pt.) IEEE SP largest fl pt largest fl pt number number 65,504 O(1038) 14 Leveraging Half Precision in HPC on V100 Study of the Matrix Matrix multiplication kernel on Nvidia V100 • dgemm achieve about 6.4 Tflop/s FP64 square 90 85 80 75 70 65 60 55 50 Tflop/s 45 40 35 30 25 Matrix matrix multiplication GEMM 20 15 10 5 0 2k 4k 6k 8k 10k 12k 14k 16k 18k 20k 22k 24k 26k 28k 30k m=n Leveraging Half Precision in HPC on V100 Study of the Matrix Matrix multiplication kernel on Nvidia V100 • dgemm achieve about 6.4 Tflop/s FP32 square 90 FP64 square • sgemm achieve about 14 Tflop/s 85 80 75 70 65 60 55 50 Tflop/s 45 40 35 30 25 Matrix matrix multiplication GEMM 20 15 10 5 ~2X 0 2k 4k 6k 8k 10k 12k 14k 16k 18k 20k 22k 24k 26k 28k 30k m=n Leveraging Half Precision in HPC on V100 Study of the Matrix Matrix multiplication kernel on Nvidia V100 • dgemm achieve about 6.4 Tflop/s FP16 square 90 FP32 square • sgemm achieve about 14 Tflop/s 85 FP64 square • hgemm achieve about 27 Tflop/s 80 75 70 65 60 55 50 Tflop/s 45 40 35 30 25 Matrix matrix multiplication GEMM 20 ~4X 15 10 5 0 2k 4k 6k 8k 10k 12k 14k 16k 18k 20k 22k 24k 26k 28k 30k m=n Leveraging Half Precision in HPC on V100 Study of the Matrix Matrix multiplication kernel on Nvidia V100 • dgemm achieve about 6.4 Tflop/s FP16 TC square 90 FP16 square • sgemm achieve about 14 Tflop/s 85 FP32 square • hgemm achieve about 27 Tflop/s 80 FP64 square 75 • Tensor cores gemm reach about 85 Tflop/s 70 65 60 55 50 ~12X Tflop/s 45 40 35 30 25 Matrix matrix multiplication GEMM 20 15 10 5 0 2k 4k 6k 8k 10k 12k 14k 16k 18k 20k 22k 24k 26k 28k 30k m=n Leveraging Half Precision in HPC on V100 Study of the Matrix Matrix multiplication kernel on Nvidia V100 • dgemm achieve about 6.4 Tflop/s FP16 TC square 90 FP16 square • sgemm achieve about 14 Tflop/s 85 FP32 square • hgemm achieve about 27 Tflop/s 80 FP64 square 75 • Tensor cores gemm reach about 85 Tflop/s 70 65 60 55 50 Tflop/s 45 40 35 30 25 Matrix matrix multiplication GEMM 20 15 10 5 0 2k 4k 6k 8k 10k 12k 14k 16k 18k 20k 22k 24k 26k 28k 30k m=n Leveraging Half Precision in HPC on V100 Study of the rank k update used by the LU factorization algorithm on Nvidia V100 FP16 TC square FP16 square FP32 square FP64 square • In LU factorization need matrix 90 FP16 TC k=256 FP16 k=256 FP32 k=256 FP64 k=256 multiple but operations is a 85 rank-k update computing the 80 75 Schur complement 70 65 60 55 50 Tflop/s 45 40 35 30 Rank-k GEMM needed by 25 LU does not perform as 20 well as square but still OK 15 10 5 0 2k 4k 6k 8k 10k 12k 14k 16k 18k 20k 22k 24k 26k 28k 30k m=n Leveraging Half Precision in HPC on V100 Study of the LU factorization algorithm on Nvidia V100 Performance of the LU factorization with different precision 24 • LU factorization is used to solve a FP16-TC->64 hgetrf 22 FP32->64 sgetrf linear system Ax=b A x = b 20 FP64 dgetrf x b 18 A 16 14 3~4X LUx = b U x b 12 L Tflop/s 10 8 y b 6 Ly = b L 4 2 then 0 Ux = y U x y 2k4k6k8k10k 14k 18k 22k 26k 30k 34k 40k Matrix size For the LU, half precision used only in GEMM, Panel and TRSM in SP. Leveraging Half Precision in HPC on V100 Use Mixed Precision algorithms ØAchieve higher performance à faster time to solution (benefit from operations and data movement) ØReduce power consumption by decreasing the execution time àEnergy Savings !!! – Reformulate to find correction to solution, rather than solution; Δx rather than x.