Using Mixed Precision in Numerical Computations to Speedup Linear Algebra Solvers

Jack Dongarra, UTK/ORNL/U Manchester Azzam Haidar, Nvidia Nick Higham, U of Manchester Stan Tomov, UTK

Slides can be found: http://bit.ly/icerm-05-2020-dongarra 5/7/20 1 Background

• My interest in mixed precision began with my dissertation … § Improving the Accuracy of Computed Matrix Eigenvalues • Compute the eigenvalues and eigenvectors in low precision then improve selected values/vectors to higher precision for O(n2) ops using the the

§ Extended to singular values, 1983 2 § Algorithm in TOMS 710, 1992 IBM’s Cell Processor - 2004 • 9 Cores § Power PC at 3.2 GHz § 8 SPEs

• 204.8 Gflop/s peak! $600 § The catch is that this is for 32 bit fl pt; (Single Precision SP) § 64 bit fl pt peak at 14.6 Gflop/s • 14 times slower that SP; factor of 2 because of DP and 7 because of latency issues

The SPEs were fully IEEE-754 compliant in double precision. In single precision, they only implement round-towards-zero, denormalized numbers are flushed to zero and NaNs are treated like normal numbers. Mixed Precision Idea Goes Something Like This…

• Exploit 32 bit floating point as much as possible. § Especially for the bulk of the computation • Correct or update the solution with selective use of 64 bit floating point to provide a refined results • Intuitively: § Compute a 32 bit result, § Calculate a correction to 32 bit result using selected higher precision and, § Perform the update of the 32 bit results with the correction using high precision.

4 Leveraging Mixed Precision on Cell Processor

Idea: use low precision to compute the expensive flops (LU O(n3)) and then iteratively refine (O(n2)) the solution in order to achieve the FP64 arithmetic

Iterative refinement for dense systems, Ax = b, can work this way. L U = lu(A) FP32 precision O(n3) x = U\(L\b) FP32 precision O(n2) r = b – Ax (with original A) FP64 precision O(n2)

WHILE || r || not small enough 1. find a correction “z” to adjust x that satisfy Az=r FP32 precision O(n2) 2. x = x + z FP64 precision O(n1) 3. r = b – Ax (with original A) FP64 precision O(n2) END

Ø Wilkinson, Moler, Stewart, & Higham provide error bound for SP fl pt results when using DP fl pt. Ø It can be shown that using this approach we can compute the solution to 64-bit floating point precision. Ø Need a copy of the original matrix to compute residual (r) and matrix cannot be too badly conditioned

Requires extra storage, total is 1.5 times normal; O(n3) work is done in lower precision O(n2) work is done in high precision Problems if the matrix is ill-conditioned IBM Cell 3.2 GHz, Ax = b

SP Theoretical 250 Peak

200 8 SGEMM (Embarrassingly Parallel)

SP Peak (204 Gflop/s) 150

DP Peak (15 Gflop/s) GFlop/s 100

DP Theoretical Peak 50

0 0 500 1000 1500 2000 2500 3000 3500 4000 4500

Matrix Size 6 IBM Cell 3.2 GHz, Ax = b

250

200 8 SGEMM (Embarrassingly Parallel) SP Ax=b SP Peak (204 Gflop/s) Performance SP Ax=b IBM 150 .30 secs DP Peak (15 Gflop/s) DP Ax=b IBM GFlop/s 100

DP Ax=b 50 Performance

3.9 secs

0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Matrix Size 7 IBM Cell 3.2 GHz, Ax = b

250

200 8 SGEMM (Embarrassingly Parallel) SP Peak (204 Gflop/s) SP Ax=b IBM .30 secs 150 DSGESV Mixed Precision DP Peak (15 Gflop/s) Performance DP Ax=b IBM GFlop/s 100 .47 secs

8.3X Speedup 50

3.9 secs

0 0 500 1000 1500 2000 2500 3000 3500 4000 4500

Matrix Size 8 Intriguing Potential • Exploit lower precision as much as possible § Payoff in performance • Faster floating point • Less data to move • Automatically switch between SP and DP to match the desired accuracy § Compute solution in SP and then a correction to the solution in DP • Potential for GPU, FPGA, special purpose processors § Use as little precision as you can get away with and improve the accuracy • Linear systems and Eigenvalue, optimization problems, where Newton’s method is used.

xi z (correction, xi+1 – xi )

Z = - A\(b – Ax) xi+1 Machine Learning in Computational Science

Many fields are beginning to adopt machine learning to augment modeling and simulation methods • Climate • Biology • Drug Design • Epidemology • Materials • Cosmology • High-Energy Physics Deep Learning Needs Small Matrix Operations Matrix Multiply is the time consuming part.

Convolution Layers and Fully Connected Layers require matrix multiply

There are many GEMM’s of small matrices, perfectly parallel, can get by with 16-bit floating point

w x1 11 x1 w21 w12 y1 x2 w22 w13 y2

x3 w23

Convolution Step Fully Connected In this case 3x3 GEMM 11 / 47 Classification Nvidia Volta Peak Rates

• Four Performance levels for the different precision • 64 bit floating point (FMA): 7.5 Tflop/s peak • 32 bit floating point (FMA): 15 Tflop/s peak • 16 bit floating point (FMA): 30 Tflop/s peak • 16 bit floating point w/Tensor core: 120 Tflop/s peak

Tensor Core, special hardware for: Mixed Precision Matrix Multiply 4x4 Matrices

07 12 07 13 Mixed Precision • Today many precisions to deal with (IEEE Standard)

• Note the number range with half precision (16 bit fl.pt.) IEEE SP

largest fl pt largest fl pt number number 65,504 O(1038) 14 Leveraging Half Precision in HPC on V100

Study of the Matrix kernel on Nvidia V100 • dgemm achieve about 6.4 Tflop/s FP64 square 90 85 80 75 70 65 60 55 50

Tflop/s 45 40 35 30 25 Matrix matrix multiplication GEMM 20 15 10 5 0 2k 4k 6k 8k 10k 12k 14k 16k 18k 20k 22k 24k 26k 28k 30k m=n Leveraging Half Precision in HPC on V100

Study of the Matrix Matrix multiplication kernel on Nvidia V100 • dgemm achieve about 6.4 Tflop/s FP32 square 90 FP64 square • sgemm achieve about 14 Tflop/s 85 80 75 70 65 60 55 50

Tflop/s 45 40 35 30 25 Matrix matrix multiplication GEMM 20 15 10 5 ~2X 0 2k 4k 6k 8k 10k 12k 14k 16k 18k 20k 22k 24k 26k 28k 30k m=n Leveraging Half Precision in HPC on V100

Study of the Matrix Matrix multiplication kernel on Nvidia V100 • dgemm achieve about 6.4 Tflop/s FP16 square 90 FP32 square • sgemm achieve about 14 Tflop/s 85 FP64 square • hgemm achieve about 27 Tflop/s 80 75 70 65 60 55 50

Tflop/s 45 40 35 30 25 Matrix matrix multiplication GEMM 20 ~4X 15 10 5 0 2k 4k 6k 8k 10k 12k 14k 16k 18k 20k 22k 24k 26k 28k 30k m=n Leveraging Half Precision in HPC on V100

Study of the Matrix Matrix multiplication kernel on Nvidia V100 • dgemm achieve about 6.4 Tflop/s FP16 TC square 90 FP16 square • sgemm achieve about 14 Tflop/s 85 FP32 square • hgemm achieve about 27 Tflop/s 80 FP64 square 75 • Tensor cores gemm reach about 85 Tflop/s 70 65 60 55 50 ~12X

Tflop/s 45 40 35 30 25 Matrix matrix multiplication GEMM 20 15 10 5 0 2k 4k 6k 8k 10k 12k 14k 16k 18k 20k 22k 24k 26k 28k 30k m=n Leveraging Half Precision in HPC on V100

Study of the Matrix Matrix multiplication kernel on Nvidia V100 • dgemm achieve about 6.4 Tflop/s FP16 TC square 90 FP16 square • sgemm achieve about 14 Tflop/s 85 FP32 square • hgemm achieve about 27 Tflop/s 80 FP64 square 75 • Tensor cores gemm reach about 85 Tflop/s 70 65 60 55 50

Tflop/s 45 40 35 30 25 Matrix matrix multiplication GEMM 20 15 10 5 0 2k 4k 6k 8k 10k 12k 14k 16k 18k 20k 22k 24k 26k 28k 30k m=n Leveraging Half Precision in HPC on V100

Study of the rank k update used by the LU factorization algorithm on Nvidia V100

FP16 TC square FP16 square FP32 square FP64 square • In LU factorization need matrix 90 FP16 TC k=256 FP16 k=256 FP32 k=256 FP64 k=256 multiple but operations is a 85 rank-k update computing the 80 75 Schur complement 70 65 60 55 50

Tflop/s 45 40 35 30 Rank-k GEMM needed by 25 LU does not perform as 20 well as square but still OK 15 10 5 0 2k 4k 6k 8k 10k 12k 14k 16k 18k 20k 22k 24k 26k 28k 30k m=n Leveraging Half Precision in HPC on V100

Study of the LU factorization algorithm on Nvidia V100 Performance of the LU factorization with different precision 24 • LU factorization is used to solve a FP16-TC->64 hgetrf 22 FP32->64 sgetrf linear system Ax=b A x = b 20 FP64 dgetrf x b 18 A 16 14 3~4X LUx = b U x b 12 L

Tflop/s 10 8 y b 6 Ly = b L 4 2 then 0 Ux = y U x y 2k4k6k8k10k 14k 18k 22k 26k 30k 34k 40k Matrix size For the LU, half precision used only in GEMM, Panel and TRSM in SP. Leveraging Half Precision in HPC on V100

Use Mixed Precision algorithms ØAchieve higher performance à faster time to solution (benefit from operations and data movement) ØReduce power consumption by decreasing the execution time àEnergy Savings !!! – Reformulate to find correction to solution, rather than solution; Δx rather than x.

A. Haidar, P. Wu, S. Tomov, J. Dongarra, Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers, SC-17, ScalA17: 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ACM, Denver, Colorado, November 12-17, 2017.

A. Haidar, S. Tomov, J. Dongarra, and N. J. Higham, Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers, SC-18, Dallas, IEEE. Leveraging Half Precision in HPC on V100

Idea: use low precision to compute the expensive flops (LU O(n3)) and then iteratively refine (O(n2)) the solution in order to achieve the FP64 arithmetic

Iterative refinement for dense systems, Ax = b, can work this way. L U = lu(A) lower precision O(n3) x = U\(L\b) lower precision O(n2) r = b – Ax (with original A) FP64 precision O(n2)

WHILE || r || not small enough 1. find a correction “z” to adjust x that satisfy Az=r solving Az=r could be done by either: Ø z = U\(L\r) Classical Iterative Refinement lower precision O(n2) Ø GMRes preconditioned by the LU to solve Az=r Iterative Refinement using GMRes lower precision O(n2) 2. x = x + z FP64 precision O(n1) 3. r = b – Ax (with original A) FP64 precision O(n2) END Higham and Carson showed can solve the inner problem with iterative method and not infect the solution.

Ø Wilkinson, Moler, Stewart, & Higham provide error bound for SP fl pt results when using DP fl pt. E. Carson & N. Higham, “Accelerating the Solution of Ø It can be shown that using this approach we can compute the solution to 64-bit floating point precision. Linear Systems by Iterative Refinement in Three Precisions SIAM J. Sci. Comput., 40(2), A817–A847. Ø Need the original matrix to compute residual (r) and matrix cannot be too badly conditioned

Requires extra storage, total is 1.5 times normal; O(n3) work is done in lower precision O(n2) work is done in high precision Problems if the matrix is ill-conditioned Leveraging Half Precision in HPC on V100

Idea: use low precision to compute the expensive flops (LU O(n3)) and then iteratively refine (O(n2)) the solution in order to achieve the FP64 arithmetic

Iterative refinement for dense systems, Ax = b, can work this way. L U = lu(A) lower precision O(n3) x = U\(L\b) lower precision O(n2) GMRes preconditioned by the LU to solve Ax=b FP64 precision O(n2)

Ø Wilkinson, Moler, Stewart, & Higham provide error bound for SP fl pt results when using DP fl pt. Ø It can be shown that using this approach we can compute the solution to 64-bit floating point precision. Ø Need the original matrix to compute residual (r) and matrix cannot be too badly conditioned

Requires extra storage, total is 1.5 times normal; O(n3) work is done in lower precision O(n2) work is done in high precision Problems if the matrix is ill-conditioned Leveraging Half Precision in HPC on V100 solving linear system Ax = b 4 Steps 1 In 2 LU For s = 0, nb, .. N Decomp 3 1. panel factorize 4 2. update trailing matrix

• Panel Factorization performed with 32 bit fl pt TRSM • Done using MAGMA on the front-end system

• TRSM - Triangular solve performed with 32 bit fl pt

Panel GEMM • Done using V100 (no Tensor core)

• GEMM – Matrix Multiply performed with 16 bit fl pt • Done on V100 with Tensor cores Most of the performance comes from GEMM using 16 bit fl pt Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers ScalA17, November 12–17, 2017, Denver, CO, USA

Performance_rand_dominant_cond_100 Performance_poev_logrand_cond_100 13 6 12 10 FP64 dgesv FP64 dgesv 12 FP32->64 dsgesv 11 FP32->64 dsgesv 9 9 FP16->64 dhgesv FP16->64 dhgesv 11 5 5 10 8 8 8 8 8 10 9 9 7 7 7 7 7 7 4 4 4 4 4 8 8 7 6 6 6 7 3 3 3 3 3 3 3 3 3 3 3 3 3 3 6 5 6 Tflop/s Tflop/s 5 5 # iterations 4 # iterations 2 2 2 2 2 2 2 4 4 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 1 2 1 1 1 0 0 0 0 2k 4k 6k 8k 10k12k14k16k18k 22k 26k 30k 34k 2k 4k 6k 8k 10k12k14k16k18k 22k 26k 30k 34k Matrix size Matrix size

1 (a) matrix with diagonal dominant. (b) matrix with positive l and where si is random number between cond ,1 such that their logarithms are uniformly distributed.

Performance_poev_cluster2_cond_100 Performance_svd_cluster2_cond_100 12 9 8 FP64 dgesv FP64 dgesv 11 FP32->64 dsgesv FP32->64 dsgesv 850 846 FP16->64 dhgesv 8 8 8 8 8 8 8 7 FP16->64 dhgesv 10 752 7 7 7 7 9 6 658 8 6 6 6 6 5 7 564 5 6 4 470

Tflop/s 4 Tflop/s 5

# iterations 376 # iterations 3 4 3 3 3 3 3 3 3 3 3 3 3 3 3 282 3 2 2 188 2 165 1 1 90 94 1 60 32 36 42 16 18 21 24 0 0 0 313 3 4 4 4 4 4 4 4 4 4 4 0 2k 4k 6k 8k 10k12k14k16k18k 22k 26k 30k 34k 2k 4k 6k 8k 10k12k14k16k18k 22k 26k 30k 34k Matrix size Matrix size

(c) matrix with positive l and with clustered singular values, s =(1, , 1, 1 ) (d) matrix with clustered singular values, s =(1, , 1, 1 ) Tensor Core Accelerated IRS i ··· cond i ··· cond solving linear system Ax = b Performance Performance_poev_arith_cond_100 Performance_svd_arith_cond_100 12 9 8 220 Behavior FP64 dgesv FP64 dgesv 11 FP32->64 dsgesv FP32->64 dsgesv200 8 8 8 8 8 8 8 8 198 Performance of solving Ax=b to the FP64 accuracy FP16->64 dhgesv 3 7 FP16->64 dhgesv 10 Flops = 2n /(3 time) 3 176 22 FP16-TC->64 dhgesv 7 7 7 meaning7 twice higher is twice faster 7 9 6 FP32->64 dsgesv 3 20 105 154 FP64 dgesv 8 6 6 18 3 • solving Ax = b using FP64 LU 5 3 7 132 3 4 5 16 10 118 6 • solving Ax = b using FP32 LU and 4 110 14 3

4X Tflop/s 4 Tflop/s 3 5 iterative refinement to achieve FP64 3 # iterations 88 # iterations 12 3 10 accuracy 3 2 42 3 3 3 3 3 3 3 3 3 3 3 3 3 72 10 3 2 66

Tflop/s 2 3 2 3 • solving Ax = b using FP16 Tensor 2 2 2 2 2 8 2 10 44 3 2 2 Cores LU and iterative refinement to 39 6 3 2 1 2 achieve FP64 accuracy 1 21 22 3 3 2 1 4 2 1 3 10 3 4 4 4 4 4 4 4 4 4 4 4 2 0 0 0 0 0 0 0 0 0 0 0 2 3 2 2k 4k 6k 8k 10k12k14k16k18k 22k 26k 30k 34k 2k 4k 6k 8k 10k12k14k16k18k 22k 26k 30k 34k 2 Results obtained usingMatrix CUDA size 10.2 and GV100 GPU. Matrix size 0 100 2k4k6k8k10k 14k 18k 22k 26k 30k 34k 40k Matrix size i 1 1 (e) matrix with positive eigenvalues and arithmetic distribution of its singular values (f) matrix with arithmetic distribution of its singular values si = 1 ( n1 )(1 cond ). Problem generated with an arithmetic distribution of the singular values i 1 1 and positive eigenvalues. si = 1 ( n1 )(1 cond ).

Figure 4: Performance in Tflop/s for the three linear solver (dgesv,dsgesv,dhgesv) and the required number of iterations to achieve FP64 arithmetic solution using either the dsgesv (orange numbers) or the dhgesv (green numbers) solver, for different matrices size and different type of matrices. Note that cond = 102 for these experiments. FP16 TC square FP16 square FP32 square FP64 square 26 FP16-TC (Tensor Cores) hgetrf LU 90 FP16 TC k=256 FP16 k=256 FP32 k=256 FP64 k=256 FP16 hgetrf LU 24 FP32 sgetrf LU 85 FP64 dgetrf LU 80 Tensor Core Accelerated22 IRS 75 20 70 65 solving linear system18 Ax = b Performance 60 16 55 Behavior 50 14 Performance of solving Ax=b to the FP64 accuracy 3 Tflop/s Tflop/s 45 19 12 Flops = 2n /(3 time) 40 15 18 FP16-TC->64 dhgesv 10 meaning twice higher is twice faster 35 FP32->64 dsgesv 17 14 5 30 16 FP64 dgesv 8 10 25 15 14 6 20 14 13 • harder case 15 13 13 4 104 10 12 12 3.5X2 • solving Ax = b using FP64 LU 5 11 12 4 0 10 4 0 3 12 4 4 10 • solving Ax = b using FP32 LU and 2k 4k 6k 8k 10k9 12k 14k 16k 18k 20k 22k 24k 26k 28k4 30k 2k 4k 6k 8k 10k 14k 18k 22k 26k 30k 34k m=n 11 4 iterativematrix refinement size to achieve FP64 Tflop/s 8 4 11 4 (a) Performance of the rank-k7 update (Xgemm function)4 used in Xgetrf. 2(b) Performanceaccuracy of the Xgetrf routine. 6 11 4 10 5 10 10 4 • solving Ax = b using FP16 Tensor Fig. 2. Performance4 of the three arithmetic precisions obtained on a Nvidia V100 GPU. 4 9 4 101 Cores LU and iterative refinement to 3 8 3 Type 2 7 3 Description achieve FP64 accuracy 1 - 1 3 Random numbers with diagonal modified to be dominant 0 0 2 Positive eigenvalue l Random s in [ 1 ,1] such that their logarithms 10 are uniformly 2k4k6k8k10k 14k 18k 22kcond26k 30k 34k 40k Results obtained using CUDA 10.2 and GV100 GPU. 3 Positive eigenvalue l ClusteredMatrixs size s =[1, ,1, 1 ]; ··· cond 4 Problem- generated with an clusteredClustered distributions of the singular values s =[1, ,1, 1 ]; ··· cond i 1 1 si+1 5 Positive eigenvalue l Arithmetically distributed s si = 1 ( n1 )(1 cond ),i = 1..n, s is constant i i 1 1 si+1 6 - Arithmetically distributed s si = 1 ( n1 )(1 cond ),i = 1..n, s is constant i TABLE II DESCRIPTION OF THE TEST MATRICES WITH cond = 100.

63 iterations. This observation suggests the surprising effectiveness of Figure 4a shows results on matrices where all methods the FP16 arithmetic, which might be robust enough to be behave well. Convergence was achieved in 3 iterations for used in HPC dense linear system solvers. FP32, 4 iterations for FP16-TC, and 7 iterations for FP16. VII. EXPERIMENTAL RESULTS DISCUSSION Figure 4b has the same singular value distribution as Figure 4a but not necessarily positive eigenvalues. This type is the most This section presents the performance results of our three difficult, and the FP16 variants using either IR or IRGM do iterative refinement methods – dhgesv-TC, dhgesv, and not converge. Note that the FP16 IRGM can converge when dsgesv– using either the IR or IRGM, and compared to the allowing more than 2,000 iterations, but for our experiment we reference dgesv solver. We also depict the number of itera- limited the max iterations to 400, since we have seen a large tions required by each method to reach FP64 accuracy. The 2n3 performance drop when iterations are around 200—where the Tflop/s are computed based on the same formula (P = 3 time ), iterative refinement becomes a drawback. The FP32 variants which means performance reflects the time to solution, e.g., if are not influenced by the matrix type and always converge in a method has 2X higher performance, it is twice faster. The about 3–4 iterations. Surprisingly, the FP16-TC behaves very performance results are presented in Figure 5 and Figure 6 for well and converges in about 18 iterations, while the FP16 did the six representative types of matrices studied in Section VI. not converge. In each figure there are four performance curves that refer Lesson: For all the matrices considered, the FP16-TC to the reference dgesv and the three iterative refinement variant is the most robust and fastest in convergence algorithms dhgesv-TC, dhgesv, and dsgesv. among the FP16 arithmetic-based methods. The FP32 In Figure 5a, the matrix is diagonally dominant and as refinement variants emphasize a stable behavior regardless shown in Section VI, all variants require three to five iterations of the matrix types. to converge. Thus, one can expect that the low precision iterative refinement algorithms will bring a large speedup NJH: I suggest changing “stable” to “consistent” compared to dgesv. Since the number of iterations is small, since “stable” implies , which we imagine that the speedup ratio will be similar to the one is not the issue here. observed in Figure 2b for the LU factorization. We confirm TENSOR CORES ACCELERATED IRS: NUMERICAL BEHAVIOR CFD Problem: polyflow mixing tank

10-5

10-10

10-15 residual

10-20 no_cvg

10-28 0 3 6 9 12 15 18 21 24 27 30 33 36 39 # iterations

Ø Convergence history of the iterative refinement solvers to achieve FP64 solution accuracy.

Ø Interestingly, the FP16->64 (Tensor Cores Accelerated Iterative Refinement Solver) converge to the FP64 accuracy with only slightly more iterations than FP32à64 and also outperforms both the FP32->64 and the basic FP64 in term of time to solution.

Ø Scaling help the FP16->64 (Tensor Cores) convergence.

Results obtained using CUDA 10.2 and GV100 GPU. Leveraging Half Precision in HPC on V100

Use Mixed Precision algorithms Idea: use lower precision to compute the expensive flops (LU O(n3)) and then iteratively refine the solution in order to achieve the FP64 arithmetic

ØAchieve higher performance à faster time to solution ØReduce power consumption by decreasing the execution time à Energy Savings !!! Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers ScalA17, November 12–17, 2017, Denver, CO, USA

Performance_rand_dominant_cond_100 Performance_poev_logrand_cond_100 13 6 12 10 FP64 dgesv FP64 dgesv 12 FP32->64 dsgesv 11 FP32->64 dsgesv 9 9 FP16->64 dhgesv FP16->64 dhgesv 11 5 5 10 8 8 8 8 8 10 9 9 7 7 7 7 7 7 4 4 4 4 4 8 8 7 6 6 6 7 3 3 3 3 3 3 3 3 3 3 3 3 3 3 6 5 6 Tflop/s Tflop/s 5 5 # iterations 4 # iterations 2 2 2 2 2 2 2 4 4 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 1 2 1 1 1 0 0 0 0 2k 4k 6k 8k 10k12k14k16k18k 22k 26k 30k 34k 2k 4k 6k 8k 10k12k14k16k18k 22k 26k 30k 34k Matrix size Matrix size

1 (a) matrix with diagonal dominant. (b) matrix with positive l and where si is random number between cond ,1 such that their logarithms are uniformly distributed.

Performance_poev_cluster2_cond_100 Performance_svd_cluster2_cond_100 12 9 8 FP64 dgesv FP64 dgesv 11 FP32->64 dsgesv FP32->64 dsgesv 850 846 FP16->64 dhgesv 8 8 8 8 8 8 8 7 FP16->64 dhgesv 10 752 7 7 7 7 9 6 658 8 6 6 6 6 5 7 564 5 6 4 470

Tflop/s 4 Tflop/s 5

# iterations 376 # iterations 3 4 3 3 3 3 3 3 3 3 3 3 3 3 3 282 3 2 2 188 2 165 1 1 90 94 1 60 32 36 42 16 18 21 24 0 0 0 313 3 4 4 4 4 4 4 4 4 4 4 0 2k 4k 6k 8k 10k12k14k16k18k 22k 26k 30k 34k 2k 4k 6k 8k 10k12k14k16k18k 22k 26k 30k 34k Matrix size Matrix size

(c) matrix with positive l and with clustered singular values, s =(1, , 1, 1 ) (d) matrix with clustered singular values, s =(1, , 1, 1 ) Tensor Core Accelerated IRS i ··· cond i ··· cond

solving linear system Ax = b Performance_poev_arith_cond_100 Energy Efficiency Performance_svd_arith_cond_100 12 9 8 220 FP64 dgesv FP64 dgesv Power usage CPU+GPU GV100 11 FP32->64 dsgesv FP32->64 dsgesv200 8 8 8 8 8 8 8 8 198 340 FP16->64 dhgesv 7 FP16->64 dhgesv FP16-TC->64 dhgesv 10 Mixed precision techniques can provide 320 FP32->64 dsgesv 176 7 7 7 7 7 300 FP64 dgesv 9 a large gain in energy efficiency 6 280 154 8 6 6 260 5 240 7 • Power consumption for a matrix of size 40K 132 5 220 • The FP64 algorithm achieve 5.3 Tflop/s 118 6 4 110 200 providing about 21 Gflops/Watts.

Tflop/s 4 Tflop/s 180 5

# iterations 88 # iterations 160 3 FP16-TC reach 3 3 3 3 3 3 3 3 3 3 3 3 4 • The FP32à64 algorithm achieve 10 Tflop/s 3 72 140 94 Gflops/Watt 66 120 3 providing about 40 Gflops/Watts. 2 2 44 100 2 39 Average power (Watts) 80 Performance 1 21.7 10.0 5.3 • The FP16à64 TC algorithm using Tensor 1 21 22 60 in Tflop/s1 Cores achieve 22 Tflop/s providing about 40 3 4 4 4 4 4 4 4 4 4 4 4 94 40 21 Gflops/Watts0 0 0 0 0 0 0 0 0 0 0 20 2k 4k 6k 8k9410k Gflops12k14k/Watts16k18k. 22k 26k 30k 34k 2k 4k 6k 8k 10k12k14k16k18k 22k 26k 30k 34k 461 1010 1999 Joules 0 Matrix size Matrix size 0 1 2 3 4 5 6 7 8 9 10 11 Time (sec) Results obtained using CUDA 10.2 and GV100 GPU. i 1 1 (e) matrix with positive eigenvalues and arithmetic distribution of its singular values (f) matrix with arithmetic distribution of its singular values si = 1 ( n1 )(1 cond ). Problem generated with an arithmetic distribution of the singular values i 1 1 and positive eigenvalues. si = 1 ( n1 )(1 cond ).

Figure 4: Performance in Tflop/s for the three linear solver (dgesv,dsgesv,dhgesv) and the required number of iterations to achieve FP64 arithmetic solution using either the dsgesv (orange numbers) or the dhgesv (green numbers) solver, for different matrices size and different type of matrices. Note that cond = 102 for these experiments. Tensor Core Accelerated IRS solving linear system Ax = b Performance on a wider range of real-life problems

TABLE IV PERFORMANCE FOR REAL-LIFE MATRICES FROM THE SUITESPARSE COLLECTION AND FROM DENSE MATRIX ARISING FROM RADAR DESIGN

name Description size k•(A) dgesv FP64 dsgesv FP32 dhgesv FP16-TC time(s) # iter time (s) speedup # iter time (s) speedup em192 radar design 26896 106 5.70 3 3.11 1.8328 10 2.05 2.7805 appu NASA app benchmark 14000 104 0.43 2 0.27 1.5926 4 0.19 2.2632 ns3Da 3D Navier Stokes 20414 7.6 103 1.12 2 0.69 1.6232 4 0.43 2.6047 nd6k ND problem set 18000 3.5 102 0.81 2 0.45 1.8000 3 0.30 2.7000 nd12k ND problem set 36000 4.3 102 5.36 2 2.75 1.9491 3 1.31 4.0916 Poisson 2D Poisson problem 32000 2.1 106 3.81 2 2.15 1.7721 10 1.13 3.3717 Vlasov 2D Vlasov problem 22000 8.3 103 1.65 2 0.95 1.7368 3 0.48 3.4375

compared to working precision implementations and other – ICCS 2018, pages 586–600. Springer International Publishing, Cham, architectures. One can expect that a 4 speedup will at least 2018. ⇥ [11] A. Haidar, P. Wu, S. Tomov, and J. Dongarra. Investigating Half bring 4 energy improvement. Indeed, in our experiments [10] ⇥ Precision Arithmetic to Accelerate Dense Linear System Solvers. In we measured both the power of the CPU (package+DRAM), SC16 ScalA17: 8th Workshop on Latest Advances in Scalable Algorithms using the Performance Application Programming Interface for Large-Scale Systems, Denver, CO, 11/2017 2017. ACM, ACM. [12] N. J. Higham. Iterative refinement enhances the stability of QR (PAPI) [16]), and the power of the GPU (using the NVIDIA factorization methods for solving linear equations. BIT, 31:447–468, Management Library (NVML) [20]) and we observed about 1991. 5 energy efficiency improvement. [13] N. J. Higham. Iterative refinement for linear systems and LAPACK. ⇥ IMA J. Numer. Anal., 17(4):495–509, 1997. ACKNOWLEDGMENTS [14] N. J. Higham. Accuracy and Stability of Numerical Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, second This research was supported by the Exascale Computing edition, 2002. Project (17-SC-20-SC), a collaborative effort of the U.S. [15] Matrix algebra on GPU and multicore architectures (MAGMA), 2018. Available at http://icl.cs.utk.edu/magma/. Department of Energy Office of Science and the National [16] H. Jagode, A. YarKhan, A. Danalis, and J. Dongarra. Power man- Nuclear Security Administration. The work was also partially agement and event verification in papi. In A. Knupfer,¨ T. Hilbrich, supported by Nvidia and NSF grant No. OAC-1740250. N. J. C. Niethammer, J. Gracia, W. E. Nagel, and M. M. Resch, editors, Tools for High Performance Computing 2015, pages 41–51, Cham, 2016. Higham was supported by Engineering and Physical Sciences Springer International Publishing. Research Council grant EP/P020720/1, The MathWorks, and [17] J. Langou, J. Langou, P. Luszczek, J. Kurzak, A. Buttari, and J. J. the Royal Society. Dongarra. Exploiting the performance of 32 bit floating point arithmetic in obtaining 64 bit accuracy. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, 2006. EFERENCES R [18] X. S. Li and J. W. Demmel. Making Sparse [1] M. Arioli, J. W. Demmel, and I. S. Duff. Solving sparse linear systems Scalable by Static Pivoting. In Proceedings of the 1998 ACM/IEEE with sparse backward error. SIAM J. Matrix Anal. Appl., 10(2):165–190, conference on Supercomputing, number c, pages 1–17, 1998. 1989. [19] C. B. Moler. Iterative refinement in floating point. J. ACM, 14(2):316– [2] M. Baboulin, A. Buttari, J. Dongarra, J. Kurzak, J. Langou, J. Langou, 321, 1967. P. Luszczek, and S. Tomov. Accelerating scientific computations [20] NVIDIA Management Library (NVML), NVIDIA, 2018. https:// with mixed precision algorithms. Computer Physics Communications, developer.nvidia.com/nvidia-management-library-nvml. 180(12):2526–2533, 2009. [21] Y. Saad. A flexible inner-outer preconditioned GMRES algorithm. [3] A. Buttari, J. Dongarra, J. Langou, J. Langou, P. Luszczek, and Technical Report 91-279, Department of CSE, University of Minnesota, J. Kurzak. Mixed precision iterative refinement techniques for the Minneapolis, Minnesota, 1991. solution of dense linear systems. Int. J. High Perform. Comput. Appl., [22] Y. Saad and M. H. Schultz. Gmres: A generalized minimal residual 21(4):457–466, Nov. 2007. algorithm for solving nonsymmetric linear systems. SIAM J. Sci. and [4] E. Carson and N. J. Higham. A new analysis of iterative refinement Stat. Comput., 7(3):856–869, 1986. and its application to accurate solution of ill-conditioned sparse linear [23] V. Simoncini and D. B. Szyld. Flexible inner-outer Krylov subspace systems. SIAM J. Sci. Comput., 39(6):A2834–A2856, 2017. methods. SIAM J. Numer. Anal., 40(6):2219–2239, 2002. [5] E. Carson and N. J. Higham. Accelerating the solution of linear systems [24] R. D. Skeel. Iterative Refinement Implies Numerical Stability for by iterative refinement in three precisions. SIAM J. Sci. Comput., Gaussian Elimination. Mathematics of Computation, 35(151):817–832, 40(2):A817–A847, 2018. 1980. [6] T. A. Davis and Y. Hu. The University of Florida [25] G. W. Stewart. Introduction to Matrix Computations. Academic Press, Collection. ACM Trans. Math. Softw., 38(1):1:1–1:25, 2011. 1973. [7] J. W. Demmel. Applied . SIAM, 1997. [26] S. Tomov, J. Dongarra, and M. Baboulin. Towards dense linear algebra [8] J. Dongarra, V. Eijkhout, and P. Łuszczek. Recursive Approach in Sparse for hybrid gpu accelerated manycore systems. Parallel Computing, Matrix LU Factorization. Scientific Programming, 9(1):51–60, 2001. 36(5):232 – 240, 2010. Parallel Matrix Algorithms and Applications. [9] G. H. Golub and Q. Ye. Inexact preconditioned conjugate gradient [27] S. Tomov, R. Nath, H. Ltaief, and J. Dongarra. Dense linear algebra method with inner-outer iteration. SIAM Journal on Scientific Com- solvers for multicore with GPU accelerators. In Proc. of the IEEE puting, 21(4):1305–1320, 2000. IPDPS’10, pages 1–8, Atlanta, GA, April 19-23 2010. [10] A. Haidar, A. Abdelfattah, M. Zounon, P. Wu, S. Pranesh, S. Tomov, [28] J. H. Wilkinson. Rounding Errors in Algebraic Processes. Prentice-Hall, and J. Dongarra. The design of fast and energy-efficient linear solvers: 1963. On the potential of half-precision arithmetic and iterative refinement techniques. In Y. Shi, H. Fu, Y. Tian, V. V. Krzhizhanovskaya, M. H. Lees, J. Dongarra, and P. M. A. Sloot, editors, Computational Science Current #1 System Overview

System Performance Each node has The system includes

• Peak performance of 200 • 2 IBM POWER9 processors • 4608 nodes Pflop/s for modeling & • Each w/22 cores • 27,648 GPUs simulation • 2.3% performance of system • Street value $8K each • • Peak performance of 6 NVIDIA Tesla V100 GPUs • Dual-rail Mellanox EDR 3.3 Eflop/s for 16 bit • Each w/80 SMs InfiniBand network floating point used in for • 97.7% performance of • 250 PB IBM Spectrum data analytics, ML, and system Scale • 608 GB of fast memory artificial intelligence file system transferring • 1.6 TB of NVMe memory data at 2.5 TB/s

32 HPL-AI (A MIXED PRECISION BENCHMARK)

The HPL-AI benchmark seeks to highlight the emerging convergence of high- performance computing (HPC) and artificial intelligence (AI) workloads and highlight the advantages of mixed precision. The benchmark is a combination of LU factorization (at lower precision) and iterative refinement method (like GMRes) to bring the solution back to 64-bit accuracy.

Iterative refinement for dense systems, Ax = b, can work this way. L U = lu(A) lower precision O(n3) x = U\(L\b) lower precision O(n2) GMRes preconditioned by the LU to solve Ax=b FP64 precision O(n2)

http://bit.ly/hpl-ai Recent Results Run at Scale… • Mixed precision iterative refinement approach solved a matrix of order 10,091,520 on ORNL’s Summit system. – Composed of nodes made up of 2 IBM Power-9 processors (22 cores each) plus 6 Nvidia V100 GPUs (84 SMs each) – The run used 4500 nodes of Summit, 2,466,000 cores = 4500*(22*2 + 84*6) – Used a random matrix with large diagonal elements to insure convergence of the method. • Mixed precision HPL achieved 550 PFLOPS or 3.7 X over DP precision HPL result on the Top500 (148 PFLOPS). – 53 Gflops/Watt • Same accuracy compared to full 64 bit precision Conclusion: Ø We accelerated the solution of linear system Ax = b solver using hardware-accelerated FP16 arithmetic on GPUs;

Ø We introduced a framework for exploiting mixed-precision FP16-FP32/FP64 iterative refinement solvers and describe the path to draw high-performance and energy-aware GPU implementations; Ø Ideas can be applied to other 1 sided reductions (LU, LLT, LDLT, QR) and also for 2 sided in the case of eigen (singular) values/vectors,( where are few are required).

Ø Our technique shows that a number of problems can be accelerated up to 4X by the usage of the FP16-TC or 2X using the FP32 arithmetic.

Ø We studied the energy-efficiency of our approach that showed significant energy savings, 5X energy savings using the FP16-TC compared to the FP64 implementation.

Ø We illustrated a technique to use V100 Tensor Cores FP16-TC that achieves FP64 accuracy at a highly efficient/accelerated performance equating to 74 Gflops/Watt and 24 Tflops/s.

Ø There is a rigorous error analysis to support everything Questions?

Recent manuscript: http://bit.ly/it-ref-roy-soc-2020

Slides can be found: http://bit.ly/icerm-05-2020-dongarra