Accelerating Restarted GMRES with Mixed Precision Arithmetic
Total Page:16
File Type:pdf, Size:1020Kb
1 Accelerating Restarted GMRES with Mixed Precision Arithmetic Neil Lindquist, Piotr Luszczek, and Jack Dongarra, Fellow, IEEE Abstract—The generalized minimum residual method (GMRES) is a commonly used iterative Krylov solver for sparse, non-symmetric systems of linear equations. Like other iterative solvers, data movement dominates its run time. To improve this performance, we propose running GMRES in reduced precision with key operations remaining in full precision. Additionally, we provide theoretical results linking the convergence of finite precision GMRES with classical Gram-Schmidt with reorthogonalization and its infinite precision counterpart which helps justify the convergence of this method to double-precision accuracy. We tested the mixed-precision approach with a variety of matrices and preconditioners on a GPU-accelerated node. Excluding the ILU(0) preconditioner, we achieved average speedups ranging from 8 % to 61 % relative to comparable double-precision implementations, with the simpler preconditioners achieving the higher speedups. Index Terms—Linear Systems, Multiple precision arithmetic F 1 INTRODUCTION HE generalized minimum residual method (GMRES) is uses single precision everywhere except to compute the T an iterative solver for sparse, non-symmetric systems residual and update the solution. Specifically, it uses double of linear equations [1]. Like most iterative solvers, GMRES precision for lines 3 and 25 in Alg. 1 and uses single preci- consists mostly of matrix-vector and vector-vector oper- sion for everything else. (I.e. computation done in double ators, resulting in a low computational intensity. Hence, precision is labeled with uhigh and computation done in reducing data movement is necessary to improve perfor- single precision is labeled with ulow). This requires a copy mance. Towards this end, we present an implementation of of A in single precision, and two vector type conversions GMRES using a mix of single- and double-precision that per restart (zk and uk) but has the advantage of other- is designed to achieve the same final accuracy as double- wise using existing uniform-precision kernels. Our previous precision GMRES while reducing data movement. work included experiments with the effects on convergence In short, GMRES constructs a basis for the Krylov sub- when reducing various parts of GMRES and using different space using Arnoldi’s procedure [2], then finds the solution restart strategies; those experiments guided the design of vector from that subspace that minimizes the 2-norm of the our mixed-precision scheme. Herein, we extend the support resulting residual. One particularly useful modification to for this approach by providing stronger theoretical analysis GMRES is restarting, which is used to limit the memory us- of the reduced precision inner solver and new experimental age and computational requirements of the growing Krylov results on a GPU-accelerated node with a variety of precon- basis [3]. We focus on two schemes for orthogonalizing ditioners. the Krylov basis in the Arnoldi procedure, modified Gram- Schmidt (MGS) and classical Gram-Schmidt with reorthog- 2 PREVIOUS WORK onalization (CGSR). The former is often used due to its lower computational and data-access costs [4], while the The idea of using multiple precisions to improve perfor- latter better retains orthogonality and can be implemented mance has been applied successfully to a variety of prob- using matrix-vector products [5]. For CGSR, we always lems, particularly in recent years [8]. Furthermore, it is a reorthogonalize once, which is sufficient to provide nu- well-established method for improving the performance of merically stability [6]. Additionally, we focused on a left- solving systems of linear equations, particularly for dense preconditioned version of GMRES but expect the results linear systems [9], [10]. to hold for right-preconditioned versions too. Algorithm 1 One approach to using multiple precisions in GMRES shows the formulation of GMRES we used. is to store the preconditioner in reduced precision and Based on our previous work, we focus on a specific doing the rest of the computation in full precision [11]. mixed-precision strategy that has been successful regarding The approximate nature of the preconditioner means that both accuracy and CPU performance [7]. This approach reducing the floating-point precision has a limited reduction in quality. One interesting variation of this is to precondition a full-precision GMRES with a reduced-precision GMRES • Neil Lindquist, Piotr Luszczek, and Jack Dongarra are with the Innovative (possibly with a preconditioner of its own) [12]. This is Computing Laboratory, University of Tennessee, Knoxville, TN, 37996. E-mail: fnlindqu1,luszczek,[email protected] similar to our approach, but we use iterative refinement as • Jack Dongarra is also with Oak Ridge National Laboratory, Oak Ridge, TN the outer solver instead of GMRES. 37831 and the University of Manchester, Manchester M13 9PL, United There has recently been useful theoretical work for vary- Kingdom ing the working precision as the iteration progresses in a 2 non-restarted GMRES [13]. Notably, part of this work shows that computing the Arnoldi process in finite-precision al- lows GMRES to converge at approximately the same rate as GMRES computed exactly until an accuracy related to round-off error is reached. Our theoretical results in Sec. 3 Algorithm 1 Restarted GMRES in mixed-precision with left are based on some of these ideas. There exists further preconditioning [3] theoretical work on reducing the accuracy of just the matrix vector products in GMRES and other Krylov solvers as the A 2 n×n; x ; b 2 n;M −1 ≈ A−1 1: R 0 R number of inner iterations progresses [14], [15], [16]. These 2: for k = 1; 2;::: do approaches may avoid the need to restart, unlike our work; z b − Ax (u ) 3: k k high however, they increase the complexity of implementation 4: If kz k is small enough, stop (u ) k 2 high and achieving high accuracy may require estimating the r M −1z (u ) 5: k k low smallest singular value. 6: β kr k ; s β (u ) k 2 0 low For restarted GMRES, there have been a few works in- 7: v r /β; V [v ](u ) 1 k 1 1 low volving using multiple precisions in the GMRES algorithm. 8: j 0 The first is using mixed-precision iterative refinement where 9: loop until the restart condition is met the inner solver is a single-precision GMRES [17], [18]. This 10: j j + 1 −1 approach is similar to what we tested; however, that work 11: w M Av (u ) j low tests only limited configurations of GMRES and matrices. 12: w; h ; : : : ; h orth(w;V ) . MGS or CGSR 1;j j;j j The second approach is to store just the Krylov basis in 13: h kwk (u ) j+1;j 2 low reduced precision, which was tested with both floating- and 14: v w=h (u ) j+1 j+1;j low fixed-point formats and 32- and 16-bits per value [19]. It was 15: V [V ; v ](u ) j+1 j j+1 low successful in providing a speedup with the 32-bit floating- 16: for i = 1; : : : ; j − 1 do point version providing the best median speedup at 1.4 hi;j αi βi hi;j 17: × (ulow) times. However, the scheme, as described, requires custom, hi+1;j −βi αi hi+1;j high-performance, mixed-precision kernels, which increases 18: end for αj hj;j the cost of implementation due to the limited availability of 19: rotation matrix (ulow) βj hj+1;j existing mixed-precision routines. sj αj βj sj One final work with GMRES is the use of integer arith- 20: × (ulow) sj+1 −βj αj 0 metic [20]. While integer GMRES did not involve a reduc- hj;j αj βj hj;j tion in data movement, it does show GMRES achieving 21: × (ulow) hj+1;j −βj αj hj+1;j a full-precision solution with limited iteration overhead 22: end loop when the solver uses an alternative data format. Relatedly, T 23: H fhi;`g1≤i;`≤j; s [s1; : : : sj] there has been work to use data compression techniques in −1 24: uk VjH s (ulow) flexible GMRES, although only for the non-orthogonalized 25: xk+1 xk + uk (uhigh) Krylov vectors [21]. 26: end for Mixed precision approaches have also been used for other iterative solvers. Like GMRES, mixed precision ap- 27: procedure MGS(w;Vj) proaches include a reduced precision preconditioner [22], 28: [v1;:::; vj ] Vj [23] and using a reduced precision solver inside iterative re- 29: for i = 1, 2, . , j do finement [17]. However, with Krylov methods, iterative re- 30: hi;j w · vi (ulow) finement discards the subspace at each restart; so, the strat- 31: w w − hi;jvi (ulow) egy of “reliable updates” has been proposed, which retains 32: end for information about the Krylov subspace across restarts [24], 33: return w; h1;j; : : : ; hj;j [25]. Finally, there has been some exploration of the use of 34: end procedure alternate data representations, such as data compression, in iterative solvers [26], [27], [28]. 35: procedure CGSR(w;Vj) T 36: h Vj w (ulow) 37: w w − Vjh (ulow) UMERICAL ROPERTIES OF IXED RECISION T 3 N P M P 38: g Vj w (ulow) GMRES 39: w w − Vjg (ulow) T 40: [h0;j; : : : ; hj;j] h + g (ulow) It is important to understand the effects of reductions in 41: return w; h1;j; : : : ; hj;j precision on the accuracy of the final solutions. Note that 42: end procedure restarted GMRES is equivalent to iterative refinement using a non-restarted GMRES as the inner solver, which provides a powerful tool for recovering accuracy. This equivalence can be seen by noting that lines 5 through 24 in Alg.