Accelerating Scientific Computations with Mixed Precision Algorithms

Computer Physics Communications 180 (2009) 2526–2533 Contents lists available at ScienceDirect Computer Physics Communications www.elsevier.com/locate/cpc 40th Anniversary Issue Accelerating scientific computations with mixed precision algorithms ✩ ∗ Marc Baboulin a,AlfredoButtarib, Jack Dongarra c,d,e, Jakub Kurzak c, ,JulieLangouc, Julien Langou f, Piotr Luszczek g, Stanimire Tomov c a Department of Mathematics, University of Coimbra, Coimbra, Portugal b French National Institute for Research in Computer Science and Control, Lyon, France c Department of Electrical Engineering and Computer Science, University Tennessee, Knoxville, TN, USA d Oak Ridge National Laboratory, Oak Ridge, TN, USA e University of Manchester, Manchester, United Kingdom f Department of Mathematical and Statistical Sciences, University of Colorado Denver, Denver, CO, USA g MathWorks, Inc., Natick, MA, USA article info abstract Article history: On modern architectures, the performance of 32-bit operations is often at least twice as fast as the Received 2 September 2008 performance of 64-bit operations. By using a combination of 32-bit and 64-bit floating point arithmetic, Received in revised form 9 November 2008 the performance of many dense and sparse linear algebra algorithms can be significantly enhanced while Accepted 10 November 2008 maintaining the 64-bit accuracy of the resulting solution. The approach presented here can apply not Availableonline13November2008 only to conventional processors but also to other technologies such as Field Programmable Gate Arrays PACS: (FPGA), Graphical Processing Units (GPU), and the STI Cell BE processor. Results on modern processor 02.60.Dc architectures and the STI Cell BE are presented. Keywords: Program summary Numerical linear algebra Mixed precision Program title: ITER-REF Iterative refinement Catalogue identifier: AECO_v1_0 Program summary URL: http://cpc.cs.qub.ac.uk/summaries/AECO_v1_0.html Program obtainable from: CPC Program Library, Queen’s University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 7211 No. of bytes in distributed program, including test data, etc.: 41 862 Distribution format: tar.gz Programming language: FORTRAN 77 Computer: desktop, server Operating system: Unix/Linux RAM: 512 Mbytes Classification: 4.8 External routines: BLAS (optional) Nature of problem: On modern architectures, the performance of 32-bit operations is often at least twice as fast as the performance of 64-bit operations. By using a combination of 32-bit and 64-bit floating point arithmetic, the performance of many dense and sparse linear algebra algorithms can be significantly enhanced while maintaining the 64-bit accuracy of the resulting solution. Solution method: Mixed precision algorithms stem from the observation that, in many cases, a single precision solution of a problem can be refined to the point where double precision accuracy is achieved. A common approach to the solution of linear systems, either dense or sparse, is to perform the LU factorization of the coefficient matrix using Gaussian elimination. First, the coefficient matrix A is factored into the product of a lower triangular matrix L and an upper triangular matrix U . Partial row pivoting is in general used to improve numerical stability resulting in a factorization PA= LU,where P is a permutation matrix. The solution for the system is achieved by first solving Ly = Pb (forward substitution) and then solving Ux = y (backward substitution). Due to round-off errors, the computed ✩ This paper and its associated computer program are available via the Computer Physics Communications homepage on ScienceDirect (http://www.sciencedirect.com/ science/journal/00104655). * Corresponding author. E-mail address: [email protected] (J. Kurzak). 0010-4655/$ – see front matter Published by Elsevier B.V. doi:10.1016/j.cpc.2008.11.005 M. Baboulin et al. / Computer Physics Communications 180 (2009) 2526–2533 2527 solution, x, carries a numerical error magnified by the condition number of the coefficient matrix A.In order to improve the computed solution, an iterative process can be applied, which produces a correction to the computed solution at each iteration, which then yields the method that is commonly known as the iterative refinement algorithm. Provided that the system is not too ill-conditioned, the algorithm produces a solution correct to the working precision. Running time: seconds/minutes Published by Elsevier B.V. 1. Introduction Table 1 Hardware and software details of the systems used for performance experiments. On modern architectures, the performance of 32-bit operations Architecture Clock Peak SP/ Memory BLAS Compiler [GHz] Peak DP [MB] is often at least twice as fast as the performance of 64-bit opera- AMD Opteron 246 2.0 2 2048 Goto-1.13 Intel-9.1 tions. There are two reasons for this. Firstly, 32-bit floating point IBM PowerPC 970 2.5 2 2048 Goto-1.13 IBM-8.1 arithmetic is usually twice as fast as 64-bit floating point arith- Intel Xeon 5100 3.0 2 4096 Goto-1.13 Intel-9.1 metic on most modern processors. Secondly the amount of bytes STI Cell BE 3.214512– CellSDK-1.1 moved through the memory system is halved. In Table 1,wepro- vide some hardware numbers that support these claims. On AMD 1: LU← PA (εs ) Opteron 246, IBM PowerPC 970, and Intel Xeon 5100, the single 2: solve Ly = Pb (εs ) = precision peak is twice the double precision peak. On the STI Cell 3: solve Ux0 y (εs ) = BE, the single precision peak is fourteen times the double preci- do k 1, 2,... 4: rk ← b − Axk−1 (εd ) sion peak. Not only single precision is faster than double preci- 5: solve Ly = Prk (εs ) sion on conventional processors but this is also the case on less 6: solve Uzk = y (εs ) ← + mainstream technologies such as Field Programmable Gate Arrays 7: xk xk−1 zk (εd ) (FPGA) and Graphical Processing Units (GPU). These speedup num- check convergence done bers tempt us and we would like to be able to benefit from it. For several physics applications, results with 32-bit accuracy are Algorithm 1. Mixed precision, Iterative Refinement for Direct Solvers. not an option and one really needs 64-bit accuracy maintained throughout the computations. The obvious reason is for the ap- 2.1. Direct methods plication to give an accurate answer. Also, 64-bit accuracy enables most of the modern computational methods to be more stable; A common approach to the solution of linear systems, either therefore, in critical conditions, one must use 64-bit accuracy to dense or sparse, is to perform the LU factorization of the coeffi- obtain an answer. In this manuscript, we present a methodology cient matrix using Gaussian elimination. First, the coefficient ma- of how to perform the bulk of the operations in 32-bit arithmetic, trix A is factored into the product of a lower triangular matrix L then postprocess the 32-bit solution by refining it into a solution and an upper triangular matrix U . Partial row pivoting is in gen- that is 64-bit accurate. We present this methodology in the con- eral used to improve numerical stability resulting in a factorization text of solving a system of linear equations, be it sparse or dense, PA= LU, where P is a permutation matrix. The solution for the symmetric positive definite or nonsymmetric, using either direct or system is achieved by first solving Ly = Pb (forward substitution) iterative methods. We believe that the approach outlined below is and then solving Ux= y (backward substitution). Due to round-off quite general and should be considered by application developers errors, the computed solution x carries a numerical error magni- for their practical problems. fied by the condition number of the coefficient matrix A. In order to improve the computed solution, we can apply an iterative process which produces a correction to the computed 2. The idea behind mixed precision algorithms solution at each iteration, which then yields the method that is commonly known as the iterative refinement algorithm. As Demmel Mixed precision algorithms stem from the observation that, in points out [2], the nonlinearity of the round-off errors makes the many cases, a single precision solution of a problem can be re- iterative refinement process equivalent to the Newton’s method = − fined to the point where double precision accuracy is achieved. applied to the function f (x) b Ax. Provided that the system is The refinement can be accomplished, for instance, by means of the not too ill-conditioned, the algorithm produces a solution correct Newton’s algorithm [1] which computes the zero of a function f (x) to the working precision. Iterative refinement in double/double according to the iterative formula precision is a fairly well understood concept and was analyzed by Wilkinson [3],Moler[4] and Stewart [5]. The algorithm can be modified to use a mixed precision ap- = − f (xn) proach. The factorization PA = LU and the solution of the tri- xn+1 xn . (1) f (xn) angular systems Ly = Pb and Ux = y are computed using single precision arithmetic. The residual calculation and the update of the In general, we would compute a starting point and f (x) in single solution are computed using double precision arithmetic and the precision arithmetic and the refinement process will be computed original double precision coefficients (see Algorithm 1). The most in double precision arithmetic. computationally expensive operation, the factorization of the co- If the refinement process is cheaper than the initial computa- efficient matrix A, is performed using single precision arithmetic tion of the solution then double precision accuracy can be achieved and takes advantage of its higher speed. The only operations that nearly at the same speed as the single precision accuracy. Sec- must be executed in double precision are the residual calculation tions 2.1 and 2.2 describe how this concept can be applied to and the update of the solution (they are denoted with an εd in solvers of linear systems based on direct and iterative methods, Algorithm 1).

Load more