Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization – LAPACK Working Note 184

Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization – LAPACK Working Note 184 Jakub Kurzak1, Alfredo Buttari1, Jack Dongarra1,2 1Department of Computer Science, University Tennessee, Knoxville, Tennessee 37996 2Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee, 37831 May 10, 2007 ABSTRACT: The STI CELL processor introduces 1 Motivation pioneering solutions in processor architecture. At the same time it presents new challenges for the devel- In numerical computing, there is a fundamental per- opment of numerical algorithms. One is effective ex- formance advantage of using single precision float- ploitation of the differential between the speed of sin- ing point data format over double precision data for- gle and double precision arithmetic; the other is effi- mat, due to more compact representation, thanks to cient parallelization between the short vector SIMD which, twice the number of single precision data ele- cores. In this work, the first challenge is addressed ments can be stored at each stage of the memory hi- by utilizing a mixed-precision algorithm for the solu- erarchy. Short vector SIMD processing provides yet tion of a dense symmetric positive definite system of more potential for performance gains from using sin- linear equations, which delivers double precision ac- gle precision arithmetic over double precision. Since curacy, while performing the bulk of the work in sin- the goal is to process the entire vector in a single gle precision. The second challenge is approached by operation, the computation throughput can be dou- introducing much finer granularity of parallelization bled when the data representation is halved. Unfor- than has been used for other architectures and us- tunately, the accuracy of the solution is also halved. ing a lightweight decentralized synchronization. The Most of the processor architectures available to- implementation of the computationally intensive sec- day have been at some point augmented with short tions gets within 90 percent of peak floating point vector SIMD extensions. Examples include Stream- performance, while the implementation of the mem- ing SIMD Extensions (SSE) for the AMD and Intel ory intensive sections reaches within 90 percent of lines of processors, PowerPC Velocity Engine / Al- peak memory bandwidth. On a single CELL pro- tiVec / VMX, Sparc Visual Instruction Set (VIS), cessor, the algorithm achieves over 170 Gflop/s when Alpha Motion Video Instructions (MVI), PA-RISC solving a symmetric positive definite system of lin- Multimedia Acceleration eXtensions (MAX), MIPS- ear equation in single precision and over 150 Gflop/s 3D Application Specific Extensions (ASP), and Dig- when delivering the result in double precision accu- ital Media Extensions (MDMX), ARM NEON. The racy. different architectures exhibit big differences in their KEYWORDS: CELL BE, iterative refinement, capabilities. The vector size is either 64 bits or, more mixed-precision algorithms, Cholesky factorization commonly, 128 bits. The register file size ranges from 1 just a few to as many as 128 registers. Some exten- order to improve the computed solution, an iterative sions only support integer types, others also operate refinement process can be applied, which produces a on single precision floating point numbers, and yet correction to the computed solution at each iteration. others also process double precision values. In principle, the algorithm can produce a solution Today the Synergistic Processing Element (SPE) correct to the working precision. of the STI CELL processor [1–3] can probably be Iterative refinement is a fairly well understood con- considered the state of the art in short vector SIMD cept and was analyzed by Wilkinson [4], Moler [5] processing. Possessing 128-byte long registers and a and Stewart [6]. Higham gives error bounds for both fully pipelined, fused add-multiply instruction, it is single and double precision iterative refinement algo- capable of completing eight single precision floating rithms, where the entire algorithm is implemented point operations each clock cycle, which combined with the same precision (single or double respec- with the size of the register file of 128 registers de- tively) [7]. He also gives error bounds in single preci- livers close to peak performance on many common sion arithmetic, with refinement performed in double workloads. At the same time, built with multime- precision arithmetic. Error analysis for the case de- dia and embedded applications in mind, the current scribed in this work, where the factorization is per- incarnation of the CELL architecture does not imple- formed in single precision and the refinement in dou- ment double precision arithmetic on par with single ble precision, is given by Langou et al. [8]. precision performance, which makes the processor a The authors of this work have previously presented very attractive target for exploring mixed-precision an initial implementation of the mixed-precision algo- approaches. rithm for the general, non-symmetric, case using LU Another important phenomenon in recent years factorization on the CELL processors. Although re- has been the gradual shift of focus in processor ar- spectable performance numbers were presented, both chitecture from aggressive exploitation of instruction the factorization and the refinement steps relied on level parallelism towards thread-level parallelism, re- rather classic parallelization approaches. Also, a sulting in the introduction of chips with multiple pro- somewhat general discussion of algorithmic and im- cessing units commonly referred to as multi-core pro- plementation details was presented. This work ex- cessors. The new architectures deliver the much de- tends the previous presentation by introducing a sired improvement in performance, and at the same novel scheme for parallelization of the computational time challenge the scalability of existing algorithms, components of the algorithm, and also describes and force the programmers to seek more parallelism in much more detail the implementation of both by going to much finer levels of problem granular- computation-intensive, as well as memory-intensive ity. In linear algebra, it enforces the departure from operations. the model relying on parallelism encapsulated at the level of BLAS and shifts to more flexible methods of scheduling work. 3 Algorithm The standard approach to solving symmetric posi- 2 Related Work tive definite systems of linear equations is to use the Cholesky factorization. The Cholesky factorization Iterative refinement is a well known method for im- of a real symmetric positive definite matrix A has proving the solution of a linear system of equations of the form A = LLT , where L is a real lower triangular the form Ax = b. Typically, a dense system of linear matrix with positive diagonal elements. The system equations is solved by applying a factorization to the is solved by first solving Ly = b (forward substitution) coefficient matrix, followed by a back solve. Due to and then solving LT x = y (backward substitution). In roundoff errors, the solution carries an error related order to improve the accuracy of the computed solu- to the condition number of the coefficient matrix. In tion, an iterative refinement process is applied, which 2 Algorithm 1 Solution of a symmetric positive defi- The algorithm described above, and shown on Al- nite system of linear equations using mixed-precision gorithm 1 is available in the LAPACK 3.1 library and iterative refinement based on Cholesky factorization. implemented by the routine DSGESV. 1: A(32), b(32) ← A, b T a 2: L(32),L(32) ←SPOTRF (A(32)) (1) b T 4 Implementation 3: x(32) ←SPOTRS (L(32),L(32), b(32)) (1) 4: x(1) ← x (32) 4.1 Essential Hardware Features 5: repeat 6: r(i) ← b − Ax(i) An extensive hardware overview would be beyond (i) (i) 7: r(32) ← r the scope of this publication. Vast amounts of (i) (i) information are publicly available for both experi- 8: z ←SPOTRSb(L ,LT , r ) (32) (32) (32) (32) enced programmers [9], as well as newcomers to the (i) (i) 9: z ← z(32) field [10, 11]. It is assumed that the reader has some 10: x(i+1) ← x(i) + z(i) familiarity with the architecture. Here, the features 11: until x(i) is accurate enough are mentioned that have the most influence on the implementation presented. aLAPACK name for Cholesky factorization bLAPACK name for symmetric back solve The CELL is a multi-core chip that includes nine 64-bit representation is used in all cases where 32-bit repre- different processing elements. One core, the POWER sentation is not indicated by a subscript. Processing Element (PPE), represents a standard processor design implementing the PowerPC instruction set. The remaining eight cores, the Syner- produces a correction to the computed solution, x. gistic Processing Elements (SPEs), are short vec- The mixed-precision iterative refinement algorithm tor Single Instruction Multiple Data (SIMD) engines using Cholesky factorization is outlined by Algo- with big register files of 128 128-bit vector registers rithm 1. The factorization A = LLT (line 2) and and 256 KB of local memory, referred to as local the solution of the triangular systems Ly = b and store (LS). LT x = y (lines 3 and 8) are computed using single Although standard C code can be compiled for precision arithmetic. The residual calculation (line 6) the execution on the SPEs, the SPEs do not exe- and the update of the solution (line 10) are com- cute scalar code efficiently. For efficient execution, puted using double precision arithmetic and the orig- the code has to be vectorized in the SIMD sense, by inal double precision coefficients. The most compu- using C language vector extensions (intrinsics), or by tationally expensive operations, including the factor- using assembly code. The system’s main memory is ization of the coefficient matrix A and the forward accessible to the PPE through L1 and L2 caches and and backward substitution, are performed using sin- to the SPEs through DMA engines associated with gle precision arithmetic and they take advantage of them.

Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization – LAPACK Working Note 184

Superlu Users' Guide

A New Analysis of Iterative Refinement and Its Application to Accurate Solution of Ill-Conditioned Sparse Linear Systems Carson

Error Bounds from Extra-Precise Iterative Refinement

Three-Precision GMRES-Based Iterative Refinement for Least Squares Problems Carson, Erin and Higham, Nicholas J. and Pranesh, Sr

Using Mixed Precision in Numerical Computations to Speedup Linear Algebra Solvers

A Survey of Recent Developments in Parallel Implementations of Gaussian Elimination

Multiprecision Algorithms 2 / 95 Lecture 1

Mixed Precision Iterative Refinement

Design, Implementation and Testing of Extended and Mixed Precision BLAS ∗

Adaptive Dynamic Precision Iterative Refinement Jun Kyu Lee [email protected]

PARDISO User Guide Version

Mixed-Precision Numerical Linear Algebra Algorithms: Integer Arithmetic Based LU Factorization and Iterative Refinement for Hermitian Eigenvalue Problem