Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization – LAPACK Working Note 184

Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization – LAPACK Working Note 184 Jakub Kurzak1, Alfredo Buttari1, Jack Dongarra1,2 1Department of Computer Science, University Tennessee, Knoxville, Tennessee 37996 2Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee, 37831 April 30, 2007 ABSTRACT: The STI CELL processor introduces 1 Motivation pioneering solutions in processor architecture. At the same time it presents new challenges to development In numerical computing, there is a fundamental per- of numerical algorithms. One is effective exploita- formance advantage of using single precision floating tion of the differential between the speed of single point data formant over double precision data for- and double precision arithmetic, the other is efficient mat, due to more compact representation, thanks to parallelization between the short vector SIMD cores. which, twice the number of single precision data ele- In this work the first challenge is addressed by uti- ments can be stored at each stage of the memory hi- lizing a mixed-precision algorithm for the solution of erarchy. Short vector SIMD processing provides yet a dense symmetric positive definite system of linear more potential for performance gains from using sin- equations, which delivers double precision accuracy, gle precision arithmetic over double precision. Since while performing the bulk of the work in single pre- the goal is to process the entire vector in a single op- cision. The second challenge is approached by intro- eration, the computation throughput can be doubled ducing much finer granularity of parallelization than when the data representation is halved. has been used for other architectures, and using a Most of processor architectures available today lightweight decentralized synchronization. The im- have been at some point augmented with short vec- plementation of the computationally intensive sec- tor SIMD extensions. Examples include Streaming tions gets within 90% of peak floating point perfor- SIMD Extensions (SSE) for AMD and Intel line of mance, while the implementation of the memory in- processors, PowerPC Velocity Engine / AltiVec / tensive sections reaches within 90% of peak memory VMX, Sparc Visual Instruction Set (VIS), Alpha Mo- bandwidth. On a single CELL processor the algo- tion Video Instructions (MVI), PA-RISC Multimedia rithm achieves over 170 Gflop/s when solving a sym- Acceleration eXtensions (MAX), MIPS-3D Applica- metric positive definite system of linear equation in tion Specific Extensions (ASP), and Digital Media single precision and over 150 Gflop/s when delivering Extensions (MDMX), ARM NEON. The different ar- the result in double precision accuracy. chitectures exhibit big differences in their capabili- ties. The vector size is either 64 bits or, more com- KEYWORDS: CELL BE, iterative refinement, monly, 128 bits. The register file size ranges from just mixed-precision algorithms, Cholesky factorization a few to as many as 128 registers. Some extensions 1 only support integer types, other also operate on sin- refinement process can be applied, which produces gle precision floating point numbers, and yet others a correction to the computed solution at each itera- also process double precision values. tion. In principle the algorithm can produce solution Today the Synergistic Processing Element (SPE) correct to the working precision. of the STI CELL processor [1–3] can probably be Iterative refinement is a fairly well understood con- considered the state of the art in short vector SIMD cept, and was analyzed by Wilkinson [4], Moler [5] processing. Possessing 128-byte long registers and and Stewart [6]. Higham gives error bounds for both a fully pipelined fused add-multiply instruction, it is single and double precision iterative refinement algo- capable of completing 8 single precision floating point rithms, where the entire algorithm is implemented operations each clock cycle, which combined with the with the same precision (single or double respec- size of the register file of 128 registers, delivers close tively) [7]. He also gives error bounds in single preci- to peak performance on many common workloads. sion arithmetic, with refinement performed in double At the same time, built with multimedia and em- precision arithmetic. Error analysis for the case de- bedded applications in mind, the current incarnation scribed in this work, where the factorization is per- of the CELL architecture does not implement dou- formed in single precision and the refinement in double precision arithmetic on a par with single preci- ble precision, is given by Langou et al.[8]. sion performance-wise, which makes the processor a The authors of this work have previously presented very attractive target for exploring mixed-precision an initial implementation of the mixed-precision algo- approaches. rithm for the general, non-symmetric, case using LU Another important phenomenon in recent years factorization on the CELL processors. Although re- has been the gradual shift of focus in processor ar- spectable performance numbers were presented, both chitecture from aggressive exploitation of instruction the factorization and the refinement steps relied on level parallelism towards thread-level parallelism, re- rather classic parallelization approaches. Also, some- sulting in the introduction of chips with multiple pro- what general discussion of algorithmic and imple- cessing units commonly referred to as multi-core pro- mentation details was presented. This work extends cessors. The new architectures deliver the much de- the previous presentation by introducing a novel sired improvement in performance, and at the same scheme for parallelization of the computational com- time challenge the scalability of existing algorithms, ponents of the algorithm, and also describing in much and force the programmers to seek more parallelism more detail the implementation of both computation- by going to much finer levels of problem granular- intensive, as well as memory-intensive operations. ity. In linear algebra it enforces the departure from the model relying on parallelism encapsulated at the level of BLAS, and shift to more flexible methods of 3 Algorithm scheduling work. The standard approach to solving symmetric positive definite systems of linear equations is to use the 2 Related Work Cholesky factorization. The Cholesky factorization of a real symmetric positive definite matrix A has the Iterative refinement is a well known method for im- form A = LLT , where L is a real lower triangular ma- proving the solution of a linear system of equations of trix with positive diagonal elements. The system is the form Ax = b. Typically, a dense system of linear solved by first solving Ly = b (forward substitution), equations is solved by applying a factorization to the and then solving LT x = y (backward substitution). In coefficient matrix, followed by a back solve. Due to order to improve the accuracy of the computed solu- roundoff errors, the solution carries an error related tion, an iterative refinement process is applied, which to the condition number of the coefficient matrix. In produces a correction to the computed solution, x, at order to improve the computed solution, an iterative each iteration. 2 Algorithm 1 Solution of a symmetric positive defi- The algorithm described above, and shown on Al- nite system of linear equations using mixed-precision gorithm 1 is available in the LAPACK 3.1 library and iterative refinement based on Cholesky factorization. implemented by the routine DSGESV. 1: A(32), b(32) ← A, b T a 2: L(32),L(32) ←SPOTRF (A(32)) (1) b T 4 Implementation 3: x(32) ←SPOTRS (L(32),L(32), b(32)) (1) 4: x(1) ← x (32) 4.1 Essential Hardware Features 5: repeat 6: r(i) ← b − Ax(i) Extensive hardware overview would be beyond the (i) (i) 7: r(32) ← r scope of this publication. Vast amount of information (i) b T (i) is publicly available for both experienced program- 8: z(32) ←SPOTRS (L(32),L(32), r(32)) (i) mers [9], as well as newcomers to the field [10, 11]. 9: (i) z ← z(32) It is assumed that the reader has some familiarity 10: x(i+1) ← x(i) + z(i) with the architecture. Here the features are men- 11: until x(i) is accurate enough tioned which influence the most the implementation presented here. aLAPACK name for Cholesky factorization bLAPACK name for symmetric back solve The CELL is a multi-core chip including nine dif- 64-bit representation is used in all cases where 32-bit repre- ferent processing elements. One core, the POWER sentation is not indicated by a subscript. Processing Element (PPE), represents a standard processor design implementing the PowerPC instruction set. The remaining eight cores, the Synergistic The mixed-precision iterative refinement algorithm Processing Elements (SPEs), are short vector Sin- using Cholesky factorization is outlined by Algo- gle Instruction Multiple Data (SIMD) engines with rithm 1. The factorization A = LLT (line 2) and big register files of 128 128-bit vector registers and the solution of the triangular systems Ly = b and 256 KB of local memory referred to as local store LT x = y (lines 3 and 8) are computed using single (LS). Although standard C code can be compiled precision arithmetic. The residual calculation (line 6) for execution on the SPEs, the SPEs do not exe- and the update of the solution (line 10) are computed cute scalar code efficiently. For efficient execution using double precision arithmetic and the original the code has to be vectorized in the SIMD sense, by double precision coefficients. The most computation- using C language vector extensions (intrinsics), or by ally expensive operations, including the factorization using assembly code. The system’s main memory is of the coefficient matrix A and the forward and back- accessible to the PPE through L1 and L2 caches and ward substitution, are performed using single preci- to the SPEs through DMA engines associated with sion arithmetic and take advantage of the single pre- them.

Load more