SIMD Optimization of Linear Expressions for Programmable Graphics Hardware

Volume 23 (2004), Number 4 pp. 1–18 SIMD Optimization of Linear Expressions for Programmable Graphics Hardware Chandrajit Bajajy and Insung Ihmz and Jungki Minz and Jinsang Ohz y Department of Computer Science, University of Texas at Austin, Texas, U.S.A. zDepartment of Computer Science, Sogang University, Seoul, Korea Abstract The increased programmability of graphics hardware allows efficient GPU implementations of a wide range of general computations on commodity PCs. An important factor in such implementations is how to fully exploit the SIMD computing capacities offered by modern graphics processors. Linear expressions in the form of y¯ = Ax¯+ b¯, where A is a matrix, and x¯, y¯, and b¯ are vectors, constitute one of the most basic operations in many scientific computations. In this paper, we propose a SIMD code optimization technique that enables efficient shader codes to be generated for evaluating linear expressions. It is shown that performance can be improved considerably by efficiently packing arithmetic operations into four-wide SIMD instructions through reordering of the operations in linear expressions. We demonstrate that the presented technique can be used effectively for programming both vertex and pixel shaders for a variety of mathematical applications, including integrating differential equations and solving a sparse linear system of equations using iterative methods. Categories and Subject Descriptors (according to ACM CCS): I.3.1 [Computer Graphics]: Graphics processors, Par- allel processing, Programmable shader; G.1.3 [Numerical Analysis]: Numerical Linear Algebra, Sparse systems; G.1.6 [Numerical Analysis]: Optimization 1. Introduction ing hardware has been exploited extensively for more flexible classification and shading at higher frame rates (re- In recent years, commodity graphics hardware has evolved fer to 6 for the various volume rendering techniques us- beyond the traditional fixed-function pipeline, to allow flex- ing user-programmable graphics hardware). In 25; 26, real- ible and programmable graphical processing units (GPUs). time procedural shading systems were proposed for pro- User-programmable vertex and pixel shader technologies grammable GPUs. Two papers demonstrated that ray cast- have been applied extensively, and have added a large num- ing can be performed efficiently with current graphics hard- ber of interesting new visual effects that were difficult or ware. Carr et al. described how ray-triangle intersection can impossible with traditional fixed-function pipelines. Sig- be mapped to a pixel shader3. Their experimental results nificant effort has focused mainly on the efficient utiliza- showed that a GPU-enhanced implementation is faster on tion of pixel shader hardware by effectively mapping ad- existing hardware than the best CPU implementation. Pur- vanced rendering algorithms to the available programmable cell et al. presented a streaming ray tracing model suitable fragment pipelines. Flexible multi-texturing and texture- for GPU-enhanced implementation on programmable graph- blending units, as provided by recent graphics cards, allow ics hardware28. They evaluated the efficiency of their ray a variety of per-pixel shading and lighting effects, such as tracing model on two different architectures, one with and Phong shading, bump mapping, and environmental mapping. one without branching. Recently, global illumination algo- In addition to these traditional per-pixel rendering ef- rithms such as photon mapping, and matrix radiosity and fects, the list of graphics applications accelerated by pro- subsurface scattering, were implemented on GPUs by Pur- grammable shader hardware is growing rapidly. Volume ren- cell et al.27 and Carr et al.2, respectively. dering is an actively studied topic in which pixel shad- °c The Eurographics Association and Blackwell Publishers 2004. Published by Blackwell Publishers, 108 Cowley Road, Oxford OX4 1JF, UK and 350 Main Street, Malden, MA 02148, USA. Bajaj, Ihm, Min, and Oh / SIMD Optimization Programmable graphics hardware has also been used for Operation Usage Description more general mathematical computations. Hart showed how the Perlin noise function can be implemented as a multi- ADD ADD D, S0, S1 D Ã S0 + S1 pass pixel shader11. In 18, Larsen et al. described the use MUL MUL D, S0, S1 D Ã S0 * S1 of texture mapping and color blending hardware to perform MAD MAD D, S0, S1, S2 D Ã S0 * S1 + S2 large matrix multiplications. Thompson et al. also imple- DP4 DP4 D, S0, S1 D S0 S1 mented matrix multiplication, and non-graphics problems Ã ¢ such as 3-satisfiability, on GPUs33. Hardware-accelerated MOV MOV D, S D Ã S methods were proposed for computing line integral convo- lutions and Lagrangian-Eulerian advections by Heidrich et Table 1: Supported instructions. S, S0, S1, and D are al.12 and Weiskopf et al.35. In 30, Rumpf et al. attempted four-wide registers. +, *, and ¢ represent component-wise ad- to solve the linear heat equation using pixel-level compu- dition, component-wise multiplication, and four-component tations. Harris et al. also implemented a dynamic simula- dot product, respectively. tion technique, based on the coupled map lattice, on programmable graphics hardware10. Moreland et al. computed the fast Fourier transform on the GPU, and used graphics structions on the SIMD graphics architecture supported by 23 hardware to synthesize images . current GPUs. Arithmetic operations are packed into four- As the fragment processing units of the recent graph- wide SIMD instructions by reordering the operations in the ics accelerators support full single-precision floating-point linear expression. Our technique is different from other re- arithmetic, it becomes possible to efficiently run various lated GPU programming techniques because it searches for numerical algorithms on the GPU with high precision. the most efficient linear expression to make the best use Two fundamental numerical algorithms for sparse matrices, of the four-wide SIMD processing power of current GPUs. a conjugate gradient solver and a multigrid solver, were We demonstrate that the proposed optimization technique mapped to the GPU by Bolz et al.1. Goodnight et al. also is quite effective for programming both vertex and pixel implemented a general multigrid solver on the GPU, and ap- shaders in a variety of applications, including the solution plied it to solving a variety of boundary value problems8. of differential equations, and sparse linear systems using it- Krüger et al. described a framework for the implementa- erative methods. tion of linear algebra operators on a GPU17. The GPU was also used for solving large nonlinear optimization problems 2. Efficient SIMD Computation of Linear Expressions by Hillesland et al.13. Harris et al. solved partial differential 2.1. Abstract Model of SIMD Machine equations on the GPU for simulating cloud dynamics9. We assume an abstract model of the shader for which our op- Many of the numerical simulation techniques mentioned timization technique is developed. Vertex and pixel shaders are strongly reliant on arithmetic operations on vectors and of different manufactures are slightly different from each matrices. Therefore, considerable efforts have been made other. Furthermore, they are still evolving rapidly. In this to develop efficient representations and operations for vec- respect, we make only a few basic assumptions about the tors and matrices on GPUs. This paper continues those ef- shader model so that the resulting technique is vendor- forts, and is specifically concerned with the problem of com- independent, and can be easily adapted to future shaders. In puting linear expressions in the form of affine transforms this paper, we view the shaders as general purpose vector ¯ ¯ y¯ = Ax¯ + b, where A is a matrix, and x¯, y¯, and b are vectors. processors with the following capabilities: Such linear transforms constitute basic operations in many scientific computations. Because the matrix A in the expres- 1. The shader supports four-wide SIMD parallelism. sion is usually large and sparse in practice, many CPU-based 2. Its instruction set includes the instructions shown in Ta- parallel techniques have been proposed in parallel process- ble 1. ing fields for their efficient computations. In particular, sev- 3. Any component of the source registers may swizzle eral methods for large sparse matrix-vector multiplication and/or replicate into any other component. Furthermore, have been designed for SIMD machines24; 37; 15; 29. Although destination registers may be masked. These register mod- the proposed techniques work well on specific machines, it ifiers do not harm shader performance. is difficult to apply them directly to the simple SIMD models 4. Every instruction executes in a single clock cycle. Hence, supported by current programmable graphics hardware. the number of instructions in a shader program is the ma- jor factor affecting shader performance. This paper presents a SIMD code optimization technique that enables efficient assembly-level shader codes to be gen- 2.2. Definition of the Problem erated for evaluating linear expressions. The technique transforms a given linear expression into an equivalent and op- As described earlier, this paper describes the generation of timized expression that can be evaluated using fewer in- efficient shader code that evaluates a linear expression of the °c The Eurographics Association and Blackwell Publishers 2004. Bajaj, Ihm, Min, and Oh / SIMD Optimization form E :y ¯ = Ax¯+b¯, where the matrix A = (ai j)i; j=0;1;¢¢¢;m¡1 2.3.1. Code Generation Stage and the vector b¯ = (b b ¢¢¢ b )t are constants, and the 0 1 m¡1 To compute the four-vector T = A x¯ in Eq. (2), 12 addi- two vectors x¯ = (x x ¢¢¢ x )t and y¯ = (y y ¢¢¢ y )t i j i j j 0 1 m¡1 0 1 m¡1 tions and 16 multiplications must be carried out in the worst are variables. Because the vertex and pixel processing of cur- case. However, when the matrix A contains zero elements, rently available GPUs is based on four-wide SIMD floating- i j it can be evaluated in fewer arithmetic operations.

Load more