Optimizing Data Permutations for SIMD Devices

Optimizing Data Permutations for SIMD Devices Gang Ren Peng Wu David Padua Department of Computer Science IBM T.J. Watson Research Center Deparment of Computer Science University of Illinois at Urbana-Champ. Yorktown Heights, NY 10598 University of Illinois at Urbana-Champ. Urbana, IL 61801 [email protected] Urbana, IL 61801 [email protected] [email protected] Abstract SSE family are among the most popular SIMD devices. Support The widespread presence of SIMD devices in today’s microproces- for SIMD devices is likely to remain the norm for the foreseeable sors has made compiler techniques for these devices tremendously future and its importance could increase as these devices become important. One of the most important and difficult issues that must more powerful and easier to program. be addressed by these techniques is the generation of the data per- Today’s SIMD devices can be programmed directly using ma- mutation instructions needed for non-contiguous and misaligned chine language instructions (possibly in the form of C intrinsics) memory references. These instructions are expensive and, there- or indirectly through automatic vectorization. Neither of the ap- fore, it is of crucial importance to minimize their number to im- proaches is satisfactory. Machine instructions are not easy to use prove performance and, in many cases, enable speedups over scalar and today’s implementations of automatic vectorization fail in a code. number of important cases [23]. The difficulty in programming Although it is often difficult to optimize an isolated data reor- with SIMD instructions is due to the register-to-register nature of ganization operation, a collection of related data permutations can the instructions and the limitations of the SIMD load-store units. often be manipulated to reduce the number of operations. This pa- Thus, vector operations must be partitioned (or strip-mined) into per presents a strategy to optimize all forms of data permutations. blocks that fit into vector registers. Furthermore, assuming that the The strategy is organized into three steps. First, all data permuta- vector registers are Lr bytes long (Lr is typically 16), SIMD devices only support accesses to chunks of memory that are Lr bytes tions in the source program are converted into a generic represen- 1 tation. These permutations can originate from vector accesses to long and are aligned at Lr byte boundaries . When this is not the non-contiguous and misaligned memory locations or result from case, vectors must be packed, unpacked, and realigned to enable compiler transformations. Second, an optimization algorithm is ap- the application of SIMD operations. plied to reduce the number of data permutations in a basic block. By propagating permutations across statements and merging con- Element 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F R secutive permutations whenever possible, the algorithm can signifi- 1 cantly reduce the number of data permutations. Finally, a code gen- R eration algorithm translates generic permutation operations into na- 2 tive permutation instructions for the target platform. Experiments R were conducted on various kinds of applications. The results show 3 that up to 77% of the permutation instructions are eliminated and, { 00 02 08 12 11 07 13 17 04 18 14 1C 0F 1A 0C 1F } P as a result, the average performance improvement is 48% on VMX and 68% on SSE2. For several applications, near perfect speedups Figure 1. R3 ← vperm(R1,R2,P) have been achieved on both platforms. Most SIMD devices provide permutation instructions to support Categories and Subject Descriptors D.3.4 [Programming Lan- such pack, unpack and align operations. For example, in the case guages]: Processors—code generation, compilers, optimization of VMX, the permutation instruction [18] has the following form: General Terms Performance, Experimentation, Languages R3 ← vperm(R1,R2,P). Keywords SIMD Compilation, Data Permutation, Optimization As shown in Figure 1, the vperm instruction selects an arbitrary set of 16 bytes from the two input registers, R1 and R2 (each 16- 1. Introduction byte long), as indicated by the permutation pattern P which is also stored in a register. Such instructions enable the implementation of Single Instruction Multiple Data (SIMD) devices are supported by any byte movement operations needed for SIMD programming. For practically all of today’s microprocessors. IBM’s VMX and Intel’s example, Figure 2.a and Figure 2.b show, using the triplet notation of Fortran 90, code snippets to pack a strided access and load a misaligned access, respectively. Data permutations, introduced mainly to overcome the limita- Permission to make digital or hard copies of all or part of this work for personal or tions of SIMD memory units, are the most common sources of classroom use is granted without fee provided that copies are not made or distributed overhead in the programming of SIMD devices. Such overhead is for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute especially pronounced in programs, with many non-stride-one or to lists, requires prior specific permission and/or a fee. PLDI’06 June 11–14, 2006, Ottawa, Ontario, Canada. 1 Some SIMD devices, such as the SSE family, support misaligned Copyright c 2006 ACM 1-59593-320-4/06/0006. $5.00. loads/stores with costly performance penalty. a) Implementation of b[0:99] = a[0:198:2] + c[0:99] 1. v1[0:3:1] = x[0:3:1] + x[4:7:1]; p = <0,1,2,3,8,9,10,11,16,17,18,19,24,25,26,27>; 2. v1[4:7:1] = x[0:3:1] - x[4:7:1]; for(i=0; i<100; i+=4) 3. t0[0:7:1] = Permute(v1[0:7:1], P1); b[i:i+3] = vperm(a[2i:2i+3],a[2i+4:2i+7],p) + c[i:i+3]; 4. t1[0:7:1] = T8[0:7:1] * t0[0:7:1]; 5. v2[0:7:1] = Permute(t1[0:7:1], P2); b) Implementation of b[0:99] = a[1:100] + c[0:99] 6. u1[0:7:1] = Permute(v2[0:7:1], P3); p = <4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19>; 7. u2[0:3:1] = u1[0:3:1] + u1[4:7:1]; for(i=0; i<100; i+=4) 8. u2[4:7:1] = u1[0:3:1] - u1[4:7:1]; b[i:i+3] = vperm(a[i:i+3],a[i+4:i+7],p) + c[i:i+3]; 9. v3[0:7:1] = Permute(u2[0:7:1], P4); 10. t2[0:7:1] = Permute(v3[0:7:1], P5); 11. t3[0:7:1] = T4 2[0:7:1] * t2[0:7:1]; Figure 2. SIMD implementation using vperm where p is a con- 12. u3[0:7:1] = Permute(t3[0:7:1], P6); stant vector of bytes. Assume the data type of elements are 4-byte 13. u4[0:3:1] = u3[0:3:1] + u3[4:7:1]; single floating point, and the base of array a, b, and c is 16-byte 14. u4[4:7:1] = u3[0:3:1] - u3[4:7:1]; aligned. 15. v4[0:7:1] = Permute(u4[0:7:1], P7); 16. y[0:7:1] = Permute(v4[0:7:1], P8); 1. t0[0:6:2] = x[0:3:1] + x[4:7:1]; 2. t0[1:7:2] = x[0:3:1] - x[4:7:1]; Figure 4. Ana¨ıve SIMD implementation of the FFT code in Fig- 3. t1[0:7:1] = T8[0:7:1] * t0[0:7:1]; ure 3 after converting the stride-2 vectors to unit strides and un- 4. for (i = 0; i < 2; i++) { rolling the loop. Temporary arrays v1, v2, v3, v4 are introduced to 5. t2[0:2:2] = t1[i:i+2:2] + t1[i+4:i+6:2]; convert stride-2 vectors and u1, u2, u3, u4 are from loop unrolling. 6. t2[1:3:2] = t1[i:i+2:2] - t1[i+4:i+6:2]; P1 to P8 specify the permutation patterns in those operations. T4 2 7. t3[0:3:1] = T4[0:3:1] * t2[0:3:1]; is resulted from concatenating two T4 arrays. 8. y[i+0:i+2:2] = t3[0:1:1] + t3[2:3:1]; 9. y[i+4:i+6:2] = t3[0:1:1] - t3[2:3:1]; 10. } 1. v1[0:3:1] = x[0:3:1] + x[4:7:1]; 2. v1[4:7:1] = x[0:3:1] - x[4:7:1]; Figure 3. 8-point FFT codes with stride-2 accesses where t0, t1, 3. t1[0:7:1] = T8[0:7:1] * v1[0:7:1]; 4. u1[0:7:1] = Permute(t1[0:7:1], Q1); t2, t3 are temporary arrays and T4 and T8 are constant arrays. 5. u2[0:3:1] = u1[0:3:1] + u1[4:7:1]; 6. u2[4:7:1] = u1[0:3:1] - u1[4:7:1]; 7. t3[0:7:1] = T4 2[0:7:1] * u2[0:7:1]; misaligned accesses. Figure 3 gives such an example of the 8-point 8. u3[0:7:1] = Permute(t3[0:7:1], Q2); FFT computation2. In this paper, we use Fortran 90 vector syntax 9. y[0:3:1] = u3[0:3:1] + u3[4:7:1]; to represent vectors. The general form of a Fortran 90 vector is 10.

Load more