Optimizing Data Permutations for SIMD Devices

Gang Ren Peng Wu David Padua Department of IBM T.J. Watson Research Center Deparment of Computer Science University of Illinois at Urbana-Champ. Yorktown Heights, NY 10598 University of Illinois at Urbana-Champ. Urbana, IL 61801 [email protected] Urbana, IL 61801 [email protected] [email protected]

Abstract SSE family are among the most popular SIMD devices. Support The widespread presence of SIMD devices in today’s microproces- for SIMD devices is likely to remain the norm for the foreseeable sors has made compiler techniques for these devices tremendously future and its importance could increase as these devices become important. One of the most important and difficult issues that must more powerful and easier to program. be addressed by these techniques is the generation of the data per- Today’s SIMD devices can be programmed directly using ma- mutation instructions needed for non-contiguous and misaligned chine language instructions (possibly in the form of C intrinsics) memory references. These instructions are expensive and, there- or indirectly through automatic vectorization. Neither of the ap- fore, it is of crucial importance to minimize their number to im- proaches is satisfactory. Machine instructions are not easy to use prove performance and, in many cases, enable speedups over scalar and today’s implementations of automatic vectorization fail in a code. number of important cases [23]. The difficulty in programming Although it is often difficult to optimize an isolated data reor- with SIMD instructions is due to the register-to-register nature of ganization operation, a collection of related data permutations can the instructions and the limitations of the SIMD load-store units. often be manipulated to reduce the number of operations. This pa- Thus, vector operations must be partitioned (or strip-mined) into per presents a strategy to optimize all forms of data permutations. blocks that fit into vector registers. Furthermore, assuming that the The strategy is organized into three steps. First, all data permuta- vector registers are Lr bytes long (Lr is typically 16), SIMD de- vices only support accesses to chunks of memory that are Lr bytes tions in the source program are converted into a generic represen- 1 tation. These permutations can originate from vector accesses to long and are aligned at Lr byte boundaries . When this is not the non-contiguous and misaligned memory locations or result from case, vectors must be packed, unpacked, and realigned to enable compiler transformations. Second, an optimization is ap- the application of SIMD operations. plied to reduce the number of data permutations in a basic block. By propagating permutations across statements and merging con- Element 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F R secutive permutations whenever possible, the algorithm can signifi- 1 cantly reduce the number of data permutations. Finally, a code gen- R eration algorithm translates generic permutation operations into na- 2 tive permutation instructions for the target platform. Experiments R were conducted on various kinds of applications. The results show 3 that up to 77% of the permutation instructions are eliminated and, { 00 02 08 12 11 07 13 17 04 18 14 1C 0F 1A 0C 1F } P as a result, the average performance improvement is 48% on VMX and 68% on SSE2. For several applications, near perfect speedups Figure 1. R3 ← vperm(R1,R2,P) have been achieved on both platforms. Most SIMD devices provide permutation instructions to support Categories and Subject Descriptors D.3.4 [Programming Lan- such pack, unpack and align operations. For example, in the case guages]: Processors—code generation, compilers, optimization of VMX, the permutation instruction [18] has the following form:

General Terms Performance, Experimentation, Languages R3 ← vperm(R1,R2,P). Keywords SIMD Compilation, Data Permutation, Optimization As shown in Figure 1, the vperm instruction selects an arbitrary set of 16 bytes from the two input registers, R1 and R2 (each 16- 1. Introduction byte long), as indicated by the permutation pattern P which is also stored in a register. Such instructions enable the implementation of Single Instruction Multiple Data (SIMD) devices are supported by any byte movement operations needed for SIMD programming. For practically all of today’s microprocessors. IBM’s VMX and Intel’s example, Figure 2.a and Figure 2.b show, using the triplet notation of Fortran 90, code snippets to pack a strided access and load a misaligned access, respectively. Data permutations, introduced mainly to overcome the limita- Permission to make digital or hard copies of all or part of this work for personal or tions of SIMD memory units, are the most common sources of classroom use is granted without fee provided that copies are not made or distributed overhead in the programming of SIMD devices. Such overhead is for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute especially pronounced in programs, with many non-stride-one or to lists, requires prior specific permission and/or a fee. PLDI’06 June 11–14, 2006, Ottawa, Ontario, Canada. 1 Some SIMD devices, such as the SSE family, support misaligned Copyright c 2006 ACM 1-59593-320-4/06/0006. . . $5.00. loads/stores with costly performance penalty. a) Implementation of b[0:99] = a[0:198:2] + c[0:99] 1. v1[0:3:1] = x[0:3:1] + x[4:7:1]; p = <0,1,2,3,8,9,10,11,16,17,18,19,24,25,26,27>; 2. v1[4:7:1] = x[0:3:1] - x[4:7:1]; for(i=0; i<100; i+=4) 3. t0[0:7:1] = Permute(v1[0:7:1], P1); b[i:i+3] = vperm(a[2i:2i+3],a[2i+4:2i+7],p) + c[i:i+3]; 4. t1[0:7:1] = T8[0:7:1] * t0[0:7:1]; 5. v2[0:7:1] = Permute(t1[0:7:1], P2); b) Implementation of b[0:99] = a[1:100] + c[0:99] 6. u1[0:7:1] = Permute(v2[0:7:1], P3); p = <4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19>; 7. u2[0:3:1] = u1[0:3:1] + u1[4:7:1]; for(i=0; i<100; i+=4) 8. u2[4:7:1] = u1[0:3:1] - u1[4:7:1]; b[i:i+3] = vperm(a[i:i+3],a[i+4:i+7],p) + c[i:i+3]; 9. v3[0:7:1] = Permute(u2[0:7:1], P4); 10. t2[0:7:1] = Permute(v3[0:7:1], P5); 11. t3[0:7:1] = T4 2[0:7:1] * t2[0:7:1]; Figure 2. SIMD implementation using vperm where p is a con- 12. u3[0:7:1] = Permute(t3[0:7:1], P6); stant vector of bytes. Assume the data type of elements are 4-byte 13. u4[0:3:1] = u3[0:3:1] + u3[4:7:1]; single floating point, and the base of array a, b, and c is 16-byte 14. u4[4:7:1] = u3[0:3:1] - u3[4:7:1]; aligned. 15. v4[0:7:1] = Permute(u4[0:7:1], P7); 16. y[0:7:1] = Permute(v4[0:7:1], P8);

1. t0[0:6:2] = x[0:3:1] + x[4:7:1]; 2. t0[1:7:2] = x[0:3:1] - x[4:7:1]; Figure 4. Ana¨ıve SIMD implementation of the FFT code in Fig- 3. t1[0:7:1] = T8[0:7:1] * t0[0:7:1]; ure 3 after converting the stride-2 vectors to unit strides and un- 4. for (i = 0; i < 2; i++) { rolling the loop. Temporary arrays v1, v2, v3, v4 are introduced to 5. t2[0:2:2] = t1[i:i+2:2] + t1[i+4:i+6:2]; convert stride-2 vectors and u1, u2, u3, u4 are from loop unrolling. 6. t2[1:3:2] = t1[i:i+2:2] - t1[i+4:i+6:2]; P1 to P8 specify the permutation patterns in those operations. T4 2 7. t3[0:3:1] = T4[0:3:1] * t2[0:3:1]; is resulted from concatenating two T4 arrays. 8. y[i+0:i+2:2] = t3[0:1:1] + t3[2:3:1]; 9. y[i+4:i+6:2] = t3[0:1:1] - t3[2:3:1]; 10. } 1. v1[0:3:1] = x[0:3:1] + x[4:7:1]; 2. v1[4:7:1] = x[0:3:1] - x[4:7:1]; Figure 3. 8-point FFT codes with stride-2 accesses where t0, t1, 3. t1[0:7:1] = T8[0:7:1] * v1[0:7:1]; 4. u1[0:7:1] = Permute(t1[0:7:1], Q1); t2, t3 are temporary arrays and T4 and T8 are constant arrays. 5. u2[0:3:1] = u1[0:3:1] + u1[4:7:1]; 6. u2[4:7:1] = u1[0:3:1] - u1[4:7:1]; 7. t3[0:7:1] = T4 2[0:7:1] * u2[0:7:1]; misaligned accesses. Figure 3 gives such an example of the 8-point 8. u3[0:7:1] = Permute(t3[0:7:1], Q2); FFT computation2. In this paper, we use Fortran 90 vector syntax 9. y[0:3:1] = u3[0:3:1] + u3[4:7:1]; to represent vectors. The general form of a Fortran 90 vector is 10. y[4:7:1] = u3[0:3:1] - u3[4:7:1]; a[begin:end:stride]. The stride component of the triplet nota- tion can be omitted if the stride is 1. Figure 5. An optimized SIMD implementation of the FFT code in Ana¨ıve implementation of Figure 3 on SIMD devices, shown in Figure 3. Figure 4, would require 8 data permutations to pack stride-2 vectors (at statements 3, 5, 10, 16), and to coalesce vectors after unrolling (at statements 6, 9, 12, 15). In Figure 4, permutation is represented and strides. Since the optimization algorithm relies on the permu- by a generic operation, Permute(v,P), which reorders vector tation matrices being constant, we restrict input vector expressions v according to the index vector P (i.e., v(P (:)) in Fortran 90 to have compile-time constant strides, alignment, and length3. notation). Such generic permutation operation can be translated Our compiler strategy has three main phases. First, a transla- into a sequence of permutation instructions. tor converts all data permutation operations into an internal rep- Given that most SIMD units operate on short vectors (2 or 4), resentation which is based on the Permute operation. Most data minimizing permutation overhead is of crucial importance for per- permutations come from packing, unpacking, and realigning non- formance. A careful analysis would show that the code of Figure 4, contiguous and/or misaligned memory references in the source can be rewritten with far fewer number of permutations as shown code. The others result from other compiler transformations and/or in Figure 5. user specifications. Perhaps the most natural alternative to low level SIMD pro- Second, the number of data permutations is reduced by an op- gramming is to use general vector notation similar to that of For- timizer. The algorithm propagates data permutations across state- tran 90 and rely on compilers to map vector operations onto SIMD ments and composes consecutive permutations whenever possible. instructions. This approach allows us to separate two orthogonal is- The code of Figure 5 was obtained applying our optimization pass. sues in SIMD compilation, the extraction of SIMD parallelism and Finally, a code generator translates the internally represented efficient SIMD code generation. In this paper, we take a first step in permutations into native in-register permutation instructions, such addressing the second issue. We focus on minimizing permutation as vperm, for the targeted SIMD device. overhead when translating straight line code segments (possibly the Experiments were conducted on a set of applications from dif- result of compiler unrolling) containing scalar and constant length ferent domains. The result indicates that our permutation optimiza- vector operations into efficient code for microprocessor SIMD de- tion framework is able to significantly reduce the number of data vices. movement instructions (up to 77%), which improves the perfor- Our optimization algorithm accepts vector programs as input, mance on the average by 48% on VMX and 68% on SSE2. Near assuming that SIMD parallelism is either extracted by a prior pass perfect speedups have been achieved on both platforms. of the compiler or explicitly specified in the program. The input The major contributions of this paper are: vector expressions can have arbitrary lengths, memory alignments, 3 The requirement for compile-time constant vector length can be relaxed 2 For simplicity of presentation, complex operations are not expanded to real by strip-mining vectors of variable-length into a sequence of fixed-length operations in the code. Otherwise, more data permutations will be needed. vector operations. • An algorithm to translate all forms of data permutations into is Y4 = y0,x1,y2,x2. an internal representation based on the Permute operation. As discussed in Section 3.1,  elements are generated when (Section 3) converting memory store to strided vectors into SIMD form. • An optimization algorithm to reduce the number of data permu- Such conversion involves a special read-modify-write sequence tations in a basic block by propagating them across statements that can be conveniently represented by . and merging them together whenever possible. (Section 4) Using  elements, we can explicitly track “unchanged” ele- • An algorithm to translate Permute operations onto hardware ments during a permutation. This information can be used by permutation instructions through a two-phase mapping process. the optimization algorithm. For example, one may combine two (Section 5) permutations over the same output vectors into a permutation without  elements. • Experimental demonstration of the effectiveness of the new techniques on codes requiring non-trivial permutation patterns, such as FFT and sorting network codes. (Section 6) 2.2 Properties of the Permutation Operation The rest of the paper is organized as follows. Section 2 describes There are two rules that are the foundation of our optimization the Permute operation used in our internal representation. Sec- algorithm on data permutations. tions 3 through 6 are as just mentioned, Section 7 presents related Composition Rule. This rule states that two consecutive per- work and we conclude in Section 8. mutations can be composed into one. ( ( ) ) 2. Permutation, Reduction and Replication Permute Permute Xn,Pn×n ,Qn×n ≡ Permute(Xn,Qn×n · Pn×n) In this section, we define the three data movement operations, Permute, Reduct, and Spread, used in our internal representa- This composition rule is the basis of our optimization algorithm. tion. Each time the rule is applied, we reduce by one the number of permutations. 2.1 Permutation Operation It is safe to apply composition rule in the presence of since The operation Y ← Permute(Xn,P) performs a permutation P these values will be discarded eventually. On the other hand, since on a vector Xn of n elements to produce a vector Yn of the same permutation containing  elements combines data from both input length. P can be specified as a permutation matrix, Pn×n, which and output, the composition rule cannot be applied when there is is an identity square matrix with its rows reordered. A permutation  in P . However, the composition rule is still applicable when Q operation can be viewed as a matrix-vector multiplication, i.e., contains  elements. Permute(Xn,Pn×n) ≡ Pn×n · Xn. Assume that the i-th element of the index vectors of P and Q are pi and qi respectively. Then, the i-th element of the index vector A more compact representation specifies the permutation pat- of R = Q · P can be computed as follows:

tern as an index vector whose i-th element is the index of the input

vector element that is to be moved to the i-th element of the output  q i =  i =  p i if q , and p vector. For example,

ri = if qi =  Permute(x0,x1,x2,x3, 0, 2, 1, 3)=x0,x2,x1,x3.  if qi =  Maintaining a square permutation matrix greatly simplifies our optimization algorithm. However, it also requires the input and From a different view point, the composition rule also says that output vectors be of the same length. To express data permutations a permutation can be decomposed into two permutations. Permu- with mismatched input and output lengths, we use two special tation decomposition is also very important during permutation values: propagation, especially when different patterns may have different costs. However, we may have a large number of ways to decom- • The first special value is a star, denotated as .Ifthei-th pose a permutation. Thus, in practice, we only attempt to decom- element of an index vector is , the i-th element of the output pose a permutation into two permutations in special formats (see vector is undefined (i.e., we do not care its value). It can be used Section 4.2.2). to specify data permutations where the output vector is shorter Distributive Rule. This rule says that a permutation operation than the input. For example, the following permutation gathers can be distributed over element-by-element vector operations. Let the odd elements of the input vector, op be an element-by-element operation, we have Permute(x0,x1,x2,x3, 1, 3,,)=x1,x3,,. Permute(Xn,Pn×n) op P ermute(Yn,Pn×n) Using elements, we can explicit track “unused” elements dur- ≡ Permute(Xn op Yn,Pn×n) ing a permutation. This information can be used by the opti- mization algorithm. For example, indicates that the permu- The distributive rule allows us to move permutations over opera- tation bandwidth is not fully utilized and that the permutation tions. By moving common permutations toward the root of an ex- may be combined with other permutations applied to the same pression, we can reduce the number of permutations. It may also input vector. create more consecutive permutations to enable the application of the composition rule. • The second special value is a diamond, denotated as .Ifthe It is safe to apply the distributive rule in the presence of . While i-th element of an index vector is , the i-th element of the doing so can save permutation operations, it may also result in addi- output vector will remain unchanged. It is used to specify data tional arithmetic operations. Depending on the relative cost of per- permutations where the output vector also implicitly serves mutation and arithmetic operations, one may choose to distribute as input. For example, suppose Y4 = y0,y1,y2,y3. The or un-distribute in the presence of . outcome of the permutation, On the other hand, it is generally unsafe to distribute in the Y4 ← Permute(x0,x1,x2,x3, , 1, , 2) presence of . 2.3 Reduction and Replication Original statement Reduction operations involve a type of data movement that cannot a[0:49] = b[0:98:2] + b[1:99:2]; be expressed with general permutations. Therefore, we introduce After stride conversion (Scheme I) the notation, Yn ← Reduct(Xn,opr), to specify a reduction t1[0:99] = Permute(b[0:99],0, 2, 4, ···, 98,,···,); with operation opr on all n elements of Xn and stores the result 0, 2, 4, ···, 98,,···, ← ( +) t2[0:99] = Permute(b[1:100], ); in the first element of Yn. For example, Yn Reduct Xn, a[0:49] = t1[0:49] + t2[0:49]; represents After stride conversion (Scheme II) y0 = x0 + x1 + ... + xn−1 t1[0:99] = Permute(b[0:99],0, 2, 4, ···, 98,,···,); From the view of data permutation, replication is the reverse t2[0:99] = Permute(b[0:99],, ···,,1, 3, 5, ···, 99); operation of reduction. A replication (Spread) operation expands a[0:49] = t1[0:49] + t2[50:99]; a scalar into a vector. Yn ← Spread(Xn,i) specifies a replication of the i-th element of Xn to fill Yn, i.e., Figure 6. An example of converting strided loads.

Spread(, , x2,, 2) = x2,x2,x2,x2 Notice that both the input of the replication and the output of where P is 0,s,2s,...,L*s-s,,...,. Figure 6 (Scheme I) the reduction are scalars. However, vectors are used in both rep- shows an example where stride-2 vector loads are normalized using resentations because most SIMD devices do not support reduction this scheme. (or spread) instructions that directly target scalar registers as output However, this conversion scheme has a major drawback when (or as input). merging permutations. Consider the example in Figure 6. Scheme Composition Rule This rule says that consecutive Reduct- I generates two permutations on two largely overlapping vectors Permute and Permute-Spread can be composed into one, as- b[0:99] and b[1:100]. As discussed in Section 3.2, our merging suming that opr is associative. algorithm tries to combine permutations that operate on the same input vectors. ( ( ) ) ≡ ( ) Reduct Permute Xn,Pn×n ,opr Reduct Xn,opr To facilitate the merging of permutations, we modify the orig- Permute(Spread(Xn,i),Pn×n) ≡ Spread(Xn,i) inal conversion formula by truncating generated vector addresses to the closest Lr boundaries. Such modification also helps avoid Distributive Rule This rule says that Spread can be distributed unnecessary realignment. over element-by-element vector operations. We use the notation, xy, to represent the truncation of x to Spread(Xn op Yn,i) ≡ Spread(Xn,i) op Spread(Yn,i) the closest value that is multiple of y. Assume that the base of arrays are Lr-byte aligned, and the elements are Le-byte long, the enhanced stride conversion results in 3. Normalization of Vector References t[0:L*s-1] = Permute(v[b’:b’+L*s-1], P); The first step of our compilation strategy is to normalize all array S ... = ... t[o*L:o*L+L-1] ... ; references (vectors). That is, to transform the input program so that =   = − = ( ) all vector references are stride-one, aligned and have a length that where b b Lr/Le , o b b b mod Lr/Le , and

is a multiple of Lr, the length of the vector register. This is accom- P = , ..., ,o,o+ s, o +2s, ..., o +(L − 1) ∗ s, , ..., .

ßÞ  plished by inserting explicit Permuteoperations to pack or align  data. By normalizing vectors, all implicit permutation operations in o∗L non-contiguous or misaligned references are explicitly expressed. Figure 6 (Scheme II) shows an example of converting stride-2 During this transformations, the algorithm may introduce tem- vector loads using this scheme. porary arrays (vectors) to hold intermediate results. The life range of these temporary arrays is the basic block being translated. They Strided store Consider a vector store, will be allocated to (virtual) vector registers and therefore will not S: v[b:e:s] = ...; occupy any memory space. To normalize the store, a temporary vector t[0:L*s-1] is al- 3.1 Stride Conversion located and v[b:e:s] is replaced by t[o*L:o*L+L-1]. A permu- tation statement from t to v is inserted immediately after S. Using Normalization transforms a strided vector load into a load of a the same definition of o and b’ as above, the final version of the stride-one vector with a pack operation and a strided vector store code is: into a store to a stride-one vector with an unpack-merge sequence. Both pack and unpack-merge can be expressed using Permute S: t[o*L:o*L+L-1] = ...; operations. v[b’:b’+L*s-1] = Permute(t[0:L*s-1], P); Strided load Consider a vector load, where =   0   1   2   − 1  

S: ... = ... v[b:e:s] ... ; P , ..., , , , ..., , , , ..., , , , ..., ,L , , ...,

 ßÞ   ßÞ   ßÞ   ßÞ  o s−1 s−1 s−o−1 The first step is to allocate a temporary vector t[0:L*s-1] where L=(e-b)/s+1 (also the length of v[b:e:s]). Vector 3.2 Merging Permutations t[0:L*s-1] is defined as a permutation of v[b:b+L*s-1], which  is inserted immediately before S. A portion of the vector t[0:L-1] Stride conversion often generates permutations with many and is then used to replace v[b:e:s] in statement S. The final version elements. The presence of these elements indicates opportunities to of the code is: merge permutations. Two permutations with (or ) elements that operate on the t[0:L*s-1] = Permute(v[b:b+L*s-1], P); same input (and output) vectors can be merged if their index vectors S: ... = ... t[0:L-1] ... ; can be merged. We define a commutative operator, ∧, to merge two permutation indices, as follows: W ←∅; optimize_permutation(basicblock bb) a ∧ a = a, a ∧ = a, a ∧= a W ←{S| S:"v = Permute (..., P)" at top of DU graph} WHILE W = ∅ DO Otherwise, we say the two permutation indices cannot be merged. It remove a statement, S:"v=Permute(...,P)", from W is straightforward to extend ∧ to an element-wise vector operation U ← set of all use statements of v, the lhs of S merging two index vectors. We say two index vectors, A and B, propagate_and_merge(U, S) can be merged if and only if all of their corresponding indices can END be merged. In that case, the index vector of the merged permutation A ∧ B propagate_and_merge(set U, stmt "v=Permute(...,P)") is . 1. FOR each statement T in U DO Merging opportunities are common when several strided refer- 2. Let v’ be the use of v in T ences in the region jointly access a contiguous chunk of memory. 3. IF possible to propagate Permute(x,P) to v’ THEN This situation arises frequently in applications. 4. IF T is of the form "w = Permute(v’, P’)" THEN Consider our previous example in Figure 6. The permutations 5. T ← "w = Permute(x, P’*P)" produces by Scheme I cannot be directly merged because they op- 6. add T to W erate on two slightly different vectors (b[0:99] and b[1:100]), 7. ELSE whereas merging the permutations generated by Scheme II is 8. tentatively replace v’ with "Permute(x, P)" straightforward: 9. IF possible to transform the rhs of T into "Permute(,P)" THEN t[0:99] = Permute(b[0:99],0, 2, 4, ..., 98, 1, 3, ..., 99); 10. carry out the transformation a[0:49] = t[0:49] + t[50:99]; 11. addTtoW 12. ENDIF 13. ENDIF 3.3 Misalignment 14. ENDIF This normalization transforms accesses to vectors starting at ad- 15. END dresses that are not a multiple of Lr into accesses to aligned vec- tors. If we assume the base address of v is aligned, vector v[b:e] Figure 7. The algorithm of optimizing data permutations in a basic is misaligned if b mod (Lr/Le) is not zero. Consider a misaligned block. load, S: ... = ... v[b:e] ...; 4. Optimizing Data Permutations As in the case of stride conversion, a temporary vector, t[0:L’-1], In this section, we introduce an algorithm to reduce the number of is allocated for replacement, where L’ is the length of the aligned permutation operations within a basic block. The input to the algo- section v[b’:e’], where b’=bL /L and e’=e+1L /L -1. r e r e rithm is straight line vector code normalized as discussed above. Thus, L’=L+Lr /Le, where L is the length of v[b:e]. Also, a per- mutation must be inserted before the original statement. 4.1 An Overview of the Optimization Algorithm t[0:L’-1] = Permute(v[b’:e’], P) S: ... = ... t[0:L-1] ...; The optimization algorithm propagates permutations along the def- use graph built from the input program. It then applies the compo-

where P = 0, 1, 2, ..., L − 1, , ..., . sition and distributive rules to reduce the number of permutations.

ßÞ   The def-use graph is an extension of conventional def-use graph to Lr /Le accommodate vectors. Similarly, consider a misaligned store, The algorithm is shown in Figure 7. The worklist W contains all S: v[b:e] = ...; statements of the form: It is converted to v = Permute(x, P); S: t[0:L-1] = ...; where P contains no  element4. The algorithm propagates along v[b’:e’] = Permute(t[0:L’-1], P) def-use chains of each statement in W. It addresses cases such as partial definition and partial use (see Section 4.2). Once a definition =   0 1 2 − 1  

where P , ..., , , , , ..., L , , ..., . of the form v = Permute(x, P) is propagated to a permutation of

 ßÞ  b−b the form Permute (v, Q), the algorithm merges P and Q applying composition rule. The algorithm also tries to reorganize statements 3.4 Other Sources of Data Permutations into the form v = Permute(..., P) (see Section 4.3). The al- gorithm repeats this process until all permutations in W have been Besides data permutations introduced by converting strided and visited. misaligned memory references, store-load forwarding between When the use is not an input to a permute (e.g., ...=v+ partial def and use of vectors may also introduce data permuta- Permute(...)), the algorithm may still tentatively propagate v.If tions. For example, to propagate the right hand side of t[b:e] = at the end of the algorithm, these tentative replacements do not lead x[b:e] to t[b+1:e+1], it is necessary to insert an explicit permu- to a consolidated right hand side of the form Permute(...) then tation (and an element-insert operation). the replacement is undone. In addition, data permutations inherent to programs, such as If a permutation is propagated to all its uses, the permutation matrix transpose and bit-reverse reordering, are also very common. itself will be deleted by dead-code elimination. Sometimes it is difficult for compilers to recognize those data permutations from the standard C implementation. It would be easier to provide some intrinsics, like Fortran 90, to specify data 4 Permutations containing  cannot be propagated because they use data permutations directly. from both input and output vectors. 4.2 Propagation Along Def-Use Chains reshape_permutation(use v1, use v2, stmt "v=Permute(x,P)") IF v1.length = v2.length THEN RETURN P; The algorithm in Figure 7 assumes that the def-use information a[0:2*v1.length-1] ← P[v1.begin:v1.end]||P[v2.begin:v2.end] of program is already available before optimization. The def-use sort elements in a; analysis of vector programs is an important issue in itself. Unlike e ← a[0] + v1.length; two scalars, which are either same or different, two vectors can b ← a[2*v1.length-1] - v1.length be identical, completely different or overlapped. Thus the def-use FORi=0TO2*v1.length-1 DO analysis has to deal with the partial relation between vectors. For IF e <= a[i] <= b THEN RETURN P; example, END Q ← P; t[0:7] = Permute(a[0:7], P); (1) FORi=0TOv1.length-1 DO b[0:3] = t[0:3] + t[4:7]; IF (b < Q[v1.begin+i] AND b < Q[v2.begin+i]) OR (Q[v1.begin+i] < e AND Q[v2.begin+i] < e) THEN The use of t[0:3] (or t[4:7])isapartial use of t[0:7]. RETURN P; Our algorithm must handle propagation of definitions to mul- ENDIF tiple partial uses. In code segment (1), unless P can be par- IF b < Q[v1.begin+i] AND Q[v2.begin+i] < e THEN Q[v1.begin+i] ↔ Q[v2.begin+i]; titioned into permutations on 4-element vectors, it cannot be ENDIF propagated to the partial uses of t[0:7]. For example, if P is END 0 4 1 5 2 6 3 7 , , , , , , , , the permutation matrix would not be block- RETURN Q; diagonal and cannot be divided into two sub-matrices. Thus the permutation cannot be propagated to the partial uses. On the other Figure 8. The algorithm of reshaping permutations. hand, if P is 0, 2, 1, 3, 4, 6, 5, 7, it could be bisected into two independent permutations and propagated to the uses as follows: 0, 2, 1, 3 t[0:3] = Permute(a[0:3], ); P is reshaped to an identity permutation, thus can be eliminated 0, 2, 1, 3 t[4:7] = Permute(a[4:7], ); completely. Let y[by : ey]=Permute(x[bx : ex],P) be the permutation Figure 8 gives the general algorithm for reshaping permutations. to be propagated and y[by : ey] be a partial use. That is, we have Vectors v1 and v2 are operands of a commutative operator as that by ≥ by, ey ≤ ey. Assume P is represented as the index vector well as partial uses of vector v defined by permutation P. In the || p0,p1, ..., pn−1. algorithm, the symbol “ ” represents vector concatenation. The The following condition must be satisfied so that the permuta- first loop checks whether the elements in v1 and v2 come from tion can be partitioned and propagated to the use. two contiguous aligned sections of x of the size of v1(v2). This is a necessary condition for reshaping so that by switching elements ∀i ∈ [by,ey],pi ∈ [bx,ex] one may move all elements from one contiguous aligned section of x to v1 and the others to v2. The second loop reshapes the where [bx,ex] ⊆ [bx,ex], ex − bx = ey − by and bx is a multiple permutations. If a pair of corresponding elements from v1 and v2 of Lr/Le, the number of elements in each physical vector register. comes from the same section of x, such as b < Q[v1.begin+i] The condition specifies that all elements in the partial use must AND b < Q[v2.begin+i], then reshaping fails. come from a contiguous (and aligned) section of the source vector. If the condition is not satisfied, we say this permutation crosses the partial use boundary. Partial use boundary is one of the major 4.2.2 Permutation Decomposition reasons preventing propagation in many applications. This section introduces two techniques to decompose a costly non- To enable propagation across partial use boundaries, we devel- propagatable permutation into two permutations, one is fast and the oped several techniques, including permutation reshaping and per- other can be propagated. mutation decomposition which are discussed in the next two sub- Register-wise Permutations Certain permutations can be trans- sections. lated into register assignments, which in turn may be folded by It is possible for a vector definition and the corresponding use copy propagation, thus do not incur any permutation overhead. We to overlap partially but neither of them contains the other. Our referred to such permutations as register-wise permutations.For algorithm conservatively does not propagate the permutation in example, 0, 1, 4, 5, 2, 3, 6, 7 is a register-wise permutation for such situations. two-way vector registers. If the definition is a subset of the use, we put the propagation If a permutation cannot be propagated through the partial use on hold (line 3 in Figure 7) until we can determine if there are boundaries, we can attempt to decompose it into two permutations. other permutations that jointly fully define the vector use and whose One is a register-wise permutation. The other can be partitioned source vectors can be joined. If this is the case, we will continue the and further propagated to the partial uses. propagation. Otherwise, the propagation will end. Consider again code sequence (1).IfP is 1, 0, 5, 4, 3, 2, 7, 6, the propagation would across partial use boundaries and cannot 4.2.1 Reshaping Permutations take place. Assuming two-way vector registers, we can decompose When a pair of partial uses of a permutation are operands of a com- P into, mutative operation, it may be possible to reshape the permutation. P2 · P1 = 1, 0, 3, 2, 5, 4, 7, 6·0, 1, 4, 5, 2, 3, 6, 7 For example, consider the code sequence (1). When P = 0, 5, 2, 7, 4, 1, 6, 3, the permutation cannot be propagated to its where P1 is a register-wise permutation (for 2-way vectors), and partial uses. However, because the two partial uses are operands P2 is bisectable and therefore can be propagated into partial use. of a commutative operation (add), we can reshape the permutation The final code after decomposition is, pattern in the definition to 0, 1, 2, 3, 4, 5, 6, 7 by exchanging cor- responding elements (5 ↔ 1 and 7 ↔ 3) of the two operands. t1[0:7] = Permute(a[0:7], 1, 0, 5, 4, 3, 2, 7, 6); The reshaped permutation does not affect the value of b[0:3] and t2[0:3] = t1[0:3] + t1[4:7]; does not cross the partial use boundaries. In fact, in this example, b[0:3] = Permute(t2[0:3], 1, 0, 3, 2); decompose_permutation_reg(use v1, stmt "v=Permute(x,P)") Without loss of generality, we limit our focus to three-address ← a[0:v.length-1] P[v.begin:v.end] statements with vector expressions. Let v = Permute(x, P ) be sort elements in a the permutation to be propagated. The expression containing the Let r be the element size of each vector register FOR i = 0 TO v.length/r-1 THEN use of v will be transformed bottom up. Let n be an expression node IF NOT exist j, j*r <= a[i*r:i*r+r-1] < j*r+r THEN in the statement to be propagated into and op any regular element- RETURN (P,I); // Not decomposable wise operation. ENDIF The following transfer functions specify how to move Permute END in an expression tree according to node n: P’ ← I, an identity permutation; • FOR i = 0 TO v.length/r-1 THEN n ≡ Permute(v,Q). Apply the composition rule to obtain = IF a[i*r] v.begin+i*r THEN n ≡ Permute(x, P · Q). P’[i*r:i*r+r-1] ↔ P’[v.begin+i*r:v.begin+i*r+r-1] ENDIF • ≡ ( ) END n op v . Move the permutation out side op to obtain RETURN (P*INV(P’), P’) // INV computes the inverse n ≡ Permute(op(x),P). decompose_permutation_shuf(use v1, stmt "v=Permute(x,P)") • n ≡ op(v, e). Consider the following cases: P’ ← I, an identity permutation FOR i = 0 TO v.length/r-1 THEN 1. If e ≡ Permute(r, P ) we move the permutation outside insts ← code_gen(P[v.begin+i*r:v.begin+i*r+r-1]) op to obtain IF last shuffle in insts is an intra-register one THEN P’[v.begin+i*r:v.begin+i*r+r-1] = shuffle pattern n ≡ Permute(op(q,r),P) END END 2. If e ≡ C where C is a constant vector move the permutation RETURN (P’, INV(P’)*P) // INV computes the inverse outside op to obtain n ≡ Permute(op(q,C ),P) Figure 9. Two permutation decomposition . where C is a constant vector obtained by applying the re- verse permutation on C,asC = Permute(C,P). Con- decompose permutation reg in Figure 9 gives the stant vectors can be permuted at compile time. permutation decomposition algorithm. Vector v1 is a partial use of 3. If e is neither of these values, we wait for the propagation to vector v defined by permutation P. The first loop checks whether e to finish. If this changes e to Permute(r, P ), we will be a permutation is decomposable so that one of the resulting permu- able to continue the propagation. tation is a register-wise permutation and the other is propagatable. The second loop does the decomposition. If the algorithm finds a node that does not match one the these Platform-dependent decomposition For some SIMD devices, cases, it stops and returns. a general permutation instruction must be translated into multiple native instructions. Consider a permutation 0, 4, 2, 6 assuming 4.4 Discussion that the physical vector register is 4-way. On SSE, this permutation To simplify the presentation, we ignored some of the important must be decomposed into two shuffle instructions (see Section 5). issues when describing the basic algorithm. We will discuss them One moves elements from the first input register to the low-half in detail here. of the output, and elements from the second to the high-half. In Optimality As shown in Figure 7, there are three key opera- other words, the first SSE shuffle instruction performs the permuta- tions. The composition (at line 5 in Figure 7) always reduces per- tion 0, 2, 4, 6. To complete the translation, a second SSE shuffle mutations. The propagation (at line 8 in Figure 7) sometimes re- instruction implementing the permutation 0, 2, 1, 3 must be gen- duces permutations. However, the statement at line 3 might intro- erated. duce more permutations if there are multiple uses. Does the algo- Notice that the second instruction is an intra-register shuffle, rithm always produce the minimum number of data permutations? which is always propagatable. Therefore, it is helpful to decompose Unfortunately, there is no deterministic algorithm that can find those permutations, which results in shuffles not supported natively, the global optimal for an arbitrary basic block in polynomial time. into two permutations so that the second permutation includes only The problem of finding the minimum number of permutations for intra-register data movements. statements in a basic block can be mapped to a multi-terminal cut P = 0, 4, 2, 6, 1, 5, 3, 7 will be translated to 4 SSE shuf- problem introduced in [5], even if we assume there is no partial use fle instructions, two of which are intra-register shuffles (Sec- boundary and all permutations cost same. Thus, it is an NP-hard tion 5). Thus, we can decompose P into P1 · P2, with P 1= problem unless the data flow graph is a tree or there are only two 0, 2, 4, 6, 1, 3, 5, 7 and P2 = 0, 2, 1, 3, 4, 6, 5, 7. P1 can be different permutations [5]. translated into 2 shuffle instructions and P2 can be propagated to Therefore, we apply a simple heuristic. Only if the propagated the uses and further merged with the other permutations. permutation to a use can be merged with another permutation, the The algorithm for this transformation is shown in Figure 9, as permutation will be propagated to the use. Otherwise, no propaga- function decompose permutation shuf. The algorithm makes tion will be conducted. By proceeding conservatively in this way, use of the code generation algorithm discussed in Section 5 to we guarantee that the number of permutations decreases monoton- compute the shuffle pattern. ically. In addition, the above algorithm uses a top-down propaga- 4.3 Permutation Placement within a Statement tion strategy. An alternative way is to propagate the permutations After a permutation is propagated, the algorithm consolidates con- bottom-up by following use-def chains. The reverse direction of secutive permutations in the statement (composition rule), and, if propagation can sometimes create more opportunities for com- needed, transforms the statement into the form v = Permute(...) position. In fact, propagating in different directions may produce (distributive rule) to enable further propagation. entirely different permutations for the same program. In our algo- rithm, we propagate in both directions to obtain the final permuta- • select that selects data elements from one of the input registers tion results, which is always better than one-direction propagation. and places them in the corresponding position of the output Special Permutations In the optimization algorithm, the spe- register. cial permutations, such as reduction and replication, need to be handled differently. For example, a reduction can be merged with In general, restricted permutation instructions can be imple- a permutation propagated to its input. However, a reduction cannot mented by generic permutation instructions. However, since the be used as the starting point of the propagation. Similarly, replica- restricted permutation instructions have their permutation pattern tion operation can absorb permutations propagated along use-def built into the instructions, they tend to be more efficient because links but will not be propagated. they use fewer registers and/or cycles. Although it is feasible to propagate permutations with ele- ments as other permutations, it might introduce more computation. 5.2 Translating Permuteinto vperm Operations For example, if half or more elements in a permutation are ,it We first translate generic permutation operations to an internal might not be beneficial to propagate it. operation similar to the vperm instruction (Figure 1) of VMX. We On the other hand, since permutations with  elements essen- use vperm because it closely resembles the format of actual data tially combine two vectors together, it becomes extremely com- permutation instructions. The format of our vperm operation is: plicated to propagate them. Thus, in our algorithm, any permuta-  tion with elements will not be propagated. But like reduction and R3 = vperm(R1,R2,P) replication, permutations including  elements can still be merged with other permutations propagated to them. where, R1, R2 and R3 are three virtual vector registers of length Lr, the size of physical register of the target SIMD device, and P 5. Code Generation is a vector literal of the same size. When translating normalized vector programs, each vector will This section describes the algorithm that translates Permute op- be mapped to a set of virtual vector registers. Since all vectors erations into target machine instructions. The algorithm has two are stride-one, aligned and of length that is a multiple of Lr, the steps: mapping is straightforward. It is achieved by strip-mining a long 1. Translates each Permuteoperation into a sequence of vperm vector into chunks and allocating a virtual vector register for each operations (Section 5.2). chunk. For example, assuming that a vector register is 4 element long, u[0:15] will be strip-mined into 4 chunks and mapped to 4 2. Maps vperm operations into native data-movement instructions virtual vector registers. Since the mapping between vector chunks (Section 5.3). and virtual registers is one to one, we use vector expressions to represent mapped virtual vector registers. For example, we use 5.1 Hardware Permutation Instructions Set u[0:3] to represent the first virtual register allocated for u[0:15]. This section gives an overview of the permutation instructions For a generic permutation, y = Permute(x, P ), the goal of supported by the SSE family and VMX [11, 18]. These instructions our algorithm is to generate as few vperm operations as possible implement either generic or restricted permutations. to implement the permutation specified by P . No fast optimal VMX supports vperm, the generic permutation instruction de- algorithm is known for this problem. scribed in Figure 1. SSE supports shufps, Consider a single output register whose elements come from N different input registers. Since each vperm operation can only col- R3 ← shufps(R1,R2,P) lect elements from two registers to one, at least N − 1 instructions which is considered a general permutation instruction although it is are needed to generate the output. However, the bandwidth of those more constrained than vperm. For example, as shown in Figure 10, vperm operations may not be fully utilized (when N>2), thus elements from R1 (R2) can only go to the low-half (high-half) of there might be some unused slots in the virtual registers used to the output register R3. In addition, shufps manipulates elements hold intermediate results. In the hope that those slots can be used (4-byte) instead of bytes like vperm. The SSE family also provides in the construction of other output registers, our algorithm always similar instructions, pshufd, pshufhw and pshuflw, for other data maximizes the number of unused slots during the process of build- types [11]. ing one output register. To illustrate these ideas, consider the following statement: Element 3 2 1 0 v[0:15]=Permute(u[0:15], P) (2) R1

R 2 where P=0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15. Assume that a vector register is 4 data elements long. During code gen- R 3 eration, v[0:15] will be mapped to 4 virtual vector registers. To { 1 3 2 0 } P compute the first output vector, i.e., v[0:3], three vperm opera- tions are needed to collect elements u[0], u[4], u[8], u[12],as Figure 10. R3 ← shufps(R1,R2,P) shown in Figure 11. Both (a) and (b) in Figure 11 are feasible solutions to generate Both VMX and SSE support restricted permutation instructions the result, v[0:3], using three vperm operations. However, in (a), where the permutation pattern is built into the instruction. These u[0] and u[4] are moved three times, u[8] twice and u[12] once. include: Thus there are total 9 element movements in (a). On the other hand, • in (b), each element is moved twice and the total number is 8. Thus interleave that interleaves data elements from the low-halves or (b) requires fewer data movements than (a). With the same number high-halves of two inputs registers. of vperm operations, (b) has more unused slots than (a). • shift that rotates elements within a register. In addition, VMX In order to minimize the number of element movements for provides an instruction vsldoi to shift elements across two one output register, yˆ, of a permutation, y = Permute(x, P ), the registers. algorithm proceeds as follows: u0 u1 u2 u3 u4 u5 u6 u7 u0 u1 u2 u3 u4 u5 u6 u7 u0 u1 u2 u3 u4 u5 u6 u7 generate_vperm(stmt "y = Permute(x, P)") Map x to the virtual registers X0, X1,... vperm vperm vperm Map y to the virtual registers Y0, Y1, ... Xn u0 u4 * * u8 u9 u10 u11 u0 u4 * * u0 u4 u1 u5 Let VR[i] be containing element y[i](=x[P[i]]) or Ym if P[i] is  element vperm u8 u9 u10 u11 u12 u13 u14 u15 u8 u9 u10 u11 u12 u13 u14 u15 Let Loc[i] be y[i]’s location in VR[i]

u0 u4 u8 * u12 u13 u14 u15 Let r be the element size of the vector register vperm vperm FORi=0TOy.length/r-1 DO vperm ← u8 u12 * * u8 u12 u9 u13 W set of different values in VR[i*r:i*r+r-1]

u0 u4 u8 u12 WHILE |W| > 1 DO vperm vperm vperm n[R] ← |{j|VR[j]=R, i*r <= j < i*r+r}| R R R R u0 u4 u8 u12 u0 u4 u8 u12 u1 u5 u9 u13 Find 1, 2 from W that minimizes n[ 1]+n[ 2] IF n[R1]+n[R2]

10 30 12 25 8 20 6 8 15 4 Scalar Opt 10 4 Scalar GCC SIMD Base 2 Scalar Fast SIMD Base SIMD Opt 5 Scalar Fast SIMD Base SIMD Opt SIMD Opt # of Stage Per Second (100M) # of Operations Per Second (100M)

# of Operations Per Second (100M) 0 0 0 8 16 32 64 128 256 1 4 7 101316192225283134374043 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 Size

(a) (c) (e)

Scalar Fast 32-point FFT On SSE2 Scalar O3 32-point WHT On SSE2 Bitonic Sorting on SSE2 10 SIMD Base 25 8 SIMD Opt

8 20 6

6 15 4

4 10 Scalar Fast Scalar O3 2 SIMD Base 2 5 Scalar Fast Scalar O3 SIMD Opt

SIMD Base SIMD Opt # of Stage Per Second (100M) # of Operation Per Second (100M) # of Operation Per Second (100M) 0 0 0 1 4 7 101316192225283134374043 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 8 16 32Size 64 128 256

(b) (d) (f)

Figure 13. Performance of Group II programs. “Scalar Opt”, “Scalar Fast”, “Scalar O3” are optimized scalar codes generated by SPIRAL, compiled with -O3 -qaltivec, -fast, and -O3, respectively; “SIMD base” are SIMD codes with “Base” permutation generation as in Table 2; “SIMD opt” are SIMD codes with the permutation optimization.

Performance of Matrix Transpose Performance of Bit-reversed Reordering Overall Speedups of All Applications (Aligned Data) 4 80% 60% VMX VMX VMX Base VMX Opt SSE2 Base SSE2 Opt SSE2 SSE2 50% 3 60%

40% 2 40% 30% Speedups 1 20% 20%

Improvement 10% Improvement 0 0% 4 6 8 10 12 14 16 18 20 0% fff.4 fft.5 fft.6 r-fir wht.4 wht.5 wht.6 c-dot 4 8 16 32 64 128 256 r-color -20% bitonic.5 c-saxpy -10% transpose bit-reverse

-20% -40% Overall Speedups of All Applications (Misaligned Data) Matrix Size (N x N) Array Size (N) 4 VMX Base VMX Opt SSE2 Base SSE2 Opt Figure 14. Performance of matrix transpose and bit-reversal order- 3 ing. 2 Speedups 1 overhead of data permutations nullifies the performance benefit of SIMD load/stores. But the aggregated register space helps the per- 0 fff.4 fft.5 fft.6 r-fir wht.4 wht.5 wht.6 c-dot formance for large sizes. r-color bitonic.5 c-saxpy Finally, the speedups over scalar code on all applications are transpose bit-reverse shown in Figure 15. The speedups ranges from 1.14 to 2.58 on VMX and 1.51 to 3.77 on SSE2. Since VMX has two scalar FP Figure 15. Performance of all applications on VMX and SSE2. units, the speedups on VMX are lower than those on SSE2 for “VMX-Base” and “SSE2-Base” are speedups of SIMD codes with- most applications. The above-2 speedup on VMX is obtained on out the permutation optimization on VMX and SSE2 respectively. the bitonic sorting program, where expensive comparison and swap “VMX-Opt” and “SSE2-Opt”, on the other hand, are speedups with operations are replaced by native SIMD max and min operation. the permutation optimization. Figure 15 also shows that the speedups of all applications when the input data is misaligned. On the average, the performance drops 3.2% and 8.0% on VMX and SSE2 respectively, because of mis- alignment. For Group II applications, the permutation optimization algorithm improves the performance by 60% and 140% on VMX mization algorithm (see Table 3). However, for Group III appli- and SSE2 respectively, since the data permutations introduced by cation, the improvement obtained by the optimization algorithm is the misalignment are almost completely eliminated by the opti- only 6% and 10% on VMX and SSE2. 7. Related Work eration algorithm for data permutations. The experimental results Automatic generation of SIMD instructions, mainly for multimedia show that the performance of SIMD computation can be signifi- extensions, has been studied in both academia and industry. Most of cantly improved by our optimization strategy. the techniques considered in these studies are based on traditional As mentioned in the paper, the current presentation of data per- loop-based vectorization [4, 12, 26]. Others make use of instruc- mutations requires that strides, alignment, and vector length (es- tion packing techniques to explore data parallelism within a basic sentially permutation pattern) must be known at compile-time. Re- block [14, 16, 26]. Several compilers support automatic vectoriza- laxing this restriction is an important next step. The optimization algorithm is conservative on data permutations with special ele- tion for multimedia extensions, such as the Intel compilers [1], IBM  XL compilers [6] and the GNU compiler [8]. Most of them employ ments and . It would be interesting to explore more aggressive both vectorization and instruction packing techniques. optimization strategies on this type of data permutations. As we Memory alignment is a common source of data permutations have seen in many applications, the data permutation optimization and has been studied extensively. Early work on alignment han- often interacts with other compiler transformations. It would also dling [1, 4, 15] apply loop peeling and versioning to translate com- be valuable to explore this interaction. putations where misaligned references are relatively aligned to each other. Therefore, those schemes do not generate any permutations. Acknowledgments Recent work in [6, 27] handle arbitrary memory alignment while This work was supported in part by the Defense Advanced Re- minimizing permutation overhead. When dealing with misalign- search Projects Agency (DARPA) through the Department of Inte- ment within a statement, our conversion algorithm is equivalent to rior grant NBCH1050009 and in part by the National Science Foun- the zero-shift policy in [6], and the lazy-shift and dominant-shift dation under Awards 0234293, ITR/NGS-0325687 and CSR/AES- policies in [6] can be derived by applying distributive and com- 0509432. position rules on permute. Compared to [6, 27], our algorithm is more powerful in terms of minimizing alignment overhead across statements thanks to our propagation algorithm. On the other hand, References [6, 27] handles arbitrary alignment, whereas ours handles only [1] Aart J. C. Bik. The Software Vectorization Handbook : Applying compile-time alignment. Another major difference is that [6, 27] Multimedia Extensions for Maximum Performance. Intel Press, 2004. target loops while our algorithm is for straight-line code. Thus, the [2] CCIR Recommendation 601-2. Encoding Parameters of Digital code generation for permutation is quite different. Television for Studios, 1990. There are a few recent studies on generating efficient permu- [3] Siddhartha Chatterjee, John R. Gilbert, Robert Schreiber, and Shang- tation instructions for SIMD devices [13, 20, 21]. In [13], an al- Hua Teng. Automatic array alignment in data-parallel programs. gorithm was introduced to generate permutation instructions for In POPL ’93: Proceedings of the 20th ACM SIGPLAN-SIGACT SIMD devices. Despite having similar workflow, their algorithm Symposium on Principles of Programming Languages, pages 16–28. and ours work at different levels of intermediate representation. ACM Press, 1993. In [20], an algorithm was proposed to automatically generate per- [4] Gerald Cheong and Monica Lam. An optimizer for multimedia mutation instructions for a new language, StreamIt, and output plat- instruction sets. In Proceedings of the Second SUIF Compiler form, VIRAM. In [21], an extension on GCC vectorizer was intro- Workshop, 1997. duced to represent data references with non-unit strides and gener- [5] E. Dahlhaus, D. S. Johson, C. H. Papadimitriou, P. D. Seymour, and ate efficient permutation instructions for them. Both [20] and [21] M. Yannakakis. The complexity of multiterminal cuts. SIAM J. focus on permutation representation and code generation, instead Computing, 23:864–894, 1994. of optimizing data permutations. [6] Alexandre E. Eichenberger, Peng Wu, and Kevin O’Brien. Vectoriza- In the domain of signal processing, an optimization algorithm tion for SIMD architectures with alignment constraints. In PLDI ’04: on permutation matrices at the formula level was introduced as a Proceedings of the ACM SIGPLAN 2004 Conference on Program- key step of vectorization for SIMD devices [7]. In [7, 9], domain- ming Language Design and Implementation, pages 82–93. ACM specific techniques are proposed to generate efficient SIMD code Press, 2004. for complex computation in DSP programs. With the support of [7] Franz Franchetti, Stefan Kral, Juergen Lorenz, and Christoph W. indirect register accesses in a DSP processor, a different scheme of Ueberhuber. Efficient utilization of SIMD extensions. Proceedings handling data permutations was discussed in [19]. of the IEEE, 93(2):409–425, 2005. There were several other interesting studies on more general [8] Free Software Foundation. Auto-vectorization in GCC, 2004. GCC. definition of data permutations. In [25], permutation matrices are used to optimize bit-wise operations in StreamIt programs. Our [9] Matteo Frigo and Steven G. Johnson. The design and implementation strategy to optimize element-wise permutations is similar to theirs. of FFTW3. Proceedings of the IEEE, 93(2):216–231, 2005. Despite targeting distributed memory systems, the algorithm pre- [10] Gwan-Hwan Hwang, Jenq Kuen Lee, and Dz-Ching Ju. An array sented in [3] for array alignment can also be extended as an alter- operation synthesis scheme to optimize FORTRAN 90 programs. In native of our optimizing algorithm. Our composition rule for data PPOPP ’95: Proceedings of the 5th ACM SIGPLAN Symposium on permutations is similar to the idea of synthesizing array operations Principles and Practice of Parallel Programming, pages 112–122. ACM Press, 1995. in [10]. [11] Intel Corporation. IA32 Intel Architecture Optimization, 2004. [12] Andreas Krall and Sylvain Lelait. Compilation techniques for mul- 8. Conclusion timedia processors. International Journal of Parallel Programming, Due to the constraints on memory units, the overhead of data per- 28(4):347–361, 2000. mutations makes it extremely difficult to achieve peak performance [13] Alexei Kudriavtsev and Peter Kogge. Generation of permutations on SIMD devices. The strategy presented in this paper optimizes for SIMD processors. In LCTES’05: Proceedings of the 2005 ACM all forms of data permutations within a basic block in a unified SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools manner. With the help of permutation propagation, the optimization for Embedded Systems, pages 147–156. ACM Press, 2005. algorithm can reduce the number of permutations by merging con- secutive ones. Our strategy also introduces an efficient code gen- [14] Samuel Larsen and Saman Amarasinghe. Exploiting superword [22] Markus Puschel, Jose M. F. Moura, Jeremy Johnson, David Padua, level parallelism with multimedia instruction sets. In PLDI ’00: Manuela Veloso, Bryan W. Singer, Jianxin Xiong, Franz Franchetti, Proceedings of the ACM SIGPLAN 2000 Conference on Programming Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson, Language Design and Implementation, pages 145–156. ACM Press, and Nick Rizzolo. SPIRAL: Code generation for DSP transforms. 2000. Proceedings of the IEEE, 93(2):232–275, 2005. [15] Samuel Larsen, Emmett Witchel, and Saman P. Amarasinghe. [23] Gang Ren, Peng Wu, and David Padua. An empirical study on the Increasing and detecting memory address congruence. In PACT vectorization of multimedia applications for multimedia extensions. ’02: Proceedings of the 2002 International Conference on Parallel In IPDPS ’05: Proceedings of the 19th International Parallel & Architectures and Compilation Techniques, pages 18–29. IEEE Distributed Processing Symposium, 2005. Computer Society, 2002. [24] Nicholas Rizzolo and David Padua. HiLO: High level optimization of [16] Rainer Leupers. Code Optimization Techniques for Embedded FFTs. In LCPC ’04: Proceedings of the 17th International Workshop Processors: Methods, Algorithms, and Tools. Kluwer Academic on Languages and Compilers for , 2004. Publishers, 2000. [25] Armando Solar-Lezama, Rodric Rabbah, Rastislav Bodik, and Kemal [17] Xiaoming Li, Maria Jesus Garzaran, and David Padua. Optimizing Ebcioglu. Programming by sketching for bit-streaming programs. In sorting with genetic algorithms. In CGO ’05: Proceedings of the PLDI ’05: Proceedings of the 2005 ACM SIGPLAN Conference on international symposium on Code generation and optimization, pages Design and Implementation, pages 281–294. 99–110. IEEE Computer Society, 2005. ACM Press, 2005. [18] Motorola Inc. AltiVec Technology Programming Environments [26] N. Sreraman and R. Govindarajan. A vectorizing compiler for mul- Manual, 1998. timedia extensions. International Journal of Parallel Programming, [19] Dorit Naishlos, Marina Biberstein, Shay Ben-David, and Ayal 28(4):363–300, 2000. Zaks. Vectorizing for a SIMdD DSP architecture. In CASES ’03: [27] Peng Wu, Alexandre E. Eichenberger, and Amy Wang. Efficient Proceedings of the 2003 International Conference on Compilers, SIMD code generation for runtime alignment and length conversion. Architecture and Synthesis for Embedded Systems, pages 2–11. ACM In CGO ’05: Proceedings of the International Symposium on Code Press, 2003. Generation and Optimization, pages 153–164. IEEE Computer [20] Manikandan Narayanan and Katherine A. Yelick. Generating Society, 2005. permutation instructions from a high-level description. In MSP [28] Jianxin Xiong, Jeremy Johnson, Robert Johnson, and David Padua. ’04: Proceedings of the 6th Workshop on Media and Streaming SPL: a language and compiler for dsp algorithms. In PLDI ’01: Processors, 2004. Proceedings of the ACM SIGPLAN 2001 Conference on Programming [21] Dorit Nuzman, Ira Rosen, and Ayal Zaks. Auto-vectorization of Language Design and Implementation, pages 298–308. ACM Press, interleaved data for SIMD. In PLDI ’06: Proceedings of the ACM 2001. SIGPLAN 2006 Conference on Programming Language Design and Implementation, 2006.