PSLP: Padded SLP Automatic Vectorization

PSLP: Padded SLP Automatic Vectorization Vasileios Porpodasy, Alberto Magniz, Timothy M. Jonesy Computer Laboratory, University of Cambridgey School of Informatics, University of Edinburghz [email protected], [email protected], [email protected] Abstract Making use of these vector ISAs is non-trivial as it re- The need to increase performance and power efficiency in quires the extraction of data-level parallelism from the appli- modern processors has led to a wide adoption of SIMD vec- cation that can be mapped to the SIMD units. An automatic tor units. All major vendors support vector instructions and vectorization pass within the compiler can help by perform- the trend is pushing them to become wider and more power- ing the necessary analysis on the instructions and turning the ful. However, writing code that makes efficient use of these scalar code to vectors where profitable. units is hard and leads to platform-specific implementations. There are two main types of vectorization algorithm. Compiler-based automatic vectorization is one solution for Loop-based algorithms [20, 21] can combine multiple itera- this problem. In particular the Superword-Level Parallelism tions of a loop into a single iteration of vector instructions. (SLP) vectorization algorithm is the primary way to auto- However, these require that the loop has well defined induc- matically generate vector code starting from straight-line tion variables, usually affine, and that all inter- and intra-loop scalar code. SLP is implemented in all major compilers, in- dependences are statically analyzable. cluding GCC and LLVM. On the other hand, algorithms that target straight-line SLP relies on finding sequences of isomorphic instruc- code [13] operate on repeated sequences of scalar instructions to pack together into vectors. However, this hinders the tions outside a loop. They do not require sophisticated applicability of the algorithm as isomorphic code sequences dependence analysis and have more general applicability. are not common in practice. In this work we propose a solu- However, vectorization is often thwarted when the original tion to overcome this limitation. We introduce Padded SLP scalar code does not contain enough isomorphic instructions (PSLP), a novel vectorization algorithm that can vectorize to make conversion to vectors profitable. code containing non-isomorphic instruction sequences. It in- To address this limitation, we propose Padded SLP, a jects a near-minimal number of redundant instructions into novel automatic vectorization algorithm that massages scalar the code to transform non-isomorphic sequences into iso- code before attempting vectorization to increase the number morphic ones. The padded instruction sequence can then be of isomorphic instructions. The algorithm works by building successfully vectorized. Our experiments show that PSLP up data dependence graphs of the instructions it wishes to improves vectorization coverage across a number of kernels vectorize. It then identifies nodes within the graphs where and full benchmarks, decreasing execution time by up to standard vectorization would fail and pads each graph with 63%. redundant instructions to make them isomorphic, and thus amenable to vectorizing. The end result of our pass is higher vectorization coverage which translates into greater performance. 1. Introduction The rest of this paper is structured as follows. Section2 gives an overview of a straight-line automatic vectorization Single Instruction Multiple Data (SIMD) instruction sets as technique, showing where opportunities for vectorization are extensions to general purpose ISAs have gained increasing missed. Section3 then describes our automatic vectorization popularity in recent years. The fine-grained data parallelism technique, PSLP. In Section4 we present our experimental offered by these vector instructions provides energy effi- setup before showing the results from running PSLP in Sec- cient and high performance execution for a range of appli- tion5. Section6 describes prior work related to this paper, cations from the signal-processing and scientific-computing and puts our work in context, before Section7 concludes. domains. The effectiveness of vector processing has led all major processor vendors to support vector ISAs and to reg- ularly improve them through the introduction of additional 2. Background and Motivation instructions and wider data paths (e.g., 512 bits in the forth- Automatic vectorization is the process of taking scalar code coming AVX-512 from Intel). and converting as much of it to vector format as is possi- 2015 IEEE/ACM International Symposium on Code Generation and Optimization 978-1-4799-8161-8/15/$31.00 c 2015 IEEE ble and profitable, according to some cost model. We first Scalar Code give an overview of the vectorization algorithm that PSLP is based on, then identify missed opportunities for vectoriza- Find vectorization tion that PSLP can overcome. 1. seed instructions 2.1 Straight-Line Code Vectorization 2. Generate a graph for each seed Straight-line code vectorizers, the most well-known of which is the Superword-Level Parallelism algorithm (SLP [13]), identify sequences of scalar instructions that are repeated 3. Perform optimal Padding of graphs multiple times, fusing them together into vector instructions. Some implementations are confined to code within a single basic block (BB) but others can follow a single path across Calculate Calculate Calculate Padded 4. multiple BBs, as long as each group of instructions to be Scalar Cost Vector Cost Vector Cost vectorized belongs to the same BB. LLVM’s SLP vectorizer, and PSLP, follow this latter scheme. The SLP algorithm If NO contains the following steps: 5. Padded Cost is best If Step 1. Search the code for instructions that could be seeds YES 7. Vector Cost NO < for vectorization. These are instructions of the same type 6. Emit Padded Scalars Scalar Cost and bit-width and are either instructions that access ad- YES jacent memory locations, instructions that form a reduc- tion or simply instructions with no dependences between 8. Generate graph them. The most promising seeds are the adjacent memory & form groups of scalars instructions and therefore they are the first to be looked Vectorize groups for in most compilers [28]. 9. Step 2. Follow the data dependence graph (DDG) from the & emit vectors seed instructions, forming groups of vectorizable instruc- DONE tions. It is common for compilers to generate the graph bottom-up, starting at store seed instructions instead of Figure 1. Overview of the PSLP algorithm. The highlighted starting at loads. This is the case for both GCC’s and boxes refer to the structures introduced by PSLP. LLVM’s SLP vectorizers [28]. Traversal stops when en- countering scalar instructions that cannot form a vector- In Figure2(a) we show the original code. The value izable group. stored in B[i] is the result of a multiplication of A[i] and Step 3. Estimate the code’s performance for both the orig- then an addition of a constant. The value stored in B[i + 1] inal (scalar) and vectorized forms. For an accurate cost is only A[i + 1] added to a constant. calculation the algorithm takes into account any addi- We now consider how SLP optimizes this code, shown in tional instructions required for data movement between Figure2(c-d). As described in Section 2.1, we first locate the scalar and vector units. seed instructions, in this case the stores into B[i] and B[i+1] which are to adjacent memory locations. These form group Step 4. Compare the calculated costs of the two forms of 0 (the root of the SLP graph in Figure2(c)). This group is code. marked as vectorized in Figure2(d). Next the algorithm fol- Step 5. If vectorization is profitable, replace the groups of lows the data dependences upwards and tries to form more scalar instructions with the equivalent vector code. groups from instructions of same type. The second group (group 1), consisting of addition instructions, is formed eas- ily. However, a problem arises when the algorithm tries to 2.2 Missed Opportunities for Vectorization form group 2. The available nodes in the graph are a multi- Although SLP performs well on codes that contain multiple plication (∗) from the first expression and a load (L) from the isomorphic sequences of instructions, there are often cases second. Since these are not of the same type, vectorization where it cannot actually perform vectorization because the is halted at this point and the algorithm terminates having graphs are only similar, not completely the same as each formed just two groups. Applying the cost model to the two other. These are either directly written by the programmer forms of code shows that the packing overheads (associated or, more usually, the result of earlier optimization passes that with inserting the scalar values into the vector registers for have removed redundant subexpressions. Figure2 shows an the first vector instruction - the addition) outweigh the costs. example and solution to this problem. Therefore this code remains scalar and is compiled down A[i] L 7. A[i+1] double A[SIZE]; L * 1. 5. Instruction Node or Constant double B[SIZE]; X ... + + Select Instruction Node B[i] = A[i] * 7.0 + 1.0; Data Flow Edge B[i+1] = A[i+1] + 5.0; ... B[i] S B[i+1] S a. Input C code b. Dependence Graph A[i] NON−ISOMORPHIC L 7. A[i+1] STOP * 1. L 5. 2 * L vmulsd A(,%rcx,8), %xmm0, %xmm3 vaddsd %xmm1, %xmm3, %xmm3 + + 1 + + vmovsd %xmm3, B(,%rcx,8) vaddsd A+8(,%rcx,8), %xmm2, %xmm3 B[i] S S B[i+1] 0 S S vmovsd %xmm3, B+8(,%rcx,8) Lane 0 Lane 1 c. SLP internal graph d. SLP groups e. SLP : Scalar (x86/AVX2 assembly) A[i] A[i+1] L 7. L 7. 4 L L A[i] L 7.

Load more