Automatic SIMD Vectorization for Haskell

Leaf Petersen Dominic Orchard Neal Glew Intel Labs Computer Laboratory Intel Labs [email protected] University of Cambridge [email protected] [email protected]

Abstract The task of automatically finding SIMD vectorizable code in Expressing algorithms using immutable arrays greatly simplifies the context of a has been the subject of extensive study the challenges of automatic SIMD vectorization, since several im- over the course of decades. For common imperative programming portant classes of dependency violations cannot occur. The Haskell languages, such as and Fortran, in which programs are structured provides libraries for programming with as loops over mutable arrays of data, the task of rewriting loops in immutable arrays, and compiler support for optimizing them to SIMD vector form is straightforward. However, the task of discov- eliminate the overhead of intermediate temporary arrays. We de- ering which loops permit a valid rewriting (via dependency analy- scribe an implementation of automatic SIMD vectorization in a sis) is in general undecidable, and has been the subject of extensive Haskell compiler which gives substantial vector speedups for a research (see e.g. [13] and overview in [15]). Even quite sophisti- range of programs written in a natural programming style. We com- cated are often forced to rely on programmer annotations pare performance with that of programs compiled by the Glasgow to assert that a given loop is a valid target of SIMD vectorization Haskell Compiler. (for example, the INDEPENDENT keyword in HPF [7]). A great deal of the difficulty in finding vectorizable loops sim- Categories and Subject Descriptors D.3.4 [Programming Lan- ply goes away in the context of a functional language. In functional guages]: Processors—Compilers; D.3.4 [Programming Languages]: languages, the absence of mutation eliminates large classes of the Processors—Optimization; D.1.1 [Programming Techniques]: Ap- more difficult kinds of dependence violations that block vectoriza- plicative (Functional) Programming tion in compilers for imperative languages. In this sense, SIMD Keywords Vectorization, SIMD, Compiler Optimization, Haskell, vectorization would seem to be quite low-hanging fruit for func- Functional Languages tional language compiler writers. In this paper, we describe the implementation of an automatic SIMD vectorization optimization pass in the Intel Labs Haskell Research Compiler (HRC). We argue 1. Introduction that in a functional language context, this optimization is straight- In the past decade, power has increasingly been recognized as forward to implement and gives excellent results. the key limiting resource in micro-processors, in part due to the HRC was originally developed to compile programs written in limited battery on mobile devices, but more fundamentally due to an experimental strict functional language, and the vectorization the non-linear scaling of power usage with frequency. This has optimization that we present here was developed in that context. forced micro-processor designs to explore (or re-explore) different However, the main bulk of our compiler infrastructure was de- avenues at both the micro-architectural and the architectural levels. signed to be fairly language agnostic, and we have more recently One successful architectural trend has been the increasing emphasis used it to build an experimental Haskell compiler using the Glas- on data parallelism in the form of single-instruction multiple-data gow Haskell Compiler (GHC) as a front end. While Haskell is quite (SIMD) instruction sets. a complex language on the surface, the GHC front end reduces it The key idea of data parallelism is that significant portions of to a relatively simple intermediate representation based on System many sequential computations consist of uniform (or almost uni- F, which is fairly straightforward to compile using HRC. By inter- form) operations over large collections of data. SIMD instruction cepting the GHC intermediate representation (IR) after the main sets exploit this at the machine instruction level by providing data high-level optimization passes have run, we gain the benefit of the paths for and operations on fixed-width vectors of data. Program- substantial and sophisticated optimization infrastructure built up in mers (or preferably compilers) are then responsible for finding data GHC, and in particular we benefit from its ability to optimize away parallel computations and expressing them in terms of fixed-width temporary intermediate arrays. vectors. In the context of languages with loops (or compiled to We make the following three contributions: loops), this corresponds to choosing loops for which multiple it- erations can be computed simultaneously. • We design a core language with arrays and SIMD vector prim- itives (Section 3). • We present a detailed compositional vectorization process over Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed this language (Section 4). for profit or commercial advantage and that copies bear this notice and the full citation • We evaluate our performance compared to GHC with the native on the first page. Copyrights for components of this work owned by others than the and LLVM backends, showing significant scalar speedups over author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or GHC and additional vector speedups of up to 6.5× over our republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. own scalar compiler, on a selection of array processing bench- ICFP’13, September 25–27, 2013, Boston, USA. marks (Section 5). Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-2326-0/13/09. . . $15.00. We begin with a brief overview of HRC and its relationship to http://dx.doi.org/10.1145/2500365.2500605 GHC. 2. Intel Labs Haskell Research Compiler some modification of GHC itself was required. A notable example HRC provides an optimization platform aimed at producing high- of this for the purposes of this paper is the treatment of arrays in performance executables from functional-style code. The core in- the Data.Vector libraries. termediate language of the compiler is a static single-assignment The GHC libraries Data.Vector and REPA [6] provide clean style, control-flow graph based intermediate representation (IR). high-level interfaces for generating immutable arrays. Program- It is strict and has explicit thunks to represent laziness. While mers can use typical functional-language operations like maps, the intermediate representation is largely language agnostic, it is folds, and zips to compute arrays from other arrays. These oper- not paradigm agnostic. The core technologies of the compiler are ations are naturally data parallel and amenable to SIMD vector- designed specifically around compiling functional-language pro- ization by HRC. For example, consider a map operation over an grams. Most importantly for the purposes of this paper, the com- immutable array represented with the Data.Vector library. Such ar- piler relies on maintaining a distinction between mutable and im- rays are represented internally to the libraries using streams, and mutable data, and focuses almost all of its optimization effort on the so map first turns the source array into a stream, applies the map latter. The compiler is structured as a whole-program compiler— operation to the stream, and then unstreams that back into the final however, none of the optimizations in the compiler rely on this array. Unstreaming is implemented by running a monadic compu- property in any essential way. In addition to the usual inlining-style tation that first creates a mutable array of the appropriate size, then optimizations, contification [5] is performed to turn many uses of initializes the elements of the array using array update, and finally (mutual) recursion into loops. This optimization is crucial for our does an unsafe freeze of the mutable array to an immutable array purposes since we structure our SIMD-vectorization optimization type [8]. GHC has optimizations which usually eliminate all of the as an optimization on loops. Many standard compiler optimizations intermediate streaming, mapping, and unstreaming operations. The have been implemented in the compiler, including loop-invariant result of this optimization is a piece of code that sequences the code motion and a very general simplifier in the style of Appel array creation, the calling of a tail-recursive function that initial- and Jim [2]. The compiler also implements a number of inter- izes the array elements, and the unsafe freeze. After translation into procedural representation optimizations using a field-sensitive flow our internal representation, HRC can then contify the tail-recursive analysis [11]. function into a loop, resulting in a natural piece of code that can The compiler generates output in a modified extension of the be effectively optimized using standard loop-based optimizations C language called Pillar [1]. Among other things, the Pillar lan- adapted to leverage the benefits of immutability. guage supports tail calls and second-class continuations. Most im- Unfortunately, observe that as described in the previous para- portantly, it also provides the necessary mechanisms to allow for graph we do not actually get the benefits of immutability with accurate garbage collection of managed memory. While an early an unmodified GHC. Instead, we end up with code that creates version of Pillar was implemented as a modification of a C com- a mutable array and then initializes it with general array update piler, Pillar is currently implemented as a source to source transla- operations—a style of code that we argue is drastically harder to tion targeting standard C, which is then compiled using either the optimize well in general, and to perform SIMD vectorization on Intel C Compiler or GCC. Our garbage collector and a small run- in particular. As we discuss in Section 4.7, immutable arrays make time are linked in to produce the final executable. determining when a loop can be SIMD vectorized much easier. For our purposes then, what we would like GHC to generate instead of 2.1 Haskell and GHC the code described above is code that creates a new uninitialized immutable array and then uses initializing writes to initialize the The Haskell language and the GHC extensions to it provide a large array. The two invariants of initializing writes are that reading an base of useful high-level libraries written in a very powerful pro- array element must always follow the initializing write of that ele- gramming language. These libraries are often written using a high- ment, and that any element can only be initialized once. This style level functional-language style of programming, relying on excel- of write preserves the benefits of immutability from the compiler lent compiler technology to achieve good performance. While we standpoint, since any initializing write to an element is the unique wished to compile Haskell programs and take advantage of the definition of that element. Initializing writes are key to representing many libraries, we had no interest in attempting the monumental the initialization of an immutable array using the usual loop code, task of duplicating all of the high-level compiler technology imple- while preserving the ability to leverage immutability. mented in GHC. The approach we took therefore was to essentially In order to get immutable code written in this style from GHC use GHC as a Haskell frontend, thereby taking advantage of the and the libraries discussed above, we modified GHC with new existing type-checker and high-level optimizer. primitive immutable array types1 and primitive operations on them. GHC provides some support for writing its main System-F-style The additional operations include creation of a new uninitialized internal representation (called Core) out to files. While this code is array, initializing writes, and a subscript operation. We also mod- not entirely maintained, we were able to easily modify it to support ified the Data.Vector library and a small part of the REPA library all the current Core features we require. We use this facility to to use these immutable arrays instead of mutable arrays and unsafe build a Haskell compiler pipeline by first running GHC to perform freeze (of course, as with the standard GHC primitives, correctness type checking, desugaring, and all of its Core-level optimizations, requires that the library writer use the primitives correctly). With and then writing out the Core IR immediately before it would these modifications, the code produced by GHC is suitable for op- otherwise be translated to the next-lower representation. HRC then timization by HRC, including in many cases SIMD vectorization. reads in the serialized Core, and after several initial transformations and optimizations (including making thunks explicit), performs a 2.3 Vectorization in HRC closure-conversion pass and transforms it down to our main CFG based intermediate representation for subsequent optimization. SIMD vectorization in HRC targets loops for which the compiler can show that executing several iterations in parallel is seman- 2.2 GHC Modifications tics preserving, rewriting them to use SIMD vector instructions. Using this approach, we are able to provide a fairly close to func- Usually these loops serve to initialize immutable arrays, or to per- tionally complete implementation of the Haskell language as im- plemented by GHC. However, some of the choices of primitives 1 GHC does already have some such types—however, we added new ones used in GHC do not match up well with the needs of HRC, and so to avoid compatibility issues mutable arrays subscripted by 32-bit integers. The elements con- Register kind k ::= s | v tained in the fields of heap objects are constrained to be 32-bit reg- Variables xk, yk, zk,... 31 31 ister values. Register values are small values that may be bound Constants c ::= −2 ,..., (2 − 1) to variables and are implicitly represented at the machine level via Operations op ::= +, −, ∗, /, . . . registers or spill slots. The register values consist of 32-bit integers, Instruction I ::= zs = c v s s 32-bit pointers to heap values, and 256-bit SIMD vectors each con- | z = hx0, . . . , x7i sisting of eight 32-bit register values. Adding additional primitive | zs = xv!i s s s register values such as floating-pointer numbers, booleans, etc. is | z = op(x0, . . . , xn) v v v straightforward and adds nothing essential to the presentation. | z = hopi(x0 , . . . , xn) | zs = new[xs] | zs = xs[ys] 3.2 Variables | zv = xs[hyvi] Variables are divided into two classes according to size: scalar | zv = hxvi[ys] variables xs that may only be bound to 32-bit register values and | zv = hxvi[hyvi] vector variables xv that may only be bound to 256-bit vector values. | xs[ys] ← zs In practice of course, a real intermediate language will also have | xs[hyvi] ← zv additional variable sizes: 64-bit variables for doubles and 64-bit | hxvi[ys] ← zv integers, and a range of vector widths possibly including 128- and | hxvi[hyvi] ← zv 512-bit vectors. Note that the variables xs and xv are considered Comparisons cmp ::= xs < ys | xs ≤ ys | xs = ys to be distinct variables. Informally, we will sometimes use xv to Labels L ::= L0, L1,... denote the variable containing a vectorized version of an original k k s Transfers t ::= goto L(x0 , . . . , xn) program variable x —however, this is intended to be suggestive k k | if (cmp) goto L0(x0 , . . . , xn) only and has no semantic significance. k k else goto L1(y0 , . . . , ym) | halt 3.3 Instructions Blocks B ::= L(xk, . . . , xk ): 0 n The instructions of the vector language serve to introduce and oper- I 0 ate on the primitive objects of the language. Most of the instructions ... bind a new variable of the appropriate register kind, with the excep- I m tion of the array-initialization instructions which write values into t uninitialized array elements. The move instruction zs = c binds Control Flow Graph ::= Entry L in {B ,...,B } 0 n a variable to the 32-bit integer constant c. The vector introduction v s s v instruction z = hx0, . . . , x7i binds the vector variable z to a vec- Figure 1. Syntax tor composed of the contents of the eight listed 32-bit variables. We v s s use the derived syntax z = hc0, . . . , c7i to introduce a vector of constants—this may be derived in the obvious way by first binding form reductions over immutable arrays. The compiler generates the each of the constants to a fresh 32-bit variable. A component of a SIMD loops and all of the related control logic simply as code in vector may be selected out using the select instruction zs = xv!i, our intermediate language. Rather than requiring the compiler to which binds zs to the ith component of the vector in xv (where i generate only the SIMD instructions provided by the target ma- must be in the range 0 ... 7). s s s chine, we provide a more uniform SIMD instruction set that may The instruction z = op(x0, . . . , xn) corresponds to a family be a superset of the machine-supported instructions. Our small run- of instructions operating on primitive 32-bit integers, indexed by time provides a library of impedance-matching macros that imple- the operation op (e.g. addition, subtraction, negation etc.). Primitive ments the uniform SIMD instructions in terms of SIMD intrisics operations must be fully saturated—that is, the number of operands understood by the underlying C compiler. The C compiler is then provided must equal the arity of the given operation. We lift the responsible for low-level optimizations including register alloca- primitive operations on integers to pointwise primitive operations v v v tion and instruction selection. on vectors with the z = hopi(x0 , . . . , xn) instruction, which In the next section we define a vector core language with which applies op pointwise across the vectors. That is, semantically zv = v v to discuss the vectorization process. This core language is clean, hopi(x0 , . . . , xn) behaves identically to a sequence of instructions uniform, and is almost exactly faithful to (a small subset of) our performing the scalar version of op to each of the corresponding actual intermediate representation. elements of the argument vectors, and then packaging up the result as the final result vector. 3. Vector Core language Array objects in the vector language are immutable sequences of 32-bit data allocated in the heap. New uninitialized array objects We define here a small core language in which to present the vec- are allocated via the zs = new[xs] instruction which allocates a torization optimization. We restrict ourselves to an intra-procedural new array of length xs (where xs must contain a 32-bit integer) and fragment only, since for our purposes here we are not interested in binds zs to a pointer to the array. The initial contents of the array inter-procedural control flow. Programs in the language are given as are undefined, and every element must be initialized before being control-flow graphs written in a variant of static single-assignment read. It is an unchecked invariant of the language that every element (SSA) form. To keep things concrete, our language is based on a 32- of the array is initialized before it is read, and that every element bit machine with vector registers that are 256-bits wide; handling of the array is initialized at most once. It is in this sense that the other register and vector widths is a straightforward extension. arrays are immutable—the write operation on arrays serves only as an initializing write but does not provide for mutation. Ordinary 3.1 Objects scalar element initialization is performed via the xs[ys] ← zs The primitive objects of the vector core language consist of heap operation which initializes the element of array xs at offset ys to objects and register values. Heap objects are variable-length im- zs. Scalar reads from the array are performed using the zs = xs[ys] instruction, which reads a single element from the array xs at offset hxvi[ys] ← zv instruction sets the element of the ith array in the ys and binds zs to it. vector xv at offset ys to the value in the ith position of zv. In order to support SIMD vectorization, the vector language The final mode of addressing for array loads and stores covers also provides a general set of vector subscript and initialization the case in which a vector of arrays is indexed pointwise using operations permitting vectors of elements to be read from and a vector of offsets. The zv = hxvi[hyvi] instruction produces a written to arrays (or vectors of arrays) in a single instruction. vector zv such that the ith element is produced by indexing the The set of instructions we provide are more general than what is ith array from the vector of arrays xv using the ith offset from the supported by most existing machine architectures. This is done vector of offsets yv. Similarly, the hxvi[hyvi] ← zv instruction intentionally since it is often beneficial to vectorize a loop even sets the element of the ith array in the vector xv at the offset given if some instructions must be emulated via translation to scalar by the ith element of the vector of offsets yv to the value in the ith operations. Moreover, we believe it is useful to present the language position of zv. in its most general form—it is always possible to restrict the set of generated instructions as desired. We defer discussion of the situations in which these various styles of loads and stores arise 3.4 Basic Blocks to subsequent sections on the actual vectorization operation. Instructions are grouped into basic blocks consisting of a labeled The simplest vector array loads and stores are the operations entry, a sequence of instructions, and a transfer which terminates that read multiple elements directly from a single array into a the block and either transfers control to another block or terminates vector variable, or initialize multiple elements of a single array the program. We assume an infinite supply of distinct program v with elements contained in a vector variable. The instruction z = labels ranged over by metavariable L used to label basic blocks that s v s v x [hy i] takes a single array x and a vector of offsets y and binds we distinguish by numbering distinct labels with distinct integers v s z to a vector containing the elements of x from the given offsets. (as L0, L1, etc.). v s v That is, semantically we may treat the instruction z = x [hy i] as Basic block headers consist of the label of the block and a list equivalent to the following list of instructions: of entry variables which are defined on entry to the block. The entry variables are given their defining values by the transfer which ys = yv!0 0 transfers control to the block. Block entry variables may have scalar zs = xs[ys] 0 0 or vector kind. ... Transfers terminate basic blocks, and serve to transfer control ys = yv!7 7 within a program. The goto L(xk, . . . , xk ) transfer shifts control zs = xs[ys] 0 n 7 7 to the block labeled with L. Well-formed programs must have zv = hzs, . . . , zsi 0 7 matching arities on transfers and the block headers which they This instruction is commonly known as a “gather” instruction. The target, that is, the block labeled with L must have entry variables s v v k k k k initializing store x [hy i] ← z , commonly known as the “scatter” z0 , . . . , zn. The transfer of control effected by goto L(x0 , . . . , xn) v s k k instruction, writes the elements of the vector z to the fields of x also defines the entry variables z0 , . . . , zn to be the values of v k k at offsets given by the vector y . x0 , . . . , xn. There are several interesting special cases of scatters and gathers The conditional control transfer that are worth further discussion. A common idiom that arises in k k k k SIMD vectorization consists of array loads and stores such that the if (cmp) goto L0(x0 , . . . , xn) else goto L1(y0 , . . . , ym) index vector is constructed by adding multiples of a constant stride k k to a scalar index. We choose to introduce this idiom as derived behaves as goto L0(x0 , . . . , xn) if the comparison holds, and as k k syntax as follows, where i is a constant valued stride: goto L1(y0 , . . . , ym) if the comparison does not hold. Note that the targets of the two different arms of the conditional are not def zv = xs[hys:ii] = zv = xs[hyvi] required to be distinct. xs[hys:ii] ← zv def= xs[hyvi] ← zv The halt transfer terminates the program. We do not consider  v s s inter-procedural control-flow in this core language since it is not yl = hy , . . . , y i  v relevant to the vectorization optimization—however it is straight- where yb = h0,..., 7 ∗ ii v v v forward to add additional transfers to account for this if desired.  y = h+i(yl , yb ) Note that the stride multiplications (e.g. 7 ∗ i) are meta-level oper- 3.5 Programs ations. We also provide derived syntax for the further special case in which the stride is known to be one: Programs in the core language consist of a single control flow graph. A control flow graph consists of a designated entry label def zv = xs[hys:i] = zv = xs[hys:1i] L and set of associated basic blocks, each with a distinct label, xs[hys:i] ← zv def= xs[hys:1i] ← zv one of which (the entry block) is labeled with the entry label L. The entry block must expect no entry variables. Program execution For the purposes of simplicity of the vector language, we have left proceeds by starting at the designated entry block and executing these as derived forms. However, it is worth noting that many archi- block instructions and transfers until a halt instruction is reached tectures provide support for these idioms directly (and in fact may (if ever). Well-formed programs must not contain transfers whose provide support only for these idioms) and hence from a pragmatic target label is not in the control-flow graph, and every block in the standpoint it can be important to provide primitive support for these control-flow graph must be labeled with a distinct label. As usual addressing modes. with static single-assignment form, variables may only be defined A second mode of addressing for array loads and stores covers once, must be defined before their first use, and have scope given the case where a vector of arrays is indexed pointwise by a single by the dominator tree of the control-flow graph rooted at the entry fixed scalar offset. The zv = hxvi[ys] instruction produces a vector label. The style of static single-assignment form used in this core zv such that the ith element of zv is produced by indexing the ith language is similar to that used in compilers such as the MLton array from the vector of arrays xv, using offset ys. Similarly, the compiler [16]. 4. Vectorization • The characteristic function of an induction variable defined by xs = +(ys, zs) is p∗#+d+c where zs is a constant (defined The SIMD-vectorization optimization attempts to rewrite a loop to s s a new loop that executes multiple iterations of the original loop by z = c), and p ∗ # + d is the characteristic function of y . • on each iteration using SIMD instructions and registers. While the The characteristic function of an induction variable defined by xs = ∗(ys, zs) is c∗p∗#+c∗d where zs is a constant (defined core idea is quite simple, there are a number of issues that must s s be addressed in order to preserve the semantics of the original by z = c), and p ∗ # + d is the characteristic function of y . program. In the first part of this section we introduce the key issues A subtle but important point is that in these computations the com- using a series of examples. Before doing so, we first review some piler must ensure that the overflow semantics of the underlying nu- preliminary concepts. meric type are respected to avoid introducing or eliminating nu- meric overflow. It is possible to extend the definition of charac- teristic functions to allow loop invariants to take the place of the 4.1 Loops constants in the above definition at the expense of representing the We have informally described our vectorization efforts as focusing characteristic function symbolically, thereby somewhat complicat- on loops. Loops tend to account for a high proportion of executed ing code generation and overflow avoidance as discussed below. instructions in programs, and as such are natural targets for intro- 4.3 Vectorization by example ducing SIMD-vector code. We define loops using the standard no- tion of a natural loop, merging loops with shared headers in the Consider the following simple program which computes the point- s s s usual way to ensure that the nesting structure of loops form a tree. wise sum of two arrays b and c , each of length l . We do not assume a reducible control-flow graph (since in general L0(): L1(is): (1) the translation of mutually-recursive functions into loops may in- as = new[ls] xs = bs[is] troduce irreducible control flow). We focus our vectorization efforts s s s s i0 = 0 y = c [i ] more specifically on the innermost loops that make up the leaves of 1 s s s s goto L (i0) z = +(x , y ) the loop forest of the control-flow graph. For simplicity, we only as[is] ← zs target bottom-test loops. A separate optimization pass inverts top- s s i1 = +(i , 1) test loops to form bottom-test loops, both to enable vectorization s s 1 s if (i1 < l ) goto L (i1) and loop-invariant code motion. exit s else goto L (i1) Ignoring the crucial issue of whether or not it is in fact valid 4.2 Induction variables to vectorize this loop, there are a number of issues that need to be Our notion of induction variable is standard, but we review it in addressed in order to produce a vector version of this program. some detail here since induction variables play a critical role in our algorithm. We define the base induction variables of a loop 4.3.1 The vector loop to be the entry variables of the entry block of the loop such that The core of the vectorization optimization is to produce an inner the definition of that variable provided on the loop back edge is loop, each iteration of which executes multiple iterations of the produced by adding a constant to the variable itself, and where the original loop. For Example 1, this corresponds to generating code initial value passed into the loop is a compile-time constant. The to vector-load eight elements of the arrays b and c at appropriate step of a base induction variable is the constant added each time offsets, perform a pointwise vector addition of the loaded elements, around the loop. The full set of induction variables is the least set and to perform a vector write of the eight result elements into the of variables satisfying: new array a. The comparison which controls the exit test at the • A base induction variable is an induction variable. bottom of the loop must also be adjusted appropriately to account • A variable defined by xs = +(ys, zs), where ys is an induction for the fact that multiple iterations are being performed. Finally, on variable and zs is a constant (defined by zs = c), is an induction the exit edge of the loop, the use of the last value of the induction s variable. variable i1 must also be accounted for. • A variable defined by xs = ∗(ys, zs), where ys is an induction To motivate the general treatment of the vectorization optimiza- variable and zs is a constant, is an induction variable. tion, we first consider the individual instructions of Example 1, be- Our implementation also deals with the symmetric versions of the ginning with the load instruction xs = bs[is]. Executing this in- last two cases, and also considers operations such as subtraction struction during iteration j of the loop loads a single element from and negation. Arbitrary initial values can be allowed for base in- the array bs at an offset given by the value of is. In the vector duction variables and loop invariants can be allowed in place of loop, we wish to perform iterations j, j + 1, . . . , j + 7 simulta- constants at the cost of some additional complexity as discussed neously. The vector version of this instruction then must load eight below, but we do not currently support this. elements from bs at offsets given by the values of is at iteration Induction variables can be characterized as affine functions of j, j + 1, . . . , j + 7. A natural first attempt at vectorization might be the iteration count of the loop. The characteristic function for an to simply assume that vector versions of all variables are available, induction variable is is an equation of the form is = p∗#+d where and to then generate the corresponding vector instruction using the p is a constant (the step of the induction variable), and d is also a vector variables. In this case, if we assume the existence of vari- constant (the initial value of the induction variable). The symbol # ables bv and iv containing the values of bs and is for iterations stands for the iteration number: the value of the induction variable j, . . . , j + 7 then the vector instruction xv = hbvi[hivi] puts into is for iteration j can be computed directly from its characteristic xv the appropriate values of xs for iterations j, . . . , j + 7. function simply by replacing # with j. While this approach is correct (and in the most general case is The characteristic function for an induction variable is derived sometimes necessary) a key insight in SIMD vectorization is that as follows: it is often possible to do much better by using knowledge of loop • The characteristic function of a base induction variable is p ∗ invariants and induction variables. In this case, the variable bs is # + d where p is the step of the base induction variable as loop invariant, and consequently bv will consist simply of eight defined above, and d is the constant initial value of the induction copies of the same value. Semantically then, we can replace our variable. use of the general vector load instruction with the simpler gather instruction xv = bs[hivi], which does not require us to produce that cannot be computed with the vector loop. A preliminary test the vector version of bs and which may also have a more efficient is inserted before the loop choosing either to target the scalar loop implementation. More importantly, the variable is here is a base (if less than eight total iterations will be performed) or the vector induction variable with step 1. It is consequently possible to predict loop (otherwise). Similarly, after the vector loop exits, a check is the value of is at each of the next eight iterations: that is, the value done for the presence of extra iterations, which if required are again of iv will always be his, is+1, . . . , is+7i. The vector load can then performed using the scalar loop. be seen to be a load of eight contiguous elements from bs, for which Using these transformations, we arrive at the final desired vec- we have provided the derived instruction xv = bs[his:i], which torized program. performs a contiguous load of eight elements, as desired. Since this form of load is generally well supported in vector hardware, L0(): Lcheck(): generating this specialized form in preference to the general form as = new[ls] s s s 1 s is highly desirable. i0 = 0 if (i3 < l ) goto L (i3) s 2 s exit s The second load instruction can then be vectorized in an exactly if (7 < l ) goto L (i0) else goto L (i3) v v 1 s analogous fashion, resulting in two vector variables x and y . The else goto L (i0) addition instruction zs = +(xs, ys) can then be computed using vector addition on the vector variables, as zv = h+i(xv, yv). To 2 s 1 s L (i2): L (i ): produce a vector version of the instruction as[is] ← zs which v s s s s s x = b [hi2:i] x = b [i ] writes the computed value to the new array we follow a similar yv = cs[his:i] ys = cs[is] s s v 2 argument to that above for the loads to produce the a [hi :i] ← z zv = h+i(xv, yv) zs = +(xs, ys) v instruction which performs a contiguous write of the values in z as[his:i] ← zv as[is] ← zs s s 2 into the array a starting at offset i . is = +(is, 8) is = +(is, 1) s s 3 2 1 The last instruction of the loop, i1 = +(i , 1), is the induction s s s s 1 s i4 = +(i2, 15) if (i1 < l ) goto L (i1) variable computation, and as such requires special treatment. The s s 2 s exit s if (i4 < l ) goto L (i3) else goto L (i1) result of this instruction is used for three purposes: in the computa- else goto Lcheck() tion of the test to decide whether to proceed with another iteration, (2) to provide a new value for the induction variable in the next itera- tion, and to provide the value passed out on the exit edge. For the latter two uses, the required value is easy to see. Each 4.4 Automatic vectorization iteration of the vector loop corresponds to eight iterations of the scalar loop, and we require on entry to the loop that the induction The reasoning of the previous section leads us to the desired result, variable is contain the value appropriate for the first of the eight but seems completely ad hoc and unsuitable to implementation in iterations. Given the value of is for iteration j then, the back edge a compiler. Fortunately, it is possible to derive a simple composi- s tional transformation which accomplishes the same thing in a gen- of the loop requires the value of i1 for iteration j + 7, which is one plus the value of is on iteration j +7. Similarly, the exit edge of the eral manner. s The essential design principles that guide the design of the op- loop requires the value of i1 at iteration j+7. Since the computation of the value of is for each iteration depends on the value of itself timization are those of orthogonality and compositionality. As we in the previous iteration, we might imagine ourselves to be stuck. will see, while the local transformation of instructions produces However, the nature of induction variables as affine transformations specialized code depending on the particulars of the instruction, s each instruction is nonetheless translated in a compositional fash- means that we can compute the value of i1 at iteration j+7 directly: each iteration adds 1 to the original value, hence the value of is at ion in the sense that the translation does not depend on how the re- s s sult is used. A consequence of this choice is that the compositional iteration j + 7 is given by i + 7, and the value of i1 at iteration j + 7 is is + 7 + 1. Hence we can compute the appropriate value translation may produce numerous extraneous, redundant, or loop- s s invariant instructions, which could be avoided in a less composi- of i1 for both the back and the exit edge as i + 8. s tional approach. The principle that we follow is that the elimina- The remaining use of the induction variable i1 is to compute the loop exit test. The key insight is that it is no longer sufficient to tion of extraneous, redundant, and loop-invariant instructions is an check that there is at least one iteration remaining: in order to exe- orthogonal issue, which is already addressed by dead-code elimina- cute the vector loop again, we must have at least eight remaining it- tion, common sub-expression elimination, and loop-invariant code erations. Upon completion of a vector iteration computing scalar it- motion respectively. So long as the vectorization transformation is erations j, . . . , j +7, the question that must be answered is whether defined appropriately, we can safely rely on these optimizations to the scalar loop would always execute iterations j + 8, . . . , j + 15. sweep away the chaff, leaving behind only the essential bits. Because of the monotonic nature of the induction variable compu- The core of the vector transformation works by translating each tation this in turn can be reduced to the question of whether or not scalar instruction into a sequence of instructions which compute s s s three separate values: the scalar value, the vector value, and the the value of i1 at iteration j + 14 is less than l . Since i1 is an in- duction variable which is incremented by 1 on each iteration, this last value. For a vector loop computing iterations j, . . . , j + 7 of a corresponds to changing the exit test to ask whether is + 15 < ls. scalar loop, the scalar value of a variable x is the value computed at iteration j, the last value is the value computed at iteration j+7, and the vector value is the vector containing the computed values for all 4.3.2 Entry and exit code of the iterations. While it is clear that there is always a degenerate Using the ideas from previous section, it is straightforward to write implementation that simply computes the vector variable and then a vector version of the core loop. In addition to the issues touched computes the scalar and last values by projecting out the first and on above, vectorizing this example also requires accounting for the last iteration value, we often compute the scalar and last values possibility that there may be insufficient iterations to allow use of separately. This is crucial for a number of reasons: in the first place, the vector loop, and also for the possibility that there may be extra we may be able to compute the vector variable more efficiently as iterations left over if the exit test succeeds with fewer than eight a direct function of the scalar value; but more importantly, it will but more than zero iterations left. To account for these we keep frequently be the case that the vector variable (or similarly the last around the original scalar loop, and use it to perform any iterations value) will be unused. It is therefore important not to introduce a data-dependence of the scalar (or last) value on the vector value lest 4.4.1 Entry tests we keep an expensive-to-compute vector value live unnecessarily. The job of the entry test is to determine if there are sufficient For clarity in the translation, we define meta-operators for pro- iterations to execute the vector loop, and if not to go straight to ducing fresh variables from existing scalar variables. For a variable the original scalar loop. It also needs to set up some initial values xs, we take fvxs to be a fresh scalar variable, which by convention s for use in the vector loop. In particular, any variables used in the will contain the scalar value of x in the vector loop. Similarly, we loop but not defined in the loop must have scalar, last value, and take lvxs to be a fresh scalar variable, which by convention will s vv v vector versions for use by the instructions in the vector loop. contain the last value of x ; and we take x to be a fresh vector s s First, for each variable y used in the loop but not defined in it variable, which by convention will contain the vector value of x . (and not an entry variable), we add the following instructions to the Induction variables require a small amount of additional mech- end of the instructions of the preheader: anism. For every base induction variable is with step s, we say that the affine basis of is is the vector h0 ∗ s, 1 ∗ s, . . . , 7 ∗ si (the multi- fvys = ys plications here are at the meta-level). Note that for a base induction lvys = ys variable is, adding is to each of the elements of its affine basis vvyv = hys, . . . , ysi gives the values of is for each of the next eight iterations. Other Next, we must determine whether the vector loop should be (non-base) induction variables also have affine bases, which are entered at all. The desired property of the entry test is that the computed by code emitted as part of the transformation described vector loop should only be entered if there are at least 8 iterations to below, starting from the bases of the base induction variables. We be performed. The monotonic nature of induction variables means use the notation bviv to denote a fresh variable which by convention that this question is equivalent to asking whether at least one more holds the affine basis for the scalar induction variable is. iteration remains after performing the first 7 iterations: that is, we It is occasionally necessary to introduce a promoted version of wish to know the result of the exit test of the original loop on the 7th a variable: that is, for a variable xs, to produce a vector variable iteration. If the characteristic function for is is s ∗ # + d, then the containing the vector hxs, . . . , xsi. By convention, we use pvxv to value of is on the 7th iteration can be obtained by replacing # with denote a fresh variable containing the promoted version of xs. 6 and calculating the result statically. The appropriate entry test is In order to simplify the algorithm, it is convenient to apply it then of the form iis < ls where iis is defined as iis = s ∗ 6 + d. only to loops for which all variables defined in the loop and used The value s ∗ 6 + c is a compiler computed constant, and the outside of the loop (that is, live out from the loop) are explicitly compiler can statically determine that this does not overflow. If passed as parameters on the exit edge. This is in no way necessary, we generalize the notion of characteristic function to include loop- but avoids the need to rename variables throughout the rest of the invariants in addition to constants this test may require generating program to preserve the SSA property. This does not restrict the multiplication and addition instructions, and code must also be generality of the algorithm since it is always possible to transform emitted to fall back to the scalar code if there is a possibility that loops into this form. the additional arithmetic might overflow. The algorithm applies to single block loops of the form

s s 4.4.2 Vector loop L(i0, . . . , in): I0 To generate code for the main vector loop, we perform three oper- ... ations. We first generate the last value and vector versions of base Im induction variables, using their steps to produce these directly from s s s s if (i < l ) goto L(i01, . . . , in1) the scalar values. For each of the original instructions of the loop exit s s else goto L (x0, . . . , xp) we then generate instructions to compute the scalar, vector, and last value versions of the instruction. Finally, we adjust the exit test. s s s s where the variables i0, . . . , in are base induction variables, i is an For each entry variable ij for 0 ≤ j ≤ n, which is a base induc- induction variable with a positive step (the case where the step is tion variable of step sj , we produce the vector, basis, promoted, and s s zero or negative make no sense), and the variables x0, . . . , xp are last value versions of the induction variable as follows. The basis bv v pv v the live out variables of the loop. All the instructions I0,..., Im variable and promoted variables ij and ij are defined as must be scalar instructions (and thus define scalar variables), and bv v cannot create new arrays. It is straightforward to support exit tests ij = h0 ∗ sj ,..., 7 ∗ sj i s s s pv v fv s fv s of the form i ≤ l as well. In all cases, the variable l must be ij = h ij ,..., ij i defined outside of the loop (and hence be loop invariant). s 2 The vector version of ij can then be defined directly as follows and We assume the loop has a preheader, that is a block that ends lv s s s the last value ij can be defined directly using the step. with goto L(ii0, . . . , iin) and that only this block and the loop vv v bv v pv v transfer to L. ij = h+i( ij , ij ) lv s fv s As suggested by the ad hoc development from the previous ij = +( ij , 7 ∗ sj ) section, the vectorization transformation must produce three new pieces of code: an entry test that decides whether there are sufficient This initial step defines vector and last value variables for each iterations to enter the vector loop; the core vector loop itself that base induction variable of the loop, as well as basis and promoted performs the SIMD iterations; and a cleanup test that checks if variables. The translation of each instruction then proceeds com- there are any iterations remaining after the SIMD loop has finished, positionally, with the translated version of each instruction making which must be processed by the original scalar loop which remains use of the vector and last value variables produced by the trans- in the program unchanged. By convention we take Lvec and Lcheck to lation of all of the previous instructions. In addition, each induc- be fresh labels for the new vector loop and cleanup test respectively. tion variable makes use of the previously defined basis variables to produce its own vector, last value, basis, and promoted variables directly. The complete definition of this translation is given in Fig- 2 Assuming preheaders does not restrict our algorithm - transforming a pro- ure 2. gram into an equivalent one where all loops have preheaders is a standard Adjusting the exit test of the loop is again straightforward using technique. the step information. The desired new exit test must check whether Instruction Vector translation Side conditions fvzs = c zs = c vvzv = hfvzs,..., fvzsi lvzs = fvzs fvzs = +(fvxs, fvys) bvzv = bvxv zs = +(xs, ys) pvzv = hfvzs,..., fvzsi If zs and xs are induction variables. vvzv = h+i(bvzv, pvzv) lvzs = +(lvxs, fvys) fvzs = ∗(fvxs, fvys) bvzv = h∗i(bvxv, vvyv) zs = ∗(xs, ys) pvzv = hfvzs,..., fvzsi If zs and xs are induction variables. vvzv = h+i(bvzv, pvzv) lvzs = ∗(lvxs, fvys) fv s fv s fv s z = op( x0,..., xn) s s s vv v vv v vv v s z = op(x0, . . . , xn) z = hopi( x0 ,..., xn) If z is not an induction variable. lv s lv s lv s z = op( x0,..., xn) fvzs = fvxs[fvys] zs = xs[ys] vvzv = fvxs[hfvys:i] If xs is loop invariant, and ys is an induction variable with step 1. lvzs = fvxs[lvys] fvzs = fvxs[fvys] zs = xs[ys] vvzv = fvxs[hfvys:ii] If xs is loop invariant, and ys is affine with step i lvzs = fvxs[lvys] fvzs = fvxs[fvys] zs = xs[ys] vvzv = fvxs[hvvyvi] If xs is loop invariant, and ys is not an induction variable. lvzs = fvxs[lvys] fvzs = fvxs[fvys] zs = xs[ys] vvzv = hvvxvi[fvys] If xs is not loop invariant, and ys is loop invariant. lvzs = lvxs[fvys] fvzs = fvxs[fvys] zs = xs[ys] vvzv = hvvxvi[hvvyvi] If both xs and ys are not loop invariant. lvzs = lvxs[lvys] xs[ys] ← zs fvxs[hfvys:i] ← vvzv If xs is loop invariant, and ys is an induction variable with step 1. xs[ys] ← zs fvxs[hfvys:ii] ← vvzv If xs is loop invariant, and ys is an induction variable with step i. xs[ys] ← zs fvxs[hvvyvi] ← vvzv If xs is loop invariant, and ys is not an induction variable. xs[ys] ← zs hvvxvi[fvys] ← vvzv If xs is not loop invariant, and ys is loop invariant. xs[ys] ← zs hvvxvi[hvvyvi] ← vvzv If both xs and ys are not loop invariant.

Figure 2. Vector Block Instruction Translation is < ls would succeed for at least eight more iterations. Since is 4.4.3 On the subject of overflow is an induction variable, letting s be its step, it increases by s on For the most part, the vectorization transformation here is careful each iteration. Thus the value of is on the next eight iterations will lv s lv s s to only compute the same values computed in the original program. be i + 0 ∗ s,..., i + 7 ∗ s. These will all be less than l if the Generating entry and exit tests violate this property. As mentioned last of them is less than ls (recall that ls is loop invariant). Thus the lv s s in Section 4.4.1, the compiler can check statically that overflow desired new exit condition is i + 7 ∗ s < l . does not occur when computing constants for the initial entry test. Putting this all together the vector loop will be as follows, where 0s For the exit test, it is sufficient to check before entry to the vector i is a fresh variable: loop that ls < MI − 7 ∗ s where MI is the largest representable integer and s is the step of is. For signed types, this check must vec fv s fv s L ( i0,..., in): be adjusted in the obvious way when loop bounds or steps are Instructions for base induction variables negative. In all cases, if the checks fail then the original scalar loop Transformed I0,..., Im is used to perform the calculation using the original code. i0s = +(lvis, 7 ∗ s) 0s s vec lv s lv s if (i < l ) goto L ( i01,..., in1) 4.4.4 Cleanup check check else goto L () The cleanup check is responsible for determining if scalar iterations remain to be performed after the vector loop has exited. This is done Note that the back edge targets the vector loop, the exit edge simply by performing the original loop exit test. Scalar iterations targets the cleanup check, and the last values of the base induction remain to be performed exactly when lvis < ls, where lvis is the last variables are passed on the back edge. value for is computed in the vector loop (since this represents the Entry test Vector loop Cleanup check Optimized result L0(): L2(fvis): Lcheck(): L0(): s s bv v s fv s s s a = new[l ] i = h0 ∗ 1,..., 7 ∗ 1i i2 = +( i , 7 ∗ 1) a = new[l ] s pv v fv s fv s s s s i0 = 0 i = h i ,..., i i i3 = +(i2, 1) i0 = 0 fv s s vv v bv v pv v s s 1 s s a = a i = h+i( i , i ) if (i3 < l ) goto L (i3) ii = 7 s s 2 s vv v s s lv s fv s exit s (ii < l ) (i ) a = ha , . . . , a i i = +( i , 7 ∗ 1) else goto L (i3) if goto L 0 lv s s 1 s a = a fvxs = fvbs[fvis] else goto L (i0) fvbs = bs vv v fv s fv s vv v s s x = b [h i :i] 2 fv s b = hb , . . . , b i lv s fv s lv s L ( i ): lv s s x = b [ i ] lv s fv s b = b fv s fv s fv s i = +( i , 7) fv s s y = c [ i ] vv v s fv s c = c vv v fv s fv s x = b [h i :i] vv v s s y = c [h i :i] vv v s fv s c = hc , . . . , c i lv s fv s lv s y = c [h i :i] lv s s y = c [ i ] vv v vv v vv v c = c fv s fv s fv s z = h+i( x , y ) fv s s z = +( x , y ) s fv s vv v l = l vv v vv v vv v a [h i :i] ← z vv v s s z = h+i( x , y ) lv s lv s l = hl , . . . , l i lv s lv s lv s i1 = +( i , 1) lv s s z = +( x , y ) 0s lv s l = l fv s fv s vv v i1 = +( i1, 7) s a [h i :i] ← z 0s s 2 lv s ii = 6 ∗ 1 + 1 fv s fv s if (i1 < l ) goto L ( i1) s s 2 s i1 = +( i , 1) check if (ii < l ) goto L (i0) bv v bv v else goto L () 1 s i1 = i else goto L (i0) pv v fv s fv s i1 = h i1,..., i1i check vv v bv v pv v L (): i1 = h+i( i1 , i1 ) s fv s lv s lv s i2 = +( i , 7) i1 = +( i , 1) s s 0s lv s i3 = +(i2, 1) i1 = +( i1, 7 ∗ 1) s s 1 s 0s s 2 lv s if (i3 < l ) goto L (i3) if (i1 < l ) goto L ( i1) exit s else goto L (i3) else goto Lcheck()

Figure 3. Vectorized version of Example 1 value of is from the last completed iteration). Since the vector loop we generate the instructions for the base induction variable, trans- dominates the cleanup check, all its entry variables and variables it formed instructions of the loop, and adjusted exit condition to form defines are in scope, so the cleanup check could be formed as: the vector loop. The result of this appears under the header “Vec- tor loop”. Finally we form the cleanup check by regenerating the Lcheck(): last-value computations and using the original loop-exit condition. lv s s lv s lv s if ( i < l ) goto L( i01,..., in1) For simplicity we just show enough instructions to compute the last exit lv s lv s s else goto L ( x0,..., xp) value of i1, the only needed last value. This block appears under the Note that the last values of the base induction variables are passed header “Cleanup check”. to the scalar loop and the last values of the live-out variables This code, as expected, is terribly verbose. However, by apply- s s ing copy propagation followed by dead-code elimination, the ex- (x0, . . . , xp) are passed to the exit label. However, instead of using the last values computed in the vec- traneous parts disappear, yielding the code shown under the header tor loop, it is better to recompute them in order to avoid introducing “Optimized result”. By applying common sub-expression elimina- tion this code can be further improved by eliminating the calcula- data dependencies that may keep instructions live in the loop which s lv s s lv s could otherwise be eliminated. This turns out to be straightforward: tion of i2 in favor of i , and i3 in favor of i1. Finally, simplifying the portion of the translation in Figure 2 that computes last values the chained arithmetic expressions yields the same vectorized loop can simply be repeated. This results in a complete calculation of as derived by the initial ad hoc vectorization shown in Section 4.3. all of the last values, including lvis. While most of these instruc- tions will turn out to be unnecessary, a single pass of dead code 4.6 Reductions elimination is sufficient to eliminate them. The SIMD vectorization process defined so far primarily applies to loops which serve to create new arrays. An additional idiom that is 4.5 Transformed example important to handle in vectorization is that of loops which serve (in The actual process of producing SIMD vector versions of loops total or in part) to compute a reduction using a binary operator. That seems at first quite ad hoc and difficult to do cleanly, but turns out is, instead of, or in addition to, initializing a new array, each itera- to be amenable to a simple and straightforward translation which tion of the loop accumulates a value into an accumulator parameter systematically produces vector versions of scalar instructions, rely- as a function of the value from the last iteration. For example, the ing on orthogonal optimizations to clean up unnecessary generated innermost dot product of a matrix multiply routine generally takes code. To illustrate the results of this process, Figure 3 shows the the form of a reduction. results of applying the algorithm to Example 1. First note that the Adding support for reductions to the vectorization transforma- base induction variables are just is of step 1, and the only other tion is not difficult. In the same manner that we identify certain s s s s induction variable is i1 also of step 1. The variables a , b , c , variables as induction variables and treat them specially, we iden- and ls are used in the loop but not defined by it. Block L0 is the tify variables that fit the pattern of reductions and add additional preheader for the loop. We first transform the preheader to define cases to the vectorization transformation to handle them differently. variables used by but not defined by the loop and to do the entry Specifically, we say that a variable xs is a reduction variable if it test. This results in Figure 3 under the header “Entry test”. Next is a parameter to the loop that is not an induction variable, and which is defined on the back edge of the loop by the application is a vast literature on techniques for showing that particular loops of an associative, commutative binary operator applied to xs and can validly be vectorized (see for example Bik [3] for an overview). some distinct variable. All uses of xs and its associated back edge For the particular case of immutable arrays however, the problem variable must be in reductions. For example, the loop: becomes drastically easier. In this section we give a brief overview of why this is the case. L0(): L1(is, xs): (3) s s s s We say that an instruction I2 is dependent on an instruction I1 if x0 = 1 y = c [i ] s s s s I2 must causally follow I1. There are a number of different reasons i0 = 0 x1 = +(x , y ) 1 s s s s that this ordering might be required, each inducing a different goto L (i0, x0) i1 = +(i , 1) s s 1 s s kind of dependence. Flow dependence (or data-dependence) is a if (i1 < l ) goto L (i1, x1) exit s standard dataflow dependence induced by a read of the value of else goto L (x1) a variable or location which was written at an earlier point in the computes one more than the sum of the elements of the array cs via program execution. This is sometimes known as a read-after-write a reduction (assuming that ls is the length of cs) and passes out the dependence. An anti-dependence on the other hand is induced by s s s a write after a read: that is, a location or variable is updated with a result on the exit edge as x1. We say that x and x1 are reduction variables in this loop, with initial value 1. new value after an old value has been read, and therefore the write The general strategy for SIMD vectorization of reductions is to must happen after the read to avoid changing the value seen by the pass a vector variable around the loop, each element of which con- read. Finally, output dependence is dependence induced by a write tains a partial reduction. Implicitly, we re-associate the reduction so after a write: that is, a location or variable is updated with a new that element i of the vector variable contains the partial reduction value which overwrites an old value written by a previous statement of all of the values of the iterations which are congruent to i modulo (and hence the writes cannot be re-ordered without affecting the the vector width (8 here). After exiting the vector loop, performing final value of the written to location or variable). a horizontal reduction on the vector variable itself results in the full The key insight for automatic vectorization of immutable arrays reduction of all the iterations (albeit performed in a different order, is that neither anti-dependence nor output dependence can occur. hence the requirement of associativity and commutativity). This is obvious by construction, since the nature of immutability More concretely, to vectorize a reduction variable xs, we simply is such that the only writes are initializing writes. Therefore, there promote xs to a vector variable xv. In the preheader of the loop, cannot be a write after a read (since this would imply a read of we lift the definition of the initial value for the reduction variable uninitialized data), nor can there be a write after a write (since this to contain the original initial value in element 0, and the unit for would imply mutation). The only remaining dependence then is the reduction operation in all other elements. In Example 3 this flow-dependence. Flow-dependences which are carried by variable v definitions and uses are straightforward to account for, and do corresponds to defining the initial value as x0 = h1, 0,..., 0i (since 0 is unit for addition). The reduction operation in the loop not generally cause unusual trouble. In general, it is possible for is lifted to perform the same vector operation pointwise, hence loop-carried flow-dependences to block vectorization, however. For v v v example, consider the following simple program: x1 = h+i(x , y ) for the example. Finally, in the cleanup check v block, the lifted reduction variable x1 is itself reduced to produce L0(): L1(is): the final scalar result of the portion of the reduction performed as = new[ls] as[is] ← is by the vector loop (which is then passed in as the initial value to as[0] ← 0 xs = as[is − 1] the scalar loop if subsequent iterations remain to be computed). 1 s s 1 s goto L (1) if (i < l ) goto L (i + 1) This final reduction may be performed using scalar operations. It else goto Lexit(xs) is possible to maintain last value and scalar values for reduction variables within the vector loop as well as the vector value, but While this program uses only initializing writes, it nonetheless in- doing so requires performing a reduction on the vector value on curs a loop-carried dependence in which a read in one iteration de- each iteration which is potentially expensive enough to make SIMD pends on a write in a previous iteration. Clearly then, the problem vectorization a poor idea. Consequently, we simply choose to reject of dependence analysis is not completely trivial even for functional SIMD vectorization of loops which would require the calculation programs even after eliminating the problems of anti-dependences of scalar or last values for reduction variables. and output-dependences. Note that while this example reads and Unfortunately, the associativity requirement for binary opera- writes as through the same variable name, there is nothing pre- tions used in reductions can be problematic since most floating venting the binding of as to a different variable name, and hence point operations (e.g. addition and multiplication) are not fully as- allowing array accesses (reads or initializing writes) via an alias. sociative. In practice, reductions over floating point operators are In practice however, all programs that we have considered (and important idioms. By default, HRC does not re-associate floating indeed, based on the way that immutable arrays are produced, all point operations, and hence refuses to vectorize loops which per- programs that we are at all likely to consider) have the very useful form reductions over floating point data. We provide a flag which property that they are allocated and initialized in the same lexical permits the compiler to re-associate floating point operations only scope. The implication of this is that it is usually trivially easy to for SIMD vectorization (but not other optimizations) which allows conservatively determine that an initializing write in a loop does not such loops to be vectorized at the expense of occasionally produc- alias with any of the reads in the loop simply by checking that no ing slightly different results from the sequential version. aliases for the newly allocated array are introduced before the loop. As a result, a very naive analysis is quite successful at breaking 4.7 Dependence Analysis read after write dependences in this style of loops. At this point, one might be concerned that we have left aside a critical point in our development. While we have shown how to vectorize a loop, we have said nothing about when (or why) it might 5. Benchmarks be valid to do so. The crux of any kind of automatic vectorization We show performance results for eight benchmarks, measured on lies in determining when it is semantics preserving to execute an Intel Xeon E5-4650 (Sandybridge) processor implementing the multiple iterations of a scalar loop in parallel. In the most general AVX SIMD instruction set. For each benchmark, the best time case this can be a tremendously hard problem to solve, and there out of five runs was recorded. Three of the benchmarks are small 8 7 6 5 4 3 GHC 2 GHC LLVM 1 HRC 0 Matrix 1D 2D Vector Add Vector Sum Dot Product N Body Blur Multiply Convolution Convolution GHC 0.15 0.72 0.24 0.34 0.35 0.02 0.34 0.45 GHC LLVM 0.15 0.73 0.35 0.97 0.98 0.02 0.58 0.86 HRC 1.05 1.84 2.00 4.37 1.34 6.68 5.03 3.02

Figure 4. Speedup over HRC Scalar (HRC Scalar=1). synthetic benchmarks. The first computes the pointwise sum of two ings presented measure the result of performing twenty successive arrays of 32-bit floating point numbers using the following code: blurrings on a 4592x3056 pixel image. The performance of these eight benchmarks is given in Figure 4. addV v1 v2 = zipWith (+) v1 v2 Each bar shows speedup over our standard scalar compiler without vectorization, since the primary goal is to demonstrate the perfor- The second computes the horizontal sum of an array of 32-bit mance increase obtained from SIMD vectorization. Speedup for a floating point numbers using the following code: given build is computed by dividing the HRC Scalar runtime by sumV v = foldl (+) 0 v the runtime for that build: thus HRC Scalar is always at 1, and higher numbers indicate better performance. We also show the per- Both of the kernels use the operations from our modified version formance of unmodified GHC 7.6.1 using both the standard back of the Data.Vector.Unboxed library. The third kernel computes a end and the new LLVM backend [14]. This is intended primarily dot product, implemented precisely as the composition of the two to demonstrate that the scalar code from which we begin is suit- previous kernels above. These code snippets are each run in a ably optimized. Both the unmodified and our modified version of benchmark harness which generates test data and repeatedly runs GHC were run with optimization flags “–O2”, “–msse2”. Various the same kernel a number of times in order to increase the runtime combinations of the flags “–fno-liberate-case”, “–funfolding-use- to reasonable lengths for benchmarking. threshold”, and “–funfolding-keeness-factor” were also used fol- The fourth benchmark is an n-body simulation kernel which lowing the documentation of the REPA libraries, and based on uses the naive quadratic algorithm to calculate accelerations for some limited experimentation as to which arguments gave better n gravitational bodies, represented as quadruples of 32-bit float- performance. HRC was run with flags requiring it to preserve float- ing point numbers (three dimensions plus mass). For our measure- ing point value semantics and requiring the backend C compiler ments, one time step is computed for 100,000 bodies. The code is to do the same. The only exception was that the flag discussed in written in an straightforward manner using the REPA array library. Section 4 to permit the re-association of floating point operations The fifth benchmark is the matrix-matrix multiply routine in- during vectorization of reductions was used. For some benchmarks, cluded in the REPA array library. The matrix elements are dou- this caused small perturbations in the results. ble precision (64-bit) floating point numbers. The program gener- For the vector-vector addition benchmark, our baseline com- ates random arrays, performs a single multiplication, computes a piler gives large speedups over both the standard GHC backend checksum, and writes the result out to a file. The portion of the and the LLVM backend. Unfortunately this benchmark is poorly program that is timed and reported here is the matrix multiplica- structured for measuring floating point performance. Performance tion and the checksum computation. The measurements were taken analysis shows that the run time is almost entirely dominated by using a 3000 by 3000 element array. memory costs associated with initializing the new arrays allocated The sixth benchmark is a two dimensional five-by-five stencil on each iteration. The vector sum benchmark on the other hand per- convolution written using the REPA stencil libraries [9]. The con- forms a reduction over the input vector producing only a scalar re- volution is repeatedly performed 4000 consecutive times on a 1000 sult. Consequently memory latency issues are less important since by 1000 element array of 32-bit floating point numbers. the input vectors are in cache, and SIMD vectorization gives 1.8× The seventh benchmark is a one dimensional convolution of an speedup. For the dot product kernel, the GHC frontend successfully 8192 element floating-point stencil applied to an array of complex eliminates the intermediate array, and further optimization by HRC numbers represented as pairs of 32-bit floats. The REPA stencil results in a vectorizable loop for a 2× speedup. libraries could not be applied to this problem, but an elegant and The N-Body simulation benchmark benefits significantly from simple solution was arrived at which extracts a slice of the complex SIMD vectorization, since the core of the inner loop does a sub- array, zips it with the stencil array, and then reduces with a fold. stantial amount of floating point work on each iteration. While our The final benchmark is the “blur” image processing example in- baseline compiler gives only slightly better performance than the cluded with the REPA distribution. The benchmark was modified LLVM backend on the benchmark, the SIMD vectorized version slightly to match our benchmark harness requirements, with tim- gets a 4.4× speedup over the scalar version. ing code inserted around the kernel of the computation to avoid The matrix multiplication routine gets some limited benefit including file input/output times in the measurements. No changes from SIMD vectorization (1.3×). This benchmark uses double to the code implementing the algorithm itself were made. The tim- precision which reduces the expected speedup, and more work is done outside of the core inner (vectorizable) loop. Memory latency pointers, and a scatter to initialize the object meta-data used by the is also a factor in this benchmark. garbage collector. Implementing vectorized allocation of variable- The one dimensional convolution benchmark gets an excellent length arrays should also be straightforward. A special-case in- SIMD vector speedup of 6.7×. Good cache locality makes this struction for when the length is loop-invariant would likely have benchmark largely compute bound. Unfortunately, while GHC and a faster implementation. REPA do an excellent job of fusing away the implied intermediate To summarize, we have utilized the naturally data-parallel typ- arrays, GHC is unable to eliminate all of the allocation in the inner ical functional-language operations on immutable arrays such as loop and so performs extremely poorly on this benchmark. HRC maps, folds, zips, etc., utilized the large body of work on eliminat- is able to eliminate all of the remaining allocation and so achieves ing intermediate arrays, and utilized our own internal operations on significantly better baseline performance. immutable arrays to produce nice loops amenable to SIMD vector- Both the two dimensional convolution and blur benchmarks ization. Implementing that SIMD vectorization automatically is a suffered originally from a sequential optimization of the REPA surprisingly straightforward process, and one that we have shown libraries in which the inner loop was unrolled four times3 [9]. While can produce excellent speedups on functional language workloads. this gave substantial speedup on the scalar builds, it limited the A key part of this simplicity is the use of immutable objects and vector speedup because the generated code used non unit-stride operations on them, and in particular our use of initializing writes. vector loads which are not supported natively on the measurement As utilizing SIMD instruction sets is important on modern archi- hardware. To address this, we modified our version of the REPA tectures, these techniques are an important and low-hanging fruit library to eliminate the unrolling. This penalizes our scalar build for functional-language implementations. substantially, but greatly benefits the vector build since contiguous vector loads can be emitted. The GHC builds were measured using References the original REPA optimizations. The 2D convolution benchmark gets a 5× speedup, and the blur benchmark gets a 3× speedup. [1] T. Anderson, N. Glew, P. Guo, B. T. Lewis, W. Liu, Z. Liu, L. Petersen, M. Rajagopalan, J. M. Stichnoth, G. Wu, and D. Zhang. Pillar: A Of particular importance to note with respect to all of the bench- parallel implementation language. In V. Adve, M. J. Garzaran,´ and marks but most especially for the last two is that the timings pre- P. Petersen, editors, Languages and Compilers for , sented here are not for the vectorized loops alone, but for the entire pages 141–155. Springer-Verlag, Berlin, Heidelberg, 2008. kernel of the program (essentially everything except the input (or [2] A. Appel and T. Jim. Shrinking lambda expressions in linear time. input generation) and output). Journal of Functional Programming, 7(5), Sept. 1997. [3] A. J. C. Bik. The Software Vectorization Handbook. Intel Press, 2004. 6. Related and Future Work, Conclusions [4] G. E. Blelloch. Programming parallel algorithms. Communications of There is a vast and rich literature on automatic SIMD vectorization the ACM, 39(3):85–97, Mar. 1996. ISSN 0001-0782. in compilers. Bik [3] provides an excellent overview of the essen- [5] M. Fluet and S. Weeks. Contification using dominators. In Interna- tial ideas and techniques in a modern setting. Many standard com- tional Conference on Functional Programming, pages 2–13, Florence, piler textbooks also provide at least some rudimentary introduction Italy, 2001. ACM. to SIMD vectorization. The NESL programming language [4] pio- [6] G. Keller, M. M. Chakravarty, R. Leshchinskiy, S. Peyton Jones, and neered the idea of writing nested data parallel programs in a func- B. Lippmeier. Regular, shape-polymorphic, parallel arrays in Haskell. tional language, an idea which has more recently been picked up In International Conference on Functional Programming, pages 261– by the Data Parallel Haskell project [12]. A related project focusing 272, Baltimore, Maryland, USA, 2010. ACM. on regular arrays [6] (REPA) has had good success in getting array [7] K. Kennedy, C. Koelbel, and H. Zima. The rise and fall of High Perfor- computations to scale on multi-processor machines. Programs writ- mance Fortran: an historical object lesson. In History of Programming ten using the REPA libraries are promising targets for SIMD vec- Languages, pages 1–22. ACM, 2007. torization, either automatic as in this work, or via explicit language [8] R. Leshchinskiy. Recycle your arrays! In Practical Aspects of Declar- constructs in the library implementation. Mainland et al. [10] have ative Languages, pages 209–223. Springer, 2009. pursued this latter approach. They implement primitives in GHC [9] B. Lippmeier and G. Keller. Efficient parallel stencil convolution in that correspond to SIMD instructions and modify the stream fusion haskell. In Haskell Symposium, pages 59–70. ACM, 2011. used in the Data.Vector library to generate these SIMD primitives. [10] G. Mainland, R. Leshchinskiy, and S. Peyton Jones. Exploiting Vector In this way, they achieve SIMD vectorization in the library, where Instructions with Generalized Stream Fusion. In International Con- we achieve SIMD vectorization in the compiler. Two different, pos- ference on Functional Programming. ACM, 2013. sibly complimentary, ways to achieve the same ends. [11] L. Petersen and N. Glew. GC-safe interprocedural unboxing. In Com- Our initial implementation of SIMD vectorization targets only piler Construction, pages 165–184, Tallinn, Estonia, 2012. Springer- simple loops with no conditional tests. An obvious extension to this Verlag. work would be to incorporate support for predicated instructions to [12] S. Peyton Jones. Harnessing the multicores: Nested data parallelism allow more loops to be SIMD vectorized. However, hardware sup- in Haskell. In Asian Symposium on Programming Languages and port for this is somewhat limited in current processors, and the per- Systems, pages 138–138, Bangalore, India, 2008. Springer-Verlag. formance effects of generating predicated code are harder to pre- [13] W. Pugh. A practical algorithm for exact array dependence analysis. dict. A more limited effort to target the special case of condition- Communications of the ACM, 35(8):102–114, 1992. als performing array bounds checks seems like a promising area [14] D. A. Terei and M. M. Chakravarty. An LLVM backend for GHC. for investigation. As functional languages tend to allocate heav- SIGPLAN Notices, 45(11):109–120, 2010. ily, supporting allocation of heap objects inside vectorized loops [15] N. Vasilache, C. Bastoul, A. Cohen, and S. Girbal. Violated de- could be important. A vector allocation instruction allocates 8 ob- pendence analysis. In International Conference on Supercomputing, jects and places pointers to them in a vector variable. Implementing pages 335–344. ACM, 2006. this operation for fixed-size objects is straightforward and should [16] S. Weeks. Whole-program compilation in MLton. In Workshop on be quick—a single limit check, vector instructions to compute the ML, pages 1–1, Portland, Oregon, USA, 2006. ACM.

3 Thanks to an anonymous reviewer for pointing out the source of this unrolling which we had embarrassingly missed.