Automatic SIMD Vectorization of Chains of Recurrences

∗ Yixin Shou and Robert A. van Engelen Department of Computer Science and School of Computational Science Florida State University Tallahassee, FL 32306 {shou,engelen}@cs.fsu.edu

ABSTRACT 1. INTRODUCTION Many computational tasks require repeated evaluation of Many computational tasks in numerical, visualization, and functions over structured grids, such as plotting in a coordi- engineering applications require the repeated evaluation of nate system, rendering of parametric objects in 2D and 3D, functions over structured grids, such as plotting in a coordi- numerical grid generation, and signal processing. In this nate system, rendering of parametric objects in 2D and 3D, paper, we present a method and toolset to speed up closed- numerical grid generation, and signal processing. Thread- form function evaluations over grids by vectorizing Chains level parallelism is typically exploited to speed up the eval- of Recurrences (CR). CR forms of closed-form functions re- uation of closed-form functions across points in a grid. quire fewer operations to evaluate per grid point. However, Another effective optimization that yields good speedups the present CR formalism makes CR forms inherently non- when applicable to structured grids or meshes is short-vector vectorizable due to the dependences carried from one point SIMD execution of arithmetic operations, e.g. “array arith- to the next. To address this limitation, we developed a new metic”. Virtually all modern general-purpose processor ar- decoupling method for the CR algebra to translate math chitectures feature instruction sets and instruction set exten- functions into Vector Chains of Recurrences (VCR) forms. sions that support short-vector SIMD floating point and in- The VCR coefficients are packed in short vector registers for teger operations, such as MMX/SSE, 3DNow!, AltiVec, and efficient execution. Performance results of benchmark func- Cell BE SPU SIMD instructions. Coding with these instruc- tions evaluated in single and double precision VCR forms are tion sets is simplified by the use of software, such as state-of- compared to the Intel ’s auto-vectorized code and the-art that automatically SIMD-vectorize com- the high-performance small vector math library (SVML). putationally intensive loops by restructuring these loops for The results show a significant performance increase of our SIMD vectorization [5] often in combination with highly- VCR method over SVML and scalar CRs, from doubling the optimized vector math library kernels [8]. execution speed to running an order of magnitude faster. An At the hardware level, advances in algorithms for float- auto-tuning tool for VCR is developed for optimal perfor- ing point arithmetic have led to significantly faster execu- mance and accuracy. tion of math library functions in hardware for both scalar and SIMD instructions, such as trigonometric, square root, logarithmic, and exponential functions. For example, the Categories and Subject Descriptors Intel numerics family (8087, 80287, and 80387) use the fast G.4 [Mathematical Software]: Parallel and vector imple- CORDIC (COordinate, Rotation DIgital Computer) meth- mentations—Optimization ods [15, 16]. However, repeated execution of numerics instructions for each grid point over a structured grid is still costly, especially General Terms in loops over the grid points where the execution latency of Algorithms, Performance the iterated floating point instructions cannot be hidden in the instruction pipeline. On an Intel Pentium 4 processor Keywords for example, the execution latency of a numerics instruction typically approaches 200 cycles [9]. Chains of recurrences, vectorization, short vector SIMD, ILP An effective technique to decrease operation counts and reduce instruction latencies of repeated function evaluations ∗ Supported in part by NSF grant CCF-0702435. over regular grids is provided by the Chains of Recurrences (CR) formalism [7, 10, 17], which uses an algebraic approach to translate closed-form functions and math kernels into re- currence forms that in a way implements aggressive loop Permission to make digital or hard copies of all or part of this work for [3]. Any floating point expression com- personal or classroom use is granted without fee provided that copies are posed of math library functions and standard arithmetic not made or distributed for profit or commercial advantage and that copies can be transformed into a CR form and then optimized to bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific achieve efficient execution. For example, if two functions permission and/or a fee. f(x) and g(x) have CR forms, then f(x) ± g(x), f(x) ∗ g(x), ICS’08, June 7–12, 2008, Island of Kos, Aegean Sea, Greece. and f(g(i)) have CR forms (with some obvious exceptions). Copyright 2008 ACM 978-1-60558-158-3/08/06 ...$5.00. d CR forms of closed-form functions require fewer opera- Definition 1. A d-dimensional VCR form Φi of a func- tions per grid point by reusing the value computed at a tion f(i) evaluated over i = 0, . . . , n−1 is denoted by previous point to determine the function value at the next d point. Unfortunately however, this aspect of the CR for- Φi = {~ϕ0, 1, ~ϕ1, 2, ··· , k, ~ϕk}i , malism prohibits loop parallelization and vectorization that where are “+” or “∗” and ~ϕ are vector coefficients require independent operations across the iteration space. 0 1 0 1 0 1 In this paper, we present a vectorization method based on ~ϕ0[0] ~ϕ1[0] ~ϕk[0] an algebraic transformation to speed up closed-form func- B ~ϕ0[1] C B ~ϕ1[1] C B ~ϕk[1] C ~ϕ = B C ; ~ϕ = B C ; ··· ~ϕ = B C . tion evaluations over structured grids using vector represen- 0 B . C 1 B . C k B . C @ . A @ . A @ . A tations of CR forms. To this end, we developed a decoupling ~ϕ [d−1] ~ϕ [d−1] ~ϕ [d−1] method for the CR algebra to translate math functions into 0 1 k Vector Chains of Recurrences (VCR) forms. The VCR co- d Hence, a VCR Φi has (k+1) × d scalar values and k opera- efficients are packed into short vector registers for efficient tions ( = + or = ∗) on d-vectors. execution. This approach effectively reduces the instruction d counts for short-vector SIMD execution of functions over Definition 2. The evaluation of a VCR form Φi of a grids. As a result, our method outperforms separately op- function f(i) over i = 0, . . . , n−1 is defined by the loop timized scalar CR forms and SIMD vectorized closed-form vector[d] ~v0 = ~ϕ0 function evaluation. vector[d] ~v1 = ~ϕ1 We validated the performance gains using several bench- ::: mark functions evaluated in single and double precision and vector[d] ~vk−1 = ~ϕk−1 compared the results to the Intel compiler’s auto-vectorized for i = 0 to n − 1 step d do code with the high-performance small vector math library y[i : i+d−1] = ~v0 (SVML). The results show a dramatic performance increase ~v0 = ~v0 1 ~v1 of our VCR method over SIMD/SVML auto-vectorization ~v1 = ~v1 2 ~v2 and scalar CRs, ranging from doubling the execution speed ::::: to running an order of magnitude faster. In addition, the re- ~vk−1 = ~vk−1 k ~ϕk sults also show that the VCR code can significantly increase od the level of ILP in modulo scheduling [4] resulting in faster kernels for non-SIMD superscalar processors. Note that the loop computes y[i] = f(i) for i = 0, . . . , n−1 CR optimizations introduce potential roundoff errors that in vectors of d values per iteration (assuming d divides n). are propagated across the iteration space [19]. Error analysis A method to construct the VCR form of a function is given is required to ensure roundoff errors introduced in CR-based in the next section. Here, we give a simple toy example to evaluation methods are bounded. To address error propa- illustrate VCR evaluation. For d = 2, the VCR form of gation, we combined our VCR method with an auto-tuning f(i) = i! is approach. The VCR code is run to determine error prop- Φ2 = {~ϕ , ∗, ~ϕ , +, ~ϕ , +, ~ϕ } erties and select an optimal vector length for performance. i 0 1 2 3 i „ « „ « „ « „ « This approach ensures optimal performance with high float- 1 2 10 8 = { , ∗, , +, , +, }i . ing point accuracy. 1 6 14 8 The remainder of this paper is organized as follows. Sec- Using Definition 2 we obtain the sequence: tion 2 introduces the VCR notation, semantics, and alge- i = 0 2 4 6 ... braic construction. An auto-tuning approach for fast and „ y[i] « „1«„2«„ 24 «„ 720 « = ··· safe floating point evaluation is described and implemen- y[i+1] 1 6 120 5040 tation details are given. Performance results showing im- proved SIMD execution and ILP scheduling are presented in where pairs of values are computed in each iteration. In Section 3, followed by a discussion of alternative approaches general, for any d ≥ 2 a speedup is obtained with a vector- and related work in Section 4. Finally, Section 5 summarizes based VCR over scalar CR forms for functions evaluated our conclusions. over structured grids and meshes. 2.2 VCR Construction 2. VECTOR CHAINS OF RECURRENCES We present an algorithm to construct a VCR form of a This section introduces the VCR formalism, which is a function for any d ≥ 1. The case d = 1 degenerates the generalization of the scalar CR formalism [7, 10, 17]. The VCR to a scalar CR form published in [7, 10, 17]. notation and semantics are presented and an algorithm to symbolically construct VCR forms of math functions is given. Algorithm 1. Given a function f(i) represented as a closed-form expression in symbolic form, where f(i) is to 2.1 VCR Notation and Semantics be evaluated over n grid points i = 0, . . . , n−1, the VCR d The key idea of our VCR method is to translate a math form Φi for d ≥ 1 is constructed in two steps: function or floating point expression into a set of d decoupled Step 1 Construct the symbolic parametric scalar CR form CR forms, where each CR represents the value progression 0 Φi0 (j) of f(d i + j) by substituting index i in f(i) by of the math function at an offset 0, . . . d−1 and stride d ≥ 1. {j, +, d}i0 followed by the application of the CR algebra The decoupling factor d increases the level of parallelism by a to simplify this term: factor of d and also slows the propagation of roundoff errors CR by a factor d. f(i) = f({j, +, d}i0 ) =⇒ Φi0 (j) , d which gives the parametric scalar CR form Parameter j in the symbolic form Φi (j) is also used to generate blocked versions of the VCR loops (see Section 2.3). Φ 0 (j) = {ϕ (j), , ϕ (j), , ··· , , ϕ (j)} 0 . i 0 1 1 2 k k i VCR construction is applicable to factorials, sums, prod- Step 2 Construct the symbolic parametric VCR form ucts, polynomials, exponentials, transcendentals, and com- positions. We conclude this section with some examples. Φd(j) = {~ϕ (j), , ~ϕ (j), , ··· , , ~ϕ (j)} , i 0 1 1 2 k k i Example 1. (From the example shown in Section 2.1) with coefficients defined by Consider f(i) = i! for i = 0, . . . , n−1. Let d = 2. Then, 0 0 1 0 1 f(i) = f(2i +j) = f({j, +, 2}i0 ) = {j, +, 2}i0 !. Applying the ϕ0(j) ϕk(j) CR 2 CR algebra {j, +, 2}i0 ! =⇒ {j!, ∗, j +3j+2, +, 4j+10, +, 8}i0 . ϕ0(j+1) ϕk(j+1) B C B C The VCR Φ2 is constructed from the coefficients ϕ (j) = j!, ~ϕ0(j)= B . C ; ··· ~ϕk(j)= B . C . i 0 B . C B . C 2 @ . A @ . A ϕ1(j) = j +3j+2, ϕ2(j) = 4j+10, and ϕ3(j) = 8, giving ϕ (j+d−1) ϕ (j+d−1) 0 k „ j! « „j2+3j+2« „4j + 10« „8« ~ϕ = ; ~ϕ = ; ~ϕ = ; ~ϕ = where the coefficients ϕ(j) were constructed in Step 1. 0 (j+1)! 1 j2+5j+6 2 4j + 14 3 8 We prove the correctness of the algorithm. and we set j = 0 (non-blocked loops), which simplifies to d „1« „2« „10« „8« Theorem 1. Let Φi (j), d ≥ 1, be the VCR form con- 2 Φi = { , ∗, , +, , +, }i . structed by Algorithm 1 for a given function f(i) defined over 1 6 14 8 n points i = 0, . . . , n−1. Then, the loop template defined 2 d The value sequence of Φi is shown in Section 2.1. in Definition 2 for Φi (j) with j = 0 computes y[i] = f(i) exactly for all i = 0, . . . , n−1, assuming exact arithmetic, Example 2. Consider f(i) = ri2 for i = 0, . . . , n−1. Let 0 2 i.e. in the absence of floating point roundoff. d = 4. Then, f(i) = f(4i +j) = f({j, +, 4}i0 ) = r{j, +, 4}i0

0 0 i 2 CR Proof. From Step 1 we have f(i) = f(d i +j) with i = b d c r{j, +, 4}i0 =⇒ r{j, +, 4}i0 {j, +, 4}i0 0 and j = i mod d for i = 0, . . . , n−1. Let Φi (j) denote the CR 2 scalar CR form of f(d i0 + j). Then for each j = 0, . . . , d−1 =⇒ r{j , +, 8{j, +, 4}i0 + 16}i0 0 0 CR the CR forms Φi0 (j) of f(d i + j) are evaluated for i = 2 =⇒ {rj , +, 8rj+16r, +, 32r}i0 . 0,..., n − 1 to compute f(i) = f(d i0 + j) = y[d i0 + j] with d 4 the CR loop template [10] (scalar form of Definition 2): The VCR Φi is constructed from the symbolic coefficients 2 ϕ0(j) = rj , ϕ1(j) = 8rj+16r, and ϕ2(j) = 32r, giving for j = 0 to d − 1 do 0 j2 1 0 1 0 1 x0(j) = ϕ0(j) 8j+16 32 2 ::: Bj +2j+1C B8j+24C B32C ~ϕ0(j)= r B 2 C ; ~ϕ1(j)= r B C ; ~ϕ2(j)= r B C xk−1(j) = ϕk−1(j) @j +4j+4A @8j+32A @32A 0 n j2+6j+9 8j+40 32 for i = 0 to d − 1 do 0 y[d∗i + j] = x0(j) 4 and we set j = 0 to obtain the Φi coefficients x0(j) = x0(j) 1 x1(j) ::::: 0 0 1 016r1 032r1 xk−1(j) = xk−1(j) k ϕk(j) 4 B r C B24rC B32rC Φi = {B C , +, B C , +, B C}i . od @4rA @32rA @32rA od 9r 40r 32r Because the j-loop iterations are independent, we can re- By Definition 2 we obtain the sequence order the loops and use array assignment notation to obtain i = 0 4 8 12 ... 0 1 0 1 0 1 0 1 0 1 x0(0 : d−1) = ϕ0(0 : d−1) y[i] 0 16r 64r 144r ::: By[i+1]C B r C B25rC B 81r C B169rC B C = B C B C B C B C ··· xk−1(0 : d−1) = ϕk−1(0 : d−1) @y[i+2]A @4rA @36rA @100rA @196rA 0 n y[i+3] 9r 49r 121r 225r for i = 0 to d − 1 do y[d∗i0 : d∗i0+d−1] = x (0 : d−1) 0 Example 3. Consider f(i) = sin(h i) for i = 0, . . . , n−1. x0(0 : d−1) = x0(0 : d−1) 1 x1(0 : d−1) 0 For d ≥ 1 f(i) = f(d i +j) = f({j, +, d} 0 ) = sin(h{j, +, d} 0 ) ::::: i i CR 1 1 xk−1(0 : d−1) = xk−1(0 : d−1) k ϕk(0 : d−1) 0 =⇒ <({ 2 sin(hj)− 2 cos(hj)I, ∗, cos(dh)+ sin(dh)I}i ) + od 1 1 0 <({ 2 sin(hj)+ 2 cos(hj)I, ∗, cos(dh)− sin(dh)I}i ) , Taking i = d ∗ i0, this loop is equivalent to the loop in Defi- d where I is the imaginary unit and <(z) is the real part of a nition 2 for Φi of f(i). 2 In Step 1 of the algorithm, the CR algebra described in [7, complex number z. Note that the CR forms represent expo- 10, 17] is used to simplify the parametric scalar CR form of nential sequences in the complex domain. Let d = 2, then f. A comprehensive list of CR algebra rules can be found „− 1 I« „α+β« „ 1 I« „α−β« Φ2 = <({ 2 , ∗, } ) + <({ 2 , ∗, } ) in [13]. In certain cases, the exhaustive application of CR i γ α+β i δ α−β i rules does not simplify to a single CR form but results in 1 1 CR-expressions [7] which contain multiple CR forms (see with α = cos(2h), β = sin(2h)I, γ = 2 sin h − 2 I cos h, and 1 1 Example 3 at the end of this section). These CR forms can δ = 2 sin h + 2 I cos h. This VCR requires only one vector be separately evaluated in loops or in fused loops. addition and two complex multiplications per two grid points. for j = 0 to n − (n mod (b ∗ d)) − 1 step (b ∗ d) do α(n) Function Single Precision Double Precision ~v0 = ~ϕ0(j) −9 −18 ~v1 = ~ϕ1(j) Poly3 3.8 · 10 n 8.2 · 10 n −9 2 −18 2 ::: Spline 5.0 · 10 n 6.0 · 10 n ~vk−1 = ~ϕk−1(j) for i = j to j + (b ∗ d) − 1 step d do Table 1: Interpolated Maximum Relative Errors y[i : i+d−1] = ~v0 ~v0 = ~v0 1 ~v1 erated for d = 1, 2, 4, 8 to find the best speedup and largest ~v1 = ~v1 2 ~v2 b. Small values of b negatively impact performance, while b ::::: cannot be too large to stay within the error threshold. ~vk−1 = ~vk−1 k ~ϕk(j) Increasing d yields better SIMD vector utilization. In- od creasing d beyond 8 typically results in performance degra- od dation due to increased vector register pressure to hold VCR j = bn/(b ∗ d)c ∗ b ∗ d coefficients in vector registers, possibly leading to register ~v0 = ~ϕ0(j) spill. Increasing d also increases the level of ILP in mod- ~v1 = ~ϕ1(j) ulo scheduling [4], which may result in faster kernels for ::: non-SIMD superscalar processors. The decreased number ~vk−1 = ~ϕk−1(j) of anti/flow dependences of the VCR form updates com- for i = j to n − (n mod d) − 1 step d do pared to scalar CR forms allows the scheduler to schedule y[i : i+d−1] = ~v0 instructions more optimally. Here we assume the kernel is ~v0 = ~v0 1 ~v1 small, i.e. a loop to execute a few math functions over a grid. ~v1 = ~v1 2 ~v2 The basic approach to find optimal d and b for ILP is the ::::: same as for SIMD auto-tuning. More specifically, given a ~vk−1 = ~vk−1 k ~ϕk(j) maximum relative error threshold εmax, the goal is to find a od block size b such that the VCR evaluation error εb ≤ εmax for i = n − (n mod d) to n − 1 do is bounded. Because performance typically increases with y[i] = ~v0[i mod d] larger b, results are computed faster when the error tolerance od is higher. Thus, b should be maximized while εb ≤ εmax. The auto-tuning phase determines b as follows. A blocked Figure 1: Blocked VCR Loop Template VCR profile loop is generated for each value of d = 1, 2, 4, 8 and a starting value of b (in this paper b = 106 which is sufficiently large) and a given maximum n to compute 2.3 VCR Loop Blocking d f(i) − Φi Zima et al. [20] analyzed the error characteristics of real- εb = max | | . and complex-valued CR forms by considering two primary i=0,...,n−1 f(i) CR categories, “pure-sum CR” and “pure-product CR”. The To refine b, this process repeats for exponentially diminish- results indicate that the error of pure-sum CR forms (poly- ing values of b until εb ≤ εmax. The performance of the best nomials) increases with increasing recurrence iteration dis- combination of d and b is selected. tance, but is constrained and independent of the actual func- For pure-sum VCR forms, which are polynomials and com- tion value at each point. Thus, evaluation of polynomials binations of polynomials such as spline calculations, error with CR forms is numerically stable. The error in expo- analysis is simpler. The error of the scalar CR form (with nential CR forms is constrained, but depends both on the d = 1) is computed for increasing n and tabulated and inter- iteration distance and on the function values. Evaluations polated to determine the error bound α(n). Then the error of these and mixed CR forms using (complex) floating-point for a choice of b and d can be determined from εb = α(b). arithmetic may not be numerically stable. Because we require εb ≤ εmax, the optimal block size b for a −1 Therefore, blocking of VCR loops is essential. It forces re- given d can be determined from b = α (εmax). initialization of the VCR sequence to stop error propagation. Consider for example the α(n) values of a third-order Figure 1 shows our blocked VCR loop template. The block polynomial Poly3 and a bicubic Spline function shown in parameter b is used to bound the recurrence iteration length. Table 1. Suppose that four digits of precision are required, −5 3 After b recurrence steps in the inner i-loop, the recurrences i.e. εmax ≤ 10 . Then b = 2.6 · 10 for Poly3 and b = 45 for are re-initalized in the j-loop. The remainder of the code Spline using single precision floating point, and b = 1.2·1012 handles the case when n is not a multiple of b ∗ d. for Poly3 and b = 1.3 · 106 for Spline using double precision. Theoretically, the block size b should be an inverse func- tion of the growth of the relative error. That is, a small 2.5 Shuffle-Based SIMD Vectorization growth in relative errors generally yields a larger block size. In this paper we compare the shuffle-based CR vectoriza- Because of the unpredictability of the error propagation in tion method suggested in [20] to our VCR approach. When non-polynomial CR forms, the block size b must be deter- all k operators in a CR form of length k are identical, i.e. (+) mined by empirical evaluation. or (*), then this regularity can be exploited with a vector operation of length k + 1. 2.4 VCR Auto-Tuning Figure 2 (b) shows the SSE code for a cubic polynomial We use an auto-tuning approach for the generated code (CR length k=3). A copy of vector crv=[cr0,cr1,cr2,cr3] is to optimize decoupling factor d for speed and block size b made to vector tmp using mm sub ss such that tmp[0]=0, for floating-point accuracy. A set of parametric loops is gen- then tmp is rotated tmp=[cr1,cr2,cr3,0] and added to crv. register m128 crv, tmp; register m128 crv0r, crv1r, crv2r, crv3r; int s = MM SHUFFLE(0, 3, 2, 1)); register m128 crv0i, crv1i, crv2i, crv3i; for (i = 0, i

14 14

12 12

10 10 S S P P O O

L 8

L 8 F F G G 6 6

4 4

2 2

0 0 Poly3 Spline Bin15 Sine Sinh Exp Poly3 Spline Bin15 Sine Sinh Exp

Benchmark Benchmark Figure 7: Performance Comparison of VCR for n = 1, 000 and n = 100, 000 (Intel Dual Core Xeon 5160 3GHz)

16000 9000 Sine VCR8.4 Sine VCR4.2 Sine VCR4.4 Sine VCR2.2 14000 Sine Scalar CR 8000 Sine Scalar CR Sine SVML Sine SVML 7000 12000 6000 10000 5000 8000 MFlop/s MFlop/s 4000 6000 3000 4000 2000

2000 1000

0 0 10 100 1000 10000 100000 1e+06 10 100 1000 10000 100000 1e+06 Dim n Dim n Figure 8: Sine Single and Double Precision Performance Comparison (Intel Dual Core Xeon 5160 3GHz)

The vectorization of CR forms using our VCR method torization cost reduction and scalability for n ≥ 500. Simi- gives the highest overall performance improvement as can be larly, for the double-precision case the VCR2.2 code is about seen in Figure 7. The VCR4.4 code executes at least twice as a factor 2 faster than the Scalar CR code for n ≥ 500. fast as the SVML and Scalar CR codes, peaking at an almost The performance of the VCR codes drop after n = 500, 000 fourfold speedup for Bin15 and Sine. Further performance due to cache capacity effects, while the SVML code perfor- improvements are obtained with VCR8.4. For n = 1, 000, mance drops much earlier due to temporary vector mem- the exponential VCR benchmarks Sine, Sinh, and Exp are ory storage and cache effects. The apparent jitter observed faster. For n = 100, 000 the performance of VCR8.4 is faster in the graphs for the VCR4.4 and VCR8.4 single-precision overall. The level of parallelism is increased by a factor p = code is consistent and appears to be related to the use of d/w = 2 for VCR8.4. Therefore, the instruction scheduler more aggressive optimization options with the Intel com- has fewer dependences to take into account, which leads to piler, indicating that certain points benefit from increased a higher instructions per cycle (IPC) ratio. performance due to loop optimizations such as unrolling. To investigate the scalability of the performance over the The VCR8.4 code exhibits higher levels of independence interval [10,..., 106] for SVML, Scalar CR, and VCR, we between iterations than VCR4.4 (and similar for VCR4.2 plotted the normalized MFLOPS of the Sine benchmark in versus VCR2.2). This significantly increases the computa- Figure 8 for single and double-precision floating point. For tional throughput for the Sine benchmark for the single pre- n ≥ 50, the VCR8.4 and VCR4.2 codes are the fastest, for cision case and more dramatically for the double precision the single and double-precision experiments, respectively. case as observed in Figure 8. The lower performance for smaller n is caused by the ini- Note that the performance of double-precision SVML code tialization overhead to setup the VCR vectors, where the has decreased to around half of the performance of the single initialization cost increases with larger d. Note that for the precision SVML code, which is due to the fact that effective single-precision case the VCR4.4 code is about a factor 3 to 4 SIMD vector length is reduced by half. Similarly, the per- faster than the Scalar CR code, showing close-to-perfect vec- formance of the VCR2.2 code and the VCR4.2 code also register m128 cr0, cr1, cr2, cr3, tmp; register m128 crv0, crv1, crv2, crv3, tmp; int s = MM SHUFFLE(0, 3, 2, 1)); int s = MM SHUFFLE(0, 3, 2, 1)); cr0 = . . . ; cr1 = . . . ; cr2 = . . . ; cr3 = . . . ; crv0 = . . . ; crv1 = . . . ; crv2 = . . . ; crv3 = . . . ; for (i = 0; i < n; i++) for (i = 0; i < n; i += 4) { mm store ss(&x[i], cr0); { mm store ss(&y[i ], crv0); mm store ss(&y[i+1], crv1); tmp = mm move ss(cr0, cr1); tmp = mm shuffle ps(tmp, tmp, s); mm store ss(&y[i+2], crv2); mm store ss(&y[i+3], crv3); cr0 = mm add ps(cr0, tmp); tmp = mm sub ss(crv0, crv0); tmp = mm shuffle ps(tmp, tmp, s); tmp = mm move ss(cr1, cr2); tmp = mm shuffle ps(tmp, tmp, s); crv0 = mm add ps(crv0, tmp); cr1 = mm add ps(cr1, tmp); tmp = mm sub ss(crv1, crv1); tmp = mm shuffle ps(tmp, tmp, s); tmp = mm move ss(cr2, cr3); tmp = mm shuffle ps(tmp, tmp, s); crv1 = mm add ps(crv1, tmp); cr2 = mm add ps(cr2, tmp); tmp = mm sub ss(crv2, crv2); tmp = mm shuffle ps(tmp, tmp, s); tmp = mm sub ss(cr3, cr3); tmp = mm shuffle ps(tmp, tmp, s); crv2 = mm add ps(crv2, tmp); cr3 = mm add ps(cr3, tmp); tmp = mm sub ss(crv3, crv3); tmp = mm shuffle ps(tmp, tmp, s); } crv3 = mm add ps(crv3, tmp); }

Figure 9: sCR4.1 Code of the Bin15 Benchmark and sCR4.4 Code of the Poly3 Benchmark

Performance Results (n = 1,000) 6000 Poly3 SVML SVML Scalar CR sCR1.4 sCR4.4 Poly3 Scalar CR 5000 Poly3 sCR4.4 5000 Poly3 sCR1.4 4500

4000 4000 3500

3000 3000 S P MFlop/s O 2500 FL M 2000 2000

1500 1000 1000

500 0 10 100 1000 10000 100000 1e+06 0 Poly3 Spline Bin15 Dim n Benchmar k

Figure 10: Performance Comparison of Shuffle-Based CR Vectorization for the Polynomial Benchmarks with n = 1, 000 and for Poly3 with n = 10,..., 106 (Intel Dual Core Xeon 5160 3GHz) decreased to half of the performance of the VCR4.4 code The optimized design of the SSE code for shuffle-based CR and the VCR8.4 code, respectively. The performance of the operations illustrates the use of advanced instructions and serial version Scalar CR is about the same as its single pre- some tricks to reduce the operation count to the bare min- cision version, which is not surprising given that the code is imum. The transformation from scalar CR code to shuffle- not vectorized. based sCR code was shown in Figure 2. A more elaborate Overall, the performance of the VCR code is superior to example of sCR1.4 code for Bin15 is shown in Figure 9, any of the other optimization methods, sometimes even an which further draws on specialized SSE instructions to ma- order of magnitude faster than the original scalar code and nipulate CR coefficients packed in vector registers. SVML-vectorized code. The VCR transformation effectively To further speed up sCR code, we used a decoupling factor combines CR-based operation reduction with SIMD vector- to generate independent execution sequences that increases ization. The choice of decoupling factor d that gives the the level of ILP. The SSE code for the sCR optimization best speedup of VCR depends on the function to optimize with a decoupling factor d = 4, the sCR4.4 shuffle-based and is difficult to predict which leads us to the auto-tuning CR vectorization of benchmark Poly3 is shown in Figure 9. approach. Note that d should be a multiple of the SIMD The code essentially replicates the shuffle-based template of vector register length w and reasonable values are therefore Figure 2 by applying our decoupling technique to the shuffle- d = 2, d = 4, and d = 8. based CR method. Figure 10 summarizes the performance for all three poly- 3.4 Shuffle-Based CR Performance Results nomial benchmarks for n = 1, 000 grid points. Note that the Intel compiler failed to vectorize Bin15. Except for In this section, the performance results of the shuffle- Bin15, the Intel SVML-vectorized code is fastest. The Poly3 based CR vectorization, denoted sCR, is compared. Ex- sCR1.4 code has the worst performance among these ver- periments were performed on a Intel Dual Core Xeon 5160 sions. The shuffle operation used in this version is the 3GHz running Linux Fedora Core 7, using the Intel C++ reason of the slow-down. Shuffle operations that involve a compiler with compiler flags “icc -O3 -restrict -fno-inline - data reorganization in vector registers are usually expensive axW”. Recall that the shuffle-based CR vectorization ap- and appear to be more costly than the non-vector floating proach is only applicable to polynomial benchmarks Poly3, point operations on floating point registers in the Scalar CR Spline, and Bin15. Performance Results (n = 1,000) Performance Results (n = 100,000) Scalar Scalar CR VCR4.1 Scalar Scalar CR VCR4.1 1800 1800

1600 1600

1400 1400

1200 1200 S S P P 1000 1000 O O FL FL 800 800 M M

600 600

400 400

200 200

0 0 Poly3 Spline Bin15 Sine Sinh Exp Poly3 Spline Bin15 Sine Sinh Exp

Benchmark Benchmark Figure 11: Performance Comparison of VCR-Enhanced ILP (UltraSPARC IIIi 1.2GHz) code. Increasing the ILP appears to help for long CR forms, Poly3 Benchmark e.g. the 15th order polynomial Bin15 code. Measurement Scalar Scalar CR VCR4.1 To investigate the scalability of the performance of sCR Unroll factor 8 8 2 over the interval [10,..., 106], the performance results for Steady-state 5 4 3 the Poly3 benchmark are also shown in Figure 10. As can cycle count (12 per 4 pts) be expected, the Poly3 sCR4.4 code shows an almost four- Floating point fold performance increase over the Poly3 sCR1.4 code due operations 3 FMA + 3 FPadds 12 FPadds to higher ILP. per iteration 2 FPadds (per 4 points) The overall performance of sCR is disappointing com- pared to SVML. Thus, sCR is also much slower than our Table 5: Static Code Statistics of Three Poly3 Codes proposed VCR method (since VCR is overall faster than SVML). Given the low performance of sCR and its limited applicability, The sCR optimization is not a viable alterna- state cycle count (also referred to as the initiation interval tive. However, new SSE instructions such as a “shift-add” of the kernel), assuming no stalls. that would fit the CR update operations could make the Figure 11 summarizes the performance of all benchmark performance of sCR more competitive. codes, for n = 1, 000 grid points and n = 100, 000 grid points. The performance of the transcendental math library 3.5 Superscalar ILP Enhancement Results functions on this machine is very poor (the Scalar codes), The performance results presented in this section were making the data for Sinh and Exp barely visible in the graph. obtained using the Sun Studio 12 C compiler with compiler The difference is marginally better for higher n = 100, 000, options “suncc -fast -fma=fused -xrestrict -xalias level=layout because of the relatively lower overhead of CR initialization -xinline=no -xarch =native”. Note that the “-fma=fused” cost and relatively lower prologue and epilogue costs in the option is used to improve the performance of the code with modulo schedule. fused multiply-add (FMA) instructions, when possible. Ex- The hypothesized increased ILP of our VCR method is in- periments were performed on an UltraSPARC IIIi 1.2GHz deed verified by the results in Figure 11. The VCR 4.1 code running Sun Solaris 10. The UltraSPARC IIIi processor (decoupling factor d = 4 and vector width w = 1) achieves is a high performance, highly-integrated 4-issue superscalar best performance for all benchmark functions except Bin15. processor implementation of the 64-bit SPARC-V9 RISC ar- The modulo scheduler of the Sun Studio compiler failed to chitecture, supporting a 64-bit virtual address space. We construct a schedule for the Bin15 scalar code, due to the did not use the UltraSPARC VIS instruction set for SIMD high floating-point register pressure brought by four decou- short vector operations, which supports only integer and pled 15-order polynomial CR functions requiring the use of fixed-point operand types. Therefore, any improvement is a 64 registers just to hold the CR coefficients. result of the improved utilization of ILP in modulo schedul- To study the impact of enhanced ILP on modulo schedul- ing. ing, Table 5 lists the static statistics extracted from the Sun The 4-issue superscalar processor performance benefits Studio compiler for optimizing the computational loop with from increased levels of ILP. Thus modulo scheduling is an modulo scheduling in the three optimized versions of the essential compiler optimization for small loop kernels. The Poly3 code. The statistics are similar for the other bench- Scalar CR code exhibits loop-independent anti dependences marks. The steady-state cycle count determines the perfor- between the CR updates and cross iteration dependences mance, where lower cycle counts indicate that fewer cycles (see Figure 2(a) for example) that limit the effectiveness of are executed per loop iteration, where each cycle issues at the modulo scheduler to schedule instructions in parallel. most 4 instructions. The observed performance is indeed The best performance is obtained with the lowest steady- best for Poly3 VCR4.1. “functional parallel” execution of CR forms, we also found that the resulting shuffle-based CR vectorization is not com- 1200 Poly3 VCR4.1 petitive due to the overhead of the tightly-coupled copy, Poly3 Scalar CR Poly3 Scalar shuffle, and add (or multiply) operations required for each 1000 grid point. The CORDIC [15] family of algorithms provide a fast 800 method to evaluate transcendental functions, roots, logs, and exponents for a single point. The CORDIC algorithm 600 performs a rotation using a series of specific incremental ro- MFlop/s tation angles to approach the target angle and each step only 400 requires addition, subtraction, bit-shift and table lookup. It is widely used in pocket calculators and real-time systems 200 since CORDIC is generally faster than other approaches when a hardware multiplier is not available, or when the cost

0 of the chip need to be minimized. The Intel 80x87 coproces- 10 100 1000 10000 100000 1e+06 sor series until Intel 80486 all use CORDIC algorithms [16]. Dim n These hardware advances are instrumental to reduce laten- Figure 12: Poly3 Benchmark Performance Results of cies that would otherwise diminish the effectiveness of our VCR-Enhanced ILP (UltraSPARC IIIi 1.2GHz) vectorization method. Vector math libraries, such as Intel’s SVML and the Vec- tor Math Library (VML) are highly optimized for SIMD To investigate the scalability of the performance the ILP- short vector execution [9]. SVML is developed to provide enhancement of the VCR4.1 code over interval [10,..., 106], efficient software implementations for transcendental func- we plotted the results in Figure 12 for the Poly3 benchmark. tions on packed floating-point numbers with fully-accuracy [8] The steady-state cycle count (Table 5) is clearly reflected in and is only intended for use by the Intel compiler vectorizer. the performance differences with VCR4.1 having the lowest VML is an application level library designed to compute steady-state cycle count (3) and highest performance (1.0 math functions on vector arguments [2]. GFLOPS peak). By contrast to these highly optimized vector math li- Overall, the VCR approach significantly increases the ILP braries, VCR vectorization is not restricted to math func- by decoupling factors d ≥ 1. This benefits the performance tions alone. A floating point expression composed of multi- of modulo scheduling for superscalar RISC architectures, as ple math functions can be transformed to a VCR and then long as sufficient registers are available to hold the VCR optimized together, resulting in more efficient evaluation. coefficients. The number of registers needed for the VCR computation is d k, where k is the length of the CR form constructed for a function. 5. CONCLUSIONS AND FUTURE WORK In this paper we proposed a SIMD vectorization method to speed up evaluation of floating-point functions in loops. 4. RELATED WORK Many applications that require repeated evaluation of func- The CR formalism was originally developed by Zima [18] tions over regular and structured grids can benefit from the and applied to computer algebra systems [17]. The formal- performance increase obtained by this method, such as plot- ism was improved by Bachmann, Zima, and Wang [6, 7] to ting in a coordinate system, rendering of parametric objects expedite the evaluation of multivariate functions on regular in 2D/3D, numerical grid generation, and signal processing. grids. Multivariate CR forms (MCR) are nested CR forms A prototype automatic code generator is discussed that gen- that represent multivariate closed-form functions over sets of erates C code with SSE intrinsics to speed up the execution index variables. Van Engelen [12, 13] extended the CR alge- of functions represented in vector chains of recurrence forms. bra by incorporating new rules and techniques for induction Experimental results show that a dramatic performance variable detection and optimization in compilers. increase is obtained compared to the best vectorizers on an Zima et al. [20] investigated parallel mappings to evalu- Intel Xeon processor, from doubling the execution speed to ate CR forms using coarse-grain thread-level parallelism. A running an order of magnitude faster. Furthermore, the data partitioning approach is proposed to divide the itera- effectiveness of modulo scheduling of function evaluations in tion domain into p sub-domains given p processors to speed loops is improved by the proposed technique, which results up execution and reduce the error. Several thead-level par- in faster kernels on superscalar RISC machines, such as the allel execution strategies (functional parallel, data parallel, UltraSPARC IIIi. and subdomain parallel) are compared for two CR forms. The main contributions of this paper can be summarized By contrast, our approach proposes a decoupling strategy as follows: to exploit fine-grain parallelism by using a CR translation technique to generate independent value sequences, where • The present CR formalism to expedite function eval- the sequences are stored and updated in vector registers. uation over regular grids requires independent opera- Our technique can be combined with thread-level parallelism tions across the iteration space, which prevents loop of the outer block loop (Figure 1) to further increase the ex- vectorization. A new fine-grain decoupling strategy is ecution speed of math function evaluations over grids. proposed to vectorize CR evaluation for efficient SIMD Zima et al. [20] concluded that the performance of “func- vector execution and enhanced ILP. The CR-based tional parallel” CR evaluation is poor using thread-level method is applicable to any function defined over a parallelism. In our work, despite reducing the overhead of commutative ring (Z, +, ∗). • A systematic performance evaluation of the proposed [9] L. Kylex. How to Avoid Bottlenecks in Simple Math vectorization method is conducted for polynomial and Functions. Available from non-polynomial (exponential) classes of functions in http://softwarecommunity.intel.com/articles/eng/ the real and complex domains and compared to exist- 3524.htm, Intel 2004. ing work. [10] R. van Engelen. Symbolic Evaluation of Chains of Recurrences for Loop Optimization. Technical report, • A systematic performance evaluation of ILP-enhancing TR-000102, Computer Science Dept., Florida State properties of the method is conducted and the impact University, 2000. on modulo scheduling is analyzed. [11] R. van Engelen. Efficient Symbolic Analysis for • An analysis overview of roundoff error properties of CR Optimizing Compilers. In proceedings of the ETAPS forms is given, with a motivation for the decoupling Conference on Compiler Construction 2001, LNCS strategy to slow the rate of error propagation. 2027, pages 118-132, 2001. [12] R. van Engelen, J. Birch, Y. Shou, B. Walsh and K. • An auto-tuning approach is proposed to determine op- Gallivan. A Unified Framework for Nonlinear timal decoupling factor and block size for vector CR Dependence Testing and Symbolic Analysis. In execution. This ensures optimality of performance and proceedings of the ACM International Conference on floating point accuracy. Supercomputing (ICS), pages 106-115, 2004. In [11, 12, 13] we developed several compiler analysis [13] R. van Engelen and K. Gallivan. An Efficient and loop optimization techniques based on CR forms, such Algorithm for Pointer-to-Array Access Conversion for as nonlinear recognition, pointer-to-array Compiling and Optimizing DSP Applications. In access conversion, and nonlinear array data dependence anal- proceedings of the International Workshop on ysis. These methods rely on the detection of induction vari- Innovative Architectures for Future Generation able recurrences from code. In our future work we will con- High-Performance Processors and Systems (IWIA) sider combining these loop analysis methods with the tech- 2001, pages 80-89, Maui, Hawaii, 2001. nique proposed in this paper to implement aggressive loop [14] R. van Engelen, L. Wolters and G. Cats. Ctadel:A strength reduction with vectorization. However, the com- Generator of Multi-Platform High Performance Codes piler optimization techniques will be limited to integer value for PDE-based Scientific Applications. In 10th ACM sequences to ensure safety of these optimizations. International Conference on Supercomputing, pages Further development will be conducted to further improve 86-93, New York, 1996. ACM Press. our prototype implementation to release a production-quality [15] J. E. Volder. The Cordic Trigonometric Computing tool for vector CR generation and optimization. Technique. IRE Transactions Electronic Computers, EC-8:330-334, September 1959. 6. ACKNOWLEDGMENTS [16] A. K. Yuen. Intel’s Floating-Point Processors. In The authors would like to thank the anonymous reviewers Electro/88 Conference Record, pages 48/5/1-7, 1988. for their constructive comments and suggestions that helped [17] E. V. Zima. Recurrent Relations and Speed-up of improve the content of this paper. Computations using Computer Algebra Systems. In proceedings of DISCO’92, pages 152-161. LNCS 721, 7. REFERENCES 1992. [1] SWI-Prolog’s Home. Available from [18] E. V. Zima. Automatic construction of systems of http://www.swi-prolog.org/. recurrence relations. USSR Computational [2] Intel Math Kernel Library Reference Manual, March Mathematics and Mathematical Physics, 2007. Intel Document number: 630813-025US. 24(11-12):193-197, 1986. [3] A. Aho, R. Sethi and J. Ullman. Compilers: [19] E. V. Zima. On computational properties of chains of Principles, Techniques and Tools. Addison-Wesley recurrences. In Proceedings of the 2001 International Publishing Company, Reading MA, 1985. Symposium on Symbolic and Algebraic Computation, [4] V. H. Allan, R. B. Jones, R. M. Lee and S. J. Allan. page 345. ACM Press, 2001. . ACM Computing Surveys, [20] E. V. Zima, K. R. Vadivelu and T. L. Casavant. 27(3):367–432, 1995. Mapping Techniques for Parallel Evaluation of Chains [5] R. Allen and K. Kennedy. Optimizing Compilers for of Recurrences. In IPPS ’96: Proceedings of the 10th Modern Architectures. Morgan Kaufmann, 2002. International Parallel Processing Symposium, pages [6] O. Bachmann. Chains of Recurrences. PhD thesis, 620–624, Washington, DC, USA, 1996. IEEE Kent State University, College of Arts and Sciences, Computer Society. 1996. [7] O. Bachmann, P. Wang and E. Zima. Chains of Recurrences - a Method to Expedite the Evaluation of Closed-Form Functions. In proceedings of the International Symposium on Symbolic and Algebraic Computing (ISSAC), pages 242-249, Oxford, 1994. ACM. [8] A. J. C. Bik. Software Vectorization Handbook, The: Applying Intel Multimedia Extensions for Maximum Performance. Intel Press, 2004.