3/4/2013

Program Optimization Through Loop Vectorization

Program Optimization Materials for this tutorial can be found: Through Loop Vectorization http://polaris.cs.uiuc.edu/~garzaran/pldi-polv.zip

María Garzarán, Saeed Maleki William Gropp and David Padua Questions? Send an email to [email protected] Department of Computer Science University of Illinois at Urbana-Champaign

2

Topics covered in this tutorial Outline • What are the microprocessor vector extensions or SIMD (Single Instruction Multiple Data Units) 1. Intro 2. Data Dependences (Definition) • How to use them 3. Overcoming limitations to SIMD-Vectorization – Through the via automatic vectorization – Data Dependences • Manual transformations that enable vectorization – Data Alignment • Directives to guide the compiler – Aliasing – Through intrinsics – Non-unit strides • Main focus on vectorizing through the compiler. – Conditional Statements – Code more readable 4. Vectorization with intrinsics – Code portable

3 4

1 3/4/2013

Simple Example SIMD Vectorization

• Loop vectorization transforms a program so that the • The use of SIMD units can speed up the program. same operation is performed at the same time on • Intel SSE and IBM Altivec have 128-bit vector registers and several vector elements functional units ldv vr1, addr1 – 4 32-bit single precision floating point numbers n ld r1, addr1 for (i=0; i

32 bits – 8 16-bit integer or shorts

32 bits – 16 8-bit bytes or chars Y1 X1 • Assuming a single ALU, these SIMD units can execute 4 single precision floating point number or 2 double precision operations in Register File Scalar Unit + Vector … Unit the time it takes to do only one of these operations by a scalar unit.

Z1

32 bits 5 6

Executing Our Simple Example How do we access the SIMD units?

S000 • Three choices for (i=0; i

Intel Nehalem IBM Power 7 void example(){ Exec. Time scalar code: 6.1 Exec. Time scalar code: 2.1 __m128 rA, rB, rC; Exec. Time vector code: 3.2 Exec. Time vector code: 1.0 for (int i = 0; i

X 1 ..B8.5 movaps a(,%rdx,4), %xmm0 Register File Scalar Unit + Vector addps b(,%rdx,4), %xmm0 … Unit 1. Assembly Language movaps %xmm0, c(,%rdx,4) addq $4, %rdx Z1 cmpq $rdi, %rdx 32 bits jl ..B8.5

7 8

2 3/4/2013

Why should the compiler vectorize? How well do vectorize?

1. Easier

2. Portable across vendors and machines Compiler XLC ICC GCC – Although compiler directives differ across compilers Loops 3. Better performance of the compiler generated code Total 159 Vectorized 74 75 32 – Compiler applies other transformations Not vectorized 85 84 127 Average Speed Up 1.73 1.85 1.30 Compilers make your codes (almost) machine independent Compiler XLC but ICC but Loops not ICC not XLC But, compilers fail: Vectorized 25 26 - Programmers need to provide the necessary information - Programmers need to transform the code

10 9

How well do compilers vectorize? How much programmer intervention?

Compiler XLC ICC GCC • Next, three examples to illustrate what the programmer Loops may need to do: Total 159 – Add compiler directives Vectorized 74 75 32 – Transform the code Not vectorized 85 84 127 Average Speed Up 1.73 1.85 1.30 – Program using vector intrinsics

Compiler XLC but ICC but Loops not ICC not XLC Vectorized 25 26

By adding manual vectorization the average speedup was 3.78 (versus 1.73 obtained by the XLC compiler)

11 12

3 3/4/2013

Experimental results Compiler directives

• The tutorial shows results for two different platforms with their compilers: void test(float* A,float* B,float* C,float* D, float* E) { – Report generated by the compiler for (int i = 0; i

The examples use single precision floating point numbers

13 14

Compiler directives Compiler directives

S1111 S1111 S1111 S1111

void test(float* __restrict__ A, void test(float* __restrict__ A, void test(float* A, float* B, float* float* __restrict__ B, void test(float* A, float* B, float* float* __restrict__ B, C, float* D, float* E) float* __restrict__ C, C, float* D, float* E) float* __restrict__ C, { float* __restrict__ D, { float* __restrict__ D, for (int i = 0; i

Intel Nehalem Intel Nehalem Power 7 Power 7 Compiler report: Loop was not Compiler report: Loop was Compiler report: Loop was not Compiler report: Loop was vectorized. vectorized. vectorized. vectorized. Exec. Time scalar code: 5.6 Exec. Time scalar code: 5.6 Exec. Time scalar code: 2.3 Exec. Time scalar code: 1.6 Exec. Time vector code: -- Exec. Time vector code: 2.2 Exec. Time vector code: -- Exec. Time vector code: 0.6 Speedup: -- Speedup: 2.5 Speedup: -- Speedup: 2.7

15 16

4 3/4/2013

Loop Transformations Loop Transformations

S136 S136_1 S136_2

for (int i=0;i

Intel Nehalem Intel Nehalem Intel Nehalem j Compiler report: Loop was not A report: Permuted loop report: Permuted loop vectorized. Vectorization was vectorized. was vectorized. possible but seems inefficient scalar code: 1.6 scalar code: 1.6 Exec. Time scalar code: 3.7 vector code: 0.6 vector code: 0.6 B Exec. Time vector code: -- Speedup: 2.6 Speedup: 2.6 Speedup: --

17 18

Loop Transformations Intrinsics (SSE)

S136 S136_1 S136_2 #define n 1024 for (int i=0;i

#include S136 S136_1 S136_2 #define n 1024 __attribute__((aligned(16))) float a[n], b[n], c[n]; IBM Power 7 IBM Power 7 IBM Power 7 report: Loop int main() { Compiler report: Loop was report: Loop interchanging applied. __m128 rA, rB, rC; interchanging applied. not SIMD vectorized Loop was SIMD for (i = 0; i < n; i+=4) { Loop was SIMD Exec. Time scalar code: 2.0 vectorized rA = _mm_load_ps(&a[i]); Exec. Time vector code: -- scalar code: 0.4 scalar code: 0.4 rB = _mm_load_ps(&b[i]); Speedup: -- vector code: 0.2 vector code: 0.16 rC= _mm_mul_ps(rA,rB); Speedup: 2.0 Speedup: 2.7 _mm_store_ps(&c[i], rC); }} 19 20

5 3/4/2013

Intrinsics (Altivec) Outline

#define n 1024 1. Intro __attribute__ ((aligned(16))) float a[n],b[n],c[n]; ... for (int i=0; i

21 22

Data dependences Definition of Dependence

• A statement S is said to be data dependent on • The notion of dependence is the foundation of the process statement T if of vectorization. – T executes before S in the original sequential/scalar program – S and T access the same data item • It is used to build a calculus of program transformations – At least one of the accesses is a write. that can be applied manually by the programmer or automatically by a compiler.

23 24

6 3/4/2013

Data Dependence Data Dependence

Flow dependence (True dependence) S1 S1: X = A+B S2: C= X+A S2 • Dependences indicate an execution order that must be Anti dependence S1 honored. S1: A = X + B • Executing statements in the order of the dependences S2: X= C + D S2 guarantee correct results. Output dependence S1: X = A+B S1 • Statements not dependent on each other can be reordered, S2: X= C + D executed in parallel, or coalesced into a vector operation. S2

25 26

Dependences in Loops (I) Dependences in Loops (I)

• Dependences in loops are easy to understand if the loops are unrolled. Now the • Dependences in loops are easy to understand if loops are unrolled. Now the dependences are between statement “executions”. dependences are between statement “executions”

for (i=0; i

S1: a[0] = b[0] + 1 S1: a[1] = b[1] + 1 S1: a[2] = b[2] + 1 S2: c[0] = a[0] + 2 S2: c[1] = a[1] + 2 S2: c[2] = a[2] + 2

27 28

7 3/4/2013

Dependences in Loops (I) Dependences in Loops (I)

• Dependences in loops are easy to understand if loops are unrolled. Now the • Dependences in loops are easy to understand if loops are unrolled. Now the dependences are between statement “executions” dependences are between statement “executions”

for (i=0; i

instances of S2: S2 S2 S2 S2

29 30

Dependences in Loops (I) Dependences in Loops (I)

• Dependences in loops are easy to understand if loops are unrolled. Now the • Dependences in loops are easy to understand if loops are unrolled. Now the dependences are between statement “executions” dependences are between statement “executions”

for (i=0; i

Loop independent dependence For the whole loop

31 32

8 3/4/2013

Dependences in Loops (I) Dependences in Loops (I)

• Dependences in loops are easy to understand if loops are unrolled. Now the • Dependences in loops are easy to understand if loops are unrolled. Now the dependences are between statement “executions” dependences are between statement “executions”

for (i=0; i

For the whole loop For the whole loop

33 34

Dependences in Loops (I) Dependences in Loops (II)

• Dependences in loops are easy to understand if loops are unrolled. Now the • Dependences in loops are easy to understand if loops are unrolled. Now the dependences are between statement “executions” dependences are between statement “executions”

for (i=0; i

For the dependences shown here, we assume that arrays do not overlap in memory (no aliasing). Compilers must know that there is no aliasing in order to vectorize.

35 36

9 3/4/2013

Dependences in Loops (II) Dependences in Loops (II)

• Dependences in loops are easy to understand if loops are unrolled. Now the • Dependences in loops are easy to understand if loops are unrolled. Now the dependences are between statement “executions” dependences are between statement “executions”

for (i=1; i

instances of S2: S2 S2 S2 S2

37 38

Dependences in Loops (II) Dependences in Loops (II)

• Dependences in loops are easy to understand if loops are unrolled. Now the • Dependences in loops are easy to understand if loops are unrolled. Now the dependences are between statement “executions” dependences are between statement “executions”

for (i=1; i

For the whole loop Loop carried dependence 39 40

10 3/4/2013

Dependences in Loops (II) Dependences in Loops (II)

• Dependences in loops are easy to understand if loops are unrolled. Now the • Dependences in loops are easy to understand if loops are unrolled. Now the dependences are between statement “executions” dependences are between statement “executions”

for (i=1; i

For the whole loop For the whole loop

41 42

Dependences in Loops (III) Dependences in Loops (III)

• Dependences in loops are easy to understand if loops are unrolled. Now the dependences are between statement “executions” for (i=0; i

S1: a= b[0] + 1 S1: a = b[1] + 1 S1: a = b[2] + 1 S2: c[0] = a + 2 S2: c[1] = a + 2 S2: c[2] = a+ 2

43 44

11 3/4/2013

Dependences in Loops (III) Dependences in Loops (III)

for (i=0; i

instances of S2: S2 S2 S2 S2 Loop independent dependence Loop carried dependence

45 46

Dependences in Loops (III) Dependences in Loops (IV)

• Doubly nested loops

for (i=0; i

47 48

12 3/4/2013

Dependences in Loops (IV) Dependences in Loops (IV)

for (i=1; i

49 50

Dependences in Loops (IV) Dependences in Loops (IV)

for (i=1; i

2 2 S1 j j

3 3

4 4

51 52

13 3/4/2013

Data dependences and Data dependences and vectorization vectorization • Main idea: A statement inside a loop which is not in a cycle • Loop dependences guide vectorization of the dependence graph can be vectorized. • Main idea: A statement inside a loop which is not in a cycle for (i=1; i

53 54

Data dependences and Distributing transformations

• When cycles are present, vectorization can be achieved by: for (i=1; i

55 56

14 3/4/2013

Removing dependences Freezing Loops

for (i=1; i

for (i=1; i

57 58

Changing the algorithm Changing the algorithm (cont.)

• When there is a recurrence, it is necessary to change the . S1 a[0]=b[0]; algorithm in order to vectorize. for (i=1; imax) max = A[i])

59 60

15 3/4/2013

Stripmining Stripmining (cont.)

• Stripmining is a simple transformation. • Stripmining is often used when vectorizing for (i=1; i

61 62

Loop Vectorization Outline • Loop Vectorization is not always a legal and 1. Intro profitable transformation. 2. Data Dependences (Definition) • Compiler needs: 3. Overcoming limitations to SIMD-Vectorization – Compute the dependences – Data Dependences • The compiler figures out dependences by – Solving a system of (integer) equations (with constraints) – Data Alignment – Demonstrating that there is no solution to the system of – Aliasing equations – Non-unit strides – Remove cycles in the dependence graph – Conditional Statements – Determine data alignment 4. Vectorization with intrinsics – Vectorization is profitable

63 64

16 3/4/2013

Simple Example Loop Vectorization

• Loop vectorization transforms a program so that the • When vectorizing a loop with several statements the same operation is performed at the same time on compiler need to strip-mine the loop and then apply loop several of the elements of the vectors distribution ldv vr1, addr1 for (i=0; i

Register File Scalar Unit + Vector … Unit S2 S2 S2 S2 S2 S2 S2 S2

Z1

32 bits 65 66

Dependence Graphs and Loop Vectorization Compiler Vectorization • When vectorizing a loop with several statements the • No dependences: previous two slides compiler need to strip-mine the loop and then apply loop • Acyclic graphs: distribution – All dependences are forward: for (i=0; i

67 68

17 3/4/2013

Acyclic Dependence Graphs: Acyclic Dependence Graphs: Forward Dependences Forward Dependences

S113

for (i=0; i

i=0 i=1 i=2 i=3 Intel Nehalem IBM Power 7 Compiler report: Loop was Compiler report: Loop was SIMD S1 S1 S1 S1 S1 vectorized vectorized Exec. Time scalar code: 10.2 Exec. Time scalar code: 3.1 forward Exec. Time vector code: 6.3 Exec. Time vector code: 1.5 dependence Speedup: 1.6 Speedup: 2.0 S2 S2 S2 S2 S2

69 70

Acyclic Dependenden Graphs Acyclic Dependenden Graphs Backward Dependences (I) Backward Dependences (I) Reorder of statements for (i=0; i

S1 S1 S1 S1 S2 S2 S2 S2 S1: a[0] = b[0] + c[0] S1 S1 S1 S1 S1 i=0 S2: d[0] = a[1] + 1 S1: a[1] = b[0] + c[0] i=1 S2: d[1] = a[2] + 1 S2 S2 S2 S2 S1 S1 S1 S1 S2 S2 S2 S2 S2

backward forward This loop cannot be vectorized as it is depedence depedence

71 72

18 3/4/2013

Acyclic Dependenden Graphs Acyclic Dependenden Graphs Backward Dependences (I) Backward Dependences (I)

S114 S114_1 S114 S114_1 for (i=0; i

Intel Nehalem IBM Power 7 IBM Power 7 Compiler report: Loop was not Intel Nehalem Compiler report: Loop was SIMD Compiler report: Loop was SIMD vectorized. Existence of vector Compiler report: Loop was vectorized vectorized vectorized dependence Exec. Time scalar code: 10.7 Exec. Time scalar code: 1.2 Exec. Time scalar code: 1.2 Exec. Time scalar code: 12.6 Exec. Time vector code: 6.2 Exec. Time vector code: 0.6 Exec. Time vector code: 0.6 Exec. Time vector code: -- Speedup: 1.72 Speedup: 2.0 Speedup: 2.0 Speedup: -- Speedup vs non-reordered code:2.03

73 74

Acyclic Dependenden Graphs Acyclic Dependenden Graphs Backward Dependences (II) Backward Dependences (II)

S214 S214_1 for (int i = 1; i < LEN; i++) { for (int i=1;i

S1: a[1] = d[0] + sqrt(c[1]) S1 S1 S1 S1 S1 i=1 S114 S114_1 S2: d[1] = b[1] + sqrt(e[1]) S2 S2 S1: a[2] = d[1] + sqrt(c[2]) i=2 S2: d[2] = b[2] + sqrt(e[2]) Intel Nehalem S2 S2 S2 S2 S2 Compiler report: Loop was not Intel Nehalem vectorized. Existence of vector Compiler report: Loop was vectorized dependence Exec. Time scalar code: 7.6 Exec. Time scalar code: 7.6 Exec. Time vector code: 3.8 This loop cannot be vectorized as it is Exec. Time vector code: -- Speedup: 2.0 Speedup: --

75 76

19 3/4/2013

Acyclic Dependenden Graphs Cycles in the DG (I) Backward Dependences (II)

S114 S114_1 for (int i=0;i

The IBM XLC compiler generated the same code in both cases S1 S1 S1 S1 S1 S114 S114_1

IBM Power 7 IBM Power 7 Compiler report: Loop was SIMD Compiler report: Loop was SIMD vectorized vectorized S2 S2 S2 S2 S2 Exec. Time scalar code: 3.3 Exec. Time scalar code: 3.3 Exec. Time vector code: 1.8 Exec. Time vector code: 1.8 This loop cannot be vectorized (as it is) Speedup: 1.8 Speedup: 1.8 Statements cannot be simply reordered

77 78

Cycles in the DG (I) Cycles in the DG (I)

S115 S115

for (int i=0;i

S115 S115

Intel Nehalem IBM Power 7 Compiler report: Loop was not vectorized. Compiler report: Loop was SIMD vectorized Existence of vector dependence Exec. Time scalar code: 3.1 Exec. Time scalar code: 12.1 Exec. Time vector code: 2.2 Exec. Time vector code: -- Speedup: 1.4 Speedup: --

79 80

20 3/4/2013

Cycles in the DG (I) Cycles in the DG (I)

S115 S1 S115 S215

for (int i=0;i

81 82

Cycles in the DG (I) Cycles in the DG (I)

S115 S215 S115 S215 for (int i=0;i

83 84

21 3/4/2013

Cycles in the DG (II) Cycles in the DG (II)

A loop can be partially vectorized S116 S116 for (int i=1;i

Intel Nehalem IBM Power 7 S1 can be vectorized Compiler report: Loop was not S2 Compiler report: Loop was S2 and S3 cannot be vectorized (as they are) partially vectorized SIMD vectorized because a data Exec. Time scalar code: 14.7 dependence prevents SIMD Exec. Time vector code: 18.1 vectorization S3 Speedup: 0.8 Exec. Time scalar code: 13.5 Exec. Time vector code: -- Speedup: -- 85 86

Cycles in the DG (III) Cycles in the DG (III)

S117 S118 for (int i=0;i

87 88

22 3/4/2013

Cycles in the DG (III) Cycles in the DG (IV)

S117 S118 for (int i=0;i

89 90

Cycles in the DG (IV) Cycles in the DG (IV)

S119 for (int i=1;i

i=4 a[4] =a[0]+b[4] S1 } a[1]=a[0]+b[1] i=5 a[5] =a[1]+b[5] S1 4 a[2]=a[1]+b[2] i=6 a[6] =a[2]+b[6] S1 a[3]=a[2]+b[3] i=7 a[7] =a[3]+b[7] S1 Intel Nehalem IBM Power 7 i=8 a[8] =a[4]+b[8] S1 S1 Compiler report: Loop was Compiler report: Loop was SIMD 1 i=9 a[9] =a[5]+b[9] S1 vectorized vectorized i=10 a[10]=a[6]+b[10] S1 S1 Exec. Time scalar code: 8.4 Exec. Time scalar code: 6.6 i=11 a[11]=a[7]+b[11] S1 Exec. Time vector code: 3.9 Exec. Time vector code: 1.8 Speedup: 2.1 Speedup: 3.7 Self true-dependence Yes, it can be vectorized because the cannot be vectorized dependence distance is 4, which is the number of iterations that the SIMD unit

can91 execute simultaneously. 92

23 3/4/2013

Cycles in the DG (V) Cycles in the DG (V)

for (int i = 0; i < LEN-1; i++) { for (int i = 0; i < LEN-1; i++) { for (int j = 0; j < LEN; j++) for (int j = 0; j < LEN; j++) S1 a[i+1][j] = a[i][j] + b; S1 a[i+1][j] = a[i][j] + (float) 1.0; } S1 } S1

Can this loop be vectorized? Can this loop be vectorized? Dependences occur in the outermost loop. i=0, j=0: a[1][0] = a[0][0] + b i=0, j=0: a[1][0] = a[0][0] + 1 - outer loop runs serially j=1: a[1][1] = a[0][1] + b j=1: a[1][1] = a[0][1] + 1 - inner loop can be vectorized j=2: a[1][2] = a[0][2] + b j=2: a[1][2] = a[0][2] + 1 i=1 j=0: a[2][0] = a[1][0] + b i=1 j=0: a[2][0] = a[1][0] + 1 for (int i=0;i

93 94

Cycles in the DG (V) Cycles in the DG (VI)

S121 • Cycles can appear because the compiler does not know for (int i = 0; i < LEN-1; i++) { if there are dependences for (int j = 0; j < LEN; j++) a[i+1][j] = a[i][j] + 1; } Is there a value of i such for (int i=0;i

95 96

24 3/4/2013

Cycles in the DG (VI) Cycles in the DG (VI)

• The compiler is conservative. S122 S123 • The compiler only vectorizes when it can prove that it is for (int i=0;i

Intel Nehalem Intel Nehalem Does the compiler use the info that r[i] = i Compiler report: Loop was not Compiler report: Partial Loop was vectorized. Existence of vector vectorized to compute data dependences? dependence Exec. Time scalar code: 5.8 Exec. Time scalar code: 5.0 Exec. Time vector code: 5.7 Exec. Time vector code: -- Speedup: 1.01 Speedup: --

97 98

Dependence Graphs and Cycles in the DG (VI) Compiler Vectorization S122 S123 • No dependences: Vectorized by the compiler for (int i=0;i

99 100

25 3/4/2013

Loop Transformations Compiler Directives (I)

• Compiler Directives • Loop Distribution or loop fission • When the compiler does not vectorize automatically due • Reordering Statements to dependences the programmer can inform the compiler • Node Splitting that it is safe to vectorize: • Scalar expansion • Loop Peeling #pragma ivdep (ICC compiler) • Loop Fusion #pragma ibm independent_loop (XLC compiler) • • Loop Interchanging

101 102

Compiler Directives (I) Compiler Directives (I)

• This loop can be vectorized when k < -3 and k >= 0. • This loop can be vectorized when k < -3 and k >= 0. • Programmer knows that k>=0 • Programmer knows that k>=0 for (int i=val;i= 0)  no dependence How can the programmer tell the a[i]=a[i+k]+b[i]; or self-anti-dependence compiler that k >= 0 a[0]=a[1]+b[0] Can k=1 a[1]=a[2]+b[1] be vectorized S1 a[2]=a[3]+b[2] for (int i=val;i

103 104

26 3/4/2013

Compiler Directives (I) Compiler Directives (I)

• This loop can be vectorized when k < -3 and k >= 0. S124 S124_1 S124_2 • Programmer knows that k>=0 for (int i=0;i=0) if (k>=0) a[i]=a[i+k]+b[i]; for (int i=0;i

105 106

Compiler Directives (I) Compiler Directives (II)

S124 S124_1 S124_2 for (int i=0;i=0) if (k>=0) • Programmer can disable vectorization of a loop when the a[i]=a[i+k]+b[i]; for (int i=0;i

S124 and S124_1 S124_2 #pragma nosimd (XLC compiler) IBM Power 7 #pragma ibm independent_loop Compiler report: Loop was not needs AIX OS (we ran the vectoriced because a data experiments on Linux) dependence prevents SIMD vectorization Exec. Time scalar code: 2.2 Exec. Time vector code: -- Speedup: -- 107 108

27 3/4/2013

Compiler Directives (II) Compiler Directives (II)

Vector code can run slower than scalar code S116 #pragma novector for (int i=1;i

S1 S1 can be vectorized S116 S2 and S3 cannot be vectorized (as they are) Intel Nehalem S2 Compiler report: Loop was partially vectorized Exec. Time scalar code: 14.7 Exec. Time vector code: 18.1 S3 Speedup: 0.8

109 110

Loop Distribution Loop Distribution

S126 S126_1 • It is also called loop fission. for (i=1; i

111 112

28 3/4/2013

Loop Distribution Reordering Statements

S126 S126_1 for (i=1; i

S1 S1 S126 S126_1 backward IBM Power 7 IBM Power 7 depedence forward Compiler report: Compiler report: Loop was not depedence SIMD vectorized - Loop 1 was SIMD vectorized. S2 S2 Exec. Time scalar code: 1.3 - Loop 2 was not SIMD vectorized Exec. Time vector code: -- Exec. Time scalar code: 1.14 Speedup: -- Exec. Time vector code: 1.0 Speedup: 1.14

113 114

Reordering Statements Reordering Statements

S114 S114_1 S114 S114_1 for (i=0; i

The IBM XLC compiler generated the same code in both cases S114 S114_1 S114 S114_1

Intel Nehalem IBM Power 7 IBM Power 7 Intel Nehalem Compiler report: Loop was not Compiler report: Loop was SIMD Compiler report: Loop was SIMD Compiler report: Loop was vectorized. Existence of vector vectorized vectorized vectorized. dependence Exec. Time scalar code: 3.3 Exec. Time scalar code: 3.3 Exec. Time scalar code: 10.7 Exec. Time scalar code: 12.6 Exec. Time vector code: 1.8 Exec. Time vector code: 1.8 Exec. Time vector code: 6.2 Exec. Time vector code: -- Speedup: 1.8 Speedup: 1.8 Speedup: 1.7 Speedup: --

115 116

29 3/4/2013

Node Splitting Node Splitting

S126 S126_1

for (int i=0;i

S0 S1 S126 S126_1

Intel Nehalem S1 Intel Nehalem Compiler report: Loop was not Compiler report: Loop was S2 vectorized. Existence of vector vectorized. dependence Exec. Time scalar code: 13.2 S2 Exec. Time scalar code: 12.6 Exec. Time vector code: 9.7 Exec. Time vector code: -- Speedup: 1.3 Speedup: --

117 118

Node Splitting Scalar Expansion

S126 S126_1 for (int i=0;i

S126 S126_1 S1 S1

IBM Power 7 IBM Power 7 S2 S2 Compiler report: Loop was SIMD Compiler report: Loop was SIMD vectorized vectorized Exec. Time scalar code: 3.8 Exec. Time scalar code: 5.1 S3 S3 Exec. Time vector code: 1.7 Exec. Time vector code: 2.4 Speedup: 2.2 Speedup: 2.0

119 120

30 3/4/2013

Scalar Expansion Scalar Expansion

S139 S139_1 S139 S139_1

for (int i=0;i

S139 S139_1 S139 S139_1

Intel Nehalem Intel Nehalem IBM Power 7 IBM Power 7 Compiler report: Loop was Compiler report: Loop was Compiler report: Loop was SIMD Compiler report: Loop was SIMD vectorized. vectorized. vectorized vectorized Exec. Time scalar code: 0.7 Exec. Time scalar code: 0.7 Exec. Time scalar code: 0.28 Exec. Time scalar code: 0.28 Exec. Time vector code: 0.4 Exec. Time vector code: 0.4 Exec. Time vector code: 0.14 Exec. Time vector code: 0.14 Speedup: 1.5 Speedup: 1.5 Speedup: 2 Speedup: 2.0

121 122

Loop Peeling Loop Peeling

• Remove the first/s or the last/s iteration of the loop into separate • Remove the first/s or the last/s iteration of the loop into separate code outside the loop code outside the loop • It is always legal, provided that no additional iterations are • It is always legal, provided that no additional iterations are introduced. introduced. • When the trip count of the loop is not constant the peeled loop has • When the trip count of the loop is not constant the peeled loop has to be protected with additional runtime tests. to be protected with additional runtime tests. • This transformation is useful to enforce a particular initial memory • This transformation is useful to enforce a particular initial memory alignment on array references prior to loop vectorization. alignment on array references prior to loop vectorization.

if (N>=1) A[0] = B[0] + C[0]; A[0] = B[0] + C[0]; for (i=0; i

123 124

31 3/4/2013

Loop Peeling Loop Peeling

S127 S127_1

a[0]= a[0] + a[0]; a[0]= a[0] + a[0]; for (int i=0;i

a[0]=a[0]+a[0] a[1]=a[1]+a[0] After loop peeling, there are no S127 S127_1 a[2]=a[2]+a[0] dependences, and the loop can be vectorized Intel Nehalem Intel Nehalem Compiler report: Loop was not Compiler report: Loop was vectorized. Existence of vector vectorized. dependence S1 Exec. Time scalar code: 6.6 Exec. Time scalar code: 6.7 Exec. Time vector code: 1.2 Exec. Time vector code: -- Self true-dependence Speedup: 5.2 is not vectorized Speedup: --

125 126

Loop Peeling Loop Interchanging

S127 S127_1 S127_2 • This transformation switches the positions of one loop that is tightly nested within another loop. a[0]= a[0] + a[0]; a[0]= a[0] + a[0]; float t = a[0]; for (int i=0;i

IBM Power 7 IBM Power 7 IBM Power 7 Compiler report: Loop Compiler report: Loop Compiler report: Loop was not SIMD vectorized was not SIMD vectorized was vectorized Time scalar code: 2.4 Exec. scalar code: 2.4 Exec. scalar code: 1.58 Time vector code: -- Exec. vector code: -- Exec. vector code: 0.62 Speedup: -- Speedup: -- Speedup: 2.54

128 127

32 3/4/2013

Loop Interchanging Loop Interchanging

for (j=1; j

3 3 Inner loop cannot be vectorized because of self-dependence

129 130

Loop Interchanging Loop Interchanging

S228 S228_1 for (i=1; i

A[3][1]=A[2][1] +1 Intel Nehalem Intel Nehalem j i=3 j=1 2 j=2 A[3][2]=A[2][2] +1 Compiler report: Loop was not Compiler report: Loop was vectorized. vectorized. j=3 A[3][3]=A[2][3] +1 Exec. Time scalar code: 2.3 Exec. Time scalar code: 0.6 3 Exec. Time vector code: -- Exec. Time vector code: 0.2 is legal Speedup: -- Speedup: 3 No dependences in inner loop 132 131

33 3/4/2013

Loop Interchanging Outline

S228 S228_1 for (j=1; j

133 134

Reductions Reductions • Reduction is an operation, such as addition, which is S131 S132 applied to the elements of an array to produce a result of sum =0; x = a[0]; a lesser rank. for (int i=0;i x) { Sum Reduction Max Loc Reduction x = a[i]; sum =0; x = a[0]; index = i; for (int i=0;i x) { S131 S132 x = a[i]; index = i; Intel Nehalem Intel Nehalem }} S1 Compiler report: Loop was Compiler report: Loop was vectorized. vectorized. Exec. Time scalar code: 5.2 Exec. Time scalar code: 9.6 Exec. Time vector code: 1.2 Exec. Time vector code: 2.4 Speedup: 4.1 Speedup: 3.9

135 136

34 3/4/2013

Reductions Reductions

S131 S132 S141_1 S141_1

sum =0; x = a[0]; for (int i = 0; i < 64; i++){ for (int i=0;i x) { for (int i = 0; i < LEN; i+=64){ vectorized x = a[i]; for (int j=0, k=i; k

137 138

Induction variables Outline

1. Intro • Induction variable is a variable that can be expressed as a function of the loop iteration variable 2. Data Dependences (Definition) 3. Overcoming limitations to SIMD-Vectorization float s = (float)0.0; for (int i=0;i

139 140

35 3/4/2013

Induction variables Induction variables

S133 S133_1 S133 S133_1 float s = (float)0.0; for (int i=0;i

S133 S133_1 S133 S133_1

IBM Power 7 IBM Power 7 Intel Nehalem Intel Nehalem Compiler report: Loop was SIMD Compiler report: Loop was not Compiler report: Loop was Compiler report: Loop was vectorized SIMD vectorized vectorized. vectorized. Exec. Time scalar code: 3.7 Exec. Time scalar code: 2.7 Exec. Time scalar code: 6.1 Exec. Time scalar code: 8.4 Exec. Time vector code: 1.4 Exec. Time vector code: -- Exec. Time vector code: 1.9 Exec. Time vector code: 1.9 Speedup: 2.6 Speedup: 3.1 Speedup: 4.2 Speedup: --

141 142

Induction Variables Induction Variables

S134 S134_1 • Coding style matters: for (int i=0;i

These codes are equivalent, but … Intel Nehalem Intel Nehalem Compiler report: Loop was not Compiler report: Loop was vectorized. vectorized. Exec. Time scalar code: 5.5 Exec. Time scalar code: 6.1 Exec. Time vector code: -- Exec. Time vector code: 3.2 Speedup: -- Speedup: 1.8

143 144

36 3/4/2013

Induction Variables Outline

S134 S134_1

for (int i=0;i

S134 S134_1 – Data Alignment – Aliasing IBM Power 7 IBM Power 7 Compiler report: Loop was SIMD – Non-unit strides Compiler report: Loop was SIMD vectorized vectorized – Conditional Statements Exec. Time scalar code: 2.2 Exec. Time scalar code: 2.2 Exec. Time vector code: 1.0 Exec. Time vector code: 1.0 4. Vectorization with intrinsics Speedup: 2.2 Speedup: 2.2

145 146

Data Alignment Why data alignment may improve efficiency • Vector loads/stores load/store 128 consecutive bits to a vector register. • Data addresses need to be 16-byte (128 bits) aligned to be • Vector load/store from loaded/stored aligned data requires one - Intel platforms support aligned and unaligned load/stores memory access - IBM platforms do not support unaligned load/stores • Vector load/store from Is &b[0] 16-byte aligned? unaligned data requires multiple memory accesses 0 1 2 3 void test1(float *a,float *b,float *c) and some shift operations { b Reading 4 bytes from address 1 for (int i=0;i

148 147

37 3/4/2013

Data Alignment Data Alignment

• To know if a pointer is 16-byte aligned, the last • In many cases, the compiler cannot statically know the digit of the pointer address in hex must be 0. alignment of the address in a pointer • Note that if &b[0] is 16-byte aligned, and is a • The compiler assumes that the base address of the pointer is 16-byte aligned and adds a run-time checks for it single precision array, then &b[4] is also 16-byte – if the runtime check is false, then it uses another code aligned (which may be scalar) __attribute__ ((aligned(16))) float B[1024];

int main(){ printf("%p, %p\n", &B[0], &B[4]); } Output: 0x7fff1e9d8580, 0x7fff1e9d8590

149 150

Data Alignment Data Alignment - Example

• Manual 16-byte alignment can be achieved by forcing float A[N] __attribute__((aligned(16))); float B[N] __attribute__((aligned(16))); the base address to be a multiple of 16. float C[N] __attribute__((aligned(16)));

__attribute__ ((aligned(16))) float b[N]; void test(){ float* a = (float*) memalign(16,N*sizeof(float)); for (int i = 0; i < N; i++){ C[i] = A[i] + B[i]; • When the pointer is passed to a function, the compiler }} should be aware of where the 16-byte aligned address of the array starts. void func1(float *a, float *b, float *c) { __assume_aligned(a, 16); __assume_aligned(b, 16); __assume_aligned(c, 16); for int (i=0; i

151

38 3/4/2013

Data Alignment - Example Alignment in a struct

struct st{ float A[N] __attribute__((aligned(16))); float B[N] __attribute__((aligned(16))); char A; float C[N] __attribute__((aligned(16))); int B[64]; float C; void test1(){ void test2(){ int D[64]; __m128 rA, rB, rC; __m128 rA, rB, rC; for (int i = 0; i < N; i+=4){ for (int i = 0; i < N; i+=4){ }; rA = _mm_load_ps(&A[i]); rA = _mm_loadu_ps(&A[i]); rB = _mm_load_ps(&B[i]); rB = _mm_loadu_ps(&B[i]); int main(){ rC = _mm_add_ps(rA,rB); rC = _mm_add_ps(rA,rB); _mm_store_ps(&C[i], rC); _mm_storeu_ps(&C[i], rC); st s1; }} }} printf("%p, %p, %p, %p\n", &s1.A, s1.B, &s1.C, s1.D);} void test3(){ __m128 rA, rB, rC; Nanosecond per iteration Output: for (int i= 1; i < N-3; i+=4){ 0x7fffe6765f00, 0x7fffe6765f04, 0x7fffe6766004, 0x7fffe6766008 rA = _mm_loadu_ps(&A[i]); Core 2 Duo Intel i7 Power 7 rB = _mm_loadu_ps(&B[i]); rC = _mm_add_ps(rA,rB); Aligned 0.577 0.580 0.156 _mm_storeu_ps(&C[i], rC); • Arrays B and D are not 16-bytes aligned (see the }} Aligned (unaligned ld) 0.689 0.581 0.241 address) Unaligned 2.176 0.629 0.243 153

154

Alignment in a struct Outline struct st{ char A; int B[64] __attribute__ ((aligned(16))); 1. Intro float C; int D[64] __attribute__ ((aligned(16))); 2. Data Dependences (Definition) }; 3. Overcoming limitations to SIMD-Vectorization int main(){ – Data Dependences st s1; printf("%p, %p, %p, %p\n", &s1.A, s1.B, &s1.C, s1.D);} – Data Alignment – Aliasing Output: 0x7fff1e9d8580, 0x7fff1e9d8590, 0x7fff1e9d8690, 0x7fff1e9d86a0 – Non-unit strides – Conditional Statements • Arrays A and B are aligned to 16-byes (notice the 0 in the 4 least significant bits of the address) 4. Vectorization withintrinsics • Compiler automatically does padding 155 156

39 3/4/2013

Aliasing Aliasing

• Can the compiler vectorize this loop? • Can the compiler vectorize this loop?

float* a = &b[1]; … void func1(float *a,float *b, float *c){ void func1(float *a,float *b, float *c) for (int i = 0; i < LEN; i++) { { a[i] = b[i] + c[i]; for (int i = 0; i < LEN; i++) } a[i] = b[i] + c[i]; } b[1]= b[0] + c[0] b[2] = b[1] + c[1]

157 158

Aliasing Aliasing

• Can the compiler vectorize this loop? • To vectorize, the compiler needs to guarantee that the pointers are not aliased. float* a = &b[1]; • When the compiler does not know if two pointer are … alias, it still vectorizes, but needs to add up-to run- void func1(float *a,float *b, float *c) time checks, where n is the number of pointers { When the number of pointers is large, the compiler for (int i = 0; i < LEN; i++) may decide to not vectorize a[i] = b[i] + c[i]; } a and b are aliasing void func1(float *a, float *b, float *c){ for (int i=0; i

159 160

40 3/4/2013

Aliasing Aliasing

•Two solutions can be used to avoid the run-time 1. Static and Global arrays __attribute__ ((aligned(16))) float a[LEN]; checks __attribute__ ((aligned(16))) float b[LEN]; 1. static and global arrays __attribute__ ((aligned(16))) float c[LEN]; 2. __restrict__ attribute void func1(){ for (int i=0; i

int main() { … func1(); }

161 162

Aliasing – Multidimensional Aliasing arrays 1. __restrict__ keyword • Example with 2D arrays: pointer-to-pointer declaration. void func1(float* __restrict__ a,float* __restrict__ b, float* __restrict__ c) { void func1(float** __restrict__ a,float** __assume_aligned(a, 16); __restrict__ b, float** __restrict__ c) { __assume_aligned(b, 16); for (int i=0; i

163 164

41 3/4/2013

Aliasing – Multidimensional arrays Aliasing – Multidimensional arrays • Example with 2D arrays: pointer-to-pointer declaration. • Example with 2D arrays: pointer-to-pointer declaration. void func1(float** __restrict__ a,float** __restrict__ void func1(float** __restrict__ a,float** __restrict__ b, float** __restrict__ c) { b, float** __restrict__ c) { for (int i=0; i

Intel ICC compiler, version 11.1 will vectorize this code.

Previous versions of the Intel compiler or compilers from other vendors, such as IBM XLC, will not vectorize it. 165 166

Aliasing – Multidemensional Aliasing – Multidimensional Arrays arrays • Three solutions when __restrict__ does not enable 1. Static and Global declaration vectorization __attribute__ ((aligned(16))) float a[N][N]; void t(){ 1. Static and global arrays a[i][j]…. 2. Linearize the arrays and use __restrict__ keyword }

int main() { 3. Use compiler directives … t(); }

167 168

42 3/4/2013

Aliasing – Multidimensional Aliasing – Multidimensional arrays arrays 2. Linearize the arrays 3. Use compiler directives: void t(float* __restrict__ A){ #pragma ivdep (Intel ICC) //Access to Element A[i][j] is now A[i*128+j] #pragma disjoint(IBM XLC) …. } void func1(float **a, float **b, float **c) { for (int i=0; i

169 170

Non-unit Stride – Example I Outline

1. Intro • Array of a struct typedef struct{int x, y, z} 2. Data Dependences (Definition) point; 3. Overcoming limitations to SIMD-Vectorization point pt[LEN]; – Data Dependences for (int i=0; i

171 172

43 3/4/2013

Non-unit Stride – Example I Non-unit Stride – Example I

• Array of a struct • Array of a struct typedef struct{int x, y, z} typedef struct{int x, y, z} point; point; point pt[LEN]; point pt[LEN];

for (int i=0; i

point pt[N] x0 y0 z x1 y z x y z2 x y z3 y z z 0 1 1 2 2 3 3 point pt[N] x0 0 z0 x1 y1 z1 x2 y2 2 x3 y3 3 pt[0] pt[1] pt[2] pt[3] vector register y0 y1 y2 y3 (I need)

173 174

Non-unit Stride – Example I Non-unit Stride – Example I

S135 S135_1 • Array of a struct • Arrays typedef struct{int x, y, z} int ptx[LEN], int pty[LEN], typedef struct{int x, y, z} int ptx[LEN], int pty[LEN], point; int ptz[LEN]; point; int ptz[LEN]; point pt[LEN]; point pt[LEN]; for (int i=0; i

175 176

44 3/4/2013

Non-unit Stride – Example I Non-unit Stride – Example II

S135 S135_1

typedef struct{int x, y, z} int ptx[LEN], int pty[LEN], for (int i=0;i

IBM Power 7 IBM Power 7 Compiler report: Loop was not Compiler report: Loop was SIMD j SIMD vectorized because it is not vectorized profitable to vectorize Exec. Time scalar code: 1.8 Exec. Time scalar code: 2.0 Exec. Time vector code: 1.5 Exec. Time vector code: -- Speedup: 1.2 Speedup: --

177 178

Non-unit Stride – Example II Non-unit Stride – Example II

S136 S136_1 S136_2 S136 S136_1 S136_2 for (int i=0;i

S136 S136_1 S136_2 S136 S136_1 S136_2 IBM Power 7 Intel Nehalem IBM Power 7 IBM Power 7 Intel Nehalem Intel Nehalem report: Loop Compiler report: Loop was not Compiler report: Loop was report: Loop report: Permuted loop report: Permuted loop interchanging applied. vectorized. Vectorization not SIMD vectorized interchanging applied. was vectorized. was vectorized. Loop was SIMD possible but seems inefficient Exec. Time scalar code: 2.0 Loop was SIMD scalar code: 1.6 scalar code: 1.6 vectorized Exec. Time scalar code: 3.7 Exec. Time vector code: -- scalar code: 0.4 vector code: 0.6 vector code: 0.6 scalar code: 0.4 Exec. Time vector code: -- Speedup: -- vector code: 0.16 Speedup: 2.6 Speedup: 2.6 vector code: 0.2 Speedup: -- Speedup: 2.0 Speedup: 2.7 179 180

45 3/4/2013

Outline Conditional Statements – I

1. Intro • Loops with conditions need #pragma vector always – Since the compiler does not know if vectorization will be 2. Data Dependences (Definition) profitable 3. Overcoming limitations to SIMD-Vectorization – The condition may prevent from an exception – Data Dependences #pragma vector always – Data Alignment for (int i = 0; i < LEN; i++){ – Aliasing if (c[i] < (float) 0.0) – Non-unit strides a[i] = a[i] * b[i] + d[i]; } – Conditional Statements 4. Vectorization with intrinsics

181 182

Conditional Statements – I Conditional Statements – I

S137 S137_1 S137 S137_1

#pragma vector always for (int i = 0; i < LEN; i++){ for (int i = 0; i < LEN; i++){ for (int i = 0; i < LEN; i++){ for (int i = 0; i < LEN; i++){ if (c[i] < (float) 0.0) if (c[i] < (float) 0.0) if (c[i] < (float) 0.0) if (c[i] < (float) 0.0) a[i] = a[i] * b[i] + d[i]; a[i] = a[i] * b[i] + d[i]; a[i] = a[i] * b[i] + d[i]; a[i] = a[i] * b[i] + d[i]; } } } } compiled with flag -qdebug=alwaysspec

S137 S137_1 S137 S137_1

Intel Nehalem IBM Power 7 IBM Power 7 Intel Nehalem Compiler report: Loop was not Compiler report: Loop was SIMD Compiler report: Loop was SIMD Compiler report: Loop was vectorized. Condition may protect vectorized vectorized vectorized. exception Exec. Time scalar code: 4.0 Exec. Time scalar code: 4.0 Exec. Time scalar code: 10.4 Exec. Time scalar code: 10.4 Exec. Time vector code: 1.5 Exec. Time vector code: 1.5 Exec. Time vector code: 5.0 Exec. Time vector code: -- Speedup: 2.5 Speedup: 2.5 Speedup: 2.0 Speedup: --

183 184

46 3/4/2013

Conditional Statements Conditional Statements

for (int i=0;i<1024;i++){ vector bool char = rCmp • Compiler removes if conditions when if (c[i] < (float) 0.0) vector float r0={0.,0.,0.,0.}; a[i]=a[i]*b[i]+d[i]; vector float rA,rB,rC,rD,rS, rT, generating vector code } rThen,rElse; for (int i=0;i<1024;i+=4){ // load rA, rB, and rD; for (int i = 0; i < LEN; i++){ rCmp = vec_cmplt(rC, r0); if (c[i] < (float) 0.0) rC 2-11-2 rT= rA*rB+rD; a[i] = a[i] * b[i] + d[i]; rThen = vec_and(rT.rCmp); } rCmp FalseTrue False True rElse = vec_andc(rA.rCmp); rS = vec_or(rthen, relse); rThen 0 3.20 3.2 //store rS } rElse 1.0 1. 0 rS 1.3.2 1. 3.2

185 186

Compiler Directives Conditional Statements for (int i=0;i<1024;i++){ vector bool char = rCmp if (c[i] < (float) 0.0) vector float r0={0.,0.,0.,0.}; • Compiler vectorizes many loops, but many more can be a[i]=a[i]*b[i]+d[i]; vector float rA,rB,rC,rD,rS, rT, vectorized if the appropriate directives are used } rThen,rElse; for (int i=0;i<1024;i+=4){ Compiler Hints for Intel ICC Semantics // load rA, rB, and rD; rCmp = vec_cmplt(rC, r0); #pragma ivdep Ignore assume data dependences rT= rA*rB+rD; Speedups will depend on the rThen = vec_and(rT.rCmp); #pragma vector always override efficiency heuristics values on c[i] rElse = vec_andc(rA.rCmp); #pragma novector disable vectorization rS = vec_or(rthen, relse); Compiler tends to be //store rS __restrict__ assert exclusive access through conservative, as the condition } pointer may prevent from segmentation __attribute__ ((aligned(int-val))) request memory alignment faults memalign(int-val,size); malloc aligned memory __assume_aligned(exp, int-val) assert alignment property

187 188

47 3/4/2013

Compiler Directives Outline

• Compiler vectorizes many loops, but many more can be 1. Intro vectorized if the appropriate directives are used 2. Data Dependences (Definition) Compiler Hints for IBM XLC Semantics 3. Overcoming limitations to SIMD-Vectorization #pragma ibm independent_loop Ignore assumed data dependences – Data Dependences #pragma nosimd disable vectorization – Data Alignment __restrict__ assert exclusive access through pointer – Aliasing __attribute__ ((aligned(int-val))) request memory alignment – Non-unit strides memalign(int-val,size); malloc aligned memory – Conditional Statements __alignx (int-val, exp) assert alignment property 4. Vectorization with intrinsics

189 190

Access the SIMD through intrinsics The Intel SSE intrinsics Header file

• SSE can be accessed using intrinsics. • Intrinsics are vendor/architecture specific • You must use one of the following header files: #include (for SSE) • We will focus on the Intel vector intrinsics #include (for SSE2) #include (for SSE3) • Intrinsics are useful when #include (for SSE4) – the compiler fails to vectorize – when the programmer thinks it is possible to generate • These include the prototypes of the intrinsics. better code than the one produced by the compiler

191 192

48 3/4/2013

Intel SSE intrinsics Data types Intel SSE intrinsic Instructions • We will use the following data types: • Intrinsics operate on these types and have the format: __m128 packed single precision (vector XMM register) _mm_instruction_suffix(…) __m128d packed double precision (vector XMM register) • Suffix can take many forms. Among them: __m128i packed integer (vector XMM register) ss scalar single precision •Example ps packed (vector) singe precision sd scalar double precision #include pd packed double precision int main ( ) { ... si# scalar integer (8, 16, 32, 64, 128 bits) __m128 A, B, C; /* three packed s.p. variables */ su# scalar unsigned integer (8, 16, 32, 64, 128 bits) ... }

193 194

Intel SSE intrinsics Intrinsics (SSE) Instructions – Examples

• Load four 16-byte aligned single precision values in a #define n 1024 #include __attribute__ ((aligned(16))) #define n 1024 vector: float a[n], b[n], c[n]; __attribute__((aligned(16))) float float a[4]={1.0,2.0,3.0,4.0};//a must be 16-byte aligned a[n], b[n], c[n];

__m128 x = _mm_load_ps(a); int main() { int main() { for (i = 0; i < n; i++) { __m128 rA, rB, rC; c[i]=a[i]*b[i]; for (i = 0; i < n; i+=4) { } rA = _mm_load_ps(&a[i]); • Add two vectors containing four single precision values: } rB = _mm_load_ps(&b[i]); rC= _mm_mul_ps(rA,rB); __m128 a, b; _mm_store_ps(&c[i], rC); __m128 c = _mm_add_ps(a, b); }}

195 196

49 3/4/2013

Intel SSE intrinsics Intel SSE intrinsics A complete example A complete example

#define n 1024 Header file #include #define n 1024 #include #define n 1024 #define n 1024 __attribute__((aligned(16))) float __attribute__((aligned(16))) float int main() { a[n], b[n], c[n]; int main() { a[n], b[n], c[n]; float a[n], b[n], c[n]; float a[n], b[n], c[n]; for (i = 0; i < n; i+=4) { int main() { for (i = 0; i < n; i+=4) { int main() { Declare 3 vector c[i:i+3]=a[i:i+3]+b[i:i+3]; __m128 rA, rB, rC; c[i:i+3]=a[i:i+3]+b[i:i+3]; __m128 rA, rB, rC; registers } for (i = 0; i < n; i+=4) { } for (i = 0; i < n; i+=4) { } rA = _mm_load_ps(&a[i]); } rA = _mm_load_ps(&a[i]); rB = _mm_load_ps(&b[i]); rB = _mm_load_ps(&b[i]); rC= _mm_mul_ps(rA,rB); rC= _mm_mul_ps(rA,rB); _mm_store_ps(&c[i], rC); _mm_store_ps(&c[i], rC); }} }}

197 198

Intel SSE intrinsics Node Splitting A complete example

#define n 1000 #include for (int i=0;i

199 200

50 3/4/2013

Node Splitting with intrinsics Node Splitting with intrinsics

S126 S126_2

#include #include for (int i=0;i

201 202

Node Splitting with intrinsics Node Splitting with intrinsics

S126 S126_1 S126 S126_1

Intel Nehalem IBM Power 7 IBM Power 7 Intel Nehalem Compiler report: Loop was not Compiler report: Loop was SIMD Compiler report: Loop was SIMD Compiler report: Loop was vectorized. Existence of vector vectorized vectorized vectorized. dependence Exec. Time scalar code: 3.8 Exec. Time scalar code: 5.1 Exec. Time scalar code: 13.2 Exec. Time scalar code: 12.6 Exec. Time vector code: 1.7 Exec. Time vector code: 2.4 Exec. Time vector code: 9.7 Exec. Time vector code: -- Speedup: 2.2 Speedup: 2.0 Speedup: 1.3 Speedup: -- S126_2 S126_2

Intel Nehalem IBM Power 7 Exec. Time intrinsics: 6.1 Exec. Time intrinsics: 1.6 Speedup (versus vector code): 1.6 Speedup (versus vector code): 1.5

203 204

51 3/4/2013

Summary Data Dependences

• Microprocessor vector extensions can contribute to improve program performance and the amount of this contribution is likely to increase in the future as vector lengths grow. • The correctness of many many loop transformations including • Compilers are only partially successful at vectorizing vectorization can be decided using dependences.

• When the compiler fails, programmers can – add compiler directives • A good introduction to the notion of dependence and its applications – apply loop transformations can be found in D. Kuck, R. Kuhn, D. Padua, B. Leasure, M. Wolfe: Dependence Graphs and Compiler Optimizations. POPL 1981. • If after transforming the code, the compiler still fails to vectorize (or the performance of the generated code is poor), the only option is to program the vector extensions directly using intrinsics or assembly language.

205 206

Compiler Optimizations Algorithms

• W. Daniel Hillis and Guy L. Steele, Jr.. 1986. • For a longer discussion see: Data parallel algorithms. Commun. ACM 29, 12 – Kennedy, K. and Allen, J. R. 2002 Optimizing Compilers for Modern Architectures: a Dependence-Based Approach. Morgan Kaufmann Publishers (December 1986), 1170-1183. Inc.

– U. Banerjee. Dependence Analysis for Supercomputing. Kluwer Academic Publishers, Norwell, Mass., 1988. • Shyh-Ching Chen, D.J. Kuck, "Time and Parallel Processor Bounds for Linear Recurrence – Advanced Compiler Optimizations for , by David Padua and Michael Wolfe in Communications of the ACM, December 1986, Volume 29, Systems," IEEE Transactions on Computers, pp. Number 12. 701-717, July, 1975

– Compiler Transformations for High-Performance Computing, by David Bacon, Susan Graham and Oliver Sharp, in ACM Computing Surveys, Vol. 26, No. 4, December 1994.

207 208

52 3/4/2013

Thank you Program Optimization Through Loop Vectorization Questions? María Garzarán, Saeed Maleki William Gropp and David Padua María Garzarán, Saeed Maleki {garzaran,maleki,wgropp,padua}@illinois.edu William Gropp and David Padua Department of Computer Science {garzaran,maleki,wgropp,padua}@illinois.edu University of Illinois at Urbana-Champaign

Measuring execution time

Back-up Slides time1 = time();

for (i=0; i<32000; i++) c[i] = a[i] + b[i];

time2 = time();

211 212

53 3/4/2013

Measuring execution time Measuring execution times

• Added an outer loop that runs (serially) • Added an outer loop that runs (serially) – to increase the running time of the loop – to increase the running time of the loop • Call a dummy () function that is compiled separately • to avoid loop interchange or dead code elimination

time1 = time(); time1 = time(); for (j=0; j<200000; j++){ for (j=0; j<200000; j++){ for (i=0; i<32000; i++) for (i=0; i<32000; i++) c[i] = a[i] + b[i]; c[i] = a[i] + b[i]; dummy(); } } time2 = time(); time2 = time();

213 214

Measuring execution times Compiling

• Added an outer loop that runs (serially) • Intel icc scalar code – to increase the running time of the loop icc -O3 –no-vec dummy.o tsc.o –o runnovec • Call a dummy () function that is compiled separately • to avoid loop interchange or dead code elimination • Intel icc vector code • Access the elements of one output array and print the result icc -O3 –vec-report[n] –xSSE4.2 dummy.o tsc.o –o runvec – to avoid dead code elimination time1 = time(); for (j=0; j<200000; j++){ [n] can be 0,1,2,3,4,5 for (i=0; i<32000; i++) – vec-report0, no report is generated c[i] = a[i] + b[i]; dummy(); – vec-report1, indicates the line number of the loops that were } vectorized time2 = time(); for (j=0; j<32000; j++) – vec-report2 .. 5, gives a more detailed report that includes the ret+= a[i]; loops that were not vectorized and the reason for that. printf (“Time %f, result %f”, (time2 –time1), ret);

215 216

54 3/4/2013

Compiling Strip Mining

flags = -O3 –qaltivec -qhot -qarch=pwr7 -qtune=pwr7 -qipa=malloc16 -qdebug=NSIMDCOST This transformation improves locality and is usually combined with vectorization -qdebug=alwaysspec –qreport

• IBM xlc scalar code xlc -qnoenablevmx dummy.o tsc.o –o runnovec • IBM vector code xlc –qenablevmx dummy.o tsc.o –o runvec

217 218

Strip Mining Strip Mining

for (i=1; i

S1 But, … loop distribution will increase strip_size is usually a small value (4, 8, 16 or 32). the cache miss ratio if array a[] is large

S2

219 220

55 3/4/2013

Strip Mining What are Microprocessor vector extensions?

• Another example • Instructions that execute on SIMD (Single Instruction Multiple Data) functional units

int v[N]; int v[N]; x … … + 4-way for (int i=0;i

Based on slide provided by Markus Püschel 221 222

Overview Floating-Point Vector ISAs What are Microprocessor vector extensions? Vendor Name n-ways Precision Introduced with • Instructions that execute on SIMD (Single Instruction Multiple Intel SSE 4-way single Pentium III SSE2 + 2-way double Pentium 4 Data) functional units SSE3 Pentium 4(Prescott SSSE3 Core Duo x 4-way SSE4 Core2 Extreme (Penryn) + AVX + 4-way double Sandy Bridge, 2011 Intel IPF 2-way single Itanium – They operate on short vectors of integers and floats (from 16-way AMD 3DNOW! 2-way single K6 on 1-byte (chars) to 2-way (doubles)) Enhanced 3DNow! K7 3DNow! Professional + 4-way single AthlonXP AMD64 + 2-way double Opteron • Why are they here? Motorola AltiVec 4-way single MPC 7400 G4 – Useful: Many applications (e.g., multimedia) feature the required IBM VMX 4-way single PowerPC 970 G5 SPU + 2-way double Cell BE fine grain parallelism – code potentially faster Double FPU 2-way double PowerPC 440 FP2 – Doable: Chip designers have enough transistors available, easy to A similar architecture is found in game consoles (PS2, PS3) and GPUs (NVIDIA’s GeForce) implement

Based on slide provided by Markus Püschel Based on slide provided by Markus Püschel 223 224

56 3/4/2013

Overview Floating-Point Vector ISAs Evolution of Intel Vector Instructions • MMX (1996, Pentium) Vendor Name n-ways Precision Introduced with – CPU-based MPEG decoding – Integers only, 64-bit divided into 2 x 32 to 8 x 8 Intel SSE 4-way single Pentium III – Phased out with SSE4 SSE2 + 2-way double Pentium 4 • SSE (1999, Pentium III) SSE3 Pentium 4(Prescott – CPU-based 3D graphics SSSE3 Core Duo – 4-way float operations, single precision SSE4 Core2 Extreme (Penryn) – 8 new 128 bit Register, 100+ instructions AVX + 4-way double Sandy Bridge, 2011 • SSE2 (2001, Pentium 4) Intel IPF 2-way single Itanium – High-performance computing – Adds 2-way float ops, double-precision; same registers as 4-way single-precision AMD 3DNOW! 2-way single K6 – Integer SSE instructions make MMX obsolete Enhanced 3DNow! K7 • SSE3 (2004, Pentium 4E Prescott) 3DNow! Professional + 4-way single AthlonXP – Scientific computing AMD64 + 2-way double Opteron – New 2-way and 4-way vector instructions for complex arithmetic Motorola AltiVec 4-way single MPC 7400 G4 IBM VMX 4-way single PowerPC 970 G5 • SSSE3 (2006, Core Duo) SPU + 2-way double Cell BE – Minor advancement over SSE3 Double FPU 2-way double PowerPC 440 FP2 • SSE4 (2007, Core2 Duo Penryn) A similar architecture is found in game consoles (PS2, PS3) and GPUs (NVIDIA’s – Modern codecs, cryptography – New integer instructions GeForce) – Better support for unaligned data, super shuffle engine Based on slide provided by Markus Püschel More details at http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions 225 226

Run-Time Symbolic Resoltuion Loop Vectorization – Example I

If (t > 0)  self true dependence S113 S113_1 a[4]=a[3]+b[3] for (i=0; i

S113 S113_1 a[2]=a[3]+b[3] t =-1 a[3]=a[4]+b[4] Can be vectorized Intel Nehalem Intel Nehalem a[4]=a[5]+b[5] Compiler report: Loop was Compiler report: Fused loop was vectorized in both cases vectorized Exec. Time scalar code: 10.2 Exec. Time scalar code: 10.2 Exec. Time vector code: 6.3 Exec. Time vector code: 6.3 Speedup: 1.6 Speedup: 1.6

227 228

57 3/4/2013

Loop Vectorization – Example I How do we access the SIMD units?

S113 S113_1 • Three choices for (i=0; i

1. Macros or Vector Intrinsics S113 S113_1

IBM Power 7 IBM Power 7 Compiler report: Loop was SIMD Compiler report: Loop was SIMD vectorized vectorized 1. Assembly Language Exec. Time scalar code: 3.1 Exec. Time scalar code: 3.7 Exec. Time vector code: 1.5 Exec. Time vector code: 2.3 Speedup: 2.0 Speedup: 1.6

229 230

How do we access the SIMD units? How do we access the SIMD units?

• Three choices • Three choices

for (i=0; i<32000; i++) 1. C code and a vectorizing compiler c[i] = a[i] + b[i]; 1. C code and a vectorizing compiler void example(){ __m128 rA, rB, rC; for (int i = 0; i < 32000; i+=4){ rA = _mm_load_ps(&a[i]); 1. Macros or Vector Intrinsics 1. Macros or Vector Intrinsics rB = _mm_load_ps(&b[i]); rC = _mm_add_ps(rA,rB); _mm_store_ps(&C[i], rC); }} 1. Assembly Language 1. Assembly Language

231 232

58 3/4/2013

How do we access the SIMD units? Why should the compiler vectorize?

• Three choices 1. Easier 2. Portable across vendors and across generations of the same class of machines 1. C code and a vectorizing compiler – Although compiler directives maybe different across compilers 3. Better performance than programmer generated vector code – Compiler applies other transformations such as loop unrolling, 1. Macros or Vector Intrinsics ..B8.5 instruction scheduling … movaps a(,%rdx,4), %xmm0 movaps 16+a(,%rdx,4), %xmm1 addps b(,%rdx,4), %xmm0 Compilers make your codes (almost) machine independent 1. Assembly Language addps 16+b(,%rdx,4), %xmm1 movaps %xmm0, c(,%rdx,4) movaps %xmm1, 16+c(,%rdx,4) But, compilers fail: addq $8, %rdx cmpq $32000, %rdx - Programmers need to provide the necessary information jl ..B8.5 - Programmers need to transform the code 233 234

How well do compilers vectorize? How well do compilers vectorize? • Results for – Test Suite for Vectorizing compilers by David Callahan, Jack Dongarra and David Levine. – IBM XLC compiler, version 11 Loops Percentage Vectorizable 84.3% - Vectorized 46.5% Total 159 - Classic transformation applied 22.0% Auto Vectorized 74 Not Vectorized 85 - New transformation applied 3.8% Vectorizable by 60 Impossible to Vectorize 25 - Manual vector code 11.9% Classic Transformation 35 Non-unit Stride Access 16 New Transformation 6 Data Dependence 5 Manual Vectorization 19 Other 4

235 236 235

59 3/4/2013

Terminology What are the speedups? Transformations Explanation Classic transformation • Loop Interchange • Speedups obtained by XLC compiler (source level) • Scalar Expansion • Scalar and Array Renaming • Node Splitting • Reduction Test Suite Collection Average Speed Up • Loop Peeling Automatically by XLC 1.73 • Loop Distribution • Run-Time Symbolic Resolution By adding classic transformations 3.48 • Speculating Conditional Statements By adding new transformations 3.64 New Transformation • Manually vectorized Matrix Transposition By adding manual vectorization 3.78 (Intrinsics) • Manually vectorized Prefix Sum Manual Transformation • Auto vectorization is inefficient (Intrinsics) • Vectorization of the transformed code is inefficient • No transformation found to enable auto vectorization

237 238 237

Why did the compilers fail to vectorize? XLC and ICC Comparison

Compiler XLC but ICC but Issues ICC XLC Reasons not ICC not XLC Vectorizable but not automatic 59 60 Vectorized 25 26 Cyclic data dependence 9 8 Aliasing 5 0 Non-Unit stride access 6 7 Acyclic data dependence 4 0 Conditional statement 6 9 Unrolled loop 3 0 Aliasing 5 0 Loop interchange 2 0 Acyclic data dependence 4 0 Conditional statements 4 7 Reduction 4 5 Unsupported loop structure 0 5 Unsupported loop structure 4 9 Wrap around variable 0 3 Loop Interchange 4 2 Reduction 1 2 Wrap around 1 4 Non-unit stride access 1 2 Scalar expansion 3 4 Other 5 6 Other 13 12

239 240

60 3/4/2013

Another example: matrix-matrix multiplication Another example: matrix-matrix multiplication

S111_1 Added compiler S111 directives

void MMM(float** a,float** b, float** c) { void MMM(float** __restrict__ a,float** __restrict__ b, for (int i=0; i

241 242

Another example: matrix-matrix multiplication Definition of Dependence

S111_3 Loop • A statement S is said to be data dependent on another interchagne statement T if void MMM(float** __restrict__ a,float** __restrict__ b, – S accesses the same data being accessed by an earlier execution of T float** __restrict__ c) { – S, T or both write the data. for (int i = 0; i < LEN2; i++) for (int j = 0; j < LEN2; j++) C[i][j] = (float)0.; for (int i=0; i

61 3/4/2013

Data Dependence Dependences in Loops

• Dependences in loops are easy to understand if loops are unrolled. Flow dependence (True dependence) Now the dependences are between statement “instances” S1 S1: X = A+B Unrolled loop S2: C= X+A for (i=0; i

245 246

Dependences in Loops Dependences in Loops

• Dependences in loops are easy to understand if loops are unrolled. • A slightly more complex example Now the dependences are between statement “instances” Unrolled loop for (i=1; i

247 248

62 3/4/2013

Dependences in Loops Dependences in Loops

• A slightly more complex example • Even more complex

for (i=1; i

249 250

Dependences in Loops Dependences in Loops

• Even more complex • Two dimensional Unrolled loop for (i=0; i

251 252

63 3/4/2013

Dependences in Loops Dependences in Loops

• Two dimensional • Another two dimensional loop Unrolled loop Unrolled loop for (i=1; i

4 4

253 254

Dependences in Loops S1 S1 S1 S1 S1 … = • The representation of these dependence graphs inside S2 S2 S2 S2 S2 the compiler is a “collapsed” version of the graph where the arcs are annotated with direction (or distances) to S1 reduce ambiguities. S1 S1 S1 S1 … < S2 S2 S2 S2 S2 • In the collapsed version each statement is represented by a node and each ordered pair of variable accesses is < represented by an arc S1 S1 S1 S1 S1 … = < S2 S2 S2 S2 S2

255 256

64 3/4/2013

Loop Vectorization Loop Vectorization

• When the loop has several statements, it is better to first strip-mine the loop and then distribute strip_size is usually a small value (4, 8, 16, 32) Scalar code, Scalar code, Scalar code, loops are strip-mined Scalar code, Scalar code Scalar code loops are distributed loops are strip-mined loops are distributed and distributed for (i=0; i

Loop Vectorization Loop Vectorization

S112 S112_1 S112_2 S112 S112_1 S112_2 Scalar code Scalar code Scalar code Scalar code Scalar code Scalar code loops are distributed loops are strip-mined loops are distributed loops are strip-mined for (i=0; i

S112 and S112_2 S112_1 IBM Power 7 IBM Power 7 IBM Power 7 Compiler report: Loop was report: Loop was report: Loop was Intel Nehalem Intel Nehalem SIMD vectorized SIMD vectorized SIMD vectorized Compiler report: Loop was Compiler report: Fused loop was Exec. Time scalar code: 2.7 scalar code: 2.7 scalar code: 2.9 vectorized vectorized Exec. Time vector code: 1.2 vector code: 1.2 vector code: 1.4 Exec. Time scalar code: 9.6 Exec. Time scalar code: 9.6 Speedup: 2.1 Speedup: 2.1 Speedup: 2.07 Exec. Time vector code: 5.5 Exec. Time vector code: 5.5 Speedup: 1.7 Speedup: 1.7 259 260

65 3/4/2013

Loop Vectoriation Loop Vectorization – Example II

• Our observations S114

for (i=0; i

Intel Nehalem IBM Power 7 Compiler report: Loop was not Compiler report: Loop was SIMD vectorized. Existence of vector vectorized dependence Exec. Time scalar code: 3.3 Exec. Time scalar code: 12.6 Exec. Time vector code: 1.8 Exec. Time vector code: -- Speedup: 1.8 Speedup: --

261 262

Loop Vectorization – Example II Loop vectorization – Example IV

S114 We have observed that A loop can be partially vectorized ICC usually vectorizes only for (i=0; i

263 264

66 3/4/2013

Loop vectorization – Example IV Loop vectorization – Example III

A loop can be partially vectorized A loop can be partially vectorized

for (int i=1;i

S1 can be vectorized S1 can be vectorized S2 S2 and S3 cannot be vectorized S2 S2 and S3 cannot be (A loop with a cycle in the dependence vectorized graph cannot be vectorized) S3 S3

265 266

Loop vectorization – Example IV Loop vectorization – Example IV

S116 S116_1 S116 S116_1

for (int i=1;i

67 3/4/2013

Cycles in the DG – Example V Loop Interchanging

• Compiler needs to resolve a system of equations to determine if there are dependences. for (j=0; j

269 270

Loop Interchanging Loop Interchanging

for (j=0; j

0 0 0 j 1 j 1 j 1

2 2 2

272 271 272

68 3/4/2013

Loop Interchanging Loop Interchanging

S128 for (j=1; j

274 273 274

Loop Interchanging Loop Interchanging

for (j=0; j

3 3 3

275 276 275 276

69 3/4/2013

Loop Interchanging Loop Interchanging

S129 S129_1 S129 S129_1 for (j=0; j

S129 S129_1 S129 S129_1

Intel Nehalem Intel Nehalem IBM Power 7 IBM Power 7 Compiler report: Permuted loop Compiler report: Loop was Compiler report: Loop Compiler report: Loop was SIMD was vectorized. vectorized. interchanging applied. Loop was vectorized Exec. Time scalar code: 0.4 Exec. Time scalar code: 0.4 SIMD vectorized Exec. Time scalar code: 0.5 Exec. Time vector code: 0.1 Exec. Time vector code: 0.1 Exec. Time scalar code: 0.5 Exec. Time vector code: 0.1 Speedup: 4 Speedup: 4 Exec. Time vector code: 0.1 Speedup: 5 Speedup: 5 277 278 277 278

Intel SSE intrinsics Intel SSE intrinsics Instructions – Examples II Instructions – Examples III • Add two vectors containing four single precision values: • Add two vectors of 8 16-bit signed integers using saturation arithmetic (*) __m128 a, b; __m128 c = _mm_add_ps(a, b); __m128i r, s; __m128i t = _mm_adds_epi16(r, s); • Multiply two vectors containing four floats: • Compare two vectors of 16 8-bit signed integers __m128 a, b; __m128 c = _mm_mul_ps(a, b); __m128i a, b; __m128i c = _mm_cmpgt_epi8(a, b); • Add two vectors containing two doubles: (*) [In saturation arithmetic] all operations such as addition and multiplication are limited to a __m128d x, y; fixed range between a minimum and maximum value. If the result of an operation is greater than __m128d z = _mm_add_pd(x, y); the maximum it is set ("clamped") to the maximum, while if it is below the minimum it is clamped to the minimum. [From the wikipedia]

279 280

70 3/4/2013

Blue Water Compiler work

• Illinois has a long tradition in vectorization. • Most recent work: vectorization for Blue Waters

Communications of the ACM, December 1986 Volume 29, Number 12

Source: Thom Dunning: Blue Waters Project 281 282

Compiler work Conditional Statements-II

for (int i = 0; i < LEN; ++i) { if (c[i] == 0) a[i] = CHEAP_FUNC(d[i]); else a[i] = EXPENSIVE_FUNC(b[i])*c[i]+CHEAP_FUNC(d[i]); } • The programmer puts the condition to reduce the amount of computation, because he/she knows that the condition is true many times. Options: 1) Remove the condition. Communications of the ACM, December 1986 Volume 29, Number 12 2) Leave it as it is.

283 284

71 3/4/2013

Conditional Statements-II Conditional Statements-II

S138 S138_1 for (int i = 0; i < LEN; ++i) { for (int i = 0; i < LEN; ++i) { for (int i = 0; i < 1024; ++i) { a[i] = CHEAP_FUNC(d[i]); if (c[i] == (float)0.0) a[i] = d[i]*(float)2.0; #pragma novector a[i] = d[i]*(float)2.0; #pragma novector for (int i = 0; i < LEN; ++i) { else for (int i = 0; i < 1024; ++i) { if (c[i] != 0) a[i]= c[i]/e[i]/b[i]/d[i]/ if (c[i] != (float)0.0) a[i] += EXPENSIVE_FUNC(b[i])*c[i]; a[i]+d[i]*(float)2.0; a[i] += c[i]/e[i]/b[i]/d[i]/ } } a[i]; } S138 S138_1

Intel Nehalem Intel Nehalem Performance is input dependent Compiler report: Compiler report: Loop was -Loop 1 was vectorized. vectorized -Loop 2 was not vectorized: Exec. Time scalar code: 1.01 #pragma novector used Exec. Time vector code: 1.09 Exec. Time scalar code: 0.99 Speedup: 0.9 Exec. Time vector code: 0.97 Speedup: 1.0 285 286

Conditional Statements-II

S138 S138_1 for (int i = 0; i < LEN; ++i) { for (int i = 0; i < 1024; ++i) { if (c[i] == (float)0.0) a[i] = d[i]*(float)2.0; a[i] = d[i]*(float)2.0; #pragma nosimd else for (int i = 0; i < 1024; ++i) { a[i]= c[i]/e[i]/b[i]/d[i]/ if (c[i] != (float)0.0) a[i]+d[i]*(float)2.0; a[i] += c[i]/e[i]/b[i]/d[i]/ } a[i]; } S138 S138_1 IBM Power 7 IBM Power 7 Compiler report: Loop was not Compiler report: Loop was SIMD SIMD vectorized because it is not vectorized profitable Exec. Time scalar code: 2.06 Exec. Time scalar code: 2.1 Exec. Time vector code: 2.6 Exec. Time vector code: -- Speedup: 0.8 Speedup: --

287

72