3/4/2013
Program Optimization Through Loop Vectorization
Program Optimization Materials for this tutorial can be found: Through Loop Vectorization http://polaris.cs.uiuc.edu/~garzaran/pldi-polv.zip
María Garzarán, Saeed Maleki William Gropp and David Padua Questions? Send an email to [email protected] Department of Computer Science University of Illinois at Urbana-Champaign
2
Topics covered in this tutorial Outline • What are the microprocessor vector extensions or SIMD (Single Instruction Multiple Data Units) 1. Intro 2. Data Dependences (Definition) • How to use them 3. Overcoming limitations to SIMD-Vectorization – Through the compiler via automatic vectorization – Data Dependences • Manual transformations that enable vectorization – Data Alignment • Directives to guide the compiler – Aliasing – Through intrinsics – Non-unit strides • Main focus on vectorizing through the compiler. – Conditional Statements – Code more readable 4. Vectorization with intrinsics – Code portable
3 4
1 3/4/2013
Simple Example SIMD Vectorization
• Loop vectorization transforms a program so that the • The use of SIMD units can speed up the program. same operation is performed at the same time on • Intel SSE and IBM Altivec have 128-bit vector registers and several vector elements functional units ldv vr1, addr1 – 4 32-bit single precision floating point numbers n ld r1, addr1 for (i=0; i
32 bits – 8 16-bit integer or shorts
32 bits – 16 8-bit bytes or chars Y1 X1 • Assuming a single ALU, these SIMD units can execute 4 single precision floating point number or 2 double precision operations in Register File Scalar Unit + Vector … Unit the time it takes to do only one of these operations by a scalar unit.
Z1
32 bits 5 6
Executing Our Simple Example How do we access the SIMD units?
S000 • Three choices for (i=0; i Intel Nehalem IBM Power 7 void example(){ Exec. Time scalar code: 6.1 Exec. Time scalar code: 2.1 __m128 rA, rB, rC; Exec. Time vector code: 3.2 Exec. Time vector code: 1.0 for (int i = 0; i X 1 ..B8.5 movaps a(,%rdx,4), %xmm0 Register File Scalar Unit + Vector addps b(,%rdx,4), %xmm0 … Unit 1. Assembly Language movaps %xmm0, c(,%rdx,4) addq $4, %rdx Z1 cmpq $rdi, %rdx 32 bits jl ..B8.5 7 8 2 3/4/2013 Why should the compiler vectorize? How well do compilers vectorize? 1. Easier 2. Portable across vendors and machines Compiler XLC ICC GCC – Although compiler directives differ across compilers Loops 3. Better performance of the compiler generated code Total 159 Vectorized 74 75 32 – Compiler applies other transformations Not vectorized 85 84 127 Average Speed Up 1.73 1.85 1.30 Compilers make your codes (almost) machine independent Compiler XLC but ICC but Loops not ICC not XLC But, compilers fail: Vectorized 25 26 - Programmers need to provide the necessary information - Programmers need to transform the code 10 9 How well do compilers vectorize? How much programmer intervention? Compiler XLC ICC GCC • Next, three examples to illustrate what the programmer Loops may need to do: Total 159 – Add compiler directives Vectorized 74 75 32 – Transform the code Not vectorized 85 84 127 Average Speed Up 1.73 1.85 1.30 – Program using vector intrinsics Compiler XLC but ICC but Loops not ICC not XLC Vectorized 25 26 By adding manual vectorization the average speedup was 3.78 (versus 1.73 obtained by the XLC compiler) 11 12 3 3/4/2013 Experimental results Compiler directives • The tutorial shows results for two different platforms with their compilers: void test(float* A,float* B,float* C,float* D, float* E) { – Report generated by the compiler for (int i = 0; i The examples use single precision floating point numbers 13 14 Compiler directives Compiler directives S1111 S1111 S1111 S1111 void test(float* __restrict__ A, void test(float* __restrict__ A, void test(float* A, float* B, float* float* __restrict__ B, void test(float* A, float* B, float* float* __restrict__ B, C, float* D, float* E) float* __restrict__ C, C, float* D, float* E) float* __restrict__ C, { float* __restrict__ D, { float* __restrict__ D, for (int i = 0; i Intel Nehalem Intel Nehalem Power 7 Power 7 Compiler report: Loop was not Compiler report: Loop was Compiler report: Loop was not Compiler report: Loop was vectorized. vectorized. vectorized. vectorized. Exec. Time scalar code: 5.6 Exec. Time scalar code: 5.6 Exec. Time scalar code: 2.3 Exec. Time scalar code: 1.6 Exec. Time vector code: -- Exec. Time vector code: 2.2 Exec. Time vector code: -- Exec. Time vector code: 0.6 Speedup: -- Speedup: 2.5 Speedup: -- Speedup: 2.7 15 16 4 3/4/2013 Loop Transformations Loop Transformations S136 S136_1 S136_2 for (int i=0;i Intel Nehalem Intel Nehalem Intel Nehalem j Compiler report: Loop was not A report: Permuted loop report: Permuted loop vectorized. Vectorization was vectorized. was vectorized. possible but seems inefficient scalar code: 1.6 scalar code: 1.6 Exec. Time scalar code: 3.7 vector code: 0.6 vector code: 0.6 B Exec. Time vector code: -- Speedup: 2.6 Speedup: 2.6 Speedup: -- 17 18 Loop Transformations Intrinsics (SSE) S136 S136_1 S136_2 #define n 1024 for (int i=0;i #include 5 3/4/2013 Intrinsics (Altivec) Outline #define n 1024 1. Intro __attribute__ ((aligned(16))) float a[n],b[n],c[n]; ... for (int i=0; i 21 22 Data dependences Definition of Dependence • A statement S is said to be data dependent on • The notion of dependence is the foundation of the process statement T if of vectorization. – T executes before S in the original sequential/scalar program – S and T access the same data item • It is used to build a calculus of program transformations – At least one of the accesses is a write. that can be applied manually by the programmer or automatically by a compiler. 23 24 6 3/4/2013 Data Dependence Data Dependence Flow dependence (True dependence) S1 S1: X = A+B S2: C= X+A S2 • Dependences indicate an execution order that must be Anti dependence S1 honored. S1: A = X + B • Executing statements in the order of the dependences S2: X= C + D S2 guarantee correct results. Output dependence S1: X = A+B S1 • Statements not dependent on each other can be reordered, S2: X= C + D executed in parallel, or coalesced into a vector operation. S2 25 26 Dependences in Loops (I) Dependences in Loops (I) • Dependences in loops are easy to understand if the loops are unrolled. Now the • Dependences in loops are easy to understand if loops are unrolled. Now the dependences are between statement “executions”. dependences are between statement “executions” for (i=0; i S1: a[0] = b[0] + 1 S1: a[1] = b[1] + 1 S1: a[2] = b[2] + 1 S2: c[0] = a[0] + 2 S2: c[1] = a[1] + 2 S2: c[2] = a[2] + 2 27 28 7 3/4/2013 Dependences in Loops (I) Dependences in Loops (I) • Dependences in loops are easy to understand if loops are unrolled. Now the • Dependences in loops are easy to understand if loops are unrolled. Now the dependences are between statement “executions” dependences are between statement “executions” for (i=0; i instances of S2: S2 S2 S2 S2 29 30 Dependences in Loops (I) Dependences in Loops (I) • Dependences in loops are easy to understand if loops are unrolled. Now the • Dependences in loops are easy to understand if loops are unrolled. Now the dependences are between statement “executions” dependences are between statement “executions” for (i=0; i Loop independent dependence For the whole loop 31 32 8 3/4/2013 Dependences in Loops (I) Dependences in Loops (I) • Dependences in loops are easy to understand if loops are unrolled. Now the • Dependences in loops are easy to understand if loops are unrolled. Now the dependences are between statement “executions” dependences are between statement “executions” for (i=0; i For the whole loop For the whole loop 33 34 Dependences in Loops (I) Dependences in Loops (II) • Dependences in loops are easy to understand if loops are unrolled. Now the • Dependences in loops are easy to understand if loops are unrolled. Now the dependences are between statement “executions” dependences are between statement “executions” for (i=0; i For the dependences shown here, we assume that arrays do not overlap in memory (no aliasing). Compilers must know that there is no aliasing in order to vectorize. 35 36 9 3/4/2013 Dependences in Loops (II) Dependences in Loops (II) • Dependences in loops are easy to understand if loops are unrolled. Now the • Dependences in loops are easy to understand if loops are unrolled. Now the dependences are between statement “executions” dependences are between statement “executions” for (i=1; i instances of S2: S2 S2 S2 S2 37 38 Dependences in Loops (II) Dependences in Loops (II) • Dependences in loops are easy to understand if loops are unrolled. Now the • Dependences in loops are easy to understand if loops are unrolled. Now the dependences are between statement “executions” dependences are between statement “executions” for (i=1; i For the whole loop Loop carried dependence 39 40 10 3/4/2013 Dependences in Loops (II) Dependences in Loops (II) • Dependences in loops are easy to understand if loops are unrolled. Now the • Dependences in loops are easy to understand if loops are unrolled. Now the dependences are between statement “executions” dependences are between statement “executions” for (i=1; i For the whole loop For the whole loop 41 42 Dependences in Loops (III) Dependences in Loops (III) • Dependences in loops are easy to understand if loops are unrolled. Now the dependences are between statement “executions” for (i=0; i S1: a= b[0] + 1 S1: a = b[1] + 1 S1: a = b[2] + 1 S2: c[0] = a + 2 S2: c[1] = a + 2 S2: c[2] = a+ 2 43 44 11 3/4/2013 Dependences in Loops (III) Dependences in Loops (III) for (i=0; i instances of S2: S2 S2 S2 S2 Loop independent dependence Loop carried dependence 45 46 Dependences in Loops (III) Dependences in Loops (IV) • Doubly nested loops for (i=0; i 47 48 12 3/4/2013 Dependences in Loops (IV) Dependences in Loops (IV) for (i=1; i 49 50 Dependences in Loops (IV) Dependences in Loops (IV) for (i=1; i 2 2 S1 j j 3 3 4 4 51 52 13 3/4/2013 Data dependences and Data dependences and vectorization vectorization • Main idea: A statement inside a loop which is not in a cycle • Loop dependences guide vectorization of the dependence graph can be vectorized. • Main idea: A statement inside a loop which is not in a cycle for (i=1; i 53 54 Data dependences and Distributing transformations • When cycles are present, vectorization can be achieved by: for (i=1; i 55 56 14 3/4/2013 Removing dependences Freezing Loops for (i=1; i for (i=1; i 57 58 Changing the algorithm Changing the algorithm (cont.) • When there is a recurrence, it is necessary to change the . S1 a[0]=b[0]; algorithm in order to vectorize. for (i=1; i 59 60 15 3/4/2013 Stripmining Stripmining (cont.) • Stripmining is a simple transformation. • Stripmining is often used when vectorizing for (i=1; i 61 62 Loop Vectorization Outline • Loop Vectorization is not always a legal and 1. Intro profitable transformation. 2. Data Dependences (Definition) • Compiler needs: 3. Overcoming limitations to SIMD-Vectorization – Compute the dependences – Data Dependences • The compiler figures out dependences by – Solving a system of (integer) equations (with constraints) – Data Alignment – Demonstrating that there is no solution to the system of – Aliasing equations – Non-unit strides – Remove cycles in the dependence graph – Conditional Statements – Determine data alignment 4. Vectorization with intrinsics – Vectorization is profitable 63 64 16 3/4/2013 Simple Example Loop Vectorization • Loop vectorization transforms a program so that the • When vectorizing a loop with several statements the same operation is performed at the same time on compiler need to strip-mine the loop and then apply loop several of the elements of the vectors distribution ldv vr1, addr1 for (i=0; i Register File Scalar Unit + Vector … Unit S2 S2 S2 S2 S2 S2 S2 S2 Z1 32 bits 65 66 Dependence Graphs and Loop Vectorization Compiler Vectorization • When vectorizing a loop with several statements the • No dependences: previous two slides compiler need to strip-mine the loop and then apply loop • Acyclic graphs: distribution – All dependences are forward: for (i=0; i 67 68 17 3/4/2013 Acyclic Dependence Graphs: Acyclic Dependence Graphs: Forward Dependences Forward Dependences S113 for (i=0; i i=0 i=1 i=2 i=3 Intel Nehalem IBM Power 7 Compiler report: Loop was Compiler report: Loop was SIMD S1 S1 S1 S1 S1 vectorized vectorized Exec. Time scalar code: 10.2 Exec. Time scalar code: 3.1 forward Exec. Time vector code: 6.3 Exec. Time vector code: 1.5 dependence Speedup: 1.6 Speedup: 2.0 S2 S2 S2 S2 S2 69 70 Acyclic Dependenden Graphs Acyclic Dependenden Graphs Backward Dependences (I) Backward Dependences (I) Reorder of statements for (i=0; i S1 S1 S1 S1 S2 S2 S2 S2 S1: a[0] = b[0] + c[0] S1 S1 S1 S1 S1 i=0 S2: d[0] = a[1] + 1 S1: a[1] = b[0] + c[0] i=1 S2: d[1] = a[2] + 1 S2 S2 S2 S2 S1 S1 S1 S1 S2 S2 S2 S2 S2 backward forward This loop cannot be vectorized as it is depedence depedence 71 72 18 3/4/2013 Acyclic Dependenden Graphs Acyclic Dependenden Graphs Backward Dependences (I) Backward Dependences (I) S114 S114_1 S114 S114_1 for (i=0; i Intel Nehalem IBM Power 7 IBM Power 7 Compiler report: Loop was not Intel Nehalem Compiler report: Loop was SIMD Compiler report: Loop was SIMD vectorized. Existence of vector Compiler report: Loop was vectorized vectorized vectorized dependence Exec. Time scalar code: 10.7 Exec. Time scalar code: 1.2 Exec. Time scalar code: 1.2 Exec. Time scalar code: 12.6 Exec. Time vector code: 6.2 Exec. Time vector code: 0.6 Exec. Time vector code: 0.6 Exec. Time vector code: -- Speedup: 1.72 Speedup: 2.0 Speedup: 2.0 Speedup: -- Speedup vs non-reordered code:2.03 73 74 Acyclic Dependenden Graphs Acyclic Dependenden Graphs Backward Dependences (II) Backward Dependences (II) S214 S214_1 for (int i = 1; i < LEN; i++) { for (int i=1;i S1: a[1] = d[0] + sqrt(c[1]) S1 S1 S1 S1 S1 i=1 S114 S114_1 S2: d[1] = b[1] + sqrt(e[1]) S2 S2 S1: a[2] = d[1] + sqrt(c[2]) i=2 S2: d[2] = b[2] + sqrt(e[2]) Intel Nehalem S2 S2 S2 S2 S2 Compiler report: Loop was not Intel Nehalem vectorized. Existence of vector Compiler report: Loop was vectorized dependence Exec. Time scalar code: 7.6 Exec. Time scalar code: 7.6 Exec. Time vector code: 3.8 This loop cannot be vectorized as it is Exec. Time vector code: -- Speedup: 2.0 Speedup: -- 75 76 19 3/4/2013 Acyclic Dependenden Graphs Cycles in the DG (I) Backward Dependences (II) S114 S114_1 for (int i=0;i The IBM XLC compiler generated the same code in both cases S1 S1 S1 S1 S1 S114 S114_1 IBM Power 7 IBM Power 7 Compiler report: Loop was SIMD Compiler report: Loop was SIMD vectorized vectorized S2 S2 S2 S2 S2 Exec. Time scalar code: 3.3 Exec. Time scalar code: 3.3 Exec. Time vector code: 1.8 Exec. Time vector code: 1.8 This loop cannot be vectorized (as it is) Speedup: 1.8 Speedup: 1.8 Statements cannot be simply reordered 77 78 Cycles in the DG (I) Cycles in the DG (I) S115 S115 for (int i=0;i S115 S115 Intel Nehalem IBM Power 7 Compiler report: Loop was not vectorized. Compiler report: Loop was SIMD vectorized Existence of vector dependence Exec. Time scalar code: 3.1 Exec. Time scalar code: 12.1 Exec. Time vector code: 2.2 Exec. Time vector code: -- Speedup: 1.4 Speedup: -- 79 80 20 3/4/2013 Cycles in the DG (I) Cycles in the DG (I) S115 S1 S115 S215 for (int i=0;i 81 82 Cycles in the DG (I) Cycles in the DG (I) S115 S215 S115 S215 for (int i=0;i 83 84 21 3/4/2013 Cycles in the DG (II) Cycles in the DG (II) A loop can be partially vectorized S116 S116 for (int i=1;i Intel Nehalem IBM Power 7 S1 can be vectorized Compiler report: Loop was not S2 Compiler report: Loop was S2 and S3 cannot be vectorized (as they are) partially vectorized SIMD vectorized because a data Exec. Time scalar code: 14.7 dependence prevents SIMD Exec. Time vector code: 18.1 vectorization S3 Speedup: 0.8 Exec. Time scalar code: 13.5 Exec. Time vector code: -- Speedup: -- 85 86 Cycles in the DG (III) Cycles in the DG (III) S117 S118 for (int i=0;i 87 88 22 3/4/2013 Cycles in the DG (III) Cycles in the DG (IV) S117 S118 for (int i=0;i 89 90 Cycles in the DG (IV) Cycles in the DG (IV) S119 for (int i=1;i i=4 a[4] =a[0]+b[4] S1 } a[1]=a[0]+b[1] i=5 a[5] =a[1]+b[5] S1 4 a[2]=a[1]+b[2] i=6 a[6] =a[2]+b[6] S1 a[3]=a[2]+b[3] i=7 a[7] =a[3]+b[7] S1 Intel Nehalem IBM Power 7 i=8 a[8] =a[4]+b[8] S1 S1 Compiler report: Loop was Compiler report: Loop was SIMD 1 i=9 a[9] =a[5]+b[9] S1 vectorized vectorized i=10 a[10]=a[6]+b[10] S1 S1 Exec. Time scalar code: 8.4 Exec. Time scalar code: 6.6 i=11 a[11]=a[7]+b[11] S1 Exec. Time vector code: 3.9 Exec. Time vector code: 1.8 Speedup: 2.1 Speedup: 3.7 Self true-dependence Yes, it can be vectorized because the cannot be vectorized dependence distance is 4, which is the number of iterations that the SIMD unit can91 execute simultaneously. 92 23 3/4/2013 Cycles in the DG (V) Cycles in the DG (V) for (int i = 0; i < LEN-1; i++) { for (int i = 0; i < LEN-1; i++) { for (int j = 0; j < LEN; j++) for (int j = 0; j < LEN; j++) S1 a[i+1][j] = a[i][j] + b; S1 a[i+1][j] = a[i][j] + (float) 1.0; } S1 } S1 Can this loop be vectorized? Can this loop be vectorized? Dependences occur in the outermost loop. i=0, j=0: a[1][0] = a[0][0] + b i=0, j=0: a[1][0] = a[0][0] + 1 - outer loop runs serially j=1: a[1][1] = a[0][1] + b j=1: a[1][1] = a[0][1] + 1 - inner loop can be vectorized j=2: a[1][2] = a[0][2] + b j=2: a[1][2] = a[0][2] + 1 i=1 j=0: a[2][0] = a[1][0] + b i=1 j=0: a[2][0] = a[1][0] + 1 for (int i=0;i 93 94 Cycles in the DG (V) Cycles in the DG (VI) S121 • Cycles can appear because the compiler does not know for (int i = 0; i < LEN-1; i++) { if there are dependences for (int j = 0; j < LEN; j++) a[i+1][j] = a[i][j] + 1; } Is there a value of i such for (int i=0;i 95 96 24 3/4/2013 Cycles in the DG (VI) Cycles in the DG (VI) • The compiler is conservative. S122 S123 • The compiler only vectorizes when it can prove that it is for (int i=0;i Intel Nehalem Intel Nehalem Does the compiler use the info that r[i] = i Compiler report: Loop was not Compiler report: Partial Loop was vectorized. Existence of vector vectorized to compute data dependences? dependence Exec. Time scalar code: 5.8 Exec. Time scalar code: 5.0 Exec. Time vector code: 5.7 Exec. Time vector code: -- Speedup: 1.01 Speedup: -- 97 98 Dependence Graphs and Cycles in the DG (VI) Compiler Vectorization S122 S123 • No dependences: Vectorized by the compiler for (int i=0;i 99 100 25 3/4/2013 Loop Transformations Compiler Directives (I) • Compiler Directives • Loop Distribution or loop fission • When the compiler does not vectorize automatically due • Reordering Statements to dependences the programmer can inform the compiler • Node Splitting that it is safe to vectorize: • Scalar expansion • Loop Peeling #pragma ivdep (ICC compiler) • Loop Fusion #pragma ibm independent_loop (XLC compiler) • Loop Unrolling • Loop Interchanging 101 102 Compiler Directives (I) Compiler Directives (I) • This loop can be vectorized when k < -3 and k >= 0. • This loop can be vectorized when k < -3 and k >= 0. • Programmer knows that k>=0 • Programmer knows that k>=0 for (int i=val;i 103 104 26 3/4/2013 Compiler Directives (I) Compiler Directives (I) • This loop can be vectorized when k < -3 and k >= 0. S124 S124_1 S124_2 • Programmer knows that k>=0 for (int i=0;i 105 106 Compiler Directives (I) Compiler Directives (II) S124 S124_1 S124_2 for (int i=0;i S124 and S124_1 S124_2 #pragma nosimd (XLC compiler) IBM Power 7 #pragma ibm independent_loop Compiler report: Loop was not needs AIX OS (we ran the vectoriced because a data experiments on Linux) dependence prevents SIMD vectorization Exec. Time scalar code: 2.2 Exec. Time vector code: -- Speedup: -- 107 108 27 3/4/2013 Compiler Directives (II) Compiler Directives (II) Vector code can run slower than scalar code S116 #pragma novector for (int i=1;i S1 S1 can be vectorized S116 S2 and S3 cannot be vectorized (as they are) Intel Nehalem S2 Compiler report: Loop was partially vectorized Exec. Time scalar code: 14.7 Exec. Time vector code: 18.1 S3 Speedup: 0.8 109 110 Loop Distribution Loop Distribution S126 S126_1 • It is also called loop fission. for (i=1; i 111 112 28 3/4/2013 Loop Distribution Reordering Statements S126 S126_1 for (i=1; i S1 S1 S126 S126_1 backward IBM Power 7 IBM Power 7 depedence forward Compiler report: Compiler report: Loop was not depedence SIMD vectorized - Loop 1 was SIMD vectorized. S2 S2 Exec. Time scalar code: 1.3 - Loop 2 was not SIMD vectorized Exec. Time vector code: -- Exec. Time scalar code: 1.14 Speedup: -- Exec. Time vector code: 1.0 Speedup: 1.14 113 114 Reordering Statements Reordering Statements S114 S114_1 S114 S114_1 for (i=0; i The IBM XLC compiler generated the same code in both cases S114 S114_1 S114 S114_1 Intel Nehalem IBM Power 7 IBM Power 7 Intel Nehalem Compiler report: Loop was not Compiler report: Loop was SIMD Compiler report: Loop was SIMD Compiler report: Loop was vectorized. Existence of vector vectorized vectorized vectorized. dependence Exec. Time scalar code: 3.3 Exec. Time scalar code: 3.3 Exec. Time scalar code: 10.7 Exec. Time scalar code: 12.6 Exec. Time vector code: 1.8 Exec. Time vector code: 1.8 Exec. Time vector code: 6.2 Exec. Time vector code: -- Speedup: 1.8 Speedup: 1.8 Speedup: 1.7 Speedup: -- 115 116 29 3/4/2013 Node Splitting Node Splitting S126 S126_1 for (int i=0;i S0 S1 S126 S126_1 Intel Nehalem S1 Intel Nehalem Compiler report: Loop was not Compiler report: Loop was S2 vectorized. Existence of vector vectorized. dependence Exec. Time scalar code: 13.2 S2 Exec. Time scalar code: 12.6 Exec. Time vector code: 9.7 Exec. Time vector code: -- Speedup: 1.3 Speedup: -- 117 118 Node Splitting Scalar Expansion S126 S126_1 for (int i=0;i S126 S126_1 S1 S1 IBM Power 7 IBM Power 7 S2 S2 Compiler report: Loop was SIMD Compiler report: Loop was SIMD vectorized vectorized Exec. Time scalar code: 3.8 Exec. Time scalar code: 5.1 S3 S3 Exec. Time vector code: 1.7 Exec. Time vector code: 2.4 Speedup: 2.2 Speedup: 2.0 119 120 30 3/4/2013 Scalar Expansion Scalar Expansion S139 S139_1 S139 S139_1 for (int i=0;i S139 S139_1 S139 S139_1 Intel Nehalem Intel Nehalem IBM Power 7 IBM Power 7 Compiler report: Loop was Compiler report: Loop was Compiler report: Loop was SIMD Compiler report: Loop was SIMD vectorized. vectorized. vectorized vectorized Exec. Time scalar code: 0.7 Exec. Time scalar code: 0.7 Exec. Time scalar code: 0.28 Exec. Time scalar code: 0.28 Exec. Time vector code: 0.4 Exec. Time vector code: 0.4 Exec. Time vector code: 0.14 Exec. Time vector code: 0.14 Speedup: 1.5 Speedup: 1.5 Speedup: 2 Speedup: 2.0 121 122 Loop Peeling Loop Peeling • Remove the first/s or the last/s iteration of the loop into separate • Remove the first/s or the last/s iteration of the loop into separate code outside the loop code outside the loop • It is always legal, provided that no additional iterations are • It is always legal, provided that no additional iterations are introduced. introduced. • When the trip count of the loop is not constant the peeled loop has • When the trip count of the loop is not constant the peeled loop has to be protected with additional runtime tests. to be protected with additional runtime tests. • This transformation is useful to enforce a particular initial memory • This transformation is useful to enforce a particular initial memory alignment on array references prior to loop vectorization. alignment on array references prior to loop vectorization. if (N>=1) A[0] = B[0] + C[0]; A[0] = B[0] + C[0]; for (i=0; i 123 124 31 3/4/2013 Loop Peeling Loop Peeling S127 S127_1 a[0]= a[0] + a[0]; a[0]= a[0] + a[0]; for (int i=0;i a[0]=a[0]+a[0] a[1]=a[1]+a[0] After loop peeling, there are no S127 S127_1 a[2]=a[2]+a[0] dependences, and the loop can be vectorized Intel Nehalem Intel Nehalem Compiler report: Loop was not Compiler report: Loop was vectorized. Existence of vector vectorized. dependence S1 Exec. Time scalar code: 6.6 Exec. Time scalar code: 6.7 Exec. Time vector code: 1.2 Exec. Time vector code: -- Self true-dependence Speedup: 5.2 is not vectorized Speedup: -- 125 126 Loop Peeling Loop Interchanging S127 S127_1 S127_2 • This transformation switches the positions of one loop that is tightly nested within another loop. a[0]= a[0] + a[0]; a[0]= a[0] + a[0]; float t = a[0]; for (int i=0;i IBM Power 7 IBM Power 7 IBM Power 7 Compiler report: Loop Compiler report: Loop Compiler report: Loop was not SIMD vectorized was not SIMD vectorized was vectorized Time scalar code: 2.4 Exec. scalar code: 2.4 Exec. scalar code: 1.58 Time vector code: -- Exec. vector code: -- Exec. vector code: 0.62 Speedup: -- Speedup: -- Speedup: 2.54 128 127 32 3/4/2013 Loop Interchanging Loop Interchanging for (j=1; j 3 3 Inner loop cannot be vectorized because of self-dependence 129 130 Loop Interchanging Loop Interchanging S228 S228_1 for (i=1; i A[3][1]=A[2][1] +1 Intel Nehalem Intel Nehalem j i=3 j=1 2 j=2 A[3][2]=A[2][2] +1 Compiler report: Loop was not Compiler report: Loop was vectorized. vectorized. j=3 A[3][3]=A[2][3] +1 Exec. Time scalar code: 2.3 Exec. Time scalar code: 0.6 3 Exec. Time vector code: -- Exec. Time vector code: 0.2 Loop interchange is legal Speedup: -- Speedup: 3 No dependences in inner loop 132 131 33 3/4/2013 Loop Interchanging Outline S228 S228_1 for (j=1; j 133 134 Reductions Reductions • Reduction is an operation, such as addition, which is S131 S132 applied to the elements of an array to produce a result of sum =0; x = a[0]; a lesser rank. for (int i=0;i 135 136 34 3/4/2013 Reductions Reductions S131 S132 S141_1 S141_1 sum =0; x = a[0]; for (int i = 0; i < 64; i++){ for (int i=0;i 137 138 Induction variables Outline 1. Intro • Induction variable is a variable that can be expressed as a function of the loop iteration variable 2. Data Dependences (Definition) 3. Overcoming limitations to SIMD-Vectorization float s = (float)0.0; for (int i=0;i 139 140 35 3/4/2013 Induction variables Induction variables S133 S133_1 S133 S133_1 float s = (float)0.0; for (int i=0;i S133 S133_1 S133 S133_1 IBM Power 7 IBM Power 7 Intel Nehalem Intel Nehalem Compiler report: Loop was SIMD Compiler report: Loop was not Compiler report: Loop was Compiler report: Loop was vectorized SIMD vectorized vectorized. vectorized. Exec. Time scalar code: 3.7 Exec. Time scalar code: 2.7 Exec. Time scalar code: 6.1 Exec. Time scalar code: 8.4 Exec. Time vector code: 1.4 Exec. Time vector code: -- Exec. Time vector code: 1.9 Exec. Time vector code: 1.9 Speedup: 2.6 Speedup: 3.1 Speedup: 4.2 Speedup: -- 141 142 Induction Variables Induction Variables S134 S134_1 • Coding style matters: for (int i=0;i These codes are equivalent, but … Intel Nehalem Intel Nehalem Compiler report: Loop was not Compiler report: Loop was vectorized. vectorized. Exec. Time scalar code: 5.5 Exec. Time scalar code: 6.1 Exec. Time vector code: -- Exec. Time vector code: 3.2 Speedup: -- Speedup: 1.8 143 144 36 3/4/2013 Induction Variables Outline S134 S134_1 for (int i=0;i S134 S134_1 – Data Alignment – Aliasing IBM Power 7 IBM Power 7 Compiler report: Loop was SIMD – Non-unit strides Compiler report: Loop was SIMD vectorized vectorized – Conditional Statements Exec. Time scalar code: 2.2 Exec. Time scalar code: 2.2 Exec. Time vector code: 1.0 Exec. Time vector code: 1.0 4. Vectorization with intrinsics Speedup: 2.2 Speedup: 2.2 145 146 Data Alignment Why data alignment may improve efficiency • Vector loads/stores load/store 128 consecutive bits to a vector register. • Data addresses need to be 16-byte (128 bits) aligned to be • Vector load/store from loaded/stored aligned data requires one - Intel platforms support aligned and unaligned load/stores memory access - IBM platforms do not support unaligned load/stores • Vector load/store from Is &b[0] 16-byte aligned? unaligned data requires multiple memory accesses 0 1 2 3 void test1(float *a,float *b,float *c) and some shift operations { b Reading 4 bytes from address 1 for (int i=0;i 148 147 37 3/4/2013 Data Alignment Data Alignment • To know if a pointer is 16-byte aligned, the last • In many cases, the compiler cannot statically know the digit of the pointer address in hex must be 0. alignment of the address in a pointer • Note that if &b[0] is 16-byte aligned, and is a • The compiler assumes that the base address of the pointer is 16-byte aligned and adds a run-time checks for it single precision array, then &b[4] is also 16-byte – if the runtime check is false, then it uses another code aligned (which may be scalar) __attribute__ ((aligned(16))) float B[1024]; int main(){ printf("%p, %p\n", &B[0], &B[4]); } Output: 0x7fff1e9d8580, 0x7fff1e9d8590 149 150 Data Alignment Data Alignment - Example • Manual 16-byte alignment can be achieved by forcing float A[N] __attribute__((aligned(16))); float B[N] __attribute__((aligned(16))); the base address to be a multiple of 16. float C[N] __attribute__((aligned(16))); __attribute__ ((aligned(16))) float b[N]; void test(){ float* a = (float*) memalign(16,N*sizeof(float)); for (int i = 0; i < N; i++){ C[i] = A[i] + B[i]; • When the pointer is passed to a function, the compiler }} should be aware of where the 16-byte aligned address of the array starts. void func1(float *a, float *b, float *c) { __assume_aligned(a, 16); __assume_aligned(b, 16); __assume_aligned(c, 16); for int (i=0; i 151 38 3/4/2013 Data Alignment - Example Alignment in a struct struct st{ float A[N] __attribute__((aligned(16))); float B[N] __attribute__((aligned(16))); char A; float C[N] __attribute__((aligned(16))); int B[64]; float C; void test1(){ void test2(){ int D[64]; __m128 rA, rB, rC; __m128 rA, rB, rC; for (int i = 0; i < N; i+=4){ for (int i = 0; i < N; i+=4){ }; rA = _mm_load_ps(&A[i]); rA = _mm_loadu_ps(&A[i]); rB = _mm_load_ps(&B[i]); rB = _mm_loadu_ps(&B[i]); int main(){ rC = _mm_add_ps(rA,rB); rC = _mm_add_ps(rA,rB); _mm_store_ps(&C[i], rC); _mm_storeu_ps(&C[i], rC); st s1; }} }} printf("%p, %p, %p, %p\n", &s1.A, s1.B, &s1.C, s1.D);} void test3(){ __m128 rA, rB, rC; Nanosecond per iteration Output: for (int i= 1; i < N-3; i+=4){ 0x7fffe6765f00, 0x7fffe6765f04, 0x7fffe6766004, 0x7fffe6766008 rA = _mm_loadu_ps(&A[i]); Core 2 Duo Intel i7 Power 7 rB = _mm_loadu_ps(&B[i]); rC = _mm_add_ps(rA,rB); Aligned 0.577 0.580 0.156 _mm_storeu_ps(&C[i], rC); • Arrays B and D are not 16-bytes aligned (see the }} Aligned (unaligned ld) 0.689 0.581 0.241 address) Unaligned 2.176 0.629 0.243 153 154 Alignment in a struct Outline struct st{ char A; int B[64] __attribute__ ((aligned(16))); 1. Intro float C; int D[64] __attribute__ ((aligned(16))); 2. Data Dependences (Definition) }; 3. Overcoming limitations to SIMD-Vectorization int main(){ – Data Dependences st s1; printf("%p, %p, %p, %p\n", &s1.A, s1.B, &s1.C, s1.D);} – Data Alignment – Aliasing Output: 0x7fff1e9d8580, 0x7fff1e9d8590, 0x7fff1e9d8690, 0x7fff1e9d86a0 – Non-unit strides – Conditional Statements • Arrays A and B are aligned to 16-byes (notice the 0 in the 4 least significant bits of the address) 4. Vectorization withintrinsics • Compiler automatically does padding 155 156 39 3/4/2013 Aliasing Aliasing • Can the compiler vectorize this loop? • Can the compiler vectorize this loop? float* a = &b[1]; … void func1(float *a,float *b, float *c){ void func1(float *a,float *b, float *c) for (int i = 0; i < LEN; i++) { { a[i] = b[i] + c[i]; for (int i = 0; i < LEN; i++) } a[i] = b[i] + c[i]; } b[1]= b[0] + c[0] b[2] = b[1] + c[1] 157 158 Aliasing Aliasing • Can the compiler vectorize this loop? • To vectorize, the compiler needs to guarantee that the pointers are not aliased. float* a = &b[1]; • When the compiler does not know if two pointer are … alias, it still vectorizes, but needs to add up-to run- void func1(float *a,float *b, float *c) time checks, where n is the number of pointers { When the number of pointers is large, the compiler for (int i = 0; i < LEN; i++) may decide to not vectorize a[i] = b[i] + c[i]; } a and b are aliasing void func1(float *a, float *b, float *c){ for (int i=0; i 159 160 40 3/4/2013 Aliasing Aliasing •Two solutions can be used to avoid the run-time 1. Static and Global arrays __attribute__ ((aligned(16))) float a[LEN]; checks __attribute__ ((aligned(16))) float b[LEN]; 1. static and global arrays __attribute__ ((aligned(16))) float c[LEN]; 2. __restrict__ attribute void func1(){ for (int i=0; i int main() { … func1(); } 161 162 Aliasing – Multidimensional Aliasing arrays 1. __restrict__ keyword • Example with 2D arrays: pointer-to-pointer declaration. void func1(float* __restrict__ a,float* __restrict__ b, float* __restrict__ c) { void func1(float** __restrict__ a,float** __assume_aligned(a, 16); __restrict__ b, float** __restrict__ c) { __assume_aligned(b, 16); for (int i=0; i 163 164 41 3/4/2013 Aliasing – Multidimensional arrays Aliasing – Multidimensional arrays • Example with 2D arrays: pointer-to-pointer declaration. • Example with 2D arrays: pointer-to-pointer declaration. void func1(float** __restrict__ a,float** __restrict__ void func1(float** __restrict__ a,float** __restrict__ b, float** __restrict__ c) { b, float** __restrict__ c) { for (int i=0; i Intel ICC compiler, version 11.1 will vectorize this code. Previous versions of the Intel compiler or compilers from other vendors, such as IBM XLC, will not vectorize it. 165 166 Aliasing – Multidemensional Aliasing – Multidimensional Arrays arrays • Three solutions when __restrict__ does not enable 1. Static and Global declaration vectorization __attribute__ ((aligned(16))) float a[N][N]; void t(){ 1. Static and global arrays a[i][j]…. 2. Linearize the arrays and use __restrict__ keyword } int main() { 3. Use compiler directives … t(); } 167 168 42 3/4/2013 Aliasing – Multidimensional Aliasing – Multidimensional arrays arrays 2. Linearize the arrays 3. Use compiler directives: void t(float* __restrict__ A){ #pragma ivdep (Intel ICC) //Access to Element A[i][j] is now A[i*128+j] #pragma disjoint(IBM XLC) …. } void func1(float **a, float **b, float **c) { for (int i=0; i 169 170 Non-unit Stride – Example I Outline 1. Intro • Array of a struct typedef struct{int x, y, z} 2. Data Dependences (Definition) point; 3. Overcoming limitations to SIMD-Vectorization point pt[LEN]; – Data Dependences for (int i=0; i 171 172 43 3/4/2013 Non-unit Stride – Example I Non-unit Stride – Example I • Array of a struct • Array of a struct typedef struct{int x, y, z} typedef struct{int x, y, z} point; point; point pt[LEN]; point pt[LEN]; for (int i=0; i point pt[N] x0 y0 z x1 y z x y z2 x y z3 y z z 0 1 1 2 2 3 3 point pt[N] x0 0 z0 x1 y1 z1 x2 y2 2 x3 y3 3 pt[0] pt[1] pt[2] pt[3] vector register y0 y1 y2 y3 (I need) 173 174 Non-unit Stride – Example I Non-unit Stride – Example I S135 S135_1 • Array of a struct • Arrays typedef struct{int x, y, z} int ptx[LEN], int pty[LEN], typedef struct{int x, y, z} int ptx[LEN], int pty[LEN], point; int ptz[LEN]; point; int ptz[LEN]; point pt[LEN]; point pt[LEN]; for (int i=0; i 175 176 44 3/4/2013 Non-unit Stride – Example I Non-unit Stride – Example II S135 S135_1 typedef struct{int x, y, z} int ptx[LEN], int pty[LEN], for (int i=0;i IBM Power 7 IBM Power 7 Compiler report: Loop was not Compiler report: Loop was SIMD j SIMD vectorized because it is not vectorized profitable to vectorize Exec. Time scalar code: 1.8 Exec. Time scalar code: 2.0 Exec. Time vector code: 1.5 Exec. Time vector code: -- Speedup: 1.2 Speedup: -- 177 178 Non-unit Stride – Example II Non-unit Stride – Example II S136 S136_1 S136_2 S136 S136_1 S136_2 for (int i=0;i S136 S136_1 S136_2 S136 S136_1 S136_2 IBM Power 7 Intel Nehalem IBM Power 7 IBM Power 7 Intel Nehalem Intel Nehalem report: Loop Compiler report: Loop was not Compiler report: Loop was report: Loop report: Permuted loop report: Permuted loop interchanging applied. vectorized. Vectorization not SIMD vectorized interchanging applied. was vectorized. was vectorized. Loop was SIMD possible but seems inefficient Exec. Time scalar code: 2.0 Loop was SIMD scalar code: 1.6 scalar code: 1.6 vectorized Exec. Time scalar code: 3.7 Exec. Time vector code: -- scalar code: 0.4 vector code: 0.6 vector code: 0.6 scalar code: 0.4 Exec. Time vector code: -- Speedup: -- vector code: 0.16 Speedup: 2.6 Speedup: 2.6 vector code: 0.2 Speedup: -- Speedup: 2.0 Speedup: 2.7 179 180 45 3/4/2013 Outline Conditional Statements – I 1. Intro • Loops with conditions need #pragma vector always – Since the compiler does not know if vectorization will be 2. Data Dependences (Definition) profitable 3. Overcoming limitations to SIMD-Vectorization – The condition may prevent from an exception – Data Dependences #pragma vector always – Data Alignment for (int i = 0; i < LEN; i++){ – Aliasing if (c[i] < (float) 0.0) – Non-unit strides a[i] = a[i] * b[i] + d[i]; } – Conditional Statements 4. Vectorization with intrinsics 181 182 Conditional Statements – I Conditional Statements – I S137 S137_1 S137 S137_1 #pragma vector always for (int i = 0; i < LEN; i++){ for (int i = 0; i < LEN; i++){ for (int i = 0; i < LEN; i++){ for (int i = 0; i < LEN; i++){ if (c[i] < (float) 0.0) if (c[i] < (float) 0.0) if (c[i] < (float) 0.0) if (c[i] < (float) 0.0) a[i] = a[i] * b[i] + d[i]; a[i] = a[i] * b[i] + d[i]; a[i] = a[i] * b[i] + d[i]; a[i] = a[i] * b[i] + d[i]; } } } } compiled with flag -qdebug=alwaysspec S137 S137_1 S137 S137_1 Intel Nehalem IBM Power 7 IBM Power 7 Intel Nehalem Compiler report: Loop was not Compiler report: Loop was SIMD Compiler report: Loop was SIMD Compiler report: Loop was vectorized. Condition may protect vectorized vectorized vectorized. exception Exec. Time scalar code: 4.0 Exec. Time scalar code: 4.0 Exec. Time scalar code: 10.4 Exec. Time scalar code: 10.4 Exec. Time vector code: 1.5 Exec. Time vector code: 1.5 Exec. Time vector code: 5.0 Exec. Time vector code: -- Speedup: 2.5 Speedup: 2.5 Speedup: 2.0 Speedup: -- 183 184 46 3/4/2013 Conditional Statements Conditional Statements for (int i=0;i<1024;i++){ vector bool char = rCmp • Compiler removes if conditions when if (c[i] < (float) 0.0) vector float r0={0.,0.,0.,0.}; a[i]=a[i]*b[i]+d[i]; vector float rA,rB,rC,rD,rS, rT, generating vector code } rThen,rElse; for (int i=0;i<1024;i+=4){ // load rA, rB, and rD; for (int i = 0; i < LEN; i++){ rCmp = vec_cmplt(rC, r0); if (c[i] < (float) 0.0) rC 2-11-2 rT= rA*rB+rD; a[i] = a[i] * b[i] + d[i]; rThen = vec_and(rT.rCmp); } rCmp FalseTrue False True rElse = vec_andc(rA.rCmp); rS = vec_or(rthen, relse); rThen 0 3.20 3.2 //store rS } rElse 1.0 1. 0 rS 1.3.2 1. 3.2 185 186 Compiler Directives Conditional Statements for (int i=0;i<1024;i++){ vector bool char = rCmp if (c[i] < (float) 0.0) vector float r0={0.,0.,0.,0.}; • Compiler vectorizes many loops, but many more can be a[i]=a[i]*b[i]+d[i]; vector float rA,rB,rC,rD,rS, rT, vectorized if the appropriate directives are used } rThen,rElse; for (int i=0;i<1024;i+=4){ Compiler Hints for Intel ICC Semantics // load rA, rB, and rD; rCmp = vec_cmplt(rC, r0); #pragma ivdep Ignore assume data dependences rT= rA*rB+rD; Speedups will depend on the rThen = vec_and(rT.rCmp); #pragma vector always override efficiency heuristics values on c[i] rElse = vec_andc(rA.rCmp); #pragma novector disable vectorization rS = vec_or(rthen, relse); Compiler tends to be //store rS __restrict__ assert exclusive access through conservative, as the condition } pointer may prevent from segmentation __attribute__ ((aligned(int-val))) request memory alignment faults memalign(int-val,size); malloc aligned memory __assume_aligned(exp, int-val) assert alignment property 187 188 47 3/4/2013 Compiler Directives Outline • Compiler vectorizes many loops, but many more can be 1. Intro vectorized if the appropriate directives are used 2. Data Dependences (Definition) Compiler Hints for IBM XLC Semantics 3. Overcoming limitations to SIMD-Vectorization #pragma ibm independent_loop Ignore assumed data dependences – Data Dependences #pragma nosimd disable vectorization – Data Alignment __restrict__ assert exclusive access through pointer – Aliasing __attribute__ ((aligned(int-val))) request memory alignment – Non-unit strides memalign(int-val,size); malloc aligned memory – Conditional Statements __alignx (int-val, exp) assert alignment property 4. Vectorization with intrinsics 189 190 Access the SIMD through intrinsics The Intel SSE intrinsics Header file • SSE can be accessed using intrinsics. • Intrinsics are vendor/architecture specific • You must use one of the following header files: #include 191 192 48 3/4/2013 Intel SSE intrinsics Data types Intel SSE intrinsic Instructions • We will use the following data types: • Intrinsics operate on these types and have the format: __m128 packed single precision (vector XMM register) _mm_instruction_suffix(…) __m128d packed double precision (vector XMM register) • Suffix can take many forms. Among them: __m128i packed integer (vector XMM register) ss scalar single precision •Example ps packed (vector) singe precision sd scalar double precision #include 193 194 Intel SSE intrinsics Intrinsics (SSE) Instructions – Examples • Load four 16-byte aligned single precision values in a #define n 1024 #include __m128 x = _mm_load_ps(a); int main() { int main() { for (i = 0; i < n; i++) { __m128 rA, rB, rC; c[i]=a[i]*b[i]; for (i = 0; i < n; i+=4) { } rA = _mm_load_ps(&a[i]); • Add two vectors containing four single precision values: } rB = _mm_load_ps(&b[i]); rC= _mm_mul_ps(rA,rB); __m128 a, b; _mm_store_ps(&c[i], rC); __m128 c = _mm_add_ps(a, b); }} 195 196 49 3/4/2013 Intel SSE intrinsics Intel SSE intrinsics A complete example A complete example #define n 1024 Header file #include 197 198 Intel SSE intrinsics Node Splitting A complete example #define n 1000 #include 199 200 50 3/4/2013 Node Splitting with intrinsics Node Splitting with intrinsics S126 S126_2 #include 201 202 Node Splitting with intrinsics Node Splitting with intrinsics S126 S126_1 S126 S126_1 Intel Nehalem IBM Power 7 IBM Power 7 Intel Nehalem Compiler report: Loop was not Compiler report: Loop was SIMD Compiler report: Loop was SIMD Compiler report: Loop was vectorized. Existence of vector vectorized vectorized vectorized. dependence Exec. Time scalar code: 3.8 Exec. Time scalar code: 5.1 Exec. Time scalar code: 13.2 Exec. Time scalar code: 12.6 Exec. Time vector code: 1.7 Exec. Time vector code: 2.4 Exec. Time vector code: 9.7 Exec. Time vector code: -- Speedup: 2.2 Speedup: 2.0 Speedup: 1.3 Speedup: -- S126_2 S126_2 Intel Nehalem IBM Power 7 Exec. Time intrinsics: 6.1 Exec. Time intrinsics: 1.6 Speedup (versus vector code): 1.6 Speedup (versus vector code): 1.5 203 204 51 3/4/2013 Summary Data Dependences • Microprocessor vector extensions can contribute to improve program performance and the amount of this contribution is likely to increase in the future as vector lengths grow. • The correctness of many many loop transformations including • Compilers are only partially successful at vectorizing vectorization can be decided using dependences. • When the compiler fails, programmers can – add compiler directives • A good introduction to the notion of dependence and its applications – apply loop transformations can be found in D. Kuck, R. Kuhn, D. Padua, B. Leasure, M. Wolfe: Dependence Graphs and Compiler Optimizations. POPL 1981. • If after transforming the code, the compiler still fails to vectorize (or the performance of the generated code is poor), the only option is to program the vector extensions directly using intrinsics or assembly language. 205 206 Compiler Optimizations Algorithms • W. Daniel Hillis and Guy L. Steele, Jr.. 1986. • For a longer discussion see: Data parallel algorithms. Commun. ACM 29, 12 – Kennedy, K. and Allen, J. R. 2002 Optimizing Compilers for Modern Architectures: a Dependence-Based Approach. Morgan Kaufmann Publishers (December 1986), 1170-1183. Inc. – U. Banerjee. Dependence Analysis for Supercomputing. Kluwer Academic Publishers, Norwell, Mass., 1988. • Shyh-Ching Chen, D.J. Kuck, "Time and Parallel Processor Bounds for Linear Recurrence – Advanced Compiler Optimizations for Supercomputers, by David Padua and Michael Wolfe in Communications of the ACM, December 1986, Volume 29, Systems," IEEE Transactions on Computers, pp. Number 12. 701-717, July, 1975 – Compiler Transformations for High-Performance Computing, by David Bacon, Susan Graham and Oliver Sharp, in ACM Computing Surveys, Vol. 26, No. 4, December 1994. 207 208 52 3/4/2013 Thank you Program Optimization Through Loop Vectorization Questions? María Garzarán, Saeed Maleki William Gropp and David Padua María Garzarán, Saeed Maleki {garzaran,maleki,wgropp,padua}@illinois.edu William Gropp and David Padua Department of Computer Science {garzaran,maleki,wgropp,padua}@illinois.edu University of Illinois at Urbana-Champaign Measuring execution time Back-up Slides time1 = time(); for (i=0; i<32000; i++) c[i] = a[i] + b[i]; time2 = time(); 211 212 53 3/4/2013 Measuring execution time Measuring execution times • Added an outer loop that runs (serially) • Added an outer loop that runs (serially) – to increase the running time of the loop – to increase the running time of the loop • Call a dummy () function that is compiled separately • to avoid loop interchange or dead code elimination time1 = time(); time1 = time(); for (j=0; j<200000; j++){ for (j=0; j<200000; j++){ for (i=0; i<32000; i++) for (i=0; i<32000; i++) c[i] = a[i] + b[i]; c[i] = a[i] + b[i]; dummy(); } } time2 = time(); time2 = time(); 213 214 Measuring execution times Compiling • Added an outer loop that runs (serially) • Intel icc scalar code – to increase the running time of the loop icc -O3 –no-vec dummy.o tsc.o –o runnovec • Call a dummy () function that is compiled separately • to avoid loop interchange or dead code elimination • Intel icc vector code • Access the elements of one output array and print the result icc -O3 –vec-report[n] –xSSE4.2 dummy.o tsc.o –o runvec – to avoid dead code elimination time1 = time(); for (j=0; j<200000; j++){ [n] can be 0,1,2,3,4,5 for (i=0; i<32000; i++) – vec-report0, no report is generated c[i] = a[i] + b[i]; dummy(); – vec-report1, indicates the line number of the loops that were } vectorized time2 = time(); for (j=0; j<32000; j++) – vec-report2 .. 5, gives a more detailed report that includes the ret+= a[i]; loops that were not vectorized and the reason for that. printf (“Time %f, result %f”, (time2 –time1), ret); 215 216 54 3/4/2013 Compiling Strip Mining flags = -O3 –qaltivec -qhot -qarch=pwr7 -qtune=pwr7 -qipa=malloc16 -qdebug=NSIMDCOST This transformation improves locality and is usually combined with vectorization -qdebug=alwaysspec –qreport • IBM xlc scalar code xlc -qnoenablevmx dummy.o tsc.o –o runnovec • IBM vector code xlc –qenablevmx dummy.o tsc.o –o runvec 217 218 Strip Mining Strip Mining for (i=1; i S1 But, … loop distribution will increase strip_size is usually a small value (4, 8, 16 or 32). the cache miss ratio if array a[] is large S2 219 220 55 3/4/2013 Strip Mining What are Microprocessor vector extensions? • Another example • Instructions that execute on SIMD (Single Instruction Multiple Data) functional units int v[N]; int v[N]; x … … + 4-way for (int i=0;i Based on slide provided by Markus Püschel 221 222 Overview Floating-Point Vector ISAs What are Microprocessor vector extensions? Vendor Name n-ways Precision Introduced with • Instructions that execute on SIMD (Single Instruction Multiple Intel SSE 4-way single Pentium III SSE2 + 2-way double Pentium 4 Data) functional units SSE3 Pentium 4(Prescott SSSE3 Core Duo x 4-way SSE4 Core2 Extreme (Penryn) + AVX + 4-way double Sandy Bridge, 2011 Intel IPF 2-way single Itanium – They operate on short vectors of integers and floats (from 16-way AMD 3DNOW! 2-way single K6 on 1-byte (chars) to 2-way (doubles)) Enhanced 3DNow! K7 3DNow! Professional + 4-way single AthlonXP AMD64 + 2-way double Opteron • Why are they here? Motorola AltiVec 4-way single MPC 7400 G4 – Useful: Many applications (e.g., multimedia) feature the required IBM VMX 4-way single PowerPC 970 G5 SPU + 2-way double Cell BE fine grain parallelism – code potentially faster Double FPU 2-way double PowerPC 440 FP2 – Doable: Chip designers have enough transistors available, easy to A similar architecture is found in game consoles (PS2, PS3) and GPUs (NVIDIA’s GeForce) implement Based on slide provided by Markus Püschel Based on slide provided by Markus Püschel 223 224 56 3/4/2013 Overview Floating-Point Vector ISAs Evolution of Intel Vector Instructions • MMX (1996, Pentium) Vendor Name n-ways Precision Introduced with – CPU-based MPEG decoding – Integers only, 64-bit divided into 2 x 32 to 8 x 8 Intel SSE 4-way single Pentium III – Phased out with SSE4 SSE2 + 2-way double Pentium 4 • SSE (1999, Pentium III) SSE3 Pentium 4(Prescott – CPU-based 3D graphics SSSE3 Core Duo – 4-way float operations, single precision SSE4 Core2 Extreme (Penryn) – 8 new 128 bit Register, 100+ instructions AVX + 4-way double Sandy Bridge, 2011 • SSE2 (2001, Pentium 4) Intel IPF 2-way single Itanium – High-performance computing – Adds 2-way float ops, double-precision; same registers as 4-way single-precision AMD 3DNOW! 2-way single K6 – Integer SSE instructions make MMX obsolete Enhanced 3DNow! K7 • SSE3 (2004, Pentium 4E Prescott) 3DNow! Professional + 4-way single AthlonXP – Scientific computing AMD64 + 2-way double Opteron – New 2-way and 4-way vector instructions for complex arithmetic Motorola AltiVec 4-way single MPC 7400 G4 IBM VMX 4-way single PowerPC 970 G5 • SSSE3 (2006, Core Duo) SPU + 2-way double Cell BE – Minor advancement over SSE3 Double FPU 2-way double PowerPC 440 FP2 • SSE4 (2007, Core2 Duo Penryn) A similar architecture is found in game consoles (PS2, PS3) and GPUs (NVIDIA’s – Modern codecs, cryptography – New integer instructions GeForce) – Better support for unaligned data, super shuffle engine Based on slide provided by Markus Püschel More details at http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions 225 226 Run-Time Symbolic Resoltuion Loop Vectorization – Example I If (t > 0) self true dependence S113 S113_1 a[4]=a[3]+b[3] for (i=0; i S113 S113_1 a[2]=a[3]+b[3] t =-1 a[3]=a[4]+b[4] Can be vectorized Intel Nehalem Intel Nehalem a[4]=a[5]+b[5] Compiler report: Loop was Compiler report: Fused loop was vectorized in both cases vectorized Exec. Time scalar code: 10.2 Exec. Time scalar code: 10.2 Exec. Time vector code: 6.3 Exec. Time vector code: 6.3 Speedup: 1.6 Speedup: 1.6 227 228 57 3/4/2013 Loop Vectorization – Example I How do we access the SIMD units? S113 S113_1 • Three choices for (i=0; i 1. Macros or Vector Intrinsics S113 S113_1 IBM Power 7 IBM Power 7 Compiler report: Loop was SIMD Compiler report: Loop was SIMD vectorized vectorized 1. Assembly Language Exec. Time scalar code: 3.1 Exec. Time scalar code: 3.7 Exec. Time vector code: 1.5 Exec. Time vector code: 2.3 Speedup: 2.0 Speedup: 1.6 229 230 How do we access the SIMD units? How do we access the SIMD units? • Three choices • Three choices for (i=0; i<32000; i++) 1. C code and a vectorizing compiler c[i] = a[i] + b[i]; 1. C code and a vectorizing compiler void example(){ __m128 rA, rB, rC; for (int i = 0; i < 32000; i+=4){ rA = _mm_load_ps(&a[i]); 1. Macros or Vector Intrinsics 1. Macros or Vector Intrinsics rB = _mm_load_ps(&b[i]); rC = _mm_add_ps(rA,rB); _mm_store_ps(&C[i], rC); }} 1. Assembly Language 1. Assembly Language 231 232 58 3/4/2013 How do we access the SIMD units? Why should the compiler vectorize? • Three choices 1. Easier 2. Portable across vendors and across generations of the same class of machines 1. C code and a vectorizing compiler – Although compiler directives maybe different across compilers 3. Better performance than programmer generated vector code – Compiler applies other transformations such as loop unrolling, 1. Macros or Vector Intrinsics ..B8.5 instruction scheduling … movaps a(,%rdx,4), %xmm0 movaps 16+a(,%rdx,4), %xmm1 addps b(,%rdx,4), %xmm0 Compilers make your codes (almost) machine independent 1. Assembly Language addps 16+b(,%rdx,4), %xmm1 movaps %xmm0, c(,%rdx,4) movaps %xmm1, 16+c(,%rdx,4) But, compilers fail: addq $8, %rdx cmpq $32000, %rdx - Programmers need to provide the necessary information jl ..B8.5 - Programmers need to transform the code 233 234 How well do compilers vectorize? How well do compilers vectorize? • Results for – Test Suite for Vectorizing compilers by David Callahan, Jack Dongarra and David Levine. – IBM XLC compiler, version 11 Loops Percentage Vectorizable 84.3% - Vectorized 46.5% Total 159 - Classic transformation applied 22.0% Auto Vectorized 74 Not Vectorized 85 - New transformation applied 3.8% Vectorizable by 60 Impossible to Vectorize 25 - Manual vector code 11.9% Classic Transformation 35 Non-unit Stride Access 16 New Transformation 6 Data Dependence 5 Manual Vectorization 19 Other 4 235 236 235 59 3/4/2013 Terminology What are the speedups? Transformations Explanation Classic transformation • Loop Interchange • Speedups obtained by XLC compiler (source level) • Scalar Expansion • Scalar and Array Renaming • Node Splitting • Reduction Test Suite Collection Average Speed Up • Loop Peeling Automatically by XLC 1.73 • Loop Distribution • Run-Time Symbolic Resolution By adding classic transformations 3.48 • Speculating Conditional Statements By adding new transformations 3.64 New Transformation • Manually vectorized Matrix Transposition By adding manual vectorization 3.78 (Intrinsics) • Manually vectorized Prefix Sum Manual Transformation • Auto vectorization is inefficient (Intrinsics) • Vectorization of the transformed code is inefficient • No transformation found to enable auto vectorization 237 238 237 Why did the compilers fail to vectorize? XLC and ICC Comparison Compiler XLC but ICC but Issues ICC XLC Reasons not ICC not XLC Vectorizable but not automatic 59 60 Vectorized 25 26 Cyclic data dependence 9 8 Aliasing 5 0 Non-Unit stride access 6 7 Acyclic data dependence 4 0 Conditional statement 6 9 Unrolled loop 3 0 Aliasing 5 0 Loop interchange 2 0 Acyclic data dependence 4 0 Conditional statements 4 7 Reduction 4 5 Unsupported loop structure 0 5 Unsupported loop structure 4 9 Wrap around variable 0 3 Loop Interchange 4 2 Reduction 1 2 Wrap around 1 4 Non-unit stride access 1 2 Scalar expansion 3 4 Other 5 6 Other 13 12 239 240 60 3/4/2013 Another example: matrix-matrix multiplication Another example: matrix-matrix multiplication S111_1 Added compiler S111 directives void MMM(float** a,float** b, float** c) { void MMM(float** __restrict__ a,float** __restrict__ b, for (int i=0; i 241 242 Another example: matrix-matrix multiplication Definition of Dependence S111_3 Loop • A statement S is said to be data dependent on another interchagne statement T if void MMM(float** __restrict__ a,float** __restrict__ b, – S accesses the same data being accessed by an earlier execution of T float** __restrict__ c) { – S, T or both write the data. for (int i = 0; i < LEN2; i++) for (int j = 0; j < LEN2; j++) C[i][j] = (float)0.; for (int i=0; i 61 3/4/2013 Data Dependence Dependences in Loops • Dependences in loops are easy to understand if loops are unrolled. Flow dependence (True dependence) Now the dependences are between statement “instances” S1 S1: X = A+B Unrolled loop S2: C= X+A for (i=0; i 245 246 Dependences in Loops Dependences in Loops • Dependences in loops are easy to understand if loops are unrolled. • A slightly more complex example Now the dependences are between statement “instances” Unrolled loop for (i=1; i 247 248 62 3/4/2013 Dependences in Loops Dependences in Loops • A slightly more complex example • Even more complex for (i=1; i 249 250 Dependences in Loops Dependences in Loops • Even more complex • Two dimensional Unrolled loop for (i=0; i 251 252 63 3/4/2013 Dependences in Loops Dependences in Loops • Two dimensional • Another two dimensional loop Unrolled loop Unrolled loop for (i=1; i 4 4 253 254 Dependences in Loops S1 S1 S1 S1 S1 … = • The representation of these dependence graphs inside S2 S2 S2 S2 S2 the compiler is a “collapsed” version of the graph where the arcs are annotated with direction (or distances) to S1 reduce ambiguities. S1 S1 S1 S1 … < S2 S2 S2 S2 S2 • In the collapsed version each statement is represented by a node and each ordered pair of variable accesses is < represented by an arc S1 S1 S1 S1 S1 … = < S2 S2 S2 S2 S2 255 256 64 3/4/2013 Loop Vectorization Loop Vectorization • When the loop has several statements, it is better to first strip-mine the loop and then distribute strip_size is usually a small value (4, 8, 16, 32) Scalar code, Scalar code, Scalar code, loops are strip-mined Scalar code, Scalar code Scalar code loops are distributed loops are strip-mined loops are distributed and distributed for (i=0; i Loop Vectorization Loop Vectorization S112 S112_1 S112_2 S112 S112_1 S112_2 Scalar code Scalar code Scalar code Scalar code Scalar code Scalar code loops are distributed loops are strip-mined loops are distributed loops are strip-mined for (i=0; i S112 and S112_2 S112_1 IBM Power 7 IBM Power 7 IBM Power 7 Compiler report: Loop was report: Loop was report: Loop was Intel Nehalem Intel Nehalem SIMD vectorized SIMD vectorized SIMD vectorized Compiler report: Loop was Compiler report: Fused loop was Exec. Time scalar code: 2.7 scalar code: 2.7 scalar code: 2.9 vectorized vectorized Exec. Time vector code: 1.2 vector code: 1.2 vector code: 1.4 Exec. Time scalar code: 9.6 Exec. Time scalar code: 9.6 Speedup: 2.1 Speedup: 2.1 Speedup: 2.07 Exec. Time vector code: 5.5 Exec. Time vector code: 5.5 Speedup: 1.7 Speedup: 1.7 259 260 65 3/4/2013 Loop Vectoriation Loop Vectorization – Example II • Our observations S114 for (i=0; i Intel Nehalem IBM Power 7 Compiler report: Loop was not Compiler report: Loop was SIMD vectorized. Existence of vector vectorized dependence Exec. Time scalar code: 3.3 Exec. Time scalar code: 12.6 Exec. Time vector code: 1.8 Exec. Time vector code: -- Speedup: 1.8 Speedup: -- 261 262 Loop Vectorization – Example II Loop vectorization – Example IV S114 We have observed that A loop can be partially vectorized ICC usually vectorizes only for (i=0; i 263 264 66 3/4/2013 Loop vectorization – Example IV Loop vectorization – Example III A loop can be partially vectorized A loop can be partially vectorized for (int i=1;i S1 can be vectorized S1 can be vectorized S2 S2 and S3 cannot be vectorized S2 S2 and S3 cannot be (A loop with a cycle in the dependence vectorized graph cannot be vectorized) S3 S3 265 266 Loop vectorization – Example IV Loop vectorization – Example IV S116 S116_1 S116 S116_1 for (int i=1;i 67 3/4/2013 Cycles in the DG – Example V Loop Interchanging • Compiler needs to resolve a system of equations to determine if there are dependences. for (j=0; j 269 270 Loop Interchanging Loop Interchanging for (j=0; j 0 0 0 j 1 j 1 j 1 2 2 2 272 271 272 68 3/4/2013 Loop Interchanging Loop Interchanging S128 for (j=1; j 274 273 274 Loop Interchanging Loop Interchanging for (j=0; j 3 3 3 275 276 275 276 69 3/4/2013 Loop Interchanging Loop Interchanging S129 S129_1 S129 S129_1 for (j=0; j S129 S129_1 S129 S129_1 Intel Nehalem Intel Nehalem IBM Power 7 IBM Power 7 Compiler report: Permuted loop Compiler report: Loop was Compiler report: Loop Compiler report: Loop was SIMD was vectorized. vectorized. interchanging applied. Loop was vectorized Exec. Time scalar code: 0.4 Exec. Time scalar code: 0.4 SIMD vectorized Exec. Time scalar code: 0.5 Exec. Time vector code: 0.1 Exec. Time vector code: 0.1 Exec. Time scalar code: 0.5 Exec. Time vector code: 0.1 Speedup: 4 Speedup: 4 Exec. Time vector code: 0.1 Speedup: 5 Speedup: 5 277 278 277 278 Intel SSE intrinsics Intel SSE intrinsics Instructions – Examples II Instructions – Examples III • Add two vectors containing four single precision values: • Add two vectors of 8 16-bit signed integers using saturation arithmetic (*) __m128 a, b; __m128 c = _mm_add_ps(a, b); __m128i r, s; __m128i t = _mm_adds_epi16(r, s); • Multiply two vectors containing four floats: • Compare two vectors of 16 8-bit signed integers __m128 a, b; __m128 c = _mm_mul_ps(a, b); __m128i a, b; __m128i c = _mm_cmpgt_epi8(a, b); • Add two vectors containing two doubles: (*) [In saturation arithmetic] all operations such as addition and multiplication are limited to a __m128d x, y; fixed range between a minimum and maximum value. If the result of an operation is greater than __m128d z = _mm_add_pd(x, y); the maximum it is set ("clamped") to the maximum, while if it is below the minimum it is clamped to the minimum. [From the wikipedia] 279 280 70 3/4/2013 Blue Water Compiler work • Illinois has a long tradition in vectorization. • Most recent work: vectorization for Blue Waters Communications of the ACM, December 1986 Volume 29, Number 12 Source: Thom Dunning: Blue Waters Project 281 282 Compiler work Conditional Statements-II for (int i = 0; i < LEN; ++i) { if (c[i] == 0) a[i] = CHEAP_FUNC(d[i]); else a[i] = EXPENSIVE_FUNC(b[i])*c[i]+CHEAP_FUNC(d[i]); } • The programmer puts the condition to reduce the amount of computation, because he/she knows that the condition is true many times. Options: 1) Remove the condition. Communications of the ACM, December 1986 Volume 29, Number 12 2) Leave it as it is. 283 284 71 3/4/2013 Conditional Statements-II Conditional Statements-II S138 S138_1 for (int i = 0; i < LEN; ++i) { for (int i = 0; i < LEN; ++i) { for (int i = 0; i < 1024; ++i) { a[i] = CHEAP_FUNC(d[i]); if (c[i] == (float)0.0) a[i] = d[i]*(float)2.0; #pragma novector a[i] = d[i]*(float)2.0; #pragma novector for (int i = 0; i < LEN; ++i) { else for (int i = 0; i < 1024; ++i) { if (c[i] != 0) a[i]= c[i]/e[i]/b[i]/d[i]/ if (c[i] != (float)0.0) a[i] += EXPENSIVE_FUNC(b[i])*c[i]; a[i]+d[i]*(float)2.0; a[i] += c[i]/e[i]/b[i]/d[i]/ } } a[i]; } S138 S138_1 Intel Nehalem Intel Nehalem Performance is input dependent Compiler report: Compiler report: Loop was -Loop 1 was vectorized. vectorized -Loop 2 was not vectorized: Exec. Time scalar code: 1.01 #pragma novector used Exec. Time vector code: 1.09 Exec. Time scalar code: 0.99 Speedup: 0.9 Exec. Time vector code: 0.97 Speedup: 1.0 285 286 Conditional Statements-II S138 S138_1 for (int i = 0; i < LEN; ++i) { for (int i = 0; i < 1024; ++i) { if (c[i] == (float)0.0) a[i] = d[i]*(float)2.0; a[i] = d[i]*(float)2.0; #pragma nosimd else for (int i = 0; i < 1024; ++i) { a[i]= c[i]/e[i]/b[i]/d[i]/ if (c[i] != (float)0.0) a[i]+d[i]*(float)2.0; a[i] += c[i]/e[i]/b[i]/d[i]/ } a[i]; } S138 S138_1 IBM Power 7 IBM Power 7 Compiler report: Loop was not Compiler report: Loop was SIMD SIMD vectorized because it is not vectorized profitable Exec. Time scalar code: 2.06 Exec. Time scalar code: 2.1 Exec. Time vector code: 2.6 Exec. Time vector code: -- Speedup: 0.8 Speedup: -- 287 72