Spring 20102010

Loop Optimizations

Instruction SchedulingScheduling 5 Outline

• Scheduling for loops • • Interaction with • Hardware vs. • IdInduct ion Variablble Recogn ition • loop invariant code motion

Saman Amarasinghe 2 6.035 ©MIT Fall 1998 Scheduling Loop s

•Loop bodies are small • But, lot of time is spend in loops due to large number ooff iterations • Need better ways to schedule loops

Saman Amarasinghe 3 6.035 ©MIT Fall 1998 Loop Examp le

• Machine – One load/store unit • load 2 cycles • store 2 cycles – Two arithmetic units • add 2 cycles • branch 2 cycles • multiply 3 cycles – Both units are ppipelinedipelined (initiate one op each cycle) • Source Code for i = 1 to N A[i] = A[i] * b

Saman Amarasinghe 4 6.035 ©MIT Fall 1998 Loop Examp le

• Source Code for i = 1 to N base A[i] = A[i] * b • Assembly Code off set loop: mov (%rdi,%rax), %r10 imul %r11, %r10 mov %r10, (%rdi,%rax) sub $4, %rax jz loop

Saman Amarasinghe 5 6.035 ©MIT Fall 1998 Looppp Example mov d=7 2 • Assembly Code imul d=5 loop: 3 mov (%rdi,%rax), %r10 mov d=2 imul %r11, %r10 0 mov %r10, (%rdi,%rax) sub $4, %rax sub d=2 jz loop 2 • Schedule (9 cycles per iteration) jz d=0 mov mov mov mov imul bge imul bge imul sub sub

Saman Amarasinghe 6 6.035 ©MIT Fall 1998 5 Outline

• Scheduling for loops • Loop unrolling • Software pipelining • Interaction with register allocation • Hardware vs. Compiler • IdInduct ion Variablble Recogn ition • loop invariant code motion

Saman Amarasinghe 7 6.035 ©MIT Fall 1998 Loop Unrolling

• Unroll the loop bod y few times •Pros: – Create a much larger basic blockblock for thethe body – Eliminate few loop bounds checks • Cons: – Much larger program – Setup code ((## of iterat ions < unro llll factor) – beginning and end of the schedule can still have unusedld slots

Saman Amarasinghe 8 6.035 ©MIT Fall 1998 loop: Loop Example mov (%rdi,%rax), %r10 imul %r11, %r10 mov %r10, (%rdi,%rax) sub $4, %rax jz loop

Saman Amarasinghe 9 6.035 ©MIT Fall 1998 loop: Loop Example mov (%rdi,%rax), %r10 imul %r11, %r10 mov %r10, (%rdi,%rax) sub $4, %rax mov (%rdi,%rax), %r10 imul %r11, %r10 mov %r10, (%rdi,%rax) sub $4, %rax jz loop

Saman Amarasinghe 10 6.035 ©MIT Fall 1998 mov d=14 2 mul d=12 loop: Looppp Example 3 mov (%rdi,%rax), %r10 mov d=9 imul %r11, %r10 0 mov %r10, (%rdi,%rax) sub d=9 2 sub $4, %rax mov d=7 mov (%rdi,%rax), %r10 2 imul %r11, %r10 mul d=5 mov %r10, (%rdi,%rax) 3 sub $4, %rax mov d=2 jz loop 0 sub d=2 2 • Schedule (8 cycles per iteration) jz d=0 mov mov mov mov mov mov mov mov imul imul bge imul imul bge imul imul sub sub sub sub

Saman Amarasinghe 11 6.035 ©MIT Fall 1998 Loop Unrolling

• Rename registers – Use different registers in different iterations

Saman Amarasinghe 12 6.035 ©MIT Fall 1998 mov d=14 2 Looppp Example mul d=12 loop: 3 mov (%rdi,%rax), %r10 mov d=9 imul %r11, %r10 0 sub d=9 mov %10%r10, (%rdi,% %rax) 2 sub $4, %rax mov d=7 mov (%rdi,%rax), %r10 2 mul d=5 imul %r11, %r10 3 mov %r10, (%rdi,%rax) mov d=2 0 sub $4, %rax sub d2d=2 jz loop 2 jz d=0

Saman Amarasinghe 13 6.035 ©MIT Fall 1998 mov d=14 2 Looppp Example mul d=12 loop: 3 mov (%rdi,%rax), %r10 mov d=9 imul %r11, %r10 0 sub d=9 mov %10%r10, (%rdi,% %rax) 2 sub $4, %rax mov d=7 mov (%rdi,%rax), %rcx 2 mul d=5 imul %r11, %rcx 3 mov %rcx, (%rdi,%rax) mov d=2 0 sub $4, %rax sub d2d=2 jz loop 2 jz d=0

Saman Amarasinghe 14 6.035 ©MIT Fall 1998 Loop Unrolling

• Rename registers – Use different registers in different iterations

• Eliminate unnecessary dependencies – again, useuse more registers toto eliminateeliminate truetrue, antianti andand output dependencies – eliminate dependentdependent -chains of calculations when possible

Saman Amarasinghe 15 6.035 ©MIT Fall 1998 mov d=14 2 mul d=12 loop: Looppp Example 3 mov (%rdi,%rax), %r10 mov d=9 0 imul %r11, %r10 sub d=9 mov %r%10(%10, (%rdi,%rax) 2 mov d=7 sub $4, %rax 2 mov (%rdi,%rax), %rcx mul d=5 imul %r11, %rcx 3 mov d=2 mov %rcx, (%rdi,%rax) 0 sub $4, %rax sub d=2 2 jz loop jz d=0

Saman Amarasinghe 16 6.035 ©MIT Fall 1998 mov d=5 2 mul d=3 loop: Looppp Example 3 mov (%rdi,%rax), %r10 mov d=0 0 imul %r11, %r10 sub d=0 mov %r%10(%10, (%rdi,%rax) 2 mov d=7 sub $8, %rax 2 mov (%rdi,%rbx), %rcx mul d=5 imul %r11, %rcx 3 mov d=2 mov %rcx, (%rdi,%rbx) 0 sub $8, %rbx sub d=2 2 jz loop jz d=0

Saman Amarasinghe 17 6.035 ©MIT Fall 1998 mov d=5 2 mul d=3 loop: Looppp Example 3 mov (%rdi,%rax), %r10 mov d=0 imul %r11, %r10 0 mov %r10, (%rdi,%rax) sub d=0 2 sub $8, %rax mov d=7 mov (%rdi,%rbx), %rcx 2 imul %r11, %rcx mul d=5 mov %rcx, (%rdi,%rbx) 3 sub $8, %rbx mov d=2 jz loop 0 sub d=2 •S Schedule h d l(4 .5 5 cyc les per itera tion 2 jz d=0 mov mov mov mov mov mov mov mov imul imul jz imul imul jz imul imul sub sub sub sub

Saman Amarasinghe 18 6.035 ©MIT Fall 1998 5 Outline

• Scheduling for loops • Loop unrolling • Software pipelining • Interaction with register allocation • Hardware vs. Compiler • loop invariant code motion • Recognition

Saman Amarasinghe 19 6.035 ©MIT Fall 1998 Software Pipelining

•Try to overla p multi ple iterations so that the slots will be filled • Find the steadysteady -state windowwindow so that: – all the instructions of the loop body is executed – but from ddifferentifferent iterations

Saman Amarasinghe 20 6.035 ©MIT Fall 1998 Loop Examp le

• Assembly Code loop: mov (%rdi,%rax), %r10 imul %r11, %r10 mov %r10, (%rdi,%rax) sub $4, %rax jz loop • Schedule mov mov mov mov mul jz mul jz mul sub sub

Saman Amarasinghe 21 6.035 ©MIT Fall 1998 Looppp Example

• Assembly Code loop: mov (%rdi,%rax), %r10 imul %r11, %r10 mov %r10, (%rdi,%rax) sub $4, %rax jz loop • Schedule mov mov1 mov2 mov mov3 mov1 mov4 mov2 mov5 mov3 mov6 mov mov1 mov2 mov mov3 mov1 mov4 mov2 ld5 mov3 mul mul1 mul2 jz mul3 jz1 mul4 jz2 mul5 mul mul1 mul2 jz mul3 jz1 mul4 jz2 mul mul1 mul2 mul3 mul4 sub sub1 sub2 sub3 sub sub1 sub2 sub3

Saman Amarasinghe 22 6.035 ©MIT Fall 1998 Looppp Example

• Assembly Code loop: mov (%rdi,%rax), %r10 imul %r11, %r10 mov %r10, (%rdi,%rax) sub $4, %rax jz loop • Schedule (2 cycles per iteration) mov mov1 mov2 mov mov3 mov1 mov4 mov2 mov5 mov3 mov6 mov mov1 mov2 mov mov3 mov1 mov4 mov2 ld5 mov3 mul mul1 mul2 jz mul3 jz1 mul4 jz2 mul5 mul mul1 mul2 jz mul3 jz1 mul4 jz2 mul mul1 mul2 mul3 mul4 sub sub1 sub2 sub3 sub sub1 sub2 sub3

Saman Amarasinghe 23 6.035 ©MIT Fall 1998 Looppp Example • 4 iterations are overlapped %r11 mov4 mov2 – value of don’t change mov1 mov4 mul3 jz1 – 4 regs for (%rdi,%rax) jz mul3 – each addr. incremented by 4*4 mul2 sub2 %r10 – 4 regs to keep value sub1

– Same registers can be reused after 4 of these blocks loop: mov (%rdi,%rax), %r10 g,generate code for 4 blocks, imul %r11, %r10 otherwise need to move mov %r10, (%rdi,%rax) sub $4, %rax jz loop

Saman Amarasinghe 24 6.035 ©MIT Fall 1998 Software Pip elining

• Optimal use of resources • NdNeed a lot of reg isters – Values in multiple iterations need to be kept • Issues in dependencies – Executing a store instruction in an iteration before branch instruction is executed for a previous iteration (writing when it should not have) – Loads and stores are issued out-of-order (need to figure-out dependencies before doing this) • Code generation issues – Generate pre-amble and post-amble code – Multiple blocks so no register copy is needed

Saman Amarasinghe 25 6.035 ©MIT Fall 1998 5 Outline

• Scheduling for loops • Loop unrolling • Software pipelining • Interaction with register allocation • Hardware vs. Compiler • IdInduct ion Variablble Recogn ition • loop invariant code motion

Saman Amarasinghe 26 6.035 ©MIT Fall 1998 Register Allocation andId Ins truc tition Sc he du liling • If register allocation is before – restricts the choices for scheduling

Saman Amarasinghe 27 6.035 ©MIT Fall 1998 Example

1: mov 4(%rbp), %rax 2: a dddd %rax, %rbx 3: mov 8(%rbp), %rax 4: add %rax, %rcx

Saman Amarasinghe 28 6.035 ©MIT Fall 1998 Exampple

1: mov 4(%rbp), %rax 1 2dd2: add %%b%rax, %rbx 3 3: mov 8(%rbp), %rax 1 1 4: add %rax, %rcx 2

1 1 3 3

4

Saman Amarasinghe 29 6.035 ©MIT Fall 1998 Exampple

1: mov 4(%rbp), %rax 1 2dd2: add %%b%rax, %rbx 3 3: mov 8(%rbp), %rax 1 1 4: add %rax, %rcx 2

1 1 3 3

ALUop 2 4 4 MEM 1 1 3 MEM 2 1 3

Saman Amarasinghe 30 6.035 ©MIT Fall 1998 Exampple

1: mov 4(%rbp), %rax 1 2dd2: add %rax,%b %rbx 3 3: mov 8(%rbp), %rax 1 1 4: add %rax, %rcx 2

1 1 Anti-dependence 3 How about a different register? 3

4

Saman Amarasinghe 31 6.035 ©MIT Fall 1998 Exampple

1: mov 4(%rbp), %rax 1 2dd2: add %%b%rax, %rbx 3 3: mov 8(%rbp), %r10 4: add %r10, %rcx 2

Anti-dependence 3 How about a different register? 3

4

Saman Amarasinghe 32 6.035 ©MIT Fall 1998 Exampple

1: mov 4(%rbp), %rax 1 2dd2: add %%b%rax, %rbx 3 3: mov 8(%rbp), %r10 4: add %r10, %rcx 2

3 3

ALUop 2 4 4 MEM 1 1 3 MEM 2 1 3

Saman Amarasinghe 33 6.035 ©MIT Fall 1998 Register Allocation andId Ins truc tition Sc he du liling • If register allocation is before instruction scheduling – restricts the choices for scheduling

Saman Amarasinghe 34 6.035 ©MIT Fall 1998 Register Allocation andId Instructition Scheduliling • If register allocation is before instruction scheduling – restricts the choices for scheduling

• If instruction schedulingscheduling bbeforeefore registerregister allocation – Register allocation may spillspill registers – Will change the carefully done schedule!!!

Saman Amarasinghe 35 6.035 ©MIT Fall 1998 5 Outline

• Scheduling for loops • Loop unrolling • Software pipelining • Interaction with register allocation • Hardware vs. Compiler • IdInduct ion Variablble Recogn ition • loop invariant code motion

Saman Amarasinghe 36 6.035 ©MIT Fall 1998 Superscalar: Where have all the trans is tors gone? • Out of order execution – If an instruction stalls, go beyond that and start executing non-dependent instructions –Pros: • Hardware scheduling • Tolerates unpredictable latencies – Cons: • Instruction window is small

Saman Amarasinghe 37 6.035 ©MIT Fall 1998 Superscalar: Where have all the trans is tors gone? •Register renaming – If there is an anti or output dependency of a register that stalls the pipeline, use a different hardware register –Pros: • Avoids anti and output dependencies – Cons: • Cannot do more complex transformations to eliminate dependencies

Saman Amarasinghe 38 6.035 ©MIT Fall 1998 Hardware vs. Compiler

• In a superscalar, hardware and compiler scheduling can work hand-in-hand • Hardware can reduce the burden when not predictable by the compiler • Compiler can still greatly enhance the performance – Large instruction window for scheduling – Many program transformations that increase parallelism • Comp ililer is even more critica l when no har dware support – VLIW machines (Itanium, DSPs)DSPs)

Saman Amarasinghe 39 6.035 ©MIT Fall 1998 5 Outline

• Scheduling for loops • Loop unrolling • Software pipelining • Interaction with register allocation • Hardware vs. Compiler • IdInduction Var iabblle Recognition • loop invariant code motion

Saman Amarasinghe 40 6.035 ©MIT Fall 1998 Induction Variables

•Example i = 200 for j = 1 to 100 a(i) = 0 i = i - 1

Saman Amarasinghe 41 6.035 ©MIT Fall 1998 Induction Variables

•Example i = 200 for j = 1 to 100 a(i) = 0 i = i - 1

Basic Induction variable: J =1= 1, 2, 3 , 4, …..

Index Variable i in a(i): I = 200, 199, 198, 197….

Saman Amarasinghe 42 6.035 ©MIT Fall 1998 Induction Variables

• Example i = 200 ffororj=1t j = 1 too1 10000 a(i) = 0 i = i - 1 Basic Induction variable: J =1= 1, 2, 3 , 4, …..

Index Variable i in a(i): I = 200, 199, 198, 197…. = 201 - J

Saman Amarasinghe 43 6.035 ©MIT Fall 1998 Induction Variables

• Example i = 200 ffororj=1t j = 1 too1 10000 a(201 - j) = 0 i = i - 1 Basic Induction variable: J =1= 1, 2, 3 , 4, …..

Index Variable i in a(i): I = 200, 199, 198, 197…. = 201 - J

Saman Amarasinghe 44 6.035 ©MIT Fall 1998 Induction Variables

• Example ffororj=1t j = 1 too1 10000 a(201 - j) = 0

Basic Induction variable: J =1= 1, 2, 3 , 4, …..

Index Variable i in a(i): I = 200, 199, 198, 197…. = 201 - J

Saman Amarasinghe 45 6.035 ©MIT Fall 1998 What are induction variables?

• x is an induction variable of a loop L if – variable changes its value every iteration of the loop – the value is a function of number of iterations of the loop

• In this function is normally a linear function – Example: for loop index variable j, function c*j + d

Saman Amarasinghe 46 6.035 ©MIT Fall 1998 What can we do with induction varibiablles ?

• Use themthem ttoo performperform strength reductionreduction

• Get rid of them

Saman Amarasinghe 47 6.035 ©MIT Fall 1998 Classification of induction variables

• Basic induction variables – Explicitly modified by the same constant amount once during each iteration of the loop – Example: loop index variable

• Dependent induction variables – Can bebe expressedexpressed inin thethe form: axa*x + b wherewhere a andand be are loop invariant and x is an induction variable – Example: 202 - 22j*j

Saman Amarasinghe 48 6.035 ©MIT Fall 1998 Classification of induction variables

• Class of induction variables: All induction variables with same basic variable in their linear equations

• Basis ofof a class: the basicbasic variable that determines that class

Saman Amarasinghe 49 6.035 ©MIT Fall 1998 Finding Basic Induction Variables

• Look inside loop nodes • Find variables whose only modification is of the fformorm j = j+dj + d where d is a looploop constant

Saman Amarasinghe 50 6.035 ©MIT Fall 1998 Finding Dep endent Induction Variables

• Find all the basic induction variables

• Search variable k with a single assignment in the loop

• Variable assignments of the form k = e op j or k = -j where j is an induction variable and e is loop invariant

Saman Amarasinghe 51 6.035 ©MIT Fall 1998 Finding Dep endent Induction Variables

• Example for i = 1 to 100 j = i*c k = j+1

Saman Amarasinghe 52 6.035 ©MIT Fall 1998 A special case t = 202 for j = 1 to 100 t = t - 2 a(j) = t t = t - 2 b(j) = t

Saman Amarasinghe 53 6.035 ©MIT Fall 1998 A special case

u1 = 200 t = 202 u2 = 202 for j = 1 to 100 for j = 1 to 100 t = t - 2 u1 = u1 - 4 a(j) = t a(j) = u1 t = t - 2 u2 = u2 - 4 b(j) = t b(j) = uu22

Saman Amarasinghe 54 6.035 ©MIT Fall 1998 5 Outline

• Scheduling for loops • Loop unrolling • Software pipelining • Interaction with register allocation • Hardware vs. Compiler • IdInduct ion Variablble Recogn ition • Loop invariant code motion

Saman Amarasinghe 55 6.035 ©MIT Fall 1998 Loop Invariant Code Motion

• If a comp utation p roduces the same value in every loop iteration, move it out of the loop

Saman Amarasinghe 56 6.035 ©MIT Fall 1998 Loop Invariant Code Motion

• If a comp utation p roduces the same value in every loop iteration, move it out of the loop for i = 1 to N x = x+1x + 1 for j = 1 to N a(i,j) = 100*N + 10*i + j + x

Saman Amarasinghe 57 6.035 ©MIT Fall 1998 Loop Invariant Code Motion

• If a comp utation p roduces the same value in every loop iteration, move it out of the loop for i = 1 to N x = x+1x + 1 for j = 1 to N a(i,j) = 100*N + 10*i + j + x

Saman Amarasinghe 58 6.035 ©MIT Fall 1998 Loop Invariant Code Motion

• If a computation produces the same value in every loop iteration, move it out of the loop t1 = 100*N for i = 1 to N x = x+1x + 1 for j = 1 to N a(i,j) = 100*N + 10*i + j + x

Saman Amarasinghe 59 6.035 ©MIT Fall 1998 Loop Invariant Code Motion

• If a comp utation p roduces the same value in every loop iteration, move it out of the loop t1 = 100*N for i = 1 to N x = x+1x + 1 for j = 1 to N a(i,j) = t1 + 10*i + j + x

Saman Amarasinghe 60 6.035 ©MIT Fall 1998 Loop Invariant Code Motion

• If a comp utation p roduces the same value in every loop iteration, move it out of the loop t1 = 100*N for i = 1 to N x = x+1x + 1 for j = 1 to N a(i,j) = t1 + 10*i + j + x

Saman Amarasinghe 61 6.035 ©MIT Fall 1998 Loop Invariant Code Motion

• If a computation produces the same value in every loop iteration, move it out of the loop t1 = 100*N for i = 1 to N x = x+1x + 1 for j = 1 to N a(i,j) = t1 + 10*i + j + x

Saman Amarasinghe 62 6.035 ©MIT Fall 1998 Loop Invariant Code Motion

• If a computation produces the same value in every loop iteration, move it out of the loop t1 = 100*N for i = 1 to N x = x+1x + 1 t2 = t1 + 10*i + x for j = 1 to N a( i,j) = t1 + 10*i + j + x

Saman Amarasinghe 63 6.035 ©MIT Fall 1998 Loop Invariant Code Motion

• If a comp utation p roduces the same value in every loop iteration, move it out of the loop t1 = 100*N for i = 1 to N x = x+1x + 1 t2 = t1 + 10*i + x for j = 1 to N a( i,j) = t2 + j

Saman Amarasinghe 64 6.035 ©MIT Fall 1998 MIT OpenCourseWare http://ocw.mit.edu

6.035 Computer Language Engineering Spring 2010

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.