6.035 Lecture 14, Loop Optimizations: Instruction Scheduling

Spring 20102010 Loop Optimizations Instruction SchedulingScheduling 5 Outline • Scheduling for loops • Loop unrolling • Software pipelining • Interaction with register allocation • Hardware vs. Compiler • IdInduct ion Variablble Recogn ition • loop invariant code motion Saman Amarasinghe 2 6.035 ©MIT Fall 1998 Scheduling Loop s •Loop bodies are small • But, lot of time is spend in loops due to large number ooff iterations • Need better ways to schedule loops Saman Amarasinghe 3 6.035 ©MIT Fall 1998 Loop Examp le • Machine – One load/store unit • load 2 cycles • store 2 cycles – Two arithmetic units • add 2 cycles • branch 2 cycles • multiply 3 cycles – Both units are ppipelinedipelined (initiate one op each cycle) • Source Code for i = 1 to N A[i] = A[i] * b Saman Amarasinghe 4 6.035 ©MIT Fall 1998 Loop Examp le • Source Code for i = 1 to N base A[i] = A[i] * b • Assembly Code off set loop: mov (%rdi,%rax), %r10 imul %r11, %r10 mov %r10, (%rdi,%rax) sub $4, %rax jz loop Saman Amarasinghe 5 6.035 ©MIT Fall 1998 Looppp Example mov d=7 2 • Assembly Code imul d=5 loop: 3 mov (%rdi,%rax), %r10 mov d=2 imul %r11, %r10 0 mov %r10, (%rdi,%rax) sub $4, %rax sub d=2 jz loop 2 • Schedule (9 cycles per iteration) jz d=0 mov mov mov mov imul bge imul bge imul sub sub Saman Amarasinghe 6 6.035 ©MIT Fall 1998 5 Outline • Scheduling for loops • Loop unrolling • Software pipelining • Interaction with register allocation • Hardware vs. Compiler • IdInduct ion Variablble Recogn ition • loop invariant code motion Saman Amarasinghe 7 6.035 ©MIT Fall 1998 Loop Unrolling • Unroll the loop bod y few times •Pros: – Create a much larger basic blockblock for thethe body – Eliminate few loop bounds checks • Cons: – Much larger program – Setup code (#(# of iterat ions < unro llll factor) – beginning and end of the schedule can still have unusedld slots Saman Amarasinghe 8 6.035 ©MIT Fall 1998 loop: Loop Example mov (%rdi,%rax), %r10 imul %r11, %r10 mov %r10, (%rdi,%rax) sub $4, %rax jz loop Saman Amarasinghe 9 6.035 ©MIT Fall 1998 loop: Loop Example mov (%rdi,%rax), %r10 imul %r11, %r10 mov %r10, (%rdi,%rax) sub $4, %rax mov (%rdi,%rax), %r10 imul %r11, %r10 mov %r10, (%rdi,%rax) sub $4, %rax jz loop Saman Amarasinghe 10 6.035 ©MIT Fall 1998 mov d=14 2 mul d=12 loop: Looppp Example 3 mov (%rdi,%rax), %r10 mov d=9 imul %r11, %r10 0 mov %r10, (%rdi,%rax) sub d=9 2 sub $4, %rax mov d=7 mov (%rdi,%rax), %r10 2 imul %r11, %r10 mul d=5 mov %r10, (%rdi,%rax) 3 sub $4, %rax mov d=2 jz loop 0 sub d=2 2 • Schedule (8 cycles per iteration) jz d=0 mov mov mov mov mov mov mov mov imul imul bge imul imul bge imul imul sub sub sub sub Saman Amarasinghe 11 6.035 ©MIT Fall 1998 Loop Unrolling • Rename registers – Use different registers in different iterations Saman Amarasinghe 12 6.035 ©MIT Fall 1998 mov d=14 2 Looppp Example mul d=12 loop: 3 mov (%rdi,%rax), %r10 mov d=9 imul %r11, %r10 0 sub d=9 mov %10%r10, (%rdi,% %rax) 2 sub $4, %rax mov d=7 mov (%rdi,%rax), %r10 2 mul d=5 imul %r11, %r10 3 mov %r10, (%rdi,%rax) mov d=2 0 sub $4, %rax sub d2d=2 jz loop 2 jz d=0 Saman Amarasinghe 13 6.035 ©MIT Fall 1998 mov d=14 2 Looppp Example mul d=12 loop: 3 mov (%rdi,%rax), %r10 mov d=9 imul %r11, %r10 0 sub d=9 mov %10%r10, (%rdi,% %rax) 2 sub $4, %rax mov d=7 mov (%rdi,%rax), %rcx 2 mul d=5 imul %r11, %rcx 3 mov %rcx, (%rdi,%rax) mov d=2 0 sub $4, %rax sub d2d=2 jz loop 2 jz d=0 Saman Amarasinghe 14 6.035 ©MIT Fall 1998 Loop Unrolling • Rename registers – Use different registers in different iterations • Eliminate unnecessary dependencies – again, useuse more registers toto eliminateeliminate truetrue, antianti andand output dependencies – eliminate dependentdependent -chains of calculations when possible Saman Amarasinghe 15 6.035 ©MIT Fall 1998 mov d=14 2 mul d=12 loop: Looppp Example 3 mov (%rdi,%rax), %r10 mov d=9 0 imul %r11, %r10 sub d=9 mov %r%10(%10, (%rdi,%rax) 2 mov d=7 sub $4, %rax 2 mov (%rdi,%rax), %rcx mul d=5 imul %r11, %rcx 3 mov d=2 mov %rcx, (%rdi,%rax) 0 sub $4, %rax sub d=2 2 jz loop jz d=0 Saman Amarasinghe 16 6.035 ©MIT Fall 1998 mov d=5 2 mul d=3 loop: Looppp Example 3 mov (%rdi,%rax), %r10 mov d=0 0 imul %r11, %r10 sub d=0 mov %r%10(%10, (%rdi,%rax) 2 mov d=7 sub $8, %rax 2 mov (%rdi,%rbx), %rcx mul d=5 imul %r11, %rcx 3 mov d=2 mov %rcx, (%rdi,%rbx) 0 sub $8, %rbx sub d=2 2 jz loop jz d=0 Saman Amarasinghe 17 6.035 ©MIT Fall 1998 mov d=5 2 mul d=3 loop: Looppp Example 3 mov (%rdi,%rax), %r10 mov d=0 imul %r11, %r10 0 mov %r10, (%rdi,%rax) sub d=0 2 sub $8, %rax mov d=7 mov (%rdi,%rbx), %rcx 2 imul %r11, %rcx mul d=5 mov %rcx, (%rdi,%rbx) 3 sub $8, %rbx mov d=2 jz loop 0 sub d=2 •S Schedule h d l(4 .5 5 cyc les per itera tion 2 jz d=0 mov mov mov mov mov mov mov mov imul imul jz imul imul jz imul imul sub sub sub sub Saman Amarasinghe 18 6.035 ©MIT Fall 1998 5 Outline • Scheduling for loops • Loop unrolling • Software pipelining • Interaction with register allocation • Hardware vs. Compiler • loop invariant code motion • Induction Variable Recognition Saman Amarasinghe 19 6.035 ©MIT Fall 1998 Software Pipelining •Try to overla p multi ple iterations so that the slots will be filled • Find the steadysteady -state windowwindow so that: – all the instructions of the loop body is executed – but from ddifferentifferent iterations Saman Amarasinghe 20 6.035 ©MIT Fall 1998 Loop Examp le • Assembly Code loop: mov (%rdi,%rax), %r10 imul %r11, %r10 mov %r10, (%rdi,%rax) sub $4, %rax jz loop • Schedule mov mov mov mov mul jz mul jz mul sub sub Saman Amarasinghe 21 6.035 ©MIT Fall 1998 Looppp Example • Assembly Code loop: mov (%rdi,%rax), %r10 imul %r11, %r10 mov %r10, (%rdi,%rax) sub $4, %rax jz loop • Schedule mov mov1 mov2 mov mov3 mov1 mov4 mov2 mov5 mov3 mov6 mov mov1 mov2 mov mov3 mov1 mov4 mov2 ld5 mov3 mul mul1 mul2 jz mul3 jz1 mul4 jz2 mul5 mul mul1 mul2 jz mul3 jz1 mul4 jz2 mul mul1 mul2 mul3 mul4 sub sub1 sub2 sub3 sub sub1 sub2 sub3 Saman Amarasinghe 22 6.035 ©MIT Fall 1998 Looppp Example • Assembly Code loop: mov (%rdi,%rax), %r10 imul %r11, %r10 mov %r10, (%rdi,%rax) sub $4, %rax jz loop • Schedule (2 cycles per iteration) mov mov1 mov2 mov mov3 mov1 mov4 mov2 mov5 mov3 mov6 mov mov1 mov2 mov mov3 mov1 mov4 mov2 ld5 mov3 mul mul1 mul2 jz mul3 jz1 mul4 jz2 mul5 mul mul1 mul2 jz mul3 jz1 mul4 jz2 mul mul1 mul2 mul3 mul4 sub sub1 sub2 sub3 sub sub1 sub2 sub3 Saman Amarasinghe 23 6.035 ©MIT Fall 1998 Looppp Example • 4 iterations are overlapped %r11 mov4 mov2 – value of don’t change mov1 mov4 mul3 jz1 – 4 regs for (%rdi,%rax) jz mul3 – each addr. incremented by 4*4 mul2 sub2 %r10 – 4 regs to keep value sub1 – Same registers can be reused after 4 of these blocks loop: mov (%rdi,%rax), %r10 g,generate code for 4 blocks, imul %r11, %r10 otherwise need to move mov %r10, (%rdi,%rax) sub $4, %rax jz loop Saman Amarasinghe 24 6.035 ©MIT Fall 1998 Software Pip elining • Optimal use of resources • NdNeed a lot of reg isters – Values in multiple iterations need to be kept • Issues in dependencies – Executing a store instruction in an iteration before branch instruction is executed for a previous iteration (writing when it should not have) – Loads and stores are issued out-of-order (need to figure-out dependencies before doing this) • Code generation issues – Generate pre-amble and post-amble code – Multiple blocks so no register copy is needed Saman Amarasinghe 25 6.035 ©MIT Fall 1998 5 Outline • Scheduling for loops • Loop unrolling • Software pipelining • Interaction with register allocation • Hardware vs. Compiler • IdInduct ion Variablble Recogn ition • loop invariant code motion Saman Amarasinghe 26 6.035 ©MIT Fall 1998 Register Allocation andId Ins truc tition Sc he du liling • If register allocation is before instruction scheduling – restricts the choices for scheduling Saman Amarasinghe 27 6.035 ©MIT Fall 1998 Example 1: mov 4(%rbp), %rax 2: a dddd %rax, %rbx 3: mov 8(%rbp), %rax 4: add %rax, %rcx Saman Amarasinghe 28 6.035 ©MIT Fall 1998 Exampple 1: mov 4(%rbp), %rax 1 2dd2: add %%b%rax, %rbx 3 3: mov 8(%rbp), %rax 1 1 4: add %rax, %rcx 2 1 1 3 3 4 Saman Amarasinghe 29 6.035 ©MIT Fall 1998 Exampple 1: mov 4(%rbp), %rax 1 2dd2: add %%b%rax, %rbx 3 3: mov 8(%rbp), %rax 1 1 4: add %rax, %rcx 2 1 1 3 3 ALUop 2 4 4 MEM 1 1 3 MEM 2 1 3 Saman Amarasinghe 30 6.035 ©MIT Fall 1998 Exampple 1: mov 4(%rbp), %rax 1 2dd2: add %rax,%b %rbx 3 3: mov 8(%rbp), %rax 1 1 4: add %rax, %rcx 2 1 1 Anti-dependence 3 How about a different register? 3 4 Saman Amarasinghe 31 6.035 ©MIT Fall 1998 Exampple 1: mov 4(%rbp), %rax 1 2dd2: add %%b%rax, %rbx 3 3: mov 8(%rbp), %r10 4: add %r10, %rcx 2 Anti-dependence 3 How about a different register? 3 4 Saman Amarasinghe 32 6.035 ©MIT Fall 1998 Exampple 1: mov 4(%rbp), %rax 1 2dd2: add %%b%rax, %rbx 3 3: mov 8(%rbp), %r10 4: add %r10, %rcx 2 3 3 ALUop 2 4 4 MEM 1 1 3 MEM 2 1 3 Saman Amarasinghe 33 6.035 ©MIT Fall 1998 Register Allocation andId Ins truc tition Sc he du liling • If register allocation is before instruction scheduling – restricts the choices for scheduling Saman Amarasinghe 34 6.035 ©MIT Fall 1998 Register Allocation andId Instructition Scheduliling • If register allocation is before instruction scheduling – restricts the choices for scheduling • If instruction schedulingscheduling bbeforeefore registerregister allocation – Register allocation may spillspill registers – Will change the carefully done schedule!!! Saman Amarasinghe 35 6.035 ©MIT Fall 1998 5 Outline • Scheduling for loops • Loop unrolling • Software pipelining • Interaction with register allocation • Hardware vs.

6.035 Lecture 14, Loop Optimizations: Instruction Scheduling

Integrating Program Optimizations and Transformations with the Scheduling of Instruction Level Parallelism*

Equality Saturation: a New Approach to Optimization

Scalable Conditional Induction Variables (CIV) Analysis

Induction Variable Analysis with Delayed Abstractions1

Strength Reduction of Induction Variables and Pointer Analysis – Induction Variable Elimination

Computer Architecture: Static Instruction Scheduling

Generalizing Loop-Invariant Code Motion in a Real-World Compiler

Efficient Symbolic Analysis for Optimizing Compilers*

Static Instruction Scheduling for High Performance on Limited Hardware

Compiler-Based Code-Improvement Techniques

Lecture 1 Introduction

Compiler Optimisation 6 – Instruction Scheduling