CS415

Instruction Scheduling

These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

The General Task • At each point in the code, pick the values to keep in registers • Insert code to move values between registers & memory → No instruction reordering (leave that to scheduling) • Minimize inserted code — both dynamic & static measures • Make good use of any extra registers

Allocation versus assignment • Allocation is deciding which values to keep in registers • Assignment is choosing specific registers for each value The must perform both allocation & assignment

cs415, spring 16 2 Lecture 4 Top-down Versus Bottom-up Allocation

Top-down allocator • Work from external notion of what is important • Assign registers in priority order • Register assignment remains fixed for entire basic block • Save some registers for the values relegated to memory (feasible set F)

Bottom-up allocator • Work from detailed knowledge about problem instance • Incorporate knowledge of partial solution at each step • Register assignment may change across basic block (different register assignments for different parts of live range) • Save some registers for the values relegated to memory (feasible set F)

cs415, spring 16 3 Lecture 4 Bottom-up Allocator

The idea: • Focus on replacement rather than allocation • Keep values “used soon” in registers • Only parts of a live range may be assigned to a physical register ( ≠ top-down allocation’s “all-or-nothing” approach) Algorithm: • Start with empty register set • Load on demand • When no register is available, free one Replacement (heuristic): • Spill the value whose next use is farthest in the future

cs415, spring 16 4 Lecture 4 Our Example: Live Ranges

Ø Live Ranges

1 loadI 1028 ⇒ r1 // r1 2 load r1 ⇒ r2 // r1 r2 3 mult r1, r2 ⇒ r3 // r1 r2 r3 4 loadI 5 ⇒ r4 // r1 r2 r3 r4 5 sub r4, r2 ⇒ r5 // r1 r3 r5 6 loadI 8 ⇒ r6 // r1 r3 r5 r6 7 mult r5, r6 ⇒ r7 // r1 r3 r7 8 sub r7, r3 ⇒ r8 // r1 r8 9 store r8 ⇒ r1 //

NOTE: live sets on exit of each instruction

cs415, spring 16 5 Lecture 4 An Example : Bottom-up

Ø Bottom up (3 physical registers: ra, rb, rc) register allocation and assignment(on exit) source code ra rb rc

1 loadI 1028 ⇒ r1 r1 2 load r1 ⇒ r2 r1 r2 3 mult r1, r2 ⇒ r3 r1 r2 r3 4 loadI 5 ⇒ r4 r4 r2 r3 5 sub r4, r2 ⇒ r5 r4 r5 r3 6 loadI 8 ⇒ r6 r6 r5 r3 7 mult r5, r6 ⇒ r7 r6 r7 r3 8 sub r7, r3 ⇒ r8 r6 r8 r3 9 store r8 ⇒ r1 r1 r8 r3

Let’s generate code now! cs415, spring 16 6 Lecture 4 An Example : Bottom-up

Ø Bottom up (3 physical registers: ra, rb, rc) register allocation and assignment(on exit) source code ra rb rc write 1 loadI 1028 ⇒ ra r1 2 load r1 ⇒ r2 r1 r2 3 mult r1, r2 ⇒ r3 r1 r2 r3 4 loadI 5 ⇒ r4 r4 r2 r3 5 sub r4, r2 ⇒ r5 r4 r5 r3 6 loadI 8 ⇒ r6 r6 r5 r3 7 mult r5, r6 ⇒ r7 r6 r7 r3 8 sub r7, r3 ⇒ r8 r6 r8 r3 9 store r8 ⇒ r1 r1 r8 r3

For written registers, use current register assignment.

cs415, spring 16 7 Lecture 4 An Example : Bottom-up

Ø Bottom up (3 physical registers: ra, rb, rc) register allocation and assignment(on exit) source code ra rb rc

1 loadI 1028 ⇒ ra read r1 2 load ra ⇒ r2 r1 r2 3 mult r1, r2 ⇒ r3 r1 r2 r3 4 loadI 5 ⇒ r4 r4 r2 r3 5 sub r4, r2 ⇒ r5 r4 r5 r3 6 loadI 8 ⇒ r6 r6 r5 r3 7 mult r5, r6 ⇒ r7 r6 r7 r3 8 sub r7, r3 ⇒ r8 r6 r8 r3 9 store r8 ⇒ r1 r1 r8 r3

For read registers, use previous register assignment.

cs415, spring 16 8 Lecture 4 An Example : Bottom-up

Ø Bottom up (3 physical registers: ra, rb, rc) register allocation and assignment(on exit) source code ra rb rc

1 loadI 1028 ⇒ ra write r1 2 load ra ⇒ rb r1 r2 3 mult r1, r2 ⇒ r3 r1 r2 r3 4 loadI 5 ⇒ r4 r4 r2 r3 5 sub r4, r2 ⇒ r5 r4 r5 r3 6 loadI 8 ⇒ r6 r6 r5 r3 7 mult r5, r6 ⇒ r7 r6 r7 r3 8 sub r7, r3 ⇒ r8 r6 r8 r3 9 store r8 ⇒ r1 r1 r8 r3

cs415, spring 16 9 Lecture 4 An Example : Bottom-up

Ø Bottom up (3 physical registers: ra, rb, rc) register allocation and assignment(on exit) source code ra rb rc

1 loadI 1028 ⇒ ra r1 2 load ra ⇒ rb r1 r2 3 mult ra, rb ⇒ rc r1 r2 r3 4 loadI 5 ⇒ r4 r4 r2 r3 5 sub r4, r2 ⇒ r5 r4 r5 r3 6 loadI 8 ⇒ r6 r6 r5 r3 7 mult r5, r6 ⇒ r7 r6 r7 r3 8 sub r7, r3 ⇒ r8 r6 r8 r3 9 store r8 ⇒ r1 r1 r8 r3

Insert spill code. cs415, spring 16 10 Lecture 4 An Example : Bottom-up

Ø Bottom up (3 physical registers: ra, rb, rc) register allocation and assignment(on exit) source code ra rb rc

1 loadI 1028 ⇒ ra r1 2 load ra ⇒ rb r1 r2 3 mult ra, rb ⇒ rc r1 r2 r3 store* ra ⇒ 10 spill code r1 r2 r3 4 loadI 5 ⇒ r4 r4 r2 r3 5 sub r4, r2 ⇒ r5 r4 r5 r3 6 loadI 8 ⇒ r6 r6 r5 r3 7 mult r5, r6 ⇒ r7 r6 r7 r3 8 sub r7, r3 ⇒ r8 r6 r8 r3 9 store r8 ⇒ r1 r1 r8 r3

Insert spill code. cs415, spring 16 11 Lecture 4 An Example : Bottom-up

Ø Bottom up (3 physical registers: ra, rb, rc) register allocation and assignment(on exit) source code ra rb rc

1 loadI 1028 ⇒ ra r1 2 load ra ⇒ rb r1 r2 3 mult ra, rb ⇒ rc r1 r2 r3 store ra ⇒ 10 spill code r1 r2 r3 4 loadI 5 ⇒ ra r4 r2 r3 5 sub ra, rb ⇒ rb r4 r5 r3 6 loadI 8 ⇒ ra r6 r5 r3 7 mult rb, ra ⇒ rb r6 r7 r3 8 sub rb, rc ⇒ rb r6 r8 r3 9 store r8 ⇒ r1 r1 r8 r3

cs415, spring 16 12 Lecture 4 An Example : Bottom-up

Ø Bottom up (3 physical registers: ra, rb, rc) register allocation and assignment(on exit) source code ra rb rc

1 loadI 1028 ⇒ ra r1 2 load ra ⇒ rb r1 r2 3 mult ra, rb ⇒ rc r1 r2 r3 store* ra ⇒ 10 spill code r1 r2 r3 4 loadI 5 ⇒ ra r4 r2 r3 5 sub ra, rb ⇒ rb r4 r5 r3 6 loadI 8 ⇒ ra r6 r5 r3 7 mult rb, ra ⇒ rb r6 r7 r3 8 sub rb, rc ⇒ rb r6 r8 r3 load* 10 ⇒ ra spill code r1 r8 r3 9 store r8 ⇒ r1 *NOT ILOC* r1 r8 r3 Insert spill code. cs415, spring 16 13 Lecture 4 An Example : Bottom-up

Ø Bottom up (3 physical registers: ra, rb, rc) register allocation and assignment(on exit) source code ra rb rc

1 loadI 1028 ⇒ ra r1 2 load ra ⇒ rb r1 r2 3 mult ra, rb ⇒ rc r1 r2 r3 store* ra ⇒ 10 spill code r1 r2 r3 4 loadI 5 ⇒ ra r4 r2 r3 5 sub ra, rb ⇒ rb r4 r5 r3 6 loadI 8 ⇒ ra r6 r5 r3 7 mult rb, ra ⇒ rb r6 r7 r3 8 sub rb, rc ⇒ rb r6 r8 r3 load* 10 ⇒ ra spill code r1 r8 r3 9 store rb ⇒ ra r1 r8 r3 Done. cs415, spring 16 14 Lecture 4 Spilling revisited

Ø Bottom up (3 physical registers: ra, rb, rc) register allocation and assignment(on exit) source code ra rb rc

1 loadI 1028 ⇒ ra r1 2 load ra ⇒ rb r1 r2 3 mult ra, rb ⇒ rc r1 r2 r3 no spill code 4 loadI 5 ⇒ ra r4 r2 r3 5 sub ra, rb ⇒ rb r4 r5 r3 6 loadI 8 ⇒ ra r6 r5 r3 7 mult rb, ra ⇒ rb r6 r7 r3 8 sub rb, rc ⇒ rb r6 r8 r3 loadI 1028 ⇒ ra spill code r1 r8 r3 9 store rb ⇒ ra r1 r8 r3

Rematerialization: Re-computation is cheaper than store/load to memory cs415, spring 16 15 Lecture 4 An Example : Bottom-up

memory layout Ø Bottom up (3 registers) 0 . spill . 1 loadI 1028 ⇒ r1 // r1 . addresses 2 load r1 ⇒ r2 // r1 r2 1024 3 mult r1, r2 ⇒ r3 // r1 r2 r3 4 loadI 5 ⇒ r4 // r1 r2 r3 r4 5 sub r4, r2 ⇒ r5 // r1 r3 r5 data 6 loadI 8 ⇒ r6 // r1 r3 r5 r6 addresses 7 mult r5, r6 ⇒ r7 // r1 r3 r7 8 sub r7, r3 ⇒ r8 // r1 r8 9 store r8 ⇒ r1 //

Note that this assumes that extra registers are not needed for save/ restore

cs415, spring 16 16 Lecture 4 Review: The Back End

IR Instruction IR Register IR Instruction Machine Selection Allocation Scheduling code

Errors Responsibilities • Translate IR into target machine code • Choose instructions to implement each IR operation • Decide which value to keep in registers • Ensure conformance with system interfaces

Automation has been less successful in the back end

cs415, spring 16 17 Lecture 4 Local Instruction Scheduling

Readings: EaC 12.1-12.3

Local: within single basic block Global: across entire procedure

cs415, spring 16 18 Lecture 4 Instruction Scheduling

Motivation • Instruction latency (pipelining) several cycles to complete instructions; instructions can be issued every cycle • Instruction-level parallelism (VLIW, superscalar) execute multiple instructions per cycle

Issues • Reorder instructions to reduce execution time • Static schedule – insert NOPs to preserve correctness • Dynamic schedule – hardware pipeline stalls • Preserve correctness • Interactions with other optimizations (register allocation!)

cs415, spring 16 19 Lecture 4 Instruction Scheduling

Motivation • Instruction latency (pipelining) several cycles to complete instructions; instructions can be issued every cycle • Instruction-level parallelism (VLIW, superscalar) execute multiple instructions per cycle

Issues • Reorder instructions to reduce execution time (or power requirements) • Static schedule – insert NOPs to preserve correctness • Dynamic schedule – hardware pipeline stalls • Preserve correctness • Interactions with other optimizations (register allocation!) • Note: code shape contains real, not virtual registers ==> register may be redefined cs415, spring 16 20 Lecture 4

Instruction Scheduling (Engineer’s View)

The Problem Given a code fragment for some target machine and the latencies for each individual operation, reorder the operations to minimize execution time The task The Concept • Produce correct code Machine description • Minimize wasted (idle) cycles • Operate efficiently slow fast Scheduler code code

cs415, spring 16 21 Lecture 4 Data Dependences (stmt./instr. level)

Dependences ⇒ defined on memory locations / registers

Statement/instruction b depends on statement/instruction a if there exists:

• true of flow dependence a writes a location/register that b later reads (RAW conflict)

• anti dependence a reads a location/register that b later writes (WAR conflict)

• output dependence a writes a location/register that b later writes (WAW conflict)

Dependences define ORDER CONSTRAINTS that need to be respected in order to generate correct code. true anti output a = = a a = = a a = a =

cs415, spring 16 22 Lecture 4 Instruction Scheduling (The Abstract View)

To capture properties of the code, build a dependence graph G • Nodes n ∈ G are operations with type(n) and delay(n)

• An edge e = (n1,n2) ∈ G iff n2 uses the result of n1

a: loadAI r0,@w ⇒ r1 a true b: add r1,r1 ⇒ r1 c: loadAI r0,@x ⇒ r2 b c anti d: mult r1,r2 ⇒ r1 e e: loadAI r0,@y ⇒ r3 d g f: mult r1,r3 ⇒ r1 f g: loadAI r0,@z r2 ⇒ h h: mult r1,r2 ⇒ r1 i: storeAI r1 r0,@w i ⇒

The Dependence Graph The Code (all output dependences are covered, i.e., are satisfied through other cs415, spring 16 23 Lecture 4 dependences) Instruction Scheduling (Definitions)

A correct schedule S maps each n∈ N into a non-negative integer representing its cycle number such that 1. S(n) ≥ 0, for all n ∈ N, obviously

2. If (n1,n2) ∈ E, S(n1 ) + delay(n1 ) ≤ S(n2 ) 3. For each type t, there are no more operations of type t in any cycle than the target machine can issue

The length of a schedule S, denoted L(S), is

L(S) = maxn ∈ N (S(n) + delay(n))

The goal is to find the shortest possible correct schedule.

S is time-optimal if L(S) ≤ L(S1 ), for all other schedules S1 A schedule might also be optimal in terms of registers, power, or space….

cs415, spring 16 24 Lecture 4 Instruction Scheduling (What’s so difficult?)

Critical Points • All operands must be available • Multiple operations can be ready • Moving operations can lengthen register lifetimes • Placing uses near definitions can shorten register lifetimes • Operands can have multiple predecessors Together, these issues make scheduling hard (NP-Complete)

Local scheduling is the simple case • Restricted to straight-line code (single basic block) • Consistent and predictable latencies

cs415, spring 16 25 Lecture 4 Instruction Scheduling

The big picture 1. Build a dependence graph, P 2. Compute a priority function over the nodes in P 3. Use list scheduling to construct a schedule, one cycle at a time (can only issue/schedule at most one instructions per cycle) a. Use a queue of operations that are ready b. At each cycle I. Choose a ready operation and schedule it II. Update the ready queue

Local list scheduling • The dominant algorithm for decades • A greedy, heuristic, local technique cs415, spring 16 26 Lecture 4 Local (Forward) List Scheduling

Cycle ← 1 Removal in priority order Ready ← leaves of P Active ← Ø while (Ready ∪ Active ≠ Ø) if (Ready ≠ Ø) then remove an op from Ready S(op) ← Cycle op has completed execution Active ← Active ∪ op Cycle ← Cycle + 1 for each op ∈ Active if (S(op) + delay(op) ≤ Cycle) then remove op from Active If successor’s operands are for each successor s of op in P ready, put it on Ready if (s is ready) then Ready ← Ready ∪ s

cs415, spring 16 27 Lecture 4 Scheduling Example

• Loads & stores may or may not block Operation Cycles > Non-blocking ⇒fill those issue slots load 3 • Branches typically have delay slots loadI 1 > Fill slots with operations unrelated to loadAI 3 branch condition evaluation store 3 • Branch Prediction may hide branch storeAI 3 latencies (hardware feature) add 1 mult 2 fadd 1 Build a simple local scheduler (basic fmult 2 block) shift 1 branch 0 to 8 - non-blocking loads & stores - different latencies load/store vs. arith. etc. operations - different heuristics - forward / backward scheduling cs415, spring 16 28 Lecture 4 Scheduling Example

1. Build the dependence graph

true 1 a: loadAI r0,@w ⇒ r1 a 4 b: add r1,r1 ⇒ r1 anti c: loadAI r0,@x 5 ⇒ r2 c 8 d: mult r1,r2 ⇒ r1 b 9 e: loadAI r0,@y ⇒ r3 d e 12 f: mult r1,r3 ⇒ r1 g f 13 g: loadAI r0,@z ⇒ r2 16 h: mult r1,r2 ⇒ r1 h 18 i: storeAI r1 r0,@w ⇒ i 21 The Code The Dependence Graph ⇒ 20 cycles cs415, spring 16 29 Lecture 4 Scheduling Example Operation Cycles load 3 loadI 1 Build the dependence graph loadAI 3 1. storeAI 3 2. Determine priorities: longest latency-weighted path add 1 mult 2 true a: loadAI r0,@w ⇒ r1 a b: add r1,r1 ⇒ r1 anti c: loadAI r0,@x ⇒ r2 c d: mult r1,r2 ⇒ r1 b e e: loadAI r0,@y ⇒ r3 10 d f: mult r1,r3 ⇒ r1 g 8 f g: loadAI r0,@z ⇒ r2 7 h: mult r1,r2 ⇒ r1 h 5 i: storeAI r1 r0,@w ⇒ i 3

The Code The Dependence Graph

cs415, spring 16 30 Lecture 4 Scheduling Example

1. Build the dependence graph 2. Determine priorities: longest latency-weighted path

true a: loadAI r0,@w r1 14 ⇒ a b: add r1,r1 ⇒ r1 anti c: loadAI r0,@x ⇒ r2 13 b c d: mult r1,r2 ⇒ r1 11 10 e e: loadAI r0,@y ⇒ r3 10 d 8 f: mult r1,r3 ⇒ r1 g f g: loadAI r0,@z ⇒ r2 7 5 h: mult r1,r2 ⇒ r1 h i: storeAI r1 r0,@w 3 ⇒ i

The Code The Dependence Graph

cs415, spring 16 31 Lecture 4 Scheduling Example

1. Build the dependence graph 2. Determine priorities: longest latency-weighted path

true a: loadAI r0,@w r1 14 ⇒ a b: add r1,r1 ⇒ r1 anti c: loadAI r0,@x ⇒ r2 13 b c d: mult r1,r2 ⇒ r1 11 10 e e: loadAI r0,@y ⇒ r3 10 d 8 f: mult r1,r3 ⇒ r1 g f g: loadAI r0,@z ⇒ r2 7 5 h: mult r1,r2 ⇒ r1 h i: storeAI r1 r0,@w 3 ⇒ i

The Code The Dependence Graph Note: Here we assume that operation has to finish to satisfy an anti dependence.

cs415, spring 16 32 Lecture 4 Scheduling Example

1. Build the dependence graph 2. Determine priorities: longest latency-weighted path 3. Perform list scheduling (forward) true a: loadAI r0,@w r1 14 ⇒ a b: add r1,r1 ⇒ r1 anti c: loadAI r0,@x ⇒ r2 13 b c d: mult r1,r2 ⇒ r1 11 10 e e: loadAI r0,@y ⇒ r3 10 d 8 f: mult r1,r3 ⇒ r1 g f g: loadAI r0,@z ⇒ r2 7 5 h: mult r1,r2 ⇒ r1 h i: storeAI r1 r0,@w 3 ⇒ i

The Code The Dependence Graph

cs415, spring 16 33 Lecture 4 Scheduling Example

1. Build the dependence graph 2. Determine priorities: longest latency-weighted path 3. Perform list scheduling (forward) 14 true

1 a: loadAI r0,@w ⇒ r 1 a 2 c: l o adA I r 0,@ x ⇒ r 2 anti 3 e: l o adA I r 0,@ y ⇒ r 3 13 b c 4 b: a d d r 1,r 1 ⇒ r 1 11 10 d: m ult r 1,r 2 r 1 e 5 ⇒ 10 d 8 7 g: loadAI r0,@z ⇒ r2 g 8 f: m ult r 1,r 3 ⇒ r 1 7 f 5 10 h: m ult r 1,r 2 ⇒ r 1 h 12 i: s t or e A I r 1 ⇒ r 0,@w 3 i 15 The Code ⇒ 14 The Dependence Graph cycles

cs415, spring 16 34 Lecture 4 More on Scheduling

Forward list scheduling Backward list scheduling • start with available ops • start with no successors • work forward • work backward • ready ⇒ all operands available • ready ⇒ latency covers operands

Different heuristics (forward) based on Precedence/Dependence Graph 1. Longest latency weighted path to root (⇒ critical path) 2. Highest latency instructions (⇒ more overlap) 3. Most immediate successors (⇒ create more candidates) 4. Most descendents (⇒ create more candidates) 5. …

Interactions with register allocation • perform dynamic register renaming (⇒ may require spill code) • move life ranges around (⇒ may remove or require spill code) • …

cs415, spring 16 35 Lecture 4