CS415 Compilers Instruction Scheduling

CS415 Compilers Instruction Scheduling These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University Register Allocation The General Task • At each point in the code, pick the values to keep in registers • Insert code to move values between registers & memory → No instruction reordering (leave that to scheduling) • Minimize inserted code — both dynamic & static measures • Make good use of any extra registers Allocation versus assignment • Allocation is deciding which values to keep in registers • Assignment is choosing specific registers for each value The compiler must perform both allocation & assignment cs415, spring 16 2 Lecture 4 Top-down Versus Bottom-up Allocation Top-down allocator • Work from external notion of what is important • Assign registers in priority order • Register assignment remains fixed for entire basic block • Save some registers for the values relegated to memory (feasible set F) Bottom-up allocator • Work from detailed knowledge about problem instance • Incorporate knowledge of partial solution at each step • Register assignment may change across basic block (different register assignments for different parts of live range) • Save some registers for the values relegated to memory (feasible set F) cs415, spring 16 3 Lecture 4 Bottom-up Allocator The idea: • Focus on replacement rather than allocation • Keep values “used soon” in registers • Only parts of a live range may be assigned to a physical register ( ≠ top-down allocation’s “all-or-nothing” approach) Algorithm: • Start with empty register set • Load on demand • When no register is available, free one Replacement (heuristic): • Spill the value whose next use is farthest in the future cs415, spring 16 4 Lecture 4 Our Example: Live Ranges Ø Live Ranges 1 loadI 1028 ⇒ r1 // r1 2 load r1 ⇒ r2 // r1 r2 3 mult r1, r2 ⇒ r3 // r1 r2 r3 4 loadI 5 ⇒ r4 // r1 r2 r3 r4 5 sub r4, r2 ⇒ r5 // r1 r3 r5 6 loadI 8 ⇒ r6 // r1 r3 r5 r6 7 mult r5, r6 ⇒ r7 // r1 r3 r7 8 sub r7, r3 ⇒ r8 // r1 r8 9 store r8 ⇒ r1 // NOTE: live sets on exit of each instruction cs415, spring 16 5 Lecture 4 An Example : Bottom-up Ø Bottom up (3 physical registers: ra, rb, rc) register allocation and assignment(on exit) source code ra rb rc 1 loadI 1028 ⇒ r1 r1 2 load r1 ⇒ r2 r1 r2 3 mult r1, r2 ⇒ r3 r1 r2 r3 4 loadI 5 ⇒ r4 r4 r2 r3 5 sub r4, r2 ⇒ r5 r4 r5 r3 6 loadI 8 ⇒ r6 r6 r5 r3 7 mult r5, r6 ⇒ r7 r6 r7 r3 8 sub r7, r3 ⇒ r8 r6 r8 r3 9 store r8 ⇒ r1 r1 r8 r3 Let’s generate code now! cs415, spring 16 6 Lecture 4 An Example : Bottom-up Ø Bottom up (3 physical registers: ra, rb, rc) register allocation and assignment(on exit) source code ra rb rc write 1 loadI 1028 ⇒ ra r1 2 load r1 ⇒ r2 r1 r2 3 mult r1, r2 ⇒ r3 r1 r2 r3 4 loadI 5 ⇒ r4 r4 r2 r3 5 sub r4, r2 ⇒ r5 r4 r5 r3 6 loadI 8 ⇒ r6 r6 r5 r3 7 mult r5, r6 ⇒ r7 r6 r7 r3 8 sub r7, r3 ⇒ r8 r6 r8 r3 9 store r8 ⇒ r1 r1 r8 r3 For written registers, use current register assignment. cs415, spring 16 7 Lecture 4 An Example : Bottom-up Ø Bottom up (3 physical registers: ra, rb, rc) register allocation and assignment(on exit) source code ra rb rc 1 loadI 1028 ⇒ ra read r1 2 load ra ⇒ r2 r1 r2 3 mult r1, r2 ⇒ r3 r1 r2 r3 4 loadI 5 ⇒ r4 r4 r2 r3 5 sub r4, r2 ⇒ r5 r4 r5 r3 6 loadI 8 ⇒ r6 r6 r5 r3 7 mult r5, r6 ⇒ r7 r6 r7 r3 8 sub r7, r3 ⇒ r8 r6 r8 r3 9 store r8 ⇒ r1 r1 r8 r3 For read registers, use previous register assignment. cs415, spring 16 8 Lecture 4 An Example : Bottom-up Ø Bottom up (3 physical registers: ra, rb, rc) register allocation and assignment(on exit) source code ra rb rc 1 loadI 1028 ⇒ ra write r1 2 load ra ⇒ rb r1 r2 3 mult r1, r2 ⇒ r3 r1 r2 r3 4 loadI 5 ⇒ r4 r4 r2 r3 5 sub r4, r2 ⇒ r5 r4 r5 r3 6 loadI 8 ⇒ r6 r6 r5 r3 7 mult r5, r6 ⇒ r7 r6 r7 r3 8 sub r7, r3 ⇒ r8 r6 r8 r3 9 store r8 ⇒ r1 r1 r8 r3 cs415, spring 16 9 Lecture 4 An Example : Bottom-up Ø Bottom up (3 physical registers: ra, rb, rc) register allocation and assignment(on exit) source code ra rb rc 1 loadI 1028 ⇒ ra r1 2 load ra ⇒ rb r1 r2 3 mult ra, rb ⇒ rc r1 r2 r3 4 loadI 5 ⇒ r4 r4 r2 r3 5 sub r4, r2 ⇒ r5 r4 r5 r3 6 loadI 8 ⇒ r6 r6 r5 r3 7 mult r5, r6 ⇒ r7 r6 r7 r3 8 sub r7, r3 ⇒ r8 r6 r8 r3 9 store r8 ⇒ r1 r1 r8 r3 Insert spill code. cs415, spring 16 10 Lecture 4 An Example : Bottom-up Ø Bottom up (3 physical registers: ra, rb, rc) register allocation and assignment(on exit) source code ra rb rc 1 loadI 1028 ⇒ ra r1 2 load ra ⇒ rb r1 r2 3 mult ra, rb ⇒ rc r1 r2 r3 store* ra ⇒ 10 spill code r1 r2 r3 4 loadI 5 ⇒ r4 r4 r2 r3 5 sub r4, r2 ⇒ r5 r4 r5 r3 6 loadI 8 ⇒ r6 r6 r5 r3 7 mult r5, r6 ⇒ r7 r6 r7 r3 8 sub r7, r3 ⇒ r8 r6 r8 r3 9 store r8 ⇒ r1 r1 r8 r3 Insert spill code. cs415, spring 16 11 Lecture 4 An Example : Bottom-up Ø Bottom up (3 physical registers: ra, rb, rc) register allocation and assignment(on exit) source code ra rb rc 1 loadI 1028 ⇒ ra r1 2 load ra ⇒ rb r1 r2 3 mult ra, rb ⇒ rc r1 r2 r3 store ra ⇒ 10 spill code r1 r2 r3 4 loadI 5 ⇒ ra r4 r2 r3 5 sub ra, rb ⇒ rb r4 r5 r3 6 loadI 8 ⇒ ra r6 r5 r3 7 mult rb, ra ⇒ rb r6 r7 r3 8 sub rb, rc ⇒ rb r6 r8 r3 9 store r8 ⇒ r1 r1 r8 r3 cs415, spring 16 12 Lecture 4 An Example : Bottom-up Ø Bottom up (3 physical registers: ra, rb, rc) register allocation and assignment(on exit) source code ra rb rc 1 loadI 1028 ⇒ ra r1 2 load ra ⇒ rb r1 r2 3 mult ra, rb ⇒ rc r1 r2 r3 store* ra ⇒ 10 spill code r1 r2 r3 4 loadI 5 ⇒ ra r4 r2 r3 5 sub ra, rb ⇒ rb r4 r5 r3 6 loadI 8 ⇒ ra r6 r5 r3 7 mult rb, ra ⇒ rb r6 r7 r3 8 sub rb, rc ⇒ rb r6 r8 r3 load* 10 ⇒ ra spill code r1 r8 r3 9 store r8 ⇒ r1 *NOT ILOC* r1 r8 r3 Insert spill code. cs415, spring 16 13 Lecture 4 An Example : Bottom-up Ø Bottom up (3 physical registers: ra, rb, rc) register allocation and assignment(on exit) source code ra rb rc 1 loadI 1028 ⇒ ra r1 2 load ra ⇒ rb r1 r2 3 mult ra, rb ⇒ rc r1 r2 r3 store* ra ⇒ 10 spill code r1 r2 r3 4 loadI 5 ⇒ ra r4 r2 r3 5 sub ra, rb ⇒ rb r4 r5 r3 6 loadI 8 ⇒ ra r6 r5 r3 7 mult rb, ra ⇒ rb r6 r7 r3 8 sub rb, rc ⇒ rb r6 r8 r3 load* 10 ⇒ ra spill code r1 r8 r3 9 store rb ⇒ ra r1 r8 r3 Done. cs415, spring 16 14 Lecture 4 Spilling revisited Ø Bottom up (3 physical registers: ra, rb, rc) register allocation and assignment(on exit) source code ra rb rc 1 loadI 1028 ⇒ ra r1 2 load ra ⇒ rb r1 r2 3 mult ra, rb ⇒ rc r1 r2 r3 no spill code 4 loadI 5 ⇒ ra r4 r2 r3 5 sub ra, rb ⇒ rb r4 r5 r3 6 loadI 8 ⇒ ra r6 r5 r3 7 mult rb, ra ⇒ rb r6 r7 r3 8 sub rb, rc ⇒ rb r6 r8 r3 loadI 1028 ⇒ ra spill code r1 r8 r3 9 store rb ⇒ ra r1 r8 r3 Rematerialization: Re-computation is cheaper than store/load to memory cs415, spring 16 15 Lecture 4 An Example : Bottom-up memory layout Ø Bottom up (3 registers) 0 . spill . 1 loadI 1028 ⇒ r1 // r1 . addresses 2 load r1 ⇒ r2 // r1 r2 1024 3 mult r1, r2 ⇒ r3 // r1 r2 r3 4 loadI 5 ⇒ r4 // r1 r2 r3 r4 5 sub r4, r2 ⇒ r5 // r1 r3 r5 data 6 loadI 8 ⇒ r6 // r1 r3 r5 r6 addresses 7 mult r5, r6 ⇒ r7 // r1 r3 r7 8 sub r7, r3 ⇒ r8 // r1 r8 9 store r8 ⇒ r1 // Note that this assumes that extra registers are not needed for save/ restore cs415, spring 16 16 Lecture 4 Review: The Back End IR Instruction IR Register IR Instruction Machine Selection Allocation Scheduling code Errors Responsibilities • Translate IR into target machine code • Choose instructions to implement each IR operation • Decide which value to keep in registers • Ensure conformance with system interfaces Automation has been less successful in the back end cs415, spring 16 17 Lecture 4 Local Instruction Scheduling Readings: EaC 12.1-12.3 Local: within single basic block Global: across entire procedure cs415, spring 16 18 Lecture 4 Instruction Scheduling Motivation • Instruction latency (pipelining) several cycles to complete instructions; instructions can be issued every cycle • Instruction-level parallelism (VLIW, superscalar) execute multiple instructions per cycle Issues • Reorder instructions to reduce execution time • Static schedule – insert NOPs to preserve correctness • Dynamic schedule – hardware pipeline stalls • Preserve correctness • Interactions with other optimizations (register allocation!) cs415, spring 16 19 Lecture 4 Instruction Scheduling Motivation • Instruction latency (pipelining) several cycles to complete instructions; instructions can be issued every cycle • Instruction-level parallelism (VLIW, superscalar) execute multiple instructions per cycle Issues • Reorder instructions to reduce execution time (or power requirements) • Static schedule – insert NOPs to preserve correctness • Dynamic schedule – hardware pipeline stalls • Preserve correctness • Interactions with other optimizations (register allocation!) • Note: code shape contains real, not virtual registers ==> register may be redefined cs415, spring 16 20 Lecture 4 Instruction Scheduling (Engineer’s View) The Problem Given a code fragment for some target machine and the latencies for each individual operation, reorder the operations to minimize execution time The task The Concept • Produce correct code Machine description • Minimize wasted (idle) cycles • Operate efficiently slow fast Scheduler code code cs415, spring 16 21 Lecture 4 Data Dependences (stmt./instr.

CS415 Compilers Instruction Scheduling

Integrating Program Optimizations and Transformations with the Scheduling of Instruction Level Parallelism*

Computer Architecture: Static Instruction Scheduling

Static Instruction Scheduling for High Performance on Limited Hardware

Lecture 1 Introduction

Compiler Optimisation 6 – Instruction Scheduling

Register Allocation and Instruction Scheduling in Unison

An Application of Constraint Programming to Superblock Instruction Scheduling

A Simple, Verified Validator for Software Pipelining

SSE Optimizations These Optimizations Leverage the Streaming SIMD Extension (SSE) Instruction Set of the CPU

Survey on Combinatorial Register Allocation and Instruction Scheduling ACM Computing Surveys

Learning Heuristics for the Superblock Instruction Scheduling Problem Tyrel Russell, Abid M

Cluster Assignment and Scheduling for Partitioned Register Set Machines