Computer Architecture: Static Instruction Scheduling

Key Questions Q1. How do we find independent instructions to fetch/execute? Computer Architecture: Q2. How do we enable more compiler optimizations? Static Instruction Scheduling e.g., common subexpression elimination, constant propagation, dead code elimination, redundancy elimination, … Q3. How do we increase the instruction fetch rate? i.e., have the ability to fetch more instructions per cycle Prof. Onur Mutlu (Editted by Seth) Carnegie Mellon University 2 Key Questions VLIW (Very Long Instruction Word Q1. How do we find independent instructions to fetch/execute? Simple hardware with multiple function units Reduced hardware complexity Q2. How do we enable more compiler optimizations? Little or no scheduling done in hardware, e.g., in-order e.g., common subexpression elimination, constant Hopefully, faster clock and less power propagation, dead code elimination, redundancy Compiler required to group and schedule instructions elimination, … (compare to OoO superscalar) Predicated instructions to help with scheduling (trace, etc.) Q3. How do we increase the instruction fetch rate? More registers (for software pipelining, etc.) i.e., have the ability to fetch more instructions per cycle Example machines: Multiflow, Cydra 5 (8-16 ops per VLIW) IA-64 (3 ops per bundle) A: Enabling the compiler to optimize across a larger number of instructions that will be executed straight line (without branches TMS32xxxx (5+ ops per VLIW) getting in the way) eases all of the above Crusoe (4 ops per VLIW) 3 4 Comparison between SS VLIW Comparison: CISC, RISC, VLIW From Mark Smotherman, “Understanding EPIC Architectures and Implementations” EPIC – Intel IA-64 Architecture IA-64 Instructions Gets rid of lock-step execution of instructions within a VLIW IA-64 “Bundle” (~EPIC Instruction) instruction Total of 128 bits Idea: More ISA support for static scheduling and parallelization Contains three IA-64 instructions Specify dependencies within and between VLIW instructions Template bits in each bundle specify dependencies within a (explicitly parallel) bundle + No lock-step execution \ + Static reordering of stores and loads + dynamic checking -- Hardware needs to perform dependency checking (albeit aided by software) IA-64 Instruction -- Other disadvantages of VLIW still exist Fixed-length 41 bits long Contains three 7-bit register specifiers Contains a 6-bit field for specifying one of the 64 one-bit Huck et al., “Introducing the IA-64 Architecture,” IEEE Micro, Sep/Oct 2000. predicate registers 9 10 IA-64 Instruction Bundles and Groups VLIW: Finding Independent Operations Groups of instructions can be Within a basic block, there is limited instruction-level executed safely in parallel parallelism Marked by “stop bits” To find multiple instructions to be executed in parallel, the compiler needs to consider multiple basic blocks Bundles are for packaging Problem: Moving an instruction above a branch is unsafe Groups can span multiple bundles because instruction is not guaranteed to be executed Alleviates recompilation need somewhat Idea: Enlarge blocks at compile time by finding the frequently-executed paths Trace scheduling It’s all about the compiler Superblock scheduling and how to schedule the Hyperblock scheduling instructions to maximize Software Pipelining parallelism 11 13 List Scheduling: For 1 basic block Data Precedence Graph Assign priority to each instruction Initialize ready list that holds all ready instructions i1 i2 i5 i6 i7 i10 i11 i12 Ready = data ready and can be scheduled 2 2 2 2 2 2 2 Choose one ready instruction I from ready list with the i3 2 i8 i13 highest priority 2 2 2 Possibly using tie-breaking heuristics i4 i9 i14 Insert I into schedule Making sure resource constraints are satisfied 4 4 Add those instructions whose precedence constraints are i15 now satisfied into the ready list 2 i16 14 15 Instruction Prioritization Heuristics VLIW List Scheduling Number of descendants in precedence graph Assign Priorities Compute Data Ready List - all operations whose predecessors have Maximum latency from root node of precedence graph been scheduled. Length of operation latency Select from DRL in priority order while checking resource constraints Ranking of paths based on importance Add newly ready operations to DRL and repeat for next instruction Combination of above 5 1 4-wide VLIW Data Ready List 2 3 3 3 4 2 3 4 5 6 1{1} 6345{2,3,4,5,6} 2 2 3 9 7 8 9278{2,7,8,9} 1 1 2 12 10 11 {10,11,12} 10 11 12 1 13 {13} 13 16 17 Extending the scheduling domain Safety and Legality in Code Motion Basic block is too small to get any real parallelism Two characteristics of speculative code motion: How to extend the basic block? Safety: whether or not spurious exceptions may occur Why do we have basic blocks in the first place? Legality: whether or not result will be always correct Loops Four possible types of code motion: Loop unrolling Software pipelining Non-loops r1 = ... r1 = r2 & r3 r4 = r1 ... r1 = r2 & r3 Will almost always involve some speculation (a) safe and legal (b) illegal And, thus, profiling may be very important r1 = ... r1 = load A r4 = r1 ... r1 = load A (c) unsafe (d) unsafe and illegal 18 19 Code Movement Constraints Trace Scheduling Downward Trace: A frequently executed path in the control-flow graph When moving an operation from a BB to one of its dest BB’s, (has multiple side entrances and multiple side exits) all the other dest basic blocks should still be able to use the result of the operation Idea: Find independent operations within a trace to pack ’ the other source BB s of the dest BB should not be disturbed into VLIW instructions. Traces determined via profiling Upward Compiler adds fix-up code for correctness (if a side entrance When moving an operation from a BB to its source BB’s or side exit of a trace is exercised at runtime, corresponding register values required by the other dest BB’s must not be fix-up code is executed) destroyed the movement must not cause new exceptions 20 21 Trace Scheduling Idea Trace Scheduling (II) There may be conditional branches from the middle of the trace (side exits) and transitions from other traces into the middle of the trace (side entrances). These control-flow transitions are ignored during trace scheduling. After scheduling, fix-up/bookkeeping code is inserted to ensure the correct execution of off-trace code. Fisher, “Trace scheduling: A technique for global microcode compaction,” IEEE TC 1981. 22 23 Trace Scheduling (III) Trace Scheduling (IV) Instr 3 Instr 1 Instr 2 Instr 1 Instr 2 Instr 4 Instr 2 Instr 3 Instr 2 Instr 3 Instr 3 Instr 4 Instr 3 Instr 4 Instr 4 Instr 1 Instr 4 Instr 1 Instr 5 Instr 5 Instr 5 Instr 5 What bookeeping is required when Instr 1 is moved below the side entrance in the trace? 24 25 Trace Scheduling (V) Trace Scheduling (VI) Instr 5 Instr 1 Instr 1 Instr 1 Instr 1 Instr 2 Instr 5 Instr 2 Instr 5 Instr 3 Instr 2 Instr 3 Instr 2 Instr 4 Instr 3 Instr 4 Instr 3 Instr 5 Instr 4 Instr 5 Instr 4 What bookeeping is required when Instr 5 moves above the side entrance in the trace? 26 27 Trace Scheduling Fixup Code Issues Trace Scheduling Overview Sometimes need to copy instructions more than once to Trace Selection ensure correctness on all paths (see C below) select seed block (the highest frequency basic block) extend trace (along the highest frequency edges) A D A’ B’ C’ Y forward (successor of the last block of the trace) B X B backward (predecessor of the first block of the trace) Original Scheduled C trace E don’t cross loop back edge trace ’’’ D Y A C bound max_trace_length heuristically E C E’’ D’’ B’’ X Trace Scheduling build data precedence graph for a whole trace perform list scheduling and allocate registers B X add compensation code to maintain semantic correctness Correctness C Speculative Code Motion (upward) D Y move an instruction above a branch if safe 28 29 Trace Scheduling Example (I) Trace Scheduling Example (II) fdiv f1, f2, f3 B1 9 stalls fdiv f1, f2, f3 fadd f4, f1, f5 fadd f4, f1, f5 fdiv f1, f2, f3 fdiv f1, f2, f3 beq r1, $0 beq r1, $0 beq r1, $0 beq r1, $0 990 10 r2 and f2 not live fadd f4, f1, f5 B2 B3 ld r2, 0(r3) ld r2, 0(r3) ld r2, 0(r3) ld r2, 4(r3) ld r2, 0(r3) out fsub f2, f2, f6 fsub f2, f2, f6 B3 add r2, r2, 4 add r2, r2, 4 Split 990 10 1 stall 0 stall beq r2, $0 beq r2, $0 comp. code add r2, r2, 4 B4 add r2, r2, 4 0 stall beq r2, $0 beq r2, $0 fadd f4, f1, f5 st.d f2, 0(r8) st.d f2, 0(r8) 800 200 f2 not 1 stall B5fsub f2, f2, f6 B6 fsub f2, f2, f6 live out st.d f2, 0(r8) fsub f2, f3, f7 st.d f2, 0(r8) 1 stall add r3, r3, 4 add r3, r3, 4 800 200 B6 add r8, r8, 4 B3 add r8, r8, 4 B3 B7 fadd f4, f1, f5 fadd f4, f1, f5 add r3, r3, 4 add r3, r3, 4 add r8, r8, 4 add r8, r8, 4 B6 B6 30 31 Trace Scheduling Example (III) Trace Scheduling Example (IV) fdiv f1, f2, f3 fdiv f1, f2, f3 beq r1, $0 beq r1, $0 B3 fadd f4, f1, f5 fadd f4, f1, f5 ld r2, 0(r3) ld r2, 0(r3) fsub f2, f2, f6 fsub f2, f2, f6 Split add r2, r2, 4 comp.

Computer Architecture: Static Instruction Scheduling

Integrating Program Optimizations and Transformations with the Scheduling of Instruction Level Parallelism*

Static Instruction Scheduling for High Performance on Limited Hardware

Lecture 1 Introduction

Compiler Optimisation 6 – Instruction Scheduling

Register Allocation and Instruction Scheduling in Unison

An Application of Constraint Programming to Superblock Instruction Scheduling

CS415 Compilers Instruction Scheduling

A Simple, Verified Validator for Software Pipelining

SSE Optimizations These Optimizations Leverage the Streaming SIMD Extension (SSE) Instruction Set of the CPU

Survey on Combinatorial Register Allocation and Instruction Scheduling ACM Computing Surveys

Learning Heuristics for the Superblock Instruction Scheduling Problem Tyrel Russell, Abid M

Cluster Assignment and Scheduling for Partitioned Register Set Machines