Optimization: What’s the Point? (A Quick Review) Instruction Scheduling: Introduction Machine-Independent Optimizations: • e.g., constant propagation & folding, redundancy elimination, dead-code elimination, etc. • Goal: eliminate work
Todd C. Mowry Machine-Dependent Optimizations: CS745: Optimizing Compilers • register allocation • Goal: reduce cost of accessing data • instruction scheduling • Goal: ??? • …
CS745: Instruction Scheduling -2- Todd C. Mowry
The Goal of Instruction Scheduling Hardware Support for Parallel Execution
Assume that the remaining instructions are all essential Three forms of parallelism are found in modern machines: • (otherwise, earlier passes would have eliminated them) • Pipelining Instruction Scheduling How can we perform this fixed amount of work in less time? • Superscalar Processing } • Answer: execute the instructions in parallel • Multiprocessing Automatic Parallelization (covered later in class) Time a = 1 + x; a = 1 + x; b = 2 + y; c = 3 + z; b = 2 + y; c = 3 + z;
CS745: Instruction Scheduling -3- Todd C. Mowry CS745: Instruction Scheduling -4- Todd C. Mowry
1 Pipelining Pipelining Illustration
Basic idea: IF RF EX MEWB • break instruction into stages that can be overlapped IF RF EX MEWB
Example: simple 5-stage pipeline from early RISC machines IF RF EX MEWB
1 instruction IF RF EX MEWB IF = Instruction Fetch IF RF EX ME WB RF = Decode & Register Fetch IF RF EX MEWB EX = Execute on ALU ME = Memory Access Time WB = Write Back to Register File Time
CS745: Instruction Scheduling -5- Todd C. Mowry CS745: Instruction Scheduling -6- Todd C. Mowry
Beyond 5-Stage Pipelines: Pipelining Illustration How to Support Even More Parallelism
IF RF EX MEWB Should we simply make pipelines deeper and deeper?
IF RF EX MEWB IF RF EX ME WB Pipe Register Pipe Pipe Register Pipe Pipe Register Pipe IF RF EX MEWB Register Pipe
IF RF EX MEWB • registers between pipeline stages have fixed overheads • hence diminishing returns with more stages (Amdahl’s Law) IF RF EX MEWB • value of pipe stage unclear if < time for integer add However, most consumers think “performance = clock rate” Time • perceived need for higher clock rates -> deeper pipelines • e.g., Pentium 4 processor has a 20-stage pipeline In a given cycle, each instruction is in a different stage
CS745: Instruction Scheduling -7- Todd C. Mowry CS745: Instruction Scheduling -8- Todd C. Mowry
2 Beyond Pipelining: “Superscalar” Processing Superscalar Pipeline Illustration
IF RF EX MEWB Basic Idea: Original (scalar) pipeline: • multiple (independent) instructions can proceed IF RF EX MEWB Only one instruction in simultaneously through the same pipeline stages IF RF EX MEWB a given pipe stage at a Requires additional hardware IF RF EX MEWB given time • example: “Execute” stage r3+r4 IF RF EX MEWB
IF RF EX MEWB Superscalar pipeline: Multiple instructions in
Pipe Register Pipe IF RF EX MEWB EX r1+r2 Register Pipe r1+r2 IF RF EX MEWB the same pipe stage at Pipe Register Pipe Pipe Register Pipe Pipe Register Pipe Pipe Register Pipe Time the same time IF RF EX MEWB Abstract Hardware for Hardware for IF RF EX MEWB Representation Scalar Pipeline: 2-way Superscalar: 1 ALU 2 ALUs
CS745: Instruction Scheduling -9- Todd C. Mowry CS745: Instruction Scheduling -10- Todd C. Mowry
The Ideal Scheduling Outcome Limitations Upon Scheduling
Before After
Time 1 cycle 1. Hardware Resources 2. Data Dependences 3. Control Dependences N cycles
What prevents us from achieving this ideal?
CS745: Instruction Scheduling -11- Todd C. Mowry CS745: Instruction Scheduling -12- Todd C. Mowry
3 Limitation #1: Hardware Resources Finite Issue Width
Processors have finite resources, and there are often Prior to superscalar processing: constraints on how these resources can be used. • processors only “issued” one instruction per cycle Even with superscalar processing: Examples: • limit on total # of instructions issued per cycle • Finite issue width • Limited functional units (FUs) per given instruction type Issue Width = infinite Issue Width = 4 • Limited pipelining within a given functional unit (FU) Time 1 ≥ N/4
CS745: Instruction Scheduling -13- Todd C. Mowry CS745: Instruction Scheduling -14- Todd C. Mowry
Limited FUs per Instruction Type Limited Pipelining within a Functional Unit
e.g., a 4-way superscalar might only be able to issue up to e.g., only 1 new floating-point division once every 2 cycles 2 integer, 1 memory, and 1 floating-point insts per cycle Schedule with Limited Pipelining More Realistic Original Code Int Mem FP Original Code Unconstrained Int Mem FP Time ÷ Time ÷ 3 ÷ 5 ÷ ÷ 9
÷ 12 ÷ 12 Bottleneck Integer ÷ Integer ÷ Empty Slot Memory Memory ÷ Floating-Point Empty Slot Floating-Point
CS745: Instruction Scheduling -15- Todd C. Mowry CS745: Instruction Scheduling -16- Todd C. Mowry
4 Limitations Upon Scheduling Limitation #2: Data Dependences
If we read or write a data location “too early”, the program 1. Hardware Resources may behave incorrectly. 2. Data Dependences (Assume that initially, x = 0.) 3. Control Dependences x = 1; x = 1; y = x;x; ??? ??? ??? y = x;x; x 1 = 2; x 1 = 1;
Read-after-Write Write-after-Write Write-after-Read (“True” dependence) (“Output” dependence) (“Anti” dependence)
Fundamental Can potentially fix through renaming. (no simple fix)
CS745: Instruction Scheduling -17- Todd C. Mowry CS745: Instruction Scheduling -18- Todd C. Mowry
Given Ambiguous Data Dependences, Why Data Dependences are Challenging What Can the Compiler Do?
x = a[i]; x = a[i]; *p = 1; *p = 1; y = *q; y = *q; *r = z; *r = z;
Conservative approach: don’t reorder instructions which of these instructions can be reordered? • ensures correct execution • but may suffer poor performance ambiguous data dependences are very common in practice Aggressive approach? • difficult to resolve, despite fancy pointer analysis • is there a way to safely reorder instructions?
CS745: Instruction Scheduling -19- Todd C. Mowry CS745: Instruction Scheduling -20- Todd C. Mowry
5 Hardware Limitations Revisited: Multi-cycle Execution Latencies Limitations Upon Scheduling
Simple instructions often “execute” in one cycle • (as observed by other instructions in the pipeline) 1. Hardware Resources • e.g., integer addition 2. Data Dependences More complex instructions may require multiple cycles 3. Control Dependences • e.g., integer division, square-root • cache misses!
These latencies, when combined with data dependencies, can result in non-trivial critical path lengths through code
CS745: Instruction Scheduling -21- Todd C. Mowry CS745: Instruction Scheduling -22- Todd C. Mowry
Limitation #3: Control Dependences Scheduling Constraints: Summary
Hardware Resources • finite set of FUs with instruction type, bandwidth, and latency constraints • cache hierarchy also has many constraints Data Dependences • can’t consume a result before it is produced • ambiguous dependences create many challenges Control Dependences What do we do when we reach a conditional branch? • impractical to schedule for all possible paths • choose a “frequently-executed” path? • choosing an “expected” path may be difficult • choose multiple paths? • recovery costs can be non-trivial if you are wrong
CS745: Instruction Scheduling -23- Todd C. Mowry CS745: Instruction Scheduling -24- Todd C. Mowry
6 Spectrum of Hardware Support Hardware- vs. Compiler-Based Scheduling for Scheduling
The hardware can also attempt to reschedule instructions Compiler-Centric Hardware-Centric (on-the-fly) to improve performance What advantages/disadvantages would hardware have (vs. VLIW In-Order Out-of-Order the compiler) when trying to reason about: (Very Long Superscalar Superscalar • Hardware Resources Instruction Word) • Data Dependences e.g.: Itanium e.g.: Original Pentium e.g.: Pentium 4 • Control Dependences Which is better: • doing more of the scheduling work in the compiler? • doing more of the scheduling work in the hardware?
CS745: Instruction Scheduling -25- Todd C. Mowry CS745: Instruction Scheduling -26- Todd C. Mowry
VLIW Processors Compiling for VLIW
Predicting Execution Latencies: Motivation: • easy for most functional units (latency is fixed) • if the hardware spends zero (or almost zero) time thinking about scheduling, it can run faster • but what about memory references? Philosophy: Data Dependences: • give full control over scheduling to the compiler • in “pure” VLIW, the hardware does not check for them Implementation: • the compiler takes them into account to produce safe code • expose control over all FUs directly to software via a while (p != NULL) { a = b + 1; “very long instruction word” if (test(p->val)) c = a – d; Int Mem FP Time q->next = p->left; e = c / 3; p = p->next; f = g – e; } Example #1 Example #2
CS745: Instruction Scheduling -27- Todd C. Mowry CS745: Instruction Scheduling -28- Todd C. Mowry
7 Spectrum of Hardware Support “VLIW” Today for Scheduling
Hardware checks for data dependences through memory Compiler-Centric Hardware-Centric Compiler can do a good job with register dependences In-Order VLIW Superscalar
Intel/HP Itanium2 Transmeta Crusoe 5400
Inst 2 Inst 1 Inst 0 Template Runtime software dynamically generates VLIW code 128-bit bundle
CS745: Instruction Scheduling -29- Todd C. Mowry CS745: Instruction Scheduling -30- Todd C. Mowry
Spectrum of Hardware Support In-Order Superscalar Processors for Scheduling
In contrast with VLIW: Compiler-Centric Hardware-Centric • hardware does full data dependence checking In-Order Out-of-Order • hence, no need to encode NOPs for empty slots VLIW Superscalar Superscalar Once an instruction cannot be issued, no Int Mem FP Time instructions after it will be issued. Bottom Line: • hardware matches code to available resources; recompilation is not necessary for correctness Empty Slot • compiler’s role is still important • for performance, not correctness!
CS745: Instruction Scheduling -31- Todd C. Mowry CS745: Instruction Scheduling -32- Todd C. Mowry
8 Out-of-Order Superscalar Processors: Out-of-Order Superscalar Processors Hardware Overview
Motivation: Inst. Complexity of checking PC: 0x100x140x180x1c • when an instruction is stuck, perhaps there are Cache dependences increases subsequent instructions that can be executed exponentially with issue width!
x = *p; suffers expensive cache miss Branch Predictor y = x + 1; stuck waiting on true dependence 0x1c: b = c / 3; issue (out-of-order) z = a + 2; these do not need to wait 0x18: z = a + 2; issue (out-of-order) b = c / 3; } 0x14: y = x + 1; can’t issue
Reorder Buffer } 0x10: x = *p; issue (cache miss) Sounds great! But how does this complicate the hardware? fetch & graduate in-order, issue out-of-order
CS745: Instruction Scheduling -33- Todd C. Mowry CS745: Instruction Scheduling -34- Todd C. Mowry
Compiler- vs. Hardware-Centric Scheduling: Bottom Line Scheduling Roadmap
Compiler-Centric Hardware-Centric
In-Order Out-of-Order x = a + b VLIW x = a + b x = a + b Superscalar Superscalar … … y = c + d y = c + d y = c + d High-end processors will probably remain out-of-order • moving instructions small distances is probably useless • BUT, moving instructions large distances may still help List Scheduling: Trace Scheduling: Software Pipelining: • within a basic block • across basic blocks • across loop iterations Cheap, power-efficient processors may be in-order/VLIW • instruction scheduling may have a large impact
CS745: Instruction Scheduling -35- Todd C. Mowry CS745: Instruction Scheduling -36- Todd C. Mowry
9