Optimization: What’s the Point? (A Quick Review) Instruction Scheduling: Introduction Machine-Independent Optimizations: • e.g., constant propagation & folding, redundancy elimination, dead-code elimination, etc. • Goal: eliminate work

Todd C. Mowry Machine-Dependent Optimizations: CS745: Optimizing • Goal: reduce cost of accessing data • instruction scheduling • Goal: ??? • …

CS745: Instruction Scheduling -2- Todd C. Mowry

The Goal of Instruction Scheduling Hardware Support for Parallel Execution

ƒ Assume that the remaining instructions are all essential ƒ Three forms of parallelism are found in modern machines: • (otherwise, earlier passes would have eliminated them) • Pipelining Instruction Scheduling ƒ How can we perform this fixed amount of work in less time? • Superscalar Processing } • Answer: execute the instructions in parallel • Multiprocessing Automatic Parallelization (covered later in class) Time a = 1 + x; a = 1 + x; b = 2 + y; c = 3 + z; b = 2 + y; c = 3 + z;

CS745: Instruction Scheduling -3- Todd C. Mowry CS745: Instruction Scheduling -4- Todd C. Mowry

1 Pipelining Pipelining Illustration

Basic idea: IF RF EX MEWB • break instruction into stages that can be overlapped IF RF EX MEWB

Example: simple 5-stage pipeline from early RISC machines IF RF EX MEWB

1 instruction IF RF EX MEWB IF = Instruction Fetch IF RF EX ME WB RF = Decode & Register Fetch IF RF EX MEWB EX = Execute on ALU ME = Memory Access Time WB = Write Back to Register File Time

CS745: Instruction Scheduling -5- Todd C. Mowry CS745: Instruction Scheduling -6- Todd C. Mowry

Beyond 5-Stage Pipelines: Pipelining Illustration How to Support Even More Parallelism

IF RF EX MEWB ƒ Should we simply make pipelines deeper and deeper?

IF RF EX MEWB IF RF EX ME WB Pipe Register Pipe Pipe Register Pipe Pipe Register Pipe IF RF EX MEWB Register Pipe

IF RF EX MEWB • registers between pipeline stages have fixed overheads • hence diminishing returns with more stages (Amdahl’s Law) IF RF EX MEWB • value of pipe stage unclear if < time for integer add ƒ However, most consumers think “performance = clock rate” Time • perceived need for higher clock rates -> deeper pipelines • e.g., Pentium 4 processor has a 20-stage pipeline ƒ In a given cycle, each instruction is in a different stage

CS745: Instruction Scheduling -7- Todd C. Mowry CS745: Instruction Scheduling -8- Todd C. Mowry

2 Beyond Pipelining: “Superscalar” Processing Superscalar Pipeline Illustration

IF RF EX MEWB ƒ Basic Idea: Original (scalar) pipeline: • multiple (independent) instructions can proceed IF RF EX MEWB ƒ Only one instruction in simultaneously through the same pipeline stages IF RF EX MEWB a given pipe stage at a ƒ Requires additional hardware IF RF EX MEWB given time • example: “Execute” stage r3+r4 IF RF EX MEWB

IF RF EX MEWB Superscalar pipeline: ƒ Multiple instructions in

Pipe Register Pipe IF RF EX MEWB EX r1+r2 Register Pipe r1+r2 IF RF EX MEWB the same pipe stage at Pipe Register Pipe Pipe Register Pipe Pipe Register Pipe Pipe Register Pipe Time the same time IF RF EX MEWB Abstract Hardware for Hardware for IF RF EX MEWB Representation Scalar Pipeline: 2-way Superscalar: 1 ALU 2 ALUs

CS745: Instruction Scheduling -9- Todd C. Mowry CS745: Instruction Scheduling -10- Todd C. Mowry

The Ideal Scheduling Outcome Limitations Upon Scheduling

Before After

Time 1 cycle 1. Hardware Resources 2. Data Dependences 3. Control Dependences N cycles

ƒ What prevents us from achieving this ideal?

CS745: Instruction Scheduling -11- Todd C. Mowry CS745: Instruction Scheduling -12- Todd C. Mowry

3 Limitation #1: Hardware Resources Finite Issue Width

ƒ Processors have finite resources, and there are often ƒ Prior to superscalar processing: constraints on how these resources can be used. • processors only “issued” one instruction per cycle ƒ Even with superscalar processing: Examples: • limit on total # of instructions issued per cycle • Finite issue width • Limited functional units (FUs) per given instruction type Issue Width = infinite Issue Width = 4 • Limited pipelining within a given functional unit (FU) Time 1 ≥ N/4

CS745: Instruction Scheduling -13- Todd C. Mowry CS745: Instruction Scheduling -14- Todd C. Mowry

Limited FUs per Instruction Type Limited Pipelining within a Functional Unit

ƒ e.g., a 4-way superscalar might only be able to issue up to ƒ e.g., only 1 new floating-point division once every 2 cycles 2 integer, 1 memory, and 1 floating-point insts per cycle Schedule with Limited Pipelining More Realistic Original Code Int Mem FP Original Code Unconstrained Int Mem FP Time ÷ Time ÷ 3 ÷ 5 ÷ ÷ 9

÷ 12 ÷ 12 Bottleneck Integer ÷ Integer ÷ Empty Slot Memory Memory ÷ Floating-Point Empty Slot Floating-Point

CS745: Instruction Scheduling -15- Todd C. Mowry CS745: Instruction Scheduling -16- Todd C. Mowry

4 Limitations Upon Scheduling Limitation #2: Data Dependences

ƒ If we read or write a data location “too early”, the program 1. Hardware Resources may behave incorrectly. 2. Data Dependences (Assume that initially, x = 0.) 3. Control Dependences x = 1; x = 1; y = x;x; ??? ??? ??? y = x;x; x 1 = 2; x 1 = 1;

Read-after-Write Write-after-Write Write-after-Read (“True” dependence) (“Output” dependence) (“Anti” dependence)

Fundamental Can potentially fix through renaming. (no simple fix)

CS745: Instruction Scheduling -17- Todd C. Mowry CS745: Instruction Scheduling -18- Todd C. Mowry

Given Ambiguous Data Dependences, Why Data Dependences are Challenging What Can the Do?

x = a[i]; x = a[i]; *p = 1; *p = 1; y = *q; y = *q; *r = z; *r = z;

ƒ Conservative approach: don’t reorder instructions ƒ which of these instructions can be reordered? • ensures correct execution • but may suffer poor performance ƒ ambiguous data dependences are very common in practice ƒ Aggressive approach? • difficult to resolve, despite fancy • is there a way to safely reorder instructions?

CS745: Instruction Scheduling -19- Todd C. Mowry CS745: Instruction Scheduling -20- Todd C. Mowry

5 Hardware Limitations Revisited: Multi-cycle Execution Latencies Limitations Upon Scheduling

ƒ Simple instructions often “execute” in one cycle • (as observed by other instructions in the pipeline) 1. Hardware Resources • e.g., integer addition 2. Data Dependences ƒ More complex instructions may require multiple cycles 3. Control Dependences • e.g., integer division, square-root • cache misses!

ƒ These latencies, when combined with data dependencies, can result in non-trivial critical path lengths through code

CS745: Instruction Scheduling -21- Todd C. Mowry CS745: Instruction Scheduling -22- Todd C. Mowry

Limitation #3: Control Dependences Scheduling Constraints: Summary

ƒ Hardware Resources • finite set of FUs with instruction type, bandwidth, and latency constraints • cache hierarchy also has many constraints ƒ Data Dependences • can’t consume a result before it is produced • ambiguous dependences create many challenges ƒ Control Dependences ƒ What do we do when we reach a conditional branch? • impractical to schedule for all possible paths • choose a “frequently-executed” path? • choosing an “expected” path may be difficult • choose multiple paths? • recovery costs can be non-trivial if you are wrong

CS745: Instruction Scheduling -23- Todd C. Mowry CS745: Instruction Scheduling -24- Todd C. Mowry

6 Spectrum of Hardware Support Hardware- vs. Compiler-Based Scheduling for Scheduling

ƒ The hardware can also attempt to reschedule instructions Compiler-Centric Hardware-Centric (on-the-fly) to improve performance ƒ What advantages/disadvantages would hardware have (vs. VLIW In-Order Out-of-Order the compiler) when trying to reason about: (Very Long Superscalar Superscalar • Hardware Resources Instruction Word) • Data Dependences e.g.: Itanium e.g.: Original Pentium e.g.: Pentium 4 • Control Dependences ƒ Which is better: • doing more of the scheduling work in the compiler? • doing more of the scheduling work in the hardware?

CS745: Instruction Scheduling -25- Todd C. Mowry CS745: Instruction Scheduling -26- Todd C. Mowry

VLIW Processors Compiling for VLIW

Predicting Execution Latencies: Motivation: • easy for most functional units (latency is fixed) • if the hardware spends zero (or almost zero) time thinking about scheduling, it can run faster • but what about memory references? Philosophy: Data Dependences: • give full control over scheduling to the compiler • in “pure” VLIW, the hardware does not check for them Implementation: • the compiler takes them into account to produce safe code • expose control over all FUs directly to software via a while (p != NULL) { a = b + 1; “very long instruction word” if (test(p->val)) c = a – d; Int Mem FP Time q->next = p->left; e = c / 3; p = p->next; f = g – e; } Example #1 Example #2

CS745: Instruction Scheduling -27- Todd C. Mowry CS745: Instruction Scheduling -28- Todd C. Mowry

7 Spectrum of Hardware Support “VLIW” Today for Scheduling

ƒ Hardware checks for data dependences through memory Compiler-Centric Hardware-Centric ƒ Compiler can do a good job with register dependences In-Order VLIW Superscalar

Intel/HP Itanium2 Transmeta Crusoe 5400

Inst 2 Inst 1 Inst 0 Template ƒ Runtime software dynamically generates VLIW code 128-bit bundle

CS745: Instruction Scheduling -29- Todd C. Mowry CS745: Instruction Scheduling -30- Todd C. Mowry

Spectrum of Hardware Support In-Order Superscalar Processors for Scheduling

In contrast with VLIW: Compiler-Centric Hardware-Centric • hardware does full data dependence checking In-Order Out-of-Order • hence, no need to encode NOPs for empty slots VLIW Superscalar Superscalar ƒ Once an instruction cannot be issued, no Int Mem FP Time instructions after it will be issued. Bottom Line: • hardware matches code to available resources; recompilation is not necessary for correctness Empty Slot • compiler’s role is still important • for performance, not correctness!

CS745: Instruction Scheduling -31- Todd C. Mowry CS745: Instruction Scheduling -32- Todd C. Mowry

8 Out-of-Order Superscalar Processors: Out-of-Order Superscalar Processors Hardware Overview

Motivation: Inst. Complexity of checking PC: 0x100x140x180x1c • when an instruction is stuck, perhaps there are Cache dependences increases subsequent instructions that can be executed exponentially with issue width!

x = *p; suffers expensive cache miss Branch Predictor y = x + 1; stuck waiting on true dependence 0x1c: b = c / 3; issue (out-of-order) z = a + 2; these do not need to wait 0x18: z = a + 2; issue (out-of-order) b = c / 3; } 0x14: y = x + 1; can’t issue

Reorder Buffer } 0x10: x = *p; issue (cache miss) Sounds great! But how does this complicate the hardware? ƒ fetch & graduate in-order, issue out-of-order

CS745: Instruction Scheduling -33- Todd C. Mowry CS745: Instruction Scheduling -34- Todd C. Mowry

Compiler- vs. Hardware-Centric Scheduling: Bottom Line Scheduling Roadmap

Compiler-Centric Hardware-Centric

In-Order Out-of-Order x = a + b VLIW x = a + b x = a + b Superscalar Superscalar … … y = c + d y = c + d y = c + d ƒ High-end processors will probably remain out-of-order • moving instructions small distances is probably useless • BUT, moving instructions large distances may still help List Scheduling: Trace Scheduling: : • within a basic block • across basic blocks • across loop iterations ƒ Cheap, power-efficient processors may be in-order/VLIW • instruction scheduling may have a large impact

CS745: Instruction Scheduling -35- Todd C. Mowry CS745: Instruction Scheduling -36- Todd C. Mowry

9