Optimization: What's the Point?

Optimization: What’s the Point? (A Quick Review) Instruction Scheduling: Introduction Machine-Independent Optimizations: • e.g., constant propagation & folding, redundancy elimination, dead-code elimination, etc. • Goal: eliminate work Todd C. Mowry Machine-Dependent Optimizations: CS745: Optimizing Compilers • register allocation • Goal: reduce cost of accessing data • instruction scheduling • Goal: ??? • … CS745: Instruction Scheduling -2- Todd C. Mowry The Goal of Instruction Scheduling Hardware Support for Parallel Execution Assume that the remaining instructions are all essential Three forms of parallelism are found in modern machines: • (otherwise, earlier passes would have eliminated them) • Pipelining Instruction Scheduling How can we perform this fixed amount of work in less time? • Superscalar Processing } • Answer: execute the instructions in parallel • Multiprocessing Automatic Parallelization (covered later in class) Time a = 1 + x; a = 1 + x; b = 2 + y; c = 3 + z; b = 2 + y; c = 3 + z; CS745: Instruction Scheduling -3- Todd C. Mowry CS745: Instruction Scheduling -4- Todd C. Mowry 1 Pipelining Pipelining Illustration Basic idea: IF RF EX MEWB • break instruction into stages that can be overlapped IF RF EX MEWB Example: simple 5-stage pipeline from early RISC machines IF RF EX MEWB 1 instruction IF RF EX MEWB IF = Instruction Fetch IF RF EX ME WB RF = Decode & Register Fetch IF RF EX MEWB EX = Execute on ALU ME = Memory Access Time WB = Write Back to Register File Time CS745: Instruction Scheduling -5- Todd C. Mowry CS745: Instruction Scheduling -6- Todd C. Mowry Beyond 5-Stage Pipelines: Pipelining Illustration How to Support Even More Parallelism IF RF EX MEWB Should we simply make pipelines deeper and deeper? IF RF EX MEWB IF RF EX ME WB Pipe Register Pipe Pipe Register Pipe Pipe Register Pipe IF RF EX MEWB Register Pipe IF RF EX MEWB • registers between pipeline stages have fixed overheads • hence diminishing returns with more stages (Amdahl’s Law) IF RF EX MEWB • value of pipe stage unclear if < time for integer add However, most consumers think “performance = clock rate” Time • perceived need for higher clock rates -> deeper pipelines • e.g., Pentium 4 processor has a 20-stage pipeline In a given cycle, each instruction is in a different stage CS745: Instruction Scheduling -7- Todd C. Mowry CS745: Instruction Scheduling -8- Todd C. Mowry 2 Beyond Pipelining: “Superscalar” Processing Superscalar Pipeline Illustration IF RF EX MEWB Basic Idea: Original (scalar) pipeline: • multiple (independent) instructions can proceed IF RF EX MEWB Only one instruction in simultaneously through the same pipeline stages IF RF EX MEWB a given pipe stage at a Requires additional hardware IF RF EX MEWB given time • example: “Execute” stage r3+r4 IF RF EX MEWB IF RF EX MEWB Superscalar pipeline: Multiple instructions in Pipe Register Pipe IF RF EX MEWB EX r1+r2 Register Pipe r1+r2 IF RF EX MEWB the same pipe stage at Pipe Register Pipe Pipe Register Pipe Pipe Register Pipe Pipe Register Pipe Time the same time IF RF EX MEWB Abstract Hardware for Hardware for IF RF EX MEWB Representation Scalar Pipeline: 2-way Superscalar: 1 ALU 2 ALUs CS745: Instruction Scheduling -9- Todd C. Mowry CS745: Instruction Scheduling -10- Todd C. Mowry The Ideal Scheduling Outcome Limitations Upon Scheduling Before After Time 1 cycle 1. Hardware Resources 2. Data Dependences 3. Control Dependences N cycles What prevents us from achieving this ideal? CS745: Instruction Scheduling -11- Todd C. Mowry CS745: Instruction Scheduling -12- Todd C. Mowry 3 Limitation #1: Hardware Resources Finite Issue Width Processors have finite resources, and there are often Prior to superscalar processing: constraints on how these resources can be used. • processors only “issued” one instruction per cycle Even with superscalar processing: Examples: • limit on total # of instructions issued per cycle • Finite issue width • Limited functional units (FUs) per given instruction type Issue Width = infinite Issue Width = 4 • Limited pipelining within a given functional unit (FU) Time 1 ≥ N/4 CS745: Instruction Scheduling -13- Todd C. Mowry CS745: Instruction Scheduling -14- Todd C. Mowry Limited FUs per Instruction Type Limited Pipelining within a Functional Unit e.g., a 4-way superscalar might only be able to issue up to e.g., only 1 new floating-point division once every 2 cycles 2 integer, 1 memory, and 1 floating-point insts per cycle Schedule with Limited Pipelining More Realistic Original Code Int Mem FP Original Code Unconstrained Int Mem FP Time ÷ Time ÷ 3 ÷ 5 ÷ ÷ 9 ÷ 12 ÷ 12 Bottleneck Integer ÷ Integer ÷ Empty Slot Memory Memory ÷ Floating-Point Empty Slot Floating-Point CS745: Instruction Scheduling -15- Todd C. Mowry CS745: Instruction Scheduling -16- Todd C. Mowry 4 Limitations Upon Scheduling Limitation #2: Data Dependences If we read or write a data location “too early”, the program 1. Hardware Resources may behave incorrectly. 2. Data Dependences (Assume that initially, x = 0.) 3. Control Dependences x = 1; x = 1; y = x;x; ??? ??? ??? y = x;x; x 1 = 2; x 1 = 1; Read-after-Write Write-after-Write Write-after-Read (“True” dependence) (“Output” dependence) (“Anti” dependence) Fundamental Can potentially fix through renaming. (no simple fix) CS745: Instruction Scheduling -17- Todd C. Mowry CS745: Instruction Scheduling -18- Todd C. Mowry Given Ambiguous Data Dependences, Why Data Dependences are Challenging What Can the Compiler Do? x = a[i]; x = a[i]; *p = 1; *p = 1; y = *q; y = *q; *r = z; *r = z; Conservative approach: don’t reorder instructions which of these instructions can be reordered? • ensures correct execution • but may suffer poor performance ambiguous data dependences are very common in practice Aggressive approach? • difficult to resolve, despite fancy pointer analysis • is there a way to safely reorder instructions? CS745: Instruction Scheduling -19- Todd C. Mowry CS745: Instruction Scheduling -20- Todd C. Mowry 5 Hardware Limitations Revisited: Multi-cycle Execution Latencies Limitations Upon Scheduling Simple instructions often “execute” in one cycle • (as observed by other instructions in the pipeline) 1. Hardware Resources • e.g., integer addition 2. Data Dependences More complex instructions may require multiple cycles 3. Control Dependences • e.g., integer division, square-root • cache misses! These latencies, when combined with data dependencies, can result in non-trivial critical path lengths through code CS745: Instruction Scheduling -21- Todd C. Mowry CS745: Instruction Scheduling -22- Todd C. Mowry Limitation #3: Control Dependences Scheduling Constraints: Summary Hardware Resources • finite set of FUs with instruction type, bandwidth, and latency constraints • cache hierarchy also has many constraints Data Dependences • can’t consume a result before it is produced • ambiguous dependences create many challenges Control Dependences What do we do when we reach a conditional branch? • impractical to schedule for all possible paths • choose a “frequently-executed” path? • choosing an “expected” path may be difficult • choose multiple paths? • recovery costs can be non-trivial if you are wrong CS745: Instruction Scheduling -23- Todd C. Mowry CS745: Instruction Scheduling -24- Todd C. Mowry 6 Spectrum of Hardware Support Hardware- vs. Compiler-Based Scheduling for Scheduling The hardware can also attempt to reschedule instructions Compiler-Centric Hardware-Centric (on-the-fly) to improve performance What advantages/disadvantages would hardware have (vs. VLIW In-Order Out-of-Order the compiler) when trying to reason about: (Very Long Superscalar Superscalar • Hardware Resources Instruction Word) • Data Dependences e.g.: Itanium e.g.: Original Pentium e.g.: Pentium 4 • Control Dependences Which is better: • doing more of the scheduling work in the compiler? • doing more of the scheduling work in the hardware? CS745: Instruction Scheduling -25- Todd C. Mowry CS745: Instruction Scheduling -26- Todd C. Mowry VLIW Processors Compiling for VLIW Predicting Execution Latencies: Motivation: • easy for most functional units (latency is fixed) • if the hardware spends zero (or almost zero) time thinking about scheduling, it can run faster • but what about memory references? Philosophy: Data Dependences: • give full control over scheduling to the compiler • in “pure” VLIW, the hardware does not check for them Implementation: • the compiler takes them into account to produce safe code • expose control over all FUs directly to software via a while (p != NULL) { a = b + 1; “very long instruction word” if (test(p->val)) c = a – d; Int Mem FP Time q->next = p->left; e = c / 3; p = p->next; f = g – e; } Example #1 Example #2 CS745: Instruction Scheduling -27- Todd C. Mowry CS745: Instruction Scheduling -28- Todd C. Mowry 7 Spectrum of Hardware Support “VLIW” Today for Scheduling Hardware checks for data dependences through memory Compiler-Centric Hardware-Centric Compiler can do a good job with register dependences In-Order VLIW Superscalar Intel/HP Itanium2 Transmeta Crusoe 5400 Inst 2 Inst 1 Inst 0 Template Runtime software dynamically generates VLIW code 128-bit bundle CS745: Instruction Scheduling -29- Todd C. Mowry CS745: Instruction Scheduling -30- Todd C. Mowry Spectrum of Hardware Support In-Order Superscalar Processors for Scheduling In contrast with VLIW: Compiler-Centric Hardware-Centric • hardware does full data dependence checking In-Order Out-of-Order • hence, no need to encode NOPs for empty slots VLIW Superscalar

Load more