Dynamic Scheduling (OOO) Via Tomasulo's Approach

Lecture 16: Instruction Level Parallelism -- Dynamic Scheduling (OOO) via Tomasulo’s Approach CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan [email protected] www.secs.oakland.edu/~yan Topics for Instruction Level Parallelism § ILP Introduction, Compiler Techniques and Branch Prediction – 3.1, 3.2, 3.3 § Dynamic Scheduling (OOO) – 3.4, 3.5 and C.5, C.6 and C.7 (FP pipeline and scoreboard) § Hardware Speculation and Static Superscalar/VLIW – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation – 3.8, 3.9 § ILP Limitations and SMT – 3.10, 3.11, 3.12 2 Acknowledge and Copyright § Slides adapted from – UC Berkeley course “Computer Science 252: Graduate Computer Architecture” of David E. Culler Copyright(C) 2005 UCB – UC Berkeley course Computer Science 252, Graduate Computer Architecture Spring 2012 of John Kubiatowicz Copyright(C) 2012 UCB – Computer Science 152: Computer Architecture and Engineering, Spring 2016 by Dr. George Michelogiannakis from UC Berkeley § https://passlab.github.io/CSE564/copyrightack.html 3 Complex Pipelining: Motivation § Why would we want more than our in-order pipeline? Physical Physical Address Inst. Address Data Decode PC Cache D E + M Cache W Physical Memory Controller Physical Address Address Physical Address Main Memory (DRAM) 4 Complex Pipelining: Motivation Pipelining becomes complex when we want high performance in the presence of: § Long latency or partially pipelined floating-point units – Not all instructions are floating point or integer § Memory systems with variable access time – For example cache misses § Multiple arithmetic and memory units 5 Floating Point Representation § IEEE standard 754 Value = (-1)s * 1.mantissa * 2(exp-127) Exponent = 0 has special meaning 6 Floating-Point Unit (FPU) § Much more hardware than an integer unit – A simple FPU takes 150,000 gates. Verification complex. Some exceptions specific to floating point. – Integer FU to the order of thousands § Common to have several FPU’s – Some integer, some floating point § Common to have different types of FPU’s: Fadd, Fmul, Fdiv, … § An FPU may be pipelined, partially pipelined or not pipelined § To operate several FPU’s concurrently the FP register file needs to have more read and write ports 7 Unpipelined FP EXE Stage § FP takes loops to compute § Much longer clock period Single-cycle FPU is a bad idea 8 Latency and Interval § Latency – The number of intervening cycles between an instruction that produces a result and an instruction that uses the result. – Usually the number of stages after EX that an instruction produces a result » ALU Integer 0, Load latency 1 § Initiation or repeat interval – the number of cycles that must elapse between issuing two operations of a given type à structural hazards 9 Pipelined FP EXE § Increased stall for RAW hazards 10 Breaking Our Assumption of Integer Pipeline § The divide unit is not fully pipelined – structural hazards can occur » need to be detected and stall incurred. § The instructions have varying running times – the number of register writes required in a cycle can be > 1 § Instructions no longer reach WB in order – Write after write (WAW) hazards are possible » Note that write after read (WAR) hazards are not possible, since the register reads always occur in ID. § Instructions can complete in a different order than they were issued (out-of-order complete) – causing problems with exceptions § Longer latency of operations – stalls for RAW hazards will be more frequent. 11 Hazards and Forwarding for Longer- Latency Pipeline § H 12 Stalls of FP Operations § SPEC89 FP § Latency average § FP add, subtract, or convert – 1.7 cycles, or 56% of the latency (3 cycles). § Multiplies and divides – 2.8 and 14.2, respectively, or 46% and 59% of the corresponding latency. § Structural hazards for divides are rare – since the divide frequency is low. 13 Stalls per FP Operation § The total number of stalls per instruction – ranges from 0.65 for su2cor to 1.21 for doduc, with an average of 0.87. – FP result stalls dominate in all cases, with an average of 0.71 stalls per instruction, or 82% of the stalled cycles. 14 Problems Arising From Writes § If we issue one instruction per cycle, how can we avoid structural hazards at the writeback stage and out-of-order writeback issues? § WAW Hazards WAW Hazards 15 Complex In-Order Pipeline Inst. Data Decode GPRs PC Mem D X1 + X2 Mem X3 W § Delay writeback so all operations have same latency to W stage FPRs X1 X2 FAdd X3 W – Write ports never oversubscribed (one inst. in & one inst. out every cycle) – Stall pipeline on long latency operations, e.g., divides, X2 FMul X3 cache misses – Handle exceptions in-order at commit point Unpipeline How to prevent increased writeback latency FDiv X2 d divider X3 from slowing down single cycle integer Commit opera:ons? Bypassing Point 16 Floating-Point ISA § Interaction between floating-point datapath and integer datapath is determined by ISA § RISC-V ISA – separate register files for FP and Integer instructions » the only interaction is via a set of move/convert instructions (some ISA’s don’t even permit this) – separate load/store for FPR’s and GPR’s (general purpose registers) but both use GPR’s for address calculation – FP compares write integer registers, then use integer branch 17 Realistic Memory Systems Common approaches to improving memory performance: § Caches - single cycle except in case of a miss =>stall § Banked memory - multiple memory accesses => bank conflicts § split-phase memory operations (separate memory request from response), many in flight => out-of-order responses Latency of access to the main memory is usually much greater than one cycle and oHen unpredictable Solving this problem is a central issue in computer architecture 18 Multiple-Cycles MEM Stage § MIPS R4000 § IF: First half of instruction fetch; PC selection actually happens here, together with initiation of instruction cache access. § IS: Second half of instruction fetch, complete instruction cache access. § RF: Instruction decode and register fetch, hazard checking, and instruction cache hit detection. § EX: Execution, which includes effective address calculation, ALU operation, and branch-target computation and condition evaluation. § DF: Data fetch, first half of data cache access. § DS: Second half of data fetch, completion of data cache access. § TC: Tag check, to determine whether the data cache access hit. § WB: Write-back for loads and register-register operations. 19 2-Cycles Load Delay § 2 20 3-Cycle Branch Delay when Taken 21 Dynamic Scheduling § Data Hazards § Control Hazards 22 Types of Data Hazards Consider execuLng a sequence of rk <= ri op rj type of instrucLons Data-dependence r3 <= r1 op r2 Read-aHer-Write r5 <= r3 op r4 (RAW) hazard AnL-dependence r3 <= r1 op r2 Write-aHer-Read r1 <= r4 op r5 (WAR) hazard Output-dependence r3 <= r1 op r2 Write-aHer-Write r3 <= r6 op r7 (WAW) hazard 23 Register vs. Memory Dependence Data hazards due to register operands can be determined at the decode stage, but data hazards due to memory operands can be determined only after computing the effective address Store: M[r1 + disp1] <= r2 ! Load: r3 <= M[r4 + disp2]! ! Does (r1 + disp1) = (r4 + disp2) ? 24 Data Hazards: An Example I1 FDIV.D f6, f6, f4 I2 FLD f2, 45(x3) I3 FMUL.D f0, f2, f4 I4 FDIV.D f8, f6, f2 I5 FSUB.D f10, f0, f6 I6 FADD.D f6, f8, f2 RAW Hazards WAR Hazards WAW Hazards 25 Instruction Scheduling I1 FDIV.D f6, f6, f4 I1 I2 FLD f2, 45(x3) I FMULT.D f0, f2, f4 3 I2 I4 FDIV.D f8, f6, f2 I3 I5 FSUB.D f10, f0, f6 I FADD.D f6, f8, f2 6 I4 Valid orderings: I5 in-order I1 I2 I3 I4 I5 I6 out-of-order I 2 I1 I3 I4 I5 I6 I6 out-of-order I1 I2 I3 I5 I4 I6 26 Out-of-order Completion In-order Issue Latency I1 FDIV.D f6, f6, f4 4 I2 FLD f2, 45(x3) 1 I3 FMULT.D f0, f2, f4 3 I4 FDIV.D f8, f6, f2 4 I5 FSUB.D f10, f0, f6 1 I6 FADD.D f6, f8, f2 1 in-order comp 1 2 1 2 3 4 3 5 4 6 5 6 out-of-order comp 1 2 2 3 1 4 3 5 5 4 6 6 Underlines are completes 27 Dynamic Scheduling § Rearrange order of instructions to reduce stalls while maintaining data flow – Minimize RAW Hazards – Minimize WAW and WAR hazards via Register Renaming – Between registers and memory hazards § Advantages: – Compiler doesn’t need to have knowledge of microarchitecture – Handles cases where dependencies are unknown at compile time § Disadvantage: – Substantial increase in hardware complexity – Complicates exceptions 28 Dynamic Scheduling § Dynamic scheduling implies: – Out-of-order execution – Out-of-order completion § Creates more possibility for WAR and WAW hazards § Scoreboard: C.6 – CDC6600 in 1963 § Tomasulo’s Approach – Tracks when operands are available – Introduces register renaming in hardware » Minimizes WAW and WAR hazards 29 Register Renaming § Example: DIV.D F0,F2,F4 ADD.D F6,F0,F8 Anti-dependence on F8 S.D F6,0(R1) SUB.D F8,F10,F14 Output dependence on F6 MUL.D F6,F10,F8 30 Register Renaming § Example: DIV.D F0,F2,F4 ADD.D F6,F0,F8 DIV.D F0,F2,F4 S.D F6,0(R1) ADD.D F6,F0,F8 SUB.D F8,F10,F14 S.D S,0(R1) MUL.D F6,F10,F8 SUB.D T,F10,F14 MUL.D T,F10,T § Now only RAW hazards remain, which can be strictly ordered 31 Tomasulo Algorithm § For IBM 360/91 about 3 years after CDC 6600 (1966) § Goal: High Performance without special compilers § Differences between IBM 360 & CDC 6600 ISA – IBM has only 2 register specifiers/instr vs.

Load more