CSE 586 Computer Architecture Lecture 4

Highlights from last week CSE 586 Computer Architecture • ILP: where can the compiler optimize – Loop unrolling and software pipelining – Speculative execution (we’ll see predication today) Lecture 4 • ILP: Dynamic scheduling in a single issue machine – Scoreboard Jean-Loup Baer – Tomasulo’s algorithm http://www.cs.washington.edu/education/courses/586/00sp CSE 586 Spring 00 1 CSE 586 Spring 00 2 Highlights from last week (c’ed) -- Highlights from last week (c’ed) – Scoreboard Tomasulo’s algorithm • The scoreboard keeps a record of all data dependencies • Decentralized control • The scoreboard keeps a record of all functional unit • Use of reservation stations to buffer and/or rename occupancies registers (hence gets rid of WAW and WAR hazards) • The scoreboard decides if an instruction can be issued • Results –and their names– are broadcast to reservations • The scoreboard decides if an instruction can store its result stations and register file • Implementation-wise, scoreboard keeps track of which • Instructions are issued in order but can be dispatched, registers are used as sources and destinations and which executed and completed out-of-order functional units use them CSE 586 Spring 00 3 CSE 586 Spring 00 4 Highlights from last week (c’ed) Multiple Issue Alternatives • Register renaming: avoids WAW and WAR hazards • Superscalar (hardware detects conflicts) • Is performed at “decode” time to rename the result register – Statically scheduled (in order dispatch and hence execution; cf DEC Alpha 21164) • Two basic implementation schemes – Dynamically scheduled (in order issue, out of order dispatch and – Have a separate physical register file execution; cf MIPS 10000, IBM Power PC 620 and Intel Pentium – Use of reorder buffer and reservation stations (cf. Tomasulo Pro) algorithm extended implementation) • VLIW – EPIC (Explicitly Parallel Instruction Computing) – Often a mix of the two (cf. description of MIPS 10000 in Smith – Compiler generates “bundles “ of instructions that can be executed and Sohi) concurrently (cf. Intel Merced – Itanium) CSE 586 Spring 00 5 CSE 586 Spring 00 6 Multiple Issue for Static/Dynamic Scheduling Impact of Multiple Issue on IF • Issue in order • IF: Need to fetch more than 1 instruction at a time – Otherwise bookkeeping too complex (the old “data flow” – Simpler if instructions are of fixed length machines could issue any ready instruction in the whole program) – In fact need to fetch as many instructions as the issue stage can – Check for structural hazards; if any stall handle in one cycle (otherwise the issue stage will stall) • Dispatch for static scheduling • Simpler if restricted not to overlap I-cache lines – Check for data dependencies; stall adequately – But with branch prediction and superblocks, this is not realistic – Can take forwarding into account hence introduction of (instruction) fetch buffers • Dispatch for dynamic scheduling – Always attempt to keep at least as many instructions in the fetch – Dispatch out of order (reservation stations, instruction window) buffer as can be issued in the next cycle (BTB’s help for that) – Requires possibility of dispatching concurrently dependent – For example, have an 8 wide instruction buffer for a machine that instructions (otherwise little benefit over static. sched.) can issue 4 instructions per cycle CSE 586 Spring 00 7 CSE 586 Spring 00 8 Stalls at the IF Stage Sample of Current Micros • Instruction buffer is full • Two instruction issue: Alpha 21064, Sparc 2, Pentium, – Most likely there are stalls in the stages downstream Cyrix • Branch misprediction • Three instruction issue: Pentium Pro (but 5 uops from • Instructions are stored in several I-cache lines IF/ID to EX; AMD has 4 uops) – In one cycle one I-cache line can be brought into fetch buffer • Four instruction issue: Alpha 21164, Alpha 21264, Power – A basic block might start in the middle (or end) of an I-cache line PC 620, Sun UltraSparc, HP PA-8000, MIPS R10000 – Requires several cache lines to fill the buffer • Many papers written in 1995-98 predicted 16-way issue by – The ID (issue-dispatch) stage will stall if not enough instructions in 2000. We are still at 4! the fetch buffer • Instruction cache miss CSE 586 Spring 00 9 CSE 586 Spring 00 10 The Decode Stage (simple case: dual issue Decode in Simple Multiple Issue Case and static scheduling) • ID = Issue + Dispatch • If instructions i and i+1 are fetched together and: • Look for conflicts between the (say) 2 instructions – Instruction i stalls, instruction i+1 will stall – If one integer unit and one f-p unit, only check for structural – Instruction i is dispatched but instruction i+1 stalls (e.g., because hazard, i.e. the two instructions need the same f-u (easy to check of structural hazard = need the same f-u), instruction i+2 will not with opcodes ) advance to the issue stage. It will have to wait till both i and i+1 have been dispatched – Slight difficulty for integer ops that are f-p load/store/move (potential multiple accesses to f-p register file; solution: provide additional ports) – RAW dependencies resolved as in single pipelines – Note that the load delay (assume 1 cycle) can now delay up to 3 instructions, i.e., 3 issue slots are lost CSE 586 Spring 00 11 CSE 586 Spring 00 12 Alpha 21064 FP S0 S1 S2 S3 • IF – S0: Access I-cache Fet Swap Dec Iss Load-store – Prefetcher fetches 2 instructions (8 bytes) at a time 4 stages common to • Swap stage - S1: all instructions. Once In blue, a – Prefetcher contains branch prediction logic tested at this stage: 4 passed the 4th stage, Integer subset of entry return stack; 1 bit/instruction in the I-cache + static no stalls . the 38 prediction BTFNT Branch prediction bypasses – Initial decode yields 0, 1 or 2 instruction potential issue; align (forwarding during “swap” instructions depending on the functional unit there are headed for. paths) Check structural and • End of decode: S2. data hazards during issue Alpha 21064 2-way issue – Check for WAW and WAR (my guess) Alpha 21164 4-way issue (more pipes) CSE 586 Spring 00 13 CSE 586 Spring 00 14 Alpha 21064 (c’ed) Alpha 21164 • Instruction Issue: S3 • Main differences in with 21064 (besides caches) – Check for RAW; forwarding etc – Up to 4 instructions issued/cycle – Two integer units; Two f-p units (one add, one multiply; divide can • Conditions for 2 instruction issue (S2 and S3) be concurrent with add) – The first instruction must be able to issue (in order execution) – Slightly different execution pipe organizations – Load/store can issue with an operate except stores cannot issue • Still common trunk of 4 stages with an operate of different format (share the same result bus) – S0: Access I-cache. The instructions are predecoded – An integer op. can issue with a f-p op. (determination of whether the instruction is a branch – used in S1 – – A branch can issue with a load/store/operate (but not with stores of and of the pipeline executing the instruction – used in S2 –) the same format) – S1: Branch prediction (2-bit saturating counters in I-cache associated with each instruction). Buffer 4 instructions for next stage CSE 586 Spring 00 15 CSE 586 Spring 00 16 Alpha 21164 (c’ed) Pentium • Still common trunk of 4 stages (c’ed) • Recall dual integer pipeline and a f-p – S2: Slot-swap instructions so that they are headed for the right • Decode 2 consecutive instructions I1 and I2. If both are pipeline. If functional unit conflicts, stall previous stages until all intended for the integer pipes, issue both iff four are gone – I1 and I2 are “simple” instructions (no microcode) – S3: Check for WAW hazards. Read integer file. Stall if results are not ready. – I1 is not a jump instruction – No WAR and WAW hazard between I1 and I2 (I1 precedes I2) CSE 586 Spring 00 17 CSE 586 Spring 00 18 The Decode Stage (dynamic scheduling) Stalls in Decode (issue/dispatch) Stage • Decode means: • There can be several instructions ready to be dispatched in – Dispatch to either same cycle to same functional unit • A FIFO queue associated with each functional unit (not done any • There might not be enough bus/ports to forward values to more) all the reservation stations that need them in the same cycle • A centralized instruction window common to all functional units (Pentium Pro and Pentium III -- I think) • Reservation stations associated with functional units (MIPS 10000, AMD K5, IBM Power PC 620) – Rename registers (if supported by architecture) – Set up entry at tail of reorder buffer (if supported by architecture) – Issue operands, when ready, to functional unit CSE 586 Spring 00 19 CSE 586 Spring 00 20 The Execute Stage The Commit Step (in-order completion) • Use of forwarding in the case of static scheduling • Recall: need of a mechanism (reorder buffer) to: • Use of broadcast bus and reservation stations for dynamic –“Complete” instructions in order. This commits the instruction. scheduling Since multiple issue machine, should be able to commit (retire) several instructions per cycle • We’ll talk at length about memory operations (load-store) – Know when an instruction has completed non-speculatively,i.e., when we study memory hierarchies what to do with branches – Know whether the result of an instruction is correct, i.e., what to do with exceptions CSE 586 Spring 00 21 CSE 586 Spring 00 22 Power PC 620 (see figure in book) Pentium Pro • Issue stage. Up to 4 instructions issued/cycle except if • Fetch-Decode unit structural hazards such as: – Transforms instructions into micro-operations (uops) and stores – No reservation station available them in a global reservation table (instruction window). Does – No rename register available register renaming (RAT = register alias table) – Reorder buffer is full.

CSE 586 Computer Architecture Lecture 4

IEEE Paper Template in A4

Computer Science 246 Computer Architecture Spring 2010 Harvard University

Dynamic Scheduling

United States Patent (19) 11 Patent Number: 5,680,565 Glew Et Al

Identifying Bottlenecks in a Multithreaded Superscalar

Advanced Processor Designs

Computer Architecture Out-Of-Order Execution

Superscalar Instruction Dispatch with Exception Handling

MIPS Architecture with Tomasulo Algorithm [12]

In-Line Interrupt Handling and Lock-Up Free Translation Lookaside Buffers (Tlbs)

Improving the Precise Interrupt Mechanism of Software- Managed TLB Miss Handlers

Identifying Bottlenecks in a Multithreaded Superscalar