Multiple Issue, Speculative Execution and Recovery

Total Page:16

File Type:pdf, Size:1020Kb

Multiple Issue, Speculative Execution and Recovery ILP: Multiple Issue, Speculative Execution and Recovery Multiple issue, speculative execution, precise interrupt, reorder buffer Multiple instruction issue • Recap: Tomasulo’s algorithm handles RAW, WAR, WAW hazards via dynamic scheduling • Three basic mechanisms: – Execute only when data ready – RAW – Buffer register values in reservation stations when instruction is issued – WAR – Register destination field – WAW Single-issue Tomasulo • The previous mechanisms support a model where: – Instructions are issued one at a time, in program order » Into reservation stations – Multiple instructions can begin execution at a time, out of program order » Provided their operands, and independent functional units are available – Instructions can complete out of order Support for multiple issue • There are several phases in the Tomasulo algorithm – Fetch into instruction queue – Decode and issue to reservation stations – Execute – Write results • If any of these phases support only one instruction per cycle – Multiple issue will not yield CPI<1 Support for multiple issue (cont) • Our Tomasulo design so far: – Execution supports multiple instructions – Write results support multiple instructions » Provided there are multiple functional units, and CDBs • Extensions: – Fetching multiple instructions per cycle – Issuing multiple instructions per cycle Fetching multiple instructions • Must be able to fill in instruction queue at a higher rate – If want to sustain “N” instructions per cycle, must fetch at least “N” instructions per cycle • Must look ahead the program order stream and fetch from multiple PCs Decoding multiple operations • Once multiple instructions are in the fetch queue – Must be able to decode and issue more than one in a single cycle – An “issue packet” Issue logic • First, need to check whether reservation stations are available – For each instruction that may be issued – As previously, but now the check gets more complex » Not only reservation stations that are already in use, but also those that will be occupied by other instructions in the issue packet Issue logic • Second, need to resolve dependences within issue packet – In addition to those with instructions already in flight • Keep in mind checks must preserve program order within issue packet Issue logic • Check must be done in parallel – While preserving program order • Possible approaches: – Hardware checks for all possible combinations within a single cycle – Pipeline the issue stage Simplifications • If floating point and integer registers are separate – May facilitate issue logic by looking at one FP and one INT instruction at a time • But some programs only have INT instructions • General case: support issue of arbitrary mix of INT, FP – And loads, stores Tomasulo: decoupling • Fetch, issue and execution are decoupled – Number of instructions fetched in cycle “k” does not imply same number of instructions issued in cycle “k+1” – Number of instructions issued does not imply same number of instructions executed • Once fetch/issue stages are improved – Execution works just as before Example Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) DADDIUR1,R1,#-8 BNE R1,R2,Loop 1 integer ALU, 1 FP ALU, 1 address adder, 1 memory unit, 2 CDBs Multiple reservation stations Issue packet: any two instructions (except branches; single issue) Branch-predicted instructions can be fetched, issued but not executed Loop: L.D F0,0(R1) Cycle #0 ADD.D F4,F0,F2 S.D F4,0(R1) L.D, ADD.D fetched into DADDI R1,R1,#-8 instruction queue BNE R1,R2,Loop Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) DADDI R1,R1,#-8 BNE R1,R2,Loop Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) DADDI R1,R1,#-8 BNE R1,R2,Loop Cycle #1 Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) L.D, ADD.D issued DADDI R1,R1,#-8 -what checks need BNE R1,R2,Loop to be made? Loop: L.D F0,0(R1) ADD.D F4,F0,F2 -what are the Vj/Vk, S.D F4,0(R1) Qj/Qk of ADD.D RS DADDI R1,R1,#-8 set to? BNE R1,R2,Loop Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) DADDI R1,R1,#-8 BNE R1,R2,Loop S.D, DADDIU brought into instruction queue Tomasulo Example Cycle 1 Instruction status Execution Write Instruction j k Issue complete Result Busy Address LD F0 0+ R1 1 Load1 Yes 0+R1 ADD F4 F0 F2 1 Reservation Stations S1 S2 RS for j RS for k TimeNameBusyOp Vj Vk Qj Qk Add1 Yes Add F2 Load1 Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F30 1 FU Load1 Add1 Cycle #2 Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) S.D, DADDIU issued DADDI R1,R1,#-8 BNE R1,R2,Loop Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) DADDI R1,R1,#-8 BNE R1,R2,Loop Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) BNE brought DADDI R1,R1,#-8 into instruction queue BNE R1,R2,Loop -Assumption: single-issue of branches Tomasulo Example Cycle 2 Instruction status Execution Write Instruction j k Issue complete Result Busy Address LD F0 0+ R1 1 Load1 Yes 0+R1 ADD F4 F0 F2 1 SD F4 0+ R1 2 Store1 Yes 0+R1 DAD R1 R1 -8 2 Reservation Stations S1 S2 RS for j RS for k TimeNameBusyOp Vj Vk Qj Qk Add1 Yes Add F2 Load1 Int Yes DAD R1 -8 Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F30 2 FU Load1 Add1 Cycle #3 Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) BNE issued DADDI R1,R1,#-8 BNE R1,R2,Loop Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) L.D, DADDI access DADDI R1,R1,#-8 memory, execute BNE R1,R2,Loop Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) L.D, ADD.D brought DADDI R1,R1,#-8 into instruction queue BNE R1,R2,Loop - Under what assumption? Cycle #4 Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) L.D, ADD.D issued DADDI R1,R1,#-8 -Can they execute? BNE R1,R2,Loop Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) L.D, DADDI write results DADDI R1,R1,#-8 -who needs their results? BNE R1,R2,Loop Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) S.D, DADDI brought DADDI R1,R1,#-8 into instruction queue BNE R1,R2,Loop Cycle #5 Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) BNE issued DADDI R1,R1,#-8 BNE R1,R2,Loop Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) BNE executes DADDI R1,R1,#-8 -what happens after BNE R1,R2,Loop branch is resolved? Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) L.D, ADD.D brought DADDI R1,R1,#-8 into instruction queue BNE R1,R2,Loop Limitations/challenges • Multiple issue relies on fetching a stream of instructions ahead of time – Branch/target prediction is very important • What to do with instructions that depend on prediction results? – They may or may not be valid • Can always fetch and issue them – No harm done until memory or register contents are written – But just fetching/issuing not enough Dealing with branches • Predicting branches lets us: – Fetch appropriate instructions into queue – Issue instructions into reservation stations – If prediction is incorrect » Flush queue » Flush reservation station entries • If prediction is correct – Some overlapping will come from being able to fetch/issue new instructions – If branch outcome takes long to compute, will fill up reservation stations Limitations/challenges (cont) • Single- and multi-issue Tomasulo schemes so far allow for out-of-order completion – Difficult to support precise exceptions • Approach to handling this and previous issue – Speculative execution Multiple-issue Tomasulo • Extends fetch/issue to handle multiple instructions per cycle • Branches pose limitations – Prediction lets processor fetch ahead – Can also issue – But cannot execute, complete until known if prediction is correct Control Dependencies • Every instruction is control dependent on some set of branches if p1 S1; if p2 S2; • S1 is control dependent on p1, and S2 is control dependent on p2 but not on p1. control dependencies must be preserved to preserve program order Speculative execution • Let instructions that follow a predicted branch actually execute and complete – Speculate; “gamble” that a prediction is correct, then verify prediction – Buffer all instructions that complete until they are safe to alter registers/memory » And commit these results in program order • Hardware enhancement: Reorder Buffers (ROBs) Example ADD R1, R2, R3 BNEZ R1, Foo MULT R4, R4, R5 Foo: ADD R4, R4, R6 Assume R4=1, R5=10, R6=20, branch predicted taken, 5 cycles to resolve correct results: R4=21 (taken), R4=30 (not taken) Example ADD R1, R2, R3 BNEZ R1, Foo MULT R4, R4, R5 Foo: ADD R4, R4, R6 The processor issues and execute the instructions in red -> R4=1+20 Correct if branch taken what if not taken? Example ADD R1, R2, R3 BNEZ R1, Foo MULT R4, R4, R5 Foo: ADD R4, R4, R6 If branch not taken: -cannot simply roll-back and execute MULT and ADD would result in R4=21*10+20=230 -cannot simply execute MULT and skip ADD would result in R4=21*10=210 Example ADD R1, R2, R3 BNEZ R1, Foo MULT R4, R4, R5 Foo: ADD R4, R4, R6 If branch not taken, two things must happen: 1) ADD’s result must be discarded, and, 2) MULT and ADD must be executed in order (roll-back sequential execution from MULT) Supporting discards/roll-back • Do not let results go immediately to register file – Stage them in a buffer • Once branch condition is checked: – If prediction was correct, copy from buffer to register file – If incorrect, clear buffer entries and roll-back Re-order buffer Key ideas • Add another buffer to the design – Re-order buffer (ROB) • Break the completion of an instruction into two stages: – Write result » From functional units to the ROB – Commit results » From ROB to registers, memory Entries • ROB: – Busy – Instruction opcode – Destination (register, memory) – Value • New reservation station entry: – “Dest” (index points to ROB entry) – Also, Qj, Qk now point to ROB entries (not to RS) • Register status modifications – Index (“Reorder”) now also points to ROB entry (not to RS) Execution stages • (fetch) • 1.
Recommended publications
  • Computer Science 246 Computer Architecture Spring 2010 Harvard University
    Computer Science 246 Computer Architecture Spring 2010 Harvard University Instructor: Prof. David Brooks [email protected] Dynamic Branch Prediction, Speculation, and Multiple Issue Computer Science 246 David Brooks Lecture Outline • Tomasulo’s Algorithm Review (3.1-3.3) • Pointer-Based Renaming (MIPS R10000) • Dynamic Branch Prediction (3.4) • Other Front-end Optimizations (3.5) – Branch Target Buffers/Return Address Stack Computer Science 246 David Brooks Tomasulo Review • Reservation Stations – Distribute RAW hazard detection – Renaming eliminates WAW hazards – Buffering values in Reservation Stations removes WARs – Tag match in CDB requires many associative compares • Common Data Bus – Achilles heal of Tomasulo – Multiple writebacks (multiple CDBs) expensive • Load/Store reordering – Load address compared with store address in store buffer Computer Science 246 David Brooks Tomasulo Organization From Mem FP Op FP Registers Queue Load Buffers Load1 Load2 Load3 Load4 Load5 Store Load6 Buffers Add1 Add2 Mult1 Add3 Mult2 Reservation To Mem Stations FP adders FP multipliers Common Data Bus (CDB) Tomasulo Review 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 LD F0, 0(R1) Iss M1 M2 M3 M4 M5 M6 M7 M8 Wb MUL F4, F0, F2 Iss Iss Iss Iss Iss Iss Iss Iss Iss Ex Ex Ex Ex Wb SD 0(R1), F0 Iss Iss Iss Iss Iss Iss Iss Iss Iss Iss Iss Iss Iss M1 M2 M3 Wb SUBI R1, R1, 8 Iss Ex Wb BNEZ R1, Loop Iss Ex Wb LD F0, 0(R1) Iss Iss Iss Iss M Wb MUL F4, F0, F2 Iss Iss Iss Iss Iss Ex Ex Ex Ex Wb SD 0(R1), F0 Iss Iss Iss Iss Iss Iss Iss Iss Iss M1 M2
    [Show full text]
  • Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo's Algorithm
    EEF011 Computer Architecture 計算機結構 Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm 吳俊興 高雄大學資訊工程學系 October 2004 A Dynamic Algorithm: Tomasulo’s Algorithm • For IBM 360/91 (before caches!) – 3 years after CDC • Goal: High Performance without special compilers • Small number of floating point registers (4 in 360) prevented interesting compiler scheduling of operations – This led Tomasulo to try to figure out how to get more effective registers — renaming in hardware! • Why Study 1966 Computer? • The descendants of this have flourished! – Alpha 21264, HP 8000, MIPS 10000, Pentium III, PowerPC 604, … Example to eleminate WAR and WAW by register renaming • Original DIV.D F0, F2, F4 ADD.D F6, F0, F8 S.D F6, 0(R1) SUB.D F8, F10, F14 MUL.D F6, F10, F8 WAR between ADD.D and SUB.D, WAW between ADD.D and MUL.D (Due to that DIV.D needs to take much longer cycles to get F0) • Register renaming DIV.D F0, F2, F4 ADD.D S, F0, F8 S.D S, 0(R1) SUB.D T, F10, F14 MUL.D F6, F10, T Tomasulo Algorithm • Register renaming provided – by reservation stations, which buffer the operands of instructions waiting to issue – by the issue logic • Basic idea: – a reservation station fetches and buffers an operand as soon as it is available, eliminating the need to get the operand from a register (WAR) – pending instructions designate the reservation station that will provide their input (RAW) – when successive writes to a register overlap in execution, only the last one is actually used to update the register (WAW) As instructions are issued, the register specifiers for pending operands are renamed to the names of the reservation station, which provides register renaming • more reservation stations than real registers Properties of Tomasulo Algorithm 1.
    [Show full text]
  • Multiple Instruction Issue and Completion Per Clock Cycle Using Tomasulo’S Algorithm – a Simple Example
    Multiple Instruction Issue and Completion per Clock Cycle Using Tomasulo’s Algorithm – A Simple Example Assumptions: . The processor has in-order issue but execution may be out-of-order as it is done as soon after issue as operands are available. Instructions commit as they finish execu- tion. There is no speculative execution on Branch instructions because out-of-order com- pletion prevents backing out incorrect results. The reason for the out-of-order com- pletion is that there is no buffer between results from the execution and their com- mitment in the register file. This restriction is just to keep the example simple. It is possible for at least two instructions to issue in a cycle. The processor has two integer ALU’s with one Reservation Station each. ALU op- erations complete in one cycle. We are looking at the operation during a sequence of arithmetic instructions so the only Functional Units shown are the ALU’s. NOTES: Customarily the term “issue” is the transition from ID to the window for dy- namically scheduled machines and dispatch is the transition from the window to the FU’s. Since this machine has an in-order window and dispatch with no speculation, I am using the term “issue” for the transition to the reservation stations. It is possible to have multiple reservation stations in front of a single functional unit. If two instructions are ready for execution on that unit at the same clock cycle, the lowest station number executes. There is a problem with exception processing when there is no buffering of results being committed to the register file because some register entries may be from instructions past the point at which the exception occurred.
    [Show full text]
  • Tomasulo's Algorithm
    Out-of-Order Execution Several implementations • out-of-order completion • CDC 6600 with scoreboarding • IBM 360/91 with Tomasulo’s algorithm & reservation stations • out-of-order completion leads to: • imprecise interrupts • WAR hazards • WAW hazards • in-order completion • MIPS R10000/R12000 & Alpha 21264/21364 with large physical register file & register renaming • Intel Pentium Pro/Pentium III with the reorder buffer Autumn 2006 CSE P548 - Tomasulo 1 Out-of-order Hardware In order to compute correct results, need to keep track of: • which instruction is in which stage of the pipeline • which registers are being used for reading/writing & by which instructions • which operands are available • which instructions have completed Each scheme has different hardware structures & different algorithms to do this Autumn 2006 CSE P548 - Tomasulo 2 1 Tomasulo’s Algorithm Tomasulo’s Algorithm (IBM 360/91) • out-of-order execution capability plus register renaming Motivation • long FP delays • only 4 FP registers • wanted common compiler for all implementations Autumn 2006 CSE P548 - Tomasulo 3 Tomasulo’s Algorithm Key features & hardware structures • reservation stations • distributed hazard detection & execution control • forwarding to eliminate RAW hazards • register renaming to eliminate WAR & WAW hazards • deciding which instruction to execute next • common data bus • dynamic memory disambiguation Autumn 2006 CSE P548 - Tomasulo 4 2 Hardware for Tomasulo’s Algorithm Autumn 2006 CSE P548 - Tomasulo 5 Tomasulo’s Algorithm: Key Features Reservation
    [Show full text]
  • Tomasulo's Algorithm
    Lecture-12 (Tomasulo’s Algorithm) CS422-Spring 2018 Biswa@CSE-IITK Another Dynamic One: Tomasulo’s Algorithm • For IBM 360/91 about 3 years after CDC 6600 (1966) • Goal: High Performance without special compilers • Differences between IBM 360 & CDC 6600 ISA – IBM has only 2 register specifiers/instruction vs. 3 in CDC 6600 – IBM has 4 FP registers vs. 8 in CDC 6600 – IBM has memory-register ops • Why Study? lead to Alpha 21264, HP 8000, MIPS 10000, Pentium II, PowerPC 604, … CS422: Spring 2018 Biswabandan Panda, CSE@IITK 2 Tomasulo’s Organization From Mem FP Op FP Registers Queue Load Buffers Load1 Load2 Load3 Load4 Load5 Store Load6 Buffers Add1 Add2 Mult1 Add3 Mult2 Reservation To Mem Stations FP adders FP multipliers CS422: Spring 2018 Biswabandan Panda, CSE@IITK 3 Tomasulo vs Scoreboard • Control & buffers distributed with Function Units (FU) vs. centralized in scoreboard; – FU buffers called “reservation stations”; have pending operands • Registers in instructions replaced by values or pointers to reservation stations(RS); called register renaming ; – avoids WAR, WAW hazards – More reservation stations than registers, so can do optimizations compilers can’t • Results to FU from RS, not through registers, over Common Data Bus that broadcasts results to all FUs • Load and Stores treated as FUs with RSs as wells CS422: Spring 2018 Biswabandan Panda, CSE@IITK 4 Reservation Station Components Op: Operation to perform in the unit (e.g., + or –) Vj, Vk: Value of Source operands – Store buffers has V field, result to be stored Qj, Qk: Reservation stations producing source registers (value to be written) – Note: No ready flags as in Scoreboard; Qj,Qk=0 => ready – Store buffers only have Qi for RS producing result Busy: Indicates reservation station or FU is busy Register result status—Indicates which functional unit will write each register, if one exists.
    [Show full text]
  • Tomasulo Algorithm and Dynamic Branch Prediction
    Lecture 4: Tomasulo Algorithm and Dynamic Branch Prediction Professor David A. Patterson Computer Science 252 Spring 1998 DAP Spr.‘98 ©UCB 1 Review: Summary • Instruction Level Parallelism (ILP) in SW or HW • Loop level parallelism is easiest to see • SW parallelism dependencies defined for program, hazards if HW cannot resolve • SW dependencies/compiler sophistication determine if compiler can unroll loops – Memory dependencies hardest to determine • HW exploiting ILP – Works when can’t know dependence at run time – Code for one machine runs well on another • Key idea of Scoreboard: Allow instructions behind stall to proceed (Decode => Issue instr & read operands) – Enables out-of-order execution => out-of-order completion – ID stage checked both for structural & data dependenciesDAP Spr.‘98 ©UCB 2 Review: Three Parts of the Scoreboard 1.Instruction status—which of 4 steps the instruction is in 2.Functional unit status—Indicates the state of the functional unit (FU). 9 fields for each functional unit Busy—Indicates whether the unit is busy or not Op—Operation to perform in the unit (e.g., + or –) Fi—Destination register Fj, Fk—Source-register numbers Qj, Qk—Functional units producing source registers Fj, Fk Rj, Rk—Flags indicating when Fj, Fk are ready 3.Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register DAP Spr.‘98 ©UCB 3 Review: Scoreboard Example Cycle 3 Instruction status Read ExecutionWrite Instruction j k Issue operandscompleteResult LD F6 34+ R2 1 2 3 LD F2 45+ R3 MULTDF0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status dest S1 S2 FU for jFU for kFj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F6 R2 Yes Mult1 No Mult2 No Add No Divide No Register result status Clock F0 F2 F4 F6 F8 F10 F12 ..
    [Show full text]
  • WCAE 2003 Workshop on Computer Architecture Education
    WCAE 2003 Proceedings of the Workshop on Computer Architecture Education in conjunction with The 30th International Symposium on Computer Architecture DQG 2003 Federated Computing Research Conference Town and Country Resort and Convention Center b San Diego, California June 8, 2003 Workshop on Computer Architecture Education Sunday, June 8, 2003 Session 1. Welcome and Keynote 8:45–10:00 8:45 Welcome Edward F. Gehringer, workshop organizer 8:50 Keynote address, “Teaching and teaching computer Architecture: Two very different topics (Some opinions about each),” Yale Patt, teacher, University of Texas at Austin 1 Break 10:00–10:30 Session 2. Teaching with New Architectures 10:30–11:20 10:30 “Intel Itanium floating-point architecture,” Marius Cornea, John Harrison, and Ping Tak Peter Tang, Intel Corp. ................................................................................................................................. 5 10:50 “DOP — A CPU core for teaching basics of computer architecture,” Miloš BeþváĜ, Alois Pluháþek and JiĜí DanƟþek, Czech Technical University in Prague ................................................................. 14 11:05 Discussion Break 11:20–11:30 Session 3. Class Projects 11:30–12:30 11:30 “Superscalar out-of-order demystified in four instructions,” James C. Hoe, Carnegie Mellon University ......................................................................................................................................... 22 11:50 “Bridging the gap between undergraduate and graduate experience in
    [Show full text]
  • Verification of an Implementation of Tomasulo's Algorithm by Compositional Model Checking
    Verification of an Implementation of Tomasulo's Algorithm by Compositional Model Checking K. L. McMillan Cadence Berkeley Labs 2001 Addison St., 3rd floor Berkeley, CA 94704-1103 [email protected] Abstract. An implementation of an out-of-order processing unit based on Tomasulo's algorithm is formally verified using compositional model checking techniques. This demonstrates that finite-state methods can be applied to such algorithms, without recourse to higher-order proof sys- tems. The paper introduces a novel compositional system that supports cyclic environment reasoning and multiple environment abstractions per signal. A proof of Tomasulo's algorithm is outlined, based on refinement maps, and relying on the novel features of the compositional system. This proof is fully verified by the SMV verifier, using symmetry to reduce the number of assertions that must be verified. 1 Introduction We present the formal design verification of an "out-of-order" processing unit based on Tomasulo's algorithm [Tom67]. This and related techniques such as "register renaming" are used in modern microprocessors [LR97] to keep multiple or deeply pipelined execution units busy by executing instructions in data-flow order, rather than sequential order. The complex variability of instruction flow in "out-of-order" processors presents a significant opportunity for undetected er- rors, compared to an "in-order" pipelined machine where the flow of instructions is fixed and orderly. Unfortunately, this variability also makes formal verifica- tion of such machines difficult. They are beyond the present capacity of methods based on integrated decision procedures [BD94], and are not amenable to sym- bolic trajectory analysis [JNB96].
    [Show full text]
  • MIPS Architecture with Tomasulo Algorithm [12]
    CALIFORNIA STATE UNIVERSITY NORTHRIDGE TOMASULO ARCHITECTURE BASED MIPS PROCESSOR A graduate project submitted in partial fulfilment for the requirement Of the degree of Master of Science In Electrical Engineering By Sameer S Pandit May 2014 The graduate project of Sameer Pandit is approved by: ______________________________________ ____________ Dr. Ali Amini, Ph.D. Date _______________________________________ ____________ Dr. Shahnam Mirzaei, Ph.D. Date _______________________________________ ____________ Dr. Ramin Roosta, Ph.D., Chair Date California State University Northridge ii ACKNOWLEDGEMENT I would like to express my gratitude towards all the members of the project committee and would like to thank them for their continuous support and mentoring me for every step that I have taken towards the completion of this project. Dr. Roosta, for his guidance, Dr. Shahnam for his ideas, and Dr. Ali Amini for his utmost support. I would also like to thank my family and friends for their love, care and support through all the tough times during my graduation. iii Table of Contents SIGNATURE PAGE .......................................................................................................... ii ACKNOWLEDGEMENT ................................................................................................. iii LIST OF FIGURES .......................................................................................................... vii ABSTRACT .......................................................................................................................
    [Show full text]
  • MP-Tomasulo: a Dependency-Aware Automatic Parallel Execution Engine for Sequential Programs
    i i i i MP-Tomasulo: A Dependency-Aware Automatic Parallel Execution Engine for Sequential Programs CHAO WANG, University of Science and Technology of China XI LI and JUNNENG ZHANG, Suzhou Institute for University of Science and Technology of China XUEHAI ZHOU, University of Science and Technology of China XIAONING NIE,Intel This article presents MP-Tomasulo, a dependency-aware automatic parallel task execution engine for sequential programs. Applying the instruction-level Tomasulo algorithm to MPSoC environments, MP- Tomasulo detects and eliminates Write-After-Write (WAW) and Write-After-Read (WAR) inter-task depen- dencies in the dataflow execution, therefore to operate out-of-order task execution on heterogeneous units. We implemented the prototype system within a single FPGA. Experimental results on EEMBC applications demonstrate that MP-Tomasulo can execute the tasks out-of-order to achieve as high as 93.6% to 97.6% of ideal peak speedup. A comparative study against a state-of-the-art dataflow execution scheme is illustrated with a classic JPEG application. The promising results show MP-Tomasulo enables programmers to uncover more task-level parallelism on heterogeneous systems, as well as to ease the burden of programmers. Categories and Subject Descriptors: C.1.4 [Processor Architecture]: Parallel Architectures; D.1.3 9 [Programming Techniques]: Concurrent Programming—Parallel programming General Terms: Performance, Design Additional Key Words and Phrases: Automatic parallelization, data dependency, out-of-order execution ACM Reference Format: Wang, C., Li, X., Zhang, J., Zhou, X., and Nie, X. 2013. MP-Tomasulo: A dependency-aware automatic parallel execution engine for sequential programs. ACM Trans.
    [Show full text]
  • California State University, Northridge a Tomasulo
    CALIFORNIA STATE UNIVERSITY, NORTHRIDGE A TOMASULO BASED MIPS SIMULATOR A graduate project submitted in partial fulfillment of the requirement For the degree of Master of Science In Electrical Engineering By Reza Azimi May 2013 Signature Page The graduate project of Reza Azimi is approved: ____________________________________ _____________ Ali Amini, Ph.D. Date ____________________________________ _____________ Shahnam Mirzaei, Ph.D. Date ____________________________________ _____________ Ramin Roosta, Ph.D., Chair Date California State University, Northridge ii Acknowledgement I would like to thank Dr. Shahnam Mirzaei for providing nice ideas to work upon and Dr. Ramin Roosta for his guidance. I sincerely want to thank my other committee member Dr. Ali Amini for his support as a member of project committee. I would like to show gratitude to all of my project committee members for being great mentors and their continuous guidance. Most importantly, I like to thank my family for their endless support, unconditional love and great care throughout my life. iii Table of Contents Signature Page .................................................................................................................... ii Acknowledgement ............................................................................................................. iii List of Figures ..................................................................................................................... v List of Tables ....................................................................................................................
    [Show full text]
  • Superscalar Techniques – Register Data Flow Inside the Processor
    Advanced Computer Architectures 03 Superscalar Techniques – Data flow inside processor as result of instructions execution (Register Data Flow) Czech Technical University in Prague, Faculty of Electrical Engineering Slides authors: Michal Štepanovský, update Pavel Píša B4M35PAP Advanced Computer Architectures 1 Superscalar Technique – see previous lesson • The goal is to achieve maximum throughput of instruction processing • Instruction processing can be analyzed as instructions flow or data flow, more precisely: • register data flow – data flow between processor registers • instruction flow through pipeline Today’s lecture • memory data flow – to/from memory topic • It roughly matches to: • Arithmetic-logic (ALU) and other computational instructions (FP, bit- field, vector) processing • Branch instruction processing • Load/store instruction processing • maximizing the throughput of these three flows (or complete flow) correspond to the minimizing penalties and latencies of above three instructions types B4M35PAP Advanced Computer Architectures 2 Superscalar pipeline – see previous lesson Fetch Instruction / decode buffer Decode Dispatch buffer Dispatch Reservation stations Issue Execute Finish Reorder / Completion buffer Complete Store buffer Retire B4M35PAP Advanced Computer Architectures 3 Register data flow • load/store architecture – lw,sw,... instructions without other data processing (ALU, FP, …) or modifications • We start with register-register instruction type processing (all instructions performing some operation on source registers and using a destination register to store result of that operation) • Register recycling – The reuse of registers. It has two forms: • static – due to optimization performed by the compiler during register allocation. In first step, compiler generates single-assignment code, supposing infinity number of registers. In the second step, due to limited number of registers in ISA, it attempts to keep as many of the temporary values in registers as possible.
    [Show full text]