CS/ECE 3330 Computer Architecture

Total Page:16

File Type:pdf, Size:1020Kb

CS/ECE 3330 Computer Architecture CS/ECE 3330 Computer Architecture Chapter 4 The Processor Introduction n CPU performance factors § Instruction count – Determined by ISA and compiler § CPI and Cycle time – Determined by CPU hardware n We will examine two MIPS implementations § A simplified version § A more realistic pipelined version n Simple subset, shows most aspects § Memory reference: lw, sw § Arithmetic/logical: add, sub, and, or, slt § Control transfer: beq, j 1 COMP 140 – Summer 2014 Instruction Execution n PC → instruction memory, fetch instruction n Register numbers → register file, read registers n Depending on instruction class § Use ALU to calculate – Arithmetic result – Memory address for load/store – Branch target address § Access data memory for load/store § PC ← target address or PC + 4 2 COMP 140 – Summer 2014 CPU Overview 3 COMP 140 – Summer 2014 Multiplexers n Can’t just join wires together § Use multiplexers 4 COMP 140 – Summer 2014 Control 5 COMP 140 – Summer 2014 Logic Design Basics n Information encoded in binary § Low voltage = 0, High voltage = 1 § One wire per bit § Multi-bit data encoded on multi-wire buses n Combinational element § Operate on data § Output is a function of input n State (sequential) elements § Store information 6 COMP 140 – Summer 2014 Combinational Elements AND-gate Adder Y = A & B Y = A + B A A Y + Y B B Multiplexer Arithmetic/Logic Unit Y = S ? I1 : I0 Y = F(A, B) A I0 M u Y ALU Y I1 x B S F 7 COMP 140 – Summer 2014 Sequential Elements n Register: stores data in a circuit § Uses a clock signal to determine when to update the stored value § Edge-triggered: update when Clk changes from 0 to 1 Clk D Q D Clk Q 8 COMP 140 – Summer 2014 Sequential Elements n Register with write control § Only updates on clock edge when write control input is 1 § Used when stored value is required later Clk D Q Write Write Clk D Q 9 COMP 140 – Summer 2014 Clocking Methodology n Combinational logic transforms data during clock cycles § Between clock edges § Input from state elements, output to state element § Longest delay determines clock period 10 COMP 140 – Summer 2014 Building a Datapath n Datapath § Elements that process data and addresses in the CPU – Registers, ALUs, mux’s, memories, … n We will build a MIPS datapath incrementally § Refining the overview design 11 COMP 140 – Summer 2014 Instruction Fetch Increment by 4 for next 32-bit instruction register 12 COMP 140 – Summer 2014 Recall: MIPS R-format Instructions op rs rt rd shamt funct 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits n Example: add $t0, $s1, $s2 n Instruction fields § op: operation code (opcode) § rs: first source register number § rt: second source register number § rd: destination register number § shamt: shift amount (00000 for now) § funct: function code (extends opcode) 13 COMP 140 – Summer 2014 R-Format Instructions n Read two register operands n Perform arithmetic/logical operation n Write register result 14 COMP 140 – Summer 2014 Recall: MIPS I-format Instructions op rs rt constant or address 6 bits 5 bits 5 bits 16 bits n Examples: § lw $t0, 32($s3) § beq rs, rt, L1 n Immediate arithmetic and load/store instructions § rt: destination or source register number § Constant: –215 to +215 – 1 § Address: offset added to base address in rs 15 COMP 140 – Summer 2014 Load/Store Instructions n Read register operands n Calculate address using 16-bit offset § Use ALU, but sign-extend offset n Load: Read memory and update register n Store: Write register value to memory 16 COMP 140 – Summer 2014 Branch Instructions n Read register operands n Compare operands § Use ALU, subtract and check Zero output n Calculate target address § Sign-extend displacement § Shift left 2 places (word displacement) § Add to PC + 4 – Already calculated by instruction fetch 17 COMP 140 – Summer 2014 Branch Instructions Just re-routes wires Sign-bit wire replicated 18 COMP 140 – Summer 2014 Composing the Elements n First-cut data path does an instruction in one clock cycle § Each datapath element can only do one function at a time § Hence, we need separate instruction and data memories n Use multiplexers where alternate data sources are used for different instructions 19 COMP 140 – Summer 2014 R-Type/Load/Store Datapath 20 COMP 140 – Summer 2014 Full Datapath 21 COMP 140 – Summer 2014 ALU Control n ALU used for § Load/Store: F = add § Branch: F = subtract § R-type: F depends on funct field ALU control Function 0000 AND 0001 OR 0010 add 0110 subtract 0111 set-on-less-than 1100 NOR 22 COMP 140 – Summer 2014 ALU Control n Assume 2-bit ALUOp derived from opcode § Combinational logic derives ALU control opcode ALUOp Operation funct ALU function ALU control lw 00 load word XXXXXX add 0010 sw 00 store word XXXXXX add 0010 beq 01 branch equal XXXXXX subtract 0110 R-type 10 add 100000 add 0010 subtract 100010 subtract 0110 AND 100100 AND 0000 OR 100101 OR 0001 set-on-less-than 101010 set-on-less-than 0111 23 COMP 140 – Summer 2014 The Main Control Unit n Control signals derived from instruction R-type 0 rs rt rd shamt funct 31:26 25:21 20:16 15:11 10:6 5:0 Load/ 35 or 43 rs rt address Store 31:26 25:21 20:16 15:0 Branch 4 rs rt address 31:26 25:21 20:16 15:0 opcode always read, write for sign-extend read except R-type and add for load and load 24 COMP 140 – Summer 2014 Datapath With Control 25 COMP 140 – Summer 2014 op rs rt rd sham funct R-Type Instruction 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits rd rs rt add $t1,$t2,$t3 26 COMP 140 – Summer 2014 op rs rt constant or address Load Instruction 6 bits 5 bits 5 bits 16 bits rt rs lw $t1,offset($t2) 27 COMP 140 – Summer 2014 op rs rt constant or address Branch-on-Equal 6 bits 5 bits 5 bits 16 bits rs rt beq $t1,$t2,offset 28 COMP 140 – Summer 2014 Implementing Jumps Jump 2 address 31:26 25:0 n Jump uses word address n Update PC with concatenation of § Top 4 bits of old PC § 26-bit jump address § 00 n Need an extra control signal decoded from opcode 29 COMP 140 – Summer 2014 Datapath With Jumps Added 30 COMP 140 – Summer 2014 Performance Issues n Longest delay determines clock period § Critical path: load instruction § Instruction memory → register file → ALU → data memory → register file n Not feasible to vary period for different instructions n Violates design principle § Making the common case fast n We will improve performance by pipelining 31 COMP 140 – Summer 2014 Pipelining Analogy n Pipelined laundry: overlapping execution § Parallelism improves performance n Four loads: n Speedup = 8hrs/3.5hrs = 2.3 n Non-stop: (n=# loads) n Speedup = 2hrs*n/(0.5hrs*n + 1.5hrs) ≈ 4 (# of stages) 32 COMP 140 – Summer 2014 MIPS Pipeline n Five stages, one step per stage § IF: Instruction fetch from memory § ID: Instruction decode & register read § EX: Execute operation or calculate address § MEM: Access memory operand § WB: Write result back to register Instr 1 IF ID EX MEM WB Instr 2 IF ID EX MEM WB Instr 3 IF ID EX MEM WB Time à 33 COMP 140 – Summer 2014 Pipeline Performance n Assume time for stages is § 100ps for register read or write § 200ps for other stages n Compare pipelined datapath with single-cycle datapath Instr Instr fetch Register ALU op Memory Register Total time read access write lw 200ps 100 ps 200ps 200ps 100 ps 800ps sw 200ps 100 ps 200ps 200ps 700ps R-format 200ps 100 ps 200ps 100 ps 600ps beq 200ps 100 ps 200ps 500ps 34 COMP 140 – Summer 2014 Pipeline Performance Single-cycle (Tc= 800ps) Pipelined (Tc= 200ps) 35 COMP 140 – Summer 2014 Pipeline Speedup n If all stages are balanced § i.e., all take the same time § Time between instructionspipelined = Time between instructionsnonpipelined Number of stages n If not balanced, speedup is lower n Speedup due to increased throughput § Latency (time for each instruction) does not decrease 36 COMP 140 – Summer 2014 Pipelining and ISA Design n MIPS ISA designed for pipelining § All instructions are 32-bits – Easier to fetch and decode in one cycle – c.f. x86: 1- to 17-byte instructions § Few and regular instruction formats – Can decode and read registers in one step § Load/store addressing – Can calculate address in 3rd stage, access memory in 4th stage § Alignment of memory operands – Memory access takes only one cycle 37 COMP 140 – Summer 2014 Hazards n Situations that prevent starting the next instruction in the next cycle n Structure hazards § A required resource is busy n Data hazard § Need to wait for previous instruction to complete its data read/write n Control hazard § Deciding on control action depends on previous instruction 38 COMP 140 – Summer 2014 Structure Hazards n Conflict for use of a resource n In MIPS pipeline with a single memory § Load/store requires data access § Instruction fetch would have to stall for that cycle – Would cause a pipeline “bubble” n Hence, pipelined datapaths require separate instruction/data memories § Or separate instruction/data caches 39 COMP 140 – Summer 2014 Data Hazards n An instruction depends on completion of data access by a previous instruction § add $s0, $t0, $t1 sub $t2, $s0, $t3 add $s0, $t0, $t1 IF ID EX MEM WB sub $t2, $s0, $t3 IF ID ID ID EX MEM WB 40 COMP 140 – Summer 2014 Data Hazards n Think about which pipeline stage generates or uses a value add $s0, $t0, $t1 IF ID EX MEM WB sub $t2, $s0, $t3 IF ID EX MEM WB lw $s0, 20($t1) IF ID EX MEM WB sw $t3, 12($t0) IF ID EX MEM WB 41 COMP 140 – Summer 2014 Forwarding (aka Bypassing) n Use result when it is computed § Don’t wait for it to be stored in a register § Requires extra connections in the datapath 42 COMP 140 – Summer 2014 Load-Use Data Hazard n Can’t always avoid stalls by forwarding § If
Recommended publications
  • IEEE Paper Template in A4
    Vasantha.N.S. et al, International Journal of Computer Science and Mobile Computing, Vol.6 Issue.6, June- 2017, pg. 302-306 Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320–088X IMPACT FACTOR: 6.017 IJCSMC, Vol. 6, Issue. 6, June 2017, pg.302 – 306 Enhancing Performance in Multiple Execution Unit Architecture using Tomasulo Algorithm Vasantha.N.S.1, Meghana Kulkarni2 ¹Department of VLSI and Embedded Systems, Centre for PG Studies, VTU Belagavi, India ²Associate Professor Department of VLSI and Embedded Systems, Centre for PG Studies, VTU Belagavi, India 1 [email protected] ; 2 [email protected] Abstract— Tomasulo’s algorithm is a computer architecture hardware algorithm for dynamic scheduling of instructions that allows out-of-order execution, designed to efficiently utilize multiple execution units. It was developed by Robert Tomasulo at IBM. The major innovations of Tomasulo’s algorithm include register renaming in hardware. It also uses the concept of reservation stations for all execution units. A common data bus (CDB) on which computed values broadcast to all reservation stations that may need them is also present. The algorithm allows for improved parallel execution of instructions that would otherwise stall under the use of other earlier algorithms such as scoreboarding. Keywords— Reservation Station, Register renaming, common data bus, multiple execution unit, register file I. INTRODUCTION The instructions in any program may be executed in any of the 2 ways namely sequential order and the other is the data flow order. The sequential order is the one in which the instructions are executed one after the other but in reality this flow is very rare in programs.
    [Show full text]
  • Computer Science 246 Computer Architecture Spring 2010 Harvard University
    Computer Science 246 Computer Architecture Spring 2010 Harvard University Instructor: Prof. David Brooks [email protected] Dynamic Branch Prediction, Speculation, and Multiple Issue Computer Science 246 David Brooks Lecture Outline • Tomasulo’s Algorithm Review (3.1-3.3) • Pointer-Based Renaming (MIPS R10000) • Dynamic Branch Prediction (3.4) • Other Front-end Optimizations (3.5) – Branch Target Buffers/Return Address Stack Computer Science 246 David Brooks Tomasulo Review • Reservation Stations – Distribute RAW hazard detection – Renaming eliminates WAW hazards – Buffering values in Reservation Stations removes WARs – Tag match in CDB requires many associative compares • Common Data Bus – Achilles heal of Tomasulo – Multiple writebacks (multiple CDBs) expensive • Load/Store reordering – Load address compared with store address in store buffer Computer Science 246 David Brooks Tomasulo Organization From Mem FP Op FP Registers Queue Load Buffers Load1 Load2 Load3 Load4 Load5 Store Load6 Buffers Add1 Add2 Mult1 Add3 Mult2 Reservation To Mem Stations FP adders FP multipliers Common Data Bus (CDB) Tomasulo Review 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 LD F0, 0(R1) Iss M1 M2 M3 M4 M5 M6 M7 M8 Wb MUL F4, F0, F2 Iss Iss Iss Iss Iss Iss Iss Iss Iss Ex Ex Ex Ex Wb SD 0(R1), F0 Iss Iss Iss Iss Iss Iss Iss Iss Iss Iss Iss Iss Iss M1 M2 M3 Wb SUBI R1, R1, 8 Iss Ex Wb BNEZ R1, Loop Iss Ex Wb LD F0, 0(R1) Iss Iss Iss Iss M Wb MUL F4, F0, F2 Iss Iss Iss Iss Iss Ex Ex Ex Ex Wb SD 0(R1), F0 Iss Iss Iss Iss Iss Iss Iss Iss Iss M1 M2
    [Show full text]
  • Dynamic Scheduling
    Dynamically-Scheduled Machines ! • In a Statically-Scheduled machine, the compiler schedules all instructions to avoid data-hazards: the ID unit may require that instructions can issue together without hazards, otherwise the ID unit inserts stalls until the hazards clear! • This section will deal with Dynamically-Scheduled machines, where hardware- based techniques are used to detect are remove avoidable data-hazards automatically, to allow ‘out-of-order’ execution, and improve performance! • Dynamic scheduling used in the Pentium III and 4, the AMD Athlon, the MIPS R10000, the SUN Ultra-SPARC III; the IBM Power chips, the IBM/Motorola PowerPC, the HP Alpha 21264, the Intel Dual-Core and Quad-Core processors! • In contrast, static multiple-issue with compiler-based scheduling is used in the Intel IA-64 Itanium architectures! • In 2007, the dual-core and quad-core Intel processors use the Pentium 5 family of dynamically-scheduled processors. The Itanium has had a hard time gaining market share.! Ch. 6, Advanced Pipelining-DYNAMIC, slide 1! © Ted Szymanski! Dynamic Scheduling - the Idea ! (class text - pg 168, 171, 5th ed.)" !DIVD ! !F0, F2, F4! ADDD ! !F10, F0, F8 !- data hazard, stall issue for 23 cc! SUBD ! !F12, F8, F14 !- SUBD inherits 23 cc of stalls! • ADDD depends on DIVD; in a static scheduled machine, the ID unit detects the hazard and causes the basic pipeline to stall for 23 cc! • The SUBD instruction cannot execute because the pipeline has stalled, even though SUBD does not logically depend upon either previous instruction! • suppose the machine architecture was re-organized to let the SUBD and subsequent instructions “bypass” the previous stalled instruction (the ADDD) and proceed with its execution -> we would allow “out-of-order” execution! • however, out-of-order execution would allow out-of-order completion, which may allow RW (Read-Write) and WW (Write-Write) data hazards ! • a RW and WW hazard occurs when the reads/writes complete in the wrong order, destroying the data.
    [Show full text]
  • United States Patent (19) 11 Patent Number: 5,680,565 Glew Et Al
    USOO568.0565A United States Patent (19) 11 Patent Number: 5,680,565 Glew et al. 45 Date of Patent: Oct. 21, 1997 (54) METHOD AND APPARATUS FOR Diefendorff, "Organization of the Motorola 88110 Super PERFORMING PAGE TABLE WALKS INA scalar RISC Microprocessor.” IEEE Micro, Apr. 1996, pp. MCROPROCESSOR CAPABLE OF 40-63. PROCESSING SPECULATIVE Yeager, Kenneth C. "The MIPS R10000 Superscalar Micro INSTRUCTIONS processor.” IEEE Micro, Apr. 1996, pp. 28-40 Apr. 1996. 75 Inventors: Andy Glew, Hillsboro; Glenn Hinton; Smotherman et al. "Instruction Scheduling for the Motorola Haitham Akkary, both of Portland, all 88.110" Microarchitecture 1993 International Symposium, of Oreg. pp. 257-262. Circello et al., “The Motorola 68060 Microprocessor.” 73) Assignee: Intel Corporation, Santa Clara, Calif. COMPCON IEEE Comp. Soc. Int'l Conf., Spring 1993, pp. 73-78. 21 Appl. No.: 176,363 "Superscalar Microprocessor Design" by Mike Johnson, Advanced Micro Devices, Prentice Hall, 1991. 22 Filed: Dec. 30, 1993 Popescu, et al., "The Metaflow Architecture.” IEEE Micro, (51) Int. C. G06F 12/12 pp. 10-13 and 63–73, Jun. 1991. 52 U.S. Cl. ........................... 395/415: 395/800; 395/383 Primary Examiner-David K. Moore 58) Field of Search .................................. 395/400, 375, Assistant Examiner-Kevin Verbrugge 395/800, 414, 415, 421.03 Attorney, Agent, or Firm-Blakely, Sokoloff, Taylor & 56 References Cited Zafiman U.S. PATENT DOCUMENTS 57 ABSTRACT 5,136,697 8/1992 Johnson .................................. 395/586 A page table walk is performed in response to a data 5,226,126 7/1993 McFarland et al. 395/.394 translation lookaside buffer miss based on a speculative 5,230,068 7/1993 Van Dyke et al.
    [Show full text]
  • Identifying Bottlenecks in a Multithreaded Superscalar
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by KITopen Identifying Bottlenecks in a Multithreaded Sup erscalar Micropro cessor Ulrich Sigmund and Theo Ungerer VIONA DevelopmentGmbH Karlstr D Karlsruhe Germany University of Karlsruhe Dept of Computer Design and Fault Tolerance D Karlsruhe Germany Abstract This pap er presents a multithreaded sup erscalar pro cessor that p ermits several threads to issue instructions to the execution units of a wide sup erscalar pro cessor in a single cycle Instructions can simul taneously b e issued from up to threads with a total issue bandwidth of instructions p er cycle Our results show that the threaded issue pro cessor reaches a throughput of instructions p er cycle Intro duction Current micropro cessors utilize instructionlevel parallelism by a deep pro cessor pip eline and by the sup erscalar technique that issues up to four instructions p er cycle from a single thread VLSItechnology will allow future generations of micropro cessors to exploit instructionlevel parallelism up to instructions p er cycle or more However the instructionlevel parallelism found in a conventional instruction stream is limited The solution is the additional utilization of more coarsegrained parallelism The main approaches are the multipro cessor chip and the multithreaded pro ces pro cessors on a sor The multipro cessor chip integrates two or more complete single chip Therefore every unit of a pro cessor is duplicated and used indep en dently of its copies on the chip
    [Show full text]
  • Advanced Processor Designs
    Advanced processor designs We’ve only scratched the surface of CPU design. Today we’ll briefly introduce some of the big ideas and big words behind modern processors by looking at two example CPUs. —The Motorola PowerPC, used in Apple computers and many embedded systems, is a good example of state-of-the-art RISC technologies. —The Intel Itanium is a more radical design intended for the higher-end systems market. April 9, 2003 ©2001-2003 Howard Huang 1 General ways to make CPUs faster You can improve the chip manufacturing technology. — Smaller CPUs can run at a higher clock rates, since electrons have to travel a shorter distance. Newer chips use a “0.13µm process,” and this will soon shrink down to 0.09µm. — Using different materials, such as copper instead of aluminum, can improve conductivity and also make things faster. You can also improve your CPU design, like we’ve been doing in CS232. — One of the most important ideas is parallel computation—executing several parts of a program at the same time. — Being able to execute more instructions at a time results in a higher instruction throughput, and a faster execution time. April 9, 2003 Advanced processor designs 2 Pipelining is parallelism We’ve already seen parallelism in detail! A pipelined processor executes many instructions at the same time. All modern processors use pipelining, because most programs will execute faster on a pipelined CPU without any programmer intervention. Today we’ll discuss some more advanced techniques to help you get the most out of your pipeline. April 9, 2003 Advanced processor designs 3 Motivation for some advanced ideas Our pipelined datapath only supports integer operations, and we assumed the ALU had about a 2ns delay.
    [Show full text]
  • Computer Architecture Out-Of-Order Execution
    Computer Architecture Out-of-order Execution By Yoav Etsion With acknowledgement to Dan Tsafrir, Avi Mendelson, Lihu Rappoport, and Adi Yoaz 1 Computer Architecture 2013– Out-of-Order Execution The need for speed: Superscalar • Remember our goal: minimize CPU Time CPU Time = duration of clock cycle × CPI × IC • So far we have learned that in order to Minimize clock cycle ⇒ add more pipe stages Minimize CPI ⇒ utilize pipeline Minimize IC ⇒ change/improve the architecture • Why not make the pipeline deeper and deeper? Beyond some point, adding more pipe stages doesn’t help, because Control/data hazards increase, and become costlier • (Recall that in a pipelined CPU, CPI=1 only w/o hazards) • So what can we do next? Reduce the CPI by utilizing ILP (instruction level parallelism) We will need to duplicate HW for this purpose… 2 Computer Architecture 2013– Out-of-Order Execution A simple superscalar CPU • Duplicates the pipeline to accommodate ILP (IPC > 1) ILP=instruction-level parallelism • Note that duplicating HW in just one pipe stage doesn’t help e.g., when having 2 ALUs, the bottleneck moves to other stages IF ID EXE MEM WB • Conclusion: Getting IPC > 1 requires to fetch/decode/exe/retire >1 instruction per clock: IF ID EXE MEM WB 3 Computer Architecture 2013– Out-of-Order Execution Example: Pentium Processor • Pentium fetches & decodes 2 instructions per cycle • Before register file read, decide on pairing Can the two instructions be executed in parallel? (yes/no) u-pipe IF ID v-pipe • Pairing decision is based… On data
    [Show full text]
  • CSE 586 Computer Architecture Lecture 4
    Highlights from last week CSE 586 Computer Architecture • ILP: where can the compiler optimize – Loop unrolling and software pipelining – Speculative execution (we’ll see predication today) Lecture 4 • ILP: Dynamic scheduling in a single issue machine – Scoreboard Jean-Loup Baer – Tomasulo’s algorithm http://www.cs.washington.edu/education/courses/586/00sp CSE 586 Spring 00 1 CSE 586 Spring 00 2 Highlights from last week (c’ed) -- Highlights from last week (c’ed) – Scoreboard Tomasulo’s algorithm • The scoreboard keeps a record of all data dependencies • Decentralized control • The scoreboard keeps a record of all functional unit • Use of reservation stations to buffer and/or rename occupancies registers (hence gets rid of WAW and WAR hazards) • The scoreboard decides if an instruction can be issued • Results –and their names– are broadcast to reservations • The scoreboard decides if an instruction can store its result stations and register file • Implementation-wise, scoreboard keeps track of which • Instructions are issued in order but can be dispatched, registers are used as sources and destinations and which executed and completed out-of-order functional units use them CSE 586 Spring 00 3 CSE 586 Spring 00 4 Highlights from last week (c’ed) Multiple Issue Alternatives • Register renaming: avoids WAW and WAR hazards • Superscalar (hardware detects conflicts) • Is performed at “decode” time to rename the result register – Statically scheduled (in order dispatch and hence execution; cf DEC Alpha 21164) • Two basic implementation schemes – Dynamically scheduled (in order issue, out of order dispatch and – Have a separate physical register file execution; cf MIPS 10000, IBM Power PC 620 and Intel Pentium – Use of reorder buffer and reservation stations (cf.
    [Show full text]
  • Superscalar Instruction Dispatch with Exception Handling
    Superscalar Instruction Dispatch With Exception Handling Design and Testing Report Joseph McMahan December 2013 Contents 1 Design 3 1.1 Overview . ............. ............. 3 1.1.1 The Tomasulo Algorithm . ............. 4 1.1.2 Two-Phase Cloc"ing .......... ............. 4 1.1.3 #esolving Data Hazards . ............. ' 1.1.4 (oftware Interface............ ............. + 1.2 High ,evel Design ............. ............. 10 1.3 ,ow ,evel Design ............. ............. 12 1.3.1 Buses . ............. ............. 12 1.3.2 .unctional /nits ............. ............ 13 1.3.3 Completion File ............. ............. 14 1.3.4 Bus !0cles . ............. ............. 17 1.3.' *nstruction Deco&e an& Dispatch . ............. 18 2 Testing 23 2.1 Tests . ............. ............. 23 2.1.1 ,oa& Data2 Ad& ............. ............. 23 2.1.2 #esource Hazard3 Onl0 One (tation Available ......... 24 2.1.3 Multipl0 . ............. ............. 26 2.1.4 #esource Hazard3 Two Instructions o) (ame Type . 26 2.1.' Data Hazar&3 #A5........... ............. 29 2.1.4 Data Hazar&3 5A5 .......... ............. 31 2.1.+ Data Hazar&3 5AR . ............. 31 2.1.1 Memor0 Hazar&3 5AR . ............. 35 2.1.6 #esource Hazard3 No (tations Available ............ 3' 2.1.10 (tall on .ull Completion File . ............ 31 2.1.11 Han&le 89ception............ ............. 40 1 3 Code 43 3.1 Mo&ules . ............. ............. 43 3.1.1 rstation.v . ............. ............. 43 3.1.2 reg.v . ............. ............. 49 3.1.3 fetch.v . ............. ............. 53 3.1.4 store.v . ............. ............. '6 3.1.' mem-unit.v . ............. ............. 63 3.1.4 ID.v . ............. ............. 70 3.1.+ alu.v . ............. ............. 84 3.1.1 alu-unit.v . ............. ............. 14 3.1.6 mult.v . ............. ............. 90 3.1.10 mp-unit.v . ............. ............. 92 3.1.11 cf.v .
    [Show full text]
  • MIPS Architecture with Tomasulo Algorithm [12]
    CALIFORNIA STATE UNIVERSITY NORTHRIDGE TOMASULO ARCHITECTURE BASED MIPS PROCESSOR A graduate project submitted in partial fulfilment for the requirement Of the degree of Master of Science In Electrical Engineering By Sameer S Pandit May 2014 The graduate project of Sameer Pandit is approved by: ______________________________________ ____________ Dr. Ali Amini, Ph.D. Date _______________________________________ ____________ Dr. Shahnam Mirzaei, Ph.D. Date _______________________________________ ____________ Dr. Ramin Roosta, Ph.D., Chair Date California State University Northridge ii ACKNOWLEDGEMENT I would like to express my gratitude towards all the members of the project committee and would like to thank them for their continuous support and mentoring me for every step that I have taken towards the completion of this project. Dr. Roosta, for his guidance, Dr. Shahnam for his ideas, and Dr. Ali Amini for his utmost support. I would also like to thank my family and friends for their love, care and support through all the tough times during my graduation. iii Table of Contents SIGNATURE PAGE .......................................................................................................... ii ACKNOWLEDGEMENT ................................................................................................. iii LIST OF FIGURES .......................................................................................................... vii ABSTRACT .......................................................................................................................
    [Show full text]
  • In-Line Interrupt Handling and Lock-Up Free Translation Lookaside Buffers (Tlbs)
    IEEE TRANSACTIONS ON COMPUTERS, VOL. 55, NO. 5, MAY 2006 1 In-Line Interrupt Handling and Lock-Up Free Translation Lookaside Buffers (TLBs) Aamer Jaleel, Student Member, IEEE, and Bruce Jacob, Member, IEEE Abstract—The effects of the general-purpose precise interrupt mechanisms in use for the past few decades have received very little attention. When modern out-of-order processors handle interrupts precisely, they typically begin by flushing the pipeline to make the CPU available to execute handler instructions. In doing so, the CPU ends up flushing many instructions that have been brought in to the reorder buffer. In particular, these instructions may have reached a very deep stage in the pipeline—representing significant work that is wasted. In addition, an overhead of several cycles and wastage of energy (per exception detected) can be expected in refetching and reexecuting the instructions flushed. This paper concentrates on improving the performance of precisely handling software managed translation look-aside buffer (TLB) interrupts, one of the most frequently occurring interrupts. The paper presents a novel method of in-lining the interrupt handler within the reorder buffer. Since the first level interrupt-handlers of TLBs are usually small, they could potentially fit in the reorder buffer along with the user-level code already there. In doing so, the instructions that would otherwise be flushed from the pipe need not be refetched and reexecuted. Additionally, it allows for instructions independent of the exceptional instruction to continue to execute in parallel with the handler code. By in-lining the TLB interrupt handler, this provides lock-up free TLBs.
    [Show full text]
  • Improving the Precise Interrupt Mechanism of Software- Managed TLB Miss Handlers
    Improving the Precise Interrupt Mechanism of Software- Managed TLB Miss Handlers Aamer Jaleel and Bruce Jacob Electrical & Computer Engineering University of Maryland at College Park {ajaleel,blj}@eng.umd.edu Abstract. The effects of the general-purpose precise interrupt mechanisms in use for the past few decades have received very little attention. When modern out-of-order processors handle interrupts precisely, they typically begin by flushing the pipeline to make the CPU available to execute handler instructions. In doing so, the CPU ends up flushing many instructions that have been brought in to the reorder buffer. In par- ticular, many of these instructions have reached a very deep stage in the pipeline - representing significant work that is wasted. In addition, an overhead of several cycles can be expected in re-fetching and re-executing these instructions. This paper concentrates on improving the performance of precisely handling software managed translation lookaside buffer (TLB) interrupts, one of the most frequently occurring interrupts. This paper presents a novel method of in-lining the interrupt handler within the reorder buffer. Since the first level interrupt-handlers of TLBs are usually small, they could potentially fit in the reorder buffer along with the user-level code already there. In doing so, the instructions that would otherwise be flushed from the pipe need not be re-fetched and re-executed. Additionally, it allows for instructions independent of the exceptional instruction to continue to execute in parallel with the handler code. We simulate two different schemes of in-lining the interrupt on a pro- cessor with a 4-way out-of-order core similar to the Alpha 21264.
    [Show full text]