Design and Implementation of an Asynchronous Version of the MIPS R3000 Microprocessor

Total Page:16

File Type:pdf, Size:1020Kb

Design and Implementation of an Asynchronous Version of the MIPS R3000 Microprocessor Rochester Institute of Technology RIT Scholar Works Theses 1-1994 Design and implementation of an asynchronous version of the MIPS R3000 microprocessor Kevin Johnson Follow this and additional works at: https://scholarworks.rit.edu/theses Recommended Citation Johnson, Kevin, "Design and implementation of an asynchronous version of the MIPS R3000 microprocessor" (1994). Thesis. Rochester Institute of Technology. Accessed from This Thesis is brought to you for free and open access by RIT Scholar Works. It has been accepted for inclusion in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact [email protected]. DESIGN AND IMPLEMENTATION OF AN ASYNCHRONOUS VERSION OF THE MIPS R3000 MICROPROCESSOR by Kevin Johnson A Thesis Submitted In Partial Fulfillment of the Requirements for the Degree of MASTER OF SCIENCE In Computer Engineering Approved by: Graduate Advisor - George A. Brown, Professor Roy S. Czemikowski, Professor and Department Head Tony H. Chang, Professor Department of Computer Engineering College of Engineering Rochester Institute of Technology Rochester, New York January, 1994 THESIS RELEASE PERMISSION FORM ROCHESTER INSTITUTE OF TECHNOLOGY COLLEGE OF ENGINEERING Title: Design and Implementation of an Asynchronous Version of the MIPS R3000 Microprocessor. I, Kevin Johnson, hereby deny permission to the Wallace Memorial Library of RIT to reproduce my thesis in whole or in part. Date: , II~-Ilf 'f ABSTRACT The purpose of this thesis is to show some of the advantages for the asynchronous implementation of a major synchronous structure. Three people were involved with the design of an asynchronous microprocessor modeled after the MIPS R3000 microprocessor. This microprocessor implemented all of the MIPS reduced instruction set, while eliminating the need for synchronous clocking throughout the chip. Paul Fanelli modeled the asynchronous processor using VHDL (hardware description language). Kevin Johnson created circuit level designs and layouts of the arithmetic logic unit and supporting hardware. Scott Siers created circuit level designs and layouts of the instruction fetch, write back, memory and instruction decode stages. Table Of Contents Abstract ii Table of Contents iii List of Figures iv List of Tables vi Glossary vii 1. Introduction 1 1.1 MIPSR3000 2 1.2 Pipeline 3 1.3 ALU stage 4 2. Registers 6 2.1 Data Dependencies 6 2.2 Register Design 13 3. Arithmetic Logic Unit 21 3.1 TTL Model 21 3.2 DataPath 28 3.3 Dual Purpose 28 3.4 Conditional Statements 29 3.5 Opcode Interfacing 32 4. Shifter Design 38 5 . Multiplier Divider in the Asynchronous Microprocessor 48 6. Bus Control 50 6.1 Inputs 50 6.2 Outputs 52 7. Exception Creation 57 8. Handshaking 61 8.1 Completion Signals 61 8.2 Handshaking Controller Circuit 67 9. Results 70 10. Conclusions 78 Bibliography 80 m List Of Figures Figure 1.1 Pipeline Stages 3 Figure 1.2 Top Level Design of the ALU stage 5 Figure 2.1 Simple Method For Adding Three Numbers 6 Figure 2.2 Possible Paths for the Valid Bits 9 Figure 2.3 Program Showing Data Dependency 9 Figure 2.4 Petri Net for the Data Dependency Problem 12 Figure 2.5 Two Phase Clocking Scheme..... 13 Figure 2.6 Bus Utilization Scheme 14 Figure 2.7 Example of a Load Word Left Instruction 16 Figure 2.8 Logic Diagram of One Register Bit 18 Figure 2.9 Layout of a One-Bit Register 19 Figure 2.10 Logic Diagram of One 32-Bit Register 20 Figure 3.1 Schematic of the Modified ALU 22 Figure 3.2 Equations for Intermediate Carry Signals 24 Figure 3.3 Look Ahead Circuit 26 Figure 3 .4 Schematic of a 32-bit Adder Using 4-bit Look Ahead Carry Adders 27 Figure 3.5 Sign Extension Circuit 30 Figure 3.6 Condition Analyzer 31 Figure 3.7 Condition Bit Generator 34 Figure 3.8 Instruction Interface Circuit 35 Figure 3.9 Bit Coding of the R3000 Instruction Set 36 Figure 4.1 Arithmetic and Logical Shifts 38 Figure 4.2 Basic Shifting Cell 39 Figure 4.3 Simple Connection For Shifting Cells 40 Figure 4.4 Circuit Diagram of A 3-1 Multiplexor 42 Figure 4.5 Layout of a 3-1 Multiplexor 43 Figure 4.6 Layout of a 2-1 Multiplexor 44 Figure 4.6 Example of a Driving Circuit 46 Figure 4.7 Analog Simulation of the Driving Circuit 47 Figure 6.1 Input Bus Control Block 51 Figure 6.2 A-Bus Selector 53 Figure 6.3 B-Bus Selector 54 Figure 6.4 Bus Selector Decoder 55 Figure 6.5 Output Selector 56 Figure 7.1 Example of Overflow Conditions 58 Figure 7.2 Schematic of the Overflow Logic 59 Figure 7.3 Layout of the Overflow Logic 60 IV Figure 8.1 DCVSL NOR Gate with Completion Signal 62 Figure 8.2 Paths of DCVSL Outputs 63 Figure 8.3 Comparison of DCVSL and Complementary CMOS Logic 65 Figure 8.4 Layout Comparison of DCVSL versus Complementary CMOS 66 Figure 8.5 Handshaking Controller Circuit 68 Figure 9.1 Simulation of Unsigned Addition 73 Figure 9.2 Simulation of a Branch Not Equal to Zero Instruction 74 Figure 9.3 Simulation of a Signed Instruction Causing an Exception 75 Figure 9.4 Simulation of a Shift Instruction 76 Figure 9.5 Simulation of a Move From HI Instruction 77 List of Tables Table 2.1 Incorrect Scheduling of Dependent Instructions 10 Table 9.1 Average Times of Various Operations 71 VI Glossary: ALU Arithmetic Logic Unit ALU Stage Part of the pipeline where all arithmetic operations takes place. Asynchronous Actions do not occur at specific time periods or when another action is scheduled to take place. CISC Complex instruction set computer. CMOS Complementary metal-oxide-semiconductor DCVSL Differential Cascade Voltage Switched Logic FPU Floating point unit. HCC Handshaking Controller Circuit. HI 32-bit register used to hold the high 32 bits in a multiply operation or the quotient in a divide operation. ID stage Instruction decode stage of the pipeline where the instruction is decoded into signals. IF stage Instruction fetch stage of the pipeline where the instruction is obtained from memory. LOW 32-bit register used to hold the low 32 bits in a multiply operation or the remainder in a divide operation. LWL Load word left. LWR Load word right. MEM stage Memory stage of the pipeline where all memory operations are performed. MIPS Company that makes microprocessors currently now a division of Sun Microsystems. Vll MIPS Millions of instructions per second. NMOS N-type metal-oxide-semiconductor. NOP Instruction that cuses nothing to happen. PC Program counter. PCL Signal name for the latched output of the program counter. R3000 A version of the MIPS architecture used in a particular 32-bit microprocessor. RISC Reduced instruction set computer. RIT Rochester Institute of Technology Synchronous Actions occur according to a deterministic schedule, e.g. Actions occur at specific phases of a clock pulse. TTL Transistor Transistor Logic VHDL VHSIC Hardware Description Language VHSIC Very High Speed Integrated Circuit VLSI Very Large Scale Integration WB stage Write back stage of the pipeline where the data is written into the registers. Vlll 1. Introduction The purpose of this project is to show some of the advantages of the asynchronous design methodology. The design should be large enough to allow the increase in speed caused by the asynchronous operation to be greater than the decrease in speed caused by the extra circuitry. A thirty-two bit microprocessor was chosen for this project. The asynchronous design methodology eliminates the need for clocking circuits by adding handshaking circuitry. Clock lines in a microprocessor need to be routed to every part of the chip. These lines cause a problem in the operation of the chip called clock skew. The clock lines, upon which the clock signal travels, spread out in many directions from the clock driver. This signal is supposed to change the state of many circuit elements throughout the chip simultaneously. Since these lines are not the same length, the clock pulses on one side of the chip may or may not be in phase with the pulses on the other side. In order to make sure that circuit elements on both sides of the chip have changed state using the same clock pulse, the period of the clock is extended. This makes sure that the clock pulse has had time to travel throughout the chip. The clock lines and the gates that are driven by the clock create a large capacitive load. This capacitance must be overcome in order for the clock signal to change state in a reasonable amount of time. This causes the clock driver to be necessarily large. The capacitance in the line must be charged in order for the voltage on the line to come to a true logical-one value. The capacitance also must be discharged in order for the value to return to the logical-zero state. The time needed to charge and discharge the large capacitance limits the maximum possible clock frequency. If the clock frequency is faster than the charging and discharging times, the clock pulses mid- will never reach their true values but will stay fluctuating around some level voltage. Within a large clocked circuit, the clock speed must be slow enough to allow the slowest part of the circuit to be able to complete its task before the beginning of the next clock cycle. In large designs, this minimum time between clock pulses becomes a problem. For example, in a microprocessor this minimum time is the time it takes for the slowest instruction to execute. This can be a long period of time because operations such as multiplication take longer than operations such as shifting.
Recommended publications
  • PIPELINING and ASSOCIATED TIMING ISSUES Introduction: While
    PIPELINING AND ASSOCIATED TIMING ISSUES Introduction: While studying sequential circuits, we studied about Latches and Flip Flops. While Latches formed the heart of a Flip Flop, we have explored the use of Flip Flops in applications like counters, shift registers, sequence detectors, sequence generators and design of Finite State machines. Another important application of latches and flip flops is in pipelining combinational/algebraic operation. To understand what is pipelining consider the following example. Let us take a simple calculation which has three operations to be performed viz. 1. add a and b to get (a+b), 2. get magnitude of (a+b) and 3. evaluate log |(a + b)|. Each operation would consume a finite period of time. Let us assume that each operation consumes 40 nsec., 35 nsec. and 60 nsec. respectively. The process can be represented pictorially as in Fig. 1. Consider a situation when we need to carry this out for a set of 100 such pairs. In a normal course when we do it one by one it would take a total of 100 * 135 = 13,500 nsec. We can however reduce this time by the realization that the whole process is a sequential process. Let the values to be evaluated be a1 to a100 and the corresponding values to be added be b1 to b100. Since the operations are sequential, we can first evaluate (a1 + b1) while the value |(a1 + b1)| is being evaluated the unit evaluating the sum is dormant and we can use it to evaluate (a2 + b2) giving us both |(a1 + b1)| and (a2 + b2) at the end of another evaluation period.
    [Show full text]
  • 2.5 Classification of Parallel Computers
    52 // Architectures 2.5 Classification of Parallel Computers 2.5 Classification of Parallel Computers 2.5.1 Granularity In parallel computing, granularity means the amount of computation in relation to communication or synchronisation Periods of computation are typically separated from periods of communication by synchronization events. • fine level (same operations with different data) ◦ vector processors ◦ instruction level parallelism ◦ fine-grain parallelism: – Relatively small amounts of computational work are done between communication events – Low computation to communication ratio – Facilitates load balancing 53 // Architectures 2.5 Classification of Parallel Computers – Implies high communication overhead and less opportunity for per- formance enhancement – If granularity is too fine it is possible that the overhead required for communications and synchronization between tasks takes longer than the computation. • operation level (different operations simultaneously) • problem level (independent subtasks) ◦ coarse-grain parallelism: – Relatively large amounts of computational work are done between communication/synchronization events – High computation to communication ratio – Implies more opportunity for performance increase – Harder to load balance efficiently 54 // Architectures 2.5 Classification of Parallel Computers 2.5.2 Hardware: Pipelining (was used in supercomputers, e.g. Cray-1) In N elements in pipeline and for 8 element L clock cycles =) for calculation it would take L + N cycles; without pipeline L ∗ N cycles Example of good code for pipelineing: §doi =1 ,k ¤ z ( i ) =x ( i ) +y ( i ) end do ¦ 55 // Architectures 2.5 Classification of Parallel Computers Vector processors, fast vector operations (operations on arrays). Previous example good also for vector processor (vector addition) , but, e.g. recursion – hard to optimise for vector processors Example: IntelMMX – simple vector processor.
    [Show full text]
  • Microprocessor Architecture
    EECE416 Microcomputer Fundamentals Microprocessor Architecture Dr. Charles Kim Howard University 1 Computer Architecture Computer System CPU (with PC, Register, SR) + Memory 2 Computer Architecture •ALU (Arithmetic Logic Unit) •Binary Full Adder 3 Microprocessor Bus 4 Architecture by CPU+MEM organization Princeton (or von Neumann) Architecture MEM contains both Instruction and Data Harvard Architecture Data MEM and Instruction MEM Higher Performance Better for DSP Higher MEM Bandwidth 5 Princeton Architecture 1.Step (A): The address for the instruction to be next executed is applied (Step (B): The controller "decodes" the instruction 3.Step (C): Following completion of the instruction, the controller provides the address, to the memory unit, at which the data result generated by the operation will be stored. 6 Harvard Architecture 7 Internal Memory (“register”) External memory access is Very slow For quicker retrieval and storage Internal registers 8 Architecture by Instructions and their Executions CISC (Complex Instruction Set Computer) Variety of instructions for complex tasks Instructions of varying length RISC (Reduced Instruction Set Computer) Fewer and simpler instructions High performance microprocessors Pipelined instruction execution (several instructions are executed in parallel) 9 CISC Architecture of prior to mid-1980’s IBM390, Motorola 680x0, Intel80x86 Basic Fetch-Execute sequence to support a large number of complex instructions Complex decoding procedures Complex control unit One instruction achieves a complex task 10
    [Show full text]
  • SPIM S20: a MIPS R2000 Simulator∗
    SPIM S20: A MIPS R2000 Simulator∗ 1 th “ 25 the performance at none of the cost” James R. Larus [email protected] Computer Sciences Department University of Wisconsin–Madison 1210 West Dayton Street Madison, WI 53706, USA 608-262-9519 Copyright °c 1990–1997 by James R. Larus (This document may be copied without royalties, so long as this copyright notice remains on it.) 1 SPIM SPIM S20 is a simulator that runs programs for the MIPS R2000/R3000 RISC computers.1 SPIM can read and immediately execute files containing assembly language. SPIM is a self- contained system for running these programs and contains a debugger and interface to a few operating system services. The architecture of the MIPS computers is simple and regular, which makes it easy to learn and understand. The processor contains 32 general-purpose 32-bit registers and a well-designed instruction set that make it a propitious target for generating code in a compiler. However, the obvious question is: why use a simulator when many people have workstations that contain a hardware, and hence significantly faster, implementation of this computer? One reason is that these workstations are not generally available. Another reason is that these ma- chine will not persist for many years because of the rapid progress leading to new and faster computers. Unfortunately, the trend is to make computers faster by executing several instruc- tions concurrently, which makes their architecture more difficult to understand and program. The MIPS architecture may be the epitome of a simple, clean RISC machine. In addition, simulators can provide a better environment for low-level programming than an actual machine because they can detect more errors and provide more features than an actual computer.
    [Show full text]
  • Instruction Pipelining (1 of 7)
    Chapter 5 A Closer Look at Instruction Set Architectures Objectives • Understand the factors involved in instruction set architecture design. • Gain familiarity with memory addressing modes. • Understand the concepts of instruction- level pipelining and its affect upon execution performance. 5.1 Introduction • This chapter builds upon the ideas in Chapter 4. • We present a detailed look at different instruction formats, operand types, and memory access methods. • We will see the interrelation between machine organization and instruction formats. • This leads to a deeper understanding of computer architecture in general. 5.2 Instruction Formats (1 of 31) • Instruction sets are differentiated by the following: – Number of bits per instruction. – Stack-based or register-based. – Number of explicit operands per instruction. – Operand location. – Types of operations. – Type and size of operands. 5.2 Instruction Formats (2 of 31) • Instruction set architectures are measured according to: – Main memory space occupied by a program. – Instruction complexity. – Instruction length (in bits). – Total number of instructions in the instruction set. 5.2 Instruction Formats (3 of 31) • In designing an instruction set, consideration is given to: – Instruction length. • Whether short, long, or variable. – Number of operands. – Number of addressable registers. – Memory organization. • Whether byte- or word addressable. – Addressing modes. • Choose any or all: direct, indirect or indexed. 5.2 Instruction Formats (4 of 31) • Byte ordering, or endianness, is another major architectural consideration. • If we have a two-byte integer, the integer may be stored so that the least significant byte is followed by the most significant byte or vice versa. – In little endian machines, the least significant byte is followed by the most significant byte.
    [Show full text]
  • Review of Computer Architecture
    Basic Computer Architecture CSCE 496/896: Embedded Systems Witawas Srisa-an Review of Computer Architecture Credit: Most of the slides are made by Prof. Wayne Wolf who is the author of the textbook. I made some modifications to the note for clarity. Assume some background information from CSCE 430 or equivalent von Neumann architecture Memory holds data and instructions. Central processing unit (CPU) fetches instructions from memory. Separate CPU and memory distinguishes programmable computer. CPU registers help out: program counter (PC), instruction register (IR), general- purpose registers, etc. von Neumann Architecture Memory Unit Input CPU Output Unit Control + ALU Unit CPU + memory address 200PC memory data CPU 200 ADD r5,r1,r3 ADD IRr5,r1,r3 Recalling Pipelining Recalling Pipelining What is a potential Problem with von Neumann Architecture? Harvard architecture address data memory data PC CPU address program memory data von Neumann vs. Harvard Harvard can’t use self-modifying code. Harvard allows two simultaneous memory fetches. Most DSPs (e.g Blackfin from ADI) use Harvard architecture for streaming data: greater memory bandwidth. different memory bit depths between instruction and data. more predictable bandwidth. Today’s Processors Harvard or von Neumann? RISC vs. CISC Complex instruction set computer (CISC): many addressing modes; many operations. Reduced instruction set computer (RISC): load/store; pipelinable instructions. Instruction set characteristics Fixed vs. variable length. Addressing modes. Number of operands. Types of operands. Tensilica Xtensa RISC based variable length But not CISC Programming model Programming model: registers visible to the programmer. Some registers are not visible (IR). Multiple implementations Successful architectures have several implementations: varying clock speeds; different bus widths; different cache sizes, associativities, configurations; local memory, etc.
    [Show full text]
  • MIPS IV Instruction Set
    MIPS IV Instruction Set Revision 3.2 September, 1995 Charles Price MIPS Technologies, Inc. All Right Reserved RESTRICTED RIGHTS LEGEND Use, duplication, or disclosure of the technical data contained in this document by the Government is subject to restrictions as set forth in subdivision (c) (1) (ii) of the Rights in Technical Data and Computer Software clause at DFARS 52.227-7013 and / or in similar or successor clauses in the FAR, or in the DOD or NASA FAR Supplement. Unpublished rights reserved under the Copyright Laws of the United States. Contractor / manufacturer is MIPS Technologies, Inc., 2011 N. Shoreline Blvd., Mountain View, CA 94039-7311. R2000, R3000, R6000, R4000, R4400, R4200, R8000, R4300 and R10000 are trademarks of MIPS Technologies, Inc. MIPS and R3000 are registered trademarks of MIPS Technologies, Inc. The information in this document is preliminary and subject to change without notice. MIPS Technologies, Inc. (MTI) reserves the right to change any portion of the product described herein to improve function or design. MTI does not assume liability arising out of the application or use of any product or circuit described herein. Information on MIPS products is available electronically: (a) Through the World Wide Web. Point your WWW client to: http://www.mips.com (b) Through ftp from the internet site “sgigate.sgi.com”. Login as “ftp” or “anonymous” and then cd to the directory “pub/doc”. (c) Through an automated FAX service: Inside the USA toll free: (800) 446-6477 (800-IGO-MIPS) Outside the USA: (415) 688-4321 (call from a FAX machine) MIPS Technologies, Inc.
    [Show full text]
  • MIPS Architecture • MIPS (Microprocessor Without Interlocked Pipeline Stages) • MIPS Computer Systems Inc
    Spring 2011 Prof. Hyesoon Kim MIPS Architecture • MIPS (Microprocessor without interlocked pipeline stages) • MIPS Computer Systems Inc. • Developed from Stanford • MIPS architecture usages • 1990’s – R2000, R3000, R4000, Motorola 68000 family • Playstation, Playstation 2, Sony PSP handheld, Nintendo 64 console • Android • Shift to SOC http://en.wikipedia.org/wiki/MIPS_architecture • MIPS R4000 CPU core • Floating point and vector floating point co-processors • 3D-CG extended instruction sets • Graphics – 3D curved surface and other 3D functionality – Hardware clipping, compressed texture handling • R4300 (embedded version) – Nintendo-64 http://www.digitaltrends.com/gaming/sony- announces-playstation-portable-specs/ Not Yet out • Google TV: an Android-based software service that lets users switch between their TV content and Web applications such as Netflix and Amazon Video on Demand • GoogleTV : search capabilities. • High stream data? • Internet accesses? • Multi-threading, SMP design • High graphics processors • Several CODEC – Hardware vs. Software • Displaying frame buffer e.g) 1080p resolution: 1920 (H) x 1080 (V) color depth: 4 bytes/pixel 4*1920*1080 ~= 8.3MB 8.3MB * 60Hz=498MB/sec • Started from 32-bit • Later 64-bit • microMIPS: 16-bit compression version (similar to ARM thumb) • SIMD additions-64 bit floating points • User Defined Instructions (UDIs) coprocessors • All self-modified code • Allow unaligned accesses http://www.spiritus-temporis.com/mips-architecture/ • 32 64-bit general purpose registers (GPRs) • A pair of special-purpose registers to hold the results of integer multiply, divide, and multiply-accumulate operations (HI and LO) – HI—Multiply and Divide register higher result – LO—Multiply and Divide register lower result • a special-purpose program counter (PC), • A MIPS64 processor always produces a 64-bit result • 32 floating point registers (FPRs).
    [Show full text]
  • Computer Organization EECC 550 • Introduction: Modern Computer Design Levels, Components, Technology Trends, Register Transfer Week 1 Notation (RTN)
    Computer Organization EECC 550 • Introduction: Modern Computer Design Levels, Components, Technology Trends, Register Transfer Week 1 Notation (RTN). [Chapters 1, 2] • Instruction Set Architecture (ISA) Characteristics and Classifications: CISC Vs. RISC. [Chapter 2] Week 2 • MIPS: An Example RISC ISA. Syntax, Instruction Formats, Addressing Modes, Encoding & Examples. [Chapter 2] • Central Processor Unit (CPU) & Computer System Performance Measures. [Chapter 4] Week 3 • CPU Organization: Datapath & Control Unit Design. [Chapter 5] Week 4 – MIPS Single Cycle Datapath & Control Unit Design. – MIPS Multicycle Datapath and Finite State Machine Control Unit Design. Week 5 • Microprogrammed Control Unit Design. [Chapter 5] – Microprogramming Project Week 6 • Midterm Review and Midterm Exam Week 7 • CPU Pipelining. [Chapter 6] • The Memory Hierarchy: Cache Design & Performance. [Chapter 7] Week 8 • The Memory Hierarchy: Main & Virtual Memory. [Chapter 7] Week 9 • Input/Output Organization & System Performance Evaluation. [Chapter 8] Week 10 • Computer Arithmetic & ALU Design. [Chapter 3] If time permits. Week 11 • Final Exam. EECC550 - Shaaban #1 Lec # 1 Winter 2005 11-29-2005 Computing System History/Trends + Instruction Set Architecture (ISA) Fundamentals • Computing Element Choices: – Computing Element Programmability – Spatial vs. Temporal Computing – Main Processor Types/Applications • General Purpose Processor Generations • The Von Neumann Computer Model • CPU Organization (Design) • Recent Trends in Computer Design/performance • Hierarchy
    [Show full text]
  • CS152: Computer Systems Architecture Pipelining
    CS152: Computer Systems Architecture Pipelining Sang-Woo Jun Winter 2021 Large amount of material adapted from MIT 6.004, “Computation Structures”, Morgan Kaufmann “Computer Organization and Design: The Hardware/Software Interface: RISC-V Edition”, and CS 152 Slides by Isaac Scherson Eight great ideas ❑ Design for Moore’s Law ❑ Use abstraction to simplify design ❑ Make the common case fast ❑ Performance via parallelism ❑ Performance via pipelining ❑ Performance via prediction ❑ Hierarchy of memories ❑ Dependability via redundancy But before we start… Performance Measures ❑ Two metrics when designing a system 1. Latency: The delay from when an input enters the system until its associated output is produced 2. Throughput: The rate at which inputs or outputs are processed ❑ The metric to prioritize depends on the application o Embedded system for airbag deployment? Latency o General-purpose processor? Throughput Performance of Combinational Circuits ❑ For combinational logic o latency = tPD o throughput = 1/t F and G not doing work! PD Just holding output data X F(X) X Y G(X) H(X) Is this an efficient way of using hardware? Source: MIT 6.004 2019 L12 Pipelined Circuits ❑ Pipelining by adding registers to hold F and G’s output o Now F & G can be working on input Xi+1 while H is performing computation on Xi o A 2-stage pipeline! o For input X during clock cycle j, corresponding output is emitted during clock j+2. Assuming ideal registers Assuming latencies of 15, 20, 25… 15 Y F(X) G(X) 20 H(X) 25 Source: MIT 6.004 2019 L12 Pipelined Circuits 20+25=45 25+25=50 Latency Throughput Unpipelined 45 1/45 2-stage pipelined 50 (Worse!) 1/25 (Better!) Source: MIT 6.004 2019 L12 Pipeline conventions ❑ Definition: o A well-formed K-Stage Pipeline (“K-pipeline”) is an acyclic circuit having exactly K registers on every path from an input to an output.
    [Show full text]
  • IDT79R4600 and IDT79R4700 RISC Processor Hardware User's Manual
    IDT79R4600™ and IDT79R4700™ RISC Processor Hardware User’s Manual Revision 2.0 April 1995 Integrated Device Technology, Inc. Integrated Device Technology, Inc. reserves the right to make changes to its products or specifications at any time, without notice, in order to improve design or performance and to supply the best possible product. IDT does not assume any respon- sibility for use of any circuitry described other than the circuitry embodied in an IDT product. ITD makes no representations that circuitry described herein is free from patent infringement or other rights of third parties which may result from its use. No license is granted by implication or otherwise under any patent, patent rights, or other rights of Integrated Device Technology, Inc. LIFE SUPPORT POLICY Integrated Device Technology’s products are not authorized for use as critical components in life sup- port devices or systems unless a specific written agreement pertaining to such intended use is executed between the manufacturer and an officer of IDT. 1. Life support devices or systems are devices or systems that (a) are intended for surgical implant into the body, or (b) support or sustain life, and whose failure to perform, when properly used in accor- dance with instructions for use provided in the labeling, can be reasonably expected to result in a sig- nificant injury to the user. 2. A critical component is any component of a life support device or system whose failure to perform can be reasonably expected to cause the failure of the life support device or system, or to affect its safety or effectiveness.
    [Show full text]
  • Threading SIMD and MIMD in the Multicore Context the Ultrasparc T2
    Overview SIMD and MIMD in the Multicore Context Single Instruction Multiple Instruction ● (note: Tute 02 this Weds - handouts) ● Flynn’s Taxonomy Single Data SISD MISD ● multicore architecture concepts Multiple Data SIMD MIMD ● for SIMD, the control unit and processor state (registers) can be shared ■ hardware threading ■ SIMD vs MIMD in the multicore context ● however, SIMD is limited to data parallelism (through multiple ALUs) ■ ● T2: design features for multicore algorithms need a regular structure, e.g. dense linear algebra, graphics ■ SSE2, Altivec, Cell SPE (128-bit registers); e.g. 4×32-bit add ■ system on a chip Rx: x x x x ■ 3 2 1 0 execution: (in-order) pipeline, instruction latency + ■ thread scheduling Ry: y3 y2 y1 y0 ■ caches: associativity, coherence, prefetch = ■ memory system: crossbar, memory controller Rz: z3 z2 z1 z0 (zi = xi + yi) ■ intermission ■ design requires massive effort; requires support from a commodity environment ■ speculation; power savings ■ massive parallelism (e.g. nVidia GPGPU) but memory is still a bottleneck ■ OpenSPARC ● multicore (CMT) is MIMD; hardware threading can be regarded as MIMD ● T2 performance (why the T2 is designed as it is) ■ higher hardware costs also includes larger shared resources (caches, TLBs) ● the Rock processor (slides by Andrew Over; ref: Tremblay, IEEE Micro 2009 ) needed ⇒ less parallelism than for SIMD COMP8320 Lecture 2: Multicore Architecture and the T2 2011 ◭◭◭ • ◮◮◮ × 1 COMP8320 Lecture 2: Multicore Architecture and the T2 2011 ◭◭◭ • ◮◮◮ × 3 Hardware (Multi)threading The UltraSPARC T2: System on a Chip ● recall concurrent execution on a single CPU: switch between threads (or ● OpenSparc Slide Cast Ch 5: p79–81,89 processes) requires the saving (in memory) of thread state (register values) ● aggressively multicore: 8 cores, each with 8-way hardware threading (64 virtual ■ motivation: utilize CPU better when thread stalled for I/O (6300 Lect O1, p9–10) CPUs) ■ what are the costs? do the same for smaller stalls? (e.g.
    [Show full text]