Instruction Execution and Pipelining

Total Page:16

File Type:pdf, Size:1020Kb

Instruction Execution and Pipelining Instruction Execution & Instruction Execution Pipelining CS160 Ward 1 CS160 Ward 2 Instruction Execution Indirect Cycle • Simple fetch-decode-execute cycle: • Instruction execution may require 1. Get address of next instruction from PC memory access to fetch operands 2. Fetch next instruction into IR • Memory fetch may use indirect 3. Change PC addressing 4. Determine instruction type (add, shift, … ) – Actual address to be fetched is in memory – Requires more memory accesses 4. If instruction has operand in memory, fetch it into a register – Can be thought of as additional instruction 5. Execute instruction storing result subcycle 6. Go to step 1 CS160 Ward 3 CS160 Ward 4 Basic Instruction Cycle States Data Flow (Instruction Fetch) • Depends on CPU design; below typical • Fetch (Step 2) – PC contains address of next instruction – Address moved to MAR – Address placed on address bus – Control unit requests memory read – Result placed on data bus, copied to MBR, then to IR – Meanwhile PC incremented by 1 instruction word • Typically byte addressable, so PC incremented by 4 for 32-bit address ISA CS160 Ward 5 CS160 Ward 6 Data Flow (Instruction Fetch) Data Flow (Data Fetch) 3-Bus Architecture • IR is examined • If indirect addressing, indirect cycle is performed – Right most bits of MBR transferred to MAR – Control unit requests memory read – Result (address of operand) moved to MBR Combined in 1-Bus Architecture CS160 Ward 7 CS160 Ward 8 Data Flow (Indirect Diagram) Data Flow (Execute) • May take many forms • Depends on instruction being executed • May include – Memory read/write – Input/Output – Register transfers – ALU operations CS160 Ward 9 CS160 Ward 10 Data Path [1] Data Path [2] • Data path cycle: two operands going from registers, through ALU and results stored (the faster the better). • Data path cycle time: time to accomplish above • Width of data path is a major determining factor in performance CS160 Ward 11 CS160 Ward 12 RISC and CISC Based Processors RISC vs. CISC Debate • RISC (Reduced Instruction Set Computer) • CISC & RISC both improved during 1980’s vs – CISC driven primarily by technology CISC (Complex Instruction Set Computer) – RISC driven by technology & architecture – “War” started in the late 1970’s • RISC has basically won – RISC: – Improved by over 50% / year • Simple interpret step so each instruction faster – Took advantage of new technology • Issues instructions more quickly (faster) – Incorporated new architectural features (e.g., Instruction Level Parallelism (ISL), superscalar – Basic concept: concepts) If RISC needs 5 instructions to do what 1 CISC • Compilers have become critical component instruction does but 1 CISC instruction takes 10 in performance times longer than 1 RISC, then RISC is faster CS160 Ward 13 CS160 Ward 14 More on RISC & CISC • RISC-CISC debate largely irrelevant now – Concepts are similar • Heavy reliance on compiler technology Prefetching – Modern RISC has “complicated” ISA – CISC leader (Intel x86 family) now has large RISC component CS160 Ward 15 CS160 Ward 16 Prefetch Improved Performance • Fetch accesses main memory • But not doubled: • Execution usually does not access main – Fetch usually shorter than complete memory execution of one instruction • Can fetch next instruction during • Prefetch more than one instruction? execution of current instruction – If fetch takes longer than execute, then cache (will discuss later) saves the day! • Called instruction prefetch – Any jump or branch means that prefetched instructions are not the required instructions CS160 Ward 17 CS160 Ward 18 Prefetching, Waiting & Branching Pipelining CS160 Ward 19 CS160 Ward 20 Pipelining Timing of Pipeline • Add more stages to improve performance • For example, 6 stages: – Fetch instruction (FI) Memory access – Decode instruction (DI) – Calculate operands (CO) – Fetch operands (FO) ? Memory access ? – Execute instruction (EI) – Write operand (result) (WO) ? Memory access ? Overlap these stages Again, cache saves the day! CS160 Ward 21 CS160 Ward 22 Important points Execution Time • An instruction pipeline passes instructions through • Assume that a pipelined instruction processor a series of stages with each performing some part has 4 stages, and the maximum time of the instruction. required in the stages are 10, 11, 10 and 12 • If the speed of two processors, one with a nanoseconds, respectively. pipeline and one without, are the same, the • How long will the processor take to perform pipeline architecture will not improve the overall each stage? 12 nanoseconds time required to execute one instruction. • How long does it take for one instruction to • If the speed of two processors, one with a complete execution? 48 nanoseconds pipeline and one without, are the same, the • How long will it take for ten instruction to pipelined architecture has a higher throughput complete execution? (48 + 9*12) nanoseconds (number of instructions processed per second). 156 nanoseconds 15.6 nanoseconds/instru CS160 Ward 23 CS160 Ward 24 4-Stage Pipeline Some Potential Causes of Pipeline “Stalls” • Another example (with interstage buffers): • What might cause the pipeline to “stall”? – Fetch instruction (F) Memory access – One stage takes too long (examples) – Decode instruction and fetch operands (D) • Different execution times (data hazard) – Execute instruction (E) ? Memory access ? • Cache miss (control hazard) – Write operand (result) (W) • Memory unit busy (structural hazard) – Branch to a new location (control hazard) – Call a subroutine (control hazard) CS160 Ward 25 CS160 Ward 26 Long Stages – Execution [1] Long Stages – Execution [2] • For a variety of reasons, some instructions • Solution: take longer to execute than others (e.g., – Sub-divide the long execute stage into ⇒ add vs divide) stall multiple stages Time Clock yclec 1 2 3 4 5 6 7 8 9 – Provide multiple execution hardware so Instruction that multiple instructions can be in the I1 F1 D1 E1 W1 stalls 2 execute stage at the same time I F D E W cycles 2 2 2 2 2 – Produces one instruction completion every I3 F3 D3 E3 W3 cycle after the pipeline is full data I F D E W 4 4 4 4 4 hazard example I5 F5 D5 E5 CS160 Ward 27 CS160 Ward 28 Long Stages – Cache Miss Long Stages – Memory Access Conflict • Delay in getting next instruction due to • Two different instructions require a cache miss (will discuss later) access to memory at the same time Load X(R1),R2 - place contents of memory location X+R1 into R2 control I2 is writing to R2 hazard when I3 wants to example write to a register stall Decode, Execute & Write stages become idle – stall. structural hazard CS160 Ward 29 CS160 Ward 30 Conditional Branch in a Pipeline Instruction 3 is a conditional branch to Instruction 15 Six Stage Instruction Pipeline Note Important Unconditional Architectural Logic branch Point: When is the PC changed? ? control hazard CS160 Ward 31 CS160 Ward 32 Data Dependencies in a Pipeline Alternative Pipeline r1 = r2 + r3 Depiction r4 = r1 -r3 Pipeline stalls until r1 has been written. (RAW hazard) r1 = r2 + r3 r4 = r1 -r3 CS160 Ward 33 CS160 Ward 34 Example of Avoiding Stalls RAW Hazard: Forwarding • Hardware optimization to avoid stall • Allows ALU to reference result in next instruction • Example: Instruction K: C ← add A B Instruction K+1: D ← subtract E C • Stalls eliminated by rearranging (a) to (b) CS160 Ward 35 CS160 Ward 36 Unconditional Branches: Instruction Conditional Branches: Delayed Branch Queue & Prefetching With more stages, need to reorder more instructions. CS160 Ward 37 CS160 Ward 38 Conditional Branches: Dynamic Prediction Pipelining Speedup Factors [1] without branch Never get speedup better than the number of stages in a pipeline CS160 Ward 39 CS160 Ward 40 Pipelining Speedup Factors [2] Classic 5-Stage RISC Pipeline • Instruction Fetch (IF) Memory access • Instruction Decode & Register Fetch (ID) • Execute / Effective Address (load-store) (EX) • Memory Access (MEM) ? Memory access ? • Write-back (to register) (WB) Never get speedup better than the number of instructions without a branch. CS160 Ward 41 CS160 Ward 42 Another 5-Stage Pipeline Achieving Maximum Speed • Instruction Fetch (IF) Memory access • Program must be written to accommodate instruction pipeline • Instruction Decode (ID) • To minimize stalls, for example: • Operand Fetch (OF) ? Memory access ? – Avoid introducing unnecessary branches • Execute Instruction (EX) – Delay references to result register(s) • Write-back (WB) ? Memory access ? CS160 Ward 43 CS160 Ward 44 More About Pipelines Advanced Pipeline Topics • Although hardware that uses an instruction • Structural hazards (e.g., resource conflicts) pipeline will not run at full speed unless • Data hazards (e.g., RAW, WAR, WAW) programs are written to accommodate the • Control hazards (e.g., branch prediction, pipeline, a programmer can choose to ignore cache miss) pipelining and assume the hardware will automatically increase speed whenever • Dynamic scheduling possible. • Out-of-order issue • Modern compilers are typically able to move • Out-of-order execution instructions around to optimize pipeline • Speculative execution execution for RISC architectures. • Register renaming • . CS160 Ward 45 CS160 Ward 46 Superscalar Architecture [1] • Superscalar architecture (parallelism inside single processor – Instruction-Level Parallelism) example – 2 pipelines with 2 ALU’s – 2 instructions must not conflict over resource (e.g., Parallelizations register, memory access unit) and must not depend on other result CS160 Ward 47 CS160 Ward 48
Recommended publications
  • Fill Your Boots: Enhanced Embedded Bootloader Exploits Via Fault Injection and Binary Analysis
    IACR Transactions on Cryptographic Hardware and Embedded Systems ISSN 2569-2925, Vol. 2021, No. 1, pp. 56–81. DOI:10.46586/tches.v2021.i1.56-81 Fill your Boots: Enhanced Embedded Bootloader Exploits via Fault Injection and Binary Analysis Jan Van den Herrewegen1, David Oswald1, Flavio D. Garcia1 and Qais Temeiza2 1 School of Computer Science, University of Birmingham, UK, {jxv572,d.f.oswald,f.garcia}@cs.bham.ac.uk 2 Independent Researcher, [email protected] Abstract. The bootloader of an embedded microcontroller is responsible for guarding the device’s internal (flash) memory, enforcing read/write protection mechanisms. Fault injection techniques such as voltage or clock glitching have been proven successful in bypassing such protection for specific microcontrollers, but this often requires expensive equipment and/or exhaustive search of the fault parameters. When multiple glitches are required (e.g., when countermeasures are in place) this search becomes of exponential complexity and thus infeasible. Another challenge which makes embedded bootloaders notoriously hard to analyse is their lack of debugging capabilities. This paper proposes a grey-box approach that leverages binary analysis and advanced software exploitation techniques combined with voltage glitching to develop a powerful attack methodology against embedded bootloaders. We showcase our techniques with three real-world microcontrollers as case studies: 1) we combine static and on-chip dynamic analysis to enable a Return-Oriented Programming exploit on the bootloader of the NXP LPC microcontrollers; 2) we leverage on-chip dynamic analysis on the bootloader of the popular STM8 microcontrollers to constrain the glitch parameter search, achieving the first fully-documented multi-glitch attack on a real-world target; 3) we apply symbolic execution to precisely aim voltage glitches at target instructions based on the execution path in the bootloader of the Renesas 78K0 automotive microcontroller.
    [Show full text]
  • PIPELINING and ASSOCIATED TIMING ISSUES Introduction: While
    PIPELINING AND ASSOCIATED TIMING ISSUES Introduction: While studying sequential circuits, we studied about Latches and Flip Flops. While Latches formed the heart of a Flip Flop, we have explored the use of Flip Flops in applications like counters, shift registers, sequence detectors, sequence generators and design of Finite State machines. Another important application of latches and flip flops is in pipelining combinational/algebraic operation. To understand what is pipelining consider the following example. Let us take a simple calculation which has three operations to be performed viz. 1. add a and b to get (a+b), 2. get magnitude of (a+b) and 3. evaluate log |(a + b)|. Each operation would consume a finite period of time. Let us assume that each operation consumes 40 nsec., 35 nsec. and 60 nsec. respectively. The process can be represented pictorially as in Fig. 1. Consider a situation when we need to carry this out for a set of 100 such pairs. In a normal course when we do it one by one it would take a total of 100 * 135 = 13,500 nsec. We can however reduce this time by the realization that the whole process is a sequential process. Let the values to be evaluated be a1 to a100 and the corresponding values to be added be b1 to b100. Since the operations are sequential, we can first evaluate (a1 + b1) while the value |(a1 + b1)| is being evaluated the unit evaluating the sum is dormant and we can use it to evaluate (a2 + b2) giving us both |(a1 + b1)| and (a2 + b2) at the end of another evaluation period.
    [Show full text]
  • 2.5 Classification of Parallel Computers
    52 // Architectures 2.5 Classification of Parallel Computers 2.5 Classification of Parallel Computers 2.5.1 Granularity In parallel computing, granularity means the amount of computation in relation to communication or synchronisation Periods of computation are typically separated from periods of communication by synchronization events. • fine level (same operations with different data) ◦ vector processors ◦ instruction level parallelism ◦ fine-grain parallelism: – Relatively small amounts of computational work are done between communication events – Low computation to communication ratio – Facilitates load balancing 53 // Architectures 2.5 Classification of Parallel Computers – Implies high communication overhead and less opportunity for per- formance enhancement – If granularity is too fine it is possible that the overhead required for communications and synchronization between tasks takes longer than the computation. • operation level (different operations simultaneously) • problem level (independent subtasks) ◦ coarse-grain parallelism: – Relatively large amounts of computational work are done between communication/synchronization events – High computation to communication ratio – Implies more opportunity for performance increase – Harder to load balance efficiently 54 // Architectures 2.5 Classification of Parallel Computers 2.5.2 Hardware: Pipelining (was used in supercomputers, e.g. Cray-1) In N elements in pipeline and for 8 element L clock cycles =) for calculation it would take L + N cycles; without pipeline L ∗ N cycles Example of good code for pipelineing: §doi =1 ,k ¤ z ( i ) =x ( i ) +y ( i ) end do ¦ 55 // Architectures 2.5 Classification of Parallel Computers Vector processors, fast vector operations (operations on arrays). Previous example good also for vector processor (vector addition) , but, e.g. recursion – hard to optimise for vector processors Example: IntelMMX – simple vector processor.
    [Show full text]
  • Microprocessor Architecture
    EECE416 Microcomputer Fundamentals Microprocessor Architecture Dr. Charles Kim Howard University 1 Computer Architecture Computer System CPU (with PC, Register, SR) + Memory 2 Computer Architecture •ALU (Arithmetic Logic Unit) •Binary Full Adder 3 Microprocessor Bus 4 Architecture by CPU+MEM organization Princeton (or von Neumann) Architecture MEM contains both Instruction and Data Harvard Architecture Data MEM and Instruction MEM Higher Performance Better for DSP Higher MEM Bandwidth 5 Princeton Architecture 1.Step (A): The address for the instruction to be next executed is applied (Step (B): The controller "decodes" the instruction 3.Step (C): Following completion of the instruction, the controller provides the address, to the memory unit, at which the data result generated by the operation will be stored. 6 Harvard Architecture 7 Internal Memory (“register”) External memory access is Very slow For quicker retrieval and storage Internal registers 8 Architecture by Instructions and their Executions CISC (Complex Instruction Set Computer) Variety of instructions for complex tasks Instructions of varying length RISC (Reduced Instruction Set Computer) Fewer and simpler instructions High performance microprocessors Pipelined instruction execution (several instructions are executed in parallel) 9 CISC Architecture of prior to mid-1980’s IBM390, Motorola 680x0, Intel80x86 Basic Fetch-Execute sequence to support a large number of complex instructions Complex decoding procedures Complex control unit One instruction achieves a complex task 10
    [Show full text]
  • Instruction Pipelining (1 of 7)
    Chapter 5 A Closer Look at Instruction Set Architectures Objectives • Understand the factors involved in instruction set architecture design. • Gain familiarity with memory addressing modes. • Understand the concepts of instruction- level pipelining and its affect upon execution performance. 5.1 Introduction • This chapter builds upon the ideas in Chapter 4. • We present a detailed look at different instruction formats, operand types, and memory access methods. • We will see the interrelation between machine organization and instruction formats. • This leads to a deeper understanding of computer architecture in general. 5.2 Instruction Formats (1 of 31) • Instruction sets are differentiated by the following: – Number of bits per instruction. – Stack-based or register-based. – Number of explicit operands per instruction. – Operand location. – Types of operations. – Type and size of operands. 5.2 Instruction Formats (2 of 31) • Instruction set architectures are measured according to: – Main memory space occupied by a program. – Instruction complexity. – Instruction length (in bits). – Total number of instructions in the instruction set. 5.2 Instruction Formats (3 of 31) • In designing an instruction set, consideration is given to: – Instruction length. • Whether short, long, or variable. – Number of operands. – Number of addressable registers. – Memory organization. • Whether byte- or word addressable. – Addressing modes. • Choose any or all: direct, indirect or indexed. 5.2 Instruction Formats (4 of 31) • Byte ordering, or endianness, is another major architectural consideration. • If we have a two-byte integer, the integer may be stored so that the least significant byte is followed by the most significant byte or vice versa. – In little endian machines, the least significant byte is followed by the most significant byte.
    [Show full text]
  • Computer Organization and Architecture Designing for Performance Ninth Edition
    COMPUTER ORGANIZATION AND ARCHITECTURE DESIGNING FOR PERFORMANCE NINTH EDITION William Stallings Boston Columbus Indianapolis New York San Francisco Upper Saddle River Amsterdam Cape Town Dubai London Madrid Milan Munich Paris Montréal Toronto Delhi Mexico City São Paulo Sydney Hong Kong Seoul Singapore Taipei Tokyo Editorial Director: Marcia Horton Designer: Bruce Kenselaar Executive Editor: Tracy Dunkelberger Manager, Visual Research: Karen Sanatar Associate Editor: Carole Snyder Manager, Rights and Permissions: Mike Joyce Director of Marketing: Patrice Jones Text Permission Coordinator: Jen Roach Marketing Manager: Yez Alayan Cover Art: Charles Bowman/Robert Harding Marketing Coordinator: Kathryn Ferranti Lead Media Project Manager: Daniel Sandin Marketing Assistant: Emma Snider Full-Service Project Management: Shiny Rajesh/ Director of Production: Vince O’Brien Integra Software Services Pvt. Ltd. Managing Editor: Jeff Holcomb Composition: Integra Software Services Pvt. Ltd. Production Project Manager: Kayla Smith-Tarbox Printer/Binder: Edward Brothers Production Editor: Pat Brown Cover Printer: Lehigh-Phoenix Color/Hagerstown Manufacturing Buyer: Pat Brown Text Font: Times Ten-Roman Creative Director: Jayne Conte Credits: Figure 2.14: reprinted with permission from The Computer Language Company, Inc. Figure 17.10: Buyya, Rajkumar, High-Performance Cluster Computing: Architectures and Systems, Vol I, 1st edition, ©1999. Reprinted and Electronically reproduced by permission of Pearson Education, Inc. Upper Saddle River, New Jersey, Figure 17.11: Reprinted with permission from Ethernet Alliance. Credits and acknowledgments borrowed from other sources and reproduced, with permission, in this textbook appear on the appropriate page within text. Copyright © 2013, 2010, 2006 by Pearson Education, Inc., publishing as Prentice Hall. All rights reserved. Manufactured in the United States of America.
    [Show full text]
  • ARM Instruction Set
    4 ARM Instruction Set This chapter describes the ARM instruction set. 4.1 Instruction Set Summary 4-2 4.2 The Condition Field 4-5 4.3 Branch and Exchange (BX) 4-6 4.4 Branch and Branch with Link (B, BL) 4-8 4.5 Data Processing 4-10 4.6 PSR Transfer (MRS, MSR) 4-17 4.7 Multiply and Multiply-Accumulate (MUL, MLA) 4-22 4.8 Multiply Long and Multiply-Accumulate Long (MULL,MLAL) 4-24 4.9 Single Data Transfer (LDR, STR) 4-26 4.10 Halfword and Signed Data Transfer 4-32 4.11 Block Data Transfer (LDM, STM) 4-37 4.12 Single Data Swap (SWP) 4-43 4.13 Software Interrupt (SWI) 4-45 4.14 Coprocessor Data Operations (CDP) 4-47 4.15 Coprocessor Data Transfers (LDC, STC) 4-49 4.16 Coprocessor Register Transfers (MRC, MCR) 4-53 4.17 Undefined Instruction 4-55 4.18 Instruction Set Examples 4-56 ARM7TDMI-S Data Sheet 4-1 ARM DDI 0084D Final - Open Access ARM Instruction Set 4.1 Instruction Set Summary 4.1.1 Format summary The ARM instruction set formats are shown below. 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 9876543210 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 Cond 0 0 I Opcode S Rn Rd Operand 2 Data Processing / PSR Transfer Cond 0 0 0 0 0 0 A S Rd Rn Rs 1 0 0 1 Rm Multiply Cond 0 0 0 0 1 U A S RdHi RdLo Rn 1 0 0 1 Rm Multiply Long Cond 0 0 0 1 0 B 0 0 Rn Rd 0 0 0 0 1 0 0 1 Rm Single Data Swap Cond 0 0 0 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 Rn Branch and Exchange Cond 0 0 0 P U 0 W L Rn Rd 0 0 0 0 1 S H 1 Rm Halfword Data Transfer: register offset Cond 0 0 0 P U 1 W L Rn Rd Offset 1 S H 1 Offset Halfword Data Transfer: immediate offset Cond 0
    [Show full text]
  • Review of Computer Architecture
    Basic Computer Architecture CSCE 496/896: Embedded Systems Witawas Srisa-an Review of Computer Architecture Credit: Most of the slides are made by Prof. Wayne Wolf who is the author of the textbook. I made some modifications to the note for clarity. Assume some background information from CSCE 430 or equivalent von Neumann architecture Memory holds data and instructions. Central processing unit (CPU) fetches instructions from memory. Separate CPU and memory distinguishes programmable computer. CPU registers help out: program counter (PC), instruction register (IR), general- purpose registers, etc. von Neumann Architecture Memory Unit Input CPU Output Unit Control + ALU Unit CPU + memory address 200PC memory data CPU 200 ADD r5,r1,r3 ADD IRr5,r1,r3 Recalling Pipelining Recalling Pipelining What is a potential Problem with von Neumann Architecture? Harvard architecture address data memory data PC CPU address program memory data von Neumann vs. Harvard Harvard can’t use self-modifying code. Harvard allows two simultaneous memory fetches. Most DSPs (e.g Blackfin from ADI) use Harvard architecture for streaming data: greater memory bandwidth. different memory bit depths between instruction and data. more predictable bandwidth. Today’s Processors Harvard or von Neumann? RISC vs. CISC Complex instruction set computer (CISC): many addressing modes; many operations. Reduced instruction set computer (RISC): load/store; pipelinable instructions. Instruction set characteristics Fixed vs. variable length. Addressing modes. Number of operands. Types of operands. Tensilica Xtensa RISC based variable length But not CISC Programming model Programming model: registers visible to the programmer. Some registers are not visible (IR). Multiple implementations Successful architectures have several implementations: varying clock speeds; different bus widths; different cache sizes, associativities, configurations; local memory, etc.
    [Show full text]
  • 3.2 the CORDIC Algorithm
    UC San Diego UC San Diego Electronic Theses and Dissertations Title Improved VLSI architecture for attitude determination computations Permalink https://escholarship.org/uc/item/5jf926fv Author Arrigo, Jeanette Fay Freauf Publication Date 2006 Peer reviewed|Thesis/dissertation eScholarship.org Powered by the California Digital Library University of California 1 UNIVERSITY OF CALIFORNIA, SAN DIEGO Improved VLSI Architecture for Attitude Determination Computations A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Electrical and Computer Engineering (Electronic Circuits and Systems) by Jeanette Fay Freauf Arrigo Committee in charge: Professor Paul M. Chau, Chair Professor C.K. Cheng Professor Sujit Dey Professor Lawrence Larson Professor Alan Schneider 2006 2 Copyright Jeanette Fay Freauf Arrigo, 2006 All rights reserved. iv DEDICATION This thesis is dedicated to my husband Dale Arrigo for his encouragement, support and model of perseverance, and to my father Eugene Freauf for his patience during my pursuit. In memory of my mother Fay Freauf and grandmother Fay Linton Thoreson, incredible mentors and great advocates of the quest for knowledge. iv v TABLE OF CONTENTS Signature Page...............................................................................................................iii Dedication … ................................................................................................................iv Table of Contents ...........................................................................................................v
    [Show full text]
  • Consider an Instruction Cycle Consisting of Fetch, Operators Fetch (Immediate/Direct/Indirect), Execute and Interrupt Cycles
    Module-2, Unit-3 Instruction Execution Question 1: Consider an instruction cycle consisting of fetch, operators fetch (immediate/direct/indirect), execute and interrupt cycles. Explain the purpose of these four cycles. Solution 1: The life of an instruction passes through four phases—(i) Fetch, (ii) Decode and operators fetch, (iii) execute and (iv) interrupt. The purposes of these phases are as follows 1. Fetch We know that in the stored program concept, all instructions are also present in the memory along with data. So the first phase is the “fetch”, which begins with retrieving the address stored in the Program Counter (PC). The address stored in the PC refers to the memory location holding the instruction to be executed next. Following that, the address present in the PC is given to the address bus and the memory is set to read mode. The contents of the corresponding memory location (i.e., the instruction) are transferred to a special register called the Instruction Register (IR) via the data bus. IR holds the instruction to be executed. The PC is incremented to point to the next address from which the next instruction is to be fetched So basically the fetch phase consists of four steps: a) MAR <= PC (Address of next instruction from Program counter is placed into the MAR) b) MBR<=(MEMORY) (the contents of Data bus is copied into the MBR) c) PC<=PC+1 (PC gets incremented by instruction length) d) IR<=MBR (Data i.e., instruction is transferred from MBR to IR and MBR then gets freed for future data fetches) 2.
    [Show full text]
  • Akukwe Michael Lotachukwu 18/Sci01/014 Computer Science
    AKUKWE MICHAEL LOTACHUKWU 18/SCI01/014 COMPUTER SCIENCE The purpose of every computer is some form of data processing. The CPU supports data processing by performing the functions of fetch, decode and execute on programmed instructions. Taken together, these functions are frequently referred to as the instruction cycle. In addition to the instruction cycle functions, the CPU performs fetch and writes functions on data. When a program runs on a computer, instructions are stored in computer memory until they're executed. The CPU uses a program counter to fetch the next instruction from memory, where it's stored in a format known as assembly code. The CPU decodes the instruction into binary code that can be executed. Once this is done, the CPU does what the instruction tells it to, performing an operation, fetching or storing data or adjusting the program counter to jump to a different instruction. The types of operations that typically can be performed by the CPU include simple math functions like addition, subtraction, multiplication and division. The CPU can also perform comparisons between data objects to determine if they're equal. All the amazing things that computers can do are performed with these and a few other basic operations. After an instruction is executed, the next instruction is fetched and the cycle continues. While performing the execute function of the instruction cycle, the CPU may be asked to execute an instruction that requires data. For example, executing an arithmetic function requires the numbers that will be used for the calculation. To deliver the necessary data, there are instructions to fetch data from memory and write data that has been processed back to memory.
    [Show full text]
  • CS152: Computer Systems Architecture Pipelining
    CS152: Computer Systems Architecture Pipelining Sang-Woo Jun Winter 2021 Large amount of material adapted from MIT 6.004, “Computation Structures”, Morgan Kaufmann “Computer Organization and Design: The Hardware/Software Interface: RISC-V Edition”, and CS 152 Slides by Isaac Scherson Eight great ideas ❑ Design for Moore’s Law ❑ Use abstraction to simplify design ❑ Make the common case fast ❑ Performance via parallelism ❑ Performance via pipelining ❑ Performance via prediction ❑ Hierarchy of memories ❑ Dependability via redundancy But before we start… Performance Measures ❑ Two metrics when designing a system 1. Latency: The delay from when an input enters the system until its associated output is produced 2. Throughput: The rate at which inputs or outputs are processed ❑ The metric to prioritize depends on the application o Embedded system for airbag deployment? Latency o General-purpose processor? Throughput Performance of Combinational Circuits ❑ For combinational logic o latency = tPD o throughput = 1/t F and G not doing work! PD Just holding output data X F(X) X Y G(X) H(X) Is this an efficient way of using hardware? Source: MIT 6.004 2019 L12 Pipelined Circuits ❑ Pipelining by adding registers to hold F and G’s output o Now F & G can be working on input Xi+1 while H is performing computation on Xi o A 2-stage pipeline! o For input X during clock cycle j, corresponding output is emitted during clock j+2. Assuming ideal registers Assuming latencies of 15, 20, 25… 15 Y F(X) G(X) 20 H(X) 25 Source: MIT 6.004 2019 L12 Pipelined Circuits 20+25=45 25+25=50 Latency Throughput Unpipelined 45 1/45 2-stage pipelined 50 (Worse!) 1/25 (Better!) Source: MIT 6.004 2019 L12 Pipeline conventions ❑ Definition: o A well-formed K-Stage Pipeline (“K-pipeline”) is an acyclic circuit having exactly K registers on every path from an input to an output.
    [Show full text]