Pipelining and Hazards Instructor: Steven Ho Great Idea #4: Parallelism Software Hardware • Parallel Requests Warehouse Smart Assigned to Computer Scale Phone E.G

Total Page:16

File Type:pdf, Size:1020Kb

Pipelining and Hazards Instructor: Steven Ho Great Idea #4: Parallelism Software Hardware • Parallel Requests Warehouse Smart Assigned to Computer Scale Phone E.G Pipelining and Hazards Instructor: Steven Ho Great Idea #4: Parallelism Software Hardware • Parallel Requests Warehouse Smart Assigned to computer Scale Phone e.g. search “Garcia” Computer Leverage • Parallel Threads Parallelism & Assigned to core Achieve High e.g. lookup, ads Performance Computer • Parallel Instructions Core … Core > 1 instruction @ one time Memory e.g. 5 pipelined instructions Input/Output • Parallel Data Core > 1 data item @ one time Instruction Unit(s) Functional e.g. add of 4 pairs of words Unit(s) A +B A +B A +B A +B • Hardware descriptions 0 0 1 1 2 2 3 3 All gates functioning in Cache Memory Logic Gates parallel at same time 7/12/2017 CS61C Su18 - Lecture 13 2 Review of Last Lecture • Implementing controller for your datapath – Take decoded signals from instruction and generate control signals • Pipelining improves performance by exploiting Instruction Level Parallelism – 5-stage pipeline for RISC-V: IF, ID, EX, MEM, WB – Executes multiple instructions in parallel – Each instruction has the same latency – What can go wrong??? 7/12/2017 CS61C Su18 - Lecture 13 3 Agenda • RISC-V Pipeline • Hazards – Structural – Data • R-type instructions • Load – Control • Superscalar processors 7/12/2017 CS61C Su18 - Lecture 13 4 Recap: Pipelining with RISC-V tinstruction instruction sequence add t0, t1, t2 or t3, t4, t5 sll t6, t0, t3 tcycle Single Cycle Pipelining Timing tstep = 100 … 200 ps tcycle = 200 ps Register access only 100 ps All cycles same length Instruction time, tinstruction = tcycle = 800 ps 1000 ps Clock rate, fs 1/800 ps = 1.25 GHz 1/200 ps = 5 GHz Relative speed 1 x 4 x 7/11/2018 5 RISC-V Pipeline t = 1000 ps instruction Resource use in a particular time slot add t0, t1, t2 Resource use of or t3, t4, t5 instruction sequence instruction over time slt t6, t0, t3 sw t0, 4(t3) lw t0, 8(t3) tcycle = 200 ps addi t2, t2, 1 7/11/2018 CS61C Su18 - Lecture 13 6 Single-Cycle RISC-V RV32I Datapath pc+4 pc +4 Reg[] alu wb 1 DataD 1 Reg[rs1] ALU 2 alu 0 DMEM inst[11:7] 1 pc AddrD 0 IMEM Reg[rs2] Addr wb pc+4 Branch DataR 0 inst[19:15] AddrA DataA 0 Comp. DataW mem inst[24:20] AddrB DataB 1 inst[31:7] Imm. imm[31:0] Gen PCSel inst[31:0] ImmSel RegWEn BrUnBrEq BrLT BSel ASel ALUSel MemRW WBSel 7/11/2018 7 Pipelining RISC-V RV32I Datapath pc+4 pc +4 Reg[] alu wb 1 DataD 1 Reg[rs1] ALU 2 alu 0 DMEM inst[11:7] 1 pc AddrD 0 IMEM Reg[rs2] Addr wb pc+4 Branch DataR 0 inst[19:15] AddrA DataA 0 Comp. DataW mem inst[24:20] AddrB DataB 1 inst[31:7] Imm. imm[31:0] Gen Instruction Instruction Fetch ALU Execute Memory Access Write Back (F) Decode/Register Read (X) (M) (W) (D) 7/11/2018 8 Pipelined RISC-V RV32I Datapath Recalculate PC+4 in M stage to avoid sending both PC and PC+4 down pipeline pcM pcX pcF pcD +4 +4 Reg[] rs1X DataD 1 ALU aluX DMEM 0 AddrD Addr pc +4 Branch aluM DataR F IMEM DataA AddrA Comp. DataW DataB AddrB inst D rs2 rs2 X imm M Imm. X instX instM instW Must pipeline instruction along with data, so 7/11/2018 control operates correctly in each stage 9 Each stage operates on different instruction lw t0, 8(t3) sw t0, 4(t3) slt t6, t0, t3 or t3, t4, t5 add t0, t1, t2 pcM pcX pcF pcD +4 +4 Reg[] rs1X DataD 1 ALU aluX DMEM 0 AddrD Addr pc +4 Branch aluM DataR F IMEM DataA AddrA Comp. DataW DataB AddrB inst D rs2 rs2 X imm M Imm. X instX instM instW 7/11/2018 Pipeline registers separate stages, hold data for each instruction in flight 10 Pipelined Control • Control signals derived from instruction − As in single-cycle implementation − Information is stored in pipeline registers for use by later stages 7/11/2018 11 Graphical Pipeline Representation • RegFile: right half is read, left half is write Time (clock cycles) I ALU n I$ Reg D$ Reg s Load ALU t I$ Reg D$ Reg Add r ALU I$ Reg D$ Reg Store O ALU r I$ Reg D$ Reg Sub d ALU I$ Reg D$ Reg e Or r 7/12/2017 CS61C Su18 - Lecture 13 12 Question: Which of the following signals (buses or control signals) for RISC-V does NOT need to be passed into the EX pipeline stage for a beq instruction? beq t0 t1 Label (A) BrUn IF ID EX Mem WB (B) MemWr ALU I$ Reg D$ Reg (C) RegWr (D) WBSel 13 Hazards Agenda Ahead! • RISC-V Pipeline • Hazards – Structural – Data • R-type instructions • Load – Control • Superscalar processors 7/12/2017 CS61C Su18 - Lecture 13 14 Pipelining Hazards A hazard is a situation that prevents starting the next instruction in the next clock cycle 1) Structural hazard – A required resource is busy (e.g. needed in multiple stages) 2) Data hazard – Data dependency between instructions – Need to wait for previous instruction to complete its data write 3) Control hazard – Flow of execution depends on previous instruction 7/12/2017 CS61C Su18 - Lecture 13 15 Agenda • RISC-V Pipeline • Hazards – Structural – Data • R-type instructions • Load – Control • Superscalar processors 7/12/2017 CS61C Su18 - Lecture 13 16 Structural Hazard • Problem: Two or more instructions in the pipeline compete for access to a single physical resource • Solution 1: Instructions take it in turns to use resource, some instructions have to stall • Solution 2: Add more hardware to machine • Can always solve a structural hazard by adding more hardware 7/11/2018 CS61C Su18 - Lecture 13 17 1. Structural Hazards • Conflict for use of a resource Time (clock cycles) I ALU n I$ Reg D$ Reg s Load Can we read and t ALU I$ Reg D$ Reg write to registers r Instr 1 simultaneously? ALU I$ Reg D$ Reg O Instr 2 r ALU I$ Reg D$ Reg d Instr 3 e ALU I$ Reg D$ Reg r Instr 4 7/12/2017 CS61C Su18 - Lecture 13 18 Regfile Structural Hazards • Each instruction: − can read up to two operands in decode stage − can write one value in writeback stage • Avoid structural hazard by having separate “ports” − two independent read ports and one independent write port • Three accesses per cycle can happen simultaneously 7/11/2018 CS61C Su18 - Lecture 13 19 Regfile Structural Hazards • Two alternate solutions: 1) Build RegFile with independent read and write ports (what you will do in the project; good for single-stage) 2) Double Pumping: split RegFile access in two! Prepare to write during 1st half, write on falling edge, read during 2nd half of each clock cycle • Will save us a cycle later... • Possible because RegFile access is VERY fast (takes less than half the time of ALU stage) • Conclusion: Read and Write to registers during same clock cycle is okay 7/12/2017 CS61C Su18 - Lecture 13 20 Memory Structural Hazards • Conflict for use of a resource Time (clock cycles) I Trying to read ALU n (and maybe I$ Reg D$ Reg s Load write) same t ALU I$ Reg D$ Reg memory twice r Instr 1 in same clock ALU I$ Reg D$ Reg cycle O Instr 2 r ALU I$ Reg D$ Reg d Instr 3 e ALU I$ Reg D$ Reg r Instr 4 7/12/2017 CS61C Su18 - Lecture 13 21 Instruction and Data Caches Processor Memory (DRAM) Control Instruction Cache Program Datapath PC Bytes Data Registers Cache Arithmetic & Logic Unit Data (ALU) Caches: small and fast “buffer” memories 7/11/2018 CS61C Su18 - Lecture 13 22 Structural Hazards – Summary • Conflict for use of a resource • In RISC-V pipeline with a single memory − Load/store requires data access − Without separate memories, instruction fetch would have to stall for that cycle ▪ All other operations in pipeline would have to wait • Pipelined datapaths require separate instruction/data memories − Or separate instruction/data caches • RISC ISAs (including RISC-V) designed to avoid structural hazards − e.g. at most one memory access/instruction CS61C Su18 - Lecture 13 23 Administrivia • Proj2-2 due 7/13, HW3/4 due 7/16 – 2-2 autograder being run • Guerilla Session Tonight! 4-6pm • HW0-2 grades should now be accurate on glookup • Project 3 released tomorrow night! • Supplementary review sessions starting – First one this Sat. (7/14) 12-2p, Cory 540AB 7/12/2017 CS61C Su18 - Lecture 13 24 Agenda • RISC-V Pipeline • Hazards – Structural – Data • R-type instructions • Load – Control • Superscalar processors 7/12/2017 CS61C Su18 - Lecture 13 25 2. Data Hazards (1/2) • Consider the following sequence of instructions: add t0, t1, t2 sub t4, t0, t3 and t5, t0, t6 or t7, t0, t8 xor t9, t0, t10 Stored Read during WB during ID 7/12/2017 CS61C Su18 - Lecture 13 26 2. Data Hazards (2/2) • Data-flow backward in time are hazards Time (clock cycles) IF ID/RF EX MEM WB I ALU add t0, t1, t2 I$ Reg D$ Reg Hazard if no double pumping n s ALU sub t4, t0, t3 I$ Reg D$ Reg t r ALU I$ D$ Reg and t5, t0, t6 Reg O ALU I$ Reg D$ Reg r or t7, t0, t8 d ALU e I$ Reg D$ Reg xor t9, t0, t10 r 7/12/2017 CS61C Su18 - Lecture 13 27 Solution 1: Stalling • Problem: Instruction depends on result from previous instruction − add t0, t1, t2 sub t4, t0, t3 • Bubble: − effectively NOP: affected pipeline stages do “nothing” Stalls and Performance • Stalls reduce performance − But stalls are required to get correct results • Compiler can arrange code to avoid hazards and stalls − Requires knowledge of the pipeline structure 7/11/2018 29 Data Hazard Solution: Forwarding • Forward result as soon as it is available – OK that it’s not stored in RegFile yet Arithmetic result IF ID/RF EX MEM WB available in EX ALU add t0, t1, t2 I$ Reg D$ Reg ALU sub t4, t0, t3 I$ Reg D$ Reg ALU I$ D$ Reg and t5, t0, t6 Reg ALU I$ Reg D$ Reg or t7, t0, t8 ALU I$ Reg D$ Reg xor t9, t0, t10 Forwarding: grab operand from pipeline stage, 7/12/2017 rather than register file 30 Forwarding (aka Bypassing) • Use result when it is computed − Don’t wait for it to be stored in a register − Requires extra connections in the datapath 7/11/2018 CS61C Su18 - Lecture 13 31 Detect Need for Forwarding (example) Compare destination of older instructions in D X M W pipeline with sources of new instruction in decode stage.
Recommended publications
  • Computer Organization
    Computer organization Computer design – an application of digital logic design procedures Computer = processing unit + memory system Processing unit = control + datapath Control = finite state machine inputs = machine instruction, datapath conditions outputs = register transfer control signals, ALU operation codes instruction interpretation = instruction fetch, decode, execute Datapath = functional units + registers functional units = ALU, multipliers, dividers, etc. registers = program counter, shifters, storage registers CSE370 - XI - Computer Organization 1 Structure of a computer Block diagram view address Processor read/write Memory System central processing data unit (CPU) control signals Control Data Path data conditions instruction unit execution unit œ instruction fetch and œ functional units interpretation FSM and registers CSE370 - XI - Computer Organization 2 Registers Selectively loaded – EN or LD input Output enable – OE input Multiple registers – group 4 or 8 in parallel LD OE D7 Q7 OE asserted causes FF state to be D6 Q6 connected to output pins; otherwise they D5 Q5 are left unconnected (high impedance) D4 Q4 D3 Q3 D2 Q2 LD asserted during a lo-to-hi clock D1 Q1 transition loads new data into FFs D0 CLK Q0 CSE370 - XI - Computer Organization 3 Register transfer Point-to-point connection MUX MUX MUX MUX dedicated wires muxes on inputs of each register rs rt rd R4 Common input from multiplexer load enables rs rt rd R4 for each register control signals MUX for multiplexer Common bus with output enables output enables and load rs rt rd R4 enables for each register BUS CSE370 - XI - Computer Organization 4 Register files Collections of registers in one package two-dimensional array of FFs address used as index to a particular word can have separate read and write addresses so can do both at same time 4 by 4 register file 16 D-FFs organized as four words of four bits each write-enable (load) 3E RB read-enable (output enable) RA WE (- WB (.
    [Show full text]
  • Review of Computer Architecture
    Basic Computer Architecture CSCE 496/896: Embedded Systems Witawas Srisa-an Review of Computer Architecture Credit: Most of the slides are made by Prof. Wayne Wolf who is the author of the textbook. I made some modifications to the note for clarity. Assume some background information from CSCE 430 or equivalent von Neumann architecture Memory holds data and instructions. Central processing unit (CPU) fetches instructions from memory. Separate CPU and memory distinguishes programmable computer. CPU registers help out: program counter (PC), instruction register (IR), general- purpose registers, etc. von Neumann Architecture Memory Unit Input CPU Output Unit Control + ALU Unit CPU + memory address 200PC memory data CPU 200 ADD r5,r1,r3 ADD IRr5,r1,r3 Recalling Pipelining Recalling Pipelining What is a potential Problem with von Neumann Architecture? Harvard architecture address data memory data PC CPU address program memory data von Neumann vs. Harvard Harvard can’t use self-modifying code. Harvard allows two simultaneous memory fetches. Most DSPs (e.g Blackfin from ADI) use Harvard architecture for streaming data: greater memory bandwidth. different memory bit depths between instruction and data. more predictable bandwidth. Today’s Processors Harvard or von Neumann? RISC vs. CISC Complex instruction set computer (CISC): many addressing modes; many operations. Reduced instruction set computer (RISC): load/store; pipelinable instructions. Instruction set characteristics Fixed vs. variable length. Addressing modes. Number of operands. Types of operands. Tensilica Xtensa RISC based variable length But not CISC Programming model Programming model: registers visible to the programmer. Some registers are not visible (IR). Multiple implementations Successful architectures have several implementations: varying clock speeds; different bus widths; different cache sizes, associativities, configurations; local memory, etc.
    [Show full text]
  • V850ES/SA2, V850ES/SA3 32-Bit Single-Chip Microcontrollers
    To our customers, Old Company Name in Catalogs and Other Documents On April 1st, 2010, NEC Electronics Corporation merged with Renesas Technology Corporation, and Renesas Electronics Corporation took over all the business of both companies. Therefore, although the old company name remains in this document, it is a valid Renesas Electronics document. We appreciate your understanding. Renesas Electronics website: http://www.renesas.com April 1st, 2010 Renesas Electronics Corporation Issued by: Renesas Electronics Corporation (http://www.renesas.com) Send any inquiries to http://www.renesas.com/inquiry. Notice 1. All information included in this document is current as of the date this document is issued. Such information, however, is subject to change without any prior notice. Before purchasing or using any Renesas Electronics products listed herein, please confirm the latest product information with a Renesas Electronics sales office. Also, please pay regular and careful attention to additional and different information to be disclosed by Renesas Electronics such as that disclosed through our website. 2. Renesas Electronics does not assume any liability for infringement of patents, copyrights, or other intellectual property rights of third parties by or arising from the use of Renesas Electronics products or technical information described in this document. No license, express, implied or otherwise, is granted hereby under any patents, copyrights or other intellectual property rights of Renesas Electronics or others. 3. You should not alter, modify, copy, or otherwise misappropriate any Renesas Electronics product, whether in whole or in part. 4. Descriptions of circuits, software and other related information in this document are provided only to illustrate the operation of semiconductor products and application examples.
    [Show full text]
  • Testing and Validation of a Prototype Gpgpu Design for Fpgas Murtaza Merchant University of Massachusetts Amherst
    University of Massachusetts Amherst ScholarWorks@UMass Amherst Masters Theses 1911 - February 2014 2013 Testing and Validation of a Prototype Gpgpu Design for FPGAs Murtaza Merchant University of Massachusetts Amherst Follow this and additional works at: https://scholarworks.umass.edu/theses Part of the VLSI and Circuits, Embedded and Hardware Systems Commons Merchant, Murtaza, "Testing and Validation of a Prototype Gpgpu Design for FPGAs" (2013). Masters Theses 1911 - February 2014. 1012. Retrieved from https://scholarworks.umass.edu/theses/1012 This thesis is brought to you for free and open access by ScholarWorks@UMass Amherst. It has been accepted for inclusion in Masters Theses 1911 - February 2014 by an authorized administrator of ScholarWorks@UMass Amherst. For more information, please contact [email protected]. TESTING AND VALIDATION OF A PROTOTYPE GPGPU DESIGN FOR FPGAs A Thesis Presented by MURTAZA S. MERCHANT Submitted to the Graduate School of the University of Massachusetts Amherst in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE IN ELECTRICAL AND COMPUTER ENGINEERING February 2013 Department of Electrical and Computer Engineering © Copyright by Murtaza S. Merchant 2013 All Rights Reserved TESTING AND VALIDATION OF A PROTOTYPE GPGPU DESIGN FOR FPGAs A Thesis Presented by MURTAZA S. MERCHANT Approved as to style and content by: _________________________________ Russell G. Tessier, Chair _________________________________ Wayne P. Burleson, Member _________________________________ Mario Parente, Member ______________________________ C. V. Hollot, Department Head Electrical and Computer Engineering ACKNOWLEDGEMENTS To begin with, I would like to sincerely thank my advisor, Prof. Russell Tessier for all his support, faith in my abilities and encouragement throughout my tenure as a graduate student.
    [Show full text]
  • The Microarchitecture of the Pentium 4 Processor
    The Microarchitecture of the Pentium 4 Processor Glenn Hinton, Desktop Platforms Group, Intel Corp. Dave Sager, Desktop Platforms Group, Intel Corp. Mike Upton, Desktop Platforms Group, Intel Corp. Darrell Boggs, Desktop Platforms Group, Intel Corp. Doug Carmean, Desktop Platforms Group, Intel Corp. Alan Kyker, Desktop Platforms Group, Intel Corp. Patrice Roussel, Desktop Platforms Group, Intel Corp. Index words: Pentium® 4 processor, NetBurst™ microarchitecture, Trace Cache, double-pumped ALU, deep pipelining provides an in-depth examination of the features and ABSTRACT functions of the Intel NetBurst microarchitecture. This paper describes the Intel® NetBurst™ ® The Pentium 4 processor is designed to deliver microarchitecture of Intel’s new flagship Pentium 4 performance across applications where end users can truly processor. This microarchitecture is the basis of a new appreciate and experience its performance. For example, family of processors from Intel starting with the Pentium it allows a much better user experience in areas such as 4 processor. The Pentium 4 processor provides a Internet audio and streaming video, image processing, substantial performance gain for many key application video content creation, speech recognition, 3D areas where the end user can truly appreciate the applications and games, multi-media, and multi-tasking difference. user environments. The Pentium 4 processor enables real- In this paper we describe the main features and functions time MPEG2 video encoding and near real-time MPEG4 of the NetBurst microarchitecture. We present the front- encoding, allowing efficient video editing and video end of the machine, including its new form of instruction conferencing. It delivers world-class performance on 3D cache called the Execution Trace Cache.
    [Show full text]
  • Implementation, Verification and Validation of an Openrisc-1200
    (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 10, No. 1, 2019 Implementation, Verification and Validation of an OpenRISC-1200 Soft-core Processor on FPGA Abdul Rafay Khatri Department of Electronic Engineering, QUEST, NawabShah, Pakistan Abstract—An embedded system is a dedicated computer system in which hardware and software are combined to per- form some specific tasks. Recent advancements in the Field Programmable Gate Array (FPGA) technology make it possible to implement the complete embedded system on a single FPGA chip. The fundamental component of an embedded system is a microprocessor. Soft-core processors are written in hardware description languages and functionally equivalent to an ordinary microprocessor. These soft-core processors are synthesized and implemented on the FPGA devices. In this paper, the OpenRISC 1200 processor is used, which is a 32-bit soft-core processor and Fig. 1. General block diagram of embedded systems. written in the Verilog HDL. Xilinx ISE tools perform synthesis, design implementation and configure/program the FPGA. For verification and debugging purpose, a software toolchain from (RISC) processor. This processor consists of all necessary GNU is configured and installed. The software is written in C components which are available in any other microproces- and Assembly languages. The communication between the host computer and FPGA board is carried out through the serial RS- sor. These components are connected through a bus called 232 port. Wishbone bus. In this work, the OR1200 processor is used to implement the system on a chip technology on a Virtex-5 Keywords—FPGA Design; HDLs; Hw-Sw Co-design; Open- FPGA board from Xilinx.
    [Show full text]
  • Python Console Target Device 78K0 Microcontroller RL78 Family 78K0R Microcontroller V850 Family RX Family RH850 Family
    User’s Manual CS+ V4.01.00 Integrated Development Environment User’s Manual: Python Console Target Device 78K0 Microcontroller RL78 Family 78K0R Microcontroller V850 Family RX Family RH850 Family All information contained in these materials, including products and product specifications, represents information on the product at the time of publication and is subject to change by Renesas Electronics Corp. without notice. Please review the latest information published by Renesas Electronics Corp. through various means, including the Renesas Electronics Corp. website (http://www.renesas.com). www.renesas.com Rev.1.00 2016.09 Notice 1. Descriptions of circuits, software and other related information in this document are provided only to illustrate the operation of semiconductor products and application examples. You are fully responsible for the incorporation of these circuits, software, and information in the design of your equipment. Renesas Electronics assumes no responsibility for any losses incurred by you or third parties arising from the use of these circuits, software, or information. 2. Renesas Electronics has used reasonable care in preparing the information included in this document, but Renesas Electronics does not warrant that such information is error free. Renesas Electronics assumes no liability whatsoever for any damages incurred by you resulting from errors in or omissions from the information included herein. 3. Renesas Electronics does not assume any liability for infringement of patents, copyrights, or other intellectual property rights of third parties by or arising from the use of Renesas Electronics products or technical information described in this document. No license, express, implied or otherwise, is granted hereby under any patents, copyrights or other intellectual property rights of Renesas Electronics or others.
    [Show full text]
  • Superh RISC Engine SH-1/SH-2
    SuperH RISC Engine SH-1/SH-2 Programming Manual September 3, 1996 Hitachi America Ltd. Notice When using this document, keep the following in mind: 1. This document may, wholly or partially, be subject to change without notice. 2. All rights are reserved: No one is permitted to reproduce or duplicate, in any form, the whole or part of this document without Hitachi’s permission. 3. Hitachi will not be held responsible for any damage to the user that may result from accidents or any other reasons during operation of the user’s unit according to this document. 4. Circuitry and other examples described herein are meant merely to indicate the characteristics and performance of Hitachi’s semiconductor products. Hitachi assumes no responsibility for any intellectual property claims or other problems that may result from applications based on the examples described herein. 5. No license is granted by implication or otherwise under any patents or other rights of any third party or Hitachi, Ltd. 6. MEDICAL APPLICATIONS: Hitachi’s products are not authorized for use in MEDICAL APPLICATIONS without the written consent of the appropriate officer of Hitachi’s sales company. Such use includes, but is not limited to, use in life support systems. Buyers of Hitachi’s products are requested to notify the relevant Hitachi sales offices when planning to use the products in MEDICAL APPLICATIONS. Introduction The SuperH RISC engine family incorporates a RISC (Reduced Instruction Set Computer) type CPU. A basic instruction can be executed in one clock cycle, realizing high performance operation. A built-in multiplier can execute multiplication and addition as quickly as DSP.
    [Show full text]
  • Thumb® 16-Bit Instruction Set Quick Reference Card
    Thumb® 16-bit Instruction Set Quick Reference Card This card lists all Thumb instructions available on Thumb-capable processors earlier than ARM®v6T2. In addition, it lists all Thumb-2 16-bit instructions. The instructions shown on this card are all 16-bit in Thumb-2, except where noted otherwise. All registers are Lo (R0-R7) except where specified. Hi registers are R8-R15. Key to Tables § See Table ARM architecture versions. <loreglist+LR> A comma-separated list of Lo registers. plus the LR, enclosed in braces, { and }. <loreglist> A comma-separated list of Lo registers, enclosed in braces, { and }. <loreglist+PC> A comma-separated list of Lo registers. plus the PC, enclosed in braces, { and }. Operation § Assembler Updates Action Notes Move Immediate MOVS Rd, #<imm> N Z Rd := imm imm range 0-255. Lo to Lo MOVS Rd, Rm N Z Rd := Rm Synonym of LSLS Rd, Rm, #0 Hi to Lo, Lo to Hi, Hi to Hi MOV Rd, Rm Rd := Rm Not Lo to Lo. Any to Any 6 MOV Rd, Rm Rd := Rm Any register to any register. Add Immediate 3 ADDS Rd, Rn, #<imm> N Z C V Rd := Rn + imm imm range 0-7. All registers Lo ADDS Rd, Rn, Rm N Z C V Rd := Rn + Rm Hi to Lo, Lo to Hi, Hi to Hi ADD Rd, Rd, Rm Rd := Rd + Rm Not Lo to Lo. Any to Any T2 ADD Rd, Rd, Rm Rd := Rd + Rm Any register to any register. Immediate 8 ADDS Rd, Rd, #<imm> N Z C V Rd := Rd + imm imm range 0-255.
    [Show full text]
  • Emcissonfrom L0 \ 001Mm: R" W \14
    United States Patent [19] [11] Patent Number: 4,821,183 Hauris [45] Date of Patent: Apr. 11, 1989 [54] A MICROSEQUENCER CIRCUIT WITH 4,439,827 3/1984 Wilkes ............................... .. 364/21!) PLURAL MICROPROGROM INSTRUCTION 4,446,518 5/1984 Casamatta 364/200 COUNTERS 4,551,798 11/1985 Horvath ............................ .. 364/200 [75] Inventor: Jon F. Hauris, Manassas, Va. FOREIGN PATENT DOCUMENTS 73 Assi ee: Internatio :11 B siness Mach'nes “(H843 8/l932 PCT In“ APPL ' l 1 g“ corpom?gn, Armonk N_Y_ ' 2013923 4/1980 United Kingdom. 21 A LN .: 937772 Primag'Y Examiner-David Y. Eng . [ 1 pp 0 ’ Attorney, Agent, or Firm—H. St. Julian; John E. Hoel 22 F1 d: D .4 1986 [ 1 16 W ’ [51] ABSTRACT [51] Int. Cl.4 .............................................. .. G06F 9/30 A . l I [52] US. Cl. .................................................. .. 364/200 mlcliosequencer ‘.nc “.des at Fast two prcigram count [58] Field of S e at ch 364/200 MS File 900 MS File ers which access microinstrucnons stored in a memory ’ system. A ?rst program counter is cyclicly incremented [56] References Cited to sequentially access rnicroinstructions of a principal U_s_ PATENT DOCUMENTS microprogram. When a particular microinstruction is accessed which indicates that a subroutine will be the gardner e‘ """"""" " next program to be executed, a branched-from address, 3'967’lo4 6/1976 33:21:51‘;smleets' 364/900 representing the microinstruction calling the subrou 3:978:454 8/1976 - ' ' 364/900 tine, is retained in the ?rst program counter. An address 3,991,404 11/1976 364/200 representing the ?rst instruction of the subroutine 15 4,003,033 1/1977 364/200 loaded into a second program counter.
    [Show full text]
  • Value Prediction Design for High-Frequency Microprocessors 1
    Value Prediction Design for High-Frequency Microprocessors Ravi Bhargava and Lizy K. John Technical Report TR-020508-01 Laboratory for Computer Architecture Department of Electrical and Computer Engineering The University of Texas at Austin fravib,[email protected] May 2002 Abstract This paper studies value predictor design under table access latency and energy constraints for high-frequency, wide-issue microprocessors. Previous value prediction efforts make generous assumptions regarding table sizes and access conditions, while ignoring prediction latencies and energy issues. Our work shows that the latency of a high-performance value predictor cannot be completely hidden by the early stages of the instruction pipeline as previously assumed, and causes noticeable performance degradation versus unconstrained value prediction. This work describes and compares three varieties of value prediction strategies: At-Fetch, Post-Decode, and Decoupled. Decoupled value prediction excels at overcoming the high-frequency table access constraints but with high energy consumption requirements. A load-only value predictor can greatly reduce energy consumption but at the cost of performance. We propose a value prediction approach that combines the latency-friendly approach of Decoupled value prediction with a more energy- efficient implementation. The traditional PC-indexed prediction tables are removed and replaced by updates of buffered prediction traces. This latency and energy aware form of accounting for and distributing the value predictions leads to a 58%-95% reduction in dynamic energy versus common high-performance value prediction hardware while still maintaining high performance. 1 Introduction A trend in recent microprocessors design is the increasing density of on-die transistors. This trend has allowed designers and researchers to explore creative new techniques for improving instruction throughput.
    [Show full text]
  • Computer Organization Structure of a Computer Registers Register
    Computer Organization Structure of a Computer z Computer design as an application of digital logic design procedures z Block diagram view address z Computer = processing unit + memory system Processor read/write Memory System central processing data z Processing unit = control + datapath unit (CPU) z Control = finite state machine y Inputs = machine instruction, datapath conditions y Outputs = register transfer control signals, ALU operation control signals codes Control Data Path y Instruction interpretation = instruction fetch, decode, data conditions execute z Datapath = functional units + registers instruction unit execution unit y Functional units = ALU, multipliers, dividers, etc. instruction fetch and functional units interpretation FSM y Registers = program counter, shifters, storage registers and registers CS 150 - Spring 2001 - Computer Organization - 1 CS 150 - Spring 2001 - Computer Organization - 2 Registers Register Transfer z Selectively loaded EN or LD input z Point-to-point connection MUX MUX MUX MUX z Output enable OE input y Dedicated wires y Muxes on inputs of rs rt rd R4 z Multiple registers group 4 or 8 in parallel each register z Common input from multiplexer y Load enables LD OE for each register rs rt rd R4 D7 Q7 OE asserted causes FF state to be D6 Q6 connected to output pins; otherwise they y Control signals D5 Q5 are left unconnected (high impedance) MUX D4 Q4 for multiplexer D3 Q3 D2 Q2 LD asserted during a lo-to-hi clock D1 Q1 transition loads new data into FFs z Common bus with output enables D0 CLK
    [Show full text]