Design of Digital Circuits Lecture 20: SIMD Processors

Total Page:16

File Type:pdf, Size:1020Kb

Design of Digital Circuits Lecture 20: SIMD Processors Design of Digital Circuits Lecture 20: SIMD Processors Prof. Onur Mutlu ETH Zurich Spring 2018 11 May 2018 New Course: Bachelor’s Seminar in Comp Arch n Fall 2018 n 2 credit units n Rigorous seminar on fundamental and cutting-edge topics in computer architecture n Critical presentation, review, and discussion of seminal works in computer architecture q We will cover many ideas & issues, analyze their tradeoffs, perform critical thinking and brainstorming n Participation, presentation, report and review writing n Stay tuned for more information 2 For the Curious: New Rowhammer Attack n Another Rowhammer-based attack disclosed yesterday 3 Last Week’s Attack n Using an integrated GPU in a mobile system to remotely escalate privilege via the WebGL interface 4 More to Come … n Onur Mutlu, "The RowHammer Problem and Other Issues We May Face as Memory Becomes Denser" Invited Paper in Proceedings of the Design, Automation, and Test in Europe Conference (DATE), Lausanne, Switzerland, March 2017. [Slides (pptx) (pdf)] https://people.inf.ethz.ch/omutlu/pub/rowhammer-and-other-memory-issues_date17.pdf 5 Agenda for Today & Next Few Lectures n Single-cycle Microarchitectures n Multi-cycle and Microprogrammed Microarchitectures n Pipelining n Issues in Pipelining: Control & Data Dependence Handling, State Maintenance and Recovery, … n Out-of-Order Execution n Other Execution Paradigms 6 Readings for Today n Peleg and Weiser, “MMX Technology Extension to the Intel Architecture,” IEEE Micro 1996. n Lindholm et al., "NVIDIA Tesla: A Unified Graphics and Computing Architecture," IEEE Micro 2008. 7 Other Approaches to Concurrency (or Instruction Level Parallelism) Approaches to (Instruction-Level) Concurrency n Pipelining n Out-of-order execution n Dataflow (at the ISA level) n Superscalar Execution n VLIW n Fine-Grained Multithreading n SIMD Processing (Vector and array processors, GPUs) n Decoupled Access Execute n Systolic Arrays 9 SIMD Processing: Exploiting Regular (Data) Parallelism Flynn’s Taxonomy of Computers n Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966 n SISD: Single instruction operates on single data element n SIMD: Single instruction operates on multiple data elements q Array processor q Vector processor n MISD: Multiple instructions operate on single data element q Closest form: systolic array processor, streaming processor n MIMD: Multiple instructions operate on multiple data elements (multiple instruction streams) q Multiprocessor q Multithreaded processor 11 Data Parallelism n Concurrency arises from performing the same operation on different pieces of data q Single instruction multiple data (SIMD) q E.g., dot product of two vectors n Contrast with data flow q Concurrency arises from executing different operations in parallel (in a data driven manner) n Contrast with thread (“control”) parallelism q Concurrency arises from executing different threads of control in parallel n SIMD exploits operation-level parallelism on different data q Same operation concurrently applied to different pieces of data q A form of ILP where instruction happens to be the same across data 12 SIMD Processing n Single instruction operates on multiple data elements q In time or in space n Multiple processing elements n Time-space duality q Array processor: Instruction operates on multiple data elements at the same time using different spaces q Vector processor: Instruction operates on multiple data elements in consecutive time steps using the same space 13 Array vs. Vector Processors ARRAY PROCESSOR VECTOR PROCESSOR Instruction Stream Same op @ same time DiFFerent ops @ time LD VR ß A[3:0] LD0 LD1 LD2 LD3 LD0 ADD VR ß VR, 1 AD0 AD1 AD2 AD3 LD1 AD0 MUL VR ß VR, 2 ST A[3:0] ß VR MU0 MU1 MU2 MU3 LD2 AD1 MU0 ST0 ST1 ST2 ST3 LD3 AD2 MU1 ST0 DiFFerent ops @ same space AD3 MU2 ST1 MU3 ST2 Time Same op @ space ST3 Space Space 14 SIMD Array Processing vs. VLIW n VLIW: Multiple independent operations packed together by the compiler 15 SIMD Array Processing vs. VLIW n Array processor: Single operation on multiple (different) data elements 16 Vector Processors (I) n A vector is a one-dimensional array of numbers n Many scientific/commercial programs use vectors for (i = 0; i<=49; i++) C[i] = (A[i] + B[i]) / 2 n A vector processor is one whose instructions operate on vectors rather than scalar (single data) values n Basic requirements q Need to load/store vectors à vector registers (contain vectors) q Need to operate on vectors of different lengths à vector length register (VLEN) q Elements of a vector might be stored apart from each other in memory à vector stride register (VSTR) n Stride: distance in memory between two elements of a vector 17 Vector Processors (II) n A vector instruction performs an operation on each element in consecutive cycles q Vector functional units are pipelined q Each pipeline stage operates on a different data element n Vector instructions allow deeper pipelines q No intra-vector dependencies à no hardware interlocking needed within a vector q No control flow within a vector q Known stride allows easy address calculation for all vector elements n Enables prefetching of vectors into registers/cache/memory 18 Vector Processor Advantages + No dependencies within a vector q Pipelining & parallelization work really well q Can have very deep pipelines, no dependencies! + Each instruction generates a lot of work q Reduces instruction fetch bandwidth requirements + Highly regular memory access pattern + No need to explicitly code loops q Fewer branches in the instruction sequence 19 Vector Processor Disadvantages -- Works (only) if parallelism is regular (data/SIMD parallelism) ++ Vector operations -- Very inefficient if parallelism is irregular -- How about searching for a key in a linked list? Fisher, “Very Long Instruction Word architectures and the ELI-512,” ISCA 1983. 20 Vector Processor Limitations -- Memory (bandwidth) can easily become a bottleneck, especially if 1. compute/memory operation balance is not maintained 2. data is not mapped appropriately to memory banks 21 Vector Processing in More Depth Vector Registers n Each vector data register holds N M-bit values n Vector control registers: VLEN, VSTR, VMASK n Maximum VLEN can be N q Maximum number of elements stored in a vector register n Vector Mask Register (VMASK) q Indicates which elements of vector to operate on q Set by vector test instructions n e.g., VMASK[i] = (Vk[i] == 0) M-bit wide M-bit wide V0,0 V1,0 V0,1 V1,1 V0,N-1 V1,N-1 23 Vector Functional Units n Use a deep pipeline to execute element operations à fast clock cycle V V V 1 2 3 n Control of deep pipeline is simple because elements in vector are independent Six stage multiply pipeline V1 * V2 à V3 SLide credit: Krste Asanovic 24 Vector Machine Organization (CRAY-1) n CRAY-1 n Russell, “The CRAY-1 computer system,” CACM 1978. n Scalar and vector modes n 8 64-element vector registers n 64 bits per element n 16 memory banks n 8 64-bit scalar registers n 8 24-bit address registers 25 CRAY X-MP-28 @ ETH (CAB, E Floor) 26 CRAY X-MP System Organization E CRAY X-MP system organization Cray Research Inc., “The CRAY X-MP Series oF Computer Systems,” 1985 27 CRAY X-MP Design Detail CRAY X-MP designdetail Mainframe Memory size CRAY X-MP single- and (millions of Number multiprocessor systems are Model Number of CPUs 64-bit words) of banks designed to offer users outstandmg performance on large-scale, CRAY X-MPl416 compute-intensive and 110-bound CRAY X-MPl48 jobs. CRAY X-MPl216 CRAY X-MP128 CRAY X-MP mainframes consist of CRAY X-MPl24 SIX (X-MPII), eight (X-MPl2) or CRAY X-MPl18 twelve (X-MPl4) vertical columns CRAY X-MPl14 arranged in an arc. Power supplies CRAY X-MP112 and cooling are clustered around the CRAY X-MPII 1 base and extend outward. A description of the major system communications section coordinates components and their functions processing between CPUs, and follows. central memory is shared. CPU computation section Registers The basic set of programmable Within the computation section of registers is composed of: each CPU are operating registers, functional units and an instruction Eight 24-bit address (A) registers control network - hardware Sixty-four 24-b~tintermediate address elements that cooperate in executing (B) registers sequences of instructions. The Eight 64-bit scalar (S) registers instruction control network makes all Sixty-four 64-bit scalar-save decisions related to instruction issue (T) reg~sters as well as coordinating the three Eight 64-element (4096-bit) vector (V) types of processing within each registers with 64 bits per element CPU: vector, scalar and address. Each of the processing modes has The 24-bit A registers are generally its associated registers and used for addressing and counting functional unk operations. Associated with them are 64 B registers, also 24 bits wide. The block diagram of a CRAY Since the transfer between an A and X-MPl4 (opposite page) illustrates a B register takes only one clock Cray Research Inc., “The the relationship of the registers to the period, the B registers assume the functional units, instruction buffers, role of data cache, storing CRAY X-MP Series oF I10 channel control registers, informationfor fast access without interprocessor communications tying up the A registers for relatively Computer Systems,” 1985 section and memory. For long periods. multiple-processorCRAY X-MP models, the interprocessor 28 CRAY X-MP CPU Functional Units shared registers for btcrprucessw mehwh1 28 7 @-bitinsfruetion comrnun~cat~onand synchronlzatlon cause ~tto swltch from user to parcels, twlce the capac~tyof the Each cluster of shared reglsters monitor mode. Addlt~onally,each CRAY-1 ~nstruct~onbuffer.Cray The Research Inc., “The cons~stsof eight 24-b~tshared processor In a cluster can instruction buffers of eachCRAY CPU areX- MP Series oF address (SB) reglsters, e~ght64-b~t asynchronously perform scalar or baded from memory at the burst rate shared scalar (ST) reg~stersand vector operations dctated by user ~f eight words per clockComputer period.
Recommended publications
  • Data-Flow Prescheduling for Large Instruction Windows in Out-Of-Order Processors
    Data-Flow Prescheduling for Large Instruction Windows in Out-of-Order Processors Pierre Michaud, Andr´e Seznec IRISA/INRIA Campus de Beaulieu, 35042 Rennes Cedex, France {pmichaud, seznec}@irisa.fr Abstract We introduce data-flow prescheduling. Instructions are sent to the issue buffer in a predicted data-flow order instead The performance of out-of-order processors increases of the sequential order, allowing a smaller issue buffer. The with the instruction window size. In conventional proces- rationale of this proposal is to avoid using entries in the is- sors, the effective instruction window cannot be larger than sue buffer for instructions which operands are known to be the issue buffer. Determining which instructions from the yet unavailable. issue buffer can be launched to the execution units is a time- In our proposal, this reordering of instructions is accom- critical operation which complexity increases with the issue plished through an array of schedule lines. Each schedule buffer size. We propose to relieve the issue stage by reorder- line corresponds to a different depth in the data-flow graph. ing instructions before they enter the issue buffer. This study The depth of each instruction in the data-flow graph is de- introduces the general principle of data-flow prescheduling. termined, and the instruction is inserted in the correspond- Then we describe a possible implementation. Our prelim- ing schedule line. Lines are consumed by the issue buffer inary results show that data-flow prescheduling makes it sequentially. possible to enlarge the effective instruction window while Section 2 briefly describes issue buffers and discusses re- keeping the issue buffer small.
    [Show full text]
  • Jetson TX2 • NVIDIA Jetson Xavier • GPU Programming • Algorithm Mapping: • Convolutions Parallel Algorithm Execution
    GPU and multicore CPU architectures. Algorithm mapping Contributors: N. Tsapanos, I. Karakostas, I. Pitas Aristotle University of Thessaloniki, Greece Presenter: Prof. Ioannis Pitas Aristotle University of Thessaloniki [email protected] www.multidrone.eu Presentation version 1.3 GPU and multicore CPU architectures. Algorithm mapping • GPU and multicore CPU processing boards • Graphics cards • NVIDIA Jetson TX2 • NVIDIA Jetson Xavier • GPU programming • Algorithm mapping: • Convolutions Parallel algorithm execution • Graphics computing: • Highly parallelizable • Linear algebra parallelization: • Vector inner products: 푐 = 풙푇풚. • Matrix-vector multiplications 풚 = 푨풙. • Matrix multiplications: 푪 = 푨푩. Parallel algorithm execution • Convolution: 풚 = 푨풙 • CNN architectures, linear systems, signal filtering. • Correlation: 풚 = 푨풙 • template matching, tracking. • Signal transforms (DFT, DCT, Haar, etc): • Matrix vector product form: 푿 = 푾풙 • 2D transforms (matrix product form): 푿’ = 푾푿. Processing Units • Multicore (CPU): • MIMD. • Focused on latency. • Best single thread performance. • Manycore (GPU): • SIMD. • Focused on throughput. • Best for embarrassingly parallel tasks. Pascal microarchitecture https://devblogs.nvidia.com/inside-pascal/gp100_block_diagram-2/ Pascal microarchitecture https://devblogs.nvidia.com/inside-pascal/gp100_sm_diagram/ GeForce GTX 1080 • Microarchitecture: Pascal. • DRAM: 8 GB GDDR5X at 10000 MHz. • SMs: 20. • Memory bandwidth: 320 GB/s. • CUDA cores: 2560. • L2 Cache: 2048 KB. • Clock (base/boost): 1607/1733 MHz. • L1 Cache: 48 KB per SM. • GFLOPs: 8873. • Shared memory: 96 KB per SM. GPU and multicore CPU architectures. Algorithm mapping • GPU and multicore CPU processing boards • Graphics cards • NVIDIA Jetson TX2 • NVIDIA Jetson Xavier • GPU programming • Algorithm mapping: • Convolutions ARM Cortex-A57: High-End ARMv8 CPU • ARMv8 architecture • Architecture evolution that extends ARM’s applicability to all markets. • Full ARM 32-bit compatibility, streamlined 64-bit capability.
    [Show full text]
  • Computer Architecture Out-Of-Order Execution
    Computer Architecture Out-of-order Execution By Yoav Etsion With acknowledgement to Dan Tsafrir, Avi Mendelson, Lihu Rappoport, and Adi Yoaz 1 Computer Architecture 2013– Out-of-Order Execution The need for speed: Superscalar • Remember our goal: minimize CPU Time CPU Time = duration of clock cycle × CPI × IC • So far we have learned that in order to Minimize clock cycle ⇒ add more pipe stages Minimize CPI ⇒ utilize pipeline Minimize IC ⇒ change/improve the architecture • Why not make the pipeline deeper and deeper? Beyond some point, adding more pipe stages doesn’t help, because Control/data hazards increase, and become costlier • (Recall that in a pipelined CPU, CPI=1 only w/o hazards) • So what can we do next? Reduce the CPI by utilizing ILP (instruction level parallelism) We will need to duplicate HW for this purpose… 2 Computer Architecture 2013– Out-of-Order Execution A simple superscalar CPU • Duplicates the pipeline to accommodate ILP (IPC > 1) ILP=instruction-level parallelism • Note that duplicating HW in just one pipe stage doesn’t help e.g., when having 2 ALUs, the bottleneck moves to other stages IF ID EXE MEM WB • Conclusion: Getting IPC > 1 requires to fetch/decode/exe/retire >1 instruction per clock: IF ID EXE MEM WB 3 Computer Architecture 2013– Out-of-Order Execution Example: Pentium Processor • Pentium fetches & decodes 2 instructions per cycle • Before register file read, decide on pairing Can the two instructions be executed in parallel? (yes/no) u-pipe IF ID v-pipe • Pairing decision is based… On data
    [Show full text]
  • CSC.T433 Advanced Computer Architecture, Department of Computer Science, TOKYO TECH 1 Datapath of Ooo Execution Processor
    Fiscal Year 2020 Ver. 2021-01-25a Course number: CSC.T433 School of Computing, Graduate major in Computer Science Advanced Computer Architecture 10. Multi-Processor: Distributed Memory and Shared Memory Architecture www.arch.cs.titech.ac.jp/lecture/ACA/ Room No.W936 Kenji Kise, Department of Computer Science Mon 14:20-16:00, Thr 14:20-16:00 kise _at_ c.titech.ac.jp CSC.T433 Advanced Computer Architecture, Department of Computer Science, TOKYO TECH 1 Datapath of OoO execution processor Instruction flow Instruction cache Branch handler Instruction fetch Instruction decode Renaming Register file Dispatch Integer Floating-point Memory Memory dataflow RS Instruction window ALU ALU Branch FP ALU Adr gen. Adr gen. Store Reorder buffer (ROB) queue Data cache Register dataflow CSC.T433 Advanced Computer Architecture, Department of Computer Science, TOKYO TECH Reservation station (RS) 2 Growth in clock rate of microprocessors From CAQA 5th edition CSC.T433 Advanced Computer Architecture, Department of Computer Science, TOKYO TECH 3 From multi-core era to many-core era Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction, MICRO-36 CSC.T433 Advanced Computer Architecture, Department of Computer Science, TOKYO TECH 4 Aside: What is a window? • A window is a space in the wall of a building or in the side of a vehicle, which has glass in it so that light can come in and you can see out. (Collins) Instruction window 8 6 5 4 7 (a) Instruction window Instructions to be executed for an application Large instruction
    [Show full text]
  • STRAIGHT: Realizing a Lightweight Large Instruction Window by Using Eventually Consistent Distributed Registers
    2012 Third International Conference on Networking and Computing STRAIGHT: Realizing a Lightweight Large Instruction Window by using Eventually Consistent Distributed Registers Hidetsugu IRIE∗, Daisuke FUJIWARA∗, Kazuki MAJIMA∗, Tsutomu YOSHINAGA∗ ∗The University of Electro-Communications 1-5-1 Chofugaoka, Chofu-shi, Tokyo 182-8585, Japan E-mail: [email protected], [email protected], [email protected], [email protected] Abstract—As the number of cores as well as the network size programs. For scale-out applications, we assume the manycore in a processor chip increases, the performance of each core is processor structure, which consists of a number of STRAIGHT more critical for the improvement of the total chip performance. architecture cores (SAC) that are loosely connected each other. However, to improve the total chip performance, the performance per power or per unit area must be improved, making it difficult Being the first report on this novel processor architecture, in to adopt a conventional approach of superscalar extension. In this paper, we discuss the concept behind STRAIGHT, propose this paper, we explore a new core structure that is suitable for basic principles, and estimate the performance and budget manycore processors. We revisit prior studies of new instruction- expectation. The rest of the paper consists of following sec- level (ILP) and thread-level parallelism (TLP) architectures tions. Section II revisits studies of new architectures that were and propose our novel STRAIGHT processor architecture. By introducing the scheme of distributed key-value-store to the designed to improve the ILP/TLP performance of superscalar register file of clustered microarchitectures, STRAIGHT directly processors, and discusses the dilemma of both scalability executes the operation with large logical registers, which are approach and quick worker approach.
    [Show full text]
  • Optimizing SIMD Execution in HW/SW Co-Designed Processors
    Optimizing SIMD Execution in HW/SW Co-designed Processors Rakesh Kumar Department of Computer Architecture Universitat Politècnica de Catalunya Advisors: Alejandro Martínez Intel Barcelona Research Center Antonio González Intel Barcelona Research Center Universitat Politècnica de Catalunya A thesis submitted in fulfillment of the requirements for the degree of Doctor of Philosophy / Doctor per la UPC ABSTRACT SIMD accelerators are ubiquitous in microprocessors from different computing domains. Their high compute power and hardware simplicity improve overall performance in an energy efficient manner. Moreover, their replicated functional units and simple control mechanism make them amenable to scaling to higher vector lengths. However, code generation for these accelerators has been a challenge from the days of their inception. Compilers generate vector code conservatively to ensure correctness. As a result they lose significant vectorization opportunities and fail to extract maximum benefits out of SIMD accelerators. This thesis proposes to vectorize the program binary at runtime in a speculative manner, in addition to the compile time static vectorization. There are different environments that support runtime profiling and optimization support required for dynamic vectorization, one of most prominent ones being: 1) Dynamic Binary Translators and Optimizers (DBTO) and 2) Hardware/Software (HW/SW) Co-designed Processors. HW/SW co-designed environment provides several advantages over DBTOs like transparent incorporations of new hardware features, binary compatibility, etc. Therefore, we use HW/SW co-designed environment to assess the potential of speculative dynamic vectorization. Furthermore, we analyze vector code generation for wider vector units and find out that even though SIMD accelerators are amenable to scaling from hardware point of view, vector code generation at higher vector length is even more challenging.
    [Show full text]
  • Multithreading
    CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 14 – Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste http://inst.eecs.berkeley.edu/~cs152 Last Time Lecture 13: VLIW § In a classic VLIW, compiler is responsible for avoiding all hazards -> simple hardware, complex compiler. § Later VLIWs added more dynamic hardware interlocks, which reduce relative hardware benefits § Use loop unrolling and software pipelining for loops, trace scheduling for more irregular code § Static scheduling difficult in presence of unpredictable branches and variable latency memory § VLIW has failed in general-purpose computing, but still used in deeply embedded processors and DSPs 2 Thread-Level Parallelism (TLP) § Difficult to continue to extract instruction-level parallelism (ILP) from a single sequential thread of control § Many workloads can make use of thread-level parallelism: – TLP from multiprogramming (run independent sequential jobs) – TLP from multithreaded applications (run one job faster using parallel threads) § Multithreading uses TLP to improve utilization of a single processor 3 Multithreading How can we guarantee no dependencies between instructions in a pipeline? One way is to interleave execution of instructions from different program threads on same pipeline Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 T1:LD x1,0(x2) F D X M W Prior instruction in a T2:ADD x7,x1,x4
    [Show full text]
  • Transforming TLP Into DLP with the Dynamic Inter-Thread Vectorization Architecture Sajith Kalathingal
    Transforming TLP into DLP with the dynamic inter-thread vectorization architecture Sajith Kalathingal To cite this version: Sajith Kalathingal. Transforming TLP into DLP with the dynamic inter-thread vectorization archi- tecture. Hardware Architecture [cs.AR]. Université Rennes 1, 2016. English. NNT : 2016REN1S133. tel-01426915v3 HAL Id: tel-01426915 https://tel.archives-ouvertes.fr/tel-01426915v3 Submitted on 28 Aug 2017 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. ANNEE´ 2016 THESE` / UNIVERSITE´ DE RENNES 1 sous le sceau de l’Universit´eBretagne Loire pour le grade de DOCTEUR DE L’UNIVERSITE´ DE RENNES 1 Mention : Informatique Ecole´ doctorale Matisse pr´esent´eepar Sajith Kalathingal pr´epar´ee`al’unit´ede recherche INRIA Institut National de Recherche en Informatique et Automatique Universit´ede Rennes 1 Th`esesoutenue `aRennes Transforming TLP into le 13 D´ecembre 2016 DLP with the Dynamic devant le jury compos´ede : Bernard GOOSSENS Inter-Thread Vector- Professeur `al’Universit´ede Perpignan Via Domitia / Rapporteur Smail NIAR ization Architecture Professeur `al’Universit´ede Valenciennes / Rapporteur Laure GONNORD Maˆitre de conf´erences `a l’Universit´e Lyon 1 / Examinatrice C´edricTEDESCHI Maˆitre de conf´erences `a l’Universit´e Rennes 1 / Examinateur Andr´eSEZNEC Directeur de recherches Inria / Directeur de th´ese Sylvain COLLANGE Charg´ede recherche INRIA / Co-directeur de th´ese Acknowledgement I would like to express my sincere gratitude to my thesis advisors, Andr´eSEZNEC and Sylvain COLLANGE.
    [Show full text]
  • Advanced Computer Architecture
    ADAPTIVE & SECURE ASCS COMPUTING SYSTEMS LABORATORY ECEN 676 Advanced Computer Architecture Complex Pipelining: VLIW Prof. Michel A. Kinsy ADAPTIVE & SECURE ASCS COMPUTING SYSTEMS LABORATORY Execution Concurrency Limits § Which features of an ISA limit the number of instructions in the pipeline? § Number of Registers § Which features of a program limit the number of instructions in the pipeline? § Control transfers ADAPTIVE & SECURE ASCS COMPUTING SYSTEMS LABORATORY Little’s Law § Throughput (T) = Number in Flight (N) / Latency (L) Issue Execution WB § Illustrative Example § 4 floating point units § 8 cycles per floating point operation § 1/2 issues per cycle! ADAPTIVE & SECURE ASCS COMPUTING SYSTEMS LABORATORY Little’s Law Parallelism = Throughput * Latency or N = T ´ L Throughput per Cycle One Operation Latency in Cycles ADAPTIVE & SECURE ASCS COMPUTING SYSTEMS LABORATORY Pipelined ILP Machine Max Throughput, Six Instructions per Cycle One Pipeline Stage Two Integer Units, Latency Single Cycle Latency in Cycles Two Load/Store Units, Three Cycle Latency Two Floating-Point Units, Four Cycle Latency § How much instruction-level parallelism (ILP) required to keep machine pipelines busy? ADAPTIVE & SECURE ASCS COMPUTING SYSTEMS LABORATORY Superscalar Control Logic Scaling § Each issued instructions must make interlock checks against W*L instructions, i.e., growth in interlocks µ W*(W*L) § For in-order machines, L is related to pipeline latencies § For out-of-order machines, L also includes time spent in instruction buffers (instruction window
    [Show full text]
  • Dynamic Vectorization in the E2 Dynamic Multicore Architecture to Appear in the Proceedings of HEART 2010
    Dynamic Vectorization in the E2 Dynamic Multicore Architecture To appear in the proceedings of HEART 2010 Andrew Putnam Aaron Smith Doug Burger Microsoft Research Microsoft Research Microsoft Research [email protected] [email protected] [email protected] ABSTRACT TFlex [9] is one proposed architecture that demonstrated a Previous research has shown that Explicit Data Graph Exe- large dynamic range of power and performance by combin- cution (EDGE) instruction set architectures (ISA) allow for ing power efficient, lightweight processor cores into larger, power efficient performance scaling. In this paper we de- more powerful cores through the use of an Explicit Data scribe the preliminary design of a new dynamic multicore Graph Execution (EDGE) instruction set architecture (ISA). processor called E2 that utilizes an EDGE ISA to allow for TFlex is dynamically configurable to provide the same per- the dynamic composition of physical cores into logical pro- formance and energy efficiency as a small embedded proces- cessors. We provide details of E2’s support for dynamic re- sor or to provide the higher performance of an out-of-order configurability and show how the EDGE ISA facilities out- superscalar on single-threaded applications. of-order vector execution. Motivated by these promising results, we are currently designing a new dynamic architecture called E2 that uti- lizes an EDGE ISA to achieve high performance power effi- Categories and Subject Descriptors ciently [3]. The EDGE model divides a program into blocks C.1.2 [Computer Systems Organization]: Multiple Data of instructions that execute atomically. Blocks consist of a Stream Architectures—single-instruction-stream, multiple- sequence of dataflow instructions that explicitly encode re- data-stream processors (SIMD), array and vector proces- lationships between producer-consumer instructions, rather sors; C.1.3 [Computer Systems Organization]: Other Ar- than communicating through registers as done in a conven- chitecture Styles—adaptable architectures, data-flow archi- tional ISA.
    [Show full text]
  • Instruction Fetch and Issue on an Implementable Simultaneous
    Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor ¡ Dean M. Tullsen , Susan J. Eggers , Joel S. Emer , Henry M. Levy , ¡ Jack L. Lo , and Rebecca L. Stamm ¡ Dept of Computer Science and Engineering Digital Equipment Corporation University of Washington HLO2-3/J3 Box 352350 77 Reed Road Seattle, WA 98195-2350 Hudson, MA 01749 Abstract an SMT processor to achieve signi®cantly higher throughput than either a wide superscalar or a multithreaded processor. That paper Simultaneous multithreading is a technique that permits multiple also demonstrated the advantages of simultaneous multithreading independent threads to issue multiple instructions each cycle. In over multiple processors on a single chip, due to SMT's ability to previous work we demonstrated the performance potential of si- dynamically assign execution resources where needed each cycle. multaneous multithreading, based on a somewhat idealized model. Those results showed SMT's potential based on a somewhat ide- In this paper we show that the throughput gains from simultaneous alized model. This paper extends that work in four signi®cant ways. multithreading can be achieved without extensive changes to a con- First, we demonstrate that the throughput gains of simultaneous mul- ventional wide-issue superscalar, either in hardware structures or tithreading are possible without extensive changesto a conventional, sizes. We present an architecture for simultaneous multithreading wide-issue superscalar processor. We propose an architecture that that achieves three goals: (1) it minimizes the architectural impact is more comprehensive, realistic, and heavily leveraged off existing on the conventional superscalar design, (2) it has minimal perfor- superscalar technology.
    [Show full text]
  • Chapter 16 - Instruction-Level Parallelism and Superscalar Processors
    Chapter 16 - Instruction-Level Parallelism and Superscalar Processors Luis Tarrataca [email protected] CEFET-RJ Luis Tarrataca Chapter 16 - Superscalar Processors 1 / 90 Table of Contents 1 Overview Scalar Processor Superscalar Processor Superscalar vs. Superpipelined Constraints Luis Tarrataca Chapter 16 - Superscalar Processors 2 / 90 Table of Contents 2 Design Issues Machine Parallelism Instruction Issue Policy In-order issue with in-order completion In-order issue with out-of-order completion Out-of-Order issue with Out-Of-Order Completion Register Renaming 3 Superscalar Execution Overview 4 References Luis Tarrataca Chapter 16 - Superscalar Processors 3 / 90 Overview Scalar Processor The first processors were known as scalar: What is a scalar processor? Any ideas? Luis Tarrataca Chapter 16 - Superscalar Processors 4 / 90 Overview Scalar Processor Scalar Processor The first processors were known as scalar: What is a scalar processor? Any ideas? In a scalar organization, a single pipelined functional unit exists for: • Integer operations; • And one for floating-point operations; Functional unit: • Part of the CPU responsible for calculations; Luis Tarrataca Chapter 16 - Superscalar Processors 5 / 90 Overview Scalar Processor Scalar Processor In a scalar organization, a single pipelined functional unit exists for: • Integer operations; • And one for floating-point operations; Figure: Scalar Organization (Source: [Stallings, 2015]) Luis Tarrataca Chapter 16 - Superscalar Processors 6 / 90 Overview Scalar Processor But why do we
    [Show full text]