Dynamic Vectorization in the E2 Dynamic Multicore Architecture to Appear in the Proceedings of HEART 2010

Total Page:16

File Type:pdf, Size:1020Kb

Dynamic Vectorization in the E2 Dynamic Multicore Architecture to Appear in the Proceedings of HEART 2010 Dynamic Vectorization in the E2 Dynamic Multicore Architecture To appear in the proceedings of HEART 2010 Andrew Putnam Aaron Smith Doug Burger Microsoft Research Microsoft Research Microsoft Research [email protected] [email protected] [email protected] ABSTRACT TFlex [9] is one proposed architecture that demonstrated a Previous research has shown that Explicit Data Graph Exe- large dynamic range of power and performance by combin- cution (EDGE) instruction set architectures (ISA) allow for ing power efficient, lightweight processor cores into larger, power efficient performance scaling. In this paper we de- more powerful cores through the use of an Explicit Data scribe the preliminary design of a new dynamic multicore Graph Execution (EDGE) instruction set architecture (ISA). processor called E2 that utilizes an EDGE ISA to allow for TFlex is dynamically configurable to provide the same per- the dynamic composition of physical cores into logical pro- formance and energy efficiency as a small embedded proces- cessors. We provide details of E2’s support for dynamic re- sor or to provide the higher performance of an out-of-order configurability and show how the EDGE ISA facilities out- superscalar on single-threaded applications. of-order vector execution. Motivated by these promising results, we are currently designing a new dynamic architecture called E2 that uti- lizes an EDGE ISA to achieve high performance power effi- Categories and Subject Descriptors ciently [3]. The EDGE model divides a program into blocks C.1.2 [Computer Systems Organization]: Multiple Data of instructions that execute atomically. Blocks consist of a Stream Architectures—single-instruction-stream, multiple- sequence of dataflow instructions that explicitly encode re- data-stream processors (SIMD), array and vector proces- lationships between producer-consumer instructions, rather sors; C.1.3 [Computer Systems Organization]: Other Ar- than communicating through registers as done in a conven- chitecture Styles—adaptable architectures, data-flow archi- tional ISA. These explicit encodings are used to route operands tectures to private reservation stations (called operand buffers) for each instruction. Registers and memory are only used for General Terms handling less-frequent inter-block communication. Prior dynamic architectures [7, 9] have demonstrated the Design, Performance ability to take advantage of task and thread-level parallelism, but handling data-level parallelism requires dividing data into Keywords independent sets and using thread-level parallelism. In this Explicit Data Graph Execution (EDGE) paper we focus on efficiently exploiting data-level parallelism, even without threading, and present our preliminary vector unit design for E2. Unlike previous in-order vector ma- 1. INTRODUCTION chines, E2 allows for out-of-order execution of both vectors Chip designers have long relied on dynamic voltage and and scalars. frequency scaling (DVFS) to trade off power for performance. The E2 instruction set and execution model supports three However, voltage scaling no longer works as processors ap- new capabilities that enable efficient vectorization across a proach the minimum threshold voltage (Vmin) as frequency broad range of codes. First, by slicing up the statically pro- scaling at Vmin reduces both power and performance lin- grammed issue window into vector lanes, highly concurrent, early, achieving no reduction in energy. Power and perfor- out-of-order issue of mixed scalar and vector operations can mance trade-offs are thus left to either the microarchitecture be achieved with lower energy overhead than scalar mode. or the system software. Second, the statically allocated reservation stations permit When designing an architecture with little (if any) DVFS, the issue window to be treated as a vector register file, with designers must choose how to spend the silicon resources. wide fetches to memory and limited copying between a vec- Hill and Marty [6] described four ways that designers could tor load and the vector operations. Third, the atomic block- use these resources: (1) many small, low performance, power based model in E2 permits refreshing of vector (and scalar) efficient cores, (2) few large, power inefficient, high per- instruction blocks mapped to reservation stations, enabling formance cores, (3) a heterogeneous mix of both small and repeated vector operations to issue with no fetch or decode large cores, and (4) a dynamic architecture capable of com- energy overhead after the first loop iteration. Taken together, bining or splitting cores to adapt to a given workload. Of these optimizations will reduce the energy associated with these alternatives, the highest performance and most energy finding and executing many sizes of vectors across a wide efficient design is the dynamic architecture. Hill and Marty range of codes. characterized what such a dynamic processor could do but did not describe the details of such an architecture. 8-&()-?$ "9&;$7 %&'()*+(,-&$ AB;)9&/$ AB;)9&/$ $ E;F,'(;)'$ .,&/-0$ =*CC;)$ =*CC;)$ GHI74J$ 12$3$456$ 12$3$D56$ 12$3$D56$ !"# 7D$3$D56$ $ 8-);$ 8-);$ 8-);$ 8-);$ "9&;$2 "2$ "2$ %&'()*+(,-&$ AB;)9&/$ AB;)9&/$ $ E;F,'(;)'$ 8-);$ 8-);$ 8-);$ 8-);$ .,&/-0$ =*CC;)$ =*CC;)$ G7DI17J$ "7$ !"# 7D$3$D56$ %&'()*+(,-&$ 12$3$456$ 12$3$D56$ 12$3$D56$ 8-);$ 8-);$ 8-);$ 8-);$ $ 89+:;$ "9&;$1 "2$ $ "2$ %&'()*+(,-&$ AB;)9&/$ AB;)9&/$ $ E;F,'(;)'$ 8-);$ 8-);$ 8-);$ 8-);$ 12$<=$ .,&/-0$ =*CC;)$ =*CC;)$ G12I5KJ$ !"# 8-);$ 8-);$ 8-);$ 8-);$ 12$3$456$ 12$3$D56$ 12$3$D56$ 7D$3$D56$ $ "2$ "2$ "9&;$5 8-);$ 8-);$ 8-);$ 8-);$ %&'()*+(,-&$ AB;)9&/$ AB;)9&/$ $ E;F,'(;)'$ "-9/PQ(-);$ .,&/-0$ =*CC;)$ =*CC;)$ G5LID1J$ R*;*;$ 8-);$ 8-);$ 8-);$ 8-);$ !"# =)9&+:$ 12$3$456$ 12$3$D56$ 12$3$D56$ 7D$3$D56$ $ "2$ "2$ 8-);$ 8-);$ 8-);$ 8-);$ @);/,+(-)$ M;N-)O$ "7$>9(9$89+:;$ %&(;)C9+;$ $ 8-&()-??;)$ 12$<=$ Figure 1: E2 microarchitecture block diagram. In vector mode, each core is composed of four independent vector lanes, each with a 32-instruction window, two 64-bit operand buffers, an ALU for both integer and floating point operations, and 16 registers. In scalar mode, the ALUs in lanes 3 and 4 are powered down, and the instruction windows, operand buffers, and registers are made available to the other two lanes. 2. THE E2 ARCHITECTURE two of the ALUs are turned off to conserve power. In vec- E2 is a tiled architecture that consists of low power, high tor mode, all N ALUs are turned on, but instructions can performance, decentralized processing cores connected by only send operands to instructions in the same vector lane. an on-chip network. This design provides E2 with the bene- The mode is determined on a per-block basis from a bit in fits of other tiled architectures - namely simplicity, scalabil- the block header. This allows each core to adapt quickly to ity, and fault tolerance. Figure 1 shows the basic architecture different application phases on a block-by-block basis. of an E2 processor containing 32 cores, and a block diagram of the internal structure of one physical core. 2.1 Composing Cores A core contains N lanes (in this paper we choose four), One key characteristic that distinguishes E2 from other with each lane consisting of a 64-bit ALU and one bank processors is the ability to dynamically adapt the architec- of the instruction window, operand buffers, and register file. ture for a given workload by composing and decomposing ALUs support both integer and floating point operations, as cores. Rather than fixing the size and number of cores at de- well as fine-grained SIMD execution (eight 8-bit, four 16- sign time, one or more physical cores can be merged together bit, or two 32-bit integer operations per cycle, or two single- at runtime to form larger, more powerful logical cores. For precision floating point calculations per cycle). This innova- example, serial portions of a workload can be handled by tion of breaking the window into lanes allows for high vector composing every physical core into one large logical proces- throughput with little additional hardware complexity. sor that performs like an aggressive superscalar. Or, when E2’s EDGE ISA restricts blocks in several ways to sim- ample thread-level parallelism is available, the same large plify the hardware that maps blocks to the execution sub- logical processor can be split so each physical processor can strate and detect when blocks are finished executing. Blocks work independently and execute instruction blocks from in- are variable-size: they contain between 4 and 128 instruc- dependent threads. Merging cores together is called compos- tions and may execute at most 32 loads and stores. The hard- ing cores, while splitting cores is called decomposing cores. ware relies on the compiler to break programs into blocks Logical cores interleave accesses to registers and memory of dataflow instructions and assign load and store identifiers among the physical cores to give the logical core the com- to enforce sequential memory semantics [12]. To improve bined computational resources of all the composed physical performance, the compiler uses predication to form large cores. For example, a logical core composed of two physi- blocks filled with useful instructions. To simplify commit, cal cores uses an additional bit of the address to choose be- the architecture relies on the compiler to ensure that a sin- tween the two physical caches, effectively doubling the L1 gle branch is produced from every block, and to encode the cache capacity. The register files are similarly interleaved, register writes and the set of store identifiers used. but since only 64 registers are supported by the ISA, the E2 cores operate in two execution modes: scalar mode additional register file capacity is powered gated to reduce and vector mode. In scalar mode, any instruction can send power consumption. operands to any other instruction in the block, and all but Each instruction block is mapped to a single physical pro- cessor. When composed, the architecture uses additional Component Parameters Area (mm2) % Area cores to execute speculative instruction blocks.
Recommended publications
  • Data-Flow Prescheduling for Large Instruction Windows in Out-Of-Order Processors
    Data-Flow Prescheduling for Large Instruction Windows in Out-of-Order Processors Pierre Michaud, Andr´e Seznec IRISA/INRIA Campus de Beaulieu, 35042 Rennes Cedex, France {pmichaud, seznec}@irisa.fr Abstract We introduce data-flow prescheduling. Instructions are sent to the issue buffer in a predicted data-flow order instead The performance of out-of-order processors increases of the sequential order, allowing a smaller issue buffer. The with the instruction window size. In conventional proces- rationale of this proposal is to avoid using entries in the is- sors, the effective instruction window cannot be larger than sue buffer for instructions which operands are known to be the issue buffer. Determining which instructions from the yet unavailable. issue buffer can be launched to the execution units is a time- In our proposal, this reordering of instructions is accom- critical operation which complexity increases with the issue plished through an array of schedule lines. Each schedule buffer size. We propose to relieve the issue stage by reorder- line corresponds to a different depth in the data-flow graph. ing instructions before they enter the issue buffer. This study The depth of each instruction in the data-flow graph is de- introduces the general principle of data-flow prescheduling. termined, and the instruction is inserted in the correspond- Then we describe a possible implementation. Our prelim- ing schedule line. Lines are consumed by the issue buffer inary results show that data-flow prescheduling makes it sequentially. possible to enlarge the effective instruction window while Section 2 briefly describes issue buffers and discusses re- keeping the issue buffer small.
    [Show full text]
  • Jetson TX2 • NVIDIA Jetson Xavier • GPU Programming • Algorithm Mapping: • Convolutions Parallel Algorithm Execution
    GPU and multicore CPU architectures. Algorithm mapping Contributors: N. Tsapanos, I. Karakostas, I. Pitas Aristotle University of Thessaloniki, Greece Presenter: Prof. Ioannis Pitas Aristotle University of Thessaloniki [email protected] www.multidrone.eu Presentation version 1.3 GPU and multicore CPU architectures. Algorithm mapping • GPU and multicore CPU processing boards • Graphics cards • NVIDIA Jetson TX2 • NVIDIA Jetson Xavier • GPU programming • Algorithm mapping: • Convolutions Parallel algorithm execution • Graphics computing: • Highly parallelizable • Linear algebra parallelization: • Vector inner products: 푐 = 풙푇풚. • Matrix-vector multiplications 풚 = 푨풙. • Matrix multiplications: 푪 = 푨푩. Parallel algorithm execution • Convolution: 풚 = 푨풙 • CNN architectures, linear systems, signal filtering. • Correlation: 풚 = 푨풙 • template matching, tracking. • Signal transforms (DFT, DCT, Haar, etc): • Matrix vector product form: 푿 = 푾풙 • 2D transforms (matrix product form): 푿’ = 푾푿. Processing Units • Multicore (CPU): • MIMD. • Focused on latency. • Best single thread performance. • Manycore (GPU): • SIMD. • Focused on throughput. • Best for embarrassingly parallel tasks. Pascal microarchitecture https://devblogs.nvidia.com/inside-pascal/gp100_block_diagram-2/ Pascal microarchitecture https://devblogs.nvidia.com/inside-pascal/gp100_sm_diagram/ GeForce GTX 1080 • Microarchitecture: Pascal. • DRAM: 8 GB GDDR5X at 10000 MHz. • SMs: 20. • Memory bandwidth: 320 GB/s. • CUDA cores: 2560. • L2 Cache: 2048 KB. • Clock (base/boost): 1607/1733 MHz. • L1 Cache: 48 KB per SM. • GFLOPs: 8873. • Shared memory: 96 KB per SM. GPU and multicore CPU architectures. Algorithm mapping • GPU and multicore CPU processing boards • Graphics cards • NVIDIA Jetson TX2 • NVIDIA Jetson Xavier • GPU programming • Algorithm mapping: • Convolutions ARM Cortex-A57: High-End ARMv8 CPU • ARMv8 architecture • Architecture evolution that extends ARM’s applicability to all markets. • Full ARM 32-bit compatibility, streamlined 64-bit capability.
    [Show full text]
  • Computer Architecture Out-Of-Order Execution
    Computer Architecture Out-of-order Execution By Yoav Etsion With acknowledgement to Dan Tsafrir, Avi Mendelson, Lihu Rappoport, and Adi Yoaz 1 Computer Architecture 2013– Out-of-Order Execution The need for speed: Superscalar • Remember our goal: minimize CPU Time CPU Time = duration of clock cycle × CPI × IC • So far we have learned that in order to Minimize clock cycle ⇒ add more pipe stages Minimize CPI ⇒ utilize pipeline Minimize IC ⇒ change/improve the architecture • Why not make the pipeline deeper and deeper? Beyond some point, adding more pipe stages doesn’t help, because Control/data hazards increase, and become costlier • (Recall that in a pipelined CPU, CPI=1 only w/o hazards) • So what can we do next? Reduce the CPI by utilizing ILP (instruction level parallelism) We will need to duplicate HW for this purpose… 2 Computer Architecture 2013– Out-of-Order Execution A simple superscalar CPU • Duplicates the pipeline to accommodate ILP (IPC > 1) ILP=instruction-level parallelism • Note that duplicating HW in just one pipe stage doesn’t help e.g., when having 2 ALUs, the bottleneck moves to other stages IF ID EXE MEM WB • Conclusion: Getting IPC > 1 requires to fetch/decode/exe/retire >1 instruction per clock: IF ID EXE MEM WB 3 Computer Architecture 2013– Out-of-Order Execution Example: Pentium Processor • Pentium fetches & decodes 2 instructions per cycle • Before register file read, decide on pairing Can the two instructions be executed in parallel? (yes/no) u-pipe IF ID v-pipe • Pairing decision is based… On data
    [Show full text]
  • CSC.T433 Advanced Computer Architecture, Department of Computer Science, TOKYO TECH 1 Datapath of Ooo Execution Processor
    Fiscal Year 2020 Ver. 2021-01-25a Course number: CSC.T433 School of Computing, Graduate major in Computer Science Advanced Computer Architecture 10. Multi-Processor: Distributed Memory and Shared Memory Architecture www.arch.cs.titech.ac.jp/lecture/ACA/ Room No.W936 Kenji Kise, Department of Computer Science Mon 14:20-16:00, Thr 14:20-16:00 kise _at_ c.titech.ac.jp CSC.T433 Advanced Computer Architecture, Department of Computer Science, TOKYO TECH 1 Datapath of OoO execution processor Instruction flow Instruction cache Branch handler Instruction fetch Instruction decode Renaming Register file Dispatch Integer Floating-point Memory Memory dataflow RS Instruction window ALU ALU Branch FP ALU Adr gen. Adr gen. Store Reorder buffer (ROB) queue Data cache Register dataflow CSC.T433 Advanced Computer Architecture, Department of Computer Science, TOKYO TECH Reservation station (RS) 2 Growth in clock rate of microprocessors From CAQA 5th edition CSC.T433 Advanced Computer Architecture, Department of Computer Science, TOKYO TECH 3 From multi-core era to many-core era Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction, MICRO-36 CSC.T433 Advanced Computer Architecture, Department of Computer Science, TOKYO TECH 4 Aside: What is a window? • A window is a space in the wall of a building or in the side of a vehicle, which has glass in it so that light can come in and you can see out. (Collins) Instruction window 8 6 5 4 7 (a) Instruction window Instructions to be executed for an application Large instruction
    [Show full text]
  • STRAIGHT: Realizing a Lightweight Large Instruction Window by Using Eventually Consistent Distributed Registers
    2012 Third International Conference on Networking and Computing STRAIGHT: Realizing a Lightweight Large Instruction Window by using Eventually Consistent Distributed Registers Hidetsugu IRIE∗, Daisuke FUJIWARA∗, Kazuki MAJIMA∗, Tsutomu YOSHINAGA∗ ∗The University of Electro-Communications 1-5-1 Chofugaoka, Chofu-shi, Tokyo 182-8585, Japan E-mail: [email protected], [email protected], [email protected], [email protected] Abstract—As the number of cores as well as the network size programs. For scale-out applications, we assume the manycore in a processor chip increases, the performance of each core is processor structure, which consists of a number of STRAIGHT more critical for the improvement of the total chip performance. architecture cores (SAC) that are loosely connected each other. However, to improve the total chip performance, the performance per power or per unit area must be improved, making it difficult Being the first report on this novel processor architecture, in to adopt a conventional approach of superscalar extension. In this paper, we discuss the concept behind STRAIGHT, propose this paper, we explore a new core structure that is suitable for basic principles, and estimate the performance and budget manycore processors. We revisit prior studies of new instruction- expectation. The rest of the paper consists of following sec- level (ILP) and thread-level parallelism (TLP) architectures tions. Section II revisits studies of new architectures that were and propose our novel STRAIGHT processor architecture. By introducing the scheme of distributed key-value-store to the designed to improve the ILP/TLP performance of superscalar register file of clustered microarchitectures, STRAIGHT directly processors, and discusses the dilemma of both scalability executes the operation with large logical registers, which are approach and quick worker approach.
    [Show full text]
  • Optimizing SIMD Execution in HW/SW Co-Designed Processors
    Optimizing SIMD Execution in HW/SW Co-designed Processors Rakesh Kumar Department of Computer Architecture Universitat Politècnica de Catalunya Advisors: Alejandro Martínez Intel Barcelona Research Center Antonio González Intel Barcelona Research Center Universitat Politècnica de Catalunya A thesis submitted in fulfillment of the requirements for the degree of Doctor of Philosophy / Doctor per la UPC ABSTRACT SIMD accelerators are ubiquitous in microprocessors from different computing domains. Their high compute power and hardware simplicity improve overall performance in an energy efficient manner. Moreover, their replicated functional units and simple control mechanism make them amenable to scaling to higher vector lengths. However, code generation for these accelerators has been a challenge from the days of their inception. Compilers generate vector code conservatively to ensure correctness. As a result they lose significant vectorization opportunities and fail to extract maximum benefits out of SIMD accelerators. This thesis proposes to vectorize the program binary at runtime in a speculative manner, in addition to the compile time static vectorization. There are different environments that support runtime profiling and optimization support required for dynamic vectorization, one of most prominent ones being: 1) Dynamic Binary Translators and Optimizers (DBTO) and 2) Hardware/Software (HW/SW) Co-designed Processors. HW/SW co-designed environment provides several advantages over DBTOs like transparent incorporations of new hardware features, binary compatibility, etc. Therefore, we use HW/SW co-designed environment to assess the potential of speculative dynamic vectorization. Furthermore, we analyze vector code generation for wider vector units and find out that even though SIMD accelerators are amenable to scaling from hardware point of view, vector code generation at higher vector length is even more challenging.
    [Show full text]
  • Multithreading
    CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 14 – Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste http://inst.eecs.berkeley.edu/~cs152 Last Time Lecture 13: VLIW § In a classic VLIW, compiler is responsible for avoiding all hazards -> simple hardware, complex compiler. § Later VLIWs added more dynamic hardware interlocks, which reduce relative hardware benefits § Use loop unrolling and software pipelining for loops, trace scheduling for more irregular code § Static scheduling difficult in presence of unpredictable branches and variable latency memory § VLIW has failed in general-purpose computing, but still used in deeply embedded processors and DSPs 2 Thread-Level Parallelism (TLP) § Difficult to continue to extract instruction-level parallelism (ILP) from a single sequential thread of control § Many workloads can make use of thread-level parallelism: – TLP from multiprogramming (run independent sequential jobs) – TLP from multithreaded applications (run one job faster using parallel threads) § Multithreading uses TLP to improve utilization of a single processor 3 Multithreading How can we guarantee no dependencies between instructions in a pipeline? One way is to interleave execution of instructions from different program threads on same pipeline Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 T1:LD x1,0(x2) F D X M W Prior instruction in a T2:ADD x7,x1,x4
    [Show full text]
  • Transforming TLP Into DLP with the Dynamic Inter-Thread Vectorization Architecture Sajith Kalathingal
    Transforming TLP into DLP with the dynamic inter-thread vectorization architecture Sajith Kalathingal To cite this version: Sajith Kalathingal. Transforming TLP into DLP with the dynamic inter-thread vectorization archi- tecture. Hardware Architecture [cs.AR]. Université Rennes 1, 2016. English. NNT : 2016REN1S133. tel-01426915v3 HAL Id: tel-01426915 https://tel.archives-ouvertes.fr/tel-01426915v3 Submitted on 28 Aug 2017 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. ANNEE´ 2016 THESE` / UNIVERSITE´ DE RENNES 1 sous le sceau de l’Universit´eBretagne Loire pour le grade de DOCTEUR DE L’UNIVERSITE´ DE RENNES 1 Mention : Informatique Ecole´ doctorale Matisse pr´esent´eepar Sajith Kalathingal pr´epar´ee`al’unit´ede recherche INRIA Institut National de Recherche en Informatique et Automatique Universit´ede Rennes 1 Th`esesoutenue `aRennes Transforming TLP into le 13 D´ecembre 2016 DLP with the Dynamic devant le jury compos´ede : Bernard GOOSSENS Inter-Thread Vector- Professeur `al’Universit´ede Perpignan Via Domitia / Rapporteur Smail NIAR ization Architecture Professeur `al’Universit´ede Valenciennes / Rapporteur Laure GONNORD Maˆitre de conf´erences `a l’Universit´e Lyon 1 / Examinatrice C´edricTEDESCHI Maˆitre de conf´erences `a l’Universit´e Rennes 1 / Examinateur Andr´eSEZNEC Directeur de recherches Inria / Directeur de th´ese Sylvain COLLANGE Charg´ede recherche INRIA / Co-directeur de th´ese Acknowledgement I would like to express my sincere gratitude to my thesis advisors, Andr´eSEZNEC and Sylvain COLLANGE.
    [Show full text]
  • Advanced Computer Architecture
    ADAPTIVE & SECURE ASCS COMPUTING SYSTEMS LABORATORY ECEN 676 Advanced Computer Architecture Complex Pipelining: VLIW Prof. Michel A. Kinsy ADAPTIVE & SECURE ASCS COMPUTING SYSTEMS LABORATORY Execution Concurrency Limits § Which features of an ISA limit the number of instructions in the pipeline? § Number of Registers § Which features of a program limit the number of instructions in the pipeline? § Control transfers ADAPTIVE & SECURE ASCS COMPUTING SYSTEMS LABORATORY Little’s Law § Throughput (T) = Number in Flight (N) / Latency (L) Issue Execution WB § Illustrative Example § 4 floating point units § 8 cycles per floating point operation § 1/2 issues per cycle! ADAPTIVE & SECURE ASCS COMPUTING SYSTEMS LABORATORY Little’s Law Parallelism = Throughput * Latency or N = T ´ L Throughput per Cycle One Operation Latency in Cycles ADAPTIVE & SECURE ASCS COMPUTING SYSTEMS LABORATORY Pipelined ILP Machine Max Throughput, Six Instructions per Cycle One Pipeline Stage Two Integer Units, Latency Single Cycle Latency in Cycles Two Load/Store Units, Three Cycle Latency Two Floating-Point Units, Four Cycle Latency § How much instruction-level parallelism (ILP) required to keep machine pipelines busy? ADAPTIVE & SECURE ASCS COMPUTING SYSTEMS LABORATORY Superscalar Control Logic Scaling § Each issued instructions must make interlock checks against W*L instructions, i.e., growth in interlocks µ W*(W*L) § For in-order machines, L is related to pipeline latencies § For out-of-order machines, L also includes time spent in instruction buffers (instruction window
    [Show full text]
  • Instruction Fetch and Issue on an Implementable Simultaneous
    Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor ¡ Dean M. Tullsen , Susan J. Eggers , Joel S. Emer , Henry M. Levy , ¡ Jack L. Lo , and Rebecca L. Stamm ¡ Dept of Computer Science and Engineering Digital Equipment Corporation University of Washington HLO2-3/J3 Box 352350 77 Reed Road Seattle, WA 98195-2350 Hudson, MA 01749 Abstract an SMT processor to achieve signi®cantly higher throughput than either a wide superscalar or a multithreaded processor. That paper Simultaneous multithreading is a technique that permits multiple also demonstrated the advantages of simultaneous multithreading independent threads to issue multiple instructions each cycle. In over multiple processors on a single chip, due to SMT's ability to previous work we demonstrated the performance potential of si- dynamically assign execution resources where needed each cycle. multaneous multithreading, based on a somewhat idealized model. Those results showed SMT's potential based on a somewhat ide- In this paper we show that the throughput gains from simultaneous alized model. This paper extends that work in four signi®cant ways. multithreading can be achieved without extensive changes to a con- First, we demonstrate that the throughput gains of simultaneous mul- ventional wide-issue superscalar, either in hardware structures or tithreading are possible without extensive changesto a conventional, sizes. We present an architecture for simultaneous multithreading wide-issue superscalar processor. We propose an architecture that that achieves three goals: (1) it minimizes the architectural impact is more comprehensive, realistic, and heavily leveraged off existing on the conventional superscalar design, (2) it has minimal perfor- superscalar technology.
    [Show full text]
  • Chapter 16 - Instruction-Level Parallelism and Superscalar Processors
    Chapter 16 - Instruction-Level Parallelism and Superscalar Processors Luis Tarrataca [email protected] CEFET-RJ Luis Tarrataca Chapter 16 - Superscalar Processors 1 / 90 Table of Contents 1 Overview Scalar Processor Superscalar Processor Superscalar vs. Superpipelined Constraints Luis Tarrataca Chapter 16 - Superscalar Processors 2 / 90 Table of Contents 2 Design Issues Machine Parallelism Instruction Issue Policy In-order issue with in-order completion In-order issue with out-of-order completion Out-of-Order issue with Out-Of-Order Completion Register Renaming 3 Superscalar Execution Overview 4 References Luis Tarrataca Chapter 16 - Superscalar Processors 3 / 90 Overview Scalar Processor The first processors were known as scalar: What is a scalar processor? Any ideas? Luis Tarrataca Chapter 16 - Superscalar Processors 4 / 90 Overview Scalar Processor Scalar Processor The first processors were known as scalar: What is a scalar processor? Any ideas? In a scalar organization, a single pipelined functional unit exists for: • Integer operations; • And one for floating-point operations; Functional unit: • Part of the CPU responsible for calculations; Luis Tarrataca Chapter 16 - Superscalar Processors 5 / 90 Overview Scalar Processor Scalar Processor In a scalar organization, a single pipelined functional unit exists for: • Integer operations; • And one for floating-point operations; Figure: Scalar Organization (Source: [Stallings, 2015]) Luis Tarrataca Chapter 16 - Superscalar Processors 6 / 90 Overview Scalar Processor But why do we
    [Show full text]
  • Hazardless Processor Architecture Without Register Renaming
    STRAIGHT: Hazardless Processor Architecture Without Register Renaming Hidetsugu Irie, Toru Koizumi, Akifumi Fukuda, Seiya Akaki, Satoshi Nakae, Yutaro Bessho, Ryota Shioya The University of Tokyo Tokyo, Japan irie, koizumi, a.fukuda, akaki, nakae, [email protected], [email protected] Takahiro Notsu, Katsuhiro Yoda, Teruo Ishihara Shuichi Sakai FUJITSU LABORATORIES LTD. The University of Tokyo Kawasaki, Japan Tokyo, Japan notsu.takahiro, yoda.katsuhiro, [email protected] [email protected] Abstract—The single-thread performance of a processor im- higher performance, more functions, and higher power effi- proves the capability of the entire system by reducing the critical ciency. Currently, for the wide range of purpose from embed- path latency of programs. Typically, conventional superscalar ded processors to server processors, the heterogeneous multi- processors improve this performance by introducing out-of- order (OoO) execution with register renaming. However, it is core architecture [1] is adopted, which typically implements also known to increase the complexity and affect the power various types and various scales of cores into a chip; the most efficiency. This paper realizes a novel computer architecture efficient cores among them are in charge for each application called “STRAIGHT” to resolve this dilemma. The key feature depending on its characteristics. This strategy trades off energy is a unique instruction format in which the source operand efficiency against lower-utilized cores, reflecting the recent is given based on the distance from the producer instruction. By leveraging this format, register renaming is completely dilemma that even though the number of transistors can be removed from the pipeline.
    [Show full text]