On the Exploration of the DRISC Architecture
Total Page:16
File Type:pdf, Size:1020Kb
UvA-DARE (Digital Academic Repository) On the exploration of the DRISC architecture Yang, Q. Publication date 2014 Link to publication Citation for published version (APA): Yang, Q. (2014). On the exploration of the DRISC architecture. General rights It is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an open content license (like Creative Commons). Disclaimer/Complaints regulations If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask the Library: https://uba.uva.nl/en/contact, or a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You will be contacted as soon as possible. UvA-DARE is a service provided by the library of the University of Amsterdam (https://dare.uva.nl) Download date:30 Sep 2021 Chapter 1 Introduction Contents 1.1 Background . 8 1.2 Multithreading . 8 1.3 Multiple cores . 10 1.4 What’s next . 11 1.5 Case study . 12 1.6 What have we done . 14 1.7 Overview and organization . 16 7 8 CHAPTER 1. INTRODUCTION 1.1 Background Moore’s law, first raised in 1965 [M 65] but later widely accepted as being “the number of transistors on integrated circuits+ doubles approximately every two year” (fig. 1.1), is steering the development of the semiconductor industry covering the capabilities of electronic devices like storage capacity, processing speed, etc. In the field of microprocessors, this is highly exemplified by the “Tick-Tock” paradigm of Intel (fig. 1.2). Before the dawn of this millennium, the increased density of transistors was chiefly devoted to more powerful uniprocessors. “Powerful” on one hand refers to techniques such as out-of-order (OoO) execution, branch prediction, specula- tion and multiple issue, accelerating the execution of sequential codes; on the other hand it refers to increasing the main frequency, e.g. extending the number of pipeline stages. The former suffers hardware complexity and insufficient ILP while both incur serious power dissipation and thermal problems. The resultant memory wall [WM95] (increasing memory access latency compared with decreas- ing execution time of a single instruction) and performance wall [AHKB00] (de- creasing performance benefits compared with increasing hardware costs and power consumption) mandate rethinking architecture design for sustainable, efficient per- formance improvement. One answer proves to be multithreading and later (with) multiple cores, although binary compatibility is not so easily attainable as it used to be from the perspective of performance scalability. 1.2 Multithreading As said in [BH95], independent streams of instructions, interwoven on a singe pro- cessor, fill its otherwise idle cycles and so boosts its performance; multithreading which exploits the higher Thread-Level Parallelism (TLP) starts a new way to achieve overall performance by interleaving threads to hide latency. The study of TLP dates back to late 1950s like [Bem57] and the first commer- cial multi-threaded system is the heterogeneous element processor (HEP) [Smi82] released in 1978. However such architectures began to thrive only since the late 1990s because of stalling ILP in OoO processors. Despite diverse accomplishments, two important keys to multithreading are how and when to perform thread inter- leaving, which also categorize multi-threaded systems into different classes. As the execution of a thread requires its own context consisting of a Program Counter (PC), state registers and stack frames, interleaving threads implies a con- text switch, namely context preservation and recovery. If such a switch is entirely conducted by software, this is known as software multithreading. Software mul- tithreading usually bears a higher switch overhead because it relies on accessing memory for context values, thus it is only suitable to infrequent switches; con- trarily if each thread has its unique inside-processor storages for context, this is known as Hardware Multithreading (HMT), which usually facilitates more fre- quent switching achieved at the cost of additional hardware investment (hence the design trade-off). For instance, HEP supports 50 hardware threads equipped with 2048 registers; while its descendent Tera MTA [ACC 90, AKK 95] and the latest Cray Thunderstorm processor [KV11] maintains 128+ hardware+ contexts backed with 128 register sets. 1.2. MULTITHREADING 9 Figure 1.1: The number of on-chip transistors scaling from 1971 to mid 20131. Decisions on when to start thread switches generally fall into three types: 1. fine-grained or interleaved multithreading. That is fetching one instruction from a thread in one cycle and switching to another thread in the next cycle to ensure fairness among threads. The exemplary architecture of this type is the often called barrel processor, where each thread is guaranteed to run one instruction every N cycles given N hardware contexts; meantime the execution speed of a single thread is always roughly 1 N of its original. This is improved later in [AKK 95] by scheduling only ready threads for better flexibility. + 2. coarse-grained or blocked multithreading. A thread emits instructions until a point is reached to trigger a switch. Such points can be either static, e.g. tags or switch instructions inserted by compilers, every branch or memory load instruction, or dynamic such as cache misses, traps or interruptions. Unlike interleaved multithreading, different threads often have different time slices for execution. 1This is a refined figure based on http://en.wikipedia.org/wiki/File:Transistor_Count_ and_Moore’s_Law_-_2011.svg. 2This is a refined figure based on http://commons.wikimedia.org/wiki/File: IntelProcessorRoadmap.svg. 10 CHAPTER 1. INTRODUCTION Figure 1.2: The evolution of Intel’s processors: shrunk process technology in the first “tick” year followed by the “tock” year with updated micro-architecture2. 3. Simultaneous Multithreading (SMT). It resembles the cycle-by-cycle switch of fine-grained multithreading but features multiple issue, i.e. several in- structions from different threads are issued to the pipeline in one cycle. This is usually built on top of superscalar processors to fill empty pipeline slots due to dependencies, and thus captures both inter-thread TLP and intra- thread ILP. The issue width of SMT processors is always moderate, i.e. 2 or 4 ways, and cannot be very large because of area costs e.g. the issue logic area grows as the square of the width [AHKB00, PBB 02] and that of regis- ter file has a cubic scaling [BDA01], and the subsequent+ power consumption moves the processor towards even lower performance efficiency. 1.3 Multiple cores In 2000, IBM announced the Power4, the first commercial multi-core microproces- sor that integrated two Power3 cores onto a single die. It was the kick-off of a new era that is investing transistors on additional CPUs (cores) inside the same chip. This was soon followed by other venders moving to multi-core instead of launch- ing more aggressive uni-core processors. However what a “core” is like manifests disparate design principles from different venders. One extreme can be found in the developing Oracle T-Series and the IBM Power-Series Chip Multiprocessors (CMP) that integrate 8 16 “fat” cores. Each core is enhanced with SMT as well as some other special function units (i.e. SPU in Oracle cores and Vector unit in Power cores) emphasizing∼ chip-level parallelism but retaining complex hardware circuits for ILP and assuring better performance of a single instruction stream. For example, the Oracle SPARC T5 [Ora13] claims 30% higher performance of a single thread compared with the previous generation. This is by and large regarded as latency-oriented. The other extreme is utilizing only simpler or “thin” cores. Each core supports massive hardware threads with trivial scheduling overheads in order to hide latency 1.4. WHAT’S NEXT 11 simply by thread interleaving, even to the point that hierarchical caches are not necessary. A trial of this is the early Niagara T1 [LLS06] that packs 8 cores of 32 threads. It featured fine-grained multi-threading not radical ILP techniques or a sizable amount of cache. The development of the Graphics Processing Unit (GPU) is much more aggressive. Taking NVidia’s Kepler GK110 [Nvi12] for example, it includes up to 15 streaming multiprocessors (SMX) primarily composed of integer and floating-point arithmetic logic; each SMX allows the concurrent issue and execution of 128 parallel threads and supports up to a maximum of 2048 threads. Consequently the GPU possesses superior advantages over its CMP rivals in terms of system throughout when solving problems exposing redundant parallelism, and such an achievement is devoid of a single thread’s performance [GK10]. Also it is less general-purpose in contrast with CMP. The deficiencies of the above two extremes motivate an alternative design— the “fused” microprocessor packaging “fat” CPU cores and GPU together. This is currently prompted by AMD’s APU and Intel’s Haswell for the purpose of smooth synergy between several x86 cores and a GPU on chip, yet it needs further im- provements to perfect the coordination and get rid of interconnection limitation [DAF11]. In fact,