Hybrid Threading Processor (HTP)

Total Page:16

File Type:pdf, Size:1020Kb

Hybrid Threading Processor (HTP) Hybrid Threaded Processing for Sparse Data Kernels Tony Brewer Chief Architect, Advanced Computing Solutions May 9, 2018 Distribution Statement “A”: Approved for Public Release, Distribution Unlimited “This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. ©2016©2015 Micron Technology, Inc. All rights reserved. Information, products, and/or specifications are subject to change without notice. All information is provided on an “AS IS” basis without warranties of any kind. Statements regarding products, including regarding their features, availability, functionality, or compatibility, Title Slide are provided for informational purposes only and do not modify the warranty, if any, applicable to any product. Drawings may not be to scale. Micron, the Micron logo, and all other Micron trademarks are the Primary design property of Micron Technology, Inc. All other trademarks are the property of their respective owners. for the first slide in the deck. The Challenge . Sparse data sets that greatly exceed a processor’s cache size are a challenge for most systems – Processor’s are typically optimized for high cache hit rate (>90%) . Low cache hit rate results in idle cores – Memory accesses are cache line size (64B) . Sparse data sets result in memory accesses where the majority of accessed data is not used Title and Content The primary layout used for standard slides. The placeholder can be 2 © 2016 Micron Technology, Inc. | May 9, 2018 used to create text, tables, or charts. Hybrid Threading Processor (HTP) . RISC-V ISA (RV64G) . Efficient memory usage – Extensions for thread and message – Memory access size 8, 16, 32 or 64B management . Software managed coherency . High thread count barrel processor – Small cache per thread – Similar to Cray’s MTA architecture – Atomics performed at memory – One instruction per thread per scheduling interval (avoids register . User space only hazard checking) – Host processor provides system support . Event driven processor – Pause for memory response . Standard GCC compiler – Pause for thread join – Runtime provides access to new – Pause for message reception instructions Agenda Two-column layout, 3 © 2016 Micron Technology, Inc. | May 9, 2018 to be used with any number of items. New RISC-V Instructions . Thread Management – Thread Create, Return, Join . Message Management Instructions – Message Send, Broadcast, Receive, Listen . Non-Cached Loads and Stores – Integer and Float Title and Content The primary layout used for standard slides. The placeholder can be 4 © 2016 Micron Technology, Inc. | May 9, 2018 used to create text, tables, or charts. System Architecture Storage Class DRAM SCM DRAM SCM Memory DRAM SCM DRAM SCM Ctrl Ctrl Ctrl Ctrl SRAM Cache SRAM Cache Atomic Atomic Operations Operations Network on Chip (NOC) Hybrid Hybrid Threading Threading Processor (HTP) Processor (HTP) Title and Content The primary layout used for standard slides. The placeholder can be 5 © 2016 Micron Technology, Inc. | May 9, 2018 used to create text, tables, or charts. System Architecture DRAM SCM DRAM SCM Memory Controller DRAM SCM DRAM SCM (MC) Chiplet Ctrl Ctrl Ctrl Ctrl SRAM Cache SRAM Cache Atomic Atomic Operations Operations Compute Network on Chip (NOC) Near Memory (CNM) Chiplet Hybrid Hybrid Threading Threading Processor (HTP) Processor (HTP) Title and Content The primary layout used for standard slides. The placeholder can be 6 © 2016 Micron Technology, Inc. | May 9, 2018 used to create text, tables, or charts. Modeled Configurations 1x1 CNM Configuration GDDR6 GDDR6 Memory Interposer Controller Chiplet C C C C M M M . Focused on two primary configurations: M Compute CPI CPI Near Memory – 1x1 CNM Config. CPI CPI CPI 2x2 CNM Configuration Chiplet C N I P O P C I . 1 CNM Chiplet C HTP HTP GDDR6 GDDR6 GDDR6 GDDR6 . 2 MC Chiplets DRAM DRAM DRAM DRAM HTP HTP C I G D P P 6 D NOC NOC NOC R I M R C D A A D R M 4 GDDR6 Memories R . D 6 D G HTP HTP C C C C C C C C G M M M M M M M M D 6 M C D R N N O O MC M C R HTP C HTP HTP C HTP I D C A M A C D P HTP HTP HTP HTP MC O P R M R NOC NOC NOC NOC NOC NOC HTP HTP D I N M C 6 HTP HTP HTP HTP MC C D G C C M C O HTP O HTP HTP HTP N N MC M – 2x2 CNM Config. C N N O O MC G C C HTP HTP HTP HTP D 6 M C D R CPI CPI CPI HTP HTP HTP HTP MC M R NOC NOC NOC NOC NOC NOC D A M C A D HTP HTP HTP HTP MC R M R C C M C O O D HTP HTP HTP HTP N N 6 CPI CPI MC D G M M M M . 4 CNM Chiplet M M M M C C C C C C C C M M M M G D 6 D R M R C C C C D A A D R M . 8 MC Chiplets R D 6 D G M A R D M A R D M A R M A R D . 16 GDDR6 Memories D 6 R D D G 6 R D D G 6 R D D 6 R D D G Title and Content G The primary layout GDDR6 GDDR6 used for standard slides. The placeholder can be 7 © 2016 Micron Technology, Inc. | May 9, 2018 used to create text, tables, or charts. Performance and Power . Simulator models functionality and performance – Clocked simulation model – Models functionality – Models data paths, arbitration and queueing . Power Estimation Methodology – Break design into major components: NOC, HTP, MC, GDDR6 – Identify all ram structures, ALUs, I/O, long signal runs (NOC), etc. – Determine power through foundry power estimation tools – Determine application activity factors through simulation Title and Content The primary layout used for standard – Determine power for each chiplet and total solution power slides. The placeholder can be 8 © 2016 Micron Technology, Inc. | May 9, 2018 used to create text, tables, or charts. Graph Spectral Clustering GRAPH COMMUNITY DETECTION USING SPECTRAL METHODS . Community detection using spectral . Profile on an X86 system methods uses linear algebraic to Overhead Symbol compute eigenvalues for the adjacency 13.82% [.] svd_ATxb matrix associated with a graph. The 13.36% [.] svd_ATxb2 lowest eigenvalues can be used to 10.70% [.] svd_Axb partition the graph. 10.01% [.] substruct 9.93% [.] svd_Axb2 . Sparse data structures store the graph 6.59% [.] _IO_vfscanf vertices, edges and properties. Title/Subtitle and Content Identical to main layout but includes the addition of a 9 © 2016 Micron Technology, Inc. | May 9, 2018 subtitle directly below the title. Memory Access Size Graph Spectral Clustering 25 SENSITIVITY ANALYSIS 8B 20 16B . Sensitivity analysis to determine 15 32B optimal configuration 10 64B 5 Total Thread Count 0 25 per Second EdgesMillion 0 50 100 150 200 20 Edges per Vertex 15 1024 10 512 5 256 . Other parameters 128 0 – Clock rate Title/Subtitle per Second EdgesMillion 0 50 100 150 200 – Cores per HTP processor and Content Identical to main Edges per Vertex layout but includes the addition of a 10 © 2016 Micron Technology, Inc. | May 9, 2018 subtitle directly below the title. Graph Spectral Clustering THREAD STATE MONITORING 1x1 1.2Ghz 70ms 900 . Provides insights into 800 700 run time dynamics of 600 500 Idle application RdyToRun 400 PausedMem 300 PausedEvent Thread Count Thread 200 100 0 Title/Subtitle 0.00E+00 2.00E+07 4.00E+07 6.00E+07 and Content Identical to main Simulation Time(ns) layout but includes the addition of a 11 © 2016 Micron Technology, Inc. | May 9, 2018 subtitle directly below the title. Graph Spectral Clustering COMPARISON TO REFERENCE PLATFORMS Haswell Nvidia K80 Nvidia DGX-1 NOC 1x1 Config NOC 2x2 Config 8-threads (Host + 1 GPU) (Host + P100) 1.2Ghz 1.2Ghz 1-socket 1 P100 time Simulated Simulated 13.5 Sec 5.0 Sec1 3.95 Sec1 2.90 Sec 0.814 Sec 140 Watts 340 Watts (Note – System 23.8 Watt 90.9 Watt 1890 1703 Joules was at a cloud 69 Joules 74 Joules Joules 24.7x provider, no 1.0x 1.07x 27.4x power info) GDDR6 GDDR6 GDDR6 GDDR6 DRAM DRAM DRAM DRAM P P P P P P P P C C C C C C C C T T T T T T T T M M M M M M M M H H H H H H H H CIPI CIPI CIPI CIPI GDDR6 GDDR6 CIPI CIPI CIPI CIPI CIPI CIPI DRAM DRAM M C G P T H NoC NoC MC D 6 C C C D R I I I M R I Edge Edge I I D A P T H P P P P P P HTP A D I I I R M I I I P P P P R C C C C D C C HTF HTF C HTF HTF 6 M D C T T T T HTP G M M M M H H H H Cluster Cluster Cluster Cluster MC CIPI CIPI C C I I I NoC NoC NoC NoC NoC NoC I P P P P I I I Edge Hub Edge Edge Hub Edge I C CIPI CIPI CIPI C NoC M C C I Edge I P P G I P T H MC D I HTF HTF HTF HTF 6 C C C D C HTF HTF R I I I M R I I I D A P T H P P P P P P HTP A D I I I R Cluster Cluster Cluster Cluster M I I I R D C C C Cluster Cluster NoC NoC 6 M D C HTP G Edge Edge MC C CIPI CIPI CIPI CIPI CIPI CIPI I Noc Noc NoC I P P I Edge Hub Edge I C CIPI CIPI CIPI CIPI CIPI CIPI M C G P T H NoC NoC MC D 6 C C C D R I I I M R I Edge Edge I I D A P T H P P P P P P HTP A D I I HTF HTF I R M C I I I R D C C C I HTF HTF HTF HTF 6 M I C D HTP G P P I Cluster Cluster I C NoC Cluster Cluster Cluster Cluster MC Edge C C I I I NoC NoC NoC NoC NoC NoC I P P P CIPI CIPI CIPI P I I I Edge Hub Edge Edge Hub Edge I C C CIPI CIPI CIPI M H C P M M M M H C o P P P P G P T H D I HTF HTF HTF HTF MC I C C C C 6 T T T T F s C C C D R e I I I M R t I I I D A P T H H H H H P P P P P P HTP A D I I I R Cluster Cluster Cluster Cluster M I I I R D C C C NoC NoC 6 M D C HTP G Edge Edge MC GDDR6 GDDR6 DRAM DRAM CIPI CIPI CIPI CIPI CIPI CIPI CIPI CIPI CIPI CIPI CIPI H P H H H H H H H H M M M M M M M M H C o T T T T T T T T I I C C C C C C C C F s P P P P P P P P e t GDDR6 GDDR6 GDDR6 GDDR6 Notes: DRAM DRAM DRAM DRAM Title/Subtitle and Content Identical to main 1.
Recommended publications
  • Antikernel: a Decentralized Secure Hardware-Software Operating
    Antikernel A Decentralized Secure Hardware-Software Operating System Andrew Zonenberg (@azonenberg) Senior Security Consultant, IOActive Bülent Yener Professor, Rensselaer Polytechnic Institute This work is based on Zonenberg’s 2015 doctoral dissertation, advised by Yener. IOActive, Inc. Copyright ©2016. All Rights Reserved. Kernel mode = full access to all state • What OS code needs this level of access? – Memory manager only needs heap metadata – Scheduler only needs run queue – Drivers only need their peripheral – Nothing needs access to state of user-mode apps • No single subsystem that needs access to all state • Any code with ring 0 privs is incompatible with LRP! IOActive, Inc. Copyright ©2016. All Rights Reserved. Monolithic kernel, microkernel, … Huge Huge attack surface Small Better 0 code size - Ring Can we get here? None Monolithic Micro ??? IOActive, Inc. Copyright ©2016. All Rights Reserved. Exokernel (MIT, 1995) • OS abstractions can often hurt performance – You don’t need a full FS to store temporary data on disk • Split protection / segmentation from abstraction Word proc FS Cache Disk driver DHCP server IOActive, Inc. Copyright ©2016. All Rights Reserved. Exokernel (MIT, 1995) • OS does very little: – Divide resources into blocks (CPU time quanta, RAM pages…) – Provide controlled access to them IOActive, Inc. Copyright ©2016. All Rights Reserved. But wait, there’s more… • By removing non-security abstractions from the kernel, we shrink the TCB and thus the attack surface! IOActive, Inc. Copyright ©2016. All Rights Reserved. So what does the kernel have to do? • Well, obviously a few things… – Share CPU time between multiple processes – Allow processes to talk to hardware/drivers – Allow processes to talk to each other – Page-level RAM allocation/access control IOActive, Inc.
    [Show full text]
  • Mali-400 MP: a Scalable GPU for Mobile Devices
    Mali-400 MP: A Scalable GPU for Mobile Devices Tom Olson Director, Graphics Research, ARM Outline . ARM and Mobile Graphics . Design Constraints for Mobile GPUs . Mali Architecture Overview . Multicore Scaling in Mali-400 MP . Results 2 About ARM . World’s leading supplier of semiconductor IP . Processor Architectures and Implementations . Related IP: buses, caches, debug & trace, physical IP . Software tools and infrastructure . Business Model . License fees . Per-chip royalties . Graphics at ARM . Acquired Falanx in 2006 . ARM Mali is now the world’s most widely licensed GPU family 3 Challenges for Mobile GPUs . Size . Power . Memory Bandwidth 4 More Challenges . Graphics is going into “anything that has a screen” . Mobile . Navigation . Set Top Box/DTV . Automotive . Video telephony . Cameras . Printers . Huge range of form factors, screen sizes, power budgets, and performance requirements . In some applications, a huge difference between peak and average performance requirements 5 Solution: Scalability . Address a wide variety of performance points and applications with a single IP and a single software stack. Need static scalability to adapt to different peak requirements in different platforms / markets . Need dynamic scalability to reduce power when peak performance isn’t needed 6 Options for Scalability . Fine-grained: Multiple pipes, wide SIMD, etc . Proven approach, efficient and effective . But, adding pipes / lanes is invasive . Hard for IP licensees to do on their own . And, hard to partition to provide dynamic scalability . Coarse-grained: Multicore . Easy for licensees to select desired performance . Putting cores on separate power islands allows dynamic scaling 7 Mali 400-MP Top Level Architecture Asynch Mali-400 MP Top-Level APB Geometry Pixel Processor Pixel Processor Pixel Processor Pixel Processor Processor #1 #2 #3 #4 CLKs MaliMMUs RESETs IRQs IDLEs MaliL2 AXI .
    [Show full text]
  • Nyami: a Synthesizable GPU Architectural Model for General-Purpose and Graphics-Specific Workloads
    Nyami: A Synthesizable GPU Architectural Model for General-Purpose and Graphics-Specific Workloads Jeff Bush Philip Dexter†, Timothy N. Miller†, and Aaron Carpenter⇤ San Jose, California †Dept. of Computer Science [email protected] ⇤ Dept. of Electrical & Computer Engineering Binghamton University {pdexter1, millerti, carpente}@binghamton.edu Abstract tempt to bridge this gap, Intel developed Larrabee (now called Graphics processing units (GPUs) continue to grow in pop- Xeon Phi) [25]. Larrabee is architected around small in-order ularity for general-purpose, highly parallel, high-throughput cores with wide vector ALUs to facilitate graphics rendering and systems. This has forced GPU vendors to increase their fo- multi-threading to hide instruction latencies. The use of small, cus on general purpose workloads, sometimes at the expense simple processor cores allows many cores to be packed onto a of the graphics-specific workloads. Using GPUs for general- single die and into a limited power envelope. purpose computation is a departure from the driving forces be- Although GPUs were originally designed to render images for hind programmable GPUs that were focused on a narrow subset visual displays, today they are used frequently for more general- of graphics rendering operations. Rather than focus on purely purpose applications. However, they must still efficiently per- graphics-related or general-purpose use, we have designed and form what would be considered traditional graphics tasks (i.e. modeled an architecture that optimizes for both simultaneously rendering images onto a screen). GPUs optimized for general- to efficiently handle all GPU workloads. purpose computing may downplay graphics-specific optimiza- In this paper, we present Nyami, a co-optimized GPU archi- tions, even going so far as to offload them to software.
    [Show full text]
  • Dynamic Task Scheduling and Binding for Many-Core Systems Through Stream Rewriting
    Dynamic Task Scheduling and Binding for Many-Core Systems through Stream Rewriting Dissertation zur Erlangung des akademischen Grades Doktor-Ingenieur (Dr.-Ing.) der Fakultät für Informatik und Elektrotechnik der Universität Rostock vorgelegt von Lars Middendorf, geb. am 21.09.1982 in Iserlohn aus Rostock Rostock, 03.12.2014 Gutachter Prof. Dr.-Ing. habil. Christian Haubelt Lehrstuhl "Eingebettete Systeme" Institut für Angewandte Mikroelektronik und Datentechnik Universität Rostock Prof. Dr.-Ing. habil. Heidrun Schumann Lehrstuhl Computergraphik Institut für Informatik Universität Rostock Prof. Dr.-Ing. Michael Hübner Lehrstuhl für Eingebettete Systeme der Informationstechnik Fakultät für Elektrotechnik und Informationstechnik Ruhr-Universität Bochum Datum der Abgabe: 03.12.2014 Datum der Verteidigung: 05.03.2015 Acknowledgements First of all, I would like to thank my supervisor Prof. Dr. Christian Haubelt for his guidance during the years, the scientific assistance to write this thesis, and the chance to research on a very individual topic. In addition, I thank my colleagues for interesting discussions and a pleasant working environment. Finally, I would like to thank my family for supporting and understanding me. Contents 1 INTRODUCTION........................................................................................................................1 1.1 STREAM REWRITING ...................................................................................................................5 1.2 RELATED WORK .........................................................................................................................7
    [Show full text]
  • An Overview of MIPS Multi-Threading White Paper
    Public Imagination Technologies An Overview of MIPS Multi-Threading White Paper Copyright © Imagination Technologies Limited. All Rights Reserved. This document is Public. This publication contains proprietary information which is subject to change without notice and is supplied ‘as is’, without any warranty of any kind. Filename : Overview_of_MIPS_Multi_Threading.docx Version : 1.0.3 Issue Date : 19 Dec 2016 Author : Imagination Technologies 1 Revision 1.0.3 Imagination Technologies Public Contents 1. Motivations for Multi-threading ................................................................................................. 3 2. Performance Gains from Multi-threading ................................................................................. 4 3. Types of Multi-threading ............................................................................................................ 4 3.1. Coarse-Grained MT ............................................................................................................ 4 3.2. Fine-Grained MT ................................................................................................................ 5 3.3. Simultaneous MT ................................................................................................................ 6 4. MIPS Multi-threading .................................................................................................................. 6 5. R6 Definition of MT: Virtual Processors ..................................................................................
    [Show full text]
  • C for a Tiny System Implementing C for a Tiny System and Making the Architecture More Suitable for C
    Abstract We have implemented support for Padauk microcontrollers, tiny 8-Bit devices with 60 B to 256 B of RAM, in the Small Device C Compiler (SDCC), showing that the use of (mostly) standard C to program such minimal devices is feasible. We report on our experience and on the difficulties in supporting the hardware multithreading present on some of these devices. To make the devices a better target for C, we propose various enhancements of the architecture, and empirically evaluated their impact on code size. arXiv:2010.04633v1 [cs.PL] 9 Oct 2020 1 C for a tiny system Implementing C for a tiny system and making the architecture more suitable for C Philipp Klaus Krause, Nicolas Lesser October 12, 2020 1 The architecture Padauk microcontrollers use a Harvard architecture with an OTP or Flash pro- gram memory and an 8 bit wide RAM data memory. There also is a third address space for input / output registers. These three memories are accessed using separate instructions. Figure 1 shows the 4 architecture variants, which are commonly called pdk13, pdk14, pdk15 and pdk16 (these names are different from the internal names found in files from the manufacturer-provided IDE) by the width of their program memory. Each instruction is exactly one word in program memory. Most instructions execute in a single cycle, the few excep- tions take 2 cycles. Most instructions use either implicit addressing or direct addressing; the latter usually use the accumulator and one memory operand and write their result into the accumulator or memory. On the pdk13, pdk14 and pdk15, the bit set, reset and test instructions, which use direct addressing, can only access the lower half of the data address space.
    [Show full text]
  • Antikernel: a Decentralized Secure Hardware-Software Operating System Architecture
    Antikernel: A Decentralized Secure Hardware-Software Operating System Architecture Andrew Zonenberg1 and B¨ulent Yener2 1 IOActive Inc., Seattle WA 98105, USA, [email protected] 2 Rensselaer Polytechnic Institute, Troy NY 12180, USA, [email protected] Abstract. The \kernel" model has been part of operating system ar- chitecture for decades, but upon closer inspection it clearly violates the principle of least required privilege. The kernel is a single entity which provides many services (memory management, interfacing to drivers, context switching, IPC) having no real relation to each other, and has the ability to observe or tamper with all state of the system. This work presents Antikernel, a novel operating system architecture consisting of both hardware and software components and designed to be fundamen- tally more secure than the state of the art. To make formal verification easier, and improve parallelism, the Antikernel system is highly modular and consists of many independent hardware state machines (one or more of which may be a general-purpose CPU running application or systems software) connected by a packet-switched network-on-chip (NoC). We create and verify an FPGA-based prototype of the system. Keywords: network on chip · system on chip · security · operating sys- tems · hardware accelerators 1 Introduction The Antikernel architecture is intended to be more, yet less, than simply a \ker- nel in hardware". By breaking up functionality and decentralizing as much as possible we aim to create a platform that allows applications to pick and choose the OS features they wish to use, thus reducing their attack surface dramati- cally compared to a conventional OS (and potentially experiencing significant performance gains, as in an exokernel).3 Antikernel is a decentralized architecture with no system calls; all OS func- tionality is accessed through message passing directly to the relevant service.
    [Show full text]
  • On the Exploration of the DRISC Architecture
    UvA-DARE (Digital Academic Repository) On the exploration of the DRISC architecture Yang, Q. Publication date 2014 Link to publication Citation for published version (APA): Yang, Q. (2014). On the exploration of the DRISC architecture. General rights It is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an open content license (like Creative Commons). Disclaimer/Complaints regulations If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask the Library: https://uba.uva.nl/en/contact, or a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You will be contacted as soon as possible. UvA-DARE is a service provided by the library of the University of Amsterdam (https://dare.uva.nl) Download date:30 Sep 2021 Chapter 1 Introduction Contents 1.1 Background . 8 1.2 Multithreading . 8 1.3 Multiple cores . 10 1.4 What’s next . 11 1.5 Case study . 12 1.6 What have we done . 14 1.7 Overview and organization . 16 7 8 CHAPTER 1. INTRODUCTION 1.1 Background Moore’s law, first raised in 1965 [M 65] but later widely accepted as being “the number of transistors on integrated circuits+ doubles approximately every two year” (fig.
    [Show full text]
  • Programming the Cray XMT at PNNL
    21‐May‐12 What is parallel computing? Using multiple computing elements to solve a problem faster Multiple systems Multiple processors Multiple nodes Multiple cores Parallel computing is becoming ubiquitous due to power constraints Multiple cores rather than faster clock speeds Programming the Cray XMT Since cores share memory and programming cores as separate computing elements is too heavy weight, shared memory programming John Feo will become ubiquitous Director Center for Adaptive Supercomputing Software Intel 48‐core x86 processor AMD Athlon X2 6400+ dual‐core en.wikipedia.org/wiki/Multi‐core_processor www.pcper.com/reviews/Processors/Intel‐Shows‐48‐core‐x86‐Processor‐Single‐chip‐Cloud‐Computer May 21, 2012 1 Shared memory Hiding memory latencies The Good Memory hierarchy Read and write any data item C1 . Cn Reduce latency by storing some data nearby No partitioning No message passing Vectors L1 L1 Reads and write performed by Amortize latency by fetching N words at a time L2 L2 hardware Little overhead lower latency, Parallelism higher bandwidth Hide latency by switching tasks The Bad L3 Can also hide other forms of latencies Read and write any data item Race conditions Memory 4 1 21‐May‐12 Barrel processor Multithreading Many threads per processor core Hide latencies via parallelism Thread-level context switch at every instruction cycle Maintain multiple active threads per processor, so that gaps introduced by long latency operations in one thread are filled by instructions in other threads registers ALU “stream” program counter
    [Show full text]
  • Abstract PRET Machines
    Abstract PRET Machines Invited TCRTS award paper Edward A. Lee (award recipient) Jan Reineke Michael Zimmer UC Berkeley Saarland University Swarm64 AS Berkeley, CA 94720 Saarland Informatics Campus Berlin, Germany Email: [email protected] Saarbrucken,¨ Germany Email: [email protected] Email: [email protected] Abstract—Prior work has shown that it is possible to design stability is maintained if events occur within some specified microarchitectures called PRET machines that deliver precise latency after some stimulus. and repeatable timing of software execution without sacrificing No engineered system is perfect. No matter what specifica- performance. That prior work provides specific designs for PRET microarchitectures and compares them against conventional de- tions we use for what a “correct behavior” of the system is, signs. This paper defines a class of microarchitectures called there will always be the possibility that the realized system will abstract PRET machines (APMs) that capture the essential deviate from that behavior in the field. The goal of engineering, temporal properties of PRET machines. We show that APMs therefore, needs to be to clearly define what is a correct deliver deterministic timing with no loss of performance for a behavior, to design a system that realizes that behavior with family of real-time problems consisting of sporadic event streams with deadlines equal to periods. On the other hand, we observe high probability, to provide detectors for violations, and to a tradeoff between deterministic timing and the ability to meet provide safe fallback behaviors when violations occur. deadlines for sporadic event streams with constrained deadlines. A straightforward way to define correct behavior is to specify what properties the output of a system must exhibit for each possible input.
    [Show full text]
  • Concurrent Programming CLASS NOTES
    COMP 409 Concurrent Programming CLASS NOTES Based on professor Clark Verbrugge's notes Format and figures by Gabriel Lemonde-Labrecque Contents 1 Lecture: January 4th, 2008 7 1.1 Final Exam . .7 1.2 Syllabus . .7 2 Lecture: January 7th, 2008 8 2.1 What is a thread (vs a process)? . .8 2.1.1 Properties of a process . .8 2.1.2 Properties of a thread . .8 2.2 Lifecycle of a process . .9 2.3 Achieving good performances . .9 2.3.1 What is speedup? . .9 2.3.2 What are threads good for then? . 10 2.4 Concurrent Hardware . 10 2.4.1 Basic Uniprocessor . 10 2.4.2 Multiprocessors . 10 3 Lecture: January 9th, 2008 11 3.1 Last Time . 11 3.2 Basic Hardware (continued) . 11 3.2.1 Cache Coherence Issue . 11 3.2.2 On-Chip Multiprocessing (multiprocessors) . 11 3.3 Granularity . 12 3.3.1 Coarse-grained multi-threading (CMT) . 12 3.3.2 Fine-grained multithreading (FMT) . 12 3.4 Simultaneous Multithreading (SMT) . 12 4 Lecture: January 11th, 2008 14 4.1 Last Time . 14 4.2 \replay architecture" . 14 4.3 Atomicity . 15 5 Lecture: January 14th, 2008 16 6 Lecture: January 16th, 2008 16 6.1 Last Time . 16 6.2 At-Most-Once (AMO) . 16 6.3 Race Conditions . 18 7 Lecture: January 18th, 2008 19 7.1 Last Time . 19 7.2 Mutual Exclusion . 19 8 Lecture: January 21st, 2008 21 8.1 Last Time . 21 8.2 Kessel's Algorithm . 22 8.3 Brief Interruption to Introduce Java and PThreads .
    [Show full text]
  • Understanding SPARC Processor Performance
    Understanding SPARC Processor Performance MAY 15 & 16, 2019 CLEVELAND PUBLIC AUDITORIUM, CLEVELAND, OHIO WWW.NEOOUG.ORG/GLOC About the Speaker • Akiva Lichtner • Physics background • Twenty years experience in IT • Enterprise production support analyst • Java developer • Oracle query plan manager … • Spoke here at G.L.O.C. about TDD and Java dynamic tracing Audience • Developers • System administrators • Tech support analysts • IT managers Motivation • I have been working in tech support for a large application • We have run SPARC T4 servers and now we run T7 servers • Application servers, database servers • Environments are all different • Users complained for years about “environment X” being slow, finally figured out why • What I learned can be very useful for users of SPARC servers What is SPARC? • First released in 1987, created by Sun Microsystems to replace the Motorola 68000 in its workstation products • During the .com boom Solaris/SPARC and Windows/Intel were the only supported platforms for the JVM • In 2000 the bubble burst and Sun server sales plunged • Sun acquired Afara Websystems, which had built an interesting new processor, and renamed it the UltraSPARC T1 • Was followed by T2 through M8, evolutions of the same design • More recently Oracle has added significant new functionality Processor Design • High core count (even in the early days) • Many threads per core • “Barrel” processor • Designed to switch efficiently • Non-uniform memory access • Per-processor shared cache • Core-level shared cache A picture speaks a thousand
    [Show full text]