Mali-400 MP: a Scalable GPU for Mobile Devices
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
GPU Developments 2018
GPU Developments 2018 2018 GPU Developments 2018 © Copyright Jon Peddie Research 2019. All rights reserved. Reproduction in whole or in part is prohibited without written permission from Jon Peddie Research. This report is the property of Jon Peddie Research (JPR) and made available to a restricted number of clients only upon these terms and conditions. Agreement not to copy or disclose. This report and all future reports or other materials provided by JPR pursuant to this subscription (collectively, “Reports”) are protected by: (i) federal copyright, pursuant to the Copyright Act of 1976; and (ii) the nondisclosure provisions set forth immediately following. License, exclusive use, and agreement not to disclose. Reports are the trade secret property exclusively of JPR and are made available to a restricted number of clients, for their exclusive use and only upon the following terms and conditions. JPR grants site-wide license to read and utilize the information in the Reports, exclusively to the initial subscriber to the Reports, its subsidiaries, divisions, and employees (collectively, “Subscriber”). The Reports shall, at all times, be treated by Subscriber as proprietary and confidential documents, for internal use only. Subscriber agrees that it will not reproduce for or share any of the material in the Reports (“Material”) with any entity or individual other than Subscriber (“Shared Third Party”) (collectively, “Share” or “Sharing”), without the advance written permission of JPR. Subscriber shall be liable for any breach of this agreement and shall be subject to cancellation of its subscription to Reports. Without limiting this liability, Subscriber shall be liable for any damages suffered by JPR as a result of any Sharing of any Material, without advance written permission of JPR. -
Antikernel: a Decentralized Secure Hardware-Software Operating
Antikernel A Decentralized Secure Hardware-Software Operating System Andrew Zonenberg (@azonenberg) Senior Security Consultant, IOActive Bülent Yener Professor, Rensselaer Polytechnic Institute This work is based on Zonenberg’s 2015 doctoral dissertation, advised by Yener. IOActive, Inc. Copyright ©2016. All Rights Reserved. Kernel mode = full access to all state • What OS code needs this level of access? – Memory manager only needs heap metadata – Scheduler only needs run queue – Drivers only need their peripheral – Nothing needs access to state of user-mode apps • No single subsystem that needs access to all state • Any code with ring 0 privs is incompatible with LRP! IOActive, Inc. Copyright ©2016. All Rights Reserved. Monolithic kernel, microkernel, … Huge Huge attack surface Small Better 0 code size - Ring Can we get here? None Monolithic Micro ??? IOActive, Inc. Copyright ©2016. All Rights Reserved. Exokernel (MIT, 1995) • OS abstractions can often hurt performance – You don’t need a full FS to store temporary data on disk • Split protection / segmentation from abstraction Word proc FS Cache Disk driver DHCP server IOActive, Inc. Copyright ©2016. All Rights Reserved. Exokernel (MIT, 1995) • OS does very little: – Divide resources into blocks (CPU time quanta, RAM pages…) – Provide controlled access to them IOActive, Inc. Copyright ©2016. All Rights Reserved. But wait, there’s more… • By removing non-security abstractions from the kernel, we shrink the TCB and thus the attack surface! IOActive, Inc. Copyright ©2016. All Rights Reserved. So what does the kernel have to do? • Well, obviously a few things… – Share CPU time between multiple processes – Allow processes to talk to hardware/drivers – Allow processes to talk to each other – Page-level RAM allocation/access control IOActive, Inc. -
Intel Desktop Board DG41MJ Product Guide
Intel® Desktop Board DG41MJ Product Guide Order Number: E59138-001 Revision History Revision Revision History Date -001 First release of the Intel® Desktop Board DG41MJ Product Guide February 2009 If an FCC declaration of conformity marking is present on the board, the following statement applies: FCC Declaration of Conformity This device complies with Part 15 of the FCC Rules. Operation is subject to the following two conditions: (1) this device may not cause harmful interference, and (2) this device must accept any interference received, including interference that may cause undesired operation. For questions related to the EMC performance of this product, contact: Intel Corporation, 5200 N.E. Elam Young Parkway, Hillsboro, OR 97124 1-800-628-8686 This equipment has been tested and found to comply with the limits for a Class B digital device, pursuant to Part 15 of the FCC Rules. These limits are designed to provide reasonable protection against harmful interference in a residential installation. This equipment generates, uses, and can radiate radio frequency energy and, if not installed and used in accordance with the instructions, may cause harmful interference to radio communications. However, there is no guarantee that interference will not occur in a particular installation. If this equipment does cause harmful interference to radio or television reception, which can be determined by turning the equipment off and on, the user is encouraged to try to correct the interference by one or more of the following measures: • Reorient or relocate the receiving antenna. • Increase the separation between the equipment and the receiver. • Connect the equipment to an outlet on a circuit other than the one to which the receiver is connected. -
GMAI: a GPU Memory Allocation Inspection Tool for Understanding and Exploiting the Internals of GPU Resource Allocation in Critical Systems
The final publication is available at ACM via http://dx.doi.org/ 10.1145/3391896 GMAI: A GPU Memory Allocation Inspection Tool for Understanding and Exploiting the Internals of GPU Resource Allocation in Critical Systems ALEJANDRO J. CALDERÓN, Universitat Politècnica de Catalunya & Ikerlan Technology Research Centre LEONIDAS KOSMIDIS, Barcelona Supercomputing Center (BSC) CARLOS F. NICOLÁS, Ikerlan Technology Research Centre FRANCISCO J. CAZORLA, Barcelona Supercomputing Center (BSC) PEIO ONAINDIA, Ikerlan Technology Research Centre Critical real-time systems require strict resource provisioning in terms of memory and timing. The constant need for higher performance in these systems has led industry to recently include GPUs. However, GPU software ecosystems are by their nature closed source, forcing system engineers to consider them as black boxes, complicating resource provisioning. In this work we reverse engineer the internal operations of the GPU system software to increase the understanding of their observed behaviour and how resources are internally managed. We present our methodology which is incorporated in GMAI (GPU Memory Allocation Inspector), a tool which allows system engineers to accurately determine the exact amount of resources required by their critical systems, avoiding underprovisioning. We first apply our methodology on a wide range of GPU hardware from different vendors showing itsgeneralityin obtaining the properties of the GPU memory allocators. Next, we demonstrate the benefits of such knowledge in resource provisioning of two case studies from the automotive domain, where the actual memory consumption is up to 5.6× more than the memory requested by the application. ACM Reference Format: Alejandro J. Calderón, Leonidas Kosmidis, Carlos F. Nicolás, Francisco J. -
Nyami: a Synthesizable GPU Architectural Model for General-Purpose and Graphics-Specific Workloads
Nyami: A Synthesizable GPU Architectural Model for General-Purpose and Graphics-Specific Workloads Jeff Bush Philip Dexter†, Timothy N. Miller†, and Aaron Carpenter⇤ San Jose, California †Dept. of Computer Science [email protected] ⇤ Dept. of Electrical & Computer Engineering Binghamton University {pdexter1, millerti, carpente}@binghamton.edu Abstract tempt to bridge this gap, Intel developed Larrabee (now called Graphics processing units (GPUs) continue to grow in pop- Xeon Phi) [25]. Larrabee is architected around small in-order ularity for general-purpose, highly parallel, high-throughput cores with wide vector ALUs to facilitate graphics rendering and systems. This has forced GPU vendors to increase their fo- multi-threading to hide instruction latencies. The use of small, cus on general purpose workloads, sometimes at the expense simple processor cores allows many cores to be packed onto a of the graphics-specific workloads. Using GPUs for general- single die and into a limited power envelope. purpose computation is a departure from the driving forces be- Although GPUs were originally designed to render images for hind programmable GPUs that were focused on a narrow subset visual displays, today they are used frequently for more general- of graphics rendering operations. Rather than focus on purely purpose applications. However, they must still efficiently per- graphics-related or general-purpose use, we have designed and form what would be considered traditional graphics tasks (i.e. modeled an architecture that optimizes for both simultaneously rendering images onto a screen). GPUs optimized for general- to efficiently handle all GPU workloads. purpose computing may downplay graphics-specific optimiza- In this paper, we present Nyami, a co-optimized GPU archi- tions, even going so far as to offload them to software. -
Intel® Desktop Board DG965RY Product Guide
Intel® Desktop Board DG965RY Product Guide Order Number: D46818-003 Revision History Revision Revision History Date -001 First release of the Intel® Desktop Board DG965RY Product Guide June 2006 -002 Second release of the Intel® Desktop Board DG965RY Product Guide September 2006 -003 Added operating system support December 2006 If an FCC declaration of conformity marking is present on the board, the following statement applies: FCC Declaration of Conformity This device complies with Part 15 of the FCC Rules. Operation is subject to the following two conditions: (1) this device may not cause harmful interference, and (2) this device must accept any interference received, including interference that may cause undesired operation. For questions related to the EMC performance of this product, contact: Intel Corporation, 5200 N.E. Elam Young Parkway, Hillsboro, OR 97124 1-800-628-8686 This equipment has been tested and found to comply with the limits for a Class B digital device, pursuant to Part 15 of the FCC Rules. These limits are designed to provide reasonable protection against harmful interference in a residential installation. This equipment generates, uses, and can radiate radio frequency energy and, if not installed and used in accordance with the instructions, may cause harmful interference to radio communications. However, there is no guarantee that interference will not occur in a particular installation. If this equipment does cause harmful interference to radio or television reception, which can be determined by turning the equipment off and on, the user is encouraged to try to correct the interference by one or more of the following measures: • Reorient or relocate the receiving antenna. -
Intel® Desktop Board DG41TX Product Guide
Intel® Desktop Board DG41TX Product Guide Order Number: E88309-001 Revision History Revision Revision History Date ® -001 First release of the Intel Desktop Board DG41TX Product Guide February 2010 Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life saving, or life sustaining applications. Intel may make changes to specifications and product descriptions at any time, without notice. Intel Desktop Board DG41TX may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an ordering number and are referenced in this document, or other Intel literature, may be obtained from Intel Corporation by going to the World Wide Web site at: http://intel.com/ or by calling 1-800-548-4725. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. * Other names and brands may be claimed as the property of others. -
The Bifrost GPU Architecture and the ARM Mali-G71 GPU
The Bifrost GPU architecture and the ARM Mali-G71 GPU Jem Davies ARM Fellow and VP of Technology Hot Chips 28 Aug 2016 Introduction to ARM Soft IP . ARM licenses Soft IP cores (amongst other things) to our Silicon Partners . They then make chips and sell to OEMs, who sell consumer devices . “ARM doesn’t make chips”… . We provide all the RTL, integration testbenches, memories lists, reference floorplans, example synthesis scripts, sometimes models, sometimes FPGA images, sometimes with implementation advice, always with memory system requirements/recommendations . Consequently silicon area, power, frequencies, performance, benchmark scores can therefore vary quite a bit in real silicon… 2 © ARM2016 ARM Mali: The world’s #1 shipping GPU 750M 65 27 Mali-based new Mali GPUs shipped 140 Total licenses in in 2015 Total licenses licensees FY15 Mali is in: Mali graphics based IC shipments (units) 750m ~75% ~50% ~40% 550m of of of 400m DTVs… tablets… smartphones 150m <50m 2011 2012 2013 2014 2015 3 © ARM2016 ARM Mali graphics processor generations Mali-G71 BIFROST … GPU Unified shader cores, scalar ISA, clause execution, full coherency, Vulkan, OpenCL MIDGARD Mali-T600 GPU series Mali-T700 GPU series Mali-T800 GPU series Unified shader cores, SIMD ISA, OpenGL ES 3.x, OpenCL, Vulkan Presented at HotChips 2015 Mali-200 Mali-300 Mali-400 Mali-450 Mali-470 UTGARD GPU GPU GPU GPU GPU Separate shader cores, SIMD ISA, OpenGL ES 2.x 4 © ARM2016 Bifrost features . Leverages Mali’s scalable architecture . Scalable to 32 shader cores 20% Scalable to 32 Higher energy Shader cores . Major shader core redesign efficiency* . -
AI Acceleration – Image Processing As a Use Case
AI acceleration – image processing as a use case Masoud Daneshtalab Associate Professor at MDH www.idt.mdh.se/~md/ Outline • Dark Silicon • Heterogeneouse Computing • Approximation – Deep Neural Networks 2 The glory of Moore’s law Intel 4004 Intel Core i7 980X 2300 transistors 1.17B transistors 740 kHz clock 3.33 GHz clock 10um process 32nm process 10.8 usec/inst 73.4 psec/inst Every 2 Years • Double the number of transistors • Build higher performance general-purpose processors ⁻ Make the transistors available to masses ⁻ Increase performance (1.8×↑) ⁻ Lower the cost of computing (1.8×↓) Semiconductor trends ITRS roadmap for SoC Design Complexity Trens ITRS: International Technology Roadmap for Semiconductors Expected number of processing elements into a System-on-Chip (SoC). 4 Semiconductor trends ITRS roadmap for SoC Design Complexity Trens 5 Semiconductor trends Network-on-Chip (NoC) –based Multi/Many-core Systems Founder: Andreas Olofsson Sponsored by Ericsson AB 6 Dark Silicon Era The catch is powering exponentially increasing number of transistors without melting the chip down. 10 10 Chip Transistor Count 2,200,000,000 109 Chip Power 108 107 106 105 2300 104 103 130 W 102 101 0.5 W 100 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 If you cannot power them, why bother making them? 7 Dark Silicon Era The catch is powering exponentially increasing number of transistors without melting the chip down. 10 10 Chip Transistor Count 2,200,000,000 109 Chip Power 108 107 Dark Silicon 106 105 (Utilization Wall) 2300 104 Fraction of transistors that need to be 103 powered off at all times due to power 130 W 2 constraints. -
Dynamic Task Scheduling and Binding for Many-Core Systems Through Stream Rewriting
Dynamic Task Scheduling and Binding for Many-Core Systems through Stream Rewriting Dissertation zur Erlangung des akademischen Grades Doktor-Ingenieur (Dr.-Ing.) der Fakultät für Informatik und Elektrotechnik der Universität Rostock vorgelegt von Lars Middendorf, geb. am 21.09.1982 in Iserlohn aus Rostock Rostock, 03.12.2014 Gutachter Prof. Dr.-Ing. habil. Christian Haubelt Lehrstuhl "Eingebettete Systeme" Institut für Angewandte Mikroelektronik und Datentechnik Universität Rostock Prof. Dr.-Ing. habil. Heidrun Schumann Lehrstuhl Computergraphik Institut für Informatik Universität Rostock Prof. Dr.-Ing. Michael Hübner Lehrstuhl für Eingebettete Systeme der Informationstechnik Fakultät für Elektrotechnik und Informationstechnik Ruhr-Universität Bochum Datum der Abgabe: 03.12.2014 Datum der Verteidigung: 05.03.2015 Acknowledgements First of all, I would like to thank my supervisor Prof. Dr. Christian Haubelt for his guidance during the years, the scientific assistance to write this thesis, and the chance to research on a very individual topic. In addition, I thank my colleagues for interesting discussions and a pleasant working environment. Finally, I would like to thank my family for supporting and understanding me. Contents 1 INTRODUCTION........................................................................................................................1 1.1 STREAM REWRITING ...................................................................................................................5 1.2 RELATED WORK .........................................................................................................................7 -
An Overview of MIPS Multi-Threading White Paper
Public Imagination Technologies An Overview of MIPS Multi-Threading White Paper Copyright © Imagination Technologies Limited. All Rights Reserved. This document is Public. This publication contains proprietary information which is subject to change without notice and is supplied ‘as is’, without any warranty of any kind. Filename : Overview_of_MIPS_Multi_Threading.docx Version : 1.0.3 Issue Date : 19 Dec 2016 Author : Imagination Technologies 1 Revision 1.0.3 Imagination Technologies Public Contents 1. Motivations for Multi-threading ................................................................................................. 3 2. Performance Gains from Multi-threading ................................................................................. 4 3. Types of Multi-threading ............................................................................................................ 4 3.1. Coarse-Grained MT ............................................................................................................ 4 3.2. Fine-Grained MT ................................................................................................................ 5 3.3. Simultaneous MT ................................................................................................................ 6 4. MIPS Multi-threading .................................................................................................................. 6 5. R6 Definition of MT: Virtual Processors .................................................................................. -
C for a Tiny System Implementing C for a Tiny System and Making the Architecture More Suitable for C
Abstract We have implemented support for Padauk microcontrollers, tiny 8-Bit devices with 60 B to 256 B of RAM, in the Small Device C Compiler (SDCC), showing that the use of (mostly) standard C to program such minimal devices is feasible. We report on our experience and on the difficulties in supporting the hardware multithreading present on some of these devices. To make the devices a better target for C, we propose various enhancements of the architecture, and empirically evaluated their impact on code size. arXiv:2010.04633v1 [cs.PL] 9 Oct 2020 1 C for a tiny system Implementing C for a tiny system and making the architecture more suitable for C Philipp Klaus Krause, Nicolas Lesser October 12, 2020 1 The architecture Padauk microcontrollers use a Harvard architecture with an OTP or Flash pro- gram memory and an 8 bit wide RAM data memory. There also is a third address space for input / output registers. These three memories are accessed using separate instructions. Figure 1 shows the 4 architecture variants, which are commonly called pdk13, pdk14, pdk15 and pdk16 (these names are different from the internal names found in files from the manufacturer-provided IDE) by the width of their program memory. Each instruction is exactly one word in program memory. Most instructions execute in a single cycle, the few excep- tions take 2 cycles. Most instructions use either implicit addressing or direct addressing; the latter usually use the accumulator and one memory operand and write their result into the accumulator or memory. On the pdk13, pdk14 and pdk15, the bit set, reset and test instructions, which use direct addressing, can only access the lower half of the data address space.