General Purpose Computing on Low-Power Embedded Gpus: Has It Come of Age?

Total Page:16

File Type:pdf, Size:1020Kb

General Purpose Computing on Low-Power Embedded Gpus: Has It Come of Age? General Purpose Computing on Low-Power Embedded GPUs: Has It Come of Age? Arian Maghazeh Unmesh D. Bordoloi Petru Eles Zebo Peng Department of Computer and Information Science, Linköpings Universitet, Sweden E-mail: {arian.maghazeh, unmesh.bordoloi, petru.eles, zebo.peng}@liu.se Abstract—In this paper we evaluate the promise held by low- Our contributions: We believe that the arrival of OpenCL- power GPUs for non-graphic workloads that arise in embedded enabled GPUs gives us an opportunity to usher in the era systems. Towards this, we map and implement 5 benchmarks, of heterogeneous computation for embedded devices as well. that find utility in very different application domains, to an However, unless GPGPU on low-power embedded platforms embedded GPU. Our results show that apart from accelerated can be unleashed, heterogeneous computation will remain performance, embedded GPUs are promising also because of their energy efficiency which is an important design goal for debilitated. The question has remained open whether (and what battery-driven mobile devices. We show that adopting the same kind of) non-graphics workloads may benefit from embedded optimization strategies as those used for programming high-end GPUs given that powerful multi-core CPUs on the same chip GPUs might lead to worse performance on embedded GPUs. This are already a promising choice. Towards this, we mapped and is due to restricted features of embedded GPUs, such as, limited implemented 5 benchmarks — Rijndael algorithm, bitcount, or no user-defined memory, small instruction-set, limited number of registers, among others. We propose techniques to overcome genetic programming, convolution and pattern matching — such challenges, e.g., by distributing the workload between GPUs to an embedded GPU from Vivante. These algorithms are and multi-core CPUs, similar to the spirit of heterogeneous deployed in numerous potential applications including automo- computation. tive (bitcount, convolution, security), radar or flight tracking I. INTRODUCTION Over the span of the last decade, researchers have re- (pattern matching) and image processing/augmented reality engineered sequential algorithms from a wide spectrum of ap- (convolution). Our choice was also driven by a desire to inves- plications to harness the parallelism offered by GPUs (Graphics tigate whether computationally heavy optimization algorithms Processing Units) and demonstrated tremendous performance (genetic programming), typically not suitable on embedded benefits [14]. In fact, brisk successes from early endeavors platforms, may become feasible with the advent of low power meant that GPGPU (General Purpose computing on Graphics GPUs. Our results show that embedded GPUs are indeed, Processing Units) was established as a specialization on its sometimes, attractive alternatives for non-graphics computation own. It is not uncommon, nowadays, to find commercial but, at the same time, intelligent tradeoffs/choices must be GPGPU applications in electronic design, scientific computing considered for each application between its GPU, multi-core and defense, among others. As GPGPU research has matured, and hybrid (heterogeneous) implementations. To the best of the emerging trend for future is “heterogeneous computation”, our knowledge, ours is the first paper to implement such non- where multi-cores, GPUs and other units are utilized syner- graphic workloads on embedded GPUs and compare against gistically to accelerate computationally expensive problems. sequential and multi-core implementations. Heterogeneous computation is well-poised to become the de- In the last 10 years, several hundred papers have been facto computational paradigm in the realm of servers and published on GPGPU but they were almost exclusively con- desktops [3]. cerned about high-end GPUs where maximizing the speedup Yet, in embedded devices, the role of GPUs has been has been the sole overarching aim. Our experiments reveal limited, until recently. Over the last 18 months, programmable that adopting the same optimization strategies as those used GPUs have penetrated embedded platforms providing graphics for high-performance GPU computing might lead to worse software designers with a powerful programmable engine. performance on embedded GPUs. This is due to restricted Typically, such embedded GPUs are programmed with features of embedded GPUs like limited or no user-defined OpenGL, Renderscript and other similar tools that require memory, limited size of cache, among several others. Moreover, prior experience in graphics programming and, in some cases, embedded GPUs share the same physical memory with the even expertise on the underlying graphics pipeline hardware CPUs, which implies higher contention for memory, imposing [14]. However, today, the stage seems to be set for a change new demands on programmers. We discuss techniques, such as several industrial players, either, already have, or are on as distribution of the workload between GPUs and multi-core the verge of releasing embedded platforms with low-power CPUs, similar to the spirit of heterogeneous computation to GPUs that are programmable by OpenCL. OpenCL is a overcome some of the challenges. framework for coding that enables seamless programming II. RELATED WORK across heterogeneous units including multi-cores, GPUs and other processors. Vivante’s embedded graphics cores [17], Major strides have been made in GPGPU [14] programming ARM’s Mali [12] graphics processors, the StemCell Processor over the years. Almost all threads of work, however, singularly [18] from ZiiLabs (Creative) are some examples of low-power focused on improving performance without discussing other embedded GPUs targeted for mobile devices. concerns that arise specifically in embedded GPUs. Hence, the 978-1-4799-0103-6/13/$31.00 ©2013 IEEE 1 existing body of work does not provide any insight into oppor- some benchmarks have a high ratio of compute-to-memory tunities and challenges of embedded GPGPU programming. Of access while others have a low ratio of compute-to-memory late, few attempts have been made to bridge this gap, however, access. Second, we believe that the benchmarks should cover almost all of them evaluated their work on high-end GPUs. a wide application range. As mentioned in Section I, the 5 In a recent paper, Gupta and Owens [4], discussed strategies benchmarks that we implemented may be utilized in automotive for memory optimizations for a speech recognition application. software, radar/flight tracking, image processing, augmented However, they did not discuss the impact of their algorithm reality and optimization tools. We hope that our work will spur on power or energy consumption. In fact, they evaluated their the embedded systems community to explore other potential performance results on a 9400M Nvidia GPU which is not applications on embedded GPUs. With these methodological targeted towards low-power devices such as hand-held smart choices in mind, we chose 5 benchmarks. The characteristics phones. Yet, we refer the interested reader to their paper for for each benchmark is individually discussed below. an excellent discussion on limitations that arise out of (i) on- We chose the Rijndael algorithm [2] that is specified in the chip memory sharing between GPU and CPU and (ii) limited Advanced Encryption Standard. Data security is increasingly or no L2 cache. These two characteristics are among several becoming more important as mobile and hand-held devices restrictions in embedded GPUs that we target in this paper. are beginning to host e-commerce activities. In fact, it is also We also note that Mu et al. [13] implemented the bench- gaining significance in vehicular systems as the internet con- marks from High Performance Embedded Computing Chal- tinues to penetrate the automotive applications. Moreover, the lenge Benchmark from MIT [15] on a GPU. However, their characteristics of the version we chose to implement inherently paper suffers from two significant drawbacks. First and fore- makes it an excellent candidate to stress and test the memory most, they did not study power or energy consumption which is bandwidth of the GPU because the algorithm relies heavily crucial for embedded devices. In fact, all the experiments were on look-up tables. It essentially has a low compute-to-memory performed on the power-hungry Fermi machine from Nvidia. access ratio. Second, their reported speedups do not include the overheads Second, we selected the bitcount application from the Au- of data transfer between the CPU and GPU. tomotive and Industrial Control suite of MiBench [5]. The bit- There has been some prior work [1], [6], [11], [16] related count is an important benchmark because low-level embedded to modeling power and energy consumption on GPUs. Again, control applications often require bit manipulation and basic this thread of work has focused on high performance GPU math functionalities. It counts the total number of bits set computing and desktop/server workloads, shedding no light on to 1 in an array of integers. Thus, it needs to sum several their suitability for low-power applications. integers, thereby stressing the integer datapath of the GPU. The Researchers from Nokia published results [10] that they bit count algorithm is also interesting because it involves two obtained from OpenCL programming of an embedded GPU. typical GPGPU algorithmic paradigms (i) “synchronization” Unfortunately, this work was limited to an image processing across the threads in the GPU and (ii) “reduction” in
Recommended publications
  • GPU Developments 2018
    GPU Developments 2018 2018 GPU Developments 2018 © Copyright Jon Peddie Research 2019. All rights reserved. Reproduction in whole or in part is prohibited without written permission from Jon Peddie Research. This report is the property of Jon Peddie Research (JPR) and made available to a restricted number of clients only upon these terms and conditions. Agreement not to copy or disclose. This report and all future reports or other materials provided by JPR pursuant to this subscription (collectively, “Reports”) are protected by: (i) federal copyright, pursuant to the Copyright Act of 1976; and (ii) the nondisclosure provisions set forth immediately following. License, exclusive use, and agreement not to disclose. Reports are the trade secret property exclusively of JPR and are made available to a restricted number of clients, for their exclusive use and only upon the following terms and conditions. JPR grants site-wide license to read and utilize the information in the Reports, exclusively to the initial subscriber to the Reports, its subsidiaries, divisions, and employees (collectively, “Subscriber”). The Reports shall, at all times, be treated by Subscriber as proprietary and confidential documents, for internal use only. Subscriber agrees that it will not reproduce for or share any of the material in the Reports (“Material”) with any entity or individual other than Subscriber (“Shared Third Party”) (collectively, “Share” or “Sharing”), without the advance written permission of JPR. Subscriber shall be liable for any breach of this agreement and shall be subject to cancellation of its subscription to Reports. Without limiting this liability, Subscriber shall be liable for any damages suffered by JPR as a result of any Sharing of any Material, without advance written permission of JPR.
    [Show full text]
  • Mali-400 MP: a Scalable GPU for Mobile Devices
    Mali-400 MP: A Scalable GPU for Mobile Devices Tom Olson Director, Graphics Research, ARM Outline . ARM and Mobile Graphics . Design Constraints for Mobile GPUs . Mali Architecture Overview . Multicore Scaling in Mali-400 MP . Results 2 About ARM . World’s leading supplier of semiconductor IP . Processor Architectures and Implementations . Related IP: buses, caches, debug & trace, physical IP . Software tools and infrastructure . Business Model . License fees . Per-chip royalties . Graphics at ARM . Acquired Falanx in 2006 . ARM Mali is now the world’s most widely licensed GPU family 3 Challenges for Mobile GPUs . Size . Power . Memory Bandwidth 4 More Challenges . Graphics is going into “anything that has a screen” . Mobile . Navigation . Set Top Box/DTV . Automotive . Video telephony . Cameras . Printers . Huge range of form factors, screen sizes, power budgets, and performance requirements . In some applications, a huge difference between peak and average performance requirements 5 Solution: Scalability . Address a wide variety of performance points and applications with a single IP and a single software stack. Need static scalability to adapt to different peak requirements in different platforms / markets . Need dynamic scalability to reduce power when peak performance isn’t needed 6 Options for Scalability . Fine-grained: Multiple pipes, wide SIMD, etc . Proven approach, efficient and effective . But, adding pipes / lanes is invasive . Hard for IP licensees to do on their own . And, hard to partition to provide dynamic scalability . Coarse-grained: Multicore . Easy for licensees to select desired performance . Putting cores on separate power islands allows dynamic scaling 7 Mali 400-MP Top Level Architecture Asynch Mali-400 MP Top-Level APB Geometry Pixel Processor Pixel Processor Pixel Processor Pixel Processor Processor #1 #2 #3 #4 CLKs MaliMMUs RESETs IRQs IDLEs MaliL2 AXI .
    [Show full text]
  • Intel Desktop Board DG41MJ Product Guide
    Intel® Desktop Board DG41MJ Product Guide Order Number: E59138-001 Revision History Revision Revision History Date -001 First release of the Intel® Desktop Board DG41MJ Product Guide February 2009 If an FCC declaration of conformity marking is present on the board, the following statement applies: FCC Declaration of Conformity This device complies with Part 15 of the FCC Rules. Operation is subject to the following two conditions: (1) this device may not cause harmful interference, and (2) this device must accept any interference received, including interference that may cause undesired operation. For questions related to the EMC performance of this product, contact: Intel Corporation, 5200 N.E. Elam Young Parkway, Hillsboro, OR 97124 1-800-628-8686 This equipment has been tested and found to comply with the limits for a Class B digital device, pursuant to Part 15 of the FCC Rules. These limits are designed to provide reasonable protection against harmful interference in a residential installation. This equipment generates, uses, and can radiate radio frequency energy and, if not installed and used in accordance with the instructions, may cause harmful interference to radio communications. However, there is no guarantee that interference will not occur in a particular installation. If this equipment does cause harmful interference to radio or television reception, which can be determined by turning the equipment off and on, the user is encouraged to try to correct the interference by one or more of the following measures: • Reorient or relocate the receiving antenna. • Increase the separation between the equipment and the receiver. • Connect the equipment to an outlet on a circuit other than the one to which the receiver is connected.
    [Show full text]
  • GMAI: a GPU Memory Allocation Inspection Tool for Understanding and Exploiting the Internals of GPU Resource Allocation in Critical Systems
    The final publication is available at ACM via http://dx.doi.org/ 10.1145/3391896 GMAI: A GPU Memory Allocation Inspection Tool for Understanding and Exploiting the Internals of GPU Resource Allocation in Critical Systems ALEJANDRO J. CALDERÓN, Universitat Politècnica de Catalunya & Ikerlan Technology Research Centre LEONIDAS KOSMIDIS, Barcelona Supercomputing Center (BSC) CARLOS F. NICOLÁS, Ikerlan Technology Research Centre FRANCISCO J. CAZORLA, Barcelona Supercomputing Center (BSC) PEIO ONAINDIA, Ikerlan Technology Research Centre Critical real-time systems require strict resource provisioning in terms of memory and timing. The constant need for higher performance in these systems has led industry to recently include GPUs. However, GPU software ecosystems are by their nature closed source, forcing system engineers to consider them as black boxes, complicating resource provisioning. In this work we reverse engineer the internal operations of the GPU system software to increase the understanding of their observed behaviour and how resources are internally managed. We present our methodology which is incorporated in GMAI (GPU Memory Allocation Inspector), a tool which allows system engineers to accurately determine the exact amount of resources required by their critical systems, avoiding underprovisioning. We first apply our methodology on a wide range of GPU hardware from different vendors showing itsgeneralityin obtaining the properties of the GPU memory allocators. Next, we demonstrate the benefits of such knowledge in resource provisioning of two case studies from the automotive domain, where the actual memory consumption is up to 5.6× more than the memory requested by the application. ACM Reference Format: Alejandro J. Calderón, Leonidas Kosmidis, Carlos F. Nicolás, Francisco J.
    [Show full text]
  • CES 2016 Exhibitor Listing As of 1/19/16
    CES 2016 Exhibitor Listing as of 1/19/16 Name Booth * Cosmopolitan Vdara Hospitality Suites 1 Esource Technology Co., Ltd. 26724 10 Vins 80642 12 Labs 73846 1Byone Products Inc. 21953 2 the Max Asia Pacific Ltd. 72163 2017 Exhibit Space Selection 81259 3 Legged Thing Ltd 12045 360fly 10417 360-G GmbH 81250 360Heros Inc 26417 3D Fuel 73113 3D Printlife 72323 3D Sound Labs 80442 3D Systems 72721 3D Vision Technologies Limited 6718 3DiVi Company 81532 3Dprintler.com 80655 3DRudder 81631 3Iware Co.,Ltd. 45005 3M 31411 3rd Dimension Industrial 3D Printing 73108 4DCulture Inc. 58005 4DDynamics 35483 4iiii Innovations, Inc. 73623 5V - All In One HC 81151 6SensorLabs BT31 Page 1 of 135 6sensorlabs / Nima 81339 7 Medical 81040 8 Locations Co., Ltd. 70572 8A Inc. 82831 A&A Merchandising Inc. 70567 A&D Medical 73939 A+E Networks Aria 36, Aria 53 AAC Technologies Holdings Inc. Suite 2910 AAMP Global 2809 Aaron Design 82839 Aaudio Imports Suite 30-116 AAUXX 73757 Abalta Technologies Suite 2460 ABC Trading Solution 74939 Abeeway 80463 Absolare USA LLC Suite 29-131 Absolue Creations Suite 30-312 Acadia Technology Inc. 20365 Acapella Audio Arts Suite 30-215 Accedo Palazzo 50707 Accele Electronics 1110 Accell 20322 Accenture Toscana 3804 Accugraphic Sales 82423 Accuphase Laboratory Suite 29-139 ACE CAD Enterprise Co., Ltd 55023 Ace Computers/Ace Digital Home 20318 ACE Marketing Inc. 59025 ACE Marketing Inc. 31622 ACECAD Digital Corp./Hongteli, DBA Solidtek 31814 USA Acelink Technology Co., Ltd. Suite 2660 Acen Co.,Ltd. 44015 Page 2 of 135 Acesonic USA 22039 A-Champs 74967 ACIGI, Fujiiryoki USA/Dr.
    [Show full text]
  • Vivante Corporation Introduction
    World’s Smallest OpenGL ES 2.0 Silicon Introduction to Vivante Graphics Processor IP Khronos DevU - December 2009 2009 Silicon Proven in the Marketplace Vivante GPUs to ship in tier one consumer products four key market segments Smart Phone Digital Picture & Frame Mobile MID Gaming Netbook Home Embedded Entertainment 2009 World’s Smallest OpenGL ES 2.0 GPU Technology Leader in GPU IP Delivers 2x Advantages • Most efficient solution • 2x Performance per mm2 • Support 1080HD and higher • 30+ Licensees • Silicon proven in multiple applications 2009 smaller faster cooler 30+ Licensees Vivante Marketplace GC2000 in 65LP 500 MHz 100 Mtri/s 1.0 Gpix/s 18 GFLOPS GC400 in 65LP 2.5 mm2 silicon 49 mW active 13 Mtri/s 125 Mpix/s CAMERA ∗ MOBILE PHONE ∗ NAVIGATION ∗ PRINTER ∗ AUTOMOTIVE ∗ NETBOOK ∗ MID ∗ SET-TOP BOX ∗ HDTV ∗ MOBILE GAMING DTV Digital Smartphone MID Blu-ray Picture Camera GPS Netbook Frame Set-top box 2009 Vivante Differentiation ScalarMorphicTM Architecture GC GPU Core Multiple dimensions of AHB AXI scalability for optimal Host Interface cost/area/power balance Memory Controller Unified shaders Scalable multicore design 3-D Pipeline Texture Hyper-threading virtually Engine Engine 3-D RenderingEngine eliminates latency effects Engine Simple integration Ultra-threaded Unified Shader Ultra-threadedUltra-threaded Unified Unified Shader Shader Low bus bandwidth Graphics Up to 256 threads per Shader Pipeline Pixel Produces a small area, Front Engine high performance End 2-D Pipeline 2-D Drawing and Scaling Engine 2009 Complete Graphics
    [Show full text]
  • The Openvx™ Specification
    The OpenVX™ Specification Version 1.0.1 Document Revision: r31169 Generated on Wed May 13 2015 08:41:43 Khronos Vision Working Group Editor: Susheel Gautam Editor: Erik Rainey Copyright ©2014 The Khronos Group Inc. i Copyright ©2014 The Khronos Group Inc. All Rights Reserved. This specification is protected by copyright laws and contains material proprietary to the Khronos Group, Inc. It or any components may not be reproduced, republished, distributed, transmitted, displayed, broadcast or otherwise exploited in any manner without the express prior written permission of Khronos Group. You may use this specifica- tion for implementing the functionality therein, without altering or removing any trademark, copyright or other notice from the specification, but the receipt or possession of this specification does not convey any rights to reproduce, disclose, or distribute its contents, or to manufacture, use, or sell anything that it may describe, in whole or in part. Khronos Group grants express permission to any current Promoter, Contributor or Adopter member of Khronos to copy and redistribute UNMODIFIED versions of this specification in any fashion, provided that NO CHARGE is made for the specification and the latest available update of the specification for any version of the API is used whenever possible. Such distributed specification may be re-formatted AS LONG AS the contents of the specifi- cation are not changed in any way. The specification may be incorporated into a product that is sold as long as such product includes significant independent work developed by the seller. A link to the current version of this specification on the Khronos Group web-site should be included whenever possible with specification distributions.
    [Show full text]
  • Intel® Desktop Board DG965RY Product Guide
    Intel® Desktop Board DG965RY Product Guide Order Number: D46818-003 Revision History Revision Revision History Date -001 First release of the Intel® Desktop Board DG965RY Product Guide June 2006 -002 Second release of the Intel® Desktop Board DG965RY Product Guide September 2006 -003 Added operating system support December 2006 If an FCC declaration of conformity marking is present on the board, the following statement applies: FCC Declaration of Conformity This device complies with Part 15 of the FCC Rules. Operation is subject to the following two conditions: (1) this device may not cause harmful interference, and (2) this device must accept any interference received, including interference that may cause undesired operation. For questions related to the EMC performance of this product, contact: Intel Corporation, 5200 N.E. Elam Young Parkway, Hillsboro, OR 97124 1-800-628-8686 This equipment has been tested and found to comply with the limits for a Class B digital device, pursuant to Part 15 of the FCC Rules. These limits are designed to provide reasonable protection against harmful interference in a residential installation. This equipment generates, uses, and can radiate radio frequency energy and, if not installed and used in accordance with the instructions, may cause harmful interference to radio communications. However, there is no guarantee that interference will not occur in a particular installation. If this equipment does cause harmful interference to radio or television reception, which can be determined by turning the equipment off and on, the user is encouraged to try to correct the interference by one or more of the following measures: • Reorient or relocate the receiving antenna.
    [Show full text]
  • The Openvx™ Specification
    The OpenVX™ Specification Version 1.2 Document Revision: dba1aa3 Generated on Wed Oct 11 2017 20:00:10 Khronos Vision Working Group Editor: Stephen Ramm Copyright ©2016-2017 The Khronos Group Inc. i Copyright ©2016-2017 The Khronos Group Inc. All Rights Reserved. This specification is protected by copyright laws and contains material proprietary to the Khronos Group, Inc. It or any components may not be reproduced, republished, distributed, transmitted, displayed, broadcast or otherwise exploited in any manner without the express prior written permission of Khronos Group. You may use this specifica- tion for implementing the functionality therein, without altering or removing any trademark, copyright or other notice from the specification, but the receipt or possession of this specification does not convey any rights to reproduce, disclose, or distribute its contents, or to manufacture, use, or sell anything that it may describe, in whole or in part. Khronos Group grants express permission to any current Promoter, Contributor or Adopter member of Khronos to copy and redistribute UNMODIFIED versions of this specification in any fashion, provided that NO CHARGE is made for the specification and the latest available update of the specification for any version of the API is used whenever possible. Such distributed specification may be re-formatted AS LONG AS the contents of the specifi- cation are not changed in any way. The specification may be incorporated into a product that is sold as long as such product includes significant independent work developed by the seller. A link to the current version of this specification on the Khronos Group web-site should be included whenever possible with specification distributions.
    [Show full text]
  • Intel® Desktop Board DG41TX Product Guide
    Intel® Desktop Board DG41TX Product Guide Order Number: E88309-001 Revision History Revision Revision History Date ® -001 First release of the Intel Desktop Board DG41TX Product Guide February 2010 Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life saving, or life sustaining applications. Intel may make changes to specifications and product descriptions at any time, without notice. Intel Desktop Board DG41TX may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an ordering number and are referenced in this document, or other Intel literature, may be obtained from Intel Corporation by going to the World Wide Web site at: http://intel.com/ or by calling 1-800-548-4725. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. * Other names and brands may be claimed as the property of others.
    [Show full text]
  • The Bifrost GPU Architecture and the ARM Mali-G71 GPU
    The Bifrost GPU architecture and the ARM Mali-G71 GPU Jem Davies ARM Fellow and VP of Technology Hot Chips 28 Aug 2016 Introduction to ARM Soft IP . ARM licenses Soft IP cores (amongst other things) to our Silicon Partners . They then make chips and sell to OEMs, who sell consumer devices . “ARM doesn’t make chips”… . We provide all the RTL, integration testbenches, memories lists, reference floorplans, example synthesis scripts, sometimes models, sometimes FPGA images, sometimes with implementation advice, always with memory system requirements/recommendations . Consequently silicon area, power, frequencies, performance, benchmark scores can therefore vary quite a bit in real silicon… 2 © ARM2016 ARM Mali: The world’s #1 shipping GPU 750M 65 27 Mali-based new Mali GPUs shipped 140 Total licenses in in 2015 Total licenses licensees FY15 Mali is in: Mali graphics based IC shipments (units) 750m ~75% ~50% ~40% 550m of of of 400m DTVs… tablets… smartphones 150m <50m 2011 2012 2013 2014 2015 3 © ARM2016 ARM Mali graphics processor generations Mali-G71 BIFROST … GPU Unified shader cores, scalar ISA, clause execution, full coherency, Vulkan, OpenCL MIDGARD Mali-T600 GPU series Mali-T700 GPU series Mali-T800 GPU series Unified shader cores, SIMD ISA, OpenGL ES 3.x, OpenCL, Vulkan Presented at HotChips 2015 Mali-200 Mali-300 Mali-400 Mali-450 Mali-470 UTGARD GPU GPU GPU GPU GPU Separate shader cores, SIMD ISA, OpenGL ES 2.x 4 © ARM2016 Bifrost features . Leverages Mali’s scalable architecture . Scalable to 32 shader cores 20% Scalable to 32 Higher energy Shader cores . Major shader core redesign efficiency* .
    [Show full text]
  • AI Acceleration – Image Processing As a Use Case
    AI acceleration – image processing as a use case Masoud Daneshtalab Associate Professor at MDH www.idt.mdh.se/~md/ Outline • Dark Silicon • Heterogeneouse Computing • Approximation – Deep Neural Networks 2 The glory of Moore’s law Intel 4004 Intel Core i7 980X 2300 transistors 1.17B transistors 740 kHz clock 3.33 GHz clock 10um process 32nm process 10.8 usec/inst 73.4 psec/inst Every 2 Years • Double the number of transistors • Build higher performance general-purpose processors ⁻ Make the transistors available to masses ⁻ Increase performance (1.8×↑) ⁻ Lower the cost of computing (1.8×↓) Semiconductor trends ITRS roadmap for SoC Design Complexity Trens ITRS: International Technology Roadmap for Semiconductors Expected number of processing elements into a System-on-Chip (SoC). 4 Semiconductor trends ITRS roadmap for SoC Design Complexity Trens 5 Semiconductor trends Network-on-Chip (NoC) –based Multi/Many-core Systems Founder: Andreas Olofsson Sponsored by Ericsson AB 6 Dark Silicon Era The catch is powering exponentially increasing number of transistors without melting the chip down. 10 10 Chip Transistor Count 2,200,000,000 109 Chip Power 108 107 106 105 2300 104 103 130 W 102 101 0.5 W 100 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 If you cannot power them, why bother making them? 7 Dark Silicon Era The catch is powering exponentially increasing number of transistors without melting the chip down. 10 10 Chip Transistor Count 2,200,000,000 109 Chip Power 108 107 Dark Silicon 106 105 (Utilization Wall) 2300 104 Fraction of transistors that need to be 103 powered off at all times due to power 130 W 2 constraints.
    [Show full text]