Artificial Intelligence

Total Page:16

File Type:pdf, Size:1020Kb

Artificial Intelligence EQUITY RESEARCH INDUSTRY UPDATE June 3, 2021 Artificial Intelligence The Next Technology Frontier TECHNOLOGY/SEMICONDUCTORS & COMPONENTS SUMMARY Artificial Intelligence, once the stuff of science fiction, has arrived. Interest is high and adoption increasing from supercomputers to smartphones. Investors have taken note and rewarded early leaders like NVIDIA. Advances in semiconductors and software have enabled sophisticated neural networks, further accelerating AI development. Models continue to grow in size and sophistication, delivering transformative breakthroughs in image recognition, natural language processing, and recommendation systems. We see AI as a leading catalyst for Industry 4.0, a disruptive technology with broad societal/economic benefits. In this report, we explore key concepts underpinning the evolution of AI from a hardware and software perspective. We consulted more than a dozen leading public and private companies working on the latest AI platforms. We see a large and rapidly expanding AI accelerator opportunity. We estimate AI hardware platform TAM at $105B by 2025, a 34% CAGR. KEY POINTS ■ AI/ML/DL: Artificial Intelligence enables machines to simulate human intelligence. Machine Learning (ML) is one of the most prevalent AI techniques, where data- trained models allow machines to make informed predictions. Within ML, Deep Learning (DL) uses Artificial Neural Networks to replicate the compute capabilities of biological neurons. DL is showing promise in AI research, providing machines the ability to self-learn. ■ Drivers: We highlight three factors driving the latest DL breakthroughs: 1) Rapid Data Growth—global data is expected to reach 180ZB by 2025 (25% CAGR), necessitating AI to process this data and create meaningful inferences; 2) Advanced Processors—the decline of Moore’s Law and shift to heterogeneous computing have sparked specialized AI silicon development, providing critical performance gains; 3) Neural Networks—DL performance scales with exponential data and neural network model growth. ■ Hardware/Software: As Moore’s Law sunsets, we see diminishing performance gains from transistor shrinkage. Semiconductor engineers are increasingly focused on architectural improvements. The market is seeing a growing trend toward heterogeneous computing, where multiple processors (GPUs, ASICs, FPGAs, DPUs, CPUs) work together to improve performance. Software is critical to accelerated AI performance and seeing corresponding incremental investment. ■ Applications/Markets: AI workloads are classified as Training or Inference. Training is the creation of an AI model through repetitive data processing/learning. Training is compute-intensive, requiring the most advanced AI hardware/software. Generally located in hyperscale DC, we estimate training TAM at $21B by 2025. Inference utilizes a trained model to predict results from a dataset. We see inference increasingly moving to edge devices, improving speed/cost. Led by Smartphones/PCs/IoT/Robotics/Auto, we see an $84B Edge market by 2025. ■ Competitive Backdrop: NVIDIA is the clear AI leader, with dominant training share (~99%) and growing inference (~20%). Being nimble is key, as competitors Rick Schafer Wei Mok must adapt quickly to a rapidly changing market. Hyperscalers are developing in- 720-554-1119 212-667-8387 [email protected] [email protected] house AI solutions for custom/proprietary workloads, where merchant silicon is not Andrew Hummel, CFA available. Traditional semi vendors are consolidating to strengthen Cloud/Edge AI 312-360-5946 offerings. AI has also inspired a wave of semiconductor startups. [email protected] For analyst certification and important disclosures, see the Disclosure Disseminated: June 3, 2021 23:45 EDT; Produced: June 3, 2021 23:36 EDT Appendix. Oppenheimer & Co Inc. 85 Broad Street, New York, NY 10004 Tel: 800-221-5588 Fax: 212-667-8229 TECHNOLOGY / SEMICONDUCTORS & COMPONENTS Contents Artificial Intelligence: The Next General-Purpose Plateau of Clock Speeds and the Megahertz Myth ............ 34 Technology ............................................................................. 3 Measuring Performance with FLOPS and TOPS .............. 34 AI: The Next General-Purpose Technology .......................... 3 Benchmarking AI Training/Inference Results ..................... 37 The Industrial Revolution and Industry 4.0 ........................... 3 ResNet-50 ..................................................................... 37 Single Shot Detection (SSD) ........................................ 37 Artificial Intelligence .............................................................. 5 Neural Machine Translation (NMT) ............................... 37 Background and AI Classification ......................................... 5 Transformer ................................................................... 37 Artificial Intelligence Fundamentals ...................................... 7 NLP (BERT) .................................................................. 38 Machine Learning ............................................................ 7 Deep Learning Recommendation Model (DLRM) ......... 38 Training ........................................................................... 9 Mini-Go .......................................................................... 38 Inference ........................................................................ 11 Deep Learning ............................................................... 11 AI Accelerators in Datacenters ........................................... 40 Artificial Neural Networks ............................................... 12 Enterprise Servers ............................................................. 40 AI Applications.................................................................... 13 Cloud Computing ............................................................... 40 Image Processing .......................................................... 13 Hyperscalers ...................................................................... 42 Natural Language Processing........................................ 13 Datacenter AI Startups....................................................... 44 Recommendation Systems ............................................ 13 AI Accelerators at the Edge................................................. 45 Case Study: History of AI Cycles ........................................ 15 Edge Infrastructure: Cloud and Telco ................................ 46 Robotics: Rise of Machines ............................................... 47 Moore’s Law and the Implications on Semiconductor Autonomous Vehicles: New Age of Transportation ............ 49 Industry ................................................................................. 17 Endpoint Devices: PCs, Smartphones, Internet of Things . 53 Moore’s Law: Industry Guide to Innovation in the Last Half PCs and Smartphones .................................................. 53 Century ............................................................................... 17 Embedded ..................................................................... 54 Dennard Scaling ............................................................ 18 Silicon IP, Custom Silicon ............................................. 55 A New Compute Paradigm: Emergence of AI Specialized Silicon ................................................................................. 19 Leading Public Companies Developing AI Silicon ........... 56 Achronix ............................................................................. 56 AI Hardware: CPU, GPU, ASIC, FPGA, DPU ....................... 21 AMD ................................................................................... 57 AI Silicon: It Starts with the Hardware ................................ 21 Broadcom .......................................................................... 57 CPU: x86 and ARM ....................................................... 22 Intel .................................................................................... 57 GPU .............................................................................. 24 Marvell ............................................................................... 58 ASIC .............................................................................. 25 NVIDIA ............................................................................... 59 FPGA ............................................................................. 26 NXP ................................................................................... 60 DPU ............................................................................... 27 Qualcomm ......................................................................... 61 Heterogenous Computing: All Chips Play a Role ................... 29 Xilinx .................................................................................. 61 AI/ML Software, Frameworks/Libraries; Software 2.0 ....... 30 Leading Startup Companies Developing AI Silicon ......... 63 Programming Languages ................................................... 30 Blaize Semi ........................................................................ 63 Deep Learning Frameworks and Libraries ........................ 31 Cerebras Systems ............................................................. 63 TensorFlow .................................................................... 32 EdgeCortix ........................................................................
Recommended publications
  • Personal Computing, the Notebook Battery Crisis, and Postindustrial
    1 1 <running head>Eisler | Exploding the Black Box 2 <title>Exploding the Black Box 3 <subtitle>Personal Computing, the Notebook Battery Crisis, and Postindustrial 4 Systems Thinking 5 Matthew N. Eisler 6 Matthew N. Eisler studies the relationship between ideology, material practices, and 7 the social relations of contemporary science and engineering at the intersection of energy, 8 environmental, and industrial policy. His first book, Overpotential: Fuel Cells, Futurism, 9 and the Making of a Power Panacea was published by Rutgers University Press in 2012. 10 He has been affiliated with Western University, the Center for Nanotechnology in Society 11 at the University of California at Santa Barbara, the Chemical Heritage Foundation, and 12 the Department of Engineering and Society at the University of Virginia. He is currently a 13 Visiting Assistant Professor in the Department of Integrated Science and Technology at 14 James Madison University. He thanks Jack K. Brown, W. Bernard Carlson, Paul E. 15 Ceruzzi, Michael D. Gordin, Barbara Hahn, Cyrus C.M. Mody, Hannah S. Rogers, and 16 three anonymous referees for their incisive and constructive criticism of earlier drafts of 17 this article. 18 Abstract: 19 Historians of science and technology have generally ignored the role of power 20 sources in the development of consumer electronics. In this they have followed the 21 predilections of historical actors. Research, development, and manufacturing of 22 batteries has historically occurred at a social and intellectual distance from the research, 23 development, and manufacturing of the devices they power. Nevertheless, power source 24 technoscience should properly be understood as an allied yet estranged field of 2 1 electronics.
    [Show full text]
  • DEEP-Hybriddatacloud ASSESSMENT of AVAILABLE TECHNOLOGIES for SUPPORTING ACCELERATORS and HPC, INITIAL DESIGN and IMPLEMENTATION PLAN
    DEEP-HybridDataCloud ASSESSMENT OF AVAILABLE TECHNOLOGIES FOR SUPPORTING ACCELERATORS AND HPC, INITIAL DESIGN AND IMPLEMENTATION PLAN DELIVERABLE: D4.1 Document identifier: DEEP-JRA1-D4.1 Date: 29/04/2018 Activity: WP4 Lead partner: IISAS Status: FINAL Dissemination level: PUBLIC Permalink: http://hdl.handle.net/10261/164313 Abstract This document describes the state of the art of technologies for supporting bare-metal, accelerators and HPC in cloud and proposes an initial implementation plan. Available technologies will be analyzed from different points of views: stand-alone use, integration with cloud middleware, support for accelerators and HPC platforms. Based on results of these analyses, an initial implementation plan will be proposed containing information on what features should be developed and what components should be improved in the next period of the project. DEEP-HybridDataCloud – 777435 1 Copyright Notice Copyright © Members of the DEEP-HybridDataCloud Collaboration, 2017-2020. Delivery Slip Name Partner/Activity Date From Viet Tran IISAS / JRA1 25/04/2018 Marcin Plociennik PSNC 20/04/2018 Cristina Duma Aiftimiei Reviewed by INFN 25/04/2018 Zdeněk Šustr CESNET 25/04/2018 Approved by Steering Committee 30/04/2018 Document Log Issue Date Comment Author/Partner TOC 17/01/2018 Table of Contents Viet Tran / IISAS 0.01 06/02/2018 Writing assignment Viet Tran / IISAS 0.99 10/04/2018 Partner contributions WP members 1.0 19/04/2018 Version for first review Viet Tran / IISAS Updated version according to 1.1 22/04/2018 Viet Tran / IISAS recommendations from first review 2.0 24/04/2018 Version for second review Viet Tran / IISAS Updated version according to 2.1 27/04/2018 Viet Tran / IISAS recommendations from second review 3.0 29/04/2018 Final version Viet Tran / IISAS DEEP-HybridDataCloud – 777435 2 Table of Contents Executive Summary.............................................................................................................................5 1.
    [Show full text]
  • RISC-V Core Out-Clocks Apple, Sifive; Available As IP
    RISC-V core out-clocks Apple, SiFive; available as IP Movember 05, 2020 //By Peter Clarke Micro Magic Inc. (Sunnyvale, Calif.) has functioning silicon of its 5GHz-capable 64bit RISC-V processor and is offering the design as intellectual property. The Micro Magic processor out-clocks the Apple A14 bionic, the processor at the heart of the iPhone 12 and one of the first processors on 5nm silicon. It also goes faster than a quad-core U84 CPU that SiFive states can operate at up to 2.6GHz clock frequency when implemented in a 7nm process. Micro Magic has a history that goes back to Sun Microsystems and beyond (see EDA company claims world’s fastest 64bit RISC-V core). It is reportedly one of Silicon Valley’s well-kept secrets and a go-to resource for design teams trying to remove bottlenecks in their datapath designs. Andy Huang, an independent contractor who supports Micro Magic for marketing and business development functions, contacted eeNews Europe and demonstrated the processor running EEMBC CoreMark benchmarks over a Facetime connection. Huang was founder and CEO of ACAD Corp., the developer of the Finesim simulator, one of the first and fastest of parallel SPICE simulators. ACAD was acquired by Magma Design Automation in 2006 before Magma itself was acquired by Synopsys in 2012. Huang declined to say which foundry had manufactured silicon for Micro Magic or in what manufacturing process it had been implemented. Huang said the that processor is made in a FinFET process and was manufactured using a multiproject wafer (MPW) run.
    [Show full text]
  • Accenture AI Inferencing in Action
    POV POV PUT YOUR AI SOLUTION ON STEROIDS POV PUT YOUR AI SOLUTION ON STEROIDS POINT OF VIEW POV PUT YOUR AI SOLUTION ON STEROIDS POV MATCH GPU PERFORMANCE AT HALF THE COST FOR AI INFERENCE WORKLOADS Proven CPU-based solution from Accenture and Intel boosts the performance and lowers the cost of AI inferencing by enabling an easy-to-deploy, scalable, and cost-efficient architecture AI INFERENCING—THE NEXT CRITICAL STEP AFTER AI ALGORITHM TRAINING Artificial Intelligence (AI) solutions include three main functions—identifying and preparing data, training an artificial intelligence algorithm, and using the algorithm for inferring new outcomes. Each function requires different compute recourses and deployment architecture. The choices of infrastructure components and technologies significantly impact the performance and costs associated with deploying an end-to-end AI solution. Data scientists and machine learning (ML) engineers spend significant time devising the right architecture for all stages of the AI pipeline. MODEL DATA TRAINING AND SCORING AND PREPARATION OPTIMIZATION INFERENCE Once an AI computer/algorithm has been trained through traditional or deep learning techniques, it can deliver value by interpreting data (i.e., inferring). Through inference, an AI algorithm can analyze data to: • Differentiate between various items • Identify trends and patterns that can be leveraged during decision-making • Reveal opportunities and possible solutions • Recognize voices, faces, images, etc. POV PUT YOUR AI SOLUTION ON STEROIDS POV Revealing hidden As we look to the future, AI inference will become increasingly possibilities— important to businesses operating in all segments—from health care to financial services to aerospace. And as the reliance on AI inference continues to grow, so Accenture AIP, does the importance of choosing the right AI infrastructure to support it.
    [Show full text]
  • GPU Developments 2018
    GPU Developments 2018 2018 GPU Developments 2018 © Copyright Jon Peddie Research 2019. All rights reserved. Reproduction in whole or in part is prohibited without written permission from Jon Peddie Research. This report is the property of Jon Peddie Research (JPR) and made available to a restricted number of clients only upon these terms and conditions. Agreement not to copy or disclose. This report and all future reports or other materials provided by JPR pursuant to this subscription (collectively, “Reports”) are protected by: (i) federal copyright, pursuant to the Copyright Act of 1976; and (ii) the nondisclosure provisions set forth immediately following. License, exclusive use, and agreement not to disclose. Reports are the trade secret property exclusively of JPR and are made available to a restricted number of clients, for their exclusive use and only upon the following terms and conditions. JPR grants site-wide license to read and utilize the information in the Reports, exclusively to the initial subscriber to the Reports, its subsidiaries, divisions, and employees (collectively, “Subscriber”). The Reports shall, at all times, be treated by Subscriber as proprietary and confidential documents, for internal use only. Subscriber agrees that it will not reproduce for or share any of the material in the Reports (“Material”) with any entity or individual other than Subscriber (“Shared Third Party”) (collectively, “Share” or “Sharing”), without the advance written permission of JPR. Subscriber shall be liable for any breach of this agreement and shall be subject to cancellation of its subscription to Reports. Without limiting this liability, Subscriber shall be liable for any damages suffered by JPR as a result of any Sharing of any Material, without advance written permission of JPR.
    [Show full text]
  • Persistent Memory for Artificial Intelligence
    Persistent Memory for Artificial Intelligence Bill Gervasi Principal Systems Architect [email protected] Santa Clara, CA August 2018 1 Demand Outpacing Capacity In-Memory Computing Artificial Intelligence Machine Learning Deep Learning Memory Demand DRAM Capacity Santa Clara, CA August 2018 2 Driving New Capacity Models Non-volatile memories Industry successfully snuggling large memories to the processors… Memory Demand DRAM Capacity …but we can do oh! so much more Santa Clara, CA August 2018 3 My Three Talks at FMS NVDIMM Analysis Memory Class Storage Artificial Intelligence Santa Clara, CA August 2018 4 History of Architectures Let’s go back in time… Santa Clara, CA August 2018 5 Historical Trends in Computing Edge Co- Computing Processing Power Failure Santa Clara, CA August 2018 Data Loss 6 Some Moments in History Central Distributed Processing Processing Shared Processor Processor per user Dumb terminals Peer-to-peer networks Santa Clara, CA August 2018 7 Some Moments in History Central Distributed Processing Processing “Native Signal Processing” Hercules graphics Main CPU drivers Sound Blaster audio Cheap analog I/O Rockwell modem Ethernet DSP Tightly-coupled coprocessing Santa Clara, CA August 2018 8 The Lone Survivor… Integrated graphics Graphics add-in cards …survived the NSP war Santa Clara, CA August 2018 9 Some Moments in History Central Distributed Processing Processing Phone providers Phone apps provide controlled all local services data processing Edge computing reduces latency Santa Clara, CA August 2018 10 When the Playing
    [Show full text]
  • Wire-Aware Architecture and Dataflow for CNN Accelerators
    Wire-Aware Architecture and Dataflow for CNN Accelerators Sumanth Gudaparthi Surya Narayanan Rajeev Balasubramonian University of Utah University of Utah University of Utah Salt Lake City, Utah Salt Lake City, Utah Salt Lake City, Utah [email protected] [email protected] [email protected] Edouard Giacomin Hari Kambalasubramanyam Pierre-Emmanuel Gaillardon University of Utah University of Utah University of Utah Salt Lake City, Utah Salt Lake City, Utah Salt Lake City, Utah [email protected] [email protected] pierre- [email protected] ABSTRACT The 52nd Annual IEEE/ACM International Symposium on Microarchitecture In spite of several recent advancements, data movement in modern (MICRO-52), October 12–16, 2019, Columbus, OH, USA. ACM, New York, NY, USA, 13 pages. https://doi.org/10.1145/3352460.3358316 CNN accelerators remains a significant bottleneck. Architectures like Eyeriss implement large scratchpads within individual pro- cessing elements, while architectures like TPU v1 implement large systolic arrays and large monolithic caches. Several data move- 1 INTRODUCTION ments in these prior works are therefore across long wires, and ac- Several neural network accelerators have emerged in recent years, count for much of the energy consumption. In this work, we design e.g., [9, 11, 12, 28, 38, 39]. Many of these accelerators expend sig- a new wire-aware CNN accelerator, WAX, that employs a deep and nificant energy fetching operands from various levels of the mem- distributed memory hierarchy, thus enabling data movement over ory hierarchy. For example, the Eyeriss architecture and its row- short wires in the common case. An array of computational units, stationary dataflow require non-trivial storage for scratchpads and each with a small set of registers, is placed adjacent to a subarray registers per processing element (PE) to maximize reuse [11].
    [Show full text]
  • AI Accelerator Latencies in Hybrid Vehicular Simulation
    AI Accelerator Latencies in Hybrid Vehicular Simulation Jussi Hanhirova Matias Hyyppä Abstract Aalto University Aalto University We study the use of accelerators for vehicular AI (Artifi- Espoo, Finland Espoo, Finland cial Intelligence) applications. Managing the computation jussi.hanhirova@aalto.fi juho.hyyppa@aalto.fi is complex as vehicular AI applications call for high per- formance computations in a real-time distributed environ- ment, in which low and predictable latencies are essential. We have used the CARLA simulator together with machine learning based on CNNs (Convolutional Neural Networks) in our research. In this paper, we present the latency be- Anton Debner Vesa Hirvisalo havior with GPU acceleration for CNN processing. Our ex- Aalto University Aalto University perimentation is motivated by using the simulator to find the Espoo, Finland Espoo, Finland corner cases that are demanding for the accelerated CNN anton.debner@aalto.fi vesa.hirvisalo@aalto.fi processing. Author Keywords Computation acceleration; GPU; deep learning ACM Classification Keywords D.4.8 [Performance]: Simulation; I.2.9 [Robotics]: Autonomous vehicles; I.3.7 [Three-Dimensional Graphics and Realism]: Virtual reality Introduction In this paper, we address the usage of accelerators in ve- hicular AI (Artificial Intelligence) systems and in the simula- tors that are needed in the development of such systems. The recent development of AI system is enabling many new Convolutional Neural Net- applications including autonomous driving of motor vehi- software [5] together with deep learning based inference on works (CNN) are a specific cles on public roads. Many of such systems process sen- TensorFlow [6]. Our measurements show the basic viability class of neural networks that sor data related to environment perception in real-time, be- of the hybrid simulation approach, but they also underline are often used in deep form cause they trigger actions which have latency requirements.
    [Show full text]
  • And Complex-Valued Multiply-Accumulate SIMD Unit for Digital Signal Processors
    An Area Efficient Real- and Complex-Valued Multiply-Accumulate SIMD Unit for Digital Signal Processors Lukas Gerlach, Guillermo Paya-Vay´ a,´ and Holger Blume Cluster of Excellence Hearing4all, Institute of Microelectronic Systems Leibniz Universitat¨ Hannover, Appelstr. 4, 30167 Hannover, Germany Email: {gerlach, guipava, blume}@ims.uni-hannover.de Abstract—This paper explores a real- and complex-valued In the signal processing field, the fast Fourier transform multiply-accumulate (MAC) functional unit for digital signal pro- (FFT) is one of the mostly used transformations, which greatly cessors. MAC units with single-instruction-multiple-data (SIMD) pushes the performance requirements. The data parallelism support are often used to increase the processing performance inherent in the FFT processing allows operating with many in modern signal processing processors. Compared to a real- independent MAC operations simultaneously. Therefore, a valued SIMD-MAC units, the proposed unit uses the same performance increment can be achieved by MAC units with multipliers to also support complex-valued SIMD-MAC and butterfly operations. The area overhead for the complex mode SIMD mechanisms, but many instructions are still needed to is small. Complex-valued operations speed up signal processing operate the real- and imaginary parts of complex numbers algorithms and make the execution more efficient in terms of separately. The use of single instructions in DSPs, executing power consumption. As a case study, a fast Fourier transform operations with complex numbers, can lead to a significant (FFT) is implemented for a VLIW-processor with a complex- performance gain in many signal processing algorithms. valued SIMD butterfly extension. The proposed functional unit is quantitatively evaluated in terms of performance, silicon area, A SIMD-MAC unit that can handle both complex and and power consumption.
    [Show full text]
  • Software-Defined Hardware Provides the Key to High-Performance Data
    Software-Defined Hardware Provides the Key to High-Performance Data Acceleration (WP019) Software-Defined Hardware Provides the Key to High- Performance Data Acceleration (WP019) November 13, 2019 White Paper Executive Summary Across a wide range of industries, data acceleration is the key to building efficient, smart systems. Traditional general-purpose processors are falling short in their ability to support the performance and latency constraints that users have. A number of accelerator technologies have appeared to fill the gap that are based on custom silicon, graphics processors or dynamically reconfigurable hardware, but the key to their success is their ability to integrate into an environment where high throughput, low latency and ease of development are paramount requirements. A board-level platform developed jointly by Achronix and BittWare has been optimized for these applications, providing developers with a rapid path to deployment for high-throughput data acceleration. A Growing Demand for Distributed Acceleration There is a massive thirst for performance to power a diverse range of applications in both cloud and edge computing. To satisfy this demand, operators of data centers, network hubs and edge-computing sites are turning to the technology of customized accelerators. Accelerators are a practical response to the challenges faced by users with a need for high-performance computing platforms who can no longer count on traditional general-purpose CPUs, such as those in the Intel Xeon family, to support the growth in demand for data throughput. The core of the problem with the general- purpose CPU is that Moore's Law continues to double the number of available transistors per square millimeter approximately every two years but no longer allows for growth in clock speeds.
    [Show full text]
  • An Optimized H.266/VVC Software Decoder on Mobile Platform
    An Optimized H.266/VVC Software Decoder On Mobile Platform Yiming Li, Shan Liu, Yu Chen, Yushan Zheng, Sijia Chen, Bin Zhu, Jian Lou Tencent Media Lab, Shenzhen, China and Palo Alto, CA, USA, fmarcli, [email protected] Abstract—As the successor of H.265/HEVC, the new versatile standard. Therefore, it is essential to have an efficient and video coding standard (H.266/VVC) can provide up to 50% optimized software decoder implementation to support the bitrate saving with the same subjective quality, at the cost of emerging applications. In [3] [4], an independent VVC soft- increased decoding complexity. To accelerate the application of the new coding standard, a real-time H.266/VVC software ware decoder implemented by Tencent demonstrated real-time decoder that can support various platforms is implemented, HD/UHD decoding capability on x86 platform. Considering where SIMD technologies, parallelism optimization, and the that mobile devices have become an essential carrier and acceleration strategies based on the characteristics of each coding display tool for video services, extensive optimization efforts tool are applied. As the mobile devices have become an essential were made on top of the framework of [3] to achieve real- carrier for video services nowadays, the mentioned optimization efforts are not only implemented for the x86 platform, but more time HD/UHD decoding on the mobile platform. As a result, importantly utilized to highly optimize the decoding performance a uniform-designed software H.266/VVC decoder that can on the ARM platform in this work. The experimental results show run real-time on different platforms and supports versatile that when running on the Apple A14 SoC (iPhone 12pro), the av- functionalities such as screen content coding (SCC) is ac- erage single-thread decoding speed of the present implementation complished.
    [Show full text]
  • Efficient Management of Scratch-Pad Memories in Deep Learning
    Efficient Management of Scratch-Pad Memories in Deep Learning Accelerators Subhankar Pal∗ Swagath Venkataramaniy Viji Srinivasany Kailash Gopalakrishnany ∗ y University of Michigan, Ann Arbor, MI IBM TJ Watson Research Center, Yorktown Heights, NY ∗ y [email protected] [email protected] fviji,[email protected] Abstract—A prevalent challenge for Deep Learning (DL) ac- TABLE I celerators is how they are programmed to sustain utilization PERFORMANCE IMPROVEMENT USING INTER-NODE SPM MANAGEMENT. Incep Incep Res Multi- without impacting end-user productivity. Little prior effort has Alex VGG Goog SSD Res Mobile Squee tion- tion- Net- Head Geo been devoted to the effective management of their on-chip Net 16 LeNet 300 NeXt NetV1 zeNet PTB v3 v4 50 Attn Mean Scratch-Pad Memory (SPM) across the DL operations of a 1 SPM 1.04 1.19 1.94 1.64 1.58 1.75 1.31 3.86 5.17 2.84 1.02 1.06 1.76 Deep Neural Network (DNN). This is especially critical due to 1-Step 1.04 1.03 1.01 1.10 1.11 1.33 1.18 1.40 2.84 1.57 1.01 1.02 1.24 trends in complex network topologies and the emergence of eager execution. This work demonstrates that there exists up to a speedups of 12 ConvNets, LSTM and Transformer DNNs [18], 5.2× performance gap in DL inference to be bridged using SPM management, on a set of image, object and language networks. [19], [21], [26]–[33] compared to the case when there is no We propose OnSRAM, a novel SPM management framework SPM management, i.e.
    [Show full text]