Hardware Implementation of Deep Network Accelerators Towards Healthcare and Biomedical Applications

Total Page:16

File Type:pdf, Size:1020Kb

Hardware Implementation of Deep Network Accelerators Towards Healthcare and Biomedical Applications ACCEPTED BY IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, 2020 1 Hardware Implementation of Deep Network Accelerators Towards Healthcare and Biomedical Applications Mostafa Rahimi Azghadi , Senior Member, IEEE, Corey Lammie , Student Member, IEEE, Jason K. Eshraghian , Member, IEEE, Melika Payvand , Member, IEEE, Elisa Donati , Member, IEEE, Bernabe´ Linares-Barranco , Fellow, IEEE and Giacomo Indiveri , Senior Member, IEEE Abstract—The advent of dedicated Deep Learning (DL) ac- the reliability of Deep Learning (DL) improves, it has per- celerators and neuromorphic processors has brought on new vaded various facets of healthcare from monitoring [3], [4], to opportunities for applying both Deep and Spiking Neural Net- prediction [5], diagnosis [6], treatment [7], and prognosis [8]. work (SNN) algorithms to healthcare and biomedical appli- cations at the edge. This can facilitate the advancement of Fig. 1(a) shows how data collected from the patient, which medical Internet of Things (IoT) systems and Point of Care may be a combination of bio-samples, medical images, tem- (PoC) devices. In this paper, we provide a tutorial describing perature, movement, etc., can be processed using a smart DL how various technologies including emerging memristive devices, system that monitors the patient for anomalies and/or to predict Field Programmable Gate Arrays (FPGAs), and Complementary diseases. DL systems can be used to recommend treatment Metal Oxide Semiconductor (CMOS) can be used to develop efficient DL accelerators to solve a wide variety of diagnostic, options and prognosis, which further affect monitoring and pattern recognition, and signal processing problems in healthcare. prediction in a closed-loop scenario. Furthermore, we explore how spiking neuromorphic processors The capacity of Artificial Intelligence (AI) to meet or can complement their DL counterparts for processing biomedical exceed the performance of human experts in medical-data signals. The tutorial is augmented with case studies of the vast analysis [9], [10], [11] can, in part, be attributed to the literature on neural network and neuromorphic hardware as applied to the healthcare domain. We benchmark various hard- continued improvement of high-performance computing plat- ware platforms by performing a sensor fusion signal processing forms such as Graphics Processing Units (GPUs) [12] and task combining electromyography (EMG) signals with computer customized Machine Learning (ML) hardware [13]. These can vision. Comparisons are made between dedicated neuromorphic now process and learn from a large amount of multi-modal processors and embedded AI accelerators in terms of inference la- heterogeneous general and medical data [14]. This was not tency and energy. Finally, we provide our analysis of the field and share a perspective on the advantages, disadvantages, challenges, readily achievable a decade ago. and opportunities that various accelerators and neuromorphic While the field of DL has been growing at an astonishing processors introduce to healthcare and biomedical domains. rate in terms of performance, network size, and training run Index Terms—Spiking Neural Networks, Deep Neural Net- time, the development of dedicated hardware to process DL works, Neuromorphic Hardware, CMOS, Memristor, FPGA, algorithms is struggling to keep up. Concretely, the compute RRAM, Healthcare, Medical IoT, Point-of-Care loads of DL have doubled every 3.4 months since 2012. Moore’s Law targets the doubling of compute power every I. INTRODUCTION 18-24 months, and appears to be slowing down [15]. The arXiv:2007.05657v2 [cs.AR] 28 Apr 2021 progress in hardware accelerator development currently relies RTIFICIAL intelligence is uniquely poised to cope with on advances by a handful of technology companies, most the growing demands of the universal healthcare sys- A notably Nvidia and its GPUs [16], [17] and Google and its tem [1]. The healthcare industry is projected to reach over Tensor Processing Units (TPUs) [13], in addition to new 10 trillion dollars by 2022, and the associated workload on startups and research groups developing Application Specific medical practitioners is expected to grow concurrently [2]. As Integrated Circuits (ASICs) for DL training and acceleration. M. Rahimi Azghadi and Corey Lammie are with the College of Science While there are significant advances in tailoring deep network and Engineering, James Cook University, QLD 4811, Australia. e-mail: models and algorithms for various healthcare and biomedical [email protected] J. K. Eshraghian is with the Department of Electrical Engineering and applications [18], most computationally expensive deep net- Computer Science, The University of Michigan, Ann Arbor, MI 48109-2122, works are trained on either GPUs or in data centers [12], USA. [19]. The latter typically requires access to cloud computing M. Payvand, E. Donati, and G. Indiveri are with the Institute of Neuroin- formatics, University and ETH Zurich, Switzerland. services which is not only costly and comes with high power B. Linares-Barranco is with the Instituto de Microelectronica´ de demands, but also compromises data privacy. This is distinct Sevilla IMSE-CNM (CSIC and Universidad de Sevilla), Sevilla, Spain. to the effective deployment of DL at the edge on an increasing © This work is licensed under a Creative Commons Attribution 4.0 License. number of medical IoT devices [20] and PoC systems [21], as For more information, see https://creativecommons.org/licenses/by/4.0/ illustrated in Fig. 1(b). Edge learning and inference enables the ACCEPTED BY IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, 2020 2 Fig. 1. A depiction of (a) the usage of DL in a smart healthcare setting, which typically involves monitoring, prediction, diagnosis, treatment, and prognosis. The various parts of the DL-based healthcare system can run on (b) the three levels of the IoT, i.e. edge devices, edge nodes, and the cloud. However, for healthcare IoT and PoC processing, edge learning and inference is preferred. option to move processing away from the cloud. This is critical resource-intensive. This will justify the need for application for highly sensitive medical data and offline operation. Edge- specific hardware platforms. After that, we discuss recent based processing must combine compactness, low-power, and hardware advances which have led to improvements in training rapid (high throughput) at a low-cost, to make smart health and inference efficiency. These improvements ultimately guide monitoring viable and affordable for integration into human us to viable edge inference engine options. life [22]. After reviewing the literature on these DL accelerators, we Specialized embedded DL accelerators, such as the Nvidia quantify the performance of various algorithms on different Jetson and Xavier series [23], and the Movidius Neural Com- types of DL processors. The results allow us to draw a pute Stick [24], [25], have shown the promise of edge comput- perspective on the potential future of spike-based neuromor- ing. More recently, the Nvidia Clara Embedded was released phic processors in the biomedical signal processing domain. as a healthcare-specific edge accelerator. This is a computing Based on our analysis and perspective, we conjecture that, platform for edge-enabled AI on the Internet of Medical for edge processing, neuromorphic computing and Spiking Things (IoMT). However, embedded devices remain relatively Neural Networks (SNNs) [26] will likely complement DL power hungry and costly, and many state-of-the-art algorithms inference engines, either through signaling anomalies in the far exceed the memory bandwidth of resource-constrained data or acting as ‘intelligent always-on watchdogs’ which devices. They are not yet ideal learning/inference engines continuously monitor the data being recorded, but only activate for ambient-assisted precision medicine systems. There is a further processing stages if and when necessary. need for innovative systems which can satisfy the stringent We expect this tutorial, review and perspective to provide requirements of healthcare edge devices to be made affordable guidance on the history and future of DL accelerators, and the to the community at large scales. potential they hold for advancing healthcare. Our contributions To that end, in this paper we focus on the use of three vari- are summarized as follows: ous hardware technologies to develop dedicated deep network • Our paper is the first to discuss the use of three differ- accelerators which will be discussed from a biomedical and ent emerging and established hardware technologies for healthcare application point-of-view. The three technologies facilitating DL acceleration, with a focus on biomedical that we cover here are CMOS, memristors, and Field Pro- applications. grammable Gate Arrays (FPGAs). It is worth noting that, while • We provide tutorial sections on how one may implement our focus targets edge inference engines in the biomedical a typical biomedical task on FPGAs or simulate it for domain, the techniques and hardware advantages discussed deployment on memristive crossbars. here are likely to be useful for efficient offline deep network • Our paper is the first to discuss how event-based neuro- learning, or online on-chip learning. Herein, the term DL morphic processors can complement DL accelerators for ‘accelerator’ is used to refer to a device that is able to perform biomedical signal processing. DL inference and potentially training. • We provide
Recommended publications
  • DEEP-Hybriddatacloud ASSESSMENT of AVAILABLE TECHNOLOGIES for SUPPORTING ACCELERATORS and HPC, INITIAL DESIGN and IMPLEMENTATION PLAN
    DEEP-HybridDataCloud ASSESSMENT OF AVAILABLE TECHNOLOGIES FOR SUPPORTING ACCELERATORS AND HPC, INITIAL DESIGN AND IMPLEMENTATION PLAN DELIVERABLE: D4.1 Document identifier: DEEP-JRA1-D4.1 Date: 29/04/2018 Activity: WP4 Lead partner: IISAS Status: FINAL Dissemination level: PUBLIC Permalink: http://hdl.handle.net/10261/164313 Abstract This document describes the state of the art of technologies for supporting bare-metal, accelerators and HPC in cloud and proposes an initial implementation plan. Available technologies will be analyzed from different points of views: stand-alone use, integration with cloud middleware, support for accelerators and HPC platforms. Based on results of these analyses, an initial implementation plan will be proposed containing information on what features should be developed and what components should be improved in the next period of the project. DEEP-HybridDataCloud – 777435 1 Copyright Notice Copyright © Members of the DEEP-HybridDataCloud Collaboration, 2017-2020. Delivery Slip Name Partner/Activity Date From Viet Tran IISAS / JRA1 25/04/2018 Marcin Plociennik PSNC 20/04/2018 Cristina Duma Aiftimiei Reviewed by INFN 25/04/2018 Zdeněk Šustr CESNET 25/04/2018 Approved by Steering Committee 30/04/2018 Document Log Issue Date Comment Author/Partner TOC 17/01/2018 Table of Contents Viet Tran / IISAS 0.01 06/02/2018 Writing assignment Viet Tran / IISAS 0.99 10/04/2018 Partner contributions WP members 1.0 19/04/2018 Version for first review Viet Tran / IISAS Updated version according to 1.1 22/04/2018 Viet Tran / IISAS recommendations from first review 2.0 24/04/2018 Version for second review Viet Tran / IISAS Updated version according to 2.1 27/04/2018 Viet Tran / IISAS recommendations from second review 3.0 29/04/2018 Final version Viet Tran / IISAS DEEP-HybridDataCloud – 777435 2 Table of Contents Executive Summary.............................................................................................................................5 1.
    [Show full text]
  • Accenture AI Inferencing in Action
    POV POV PUT YOUR AI SOLUTION ON STEROIDS POV PUT YOUR AI SOLUTION ON STEROIDS POINT OF VIEW POV PUT YOUR AI SOLUTION ON STEROIDS POV MATCH GPU PERFORMANCE AT HALF THE COST FOR AI INFERENCE WORKLOADS Proven CPU-based solution from Accenture and Intel boosts the performance and lowers the cost of AI inferencing by enabling an easy-to-deploy, scalable, and cost-efficient architecture AI INFERENCING—THE NEXT CRITICAL STEP AFTER AI ALGORITHM TRAINING Artificial Intelligence (AI) solutions include three main functions—identifying and preparing data, training an artificial intelligence algorithm, and using the algorithm for inferring new outcomes. Each function requires different compute recourses and deployment architecture. The choices of infrastructure components and technologies significantly impact the performance and costs associated with deploying an end-to-end AI solution. Data scientists and machine learning (ML) engineers spend significant time devising the right architecture for all stages of the AI pipeline. MODEL DATA TRAINING AND SCORING AND PREPARATION OPTIMIZATION INFERENCE Once an AI computer/algorithm has been trained through traditional or deep learning techniques, it can deliver value by interpreting data (i.e., inferring). Through inference, an AI algorithm can analyze data to: • Differentiate between various items • Identify trends and patterns that can be leveraged during decision-making • Reveal opportunities and possible solutions • Recognize voices, faces, images, etc. POV PUT YOUR AI SOLUTION ON STEROIDS POV Revealing hidden As we look to the future, AI inference will become increasingly possibilities— important to businesses operating in all segments—from health care to financial services to aerospace. And as the reliance on AI inference continues to grow, so Accenture AIP, does the importance of choosing the right AI infrastructure to support it.
    [Show full text]
  • GPU Developments 2018
    GPU Developments 2018 2018 GPU Developments 2018 © Copyright Jon Peddie Research 2019. All rights reserved. Reproduction in whole or in part is prohibited without written permission from Jon Peddie Research. This report is the property of Jon Peddie Research (JPR) and made available to a restricted number of clients only upon these terms and conditions. Agreement not to copy or disclose. This report and all future reports or other materials provided by JPR pursuant to this subscription (collectively, “Reports”) are protected by: (i) federal copyright, pursuant to the Copyright Act of 1976; and (ii) the nondisclosure provisions set forth immediately following. License, exclusive use, and agreement not to disclose. Reports are the trade secret property exclusively of JPR and are made available to a restricted number of clients, for their exclusive use and only upon the following terms and conditions. JPR grants site-wide license to read and utilize the information in the Reports, exclusively to the initial subscriber to the Reports, its subsidiaries, divisions, and employees (collectively, “Subscriber”). The Reports shall, at all times, be treated by Subscriber as proprietary and confidential documents, for internal use only. Subscriber agrees that it will not reproduce for or share any of the material in the Reports (“Material”) with any entity or individual other than Subscriber (“Shared Third Party”) (collectively, “Share” or “Sharing”), without the advance written permission of JPR. Subscriber shall be liable for any breach of this agreement and shall be subject to cancellation of its subscription to Reports. Without limiting this liability, Subscriber shall be liable for any damages suffered by JPR as a result of any Sharing of any Material, without advance written permission of JPR.
    [Show full text]
  • Persistent Memory for Artificial Intelligence
    Persistent Memory for Artificial Intelligence Bill Gervasi Principal Systems Architect [email protected] Santa Clara, CA August 2018 1 Demand Outpacing Capacity In-Memory Computing Artificial Intelligence Machine Learning Deep Learning Memory Demand DRAM Capacity Santa Clara, CA August 2018 2 Driving New Capacity Models Non-volatile memories Industry successfully snuggling large memories to the processors… Memory Demand DRAM Capacity …but we can do oh! so much more Santa Clara, CA August 2018 3 My Three Talks at FMS NVDIMM Analysis Memory Class Storage Artificial Intelligence Santa Clara, CA August 2018 4 History of Architectures Let’s go back in time… Santa Clara, CA August 2018 5 Historical Trends in Computing Edge Co- Computing Processing Power Failure Santa Clara, CA August 2018 Data Loss 6 Some Moments in History Central Distributed Processing Processing Shared Processor Processor per user Dumb terminals Peer-to-peer networks Santa Clara, CA August 2018 7 Some Moments in History Central Distributed Processing Processing “Native Signal Processing” Hercules graphics Main CPU drivers Sound Blaster audio Cheap analog I/O Rockwell modem Ethernet DSP Tightly-coupled coprocessing Santa Clara, CA August 2018 8 The Lone Survivor… Integrated graphics Graphics add-in cards …survived the NSP war Santa Clara, CA August 2018 9 Some Moments in History Central Distributed Processing Processing Phone providers Phone apps provide controlled all local services data processing Edge computing reduces latency Santa Clara, CA August 2018 10 When the Playing
    [Show full text]
  • Wire-Aware Architecture and Dataflow for CNN Accelerators
    Wire-Aware Architecture and Dataflow for CNN Accelerators Sumanth Gudaparthi Surya Narayanan Rajeev Balasubramonian University of Utah University of Utah University of Utah Salt Lake City, Utah Salt Lake City, Utah Salt Lake City, Utah [email protected] [email protected] [email protected] Edouard Giacomin Hari Kambalasubramanyam Pierre-Emmanuel Gaillardon University of Utah University of Utah University of Utah Salt Lake City, Utah Salt Lake City, Utah Salt Lake City, Utah [email protected] [email protected] pierre- [email protected] ABSTRACT The 52nd Annual IEEE/ACM International Symposium on Microarchitecture In spite of several recent advancements, data movement in modern (MICRO-52), October 12–16, 2019, Columbus, OH, USA. ACM, New York, NY, USA, 13 pages. https://doi.org/10.1145/3352460.3358316 CNN accelerators remains a significant bottleneck. Architectures like Eyeriss implement large scratchpads within individual pro- cessing elements, while architectures like TPU v1 implement large systolic arrays and large monolithic caches. Several data move- 1 INTRODUCTION ments in these prior works are therefore across long wires, and ac- Several neural network accelerators have emerged in recent years, count for much of the energy consumption. In this work, we design e.g., [9, 11, 12, 28, 38, 39]. Many of these accelerators expend sig- a new wire-aware CNN accelerator, WAX, that employs a deep and nificant energy fetching operands from various levels of the mem- distributed memory hierarchy, thus enabling data movement over ory hierarchy. For example, the Eyeriss architecture and its row- short wires in the common case. An array of computational units, stationary dataflow require non-trivial storage for scratchpads and each with a small set of registers, is placed adjacent to a subarray registers per processing element (PE) to maximize reuse [11].
    [Show full text]
  • AI Accelerator Latencies in Hybrid Vehicular Simulation
    AI Accelerator Latencies in Hybrid Vehicular Simulation Jussi Hanhirova Matias Hyyppä Abstract Aalto University Aalto University We study the use of accelerators for vehicular AI (Artifi- Espoo, Finland Espoo, Finland cial Intelligence) applications. Managing the computation jussi.hanhirova@aalto.fi juho.hyyppa@aalto.fi is complex as vehicular AI applications call for high per- formance computations in a real-time distributed environ- ment, in which low and predictable latencies are essential. We have used the CARLA simulator together with machine learning based on CNNs (Convolutional Neural Networks) in our research. In this paper, we present the latency be- Anton Debner Vesa Hirvisalo havior with GPU acceleration for CNN processing. Our ex- Aalto University Aalto University perimentation is motivated by using the simulator to find the Espoo, Finland Espoo, Finland corner cases that are demanding for the accelerated CNN anton.debner@aalto.fi vesa.hirvisalo@aalto.fi processing. Author Keywords Computation acceleration; GPU; deep learning ACM Classification Keywords D.4.8 [Performance]: Simulation; I.2.9 [Robotics]: Autonomous vehicles; I.3.7 [Three-Dimensional Graphics and Realism]: Virtual reality Introduction In this paper, we address the usage of accelerators in ve- hicular AI (Artificial Intelligence) systems and in the simula- tors that are needed in the development of such systems. The recent development of AI system is enabling many new Convolutional Neural Net- applications including autonomous driving of motor vehi- software [5] together with deep learning based inference on works (CNN) are a specific cles on public roads. Many of such systems process sen- TensorFlow [6]. Our measurements show the basic viability class of neural networks that sor data related to environment perception in real-time, be- of the hybrid simulation approach, but they also underline are often used in deep form cause they trigger actions which have latency requirements.
    [Show full text]
  • Unified Inference and Training at the Edge
    Unified Inference and Training at the Edge By Linley Gwennap Principal Analyst October 2020 www.linleygroup.com Unified Inference and Training at the Edge By Linley Gwennap, Principal Analyst, The Linley Group As more edge devices add AI capabilities, some applications are becoming increasingly complex. Wearables and other IoT devices often have multiple sensors, requiring different neural networks for each sensor, or they may use a single complex network to combine all the input data, a tech- nique called sensor fusion. Others implement on-device training to customize the application. The GPX-10 processor can handle these advanced AI applications while keeping power to a minimum. Ambient Scientific sponsored this white paper, but the opinions and analysis are those of the author. Introduction As more IoT products implement AI inferencing on the device, some applications are becoming increasingly complex. Instead of just a single sensor, they may have several, such as a wearable device that has accelerometer, pulse rate, and temperature sensors. Each sensor may require a different neural network, or a single complex network can combine all the input data, a technique called sensor fusion. A microcontroller (e.g., Cortex-M) or DSP can handle a single sensor, but these simple cores are inefficient for more complex IoT applications. Training introduces additional complication to edge devices. Today’s neural networks are trained in the cloud using generic data from many users. This approach takes advan- tage of the massive compute power in cloud data centers, but it creates a one-size-fits-all AI model. The trained model can be distributed to many devices, but even if each device performs the inferencing, the model performs the same way for all users.
    [Show full text]
  • Low-Power Ultra-Small Edge AI Accelerators for Image Recog- Nition with Convolution Neural Networks: Analysis and Future Directions
    Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 16 July 2021 doi:10.20944/preprints202107.0375.v1 Review Low-power Ultra-small Edge AI Accelerators for Image Recog- nition with Convolution Neural Networks: Analysis and Future Directions Weison Lin 1, *, Adewale Adetomi 1 and Tughrul Arslan 1 1 Institute for Integrated Micro and Nano Systems, University of Edinburgh, Edinburgh EH9 3FF, UK; [email protected]; [email protected] * Correspondence: [email protected] Abstract: Edge AI accelerators have been emerging as a solution for near customers’ applications in areas such as unmanned aerial vehicles (UAVs), image recognition sensors, wearable devices, ro- botics, and remote sensing satellites. These applications not only require meeting performance tar- gets but also meeting strict reliability and resilience constraints due to operations in harsh and hos- tile environments. Numerous research articles have been proposed, but not all of these include full specifications. Most of these tend to compare their architecture with other existing CPUs, GPUs, or other reference research. This implies that the performance results of the articles are not compre- hensive. Thus, this work lists the three key features in the specifications such as computation ability, power consumption, and the area size of prior art edge AI accelerators and the CGRA accelerators during the past few years to define and evaluate the low power ultra-small edge AI accelerators. We introduce the actual evaluation results showing the trend in edge AI accelerator design about key performance metrics to guide designers on the actual performance of existing edge AI acceler- ators’ capability and provide future design directions and trends for other applications with chal- lenging constraints.
    [Show full text]
  • Survey and Benchmarking of Machine Learning Accelerators
    1 Survey and Benchmarking of Machine Learning Accelerators Albert Reuther, Peter Michaleas, Michael Jones, Vijay Gadepally, Siddharth Samsi, and Jeremy Kepner MIT Lincoln Laboratory Supercomputing Center Lexington, MA, USA freuther,pmichaleas,michael.jones,vijayg,sid,[email protected] Abstract—Advances in multicore processors and accelerators components play a major role in the success or failure of an have opened the flood gates to greater exploration and application AI system. of machine learning techniques to a variety of applications. These advances, along with breakdowns of several trends including Moore’s Law, have prompted an explosion of processors and accelerators that promise even greater computational and ma- chine learning capabilities. These processors and accelerators are coming in many forms, from CPUs and GPUs to ASICs, FPGAs, and dataflow accelerators. This paper surveys the current state of these processors and accelerators that have been publicly announced with performance and power consumption numbers. The performance and power values are plotted on a scatter graph and a number of dimensions and observations from the trends on this plot are discussed and analyzed. For instance, there are interesting trends in the plot regarding power consumption, numerical precision, and inference versus training. We then select and benchmark two commercially- available low size, weight, and power (SWaP) accelerators as these processors are the most interesting for embedded and Fig. 1. Canonical AI architecture consists of sensors, data conditioning, mobile machine learning inference applications that are most algorithms, modern computing, robust AI, human-machine teaming, and users (missions). Each step is critical in developing end-to-end AI applications and applicable to the DoD and other SWaP constrained users.
    [Show full text]
  • Efficient Management of Scratch-Pad Memories in Deep Learning
    Efficient Management of Scratch-Pad Memories in Deep Learning Accelerators Subhankar Pal∗ Swagath Venkataramaniy Viji Srinivasany Kailash Gopalakrishnany ∗ y University of Michigan, Ann Arbor, MI IBM TJ Watson Research Center, Yorktown Heights, NY ∗ y [email protected] [email protected] fviji,[email protected] Abstract—A prevalent challenge for Deep Learning (DL) ac- TABLE I celerators is how they are programmed to sustain utilization PERFORMANCE IMPROVEMENT USING INTER-NODE SPM MANAGEMENT. Incep Incep Res Multi- without impacting end-user productivity. Little prior effort has Alex VGG Goog SSD Res Mobile Squee tion- tion- Net- Head Geo been devoted to the effective management of their on-chip Net 16 LeNet 300 NeXt NetV1 zeNet PTB v3 v4 50 Attn Mean Scratch-Pad Memory (SPM) across the DL operations of a 1 SPM 1.04 1.19 1.94 1.64 1.58 1.75 1.31 3.86 5.17 2.84 1.02 1.06 1.76 Deep Neural Network (DNN). This is especially critical due to 1-Step 1.04 1.03 1.01 1.10 1.11 1.33 1.18 1.40 2.84 1.57 1.01 1.02 1.24 trends in complex network topologies and the emergence of eager execution. This work demonstrates that there exists up to a speedups of 12 ConvNets, LSTM and Transformer DNNs [18], 5.2× performance gap in DL inference to be bridged using SPM management, on a set of image, object and language networks. [19], [21], [26]–[33] compared to the case when there is no We propose OnSRAM, a novel SPM management framework SPM management, i.e.
    [Show full text]
  • Infiltration of AI/ML in Particle Physics Aspects of Data Handling
    Infiltration of AI/ML in Particle Physics Aspects of data handling Snowmass Community Planning Meeting Oct 5-8, 2020 Jean-Roch Vlimant [email protected] @vlimant Outline I.Motivations for Machine Learning II.Path to the Future of AI in HEP AI/ML in HEP, J-R Vlimant, Snowmass CPM !2 Motivations for Using Machine Learning in High Energy Physics and elsewhere ... 8/27/20 AI/ML in HEP, J-R Vlimant, Snowmass CPM !3 Machine Learning in Industry https://www.nvidia.com/en-us/deep-learning-ai/ http://www.shivonzilis.com/machineintelligence ● Prominent field in industry nowadays ● Lots of data, lots of applications, lots of potential use cases, lots of money ➔ Knowing machine learning can open significantly career horizons !4 8/27/20 AI/ML in HEP, J-R Vlimant, Snowmass CPM Learning to Control Learning to Walk via Deep Reinforcement Learning https://arxiv.org/abs/1812.11103 Mastering the game of Go with deep neural networks and tree search, https://doi.org/10.1038/nature16961 Modern machine learning boosts control technologies. AI, gaming, robotic, self-driving vehicle, etc. AI/ML in HEP, J-R Vlimant, Snowmass CPM !5 Learning from Complexity Machine learning model can extract information from complex dataset. More classical algorithm counter part may take years of development. !6 8/27/20 AI/ML in HEP, J-R Vlimant, Snowmass CPM Learn Physics P. Komiske, E. Metodiev, J. Thaler, https://arxiv.org/abs/1810.05165 Machine Learning can help understand Physics. !7 8/27/20 AI/ML in HEP, J-R Vlimant, Snowmass CPM Use Physics A.
    [Show full text]
  • Ai Accelerator Ecosystem: an Overview
    AI ACCELERATOR ECOSYSTEM: AN OVERVIEW DAVID BURNETTE, DIRECTOR OF ENGINEERING, MENTOR, A SIEMENS BUSINESS HIGH-LEVEL SYNTHESIS WHITEPAPER www.mentor.com AI Accelerator Ecosystem: An Overview One of the fastest growing areas of hardware and software design is Artificial Intelligence (AI)/Machine Learning (ML), fueled by the demand for more autonomous systems like self-driving vehicles and voice recognition for personal assistants. Many of these algorithms rely on convolutional neural networks (CNNs) to implement deep learning systems. While the concept of convolution is relatively straightforward, the application of CNNs to the ML domain has yielded dozens of different neural network approaches. While these networks can be executed in software on CPUs/GPUs, the power requirements for these solutions make them impractical for most inferencing applications, the majority of which involve portable, low-power devices. To improve the power/performance, hardware teams are forming to create ML hardware acceleration blocks. However, the process of taking any one of these compute-intensive networks into hardware, especially energy-efficient hardware, is a time consuming process if the team employs a traditional RTL design flow. Consider all of these interdependent activities using a traditional flow: ■ Expressing the algorithm correctly in RTL. ■ Choosing the optimal bit-widths for kernel weights and local storage to meet the memory budget. ■ Designing the microarchitecture to have a low enough latency to be practical for the target application, while determining how the accelerator communicates across the system bus without killing the latency the team just fought for. ■ Verifying the algorithm early on and throughout the implementation process, especially in the context of the entire system.
    [Show full text]