A Review on Software Architectures for Heterogeneous Platforms
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
On Heterogeneous Compute and Memory Systems
ON HETEROGENEOUS COMPUTE AND MEMORY SYSTEMS by Jason Lowe-Power A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer Sciences) at the UNIVERSITY OF WISCONSIN–MADISON 2017 Date of final oral examination: 05/31/2017 The dissertation is approved by the following members of the Final Oral Committee: Mark D. Hill, Professor, Computer Sciences Dan Negrut, Professor, Mechanical Engineering Jignesh M. Patel, Professor, Computer Sciences Karthikeyan Sankaralingam, Associate Professor, Computer Sciences David A. Wood, Professor, Computer Sciences © Copyright by Jason Lowe-Power 2017 All Rights Reserved i Acknowledgments I would like to acknowledge all of the people who helped me along the way to completing this dissertation. First, I would like to thank my advisors, Mark Hill and David Wood. Often, when students have multiple advisors they find there is high “synchronization overhead” between the advisors. However, Mark and David complement each other well. Mark is a high-level thinker, focusing on the structure of the argument and distilling ideas to their essentials; David loves diving into the details of microarchitectural mechanisms. Although ever busy, at least one of Mark or David were available to meet with me, and they always took the time to help when I needed it. Together, Mark and David taught me how to be a researcher, and they have given me a great foundation to build my career. I thank my committee members. Jignesh Patel for his collaborations, and for the fact that each time I walked out of his office after talking to him, I felt a unique excitement about my research. -
Hardware Architecture
Hardware Architecture Components Computing Infrastructure Components Servers Clients LAN & WLAN Internet Connectivity Computation Software Storage Backup Integration is the Key ! Security Data Network Management Computer Today’s Computer Computer Model: Von Neumann Architecture Computer Model Input: keyboard, mouse, scanner, punch cards Processing: CPU executes the computer program Output: monitor, printer, fax machine Storage: hard drive, optical media, diskettes, magnetic tape Von Neumann architecture - Wiki Article (15 min YouTube Video) Components Computer Components Components Computer Components CPU Memory Hard Disk Mother Board CD/DVD Drives Adaptors Power Supply Display Keyboard Mouse Network Interface I/O ports CPU CPU CPU – Central Processing Unit (Microprocessor) consists of three parts: Control Unit • Execute programs/instructions: the machine language • Move data from one memory location to another • Communicate between other parts of a PC Arithmetic Logic Unit • Arithmetic operations: add, subtract, multiply, divide • Logic operations: and, or, xor • Floating point operations: real number manipulation Registers CPU Processor Architecture See How the CPU Works In One Lesson (20 min YouTube Video) CPU CPU CPU speed is influenced by several factors: Chip Manufacturing Technology: nm (2002: 130 nm, 2004: 90nm, 2006: 65 nm, 2008: 45nm, 2010:32nm, Latest is 22nm) Clock speed: Gigahertz (Typical : 2 – 3 GHz, Maximum 5.5 GHz) Front Side Bus: MHz (Typical: 1333MHz , 1666MHz) Word size : 32-bit or 64-bit word sizes Cache: Level 1 (64 KB per core), Level 2 (256 KB per core) caches on die. Now Level 3 (2 MB to 8 MB shared) cache also on die Instruction set size: X86 (CISC), RISC Microarchitecture: CPU Internal Architecture (Ivy Bridge, Haswell) Single Core/Multi Core Multi Threading Hyper Threading vs. -
Efficient Processing of Deep Neural Networks
Efficient Processing of Deep Neural Networks Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, Joel Emer Massachusetts Institute of Technology Reference: V. Sze, Y.-H.Chen, T.-J. Yang, J. S. Emer, ”Efficient Processing of Deep Neural Networks,” Synthesis Lectures on Computer Architecture, Morgan & Claypool Publishers, 2020 For book updates, sign up for mailing list at http://mailman.mit.edu/mailman/listinfo/eems-news June 15, 2020 Abstract This book provides a structured treatment of the key principles and techniques for enabling efficient process- ing of deep neural networks (DNNs). DNNs are currently widely used for many artificial intelligence (AI) applications, including computer vision, speech recognition, and robotics. While DNNs deliver state-of-the- art accuracy on many AI tasks, it comes at the cost of high computational complexity. Therefore, techniques that enable efficient processing of deep neural networks to improve key metrics—such as energy-efficiency, throughput, and latency—without sacrificing accuracy or increasing hardware costs are critical to enabling the wide deployment of DNNs in AI systems. The book includes background on DNN processing; a description and taxonomy of hardware architectural approaches for designing DNN accelerators; key metrics for evaluating and comparing different designs; features of DNN processing that are amenable to hardware/algorithm co-design to improve energy efficiency and throughput; and opportunities for applying new technologies. Readers will find a structured introduction to the field as well as formalization and organization of key concepts from contemporary work that provide insights that may spark new ideas. 1 Contents Preface 9 I Understanding Deep Neural Networks 13 1 Introduction 14 1.1 Background on Deep Neural Networks . -
Architectural Adaptation for Application-Specific Locality
Architectural Adaptation for Application-Specific Locality Optimizations y z Xingbin Zhang Ali Dasdan Martin Schulz Rajesh K. Gupta Andrew A. Chien Department of Computer Science yInstitut f¨ur Informatik University of Illinois at Urbana-Champaign Technische Universit¨at M¨unchen g fzhang,dasdan,achien @cs.uiuc.edu [email protected] zInformation and Computer Science, University of California at Irvine [email protected] Abstract without repartitioning hardware and software functionality and reimplementing the co-processing hardware. This re- We propose a machine architecture that integrates pro- targetability problem is an obstacle toward exploiting pro- grammable logic into key components of the system with grammable logic for general purpose computing. the goal of customizing architectural mechanisms and poli- cies to match an application. This approach presents We propose a machine architecture that integrates pro- an improvement over traditional approach of exploiting grammable logic into key components of the system with programmable logic as a separate co-processor by pre- the goal of customizing architectural mechanisms and poli- serving machine usability through software and over tra- cies to match an application. We base our design on the ditional computer architecture by providing application- premise that communication is already critical and getting specific hardware assists. We present two case studies of increasingly so [17], and flexible interconnects can be used architectural customization to enhance latency tolerance to replace static wires at competitive performance [6, 9, 20]. and efficiently utilize network bisection on multiproces- Our approach presents an improvement over co-processing sors for sparse matrix computations. We demonstrate that by preserving machine usability through software and over application-specific hardware assists and policies can pro- traditional computer architecture by providing application- vide substantial improvements in performance on a per ap- specific hardware assists. -
Heterogeneous Cpu+Gpu Computing
HETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam [email protected] Significant contributions by: Stijn Heldens (U Twente), Jie Shen (NUDT, China), Heterogeneous platforms • Systems combining main processors and accelerators • e.g., CPU + GPU, CPU + Intel MIC, AMD APU, ARM SoC • Everywhere from supercomputers to mobile devices Heterogeneous platforms • Host-accelerator hardware model Accelerator FPGAs Accelerator PCIe / Shared memory ... Host MICs Accelerator GPUs Accelerator CPUs Our focus today … • A heterogeneous platform = CPU + GPU • Most solutions work for other/multiple accelerators • An application workload = an application + its input dataset • Workload partitioning = workload distribution among the processing units of a heterogeneous system Few cores Thousands of Cores 5 Generic multi-core CPU 6 Programming models • Pthreads + intrinsics • TBB – Thread building blocks • Threading library • OpenCL • To be discussed … • OpenMP • Traditional parallel library • High-level, pragma-based • Cilk • Simple divide-and-conquer model abstractionincreasesLevel of 7 A GPU Architecture Offloading model Kernel Host code 9 Programming models • CUDA • NVIDIA proprietary • OpenCL • Open standard, functionally portable across multi-cores • OpenACC • High-level, pragma-based • Different libraries, programming models, and DSLs for different domains Level of abstractionincreasesLevel of CPU vs. GPU 10 ALU ALU CPU Control Low latency, high Throughput: ~ALU 500 GFLOPsALU flexibility. Bandwidth: ~ 60 GB/s Excellent for -
Development of a Predictable Hardware Architecture Template and Integration Into an Automated System Design Flow
Development Secure of Reprogramming a Predictable Hardware of Architecturea Network Template Connected and Device Integration into an Automated System Design Flow Securing programmable logic controllers MARCUS MIKULCAK MUSSIE TESFAYE KTH Information and Communication Technology Masters’ Degree Project Second level, 30.0 HEC Stockholm, Sweden June 2013 TRITA-ICT-EX-2013:138Degree project in Communication Systems Second level, 30.0 HEC Stockholm, Sweden Secure Reprogramming of a Network Connected Device KTH ROYAL INSTITUTE OF TECHNOLOGY Securing programmable logic controllers SCHOOL OF INFORMATION AND COMMUNICATION TECHNOLOGY ELECTRONIC SYSTEMS MUSSIE TESFAYE Development of a Predictable Hardware Architecture TemplateKTH Information and Communication Technology and Integration into an Automated System Design Flow Master of Science Thesis in System-on-Chip Design Stockholm, June 2013 TRITA-ICT-EX-2013:138 Author: Examiner: Marcus Mikulcak Assoc. Prof. Ingo Sander Supervisor: Seyed Hosein Attarzadeh Niaki Degree project in Communication Systems Second level, 30.0 HEC Stockholm, Sweden Abstract The requirements of safety-critical real-time embedded systems pose unique challenges on their design process which cannot be fulfilled with traditional development methods. To ensure their correct timing and functionality, it has been suggested to move the design process to a higher abstraction level, which opens the possibility to utilize automated correct-by-design development flows from a functional specification of the system down to the level of Multiprocessor Systems-on-Chip. ForSyDe, an embedded system design methodology, presents a flow of this kind by basing system development on the theory of Models of Computation and side-effect-free processes, making it possible to separate the timing analysis of computation and communication of process networks. -
Computer Architectures an Overview
Computer Architectures An Overview PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Sat, 25 Feb 2012 22:35:32 UTC Contents Articles Microarchitecture 1 x86 7 PowerPC 23 IBM POWER 33 MIPS architecture 39 SPARC 57 ARM architecture 65 DEC Alpha 80 AlphaStation 92 AlphaServer 95 Very long instruction word 103 Instruction-level parallelism 107 Explicitly parallel instruction computing 108 References Article Sources and Contributors 111 Image Sources, Licenses and Contributors 113 Article Licenses License 114 Microarchitecture 1 Microarchitecture In computer engineering, microarchitecture (sometimes abbreviated to µarch or uarch), also called computer organization, is the way a given instruction set architecture (ISA) is implemented on a processor. A given ISA may be implemented with different microarchitectures.[1] Implementations might vary due to different goals of a given design or due to shifts in technology.[2] Computer architecture is the combination of microarchitecture and instruction set design. Relation to instruction set architecture The ISA is roughly the same as the programming model of a processor as seen by an assembly language programmer or compiler writer. The ISA includes the execution model, processor registers, address and data formats among other things. The Intel Core microarchitecture microarchitecture includes the constituent parts of the processor and how these interconnect and interoperate to implement the ISA. The microarchitecture of a machine is usually represented as (more or less detailed) diagrams that describe the interconnections of the various microarchitectural elements of the machine, which may be everything from single gates and registers, to complete arithmetic logic units (ALU)s and even larger elements. -
State-Of-The-Art in Heterogeneous Computing
Scientific Programming 18 (2010) 1–33 1 DOI 10.3233/SPR-2009-0296 IOS Press State-of-the-art in heterogeneous computing Andre R. Brodtkorb a,∗, Christopher Dyken a, Trond R. Hagen a, Jon M. Hjelmervik a and Olaf O. Storaasli b a SINTEF ICT, Department of Applied Mathematics, Blindern, Oslo, Norway E-mails: {Andre.Brodtkorb, Christopher.Dyken, Trond.R.Hagen, Jon.M.Hjelmervik}@sintef.no b Oak Ridge National Laboratory, Future Technologies Group, Oak Ridge, TN, USA E-mail: [email protected] Abstract. Node level heterogeneous architectures have become attractive during the last decade for several reasons: compared to traditional symmetric CPUs, they offer high peak performance and are energy and/or cost efficient. With the increase of fine-grained parallelism in high-performance computing, as well as the introduction of parallelism in workstations, there is an acute need for a good overview and understanding of these architectures. We give an overview of the state-of-the-art in heterogeneous computing, focusing on three commonly found architectures: the Cell Broadband Engine Architecture, graphics processing units (GPUs), and field programmable gate arrays (FPGAs). We present a review of hardware, available software tools, and an overview of state-of-the-art techniques and algorithms. Furthermore, we present a qualitative and quantitative comparison of the architectures, and give our view on the future of heterogeneous computing. Keywords: Power-efficient architectures, parallel computer architecture, stream or vector architectures, energy and power consumption, microprocessor performance 1. Introduction the speed of logic gates, making computers smaller and more power efficient. Noyce and Kilby indepen- The goal of this article is to provide an overview of dently invented the integrated circuit in 1958, leading node-level heterogeneous computing, including hard- to further reductions in power and space required for ware, software tools and state-of-the-art algorithms. -
Summarizing CPU and GPU Design Trends with Product Data
Summarizing CPU and GPU Design Trends with Product Data Yifan Sun, Nicolas Bohm Agostini, Shi Dong, and David Kaeli Northeastern University Email: fyifansun, agostini, shidong, [email protected] Abstract—Moore’s Law and Dennard Scaling have guided the products. Equipped with this data, we answer the following semiconductor industry for the past few decades. Recently, both questions: laws have faced validity challenges as transistor sizes approach • Are Moore’s Law and Dennard Scaling still valid? If so, the practical limits of physics. We are interested in testing the validity of these laws and reflect on the reasons responsible. In what are the factors that keep the laws valid? this work, we collect data of more than 4000 publicly-available • Do GPUs still have computing power advantages over CPU and GPU products. We find that transistor scaling remains CPUs? Is the computing capability gap between CPUs critical in keeping the laws valid. However, architectural solutions and GPUs getting larger? have become increasingly important and will play a larger role • What factors drive performance improvements in GPUs? in the future. We observe that GPUs consistently deliver higher performance than CPUs. GPU performance continues to rise II. METHODOLOGY because of increases in GPU frequency, improvements in the thermal design power (TDP), and growth in die size. But we We have collected data for all CPU and GPU products (to also see the ratio of GPU to CPU performance moving closer to our best knowledge) that have been released by Intel, AMD parity, thanks to new SIMD extensions on CPUs and increased (including the former ATI GPUs)1, and NVIDIA since January CPU core counts. -
Integrated Circuit Technology for Wireless Communications: an Overview Babak Daneshrad Integrated Circuits and Systems Laboratory UCLA Electrical Engineering Dept
Integrated Circuit Technology for Wireless Communications: An Overview Babak Daneshrad Integrated Circuits and Systems Laboratory UCLA Electrical Engineering Dept. email: [email protected] http://www.ee.ucla.edu/~babak Abstract: Figure 1 shows a block diagram of a typical wireless com- This paper provides a brief overview of present trends in the develop- munication system. Moreover it shows the partitioning of the ment of integrated circuit technology for applications in the wireless receiver into RF, IF, baseband analog and digital components. communications industry. Through advanced circuit, architectural and In this paper we will focus on two main classes of integrated processing technologies, ICs have helped bring about the wireless revo- circuit technologies. The first is ICs for analog and more lution by enabling highly sophisticated, low cost and portable end user importantly RF communications. In general the technologist as terminals. In this paper two broad categories of circuits are highlighted. The first is RF integrated circuits and the second is digital baseband well as the circuit designer’s challenge here is to first, make the processing circuits. In both areas, the paper presents the circuit design transistors operate at higher carrier frequencies, and second, to challenges and options presented to the designer. It also highlights the integrate as many of the components required in the receiver manner in which these technologies have helped advance wireless com- onto the IC. The second class of circuits that we will focus on munications. are digital baseband circuits. Where increased density and I. Introduction reduced power consumption are the key factors for optimiza- tion. -
Heterogeneous Computing in the Edge
Heterogeneous Computing in the Edge Authors: Charles Byers Associate Chief Technical Officer Industrial Internet Consortium [email protected] IIC Journal of Innovation - 1 - Heterogeneous Computing in the Edge INTRODUCTION Heterogeneous computing is the technique where different types of processors with different data path architectures are applied together to optimize the execution of specific computational workloads. Traditional CPUs are often inefficient for the types of computational workloads we will run on edge computing nodes. By adding additional types of processing resources like GPUs, TPUs, and FPGAs, system operation can be optimized. This technique is growing in popularity in cloud data centers, but is nascent in edge computing nodes. This paper will discuss some of the types of processors used in heterogenous computing, leading suppliers of these technologies, example edge use cases that benefit from each type, partitioning techniques to optimize its application, and hardware / software architectures to implement it in edge nodes. Edge computing is a technique through which the computational, storage, and networking functions of an IoT network are distributed to a layer or layers of edge nodes arranged between the bottom of the cloud and the top of IoT devices1. There are many tradeoffs to consider when deciding how to partition workloads between cloud data centers and edge computing nodes, and which processor data path architecture(s) are optimum at each layer for different applications. Figure 1 is an abstracted view of a cloud-edge network that employs heterogenous computing. A cloud data center hosts a number of types of computing resources, with a central interconnect. These computing resources consist of traditional Complex Instruction Set Computing / Reduced Instruction Set Computing (CISC/RISC) servers, but also include Graphics Processing Unit (GPU) accelerators, Tensor Processing Units (TPUs), and Field Programmable Gate Array (FPGA) farms and a few other processor types to help accelerate certain types of workloads. -
Hybrid Pipeline Hardware Architecture Based on Error Detection and Correction for AES
sensors Article Hybrid Pipeline Hardware Architecture Based on Error Detection and Correction for AES Ignacio Algredo-Badillo 1,† , Kelsey A. Ramírez-Gutiérrez 1,† , Luis Alberto Morales-Rosales 2,* , Daniel Pacheco Bautista 3,† and Claudia Feregrino-Uribe 4,† 1 CONACYT-Instituto Nacional de Astrofísica, Óptica y Electrónica, Puebla 72840, Mexico; [email protected] (I.A.-B.); [email protected] (K.A.R.-G.) 2 Faculty of Civil Engineering, CONACYT-Universidad Michoacana de San Nicolás de Hidalgo, Morelia 58000, Mexico 3 Departamento de Ingeniería en Computación, Universidad del Istmo, Campus Tehuantepec, Oaxaca 70760, Mexico; [email protected] 4 Instituto Nacional de Astrofísica, Óptica y Electrónica, Puebla 72840, Mexico; [email protected] * Correspondence: [email protected] † These authors contributed equally to this work. Abstract: Currently, cryptographic algorithms are widely applied to communications systems to guarantee data security. For instance, in an emerging automotive environment where connectivity is a core part of autonomous and connected cars, it is essential to guarantee secure communications both inside and outside the vehicle. The AES algorithm has been widely applied to protect communications in onboard networks and outside the vehicle. Hardware implementations use techniques such as iterative, parallel, unrolled, and pipeline architectures. Nevertheless, the use of AES does not Citation: Algredo-Badillo, I.; guarantee secure communication, because previous works have proved that implementations of secret Ramírez-Gutiérrez, K.A.; key cryptosystems, such as AES, in hardware are sensitive to differential fault analysis. Moreover, Morales-Rosales, L.A.; Pacheco it has been demonstrated that even a single fault during encryption or decryption could cause Bautista, D.; Feregrino-Uribe, C.