High-Performance Heterogeneous Computing with the Convey HC-1 Jason D

Total Page:16

File Type:pdf, Size:1020Kb

High-Performance Heterogeneous Computing with the Convey HC-1 Jason D University of South Carolina Scholar Commons Faculty Publications Computer Science and Engineering, Department of 2010 High-Performance Heterogeneous Computing with the Convey HC-1 Jason D. Bakos University of South Carolina - Columbia, [email protected] Follow this and additional works at: https://scholarcommons.sc.edu/csce_facpub Part of the Computer Engineering Commons Publication Info Published in Computing in Science and Engineering, ed. Volodymyr Kindratenko and Pedro Trancoso, Volume 12, Issue 6, 2010, pages 80-87. http://ieeexplore.ieee.org/servlet/opac?punumber=5992 © 2010 by the Institute of Electrical and Electronics Engineers (IEEE) This Article is brought to you by the Computer Science and Engineering, Department of at Scholar Commons. It has been accepted for inclusion in Faculty Publications by an authorized administrator of Scholar Commons. For more information, please contact [email protected]. N o v E l A r C h I t E C t u r E S Editors: Volodymyr Kindratenko, [email protected] Pedro Trancoso, [email protected] HigH-Performance Heterogeneous comPuting witH tHe convey Hc-1 By Jason D. Bakos Unlike other socket-based reconfigurable coprocessors, the Convey HC-1 contains nearly 40 field-programmable gate arrays, scatter-gather memory modules, a high-capacity crossbar switch, and a fully coherent memory system. t Supercomputing 2009, Con- hardware description language. How- field-programmable gate arrays vey Computer unveiled the ever, realizing that the HC-1 appeals (FPGAs) as a “personality.” The four A HC-1, an all-in-one compute to customers who would like to do AEs each connect to eight memory server containing a socket-based re- this, Convey offers support and tools controllers through a full crossbar. configurable coprocessor board. The accordingly. Each memory controller is imple- HC-1 is unique in several ways. Unlike Here, I examine the HC-1, empha- mented on its own FPGA and is in-socket coprocessors from Nallatech sizing its system architecture, per- connected to two Convey-designed (www.nallatech.com/Intel-Xeon- formance, ease of programming, and scatter-gather dual inline memory FSB-Socket-Fillers/fsb-development- flexibility. modules (SG-DIMMs) contain- systems.html), DRC (www.drccomputer. ing 64 banks each and an integrated com/drc/modules.html), and Xtreme- System Overview Stratix-2 FPGA. The AEs themselves Data (www.xtremedata.com/products/ The HC-1’s host consists of a dual- are interconnected in a ring configu- accelerators/in-socket-accelerator/ socket Intel server motherboard, an ration with 668 Mbytes/s, full duplex xd2000i)—all of which are confined Intel 5400 memory-controller hub links for AE-to-AE communication. to a socket-sized footprint—Convey chipset, 24 Gbytes of RAM, 1,066 These links can be useful for multi- uses a mezzanine connector to bring MHz FSB, and a 2.13 GHz Intel Xeon FPGA applications. the front side bus (FSB) interface to a 5138—a dual-core, low-voltage pro- large coprocessor board roughly the cessor (the 65-nanometer Intel Core Memory Interleave Modes size of an ATX motherboard. This co- architecture released in 2006). Newer Each AE has a 2.5 Gbyte/s link to processor board is housed in a one-unit Intel Xeons based on the Nehalem or each memory controller, and each (1U) chassis that’s fused to the top of later architectures can’t be used in an SG-DIMM has a 5 Gbyte/s link to another 1U chassis containing the host HC-1-like system until Convey com- its corresponding memory control- motherboard. pletes the Quick Path Interconnect ler. As such, the effective memory In addition to the machine, Con- interface for their coprocessor board. bandwidth of the AEs is dependent vey designed a selection of accelera- The HC-1 host runs a 64-bit 2.6.18 on their memory access pattern to tor designs to use with it. Some of Linux kernel with a modified vir- the eight memory controllers and these implement soft-core floating tual memory system to accommodate their two SG-DIMMs. Each AE can point vector processors for which memory coherency for the coproces- achieve a theoretical peak bandwidth Convey has also developed a C and sor board. of 20 Gbyte/s when striding across FORTRAN compiler. Others, such eight different memory controllers, as their Smith-Waterman sequence Top-Level Design but this bandwidth would drop if two alignment accelerator design, include Figure 1 shows the coprocessor other AEs attempt to read from the an easy-to-use interface library. This board’s design. There are four user- same set of SG-DIMMs because this makes the HC-1’s FPGAs acces- programmable Virtex-5 LX 330s, would saturate the 5 Gbytes/s DIMM sible to programmers who lack the which Convey calls the application memory controller links. expertise or patience to design their engines (AEs). Convey refers to a Because each memory address own FPGA-based coprocessors in particular configuration of these maps only to one SG-DIMM (and 80 Copublished by the IEEE CS and the AIP 1521-9615/10/$26.00 © 2010 IEEE Computing in SCienCe & engineering CISE-12-6-Novel.indd 80 16/10/10 2:33 PM 1 GB 1 GB 1 GB 1 GB 1 GB 1 GB 1 GB 1 GB 1 GB 1 GB 1 GB 1 GB 1 GB 1 GB 1 GB 1 GB 5 GB/s 5 GB/s MC0 MC1 MC2 MC3 MC4 MC5 MC6 MC7 2.5 GB/s AE0 AE1 AE2 AE3 Virtex- Virtex- Virtex- Virtex- Application 5 LX 5 LX 5 LX 5 LX engine 668 MB/s full 330 330 330 330 hub duplex Coprocessor board 24 GB RAM Xeon 5138 dual-core 2.13 GHz Intel 5400 4 MB L2 Northbridge 1,066 MHz FSB Host board Figure 1. the hC-1 coprocessor board. Four application engines connect to eight memory controllers through a full crossbar. Each memory controller is implemented on its own field-programmable gate array. its corresponding memory control- groups isn’t used. Because the number on the coprocessor (indicating that it’s ler), Convey’s goal when designing of groups and banks per group is a out-of-date). If one of the application its memory system was to maximize prime number, this reduces the like- engines on the coprocessor reads from the likelihood that an arbitrary set lihood of strides aliasing to the same this block, an updated copy of the of unique memory references would SG-DIMM. Selecting the 31-31 inter- block’s memory contents is sent to the be uniformly distributed across all 16 leave comes at a cost of approximately coprocessor memory, and the memory SG-DIMMs and eight memory con- 1 Gbyte of addressable memory space block changes to shared in both the trollers. Convey provides two user- (6 percent) and a 6 percent reduction host and coprocessor memory. The selectable memory mapping modes to in peak memory bandwidth. coherence mechanism is transparent partition the coprocessor’s virtual ad- to the user and removes the need for dress space among the SG-DIMMs: Coprocessor Memory Coherency explicit direct memory access (DMA) The coprocessor memory is cache co- transactions, which coprocessors based • Binary interleave, which maps bit- herent with the host memory and is on peripheral component intercon- fields of the memory address to a implemented using the snoopy coher- nect (PCI) require. particular controller, DIMM, and ence mechanism built into the Intel bank, and FSB protocol. This essentially creates Host Interface • 31-31 interleave, a modulo 31 map- a common virtual address space that The coprocessor board contains two ping optimized for constant memory both the host and coprocessor share. non-user programmable FPGAs that strides (strides lengths that are a In the coherence protocol, both together form the application engine power-of-two are guaranteed to hit the host and the coprocessor possess hub (AEH). One FPGA serves as the all 16 SG-DIMMs for any sequence copies of the global memory space. physical interface between the copro- of 16 consecutive references). Each block of memory addresses in cessor board and the FSB, and its logic both the host memory and coproces- monitors the FSB to maintain the The memory banks are divided into sor memory are marked as exclusive, snoopy memory coherence protocol 32 groups of 32 banks each. In 31-31 shared, or invalid. A write by the host and manages the coprocessor memo- interleave, one group isn’t used, and to an address block will change its sta- ry’s page table. This FPGA is actually one bank within each of the remaining tus to exclusive and invalidate the block mounted to the mezzanine connector. november/DeCember 2010 81 CISE-12-6-Novel.indd 81 16/10/10 2:33 PM N o v E l A r C h I t E C t u r E S The second AEH FPGA contains host CPU can also use this mechanism double-precision vector personality, the scalar processor, a soft-core proces- to send parameters to and receive sta- financial analytics personality, and sor that implements the base Convey tus information from the AEs. Smith-Waterman personality. instruction set. The scalar processor The scalar processor is connected The two vector personalities act is a substantial architecture, including to each AE via a point-to-point link, as vector coprocessors for the scalar a cache and features such as multiple and uses this link to dispatch instruc- processor and are targets for Convey’s issue out-of-order execution, branch tions to the AEs that aren’t entirely vectorizing compiler. When using predication, register renaming, and implemented on the scalar processor.
Recommended publications
  • X86 Platform Coprocessor/Prpmc (PC on a PMC)
    x86 Platform Coprocessor/PrPMC (PC on a PMC) 32b/33MHz PCI bus PN1/PN2 +5V to Kbd/Mouse/USB power Vcore Power +3.3V +2.5v Conditioning +3.3VIO 1K100 FPGA & 512KB 10/100TX TMS2250 PCI BIOS/PLA Ethernet RJ45 Compact to PCI bridge Flash Memory PLA I/O Flash site 8 32b/33MHz Internal PCI bus Analog SVGA Video Pwr Seq AMD SC2200 Signal COM 1 (RXD/TXD only) IDE "GEODE (tm)" Conditioning COM 2 (RXD/TXD only) Processor USB Port 1 Rear I/O PN4 I/O Rear 64 USB Port 2 LPC Keyboard/Mouse Floppy 36 Pin 36 Pin Connector PC87364 128MB SDRAM Status LEDs Super I/O (16MWx64b) PC Spkr This board is a Processor PMC (PrPMC) implementing A PrPMC implements a TMS2250 PCI-to-PCI bridge to an x86-based PC architecture on a single-wide, stan- the host, and a “Coprocessor” cofniguration implements dard height PMC form factor. This delivers a complete back-to-back 16550 compatible COM ports. The “Mon- x86 platform with comprehensive I/O support, a 10/100- arch” signal selects either PrPMC or Co-processor TX Ethernet interface, and disk access (both floppy and mode. IDE drives). The board features an on-board hard drive using Compact Flash. A 512Kbyte FLASH memory holds the 1K100 PLA im- age that is automatically loaded on power up. It also At the heart of the design is an Advanced Micro De- contains the General Software Embedded BIOS 2000™ vices SC2200 GEODE™ processor that integrates video, BIOS code that is included with each board.
    [Show full text]
  • Convey Overview
    THE WORLD’S FIRST HYBRID-CORE COMPUTER. CONVEY HYBRID-CORE COMPUTING Hybrid-core Computing Convey HC-1 High Performance of application- specific hardware Heterogenous solutions • can be much more efficient Performance/ • still hard to program Programmability and Power efficiency deployment ease of an x86 server Application Multicore solutions • don’t always scale well • parallel programming is hard Low Difficult Ease of Deployment Easy 1/22/2010 3 Hybrid-Core Computing Application-Specific Personalities Applications • Extend the x86 instruction set • Implement key operations in Convey Compilers hardware Life Sciences x86-64 ISA Custom ISA CAE Custom Financial Oil & Gas Shared Virtual Memory Cache-coherent, shared memory • Both ISAs address common memory *ISA: Instruction Set Architecture 7/12/2010 4 HC-1 Hardware PCI I/O FPGA FPGA Intel Personalities Chipset FPGA FPGA 8 GB/s 80 GB/s Memory Memory Cache Coherent, Shared Virtual Memory 1/22/2010 5 Using Personalities C/C++ Fortran • Personalities are user specifies reloadable personality at instruction sets Convey Software compile time Development Suite • Compiler instruction generates x86 descriptions and coprocessor instructions from Hybrid-Core Executable P ANSI standard x86-64 and Coprocessor Personalities C/C++ & Fortran Instructions • Executable can run on x86 nodes FPGA Convey HC-1 or Convey Hybrid- bitfiles Core nodes Intel x86 Coprocessor personality loaded at runtime by OS 1/22/2010 6 SYSTEM ARCHITECTURE HC-1 Architecture “Commodity” Intel Server Convey FPGA-based coprocessor Direct
    [Show full text]
  • On Heterogeneous Compute and Memory Systems
    ON HETEROGENEOUS COMPUTE AND MEMORY SYSTEMS by Jason Lowe-Power A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer Sciences) at the UNIVERSITY OF WISCONSIN–MADISON 2017 Date of final oral examination: 05/31/2017 The dissertation is approved by the following members of the Final Oral Committee: Mark D. Hill, Professor, Computer Sciences Dan Negrut, Professor, Mechanical Engineering Jignesh M. Patel, Professor, Computer Sciences Karthikeyan Sankaralingam, Associate Professor, Computer Sciences David A. Wood, Professor, Computer Sciences © Copyright by Jason Lowe-Power 2017 All Rights Reserved i Acknowledgments I would like to acknowledge all of the people who helped me along the way to completing this dissertation. First, I would like to thank my advisors, Mark Hill and David Wood. Often, when students have multiple advisors they find there is high “synchronization overhead” between the advisors. However, Mark and David complement each other well. Mark is a high-level thinker, focusing on the structure of the argument and distilling ideas to their essentials; David loves diving into the details of microarchitectural mechanisms. Although ever busy, at least one of Mark or David were available to meet with me, and they always took the time to help when I needed it. Together, Mark and David taught me how to be a researcher, and they have given me a great foundation to build my career. I thank my committee members. Jignesh Patel for his collaborations, and for the fact that each time I walked out of his office after talking to him, I felt a unique excitement about my research.
    [Show full text]
  • Exploiting Free Silicon for Energy-Efficient Computing Directly
    Exploiting Free Silicon for Energy-Efficient Computing Directly in NAND Flash-based Solid-State Storage Systems Peng Li Kevin Gomez David J. Lilja Seagate Technology Seagate Technology University of Minnesota, Twin Cities Shakopee, MN, 55379 Shakopee, MN, 55379 Minneapolis, MN, 55455 [email protected] [email protected] [email protected] Abstract—Energy consumption is a fundamental issue in today’s data A well-known solution to the memory wall issue is moving centers as data continue growing dramatically. How to process these computing closer to the data. For example, Gokhale et al [2] data in an energy-efficient way becomes more and more important. proposed a processor-in-memory (PIM) chip by adding a processor Prior work had proposed several methods to build an energy-efficient system. The basic idea is to attack the memory wall issue (i.e., the into the main memory for computing. Riedel et al [9] proposed performance gap between CPUs and main memory) by moving com- an active disk by using the processor inside the hard disk drive puting closer to the data. However, these methods have not been widely (HDD) for computing. With the evolution of other non-volatile adopted due to high cost and limited performance improvements. In memories (NVMs), such as phase-change memory (PCM) and spin- this paper, we propose the storage processing unit (SPU) which adds computing power into NAND flash memories at standard solid-state transfer torque (STT)-RAM, researchers also proposed to use these drive (SSD) cost. By pre-processing the data using the SPU, the data NVMs as the main memory for data-intensive applications [10] to that needs to be transferred to host CPUs for further processing improve the system energy-efficiency.
    [Show full text]
  • CUDA What Is GPGPU
    What is GPGPU ? • General Purpose computation using GPU in applications other than 3D graphics CUDA – GPU accelerates critical path of application • Data parallel algorithms leverage GPU attributes – Large data arrays, streaming throughput Slides by David Kirk – Fine-grain SIMD parallelism – Low-latency floating point (FP) computation • Applications – see //GPGPU.org – Game effects (FX) physics, image processing – Physical modeling, computational engineering, matrix algebra, convolution, correlation, sorting Previous GPGPU Constraints CUDA • Dealing with graphics API per thread per Shader Input Registers • “Compute Unified Device Architecture” – Working with the corner cases per Context • General purpose programming model of the graphics API Fragment Program Texture – User kicks off batches of threads on the GPU • Addressing modes Constants – GPU = dedicated super-threaded, massively data parallel co-processor – Limited texture size/dimension Temp Registers • Targeted software stack – Compute oriented drivers, language, and tools Output Registers • Shader capabilities • Driver for loading computation programs into GPU FB Memory – Limited outputs – Standalone Driver - Optimized for computation • Instruction sets – Interface designed for compute - graphics free API – Data sharing with OpenGL buffer objects – Lack of Integer & bit ops – Guaranteed maximum download & readback speeds • Communication limited – Explicit GPU memory management – Between pixels – Scatter a[i] = p 1 Parallel Computing on a GPU Extended C • Declspecs • NVIDIA GPU Computing Architecture – global, device, shared, __device__ float filter[N]; – Via a separate HW interface local, constant __global__ void convolve (float *image) { – In laptops, desktops, workstations, servers GeForce 8800 __shared__ float region[M]; ... • Keywords • 8-series GPUs deliver 50 to 200 GFLOPS region[threadIdx] = image[i]; on compiled parallel C applications – threadIdx, blockIdx • Intrinsics __syncthreads() – __syncthreads ..
    [Show full text]
  • Comparing the Power and Performance of Intel's SCC to State
    Comparing the Power and Performance of Intel’s SCC to State-of-the-Art CPUs and GPUs Ehsan Totoni, Babak Behzad, Swapnil Ghike, Josep Torrellas Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA E-mail: ftotoni2, bbehza2, ghike2, [email protected] Abstract—Power dissipation and energy consumption are be- A key architectural challenge now is how to support in- coming increasingly important architectural design constraints in creasing parallelism and scale performance, while being power different types of computers, from embedded systems to large- and energy efficient. There are multiple options on the table, scale supercomputers. To continue the scaling of performance, it is essential that we build parallel processor chips that make the namely “heavy-weight” multi-cores (such as general purpose best use of exponentially increasing numbers of transistors within processors), “light-weight” many-cores (such as Intel’s Single- the power and energy budgets. Intel SCC is an appealing option Chip Cloud Computer (SCC) [1]), low-power processors (such for future many-core architectures. In this paper, we use various as embedded processors), and SIMD-like highly-parallel archi- scalable applications to quantitatively compare and analyze tectures (such as General-Purpose Graphics Processing Units the performance, power consumption and energy efficiency of different cutting-edge platforms that differ in architectural build. (GPGPUs)). These platforms include the Intel Single-Chip Cloud Computer The Intel SCC [1] is a research chip made by Intel Labs (SCC) many-core, the Intel Core i7 general-purpose multi-core, to explore future many-core architectures. It has 48 Pentium the Intel Atom low-power processor, and the Nvidia ION2 (P54C) cores in 24 tiles of two cores each.
    [Show full text]
  • World's First High- Performance X86 With
    World’s First High- Performance x86 with Integrated AI Coprocessor Linley Spring Processor Conference 2020 April 8, 2020 G Glenn Henry Dr. Parviz Palangpour Chief AI Architect AI Software Deep Dive into Centaur’s New x86 AI Coprocessor (Ncore) • Background • Motivations • Constraints • Architecture • Software • Benchmarks • Conclusion Demonstrated Working Silicon For Video-Analytics Edge Server Nov, 2019 ©2020 Centaur Technology. All Rights Reserved Centaur Technology Background • 25-year-old startup in Austin, owned by Via Technologies • We design, from scratch, low-cost x86 processors • Everything to produce a custom x86 SoC with ~100 people • Architecture, logic design and microcode • Design, verification, and layout • Physical build, fab interface, and tape-out • Shipped by IBM, HP, Dell, Samsung, Lenovo… ©2020 Centaur Technology. All Rights Reserved Genesis of the AI Coprocessor (Ncore) • Centaur was developing SoC (CHA) with new x86 cores • Targeted at edge/cloud server market (high-end x86 features) • Huge inference markets beyond hyperscale cloud, IOT and mobile • Video analytics, edge computing, on-premise servers • However, x86 isn’t efficient at inference • High-performance inference requires external accelerator • CHA has 44x PCIe to support GPUs, etc. • But adds cost, power, another point of failure, etc. ©2020 Centaur Technology. All Rights Reserved Why not integrate a coprocessor? • Very low cost • Many components already on SoC (“free” to Ncore) Caches, memory, clock, power, package, pins, busses, etc. • There often is “free” space on complex SOCs due to I/O & pins • Having many high-performance x86 cores allows flexibility • The x86 cores can do some of the work, in parallel • Didn’t have to implement all strange/new functions • Allows fast prototyping of new things • For customer: nothing extra to buy ©2020 Centaur Technology.
    [Show full text]
  • Introduction to Cpu
    microprocessors and microcontrollers - sadri 1 INTRODUCTION TO CPU Mohammad Sadegh Sadri Session 2 Microprocessor Course Isfahan University of Technology Sep., Oct., 2010 microprocessors and microcontrollers - sadri 2 Agenda • Review of the first session • A tour of silicon world! • Basic definition of CPU • Von Neumann Architecture • Example: Basic ARM7 Architecture • A brief detailed explanation of ARM7 Architecture • Hardvard Architecture • Example: TMS320C25 DSP microprocessors and microcontrollers - sadri 3 Agenda (2) • History of CPUs • 4004 • TMS1000 • 8080 • Z80 • Am2901 • 8051 • PIC16 microprocessors and microcontrollers - sadri 4 Von Neumann Architecture • Same Memory • Program • Data • Single Bus microprocessors and microcontrollers - sadri 5 Sample : ARM7T CPU microprocessors and microcontrollers - sadri 6 Harvard Architecture • Separate memories for program and data microprocessors and microcontrollers - sadri 7 TMS320C25 DSP microprocessors and microcontrollers - sadri 8 Silicon Market Revenue Rank Rank Country of 2009/2008 Company (million Market share 2009 2008 origin changes $ USD) Intel 11 USA 32 410 -4.0% 14.1% Corporation Samsung 22 South Korea 17 496 +3.5% 7.6% Electronics Toshiba 33Semiconduc Japan 10 319 -6.9% 4.5% tors Texas 44 USA 9 617 -12.6% 4.2% Instruments STMicroelec 55 FranceItaly 8 510 -17.6% 3.7% tronics 68Qualcomm USA 6 409 -1.1% 2.8% 79Hynix South Korea 6 246 +3.7% 2.7% 812AMD USA 5 207 -4.6% 2.3% Renesas 96 Japan 5 153 -26.6% 2.2% Technology 10 7 Sony Japan 4 468 -35.7% 1.9% microprocessors and microcontrollers
    [Show full text]
  • Heterogeneous Cpu+Gpu Computing
    HETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam [email protected] Significant contributions by: Stijn Heldens (U Twente), Jie Shen (NUDT, China), Heterogeneous platforms • Systems combining main processors and accelerators • e.g., CPU + GPU, CPU + Intel MIC, AMD APU, ARM SoC • Everywhere from supercomputers to mobile devices Heterogeneous platforms • Host-accelerator hardware model Accelerator FPGAs Accelerator PCIe / Shared memory ... Host MICs Accelerator GPUs Accelerator CPUs Our focus today … • A heterogeneous platform = CPU + GPU • Most solutions work for other/multiple accelerators • An application workload = an application + its input dataset • Workload partitioning = workload distribution among the processing units of a heterogeneous system Few cores Thousands of Cores 5 Generic multi-core CPU 6 Programming models • Pthreads + intrinsics • TBB – Thread building blocks • Threading library • OpenCL • To be discussed … • OpenMP • Traditional parallel library • High-level, pragma-based • Cilk • Simple divide-and-conquer model abstractionincreasesLevel of 7 A GPU Architecture Offloading model Kernel Host code 9 Programming models • CUDA • NVIDIA proprietary • OpenCL • Open standard, functionally portable across multi-cores • OpenACC • High-level, pragma-based • Different libraries, programming models, and DSLs for different domains Level of abstractionincreasesLevel of CPU vs. GPU 10 ALU ALU CPU Control Low latency, high Throughput: ~ALU 500 GFLOPsALU flexibility. Bandwidth: ~ 60 GB/s Excellent for
    [Show full text]
  • State-Of-The-Art in Heterogeneous Computing
    Scientific Programming 18 (2010) 1–33 1 DOI 10.3233/SPR-2009-0296 IOS Press State-of-the-art in heterogeneous computing Andre R. Brodtkorb a,∗, Christopher Dyken a, Trond R. Hagen a, Jon M. Hjelmervik a and Olaf O. Storaasli b a SINTEF ICT, Department of Applied Mathematics, Blindern, Oslo, Norway E-mails: {Andre.Brodtkorb, Christopher.Dyken, Trond.R.Hagen, Jon.M.Hjelmervik}@sintef.no b Oak Ridge National Laboratory, Future Technologies Group, Oak Ridge, TN, USA E-mail: [email protected] Abstract. Node level heterogeneous architectures have become attractive during the last decade for several reasons: compared to traditional symmetric CPUs, they offer high peak performance and are energy and/or cost efficient. With the increase of fine-grained parallelism in high-performance computing, as well as the introduction of parallelism in workstations, there is an acute need for a good overview and understanding of these architectures. We give an overview of the state-of-the-art in heterogeneous computing, focusing on three commonly found architectures: the Cell Broadband Engine Architecture, graphics processing units (GPUs), and field programmable gate arrays (FPGAs). We present a review of hardware, available software tools, and an overview of state-of-the-art techniques and algorithms. Furthermore, we present a qualitative and quantitative comparison of the architectures, and give our view on the future of heterogeneous computing. Keywords: Power-efficient architectures, parallel computer architecture, stream or vector architectures, energy and power consumption, microprocessor performance 1. Introduction the speed of logic gates, making computers smaller and more power efficient. Noyce and Kilby indepen- The goal of this article is to provide an overview of dently invented the integrated circuit in 1958, leading node-level heterogeneous computing, including hard- to further reductions in power and space required for ware, software tools and state-of-the-art algorithms.
    [Show full text]
  • AI Chips: What They Are and Why They Matter
    APRIL 2020 AI Chips: What They Are and Why They Matter An AI Chips Reference AUTHORS Saif M. Khan Alexander Mann Table of Contents Introduction and Summary 3 The Laws of Chip Innovation 7 Transistor Shrinkage: Moore’s Law 7 Efficiency and Speed Improvements 8 Increasing Transistor Density Unlocks Improved Designs for Efficiency and Speed 9 Transistor Design is Reaching Fundamental Size Limits 10 The Slowing of Moore’s Law and the Decline of General-Purpose Chips 10 The Economies of Scale of General-Purpose Chips 10 Costs are Increasing Faster than the Semiconductor Market 11 The Semiconductor Industry’s Growth Rate is Unlikely to Increase 14 Chip Improvements as Moore’s Law Slows 15 Transistor Improvements Continue, but are Slowing 16 Improved Transistor Density Enables Specialization 18 The AI Chip Zoo 19 AI Chip Types 20 AI Chip Benchmarks 22 The Value of State-of-the-Art AI Chips 23 The Efficiency of State-of-the-Art AI Chips Translates into Cost-Effectiveness 23 Compute-Intensive AI Algorithms are Bottlenecked by Chip Costs and Speed 26 U.S. and Chinese AI Chips and Implications for National Competitiveness 27 Appendix A: Basics of Semiconductors and Chips 31 Appendix B: How AI Chips Work 33 Parallel Computing 33 Low-Precision Computing 34 Memory Optimization 35 Domain-Specific Languages 36 Appendix C: AI Chip Benchmarking Studies 37 Appendix D: Chip Economics Model 39 Chip Transistor Density, Design Costs, and Energy Costs 40 Foundry, Assembly, Test and Packaging Costs 41 Acknowledgments 44 Center for Security and Emerging Technology | 2 Introduction and Summary Artificial intelligence will play an important role in national and international security in the years to come.
    [Show full text]
  • Instruction Set Innovations for Convey's HC-1 Computer
    Instruction Set Innovations for Convey's HC-1 Computer THE WORLD’S FIRST HYBRID-CORE COMPUTER. Hot Chips Conference 2009 [email protected] Introduction to Convey Computer • Company Status – Second round venture based startup company – Product beta systems are at customer sites – Currently staffing at 36 people – Located in Richardson, Texas • Investors – Four Venture Capital Investors • Interwest Partners (Menlo Park) • CenterPoint Ventures (Dallas) • Rho Ventures (New York) • Braemar Energy Ventures (Boston) – Two Industry Investors • Intel Capital • Xilinx Presentation Outline • Overview of HC-1 Computer • Instruction Set Innovations • Application Examples Page 3 Hot Chips Conference 2009 What is a Hybrid-Core Computer ? A hybrid-core computer improves application performance by combining an x86 processor with hardware that implements application-specific instructions. ANSI Standard Applications C/C++/Fortran Convey Compilers x86 Coprocessor Instructions Instructions Intel® Processor Hybrid-Core Coprocessor Oil & Gas& Oil Financial Sciences Custom CAE Application-Specific Personalities Cache-coherent shared virtual memory Page 4 Hot Chips Conference 2009 What Is a Personality? • A personality is a reloadable set of instructions that augment x86 application the x86 instruction set Processor specific – Applicable to a class of applications instructions or specific to a particular code • Each personality is a set of files that includes: – The bits loaded into the Coprocessor – Information used by the Convey compiler • List of
    [Show full text]