Hardware Platforms for Embedded Computing Graphics: © Alexandra Nolte, Gesine Marwedel, 2003 Universität Dortmund Importance of Energy Efficiency

Total Page:16

File Type:pdf, Size:1020Kb

Hardware Platforms for Embedded Computing Graphics: © Alexandra Nolte, Gesine Marwedel, 2003 Universität Dortmund Importance of Energy Efficiency Universität Dortmund Hardware Platforms for Embedded Computing Graphics: © Alexandra Nolte, Gesine Marwedel, 2003 Universität Dortmund Importance of Energy Efficiency Efficient software design needed, otherwise, the price for software flexibility cannot be paid. © Hugo De Man (IMEC) Philips, 2007 Universität Dortmund Embedded vs. general-purpose processors • Embedded processors may be optimized for a category of applications. – Customization may be narrow or broad. • We may judge embedded processors using different metrics: – Code size. – Memory system performance. – Preditability. • Disappearing distinction: embedded processors everywhere Universität Dortmund Microcontroller Architectures Memory 0 Address Bus Program + CPU Data Bus Data Von Neumann 2n Architecture Memory 0 Address Bus Program CPU Fetch Bus Harvard Address Bus 0 Architecture Data Bus Data Universität Dortmund RISC processors • RISC generally means highly-pipelinable, one instruction per cycle. • Pipelines of embedded RISC processors have grown over time: – ARM7 has 3-stage pipeline. – ARM9 has 5-stage pipeline. – ARM11 has eight-stage pipeline. ARM11 pipeline [ARM05]. Universität Dortmund ARM Cortex Based on ARMv7 Architecture & Thumb®-2 ISA – ARM Cortex A Series - Applications CPUs focused on the execution of complex OS and user applications • First Product: Cortex-A8 • Executes ARM, Thumb-2 instructions – ARM Cortex R Series - Deeply embedded processors focused on Real-time environments • First Product: Cortex-R4(F) • Executes ARM, Thumb-2 instructions – ARM Cortex M Series - Microcontroller cores focused on very cost sensitive, deterministic, interrupt driven environments • First Product: ARM Cortex-M3 (2uA, 0.5mW/MHz) • Executes Thumb-2 instructions Universität Dortmund Cortex-M3 “Processor” Universität Dortmund Central Core Harvard architecture Separate Instruction & Data buses enable parallel fetch & store Advanced 3-Stage Pipeline Includes Branch Forwarding & Speculation Additional Write-Back via Bus Matrix Universität Dortmund Microcontrollers Memory ROM RAM CPU I/O Subsystems: Timers, Counters, Analog A single chip Interfaces, I/O interfaces Universität Dortmund A Microcontroller SOC example: STM32 Value line 64K-128KBytes System Diagram • Core and operating conditions CORTEXTM-M3 64kB - 128kB Power Supply - ARM® Cortex™-M3 CPU Flash Memory Reg 1.8V 24 MHz Flash I/F POR/PDR/PVD - 1.25 DMIPS/MHz up to 24 MHz - 2.0 V to 3.6 V range 8kB SRAM XTAL oscillators 32KHz + 4~25MHz - -40 to +105 °C JTAG/SW Debug 20B Backup Data (max 24MHz) (max Int. RC oscillators Speed Bus Nested vect IT Ctrl - • Rich connectivity 40KHz + 8MHz 1 x Systick Timer - 8 communications peripherals Hi Lite PLL ® DMA 7 Channels ARM RTC / AWU • Advanced analog / Arbiter Matrix Clock Control - 12-bit1.2 µs conversion time ADC ARM® Peripheral Bus - Dual channel 12-bit DAC Bridge (max 24MHz) Bridge 1 x 16-bit PWM Synchronized AC Timer • Enhanced control 6 x 16-bit Timer 1 x CEC - 16-bit motor control timer Up to 16 Ext. ITs 2 x Watchdog - 6x 16-bit PWM timers (independent & window) 37/51/80 I/Os 2 x USART/LIN 2-channel 12-bit DAC Smartcard / IrDa Peripheral Bus Modem Control (max 24MHz) • LQFP48, LQFP/BGA64, LQFP100 1 x SPI ® 1 x 12-bit ADC 16 channels 1 x SPI ARM up to 1 x USART/LIN Smartcard/IrDa 2 Modem Control Temperature Sensor 2 x I C Universität Dortmund DSP Applications • Audio applications • Networking • MPEG Audio • Cable modems • Portable audio • ADSL • Digital cameras • VDSL • Wireless • Cellular telephones • Base station Embeded computing needs lots of DSP capabilities Universität Dortmund DSP architectures Application: y[j] = ∑n-1 x[j-i]*a[i] i=0 ∀i: 0≤i ≤ n-1: yi[j] = yi-1[j] + x[j-i]*a[i] Architecture: Example: Data path ADSP210x P a - Parallelism D x - Dedicated registers x[j-i] a[i] AX AY MX MY MF Address- AF MR:=0; A1:=1; A2:=n-2; registers A0, A1, A2 MX:=x[n-1]; MY:=a[0]; .. for ( j:=1 to n) +,-,.. i+1, j-i+1 * x[j-i]*a[i] {MR:=MR+MX*MY; Address AR +,- MY:=a[A1]; MX:=x[A2]; generation A1++; A2--} unit (AGU) MR yi-1[j] Universität Dortmund DSP - Features (1) • Multiply/accumulate (MAC) and zero-overhead loop (ZOL) instructions (as shown) • Heterogeneous registers (as shown) • Separate address generation units (AGUs) (as in ADSP 210x) Universität Dortmund Single Issue vs VLIW instr op instr op instr op instr op instr op op op instr nop op op instr op Compiler instr op instr op op nop instr op instr op nop op instr op instr op op op instr op execute instr op 1 instr/cycle instr op 3 ops/cycle instr op execute 1 instr/cycle 3-issue VLIW Single Issue CPU 2/25/2016 Embedded Computer Architecture H. Corporaal 14 and B. Mesman Universität Dortmund ARM Processors Families 15 Cortex-M4 . ARMv7E-M Architecture . Thumb-2 only . DSP extensions . Optional FPU (Cortex-M4F) . Otherwise, same as Cortex-M3 . Implements full Thumb-2 instruction set . Saturated math (e.g. QADD) . Packing and unpacking (e.g. UXTB) . Signed multiply (e.g. SMULTB) . SIMD (e.g. ADD8) Cortex M3 Total 60k* Gates University Program Material Copyright © ARM Ltd 2012 16 Binary Upwards Compatibility ARMv7-M Architecture ARMv6-M Architecture University Program Material Copyright © ARM Ltd 2012 17 Cortex-M4 DSP instructions . Remember VLIW? University Program Material Copyright © ARM Ltd 2012 18 Universität Dortmund Multi-processors SoCs for Embedded Computing Graphics: © Alexandra Nolte, Gesine Marwedel, 2003 Universität Dortmund Application pull 3D gaming 1TOPS/W 3D TV 3D ambient Structured interaction decoding Ubiquitous 3D projectednavigation Autonomous display driving HMI by motion StructuredGesture detection encoding 100GOPS/W Expression recognition Gbit radio Collision H264 Adaptive avoidance encoding route Gesture LanguageEmotion dictation recognition recognition 10GOPS/W UWB A/V Sign 5 GOPS/W Image streaming recognition recognition 802.11n Mobile Si Xray Base-band H264 decoding Auto Fully recognition personalization (security) 2005 2007 2009 2011 2013 2015 [IMEC] Year of Introduction Universität Dortmund Power Bottleneck • Power trend 100 300 Dynamic Power 250 1 200 Possible trajectory for 10-2 Gate- 150 Oxide high-k Leakage dielectrics Sub- 100 10-4 Threshold Power Consumption Leakage 50 Physical Gate Length [nm] Length Gate Physical 10-6 0 1990 1995 2000 2005 2010 2015 2020 Power density trend 150 125 Leakage Power Power Density 100 2 75 Dynamic Power (Watts/cm ) 50 25 0 250nm 180nm 130nm 90nm 65nm [STM ASIC] Universität Dortmund Multi-Core & Power Power Cache Power = 1/4 4 Performance Performance = 1/2 3 Large Core 2 2 Small Core 1 1 1 1 C1 C2 4 4 Multi-Core: 3 3 Cache Power efficient 2 2 Better power and C3 C4 thermal management 1 1 Universität Dortmund µArchitecture Techniques 100% Multi-threading Increase on-die Memory Single Thread Full HW Utilization 75% Pentium® M ST Wait for Mem 50% Multi-Threading MT1 Wait for Mem Pentium® III MT2 Wait 25% 486 Pentium® Pentium® 4 Cache % of Total Area MT3 0% 1u 0.5u 0.25u 0.13u 65nm Improved performance, no impact on thermals & power delivery Chip Multi-processing 3,5 3 Multi Core C1 C2 2,5 2 Cache Large 1,5 Single Core Core C3 C4 Relative Performance 1 1 2 3 4 Die Area, Power Universität Dortmund Integrated SoC Mobile • High-speed SMP for “almost sequential” GP • “Processor arrays” for domain-specific throughput computing (100x GOPS/W) ultra parallel… 24 Universität Dortmund H-SOC in 2013 – Apple A7 Used in IPad AIR & IPhone 5s Universität Dortmund H-SOC in 2015/16 Tegra K1 Universität Dortmund Heterogeneous Computing in K1 Visual Analytics & Computational Photography Universität Dortmund Accelerated (Heterogeneous) Embedded Computing Graphics: © Alexandra Nolte, Gesine Marwedel, 2003 Universität Dortmund Hardware Execution Model Lane 0 Lane 0 Lane 0 Lane 1 Lane 1 Lane 1 CPU Lane 15 Lane 15 Lane 15 Core 0 Core 1 Core 15 GPU CPU Memory GPU Memory • GPU is built from multiple parallel cores, each core contains a multithreaded SIMD processor with multiple lanes but with no scalar processor • CPU sends whole “grid” over to GPU, which distributes thread blocks among cores (each thread block executes on one core) – Programmer unaware of number of cores 29 Universität Dortmund CPUs vs GPUs ALU ALU Control ALU ALU CPU GPU Cache DRAM DRAM © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007- 2010 30 ECE 408, University of Illi i U b Ch i Universität Dortmund CUDA Programmer's View of GPUs A GPU contains multiple SIMD Units. Universität Dortmund CUDA Programmer's View of GPUs A GPU contains multiple SIMD Units. All of them can access global memory. Universität Dortmund Simplified CUDA Programming Model • Computation performed by a very large number of independent small scalar threads (CUDA threads or microthreads) grouped into thread blocks. // C version of DAXPY loop. void daxpy(int n, double a, double*x, double*y) { for (int i=0; i<n; i++) y[i] = a*x[i] + y[i]; } // CUDA version. __host__ // Piece run on host processor. int nblocks = (n+255)/256; // 256 CUDA threads/block daxpy<<<nblocks,256>>>(n,2.0,x,y); __device__ // Piece run on GP-GPU. void daxpy(int n, double a, double*x, double*y) { int i = blockIdx.x*blockDim.x + threadId.x; if (i<n) y[i]=a*x[i]+y[i]; } 33 Universität Dortmund Thread Hierarchy in CUDA Grid contains Thread Blocks Thread Block contains Threads Universität Dortmund Sharing memory • Mobile GPUs share memory with CPU Converging also for general computing: “Heterogeneous System Architecture” Universität Dortmund Energy Efficiency Again What if workload is MP+GPU not Friendly to MultiProc or GPU? MP Efficient software design needed, otherwise, the price for software flexibility cannot be paid. © Hugo De Man (IMEC) Philips, 2007 Universität Dortmund FPGA Reconfigurable computing Computer architecture combining some of the flexibility of software with the high performance of hardware by processing with very flexible high speed computing fabrics like field-programmable gate arrays (FPGAs).
Recommended publications
  • Reconfigurable Computing
    Reconfigurable Computing: A Survey of Systems and Software KATHERINE COMPTON Northwestern University AND SCOTT HAUCK University of Washington Due to its potential to greatly accelerate a wide variety of applications, reconfigurable computing has become a subject of a great deal of research. Its key feature is the ability to perform computations in hardware to increase performance, while retaining much of the flexibility of a software solution. In this survey, we explore the hardware aspects of reconfigurable computing machines, from single chip architectures to multi-chip systems, including internal structures and external coupling. We also focus on the software that targets these machines, such as compilation tools that map high-level algorithms directly to the reconfigurable substrate. Finally, we consider the issues involved in run-time reconfigurable systems, which reuse the configurable hardware during program execution. Categories and Subject Descriptors: A.1 [Introductory and Survey]; B.6.1 [Logic Design]: Design Style—logic arrays; B.6.3 [Logic Design]: Design Aids; B.7.1 [Integrated Circuits]: Types and Design Styles—gate arrays General Terms: Design, Performance Additional Key Words and Phrases: Automatic design, field-programmable, FPGA, manual design, reconfigurable architectures, reconfigurable computing, reconfigurable systems 1. INTRODUCTION of algorithms. The first is to use hard- wired technology, either an Application There are two primary methods in con- Specific Integrated Circuit (ASIC) or a ventional computing for the execution group of individual components forming a This research was supported in part by Motorola, Inc., DARPA, and NSF. K. Compton was supported by an NSF fellowship. S. Hauck was supported in part by an NSF CAREER award and a Sloan Research Fellowship.
    [Show full text]
  • Syllabus: EEL4930/5934 Reconfigurable Computing
    EEL4720/5721 Reconfigurable Computing (dual-listed course) Department of Electrical and Computer Engineering University of Florida Spring Semester 2019 Catalog Description: Prereq: EEL4712C or EEL5764 or consent of instructor. Fundamental concepts at advanced undergraduate level (EEL4720) and introductory graduate level (EEL5721) in reconfigurable computing (RC) based upon advanced technologies in field-programmable logic devices. Topics include general RC concepts, device architectures, design tools, metrics and kernels, system architectures, and application case studies. Credit Hours: 3 Prerequisites by Topic: Fundamentals of digital design including device technologies, design methodology and techniques, and design environments and tools; fundamentals of computer organization and architecture, including datapath and control structures, data formats, instruction-set principles, pipelining, instruction-level parallelism, memory hierarchy, and interconnects and interfacing. Instructor: Dr. Herman Lam Office: Benton Hall, Room 313 Office hours: TBA Telephone: (352) 392-2689 Email: [email protected] Teaching Assistant: Seyed Hashemi Office hours: TBA Email: [email protected] Class lectures: MWF 4th period, Larsen Hall 239 Required textbook: none References: . Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation, edited by Scott Hauck and Andre DeHon, Elsevier, Inc. (Morgan Kaufmann Publishers), Amsterdam, 2008. ISBN: 978-0-12-370522-8 . C. Maxfield, The Design Warrior's Guide to FPGAs, Newnes, 2004, ISBN: 978-0750676045.
    [Show full text]
  • A MATLAB Compiler for Distributed, Heterogeneous, Reconfigurable
    A MATLAB Compiler For Distributed, Heterogeneous, Reconfigurable Computing Systems P. Banerjee, N. Shenoy, A. Choudhary, S. Hauck, C. Bachmann, M. Haldar, P. Joisha, A. Jones, A. Kanhare A. Nayak, S. Periyacheri, M. Walkden, D. Zaretsky Electrical and Computer Engineering Northwestern University 2145 Sheridan Road Evanston, IL-60208 [email protected] Abstract capabilities and are coordinated to perform an application whose subtasks have diverse execution requirements. One Recently, high-level languages such as MATLAB have can visualize such systems to consist of embedded proces- become popular in prototyping algorithms in domains such sors, digital signal processors, specialized chips, and field- as signal and image processing. Many of these applications programmable gate arrays (FPGA) interconnected through whose subtasks have diverse execution requirements, often a high-speed interconnection network; several such systems employ distributed, heterogeneous, reconfigurable systems. have been described in [9]. These systems consist of an interconnected set of heteroge- A key question that needs to be addressed is how to map neous processing resources that provide a variety of archi- a given computation on such a heterogeneous architecture tectural capabilities. The objective of the MATCH (MATlab without expecting the application programmer to get into Compiler for Heterogeneous computing systems) compiler the low level details of the architecture or forcing him/her to project at Northwestern University is to make it easier for understand
    [Show full text]
  • Architecture and Programming Model Support for Reconfigurable Accelerators in Multi-Core Embedded Systems »
    THESE DE DOCTORAT DE L’UNIVERSITÉ BRETAGNE SUD COMUE UNIVERSITE BRETAGNE LOIRE ÉCOLE DOCTORALE N° 601 Mathématiques et Sciences et Technologies de l'Information et de la Communication Spécialité : Électronique Par « Satyajit DAS » « Architecture and Programming Model Support For Reconfigurable Accelerators in Multi-Core Embedded Systems » Thèse présentée et soutenue à Lorient, le 4 juin 2018 Unité de recherche : Lab-STICC Thèse N° : 491 Rapporteurs avant soutenance : Composition du Jury : Michael HÜBNER Professeur, Ruhr-Universität François PÊCHEUX Professeur, Sorbonne Université Bochum Président (à préciser après la soutenance) Jari NURMI Professeur, Tampere University of Angeliki KRITIKAKOU Maître de conférences, Université Technology Rennes 1 Davide ROSSI Assistant professor, Université de Bologna Kevin MARTIN Maître de conférences, Université Bretagne Sud Directeur de thèse Philippe COUSSY Professeur, Université Bretagne Sud Co-directeur de thèse Luca BENINI Professeur, Université de Bologna Architecture and Programming Model Support for Reconfigurable Accelerators in Multi-Core Embedded Systems Satyajit Das 2018 iii ABSTRACT Emerging trends in embedded systems and applications need high throughput and low power consumption. Due to the increasing demand for low power computing, and diminishing returns from technology scaling, industry and academia are turning with renewed interest toward energy efficient hardware accelerators. The main drawback of hardware accelerators is that they are not programmable. Therefore, their utilization can be low as they perform one specific function and increasing the number of the accelerators in a system onchip (SoC) causes scalability issues. Programmable accelerators provide flexibility and solve the scalability issues. Coarse-Grained Reconfigurable Array (CGRA) architecture consisting several processing elements with word level granularity is a promising choice for programmable accelerator.
    [Show full text]
  • Reconfigurable Computing
    Reconfigurable Computing David Boland1, Chung-Kuan Cheng2, Andrew B. Kahng2, Philip H.W. Leong1 1School of Electrical and Information Engineering, The University of Sydney, Australia 2006 2Dept. of Computer Science and Engineering, University of California, La Jolla, California Abstract: Reconfigurable computing is the application of adaptable fabrics to solve computational problems, often taking advantage of the flexibility available to produce problem-specific architectures that achieve high performance because of customization. Reconfigurable computing has been successfully applied to fields as diverse as digital signal processing, cryptography, bioinformatics, logic emulation, CAD tool acceleration, scientific computing, and rapid prototyping. Although Estrin-first proposed the idea of a reconfigurable system in the form of a fixed plus variable structure computer in 1960 [1] it has only been in recent years that reconfigurable fabrics, in the form of field-programmable gate arrays (FPGAs), have reached sufficient density to make them a compelling implementation platform for high performance applications and embedded systems. In this article, intended for the non-specialist, we describe some of the basic concepts, tools and architectures associated with reconfigurable computing. Keywords: reconfigurable computing; adaptable fabrics; application integrated circuits; field programmable gate arrays (FPGAs); system architecture; runtime 1 Introduction Although reconfigurable fabrics can in principle be constructed from any type of technology, in practice, most contemporary designs are made using commercial field programmable gate arrays (FPGAs). An FPGA is an integrated circuit containing an array of logic gates in which the connections can be configured by downloading a bitstream to its memory. FPGAs can also be embedded in integrated circuits as intellectual property cores.
    [Show full text]
  • A Reconfigurable Convolutional Neural Network-Accelerated
    electronics Article A Reconfigurable Convolutional Neural Network-Accelerated Coprocessor Based on RISC-V Instruction Set Ning Wu *, Tao Jiang, Lei Zhang, Fang Zhou and Fen Ge College of Electronic and Information Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China; [email protected] (T.J.); [email protected] (L.Z.); [email protected] (F.Z.); [email protected] (F.G.) * Correspondence: [email protected]; Tel.: +86-139-5189-3307 Received: 5 May 2020; Accepted: 8 June 2020; Published: 16 June 2020 Abstract: As a typical artificial intelligence algorithm, the convolutional neural network (CNN) is widely used in the Internet of Things (IoT) system. In order to improve the computing ability of an IoT CPU, this paper designs a reconfigurable CNN-accelerated coprocessor based on the RISC-V instruction set. The interconnection structure of the acceleration chain designed by the predecessors is optimized, and the accelerator is connected to the RISC-V CPU core in the form of a coprocessor. The corresponding instruction of the coprocessor is designed and the instruction compiling environment is established. Through the inline assembly in the C language, the coprocessor instructions are called, coprocessor acceleration library functions are established, and common algorithms in the IoT system are implemented on the coprocessor. Finally, resource consumption evaluation and performance analysis of the coprocessor are completed on a Xilinx FPGA. The evaluation results show that the reconfigurable CNN-accelerated coprocessor only consumes 8534 LUTS, accounting for 47.6% of the total SoC system. The number of instruction cycles required to implement functions such as convolution and pooling based on the designed coprocessor instructions is better than using the standard instruction set, and the acceleration ratio of convolution is 6.27 times that of the standard instruction set.
    [Show full text]
  • Reconfigurable Computing Brittany Ransbottom Benjamin Wheeler What Is Reconfigurable Computing?
    Reconfigurable Computing Brittany Ransbottom Benjamin Wheeler What is Reconfigurable Computing? A computer architecture style which tries to find a happy medium of general purpose processors and ASICs through the use of the flexibility of software, and the speed of hardware. Reconfigurable Computing von Neumann vs. Reconfigurable Fixed resources Variable resources The architecture is While the designed prior to architecture is use as a processor designed prior to use, it can be adjusted or reconfigured to use different resources Execution in SW Execution in HW Spatial vs. Temporal Hardware Software Software The Reconfigurable Computing Paradox • Migration from software to configware o speed-up factors and electricity consumption reduction of about four orders of magnitude o Clock frequency and other specifications of FPGAs are behind microprocessors by about four orders of magnitude Current State No designs merge GPP and ASIC enough to be marketable Too specialized (a glorified asic), or priced too high to benefit replacing GPPs Coarse-Grained doesn't have SW support necessary What is an FPGA? Reprogrammable Hardware Generally configured in VHDL or Verilog Can execute processes spatially as opposed to temporally The FPGA Coprocessor Option +Reprogrammable Logic -Requires pre-existing HW design -Needs to be specific processing, or have an intensive controller/compiler What is Coarse-grained computing Functional units (add, subtraction, multiplication - word-level operations) interconnected in a mesh style Coarse-Grained +Short Reconfiguration
    [Show full text]
  • Elements of High-Performance Reconfigurable Computing*
    Author's ppersonalersonal copcopyy Elements of High-Performance Reconfigurable Computing* TOM VANCOURT† Department of Electrical and Computer Engineering, Boston University, Boston, Massachusetts 02215 MARTIN C. HERBORDT Department of Electrical and Computer Engineering, Boston University, Boston, Massachusetts 02215 Abstract Numerous application areas, including bioinformatics and computational biology (BCB), demand increasing amounts of processing capability. In many cases, the computation cores and data types are suited to field-programmable gate arrays (FPGAs). The challenge is identifying those design techniques that can extract high performance from the FPGA fabric while allowing efficient development of production-worthy program codes. After brief introductions to high-performance reconfigurable computing (HPRC) systems and some appli- cations for which they are well suited, we present a dozen such techniques together with examples of their use in our own research in BCB. Each technique, if ignored in a naive implementation, would have cost at least a factor 2 in performance, with most saving a factor of 10 or more. We follow this by comparing HPRC with an alternative accelerator technology, the use of graphics processors for general-purpose computing (GPGPU). We conclude with a discussion of some implications for future HPRC development tools. * This chapter is based on two articles previously published by the IEEE: ‘‘Achieving High Performance with FPGA-Based Computing’’ which appeared in IEEE Computer in March 2007, and ‘‘Computing Models for FPGA-Based Accelerators’’ which appeared in IEEE Computing in Science and Engineering in November/December 2008. { Currently with Altera, Inc. ADVANCES IN COMPUTERS, VOL. 75 113 Copyright © 2009 Elsevier Inc. ISSN: 0065-2458/DOI: 10.1016/S0065-2458(08)00802-4 All rights reserved.
    [Show full text]
  • Coarse-Grained Reconfigurable Computing with the Versat
    electronics Article Coarse-Grained Reconfigurable Computing with the Versat Architecture João D. Lopes 1 , Mário P. Véstias 2,* , Rui Policarpo Duarte 1 , Horácio C. Neto 1 and José T. de Sousa 1 1 INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, 1049-001 Lisbon, Portugal; [email protected] (J.D.L.); [email protected] (R.P.D.); [email protected] (H.C.N.); [email protected] (J.T.d.S.) 2 INESC-ID, Instituto Superior de Engenharia de Lisboa, Instituto Politécnico de Lisboa, 1959-007 Lisbon, Portugal * Correspondence: [email protected]; Tel.: +351-218-317-000 Abstract: Reconfigurable computing architectures allow the adaptation of the underlying datapath to the algorithm. The granularity of the datapath elements and data width determines the granularity of the architecture and its programming flexibility. Coarse-grained architectures have shown the right balance between programmability and performance. This paper provides an overview of coarse-grained reconfigurable architectures and describes Versat, a Coarse-Grained Reconfigurable Array (CGRA) with self-generated partial reconfiguration, presented as a case study for better under- standing these architectures. Unlike most of the existing approaches, which mainly use pre-compiled configurations, a Versat program can generate and apply myriads of on-the-fly configurations. Partial reconfiguration plays a central role in this approach, as it speeds up the generation of incrementally different configurations. The reconfigurable array has a complete graph topology, which yields un- precedented programmability, including assembly programming. Besides being useful for optimising programs, assembly programming is invaluable for working around post-silicon hardware, software, Citation: Lopes, J.D.; Véstias, M.P.; Duarte, R.P.; Neto, H.C.; de Sousa, J.T.
    [Show full text]
  • Reconfigurable Computing: Architectures and Design Methods
    1 Reconfigurable Computing: Architectures and Design Methods ∗ ∗∗ ∗∗∗ ∗ ∗ ∗∗ ∗ T.J. Todman , G.A. Constantinides , S.J.E. Wilton , O. Mencer , W. Luk and P.Y.K. Cheung Department of Computing, Imperial College London, UK ∗∗ Department of Electrical and Electronic Engineering, Imperial College London, UK ∗∗∗ Department of Electrical and Computer Engineering, University of British Columbia, Canada Abstract Reconfigurable computing is becoming increasingly attractive for many applications. This survey covers two aspects of reconfigurable computing: architectures and design methods. Our chapter includes recent advances in reconfigurable architectures, such as the Altera Stratix II and Xilinx Virtex 4 FPGA devices. We identify major trends in general-purpose and special-purpose design methods. I. INTRODUCTION Reconfigurable computing is rapidly establishing itself as a major discipline that covers various subjects of learning, including both computing science and electronic engineering. Reconfigurable computing involves the use of reconfigurable devices, such as Field Programmable Gate Arrays (FPGAs), for computing purposes. Reconfigurable computing is also known as configurable computing or custom computing, since many of the design techniques can be seen as customising a computational fabric for specific applications [102]. Reconfigurable computing systems often have impressive performance. Consider, as an example, the point multiplication operation in Elliptic Curve cryptography. For a key size of 270 bits, it has been reported [172] that a point multiplication can be computed in 0.36 ms with a reconfigurable computing design implemented in an XC2V6000 FPGA at 66 MHz. In contrast, an optimised software implementation requires 196.71 ms on a dual-Xeon computer at 2.6 GHz; so the reconfigurable computing design is more than 540 times faster, while its clock speed is almost 40 times slower than the Xeon processors.
    [Show full text]
  • Dionysios Diamantopoulos
    Dionysios Diamantopoulos PHD · RESEARCH ASSOCIATE · R&D COMPUTER ENGINEER Thalwil, 8800, Zurich Canton, Switzerland (+41)-0-793339529 | [email protected] | diamantopoulos.github.io | dionysiosdiamantopoulos | dionisis.diamantopoulos | diamantopoulos Cover note: Developing innovative digital systems and design tools as well, by following a multi-facet approach is my core belief in the last years of my work experience. A strong research background gives me the analytical framework for simplifying complexity. In the past years I have actively participated in several leading multi-national projects delivering world-class digital systems with highly acknowledged contribution in the realization of reconfigurable and application specific computing. Being passionate about all things to do with silicon bring-up, I am constantly looking to deploy my expertise in world-class R&D computer engineering challenges. Education NTUA (Electrical & Computer Engineering - National Technical Univ. of Athens) Athens, Greece PH.D. IN COMPUTER SCIENCE & ENGINEERING November 2009 - July 2015 • Dissertation title: “Cross-Layer Rapid Prototyping and Synthesis of Application-Specific and Reconfigurable Many-accelerator Plat- forms”. Available in Greek (280 pages) & English (237 pages). • Advisor: Assoc. Prof. NTUA Dimitrios Soudris. • Promotion Committee: Assoc. Prof. NTUA D. Soudris, Prof. NTUA K. Pekmestzi, Assis. Prof. NTUA G. Economakos, Assis. Prof. U. Patras G. Theodoridis, Assoc. Prof. NKUA D. Reisis, Prof. RUB M. Hübner and Professor
    [Show full text]
  • Power Efficiency Benchmarking of a Partially Reconfigurable, Many-Tile System Implemented on a Xilinx Virtex-6 FPGA
    Power Efficiency Benchmarking of a Partially Reconfigurable, Many-Tile System Implemented on a Xilinx Virtex-6 FPGA Raymond J. Weber, Justin A. Hogan and Brock J. LaMeres Department of Electrical and Computer Engineering Montana State University Bozeman, MT 59718 USA Abstract—Field Programmable Gate Arrays are an attractive FPGAs more performance competitive with each process node platform for reconfigurable computing due to their inherent [7]. flexibility and low entry cost relative to custom integrated circuits. With modern programmable devices exploiting the most Power consumption is also a major concern in FPGAs recent fabrication nodes, designs are able to achieve device-level using nanometer process nodes. Great strides have been made performance and power efficiency that rivals custom integrated in the design of nanometer FPGAs to minimize power circuits. This paper presents the benchmarking of performance consumption. Techniques such as triple oxide, transistor and power efficiency of a variety of standard benchmarks on a distribution optimization and local suspend circuitry have been Xilinx Virtex-6 75k device using a tiled, partially reconfigurable implemented in the Xilinx Virtex-6 to address the power architecture. The tiled architecture provides the ability to swap consumption concern [8]. One feature that promises to not in arbitrary processing units in real-time without re-synthesizing only increase the performance of FPGA-based system but also the entire design. This has performance advantages by allowing reduce power is partial reconfiguration (PR) [9]. Partial multiple processors and/or hardware accelerators to be brought reconfiguration is the process of only programming a particular online as the application requires. This also has power efficiency portion of the FPGA during run-time.
    [Show full text]