Generation of Custom Run-Time Reconfigurable Hardware For

Total Page:16

File Type:pdf, Size:1020Kb

Generation of Custom Run-Time Reconfigurable Hardware For FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Generation of Custom Run-time Reconfigurable Hardware for Transparent Binary Acceleration Nuno Miguel Cardanha Paulino Programa Doutoral em Engenharia Electrotécnica e de Computadores (PDEEC) Supervisor: João Canas Ferreira (Assistant Professor) Co-supervisor: João M. P. Cardoso (Associate Professor) June 2016 c Nuno Miguel Cardanha Paulino, 2016 i Abstract With the increase of application complexity and amount of data, the required computational power increases in tandem. Technology improvements have allowed for the increase in clock frequen- cies of all kinds of processing architectures. But exploration of new architecture and computing paradigms over the simple single-issue in-order processor are equally important towards increas- ing performance, by properly exploiting the data-parallelism of demanding tasks. For instance: superscalar processors, which discover instruction parallelism at runtime; Very Long Instruction Word processors, which rely on compile-time parallelism discovery and multiple issue-units; and multi-core approaches, which are based on thread-level parallelism exploited at the software level. For embedded applications, depending on the performance requirements or resource con- straints, and if the application is composed of well-defined tasks, then a more application-specific system may be appropriate, i.e., developing an Application Specific Integrated Circuit (ASIC). Custom logic delivers the best power/performance ratio, but this solution requires advanced hard- ware expertise, implies long development time, and especially very high manufacture costs. This work designed and evaluated a transparent binary acceleration approach, targeting Field Programmable Gate Array (FPGA) devices, which relies on instruction traces to automatically generate specialized accelerator instances. A custom accelerator, capable of executing a set of previously detected loop traces, is coupled to a host MicroBlaze processor. The traces are detected via simulation of the target binary. The approach does not require the application source code to be modified, which ensures the transparency for the application developer. No custom compilers are necessary, and the binary code does not need to be altered either offline or during runtime. The accelerators contain per-instance reconfiguration capabilities, which allow for the reuse of computing units between accelerated loops, without sacrificing the benefits of circuit special- ization. To increase the achievable performance, the accelerator is capable of performing two concurrent memory accesses to the MicroBlaze’s data memory. The repetitive nature of the loop traces is exploited via loop-pipelining, which maximizes the achievable acceleration. By support- ing single-precision floating-point operations via fully-pipelined units, the accelerator is capable of executing realistic data-oriented loops. Finally, the use of Dynamic Partial Reconfiguration (DPR) allows for significant area savings when instantiating accelerators with numerous configurations, and also ensures circuit specialization per-configuration. Several fully functional systems were implemented, using commercial FPGAs, to validate the design iterations of the accelerator. An initial design relied on translating Control and Dataflow Graph representations of the traces into a multi-row array of Functional Units. For 15 benchmarks, the geometric mean speedup was 2.08×. A second implementation augmented the accelerator with shared memory access to the the entire local data memory of the MicroBlaze. Arbitrary addresses can be accessed without need for address generation hardware. Exploiting data-parallelism allows for targeting of larger, more realistic traces. The mean geometric speedup for 37 benchmarks was 2.35×. The most efficient implementation supports floating-point operations and relies on loop pipelining. The developed tools generate an accelerator instance by modulo-scheduling each trace at the minimum possible Initiation Interval. The geometric mean speedup for a set of 24 benchmarks is 5.61×, and the accelerator requires only 1.12× the FPGA slices required by the MicroBlaze. Finally, resorting to DPR, an accelerator with 10 configurations requires only a third of the Lookup Tables relative to an equivalent accelerator without this capability. To summarize, the approach is capable of expediently generating accelerator-augmented em- bedded systems which achieve considerable performance increases whilst incurring a low resource cost, and without requiring manual hardware design. ii Sumário Com o aumento da complexidade das aplicações, aumenta também a capacidade computacional necessária. Melhorias tecnológicas têm permitido o aumento da frequência de relógio para todo o tipo de arquitecturas computacionais. Contudo, a exploração de novas arquitecturas é igualmente importante para melhorias de desempenho, explorando eficientemente o paralelismo de dados de tarefas exigentes. Por exemplo: processadores superscalar, que descobrem paralelismo ao nível da instrução durante a execução; processadores VLIW (Very Large Instruction Word), que dependem de paralelismo explorado durante a compilação e de várias unidades em paralelo; e tecnologias multi-core, que se baseiam na exploração de paralelismo ao nível da thread através de software. Para aplicações embebidas, dependendo dos requisitos de desempenho ou restrições de recur- sos, e se a aplicação for composta por tarefas bem definidas, um sistema mais especifico poderá ser mais apropriado, i.e., desenhar um ASIC (Application Specific Integrated Circuit). Lógica dedi- cada beneficia do melhor desempenho por watt, mas esta solução requer experiência de desenho de hardware, sofre de tempo de desenvolvimento longo, e custos de fabricação elevados. Este trabalho desenvolveu e avaliou uma abordagem de aceleração transparente de código binário, orientada especificamente para FPGAs. A abordagem baseia-se em sequências frequentes de instruções executadas (i.e, traces) para gerar automaticamente aceleradores especializados. Um acelerador, capaz de executar um conjunto de traces previamente detectados, complementa um MicroBlaze, que age como processador principal. Os traces são detectados por simulação da aplicação, e representam ciclos de execução (i.e., loops). Não é necessário modificações ao código-fonte, o que aumenta a transparência da abordagem para o programador. Um compilador especializado não é necessário, e o código binário não sofre modificações pós-compilação. O acelerador contém lógica de reconfiguração especializada, o que permite a reutilização de unidades de cálculo entre os loops acelerados, sem sacrificar a especialização. Para maximizar o desempenho, o acelerador é capaz de efectuar dois acessos paralelos à memória de dados do pro- cessador. O uso de loop-pipelining maximiza a aceleração, e o suporte para operações de vírgula flutuante permite a execução de tarefas embebidas realistas. Finalmente, o uso de Reconfiguraçao Parcial Dinâmica (DPR), reduz significativamente a área necessária para suportar várias configu- rações, e assegura a especialização do hardware respectivo a cada configuração. Vários sistemas totalmente funcionais foram implementados para validar os aceleradores, us- ando FPGAs comerciais. Um primeira implementação traduz representações de grafo de dados (i.e., Control and Dataflow Graph) dos loops para várias linhas de unidades funcionais interli- gadas. Para 15 benchmarks, a média geométrica da aceleração foi de 2.08×. Uma segunda im- plementação adiciona ao acelerador suporte para acesso à memória. O mesmo é capaz de aceder directamente a toda a memória local de dados do MicroBlaze, sendo suportados acessos endereços arbitrários. Explorar o paralelismo de dados permite acelerar loops mais realísticos. Para 37 benchmarks, a média geométrica da aceleração foi de 2.35×. A implementação mais eficiente su- porta operações de vírgula flutuante e utiliza loop-pipelining. Um escalonador gera uma instância do acelerador efectuando modulo-scheduling para cada trace ao intervalo de iniciação (i.e., Initia- tion Interval) mínimo. A média geométrica da aceleração é de 5.61× em média para 24 aplicações, e os aceleradores requerem 1.12× o número de slices de FPGA de um MicroBlaze. Finalmente, com o uso de DPR, um acelerador com 10 configurações necessita apenas de um terço das Lookup Tables relativamente a um acelerador equivalente sem esta capacidade. Concluindo, a abordagem permite gerar rapidamente sistemas com aceleradores especializa- dos que aumentam consideravelmente o desempenho, com um custo reduzido em termos de recur- sos, evitando também a necessidade de desenho de hardware manual. Acknowledgments This thesis is the result of four years of work that I was fortunate enough to be able to carry out with focus and nearly undivided attention due mostly, if not totally, to the support of my parents, who helped keep my mind of other, more time consuming, and infinitely less productive matters. The individual moments where someone lent me their support, and particular people who took the time to care and ask about my work, are too numerous to list. A thank you to the friends that shared a seat in the lab where I sit as I type this. It would go without saying, but I’d also like to thank my supervisor, João Canas Ferreira, for his guidance and insight which helped make sure I didn’t stray off into the distance on a random direction. Also, a thank you to my co-supervisor, João Manuel Paiva Cardoso, for many fruitful discussions, and another
Recommended publications
  • 2.2 Adaptive Routing Algorithms and Router Design 20
    https://theses.gla.ac.uk/ Theses Digitisation: https://www.gla.ac.uk/myglasgow/research/enlighten/theses/digitisation/ This is a digitised version of the original print thesis. Copyright and moral rights for this work are retained by the author A copy can be downloaded for personal non-commercial research or study, without prior permission or charge This work cannot be reproduced or quoted extensively from without first obtaining permission in writing from the author The content must not be changed in any way or sold commercially in any format or medium without the formal permission of the author When referring to this work, full bibliographic details including the author, title, awarding institution and date of the thesis must be given Enlighten: Theses https://theses.gla.ac.uk/ [email protected] Performance Evaluation of Distributed Crossbar Switch Hypermesh Sarnia Loucif Dissertation Submitted for the Degree of Doctor of Philosophy to the Faculty of Science, Glasgow University. ©Sarnia Loucif, May 1999. ProQuest Number: 10391444 All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted. In the unlikely event that the author did not send a com plete manuscript and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion. uest ProQuest 10391444 Published by ProQuest LLO (2017). Copyright of the Dissertation is held by the Author. All rights reserved. This work is protected against unauthorized copying under Title 17, United States C ode Microform Edition © ProQuest LLO. ProQuest LLO.
    [Show full text]
  • Adaptive Parallelism for Coupled, Multithreaded Message-Passing Programs Samuel K
    University of New Mexico UNM Digital Repository Computer Science ETDs Engineering ETDs Fall 12-1-2018 Adaptive Parallelism for Coupled, Multithreaded Message-Passing Programs Samuel K. Gutiérrez Follow this and additional works at: https://digitalrepository.unm.edu/cs_etds Part of the Numerical Analysis and Scientific omputC ing Commons, OS and Networks Commons, Software Engineering Commons, and the Systems Architecture Commons Recommended Citation Gutiérrez, Samuel K.. "Adaptive Parallelism for Coupled, Multithreaded Message-Passing Programs." (2018). https://digitalrepository.unm.edu/cs_etds/95 This Dissertation is brought to you for free and open access by the Engineering ETDs at UNM Digital Repository. It has been accepted for inclusion in Computer Science ETDs by an authorized administrator of UNM Digital Repository. For more information, please contact [email protected]. Samuel Keith Guti´errez Candidate Computer Science Department This dissertation is approved, and it is acceptable in quality and form for publication: Approved by the Dissertation Committee: Professor Dorian C. Arnold, Chair Professor Patrick G. Bridges Professor Darko Stefanovic Professor Alexander S. Aiken Patrick S. McCormick Adaptive Parallelism for Coupled, Multithreaded Message-Passing Programs by Samuel Keith Guti´errez B.S., Computer Science, New Mexico Highlands University, 2006 M.S., Computer Science, University of New Mexico, 2009 DISSERTATION Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy Computer Science The University of New Mexico Albuquerque, New Mexico December 2018 ii ©2018, Samuel Keith Guti´errez iii Dedication To my beloved family iv \A Dios rogando y con el martillo dando." Unknown v Acknowledgments Words cannot adequately express my feelings of gratitude for the people that made this possible.
    [Show full text]
  • Detecting False Sharing Efficiently and Effectively
    W&M ScholarWorks Arts & Sciences Articles Arts and Sciences 2016 Cheetah: Detecting False Sharing Efficiently andff E ectively Tongping Liu Univ Texas San Antonio, Dept Comp Sci, San Antonio, TX 78249 USA; Xu Liu Coll William & Mary, Dept Comp Sci, Williamsburg, VA 23185 USA Follow this and additional works at: https://scholarworks.wm.edu/aspubs Recommended Citation Liu, T., & Liu, X. (2016, March). Cheetah: Detecting false sharing efficiently and effectively. In 2016 IEEE/ ACM International Symposium on Code Generation and Optimization (CGO) (pp. 1-11). IEEE. This Article is brought to you for free and open access by the Arts and Sciences at W&M ScholarWorks. It has been accepted for inclusion in Arts & Sciences Articles by an authorized administrator of W&M ScholarWorks. For more information, please contact [email protected]. Cheetah: Detecting False Sharing Efficiently and Effectively Tongping Liu ∗ Xu Liu ∗ Department of Computer Science Department of Computer Science University of Texas at San Antonio College of William and Mary San Antonio, TX 78249 USA Williamsburg, VA 23185 USA [email protected] [email protected] Abstract 1. Introduction False sharing is a notorious performance problem that may Multicore processors are ubiquitous in the computing spec- occur in multithreaded programs when they are running on trum: from smart phones, personal desktops, to high-end ubiquitous multicore hardware. It can dramatically degrade servers. Multithreading is the de-facto programming model the performance by up to an order of magnitude, significantly to exploit the massive parallelism of modern multicore archi- hurting the scalability. Identifying false sharing in complex tectures.
    [Show full text]
  • Multiprocessing Contents
    Multiprocessing Contents 1 Multiprocessing 1 1.1 Pre-history .............................................. 1 1.2 Key topics ............................................... 1 1.2.1 Processor symmetry ...................................... 1 1.2.2 Instruction and data streams ................................. 1 1.2.3 Processor coupling ...................................... 2 1.2.4 Multiprocessor Communication Architecture ......................... 2 1.3 Flynn’s taxonomy ........................................... 2 1.3.1 SISD multiprocessing ..................................... 2 1.3.2 SIMD multiprocessing .................................... 2 1.3.3 MISD multiprocessing .................................... 3 1.3.4 MIMD multiprocessing .................................... 3 1.4 See also ................................................ 3 1.5 References ............................................... 3 2 Computer multitasking 5 2.1 Multiprogramming .......................................... 5 2.2 Cooperative multitasking ....................................... 6 2.3 Preemptive multitasking ....................................... 6 2.4 Real time ............................................... 7 2.5 Multithreading ............................................ 7 2.6 Memory protection .......................................... 7 2.7 Memory swapping .......................................... 7 2.8 Programming ............................................. 7 2.9 See also ................................................ 8 2.10 References .............................................
    [Show full text]
  • Leaper: a Learned Prefetcher for Cache Invalidation in LSM-Tree Based Storage Engines
    Leaper: A Learned Prefetcher for Cache Invalidation in LSM-tree based Storage Engines ∗ Lei Yang1 , Hong Wu2, Tieying Zhang2, Xuntao Cheng2, Feifei Li2, Lei Zou1, Yujie Wang2, Rongyao Chen2, Jianying Wang2, and Gui Huang2 fyang lei, [email protected] fhong.wu, tieying.zhang, xuntao.cxt, lifeifei, zhencheng.wyj, rongyao.cry, beilou.wjy, [email protected] Peking University1 Alibaba Group2 ABSTRACT 101 Hit ratio Frequency-based cache replacement policies that work well QPS on page-based database storage engines are no longer suffi- Latency cient for the emerging LSM-tree (Log-Structure Merge-tree) based storage engines. Due to the append-only and copy- 100 on-write techniques applied to accelerate writes, the state- of-the-art LSM-tree adopts mutable record blocks and issues Normalized value frequent background operations (i.e., compaction, flush) to reorganize records in possibly every block. As a side-effect, 0 50 100 150 200 such operations invalidate the corresponding entries in the Time (s) Figure 1: Cache hit ratio and system performance churn (QPS cache for each involved record, causing sudden drops on the and latency of 95th percentile) caused by cache invalidations. cache hit rates and spikes on access latency. Given the ob- with notable examples including LevelDB [10], HBase [2], servation that existing methods cannot address this cache RocksDB [7] and X-Engine [14] for its superior write perfor- invalidation problem, we propose Leaper, a machine learn- mance. These storage engines usually come with row-level ing method to predict hot records in an LSM-tree storage and block-level caches to buffer hot records in main mem- engine and prefetch them into the cache without being dis- ory.
    [Show full text]
  • CSC 256/456: Operating Systems
    CSC 256/456: Operating Systems Multiprocessor John Criswell Support University of Rochester 1 Outline ❖ Multiprocessor hardware ❖ Types of multi-processor workloads ❖ Operating system issues ❖ Where to run the kernel ❖ Synchronization ❖ Where to run processes 2 Multiprocessor Hardware 3 Multiprocessor Hardware ❖ System in which two or more CPUs share full access to the main memory ❖ Each CPU might have its own cache and the coherence among multiple caches is maintained CPU CPU … … … CPU Cache Cache Cache Memory bus Memory 4 Multi-core Processor ❖ Multiple processors on “chip” ❖ Some caches shared CPU CPU ❖ Some not shared Cache Cache Shared Cache Memory Bus 5 Hyper-Threading ❖ Replicate parts of processor; share other parts ❖ Create illusion that one core is two cores CPU Core Fetch Unit Decode ALU MemUnit Fetch Unit Decode 6 Cache Coherency ❖ Ensure processors not operating with stale memory data ❖ Writes send out cache invalidation messages CPU CPU … … … CPU Cache Cache Cache Memory bus Memory 7 Non Uniform Memory Access (NUMA) ❖ Memory clustered around CPUs ❖ For a given CPU ❖ Some memory is nearby (and fast) ❖ Other memory is far away (and slow) 8 Multiprocessor Workloads 9 Multiprogramming ❖ Non-cooperating processes with no communication ❖ Examples ❖ Time-sharing systems ❖ Multi-tasking single-user operating systems ❖ make -j<very large number here> 10 Concurrent Servers ❖ Minimal communication between processes and threads ❖ Throughput usually the goal ❖ Examples ❖ Web servers ❖ Database servers 11 Parallel Programs ❖ Use parallelism
    [Show full text]
  • Improving Performance of the Distributed File System Using Speculative Read Algorithm and Support-Based Replacement Technique
    International Journal of Advanced Research in Engineering and Technology (IJARET) Volume 11, Issue 9, September 2020, pp. 602-611, Article ID: IJARET_11_09_061 Available online at http://iaeme.com/Home/issue/IJARET?Volume=11&Issue=9 ISSN Print: 0976-6480 and ISSN Online: 0976-6499 DOI: 10.34218/IJARET.11.9.2020.061 © IAEME Publication Scopus Indexed IMPROVING PERFORMANCE OF THE DISTRIBUTED FILE SYSTEM USING SPECULATIVE READ ALGORITHM AND SUPPORT-BASED REPLACEMENT TECHNIQUE Rathnamma Gopisetty Research Scholar at Jawaharlal Nehru Technological University, Anantapur, India. Thirumalaisamy Ragunathan Department of Computer Science and Engineering, SRM University, Andhra Pradesh, India. C Shoba Bindu Department of Computer Science and Engineering, Jawaharlal Nehru Technological University, Anantapur, India. ABSTRACT Cloud computing systems use distributed file systems (DFSs) to store and process large data generated in the organizations. The users of the web-based information systems very frequently perform read operations and infrequently carry out write operations on the data stored in the DFS. Various caching and prefetching techniques are proposed in the literature to improve performance of the read operations carried out on the DFS. In this paper, we have proposed a novel speculative read algorithm for improving performance of the read operations carried out on the DFS by considering the presence of client-side and global caches in the DFS environment. The proposed algorithm performs speculative executions by reading the data from the local client-side cache, local collaborative cache, from global cache and global collaborative cache. One of the speculative executions will be selected based on the time stamp verification process. The proposed algorithm reduces the write/update overhead and hence performs better than that of the non-speculative algorithm proposed for the similar DFS environment.
    [Show full text]
  • Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2016 Tunes Bang Bang (My Baby Shot Me Down) Nancy Sinatra (Kill Bill Volume 1 Soundtrack)
    Lecture 11: Directory-Based Cache Coherence Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2016 Tunes Bang Bang (My Baby Shot Me Down) Nancy Sinatra (Kill Bill Volume 1 Soundtrack) “It’s such a heartbreaking thing... to have your performance silently crippled by cache lines dropping all over the place due to false sharing.” - Nancy Sinatra CMU 15-418/618, Spring 2016 Today: what you should know ▪ What limits the scalability of snooping-based approaches to cache coherence? ▪ How does a directory-based scheme avoid these problems? ▪ How can the storage overhead of the directory structure be reduced? (and at what cost?) ▪ How does the communication mechanism (bus, point-to- point, ring) affect scalability and design choices? CMU 15-418/618, Spring 2016 Implementing cache coherence The snooping cache coherence Processor Processor Processor Processor protocols from the last lecture relied Local Cache Local Cache Local Cache Local Cache on broadcasting coherence information to all processors over the Interconnect chip interconnect. Memory I/O Every time a cache miss occurred, the triggering cache communicated with all other caches! We discussed what information was communicated and what actions were taken to implement the coherence protocol. We did not discuss how to implement broadcasts on an interconnect. (one example is to use a shared bus for the interconnect) CMU 15-418/618, Spring 2016 Problem: scaling cache coherence to large machines Processor Processor Processor Processor Local Cache Local Cache Local Cache Local Cache Memory Memory Memory Memory Interconnect Recall non-uniform memory access (NUMA) shared memory systems Idea: locating regions of memory near the processors increases scalability: it yields higher aggregate bandwidth and reduced latency (especially when there is locality in the application) But..
    [Show full text]
  • Efficient Synchronization Mechanisms for Scalable GPU Architectures
    Efficient Synchronization Mechanisms for Scalable GPU Architectures by Xiaowei Ren M.Sc., Xi’an Jiaotong University, 2015 B.Sc., Xi’an Jiaotong University, 2012 a thesis submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the faculty of graduate and postdoctoral studies (Electrical and Computer Engineering) The University of British Columbia (Vancouver) October 2020 © Xiaowei Ren, 2020 The following individuals certify that they have read, and recommend to the Faculty of Graduate and Postdoctoral Studies for acceptance, the dissertation entitled: Efficient Synchronization Mechanisms for Scalable GPU Architectures submitted by Xiaowei Ren in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical and Computer Engineering. Examining Committee: Mieszko Lis, Electrical and Computer Engineering Supervisor Steve Wilton, Electrical and Computer Engineering Supervisory Committee Member Konrad Walus, Electrical and Computer Engineering University Examiner Ivan Beschastnikh, Computer Science University Examiner Vijay Nagarajan, School of Informatics, University of Edinburgh External Examiner Additional Supervisory Committee Members: Tor Aamodt, Electrical and Computer Engineering Supervisory Committee Member ii Abstract The Graphics Processing Unit (GPU) has become a mainstream computing platform for a wide range of applications. Unlike latency-critical Central Processing Units (CPUs), throughput-oriented GPUs provide high performance by exploiting massive application parallelism.
    [Show full text]
  • 1 Introduction
    Technical report, IDE1055, September 2010 Radar Signal Processing on Ambric Platform Master’s Thesis in Computer Systems Engineering Yadagiri Pyaram & Md Mashiur Rahman firstFFT RealFFT finalFFT B0 B4 B8 D1 A1 8-Point Input B1 B5 B9 Output D0 A0 B2 B6 B10 D2 A2 B3 B7 B11 Assembler Objects Distributor Objects Butterfly Operation School of Information Science, Computer and Electrical Engineering Halmstad University 2 Radar Signal Processing on Ambric Platform Master Thesis in Computer Systems Engineering School of Information Science, Computer and Electrical Engineering Halmstad University Box 823, S-301 18 Halmstad, Sweden September 2010 3 4 Figure on cover page : Design approach for one of algorithm FFT from page no.41, Figure 15: 8-Point FFT Design Approach 5 6 Acknowledgements This Master’s thesis is the part of Master’s Program for the fulfilment of Master’s Degree specialization in Computer Systems Engineering at Halmstad University. In this project the Ambric processor developed by Ambric Inc. has been used. Behalf of completion of thesis, we are glad to our supervisors, Professor Bertil Svensson and Zain-ul-Abdin Ph.D. Student, Dept of IDE, Halmstad University for their valuable guidance and support throughout this thesis project and without them it would be difficult to complete our thesis. And we are very glad to the Jerker Bengtsson Dept of IDE, Halmstad University, who gave valuable suggestions in the meetings, which helped us. And we would like to thank the Library personnel who helped some stuff in documentation and also to Eva Nestius Research Secretary, IDE Division University of Halmstad for providing access to Laboratory.
    [Show full text]
  • Focused on Embedded Systems
    Process Network in Silicon: A High-Productivity, Scalable Platform for High-Performance Embedded Computing Mike Butts [email protected] / [email protected] Chess Seminar, UC Berkeley, 11/25/08 Focused on Embedded Systems Not General-Purpose Systems — Must support huge existing base of apps, which use very large shared memory — They now face severe scaling issues with SMP (symmetric multiprocessing) Not GPU Supercomputers — SIMD architecture for scientific computing — Depends on data parallelism — 170 watts! Embedded Systems — Max performance on one job, strict limits on cost & power, hard real-time determinism — HW/SW developed from scratch by specialist engineers — Able to adopt new programming models UCBerkeley EECS Chess Seminar – 11/25/08 2 1 Scaling Limits: CPU/DSP, ASIC/FPGA MIPS 2008: 5X gap Single CPU & DSP performance 10000 has fallen off Moore’s Law 20%/year 1000 — All the architectural features that turn Moore’s Law area into speed have been used up. 100 Hennessy & 52%/year Patterson, — Now it’s just device speed. Computer 10 Architecture: A CPU/DSP does not scale Quantitative Approach, 4th ed. ASIC project now up to $30M 1986 1990 1994 1998 2002 2006 — NRE, Fab / Design, Validation HW Design Productivity Gap — Stuck at RTL — 21%/yr productivity vs 58%/yr Moore’s Law ASICs limited now, FPGAs soon ASIC/FPGA does not scale Gary Smith, The Crisis of Complexity, Parallel Processing is the Only Choice DAC 2003 UCBerkeley EECS Chess Seminar – 11/25/08 3 Parallel Platforms for Embedded Computing Program processors in software, far more productive than hardware design Massive parallelism is available — A basic pipelined 32-bit integer CPU takes less than 50,000 transistors — Medium-sized chip has over 100 million transistors available.
    [Show full text]
  • CIS 501 Computer Architecture This Unit: Shared Memory
    This Unit: Shared Memory Multiprocessors App App App • Thread-level parallelism (TLP) System software • Shared memory model • Multiplexed uniprocessor CIS 501 Mem CPUCPU I/O CPUCPU • Hardware multihreading CPUCPU Computer Architecture • Multiprocessing • Synchronization • Lock implementation Unit 9: Multicore • Locking gotchas (Shared Memory Multiprocessors) • Cache coherence Slides originally developed by Amir Roth with contributions by Milo Martin • Bus-based protocols at University of Pennsylvania with sources that included University of • Directory protocols Wisconsin slides by Mark Hill, Guri Sohi, Jim Smith, and David Wood. • Memory consistency models CIS 501 (Martin): Multicore 1 CIS 501 (Martin): Multicore 2 Readings Beyond Implicit Parallelism • Textbook (MA:FSPTCM) • Consider “daxpy”: • Sections 7.0, 7.1.3, 7.2-7.4 daxpy(double *x, double *y, double *z, double a): for (i = 0; i < SIZE; i++) • Section 8.2 Z[i] = a*x[i] + y[i]; • Lots of instruction-level parallelism (ILP) • Great! • But how much can we really exploit? 4 wide? 8 wide? • Limits to (efficient) super-scalar execution • But, if SIZE is 10,000, the loop has 10,000-way parallelism! • How do we exploit it? CIS 501 (Martin): Multicore 3 CIS 501 (Martin): Multicore 4 Explicit Parallelism Multiplying Performance • Consider “daxpy”: • A single processor can only be so fast daxpy(double *x, double *y, double *z, double a): • Limited clock frequency for (i = 0; i < SIZE; i++) • Limited instruction-level parallelism Z[i] = a*x[i] + y[i]; • Limited cache hierarchy • Break it
    [Show full text]