An Integrated Multiprocessor for Matrix Algorithms

Total Page:16

File Type:pdf, Size:1020Kb

An Integrated Multiprocessor for Matrix Algorithms \\'\ \, An Integrated Multiprocessor for Matrix Algorithms Warren Marwood B.Sc., B.E. A thesis submitted to the Department of Electrical and Electronic Engineering, the University of Adelaide, to meet the requirements for award of the degree of Doctor of Philosophy. June L994 Awond lqql " 4 Contents Abstract vt Statement of Originality . vl11 Acknowledgements . 1X Publications x List of Figures xlv List of Tables . xxl Chapter 1: The Evolution of Computers 1 1.1 Scalar Architectures. 1 1.2 Vector Architectures l) 1.3 Parallel Architectures 6 1.3.1 Ring Architectures. 8 1.3.2 The Two-dimensional Mesh I 1.3.3 The Three-dimensional Mesh 10 1.3.4 The Hypercube 11 1.4 Massively Parallel Computers: a Summary 72 1.5 The MATRISC Processor 74 1.5.1 Architecture. 74 1.5. 2 Performance/Algorithms 15 1.6 Summary .. 15 Chapter 2: A Review of Systolic Processors .. 77 2.1 The Inner-product-step Processor 19 .to 2.2 A Systolic Cell for Polynomial and DFT Evaluation Z.) 2.3 The Engagement Processor.... 25 2.4 Algorithms 26 2.4.1 Convolution 27 2.4.2 Finite Impulse Response (FIR) Filters 27 2.4.3 Discrete Fourier Transform .. 28 2.5 Heterogeneous Systolic Arrays 29 2.5.7 LU Decomposition 29 2.5.2 Solution of Triangular Linear Systems 30 oô 2.5.3 A Systolic Array for Adaptive Beamforming. ¿a 2.6 Systolic Arrays for Data Formatting. 34 2.7 Bif Level Systolic Arrays óÐ 2.7.l Beamforming at the Bit-level 37 2.7.2 Oplimisation for Binary Quantised Data . 40 2.8 Configurable Sysiolic Architectures 47 2.8.1 The Configurable Highly Parallel (CHiP) Computer 42 2.8.2 The Programmable Systolic Chip (PSC) 43 2.9 Mapping Algorithms to Architectures 44 2.9.1 Software. 44 2.10 Systolic Processors/Coprocessors . 46 2.11 Summary 49 Chapter 3: Systolic Ring Processors ... 52 3.1 A Systolic Multipiier 52 3.1.1 A Multiplier Model 54 3.7.2 Systolic Ring Multiplier 59 ^ 3.2 Parallel Digit Multiplication. 61 3.2.1 Digit-Serial Integer Multiplication. 62 3.2.2 Digif-Serial Floating Point Multiplication . 66 3.3 Coalescence of Systolic Ring Multipliers 68 3.4 Ring Accumulator 69 ^ 3.4.1 Floating Point Addition/Accumulation. 69 3.4.2 A Simplified Algorithm 77 3.5 Coalescence of Ring Accumulators. 77 3.6 A Systolic Ring Floating Point Multiplier/Accumulator.. 78 3.6.1 Coalescence of Ring Multiply/Accumulators.. 82 3.7 Discussion... 82 Chapter 4: The MATRISC Processor.. 84 4.1 A Lattice Model 85 4.1.1 Implications for Matrix Processors 90 4.7.2 Analysis of a Rectangular Systolic Array Processor. 91 4.1.3 Optimisation of a Square Systolic Array 93 4.1.4 Bandwidth Considerations 101 4.2 Generalised Matrix Address Generator 103 ^ 4.2.7 The Address Generator Architecture . 104 11 4.2.2 An Implementation 108 4.2.3 Area and Time Considerations . 109 4.2.4 Matrix Addressing Examples. 109 4.2.5 Address Generation in Higher Dimensions. 772 4.3 The MATRISC ArchitectuÍe ... 714 4.4 Discussion 118 Chapter 5: MATRISC Matrix Algorithms .. 720 5.1 Mairix Primitives.... 72r 5.1.1 Mairix Multiplication 727 5.1.2 Element-wise Operations.. ... ..... 722 5.1.3 Matrix Transposition/Permutation 722 5.2 Matrix Algorithms. 722 5.2.1 FIR Filtering, Convolution and Correlation 722 5.2.2 QR Factorisation .. 727 5.2.3 The Discrete Fourier Transform 729 5.3 A MATRISC Performance Model and its Validation 130 5.3.1 Matrix Multiplication 131 5.3.2 The FIR Algorithm. 135 5.3.3 A BLAS Matrix Primitive 737 6.4 MATRISC Processor and its Performance. 138 ^5.4.1 Matrix Multiplication 139 5.4.2 FIR Filters, Convolution and Correlation r42 5.4.3 The BLAS SGEMM0 subroutine 743 5.4.4 The QR Factorisation 744 5.5 A Constant Bandwidth Array 744 Chapter 6: The Fourier and Hartley TYansforms r47 6.1 The Discrete Fourier Transform 747 6.2 Linear Mappings Between One and Two Dimensions.. 749 6.2.1 Relatively Prime 150 6.2.2 Cornmon Factor 150 6.3 Two-dimensional Mappings for a One-dimensional DFT 150 6.3.1 The Prime Factor Case . 151 6.3.2 The Common Factor Case. 158 6.4 p-diÍrensional Mappings and the One-dimensional DFT 160 6.4. 1 Particular Implementations 762 6.4.2 L Recursive Implementation 764 6.5 Performance 165 lll 6.5. 1 Addiiion/Subtraction 166 6.5.2 Hadamard or Schur (Elementwise) Multiplication 767 6.5.3 Matrix Multiplication 168 6.5.4 The Prime Factor Algoriihm 170 6.5.5 The Common-Factor Algorithm 773 6.6 The Discrete Hartley Transform. 777 6.7 Two-dimensional Mappings for a One-dimensional DHT 178 6.7.1 The Prime Factor Case . 779 6.7.2 The Common Factor Case . 181 6.8 Performance 183 6.8.1 The Prime-Factor Algorithm. 183 6.8.2 The Common-Factor Aigorithm 183 6.8.3 Comparison 184 6.9 A Multi-dimensional Transform Example 784 6.9.1 The Multi-dimensional Prime Factor cos Transform 184 6.10 Summary 792 Chapter 7: Implementation Studies in Si and GaAs 193 7.1 Silicon: The SCalable Array Processor (SCAP) 193 7.1.1 System Architecture 794 7.1.2 Hardware 794 7.1.3 The Data Formatter Chip 195 7.7.4 The Processing Element Chip . 199 7.1.5 Software. 203 7.2 Gallium Arsenide ... 206 7.2.1 Gallium Arsenide Technology . 208 7.2.2 Detailed Circuit Design and Simulation . 215 7.2.3 Test Equipment and Procedures . .. 225 7.3 Interconnection Technology ... 228 7.3.1 lVlulti-chip Modules 228 7.3.2 Silicon Hybrids 229 Chapter 8: Surnmar¡ Future Tlends and Conclusion 237 8.1 Summary 237 8.1.1 Implementation. 232 8.1.2 Software... 233 8.2 Current and Future Trends 233 8.2.1 The Matrix Product 234 IV 8.3 MATRISC Multiprocessors-Teraflops Machines. 235 8.4 Conclusion 236 Bibliography ... 237 v Abstract Current trends in supercomputer architectures indicate that massive parallelism will provide the means by which computers will achieve the fastest possible performance for arbitrary problems. For the particular case of signal processing, a study of the computationally intensive algorithms revealed their dependence on matrix opera- tions in general, and the O("") matrix product in particular. This thesis proposes a computer architecture for these algorithms. The architectural philosophy is to add hardware support for a limited set of matrix operations to the conventional Re- duced or Complex Instruction Set Computer (RISC or CISC) architectures which are commonly used at multiprocessor nodes. The support for the matrix data type is provided with systolic array techniques. This philosophy parallels and extends that of the RISC computer architecture proposed in the 1980's, and as a consequence the new processor architecture proposed in this thesis is referred to as a MATrix Reduced Instruction Set Computer (MATRISC). The architecture is shown to offer outstanding performance for the class of probiems which are expressible in terms of matrix algebra. The concepts of massive parallelism are applicable to arrays of \4ATRISC processors, each of which appears as a conventional machine in terms of both hardware and software. Tasks are partitioned into sub-tasks expressible in ma- trix form. This form embeds, or hides, a high level of parallelism within the matrix operators. The work in this thesis is devoted to the architecture, implementation and performance of a MATRISC processing node. Specific advantages of the MATRISC architecture include: 1. the provision of orders of magnitude improvement in the peak computational performance of a multiprocessor processing node; 2. a simple object-oriented coding paradigm which follows traditional problem formulation processes and conventional von Neumann coding techniques; 3. a design method which controls complexity and allows the use of arbitrary numbers of processing elements in the implementation of MATRISC processors. V1 The restricted number of efficiently implemented matrix primitives provided in the MATRISC processor can be used to implement all of the higher order matrix opera- tors found in matrix algebra. As in ihe RISC philosophy, resources are concentrated on the primary set of operations or instructions which incur the highest penalties in execution. Also reflected from the RISC development is the integration of the software and hardware to produce a complete system. The novelty in the thesis is found in three major areas, and in the whole to which these areas contribute: 1. the design of low complexity and low transistor count arithmetic units which perform floating point computations with variable precision and dynamic range, and which are designed to optimise the execution time of matrix operations; 2. lhe design of an address generator and systolic interface which extends the domain of application for the systolic; computation engine 3. the integration of the hardware into conventional software compilers. Simulation results for the MATRISC processor are provided which give performance estimates for systems which can be implemented in current technologies. These tech- nologies include both Gallium Arsenide and Silicon. In addition, a description of a concept demonstrator is provided which has been implemented in a 1.2 micron CMOS process. This concept demonstrator has been implemented in a SUN SPARCsta- tion 1. The simulator code is verified by comparing simulation predictions with measured performance of the concept demonstrator. The simulator parameters are then modified to describe a typical system which is being implemented in current technologies. These simulation results show that nodal processing rates which ex- ceed 5 gigaflops are achievable at single processor nodes which use current memory technologies in non-interleaved structures.
Recommended publications
  • Validated Products List, 1995 No. 3: Programming Languages, Database
    NISTIR 5693 (Supersedes NISTIR 5629) VALIDATED PRODUCTS LIST Volume 1 1995 No. 3 Programming Languages Database Language SQL Graphics POSIX Computer Security Judy B. Kailey Product Data - IGES Editor U.S. DEPARTMENT OF COMMERCE Technology Administration National Institute of Standards and Technology Computer Systems Laboratory Software Standards Validation Group Gaithersburg, MD 20899 July 1995 QC 100 NIST .056 NO. 5693 1995 NISTIR 5693 (Supersedes NISTIR 5629) VALIDATED PRODUCTS LIST Volume 1 1995 No. 3 Programming Languages Database Language SQL Graphics POSIX Computer Security Judy B. Kailey Product Data - IGES Editor U.S. DEPARTMENT OF COMMERCE Technology Administration National Institute of Standards and Technology Computer Systems Laboratory Software Standards Validation Group Gaithersburg, MD 20899 July 1995 (Supersedes April 1995 issue) U.S. DEPARTMENT OF COMMERCE Ronald H. Brown, Secretary TECHNOLOGY ADMINISTRATION Mary L. Good, Under Secretary for Technology NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY Arati Prabhakar, Director FOREWORD The Validated Products List (VPL) identifies information technology products that have been tested for conformance to Federal Information Processing Standards (FIPS) in accordance with Computer Systems Laboratory (CSL) conformance testing procedures, and have a current validation certificate or registered test report. The VPL also contains information about the organizations, test methods and procedures that support the validation programs for the FIPS identified in this document. The VPL includes computer language processors for programming languages COBOL, Fortran, Ada, Pascal, C, M[UMPS], and database language SQL; computer graphic implementations for GKS, COM, PHIGS, and Raster Graphics; operating system implementations for POSIX; Open Systems Interconnection implementations; and computer security implementations for DES, MAC and Key Management.
    [Show full text]
  • System Trends and Their Impact on Future Microprocessor Design
    IBM Research System Trends and their Impact on Future Microprocessor Design Tilak Agerwala Vice President, Systems IBM Research System Trends and their Impact | MICRO 35 | Tilak Agerwala © 2002 IBM Corporation IBM Research Agenda System and application trends Impact on architecture and microarchitecture The Memory Wall Cellular architectures and IBM's Blue Gene Summary System Trends and their Impact | MICRO 35 | Tilak Agerwala © 2002 IBM Corporation IBM Research TRENDSTRENDS Microprocessors < 10 GHz in systems 64-256 Way SMP 65-45nm, Copper, Highest performance SOI Best MP Scalability 1-2 GHz Leading edge process technology 4-8 Way SMP RAS, virtualization ~100nm technology 10+ of GHz 4-8 Way SMP SMP/Large 65-45nm, Copper, SOI Systems Highest Frequency Cost and power Low GHz sensitive Leading edge process Uniprocessor technology Desktop ~100nm technology 2-4 GHz, Uniproc, Component-based and Game ~100nm, Copper, SOI Consoles Lowest Power / Lowest Multi MHz cost designs SoC capable Embedded Uniprocessor ASIC / Foundry ~100-200nm technologies Systems technology System Trends and their Impact | MICRO 35 | Tilak Agerwala © 2002 IBM Corporation IBM Research TRENDSTRENDS Large system application trends Traditional commercial applications Databases, transaction processing, business apps like payroll etc. The internet has driven the growth of new commercial applications New life sciences applications are commercial and high-growth Drug discovery and genetic engineering research needs huge amounts of compute power (e.g. protein folding simulations)
    [Show full text]
  • LSI Logic Corporation
    Publications are stocked at the address given below. Requests should be addressed to: LSI Logic Corporation 1551 McCarthy Boulevard MUpitas, CA 95035 Fax 408.433.6802 LSI Logic Corporation reserves the right to make changes to any products herein at any time without notice. LSI Logic does not assume any responsibility or liability arising out of the application or use of any product described herein, except as expressly agreed to in writing by LSI Logic nor does the purchase or use of a product from LSI Logic convey a license under any patent rights, copyrights, trademark rights, or any other of the intellectual property rights of LSI Logic or third parties. All rights reserved. @LSI Logic Corporation 1989 All rights reserved. This document is derived in part from documents created by Sun Microsystems and thus constitutes a derivative work. TRADEMARK ACKNOWLEDGMENT LSI Logic and the logo design are trademarks of LSI Logic Corporation. Sun and SPARC are trademarks of Sun Microsystems, Inc. ii MD70-000109-99 A Preface The L64853 SBw DMA Controller Technical Manual is written for two audiences: system-level pro­ grammers and hardware designers. The manual assumes readers are familiar with computer architecture, software and hardware design, and design implementation. It also assumes readers have access to additional information about the SPARC workstation-in particular, the SPARC SBus spec­ ification, which can be obtained from Sun Microsystems. This manual is organized in a top-down sequence; that is, the earlier chapters describe the purpose, context, and functioning of the L64853 DMA Controller from an architectural perspective, and later chapters provide implementation details.
    [Show full text]
  • An Overview of the Blue Gene/L System Software Organization
    An Overview of the Blue Gene/L System Software Organization George Almasi´ , Ralph Bellofatto , Jose´ Brunheroto , Calin˘ Cas¸caval , Jose´ G. ¡ Castanos˜ , Luis Ceze , Paul Crumley , C. Christopher Erway , Joseph Gagliano , Derek Lieber , Xavier Martorell , Jose´ E. Moreira , Alda Sanomiya , and Karin ¡ Strauss ¢ IBM Thomas J. Watson Research Center Yorktown Heights, NY 10598-0218 £ gheorghe,ralphbel,brunhe,cascaval,castanos,pgc,erway, jgaglia,lieber,xavim,jmoreira,sanomiya ¤ @us.ibm.com ¥ Department of Computer Science University of Illinois at Urbana-Champaign Urabana, IL 61801 £ luisceze,kstrauss ¤ @uiuc.edu Abstract. The Blue Gene/L supercomputer will use system-on-a-chip integra- tion and a highly scalable cellular architecture. With 65,536 compute nodes, Blue Gene/L represents a new level of complexity for parallel system software, with specific challenges in the areas of scalability, maintenance and usability. In this paper we present our vision of a software architecture that faces up to these challenges, and the simulation framework that we have used for our experiments. 1 Introduction In November 2001 IBM announced a partnership with Lawrence Livermore National Laboratory to build the Blue Gene/L (BG/L) supercomputer, a 65,536-node machine de- signed around embedded PowerPC processors. Through the use of system-on-a-chip in- tegration [10], coupled with a highly scalable cellular architecture, Blue Gene/L will de- liver 180 or 360 Teraflops of peak computing power, depending on the utilization mode. Blue Gene/L represents a new level of scalability for parallel systems. Whereas existing large scale systems range in size from hundreds (ASCI White [2], Earth Simulator [4]) to a few thousands (Cplant [3], ASCI Red [1]) of compute nodes, Blue Gene/L makes a jump of almost two orders of magnitude.
    [Show full text]
  • Tms320c3x Workstation Emulator Installation Guide
    TMS320C3x Workstation Emulator Installation Guide 1994 Microprocessor Development Systems Printed in U.S.A., December 1994 2617676-9741 revision A TMS320C3x Workstation Emulator Installation Guide SPRU130 December 1994 Printed on Recycled Paper IMPORTANT NOTICE Texas Instruments (TI) reserves the right to make changes to its products or to discontinue any semiconductor product or service without notice, and advises its customers to obtain the latest version of relevant information to verify, before placing orders, that the information being relied on is current. TI warrants performance of its semiconductor products and related software to the specifications applicable at the time of sale in accordance with TI’s standard warranty. Testing and other quality control techniques are utilized to the extent TI deems necessary to support this warranty. Specific testing of all parameters of each device is not necessarily performed, except those mandated by government requirements. Certain applications using semiconductor products may involve potential risks of death, personal injury, or severe property or environmental damage (“Critical Applications”). TI SEMICONDUCTOR PRODUCTS ARE NOT DESIGNED, INTENDED, AUTHORIZED, OR WARRANTED TO BE SUITABLE FOR USE IN LIFE-SUPPORT APPLICATIONS, DEVICES OR SYSTEMS OR OTHER CRITICAL APPLICATIONS. Inclusion of TI products in such applications is understood to be fully at the risk of the customer. Use of TI products in such applications requires the written approval of an appropriate TI officer. Questions concerning potential risk applications should be directed to TI through a local SC sales offices. In order to minimize risks associated with the customer’s applications, adequate design and operating safeguards should be provided by the customer to minimize inherent or procedural hazards.
    [Show full text]
  • 2.2 Adaptive Routing Algorithms and Router Design 20
    https://theses.gla.ac.uk/ Theses Digitisation: https://www.gla.ac.uk/myglasgow/research/enlighten/theses/digitisation/ This is a digitised version of the original print thesis. Copyright and moral rights for this work are retained by the author A copy can be downloaded for personal non-commercial research or study, without prior permission or charge This work cannot be reproduced or quoted extensively from without first obtaining permission in writing from the author The content must not be changed in any way or sold commercially in any format or medium without the formal permission of the author When referring to this work, full bibliographic details including the author, title, awarding institution and date of the thesis must be given Enlighten: Theses https://theses.gla.ac.uk/ [email protected] Performance Evaluation of Distributed Crossbar Switch Hypermesh Sarnia Loucif Dissertation Submitted for the Degree of Doctor of Philosophy to the Faculty of Science, Glasgow University. ©Sarnia Loucif, May 1999. ProQuest Number: 10391444 All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted. In the unlikely event that the author did not send a com plete manuscript and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion. uest ProQuest 10391444 Published by ProQuest LLO (2017). Copyright of the Dissertation is held by the Author. All rights reserved. This work is protected against unauthorized copying under Title 17, United States C ode Microform Edition © ProQuest LLO. ProQuest LLO.
    [Show full text]
  • Adaptive Parallelism for Coupled, Multithreaded Message-Passing Programs Samuel K
    University of New Mexico UNM Digital Repository Computer Science ETDs Engineering ETDs Fall 12-1-2018 Adaptive Parallelism for Coupled, Multithreaded Message-Passing Programs Samuel K. Gutiérrez Follow this and additional works at: https://digitalrepository.unm.edu/cs_etds Part of the Numerical Analysis and Scientific omputC ing Commons, OS and Networks Commons, Software Engineering Commons, and the Systems Architecture Commons Recommended Citation Gutiérrez, Samuel K.. "Adaptive Parallelism for Coupled, Multithreaded Message-Passing Programs." (2018). https://digitalrepository.unm.edu/cs_etds/95 This Dissertation is brought to you for free and open access by the Engineering ETDs at UNM Digital Repository. It has been accepted for inclusion in Computer Science ETDs by an authorized administrator of UNM Digital Repository. For more information, please contact [email protected]. Samuel Keith Guti´errez Candidate Computer Science Department This dissertation is approved, and it is acceptable in quality and form for publication: Approved by the Dissertation Committee: Professor Dorian C. Arnold, Chair Professor Patrick G. Bridges Professor Darko Stefanovic Professor Alexander S. Aiken Patrick S. McCormick Adaptive Parallelism for Coupled, Multithreaded Message-Passing Programs by Samuel Keith Guti´errez B.S., Computer Science, New Mexico Highlands University, 2006 M.S., Computer Science, University of New Mexico, 2009 DISSERTATION Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy Computer Science The University of New Mexico Albuquerque, New Mexico December 2018 ii ©2018, Samuel Keith Guti´errez iii Dedication To my beloved family iv \A Dios rogando y con el martillo dando." Unknown v Acknowledgments Words cannot adequately express my feelings of gratitude for the people that made this possible.
    [Show full text]
  • Detecting False Sharing Efficiently and Effectively
    W&M ScholarWorks Arts & Sciences Articles Arts and Sciences 2016 Cheetah: Detecting False Sharing Efficiently andff E ectively Tongping Liu Univ Texas San Antonio, Dept Comp Sci, San Antonio, TX 78249 USA; Xu Liu Coll William & Mary, Dept Comp Sci, Williamsburg, VA 23185 USA Follow this and additional works at: https://scholarworks.wm.edu/aspubs Recommended Citation Liu, T., & Liu, X. (2016, March). Cheetah: Detecting false sharing efficiently and effectively. In 2016 IEEE/ ACM International Symposium on Code Generation and Optimization (CGO) (pp. 1-11). IEEE. This Article is brought to you for free and open access by the Arts and Sciences at W&M ScholarWorks. It has been accepted for inclusion in Arts & Sciences Articles by an authorized administrator of W&M ScholarWorks. For more information, please contact [email protected]. Cheetah: Detecting False Sharing Efficiently and Effectively Tongping Liu ∗ Xu Liu ∗ Department of Computer Science Department of Computer Science University of Texas at San Antonio College of William and Mary San Antonio, TX 78249 USA Williamsburg, VA 23185 USA [email protected] [email protected] Abstract 1. Introduction False sharing is a notorious performance problem that may Multicore processors are ubiquitous in the computing spec- occur in multithreaded programs when they are running on trum: from smart phones, personal desktops, to high-end ubiquitous multicore hardware. It can dramatically degrade servers. Multithreading is the de-facto programming model the performance by up to an order of magnitude, significantly to exploit the massive parallelism of modern multicore archi- hurting the scalability. Identifying false sharing in complex tectures.
    [Show full text]
  • Cellular Wave Computers and CNN Technology – a Soc Architecture with Xk Processors and Sensor Arrays*
    Cellular Wave Computers and CNN Technology – a SoC architecture with xK processors and sensor arrays* Tamás ROSKA1 Fellow IEEE 1. Introduction and the main theses 1.1 Scenario: Architectural lessons from the trends in manufacturing billion component devices when crossing the threshold of 100 nm feature size Preliminary proposition: The nature of fabrication technology, the nature and type of data to be processed, and the nature and type of events to be detected or „computed” will determine the architecture, the elementary instructions, and the type of algorithms needed, hence also the complexity of the solution. In view of this proposition, let us list a few key features of the electronic technology of Today and its consequences.. (i) Convergence of CMOS, NANO and OPTICAL technologies towards a cellular architecture with short and sparse wires CMOS chips: * Processors: K or M transistors on an M or G transistor die =>K processors /chip * Wires: at 180 nm or below, gate delay is smaller than wire delay NANO processors and sensors: * Mainly 2D organization of cells integrating processing and sensing * Interactions mainly with the neighbours OPTICAL devices: * parallel processing * optical correlators * VCSELs and programable SLMs, Hence the architecture should be characterized by * 2 D layers (or a layered 3D) * Cellular architecture with * mainly local and /or regular sparse wireing leading via Î a Cellular Nonlinear Network (CNN) Dynamics 1 The Faculty of Information Technology and the Jedlik Laboratories of the Pázmány University, Budapest and the Computer and Automation Institute of the Hungarian Academy of Sciences, Budapest, Hungary ([email protected], www.itk.ppke.hu) * Research supported by the Office of Naval Research, Human Frontiers of Science Program, EU Future and Emergent Technologies Program, the Hungarian Academy of Sciences, and the Jedlik Laboratories of the Pázmány University, Budapest 0-7803-9254-X/05/$20.00 ©2005 IEEE.
    [Show full text]
  • Performance Modelling and Optimization of Memory Access on Cellular Computer Architecture Cyclops64
    Performance modelling and optimization of memory access on cellular computer architecture Cyclops64 Yanwei Niu, Ziang Hu, Kenneth Barner and Guang R. Gao Department of ECE, University of Delaware, Newark, DE, 19711, USA {niu, hu, barner, ggao}@ee.udel.edu Abstract. This paper focuses on the Cyclops64 computer architecture and presents an analytical model and performance simulation results for the preloading and loop unrolling approaches to optimize the performance of SVD (Singular Value Decomposition) benchmark. A performance model for dissecting the total execu- tion cycles is presented. The data preloading using “memcpy” or hand optimized “inline” assembly code, and the loop unrolling approach are implemented and compared with each other in terms of the total number of memory access cycles. The key idea is to preload data from offchip to onchip memory and store the data back after the computation. These approaches can reduce the total memory access cycles and can thus improve the benchmark performance significantly. 1 Introduction The design concept of computer architecture over the last two decades has been mainly on the exploitation of the instruction level parallelism, such as pipelining,VLIW or superscalar architecture. For the next generation of computer architecture, hardware threading multiprocessor is becoming more and more popular. One approach of hard- ware multithreading is called CMP (Chip MultiProcessor) approach, which proposes a single chip design that uses a collection of independent processors with less resource sharing. An example of CMP architecture design is Cyclops64 [1–5], a new architec- ture for high performance parallel computers being developed at the IBM T. J. Watson Research Center and University of Delaware.
    [Show full text]
  • Leaper: a Learned Prefetcher for Cache Invalidation in LSM-Tree Based Storage Engines
    Leaper: A Learned Prefetcher for Cache Invalidation in LSM-tree based Storage Engines ∗ Lei Yang1 , Hong Wu2, Tieying Zhang2, Xuntao Cheng2, Feifei Li2, Lei Zou1, Yujie Wang2, Rongyao Chen2, Jianying Wang2, and Gui Huang2 fyang lei, [email protected] fhong.wu, tieying.zhang, xuntao.cxt, lifeifei, zhencheng.wyj, rongyao.cry, beilou.wjy, [email protected] Peking University1 Alibaba Group2 ABSTRACT 101 Hit ratio Frequency-based cache replacement policies that work well QPS on page-based database storage engines are no longer suffi- Latency cient for the emerging LSM-tree (Log-Structure Merge-tree) based storage engines. Due to the append-only and copy- 100 on-write techniques applied to accelerate writes, the state- of-the-art LSM-tree adopts mutable record blocks and issues Normalized value frequent background operations (i.e., compaction, flush) to reorganize records in possibly every block. As a side-effect, 0 50 100 150 200 such operations invalidate the corresponding entries in the Time (s) Figure 1: Cache hit ratio and system performance churn (QPS cache for each involved record, causing sudden drops on the and latency of 95th percentile) caused by cache invalidations. cache hit rates and spikes on access latency. Given the ob- with notable examples including LevelDB [10], HBase [2], servation that existing methods cannot address this cache RocksDB [7] and X-Engine [14] for its superior write perfor- invalidation problem, we propose Leaper, a machine learn- mance. These storage engines usually come with row-level ing method to predict hot records in an LSM-tree storage and block-level caches to buffer hot records in main mem- engine and prefetch them into the cache without being dis- ory.
    [Show full text]
  • CSC 256/456: Operating Systems
    CSC 256/456: Operating Systems Multiprocessor John Criswell Support University of Rochester 1 Outline ❖ Multiprocessor hardware ❖ Types of multi-processor workloads ❖ Operating system issues ❖ Where to run the kernel ❖ Synchronization ❖ Where to run processes 2 Multiprocessor Hardware 3 Multiprocessor Hardware ❖ System in which two or more CPUs share full access to the main memory ❖ Each CPU might have its own cache and the coherence among multiple caches is maintained CPU CPU … … … CPU Cache Cache Cache Memory bus Memory 4 Multi-core Processor ❖ Multiple processors on “chip” ❖ Some caches shared CPU CPU ❖ Some not shared Cache Cache Shared Cache Memory Bus 5 Hyper-Threading ❖ Replicate parts of processor; share other parts ❖ Create illusion that one core is two cores CPU Core Fetch Unit Decode ALU MemUnit Fetch Unit Decode 6 Cache Coherency ❖ Ensure processors not operating with stale memory data ❖ Writes send out cache invalidation messages CPU CPU … … … CPU Cache Cache Cache Memory bus Memory 7 Non Uniform Memory Access (NUMA) ❖ Memory clustered around CPUs ❖ For a given CPU ❖ Some memory is nearby (and fast) ❖ Other memory is far away (and slow) 8 Multiprocessor Workloads 9 Multiprogramming ❖ Non-cooperating processes with no communication ❖ Examples ❖ Time-sharing systems ❖ Multi-tasking single-user operating systems ❖ make -j<very large number here> 10 Concurrent Servers ❖ Minimal communication between processes and threads ❖ Throughput usually the goal ❖ Examples ❖ Web servers ❖ Database servers 11 Parallel Programs ❖ Use parallelism
    [Show full text]