SIMD Computing: an Introduction

Total Page:16

File Type:pdf, Size:1020Kb

SIMD Computing: an Introduction SIMD Computing: An Intro duction C. J. C. Schauble Septemb er 12, 1995 High Performance Scienti c Computing University of Colorado at Boulder c Copyright 1995 by the HPSC Group of the University of Colorado The following are memb ers of the HPSC Group of the Department of Computer Science at the University of Colorado at Boulder: Lloyd D. Fosdick Elizab eth R. Jessup Carolyn J. C. Schauble Gitta O. Domik SIMD Computing i Contents 1 General architecture 2 1.1 The Connection Machine CM-2 ::: :: :: :: :: ::: :: 3 1.1.1 Characteristics :: :: :: ::: :: :: :: :: ::: :: 5 1.1.2 Performance :: :: :: :: ::: :: :: :: :: ::: :: 12 1.2 The MasPar MP-2 :: :: :: :: ::: :: :: :: :: ::: :: 12 1.2.1 Characteristics :: :: :: ::: :: :: :: :: ::: :: 13 1.2.2 Performance :: :: :: :: ::: :: :: :: :: ::: :: 15 2 Programming issues 17 2.1 Architectural organization considerations :::::::::::: 17 2.1.1 Homes ::: :: :: :: :: ::: :: :: :: :: ::: :: 18 2.2 CM Fortran, MPF, and Fortran 90 :: :: :: :: :: ::: :: 19 2.2.1 Arrays ::: :: :: :: :: ::: :: :: :: :: ::: :: 20 2.2.2 Array sections : :: :: :: ::: :: :: :: :: ::: :: 22 2.2.3 Alternate DO lo ops :: :: ::: :: :: :: :: ::: :: 23 2.2.4 WHERE statements : :: :: ::: :: :: :: :: ::: :: 23 2.2.5 FORALL statements :: :: ::: :: :: :: :: ::: :: 24 2.3 Built-in functions for CM Fortran and Fortran 90 :: ::: :: 25 2.3.1 Intrinsic functions : :: :: ::: :: :: :: :: ::: :: 26 2.3.2 Masks ::: :: :: :: :: ::: :: :: :: :: ::: :: 26 2.3.3 Sp ecial functions : :: :: ::: :: :: :: :: ::: :: 27 2.4 Compiler directives :: :: :: :: ::: :: :: :: :: ::: :: 34 2.4.1 CM Fortran LAYOUT :: :: ::: :: :: :: :: ::: :: 35 2.4.2 MasPar MPF MAP : :: :: ::: :: :: :: :: ::: :: 38 2.4.3 CM Fortran ALIGN :: :: ::: :: :: :: :: ::: :: 39 2.4.4 CM Fortran COMMON :: :: ::: :: :: :: :: ::: :: 40 2.4.5 MasPar MPF ONDPU : :: ::: :: :: :: :: ::: :: 40 2.4.6 MasPar MPF ONFE :: :: ::: :: :: :: :: ::: :: 41 3 Acknowledgements 42 References 42 CUBoulder : HPSC Course Notes ii SIMD Computing Trademark Notice DECstation, ULTRIX, VAX are trademarks of Digital Equipment Corp ora- tion. Goodyear MPP is a trademark of Go o dyear Rubb er and Tire Company, Inc. ICL DAP is a trademark of International Computers Limited. MasPar Fortran, MasPar MP-1, MasPar MP-2, MasPar Programming En- vironment, MPF, MPL, MPPE, X-net are trademarks of MasPar Computer Corp oration. X-Window System is a trademark of The Massachusetts Institute of Tech- nology. MATLAB is a trademark of The MathWorks, Inc. IDL is a registered trademark of Research Systems, Inc. Symb olics is trademark of Symb olics, Inc. C*, CM, CM-1, CM-2, CM-5, CM Fortran, Connection Machine, DataVault, *Lisp, Paris, Slicewise are trademarks of Thinking Machines Corp oration. UNIX is a trademark of UNIX Systems Lab oratories, Inc. CUBoulder : HPSC Course Notes SIMD Computing: yz An Intro duction C. J. C. Schauble Septemb er 12, 1995 According to the Flynn computer classi cation system [Flynn 72 ], a SIMD computer is a Single-Instruction, Multiple-Data machine. In other words, all the pro cessors of a SIMD multipro cessor execute the same instruction at the same time, but each executes that instruction with di erent data. The computers we discuss in this tutorial are SIMD machines with dis- tributed memories DM-SIMD. They are sometimes referred to as processor arrays or as massively-paral lel computers. This tutorial is divided into two main parts. In the rst section, we discuss the general architecture of SIMD multipro cessors. Then we consider how these general features are emb o died in two particular SIMD machines: the Thinking Machines CM-2 and the MasPar MP-2. In the second section, welookinto programming issues for SIMD multi- pro cessors, b oth architectural and language-oriented. In particular, we de- This work has b een partially supp orted by the National Center for Atmospheric Re- search NCAR and utilized the TMC CM-2 at NCAR in Boulder, CO. NCAR is supp orted by the National Science Foundation. y This work has b een partially supp orted by the National Center for Sup ercomputing Applications under the grants, TRA930330N and TRA930331N, and utilized the Connec- tion Machine Mo del-2 CM-2 at the National Center for Sup ercomputing Applications, University of Illinois at Urbana-Champaign. z This work has b een supp orted by the National Science Foundation under an Ed- ucational Infrastructure grant, CDA-9017953. It has b een pro duced by the HPSC Group, Department of Computer Science, University of Colorado, Boulder, CO 80309. Please direct comments or queries to Elizab eth Jessup at this address or e-mail [email protected]. c Copyright 1995 by the HPSC Group of the University of Colorado 1 2 SIMD Computing scrib e useful features of Fortran 90 and CM Fortran. For detailed information on how to login and program sp eci c SIMD computers such as CM-2 and the MasPar MP-1, refer to the do cuments in the /pub/HPSC directory at the cs.colorado.edu anonymous ftp site. 1 General architecture Each of the pro cessors in a distributed-memory SIMD machine has its own lo cal memory to store the data it needs. Also each pro cessor is connected to other pro cessors in the computer and may send or receive data to or from any of them. In many resp ects, these computers are similar to distributed- memory MIMD multiple instruction, multiple data multipro cessors. As stated ab ove, the term SIMD implies that the same instruction is exe- cuted on multiple data. Hence the distinguishing feature of a SIMD machine is that all the pro cessors act in concert . Each pro cessor p erforms the same instruction at the same time as all the other pro cessors, but each pro cessor uses it own lo cal data for this execution. The array of pro cessors is usually connected to the outside world bya sequential computer or workstation. The user accesses the pro cessor array through this front end or host machine. Using a SIMD computer for scienti c computing means that many ele- 1 ments of an array can b e computed simultaneously. Unlikevector pro cessors, the computation of these elements is not pip elined with di erent p ortions of neighb oring elements b eing worked on at the same time. Instead large groups of elements go through the same computation in parallel. In the following, we discuss the architectural features of SIMD multi- pro cessors concentrating on two computers in this class: the Connection Machine CM-2 by Thinking Machines Corp oration and the MasPar MP-2 by MasPar Computer Corp oration. Similar computers include the Digital Equipment Corp oration MPP series technically the same as the MasPar machines, the Go o dyear MPP, and the ICL DAP. 1 See the tutorial on vector computing [Schauble 95] for more information on vector pro cessors. CUBoulder : HPSC Course Notes SIMD Computing 3 1.1 The Connection Machine CM-2 The CM-2 Connection Machine is a SIMD sup ercomputer manufactured by Thinking Machines Corp oration TMC. Data parallel programming is the natural paradigm for this machine allowing each pro cessor to handle one data element or set of data elements at a time. The initial concept of the machine was set forth in a Ph.D. dissertation by W. Daniel Hillis [Hillis 85 ]. The rst commercial version of this computer was called the CM-1 and was manufactured in 1986. It contained up to 65,536 or 64K pro cessors capable of executing the same instruction concurrently.As shown in gure 1, sixteen one-bit pro cessors with 4K bits of memory apiece 2 are on one chip of the machine. These chips are arranged in a hyp ercub e d pattern. Thus the machine was available in units of 2 pro cessors where d = 12 through 16. One of the original purp oses of the computer was arti cial intelligence; the eventual goal was a thinking machine . Each pro cessor is only a one- bit pro cessor. The idea was to provide one pro cessor p er pixel for image pro cessing, one pro cessor p er transistor for VLSI simulation, or one pro cessor p er concept for semantic networks. The rst high-level language implemented for the machine is *Lisp, a parallel extension of Lisp. The design of p ortions of the *Lisp language are discussed in the Hillis dissertation. As the rst version of this sup ercomputer came onto the market, TMC discovered that there was also signi cantinterest and money for sup ercom- puters that can b e used for numerical and scienti c computing. Hence a faster version of the machine was pro duced in 1987; named the CM-2, this machine was the rst of the CM-200 series of computers. It included oating- p oint hardware, a faster clo ck, and increased the memory to 64K bits p er pro cessor. These mo dels emphasized the use of data-parallel programming. Both C* and CM Fortran were available on this machine in addition to *Lisp. Announced in Novemb er 1991, a more recent machine is the CM-5. This is a MIMD machine that emb o dies many of the earlier Connection Machine concepts with more p owerful pro cessors, routing techniques, and I/O. The following subsections discuss the characteristics and the p erformance of the CM-2. For further information, see the Connection Machine CM-200 2 See the tutorial on MIMD computing [Jessup 95] for more information on a hyp ercub e. CUBoulder : HPSC Course Notes 4 SIMD Computing Memory P P P P P P P P M M e e P P P P m m q o o H r r P P P P B H H y y B H H B B Router B B B B B B Memory B B B 2 Figure 1: A representative blowup of one of the 64 pro cessor chips in a Thinking Machines CM-1 or CM-2.
Recommended publications
  • 2.5 Classification of Parallel Computers
    52 // Architectures 2.5 Classification of Parallel Computers 2.5 Classification of Parallel Computers 2.5.1 Granularity In parallel computing, granularity means the amount of computation in relation to communication or synchronisation Periods of computation are typically separated from periods of communication by synchronization events. • fine level (same operations with different data) ◦ vector processors ◦ instruction level parallelism ◦ fine-grain parallelism: – Relatively small amounts of computational work are done between communication events – Low computation to communication ratio – Facilitates load balancing 53 // Architectures 2.5 Classification of Parallel Computers – Implies high communication overhead and less opportunity for per- formance enhancement – If granularity is too fine it is possible that the overhead required for communications and synchronization between tasks takes longer than the computation. • operation level (different operations simultaneously) • problem level (independent subtasks) ◦ coarse-grain parallelism: – Relatively large amounts of computational work are done between communication/synchronization events – High computation to communication ratio – Implies more opportunity for performance increase – Harder to load balance efficiently 54 // Architectures 2.5 Classification of Parallel Computers 2.5.2 Hardware: Pipelining (was used in supercomputers, e.g. Cray-1) In N elements in pipeline and for 8 element L clock cycles =) for calculation it would take L + N cycles; without pipeline L ∗ N cycles Example of good code for pipelineing: §doi =1 ,k ¤ z ( i ) =x ( i ) +y ( i ) end do ¦ 55 // Architectures 2.5 Classification of Parallel Computers Vector processors, fast vector operations (operations on arrays). Previous example good also for vector processor (vector addition) , but, e.g. recursion – hard to optimise for vector processors Example: IntelMMX – simple vector processor.
    [Show full text]
  • Massively Parallel Computing with CUDA
    Massively Parallel Computing with CUDA Antonino Tumeo Politecnico di Milano 1 GPUs have evolved to the point where many real world applications are easily implemented on them and run significantly faster than on multi-core systems. Future computing architectures will be hybrid systems with parallel-core GPUs working in tandem with multi-core CPUs. Jack Dongarra Professor, University of Tennessee; Author of “Linpack” Why Use the GPU? • The GPU has evolved into a very flexible and powerful processor: • It’s programmable using high-level languages • It supports 32-bit and 64-bit floating point IEEE-754 precision • It offers lots of GFLOPS: • GPU in every PC and workstation What is behind such an Evolution? • The GPU is specialized for compute-intensive, highly parallel computation (exactly what graphics rendering is about) • So, more transistors can be devoted to data processing rather than data caching and flow control ALU ALU Control ALU ALU Cache DRAM DRAM CPU GPU • The fast-growing video game industry exerts strong economic pressure that forces constant innovation GPUs • Each NVIDIA GPU has 240 parallel cores NVIDIA GPU • Within each core 1.4 Billion Transistors • Floating point unit • Logic unit (add, sub, mul, madd) • Move, compare unit • Branch unit • Cores managed by thread manager • Thread manager can spawn and manage 12,000+ threads per core 1 Teraflop of processing power • Zero overhead thread switching Heterogeneous Computing Domains Graphics Massive Data GPU Parallelism (Parallel Computing) Instruction CPU Level (Sequential
    [Show full text]
  • Parallel Computer Architecture
    Parallel Computer Architecture Introduction to Parallel Computing CIS 410/510 Department of Computer and Information Science Lecture 2 – Parallel Architecture Outline q Parallel architecture types q Instruction-level parallelism q Vector processing q SIMD q Shared memory ❍ Memory organization: UMA, NUMA ❍ Coherency: CC-UMA, CC-NUMA q Interconnection networks q Distributed memory q Clusters q Clusters of SMPs q Heterogeneous clusters of SMPs Introduction to Parallel Computing, University of Oregon, IPCC Lecture 2 – Parallel Architecture 2 Parallel Architecture Types • Uniprocessor • Shared Memory – Scalar processor Multiprocessor (SMP) processor – Shared memory address space – Bus-based memory system memory processor … processor – Vector processor bus processor vector memory memory – Interconnection network – Single Instruction Multiple processor … processor Data (SIMD) network processor … … memory memory Introduction to Parallel Computing, University of Oregon, IPCC Lecture 2 – Parallel Architecture 3 Parallel Architecture Types (2) • Distributed Memory • Cluster of SMPs Multiprocessor – Shared memory addressing – Message passing within SMP node between nodes – Message passing between SMP memory memory nodes … M M processor processor … … P … P P P interconnec2on network network interface interconnec2on network processor processor … P … P P … P memory memory … M M – Massively Parallel Processor (MPP) – Can also be regarded as MPP if • Many, many processors processor number is large Introduction to Parallel Computing, University of Oregon,
    [Show full text]
  • A Review of Multicore Processors with Parallel Programming
    International Journal of Engineering Technology, Management and Applied Sciences www.ijetmas.com September 2015, Volume 3, Issue 9, ISSN 2349-4476 A Review of Multicore Processors with Parallel Programming Anchal Thakur Ravinder Thakur Research Scholar, CSE Department Assistant Professor, CSE L.R Institute of Engineering and Department Technology, Solan , India. L.R Institute of Engineering and Technology, Solan, India ABSTRACT When the computers first introduced in the market, they came with single processors which limited the performance and efficiency of the computers. The classic way of overcoming the performance issue was to use bigger processors for executing the data with higher speed. Big processor did improve the performance to certain extent but these processors consumed a lot of power which started over heating the internal circuits. To achieve the efficiency and the speed simultaneously the CPU architectures developed multicore processors units in which two or more processors were used to execute the task. The multicore technology offered better response-time while running big applications, better power management and faster execution time. Multicore processors also gave developer an opportunity to parallel programming to execute the task in parallel. These days parallel programming is used to execute a task by distributing it in smaller instructions and executing them on different cores. By using parallel programming the complex tasks that are carried out in a multicore environment can be executed with higher efficiency and performance. Keywords: Multicore Processing, Multicore Utilization, Parallel Processing. INTRODUCTION From the day computers have been invented a great importance has been given to its efficiency for executing the task.
    [Show full text]
  • Vector Vs. Scalar Processors: a Performance Comparison Using a Set of Computational Science Benchmarks
    Vector vs. Scalar Processors: A Performance Comparison Using a Set of Computational Science Benchmarks Mike Ashworth, Ian J. Bush and Martyn F. Guest, Computational Science & Engineering Department, CCLRC Daresbury Laboratory ABSTRACT: Despite a significant decline in their popularity in the last decade vector processors are still with us, and manufacturers such as Cray and NEC are bringing new products to market. We have carried out a performance comparison of three full-scale applications, the first, SBLI, a Direct Numerical Simulation code from Computational Fluid Dynamics, the second, DL_POLY, a molecular dynamics code and the third, POLCOMS, a coastal-ocean model. Comparing the performance of the Cray X1 vector system with two massively parallel (MPP) micro-processor-based systems we find three rather different results. The SBLI PCHAN benchmark performs excellently on the Cray X1 with no code modification, showing 100% vectorisation and significantly outperforming the MPP systems. The performance of DL_POLY was initially poor, but we were able to make significant improvements through a few simple optimisations. The POLCOMS code has been substantially restructured for cache-based MPP systems and now does not vectorise at all well on the Cray X1 leading to poor performance. We conclude that both vector and MPP systems can deliver high performance levels but that, depending on the algorithm, careful software design may be necessary if the same code is to achieve high performance on different architectures. KEYWORDS: vector processor, scalar processor, benchmarking, parallel computing, CFD, molecular dynamics, coastal ocean modelling All of the key computational science groups in the 1. Introduction UK made use of vector supercomputers during their halcyon days of the 1970s, 1980s and into the early 1990s Vector computers entered the scene at a very early [1]-[3].
    [Show full text]
  • A PARALLEL IMPLEMENTATION of BACKPROPAGATION NEURAL NETWORK on MASPAR MP-1 Faramarz Valafar Purdue University School of Electrical Engineering
    Purdue University Purdue e-Pubs ECE Technical Reports Electrical and Computer Engineering 3-1-1993 A PARALLEL IMPLEMENTATION OF BACKPROPAGATION NEURAL NETWORK ON MASPAR MP-1 Faramarz Valafar Purdue University School of Electrical Engineering Okan K. Ersoy Purdue University School of Electrical Engineering Follow this and additional works at: http://docs.lib.purdue.edu/ecetr Valafar, Faramarz and Ersoy, Okan K., "A PARALLEL IMPLEMENTATION OF BACKPROPAGATION NEURAL NETWORK ON MASPAR MP-1" (1993). ECE Technical Reports. Paper 223. http://docs.lib.purdue.edu/ecetr/223 This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact [email protected] for additional information. TR-EE 93-14 MARCH 1993 A PARALLEL IMPLEMENTATION OF BACKPROPAGATION NEURAL NETWORK ON MASPAR MP-1" Faramarz Valafar Okan K. Ersoy School of Electrical Engineering Purdue University W. Lafayette, IN 47906 - * The hdueUniversity MASPAR MP-1 research is supponed in pan by NSF Parallel InfrasmctureGrant #CDA-9015696. - 2 - ABSTRACT One of the major issues in using artificial neural networks is reducing the training and the testing times. Parallel processing is the most efficient approach for this purpose. In this paper, we explore the parallel implementation of the backpropagation algorithm with and without hidden layers [4][5] on MasPar MP-I. This implementation is based on the SIMD architecture, and uses a backpropagation model which is more exact theoretically than the serial backpropagation model. This results in a smoother convergence to the solution. Most importantly, the processing time is reduced both theoretically and experimentally by the order of 3000, due to architectural and data parallelism of the backpropagation algorithm.
    [Show full text]
  • Parallel Processing! 1! CSE 30321 – Lecture 23 – Introduction to Parallel Processing! 2! Suggested Readings! •! Readings! –! H&P: Chapter 7! •! (Over Next 2 Weeks)!
    CSE 30321 – Lecture 23 – Introduction to Parallel Processing! 1! CSE 30321 – Lecture 23 – Introduction to Parallel Processing! 2! Suggested Readings! •! Readings! –! H&P: Chapter 7! •! (Over next 2 weeks)! Lecture 23" Introduction to Parallel Processing! University of Notre Dame! University of Notre Dame! CSE 30321 – Lecture 23 – Introduction to Parallel Processing! 3! CSE 30321 – Lecture 23 – Introduction to Parallel Processing! 4! Processor components! Multicore processors and programming! Processor comparison! vs.! Goal: Explain and articulate why modern microprocessors now have more than one core andCSE how software 30321 must! adapt to accommodate the now prevalent multi- core approach to computing. " Introduction and Overview! Writing more ! efficient code! The right HW for the HLL code translation! right application! University of Notre Dame! University of Notre Dame! CSE 30321 – Lecture 23 – Introduction to Parallel Processing! CSE 30321 – Lecture 23 – Introduction to Parallel Processing! 6! Pipelining and “Parallelism”! ! Load! Mem! Reg! DM! Reg! ALU ! Instruction 1! Mem! Reg! DM! Reg! ALU ! Instruction 2! Mem! Reg! DM! Reg! ALU ! Instruction 3! Mem! Reg! DM! Reg! ALU ! Instruction 4! Mem! Reg! DM! Reg! ALU Time! Instructions execution overlaps (psuedo-parallel)" but instructions in program issued sequentially." University of Notre Dame! University of Notre Dame! CSE 30321 – Lecture 23 – Introduction to Parallel Processing! CSE 30321 – Lecture 23 – Introduction to Parallel Processing! Multiprocessing (Parallel) Machines! Flynn#s
    [Show full text]
  • Thinking Machines
    High Performance Comput ing IAnd Communications Week I K1N.G COMMUNICATIONS GROUP, INC. 627 National Press Building, Washington, D.C. 20045 (202) 638-4260 Fax: (202) 662-9744 Thursday, July 8,1993 Volume 2, Number 27 Gordon Bell Makes His Case: Get The Feds out Of Computer Architecture BY RICHARD McCORMACK It's an issue that has been simmering, then smolder- Bell: "You've got ing and occasionally flaring up: will the big massively huge forces tellmg parallel machines costing tens of millions of dollars you who's malnl~ne prove themselves worthy of their promise? Or will and how you bu~ld these machines, developed with millions of dollars computers " from the taxpayer, be an embarrassing bust? It's a debate that occurs daily-even with spouses in bed at night-but not much outside of the high- performance computing industry's small borders. Lit- erally thousands of people are engaged in trying to make massive parallelism a viable technology. But there are still few objective observers, very little data, and not enough experience with the big machines to prove-or disprove-their true worth. afraid to talk about [the situation] because they know Interestingly, though, one of the biggest names in they've conned the world and they have to keep lying computing has made up his mind. The MPPs are aw- to support" their assertions that the technology needs ful, and the companies selling them, notably Intel and government support, says the ever-quotable Bell. "It's Thinking Machines, but others as well, are bound to really bad when it turns the scientists into a bunch of fail, says Gordon Bell, whose name is attached to the liars.
    [Show full text]
  • Mathematics 18.337, Computer Science 6.338, SMA 5505 Applied Parallel Computing Spring 2004
    Mathematics 18.337, Computer Science 6.338, SMA 5505 Applied Parallel Computing Spring 2004 Lecturer: Alan Edelman1 MIT 1Department of Mathematics and Laboratory for Computer Science. Room 2-388, Massachusetts Institute of Technology, Cambridge, MA 02139, Email: [email protected], http://math.mit.edu/~edelman ii Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004 Contents 1 Introduction 1 1.1 The machines . 1 1.2 The software . 2 1.3 The Reality of High Performance Computing . 3 1.4 Modern Algorithms . 3 1.5 Compilers . 3 1.6 Scientific Algorithms . 4 1.7 History, State-of-Art, and Perspective . 4 1.7.1 Things that are not traditional supercomputers . 4 1.8 Analyzing the top500 List Using Excel . 5 1.8.1 Importing the XML file . 5 1.8.2 Filtering . 7 1.8.3 Pivot Tables . 9 1.9 Parallel Computing: An Example . 14 1.10 Exercises . 16 2 MPI, OpenMP, MATLAB*P 17 2.1 Programming style . 17 2.2 Message Passing . 18 2.2.1 Who am I? . 19 2.2.2 Sending and receiving . 20 2.2.3 Tags and communicators . 22 2.2.4 Performance, and tolerance . 23 2.2.5 Who's got the floor? . 24 2.3 More on Message Passing . 26 2.3.1 Nomenclature . 26 2.3.2 The Development of Message Passing . 26 2.3.3 Machine Characteristics . 27 2.3.4 Active Messages . 27 2.4 OpenMP for Shared Memory Parallel Programming . 27 2.5 STARP . 30 3 Parallel Prefix 33 3.1 Parallel Prefix .
    [Show full text]
  • Trends in HPC Architectures and Parallel Programmming
    Trends in HPC Architectures and Parallel Programmming Giovanni Erbacci - [email protected] Supercomputing, Applications & Innovation Department - CINECA Agenda - Computational Sciences - Trends in Parallel Architectures - Trends in Parallel Programming - PRACE G. Erbacci 1 Computational Sciences Computational science (with theory and experimentation ), is the “third pillar” of scientific inquiry, enabling researchers to build and test models of complex phenomena Quick evolution of innovation : • Instantaneous communication • Geographically distributed work • Increased productivity • More data everywhere • Increasing problem complexity • Innovation happens worldwide G. Erbacci 2 Technology Evolution More data everywhere : Radar, satellites, CAT scans, weather models, the human genome. The size and resolution of the problems scientists address today are limited only by the size of the data they can reasonably work with. There is a constantly increasing demand for faster processing on bigger data. Increasing problem complexity : Partly driven by the ability to handle bigger data, but also by the requirements and opportunities brought by new technologies. For example, new kinds of medical scans create new computational challenges. HPC Evolution As technology allows scientists to handle bigger datasets and faster computations, they push to solve harder problems. In turn, the new class of problems drives the next cycle of technology innovation . G. Erbacci 3 Computational Sciences today Multidisciplinary problems Coupled applicatitions -
    [Show full text]
  • Introduction to Parallel Processing : Algorithms and Architectures
    Introduction to Parallel Processing Algorithms and Architectures PLENUM SERIES IN COMPUTER SCIENCE Series Editor: Rami G. Melhem University of Pittsburgh Pittsburgh, Pennsylvania FUNDAMENTALS OF X PROGRAMMING Graphical User Interfaces and Beyond Theo Pavlidis INTRODUCTION TO PARALLEL PROCESSING Algorithms and Architectures Behrooz Parhami Introduction to Parallel Processing Algorithms and Architectures Behrooz Parhami University of California at Santa Barbara Santa Barbara, California KLUWER ACADEMIC PUBLISHERS NEW YORK, BOSTON , DORDRECHT, LONDON , MOSCOW eBook ISBN 0-306-46964-2 Print ISBN 0-306-45970-1 ©2002 Kluwer Academic Publishers New York, Boston, Dordrecht, London, Moscow All rights reserved No part of this eBook may be reproduced or transmitted in any form or by any means, electronic, mechanical, recording, or otherwise, without written consent from the Publisher Created in the United States of America Visit Kluwer Online at: http://www.kluweronline.com and Kluwer's eBookstore at: http://www.ebooks.kluweronline.com To the four parallel joys in my life, for their love and support. This page intentionally left blank. Preface THE CONTEXT OF PARALLEL PROCESSING The field of digital computer architecture has grown explosively in the past two decades. Through a steady stream of experimental research, tool-building efforts, and theoretical studies, the design of an instruction-set architecture, once considered an art, has been transformed into one of the most quantitative branches of computer technology. At the same time, better understanding of various forms of concurrency, from standard pipelining to massive parallelism, and invention of architectural structures to support a reasonably efficient and user-friendly programming model for such systems, has allowed hardware performance to continue its exponential growth.
    [Show full text]
  • The Maspar MP-1 As a Computer Arithmetic Laboratory
    Division Computing and Applied Mathematics Laboratory The MasPar MP-1 as a Computer Arithmetic Laboratory M. A. Anuta, D. W. Lozier and P. R. Turner January 1995 U. S. DEPARTMENT OF COMMERCE Technology Administration National Institute of Standards and Technology Gaithersburg, MD 20899 QC N\sr 100 .U56 NO. 5569 1995 NISTIR 5569 The MasPar MP-l as a Computer Arithmetic Laboratory M. A. Anuta D. W. Lozier P. R. Turner U.S. DEPARTMENT OF COMMERCE Technology Administration National Institute of Standards and Technology Applied and Computational Mathematics Division Computing and Applied Mathematics Laboratory Gaithersburg, MD 20899 January 1995 U.S. DEPARTMENT OF COMMERCE Ronald H. Brown, Secretary TECHNOLOGY ADMINISTRATION Mary L. Good, Under Secretary for Technology NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY Arati Prabhakar, Director I I The MasPar MP-1 as a Computer Arithmetic Laboratory Michael A Anuta^ Daniel W Lozier and Peter R Turner^ Abstract This paper describes the use of a massively parallel SIMD computer architecture for the simulation of various forms of computer arithmetic. The particular system used is a DEC/MasPar MP-1 with 4096 processors in a square array. This architecture has many ad\>cmtagesfor such simulations due largely to the simplicity of the individual processors. Arithmetic operations can be spread across the processor array to simulate a hardware chip. Alternatively they may be performed on individual processors to allow simulation of a massively parallel implementation of the arithmetic. Compromises between these extremes permit speed-area trade-offs to be examined. The paper includes a description of the architecture and its features. It then summarizes some of the arithmetic systems which have been, or are to be, implemented.
    [Show full text]