Parallel Processing, Part 6

Total Page:16

File Type:pdf, Size:1020Kb

Parallel Processing, Part 6 Part VI Implementation Aspects Fall 2010 Parallel Processing, Implementation Aspects Slide 1 About This Presentation This presentation is intended to support the use of the textbook Introduction to Parallel Processing: Algorithms and Architectures (Plenum Press, 1999, ISBN 0-306-45970-1). It was prepared by the author in connection with teaching the graduate-level course ECE 254B: Advanced Computer Architecture: Parallel Processing, at the University of California, Santa Barbara. Instructors can use these slides in classroom teaching and for other educational purposes. Any other use is strictly prohibited. © Behrooz Parhami Edition Released Revised Revised First Spring 2005 Fall 2008* * Very limited update Fall 2010 Parallel Processing, Implementation Aspects Slide 2 About This Presentation This presentation is intended to support the use of the textbook Introduction to Parallel Processing: Algorithms and Architectures (Plenum Press, 1999, ISBN 0-306-45970-1). It was prepared by the author in connection with teaching the graduate-level course ECE 254B: Advanced Computer Architecture: Parallel Processing, at the University of California, Santa Barbara. Instructors can use these slides in classroom teaching and for other educational purposes. Any other use is strictly prohibited. © Behrooz Parhami Edition Released Revised Revised Revised First Spring 2005 Fall 2008 Fall 2010* * Very limited update Fall 2010 Parallel Processing, Implementation Aspects Slide 3 VI Implementation Aspects Study real parallel machines in MIMD and SIMD classes: • Examine parallel computers of historical significance • Learn from modern systems and implementation ideas • Bracket our knowledge with history and forecasts Topics in This Part Chapter 21 Shared-Memory MIMD Machines Chapter 22 Message-Passing MIMD Machines Chapter 23 Data-Parallel SIMD Machines Chapter 24 Past, Present, and Future Fall 2010 Parallel Processing, Implementation Aspects Slide 4 21 Shared-Memory MIMD Machines Learn about practical shared-variable parallel architectures: • Contrast centralized and distributed shared memories • Examine research prototypes and production machines Topics in This Chapter 21.1 Variations in Shared Memory 21.2 MIN-Based BBN Butterfly 21.3 Vector-Parallel Cray Y-MP 21.4 Latency-Tolerant Tera MTA 21.5 CC-NUMA Stanford DASH 21.6 SCI-Based Sequent NUMA-Q Fall 2010 Parallel Processing, Implementation Aspects Slide 5 21.1 Variations in Shared Memory Single Copy of Multiple Copies of Modifiable Data Modifiable Data Central UMA Main CC-UMA Memory BBN Butterfly Cray Y-MP COMA NUMA Distributed CC-NUMA Main Memory Tera MTA Stanford DASH Sequent NUMA-Q Fig. 21.1 Classification of shared-memory hardware architectures and example systems that will be studied in the rest of this chapter. Fall 2010 Parallel Processing, Implementation Aspects Slide 6 C.mmp: A Multiprocessor of Historical Significance Mem Proc Cache Map BI 0 0 Clock Mem Proc Cache Map BI 1 1 . 16 × 16 . crossbar Bus . interface Bus . control Mem Proc Cache Map BI 15 15 Fig. 21.2 Organization of the C.mmp multiprocessor. Fall 2010 Parallel Processing, Implementation Aspects Slide 7 Shared Memory Consistency Models Varying latencies makes each processor’s view of the memory different Sequential consistency (strictest and most intuitive); mandates that the interleaving of reads and writes be the same from the viewpoint of all processors. This provides the illusion of a FCFS single-port memory. Processor consistency (laxer); only mandates that writes be observed in the same order by all processors. This allows reads to overtake writes, providing better performance due to out-of-order execution. Weak consistency separates ordinary memory accesses from synch accesses that require memory to become consistent. Ordinary read and write accesses can proceed as long as there is no pending synch access, but the latter must wait for all preceding accesses to complete. Release consistency is similar to weak consistency, but recognizes two synch accesses, called “acquire” and “release”, that sandwich protected shared accesses. Ordinary read / write accesses can proceed only when there is no pending acquire access from the same processor and a release access must wait for all reads and writes to be completed. Fall 2010 Parallel Processing, Implementation Aspects Slide 8 21.2 MIN-Based BBN Butterfly Processing Node 0 MC 68000 Processor 4 × 4 4 × 4 Processor Node Controller Memory Manager 4 × 4 4 × 4 Switch EPROM Interface 1 MB To I/O 4 × 4 4 × 4 Memory Boards Daughter Board Connection 4 × 4 4 × 4 for Memory Expansion (3 MB) Processing Node 15 Fig. 21.3 Structure Fig. 21.4 A small 16-node version of a processing node of the multistage interconnection in the BBN Butterfly. network of the BBN Butterfly. Fall 2010 Parallel Processing, Implementation Aspects Slide 9 21.3 Vector-Parallel Cray Y-MP Vector Add Inter- Integer Processor Vector Units Shift Commun. Registers V7 Logic V6 Fig. 21.5 Key V5 Weight/ V4 Parity elements of V3 V2 V1 64 bit the Cray Y-MP 0 1 processor. 2 Floating- Address 3 Point Add Units Multiply registers, . V0 Reciprocal address . Approx. function units, Central Memory 64 bit instruction 62 63 buffers, and Scalar Integer Add control not Units Shift T Registers Logic shown. (8 64-bit) S Registers (8 64-bit) Weight/ Parity From address 32 bit CPU registers/units Input/Output Fall 2010 Parallel Processing, Implementation Aspects Slide 10 Cray Y-MP’s Interconnection Network Sections Subsections Memory Banks 1 × 8 0, 4, 8, ... , 28 P0 4 × 4 32, 36, 40, ... , 92 8 × 8 P 4 × 4 1 P 1, 5, 9, ... , 29 2 4 × 4 8 × 8 P3 4 × 4 2, 6, 10, ... , 30 P 4 × 4 4 8 × 8 P 4 × 4 5 3, 7, 11, ... , 31 P6 4 × 4 8 × 8 P 4 × 4 7 227, 231, ... , 255 Fig. 21.6 The processor-to-memory interconnection network of Cray Y-MP. Fall 2010 Parallel Processing, Implementation Aspects Slide 11 21.4 Latency-Tolerant Tera MTA Issue Pool Instr. Fetch MAC Retry Pool Fig. 21.7 The instruction execution pipelines of Interconnection Network Tera MTA. Memory Internal Pipeline Fall 2010 Parallel Processing, Implementation Aspects Slide 12 21.5 CC-NUMA Stanford DASH Processor Processor cluster cluster Request mesh Reply mesh Main m em ory Processor Directory cluster Level-2 cache & network I-cache D-cache interface Processor Processor cluster Wormhole routers Fig. 21.8 The architecture of Stanford DASH. Fall 2010 Parallel Processing, Implementation Aspects Slide 13 21.6 SCI-Based Sequent NUMA-Q Quad 533 MB/s local bus P6 (64 bits, 66 MHz) IQ-Link 1 GB/s Interquad SCI link Orion PCI bridge Fig. 21.9 The P6 100 MB/s Fibre Channel OPB PCI/FC physical placement 4 x 33 MB/s P6 PCI buses Publ ic of Sequent’s quad OPB PCI/LAN LAN components on a P6 IQ-Plus Mem ory MDP (SCI rackmount Pentiu m Pr o ring) (P6) proc’s baseboard (not to scale) Quad . Quad Peripheral bay Private SCSI Fig. 21.10 Ethernet Bridge The architecture Cons ole of Sequent Bridge SCSI IQ-link NUMA-Q 2000. Peripheral bay Fall 2010 Parallel Processing, Implementation Aspects Slide 14 Details of the IQ-Link in Sequent NUMA-Q Network side snooping tags tags P6 Local Remote Local Remote bus directory tags directory tags Fig. 21.11 SCI in Block diagram of the IQ-Link board. Bus interface Directory Interconnect controller controller controller Remote 32 MB cache data SCI out SCI in Assemble Stripper Elastic buffer Request Response Fig. 21.12 receive Q receive Q Disassemble Bypass Block diagram FIFO of IQ-Link’s interconnect Request Response send Q send Q controller. SCI out Fall 2010 Parallel Processing, Implementation Aspects Slide 15 22 Message-Passing MIMD Machines Learn about practical message-passing parallel architectures: • Study mechanisms that support message passing • Examine research prototypes and production machines Topics in This Chapter 22.1 Mechanisms for Message Passing 22.2 Reliable Bus-Based Tandem NonStop 22.3 Hypercube-Based nCUBE3 22.4 Fat-Tree-Based Connection Machine 5 22.5 Omega-Network-Based IBM SP2 22.6 Commodity-Driven Berkeley NOW Fall 2010 Parallel Processing, Implementation Aspects Slide 16 22.1 Mechanisms for Message Passing 1. Shared-Medium networks (buses, LANs; e.g., Ethernet, token-based) 2. Router-based networks (aka direct networks; e.g., Cray T3E, Chaos) 3. Switch-based networks (crossbars, MINs; e.g., Myrinet, IBM SP2) Crosspoint switch Through Cros sed Inputs (straight) (exchange) Lower Upper Outputs broadcast broadcast Fig. 22.2 Example 4 × 4 and 2 × 2 switches Fig.used 22.1 as building The strublockscture for of larger a generic networks. router. Fall 2010 Parallel Processing, Implementation Aspects Slide 17 Router-Based Networks Injection Ejection Channel Channel LC LC Link Controller Q Q Message Queue Routing/Arbitration Input Queue Output Queue LC Q Q LC Input LC Q Q LC Output Channels Switch Channels LC Q Q LC LC Q Q LC Fig. 22.1 The structure of a generic router. Fall 2010 Parallel Processing, Implementation Aspects Slide 18 Case Studies of Message-Passing Machines Shared-Medium Router-Based Switch-Based Network Network Network Fig. 22.3 Classification of message-passing Coarse- Grain Tandem NonStop hardware (Bus) architectures and example systems that will be studied in this Medium- Grain Berkeley NOW nCUBE3 TMC CM-5 chapter. (LAN) IBM SP2 Fine- Grain Fall 2010 Parallel Processing, Implementation Aspects Slide 19 22.2 Reliable Bus-Based Tandem NonStop Dynabus Processor Processor Processor Processor and and and and Memory Memory Memory Memory I/O I/O I/O I/O I/O I/O I/O I/O I/O Controller Fig. 22.4 One section of the Tandem NonStop Cyclone system. Fall 2010 Parallel Processing, Implementation Aspects Slide 20 High-Level Structure of the Tandem NonStop Fig. 22.5 Four four-processor sections interconnected Four-processor by Dynabus+.
Recommended publications
  • Early Stored Program Computers
    Stored Program Computers Thomas J. Bergin Computing History Museum American University 7/9/2012 1 Early Thoughts about Stored Programming • January 1944 Moore School team thinks of better ways to do things; leverages delay line memories from War research • September 1944 John von Neumann visits project – Goldstine’s meeting at Aberdeen Train Station • October 1944 Army extends the ENIAC contract research on EDVAC stored-program concept • Spring 1945 ENIAC working well • June 1945 First Draft of a Report on the EDVAC 7/9/2012 2 First Draft Report (June 1945) • John von Neumann prepares (?) a report on the EDVAC which identifies how the machine could be programmed (unfinished very rough draft) – academic: publish for the good of science – engineers: patents, patents, patents • von Neumann never repudiates the myth that he wrote it; most members of the ENIAC team contribute ideas; Goldstine note about “bashing” summer7/9/2012 letters together 3 • 1.0 Definitions – The considerations which follow deal with the structure of a very high speed automatic digital computing system, and in particular with its logical control…. – The instructions which govern this operation must be given to the device in absolutely exhaustive detail. They include all numerical information which is required to solve the problem…. – Once these instructions are given to the device, it must be be able to carry them out completely and without any need for further intelligent human intervention…. • 2.0 Main Subdivision of the System – First: since the device is a computor, it will have to perform the elementary operations of arithmetics…. – Second: the logical control of the device is the proper sequencing of its operations (by…a control organ.
    [Show full text]
  • Simulating Physics with Computers
    International Journal of Theoretical Physics, VoL 21, Nos. 6/7, 1982 Simulating Physics with Computers Richard P. Feynman Department of Physics, California Institute of Technology, Pasadena, California 91107 Received May 7, 1981 1. INTRODUCTION On the program it says this is a keynote speech--and I don't know what a keynote speech is. I do not intend in any way to suggest what should be in this meeting as a keynote of the subjects or anything like that. I have my own things to say and to talk about and there's no implication that anybody needs to talk about the same thing or anything like it. So what I want to talk about is what Mike Dertouzos suggested that nobody would talk about. I want to talk about the problem of simulating physics with computers and I mean that in a specific way which I am going to explain. The reason for doing this is something that I learned about from Ed Fredkin, and my entire interest in the subject has been inspired by him. It has to do with learning something about the possibilities of computers, and also something about possibilities in physics. If we suppose that we know all the physical laws perfectly, of course we don't have to pay any attention to computers. It's interesting anyway to entertain oneself with the idea that we've got something to learn about physical laws; and if I take a relaxed view here (after all I'm here and not at home) I'll admit that we don't understand everything.
    [Show full text]
  • Think in G Machines Corporation Connection
    THINK ING MACHINES CORPORATION CONNECTION MACHINE TECHNICAL SUMMARY The Connection Machine System Connection Machine Model CM-2 Technical Summary ................................................................. Version 6.0 November 1990 Thinking Machines Corporation Cambridge, Massachusetts First printing, November 1990 The information in this document is subject to change without notice and should not be construed as a commitment by Thinking Machines Corporation. Thinking Machines Corporation reserves the right to make changes to any products described herein to improve functioning or design. Although the information in this document has been reviewed and is believed to be reliable, Thinking Machines Corporation does not assume responsibility or liability for any errors that may appear in this document. Thinking Machines Corporation does not assume any liability arising from the application or use of any information or product described herein. Connection Machine® is a registered trademark of Thinking Machines Corporation. CM-1, CM-2, CM-2a, CM, and DataVault are trademarks of Thinking Machines Corporation. C*®is a registered trademark of Thinking Machines Corporation. Paris, *Lisp, and CM Fortran are trademarks of Thinking Machines Corporation. C/Paris, Lisp/Paris, and Fortran/Paris are trademarks of Thinking Machines Corporation. VAX, ULTRIX, and VAXBI are trademarks of Digital Equipment Corporation. Symbolics, Symbolics 3600, and Genera are trademarks of Symbolics, Inc. Sun, Sun-4, SunOS, and Sun Workstation are registered trademarks of Sun Microsystems, Inc. UNIX is a registered trademark of AT&T Bell Laboratories. The X Window System is a trademark of the Massachusetts Institute of Technology. StorageTek is a registered trademark of Storage Technology Corporation. Trinitron is a registered trademark of Sony Corporation.
    [Show full text]
  • Computer Development at the National Bureau of Standards
    Computer Development at the National Bureau of Standards The first fully operational stored-program electronic individual computations could be performed at elec- computer in the United States was built at the National tronic speed, but the instructions (the program) that Bureau of Standards. The Standards Electronic drove these computations could not be modified and Automatic Computer (SEAC) [1] (Fig. 1.) began useful sequenced at the same electronic speed as the computa- computation in May 1950. The stored program was held tions. Other early computers in academia, government, in the machine’s internal memory where the machine and industry, such as Edvac, Ordvac, Illiac I, the Von itself could modify it for successive stages of a compu- Neumann IAS computer, and the IBM 701, would not tation. This made a dramatic increase in the power of begin productive operation until 1952. programming. In 1947, the U.S. Bureau of the Census, in coopera- Although originally intended as an “interim” com- tion with the Departments of the Army and Air Force, puter, SEAC had a productive life of 14 years, with a decided to contract with Eckert and Mauchly, who had profound effect on all U.S. Government computing, the developed the ENIAC, to create a computer that extension of the use of computers into previously would have an internally stored program which unknown applications, and the further development of could be run at electronic speeds. Because of a demon- computing machines in industry and academia. strated competence in designing electronic components, In earlier developments, the possibility of doing the National Bureau of Standards was asked to be electronic computation had been demonstrated by technical monitor of the contract that was issued, Atanasoff at Iowa State University in 1937.
    [Show full text]
  • P the Pioneers and Their Computers
    The Videotape Sources: The Pioneers and their Computers • Lectures at The Compp,uter Museum, Marlboro, MA, September 1979-1983 • Goal: Capture data at the source • The first 4: Atanasoff (ABC), Zuse, Hopper (IBM/Harvard), Grosch (IBM), Stibitz (BTL) • Flowers (Colossus) • ENIAC: Eckert, Mauchley, Burks • Wilkes (EDSAC … LEO), Edwards (Manchester), Wilkinson (NPL ACE), Huskey (SWAC), Rajchman (IAS), Forrester (MIT) What did it feel like then? • What were th e comput ers? • Why did their inventors build them? • What materials (technology) did they build from? • What were their speed and memory size specs? • How did they work? • How were they used or programmed? • What were they used for? • What did each contribute to future computing? • What were the by-products? and alumni/ae? The “classic” five boxes of a stored ppgrogram dig ital comp uter Memory M Central Input Output Control I O CC Central Arithmetic CA How was programming done before programming languages and O/Ss? • ENIAC was programmed by routing control pulse cables f ormi ng th e “ program count er” • Clippinger and von Neumann made “function codes” for the tables of ENIAC • Kilburn at Manchester ran the first 17 word program • Wilkes, Wheeler, and Gill wrote the first book on programmiidbBbbIiSiing, reprinted by Babbage Institute Series • Parallel versus Serial • Pre-programming languages and operating systems • Big idea: compatibility for program investment – EDSAC was transferred to Leo – The IAS Computers built at Universities Time Line of First Computers Year 1935 1940 1945 1950 1955 ••••• BTL ---------o o o o Zuse ----------------o Atanasoff ------------------o IBM ASCC,SSEC ------------o-----------o >CPC ENIAC ?--------------o EDVAC s------------------o UNIVAC I IAS --?s------------o Colossus -------?---?----o Manchester ?--------o ?>Ferranti EDSAC ?-----------o ?>Leo ACE ?--------------o ?>DEUCE Whirl wi nd SEAC & SWAC ENIAC Project Time Line & Descendants IBM 701, Philco S2000, ERA..
    [Show full text]
  • SIMD1 Ñ Illiac IV
    Illiac IV History Illiac IV n First massively parallel computer ● SIMD (duplicate the PE, not the CU) ● First large system with semiconductor- based primary memory n Three earlier designs (vacuum tubes and transistors) culminating in the Illiac IV design, all at the University of Illinois ● Logical organization similar to the Solomon (prototyped by Westinghouse) ● Sponsored by DARPA, built by various companies, assembled by Burroughs ● Plan was for 256 PEs, in 4 quadrants of 64 PEs, but only one quadrant was built ● Used at NASA Ames Research Center in mid-1970s 1 Fall 2001, Lecture SIMD1 2 Fall 2001, Lecture SIMD1 Illiac IV Architectural Overview Programming Issues n One CU (control unit), n Consider the following FORTRAN code: 64 64-bit PEs (processing elements), DO 10 I = 1, 64 each PE has a PEM (PE memory) 10 A(I) = B(I) + C(I) ● Put A(1), B(1), C(1) on PU 1, etc. n CU operates on scalars, PEs on vector- n Each PE loads RGA from base+1, aligned arrays adds base+2, stores into base, ● All PEs execute the instruction broadcast where “base” is base of data in PEM by the CU, if they are in active mode n Each PE does this simultaneously, giving a speedup of 64 ● Each PE can perform various arithmetic ● and logical instructions For less than 64 array elements, some processors will sit idle ● Each PE has a memory with 2048 64-bit ● words, accessed in less than 188 ns For more than 64 array elements, some processors might have to do more work ● PEs can operate on data in 64-bit, 32-bit, and 8-bit formats n For some algorithms, it may be desirable to turn off PEs n Data routed between PEs various ways ● 64 PEs compute, then one half passes data to other half, then 32 PEs compute, n I/O is handled by a separate Burroughs etc.
    [Show full text]
  • Oral History Interview with David J. Wheeler
    An Interview with DAVID J. WHEELER OH 132 Conducted by William Aspray on 14 May 1987 Princeton, NJ Charles Babbage Institute The Center for the History of Information Processing University of Minnesota, Minneapolis Copyright, Charles Babbage Institute 1 David J. Wheeler Interview 14 May 1987 Abstract Wheeler, who was a research student at the University Mathematical Laboratory at Cambridge from 1948-51, begins with a discussion of the EDSAC project during his tenure. He compares the research orientation and the programming methods at Cambridge with those at the Institute for Advanced Study. He points out that, while the Cambridge group was motivated to process many smaller projects from the larger university community, the Institute was involved with a smaller number of larger projects. Wheeler mentions some of the projects that were run on the EDSAC, the user-oriented programming methods that developed at the laboratory, and the influence of the EDSAC model on the ILLIAC, the ORDVAC, and the IBM 701. He also discusses the weekly meetings held in conjunction with the National Physical Laboratory, the University of Birmingham, and the Telecommunications Research Establishment. These were attended by visitors from other British institutions as well as from the continent and the United States. Wheeler notes visits by Douglas Hartree (of Cavendish Laboratory), Nelson Blackman (of ONR), Peter Naur, Aad van Wijngarden, Arthur van der Poel, Friedrich L. Bauer, and Louis Couffignal. In the final part of the interview Wheeler discusses his visit to Illinois where he worked on the ILLIAC and taught from September 1951 to September 1953. 2 DAVID J.
    [Show full text]
  • Markov Chain Monte Carlo Methods
    Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabás Póczos Contents Markov Chain Monte Carlo Methods Sampling • Rejection • Importance • Hastings-Metropolis • Gibbs Markov Chains • Properties Particle Filtering • Condensation 2 Monte Carlo Methods 3 Markov Chain Monte Carlo History A recent survey places the Metropolis algorithm among the 10 algorithms that have had the greatest influence on the development and practice of science and engineering in the 20th century (Beichl&Sullivan, 2000). The Metropolis algorithm is an instance of a large class of sampling algorithms, known as Markov chain Monte Carlo (MCMC) MCMC plays significant role in statistics, econometrics, physics and computing science. There are several high-dimensional problems, such as computing the volume of a convex body in d dimensions and other high-dim integrals, for which MCMC simulation is the only known general approach for providing a solution within a reasonable time. 4 Markov Chain Monte Carlo History While convalescing from an illness in 1946, Stanislaw Ulam was playing solitaire. It, then, occurred to him to try to compute the chances that a particular solitaire laid out with 52 cards would come out successfully (Eckhard, 1987). Exhaustive combinatorial calculations are very difficult. New, more practical approach: laying out several solitaires at random and then observing and counting the number of successful plays. This idea of selecting a statistical sample to approximate a hard combinatorial problem by a much simpler problem is at the heart of modern Monte Carlo simulation. 5 FERMIAC FERMIAC: The beginning of Monte Carlo Methods Developed in the Early 1930’s The Monte Carlo trolley, or FERMIAC, was an analog computer invented by physicist Enrico Fermi to aid in his studies of neutron transport.
    [Show full text]
  • Mathematics 18.337, Computer Science 6.338, SMA 5505 Applied Parallel Computing Spring 2004
    Mathematics 18.337, Computer Science 6.338, SMA 5505 Applied Parallel Computing Spring 2004 Lecturer: Alan Edelman1 MIT 1Department of Mathematics and Laboratory for Computer Science. Room 2-388, Massachusetts Institute of Technology, Cambridge, MA 02139, Email: [email protected], http://math.mit.edu/~edelman ii Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004 Contents 1 Introduction 1 1.1 The machines . 1 1.2 The software . 2 1.3 The Reality of High Performance Computing . 3 1.4 Modern Algorithms . 3 1.5 Compilers . 3 1.6 Scientific Algorithms . 4 1.7 History, State-of-Art, and Perspective . 4 1.7.1 Things that are not traditional supercomputers . 4 1.8 Analyzing the top500 List Using Excel . 5 1.8.1 Importing the XML file . 5 1.8.2 Filtering . 7 1.8.3 Pivot Tables . 9 1.9 Parallel Computing: An Example . 14 1.10 Exercises . 16 2 MPI, OpenMP, MATLAB*P 17 2.1 Programming style . 17 2.2 Message Passing . 18 2.2.1 Who am I? . 19 2.2.2 Sending and receiving . 20 2.2.3 Tags and communicators . 22 2.2.4 Performance, and tolerance . 23 2.2.5 Who's got the floor? . 24 2.3 More on Message Passing . 26 2.3.1 Nomenclature . 26 2.3.2 The Development of Message Passing . 26 2.3.3 Machine Characteristics . 27 2.3.4 Active Messages . 27 2.4 OpenMP for Shared Memory Parallel Programming . 27 2.5 STARP . 30 3 Parallel Prefix 33 3.1 Parallel Prefix .
    [Show full text]
  • Trends in HPC Architectures and Parallel Programmming
    Trends in HPC Architectures and Parallel Programmming Giovanni Erbacci - [email protected] Supercomputing, Applications & Innovation Department - CINECA Agenda - Computational Sciences - Trends in Parallel Architectures - Trends in Parallel Programming - PRACE G. Erbacci 1 Computational Sciences Computational science (with theory and experimentation ), is the “third pillar” of scientific inquiry, enabling researchers to build and test models of complex phenomena Quick evolution of innovation : • Instantaneous communication • Geographically distributed work • Increased productivity • More data everywhere • Increasing problem complexity • Innovation happens worldwide G. Erbacci 2 Technology Evolution More data everywhere : Radar, satellites, CAT scans, weather models, the human genome. The size and resolution of the problems scientists address today are limited only by the size of the data they can reasonably work with. There is a constantly increasing demand for faster processing on bigger data. Increasing problem complexity : Partly driven by the ability to handle bigger data, but also by the requirements and opportunities brought by new technologies. For example, new kinds of medical scans create new computational challenges. HPC Evolution As technology allows scientists to handle bigger datasets and faster computations, they push to solve harder problems. In turn, the new class of problems drives the next cycle of technology innovation . G. Erbacci 3 Computational Sciences today Multidisciplinary problems Coupled applicatitions -
    [Show full text]
  • Introduction to Parallel Processing : Algorithms and Architectures
    Introduction to Parallel Processing Algorithms and Architectures PLENUM SERIES IN COMPUTER SCIENCE Series Editor: Rami G. Melhem University of Pittsburgh Pittsburgh, Pennsylvania FUNDAMENTALS OF X PROGRAMMING Graphical User Interfaces and Beyond Theo Pavlidis INTRODUCTION TO PARALLEL PROCESSING Algorithms and Architectures Behrooz Parhami Introduction to Parallel Processing Algorithms and Architectures Behrooz Parhami University of California at Santa Barbara Santa Barbara, California KLUWER ACADEMIC PUBLISHERS NEW YORK, BOSTON , DORDRECHT, LONDON , MOSCOW eBook ISBN 0-306-46964-2 Print ISBN 0-306-45970-1 ©2002 Kluwer Academic Publishers New York, Boston, Dordrecht, London, Moscow All rights reserved No part of this eBook may be reproduced or transmitted in any form or by any means, electronic, mechanical, recording, or otherwise, without written consent from the Publisher Created in the United States of America Visit Kluwer Online at: http://www.kluweronline.com and Kluwer's eBookstore at: http://www.ebooks.kluweronline.com To the four parallel joys in my life, for their love and support. This page intentionally left blank. Preface THE CONTEXT OF PARALLEL PROCESSING The field of digital computer architecture has grown explosively in the past two decades. Through a steady stream of experimental research, tool-building efforts, and theoretical studies, the design of an instruction-set architecture, once considered an art, has been transformed into one of the most quantitative branches of computer technology. At the same time, better understanding of various forms of concurrency, from standard pipelining to massive parallelism, and invention of architectural structures to support a reasonably efficient and user-friendly programming model for such systems, has allowed hardware performance to continue its exponential growth.
    [Show full text]
  • Parallel Architecture Hardware and General Purpose Operating System
    Parallel Architecture Hardware and General Purpose Operating System Co-design Oskar Schirmer G¨ottingen, 2018-07-10 Abstract Because most optimisations to achieve higher computational performance eventually are limited, parallelism that scales is required. Parallelised hard- ware alone is not sufficient, but software that matches the architecture is required to gain best performance. For decades now, hardware design has been guided by the basic design of existing software, to avoid the higher cost to redesign the latter. In doing so, however, quite a variety of supe- rior concepts is excluded a priori. Consequently, co-design of both hardware and software is crucial where highest performance is the goal. For special purpose application, this co-design is common practice. For general purpose application, however, a precondition for usability of a computer system is an arXiv:1807.03546v1 [cs.DC] 10 Jul 2018 operating system which is both comprehensive and dynamic. As no such op- erating system has ever been designed, a sketch for a comprehensive dynamic operating system is presented, based on a straightforward hardware architec- ture to demonstrate how design decisions regarding software and hardware do coexist and harmonise. 1 Contents 1 Origin 4 1.1 Performance............................ 4 1.2 Limits ............................... 5 1.3 TransparentStructuralOptimisation . 8 1.4 VectorProcessing......................... 9 1.5 Asymmetric Multiprocessing . 10 1.6 SymmetricMulticore ....................... 11 1.7 MultinodeComputer ....................... 12 2 Review 14 2.1 SharedMemory.......................... 14 2.2 Cache ............................... 15 2.3 Synchronisation .......................... 15 2.4 Time-Shared Multitasking . 15 2.5 Interrupts ............................. 16 2.6 Exceptions............................. 16 2.7 PrivilegedMode.......................... 17 2.8 PeripheralI/O .........................
    [Show full text]