Approaches to Parallel Computing

Total Page:16

File Type:pdf, Size:1020Kb

Approaches to Parallel Computing Introduction Approaches to Parallel Computing K. Cooper1 1Department of Mathematics Washington State University 2019 Introduction Paradigms Concept Many hands make light work. Set several processors to work on separate aspects of a problem. Simulation: One program, different data, no communication. Master-Slave: One program, sends small tasks to many subprocesses. Communication only to/from master process. Multiple Instruction Streams: Separate programs running on many processors. Communication among processes via messages. Introduction Paradigms Single Instruction, Multiple Data Single CPU with many ALUs ID step fills registers for each ALU EX step does computation simultaneously on all ALUs Introduction Paradigms SIMD Pipeline... After instructions are decoded, the same operations can be executed on a vector array of numbers. Introduction Paradigms Disadvantages Specialized architecture Slower to fill ALU registers – bottleneck Many ALUs idle during EX Introduction Paradigms Single Instruction, Multiple Thread Many CPUs Main program spins many threads for one instruction Examples Python parallel package – uses several cores of a CPU CUDA computing – use hundreds or thousands of cores of a GPU Conjecture: This is only efficient with many many cores. Introduction MIMD Multiple Instruction, Multiple Data Many CPUs Asynchronous Redundant work Much more versatile than most SIMD Introduction MIMD Shared Memory E.g. Quad Core CPU Bus-based Limited bandwidth on FSB Scales poorly Switch-based Expensive Still does not scale well – communication bottleneck Introduction MIMD Distributed Memory Each node adds memory to system. Maybe no single node sees entire problem. E.g. Beowulf cluster Each CPU requires its own dedicated memory Could be separate sectors in single RAM ... Could be separate machines Communication becomes a roadblock Introduction MIMD Distributed Memory MIMD Introduction MIMD Message Passing Typically, each instruction stream starts identically Each processor starts with same code Processes perform different tasks based on rank I/O to processes is performed through messages Introduction MIMD Interconnection Network Front side bus Infiniband Ethernet - slow Introduction MIMD SPMD SPMD Single Program, Multiple Data You write one (1) program ... ...that program runs on every processor Instances Processes perform tasks based on conditions and messages Processes have different inputs, outputs Introduction MIMD SPMD Nomenclature Node A computer connected to a head machine by some means Interconnect The means of connecting the nodes Core A single processor on one of the CPUs of a node Processor Usually means a core Process A program that runs on a processor. Possibly (but not desirably) many processes per processor. Introduction Summary Summary When CPUs were expensive: Pipelines As chips became denser SIMD As CPUs become commodities: MIMD As GPUs become dense: GPU Introduction Summary Goal Hope to show that we can modify programs easily to take advantage of modern processors Getting speedups is more problematic Introduction Summary Resources Solitary - Two cpus, four cores each, 8GB RAM runs prime1 on 6 cores in .038 seconds on OpenMPI Cluster - Five nodes, one cpu per node, six cores per cpu, 8GB RAM per node runs prime1 on 6 cores in .041 seconds on MPICH2 Labs - 20 to 32 nodes, two to eight cores per node.
Recommended publications
  • Multicomputer Cluster
    Multicomputer • Multiple (full) computers connected by network. • Distributed memory each have special address space. • Access to data another processor is explicit in program, express by call function for sending or receiving message. • Don’t need special operating System, enough libraries with function for sub sending message. • Good scalability. In this section we discuss network computing, in which the nodes are stand- alone computers that could be connected via a switch, local area network, or the Internet. The main idea is to divide the application into semi-independent parts according to the kind of processing needed. Different nodes on the network can be assigned different parts of the application. This form of network computing takes advantage of the unique capabilities of diverse system architectures. It also maximally leverages potentially idle resources within a large organization. Therefore, unused CPU cycles may be utilized during short periods of time resulting in bursts of activity followed by periods of inactivity. In what follows, we discuss the utilization of network technology in order to create a computing infrastructure using commodity computers. Cluster • In 1990 shifted from expensive and specialized parallel machines to the more cost-effective clusters of PCs and workstations. • A cluster is a collection of stand-alone computers connected using some interconnection network. • Each node in a cluster could be a workstation. • Important for it to have fast processors and fast network to enable it to use for distributed system. • Cluster workstation component: 1. Fast processor/memory and complete HW for PC. 2. Free access SW. 3. High execute, low latency. The 1990s have witnessed a significant shift from expensive and specialized parallel machines to the more cost-effective clusters of PCs and workstations.
    [Show full text]
  • 2.5 Classification of Parallel Computers
    52 // Architectures 2.5 Classification of Parallel Computers 2.5 Classification of Parallel Computers 2.5.1 Granularity In parallel computing, granularity means the amount of computation in relation to communication or synchronisation Periods of computation are typically separated from periods of communication by synchronization events. • fine level (same operations with different data) ◦ vector processors ◦ instruction level parallelism ◦ fine-grain parallelism: – Relatively small amounts of computational work are done between communication events – Low computation to communication ratio – Facilitates load balancing 53 // Architectures 2.5 Classification of Parallel Computers – Implies high communication overhead and less opportunity for per- formance enhancement – If granularity is too fine it is possible that the overhead required for communications and synchronization between tasks takes longer than the computation. • operation level (different operations simultaneously) • problem level (independent subtasks) ◦ coarse-grain parallelism: – Relatively large amounts of computational work are done between communication/synchronization events – High computation to communication ratio – Implies more opportunity for performance increase – Harder to load balance efficiently 54 // Architectures 2.5 Classification of Parallel Computers 2.5.2 Hardware: Pipelining (was used in supercomputers, e.g. Cray-1) In N elements in pipeline and for 8 element L clock cycles =) for calculation it would take L + N cycles; without pipeline L ∗ N cycles Example of good code for pipelineing: §doi =1 ,k ¤ z ( i ) =x ( i ) +y ( i ) end do ¦ 55 // Architectures 2.5 Classification of Parallel Computers Vector processors, fast vector operations (operations on arrays). Previous example good also for vector processor (vector addition) , but, e.g. recursion – hard to optimise for vector processors Example: IntelMMX – simple vector processor.
    [Show full text]
  • Cluster Computing: Architectures, Operating Systems, Parallel Processing & Programming Languages
    Cluster Computing Architectures, Operating Systems, Parallel Processing & Programming Languages Author Name: Richard S. Morrison Revision Version 2.4, Monday, 28 April 2003 Copyright © Richard S. Morrison 1998 – 2003 This document is distributed under the GNU General Public Licence [39] Print date: Tuesday, 28 April 2003 Document owner: Richard S. Morrison, [email protected] ✈ +612-9928-6881 Document name: CLUSTER_COMPUTING_THEORY Stored: (\\RSM\FURTHER_RESEARCH\CLUSTER_COMPUTING) Revision Version 2.4 Copyright © 2003 Synopsis & Acknolegdements My interest in Supercomputing through the use of clusters has been long standing and was initially sparked by an article in Electronic Design [33] in August 1998 on the Avalon Beowulf Cluster [24]. Between August 1998 and August 1999 I gathered information from websites and parallel research groups. This culminated in September 1999 when I organised the collected material and wove a common thread through the subject matter producing two handbooks for my own use on cluster computing. Each handbook is of considerable length, which was governed by the wealth of information and research conducted in this area over the last 5 years. The cover the handbooks are shown in Figure 1-1 below. Figure 1-1 – Author Compiled Beowulf Class 1 Handbooks Through my experimentation using the Linux Operating system and the undertaking of the University of Technology, Sydney (UTS) undergraduate subject Operating Systems in Autumn Semester 1999 with Noel Carmody, a systems level focus was developed and is the core element of this material contained in this document. This led to my membership to the IEEE and the IEEE Technical Committee on Parallel Processing, where I am able to gather and contribute information and be kept up to date on the latest issues.
    [Show full text]
  • CS650 Computer Architecture Lecture 10 Introduction to Multiprocessors
    NJIT Computer Science Dept CS650 Computer Architecture CS650 Computer Architecture Lecture 10 Introduction to Multiprocessors and PC Clustering Andrew Sohn Computer Science Department New Jersey Institute of Technology Lecture 10: Intro to Multiprocessors/Clustering 1/15 12/7/2003 A. Sohn NJIT Computer Science Dept CS650 Computer Architecture Key Issues Run PhotoShop on 1 PC or N PCs Programmability • How to program a bunch of PCs viewed as a single logical machine. Performance Scalability - Speedup • Run PhotoShop on 1 PC (forget the specs of this PC) • Run PhotoShop on N PCs • Will it run faster on N PCs? Speedup = ? Lecture 10: Intro to Multiprocessors/Clustering 2/15 12/7/2003 A. Sohn NJIT Computer Science Dept CS650 Computer Architecture Types of Multiprocessors Key: Data and Instruction Single Instruction Single Data (SISD) • Intel processors, AMD processors Single Instruction Multiple Data (SIMD) • Array processor • Pentium MMX feature Multiple Instruction Single Data (MISD) • Systolic array • Special purpose machines Multiple Instruction Multiple Data (MIMD) • Typical multiprocessors (Sun, SGI, Cray,...) Single Program Multiple Data (SPMD) • Programming model Lecture 10: Intro to Multiprocessors/Clustering 3/15 12/7/2003 A. Sohn NJIT Computer Science Dept CS650 Computer Architecture Shared-Memory Multiprocessor Processor Prcessor Prcessor Interconnection network Main Memory Storage I/O Lecture 10: Intro to Multiprocessors/Clustering 4/15 12/7/2003 A. Sohn NJIT Computer Science Dept CS650 Computer Architecture Distributed-Memory Multiprocessor Processor Processor Processor IO/S MM IO/S MM IO/S MM Interconnection network IO/S MM IO/S MM IO/S MM Processor Processor Processor Lecture 10: Intro to Multiprocessors/Clustering 5/15 12/7/2003 A.
    [Show full text]
  • Building a Beowulf Cluster
    Building a Beowulf cluster Åsmund Ødegård April 4, 2001 1 Introduction The main part of the introduction is only contained in the slides for this session. Some of the acronyms and names in this paper may be unknown. In Appendix B we includ short descriptions for some of them. Most of this is taken from “whatis” [6] 2 Outline of the installation ² Install linux on a PC ² Configure the PC to act as a install–server for the cluster ² Wire up the network if that isn’t done already ² Install linux on the rest of the nodes ² Configure one PC, e.g the install–server, to be a server for your cluster. These are the main steps required to build a linux cluster, but each step can be done in many different ways. How you prefer to do it, depends mainly on personal taste, though. Therefor, I will translate the given outline into this list: ² Install Debian GNU/Linux on a PC ² Install and configure “FAI” on the PC ² Build the “FAI” boot–floppy ² Assemble hardware information, and finalize the “FAI” configuration ² Boot each node with the boot–floppy ² Install and configure a queue system and software for running parallel jobs on your cluster 3 Debian The choice of Linux distribution is most of all a matter of personal taste. I prefer the Debian distri- bution for various reasons. So, the first step in the cluster–building process is to pick one of the PCs as a install–server, and install Debian onto it, as follows: ² Make sure that the computer can boot from cdrom.
    [Show full text]
  • What Is SPMD? Messages
    1 2 Outline Motivation for MPI Overview of PVM and MPI The pro cess that pro duced MPI What is di erent ab out MPI? { the \usual" send/receive Jack Dongarra { the MPI send/receive { simple collective op erations Computer Science Department New in MPI: Not in MPI UniversityofTennessee Some simple complete examples, in Fortran and C and Communication mo des, more on collective op erations Implementation status Mathematical Sciences Section Oak Ridge National Lab oratory MPICH - a free, p ortable implementation MPI resources on the Net MPI-2 http://www.netlib.org/utk/p eople/JackDongarra.html 3 4 Messages What is SPMD? 2 Messages are packets of data moving b etween sub-programs. 2 Single Program, Multiple Data 2 The message passing system has to b e told the 2 Same program runs everywhere. following information: 2 Restriction on the general message-passing mo del. { Sending pro cessor 2 Some vendors only supp ort SPMD parallel programs. { Source lo cation { Data typ e 2 General message-passing mo del can b e emulated. { Data length { Receiving pro cessors { Destination lo cation { Destination size 5 6 Access Point-to-Point Communication 2 A sub-program needs to b e connected to a message passing 2 Simplest form of message passing. system. 2 One pro cess sends a message to another 2 A message passing system is similar to: 2 Di erenttyp es of p oint-to p oint communication { Mail b ox { Phone line { fax machine { etc. 7 8 Synchronous Sends Asynchronous Sends Provide information about the completion of the Only know when the message has left.
    [Show full text]
  • Exploiting Automatic Vectorization to Employ SPMD on SIMD Registers
    Exploiting automatic vectorization to employ SPMD on SIMD registers Stefan Sprenger Steffen Zeuch Ulf Leser Department of Computer Science Intelligent Analytics for Massive Data Department of Computer Science Humboldt-Universitat¨ zu Berlin German Research Center for Artificial Intelligence Humboldt-Universitat¨ zu Berlin Berlin, Germany Berlin, Germany Berlin, Germany [email protected] [email protected] [email protected] Abstract—Over the last years, vectorized instructions have multi-threading with SIMD instructions1. For these reasons, been successfully applied to accelerate database algorithms. How- vectorization is essential for the performance of database ever, these instructions are typically only available as intrinsics systems on modern CPU architectures. and specialized for a particular hardware architecture or CPU model. As a result, today’s database systems require a manual tai- Although modern compilers, like GCC [2], provide auto loring of database algorithms to the underlying CPU architecture vectorization [1], typically the generated code is not as to fully utilize all vectorization capabilities. In practice, this leads efficient as manually-written intrinsics code. Due to the strict to hard-to-maintain code, which cannot be deployed on arbitrary dependencies of SIMD instructions on the underlying hardware, hardware platforms. In this paper, we utilize ispc as a novel automatically transforming general scalar code into high- compiler that employs the Single Program Multiple Data (SPMD) execution model, which is usually found on GPUs, on the SIMD performing SIMD programs remains a (yet) unsolved challenge. lanes of modern CPUs. ispc enables database developers to exploit To this end, all techniques for auto vectorization have focused vectorization without requiring low-level details or hardware- on enhancing conventional C/C++ programs with SIMD instruc- specific knowledge.
    [Show full text]
  • Polynomial-Time Algorithms for Enforcing Sequential Consistency in SPMD Programs with Arrays
    Polynomial-time Algorithms for Enforcing Sequential Consistency in SPMD Programs with Arrays ¡ Wei-Yu Chen , Arvind Krishnamurthy , and Katherine Yelick ¢ Computer Science Division, University of California, Berkeley £ wychen, yelick ¤ @cs.berkeley.edu ¥ Department of Computer Science, Yale University [email protected] Abstract. The simplest semantics for parallel shared memory programs is se- quential consistency in which memory operations appear to take place in the or- der specified by the program. But many compiler optimizations and hardware fea- tures explicitly reorder memory operations or make use of overlapping memory operations which may violate this constraint. To ensure sequential consistency while allowing for these optimizations, traditional data dependence analysis is augmented with a parallel analysis called cycle detection. In this paper, we present new algorithms to enforce sequential consistency for the special case of the Single Program Multiple Data (SPMD) model of parallelism. First, we present an algo- rithm for the basic cycle detection problem, which lowers the running time from ¥ ¦¨§ © ¦¨§ © to . Next, we present three polynomial-time methods that more ac- curately support programs with array accesses. These results are a step toward making sequentially consistent shared memory programming a practical model across a wide range of languages and hardware platforms. 1 Introduction In a uniprocessor environment, compiler and hardware transformations must adhere to a simple data dependency constraint: the orders of all pairs of conflicting accesses (accesses to the same memory location, with at least one a write) must be preserved. The execution model for parallel programs is considerably more complicated, since each thread executes its own portion of the program asynchronously, and there is no predetermined ordering among accesses issued by different threads to shared memory locations.
    [Show full text]
  • Beowulf Clusters Make Supercomputing Accessible
    Nor-Tech Contributes to NASA Article: Beowulf Clusters Make Supercomputing Accessible Original article available at NASA Spinoff: https://spinoff.nasa.gov/Spinoff2020/it_1.html NASA Technology In the Old English epic Beowulf, the warrior Unferth, jealous of the eponymous hero’s bravery, openly doubts Beowulf’s odds of slaying the monster Grendel that has tormented the Danes for 12 years, promising a “grim grappling” if he dares confront the dreaded march-stepper. A thousand years later, many in the supercomputing world were similarly skeptical of a team of NASA engineers trying achieve supercomputer-class processing on a cluster of standard desktop computers running a relatively untested open source operating system. “Not only did nobody care, but there were even a number of people hostile to this project,” says Thomas Sterling, who led the small team at NASA’s Goddard Space Flight Center in the early 1990s. “Because it was different. Because it was completely outside the scope of the Thomas Sterling, who co-invented the Beowulf supercomputing cluster at Goddard Space Flight Center, poses with the Naegling cluster at California supercomputing community at that time.” Technical Institute in 1997. Consisting of 120 Pentium Pro processors, The technology, now known as the Naegling was the first cluster to hit 10 gigaflops of sustained performance. Beowulf cluster, would ultimately succeed beyond its inventors’ imaginations. In 1993, however, its odds may indeed have seemed long. The U.S. Government, nervous about Japan’s high- performance computing effort, had already been pouring money into computer architecture research at NASA and other Federal agencies for more than a decade, and results were frustrating.
    [Show full text]
  • Data Management and Control-Flow Aspects of an SIMD/SPMD Parallel Language/Compiler
    222 lltt TRANSA('TI0NS ON PARALLEL AND DISTRIBUTED SYSTEMS. VOL. 4. NO. 2. FEBRUARY lYY3 Data Management and Control-Flow Aspects of an SIMD/SPMD Parallel Language/Compiler Mark A. Nichols, Member, IEEE, Howard Jay Siegel, Fellow, IEEE, and Henry G. Dietz, Member, IEEE Abstract-Features of an explicitly parallel programming lan- map more closely to different modes of parallelism [ 191, [23]. guage targeted for reconfigurable parallel processing systems, For instance, most parallel systems designed to exploit data where the machine's -1-processing elements (PE's) are capable of parallelism operate solely in the SlMD mode of parallelism. operating in both the SIMD and SPMD modes of parallelism, are described. The SPMD (Single Program-Multiple Data) mode of Because many data-parallel applications require a significant parallelism is a subset of the MIMD mode where all processors ex- number of data-dependent conditionals, SIMD mode is un- ecute the same program. By providing all aspects of the language necessarily restrictive. These types of applications are usually with an SIMD mode version and an SPMD mode version that are better served when using the SPMD mode of parallelism. syntactically and semantically equivalent, the language facilitates Several parallel machines have been built that are capable of experimentation with and exploitation of hybrid SlMDiSPMD machines. Language constructs (and their implementations) for operating in both the SIMD and SPMD modes of parallelism. data management, data-dependent control-flow, and PE-address Both PASM [9], [17], [51], [52] and TRAC [30], [46] are dependent control-flow are presented. These constructs are based hybrid SIMDiMIMD machines, and OPSILA [2], [3],[16] is on experience gained from programming a parallel machine a hybrid SIMDiSPMD machine.
    [Show full text]
  • Regent: a High-Productivity Programming Language for Implicit Parallelism with Logical Regions
    REGENT: A HIGH-PRODUCTIVITY PROGRAMMING LANGUAGE FOR IMPLICIT PARALLELISM WITH LOGICAL REGIONS A DISSERTATION SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Elliott Slaughter August 2017 © 2017 by Elliott David Slaughter. All Rights Reserved. Re-distributed by Stanford University under license with the author. This work is licensed under a Creative Commons Attribution- Noncommercial 3.0 United States License. http://creativecommons.org/licenses/by-nc/3.0/us/ This dissertation is online at: http://purl.stanford.edu/mw768zz0480 ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Alex Aiken, Primary Adviser I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Philip Levis I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Oyekunle Olukotun Approved for the Stanford University Committee on Graduate Studies. Patricia J. Gumport, Vice Provost for Graduate Education This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file in University Archives. iii Abstract Modern supercomputers are dominated by distributed-memory machines. State of the art high-performance scientific applications targeting these machines are typically written in low-level, explicitly parallel programming models that enable maximal performance but expose the user to programming hazards such as data races and deadlocks.
    [Show full text]
  • Message Passing Interface Part - I
    Message Passing Interface Part - I Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi – 110016 India http://www.cse.iitd.ac.in/~dheerajb Message Passing Interface Dheeraj Bhardwaj <[email protected]> 1 Message Passing Interface Outlines ? Basics of MPI ? How to compile and execute MPI programs? ? MPI library calls used in Example program ? MPI point-to-point communication ? MPI advanced point-to-point communication ? MPI Collective Communication and Computations ? MPI Datatypes ? MPI Communication Modes ? MPI special features Message Passing Interface Dheeraj Bhardwaj <[email protected]> 2 What is MPI? ? A message-passing library specification • Message-passing model • Not a compiler specification • Not a specific product ? Used for parallel computers, clusters, and heterogeneous networks as a message passing library. ? Designed to permit the development of parallel software libraries Message Passing Interface Dheeraj Bhardwaj <[email protected]> 3 Information about MPI Where to use MPI ? ? You need a portable parallel program ? You are writing a parallel Library ? You have irregular data relationships that do not fit a data parallel model Why learn MPI? ? Portable & Expressive ? Good way to learn about subtle issues in parallel computing ? Universal acceptance Message Passing Interface Dheeraj Bhardwaj <[email protected]> 4 Information about MPI MPI Resources ? The MPI Standard : http://www.mcs.anl.gov/mpi ? Using MPI by William Gropp, Ewing Lusk and Anthony Skjellum
    [Show full text]