The Global Array Programming Model for High Performance Scienti C

Total Page:16

File Type:pdf, Size:1020Kb

The Global Array Programming Model for High Performance Scienti C SIAM News, August/September1995 The Global Array ProgrammingModel for High Performance Scienti c Computing J. Nieplo cha, R.J. Harr i son and R.J. Little eld Paci c Northwest Laboratory Motivated bythecharacter i stics of current parallel architecture s, wehavede- veloped an approachtothe programming of scalable scienti c applications that combines someofthe b e st feature s of me ssage-pass ingandshare d-memory pro- grammingmodels. Two assumptions p ermeate our work. The rst i s thatmost high p erformance parallel computers have, and will continue tohave, phys ically di str ibuted memor ie s with non-uniform memory acce ss NUMA timingchar- acteristics. NUMA machines work best withapplication programs thathavea high degree of lo calityintheir memory reference patter ns. Thesecondassump- tion i s that extra programming e ort i s, and will continue to b e, require d to construct suchapplications. Thus, a recurr ingthemeinourworkisthedevelop- mentoftechnique s andto ols that minimize the extra e ort require d to construct application programs with explicit control of lo cality. There are s igni cant tradeo s amongthe imp ortantconsiderations of p ortab- ility, eciency,and eas e of co ding. Theme ssage-pass ing programmingmodel i s widely used becaus e of its p ortability. Someapplications, however, are too com- plex tobecoded in a me ssage-pass ingmode, if care i s tobetaken tomaintain a balance d computation load andavoid re dundant computations. Theshare d- memory programmingmodel s impli e s co ding, but itisnotportable andoften provide s little control over interpro ce ssor data transfer costs. Other more recent parallel programmingmodels, repre s ented bysuch language s and f acilitie s as HPF[1], SISAL[2], PCN[3], Fortran-M[4], Linda[5], andshare d virtual memory, addre ss the s e problems in di erentways andtovaryingdegree s. Noneofthese mo dels repre s entsanideal solution. Global Arrays GAs, theapproachde scr ib e d here, lead tobothsimplecod- ingand ecient execution for a class of applications thatappears to b e f airly common. Thekey concept of a GA mo del i s thatitprovide s a p ortable interf ace through which each pro ce ss in a MIMD parallel program can indep endently, asynchronously,and eciently acce ss logical blo cks of phys ically di str ibuted 1 registers on-chip cache speed off-chip cache capacity main memory virtual memory Figure 1: Thememory hierarchyofatypical NUMA architecture. matr ice s, withnonee d for explicit co operation byother pro ce ss e s. In thi s re- sp ect, it i s s imilar totheshare d-memory programmingmodel. In addition, however, theGAmodel acknowle dge s that more time i s require d to acce ss re- motedatathan lo cal data, and it allows data lo calityto b e explicitly sp eci e d and used. In these respects, it i s s imilar tome ssage pass ing. NUMA Architecture The concept of NUMA i s imp ortanteven totheperformance of mo der n s e- quential p ersonal computers or workstations. On a standard RISC workstation, for instance, go o d p erformance of the pro ce ssors re sults f rom algor ithms and compilers thatoptimize usage of thememory hierarchy.Thememory hierarchy is formed by regi sters, on-chip cache, o -chip cache, main memory,and virtual memory s ee Figure 1. If the programmer ignore s thi s structure and constantly ushes thecacheor, even wors e, thrashes the virtual memory, p erformance will b e s er iously degraded. The class ic solution tothi s problem i s to acce ss datainblocks small enough to t in the cacheandthen ensure thatthe algor ithm make s sucientuseofthe encached datato justify thecostsofmovingthedata. Tothe NUMA hierarchyofsequential computers,parallel computers add at least one extra layer: remotememory. Acce ss toremotememory on di str ibuted- memory machine s i s accompli shed through me ssage pass ing. Me ssage pass ing, in addition tothe require d co op eration b etween s ender and rece iver thatmakes thi s programming paradigm diculttouse,intro duce s degradation of latency andbandwidthinthe accessing of remote, as opposed to lo cal, memory. Scalable share d-memory machine s,i.e., architecturally di str ibuted-memory machine s withhardware supp ort for share d-memory operations for example, the KSR-2 or theConvex Exemplar, allowaccesstoremotememory in thesame fashion as tolocalmemory.However, thi s uniform mechani sm for acce ss ing lo cal andremotememory should b e s een only as a programmingconvenience| 2 on b othshare d- and di str ibute d-memory computers, thelatency andbandwidth for acce ss ingremotememory are s igni cantly larger than for lo cal memory and therefore must b e incorp orated into p erformance mo dels. If wethink about programming of MIMD parallel computers e ither share d- or di str ibute d-memory in terms of NUMA, then parallel computation di ers f rom s equential computation only in terms of concurrency. By fo cus ingon NUMA, we not only have a f ramework in whichto reason aboutthe p erformance of our parallel algor ithms i.e., memory latency,bandwidth, dataand reference lo cality, we also conceptually unite s equential and parallel computation. Global Array Mo del The GA programmingmodel i s motivated bythe NUMA character i stics of cur- rent parallel architecture s. By removingtheunnece ssary pro ce ssor interactions require d to acce ss remotedatainme ssage-pass ing paradigm, theGAmodel greatly s impli e s parallel programmingand i s s imilar in thi s re sp ect tothe share d-memory programmingmodel. However, theGAmodel also acknow- le dge s that itismoretime consumingto acce ss remotedatathan lo cal data i.e., remotememory i s yet another layer of NUMA, and it allows data lo calityto b e explicitly sp eci e d and used. AdvantagesoftheGAmodel over a share d- memory programmingmodel includeits explicit di stinction b etween lo cal and remotememory andtheavailabilityoftwodistinct mechanisms for accessing lo cal and remotedata. Global arralys, instead of hidingthe NUMA charac- teristics, exp os e them tothe programmer andmake it p oss ible towrite more ecientand scalable parallel programs. The current GA programmingmodel can b e character ize d as follows: MIMD paralleli sm i s provide d via a multipro ce ss approach, in whichall non-GA data, le de scr iptors, and so on are replicated or unique toeach pro ce ss. Pro ce ss e s can communicatewith eachother by creatingand accessing GA di str ibuted matr ice s, as well as if de s ire d byconventional message pass ing. Matr ice s are phys ically di str ibuted block-wi s e, e ither regularly or as the Carte s ian pro duct of irregular di str ibutions on eachaxis. Each pro ce ss can indep endently and asynchronously acce ss anytwo- di- mens ional patch of a GA di str ibuted matr ix, withoutrequiringcooperation from theapplication co deinanyother pro ce ss. Several typ e s of acce ss are supp orte d, including \get," \put," \accumu- late" oating-p ointsum-re duction, and \get and increment" integer. Thi s li st can b e extente d as nee ded. Each pro ce ss i s assumed tohave f ast acce ss to someportion of each di str ib- uted matr ix, and slower acce ss totheremainder. The s e sp ee d di erence s de nethedata as being lo cal or remote, re sp ectively.However, thenumer ic 3 di erence b etween lo cal and remotememory acce ss times is unsp eci e d. Each pro ce ss can determine which p ortion of each di str ibuted matr ix i s store d lo cally.Every element of a di str ibuted matr ix i s guarantee d tobe lo cal to exactly one pro ce ss. Thi s mo del di ers f rom other common mo dels as follows. UnlikeHPF,it allows task-parallel acce ss to di str ibuted matr ice s, includingreduction intoover- lappingpatche s. Unlike Linda[5], it eciently provide s for sum-re duction and acce ss tooverlappingpatche s. Unlikeshare d-virtual-memory software f acilit- ie s,the GA paradigm require s explicit library calls to acce ss databutavoids theoverhead asso ciate d withthemaintenance of memory coherence andhand- ling of virtual page f aults. The GA implementation guarantee s that all of the require d data for a patch can b e transferre d atthe sametime. Unlikeactive me ssage s[6], theGAmodel do e s not incorp oratethe concept of interpro ce ssor co operation andcanthus b e implemente d eciently[7]even on share d-memory systems. Finally,unlike someother strategie s bas e d on p olling, task duration i s relatively unimp ortant in programs that us e GAs, which s impli e s co dingand make s it p oss ible for GA programs to exploit standard library co des without mo di cation. Global Array Toolkit Thi s GA interf ace has b een de s igne d in the lightofemergingstandards. In particular, High Performance Fortran HPF will certainly providethe bas i s for future standards de nition for di str ibute d arrays in Fortran. Theop erations that providethebasicfunctionality create, fetch, store, accumulate, gather, scatter, data-parallel op erations all can b e expre ss e d as s ingle statementsinFortran-90 array notation and withthedata-di str ibution directive s of HPF. TheGAmodel is, however, more general than that of HPF, which currently precludes theuse of suchoperations in MIMD task-parallel co de. Supporte d Op erations EachGAoperation may b e categor ize d as e ither an implementation- dep endent pr imitiveoperation or an op eration thathas b een constructe d in an implementation- indep endent f ashion f rom pr imitiveop erations.
Recommended publications
  • R from a Programmer's Perspective Accompanying Manual for an R Course Held by M
    DSMZ R programming course R from a programmer's perspective Accompanying manual for an R course held by M. Göker at the DSMZ, 11/05/2012 & 25/05/2012. Slightly improved version, 10/09/2012. This document is distributed under the CC BY 3.0 license. See http://creativecommons.org/licenses/by/3.0 for details. Introduction The purpose of this course is to cover aspects of R programming that are either unlikely to be covered elsewhere or likely to be surprising for programmers who have worked with other languages. The course thus tries not be comprehensive but sort of complementary to other sources of information. Also, the material needed to by compiled in short time and perhaps suffers from important omissions. For the same reason, potential participants should not expect a fully fleshed out presentation but a combination of a text-only document (this one) with example code comprising the solutions of the exercises. The topics covered include R's general features as a programming language, a recapitulation of R's type system, advanced coding of functions, error handling, the use of attributes in R, object-oriented programming in the S3 system, and constructing R packages (in this order). The expected audience comprises R users whose own code largely consists of self-written functions, as well as programmers who are fluent in other languages and have some experience with R. Interactive users of R without programming experience elsewhere are unlikely to benefit from this course because quite a few programming skills cannot be covered here but have to be presupposed.
    [Show full text]
  • Parallel Functional Programming with APL
    Parallel Functional Programming, APL Lecture 1 Parallel Functional Programming with APL Martin Elsman Department of Computer Science University of Copenhagen DIKU December 19, 2016 Martin Elsman (DIKU) Parallel Functional Programming, APL Lecture 1 December 19, 2016 1 / 18 Outline 1 Outline Course Outline 2 Introduction to APL What is APL APL Implementations and Material APL Scalar Operations APL (1-Dimensional) Vector Computations Declaring APL Functions (dfns) APL Multi-Dimensional Arrays Iterations Declaring APL Operators Function Trains Examples Reading Martin Elsman (DIKU) Parallel Functional Programming, APL Lecture 1 December 19, 2016 2 / 18 Outline Course Outline Teachers Martin Elsman (ME), Ken Friis Larsen (KFL), Andrzej Filinski (AF), and Troels Henriksen (TH) Location Lectures in Small Aud, Universitetsparken 1 (UP1); Labs in Old Library, UP1 Course Description See http://kurser.ku.dk/course/ndak14009u/2016-2017 Course Outline Week 47 48 49 50 51 1–3 Mon 13–15 Intro, Futhark Parallel SNESL APL (ME) Project Futhark (ME) Haskell (AF) (ME) (KFL) Mon 15–17 Lab Lab Lab Lab Project Wed 13–15 Futhark Parallel SNESL Invited APL Project (ME) Haskell (AF) Lecture (ME) / (KFL) (John Projects Reppy) Martin Elsman (DIKU) Parallel Functional Programming, APL Lecture 1 December 19, 2016 3 / 18 Introduction to APL What is APL APL—An Ancient Array Programming Language—But Still Used! Pioneered by Ken E. Iverson in the 1960’s. E. Dijkstra: “APL is a mistake, carried through to perfection.” There are quite a few APL programmers around (e.g., HIPERFIT partners). Very concise notation for expressing array operations. Has a large set of functional, essentially parallel, multi- dimensional, second-order array combinators.
    [Show full text]
  • Benchmarking the Intel FPGA SDK for Opencl Memory Interface
    The Memory Controller Wall: Benchmarking the Intel FPGA SDK for OpenCL Memory Interface Hamid Reza Zohouri*†1, Satoshi Matsuoka*‡ *Tokyo Institute of Technology, †Edgecortix Inc. Japan, ‡RIKEN Center for Computational Science (R-CCS) {zohour.h.aa@m, matsu@is}.titech.ac.jp Abstract—Supported by their high power efficiency and efficiency on Intel FPGAs with different configurations recent advancements in High Level Synthesis (HLS), FPGAs are for input/output arrays, vector size, interleaving, kernel quickly finding their way into HPC and cloud systems. Large programming model, on-chip channels, operating amounts of work have been done so far on loop and area frequency, padding, and multiple types of blocking. optimizations for different applications on FPGAs using HLS. However, a comprehensive analysis of the behavior and • We outline one performance bug in Intel’s compiler, and efficiency of the memory controller of FPGAs is missing in multiple deficiencies in the memory controller, leading literature, which becomes even more crucial when the limited to significant loss of memory performance for typical memory bandwidth of modern FPGAs compared to their GPU applications. In some of these cases, we provide work- counterparts is taken into account. In this work, we will analyze arounds to improve the memory performance. the memory interface generated by Intel FPGA SDK for OpenCL with different configurations for input/output arrays, II. METHODOLOGY vector size, interleaving, kernel programming model, on-chip channels, operating frequency, padding, and multiple types of A. Memory Benchmark Suite overlapped blocking. Our results point to multiple shortcomings For our evaluation, we develop an open-source benchmark in the memory controller of Intel FPGAs, especially with respect suite called FPGAMemBench, available at https://github.com/ to memory access alignment, that can hinder the programmer’s zohourih/FPGAMemBench.
    [Show full text]
  • Programming Language
    Programming language A programming language is a formal language, which comprises a set of instructions that produce various kinds of output. Programming languages are used in computer programming to implement algorithms. Most programming languages consist of instructions for computers. There are programmable machines that use a set of specific instructions, rather than general programming languages. Early ones preceded the invention of the digital computer, the first probably being the automatic flute player described in the 9th century by the brothers Musa in Baghdad, during the Islamic Golden Age.[1] Since the early 1800s, programs have been used to direct the behavior of machines such as Jacquard looms, music boxes and player pianos.[2] The programs for these The source code for a simple computer program written in theC machines (such as a player piano's scrolls) did not programming language. When compiled and run, it will give the output "Hello, world!". produce different behavior in response to different inputs or conditions. Thousands of different programming languages have been created, and more are being created every year. Many programming languages are written in an imperative form (i.e., as a sequence of operations to perform) while other languages use the declarative form (i.e. the desired result is specified, not how to achieve it). The description of a programming language is usually split into the two components ofsyntax (form) and semantics (meaning). Some languages are defined by a specification document (for example, theC programming language is specified by an ISO Standard) while other languages (such as Perl) have a dominant implementation that is treated as a reference.
    [Show full text]
  • RAL-TR-1998-060 Co-Array Fortran for Parallel Programming
    RAL-TR-1998-060 Co-Array Fortran for parallel programming 1 by R. W. Numrich23 and J. K. Reid Abstract Co-Array Fortran, formerly known as F−− , is a small extension of Fortran 95 for parallel processing. A Co-Array Fortran program is interpreted as if it were replicated a number of times and all copies were executed asynchronously. Each copy has its own set of data objects and is termed an image. The array syntax of Fortran 95 is extended with additional trailing subscripts in square brackets to give a clear and straightforward representation of any access to data that is spread across images. References without square brackets are to local data, so code that can run independently is uncluttered. Only where there are square brackets, or where there is a procedure call and the procedure contains square brackets, is communication between images involved. There are intrinsic procedures to synchronize images, return the number of images, and return the index of the current image. We introduce the extension; give examples to illustrate how clear, powerful, and flexible it can be; and provide a technical definition. Categories and subject descriptors: D.3 [PROGRAMMING LANGUAGES]. General Terms: Parallel programming. Additional Key Words and Phrases: Fortran. Department for Computation and Information, Rutherford Appleton Laboratory, Oxon OX11 0QX, UK August 1998. 1 Available by anonymous ftp from matisa.cc.rl.ac.uk in directory pub/reports in the file nrRAL98060.ps.gz 2 Silicon Graphics, Inc., 655 Lone Oak Drive, Eagan, MN 55121, USA. Email:
    [Show full text]
  • Programming the Capabilities of the PC Have Changed Greatly Since the Introduction of Electronic Computers
    1 www.onlineeducation.bharatsevaksamaj.net www.bssskillmission.in INTRODUCTION TO PROGRAMMING LANGUAGE Topic Objective: At the end of this topic the student will be able to understand: History of Computer Programming C++ Definition/Overview: Overview: A personal computer (PC) is any general-purpose computer whose original sales price, size, and capabilities make it useful for individuals, and which is intended to be operated directly by an end user, with no intervening computer operator. Today a PC may be a desktop computer, a laptop computer or a tablet computer. The most common operating systems are Microsoft Windows, Mac OS X and Linux, while the most common microprocessors are x86-compatible CPUs, ARM architecture CPUs and PowerPC CPUs. Software applications for personal computers include word processing, spreadsheets, databases, games, and myriad of personal productivity and special-purpose software. Modern personal computers often have high-speed or dial-up connections to the Internet, allowing access to the World Wide Web and a wide range of other resources. Key Points: 1. History of ComputeWWW.BSSVE.INr Programming The capabilities of the PC have changed greatly since the introduction of electronic computers. By the early 1970s, people in academic or research institutions had the opportunity for single-person use of a computer system in interactive mode for extended durations, although these systems would still have been too expensive to be owned by a single person. The introduction of the microprocessor, a single chip with all the circuitry that formerly occupied large cabinets, led to the proliferation of personal computers after about 1975. Early personal computers - generally called microcomputers - were sold often in Electronic kit form and in limited volumes, and were of interest mostly to hobbyists and technicians.
    [Show full text]
  • Tangent: Automatic Differentiation Using Source-Code Transformation for Dynamically Typed Array Programming
    Tangent: Automatic differentiation using source-code transformation for dynamically typed array programming Bart van Merriënboer Dan Moldovan Alexander B Wiltschko MILA, Google Brain Google Brain Google Brain [email protected] [email protected] [email protected] Abstract The need to efficiently calculate first- and higher-order derivatives of increasingly complex models expressed in Python has stressed or exceeded the capabilities of available tools. In this work, we explore techniques from the field of automatic differentiation (AD) that can give researchers expressive power, performance and strong usability. These include source-code transformation (SCT), flexible gradient surgery, efficient in-place array operations, and higher-order derivatives. We implement and demonstrate these ideas in the Tangent software library for Python, the first AD framework for a dynamic language that uses SCT. 1 Introduction Many applications in machine learning rely on gradient-based optimization, or at least the efficient calculation of derivatives of models expressed as computer programs. Researchers have a wide variety of tools from which they can choose, particularly if they are using the Python language [21, 16, 24, 2, 1]. These tools can generally be characterized as trading off research or production use cases, and can be divided along these lines by whether they implement automatic differentiation using operator overloading (OO) or SCT. SCT affords more opportunities for whole-program optimization, while OO makes it easier to support convenient syntax in Python, like data-dependent control flow, or advanced features such as custom partial derivatives. We show here that it is possible to offer the programming flexibility usually thought to be exclusive to OO-based tools in an SCT framework.
    [Show full text]
  • BCL: a Cross-Platform Distributed Data Structures Library
    BCL: A Cross-Platform Distributed Data Structures Library Benjamin Brock, Aydın Buluç, Katherine Yelick University of California, Berkeley Lawrence Berkeley National Laboratory {brock,abuluc,yelick}@cs.berkeley.edu ABSTRACT high-performance computing, including several using the Parti- One-sided communication is a useful paradigm for irregular paral- tioned Global Address Space (PGAS) model: Titanium, UPC, Coarray lel applications, but most one-sided programming environments, Fortran, X10, and Chapel [9, 11, 12, 25, 29, 30]. These languages are including MPI’s one-sided interface and PGAS programming lan- especially well-suited to problems that require asynchronous one- guages, lack application-level libraries to support these applica- sided communication, or communication that takes place without tions. We present the Berkeley Container Library, a set of generic, a matching receive operation or outside of a global collective. How- cross-platform, high-performance data structures for irregular ap- ever, PGAS languages lack the kind of high level libraries that exist plications, including queues, hash tables, Bloom filters and more. in other popular programming environments. For example, high- BCL is written in C++ using an internal DSL called the BCL Core performance scientific simulations written in MPI can leverage a that provides one-sided communication primitives such as remote broad set of numerical libraries for dense or sparse matrices, or get and remote put operations. The BCL Core has backends for for structured, unstructured, or adaptive meshes. PGAS languages MPI, OpenSHMEM, GASNet-EX, and UPC++, allowing BCL data can sometimes use those numerical libraries, but are missing the structures to be used natively in programs written using any of data structures that are important in some of the most irregular these programming environments.
    [Show full text]
  • Advances, Applications and Performance of The
    1 Introduction The two predominant classes of programming models for MIMD concurrent computing are distributed memory and ADVANCES, APPLICATIONS AND shared memory. Both shared memory and distributed mem- PERFORMANCE OF THE GLOBAL ory models have advantages and shortcomings. Shared ARRAYS SHARED MEMORY memory model is much easier to use but it ignores data PROGRAMMING TOOLKIT locality/placement. Given the hierarchical nature of the memory subsystems in modern computers this character- istic can have a negative impact on performance and scal- Jarek Nieplocha1 1 ability. Careful code restructuring to increase data reuse Bruce Palmer and replacing fine grain load/stores with block access to Vinod Tipparaju1 1 shared data can address the problem and yield perform- Manojkumar Krishnan ance for shared memory that is competitive with mes- Harold Trease1 2 sage-passing (Shan and Singh 2000). However, this Edoardo Aprà performance comes at the cost of compromising the ease of use that the shared memory model advertises. Distrib- uted memory models, such as message-passing or one- Abstract sided communication, offer performance and scalability This paper describes capabilities, evolution, performance, but they are difficult to program. and applications of the Global Arrays (GA) toolkit. GA was The Global Arrays toolkit (Nieplocha, Harrison, and created to provide application programmers with an inter- Littlefield 1994, 1996; Nieplocha et al. 2002a) attempts to face that allows them to distribute data while maintaining offer the best features of both models. It implements a the type of global index space and programming syntax shared-memory programming model in which data local- similar to that available when programming on a single ity is managed by the programmer.
    [Show full text]
  • Abstractions for Programming Graphics Processors in High-Level Programming Languages
    Abstracties voor het programmeren van grafische processoren in hoogniveau-programmeertalen Abstractions for Programming Graphics Processors in High-Level Programming Languages Tim Besard Promotor: prof. dr. ir. B. De Sutter Proefschrift ingediend tot het behalen van de graad van Doctor in de ingenieurswetenschappen: computerwetenschappen Vakgroep Elektronica en Informatiesystemen Voorzitter: prof. dr. ir. K. De Bosschere Faculteit Ingenieurswetenschappen en Architectuur Academiejaar 2018 - 2019 ISBN 978-94-6355-244-8 NUR 980 Wettelijk depot: D/2019/10.500/52 Examination Committee Prof. Filip De Turck, chair Department of Information Technology Faculty of Engineering and Architecture Ghent University Prof. Koen De Bosschere, secretary Department of Electronics and Information Systems Faculty of Engineering and Architecture Ghent University Prof. Bjorn De Sutter, supervisor Department of Electronics and Information Systems Faculty of Engineering and Architecture Ghent University Prof. Jutho Haegeman Department of Physics and Astronomy Faculty of Sciences Ghent University Prof. Jan Lemeire Department of Electronics and Informatics Faculty of Engineering Vrije Universiteit Brussel Prof. Christophe Dubach School of Informatics College of Science & Engineering The University of Edinburgh Prof. Alan Edelman Computer Science & Artificial Intelligence Laboratory Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology ii Dankwoord Ik wist eigenlijk niet waar ik aan begon, toen ik in 2012 in de cata- comben van het Technicum op gesprek ging over een doctoraat. Of ik al eens met LLVM gewerkt had. Ondertussen zijn we vele jaren verder, werk ik op een bureau waar er wel daglicht is, en is het eindpunt van deze studie zowaar in zicht. Dat mag natuurlijk wel, zo vertelt men mij, na 7 jaar.
    [Show full text]
  • Enabling Efficient Use of UPC and Openshmem PGAS Models on GPU Clusters
    Enabling Efficient Use of UPC and OpenSHMEM PGAS Models on GPU Clusters Presented at GTC ’15 Presented by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] hCp://www.cse.ohio-state.edu/~panda Accelerator Era GTC ’15 • Accelerators are becominG common in hiGh-end system architectures Top 100 – Nov 2014 (28% use Accelerators) 57% use NVIDIA GPUs 57% 28% • IncreasinG number of workloads are beinG ported to take advantage of NVIDIA GPUs • As they scale to larGe GPU clusters with hiGh compute density – hiGher the synchronizaon and communicaon overheads – hiGher the penalty • CriPcal to minimize these overheads to achieve maximum performance 3 Parallel ProGramminG Models Overview GTC ’15 P1 P2 P3 P1 P2 P3 P1 P2 P3 LoGical shared memory Shared Memory Memory Memory Memory Memory Memory Memory Shared Memory Model Distributed Memory Model ParPPoned Global Address Space (PGAS) DSM MPI (Message PassinG Interface) Global Arrays, UPC, Chapel, X10, CAF, … • ProGramminG models provide abstract machine models • Models can be mapped on different types of systems - e.G. Distributed Shared Memory (DSM), MPI within a node, etc. • Each model has strenGths and drawbacks - suite different problems or applicaons 4 Outline GTC ’15 • Overview of PGAS models (UPC and OpenSHMEM) • Limitaons in PGAS models for GPU compuPnG • Proposed DesiGns and Alternaves • Performance Evaluaon • ExploiPnG GPUDirect RDMA 5 ParPPoned Global Address Space (PGAS) Models GTC ’15 • PGAS models, an aracPve alternave to tradiPonal message passinG - Simple
    [Show full text]
  • Automatic Handling of Global Variables for Multi-Threaded MPI Programs
    Automatic Handling of Global Variables for Multi-threaded MPI Programs Gengbin Zheng, Stas Negara, Celso L. Mendes, Laxmikant V. Kale´ Eduardo R. Rodrigues Department of Computer Science Institute of Informatics University of Illinois at Urbana-Champaign Federal University of Rio Grande do Sul Urbana, IL 61801, USA Porto Alegre, Brazil fgzheng,snegara2,cmendes,[email protected] [email protected] Abstract— necessary to privatize global and static variables to ensure Hybrid programming models, such as MPI combined with the thread-safety of an application. threads, are one of the most efficient ways to write parallel ap- In high-performance computing, one way to exploit multi- plications for current machines comprising multi-socket/multi- core nodes and an interconnection network. Global variables in core platforms is to adopt hybrid programming models that legacy MPI applications, however, present a challenge because combine MPI with threads, where different programming they may be accessed by multiple MPI threads simultaneously. models are used to address issues such as multiple threads Thus, transforming legacy MPI applications to be thread-safe per core, decreasing amount of memory per core, load in order to exploit multi-core architectures requires proper imbalance, etc. When porting legacy MPI applications to handling of global variables. In this paper, we present three approaches to eliminate these hybrid models that involve multiple threads, thread- global variables to ensure thread-safety for an MPI program. safety of the applications needs to be addressed again. These approaches include: (a) a compiler-based refactoring Global variables cause no problem with traditional MPI technique, using a Photran-based tool as an example, which implementations, since each process image contains a sep- automates the source-to-source transformation for programs arate instance of the variable.
    [Show full text]