Enabling Efficient Use of UPC and Openshmem PGAS Models on GPU Clusters

Total Page:16

File Type:pdf, Size:1020Kb

Enabling Efficient Use of UPC and Openshmem PGAS Models on GPU Clusters Enabling Efficient Use of UPC and OpenSHMEM PGAS Models on GPU Clusters Presented at GTC ’15 Presented by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] hCp://www.cse.ohio-state.edu/~panda Accelerator Era GTC ’15 • Accelerators are becominG common in hiGh-end system architectures Top 100 – Nov 2014 (28% use Accelerators) 57% use NVIDIA GPUs 57% 28% • IncreasinG number of workloads are beinG ported to take advantage of NVIDIA GPUs • As they scale to larGe GPU clusters with hiGh compute density – hiGher the synchronizaon and communicaon overheads – hiGher the penalty • CriPcal to minimize these overheads to achieve maximum performance 3 Parallel ProGramminG Models Overview GTC ’15 P1 P2 P3 P1 P2 P3 P1 P2 P3 LoGical shared memory Shared Memory Memory Memory Memory Memory Memory Memory Shared Memory Model Distributed Memory Model ParPPoned Global Address Space (PGAS) DSM MPI (Message PassinG Interface) Global Arrays, UPC, Chapel, X10, CAF, … • ProGramminG models provide abstract machine models • Models can be mapped on different types of systems - e.G. Distributed Shared Memory (DSM), MPI within a node, etc. • Each model has strenGths and drawbacks - suite different problems or applicaons 4 Outline GTC ’15 • Overview of PGAS models (UPC and OpenSHMEM) • Limitaons in PGAS models for GPU compuPnG • Proposed DesiGns and Alternaves • Performance Evaluaon • ExploiPnG GPUDirect RDMA 5 ParPPoned Global Address Space (PGAS) Models GTC ’15 • PGAS models, an aracPve alternave to tradiPonal message passinG - Simple shared memory abstracPons - LiGhtweiGht one-sided communicaon - Flexible synchronizaon • Different approaches to PGAS - Libraries - LanGuages • OpenSHMEM • Unified Parallel C (UPC) • Global Arrays • Co-Array Fortran (CAF) • Chapel • X10 6 OpenSHMEM GTC ’15 • SHMEM implementaons – Cray SHMEM, SGI SHMEM, Quadrics SHMEM, HP SHMEM, GSHMEM • Subtle differences in API, across versions – example: SGI SHMEM Quadrics SHMEM Cray SHMEM Inializaon start_pes(0) shmem_init start_pes Process ID _my_pe my_pe shmem_my_pe • Made applicaons codes non-portable • OpenSHMEM is an effort to address this: “A new, open specifica3on to consolidate the various extant SHMEM versions into a widely accepted standard.” – OpenSHMEM Specificaon v1.0 by University of Houston and Oak RidGe Naonal Lab SGI SHMEM is the baseline 7 OpenSHMEM Memory Model GTC ’15 • Defines symmetric data objects that are Globally addressable int main (int c, char *v[]) { - Allocated usinG a collecPve shmalloc rouPne int *b; start_pes(); - Same type, size and offset address at all processes/ b = (int *) shmalloc (sizeof(int)); processinG elements (PEs) shmem_int_get (b, b, 1 , 1); - Address of a remote object can be calculated based on info } (dst, src, count, pe) of local object int main (int c, char *v[]) { b int *b; Symmetric start_pes(); b b = (int *) shmalloc (sizeof(int)); Object } PE 0 PE 1 8 Compiler-based: Unified Parallel C GTC ’15 • UPC: a parallel extension to the C standard • UPC Specificaons and Standards: - IntroducPon to UPC and LanGuage Specificaon, 1999 - UPC LanGuage Specificaons, v1.0, Feb 2001 - UPC LanGuage Specificaons, v1.1.1, Sep 2004 - UPC LanGuage Specificaons, v1.2, 2005 - UPC LanGuage Specificaons, v1.3, In ProGress - Dra Available • UPC Consorum - Academic InsPtuPons: GWU, MTU, UCB, U. Florida, U. Houston, U. Maryland… - Government InsPtuPons: ARSC, IDA, LBNL, SNL, US DOE… - Commercial InsPtuPons: HP, Cray, Intrepid TechnoloGy, IBM, … • Supported by several UPC compilers - Vendor-based commercial UPC compilers: HP UPC, Cray UPC, SGI UPC - Open-source UPC compilers: Berkeley UPC, GCC UPC, MichiGan Tech MuPC • Aims for: hiGh performance, codinG efficiency, irreGular applicaons, … 9 UPC Memory Model GTC ’15 Thread 0 Thread 1 Thread 2 Thread 3 A1[0] A1[1] A1[2] A1[3] Global Shared Space y y y y Private Space • Global Shared Space: can be accessed by all the threads • Private Space: holds all the normal variables; can only be accessed by the local thread • Example: shared int A1[THREADS]; //shared variable int main() { int y; //private variable A1[0] = 0; //local access A1[1] = 1; //remote access 10 } MPI+PGAS for Exascale Architectures and Applicaons GTC ’15 • GaininG aenPon in efforts towards Exascale compuPnG • Hierarchical architectures with mulPple address spaces • (MPI + PGAS) Model - MPI across address spaces - PGAS within an address space • MPI is Good at movinG data between address spaces • Within an address space, MPI can interoperate with other shared memory proGramminG models • Re-wriPnG complete applicaons can be a huGe effort • Port criPcal kernels to the desired model instead 11 Hybrid (MPI+PGAS) ProGramminG GTC ’15 • Applicaon sub-kernels can be re-wriCen in MPI/PGAS based on communicaon characterisPcs HPC Applica8on • Benefits: Kernel 1 - Best of Distributed CompuPnG Model MPI - Best of Shared Memory CompuPnG Model Kernel 2 PGASMPI • Exascale Roadmap*: Kernel 3 MPI - “Hybrid ProGramminG is a pracPcal way to proGram exascale systems” Kernel N Kernel N PGASMPI 12 MVAPICH2-X for Hybrid MPI + PGAS Applicaons GTC ’15 MPI, OpenSHMEM, UPC, CAF and Hybrid (MPI + PGAS) Applications OpenSHMEM Calls UPC Calls CAF Calls MPI Calls Unified MVAPICH2-X Runtime InfiniBand, RoCE, iWARP • Unified communicaon runPme for MPI, UPC, OpenSHMEM,CAF available from MVAPICH2-X 1.9 : (09/07/2012) hCp://mvapich.cse.ohio-state.edu • Feature HiGhliGhts - Supports MPI(+OpenMP), OpenSHMEM, UPC, CAF, MPI(+OpenMP) + OpenSHMEM, MPI(+OpenMP) + UPC, MPI(+OpenMP) + CAF - MPI-3 compliant, OpenSHMEM v1.0 standard compliant, UPC v1.2 standard compliant, CAF 2015 standard compliant - Scalable Inter-node and intra-node communicaon – point-to-point and collecPves 13 • Effort underway for support on NVIDIA GPU clusters Outline GTC ’15 • Overview of PGAS models (UPC and OpenSHMEM) • Limitaons in PGAS models for GPU compuPnG • Proposed DesiGns and Alternaves • Performance Evaluaon • ExploiPnG GPUDirect RDMA 14 Limitaons of PGAS models for GPU CompuPnG GTC ’15 • PGAS memory models does not support disjoint memory address spaces - case with GPU clusters Exisng OpenSHMEM Model with CUDA • OpenSHMEM case PE 0 • Copies severely limit the performance host_buf = shmalloc (…) cudaMemcpy (host_buf, dev_buf, . ) PE 0 shmem_putmem (host_buf, host_buf, size, pe) shmem_barrier (…) GPU-to-GPU Data Movement PE 1 host_buf = shmalloc (…) PE 1 shmem_barrier ( . ) cudaMemcpy (dev_buf, host_buf, size, . ) • Synchronizaon neGates the benefits of one-sided communicaon • Similar limitaons in UPC 15 Outline GTC ’15 • Overview of PGAS models (UPC and OpenSHMEM) • Limitaons in PGAS models for GPU compuPnG • Proposed DesiGns and Alternaves • Performance Evaluaon • ExploiPnG GPUDirect RDMA 16 Global Address Space with Host and Device Memory GTC ’15 heap_on_device(); Host Memory Host Memory /*allocated on device*/ Private Private dev_buf = shmalloc (sizeof(int)); Shared shared space Shared heap_on_host(); N on host memory N /*allocated on host*/ host_buf = shmalloc (sizeof(int)); Device Memory Device Memory Private Private Shared shared space Shared N on device memory N Extended APIs: heap_on_device/heap_on_host a way to indicate locaon on heap Can be similar for dynamically allocated memory in UPC 17 CUDA-aware OpenSHMEM and UPC runPmes GTC ’15 • Amer device memory becomes part of the Global shared space: - Accessible throuGh standard UPC/OpenSHMEM communicaon APIs - Data movement transparently handled by the runPme - Preserves one-sided semanPcs at the applicaon level • Efficient desiGns to handle communicaon - Inter-node transfers use host-staged transfers with pipelininG - Intra-node transfers use CUDA IPC - Possibility to take advantage of GPUDirect RDMA (GDR) • Goal: EnablinG HiGh performance one-sided communicaons semanPcs with GPU devices 18 Outline GTC ’15 • Overview of PGAS models (UPC and OpenSHMEM) • Limitaons in PGAS models for GPU compuPnG • Proposed DesiGns and Alternaves • Performance Evaluaon • ExploiPnG GPUDirect RDMA 19 Shmem_putmem Inter-node Communicaon GTC ’15 Small Messages LarGe Messages 35 3000 2500 28% 30 22% 25 2000 20 1500 15 1000 10 Latency (usec) 500 Latency (usec) Latency 5 0 0 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M Message Size (Bytes) Message Size (Bytes) • Small messages benefit from selecPve CUDA reGistraon – 22% for 4Byte messages • LarGe messages benefit from pipelined overlap – 28% for 4MByte messages S. Potluri, D. Bureddy, H. WanG, H. Subramoni and D. K. Panda, ExtendinG OpenSHMEM for GPU CompuPnG, Int'l Parallel and Distributed ProcessinG Symposium (IPDPS '13) 20 Shmem_putmem Intra-node Communicaon GTC ’15 Small Messages LarGe Messages 30 3000 25 2500 20 2000 15 1500 3.4X 5X 10 1000 5 500 Latency (usec) Latency Latency (usec) 0 0 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M Message Size (Bytes) Message Size (Bytes) • UsinG IPC for intra-node communicaon Based on MVAPICH2X-2.0b + Extensions Intel WestmereEP node with 8 cores • Small messages – 3.4X improvement for 4Byte messages 2 NVIDIA Tesla K20c GPUs, Mellanox QDR HCA CUDA 6.0RC1 • LarGe messages – 5X for 4MByte messages 21 Applicaon Kernel Evaluaon: Stencil2D GTC ’15 30000 14 25000 12 20000 10 15000 8 65% 6 10000 4 5000 2 0 0 512x512 1Kx1K 2Kx2K 4Kx4K 8Kx8K 48 96 192 Total Execution Time (msec) Time Execution Total Problem Size/GPU (msec) Time Execution Total Number of GPUs (192 GPUs) (4Kx4K problem/GPU) • Modified SHOC Stencil2D kernelto use OpenSHMEM for cluster level parallelism • The enhanced version shows 65% improvement on 192 GPUs with 4Kx4K problem size/GPU • Using OpenSHMEM for GPU-GPU communicaon allows runPme
Recommended publications
  • Mellanox Scalableshmem Support the Openshmem Parallel Programming Language Over Infiniband
    SOFTWARE PRODUCT BRIEF Mellanox ScalableSHMEM Support the OpenSHMEM Parallel Programming Language over InfiniBand The SHMEM programming library is a one-side communications library that supports a unique set of parallel programming features including point-to-point and collective routines, synchronizations, atomic operations, and a shared memory paradigm used between the processes of a parallel programming application. SHMEM Details API defined by the OpenSHMEM.org consortium. There are two types of models for parallel The library works with the OpenFabrics RDMA programming. The first is the shared memory for Linux stack (OFED), and also has the ability model, in which all processes interact through a to utilize Mellanox Messaging libraries (MXM) globally addressable memory space. The other as well as Mellanox Fabric Collective Accelera- HIGHLIGHTS is a distributed memory model, in which each tions (FCA), providing an unprecedented level of FEATURES processor has its own memory, and interaction scalability for SHMEM programs running over – Use of symmetric variables and one- with another processors memory is done though InfiniBand. sided communication (put/get) message communication. The PGAS model, The use of Mellanox FCA provides for collective – RDMA for performance optimizations or Partitioned Global Address Space, uses a optimizations by taking advantage of the high for one-sided communications combination of these two methods, in which each performance features within the InfiniBand fabric, – Provides shared memory data transfer process has access to its own private memory, including topology aware coalescing, hardware operations (put/get), collective and also to shared variables that make up the operations, and atomic memory multicast and separate quality of service levels for global memory space.
    [Show full text]
  • Scalable and Distributed Deep Learning (DL): Co-Design MPI Runtimes and DL Frameworks
    Scalable and Distributed Deep Learning (DL): Co-Design MPI Runtimes and DL Frameworks OSU Booth Talk (SC ’19) Ammar Ahmad Awan [email protected] Network Based Computing Laboratory Dept. of Computer Science and Engineering The Ohio State University Agenda • Introduction – Deep Learning Trends – CPUs and GPUs for Deep Learning – Message Passing Interface (MPI) • Research Challenges: Exploiting HPC for Deep Learning • Proposed Solutions • Conclusion Network Based Computing Laboratory OSU Booth - SC ‘18 High-Performance Deep Learning 2 Understanding the Deep Learning Resurgence • Deep Learning (DL) is a sub-set of Machine Learning (ML) – Perhaps, the most revolutionary subset! Deep Machine – Feature extraction vs. hand-crafted Learning Learning features AI Examples: • Deep Learning Examples: Logistic Regression – A renewed interest and a lot of hype! MLPs, DNNs, – Key success: Deep Neural Networks (DNNs) – Everything was there since the late 80s except the “computability of DNNs” Adopted from: http://www.deeplearningbook.org/contents/intro.html Network Based Computing Laboratory OSU Booth - SC ‘18 High-Performance Deep Learning 3 AlexNet Deep Learning in the Many-core Era 10000 8000 • Modern and efficient hardware enabled 6000 – Computability of DNNs – impossible in the 4000 ~500X in 5 years 2000 past! Minutesto Train – GPUs – at the core of DNN training 0 2 GTX 580 DGX-2 – CPUs – catching up fast • Availability of Datasets – MNIST, CIFAR10, ImageNet, and more… • Excellent Accuracy for many application areas – Vision, Machine Translation, and
    [Show full text]
  • Parallel Programming
    Parallel Programming Parallel Programming Parallel Computing Hardware Shared memory: multiple cpus are attached to the BUS all processors share the same primary memory the same memory address on different CPU’s refer to the same memory location CPU-to-memory connection becomes a bottleneck: shared memory computers cannot scale very well Parallel Programming Parallel Computing Hardware Distributed memory: each processor has its own private memory computational tasks can only operate on local data infinite available memory through adding nodes requires more difficult programming Parallel Programming OpenMP versus MPI OpenMP (Open Multi-Processing): easy to use; loop-level parallelism non-loop-level parallelism is more difficult limited to shared memory computers cannot handle very large problems MPI(Message Passing Interface): require low-level programming; more difficult programming scalable cost/size can handle very large problems Parallel Programming MPI Distributed memory: Each processor can access only the instructions/data stored in its own memory. The machine has an interconnection network that supports passing messages between processors. A user specifies a number of concurrent processes when program begins. Every process executes the same program, though theflow of execution may depend on the processors unique ID number (e.g. “if (my id == 0) then ”). ··· Each process performs computations on its local variables, then communicates with other processes (repeat), to eventually achieve the computed result. In this model, processors pass messages both to send/receive information, and to synchronize with one another. Parallel Programming Introduction to MPI Communicators and Groups: MPI uses objects called communicators and groups to define which collection of processes may communicate with each other.
    [Show full text]
  • Introduction to Parallel Programming
    Introduction to Parallel Programming Martin Čuma Center for High Performance Computing University of Utah [email protected] Overview • Types of parallel computers. • Parallel programming options. • How to write parallel applications. • How to compile. • How to debug/profile. • Summary, future expansion. 3/13/2009 http://www.chpc.utah.edu Slide 2 Parallel architectures Single processor: • SISD – single instruction single data. Multiple processors: • SIMD - single instruction multiple data. • MIMD – multiple instruction multiple data. Shared Memory Distributed Memory 3/13/2009 http://www.chpc.utah.edu Slide 3 Shared memory Dual quad-core node • All processors have BUS access to local memory CPU Memory • Simpler programming CPU Memory • Concurrent memory access Many-core node (e.g. SGI) • More specialized CPU Memory hardware BUS • CHPC : CPU Memory Linux clusters, 2, 4, 8 CPU Memory core nodes CPU Memory 3/13/2009 http://www.chpc.utah.edu Slide 4 Distributed memory • Process has access only BUS CPU to its local memory Memory • Data between processes CPU Memory must be communicated • More complex Node Network Node programming Node Node • Cheap commodity hardware Node Node • CHPC: Linux clusters Node Node (Arches, Updraft) 8 node cluster (64 cores) 3/13/2009 http://www.chpc.utah.edu Slide 5 Parallel programming options Shared Memory • Threads – POSIX Pthreads, OpenMP – Thread – own execution sequence but shares memory space with the original process • Message passing – processes – Process – entity that executes a program – has its own
    [Show full text]
  • Benchmarking the Intel FPGA SDK for Opencl Memory Interface
    The Memory Controller Wall: Benchmarking the Intel FPGA SDK for OpenCL Memory Interface Hamid Reza Zohouri*†1, Satoshi Matsuoka*‡ *Tokyo Institute of Technology, †Edgecortix Inc. Japan, ‡RIKEN Center for Computational Science (R-CCS) {zohour.h.aa@m, matsu@is}.titech.ac.jp Abstract—Supported by their high power efficiency and efficiency on Intel FPGAs with different configurations recent advancements in High Level Synthesis (HLS), FPGAs are for input/output arrays, vector size, interleaving, kernel quickly finding their way into HPC and cloud systems. Large programming model, on-chip channels, operating amounts of work have been done so far on loop and area frequency, padding, and multiple types of blocking. optimizations for different applications on FPGAs using HLS. However, a comprehensive analysis of the behavior and • We outline one performance bug in Intel’s compiler, and efficiency of the memory controller of FPGAs is missing in multiple deficiencies in the memory controller, leading literature, which becomes even more crucial when the limited to significant loss of memory performance for typical memory bandwidth of modern FPGAs compared to their GPU applications. In some of these cases, we provide work- counterparts is taken into account. In this work, we will analyze arounds to improve the memory performance. the memory interface generated by Intel FPGA SDK for OpenCL with different configurations for input/output arrays, II. METHODOLOGY vector size, interleaving, kernel programming model, on-chip channels, operating frequency, padding, and multiple types of A. Memory Benchmark Suite overlapped blocking. Our results point to multiple shortcomings For our evaluation, we develop an open-source benchmark in the memory controller of Intel FPGAs, especially with respect suite called FPGAMemBench, available at https://github.com/ to memory access alignment, that can hinder the programmer’s zohourih/FPGAMemBench.
    [Show full text]
  • An Open Standard for SHMEM Implementations Introduction
    OpenSHMEM: An Open Standard for SHMEM Implementations Introduction • OpenSHMEM is an effort to create a standardized SHMEM library for C , C++, and Fortran • OpenSHMEM is a Partitioned Global Address Space (PGAS) library and supports Single Program Multiple Data (SPMD) style of programming • SGI’s SHMEM API is the baseline for OpenSHMEM Specification 1.0 • One-sided communication is achieved through Symmetric variables • globals C/C++: Non-stack variables Fortran: objects in common blocks or with the SAVE attribute • dynamically allocated and managed by the OpenSHMEM Library Figure 1. Dynamic allocation of symmetric variable ‘x’ OpenSHMEM API • OpenSHMEM Specification 1.0 provides support for data transfer operations, collective and point- to-point synchronization mechanisms (barrier, fence, quiet, wait), collective operations (broadcast, collection, reduction), atomic memory operations (swap, fetch/add, increment), and distributed locks. • Open to the community for reviews and contributions. Current Status • University of Houston has developed a portable reference OpenSHMEM library. • OpenSHMEM Specification 1.0 man pages are available. • OpenSHMEM mailing list for discussions and contributions can be joined by emailing [email protected] • Website for OpenSHMEM and a Wiki for community use are available. Figure 2. University of Houston http://www.openshmem.org/ Implementation Future Work and Goals • Develop OpenSHMEM Specification 2.0 with co-operation and contribution from the OpenSHMEM community • Discuss and extend specification with important new capabilities • Improve on the OpenSHMEM reference implementation by providing support for scalable collective algorithms • Explore innovative hardware platforms .
    [Show full text]
  • Parallel Computer Architecture
    Parallel Computer Architecture Introduction to Parallel Computing CIS 410/510 Department of Computer and Information Science Lecture 2 – Parallel Architecture Outline q Parallel architecture types q Instruction-level parallelism q Vector processing q SIMD q Shared memory ❍ Memory organization: UMA, NUMA ❍ Coherency: CC-UMA, CC-NUMA q Interconnection networks q Distributed memory q Clusters q Clusters of SMPs q Heterogeneous clusters of SMPs Introduction to Parallel Computing, University of Oregon, IPCC Lecture 2 – Parallel Architecture 2 Parallel Architecture Types • Uniprocessor • Shared Memory – Scalar processor Multiprocessor (SMP) processor – Shared memory address space – Bus-based memory system memory processor … processor – Vector processor bus processor vector memory memory – Interconnection network – Single Instruction Multiple processor … processor Data (SIMD) network processor … … memory memory Introduction to Parallel Computing, University of Oregon, IPCC Lecture 2 – Parallel Architecture 3 Parallel Architecture Types (2) • Distributed Memory • Cluster of SMPs Multiprocessor – Shared memory addressing – Message passing within SMP node between nodes – Message passing between SMP memory memory nodes … M M processor processor … … P … P P P interconnec2on network network interface interconnec2on network processor processor … P … P P … P memory memory … M M – Massively Parallel Processor (MPP) – Can also be regarded as MPP if • Many, many processors processor number is large Introduction to Parallel Computing, University of Oregon,
    [Show full text]
  • BCL: a Cross-Platform Distributed Data Structures Library
    BCL: A Cross-Platform Distributed Data Structures Library Benjamin Brock, Aydın Buluç, Katherine Yelick University of California, Berkeley Lawrence Berkeley National Laboratory {brock,abuluc,yelick}@cs.berkeley.edu ABSTRACT high-performance computing, including several using the Parti- One-sided communication is a useful paradigm for irregular paral- tioned Global Address Space (PGAS) model: Titanium, UPC, Coarray lel applications, but most one-sided programming environments, Fortran, X10, and Chapel [9, 11, 12, 25, 29, 30]. These languages are including MPI’s one-sided interface and PGAS programming lan- especially well-suited to problems that require asynchronous one- guages, lack application-level libraries to support these applica- sided communication, or communication that takes place without tions. We present the Berkeley Container Library, a set of generic, a matching receive operation or outside of a global collective. How- cross-platform, high-performance data structures for irregular ap- ever, PGAS languages lack the kind of high level libraries that exist plications, including queues, hash tables, Bloom filters and more. in other popular programming environments. For example, high- BCL is written in C++ using an internal DSL called the BCL Core performance scientific simulations written in MPI can leverage a that provides one-sided communication primitives such as remote broad set of numerical libraries for dense or sparse matrices, or get and remote put operations. The BCL Core has backends for for structured, unstructured, or adaptive meshes. PGAS languages MPI, OpenSHMEM, GASNet-EX, and UPC++, allowing BCL data can sometimes use those numerical libraries, but are missing the structures to be used natively in programs written using any of data structures that are important in some of the most irregular these programming environments.
    [Show full text]
  • Advances, Applications and Performance of The
    1 Introduction The two predominant classes of programming models for MIMD concurrent computing are distributed memory and ADVANCES, APPLICATIONS AND shared memory. Both shared memory and distributed mem- PERFORMANCE OF THE GLOBAL ory models have advantages and shortcomings. Shared ARRAYS SHARED MEMORY memory model is much easier to use but it ignores data PROGRAMMING TOOLKIT locality/placement. Given the hierarchical nature of the memory subsystems in modern computers this character- istic can have a negative impact on performance and scal- Jarek Nieplocha1 1 ability. Careful code restructuring to increase data reuse Bruce Palmer and replacing fine grain load/stores with block access to Vinod Tipparaju1 1 shared data can address the problem and yield perform- Manojkumar Krishnan ance for shared memory that is competitive with mes- Harold Trease1 2 sage-passing (Shan and Singh 2000). However, this Edoardo Aprà performance comes at the cost of compromising the ease of use that the shared memory model advertises. Distrib- uted memory models, such as message-passing or one- Abstract sided communication, offer performance and scalability This paper describes capabilities, evolution, performance, but they are difficult to program. and applications of the Global Arrays (GA) toolkit. GA was The Global Arrays toolkit (Nieplocha, Harrison, and created to provide application programmers with an inter- Littlefield 1994, 1996; Nieplocha et al. 2002a) attempts to face that allows them to distribute data while maintaining offer the best features of both models. It implements a the type of global index space and programming syntax shared-memory programming model in which data local- similar to that available when programming on a single ity is managed by the programmer.
    [Show full text]
  • Parallel Programming Using Openmp Feb 2014
    OpenMP tutorial by Brent Swartz February 18, 2014 Room 575 Walter 1-4 P.M. © 2013 Regents of the University of Minnesota. All rights reserved. Outline • OpenMP Definition • MSI hardware Overview • Hyper-threading • Programming Models • OpenMP tutorial • Affinity • OpenMP 4.0 Support / Features • OpenMP Optimization Tips © 2013 Regents of the University of Minnesota. All rights reserved. OpenMP Definition The OpenMP Application Program Interface (API) is a multi-platform shared-memory parallel programming model for the C, C++ and Fortran programming languages. Further information can be found at http://www.openmp.org/ © 2013 Regents of the University of Minnesota. All rights reserved. OpenMP Definition Jointly defined by a group of major computer hardware and software vendors and the user community, OpenMP is a portable, scalable model that gives shared-memory parallel programmers a simple and flexible interface for developing parallel applications for platforms ranging from multicore systems and SMPs, to embedded systems. © 2013 Regents of the University of Minnesota. All rights reserved. MSI Hardware • MSI hardware is described here: https://www.msi.umn.edu/hpc © 2013 Regents of the University of Minnesota. All rights reserved. MSI Hardware The number of cores per node varies with the type of Xeon on that node. We define: MAXCORES=number of cores/node, e.g. Nehalem=8, Westmere=12, SandyBridge=16, IvyBridge=20. © 2013 Regents of the University of Minnesota. All rights reserved. MSI Hardware Since MAXCORES will not vary within the PBS job (the itasca queues are split by Xeon type), you can determine this at the start of the job (in bash): MAXCORES=`grep "core id" /proc/cpuinfo | wc -l` © 2013 Regents of the University of Minnesota.
    [Show full text]
  • Automatic Handling of Global Variables for Multi-Threaded MPI Programs
    Automatic Handling of Global Variables for Multi-threaded MPI Programs Gengbin Zheng, Stas Negara, Celso L. Mendes, Laxmikant V. Kale´ Eduardo R. Rodrigues Department of Computer Science Institute of Informatics University of Illinois at Urbana-Champaign Federal University of Rio Grande do Sul Urbana, IL 61801, USA Porto Alegre, Brazil fgzheng,snegara2,cmendes,[email protected] [email protected] Abstract— necessary to privatize global and static variables to ensure Hybrid programming models, such as MPI combined with the thread-safety of an application. threads, are one of the most efficient ways to write parallel ap- In high-performance computing, one way to exploit multi- plications for current machines comprising multi-socket/multi- core nodes and an interconnection network. Global variables in core platforms is to adopt hybrid programming models that legacy MPI applications, however, present a challenge because combine MPI with threads, where different programming they may be accessed by multiple MPI threads simultaneously. models are used to address issues such as multiple threads Thus, transforming legacy MPI applications to be thread-safe per core, decreasing amount of memory per core, load in order to exploit multi-core architectures requires proper imbalance, etc. When porting legacy MPI applications to handling of global variables. In this paper, we present three approaches to eliminate these hybrid models that involve multiple threads, thread- global variables to ensure thread-safety for an MPI program. safety of the applications needs to be addressed again. These approaches include: (a) a compiler-based refactoring Global variables cause no problem with traditional MPI technique, using a Photran-based tool as an example, which implementations, since each process image contains a sep- automates the source-to-source transformation for programs arate instance of the variable.
    [Show full text]
  • An Overview of the PARADIGM Compiler for Distributed-Memory
    appeared in IEEE Computer, Volume 28, Number 10, October 1995 An Overview of the PARADIGM Compiler for DistributedMemory Multicomputers y Prithvira j Banerjee John A Chandy Manish Gupta Eugene W Ho dges IV John G Holm Antonio Lain Daniel J Palermo Shankar Ramaswamy Ernesto Su Center for Reliable and HighPerformance Computing University of Illinois at UrbanaChampaign W Main St Urbana IL Phone Fax banerjeecrhcuiucedu Abstract Distributedmemory multicomputers suchasthetheIntel Paragon the IBM SP and the Thinking Machines CM oer signicant advantages over sharedmemory multipro cessors in terms of cost and scalability Unfortunately extracting all the computational p ower from these machines requires users to write ecientsoftware for them which is a lab orious pro cess The PARADIGM compiler pro ject provides an automated means to parallelize programs written using a serial programming mo del for ecient execution on distributedmemory multicomput ers In addition to p erforming traditional compiler optimizations PARADIGM is unique in that it addresses many other issues within a unied platform automatic data distribution commu nication optimizations supp ort for irregular computations exploitation of functional and data parallelism and multithreaded execution This pap er describ es the techniques used and pro vides exp erimental evidence of their eectiveness on the Intel Paragon the IBM SP and the Thinking Machines CM Keywords compilers distributed memorymulticomputers parallelizing compilers parallel pro cessing This research was supp orted
    [Show full text]