Overview of the MVAPICH Project: Latest Status and Future Roadmap

Total Page:16

File Type:pdf, Size:1020Kb

Overview of the MVAPICH Project: Latest Status and Future Roadmap Overview of the MVAPICH Project: Latest Status and Future Roadmap MVAPICH2 User Group (MUG) MeeKng by DhaBaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] h<p://www.cse.ohio-state.edu/~panda Drivers of Modern HPC Cluster Architectures Accelerators / Coprocessors High Performance Interconnects - InfiniBand high compute density, high performance/wa MulK-core Processors <1usec latency, >100GBps Bandwidth >1 TFlop DP on a chip • MulC-core processors are ubiquitous • InfiniBand very popular in HPC clusters • • Accelerators/Coprocessors becoming common in high-end systems • Pushing the envelope for Exascale compuCng Tianhe – 2 (1) Titan (2) Stampede (8) Tianhe – 1A (24) MVAPICH User Group MeeKng 2015 2 Parallel Programming Models Overview P1 P2 P3 P1 P2 P3 P1 P2 P3 Logical shared memory Shared Memory Memory Memory Memory Memory Memory Memory Shared Memory Model Distributed Memory Model ParCConed Global Address Space (PGAS) SHMEM, DSM MPI (Message Passing Interface) Global Arrays, UPC, Chapel, X10, CAF, … • Programming models provide abstract machine models • Models can be mapped on different types of systems – e.g. Distributed Shared Memory (DSM), MPI within a node, etc. • PGAS models and Hybrid MPI+PGAS models are gradually receiving importance MVAPICH User Group MeeKng 2015 3 SupporKng Programming Models for MulK- Petaflop and Exaflop Systems: Challenges ApplicaKon Kernels/ApplicaKons Co-Design Middleware Opportunies and Challenges across Various Programming Models Layers MPI, PGAS (UPC, GloBal Arrays, OpenSHMEM), CUDA, OpenACC, Cilk, Hadoop, MapReduce, etc. CommunicaKon LiBrary or RunKme for Programming Models Point-to-point Collecve Synchronizaon & I/O & File Fault Communicaon Communicaon Locks (two-sided & one-sided) Systems Tolerance Networking Technologies MulK/Many-core Accelerators (InfiniBand, 10/40GigE, Architectures (NVIDIA and MIC) Aries, OmniPath) MVAPICH User Group MeeKng 2015 4 Designing (MPI+X) at Exascale • Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communicaon (both two-sided and one-sided) – Extremely minimum memory footprint • Hybrid programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM, …) • Balancing intra-node and inter-node communicaon for next generaon mulC-core (128-1024 cores/node) – MulCple end-points per node • Support for efficient mulC-threading • Scalable CollecCve communicaon – Offload – Non-blocking – Topology-aware – Power-aware • Support for MPI-3 RMA Model • Support for GPGPUs and Accelerators • Fault-tolerance/resiliency • QoS support for communicaon and I/O • Virtualizaon Support MVAPICH User Group MeeKng 2015 5 The MVAPICH2 Soeware Family • High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, RDMA over Converged Enhanced Ethernet (RoCE), and virtualized clusters. – MVAPICH (MPI-1) , AvailaBle since 2002 – MVAPICH2 (MPI-2.2, MPI-3.0 and MPI-3.1), AvailaBle since 2004 – MVAPICH2-X (Advanced MPI + PGAS), AvailaBle since 2012 – Support for GPGPUs (MVAPICH2-GDR), AvailaBle since 2014 – Support for MIC (MVAPICH2-MIC), AvailaBle since 2014 – Support for VirtualizaKon (MVAPICH2-Virt), AvailaBle since 2015 • h<p://mvapich.cse.ohio-state.edu MVAPICH User Group MeeKng 2015 6 MVAPICH Project Timeline MVAPICH2-Virt MVAPICH2-MIC MVAPICH2-GDR MVAPICH2-X MVAPICH2 OMB MVAPICH EOL Jul-15 Jan-04 Oct-02 Oct-02 Jan-10 Nov-04 Nov-12 Apr-15 Apr-15 MVAPICH User Group MeeKng 2015 Timeline Aug-14 7 Number of Downloads 300000 350000 100000 150000 200000 250000 50000 MVAPICH/MVAPICH2 Release Timeline and Downloads MVAPICH User Group MeeKng 2015 0 Sep-04 MV 0.9.4 Mar-05 Sep-05 MV2 0.9.0 Mar-06 Sep-06 MV2 0.9.8 Mar-07 Sep-07 MV2 1.0 Mar-08 MV 1.0 Sep-08 MV2 1.0.3 Mar-09 MV 1.1 Timeline Sep-09 Mar-10 MV2 1.4 Sep-10 Mar-11 MV2 1.5 Sep-11 MV2 1.6 Mar-12 MV2 1.7 Sep-12 MV2 1.8 Mar-13 Sep-13 MV2 1.9 Mar-14 MV2-GDR 2.0B Sep-14 MV2-MIC 2.0 MV2 2.1 Mar-15 MV2-Virt 2.1rc2 MV2-GDR 2.1rc2 MV2 2.2a MV2-X 2.2a 8 The MVAPICH2 Soeware Family (Cont.) • Empowering many TOP500 clusters (July ‘15 ranking) – 8th ranked 519,640-core cluster (Stampede) at TACC – 11th ranked 185,344-core cluster (Pleiades) at NASA – 22nd ranked 76,032-core cluster (Tsubame 2.5) at Tokyo InsCtute of Technology and many others • Used by more than 2,450 organizaons in 76 countries • More than 282,000 downloads from the OSU site directly • Available with sopware stacks of many IB, HSE, and server vendors including Linux Distros (RedHat and SuSE) • Empowering Top500 systems for over a decade – System-X from Virginia Tech (3rd in Nov 2003, 2,200 processors, 12.25 TFlops) -> – Stampede at TACC (8th in Jun’15, 462,462 cores, 5.168 Plops) MVAPICH User Group MeeKng 2015 9 Usage Guidelines for the MVAPICH2 Soeware Family Requirements MVAPICH2 LiBrary to use MPI with IB, iWARP and RoCE MVAPICH2 Advanced MPI, PGAS and MPI+PGAS with IB and RoCE MVAPICH2-X MPI with IB & GPU MVAPICH2-GDR MPI with IB & MIC MVAPICH2-MIC HPC Cloud with MPI & IB MVAPICH2-Virt MVAPICH User Group MeeKng 2015 10 Strong Procedure for Design, Development and Release • Research is done for exploring new designs • Designs are first presented to conference/journal publicaons • Best performing designs are incorporated into the codebase • Rigorous Q&A procedure before making a release – ExhausCve unit tesCng – Various test procedures on diverse range of plaorms and interconnects – Performance tuning – Applicaons-based evaluaon – Evaluaon on large-scale systems • Even alpha and beta versions go through the above tesCng MVAPICH User Group MeeKng 2015 11 MVAPICH2 2.2a • Released on 08/18/2015 • Major Features and Enhancements – Based on MPICH-3.1.4 – Support for backing on-demand UD CM informaon with shared memory for minimizing memory footprint – Dynamic idenCficaon of maximum read/atomic operaons supported by HCA – Enabling support for intra-node communicaons in RoCE mode without shared memory – Updated to hwloc 1.11.0 – Updated to sm_20 kernel opCmizaons for MPI Datatypes – Automac detecCon and tuning for 24-core Haswell architecture – Enhanced startup performance • Support for PMI-2 based startup with SLURM – Checkpoint-Restart Support with DMTCP (Distributed MulCThreaded CheckPoinCng) – Enhanced communicaon performance for small/medium message sizes – Support for linking Intel Trace Analyzer and Collector MVAPICH User Group MeeKng 2015 12 One-way Latency: MPI over IB with MVAPICH2 Small Message Latency Large Message Latency 2.00 120 TrueScale-QDR 1.80 100 ConnectX-3-FDR 1.60 ConnectIB-DualFDR 1.40 1.26 80 1.19 ConnectX-4-EDR 1.20 1.00 60 1.15 0.80 Latency (us) Latency (us) 1.05 40 0.60 0.40 20 0.20 0.00 0 Message Size (Bytes) Message Size (Bytes) TrueScale-QDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectX-4-EDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 Back-to-Back MVAPICH User Group MeeKng 2015 13 Bandwidth: MPI over IB with MVAPICH2 UnidirecKonal Bandwidth BidirecKonal Bandwidth 14000 30000 12465 TrueScale-QDR 24353 12000 25000 ConnectX-3-FDR 11497 10000 ConnectIB-DualFDR 22909 20000 8000 ConnectX-4-EDR 6356 15000 12161 6000 4000 3387 10000 6308 Bandwidth (MBytes/sec) Bandwidth (MBytes/sec) 2000 5000 0 0 4 16 64 256 1024 4K 16K 64K 256K 1M Message Size (Bytes) Message Size (Bytes) TrueScale-QDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectX-4-EDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 Back-to-Back MVAPICH User Group MeeKng 2015 14 MVAPICH2 Two-Sided Intra-Node Performance (Shared memory and Kernel-Based Zero-copy Support (LiMIC and CMA)) Latency 1 0.8 Intra-Socket Inter-Socket Latest MVAPICH2 2.2a 0.6 Intel Ivy-bridge 0.45 us 0.4 Latency (us) 0.18 us 0.2 0 0 1 2 4 8 16 32 64 128 256 512 1K Message Size (Bytes) Bandwidth (Intra-socket) Bandwidth (Inter-socket) 16000 16000 14,250 MB/s 13,749 MB/s 14000 intra-Socket-CMA 14000 inter-Socket-CMA inter-Socket-Shmem 12000 intra-Socket-Shmem 12000 inter-Socket-LiMIC 10000 intra-Socket-LiMIC 10000 8000 8000 6000 6000 4000 4000 Bandwidth (MB/s) Bandwidth (MB/s) 2000 2000 0 0 Message Size (Bytes) Message Size (Bytes) MVAPICH User Group MeeKng 2015 15 MVAPICH2-X for Advanced MPI, HyBrid MPI + PGAS ApplicaKons MPI, OpenSHMEM, UPC, CAF or HyBrid (MPI + PGAS) ApplicaKons OpenSHMEM Calls CAF Calls UPC Calls MPI Calls Unified MVAPICH2-X RunKme InfiniBand, RoCE, iWARP • Current Model – Separate RunCmes for OpenSHMEM/UPC/CAF and MPI – Possible deadlock if both runCmes are not progressed – Consumes more network resource • Unified communicaon runCme for MPI, UPC, OpenSHMEM, CAF available with MVAPICH2-X 1.9 onwards! – h<p://mvapich.cse.ohio-state.edu MVAPICH User Group MeeKng 2015 16 HyBrid (MPI+PGAS) Programming • Applicaon sub-kernels can be re-wri<en in MPI/PGAS based on communicaon characterisCcs HPC ApplicaKon • Benefits: Kernel 1 – Best of Distributed CompuCng Model MPI – Best of Shared Memory CompuCng Model Kernel 2 PGAS MPI • Exascale Roadmap*: Kernel 3 – “Hybrid Programming is a pracCcal way to MPI program exascale systems” Kernel N Kernel N PGAS MPI * The Interna8onal Exascale So?ware Roadmap, Dongarra, J., Beckman, P. et al., Volume 25, Number 1, 2011, Interna8onal Journal of High Performance Computer Applica8ons, ISSN 1094-3420 MVAPICH User Group MeeKng 2015 17 MVAPICH2-X 2.2a • Released on 08/18/2015
Recommended publications
  • Mellanox Scalableshmem Support the Openshmem Parallel Programming Language Over Infiniband
    SOFTWARE PRODUCT BRIEF Mellanox ScalableSHMEM Support the OpenSHMEM Parallel Programming Language over InfiniBand The SHMEM programming library is a one-side communications library that supports a unique set of parallel programming features including point-to-point and collective routines, synchronizations, atomic operations, and a shared memory paradigm used between the processes of a parallel programming application. SHMEM Details API defined by the OpenSHMEM.org consortium. There are two types of models for parallel The library works with the OpenFabrics RDMA programming. The first is the shared memory for Linux stack (OFED), and also has the ability model, in which all processes interact through a to utilize Mellanox Messaging libraries (MXM) globally addressable memory space. The other as well as Mellanox Fabric Collective Accelera- HIGHLIGHTS is a distributed memory model, in which each tions (FCA), providing an unprecedented level of FEATURES processor has its own memory, and interaction scalability for SHMEM programs running over – Use of symmetric variables and one- with another processors memory is done though InfiniBand. sided communication (put/get) message communication. The PGAS model, The use of Mellanox FCA provides for collective – RDMA for performance optimizations or Partitioned Global Address Space, uses a optimizations by taking advantage of the high for one-sided communications combination of these two methods, in which each performance features within the InfiniBand fabric, – Provides shared memory data transfer process has access to its own private memory, including topology aware coalescing, hardware operations (put/get), collective and also to shared variables that make up the operations, and atomic memory multicast and separate quality of service levels for global memory space.
    [Show full text]
  • Scalable and Distributed Deep Learning (DL): Co-Design MPI Runtimes and DL Frameworks
    Scalable and Distributed Deep Learning (DL): Co-Design MPI Runtimes and DL Frameworks OSU Booth Talk (SC ’19) Ammar Ahmad Awan [email protected] Network Based Computing Laboratory Dept. of Computer Science and Engineering The Ohio State University Agenda • Introduction – Deep Learning Trends – CPUs and GPUs for Deep Learning – Message Passing Interface (MPI) • Research Challenges: Exploiting HPC for Deep Learning • Proposed Solutions • Conclusion Network Based Computing Laboratory OSU Booth - SC ‘18 High-Performance Deep Learning 2 Understanding the Deep Learning Resurgence • Deep Learning (DL) is a sub-set of Machine Learning (ML) – Perhaps, the most revolutionary subset! Deep Machine – Feature extraction vs. hand-crafted Learning Learning features AI Examples: • Deep Learning Examples: Logistic Regression – A renewed interest and a lot of hype! MLPs, DNNs, – Key success: Deep Neural Networks (DNNs) – Everything was there since the late 80s except the “computability of DNNs” Adopted from: http://www.deeplearningbook.org/contents/intro.html Network Based Computing Laboratory OSU Booth - SC ‘18 High-Performance Deep Learning 3 AlexNet Deep Learning in the Many-core Era 10000 8000 • Modern and efficient hardware enabled 6000 – Computability of DNNs – impossible in the 4000 ~500X in 5 years 2000 past! Minutesto Train – GPUs – at the core of DNN training 0 2 GTX 580 DGX-2 – CPUs – catching up fast • Availability of Datasets – MNIST, CIFAR10, ImageNet, and more… • Excellent Accuracy for many application areas – Vision, Machine Translation, and
    [Show full text]
  • Parallel Programming
    Parallel Programming Parallel Programming Parallel Computing Hardware Shared memory: multiple cpus are attached to the BUS all processors share the same primary memory the same memory address on different CPU’s refer to the same memory location CPU-to-memory connection becomes a bottleneck: shared memory computers cannot scale very well Parallel Programming Parallel Computing Hardware Distributed memory: each processor has its own private memory computational tasks can only operate on local data infinite available memory through adding nodes requires more difficult programming Parallel Programming OpenMP versus MPI OpenMP (Open Multi-Processing): easy to use; loop-level parallelism non-loop-level parallelism is more difficult limited to shared memory computers cannot handle very large problems MPI(Message Passing Interface): require low-level programming; more difficult programming scalable cost/size can handle very large problems Parallel Programming MPI Distributed memory: Each processor can access only the instructions/data stored in its own memory. The machine has an interconnection network that supports passing messages between processors. A user specifies a number of concurrent processes when program begins. Every process executes the same program, though theflow of execution may depend on the processors unique ID number (e.g. “if (my id == 0) then ”). ··· Each process performs computations on its local variables, then communicates with other processes (repeat), to eventually achieve the computed result. In this model, processors pass messages both to send/receive information, and to synchronize with one another. Parallel Programming Introduction to MPI Communicators and Groups: MPI uses objects called communicators and groups to define which collection of processes may communicate with each other.
    [Show full text]
  • Introduction to Parallel Programming
    Introduction to Parallel Programming Martin Čuma Center for High Performance Computing University of Utah [email protected] Overview • Types of parallel computers. • Parallel programming options. • How to write parallel applications. • How to compile. • How to debug/profile. • Summary, future expansion. 3/13/2009 http://www.chpc.utah.edu Slide 2 Parallel architectures Single processor: • SISD – single instruction single data. Multiple processors: • SIMD - single instruction multiple data. • MIMD – multiple instruction multiple data. Shared Memory Distributed Memory 3/13/2009 http://www.chpc.utah.edu Slide 3 Shared memory Dual quad-core node • All processors have BUS access to local memory CPU Memory • Simpler programming CPU Memory • Concurrent memory access Many-core node (e.g. SGI) • More specialized CPU Memory hardware BUS • CHPC : CPU Memory Linux clusters, 2, 4, 8 CPU Memory core nodes CPU Memory 3/13/2009 http://www.chpc.utah.edu Slide 4 Distributed memory • Process has access only BUS CPU to its local memory Memory • Data between processes CPU Memory must be communicated • More complex Node Network Node programming Node Node • Cheap commodity hardware Node Node • CHPC: Linux clusters Node Node (Arches, Updraft) 8 node cluster (64 cores) 3/13/2009 http://www.chpc.utah.edu Slide 5 Parallel programming options Shared Memory • Threads – POSIX Pthreads, OpenMP – Thread – own execution sequence but shares memory space with the original process • Message passing – processes – Process – entity that executes a program – has its own
    [Show full text]
  • Benchmarking the Intel FPGA SDK for Opencl Memory Interface
    The Memory Controller Wall: Benchmarking the Intel FPGA SDK for OpenCL Memory Interface Hamid Reza Zohouri*†1, Satoshi Matsuoka*‡ *Tokyo Institute of Technology, †Edgecortix Inc. Japan, ‡RIKEN Center for Computational Science (R-CCS) {zohour.h.aa@m, matsu@is}.titech.ac.jp Abstract—Supported by their high power efficiency and efficiency on Intel FPGAs with different configurations recent advancements in High Level Synthesis (HLS), FPGAs are for input/output arrays, vector size, interleaving, kernel quickly finding their way into HPC and cloud systems. Large programming model, on-chip channels, operating amounts of work have been done so far on loop and area frequency, padding, and multiple types of blocking. optimizations for different applications on FPGAs using HLS. However, a comprehensive analysis of the behavior and • We outline one performance bug in Intel’s compiler, and efficiency of the memory controller of FPGAs is missing in multiple deficiencies in the memory controller, leading literature, which becomes even more crucial when the limited to significant loss of memory performance for typical memory bandwidth of modern FPGAs compared to their GPU applications. In some of these cases, we provide work- counterparts is taken into account. In this work, we will analyze arounds to improve the memory performance. the memory interface generated by Intel FPGA SDK for OpenCL with different configurations for input/output arrays, II. METHODOLOGY vector size, interleaving, kernel programming model, on-chip channels, operating frequency, padding, and multiple types of A. Memory Benchmark Suite overlapped blocking. Our results point to multiple shortcomings For our evaluation, we develop an open-source benchmark in the memory controller of Intel FPGAs, especially with respect suite called FPGAMemBench, available at https://github.com/ to memory access alignment, that can hinder the programmer’s zohourih/FPGAMemBench.
    [Show full text]
  • An Open Standard for SHMEM Implementations Introduction
    OpenSHMEM: An Open Standard for SHMEM Implementations Introduction • OpenSHMEM is an effort to create a standardized SHMEM library for C , C++, and Fortran • OpenSHMEM is a Partitioned Global Address Space (PGAS) library and supports Single Program Multiple Data (SPMD) style of programming • SGI’s SHMEM API is the baseline for OpenSHMEM Specification 1.0 • One-sided communication is achieved through Symmetric variables • globals C/C++: Non-stack variables Fortran: objects in common blocks or with the SAVE attribute • dynamically allocated and managed by the OpenSHMEM Library Figure 1. Dynamic allocation of symmetric variable ‘x’ OpenSHMEM API • OpenSHMEM Specification 1.0 provides support for data transfer operations, collective and point- to-point synchronization mechanisms (barrier, fence, quiet, wait), collective operations (broadcast, collection, reduction), atomic memory operations (swap, fetch/add, increment), and distributed locks. • Open to the community for reviews and contributions. Current Status • University of Houston has developed a portable reference OpenSHMEM library. • OpenSHMEM Specification 1.0 man pages are available. • OpenSHMEM mailing list for discussions and contributions can be joined by emailing [email protected] • Website for OpenSHMEM and a Wiki for community use are available. Figure 2. University of Houston http://www.openshmem.org/ Implementation Future Work and Goals • Develop OpenSHMEM Specification 2.0 with co-operation and contribution from the OpenSHMEM community • Discuss and extend specification with important new capabilities • Improve on the OpenSHMEM reference implementation by providing support for scalable collective algorithms • Explore innovative hardware platforms .
    [Show full text]
  • Parallel Computer Architecture
    Parallel Computer Architecture Introduction to Parallel Computing CIS 410/510 Department of Computer and Information Science Lecture 2 – Parallel Architecture Outline q Parallel architecture types q Instruction-level parallelism q Vector processing q SIMD q Shared memory ❍ Memory organization: UMA, NUMA ❍ Coherency: CC-UMA, CC-NUMA q Interconnection networks q Distributed memory q Clusters q Clusters of SMPs q Heterogeneous clusters of SMPs Introduction to Parallel Computing, University of Oregon, IPCC Lecture 2 – Parallel Architecture 2 Parallel Architecture Types • Uniprocessor • Shared Memory – Scalar processor Multiprocessor (SMP) processor – Shared memory address space – Bus-based memory system memory processor … processor – Vector processor bus processor vector memory memory – Interconnection network – Single Instruction Multiple processor … processor Data (SIMD) network processor … … memory memory Introduction to Parallel Computing, University of Oregon, IPCC Lecture 2 – Parallel Architecture 3 Parallel Architecture Types (2) • Distributed Memory • Cluster of SMPs Multiprocessor – Shared memory addressing – Message passing within SMP node between nodes – Message passing between SMP memory memory nodes … M M processor processor … … P … P P P interconnec2on network network interface interconnec2on network processor processor … P … P P … P memory memory … M M – Massively Parallel Processor (MPP) – Can also be regarded as MPP if • Many, many processors processor number is large Introduction to Parallel Computing, University of Oregon,
    [Show full text]
  • BCL: a Cross-Platform Distributed Data Structures Library
    BCL: A Cross-Platform Distributed Data Structures Library Benjamin Brock, Aydın Buluç, Katherine Yelick University of California, Berkeley Lawrence Berkeley National Laboratory {brock,abuluc,yelick}@cs.berkeley.edu ABSTRACT high-performance computing, including several using the Parti- One-sided communication is a useful paradigm for irregular paral- tioned Global Address Space (PGAS) model: Titanium, UPC, Coarray lel applications, but most one-sided programming environments, Fortran, X10, and Chapel [9, 11, 12, 25, 29, 30]. These languages are including MPI’s one-sided interface and PGAS programming lan- especially well-suited to problems that require asynchronous one- guages, lack application-level libraries to support these applica- sided communication, or communication that takes place without tions. We present the Berkeley Container Library, a set of generic, a matching receive operation or outside of a global collective. How- cross-platform, high-performance data structures for irregular ap- ever, PGAS languages lack the kind of high level libraries that exist plications, including queues, hash tables, Bloom filters and more. in other popular programming environments. For example, high- BCL is written in C++ using an internal DSL called the BCL Core performance scientific simulations written in MPI can leverage a that provides one-sided communication primitives such as remote broad set of numerical libraries for dense or sparse matrices, or get and remote put operations. The BCL Core has backends for for structured, unstructured, or adaptive meshes. PGAS languages MPI, OpenSHMEM, GASNet-EX, and UPC++, allowing BCL data can sometimes use those numerical libraries, but are missing the structures to be used natively in programs written using any of data structures that are important in some of the most irregular these programming environments.
    [Show full text]
  • Advances, Applications and Performance of The
    1 Introduction The two predominant classes of programming models for MIMD concurrent computing are distributed memory and ADVANCES, APPLICATIONS AND shared memory. Both shared memory and distributed mem- PERFORMANCE OF THE GLOBAL ory models have advantages and shortcomings. Shared ARRAYS SHARED MEMORY memory model is much easier to use but it ignores data PROGRAMMING TOOLKIT locality/placement. Given the hierarchical nature of the memory subsystems in modern computers this character- istic can have a negative impact on performance and scal- Jarek Nieplocha1 1 ability. Careful code restructuring to increase data reuse Bruce Palmer and replacing fine grain load/stores with block access to Vinod Tipparaju1 1 shared data can address the problem and yield perform- Manojkumar Krishnan ance for shared memory that is competitive with mes- Harold Trease1 2 sage-passing (Shan and Singh 2000). However, this Edoardo Aprà performance comes at the cost of compromising the ease of use that the shared memory model advertises. Distrib- uted memory models, such as message-passing or one- Abstract sided communication, offer performance and scalability This paper describes capabilities, evolution, performance, but they are difficult to program. and applications of the Global Arrays (GA) toolkit. GA was The Global Arrays toolkit (Nieplocha, Harrison, and created to provide application programmers with an inter- Littlefield 1994, 1996; Nieplocha et al. 2002a) attempts to face that allows them to distribute data while maintaining offer the best features of both models. It implements a the type of global index space and programming syntax shared-memory programming model in which data local- similar to that available when programming on a single ity is managed by the programmer.
    [Show full text]
  • Parallel Programming Using Openmp Feb 2014
    OpenMP tutorial by Brent Swartz February 18, 2014 Room 575 Walter 1-4 P.M. © 2013 Regents of the University of Minnesota. All rights reserved. Outline • OpenMP Definition • MSI hardware Overview • Hyper-threading • Programming Models • OpenMP tutorial • Affinity • OpenMP 4.0 Support / Features • OpenMP Optimization Tips © 2013 Regents of the University of Minnesota. All rights reserved. OpenMP Definition The OpenMP Application Program Interface (API) is a multi-platform shared-memory parallel programming model for the C, C++ and Fortran programming languages. Further information can be found at http://www.openmp.org/ © 2013 Regents of the University of Minnesota. All rights reserved. OpenMP Definition Jointly defined by a group of major computer hardware and software vendors and the user community, OpenMP is a portable, scalable model that gives shared-memory parallel programmers a simple and flexible interface for developing parallel applications for platforms ranging from multicore systems and SMPs, to embedded systems. © 2013 Regents of the University of Minnesota. All rights reserved. MSI Hardware • MSI hardware is described here: https://www.msi.umn.edu/hpc © 2013 Regents of the University of Minnesota. All rights reserved. MSI Hardware The number of cores per node varies with the type of Xeon on that node. We define: MAXCORES=number of cores/node, e.g. Nehalem=8, Westmere=12, SandyBridge=16, IvyBridge=20. © 2013 Regents of the University of Minnesota. All rights reserved. MSI Hardware Since MAXCORES will not vary within the PBS job (the itasca queues are split by Xeon type), you can determine this at the start of the job (in bash): MAXCORES=`grep "core id" /proc/cpuinfo | wc -l` © 2013 Regents of the University of Minnesota.
    [Show full text]
  • Enabling Efficient Use of UPC and Openshmem PGAS Models on GPU Clusters
    Enabling Efficient Use of UPC and OpenSHMEM PGAS Models on GPU Clusters Presented at GTC ’15 Presented by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] hCp://www.cse.ohio-state.edu/~panda Accelerator Era GTC ’15 • Accelerators are becominG common in hiGh-end system architectures Top 100 – Nov 2014 (28% use Accelerators) 57% use NVIDIA GPUs 57% 28% • IncreasinG number of workloads are beinG ported to take advantage of NVIDIA GPUs • As they scale to larGe GPU clusters with hiGh compute density – hiGher the synchronizaon and communicaon overheads – hiGher the penalty • CriPcal to minimize these overheads to achieve maximum performance 3 Parallel ProGramminG Models Overview GTC ’15 P1 P2 P3 P1 P2 P3 P1 P2 P3 LoGical shared memory Shared Memory Memory Memory Memory Memory Memory Memory Shared Memory Model Distributed Memory Model ParPPoned Global Address Space (PGAS) DSM MPI (Message PassinG Interface) Global Arrays, UPC, Chapel, X10, CAF, … • ProGramminG models provide abstract machine models • Models can be mapped on different types of systems - e.G. Distributed Shared Memory (DSM), MPI within a node, etc. • Each model has strenGths and drawbacks - suite different problems or applicaons 4 Outline GTC ’15 • Overview of PGAS models (UPC and OpenSHMEM) • Limitaons in PGAS models for GPU compuPnG • Proposed DesiGns and Alternaves • Performance Evaluaon • ExploiPnG GPUDirect RDMA 5 ParPPoned Global Address Space (PGAS) Models GTC ’15 • PGAS models, an aracPve alternave to tradiPonal message passinG - Simple
    [Show full text]
  • Automatic Handling of Global Variables for Multi-Threaded MPI Programs
    Automatic Handling of Global Variables for Multi-threaded MPI Programs Gengbin Zheng, Stas Negara, Celso L. Mendes, Laxmikant V. Kale´ Eduardo R. Rodrigues Department of Computer Science Institute of Informatics University of Illinois at Urbana-Champaign Federal University of Rio Grande do Sul Urbana, IL 61801, USA Porto Alegre, Brazil fgzheng,snegara2,cmendes,[email protected] [email protected] Abstract— necessary to privatize global and static variables to ensure Hybrid programming models, such as MPI combined with the thread-safety of an application. threads, are one of the most efficient ways to write parallel ap- In high-performance computing, one way to exploit multi- plications for current machines comprising multi-socket/multi- core nodes and an interconnection network. Global variables in core platforms is to adopt hybrid programming models that legacy MPI applications, however, present a challenge because combine MPI with threads, where different programming they may be accessed by multiple MPI threads simultaneously. models are used to address issues such as multiple threads Thus, transforming legacy MPI applications to be thread-safe per core, decreasing amount of memory per core, load in order to exploit multi-core architectures requires proper imbalance, etc. When porting legacy MPI applications to handling of global variables. In this paper, we present three approaches to eliminate these hybrid models that involve multiple threads, thread- global variables to ensure thread-safety for an MPI program. safety of the applications needs to be addressed again. These approaches include: (a) a compiler-based refactoring Global variables cause no problem with traditional MPI technique, using a Photran-based tool as an example, which implementations, since each process image contains a sep- automates the source-to-source transformation for programs arate instance of the variable.
    [Show full text]