MVAPICH2-‐GPU: Opumized GPU to GPU Communicauon for Infiniband

Total Page:16

File Type:pdf, Size:1020Kb

MVAPICH2-‐GPU: Opumized GPU to GPU Communicauon for Infiniband MVAPICH2-GPU: Op0mized GPU to GPU Communicaon for InfiniBand Clusters H. Wang, S. Potluri, M. Luo, A. K. Singh, S. Sur D. K. Panda Network-Based Compu0ng Laboratory The Ohio State University ISC 2011 Hamburg 1 Outline • Introduc0on • Problem Statement • Our Solu0on: MVAPICH2-GPU • Design Consideraons • Performance Evaluaon • Conclusion & Future Work ISC 2011 Hamburg 2 InfiniBand Clusters in Top500 • Percentage share of InfiniBand is steadily increasing • 41% of systems in TOP 500 using InfiniBand (June ’11) • 61% of systems in TOP 100 using InfiniBand (June ‘11) ISC 2011 Hamburg 3 Growth in GPGPUs • GPGPUs are gaining significance on clusters for data-centric applicaons: – Word Occurrence, Sparse Integer Occurrence – K-means clustering, Linear regression • GPGPUs + InfiniBand are gaining momentum for large clusters – #2 (Tianhe-1A), #4 (Nebulae) and #5 (Tsubame) Petascale systems • GPGPUs programming – CUDA or OpenCL + MPI – Dr. Sumit Gupta briefed industry users at NVIDIA mee0ng yesterday on programmability advances on GPUs • Big issues: performance of data movement – Latency – Bandwidth – Overlap ISC 2011 Hamburg 4 Data movement in GPU clusters Main IB IB Main GPU GPU Memory Adapter Adapter Memory PCI-E PCI-E PCI-E PCI-E PCI-E Hub IB Network PCI-E Hub • Data movement in InfiniBand clusters with GPUs – CUDA: Device memory à Main memory [at source process] – MPI: Source rank à Des0naon process – CUDA: Main memory à Device memory [at des0naon process] • GPU and InfiniBand require separate memory registraon ISC 2011 Hamburg 5 GPU Direct without GPU Direct with GPU Direct • Collaboraon between Mellanox and NVIDIA to converge on one memory registraon technique • Both devices can register same host memory: – GPU and network adapters can access the buffer ISC 2011 Hamburg 6 Outline • Introduc0on • Problem Statement • Our Solu0on: MVAPICH2-GPU • Design Consideraons • Performance Evaluaon • Conclusion & Future Work ISC 2011 Hamburg 7 Problem Statement • Data movement from/to GPGPUs – Performance bojleneck – Reduced programmer produc0vity • Hard to op0mize at the applicaon level – CUDA and MPI exper0se required for efficient implementaon – Hardware dependent latency characteris0cs – Hard to support and op0mize collec0ves – Hard to support advanced features like one-sided communicaon ISC 2011 Hamburg 8 Outline • Introduc0on • Problem Statement • Our Solu0on: MVAPICH2-GPU • Design Consideraons • Performance Evaluaon • Conclusion & Future Work ISC 2011 Hamburg 9 MVAPICH2-GPU: Design Goals • Support GPU to GPU communicaon through standard MPI interfaces – e.g. enable MPI_Send, MPI_Recv from/to GPU memory • Provide high performance without exposing low level details to the programmer – Pipelined data transfer which automa:cally provides op0mizaons inside MPI library without user tuning • Available to work with – GPU Direct – Without GPU Direct ISC 2011 Hamburg 10 Sample Code - without MPI integraon • Naïve implementaon with MPI and CUDA At Sender: cudaMemcpy(s_buf, s_device, size, cudaMemcpyDeviceToHost); MPI_Send(s_buf, size, MPI_CHAR, 1, 1, MPI_COMM_WORLD); At Receiver: MPI_Recv(r_buf, size, MPI_CHAR, 0, 1, MPI_COMM_WORLD, &req); cudaMemcpy(r_device, r_buf, size, cudaMemcpyHostToDevice); • High producvity but poor performance ISC 2011 Hamburg 11 Sample Code – User op0mized code • Pipelining at user level with non-blocking MPI and CUDA interfaces • Code repeated at receiver side • Good performance but poor producvity At Sender: for (j = 0; j < pipeline_len; j++) cudaMemcpyAsync(s_buf + j * block_sz, s_device + j * block_sz, …); for (j = 0; j < pipeline_len; j++) { while (result != cudaSucess) { result = cudaStreamQuery(…); if(j > 0) MPI_Test(…); } MPI_Isend(s_buf + j * block_sz, block_sz, MPI_CHAR, 1, 1, ….); } MPI_Waitall(); ISC 2011 Hamburg 12 Sample Code – MVAPICH2-GPU • MVAPICH2-GPU: provides standard MPI interfaces for GPU At Sender: MPI_Send(s_device, size, …); // s_device is data buffer in GPU At Receiver: MPI_Recv(r_device, size, …); // r_device is data buffer in GPU • High productivity and high performance! ISC 2011 Hamburg 13 Outline • Introduc0on • Problem Statement • Our Solu0on: MVAPICH2-GPU • Design Consideraons • Performance Evaluaon • Conclusion & Future Work ISC 2011 Hamburg 14 Design consideraons • Memory detec0on – CUDA 4.0 introduces Unified Virtual Addressing (UVA) – MPI library can differen0ate between device memory and host memory without any hints from the user • Overlap CUDA copy and RDMA transfer – Pipeline DMA of data from GPU and InfiniBand RDMA – Allow for progressing DMAs individual data chunks ISC 2011 Hamburg 15 Pipelined Design MPI_Send MPI_Recv GPU Device Host Main Host Main GPU Device Memory Memory Memory Memory RTS/CTS cudaMemcpy Async() … RDMA Write & Finish MSG cudaStream Query() … cudaMemcpy … Async() … cudaStream … Query() with GPU-Direct – Data is divided into chunks – Pipeline CUDA copies with RDMA transfers – If system does not have GPU-Direct, an extra copy is required ISC 2011 Hamburg 16 Pipeline Design (Cont.) • Chunk size depends on CUDA copy cost and RDMA latency over the network • Automac tuning of chunk size – Detects CUDA copy and RDMA latencies during installaon – Chunk size can be stored in configuraon file (mvapich.conf) • User transparent to deliver the best performance ISC 2011 Hamburg 17 Outline • Introduc0on • Problem Statement • Our Solu0on: MVAPICH2-GPU • Design Consideraons • Performance Evaluaon • Conclusion & Future Work ISC 2011 Hamburg 18 Performance Evaluaon • Experimental environment – NVIDIA Tesla C2050 – Mellanox QDR InfiniBand HCA MT26428 – Intel Westmere processor with 12 GB main memory – MVAPICH2 1.6, CUDA Toolkit 4.0, GPUDirect • OSU Micro Benchmarks – The source and des0naon addresses are in GPU device memory ISC 2011 Hamburg 19 Ping Pong Latency 3000 3000 Memcpy+Send Memcpy-Send 2500 MemcpyAsync+Isend 2500 MemcpyAsync+Isend 2000 MVAPICH2-GPU 2000 MVAPICH2-GPU 1500 1500 with GPU Direct without GPU Direct Time (us) 1000 Time (us) 1000 500 500 0 0 32K 64K 128K 256K 512K 1M 2M 4M 32K 64K 128K 256K 512K 1M 2M 4M Message Size (bytes) Message Size (bytes) • With GPU-Direct – 45% improvement compared to Memcpy+Send (4MB messages) – 24% improvement compared to MemcpyAsync+Isend (4MB messages) • Without GPU-Direct – 38% improvement compared to Memcpy+send (4MB messages) – 33% improvement compared to MemcpyAsync+Isend (4MB messages) ISC 2011 Hamburg 20 One-sided Communicaon 3500 3000 Mempy+Put w/ GPU Direct Mempy+Get w/ GPU Direct 3000 MVAPICH2-GPU w/ GPU Direct 2500 MVAPICH2-GPU w/ GPU Direct 2500 Mempy+Put w/o GPU Direct Mempy+Get w/o GPU Direct 2000 MVAPICH2-GPU w/o GPU Direct MVAPICH2-GPU w/o GPU Direct 2000 1500 1500 Time (us) Time (us) 1000 1000 500 500 0 0 32K 64K 128K 256K 512K 1M 2M 4M 32K 64K 128K 256K 512K 1M 2M 4M Message Size (bytes) Message Size (bytes) • 45% improvement compared to Memcpy+Put (with GPU Direct) • 39% improvement compared with Memcpy+Put (without GPU Direct) • Similar improvement for Get operaon • Major improvement in programming – One sided communicaon not feasible without MPI+CUDA support ISC 2011 Hamburg 21 Collec0ve Communicaon 25 25 Mempy+Gather w/ GPU Direct Mempy+Scaer w/ GPU Direct MVAPICH2-GPU w/ GPU Direct MVAPICH2-GPU w/ GPU Direct 20 20 Mempy+Gather w/o GPU Direct Mempy+Scaer w/o GPU Direct 15 MVAPICH2-GPU w/o GPU Direct 15 MVAPICH2-GPU w/o GPU Direct 10 10 Time (ms) Time (ms) 5 5 0 0 32K 64K 128K 256K 512K 1M 2M 4M 32K 64K 128K 256K 512K 1M 2M 4M Message Size (bytes) Message Size (bytes) • With GPU Direct – 37% improvement for MPI_Gather (1MB) and 32% improvement for MPI_Scaer (4MB) • Without GPU Direct – 33% improvement for MPI_Gather (1MB) and 23% improvement MPI_Scaer (4MB) ISC 2011 Hamburg 22 Collective Communication: Alltoall 50 Mempy+Alltoall w/ GPU Direct MVAPICH2-GPU w/ GPU Direct 40 Mempy+Alltoall w/o GPU Direct 30 MVAPICH2-GPU w/o GPU Direct 20 Time (ms) 10 0 32K 64K 128K 256K 512K 1M 2M 4M Message Size (bytes) • With GPU Direct • Without GPU Direct – 30% improvement for 4MB messages – 26% improvement for 4MB messages ISC 2011 Hamburg 23 Comparison with Other MPI Stack: Latency Small Messages Large Messages 60 2500 MVAPICH2-GPU 50 2000 OpenMPI-CUDA 40 1500 30 1000 20 Latnecy (us) MVAPICH2-GPU Latency (us) 10 OpenMPI-CUDA 500 0 0 1 4 16 64 256 1K 4K 8K 32K 128K 512K 2M Message Size (Bytes) Message Size (Bytes) • Open MPI implementaon by Rolf vandeVaart(NVIDIA) – hjps://bitbucket.org/rolfv/ompi-trunk-cuda-3 • Small message latency around 10% bejer (1B – 512B) • Large message latency around 40% bejer (4M) ISC 2011 Hamburg 24 Comparison with Other MPI Stack: Bandwidth 3500 3000 MVAPICH2-GPU 2500 OpenMPI-CUDA 2000 1500 Bandwidth (MBytes/s) 1000 500 0 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M Message Size (Bytes) • Bandwidth for large messages 512K improved by almost factor of three ISC 2011 Hamburg 25 Conclusion & Future Work • GPUs are key for data-centric applicaons • However, data movement is the key bojleneck in current generaon clusters with GPUs • Asynchronous CUDA Copy and RDMA provide opportunity for latency hiding • Op0mizaons at applicaon level reduce produc0vity and achieve sub- op0mal performance • MPI Library can handle this for the users efficiently • MVAPICH2-GPU shows substan0al improvements for MPI operaons – Will be available in future MVAPICH2 release • Future work includes collec0ve specific op0mizaons and applicaons- level studies with MVAPICH2-GPU • Upcoming paper at Cluster 2011: “Op:mized Non-con:guous MPI Datatype Communica:on for GPU Clusters: Design, Implementa:on and Evalua:on with MVAPICH2”, H. Wang, S. Potluri, M. Luo, A. K. Singh, X. Ouyang, S. Sur, D. K. Panda ISC 2011 Hamburg 26 Thank You! {wangh, potluri, luom, singhas, surs, panda}@cse.ohio-state.edu Network-Based Compu0ng Laboratory hjp://nowlab.cse.ohio-state.edu/ MVAPICH Web Page hjp://mvapich.cse.ohio-state.edu/ 27 .
Recommended publications
  • MVAPICH2 2.2 User Guide
    MVAPICH2 2.2 User Guide MVAPICH TEAM NETWORK-BASED COMPUTING LABORATORY DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING THE OHIO STATE UNIVERSITY http://mvapich.cse.ohio-state.edu Copyright (c) 2001-2016 Network-Based Computing Laboratory, headed by Dr. D. K. Panda. All rights reserved. Last revised: October 19, 2017 Contents 1 Overview of the MVAPICH Project1 2 How to use this User Guide?1 3 MVAPICH2 2.2 Features2 4 Installation Instructions 13 4.1 Building from a tarball .................................... 13 4.2 Obtaining and Building the Source from SVN repository .................. 13 4.3 Selecting a Process Manager................................. 14 4.3.1 Customizing Commands Used by mpirun rsh..................... 15 4.3.2 Using SLURM..................................... 15 4.3.3 Using SLURM with support for PMI Extensions ................... 15 4.4 Configuring a build for OFA-IB-CH3/OFA-iWARP-CH3/OFA-RoCE-CH3......... 16 4.5 Configuring a build for NVIDIA GPU with OFA-IB-CH3.................. 19 4.6 Configuring a build for Shared-Memory-CH3........................ 20 4.7 Configuring a build for OFA-IB-Nemesis .......................... 20 4.8 Configuring a build for Intel TrueScale (PSM-CH3)..................... 21 4.9 Configuring a build for Intel Omni-Path (PSM2-CH3).................... 22 4.10 Configuring a build for TCP/IP-Nemesis........................... 23 4.11 Configuring a build for TCP/IP-CH3............................. 24 4.12 Configuring a build for OFA-IB-Nemesis and TCP/IP Nemesis (unified binary) . 24 4.13 Configuring a build for Shared-Memory-Nemesis...................... 25 5 Basic Usage Instructions 26 5.1 Compile Applications..................................... 26 5.2 Run Applications....................................... 26 5.2.1 Run using mpirun rsh ...............................
    [Show full text]
  • Scalable and High Performance MPI Design for Very Large
    SCALABLE AND HIGH-PERFORMANCE MPI DESIGN FOR VERY LARGE INFINIBAND CLUSTERS DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Sayantan Sur, B. Tech ***** The Ohio State University 2007 Dissertation Committee: Approved by Prof. D. K. Panda, Adviser Prof. P. Sadayappan Adviser Prof. S. Parthasarathy Graduate Program in Computer Science and Engineering c Copyright by Sayantan Sur 2007 ABSTRACT In the past decade, rapid advances have taken place in the field of computer and network design enabling us to connect thousands of computers together to form high-performance clusters. These clusters are used to solve computationally challenging scientific problems. The Message Passing Interface (MPI) is a popular model to write applications for these clusters. There are a vast array of scientific applications which use MPI on clusters. As the applications operate on larger and more complex data, the size of the compute clusters is scaling higher and higher. Thus, in order to enable the best performance to these scientific applications, it is very critical for the design of the MPI libraries be extremely scalable and high-performance. InfiniBand is a cluster interconnect which is based on open-standards and gaining rapid acceptance. This dissertation presents novel designs based on the new features offered by InfiniBand, in order to design scalable and high-performance MPI libraries for large-scale clusters with tens-of-thousands of nodes. Methods developed in this dissertation have been applied towards reduction in overall resource consumption, increased overlap of computa- tion and communication, improved performance of collective operations and finally designing application-level benchmarks to make efficient use of modern networking technology.
    [Show full text]
  • MVAPICH2 2.3 User Guide
    MVAPICH2 2.3 User Guide MVAPICH Team Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University http://mvapich.cse.ohio-state.edu Copyright (c) 2001-2018 Network-Based Computing Laboratory, headed by Dr. D. K. Panda. All rights reserved. Last revised: February 19, 2018 Contents 1 Overview of the MVAPICH Project1 2 How to use this User Guide?1 3 MVAPICH2 2.3 Features2 4 Installation Instructions 14 4.1 Building from a tarball................................... 14 4.2 Obtaining and Building the Source from SVN repository................ 14 4.3 Selecting a Process Manager................................ 15 4.3.1 Customizing Commands Used by mpirun rsh.................. 16 4.3.2 Using SLURM................................... 16 4.3.3 Using SLURM with support for PMI Extensions................ 16 4.4 Configuring a build for OFA-IB-CH3/OFA-iWARP-CH3/OFA-RoCE-CH3...... 17 4.5 Configuring a build for NVIDIA GPU with OFA-IB-CH3............... 20 4.6 Configuring a build to support running jobs across multiple InfiniBand subnets... 21 4.7 Configuring a build for Shared-Memory-CH3...................... 21 4.8 Configuring a build for OFA-IB-Nemesis......................... 21 4.9 Configuring a build for Intel TrueScale (PSM-CH3)................... 22 4.10 Configuring a build for Intel Omni-Path (PSM2-CH3)................. 23 4.11 Configuring a build for TCP/IP-Nemesis......................... 24 4.12 Configuring a build for TCP/IP-CH3........................... 25 4.13 Configuring a build for OFA-IB-Nemesis and TCP/IP Nemesis (unified binary)... 26 4.14 Configuring a build for Shared-Memory-Nemesis.................... 26 4.15 Configuration and Installation with Singularity....................
    [Show full text]
  • Automatic Handling of Global Variables for Multi-Threaded MPI Programs
    Automatic Handling of Global Variables for Multi-threaded MPI Programs Gengbin Zheng, Stas Negara, Celso L. Mendes, Laxmikant V. Kale´ Eduardo R. Rodrigues Department of Computer Science Institute of Informatics University of Illinois at Urbana-Champaign Federal University of Rio Grande do Sul Urbana, IL 61801, USA Porto Alegre, Brazil fgzheng,snegara2,cmendes,[email protected] [email protected] Abstract— necessary to privatize global and static variables to ensure Hybrid programming models, such as MPI combined with the thread-safety of an application. threads, are one of the most efficient ways to write parallel ap- In high-performance computing, one way to exploit multi- plications for current machines comprising multi-socket/multi- core nodes and an interconnection network. Global variables in core platforms is to adopt hybrid programming models that legacy MPI applications, however, present a challenge because combine MPI with threads, where different programming they may be accessed by multiple MPI threads simultaneously. models are used to address issues such as multiple threads Thus, transforming legacy MPI applications to be thread-safe per core, decreasing amount of memory per core, load in order to exploit multi-core architectures requires proper imbalance, etc. When porting legacy MPI applications to handling of global variables. In this paper, we present three approaches to eliminate these hybrid models that involve multiple threads, thread- global variables to ensure thread-safety for an MPI program. safety of the applications needs to be addressed again. These approaches include: (a) a compiler-based refactoring Global variables cause no problem with traditional MPI technique, using a Photran-based tool as an example, which implementations, since each process image contains a sep- automates the source-to-source transformation for programs arate instance of the variable.
    [Show full text]
  • Exascale Computing Project -- Software
    Exascale Computing Project -- Software Paul Messina, ECP Director Stephen Lee, ECP Deputy Director ASCAC Meeting, Arlington, VA Crystal City Marriott April 19, 2017 www.ExascaleProject.org ECP scope and goals Develop applications Partner with vendors to tackle a broad to develop computer spectrum of mission Support architectures that critical problems national security support exascale of unprecedented applications complexity Develop a software Train a next-generation Contribute to the stack that is both workforce of economic exascale-capable and computational competitiveness usable on industrial & scientists, engineers, of the nation academic scale and computer systems, in collaboration scientists with vendors 2 Exascale Computing Project, www.exascaleproject.org ECP has formulated a holistic approach that uses co- design and integration to achieve capable exascale Application Software Hardware Exascale Development Technology Technology Systems Science and Scalable and Hardware Integrated mission productive technology exascale applications software elements supercomputers Correctness Visualization Data Analysis Applicationsstack Co-Design Programming models, Math libraries and development environment, Tools Frameworks and runtimes System Software, resource Workflows Resilience management threading, Data Memory scheduling, monitoring, and management and Burst control I/O and file buffer system Node OS, runtimes Hardware interface ECP’s work encompasses applications, system software, hardware technologies and architectures, and workforce
    [Show full text]
  • MVAPICH USER GROUP MEETING August 26, 2020
    APPLICATION AND MICROBENCHMARK STUDY USING MVAPICH2 and MVAPICH2-GDR ON SDSC COMET AND EXPANSE MVAPICH USER GROUP MEETING August 26, 2020 Mahidhar Tatineni SDSC San Diego Supercomputer Center NSF Award 1928224 Overview • History of InfiniBand based clusters at SDSC • Hardware summaries for Comet and Expanse • Application and Software Library stack on SDSC systems • Containerized approach – Singularity on SDSC systems • Benchmark results on Comet, Expanse development system - OSU Benchmarks, NEURON, RAxML, TensorFlow, QE, and BEAST are presented. InfiniBand and MVAPICH2 on SDSC Systems Gordon (NSF) Trestles (NSF) 2012-2017 COMET (NSF) Expanse (NSF) 2011-2014 GordonS 2015-Current Production Fall 2020 (Simons Foundation) 2017- 2020 • 324 nodes, 10,368 cores • 1024 nodes, 16,384 cores • 1944 compute, 72 GPU, and • 728 compute, 52 GPU, and • 4-socket AMD Magny-Cours • 2-socket Intel Sandy Bridge 4 large memory nodes. 4 large memory nodes. • QDR InfiniBand • Dual Rail QDR InfiniBand • 2-socket Intel Haswell • 2-socket AMD EPYC 7742, • Fat Tree topology • 3-D Torus topology • FDR InfiniBand HDR100 InfiniBand • MVAPICH2 • 300TB of SSD storage - via • Fat Tree topology • GPU nodes with 4 V100 iSCSI over RDMA (iSER) • MVAPICH2, MVAPICH2-X, GPUs + NVLINK • MVAPICH2 (1.9, 2.1) with 3-D MVAPICH2-GDR • HDR200 Switches, Fat torus support • Leverage SRIOV for Virtual Tree topology with 3:1 Clusters oversubscription • MVAPICH2, MVAPICH2-X, MVAPICH2-GDR Comet: System Characteristics • Total peak flops ~2.76 PF • Hybrid fat-tree topology • Dell primary
    [Show full text]
  • Infiniband and 10-Gigabit Ethernet for Dummies
    The MVAPICH2 Project: Latest Developments and Plans Towards Exascale Computing Presentation at Mellanox Theatre (SC ‘18) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda Drivers of Modern HPC Cluster Architectures High Performance Interconnects - Accelerators / Coprocessors InfiniBand high compute density, high Multi-core Processors <1usec latency, 100Gbps Bandwidth> performance/watt SSD, NVMe-SSD, NVRAM >1 TFlop DP on a chip • Multi-core/many-core technologies • Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE) • Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD • Accelerators (NVIDIA GPGPUs and Intel Xeon Phi) • Available on HPC Clouds, e.g., Amazon EC2, NSF Chameleon, Microsoft Azure, etc. Summit Sunway TaihuLight Sierra K - Computer Network Based Computing Laboratory Mellanox Theatre SC’18 2 Parallel Programming Models Overview P1 P2 P3 P1 P2 P3 P1 P2 P3 Logical shared memory Memory Memory Memory Shared Memory Memory Memory Memory Shared Memory Model Distributed Memory Model Partitioned Global Address Space (PGAS) SHMEM, DSM MPI (Message Passing Interface) Global Arrays, UPC, Chapel, X10, CAF, … • Programming models provide abstract machine models • Models can be mapped on different types of systems – e.g. Distributed Shared Memory (DSM), MPI within a node, etc. • PGAS models and Hybrid MPI+PGAS models are gradually receiving importance Network Based Computing Laboratory Mellanox Theatre SC’18 3 Designing Communication Libraries for Multi-Petaflop and Exaflop Systems: Challenges Application Kernels/Applications Co-Design Middleware Opportunities and Programming Models Challenges MPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP, across Various OpenACC, Cilk, Hadoop (MapReduce), Spark (RDD, DAG), etc.
    [Show full text]
  • MVAPICH2 2.2Rc2 User Guide
    MVAPICH2 2.2rc2 User Guide MVAPICH TEAM NETWORK-BASED COMPUTING LABORATORY DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING THE OHIO STATE UNIVERSITY http://mvapich.cse.ohio-state.edu Copyright (c) 2001-2016 Network-Based Computing Laboratory, headed by Dr. D. K. Panda. All rights reserved. Last revised: August 8, 2016 Contents 1 Overview of the MVAPICH Project1 2 How to use this User Guide?1 3 MVAPICH2 2.2rc2 Features2 4 Installation Instructions 13 4.1 Building from a tarball .................................... 13 4.2 Obtaining and Building the Source from SVN repository .................. 13 4.3 Selecting a Process Manager................................. 14 4.3.1 Customizing Commands Used by mpirun rsh..................... 15 4.3.2 Using SLURM..................................... 15 4.3.3 Using SLURM with support for PMI Extensions ................... 15 4.4 Configuring a build for OFA-IB-CH3/OFA-iWARP-CH3/OFA-RoCE-CH3......... 16 4.5 Configuring a build for NVIDIA GPU with OFA-IB-CH3.................. 19 4.6 Configuring a build for Shared-Memory-CH3........................ 20 4.7 Configuring a build for OFA-IB-Nemesis .......................... 20 4.8 Configuring a build for Intel TrueScale (PSM-CH3)..................... 21 4.9 Configuring a build for Intel Omni-Path (PSM2-CH3).................... 22 4.10 Configuring a build for TCP/IP-Nemesis........................... 23 4.11 Configuring a build for TCP/IP-CH3............................. 24 4.12 Configuring a build for OFA-IB-Nemesis and TCP/IP Nemesis (unified binary) . 24 4.13 Configuring a build for Shared-Memory-Nemesis...................... 25 5 Basic Usage Instructions 26 5.1 Compile Applications..................................... 26 5.2 Run Applications....................................... 26 5.2.1 Run using mpirun rsh ...............................
    [Show full text]
  • Combining Openmp and Message Passing: Hybrid Programming
    Overview Introduction Basic Hybrid Programming A Benchmark Problem Combining OpenMP and Message Passing: Hybrid Programming Michael Hanke Summer School on High Performance Computing Hybrid, p. 1/20 Overview Introduction Basic Hybrid Programming A Benchmark Problem Outline Overview Introduction Basic Hybrid Programming A Benchmark Problem Hybrid, p. 2/20 Overview Introduction Basic Hybrid Programming A Benchmark Problem What You Already Know • Parallelizing programs on shared memory computers using OpenMP • Parallelizing programs on distributed memory computers using MPI Hybrid, p. 3/20 Overview Introduction Basic Hybrid Programming A Benchmark Problem What Is Covered in This Module? • Multithreaded MPI • Support provided by the implementation • How to use it • Programming recommendations Hybrid, p. 4/20 Overview Introduction Basic Hybrid Programming A Benchmark Problem Mainstream HPC Architecture • General purpose processors are not getting (very much) faster • The optimal (price/performance) HPC hardware goes to massively parallel computers (MPPs): • many compute nodes coupled by high-speed interconnects • each compute node is a multi-socket shared memory node • each socket holds a multi-core processor • Hybrid architectures: • Coupling of “standard” compute nodes with highly specilized computes nodes (cell processors, general purpose GPUs) Hybrid, p. 5/20 Overview Introduction Basic Hybrid Programming A Benchmark Problem Programming Models OpenMP Shared memory, symmetric multi-processors • Corresponds to one multi-core processor MPI Distributed memory, processing nodes coupled via comunication channels • Corresponds to MPPs These are logical programming models and can be mapped to any hardware! Problem Can one combine these two models in order to obtain faster codes by using the architecture specifics? Hybrid, p. 6/20 Overview Introduction Basic Hybrid Programming A Benchmark Problem Reasons to Combine MPI and OpenMP 1.
    [Show full text]
  • Comparison of Platforms for Recommender Algorithm on Large Datasets
    Comparison of Platforms for Recommender Algorithm on Large Datasets Christina Diedhiou School of Computing, University of Portsmouth, United Kingdom [email protected] Bryan Carpenter School of Computing, University of Portsmouth, United Kingdom [email protected] Ramazan Esmeli School of Computing, University of Portsmouth, United Kingdom [email protected] Abstract One of the challenges our society faces is the ever increasing amount of data. Among existing platforms that address the system requirements, Hadoop is a framework widely used to store and analyze “big data”. On the human side, one of the aids to finding the things people really want is recommendation systems. This paper evaluates highly scalable parallel algorithms for recom- mendation systems with application to very large data sets. A particular goal is to evaluate an open source Java message passing library for parallel computing called MPJ Express, which has been integrated with Hadoop. As a demonstration we use MPJ Express to implement collabor- ative filtering on various data sets using the algorithm ALSWR (Alternating-Least-Squares with Weighted-λ-Regularization). We benchmark the performance and demonstrate parallel speedup on Movielens and Yahoo Music data sets, comparing our results with two other frameworks: Ma- hout and Spark. Our results indicate that MPJ Express implementation of ALSWR has very competitive performance and scalability in comparison with the two other frameworks. 2012 ACM Subject Classification Information systems → Database management system engines Keywords and phrases HPC, MPJ Express, Hadoop, Spark, Mahout Digital Object Identifier 10.4230/OASIcs.ICCSW.2018.4 Category Main Track 1 Introduction Over the last decade Apache Hadoop has established itself as a pillar in the ecosystem of software frameworks for “big data” processing.
    [Show full text]
  • Infiniband and 10-Gigabit Ethernet for Dummies
    The MVAPICH2 Project: Latest Developments and Plans Towards Exascale Computing Presentation at OSU Booth (SC ‘19) by Hari Subramoni The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~subramon Drivers of Modern HPC Cluster Architectures High Performance Accelerators / Coprocessors Interconnects - InfiniBand high compute density, high Multi-core <1usec latency, 200Gbps performance/watt SSD, NVMe-SSD, Processors Bandwidth> >1 TFlop DP on a chip NVRAM • Multi-core/many-core technologies • Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE) • Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD • Accelerators (NVIDIA GPGPUs and Intel Xeon Phi) • Available on HPC Clouds, e.g., Amazon EC2, NSF Chameleon, Microsoft Azure, etc. NetworkSummit Based Computing Sierra Sunway K - Laboratory OSU Booth SC’19TaihuLight Computer 2 Parallel Programming Models Overview P1 P2 P3 P1 P2 P3 P1 P2 P3 Memo Memo Memo Logical shared memory Memor Memor Memor Shared Memory ry ry ry y y y Shared Memory Model Distributed Memory Model Partitioned Global Address Space (PGAS) SHMEM, DSM MPI (Message Passing Interface)Global Arrays, UPC, Chapel, X10, CAF, … • Programming models provide abstract machine models • Models can be mapped on different types of systems – e.g. Distributed Shared Memory (DSM), MPI within a node, etc. • PGAS models and Hybrid MPI+PGAS models are Network Based Computing Laboratorygradually receiving importanceOSU Booth SC’19 3 Designing Communication Libraries for Multi-Petaflop and Exaflop Systems: Challenges Co-DesignCo-Design Application Kernels/Applications OpportuniOpportuni tiesties andand Middleware ChallengeChallenge ss acrossacross Programming Models VariousVarious MPI, PGAS (UPC, Global Arrays, OpenSHMEM), LayersLayers CUDA, OpenMP, OpenACC, Cilk, Hadoop (MapReduce), Spark (RDD, DAG), etc.
    [Show full text]
  • Hodor and Bran - Environment Modules and Compiling Programs
    Hodor and Bran - Environment Modules and Compiling Programs UND Computational Research Center This tutorial is meant to provide you with the basics of compiling programs for the HPC environment on Hodor and Bran clusters. This tutorial is meant for user developed programs written in C, C++, or FORTRAN. If you wish to install third-party programs, libraries, or programming environments, it is suggested that you contact the CRC to discuss your software requirements. Serial vs Parallel Computing Traditional compute program are written as serial computation programs. High Performance Computing cluster systems increase performance through parallel computing. Serial computation basically means that a computer program will execute its program one operation at a time in sequential order. Whereas parallel computation occurs by executing two or more operations simultaneously. While an in-depth discussion of serial vs parallel computing is beyond the scope of this tutorial, those who feel that they would benefit from a more comprehensive explanation of serial vs parallel programming can check out the Wikipedia article on the subject http://en.wikipedia.org/wiki/Parallel_computing Basic Serial Compiling The following GNU compilers ( http://gcc.gnu.org ) are automatically available to users: For the C language: gcc For the C++ language: g++ For the FORTRAN language: gfortran Also the default RHEL (RedHat Enterprise Linux) compilers are also available: For the C language: cc For the C++ language: c++ Any of the following commands will create a “helloworld.x”
    [Show full text]