A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures
Total Page:16
File Type:pdf, Size:1020Kb
Load more
										Recommended publications
									
								- 
												  MVAPICH2 2.2 User GuideMVAPICH2 2.2 User Guide MVAPICH TEAM NETWORK-BASED COMPUTING LABORATORY DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING THE OHIO STATE UNIVERSITY http://mvapich.cse.ohio-state.edu Copyright (c) 2001-2016 Network-Based Computing Laboratory, headed by Dr. D. K. Panda. All rights reserved. Last revised: October 19, 2017 Contents 1 Overview of the MVAPICH Project1 2 How to use this User Guide?1 3 MVAPICH2 2.2 Features2 4 Installation Instructions 13 4.1 Building from a tarball .................................... 13 4.2 Obtaining and Building the Source from SVN repository .................. 13 4.3 Selecting a Process Manager................................. 14 4.3.1 Customizing Commands Used by mpirun rsh..................... 15 4.3.2 Using SLURM..................................... 15 4.3.3 Using SLURM with support for PMI Extensions ................... 15 4.4 Configuring a build for OFA-IB-CH3/OFA-iWARP-CH3/OFA-RoCE-CH3......... 16 4.5 Configuring a build for NVIDIA GPU with OFA-IB-CH3.................. 19 4.6 Configuring a build for Shared-Memory-CH3........................ 20 4.7 Configuring a build for OFA-IB-Nemesis .......................... 20 4.8 Configuring a build for Intel TrueScale (PSM-CH3)..................... 21 4.9 Configuring a build for Intel Omni-Path (PSM2-CH3).................... 22 4.10 Configuring a build for TCP/IP-Nemesis........................... 23 4.11 Configuring a build for TCP/IP-CH3............................. 24 4.12 Configuring a build for OFA-IB-Nemesis and TCP/IP Nemesis (unified binary) . 24 4.13 Configuring a build for Shared-Memory-Nemesis...................... 25 5 Basic Usage Instructions 26 5.1 Compile Applications..................................... 26 5.2 Run Applications....................................... 26 5.2.1 Run using mpirun rsh ...............................
- 
												  Scalable and High Performance MPI Design for Very LargeSCALABLE AND HIGH-PERFORMANCE MPI DESIGN FOR VERY LARGE INFINIBAND CLUSTERS DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Sayantan Sur, B. Tech ***** The Ohio State University 2007 Dissertation Committee: Approved by Prof. D. K. Panda, Adviser Prof. P. Sadayappan Adviser Prof. S. Parthasarathy Graduate Program in Computer Science and Engineering c Copyright by Sayantan Sur 2007 ABSTRACT In the past decade, rapid advances have taken place in the field of computer and network design enabling us to connect thousands of computers together to form high-performance clusters. These clusters are used to solve computationally challenging scientific problems. The Message Passing Interface (MPI) is a popular model to write applications for these clusters. There are a vast array of scientific applications which use MPI on clusters. As the applications operate on larger and more complex data, the size of the compute clusters is scaling higher and higher. Thus, in order to enable the best performance to these scientific applications, it is very critical for the design of the MPI libraries be extremely scalable and high-performance. InfiniBand is a cluster interconnect which is based on open-standards and gaining rapid acceptance. This dissertation presents novel designs based on the new features offered by InfiniBand, in order to design scalable and high-performance MPI libraries for large-scale clusters with tens-of-thousands of nodes. Methods developed in this dissertation have been applied towards reduction in overall resource consumption, increased overlap of computa- tion and communication, improved performance of collective operations and finally designing application-level benchmarks to make efficient use of modern networking technology.
- 
												  Map, Reduce and Mapreduce, the Skeleton Way$View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Elsevier - Publisher Connector Procedia Computer Science ProcediaProcedia Computer Computer Science Science 00 1 (2010)(2012) 1–9 2095–2103 www.elsevier.com/locate/procedia International Conference on Computational Science, ICCS 2010 Map, Reduce and MapReduce, the skeleton way$ D. Buonoa, M. Daneluttoa, S. Lamettia aDept. of Computer Science, Univ. of Pisa, Italy Abstract Composition of Map and Reduce algorithmic skeletons have been widely studied at the end of the last century and it has demonstrated effective on a wide class of problems. We recall the theoretical results motivating the intro- duction of these skeletons, then we discuss an experiment implementing three algorithmic skeletons, a map,areduce and an optimized composition of a map followed by a reduce skeleton (map+reduce). The map+reduce skeleton implemented computes the same kind of problems computed by Google MapReduce, but the data flow through the skeleton is streamed rather than relying on already distributed (and possibly quite large) data items. We discuss the implementation of the three skeletons on top of ProActive/GCM in the MareMare prototype and we present some experimental obtained on a COTS cluster. ⃝c 2012 Published by Elsevier Ltd. Open access under CC BY-NC-ND license. Keywords: Algorithmic skeletons, map, reduce, list theory, homomorphisms 1. Introduction Algorithmic skeletons have been around since Murray Cole’s PhD thesis [1]. An algorithmic skeleton is a paramet- ric, reusable, efficient and composable programming construct that exploits a well know, efficient parallelism exploita- tion pattern. A skeleton programmer may build parallel applications simply picking up a (composition of) skeleton from the skeleton library available, providing the required parameters and running some kind of compiler+run time library tool specific of the target architecture at hand [2].
- 
												  MVAPICH2 2.3 User GuideMVAPICH2 2.3 User Guide MVAPICH Team Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University http://mvapich.cse.ohio-state.edu Copyright (c) 2001-2018 Network-Based Computing Laboratory, headed by Dr. D. K. Panda. All rights reserved. Last revised: February 19, 2018 Contents 1 Overview of the MVAPICH Project1 2 How to use this User Guide?1 3 MVAPICH2 2.3 Features2 4 Installation Instructions 14 4.1 Building from a tarball................................... 14 4.2 Obtaining and Building the Source from SVN repository................ 14 4.3 Selecting a Process Manager................................ 15 4.3.1 Customizing Commands Used by mpirun rsh.................. 16 4.3.2 Using SLURM................................... 16 4.3.3 Using SLURM with support for PMI Extensions................ 16 4.4 Configuring a build for OFA-IB-CH3/OFA-iWARP-CH3/OFA-RoCE-CH3...... 17 4.5 Configuring a build for NVIDIA GPU with OFA-IB-CH3............... 20 4.6 Configuring a build to support running jobs across multiple InfiniBand subnets... 21 4.7 Configuring a build for Shared-Memory-CH3...................... 21 4.8 Configuring a build for OFA-IB-Nemesis......................... 21 4.9 Configuring a build for Intel TrueScale (PSM-CH3)................... 22 4.10 Configuring a build for Intel Omni-Path (PSM2-CH3)................. 23 4.11 Configuring a build for TCP/IP-Nemesis......................... 24 4.12 Configuring a build for TCP/IP-CH3........................... 25 4.13 Configuring a build for OFA-IB-Nemesis and TCP/IP Nemesis (unified binary)... 26 4.14 Configuring a build for Shared-Memory-Nemesis.................... 26 4.15 Configuration and Installation with Singularity....................
- 
												  Automatic Handling of Global Variables for Multi-Threaded MPI ProgramsAutomatic Handling of Global Variables for Multi-threaded MPI Programs Gengbin Zheng, Stas Negara, Celso L. Mendes, Laxmikant V. Kale´ Eduardo R. Rodrigues Department of Computer Science Institute of Informatics University of Illinois at Urbana-Champaign Federal University of Rio Grande do Sul Urbana, IL 61801, USA Porto Alegre, Brazil fgzheng,snegara2,cmendes,[email protected] [email protected] Abstract— necessary to privatize global and static variables to ensure Hybrid programming models, such as MPI combined with the thread-safety of an application. threads, are one of the most efficient ways to write parallel ap- In high-performance computing, one way to exploit multi- plications for current machines comprising multi-socket/multi- core nodes and an interconnection network. Global variables in core platforms is to adopt hybrid programming models that legacy MPI applications, however, present a challenge because combine MPI with threads, where different programming they may be accessed by multiple MPI threads simultaneously. models are used to address issues such as multiple threads Thus, transforming legacy MPI applications to be thread-safe per core, decreasing amount of memory per core, load in order to exploit multi-core architectures requires proper imbalance, etc. When porting legacy MPI applications to handling of global variables. In this paper, we present three approaches to eliminate these hybrid models that involve multiple threads, thread- global variables to ensure thread-safety for an MPI program. safety of the applications needs to be addressed again. These approaches include: (a) a compiler-based refactoring Global variables cause no problem with traditional MPI technique, using a Photran-based tool as an example, which implementations, since each process image contains a sep- automates the source-to-source transformation for programs arate instance of the variable.
- 
												  Exascale Computing Project -- SoftwareExascale Computing Project -- Software Paul Messina, ECP Director Stephen Lee, ECP Deputy Director ASCAC Meeting, Arlington, VA Crystal City Marriott April 19, 2017 www.ExascaleProject.org ECP scope and goals Develop applications Partner with vendors to tackle a broad to develop computer spectrum of mission Support architectures that critical problems national security support exascale of unprecedented applications complexity Develop a software Train a next-generation Contribute to the stack that is both workforce of economic exascale-capable and computational competitiveness usable on industrial & scientists, engineers, of the nation academic scale and computer systems, in collaboration scientists with vendors 2 Exascale Computing Project, www.exascaleproject.org ECP has formulated a holistic approach that uses co- design and integration to achieve capable exascale Application Software Hardware Exascale Development Technology Technology Systems Science and Scalable and Hardware Integrated mission productive technology exascale applications software elements supercomputers Correctness Visualization Data Analysis Applicationsstack Co-Design Programming models, Math libraries and development environment, Tools Frameworks and runtimes System Software, resource Workflows Resilience management threading, Data Memory scheduling, monitoring, and management and Burst control I/O and file buffer system Node OS, runtimes Hardware interface ECP’s work encompasses applications, system software, hardware technologies and architectures, and workforce
- 
												  From Sequential to Parallel Programming with PatternsFrom sequential to parallel programming with patterns Inverted CERN School of Computing 2018 Plácido Fernández [email protected] 07 Mar 2018 Table of Contents Introduction Sequential vs Parallel patterns Patterns Data parallel vs streaming patterns Control patterns (sequential and parallel Streaming parallel patterns Composing parallel patterns Using the patterns Real world use case GrPPI with NUMA Conclusions Acknowledgements and references 07 Mar 2018 3 Table of Contents Introduction Sequential vs Parallel patterns Patterns Data parallel vs streaming patterns Control patterns (sequential and parallel Streaming parallel patterns Composing parallel patterns Using the patterns Real world use case GrPPI with NUMA Conclusions Acknowledgements and references 07 Mar 2018 4 Source: Herb Sutter in Dr. Dobb’s Journal Parallel hardware history I Before ~2005, processor manufacturers increased clock speed. I Then we hit the power and memory wall, which limits the frequency at which processors can run (without melting). I Manufacturers response was to continue improving their chips performance by adding more parallelism. To fully exploit modern processors capabilities, programmers need to put effort into the source code. 07 Mar 2018 6 Parallel Hardware Processors are naturally parallel: Programmers have plenty of options: I Caches I Multi-threading I Different processing units (floating I SIMD / Vectorization (Single point, arithmetic logic...) Instruction Multiple Data) I Integrated GPU I Heterogeneous hardware 07 Mar 2018 7 Source: top500.org Source: Intel Source: CERN Parallel hardware Xeon Xeon 5100 Xeon 5500 Sandy Bridge Haswell Broadwell Skylake Year 2005 2006 2009 2012 2015 2016 2017 Cores 1 2 4 8 18 24 28 Threads 2 2 8 16 36 48 56 SIMD Width 128 128 128 256 256 256 512 Figure: Intel Xeon processors evolution 0Source: ark.intel.com 07 Mar 2018 11 Parallel considerations When designing parallel algorithms where are interested in speedup.
- 
												  MVAPICH USER GROUP MEETING August 26, 2020APPLICATION AND MICROBENCHMARK STUDY USING MVAPICH2 and MVAPICH2-GDR ON SDSC COMET AND EXPANSE MVAPICH USER GROUP MEETING August 26, 2020 Mahidhar Tatineni SDSC San Diego Supercomputer Center NSF Award 1928224 Overview • History of InfiniBand based clusters at SDSC • Hardware summaries for Comet and Expanse • Application and Software Library stack on SDSC systems • Containerized approach – Singularity on SDSC systems • Benchmark results on Comet, Expanse development system - OSU Benchmarks, NEURON, RAxML, TensorFlow, QE, and BEAST are presented. InfiniBand and MVAPICH2 on SDSC Systems Gordon (NSF) Trestles (NSF) 2012-2017 COMET (NSF) Expanse (NSF) 2011-2014 GordonS 2015-Current Production Fall 2020 (Simons Foundation) 2017- 2020 • 324 nodes, 10,368 cores • 1024 nodes, 16,384 cores • 1944 compute, 72 GPU, and • 728 compute, 52 GPU, and • 4-socket AMD Magny-Cours • 2-socket Intel Sandy Bridge 4 large memory nodes. 4 large memory nodes. • QDR InfiniBand • Dual Rail QDR InfiniBand • 2-socket Intel Haswell • 2-socket AMD EPYC 7742, • Fat Tree topology • 3-D Torus topology • FDR InfiniBand HDR100 InfiniBand • MVAPICH2 • 300TB of SSD storage - via • Fat Tree topology • GPU nodes with 4 V100 iSCSI over RDMA (iSER) • MVAPICH2, MVAPICH2-X, GPUs + NVLINK • MVAPICH2 (1.9, 2.1) with 3-D MVAPICH2-GDR • HDR200 Switches, Fat torus support • Leverage SRIOV for Virtual Tree topology with 3:1 Clusters oversubscription • MVAPICH2, MVAPICH2-X, MVAPICH2-GDR Comet: System Characteristics • Total peak flops ~2.76 PF • Hybrid fat-tree topology • Dell primary
- 
												  Opencl: a Hands-On IntroductionOpenCL: A Hands-on Introduction Tim Mattson Alice Koniges Intel Corp. Berkeley Lab/NERSC Simon McIntosh-Smith University of Bristol Acknowledgements: In addition to Tim, Alice and Simon … Tom Deakin (Bristol) and Ben Gaster (Qualcomm) contributed to this content. Agenda Lectures Exercises An Introduction to OpenCL Logging in and running the Vadd program Understanding Host programs Chaining Vadd kernels together Kernel programs The D = A + B + C problem Writing Kernel Programs Matrix Multiplication Lunch Working with the OpenCL memory model Several ways to Optimize matrix multiplication High Performance OpenCL Matrix multiplication optimization contest The OpenCL Zoo Run your OpenCL programs on a variety of systems. Closing Comments Course materials In addition to these slides, C++ API header files, a set of exercises, and solutions, we provide: OpenCL C 1.2 Reference Card OpenCL C++ 1.2 Reference Card These cards will help you keep track of the API as you do the exercises: https://www.khronos.org/files/ opencl-1-2-quick-reference-card.pdf The v1.2 spec is also very readable and recommended to have on-hand: https://www.khronos.org/registry/ cl/specs/opencl-1.2.pdf AN INTRODUCTION TO OPENCL Industry Standards for Programming Heterogeneous Platforms GPUs CPUs Emerging Increasingly general Multiple cores driving Intersection purpose data-parallel performance increases computing Graphics Multi- Heterogeneous APIs and processor Computing Shading programming – Languages e.g. OpenMP OpenCL – Open Computing Language Open, royalty-free standard for portable, parallel programming of heterogeneous parallel computing CPUs, GPUs, and other processors The origins of OpenCL ARM AMD Merged, needed Nokia commonality IBM ATI across products Sony Wrote a rough draft Qualcomm GPU vendor – straw man API Imagination wants to steal TI NVIDIA market share from CPU + many more Khronos Compute CPU vendor – group formed wants to steal Intel market share from GPU Was tired of recoding for many core, GPUs.
- 
												  Infiniband and 10-Gigabit Ethernet for DummiesThe MVAPICH2 Project: Latest Developments and Plans Towards Exascale Computing Presentation at Mellanox Theatre (SC ‘18) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda Drivers of Modern HPC Cluster Architectures High Performance Interconnects - Accelerators / Coprocessors InfiniBand high compute density, high Multi-core Processors <1usec latency, 100Gbps Bandwidth> performance/watt SSD, NVMe-SSD, NVRAM >1 TFlop DP on a chip • Multi-core/many-core technologies • Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE) • Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD • Accelerators (NVIDIA GPGPUs and Intel Xeon Phi) • Available on HPC Clouds, e.g., Amazon EC2, NSF Chameleon, Microsoft Azure, etc. Summit Sunway TaihuLight Sierra K - Computer Network Based Computing Laboratory Mellanox Theatre SC’18 2 Parallel Programming Models Overview P1 P2 P3 P1 P2 P3 P1 P2 P3 Logical shared memory Memory Memory Memory Shared Memory Memory Memory Memory Shared Memory Model Distributed Memory Model Partitioned Global Address Space (PGAS) SHMEM, DSM MPI (Message Passing Interface) Global Arrays, UPC, Chapel, X10, CAF, … • Programming models provide abstract machine models • Models can be mapped on different types of systems – e.g. Distributed Shared Memory (DSM), MPI within a node, etc. • PGAS models and Hybrid MPI+PGAS models are gradually receiving importance Network Based Computing Laboratory Mellanox Theatre SC’18 3 Designing Communication Libraries for Multi-Petaflop and Exaflop Systems: Challenges Application Kernels/Applications Co-Design Middleware Opportunities and Programming Models Challenges MPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP, across Various OpenACC, Cilk, Hadoop (MapReduce), Spark (RDD, DAG), etc.
- 
												  MVAPICH2 2.2Rc2 User GuideMVAPICH2 2.2rc2 User Guide MVAPICH TEAM NETWORK-BASED COMPUTING LABORATORY DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING THE OHIO STATE UNIVERSITY http://mvapich.cse.ohio-state.edu Copyright (c) 2001-2016 Network-Based Computing Laboratory, headed by Dr. D. K. Panda. All rights reserved. Last revised: August 8, 2016 Contents 1 Overview of the MVAPICH Project1 2 How to use this User Guide?1 3 MVAPICH2 2.2rc2 Features2 4 Installation Instructions 13 4.1 Building from a tarball .................................... 13 4.2 Obtaining and Building the Source from SVN repository .................. 13 4.3 Selecting a Process Manager................................. 14 4.3.1 Customizing Commands Used by mpirun rsh..................... 15 4.3.2 Using SLURM..................................... 15 4.3.3 Using SLURM with support for PMI Extensions ................... 15 4.4 Configuring a build for OFA-IB-CH3/OFA-iWARP-CH3/OFA-RoCE-CH3......... 16 4.5 Configuring a build for NVIDIA GPU with OFA-IB-CH3.................. 19 4.6 Configuring a build for Shared-Memory-CH3........................ 20 4.7 Configuring a build for OFA-IB-Nemesis .......................... 20 4.8 Configuring a build for Intel TrueScale (PSM-CH3)..................... 21 4.9 Configuring a build for Intel Omni-Path (PSM2-CH3).................... 22 4.10 Configuring a build for TCP/IP-Nemesis........................... 23 4.11 Configuring a build for TCP/IP-CH3............................. 24 4.12 Configuring a build for OFA-IB-Nemesis and TCP/IP Nemesis (unified binary) . 24 4.13 Configuring a build for Shared-Memory-Nemesis...................... 25 5 Basic Usage Instructions 26 5.1 Compile Applications..................................... 26 5.2 Run Applications....................................... 26 5.2.1 Run using mpirun rsh ...............................
- 
												  Mapreduce BasicsChapter 2 MapReduce Basics The only feasible approach to tackling large-data problems today is to divide and conquer, a fundamental concept in computer science that is introduced very early in typical undergraduate curricula. The basic idea is to partition a large problem into smaller sub-problems. To the extent that the sub-problems are independent [5], they can be tackled in parallel by different workers| threads in a processor core, cores in a multi-core processor, multiple processors in a machine, or many machines in a cluster. Intermediate results from each individual worker are then combined to yield the final output.1 The general principles behind divide-and-conquer algorithms are broadly applicable to a wide range of problems in many different application domains. However, the details of their implementations are varied and complex. For example, the following are just some of the issues that need to be addressed: • How do we break up a large problem into smaller tasks? More specifi- cally, how do we decompose the problem so that the smaller tasks can be executed in parallel? • How do we assign tasks to workers distributed across a potentially large number of machines (while keeping in mind that some workers are better suited to running some tasks than others, e.g., due to available resources, locality constraints, etc.)? • How do we ensure that the workers get the data they need? • How do we coordinate synchronization among the different workers? • How do we share partial results from one worker that is needed by an- other? • How do we accomplish all of the above in the face of software errors and hardware faults? 1We note that promising technologies such as quantum or biological computing could potentially induce a paradigm shift, but they are far from being sufficiently mature to solve real world problems.