A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures

Total Page:16

File Type:pdf, Size:1020Kb

A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Wei Jiang, B.S., M.S. Graduate Program in Department of Computer Science and Engineering The Ohio State University 2012 Dissertation Committee: Gagan Agrawal, Advisor P. Sadayappan Radu Teodorescu c Copyright by Wei Jiang 2012 Abstract Parallel computing environments are ubiquitous nowadays, including traditional CPU clusters and the emergence of GPU clusters and CPU-GPU clusters because of their per- formance, cost and energy efficiency. With this trend, an important research issue is to effectively utilize the massive computing power in these architectures to accelerate data- intensive applications arising from commercial and scientific domains. Map-reduce and its Hadoop implementation have become popular for its high programming productivity but exhibits non-trivial performance losses for many classes of data-intensive applications. Also, there is no general map-reduce-like support up to date for programming heteroge- neous systems like a CPU-GPU cluster. Besides, it is widely accepted that the existing fault tolerant techniques for high-end systems will not be feasible in the exascale era and novel solutions are clearly needed. Our overall goal is to solve these programmability and performance issues by providing a map-reduce-like API with better performance efficiency as well as efficient fault tolerance support, targeting data-intensive applications and various new emerging parallel architec- tures. We believe that a map-reduce-like API can ease the programming difficulty in these parallel architectures, and more importantly improve the performance efficiency of paral- lelizing these data-intensive applications. Also, the use of a high-level programming model can greatly simplify fault-tolerance support, resulting in low overhead checkpointing and recovery. ii We performed a comparative study showing that the map-reduce processing style could cause significant overheads for a set of data mining applications. Based on the observation, we developed a map-reduce system with an alternate API (MATE) using a user-declared reduction-object to be able to further improve the performance of map-reduce programs in multi-core environments. To address the limitation in MATE that the reduction object must fit in memory, we extended the MATE system to support the reduction object of arbitrary sizes in distributed environments and apply it to a set of graph mining applications, obtaining better performance than the original graph mining library based on map-reduce. We then supported the generalized reduction API in a CPU-GPU cluster with the ability of automatic data distribution among CPUs and GPUs to achieve the best-possible hetero- geneous execution of iterative applications. Finally, in our recent work, we extended the generalized reduction model with supporting low overhead fault tolerance for MPI pro- grams in our FT-MATE system. Especially, we are able to deal with CPU/GPU failures in a cluster with low overhead checkpointing, and restart the computations from a different number of nodes. Through this work, we would like to provide useful insights for designing and implementing efficient fault tolerance solutions for the exascale systems in the future. iii This is dedicated to the ones I love: my parents and my younger sister. iv Acknowledgments It would not have been possible to write this doctoral dissertation without the help and support of the kind people around me, to only some of whom it is possible to give particular mention here. First of all, I would like to thank my advisor Prof. Gagan Agrawal. This disserta- tion would not have been possible without the help, support and patience of Prof. Gagan Agrawal, not to mention his advice and unsurpassed knowledge of parallel and distributed computing in computer science. He has taught me, both consciously and unconsciously, how good computer systems and software should be done. I appreciate all his contributions of time, ideas, and funding to make my Ph.D. life productive and stimulating. The joy, passion and enthusiasm he has for his research were contagious and motivational for me, even during tough times in the Ph.D. pursuit. I am also indebted to him for the excellent example he has provided as a successful computer scientist and professor. Second, I would like to acknowledge the financial, academic and technical support of the Ohio State University, particularly in the award of a University Fellowship that pro- vided the necessary financial support for my first year. The library facilities and computer facilities of the University, as well as the Computer Science and Engineering Department, have been indispensable. I also thank the Department of Computer Science and Engineer- ing for their support and assistance since the start of my graduate associate work in 2008, v especially the head of department, Prof. Xiaodong Zhang. My work was also supported by the National Science Foundation. Also, I would like to thank my colleagues in my lab. The members of the Data-Intensive and High Performance Computing Research Group have contributed immensely to my per- sonal and professional time at Ohio State. The group has been a source of friendships as well as good advice and collaboration. I am especially grateful for the fun group of original members who stuck it out in graduate school with me all the way. My time at Ohio State was made enjoyable in large part due to the many friends and groups that became part of my life. I am thankful for the pleasant days spent with my roommates and hangout buddies with our memorable time in the city of Columbus, for my academic friends who have helped me a lot in improving my programming and paper writing skills, and for many other people and memories of trips into mountains, rivers and lakes. Special thanks are given to my family for their personal support and great patience at all times. My parents and my younger sister have given me their unequivocal support throughout, as always, for which my mere expression of thanks likewise does not suffice. Last, but by no means least, I thank my friends in America, China and elsewhere for their support and encouragement throughout, some of whom have already been named. vi Vita October29th,1985 ..........................Born-Tianmen, China 2007 .......................................B.S.Software Engineering, Nankai University, Tianjin, China 2011 .......................................M.S.Computer Science and Engineering, The Ohio State University Publications Research Publications Wei Jiang and Gagan Agrawal. “MATE-CG: A MapReduce-Like Framework for Accel- erating Data-Intensive Computations on Heterogeneous Clusters”. In Proceedings of the 26th IEEE International Symposium on Parallel and Distributed Processing, May 2012. Yan Liu, Wei Jiang, Shuangshuang Jin, Mark Rice, and Yousu Chen. “Distributing Power Grid State Estimation on HPC Clusters: A System Architecture Prototype”. In Proceed- ings of the 13th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing, May 2012. Yi Wang, Wei Jiang and Gagan Agrawal. “SciMATE: A Novel MapReduce-Like Frame- work for Multiple Scientific Data Formats”. In Proceedings of the 12th IEEE/ACM Inter- national Symposium on Cluster, Cloud and Grid Computing, pages 443–450, May 2012. Vignesh Ravi, Michela Becchi, Wei Jiang, Gagan Agrawal and Srimat Chakradhar. “Schedul- ing Concurrent Applications on a Cluster of CPU-GPU Nodes”. In Proceedings of the 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pages 140– 147, May 2012. Wei Jiang and Gagan Agrawal. “Ex-MATE: Data-Intensive Computing with Large Reduc- tion Objects and Its Application to Graph Mining”. In Proceedings of the 11th IEEE/ACM vii International Symposium on Cluster, Cloud and Grid Computing, pages 475–484, May 2011. Wei Jiang, Vignesh T. Ravi, and Gagan Agrawal. “A Map-Reduce System with an Alter- nate API for Multi-core Environments”. In Proceedings of the 10th IEEE/ACM Interna- tional Symposium on Cluster, Cloud and Grid Computing, pages 84–93, May 2010. Tekin Bicer, Wei Jiang, and Gagan Agrawal. “Supporting Fault Tolerance in a Data- Intensive Computing Middleware”. In Proceedings of the 24th IEEE International Sympo- sium on Parallel and Distributed Processing, pages 1–12, Apr. 2010. Wei Jiang, Vignesh T. Ravi, and Gagan Agrawal. “Comparing Map-Reduce and FREERIDE for Data-Intensive Applications”. In Proceedings of the 2009 IEEE International Confer- ence on Cluster Computing, pages 1–10. Aug. 2009. Fields of Study Major Field: Computer Science and Engineering viii Table of Contents Page Abstract......................................... ii Dedication........................................ iv Acknowledgments.................................... v Vita ........................................... vii ListofTables ...................................... xii ListofFigures...................................... xiii 1. Introduction.................................... 1 1.1 High-Level APIs: MapReduce and Generalized Reduction . ....... 3 1.2 ParallelArchitectures:CPUsandGPUs . ... 5 1.3 DissertationContributions . .. 7 1.4 RelatedWork................................ 15 1.5 Outline ................................... 22 2. Comparing Map-Reduce and FREERIDE for Date-Intensive Applications
Recommended publications
  • MVAPICH2 2.2 User Guide
    MVAPICH2 2.2 User Guide MVAPICH TEAM NETWORK-BASED COMPUTING LABORATORY DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING THE OHIO STATE UNIVERSITY http://mvapich.cse.ohio-state.edu Copyright (c) 2001-2016 Network-Based Computing Laboratory, headed by Dr. D. K. Panda. All rights reserved. Last revised: October 19, 2017 Contents 1 Overview of the MVAPICH Project1 2 How to use this User Guide?1 3 MVAPICH2 2.2 Features2 4 Installation Instructions 13 4.1 Building from a tarball .................................... 13 4.2 Obtaining and Building the Source from SVN repository .................. 13 4.3 Selecting a Process Manager................................. 14 4.3.1 Customizing Commands Used by mpirun rsh..................... 15 4.3.2 Using SLURM..................................... 15 4.3.3 Using SLURM with support for PMI Extensions ................... 15 4.4 Configuring a build for OFA-IB-CH3/OFA-iWARP-CH3/OFA-RoCE-CH3......... 16 4.5 Configuring a build for NVIDIA GPU with OFA-IB-CH3.................. 19 4.6 Configuring a build for Shared-Memory-CH3........................ 20 4.7 Configuring a build for OFA-IB-Nemesis .......................... 20 4.8 Configuring a build for Intel TrueScale (PSM-CH3)..................... 21 4.9 Configuring a build for Intel Omni-Path (PSM2-CH3).................... 22 4.10 Configuring a build for TCP/IP-Nemesis........................... 23 4.11 Configuring a build for TCP/IP-CH3............................. 24 4.12 Configuring a build for OFA-IB-Nemesis and TCP/IP Nemesis (unified binary) . 24 4.13 Configuring a build for Shared-Memory-Nemesis...................... 25 5 Basic Usage Instructions 26 5.1 Compile Applications..................................... 26 5.2 Run Applications....................................... 26 5.2.1 Run using mpirun rsh ...............................
    [Show full text]
  • Scalable and High Performance MPI Design for Very Large
    SCALABLE AND HIGH-PERFORMANCE MPI DESIGN FOR VERY LARGE INFINIBAND CLUSTERS DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Sayantan Sur, B. Tech ***** The Ohio State University 2007 Dissertation Committee: Approved by Prof. D. K. Panda, Adviser Prof. P. Sadayappan Adviser Prof. S. Parthasarathy Graduate Program in Computer Science and Engineering c Copyright by Sayantan Sur 2007 ABSTRACT In the past decade, rapid advances have taken place in the field of computer and network design enabling us to connect thousands of computers together to form high-performance clusters. These clusters are used to solve computationally challenging scientific problems. The Message Passing Interface (MPI) is a popular model to write applications for these clusters. There are a vast array of scientific applications which use MPI on clusters. As the applications operate on larger and more complex data, the size of the compute clusters is scaling higher and higher. Thus, in order to enable the best performance to these scientific applications, it is very critical for the design of the MPI libraries be extremely scalable and high-performance. InfiniBand is a cluster interconnect which is based on open-standards and gaining rapid acceptance. This dissertation presents novel designs based on the new features offered by InfiniBand, in order to design scalable and high-performance MPI libraries for large-scale clusters with tens-of-thousands of nodes. Methods developed in this dissertation have been applied towards reduction in overall resource consumption, increased overlap of computa- tion and communication, improved performance of collective operations and finally designing application-level benchmarks to make efficient use of modern networking technology.
    [Show full text]
  • Map, Reduce and Mapreduce, the Skeleton Way$
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Elsevier - Publisher Connector Procedia Computer Science ProcediaProcedia Computer Computer Science Science 00 1 (2010)(2012) 1–9 2095–2103 www.elsevier.com/locate/procedia International Conference on Computational Science, ICCS 2010 Map, Reduce and MapReduce, the skeleton way$ D. Buonoa, M. Daneluttoa, S. Lamettia aDept. of Computer Science, Univ. of Pisa, Italy Abstract Composition of Map and Reduce algorithmic skeletons have been widely studied at the end of the last century and it has demonstrated effective on a wide class of problems. We recall the theoretical results motivating the intro- duction of these skeletons, then we discuss an experiment implementing three algorithmic skeletons, a map,areduce and an optimized composition of a map followed by a reduce skeleton (map+reduce). The map+reduce skeleton implemented computes the same kind of problems computed by Google MapReduce, but the data flow through the skeleton is streamed rather than relying on already distributed (and possibly quite large) data items. We discuss the implementation of the three skeletons on top of ProActive/GCM in the MareMare prototype and we present some experimental obtained on a COTS cluster. ⃝c 2012 Published by Elsevier Ltd. Open access under CC BY-NC-ND license. Keywords: Algorithmic skeletons, map, reduce, list theory, homomorphisms 1. Introduction Algorithmic skeletons have been around since Murray Cole’s PhD thesis [1]. An algorithmic skeleton is a paramet- ric, reusable, efficient and composable programming construct that exploits a well know, efficient parallelism exploita- tion pattern. A skeleton programmer may build parallel applications simply picking up a (composition of) skeleton from the skeleton library available, providing the required parameters and running some kind of compiler+run time library tool specific of the target architecture at hand [2].
    [Show full text]
  • MVAPICH2 2.3 User Guide
    MVAPICH2 2.3 User Guide MVAPICH Team Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University http://mvapich.cse.ohio-state.edu Copyright (c) 2001-2018 Network-Based Computing Laboratory, headed by Dr. D. K. Panda. All rights reserved. Last revised: February 19, 2018 Contents 1 Overview of the MVAPICH Project1 2 How to use this User Guide?1 3 MVAPICH2 2.3 Features2 4 Installation Instructions 14 4.1 Building from a tarball................................... 14 4.2 Obtaining and Building the Source from SVN repository................ 14 4.3 Selecting a Process Manager................................ 15 4.3.1 Customizing Commands Used by mpirun rsh.................. 16 4.3.2 Using SLURM................................... 16 4.3.3 Using SLURM with support for PMI Extensions................ 16 4.4 Configuring a build for OFA-IB-CH3/OFA-iWARP-CH3/OFA-RoCE-CH3...... 17 4.5 Configuring a build for NVIDIA GPU with OFA-IB-CH3............... 20 4.6 Configuring a build to support running jobs across multiple InfiniBand subnets... 21 4.7 Configuring a build for Shared-Memory-CH3...................... 21 4.8 Configuring a build for OFA-IB-Nemesis......................... 21 4.9 Configuring a build for Intel TrueScale (PSM-CH3)................... 22 4.10 Configuring a build for Intel Omni-Path (PSM2-CH3)................. 23 4.11 Configuring a build for TCP/IP-Nemesis......................... 24 4.12 Configuring a build for TCP/IP-CH3........................... 25 4.13 Configuring a build for OFA-IB-Nemesis and TCP/IP Nemesis (unified binary)... 26 4.14 Configuring a build for Shared-Memory-Nemesis.................... 26 4.15 Configuration and Installation with Singularity....................
    [Show full text]
  • Automatic Handling of Global Variables for Multi-Threaded MPI Programs
    Automatic Handling of Global Variables for Multi-threaded MPI Programs Gengbin Zheng, Stas Negara, Celso L. Mendes, Laxmikant V. Kale´ Eduardo R. Rodrigues Department of Computer Science Institute of Informatics University of Illinois at Urbana-Champaign Federal University of Rio Grande do Sul Urbana, IL 61801, USA Porto Alegre, Brazil fgzheng,snegara2,cmendes,[email protected] [email protected] Abstract— necessary to privatize global and static variables to ensure Hybrid programming models, such as MPI combined with the thread-safety of an application. threads, are one of the most efficient ways to write parallel ap- In high-performance computing, one way to exploit multi- plications for current machines comprising multi-socket/multi- core nodes and an interconnection network. Global variables in core platforms is to adopt hybrid programming models that legacy MPI applications, however, present a challenge because combine MPI with threads, where different programming they may be accessed by multiple MPI threads simultaneously. models are used to address issues such as multiple threads Thus, transforming legacy MPI applications to be thread-safe per core, decreasing amount of memory per core, load in order to exploit multi-core architectures requires proper imbalance, etc. When porting legacy MPI applications to handling of global variables. In this paper, we present three approaches to eliminate these hybrid models that involve multiple threads, thread- global variables to ensure thread-safety for an MPI program. safety of the applications needs to be addressed again. These approaches include: (a) a compiler-based refactoring Global variables cause no problem with traditional MPI technique, using a Photran-based tool as an example, which implementations, since each process image contains a sep- automates the source-to-source transformation for programs arate instance of the variable.
    [Show full text]
  • Exascale Computing Project -- Software
    Exascale Computing Project -- Software Paul Messina, ECP Director Stephen Lee, ECP Deputy Director ASCAC Meeting, Arlington, VA Crystal City Marriott April 19, 2017 www.ExascaleProject.org ECP scope and goals Develop applications Partner with vendors to tackle a broad to develop computer spectrum of mission Support architectures that critical problems national security support exascale of unprecedented applications complexity Develop a software Train a next-generation Contribute to the stack that is both workforce of economic exascale-capable and computational competitiveness usable on industrial & scientists, engineers, of the nation academic scale and computer systems, in collaboration scientists with vendors 2 Exascale Computing Project, www.exascaleproject.org ECP has formulated a holistic approach that uses co- design and integration to achieve capable exascale Application Software Hardware Exascale Development Technology Technology Systems Science and Scalable and Hardware Integrated mission productive technology exascale applications software elements supercomputers Correctness Visualization Data Analysis Applicationsstack Co-Design Programming models, Math libraries and development environment, Tools Frameworks and runtimes System Software, resource Workflows Resilience management threading, Data Memory scheduling, monitoring, and management and Burst control I/O and file buffer system Node OS, runtimes Hardware interface ECP’s work encompasses applications, system software, hardware technologies and architectures, and workforce
    [Show full text]
  • From Sequential to Parallel Programming with Patterns
    From sequential to parallel programming with patterns Inverted CERN School of Computing 2018 Plácido Fernández [email protected] 07 Mar 2018 Table of Contents Introduction Sequential vs Parallel patterns Patterns Data parallel vs streaming patterns Control patterns (sequential and parallel Streaming parallel patterns Composing parallel patterns Using the patterns Real world use case GrPPI with NUMA Conclusions Acknowledgements and references 07 Mar 2018 3 Table of Contents Introduction Sequential vs Parallel patterns Patterns Data parallel vs streaming patterns Control patterns (sequential and parallel Streaming parallel patterns Composing parallel patterns Using the patterns Real world use case GrPPI with NUMA Conclusions Acknowledgements and references 07 Mar 2018 4 Source: Herb Sutter in Dr. Dobb’s Journal Parallel hardware history I Before ~2005, processor manufacturers increased clock speed. I Then we hit the power and memory wall, which limits the frequency at which processors can run (without melting). I Manufacturers response was to continue improving their chips performance by adding more parallelism. To fully exploit modern processors capabilities, programmers need to put effort into the source code. 07 Mar 2018 6 Parallel Hardware Processors are naturally parallel: Programmers have plenty of options: I Caches I Multi-threading I Different processing units (floating I SIMD / Vectorization (Single point, arithmetic logic...) Instruction Multiple Data) I Integrated GPU I Heterogeneous hardware 07 Mar 2018 7 Source: top500.org Source: Intel Source: CERN Parallel hardware Xeon Xeon 5100 Xeon 5500 Sandy Bridge Haswell Broadwell Skylake Year 2005 2006 2009 2012 2015 2016 2017 Cores 1 2 4 8 18 24 28 Threads 2 2 8 16 36 48 56 SIMD Width 128 128 128 256 256 256 512 Figure: Intel Xeon processors evolution 0Source: ark.intel.com 07 Mar 2018 11 Parallel considerations When designing parallel algorithms where are interested in speedup.
    [Show full text]
  • MVAPICH USER GROUP MEETING August 26, 2020
    APPLICATION AND MICROBENCHMARK STUDY USING MVAPICH2 and MVAPICH2-GDR ON SDSC COMET AND EXPANSE MVAPICH USER GROUP MEETING August 26, 2020 Mahidhar Tatineni SDSC San Diego Supercomputer Center NSF Award 1928224 Overview • History of InfiniBand based clusters at SDSC • Hardware summaries for Comet and Expanse • Application and Software Library stack on SDSC systems • Containerized approach – Singularity on SDSC systems • Benchmark results on Comet, Expanse development system - OSU Benchmarks, NEURON, RAxML, TensorFlow, QE, and BEAST are presented. InfiniBand and MVAPICH2 on SDSC Systems Gordon (NSF) Trestles (NSF) 2012-2017 COMET (NSF) Expanse (NSF) 2011-2014 GordonS 2015-Current Production Fall 2020 (Simons Foundation) 2017- 2020 • 324 nodes, 10,368 cores • 1024 nodes, 16,384 cores • 1944 compute, 72 GPU, and • 728 compute, 52 GPU, and • 4-socket AMD Magny-Cours • 2-socket Intel Sandy Bridge 4 large memory nodes. 4 large memory nodes. • QDR InfiniBand • Dual Rail QDR InfiniBand • 2-socket Intel Haswell • 2-socket AMD EPYC 7742, • Fat Tree topology • 3-D Torus topology • FDR InfiniBand HDR100 InfiniBand • MVAPICH2 • 300TB of SSD storage - via • Fat Tree topology • GPU nodes with 4 V100 iSCSI over RDMA (iSER) • MVAPICH2, MVAPICH2-X, GPUs + NVLINK • MVAPICH2 (1.9, 2.1) with 3-D MVAPICH2-GDR • HDR200 Switches, Fat torus support • Leverage SRIOV for Virtual Tree topology with 3:1 Clusters oversubscription • MVAPICH2, MVAPICH2-X, MVAPICH2-GDR Comet: System Characteristics • Total peak flops ~2.76 PF • Hybrid fat-tree topology • Dell primary
    [Show full text]
  • Opencl: a Hands-On Introduction
    OpenCL: A Hands-on Introduction Tim Mattson Alice Koniges Intel Corp. Berkeley Lab/NERSC Simon McIntosh-Smith University of Bristol Acknowledgements: In addition to Tim, Alice and Simon … Tom Deakin (Bristol) and Ben Gaster (Qualcomm) contributed to this content. Agenda Lectures Exercises An Introduction to OpenCL Logging in and running the Vadd program Understanding Host programs Chaining Vadd kernels together Kernel programs The D = A + B + C problem Writing Kernel Programs Matrix Multiplication Lunch Working with the OpenCL memory model Several ways to Optimize matrix multiplication High Performance OpenCL Matrix multiplication optimization contest The OpenCL Zoo Run your OpenCL programs on a variety of systems. Closing Comments Course materials In addition to these slides, C++ API header files, a set of exercises, and solutions, we provide: OpenCL C 1.2 Reference Card OpenCL C++ 1.2 Reference Card These cards will help you keep track of the API as you do the exercises: https://www.khronos.org/files/ opencl-1-2-quick-reference-card.pdf The v1.2 spec is also very readable and recommended to have on-hand: https://www.khronos.org/registry/ cl/specs/opencl-1.2.pdf AN INTRODUCTION TO OPENCL Industry Standards for Programming Heterogeneous Platforms GPUs CPUs Emerging Increasingly general Multiple cores driving Intersection purpose data-parallel performance increases computing Graphics Multi- Heterogeneous APIs and processor Computing Shading programming – Languages e.g. OpenMP OpenCL – Open Computing Language Open, royalty-free standard for portable, parallel programming of heterogeneous parallel computing CPUs, GPUs, and other processors The origins of OpenCL ARM AMD Merged, needed Nokia commonality IBM ATI across products Sony Wrote a rough draft Qualcomm GPU vendor – straw man API Imagination wants to steal TI NVIDIA market share from CPU + many more Khronos Compute CPU vendor – group formed wants to steal Intel market share from GPU Was tired of recoding for many core, GPUs.
    [Show full text]
  • Infiniband and 10-Gigabit Ethernet for Dummies
    The MVAPICH2 Project: Latest Developments and Plans Towards Exascale Computing Presentation at Mellanox Theatre (SC ‘18) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda Drivers of Modern HPC Cluster Architectures High Performance Interconnects - Accelerators / Coprocessors InfiniBand high compute density, high Multi-core Processors <1usec latency, 100Gbps Bandwidth> performance/watt SSD, NVMe-SSD, NVRAM >1 TFlop DP on a chip • Multi-core/many-core technologies • Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE) • Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD • Accelerators (NVIDIA GPGPUs and Intel Xeon Phi) • Available on HPC Clouds, e.g., Amazon EC2, NSF Chameleon, Microsoft Azure, etc. Summit Sunway TaihuLight Sierra K - Computer Network Based Computing Laboratory Mellanox Theatre SC’18 2 Parallel Programming Models Overview P1 P2 P3 P1 P2 P3 P1 P2 P3 Logical shared memory Memory Memory Memory Shared Memory Memory Memory Memory Shared Memory Model Distributed Memory Model Partitioned Global Address Space (PGAS) SHMEM, DSM MPI (Message Passing Interface) Global Arrays, UPC, Chapel, X10, CAF, … • Programming models provide abstract machine models • Models can be mapped on different types of systems – e.g. Distributed Shared Memory (DSM), MPI within a node, etc. • PGAS models and Hybrid MPI+PGAS models are gradually receiving importance Network Based Computing Laboratory Mellanox Theatre SC’18 3 Designing Communication Libraries for Multi-Petaflop and Exaflop Systems: Challenges Application Kernels/Applications Co-Design Middleware Opportunities and Programming Models Challenges MPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP, across Various OpenACC, Cilk, Hadoop (MapReduce), Spark (RDD, DAG), etc.
    [Show full text]
  • MVAPICH2 2.2Rc2 User Guide
    MVAPICH2 2.2rc2 User Guide MVAPICH TEAM NETWORK-BASED COMPUTING LABORATORY DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING THE OHIO STATE UNIVERSITY http://mvapich.cse.ohio-state.edu Copyright (c) 2001-2016 Network-Based Computing Laboratory, headed by Dr. D. K. Panda. All rights reserved. Last revised: August 8, 2016 Contents 1 Overview of the MVAPICH Project1 2 How to use this User Guide?1 3 MVAPICH2 2.2rc2 Features2 4 Installation Instructions 13 4.1 Building from a tarball .................................... 13 4.2 Obtaining and Building the Source from SVN repository .................. 13 4.3 Selecting a Process Manager................................. 14 4.3.1 Customizing Commands Used by mpirun rsh..................... 15 4.3.2 Using SLURM..................................... 15 4.3.3 Using SLURM with support for PMI Extensions ................... 15 4.4 Configuring a build for OFA-IB-CH3/OFA-iWARP-CH3/OFA-RoCE-CH3......... 16 4.5 Configuring a build for NVIDIA GPU with OFA-IB-CH3.................. 19 4.6 Configuring a build for Shared-Memory-CH3........................ 20 4.7 Configuring a build for OFA-IB-Nemesis .......................... 20 4.8 Configuring a build for Intel TrueScale (PSM-CH3)..................... 21 4.9 Configuring a build for Intel Omni-Path (PSM2-CH3).................... 22 4.10 Configuring a build for TCP/IP-Nemesis........................... 23 4.11 Configuring a build for TCP/IP-CH3............................. 24 4.12 Configuring a build for OFA-IB-Nemesis and TCP/IP Nemesis (unified binary) . 24 4.13 Configuring a build for Shared-Memory-Nemesis...................... 25 5 Basic Usage Instructions 26 5.1 Compile Applications..................................... 26 5.2 Run Applications....................................... 26 5.2.1 Run using mpirun rsh ...............................
    [Show full text]
  • Mapreduce Basics
    Chapter 2 MapReduce Basics The only feasible approach to tackling large-data problems today is to divide and conquer, a fundamental concept in computer science that is introduced very early in typical undergraduate curricula. The basic idea is to partition a large problem into smaller sub-problems. To the extent that the sub-problems are independent [5], they can be tackled in parallel by different workers| threads in a processor core, cores in a multi-core processor, multiple processors in a machine, or many machines in a cluster. Intermediate results from each individual worker are then combined to yield the final output.1 The general principles behind divide-and-conquer algorithms are broadly applicable to a wide range of problems in many different application domains. However, the details of their implementations are varied and complex. For example, the following are just some of the issues that need to be addressed: • How do we break up a large problem into smaller tasks? More specifi- cally, how do we decompose the problem so that the smaller tasks can be executed in parallel? • How do we assign tasks to workers distributed across a potentially large number of machines (while keeping in mind that some workers are better suited to running some tasks than others, e.g., due to available resources, locality constraints, etc.)? • How do we ensure that the workers get the data they need? • How do we coordinate synchronization among the different workers? • How do we share partial results from one worker that is needed by an- other? • How do we accomplish all of the above in the face of software errors and hardware faults? 1We note that promising technologies such as quantum or biological computing could potentially induce a paradigm shift, but they are far from being sufficiently mature to solve real world problems.
    [Show full text]