Infiniband and 10-Gigabit Ethernet for Dummies

Total Page:16

File Type:pdf, Size:1020Kb

Infiniband and 10-Gigabit Ethernet for Dummies The MVAPICH2 Project: Latest Developments and Plans Towards Exascale Computing Presentation at OSU Booth (SC ‘19) by Hari Subramoni The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~subramon Drivers of Modern HPC Cluster Architectures High Performance Accelerators / Coprocessors Interconnects - InfiniBand high compute density, high Multi-core <1usec latency, 200Gbps performance/watt SSD, NVMe-SSD, Processors Bandwidth> >1 TFlop DP on a chip NVRAM • Multi-core/many-core technologies • Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE) • Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD • Accelerators (NVIDIA GPGPUs and Intel Xeon Phi) • Available on HPC Clouds, e.g., Amazon EC2, NSF Chameleon, Microsoft Azure, etc. NetworkSummit Based Computing Sierra Sunway K - Laboratory OSU Booth SC’19TaihuLight Computer 2 Parallel Programming Models Overview P1 P2 P3 P1 P2 P3 P1 P2 P3 Memo Memo Memo Logical shared memory Memor Memor Memor Shared Memory ry ry ry y y y Shared Memory Model Distributed Memory Model Partitioned Global Address Space (PGAS) SHMEM, DSM MPI (Message Passing Interface)Global Arrays, UPC, Chapel, X10, CAF, … • Programming models provide abstract machine models • Models can be mapped on different types of systems – e.g. Distributed Shared Memory (DSM), MPI within a node, etc. • PGAS models and Hybrid MPI+PGAS models are Network Based Computing Laboratorygradually receiving importanceOSU Booth SC’19 3 Designing Communication Libraries for Multi-Petaflop and Exaflop Systems: Challenges Co-DesignCo-Design Application Kernels/Applications OpportuniOpportuni tiesties andand Middleware ChallengeChallenge ss acrossacross Programming Models VariousVarious MPI, PGAS (UPC, Global Arrays, OpenSHMEM), LayersLayers CUDA, OpenMP, OpenACC, Cilk, Hadoop (MapReduce), Spark (RDD, DAG), etc. PerformaPerforma Communication Library or Runtime for Programming Models ncence Point-to-Point-to- CollectiveCollective SynchronizSynchroniz I/OI/O andand pointpoint Energy-Energy- FaultFault ScalabilitScalabilit CommunicaCommunica ationation andand FileFile CommunicaCommunica AwarenessAwareness ToleranceTolerance tiontion LocksLocks SystemsSystems yy tiontion Fault-Fault- Networking Technologies Multi/Many-core Accelerators (InfiniBand, 40/100GigE, ResiliencResilienc Architectures (GPU and FPGA) Aries, and OmniPath) ee Network Based Computing Laboratory OSU Booth SC’19 4 MPI+X Programming model: Broad Challenges at Exascale • Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided and one- sided) – Scalable job start-up • Scalable Collective communication – Offload – Non-blocking – Topology-aware • Balancing intra-node and inter-node communication for next generation nodes (128-1024 cores) – Multiple end-points per node • Support for efficient multi-threading • Integrated Support for GPGPUs and FPGAs • Fault-tolerance/resiliency • QoS support for communication and I/O • Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC, MPI+UPC++, MPI + OpenSHMEM, CAF, …) • Virtualization • NetworkEnergy-Awareness Based Computing Laboratory OSU Booth SC’19 5 Overview of the MVAPICH2 Project • High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE) – MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015 – Used by more than 3,050 organizations in 89 countries – More than 614,000 (> 0.6 million) downloads from the OSU site directly – Empowering many TOP500 clusters (Nov ‘18 ranking) • 3rd, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China • 5th, 448, 448 cores (Frontera) at TACC • 8th, 391,680 cores (ABCI) in Japan • 15th, 570,020 cores (Neurion) in South Korea and many others – Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, and OpenHPC) Partner in the TACC Frontera System Network– http://mvapich.cse.ohio-state.edu Based Computing Laboratory OSU Booth SC’19 6 • Empowering Top500 systems for over a decade 100000 Num200000 ber of Do300000 wnloads 400000 500000 600000 0 Downloads Release Timeline and MVAPICH2 Laboratory Network Based Computing M V M 0 V . 2 M 9 . 0V 4 .2 M 9 .0V M 0.2 M9 V .1 V M8 Timeline .1 2V 0. 0 11 M . .V OSU Booth OSUSC’19 Booth 01 .2 M 3 V 1M . 2 V 4 2 1 M . V 1 M 5 2 . V 6 2 1M . 1 MV 7 . 2M V MO 8 2V VS 1 MM -2 M2UVV . G M M - 22V 9 D V VM VI- -2 R 2 2I i NXG - -C rA D 2 A A tM2R2 . z W2 . 0 u S. 203 3 b r 0 .. 2. e 2 29r .2 . .c3 3 32. 2 2 . 3 . 7 2 Architecture of MVAPICH2 Software Family High Performance Parallel Programming Models Message Passing Interface PGAS Hybrid --- MPI + X (MPI) (UPC, OpenSHMEM, CAF, UPC++) (MPI + PGAS + OpenMP/Cilk) High Performance and Scalable Communication Runtime Diverse APIs and Mechanisms Point-to-Point-to- CollectivCollectiv Energy-Energy- RemoteRemote I/OI/O andand ActiveActive IntrospecIntrospec pointpoint eses JobJob FaultFault VirtualizaVirtualiza AwareneAwarene MemoryMemory FileFile MessageMessage tiontion && PrimitivePrimitive AlgorithAlgorith StartupStartup ToleranceTolerance tiontion ssss AccessAccess SystemsSystems ss AnalysisAnalysis ss msms Support for Modern Networking TechnologySupport for Modern Multi-/Many-core Architectures (InfiniBand, iWARP, RoCE, Omni-Path, Elastic Fabric Adapter)(Intel-Xeon, OpenPOWER, Xeon-Phi, ARM, NVIDIA GPGPU) Transport Protocols Modern Features Transport Mechanisms Modern Features MulMul ShareShare SRSR UMUM ODOD SR-SR- dd CMCM IVSHMIVSHM XPMEXPME NVLinNVLin CAPCAP ti ** RCRC UDUD DCDC ti Memo A EM M OptaneOptane * DD RR PP IOVIOV Memo A EM M kk II* RailRail ryry * Upcoming Network Based Computing Laboratory OSU Booth SC’19 8 MVAPICH2 Software Family Requirements Library MPI with IB, iWARP, Omni-Path, and RoCE MVAPICH2 Advanced MPI Features/Support, OSU INAM, PGAS MVAPICH2-X and MPI+PGAS with IB, Omni-Path, and RoCE MPI with IB, RoCE & GPU and Support for Deep MVAPICH2-GDR Learning HPC Cloud with MPI & IB MVAPICH2-Virt Energy-aware MPI with IB, iWARP and RoCE MVAPICH2-EA MPI Energy Monitoring Tool OEMT InfiniBand Network Analysis and Monitoring OSU INAM Microbenchmarks for Measuring MPI and PGAS OMB Performance Network Based Computing Laboratory OSU Booth SC’19 9 MVAPICH2 Distributions • MVAPICH2 – Basic MPI support for IB, Omni-Path, iWARP and RoCE • MVAPICH2-X – Advanced MPI features and support for INAM – MPI, PGAS and Hybrid MPI+PGAS support for IB • MVAPICH2-Virt – Optimized for HPC Clouds with IB and SR-IOV virtualization – Support for OpenStack, Docker, and Singularity • OSU Micro-Benchmarks (OMB) – MPI (including CUDA-aware MPI), OpenSHMEM and UPC • OSU INAM – InfiniBand Network Analysis and Monitoring Tool Network Based Computing Laboratory OSU Booth SC’19 10 ) s d n o c e s i l l Startup Performance on TACC Frontera i M ( n e k 4.5s a T 3.9s e m i MPI_Init on Frontera T 5000 Intel MPI 4000 2019 3000 2000 1000 0 56 112 224 448 896 1792 3584 7168 14336 28672 57344 Number of Processes • MPI_Init takes 3.9 seconds on 57,344 processes on 1,024 nodes • MPI_Init takes 195 seconds on 229376 processes on 4096 nodes while MVAPICH2 takes 31 seconds • All numbers reported with 56 processes per node Network Based ComputingNew designs available in MVAPICH2-2.3.2 Laboratory OSU Booth SC’19 11 One-way Latency: MPI over IB with Large Message La- MVAPICH21.8 Small Message La- 120 tency TrueScale-QDRtency 1.6 1.1 100 ConnectX-3-FDR 1.4 1.19 ConnectIB-DualFDR 1.2 1 80 ConnectX-4-EDR 1 60 Omni-Path t ( ) c s t ( ) L y a e c s n u L y a e 0.8n 1.15u ConnectX-6 HDR 0.6 1.0 40 1.01 0.4 1.14 20 0.2 0 0 Message Size (bytes) Message Size (bytes) TrueScale-QDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-4-EDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch Omni-Path - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switch NetworkConnectX-6-HDR Based Computing - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch Laboratory OSU Booth SC’19 12 Bandwidth: MPI over IB with MVAPICH2 30000 Unidirectional Band- 60000 Bidirectional Band- width TrueScale-QDRwidth 24,5 48,02 25000 50000 ConnectX-3-FDR 32 7 ConnectIB- 20000 40000 DualFDR ConnectX-4-EDR 21,98 15000i / B a n d w 30000i d t h ( M B y t e s / s e c ) t ( t ) B a n d w d h M B y e s s e 12,5c Omni-Path 24,133 10000 12,390 20000 ConnectX-6 HDR 6 12,066 21,22 83 5000 6,35 10000 12,167 3,376 6,2281 0 3 0 Message Size (bytes) Message Size (bytes) TrueScale-QDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-4-EDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch Omni-Path - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switch NetworkConnectX-6-HDR Based Computing - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch Laboratory
Recommended publications
  • Fact Sheet: the Next Generation of Computing Has Arrived
    News Fact Sheet The Next Generation of Computing Has Arrived: Performance to Power Amazing Experiences The new 5th Generation Intel® Core™ processor family is Intel’s latest wave of 14nm processors, delivering improved system and graphics performance, more natural and immersive user experiences, and enabling longer battery life compared to previous generations. The release of the 5th Generation Intel Core technology includes 14 new processors for consumers and businesses, including 10 new 15W processors with Intel® HD Graphics and four new 28W products with Intel® Iris™ Graphics. The 5th Generation Intel Core processor is purpose-built for the next generation of compute devices offering a thinner, lighter and more efficient experience across diverse form factors, including traditional notebooks, 2 in 1s, Ultrabooks™, Chromebooks, all-in-one desktop PCs and mini PCs. With the 5th Generation Intel Core processor availability, the “Broadwell” microarchitecture is expected to be the fastest mobile transition in company history to offer consumers a broad selection and availability of devices. Intel also started shipping its next generation 14nm processor for tablets, codenamed “Cherry Trail”, to device manufacturers. The new system on a chip (SoC) offers 64-bit computing, improved graphics with Intel® Generation 8-LP graphics, great performance and battery life for mainstream tablets. The platform offers world-class modem capabilities with LTE-Advanced on Intel® XMM™ 726x platform, which supports Cat-6 speeds and carrier aggregation. Customers will introduce new products based on this platform starting in the first half of this year. Key benefits of the 5th Generation Intel Core family and the next-generation 14nm processor for tablets include: Powerful Performance.
    [Show full text]
  • 1 Intel CEO Remarks Pat Gelsinger Q2'21 Earnings Webcast July 22
    Intel CEO Remarks Pat Gelsinger Q2’21 Earnings Webcast July 22, 2021 Good afternoon, everyone. Thanks for joining our second-quarter earnings call. It’s a thrilling time for both the semiconductor industry and for Intel. We're seeing unprecedented demand as the digitization of everything is accelerated by the superpowers of AI, pervasive connectivity, cloud-to-edge infrastructure and increasingly ubiquitous compute. Our depth and breadth of software, silicon and platforms, and packaging and process, combined with our at-scale manufacturing, uniquely positions Intel to capitalize on this vast growth opportunity. Our Q2 results, which exceeded our top and bottom line expectations, reflect the strength of the industry, the demand for our products, as well as the superb execution of our factory network. As I’ve said before, we are only in the early innings of what is likely to be a decade of sustained growth across the industry. Our momentum is building as once again we beat expectations and raise our full-year revenue and EPS guidance. Since laying out our IDM 2.0 strategy in March, we feel increasingly confident that we're moving the company forward toward our goal of delivering leadership products in every category in which we compete. While we have work to do, we are making strides to renew our execution machine: 7nm is progressing very well. We’ve launched new innovative products, established Intel Foundry Services, and made operational and organizational changes to lay the foundation needed to win in the next phase of our company’s great history. Here at Intel, we’re proud of our past, pragmatic about the work ahead, but, most importantly, confident in our future.
    [Show full text]
  • MVAPICH2 2.2 User Guide
    MVAPICH2 2.2 User Guide MVAPICH TEAM NETWORK-BASED COMPUTING LABORATORY DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING THE OHIO STATE UNIVERSITY http://mvapich.cse.ohio-state.edu Copyright (c) 2001-2016 Network-Based Computing Laboratory, headed by Dr. D. K. Panda. All rights reserved. Last revised: October 19, 2017 Contents 1 Overview of the MVAPICH Project1 2 How to use this User Guide?1 3 MVAPICH2 2.2 Features2 4 Installation Instructions 13 4.1 Building from a tarball .................................... 13 4.2 Obtaining and Building the Source from SVN repository .................. 13 4.3 Selecting a Process Manager................................. 14 4.3.1 Customizing Commands Used by mpirun rsh..................... 15 4.3.2 Using SLURM..................................... 15 4.3.3 Using SLURM with support for PMI Extensions ................... 15 4.4 Configuring a build for OFA-IB-CH3/OFA-iWARP-CH3/OFA-RoCE-CH3......... 16 4.5 Configuring a build for NVIDIA GPU with OFA-IB-CH3.................. 19 4.6 Configuring a build for Shared-Memory-CH3........................ 20 4.7 Configuring a build for OFA-IB-Nemesis .......................... 20 4.8 Configuring a build for Intel TrueScale (PSM-CH3)..................... 21 4.9 Configuring a build for Intel Omni-Path (PSM2-CH3).................... 22 4.10 Configuring a build for TCP/IP-Nemesis........................... 23 4.11 Configuring a build for TCP/IP-CH3............................. 24 4.12 Configuring a build for OFA-IB-Nemesis and TCP/IP Nemesis (unified binary) . 24 4.13 Configuring a build for Shared-Memory-Nemesis...................... 25 5 Basic Usage Instructions 26 5.1 Compile Applications..................................... 26 5.2 Run Applications....................................... 26 5.2.1 Run using mpirun rsh ...............................
    [Show full text]
  • Scalable and High Performance MPI Design for Very Large
    SCALABLE AND HIGH-PERFORMANCE MPI DESIGN FOR VERY LARGE INFINIBAND CLUSTERS DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Sayantan Sur, B. Tech ***** The Ohio State University 2007 Dissertation Committee: Approved by Prof. D. K. Panda, Adviser Prof. P. Sadayappan Adviser Prof. S. Parthasarathy Graduate Program in Computer Science and Engineering c Copyright by Sayantan Sur 2007 ABSTRACT In the past decade, rapid advances have taken place in the field of computer and network design enabling us to connect thousands of computers together to form high-performance clusters. These clusters are used to solve computationally challenging scientific problems. The Message Passing Interface (MPI) is a popular model to write applications for these clusters. There are a vast array of scientific applications which use MPI on clusters. As the applications operate on larger and more complex data, the size of the compute clusters is scaling higher and higher. Thus, in order to enable the best performance to these scientific applications, it is very critical for the design of the MPI libraries be extremely scalable and high-performance. InfiniBand is a cluster interconnect which is based on open-standards and gaining rapid acceptance. This dissertation presents novel designs based on the new features offered by InfiniBand, in order to design scalable and high-performance MPI libraries for large-scale clusters with tens-of-thousands of nodes. Methods developed in this dissertation have been applied towards reduction in overall resource consumption, increased overlap of computa- tion and communication, improved performance of collective operations and finally designing application-level benchmarks to make efficient use of modern networking technology.
    [Show full text]
  • MVAPICH2 2.3 User Guide
    MVAPICH2 2.3 User Guide MVAPICH Team Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University http://mvapich.cse.ohio-state.edu Copyright (c) 2001-2018 Network-Based Computing Laboratory, headed by Dr. D. K. Panda. All rights reserved. Last revised: February 19, 2018 Contents 1 Overview of the MVAPICH Project1 2 How to use this User Guide?1 3 MVAPICH2 2.3 Features2 4 Installation Instructions 14 4.1 Building from a tarball................................... 14 4.2 Obtaining and Building the Source from SVN repository................ 14 4.3 Selecting a Process Manager................................ 15 4.3.1 Customizing Commands Used by mpirun rsh.................. 16 4.3.2 Using SLURM................................... 16 4.3.3 Using SLURM with support for PMI Extensions................ 16 4.4 Configuring a build for OFA-IB-CH3/OFA-iWARP-CH3/OFA-RoCE-CH3...... 17 4.5 Configuring a build for NVIDIA GPU with OFA-IB-CH3............... 20 4.6 Configuring a build to support running jobs across multiple InfiniBand subnets... 21 4.7 Configuring a build for Shared-Memory-CH3...................... 21 4.8 Configuring a build for OFA-IB-Nemesis......................... 21 4.9 Configuring a build for Intel TrueScale (PSM-CH3)................... 22 4.10 Configuring a build for Intel Omni-Path (PSM2-CH3)................. 23 4.11 Configuring a build for TCP/IP-Nemesis......................... 24 4.12 Configuring a build for TCP/IP-CH3........................... 25 4.13 Configuring a build for OFA-IB-Nemesis and TCP/IP Nemesis (unified binary)... 26 4.14 Configuring a build for Shared-Memory-Nemesis.................... 26 4.15 Configuration and Installation with Singularity....................
    [Show full text]
  • The Component Architecture of Open Mpi: Enabling Third-Party Collective Algorithms∗
    THE COMPONENT ARCHITECTURE OF OPEN MPI: ENABLING THIRD-PARTY COLLECTIVE ALGORITHMS∗ Jeffrey M. Squyres and Andrew Lumsdaine Open Systems Laboratory, Indiana University Bloomington, Indiana, USA [email protected] [email protected] Abstract As large-scale clusters become more distributed and heterogeneous, significant research interest has emerged in optimizing MPI collective operations because of the performance gains that can be realized. However, researchers wishing to develop new algorithms for MPI collective operations are typically faced with sig- nificant design, implementation, and logistical challenges. To address a number of needs in the MPI research community, Open MPI has been developed, a new MPI-2 implementation centered around a lightweight component architecture that provides a set of component frameworks for realizing collective algorithms, point-to-point communication, and other aspects of MPI implementations. In this paper, we focus on the collective algorithm component framework. The “coll” framework provides tools for researchers to easily design, implement, and exper- iment with new collective algorithms in the context of a production-quality MPI. Performance results with basic collective operations demonstrate that the com- ponent architecture of Open MPI does not introduce any performance penalty. Keywords: MPI implementation, Parallel computing, Component architecture, Collective algorithms, High performance 1. Introduction Although the performance of the MPI collective operations [6, 17] can be a large factor in the overall run-time of a parallel application, their optimiza- tion has not necessarily been a focus in some MPI implementations until re- ∗This work was supported by a grant from the Lilly Endowment and by National Science Foundation grant 0116050.
    [Show full text]
  • Automatic Handling of Global Variables for Multi-Threaded MPI Programs
    Automatic Handling of Global Variables for Multi-threaded MPI Programs Gengbin Zheng, Stas Negara, Celso L. Mendes, Laxmikant V. Kale´ Eduardo R. Rodrigues Department of Computer Science Institute of Informatics University of Illinois at Urbana-Champaign Federal University of Rio Grande do Sul Urbana, IL 61801, USA Porto Alegre, Brazil fgzheng,snegara2,cmendes,[email protected] [email protected] Abstract— necessary to privatize global and static variables to ensure Hybrid programming models, such as MPI combined with the thread-safety of an application. threads, are one of the most efficient ways to write parallel ap- In high-performance computing, one way to exploit multi- plications for current machines comprising multi-socket/multi- core nodes and an interconnection network. Global variables in core platforms is to adopt hybrid programming models that legacy MPI applications, however, present a challenge because combine MPI with threads, where different programming they may be accessed by multiple MPI threads simultaneously. models are used to address issues such as multiple threads Thus, transforming legacy MPI applications to be thread-safe per core, decreasing amount of memory per core, load in order to exploit multi-core architectures requires proper imbalance, etc. When porting legacy MPI applications to handling of global variables. In this paper, we present three approaches to eliminate these hybrid models that involve multiple threads, thread- global variables to ensure thread-safety for an MPI program. safety of the applications needs to be addressed again. These approaches include: (a) a compiler-based refactoring Global variables cause no problem with traditional MPI technique, using a Photran-based tool as an example, which implementations, since each process image contains a sep- automates the source-to-source transformation for programs arate instance of the variable.
    [Show full text]
  • Intel 2019 Year Book
    YEARBOOK 2019 POWERING THE FUTURE Our 2019 yearbook invites you to look back and reflect on a memorable year for Intel. TABLE OF CONTENTS 2019 kicked off with the announcement of our new p4 New CEO. Evolving culture. Expanded ambitions. chief executive, Bob Swan. It was followed by a stream of notable news: product announcements, technology p6 More data. More storage. More processing. breakthroughs, new customers and partnerships, p10 Innovation for the PC user experience and important moves to evolve Intel’s culture as the company entered its sixth decade. p12 Self-driving cars hit the road p2 p16 AI unlocks the power of data It’s a privilege to tell the Intel story in all its complexity and humanity. Looking through these pages, the p18 Helping customers push boundaries breadth and depth of what we’ve achieved in 12 p22 More supply to meet strong demand months is substantial, as is the strong foundation we’ve built for even greater impact in the future. p26 Next-gen hardware and software to unlock it p28 Tech’s future: Inventing and investing I hope you enjoy this colorful look at what’s possible when more than 100,000 individuals from every p32 Reinforcing the nature of Moore’s Law corner of the globe unite to change the world – p34 Building for the smarter future through technologies that make a positive difference to our customers, to society, and to people’s lives. — Claire Dixon, Chief Communications Officer NEW CEO. EVOLVING CULTURE. EXPANDED AMBITIONS. 2019 was an important year in Intel’s transformation, with a new chief executive officer, ambitious business priorities, an aspirational culture evolution, and a farewell to Focal.
    [Show full text]
  • Exascale Computing Project -- Software
    Exascale Computing Project -- Software Paul Messina, ECP Director Stephen Lee, ECP Deputy Director ASCAC Meeting, Arlington, VA Crystal City Marriott April 19, 2017 www.ExascaleProject.org ECP scope and goals Develop applications Partner with vendors to tackle a broad to develop computer spectrum of mission Support architectures that critical problems national security support exascale of unprecedented applications complexity Develop a software Train a next-generation Contribute to the stack that is both workforce of economic exascale-capable and computational competitiveness usable on industrial & scientists, engineers, of the nation academic scale and computer systems, in collaboration scientists with vendors 2 Exascale Computing Project, www.exascaleproject.org ECP has formulated a holistic approach that uses co- design and integration to achieve capable exascale Application Software Hardware Exascale Development Technology Technology Systems Science and Scalable and Hardware Integrated mission productive technology exascale applications software elements supercomputers Correctness Visualization Data Analysis Applicationsstack Co-Design Programming models, Math libraries and development environment, Tools Frameworks and runtimes System Software, resource Workflows Resilience management threading, Data Memory scheduling, monitoring, and management and Burst control I/O and file buffer system Node OS, runtimes Hardware interface ECP’s work encompasses applications, system software, hardware technologies and architectures, and workforce
    [Show full text]
  • DARC: Design and Evaluation of an I/O Controller for Data Protection
    DARC: Design and Evaluation of an I/O Controller for Data Protection ∗ ∗ Markos Fountoulakis , Manolis Marazakis, Michail D. Flouris, and Angelos Bilas Institute of Computer Science (ICS) Foundation for Research and Technology - Hellas (FORTH) N. Plastira 100, Vassilika Vouton, Heraklion, GR-70013, Greece {mfundul,maraz,flouris,bilas}@ics.forth.gr ABSTRACT tualization, versioning, data protection, error detection, I/O Lately, with increasing disk capacities, there is increased performance concern about protection from data errors, beyond masking of device failures. In this paper, we present a prototype I/O 1. INTRODUCTION stack for storage controllers that encompasses two data pro- tection features: (a) persistent checksums to protect data at- In the last three decades there has been a dramatic in- rest from silent errors and (b) block-level versioning to allow crease of storage density in magnetic and optical media, protection from user errors. Although these techniques have which has led to significantly lower cost per capacity unit. been previously used either at the device level (checksums) Today, most I/O stacks already encompass RAID features or at the host (versioning), in this work we implement these for protecting from device failures. However, recent stud- features in the storage controller, which allows us to use ies of large magnetic disk populations indicate that there any type of storage devices as well as any type of host I/O is also a significant failure rate of hardware and software stack. The main challenge in our approach is to deal with components in the I/O path, many of which are silent until persistent metadata in the controller I/O path.
    [Show full text]
  • MVAPICH USER GROUP MEETING August 26, 2020
    APPLICATION AND MICROBENCHMARK STUDY USING MVAPICH2 and MVAPICH2-GDR ON SDSC COMET AND EXPANSE MVAPICH USER GROUP MEETING August 26, 2020 Mahidhar Tatineni SDSC San Diego Supercomputer Center NSF Award 1928224 Overview • History of InfiniBand based clusters at SDSC • Hardware summaries for Comet and Expanse • Application and Software Library stack on SDSC systems • Containerized approach – Singularity on SDSC systems • Benchmark results on Comet, Expanse development system - OSU Benchmarks, NEURON, RAxML, TensorFlow, QE, and BEAST are presented. InfiniBand and MVAPICH2 on SDSC Systems Gordon (NSF) Trestles (NSF) 2012-2017 COMET (NSF) Expanse (NSF) 2011-2014 GordonS 2015-Current Production Fall 2020 (Simons Foundation) 2017- 2020 • 324 nodes, 10,368 cores • 1024 nodes, 16,384 cores • 1944 compute, 72 GPU, and • 728 compute, 52 GPU, and • 4-socket AMD Magny-Cours • 2-socket Intel Sandy Bridge 4 large memory nodes. 4 large memory nodes. • QDR InfiniBand • Dual Rail QDR InfiniBand • 2-socket Intel Haswell • 2-socket AMD EPYC 7742, • Fat Tree topology • 3-D Torus topology • FDR InfiniBand HDR100 InfiniBand • MVAPICH2 • 300TB of SSD storage - via • Fat Tree topology • GPU nodes with 4 V100 iSCSI over RDMA (iSER) • MVAPICH2, MVAPICH2-X, GPUs + NVLINK • MVAPICH2 (1.9, 2.1) with 3-D MVAPICH2-GDR • HDR200 Switches, Fat torus support • Leverage SRIOV for Virtual Tree topology with 3:1 Clusters oversubscription • MVAPICH2, MVAPICH2-X, MVAPICH2-GDR Comet: System Characteristics • Total peak flops ~2.76 PF • Hybrid fat-tree topology • Dell primary
    [Show full text]
  • Infiniband and 10-Gigabit Ethernet for Dummies
    The MVAPICH2 Project: Latest Developments and Plans Towards Exascale Computing Presentation at Mellanox Theatre (SC ‘18) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda Drivers of Modern HPC Cluster Architectures High Performance Interconnects - Accelerators / Coprocessors InfiniBand high compute density, high Multi-core Processors <1usec latency, 100Gbps Bandwidth> performance/watt SSD, NVMe-SSD, NVRAM >1 TFlop DP on a chip • Multi-core/many-core technologies • Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE) • Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD • Accelerators (NVIDIA GPGPUs and Intel Xeon Phi) • Available on HPC Clouds, e.g., Amazon EC2, NSF Chameleon, Microsoft Azure, etc. Summit Sunway TaihuLight Sierra K - Computer Network Based Computing Laboratory Mellanox Theatre SC’18 2 Parallel Programming Models Overview P1 P2 P3 P1 P2 P3 P1 P2 P3 Logical shared memory Memory Memory Memory Shared Memory Memory Memory Memory Shared Memory Model Distributed Memory Model Partitioned Global Address Space (PGAS) SHMEM, DSM MPI (Message Passing Interface) Global Arrays, UPC, Chapel, X10, CAF, … • Programming models provide abstract machine models • Models can be mapped on different types of systems – e.g. Distributed Shared Memory (DSM), MPI within a node, etc. • PGAS models and Hybrid MPI+PGAS models are gradually receiving importance Network Based Computing Laboratory Mellanox Theatre SC’18 3 Designing Communication Libraries for Multi-Petaflop and Exaflop Systems: Challenges Application Kernels/Applications Co-Design Middleware Opportunities and Programming Models Challenges MPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP, across Various OpenACC, Cilk, Hadoop (MapReduce), Spark (RDD, DAG), etc.
    [Show full text]