Networks for High-Performance Computing

Total Page:16

File Type:pdf, Size:1020Kb

Networks for High-Performance Computing Networks for High-Performance Computing Louis Jenkins University of Rochester Abstract— In high-performance computing (HPC), low- work. These interconnects implement Remote Direct Memory latency and high-bandwidth networks are required to maintain Access (RDMA) which allows for accessing memory on a high-performance in a distributed computing environment. remote PE directly through the Network Interface Controller According to the Top500, the two most widely-used networks in the top 10 supercomputers are Cray’s Aries network intercon- (NIC) without the intervention of the CPU, and/or even nect, and Mellanox’s Infiniband network interconnect, which the GPU. RDMA allows for the the high-bandwidth and both take radically different approaches towards their design. low-latency remote memory operations, both incoming and This survey explores the design of each network interconnect, outgoing, to not saturate the CPU, freeing it up to perform the ways they provide reliability and fault-tolerance, their more computations. RDMA for GPUs, such as NVidia’s support for remote direct memory access (RDMA), and analyze the performance of an Infiniband cluster and a Cray-XC50 GPUDirect, allows for not just host-to-remote-device, but supercomputer. Finally, the near-future of high-performance even device-to-remote-device transfers to perform without interconnect, obtained from an interview with Steve Scott, Chief any unnecessary copying to and from the CPU, where Technology Officer (CTO) of Cray Inc., a Hewlett Packard normally transferring device-to-device would require device- Enterprise company, is also examined. to-host, host-to-host, and then a host-to-device. As well, RDMA can be used to atomically update remote memory I. INTRODUCTION such that it appears to happen all-at-once or not-at-all, which To address the ever increasing needs for computational opens up many potential applications. power, such computations have become more and more In section 2, the properties of a high-performance network decentralized over time. The technological advancements are explored in detail, and in section 3, the design and from uni-processor to multi-processor systems, from single- implementation of Infiniband is examined, while in section 4, socket to multi-socket NUMA domains, and from shared- the design and implementation of Gemini and it’s successor memory to distributed memory, all are testaments to this fact. Aries are looked at in detail. In section 5, performance Decentralizing computations has many advantages, such as benchmark results from the Ohio State University (OSU) increasing of available processing elements (PE) available benchmarks are analyzed between a Cray-XC50 and Infini- for balancing the computational workload across, but it band compute cluster, and in section 6, we take a look at the does not come without its challenges. The lower down the near future of high-performance computing, courteousy of memory hierarchy you go, the higher the latency, where Steve Scott, CTO of Cray Inc., a Hewlett Packard Enterprise sometimes the difference between one part of the hierarchy company. Finally, in section 7, we have our conclusion. and another is as much as an order of magnitude. While in accessing memory that is local to the PE may have a lot II. PROPERTIES OF HIGH-PERFORMANCE NETWORKS of variability, such as the case of L1 Cache, L2 Cache, L3 Cache, and main memory (DRAM), they are usually within There are many characteristics and properties that make the span of nanoseconds, while accesses to memory that is up networks suitable for high-performance computing. The remote to the PE can be in the span of microseconds to structure of the network, that is it’s topology, as well as the milliseconds. Computations that span across multiple PEs properties relating to fault tolerance and reliability, and the require communication over some type of network, and in types of supported communication between PEs in the same the context of high-performance computing, the larger the network, are all important. A supercomputer or compute bandwidth and smaller the latency, the better. cluster lacking in any of these will be lacking as a whole, Supercomputers and compute clusters primarily tend to as each individual part plays a role in extracting as much use one of two high-performance network interconnects, performance possible from the underlying hardware. being Infiniband[1], [2] from Mellanox and Aries[3], suc- A. Network Topology cessor of Gemini[4], [5], from Cray. Commodity clusters also require high-performance networks, although Infiniband The topology of a network describes the overall makeup may be excessive for their computational needs, and so they and layout of switches, nodes, and links, and the ways may opt for RDMA over Converged Ethernet (RoCE)[6], they are connected. This layout is crucial for maximizing which is common in big data centers. The high-performance bandwidth, minimizing latency, ensuring reliability, and en- interconnects require their own unique design and imple- abling fault tolerance in a network. There are many types mentation that minimizes latency, maximizes bandwidth, and of topological structures in a network, but only the most ensures reliability of data sent between PEs on the net- commonly used in the top supercomputers and compute Switch Switch Switch Switch Switch Switch Switch Switch 24 25 26 27 28 29 30 31 Switch Switch Switch Switch Switch Switch Switch Switch 16 17 18 19 20 21 22 23 Switch Switch Switch Switch Switch Switch Switch Switch Group 1 8 9 10 11 12 13 14 15 Switch Switch Switch Switch Switch Switch Switch Switch 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Compute Nodes Fig. 1. Fat Tree Network Topology Group 0 Group 2 Fig. 3. Dragonfly Network Topology Switch Switch Switch Switch Compute Compute Compute Compute Node Node 1 Node Node Figure 2[7], can be generalized to N dimensions, where each the PEs are placed on a coordinate system. For example, in a 3D Torus network topology, each PE can connect to 6 1 Switch Switch Switch Switch different PEs in the ±x direction, ±y or in the ±z direction. Compute Compute Compute Compute The Torus offers a lot of diversity in routing, as there are a Node Node Node Node wide variety of ways that you can send data from a source PE to a target PE, and hence increased bandwidth, but comes at the downside of a higher cost and much more difficult setup. Switch Switch Switch Switch The Dragonfly topology is a hybrid, in that PEs are Compute Compute Compute Compute Node Node Node Node grouped together, where at least one PE in a group is linked to some other PE in some other group. The topology used inside of the group is implementation dependent, but usually Switch Switch Switch Switch is the Fat Tree topology or a Torus or Mesh topology is used, Compute Compute Compute Compute Node Node Node Node as shown in Figure 3. B. Fault Tolerance and Reliability In a network, reliability is an important feature for ensur- Fig. 2. 2D Torus Network Topology ing consistency, and when done well, can serve to extract even greater performance. A packet in an unreliable network 1 may be lost, or have to be retransmitted at the cost of clusters will be considered, and all else are outside the scope additional time, or at absolute worst, could result in the of this work. acceptance of corrupted data that throws the application The Fat Tree is a tree data structure where branches are in jeopardy. In a strive to balance both reliability and larger, or "fatter", the farther up the tree they are. The performance, each high-performance network must make ’thickness’ of the branch is an analog for the amount of some rather unique design decisions and trade-offs that may bandwidth available to that node; the fatter the branches, be unique to how they are handled in your run-of-the-mill the larger the bandwidth. The fat tree, in the context of network. One way that packets are protected is via a Cyclic networking, have all non-leaf nodes in the tree as switches, Redundancy Check (CRC), which is adds a fixed-width with leaf nodes being PEs, as demonstrated in figure 1.[7] checksum to the original payload, where the checksum is The Fat Tree is extremely cost effective, as adding a new calculated in a way that is sensitive to the position and value PE to the network can be as simple as adding a logarithmic of each individual bit. The usage of Error Correcting Codes number of switches in the worst case. Communication from (ECC) is also used in the link itself to add an additional layer one PE to another is handled by single source shortest path of reliability and recoverability on an error. (SSSP) routing, which sends packets up and down the tree When a PE goes down in a network, network traffic must from the source PE to the target PE. be routed around it, as it may no longer be able to properly The Torus topology, with a 2D Torus network shown in receive, send, or even forward packets that route through Atomic Operation Gemini uGNI RoCE Infiniband Add (Integer) 64 32,64 N/A N/A Add (Float) N/A 32 N/A N/A Fetch-Add 64 32,64 64 64 And 64 32,64 N/A N/A Fetch-And 64 32,64 64 64 Or 64 32,64 N/A N/A Fetch-Or 64 32,64 64 64 Min (Integer) 64 32,64 N/A N/A Min (Float) N/A 32 N/A N/A Max (Integer) 64 32,64 N/A N/A Max (Float) N/A 32 N/A N/A TABLE I ATOMIC MEMORY OPERATORS it.
Recommended publications
  • An Operational Perspective on a Hybrid and Heterogeneous Cray XC50 System
    An Operational Perspective on a Hybrid and Heterogeneous Cray XC50 System Sadaf Alam, Nicola Bianchi, Nicholas Cardo, Matteo Chesi, Miguel Gila, Stefano Gorini, Mark Klein, Colin McMurtrie, Marco Passerini, Carmelo Ponti, Fabio Verzelloni CSCS – Swiss National Supercomputing Centre Lugano, Switzerland Email: {sadaf.alam, nicola.bianchi, nicholas.cardo, matteo.chesi, miguel.gila, stefano.gorini, mark.klein, colin.mcmurtrie, marco.passerini, carmelo.ponti, fabio.verzelloni}@cscs.ch Abstract—The Swiss National Supercomputing Centre added depth provides the necessary space for full-sized PCI- (CSCS) upgraded its flagship system called Piz Daint in Q4 e daughter cards to be used in the compute nodes. The use 2016 in order to support a wider range of services. The of a standard PCI-e interface was done to provide additional upgraded system is a heterogeneous Cray XC50 and XC40 system with Nvidia GPU accelerated (Pascal) devices as well as choice and allow the systems to evolve over time[1]. multi-core nodes with diverse memory configurations. Despite Figure 1 clearly shows the visible increase in length of the state-of-the-art hardware and the design complexity, the 37cm between an XC40 compute module (front) and an system was built in a matter of weeks and was returned to XC50 compute module (rear). fully operational service for CSCS user communities in less than two months, while at the same time providing significant improvements in energy efficiency. This paper focuses on the innovative features of the Piz Daint system that not only resulted in an adaptive, scalable and stable platform but also offers a very high level of operational robustness for a complex ecosystem.
    [Show full text]
  • This Is Your Presentation Title
    Introduction to GPU/Parallel Computing Ioannis E. Venetis University of Patras 1 Introduction to GPU/Parallel Computing www.prace-ri.eu Introduction to High Performance Systems 2 Introduction to GPU/Parallel Computing www.prace-ri.eu Wait, what? Aren’t we here to talk about GPUs? And how to program them with CUDA? Yes, but we need to understand their place and their purpose in modern High Performance Systems This will make it clear when it is beneficial to use them 3 Introduction to GPU/Parallel Computing www.prace-ri.eu Top 500 (June 2017) CPU Accel. Rmax Rpeak Power Rank Site System Cores Cores (TFlop/s) (TFlop/s) (kW) National Sunway TaihuLight - Sunway MPP, Supercomputing Center Sunway SW26010 260C 1.45GHz, 1 10.649.600 - 93.014,6 125.435,9 15.371 in Wuxi Sunway China NRCPC National Super Tianhe-2 (MilkyWay-2) - TH-IVB-FEP Computer Center in Cluster, Intel Xeon E5-2692 12C 2 Guangzhou 2.200GHz, TH Express-2, Intel Xeon 3.120.000 2.736.000 33.862,7 54.902,4 17.808 China Phi 31S1P NUDT Swiss National Piz Daint - Cray XC50, Xeon E5- Supercomputing Centre 2690v3 12C 2.6GHz, Aries interconnect 3 361.760 297.920 19.590,0 25.326,3 2.272 (CSCS) , NVIDIA Tesla P100 Cray Inc. DOE/SC/Oak Ridge Titan - Cray XK7 , Opteron 6274 16C National Laboratory 2.200GHz, Cray Gemini interconnect, 4 560.640 261.632 17.590,0 27.112,5 8.209 United States NVIDIA K20x Cray Inc. DOE/NNSA/LLNL Sequoia - BlueGene/Q, Power BQC 5 United States 16C 1.60 GHz, Custom 1.572.864 - 17.173,2 20.132,7 7.890 4 Introduction to GPU/ParallelIBM Computing www.prace-ri.eu How do
    [Show full text]
  • TECHNICAL GUIDELINES for APPLICANTS to PRACE 17Th CALL
    TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 17th CALL (T ier-0) The contributing sites and the corresponding computer systems for this call are: System Architecture Site (Country) Core Hours Minimum (node hours) request Joliot Curie - Bull Sequana X1000 GENCI@CEA 134 million 15 million core SKL (FR) (2.8 million) hours Joliot Curie - BULL Sequana GENCI@CEA 72 million 15 million core KNL X1000 (FR) (1,1 million) hours Hazel Hen Cray XC40 System GCS@HLRS 70 million 35 million core (DE) (2.9 million) hours JUWELS Multicore cluster GCS@JSC (DE) 70 million 35 million core (1.5 million) hours Marconi- Lenovo System CINECA (IT) 36 million 15 million core Broadwell (1 million) hours Marconi-KNL Lenovo System CINECA (IT) 612 million 30 million core (9 million) hours MareNostrum Lenovo System BSC (ES) 240 million 15 million core (5 million) hours Piz Daint Cray XC50 System CSCS (CH) 510 million 68 million core (7.5 million) hours Use of GPUs SuperMUC Lenovo NextScale/ GCS@LRZ (DE) 105 million 35 million core SuperMUC-NG Lenovo ThinkSystem (3.8 million) hours The site selection is done together with the specification of the requested computing time by the two sections at the beginning of the online form. The applicant can choose one or several machines as execution system, as long as proper benchmarks and resource request justification are provided on each of the requested systems. The parameters are listed in tables. The first column describes the field in the web online form to be filled in by the applicant. The remaining columns specify the range limits for each system.
    [Show full text]
  • A Performance Analysis of the First Generation of HPC-Optimized Arm Processors
    McIntosh-Smith, S., Price, J., Deakin, T., & Poenaru, A. (2019). A performance analysis of the first generation of HPC-optimized Arm processors. Concurrency and Computation: Practice and Experience, 31(16), [e5110]. https://doi.org/10.1002/cpe.5110 Publisher's PDF, also known as Version of record License (if available): CC BY Link to published version (if available): 10.1002/cpe.5110 Link to publication record in Explore Bristol Research PDF-document This is the final published version of the article (version of record). It first appeared online via Wiley at https://doi.org/10.1002/cpe.5110 . Please refer to any applicable terms of use of the publisher. University of Bristol - Explore Bristol Research General rights This document is made available in accordance with publisher policies. Please cite only the published version using the reference above. Full terms of use are available: http://www.bristol.ac.uk/red/research-policy/pure/user-guides/ebr-terms/ Received: 18 June 2018 Revised: 27 November 2018 Accepted: 27 November 2018 DOI: 10.1002/cpe.5110 SPECIAL ISSUE PAPER A performance analysis of the first generation of HPC-optimized Arm processors Simon McIntosh-Smith James Price Tom Deakin Andrei Poenaru High Performance Computing Research Group, Department of Computer Science, Summary University of Bristol, Bristol, UK In this paper, we present performance results from Isambard, the first production supercomputer Correspondence to be based on Arm CPUs that have been optimized specifically for HPC. Isambard is the first Simon McIntosh-Smith, High Performance Cray XC50 ‘‘Scout’’ system, combining Cavium ThunderX2 Arm-based CPUs with Cray's Aries Computing Research Group, Department of interconnect.
    [Show full text]
  • Industry Insights | HPC and the Future of Seismic
    INDUSTRY INSIGHTS August 20 By Andrew Long ([email protected]) 1 of 4 HPC and the Future of Seismic I briefly profile the world’s largest commercial computer installations, and consider their relevance for the future of high-end seismic imaging and AI pursuits by the oil and gas industry. Introduction When I began my career in seismic geophysics over 20 years ago we were regularly told that seismic processing and imaging was one of the largest users of computing, and indeed, it took developments in computing capability to mature much further before applications such as Full Waveform Inversion (FWI) and Least-Squares Migration (LSM) became commercially commonplace in the last 5-10 years. Although the oil and gas industry still has several representatives in the top-100 ranked supercomputers (discussed below), applications now more commonly include modeling and simulations essential for nuclear security, weather and climate forecasting, computational fluid dynamics, financial forecasting and risk management, drug discovery and biochemical modeling, clean energy development, exploring the fundamental laws of the universe, and so on. The dramatic growth in High Performance Computing (HPC) services in the cloud (HPCaaS: HPC-as-a-Service) in recent years has in principle made HPC accessible ‘on demand’ to anyone with an appropriate budget, and key selling points are generally the scalable and flexible capabilities, combined with a variety of vendor-specific Software-as-a-Service (SaaS), Platform-as-a-Service (PaaS), and Infrastucture-as-a-Service (IaaS) offerings. This new vocabulary can be confronting, and due to the complexity of cloud-based solutions, I only focus here on the standalone supercomputer infrastructures.
    [Show full text]
  • Cray Xc-50 at Cscs Swiss National Supercomputing Centre
    ▸ Motivation and challenges ▸ Architecture ▸ Production performance ▸ Assessment and outlook PRODUCTION EXPERIENCE AND PERFORMANCE FOR ATLAS DATA PROCESSING ON A CRAY XC-50 AT CSCS SWISS NATIONAL SUPERCOMPUTING CENTRE Gianfranco Sciacca AEC - Laboratory for High Energy Physics, University of Bern, Switzerland CHEP Conference 2018 - 9-13 July 2018, Sofia, Bulgaria MOTIVATION AND CHALLENGES ATLAS DATA PROCESSING ON PIZ DAINT AT CSCS !2 WLCG computing on HPC systems ▸ The HENP computing community will face several challenges with respect to the computing requirements for the next decade and beyond ▸ For the LHC community, requirements for the High-Luminosity LHC runs (2025-2034) are expected to be a factor of ~50 higher than today ▸ Available budgets are expected to be flat at best ▸ Novel computing models making a more dynamic use of heterogeneous resources (supercomputers and clouds) need to be evaluated in order to address such challenges ▸ HPC machines are increasingly powerful, could play a crucial role in delivering more computing for the same price ▸ One of the biggest challenges is the transparent integration with the complex experiment data processing frameworks Gianfranco Sciacca - AEC / LHEP Universität Bern • CHEP 2018, 9-13 July 2018, Sofia, Bulgaria MOTIVATION AND CHALLENGES ATLAS DATA PROCESSING ON PIZ DAINT AT CSCS !3 WLCG computing on HPC systems ▸ HPC is awesome ▸ Piz Daint Cray XC50 / XC40 @ CSCS - ‘cpu/gpu hybrid’ (5320 nodes) and ‘multicore’ (1431 nodes), 361,760 cores - NVIDIA Tesla P100, Xeon E5-2690v3 2.6 GHz, 521 TB
    [Show full text]
  • Piz Daint”, One of the Most Powerful Supercomputers
    The Swiss High-Performance Computing and Networking Thomas C. Schulthess | T. Schulthess !1 What is considered success in HPC? | T. Schulthess !2 High-Performance Computing Initiative (HPCN) in Switzerland High-risk & high-impact projects (www.hp2c.ch) Application driven co-design Phase III of pre-exascale supercomputing ecosystem Three pronged approach of the HPCN Initiative 2017 1. New, flexible, and efficient building 2. Efficient supercomputers 2016 Monte Rosa 3. Efficient applications Pascal based hybrid Cray XT5 2015 14’762 cores Upgrade to Phase II Cray XE6 K20X based hybrid 2014 Upgrade 47,200 cores Hex-core upgrade Phase I 22’128 cores 2013 Development & Aries network & multi-core procurement of 2012 petaflop/s scale supercomputer(s) 2011 2010 New 2009 building Begin construction complete of new building | T. Schulthess !3 FACT SHEET “Piz Daint”, one of the most powerful supercomputers A hardware upgrade in the final quarter of 2016 Thanks to the new hardware, researchers can run their simula- saw “Piz Daint”, Europe’s most powerful super- tions more realistically and more efficiently. In the future, big computer, more than triple its computing perfor- science experiments such as the Large Hadron Collider at CERN mance. ETH Zurich invested around CHF 40 million will also see their data analysis support provided by “Piz Daint”. in the upgrade, so that simulations, data analysis and visualisation can be performed more effi- ETH Zurich has invested CHF 40 million in the upgrade of “Piz ciently than ever before. Daint” – from a Cray XC30 to a Cray XC40/XC50. The upgrade in- volved replacing two types of compute nodes as well as the in- With a peak performance of seven petaflops, “Piz Daint” has tegration of a novel technology from Cray known as DataWarp.
    [Show full text]
  • Introduction to Parallel Programming for Multicore/Manycore Clusters
    Introduction to Parallel Programming for Multicore/Manycore Clusters Introduction Kengo Nakajima Information Technology Center The University of Tokyo http://nkl.cc.u-tokyo.ac.jp/18s/ 2 Descriptions of the Class • Technical & Scientific Computing I (4820-1027) – 科学技術計算Ⅰ – Department of Mathematical Informatics • Special Lecture on Computational Science I(4810-1215) – 計算科学アライアンス特別講義Ⅰ – Department of Computer Science • Multithreaded Parallel Computing (3747-110) (from 2016) – スレッド並列コンピューティング – Department of Electrical Engineering & Information Systems • This class is certificated as the category “F" lecture of "the Computational Alliance, the University of Tokyo" • 2009-2014 3 – Introduction to FEM Programming • FEM: Finite-Element Method :有限要素法 – Summer (I) : FEM Programming for Solid Mechanics – Winter (II): Parallel FEM using MPI • The 1 st part (summer) is essential for the 2 nd part (winter) • Problems – Many new (international) students in Winter, who did not take the 1st part in Summer – They are generally more diligent than Japanese students • 2015 – Summer (I) : Multicore programming by OpenMP – Winter (II): FEM + Parallel FEM by MPI/OpenMP for Heat Transfer • Part I & II are independent (maybe...) • 2017 – Lectures are given in English – Reedbush-U Supercomputer System Intro 4 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing in large-scale scientific simulations with detailed models. Computational science develops new frontiers of science and engineering. • Why parallel computing ? – faster & larger – “larger” is more important from the view point of “new frontiers of science & engineering”, but “faster” is also important. – + more complicated – Ideal: Scalable • Solving N x scale problem using N x computational resources during same computation time (weak scaling) • Solving a fix-sized problem using N x computational resources in 1/N computation time (strong scaling) Intro 5 Scientific Computing = SMASH Science • You have to learn many things.
    [Show full text]
  • Conceptual and Technical Challenges for High Performance Computing Claude Tadonki
    Conceptual and Technical Challenges for High Performance Computing Claude Tadonki To cite this version: Claude Tadonki. Conceptual and Technical Challenges for High Performance Computing. 2020. hal- 02962355 HAL Id: hal-02962355 https://hal.archives-ouvertes.fr/hal-02962355 Preprint submitted on 9 Oct 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Conceptual and Technical Challenges for High Performance Computing Claude, Tadonki Mines ParisTech – PSL Research University Paris, France Email: [email protected] Abstract High Performance Computing (HPC) aims at providing reasonably fast computing solutions to scientific and real life problems. Many efforts have been made on the way to powerful supercomputers, including generic and customized configurations. The advent of multicore architectures is noticeable in the HPC history, because it has brought the underlying parallel programming concept into common considerations. At a larger scale, there is a keen interest in building or hosting frontline supercomputers; the Top500 ranking is a nice illustration of this (implicit) racing. Supercomputers, as well as ordinary computers, have fallen in price for years while gaining processing power. We clearly see that, what commonly springs up in mind when it comes to HPC is computer capability.
    [Show full text]
  • Piz Daint and Its Ecosystem Sadaf Alam Chief Technology Officer Swiss National Supercomputing Centre November 16, 2017 CSCS in a Nutshell
    Creating Abstractions for Piz Daint and its Ecosystem Sadaf Alam Chief Technology Officer Swiss National Supercomputing Centre November 16, 2017 CSCS in a Nutshell § A unit of the Swiss Federal Institute of Technology in Zurich (ETH Zurich) § founded in 1991 in Manno § relocated to Lugano in 2012 § Develops and promotes technical and scientific services § for the Swiss research community in the field of high-performance computing § Enables world-class scientific research § by pioneering, operating and supporting leading-edge supercomputing technologies § Employing 90+ persons from about 15+ different nations Supercomputing, 2017 2 Piz Daint and the User Lab http://www.cscs.ch/uploads/tx_factsheet/FSPizDaint_2017_EN.pdf http://www.cscs.ch/publications/highlights/ http://www.cscs.ch/uploads/tx_factsheet/AR2016_Online.pdf Model Cray XC40/XC50 Intel® Xeon® E5-2690 v3 @ 2.60GHz (12 XC50 Compute cores, 64GB RAM) and NVIDIA® Tesla® Nodes P100 16GB XC40 Compute Intel® Xeon® E5-2695 v4 @ 2.10GHz (18 Nodes cores, 64/128 GB RAM) Interconnect Aries routing and communications ASIC, Configuration and Dragonfly network topology Scratch ~9 + 2.7 PB capacity Supercomputing, 2017 3 Piz Daint (2013à2016à) § 5,272 hybrid nodes (Cray XC30) § 5,320 hybrid nodes (Cray XC50) § Nvidia Tesla K20x § Nvidia Tesla P100 § Intel Xeon E5-2670 § Intel Xeon E5-2690 v3 § § 6 GB GDDR5 16 GB HBM2 § 64 GB DDR4 § 32 GB DDR3 § 1,431 multi-core nodes (Cray XC40) § No multi-core à § 2 x Intel Xeon E5-2695 v4 § 64 and 128 GB DDR4 2013 2016 § Cray Aries dragonfly interconnect §
    [Show full text]
  • Cray® XC™ Advanced Power Management Updates
    Cray® XC™ Advanced Power Management Updates Steven J. Martin, Greg J. Koprowski, Dr. Sean J. Wallace Cray Inc. {stevem,gkoprowski,swallace}@cray.com Abstract—This paper will highlight the power management • Aggregate Memory power and energy monitoring ® ™ features of the newest blades for Cray XC50 and updates • Publishing of collected power and energy data: to the Cray PMDB (Power Management Database). The paper will first highlight the power monitoring and control features – Out-Of-Band (OOB) via the PMDB of the compute blades. The paper will then highlight power – In-band via /sys/cray/pm counters sysfs files management changes in the SMW 8.0.UP06 release for the • Power capping controls PMDB. These database implementation changes enhance the • P-state and C-state controls PMDB performance, configuration and management. This paper targets system administrators along with researchers A. Total Node Power and Energy Monitoring involved in advanced power monitoring and management, power aware computing, and energy efficiency. All blades developed for the Cray XC platform support OOB collection of total node power and energy at the blade Keywords-Cray Advanced Power Management; Power mon- itoring; power capping; level. Data collection is enabled by default on all blade types with a default collect rate of 1Hz. Collected data feeds into I. INTRODUCTION the data publishing paths described in Subsection II-C below. In previous work and Cray publications, we have shown B. Aggregate CPU and Memory; Power and Energy Moni- details of, and documentation for power monitoring and toring management features of Cray XC systems [1]–[5]. In this First introduced in [1], aggregate CPU-domain and paper we add to that work with details for Cray XC50 memory-domain power telemetry is supported on blades blades, PMDB updates in the SMW 8.0UP06 release, and with Intel Xeon Scalable processors, and on blades support- new examples using power monitoring and control on Cray ing the Cavium ThunderX2 processors.
    [Show full text]
  • Hpcとanalyticsに向けた クレイ製品のご紹介
    HPC䛸Analytics 䛻ྥ䛡䛯 䜽䝺䜲〇ရ䛾䛤⤂௓ ᱵ⏣ᏹ᫂, Cray Japan Inc. PC 䜽䝷䝇䝍䝅䞁䝫䝆䜴䝮, 12/15/2016 1 12/15/2016 Copyright 2016 Cray Inc. Cray Systems for Deep Learning CS-Storm: Dense GPU Cluster Options ● 8 x M40 or P100 NVIDIA ● 512 GB – 1 TB of RAM, up to 6 SSDs ● M40 has 1.2 – 1.8x workload improvement ● P100 Memory Bandwidth 3x of M40 ● Optimizations in CUDA not available with K40 or K80 ● Building Block for Dense Deep Learning Compute Solution XC Scalable Deep Learning Supercomputer ● Worlds most scalable supercomputer for deep neural network training ● XC50 Features the Telsa P100 GPU accelerator for PCIe 2 12/15/2016 Copyright 2016 Cray Inc. Cray XC Series 3 12/15/2016 Copyright 2016 Cray Inc. The XC Series For Deep Learning Cray Programming Environment Cray Aries™ NVIDIA® Tesla® P100 GPU NVIDIA® Tesla® K40 GPU Accelerator Compute Blade High Performance Parallel Storage Accelerator Compute Blade Extreme single High Performance Packet Combine large on-node precision performance, Switch Network memory and mini-batch and large GPU-memory processing to mid- Scalable network able performance GPU to handle neural network node-to-node communication For large-scale Deep Learning workflows where Data Parallelism is the preferred mode, the Cray XC Series offers exceptional deployment flexibility 4 12/15/2016 Copyright 2016 Cray Inc. Cray CS-Storm: Accelerated Computing Over 800 peak GPU teraflops in one 48U rack leveraging Accelerated the power of NVIDIA® Tesla® GPU accelerators Integrated hardware, software, packaging and cooling for Integrated reduced TCO and production reliability Supercomputer reliability, redundancy and Reliable serviceability suitable for large systems, architected, built, delivered and serviced by Cray 5 12/15/2016 Copyright 2016 Cray Inc.
    [Show full text]