Present and Future Supercomputer Architectures

Total Page:16

File Type:pdf, Size:1020Kb

Present and Future Supercomputer Architectures Hong Kong, China, 13-15 Dec. 2004 Present and Future Supercomputer Architectures Jack Dongarra University of Tennessee and Oak Ridge National Laboratory 12/12/2004 1 A Growth-Factor of a Billion in Performance in a Career Super Scalar/Vector/Parallel 1 PFlop/s (1015) IBM Parallel BG/L ASCI White ASCI Red 1 TFlop/s Pacific (1012) TMC CM-5 Cray T3D 2X Transistors/Chip Vector Every 1.5 Years TMC CM-2 Cray 2 1 GFlop/s Cray X-MP (109) Super Scalar Cray 1 1941 1 (Floating Point operations / second, Flop/s) CDC 7600 IBM 360/195 1945 100 1 MFlop/sScalar 1949 1,000 (1 KiloFlop/s, KFlop/s) 6 CDC 6600 1951 10,000 (10 ) 1961 100,000 1964 1,000,000 (1 MegaFlop/s, MFlop/s) IBM 7090 1968 10,000,000 1975 100,000,000 1987 1,000,000,000 (1 GigaFlop/s, GFlop/s) 1992 10,000,000,000 1993 100,000,000,000 1 KFlop/s 1997 1,000,000,000,000 (1 TeraFlop/s, TFlop/s) (103) UNIVAC 1 2000 10,000,000,000,000 EDSAC 1 2003 35,000,000,000,000 (35 TFlop/s) 07 1950 1960 1970 1980 1990 2000 2010 2 1 H. Meuer, H. Simon, E. Strohmaier, & JD - Listing of the 500 most powerful Computers in the World - Yardstick: Rmax from LINPACK MPP Ax=b, dense problem TPP performance Rate - Updated twice a year Size SC‘xy in the States in November Meeting in Mannheim, Germany in June 07 - All data available from www.top500.org 3 What is a Supercomputer? ♦ A supercomputer is a hardware and software system that provides close to the maximum performance that can currently be achieved. ♦ Over the last 10 years the range for the Top500 has Why do we need them? increased greater than Almost all of the technical areas that Moore’s Law are important to the well-being of ♦ 1993: humanity use supercomputing in ¾ #1 = 59.7 GFlop/s fundamental and essential ways. ¾ #500 = 422 MFlop/s Computational fluid dynamics, ♦ 2004: protein folding, climate modeling, ¾ #1 = 70 TFlop/s national security, in particular for ¾ #500 = 850 GFlop/s 07 cryptanalysis and for simulating nuclear weapons to name a few. 4 2 TOP500 Performance – November 2004 1.127 PF/s 1 Pflop/s IBM BlueGene/L 100 Tflop/s SUM 70.72 TF/s NEC 10 Tflop/s 1.167 TF/s N=1 Earth Simulator IBM ASCI White 1 Tflop/s LLNL 850 GF/s 59.7 GF/s Intel ASCI Red Sandia 100 Gflop/s Fujitsu 10 Gflop/s 'NWT' NAL N=500 My Laptop 0.4 GF/s 1 Gflop/s 100 Mflop/s 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 07 5 Vibrant Field for High Performance Computers ♦ Cray X1 ♦ Coming soon … ♦ SGI Altix ¾ Cray RedStorm ♦ IBM Regatta ¾ Cray BlackWidow ♦ IBM Blue Gene/L ¾ NEC SX-8 ♦ IBM eServer ¾ Galactic Computing ♦ Sun ♦ HP ♦ Dawning ♦ Bull NovaScale ♦ Lanovo ♦ Fujitsu PrimePower ♦ Hitachi SR11000 ♦ NEC SX-7 ♦ Apple 07 6 3 Architecture/Systems Continuum Tightly 100% Coupled ♦ ♦ Custom processor Best Customprocessor performance for 80% codes that are not “cache with custom interconnect friendly” ¾ Cray X1 ♦ ¾ NEC SX-7 Good communication performance ♦ Simplest programming model ¾ IBM Regatta 60% ¾ IBM Blue Gene/L ♦ Most expensive ♦ Commodity processor Hybrid ♦ with custom interconnect 40% Good communication performance ¾ SGI Altix ♦ Good scalability ¾ Intel Itanium 2 ¾ Cray Red Storm ¾ AMD Opteron 20% ♦ Commodity processor with commodity interconnect ♦ Best price/performance (forCommod codes that work well with caches ¾ 0% Clusters and are latency tolerant) ¾ Pentium, Itanium, Jun-93 Jun-94 Jun-95 Jun-96 Jun-97 Jun-98 Jun-99 Jun-00 Jun-01 Jun-02 Jun-03 Jun-04 ♦ Dec-93 Dec-94 Dec-95 Dec-96 Dec-97 Dec-98 Dec-99 Dec-00 Dec-01 Dec-02 Dec-03 Opteron, Alpha More complex programming model ¾ GigE, Infiniband, Myrinet, Quadrics Loosely ¾ NEC TX7 Coupled ¾ IBM eServer 07 ¾ Dawning 7 Top500 Performance by Manufacture (11/04) Cray Hitachi Sun 1% 0% Fujitsu 2% 2% Intel NEC 0% 4% SGI 7% others 14% IBM 49% HP 21% 07 8 4 Commodity Processors ♦ Intel Pentium Nocona ♦ HP PA RISC ¾ 3.6 GHz, peak = 7.2 Gflop/s ♦ Sun UltraSPARC IV ¾ Linpack 100 = 1.8 Gflop/s ♦ HP Alpha EV68 ¾ Linpack 1000 = 3.1 Gflop/s ¾ 1.25 GHz, 2.5 Gflop/s peak ♦ AMD Opteron ♦ MIPS R16000 ¾ 2.2 GHz, peak = 4.4 Gflop/s ¾ Linpack 100 = 1.3 Gflop/s ¾ Linpack 1000 = 3.1 Gflop/s ♦ Intel Itanium 2 ¾ 1.5 GHz, peak = 6 Gflop/s ¾ Linpack 100 = 1.7 Gflop/s 07 ¾ Linpack 1000 = 5.4 Gflop/s 9 Commodity Interconnects ♦ Gig Ethernet ♦ Myrinet Clos ♦ Infiniband ♦ Fa QsNet t t ree ♦ SCI T o Cost Cost Cost MPI Lat / 1-way / Bi-Dir r u s Switch topology NIC Sw/node Node (us) / MB/s / MB/s Gigabit Ethernet Bus $ 50 $ 50 $ 100 30 / 100 / 150 SCI Torus $1,600 $ 0 $1,600 5 / 300 / 400 QsNetII (R) Fat Tree $1,200 $1,700 $2,900 3 / 880 / 900 QsNetII (E) Fat Tree $1,000 $ 700 $1,700 3 / 880 / 900 Myrinet (D card) Clos $ 595 $ 400 $ 995 6.5 / 240 / 480 Myrinet07 (E card) Clos $ 995 $ 400 $1,395 6 / 450 / 900 IB 4x Fat Tree $1,000 $ 400 $1,400 6 / 820 / 790 10 5 24th List: The TOP10 Rmax Manufacturer Computer Installation Site Country Year #Proc [TF/s] BlueGene/L 1 IBM 70.72 DOE/IBM USA 2004 32768 β-System Columbia 2 SGI 51.87 NASA Ames USA 2004 10160 Altix, Infiniband 3 NEC Earth-Simulator 35.86 Earth Simulator Center Japan 2002 5120 MareNostrum Barcelona Supercomputer 4 IBM BladeCenter JS20, 20.53 Spain 2004 3564 Myrinet Center Thunder Lawrence Livermore 5 CCD 19.94 USA 2004 4096 Itanium2, Quadrics National Laboratory ASCI Q Los Alamos 6 HP 13.88 USA 2002 8192 AlphaServer SC, Quadrics National Laboratory X 7 Self Made 12.25 Virginia Tech USA 2004 2200 Apple XServe, Infiniband BlueGene/L Lawrence Livermore 8 IBM/LLNL 11.68 USA 2004 8192 DD1 500 MHz National Laboratory Naval Oceanographic 9 IBM pSeries 655 10.31 USA 2004 2944 Office Tungsten 10 Dell 9.82 NCSA USA 2003 2500 PowerEdge, Myrinet 07 399 system > 1 TFlop/s; 294 machines are clusters, top10 average 8K proc 11 IBM BlueGene/L System 131,072 Processors (64 racks, 64x32x32) Rack 131,072 procs (32 Node boards, 8x8x16) 2048 processors BlueGene/L Compute ASIC Node Card (32 chips, 4x4x2) 16 Compute Cards 64 processors Compute Card (2 chips, 2x1x1) 180/360 TF/s 4 processors 32 TB DDR Chip (2 processors) 2.9/5.7 TF/s 0.5 TB DDR Full system total of 90/180 GF/s 131,072 processors 16 GB DDR 5.6/11.2 GF/s 2.8/5.6 GF/s 1 GB DDR 4 MB (cache) “Fastest Computer” BG/L 700 MHz 32K proc 16 racks 07 Peak: 91.7 Tflop/s Linpack: 70.7 Tflop/s 12 77% of peak 6 BlueGene/L Interconnection Networks 3 Dimensional Torus ¾ Interconnects all compute nodes (65,536) ¾ Virtual cut-through hardware routing ¾ 1.4Gb/s on all 12 node links (2.1 GB/s per node) ¾ 1 µs latency between nearest neighbors, 5 µs to the farthest ¾ 4 µs latency for one hop with MPI, 10 µs to the farthest ¾ Communications backbone for computations ¾ 0.7/1.4 TB/s bisection bandwidth, 68TB/s total bandwidth Global Tree ¾ Interconnects all compute and I/O nodes (1024) ¾ One-to-all broadcast functionality ¾ Reduction operations functionality ¾ 2.8 Gb/s of bandwidth per link ¾ Latency of one way tree traversal 2.5 µs ¾ ~23TB/s total binary tree bandwidth (64k machine) Ethernet ¾ Incorporated into every node ASIC ¾ Active in the I/O nodes (1:64) ¾ All external comm. (file I/O, control, user interaction, etc.) Low Latency Global Barrier and Interrupt 07 ¾ Latency of round trip 1.3 µs Control Network 13 NASA Ames: SGI Altix Columbia 10,240 Processor System ♦ Architecture: Hybrid Technical Server Cluster ♦ Vendor: SGI based on Altix systems ♦ Deployment: Today ♦ Node: ¾ 1.5 GHz Itanium-2 Processor ¾ 512 procs/node (20 cabinets) ¾ Dual FPU’s / processor ♦ System: ¾ 20 Altix NUMA systems @ 512 procs/node = 10240 procs ¾ 320 cabinets (estimate 16 per node) ¾ Peak: 61.4 Tflop/s ; LINPACK: 52 Tflop/s ♦ Interconnect: ¾ FastNumaFlex (custom hypercube) within node ¾ Infiniband between nodes ♦ Pluses: ¾ Large and powerful DSM nodes ♦ 07 Potential problems (Gotchas): ¾ Power consumption - 100 kw per node (2 Mw total) 14 7 1 Eflop/ 100 Pflop/ 10 Pflop/ s Performance Projection 1 Pflop/ s 100 Tflop/ s 10 Tflop/ s 1 Tflop/ s 100 Gflop/ s 10 Gflop/ s SUM 1 Gflop/ s 100 Mflop/ s s N=1 s 07 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015 N=500 BlueGene/L DARPA HPCS Power: Watts/ My Laptop 120 100 80 Watts/Gflop 60 Gflop 40 20 0 (smaller is better) BlueGene/L DD2 beta-System (0.7 GHz PowerPC 440) SGI Altix 1.5 GHz, Voltaire Infiniband 07 Earth-Simulator eServer BladeCenter JS20+ (PowerPC970 2.2 GHz), Myrine t 15 Intel Itanium2 Tiger4 1.4GHz - Quadrics ASCI Q - AlphaServer SC45, 1.25 GHz 1100 Dual 2.3 GHz Apple XServe/Mellanox Infiniband 4X/Cisco GigE BlueGene/L DD1 Prototype (0.5GHz PowerPC 440 w/Custom) eServer pSeries 655 (1.7 GHz Power4+) PowerEdge 1750, P4 Xeon 3.06 GHz, Myrinet eServer pSeries 690 (1.9 GHz Power4+) Based on processor power rating only eServer pSeries 690 (1.9 GHz Power4+) LNX Cluster, Xeon 3.4 GHz, Myrinet Top 20 systems RIKEN Super Combined Cluster BlueGene/ L DD2 Prototype (0.7 GHz PowerPC 440) Integrity rx2600 Itanium2 1.5 GHz, Quadrics Dawning 4000A, Opteron 2.2 GHz, Myrinet Opteron 2 GHz, Myrinet MCR Linux Cluster Xeon 2.4 GHz - Quadrics ASCI White, SP Power3 375 MHz 16 8 Top500 in Asia (Numbers of Machines) 120 100 80 Others 60 India South Korea 40 China Japan 20 0 3 5 7 8 0 2 4 96 03 99 9 99 99 00 00 0 00 /1 /1 1 /1 /2 2 /2 2 6 6 6 6 6/ 6 6/ 0 06/1994 06/199 0 06/ 0 06/1999 0 06/2001 0 0 0 07 17 17 Chinese Sites on the Top500 R Rank Installation-site-name Manufacturer Computer Area Year max Procs Shanghai Supercomputer 17 Center Dawning Dawning 4000A, Opteron 2.2 GHz, Myrinet Research 2004 8061 2560 38 Chinese Academy of Science lenovo DeepComp 6800, Itanium2 1.3 GHz, QsNet Academic 2003 4193 1024 Institute of Scientific 61 Computing/Nankai University IBM xSeries Xeon 3.06 GHz, Myrinet Academic 2004 3231 768 132 Petroleum Company (D) IBM BladeCenter Xeon 3.06 GHz, Gig-Ethernet Industry 2004 1923 512 184 Geoscience (A) IBM BladeCenter Xeon 3.06 GHz, Gig-Ethernet Industry 2004 1547 412 209 University of Shanghai HP DL360G3 Xeon 3.06 GHz, Infiniband Academic 2004 1401 348 Academy of Mathematics and 225 System Science lenovo DeepComp 1800 - P4 Xeon 2 GHz - Myrinet Academic 2002 1297 512 229 Digital China Ltd.
Recommended publications
  • The Reliability Wall for Exascale Supercomputing Xuejun Yang, Member, IEEE, Zhiyuan Wang, Jingling Xue, Senior Member, IEEE, and Yun Zhou
    . IEEE TRANSACTIONS ON COMPUTERS, VOL. *, NO. *, * * 1 The Reliability Wall for Exascale Supercomputing Xuejun Yang, Member, IEEE, Zhiyuan Wang, Jingling Xue, Senior Member, IEEE, and Yun Zhou Abstract—Reliability is a key challenge to be understood to turn the vision of exascale supercomputing into reality. Inevitably, large-scale supercomputing systems, especially those at the peta/exascale levels, must tolerate failures, by incorporating fault- tolerance mechanisms to improve their reliability and availability. As the benefits of fault-tolerance mechanisms rarely come without associated time and/or capital costs, reliability will limit the scalability of parallel applications. This paper introduces for the first time the concept of “Reliability Wall” to highlight the significance of achieving scalable performance in peta/exascale supercomputing with fault tolerance. We quantify the effects of reliability on scalability, by proposing a reliability speedup, defining quantitatively the reliability wall, giving an existence theorem for the reliability wall, and categorizing a given system according to the time overhead incurred by fault tolerance. We also generalize these results into a general reliability speedup/wall framework by considering not only speedup but also costup. We analyze and extrapolate the existence of the reliability wall using two representative supercomputers, Intrepid and ASCI White, both employing checkpointing for fault tolerance, and have also studied the general reliability wall using Intrepid. These case studies provide insights on how to mitigate reliability-wall effects in system design and through hardware/software optimizations in peta/exascale supercomputing. Index Terms—fault tolerance, exascale, performance metric, reliability speedup, reliability wall, checkpointing. ! 1INTRODUCTION a revolution in computing at a greatly accelerated pace [2].
    [Show full text]
  • UNICOS/Mk Status)
    Status on the Serverization of UNICOS - (UNICOS/mk Status) Jim Harrell, Cray Research, Inc., 655-F Lone Oak Drive, Eagan, Minnesota 55121 ABSTRACT: UNICOS is being reorganized into a microkernel based system. The purpose of this reorganization is to provide an operating system that can be used on all Cray architectures and provide both the current UNICOS functionality and a path to the future distributed systems. The reorganization of UNICOS is moving forward. The port of this “new” system to the MPP is also in progress. This talk will present the current status, and plans for UNICOS/mk. The chal- lenges of performance, size and scalability will be discussed. 1 Introduction have to be added in order to provide required functionality, such as support for distributed applications. As the work of adding This discussion is divided into four parts. The first part features and porting continues there is a testing effort that discusses the development process used by this project. The ensures correct functionality of the product. development process is the methodology that is being used to serverize UNICOS. The project is actually proceeding along Some of the initial porting can be and has been done in the multiple, semi-independent paths at the same time.The develop- simulator. However, issues such as MPP system organization, ment process will help explain the information in the second which nodes the servers will reside on and how the servers will part which is a discussion of the current status of the project. interact can only be completed on the hardware. The issue of The third part discusses accomplished milestones.
    [Show full text]
  • LLNL Computation Directorate Annual Report (2014)
    PRODUCTION TEAM LLNL Associate Director for Computation Dona L. Crawford Deputy Associate Directors James Brase, Trish Damkroger, John Grosh, and Michel McCoy Scientific Editors John Westlund and Ming Jiang Art Director Amy Henke Production Editor Deanna Willis Writers Andrea Baron, Rose Hansen, Caryn Meissner, Linda Null, Michelle Rubin, and Deanna Willis Proofreader Rose Hansen Photographer Lee Baker LLNL-TR-668095 3D Designer Prepared by LLNL under Contract DE-AC52-07NA27344. Ryan Chen This document was prepared as an account of work sponsored by an agency of the United States government. Neither the United States government nor Lawrence Livermore National Security, LLC, nor any of their employees makes any warranty, expressed or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, Print Production manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States government or Lawrence Livermore National Security, LLC. The views and opinions of authors expressed herein do not necessarily state or reflect those of the Charlie Arteago, Jr., and Monarch Print Copy and Design Solutions United States government or Lawrence Livermore National Security, LLC, and shall not be used for advertising or product endorsement purposes. CONTENTS Message from the Associate Director . 2 An Award-Winning Organization . 4 CORAL Contract Awarded and Nonrecurring Engineering Begins . 6 Preparing Codes for a Technology Transition . 8 Flux: A Framework for Resource Management .
    [Show full text]
  • Biomolecular Simulation Data Management In
    BIOMOLECULAR SIMULATION DATA MANAGEMENT IN HETEROGENEOUS ENVIRONMENTS Julien Charles Victor Thibault A dissertation submitted to the faculty of The University of Utah in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of Biomedical Informatics The University of Utah December 2014 Copyright © Julien Charles Victor Thibault 2014 All Rights Reserved The University of Utah Graduate School STATEMENT OF DISSERTATION APPROVAL The dissertation of Julien Charles Victor Thibault has been approved by the following supervisory committee members: Julio Cesar Facelli , Chair 4/2/2014___ Date Approved Thomas E. Cheatham , Member 3/31/2014___ Date Approved Karen Eilbeck , Member 4/3/2014___ Date Approved Lewis J. Frey _ , Member 4/2/2014___ Date Approved Scott P. Narus , Member 4/4/2014___ Date Approved And by Wendy W. Chapman , Chair of the Department of Biomedical Informatics and by David B. Kieda, Dean of The Graduate School. ABSTRACT Over 40 years ago, the first computer simulation of a protein was reported: the atomic motions of a 58 amino acid protein were simulated for few picoseconds. With today’s supercomputers, simulations of large biomolecular systems with hundreds of thousands of atoms can reach biologically significant timescales. Through dynamics information biomolecular simulations can provide new insights into molecular structure and function to support the development of new drugs or therapies. While the recent advances in high-performance computing hardware and computational methods have enabled scientists to run longer simulations, they also created new challenges for data management. Investigators need to use local and national resources to run these simulations and store their output, which can reach terabytes of data on disk.
    [Show full text]
  • UNICOS® Installation Guide for CRAY J90lm Series SG-5271 9.0.2
    UNICOS® Installation Guide for CRAY J90lM Series SG-5271 9.0.2 / ' Cray Research, Inc. Copyright © 1996 Cray Research, Inc. All Rights Reserved. This manual or parts thereof may not be reproduced in any form unless permitted by contract or by written permission of Cray Research, Inc. Portions of this product may still be in development. The existence of those portions still in development is not a commitment of actual release or support by Cray Research, Inc. Cray Research, Inc. assumes no liability for any damages resulting from attempts to use any functionality or documentation not officially released and supported. If it is released, the final form and the time of official release and start of support is at the discretion of Cray Research, Inc. Autotasking, CF77, CRAY, Cray Ada, CRAYY-MP, CRAY-1, HSX, SSD, UniChem, UNICOS, and X-MP EA are federally registered trademarks and CCI, CF90, CFr, CFr2, CFT77, COS, Cray Animation Theater, CRAY C90, CRAY C90D, Cray C++ Compiling System, CrayDoc, CRAY EL, CRAY J90, Cray NQS, CraylREELlibrarian, CraySoft, CRAY T90, CRAY T3D, CrayTutor, CRAY X-MP, CRAY XMS, CRAY-2, CRInform, CRIlThrboKiva, CSIM, CVT, Delivering the power ..., DGauss, Docview, EMDS, HEXAR, lOS, LibSci, MPP Apprentice, ND Series Network Disk Array, Network Queuing Environment, Network Queuing '!boIs, OLNET, RQS, SEGLDR, SMARTE, SUPERCLUSTER, SUPERLINK, Trusted UNICOS, and UNICOS MAX are trademarks of Cray Research, Inc. Anaconda is a trademark of Archive Technology, Inc. EMASS and ER90 are trademarks of EMASS, Inc. EXABYTE is a trademark of EXABYTE Corporation. GL and OpenGL are trademarks of Silicon Graphics, Inc.
    [Show full text]
  • (PDF) Kostenlos
    David Gugerli | Ricky Wichum Simulation for All David Gugerli Ricky Wichum The Politics of Supercomputing in Stuttgart David Gugerli, Ricky Wichum SIMULATION FOR ALL THE POLITICS OF SUPERCOMPUTING IN STUTTGART Translator: Giselle Weiss Cover image: Installing the Cray-2 in the computing center on 7 October 1983 (Polaroid UASt). Cover design: Thea Sautter, Zürich © 2021 Chronos Verlag, Zürich ISBN 978-3-0340-1621-6 E-Book (PDF): DOI 10.33057/chronos.1621 German edition: ISBN 978-3-0340-1620-9 Contents User’s guide 7 The centrality issue (1972–1987) 13 Gaining dominance 13 Planning crisis and a flood of proposals 15 5 Attempted resuscitation 17 Shaping policy and organizational structure 22 A diversity of machines 27 Shielding users from complexity 31 Communicating to the public 34 The performance gambit (1988–1996) 41 Simulation for all 42 The cost of visualizing computing output 46 The false security of benchmarks 51 Autonomy through regional cooperation 58 Stuttgart’s two-pronged solution 64 Network to the rescue (1997–2005) 75 Feasibility study for a national high-performance computing network 77 Metacomputing – a transatlantic experiment 83 Does the university really need an HLRS? 89 Grid computing extends a lifeline 95 Users at work (2006–2016) 101 6 Taking it easy 102 With Gauss to Europe 106 Virtual users, and users in virtual reality 109 Limits to growth 112 A history of reconfiguration 115 Acknowledgments 117 Notes 119 Bibliography 143 List of Figures 153 User’s guide The history of supercomputing in Stuttgart is fascinating. It is also complex. Relating it necessarily entails finding one’s own way and deciding meaningful turning points.
    [Show full text]
  • Measuring Power Consumption on IBM Blue Gene/P
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Springer - Publisher Connector Comput Sci Res Dev DOI 10.1007/s00450-011-0192-y SPECIAL ISSUE PAPER Measuring power consumption on IBM Blue Gene/P Michael Hennecke · Wolfgang Frings · Willi Homberg · Anke Zitz · Michael Knobloch · Hans Böttiger © The Author(s) 2011. This article is published with open access at Springerlink.com Abstract Energy efficiency is a key design principle of the Top10 supercomputers on the November 2010 Top500 list IBM Blue Gene series of supercomputers, and Blue Gene [1] alone (which coincidentally are also the 10 systems with systems have consistently gained top GFlops/Watt rankings an Rpeak of at least one PFlops) are consuming a total power on the Green500 list. The Blue Gene hardware and man- of 33.4 MW [2]. These levels of power consumption are al- agement software provide built-in features to monitor power ready a concern for today’s Petascale supercomputers (with consumption at all levels of the machine’s power distribu- operational expenses becoming comparable to the capital tion network. This paper presents the Blue Gene/P power expenses for procuring the machine), and addressing the measurement infrastructure and discusses the operational energy challenge clearly is one of the key issues when ap- aspects of using this infrastructure on Petascale machines. proaching Exascale. We also describe the integration of Blue Gene power moni- While the Flops/Watt metric is useful, its emphasis on toring capabilities into system-level tools like LLview, and LINPACK performance and thus computational load ne- highlight some results of analyzing the production workload glects the fact that the energy costs of memory references at Research Center Jülich (FZJ).
    [Show full text]
  • The Gemini Network
    The Gemini Network Rev 1.1 Cray Inc. © 2010 Cray Inc. All Rights Reserved. Unpublished Proprietary Information. This unpublished work is protected by trade secret, copyright and other laws. Except as permitted by contract or express written permission of Cray Inc., no part of this work or its content may be used, reproduced or disclosed in any form. Technical Data acquired by or for the U.S. Government, if any, is provided with Limited Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7013, as applicable. Autotasking, Cray, Cray Channels, Cray Y-MP, UNICOS and UNICOS/mk are federally registered trademarks and Active Manager, CCI, CCMT, CF77, CF90, CFT, CFT2, CFT77, ConCurrent Maintenance Tools, COS, Cray Ada, Cray Animation Theater, Cray APP, Cray Apprentice2, Cray C90, Cray C90D, Cray C++ Compiling System, Cray CF90, Cray EL, Cray Fortran Compiler, Cray J90, Cray J90se, Cray J916, Cray J932, Cray MTA, Cray MTA-2, Cray MTX, Cray NQS, Cray Research, Cray SeaStar, Cray SeaStar2, Cray SeaStar2+, Cray SHMEM, Cray S-MP, Cray SSD-T90, Cray SuperCluster, Cray SV1, Cray SV1ex, Cray SX-5, Cray SX-6, Cray T90, Cray T916, Cray T932, Cray T3D, Cray T3D MC, Cray T3D MCA, Cray T3D SC, Cray T3E, Cray Threadstorm, Cray UNICOS, Cray X1, Cray X1E, Cray X2, Cray XD1, Cray X-MP, Cray XMS, Cray XMT, Cray XR1, Cray XT, Cray XT3, Cray XT4, Cray XT5, Cray XT5h, Cray Y-MP EL, Cray-1, Cray-2, Cray-3, CrayDoc, CrayLink, Cray-MP, CrayPacs, CrayPat, CrayPort, Cray/REELlibrarian, CraySoft, CrayTutor, CRInform, CRI/TurboKiva, CSIM, CVT, Delivering the power…, Dgauss, Docview, EMDS, GigaRing, HEXAR, HSX, IOS, ISP/Superlink, LibSci, MPP Apprentice, ND Series Network Disk Array, Network Queuing Environment, Network Queuing Tools, OLNET, RapidArray, RQS, SEGLDR, SMARTE, SSD, SUPERLINK, System Maintenance and Remote Testing Environment, Trusted UNICOS, TurboKiva, UNICOS MAX, UNICOS/lc, and UNICOS/mp are trademarks of Cray Inc.
    [Show full text]
  • Trends in HPC Architectures and Parallel Programmming
    11 th Advanced School on Parallel Computing February 9-13, 2015 Bologna Trends in HPC Architectures and Parallel Programmming Giovanni Erbacci - [email protected] Supercomputing, Applications & Innovation Department - CINECA CINECA, February 9-13, 2015 11 th Advanced School on Parallel Computing February 9-13, 2015 Bologna Agenda • Computational Sciences • Trends in Parallel Architectures • Trends in Parallel Programming • HPC access offer: ISCRA and PRACE 1 11 th Advanced School on Parallel Computing February 9-13, 2015 Bologna Computational Sciences Computational science (with theory and experimentation ), is the “third pillar” of scientific inquiry, enabling researchers to build and test models of complex phenomena Quick evolution of innovation : • Instantaneous communication • Geographically distributed work • Increased productivity • More data everywhere • Increasing problem complexity • Innovation happens worldwide 2 11 th Advanced School on Parallel Computing Technology Evolution February 9-13, 2015 Bologna More data everywhere : Radar, satellites, CAT scans, weather models, the human genome. The size and resolution of the problems scientists address today are limited only by the size of the data they can reasonably work with. There is a constantly increasing demand for faster processing on bigger data. Increasing problem complexity : Partly driven by the ability to handle bigger data, but also by the requirements and opportunities brought by new technologies. For example, new kinds of medical scans create new computational challenges.
    [Show full text]
  • Blue Gene®/Q Overview and Update
    Blue Gene®/Q Overview and Update November 2011 Agenda Hardware Architecture George Chiu Packaging & Cooling Paul Coteus Software Architecture Robert Wisniewski Applications & Configurations Jim Sexton IBM System Technology Group Blue Gene/Q 32 Node Board Industrial Design BQC DD2.0 4-rack system 5D torus 3 © 2011 IBM Corporation IBM System Technology Group Top 10 reasons that you need Blue Gene/Q 1. Ultra-scalability for breakthrough science – System can scale to 256 racks and beyond (>262,144 nodes) – Cluster: typically a few racks (512-1024 nodes) or less. 2. Highest capability machine in the world (20-100PF+ peak) 3. Superior reliability: Run an application across the whole machine, low maintenance 4. Highest power efficiency, smallest footprint, lowest TCO (Total Cost of Ownership) 5. Low latency, high bandwidth inter-processor communication system 6. Low latency, high bandwidth memory system 7. Open source and standards-based programming environment – Red Hat Linux distribution on service, front end, and I/O nodes – Lightweight Compute Node Kernel (CNK) on compute nodes ensures scaling with no OS jitter, enables reproducible runtime results – Automatic SIMD (Single-Instruction Multiple-Data) FPU exploitation enabled by IBM XL (Fortran, C, C++) compilers – PAMI (Parallel Active Messaging Interface) runtime layer. Runs across IBM HPC platforms 8. Software architecture extends application reach – Generalized communication runtime layer allows flexibility of programming model – Familiar Linux execution environment with support for most
    [Show full text]
  • TMA4280—Introduction to Supercomputing
    Supercomputing TMA4280—Introduction to Supercomputing NTNU, IMF January 12. 2018 1 Outline Context: Challenges in Computational Science and Engineering Examples: Simulation of turbulent flows and other applications Goal and means: parallel performance improvement Overview of supercomputing systems Conclusion 2 Computational Science and Engineering (CSE) What is the motivation for Supercomputing? Solve complex problems fast and accurately: — efforts in modelling and simulation push sciences and engineering applications forward, — computational requirements drive the development of new hardware and software. 3 Computational Science and Engineering (CSE) Development of computational methods for scientific research and innovation in engineering and technology. Covers the entire spectrum of natural sciences, mathematics, informatics: — Scientific model (Physics, Biology, Medical, . ) — Mathematical model — Numerical model — Implementation — Visualization, Post-processing — Validation ! Feedback: virtuous circle Allows for larger and more realistic problems to be simulated, new theories to be experimented numerically. 4 Outcome in Industrial Applications Figure: 2004: “The Falcon 7X becomes the first aircraft in industry history to be entirely developed in a virtual environment, from design to manufacturing to maintenance.” Dassault Systèmes 5 Evolution of computational power Figure: Moore’s Law: exponential increase of number of transistors per chip, 1-year rate (1965), 2-year rate (1975). WikiMedia, CC-BY-SA-3.0 6 Evolution of computational power
    [Show full text]
  • Introduction to the Practical Aspects of Programming for IBM Blue Gene/P
    Introduction to the Practical Aspects of Programming for IBM Blue Gene/P Alexander Pozdneev Research Software Engineer, IBM June 22, 2015 — International Summer Supercomputing Academy IBM Science and Technology Center Established in 2006 • Research projects I HPC simulations I Security I Oil and gas applications • ESSL math library for IBM POWER • IBM z Systems mainframes I Firmware I Linux on z Systems I Software • IBM Rational software • Smarter Commerce for Retail • Smarter Cities 2 c 2015 IBM Corporation Outline 1 Blue Gene architecture in the HPC world 2 Blue Gene/P architecture 3 Conclusion 4 References 3 c 2015 IBM Corporation Outline 1 Blue Gene architecture in the HPC world Brief history of Blue Gene architecture Blue Gene in TOP500 Blue Gene in Graph500 Scientific applications Gordon Bell awards Blue Gene installation sites 2 Blue Gene/P architecture 3 Conclusion 4 References 4 c 2015 IBM Corporation Brief history of Blue Gene architecture • 1999 US$100M research initiative, novel massively parallel architecture, protein folding • 2003, Nov Blue Gene/L first prototype TOP500, #73 • 2004, Nov 16 Blue Gene/L racks at LLNL TOP500, #1 • 2007 Second generation Blue Gene/P • 2009 National Medal of Technology and Innovation • 2012 Third generation Blue Gene/Q • Features: I Low frequency, low power consumption I Fast interconnect balanced with CPU I Light-weight OS 5 c 2015 IBM Corporation Blue Gene in TOP500 Date # System Model Nodes Cores Jun’14 / Nov’14 3 Sequoia Q 96k 1.5M Jun’13 / Nov’13 3 Sequoia Q 96k 1.5M Nov’12 2 Sequoia
    [Show full text]