Performance Engineering on Cray XC40 with Xeon Phi

Total Page:16

File Type:pdf, Size:1020Kb

Performance Engineering on Cray XC40 with Xeon Phi Performance Engineering for Legacy Codes on a Cray XC40 with Intel Xeon Phi (KNL) Matthias Noack ([email protected]), Florian Wende, Thomas Steinke, Alexander Reinefeld Zuse Institute Berlin 2017-06-22, Performance Engineering for HPC: Implementation, Processes & Case Studies at ISC’17 1 / 44 North-German Supercomputing Alliance (HLRN) Applications on the HLRN-III TDS (Berlin, ZIB) - many pure MPI codes - 16 KNC nodes (until July 2016) - 80 KNL nodes (since July 2016) - some MPI+OpenMP - data warp nodes - some vectorized “Konrad” (Berlin, ZIB) - 1872 Xeon nodes - 44928 cores 10 Gbps (243 km linear distance) “Gottfried” (Hanover, LUIS) - 1680 Xeon nodes - 40320 cores + 64 SMP servers, 256/512 GB 2 / 44 44.928 Supercomputer at ZIB 10.240 Intel Ivy Bridge, Haswell Intel Harpertown, Nehalem 384 IBM Power4 256 192 DEC Alpha DEC Alpha [cores] 2 2013 (HLRN-III) 1 Cray XC30/XC40 1,3 PFlops 2008 (HLRN-II) 2002 (HLRN-I) SGI ICE, XE IBM p690 1997 150 TFlops [peak performance] Cray T3E 2,5 TFlops 1994 1987 486 GFlops Cray T3D 1984 Cray X-MP 38 GFlops Cray 1M 471 MFlops 160 MFlops 3 / 44 44.928 Supercomputer at ZIB 10.240 Intel Ivy Bridge, Haswell Intel Harpertown, Nehalem 384 IBM Power4 256 192 DEC Alpha DEC Alpha [cores] 2 2013 (HLRN-III) 1 Cray XC30/XC40 1,3 PFlops 2008 (HLRN-II) 2002 (HLRN-I) SGI ICE, XE IBM p690 1997 150 TFlops [peak performance] Cray T3E 2,5 TFlops 1994 1987 486 GFlops 200 kWatt Cray T3D 1984 Cray X-MP 10 M€ 38 GFlops Cray 1M 471 MFlops Xeon Phi KNL, 2016 160 MFlops Intel Xeon Phi 7290 72 cores (288 threads) 3 TFLOPS Y 245 Watt 6662 € (6/2017) 4 / 44 Research Center for Many-Core HPC at ZIB Intel Parallel Compute Center (IPCC) Applications • GLAT (atomistic thermodynamics) Challenges • VASP (electronic structure) • Adapting data structures for enabling SIMD • BQCD (high-energy physics) OBJECTIVE • Vectorising complex code structures • HEOM (photo-active processes) • Transition to hybrid MPI + OpenMP • BOSS (time series analysis, phase 2) Many-Core High- • (Offload with Intel LEO vs. OpenMP 4.x) • PALM (fluid dynamics, phase 2) Performance Computing • PHOENIX3D (astrophysics, phase 2) APPLICATIONS RESEARCH Modernization, Programming OpenMP/MPI, Models, Scalability Runtime Libraries 5 / 44 Intel Xeon Phi (Knights Landing) – Architecture Knights Landing (KNL): Intel’s 2. Many Integrated Core (MIC) Architecture1) self-booting CPU, optionall with integrated OmniPath fabric 64+ core (based on Intel Atom Silvermont architecture, x86-64) 4-way hardware-threading 512-bit SIMD vector processing (AVX-512) On-Chip Multi-Channel (MC) DRAM: 16 GiB DDR4 main memory: up to 384 GiB 1) A. Sodani et al., Knights Landing: Second-Generation Intel Xeon Phi Product, IEEE Micro vol. 6, April 2016 6 / 44 Intel Xeon Phi (Knights Landing) – Architecture KNL CPU MCDRAM MCDRAM MCDRAM MCDRAM EDC EDC EDC EDC PCIe 3 DMI Tile up to 36 active tiles DDR MC connected via 2D-Mesh- DDR MC Channels Interconnect 3 Channels DDR4 3 DDR4 EDC EDC Misc EDC EDC MCDRAM MCDRAM MCDRAM MCDRAM 7 / 44 Intel Xeon Phi (Knights Landing) – Architecture KNL CPU Tile MCDRAM MCDRAM MCDRAM MCDRAM 2 VPUs CHA 2 VPUs 1 MiB L2 Cache EDC EDC EDC EDC Core Core PCIe 3 DMI Tile 2 out-of-order cores up to 36 active tiles Channels DDR MC connected via 2D-Mesh- DDR MC 2 VPUs per core (AVX-512) Interconnect 1 MiB shared L2-Cache 3 Channels DDR4 3 DDR4 Caching/Home-Agent (CHA) interface to 2D-Mesh-Interconnect EDC EDC Misc EDC EDC distributed tag directory (MESIF cache coherency protocol) MCDRAM MCDRAM MCDRAM MCDRAM 8 / 44 Intel Xeon Phi (Knights Landing) – Architecture KNL CPU DDR4 Memory (up to 384 GiB) 2 memory controller MCDRAM MCDRAM MCDRAM MCDRAM 6 DDR4 Channels EDC EDC EDC EDC PCIe 3 DMI Tile up to 36 active tiles DDR MC connected via 2D-Mesh- DDR MC Channels Interconnect 3 Channels DDR4 3 DDR4 EDC EDC Misc EDC EDC MCDRAM MCDRAM MCDRAM MCDRAM 9 / 44 Intel Xeon Phi (Knights Landing) – Architecture KNL CPU DDR4 Memory (up to 384 GiB) 2 memory controller MCDRAM MCDRAM MCDRAM MCDRAM 6 DDR4 Channels EDC EDC EDC EDC PCIe 3 DMI MCDRAM (16 GiB) Tile 8 on-chip units, each 2 GiB up to 36 active tiles DDR MC connected via 2D-Mesh- DDR MC Channels Interconnect 3 Channels DDR4 3 DDR4 EDC EDC Misc EDC EDC MCDRAM MCDRAM MCDRAM MCDRAM 10 / 44 Intel Xeon Phi (Knights Landing) – Architecture KNL CPU Memory Modes: Flat-Mode 16 GiB MCDRAM MCDRAM MCDRAM MCDRAM MCDRAM DDR4 and MCDRAM in same address space DDR4 EDC EDC EDC EDC PCIe 3 DMI space address phys. Cache-Mode Tile MCDRAM = direct- mapped Cache up to 36 active tiles 16 GiB DDR MC DDR MC Channels for DDR4 DDR4 connected via 2D-Mesh- MCDRAM Interconnect 3 DDR4 3 Channels DDR4 Hybrid-Mode 8 | 4 GiB MCDRAM in Cache-Mode, 8 | 12 GiB MCDRAM EDC EDC Misc EDC EDC remainder in 8 | 4 GiB Flat-Mode DDR4 space address MCDRAM . MCDRAM MCDRAM MCDRAM MCDRAM hys p 11 / 44 KNL in the Roofline Model (Samuel Williams, 2008) [GFLOPS] 4096 1024 256 64 16 Attainable FLOPS 4 1 [FLOP/Byte] 1 2 4 8 16 32 64 128 256 Arithmetic Intensity 12 / 44 KNL in the Roofline Model - adding peak FLOPS [GFLOPS] 4096 peak FLOPS 1024 256 64 16 Attainable FLOPS 4 1 [FLOP/Byte] 1 2 4 8 16 32 64 128 256 Arithmetic Intensity Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s 12 / 44 KNL in the Roofline Model - adding peak DRAM bandwidth [GFLOPS] 4096 peak FLOPS 1024 256 DRAM BW 64 16 Attainable FLOPS 4 1 [FLOP/Byte] 1 2 4 8 16 32 64 128 256 Arithmetic Intensity Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s 12 / 44 KNL in the Roofline Model [GFLOPS] 4096 peak FLOPS 1024 256 DRAM BW 64 16 Attainable FLOPS 4 1 [FLOP/Byte] 1 2 4 8 16 32 64 128 256 Arithmetic Intensity Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s 12 / 44 KNL in the Roofline Model Peak FLOPS Attainable = min FLOPS Memory Arithmetic [GFLOPS] Bandwidth × Intensity 4096 peak FLOPS 1024 256 DRAM BW 64 16 Attainable FLOPS 4 1 [FLOP/Byte] 1 2 4 8 16 32 64 128 256 Arithmetic Intensity Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s 12 / 44 KNL in the Roofline Model Peak FLOPS Attainable = min FLOPS Memory Arithmetic [GFLOPS] Bandwidth × Intensity 4096 peak FLOPS 1024 256 DRAM BW 64 16 Attainable FLOPS 4 ⇐ memory bound compute bound ⇒ 1 [FLOP/Byte] 1 2 4 8 16 32 64 128 256 Arithmetic Intensity Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s 12 / 44 KNL in the Roofline Model Peak FLOPS = [GFLOPS] Memory Bandwidth 4096 peak FLOPS 1024 256 DRAM BW 64 16 Attainable FLOPS 4 ⇐ memory bound compute bound ⇒ 1 [FLOP/Byte] 1 2 4 8 16 32 64 128 256 Arithmetic Intensity Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s 12 / 44 KNL in the Roofline Model Peak Minimal FLOPS = Arithmetic Intensity [GFLOPS] Memory Bandwidth necessary for peak FLOPS peak FLOPS 4096 (HLRN nodes): 1024 MCDRAM BW KNL DRAM: 22.7 256 DRAM BW KNL MCDRAM: 5.3 Haswell: 7.1 64 16 Attainable FLOPS 4 1 [FLOP/Byte] 1 2 4 8 16 32 64 128 256 Arithmetic Intensity Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/sNumbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s DDR, 490 GB/s MCDRAM; Xeon12 / 44 E5-2680v3: 480 GFLOPS, 68 GiB/s • 1.4 GHz × 68 core × 8 SIMD × 2 VPUs × 2 FMA = 3046.4 GFLOPS • AVX frequency is only 1.2 GHz and might throttle down under heavy load ⇒ actual peak: 2611.2 GFLOPS Add more FLOPS ceilings • without instruction level parallelism (ILP), i.e. dual VPUs and FMA • 1.2 GHz × 68 core × 8 SIMD = 652.8 GFLOPS (25 %) • without ILP, and without SIMD • 1.2 GHz × 68 core = 84.6 GFLOPS (3.2 %) Refining the Ceilings - FLOPS Realistic peak FLOPS • Xeon Phi 7250 (HLRN Cray TDS), advertised with 3.05 TFLOPS peak DP 13 / 44 • AVX frequency is only 1.2 GHz and might throttle down under heavy load ⇒ actual peak: 2611.2 GFLOPS Add more FLOPS ceilings • without instruction level parallelism (ILP), i.e. dual VPUs and FMA • 1.2 GHz × 68 core × 8 SIMD = 652.8 GFLOPS (25 %) • without ILP, and without SIMD • 1.2 GHz × 68 core = 84.6 GFLOPS (3.2 %) Refining the Ceilings - FLOPS Realistic peak FLOPS • Xeon Phi 7250 (HLRN Cray TDS), advertised with 3.05 TFLOPS peak DP • 1.4 GHz × 68 core × 8 SIMD × 2 VPUs × 2 FMA = 3046.4 GFLOPS 13 / 44 ⇒ actual peak: 2611.2 GFLOPS Add more FLOPS ceilings • without instruction level parallelism (ILP), i.e. dual VPUs and FMA • 1.2 GHz × 68 core × 8 SIMD = 652.8 GFLOPS (25 %) • without ILP, and without SIMD • 1.2 GHz × 68 core = 84.6 GFLOPS (3.2 %) Refining the Ceilings - FLOPS Realistic peak FLOPS • Xeon Phi 7250 (HLRN Cray TDS), advertised with 3.05 TFLOPS peak DP • 1.4 GHz × 68 core × 8 SIMD × 2 VPUs × 2 FMA = 3046.4 GFLOPS • AVX frequency is only 1.2 GHz and might throttle down under heavy load 13 / 44 Add more FLOPS ceilings • without instruction level parallelism (ILP), i.e. dual VPUs and FMA • 1.2 GHz × 68 core × 8 SIMD = 652.8 GFLOPS (25 %) • without ILP, and without SIMD • 1.2 GHz × 68 core = 84.6 GFLOPS (3.2 %) Refining the Ceilings - FLOPS Realistic peak FLOPS • Xeon Phi 7250 (HLRN Cray TDS), advertised with 3.05 TFLOPS peak DP • 1.4 GHz × 68 core × 8 SIMD × 2 VPUs × 2 FMA = 3046.4 GFLOPS • AVX frequency is only 1.2 GHz and might throttle down under heavy load ⇒ actual peak: 2611.2 GFLOPS 13 / 44 • without ILP, and without SIMD • 1.2 GHz × 68 core = 84.6 GFLOPS (3.2 %) Refining the Ceilings - FLOPS Realistic peak FLOPS • Xeon Phi 7250 (HLRN Cray TDS), advertised with 3.05 TFLOPS peak DP • 1.4 GHz × 68 core × 8 SIMD × 2 VPUs × 2 FMA = 3046.4 GFLOPS • AVX frequency is only 1.2 GHz and might throttle down under heavy load ⇒ actual peak: 2611.2 GFLOPS Add more FLOPS ceilings • without instruction level parallelism (ILP), i.e.
Recommended publications
  • UNICOS/Mk Status)
    Status on the Serverization of UNICOS - (UNICOS/mk Status) Jim Harrell, Cray Research, Inc., 655-F Lone Oak Drive, Eagan, Minnesota 55121 ABSTRACT: UNICOS is being reorganized into a microkernel based system. The purpose of this reorganization is to provide an operating system that can be used on all Cray architectures and provide both the current UNICOS functionality and a path to the future distributed systems. The reorganization of UNICOS is moving forward. The port of this “new” system to the MPP is also in progress. This talk will present the current status, and plans for UNICOS/mk. The chal- lenges of performance, size and scalability will be discussed. 1 Introduction have to be added in order to provide required functionality, such as support for distributed applications. As the work of adding This discussion is divided into four parts. The first part features and porting continues there is a testing effort that discusses the development process used by this project. The ensures correct functionality of the product. development process is the methodology that is being used to serverize UNICOS. The project is actually proceeding along Some of the initial porting can be and has been done in the multiple, semi-independent paths at the same time.The develop- simulator. However, issues such as MPP system organization, ment process will help explain the information in the second which nodes the servers will reside on and how the servers will part which is a discussion of the current status of the project. interact can only be completed on the hardware. The issue of The third part discusses accomplished milestones.
    [Show full text]
  • UNICOS® Installation Guide for CRAY J90lm Series SG-5271 9.0.2
    UNICOS® Installation Guide for CRAY J90lM Series SG-5271 9.0.2 / ' Cray Research, Inc. Copyright © 1996 Cray Research, Inc. All Rights Reserved. This manual or parts thereof may not be reproduced in any form unless permitted by contract or by written permission of Cray Research, Inc. Portions of this product may still be in development. The existence of those portions still in development is not a commitment of actual release or support by Cray Research, Inc. Cray Research, Inc. assumes no liability for any damages resulting from attempts to use any functionality or documentation not officially released and supported. If it is released, the final form and the time of official release and start of support is at the discretion of Cray Research, Inc. Autotasking, CF77, CRAY, Cray Ada, CRAYY-MP, CRAY-1, HSX, SSD, UniChem, UNICOS, and X-MP EA are federally registered trademarks and CCI, CF90, CFr, CFr2, CFT77, COS, Cray Animation Theater, CRAY C90, CRAY C90D, Cray C++ Compiling System, CrayDoc, CRAY EL, CRAY J90, Cray NQS, CraylREELlibrarian, CraySoft, CRAY T90, CRAY T3D, CrayTutor, CRAY X-MP, CRAY XMS, CRAY-2, CRInform, CRIlThrboKiva, CSIM, CVT, Delivering the power ..., DGauss, Docview, EMDS, HEXAR, lOS, LibSci, MPP Apprentice, ND Series Network Disk Array, Network Queuing Environment, Network Queuing '!boIs, OLNET, RQS, SEGLDR, SMARTE, SUPERCLUSTER, SUPERLINK, Trusted UNICOS, and UNICOS MAX are trademarks of Cray Research, Inc. Anaconda is a trademark of Archive Technology, Inc. EMASS and ER90 are trademarks of EMASS, Inc. EXABYTE is a trademark of EXABYTE Corporation. GL and OpenGL are trademarks of Silicon Graphics, Inc.
    [Show full text]
  • (PDF) Kostenlos
    David Gugerli | Ricky Wichum Simulation for All David Gugerli Ricky Wichum The Politics of Supercomputing in Stuttgart David Gugerli, Ricky Wichum SIMULATION FOR ALL THE POLITICS OF SUPERCOMPUTING IN STUTTGART Translator: Giselle Weiss Cover image: Installing the Cray-2 in the computing center on 7 October 1983 (Polaroid UASt). Cover design: Thea Sautter, Zürich © 2021 Chronos Verlag, Zürich ISBN 978-3-0340-1621-6 E-Book (PDF): DOI 10.33057/chronos.1621 German edition: ISBN 978-3-0340-1620-9 Contents User’s guide 7 The centrality issue (1972–1987) 13 Gaining dominance 13 Planning crisis and a flood of proposals 15 5 Attempted resuscitation 17 Shaping policy and organizational structure 22 A diversity of machines 27 Shielding users from complexity 31 Communicating to the public 34 The performance gambit (1988–1996) 41 Simulation for all 42 The cost of visualizing computing output 46 The false security of benchmarks 51 Autonomy through regional cooperation 58 Stuttgart’s two-pronged solution 64 Network to the rescue (1997–2005) 75 Feasibility study for a national high-performance computing network 77 Metacomputing – a transatlantic experiment 83 Does the university really need an HLRS? 89 Grid computing extends a lifeline 95 Users at work (2006–2016) 101 6 Taking it easy 102 With Gauss to Europe 106 Virtual users, and users in virtual reality 109 Limits to growth 112 A history of reconfiguration 115 Acknowledgments 117 Notes 119 Bibliography 143 List of Figures 153 User’s guide The history of supercomputing in Stuttgart is fascinating. It is also complex. Relating it necessarily entails finding one’s own way and deciding meaningful turning points.
    [Show full text]
  • The Gemini Network
    The Gemini Network Rev 1.1 Cray Inc. © 2010 Cray Inc. All Rights Reserved. Unpublished Proprietary Information. This unpublished work is protected by trade secret, copyright and other laws. Except as permitted by contract or express written permission of Cray Inc., no part of this work or its content may be used, reproduced or disclosed in any form. Technical Data acquired by or for the U.S. Government, if any, is provided with Limited Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7013, as applicable. Autotasking, Cray, Cray Channels, Cray Y-MP, UNICOS and UNICOS/mk are federally registered trademarks and Active Manager, CCI, CCMT, CF77, CF90, CFT, CFT2, CFT77, ConCurrent Maintenance Tools, COS, Cray Ada, Cray Animation Theater, Cray APP, Cray Apprentice2, Cray C90, Cray C90D, Cray C++ Compiling System, Cray CF90, Cray EL, Cray Fortran Compiler, Cray J90, Cray J90se, Cray J916, Cray J932, Cray MTA, Cray MTA-2, Cray MTX, Cray NQS, Cray Research, Cray SeaStar, Cray SeaStar2, Cray SeaStar2+, Cray SHMEM, Cray S-MP, Cray SSD-T90, Cray SuperCluster, Cray SV1, Cray SV1ex, Cray SX-5, Cray SX-6, Cray T90, Cray T916, Cray T932, Cray T3D, Cray T3D MC, Cray T3D MCA, Cray T3D SC, Cray T3E, Cray Threadstorm, Cray UNICOS, Cray X1, Cray X1E, Cray X2, Cray XD1, Cray X-MP, Cray XMS, Cray XMT, Cray XR1, Cray XT, Cray XT3, Cray XT4, Cray XT5, Cray XT5h, Cray Y-MP EL, Cray-1, Cray-2, Cray-3, CrayDoc, CrayLink, Cray-MP, CrayPacs, CrayPat, CrayPort, Cray/REELlibrarian, CraySoft, CrayTutor, CRInform, CRI/TurboKiva, CSIM, CVT, Delivering the power…, Dgauss, Docview, EMDS, GigaRing, HEXAR, HSX, IOS, ISP/Superlink, LibSci, MPP Apprentice, ND Series Network Disk Array, Network Queuing Environment, Network Queuing Tools, OLNET, RapidArray, RQS, SEGLDR, SMARTE, SSD, SUPERLINK, System Maintenance and Remote Testing Environment, Trusted UNICOS, TurboKiva, UNICOS MAX, UNICOS/lc, and UNICOS/mp are trademarks of Cray Inc.
    [Show full text]
  • Trends in HPC Architectures and Parallel Programmming
    11 th Advanced School on Parallel Computing February 9-13, 2015 Bologna Trends in HPC Architectures and Parallel Programmming Giovanni Erbacci - [email protected] Supercomputing, Applications & Innovation Department - CINECA CINECA, February 9-13, 2015 11 th Advanced School on Parallel Computing February 9-13, 2015 Bologna Agenda • Computational Sciences • Trends in Parallel Architectures • Trends in Parallel Programming • HPC access offer: ISCRA and PRACE 1 11 th Advanced School on Parallel Computing February 9-13, 2015 Bologna Computational Sciences Computational science (with theory and experimentation ), is the “third pillar” of scientific inquiry, enabling researchers to build and test models of complex phenomena Quick evolution of innovation : • Instantaneous communication • Geographically distributed work • Increased productivity • More data everywhere • Increasing problem complexity • Innovation happens worldwide 2 11 th Advanced School on Parallel Computing Technology Evolution February 9-13, 2015 Bologna More data everywhere : Radar, satellites, CAT scans, weather models, the human genome. The size and resolution of the problems scientists address today are limited only by the size of the data they can reasonably work with. There is a constantly increasing demand for faster processing on bigger data. Increasing problem complexity : Partly driven by the ability to handle bigger data, but also by the requirements and opportunities brought by new technologies. For example, new kinds of medical scans create new computational challenges.
    [Show full text]
  • System Programmer Reference (Cray SV1™ Series)
    ® System Programmer Reference (Cray SV1™ Series) 108-0245-003 Cray Proprietary (c) Cray Inc. All Rights Reserved. Unpublished Proprietary Information. This unpublished work is protected by trade secret, copyright, and other laws. Except as permitted by contract or express written permission of Cray Inc., no part of this work or its content may be used, reproduced, or disclosed in any form. U.S. GOVERNMENT RESTRICTED RIGHTS NOTICE: The Computer Software is delivered as "Commercial Computer Software" as defined in DFARS 48 CFR 252.227-7014. All Computer Software and Computer Software Documentation acquired by or for the U.S. Government is provided with Restricted Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7014, as applicable. Technical Data acquired by or for the U.S. Government, if any, is provided with Limited Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7013, as applicable. Autotasking, CF77, Cray, Cray Ada, Cray Channels, Cray Chips, CraySoft, Cray Y-MP, Cray-1, CRInform, CRI/TurboKiva, HSX, LibSci, MPP Apprentice, SSD, SuperCluster, UNICOS, UNICOS/mk, and X-MP EA are federally registered trademarks and Because no workstation is an island, CCI, CCMT, CF90, CFT, CFT2, CFT77, ConCurrent Maintenance Tools, COS, Cray Animation Theater, Cray APP, Cray C90, Cray C90D, Cray CF90, Cray C++ Compiling System, CrayDoc, Cray EL, CrayLink,
    [Show full text]
  • Cray Supercomputers Past, Present, and Future
    Cray Supercomputers Past, Present, and Future Hewdy Pena Mercedes, Ryan Toukatly Advanced Comp. Arch. 0306-722 November 2011 Cray Companies z Cray Research, Inc. (CRI) 1972. Seymour Cray. z Cray Computer Corporation (CCC) 1989. Spin-off. Bankrupt in 1995. z Cray Research, Inc. bought by Silicon Graphics, Inc (SGI) in 1996. z Cray Inc. Formed when Tera Computer Company (pioneer in multi-threading technology) bought Cray Research, Inc. in 2000 from SGI. Seymour Cray z Joined Engineering Research Associates (ERA) in 1950 and helped create the ERA 1103 (1953), also known as UNIVAC 1103. z Joined the Control Data Corporation (CDC) in 1960 and collaborated in the design of the CDC 6600 and 7600. z Formed Cray Research Inc. in 1972 when CDC ran into financial difficulties. z First product was the Cray-1 supercomputer z Faster than all other computers at the time. z The first system was sold within a month for US$8.8 million. z Not the first system to use a vector processor but was the first to operate on data on a register instead of memory Vector Processor z CPU that implements an instruction set that operates on one- dimensional arrays of data called vectors. z Appeared in the 1970s, formed the basis of most supercomputers through the 80s and 90s. z In the 60s the Solomon project of Westinghouse wanted to increase math performance by using a large number of simple math co- processors under the control of a single master CPU. z The University of Illinois used the principle on the ILLIAC IV.
    [Show full text]
  • Implementation of IEEE Floating-Point Arithmetic on the Cray T90 System
    Implementation of IEEE Floating-point Arithmetic on the Cray T90 System James M. Kiernan, Sun Microsystems, Inc. and William J. Harrod, Cray Research, Inc. ABSTRACT: Cray Research will offer a version of the CPU for the Cray T90 system that implements a subset of the IEEE Standard 754 for floating-point arithmetic. A description of the 64-bit IEEE arithmetic implementation on the Cray T90 system with IEEE floating point hard- ware architecture is presented. Instruction set differences between this system and machnes with traditional Cray arithmetic are discussed, as well as the trade-offs necessary to maintain high performance on a vector supercomputer. Some examples of programming style will be given that take advantage of vectorized IEEE arithmetic. 1 Introduction metic will continue to be done via a software implementation compatible with the Sun quad format, although this does not The Cray T90 system with IEEE floating point hardware will rule out the possibility of full 128-bit arithmetic support in be the first parallel-vector supercomputer (PVP) offered by future hardware. The new 128-bit format is also not compatible Cray Research to support the IEEE Standard 754 for with the previous CRI 128-bit arithmetic, although 64-bit IEEE floating-point arithmetic[1]. The machine’s architecture will format is compatible between the Cray T90 system with IEEE conform to the Cray Research IEEE definition as described in a floating point hardware and Cray T3D/E systems. previous CUG paper [2]. While the initial version of the Cray In this technical paper we will use the term “Cray T90 T90 system implements the traditional Cray floating-point arith- system” to denote a Cray T90 system with IEEE floating point metic, the changes required for the Cray T90 system with IEEE hardware.
    [Show full text]
  • EECS 594 Spring 2009 Lecture 6: Overview of High-Performance Computing
    EECS 594 Spring 2009 Lecture 6: Overview of High-Performance Computing 1 High-Performance Computing Today In the past decade, the world has experienced one of the most exciting periods in computer development. Microprocessors have become smaller, denser, and more powerful. The result is that microprocessor-based supercomputing is rapidly becoming the technology of preference in attacking some of the most important problems of science and engineering. 2 1 Super Scalar/Special Purpose/Para 1 PFlop/s (1015) IBM RoadRunn 2X Transistors/ Parallel Cray Jagua ASCI White Chip Every 1.5 ASCI Red Pacific 1 TFlop/s (1012) Years TMC CM-5 Cray T3D Vector TMC CM-2 Cray 2 1 GFlop/s Cray X-MP (109) Super Scalar 1941 1 (Floating Point operations / second, Flop/s) 1945 100 Cray 1 1949 1,000 (1 KiloFlop/s, KFlop/s) 1951 10,000 1961 100,000 CDC 7600 IBM 360/195 1964 1,000,000 (1 MegaFlop/s, MFlop/s) 1 MFlop/s 1968 10,000,000 (106) Scalar CDC 6600 1975 100,000,000 1987 1,000,000,000 (1 GigaFlop/s, GFlop/s) 1992 10,000,000,000 IBM 7090 1993 100,000,000,000 1997 1,000,000,000,000 (1 TeraFlop/s, TFlop/s) 2000 10,000,000,000,000 2007 478,000,000,000,000 (478 Tflop/s) 1 KFlop/s 2009 1,100,000,000,000,000 (1.1 Pflop/s) (103) UNIVAC 1 EDSAC 1 07 1950 1960 1970 1980 1990 2000 2010 3 Technology Trends: Microprocessor Capacity Gordon Moore (co-founder of Intel) Electronics Magazine, 1965 Number of devices/chip Microprocessors have become smaller, doubles every 18 months denser, and more powerful.
    [Show full text]
  • Message Passing Dataflow Shared Memory
    Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution and Convergence of Parallel Architectures Fundamental Design Issues 2 What is Parallel Architecture? A parallel computer is a collection of processing elements that cooperate to solve large problems fast Some broad issues: • Resource Allocation: – how large a collection? – how powerful are the elements? – how much memory? • Data access, Communication and Synchronization – how do the elements cooperate and communicate? – how are data transmitted between processors? – what are the abstractions and primitives for cooperation? • Performance and Scalability – how does it all translate into performance? – how does it scale? 3 Why Study Parallel Architecture? Role of a computer architect: To design and engineer the various levels of a computer system to maximize performance and programmability within limits of technology and cost. Parallelism: • Provides alternative to faster clock for performance • Applies at all levels of system design • Is a fascinating perspective from which to view architecture • Is increasingly central in information processing 4 Why Study it Today? History: diverse and innovative organizational structures, often tied to novel programming models Rapidly maturing under strong technological constraints • The “killer micro” is ubiquitous • Laptops and supercomputers are fundamentally similar! • Technological trends cause diverse approaches to converge Technological trends make parallel computing inevitable • In the mainstream Need
    [Show full text]
  • An Introduction
    High Performance Computing : Concepts, Methods, & Means An Introduction Prof. Thomas Sterling Department of Computer Science Louisiana State University January 16, 2007 The Hammer of the Mind • The Hammer – Mankind’s 1 st tool – In the most general case: applies a directed force to a concentrated point in our physical world to affect a desired change of state – Many implements of the physical world • Conventional means of inserting nails to wood • Includes knives, spears, arrows, screwdrivers, sledge-hammers, axes, clubs, etc. • Understanding – The “force” that drives our abstract world – Historically, 2 means by which the mind applies understanding • Empiricism – acquiring knowledge through experience • Theory – project beyond immediate experience to new knowledge • Supercomputing – The 3 rd hammer of the mind for applying understanding – Explain the past – Predict the future – Control the present 2 Topics • Supercomputing – the big picture • What is a supercomputer? • Supercomputing as a multidisciplinary field • Challenges and Opportunities • A brief history of supercomputing • Overview of Course • Segment 1 knowledge factors & skills • Resources and rules of engagement 3 Topics • Supercomputing – the big picture • What is a supercomputer? • Supercomputing as a multidisciplinary field • Challenges and Opportunities • A brief history of supercomputing • Overview of Course • Segment 1 knowledge factors & skills • Resources and rules of engagement 4 Applying the Force of Understanding through the Power of Supercomputing 5 Addressing the
    [Show full text]
  • Performance Computing Systems
    Assessing Performance of High- Performance Computing Systems Tarek El-Ghazawi Department of Electrical and Computer Engineering The George Washington University Tarek El-Ghazawi, Intro to HPC slide 1 Performance of MPP’s Theoretical Peak Performance vs. Actual Example: » AMD Opteron 6100 – 12 cores – clock speed 1.9 GHz, 4 floating operations per cycle – 12 * 4 * 1.9 G = 91.2 GFLOPS » Cray XE6m from GWU HPC Lab – AMD Opteron 6100 – (1.9 GHz, 12 cores per chip) – 2 CPUs per node – 56 nodes – 91.2 GFLOPS * 2 * 56 = 10214 GFLOPS = 10.2 TFLOPS (measured Rmax (Linpack ) = 7.9 TFLOPS ) Tarek El-Ghazawi, Intro to HPC slide 2 Performance of MPP’s- Metrics Execution Time- Wall clock time Throughput – amount of work per unit time MIPS and problems MFLOPS and problems Applications related measures – e.g. # of particle interactions/second; pixels/sec…. Speedup – how many times the parallel execution is faster than the sequential execution Efficiency – percentage of utilized computing resources during an execution All the above are typically considered in the light of a given program Tarek El-Ghazawi, Intro to HPC slide 3 Performance of MPP’s Scalability - the ability to maintain performance gains when system and/or problem size increase » strong scaling - how the processing time varies with the number of processors for a fixed problem size » weak scaling - how the processing time varies with the number of processors for a fixed problem size per processor Tarek El-Ghazawi, Intro to HPC slide 4 Performance Metrics Speedup (strong scaling) ratio of
    [Show full text]