The Cray X1 First in a Series of Extreme Performance Systems

Total Page:16

File Type:pdf, Size:1020Kb

The Cray X1 First in a Series of Extreme Performance Systems The Cray X1 First in a series of extreme performance systems Steve Scott Cray X1 Chief Architect [email protected] Cray X1 Overview, Copyright 2002-3, Cray Inc. 1 Cray Supercomputer Evolution and Roadmap sustained petaflops High BW MPP DARPA HPCS BW++ BW+ BlackWidow X1e Microprocessor-based MPP X1 1000’s of processors UNICOS/mk 1350 1200 T3E T3D SV1ex SV1 T90 J90 C90 Y-MP High Bandwidth X-MP Cray-1 Vector 1-32 processors Cray X1 Overview, Copyright 2002-3, Cray Inc. 2 Architecture Cray X1 Overview, Copyright 2002-3, Cray Inc. 3 Cray X1 Cray PVP Cray T3E • Powerful single processors • Distributed shared memory • Very high memory bandwidth • High BW scalable network • Non-unit stride computation • Optimized communication • Special ISA features and synchronization features • Modernized the ISA and • Improved via custom microarchitecture processors Cray X1 Overview, Copyright 2002-3, Cray Inc. 4 Cray X1 Instruction Set Architecture New ISA – Much larger register set (32x64 vector, 64+64 scalar) – All operations performed under mask – 64- and 32-bit memory and IEEE arithmetic – Integrated synchronization features Advantages of a vector ISA – Compiler provides useful dependence information to hardware – Very high single processor performance with low complexity ops/sec = (cycles/sec) * (instrs/cycle) * (ops/instr) – Localized computation on processor chip – large register state with very regular access patterns – registers and functional units grouped into local clusters (pipes) ⇒ excellent fit with future IC technology – Latency tolerance and pipelining to memory/network ⇒ very well suited for scalable systems – Easy extensibility to next generation implementation Cray X1 Overview, Copyright 2002-3, Cray Inc. 5 Not your father’s vector machine • New instruction set • New system architecture • New processor microarchitecture • “Classic” vector machines were programmed differently – Classic vector: Optimize for loop length with little regard for locality – Scalable micro: Optimize for locality with little regard for loop length • The Cray X1 is programmed like a parallel micro-based machine – Rewards locality: register, cache, local memory, remote memory – Decoupled microarchitecture performs well on short loop nests –(however, does require vectorizable code) Cray X1 Overview, Copyright 2002-3, Cray Inc. 6 Cray X1 Node P P P P P P P P P P P P P P P P $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ M M M M M M M M M M M M M M M M mem mem mem mem mem mem mem mem mem mem mem mem mem mem mem mem IO IO • Four multistream processors (MSPs), each 12.8 Gflops • High bandwidth local shared memory (128 Direct Rambus channels) • 32 network links and four I/O links per node Cray X1 Overview, Copyright 2002-3, Cray Inc. 7 NUMA Scalable up to 1024 Nodes Interconnection Network • 32 parallel networks for bandwidth • Global shared memory across machine Cray X1 Overview, Copyright 2002-3, Cray Inc. 8 Network for Small Systems M0 M15 Node 0 M0 M15 M0 M15 M0 M15 Node 3 Slice 0 Slice 15 • Up to 16 CPUs, can connect directly via M chip ports Cray X1 Overview, Copyright 2002-3, Cray Inc. 9 Scalable Network Building Block M0 M15 Node 0 M0 M15 M0 M15 Re M0 Re M15 M0 Ro M15 Ro Hypercube/ 2D Torus M0 M15 Network M0 M15 M0 M15 Node 7 Slice 0One 32-processor “stack” Slice 15 Cray X1 Overview, Copyright 2002-3, Cray Inc. 10 Multicabinet Configuration • Scales as a 2D torus of “stacks” - ~third dimension within a stack (32 links between stacks) - two stacks per cabinet 32 links Bird’s Eye View 8 cabinets, 512 processors, 6.5 Tflops Cray X1 Overview, Copyright 2002-3, Cray Inc. 11 A 40 Tflop Configuration 50 cabinets, 3200 processors Bisection = 2 TB/s (4 TB/s global bandwidth) Cray X1 Overview, Copyright 2002-3, Cray Inc. 12 Designed for Scalability T3E Heritage • Distributed shared memory (DSM) architecture – Low latency, load/store access to entire machine (tens of TBs) • Decoupled vector memory architecture for latency tolerance – Thousands of outstanding references, flexible addressing • Very high performance network – High bandwidth, fine-grained transfers – Same router as Origin 3000, but 32 parallel copies of the network • Architectural features for scalability – Remote address translation – Global coherence protocol optimized for distributed memory – Fast synchronization • Parallel I/O scales with system size Cray X1 Overview, Copyright 2002-3, Cray Inc. 13 Decoupled Microarchitecture • Decoupled access/execute and decoupled scalar/vector • Scalar unit runs ahead, doing addressing and control – Scalar and vector loads issued early – Store addresses computed and saved for later use – Operations queued and executed later when data arrives • Hardware dynamically unrolls loops – Scalar unit goes on to issue next loop before current loop has completed – Full scalar register renaming through 8-deep shadow registers – Vector loads renamed through load buffers – Special sync operations keep pipeline full, even across barriers This is key to making the system like short-VL code Cray X1 Overview, Copyright 2002-3, Cray Inc. 14 Maintaining Decoupling Past Synchronization Points Lsyncs ensure Msyncs ensure ordering among the ordering within a processing units of an MSP processing unit P0 P1 P2 P3 …. …. …. …. …. …. St Vx St Vx St Vx St Vx St V …. …. …. …. …. Msync Msync Msync Msync Lsync V,S …. …. …. …. Ld S Ld Ld Ld Ld Want to protect against hazards, but not drain memory pipeline. Vector store addresses computed early (before data is available): – run past scalar Dcache to invalidate any matching entries – sent out to the Ecache – provides ordering with later scalar load or loads from other P chips Cray X1 Overview, Copyright 2002-3, Cray Inc. 15 Address Translation 63 62 6148 47 3231 16 15 0 VA: MBZ Virtual Page # Page Offset Memory region: Possible page boundaries: useg, kseg, kphys 64 KB to 4 GB 47 4645 36 35 0 PA: Node Offset Physical address space: Main memory, MMR, I/O Max 64 TB physical memory • High translation bandwidth: scalar + four vector translations per cycle per P • Source translation using 256 entry TLBs with support for multiple page sizes • Remote (hierarchical) translation: – allows each node to manage its own memory (eases memory mgmt.) – TLB only needs to hold translations for one node ⇒ scales Cray X1 Overview, Copyright 2002-3, Cray Inc. 16 Cache Coherence Global coherence, but only cache memory from local node (8-32 GB) – Supports SMP-style codes up to 4 MSP (4-way sharing) – References outside this domain converted to non-allocate – Scalable codes use explicit communication anyway – Keeps directory entry and protocol simple Explicit cache allocation control – Per instruction hint for vector references – Per page hint for scalar references – Use non-allocating refs for explicit communication or to avoid cache pollution Coherence directory stored on the M chips (rather than in DRAM) – Low latency and really high bandwidth to support vectors • Typical CC system: 1 directory update per proc per 100 or 200 ns •Cray X1: 3.2 dir updates per MSP per ns (factor of several hundred!) Cray X1 Overview, Copyright 2002-3, Cray Inc. 17 System Software Cray X1 Overview, Copyright 2002-3, Cray Inc. 18 Cray X1 I/O Subsystem SPC PCI-X PCI-X Channels BUSES Cards 200 MBytes/s FC-AL 2 x 1.2 GB/s 800 MB/s RS200 FC SB IOCA SB CB FC SB I Chip SB FC IOCA Gigabit Ethernet or FC CNS Hippi Network I/O Board RS200 SB X1 Node FC SB IOCA CB SB FC SB I Chip FC IP over Fiber CPES IOCA Gigabit Ethernet or FC CNS Hippi Network I/O Board Cray X1 Overview, Copyright 2002-3, Cray Inc. 19 Cray X1 UNICOS/mp UNIX kernel executes Single System Image on each Application node UNICOS/mp (somewhat like Chorus™ Global resource microkernel on UNICOS/mk) manager (Psched) schedules Provides: SSI, scaling & applications on resiliency 256 CPU Cray X1 Cray X1 nodes Commands launch System Service nodes provide applications like on file services, process manage- T3E (mpprun) ment and other basic UNIX functionality (like /mk servers). User commands execute on System Service nodes. Cray X1 Overview, Copyright 2002-3, Cray Inc. 20 Programming Models • Traditional shared memory applications – OpenMP, pthreads – 4 MSPs – Single node memory (8-32 GB) – Very high memory bandwidth – No data locality issues • Distributed memory applications – MPI, shmem(), UPC, Co-array Fortran – Rewards optimizations for cluster/NUMA micro-based machines • work and data decomposition • cache blocking – But less worry about communication/computation ratio, strides and bandwidth • multiple GB/s network bandwidth between nodes • scatter/gather and large-stride support Cray X1 Overview, Copyright 2002-3, Cray Inc. 21 Mechanical Design Cray X1 Overview, Copyright 2002-3, Cray Inc. 22 Packaging - node board lower side Field-Replaceable Memory Daughter Cards Spray Cooling Caps CPU MCM 8 chips Network PCB Interconnect 17” x 22’’ Air Cooling Manifold Cray X1 Overview, Copyright 2002-3, Cray Inc. 23 Cray X1 Node Module Cray X1 Overview, Copyright 2002-3, Cray Inc. 24 Node Module (Power Converter side) Cray X1 Overview, Copyright 2002-3, Cray Inc. 25 Cray X1 Chassis Cray X1 Overview, Copyright 2002-3, Cray Inc. 26 64 Processor Cray X1 System ~820 Gflops Cray X1 Overview, Copyright 2002-3, Cray Inc. 27 256 Processor Cray X1 System ~ 3.3 Tflops Cray X1 Overview, Copyright 2002-3, Cray Inc. 28 Cray X1 Summary • Extreme system capability – Tens of TFLOPS in a Single System Image (SSI) MPP – Efficient operation due to balanced system design – Focus on sustained performance on the most challenging problems Achieved via: – Very powerful single processors • High instruction-level parallelism • High bandwidth memory system – Best in the world scalability • Latency tolerance via decoupled vector processors • Very high-performance, tightly integrated network • Scalable address translation and communication protocols • Robust, highly scalable operating system (Unicos/mp) Cray X1 Overview, Copyright 2002-3, Cray Inc.
Recommended publications
  • New CSC Computing Resources
    New CSC computing resources Atte Sillanpää, Nino Runeberg CSC – IT Center for Science Ltd. Outline CSC at a glance New Kajaani Data Centre Finland’s new supercomputers – Sisu (Cray XC30) – Taito (HP cluster) CSC resources available for researchers CSC presentation 2 CSC’s Services Funet Services Computing Services Universities Application Services Polytechnics Ministries Data Services for Science and Culture Public sector Information Research centers Management Services Companies FUNET FUNET and Data services – Connections to all higher education institutions in Finland and for 37 state research institutes and other organizations – Network Services and Light paths – Network Security – Funet CERT – eduroam – wireless network roaming – Haka-identity Management – Campus Support – The NORDUnet network Data services – Digital Preservation and Data for Research Data for Research (TTA), National Digital Library (KDK) International collaboration via EU projects (EUDAT, APARSEN, ODE, SIM4RDM) – Database and information services Paituli: GIS service Nic.funet.fi – freely distributable files with FTP since 1990 CSC Stream Database administration services – Memory organizations (Finnish university and polytechnics libraries, Finnish National Audiovisual Archive, Finnish National Archives, Finnish National Gallery) 4 Current HPC System Environment Name Louhi Vuori Type Cray XT4/5 HP Cluster DOB 2007 2010 Nodes 1864 304 CPU Cores 10864 3648 Performance ~110 TFlop/s 34 TF Total memory ~11 TB 5 TB Interconnect Cray QDR IB SeaStar Fat tree 3D Torus CSC
    [Show full text]
  • Petaflops for the People
    PETAFLOPS SPOTLIGHT: NERSC housands of researchers have used facilities of the Advanced T Scientific Computing Research (ASCR) program and its EXTREME-WEATHER Department of Energy (DOE) computing predecessors over the past four decades. Their studies of hurricanes, earthquakes, NUMBER-CRUNCHING green-energy technologies and many other basic and applied Certain problems lend themselves to solution by science problems have, in turn, benefited millions of people. computers. Take hurricanes, for instance: They’re They owe it mainly to the capacity provided by the National too big, too dangerous and perhaps too expensive Energy Research Scientific Computing Center (NERSC), the Oak to understand fully without a supercomputer. Ridge Leadership Computing Facility (OLCF) and the Argonne Leadership Computing Facility (ALCF). Using decades of global climate data in a grid comprised of 25-kilometer squares, researchers in These ASCR installations have helped train the advanced Berkeley Lab’s Computational Research Division scientific workforce of the future. Postdoctoral scientists, captured the formation of hurricanes and typhoons graduate students and early-career researchers have worked and the extreme waves that they generate. Those there, learning to configure the world’s most sophisticated same models, when run at resolutions of about supercomputers for their own various and wide-ranging projects. 100 kilometers, missed the tropical cyclones and Cutting-edge supercomputing, once the purview of a small resulting waves, up to 30 meters high. group of experts, has trickled down to the benefit of thousands of investigators in the broader scientific community. Their findings, published inGeophysical Research Letters, demonstrated the importance of running Today, NERSC, at Lawrence Berkeley National Laboratory; climate models at higher resolution.
    [Show full text]
  • Technical Manual for a Coupled Sea-Ice/Ocean Circulation Model (Version 3)
    OCS Study MMS 2009-062 Technical Manual for a Coupled Sea-Ice/Ocean Circulation Model (Version 3) Katherine S. Hedström Arctic Region Supercomputing Center University of Alaska Fairbanks U.S. Department of the Interior Minerals Management Service Anchorage, Alaska Contract No. M07PC13368 OCS Study MMS 2009-062 Technical Manual for a Coupled Sea-Ice/Ocean Circulation Model (Version 3) Katherine S. Hedström Arctic Region Supercomputing Center University of Alaska Fairbanks Nov 2009 This study was funded by the Alaska Outer Continental Shelf Region of the Minerals Manage- ment Service, U.S. Department of the Interior, Anchorage, Alaska, through Contract M07PC13368 with Rutgers University, Institute of Marine and Coastal Sciences. The opinions, findings, conclusions, or recommendations expressed in this report or product are those of the authors and do not necessarily reflect the views of the U.S. Department of the Interior, nor does mention of trade names or commercial products constitute endorsement or recommenda- tion for use by the Federal Government. This document was prepared with LATEX xfig, and inkscape. Acknowledgments The ROMS model is descended from the SPEM and SCRUM models, but has been entirely rewritten by Sasha Shchepetkin, Hernan Arango and John Warner, with many, many other con- tributors. I am indebted to every one of them for their hard work. Bill Hibler first came up with the viscous-plastic rheology we are using. Paul Budgell has rewritten the dynamic sea-ice model, improving the solution procedure and making the water- stress term implicit in time, then changing it again to use the elastic-viscous-plastic rheology of Hunke and Dukowicz.
    [Show full text]
  • Through the Years… When Did It All Begin?
    & Through the years… When did it all begin? 1974? 1978? 1963? 2 CDC 6600 – 1974 NERSC started service with the first Supercomputer… ● A well-used system - Serial Number 1 ● On its last legs… ● Designed and built in Chippewa Falls ● Launch Date: 1963 ● Load / Store Architecture ● First RISC Computer! ● First CRT Monitor ● Freon Cooled ● State-of-the-Art Remote Access at NERSC ● Via 4 acoustic modems, manually answered capable of 10 characters /sec 3 50th Anniversary of the IBM / Cray Rivalry… Last week, CDC had a press conference during which they officially announced their 6600 system. I understand that in the laboratory developing this system there are only 32 people, “including the janitor”… Contrasting this modest effort with our vast development activities, I fail to understand why we have lost our industry leadership position by letting someone else offer the world’s most powerful computer… T.J. Watson, August 28, 1963 4 2/6/14 Cray Higher-Ed Roundtable, July 22, 2013 CDC 7600 – 1975 ● Delivered in September ● 36 Mflop Peak ● ~10 Mflop Sustained ● 10X sustained performance vs. the CDC 6600 ● Fast memory + slower core memory ● Freon cooled (again) Major Innovations § 65KW Memory § 36.4 MHz clock § Pipelined functional units 5 Cray-1 – 1978 NERSC transitions users ● Serial 6 to vector architectures ● An fairly easy transition for application writers ● LTSS was converted to run on the Cray-1 and became known as CTSS (Cray Time Sharing System) ● Freon Cooled (again) ● 2nd Cray 1 added in 1981 Major Innovations § Vector Processing § Dependency
    [Show full text]
  • The Gemini Network
    The Gemini Network Rev 1.1 Cray Inc. © 2010 Cray Inc. All Rights Reserved. Unpublished Proprietary Information. This unpublished work is protected by trade secret, copyright and other laws. Except as permitted by contract or express written permission of Cray Inc., no part of this work or its content may be used, reproduced or disclosed in any form. Technical Data acquired by or for the U.S. Government, if any, is provided with Limited Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7013, as applicable. Autotasking, Cray, Cray Channels, Cray Y-MP, UNICOS and UNICOS/mk are federally registered trademarks and Active Manager, CCI, CCMT, CF77, CF90, CFT, CFT2, CFT77, ConCurrent Maintenance Tools, COS, Cray Ada, Cray Animation Theater, Cray APP, Cray Apprentice2, Cray C90, Cray C90D, Cray C++ Compiling System, Cray CF90, Cray EL, Cray Fortran Compiler, Cray J90, Cray J90se, Cray J916, Cray J932, Cray MTA, Cray MTA-2, Cray MTX, Cray NQS, Cray Research, Cray SeaStar, Cray SeaStar2, Cray SeaStar2+, Cray SHMEM, Cray S-MP, Cray SSD-T90, Cray SuperCluster, Cray SV1, Cray SV1ex, Cray SX-5, Cray SX-6, Cray T90, Cray T916, Cray T932, Cray T3D, Cray T3D MC, Cray T3D MCA, Cray T3D SC, Cray T3E, Cray Threadstorm, Cray UNICOS, Cray X1, Cray X1E, Cray X2, Cray XD1, Cray X-MP, Cray XMS, Cray XMT, Cray XR1, Cray XT, Cray XT3, Cray XT4, Cray XT5, Cray XT5h, Cray Y-MP EL, Cray-1, Cray-2, Cray-3, CrayDoc, CrayLink, Cray-MP, CrayPacs, CrayPat, CrayPort, Cray/REELlibrarian, CraySoft, CrayTutor, CRInform, CRI/TurboKiva, CSIM, CVT, Delivering the power…, Dgauss, Docview, EMDS, GigaRing, HEXAR, HSX, IOS, ISP/Superlink, LibSci, MPP Apprentice, ND Series Network Disk Array, Network Queuing Environment, Network Queuing Tools, OLNET, RapidArray, RQS, SEGLDR, SMARTE, SSD, SUPERLINK, System Maintenance and Remote Testing Environment, Trusted UNICOS, TurboKiva, UNICOS MAX, UNICOS/lc, and UNICOS/mp are trademarks of Cray Inc.
    [Show full text]
  • Understanding the Cray X1 System
    NAS Technical Report NAS-04-014 December 2004 Understanding the Cray X1 System Samson Cheung, Ph.D. Halcyon Systems Inc. NASA Advanced Supercomputing Division NASA Ames Research Center [email protected] Abstract This paper helps the reader understand the characteristics of the Cray X1 vector supercomputer system, and provides hints and information to enable the reader to port codes to the system. It provides a comparison between the basic performance of the X1 platform and other platforms available at NASA Ames Research Center. A set of codes, solving the Laplacian equation with different parallel paradigms, is used to understand some features of the X1 compiler. An example code from the NAS Parallel Benchmarks is used to demonstrate performance optimization on the X1 platform.* Key Words: Cray X1, High Performance Computing, MPI, OpenMP Introduction The Cray vector supercomputer, which dominated in the ‘80s and early ‘90s, was replaced by clusters of shared-memory processors (SMPs) in the mid-‘90s. Universities across the U.S. use commodity components to build Beowulf systems to achieve high peak speed [1]. The supercomputer center at NASA Ames Research Center had over 2,000 CPUs of SGI Origin ccNUMA machines in 2003. This technology path provides excellent price/performance ratio; however, many application software programs (e.g., Earth Science, validated vectorized CFD codes) sustain only a small faction of the peak without a major rewrite of the code. The Japanese Earth Simulator demonstrated the low price/performance ratio by using vector processors connected to high-bandwidth memory and high-performance networking [2]. Several scientific applications sustain a large fraction of its 40 TFLOPS/sec peak performance.
    [Show full text]
  • TMA4280—Introduction to Supercomputing
    Supercomputing TMA4280—Introduction to Supercomputing NTNU, IMF January 12. 2018 1 Outline Context: Challenges in Computational Science and Engineering Examples: Simulation of turbulent flows and other applications Goal and means: parallel performance improvement Overview of supercomputing systems Conclusion 2 Computational Science and Engineering (CSE) What is the motivation for Supercomputing? Solve complex problems fast and accurately: — efforts in modelling and simulation push sciences and engineering applications forward, — computational requirements drive the development of new hardware and software. 3 Computational Science and Engineering (CSE) Development of computational methods for scientific research and innovation in engineering and technology. Covers the entire spectrum of natural sciences, mathematics, informatics: — Scientific model (Physics, Biology, Medical, . ) — Mathematical model — Numerical model — Implementation — Visualization, Post-processing — Validation ! Feedback: virtuous circle Allows for larger and more realistic problems to be simulated, new theories to be experimented numerically. 4 Outcome in Industrial Applications Figure: 2004: “The Falcon 7X becomes the first aircraft in industry history to be entirely developed in a virtual environment, from design to manufacturing to maintenance.” Dassault Systèmes 5 Evolution of computational power Figure: Moore’s Law: exponential increase of number of transistors per chip, 1-year rate (1965), 2-year rate (1975). WikiMedia, CC-BY-SA-3.0 6 Evolution of computational power
    [Show full text]
  • Performance Evaluation of the Cray X1 Distributed Shared Memory Architecture
    Performance Evaluation of the Cray X1 Distributed Shared Memory Architecture Tom Dunigan Jeffrey Vetter Pat Worley Oak Ridge National Laboratory Highlights º Motivation – Current application requirements exceed contemporary computing capabilities – Cray X1 offered a ‘new’ system balance º Cray X1 Architecture Overview – Nodes architecture – Distributed shared memory interconnect – Programmer’s view º Performance Evaluation – Microbenchmarks pinpoint differences across architectures – Several applications show marked improvement ORNL/JV 2 1 ORNL is Focused on Diverse, Grand Challenge Scientific Applications SciDAC Genomes Astrophysics to Life Nanophase Materials SciDAC Climate Application characteristics vary dramatically! SciDAC Fusion SciDAC Chemistry ORNL/JV 3 Climate Case Study: CCSM Simulation Resource Projections Science drivers: regional detail / comprehensive model Machine and Data Requirements 1000 750 340.1 250 154 100 113.3 70.3 51.5 Tflops 31.9 23.4 Tbytes 14.5 10 10.6 6.6 4.8 3 2.2 1 1 dyn veg interactivestrat chem biogeochem eddy resolvcloud resolv trop chemistry Years CCSM Coupled Model Resolution Configurations: 2002/2003 2008/2009 Atmosphere 230kmL26 30kmL96 • Blue line represents total national resource dedicated Land 50km 5km to CCSM simulations and expected future growth to Ocean 100kmL40 10kmL80 meet demands of increased model complexity Sea Ice 100km 10km • Red line show s data volume generated for each Model years/day 8 8 National Resource 3 750 century simulated (dedicated TF) Storage (TB/century) 1 250 At 2002-3 scientific complexity, a century simulation required 12.5 days. ORNL/JV 4 2 Engaged in Technical Assessment of Diverse Architectures for our Applications º Cray X1 Cray X1 º IBM SP3, p655, p690 º Intel Itanium, Xeon º SGI Altix º IBM POWER5 º FPGAs IBM Federation º Planned assessments – Cray X1e – Cray X2 – Cray Red Storm – IBM BlueGene/L – Optical processors – Processors-in-memory – Multithreading – Array processors, etc.
    [Show full text]
  • R00456--FM Getting up to Speed
    GETTING UP TO SPEED THE FUTURE OF SUPERCOMPUTING Susan L. Graham, Marc Snir, and Cynthia A. Patterson, Editors Committee on the Future of Supercomputing Computer Science and Telecommunications Board Division on Engineering and Physical Sciences THE NATIONAL ACADEMIES PRESS Washington, D.C. www.nap.edu THE NATIONAL ACADEMIES PRESS 500 Fifth Street, N.W. Washington, DC 20001 NOTICE: The project that is the subject of this report was approved by the Gov- erning Board of the National Research Council, whose members are drawn from the councils of the National Academy of Sciences, the National Academy of Engi- neering, and the Institute of Medicine. The members of the committee responsible for the report were chosen for their special competences and with regard for ap- propriate balance. Support for this project was provided by the Department of Energy under Spon- sor Award No. DE-AT01-03NA00106. Any opinions, findings, conclusions, or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the organizations that provided support for the project. International Standard Book Number 0-309-09502-6 (Book) International Standard Book Number 0-309-54679-6 (PDF) Library of Congress Catalog Card Number 2004118086 Cover designed by Jennifer Bishop. Cover images (clockwise from top right, front to back) 1. Exploding star. Scientific Discovery through Advanced Computing (SciDAC) Center for Supernova Research, U.S. Department of Energy, Office of Science. 2. Hurricane Frances, September 5, 2004, taken by GOES-12 satellite, 1 km visible imagery. U.S. National Oceanographic and Atmospheric Administration. 3. Large-eddy simulation of a Rayleigh-Taylor instability run on the Lawrence Livermore National Laboratory MCR Linux cluster in July 2003.
    [Show full text]
  • Titan Introduction and Timeline Bronson Messer
    Titan Introduction and Timeline Presented by: Bronson Messer Scientific Computing Group Oak Ridge Leadership Computing Facility Disclaimer: There is no contract with the vendor. This presentation is the OLCF’s current vision that is awaiting final approval. Subject to change based on contractual and budget constraints. We have increased system performance by 1,000 times since 2004 Hardware scaled from single-core Scaling applications and system software is the biggest through dual-core to quad-core and challenge dual-socket , 12-core SMP nodes • NNSA and DoD have funded much • DOE SciDAC and NSF PetaApps programs are funding of the basic system architecture scalable application work, advancing many apps research • DOE-SC and NSF have funded much of the library and • Cray XT based on Sandia Red Storm applied math as well as tools • IBM BG designed with Livermore • Computational Liaisons key to using deployed systems • Cray X1 designed in collaboration with DoD Cray XT5 Systems 12-core, dual-socket SMP Cray XT4 Cray XT4 Quad-Core 2335 TF and 1030 TF Cray XT3 Cray XT3 Dual-Core 263 TF Single-core Dual-Core 119 TF Cray X1 54 TF 3 TF 26 TF 2005 2006 2007 2008 2009 2 Disclaimer: No contract with vendor is in place Why has clock rate scaling ended? Power = Capacitance * Frequency * Voltage2 + Leakage • Traditionally, as Frequency increased, Voltage decreased, keeping the total power in a reasonable range • But we have run into a wall on voltage • As the voltage gets smaller, the difference between a “one” and “zero” gets smaller. Lower voltages mean more errors.
    [Show full text]
  • Taking the Lead in HPC
    Taking the Lead in HPC Cray X1 Cray XD1 Cray Update SuperComputing 2004 Safe Harbor Statement The statements set forth in this presentation include forward-looking statements that involve risk and uncertainties. The Company wished to caution that a number of factors could cause actual results to differ materially from those in the forward-looking statements. These and other factors which could cause actual results to differ materially from those in the forward-looking statements are discussed in the Company’s filings with the Securities and Exchange Commission. SuperComputing 2004 Copyright Cray Inc. 2 Agenda • Introduction & Overview – Jim Rottsolk • Sales Strategy – Peter Ungaro • Purpose Built Systems - Ly Pham • Future Directions – Burton Smith • Closing Comments – Jim Rottsolk A Long and Proud History NASDAQ: CRAY Headquarters: Seattle, WA Marketplace: High Performance Computing Presence: Systems in over 30 countries Employees: Over 800 BuildingBuilding Systems Systems Engineered Engineered for for Performance Performance Seymour Cray Founded Cray Research 1972 The Father of Supercomputing Cray T3E System (1996) Cray X1 Cray-1 System (1976) Cray-X-MP System (1982) Cray Y-MP System (1988) World’s Most Successful MPP 8 of top 10 Fastest First Supercomputer First Multiprocessor Supercomputer First GFLOP Computer First TFLOP Computer Computer in the World SuperComputing 2004 Copyright Cray Inc. 4 2002 to 2003 – Successful Introduction of X1 Market Share growth exceeded all expectations • 2001 Market:2001 $800M • 2002 Market: about $1B
    [Show full text]
  • Site Report.1
    ♦ SITE REPORT ♦ REINVENTING THE SUPERCOMPUTER CENTER AT NERSC Horst D. Simon he power of high-performance computers grows expo- 900, four J90SEs, and a C90—have a combined computational T nentially with each new generation, but making pro- capacity of 495 Gflops. We are currently developing a single ductive use of that power for scientific research also becomes user interface to access all of our Cray systems. Our objective increasingly difficult. It took the scientific community about for the end of the decade is to offer computational capacity in seven years to make the transition from single-CPU super- the teraflops range and storage capacity in the petabyte range. computers of the Cray-1 variety to vector parallel computers The T3E-900 is one of the largest Cray systems available such as the Cray Y-MP and C90, even though the shared- at an openly accessible facility. The system has 512 user-ac- memory architecture of those machines was relatively cessible processing elements and 32 system PEs, each with straightforward. Distributed-memory, massively parallel ma- 256 Mbytes of memory, as well as approximately 1.5 Tbytes chines such as the IBM SP, Intel Paragon, and Cray T3D and of disk storage. The J90 cluster has a total of 96 PEs distrib- T3E present burdensome data distribution problems for uted among the four machines, with 32 Gbytes of cluster users to overcome. The next generation of symmetric multi- memory and 752 Gbytes of disk storage. The C90 has 16 processors will add another level of memory hierarchy and PEs and 2 Gbytes of memory.
    [Show full text]