Overview of High-Performance Computing Simulation

Total Page:16

File Type:pdf, Size:1020Kb

Overview of High-Performance Computing Simulation Simulation: The Third Pillar CS 594 Spring 2003 of Science Traditional scientific and engineering paradigm: Lecture 1: 1) Do theory or paper design. Overview of High-Performance 2) Perform experiments or build system. Limitations: Computing ¾ Too difficult -- build large wind tunnels. ¾ Too expensive -- build a throw-away passenger jet. ¾ Too slow -- wait for climate or galactic evolution. Jack Dongarra ¾ Too dangerous -- weapons, drug design, climate Computer Science Department experimentation. Computational science paradigm: University of Tennessee 3) Use high performance computer systems to simulate the phenomenon » Base on known physical laws and efficient numerical methods. 1 2 Computational Science Some Particularly Challenging Definition Computations Computational science is a rapidly growing Science ¾ Global climate modeling multidisciplinary field that uses advanced computing ¾ Astrophysical modeling capabilities to understand and solve complex ¾ Biology: genomics; protein folding; drug design problems. Computational science fuses three distinct ¾ Computational Chemistry ¾ Computational Material Sciences and Nanosciences elements: Engineering ¾ numerical algorithms and modeling and simulation software ¾ Crash simulation developed to solve science (e.g., biological, physical, and ¾ Semiconductor design social), engineering, and humanities problems; ¾ Earthquake and structural modeling ¾ Computation fluid dynamics (airplane design) ¾ advanced system hardware, software, networking, and data ¾ Combustion (engine design) management components developed through computer and Business information science to solve computationally demanding ¾ Financial and economic modeling problems; ¾ Transaction processing, web services and search engines ¾ the computing infrastructure that supports both science and Defense engineering problem solving and developmental computer and ¾ Nuclear weapons -- test by simulations information science. 3 ¾ Cryptography 4 Complex Systems Engineering Why Turn to Simulation? R&D Team: Engineering Team: When the problem is Grand Challenge Driven Operations Driven Ames Research Center Johnson Space Center Analysis and Visualization Glenn Research Center Marshall Space Flight Center too . Langley Research Center Industry Partners ¾ Complex Grand Challenges ¾ Large / small ¾ Expensive Computation Management ¾ Dangerous Next Generation Codes -AeroDB & Algorithms -ILab OVERFLOW to do any other way. Honorable Mention, NASA Software of Year STS107 INS3D Supercomputers, NASA Software of Year Storage, & Networks Taurus_to_Taurus_60per_30deg.mpeg Turbopump Analysis CART3D NASA Software of Year Modeling Environment STS-107 (experts and tools) -Compilers - Scaling and Porting - Parallelization Tools 5 6 Source: Walt Brooks, NASA 1 Economic Impact of HPC Pretty Pictures Airlines: ¾ System-wide logistics optimization systems on parallel systems. ¾ Savings: approx. $100 million per airline per year. Automotive design: ¾ Major automotive companies use large systems (500+ CPUs) for: » CAD-CAM, crash testing, structural integrity and aerodynamics. » One company has 500+ CPU parallel system. ¾ Savings: approx. $1 billion per company per year. Semiconductor industry: ¾ Semiconductor firms use large systems (500+ CPUs) for » device electronics simulation and logic validation ¾ Savings: approx. $1 billion per company per year. Securities industry: ¾ Savings: approx. $15 billion per year for U.S. home mortgages. 7 8 Why Turn to Simulation? Titov’s Tsunami Simulation Climate / Weather Modeling Data intensive problems (data-mining, oil reservoir simulation) tsunami-nw10.mov Problems with large length Titov’s Tsunami Simulation and time scales (cosmology) Global model 9 10 Cost (Economic Loss) to Evacuate 1 Mile of Coastline: $1M 24 Hour Forecast at Fine Grid Spacing This problem demands a complete, STABLE We now over- environment (hardware and software) warn by a factor ¾ 100 TF to stay a factor of 3 of 10 ahead of the weather Average over- ¾ Streaming Observations warning is 200 ¾ Massive Storage and miles of coastline, Meta Data Query or $200M per ¾ Fast Networking ¾ Visualization event ¾ Data Mining for Feature Detection 11 12 2 Units of High High-Performance Computing Performance Computing Today 6 1 Mflop/s 1 Megaflop/s 10 Flop/sec In the past decade, the world has 1 Gflop/s 1 Gigaflop/s 109 Flop/sec experienced one of the most exciting periods in computer development. 12 1 Tflop/s 1 Teraflop/s 10 Flop/sec Microprocessors have become smaller, 1 Pflop/s 1 Petaflop/s 1015 Flop/sec denser, and more powerful. 6 The result is that microprocessor-based 1 MB 1 Megabyte 10 Bytes supercomputing is rapidly becoming the 1 GB 1 Gigabyte 109 Bytes technology of preference in attacking some of the most important problems of 1 TB 1 Terabyte 1012 Bytes science and engineering. 1 PB 1 Petabyte 1015 Bytes 13 14 Technology Trends: Microprocessor Capacity Eniac and My Laptop Eniac My Laptop Year 1945 2002 Moore’s Law Devices 18,000 6,000,000,000 Weight (kg) 27,200 0.9 Size (m3) 68 0.0028 2X transistors/Chip Every 1.5 years Power (watts) 20,000 60 Gordon Moore (co-founder of Called “Moore’s Law” Cost (1999 dollars) 4,630,000 1,000 Intel) predicted in 1965 that the transistor density of semiconductor Memory (bytes) ~200 1,073,741,824 Microprocessors have chips would double roughly every Performance (FP/sec) 800 5,000,000,000 become smaller, denser, 18 months. and more powerful. Not just processors, 15 16 bandwidth, storage, etc No Exponential is Forever, But perhaps we can Delay it Forever Today’s processors Year of Transistors Introduction Some equivalences for the 4004 1971 2,250 8008 1972 2,500 microprocessors of today 8080 1974 5,000 ¾ 8086 1978 29,000 Voltage level 286 1982 120,000 » A flashlight (~1 volt) Intel386™ processor 1985 275,000 Intel486™ processor 1989 1,180,000 ¾ Current level Intel® Pentium® 1993 3,100,000 processor » An oven (~250 amps) Intel® Pentium® II 1997 7,500,000 processor ¾ Power level Intel® Pentium® III 1999 24,000,000 processor Intel® Pentium® 4 » A light bulb (~100 watts) 2000 42,000,000 processor Intel® Itanium® ¾ Area 2002 220,000,000 processor Intel® Itanium® 2 2003 410,000,000 » A postage stamp (~1 square inch) processor 17 18 3 Moore’s “Law” Percentage of peak Something doubles every 18-24 months A rule of thumb that often applies ¾ Something was originally the number of A contemporary RISC processor, for a spectrum of applications, delivers (i.e., transistors sustains) 10% of peak performance Something is also considered There are exceptions to this rule, in performance both directions Moore’s Law is an exponential Why such low efficiency? ¾ Exponentials can not last forever There are two primary reasons behind » However Moore’s Law has held remarkably the disappointing percentage of peak true for ~30 years ¾ IPC (in)efficiency ¾ BTW: It is really an empiricism rather Memory (in)efficiency than a law (not a derogatory comment) 19 20 IPC Why Fast Machines Run Slow Today the theoretical IPC (instructions Latency per cycle) is 4 in most contemporary ¾ Waiting for access to memory or other parts of the system RISC processors (6 in Itanium) Overhead ¾ Extra work that has to be done to manage program Detailed analysis for a spectrum of concurrency and parallel resources the real work you want applications indicates that the average to perform IPC is 1.2–1.4 Starvation ¾ Not enough work to do due to insufficient parallelism or We are leaving ~75% of the possible poor load balancing among distributed resources performance on the table… Contention ¾ Delays due to fighting over what task gets to use a shared resource next. Network bandwidth is a major constraint. 21 22 Extra transistors Processor vs. memory speed With the increasing number of In 1986 transistors per chip from reduced design ¾ processor cycle time ~120 nanoseconds ¾ DRAM access time ~140 nanoseconds rules do we: » 1:1 ratio ¾ Add more functional units? In 1996 » Little gain owing to poor IPC for today’s ¾ processor cycle time ~4 nanoseconds codes, compilers and ISAs ¾ DRAM access time ~60 nanoseconds ¾ Add more cache? » 20:1 ratio » This generally helps but does not solve the In 2002 ¾ processor cycle time ~0.6 nanosecond problem ¾ DRAM access time ~50 nanoseconds ¾ Add more processors » 100::1 ratio » This helps somewhat 23 24 » This hurts somewhat 4 Latency in a Single System Memory hierarchy 500 1000 Ratio Memory Access Time 400 100 300 Typical latencies for today’s technology 200 Hierarchy Processor clocks 10 Time (ns) Time 100 Register 1 1 CPU Time CPU Ratio to Memory 0 L1 cache 2-3 0.1 L2 cache 6-12 1997 1999 2001 2003 2006 2009 L3 cache 14-40 X-Axis Near memory 100-300 CPU Clock Period (ns) Ratio Memory System Access Time Far memory 300-900 Remote memory O(103) THE WALL Message-passing O(103)-O(104) 25 26 y Hierarchy Most programs have a high degree of locality in their accesses Memory bandwidth ¾ spatial locality: accessing things nearby previous accesses ¾ temporal locality: reusing an item that was previously accessed Memory hierarchy tries to exploit locality To provide bandwidth to the processor the bus either needs to be faster or processor wider control Busses are limited to perhaps 400-800 Second Main Secondary Tertiary level memory storage storage MHz cache (Disk) datapath (SRAM) (DRAM) (Disk/Tape) Links are faster on-chip registers cache ¾ Single-ended 0.5–1 GT/s ¾ Differential: 2.5–5.0 (future) GT/s ¾ Increased link frequencies increase error Speed 1ns 10ns 100ns 10ms 10sec rates
Recommended publications
  • UNICOS® Installation Guide for CRAY J90lm Series SG-5271 9.0.2
    UNICOS® Installation Guide for CRAY J90lM Series SG-5271 9.0.2 / ' Cray Research, Inc. Copyright © 1996 Cray Research, Inc. All Rights Reserved. This manual or parts thereof may not be reproduced in any form unless permitted by contract or by written permission of Cray Research, Inc. Portions of this product may still be in development. The existence of those portions still in development is not a commitment of actual release or support by Cray Research, Inc. Cray Research, Inc. assumes no liability for any damages resulting from attempts to use any functionality or documentation not officially released and supported. If it is released, the final form and the time of official release and start of support is at the discretion of Cray Research, Inc. Autotasking, CF77, CRAY, Cray Ada, CRAYY-MP, CRAY-1, HSX, SSD, UniChem, UNICOS, and X-MP EA are federally registered trademarks and CCI, CF90, CFr, CFr2, CFT77, COS, Cray Animation Theater, CRAY C90, CRAY C90D, Cray C++ Compiling System, CrayDoc, CRAY EL, CRAY J90, Cray NQS, CraylREELlibrarian, CraySoft, CRAY T90, CRAY T3D, CrayTutor, CRAY X-MP, CRAY XMS, CRAY-2, CRInform, CRIlThrboKiva, CSIM, CVT, Delivering the power ..., DGauss, Docview, EMDS, HEXAR, lOS, LibSci, MPP Apprentice, ND Series Network Disk Array, Network Queuing Environment, Network Queuing '!boIs, OLNET, RQS, SEGLDR, SMARTE, SUPERCLUSTER, SUPERLINK, Trusted UNICOS, and UNICOS MAX are trademarks of Cray Research, Inc. Anaconda is a trademark of Archive Technology, Inc. EMASS and ER90 are trademarks of EMASS, Inc. EXABYTE is a trademark of EXABYTE Corporation. GL and OpenGL are trademarks of Silicon Graphics, Inc.
    [Show full text]
  • Cray Research Software Report
    Cray Research Software Report Irene M. Qualters, Cray Research, Inc., 655F Lone Oak Drive, Eagan, Minnesota 55121 ABSTRACT: This paper describes the Cray Research Software Division status as of Spring 1995 and gives directions for future hardware and software architectures. 1 Introduction single CPU speedups that can be anticipated based on code per- formance on CRAY C90 systems: This report covers recent Supercomputer experiences with Cray Research products and lays out architectural directions for CRAY C90 Speed Speedup on CRAY T90s the future. It summarizes early customer experience with the latest CRAY T90 and CRAY J90 systems, outlines directions Under 100 MFLOPS 1.4x over the next five years, and gives specific plans for ‘95 deliv- 200 to 400 MFLOPS 1.6x eries. Over 600 MFLOPS 1.75x The price/performance of CRAY T90 systems shows sub- 2 Customer Status stantial improvements. For example, LINPACK CRAY T90 price/performance is 3.7 times better than on CRAY C90 sys- Cray Research enjoyed record volumes in 1994, expanding tems. its installed base by 20% to more than 600 systems. To accom- plish this, we shipped 40% more systems than in 1993 (our pre- CRAY T94 single CPU ratios to vious record year). This trend will continue, with similar CRAY C90 speeds: percentage increases in 1995, as we expand into new applica- tion areas, including finance, multimedia, and “real time.” • LINPACK 1000 x 1000 -> 1.75x In the face of this volume, software reliability metrics show • NAS Parallel Benchmarks (Class A) -> 1.48 to 1.67x consistent improvements. While total incoming problem re- • Perfect Benchmarks -> 1.3 to 1.7x ports showed a modest decrease in 1994, our focus on MTTI (mean time to interrupt) for our largest systems yielded a dou- bling in reliability by year end.
    [Show full text]
  • The Gemini Network
    The Gemini Network Rev 1.1 Cray Inc. © 2010 Cray Inc. All Rights Reserved. Unpublished Proprietary Information. This unpublished work is protected by trade secret, copyright and other laws. Except as permitted by contract or express written permission of Cray Inc., no part of this work or its content may be used, reproduced or disclosed in any form. Technical Data acquired by or for the U.S. Government, if any, is provided with Limited Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7013, as applicable. Autotasking, Cray, Cray Channels, Cray Y-MP, UNICOS and UNICOS/mk are federally registered trademarks and Active Manager, CCI, CCMT, CF77, CF90, CFT, CFT2, CFT77, ConCurrent Maintenance Tools, COS, Cray Ada, Cray Animation Theater, Cray APP, Cray Apprentice2, Cray C90, Cray C90D, Cray C++ Compiling System, Cray CF90, Cray EL, Cray Fortran Compiler, Cray J90, Cray J90se, Cray J916, Cray J932, Cray MTA, Cray MTA-2, Cray MTX, Cray NQS, Cray Research, Cray SeaStar, Cray SeaStar2, Cray SeaStar2+, Cray SHMEM, Cray S-MP, Cray SSD-T90, Cray SuperCluster, Cray SV1, Cray SV1ex, Cray SX-5, Cray SX-6, Cray T90, Cray T916, Cray T932, Cray T3D, Cray T3D MC, Cray T3D MCA, Cray T3D SC, Cray T3E, Cray Threadstorm, Cray UNICOS, Cray X1, Cray X1E, Cray X2, Cray XD1, Cray X-MP, Cray XMS, Cray XMT, Cray XR1, Cray XT, Cray XT3, Cray XT4, Cray XT5, Cray XT5h, Cray Y-MP EL, Cray-1, Cray-2, Cray-3, CrayDoc, CrayLink, Cray-MP, CrayPacs, CrayPat, CrayPort, Cray/REELlibrarian, CraySoft, CrayTutor, CRInform, CRI/TurboKiva, CSIM, CVT, Delivering the power…, Dgauss, Docview, EMDS, GigaRing, HEXAR, HSX, IOS, ISP/Superlink, LibSci, MPP Apprentice, ND Series Network Disk Array, Network Queuing Environment, Network Queuing Tools, OLNET, RapidArray, RQS, SEGLDR, SMARTE, SSD, SUPERLINK, System Maintenance and Remote Testing Environment, Trusted UNICOS, TurboKiva, UNICOS MAX, UNICOS/lc, and UNICOS/mp are trademarks of Cray Inc.
    [Show full text]
  • Performance Evaluation of the Cray X1 Distributed Shared Memory Architecture
    Performance Evaluation of the Cray X1 Distributed Shared Memory Architecture Tom Dunigan Jeffrey Vetter Pat Worley Oak Ridge National Laboratory Highlights º Motivation – Current application requirements exceed contemporary computing capabilities – Cray X1 offered a ‘new’ system balance º Cray X1 Architecture Overview – Nodes architecture – Distributed shared memory interconnect – Programmer’s view º Performance Evaluation – Microbenchmarks pinpoint differences across architectures – Several applications show marked improvement ORNL/JV 2 1 ORNL is Focused on Diverse, Grand Challenge Scientific Applications SciDAC Genomes Astrophysics to Life Nanophase Materials SciDAC Climate Application characteristics vary dramatically! SciDAC Fusion SciDAC Chemistry ORNL/JV 3 Climate Case Study: CCSM Simulation Resource Projections Science drivers: regional detail / comprehensive model Machine and Data Requirements 1000 750 340.1 250 154 100 113.3 70.3 51.5 Tflops 31.9 23.4 Tbytes 14.5 10 10.6 6.6 4.8 3 2.2 1 1 dyn veg interactivestrat chem biogeochem eddy resolvcloud resolv trop chemistry Years CCSM Coupled Model Resolution Configurations: 2002/2003 2008/2009 Atmosphere 230kmL26 30kmL96 • Blue line represents total national resource dedicated Land 50km 5km to CCSM simulations and expected future growth to Ocean 100kmL40 10kmL80 meet demands of increased model complexity Sea Ice 100km 10km • Red line show s data volume generated for each Model years/day 8 8 National Resource 3 750 century simulated (dedicated TF) Storage (TB/century) 1 250 At 2002-3 scientific complexity, a century simulation required 12.5 days. ORNL/JV 4 2 Engaged in Technical Assessment of Diverse Architectures for our Applications º Cray X1 Cray X1 º IBM SP3, p655, p690 º Intel Itanium, Xeon º SGI Altix º IBM POWER5 º FPGAs IBM Federation º Planned assessments – Cray X1e – Cray X2 – Cray Red Storm – IBM BlueGene/L – Optical processors – Processors-in-memory – Multithreading – Array processors, etc.
    [Show full text]
  • CRAY T90 Series IEEE Floating Point Migration Issues and Solutions
    CRAY T90 Series IEEE Floating Point Migration Issues and Solutions Philip G. Garnatz, Cray Research, Inc., Eagan, Minnesota, U.S.A. ABSTRACT: Migration to the new CRAY T90 series with IEEE floating-point arith- metic presents a new challenge to applications programmers. More precision in the mantissa and less range in the exponent will likely raise some numerical differences issues. A step-by-step process will be presented to show how to isolate these numerical differences. Data files will need to be transferred and converted from a Cray format PVP system to and from a CRAY T90 series system with IEEE. New options to the assign command allow for transparent reading and writing of files from the other types of system. Introduction This paper is intended for the user services personnel and 114814 help desk staff who will be asked questions by programmers and exp mantissa users who will be moving code to the CRAY T90 series IEEE from another Cray PVP architecture. The issues and solutions Exponent sign presented here are intended to be a guide to the most frequent problems that programmers may encounter. Mantissa sign What is the numerical model? Figure2: Cray format floating-point frames and other computational entities such as workstations and New CRAY T90 series IEEE programmers might notice graphic displays. Other benefits are as follows: different answers when running on the IEEE machine than on a system with traditional Cray format floating-point arithmetic. • Greater precision. An IEEE floating-point number provides The IEEE format has the following characteristics: approximately 16 decimal digits of precision; this is about one and a half digits more precise than Cray format numbers.
    [Show full text]
  • Recent Supercomputing Development in Japan
    Supercomputing in Japan Yoshio Oyanagi Dean, Faculty of Information Science Kogakuin University 2006/4/24 1 Generations • Primordial Ages (1970’s) – Cray-1, 75APU, IAP • 1st Generation (1H of 1980’s) – Cyber205, XMP, S810, VP200, SX-2 • 2nd Generation (2H of 1980’s) – YMP, ETA-10, S820, VP2600, SX-3, nCUBE, CM-1 • 3rd Generation (1H of 1990’s) – C90, T3D, Cray-3, S3800, VPP500, SX-4, SP-1/2, CM-5, KSR2 (HPC ventures went out) • 4th Generation (2H of 1990’s) – T90, T3E, SV1, SP-3, Starfire, VPP300/700/5000, SX-5, SR2201/8000, ASCI(Red, Blue) • 5th Generation (1H of 2000’s) – ASCI,TeraGrid,BlueGene/L,X1, Origin,Power4/5, ES, SX- 6/7/8, PP HPC2500, SR11000, …. 2006/4/24 2 Primordial Ages (1970’s) 1974 DAP, BSP and HEP started 1975 ILLIAC IV becomes operational 1976 Cray-1 delivered to LANL 80MHz, 160MF 1976 FPS AP-120B delivered 1977 FACOM230-75 APU 22MF 1978 HITAC M-180 IAP 1978 PAX project started (Hoshino and Kawai) 1979 HEP operational as a single processor 1979 HITAC M-200H IAP 48MF 1982 NEC ACOS-1000 IAP 28MF 1982 HITAC M280H IAP 67MF 2006/4/24 3 Characteristics of Japanese SC’s 1. Manufactured by main-frame vendors with semiconductor facilities (not ventures) 2. Vector processors are attached to mainframes 3. HITAC IAP a) memory-to-memory b) summation, inner product and 1st order recurrence can be vectorized c) vectorization of loops with IF’s (M280) 4. No high performance parallel machines 2006/4/24 4 1st Generation (1H of 1980’s) 1981 FPS-164 (64 bits) 1981 CDC Cyber 205 400MF 1982 Cray XMP-2 Steve Chen 630MF 1982 Cosmic Cube in Caltech, Alliant FX/8 delivered, HEP installed 1983 HITAC S-810/20 630MF 1983 FACOM VP-200 570MF 1983 Encore, Sequent and TMC founded, ETA span off from CDC 2006/4/24 5 1st Generation (1H of 1980’s) (continued) 1984 Multiflow founded 1984 Cray XMP-4 1260MF 1984 PAX-64J completed (Tsukuba) 1985 NEC SX-2 1300MF 1985 FPS-264 1985 Convex C1 1985 Cray-2 1952MF 1985 Intel iPSC/1, T414, NCUBE/1, Stellar, Ardent… 1985 FACOM VP-400 1140MF 1986 CM-1 shipped, FPS T-series (max 1TF!!) 2006/4/24 6 Characteristics of Japanese SC in the 1st G.
    [Show full text]
  • System Programmer Reference (Cray SV1™ Series)
    ® System Programmer Reference (Cray SV1™ Series) 108-0245-003 Cray Proprietary (c) Cray Inc. All Rights Reserved. Unpublished Proprietary Information. This unpublished work is protected by trade secret, copyright, and other laws. Except as permitted by contract or express written permission of Cray Inc., no part of this work or its content may be used, reproduced, or disclosed in any form. U.S. GOVERNMENT RESTRICTED RIGHTS NOTICE: The Computer Software is delivered as "Commercial Computer Software" as defined in DFARS 48 CFR 252.227-7014. All Computer Software and Computer Software Documentation acquired by or for the U.S. Government is provided with Restricted Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7014, as applicable. Technical Data acquired by or for the U.S. Government, if any, is provided with Limited Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7013, as applicable. Autotasking, CF77, Cray, Cray Ada, Cray Channels, Cray Chips, CraySoft, Cray Y-MP, Cray-1, CRInform, CRI/TurboKiva, HSX, LibSci, MPP Apprentice, SSD, SuperCluster, UNICOS, UNICOS/mk, and X-MP EA are federally registered trademarks and Because no workstation is an island, CCI, CCMT, CF90, CFT, CFT2, CFT77, ConCurrent Maintenance Tools, COS, Cray Animation Theater, Cray APP, Cray C90, Cray C90D, Cray CF90, Cray C++ Compiling System, CrayDoc, Cray EL, CrayLink,
    [Show full text]
  • Appendix G Vector Processors
    G.1 Why Vector Processors? G-2 G.2 Basic Vector Architecture G-4 G.3 Two Real-World Issues: Vector Length and Stride G-16 G.4 Enhancing Vector Performance G-23 G.5 Effectiveness of Compiler Vectorization G-32 G.6 Putting It All Together: Performance of Vector Processors G-34 G.7 Fallacies and Pitfalls G-40 G.8 Concluding Remarks G-42 G.9 Historical Perspective and References G-43 Exercises G-49 G Vector Processors Revised by Krste Asanovic Department of Electrical Engineering and Computer Science, MIT I’m certainly not inventing vector processors. There are three kinds that I know of existing today. They are represented by the Illiac-IV, the (CDC) Star processor, and the TI (ASC) processor. Those three were all pioneering processors. One of the problems of being a pioneer is you always make mistakes and I never, never want to be a pioneer. It’s always best to come second when you can look at the mistakes the pioneers made. Seymour Cray Public lecture at Lawrence Livermore Laboratories on the introduction of the Cray-1 (1976) © 2003 Elsevier Science (USA). All rights reserved. G-2 I Appendix G Vector Processors G.1 Why Vector Processors? In Chapters 3 and 4 we saw how we could significantly increase the performance of a processor by issuing multiple instructions per clock cycle and by more deeply pipelining the execution units to allow greater exploitation of instruction- level parallelism. (This appendix assumes that you have read Chapters 3 and 4 completely; in addition, the discussion on vector memory systems assumes that you have read Chapter 5.) Unfortunately, we also saw that there are serious diffi- culties in exploiting ever larger degrees of ILP.
    [Show full text]
  • Cray Supercomputers Past, Present, and Future
    Cray Supercomputers Past, Present, and Future Hewdy Pena Mercedes, Ryan Toukatly Advanced Comp. Arch. 0306-722 November 2011 Cray Companies z Cray Research, Inc. (CRI) 1972. Seymour Cray. z Cray Computer Corporation (CCC) 1989. Spin-off. Bankrupt in 1995. z Cray Research, Inc. bought by Silicon Graphics, Inc (SGI) in 1996. z Cray Inc. Formed when Tera Computer Company (pioneer in multi-threading technology) bought Cray Research, Inc. in 2000 from SGI. Seymour Cray z Joined Engineering Research Associates (ERA) in 1950 and helped create the ERA 1103 (1953), also known as UNIVAC 1103. z Joined the Control Data Corporation (CDC) in 1960 and collaborated in the design of the CDC 6600 and 7600. z Formed Cray Research Inc. in 1972 when CDC ran into financial difficulties. z First product was the Cray-1 supercomputer z Faster than all other computers at the time. z The first system was sold within a month for US$8.8 million. z Not the first system to use a vector processor but was the first to operate on data on a register instead of memory Vector Processor z CPU that implements an instruction set that operates on one- dimensional arrays of data called vectors. z Appeared in the 1970s, formed the basis of most supercomputers through the 80s and 90s. z In the 60s the Solomon project of Westinghouse wanted to increase math performance by using a large number of simple math co- processors under the control of a single master CPU. z The University of Illinois used the principle on the ILLIAC IV.
    [Show full text]
  • NAS Parallel Benchmarks Results 3-95
    NAS Parallel Benchmarks Results 3-95 Report NAS-95-011, April 1995 Subhash Saini I and David H. Bailey 2 Numerical Aerodynamic Simulation Facility NASA Ames Research Center Mail Stop T 27A-1 Moffett Field, CA 94035-1000, USA E-m_l:saini@nas. nasa. gov Abstract The NAS Parallel Benchmarks (NPB) were developed in 1991 at NASA Ames Research Center to study the performance of parallel supercomputers. The eight benchmark problems are specified in a "pencil and paper" fashion, i.e., the complete details of the problem are given in a NAS technical document. Except for a few restrictions, benchmark implementors are free to select the language constructs and implementation techniques best suited for a particular system. In this paper, we present new NPB performance results for the following systems: (a) Parallel-Vector Processors: CRAY C90, CRAY T90, and Fujitsu VPP500; (b) Highly Parallel Processors: CRAY T3D, IBM SP2-WN (Wide Nodes), and IBM SP2-TN2 (Thin Nodes 2); (c) Symmetric Multiprocessors: Convex Exemplar SPP1000, CRAY J90, DEC Alpha Server 8400 5/300, and SGI Power Challenge XL (75 MHz). We also present sustained performance per dollar for Class B LU, SP and BT benchmarks. We also mention future NAS plans for the NPB. 1. Subhash Saini is an employeeof Computer Sciences Corporation. This work was funded through NASA contract NAS 2-12961. 2. David H. Bailey is an employee of NASA Ames Research Center. 1 - 16 1:Introduetion TheNumericalAerodynamicSimulation(NAS)Program,locatedat NASAAmesResearch Center,is apathfinderin high-performancecomputingforNASAandis
    [Show full text]
  • Cray System Software Features for Cray X1 System ABSTRACT
    Cray System Software Features for Cray X1 System Don Mason Cray, Inc. 1340 Mendota Heights Road Mendota Heights, MN 55120 [email protected] ABSTRACT This paper presents an overview of the basic functionalities the Cray X1 system software. This includes operating system features, programming development tools, and support for programming model Introduction This paper is in four sections: the first section outlines the Cray software roadmap for the Cray X1 system and its follow-on products; the second section presents a building block diagram of the Cray X1 system software components, organized by programming models, programming environments and operating systems, then describes and details these categories; the third section lists functionality planned for upcoming system software releases; the final section lists currently available documentation, highlighting manuals of interest to current Cray T90, Cray SV1, or Cray T3E customers. Cray Software Roadmap Cray’s roadmap for platforms and operating systems is shown here: Figure 1: Cray Software Roadmap The Cray X1 system is the first scalable vector processor system that combines the characteristics of a high bandwidth vector machine like the Cray SV1 system with the scalability of a true MPP system like the Cray T3E system. The Cray X1e system follows the Cray X1 system, and the Cray X1e system is followed by the code-named Black Widow series systems. This new family of systems share a common instruction set architecture. Paralleling the evolution of the Cray hardware, the Cray Operating System software for the Cray X1 system builds upon technology developed in UNICOS and UNICOS/mk. The Cray X1 system operating system, UNICOS/mp, draws in particular on the architecture of UNICOS/mk by distributing functionality between nodes of the Cray X1 system in a manner analogous to the distribution across processing elements on a Cray T3E system.
    [Show full text]
  • A Comparison of Application Performance Across Cray Product Lines CUG San Jose, May 1997
    A Comparison of Application Performance Across Cray Product Lines CUG San Jose, May 1997 R. Kent Koeninger Software Division of Cray Research, a Silicon Graphics Company 655 F Lone Oak Drive, Eagan MN 55121 [email protected] www.cray.com ABSTRACT: This paper will compare standard benchmark and specific application performance across the CRAY T90, CRAY J90, CRAY T3E, and Origin 2000 product lines. KEYWORDS: Application performance, benchmarks, CRAY T90, CRAY J90, CRAY T3E, Origin 2000, LINPACK, NAS Parallel Benchmarks, Streams, STAR-CD, Gaussian94, PAM-CRASH. Introduction The current product offerings from Cray Research are the CRAY T90, CRAY J90se, CRAY T3E 900, and Cray Origin 2000 systems. Each has advantages that are application dependent. This paper will give comparisons that help classify which applications are best suited for which products. Most performance comparisons in this paper are measured by Origin 2000 (O2K) processor equivalents. The run-time on the platform in question is divided by the run time on a single processor of an O2K processor. With this technique, larger numbers indicate better performance. The examples will show that there is no one product best suited for all applications. Some run better with the extremely high memory bandwidth of the CRAY T90 systems, some run better with the high scalability and excellent interprocessor latency of the CRAY T3E systems, some run better with the large caches and SMP scalability of the Origin systems, and some run better with the good vector price-performance throughput of CRAY J90 systems. By looking at which applications run best on which platforms, one can get a feeling for which platform might best suit other applications.
    [Show full text]