Intel Paragon XP/S Overview Distributed-Memory MIMD

Total Page:16

File Type:pdf, Size:1020Kb

Intel Paragon XP/S Overview Distributed-Memory MIMD Intel MP Paragon XP/S 150 @ Intel Paragon XP/S Overview Oak Ridge National Labs I Distributed-memory MIMD multicomputer I 2D array of nodes, performing both OS functionality as well as user computation ● Main memory physically distributed among nodes (16-64 MB / node) ● Each node contains two Intel i860 XP processors: application processor for user’s program, message processor for inter-node communication I Balanced design: speed and memory capacity matched to interconnection network, storage facilities, etc. ● Interconnect bandwidth scales with number of nodes ● Efficient even with thousands of processors 1 Fall 2003, MIMD 2 Fall 2003, MIMD Paragon XP/S Nodes Paragon XP/S Node Interconnection I Network Interface Controller (NIC) I 2D mesh chosen after extensive analytical studies and simulation ● Connects node to its PMRC ● Parity-checked, full-duplexed router with I Paragon Mesh Routing Chip (PMRC) / error checking iMRC routes traffic in the mesh I Message processor ● 0.75 µm, triple-metal CMOS ● ● Intel i860 XP processor Routes traffic in four directions and to and from attached node at > 200 MB/s ● Handles all details of sending / receiving I 40 ns to make routing decisions and close a message between nodes, including appropriate switches protocols, packetization, etc. I Transfers are parity checked, router is ● Supports global operations including pipelined, routing is deadlock-free broadcast, synchronization, sum, min, ● Backplane is active backplane of router and, or, etc. chips rather than mass of cables I Application processor ● Intel i860 XP processor (42 MIPS, 50 MHz clock) to execute user programs I 16–64 MB of memory 3 Fall 2003, MIMD 4 Fall 2003, MIMD Paragon XP/S Usage Paragon XP/S Programming I OS is based on UNIX, provides I MIMD architecture, but supports various distributed system services and full UNIX programming models: SPMD, SIMD, to every node MIMD, shared memory, vector shared memory ● System is divided into partitions, some for I/O, some for system services, rest for user applications I CASE tools including: ● Optimizing compilers for FORTRAN, C, I Applications can run on arbitrary number C++, Ada, and Data Parallel FORTRAN of nodes without change ● Interactive Debugger ● Run on larger number of nodes to ● Parallelization tools: FORGE, CAST process larger data sets or to achieve required performance ● Intel’s ProSolver library of equation solvers I Users have client/server access, can ● Intel’s Performance Visualization System submit jobs over a network, or login (PVS) directly to any node ● Performance Analysis Tools (PAT) ● Comprehensive resource control utilities for allocating, tracking, and controlling system resources 5 Fall 2003, MIMD 6 Fall 2003, MIMD Thinking Machines CM-5 Overview Thinking Machines CM-5 Overview (cont.) I Distributed-memory MIMD multicomputer I Other control processors, called I/O Control Processors, manage the ● SIMD or MIMD operation system’s I/O devices I Processing nodes are supervised by a ● Scale to achieve necessary I/O capacity control processor, which runs UNIX ● DataVaults to provide storage ● Control processor broadcasts blocks of instructions to the processing nodes, and I Control processors in general initiates execution ● Scheduling user tasks, allocating ● SIMD operation: nodes are closely resources, servicing I/O requests, synchronized, blocks broadcast as needed accounting, security, etc. ● MIMD operation: nodes fetch instructions ● May execute some code independently and synchronize only as ● required by the algorithm No arithmetic accelerators, but additional I/O connections I Nodes may be divided into partitions ● In small system, one control processor may play a number of roles ● One control processor, called the partition manager, per partition ● In large system, control processors are often dedicated to particular tasks ● Partitions may exchange data (partition manager, I/O cont. proc., etc.) 7 Fall 2003, MIMD 8 Fall 2003, MIMD Thinking Machines CM-5@ CM-5 Nodes NCSA I Processing nodes ● SPARC CPU I 22 MIPS ● 8-32 MB of memory ● Network interface ● (Optional) 4 pipelined vector processing units I Each can perform up to 32 million double- precision floating-point operations per second I Including divide and square root I Fully configured CM-5 would have ● 16,384 processing nodes ● 512 GB of memory ● Theoretical peak performance of 2 teraflops 9 Fall 2003, MIMD 10 Fall 2003, MIMD CM-5 Networks Tree Networks (Reference Material) I Control Network I Binary Tree ● Tightly coupled communication services ● 2k–1 nodes arranged into complete binary tree of depth k–1 ● Optimized for fast response, low latency ● Diameter is 2(k–1) ● Functions: synchronizing processing nodes, broadcasts, reductions, parallel ● Bisection width is 1 prefix operations I Hypertree I Data Network ● Low diameter of a binary tree plus ● 4-ary hypertree, optimized for high improved bisection width bandwidth ● Hypertree of degree k and depth d ● Functions: point-to-point commn. for tens I From “front”, looks like k-ary tree of height d of thousands of items simultaneously I From “side”, looks like upside-down binary ● Responsible for eventual delivery of tree of height d messages accepted I Join both views to get complete network ● ● Network Interface connects nodes or 4-ary hypertree of depth d d d d+1 control processors to the Control or Data I 4 leaves and 2 (2 –1) nodes Network (memory-mapped control unit) I Diameter is 2d d+1 I Bisection width is 2 11 Fall 2003, MIMD 12 Fall 2003, MIMD CM-5 Usage IBM SP2 Overview I Runs Cmost, enhanced vers. of SunOS I Distributed-memory MIMD multicomputer I User task sees a control processor acting I Scalable POWERparallel 1 (SP1) as a Partition Manager (PM), a set of ● Development started February 1992, processing nodes, and inter-processor delivered to users in April 1993 communication facilities ● User task is a standard UNIX process I Scalable POWERparallel 2 (SP2) running on the PM, and one on each of ● 120-node systems delivered 1994 the processing nodes I 4–128 nodes: RS/6000 workstation with ● The CPU scheduler schedules the user POWER2 processor, 66.7 MHz, 267 task on all processors simultaneously MFLOPS I POWER2 used in RS 6000 workstations, I User tasks can read and write directly to gives compatibility with existing software the Control Network and Data Network ● 1997 version (NUMA): ● Control Network has hardware for I High Node (SMP node, 16 nodes max): broadcast, reduction, parallel prefix 2–8 PowerPC 604, 112 MHz, 224 MFLOPS, 64MB–2GB memory operations, barrier synchronization I Wide Node (128 nodes max): ● Data Network provides reliable, deadlock- 1 P2SC (POWER2 Super Chip, 8 chips on free point-to-point communication one chip), 135 MHz, 640 MFLOPS, 64MB–2GB memory 13 Fall 2003, MIMD 14 Fall 2003, MIMD IBM SP2 @ IBM SP2 Overview Oak Ridge National Labs (cont.) I RS/6000 as system console I SP2 runs various combinations of serial, parallel, interactive, and batch jobs ● Partition between types can be changed ● High nodes — interactive nodes for code development and job submission ● Thin nodes — compute nodes ● Wide nodes — configured as servers, with extra memory, storage devices, etc. I A system “frame” contains 16 thin processor or 8 wide processor nodes ● Includes redundant power supplies, nodes are hot swappable within frame ● Includes a high-performance switch for low-latency, high-bandwidth communication 15 Fall 2003, MIMD 16 Fall 2003, MIMD IBM SP2 Processors IBM SP2 Interconnection Network I POWER2 processor I General ● Various versions from 20 to 62.5 MHz ● Multistage High Performance Switch (HPS) network, with extra stages added ● RISC processor, load-store architecture to keep bw to each processor constant I Floating point multiple & add instruction with latency of 2 cycles, pipelined for ● Message delivery initiation of new one each cycle I PIO for short messages with low latency I Conditional branch to decrement and test and minimal message overhead a “count register” (without fixed-point unit I DMA for long messages involvement), good for loop closings ● Multi-user support — hardware protection I POWER 2 processor chip set between partitions and users, guaranteed fairness of message delivery ● 8 semi-custom chips: Instruction Cache Unit, four Data Cache Units, Fixed-Point I Routing Unit (FXU), Floating-Point Unit (FPU), and Storage Control Unit ● Packet switched = each packet may take I 2 execution units per FXU and FPU a different route I Can execute 6 instructions per cycle: 2 ● Cut-through = if output is free, starts FXU, 2 FPU, branch, condition register sending without buffering first I Options: 4-word memory bus with 128 KB ● data cache, or 8-word with 256 KB Wormhole routing = buffer on subpacket basis if buffering is necessary 17 Fall 2003, MIMD 18 Fall 2003, MIMD IBM SP2 AIX Parallel Environment nCUBE Overview I Parallel Operating Environment — based I Distributed-memory MIMD multicomputer on AIX, includes Desktop interface (with hardware to make it look like shared-memory multiprocessor) ● Partition Manager to allocate nodes, copy tasks to nodes, invoke tasks, etc. I History ● Program Marker Array — (online) squares graphically represent program ● nCUBE 1 — 1985 tasks ● nCUBE 2 — 1989 ● System Status Array — (offline) squares show I 34 GFLOPS, scalable percent of CPU utilization I ?–8192 processors ● nCUBE 3 — 1995 I Parallel Message Passing Library I 1–6.5 TFLOPS, 65 TB memory, 24 TB/s hypercube interconnect, 1024 I Visualization Tool — view online and 3 GB/s I/O channels, scalable offline performance
Recommended publications
  • TMA4280—Introduction to Supercomputing
    Supercomputing TMA4280—Introduction to Supercomputing NTNU, IMF January 12. 2018 1 Outline Context: Challenges in Computational Science and Engineering Examples: Simulation of turbulent flows and other applications Goal and means: parallel performance improvement Overview of supercomputing systems Conclusion 2 Computational Science and Engineering (CSE) What is the motivation for Supercomputing? Solve complex problems fast and accurately: — efforts in modelling and simulation push sciences and engineering applications forward, — computational requirements drive the development of new hardware and software. 3 Computational Science and Engineering (CSE) Development of computational methods for scientific research and innovation in engineering and technology. Covers the entire spectrum of natural sciences, mathematics, informatics: — Scientific model (Physics, Biology, Medical, . ) — Mathematical model — Numerical model — Implementation — Visualization, Post-processing — Validation ! Feedback: virtuous circle Allows for larger and more realistic problems to be simulated, new theories to be experimented numerically. 4 Outcome in Industrial Applications Figure: 2004: “The Falcon 7X becomes the first aircraft in industry history to be entirely developed in a virtual environment, from design to manufacturing to maintenance.” Dassault Systèmes 5 Evolution of computational power Figure: Moore’s Law: exponential increase of number of transistors per chip, 1-year rate (1965), 2-year rate (1975). WikiMedia, CC-BY-SA-3.0 6 Evolution of computational power
    [Show full text]
  • The TOP500 List and Progress in High- Performance Computing
    COVER FEATURE GRAND CHALLENGES IN SCIENTIFIC COMPUTING The TOP500 List and Progress in High- Performance Computing Erich Strohmaier, Lawrence Berkeley National Laboratory Hans W. Meuer, University of Mannheim Jack Dongarra, University of Tennessee Horst D. Simon, Lawrence Berkeley National Laboratory For more than two decades, the TOP list has enjoyed incredible success as a metric for supercomputing performance and as a source of data for identifying technological trends. The project’s editors refl ect on its usefulness and limitations for guiding large-scale scientifi c computing into the exascale era. he TOP list (www.top.org) has served TOP500 ORIGINS as the de ning yardstick for supercomput- In the mid-s, coauthor Hans Meuer started a small ing performance since . Published twice a and focused annual conference that has since evolved year, it compiles the world’s largest instal- into the prestigious International Supercomputing Con- Tlations and some of their main characteristics. Systems ference (www.isc-hpc.com). During the conference’s are ranked according to their performance of the Lin- opening session, Meuer presented statistics about the pack benchmark, which solves a dense system of linear numbers, locations, and manufacturers of supercomput- equations. Over time, the data collected for the list has ers worldwide collected from vendors and colleagues in enabled the early identi cation and quanti cation of academia and industry. many important technological and architectural trends Initially, it was obvious that the supercomputer label related to high-performance computing (HPC).− should be reserved for vector processing systems from Here, we brie y describe the project’s origins, the companies such as Cray, CDC, Fujitsu, NEC, and Hitachi principles guiding data collection, and what has made that each claimed to have the fastest system for scienti c the list so successful during the two-decades-long tran- computation by some selective measure.
    [Show full text]
  • Thor's Hammer/Red Storm
    Bill Camp & Jim Tomkins The Design Specification and Initial Implementation of the Red Storm Architecture --in partnership with Cray, Inc. William J. Camp & James L. Tomkins CCIM, Sandia National Laboratories Albuquerque, NM [email protected] Our rubric Mission critical engineering & science applications Large systems with a few processors per node Message passing paradigm Balanced architecture Use commodity wherever possible Efficient systems software Emphasis on scalability & reliability in all aspects Critical advances in parallel algorithms Vertical integration of technologies Computing domains at Sandia Peak Mid-Range Domain Volume # Procs 1 101 102 103 104 XXX Red Storm Cplant XXX Beowulf X X X Desktop X Red Storm is targeting the highest-end market but has real advantages for the mid-range market (from 1 cabinet on up) Red Storm Architecture True MPP, designed to be a single system Distributed memory MIMD parallel supercomputer Fully connected 3D mesh interconnect. Each compute node processor has a bi-directional connection to the primary communication network 108 compute node cabinets and 10,368 compute node processors (AMD Sledgehammer @ 2.0 GHz) ~10 TB of DDR memory @ 333MHz Red/Black switching: ~1/4, ~1/2, ~1/4 8 Service and I/O cabinets on each end (256 processors for each color)-- may add on-system viz nodes to SIO partition 240 TB of disk storage (120 TB per color) Red Storm Architecture Functional hardware partitioning: service and I/O nodes, compute nodes, and RAS nodes Partitioned Operating System (OS):
    [Show full text]
  • ASCI Red Vs. Red Storm
    7X Performance Results – Final Report: ASCI Red vs. Red Storm Joel 0. Stevenson, Robert A. Ballance, Karen Haskell, and John P. Noe, Sandia National Laboratories and Dennis C. Dinge, Thomas A. Gardiner, and Michael E. Davis, Cray Inc. ABSTRACT: The goal of the 7X performance testing was to assure Sandia National Laboratories, Cray Inc., and the Department of Energy that Red Storm would achieve its performance requirements which were defined as a comparison between ASCI Red and Red Storm. Our approach was to identify one or more problems for each application in the 7X suite, run those problems at two or three processor sizes in the capability computing range, and compare the results between ASCI Red and Red Storm. The first part of this paper describes the two computer systems, the 10 applications in the 7X suite, the 25 test problems, and the results of the performance tests on ASCI Red and Red Storm. During the course of the testing on Red Storm, we had the opportunity to run the test problems in both single-core mode and dual-core mode and the second part of this paper describes those results. Finally, we reflect on lessons learned in undertaking a major head-to-head benchmark comparison. KEYWORDS: 7X, ASCI Red, Red Storm, capability computing, benchmark This work was supported in part by the U.S. Department of Energy. Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States National Nuclear Security Administration and the Department of Energy under contract DE-AC04-94AL85000 processor sizes, and compare the results between ASCI 1.
    [Show full text]
  • DARPA's HPCS Program: History, Models, Tools, Languages
    DARPA's HPCS Program: History, Models, Tools, Languages Jack Dongarra, University of Tennessee and Oak Ridge National Lab Robert Graybill, USC Information Sciences Institute William Harrod, DARPA Robert Lucas, USC Information Sciences Institute Ewing Lusk, Argonne National Laboratory Piotr Luszczek, University of Tennessee Janice McMahon, USC Information Sciences Institute Allan Snavely, University of California – San Diego Jeffery Vetter, Oak Ridge National Laboratory Katherine Yelick, Lawrence Berkeley National Laboratory Sadaf Alam, Oak Ridge National Laboratory Roy Campbell, Army Research Laboratory Laura Carrington, University of California – San Diego Tzu-Yi Chen, Pomona College Omid Khalili, University of California – San Diego Jeremy Meredith, Oak Ridge National Laboratory Mustafa Tikir, University of California – San Diego Abstract The historical context surrounding the birth of the DARPA High Productivity Computing Systems (HPCS) program is important for understanding why federal government agencies launched this new, long- term high performance computing program and renewed their commitment to leadership computing in support of national security, large science, and space requirements at the start of the 21st century. In this chapter we provide an overview of the context for this work as well as various procedures being undertaken for evaluating the effectiveness of this activity including such topics as modeling the proposed performance of the new machines, evaluating the proposed architectures, understanding the languages used to program these machines as well as understanding programmer productivity issues in order to better prepare for the introduction of these machines in the 2011-2015 timeframe. 1 of 94 DARPA's HPCS Program: History, Models, Tools, Languages ........................................ 1 1.1 HPCS Motivation ...................................................................................................... 9 1.2 HPCS Vision ..........................................................................................................
    [Show full text]
  • Catamount Vs. Cray Linux Environment
    To Upgrade or not to Upgrade? Catamount vs. Cray Linux Environment S.D. Hammond, G.R. Mudalige, J.A. Smith, J.A. Davis, S.A. Jarvis High Performance Systems Group, University of Warwick, Coventry, CV4 7AL, UK J. Holt Tessella PLC, Abingdon Science Park, Berkshire, OX14 3YS, UK I. Miller, J.A. Herdman, A. Vadgama Atomic Weapons Establishment, Aldermaston, Reading, RG7 4PR, UK Abstract 1 Introduction Assessing the performance of individual hardware and Modern supercomputers are growing in diversity and software-stack configurations for supercomputers is a complexity – the arrival of technologies such as multi- difficult and complex task, but with potentially high core processors, general purpose-GPUs and specialised levels of reward. While the ultimate benefit of such sys- compute accelerators has increased the potential sci- tem (re-)configuration and tuning studies is to improve entific delivery possible from such machines. This is application performance, potential improvements may not however without some cost, including significant also include more effective scheduling, kernel parame- increases in the sophistication and complexity of sup- ter refinement, pointers to application redesign and an porting operating systems and software libraries. This assessment of system component upgrades. With the paper documents the development and application of growing complexity of modern systems, the effort re- methods to assess the potential performance of select- quired to discover the elusive combination of hardware ing one hardware, operating system (OS) and software and software-stack settings that improve performance stack combination against another. This is of par- across a range of applications is, in itself, becoming an ticular interest to supercomputing centres, which rou- HPC grand challenge .
    [Show full text]
  • Early Evaluation of the Cray XT3 at ORNL
    Early Evaluation of the Cray XT3 at ORNL J. S. Vetter S. R. Alam T. H. Dunigan, Jr. M. R. Fahey P. C. Roth P. H. Worley Oak Ridge National Laboratory Oak Ridge, TN, USA 37831 ABSTRACT: Oak Ridge National Laboratory recently received delivery of a Cray XT3. The XT3 is Cray’s third-generation massively parallel processing system. The system builds on a single processor node—the AMD Opteron—and uses a custom chip—called SeaStar—to provide interprocessor communication. In addition, the system uses a lightweight operating system on the compute nodes. This paper describes our initial experiences with the system, including micro-benchmark, kernel, and application benchmark results. In particular, we provide performance results for important Department of Energy applications areas including climate and fusion. We demonstrate experiments on the partially installed system, scaling applications up to 3,600 processors. KEYWORDS: performance evaluation; Cray XT3; Red Storm; Catamount; performance analysis; benchmarking. 1 Introduction computational resources utilized by a particular application. Computational requirements for many large-scale simulations and ensemble studies of vital interest to the ORNL has been evaluating these critical factors on Department of Energy (DOE) exceed what is currently several platforms that include the Cray X1 [1], the SGI offered by any U.S. computer vendor. As illustrated in Altix 3700 [13], and the Cray XD1 [14]. This report the DOE Scales report [30] and the High End describes the initial evaluation results collected on an Computing Revitalization Task Force report [17], early version of the Cray XT3 sited at ORNL. Recent examples are numerous, ranging from global climate results are also publicly available from the ORNL change research to combustion to biology.
    [Show full text]
  • My Group in Tennessee
    1 2 High Performance Computing Technologies My Group in Tennessee Numerical Linear Algebra { Basic algorithms for HPC Jack Dongarra { EISPACK, LINPACK, BLAS, LAPACK, ScaLA- University of Tennessee PACK Oak Ridge National Lab oratory Heterogeneous Network Computing http://www.netlib.org/utk/p eopl e/JackDongarra/ { PVM { MPI Software Rep ositories { Netlib { High-Performance Software Exchange Performance Evaluation { Linpack Benchmark, Top500 { ParkBench 3 4 Computational Science WhyTurn to Simulation? ... To o Large HPC o ered a new way to do science: Climate/Weather Mo delling { Exp eriment { Theory { Computation Computation used to approximate physical systems Advantages include: { playing with simulation parameters to study of emergent trends { p ossible replay of a particular simulation event { study systems where no exact theories exist Data intensive problems data-mining, oil resevoir simulation Problems with large length and time scales cosmol- ogy 5 6 Automotive Industry Why Parallel Computers? Desire to solve bigger, more realistic applicatio ns prob- lems. Huge users of HPC technology: Ford US is 25th largest user of HPC in the world Fundamental limits are b eing approached. Main uses of simulation: { aero dynamics similar to aerospace industry More cost e ective solution { crash simulation { metal sheet forming { noise/vibrati onal optimization Example: Weather Prediction Navier-Stokes with 3D Grid around the Earth { trac simulation 8 > > > temper atur e Main gains: > > > > > > > < pr essur e 6 v ar iabl es
    [Show full text]
  • Survey of “High Performance Machines”
    SurveySurvey ofof ““HighHigh PerformancePerformance MachinesMachines”” Jack Dongarra University of Tennessee and Oak Ridge National Laboratory 1 OverviewOverview ♦ Processors ♦ Interconnect ♦ Look at the 3 Japanese HPCs ♦ Examine the Top131 2 History of High Performance Computers 1000000.00001P Aggregate Systems Performance 100000.0000 100T Earth Simulator ASCI Q ASCI White 10000.0000 SX-6 10T VPP5000 SX-5 SR8000G1 VPP800 ASCI Blue Increasing 1000.00001T ASCI Red VPP700 ASCI Blue Mountain SX-4 T3E SR8000 Parallelism Paragon VPP500 SR2201/2K NWT/166 100G100.0000 T3D CM-5 SX-3R T90 S-3800 SX-3 SR2201 10G10.0000 C90 Single CPU Performance FLOPS VP2600/1 CRAY-2 Y-MP8 0 SX-2 S-820/80 1.0000 10GHz 1G S-810/20 VP-400 X-MP VP-200 100M0.1000 1GHz CPU Frequencies 10M0.0100 100MHz 1M0.0010 10MHz 1980 1985 1990 1995 2000 2005 2010 Year 3 VibrantVibrant FieldField forfor HighHigh PerformancePerformance ComputersComputers ♦ Cray X1 ♦ Coming soon … ♦ SGI Altix ¾ Cray RedStorm ♦ IBM Regatta ¾ Cray BlackWidow ¾ NEC SX-8 ♦ Sun ¾ IBM Blue Gene/L ♦ HP ♦ Bull ♦ Fujitsu PowerPower ♦ Hitachi SR11000 ♦ NEC SX-7 ♦ Apple 4 JD1 Architecture/SystemsArchitecture/Systems ContinuumContinuum Loosely ♦ Commodity processor with commodity interconnect ¾ Clusters Coupled ¾ Pentium, Itanium, Opteron, Alpha, PowerPC ¾ GigE, Infiniband, Myrinet, Quadrics, SCI ¾ NEC TX7 ¾ HP Alpha ¾ Bull NovaScale 5160 ♦ Commodity processor with custom interconnect ¾ SGI Altix ¾ Intel Itanium 2 ¾ Cray Red Storm ¾ AMD Opteron ¾ IBM Blue Gene/L (?) ¾ IBM Power PC ♦ Custom processor with custom
    [Show full text]
  • Parallel Machines
    Lecture 6 Parallel Machines A parallel computer is a connected configuration of processors and memories. The choice space available to a computer architect includes the network topology, the node processor, the address- space organization, and the memory structure. These choices are based on the parallel computation model, the current technology, and marketing decisions. No matter what the pace of change, it is impossible to make intelligent decisions about parallel computers right now without some knowledge of their architecture. For more advanced treatment of computer architecture we recommend Kai Hwang's Advanced Computer Architecture and Parallel Computer Architecture by Gupta, Singh, and Culler. One may gauge what architectures are important today by the Top500 Supercomputer1 list published by Meuer, Strohmaier, Dongarra and Simon. The secret is to learn to read between the lines. There are three kinds of machines on The November 2003 Top 500 list: Distributed Memory Multicomputers (MPPs) • Constellation of Symmetric Multiprocessors (SMPs) • Clusters (NOWs and Beowulf cluster) • Vector Supercomputers, Single Instruction Multiple Data (SIMD) Machines and SMPs are no longer present on the list but used to be important in previous versions. How can one simplify (and maybe grossly oversimplify) the current situation? Perhaps by pointing out that the world's fastest machines are mostly clusters. Perhaps it will be helpful to the reader to list some of the most important machines first sorted by type, and then by highest rank in the top 500
    [Show full text]
  • It's a Sign, Not a Race
    Meeting of the Advanced Scientific Computing Advisory Committee October 17 and 18, 2002 Hilton Washington Embassy Row Hotel 2015 Massachusetts Avenue, NW, Washington, DC It’s a Sign, Not a Race Jack Dongarra University of Tennessee 1 A Tour d’Force for a Supercomputer • Japanese System Jaeri/Jamstec/Nasda/Riken Earth Simulator (2002) • Target Application: CFD-Weather, Climate, Earthquakes • 640 NEC SX/6 Nodes (mod) – 5120 CPUs which have vector ops • 40TeraFlops (peak) • 7 MWatts (ASCI White: 1.2 MW; Q: 6 MW) – Say 10 cent/KWhr - $16.8K/day = $6M/year! • $250-500M for things in building • Footprint of 4 tennis courts • Expect to be on top of Top500 until 60-100 TFlop ASCI machine arrives (Earth Simulator Picture from JAERI web page) • Homogeneous, Centralized, Proprietary, Expensive! NASDA : National Space Development Agency of Japan JAMSTEC : Japan Marine Science and Technology Center2 JAERI : Japan Atomic Energy Research Institute RIKEN : The Institute of Physical and Chemical Research R&D results Not Revolutionary Comparison of vector processors 115mm 110mm 225mm 225mm 457457 mm 386386 mm SX4SX4 (1995) (1995) SX5SX5 (1998) (1998) EarthEarth Simulator Simulator (2002) (2002) 22 GFlop/s GFlop/s 88 GFlop/s GFlop/s 88 GFlop/s GFlop/s 88 vector vector pipes pipes 1616 vector vector pipes pipes 88 vector vector pipes pipes ClockClock :125MHz :125MHz ClockClock :250MHz :250MHz ClockClock :500MHz :500MHz LSI:LSI: 0.35 0.35µµ mm CMOS CMOS LSI:LSI: 0.25 0.25µµ mm LSI:LSI: 0.15 0.15µµ mm CMOS CMOS 37x4=14837x4=148 LSIs LSIs CMOS;CMOS; 32 32 LSIs
    [Show full text]
  • History of Supercomputing
    History of Supercomputing Parallel Computing (MC-8836) Esteban Meneses, PhD School of Computing Costa Rica Institute of Technology [email protected] I semester, 2021 Marenostrum Supercomputer, Barcelona Supercomputing Center (BSC) What is a Supercomputer? Elusive definition a computer at the frontline of contemporary processing capacity - particularly speed of calculation (Wikipedia) a large and very fast computer (Merriam-Webster) a machine that hierarchically integrates many components to accelerate the execution of particular programs (Esteban Meneses) a device for turning compute-bound problems into I/O problems (Ken Batcher) Parallel Computing (Meneses) History 3 Supercomputer Features • They are expensive • They represent a major enhancement in computational power • They have a short life (approximately 5 years) • Their performance is measured in FLOPS. Parallel Computing (Meneses) History 4 First Digital Computers Are they really super-computers? Colossus (1944) ENIAC (1946) MARK 1 (1948) First Electronic Digital First Electronic Stored-program Computer Programmable Computer, General-purpose Computer University of Manchester, Bletchley Park, United University of Pennsylvania, United Kingdom Kingdom United States Parallel Computing (Meneses) History 5 CDC 6600 Generally regarded as the first supercomputer • Designed by Seymour Cray (the father of supercomputing) • Built at Control Data Corporation • Single CPU, RISC-like system • Memory was faster than 1964 CPU 3 megaFLOPS • Used Minnesota $60 million today FORTRAN Parallel Computing
    [Show full text]