Overview of High-Performance Computing Simulation

Simulation: The Third Pillar CS 594 Spring 2003 of Science Traditional scientific and engineering paradigm: Lecture 1: 1) Do theory or paper design. Overview of High-Performance 2) Perform experiments or build system. Limitations: Computing ¾ Too difficult -- build large wind tunnels. ¾ Too expensive -- build a throw-away passenger jet. ¾ Too slow -- wait for climate or galactic evolution. Jack Dongarra ¾ Too dangerous -- weapons, drug design, climate Computer Science Department experimentation. Computational science paradigm: University of Tennessee 3) Use high performance computer systems to simulate the phenomenon » Base on known physical laws and efficient numerical methods. 1 2 Computational Science Some Particularly Challenging Definition Computations Computational science is a rapidly growing Science ¾ Global climate modeling multidisciplinary field that uses advanced computing ¾ Astrophysical modeling capabilities to understand and solve complex ¾ Biology: genomics; protein folding; drug design problems. Computational science fuses three distinct ¾ Computational Chemistry ¾ Computational Material Sciences and Nanosciences elements: Engineering ¾ numerical algorithms and modeling and simulation software ¾ Crash simulation developed to solve science (e.g., biological, physical, and ¾ Semiconductor design social), engineering, and humanities problems; ¾ Earthquake and structural modeling ¾ Computation fluid dynamics (airplane design) ¾ advanced system hardware, software, networking, and data ¾ Combustion (engine design) management components developed through computer and Business information science to solve computationally demanding ¾ Financial and economic modeling problems; ¾ Transaction processing, web services and search engines ¾ the computing infrastructure that supports both science and Defense engineering problem solving and developmental computer and ¾ Nuclear weapons -- test by simulations information science. 3 ¾ Cryptography 4 Complex Systems Engineering Why Turn to Simulation? R&D Team: Engineering Team: When the problem is Grand Challenge Driven Operations Driven Ames Research Center Johnson Space Center Analysis and Visualization Glenn Research Center Marshall Space Flight Center too . Langley Research Center Industry Partners ¾ Complex Grand Challenges ¾ Large / small ¾ Expensive Computation Management ¾ Dangerous Next Generation Codes -AeroDB & Algorithms -ILab OVERFLOW to do any other way. Honorable Mention, NASA Software of Year STS107 INS3D Supercomputers, NASA Software of Year Storage, & Networks Taurus_to_Taurus_60per_30deg.mpeg Turbopump Analysis CART3D NASA Software of Year Modeling Environment STS-107 (experts and tools) -Compilers - Scaling and Porting - Parallelization Tools 5 6 Source: Walt Brooks, NASA 1 Economic Impact of HPC Pretty Pictures Airlines: ¾ System-wide logistics optimization systems on parallel systems. ¾ Savings: approx. $100 million per airline per year. Automotive design: ¾ Major automotive companies use large systems (500+ CPUs) for: » CAD-CAM, crash testing, structural integrity and aerodynamics. » One company has 500+ CPU parallel system. ¾ Savings: approx. $1 billion per company per year. Semiconductor industry: ¾ Semiconductor firms use large systems (500+ CPUs) for » device electronics simulation and logic validation ¾ Savings: approx. $1 billion per company per year. Securities industry: ¾ Savings: approx. $15 billion per year for U.S. home mortgages. 7 8 Why Turn to Simulation? Titov’s Tsunami Simulation Climate / Weather Modeling Data intensive problems (data-mining, oil reservoir simulation) tsunami-nw10.mov Problems with large length Titov’s Tsunami Simulation and time scales (cosmology) Global model 9 10 Cost (Economic Loss) to Evacuate 1 Mile of Coastline: $1M 24 Hour Forecast at Fine Grid Spacing This problem demands a complete, STABLE We now over- environment (hardware and software) warn by a factor ¾ 100 TF to stay a factor of 3 of 10 ahead of the weather Average over- ¾ Streaming Observations warning is 200 ¾ Massive Storage and miles of coastline, Meta Data Query or $200M per ¾ Fast Networking ¾ Visualization event ¾ Data Mining for Feature Detection 11 12 2 Units of High High-Performance Computing Performance Computing Today 6 1 Mflop/s 1 Megaflop/s 10 Flop/sec In the past decade, the world has 1 Gflop/s 1 Gigaflop/s 109 Flop/sec experienced one of the most exciting periods in computer development. 12 1 Tflop/s 1 Teraflop/s 10 Flop/sec Microprocessors have become smaller, 1 Pflop/s 1 Petaflop/s 1015 Flop/sec denser, and more powerful. 6 The result is that microprocessor-based 1 MB 1 Megabyte 10 Bytes supercomputing is rapidly becoming the 1 GB 1 Gigabyte 109 Bytes technology of preference in attacking some of the most important problems of 1 TB 1 Terabyte 1012 Bytes science and engineering. 1 PB 1 Petabyte 1015 Bytes 13 14 Technology Trends: Microprocessor Capacity Eniac and My Laptop Eniac My Laptop Year 1945 2002 Moore’s Law Devices 18,000 6,000,000,000 Weight (kg) 27,200 0.9 Size (m3) 68 0.0028 2X transistors/Chip Every 1.5 years Power (watts) 20,000 60 Gordon Moore (co-founder of Called “Moore’s Law” Cost (1999 dollars) 4,630,000 1,000 Intel) predicted in 1965 that the transistor density of semiconductor Memory (bytes) ~200 1,073,741,824 Microprocessors have chips would double roughly every Performance (FP/sec) 800 5,000,000,000 become smaller, denser, 18 months. and more powerful. Not just processors, 15 16 bandwidth, storage, etc No Exponential is Forever, But perhaps we can Delay it Forever Today’s processors Year of Transistors Introduction Some equivalences for the 4004 1971 2,250 8008 1972 2,500 microprocessors of today 8080 1974 5,000 ¾ 8086 1978 29,000 Voltage level 286 1982 120,000 » A flashlight (~1 volt) Intel386™ processor 1985 275,000 Intel486™ processor 1989 1,180,000 ¾ Current level Intel® Pentium® 1993 3,100,000 processor » An oven (~250 amps) Intel® Pentium® II 1997 7,500,000 processor ¾ Power level Intel® Pentium® III 1999 24,000,000 processor Intel® Pentium® 4 » A light bulb (~100 watts) 2000 42,000,000 processor Intel® Itanium® ¾ Area 2002 220,000,000 processor Intel® Itanium® 2 2003 410,000,000 » A postage stamp (~1 square inch) processor 17 18 3 Moore’s “Law” Percentage of peak Something doubles every 18-24 months A rule of thumb that often applies ¾ Something was originally the number of A contemporary RISC processor, for a spectrum of applications, delivers (i.e., transistors sustains) 10% of peak performance Something is also considered There are exceptions to this rule, in performance both directions Moore’s Law is an exponential Why such low efficiency? ¾ Exponentials can not last forever There are two primary reasons behind » However Moore’s Law has held remarkably the disappointing percentage of peak true for ~30 years ¾ IPC (in)efficiency ¾ BTW: It is really an empiricism rather Memory (in)efficiency than a law (not a derogatory comment) 19 20 IPC Why Fast Machines Run Slow Today the theoretical IPC (instructions Latency per cycle) is 4 in most contemporary ¾ Waiting for access to memory or other parts of the system RISC processors (6 in Itanium) Overhead ¾ Extra work that has to be done to manage program Detailed analysis for a spectrum of concurrency and parallel resources the real work you want applications indicates that the average to perform IPC is 1.2–1.4 Starvation ¾ Not enough work to do due to insufficient parallelism or We are leaving ~75% of the possible poor load balancing among distributed resources performance on the table… Contention ¾ Delays due to fighting over what task gets to use a shared resource next. Network bandwidth is a major constraint. 21 22 Extra transistors Processor vs. memory speed With the increasing number of In 1986 transistors per chip from reduced design ¾ processor cycle time ~120 nanoseconds ¾ DRAM access time ~140 nanoseconds rules do we: » 1:1 ratio ¾ Add more functional units? In 1996 » Little gain owing to poor IPC for today’s ¾ processor cycle time ~4 nanoseconds codes, compilers and ISAs ¾ DRAM access time ~60 nanoseconds ¾ Add more cache? » 20:1 ratio » This generally helps but does not solve the In 2002 ¾ processor cycle time ~0.6 nanosecond problem ¾ DRAM access time ~50 nanoseconds ¾ Add more processors » 100::1 ratio » This helps somewhat 23 24 » This hurts somewhat 4 Latency in a Single System Memory hierarchy 500 1000 Ratio Memory Access Time 400 100 300 Typical latencies for today’s technology 200 Hierarchy Processor clocks 10 Time (ns) Time 100 Register 1 1 CPU Time CPU Ratio to Memory 0 L1 cache 2-3 0.1 L2 cache 6-12 1997 1999 2001 2003 2006 2009 L3 cache 14-40 X-Axis Near memory 100-300 CPU Clock Period (ns) Ratio Memory System Access Time Far memory 300-900 Remote memory O(103) THE WALL Message-passing O(103)-O(104) 25 26 y Hierarchy Most programs have a high degree of locality in their accesses Memory bandwidth ¾ spatial locality: accessing things nearby previous accesses ¾ temporal locality: reusing an item that was previously accessed Memory hierarchy tries to exploit locality To provide bandwidth to the processor the bus either needs to be faster or processor wider control Busses are limited to perhaps 400-800 Second Main Secondary Tertiary level memory storage storage MHz cache (Disk) datapath (SRAM) (DRAM) (Disk/Tape) Links are faster on-chip registers cache ¾ Single-ended 0.5–1 GT/s ¾ Differential: 2.5–5.0 (future) GT/s ¾ Increased link frequencies increase error Speed 1ns 10ns 100ns 10ms 10sec rates

Overview of High-Performance Computing Simulation

UNICOS® Installation Guide for CRAY J90lm Series SG-5271 9.0.2

Cray Research Software Report

The Gemini Network

Performance Evaluation of the Cray X1 Distributed Shared Memory Architecture

CRAY T90 Series IEEE Floating Point Migration Issues and Solutions

Recent Supercomputing Development in Japan

System Programmer Reference (Cray SV1™ Series)

Appendix G Vector Processors

Cray Supercomputers Past, Present, and Future

NAS Parallel Benchmarks Results 3-95

Cray System Software Features for Cray X1 System ABSTRACT

A Comparison of Application Performance Across Cray Product Lines CUG San Jose, May 1997