Simulation: The Third Pillar CS 594 Spring 2003 of Science ‹ Traditional scientific and engineering paradigm: Lecture 1: 1) Do theory or paper design. Overview of High-Performance 2) Perform experiments or build system. ‹ Limitations: Computing ¾ Too difficult -- build large wind tunnels. ¾ Too expensive -- build a throw-away passenger jet. ¾ Too slow -- wait for climate or galactic evolution. Jack Dongarra ¾ Too dangerous -- weapons, drug design, climate Computer Science Department experimentation. ‹ Computational science paradigm: University of Tennessee 3) Use high performance computer systems to simulate the phenomenon » Base on known physical laws and efficient numerical methods. 1 2

Computational Science Some Particularly Challenging Definition Computations

Computational science is a rapidly growing ‹ Science ¾ Global climate modeling multidisciplinary field that uses advanced computing ¾ Astrophysical modeling capabilities to understand and solve complex ¾ Biology: genomics; protein folding; drug design problems. Computational science fuses three distinct ¾ Computational Chemistry ¾ Computational Material Sciences and Nanosciences elements: ‹ Engineering ¾ numerical algorithms and modeling and simulation software ¾ Crash simulation developed to solve science (e.g., biological, physical, and ¾ Semiconductor design social), engineering, and humanities problems; ¾ Earthquake and structural modeling ¾ Computation fluid dynamics (airplane design) ¾ advanced system hardware, software, networking, and data ¾ Combustion (engine design) management components developed through computer and ‹ Business information science to solve computationally demanding ¾ Financial and economic modeling problems; ¾ Transaction processing, web services and search engines ¾ the computing infrastructure that supports both science and ‹ Defense engineering problem solving and developmental computer and ¾ Nuclear weapons -- test by simulations information science. 3 ¾ Cryptography 4

Complex Systems Engineering Why Turn to Simulation?

R&D Team: Engineering Team: ‹ When the problem is Grand Challenge Driven Operations Driven Ames Research Center Johnson Space Center Analysis and Visualization Glenn Research Center Marshall Space Flight Center too . . . Langley Research Center Industry Partners ¾ Complex Grand Challenges ¾ Large / small ¾ Expensive Computation Management ¾ Dangerous Next Generation Codes -AeroDB & Algorithms -ILab OVERFLOW ‹ to do any other way. Honorable Mention, NASA Software of Year STS107 INS3D , NASA Software of Year Storage, & Networks Taurus_to_Taurus_60per_30deg.mpeg Turbopump Analysis

CART3D NASA Software of Year Modeling Environment STS-107 (experts and tools) -Compilers - Scaling and Porting - Parallelization Tools

5 6

Source: Walt Brooks, NASA

1 Economic Impact of HPC Pretty Pictures ‹ Airlines: ¾ System-wide logistics optimization systems on parallel systems. ¾ Savings: approx. $100 million per airline per year. ‹ Automotive design: ¾ Major automotive companies use large systems (500+ CPUs) for: » CAD-CAM, crash testing, structural integrity and aerodynamics. » One company has 500+ CPU parallel system. ¾ Savings: approx. $1 billion per company per year. ‹ Semiconductor industry: ¾ Semiconductor firms use large systems (500+ CPUs) for » device electronics simulation and logic validation ¾ Savings: approx. $1 billion per company per year. ‹ Securities industry: ¾ Savings: approx. $15 billion per year for U.S. home mortgages.

7 8

Why Turn to Simulation? Titov’s Tsunami Simulation

‹ Climate / Weather Modeling ‹ Data intensive problems (data-mining, oil reservoir ‹ simulation) tsunami-nw10.mov ‹ Problems with large length ‹ Titov’s Tsunami Simulation and time scales (cosmology) ‹ Global model

9 10

Cost (Economic Loss) to Evacuate 1 Mile of Coastline: $1M 24 Hour Forecast at Fine Grid Spacing

‹ This problem demands a complete, STABLE ‹ We now over- environment (hardware and software) warn by a factor ¾ 100 TF to stay a factor of 3 of 10 ahead of the weather ‹ Average over- ¾ Streaming Observations warning is 200 ¾ Massive Storage and miles of coastline, Meta Data Query or $200M per ¾ Fast Networking ¾ Visualization event ¾ Data Mining for Feature Detection

11 12

2 Units of High High-Performance Computing Performance Computing Today

6 1 Mflop/s 1 Megaflop/s 10 Flop/sec ‹ In the past decade, the world has 1 Gflop/s 1 Gigaflop/s 109 Flop/sec experienced one of the most exciting periods in computer development. 12 1 Tflop/s 1 Teraflop/s 10 Flop/sec ‹ Microprocessors have become smaller, 1 Pflop/s 1 Petaflop/s 1015 Flop/sec denser, and more powerful.

6 ‹ The result is that microprocessor-based 1 MB 1 Megabyte 10 Bytes supercomputing is rapidly becoming the 1 GB 1 Gigabyte 109 Bytes technology of preference in attacking some of the most important problems of 1 TB 1 Terabyte 1012 Bytes science and engineering. 1 PB 1 Petabyte 1015 Bytes

13 14

Technology Trends: Microprocessor Capacity Eniac and My Laptop

Eniac My Laptop Year 1945 2002

Moore’s Law Devices 18,000 6,000,000,000 Weight (kg) 27,200 0.9 Size (m3) 68 0.0028

2X transistors/Chip Every 1.5 years Power (watts) 20,000 60 Gordon Moore (co-founder of Called “Moore’s Law” Cost (1999 dollars) 4,630,000 1,000 Intel) predicted in 1965 that the transistor density of semiconductor Memory (bytes) ~200 1,073,741,824 Microprocessors have chips would double roughly every Performance (FP/sec) 800 5,000,000,000 become smaller, denser, 18 months. and more powerful. Not just processors, 15 16 bandwidth, storage, etc

No Exponential is Forever, But perhaps we can Delay it Forever Today’s processors

Year of Transistors Introduction ‹ Some equivalences for the 4004 1971 2,250 8008 1972 2,500 microprocessors of today 8080 1974 5,000 ¾ 8086 1978 29,000 Voltage level 286 1982 120,000 » A flashlight (~1 volt) Intel386™ processor 1985 275,000 Intel486™ processor 1989 1,180,000 ¾ Current level Intel® Pentium® 1993 3,100,000 processor » An oven (~250 amps) Intel® Pentium® II 1997 7,500,000 processor ¾ Power level Intel® Pentium® III 1999 24,000,000 processor Intel® Pentium® 4 » A light bulb (~100 watts) 2000 42,000,000 processor Intel® Itanium® ¾ Area 2002 220,000,000 processor Intel® Itanium® 2 2003 410,000,000 » A postage stamp (~1 square inch) processor 17 18

3 Moore’s “Law” Percentage of peak

‹ Something doubles every 18-24 months ‹ A rule of thumb that often applies ¾ ‹ Something was originally the number of A contemporary RISC processor, for a spectrum of applications, delivers (i.e., transistors sustains) 10% of peak performance ‹ Something is also considered ‹ There are exceptions to this rule, in performance both directions ‹ Moore’s Law is an exponential ‹ Why such low efficiency? ¾ Exponentials can not last forever ‹ There are two primary reasons behind » However Moore’s Law has held remarkably the disappointing percentage of peak true for ~30 years ¾ IPC (in)efficiency ¾ ‹ BTW: It is really an empiricism rather Memory (in)efficiency than a law (not a derogatory comment) 19 20

IPC Why Fast Machines Run Slow

‹ Today the theoretical IPC (instructions ‹ Latency per cycle) is 4 in most contemporary ¾ Waiting for access to memory or other parts of the system RISC processors (6 in Itanium) ‹ Overhead ¾ Extra work that has to be done to manage program ‹ Detailed analysis for a spectrum of concurrency and parallel resources the real work you want applications indicates that the average to perform IPC is 1.2–1.4 ‹ Starvation ¾ Not enough work to do due to insufficient parallelism or ‹ We are leaving ~75% of the possible poor load balancing among distributed resources performance on the table… ‹ Contention ¾ Delays due to fighting over what task gets to use a shared resource next. Network bandwidth is a major constraint.

21 22

Extra transistors Processor vs. memory speed

‹ With the increasing number of ‹ In 1986 transistors per chip from reduced design ¾ processor cycle time ~120 nanoseconds ¾ DRAM access time ~140 nanoseconds rules do we: » 1:1 ratio ¾ Add more functional units? ‹ In 1996 » Little gain owing to poor IPC for today’s ¾ processor cycle time ~4 nanoseconds codes, compilers and ISAs ¾ DRAM access time ~60 nanoseconds ¾ Add more cache? » 20:1 ratio ‹ » This generally helps but does not solve the In 2002 ¾ processor cycle time ~0.6 nanosecond problem ¾ DRAM access time ~50 nanoseconds ¾ Add more processors » 100::1 ratio » This helps somewhat 23 24 » This hurts somewhat

4 Latency in a Single System Memory hierarchy

500 1000 Ratio Memory Access Time 400 100 300 ‹ Typical latencies for today’s technology

200 Hierarchy Processor clocks 10

Time (ns) Time 100 Register 1 1 CPU Time CPU Ratio to Memory 0 L1 cache 2-3 0.1 L2 cache 6-12 1997 1999 2001 2003 2006 2009 L3 cache 14-40 X-Axis Near memory 100-300 CPU Clock Period (ns) Ratio Memory System Access Time Far memory 300-900 Remote memory O(103) THE WALL Message-passing O(103)-O(104) 25 26

y Hierarchy ‹ Most programs have a high degree of locality in their accesses Memory bandwidth ¾ spatial locality: accessing things nearby previous accesses ¾ temporal locality: reusing an item that was previously accessed ‹ Memory hierarchy tries to exploit locality ‹ To provide bandwidth to the processor the bus either needs to be faster or processor wider control ‹ Busses are limited to perhaps 400-800 Second Main Secondary Tertiary level memory storage storage MHz cache (Disk) datapath (SRAM) (DRAM) (Disk/Tape) ‹ Links are faster on-chip registers cache ¾ Single-ended 0.5–1 GT/s ¾ Differential: 2.5–5.0 (future) GT/s ¾ Increased link frequencies increase error Speed 1ns 10ns 100ns 10ms 10sec rates requiring coding and redundancy thus increasing power and die size and not helping Size B KB MB GB TB bandwidth ‹ Making things wider requires pin-out (Si

real estate) and power 28 ¾ Both power and pin-out are serious issue

Sense Amps Sense Amps Memory Memory What’s needed for memory? Processor in Memory (PIM) Stack Stack Decode Sense Amps Sense Amps Node Logic ‹ This is a physics problem Sense Amps Sense Amps ‹ PIM merges logic with memory Memory Memory ¾ “Money can buy bandwidth, but latency is Stack Stack ¾ Wide ALUs next to the row buffer Sense Amps Sense Amps forever!” ¾ Optimized for memory throughput, not ALU utilization ‹ There are possibilities being studied ‹ PIM has the potential of riding Moore's law ¾ Generally involve putting “processing” at while memory ¾ greatly increasing effective memory bandwidth, » Reduces the latency and increases the ¾ providing many more concurrent execution threads, bandwidth ¾ reducing latency, ¾ reducing power, and ‹ Fundamental architectural research is ¾ increasing overall system efficiency lacking ‹ It may also simplify programming and system ¾ Government funding is a necessity 29 design 30

5 Internet – 4th Revolution in Telecommunications The Web Phenomenon

‹ 90 – 93 Web invented ‹ Telephone, Radio, Television ‹ U of Illinois Mosaic released March 94, ~ 0.1% traffic ‹ Growth in Internet outstrips the others ‹ September 93 ~ 1% traffic ‹ Exponential growth since 1985 w/200 sites ‹ June 94 ~ 10% of traffic ‹ Traffic doubles every 100 days w/2,000 sites Growth of Internet Hosts * Sept. 1969 - Sept. 2002 ‹ Today 60% of traffic w/2,000,000 sites 250,000,000 ‹ 200,000,000 Every organization, company, Domain names school 150,000,000

100,000,000 No. of Hosts

50,000,000

0 31 32

1 5 1 2 73 76 8 8 9 9 99 02 9/69 1/ 1/ 8/ 01/71 0 01/74 01/ 01/79 08/ 08/83 10/ 11/86 07/88 01/89 10/89 01/ 10/91 04/ 10/92 04/93 10/93 07/94 01/95 01/96 01/97 01/98 0 01/01 0 Time Period

Internet On Everything Peer to Peer Computing

‹ Peer-to-peer is a style of networking in which a group of computers communicate directly with each other. ‹ Wireless communication ‹ Home computer in the utility room, next to the water heater and furnace. ‹ Web tablets ‹ BitTorrent ¾ http://en.wikipedia.org/wiki/Bittorrent ‹ Imbedded computers in things all tied together. ¾ Books, furniture, milk cartons, etc ‹ Smart Appliances ¾ Refrigerator, scale, etc

33 34

SETI@home: Global Distributed Computing SETI@home

‹ ‹ Running on 500,000 PCs, ~1000 CPU Years Use thousands of Internet- per Day connected PCs to help in the ¾ 485,821 CPU Years so far search for extraterrestrial intelligence. ‹ Sophisticated Data & Signal Processing Analysis ‹ When their computer is idle or ‹ Distributes Datasets from Arecibo Radio being wasted this software will Telescope download a 300 kilobyte chunk of data for analysis. Performs ‹ Largest distributed about 3 Tflops for each client computation project in in 15 hours. ‹ The results of this analysis existence are sent back to the SETI ¾ Averaging 40 Tflop/s team, combined with thousands of other participants. ‹ Today a number of companies trying this for profit.

35 36

6 ‹ Google query attributes Forward link Next Generation Web are referred to ¾ 150M queries/day (2000/second) in the rows Back links ¾ 100 countries are referred to ¾ 8.0B documents in the index in the columns ‹ To treat CPU cycles and software like commodities. ‹ Data centers ¾ 100,000 Linux systems in data ‹ Enable the coordinated use of geographically centers around the world distributed resources – in the absence of central » 15 TFlop/s and 1000 TB total capability control and existing trust relationships. » 40-80 1U/2U servers/cabinet ‹ » 100 MB Ethernet Computing power is produced much like utilities such switches/cabinet with gigabit as power and water are produced for consumers. Ethernet uplink ¾ growth from 4,000 systems ‹ Users will have access to “power” on demand (June 2000) Eigenvalue problem; Ax = λx » 18M queries then ‹ This is one of our efforts at UT. n=8x109 ‹ Performance and operation (see: MathWorks ¾ simple reissue of failed commands to new servers Cleve’s Corner) ¾ no performance debugging » problems are not reproducible The matrix is the transition probability matrix of the Markov chain; Ax = x 38

Source: Monika Henzinger, Google & Cleve Moler

Where Has This Performance Improvement Come From? Impact of Device Shrinkage

‹What happens when the feature size (transistor size) ‹ Technology? shrinks by a factor of x ? ‹ Organization? ‹Clock rate goes up by x because wires are shorter ‹ Instruction Set Architecture? ¾actually less than x, because of power ‹ Software? consumption ‹Transistors per unit area goes up by x2 ‹ Some combination of all of the above? ‹Die size also tends to increase ¾typically another factor of ~x ‹Raw computing power of the chip goes up by ~ x4 ! ¾of which x3 is devoted either to parallelism or locality 39 40

How fast can a serial computer be? Processor-Memory Problem

1 Tflop 1 TB r = .3 mm ‹ Processors issue instructions roughly sequential machine every nanosecond. ‹ DRAM can be accessed roughly every

‹ Consider the 1 Tflop sequential machine 100 nanoseconds (!). ¾ data must travel some distance, r, to get from memory ‹ DRAM cannot keep processors busy! And to CPU ¾ to get 1 data element per cycle, this means 1012 times the gap is growing: per second at the speed of light, c = 3x108 m/s ¾ processors getting faster by 60% per year ¾ 12 so r < c/10 = .3 mm ¾ DRAM getting faster by 7% per year ‹ Now put 1 TB of storage in a .3 mm2 area (SDRAM and EDO RAM might help, but not ¾ each word occupies about 3 Angstroms2, the size of a small atom enough)

41 42

7 Processor-DRAM Memory Gap •Why Parallel Computing µProc

1000000 60%/yr. ‹ Desire to solve bigger, more realistic (2X/1.5yr) applications problems. “Moore’s Law” CPU ‹ Fundamental limits are being 10000 approached. Processor-Memory Performance Gap: ‹ More cost effective solution 100 Performance (grows 50% / year) DRAM DRAM 9%/yr. 1 (2X/10 yrs) 0 4 4 8 8 8 88 92 9 9 02 9 9 9 9 9 9 0 1 1982 1 1986 1 1990 1 1 1996 1 2000 2 2004 Year 43 44

Principles of “Automatic” Parallelism in Parallel Computing Modern Machines

‹ Parallelism and Amdahl’s Law ‹ Bit level parallelism ‹ Granularity ¾ within floating point operations, etc. ‹ Instruction level parallelism (ILP) ‹ Locality ¾ multiple instructions execute per clock cycle ‹ Load balance ‹ Memory system parallelism ‹ Coordination and synchronization ¾ overlap of memory operations with computation ‹ OS parallelism ‹ Performance modeling ¾ multiple jobs run in parallel on commodity SMPs

All of these things makes parallel programming Limits to all of these -- for very high performance, need user even harder than sequential programming. to identify, schedule and coordinate parallel tasks 45 46

Finding Enough Parallelism Amdahl’s Law

‹ Suppose only part of an application Amdahl’s Law places a strict limit on the speedup that can be realized by using multiple processors. Two equivalent seems parallel expressions for Amdahl’s Law are given below: ‹ Amdahl’s law t = (fp/N + fs)t1 Effect of multiple processors on run time ¾ let fs be the fraction of work done N sequentially, (1-f ) is fraction parallelizable s S = 1/(f + f /N) Effect of multiple processors on speedup ¾ N = number of processors s p Where: f = serial fraction of code ‹ Even if the parallel part speeds up s fp = parallel fraction of code = 1 - fs perfectly may be limited N = number of processors by the sequential part

47 48

8 Overhead of Parallelism Illustration of Amdahl’s Law It takes only a small fraction of serial content in a code to degrade the ‹ Given enough parallel work, this is the biggest parallel performance. It is essential to determine the scaling behavior of barrier to getting desired speedup your code before doing production runs using large numbers of processors ‹ Parallelism overheads include: ¾ cost of starting a thread or process 250 ¾ cost of communicating shared data fp = 1.000 200 ¾ cost of synchronizing fp = 0.999 ¾ extra (redundant) computation fp = 0.990 150 fp = 0.900 ‹ Each of these can be in the range of milliseconds

speedup 100 (=millions of ) on some systems ‹ Tradeoff: Algorithm needs sufficiently large 50 units of work to run fast in parallel (I.e. large 0 granularity), but not so large that there is not 0 50 100 150 200 250 enough parallel work Number of processors 49 50

Load Imbalance Locality and Parallelism Conventional Storage Proc Proc Hierarchy Proc Cache Cache Cache L2 Cache L2 Cache L2 Cache ‹ Load imbalance is the time that some interconnects potential processors in the system are idle due to L3 Cache L3 Cache L3 Cache ¾ insufficient parallelism (during that phase) ¾ unequal size tasks ‹ Memory Memory Memory Examples of the latter ¾ adapting to “interesting parts of a domain” ¾ tree-structured computations ‹ Large memories are slow, fast memories are small ¾ fundamentally unstructured problems ‹ Storage hierarchies are large and fast on average ‹ ‹ Parallel processors, collectively, have large, fast $ Algorithm needs to balance load ¾ the slow accesses to “remote” data we call “communication” ‹ Algorithm should do most work on local data 52

Performance Trends Revisited (Microprocessor Organization) What is Ahead?

100000000 • Bit Level Parallelism ‹ Greater instruction level parallelism? ‹ Bigger caches? 10000000 • Pipelining ‹ Multiple processors per chip? r4400 • Caches ‹ Complete systems on a chip? (Portable Systems) 1000000 r4000 • Instruction Level i80386

i80286 100000 Parallelism r3010

i8086 • Out-of-order Xeq 10000 i8080 • Speculation i4004 ‹ High performance LAN, Interface, and 1000 • . . . Interconnect 1970 1975 1980 1985 1990 1995 2000

Year

53 54

9 Directions High Performance Computers ‹ ~ 20 years ago ‹ Move toward shared memory ¾ 1x106 Floating Point Ops/sec (Mflop/s) » Scalar based ¾ SMPs and Distributed Shared Memory ‹ ~ 10 years ago ¾ Shared address space w/deep memory ¾ 1x109 Floating Point Ops/sec (Gflop/s) hierarchy » Vector & Shared memory computing, bandwidth aware ‹ Clustering of shared memory machines » Block partitioned, latency tolerant ‹ ~ Today for scalability ¾ 1x1012 Floating Point Ops/sec (Tflop/s) ‹ Efficiency of message passing and data » Highly parallel, distributed processing, message passing, network based parallel programming » data decomposition, communication/computation ¾ Helped by standards efforts such as MPI ‹ ~ 5 years away and HPF ¾ 1x1015 Floating Point Ops/sec (Pflop/s) » Many more levels MH, combination/grids&HPC » More adaptive, LT and bandwidth aware, fault tolerant, 55 extended precision, attention to SMP nodes 56

Top 500 Computers What is a ? ‹ A supercomputer is a hardware and software system that - Listing of the 500 most powerful provides close to the maximum performance that can currently Computers in the World be achieved.

- Yardstick: Rmax from LINPACK MPP ‹ Over the last 10 years the TPP performance range for the Top500 has Ax=b, dense problem increased greater than Moore’s Law Why do we need them?

Updated twice a year Rate ‹ 1993: Almost all of the technical areas that ¾ Size #1 = 59.7 GFlop/s are important to the well-being of SC‘xy in the States in November ¾ #500 = 422 MFlop/s humanity use supercomputing in ‹ 2004: fundamental and essential ways. Meeting in Germany in June ¾ #1 = 70 TFlop/s ¾ #500 = 850 GFlop/s Computational fluid dynamics, protein folding, climate modeling, national security, in particular for 57 cryptanalysis and for simulating 58 nuclear weapons to name a few.

24th List: The TOP10

Rmax Manufacturer Computer Installation Site Country Year #Proc Performance Development [TF/s] 1.127 PF/s 1 IBM BlueGene/L 70.72 DOE/IBM USA 2004 32768 β-System 1 Pflop/s IBM BlueGene/L 2 SGI Columbia 51.87 NASA Ames USA 2004 10160 Altix, Infiniband 100 Tflop/s SUM 70.72 TF/s 3 NEC Earth-Simulator 35.86 Earth Simulator Center Japan 2002 5120 NEC 10 Tflop/s 1.167 TF/s N=1 Earth Simulator Barcelona Supercomputer 4 IBM MareNostrum 20.53 Spain 2004 3564 IBM ASCI White BladeCenter JS20, Myrinet Center 1 Tflop/s LLNL 850 GF/s Thunder Lawrence Livermore 59.7 GF/s Intel ASCI Red 5 CCD 19.94 USA 2004 4096 Sandia Itanium2, Quadrics National Laboratory 100 Gflop/s ASCI Q Los Alamos 6 HP 13.88 USA 2002 8192 Fujitsu AlphaServer SC, Quadrics National Laboratory 'NWT' NAL 10 Gflop/s N=500 7 Self Made X 12.25 Virginia Tech USA 2004 2200 My Laptop Apple XServe, Infiniband 0.4 GF/s 1 Gflop/s Lawrence Livermore 8 IBM/LLNL BlueGene/L 11.68 USA 2004 8192 DD1 500 MHz National Laboratory 100 Mflop/s 9 IBM pSeries 655 10.31 Naval Oceanographic Office USA 2004 2944 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 10 Dell Tungsten 9.82 NCSA USA 2003 2500 PowerEdge, Myrinet

399 system > 1 TFlop/s; 294 machines are clusters, top10 average 8K proc 59 60

10 Performance Projection Performance Projection

1 Eflop/s 1 Eflop/s 100 Pflop/s 100 Pflop/s

10 Pflop/s 10 Pflop/s DARPA DARPA 1 Pflop/s 1 Pflop/s HPCS HPCS 100 Tflop/s 100 Tflop/s BlueGene/L BlueGene/L 10 Tflop/s SUM 10 Tflop/s SUM 1 Tflop/s 1 Tflop/s

100 Gflop/s 100 Gflop/s N=1 N=1 10 Gflop/s 10 Gflop/s My Laptop 1 Gflop/s 1 Gflop/s N=500 N=500 100 Mflop/s 100 Mflop/s 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015

61 62

Customer Segments / Systems Manufacturers / Systems Government 500 500 Vendor Classified others 400 400 Academic Hitachi NEC 300 300 Fujitsu Research Intel TMC 200 200 HP Sun 100 Industry 100 IBM SGI 0 0 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004

63 64

Processor Types Interconnects / Systems 500 500 SIMD Others 400 Vector 400 Infiniband Scalar Quadrics 300 300 Sparc Gigabit Ethernet

MIPS Cray Interconnect 200 200 Myrinet intel 100 SP Switch 100 HP Crossbar Power 0 N/A 0 Alpha 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004

65 66

11 Top500 Performance by Manufacture (11/04)

Cray Hitachi Sun 1% 0% Fujitsu 2% 2% Intel NEC 0% Clusters (NOW) / Systems 4% SGI 300 7% Others 250 Sun Fire - Intel NOW - Alpha others 14% IBM 200 NOW - Sun 49% Dell Cluster 150 NOW - AMD HP Cluster 100 HP AlphaServer HP 21% 50 NOW - Pentium IB M C luster 0 1997 1998 1999 2000 2001 2002 2003 2004

67 68

Performance Numbers on RISC Processors Top500 Conclusions Processor Cycle Time Linpack n=100 Linpack n=1000 Peak Intel P4 2540 1190 (23%) 2355 (46%) 5080 Intel/HP Itanium 2 1000 1102 (27%) 3534 (88%) 4000 ‹ Microprocessor based supercomputers Compaq Alpha 1000 824 (41%) 1542 (77%) 2000 have brought a major change in AMD Athlon 1200 558 (23%) 998 (42%) 2400 HP PA 550 468 (21%) 1583 (71%) 2200 accessibility and affordability. IBM Power 3 375 424 (28%) 1208 (80%) 1500 ‹ MPPs continue to account of more than Intel P3 933 234 (25%) 514 (55%) 933 PowerPC G4 533 231 (22%) 478 (45%) 1066 half of all installed high-performance SUN Ultra 80 450 208 (23%) 607 (67%) 900 computers worldwide. SGI Origin 2K 300 173 (29%) 553 (92%) 600

Cray T90 454 705 (39%) 1603 (89%) 1800 238 387 (41%) 902 (95%) 952 Cray Y-MP 166 161 (48%) 324 (97%) 333 Cray X-MP 118 121 (51%) 218 (93%) 235 Cray J-90 100 106 (53%) 190 (95%) 200 69 Cray 1 80 27 (17%) 110 (69%) 160 70

High-Performance Computing Directions: Beowulf-class PC Clusters Definition: Advantages: ‹ COTS PC Nodes ‹ Best price-performance ¾ Pentium, Alpha, PowerPC, ‹ Low entry-level cost SMP ‹ Just-in-place ‹ COTS LAN/SAN configuration Interconnect ‹ Vendor invulnerable ¾ Ethernet, Myrinet, Giganet, ATM ‹ Scalable ‹ ‹ Open Source Unix Rapid technology tracking ¾ Linux, BSD ‹ Message Passing Computing ¾ MPI, PVM ¾ HPF ‹ Peak performance Enabled by PC hardware, networks and operating system ‹ Interconnection achieving capabilities of scientific workstations at a fraction of ‹ http://clusters.top500.org the cost and availability of industry standard message ‹ Benchmark results to follow in the coming months passing libraries. However, much more of a contact sport.71 72

12 Distributed and Parallel Systems

m Virtual Environments e g m in e t W t s u t s p m O n i 0.32E-08 0.00E+00 0.00E+00 0.00E+00 0.38E-06 0.13E-05 0.22E-05 0.33E-05 0.59E-05 0.11E-04 Distributed p lo o a D f Massively 0.18E-04 0.23E-04 0.23E-04 0.21E-04 0.67E-04 0.38E-03 0.90E-03 0.18E-02 0.30E-02 0.43E-02 f N l l h a m l T systems i o y p e 0.50E-02 0.51E-02 0.49E-02 0.44E-02 0.39E-02 0.35E-02 0.31E-02 0.28E-02 0.27E-02 0.26E-02 p u ll I parallel @ C le C 0.26E-02 0.27E-02 0.28E-02 0.30E-02 0.33E-02 0.36E-02 0.38E-02 0.39E-02 0.39E-02 0.38E-02 I o w a C hetero- T r o k L r t id r S systems 0.34E-02 0.30E-02 0.27E-02 0.24E-02 0.21E-02 0.18E-02 0.16E-02 0.14E-02 0.11E-02 0.96E-03 E n e e N a r A 0.79E-03 0.63E-03 0.48E-03 0.35E-03 0.24E-03 0.15E-03 0.80E-04 0.34E-04 0.89E-05 0.16E-05 geneous S E B B S P G homo- 0.18E-06 0.34E-08 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 geneous 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.24E-08 0.00E+00 0.00E+00 0.00E+00 0.29E-06 0.11E-05 0.19E-05 0.30E-05 0.53E-05 0.96E-05 0.15E-04 0.20E-04 0.20E-04 0.18E-04 0.27E-04 0.23E-03 0.65E-03 0.14E-02 0.27E-02 0.40E-02 0.49E-02 0.51E-02 0.49E-02 0.45E-02 0.40E-02 0.35E-02 0.31E-02 0.28E-02 0.27E-02 0.26E-02 0.26E-02 0.27E-02 0.28E-02 0.30E-02 0.33E-02 0.36E-02 0.38E-02 0.39E-02 0.39E-02 0.37E-02 0.34E-02 0.30E-02 0.27E-02 0.24E-02 0.21E-02 0.18E-02 ‹ Gather (unused) resources ‹ Bounded set of resources 0.16E-02 0.14E-02 0.12E-02 0.98E-03 0.81E-03 0.65E-03 0.51E-03 0.38E-03 0.27E-03 0.17E-03 0.99E-04 0.47E-04 0.16E-04 0.36E-05 0.62E-06 0.41E-07 0.75E-10 0.00E+00 0.00E+00 0.00E+00 ‹ Steal cycles ‹ Apps grow to consume all cycles 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.15E-08 0.00E+00 ‹ System SW manages resources ‹ Application manages resources 0.00E+00 0.00E+00 0.19E-06 0.84E-06 0.16E-05 0.27E-05 0.47E-05 0.82E-05 0.13E-04 0.17E-04 0.17E-04 0.15E-04 0.16E-04 0.10E-03 0.41E-03 0.11E-02 0.23E-02 0.37E-02 0.48E-02 0.51E-02 ‹ System SW adds value ‹ System SW gets in the way 0.49E-02 0.45E-02 0.40E-02 0.35E-02 0.31E-02 0.28E-02 0.27E-02 0.26E-02 0.26E-02 0.27E-02 0.28E-02 0.31E-02 0.33E-02 0.36E-02 0.38E-02 0.39E-02 0.38E-02 0.36E-02 0.33E-02 0.29E-02 ‹ 10% - 20% overhead is OK ‹ 5% overhead is maximum ‹ Resources drive applications ‹ Apps drive purchase of equipment Do they make any sense? ‹ Time to completion is not critical ‹ Real-time constraints ‹ Time-shared ‹ Space-shared 74

Performance Improvements for Scientific Computing Problems

10000

1000

100

10

1 1970 1975 1980 1985 1995 Derived from Computational Methods

10000

Multi-grid

1000 r Conjugate Gradient acto F p 100 SOR -U eed Sp 10 Gauss-Seidel

Sparse GE 1 1970 1975 1980 1985 1990 1995 76

Different Architectures Types of Parallel Computers

‹ Parallel computing: single systems with many ‹ The simplest and most useful way to processors working on same problem classify modern parallel computers is by ‹ Distributed computing: many systems loosely their memory model: coupled by a scheduler to work on related ¾ shared memory problems ¾ distributed memory ‹ Grid Computing: many systems tightly coupled by software, perhaps geographically distributed, to work together on single problems or on related problems

77 78

13 Shared Memory: UMA vs. Shared vs. Distributed Memory NUMA Uniform memory access (UMA): P P P P P P P P P P P P Shared memory - single address Each processor has uniform space. All processors have access to a BUS access to memory. Also known BUS pool of shared memory. (Ex: SGI as symmetric multiprocessors Origin, Sun E10000) Memory (Sun E10000) Memory P P P P P P P P P P P P P P Non-uniform memory access (NUMA): Time for memory Distributed memory - each BUS BUS access depends on location processor has it’s own local M M M M M M of data. Local access is faster Memory Memory memory. Must do message passing than non-local access. Easier to exchange data between to scale than SMPs (SGI processors. (Ex: , IBM Network Origin) Network SP, clusters)

79 80

Distributed Memory: MPPs vs. Processors, Memory, & Clusters Networks

‹ Processors-memory nodes are connected ‹ Both shared and distributed memory by some type of interconnect network systems have: ¾ Massively Parallel Processor (MPP): tightly 1. processors: now generally commodity RISC integrated, single system image. processors ¾ Cluster: individual computers connected by 2. memory: now generally commodity DRAM s/w 3. network/interconnect: between the processors and memory (bus, crossbar, fat Interconnect tree, torus, hypercube, etc.) CPU CPU CPU Network CPU CPU ‹ CPU CPU CPU We will now begin to describe these MEM MEM MEMCPU MEM MEM pieces in detail, starting with MEM MEM MEM MEM definitions of terms. 81 82

Processor-Related Terms Processor-Related Terms

Clock period (cp): the minimum time Functional Unit: a hardware element that interval between successive actions in performs an operation on an operand or the processor. Fixed, depends on design pair of operations. Common FUs are of processor. Measured in nanoseconds ADD, MULT, INV, SQRT, etc. (~1-5 for fastest processors). Inverse Pipeline : technique enabling multiple of frequency (MHz) instructions to be overlapped in Instruction: an action executed by a execution processor, such as a mathematical Superscalar: multiple instructions are operation or a memory operation. possible per clock period Register: a small, extremely fast location Flops: floating point operations per second for storing data or instructions in the 83 84 processor

14 Processor-Related Terms Memory-Related Terms

Cache: fast memory (SRAM) near the SRAM: Static Random Access Memory processor. Helps keep instructions and (RAM). Very fast (~10 nanoseconds), data close to functional units so made using the same kind of circuitry as processor can execute more instructions the processors, so speed is comparable. more rapidly. DRAM: Dynamic RAM. Longer access times TLB: Translation-Lookaside Buffer keeps (~100 nanoseconds), but hold more bits addresses of pages (block of memory) in and are much less expensive (10x main memory that have recently been cheaper). accessed (a cache for memory Memory hierarchy: the hierarchy of addresses) memory in a parallel system, from 85 registers to cache to local memory to 86 remote memory More later

Interconnect-Related Terms Interconnect-Related Terms

‹ Latency: How long does it take to start Topology: the manner in which the nodes sending a "message"? Measured in are connected. microseconds. ¾ Best choice would be a fully connected (Also in processors: How long does it take to network (every processor to every other). output results of some operations, such as Unfeasible for cost and scaling reasons. floating point add, divide etc., which are ¾ Instead, processors are arranged in some pipelined?) variation of a grid, torus, or hypercube. ‹ Bandwidth: What data rate can be sustained once the message is started?

Measured in Mbytes/sec. 3-d hypercube 2-d mesh 2-d torus

87 88

Highly Parallel Supercomputing: Highly Parallel Supercomputing: Where Are We? Where Are We?

‹ Performance: ‹ Operating systems: ¾ Sustained performance has dramatically increased during the last year. ¾ Robustness and reliability are improving. ¾ On most applications, sustained performance per dollar now ¾ New system management tools improve system exceeds that of conventional supercomputers. But... utilization. But... ¾ Conventional systems are still faster on some applications. ¾ ‹ Languages and compilers: Reliability still not as good as conventional ¾ Standardized, portable, high-level languages such as HPF, PVM systems. and MPI are available. But ... ‹ I/O subsystems: ¾ Initial HPF releases are not very efficient. ¾ Message passing programming is tedious and hard ¾ New RAID disks, HiPPI interfaces, etc. provide to debug. substantially improved I/O performance. But... ¾ Programming difficulty remains a major obstacle to ¾ I/O remains a bottleneck on some systems. usage by mainstream scientist.

89 90

15 The Importance of Standards - The Importance of Standards - Software Hardware

‹ Writing programs for MPP is hard ... ‹ Processors ‹ But ... one-off efforts if written in a standard language ¾ commodity RISC processors ‹ Past lack of parallel programming standards ... ‹ Interconnects ¾ ... has restricted uptake of technology (to "enthusiasts") ¾ high bandwidth, low latency communications protocol ¾ ... reduced portability (over a range of current ¾ no de-facto standard yet (ATM, Fibre Channel, HPPI, architectures and between future generations) FDDI) ‹ Now standards exist: (PVM, MPI & HPF), which ... ‹ Growing demand for total solution: ¾ ... allows users & manufacturers to protect software investment ¾ robust hardware + usable software ¾ ... encourage growth of a "third party" parallel software industry ‹ & parallel versions of widely used codes HPC systems containing all the programming tools / environments / languages / libraries / applications packages found on desktops

91 92

The Future of HPC Achieving TeraFlops

‹ The expense of being different is being ‹ In 1991, 1 Gflop/s replaced by the economics of being the same ‹ 1000 fold increase ‹ HPC needs to lose its "special purpose" tag ¾ Architecture ‹ Still has to bring about the promise of scalable general purpose computing ... » exploiting parallelism ‹ ... but it is dangerous to ignore this technology ¾ Processor, communication, memory ‹ Final success when MPP technology is embedded » Moore’s Law in desktop computing ¾ Algorithm improvements ‹ Yesterday's HPC is today's mainframe is » block-partitioned algorithms tomorrow's workstation

93 94

Future: Petaflops ( 10 15 fl pt ops/s) A Petaflops Computer System

Today ≈ 1015 flops for our workstations ‹ 1 Pflop/s sustained computing ‹ Between 10,000 and 1,000,000 processors ‹ A Pflop for 1 second O a typical workstation computing for 1 year. ‹ Between 10 TB and 1PB main memory ‹ Commensurate I/O bandwidth, mass store, etc. ‹ From an algorithmic standpoint¾ dynamic redistribution of ¾ concurrency workload ‹ If built today, cost $40 B and consume 1 ¾ data locality ¾ new language and TWatt. constructs ¾ latency & sync ‹ May be feasible and “affordable” by the year ¾ role of numerical ¾ floating point accuracy libraries 2010 ¾ algorithm adaptation to hardware failure

95 96

16