CS 594 Spring 2007 Lecture 4: Overview of High-Performance Computing

Jack Dongarra Computer Science Department University of Tennessee

1

Top 500 Computers

- Listing of the 500 most powerful Computers in the World - Yardstick: Rmax from LINPACK MPP TPP performance Ax=b, dense problem

Updated twice a year Rate SC‘xy in the States in November Size Meeting in Germany in June

2

1 What is a ?

X A supercomputer is a hardware and software system that provides close to the maximum performance that can currently be achieved.

X Over the last 14 years the range for the Top500 has increased greater than Moore’s Why do we need them? Law Almost all of the technical areas that X 1993: are important to the well-being of ¾ #1 = 59.7 GFlop/s humanity use supercomputing in ¾ #500 = 422 MFlop/s fundamental and essential ways. X 2007: ¾ #1 = 280 TFlop/s Computational fluid dynamics, ¾ #500 = 2.73 TFlop/s protein folding, climate modeling, national security, in particular for cryptanalysis and for simulating 3 nuclear weapons to name a few.

Architecture/Systems Continuum Tightly 100% Coupled X Best processor performance for codes X Custom processor that are not “cache friendly” with custom interconnect X Good Customcommunication performance ¾ 80% Cray X1 X Simpler programming model ¾ NEC SX-8 X ¾ IBM Regatta Most expensive ¾ IBM Blue Gene/L X Commodity processor 60%X Good communication performance with custom interconnect X Good scalability ¾ SGI Altix Hybrid

» 2 40% ¾ Cray XT3 » AMD Opteron X Best price/performance (for codes X Commodity processor that work well with caches and are with commodity interconnect 20% latency tolerant) ¾ Clusters X More complex programming model » , Itanium, Commod Opteron, Alpha 0% » GigE, Infiniband, Jun-93 Jun-94 Jun-95 Jun-96 Jun-97 Jun-98 Jun-99 Jun-00 Jun-01 Jun-02 Jun-03 Jun-04 Myrinet, Quadrics Dec-93 Dec-94 Dec-95 Dec-96 Dec-97 Dec-98 Dec-99 Dec-00 Dec-01 Dec-02 Dec-03 ¾ NEC TX7 ¾ IBM eServer Loosely ¾ Dawning Coupled 4

2 Architectures / Systems

500 SIMD 400 Single Proc. 300 SMP

200 Const.

100 Cluster

0 MPP 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006

5

Supercomputing Changes Over Time 500 Fastest Systems Over the Past 14 Years

3.54 PF/s

1 Pflop/s The Fastest Computer SUM of the 500 Fastest Computers 280.6 TF/s IBM BlueGene/L 100 Tflop/s NEC Earth Simulator 10 Tflop/s 1.167 TF/s IBM ASCI White 2.74 1 Tflop/s TF/s 59.7 GF/s Intel ASCI6-8 Red years The Computer 100 Gflop/s at Position 500 Fujitsu 'NWT'

10 Gflop/s 0.4 GF/s My Laptop 1 Gflop/s

100 Mflop/s 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006

A supercomputer is a hardware and software system that provides close to the 6 maximum performance that can currently be achieved.

3 28th List: The TOP10

Rmax Year/ Manufacturer Computer Installation Site Country #Proc [TF/s] Arch

2005 1 IBM BlueGene/L 280.6 DOE/NNSA/LLNL USA 131,072 eServer Blue Gene Custom

2 2006 Sandia/Cray 101.4 NNSA/Sandia USA 26,544 9 Cray XT3 Hybrid

3 2005 IBM BGW 91.29 IBM Thomas Watson USA 40,960 2 eServer Blue Gene Custom

4 2005 IBM ASC Purple 75.76 DOE/NNSA/LLNL USA 12,208 3 eServer pSeries p575 Custom

Barcelona Supercomputer 2006 5 IBM MareNostrum 62.63 Spain 12,240 JS21 Cluster, Myrinet Center Commod

2005 6 Dell Thunderbird 53.00 NNSA/Sandia USA 9,024 PowerEdge 1850, IB Commod

7 Tera-10 2006 Bull 52.84 CEA France 9,968 5 NovaScale 5160, Quadrics Commod

8 Columbia 2004 SGI 51.87 NASA Ames USA 10,160 4 Altix, Infiniband Hybrid

9 GSIC / Tokyo Institute of 2006 NEC/Sun Tsubame 47.38 Japan 11,088 7 Fire x4600, ClearSpeed, IB Technology Commod

2006 10 Cray Jaguar 43.48 ORNL USA 10,424 Cray XT3 Hybrid 7

IBM BlueGene/L #1 131,072 Processors Total of 18 systems all in the Top100 1.6 MWatts (1600 homes) (64 racks, 64x32x32) 43,000 ops/s/person Rack 131,072 procs (32 Node boards, 8x8x16) 2048 processors

BlueGene/L Compute ASIC Node Board (32 chips, 4x4x2) 16 Compute Cards 64 processors

Compute Card (2 chips, 2x1x1) 180/360 TF/s 4 processors 32 TB DDR Chip (2 processors) 2.9/5.7 TF/s 0.5 TB DDR Full system total of 90/180 GF/s 131,072 processors 16 GB DDR 5.6/11.2 GF/s 2.8/5.6 GF/s 1 GB DDR 4 MB (cache) “Fastest Computer” BG/L 700 MHz 131K proc The compute node ASICs include all networking and processor functionality. 64 racks Each compute ASIC includes two 32-bit superscalar PowerPC 440 embedded Peak: 367 Tflop/s cores (note that L1 cache coherence is not maintained between these cores). 8 Linpack: 281 Tflop/s (13K sec about 3.6 hours; n=1.8M) 77% of peak

4 Performance Projection

1 Eflop/s 100 Pflop/s 10 Pflop/s 1 Pflop/s 100 Tflop/s 10 Tflop/s SUM 1 Tflop/s 6-8 years

100 Gflop/s N=1 10 Gflop/s 8-10 years 1 Gflop/s N=500 100 Mflop/s 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015 9

Customer Segments / Performance

100% Government Classified 90% Vendor 80% Academic 70% 60% Industry 50% 40% 30% 20% Research 10% 0% 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006

10

5 Processor Types 500 SIMD

Vector 400 oth. Scalar

AMD 300 Sparc MIPS 200 intel

100 HP Power

0 Alpha 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006

11

Processors Used in Each of the 500 Systems 92% = 51% Intel 19% IBM Sun Sparc Intel IA-32 1% 22% 22% AMD Intel EM64T NEC 22% 1%

HP Alpha Cray 1% 1%

HP PA-RISC 4%

Intel IA-64 7%

AMD x86_64 22% IBM Power 19% 12

6 Interconnects / Systems 500

Others 400 Cray Interconnect SP Switch 300 Crossbar

200 Quadrics Infiniband (78) 100 Myrinet (79) Gigabit Ethernet (211) 0 N/A 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 GigE + Infiniband + Myrinet = 74% 13

Processors per System - Nov 2006 200

180

160

140

120

100

80

60 NumberSystems of

40

20

0 33-64 65-128 129-256 257-512 513- 1025- 2049- 4k-8k 8k-16k 16k-32k 32k-64k 64k- 1024 2048 4096 128k 14

7 KFlop/s per Capita (Flops/Pop) Based on the November 2005 Top500 6000

5000

4000 Hint: Peter Jackson had something to do with this

3000 WETA Digital (Lord of the Rings) 2000

1000

0

d l y n a a e n y a n e m an i th i c l o a o p ad a ic sia dia d ands n iwa It x s In rl Isra g l Ja Spa a Brazil Chin ze in erman Ca weden Fran T Me Ru t K Australia S ia Arab wi G S d ud New Zealandte Nether a United States i Korea, Sou S Un

15 Has nothing to do with the 47.2 million sheep in NZ

Environmental Burden of PC CPUs

16 Source: Cool Chips & Micro 32

8 Power Consumption of World’s CPUs

Year Power # CPUs (in MW) (in millions) 1992 180 87 1994 392 128 1996 959 189 1998 2,349 279 2000 5,752 412 2002 14,083 607 2004 34,485 896 2006 87,439 1,321

17

Power is an Industry Wide Problem

X Google facilities ¾ leveraging hydroelectric power » old aluminum plants “Hiding in Plain Sight, Google Seeks More Power”, by John Markoff, June 14, 2006 ¾ >500,000 servers worldwide

New Google Plant in The Dulles, Oregon, from NYT, June 14, 2006 18

9 And Now We Want Petascale …

High-Speed Train Conventional Power Plant 10 Megawatts 300 Megawatts

X What is a conventional petascale machine? ¾ Many high-speed bullet trains … a significant start to a conventional power plant. ¾ “Hiding in Plain Sight, Google Seeks More Power,” The New York Times, June 14, 2006.

19

Top Three Reasons for “Eliminating” Global Climate Warming in the Machine Room

3. HPC Contributes to Global Climate Warming ¾ “I worry that we, as HPC experts in global climate modeling, are contributing to the very thing that we are trying to avoid: the generation of greenhouse gases.” 2. Electrical Power Costs $$$. ¾ Japanese Earth Simulator » Power & Cooling: 12 MW/year Æ $9.6 million/year? ¾ Lawrence Livermore National Laboratory » Power & Cooling of HPC: $14 million/year » Power-up ASC Purple Æ “Panic” call from local electrical company. 1. Reliability & Availability Impact Productivity ¾ California: State of Electrical Emergencies (July 24-25, 2006) » 50,538 MW: A load not expected to be reached until 2010! 20

10 Reliability & Availability of HPC

Systems CPUs Reliability & Availability

ASCI Q 8,192 MTBI: 6.5 hrs. 114 unplanned outages/month. ¾ HW outage sources: storage, CPU, memory.

ASCI White 8,192 MTBF: 5 hrs. (2001) and 40 hrs. (2003). ¾ HW outage sources: storage, CPU, 3rd-party HW.

NERSC 6,656 MTBI: 14 days. MTTR: 3.3 hrs. Seaborg ¾ SW is the main outage source. Availability: 98.74%.

PSC Lemieux 3,016 MTBI: 9.7 hrs. Availability: 98.33%.

Google ~15,000 20 reboots/day; 2-3% machines replaced/year. ¾ HW outage sources: storage, memory. Availability: ~100%.

MTBI: mean time between interrupts; MTBF: mean time between failures; MTTR: mean time to restore 21 Source: Daniel A. Reed, RENCI

Fuel Efficiency: GFlops/Watt

0.9

0.8

0.7

0.6

0.5

GFlops/Watt 0.4

0.3

0.2

0.1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0

) ) e r e e z z z /L n z z z n z n ne H z B B H / e H H z to z H z H G G H e n e G H a H e e e G G G e G G H G l G G H G G G (4 (2 rv 9 5 G u G 4 G .6 0 5 .9 e G e . . .0 2 e . .4 e e 2 . E E .2 1 e lu 1 1 .6 2 im . lu 2 1 lu lu , 2 1 1 1 S u x 3 , S 2 , 3 , X X 5 X l B 5 i 3 - , B 2 B B T e a 7 B p lt m h 0 T3 v h 5 le T t 7 m X r ay ay p 5 p e A iu X r 9 X u y e r r l p l I t a y i a S C C A p rp G n ay E C a n r X - A u S e r P r ta C M z - P C P C I le Q IB H P - - l p I ia m m r e G C d r a nt p C 3 S b ir o ru I A S . A m b t t u - A 2 u r S s g l e o a r X o d d N J e C n e e d m u R r n te h a u s T M h y T S

Top 20 systems 22

Based on processor power rating only (3,>100,>800)

11 Top500 Conclusions

X Microprocessor based have brought a major change in accessibility and affordability. X MPPs continue to account of more than half of all installed high-performance computers worldwide.

23

With All the Hype on the PS3 We Became Interested X The PlayStation 3's CPU based on a "Cell“ processor X Each Cell contains a Power PC processor and 8 SPEs. (SPE is processing unit, SPE: SPU + DMA engine) ¾ An SPE is a self contained vector processor which acts independently from the others. » 4 way SIMD floating point units capable of a total of 25.6 Gflop/s @ 3.2 GHZ ¾ 204.8 Gflop/s peak! ¾ The catch is that this is for 32 bit floating point; (Single Precision SP) ¾ And 64 bit floating point runs at 14.6 Gflop/s total for all 8 SPEs!! » Divide SP peak by 14; factor of 2 because of DP and 7 because of latency issues

24

12 Increase Lower Clock Rate Voltage & Transistor Density

We have seen increasing number of gates on a chip and increasing clock speed.

Heat becoming an unmanageable problem, Intel Processors > 100 Watts

We will not see the dramatic increases in clock speeds in the future.

However, the number of gates on a chip will continue to increase.

Intel will double the processing power25 on a per watt basis.

Intel Prediction of Microprocessor Frequency (ca. 2001)

10,000

Doubles every 2 years 1,000

P6 100 Pentium r proc Frequency 486 in MHz 10 386 286 8086

1 8080

8008 0.1 4004 1970 1980 1990 2000 1010 Year Adopted from a presentation by S. Borkar, Intel

26

13 Intel Prediction of Microprocessor Power Consumption (ca. 2001)

100 Pentium® proc

10

286 486 8086 386 8085 Power (Watts) 1 8080 8008 4004

0.1 1971 1974 1978 1985 1992 2000 Year Adopted from a presentation by S. Borkar, Intel

27

Moore’s Law for Power (P α V2f)

1000 Chip Maximum Power in watts/cm2 Not too long to reach Nuclear Reactor Itanium – 130 watts 100 – 75 watts Surpassed Pentium III – 35 watts Pentium II – 35 watts Heating Plate – 30 watts 10 Pentium – 14 watts

I486 – 2 watts – 1 watt 1 1.5μ 1μ 0.7μ 0.5μ 0.35μ 0.25μ 0.18μ 0.13μ 0.1μ 0.07μ 1985 1995 2001 Year Source: Fred Pollack, Intel. New Microprocessor Challenges in the Coming Generations of CMOS Technologies, MICRO32 and Transmeta 28

14 1 Core

24 GHz, 1 Core No Free Lunch For Traditional Software (Without highly concurrent software it won’t get any faster!)

2 Cores 12 GHz, 1 Core

with no change to the to the code!) change no with 4 Cores

Operations per second for serial code 8 Cores (It just runs twice as fast every 18 months 18 runs twice(It justfast every as 6 GHz 1 Core 3 GHz 2 Cores 3 GHz, 4 Cores 3 GHz, 8 Cores 3GHz 1 Core Free Lunch For Traditional Software 29 Additional operations per second if code can take advantage of concurrency

What is Multicore?

X Single chip X Multiple distinct processing engine X Multiple, independent threads of control (or program counters MIMD)

30

15 Integration is Efficient

X Discrete chips X Multicore

¾ Bandwidth 2GBps ¾ Latency 60 ns ¾ Bandwidth > 20 GBps ¾ Latency < 3ns

Parallelism and interconnect efficiency enables harnessing the power of n. n cores can yield an n-fold increase in performance 31

Power Cost of Frequency X Power ∝ Voltage2 x Frequency (V2F) X Frequency ∝ Voltage X Power ∝Frequency3

32

16 Power Cost of Frequency X Power ∝ Voltage2 x Frequency (V2F) X Frequency ∝ Voltage X Power ∝Frequency3

33

Interconnect Options

34

17 Many Changes X Many changes in our hardware over the past 30 years Top500 Systems/Architectures 500 ¾ Superscalar, Vector, Const. 400 Distributed Memory, Cluster Shared Memory, 300 Multicore, … MPP 200 SMP

100 SIMD

0 Single Proc. 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006

X Today’s memory hierarchy is much more complicated.

35

Distributed and Parallel Systemst c e n m n e s g s r o m / e in c e c t te r t p i s u s e i lo Distributed m v t f e p lu n D Massively o i P h D m l systems C r l o a le S parallel @ d e e i l I e C s t c a C hetero- T it o s e r id u a P systems E n r o l p P H geneous S U G L C S homo- geneous

X Gather (unused) resources X Bounded set of resources X Steal cycles X Apps grow to consume all cycles X System SW manages resources X Application manages resources X System SW adds value X System SW gets in the way X 10% - 20% overhead is OK X 5% overhead is maximum X Resources drive applications X Apps drive purchase of equipment X Time to completion is not critical X Real-time constraints X Time-shared X Space-shared 36

18 Virtual Environments

0.32E-08 0.00E+00 0.00E+00 0.00E+00 0.38E-06 0.13E-05 0.22E-05 0.33E-05 0.59E-05 0.11E-04 0.18E-04 0.23E-04 0.23E-04 0.21E-04 0.67E-04 0.38E-03 0.90E-03 0.18E-02 0.30E-02 0.43E-02 0.50E-02 0.51E-02 0.49E-02 0.44E-02 0.39E-02 0.35E-02 0.31E-02 0.28E-02 0.27E-02 0.26E-02 0.26E-02 0.27E-02 0.28E-02 0.30E-02 0.33E-02 0.36E-02 0.38E-02 0.39E-02 0.39E-02 0.38E-02 0.34E-02 0.30E-02 0.27E-02 0.24E-02 0.21E-02 0.18E-02 0.16E-02 0.14E-02 0.11E-02 0.96E-03 0.79E-03 0.63E-03 0.48E-03 0.35E-03 0.24E-03 0.15E-03 0.80E-04 0.34E-04 0.89E-05 0.16E-05 0.18E-06 0.34E-08 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.24E-08 0.00E+00 0.00E+00 0.00E+00 0.29E-06 0.11E-05 0.19E-05 0.30E-05 0.53E-05 0.96E-05 0.15E-04 0.20E-04 0.20E-04 0.18E-04 0.27E-04 0.23E-03 0.65E-03 0.14E-02 0.27E-02 0.40E-02 0.49E-02 0.51E-02 0.49E-02 0.45E-02 0.40E-02 0.35E-02 0.31E-02 0.28E-02 0.27E-02 0.26E-02 0.26E-02 0.27E-02 0.28E-02 0.30E-02 0.33E-02 0.36E-02 0.38E-02 0.39E-02 0.39E-02 0.37E-02 0.34E-02 0.30E-02 0.27E-02 0.24E-02 0.21E-02 0.18E-02 0.16E-02 0.14E-02 0.12E-02 0.98E-03 0.81E-03 0.65E-03 0.51E-03 0.38E-03 0.27E-03 0.17E-03 0.99E-04 0.47E-04 0.16E-04 0.36E-05 0.62E-06 0.41E-07 0.75E-10 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.15E-08 0.00E+00 0.00E+00 0.00E+00 0.19E-06 0.84E-06 0.16E-05 0.27E-05 0.47E-05 0.82E-05 0.13E-04 0.17E-04 0.17E-04 0.15E-04 0.16E-04 0.10E-03 0.41E-03 0.11E-02 0.23E-02 0.37E-02 0.48E-02 0.51E-02 0.49E-02 0.45E-02 0.40E-02 0.35E-02 0.31E-02 0.28E-02 0.27E-02 0.26E-02 0.26E-02 0.27E-02 0.28E-02 0.31E-02 0.33E-02 0.36E-02 0.38E-02 0.39E-02 0.38E-02 0.36E-02 0.33E-02 0.29E-02

Do they make any sense?

37

38

19 The power of optimal algorithms X Advances in algorithmic efficiency rival advances in hardware architecture X Consider Poisson’s equation on a cube of size N=n3 Year Method Reference Storage Flops

64 1947 GE (banded) Von Neumann & Goldstine n5 n7 64

1950 Optimal SOR Young n3 n4 log n ∇2u=f 64

1971 CG Reid n3 n3.5 log n

1984 Full MG Brandt n3 n3

X If n=64, this implies an overall reduction in * of ~16 million 39

Algorithms and Moore’s Law X This advance took place over a span of about 36 years, or 24 doubling times for Moore’s Law X 224≈16 million ⇒ the same as the factor from algorithms alone!

relative speedup

year 40

20 Different Architectures

X Parallel computing: single systems with many processors working on same problem X Distributed computing: many systems loosely coupled to work on related problems X Grid Computing: many systems tightly coupled by software, perhaps geographically distributed, to work together on single problems or on related problems

41

Types of Parallel Computers

X The simplest and most useful way to classify modern parallel computers is by their memory model: ¾ shared memory ¾ distributed memory

42

21 Shared vs. Distributed Memory

P P P P P P Shared memory - single address space. All processors have access to a BUS pool of shared memory. (Ex: SGI Origin, Sun E10000) Memory

P P P P P P Distributed memory -each processor has it’s own local M M M M M M memory. Must do message passing to exchange data between processors. (Ex: CRAY T3E, IBM Network SP, clusters)

43

Shared Memory: UMA vs. NUMA Uniform memory access (UMA): P P P P P P Each processor has uniform BUS access to memory. Also known as symmetric multiprocessors Memory (Sun E10000)

P P P P P P P P Non-uniform memory access (NUMA): Time for memory BUS BUS access depends on location of data. Local access is faster Memory Memory than non-local access. Easier to scale than SMPs (SGI Origin) Network

44

22 Distributed Memory: MPPs vs. Clusters

X Processors-memory nodes are connected by some type of interconnect network ¾ Massively Parallel Processor (MPP): tightly integrated, single system image. ¾ Cluster: individual computers connected by s/w

Interconnect CPU CPU CPU Network CPU CPU CPU CPU CPU MEM MEM MEMCPU MEM MEM MEM MEM MEM MEM 45

Processors, Memory, & Networks

X Both shared and distributed memory systems have: 1. processors: now generally commodity processors 2. memory: now generally commodity DRAM 3. network/interconnect: between the processors and memory (bus, crossbar, fat tree, torus, hypercube, etc.)

46

23 Interconnect-Related Terms

X Latency: How long does it take to start sending a "message"? Measured in microseconds. (Also in processors: How long does it take to output results of some operations, such as floating point add, divide etc., which are pipelined?) X Bandwidth: What data rate can be sustained once the message is started? Measured in Mbytes/sec.

47

Interconnect-Related Terms

Topology: the manner in which the nodes are connected. ¾ Best choice would be a fully connected network (every processor to every other). Unfeasible for cost and scaling reasons. ¾ Instead, processors are arranged in some variation of a grid, torus, or hypercube.

3-d hypercube 2-d mesh 2-d torus

48

24 Standard Uniprocessor Memory Hierarchy

X Intel Pentium 4 Processor 2 GHz processor X On Chip P7 Prescott 478 ¾ 8 Kbytes of 4 way assoc. Level-1 Cache L1 instruction cache with 32 byte lines. ¾ 8 Kbytes of 4 way assoc. L1 data cache with 32 byte Level-2 Cache lines. ¾ 256 Kbytes of 8 way assoc. Bus L2 cache 32 byte lines. ¾ 400 MB/s bus speed ¾ SSE2 provide peak of 4 Gflop/s System Memory Each flop requires 3 words of data At 4 Gflop/s needs 12 GW/s bw Bus has only .5GW/s So if driven from memory 49 24 times off of peak rate!!

Locality and Parallelism Conventional Storage Proc Proc Hierarchy Proc Cache Cache Cache L2 Cache L2 Cache L2 Cache interconnects potential

L3 Cache L3 Cache L3 Cache

Memory Memory Memory

X Large memories are slow, fast memories are small X Storage hierarchies are large and fast on average X Parallel processors, collectively, have large, fast $ ¾ the slow accesses to “remote” data we call “communication” X Algorithm should do most work on local data 50

25 51

Amdahl’s Law

Amdahl’s Law places a strict limit on the speedup that can be realized by using multiple processors. Two equivalent expressions for Amdahl’s Law are given below:

tN = (fp/N + fs)t1 Effect of multiple processors on run time

S = 1/(fs + fp/N) Effect of multiple processors on speedup

Where: fs = serial fraction of code fp = parallel fraction of code = 1 - fs N = number of processors

52

26 Amdahl’s Law - Theoretical Maximum Speedup of parallel execution

X speedup = 1/(P/N + S) » P (parallel code fraction) S (serial code fraction) N (processors) ¾Example: Image processing » 30 minutes of preparation (serial) » One minute to scan a region » 30 minutes of cleanup (serial)

X Speedup is restricted by serial portion. And, speedup increases with greater number of cores!

53

Illustration of Amdahl’s Law It takes only a small fraction of serial content in a code to degrade the parallel performance. It is essential to determine the scaling behavior of your code before doing production runs using large numbers of processors

250 fp = 1.000 200 fp = 0.999 fp = 0.990 150 fp = 0.900

speedup 100 What’s going on here?

50

0 0 50 100 150 200 250 Number of processors 54

27 Amdahl’s Law Vs. Reality Amdahl’s Law provides a theoretical upper limit on parallel speedup assuming that there are no costs for communications. In reality, communications (and I/O) will result in a further degradation of performance.

80 fp = 0.99 70 60 50 Amdahl's Law 40 Reality

speedup 30 20 10 0 0 50 100 150 200 250 Number of processors 55

Overhead of Parallelism

X Given enough parallel work, this is the biggest barrier to getting desired speedup X Parallelism overheads include: ¾ cost of starting a thread or process ¾ cost of communicating shared data ¾ cost of synchronizing ¾ extra (redundant) computation X Each of these can be in the range of milliseconds (=millions of flops) on some systems X Tradeoff: Algorithm needs sufficiently large units of work to run fast in parallel (I.e. large granularity), but not so large that there is not enough parallel work

56

28 Locality and Parallelism Conventional Storage Proc Proc Hierarchy Proc Cache Cache Cache L2 Cache L2 Cache L2 Cache interconnects potential

L3 Cache L3 Cache L3 Cache

Memory Memory Memory

X Large memories are slow, fast memories are small X Storage hierarchies are large and fast on average X Parallel processors, collectively, have large, fast $ ¾ the slow accesses to “remote” data we call “communication” X Algorithm should do most work on local data

Load Imbalance

X Load imbalance is the time that some processors in the system are idle due to ¾ insufficient parallelism (during that phase) ¾ unequal size tasks X Examples of the latter ¾ adapting to “interesting parts of a domain” ¾ tree-structured computations ¾ fundamentally unstructured problems X Algorithm needs to balance load

58

29 What is Ahead?

X Greater instruction level parallelism? X Bigger caches? X Multiple processors per chip? X Complete systems on a chip? (Portable Systems)

X High performance LAN, Interface, and Interconnect

59

Directions

X Move toward shared memory ¾ SMPs and Distributed Shared Memory ¾ Shared address space w/deep memory hierarchy X Clustering of shared memory machines for scalability X Efficiency of message passing and data parallel programming ¾ Helped by standards efforts such as MPI and HPF

60

30 Question:

X Suppose we want to compute using four decimal arithmetic: ¾ S = 1.000 + 1.000x104 – 1.000x104 ¾ What’s the answer?

61

Defining Floating Point Arithmetic

X Representable numbers ¾ Scientific notation: +/- d.d…d x rexp ¾ sign bit +/- ¾ radix r (usually 2 or 10, sometimes 16) ¾ significand d.d…d (how many base-r digits d?) ¾ exponent exp (range?) ¾ others? X Operations: ¾ arithmetic: +,-,x,/,... » how to round result to fit in format ¾ comparison (<, =, >) ¾ conversion between different formats » short to long FP numbers, FP to integer ¾ exception handling » what to do for 0/0, 2*largest_number, etc. ¾ binary/decimal conversion

» for I/O, when radix not 10 62

31 IEEE Floating Point Arithmetic Standard 754 (1985) - Normalized Numbers

X Normalized Nonzero Representable Numbers: +- 1.d…d x 2exp ¾ Macheps = Machine epsilon = 2-#significand bits = relative error in each operation smallest number ε such that fl( 1 + ε ) > 1

¾ OV = overflow threshold = largest number ¾ UN = underflow threshold = smallest number

Format # bits #significand bits macheps #exponent bits exponent range ------Single 32 23+1 2-24 (~10-7) 8 2-126 -2127 (~10+-38) Double 64 52+1 2-53 (~10-16) 11 2-1022 -21023 (~10+-308) Double >=80 >=64 <=2-64(~10-19) >=15 2-16382 -216383 (~10+-4932) Extended (80 bits on all Intel machines)

X +- Zero: +-, significand and exponent all zero 63 ¾ Why bother with -0 later

IEEE Floating Point Arithmetic Standard 754 - “Denorms”

X Denormalized Numbers: +-0.d…d x 2min_exp ¾ sign bit, nonzero significand, minimum exponent ¾ Fills in gap between UN and 0 X Underflow Exception ¾ occurs when exact nonzero result is less than underflow threshold UN ¾ Ex: UN/3 ¾ return a denorm, or zero

64

32 IEEE Floating Point Arithmetic Standard 754 - +- Infinity

X +- Infinity: Sign bit, zero significand, maximum exponent X Overflow Exception ¾occurs when exact finite result too large to represent accurately ¾Ex: 2*OV ¾return +- infinity X Divide by zero Exception ¾return +- infinity = 1/+-0 ¾sign of zero important! X Also return +- infinity for ¾3+infinity, 2*infinity, infinity*infinity ¾Result is exact, not an exception!

65

IEEE Floating Point Arithmetic Standard 754 - NAN (Not A Number)

X NAN: Sign bit, nonzero significand, maximum exponent X Invalid Exception ¾occurs when exact result not a well-defined real number ¾0/0 ¾sqrt(-1) ¾infinity-infinity, infinity/infinity, 0*infinity ¾NAN + 3 ¾NAN > 3? ¾Return a NAN in all these cases X Two kinds of NANs ¾Quiet - propagates without raising an exception ¾Signaling - generate an exception when touched » good for detecting uninitialized data

66

33 Error Analysis

X Basic error formula ¾fl(a op b) = (a op b)*(1 + d) where » op one of +,-,*,/ » |d| <= macheps » assuming no overflow, underflow, or divide by zero X Example: adding 4 numbers

¾fl(x1+x2+x3+x4) = {[(x1+x2)*(1+d1) + x3]*(1+d2) + x4}*(1+d3)

= x1*(1+d1)*(1+d2)*(1+d3) + x2*(1+d1)*(1+d2)*(1+d3) +

x3*(1+d2)*(1+d3) + x4*(1+d3) = x1*(1+e1) + x2*(1+e2) + x3*(1+e3) + x4*(1+e4)

where each |ei| <~ 3*macheps

¾get exact sum of slightly changed summands xi*(1+ei) ¾Backward Error Analysis - algorithm called numerically stable if it gives the exact result for slightly changed inputs ¾Numerical Stability is an algorithm design goal 67

Backward error

X Approximate solution is exact solution to modified problem. X How large a modification to original problem is required to give result actually obtained? X How much data error in initial input would be required to explain all the error in computed results? X Approximate solution is good if it is exact solution to “nearby” problem.

f x f(x) Backward error f’ Forward error x’ f’(x) f 68

34 Sensitivity and Conditioning X Problem is insensitive or well conditioned if relative change in input causes commensurate relative change in solution. X Problem is sensitive or ill-conditioned, if relative change in solution can be much larger than that in input data.

Cond = |Relative change in solution| / |Relative change in input data| = |[f(x’) – f(x)]/f(x)| / |(x’ – x)/x|

X Problem is sensitive, or ill-conditioned, if cond >> 1.

X When function f is evaluated for approximate input x’ = x+h instead of true input value of x. X Absolute error = f(x + h) – f(x) ≈ h f’(x) X Relative error =[ f(x + h) – f(x) ] / f(x) ≈ h f’(x) / f(x) 69

Sensitivity: 2 Examples cos(π/2) and 2-d System of Equations

X Consider problem of computing cosine function for arguments near π/2. X Let x ≈ π/2 and let h be small perturbation to x. Then . Abs: f(x + h) – f(x) ≈ h f’(x) Rel: [ f(x + h) – f(x) ] / f(x) ≈ h f’(x) / f(x)

absolute error = cos(x+h) – cos(x) ≈ -h sin(x) ≈ -h, relative error ≈ -h tan(x) ≈ ∞

X So small change in x near π/2 causes large relative change in cos(x) regardless of method used. X cos(1.57079) = 0.63267949 x 10-5 X cos(1.57078) = 1.64267949 x 10-5 X Relative change in output is a quarter million times greater than 70 relative change in input.

35 Sensitivity: 2 Examples cos(π/2) and 2-d System of Equations

a*x1+ bx2 = f X Consider problem of computing c*x1+ dx2 = g cosine function for arguments near π/2. X Let x ≈ π/2 and let h be small perturbation to x. Then . absolute error = cos(x+h) – cos(x) ≈ -h sin(x) ≈ -h, relative error ≈ -h tan(x) ≈ ∞

X So small change in x near π/2 causes large relative change in cos(x) regardless of method used. X cos(1.57079) = 0.63267949 x 10-5 . X cos(1.57078) = 1.64267949 x 10-5 X Relative change in output is a quarter million times greater than relative change in input. 71

Example: Polynomial Evaluation Using Horner’s Rule

n-1 X Σ k Horner’s rule to evaluate p = k=0 ck * x ¾p = cn, for k=n-1 down to 0, p = x*p + ck XNumerically Stable XApply to (x-2)9 = x9 - 18*x8 + … - 512 -29 + x*( 28 -x*( 27 + … ))) XEvaluated around 2

begin p := c[n]; for k := n-1 to 0 by -1 do p := p*x + c[k] end { for } HonerPoly := p; end { HonerPoly } 72

36 Example: polynomial evaluation (continued)

X(x-2)9 = x9 - 18*x8 + … - 512 XWe can compute error bounds using ¾fl(a op b)=(a op b)*(1+d)

73

Exception Handling

X What happens when the “exact value” is not a real number, or too small or too large to represent accurately? X 5 Exceptions: ¾Overflow - exact result > OV, too large to represent ¾Underflow - exact result nonzero and < UN, too small to represent ¾Divide-by-zero -nonzero/0 ¾Invalid - 0/0, sqrt(-1), … ¾Inexact - you made a rounding error (very common!) X Possible responses ¾Stop with error message (unfriendly, not default) ¾Keep computing (default, but how?)

74

37 Summary of Values Representable in IEEE FP

X+- Zero +- 0…0 0……………………0

+- Not 0 or XNormalized nonzero numbers all 1s anything XDenormalized numbers +- 0…0 nonzero X+-Infinity +- 1….1 0……………………0 XNANs +- 1….1 nonzero ¾Signaling and quiet ¾Many systems have only quiet

75

Assuming x and y are non-negative

axybxy==max( , ), min( , )

⎪⎧abaa1(+> )2 , 0 z = ⎨ ⎩⎪ 0,0a =

76

38 This Week’s Assignment

77

Hazards of Parallel and Heterogeneous Computing

XWhat new bugs arise in parallel floating point programs? XEx 1: Nonrepeatability ¾Makes debugging hard! XEx 2: Different exception handling ¾Can cause programs to hang XEx 3: Different rounding (even on IEEE FP machines) ¾Can cause hanging, or wrong results with no warning XSee www.netlib.org/lapack/lawns/lawn112.ps

XIBM RS6K and Java 78

39 Types of Parallel Computers

X The simplest and most useful way to classify modern parallel computers is by their memory model: ¾ shared memory ¾ distributed memory

79

Standard Uniprocessor Memory Hierarchy

X Intel Pentium 4 Processor 2 GHz processor X On Chip P7 Prescott 478 ¾ 8 Kbytes of 4 way assoc. Level-1 Cache L1 instruction cache with 32 byte lines. ¾ 8 Kbytes of 4 way assoc. L1 data cache with 32 byte Level-2 Cache lines. ¾ 256 Kbytes of 8 way assoc. Bus L2 cache 32 byte lines. ¾ 400 MB/s bus speed ¾ SSE2 provide peak of 4 Gflop/s System Memory

80

40 Shared Memory / Local Memory

X Usually think in terms of the hardware X What about a software model? X How about something that works like cache? X Logically shared memory

81

Parallel Programming Models

XControl ¾how is parallelism created ¾what orderings exist between operations ¾how do different threads of control synchronize XNaming ¾what data is private vs. shared ¾how logically shared data is accessed or communicated XSet of operations ¾what are the basic operations ¾what operations are considered to be atomic XCost ¾how do we account for the cost of each of the above

41 n − 1 ∑ fAi([]) Trivial Example i = 0

X Parallel Decomposition: ¾Each evaluation and each partial sum is a task X Assign n/p numbers to each of p procs ¾each computes independent “private” results and partial sum ¾one (or all) collects the p partial sums and computes the global sum

=> Classes of Data X Logically Shared ¾the original n numbers, the global sum X Logically Private ¾the individual function evaluations ¾what about the individual partial sums?

Programming Model 1

X Shared Address Space ¾program consists of a collection of threads of control, ¾each with a set of private variables » e.g., local variables on the stack ¾collectively with a set of shared variables » e.g., static variables, shared common blocks, global heap ¾threads communicate implicity by writing and reading shared variables ¾threads coordinate explicitly by synchronization operations on shared variables A: » writing and reading flags Shared x = ... » locks, semaphores X y = ..x ... Like concurrent i Private i res res programming s s on uniprocessor P P . . . P

42 Model 1

X A shared memory machine X Processors all connected to a large shared memory X “Local” memory is not (usually) part of the hardware ¾ Sun, DEC, Intel “SMPs” (Symmetric multiprocessors) in Millennium; SGI Origin X Cost: much cheaper to cache than main memory P1 P2 Pn $ $ $

network

memory

X Machine model 1a: A Shared Address Space Machine ¾replace caches by local memories (in abstract machine model) ¾this affects the cost model -- repeatedly accessed data should be copied ¾ Cray T3E 85

Shared Memory code for computing a sum

Thread 1 Thread 2

[s = 0 initially] [s = 0 initially] local_s1= 0 local_s2 = 0 for i = 0, n/2-1 for i = n/2, n-1 local_s1 = local_s1 + f(A[i]) local_s2= local_s2 + f(A[i]) s = s + local_s1 s = s +local_s2

What could go wrong?

86

43 Pitfall and solution via synchronization

° Pitfall in computing a global sum s = local_s1 + local_s2

Thread 1 (initially s=0) Thread 2 (initially s=0) load s [from mem to reg] load s [from mem to reg; initially 0] s = s+local_s1 [=local_s1, in reg] s = s+local_s2 [=local_s2, in reg] store s [from reg to mem] store s [from reg to mem] Time

° Instructions from different threads can be interleaved arbitrarily ° What can final result s stored in memory be? ° Race Condition ° Possible solution: Mutual Exclusion with Locks Thread 1 Thread 2 lock lock load s load s s = s+local_s1 s = s+local_s2 store s store s unlock unlock 87 ° Locks must be atomic (execute completely without interruption)

Programming Model 2

X Message Passing ¾program consists of a collection of named processes » thread of control plus local address space » local variables, static variables, common blocks, heap ¾processes communicate by explicit data transfers » matching pair of send & receive by source and dest. proc. ¾coordination is implicit in every communication event ¾logically shared data is partitioned over local processes X Like distributed programming send P0,X

recv Pn,Y ° Program with standard libraries: MPI, PVM A: A: Y X i i res res s s P P . . . P 0 n

44 Model 2

XA distributed memory machine ¾Cray T3E, IBM SP2, Clusters XProcessors all connected to own memory (and caches) ¾cannot directly access another processor’s memory XEach “node” has a network interface (NI) ¾all communication and synchronization done through this

P1 NI P2 NI Pn NI

memory memory . . . memory

interconnect 89

Computing s = x(1)+x(2) on each processor

° First possible solution

Processor 1 Processor 2 send xlocal, proc2 receive xremote, proc1 [xlocal = x(1)] send xlocal, proc1 receive xremote, proc2 [xlocal = x(2)] s = xlocal + xremote s = xlocal + xremote

° Second possible solution - what could go wrong?

Processor 1 Processor 2 send xlocal, proc2 send xlocal, proc1 [xlocal = x(1)] [xlocal = x(2)] receive xremote, proc2 receive xremote, proc1 s = xlocal + xremote s = xlocal + xremote

° What if send/receive act like the telephone system? The post office? 90

45 Programming Model 3

X Data Parallel ¾Single sequential thread of control consisting of parallel operations ¾Parallel operations applied to all (or defined subset) of a data structure ¾Communication is implicit in parallel operators and “shifted” data structures ¾Elegant and easy to understand and reason about ¾Not all problems fit this model X Like marching in a regiment A: A = array of all data f fA = f(A) fA: s = sum(fA) sum s: ° Think of Matlab

Model 3

X Vector Computing ¾One instruction executed across all the data in a pipelined fashion ¾Parallel operations applied to all (or defined subset) of a data structure ¾Communication is implicit in parallel operators and “shifted” data structures ¾Elegant and easy to understand and reason about ¾Not all problems fit this model X Like marching in a regiment A: A = array of all data f fA = f(A) fA: s = sum(fA) sum s: ° Think of Matlab

46 Model 3

X An SIMD (Single Instruction Multiple Data) machine X A large number of small processors X A single “control processor” issues each instruction ¾ each processor executes the same instruction ¾ some processors may be turned off on any instruction control processor

P1 NI P2 NI Pn NI

memory memory . . . memory

interconnect

X Machines not popular (CM2), but programming model is ¾implemented by mapping n-fold parallelism to p processors ¾mostly done in the compilers (HPF = High Performance Fortran)93

Model 4

XSince small shared memory machines (SMPs) are the fastest commodity machine, why not build a larger machine by connecting many of them with a network? XCLUMP = Cluster of SMPs XShared memory within one SMP, message passing outside XClusters, ASCI Red (Intel), ... XProgramming model? ¾Treat machine as “flat”, always use message passing, even within SMP (simple, but ignore important part of memory hierarchy) ¾Expose two layers: shared memory (OpenMP) and message passing (MPI) higher performance, but ugly to 94 program.

47 Programming Model 5

XBulk Synchronous Processing (BSP) – L. Valiant XUsed within the message passing or shared memory models as a programming convention XPhases separated by global barriers ¾Compute phases: all operate on local data (in distributed memory) » or read access to global data (in shared memory) ¾Communication phases: all participate in rearrangement or reduction of global data XGenerally all doing the “same thing” in a phase ¾all do f, but may all do different things within f

XSimplicity of data parallelism without restrictions

Summary so far

XHistorically, each parallel machine was unique, along with its programming model and programming language XYou had to throw away your software and start over with each new kind of machine - ugh XNow we distinguish the programming model from the underlying machine, so we can write portably correct code, that runs on many machines ¾MPI now the most portable option, but can be tedious XWriting portably fast code requires tuning for the architecture ¾Algorithm design challenge is to make this process easy ¾Example: picking a blocksize, not rewriting whole 96 algorithm

48