CS 594 Spring 2007 Lecture 4: Overview of High-Performance Computing
Jack Dongarra Computer Science Department University of Tennessee
1
Top 500 Computers
- Listing of the 500 most powerful Computers in the World - Yardstick: Rmax from LINPACK MPP TPP performance Ax=b, dense problem
Updated twice a year Rate SC‘xy in the States in November Size Meeting in Germany in June
2
1 What is a Supercomputer?
X A supercomputer is a hardware and software system that provides close to the maximum performance that can currently be achieved.
X Over the last 14 years the range for the Top500 has increased greater than Moore’s Why do we need them? Law Almost all of the technical areas that X 1993: are important to the well-being of ¾ #1 = 59.7 GFlop/s humanity use supercomputing in ¾ #500 = 422 MFlop/s fundamental and essential ways. X 2007: ¾ #1 = 280 TFlop/s Computational fluid dynamics, ¾ #500 = 2.73 TFlop/s protein folding, climate modeling, national security, in particular for cryptanalysis and for simulating 3 nuclear weapons to name a few.
Architecture/Systems Continuum Tightly 100% Coupled X Best processor performance for codes X Custom processor that are not “cache friendly” with custom interconnect X Good Customcommunication performance ¾ 80% Cray X1 X Simpler programming model ¾ NEC SX-8 X ¾ IBM Regatta Most expensive ¾ IBM Blue Gene/L X Commodity processor 60%X Good communication performance with custom interconnect X Good scalability ¾ SGI Altix Hybrid
» Intel Itanium 2 40% ¾ Cray XT3 » AMD Opteron X Best price/performance (for codes X Commodity processor that work well with caches and are with commodity interconnect 20% latency tolerant) ¾ Clusters X More complex programming model » Pentium, Itanium, Commod Opteron, Alpha 0% » GigE, Infiniband, Jun-93 Jun-94 Jun-95 Jun-96 Jun-97 Jun-98 Jun-99 Jun-00 Jun-01 Jun-02 Jun-03 Jun-04 Myrinet, Quadrics Dec-93 Dec-94 Dec-95 Dec-96 Dec-97 Dec-98 Dec-99 Dec-00 Dec-01 Dec-02 Dec-03 ¾ NEC TX7 ¾ IBM eServer Loosely ¾ Dawning Coupled 4
2 Architectures / Systems
500 SIMD 400 Single Proc. 300 SMP
200 Const.
100 Cluster
0 MPP 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006
5
Supercomputing Changes Over Time 500 Fastest Systems Over the Past 14 Years
3.54 PF/s
1 Pflop/s The Fastest Computer SUM of the 500 Fastest Computers 280.6 TF/s IBM BlueGene/L 100 Tflop/s NEC Earth Simulator 10 Tflop/s 1.167 TF/s IBM ASCI White 2.74 1 Tflop/s TF/s 59.7 GF/s Intel ASCI6-8 Red years The Computer 100 Gflop/s at Position 500 Fujitsu 'NWT'
10 Gflop/s 0.4 GF/s My Laptop 1 Gflop/s
100 Mflop/s 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006
A supercomputer is a hardware and software system that provides close to the 6 maximum performance that can currently be achieved.
3 28th List: The TOP10
Rmax Year/ Manufacturer Computer Installation Site Country #Proc [TF/s] Arch
2005 1 IBM BlueGene/L 280.6 DOE/NNSA/LLNL USA 131,072 eServer Blue Gene Custom
2 2006 Sandia/Cray Red Storm 101.4 NNSA/Sandia USA 26,544 9 Cray XT3 Hybrid
3 2005 IBM BGW 91.29 IBM Thomas Watson USA 40,960 2 eServer Blue Gene Custom
4 2005 IBM ASC Purple 75.76 DOE/NNSA/LLNL USA 12,208 3 eServer pSeries p575 Custom
Barcelona Supercomputer 2006 5 IBM MareNostrum 62.63 Spain 12,240 JS21 Cluster, Myrinet Center Commod
2005 6 Dell Thunderbird 53.00 NNSA/Sandia USA 9,024 PowerEdge 1850, IB Commod
7 Tera-10 2006 Bull 52.84 CEA France 9,968 5 NovaScale 5160, Quadrics Commod
8 Columbia 2004 SGI 51.87 NASA Ames USA 10,160 4 Altix, Infiniband Hybrid
9 GSIC / Tokyo Institute of 2006 NEC/Sun Tsubame 47.38 Japan 11,088 7 Fire x4600, ClearSpeed, IB Technology Commod
2006 10 Cray Jaguar 43.48 ORNL USA 10,424 Cray XT3 Hybrid 7
IBM BlueGene/L #1 131,072 Processors Total of 18 systems all in the Top100 1.6 MWatts (1600 homes) (64 racks, 64x32x32) 43,000 ops/s/person Rack 131,072 procs (32 Node boards, 8x8x16) 2048 processors
BlueGene/L Compute ASIC Node Board (32 chips, 4x4x2) 16 Compute Cards 64 processors
Compute Card (2 chips, 2x1x1) 180/360 TF/s 4 processors 32 TB DDR Chip (2 processors) 2.9/5.7 TF/s 0.5 TB DDR Full system total of 90/180 GF/s 131,072 processors 16 GB DDR 5.6/11.2 GF/s 2.8/5.6 GF/s 1 GB DDR 4 MB (cache) “Fastest Computer” BG/L 700 MHz 131K proc The compute node ASICs include all networking and processor functionality. 64 racks Each compute ASIC includes two 32-bit superscalar PowerPC 440 embedded Peak: 367 Tflop/s cores (note that L1 cache coherence is not maintained between these cores). 8 Linpack: 281 Tflop/s (13K sec about 3.6 hours; n=1.8M) 77% of peak
4 Performance Projection
1 Eflop/s 100 Pflop/s 10 Pflop/s 1 Pflop/s 100 Tflop/s 10 Tflop/s SUM 1 Tflop/s 6-8 years
100 Gflop/s N=1 10 Gflop/s 8-10 years 1 Gflop/s N=500 100 Mflop/s 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015 9
Customer Segments / Performance
100% Government Classified 90% Vendor 80% Academic 70% 60% Industry 50% 40% 30% 20% Research 10% 0% 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006
10
5 Processor Types 500 SIMD
Vector 400 oth. Scalar
AMD 300 Sparc MIPS 200 intel
100 HP Power
0 Alpha 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006
11
Processors Used in Each of the 500 Systems 92% = 51% Intel 19% IBM Sun Sparc Intel IA-32 1% 22% 22% AMD Intel EM64T NEC 22% 1%
HP Alpha Cray 1% 1%
HP PA-RISC 4%
Intel IA-64 7%
AMD x86_64 22% IBM Power 19% 12
6 Interconnects / Systems 500
Others 400 Cray Interconnect SP Switch 300 Crossbar
200 Quadrics Infiniband (78) 100 Myrinet (79) Gigabit Ethernet (211) 0 N/A 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 GigE + Infiniband + Myrinet = 74% 13
Processors per System - Nov 2006 200
180
160
140
120
100
80
60 NumberSystems of
40
20
0 33-64 65-128 129-256 257-512 513- 1025- 2049- 4k-8k 8k-16k 16k-32k 32k-64k 64k- 1024 2048 4096 128k 14
7 KFlop/s per Capita (Flops/Pop) Based on the November 2005 Top500 6000
5000
4000 Hint: Peter Jackson had something to do with this
3000 WETA Digital (Lord of the Rings) 2000
1000
0
d l y n a a e n y a n e m an i th i c l o a o p ad a ic sia dia d ands n iwa It x s In rl Isra g l Ja Spa a Brazil Chin ze in erman Ca weden Fran T Me Ru t K Australia S ia Arab wi G S d ud New Zealandte Nether a United States i Korea, Sou S Un
15 Has nothing to do with the 47.2 million sheep in NZ
Environmental Burden of PC CPUs
16 Source: Cool Chips & Micro 32
8 Power Consumption of World’s CPUs
Year Power # CPUs (in MW) (in millions) 1992 180 87 1994 392 128 1996 959 189 1998 2,349 279 2000 5,752 412 2002 14,083 607 2004 34,485 896 2006 87,439 1,321
17
Power is an Industry Wide Problem
X Google facilities ¾ leveraging hydroelectric power » old aluminum plants “Hiding in Plain Sight, Google Seeks More Power”, by John Markoff, June 14, 2006 ¾ >500,000 servers worldwide
New Google Plant in The Dulles, Oregon, from NYT, June 14, 2006 18
9 And Now We Want Petascale …
High-Speed Train Conventional Power Plant 10 Megawatts 300 Megawatts
X What is a conventional petascale machine? ¾ Many high-speed bullet trains … a significant start to a conventional power plant. ¾ “Hiding in Plain Sight, Google Seeks More Power,” The New York Times, June 14, 2006.
19
Top Three Reasons for “Eliminating” Global Climate Warming in the Machine Room
3. HPC Contributes to Global Climate Warming ¾ “I worry that we, as HPC experts in global climate modeling, are contributing to the very thing that we are trying to avoid: the generation of greenhouse gases.” 2. Electrical Power Costs $$$. ¾ Japanese Earth Simulator » Power & Cooling: 12 MW/year Æ $9.6 million/year? ¾ Lawrence Livermore National Laboratory » Power & Cooling of HPC: $14 million/year » Power-up ASC Purple Æ “Panic” call from local electrical company. 1. Reliability & Availability Impact Productivity ¾ California: State of Electrical Emergencies (July 24-25, 2006) » 50,538 MW: A load not expected to be reached until 2010! 20
10 Reliability & Availability of HPC
Systems CPUs Reliability & Availability
ASCI Q 8,192 MTBI: 6.5 hrs. 114 unplanned outages/month. ¾ HW outage sources: storage, CPU, memory.
ASCI White 8,192 MTBF: 5 hrs. (2001) and 40 hrs. (2003). ¾ HW outage sources: storage, CPU, 3rd-party HW.
NERSC 6,656 MTBI: 14 days. MTTR: 3.3 hrs. Seaborg ¾ SW is the main outage source. Availability: 98.74%.
PSC Lemieux 3,016 MTBI: 9.7 hrs. Availability: 98.33%.
Google ~15,000 20 reboots/day; 2-3% machines replaced/year. ¾ HW outage sources: storage, memory. Availability: ~100%.
MTBI: mean time between interrupts; MTBF: mean time between failures; MTTR: mean time to restore 21 Source: Daniel A. Reed, RENCI
Fuel Efficiency: GFlops/Watt
0.9
0.8
0.7
0.6
0.5
GFlops/Watt 0.4
0.3
0.2
0.1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0
) ) e r e e z z z /L n z z z n z n ne H z B B H / e H H z to z H z H G G H e n e G H a H e e e G G G e G G H G l G G H G G G (4 (2 rv 9 5 G u G 4 G .6 0 5 .9 e G e . . .0 2 e . .4 e e 2 . E E .2 1 e lu 1 1 .6 2 im . lu 2 1 lu lu , 2 1 1 1 S u x 3 , S 2 , 3 , X X 5 X l B 5 i 3 - , B 2 B B T e a 7 B p lt m h 0 T3 v h 5 le T t 7 m X r ay ay p 5 p e A iu X r 9 X u y e r r l p l I t a y i a S C C A p rp G n ay E C a n r X - A u S e r P r ta C M z - P C P C I le Q IB H P - - l p I ia m m r e G C d r a nt p C 3 S b ir o ru I A S . A m b t t u - A 2 u r S s g l e o a r X o d d N J e C n e e d m u R r n te h a u s T M h y T S
Top 20 systems 22
Based on processor power rating only (3,>100,>800)
11 Top500 Conclusions
X Microprocessor based supercomputers have brought a major change in accessibility and affordability. X MPPs continue to account of more than half of all installed high-performance computers worldwide.
23
With All the Hype on the PS3 We Became Interested X The PlayStation 3's CPU based on a "Cell“ processor X Each Cell contains a Power PC processor and 8 SPEs. (SPE is processing unit, SPE: SPU + DMA engine) ¾ An SPE is a self contained vector processor which acts independently from the others. » 4 way SIMD floating point units capable of a total of 25.6 Gflop/s @ 3.2 GHZ ¾ 204.8 Gflop/s peak! ¾ The catch is that this is for 32 bit floating point; (Single Precision SP) ¾ And 64 bit floating point runs at 14.6 Gflop/s total for all 8 SPEs!! » Divide SP peak by 14; factor of 2 because of DP and 7 because of latency issues
24
12 Increase Lower Clock Rate Voltage & Transistor Density
We have seen increasing number of gates on a chip and increasing clock speed.
Heat becoming an unmanageable problem, Intel Processors > 100 Watts
We will not see the dramatic increases in clock speeds in the future.
However, the number of gates on a chip will continue to increase.
Intel Yonah will double the processing power25 on a per watt basis.
Intel Prediction of Microprocessor Frequency (ca. 2001)
10,000
Doubles every 2 years 1,000
P6 100 Pentium r proc Frequency 486 in MHz 10 386 286 8086
1 8080
8008 0.1 4004 1970 1980 1990 2000 1010 Year Adopted from a presentation by S. Borkar, Intel
26
13 Intel Prediction of Microprocessor Power Consumption (ca. 2001)
100 P6 Pentium® proc
10
286 486 8086 386 8085 Power (Watts) 1 8080 8008 4004
0.1 1971 1974 1978 1985 1992 2000 Year Adopted from a presentation by S. Borkar, Intel
27
Moore’s Law for Power (P α V2f)
1000 Chip Maximum Power in watts/cm2 Not too long to reach Nuclear Reactor Itanium – 130 watts 100 Pentium 4 – 75 watts Surpassed Pentium III – 35 watts Pentium II – 35 watts Heating Plate Pentium Pro – 30 watts 10 Pentium – 14 watts
I486 – 2 watts I386 – 1 watt 1 1.5μ 1μ 0.7μ 0.5μ 0.35μ 0.25μ 0.18μ 0.13μ 0.1μ 0.07μ 1985 1995 2001 Year Source: Fred Pollack, Intel. New Microprocessor Challenges in the Coming Generations of CMOS Technologies, MICRO32 and Transmeta 28
14 1 Core
24 GHz, 1 Core No Free Lunch For Traditional Software (Without highly concurrent software it won’t get any faster!)
2 Cores 12 GHz, 1 Core
with no change to the to the code!) change no with 4 Cores
Operations per second for serial code 8 Cores (It just runs twice as fast every 18 months 18 runs twice(It justfast every as 6 GHz 1 Core 3 GHz 2 Cores 3 GHz, 4 Cores 3 GHz, 8 Cores 3GHz 1 Core Free Lunch For Traditional Software 29 Additional operations per second if code can take advantage of concurrency
What is Multicore?
X Single chip X Multiple distinct processing engine X Multiple, independent threads of control (or program counters MIMD)
30
15 Integration is Efficient
X Discrete chips X Multicore
¾ Bandwidth 2GBps ¾ Latency 60 ns ¾ Bandwidth > 20 GBps ¾ Latency < 3ns
Parallelism and interconnect efficiency enables harnessing the power of n. n cores can yield an n-fold increase in performance 31
Power Cost of Frequency X Power ∝ Voltage2 x Frequency (V2F) X Frequency ∝ Voltage X Power ∝Frequency3
32
16 Power Cost of Frequency X Power ∝ Voltage2 x Frequency (V2F) X Frequency ∝ Voltage X Power ∝Frequency3
33
Interconnect Options
34
17 Many Changes X Many changes in our hardware over the past 30 years Top500 Systems/Architectures 500 ¾ Superscalar, Vector, Const. 400 Distributed Memory, Cluster Shared Memory, 300 Multicore, … MPP 200 SMP
100 SIMD
0 Single Proc. 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006
X Today’s memory hierarchy is much more complicated.
35
Distributed and Parallel Systemst c e n m n e s g s r o m / e in c e c t te r t p i s u s e i lo Distributed m v t f e p lu n D Massively o i P h D m l systems C r l o a le S parallel @ d e e i l I e C s t c a C hetero- T it o s e r id u a P systems E n r o l p P H geneous S U G L C S homo- geneous
X Gather (unused) resources X Bounded set of resources X Steal cycles X Apps grow to consume all cycles X System SW manages resources X Application manages resources X System SW adds value X System SW gets in the way X 10% - 20% overhead is OK X 5% overhead is maximum X Resources drive applications X Apps drive purchase of equipment X Time to completion is not critical X Real-time constraints X Time-shared X Space-shared 36
18 Virtual Environments
0.32E-08 0.00E+00 0.00E+00 0.00E+00 0.38E-06 0.13E-05 0.22E-05 0.33E-05 0.59E-05 0.11E-04 0.18E-04 0.23E-04 0.23E-04 0.21E-04 0.67E-04 0.38E-03 0.90E-03 0.18E-02 0.30E-02 0.43E-02 0.50E-02 0.51E-02 0.49E-02 0.44E-02 0.39E-02 0.35E-02 0.31E-02 0.28E-02 0.27E-02 0.26E-02 0.26E-02 0.27E-02 0.28E-02 0.30E-02 0.33E-02 0.36E-02 0.38E-02 0.39E-02 0.39E-02 0.38E-02 0.34E-02 0.30E-02 0.27E-02 0.24E-02 0.21E-02 0.18E-02 0.16E-02 0.14E-02 0.11E-02 0.96E-03 0.79E-03 0.63E-03 0.48E-03 0.35E-03 0.24E-03 0.15E-03 0.80E-04 0.34E-04 0.89E-05 0.16E-05 0.18E-06 0.34E-08 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.24E-08 0.00E+00 0.00E+00 0.00E+00 0.29E-06 0.11E-05 0.19E-05 0.30E-05 0.53E-05 0.96E-05 0.15E-04 0.20E-04 0.20E-04 0.18E-04 0.27E-04 0.23E-03 0.65E-03 0.14E-02 0.27E-02 0.40E-02 0.49E-02 0.51E-02 0.49E-02 0.45E-02 0.40E-02 0.35E-02 0.31E-02 0.28E-02 0.27E-02 0.26E-02 0.26E-02 0.27E-02 0.28E-02 0.30E-02 0.33E-02 0.36E-02 0.38E-02 0.39E-02 0.39E-02 0.37E-02 0.34E-02 0.30E-02 0.27E-02 0.24E-02 0.21E-02 0.18E-02 0.16E-02 0.14E-02 0.12E-02 0.98E-03 0.81E-03 0.65E-03 0.51E-03 0.38E-03 0.27E-03 0.17E-03 0.99E-04 0.47E-04 0.16E-04 0.36E-05 0.62E-06 0.41E-07 0.75E-10 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.15E-08 0.00E+00 0.00E+00 0.00E+00 0.19E-06 0.84E-06 0.16E-05 0.27E-05 0.47E-05 0.82E-05 0.13E-04 0.17E-04 0.17E-04 0.15E-04 0.16E-04 0.10E-03 0.41E-03 0.11E-02 0.23E-02 0.37E-02 0.48E-02 0.51E-02 0.49E-02 0.45E-02 0.40E-02 0.35E-02 0.31E-02 0.28E-02 0.27E-02 0.26E-02 0.26E-02 0.27E-02 0.28E-02 0.31E-02 0.33E-02 0.36E-02 0.38E-02 0.39E-02 0.38E-02 0.36E-02 0.33E-02 0.29E-02
Do they make any sense?
37
38
19 The power of optimal algorithms X Advances in algorithmic efficiency rival advances in hardware architecture X Consider Poisson’s equation on a cube of size N=n3 Year Method Reference Storage Flops
64 1947 GE (banded) Von Neumann & Goldstine n5 n7 64
1950 Optimal SOR Young n3 n4 log n ∇2u=f 64
1971 CG Reid n3 n3.5 log n
1984 Full MG Brandt n3 n3
X If n=64, this implies an overall reduction in flops * of ~16 million 39
Algorithms and Moore’s Law X This advance took place over a span of about 36 years, or 24 doubling times for Moore’s Law X 224≈16 million ⇒ the same as the factor from algorithms alone!
relative speedup
year 40
20 Different Architectures
X Parallel computing: single systems with many processors working on same problem X Distributed computing: many systems loosely coupled to work on related problems X Grid Computing: many systems tightly coupled by software, perhaps geographically distributed, to work together on single problems or on related problems
41
Types of Parallel Computers
X The simplest and most useful way to classify modern parallel computers is by their memory model: ¾ shared memory ¾ distributed memory
42
21 Shared vs. Distributed Memory
P P P P P P Shared memory - single address space. All processors have access to a BUS pool of shared memory. (Ex: SGI Origin, Sun E10000) Memory
P P P P P P Distributed memory -each processor has it’s own local M M M M M M memory. Must do message passing to exchange data between processors. (Ex: CRAY T3E, IBM Network SP, clusters)
43
Shared Memory: UMA vs. NUMA Uniform memory access (UMA): P P P P P P Each processor has uniform BUS access to memory. Also known as symmetric multiprocessors Memory (Sun E10000)
P P P P P P P P Non-uniform memory access (NUMA): Time for memory BUS BUS access depends on location of data. Local access is faster Memory Memory than non-local access. Easier to scale than SMPs (SGI Origin) Network
44
22 Distributed Memory: MPPs vs. Clusters
X Processors-memory nodes are connected by some type of interconnect network ¾ Massively Parallel Processor (MPP): tightly integrated, single system image. ¾ Cluster: individual computers connected by s/w
Interconnect CPU CPU CPU Network CPU CPU CPU CPU CPU MEM MEM MEMCPU MEM MEM MEM MEM MEM MEM 45
Processors, Memory, & Networks
X Both shared and distributed memory systems have: 1. processors: now generally commodity processors 2. memory: now generally commodity DRAM 3. network/interconnect: between the processors and memory (bus, crossbar, fat tree, torus, hypercube, etc.)
46
23 Interconnect-Related Terms
X Latency: How long does it take to start sending a "message"? Measured in microseconds. (Also in processors: How long does it take to output results of some operations, such as floating point add, divide etc., which are pipelined?) X Bandwidth: What data rate can be sustained once the message is started? Measured in Mbytes/sec.
47
Interconnect-Related Terms
Topology: the manner in which the nodes are connected. ¾ Best choice would be a fully connected network (every processor to every other). Unfeasible for cost and scaling reasons. ¾ Instead, processors are arranged in some variation of a grid, torus, or hypercube.
3-d hypercube 2-d mesh 2-d torus
48
24 Standard Uniprocessor Memory Hierarchy
X Intel Pentium 4 Processor 2 GHz processor X On Chip P7 Prescott 478 ¾ 8 Kbytes of 4 way assoc. Level-1 Cache L1 instruction cache with 32 byte lines. ¾ 8 Kbytes of 4 way assoc. L1 data cache with 32 byte Level-2 Cache lines. ¾ 256 Kbytes of 8 way assoc. Bus L2 cache 32 byte lines. ¾ 400 MB/s bus speed ¾ SSE2 provide peak of 4 Gflop/s System Memory Each flop requires 3 words of data At 4 Gflop/s needs 12 GW/s bw Bus has only .5GW/s So if driven from memory 49 24 times off of peak rate!!
Locality and Parallelism Conventional Storage Proc Proc Hierarchy Proc Cache Cache Cache L2 Cache L2 Cache L2 Cache interconnects potential
L3 Cache L3 Cache L3 Cache
Memory Memory Memory
X Large memories are slow, fast memories are small X Storage hierarchies are large and fast on average X Parallel processors, collectively, have large, fast $ ¾ the slow accesses to “remote” data we call “communication” X Algorithm should do most work on local data 50
25 51
Amdahl’s Law
Amdahl’s Law places a strict limit on the speedup that can be realized by using multiple processors. Two equivalent expressions for Amdahl’s Law are given below:
tN = (fp/N + fs)t1 Effect of multiple processors on run time
S = 1/(fs + fp/N) Effect of multiple processors on speedup
Where: fs = serial fraction of code fp = parallel fraction of code = 1 - fs N = number of processors
52
26 Amdahl’s Law - Theoretical Maximum Speedup of parallel execution
X speedup = 1/(P/N + S) » P (parallel code fraction) S (serial code fraction) N (processors) ¾Example: Image processing » 30 minutes of preparation (serial) » One minute to scan a region » 30 minutes of cleanup (serial)
X Speedup is restricted by serial portion. And, speedup increases with greater number of cores!
53
Illustration of Amdahl’s Law It takes only a small fraction of serial content in a code to degrade the parallel performance. It is essential to determine the scaling behavior of your code before doing production runs using large numbers of processors
250 fp = 1.000 200 fp = 0.999 fp = 0.990 150 fp = 0.900
speedup 100 What’s going on here?
50
0 0 50 100 150 200 250 Number of processors 54
27 Amdahl’s Law Vs. Reality Amdahl’s Law provides a theoretical upper limit on parallel speedup assuming that there are no costs for communications. In reality, communications (and I/O) will result in a further degradation of performance.
80 fp = 0.99 70 60 50 Amdahl's Law 40 Reality
speedup 30 20 10 0 0 50 100 150 200 250 Number of processors 55
Overhead of Parallelism
X Given enough parallel work, this is the biggest barrier to getting desired speedup X Parallelism overheads include: ¾ cost of starting a thread or process ¾ cost of communicating shared data ¾ cost of synchronizing ¾ extra (redundant) computation X Each of these can be in the range of milliseconds (=millions of flops) on some systems X Tradeoff: Algorithm needs sufficiently large units of work to run fast in parallel (I.e. large granularity), but not so large that there is not enough parallel work
56
28 Locality and Parallelism Conventional Storage Proc Proc Hierarchy Proc Cache Cache Cache L2 Cache L2 Cache L2 Cache interconnects potential
L3 Cache L3 Cache L3 Cache
Memory Memory Memory
X Large memories are slow, fast memories are small X Storage hierarchies are large and fast on average X Parallel processors, collectively, have large, fast $ ¾ the slow accesses to “remote” data we call “communication” X Algorithm should do most work on local data
Load Imbalance
X Load imbalance is the time that some processors in the system are idle due to ¾ insufficient parallelism (during that phase) ¾ unequal size tasks X Examples of the latter ¾ adapting to “interesting parts of a domain” ¾ tree-structured computations ¾ fundamentally unstructured problems X Algorithm needs to balance load
58
29 What is Ahead?
X Greater instruction level parallelism? X Bigger caches? X Multiple processors per chip? X Complete systems on a chip? (Portable Systems)
X High performance LAN, Interface, and Interconnect
59
Directions
X Move toward shared memory ¾ SMPs and Distributed Shared Memory ¾ Shared address space w/deep memory hierarchy X Clustering of shared memory machines for scalability X Efficiency of message passing and data parallel programming ¾ Helped by standards efforts such as MPI and HPF
60
30 Question:
X Suppose we want to compute using four decimal arithmetic: ¾ S = 1.000 + 1.000x104 – 1.000x104 ¾ What’s the answer?
61
Defining Floating Point Arithmetic
X Representable numbers ¾ Scientific notation: +/- d.d…d x rexp ¾ sign bit +/- ¾ radix r (usually 2 or 10, sometimes 16) ¾ significand d.d…d (how many base-r digits d?) ¾ exponent exp (range?) ¾ others? X Operations: ¾ arithmetic: +,-,x,/,... » how to round result to fit in format ¾ comparison (<, =, >) ¾ conversion between different formats » short to long FP numbers, FP to integer ¾ exception handling » what to do for 0/0, 2*largest_number, etc. ¾ binary/decimal conversion
» for I/O, when radix not 10 62
31 IEEE Floating Point Arithmetic Standard 754 (1985) - Normalized Numbers
X Normalized Nonzero Representable Numbers: +- 1.d…d x 2exp ¾ Macheps = Machine epsilon = 2-#significand bits = relative error in each operation smallest number ε such that fl( 1 + ε ) > 1
¾ OV = overflow threshold = largest number ¾ UN = underflow threshold = smallest number
Format # bits #significand bits macheps #exponent bits exponent range ------Single 32 23+1 2-24 (~10-7) 8 2-126 -2127 (~10+-38) Double 64 52+1 2-53 (~10-16) 11 2-1022 -21023 (~10+-308) Double >=80 >=64 <=2-64(~10-19) >=15 2-16382 -216383 (~10+-4932) Extended (80 bits on all Intel machines)
X +- Zero: +-, significand and exponent all zero 63 ¾ Why bother with -0 later
IEEE Floating Point Arithmetic Standard 754 - “Denorms”
X Denormalized Numbers: +-0.d…d x 2min_exp ¾ sign bit, nonzero significand, minimum exponent ¾ Fills in gap between UN and 0 X Underflow Exception ¾ occurs when exact nonzero result is less than underflow threshold UN ¾ Ex: UN/3 ¾ return a denorm, or zero
64
32 IEEE Floating Point Arithmetic Standard 754 - +- Infinity
X +- Infinity: Sign bit, zero significand, maximum exponent X Overflow Exception ¾occurs when exact finite result too large to represent accurately ¾Ex: 2*OV ¾return +- infinity X Divide by zero Exception ¾return +- infinity = 1/+-0 ¾sign of zero important! X Also return +- infinity for ¾3+infinity, 2*infinity, infinity*infinity ¾Result is exact, not an exception!
65
IEEE Floating Point Arithmetic Standard 754 - NAN (Not A Number)
X NAN: Sign bit, nonzero significand, maximum exponent X Invalid Exception ¾occurs when exact result not a well-defined real number ¾0/0 ¾sqrt(-1) ¾infinity-infinity, infinity/infinity, 0*infinity ¾NAN + 3 ¾NAN > 3? ¾Return a NAN in all these cases X Two kinds of NANs ¾Quiet - propagates without raising an exception ¾Signaling - generate an exception when touched » good for detecting uninitialized data
66
33 Error Analysis
X Basic error formula ¾fl(a op b) = (a op b)*(1 + d) where » op one of +,-,*,/ » |d| <= macheps » assuming no overflow, underflow, or divide by zero X Example: adding 4 numbers
¾fl(x1+x2+x3+x4) = {[(x1+x2)*(1+d1) + x3]*(1+d2) + x4}*(1+d3)
= x1*(1+d1)*(1+d2)*(1+d3) + x2*(1+d1)*(1+d2)*(1+d3) +
x3*(1+d2)*(1+d3) + x4*(1+d3) = x1*(1+e1) + x2*(1+e2) + x3*(1+e3) + x4*(1+e4)
where each |ei| <~ 3*macheps
¾get exact sum of slightly changed summands xi*(1+ei) ¾Backward Error Analysis - algorithm called numerically stable if it gives the exact result for slightly changed inputs ¾Numerical Stability is an algorithm design goal 67
Backward error
X Approximate solution is exact solution to modified problem. X How large a modification to original problem is required to give result actually obtained? X How much data error in initial input would be required to explain all the error in computed results? X Approximate solution is good if it is exact solution to “nearby” problem.
f x f(x) Backward error f’ Forward error x’ f’(x) f 68
34 Sensitivity and Conditioning X Problem is insensitive or well conditioned if relative change in input causes commensurate relative change in solution. X Problem is sensitive or ill-conditioned, if relative change in solution can be much larger than that in input data.
Cond = |Relative change in solution| / |Relative change in input data| = |[f(x’) – f(x)]/f(x)| / |(x’ – x)/x|
X Problem is sensitive, or ill-conditioned, if cond >> 1.
X When function f is evaluated for approximate input x’ = x+h instead of true input value of x. X Absolute error = f(x + h) – f(x) ≈ h f’(x) X Relative error =[ f(x + h) – f(x) ] / f(x) ≈ h f’(x) / f(x) 69
Sensitivity: 2 Examples cos(π/2) and 2-d System of Equations
X Consider problem of computing cosine function for arguments near π/2. X Let x ≈ π/2 and let h be small perturbation to x. Then . Abs: f(x + h) – f(x) ≈ h f’(x) Rel: [ f(x + h) – f(x) ] / f(x) ≈ h f’(x) / f(x)
absolute error = cos(x+h) – cos(x) ≈ -h sin(x) ≈ -h, relative error ≈ -h tan(x) ≈ ∞
X So small change in x near π/2 causes large relative change in cos(x) regardless of method used. X cos(1.57079) = 0.63267949 x 10-5 X cos(1.57078) = 1.64267949 x 10-5 X Relative change in output is a quarter million times greater than 70 relative change in input.
35 Sensitivity: 2 Examples cos(π/2) and 2-d System of Equations
a*x1+ bx2 = f X Consider problem of computing c*x1+ dx2 = g cosine function for arguments near π/2. X Let x ≈ π/2 and let h be small perturbation to x. Then . absolute error = cos(x+h) – cos(x) ≈ -h sin(x) ≈ -h, relative error ≈ -h tan(x) ≈ ∞
X So small change in x near π/2 causes large relative change in cos(x) regardless of method used. X cos(1.57079) = 0.63267949 x 10-5 . X cos(1.57078) = 1.64267949 x 10-5 X Relative change in output is a quarter million times greater than relative change in input. 71
Example: Polynomial Evaluation Using Horner’s Rule
n-1 X Σ k Horner’s rule to evaluate p = k=0 ck * x ¾p = cn, for k=n-1 down to 0, p = x*p + ck XNumerically Stable XApply to (x-2)9 = x9 - 18*x8 + … - 512 -29 + x*( 28 -x*( 27 + … ))) XEvaluated around 2
begin p := c[n]; for k := n-1 to 0 by -1 do p := p*x + c[k] end { for } HonerPoly := p; end { HonerPoly } 72
36 Example: polynomial evaluation (continued)
X(x-2)9 = x9 - 18*x8 + … - 512 XWe can compute error bounds using ¾fl(a op b)=(a op b)*(1+d)
73
Exception Handling
X What happens when the “exact value” is not a real number, or too small or too large to represent accurately? X 5 Exceptions: ¾Overflow - exact result > OV, too large to represent ¾Underflow - exact result nonzero and < UN, too small to represent ¾Divide-by-zero -nonzero/0 ¾Invalid - 0/0, sqrt(-1), … ¾Inexact - you made a rounding error (very common!) X Possible responses ¾Stop with error message (unfriendly, not default) ¾Keep computing (default, but how?)
74
37 Summary of Values Representable in IEEE FP
X+- Zero +- 0…0 0……………………0
+- Not 0 or XNormalized nonzero numbers all 1s anything XDenormalized numbers +- 0…0 nonzero X+-Infinity +- 1….1 0……………………0 XNANs +- 1….1 nonzero ¾Signaling and quiet ¾Many systems have only quiet
75
Assuming x and y are non-negative
axybxy==max( , ), min( , )
⎪⎧abaa1(+> )2 , 0 z = ⎨ ⎩⎪ 0,0a =
76
38 This Week’s Assignment
77
Hazards of Parallel and Heterogeneous Computing
XWhat new bugs arise in parallel floating point programs? XEx 1: Nonrepeatability ¾Makes debugging hard! XEx 2: Different exception handling ¾Can cause programs to hang XEx 3: Different rounding (even on IEEE FP machines) ¾Can cause hanging, or wrong results with no warning XSee www.netlib.org/lapack/lawns/lawn112.ps
XIBM RS6K and Java 78
39 Types of Parallel Computers
X The simplest and most useful way to classify modern parallel computers is by their memory model: ¾ shared memory ¾ distributed memory
79
Standard Uniprocessor Memory Hierarchy
X Intel Pentium 4 Processor 2 GHz processor X On Chip P7 Prescott 478 ¾ 8 Kbytes of 4 way assoc. Level-1 Cache L1 instruction cache with 32 byte lines. ¾ 8 Kbytes of 4 way assoc. L1 data cache with 32 byte Level-2 Cache lines. ¾ 256 Kbytes of 8 way assoc. Bus L2 cache 32 byte lines. ¾ 400 MB/s bus speed ¾ SSE2 provide peak of 4 Gflop/s System Memory
80
40 Shared Memory / Local Memory
X Usually think in terms of the hardware X What about a software model? X How about something that works like cache? X Logically shared memory
81
Parallel Programming Models
XControl ¾how is parallelism created ¾what orderings exist between operations ¾how do different threads of control synchronize XNaming ¾what data is private vs. shared ¾how logically shared data is accessed or communicated XSet of operations ¾what are the basic operations ¾what operations are considered to be atomic XCost ¾how do we account for the cost of each of the above
41 n − 1 ∑ fAi([]) Trivial Example i = 0
X Parallel Decomposition: ¾Each evaluation and each partial sum is a task X Assign n/p numbers to each of p procs ¾each computes independent “private” results and partial sum ¾one (or all) collects the p partial sums and computes the global sum
=> Classes of Data X Logically Shared ¾the original n numbers, the global sum X Logically Private ¾the individual function evaluations ¾what about the individual partial sums?
Programming Model 1
X Shared Address Space ¾program consists of a collection of threads of control, ¾each with a set of private variables » e.g., local variables on the stack ¾collectively with a set of shared variables » e.g., static variables, shared common blocks, global heap ¾threads communicate implicity by writing and reading shared variables ¾threads coordinate explicitly by synchronization operations on shared variables A: » writing and reading flags Shared x = ... » locks, semaphores X y = ..x ... Like concurrent i Private i res res programming s s on uniprocessor P P . . . P
42 Model 1
X A shared memory machine X Processors all connected to a large shared memory X “Local” memory is not (usually) part of the hardware ¾ Sun, DEC, Intel “SMPs” (Symmetric multiprocessors) in Millennium; SGI Origin X Cost: much cheaper to cache than main memory P1 P2 Pn $ $ $
network
memory
X Machine model 1a: A Shared Address Space Machine ¾replace caches by local memories (in abstract machine model) ¾this affects the cost model -- repeatedly accessed data should be copied ¾ Cray T3E 85
Shared Memory code for computing a sum
Thread 1 Thread 2
[s = 0 initially] [s = 0 initially] local_s1= 0 local_s2 = 0 for i = 0, n/2-1 for i = n/2, n-1 local_s1 = local_s1 + f(A[i]) local_s2= local_s2 + f(A[i]) s = s + local_s1 s = s +local_s2
What could go wrong?
86
43 Pitfall and solution via synchronization
° Pitfall in computing a global sum s = local_s1 + local_s2
Thread 1 (initially s=0) Thread 2 (initially s=0) load s [from mem to reg] load s [from mem to reg; initially 0] s = s+local_s1 [=local_s1, in reg] s = s+local_s2 [=local_s2, in reg] store s [from reg to mem] store s [from reg to mem] Time
° Instructions from different threads can be interleaved arbitrarily ° What can final result s stored in memory be? ° Race Condition ° Possible solution: Mutual Exclusion with Locks Thread 1 Thread 2 lock lock load s load s s = s+local_s1 s = s+local_s2 store s store s unlock unlock 87 ° Locks must be atomic (execute completely without interruption)
Programming Model 2
X Message Passing ¾program consists of a collection of named processes » thread of control plus local address space » local variables, static variables, common blocks, heap ¾processes communicate by explicit data transfers » matching pair of send & receive by source and dest. proc. ¾coordination is implicit in every communication event ¾logically shared data is partitioned over local processes X Like distributed programming send P0,X
recv Pn,Y ° Program with standard libraries: MPI, PVM A: A: Y X i i res res s s P P . . . P 0 n
44 Model 2
XA distributed memory machine ¾Cray T3E, IBM SP2, Clusters XProcessors all connected to own memory (and caches) ¾cannot directly access another processor’s memory XEach “node” has a network interface (NI) ¾all communication and synchronization done through this
P1 NI P2 NI Pn NI
memory memory . . . memory
interconnect 89
Computing s = x(1)+x(2) on each processor
° First possible solution
Processor 1 Processor 2 send xlocal, proc2 receive xremote, proc1 [xlocal = x(1)] send xlocal, proc1 receive xremote, proc2 [xlocal = x(2)] s = xlocal + xremote s = xlocal + xremote
° Second possible solution - what could go wrong?
Processor 1 Processor 2 send xlocal, proc2 send xlocal, proc1 [xlocal = x(1)] [xlocal = x(2)] receive xremote, proc2 receive xremote, proc1 s = xlocal + xremote s = xlocal + xremote
° What if send/receive act like the telephone system? The post office? 90
45 Programming Model 3
X Data Parallel ¾Single sequential thread of control consisting of parallel operations ¾Parallel operations applied to all (or defined subset) of a data structure ¾Communication is implicit in parallel operators and “shifted” data structures ¾Elegant and easy to understand and reason about ¾Not all problems fit this model X Like marching in a regiment A: A = array of all data f fA = f(A) fA: s = sum(fA) sum s: ° Think of Matlab
Model 3
X Vector Computing ¾One instruction executed across all the data in a pipelined fashion ¾Parallel operations applied to all (or defined subset) of a data structure ¾Communication is implicit in parallel operators and “shifted” data structures ¾Elegant and easy to understand and reason about ¾Not all problems fit this model X Like marching in a regiment A: A = array of all data f fA = f(A) fA: s = sum(fA) sum s: ° Think of Matlab
46 Model 3
X An SIMD (Single Instruction Multiple Data) machine X A large number of small processors X A single “control processor” issues each instruction ¾ each processor executes the same instruction ¾ some processors may be turned off on any instruction control processor
P1 NI P2 NI Pn NI
memory memory . . . memory
interconnect
X Machines not popular (CM2), but programming model is ¾implemented by mapping n-fold parallelism to p processors ¾mostly done in the compilers (HPF = High Performance Fortran)93
Model 4
XSince small shared memory machines (SMPs) are the fastest commodity machine, why not build a larger machine by connecting many of them with a network? XCLUMP = Cluster of SMPs XShared memory within one SMP, message passing outside XClusters, ASCI Red (Intel), ... XProgramming model? ¾Treat machine as “flat”, always use message passing, even within SMP (simple, but ignore important part of memory hierarchy) ¾Expose two layers: shared memory (OpenMP) and message passing (MPI) higher performance, but ugly to 94 program.
47 Programming Model 5
XBulk Synchronous Processing (BSP) – L. Valiant XUsed within the message passing or shared memory models as a programming convention XPhases separated by global barriers ¾Compute phases: all operate on local data (in distributed memory) » or read access to global data (in shared memory) ¾Communication phases: all participate in rearrangement or reduction of global data XGenerally all doing the “same thing” in a phase ¾all do f, but may all do different things within f
XSimplicity of data parallelism without restrictions
Summary so far
XHistorically, each parallel machine was unique, along with its programming model and programming language XYou had to throw away your software and start over with each new kind of machine - ugh XNow we distinguish the programming model from the underlying machine, so we can write portably correct code, that runs on many machines ¾MPI now the most portable option, but can be tedious XWriting portably fast code requires tuning for the architecture ¾Algorithm design challenge is to make this process easy ¾Example: picking a blocksize, not rewriting whole 96 algorithm
48