Architectural Trends and Programming Model Strategies for Large-Scale Machines

Katherine Yelick

U.C. Berkeley and Lawrence Berkeley National Lab

http://titanium.cs.berkeley.edu http://upc.lbl.gov

1 Kathy Yelick Architecture Trends

• What happened to Moore’s Law? • Power density and system power • Multicore trend • Game processors and GPUs • What this means to you

Climate 2007 Kathy Yelick, 2 Moore’s Law is Alive and Well

Moore’s Law

2X transistors/Chip Every 1.5 years Called “Moore’s Law” Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of Microprocessors have semiconductor chips would become smaller, denser, double roughly every 18 and more powerful. months. Slide source: Jack Dongarra Climate 2007 Kathy Yelick, 3 But Clock Scaling Bonanza Has Ended •Processor designers forced to go “multicore”: • Heat density: faster clock means hotter chips • more cores with lower clock rates burn less power • Declining benefits of “hidden” Instruction Level Parallelism (ILP) • Last generation of single core chips probably over-engineered • Lots of logic/power to find ILP parallelism, but it wasn’t in the apps • problems • Parallelism can also be used for redundancy • IBM Cell processor has 8 small cores; a blade system with all 8 sells for $20K, whereas a PS3 is about $600 and only uses 7

Climate 2007 Kathy Yelick, 4 Clock Scaling Hits Power Density Wall

Scaling clock speed (business as usual) will not work 10000 Sun’s Surface )

2 Rocket 1000 Nozzle

Nuclear 100 Reactor

8086 Hot Plate 10 4004 P6 8008 Pentium® Power Density (W/cm Power Density 8085 386 286 486 8080 Source: Patrick 1 Gelsinger, Intel® 1970 1980 1990 2000 2010 Year Climate 2007 Kathy Yelick, 5 Revolution is Happening Now

• Chip density is continuing increase ~2x every 2 years • Clock speed is not • Number of processor cores may double instead • There is little or no hidden parallelism (ILP) to be found • Parallelism must be exposed to and managed by software Source: Intel, Microsoft (Sutter) and StanfordClimate (Olukotun, 2007 Hammond) Kathy Yelick, 6 Petaflop with ~1M Cores ByCommon 2008 1Eflop/s by 2015? 100 Pflop/s 1 PFlop system in 2008

1E+12 10 Pflop/s 1E+11 1 Pflop/s 1E+10

100 Tflop/s1E+09 SUM 10 Tflops/s1E+08 1 Tflop/s1E+07 #1

1E+06 100 Gflop/s 6-8 years #500 100000 10 Gflop/s 10000 Data from top500.org

1 Gflop/s1000

10 MFlop/s100 1993 1996 1999 2002 2005 2008 2011 2014

Climate 2007 Slide source Horst Simon, LBNL Kathy Yelick, 7 NERSC 2005 Projections for Computer Room Power (System + Cooling) 70 Cooling 60 Power is a system N8 level problem, not just N7 50 a chip level problem N6 N5b 40 N5a NGF 30 Bassi MegaWatts Jacquard 20 N3E N3 10 PDSF HPSS 0 Misc 2001 2003 2005 2007 2009 2011 2013 2015 2017

Climate 2007 Kathy Yelick, 8 Concurrency for Low Power • Highly concurrent systems are more power efficient • Dynamic power is proportional to V2fC • Increasing frequency (f) also increases supply voltage (V): more than linear effect • Increasing cores increases capacitance (C) but has only a linear effect • Hidden concurrency burns power • Speculation, dynamic dependence checking, etc. • Push parallelism discover to software (compilers and application programmers) to save power • Challenge: Can you double the concurrency in your algorithms every 18 months?

Climate 2007 Kathy Yelick, 9 Cell Processor (PS3) on Scientific Kernels

• Very hard to program: explicit control over memory hierarchy

Climate 2007 Kathy Yelick, 10 Game Processors Outpace Moore’s Law

• Traditionally too specialized (no scatter, no inter-core communication, no double fp) but trend to generalize • Still have control over memory hierarchy/partitioning Climate 2007 Kathy Yelick, 11 What Does this Mean to You? •The world of computing is going parallel • Number of cores will double every 12-24 months • Petaflop (1M processor) machines common by 2015 •More parallel programmers to hire ☺ •Climate codes must have more parallelism • Need for fundamental rewrite and new algorithms • Added challenge of combining with adaptive algorithms •New programming model or language likely ☺ • Can the HPC benefit from investments in parallel languages and tools? • One programming model for all?

Climate 2007 Kathy Yelick, 12 Performance on Current Machines Oliker, Wehner, Mirin, Parks and Worley

Percent of Theoretical Peak Power3 11% Itanium2 10% Earth Simulator 9% X1 8% X1E 7% 6% 5% 4% 3% P=256 P=336 P=672 Concurrency

• Current state-of-the-art systems attain around 5% of peak at the highest available concurrencies • Note current algorithm uses OpenMP when possible to increase parallelism • Unless we can do better, peak performance of system must be 10-20x of sustained requirement • Limitations (from separate studies of scientific kernels) • Not just memory bandwidth: latency, compiler code generation, ..

Climate 2007 Kathy Yelick, 13 Strawman 1km Climate Computer Oliker, Shalf, Wehner • Cloud system resolving global atmospheric model • .015oX.02o horizontally with 100 vertical levels • Caveat: A back-of-the-envelope calculation, with many assumptions about scaling current model • To acheive simulation time 1000x faster than real time: • ~10 Petaflops sustained performance requirement • ~100 Terabytes total memory • ~2 million horizontal subdomains • ~10 vertical domains • ~20 million processors at 500Mflops each sustained inclusive of communications costs. • 5 MB memory per processor • ~20,000 nearest neighbor send-receive pairs per subdomain per simulated hour of ~10KB each

Climate 2007 Kathy Yelick, 14 A Programming Model Approach: Partitioned Global Address Space (PGAS) Languages

What, Why, and How

15 Kathy Yelick Parallel Programming Models • Easy parallel software is still an unsolved problem ! • Goals: • Making parallel machines easier to use • Ease algorithm experiments • Make the most of your machine (network and memory) • Enable compiler optimizations • Partitioned Global Address Space (PGAS) Languages • Global address space like threads (programmability) • One-sided communication • Current static (SPMD) parallelism like MPI • Local/global distinction, i.e., layout matters (performance) Climate 2007 Kathy Yelick, 16 Partitioned Global Address Space

• Global address space: any / may directly read/write data allocated by another • Partitioned: data is designated as local or global

By default: x: 1 x: 5 x: 7 y: 0 • Object heaps y: y: are shared • Program l: l: l: stacks are private g: g: g: Global address space p0 p1 pn • 3 Current languages: UPC, CAF, and Titanium • All three use an SPMD execution model • Emphasis in this talk on UPC and Titanium (based on Java) • 3 Emerging languages: X10, Fortress, and Chapel

Climate 2007 Kathy Yelick, 17 PGAS Language for Hybrid Parallelism • PGAS languages are a good fit to machines • Global address space implemented as reads/writes • Current UPC and Titanium implementation uses threads • Working on System V shared memory for UPC • PGAS languages are a good fit to machines and networks • Good match to modern network hardware • Decoupling data transfer from synchronization can improve bandwidth

Climate 2007 Kathy Yelick, 18 PGAS Languages on Clusters: One-Sided vs Two-Sided Communication

two-sided message host message id data payload CPU network one-sided put message interface address data payload memory

• A one-sided put/get message can be handled directly by a network interface with RDMA support • Avoid interrupting the CPU or storing data from CPU (preposts) • A two-sided messages needs to be matched with a receive to identify memory address to put data • Offloaded to Network Interface in networks like Quadrics • Need to download match tables to interface (from host)

Climate 2007 Joint work with Dan Bonachea Kathy Yelick, 19 One-Sided vs. Two-Sided: Practice

900 GASNet put (nonblock)" 800 MPI Flood 700 NERSC Jacquard 600 machine with Opteron processors 500 Relative BW (GASNet/MPI) 400 (up is good) 2.4 2.2 300 2.0 1.8

Bandwidth (MB/s) 1.6 200 1.4 1.2 1.0 100 10 1000 100000 10000000 Size (bytes) 0 10 100 1,000 10,000 100,000 1,000,000 Size (bytes) • InfiniBand: GASNet vapi-conduit and OSU MVAPICH 0.9.5 • Half power point (N ½ ) differs by one order of magnitude • This is not a criticism of the implementation!

Climate 2007 Joint work with Paul Hargrove and Dan Bonachea Kathy Yelick, 20 GASNet: Portability and High-Performance

8-byte Roundtrip Latency 24.2

25 22.1 MPI ping-pong GASNet put+sync 18.5 20 17.8

14.6 13.5 15

9.6 9.5 8.3 10 (down is good) 6.6 6.6 4.5 Roundtrip Latency (usec) Latency Roundtrip 5

0 Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed

GASNet better for latency across machines

Climate 2007 Joint work with UPC Group; GASNet design by Dan Bonachea Kathy Yelick, 21 GASNet: Portability and High-Performance Flood Bandwidth for 2MB messages

100% 857 858 228 1504 1490 225 795 799 90% 255 244 80% 610 630 70%

60%

50%

(up is good) 40%

30%

Percent peak in HW MB) (BW 20% MPI GASNet 10%

0% Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed GASNet at least as high (comparable) for large messages

Climate 2007 Joint work with UPC Group; GASNet design by Dan Bonachea Kathy Yelick, 22 GASNet: Portability and High-Performance

100% Flood Bandwidth for 4KB messages 223 MPI 90% 763 GASNet 702 714 80% 231 679

70% 190 152 60%

50% 420 750

40% 547

(up is good) 252

Percent HW peak HW Percent 30%

20%

10%

0% Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed GASNet excels at mid-range sizes: important for overlap

Climate 2007 Joint work with UPC Group; GASNet design by Dan Bonachea Kathy Yelick, 23 Communication Strategies for 3D FFT chunk = all rows with same destination • Three approaches: • Chunk: • Wait for 2nd dim FFTs to finish • Minimize # messages • Slab: • Wait for chunk of rows destined for 1 proc to finish • Overlap with computation • Pencil: • Send each row as it completes • Maximize overlap and pencil = 1 row • Match natural layout slab = all rows in a single plane with same destination

Climate 2007 Joint work with Chris Bell, Rajesh Nishtala, Dan Bonachea Kathy Yelick, 24 NAS FT Variants Performance Summary 1100 BestChunk NAS Fortran/MPI (NASBest FTMFlop with rates for FFTW) all NAS FT Benchmark versions 1000 .5 Tflops BestBest MPI MPI(alwaysBest (always Slabs) NAS Fortran/MPI slabs) 1000 900 BestBest UPC UPC (alwaysBest (always Pencils)MPI pencils) Best UPC

800 800

700 600

600 400 500 MFlops Thread per

200

MFlops per Thread MFlops per 400

300 0

MFlops per Thread MFlops 4 6 et 6 d 25 256 512 256 512 yrin iBan lan3 lan3 lan4 lan4 M Infin E E E E 200 • 100Slab is always best for MPI; small message cost too high • Pencil0 is always best for UPC; more overlap MyrinetMyrinet 64 Infiniband Elan3 Elan3 256 Elan3 Elan3 512 Elan4 Elan4 256 Elan4 Elan4 512 Climate 2007 InfiniBand 256 Kathy Yelick, 25 #procs 64 256 256 512 256 512 Making PGAS Real: Applications and Portability

26 Kathy Yelick Coding Challenges: Block-Structured AMR • Adaptive Mesh Refinement (AMR) is challenging • Irregular data accesses and control from boundaries • Mixed global/local view is useful

Titanium AMR benchmark available

AMRClimate Titanium 2007 work by To ng Wen and Philip Colella Kathy Yelick, 27 Languages Support Helps Productivity 30000

C++/Fortran/MPI AMR • Chombo package from LBNL 25000 • Bulk-synchronous comm: AMRElliptic • Pack boundary data between procs 20000 AMRTools • All optimizations done by programmer Util Grid 15000 AMR Titanium AMR Array Lines of Code of Lines • Entirely in Titanium 10000 • Finer-grained communication • No explicit pack/unpack code • Automated in runtime system 5000 • General approach • Language allow programmer optimizations 0 • Compiler/runtime does some Titanium C++/F/MPI automatically (Chombo)

WorkClimate by Tong 2007 Wen and Philip Colella; Communication optimizations joint with Jimmy SuKathy Yelick, 28 Conclusions

• Future hardware performance improvements • Mostly from concurrency, not clock speed • New commodity hardware coming (sometime) from games/GPUs • PGAS Languages • Good fit for shared and distributed memory • Control over locality and (for better or worse) SPMD • Available for download • Berkeley UPC compiler: http://upc.lbl.gov • Titanium compiler: http://titanium.cs.berkeley.edu • Implications for Climate Modeling • Need to rethink algorithms to identify more parallelism • Need to match needs of future (efficient) algorithms

Climate 2007 Kathy Yelick, 29 Extra Slides

30 Kathy Yelick Parallelism Saves Low Power • Exploit explicit parallelism for reducing power Power = C * V2 * F Performance = Cores * F Power = 2C * V2 * F Performance = 2Cores * F Power = 4C * V2/4 * F/2 Performance = 4Lanes * F/2 Power = (C * V2 * F)/2 Performance = 2Cores * F • Using additional cores – Allows reduction in frequency and power supply without decreasing (original) performance – Power supply reduction lead to large power savings • Additional benefits – Small/simple cores more predictable performance

Climate 2007 Kathy Yelick, 31 Where is Fortran?

Source: Tim O’Reilly

Climate 2007 Kathy Yelick, 32