Architectural Trends and Programming Model Strategies for Large-Scale Machines

Architectural Trends and Programming Model Strategies for Large-Scale Machines Katherine Yelick U.C. Berkeley and Lawrence Berkeley National Lab http://titanium.cs.berkeley.edu http://upc.lbl.gov 1 Kathy Yelick Architecture Trends • What happened to Moore’s Law? • Power density and system power • Multicore trend • Game processors and GPUs • What this means to you Climate 2007 Kathy Yelick, 2 Moore’s Law is Alive and Well Moore’s Law 2X transistors/Chip Every 1.5 years Called “Moore’s Law” Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of Microprocessors have semiconductor chips would become smaller, denser, double roughly every 18 and more powerful. months. Slide source: Jack Dongarra Climate 2007 Kathy Yelick, 3 But Clock Scaling Bonanza Has Ended •Processor designers forced to go “multicore”: • Heat density: faster clock means hotter chips • more cores with lower clock rates burn less power • Declining benefits of “hidden” Instruction Level Parallelism (ILP) • Last generation of single core chips probably over-engineered • Lots of logic/power to find ILP parallelism, but it wasn’t in the apps • Yield problems • Parallelism can also be used for redundancy • IBM Cell processor has 8 small cores; a blade system with all 8 sells for $20K, whereas a PS3 is about $600 and only uses 7 Climate 2007 Kathy Yelick, 4 Clock Scaling Hits Power Density Wall Scaling clock speed (business as usual) will not work 10000 Sun’s Surface ) 2 Rocket 1000 Nozzle Nuclear 100 Reactor 8086 Hot Plate 10 4004 P6 8008 Pentium® Power Density (W/cm 8085 386 286 486 8080 Source: Patrick 1 Gelsinger, Intel® 1970 1980 1990 2000 2010 Year Climate 2007 Kathy Yelick, 5 Revolution is Happening Now • Chip density is continuing increase ~2x every 2 years • Clock speed is not • Number of processor cores may double instead • There is little or no hidden parallelism (ILP) to be found • Parallelism must be exposed to and managed by software Source: Intel, Microsoft (Sutter) and StanfordClimate (Olukotun, 2007 Hammond) Kathy Yelick, 6 Petaflop with ~1M Cores ByCommon 2008 1Eflop/s1E+12 by 2015? 1 PFlop system in 2008 100 Pflop/s1E+11 10 Pflop/s1E+10 1 Pflop/s1E+09 100 Tflop/s1E+08 SUM 10 Tflops/s1E+07 #1 1 Tflop/s1E+06 #500 100 Gflop/s100000 6-8 years 10 Gflop/s10000 Data from top500.org 1 Gflop/s1000 10 MFlop/s100 1993 1996 1999 2002 2005 2008 2011 2014 Climate 2007 Slide source Horst Simon, LBNL Kathy Yelick, 7 NERSC 2005 Projections for Computer Room Power (System + Cooling) 70 Cooling 60 Power is a system N8 level problem, not just N7 50 a chip level problem N6 N5b 40 N5a NGF 30 Bassi MegaWatts Jacquard 20 N3E N3 10 PDSF HPSS 0 Misc 2001 2003 2005 2007 2009 2011 2013 2015 2017 Climate 2007 Kathy Yelick, 8 Concurrency for Low Power • Highly concurrent systems are more power efficient • Dynamic power is proportional to V2fC • Increasing frequency (f) also increases supply voltage (V): more than linear effect • Increasing cores increases capacitance (C) but has only a linear effect • Hidden concurrency burns power • Speculation, dynamic dependence checking, etc. • Push parallelism discover to software (compilers and application programmers) to save power • Challenge: Can you double the concurrency in your algorithms every 18 months? Climate 2007 Kathy Yelick, 9 Cell Processor (PS3) on Scientific Kernels • Very hard to program: explicit control over memory hierarchy Climate 2007 Kathy Yelick, 10 Game Processors Outpace Moore’s Law speedup • Traditionally too specialized (no scatter, no inter-core communication, no double fp) but trend to generalize • Still have control over memory hierarchy/partitioning Climate 2007 Kathy Yelick, 11 What Does this Mean to You? •The world of computing is going parallel • Number of cores will double every 12-24 months • Petaflop (1M processor) machines common by 2015 •More parallel programmers to hire ☺ •Climate codes must have more parallelism • Need for fundamental rewrite and new algorithms • Added challenge of combining with adaptive algorithms •New programming model or language likely ☺ • Can the HPC benefit from investments in parallel languages and tools? • One programming model for all? Climate 2007 Kathy Yelick, 12 Performance on Current Machines Oliker, Wehner, Mirin, Parks and Worley Percent of Theoretical Peak Power3 11% Itanium2 10% Earth Simulator 9% X1 8% X1E 7% 6% 5% 4% 3% P=256 P=336 P=672 Concurrency • Current state-of-the-art systems attain around 5% of peak at the highest available concurrencies • Note current algorithm uses OpenMP when possible to increase parallelism • Unless we can do better, peak performance of system must be 10-20x of sustained requirement • Limitations (from separate studies of scientific kernels) • Not just memory bandwidth: latency, compiler code generation, .. Climate 2007 Kathy Yelick, 13 Strawman 1km Climate Computer Oliker, Shalf, Wehner • Cloud system resolving global atmospheric model • .015oX.02o horizontally with 100 vertical levels • Caveat: A back-of-the-envelope calculation, with many assumptions about scaling current model • To acheive simulation time 1000x faster than real time: • ~10 Petaflops sustained performance requirement • ~100 Terabytes total memory • ~2 million horizontal subdomains • ~10 vertical domains • ~20 million processors at 500Mflops each sustained inclusive of communications costs. • 5 MB memory per processor • ~20,000 nearest neighbor send-receive pairs per subdomain per simulated hour of ~10KB each Climate 2007 Kathy Yelick, 14 A Programming Model Approach: Partitioned Global Address Space (PGAS) Languages What, Why, and How 15 Kathy Yelick Parallel Programming Models • Easy parallel software is still an unsolved problem ! • Goals: • Making parallel machines easier to use • Ease algorithm experiments • Make the most of your machine (network and memory) • Enable compiler optimizations • Partitioned Global Address Space (PGAS) Languages • Global address space like threads (programmability) • One-sided communication • Current static (SPMD) parallelism like MPI • Local/global distinction, i.e., layout matters (performance) Climate 2007 Kathy Yelick, 16 Partitioned Global Address Space • Global address space: any thread/process may directly read/write data allocated by another • Partitioned: data is designated as local or global By default: x: 1 x: 5 x: 7 y: 0 • Object heaps y: y: are shared • Program l: l: l: stacks are private g: g: g: Global address space p0 p1 pn • 3 Current languages: UPC, CAF, and Titanium • All three use an SPMD execution model • Emphasis in this talk on UPC and Titanium (based on Java) • 3 Emerging languages: X10, Fortress, and Chapel Climate 2007 Kathy Yelick, 17 PGAS Language for Hybrid Parallelism • PGAS languages are a good fit to shared memory machines • Global address space implemented as reads/writes • Current UPC and Titanium implementation uses threads • Working on System V shared memory for UPC • PGAS languages are a good fit to distributed memory machines and networks • Good match to modern network hardware • Decoupling data transfer from synchronization can improve bandwidth Climate 2007 Kathy Yelick, 18 PGAS Languages on Clusters: One-Sided vs Two-Sided Communication two-sided message host message id data payload CPU network one-sided put message interface address data payload memory • A one-sided put/get message can be handled directly by a network interface with RDMA support • Avoid interrupting the CPU or storing data from CPU (preposts) • A two-sided messages needs to be matched with a receive to identify memory address to put data • Offloaded to Network Interface in networks like Quadrics • Need to download match tables to interface (from host) Climate 2007 Joint work with Dan Bonachea Kathy Yelick, 19 One-Sided vs. Two-Sided: Practice 900 GASNet put (nonblock)" 800 MPI Flood 700 NERSC Jacquard 600 machine with Opteron processors 500 Relative BW (GASNet/MPI) 400 (up is good) 2.4 2.2 300 2.0 1.8 Bandwidth (MB/s) 1.6 200 1.4 1.2 1.0 100 10 1000 100000 10000000 Size (bytes) 0 10 100 1,000 10,000 100,000 1,000,000 Size (bytes) • InfiniBand: GASNet vapi-conduit and OSU MVAPICH 0.9.5 • Half power point (N ½ ) differs by one order of magnitude • This is not a criticism of the implementation! Climate 2007 Joint work with Paul Hargrove and Dan Bonachea Kathy Yelick, 20 GASNet: Portability and High-Performance 8-byte Roundtrip Latency 24.2 25 22.1 MPI ping-pong GASNet put+sync 20 18.5 17.8 14.6 15 13.5 9.6 9.5 10 (down is good) 8.3 6.6 6.6 Roundtrip Latency (usec) Latency Roundtrip 4.5 5 0 Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed GASNet better for latency across machines Climate 2007 Joint work with UPC Group; GASNet design by Dan Bonachea Kathy Yelick, 21 GASNet: Portability and High-Performance Flood Bandwidth for 2MB messages 100% 857 858 228 1504 1490 225 795 799 90% 255 244 80% 610 630 70% 60% 50% 40% (up is good) 30% 20% Percent peak in HW MB) (BW MPI 10% GASNet 0% Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed GASNet at least as high (comparable) for large messages Climate 2007 Joint work with UPC Group; GASNet design by Dan Bonachea Kathy Yelick, 22 GASNet: Portability and High-Performance Flood Bandwidth for 4KB messages 100% MPI 223 90% 763 GASNet 702 714 80% 231 679 70% 190 152 60% 50% 420 750 40% 547 (up is good) 252 Percent HW peak HW Percent 30% 20% 10% 0% Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed GASNet excels at mid-range sizes: important for overlap Climate 2007 Joint work with UPC Group; GASNet design by Dan Bonachea Kathy Yelick,

Load more