Architectural Trends and Programming Model Strategies for Large-Scale Machines

Architectural Trends and Programming Model Strategies for Large-Scale Machines Katherine Yelick U.C. Berkeley and Lawrence Berkeley National Lab http://titanium.cs.berkeley.edu http://upc.lbl.gov 1 Kathy Yelick Architecture Trends • What happened to Moore’s Law? • Power density and system power • Multicore trend • Game processors and GPUs • What this means to you Climate 2007 Kathy Yelick, 2 Moore’s Law is Alive and Well Moore’s Law 2X transistors/Chip Every 1.5 years Called “Moore’s Law” Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of Microprocessors have semiconductor chips would become smaller, denser, double roughly every 18 and more powerful. months. Slide source: Jack Dongarra Climate 2007 Kathy Yelick, 3 But Clock Scaling Bonanza Has Ended •Processor designers forced to go “multicore”: • Heat density: faster clock means hotter chips • more cores with lower clock rates burn less power • Declining benefits of “hidden” Instruction Level Parallelism (ILP) • Last generation of single core chips probably over-engineered • Lots of logic/power to find ILP parallelism, but it wasn’t in the apps • Yield problems • Parallelism can also be used for redundancy • IBM Cell processor has 8 small cores; a blade system with all 8 sells for $20K, whereas a PS3 is about $600 and only uses 7 Climate 2007 Kathy Yelick, 4 Clock Scaling Hits Power Density Wall Scaling clock speed (business as usual) will not work 10000 Sun’s Surface ) 2 Rocket 1000 Nozzle Nuclear 100 Reactor 8086 Hot Plate 10 4004 P6 8008 Pentium® Power Density (W/cm 8085 386 286 486 8080 Source: Patrick 1 Gelsinger, Intel® 1970 1980 1990 2000 2010 Year Climate 2007 Kathy Yelick, 5 Revolution is Happening Now • Chip density is continuing increase ~2x every 2 years • Clock speed is not • Number of processor cores may double instead • There is little or no hidden parallelism (ILP) to be found • Parallelism must be exposed to and managed by software Source: Intel, Microsoft (Sutter) and StanfordClimate (Olukotun, 2007 Hammond) Kathy Yelick, 6 Petaflop with ~1M Cores ByCommon 2008 1Eflop/s1E+12 by 2015? 1 PFlop system in 2008 100 Pflop/s1E+11 10 Pflop/s1E+10 1 Pflop/s1E+09 100 Tflop/s1E+08 SUM 10 Tflops/s1E+07 #1 1 Tflop/s1E+06 #500 100 Gflop/s100000 6-8 years 10 Gflop/s10000 Data from top500.org 1 Gflop/s1000 10 MFlop/s100 1993 1996 1999 2002 2005 2008 2011 2014 Climate 2007 Slide source Horst Simon, LBNL Kathy Yelick, 7 NERSC 2005 Projections for Computer Room Power (System + Cooling) 70 Cooling 60 Power is a system N8 level problem, not just N7 50 a chip level problem N6 N5b 40 N5a NGF 30 Bassi MegaWatts Jacquard 20 N3E N3 10 PDSF HPSS 0 Misc 2001 2003 2005 2007 2009 2011 2013 2015 2017 Climate 2007 Kathy Yelick, 8 Concurrency for Low Power • Highly concurrent systems are more power efficient • Dynamic power is proportional to V2fC • Increasing frequency (f) also increases supply voltage (V): more than linear effect • Increasing cores increases capacitance (C) but has only a linear effect • Hidden concurrency burns power • Speculation, dynamic dependence checking, etc. • Push parallelism discover to software (compilers and application programmers) to save power • Challenge: Can you double the concurrency in your algorithms every 18 months? Climate 2007 Kathy Yelick, 9 Cell Processor (PS3) on Scientific Kernels • Very hard to program: explicit control over memory hierarchy Climate 2007 Kathy Yelick, 10 Game Processors Outpace Moore’s Law speedup • Traditionally too specialized (no scatter, no inter-core communication, no double fp) but trend to generalize • Still have control over memory hierarchy/partitioning Climate 2007 Kathy Yelick, 11 What Does this Mean to You? •The world of computing is going parallel • Number of cores will double every 12-24 months • Petaflop (1M processor) machines common by 2015 •More parallel programmers to hire ☺ •Climate codes must have more parallelism • Need for fundamental rewrite and new algorithms • Added challenge of combining with adaptive algorithms •New programming model or language likely ☺ • Can the HPC benefit from investments in parallel languages and tools? • One programming model for all? Climate 2007 Kathy Yelick, 12 Performance on Current Machines Oliker, Wehner, Mirin, Parks and Worley Percent of Theoretical Peak Power3 11% Itanium2 10% Earth Simulator 9% X1 8% X1E 7% 6% 5% 4% 3% P=256 P=336 P=672 Concurrency • Current state-of-the-art systems attain around 5% of peak at the highest available concurrencies • Note current algorithm uses OpenMP when possible to increase parallelism • Unless we can do better, peak performance of system must be 10-20x of sustained requirement • Limitations (from separate studies of scientific kernels) • Not just memory bandwidth: latency, compiler code generation, .. Climate 2007 Kathy Yelick, 13 Strawman 1km Climate Computer Oliker, Shalf, Wehner • Cloud system resolving global atmospheric model • .015oX.02o horizontally with 100 vertical levels • Caveat: A back-of-the-envelope calculation, with many assumptions about scaling current model • To acheive simulation time 1000x faster than real time: • ~10 Petaflops sustained performance requirement • ~100 Terabytes total memory • ~2 million horizontal subdomains • ~10 vertical domains • ~20 million processors at 500Mflops each sustained inclusive of communications costs. • 5 MB memory per processor • ~20,000 nearest neighbor send-receive pairs per subdomain per simulated hour of ~10KB each Climate 2007 Kathy Yelick, 14 A Programming Model Approach: Partitioned Global Address Space (PGAS) Languages What, Why, and How 15 Kathy Yelick Parallel Programming Models • Easy parallel software is still an unsolved problem ! • Goals: • Making parallel machines easier to use • Ease algorithm experiments • Make the most of your machine (network and memory) • Enable compiler optimizations • Partitioned Global Address Space (PGAS) Languages • Global address space like threads (programmability) • One-sided communication • Current static (SPMD) parallelism like MPI • Local/global distinction, i.e., layout matters (performance) Climate 2007 Kathy Yelick, 16 Partitioned Global Address Space • Global address space: any thread/process may directly read/write data allocated by another • Partitioned: data is designated as local or global By default: x: 1 x: 5 x: 7 y: 0 • Object heaps y: y: are shared • Program l: l: l: stacks are private g: g: g: Global address space p0 p1 pn • 3 Current languages: UPC, CAF, and Titanium • All three use an SPMD execution model • Emphasis in this talk on UPC and Titanium (based on Java) • 3 Emerging languages: X10, Fortress, and Chapel Climate 2007 Kathy Yelick, 17 PGAS Language for Hybrid Parallelism • PGAS languages are a good fit to shared memory machines • Global address space implemented as reads/writes • Current UPC and Titanium implementation uses threads • Working on System V shared memory for UPC • PGAS languages are a good fit to distributed memory machines and networks • Good match to modern network hardware • Decoupling data transfer from synchronization can improve bandwidth Climate 2007 Kathy Yelick, 18 PGAS Languages on Clusters: One-Sided vs Two-Sided Communication two-sided message host message id data payload CPU network one-sided put message interface address data payload memory • A one-sided put/get message can be handled directly by a network interface with RDMA support • Avoid interrupting the CPU or storing data from CPU (preposts) • A two-sided messages needs to be matched with a receive to identify memory address to put data • Offloaded to Network Interface in networks like Quadrics • Need to download match tables to interface (from host) Climate 2007 Joint work with Dan Bonachea Kathy Yelick, 19 One-Sided vs. Two-Sided: Practice 900 GASNet put (nonblock)" 800 MPI Flood 700 NERSC Jacquard 600 machine with Opteron processors 500 Relative BW (GASNet/MPI) 400 (up is good) 2.4 2.2 300 2.0 1.8 Bandwidth (MB/s) 1.6 200 1.4 1.2 1.0 100 10 1000 100000 10000000 Size (bytes) 0 10 100 1,000 10,000 100,000 1,000,000 Size (bytes) • InfiniBand: GASNet vapi-conduit and OSU MVAPICH 0.9.5 • Half power point (N ½ ) differs by one order of magnitude • This is not a criticism of the implementation! Climate 2007 Joint work with Paul Hargrove and Dan Bonachea Kathy Yelick, 20 GASNet: Portability and High-Performance 8-byte Roundtrip Latency 24.2 25 22.1 MPI ping-pong GASNet put+sync 20 18.5 17.8 14.6 15 13.5 9.6 9.5 10 (down is good) 8.3 6.6 6.6 Roundtrip Latency (usec) Latency Roundtrip 4.5 5 0 Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed GASNet better for latency across machines Climate 2007 Joint work with UPC Group; GASNet design by Dan Bonachea Kathy Yelick, 21 GASNet: Portability and High-Performance Flood Bandwidth for 2MB messages 100% 857 858 228 1504 1490 225 795 799 90% 255 244 80% 610 630 70% 60% 50% 40% (up is good) 30% 20% Percent peak in HW MB) (BW MPI 10% GASNet 0% Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed GASNet at least as high (comparable) for large messages Climate 2007 Joint work with UPC Group; GASNet design by Dan Bonachea Kathy Yelick, 22 GASNet: Portability and High-Performance Flood Bandwidth for 4KB messages 100% MPI 223 90% 763 GASNet 702 714 80% 231 679 70% 190 152 60% 50% 420 750 40% 547 (up is good) 252 Percent HW peak HW Percent 30% 20% 10% 0% Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed GASNet excels at mid-range sizes: important for overlap Climate 2007 Joint work with UPC Group; GASNet design by Dan Bonachea Kathy Yelick,

Architectural Trends and Programming Model Strategies for Large-Scale Machines

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support