HPC Achievement and Impact – 2009 a Personal Perspective
Total Page:16
File Type:pdf, Size:1020Kb
Invited Keynote presentation to ISC 2009 - Hamburg HPC Achievement and Impact – 2009 a personal perspective Dr. Prof. Thomas Sterling (with Maciej Brodowicz & Chirag Dekate) Ad&EddPfDttfCtSiArnaud & Edwards Professor, Department of Computer Science Adjunct Faculty, Department of Electrical and Computer Engineering Louisiana State University Visiting Associate, California Institute Technology Distinguished Visiting Scientist, Oak Ridge National Laboratory CSRI Fellow, Sandia National Laboratory June 24, 2008 DEPARTMENT OF COMPUTER SCIENCE @ LOUISIANA STATE UNIVERSITY HPC Year in Review • A continuing tradition at ISC – (6th year, and still going at it) • Previous Years’ Themes: – 2004: “Constr u ctiv e Continu ity ” – 2005: “High Density Computing” – 2006: “Multicore to Petaflops” – 2007: “Multicore: the Next Moore’ s Law” – 2008: “Run-Up to Petaflops” • This Year’s Theme: “Year 1 A.P. (after Petaflops)” • As alwayy,s, a p ersonal p ersp ective – how I’ve seen it – Highlights – the big picture • But not all the nitty details, sorry – Necessarily biased, but not intentionally so – Iron-oritdbtftiented, but software too – Trends and implications for the future • And a continuing predictor: The Canonical HEC Computer – Based on average and leading predictors 2 Trends in Highlight • Year 1 after Petaflops (1 A.P.) • Applying Petaflops Roadrunner & Jaguar to computational challenges • Deploying Petaflops Systems around the World starting with Jugene • Programming Multicore to save Moore’s Law – Quad core dominates mainstream processor architectures – TBB, Cilk,,Coce,& Concert, & Pa aarall eX • Harnessing GPUs for a quantum step in performance – Invidia Tesla – ClearSpeed • Emerging New Applications in Science and Informatics • Commodity clusters ubiquitous – Linux, MPI, Xeon, & Ethernet dominant – Infiniband increasing interconnect market share • MPPs dominate the high-end with lower power, higher density • Clock rates near flat in the 2 – 3 GHz range with some outliers • Preparing for Exascale Hardware and Software 3 ADVANCEMENTS IN PROCESSOR ARCHITECTURES DEPARTMENT OF COMPUTER SCIENCE @ LOUISIANA STATE UNIVERSITY 4 AMD Istanbul • Six cores per die • 904 mil. transistors on 346 mm2 (45 nm SOI) • Support for x8 ECC memory • HyperTransport 3.0 at 2.4 GHz • HT Assist to minimize cache probe traffic (4- socket systems and above) • Remote power management interface (APML) • 75W power envelope • Operating frequency up to 2.6 GHz Intel Dunnington • Up to six cores per die (based on Core2) • 1900 mil. transistors on 504 mm2 (45 nm) • 16 MB shared L3, 3x3 MB unified L2, 96 KB L1D • 1066 MT/s FSB • Power dissipation: 50-130 W (depending on core count and clock) • Operating frequency up to 2.66 GHz • mPGA604 socket Intel Nehalem (Core i7) • 2, 4, 6 or 8 cores per die • 731 mil. transistors on 265 mm2 in quad- core version (45 nm) • Per core: 2-3 MB shared L3, 256 KB L2, 32+32 KB L1 • Triple-channel integrated DDR3 memory controller (up to 25.6 GB/s bandwidth) • QuickPath Interconnect up to 6.4 GT/s • Hyperthreading (2 threads per core) • Turbo Boost Technology • Second-level branch predictor and TLB IBM PowerXCell 8i • 1 PPE + 8 SPEs • 250 mil. transistors on 212 mm2 (65 nm SOI) • SPE modifications: – 12.8 double-precision GFLOPS peak – Fullyypp pipelined, dual-issue DP FPU with (()reduced) 9 cycle latency – Improved IEEE FP compliance (denormals, NaNs) – ISA modifications (DP compare) • PPE: no major changes • Revamped memory controller: – Support for up to 4 DDR2 DIMMs at 800 MHz – Preserves max. 25.6 GB/s bandwidth – I/O pin count increased to 837 • Clock speed 3.2 GHz • Max. power dissipation 92 W CILK++ • Cilk++ is a simple set of extensions for C++ and a powerful runtime system for multicore applications – Work queue based task model with work stealing scheduler • One queue per core, if queue becomes empty core ‚ steals‘ work from neighbor Charles Leiserson – The Runtime System enables an application to run on arbitrary number of cores – 3 new keyy,pwords, implemented using gpp a precompiler: • cilk_spawn: spaw parallel task • cilk_sync: synchronize with all running tasks • cilk_for: execute loop body in parallel – Special C++ data types allow for lock free and race flltitihddtfree collective operations on shared data: cilk::hyper_ptr<> • Cilk++ toolset: compiler, debugger and race detector • Available on most 32/64bit Liniux systems and Windows src: CILK Arts Intel TBB • C++ library implementing task-based parallelism for multicore systems – Relies on the programmer to express explicit parallelism – Work queue based task model with work stealing scheduler, but it‘s possible to implement customized scheduler – Extendinggp concepts of C++ Standard Tem plate Librar y()y (STL) • Generic algorithms: parallel_for, parallel_reduce, pipeline, spawn_and_wait_for_all, etc. • Generic data structures: concurrent_hash_map, concurrent_vector, concurrent_queue, etc. • Concurrent memory allocators, platform independent atomic operations – Excellent integration with C++ language (lamdas, atomics, memory consistency model, rvalue refernecs, concepts) • Available on most 32/64 bit Linux systems, Windows, Mac OS X,,yp, but easily portable, GPL‘d code src: Intel, rtime.com Sun Microsystems Rock • 16 cores per die, arranged in 4-core clusters • 32 threads plus 32 scout threads • 64-bit SPARC V9 instruction set • 396 mm2 die on 65 nm process • 2.3 GHz clock • Each core cluster has: – 32 KB instruction cache + 8 KB predecoded state – Two 32 KB data caches w/pseudo-random replacement – Two fully-pipelined FPUs • 2 MB 4-bank L2 cache • 4 memory interface units support 128 requests in-flight • No out-of-order execution • Scout threads speculatively prefetch code and data during cache misses • Large instruction windows • Hardware support for transactional memory (chkpt and commit instructions) • Dissipates approximately 10 W per core AMD FireStream 9270 • Native support for double-precision floating point • 260 mm2 die on 55 nm CMOS • 10 SIMD cores, each: – With 80 32-bit stream processing units – Has its own control logic – Has 4 dedicated texture units and L1 cache – Communicates with other cores via 16 KB global data share •Supports total of 16, 384 shader threads • 64 AA resolve units, 64 Z/stencil units and 40 texture units • Peak performance: 1.2 SP TFLOPS, 240 DP GFLOPS • 750 MHz clock • 115.2 GB/s memory bandwidth • 160 W per board typical, <220 W peak • Up to 2 GB GDDR5 SDRAM NVIDIA Tesla T10 • Native support for double-precision floating point • 1400 mil. transistors on 470 mm2 die (55 nm) for G200b revision • 240 shader cores, 80 texture units, 32 ROPs • Clock up to 1.44 GHz • 933 SP GFLOPS, 78 DP GFLOPS peak @ 1.3 GHz • 512-bit GDDR3 memory interface at 800 MHz • 102 GB/s memory bandwidth per GPU • <200 W per single processor board (160 W typical) • Products: – C1060 accelerator board – S1070 quad-GPU system OpenCL: The Open Standard for Heteroggggeneous Parallel Programming • OpenCL (Open Computing Language): a framework for writing programs that execute across heteroggpeneous platforms consisting of CPUs, GPUs, and other processors • C-based cross-platform programming interface - – Subset of ISO C99 with language extensions – – Well-defined numerical accuracy - IEEE 754 rounding behavior with specified maximum error - Open CL Platform Model: – Online or offline compppilation and build of compute kernel executables - 1 Host + 1 or more CtDiCompute Devices • Platform Layer API - – A hardware abstraction layer over diverse computational resources – Query, select and initialize compute devices - – Create compute contexts and work-queues • • Runtime API – EtExecute compute klkernels – Manage scheduling, compute, and memory resources • Memory model – Shared memory model – relaxed consistency Memory model – Multiple distinct address spaces: Address spaces can be collapsed depending on device’s Memory System – Address Spaces: Pr ivate- prikiLlivate to a work item, Local – lllocal to a wor k-group, Global – accessible by all work-items in all work-groups, constant – read only global spaces src: Wikipedia, Khronos IBM BladeCenter • QS22 blade: – Includes two 3.2 GHz PowerXCell 8i processors – 460 SP GFLOPS or 217 DP GFLOPS peak – Supports up two 32 GB memory – Dual GigE and optional dual-port DDR 4x IB HCA • BladeCenter H chassis – 14 blades in 9U – Up to 6.4 SP TFLOPS or 3.0 DP TFLOPS • Standard 42U rack – 56 blades – 25.8 SP TFLOPS or 12.18 DP TFLOPS peak • Top 4 positions on Green500 with >500 MFlops/W 40G Networking for Highest System Utilization JRoPAandHPCJuRoPA and HPC –FFFF . Mellanox End-to-End 40Gb/s Supercomputer Connectivity • Network Adaptation: ensures highest efficiency • Self Recovery: ensures highest reliability • Scalability: the solution for Peta/Exa flops systems • On-demand resources: allocation per demand • Green HPC: loweringgy system p ower consump tion 274. 8 TFlops at 91. 6% Efficiency 16 Petaflops Systems DEPARTMENT OF COMPUTER SCIENCE @ LOUISIANA STATE UNIVERSITY 17 Road Runner • First supercomputer to reach sustained 1 PFLOPS performance and current #1 on TOP500 • First hybrid supercomputer (Cell+ Opteron) • System information: – 296 racks on 5,200 sq.ft. footprint – 18 connected units with 180 nodes each – 1.46 PFLOPS peak, 1.1 PFLOPS Rmax – 6,480 AMD Opteron processors – 12,960 IBM PowerXCell 8i processors – 101 TB memory – 216 System x3755 I/O nodes – 26 288-port ISR2012 Infiniband 4x DDR switches – 2.5 MW power consumption • Tri-blade compute node: – Two QS22 blades (Cell) and one LS21 blade (Opteron) – Cell blades