Invited Keynote presentation to ISC 2009 - Hamburg

HPC Achievement and Impact – 2009 a personal perspective

Dr. Prof. Thomas Sterling (with Maciej Brodowicz & Chirag Dekate) Ad&EddPfDttfCtSiArnaud & Edwards Professor, Department of Computer Science Adjunct Faculty, Department of Electrical and Computer Engineering Louisiana State University

Visiting Associate, California Institute Technology Distinguished Visiting Scientist, Oak Ridge National Laboratory CSRI Fellow, Sandia National Laboratory June 24, 2008

DEPARTMENT OF COMPUTER SCIENCE @ LOUISIANA STATE UNIVERSITY HPC Year in Review • A continuing tradition at ISC – (6th year, and still going at it) • Previous Years’ Themes: – 2004: “Constr u ctiv e Continu ity ” – 2005: “High Density Computing” – 2006: “Multicore to Petaflops” – 2007: “Multicore: the Next Moore’ s Law ” – 2008: “Run-Up to Petaflops” • This Year’s Theme: “Year 1 A.P. (after Petaflops)” • As alwayy,s, a p ersonal p ers pective – how I’ve seen it – Highlights – the big picture • But not all the nitty details, sorry – Necessarily biased, but not intentionally so – Iron-oritdbtftiented, but software t oo – Trends and implications for the future • And a continuing predictor: The Canonical HEC Computer – Based on average and leading predictors

2 Trends in Highlight • Year 1 after Petaflops (1 A.P.) • Applying Petaflops Roadrunner & Jaguar to computational challenges • Deploying Petaflops Systems around the World starting with Jugene • Programming Multicore to save Moore’s Law – Quad core dominates mainstream processor architectures – TBB, Cilk,,Coce,& Concert, & P aaarall eX • Harnessing GPUs for a quantum step in performance – Invidia Tesla – ClearSpeed • Emerging New Applications in Science and Informatics • Commodity clusters ubiquitous – Linux, MPI, Xeon, & Ethernet dominant – Infiniband increasing interconnect market share • MPPs dominate the high-end with lower power, higher density • Clock rates near flat in the 2 – 3 GHz range with some outliers • Preparing for Exascale Hardware and Software

3 ADVANCEMENTS IN PROCESSOR ARCHITECTURES

DEPARTMENT OF COMPUTER SCIENCE @ LOUISIANA STATE UNIVERSITY 4 AMD Istanbul

• Six cores per die • 904 mil. transistors on 346 mm2 (45 nm SOI) • Support for x8 ECC memory • HyperTransport 3.0 at 2.4 GHz • HT Assist to minimize cache probe traffic (4- socket systems and above) • Remote power management interface (APML) • 75W power envelope • Operating frequency up to 2.6 GHz Intel Dunnington

• Up to six cores per die (based on Core2) • 1900 mil. transistors on 504 mm2 (45 nm) • 16 MB shared L3, 3x3 MB unified L2, 96 KB L1D • 1066 MT/s FSB • Power dissipation: 50-130 W (depending on core count and clock) • Operating frequency up to 2.66 GHz • mPGA604 socket Intel Nehalem (Core i7)

• 2, 4, 6 or 8 cores per die • 731 mil. transistors on 265 mm2 in quad- core version (45 nm) • Per core: 2-3 MB shared L3, 256 KB L2, 32+32 KB L1 • Triple-channel integrated DDR3 memory controller (up to 25.6 GB/s bandwidth) • QuickPath Interconnect up to 6.4 GT/s • Hyperthreading (2 threads per core) • Turbo Boost Technology • Second-level branch predictor and TLB IBM PowerXCell 8i

• 1 PPE + 8 SPEs • 250 mil. transistors on 212 mm2 (65 nm SOI) • SPE modifications: – 12.8 double-precision GFLOPS peak – Fullyypp pipelined , dual-issue DP FPU with (()reduced) 9 cycle latency – Improved IEEE FP compliance (denormals, NaNs) – ISA modifications (DP compare) • PPE: no major changes • Revamped memory controller: – Support for up to 4 DDR2 DIMMs at 800 MHz – Preserves max. 25.6 GB/s bandwidth – I/O pin count increased to 837 • Clock speed 3.2 GHz • Max. power dissipation 92 W CILK++

• Cilk++ is a simple set of extensions for C++ and a powerful runtime system for multicore applications – Work queue based task model with work stealing scheduler • One queue per core, if queue becomes empty core ‚ steals‘ work from neighbor Charles Leiserson – The Runtime System enables an application to run on arbitrary number of cores – 3 new keyy,pwords, implemented usin gppg a precompiler: • cilk_spawn: spaw parallel task • cilk_sync: synchronize with all running tasks • cilk_for: execute loop body in parallel – Special C++ data types allow for lock free and race flltitihddtfree collective operations on shared data: cilk::hyper_ptr<> • Cilk++ toolset: compiler, debugger and race detector • Available on most 32/64bit Liniux systems and Windows

src: CILK Arts Intel TBB

• C++ library implementing task-based parallelism for multicore systems – Relies on the programmer to express explicit parallelism – Work queue based task model with work stealing scheduler, but it‘s possible to implement customized scheduler – Extendinggp concepts of C++ Standard Tem plate Librar y()y (STL) • Generic algorithms: parallel_for, parallel_reduce, pipeline, spawn_and_wait_for_all, etc. • Generic data structures: concurrent_hash_map, concurrent_vector, concurrent_queue, etc. • Concurrent memory allocators, platform independent atomic operations – Excellent integration with C++ language (lamdas, atomics, memory consistency model, rvalue refernecs, concepts) • Available on most 32/64 bit Linux systems, Windows, Mac OS X,,yp, but easily portable, GPL‘d code

src: Intel, rtime.com Sun Microsystems Rock

• 16 cores per die, arranged in 4-core clusters • 32 threads plus 32 scout threads • 64-bit SPARC V9 instruction set • 396 mm2 die on 65 nm process • 2.3 GHz clock • Each core cluster has: – 32 KB instruction cache + 8 KB predecoded state – Two 32 KB data caches w/pseudo-random replacement – Two fully-pipelined FPUs • 2 MB 4-bank L2 cache • 4 memory interface units support 128 requests in-flight • No out-of-order execution • Scout threads speculatively prefetch code and data during cache misses • Large instruction windows • Hardware support for transactional memory (chkpt and commit instructions) • Dissipates approximately 10 W per core AMD FireStream 9270

• Native support for double-precision floating point • 260 mm2 die on 55 nm CMOS • 10 SIMD cores, each: – With 80 32-bit stream processing units – Has its own control logic – Has 4 dedicated texture units and L1 cache – Communicates with other cores via 16 KB global data share •Supports total of 16 ,384 shader threads • 64 AA resolve units, 64 Z/stencil units and 40 texture units • Peak performance: 1.2 SP TFLOPS, 240 DP GFLOPS • 750 MHz clock • 115.2 GB/s memory bandwidth • 160 W per board typical, <220 W peak • Up to 2 GB GDDR5 SDRAM NVIDIA Tesla T10

• Native support for double-precision floating point • 1400 mil. transistors on 470 mm2 die (55 nm) for G200b revision • 240 shader cores, 80 texture units, 32 ROPs • Clock up to 1.44 GHz • 933 SP GFLOPS, 78 DP GFLOPS peak @ 1.3 GHz • 512-bit GDDR3 memory interface at 800 MHz • 102 GB/s memory bandwidth per GPU • <200 W per single processor board (160 W typical) • Products: – C1060 accelerator board – S1070 quad-GPU system OpenCL: The Open Standard for Heteroggggeneous Parallel Programming

• OpenCL (Open Computing Language): a framework for writing programs that execute across heteroggpeneous platforms consistin g of CPUs, GPUs, and other processors • C-based cross-platform programming interface - – Subset of ISO C99 with language extensions – – Well-defined numerical accuracy - IEEE 754 rounding behavior with specified maximum error - Open CL Platform Model: – Online or offline compppilation and build of compute kernel executables - 1 Host + 1 or more CtDiCompute Devices • Platform Layer API - – A hardware abstraction layer over diverse computational resources – Query, select and initialize compute devices - – Create compute contexts and work-queues • • Runtime API – EtExecute compute klkernels – Manage scheduling, compute, and memory resources • Memory model – Shared memory model – relaxed consistency Memory model – Multiple distinct address spaces: Address spaces can be collapsed depending on device’s Memory System – Address Spaces: Pr ivate- prikiLlivate to a work item, Local – lllocal to a wor k-group, Global – accessible by all work-items in all work-groups, constant – read only global spaces

src: Wikipedia, Khronos IBM BladeCenter

• QS22 blade: – Includes two 3.2 GHz PowerXCell 8i processors – 460 SP GFLOPS or 217 DP GFLOPS peak – Supports up two 32 GB memory – Dual GigE and optional dual-port DDR 4x IB HCA • BladeCenter H chassis – 14 blades in 9U – Up to 6.4 SP TFLOPS or 3.0 DP TFLOPS • Standard 42U rack – 56 blades – 25.8 SP TFLOPS or 12.18 DP TFLOPS peak • Top 4 positions on Green500 with >500 MFlops/W 40G Networking for Highest System Utilization JRoPAandHPCJuRoPA and HPC –FFFF . Mellanox End-to-End 40Gb/s Connectivity

• Network Adaptation: ensures highest efficiency

• Self Recovery: ensures highest reliability

• Scalability: the solution for Peta/Exa systems

• On-demand resources: allocation per demand • Green HPC: loweringgy system power consum ption 274. 8 TFlops at 91 .6% Efficiency

16 Petaflops Systems

DEPARTMENT OF COMPUTER SCIENCE @ LOUISIANA STATE UNIVERSITY 17 Road Runner

• First supercomputer to reach sustained 1 PFLOPS performance and current #1 on TOP500 • First hybrid supercomputer (Cell+ Opteron) • System information: – 296 racks on 5,200 sq.ft. footprint – 18 connected units with 180 nodes each

– 1.46 PFLOPS peak, 1.1 PFLOPS Rmax – 6,480 AMD Opteron processors – 12,960 IBM PowerXCell 8i processors – 101 TB memory – 216 System x3755 I/O nodes – 26 288-port ISR2012 Infiniband 4x DDR switches – 2.5 MW power consumption • Tri-blade compute node: – Two QS22 blades (Cell) and one LS21 blade (Opteron) – Cell blades host four 3.2 GHz ppgggprocessors with aggregate peak of 435.2 DP GFLOPS – Opteron blade contains 2 dual-core 1.8 GHz Opterons 2210 delivering peak of 14.4 GFLOPS – Every Opteron core and Cell processor uses 4 GB memory (32 GB per node) ORNL Jaguar (Cray XT 5)

• Second sustained PFLOPS machine (#2 on TOP500) • System overview – Cray XT4 and XT5 nodes over shared network (SION) – 284 cabinets in 5,800 sq.ft. floorspace

– 1.38 PFLOPS peak, 1.06 PFLOPS Rmax (XT5 only) – 26,604 compute nodes with over 181,000 cores – 362 TB memory – Cray Seastar2+ interconnect configured as 3-D torus – 374 TB/s interconnect bandwidth (XT5) – 10 PB of RAID6 storage with 240+44 GB/s bandwidth respectively from 214 XT5 and 116 XT4 I/O nodes – P69MWliidli(XT)Power: 6.95 MW, liquid cooling (XT5) • XT5 node specifications – Two quad-core 2356 Barcelona Opterons at 2.3 GHz – Peak performance of 73.6 GFLOPS – 16GB ECC DDR2-800 SDRAM – 6 port Seastar2+ ASIC, 9.6 GB/s per port Petascale Systems Enabling Science

• ORNL lead team wins the Gordon Bell Prize for peak performance using Jaguar – – “New Algorithm to Enable 400 TFlop/s Sustained Performance in Simulations of Disorder Effects in High-Tc.” – Achieved 1.352 petaflops -- on ORNL’s Cray XT Jaguar supercomputer with a simulation of superconductors, or materials that conduct electricity without resistance. – By modifying the algorithms and software design of its DCA++ code to maximize speed without sacrificing accuracy , the team was able to boost performance tenfold. – The team’s simulation made efficient use of 150,000 of Jaguar’s 180,000-plus processing cores to explore electrical conductance. • Jaguar i s h el pin g r esear ch er s to un der stan d th e pow er of radio waves in heating and controlling the plasma in ITER, a prototype fusion reactor Peak performance of Gordon – The team, led by Yong Xiao and Zhihong Lin of UCI, used 93 Bell Prize winning applications percent of the NCCS's flagship supercomputer Jaguar, a Cray XT4, with the classic fusion code GTC (GyrokineticToroidal Code) – One run can produce 60 terabytes of data – the GTC code ran with a total of 28,000 cores smoothly for two days

src: wikipedia, ORNL, HPCWire Petascale Systems Enabling Science

• Researchers performed the largest simulation ever of the dark matter cloud holding our galaxy together. – Researchers hope to understand how stars more than 10 times the mass of our sun die. Such deaths are the dominant source of elements in the universe. – The team has used Jaguar to simulate the supernova up to abou t 100 milli secon ds a fter the s hoc k wave beg ins, an d they hope to reach half a second • Another astrophysics project on Jaguar led by PieroMadau at the University of California, Santa Cruz – Will use 5 million processor-hours to explore the invisible halo of dark matter that surrounds the Milky Way. – Madau’s simulations, the largest ever performed for the Milky Way, will divide the halo into 30 billion parcels of dark matter and will simulate their evolution over 13 billion years. • Jackie Chen of Sandia National Laboratories is using 30 million processor-hours on Jaguar to simulate the combustion process of alternative fuels, like biofuel and ethanol

src: ORNL, HPCWire, Wired FZJ JUGENE (IBM BG/P)

• Most powerful BG/P system (#3 in TOP500) • System information: – 72 racks with 130 m2 footprint – 73,728 compute nodes with 294,912 processors

– 1.0 PFLOPS peak, 825.5 TFLOPS Rmax – 144 TB memory – Multiple interconnects: 3-D torus, scalable collective network and fast barrier network – 600 I/O nodes – 1 PB of storage at 16 GB/s – Power: 2.5 MW (35 kW per rack) • Node specification: – 4 PowerPC 450 cores at 850 MHz – DlDPFPUDual DP FPUs per core – 2 GB memory – 32-bit mode operation INITIATIVES AROUND THE WORLD

DEPARTMENT OF COMPUTER SCIENCE @ LOUISIANA STATE UNIVERSITY 23 Germany: FZJ JUROPA (Bull Cluster)

• #10 in the current Top500 list (2nd machine from Germany in top 10) • System Information – Bull Cluster 3288 compute nodes: • 2 Intel Xeon X5570 (Nehalem-EP) quad-core processors (2. 93 GHz ), SMT (Simultaneous Multithreading) • 24 GB memory (DDR3, 1066 MHz) – IB QDR HCA (via QNEM Network Express Module) – 26304 cores total – 80806 GB memory – 308.28 TFlops peak and 274.8 Rmax – Sun Blade 6048 system – Infiniband QDR with non-blocking Fat Tree topology – Sun Data Center Switch 648 Saudi Arabia: KAUST (Shaheen)

• Fastest machine in the Middle East, #14 on th e T op 500 li s t • System Information: – 65536 cores – 222. 82 Rpeak, 185.17 Rmax – 504 KW – 16 Racks of Blue Gene/P • Future: – To turn its facility into a petascale system – potentially host 500 Blue Gene/P racks. – Once petltascalesiiize is reac hdthhed, the Shaheen facility plans to move to exaflop scale United Kingdom

• ECMWF IBM Power6 575 – Processor: IBM Power6, 4. 7 GHz – 8320 cores per cluster – Rmax – 115.9 TfFops, Rpeak - 156.42 Tflops – Infiniband interconnect

– 32 cores per node, 64 GB per node, One of the two IBM POWER5+ Cluster – 4MB L2 cache per core, 32 MB L3 cache 1600 systems installed at ECMWF shared per two cores – QlogicInfiniband Switch – 248 compute nodes, 12 I/O nodes • 25th and 26th on the current Top 500 list • Main Application area: Weather and Climate Research France: GENCI-CINES & IDRIS

• Jade: SGI Altix ICE 8200EX – Ranked 20th on the Top 500, – Xeon quad core 3.0 GHz – 49152 GB – total memory – Rmax – 128. 4 TFlops, RkRpeak – 146.7 TFlops – Infiniband Interconnect – Intel EM64T Xeon E54xx (Harpertown) 3000 MHz (12 GFlops) – Quad Core – 12288 total cores • IDRIS IBM BG/P – Ranked 24th on the Top 500 – Rmax – 116 TFlops, Rpeak – 139. 3 TFlops Earth Simulator 2 (NEC)

• Most powerful Japanese supercomputer (#22 in TOP500) • National project, developed by: – National Space Development Agency of Japan (NASDA) – Japan Atomic Energy Research Institute (JAERI) – Japan Marine Science and Technology Center (JAMSTEC) • System information: – 80 cabinets housed in Earth Simulator Building (50m x 65m x 17m) – 160 SX-9/E Processing Nodes (PNs) with 1,280 processors

– 131 TFLOPS peak, 122.4 TFLOPS Rmax – 20 TB memory – Two-tier fat tree interconnect, 10 TB/s bandwidth – 500 TB data storage system (job staging) – 1.5 PB RAID6 storage server reachable via SAN – Max. power consumption: 3000 kVA • Node specification: – Eight 3.2 GHz CPUs, each with: • 4-way superscalar unit • Vector unit with 72 registers, 256 elements each • 8 sets of six types of vector pipelines (multiply, add, logic, divide, mask, load/store) • Peak performance of 102.4 GFLOPS • Typical power dissipation of 240 W • Single-chip implementation on 417 mm2, 65 nm CMOS – 819 GFLOPS peak – 128 GB shared memory Top Machine in China

• Dawning 5000A @ Shanghai Supercomputing Center - Largest supercomputer in China • Also the Larggpest Windows HPC Server 2008 based cluster on the Top 500 list • System Characteristics – 180.6 TFlops (Rmax), 233.5 TFlops (Rpeak) – 7680 1.9GHz AMD Opteron Quad Core Processors (30720 total cores) – Total system memory: 122.88 TB – Infiniband Interconnect Network – 700 kW Dawning 5000A at its inaugural ceremony – Cost 200 Million Yuan (29 Million U.S. Dollars) • Target Applications: – weather forecasting, – construction of seabed tunnels, – envitlttiironmental protection, – large passenger aircraft production and – earthquake predictions

• 2010: Petaflops Supercomputer in China • Dawning 5000L series − Possible CPU: Loongson 3 (Godson), 16 cores and shared 8MB L2 Cache − 64 bit processor developed at Institute of Computing Technology, CAS − Chief Architect: Professor Weiwu Hu − Currently Linux Based

Src: Top500, Wikipedia, Chinese Academy of Sciences Loongson (Godson)

• Loongson 1 – pure 32-bit CPU running at a clock speed of 266 MHz – Targeted towards developing embedded designs such as cash registers • Loongson 2 –64-bit – running at 500 MHz to 1 GHz, the latest Godson 2F being produced at 1.2 GHz – released to market in early 2008. – KD-50-I i s th e fi rst Chi nese b uilt supercomput er t o utili ze d omesti c Chi nese CPU s, Loongson 2 with a total of more than 330 Loongson-2F CPUs. (Godson 2) – The cost was less than RMB 800,000 (approximately USD $120,000, EURO €80.000 ). • Loongson 3 – 65nm – clock speed between 1.0 to 1.2 GHz, with – 4 CPU cores (10W) first and 8 cores later (20W), rumored 16 core version – expected to debut in 2010. Loongson 3 – first version of the chip will only support DDR2 DRAM, will not have SMT support or 4 Core, 10 W a built-in network interface.

src: Wikipedia, www.most.gov.cn In Recognition of:

Leadership and AhiAchievemen t

31 Key HPC Awards in 2008

Steve Wallach, Bill Gropp, Cray Award Fernbach Award Forcontribution to high-performance For outstanding contributions to the computing through design of development of domain ititdlllinnovative vector and parallel decompos ition a lgor ithms, computing systems, notably the scalable tools for the parallel Convex mini-supercomputer numerical solution of PDEs, and series, a distinguished industrial the dominant HPC career and acts of public service. communications interface. The ACM - IEEE CS Ken Kennedy Award

• The first presentation of this award will be in Novem ber 2009 • Awarded annually during ACM IEEE SC Conference • Recognizes substantial contributions to programmability and productivity in computing and substantial community service or mentoring contributions. • The award includes a $5,000 honorarium Gordon Bell Prize

• The ORNL & Cray team, led by Thomas Schulthess, • The ACM Gordon Bell Prize for Peak Performance was awarded to the team of • Gonzalo Alvarez, Michael S. Summers, Don E. Maxwell, Markus Eisenbach, Jeremy S. Meredith, Thomas A. Maier, Paul R. Kent, • Eduardo D'Azevedo and Thomas C. Schulthess (all of Oak Ridge National Laboratory), Jeffrey M. Larkin and John M. Levesque (both of Cray, Inc.) • "New Algorithm to Enable 400+ TFlop/s Sustained Performance in Simulations of Disorder Effects in High-Tc.” • Achieved 1.352 Pflops while Simulatingg, materials, known as superconductors, that have the potential applications for power transmission, and superconducting magnets have been used extensively in magnetic resonance imaging and magnetic levitation transppyortation systems. • What are the size effects and scaling laws of fracture of disordered materials? • What are the signatures of approach to failure? • What i s th e rel ati on b et ween t ough ness and crack surf ace roughness? Src: ORNL, ACM Gordon Bell Special Prize

• Gordon Bell Special Prize: – The ACM Gordon Bell Prize in a special recognition for algorithmic innovation was presented to Lin-Wanggg, Wang, Byygounghak Lee, Hongggzhang Shan, Zhengji Zhao, Juan Meza, Erich Strohmaier and David Bailey of Lawrence Berkeley National Laboratory for their work in "Linear Scaling Divide-and- Conquer Electronic Structure Calculations for Thousand Atom Nanostructures.”

A test run of LS3DF, which took one hour on 17,000 processors of Franklin, Lin-Wang Wang, performed electronic structure calculations for a 3500-atom ZnTeO alloy. of Berkeley Lab’s Isosurface plots (yellow) show the electron wavefunction squares for the bottom of Computational the conduction band (left) and the top of the oxygen-induced band (right). The Research small grey dots are Zn atoms, the blue dots are Te atoms, and the red dots are Division oxygen atoms. Src: ACM, LBL Exascale – The FINAL Frontier

DEPARTMENT OF COMPUTER SCIENCE @ LOUISIANA STATE UNIVERSITY 36 “Exascale” Strategic Challenges • Sustained Performance – 1000X today’s best in show • Parallelism – 100s of millions of cores – Multi billion-way task concurrency • Reliability – Single-point failure MTBF seconds to minutes • Power – ~100 Megawatts • Programmability •Cost

DEPARTMENT OF COMPUTER SCIENCE @ LOUISIANA STATE UNIVERSITY 37 Z-barrier • Supercomputing will never achieve Zettaflops – Limit: 32 Exafloppps to 128 Exaflops • Call 64 Exaflops the “Sterling Point” • Challenge to the creativity of the community – Using Boolean logic and binary encoding •Factors – Speed of light – Atomic granularity – Boltzmann’ s constant – Overhead work

DEPARTMENT OF COMPUTER SCIENCE @ LOUISIANA STATE UNIVERSITY 38 Exelerating towards 2020 (12 A.P.) • DARPA Exascale Studies – Last yygyear: Technology – This year: System Software & Resiliency • DOE Exascale Studies – > 8 workshops on application domains and systems • NSF Exascale Point Design Study – First in-depth study of one possible system concept – Both hardware architecture & system software – Transitional and transformative programming models

DEPARTMENT OF COMPUTER SCIENCE @ LOUISIANA STATE UNIVERSITY 39 DEPARTMENT OF COMPUTER SCIENCE @ LOUISIANA STATE UNIVERSITY 40 DARPA Exascale Technology Study Projections

1.E+10

1E+091.E+09 Exascale

1.E+08 Liggghtweight

1.E+07

GFlops Heavyweight

1E+061.E+06

1.E+05

1.E+04 1/1/04 1/1/08 1/1/12 1/1/16 1/1/20 But not at 20 MW!

Courtesy of Peter Kogge, UND Power Efficiency

1.E+02 1 Eflops @ 20MW

1E+011.E+01

1.E+00 perWatt ss

1.E-01 ax Gflop mm R 1.E-02

1E031.E-03 1/1/92 1/1/96 1/1/00 1/1/04 1/1/08 1/1/12 1/1/16 1/1/20

Historical Top 10 Top System Trend Line Exascale Goal Aggressive Strawman Light Node Simplistic Heavy Node Simplistic Light Node Fully Scaled Heavy Node Fully Scaled

Courtesy of Peter Kogge, UND Exascale Concurrency

1.E+10

1.E+09 Billion-way concurrency

y 1.E+08

1.E+07 urrecnc cc

1.E+06 Million-way concurrency

1.E+05 Total Con

1.E+04

1.E+ 03 Thousand-way concurrency 1/1/96 1/1/00 1/1/04 1/1/08 1/1/12 1/1/16 1/1/20

Top 10 Top System Top 1 Trend Historical Exa Strawman Evolutionaryyg Light Node Evolutionary Heavy Node

Courtesy of Peter Kogge, UND 1 EFlop/s “Clean Sheet of Paper” Strawman

Sizing done by “balancing” power budgets with achievable capabilities

• 4 FPUs+RegFiles/Core (=6 GF @1.5GHz) • 1Chip1 Chip = 742 Cores ((4=4. 5TF/s) • 213MB of L1I&D; 93MB of L2 • 1 Node = 1 Proc Chip + 16 DRAMs (16GB) • 1 Group = 12 Nodes + 12 Routers (=54TF/s) • 1 RkRack = 32 G roups ( =1 .7 PF/ s ) • 384 nodes / rack • 3.6EB of Disk Storage included • 1 System = 583 Racks (=1 EF/s) • 166 MILLION cores • 680 MILLION FPUs • 3.6PB = 0.0036 bytes/flops • 68 MW w’aggressive assumptions

Largely due to Bill Dally

Courtesy of Peter Kogge, UND Emerging New World Model

• Global Address Space – Direct access to system-wide memory • Multiple-threaded execution – Work-queue model for high utilization • Message-driven – Advances on Active Messages • Lightweight Synchronization – Breaking the barrier • Runtime Systems – User level in support of dynamic adaptive resource management • Self-aware system management – Power control – Fault tolerance

DEPARTMENT OF COMPUTER SCIENCE @ LOUISIANA STATE UNIVERSITY 45 Possible New Exascale Computing Model

DEPARTMENT OF COMPUTER SCIENCE @ LOUISIANA STATE UNIVERSITY 46 Closing Remarks

47 The Canonical HPC System - 2009

• Commodity Cluster – Dominant class of HEC system – But for top-end performance, MPPs • Intel Xeon E54xx Harpertown – Byypy far most deployed machines – But – IBM PowerPC 450 slightly more deployed cores • Quad core – 8192 cores – 2048 sockets •HP syygstems integrator – Top vendor – IBM a major second also dominates in overall performance • 45.2 Teraflops performance Rmax – Average performance across Top-500 list – #91 on the Top-500 list • 8 Terabytes main memory – 1 Gigabytes/processor core • Infiniband interconnect – Ethernet for system administration and maintenance • MPICH-2 – OpenMP gaining in interest to address multicore – MPI-3 forum has begun its work • Linux – #1 OS • Power Consumption: 384 Kwatts • Industry owned and run

48 HPC in 2008 – 2009: Summary Remarks

• Year 1 after Petaflops (1 A.P.) • Applying Roadrunner & Jaguar to the world’ s biggest computational challenges at > 1 Petaflops • Gettinggy ready to De pypyploy Petaflops Systems around the World •Progggramming Multicore to save Moore’s Law • Harnessing GPUs for a quantum step in performance •Preppgaring for Exascale Hardware and Software • Emerging New Applications in Science and Informatics

DEPARTMENT OF COMPUTER SCIENCE @ LOUISIANA STATE UNIVERSITY 49 50