SC08 Workshop on the Path to Exascale: An Overview of Exascale Architecture Challenges

Thomas Sterling (with Peter Kogge, UND) Arnaud and Edwards Professor of Computer Science Louisiana State University

Visiting Associate, California Institute of Technology Distinguished Visiting Scientist, Oak Ridge National Laboratory

November 16, 2008

DEPARTMENT OF COMPUTER SCIENCE @ LOUISIANA STATE UNIVERSITY 1 Exascale

• Exascale = 1,000X capability of Petascale • Exascale != Exaflops but – Exascale at the data center size => Exaflops – Exascale at the “rack” size => Petaflops for departmental systems – Exascale embedded => Teraflops in a cube • It took us 14+ years to get from –1st Petaflops workshop: 1994 – Thru NSF studies, HTMT, HPCS … – To give us to Peta now • Study Questions: – Can we ride silicon to Exa? – What will such systems look like? – Where are the Challenges?

Courtesy of Peter Kogge, UND Multi-Core – the next “Moore’s Law” …Microprocessor Era Multicore/Heterogeneous Era… 10 Zflops 1 Zflops 100 Eflops 10 Eflops 1 Eflops 100 Pflops 10 Pflops 1 Pflops SUM N=1 N=500 100 Tflops 10 Tflops 1 Tflops 100 Gflops Courtesy of Thomas Sterling of Thomas Courtesy 10 Gflops 1 Gflops 100 Mflops 1993 19952000 2005 20102015 2020 2025 2030 2035 2040 2043 3 Top Level View – the 6 P’s

• Performance – Sustained flops but informatics too • Parallelism – At unprecedented scales requiring low overhead •Power – perhaps #1 constraint •Price – Practical upper bounds, more severe for products • Persistence – By this we mean continued operation – Reliability, availability, time to repair, … • Programming – Multicore and heterogeneous already a major challenge

DEPARTMENT OF COMPUTER SCIENCE @ LOUISIANA STATE UNIVERSITY 4 The Exascale Study Group

NAME Affiliation NAME Affiliation Keren Bergman Columbia Steve Keckler UT-Austin Shekhar Borkar Intel Klein Micron Dan Campbell GTRI Peter Kogge Notre Dame Bill Carlson IDA Bob Lucas USC/ISI Bill Dally Stanford Mark Richards Georgia Tech Monty Denneau IBM Al Scarpeli AFRL Paul Franzon NCSU Steve Scott Cray Bill Harrod DARPA Allan Snavely SDSC Kerry Hill AFRL Thomas Sterling LSU Jon Hiller STA Stan Williams HP Sherman Karp STA Kathy Yelick UC-Berkeley

11 Academic 6 Non-Academic 5 “Government” + Special Domain Experts 10+ Study Meetings over 2nd half 2007 Courtesy of Peter Kogge, UND Exaflops Today?

• Performance – 1,000,000,000,000,000,000 floating point operations per second sustained – Or ten thousand Tiger stadiums of laptops • Parallelism – > a billion scalar cores or > million GPGPUs – 100,000+ racks •Power – 5 GVA •Price – $25 billion • Persistence – MTBF of minutes – You can’t checkpoint • Programming – Like building the pyramids – you spend your life doing one and then it buries you

DEPARTMENT OF COMPUTER SCIENCE @ LOUISIANA STATE UNIVERSITY 6 Architectures Considered

• Evolutionary Strawmen – “Heavyweight” Strawman” based on commodity-derived microprocessors –“Lightweight” Strawman” based on custom microprocessors • Aggressive Strawman

– “Clean Sheet of Paper” Silicon

Courtesy of Peter Kogge, UND A Modern HPC System

Silicon Area Distribution Routers Random Memory 3% 8% 86% Processors Power Distribution 3% Random Memory 2% 9%

Routers 33%

Board Area Distribution Memory White 10% Processors Space 56% 50% Processors 24%

Routers Random 8% 8%

Courtesy of Peter Kogge, UND Evolutionary Scaling Assumptions • Applications will demand same DRAM/Flops ratio as today • Ignore any changes needed in disk capacity • Processor die size will remain constant • Continued reduction in device area => multi-core chips

•Vdd, max power dissipation will flatten as forecast – Thus clock rates limited as before • On a per core basis, microarchitecture will improve from 2 flops/cycle to 4 in 2008, and 8 in 2015 • Max # of sockets per board will double roughly every 5 years • Max # of boards per rank will increase once by 33% • Max power per rack will double every 3 years • Allow growth in system configuration by 50 racks each year

Courtesy of Peter Kogge, UND Possible System Power Models

• Simplistic: A highly optimistic model – Max power per die grows as per ITRS – Power for memory grows only linearly with # of chips • Power per memory chip remains constant – Power for routers and common logic remains constant • Regardless of obvious need to increase bandwidth – True if energy for bit moved/accessed decreases as fast as “flops per second” increase • Fully Scaled: A pessimistic model – Same as Simplistic, except memory & router power grow with peak flops per chip – True if energy for bit moved/accessed remains constant

Courtesy of Peter Kogge, UND With All This – What Do We Have to Look Forward To?

1.E+10

Exa flops 1.E+09 Simplistic 1.E+08

1.E+07 Full

Peta flops GFlops 1.E+06

1.E+05

1.E+04

1.E+03 1/1/00 1/1/04 1/1/08 1/1/12 1/1/16 1/1/20

Top 10 Rmax Rmax Leading Edge Rpeak Leading Edge Evolutionary Heavy Fully Scaled Evolutionary Heavy Simplistically Scaled

Courtesy of Peter Kogge, UND And At What Power Level?

1000

100

10 System Power (MW)

1 2005 2010 2015 2020

we have to add on cooling, power conditioning, secondary memory

Courtesy of Peter Kogge, UND And How Much Parallelism Must be Handled by the Program?

1.E+09 1 billion per cycle

1.E+08

1.E+07

1.E+06 1 million per cycle

1.E+05

1.E+04 1,000 per cycle 1.E+03 Total Concurrecncy

1.E+02

1.E+01

1.E+00 1/1/72 1/1/76 1/1/80 1/1/84 1/1/88 1/1/92 1/1/96 1/1/00 1/1/04 1/1/08 1/1/12 1/1/16 1/1/20

Top 10 Top System Top 1 Trend Historical Heavy Node Projections

Courtesy of Peter Kogge, UND A “Light Weight” Node Alternative

2 Nodes per “Compute Card.” Each node: • A low power compute chip • Some memory chips • “Nothing Else” • 2 simple dual issue cores System Architecture: • Each with dual FPUs • Multiple Identical Boards/Rack • Memory controller • Each board holds multiple Compute Cards • Large eDRAM L3 • “Nothing Else” • 3D message interface • Collective interface “Packaging the Blue Gene/L supercomputer,” IBM J. R&D, March/May 2005 • All at subGHz clock “Blue Gene/L compute chip: Synthesis, timing, and physical design,” IBM J. R&D, March/May 2005

Courtesy of Peter Kogge, UND With Similar Assumptions:

1.E+10

Exa flops 1.E+09 Simplistic Full 1.E+08

1.E+07

Peta flops GFlops 1.E+06 Correlates well with BG/P #s 1.E+05

1.E+04

1.E+03 1/1/00 1/1/04 1/1/08 1/1/12 1/1/16 1/1/20

Top 10 Rmax Rmax Leading Edge Rpeak Leading Edge Exascale Goal Evolutionary Light Fully Scaled Evolutionary Light Simplistically Scaled

Courtesy of Peter Kogge, UND And At What Power Level?

150

125

100

75

50 System Power (MW)

25

0 2008 2012 2016 2020

Courtesy of Peter Kogge, UND And How Much Parallelism Must be Derived from the Program?

1.E+10

1.E+09 1 billion per cycle

1.E+08

1.E+07 1 million per cycle 1.E+06

1.E+05

1.E+04 1,000 per cycle 1.E+03 Total Concurrecncy

1.E+02

1.E+01

1.E+00 1/1/72 1/1/76 1/1/80 1/1/84 1/1/88 1/1/92 1/1/96 1/1/00 1/1/04 1/1/08 1/1/12 1/1/16 1/1/20

Top 10 Top System Top 1 Trend Historical Light Node Simplistically Scaled

Courtesy of Peter Kogge, UND 1 EFlop/s “Clean Sheet of Paper” Strawman

Sizing done by “balancing” power budgets with achievable capabilities

• 4 FPUs+RegFiles/Core (=6 GF @1.5GHz) • 1 Chip = 742 Cores (=4.5TF/s) • 213MB of L1I&D; 93MB of L2 • 1 Node = 1 Proc Chip + 16 DRAMs (16GB) • 1 Group = 12 Nodes + 12 Routers (=54TF/s) • 1 Rack = 32 Groups (=1.7 PF/s) • 384 nodes / rack • 3.6EB of Disk Storage included • 1 System = 583 Racks (=1 EF/s) • 166 MILLION cores • 680 MILLION FPUs • 3.6PB = 0.0036 bytes/flops • 68 MW w’aggressive assumptions

Largely due to Bill Dally

Courtesy of Peter Kogge, UND A Single Node (No Router)

(a) Quilt Packaging (b) Thru via chip stack FPU Leakage 21% 28% Node Characteristics: • 742 Cores; 4 FPUs/core Power Reg File 11% • 16 GB DRAM • 290 Watts Off-chip • 1.08 TF Peak @ 1.5GHz 13% Cache Access • ~3000 Flops per cycle On-chip 20% DRAM Access 6% 1% Courtesy of Peter Kogge, UND 1 Eflops Data Center Power Distribution

Disk FPU 14% Processor Leakage 16% Reg File 13% 26% DRAM 28% 3% L2/L3 Network 50% Mem ory 16% Router 29% 34% Interconnect 29% L1 DRAM 39% 3% (a) Overall System Power (b) Memory Power (c) Interconnect Power

• 12 nodes per group • 32 groups per rack • 583 racks • 1 EFlops/3.6 PB • 166 million cores • 67 MWatts

Courtesy of Peter Kogge, UND Data Center Performance Projections

1.E+10

1.E+09 Exascale

1.E+08 Lightweight

1.E+07

GFlops Heavyweight

1.E+06

1.E+05

1.E+04 1/1/04 1/1/08 1/1/12 1/1/16 1/1/20 But not at 20 MW!

Courtesy of Peter Kogge, UND Power Efficiency

1.E+02 1 Eflops @ 20MW

1.E+01

1.E+00

1.E-01 Rmax Gflops per Watt 1.E-02

1.E-03 1/1/92 1/1/96 1/1/00 1/1/04 1/1/08 1/1/12 1/1/16 1/1/20

Historical Top 10 Top System Trend Line Exascale Goal Aggressive Strawman Light Node Simplistic Heavy Node Simplistic Light Node Fully Scaled Heavy Node Fully Scaled

Courtesy of Peter Kogge, UND Data Center Total Concurrency

1.E+10

1.E+09 Billion-way concurrency

1.E+08

1.E+07

1.E+06 Million-way concurrency

1.E+05 Total Concurrecncy

1.E+04

1.E+03 Thousand-way concurrency 1/1/96 1/1/00 1/1/04 1/1/08 1/1/12 1/1/16 1/1/20

Top 10 Top System Top 1 Trend Historical Exa Strawman Evolutionary Light Node Evolutionary Heavy Node

Courtesy of Peter Kogge, UND Two Strategies for Exaflops

• Structure – Low overhead mechanisms for parallel execution management – Heterogeneous support for • in memory computing • High data reuse streaming – Rebalance of resources to emphasize precious technology elements • Execution Model – Expose and exploit diversity of parallelism granularity form and scale – Intrinsic latency hiding – Dynamic adaptive resource management

24 Real-World Directed Graph Applications • Molecular Dynamics, – Helping understand the structure of proteins – Simulations can involve large number of atoms 100,000 – nBody problem • Astrophysics – Evolution of galaxies – nBody problem • Pharmaceutical modeling – Predicting behavior of drugs before synthesizing and testing – Shortens drug development times – Analyzing the interaction among molecules used to manufacture pharamaceuticals • Data Mining – Financial risk assesment • Balck & Scholes PDEs • And many more… IMAGE SOURCE: http://www.mpa-garching.mpg.de/ 512 processor partition of the IBM p690 supercomputer ~20 TB of data produced by the simulation 10 billion particles

Courtesy of Chirag Dekate, LSU 25 Dynamic Graph Processing For Informatics

• Run-time Scheduling of Directed Graphs – Sub-domain of graphs: • Search space problems • Planning – Data-driven • Progress towards the global objective of the application • Local value • Global status of the problem – Evaluation value across the leaf nodes of the current stage of execution in Graphs • Inverse inheritance from children – Captures feedback from the children of the current node – Recursive search • Resource Contention – Local and Global resources such as Bandwidth and Memory are key constraining factors

Courtesy of Chirag Dekate, LSU 26 Chess Game Engine

• The Game of chess has several attributes that makes it the core of the experiments to be carried out – NOT to build the best chess engine • Key aim is to help design and develop data-directed semantics of execution model and associated scheduling policies to improve the scalability of graph problems • Minimax algorithm and Alpha-Beta pruning will be tested as the first workload. Images from: http://www.pressibus.org/ataxx/autre/minimax

Courtesy of Chirag Dekate, LSU 27 ParalleX Execution Model

(1) – Hardware node Guarantees efficient local compound atomic operations (2) – Main Memory Within a locality, all thread processing elements can access . (3) – Heterogeneous Accelerators FPGA, Clearspeed etc. (4) – Processes Spanning Multiple localities, Processes can overlap the nodes of their physical domain with other processes (5) – PX Thread Primary executing Abstraction, Ephemeral (6) – Parcels Causes initiation of remote actions such as thread creation, suspension and termination. Asynchronous operation of the Parallel systems (7) – Local Control Objects Light weight synchronization entities, eg. Simple mutexes, semaphore support, Data flow, futures. Dataflow template based LCO can instantiate a function as a single thread.

Courtesy of Chirag Dekate, LSU 28 ParalleX Execution Model

(A) – Percolation Offloading of System and Core algorithmic tasks to accelerators (B) – Pending Thread Queue (C) – Suspended Thread Queue FPGA, Clearspeed etc. (D) – Terminator Function (F) – Suspended thread Primary executing Abstraction, Ephemeral (G) – Process Creation that provides context for and starts an initial thread. “Main” is the global process that establishes the nodes for a given

Courtesy of Chirag Dekate, LSU 29 Directed Graph Processing Using ParalleX

Courtesy of Chirag Dekate, LSU 30 X-Caliber Module

Bit-1 Jnctn

Bit-1 Bit-2 Node pair Node pair Jnctn Jnctn Node pair Node pair

Fat-tree controller

Bit-1 Bit-3 Bit-2 Jnctn Fat- Fat- Bit-1 Jnctn Jnctn Fat Tree Fat Tree tree tree Jnctn ctrl Switch Fat Tree Switch ctrl Switch

Node pair Node pair Node pair Node pair

Local Parcel Bus Local Parcel Bus Stage 2 Router Stage 2 Router

Shared Functional Timing, and master Shared External SHARED 3 2x2 routers configuration and SHARED 3 2x2 routers PORT Unit power controller Memory interface PIM Group Memory IO

Local Parcel Bus Local Parcel Bus Stage 2 Router Stage 2 Router Level

Node pair Node pair Node pair Node pair

Fat Tree Fat- Fat Tree Switch Fat Tree Fat- tree tree Bit-1 Bit-3 Bit-2 ctrl Switch Switch ctrl Jnctn Jnctn Jnctn Bit-1 Jnctn Fat-tree controller

Node pair Node pair Bit-1 Bit-2 Node pair Node pair Jnctn Jnctn

Bit-1 Jnctn

Power / Ground Rails Global Register File Register Global

Thread Active Wide Structure Manager Buffer ALU PIM Element Level Memory Embedded Parcel DRAM Handler Manager Macro (1 of 32)

Chip Level

31 Conclusions

• Developing Exascale systems will be really tough – In any time frame, for any of the 3 classes • Evolutionary Progression is at best 2020ish – With limited memory • 4 key challenge areas – Power – Concurrency: – Memory Capacity – Resiliency • New execution model required – Unify heterogeneous multicore systems – Programmability – Address challenges to heterogeneous multicore computing • New architectures essential to address of overhead, latency, and starvation – Memory accelerators a critical form of multicore heterogeneity • Responds to variation in temporal locality • Addresses the memory wall

32 33