Living at the Top of the Top500: Myopia from Being at the Bleeding Edge

Bronson Messer

Oak Ridge Leadership Computing Facility & Theoretical Astrophysics Group Oak Ridge National Laboratory

Department of Physics & Astronomy University of Tennessee

Friday, July 1, 2011 Outline

• Statements made without proof • OLCF’s Center for Accelerated Application Readiness • Speculations on task-based approaches for multiphysics applications in astrophysics (e.g. blowing up stars)

2

Friday, July 1, 2011 Riffing on Hank’s fable...

3

Friday, July 1, 2011

1

hspprto ast write to days 2 took paper This

1

e nmonths. in red measu is time and month one in accomplish can machine current

so htvra whatever of ts uni in measured is Work plot. above the in illustrated is This 1

arXiv:astro-ph/9912202v1 9Dec1999

1

hspprto ast write to days 2 took paper This

1

e nmonths. in red measu is time and month one in accomplish can machine current

Friday, July 1, 2011 1, July Friday,

hsi lutae nteaoepo.Wr smaue nuni in measured is Work plot. above the in illustrated is This so htvra whatever of ts

4

hspprto ast write to days 2 took paper This

1

e nmonths. in red measu is time and month one in accomplish can machine current

so htvra whatever of ts uni in measured is Work plot. above the in illustrated is This

arXiv:astro-ph/9912202v1 9Dec1999

iue1: Figure

optn oe ssu is power computing o then. ion calculat the start and better ciently ffi

ucl nuhta h aclto ilfiihfse fw w we if faster finish will calculation the that enough quickly i ni h available the until ait

ag ueia acltosadfie ugt,cmuigp computing budgets, fixed and calculations numerical large wrwl improve will ower

ciently ffi su for hat t conceivable is it Therefore months. 18 every doubles price

arXiv:astro-ph/9912202v1 9Dec1999

eataparticular availabl power computational the Law, Moore’s to According

iue1: Figure calculation.

nn the nning begi and computer a purchasing before time of period some for

’orwaiting ‘slacking by computations enough large for increased be can inl etradsattecalculat the start and better ciently ffi su is power computing o then. ion

ucl nuhta h aclto ilfiihfse fw w we if faster finish will calculation the that enough quickly i ni h available the until ait vity producti overall Law, Moore’s of context the in that, show We

ag ueia acltosadfie ugt,cmuigp computing budgets, fixed and calculations numerical large wrwl improve will ower

Abstract rc obe vr 8mnh.Teeoei scnevbet conceivable is it Therefore months. 18 every doubles price ciently ffi su for hat

codn oMoesLw h opttoa oe availabl power computational the Law, Moore’s to According eataparticular

calculation.

twr bevtr,Uiest fArizona of University Observatory, Steward

o oepro ftm eoeprhsn optradbegi and computer a purchasing before time of period some for nn the nning

a eicesdfrlreeog opttosb ‘slacking by computations enough large for increased be can ’orwaiting ..Charfman J.J.

eso ht ntecneto or’ a,oealproducti overall Law, Moore’s of context the in that, show We vity

n, Thompso Todd Meakin, Casey Bailin, Jeremy Gottbrath, Chris

Abstract

Computations

nLarge on Slacking and Law Moore’s of ects ff E The

astro-ph/9912202 twr bevtr,Uiest fArizona of University Observatory, Steward

1

..Charfman J.J.

hi otrt,Jrm aln ae ekn odThompso Todd Meakin, Casey Bailin, Jeremy Gottbrath, Chris n,

Computations

iue1: Figure nLarge on Slacking and Law Moore’s of ects ff E The

1

inl etradsattecalculat the start and better ciently ffi su is power computing o then. ion

ucl nuhta h aclto ilfiihfse fw w we if faster finish will calculation the that enough quickly i ni h available the until ait

ag ueia acltosadfie ugt,cmuigp computing budgets, fixed and calculations numerical large wrwl improve will ower

rc obe vr 8mnh.Teeoei scnevbet conceivable is it Therefore months. 18 every doubles price ciently ffi su for hat

codn oMoesLw h opttoa oe availabl power computational the Law, Moore’s to According eataparticular

calculation.

o oepro ftm eoeprhsn optradbegi and computer a purchasing before time of period some for nn the nning

a eicesdfrlreeog opttosb ‘slacking by computations enough large for increased be can ’orwaiting

eso ht ntecneto or’ a,oealproducti overall Law, Moore’s of context the in that, show We vity

Abstract

twr bevtr,Uiest fArizona of University Observatory, Steward

..Charfman J.J.

hi otrt,Jrm aln ae ekn odThompso Todd Meakin, Casey Bailin, Jeremy Gottbrath, Chris n,

Computations

cso or’ a n Slacking and Law Moore’s of ects ff E The nLarge on 1 Some realities • The future is now: if you go from franklin to hopper at the same size, you lose. NERSC-6 Grace “Hopper” Cray XE6 Performance 1.2 PF Peak 1.05 PF HPL (#5) Processor AMD MagnyCours 2.1 GHz 12-core 8.4 GFLOPs/core 24 cores/node 32-64 GB DDR3-1333 per node System Franklin - Cray XT4 Gemini Interconnect (3D torus) 6392 nodes 38,288 compute cores 153,408 total cores 9,572 compute nodes I/O 2PB disk space One quad-core AMD 2.3 GHz 70GB/s peak I/O Bandwidth Opteron processors (Budapest) per node 2 4 processor cores per node 8 GB of memory per node Light-weight Cray Linux 78 TB of aggregate memory operating system 1.8 GB memory / core for No runtime dynamic, shared- applications object libs /scratch disk default quota of PGI, Cray, Pathscale, GNU 750 GB compilers Use Franklin for all your computing jobs, except those that need a full Linux operating system. 5

Friday, July 1, 2011 16 Some realities • If you use primarily IBM platforms, you have a bit longer. –scp+make on Blue Waters will likely give you a speedup. –BG/P --> BG/Q brings an increased clock, and you probably aren’t engaging the Double Hummer now anyway.

6

Friday, July 1, 2011 Some realities

• It doesn’t matter if you are gonna use GPU-based machines or not –GPUs [CUDA, OpenCL, directives] –FPUs on Power [xlf, etc.] –Cell [SPE] –SSE/AVX; MIC (Knights Ferry, Knights Corner)[?]

• Exposing the maximum amount of node-level parallelism and increasing data locality are the only way to get performance from any of these things

7

Friday, July 1, 2011

Friday, July 1, 2011 1, July Friday,

8

per node than than node per

Larger memory - more than 2x more memory memory more 2x than more - memory Larger •

Performance based on available funds available on based Performance •

10-20 PF peak performance performance peak PF 10-20 •

multi-core accelerators multi-core

New accelerated node design using NVIDIA NVIDIA using design node accelerated New •

AMD Opteron 6200 processor (Interlagos) processor 6200 Opteron AMD •

features

Advanced synchronization synchronization Advanced •

memory

Globally addressable addressable Globally •

3-D Torus Torus 3-D •

Gemini interconnect Gemini •

operating system operating

Operating system upgrade of today’s Linux Linux today’s of upgrade system Operating •

cooling as Jaguar as cooling

Similar number of cabinets, cabinet design, and and design, cabinet cabinets, of number Similar • ORNL’s “Titan” System Goals System “Titan” ORNL’s

Friday, July 1, 2011 1, July Friday,

9

Slide courtesy of Cray, Inc. Cray, of courtesy Slide

X

Y

Kepler many-core processor many-core Kepler

to NVIDIA’s to Upgradeable

Z

Gemini high speed Interconnect speed high Gemini

6GB GDDR5 capacity GDDR5 6GB

HT3

memory X090 Tesla

HT3

1600 MHz DDR3 MHz 1600

16 or 32GB or 16

PCIe Gen2 PCIe

Host memory Host

X2090 @ 665 GF 665 @ X2090 Tesla

16 core processor core 16

AMD Opteron 6200 Interlagos Interlagos 6200 Opteron AMD

Characteristics

XK6 Compute Node Node Compute XK6 Cray XK6 Compute Node Compute XK6 Cray

Friday, July 1, 2011 1, July Friday,

10

delivery of this result, and why? and result, this of delivery

or facilities are dependent up on the timely timely the on up dependent are facilities or

other programs agencies, stakeholders, and/ stakeholders, agencies, programs other

in the 2012 timeframe? timeframe? 2012 the in coming 5 years later (or never at all)? What What all)? at never (or later years 5 coming

is delivered delivered is timeframe, what does it matter as opposed to to opposed as matter it does what timeframe,

If this result is delivered in the 2010 2010 the in delivered is result this If – – What does it matter if this result result this if matter it does What

Science timeliness Science • if the improved science result occurs? result science improved the if

– What might the impact be impact the might What science result occurs? Who cares, and why? and cares, Who occurs? result science

What might the impact be if this improved improved this if be impact the might What –

and does OLCF-3 enable them? enable OLCF-3 does and

Science impact Science •

– What are the science goals goals science the are What

from the current system current the from

Key questions Key • predictability and/or productivity/throughput) productivity/throughput) and/or predictability

and how is it different (in fidelity/quality/ fidelity/quality/ (in different it is how and

in many domains many in

Survey What science will be pursued on this system system this on pursued be will science What –

with 50+ leading scientists scientists leading 50+ with

Driver Driver Science

Science driver Science • in consultation consultation in Developed •

Science Driver Survey Driver Science

in 2009 OLCF application requirements document requirements application OLCF 2009 in

Results, analysis, and conclusions documented documented conclusions and analysis, Results, •

process, SQA, V&V, usage workflow, performance workflow, usage V&V, SQA, process,

Document

parallelization strategy, software, development development software, strategy, parallelization

and impact, application models, algorithms, algorithms, models, application impact, and

Requirements

Project overview, science motivation motivation science overview, Project •

pplication pplication A

comprehensive requirements questionnaire requirements comprehensive

OLCF

Elicited, analyzed, and validated using a new new a using validated and analyzed, Elicited, •

developed by surveying science community science surveying by developed OLCF-3 Applications Requirements Applications OLCF-3

Friday, July 1, 2011 1, July Friday,

11

MADNESS Storage y Energ Electrochemical processes at the interfaces; ionic diffusion during charge-discharge cycles charge-discharge during diffusion ionic interfaces; the at processes Electrochemical •

device- and lab-scale combustion lab-scale and device-

RAPTOR

design in power and propulsion engines: bridge the gap between between gap the bridge engines: propulsion and power in design ombustion c Internal •

Combustion Combustion

S3D front stability and propagation in power and propulsion engines propulsion and power in propagation and stability front lame f Combustion •

Fuels from lignocellulosic materials lignocellulosic from Fuels •

GAMESS

chemistry erosol a Atmospheric •

Chemistry Chemistry

CPMD CP2K, Interfacial chemistry Interfacial •

structure Genomic •

LAMMPS Biology

Systems biology Systems •

GROMACS

Bioenergy ynamics of microbial of ynamics d ethanol: Cellulosic biomass on action enzyme •

LAMMPS,

models

MAESTRO MPA-FT,

Full-star type Ia supernovae simulations of thermonuclear runaway with realistic subgrid subgrid realistic with runaway thermonuclear of simulations supernovae Ia type Full-star •

Astrophysics Astrophysics

signatures, gravitational waves, and photon spectra photon and waves, gravitational signatures,

GenASiS Chimera,

supernovae simulation; validation against observations of neutrino neutrino of observations against validation simulation; supernovae Core-collapse •

Application codes Application area Application Science target Science

Science outcomes were elicited from a broad range of applications of range broad a from elicited were outcomes Science OLCF-3 Applications Analyzed Applications OLCF-3

Friday, July 1, 2011 1, July Friday,

12

Predictive contaminant ground water transport water ground contaminant Predictive •

PFLOTRAN Geoscience

2 • sequestration CO large-scale of viability and Stability

balance; radial transport balances power input power balances transport radial balance;

Multi-scale integrated electromagnetic turbulence in realistic ITER geometry ITER realistic in turbulence electromagnetic integrated Multi-scale

TGYRO GYRO,

Predictive simulations of transport iterated to bring the plasma into steady-state power power steady-state into plasma the bring to iterated transport of simulations Predictive •

Finite ion-orbit-width effects to realistically assess RF wave resonant ion interactions ion resonant wave RF assess realistically to effects ion-orbit-width Finite •

and electromagnetic electron dynamics electron electromagnetic and

Global gyrokinetic studies of core turbulence encompassing local & nonlocal phenomena phenomena nonlocal & local encompassing turbulence core of studies gyrokinetic Global • FSP

scaling to realistic Reynolds numbers Reynolds realistic to scaling MHD •

CQL3D AORSA, heating and control and heating plasma Tokamak •

whole-volume ITER plasmas with realistic diverted geometry diverted realistic with plasmas ITER whole-volume

XGC1

Fusion Fusion Fusion Fusion Fusion Fusion First-principles gyrokinetic particle simulation of multiscale electromagnetic turbulence in in turbulence electromagnetic multiscale of simulation particle gyrokinetic First-principles •

issues such as the formations of plasma critical gradients and transport barriers transport and gradients critical plasma of formations the as such issues Address •

Improved understanding of confinement physics in tokamak experiments tokamak in physics confinement of understanding Improved •

GTS

environment for realistic tokamak transport tokamak realistic for environment

Electron dynamics and magnetic perturbation (finite-beta) effects in a global code code global a in effects (finite-beta) perturbation magnetic and dynamics Electron •

Energetic particle turbulence and transport in ITER ITER in transport and turbulence particle Energetic GTC •

Application codes Application area Application Science target Science

Science outcomes were elicited from a broad range of applications of range broad a from elicited were outcomes Science OLCF-3 Applications Analyzed Applications OLCF-3

Friday, July 1, 2011 1, July Friday,

13

Hybrid Nonlinear turbulence phenomena in multi-physics settings multi-physics in phenomena turbulence Nonlinear •

Turbulence Turbulence

DNS Stratified and unstratified turbulent mixing at simultaneous high Reynolds and Schmidt numbers Schmidt and Reynolds high simultaneous at mixing turbulent unstratified and Stratified •

(masses and mixing strengths of quarks) of strengths mixing and (masses

MILC,Chroma QCD

Achieving high precision in determining the fundamental parameters of the Standard Model Model Standard the of parameters fundamental the determining in precision high Achieving •

sections cross fission and

MFDn

fragments fission of distribution energy kinetic and mass half-lives, Predict • Nuclear Physics Nuclear

NUCCOR

static and transport properties of nucleonic matter nucleonic of properties transport and static stability, nuclear of Limits •

homogenization techniques with more direct solution methods solution direct more with techniques homogenization

UNIQ

Reduce uncertainties and biases in reactor design calculations by replacing existing multi-level multi-level existing replacing by calculations design reactor in biases and uncertainties Reduce •

Nuclear energy energy Nuclear Nuclear

in transient and nominal operation nominal and transient in

Denovo

Predicting, with UQ, the behavior of existing and novel nuclear fuels and reactors reactors and fuels nuclear novel and existing of behavior the UQ, with Predicting, •

systems?

What is the role of material disorder, statistics, and fluctuations in nanoscale materials and and materials nanoscale in fluctuations and statistics, disorder, material of role the is What •

WL-LSMS differ between nano and bulk? and nano between differ

To what extent do thermodynamics and kinetics of magnetic transition and chemical reactions reactions chemical and transition magnetic of kinetics and thermodynamics do extent what To •

High-T superconducting transition temperature materials dependence in cuprates in dependence materials temperature transition superconducting High-T •

Effect of impurity configurations on pairing and the high-T superconducting gap superconducting high-T the and pairing on configurations impurity of Effect • DCA++

Nanoscience Nanoscience Nanoscience Nanoscience

Magnetic/superconducting phase diagrams including effects of disorder of effects including diagrams phase Magnetic/superconducting •

Full device simulation of a nanostructure solar cell solar nanostructure a of simulation device Full LS3DF •

Electron-lattice interactions and energy loss in full nanoscale transistors nanoscale full in loss energy and interactions Electron-lattice OMEN •

Application codes Application area Application Science target Science

Science outcomes were elicited from a broad range of applications of range broad a from elicited were outcomes Science OLCF-3 Applications Analyzed Applications OLCF-3

Friday, July 1, 2011 1, July Friday,

14

readiness activities readiness

Availability of key code development personnel to engage in and guide guide and in engage to personnel development code key of Availability •

application

Availability of OLCF liaison with adequate skills/experience match to to match skills/experience adequate with liaison OLCF of Availability •

(“INCITE ready”) operations ready”) (“INCITE requirements

Preparation for steady state state steady for Preparation Mix of low (“straightforward”) and high (“hard”) risk porting and readiness readiness and porting risk (“hard”) high and (“straightforward”) low of Mix •

Good representation of current and anticipated INCITE workload INCITE anticipated and current of representation Good • (current and anticipated) and (current

User community community User Broad institutional and developer/user involvement developer/user and institutional Broad •

Broad coverage of scientific library requirements library scientific of coverage Broad •

Broad coverage of relevant algorithms and data structures (motifs) structures data and algorithms relevant of coverage Broad • software)

(models, algorithms, algorithms, (models, languages, implementations languages,

Implementation Implementation Broad coverage of relevant programming models, environment, environment, models, programming relevant of coverage Broad •

Broad coverage of science domains science of coverage Broad •

Alignment with DOE and U.S. science mission (CD-0) mission science U.S. and DOE with Alignment •

Science results, impact, timeliness impact, results, Science • Science

Description Task

Six Representative Applications Representative Six Evaluation Criteria for Selection of Selection for Criteria Evaluation

Friday, July 1, 2011 1, July Friday,

15

well (cf. our Petascale Early Science Period). Call for proposals will be forthcoming this summer. this forthcoming be will proposals for Call Period). Science Early Petascale our (cf. well

CAAR apps will form the vanguard of ‘day-one’ science on OLCF-3, but additional science teams will be granted friendly-user access as as access friendly-user granted be will teams science additional but OLCF-3, on science ‘day-one’ of vanguard the form will apps CAAR

. applications technology

nuclear energy and and energy nuclear

used in a variety of of variety a in used

containment groundwater transport. groundwater containment calculations that can be be can that calculations

2 radiation transport transport radiation sequestration; predictive predictive sequestration; CO

Stability and viability of large scale scale large of viability and Stability Discrete ordinates ordinates Discrete

Denovo PFLOTRAN

storms. .

patterns/statistics and tropical tropical and patterns/statistics complex chemistry. complex

represent features like precipitation precipitation like features represent numerical simulation with with simulation numerical

mitigation scenarios; realistically realistically scenarios; mitigation combustion through direct direct through combustion

climate change adaptation and and adaptation change climate Understanding turbulent turbulent Understanding

Answer questions about specific specific about questions Answer S3D

CAM-SE

systems.

total of 3.3 million atoms. atoms. million 3.3 of total

nanoscale materials and and materials nanoscale

lignin molecules comprising a a comprising molecules lignin

statistics, and fluctuations in in fluctuations and statistics, cellulose (blue) surrounded by by surrounded (blue) cellulose

Biofuels: An atomistic model of of model atomistic An Biofuels: Role of material disorder, disorder, material of Role

LAMMPS WL-LSMS Center for Accelerated Application Readiness Application Accelerated for Center

Friday, July 1, 2011 1, July Friday,

16

– these changes will be made available as library library as available made be will changes these – codes) MD all to (common

ata non-locality due to calculation of long-range Coulomb force force Coulomb long-range of calculation to due non-locality ata D rating: Moderate •

Large user and developer groups developer and user Large

Broad open community MD code owned by a DOE national laboratory: laboratory: national DOE a by owned code MD community open Broad •

Easily broken up into components available to other MD codes MD other to available components into up broken Easily •

cellulosic ethanol cellulosic

Critical to development of alternative energy sources, including second-generation second-generation including sources, energy alternative of development to Critical LAMMPS •

compute libraries used libraries compute

Moderate rating: Complex rate equations, thermodynamics, and transport properties modules; no no modules; properties transport and thermodynamics, equations, rate Complex rating: Moderate •

conservation equations with structured grid written in F90 in written grid structured with equations conservation

Compressible Navier-Stokes flow solver for the full mass, momentum, energy and species species and energy momentum, mass, full the for solver flow Navier-Stokes Compressible •

DNS of combustion processes for specific fuels specific for processes combustion of DNS S3D •

Low compute intensity and high data movement in physics kernels physics in movement data high and intensity compute Low rating: Hard •

Scalable dynamical core of choice for future CCSM future for choice of core dynamical Scalable •

High leverage in physics packages physics in leverage High •

Spectral finite element method element finite Spectral CAM-HOMME •

Description Code CAAR Application Summary Application CAAR

Friday, July 1, 2011 1, July Friday,

17

data movement caused by AMR (via SAMRAI) (via AMR by caused movement data

indirect addressing and and addressing indirect with solutions PDE nonlinear implicit by caused locality ata Non-d rating: Hard •

PETSc solver technology used extensively used technology solver PETSc •

Full featured finite element application with both structured and unstructured versions written in F90 in written versions unstructured and structured both with application element finite featured Full PFLOTRAN •

environment

port, while heavy use of C++ and advanced programming models will tax GPU software and tool tool and software GPU tax will models programming advanced and C++ of use heavy while port,

Huge potential for exploiting untapped concurrency along “energy dimension” helps helps dimension” “energy along concurrency untapped exploiting for potential Huge rating: Moderate •

in nuclear reactor cores reactor nuclear in

Key application for neutron transport and power distribution prediction prediction distribution power and transport neutron for application Key Denovo •

of complex numbers (LAPACK, CULA, MAGMA) CULA, (LAPACK, numbers complex of

Straightforward rating: Main kernel based on dense linear algebra algebra linear dense on based kernel Main rating: Straightforward •

Uses a workhorse approach (F77/90, C++, MPI) common to many applications many to common MPI) C++, (F77/90, approach workhorse a Uses •

Enables first-principles studies of magnetic materials with broad relevance to DOE energy mission mission energy DOE to relevance broad with materials magnetic of studies first-principles Enables gWL-LSMS •

Description Code

CAAR Application Summary Summary Application CAAR (continued)

Friday, July 1, 2011 1, July Friday,

18

23% of 2010 INCITE allocations (in cpu-hours) cpu-hours) (in allocations INCITE 2010 of 23% –

Represented 35% of 2009 INCITE allocations INCITE 2009 of 35% Represented –

Selected applications represented bulk of use for 6 INCITE allocations totaling 212M cpu-hours (2009) (2009) cpu-hours 212M totaling allocations INCITE 6 for use of bulk represented applications Selected •

PFLOTRAN X (AMR) X

Denovo X X X X X

X LAMMPS X

LSMS X

X X X CAM X X

X X S3D X X

algebra algebra grids

FFT Code Structured grids Structured Carlo Monte Particles

Sparse linear linear Sparse linear Dense Unstructured Unstructured CAAR Algorithmic Coverage Algorithmic CAAR

Friday, July 1, 2011 1, July Friday,

19

development work will also be pushed out to broader communities (e.g., in use of ChemKin) of use in (e.g., communities broader to out pushed be also will work development the of Much •

represented

Algorithm and implementation coverage extends applicability extends coverage implementation and Algorithm well beyond the science domains immediately immediately domains science the beyond well •

hydrodynamics

chemistry, finite-volume finite-volume chemistry,

geoscience BLAS, PETSc BLAS, SAMRAI MPI, GNU PGI, F90 AMR PFLOTRAN

coupled to transport and and transport to coupled

Richards’ equation equation Richards’

ZGTRF,ZGTRS

)

Monte Carlo Monte

ZGEMM, ZGEMM, nanoscience PGI, GNU PGI, C++ C, F90, F77, N/A LAPACK ( LAPACK MPI WL-LSMS

density functional theory, theory, functional density

Intel Python SuperLU, Metis SuperLU,

wavefront sweep, GMRES sweep, wavefront energy nuclear structured MPI Denovo

GNU, PGI, Cray, Cray, PGI, GNU, Fortran, C++, Trilinos, LAPACK, LAPACK, Trilinos,

algebra, particles algebra,

dense & sparse linear linear sparse & dense combustion PGI F90 F77, structured None MPI S3D

Navier-Stokes, finite diff, diff, finite Navier-Stokes,

particles Intel

N/A biology/materials FFTW MPI C++ LAMMPS

, FFT, FFT, dynamics, molecular GNU, PGI, IBM, IBM, PGI, GNU,

algebra, particles algebra,

structured linear sparse & dense climate Trilinos MPI IBM Lahey, PGI, F90 CAM-HOMME

spectral finite elements, elements, finite spectral

Libraries supported Language(s)

Grid type Grid Algorithm(s) Area Science App Math Libraries Math

Communication Communication Compiler(s) Programming

Friday, July 1, 2011 1, July Friday,

20

Status of each app discussed weekly discussed app each of Status –

Constant monitoring of progress of monitoring Constant •

Determination of optimal, reproducible acceleration path for other applications other for path acceleration reproducible optimal, of Determination –

Maximum acceleration for model problem model for acceleration Maximum –

Two-fold aim Two-fold –

CAM-HOMME – CUDA, PGI directives PGI CUDA, – CAM-HOMME –

WL-LSMS – CULA, MAGMA, custom ZGEMM custom MAGMA, CULA, – WL-LSMS –

Multiple acceleration methods explored methods acceleration Multiple •

CAM-HOMME – pervasive – CAM-HOMME – and widespread custom acceleration required acceleration custom widespread and

WL-LSMS – dependent on accelerate on dependent – WL-LSMS – ZGEMM d

Particular plan-of-attack different for each app each for different plan-of-attack Particular •

other application developers, local tool/library developers tool/library local developers, application other Other: –

NVIDIA – developer

engineer Cray –

pplication a OLCF – ead l

Comprehensive Comprehensive team assigned to each app each to assigned team • Tactics

Friday, July 1, 2011 1, July Friday,

21

(ORNL) Hartmann-Baker Rebecca

PFLOTRAN Peng Wang Peng Wichmann Nathan Philip Bobby

Peter Lichtner (LANL) Lichtner Peter

John Roberts Roberts John

Denovo Kevin Thomas Kevin Joubert Wayne (ORNL) Baker Chris Evans, Tom

Cyril Zeller Cyril

Mike Brown (OLCF) Brown Mike

Scott Le Grande Le Scott Axel Kohlmeyer (Temple) Kohlmeyer Axel

LAMMPS Arnold Tharrington Arnold Sarah Anderson Sarah

Peng Wang Peng (ORNL) Brown Mike

Steve Plimpton, Paul Crozier (SNL) Crozier Paul Plimpton, Steve

Jim Rosinski (NOAA) Rosinski Jim

John Dennis (NCAR) Dennis John Lamarque, JF

CAM-HOMME

Jeff Larkin Jeff (NREL) Carpenter Ilene Paulius Micikevicius Paulius Mark Taylor (SNL) Taylor Mark

(ORNL) Hernandez Oscar

Rick Archibald, Rick Evans, Kate Norman, Matt Jim Hack, Hack, Jim

Adrian Tate Adrian Rusanu (ORNL/UTK) Rusanu Aurelian Wang Peng

WL-LSMS Markus Eisenbach Markus

Jeff Larkin Jeff Yang Wang (PSC) Wang Yang Fatica Massimiliano

S3D John Levesque John Sankaran Ramanan Ray Grout (NREL) Grout Ray Ruetsch Gregory

Cray Lead OLCF Application Science & Science NVIDIA Tools Application Teams Application

Friday, July 1, 2011 1, July Friday,

22

performance –

(potential) portability (potential) –

ease-of-use –

Multiple paths are available, with multifarious trade-offs multifarious with available, are paths Multiple –

beginning to become available become to beginning

Production-level tools, compilers, libraries, etc. are just just are etc. libraries, compilers, tools, Production-level •

acceleration “on their own.” their “on acceleration

have, in many cases, already begun to explore GPU GPU explore to begun already cases, many in have, Groups –

the chosen apps are under constant development development constant under are apps chosen the of All • Complications

Friday, July 1, 2011 1, July Friday,

23

Directives •

PFLOTRAN •

CUDA •

Denovo •

CUDA Fortran & Directives & Fortran CUDA •

CAM-SE •

CUDA •

LAMMPS •

Directives and CUDA and Directives •

S3D S3D •

Primarily Library-based Primarily •

WL-LSMS • What Are We Trying First? Trying We Are What

Friday, July 1, 2011 1, July Friday,

24

improves the performance of even CPU-only code. CPU-only even of performance the improves

Uncovering unrealized parallelism and improving data locality locality data improving and parallelism unrealized Uncovering •

near-future architectures. near-future

Exposure of unrealized parallelism is essential to exploit all all exploit to essential is parallelism unrealized of Exposure •

01110110111011

01010110100111 GPU threaded parallelism threaded GPU –

11010110101000

01010110101010

SSE/AVX on CPUs on SSE/AVX –

Vector parallelism parallelism Vector •

subcommunicators, or…) subcommunicators,

On-node, SMP-like parallelism via threads (or (or threads via parallelism SMP-like On-node, •

MPI parallelism between nodes (or PGAS) (or nodes between parallelism MPI • Hierarchical Parallelism Hierarchical

Friday, July 1, 2011 1, July Friday,

25

All 6 CAAR teams have converged (independently) to 2 ± 0.5 FTE-years 0.5 ± 2 to (independently) converged have teams CAAR 6 All –

How much effort is this demanding? this is effort much How •

acceleration changes obsolete/broken otherwise obsolete/broken changes acceleration

Other development work taking place on all CAAR codes could quickly make make quickly could codes CAAR all on place taking work development Other –

production “trunk” is every bit as important as we initially thought. initially we as important as bit every is “trunk” production

Ensuring that changes are communicated back and remain in the the in remain and back communicated are changes that Ensuring •

Help (or, at least, no hindrance) to overlapping computation with device communication device with computation overlapping to hindrance) no least, at (or, Help –

HW-aware choices HW-aware –

possibility of continued use is important. is use continued of possibility

For those codes that already make effective use of scientific libraries, the the libraries, scientific of use effective make already that codes those For •

A directives-based approach offers a straightforward path to portable performance portable to path straightforward a offers approach directives-based A –

Developers can quickly learn, e.g., CUDA and put it to effective use effective to it put and CUDA e.g., learn, quickly can Developers –

Making changes to exploit it is hard work (made easier by better tools) better by easier (made work hard is it exploit to changes Making –

Figuring out where is often straightforward often is where out Figuring –

Exposure of unrealized parallelism is essential. is parallelism unrealized of Exposure • Some Lessons Learned Lessons Some PANEL REPORT: NUCLEAR ASTROPHYSICS

Three-Dimensional Turbulent Convection in Massive Stars

The art of modeling stars is intimately entwined with a diverse range of astrophysical problems, from stellar population studies and galactic chemical evolution modeling, to the mechanisms underlying supernovae and gamma-ray burst explosions. These diverse topics are bound together, as almost all of this work relies on the ability to accurately model an individual star. Understanding an individual star is deeply connected to the degree at which the turbulent motions that take place deep in the stellar interior are understood. Currently, the exponential growth in computing technology, which is expressed concisely by Moore’s law, is beginning a new era of sophistication in the ability to model plasma dynamics within stars. The numerical experiments made possible by the availability of ever-larger pools of computing resources are providing stronger constraints on researchers’ theories. These numerical experimentsStellar are also inspiring breakthroughs inEvolution: understanding stellar The Sun and Other Stars mixing and combustion processes, the evolution and structure of stars, and the contribution of these stars to the chemical evolution of the universe. The turbulent convective flow in a core- collapse supernova progenitor star is shown for a large angular domain (120°´ 120°), three-dimensional simulation 3d AGB model with convective envelope and (Meakin 2008). The mass fraction of 32S is visualized to provide a sense of the resolved turbulent boundary layer mixing topology and the complex, multiscale, turbulent nature of the flow field. between active burning layers and envelope Material with a high mass fraction of 32S is being entrained into the turbulent oxygen burning shell from the underlying silicon and sulfur-rich core of a 23 solar mass star. This illustrates a mixing process that is not included in the 3D supernova progenitors including iron standard treatment used to model stars today but that has a significant impact on core and overlying silicon burning shell the resulting structure of the massive star at the time of core collapse. up to core collapse including the effects of rotation 3d supernova progenitor including all dynamically 3D supernova progenitors from the end Source: Meakin (2008). Image courtesy of Casey Meakin (University of Arizona). active layers of silicon core burning up to core Developing three-dimensional supernova progenitor models involves a large range in both spatial and temporal scales. While the end state of a massive star depends upon the entire prior evolution of the star collapse since formation, a three-dimensional simulation that begins at core silicon burning and is evolved up to core collapse would provide a significantly improved level of confidence about the state of the iron core Global circulation solar model at collapse, including the rotational state and the convectively induced perturbations. Such a three- dimensional stellar model would be used directly as an input to core-collapse supernova simulations. with techocline and surface Spatially, capturing global asymmetries will require simulating a volume that extends to the outer edge of granulation the Globalcarbon burning convectioncirculation zone (Meakin and solar Arnett 2006). model Thus, in successively with larger shells surroundingresolved, the core, silicon, oxygen, turbulent neon, and carbon burningtachocline will need to be included. Angular

Scientific Grand Challenges: Forefront Questions in Nuclear Science and the 56 Role of Computing at the Extreme Scale

3D AGB model including the effects of rotation 1st-gen 2D supernova progenitor on the global circulation and mass mixing including all dynamically active layers properties

1st-gen 3D AGB model including convective red giant envelope

10x Tera 100x Tera Peta 10x Peta 100x Peta Exa -flop year sustained

Friday, July 1, 2011 PANEL REPORT: NUCLEAR ASTROPHYSICS

these data and managing and analyzing them after they are written are as important to producing meaningful science through supernova simulation as is any algorithmic or implementation improvement that might be made for the computational step itself.

Four-Dimensional Core-Collapse Supernovae Simulations Decades of core-collapse supernovae simulations have led to a challenging proposition for the eventual description of the explosion mechanism and the associated phenomenology. It is now known that three-dimensional hydrodynamics in general relativistic gravity must be coupled to nuclear kinetics capable of accurate estimation of energy release and compositional changes, and to spectral neutrino transport of all flavors of neutrinos (requiring a fourth, neutrino energy, dimension) to reliably determine the nature of the explosion mechanism. Moreover, the inclusion of neutrino masses and oscillations, as well as magnetic fields—both of which may be important effects—make the problem even more difficult. All of these effects, familiar in some sense from the earlier life of the star, operate on very short (millisecond) timescales and at extremes in density (as high as three to four times nuclear matter density) and neutron richness (the ratio of protons to neutrons in the matter can be several times smaller than what is accessible in terrestrial experiments). The rich interplay of all these physical phenomena in core-collapse supernovae ultimately leads to the realization that only through simulation will scientists be able to fully understand these massive stellar explosions.Core-Collapse The relative complexity of Supernovae the germane physics also means the requisite physical fidelity for these simulations will only be realized at the extreme scale. Rendering of the matter entropy in the core of a massive star at approximately 100 ms following core bounce. Multidimensional fluid effects, coupled to heating and cooling via neutrino emission and absorption, have served to form large, asymmetric structures Full quantum kinetics in the flow. Much of this complicated flow pattern occurs between the surface of the nascent neutron star at the center of the star and the position of the original supernova shock (demarcated by the faint blue circle near the edge of the figure). These first four-dimensional Large (precision) nuclear simulations have already network consumed tens of millions of central processing unit hours on leadership computing platforms and will require tens of millions of additional central processing

unit hours to follow the evolution Source: Messer et al. (2009). Image courtesy of Bronson Messer to the requisite one second of (Oak Ridge National Laboratory). 150-speciesphysical time. nuclear

framework Multi-energy, multi-angle neutrino transport

Scientific Grand Challenges: Forefront Questions in Nuclear Science and the Role of Computing at the Extreme Scale 65

Multi-energy neutrino transport and coherent neutrino flavor mixing

10x Tera 100x Tera Peta 10x Peta 100x Peta Exa -flop year sustained

Friday, July 1, 2011 Thermonuclear Supernovae

3-D whole-star simulations capturing all crucial scales with detailed nuclear kinetics

3-D whole-star simulations with nuclear kinetics and resolution to treat turbulent nuclear burning 3-D whole-star simulations with resolution sufficient to capture initiation of a 3D whole-star detonation simulations with resolution to capture turbulent burning dynamics and convection in the stellar core

10x Tera 100x Tera Peta 10x Peta 100x Peta Exa -flop year sustained

Friday, July 1, 2011

Friday, July 1, 2011 1, July Friday,

29

Lots of opportunities to hide latency via multiphysics multiphysics via latency hide to opportunities of Lots •

Large number of DOF at each grid point point grid each at DOF of number Large •

environment for these architectures these for environment Stellar Astrophysics provides a target-rich target-rich a provides Astrophysics Stellar

Friday, July 1, 2011 1, July Friday,

30

– cf. Uintah, MADNESS Uintah, cf.

Task-based AMR systems AMR Task-based •

– Data locality becomes a problem a becomes locality Data

AMR will become even more essential essential more even become will AMR •

(distributed memory) (distributed

Diminishing bytes/FLOP will limit spatial resolution resolution spatial limit will bytes/FLOP Diminishing •

Many problems (e.g. Type Ia SNe) are woefully underresolved woefully are SNe) Ia Type (e.g. problems Many •

fidelity is good, but not the whole answer. whole the not but good, is fidelity Strong scaling with improved local physical physical local improved with scaling Strong Summary

• We are not in the advent of exascale-like architectures, we are in medias res. • Tools, compilers, etc. are becoming available to help make the transition. • The specific details of the platforms matter much less than the overarching theme of hierarchical parallelism. • Multiphysics simulations have unrealized parallelism to tap. –Applications relying on, e.g., solution to large linear systems could also benefit from a task based approach.

31

Friday, July 1, 2011