Living at the Top of the Top500: Myopia from Being at the Bleeding Edge
Bronson Messer
Oak Ridge Leadership Computing Facility & Theoretical Astrophysics Group Oak Ridge National Laboratory
Department of Physics & Astronomy University of Tennessee
Friday, July 1, 2011 Outline
• Statements made without proof • OLCF’s Center for Accelerated Application Readiness • Speculations on task-based approaches for multiphysics applications in astrophysics (e.g. blowing up stars)
2
Friday, July 1, 2011 Riffing on Hank’s fable...
3
Friday, July 1, 2011
1
hspprto ast write to days 2 took paper This
1
e nmonths. in red measu is time and month one in accomplish can machine current
so htvra whatever of ts uni in measured is Work plot. above the in illustrated is This 1
arXiv:astro-ph/9912202v1 9Dec1999
1
hspprto ast write to days 2 took paper This
1
e nmonths. in red measu is time and month one in accomplish can machine current
Friday, July 1, 2011 1, July Friday,
hsi lutae nteaoepo.Wr smaue nuni in measured is Work plot. above the in illustrated is This so htvra whatever of ts
4
hspprto ast write to days 2 took paper This
1
e nmonths. in red measu is time and month one in accomplish can machine current
so htvra whatever of ts uni in measured is Work plot. above the in illustrated is This
arXiv:astro-ph/9912202v1 9Dec1999
iue1: Figure
optn oe ssu is power computing o then. ion calculat the start and better ciently ffi
ucl nuhta h aclto ilfiihfse fw w we if faster finish will calculation the that enough quickly i ni h available the until ait
ag ueia acltosadfie ugt,cmuigp computing budgets, fixed and calculations numerical large wrwl improve will ower
ciently ffi su for hat t conceivable is it Therefore months. 18 every doubles price
arXiv:astro-ph/9912202v1 9Dec1999
eataparticular availabl power computational the Law, Moore’s to According
iue1: Figure calculation.
nn the nning begi and computer a purchasing before time of period some for
’orwaiting ‘slacking by computations enough large for increased be can inl etradsattecalculat the start and better ciently ffi su is power computing o then. ion
ucl nuhta h aclto ilfiihfse fw w we if faster finish will calculation the that enough quickly i ni h available the until ait vity producti overall Law, Moore’s of context the in that, show We
ag ueia acltosadfie ugt,cmuigp computing budgets, fixed and calculations numerical large wrwl improve will ower
Abstract rc obe vr 8mnh.Teeoei scnevbet conceivable is it Therefore months. 18 every doubles price ciently ffi su for hat
codn oMoesLw h opttoa oe availabl power computational the Law, Moore’s to According eataparticular
calculation.
twr bevtr,Uiest fArizona of University Observatory, Steward
o oepro ftm eoeprhsn optradbegi and computer a purchasing before time of period some for nn the nning
a eicesdfrlreeog opttosb ‘slacking by computations enough large for increased be can ’orwaiting ..Charfman J.J.
eso ht ntecneto or’ a,oealproducti overall Law, Moore’s of context the in that, show We vity
n, Thompso Todd Meakin, Casey Bailin, Jeremy Gottbrath, Chris
Abstract
Computations
nLarge on Slacking and Law Moore’s of ects ff E The
astro-ph/9912202 twr bevtr,Uiest fArizona of University Observatory, Steward
1
..Charfman J.J.
hi otrt,Jrm aln ae ekn odThompso Todd Meakin, Casey Bailin, Jeremy Gottbrath, Chris n,
Computations
iue1: Figure nLarge on Slacking and Law Moore’s of ects ff E The
1
inl etradsattecalculat the start and better ciently ffi su is power computing o then. ion
ucl nuhta h aclto ilfiihfse fw w we if faster finish will calculation the that enough quickly i ni h available the until ait
ag ueia acltosadfie ugt,cmuigp computing budgets, fixed and calculations numerical large wrwl improve will ower
rc obe vr 8mnh.Teeoei scnevbet conceivable is it Therefore months. 18 every doubles price ciently ffi su for hat
codn oMoesLw h opttoa oe availabl power computational the Law, Moore’s to According eataparticular
calculation.
o oepro ftm eoeprhsn optradbegi and computer a purchasing before time of period some for nn the nning
a eicesdfrlreeog opttosb ‘slacking by computations enough large for increased be can ’orwaiting
eso ht ntecneto or’ a,oealproducti overall Law, Moore’s of context the in that, show We vity
Abstract
twr bevtr,Uiest fArizona of University Observatory, Steward
..Charfman J.J.
hi otrt,Jrm aln ae ekn odThompso Todd Meakin, Casey Bailin, Jeremy Gottbrath, Chris n,
Computations
cso or’ a n Slacking and Law Moore’s of ects ff E The nLarge on 1 Some realities • The future is now: if you go from franklin to hopper at the same size, you lose. NERSC-6 Grace “Hopper” Cray XE6 Performance 1.2 PF Peak 1.05 PF HPL (#5) Processor AMD MagnyCours 2.1 GHz 12-core 8.4 GFLOPs/core 24 cores/node 32-64 GB DDR3-1333 per node System Franklin - Cray XT4 Gemini Interconnect (3D torus) 6392 nodes 38,288 compute cores 153,408 total cores 9,572 compute nodes I/O 2PB disk space One quad-core AMD 2.3 GHz 70GB/s peak I/O Bandwidth Opteron processors (Budapest) per node 2 4 processor cores per node 8 GB of memory per node Light-weight Cray Linux 78 TB of aggregate memory operating system 1.8 GB memory / core for No runtime dynamic, shared- applications object libs /scratch disk default quota of PGI, Cray, Pathscale, GNU 750 GB compilers Use Franklin for all your computing jobs, except those that need a full Linux operating system. 5
Friday, July 1, 2011 16 Some realities • If you use primarily IBM platforms, you have a bit longer. –scp+make on Blue Waters will likely give you a speedup. –BG/P --> BG/Q brings an increased clock, and you probably aren’t engaging the Double Hummer now anyway.
6
Friday, July 1, 2011 Some realities
• It doesn’t matter if you are gonna use GPU-based machines or not –GPUs [CUDA, OpenCL, directives] –FPUs on Power [xlf, etc.] –Cell [SPE] –SSE/AVX; MIC (Knights Ferry, Knights Corner)[?]
• Exposing the maximum amount of node-level parallelism and increasing data locality are the only way to get performance from any of these things
7
Friday, July 1, 2011
Friday, July 1, 2011 1, July Friday,
8
per node than Jaguar than node per
Larger memory - more than 2x more memory memory more 2x than more - memory Larger •
Performance based on available funds available on based Performance •
10-20 PF peak performance performance peak PF 10-20 •
multi-core accelerators multi-core
New accelerated node design using NVIDIA NVIDIA using design node accelerated New •
AMD Opteron 6200 processor (Interlagos) processor 6200 Opteron AMD •
features
Advanced synchronization synchronization Advanced •
memory
Globally addressable addressable Globally •
3-D Torus Torus 3-D •
Gemini interconnect Gemini •
operating system operating
Operating system upgrade of today’s Linux Linux today’s of upgrade system Operating •
cooling as Jaguar as cooling
Similar number of cabinets, cabinet design, and and design, cabinet cabinets, of number Similar • ORNL’s “Titan” System Goals System “Titan” ORNL’s
Friday, July 1, 2011 1, July Friday,
9
Slide courtesy of Cray, Inc. Cray, of courtesy Slide
X
Y
Kepler many-core processor many-core Kepler
to NVIDIA’s to Upgradeable
Z
Gemini high speed Interconnect speed high Gemini
6GB GDDR5 capacity GDDR5 6GB
HT3
memory X090 Tesla
HT3
1600 MHz DDR3 MHz 1600
16 or 32GB or 16
PCIe Gen2 PCIe
Host memory Host
X2090 @ 665 GF 665 @ X2090 Tesla
16 core processor core 16
AMD Opteron 6200 Interlagos Interlagos 6200 Opteron AMD
Characteristics
XK6 Compute Node Node Compute XK6 Cray XK6 Compute Node Compute XK6 Cray
Friday, July 1, 2011 1, July Friday,
10
delivery of this result, and why? and result, this of delivery
or facilities are dependent up on the timely timely the on up dependent are facilities or
other programs agencies, stakeholders, and/ stakeholders, agencies, programs other
in the 2012 timeframe? timeframe? 2012 the in coming 5 years later (or never at all)? What What all)? at never (or later years 5 coming
is delivered delivered is timeframe, what does it matter as opposed to to opposed as matter it does what timeframe,
If this result is delivered in the 2010 2010 the in delivered is result this If – – What does it matter if this result result this if matter it does What
Science timeliness Science • if the improved science result occurs? result science improved the if
– What might the impact be impact the might What science result occurs? Who cares, and why? and cares, Who occurs? result science
What might the impact be if this improved improved this if be impact the might What –
and does OLCF-3 enable them? enable OLCF-3 does and
Science impact Science •
– What are the science goals goals science the are What
from the current system current the from
Key questions Key • predictability and/or productivity/throughput) productivity/throughput) and/or predictability
and how is it different (in fidelity/quality/ fidelity/quality/ (in different it is how and
in many domains many in
Survey What science will be pursued on this system system this on pursued be will science What –
with 50+ leading scientists scientists leading 50+ with
Driver Driver Science
Science driver Science • in consultation consultation in Developed •
Science Driver Survey Driver Science
in 2009 OLCF application requirements document requirements application OLCF 2009 in
Results, analysis, and conclusions documented documented conclusions and analysis, Results, •
process, SQA, V&V, usage workflow, performance workflow, usage V&V, SQA, process,
Document
parallelization strategy, software, development development software, strategy, parallelization
and impact, application models, algorithms, algorithms, models, application impact, and
Requirements
Project overview, science motivation motivation science overview, Project •
pplication pplication A
comprehensive requirements questionnaire requirements comprehensive
OLCF
Elicited, analyzed, and validated using a new new a using validated and analyzed, Elicited, •
developed by surveying science community science surveying by developed OLCF-3 Applications Requirements Applications OLCF-3
Friday, July 1, 2011 1, July Friday,
11
MADNESS Storage y Energ Electrochemical processes at the interfaces; ionic diffusion during charge-discharge cycles charge-discharge during diffusion ionic interfaces; the at processes Electrochemical •
device- and lab-scale combustion lab-scale and device-
RAPTOR
design in power and propulsion engines: bridge the gap between between gap the bridge engines: propulsion and power in design ombustion c Internal •
Combustion Combustion
S3D front stability and propagation in power and propulsion engines propulsion and power in propagation and stability front lame f Combustion •
Fuels from lignocellulosic materials lignocellulosic from Fuels •
GAMESS
chemistry erosol a Atmospheric •
Chemistry Chemistry
CPMD CP2K, Interfacial chemistry Interfacial •
structure Genomic •
LAMMPS Biology
Systems biology Systems •
GROMACS
Bioenergy ynamics of microbial of ynamics d ethanol: Cellulosic biomass on action enzyme •
LAMMPS,
models
MAESTRO MPA-FT,
Full-star type Ia supernovae simulations of thermonuclear runaway with realistic subgrid subgrid realistic with runaway thermonuclear of simulations supernovae Ia type Full-star •
Astrophysics Astrophysics
signatures, gravitational waves, and photon spectra photon and waves, gravitational signatures,
GenASiS Chimera,
supernovae simulation; validation against observations of neutrino neutrino of observations against validation simulation; supernovae Core-collapse •
Application codes Application area Application Science target Science
Science outcomes were elicited from a broad range of applications of range broad a from elicited were outcomes Science OLCF-3 Applications Analyzed Applications OLCF-3
Friday, July 1, 2011 1, July Friday,
12
Predictive contaminant ground water transport water ground contaminant Predictive •
PFLOTRAN Geoscience
2 • sequestration CO large-scale of viability and Stability
balance; radial transport balances power input power balances transport radial balance;
Multi-scale integrated electromagnetic turbulence in realistic ITER geometry ITER realistic in turbulence electromagnetic integrated Multi-scale
•
TGYRO GYRO,
Predictive simulations of transport iterated to bring the plasma into steady-state power power steady-state into plasma the bring to iterated transport of simulations Predictive •
Finite ion-orbit-width effects to realistically assess RF wave resonant ion interactions ion resonant wave RF assess realistically to effects ion-orbit-width Finite •
and electromagnetic electron dynamics electron electromagnetic and
Global gyrokinetic studies of core turbulence encompassing local & nonlocal phenomena phenomena nonlocal & local encompassing turbulence core of studies gyrokinetic Global • FSP
scaling to realistic Reynolds numbers Reynolds realistic to scaling MHD •
CQL3D AORSA, heating and control and heating plasma Tokamak •
whole-volume ITER plasmas with realistic diverted geometry diverted realistic with plasmas ITER whole-volume
XGC1
Fusion Fusion Fusion Fusion Fusion Fusion First-principles gyrokinetic particle simulation of multiscale electromagnetic turbulence in in turbulence electromagnetic multiscale of simulation particle gyrokinetic First-principles •
issues such as the formations of plasma critical gradients and transport barriers transport and gradients critical plasma of formations the as such issues Address •
Improved understanding of confinement physics in tokamak experiments tokamak in physics confinement of understanding Improved •
GTS
environment for realistic tokamak transport tokamak realistic for environment
Electron dynamics and magnetic perturbation (finite-beta) effects in a global code code global a in effects (finite-beta) perturbation magnetic and dynamics Electron •
Energetic particle turbulence and transport in ITER ITER in transport and turbulence particle Energetic GTC •
Application codes Application area Application Science target Science
Science outcomes were elicited from a broad range of applications of range broad a from elicited were outcomes Science OLCF-3 Applications Analyzed Applications OLCF-3
Friday, July 1, 2011 1, July Friday,
13
Hybrid Nonlinear turbulence phenomena in multi-physics settings multi-physics in phenomena turbulence Nonlinear •
Turbulence Turbulence
DNS Stratified and unstratified turbulent mixing at simultaneous high Reynolds and Schmidt numbers Schmidt and Reynolds high simultaneous at mixing turbulent unstratified and Stratified •
(masses and mixing strengths of quarks) of strengths mixing and (masses
MILC,Chroma QCD
Achieving high precision in determining the fundamental parameters of the Standard Model Model Standard the of parameters fundamental the determining in precision high Achieving •
sections cross fission and
MFDn
fragments fission of distribution energy kinetic and mass half-lives, Predict • Nuclear Physics Nuclear
NUCCOR
static and transport properties of nucleonic matter nucleonic of properties transport and static stability, nuclear of Limits •
homogenization techniques with more direct solution methods solution direct more with techniques homogenization
UNIQ
Reduce uncertainties and biases in reactor design calculations by replacing existing multi-level multi-level existing replacing by calculations design reactor in biases and uncertainties Reduce •
Nuclear energy energy Nuclear Nuclear
in transient and nominal operation nominal and transient in
Denovo
Predicting, with UQ, the behavior of existing and novel nuclear fuels and reactors reactors and fuels nuclear novel and existing of behavior the UQ, with Predicting, •
systems?
What is the role of material disorder, statistics, and fluctuations in nanoscale materials and and materials nanoscale in fluctuations and statistics, disorder, material of role the is What •
WL-LSMS differ between nano and bulk? and nano between differ
To what extent do thermodynamics and kinetics of magnetic transition and chemical reactions reactions chemical and transition magnetic of kinetics and thermodynamics do extent what To •
High-T superconducting transition temperature materials dependence in cuprates in dependence materials temperature transition superconducting High-T •
Effect of impurity configurations on pairing and the high-T superconducting gap superconducting high-T the and pairing on configurations impurity of Effect • DCA++
Nanoscience Nanoscience Nanoscience Nanoscience
Magnetic/superconducting phase diagrams including effects of disorder of effects including diagrams phase Magnetic/superconducting •
Full device simulation of a nanostructure solar cell solar nanostructure a of simulation device Full LS3DF •
Electron-lattice interactions and energy loss in full nanoscale transistors nanoscale full in loss energy and interactions Electron-lattice OMEN •
Application codes Application area Application Science target Science
Science outcomes were elicited from a broad range of applications of range broad a from elicited were outcomes Science OLCF-3 Applications Analyzed Applications OLCF-3
Friday, July 1, 2011 1, July Friday,
14
readiness activities readiness
Availability of key code development personnel to engage in and guide guide and in engage to personnel development code key of Availability •
application
Availability of OLCF liaison with adequate skills/experience match to to match skills/experience adequate with liaison OLCF of Availability •
(“INCITE ready”) operations ready”) (“INCITE requirements
Preparation for steady state state steady for Preparation Mix of low (“straightforward”) and high (“hard”) risk porting and readiness readiness and porting risk (“hard”) high and (“straightforward”) low of Mix •
Good representation of current and anticipated INCITE workload INCITE anticipated and current of representation Good • (current and anticipated) and (current
User community community User Broad institutional and developer/user involvement developer/user and institutional Broad •
Broad coverage of scientific library requirements library scientific of coverage Broad •
Broad coverage of relevant algorithms and data structures (motifs) structures data and algorithms relevant of coverage Broad • software)
(models, algorithms, algorithms, (models, languages, implementations languages,
Implementation Implementation Broad coverage of relevant programming models, environment, environment, models, programming relevant of coverage Broad •
Broad coverage of science domains science of coverage Broad •
Alignment with DOE and U.S. science mission (CD-0) mission science U.S. and DOE with Alignment •
Science results, impact, timeliness impact, results, Science • Science
Description Task
Six Representative Applications Representative Six Evaluation Criteria for Selection of Selection for Criteria Evaluation
Friday, July 1, 2011 1, July Friday,
15
well (cf. our Petascale Early Science Period). Call for proposals will be forthcoming this summer. this forthcoming be will proposals for Call Period). Science Early Petascale our (cf. well
CAAR apps will form the vanguard of ‘day-one’ science on OLCF-3, but additional science teams will be granted friendly-user access as as access friendly-user granted be will teams science additional but OLCF-3, on science ‘day-one’ of vanguard the form will apps CAAR
. applications technology
nuclear energy and and energy nuclear
used in a variety of of variety a in used
containment groundwater transport. groundwater containment calculations that can be be can that calculations
2 radiation transport transport radiation sequestration; predictive predictive sequestration; CO
Stability and viability of large scale scale large of viability and Stability Discrete ordinates ordinates Discrete
Denovo PFLOTRAN
storms. .
patterns/statistics and tropical tropical and patterns/statistics complex chemistry. complex
represent features like precipitation precipitation like features represent numerical simulation with with simulation numerical
mitigation scenarios; realistically realistically scenarios; mitigation combustion through direct direct through combustion
climate change adaptation and and adaptation change climate Understanding turbulent turbulent Understanding
Answer questions about specific specific about questions Answer S3D
CAM-SE
systems.
total of 3.3 million atoms. atoms. million 3.3 of total
nanoscale materials and and materials nanoscale
lignin molecules comprising a a comprising molecules lignin
statistics, and fluctuations in in fluctuations and statistics, cellulose (blue) surrounded by by surrounded (blue) cellulose
Biofuels: An atomistic model of of model atomistic An Biofuels: Role of material disorder, disorder, material of Role
LAMMPS WL-LSMS Center for Accelerated Application Readiness Application Accelerated for Center
Friday, July 1, 2011 1, July Friday,
16
– these changes will be made available as library library as available made be will changes these – codes) MD all to (common
ata non-locality due to calculation of long-range Coulomb force force Coulomb long-range of calculation to due non-locality ata D rating: Moderate •
Large user and developer groups developer and user Large
Broad open community MD code owned by a DOE national laboratory: laboratory: national DOE a by owned code MD community open Broad •
Easily broken up into components available to other MD codes MD other to available components into up broken Easily •
cellulosic ethanol cellulosic
Critical to development of alternative energy sources, including second-generation second-generation including sources, energy alternative of development to Critical LAMMPS •
compute libraries used libraries compute
Moderate rating: Complex rate equations, thermodynamics, and transport properties modules; no no modules; properties transport and thermodynamics, equations, rate Complex rating: Moderate •
conservation equations with structured grid written in F90 in written grid structured with equations conservation
Compressible Navier-Stokes flow solver for the full mass, momentum, energy and species species and energy momentum, mass, full the for solver flow Navier-Stokes Compressible •
DNS of combustion processes for specific fuels specific for processes combustion of DNS S3D •
Low compute intensity and high data movement in physics kernels physics in movement data high and intensity compute Low rating: Hard •
Scalable dynamical core of choice for future CCSM future for choice of core dynamical Scalable •
High leverage in physics packages physics in leverage High •
Spectral finite element method element finite Spectral CAM-HOMME •
Description Code CAAR Application Summary Application CAAR
Friday, July 1, 2011 1, July Friday,
17
data movement caused by AMR (via SAMRAI) (via AMR by caused movement data
indirect addressing and and addressing indirect with solutions PDE nonlinear implicit by caused locality ata Non-d rating: Hard •
PETSc solver technology used extensively used technology solver PETSc •
Full featured finite element application with both structured and unstructured versions written in F90 in written versions unstructured and structured both with application element finite featured Full PFLOTRAN •
environment
port, while heavy use of C++ and advanced programming models will tax GPU software and tool tool and software GPU tax will models programming advanced and C++ of use heavy while port,
Huge potential for exploiting untapped concurrency along “energy dimension” helps helps dimension” “energy along concurrency untapped exploiting for potential Huge rating: Moderate •
in nuclear reactor cores reactor nuclear in
Key application for neutron transport and power distribution prediction prediction distribution power and transport neutron for application Key Denovo •
of complex numbers (LAPACK, CULA, MAGMA) CULA, (LAPACK, numbers complex of
Straightforward rating: Main kernel based on dense linear algebra algebra linear dense on based kernel Main rating: Straightforward •
Uses a workhorse approach (F77/90, C++, MPI) common to many applications many to common MPI) C++, (F77/90, approach workhorse a Uses •
Enables first-principles studies of magnetic materials with broad relevance to DOE energy mission mission energy DOE to relevance broad with materials magnetic of studies first-principles Enables gWL-LSMS •
Description Code
CAAR Application Summary Summary Application CAAR (continued)
Friday, July 1, 2011 1, July Friday,
18
23% of 2010 INCITE allocations (in cpu-hours) cpu-hours) (in allocations INCITE 2010 of 23% –
Represented 35% of 2009 INCITE allocations INCITE 2009 of 35% Represented –
Selected applications represented bulk of use for 6 INCITE allocations totaling 212M cpu-hours (2009) (2009) cpu-hours 212M totaling allocations INCITE 6 for use of bulk represented applications Selected •
PFLOTRAN X (AMR) X
Denovo X X X X X
X LAMMPS X
LSMS X
X X X CAM X X
X X S3D X X
algebra algebra grids
FFT Code Structured grids Structured Carlo Monte Particles
Sparse linear linear Sparse linear Dense Unstructured Unstructured CAAR Algorithmic Coverage Algorithmic CAAR
Friday, July 1, 2011 1, July Friday,
19
development work will also be pushed out to broader communities (e.g., in use of ChemKin) of use in (e.g., communities broader to out pushed be also will work development the of Much •
represented
Algorithm and implementation coverage extends applicability extends coverage implementation and Algorithm well beyond the science domains immediately immediately domains science the beyond well •
hydrodynamics
chemistry, finite-volume finite-volume chemistry,
geoscience BLAS, PETSc BLAS, SAMRAI MPI, GNU PGI, F90 AMR PFLOTRAN
coupled to transport and and transport to coupled
Richards’ equation equation Richards’
ZGTRF,ZGTRS
)
Monte Carlo Monte
ZGEMM, ZGEMM, nanoscience PGI, GNU PGI, C++ C, F90, F77, N/A LAPACK ( LAPACK MPI WL-LSMS
density functional theory, theory, functional density
Intel Python SuperLU, Metis SuperLU,
wavefront sweep, GMRES sweep, wavefront energy nuclear structured MPI Denovo
GNU, PGI, Cray, Cray, PGI, GNU, Fortran, C++, Trilinos, LAPACK, LAPACK, Trilinos,
algebra, particles algebra,
dense & sparse linear linear sparse & dense combustion PGI F90 F77, structured None MPI S3D
Navier-Stokes, finite diff, diff, finite Navier-Stokes,
particles Intel
N/A biology/materials FFTW MPI C++ LAMMPS
molecular dynamics, FFT, FFT, dynamics, molecular GNU, PGI, IBM, IBM, PGI, GNU,
algebra, particles algebra,
structured linear sparse & dense climate Trilinos MPI IBM Lahey, PGI, F90 CAM-HOMME
spectral finite elements, elements, finite spectral
Libraries supported Language(s)
Grid type Grid Algorithm(s) Area Science App Math Libraries Math
Communication Communication Compiler(s) Programming
Friday, July 1, 2011 1, July Friday,
20
Status of each app discussed weekly discussed app each of Status –
Constant monitoring of progress of monitoring Constant •
Determination of optimal, reproducible acceleration path for other applications other for path acceleration reproducible optimal, of Determination –
Maximum acceleration for model problem model for acceleration Maximum –
Two-fold aim Two-fold –
CAM-HOMME – CUDA, PGI directives PGI CUDA, – CAM-HOMME –
WL-LSMS – CULA, MAGMA, custom ZGEMM custom MAGMA, CULA, – WL-LSMS –
Multiple acceleration methods explored methods acceleration Multiple •
CAM-HOMME – pervasive – CAM-HOMME – and widespread custom acceleration required acceleration custom widespread and
WL-LSMS – dependent on accelerate on dependent – WL-LSMS – ZGEMM d
Particular plan-of-attack different for each app each for different plan-of-attack Particular •
other application developers, local tool/library developers tool/library local developers, application other Other: –
NVIDIA – developer
engineer Cray –
pplication a OLCF – ead l
Comprehensive Comprehensive team assigned to each app each to assigned team • Tactics
Friday, July 1, 2011 1, July Friday,
21
(ORNL) Hartmann-Baker Rebecca
PFLOTRAN Peng Wang Peng Wichmann Nathan Philip Bobby
Peter Lichtner (LANL) Lichtner Peter
John Roberts Roberts John
Denovo Kevin Thomas Kevin Joubert Wayne (ORNL) Baker Chris Evans, Tom
Cyril Zeller Cyril
Mike Brown (OLCF) Brown Mike
Scott Le Grande Le Scott Axel Kohlmeyer (Temple) Kohlmeyer Axel
LAMMPS Arnold Tharrington Arnold Sarah Anderson Sarah
Peng Wang Peng (ORNL) Brown Mike
Steve Plimpton, Paul Crozier (SNL) Crozier Paul Plimpton, Steve
Jim Rosinski (NOAA) Rosinski Jim
John Dennis (NCAR) Dennis John Lamarque, JF
CAM-HOMME
Jeff Larkin Jeff (NREL) Carpenter Ilene Paulius Micikevicius Paulius Mark Taylor (SNL) Taylor Mark
(ORNL) Hernandez Oscar
Rick Archibald, Rick Evans, Kate Norman, Matt Jim Hack, Hack, Jim
Adrian Tate Adrian Rusanu (ORNL/UTK) Rusanu Aurelian Wang Peng
WL-LSMS Markus Eisenbach Markus
Jeff Larkin Jeff Yang Wang (PSC) Wang Yang Fatica Massimiliano
S3D John Levesque John Sankaran Ramanan Ray Grout (NREL) Grout Ray Ruetsch Gregory
Cray Lead OLCF Application Science & Science NVIDIA Tools Application Teams Application
Friday, July 1, 2011 1, July Friday,
22
performance –
(potential) portability (potential) –
ease-of-use –
Multiple paths are available, with multifarious trade-offs multifarious with available, are paths Multiple –
beginning to become available become to beginning
Production-level tools, compilers, libraries, etc. are just just are etc. libraries, compilers, tools, Production-level •
acceleration “on their own.” their “on acceleration
have, in many cases, already begun to explore GPU GPU explore to begun already cases, many in have, Groups –
the chosen apps are under constant development development constant under are apps chosen the of All • Complications
Friday, July 1, 2011 1, July Friday,
23
Directives •
PFLOTRAN •
CUDA •
Denovo •
CUDA Fortran & Directives & Fortran CUDA •
CAM-SE •
CUDA •
LAMMPS •
Directives and CUDA and Directives •
S3D S3D •
Primarily Library-based Primarily •
WL-LSMS • What Are We Trying First? Trying We Are What
Friday, July 1, 2011 1, July Friday,
24
improves the performance of even CPU-only code. CPU-only even of performance the improves
Uncovering unrealized parallelism and improving data locality locality data improving and parallelism unrealized Uncovering •
near-future architectures. near-future
Exposure of unrealized parallelism is essential to exploit all all exploit to essential is parallelism unrealized of Exposure •
01110110111011
01010110100111 GPU threaded parallelism threaded GPU –
11010110101000
01010110101010
SSE/AVX on CPUs on SSE/AVX –
Vector parallelism parallelism Vector •
subcommunicators, or…) subcommunicators,
On-node, SMP-like parallelism via threads (or (or threads via parallelism SMP-like On-node, •
MPI parallelism between nodes (or PGAS) (or nodes between parallelism MPI • Hierarchical Parallelism Hierarchical
Friday, July 1, 2011 1, July Friday,
25
All 6 CAAR teams have converged (independently) to 2 ± 0.5 FTE-years 0.5 ± 2 to (independently) converged have teams CAAR 6 All –
How much effort is this demanding? this is effort much How •
acceleration changes obsolete/broken otherwise obsolete/broken changes acceleration
Other development work taking place on all CAAR codes could quickly make make quickly could codes CAAR all on place taking work development Other –
production “trunk” is every bit as important as we initially thought. initially we as important as bit every is “trunk” production
Ensuring that changes are communicated back and remain in the the in remain and back communicated are changes that Ensuring •
Help (or, at least, no hindrance) to overlapping computation with device communication device with computation overlapping to hindrance) no least, at (or, Help –
HW-aware choices HW-aware –
possibility of continued use is important. is use continued of possibility
For those codes that already make effective use of scientific libraries, the the libraries, scientific of use effective make already that codes those For •
A directives-based approach offers a straightforward path to portable performance portable to path straightforward a offers approach directives-based A –
Developers can quickly learn, e.g., CUDA and put it to effective use effective to it put and CUDA e.g., learn, quickly can Developers –
Making changes to exploit it is hard work (made easier by better tools) better by easier (made work hard is it exploit to changes Making –
Figuring out where is often straightforward often is where out Figuring –
Exposure of unrealized parallelism is essential. is parallelism unrealized of Exposure • Some Lessons Learned Lessons Some PANEL REPORT: NUCLEAR ASTROPHYSICS
Three-Dimensional Turbulent Convection in Massive Stars
The art of modeling stars is intimately entwined with a diverse range of astrophysical problems, from stellar population studies and galactic chemical evolution modeling, to the mechanisms underlying supernovae and gamma-ray burst explosions. These diverse topics are bound together, as almost all of this work relies on the ability to accurately model an individual star. Understanding an individual star is deeply connected to the degree at which the turbulent motions that take place deep in the stellar interior are understood. Currently, the exponential growth in computing technology, which is expressed concisely by Moore’s law, is beginning a new era of sophistication in the ability to model plasma dynamics within stars. The numerical experiments made possible by the availability of ever-larger pools of computing resources are providing stronger constraints on researchers’ theories. These numerical experimentsStellar are also inspiring breakthroughs inEvolution: understanding stellar The Sun and Other Stars mixing and combustion processes, the evolution and structure of stars, and the contribution of these stars to the chemical evolution of the universe. The turbulent convective flow in a core- collapse supernova progenitor star is shown for a large angular domain (120°´ 120°), three-dimensional simulation 3d AGB model with convective envelope and (Meakin 2008). The mass fraction of 32S is visualized to provide a sense of the resolved turbulent boundary layer mixing topology and the complex, multiscale, turbulent nature of the flow field. between active burning layers and envelope Material with a high mass fraction of 32S is being entrained into the turbulent oxygen burning shell from the underlying silicon and sulfur-rich core of a 23 solar mass star. This illustrates a mixing process that is not included in the 3D supernova progenitors including iron standard treatment used to model stars today but that has a significant impact on core and overlying silicon burning shell the resulting structure of the massive star at the time of core collapse. up to core collapse including the effects of rotation 3d supernova progenitor including all dynamically 3D supernova progenitors from the end Source: Meakin (2008). Image courtesy of Casey Meakin (University of Arizona). active layers of silicon core burning up to core Developing three-dimensional supernova progenitor models involves a large range in both spatial and temporal scales. While the end state of a massive star depends upon the entire prior evolution of the star collapse since formation, a three-dimensional simulation that begins at core silicon burning and is evolved up to core collapse would provide a significantly improved level of confidence about the state of the iron core Global circulation solar model at collapse, including the rotational state and the convectively induced perturbations. Such a three- dimensional stellar model would be used directly as an input to core-collapse supernova simulations. with techocline and surface Spatially, capturing global asymmetries will require simulating a volume that extends to the outer edge of granulation the Globalcarbon burning convectioncirculation zone (Meakin and solar Arnett 2006). model Thus, in successively with larger shells surroundingresolved, the core, silicon, oxygen, turbulent neon, and carbon burningtachocline will need to be included. Angular
Scientific Grand Challenges: Forefront Questions in Nuclear Science and the 56 Role of Computing at the Extreme Scale
3D AGB model including the effects of rotation 1st-gen 2D supernova progenitor on the global circulation and mass mixing including all dynamically active layers properties
1st-gen 3D AGB model including convective red giant envelope
10x Tera 100x Tera Peta 10x Peta 100x Peta Exa -flop year sustained
Friday, July 1, 2011 PANEL REPORT: NUCLEAR ASTROPHYSICS
these data and managing and analyzing them after they are written are as important to producing meaningful science through supernova simulation as is any algorithmic or implementation improvement that might be made for the computational step itself.
Four-Dimensional Core-Collapse Supernovae Simulations Decades of core-collapse supernovae simulations have led to a challenging proposition for the eventual description of the explosion mechanism and the associated phenomenology. It is now known that three-dimensional hydrodynamics in general relativistic gravity must be coupled to nuclear kinetics capable of accurate estimation of energy release and compositional changes, and to spectral neutrino transport of all flavors of neutrinos (requiring a fourth, neutrino energy, dimension) to reliably determine the nature of the explosion mechanism. Moreover, the inclusion of neutrino masses and oscillations, as well as magnetic fields—both of which may be important effects—make the problem even more difficult. All of these effects, familiar in some sense from the earlier life of the star, operate on very short (millisecond) timescales and at extremes in density (as high as three to four times nuclear matter density) and neutron richness (the ratio of protons to neutrons in the matter can be several times smaller than what is accessible in terrestrial experiments). The rich interplay of all these physical phenomena in core-collapse supernovae ultimately leads to the realization that only through simulation will scientists be able to fully understand these massive stellar explosions.Core-Collapse The relative complexity of Supernovae the germane physics also means the requisite physical fidelity for these simulations will only be realized at the extreme scale. Rendering of the matter entropy in the core of a massive star at approximately 100 ms following core bounce. Multidimensional fluid effects, coupled to heating and cooling via neutrino emission and absorption, have served to form large, asymmetric structures Full quantum kinetics in the flow. Much of this complicated flow pattern occurs between the surface of the nascent neutron star at the center of the star and the position of the original supernova shock (demarcated by the faint blue circle near the edge of the figure). These first four-dimensional Large (precision) nuclear simulations have already network consumed tens of millions of central processing unit hours on leadership computing platforms and will require tens of millions of additional central processing
unit hours to follow the evolution Source: Messer et al. (2009). Image courtesy of Bronson Messer to the requisite one second of (Oak Ridge National Laboratory). 150-speciesphysical time. nuclear
framework Multi-energy, multi-angle neutrino transport
Scientific Grand Challenges: Forefront Questions in Nuclear Science and the Role of Computing at the Extreme Scale 65
Multi-energy neutrino transport and coherent neutrino flavor mixing
10x Tera 100x Tera Peta 10x Peta 100x Peta Exa -flop year sustained
Friday, July 1, 2011 Thermonuclear Supernovae
3-D whole-star simulations capturing all crucial scales with detailed nuclear kinetics
3-D whole-star simulations with nuclear kinetics and resolution to treat turbulent nuclear burning 3-D whole-star simulations with resolution sufficient to capture initiation of a 3D whole-star detonation simulations with resolution to capture turbulent burning dynamics and convection in the stellar core
10x Tera 100x Tera Peta 10x Peta 100x Peta Exa -flop year sustained
Friday, July 1, 2011
Friday, July 1, 2011 1, July Friday,
29
Lots of opportunities to hide latency via multiphysics multiphysics via latency hide to opportunities of Lots •
Large number of DOF at each grid point point grid each at DOF of number Large •
environment for these architectures these for environment Stellar Astrophysics provides a target-rich target-rich a provides Astrophysics Stellar
Friday, July 1, 2011 1, July Friday,
30
– cf. Uintah, MADNESS Uintah, cf.
Task-based AMR systems AMR Task-based •
– Data locality becomes a problem a becomes locality Data
AMR will become even more essential essential more even become will AMR •
(distributed memory) (distributed
Diminishing bytes/FLOP will limit spatial resolution resolution spatial limit will bytes/FLOP Diminishing •
Many problems (e.g. Type Ia SNe) are woefully underresolved woefully are SNe) Ia Type (e.g. problems Many •
fidelity is good, but not the whole answer. whole the not but good, is fidelity Strong scaling with improved local physical physical local improved with scaling Strong Summary
• We are not in the advent of exascale-like architectures, we are in medias res. • Tools, compilers, etc. are becoming available to help make the transition. • The specific details of the platforms matter much less than the overarching theme of hierarchical parallelism. • Multiphysics simulations have unrealized parallelism to tap. –Applications relying on, e.g., solution to large linear systems could also benefit from a task based approach.
31
Friday, July 1, 2011