Computational Science at the Argonne Leadership Computing Facility Nichols A. Romero Assistant Computaonal Scienst Argonne Leadership Compung Facility Argonne Naonal Laboratory Mission need for Leadership Computing

Leadership computing capability is required for scientists to tackle the highest- resolution, multi-scale/multi-physics simulations of greatest interest and impact to both science and the nation. For scientific grand challenges, the Leadership Computing Facilities (LCFs) provide capability computing that is 10-100X greater than other computational centers. LCF focus is on big jobs that use a substantial fraction of the systems resources. Leadership Computing research is mission critical to inform policy decisions and advance innovation in far reaching topics such as: • energy assurance • ecological sustainability “We will respond to the • scientific discovery threat of climate change, • global security knowing that the failure to do so would betray our children and future generaons.” – President Obama 1/21/2013 What is the Leadership Computing Facility?

• Two centers/two architectures to address • Projects receive computational resources diverse and growing computational needs of typically 10-100x greater than generally the scientific community. available. • Highly competitive user allocation programs • LCF centers partner with users to enable (INCITE, ALCC). science & engineering breakthroughs. Integrated Services Model

§ Science project management § Startup assistance § Assistance with proposals, planning, § User administration assistance reporting § Job management services § Collaboration on algorithms and development § Technical support § ALCF point of coordination

ALCF Services

§ Workshops and seminars § Performance engineering § Customized training programs § Application tuning § On-line content and user guides § Data analytics § Educational and outreach programs § § Reporting and promotion § Data management services ALCF Computing Resources § Mira – BG/Q system – 49,152 nodes / 786,432 cores – 786 TB of memory – Peak flop rate: 10 petaFLOP/s – 3,145,728 hardware threads § Cetus (debug) - BG/Q system – 1,024 nodes / 16,384 cores – 16 TB of memory – Peak flop rate: 210 teraFLOP/s § Vesta (T&D) - BG/Q system – 2,048 nodes / 32,768 cores – 32 TB of memory – Peak flop rate: 420 teraFLOP/s § Tukey – Nvidia system – 96 nodes, each with two 2-GHz, 8-core x86 cores/ 200 NVIDIA Tesla GPUs – 6.4 TB x86 memory / 1.2 TB GPU memory – Peak flop rate: 220 teraFLOP/s § Storage – Scratch: 28.8 PB raw capacity, 240 GB/s BW (GPFS) – Home: 1.8 PB raw capacity, 45 GB/s BW (GPFS) Blue Gene/Q 4. Node Card: 3. Compute card: 32 Compute Cards, One chip module, Opcal Modules, Link Chips; 5D Torus 16 GB DDR3 Memory, Heat Spreader for H2O Cooling 2. Single Chip Module

1. Chip: 16+2 cores

5b. IO drawer: 7. System: 8 IO cards w/16 GB 48 racks, 10 PF/s 8 PCIe Gen2 x8 slots 3D I/O torus

5a. Midplane: 16 Node Cards

6. Rack: 2 Midplanes Programming and Running on BG/Q § High-level programming languages: , , C++ § MPI § Threads: OpenMP, PTHREADS § Run modes: combinaons of – MPI ranks/node = {1,2,4,…,64} – Threads/node = {1,2,4,…,64} § Blue Gene specific features: – QPX intrinsics: vec_ld, vec_add, vec_madd, …. – L2P prefetcher API – Transaconal memory or speculave execuon

Applicaon

Global Arrays Charm++ High-level APIs MPI OpenMP GASNet PAMI – Parallel Acve Messaging Interface SPI (System Programming Interface) for messaging Allocation Programs at the LCFs 60% 30% 10% Director’s INCITE ALCC Discreonary High-risk, high-payoff science High-risk, high-payoff science Strategic LCF goals Mission that requires LCF-scale aligned with DOE mission resources 1x/year – (Closes June) 1x/year – (Closes February) Rolling Call Duraon 1-3 years, yearly renewal 1 year 3m,6m,1 year 30 - 40 50M - 500M 10M – 300+M 100s of .5M – 10M 5 - 10 projects Typical Size projects core-hours/yr. core-hours/yr. projects core-hours

Review Scienfic Computaonal Scienfic Computaonal Strategic impact and Process Peer-Review Readiness Peer-Review Readiness feasibility INCITE management DOE Office of Science LCF management Managed By commiee (ALCF & OLCF) Readiness High Medium to High Low to High Open to all scientific researchers and organizations Availability 5 billion core hours available in 2014 Capability > 131,072 cores (16.7% of Mira) Diversity of INCITE science

Simulang a flow of healthy (red) and Provide new insights into the dynamics diseased (blue) blood cells with a of turbulent combuson processes in Dissipave Parcle Dynamics method. internal-combuson engines. - George Karniadakis, Brown University -Jacqueline Chen and Joseph Oefelein, Sandia Naonal Laboratories

Demonstraon of high-fidelity capture of Calculang an improved probabilisc airfoil boundary layer, an example of how seismic hazard forecast for California. this modeling capability can transform product development. - Thomas Jordan, - Umesh Paliath, GE Global Research University of Southern California

Modeling charge carriers in metals and High-fidelity simulaon of complex semiconductors to understand the nature of suspension flow for praccal rheometry. these ubiquitous electronic devices. - Richard Needs, - William George, Naonal Instute of University of Cambridge, UK Standards and Technology Other INCITE research topics • Glimpse into dark maer • Global climate • Membrane channels • Nano-devices • Supernovae ignion • Regional earthquakes • Protein folding • Baeries • Protein structure • Carbon sequestraon • Chemical catalyst design • Solar cells • Creaon of biofuels • Turbulent flow • Combuson • Reactor design • Replicang enzyme funcons • Propulsor systems • Algorithm development • Nuclear structure Sample of Community Codes -- and ALCF Application Expertise Applicaon Field Applicaon Field FLASH Astrophysics AVBP Combuson MILC,CPS Lace QCD GTC Fusion Nek5000 CFD IRHMD Plasma Physics Rosea Protein structure CPMD Electronic Structure ANGFMC Nuclear structure CCSM3 Climate Qbox Electronic structure HOMME Climate LAMMPS WRF Climate NWChem Gromacs Molecular dynamics GAMESS Quantum chemistry Enzo Astrophysics MADNESS Electronic structure Falkon Computer science/HTC NAMD Molecular dynamics GPAW Electronic structure QMCPACK Electronic structure OpenFOAM CFD MPQC Quantum chemistry FHI-AIMS Electronic structure UNIC Neutron Transport HACC Astrophysics INCITE – 3 year award Stress Corrosion Cracking in 40 M/yr - Materials Science Nuclear Reactors Priya Vashishta, University of Southern California Scientific Goals Approach Progress • Study stress corrosion • Classical molecular • CY2009 affect of cracking (SCC) of materials dynamics impuries (sulfur) on the widely-used in nuclear fracture modes of nickel reactors: • In-house code (Discreonary Allocaon) 1. nickel-based alloys • Reax Force-field • CY2011 – CY2013 silica 2. silica glass containers nanoindenaon by shock- • MD simulaons: induced nanobubble collapse (Le) Embrilement of nickel due to impuries (Right) Silica nanoindentaon in the presence of water (INCITE) Thermal-Hydraulic Modeling: Cross- Verification, Validation, and Co-Design Paul Fischer, Argonne National Laboratory Science and Accomplishments Key Impact ALCF Contributions • Study the use of high-fidelity, • Improve and enhance • Nek5000 is a well-established advanced thermal hydraulics predicve capabilies for code that has been tuned for BG/ codes for the simulaon of nuclear reactor systems P and BG/Q machines in nuclear reactor systems • Improve understanding of collaboraon with ALCF staff over • Nek5000 simulaons were thermal hydraulics physics in the past few years. performed to determine the nuclear reactor configuraons required resoluon in a 37- pin rod bundle to resolve Image: Nek5000 mesh turbulence and spaally and streamwise varying wall shear stress in velocity for 37-pin rod large eddy simulaons bundle corresponding • Highest resoluon to the experiment of calculaons of this flow to Krauss and Meyer. date showed that addional resoluon is needed to resolve spaally varying wall- shear stress • Exascale algorithms showed 3x speedup for a broad range of parameter values

12 Scientific Support is Collaboration

The ALCF is staffed with a team of computaonal sciensts, expert in their domain, scalable algorithms and performance engineering. § Provide a "jump-start" in the use of ALCF resources § Align the availability of ALCF resources with the needs of the project team § Collaborate to maximize the value that ALCF can bring to our projects § Connect the needs of the scienfic community with future and current hardware

Two categories of collaboraon and contribuon to teams using the ALCF: Taccal/Collaborave Strategic • Short term, fast soluons • Training • Compiling, Debugging, System Use • Postdocs, students, community • Targeted problem resoluon • Understand HPC needs for different • Resolve a specific hard problem like communies restructuring I/O • Plan for future needs • Long term collaboraons • Help planning new facilies • In depth work on a code that be over a • Advise/Parcipate in long term code long period of me development paths • Constrained by staff Summary

§ LCFs are a computaonal resource open to everyone: – Must be Open Science (publishable in the open literature) – INCITE program targets high-impact science requiring capability calculaons >100,000 cores – ALCC programs is also available for science aligned to DOE’s mission. Computaonal readiness is medium-high. – Discreonary allocaons are a great way to test and develop your code § Strong experse in atomisc simulaon methods: classical molecular dynamics, density funconal theory, quantum Monte Carlo § LCFs will have >100 pF in the 2018 me frame – Concurrency per node will go up by at least a factor of 10X § Quesons: [email protected] § Addional info: www.alcf.anl.gov

Acknowledgements

Most of the slides borrowed from Paul Messina, Sco Parker, and Katherine Riley.

Argonne Leadership Compung Facility at Argonne Naonal Laboratory is supported by the Office of Science of the U.S. Department of Energy under contract DE-AC02-06CH11357.