Computational Science at the Argonne Leadership Computing Facility Nichols A. Romero Assistant Computa onal Scien st Argonne Leadership Compu ng Facility Argonne Na onal Laboratory Mission need for Leadership Computing
Leadership computing capability is required for scientists to tackle the highest- resolution, multi-scale/multi-physics simulations of greatest interest and impact to both science and the nation. For scientific grand challenges, the Leadership Computing Facilities (LCFs) provide capability computing that is 10-100X greater than other computational centers. LCF focus is on big jobs that use a substantial fraction of the systems resources. Leadership Computing research is mission critical to inform policy decisions and advance innovation in far reaching topics such as: • energy assurance • ecological sustainability “We will respond to the • scientific discovery threat of climate change, • global security knowing that the failure to do so would betray our children and future genera ons.” – President Obama 1/21/2013 What is the Leadership Computing Facility?
• Two centers/two architectures to address • Projects receive computational resources diverse and growing computational needs of typically 10-100x greater than generally the scientific community. available. • Highly competitive user allocation programs • LCF centers partner with users to enable (INCITE, ALCC). science & engineering breakthroughs. Integrated Services Model
§ Science project management § Startup assistance § Assistance with proposals, planning, § User administration assistance reporting § Job management services § Collaboration on algorithms and development § Technical support § ALCF point of coordination
ALCF Services
§ Workshops and seminars § Performance engineering § Customized training programs § Application tuning § On-line content and user guides § Data analytics § Educational and outreach programs § Visualization § Reporting and promotion § Data management services ALCF Computing Resources § Mira – BG/Q system – 49,152 nodes / 786,432 cores – 786 TB of memory – Peak flop rate: 10 petaFLOP/s – 3,145,728 hardware threads § Cetus (debug) - BG/Q system – 1,024 nodes / 16,384 cores – 16 TB of memory – Peak flop rate: 210 teraFLOP/s § Vesta (T&D) - BG/Q system – 2,048 nodes / 32,768 cores – 32 TB of memory – Peak flop rate: 420 teraFLOP/s § Tukey – Nvidia system – 96 nodes, each with two 2-GHz, 8-core x86 cores/ 200 NVIDIA Tesla GPUs – 6.4 TB x86 memory / 1.2 TB GPU memory – Peak flop rate: 220 teraFLOP/s § Storage – Scratch: 28.8 PB raw capacity, 240 GB/s BW (GPFS) – Home: 1.8 PB raw capacity, 45 GB/s BW (GPFS) Blue Gene/Q 4. Node Card: 3. Compute card: 32 Compute Cards, One chip module, Op cal Modules, Link Chips; 5D Torus 16 GB DDR3 Memory, Heat Spreader for H2O Cooling 2. Single Chip Module
1. Chip: 16+2 cores
5b. IO drawer: 7. System: 8 IO cards w/16 GB 48 racks, 10 PF/s 8 PCIe Gen2 x8 slots 3D I/O torus
5a. Midplane: 16 Node Cards
6. Rack: 2 Midplanes Programming and Running on BG/Q § High-level programming languages: Fortran, C, C++ § MPI § Threads: OpenMP, PTHREADS § Run modes: combina ons of – MPI ranks/node = {1,2,4,…,64} – Threads/node = {1,2,4,…,64} § Blue Gene specific features: – QPX intrinsics: vec_ld, vec_add, vec_madd, …. – L2P prefetcher API – Transac onal memory or specula ve execu on
Applica on
Global Arrays Charm++ High-level APIs MPI OpenMP GASNet PAMI – Parallel Ac ve Messaging Interface SPI (System Programming Interface) for messaging Allocation Programs at the LCFs 60% 30% 10% Director’s INCITE ALCC Discre onary High-risk, high-payoff science High-risk, high-payoff science Strategic LCF goals Mission that requires LCF-scale aligned with DOE mission resources 1x/year – (Closes June) 1x/year – (Closes February) Rolling Call Dura on 1-3 years, yearly renewal 1 year 3m,6m,1 year 30 - 40 50M - 500M 10M – 300+M 100s of .5M – 10M 5 - 10 projects Typical Size projects core-hours/yr. core-hours/yr. projects core-hours
Review Scien fic Computa onal Scien fic Computa onal Strategic impact and Process Peer-Review Readiness Peer-Review Readiness feasibility INCITE management DOE Office of Science LCF management Managed By commi ee (ALCF & OLCF) Readiness High Medium to High Low to High Open to all scientific researchers and organizations Availability 5 billion core hours available in 2014 Capability > 131,072 cores (16.7% of Mira) Diversity of INCITE science
Simula ng a flow of healthy (red) and Provide new insights into the dynamics diseased (blue) blood cells with a of turbulent combus on processes in Dissipa ve Par cle Dynamics method. internal-combus on engines. - George Karniadakis, Brown University -Jacqueline Chen and Joseph Oefelein, Sandia Na onal Laboratories
Demonstra on of high-fidelity capture of Calcula ng an improved probabilis c airfoil boundary layer, an example of how seismic hazard forecast for California. this modeling capability can transform product development. - Thomas Jordan, - Umesh Paliath, GE Global Research University of Southern California
Modeling charge carriers in metals and High-fidelity simula on of complex semiconductors to understand the nature of suspension flow for prac cal rheometry. these ubiquitous electronic devices. - Richard Needs, - William George, Na onal Ins tute of University of Cambridge, UK Standards and Technology Other INCITE research topics • Glimpse into dark ma er • Global climate • Membrane channels • Nano-devices • Supernovae igni on • Regional earthquakes • Protein folding • Ba eries • Protein structure • Carbon sequestra on • Chemical catalyst design • Solar cells • Crea on of biofuels • Turbulent flow • Combus on • Reactor design • Replica ng enzyme func ons • Propulsor systems • Algorithm development • Nuclear structure Sample of Community Codes -- and ALCF Application Expertise Applica on Field Applica on Field FLASH Astrophysics AVBP Combus on MILC,CPS La ce QCD GTC Fusion Nek5000 CFD IRHMD Plasma Physics Rose a Protein structure CPMD Electronic Structure ANGFMC Nuclear structure CCSM3 Climate Qbox Electronic structure HOMME Climate LAMMPS Molecular dynamics WRF Climate NWChem Quantum chemistry Gromacs Molecular dynamics GAMESS Quantum chemistry Enzo Astrophysics MADNESS Electronic structure Falkon Computer science/HTC NAMD Molecular dynamics GPAW Electronic structure QMCPACK Electronic structure OpenFOAM CFD MPQC Quantum chemistry FHI-AIMS Electronic structure UNIC Neutron Transport HACC Astrophysics INCITE – 3 year award Stress Corrosion Cracking in 40 M/yr - Materials Science Nuclear Reactors Priya Vashishta, University of Southern California Scientific Goals Approach Progress • Study stress corrosion • Classical molecular • CY2009 affect of cracking (SCC) of materials dynamics impuri es (sulfur) on the widely-used in nuclear fracture modes of nickel reactors: • In-house code (Discre onary Alloca on) 1. nickel-based alloys • Reax Force-field • CY2011 – CY2013 silica 2. silica glass containers nanoindena on by shock- • MD simula ons: induced nanobubble collapse (Le ) Embri lement of nickel due to impuri es (Right) Silica nanoindenta on in the presence of water (INCITE) Thermal-Hydraulic Modeling: Cross- Verification, Validation, and Co-Design Paul Fischer, Argonne National Laboratory Science and Accomplishments Key Impact ALCF Contributions • Study the use of high-fidelity, • Improve and enhance • Nek5000 is a well-established advanced thermal hydraulics predic ve capabili es for code that has been tuned for BG/ codes for the simula on of nuclear reactor systems P and BG/Q machines in nuclear reactor systems • Improve understanding of collabora on with ALCF staff over • Nek5000 simula ons were thermal hydraulics physics in the past few years. performed to determine the nuclear reactor configura ons required resolu on in a 37- pin rod bundle to resolve Image: Nek5000 mesh turbulence and spa ally and streamwise varying wall shear stress in velocity for 37-pin rod large eddy simula ons bundle corresponding • Highest resolu on to the experiment of calcula ons of this flow to Krauss and Meyer. date showed that addi onal resolu on is needed to resolve spa ally varying wall- shear stress • Exascale algorithms showed 3x speedup for a broad range of parameter values
12 Scientific Support is Collaboration
The ALCF is staffed with a team of computa onal scien sts, expert in their domain, scalable algorithms and performance engineering. § Provide a "jump-start" in the use of ALCF resources § Align the availability of ALCF resources with the needs of the project team § Collaborate to maximize the value that ALCF can bring to our projects § Connect the needs of the scien fic community with future and current hardware
Two categories of collabora on and contribu on to teams using the ALCF: Tac cal/Collabora ve Strategic • Short term, fast solu ons • Training • Compiling, Debugging, System Use • Postdocs, students, community • Targeted problem resolu on • Understand HPC needs for different • Resolve a specific hard problem like communi es restructuring I/O • Plan for future needs • Long term collabora ons • Help planning new facili es • In depth work on a code that be over a • Advise/Par cipate in long term code long period of me development paths • Constrained by staff Summary
§ LCFs are a computa onal resource open to everyone: – Must be Open Science (publishable in the open literature) – INCITE program targets high-impact science requiring capability calcula ons >100,000 cores – ALCC programs is also available for science aligned to DOE’s mission. Computa onal readiness is medium-high. – Discre onary alloca ons are a great way to test and develop your code § Strong exper se in atomis c simula on methods: classical molecular dynamics, density func onal theory, quantum Monte Carlo § LCFs will have >100 pF in the 2018 me frame – Concurrency per node will go up by at least a factor of 10X § Ques ons: [email protected] § Addi onal info: www.alcf.anl.gov
Acknowledgements
Most of the slides borrowed from Paul Messina, Sco Parker, and Katherine Riley.
Argonne Leadership Compu ng Facility at Argonne Na onal Laboratory is supported by the Office of Science of the U.S. Department of Energy under contract DE-AC02-06CH11357.