Accelerating computational science and engineering with leadership computing

Jack C. Wells Director of Science Oak Ridge Leadership Computing Facility NVIDIA Theatre @ SC13

Office of Science Big Problems Require Big Solutions Climate Change

Energy

Healthcare

Competitiveness

2 What is the Leadership Computing Facility (LCF)? • Collaborative DOE Office of Science • Highly competitive user allocation program at ORNL and ANL programs (INCITE, ALCC). • Mission: Provide the computational • Projects receive 10x to 100x more and data resources required to solve resource than at other generally the most challenging problems. available centers. • 2-centers/2-architectures to address • LCF centers partner with users to diverse and growing computational enable science & engineering needs of the scientific community breakthroughs (Liaisons, Catalysts).

3 #2 Titan System (Cray XK7) 27.1 PF 24.5 PF 2.6 PF Peak Performance 18,688 compute nodes GPU CPU LINPACK Performance 17.59 PF Power 8.2 MW System Memory 710 TB total memory Gemini High Speed Interconnect 3D Torus Interconnect Storage Luster Filesystem 32 PB High-Performance Archive 29 PB Storage System (HPSS) I/O Nodes 512 Service and I/O nodes

4 High-impact science at OLCF: Four of Six SC13 Gordon Bell Finalists Used Titan Peter Staar Massimo Bernaschi Michael Bussmann Salman Habib ETH Zurich ICNR-IAC Rome HZDR - Dresden Argonne High-Temperature Biofluidic Superconducvity Systems Plasma Physics Cosmology Taking a Quantum 20 Petaflops Radiave Signatures HACC: Extreme Leap in Time to Simulaon of of the Relavisc Scaling and Soluon for Protein Kelvin-Helmholtz Performance Simulaons of High-TC Suspensions in Instability Across Diverse Superconductors Crowding Architectures Condions Titan Titan Titan Sequoia (15.4 PF) (20 PF) (7.2 PF) (13.9 PF), Titan

5 Science challenges for LCF in next decade Combustion Science Climate Change Science Increase efficiency by Understand the dynamic 25%-50% and lower ecological and chemical emissions from internal evolution of the climate combustion engines using system with uncertainty advanced fuels and low- quantification of impacts. temperature combustion.

Fusion Energy Biomass to Biofuels Develop predictive Enhance the understanding understanding of plasma and production of biofuels for properties, dynamics, and transportation and other bio- interactions with products from biomass. surrounding materials.

Optimized Accelerator Designs Solar Energy Optimize designs as the next Improve photovoltaic generations of accelerators . efficiency and lower Detailed models are needed to cost for organic and provide efficient designs of new inorganic materials. light sources.

6 Solar energy Key science challenges: Improve photovoltaic efficiency and lower cost for organic and inorganic materials. A photovoltaic material poses difficult challenges in the prediction of morphology, excited state

phenomena, transport, and materials aging. Corse-grained MD simulation of phase-separation of a 1:1 weight ratio P3HT/PCBM mixture into donor (white) and acceptor (blue) domains. Science enabled by LCF Capabilities 2013-2016 2016-2020 • Understand growth, interface structure, and • Enable computational screening of stability of heterogeneous polymer blends materials for desired excited-state and necessary for efficient solar conversion. charge transport properties. • Simulations of structure, carrier transport, • Systems-level, multiphysics simulations and defect states in nanomaterials. of practical photovoltaic devices are • Describe excited state phenomena in enabled. homogeneous systems. • Uncertainty quantification enabled for critical integrated materials properties.

7 8 LAMMPS Early Science Project Towards Rational Design of Efficient Organic Jan-Michael Carrillo, ORNL Mike Brown, ORNL Photovoltaic Materials

Science Objectives and Impact P3HT PCBM (electron donor) (electron acceptor) • Organic photovoltaic (OPV) solar cells are promising renewable energy sources: – Low costs, high-flexibility, and light weight • Bulk-heterojunction (BHJ) active layer morphology and domain size is critical Corse-grained MD simulation of phase-separation for improving performance of a 1:1 weight ratio P3HT/PCBM mixture into donor (white) and acceptor (blue) domains.

Titan Simulation: LAMMPS Preliminary Science Results • Portability: Builds with CUDA or OpenCL • Titan simulations are 27x larger and 10x longer • Speedups on Titan (GPU+CPU vs. CPU: – Converged P3HT:PCBM separation in 400ns 2X to 15x (mixed precision) depending CGMD time upon model and simulation • Prediction: Increasing polymer chain length will – Speedup of 2.5-3x for OPV simulation decrease the size of the electron donor domains used here • Prediction: PCBM (fullerene) loading parameter results in an increasing, then decreasing impact on P3HT domain size 9 Biomass to biofuels Key science challenges: Enhance the understanding and production of biofules from biomass for transportation and other bio-products. The main challenge to overcome is the recalcitrance of biomass (cellulosic materials) to hydrolysis. Lignin interacting with crystalline cellulose.

Science enabled by increasing LCF Capabilities

2013-2016 2016-2020 • Atomic-detail dynamical models of biomass • Understand the dynamics of enzymatic systems of several million atoms, permitting reactions on biomass by simulating detailed analysis of interactions interactions between microbial systems and cellulosic biomass • Simulations of pretreatment effects on multi- component biomass systems to understand • Design superior enzymes for the bottlenecks in bioconversion conversion of biomass

10 11 INCITE Program Boosting Bioenergy and Overcoming Jeremy Smith Oak Ridge National Laboratory Recalcitrance 23 M Titan core hours Simulations Science Objectives and Impact • Optimize biomass pretreatment process by understanding lignin-cellulose interactions on a molecular level • Overcome biomass recalcitrance caused by lignin and the tightly ordered structure of cellulose • Improve efficiency of the biofuel production process Interaction between cellulose fibril (blue) and lignin (pink and green) molecules. and make ethanol less costly Vizualization by M. Matheson (ORNL) Science Results Application Performance Published paper in Biomacromolecules in August • 2012: Used GROMACS on to monitor 2013 interactions of 3 million atoms that included crystalline and non-crystalline cellulose, lignin, • Discovered amorphous cellulose is easier to and water break down because it associates less with • 2013: Now run accelerated GROMACS that lignin can take advantage of Titan’s GPUs, making the application 10 times bigger and much • Phenomenon is not a result of direct interaction longer. Current simulations monitor 30 million between lignin and cellulose, but is a water- atoms. mediated effect 12 13 ALCC Program Non-Icing Surfaces for Cold Climate Masako Yamada GE Global Research Wind Turbines 40 M Titan core hours Molecular Dynamics Simulations Location of ice Science Objectives and Impact nucleation varies dependent on • Understand microscopic mechanism of water temperature droplets freezing on surfaces and contact angles. • Determine efficacy of non-icing surfaces at different by M. Matheson operation temperatures (ORNL)

Science Results Replicated GE’s experimental results: Hydrophilic Hydrophobic • Hydrophobic surfaces delay the onset of Performance Achievements nucleation • 5X speed-up from GPU acceleration • The delay is less pronounced at lower • Achieved factor 40X speed-up from new temperatures interaction potential for water

14 Center for Accelerated Application Readiness (CAAR)

• Focused effort to prepare • Application Teams applications for accelerated – OLCF application lead architectures – Cray engineer – NVIDIA developer • Goals: – Others: local tool & library developers, other – Work with code teams to develop computational scientists and implement strategies for exposing hierarchical parallelism • Single early science problem for our users applications targeted for each app – Maintain code portability across modern architectures • Explore multiple approached for – Learn from and share our results each app – Determine maximum acceleration • Selected six applications from – Determine reproducible path for different science domains and other applications algorithmic motifs

15 Early Science Challenges for Titan

WL-LSMS LAMMPS Illuminating the role of A molecular dynamics material disorder, simulation of organic statistics, and fluctuations polymers for applications in nanoscale materials in organic photovoltaic and systems. heterojunctions , de- wetting phenomena and

IMPLICIT AMR FOR EQUILIBRIUM RADIATION DIFFUSION 15 biosensor applications CAM-SE Answering questions S3D about specific climate Understanding turbulent change adaptation and combustion through direct mitigation scenarios; numerical simulation with realistically represent complex chemistry. features like precipitation . patterns / statistics and tropical storms. t = 0.50 t = 0.75 Denovo NRDF Discrete ordinates Radiation transport – radiation transport important in astrophysics, calculations that can laser fusion, combustion, be used in a variety atmospheric dynamics, of nuclear energy and medical imaging – and technology computed on AMR grids. applications. t = 1.0 t = 1.25 16 Fig. 6.6. Evolution of solution and grid for Case 2, using a 32 32 base grid plus 4 refinement levels. Boundaries of refinement patches are superimposed on a pseudocolor plot of the solution using a logarithmic color scale. The coarsest level is outlined in green; level 1: yellow; level 2: light blue; level 3: magenta; level 4: peach. increases more quickly due to the presence Region 1, adjacent to the x = 0 boundary. Eventually there is a decrease in the size of the dynamic calculation as Region 1 is de-refined and before resolution is increased in Region 2. Two inflection points are seen in the size of the locally refined calculation, initially as Region 2 is fully resolved and resolution is increased around Region 3, and subsequently as Regions 2 and 3 are de-refined. The number of cells in the dynamic calculation peaks at less than 20% of the uniform grid calculation, then decreases steadily. On average the dynamic calculation is around 8% of the size of the uniform grid calculation. Table 6.2 compares nonlinear and linear iteration counts per time step. Once again little variation is seen in the number of nonlinear iterations per time step for a fixed base grid size or for fixed finest resolution, and a small decrease in this iteration count for a fixed number of refinement levels. In contrast, the number of linear iterations per time step increases slowly as more refinement levels are added, and increases by nearly half as we fix resolution and move from a global fine grid to a locally refined calculation. Again, this is likely due to the fact that operators on refinement levels are simply obtained by rediscretization, and interlevel transfer operators are purely geometric. 7. Conclusions and Future Work. The results presented demonstrate the feasibility of combining implicit time integration with adaptive mesh refinement for Effectiveness of GPU Acceleration Applicaon Domain Cray XK7 vs. Cray XE6 Performance Rao* LAMMPS Molecular dynamics 7.4 S3D Turbulent combuson 2.2 Denovo 3D neutron transport for nuclear 3.8 reactors

CAAR WL-LSMS Stascal mechanics of magnec 3.8 materials AWP-ODC Seismology 2.1 DCA++ Condensed Maer Physics 4.4 QMCPACK Electronic structure 2.0 RMG (DFT – real- Electronic Structure 2.0 space, mulgrid) Community XGC1 Plasma Physics for Fusion Energy R&D 1.8 Titan: Cray XK7 (Kepler GPU plus AMD 16-core Opteron CPU) Cray XE6: (2x AMD 16-core Opteron CPUs) 17 *Performance depends strongly on specific problem size chosen WL-LSMS Magnetic Materials Marcus Eisenbach, Simulating nickel atoms pushes double-digit petaflops ORNL Science Objectives and Impact • Enhance the understanding of microscopic behavior of magnetic materials • Enable the simulation of new magnetic materials – Better, cheaper, more abundant materials Researchers using Titan are studying the behavior of magnetic • Model development on Titan will enable systems by simulating nickel atoms as they reach their Curie investigation on smaller computers temperature—the threshold between order (right) and disorder (left).

Titan Simulation: WL-LSMS Preliminary Science Results

• More than an 8-factor speedup on • Titan necessary to calculate nickel’s Curie Titan compared to Jaguar, Cray XT-5 temperature, a more complex calculation than iron – From 1.84 PF to 14.5 PF • Calculated 50 percent larger phase space • Wang-Landau allows for calculations • Four times faster on Titan than on comparable at realistic temperatures CPU-only system, (i.e., Cray XE6).

18 Application Power Efficiency of the Cray XK7 WL-LSMS for CPU-only and Accelerated Computing

Power consumption traces for identical WL-LSMS runs with 1024 Fe atoms on 18,561 Titan nodes (99% of Titan)

• Runtime Is 8.6X faster for the accelerated code • Energy consumed Is 7.3X less o GPU accelerated code consumed 3,500 kW-hr o CPU only code consumed 25,700 kW-hr

19 All Codes Will Need Rework at Scale! • Up to 1-2 person-years required to port each code from Jaguar to Titan – Takes work, but an unavoidable step required for exascale regardless of the type of processors. It comes from the required level of parallelism on the node – Also pays off for other systems—the ported codes often run significantly faster CPU-only (Denovo 2X, CAM-SE >1.7X) • We estimate possibly 70-80% of developer time is spent in code restructuring, regardless of whether using OpenMP / CUDA / OpenCL / OpenACC / … • Each code team must make its own choice of using OpenMP vs. CUDA vs. OpenCL vs. OpenACC, based on the specific case—may be different conclusion for each code • Our users and their sponsors must plan for this work.

20 More Lessons Learned

• Science codes are under active development—porting to GPU can be pursuing a “moving target,” challenging to manage • Heterogeneous architectures can make previously infeasible or inefficient models and implementations viable • More available FLOPS on the node should lead us to think of new science opportunities enabled—e.g., more degrees of freedom per grid cell • We may need to look to new ideas to get another ~30X thread parallelism that may be needed for exascale—e.g., parallelism in time, uncertainty quantification, design of experiments

21 Three primary ways for access to LCF Distribution of allocable hours

Leadership-class computing INCITE seeks computationally intensive, large- scale research and/or development projects with the potential to significantly advance key areas in science and 10% Director’s Discretionary engineering.

Up to 30% ASCR 60% INCITE Leadership Computing Challenge 5.8 billion core-hours in CY2014 DOE/SC capability computing

22 Sustainable Campus 2014 INCITE award statistics PIs by Affiliation (Awards) • Request for Information helped attract new projects • Call closed June 28th, 2013 • Total requests ~14 billion core-hours • Awards of 5.8 billion core-hours for CY 2014 • 59 projects awarded of which 21 are renewals

Acceptance rates Contact information Julia C. White, INCITE Manager • 36% of nonrenewal submittals [email protected] • 91% of renewals

23 Sustainable Campus Conclusions

• Leadership computing is for the critically important problems that need the most powerful compute and data infrastructure • Accelerated, hybrid-multicore computing solutions are performing well on real, complex scientific applications. – But you must work to expose the parallelism in your codes. – This refactoring of codes is largely common to all massively parallel architectures • OLCF resources are available to industry, academia, and labs, through open, peer-reviewed allocation mechanisms.

24 Sustainable Campus Acknowledgements

OLCF-3 CAAR Team: • Bronson Messer, Wayne Joubert, Mike Brown, Matt Norman, Markus Eisenbach, Ramanan Sankaran OLCF-3 Vendor Partners: Cray, AMD, NVIDIA, CAPS, Allinea OLCF Users: Jeremy Smith(UT/ORNL), Masako Yamada (GE) Mike Matheson (ORNL) for visualizations

This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

25 Contact us at Questions? http://olcf.ornl.gov http://jobs.ornl.gov [email protected]

26