Accelerating Research and Development Using the Titan Supercomputer

Accelerating Research and Development Using the Titan Supercomputer Fernanda Foertter HPC User Support, ORNL ORNL is managed by UT-Battelle for the US Department of Energy What is the Leadership Computing Facility (LCF)? • Collaborative DOE Office of Science • Highly competitive user allocation program at ORNL and ANL programs (INCITE, ALCC). • Mission: Provide the computational • Projects receive 10x to 100x more and data resources required to solve resource than at other generally the most challenging problems. available centers. • 2-centers/2-architectures to address • LCF centers partner with users to diverse and growing computational enable science & engineering needs of the scientific community breakthroughs (Liaisons, Catalysts). 2 The OLCF has delivered five systems and six upgrades to our users since 2004 • Increased our system capability by 10,000x • Strong partnerships with computer designers and architects • Worked with users to scale codes by 10,000x • Science delivered through strong user partnerships to scale codes and algorithms Titan XK7 • GPU upgrade Jaguar XT5 2012 Jaguar XT4 • 6 core upgrade Jaguar XT3 • Quad core upgrade 2008 Phoenix X1 • Dual core upgrade • Doubled size 2007 • X1e 2005 2004 3 Science breakthroughs at the OLCF: SELECTED science and engineering advances over the period 2003 - 2013 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 Researchers solved the 2D Hubbard model MD simulations show and presented evidence that it predicts selectivity filter of a trans- HTSC behavior, Phys. Rev. Lett (2005) membrane ion channel is 105 citations, 3/2014 sterically locked open by hidden water molecules, First-Principles Flame Simulation Provides Crucial Nature (2013) Information to Guide Design of Fuel-Efficient Clean Calculation of the number of bound Engines, Proc. Combust. Insti. (2007) nuclei in nature, Nature (2012), 36 78 citations, 3/2014 citations, 3/2014 , 36 citations, 3/2014 Largest simulation of a galaxy’s worth of dark matter, showed for the first time the Global Warming preceded by increasing fractal-like appearance of dark matter carbon dioxide concentrations during the substructures, Nature (2008) last deglaciation, Nature (2012). 326 citations, 3/2014 64 citations, 3/2014 World’s first continuous simulation of Demonstrated that three-body forces Astrophysicists discover 21,000 years of Earth’s climate are necessary to describe the supernova shock-wave instability, history, Science (2009) long lifetime of 14C Astrophys. J. (2003) 116 citations 254 citations, 3/2014 Phys. Rev. Lett. (2011) Biomass as a viable, sustainable feedstock for hydrogen 28 citations, 3/2014 production for fuel cells, Nano Letters (2011) J. Phys. Chem. Lett. (2010) 4 71 & 74 citations, respectively No more free lunch Herb Sutter: Dr. Dobb’s Journal: 5 http://www.gotw.ca/publications/concurrency-ddj.htm Power is THE problem Power consumption of 2.3 PF (Peak) Jaguar: 7 megawatts, equivalent to that of a small city (5,000 homes) 6 Using traditional CPUs is not economically feasible 20 PF+ system: 30 megawatts (30,000 homes) 7 Why GPUs? Hierarchical Parallelism High performance and power efficiency on path to exascale • Hierarchical parallelism improves CPU GPU Accelerator scalability of applications • Expose more parallelism through code refactoring and source code directives – Doubles performance of many codes • Heterogeneous multicore processor architecture: Using right type of processor for each task • Optimized • Data locality: Keep data near processing for sequential multitasking • Optimized for many – GPU has high bandwidth to local memory simultaneous tasks for rapid access • 10 performance – GPU has large internal cache per socket • 5 more energy- • Explicit data management: Explicitly efficient systems manage data movement between CPU and GPU memories 8 #2 8.2 Megawatts 27 Pflops (Peak) 17.59 PFlops (Linpack) 9 10 Roadmap to Exascale Our Science requires that we advance computational capability 1000x over the next decade. 2022 2017 2012 OLCF-5: 1 EF 20 MW What are the Challenges? OLCF-4: 100-250 PF 4000 TB memory Titan 27 PF > 20MW 600 TB DRAM 6 day resilience 11 Hybrid GPU/CPU Represented Research Science Category Areas Requirements gathering Bioinformatics Biophysics DOE/SC and LCFs support a Life Sciences Biology Medical Science Neuroscience diverse user community Proteomics • Science benefits and impact of future Systems Biology Chemistry Chemistry systems are examined on an ongoing Physical Chemistry Computer Science Computer Science basis Climate Earth Science Geosciences • LCF staff have been actively engaged in Aerodynamics Bioenergy Engineering community assessments of future Combustion computational needs and solutions Turbulence Fusion Energy Fusion Plasma Physics • Computational science roadmaps are Materials Science Nanoelectronics developed in collaboration with leading Materials Nanomechanics domain scientists Nanophotonics Nanoscience Nuclear Fission Nuclear Energy • Detailed performance analyses are Nuclear Fuel Cycle Accelerator Physics conducted for applications to understand Astrophysics future architectural bottlenecks Atomic/Molecular Physics Condensed Matter Physics Physics High Energy Physics • Analysis of INCITE, ALCC, Early Science, Lattice Gauge Theory Nuclear Physics and Center for Accelerated Application Solar/Space Physics Readiness (CAAR) projects history and trends 12 Requirements Process • Surveys are a “lagging indicator” that tend to tell us what problems the users are seeing now, not what they expect to see in the future • https://www.olcf.ornl.gov/wp-content/uploads/2013/01/OLCF_Requirements_TM_2013_Final.pdf 13 OLCF User Requirements Survey – Key Findings • Memory bandwidth was reported as the Hardware feature Ranking greatest need Memory Bandwidth 4.4 • Local memory capacity was not a driver for Flops 4.0 most users, perhaps in recognition of cost Interconnect Bandwidth 3.9 trends Archival Storage 3.8 Capacity • 76% of users said there is still a moderate to Interconnect Latency 3.7 large amount of parallelism to extract in their Disk Bandwidth 3.7 code, but… WAN Network 3.7 Bandwidth Memory Latency 3.5 • 85% of users rated difficulty level of extracting Local Storage Capacity 3.5 that parallelism as moderate to difficult - often Memory Capacity 3.2 requires application refactoring Mean Time to Interrupt 3.0 – Highlights training needs and community Disk Latency 2.9 based efforts for application readiness Rankings from OLCF users 1=not important, 5=very important 14 Center for Accelerated Application Readiness (CAAR) • We created CAAR as part of the Titan project to help prepare applications for accelerated architectures • Goals: – Work with code teams to develop and implement strategies for exposing hierarchical parallelism for our users applications – Maintain code portability across modern architectures – Learn from and share our results • We selected six applications from across different science domains and algorithmic motifs 15 CAAR Plan • Comprehensive team assigned to each app – OLCF application lead – Cray engineer – NVIDIA developer – Other: other application developers, local tool/library developers, computational scientists • Single early-science problem targeted for each app – Success on this problem is ultimate metric for success • Particular plan-of-attack different for each app – WL-LSMS – dependent on accelerated ZGEMM – CAM-SE– pervasive and widespread custom acceleration required • Multiple acceleration methods explored – WL-LSMS – CULA, MAGMA, custom ZGEMM – CAM-SE– CUDA, directives – Two-fold aim – Maximum acceleration for model problem – Determination of optimal, reproducible acceleration path for other applications 16 Early Science Challenges for Titan WL-LSMS LAMMPS Illuminating the role of A molecular dynamics material disorder, simulation of organic statistics, and fluctuations polymers for applications in nanoscale materials in organic photovoltaic and systems. heterojunctions , de- wetting phenomena and biosensor applications CAM-SE Answering questions S3D about specific climate Understanding turbulent change adaptation and combustion through direct mitigation scenarios; numerical simulation with realistically represent complex chemistry. features like . precipitation patterns / statistics and tropical storms. Denovo NRDF Discrete ordinates Radiation transport – radiation transport important in astrophysics, calculations that can laser fusion, combustion, be used in a variety atmospheric dynamics, of nuclear energy and medical imaging – and technology computed on AMR grids. applications. 17 Effectiveness of GPU Acceleration? OLCF-3 Early Science Codes -- Performance on Titan XK7 Cray XK7 vs. Cray XE6 Application Performance Ratio* LAMMPS* 7.4 Molecular dynamics S3D 2.2 Turbulent combustion Denovo 3.8 3D neutron transport for nuclear reactors WL-LSMS 3.8 Statistical mechanics of magnetic materials Titan: Cray XK7 (Kepler GPU plus AMD 16-core Opteron CPU) Cray XE6: (2x AMD 16-core Opteron CPUs) *Performance depends strongly on specific problem size chosen 18 Additional Applications from Community Efforts Current Performance Measurements on Titan Cray XK7 vs. Cray XE6 Application Performance Ratio* AWP-ODC 2.1 Seismology DCA++ 4.4 Condensed Matter Physics QMCPACK 2.0 Electronic structure RMG (DFT – real-space, multigrid) 2.0 Electronic Structure XGC1 1.8 Plasma Physics for Fusion Energy R&D Titan: Cray XK7 (Kepler GPU plus AMD 16-core Opteron CPU) Cray XE6: (2x AMD 16-core Opteron CPUs) *Performance depends strongly on specific

Accelerating Research and Development Using the Titan Supercomputer

ORNL Debuts Titan Supercomputer Table of Contents

Petaflops for the People

Safety and Security Challenge

Lessons Learned in Deploying the World's Largest Scale Lustre File

Titan: a New Leadership Computer for Science

Musings RIK FARROWOPINION

Unlocking the Full Potential of the Cray XK7 Accelerator

This Is Your Presentation Title

Jaguar Supercomputer

Hardware Complexity and Software Challenges

EVGA Geforce GTX TITAN X Superclocked

A Comprehensive Performance Comparison of Dedicated and Embedded GPU Systems