GPU Clusters for HPC

Bill Kramer Director of National Center for Supercomputing Applications University of at Urbana- Champaign

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign National Center for Supercomputing Applications: 30 years of leadership

• NCSA • R&D unit of the University of Illinois at Urbana-Champaign • One of original five NSF-funded supercomputing centers • Mission: Provide state-of-the-art computing capabilities (hardware, software, hpc expertise) to nation’s scientists and engineers • The Numbers • Approximately 200 staff (160+ technical/professional staff) • Approximately 15 graduate students (+ new SPIN program), 15 undergrad students • Two major facilities (NCSA Building, NPCF) • Operating NSF’s most powerful computing system: Blue Waters • Managing NSF’s national cyberinfrastructure: XSEDE

Source: Thom Dunning Petascale Computing Facility: Home to Blue Waters

• Blue Waters • 13PF, 1500TB, 300PB • >1PF On real apps • NAMD, MILC, WRF, PPM, NWChem, etc

• Modern Data Center • Energy Efficiency • 90,000+ ft2 total • LEED certified Gold • 30,000 ft2 raised floor • Power Utilization Efficiency 20,000 ft2 machine room gallery = 1.1–1.2

Source: Thom Dunning Data Intensive Computing

Personalized Medicine w/ Mayo LSST, DES

Source: Thom Dunning NCSA’s Industrial Partners

Source: Thom Dunning NCSA, NVIDIA and GPUs

• NCSA and NVIDIA have been partners for over a decade, building the expertise, experience and technology. • The efforts were at first exploratory and small scale, but have now blossomed into providing the largest GPU production resource in the US academic cyber- infrastructure • Today, we are focusing on helping world class science and engineering teams decrease their time to insight for some of the world’s most important and challenging computational and data analytical problems

Imaginations unbound Original Blue Waters Goals

• Deploy a computing system capable of sustaining more than one petaflops or more for a broad range of applications • Cray system achieves this goal using a well defined metrics • Enable the Science Teams to take full advantage of the sustained petascale computing system • Blue Waters Team has established strong partnership with Science Teams, helping them to improve the performance and scalability of their applications • Enhance the operation and use of the sustained petascale system • Blue Waters Team is developing tools, libraries and other system software to aid in operation of the system and to help scientists and engineers make effective use of the system • Provide a world-class computing environment for the petascale computing system • The NPCF is a modern, energy-efficient data center with a rich WAN environment (100-400 Gbps) and data archive (>300 PB) • Exploit advances in innovative computing technology • Proposal anticipated the rise of heterogeneous computing and planned to help the computational community transition to new modes for computational and data-driven science and engineering

Imaginations unbound Blue Waters Computing System

Aggregate Memory – 1.6 PB

IB Switch >1 TB/sec 10/40/100 Gb External Servers Ethernet Switch

100 GB/sec 120+ Gb/sec

Spectra Logic: 300 usable PB Sonexion: 26 usable PB 100-300 Gbps WAN

Imaginations unbound Details of Blue Waters

Imaginations unbound Computation by Discipline on Blue Waters

Actual Usage by Discipline STEM Education Social Sciences 0.01% 2.5% 0.3%

Astronomy and Astrophysics Particle Physics 17.8% 25.9%

Atmospheric and Climate Sciences 10.4% Nuclear Physics 0.7% Mechanical and Dynamic Systems 0.03% Biology and Materials Humanities Fluid 23.6% Science 0.0002% Systems 3.3% 5.1% Geophysics 1.3% Engineering Chemistry 0.05% 6.5% Earth Sciences 2.0% Computer Science 0.5%

Imaginations unbound XK7 Usage by NSF PRAC teams – A Behavior Experiment – First year • An observed experiment – teams self select what type of node is most useful • First year of usage

Amber

-

NAMD/VMD

- QCD – MILC and MD MD

MD Chroma

Gromacs

- MD MD

Increasing allocation size

Imaginations unbound Production Computational Science with XK nodes • The Computational Microscope • PI – Klaus Schulten • Simulated flexibility of ribosome trigger factor complex at full length and obtained better starting configuration of trigger factor model (simulated to 80ns) • 100ns simulation of cylindrical HIV 'capsule’ of CA proteins revealed it is stabilized by hydrophobic interactions between CA hexamers; maturation involves detailed remodeling rather than disassembly/re-assembly of CA lattice, as had been proposed. • 200ns simulation of CA pentamer surrounded by CA hexamers suggested interfaces in hexamer-hexamer and hexamer-pentamer pairings involve different patterns of interactions • Simulated photosynthetic membrane of a chromatophore in bacterium Rps. photometricum for 20 ns -- simulation of a few hundred nanoseconds will be needed Images from Klaus Schulten and John Stone, University of Illinois at Urbana-Champaign Imaginations unbound Production Computational Science with XK nodes • Lattice QCD on Blue Waters • PI - Robert Sugar, University of California, Santa Barbara • The USQCD Collaboration, which consists of nearly all of the high-energy and nuclear physicists in the working on the numerical study of quantum chromodynamics (QCD), will use Blue Waters to study the theory of the strong interactions of sub-atomic physics, including simulations at the physical masses of the up and down quarks, the two lightest of the six quarks that are the fundamental constituents of strongly interacting matter

Imaginations unbound Production Computational Science with XK nodes • Hierarchical sampling for assessing pathways and free energies of RNA catalysis, ligand binding, and conformational change • PI - Thomas Cheatham, University of Utah • Attempting to decipher the full landscape of RNA structure and function. • Challenging because • RNA require modeling the flexibility and subtle balance between charge, stacking and other molecular interactions • structure of RNA is highly sensitive to its surroundings, and RNA can adopt multiple functionally relevant conformations. • Goal - Fully map out the conformational, energetic and chemical landscape of RNA. • "Essentially we are able to push enhanced sampling methodologies for molecular dynamics simulation, specifically replica-exchange, to complete convergence for conformational ensembles (which hasn't really been investigated previously) and perform work that normally would take 6 months to years in weeks. This is critically important for validating and assessing the force fields for nucleic acids,” - Cheatham.

Images courtesy – T Cheatham

Imaginations unbound Most Recent Computational Use of XK nodes

Teams with both XE and XK usage - July 1, 2014 to Sept 30, 2014 9,000,000.0 8,000,000.0 7,000,000.0

6,000,000.0 5,000,000.0 4,000,000.0

Node*Hours 3,000,000.0 2,000,000.0 Total Node*hrs 1,000,000.0 XK Node Hrs -

XE Node Hrs

Chemla-Chemla

Sugar-LatticeQCD

Aksimentiev-Pioneering…

Lazebnik-Large-Scale…

Voth -MultiscaleSims. of…

Kasson-InfluenzaFusion…

Thomas-QC during Steel…

Bernholc-Quantum Sims.…

Jongeneel-AccurateGene…

Jordan-Earthquake System… Schulten-TheComputational…

Ott -CCSNe, Hypermassive…

Cheatham-MDPathways and…

Woosley-TypeSupernovae Ia

Makri-QCPIProton & Electron…

Hirata -Predictive Comp. of…

Aluru-QMC of H2O-Graphene,…

Elghobashi-DNSof Vaporizing…

Woodward -Turbulent Stellar…

Karimabadi-3DKinetic Sims. of…

Shapiro-Signaturesof Compact…

Lusk-Sys.Software Scalable… for

Tajkhorshid-Complex Biologyin…

Tomko-RedesigningComm. and…

Glotzer-Many-GPU Sims.of Soft…

Yeung-Complex TurbulentFlows…

Beltran -Spot Scanning Proton…

Mori-PlasmaPhysics Sims. using…

Pande-Simulating Vesicle Fusion Fields-Benchmark HumanVariant…

Imaginations unbound Most Resent Computational Use of XK nodes

Teams with both XE and XK usage - July 1, 2014 to Sept 30, 2014 9,000,000.0 8,000,000.0 7,000,000.0

6,000,000.0 5,000,000.0 4,000,000.0

Node*Hours 3,000,000.0 2,000,000.0 1,000,000.0 Total Node*hrs - XK Node Hrs XE Node Hrs

Imaginations unbound Evolving XK7 Use on BW - Major Advance in Understanding of Collisionless Plasmas Enabled through Petascale Kinetic Simulations

• PI: Homayoun Karimabadi, University of California, San Diego • Major results to date: • Global fully kinetic simulations of magnetic reconnection • First large-scale 3D simulations of decaying collisionless plasma turbulence • 3D global hybrid simulations

addressing coupling between Fully kinetic simulation (all species kinetic; shock physics & Large scale hybrid kinetic simulation: code: VPIC) (kinetic ions + fluid electrons; magnetosheath turbulence ~up to 1010 cells codes: H3D, HYPERES) ~up to 4x1012 particles ~up to 1.7x1010 cells ~120 TB of memory ~up to 2x1012 particles ~107 CPU-HRS Slide courtesy of H Karimardi ~130 TB of memory ~up to 500,000 cores Imaginations unbound Evolving XK7 Use on BW - Petascale Particle in Cell Simulations of of Kinetic Effects in Plasmas • PI – Warren Mori – Presenter – Frank Tsung • Use six parallel particle-in-cell (PIC) codes to investigate four key science areas: • Can fast ignition be used to develop inertial fusion energy? • What is the source of the most energetic particles in the cosmos? • Can plasma-based acceleration be the basis of new compact accelerators for use at the energy frontier, in medicine, in probing materials, and in novel light sources? • What processes trigger substorms in the magnetotail? • Evaluating New Particle-in-Cell (PIC) Algorithms on GPU and comparing to standard • Electromagnetic Case 2-1/2D EM Benchmark with 2048x2048 grid, 150,994,944 particles, 36 particles/cell optimal block size = 128, optimal tile size = 16x16. Single precision. Fermi M2090 GPU • First result • OSIRIS : 2PF sustained on BW • Complex interaction could not be understood Image and Information courtesy of without the simulations performed on BW Warren Mori and Frank Tsung

Imaginations unbound Evolving XK7 Use on BW - Comparison of 1D and 3D CyberShake Models for the Los Angeles Region

BBP-1D CVM-S4.26 1

2

2 3

4

1. lower near-fault intensities due to 3D scattering 2. much higher intensities in near-fault basins Slide courtesy of T Jordan - SCEC 3. higher intensities in the Los Angeles basins 4. lower intensities in hard-rock areas Imaginations unbound XK7 For Visualization on Blue Waters

• Many visualization utilities rely on the OpenGL API for hardware-accelerated rendering • Unsupported by default XK7 system software • Enabling NVIDIA’s OpenGL required that we: • Change operating mode of the XK7 GPU firmware • Develop a custom X11 stack • Work with Cray to acquire alternate driver package from NVIDIA • Blue Waters is the first Cray to offer this functionality which has been distributed to other systems now

Imaginations unbound Impact: VMD

• Molecular dynamics analysis and visualization tool used by “The Computational Microscope” science team (PI Klaus Schulten) • 10X to 50X rendering speedup in VMD • Interactive rate visualization • Drastic reduction in required time to fine tune parameters for production visualization

Imaginations unbound Impact of integrated system reduces data movement Computational fluid dynamics volume renderer used by “Petascale Simulation of Turbulent Stellar Hydrodynamics” science team (PI Paul R. Woodward)

Visualization created on Blue Waters: • 10,5603 grid inertial confinement fusion (ICF) calculation (26 TB) • 13,688 frames at 2048x1080 pixels • 711 frame stereo movie (2 views) at 4096x2160 pixels • Total rendering time: 24 hours • Estimated time to just ship data to team’s remote site where they had been doing visualization (no rendering): 15 days • 20-30x improvement in Imaginations unbound time to insight

Summary

• UIUC, NCSA and NVIDIA have a very stong partnership for some time • NCSA has helped move GPU computing into the mainstream for several discipline areas • Molecular dynamics, particle physics, seismic, … • NCSA is leading innovation in use of GPUs for grand challenges • Blue Waters has unique capabilities' for computation and data analysis • There is still much work to do in order to make GPU processing a standard way of doing real computational science and modeling for all disciplines

Imaginations unbound Backup Other Slides

Imaginations unbound Science Area Number Codes Struc Unstruc Dens Sparse N- Mont FF PIC Significa of t t Grids e Matrix Body e T nt I/O Teams Grids Matri Carlo x Climate and Weather 3 CESM, GCRM, X X X X X CM1/WRF, HOMME Plasmas/Magnetosphere 2 H3D(M),VPIC, X X X X OSIRIS, Magtail/UPIC Stellar Atmospheres and 5 PPM, MAESTRO, X X X X X X Supernovae CASTRO, SEDONA, ChaNGa, MS- FLUKSS Cosmology 2 Enzo, pGADGET X X X Combustion/Turbulence 2 PSDNS, DISTUF X X General Relativity 2 Cactus, Harm3D, X X LazEV Molecular Dynamics 4 AMBER, X X X Gromacs, NAMD, LAMMPS Quantum Chemistry 2 SIAL, GAMESS, X X X X X NWChem Material Science 3 NEMOS, OMEN, X X X X GW, QMCPACK Earthquakes/Seismology 2 AWP-ODC, X X X X HERCULES, PLSQR, SPECFEM3D Quantum Chromo 1 Chroma, MILC, X X X X X Dynamics USQCD

Social Networks 1 EPISIMDEMICS

Evolution 1 Eve

Engineering/System of 1 GRIPS,Revisit X Systems Imaginations unbound Computer Science 1 X X X X X Blue Waters Symposium

• May 12-15 – after the 1 year of full service • https://bluewaters.ncsa.illinois.edu/symposium-2014- schedule • About 180 people attended – over 120 from outside Illinois • 54 individual science talks

Imaginations unbound Climate – courtesy of Don Weubbles

Imaginations unbound Petascale Simulations of Complex Biological Behavior in Fluctuating Environments • Project PI: llias Tagkopoulos, University of California, Davis • Simulated 128,000 organisms • Previous best was 200 on Blue Gene

Image and Information courtesy of Ilias Tagkopoulos

Imaginations unbound Selected Highlights

• PI - Keith Bisset, of the Network Dynamics and Simulation Science Laboratory at Virginia Tech • Simulated 280 millions people (US Populations) for 120 days on 352,000 cores (11,000 nodes) on Blue Waters. • Simulation took 12 second • Estimated that the world population would take 6-10 minutes per scenario • Emphasized that a realistic assessment of disease threat would require many such runs.

Image and Information courtesy of K Bisset

Imaginations unbound P.K. Yeung – DNS Turbulence - Topology

8,1923 grid points – 0.5 Trillion Slide courtesy of P.K Yeung

Imaginations unbound Inference Spiral of System Science (PI T Jordan)

Jordan et al. (2010)

• As models become more complex and new data bring in more information, we require ever increasing computational resources Slide courtesy of T Jordan - SCEC

Imaginations unbound Comparison of 1D and 3D CyberShake Models for the Los Angeles Region

BBP-1D CVM-S4.26 1

2

2 3

4

1. lower near-fault intensities due to 3D scattering 2. much higher intensities in near-fault basins Slide courtesy of T Jordan - SCEC 3. higher intensities in the Los Angeles basins 4. lower intensities in hard-rock areas Imaginations unbound CyberShake Time-to-Solution Comparison Los Angeles Region Hazard Models (1144 sites)

2008 2009 2013 2014 CyberShake Application (Mercury, (Ranger, (Blue Waters / (Blue Waters) Metrics (Hours) normalized) normalized) Stampede) Application Core Hours: 19,488,000 16,130,400 12,200,000 15,800,000 (CPU) (CPU) (CPU) (CPU+GPU) Application Makespan: 70,165 6,191 1,467 342

4.2x quicker time to insight

Metric 2013 (Study 13.4) 2014 (Study 14.2) Simultaneous processors 21,100 (CPU) 46,720 (CPU) + 160 (GPU) Concurrent Workflows 5.8 26.2 Job Failure Rate 2.6% 1.3% Data transferred 57 TB 12 TB

Slide courtesy of T Jordan - SCEC

Imaginations unbound Major Advance in Understanding of Collisionless Plasmas Enabled through Petascale Kinetic Simulations

• PI: Homayoun Karimabadi, University of California, San Diego • Major results to date: • Global fully kinetic simulations of magnetic reconnection • First large-scale 3D simulations of decaying collisionless plasma turbulence • 3D global hybrid simulations

addressing coupling between Fully kinetic simulation (all species kinetic; shock physics & Large scale hybrid kinetic simulation: code: VPIC) (kinetic ions + fluid electrons; magnetosheath turbulence ~up to 1010 cells codes: H3D, HYPERES) ~up to 4x1012 particles ~up to 1.7x1010 cells ~120 TB of memory ~up to 2x1012 particles ~107 CPU-HRS Slide courtesy of H Karimardi ~130 TB of memory ~up to 500,000 cores Imaginations unbound OTHER FUN DATA

Imaginations unbound Q1 2014 XE Scale

128,000 Integer cores

Imaginations unbound Q1 2014 XK Scale

Imaginations unbound