Accelerating Computational Science Symposium 2012

Hosted by: The Oak Ridge Leadership Computing Facility at Oak Ridge National Laboratory The National Center for Supercomputing Applications at the University of Illinois at Urbana-Champaign The Swiss National Supercomputing Centre at ETH Zurich

Scientists Assess a New Supercomputing Approach and Judge It a Win Graphics processing units speed up research codes on hybrid computing platforms

Can scientists and engineers benefit from extreme-scale that use application-code accelerators called GPUs (for graphics processing units)? Comparing GPU accelerators with today’s fastest central processing units, early results from diverse areas of research show 1.5- to 3-fold speedups for most codes, which in turn increase the realism of simulations and decrease time to results. With the availability of new, higher-performance GPUs later this year, such as the Kepler GPU chip to be installed in the 20-petaflop Titan , application speedups are expected to be even more substantial.

n innovation disrupts the The Accelerating Computational marketplace when it confers Science Symposium 2012 (ACSS Aan undeniable advantage 2012), held March 28–30 in Wash- over earlier technologies. Develop- ington, D.C., brought together about ing a disruptive innovation—such 100 experts in science, engineering, as the telephone that so long ago and computing from around the world replaced the telegraph—can be a to share transformational advances high-risk, high-rewards venture. enabled by hybrid supercomputers in Early in a product’s life, it may be diverse domains including chemistry, unclear whether customers will derive combustion, biology, nuclear fusion enough gain from a new technology and fission, and seismology. Partici- to justify its developers’ pain. Then pants also discussed the new breadth unprecedented success compels a and scope of research possible as the community to embrace the innovation. performance of petascale systems, With supercomputers, which must which execute quadrillions of calcula- balance performance and energy tions per second, continues to in- consumption, hybrid architectures crease. At the start of the symposium, incorporate high-performance, energy- neither the scientific community nor Jack Wells, ORNL efficient scientific-application-code the government officials charged with accelerators called graphics process- providing supercomputing facilities to sors to speed discoveries in fields ing units (GPUs) as well as traditional meet the needs of researchers knew from seismology to space and aid central processing units (CPUs). For whether hybrid computing architec- innovations such as next-generation scientists who must beat competitors tures conferred incontestable benefits catalysts, drugs, materials, engines, to discoveries in a publish-or-perish to the scientists and engineers using and reactors. The overwhelming world or engineers who must invent leadership-class computing systems consensus: GPUs accelerate a broad and improve products for the global to conduct digital experiments. range of computationally intensive ap- marketplace, the ability to more fully To understand the impact hybrid plications exploring the natural world, exploit massive parallelism is proving architectures are making on science, from subatomic particles to the vast advantageous enough for GPU/ attendees heard from two dozen cosmos, and the engineered world, CPU hybrids to displace CPU-only researchers who use a combination from turbines to advanced fuels. architectures. of traditional and accelerated proces-

1 “The symposium [was] motivated by society’s great need for advances in energy technologies and by the demonstrated achievements and tremendous potential for computa- tional science and engineering,” said meeting program chair Jack Wells, director of science at the Oak Ridge Leadership Computing Facility (OLCF), which co-hosted ACSS 2012 with the National Center for Supercomputing Applications (NCSA) and the Swiss National Supercomputing Centre (CSCS). Supercomputer vendor Inc. and GPU inventor NVIDIA helped sponsor the event. “Compu- tational science on extreme-scale Monte Rosa, the Swiss National Supercomputing Centre’s 402-teraflop XE6 Cray hybrid computing architectures will supercomputer, uses only CPUs. The entire computational-science community benefits advance research and development in from efforts to optimize CPU performance. Image credit: Michele De Lorenzi/CSCS this decade, increase our understand- ing of the natural world, accelerate innovation, and as a result, increase economic opportunity,” Wells said.

Concurring was James Hack, director of the National Center for Compu- tational Sciences, which houses the U.S. Department of Energy’s (DOE’s) Jaguar, the National Science Founda- tion’s (NSF’s) Kraken, and the National Oceanic and Atmospheric Administra- tion’s (NOAA’s) Gaea supercomputers. “As a community we’re really at a point where we need to embrace this disruptive approach to supercomput- ing,” he said. “This meeting was an Tödi, the Swiss National Supercomputing Centre’s 137-teraflop XK6 Cray opportunity to catalyze a broad dis- supercomputer, is the first GPU/CPU hybrid supercomputing system. cussion of how extreme-scale hybrid Image credit: Michele De Lorenzi/CSCS architectures are already accelerating progress in computationally focused system from Cray with Interlagos computational giant called Titan. So science research.” At its conclusion, CPUs and NVIDIA Fermi GPUs, has far the upgrade has converted the attendees had a better idea of where been in operation at CSCS since fall Cray XT5 system, an all-CPU model the community stood in exploiting 2011. Third, NCSA’s Blue Waters is a with an older, less-capable version of hybrid architectures. hybrid Cray system under construction the AMD processor, into a Cray XK6 at the University of Illinois at Urbana- system with Interlagos CPUs, twice The ACSS 2012 researchers Champaign that gives more weight to the memory of its predecessor, and presented results obtained with four CPUs than GPUs. It will have only the a more capable network. After the high-performance systems employ- Interlagos CPUs in 235 XE6 cabinets cabinets have been equipped with ing different computing strategies. and both CPUs and the NVIDIA Fermi NVIDIA’s newest Kepler GPUs by the First, CSCS’s 402-teraflop XE6 Cray GPUs in 30 XK6 cabinets. A testbed of end of 2012, the peak performance supercomputer, called Monte Rosa, at 48 XE6 cabinets has been operational will be at least 20 petaflops. In the the Swiss Federal Institute of Technol- since mid-March. Fourth, the OLCF’s meantime, ten of Jaguar’s cabinets ogy in Zurich (ETH Zurich) uses two 200-cabinet Jaguar supercomputer, have already gone hybrid to create a modern 16-core CPUs from AMD America’s fastest, is undergoing an testbed system called TitanDev. named Interlagos per compute node. upgrade at its Oak Ridge National Second, a two-cabinet XK6 super- Laboratory (ORNL) home that will computer called Tödi, the first hybrid transform it into a mostly hybrid

2 ALCC 2012 Workshop Report Compute Node Architectures of Select Cray Supercomputers

XK6 Next XT5 XE6 (w/o XK6 -Gen GPU) XK AMD Istanbul CPU 2 (6 cores) AMD Interlagos CPU (16 cores) 2 1 1 1

NVIDIA Fermi 1

NVIDIA Kepler 1

Added Schulthess: “Given the a big avalanche coming that’s really The National Center for Supercomputing Applications’ Blue Waters, under applications that we’ve seen and the going to revolutionize computational construction at the University of Illinois benchmarks that were representative science. I think for these reasons we at Urbana-Champaign, is a hybrid of real science runs, I haven’t seen the really want to invest in hybrid comput- Cray system that gives more weight to case where the GPU doesn’t provide a ing from an application developer CPUs than GPUs. Image courtesy of performance increase. Really, it makes point of view.” the National Center for Supercomputing no sense not to have the GPU.” Applications and the Board of Trustees Work remains to determine if the of the University of Illinois The scientific results presented at the payoff for other applications and symposium were way ahead of what algorithms is great enough to develop At the start of ACSS 2012, managers the high-performance computing com- and port codes to hybrid, multicore of extreme-scale supercomputing munity had expected, Schulthess said. systems. “What we’re watching is a centers and program managers from Scientists from diverse fields of study bunch of pioneers, people who have DOE’s Office of Science wanted to were able to demonstrate impressive taken the first steps in embracing know if researchers would rather use results using hybrid architectures. and adopting what amounts to an an all-CPU XE6 or a hybrid GPU/ architectural sea change,” Hack said. CPU XK6. The results presented at “Compared to the past when we made the meeting laid that question to rest. major architectural changes, this one Most scientific application codes is different because the vendors are showed a 1.5- to 3-fold speedup on part of the solution. The partnering the GPU-equipped XK6 compared to that’s going on in terms of allowing the CPU-only XE6, with evidence of early adaptation or adoption of this a 6-fold speedup for one code. The new architecture includes working with greater computing speed makes pos- the vendors, something I hope will sible simulations of greater complexity continue. The kind of work that’s going and realism and quicker solutions to on now is identifying where the holes computationally demanding problems. are and where the opportunities are for “The Fermi [GPU from NVIDIA] delivers making progress.” higher performance than Interlagos [CPU from AMD],” said meeting NCSA Deputy Director Rob Penning- co-chair Thomas Schulthess, who is ton, a meeting co-chair, added that head of CSCS and a professor at ETH Thomas Schulthess, ETH Zurich and the partnership between domain and Zurich. “And with Fermi we’re talking CSCS computational scientists is also crucial about a GPU that is 1 or 2 years old “The XK architecture is looking like to providing whole solutions. “You’re compared to Interlagos, which is a it will be a big success,” Schulthess solving the problems now that are very new processor that just came opined. “It’s going to be interesting going to save your successors a lot of out.” to see in the next few years if there’s work,” he said. going to be a small avalanche, or is

3 While some scientific application codes have been fully developed to run on GPUs, others need additional attention from computational scien- tists, computer scientists, software developers, and vendors combining their talents.

The success of the ACSS 2012 scientists helps decision makers for supercomputing centers weigh whether it makes more sense to fill empty slots with the fastest CPUs available or embrace a hybrid system employing both CPUs, which excel at serial processing, and GPUs, which excel at parallel processing.

Caffeinated codes, Researchers use the S3D combustion code to investigate factors underlying the clean transformational results burning of fuels in next-generation engines. Shown are a lifted ethylene-air jet flame computed from direct numerical simulation (DNS) and tracer particle trajectories. The symposium highlighted advances The simulation explored 22 species in a reduced mechanism, used 1.3 billion grid in combustion, fusion energy, seis- points, and had a jet Reynolds number of 10,000, which indicates the simulation mology, nuclear physics, biology, encompassed a great range of scales. Image credit: Sandia National Laboratories medicine, chemistry, and more made mechanical engineers Chun Sang Yoo and Jacqueline Chen performed the DNS. possible through GPU computing. Hongfeng Yu of Sandia; Ray Grout of the National Renewable Energy Laboratory performed volume rendering Combustion research. Combustion homogeneous charge compression Supercomputers calculate the ability scientist Jacqueline Chen of Sandia ignition (HCCI) as a technology of of turbulent mixing to optimize thermal National Laboratories, the first great promise, but its control remains and chemical stratification so that speaker, gave an overview of direct a challenge. “HCCI is really a chemical spontaneous combustion through numerical simulations of turbulence- engine,” Chen said. Whereas the initia- ignition reactions occurs in proper chemistry interactions. Combustion of tion of a spark or fuel-injection timing phase with the piston motion. Chen fossil fuels accounts for 83 percent of controls combustion in conventional is working with Ray Grout (National U.S. energy use, with transportation engines, HCCI combustion starts Renewable Energy Laboratory), John alone accounting for two-thirds of when a piston compresses a charge to Levesque (Cray), Ramanan Sankaran petroleum use and one-quarter of the point of ignition. Because combus- (ORNL), and Cliff Woolley (NVIDIA) carbon dioxide emissions. America tion occurs by volumetric autoignition, on revising the S3D scientific ap- aims to reduce petroleum usage 25 thermal or chemical-species stratifica- plication code to make use of Titan’s percent by 2020 and greenhouse tion is required to control how fast fuel multicore hybrid architecture to enable gas emissions 80 percent by 2050. burns. Too rapid a burning rate can simulations of greater chemical kinetic Meeting these goals will require a new lead to engine knock. complexity. With Jaguar scientists generation of combustion systems could simulate hydrogen and small that can achieve high efficiency and Hundreds of molecules with different hydrocarbon fuels. Using Titan they low emissions using diverse fuels— chemical and physical characteristics aspire to simulate the larger hydro- from conventional fuels to new fuels have been proposed as alternative carbons of practical transportation obtained from heavy hydrocarbons biofuels. How can engineers assess fuels such as heptane, octane, and (oil from oil sands and oil shale as which are worth pursuing to optimize biodiesel. well as liquid fuels derived from coal) engine designs like HCCI? Chen’s and renewable fuels (ethanol and detailed combustion simulations on Grout followed Chen with a technical biodiesel). petascale supercomputers provide talk on efforts to “hybridize” the S3D fundamental physical insights into the legacy code. The message-passing The combustion-engine community nuances of microscale turbulent mix- interface (MPI), open message passing has identified combustion of dilute ing and chemical reactions in engine- (OpenMP), and open accelerator fuels at lower temperatures and relevant conditions. This knowledge is (OpenACC) are community-developed, higher pressures than achieved in used to develop and assess predictive standards-based programming today’s engines as a promising way to engineering models to facilitate an models for parallel computers that increase fuel efficiency and decrease optimal combined engine-fuel system. may be used separately or in concert pollutant emissions. Chen detailed

4 ALCC 2012 Workshop Report to enhance performance and portabil- ity between different and evolving computer architectures. “Movement of outer loops to the highest level in the code facilitates hybrid MPI- OpenMP usage that is necessary for performance on the machines coming on line now and allows an elegant path to GPU-enabled computing using OpenACC,” Grout said. “Setting up our algorithms this way, we get to keep a maintainable code yet still have acceptable performance across the range of machines we anticipate.”

The approach works: the hybridized version of the code performed twice as fast as the original code with each running on the now replaced Cray XT5 system. And running this new, im- The goal of ITER is to get as close as possible to achieving an ignited and burning, or proved code on today’s state-of-the- self-sustaining, plasma in a tokamak that produces 500 megawatts of power for 400 seconds, getting 10 times more energy out than will be put in to ignite the plasma. art technologies, the GPU-accelerated Image credit: Jamison Daniel, ORNL XK6 hardware outperforms the duel- CPU XE6 by a factor of 1.4. Grout said to simulate conditions in ITER, a and verified with mathematical tests such performance levels should bring multibillion-dollar experimental reactor and computational benchmarks, these a target HCCI problem incorporating in France with the collaboration of codes can provide reliable predictive a realistic heptane or octane fuel and seven governments representing capability for the behavior of fusion- nearly 2 billion grid points into the more than half the world’s population. energy-relevant high-temperature realm of feasibility under the Innova- Ultimately achieving commercial plasmas,” Tang said. tive and Novel Computational Impact fusion energy depends on success at on Theory and Experiment (INCITE) a dramatic next The fusion com- program, the major means by which step—burning munity has made the scientific community gains access plasma in ITER “If properly validated excellent progress to the petascale computing resources to generate 500 against experimental in developing of the OLCF. MW of power for measurements and verified advanced codes for which computer run 400 seconds, with mathematical tests and Nuclear fusion energy. Bill Tang and getting ten times time and problem Stephane Ethier of Princeton Plasma more energy computational benchmarks, size scale well Physics Laboratory next spoke of out than will be these codes can provide with the number extreme-scale computing as an put in. Current reliable predictive capability of processors on increasingly vital tool for accelerating devices generate for the behavior of fusion- massively parallel progress in developing nuclear fusion supercomputers. For 10 MW and put energy-relevant high- for clean, abundant energy. A fusion out about as example, scaling is reactor magnetically confines deute- much energy as temperature plasmas,” excellent for GTC, a rium and tritium fuel and releases en- they consume. Tang said. global particle-in-cell ergy when the two hydrogen isotopes kinetic code model- fuse to form helium and a fast neu- The rapid ing the motion of tron—with a total energy multiplication advances in charged particles and of 450 times per reaction. The process supercomputing power combined their interactions with the surround- depends on keeping the plasma hot, with the emergence of effective new ing electromagnetic field. Several but turbulence causes heat losses. It algorithms and computational meth- publications demonstrate that faster is the balance between such losses odologies helps enable corresponding machines enable new physics insights and the fusion reaction heating rates increases in the physics fidelity and with GTC, which was the first fusion that will ultimately determine the size the performance of the scientific code to deliver production simulations and cost of a reactor. Understanding codes used to model complex physi- at petaflops in 2009. Showing data the associated complex mechanisms cal systems. “If properly validated from China’s Tianhe-1A, currently the requires extreme-scale computing against experimental measurements world’s second-fastest super-

5 computer, and America’s TitanDev, Ethier said GTC on GPUs gives a 3-fold speedup over CPUs alone in some cases.

“We talked a lot about the GPU on the Cray, but don’t forget there’s also a brand new network and powerful CPUs in the XK6 platform,” Ethier said of TitanDev. “Things that we couldn’t do before, now we can do with the new network. So it’s a full machine that we can optimize.”

Seismology. Today petascale computing is used to create detailed “shake movies” such as M8, a 2010 simulation of a hypothetical magnitude-8 quake roiling Southern California that employed nearly all of Jaguar’s approximately 220,000 processing cores and sustained 300 teraflops to run the anelastic wave Princeton’s Jeroen Tromp used seismic inverse modeling and imaging to analyze propagation code AWP-ODC. With 190 European quakes to reveal seismic hotspots (red indicates fast seismic shear GPUs taking computing closer to exa- wave speeds whereas blue indicates slow wave speeds). Imaging of the upper flop performance levels—calculation mantle revealed volcanism in the Czech Republic, a “hole” under Bulgaria, and Italy’s speeds a thousand times faster than counterclockwise rotation over the past 6 million years. An allocation of 739 million core petaflops—Jeroen Tromp of Princeton hours on Titan would facilitate a global inversion using data from 5,000 earthquakes, allowing unprecedented understanding of forces that agitate Earth. Image courtesy of University and Olaf Schenk of the Jeroen Tromp University of Lugano said large-scale seismic inverse modeling and imaging “Our ultimate goal is to move toward difficult to solve than the PDE-forward are possible that will allow researchers adjoint tomography of the entire problems,” Schenk said. With the to work backward after a big shake to planet,” Tromp said. Getting started spectral element wave propagation determine in three dimensions where on this ambitious goal means analysis code SPECFEM, one global-scale it originated and what phenomena of a minimum of 250 earthquakes simulation of the forward problem shaped it. While earthquake prediction worldwide, requiring 14 million CPU might take 10,000 processor hours, is not possible, inverse modeling of core hours. But completing the task but the inverse problem would take virtual quakes may help engineers will require analysis of 5,000 earth- 300 million processor hours, he said. assess potential hazards and build quakes worldwide, using an estimated Optimization on tens of thousands structures to withstand them. 739 million core hours. This is a huge of processor cores offers challenges, amount of computing for today’s sys- but solving them is worth the effort Much as medical tomography uses tems—equivalent to running on Jaguar in terms of what the computational waves to image sections of the nonstop for 6 months. However, with power doing so would provide the human body, seismic imaging uses GPU performance acceleration and Earth science community. With GPUs waves generated by earthquakes to Titan on the way, Tromp is working to on Tödi and TitanDev, Schenk showed reveal the structure of the solid Earth. make such an ambitious campaign a much greater performance for the Numerically solving the equations that reality. spectral-element wave propagation govern motion of diverse geological solver SPECFEM over that of solids challenges even the most CPU-only machines—approximately advanced high-performance comput- “Our ultimate goal is to move four times faster than XK6 with no ing systems. Tromp used inversion toward adjoint tomography accelerator (which has one CPU to analyze 190 European quakes of the entire planet,” per node) and two times faster than and revealed seismic hotspots in the Tromp said. Monte Rosa XE6 (which has two upper mantle. The imaging detailed CPUs per node). TitanDev and Tödi the subduction of Africa, volcanism “These seismic inverse problems are have NVIDIA’s Fermi GPUs. But Titan in the Czech Republic, a “hole” under known as PDE [partial differential will have NIVIDA’s even faster next- Bulgaria, and Italy’s counterclockwise equation]-constrained optimization generation Kepler GPUs. rotation over the past 6 million years. problems and are significantly more

6 ALCC 2012 Workshop Report allow fully consistent, 3D neutronics simulations that could be used to address CASL challenge problems,” Evans said.

Fundamental physics. What is the nature of the nuclear force that binds protons and neutrons into stable nuclei and rare isotopes? What is the origin of simple patterns in complex nuclei? What is the origin of elements in the cosmos? These are a few of the unanswered physics questions under investigation by ORNL’s David Dean, who reported on the present status and future prospects for computing nuclei, which comprise 99.9 percent of baryonic matter in the universe and are the fuel that burns in stars. A comprehensive description of nuclei and their reactions requires extensive computational capabilities. Dean described applications of nuclear coupled-cluster techniques for descriptions of isotopic chains (recent work focused on oxygen and calcium). Today scientists use 150,000 cores to Researchers use the Denovo application code to model and simulate radiation compute one nucleus. Recent work transport in nuclear reactors for nuclear forensics, radiation detection, and analysis by Dean collaborators Hai Ah Nam of fuel cores and radiation shielding. Without GPUs, using the 3.3-petaflop Jaguar and Gaute Hagen indicated that GPUs system, Denovo can simulate a fuel rod through one cycle of use in a reactor core may triple the speed of the calcula- employing 60 hours of wall clock time. With GPU acceleration in the 20-petaflop Titan tion. Dean also detailed applications system, this simulation would take only 13 hours. Image credit: ORNL of nuclear density functional theory, Nuclear fission energy. Speaking burnup, and enhancing nuclear safety which today use all of Jaguar to next were ORNL’s Doug Kothe and by predicting component performance simulate 20,000 nuclei in 12 million Tom Evans, who use modeling and through the onset of failure. configurations in an effort to determine simulation to improve the safety the limits of existence of nuclei. “In and performance of nuclear power Evans leads development of the both cases we are benefiting from the plants, which supply 20 percent of Denovo code, which charts radiation application of petascale computing America’s electricity today and are transport in a reactor. “To match and will benefit from hybrid architec- projected to supply 25 percent by the fidelity of existing 2D industry tures in the future,” Dean said. 2030. Kothe and Evans are partners calculations with consistent 3D Scientists theorize that the four in the multi-institutional Consortium codes, we need to solve a minimum fundamental forces of nature—electro- for Advanced Simulation of Light of 10,000 times more unknowns than magnetism, gravity, the strong force Water Reactors (CASL), DOE’s first we have achieved to this point,” he (which holds an atom’s nucleus Energy Innovation Hub, established said. Denovo’s most challenging together), and the weak force (which in 2010 and headquartered at ORNL. calculation, an algorithm that sweeps is responsible for radioactive decay)— CASL researchers model and simulate through the virtual reactor to track the are related. Balint Joo of Jefferson reactors to address three critical location of particles, was consuming Lab described investigations in the performance areas for nuclear power 80 to 95 percent of the code’s run field of quantum chromodynamics, plants: reducing capital and operating time on ORNL’s 3.3-petaflop Jaguar, which aims to understand the strong costs per unit of energy by enabling which uses only CPUs. Using hybrid interaction in terms of quarks and power uprates and lifetime extension TitanDev to pass this calculation off gluons. He used the TitanDev testbed for existing plants and by increasing to GPUs, researchers saw Denovo’s to show strong scaling to 768 GPUs the rated powers and lifetimes of speed increase 3.5 fold. “Getting a (more than three-quarters of TitanDev). next-generation plants, reducing factor of 3 to 4 from GPUs means that He and NVIDIA’s Mike Clark compared nuclear waste by enabling higher fuel a 30- to 40 petaflop machine could

7 the performance of GPU-based global ocean models to high-resolu- XE6 nodes for the Community Atmo- solvers from the QUDA library with tion atmosphere models (25-kilometer- sphere Model–Spectral Element code their CPU-based counterparts and wide grid cells for atmosphere and using 900 nodes on each machine,” found a 3- to 10-fold speed boost, 10-kilometer resolution for ocean) to he said. “This currently ported portion depending on problem size. The more accurately simulate the climate of the code, or subroutine, accounts calculation had to solve 679 million system and its extremes by including for 80 percent of the run time for our unknowns and did so at a speed of ocean eddies, tropical storms, and target science problems, and most of 100 teraflops. With Titan, work to other fine-scale processes. “The the remaining run time will be ported better understand the strong force current state is 2 to 3 simulated years to GPUs soon.” Similarly, Tom Hender- may enable scientists to compute per CPU day,” he said. “A speedup son from NOAA shared his group’s the properties of never-before-seen of 3 to 10 times permits real science achievement of a 5-fold speedup of exotic mesons prior to their planned using limited ensembles and longer an atmospheric dynamical core used production in GlueX, a flagship time integrations.” for numerical weather prediction nuclear physics experiment that will (NWP). “Since atmospheric cores used run around 2015 in Jefferson Lab’s for NWP and climate prediction can accelerator after an upgrade to double be very similar, this result is strong the energy of its electron beam from 6 “Since atmospheric cores evidence that the desired 3- to 10-fold billion to 12 billion electron volts. used for numerical weather speedup can be achieved for climate prediction and climate codes,” Henderson said. Climate science. Hack, who heads prediction can be very the Oak Ridge Climate Change Math libraries. To address the Science Institute, introduced a similar, this result is strong complex challenges of hybrid environ- discussion of the challenges in evidence that the desired 3- ments, optimized software solutions accelerating coupled models such as to 10-fold speedup can be will have to fully exploit both CPUs the state-of-the-art Community Earth achieved,” Henderson said. and GPUs. The OLCF’s Jack Wells System Model used by more than introduced a panel discussion on 300 climate scientists. This mega- math libraries aimed at accomplish- model couples component models of Matt Norman from ORNL spoke of ing this feat. Jack Dongarra of the atmosphere, land, ocean, and ice to porting an important atmospheric University of Tennessee–Knoxville reflect their complex interactions. For code to hybrid platforms. “By porting talked about the Matrix Algebra on example, Phil Jones of Los Alamos to the Cray XK6 [1 Fermi GPU and GPU and Multicore Architectures, National Laboratory and a large team 1 Interlagos CPU per node], we or MAGMA, project to develop of collaborators couple high-resolution achieved a 1.5-fold speedup over Cray a dense linear algebra library for hybrid architectures. ORNL’s Chris Baker next discussed Trilinos efforts to develop parallel solver algorithms and libraries within an object-oriented software framework for large-scale multiphysics applications. Finally, Ujval Kapasi of NVIDIA gave an overview of CUDA, NVIDIA’s parallel computing platform and programming model that can send C, C++, and Fortran code straight to GPUs with no assembly language required. The sequential part of a workload still runs on the CPU, but parallel processes are sent to the GPU to accelerate the computation. CUDA has found use in applications from simulating air traffic to visualizing molecular motions. The panel made apparent the vital role of early partner- Accelerated computing may benefit simulations of ocean eddies, tropical storms, ships between technology developers and other fine-scale processes. Ocean simulations that include ecosystems are and scientific-application users in important for determining how much carbon oceans can sequester in current and optimizing libraries that will eventually future climates. Shown in green is chlorophyll, a measure of biological activity that can be compared with observational data obtained with satellites. Image credit: Matthew empower the entire computational Maltrud, Los Alamos National Laboratory science community.

8 ALCC 2012 Workshop Report Medicine and biology. The second into alcohol. Petascale simulations with the lowest energy. The result is day’s lectures kicked off with Jeremy revealed new knowledge about hydro- larger-scale simulations than would be Smith of the University of Tennes- lysis, an expensive step that must be possible with distributed computing see–Knoxville and ORNL speaking optimized to make cellulosic ethanol projects, such as Folding@Home, and on supercomputing in biology and commercially viable. Smith shared reduced time to solution. In 46 hours medicine. He described large-scale a capability-class, 24-million-atom a supercomputer can determine the docking calculations used in drug simulation that detailed enzymatic 3D structure of a small protein from discovery, a process in which thou- binding to lignocellulose that had been scratch, given its 2D amino-acid sands of compounds are screened pretreated with heat and dilute acid. sequence; GPU-accelerated nodes to identify hundreds of candidates With a more powerful 10-petaflop will reduce this time by a factor of 3 to that are then winnowed to dozens supercomputer, he said, a 300-million- 4, Lindahl said. of effective agents. A few go on to atom simulation could shed light on preclinical development, the stage how specialized microbes break down Jim Phillips of NCSA at University of at which most drugs fail, and fewer biomass. An even more powerful Illinois at Urbana-Champaign shared still make it to become the subjects exaflop supercomputer could, in his work on scalable molecular of expensive and time-consuming principle, simulate the 10 billion atoms dynamics using the NAMD code, clinical trials. Supercomputers can in a living cell, although considerable which has 51,000 users worldwide speed analysis and lower costs. technical barriers to such an achieve- to attest to its wide, deep impact. High-throughput screening of drug ment still remain. Projects driving the need for scalable libraries can quickly identify promising molecular dynamics include investiga- candidates and generate copious “Over the last decade molecular tions of ribosome dysfunctions leading initial data about how drugs bind to dynamics simulation has evolved from to neurodegenerative disease and their protein targets, decreasing the a severely limited esoteric method infection processes of the viruses likelihood that the drugs will later into a cornerstone of many fields, in that lead to polio and AIDS. His talk prove toxic due to undesired binding, particular structural biology, where detailed how supercomputers probe or cross-reactivity. “If drug candidates it is now just as established as NMR the behavior of biosensors, whole are going to fail, you want them to fail [nuclear magnetic resonance] or X-ray cells, blood coagulants, integrin [a re- fast, fail cheap,” Smith said. In one crystallography,” said Erik Lindahl ceptor mediating attachment between day a petaflop computer can analyze of Stockholm University, who next a cell and surrounding tissues], and 10 million drugs binding to a target, discussed scaling the GROMACS membrane transporters and included Smith said, whereas a 10-petaflop code for execution on hybrid, ac- a status update on scaling and ac- machine can analyze a million drugs celerated architectures. “A central celeration of NAMD and Charm++, interacting with 100 targets. A future challenge for the entire community an adaptive, parallel run-time system. exaflop machine, in comparison, could is that we frequently achieve scaling Acceleration with GPUs on Japan’s in one day assess the interactions of by going to increasingly larger model Tsubame supercomputer allowed a 10 million drugs with 1,000 different systems, reaching millions or even large simulation of a 20-million-atom targets—one representative from every hundreds of millions of particles. While photosynthetic membrane patch. class of proteins in the human body. this has resulted in some impressive Cray’s Gemini interconnect doubled showcases, the problem for structural the usable nodes for strong scaling biology is that most molecules of (i.e., problems of a fixed size, in which “If drug candidates are going biological interest (for example, increasing the number of processors to fail, you want them to fail membrane proteins) only require shortens the time to solution) in a fast, fail cheap,” Smith said. maybe 250,000 atoms.” The biggest more moderately sized simulation challenge for the future is rather of 92,000-atom apolipoprotein A1 that scientists would like to reach on a Blue Waters prototype. On Smith also highlighted molecular timescales many orders of magnitude a 100-million-atom benchmark dynamics simulations of cellulosic larger than what is possible today. molecular dynamics calculation run ethanol systems under investigation He discussed using Copernicus, an on 768 nodes, GPU-accelerated XK6 at DOE’s Bioenergy Science Center. open framework that parallelizes hardware sped the calculation above The goal is to better understand two aspects of distributed computing, and beyond what was possible with steps necessary to generate “gras- to run molecular dynamics on high- CPUs—a 2.6-fold acceleration on soline” from cellulose, a complex performance computers with fast the XK6 without acceleration and a carbohydrate made of sugar units interconnects. Copernicus treats each 1.4-fold boost on the dual-CPU XE6. that forms cell walls in stalks, trunks, problem like an ensemble in which These results show it would be more stems, and leaves. Hydrolysis breaks free energy is calculated for thousands advantageous to fill a second slot cellulose into glucose and enzymes, of molecular shapes and the computa- of a compute node with a GPU than and fermentation turns the glucose tion converges on the conformation another CPU.

9 Chemical and materials How Effective Are GPUs on Scalable Applications? sciences. Christopher Mundy Code Performance Measurements on TitanDev of Pacific Northwest National Laboratory discussed chem- istry at interfaces, which has far-reaching implications for XK6 (with GPU) vs XE6 fuel cells, biological pumps, Application Performance Ratio chemical catalysts, and TitanDev : Monte Rosa chemical reactions in the atmosphere and on surfaces. S3D 1.4 He and his team performed Turbulent combustion research using first-principles interaction potentials in Denovo 3.3 conjunction with statistical 3D neutron transport for nuclear reactors mechanics methods on leadership-class computers LAMMPS through the INCITE program. 3.2 High-performance molecular dynamics He performed the investiga- tions under the auspices of WL-LSMS two DOE programs that aim 1.6 Statistical mechanics of magnetic materials to elucidate complex chemical phenomena in heterogeneous environments: the Basic CAM-SE 1.5 Energy Sciences’ Condensed Community atmosphere model Phase and Interfacial Science program, which improves the NAMD 1.4 molecular understanding of High-performance molecular dynamics processes at interfaces, and the Energy Frontiers Research Chroma 6.1 Center on Molecular Elec- High-energy nuclear physics trocatalysis, which supports studies of the conversion of QMCPACK electrical energy into chemi- 3.0 Electronic structure of materials cal bonds and vice versa. His team used the CP2K SPECFEM-3D computer code—nicknamed 2.5 Seismology “the Swiss army knife of molecular simulation” because GTC of its wide range of tools—to 1.6 Plasma physics for fusion energy explore the role of electrons in explaining the behavior of mol- CP2K ecules and materials. Parallel 1.5 Chemical physics computations of free-energy profiles of individual chemical Cray XK6: Fermi GPU plus Interlagos CPU | Cray XE6: Dual Interlagos and No GPU reactions used ensemble methods to statistically explore As water is necessary for life on Earth, chemical reactions, and nanoscale important configurations that give rise a deeper understanding of it could processes that are crucial to address- to novel phenomena. Mundy de- have a profound impact. ing challenges in energy, environment, scribed a recent controversy—whether and health. “The next-generation hard- the surface of liquid water is acidic Joost VandeVondele of ETH Zurich ware will accelerate this research,” he or basic. His work, using high-quality spoke of nanoscale simulations using said. “However, this will require that interaction potentials and based in large-scale and hybrid computing with domain scientists, in collaboration quantum mechanics, suggests that CP2K, which aims to describe at an with system experts, design novel and the hydroxide anion is only slightly atomic level the chemical and elec- powerful algorithms that are carefully stabilized at the interface and that tronic processes in complex systems. implemented to exploit the strengths no strong propensity exists for the The work may aid understanding of the new hardware.” He presented surface to be either acidic or basic. of novel materials, precisely tuned recent efforts by the CP2K team to

10 ALCC 2012 Workshop Report simulate systems of ten thousand Monte Carlo (QMC) that has enabled to millions of atoms and increase accurate, many-body predictions of the accuracy for condensed-phase electronic structures. “The resources system simulations. He discussed at DOE leadership computing facilities early results using accelerators in a have allowed QMC to seek funda- massively parallel context, showing mental understanding of materials GPU-accelerated XK6 performance properties, from atoms to liquids, more than doubled compared to CPUs at unprecedented accuracy and alone in the XE6. time-to-solution,” she said. “We have developed QMCPACK to enable QMC David Ceperley of the University of Hybrid architecture is the foundation of studies of the electronic structure of ORNL’s 200-cabinet Titan supercomput- Illinois at Urbana-Champaign, who realistic systems, while fully taking er, which will be a groundbreaking new highlighted simulations of condensed- advantage of such theoretical devel- tool for scientists to leverage the massive matter systems, began his talk with opment and ever-growing computing power of hybrid supercomputing for a Paul Dirac quote from 1929: “The capabilities and capacity.” She transformational research and discovery. Image credit: Andy Sproles, ORNL general theory of quantum mechanics reviewed progress in QMC algorithms is now almost complete. The underly- and their implementations in QMC- per second and a planned upgrade ing physical laws necessary for the PACK, including accelerations using of up to 1 terabyte per second. Bland mathematical theory of a large part of GPUs. In recent work she simulated announced that during the week of physics and the whole of chemistry 256 electrons in a 64-atom piece of March 26, the DOE Office of Science are thus completely known, and the graphite. GPUs on TitanDev sped the approved the build-out of Titan to at difficulty is only that the exact applica- calculation about 3.8-fold compared least 20 petaflops. tion of these laws leads to equations to the CPU-only portion of the Jaguar much too complicated to be soluble.” XK6 and about 3.0-fold over Monte Pennington followed with an update Today’s supercomputers make Rosa’s duel-CPU XE6 nodes. Kim on Blue Waters, which will have more these equations solvable, Ceperley said a hybrid supercomputer of 10 than 25,000 compute nodes. All XE6 said. “Recently, we have developed to 100 petaflops could support the nodes are expected to be in place methods to couple the simulation Materials Genome Initiative, which by mid-year, and all XK components of the quantum electron degrees of aims to accelerate understanding of should be in place by the end of the freedom of a microscopic system the fundamentals of materials science, year. Blue Waters will have more with the ionic degrees of freedom providing practical information that en- than 235 XE6 cabinets and more without evoking the approximations trepreneurs and innovators can use to than 30 XK6 cabinets equipped with involved in density functional theory develop new products and processes. NVIDIA’s Kepler GPUs. It will have or semi-empirical potentials,” he 1.5 petabytes of memory, and its disk said. “This allows an unprecedented To 20 petaflops and beyond storage will exceed 25 petabytes, with description of all condensed-matter near-line storage that can go to 500 systems.” He spoke of simulations Arthur Bland, director of Jaguar’s petabytes. providing new insights into dense upgrade and transformation project hydrogen, important for understanding to create Titan, gave the meeting The major means of accessing Titan the giant planets, and water, the key to attendees an overview of that project, will be through the INCITE program, many biophysical processes. He also which by the end of 2012 will deliver which awarded 1.7 billion core hours discussed supercomputers speeding an XK6 system with 600 terabytes of in 2012 and will allocate nearly 5 billion the design of materials with specific memory, Cray’s new high-performance core hours in 2013 on the 10-petaflop properties. Computation for each Gemini network, and 18,688 compute IBM BG/Q Mira supercomputer at compound was a doctoral thesis 15 nodes each containing one 16-core Argonne National Laboratory and Titan years ago, he said. Now, supercom- AMD Opteron CPU. While TitanDev at ORNL. The Blue Waters system is puters can search 10,000 compounds now has Fermi GPUs, in fall 2012 accessed through the NSF’s Petascale to find the optimal material. “There’s these will be removed to allow most Computing Resource Allocations plenty of work at the petascale and of the compute nodes in TitanDev process. —by Dawn Levy exascale,” Ceperley said. and Jaguar to be upgraded with NVIDIA’s Kepler GPUs. Notably, the Jeongnim Kim of ORNL is the project plan is for 4,096 XK6 compute principal author of QMCPACK, a code nodes to remain without accelerators. widely used by physical scientists for Titan users will have access to the large-scale simulations of molecules, Spider file system with more than 10 solids, and nanostructures. The code petabytes of storage capacity and an uses a technique called Quantum initial data bandwidth of 240 gigabytes

11 Contact information:

Jack Wells Oak Ridge Leadership Computing Facility Oak Ridge National Laboratory Phone: 865-241-2853 Email: [email protected] www.olcf.ornl.gov

Thomas Schulthess Swiss National Supercomputing Center (CSCS) Swiss Federal Institute of Technology in Zurich (ETH Zurich) Phone: +41 91 610 82 01 Email: [email protected] http://www.cscs.ch

Rob Pennington National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Phone: 217-244-1052 Email: [email protected] http://www.ncsa.illinois.edu/

12 ALCC 2012 Workshop Report

®