S&TR March 2017

Application developers and their industry partners are working to achieve both performance and cross-platform portability as they ready science applications for the arrival of Livermore’s next flagship .

4 Lawrence Livermore National Laboratory S&TR March 2017 Supercomputing Advancements

A Center of Excellence Prepares for

N November 2014, then Secretary of I Energy Ernest Moniz announced a partnership involving IBM, NVIDIA, and Mellanox to design and deliver high-performance computing (HPC) systems for Lawrence Livermore and Oak Ridge national laboratories. (See S&TR, March 2015, pp. 11–15.) The Livermore system, Sierra, will be the latest in a series of leading-edge Advanced Simulation and Computing (ASC) Program , whose predecessors include and BlueGene/L, Purple, White, and Blue Pacific. As such, Sierra will be expected to help solve the most demanding computational challenges faced by the National Nuclear Security Administration’s (NNSA’s) ASC Program in furthering its mission.

Lawrence Livermore National Laboratory 5 Supercomputing Advancements S&TR March 2017

At peak speeds of up to 150 petaflops from one processor type to the other based depends on those applications running (a petaflop is 1015 floating-point on computational requirements. (CPUs smoothly. “The introduction of GPUs operations per second), Sierra is projected are the traditional number-crunchers of as accelerators into the production ASC to provide at least four to six times the computing. GPUs, originally developed environment at Livermore, starting with performance of Sequoia, Livermore’s for graphics-intensive applications the delivery of Sierra, will be disruptive current flagship supercomputer. such as computer games, are now being to our applications,” admits Rob Neely. Consuming “only” 11 megawatts, Sierra incorporated into supercomputers to “However, Livermore chose the GPU will also be about five times more power improve speed and reduce energy usage.) accelerator path only after concluding, efficient. The system will achieve this Powerful hybrid computing units known first, that performance-portable solutions combination of power and efficiency as nodes will each contain multiple CPUs would be available in that timeframe through a heterogeneous architecture and GPUs connected by an NVLink and, second, that the use of GPU that pairs two types of processors—IBM network that transfers data between accelerators would likely be prominent in Power9 central processing units (CPUs) components at high speeds. In addition future systems.” and NVIDIA Volta graphics processing to CPU and GPU memory, the complex To run efficiently on Sequoia, units (GPUs)—so that programs can shift node architecture incorporates a generous applications must be modified to achieve a amount of nonvolatile random-access level of task division and coordination well memory, providing memory capacity for beyond what previous systems demanded. The Department of Energy has contracted with many operations historically relegated to Building on the experience gained through an IBM-led partnership to develop and deliver far slower disk-based storage. Sequoia, and integrating Sierra’s ability advanced supercomputing systems to Lawrence These hardware features— to effectively use GPUs, Livermore Livermore and Oak Ridge national laboratories heterogeneous processing elements, fast HPC experts are hoping that application beginning this year. These powerful systems networking, and use of different memory developers—and their applications—will are designed to maximize speed and minimize types—anticipate trends in HPC that be better positioned to adapt to whatever energy consumption to provide cost-effective are expected to continue in subsequent hardware features future generations of modeling, simulation, and big data analytics. generations of systems. However, these supercomputers have to offer. The primary mission for the Livermore machine, innovations also represent a seismic Sierra, will be to run computationally demanding architectural shift that poses a significant A Platform for Engagement calculations to assess the state of the nation’s challenge for both scientific application IBM will begin delivery of Sierra in nuclear stockpile. developers and the researchers whose work late 2017, and the machine will assume its

Compute node Compute system Central processing unit (CPU) Memory: 2,1–2.7 petabytes (PB) Graphics processing unit (GPU) Speed: 120–150 petaflops Solid-state drive: 800 gigabytes (GB) Power: 11 megawatts High-bandwidth coherent shared memory: 512 GB

CPU GPU

File system CPU–GPU Usable storage: 120 PB interconnect Bandwidth: 1.0 terabytes/second

6 Lawrence Livermore National Laboratory S&TR March 2017 Supercomputing Advancements

Current homogeneous node Sierra’s heterogeneous node

Memory Memory

CPU CPU

GPU Core Core

Lower complexity Higher complexity

full ASC role by late 2018. Recognizing Compared to today’s relatively simpler nodes, cutting-edge Sierra will feature nodes combining several the complexity of the task before them, types of processing units, such as central processing units (CPUs) and graphics-processing units (GPUs). Livermore physicists, computer scientists, This advancement offers greater parallelism—completing tasks in parallel rather than serially—for and their industry partners began faster results and energy savings. Preparations are underway to enable Livermore’s highly sophisticated preparing applications for Sierra shortly computing codes to run efficiently on Sierra and take full advantage of its leaps in performance. after the contract was awarded to the IBM partnership. The vehicle for these preparations is a nonrecurring engineering that scientists can start running their helps make powerful HPC resources (NRE) contract, a companion to the applications as soon as possible on available to all areas of the Laboratory. master contract for building Sierra that is Sierra—so that the transition sparks One way the M&IC Program achieves structured to accelerate the new system’s discovery rather than being a hindrance. this goal is by investing in smaller-scale development and enhance its utility. In turn, the interactions give IBM and versions of new ASC supercomputers, “Livermore contracts for new machines NVIDIA deeper insight into how real- such as Vulcan, a quarter-sized version of always devote a separate sum to enhance world scientific applications run on their Sequoia. Sierra will also have a scaled- development and make the system even new hardware. Neely observes that the down counterpart slated for delivery in better,” explains Neely. “Because we buy new COE also dovetails nicely with 2018. To prepare codes to run on this new machines so far in advance, we can work Livermore’s philosophy of codesign, system in the same manner as the COE with the vendors to enhance some features that is, working closely with vendors is doing with ASC codes, institutional and capabilities.” For example, the Sierra to create first-of-their-kind computing funding was allocated in 2015 to establish NRE calls for investigations into GPU systems, a practice stretching back to the a new component of the COE, named the reliability and advanced networking Laboratory’s founding in 1952. Institutional COE, or iCOE. capabilities. The contract also calls for The COE features two major To qualify for COE or iCOE support, the creation of a Center of Excellence thrusts—one for ASC applications and applications were evaluated according (COE)—a first for a Livermore NRE— another for non-ASC applications. to mission needs, structural diversity, to foster more intensive and sustained The ASC component, which is funded and, for iCOE codes, topical breadth. engagement than usual among domain by the multilaboratory ASC Program Bert Still, iCOE’s lead, says, “We chose scientists, application developers, and and led by Neely, mirrors what other to fund preparation of a strategically vendor hardware and software experts. national laboratories awaiting delivery important subset of codes that touches The COE provides Livermore of new HPC architectures over the on every discipline at the Laboratory.” application teams with direct access to coming years are doing. The non-ASC Applications selected by iCOE include vendor expertise and troubleshooting component, however, was pioneered by those that help scientists understand as codes are optimized for the new Lawrence Livermore, specifically the earthquakes, refine laser experiments, architecture. (See the box on p. 9.) Multiprogrammatic and Institutional or test machine learning methods. Such engagement will help ensure Computing (M&IC) Program, which For instance, the machine learning

Lawrence Livermore National Laboratory 7 Supercomputing Advancements S&TR March 2017

application chosen (see S&TR, June Performance and Portability collaborate with Livermore application 2016, pp. 16–19) uses data analytics Nearly every major domain of physics teams through the COE on an ongoing techniques considered particularly in which the massive ASC codes are used basis, with several of them at the likely to benefit from some of Sierra’s has received some COE attention, and Livermore campus full or part time. Neely architectural features because of their nine of ten iCOE applications are already notes that the COE’s founders considered memory-intensive nature. Applications in various stages of preparation. A range it crucial for vendors to be equal under COE purview make up only a of onsite and offsite activities has been partners in projects, rather than simply small fraction of the several hundred organized to bolster these efforts, such as consultants, and that the vendors’ work Livermore-developed codes presently in training classes, talks, workshops focused align with and advance their own research use. The others will be readied for Sierra on hardware and application topics, and in addition to Livermore’s. For instance, later, in a phased fashion, so that lessons hackathons. At multiday hackathons Livermore and IBM computer scientists learned by the COE teams will likely held at the IBM T. J. Watson Research plan to jointly author journal articles and save significant time and effort. Still Center in New York, Livermore and IBM are currently organizing a special journal says, “At the end of the effort, we will personnel have focused on applications, issue on application preparedness to share have some codes ready to run, but just as leveraging experience gained with early what they have learned with the broader importantly, we will have captured and versions of Sierra compilers. (Compilers HPC and application development spread knowledge about how to prepare are specialized pieces of software that communities. A “Hit Squad” for Troubleshooting our codes.” Although some applications convert programmer-created source code Application teams are also sharing do not necessarily need to be run at the into the machine-friendly binary code ideas and experiences with their highest available computational level to that a computer understands.) With all counterparts at the other four major fulfill mission needs, their developers will the relevant experts gathered together in Department of Energy HPC facilities. still need to familiarize themselves with one place, hackathon groups were able to All are set to receive new leadership- GPU technology, because current trends quickly identify and resolve a number of class HPC systems in the next few years suggest this approach will soon trickle compatibility issues between compilers and are making preparations through down to the workhorse Linux cluster and applications. their own COEs. Last April, technical machines relied on by the majority of In addition to such special events, a experts from the facilities’ COEs and users at the Laboratory. group of 14 IBM and NVIDIA personnel their vendors attended a joint workshop on application preparation. Participants demonstrated a strong shared interest Livermore’s Center of Excellence (COE), jointly funded by the Laboratory and the multilaboratory in developing portable programming Advanced Simulation and Computing (ASC) Program, aims to accelerate application preparation strategies—methods that ensure their for Sierra through close collaboration among experts in hardware and software, as shown in this applications continue to run well not workflow illustration. just on multiple advanced architectures but also on the more standard cluster technologies. Neely, who chaired the Inputs COE Appication and e iraries andson coaorations Outcomes nsite vendor epertise ode deveopers odes tuned or Sierra patorm ustomied training

endor appication and ar deiver patorm access patorm epertise Rapid eedac to vendors on ear issues ortae and vendorspeciic soutions pertise in agorithms mproved sotare toos and environment compiers programming modes resiience toos and more Steering committee uications and ear science

8 Lawrence Livermore National Laboratory S&TR March 2017 Supercomputing Advancements

meeting, says, “Most of the teams in for exploring solutions to the daunting Making full use of Sierra’s features attendance need to run and support challenges they all face. Neely adds, “It requires developers to carefully manage users on multiple systems, across either may be too early to declare victory, but where data are stored across the memory NNSA laboratories or the Leadership the COEs are working well.” hierarchy and to develop algorithms with Computing Facilities of the Office a far greater degree of parallelism—the of Science. This joint workshop was A Smashing Development organization of tasks such that they can designed to examine the excellent work Preparing an application for a new be performed in parallel rather than being done in these COEs and bring the HPC system is an iterative process serially. Hitting performance targets on discussion up a level, to learn how we, as requiring a thorough understanding of Sierra calls for 10 to 100 times more a community, can achieve the best of both the application and the new architecture. parallelism in applications than is present worlds—performance and portability.” For instance, Sierra will feature more today. Developers must also identify the The consensus at the meeting was that memory and a more complex memory tasks best handled by the GPUs—which the COE is indeed the right vehicle hierarchy in the nodes than does Sequoia. are optimized to efficiently perform the

A “Hit Squad” for Troubleshooting

In preparing Livermore’s scientific applications to run on Sierra, application interface (CHAI), an effort led by developer Peter Robinson, everyone involved contributes something unique. Application developers with AAPS team support. Draeger explains, “CHAI is an abstraction that understand the needs of their users and the performance characteristics of allows programmers to easily write code that minimizes data movement their applications. Vendors have in-depth understanding of the complex and runs well on a variety of architectures. CHAI solves some of the key architecture of the coming machine, the systems software that will run on it, problems of writing code for both performance and portability.” and the tradeoffs made during design and development. In-house experts on The AAPS team is currently funded as part of Livermore’s COE for Livermore’s Advanced Architecture and Portability Specialists (AAPS) team Sierra, but Draeger hopes the team will endure well beyond the new provide yet another essential contribution—robust expertise in developing system’s arrival. He says, “High-performance computing is not likely to get and optimizing applications for next-generation architectures. any easier in the coming years, and we need to work together and share our “The AAPS team pulls together exceptional talent in several key areas,” experiences as much as possible if we’re going to continue to be leaders in says AAPS lead Erik Draeger. “We have computational scientists with years the field. AAPS is one important mechanism to achieve that goal.” of hands-on experience in achieving extreme scalability and performance in scientific applications. Together, we have multiple Gordon Bell prizes. Our computer scientists have deep technical knowledge in the latest programming languages and emerging programming models. This breadth gives the team a clear sense of what is possible in theory and where things can get tricky in practice.” AAPS team members act as a troubleshooting “hit squad,” according to Institutional Center of Excellence (iCOE) lead Bert Still. At any given time, team members are working with multiple application teams to provide hands-on support and advice on code preparation tailored to the unique characteristics of the code involved. Naturally, says Draeger, the biggest challenges arise with codes poorly suited for future architectures. Nevertheless, rewriting an application from scratch is rarely an option because of its size (sometimes millions of lines of code) and the effort already expended to create and enhance that code (sometimes years or even decades). In such instances, the AAPS team uses tools such as mini-apps (compact, self-contained proxies for real applications) to help determine precisely how much code needs to be rewritten and how to do so as efficiently as possible. (See S&TR, October 2013, pp. 12–13.) The team’s mandate also includes documenting and transferring knowledge about common challenges and successful solutions, particularly regarding strategies for portability—the ability of an application to run Ian Karlin (left), a member of Livermore’s Advanced Architecture and efficiently on different architectures with minimal modification. The biggest Portability Specialists team, works with Tony DeGroot, part of the ParaDyn portability success thus far with Sierra may be developing the copy-hiding application team. (Photograph by Randy Wong.)

Lawrence Livermore National Laboratory 9 Supercomputing Advancements S&TR March 2017

same operation across a large batch of data—and those to be delegated to the more general-purpose CPUs. At the same time, programmers will try to balance Sierra- specific optimizations with those designed to improve performance on a range of HPC architectures. The iCOE team preparing the application ParaDyn readily acknowledges its challenges. ParaDyn is a parallel version of the versatile DYNA3D finite-element analysis software developed at Livermore beginning in the 1970s to model the deformation and failure of solid structures under impact. (See S&TR, May 1998, Adam Bertsch (left) and Pythagoras Watson examine components of the Sierra early-access machine, pp. 12–19.) The application performs well which is being used to help prepare codes to run on the full version of Sierra as soon as it comes on today’s HPC architectures but is proving online. (Photograph by Randy Wong.) difficult to prepare for Sierra. “We were fortunate in the past because our coding style worked well on Cray and today’s HPC now we have to work harder to convert Hybrid CPU–GPU clusters at Livermore machines, including Sequoia,” says Tony than some other teams do.” are a good stand-in but require far less DeGroot. “Previously, we didn’t have to As with most teams, the ParaDyn crew parallelism to achieve good performance make as many changes as other teams did began by assessing how the data move in than does the more architecturally complex to convert from machine to machine, but the application and how work is divided up Sierra. Another crucial difference is that among its algorithms—the self-contained the hybrid clusters require the explicit units that perform specific calculations management of data movement between or data-processing tasks. Edward Zywicz CPUs and GPUs, which Sierra will not. explains, “Unlike most physics codes, DeGroot says, “These hybrid clusters help which have tens of millions of similar or us identify bottlenecks and learn about identical chunks of a problem that can be our code and its performance, but they processed in the same way, we have a large do not represent exactly how the code number of dissimilar groups. GPUs work will perform on Sierra.” To overcome best with large numbers of similar items. this shortcoming, Livermore in late 2016 So the nature of our problems dictates some acquired two small-scale “early-access” performance challenges.” The ParaDyn versions of Sierra—one for ASC computing team found the primary bottleneck to be and another for other work. (These are data movement between GPUs and their separate from the permanent counterpart to high-speed memory. Consequently, the Sierra being built by the M&IC Program.) team is reorganizing and consolidating These realistic stand-ins feature CPUs, data and will need the help of the ROSE GPUs, and networking components only preprocessor (see S&TR, October/ one generation behind those of Sierra. November 2009, pp. 12–13) to reduce the need for data movement. This strategy Heartening Results ParaDyn simulates the deformation and failure saves time and energy, the most precious Cardioid, a sophisticated application of solid structures under impact, such as the resources in computing. that simulates the electrophysiology of hypervelocity impact through two plates shown Pinpointing an application’s the human heart, is also being prepared here. The ParaDyn team is presently exploring performance-limiting features and testing for Sierra through iCOE. Developed to how to maintain the strong performance that the the solutions are far more difficult when the run on Sequoia by Laboratory scientists application achieves on CPUs while optimizing hardware and systems software involved and colleagues at the IBM T. J. Watson some aspects for Sierra’s GPUs. are still under development, as with Sierra. Research Center, the powerful application

10 Lawrence Livermore National Laboratory S&TR March 2017 Supercomputing Advancements

can, say, model the heart’s response to Code preparation is a drug or simulate a particular health already netting promising condition. (See S&TR, September 2012, results. When Cardioid pp. 22–25.) David Richards, one of team member Robert Cardioid’s developers, is leading the Blake optimized a readiness effort. He explains, “Cardioid representative chunk of was highly optimized for Sequoia. We the code for GPUs using took advantage of features in Sequoia’s OpenMP at a hackathon, CPU, memory system, and architecture he achieved 60 percent of to maximize performance. We are now peak speed for 70 percent trying to implement a more portable and of the code—a strong result. maintainable design.” What Cardioid The Cardioid team also may lose in performance will be more identified important issues than made up for in improved features for with OpenMP that they researchers. Richards says, “Sierra will be are working to resolve in six times more powerful in terms of raw collaboration with the model’s than Sequoia. Even if we trade a little standards committee. performance for flexibility, we will still be Developed by Lawrence Livermore and IBM to able to run more simulations, and run them Looking Ahead run on Sierra’s predecessor, Sequoia, the highly faster, on Sierra.” Potential collaborators Thanks to the close and coordinated scalable and complex Cardioid code replicates have already expressed great interest in efforts of COE teams, vendors, users, and the heart’s electrical system—the current that running immense, highly computationally developers, a mission-relevant selection causes the heart to beat and pump blood demanding ensembles of Cardioid of Livermore’s science applications will through the body. Sierra is expected to enable simulations, indicating the pent-up demand be ready when Sierra and its M&IC Cardioid users to run larger ensembles of high- for a machine such as Sierra that can run counterpart come online. All involved are resolution simulations than was ever possible more simulations simultaneously. fully confident that their preparatory efforts on Sequoia. Cardioid has two major components will enable scientists to make the most developed specifically for Sequoia—a of the machines right from the start. Still mechanical solver and a high-performance says, “To further scientific discovery, we Livermore’s twin COEs aim to extend the electrophysiology code. With help from want to maximize use of the machines and lifetime of many of Lawrence Livermore’s iCOE, the Cardioid team intends to minimize downtime for the five or so years immense, complex, and mission-relevant improve the portability and maintainability of their operational lifespan.” applications into the exascale era ahead. of the mechanics solver using components Given the rapid evolution and ―Rose Hansen created and maintained at the Laboratory, obsolescence of computing hardware, such as the open-source MFEM finite the Laboratory is already planning for element library. The team is also the generation of HPC systems to follow Key Words: Advanced Architecture and re-architecting the electrophysiology Sierra. Expected sometime around 2022, Portability Specialists (AAPS), Advanced code for greater ease of use. Currently, these systems will likely constitute another Simulation and Computing (ASC) Program, the computational biologists and other major leap in performance, up to exascale Cardioid, Center of Excellence (COE), researchers who use the code but lack levels (at least 1018 flops), or 10 times the central processing unit (CPU), copy-hiding application interface (CHAI), graphics advanced programming skills struggle to speed of Sierra. “We have a two-pronged processing unit (GPU), high-performance make complex modifications to the model. approach for exascale,” notes Neely, who computing (HPC), IBM, institutional However, the new system will enable users is also involved in exascale planning. “We Center of Excellence (iCOE), Linux cluster, to input their models and data in a familiar will attempt to carry forward as long as Multiprogrammatic and Institutional fashion. The system will then automatically possible with existing codes. At the same Computing (M&IC) Program, node, nonvolatile random-access memory, NVIDIA, translate that input into the format of time, we are exploring what we would OpenMP, ParaDyn, petaflop, ROSE, Sequoia, OpenMP—a popular programming model do if untethered from existing designs to Sierra, T. J. Watson Research Center. that enhances the portability of parallel take advantage of the new architecture.” applications—and then produce a version (See S&TR, September 2016, pp. 4–11.) For further information contact Rob Neely of the code optimized to run on Sierra. By increasing application portability, (925) 423-4243 ([email protected]).

Lawrence Livermore National Laboratory 11