LLNL Computation Directorate Annual Report (2014)

PRODUCTION TEAM LLNL Associate Director for Computation Dona L. Crawford Deputy Associate Directors James Brase, Trish Damkroger, John Grosh, and Michel McCoy Scientific Editors John Westlund and Ming Jiang Art Director Amy Henke Production Editor Deanna Willis Writers Andrea Baron, Rose Hansen, Caryn Meissner, Linda Null, Michelle Rubin, and Deanna Willis Proofreader Rose Hansen Photographer Lee Baker LLNL-TR-668095 3D Designer Prepared by LLNL under Contract DE-AC52-07NA27344. Ryan Chen This document was prepared as an account of work sponsored by an agency of the United States government. Neither the United States government nor Lawrence Livermore National Security, LLC, nor any of their employees makes any warranty, expressed or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, Print Production manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States government or Lawrence Livermore National Security, LLC. The views and opinions of authors expressed herein do not necessarily state or reflect those of the Charlie Arteago, Jr., and Monarch Print Copy and Design Solutions United States government or Lawrence Livermore National Security, LLC, and shall not be used for advertising or product endorsement purposes. CONTENTS Message from the Associate Director . 2 An Award-Winning Organization . 4 CORAL Contract Awarded and Nonrecurring Engineering Begins . 6 Preparing Codes for a Technology Transition . 8 Flux: A Framework for Resource Management . 10 Improved Performance Data Visualization for Extreme-Scale Systems . 12 Machine Learning Strengthens Performance Predictions . 14 Planning HPC Resources for the Institution . 16 Enhancing Data-Intensive Computing at Livermore . 18 Interweaving Timelines to Save Time . 20 Remapping Algorithm Boosts BLAST Simulations . 22 Catching Bugs with the Automated Testing System . 24 NIF Deploys New Advanced Radiographic Capability . 26 Leveraging Data-Intensive Computing for Sleuthing Seismic Signals . 28 Managing Application Portability for Next-Generation Platforms . 30 Training Tomorrow’s Cybersecurity Specialists . 32 New Capabilities for Information Technology Service Management . 34 Appendices . 36 Publications . 50 Industrial Collaborators . 55 National Laboratory Collaborators . 62 MESSAGE FROM THE ASSOCIATE DIRECTOR or more than six decades, Lawrence This requires that Computation advance the state from proposal to reality. LLNL and Oak Ridge interface—with the 20-petaflop Sequoia Blue 2 FLivermore National Laboratory (LLNL) of the art in both high performance computing National Laboratory have partnered with the Gene/Q machine providing a virtual laboratory has pioneered the computational capabilities (HPC) and the supporting computer science. hardware vendors IBM, NVIDIA, and Mellanox for these explorations. required to advance science and address the to form several working groups responsible nation’s toughest and most pressing national Larger and more complex simulations require for co-designing the new architecture. Co- As these codes evolve, our teams also must security challenges. This effort, dating to the world-class platforms. To sustain the nation’s design is an essential part of developing next- ensure that they remain stable, fast, and Laboratory’s earliest years, has been punctuated nuclear weapons deterrent without the need generation architectures well suited to key DOE accurate. For more than a decade, Livermore by the acquisition of leading computer systems for full-scale underground nuclear tests, we are applications. By incorporating the expertise of has been developing the Automated Testing and their application to an ever-broadening preparing for another major advance in HPC that the vendors’ hardware architects, the system System, which runs approximately 4,000 tests spectrum of scientific and technological will make our physics and engineering simulation software developers, and the DOE laboratory nightly across Livermore Computing’s (LC’s) problems. In 2014, this tradition continued models more predictive. The Collaboration of Oak experts—including domain scientists, computer Linux and IBM Blue Gene systems. These tests as the Computation Directorate provided the Ridge, Argonne, and Livermore (CORAL) is making scientists, and applied mathematicians—Sierra generate diagnostic information used to ensure computing architectures, system software, this advance possible. As a result of CORAL, will be well equipped to handle the most that Livermore’s applications are ready for future productivity tools, algorithmic innovations, and three DOE systems—one for each laboratory— demanding computational problems. systems and challenges. application codes necessary to fulfill mission- are being acquired. LLNL’s system acquisition critical national security objectives for the contract for Sierra was signed in November In addition to the design and development of As large-scale systems continue to explode in U.S. Departments of Energy (DOE), Homeland 2014 with an expected delivery date late in CY17. the platform, other preparations are underway parallelism, simulation codes must find additional Security, and Defense, as well as other federal Sierra is a next-generation supercomputer and to ensure that Sierra can hit the computer ways to exploit the increasingly complex design. and state agencies. Computation’s next capability platform for room floor running. An important aspect of a This can be especially difficult for simulations that the National Nuclear Security Administration successful launch is having applications that are must replicate phenomena that evolve, change, As Livermore scientists continue to push the and the Advanced Simulation and Computing capable of utilizing such an immense system. We and propagate over time. Time is inherently boundaries of what is scientifically possible, they Program. At the moment, Sierra only exists as a are adapting our existing large applications using sequential, but thanks to novel work by require computational simulations that are more proposal within contract language. A significant incremental improvements such as fine-grained researchers at Livermore, Memorial University, precise, that cover longer periods of time, and that amount of research, development, design, and threading, use of accelerators, and scaling to and Belgium’s Katholieke Universiteit Leuven, visualize higher fidelity, more complex systems. testing must still take place to move Sierra millions of nodes using message processing existing large-scale codes are becoming capable COMPUTATION 2014 ANNUAL REPORT LLNL of parallelizing in time. By using a new multilevel some of our most powerful supercomputers that will push the performance of Hadoop even students, 49 undergraduates, 1 high school algorithm called XBraid, some applications can is an important investment the Laboratory further. For example, 800 gigabytes of flash student, and 5 faculty members—from now solve problems up to 10 times faster. As makes in our science and technology. One of memory has been installed on each of the Catalyst 89 universities and 8 countries. Specific specialties its name suggests, XBraid allows simulations to the ways Computation supports LLNL projects system’s 304 compute nodes. Memory and within the scholar program have been emerging “braid” together multiple timelines, eliminating and programs is through the annual Computing storage in the system allow Catalyst, a first-of-a- over the past several years—the Cyber Defenders the need to be solved sequentially. These braided Grand Challenge program. Now in its ninth year, kind supercomputer, to appear as a 300-terabyte and the Co-Design Summer School. Now in its solutions are solved much more coarsely, and the Grand Challenge program awarded more memory system. In this configuration, our fifth year, the Cyber Defenders matched each the results are fed back into the algorithm until than 15 million central processing unit-hours scientists reduced the TeraSort runtime, a of the 21 computer science and engineering they converge on a solution that matches the per week to projects that address compelling, big data benchmark, to a little more than 230 students with an LLNL mentor and assigned them expected results from traditional sequential large-scale problems, push the envelope of seconds—much shorter than traditional Hadoop. a real-world technical project wherein they apply algorithms, within defined tolerances. capability computing, and advance science. Also this year, Catalyst was made available to technologies, develop solutions to computer- This year marked the first time that institutional industry collaborators through Livermore’s High security-related problems of national interest, and Taking advantage of LC’s valuable resources in demand for cycles on the 5-petaflop Vulcan Performance Computing Innovation Center to explore new technologies that can be applied to an orderly manner is no small task. With a sizable system exceeded the available allocation, an further test big data technologies, architectures, computer security. The 2014 inaugural Co-Design number of complex applications, Livermore encouraging indication of the growing interest and applications.

LLNL Computation Directorate Annual Report (2014)

The Reliability Wall for Exascale Supercomputing Xuejun Yang, Member, IEEE, Zhiyuan Wang, Jingling Xue, Senior Member, IEEE, and Yun Zhou

Biomolecular Simulation Data Management In

Measuring Power Consumption on IBM Blue Gene/P

Blue Gene®/Q Overview and Update

TMA4280—Introduction to Supercomputing

Replication, History, and Grafting in the Ori File System

Introduction to the Practical Aspects of Programming for IBM Blue Gene/P

Sistemas De Archivos Distribuido Para Clúster HPC Utilizando Ceph

The Blue Gene/Q Compute Chip

R00456--FM Getting up to Speed

IBM Research Report Designing a Highly-Scalable Operating System

The Evolution of File Systems