White Paper Government High Performance Computing

Intel® Technologies Advance Supercomputing at National Laboratories for U.S. Security

Penguin Clusters with Intel® Omni-Path Architecture and Intel® Xeon® Processors Deliver Enhanced Performance and Scalability for Department of Energy’s Workhorse , the Commodity Technology Systems 1 Executive Summary Table of Contents America’s nuclear stockpile of weapons is aging, yet they must be certified as Executive Summary ...... 1 effective without physical testing. That is the challenge of the Department of Energy’s National Nuclear Security Administration (NNSA). Without underground Protecting America—Safely. . . . . 2 testing, scientists must test, validate, and certify the nuclear stockpile using a Stockpile Stewardship Through combination of focused above-ground experiments, computational science and Computational Science...... 2 simulations that run on large computing clusters. The Commodity Technology The Commodity Technology Systems.2 System 1 (CTS-1) deployment has over 15 Penguin Computing, Inc. Scalable Computing at the Tri-labs. . 2 clusters consisting of 83 Tundra* Scalable Units (SU). These clusters incorporate Building Scalable Systems with Intel® Intel® Xeon® processors and many other Intel® technologies, and Asetek* DCLC Omni-Path Architecture and Intel® direct chip cooling technologies. All clusters began deployment at the NNSA’s Architecture...... 3 tri-labs in early 2016, providing the computational power for scientists to carry Driving Next-Generation Fabric out their missions. At the center of these powerful clusters is Intel® Omni-Path Performance and Reliability...... 3 Architecture (Intel® OPA) connecting the large numbers of nodes together. This 48-Port Switch Silicon Enables Simple, Large, and Fast Clusters...... 3 Performance and Reliability for HPC . 3 Intel OPA in the Tri-labs—and Beyond.3 Penguin Computing Designs for Density...... 4 Summary...... 4

The workloads on the CTS-1 systems include simulations related to the National Ignition Facility program. White Paper | Intel® Technologies Advance Supercomputing at National Laboratories for U.S. Security 2 paper looks at the need for these systems and why Intel OPA quantification requires iterating hundreds to thousands, was chosen as a foundational component. and even tens of thousands, of the same simulation with slightly different parameters, since many of the parameters Protecting America—Safely are not precisely known, to determine overall uncertainties. Hundreds of scientists run their parametric studies, requiring The U.S. Department of Energy (DOE) is responsible for massive amounts of computational core hours throughout maintaining, certifying, and developing America’s nuclear the year. Through these studies, scientists can better weapon stockpile without relying on live, underground understand characteristics and properties of the physics and nuclear testing, which the U.S. ended in 1992. Predicting focus efforts on the particular models that are identified as the behavior of a nuclear weapon detonation—both those the largest contributors to uncertainties in an effort to reduce very old and new—requires understanding the behavior those uncertainties. of a very complex system that possibly changes over years and decades of material storage. Instead of integrated The Commodity Technology Systems underground nuclear tests, DOE scientists must now rely on a combination of focused above ground tests and high The work done on the CTS-1 systems is a stepping stone performance computing to certify the stockpile and assure its to larger simulations. The CTS-1 parametric studies help continued reliability and safety. High Performance Computing scientists fine tune these larger simulations, which are run on (HPC) helps simulate effects on the ultimate performance massive supercomputers, such as Sierra, being sited at LLNL, of the weapons, which can no longer be tested, and it helps and Trinity at LANL. to interpret the results of focused experiments, which also But, the CTS-1 systems are very powerful HPC clusters, provide validation for the simulation tools. These same able to deliver as much as 3.2 petaFLOPS. Since the first simulation tools and HPC clusters are also being used in installation of CTS-1 in April of 2016, another 14 clusters of support of other Department of Defense missions, such as various sizes have been installed across all of the tri-labs. modeling of Navy railguns and blast effects on military assets They range in size from 32 nodes to a few thousand nodes. from conventional weapons. All the CTS-1 systems have been built by Penguin Computing Stockpile Stewardship Through Computational Science headquartered in Fremont, California U.S.A., using Intel Xeon Processors (codenamed Broadwell architecture) and Intel Under the DOE’s National Nuclear Security Administration OPA fabric for HPC. Many of the CTS-1 systems have taken (NNSA), and the NNSA’s Stockpile Stewardship Program top-ranked positions in the Top500.org’s list of the fastest (SSP), simulations are done using a range of supercomputing supercomputers in the world. As of June 2017, Livermore has resources that are funded by the NNSAs Advanced Simulation Jade #45 at 2.6 petaFLOPS, Quartz #46 at 2.6 petaFLOPS, and Computing (ASC) program and deployed at several Topaz #224 at 0.8 petaFLOPS; Los Alamos has Grizzly #77 at national laboratories. Three of these laboratories (the tri- 1.5 petaFLOPS, Fire #105 at 1.1 petaFLOPS, Ice #106 at 1.1 labs) are Lawrence Livermore National Laboratory (LLNL) in petaFLOPS, and at Sandia, Cayenne is #97 at 1.2 petaFLOPS, California, Los Alamos National Laboratory (LANL) in New Serrano #98 at 1.2 petaFLOPS, and Dark Ghost #223 at 0.8 Mexico, and Sandia National Laboratories (SNL), also in New petaFLOPS. Mexico and California. The Commodity Technology Systems (CTS-1), the most recent Scalable Computing at the Tri-labs of NNSA’s tri-lab commodity cluster procurements, are the In addition to the mission of the NNSA, the tri-labs have daily workhorses for the NNSA’s SSP. These clusters are built supported over the years other laboratory and collaborative by Penguin Computing on Intel® Architecture, including Intel® projects with industry and academia that require HPC. Projects Xeon® processors E5 v4 Product Family and Intel® Omni-Path vary in size, and thus, the tri-labs’ scientists require a range Architecture (Intel® OPA) fabric. (CTS is an evolution of earlier of computing capacity from project to project. Beginning with clusters called the Tri-Lab Capacity Computing (TLCC) the TLCC systems, the tri-labs defined HPC clusters to be clusters deployed in 2007 and 2011.) CTS systems provide sized in Scalable Units (SU), providing a prescribed computing tens to hundreds of millions of core-hours each year to tri- building block from which any size system could be configured labs’ scientists. and procured. Besides the SU, TLCC and CTS-1 systems share Central to the computational science carried out largely on a common theme: being built from commodity components the CTS-1 clusters are uncertainty quantification studies for (servers, storage, memory, and fabric) that keep costs down a wide range of parameters to assess the efficacy and safety while providing high levels of computing performance for of the stockpile in production and operation. These studies the daily workloads these systems need to run. Using the SU run several types of codes that have been developed at building block concept and commodity components, the tri- the national laboratories, from multi-physics and materials labs can define their scalable, high-performance computing science codes, to others that are focused on a particular area, needs for each project, and then quickly deploy clusters to such as hydrodynamics, radiation transport, and electronic meet their needs at reasonable cost. structure, among others. The process of uncertainty White Paper | Intel® Technologies Advance Supercomputing at National Laboratories for U.S. Security 3

Building Scalable Systems with Intel® Omni- • Much lower MPI latency2—Intel OPA latency is 11 percent Path Architecture and Intel® Architecture lower than InfiniBand EDR, including a switch hop for both Intel OPA and InfiniBand EDR. Scalable computing is a concept that Intel has used to help • Lower Natural Order Ring latency2—Intel OPA has lower design and develop HPC systems for decades. Today, Intel® latency than InfiniBand EDR with 16 fully subscribed nodes Scalable System Framework (Intel® SSF) integrates the using 32 MPI ranks per node. components of scalable computing—compute, memory/ storage, fabric, and software—with the Intel® Xeon® Scalable • Significantly lower Random Order Ring latency2—Intel OPA processors and multi-integrated core (MIC) Intel® Xeon shows significantly lower latency at 16 fully subscribed PhiTM Processor, high-density Intel® Optane Memory, Intel® nodes using 32 MPI ranks per node. HPC Orchestrator software for HPC clusters, and the next- To deliver a next-generation HPC fabric that is enabling generation interconnect, Intel Omni-Path Architecture HPC some of today’s and tomorrow’s fastest supercomputers, fabric (see the sidebar “Intel Approach to Scalable Computing Intel OPA’s designers focused on the nuances of intelligent for HPC”). A key part of balancing the performance of the data movement—reducing latency, while ensuring packet framework is the Intel OPA fabric. and fabric integrity and scalability at speed. A key innovative approach is in Intel OPA’s link layer, which is designed to Driving Next-Generation Fabric Performance and Reliability deliver low and deterministic latency, penalty-free packet HPC fabric technologies have been evolving for decades. integrity, and resilient operation in the event of a lane failure. InfiniBand* Architecture has dominated the HPC space for Additionally, Intel OPA architecture is built to intelligently almost 20 years. But, it was not originally designed for HPC, determine the best method of data transport available to it— which largely uses the Message Passing Interface (MPI) for setting up an RDMA channel using the host adapter or using high-performance parallel computing. Intel Omni-Path the fast and efficient resources of the Intel Xeon processor to Architecture is a next-generation fabric—complete with host transfer data. The objective is to feed the application without interface, switches, and cabling—that is specifically designed delaying the flow3. for HPC workloads. Intel OPA is designed to deliver high message injection rates and low latency, while providing full Intel OPA in the Tri-labs—and Beyond bisectional bandwidth across very large clusters. Compared Prior to Intel OPA, Intel offered the Intel® True Scale adapters to the latest evolution of 100 Gbps InfiniBand EDR, the 38 for InfiniBand, which included technologies that were later HPC systems on the June 2017 Top500 list with Intel OPA incorporated into the Intel OPA host fabric interface. The deliver 1.7 times the petaFLOPS1 of EDR―or a total of 67.1 TLCC clusters incorporated Intel True Scale adapter and petaFLOPS, which is ten percent of the total Top500 Rmax switch technologies. Thus, the tri-labs had early experience performance for June 2017. with the performance and phenomenal scalability of the technologies that were later incorporated into Intel 48-Port Switch Silicon Enables Simple, Large, and Fast OPA. According to Matt Leininger, Deputy for Advanced Clusters Technology Projects at LLNL, the tri-labs sought such Core to the fabric architecture is a 48-port radix switch silicon continuing scalability, and, along with benchmarking of Intel developed by Intel. This large port count component enables OPA and consideration of technical and scheduling risks, Intel HPC architects to design large clusters that are less complex, OPA was put at the forefront of choices for the fabric in the with fewer switches and still able to deliver full bisectional CTS-1 procurements. bandwidth across more nodes than current alternative tech- nologies allow. More ports and fewer switches allow designers With the new CTS clusters, Leininger expects to see a two- to to create networks with fewer devices to manage, which can four-fold performance improvement over the TLCC2 systems, reduce the cost of large clusters. Intel OPA has enabled many running multiple jobs faster than the previous commodity 4 large, unique HPC configurations to recently be built. clusters . The beginning of 15 CTS-1 cluster installations across the Performance and Reliability for HPC tri-labs was among the first Intel OPA installations in 2016. Intel OPA has shown to outpace InfiniBand EDR in Intel testing Others include the Bridges supercomputer at Pittsburgh with lower latencies and higher message rates. Summarized Supercomputing Center, a 1.3 petaFLOPS system; Stampede2 results2 show that: at the Texas Advanced Computing Center, a 1.5 petaFLOPS system, TX-Green at the Massachusetts Institute of • Equal bandwidth2—Both Intel OPA and InfiniBand EDR are Technology, a 1.7 petaFLOPS machine, and Oakforest-PACS at capable of delivering nearly full wire rate of 100 Gbps. the Joint Center for Advanced HPC in Japan, a 25 petaFLOPS 2 • Dramatically higher message rate —Intel OPA has 64 machine. Stampede2, TX-Green, and Oakforest-PACS are all percent higher message rate than InfiniBand EDR without based on Intel® Xeon Phi™ Processors. The massive Aurora message coalescing at the MPI level. This is a true hardware supercomputer project at Argonne National Laboratory (ANL) message rate test without message coalescing in software. will also run on Intel OPA. White Paper | Intel® Technologies Advance Supercomputing at National Laboratories for U.S. Security 4

Penguin Computing Designs for Density With the early TLCC systems, an SU’s 162 nodes were Intel Approach to configured in traditional racks. With CTS-1, the size of the Scalable Computing for HPC SU increased to 192 nodes while the density of the racks Intel’s architectural framework for scalable HPC changed considerably with Penguin Computing’s Tundra ES comprises innovative and new technologies across four system configuration. Tundra ES is built on the Open Compute pillars that build the foundation of Intel’s new approach. Project rack standard, which defines rack spaces that are a Intel Omni-Path Architecture fabric forms only one of little wider and a little shorter, allowing more hardware into the four pillars. Intel Xeon Processors and Intel Xeon a rack. Penguin estimates they can increase density for HPC Phi Processors (compute), Intel® Optane™ Technology systems by as much as 50 percent, according to Sid Mair, built on 3D XPoint™ (memory and storage), and Intel® Senior VP of their Federal Systems Division. And, with the HPC Orchestrator (software), comprise the other pillars. Intel OPA 48-port radix switch silicon, they can connect 1,000 Scalable Compute nodes on two layers of fabric instead of a more complex configuration with smaller switches for the same node density. Intel Xeon Processors continue to lead the pack in the That means Penguin can deliver CTS-1 systems that offer Top500.org’s chart of the 500 fastest supercomputers greater performance and consume less physical space. in the world, with 463 systems based on Intel technology listed in the June 2017 list. The number two OCP and Penguin’s Tundra ES also incorporate a 12-volt system uses Intel® Xeon® processors, as do 14 others power system that disaggregates the power supplies from of the top 20. But, more HPC solutions are being built the equipment in the rack. Power shelves convert building with the Intel® Xeon Phi™ Processor for highly parallel power to 12-volt rack power, which is then distributed to operations—with up to 3+ TFLOPS of double-precision all the shelves. A central power conversion and distribution and 6+ TFLOPS of single-precision peak theoretical architecture makes it much more efficient to manage and performance for a single socket. Thirteen of the top 500 control consumption, which affects the demand for cooling. are built on Intel Xeon Phi Processor 7200 Series. To accommodate Tundra’s 12-volt power rails, Intel designed a 12-volt version of Intel OPA Switches for the CTS-1 builds. Revolutionary Memory and Storage These same switches would later be incorporated into Tundra Intel Optane Technology is a disruptive technology ES clusters used in the Penguin Computing on Demand (POD) that will revolutionize memory and storage. Built on cloud HPC service. 3D XPoint and integrating Intel Memory and Storage Controllers, Intel Interconnect IP, and Intel® software, Summary Intel Optane Technology enables media that offers the The CTS-1 clusters at the tri-labs, built by Penguin Computing non-volatility of SSDs, performance that is between with Intel Omni-Path Architecture, are the daily workhorses that of DRAM and standard NAND flash SSDs, and of the NNSA’s scientists to help certify America’s nuclear enhanced density that today’s devices cannot obtain. stockpile. Intel OPA is helping to advance the performance The new technology will help create opportunities to and scalability of these HPC clusters. Intel OPA has been build faster, more scalable clusters. proven in the tri-labs and across other installations since it Validated Software Stack was first introduced in early 2016, taking more positions on the Top500 list and delivering greater petaFLOPS than the While Intel is creating breakthrough hardware for HPC, alternative 100 Gbps network technology. it is also helping enable the software community to take advantage of HPC more easily. Intel joined with Intel’s further developments around Intel OPA will enhance more than 30 other companies to found the OpenHPC the fabric and the processors used in HPC. For example, Collaborative Project, a community-led organization versions of Intel Xeon Processors and Intel Xeon Phi focused on developing a comprehensive and cohesive Processors will integrate the fabric interface into the chips, open source HPC software stack. With many users and eliminating the need for a PCIe card and eliminating the system makers creating and maintaining their own latency an extra hop induces. Intel’s commitment to HPC is stacks, there is duplicated work and inefficiency across manifested in its continuing developments to help move HPC the ecosystem. OpenHPC helps minimize these issues to the exascale computing era. and improve the efficiency across the ecosystem. The tri-labs scientists and the CTS-1 clusters are imperative Adding to the OpenHPC resources, Intel brings its own to certifying and protecting America’s nuclear stockpile, commercial tools, such as Intel® MPI Library, Intel® as well as other nuclear and non-nuclear national security Parallel Studio XE 2017 Cluster Edition, and several concerns. Using computational science on advanced HPC others, into the Intel® HPC Orchestrator, a suite of tools clusters, America can also help keep the world safe by and resources for building performance and efficient eliminating physical testing and experimentation. HPC HPC clusters. helps enable and fulfill the missions of the National Nuclear Security Administration’s Stockpile Stewardship Program. White Paper | Intel® Technologies Advance Supercomputing at National Laboratories for U.S. Security 5

Learn More You may find the following resources useful: • Partner Company: www.penguincomputing.com • Other Useful Resource: Intel Scalable System Framework

Solution Provided By:

Optional partner logo goes here

1. www..org/lists/2017/6 2. Full test results and configuration are shown at http://www.intel.com/content/www/us/en/high-performance-computing-fabrics/omni-path-architecture-performance-overview.html. 3. For full description of Intel® OPA, refer to this white paper: https://www.intel.com/content/www/us/en/high-performance-computing-fabrics/new-high-performance-fabric-hpc-paper.html. 4. https://computation.llnl.gov/computers/commodity-clusters Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to intel.com/performance All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps. Cost reduction scenarios described are intended as examples of how a given Intel- based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction. Penguin Computing, Tundra and Penguin on Demand are trademarks of Penguin Computing. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software, or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer, or learn more at www.intel.com/hpc.

Copyright © 2017 Intel Corporation. All rights reserved. Intel, Intel Xeon, Optane, Phi, and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others. 0917/RJMJ/RJ/PDF Please Recycle 336351-001US