Scaling the Power Wall: a Path to Exascale

Scaling the Power Wall: A Path to Exascale Oreste Villa, Daniel R. Johnson, Mike O’Connor, Evgeny Bolotin, David Nellans, Justin Luitjens, Nikolai Sakharnykh, Peng Wang, Paulius Micikevicius, Anthony Scudiero, Stephen W. Keckler and William J. Dally Email: fovilla, djohnson, moconnor, ebolotin, dnellans, jluitjens, nsakharnykh, penwang, pauliusm, ascudiero, skeckler, [email protected] Abstract—Modern scientific discovery is driven by an in- energy efficiency must improve by a factor of over 20× to satiable demand for computing performance. The HPC com- reach the 50 GF/W target. To achieve such a large improvement munity is targeting development of supercomputers able to in energy efficiency will require many contributions beyond sustain 1 ExaFlops by the year 2020 and power consumption process technology scaling [8]. Process technology improve- is the primary obstacle to achieving this goal. A combination of ments alone will only provide approximately 4.3× of the architectural improvements, circuit design, and manufacturing required improvement in energy efficiency. An additional 1.9× technologies must provide over a 20× improvement in energy efficiency. In this paper, we present some of the progress NVIDIA can be extracted from circuit improvements, enabling operation Research is making toward the design of Exascale systems by at lower VDD. A further 2.5× contribution to energy efficiency tailoring features to address the scaling challenges of performance must come from additional architectural and circuit techniques, and energy efficiency. We evaluate several architectural concepts as well as a system level design and interconnect that allow for a set of HPC applications demonstrating expected energy efficient scale-up of node-level performance. efficiency improvements resulting from circuit and packaging innovations such as low-voltage SRAM, low-energy signaling, At NVIDIA Research, we are developing an integrated and on-package memory. Finally, we discuss the scaling of these set of technologies spanning architecture, circuits, and co- features with respect to future process technologies and provide designed HPC applications targeting these performance and power and performance projections for our Exascale research energy goals. While we recognize the resilience and program- architecture. ming systems are also important to exascale, this paper focuses solely on the power challenges. Because no single silver bullet I. INTRODUCTION will enable exascale level performance and energy efficiency, it is critical for designers to understand the opportunities for Modern scientific discoveries in areas such as genetics, improvement provided by each area alongside the predicted drug discovery, and particle physics rely heavily on large- process technology scaling. This paper aims to quantify the scale simulation and modeling. The demand for expanding contributions of different technologies on the road to exascale computational capability drives the creation of increasingly and project performance and energy consumption of a 2020 higher-performance supercomputers. A continuation of histor- exascale system on a suite of DoE HPC proxy applications [9], ical scaling would predict that exascale supercomputers (i.e. 18 [10]. We expect that exascale applications, like the proxy- able to sustain 1 ExaFlops = 10 Flops) will start to appear apps that represent them, will exhibit highly varied execution by the year 2020 [1]. One of the main challenges in achieving profiles when compared to Linpack. Consequently, exascale this goal is power consumption, which a range of HPC system systems must be able to efficiently handle these variations and operators have suggested be limited to 20 MW for a full not be over-optimized for dense matrix multiplication. This exascale-capable system to mitigate cost of ownership and new paper builds upon the earlier Echelon project by expanding power delivery infrastructure costs [2]. The current generation upon and extending these research concepts with quantitative of energy-efficient supercomputers rely heavily on general analysis of application-level power and performance through purpose Graphic Processing Units (GPUs) to scale up their a variety of future technology nodes [11]. This paper makes floating point throughput [3]. GPUs have a large energy effi- the following contributions: ciency advantage with respect to conventional CPUs because they are designed to exploit high levels of data and thread level • We summarize and explain the behavior of a suite parallelism for performance rather than extracting instruction- of DoE Proxy applications on a modern GPU-based level parallelism from a small number of threads [4]. For supercomputer. highly parallel workloads, GPUs can be up to 10× more energy • We propose an energy-efficient node and system ar- efficient than CPUs within the same area and power budget [5]. chitecture, combining new architectural and circuit innovations. Exascale designers are facing a “power wall” when trying • We compare the power and performance of our pro- to design future systems capable of sustaining 50 GFlops posed design to currently available GPUs in a 28nm per Watt (GF/W) on dense codes such as Linpack [6]. The process technology and quantify the contributions of current Titan supercomputer at Oak Ridge National Laboratory the architecture and circuit-level features as technol- consists of 18,688 nodes, each containing one NVIDIA Tesla ogy scales down to 7nm. K20X GPU. On the Linpack benchmark, this system achieves • We estimate application-level power/performance for 17.59 PetaFlops while requiring 8.2 MW of power – delivering exascale systems by combining node level simulation 2.14 GF/W [7]. To enable exascale systems in the future, with a scaling model derived from scalability mea- SC14, November 16-21, 2014, New Orleans surements taken on modern supercomputers. 978-1-4799-5500-8/14/$31.00 c 2014 IEEE To our knowledge, no prior work has delivered a single The parallelization scheme divides the 3D domain into comprehensive overview of the future applications, challenges, sub-boxes which are assigned to MPI ranks. After receiving and features to provide exascale-level performance and power the force and position information of the neighboring sub- estimates for future HPC systems. boxes from the previous time-step, each MPI rank can compute forces and positions at the current time-step for the atoms This paper is organized as follows. Section II describes the assigned. Because of the short-range potential calculation and application used in this study. Section III presents the overall the relatively short cutoff radius, larger sub-boxes result in system architecture. Section IV presents the methodology used more computation and less communication. The computation in our studies. Section V presents the power and performance of forces, which can take up of 95% of the total time, can results at node, comparison with a modern GPU, discusses be separated among internal-to-internal atoms and external- scaling at full system level. Section VI presents related work to-internal atoms interactions. This approach enables overlap focusing on architecture, circuit features, and technology scal- of intra-node communication with internal-to-internal force ing studies. Finally Section VII concludes the paper. calculation if the problem size is sufficiently large. II. SCIENTIFIC APPLICATIONS B. LULESH While ultra-dense workloads, such as Linpack, are often LULESH represents a typical hydrodynamics modeling the benchmark by which HPC systems are measured, real application which describes the motion of materials relative to supercomputing applications are rarely as regular. As a result, each other when subject to forces [13]. LULESH is a simplified machines that are over-optimized for these dense workloads application implemented to only solve a simple Sedov blast tend to have poor real-world utilization when other applications problem with analytic answers but represents the numerical are ported to them. To encourage the design and development algorithms, data motion, and programming style typical in of balanced high performance supercomputers, the DoE has production codes. LULESH approximates the hydrodynamics funded the development of proxy applications (proxy-apps) for equations discretely by partitioning the spatial problem domain workloads they view as important to scientific computing in into a collection of volumetric elements defined by a mesh. the next 10 years [9], [10]. These proxy-apps are designed to A node on the mesh is a point where mesh lines intersect. be simplified, but representative, versions of these workloads LULESH uses a regular mesh, but to retain unstructured data suitable for public sector research and development. To ensure proprieties, the data structures make use of indirection arrays maximum portability, the proxy-apps consist of Fortran/C/C++ to index into the mesh. code using MPI for intra-node communication. Parallelism within the node is typically exposed with threading libraries At scale, the parallelization consists of assigning a subset of such as POSIX threads or OpenMP pragma directives. For elements to each MPI rank. Because the amount of data com- this work, NVIDIA helped develop GPU implementations of municated among MPI ranks is proportional to the number of the proxy-apps using numerically equivalent CUDA kernels or surface elements while computation is proportional to the vol- OpenACC compiler directives injected into the original code. ume of assigned elements,

Scaling the Power Wall: a Path to Exascale

Microbenchmarks in Big Data

Overview of the SPEC Benchmarks

Hypervisors Vs. Lightweight Virtualization: a Performance Comparison

Power Measurement Tutorial for the Green500 List

The High Performance Linpack (HPL) Benchmark Evaluation on UTP High Performance Computing Cluster by Wong Chun Shiang 16138 Diss

A Study on the Evaluation of HPC Microservices in Containerized Environment

Fair Benchmarking for Cloud Computing Systems

On the Performance of MPI-Openmp on a 12 Nodes Multi-Core Cluster

Power Efficiency in High Performance Computing

CS 267 Dense Linear Algebra: History and Structure, Parallel Matrix

CPU and FFT Benchmarks of ARM Processors

Red Hat Enterprise Linux 6 Vs. Microsoft Windows Server 2012