Tibidabo$: Making the Case for an ARM-Based HPC System
Total Page:16
File Type:pdf, Size:1020Kb
TibidaboI: Making the Case for an ARM-Based HPC System Nikola Rajovica,b,∗, Alejandro Ricoa,b, Nikola Puzovica, Chris Adeniyi-Jonesc, Alex Ramireza,b aComputer Sciences Department, Barcelona Supercomputing Center, Barcelona, Spain bDepartment d'Arquitectura de Computadors, Universitat Polit`ecnica de Catalunya - BarcelonaTech, Barcelona, Spain cARM Ltd., Cambridge, United Kingdom Abstract It is widely accepted that future HPC systems will be limited by their power consumption. Current HPC systems are built from commodity server pro- cessors, designed over years to achieve maximum performance, with energy efficiency being an after-thought. In this paper we advocate a different ap- proach: building HPC systems from low-power embedded and mobile tech- nology parts, over time designed for maximum energy efficiency, which now show promise for competitive performance. We introduce the architecture of Tibidabo, the first large-scale HPC clus- ter built from ARM multicore chips, and a detailed performance and energy efficiency evaluation. We present the lessons learned for the design and im- provement in energy efficiency of future HPC systems based on such low- power cores. Based on our experience with the prototype, we perform simu- lations to show that a theoretical cluster of 16-core ARM Cortex-A15 chips would increase the energy efficiency of our cluster by 8.7x, reaching an energy efficiency of 1046 MFLOPS/W. Keywords: high-performance computing, embedded processors, mobile ITibidabo is a mountain overlooking Barcelona ∗Corresponding author Email addresses: [email protected] (Nikola Rajovic), [email protected] (Alejandro Rico), [email protected] (Nikola Puzovic), [email protected] (Chris Adeniyi-Jones), [email protected] (Alex Ramirez) Preprint of the article accepted for publication in Future Generation Computer Systems, Elsevier processors, low power, cortex-a9, cortex-a15, energy efficiency 1. Introduction In High Performance Computing (HPC), there is a continued need for higher computational performance. Scientific grand challenges e.g., engineer- ing, geophysics, bioinformatics, and other types of compute-intensive appli- cations require increasing amounts of compute power. On the other hand, energy is increasingly becoming one of the most expensive resources and it substantially contributes to the total cost of running a large supercomputing facility. In some cases, the total energy cost over a few years of operation can exceed the cost of the hardware infrastructure acquisition [1, 2, 3]. This trend is not only limited to HPC systems, it also holds true for data centres in general. Energy efficiency is already a primary concern for the design of any computer system and it is unanimously recognized that reach- ing the next milestone in supercomputers' performance, e.g. one EFLOPS (exaFLOPS - 1018 floating-point operations per second), will be strongly constrained by power. The energy efficiency of a system will define the max- imum achievable performance. In this paper, we take a first step towards HPC systems developed from low-power solutions used in embedded and mobile devices. However, using CPUs from this domain is a challenge: these devices are neither crafted to exploit high ILP nor for high memory bandwidth. Most embedded CPUs lack a vector floating-point unit and their software ecosystem is not tuned for HPC. What makes them particularly interesting is the size and power characteristics which allow for higher packaging density and lower cost. In the following three subsections we further motivate our proposal from several important aspects. 1.1. Road to Exascale To illustrate our point about the need for low-power processors, let us reverse engineer a theoretical Exaflop supercomputer that has a power bud- get of 20 MW [4]. We will build our system using cores of 16 GFLOPS (8 ops/cycle @ 2 GHz), assuming that single-thread performance will not improve much beyond the performance that we observe today. An Exaflop machine will require 62.5 million of such cores, independently of how they are packaged together (multicore density, sockets per node). We also assume that 2 only 30-40% of the total power will be actually spent on the cores, the rest going to power supply overhead, interconnect, storage, and memory. That leads to a power budget of 6 MW to 8 MW for 62.5 million cores, which is 0.10 W to 0.13 W per core. Current high performance processors integrating this type of cores require tens of watts at 2 GHz. However, ARM proces- sors, designed for the embedded mobile market, consume less than 0.9 W at that frequency [5], and thus are worth exploring|even though they do not yet provide a sufficient level of performance, they have a promising roadmap ahead. 1.2. ARM Processors There is already a significant trend towards using ARM processors in data servers and cloud computing environments [6, 7, 8, 9, 10]. Those workloads are constrained by I/O and memory subsystems, not by CPU performance. Recently, ARM processors are also taking significant steps towards increased double-precision floating-point performance, making them competitive with state-of-the-art server performance. Previous generations of ARM application processors did not feature a floating-point unit capable of supporting the throughputs and latencies re- quired for HPC1. The ARM Cortex-A9 has an optional VFPv3 floating-point unit [11] and/or a NEON single-instruction multiple-data (SIMD) floating- point unit [12]. The VFPv3 unit is pipelined and is capable of executing one double-precision ADD operation per cycle, or one MUL/FMA (Fused Multiply Accumulate) every two cycles. The NEON unit is a SIMD unit and supports only integers and single-precision floating-point operands thus making itself unattractive for HPC. Then, with one double-precision floating- point arithmetic instruction per cycle (VFPv3), a 1 GHz Cortex-A9 provides a peak of 1 GFLOPS. The more recent ARM Cortex-A15 [13] processor has a fully-pipelined double-precision floating-point unit, delivering 2 GFLOPS at 1 GHz (one FMA every cycle). The new ARMv8 instruction set, which is being implemented in next-generation ARM cores, namely the Cortex-A50 Series [14], features a 64-bit address space, and adds double-precision to the NEON SIMD ISA, allowing for 4 operations per cycle per unit leading to 4 GFLOPS at 1 GHz. 1Cortex-A8 is the processor generation prior to Cortex-A9, which has a non-pipelined floating-point unit. In the best case it can deliver one floating-point ADD every ∼10 cycles; MUL and MAC have smaller throughputs. 3 1.3. Bell's Law Our approach for an HPC system is novel because we argue for the use of mobile cores. We consider the improvements expected in mobile SoCs in the near future that would make them real candidates for HPC. As Bell's law states [15], a new computer class is usually based on lower cost components, which continue to evolve at a roughly constant price but with increasing per- formance from Moore's law. This trend holds today: the class of computing systems on the rise today in HPC is those systems with large numbers of closely-coupled small cores (BlueGene/Q and Xeon Phi systems). From the architectural point of view, our proposal fits into this computing class and it has the potential for performance growth given the size and evolution of the mobile market. 1.4. Contributions In this paper, we present Tibidabo, an experimental HPC cluster that we built using NVIDIA Tegra2 chips, each featuring a performance-optimized dual-core ARM Cortex-A9 processor. We use the PCIe support in Tegra2 to connect a 1 GbE NIC, and build a tree interconnect with 48-port 1 GbE switches. We do not intend our first prototype to achieve an energy efficiency com- petitive with today's leaders. The purpose of this prototype is to be a proof of concept to demonstrate that building such energy-efficient clusters with mobile processors is possible, and to learn from the experience. On the soft- ware side, the goal is to deploy an HPC-ready software stack for ARM-based systems, and to serve as an early application development and tuning vehicle. Detailed analysis of performance and power distribution points to a ma- jor problem when building HPC systems from low-power parts: the system integration glue takes more power than the microprocessor cores themselves. The main building block of our cluster, the Q7 board, is designed having embedded and mobile software development in mind, and is not particularly optimized for energy-efficient operation. Nevertheless, the energy efficiency of our cluster is 120 MFLOPS/W, still competitive with Intel Xeon X5660 and AMD Opteron 6128 based clusters,2 but much lower than what could be anticipated from the performance and power figures of the Cortex-A9 processor. 2In the November 2012 edition of Green500 list these systems are ranked as 395th and 396th respectively. 4 We use our performance analysis to model and simulate a potential HPC cluster built from ARM Cortex-A9 and Cortex-A15 chips with higher multi- core density (number of cores per chip) and higher bandwidth interconnects, and conclude that such a system would deliver competitive energy efficiency. The work presented here, and the lessons that we learned are a first step towards such a system that will be built with the next generation of ARM cores implementing the ARMv8 architecture. The contributions of this paper are: • The design of the first HPC ARM-based cluster architecture, with a complete performance evaluation, energy efficiency evaluation, and comparison with state-of-the-art high-performance architectures. • A power distribution estimation of our ARM cluster. • Model-based performance and energy-efficiency projections of a theo- retical HPC cluster with a higher multicore density and higher-performance ARM cores.