Energy Consumption and Performance Analysis Between ARM and Intel *

Energy Consumption and Performance analysis between ARM and Intel * Ricardo Klein Lorenzoni1, Edson L. Padoin1,2, Manuel Binelo1, Philippe O. A. Navaux2 1Department of Exact Sciences and Engineering Regional University of Northwest of Rio Grande do Sul (UNIJUI) – Ijui, RS – Brazil 2Institute of Informatics Federal University of Rio Grande do Sul (UFRGS) – Porto Alegre, RS – Brazil ricardo.lorenzoni, padoin, binelo @unijui.edu.br, [email protected] { } Abstract One approach to overcome power constraints is using MP- SoCs with ARM processors. The Mont-Blanc Project is Power demand and energy consumption is a primary one of the first to introduce this idea [7]. constraint for the HPC community. In response, researchers Today, the main objective of the HPC community is aim to build alternatives to reduce power demand with- the reduction of power consumption while maintaining or out reducing computing power. One approach to overcome even improving the performance of supercomputers. One of power constraints is using low power processors. However, the community approaches to increase performance with- these ARM processors have low power demand and lower out incurring the increase in energy consumption is the use performance. In this paper we have compared the energy of low-power processors, such as the ARM architecture. efficiency between a MPSoC with ARM processor and a Newer processors this architecture have floating-point pro- conventional Intel XEON processor. We have used different cessing units, and possess a frequency of the reasonable benchmarks of NAS Parallel Benchmark (NPB) to analyze clock, making it even more plausible the use of this HPC. execution time and total energy consumption. The prelimi- In this context, this paper question the feasibility to use nary results show the ARM processors are up to 20.56 times low power processors in HPC. To address this question we lower performance. Moreover, their power demand are up compare the performance and energy consumption of a MP- to 27.65 times lower for all benchmark tested. In this way, SoCs with ARM processors with a Intel XEON using NAS using ARM processor, we can reduce the total energy con- benchmark. The rest of the paper is structured as follows. In sumption in up to 1.34 times. Section 2, we present related works on energy consumption and performance on ARM platforms. In Section 3, we dis- cuss the methodology and the testbed evaluation. Section 4 describes the tests results obtained. Section 5 outlines our 1. Introduction conclusions, contributions and future work perspectives. Power demand and energy consumption is a primary constraint for the HPC (High Performance Comput- 2. Related Work ing)community. According to DARPA report, the power To build the next generation of supercomputers the HPC 20 limit for future exascale systems is MW [5]. However, community have the power demand as a central question. the current topfive system, the Sunway TaihuLight, de- Many papers have evaluated the runtime and power de- 15.4 93 mand MW to compute PFlops. In response to high mand tradeof with different benchmark and real applica- power demand, researchers aim to build alternatives to re- tions. Among them, different approaches have been used in duce power demand without reducing computing power. the care of power consumption on the HPC. In this context, power hungry CPUs are leaving space * This research has received funding from the EU H2020 Programme to new technologies in supercomputers design. Embedded and from MCTI/RNP-Brazil under the HPC4E Project, grant agree- ment number 689772. Work developed on the context of the associated processors, such as ARM are a possible alternative over international laboratory between UFRGS and Universite´ de Grenoble CPUs that are the norm to supercomputers. Once, recent - LICIA. ARMv7 ISA have native support to single-precision FP and WSPPD 2016, 14th Workshop on Parallel and Distributed Processing [13] double-precision. Also, they have support for SIMD in- Cubietruck Server structions with the NEON unit [1]. Emily Blem, Jaikrish- Processor A7 Xeon nan Menon, and Karthikeyan Sankaralingam [3] reviewed Manufacturer AllWinnerTech Intel the RISC vs. CISC debate considering the contemporary Processor Model A20 E5-2640v2 ARM and x86 processors running modern workloads to un- Processor Technology (nm) 40 22 derstand the role of ISA on performance, power and energy. Clock Frequency (GHz) 1 2 Jack Dongarra and Piotr Luszczek [4] present a land- Cores/Processor (#) 2 8 scape to analyze ARM. The authors make a comparison Memory (GB) 2 32 of energy efficiency with ARM and several Intel CPUs. Cache L1/Core (KB) 64 512 Their results point that ARM has better efficiency (about Cache L2 (KB) 256 2048 4 GFlops/W). Valero et al. [9] present similar results. The Cache L3 Shared (MB) 0 20 Cortex A9 have an efficiency of 4 GFlops/W and the future Cortex A15 have 8 GFlops/W. These values depict bet- Table 1. Detailed configuration. ter efficiency than INTEL and IBM processors. Rafael Aroca and Luiz Gonçalves [2] evaluated the the Ubuntu version 14.04 with kernel version 3.16.0-70 and power efficiency of several ARM and x86 devices while gcc version 4.8.4 running typical server and number crunching tasks. They conclude that ARM processors have good perfor- 3.2. Benchmark and Workload mance to build servers and clusters, specially when considering the performance per watt and that processing To measure the runtime and energy consumption clusters based on ARM processors are feasible to de- we used NAS Parallel Benchmark (NPB). We run crease power usage of several server applications. BT, FT, LU, MG and IS benchmarks sequentially and par- Nikola Rajovic et al. [8] deploy and evaluate the first allelly aim to analyze different workloads. In parallel ver- cluster for HPC with ARM mobile processors. The results sion OMP was used with 2 threads in the Cubietruck and show that the ARM Cortex-A9 is up to ten times slower than with 2, 4 and 8 threads in the Server. an intel i7 processor, but achieves a competitive energy effi- We are measuring the power demand only of the proces- ciency compared to multicore systems in the Green500 list. sor, once this component determines the greater constraint Dominik GoDdeke¨ et al. [6] employed a prototype ARM- of electrical power on the supplier. based cluster to evaluate the performance trade-off between time and energy solving real-world problems. Different to selected related works, our work analize the 4. Results feasibility to use low power processors in HPC. In this we In this section the results will be presented. First we an- have used NPB benchmarks to compare the runtime and en- alyze the scalability of the devices with the parallel bench- ergy consumption of a MPSoC with ARM processor A20 marks and then the power consumption thereof. model with a Intel processor XEON model. 4.1. Scalability Analysis 3. Experimental Method Scalability is one of the big challenges in HPC systems This section describes the methodology used in this and is related to problems of all levels in the system. In the study. We present the execution environment, and then dis- context of this work, the goal is to analyze scalability of an cuss the Benchmark and Workload methodology. ARM processor and an Intel processor, making a runtime comparison between both. 3.1. Execution Environment Figure 1 presents the speedup achieved in the equip- ments with all benchmarks used. In the Server running the The execution environment is composed of two plat- sequential version and paralleled with 2, 4 and 8 threads, forms. The first is a MPSoC Cubietruck equipped with and in the Cubietruck running in sequential version and with ARM processor and the second is a server with Intel proces- 2 threads, according the total amount of cores. sor. Table 1 shows the main characteristics of each equip- On the Server with 2 threads was obtained the small ment. speedup, an average of 1.18. For 4 threads the average was The operating system on the MPSoC was provided by 2.26. When using 8 threads, the speedup obtained were 3.50 manufacturer, a modified version of GNU/Linux distribu- and 4.87 for MG and IS respectively, as can be seen in tion with kernel version 3.4.106 compiles the application the Figure 1(a). On the Cubietruck was obtained an aver- using gcc version 4.9.2. On Server, the operating system is age scalability of 1.52. The smallest speedup was 1.06 for WSPPD 2016, 14th Workshop on Parallel and Distributed Processing [14] 5 5 4.5 4.5 4 4 3.5 3.5 3 3 2.5 2.5 2 2 1.5 1.5 1 1 0.5 0.5 0 0 BT FT LU IS MG BT FT LU IS MG Sequential 4 Threads 2 Threads 8 Threads Sequential 2 Threads SpeedUp SpeedUp (a) Server (b) Cubietruck Figure 1. Speedup of each Equipment FT and the highest was 1.79 for LU, as shown in the Fig- 4.2. Total Energy Consumption Analysis ure 1(b). It is interesting to note that, on Intel processor used in Taking into account that the energy consumption is one the Server (Figure 1(a)) has a very small speedup when 2 of the current limitations of the HPC systems, in this sec- threads were used if compared with Cubietruck. Featuring tion we analyze the total energy spent by devices to com- a gain of speedup considerably with 4 and 8 threads. For ex- plete the execution of each benchmarks. ample, for MG benchmark with 2 threads the speedup was The Figure 3(a) shows the total energy consumption for of only 1.0039.

Energy Consumption and Performance Analysis Between ARM and Intel *

Accelerating HPL Using the Intel Xeon Phi 7120P Coprocessors

Intel Cirrascale and Petrobras Case Study

Power Measurement Tutorial for the Green500 List

Towards Better Performance Per Watt in Virtual Environments on Asymmetric Single-ISA Multi-Core Systems

Comparing the Power and Performance of Intel's SCC to State

NVIDIA Ampere GA102 GPU Architecture Whitepaper

Using Intel Processors for DSP Applications

Quantitative Analysis Modern Processor Design: Fundamentals of Superscalar Processors

High Performance Embedded Computing in Space

Low-Power High Performance Computing

PAKCK: Performance and Power Analysis of Key Computational Kernels on Cpus and Gpus

EPYC: Designed for Effective Performance