Low-Power High Performance Computing

Low-Power High Performance Computing Panagiotis Kritikakos August 16, 2011 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2011 Abstract The emerging development of computer systems to be used for HPC require a change in the architecture for processors. New design approaches and technologies need to be embraced by the HPC community for making a case for new approaches in system design for making it possible to be used for Exascale Supercomputers within the next two decades, as well as to reduce the CO2 emissions of supercomputers and scientific clusters, leading to greener computing. Power is listed as one of the most important issues and constraint for future Exascale systems. In this project we build a hybrid cluster, investigating, measuring and evaluating the performance of low-power CPUs, such as Intel Atom and ARM (Marvell 88F6281) against commodity Intel Xeon CPU that can be found within standard HPC and data-center clusters. Three main factors are considered: computational performance and efficiency, power efficiency and porting effort. Contents 1 Introduction 1 1.1 Report organisation . 2 2 Background 3 2.1 RISC versus CISC . 3 2.2 HPC Architectures . 4 2.2.1 System architectures . 4 2.2.2 Memory architectures . 5 2.3 Power issues in modern HPC systems . 9 2.4 Energy and application efficiency . 10 3 Literature review 12 3.1 Green500 . 12 3.2 Supercomputing in Small Spaces (SSS) . 12 3.3 The AppleTV Cluster . 13 3.4 Sony Playstation 3 Cluster . 13 3.5 Microsoft XBox Cluster . 14 3.6 IBM BlueGene/Q . 14 3.7 Less Watts . 14 3.8 Energy-efficient cooling . 14 3.8.1 Green Revolution Cooling . 15 3.8.2 Google Data Centres . 15 3.8.3 Nordic Research . 16 3.9 Exascale . 17 4 Technology review 19 4.1 Low-power Architectures . 19 4.1.1 ARM . 19 4.1.2 Atom . 21 4.1.3 PowerPC and Power . 22 4.1.4 MIPS . 23 5 Benchmarking, power measurement and experimentation 25 5.1 Benchmark suites . 25 5.1.1 HPCC Benchmark Suite . 25 i 5.1.2 NPB Benchmark Suite . 25 5.1.3 SPEC Benchmarks . 26 5.1.4 EEMBC Benchmarks . 26 5.2 Benchmarks . 27 5.2.1 HPL . 27 5.2.2 STREAM . 27 5.2.3 CoreMark . 27 5.3 Power measurement . 28 5.3.1 Metrics . 29 5.3.2 Measuring unit power . 29 5.3.3 The measurement procedure . 29 5.4 Experiments design and execution . 30 5.5 Validation and reproducibility . 31 6 Cluster design and deployment 33 6.1 Architecture support . 33 6.1.1 Hardware considerations . 33 6.1.2 Software considerations . 34 6.1.3 Soft Float vs Hard Float . 34 6.2 Fortran . 35 6.3 C/C++ . 35 6.4 Java . 35 6.5 Hardware decisions . 36 6.6 Software decisions . 36 6.7 Networking . 39 6.8 Porting . 40 6.8.1 Fortran to C . 40 6.8.2 Binary incompatibility . 40 6.8.3 Scripts developed . 41 7 Results and analysis 42 7.1 Thermal Design Power . 42 7.2 Idle readings . 43 7.3 Benchmark results . 44 7.3.1 Serial performance: CoreMark . 44 7.3.2 Parallel performance: HPL . 50 7.3.3 Memory performance: STREAM . 58 7.3.4 HDD and SSD power consumption . 61 8 Future work 63 9 Conclusions 64 A CoreMark results 66 ii B HPL results 67 C STREAM results 69 D Shell Scripts 70 D.1 add_node.sh . 70 D.2 status.sh . 71 D.3 armrun.sh . 71 D.4 watt_log.sh . 71 D.5 fortran2c.sh . 72 E Benchmark outputs samples 73 E.1 CoreMark output sample . 73 E.2 HPL output sample . 73 E.3 STREAM output sample . 75 F Project evaluation 76 F.1 Goals . 76 F.2 Work paln . 76 F.3 Risks . 77 F.4 Changes . 77 G Final Project Proposal 78 G.1 Content . 78 G.2 The work to be undertaken . 78 G.2.1 Deliverables . 78 G.3 Tasks . 79 G.4 Additional information / Knowledge required . 79 iii List of Tables 6.1 Cluster nodes hardware specifications . 36 6.2 Cluster nodes software specifications . 37 6.3 Network configuration . 39 7.1 Maximum TDP per processor. 42 7.2 Average system power consumption on idle. 43 7.3 CoreMark results with 1 million iterations. 44 7.4 HPL problem sizes. 51 7.5 HPL problem sizes. 51 7.6 STREAM results for 500MB array size. 58 A.1 CoreMark results for various iterations. 66 B.1 HPL problem sizes. 67 B.2 HPL problem sizes. 68 B.3 HPL results for N=500. 68 C.1 STREAM results for size array of 500MB. 69 iv List of Figures 2.1 Single Instruction Single Data (Reproduced from Blaise Barney, LLNL). .................................. 5 2.2 Single Instruction Multiple Data (Reproduced from Blaise Barney, LLNL). .................................. 6 2.3 Multiple Instruction Single Data (Reproduced from Blaise Barney, LLNL). .................................. 6 2.4 Multiple Instruction Multiple Data (Reproduced from Blaise Bar- ney, LLNL). ............................... 6 2.5 Distributed memory architecture (Reproduced from Blaise Barney, LLNL). .................................. 7 2.6 Shared Memory UMA architecture (Reproduced from Blaise Bar- ney, LLNL). ............................... 8 2.7 Shared Memory NUMA architecture (Reproduced from Blaise Bar- ney, LLNL). ............................... 8 2.8 Hybrid Distributed-Shared Memory architecture (Reproduced from Blaise Barney, LLNL). .......................... 8 2.9 Moore’s law for power consumption (Reproduced from W-chun Feng, LANL). ............................... 10 3.1 GRCooling four-rack CarnotJetTM system at Midas Networks (source GRCooling) . ............................... 15 3.2 Google data-centre at Finland, next to the Finnish gulf (source Google). 16 3.3 NATO ammunition depot at Rennesøy, Norway (source Green Moun- tain Data Centre AS). .......................... 17 3.4 Projected power demand of a supercomputer (M. Kogge) . 18 4.1 OpenRD board SoC with ARM (Marvell 88F6281) (Cantanko). 20 4.2 Intel D525 Board with Intel Atom dual-core. 21 4.3 IBM’s BlueGene/Q 16-core compute node (Timothy Prickett Mor- gan, The Register). ............................ 22 4.4 Pipelined MIPS, showing the five stages (instruction fetch, instruction decode, execute, memory access and write back (Wikimedia Commons). ................................ 23 4.5 Motherboard with Loongson 2G processor (Wikimedia Commons). 24 v 5.1 Power measurement setup. ....................... 29 6.1 The seven-node cluster that was built as part of this project. 38 6.2 Cluster connectivity ........................... 39 7.1 Power readings over time. ........................ 43 7.2 CoreMark results for 1 million iterations. 45 7.3 CoreMark results for 1 thousand iterations. 46 7.4 CoreMark results for 2 million iterations. 46 7.5 CoreMark results for 1 million iterations utilising 1 thread per core. 47 7.6 CoreMark performance for 1, 2, 4, 6 and 8 cores per system. 48 7.7 CoreMark performance speedup per system. 49 7.8 CoreMark performance on Intel Xeon. 49 7.9 Power consumption over time while executing CoreMark. 50 7.10 HPL results for large problem size, calculated with ACT’s script. 52 7.11 HPL results for problem size 80% of the system memory. 52 7.12 HPL results for N=500. ......................... 53 7.13 HPL total power consumption for N equal to 80% of memory. 54 7.14 HPL total power consumption for N calculated with ACT’s script. 55 7.15 HPL total power consumption for N=7296. 56 7.16 HPL total power consumption for N=500. 56 7.17 Power consumption over time while executing HPL. ..

Low-Power High Performance Computing

Wind River Vxworks Platforms 3.8

Accelerating HPL Using the Intel Xeon Phi 7120P Coprocessors

Vxworks Architecture Supplement, 6.2

EEMBC and the Purposes of Embedded Processor Benchmarking Markus Levy, President

Intel Cirrascale and Petrobras Case Study

Cloud Workbench a Web-Based Framework for Benchmarking Cloud Services

MIPS IV Instruction Set

Automatic Benchmark Profiling Through Advanced Trace Analysis Alexis Martin, Vania Marangozova-Martin

Power Measurement Tutorial for the Green500 List

Towards Better Performance Per Watt in Virtual Environments on Asymmetric Single-ISA Multi-Core Systems

Genesi Pegasos II Debian Linux by Maurie Ommerman CPD Applications Freescale Semiconductor, Inc

I.MX 8Quadxplus Power and Performance