Low-Power High Performance Computing
Total Page:16
File Type:pdf, Size:1020Kb
Low-Power High Performance Computing Michael Holliday August 22, 2012 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2012 Abstract There are immense challenges in building an exascale machine with the biggest issue that of power. The designs of new HPC systems are likely to be radically different from those in use today. Making use of new architectures aimed at reducing power consumption while still delivering high performance up to and beyond a speed of one exaflop might bring about greener computing. This project will make use of systems already using low power processors including the Intel Atom and ARM A9 and compare them against the Intel Westmere Xeon Processor when scaled up to higher numbers of cores. Contents 1 Introduction1 1.1 Report Organisation............................2 2 Background3 2.1 Why Power is an Issue in HPC......................3 2.2 The Exascale Problem..........................4 2.3 Average use................................4 2.4 Defence Advanced Research Projects Agency Report..........5 2.5 ARM...................................6 2.6 Measures of Energy Efficiency......................6 3 Literature Review8 3.1 Top 500, Green 500 & Graph 500....................8 3.2 Low-Power High Performance Computing................9 3.2.1 The Cluster............................ 10 3.2.2 Results and Conclusions..................... 10 3.3 SuperMUC................................ 11 3.4 ARM Servers............................... 12 3.4.1 Calexeda EnergyCoreTM & EnergyCardTM ........... 12 3.4.2 The Boston Viridis Project.................... 12 3.4.3 HP Project Moonshot....................... 13 3.5 The European Exascale Projects..................... 13 3.5.1 Mont Blanc - Barcelona Computing Centre........... 13 3.5.2 CRESTA - EPCC......................... 14 4 Technology Review 15 4.1 Intel Xeon................................. 15 4.2 GP-GPUs & Accelerators......................... 15 4.3 Intel Discovery & Intel Xeon Phi (MIC)................. 16 4.4 IBM Blue Gene & PowerPC....................... 17 4.5 Intel Atom................................. 18 4.6 ARM Cortex A9 & A15......................... 18 4.6.1 Nvidia Tegra 3.......................... 19 4.7 Networks................................. 19 i 5 Software Review 21 5.1 Operating Systems............................ 21 5.2 Libraries.................................. 21 5.2.1 BLAS & CBLAS......................... 22 5.2.2 Lapack.............................. 22 5.3 Batch System and Scheduler....................... 22 5.3.1 Sun (Oracle) Grid Engine.................... 22 5.3.2 Torque & Maui.......................... 23 5.4 Compilers................................. 23 5.5 Message Passing Interface........................ 24 5.5.1 MPICH2............................. 25 5.5.2 OpenMPI............................. 25 6 Benchmarks 26 6.1 HPCC................................... 26 6.1.1 High Performance Linpack.................... 26 6.1.2 Bandwidth and Latency..................... 26 6.1.3 Other Benchmarks in the HPCC Benchmark Suite....... 27 6.2 LMBench................................. 27 6.3 Coremark................................. 28 6.4 Validation of Results........................... 28 7 Project Preparation: Hardware Decisions and Cluster Building Week 29 7.1 Cluster building Week & Lessons Learned................ 29 7.2 ARM: Board Comparison......................... 30 7.2.1 Cstick Cotton Candy....................... 31 7.3 Changes to the ARM Hardware Available................ 32 7.3.1 Raspberry Pi........................... 32 7.3.2 Pandaboard ES.......................... 33 7.3.3 Seco Qseven - Quadmo 747-X/T30............... 34 7.4 Xeon: Edinburgh Compute and Data Facility.............. 34 7.5 Atom: Edinburgh Data Intensive Machine 1............... 35 7.6 Power Measurement........................... 35 7.6.1 ARM............................... 35 7.6.2 ECDF............................... 36 7.6.3 Edim1............................... 37 8 Hardware Setup, Benchmark Compilation and Cluster Implementation 40 8.1 Edinburgh Compute and Data Facility.................. 40 8.1.1 Problems Encountered with ECDF................ 41 8.2 Edinburgh Data Intensive Machine 1................... 41 8.2.1 Problems Encountered with EDIM1............... 42 8.3 Raspberry Pi................................ 44 8.4 Pandaboard ES.............................. 44 8.5 Qseven / Pandaboard / Atom Cluster................... 45 ii 8.5.1 Compute Nodes.......................... 46 9 Results 48 9.1 Idle Power................................. 48 9.1.1 How Idol Power changes with errors............... 50 9.2 CoreMark................................. 51 9.3 HPL.................................... 54 9.3.1 Intel Xeon vs Intel Atom..................... 54 9.3.2 GCC vs Intel........................... 55 9.3.3 Gigabit Ethernet vs Infiniband.................. 56 9.3.4 Scaling.............................. 64 9.4 Stream................................... 68 9.5 LMBench................................. 69 10 Future Work 73 11 Conclusions 74 12 Project Evaluation 76 12.1 Goals................................... 76 12.2 Work Plan................................. 77 12.3 Risks................................... 77 12.4 Changes.................................. 77 A HPL Results 78 B Coremark Results 81 C Final Project Proposal 83 C.1 Content.................................. 83 C.2 The work to be undertaken........................ 83 C.2.1 Deliverables........................... 83 C.2.2 Tasks............................... 83 C.2.3 Additional Information/Knowledge Required.......... 84 D Section of Makefile for Eddie from HPCC 85 E Benchmark Sample Outputs 86 E.1 Sample Output from Stream....................... 86 E.2 Sample Output from Coremark...................... 87 E.3 Sample Output from HPL......................... 88 E.4 Sample Output From LMBench..................... 90 F Submission Script from Eddie 92 F.1 run.sge................................... 92 F.2 nodes.c.................................. 93 iii List of Tables 3.1 Section from top500.org, November 2011 [1]..............9 3.2 Section from green500.org, November 2011 [2].............9 6.1 Other HPCC Benchmarks......................... 27 7.1 Comparison of ARM Boards....................... 31 9.1 Idle Power Comparison.......................... 49 9.2 Stream Results.............................. 68 9.3 LMBench: Basic system parameters................... 69 9.4 LMBench: Processor, Processes..................... 69 9.5 LMBench: Basic integer operations................... 69 9.6 LMBench: Basic uint64 operations.................... 70 9.7 LMBench: Basic float operations..................... 70 9.8 LMBench: Basic double operations................... 70 9.9 LMBench: Context switching....................... 71 9.10 LMBench: Local Communication latencies............... 71 9.11 LMBench: Remote Communication latencies.............. 71 9.12 LMBench: File & VM system latencies................. 72 9.13 LMBench: Local Communication bandwidths.............. 72 9.14 LMBench: Memory latencies....................... 72 A.1 Results from HPL Benchmark using GCC and Infiniband........ 78 A.2 Results from HPL Benchmark using GCC and Ethernet......... 79 A.3 Results from HPL Benchmark using Intel and Infiniband........ 80 A.4 Results from HPL Benchmark using Intel and Ethernet......... 80 B.1 Coremark Results for 100000 Iterations................. 81 B.2 Coremark Results for 1000000 Iterations................. 81 B.3 Coremark Results for 1000000 Iterations................. 81 B.4 Coremark Results for 30000000 Iterations................ 82 B.5 Coremark Results for 15000000 Iterations................ 82 iv List of Figures 2.1 Average Daily Power Usage on Gigabit Ethernet Nodes........5 3.1 The SuperMUC system [3]........................ 11 3.2 Viridis................................... 13 4.1 Blue Gene/Q at Edinburgh University (www.ph.ed.ac.uk)........ 18 4.2 ARM A9 MP Cores (www.arm.com)................... 19 5.1 MPI Family Tree (David Henty EPCC).................. 25 7.1 Cotton Candy (www.fxitech.com).................... 32 7.2 Raspberry Pi Board (Raspberry Pi Foundation)............. 33 7.3 Pandaboard (Pandaboard.org)....................... 33 7.4 Seco Qseven (www.secoqseven.com)................... 34 7.5 Eddie at ECDF.............................. 34 7.6 Power Measurement Setup Up on the ARM cluster........... 35 7.7 Watts Up power Meter used on the ARM Cluster............ 36 7.8 Power Measurement Setup on Eddie................... 37 7.9 Power Logging on Edim1......................... 38 7.10 Power Measurement Setup on Edim1................... 39 8.1 Replacement Power Measurement Setup on Edim1........... 43 8.2 ARM Cluster............................... 45 8.3 Atom Node on ARM Cluster....................... 47 9.1 Comparison of Idle Power on Xeon and Atom Nodes.......... 49 9.2 Idol Power while node had error..................... 51 9.3 Iterations Per Second for three optimisation levels............ 52 9.4 Iterations Per Watt for three optimisation levels............. 52 9.5 Iterations Per Second for Different Processors in 2011 and 2012.... 53 9.6 Iterations Per watt for Different Processors in 2011 and 2012...... 54 9.7 Comparsion of the runtime as the number of cores is scaled using dif- ferent processors............................. 55 9.8 Comparsion of the performance as the number of cores is scaled using different processors............................ 56 9.9 Comparsion of the Flop/Watt for different compilers using