Low-Power High Performance Computing

Michael Holliday

August 22, 2012

MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2012 Abstract

There are immense challenges in building an exascale machine with the biggest issue that of power. The designs of new HPC systems are likely to be radically different from those in use today. Making use of new architectures aimed at reducing power consumption while still delivering high performance up to and beyond a speed of one exaflop might bring about greener computing. This project will make use of systems already using low power processors including the Intel and ARM A9 and compare them against the Intel Westmere Xeon Processor when scaled up to higher numbers of cores. Contents

1 Introduction1 1.1 Report Organisation...... 2

2 Background3 2.1 Why Power is an Issue in HPC...... 3 2.2 The Exascale Problem...... 4 2.3 Average use...... 4 2.4 Defence Advanced Research Projects Agency Report...... 5 2.5 ARM...... 6 2.6 Measures of Energy Efficiency...... 6

3 Literature Review8 3.1 Top 500, Green 500 & Graph 500...... 8 3.2 Low-Power High Performance Computing...... 9 3.2.1 The Cluster...... 10 3.2.2 Results and Conclusions...... 10 3.3 SuperMUC...... 11 3.4 ARM Servers...... 12 3.4.1 Calexeda EnergyCoreTM & EnergyCardTM ...... 12 3.4.2 The Boston Viridis Project...... 12 3.4.3 HP Project Moonshot...... 13 3.5 The European Exascale Projects...... 13 3.5.1 Mont Blanc - Barcelona Computing Centre...... 13 3.5.2 CRESTA - EPCC...... 14

4 Technology Review 15 4.1 Intel Xeon...... 15 4.2 GP-GPUs & Accelerators...... 15 4.3 Intel Discovery & Intel Xeon Phi (MIC)...... 16 4.4 IBM Blue Gene & PowerPC...... 17 4.5 Intel Atom...... 18 4.6 ARM Cortex A9 & A15...... 18 4.6.1 Nvidia 3...... 19 4.7 Networks...... 19

i 5 Software Review 21 5.1 Operating Systems...... 21 5.2 Libraries...... 21 5.2.1 BLAS & CBLAS...... 22 5.2.2 Lapack...... 22 5.3 Batch System and Scheduler...... 22 5.3.1 Sun (Oracle) Grid Engine...... 22 5.3.2 Torque & Maui...... 23 5.4 Compilers...... 23 5.5 Message Passing Interface...... 24 5.5.1 MPICH2...... 25 5.5.2 OpenMPI...... 25

6 Benchmarks 26 6.1 HPCC...... 26 6.1.1 High Performance Linpack...... 26 6.1.2 Bandwidth and Latency...... 26 6.1.3 Other Benchmarks in the HPCC Benchmark Suite...... 27 6.2 LMBench...... 27 6.3 Coremark...... 28 6.4 Validation of Results...... 28

7 Project Preparation: Hardware Decisions and Cluster Building Week 29 7.1 Cluster building Week & Lessons Learned...... 29 7.2 ARM: Board Comparison...... 30 7.2.1 Cstick Cotton Candy...... 31 7.3 Changes to the ARM Hardware Available...... 32 7.3.1 ...... 32 7.3.2 Pandaboard ES...... 33 7.3.3 Seco Qseven - Quadmo 747-X/T30...... 34 7.4 Xeon: Edinburgh Compute and Data Facility...... 34 7.5 Atom: Edinburgh Data Intensive Machine 1...... 35 7.6 Power Measurement...... 35 7.6.1 ARM...... 35 7.6.2 ECDF...... 36 7.6.3 Edim1...... 37

8 Hardware Setup, Benchmark Compilation and Cluster Implementation 40 8.1 Edinburgh Compute and Data Facility...... 40 8.1.1 Problems Encountered with ECDF...... 41 8.2 Edinburgh Data Intensive Machine 1...... 41 8.2.1 Problems Encountered with EDIM1...... 42 8.3 Raspberry Pi...... 44 8.4 Pandaboard ES...... 44 8.5 Qseven / Pandaboard / Atom Cluster...... 45

ii 8.5.1 Compute Nodes...... 46

9 Results 48 9.1 Idle Power...... 48 9.1.1 How Idol Power changes with errors...... 50 9.2 CoreMark...... 51 9.3 HPL...... 54 9.3.1 Intel Xeon vs Intel Atom...... 54 9.3.2 GCC vs Intel...... 55 9.3.3 Gigabit Ethernet vs Infiniband...... 56 9.3.4 Scaling...... 64 9.4 Stream...... 68 9.5 LMBench...... 69

10 Future Work 73

11 Conclusions 74

12 Project Evaluation 76 12.1 Goals...... 76 12.2 Work Plan...... 77 12.3 Risks...... 77 12.4 Changes...... 77

A HPL Results 78

B Coremark Results 81

C Final Project Proposal 83 C.1 Content...... 83 C.2 The work to be undertaken...... 83 C.2.1 Deliverables...... 83 C.2.2 Tasks...... 83 C.2.3 Additional Information/Knowledge Required...... 84

D Section of Makefile for Eddie from HPCC 85

E Benchmark Sample Outputs 86 E.1 Sample Output from Stream...... 86 E.2 Sample Output from Coremark...... 87 E.3 Sample Output from HPL...... 88 E.4 Sample Output From LMBench...... 90

F Submission Script from Eddie 92 F.1 run.sge...... 92 F.2 nodes.c...... 93

iii List of Tables

3.1 Section from top500.org, November 2011 [1]...... 9 3.2 Section from green500.org, November 2011 [2]...... 9

6.1 Other HPCC Benchmarks...... 27

7.1 Comparison of ARM Boards...... 31

9.1 Idle Power Comparison...... 49 9.2 Stream Results...... 68 9.3 LMBench: Basic system parameters...... 69 9.4 LMBench: Processor, Processes...... 69 9.5 LMBench: Basic integer operations...... 69 9.6 LMBench: Basic uint64 operations...... 70 9.7 LMBench: Basic float operations...... 70 9.8 LMBench: Basic double operations...... 70 9.9 LMBench: Context switching...... 71 9.10 LMBench: Local Communication latencies...... 71 9.11 LMBench: Remote Communication latencies...... 71 9.12 LMBench: File & VM system latencies...... 72 9.13 LMBench: Local Communication bandwidths...... 72 9.14 LMBench: Memory latencies...... 72

A.1 Results from HPL Benchmark using GCC and Infiniband...... 78 A.2 Results from HPL Benchmark using GCC and Ethernet...... 79 A.3 Results from HPL Benchmark using Intel and Infiniband...... 80 A.4 Results from HPL Benchmark using Intel and Ethernet...... 80

B.1 Coremark Results for 100000 Iterations...... 81 B.2 Coremark Results for 1000000 Iterations...... 81 B.3 Coremark Results for 1000000 Iterations...... 81 B.4 Coremark Results for 30000000 Iterations...... 82 B.5 Coremark Results for 15000000 Iterations...... 82

iv List of Figures

2.1 Average Daily Power Usage on Gigabit Ethernet Nodes...... 5

3.1 The SuperMUC system [3]...... 11 3.2 Viridis...... 13

4.1 Blue Gene/Q at Edinburgh University (www.ph.ed.ac.uk)...... 18 4.2 ARM A9 MP Cores (www.arm.com)...... 19

5.1 MPI Family Tree (David Henty EPCC)...... 25

7.1 Cotton Candy (www.fxitech.com)...... 32 7.2 Raspberry Pi Board (Raspberry Pi Foundation)...... 33 7.3 Pandaboard (Pandaboard.org)...... 33 7.4 Seco Qseven (www.secoqseven.com)...... 34 7.5 Eddie at ECDF...... 34 7.6 Power Measurement Setup Up on the ARM cluster...... 35 7.7 Watts Up power Meter used on the ARM Cluster...... 36 7.8 Power Measurement Setup on Eddie...... 37 7.9 Power Logging on Edim1...... 38 7.10 Power Measurement Setup on Edim1...... 39

8.1 Replacement Power Measurement Setup on Edim1...... 43 8.2 ARM Cluster...... 45 8.3 Atom Node on ARM Cluster...... 47

9.1 Comparison of Idle Power on Xeon and Atom Nodes...... 49 9.2 Idol Power while node had error...... 51 9.3 Iterations Per Second for three optimisation levels...... 52 9.4 Iterations Per Watt for three optimisation levels...... 52 9.5 Iterations Per Second for Different Processors in 2011 and 2012.... 53 9.6 Iterations Per watt for Different Processors in 2011 and 2012...... 54 9.7 Comparsion of the runtime as the number of cores is scaled using dif- ferent processors...... 55 9.8 Comparsion of the performance as the number of cores is scaled using different processors...... 56 9.9 Comparsion of the Flop/Watt for different compilers using Infiniband. 57

v 9.10 Comparsion of the average power for different compilers using Infiniband 57 9.11 Comparsion of the Total Power for different compilers using Infiniband 58 9.12 Comparsion of the Runtime for different compilers using Infiniband.. 58 9.13 Comparsion of the Performance for different compilers using Infiniband 59 9.14 Comparsion of the Flop/Watt for different compilers using Ethernet.. 59 9.15 Comparsion of the average power for different compilers using Ethernet 60 9.16 Comparsion of the Total Power for different compilers using Ethernet. 60 9.17 Comparsion of the Runtime for different compilers using Ethernet... 61 9.18 Comparsion of the Performance for different compilers using Ethernet. 61 9.19 Comparsion of the Average Power for different networks...... 62 9.20 Comparsion of the Total Power for different networks...... 63 9.21 Comparsion of the Runtime for different networks...... 63 9.22 Comparsion of the Performance for different networks...... 64 9.23 Comparsion of the Flop/Watt for different networks using the GCC Compiler...... 65 9.24 Comparsion of the Runtime for different networks using the Intel Com- piler...... 65 9.25 Comparsion of the Performance for different networks using the Intel Compiler...... 66 9.26 Comparsion of the Total Power for different networks using the Intel Compiler...... 66 9.27 Comparsion of the Average Power for different networks using the Intel Compiler...... 67 9.28 Comparsion of the Flop/Watt for different networks using the Intel Com- piler...... 67

vi Listings

7.1 Sample Script for Processing Power Measurements...... 36 7.2 Sample Power Output File from Eddie...... 37 D.1 Section of Make File for HPCC from Eddie...... 85 E.1 Sample Output from Stream Benchmark...... 86 E.2 Sample Output from Coremark Benchmark...... 87 E.3 Sample Output from HPL Benchmark...... 88 E.4 Sample Output from LMBench Benchmark...... 90 F.1 Submission Script from eddie...... 92 F.2 Nodes Program from eddie...... 93

vii Acknowledgements

I would like to thank everyone who has made the project possible. Firstly my supervisors Mr Sean McGeever and Dr Lorna Smith. Thanks also to Dr Orlando Richards and Mr Gareth Francis for all of their help. Chapter 1

Introduction

There are significant challenges facing the world in successfully building an exascale computer. In 2007 the Defense Advanced Research Projects Agency (DARPA) com- missioned a report to investigate the challenges in achieving a speed of one exaflop. The DARPA report concluded that the four key issues were power, memory and stor- age, application scalability and resiliency. At the time of the DARPA report the most power efficient technology delivered approx- imately 450∗106 Floating Point Operations per Second (Flops). Ignoring the additional power required for cooling, which adds 20-100%, when scaled this would need 2.2 Gigawatt’s per Exaflop. As of June 2012, according to the Green 500 todayâA˘ Zs´ technology can deliver 2097 Megaflops per Watt. The improvement by a factor of four is a step in the right direction, but it would still require half a Gigawatt per Exaflop. The added power required for cooling could take the total power required to one GW. The Longannet power station in Fife has a maximum output of 2.4 GW [4]. In order to power an exascale system today, it would need a dedicated power station half the size of Longannet. DARPA has set out a goal to achieve 50 gigaflops per watt within 8 years of the report. In response two bodies were set up, the International Exascale Software Project (IESP) in the USA and the European Exascale Software Initiative (EESI). Both have the aims to overcome the issues raised by DARPA to build a machine with a speed of one Exaflop. The most pressing of the four challenges raised by DARPA is that of Power. They have aimed to improve efficiency by a factor of one hundred by 2015. The IESP are clear that power is the biggest component of the total-cost-of-ownership of a system. The IESP roadmap suggested that “Every megawatt of reduced power consumption translates to a savings of $1M/year” [5]. Delivering such improvements will require collaboration be- tween computational scientists, engineers and mathematicians to improve the efficiency of both the hardware and algorithms. Building an exascale machine will be a substantial and difficult challenge requiring new

1 and innovative ideas. Current technology trends may need to be re-evaluated or simply left behind to allow for the development of greener and more efficient computers. Over the last forty years gains in processor speeds have been due to increases in the number of transistors. Moores law states that “The number of transistors on a will double approximately every two years”[6]. Another lesser known law is Mays law [7] which states that the efficiency of a computer approximately halves in the same pe- riod. The combination of both Mays and Moores laws have led to increased computing speeds in systems that are less efficient. Transistors are rapidly approaching the size of a single atom beyond which transistors are unlikely to decrease in size. The limit of physical size means that increasing the number of transistors in the same area cannot continue. In April 2005 Gordon Moore after whom the law is named after stated that: ”In terms of size [of transistors] you can see that we’re approaching the size of atoms which is a fundamental barrier, but it’ll be two or three generations before we get that far–but that’s as far out as we’ve ever been able to see. We have another 10 to 20 years before we reach a fundamental limit. By then they’ll be able to make bigger chips and have transistor budgets in the billions. The end of Moores Law marks a crossroad for computing. Without a new technology, processor clock speeds will not get any faster. Increases in speed will only be achieved through greater processor counts, which in turn require greater power, cooling and better scaling parallel algorithms. Even if Moores law continued indefinitely, it would still be infeasible to continue to build faster computer systems. Unless the effects of Mays law are reversed the resources required to run the system would be too high particularly in terms of power usage. In this project the power usage of low power processors were compared against a current commodity processor, the Intel Xeon, as the number of cores in use are scaled. The processors were also tested in serial to compare their individual performance.

1.1 Report Organisation

The report is set out in three sections. The first three chapters cover the background material, similar projects and give an introduction to justify the project. The next chapters cover the benchmarks, hardware and systems that are used during the project. This section will cover the changes to the hardware available, the systems used to monitor the power consumption and any problems that arose in using the systems. Finally the results of the comparisons will be presented and analysed. Conclusions and ideas for future work will be covered to allow the work to be built on and further research to be carried out

2 Chapter 2

Background

This chapter will look at why power is an issue for High Performance Computing and how far power usage needs to be reduced. There are several options being looked at, one of which is the use of ARM processors. The chapter will give an introduction to ARM and why their processors are fast becoming the processor of choice for future machines.

2.1 Why Power is an Issue in HPC

In the past decade the issues in building a large HPC system have changed dramati- cally. During the 1990s the major problem was achieving parallelism. The program- ming models were still new and had not been standardised. There were a large range of architectures each with their own software and tools. The standardisation of program- ming models reduced the porting effort required. As HPC systems became larger and faster a new issue came about. The cost of running such systems became a concern, as the expenditure could add between 20 and 100% to the cost of the system. In the past ten years the speed of the fastest system in the world has increased from teraflops to petaflops. The next step is a speed of 1 exaflop, however this has presented a new problem. The power required to build and run an exascale system using todayâA˘ Zs´ technology would be to great. The need for these systems in applications covering all areas of science and engineering requires momentous changes in HPC, starting with the power.

3 2.2 The Exascale Problem

The research into the exascale probem was origanally part of the HPC Ecosystems The Exascale Problem essay. It is shown here for completeness. Applications in medicine, genomic research and weather prediction alongside many others will require exascale, and in future zettascale machines. The DARPA report concluded that the four key issues in building an Exascale System were power, memory and storage, application scalability and resiliency. DARPA defines an Exascale system to mean “that one or more key attributes of the system has 1,000 times the value of what an attribute of a ’Petascale‘ system of 2010 will have”[8]. There are many issues standing in the way of building an exascale machine. While there are many techniques to reduce power consumption and efficiency, current architectures are not usable to build exascale machines. Co-design of new hardware, software and languages is vital to produce systems capable of speeds of 1 exaflops. When comparing the speeds of the world’s fastest computers over the last 20 years according to top500.org, it can be seen that if the trend was to continue it is likely that an exascale machine would be built around 2018. DARPA have the aim of building an exascale machine three years earlier, by 2015. Power is the biggest issue for HPC systems, data centres and laptop computers alike. Tackling the power issue for HPC will have benefits for all computing, but they can only be achieved with collaboration from engineers, mathematicians and scientists from all backgrounds.

2.3 Average use

The power consumption of eight nodes on Eddie, a system hosted at the Edinburgh Data and Compute Facility were monitored continuously for a full week. Each node of Eddie contains two six core Intel Xeon processors. Monitoring the power usage across the whole week took into account the lower number of users over night and on a weekend. The results for the Ethernet nodes are shown in figure 2.1. The nodes using Infiniband were found to have similar results. The average power consumption of a single node across the week was about 200 watts per second. A single node on Eddie over the year will consume 6.3 gigawatts of power. Given there are 156 identical nodes on Eddie these nodes will consume 984 gigawatts of power. Eddie is a relatively small machine in comparison to the systems in the highest positions of the Top 500. As systems move towards exascale the number of nodes will increase dramatically. Power consumption cannot be sustained at this level. While the nodes of

4 300

250

200

150

100 00:00 02:00 04:00 06:00 08:00 10:00 12:00 14:00 16:00 18:00 20:00 22:00 00:00

Figure 2.1: Average Daily Power Usage on Gigabit Ethernet Nodes

Eddie are by no mean the most energy efficient they are representative of nodes in use on many HPC systems in use today.

2.4 Defence Advanced Research Projects Agency Re- port

The 2007 DARPA report “ExaScale Computing Study: Technology Challenges in Achiev- ing Exascale Systems” made clear the problems facing the world in building an exascale machine. By far the biggest of the four challenges raised was ‘The Energy and Power Challenge”. [8] The report authors were unable to make use of the current technologies to build a system with a power consumption low enough to make building an exascale system feasible. However the study did conclude that the base power rather than the power required to transport data would be the most likely source of power savings. Power was nevertheless only one of four challenges identified by DARPA, with Mem- ory, Concurrency and Resiliency also needing to be addressed to make and exascale system achievable.

5 2.5 ARM

Advanced RISC Machines (ARM) was formed as a spin off from the Acorn Computer Group. ARM Holdings [9] was founded in 1990 and 20 billion ARM based chips have been shipped to date. ARM’s business is designing and licensing the Intellectual Property to manufactures of chips. The ARM processor is well known in the mobile technology world. Its low power consumption makes it an ideal processor for devices with limited battery life such as smart phones. In HPC, ARM are rarely used compared to other processors such as those produced by IBM, AMD and Intel. The reason for this is the lack of performance an ARM core can provide. However, with much of the focus now turning to the power consumption of HPC systems many believe that in future ARM processors will become common place, with their high performance to power ratio. With over 250 companies licensed to produce ARM based chips, there are a wide variety of chips to choose from. Today the Tegra processor produced by Nvidia is probably the best known. The Tegra 3 will be released in 2012, following the Tegra 2 which is in use in several devices already. The Barcelona Supercomputing Centre is set to build the world’s first ARM based su- percomputer using the Tegra 3. There will no doubt be a huge interest in the results of the system in deciding if ARM processors will be used to build further HPC systems.

2.6 Measures of Energy Efficiency

The Green 500 ranks supercomputers based on the flops/watt measure. In order to cal- culate the flops/watt, the performance from the High Performance Linpack Benchmark is divided by the total power usage as show in equation 2.1. While the measure may be useful to compare systems directly, it does not take into account how much energy is required for purposes other than running the system.

T otalP erformance flops/watt = (2.1) T otalP ower

The Power Usage Effectiveness (PUE) has become a measure of how much extra energy is used by the entire infrastructure compared to the HPC system alone. The equation used to calculate the power measurement is shown in equation 2.2.

T otalEnergy PUE = (2.2) SystemEnergy

6 A system with a PUE of 1 would use no power other than that needed to run the system. In other words no energy is required to run the building, cooling systems or networks. It is doubtful that any system will ever have a PUE of 1 as some power will always be required for infrastructure such as lighting. It is possible to reduce the PUE to a level much closer to 1. Like all businesses and homes, energy use can be reduced. New buildings and systems are usually more efficient than older ones, and so should also be reviewed, upgraded and replaced much like the systems they house. The PUE is not a useful measure to compare two different systems. The PUE only shows how much power is required to run the building, as a proportion of the power needed to run the system alone. The measure is dependant of the system in question, and that means that two systems with the same PUE could be drastically different in terms of power use. There are critics of the PUE metric, suggesting that it has caused a focus on the ’right side of the decimal’, causing many to ignore the power consumption of the system itself. A new measure has been suggested at the International Supercomputing Conference 2012. The new measure takes into account energy reuse and on-site production. The Net Zero Efficiency (NZE) aims to look at sustainability rather than energy efficiency. Not only will reducing the power consumption of HPC systems reduce the running costs, but looking at the source of the power may also have positive impacts. Many greener sources of power such as wind, hydro or solar often have much lower financial and environmental costs as they do not harm the environment. There are other factors that some feel should be taken into account. The carbon footprint has become a major issue for companies all over the world. Many HPC and data centres are supplied power from coal fired power stations. According to Nicolas Dube [10], “a 20MW data center running on coal equates to 175200 tonnes of CO2 a year, while the equivalent system using hydro power equates to only 876 tons”. Many feel that alongside the performance and total power measures, systems should also be compared on their environmental impact through their carbon footprint. These new measures change the target from one to zero, and take into account the full picture of HPC systems power consumption and enviromental effects.

7 Chapter 3

Literature Review

This chapter will look at previous work carried out by the MSc Student Panagiotis Kritikakos last year. Other projects looking into the development of low power systems will also be reviewed.

3.1 Top 500, Green 500 & Graph 500

Since 1993 the Top 500 [1] list has been compiled twice a year. The list ranks the Top 500 HPC systems in order of performance achieved when running the High Per- formance Linpack benchmark. The Top 500 is not the first list to exist, in this context, it is a continuation of the list produced since 1986 by Hans Meuer. Until recent years the power consumption of the machines on the Top 500 list was not relevant, but as we move towards exascale power has become one of the biggest challenges, leading to the creation of the Green 500 list. The green 500 [2] ranks the top 500 supercomputers in order of their energy efficiency. The list was started in 2005 to increase awareness of other metrics including the perfor- mance per watt, with the first list published in 2007. It was often the case that the Green 500 reversed the Top 500 list, with the highest performing systems being less energy efficient than those further down the list. The lists published in June 2012 marked a turning point, with the IBM Blue Gene/Q at Lawrence Livermore National Laboratory taking the number 1 position in the Top 500 and the Green 500. The Blue Gene/Q is by far the most energy efficient system, taking all of the top twenty places in the Green 500. Another notable system in the green 500 is the Intel Discovery system. Discovery is a small prototype machine making use of the Intel Xeon Phi processor. The system has made it up to 150th on the Top 500, but has also made it to 21st on the Green 500. The high ranking for a machine with fewer than 10000 cores suggests that as these systems are scaled up they will move up both lists along side the Blue Gene/Q.

8 Rank Site Computer Cores Rmax Rpeak Power 1 Japan K computer 705024 10510.00 11280.38 12659.9 2 China Tianhe-1A 186368 2566.00 4701.00 4040.0 3 USA Jaguar 224162 1759.00 2331.00 6950.0 19 UK HECToR 90112 660.24 829.03 unknown 29 IBM Blue Gene/Q 32768 339.83 419.43 170.2 64 IBM Blue Gene/Q 16384 172.49 209.72 85.1 65 IBM Blue Gene/Q 16384 172.49 209.72 85.1 Table 3.1: Section from top500.org, November 2011 [1]

Rank Site Computer Total Power (kW) MFLOPS/W 1 IBM Blue Gene/Q 85.12 2026.48 2 IBM Blue Gene/Q 85.12 2026.48 3 IBM Blue Gene/Q 170.25 1996.09 32 Japan K Computer 12659.89 830.18 43 China Tianhe-1A 1155.90 668.1 95 UK Hector 1824.93 391.79 149 USA Jaguar 6950.00 253.09 Table 3.2: Section from green500.org, November 2011 [2]

The Graph 500 [11] is the newest list to be published. The Graph 500 was created to reflect the move towards data intensive computing and was first announced at the International Supercomputing Conference (ISC) in June 2010. The first list followed six months later at SC2010. The top three computers from both the green and top 500 lists in November 2011 are shown in tables 3.1& 3.2. When comparing the two tables it can be seen that the computers with the highest performance are far less energy efficient than those further down the Top 500. As of June 2012, the IBM Blue Gene computer makes up the entire top twenty of the Green 500. The innovative use of large numbers of simple cores, combined with new water cooling systems and network topologies has given the Bluegene performance and reducuded power consumption.

3.2 Low-Power High Performance Computing

In 2011 Panagiotis Kritikakos completed the Project Low-Power High Performance Computing [12]. Kritikakos considered three main factors: computational performance and efficiency, power efficiency and porting effort. The project used the Intel Xeon as a baseline representing commodity processors in use at the time. The Xeon was compared to an Intel Atom and an ARM Cortex A8 (Marvell 88F6281 [13]).

9 3.2.1 The Cluster

Kritikakos successfully built a heterogeneous cluster with seven compute nodes and a frontend node. The front end and computer nodes one, two and three each contained an eight core Intel Xeon processor. Nodes four and five contained a dual core Intel Atom, with the last two nodes utilising a single core ARM processor. Kritikakos choose to run ScienPanagiotistific Linux on his nodes. Scientific Linux is not supported on the ARM architecture. In order to keep the systems as similar as possible Kritikakos installed an OS in the same family as Scientific Linux that was supported on the ARM architecture, Fedora Core Linux. All of the nodes had the same version of the GCC compilers. The only major difference was the lack of a Fortran compiler on the ARM system. The inability to compile Fortran codes on ARM was a major problem as Fortran is used in many HPC applications. The f2c source to source compiler from Netlib was used to compile Fortran codes.

3.2.2 Results and Conclusions

The report produced by the project publishes a wide range of results from the CoreMark, Stream and HPL benchmarks. It is clear however, that there are major mistakes in the results published. In several cases the results have been changed by an order of magnitude, for example the HPL results suggest that the performance of the Atom is four times that of the Xeon. Power measurements were taken while running several different benchmarks on each of the systems. The benchmarks chosen were the High Performance Linpack (HPL) and Stream benchmarks from the HPCC benchmark suite. Alongside these the Coremark benchmark was run. Measurement of the idle power of the nodes gave a clear indication of the baseline power consumption of each system. It is apparent that the power consumption of an ARM processor is a factor of fifteen lower than the Xeon, and a factor of five lower than the Atom. The idle power gives a baseline comparison of the power efficiency of different systems, but a low power processor must also deliver performance. The benchmarks were chosen to test different parts of the systems. Using several bench- marks allowed for power usage within the system to be targeted at specific parts. Com- paring the results showed which systems used the most power and hence which parts need to be targeted in future to reduce power further. Overall the ARM processor used less power, but also had the worst performance. The Intel Xeon had by far the best performance but consequently used far more power. The Atom seems to be a compromise between power and performance. While the results seem conclusive it must be taken into account that the number of cores on each processor was different. The Xeon has 8 cores compared to the ARMs 1. In

10 future years ARM are developing dual, quad and oct-core processors which may make its performance more comparable. The effect of different types of hard drive were also investigated. Using both a Solid State Drive and hard disk drive, it was found that the power consumption on all bench- marks was lower when using the solid state drive.

3.3 SuperMUC

Figure 3.1: The SuperMUC system [3]

The Technical University of Munich hosts the SuperMUC machine currently ranked at 4th in the Top 500 [3]. SuperMUC is also ranked at 39th in the Green 500. The univer- sity has designed not only the hardware used to build the system, but the infrastructure, software and applications as well to minimise power consumption. The most obvious efficiency comes from the hardware used to build SuperMUC. Ex- ploiting new cooling technologies, minimising power losses within the system and re- using the waste heat produced have reduced the power consumption of the systems as a whole. The first innovation is the use of direct water cooling. Most systems today use some form of air cooling, which is a facotr of 20 time [3] less effective than water at dissi- pating the heat produced. Each of the compute racks is 90% cooled by warm water. The lower operating temperatures, combined with the removal of fans has decreased the power consumption by approximately 10%. The heat that is removed from the system is then reused across the univeristy. Using the warmed water for heating, the swimming pool and underfloor heat were all considered. While all three were viable options too much of the heat would be lost due to leakage.

11 Instead they use the warm water for ’Concrete Core Activation’. The water is pumped through the building warming the concrete. When heat is needed to keep the building warm, the concrete realeases the heat over the course of several hours, thus removing the need to use conventional heating sources. The entire building has been designed to promote effienct use of the system. Each floor of the building is devoted to a single purpose, with system likley to communicate placed closely together reducing the latency of communication. To reduce power consumption further, nodes not in use are placed into a deep sleep mode, minimising the idle power consumption. The codesign of the system, infrastucture and surrounding area has resulted in a system that reduces the costs of the university as a whole. By looking at the bigger picture, instead of focusing on the system itself, larger gains can be made in reducing power consumption.

3.4 ARM Servers

In the past year there have been a number of ARM servers developed. These servers were built to be low power HPC systems. Three current ARM servers are reviewed in the following sections. Dell have announced that they will realease an ARM based server in future, the Dell Copper.

3.4.1 Calexeda EnergyCoreTM & EnergyCardTM

The EnergyCore architecture was realeased in November 2011, with the aim of dramat- ically cutting power requirements. EnergyCore uses a the Marvell Armarda XP, a quad core ARM processor. The Arm processor is combined with a management engine and fabric switch to create the EnergyCore System on Chip (SoC). A seperate processor perfroms real time management of energy consumption accross a single node up to an entire cluster. The management system has the abiltity to switch on or off numerous power domains.

3.4.2 The Boston Viridis Project

A collaboration between Boston Ltd and Calxeda resulted in a “self contained, highly extensible multi node cluster” called the Boston Viridis Project [14]. The cluster is based on an ultra low power approach to parallel computing. Each node utilises a large number of Calxeda EnergyCoreTM System on Chips. The self contained micro system incorporates an ARM Cortex Quad-core processor alongside

12 Figure 3.2: Viridis an energy management engine, I/O controllers and fabric Switches. Four SoCs are combined into a cluster node sharing memory, disk connections and interconnects.

3.4.3 HP Project Moonshot

HP Project Moonshot is server that makes use of an extreme low-energy processor. The project was created to meet the increasing demand for IT servers across all areas of business, as businesses must now scale their technology to levels that match the surge of on-line media, social networking and e-business. Current technologies cannot scale fast or far enough due to the cost, space and energy required to run them. New servers must scale to new heights meeting today’s needs but must do so within today’s space and energy constraints. Project Moonshot provides “the same computing power in about one-tenth of the space and power of comparable server”. Such improvements would allow buisnesses rapidly expand their computing capabilities, reducing their costs, without major invest- ment in infrastructure.

3.5 The European Exascale Projects

3.5.1 Mont Blanc - Barcelona Computing Centre

The Barcelona Computing centre have been looking at the feasibility of using mobile processors to build HPC systems of the future. The low power nature of mobile proces- sors could have a significant effect on the power consumption of such systems, leading to the development of exascale machines.

13 A prototype cluster of 256 nodes has been built, with each node consisting of a single Nvidia Tegra 2 SoC. The tegra 2 is a dual core processor based on the ARM Cortex A9 architecture. The ARM system was compared to a power-optimised Intel I7 laptop. The STREAM, SPEC, Dhrystone and HPL benchmarks results were compared between the two sys- tems. The centre concluded that while the I7 had better performance for all of the benchmarks the low power consumption of the ARM processors made it a good option for future HPC systems. The differences in clock rate were factored into the performance of the systems, leading to the I7 only performing 3.2 times faster. The relatively small gain in performance was concluded to be outweighed by the extra complexity in the I7. [15]

3.5.2 CRESTA - EPCC

The CRESTA project is looking at software co-design required to successfully use an exascale system. CRESTA has two strands, applications and systemware. The application strand is looking at a set of key applications and how co-design can enable them to be used at exascale. The applications chosen reflect a wide range of areas that make use of HPC systems. The systemware strand is looking at the tools required to support applications including libraies, compilers and algorithms. Low power systems cannot be achived through technology improvements alone. New hardware must be designed in collaboration with algorithms, libraies and software to use it efficiently and maximise power savings.

14 Chapter 4

Technology Review

Current and future technologies including the Atom, ARM A9, IBM Blue Gene and the networks used for communication will the reviewed in this chapter.

4.1 Intel Xeon

The Intel Xeon [16] was originally created in 1998 and has since been re-released on several occasions. The Xeon was aimed at the server and HPC market and over the generations the Xeon has increased the number of cores available from one to ten. The Xeon has long been used in High Performance Computing. The Xeon is a mature and stable platform that uses the extremely familiar x86 (and x86-64) architecture. The frequency of the Xeon has increased significantly from 400Mhz in 1994 upto 4.4 Ghz in recent years. The Xeon currently faces competition in the HPC market from AMD Opteron and Pow- erPC processors. It is possible that with the announcement of the Xeon Phi that more systems will make use of the Xeon processor as power consumption could be reduced significantly. One of the biggest manufacturers of HPC systems, Cray, recently reported that they will be using the Xeon Phi in future systems.

4.2 GP-GPUs & Accelerators

The research into GP-GPUs was origanally part of the HPC Ecosystems The Exascale Problem essay. It is shown here for completeness. The use of accelerators has been suggested as a way to increase the Flop/Watt rate in systems. It can be seen in table 3.1 that the Tianhe-1A which makes use of GPUs, that the power usage of the system is lower compared to similar machines that do not make use of GPUs.

15 GP-GPUs were originally designed for use within computer games systems such as the Playstation 3. They are used to do large numbers of calculations in parallel and are designed to maximise the area of the processor devoted to computation. Removing functionality that is not useful for HPC has allowed the peak performance to increase dramatically compared to a conventional CPU. In order to use GPUs at their maximum performance they need to have very fast access to memory. GPUs make use of GDRAM memory that has a much higher bandwidth than standard memory. The high bandwidth is required in order to keep all of the cores supplied with data. While GPUs use less power than CPUs, they cannot be used to build exascale machines on their own. Each GPU must be paired with a CPU to control memory copies. The GPU is used in computationally expensive parts of codes, while the CPU takes care of less expensive parts of the code. Programming using GPUs is complex. The programmer must manage memory transfers between the GPU and CPU, while targeting the code to make use of GPUs. Nvidia are currently the market leaders in the production of GPUs for HPC, with AMD close behind. Nvidia have become market leaders as they have a well developed lan- guage, CUDA to make use of GPUs. It is specifically targeted at Nvidia GPUs, making them the most likely choice. Other languages such as OpenCL will allow for more choice in the GPU market.

4.3 Intel Discovery & Intel Xeon Phi (MIC)

The Intel Xeon Phi was officially announced at ISC 2012 in Hamburg, although there were several test devices in use before that date. In May 2010 Intel released a prototype board for the Many Iintergrated Core architecture, the Knights Ferry. The Xeon Phi is designed to complement a CPU, but unlike GPUs does not need to be paired with a CPU. The many cores of a Phi are based on the same architecture as the Xeon, allowing developers to easily port their codes. Unlike GP-GPUs, the same programming models are used in both the Xeon and Xeon Phi. The use of the same models reduces the time needed to port codes as there is only one version that must be maintained and there is no need to learn new languages such as OpenACC or Cuda. The first commercial product to be produced, currently named Knights Corner, is to be released in late 2012 probably around SC 2012. The Texas Advanced Computing Centre have already announced that their next system, “Stampede”, will use the Knights Corner devices. While there is little known about the exact performance of the Xeon Phi, the position of the Discovery system in both the Top 500 and Green 500 suggests that it will be

16 comparable with the Blue Gene/Q. The discovery cluster built by Intel using the Xeon E5 and Xeon Phi only has a total of 9800 cores. Despite the relatively small number of cores, it ranked at 150, in the top 500 suggesting that as bigger systems make use of the Phi that it will move up rapidly. The system has also proved to be extremely efficient. It ranked at 21 in the Green 500, the highest rank for a non Blue Gene/Q system.

4.4 IBM Blue Gene & PowerPC

The IBM Blue Gene makes great use of large numbers of low power processors. The Blue Gene has achieved 16324 teraflops, but has also achived a much higher rate of Mflops/Watt than it’s closest rivals. As of June 2012 the Blue Gene/Q occupies the entire top twenty positions of the green 500[2]. The Blue Gene/Q uses PowerPC processor processors. The PowerPC architecture was originally developed in the early 1980’s, making it one of the oldest processors used in HPC. The Blue Gene/Q uses the PowerPC A2, with 16 cores per node. The large numbers of cores used in Blue Gene systems required a network that allowed the system to exploit the full performance of all of the processors. The system has a 5D torus network for communication. The 5D torus includes a collective network to support funtions such as the MPI All to All. The Blue Gene/Q follows the success of the Blue Gene/L and Blue Gene/P. All three Blue Gene systems have changed the face of HPC systems in the past decade. The Blue Gene has led the way in improving the efficiency of Computing, leading to a Blue Gene/Q, Sequoia, becoming the first system to take the top positions on both the Green 500 and Top 500 at the same time. Sequoia, based at Lawrence Livermore National Laboratory, is not only the biggest Blue Gene/Q in the world, it is the biggest system in the world. It took the title and the number one position in the Top 500 from the K computer in Japan. Sequoia is the first system to use over a million cores, with a total of 1572864, more than twice the cores in the K computer.

17 Figure 4.1: Blue Gene/Q at Edinburgh University (www.ph.ed.ac.uk)

4.5 Intel Atom

The Intel Atom processor was designed with low power computing in mind, aimed at the netbook and embedded markets. The Atom currently has a maximum of 2 cores, although a quad core Atom may soon be available. The Atom is based on the x86 architecture, making it a particularly good low power processor. It’s main competitors are the ARM processors which are more efficient and already have a quad core version available, although the performance of the ARM is still behind the Atom. The clock speed of the Atom ranges from 600Mhz up to 2.13 Ghz, much slower than the Xeon. The lower clock speed is one of the major factors in reducing the power consumption, despite the reduction in performance.

4.6 ARM Cortex A9 & A15

The ARM Cortex A9 [17] is the latest available processor from ARM. The layout of the A9 is shown in figure 4.2. The A9 has a clock speed of 800Mhz up to a maximum of 2Ghz. The ARM is slower than either of the Xeon or Atom in terms of top speed, but it is not far behind. As with all of the processors designed by ARM, the A9 is much more efficient than any other processor. The efficiency comes from ARM’s background in producing mobile processors. The A9 is available with 1, 2 or 4 cores. The ARM A9 has been implemented by a wide range of companies. The implementa- tions of the A9 include the Samsung , the Apple A5 and the Sony Playstation

18 Figure 4.2: ARM A9 MP Cores (www.arm.com)

Vita. The most well known and proven implementation is the Nvidia Tegra. The Tegra 2 was released in 2011, with the Tegra 3 to be released soon. The next generation of the ARM Cortex family has already been announced and will soon become available is a variety of forms. The clock speed of the quad core A15 is up to to 2.5Ghz, more than the Atom. If the A15 proves to be efficient, as all of the predecessors have, then it could become a real competitor in the HPC market as an alternative to the current processors.

4.6.1 Nvidia Tegra 3

The Tegra 3 is a quad core ARM based processor developed by Nvidia. The processor is to be used in mobile devices including the Google Nexus 7 tablet. The Tegra 3 contains an ARM A9 quad core processor, but also contains a fifth core dedicated to power management. The Tegra 3 combines the low power Arm CPU, with a low power NVIDIA GeForce GPU to increase the performance.

4.7 Networks

Data must be moved around a HPC between different nodes via a network. There are many different networks that are used in HPC systems, invluding ethernet, Infiniband and Gemini. The Ethernet network is by far the most common, used in HPC systems to home networks. An ethernet network is likely to be the cheapest option but will also have the worst

19 performance out of the three listed above. Gemini and Infiniband are propriety hardware owned by Cray and Intel respectively. Both networks have higher performance than an Ethernet network, but will cost more. The systems in use by this project make use of Ethernet and Infiniband networks. As part of the project the performance per watt of both networks will be compared using the Eddie system.

20 Chapter 5

Software Review

In this chapter, the software and libraries used will be reviewed. Each of the libraries will be assessed in terms of their history, advantages and disadvantages in terms of performance.

5.1 Operating Systems

The project makes use to two large systems, Eddie and EDIM1 alongside a cluster built as part of the project. Unlike the project undertaken last year, there is no way to decide which OS to use on Eddie and EDIM1. The best comparison that could be made would be to use two system that were identical in every respect other than the variable being compared. As Eddie and Edim1 are run by other groups, the OS was already installed and could not be changed. Eddie uses Red Hat Linux, while EDIM1 uses Rocks Linux. The difference in the OS will change the results of the benchmarks, due to differences in the backgroud processes that run. These differences must be taken into account when making the comaprison between the three systems.

5.2 Libraries

Libraries are a set of object files containing functions, subroutines or classes. They promote code reuse through clean well behaved interfaces. The libraries are likely to be optimised and may even be portable between different systems. But there is a trade off to be made, the more portable a library is , the less optimised it becomes for a specific system, reducing performance. There are two main types of libraries, static and dynamic. A static library is copied into the developers program, leading to large executables. The use of static libraries can give

21 better performance than dynamic libraries but wastes a lot of space as each executable will have it’s own copy of the library.

5.2.1 BLAS & CBLAS

The Basic Linear Algebra Subprogram library provides routines for vector and matrix operations. The are three levels of operations: vector-vector, matrix-vector and matrix- matrix. The BLAS library is defined using the Fortran language. The CBLAS library provides interfaces for use in C programs.

5.2.2 Lapack

The Lapack library provides routines for solving simultaneous equations and factorisa- tions. It provides routines for both real and complex data types in single and double precision. There are many vendor specific versions of the lapack library that have been optimised to increase the performance. The Lapack library is used heavily along with the BLAS library in the High Perfor- mance Linpack benchmark. The choice of which version of the libraries to use, will have a massive impact on the performance gained. In order to compare two system directly it is important to take into account any differences in the library.

5.3 Batch System and Scheduler

The batch system of a cluster is used to submit jobs to run on the system. The batch system is responsible for allocating resources to jobs, and scheduling the jobs. The scheduler uses the parameters passed to it from the submission script to create a schedule for the jobs. The schedule should maximise the throughput, while not disad- vantaging any jobs unfairly. The scheduler uses the maximum runtime and resources required to decide when and where a job will run. Smaller jobs may be scheduled sooner as they may be able to run in gaps created when the larger jobs are scheduled.

5.3.1 Sun (Oracle) Grid Engine

The Sun Grid Engine is used extensively on HPC systems. Originally developed by Sun systems, it is now owned by Oracle. An open source version, the Open Grid Scheduler, has been developed since 2001. In December 2010 Oracle officially made the open grid scheduler project in charge of maintaining the open source version.

22 The Grid engine has been proven over the past decade to work on systems of all sizes. The Grid Engine is available on most Linux Distributions.

5.3.2 Torque & Maui

Torque is an open source portable batch system. It can be downloaded and modified according to the licence from adaptivecomputing.com. Torque is widely used and has a strong developer and support community. The Torque PBS can be used in combination with the Moab workload manager. The addition of Moab improves utilisation through better scheduling. The predecessor to Moab was the Maui scheduler. Torque and Moab have been proven to work well on large systems up to to thousands of cores. The communication model used in Torque is continuing to be improved, and can currently handle jobs that span hundreds of processors.

5.4 Compilers

All benchmarks, libraries and programs must be compiled into an executable. There are a considerable number of compilers available. The performance of a compiler will affect the performance of the program being compiled. There are compilers produced by system vendors such as Intel that are optimised for a specific system, however these are propriety software whicvh comes at a cost. The most widely used compilers are the GNU compiler collection including GCC, GFortran and G++.

GNU

The GNU compiler collection is a compiler with frontends used to compile codes in C, Fortran, and Java among others. The GCC compiler was originally created as the compiler for the GNU operating system. Despite the GNU OS not being used widely, the compiler has been adopted as the default compiler for most Linux distributions. The compilers are Open source and 100% free according to the definitions set out by the Free Software Foundation. Parts of the GNU compiler collection has been ported to the majority of architectures including ARM, Intel and Power, although it is not optimised for any architecture. The GNU compiler Collection also has support for MPI through the MPICC, MPIXX and MPIF90 compilers.

23 Intel

Intel have provided a compiler collection for use on Intel based systems such as Eddie, that are optimised for the x86 architecture. The optimised binarys produced should have better performance than a general compiler as they will make use of all the features available.

Portland Group

The Portland Group produces compilers for use accros a large number of systems. The compilers are not optimised for any system, however they can produce performance better or worse than the GNU compiler collection. As part of an earlier project, optimisations were carried out a simplified Molecular Dynamics code. Before any changes were made to the code, it was compiled with both GCC and the Portland Group Compiler (PGCC). The difference in compiler changed the performance of the code by a factor of six. The choice of compiler will significantly effect not only the performance, but the amount of power used.

5.5 Message Passing Interface

The MPI Application Programming Interface is a specification with many different im- plementations. Before the specification was defined, message passing was deined in several different libraries, which competed against each other. Most vendors and academic groups de- veloped their own library. Each of the libraries were optimised for different systems, and used their own communication models. The huge differences between the libraries made application development an extremely hard task, as different libraries often required the communication models to be encap- sulated in modules. In an attempt to standardise the communication model, the MPI standard was created. The standard has been fulfilled in numerous different implementations, the best known being MPICH2 and OpenMPI. The first standard was defined by the MPI forum in 1994, with a subsequent revision released in 1995. The specification was extremely large as it needed to contain support for all of the different models used by the libraries that were used previously. The spec- ification was defined with a lot of scope for implementation, still allowing for optimised versions to be created for specific hardware and systems. The second version of the standard was created in 1997, with a draft of version 3 having just been released. The second standard added more functionality including single sided

24 Figure 5.1: MPI Family Tree (David Henty EPCC) communications. The third standard is likely to include non blocking collectives.

5.5.1 MPICH2

MPICH2 was until recently almost the default MPI library in use. The library fully supported the MPI 2 standard, and was completely rewritten from MPICH 1. The maturity of MPICH has given it good support for generic clusters and can eas- ily be ported to new hardware. The ease of porting has led to many vendor specific implementations being based on MPICH.

5.5.2 OpenMPI

OpenMPI is an open source implementation of the MPI 2 standard. It was a group project that merged four of the first implementations of MPI. The project has now grown and has a large number of partners including ARM and Mellonox. OpenMPI is growing in popularity, and with more partners joining the project it will continue to grow. Each of the partners brings their own expertise and so the level of support and performance continues to increase. Eventually it is likely that the four original MPI implementations will soon be replaced entirely by OpenMPI.

25 Chapter 6

Benchmarks

6.1 HPCC

The High Performance Computing Challenge benchmark suite was created after re- searchers from the DARPA High Performance Computing Systems programme began to reassess the metrics used to define the performance of HPC systems.[18] The HPC Challenge benchmark suite consists of seven different benchmarks each de- signed to test a different part of a system. They were chosen to cover both the spatial and temporal locality space. While the benchmarks provided in HPCC were available before the suite was created, it is the verification and reporting tools that make the re- sults from the suite more consistent and therefore allow better comparisons between systems.

6.1.1 High Performance Linpack

The High Performance Linpack benchmark is the standard benchmark used to measure the performance of a HPC system. It is used by both the Top 500 [1] and the Green 500 [2] to compile their rankings. The HPL benchmark is based on the original Linpack benchmark, measuring perfor- mance based on solving a system of linear equations using LU factorisation.

6.1.2 Bandwidth and Latency

A set of tests measure the bandwidth and latency of the system using different commu- nication patterns. The two patterns are single process pair and parallel all process in a ring.

26 Ping-pong communication is used to test the latency and bandwidth between several different pairs of processes. The results of all of the pairs are compared and the highest latency and lowest bandwidth are reported. It is not possible to run the ping-pong test on every pair of processes so only some of the pairs are tested. The processes are then arranged into a ring, before each process sends and receives messages with both its neighbours. The ordering of the ring is done using two methods; first it places the processes in order of rank in MPI_COMM_WORLD. The second ordering is a randomly chosen order. The test is repeated ten times and the geometric mean returned. The benchmark returns several important results: maximal ping-pong latency, mini- mal ping-pong bandwidth, average latency of randomly ordered rings, bandwidth per process in order ring and the average bandwidth per process or randomly ordered rings.

6.1.3 Other Benchmarks in the HPCC Benchmark Suite

Benchmark Details Stream The stream benchmark measures the sustainable bandwidth of a system along with the computational rate of four oper- ations: copy, scale, add and triadd. Fast Fourier Transform Completes a one dimensional discrete Fast Fourier Trans- form, measuring the floating point execution rate. DGEMM Using matrix matrix multiplication, the floating point rate of execution is measured. Parallel Matrix Transpose The bandwidth achieved when exchanging the large mes- sages needed in transposing a matrix. It is useful for testing the total communication capacity of a system. Random Access Measuring the rate of updates to memory locations picked at random. Table 6.1: Other HPCC Benchmarks

6.2 LMBench

The LMBench benchmark suite was developed by Larry McVoy and Carl Staelin to measure a wide variety of metrics including bandwidth, latency and performance. It was intended to run on as many systems as possible without modifications. Each of the benchmarks were chosen to be part of the suite to “represent an aspect of system performance which has been crucial to an applications performance”. [19] [20]

27 6.3 Coremark

The Coremark [21] benchmark is a benchmark that produces a single number to allow a direct comparison between processor performance. CoreMark was developed by the EEMBC to specifically test the functionality of a processor.

6.4 Validation of Results

The results of all of the benchmarks run must be validated in order to minimise the risk of a result being a random event. It is possible that an event or process may occur during runtime that will effect the results produced by the system. Each benchmark will be run three times, and the results compared. If the results are similar they will be taken to be correct. During runtime different measurements will be taken and logged. It would not be logical to take an average of the results as the same operation could not be applied to all of the measurements taken, instead only one result will be taken.

28 Chapter 7

Project Preparation: Hardware Decisions and Cluster Building Week

7.1 Cluster building Week & Lessons Learned

During project preparation, the innovative learning week cluster building challange took place. Working in a group of four, the challenge was to build a cluster using the proto- type nodes from EDIM1. The cluster was built using cluster management tool Rocks. Rocks installs the sun Grid engine and MPI across the frontend and of all of the compute nodes using PXE booting. Most of the first day was spent installing and re-installing the Rocks Cluster Package. The first install failed to properly initialise the network, and as a result we were un- able to recognise any of the compute nodes. A corrupted disk, and some sloppy hard drive partitioning resulted in the second and third attempts to be somewhat unsuccess- ful. Finally, Rocks was installed and all of the compute nodes were connected and communicating. It was clear that even using Rocks to build the cluster was not simple. These problems changed the workplan of the project to allow more time to build the ARM cluster. The second day focused on fixing an MPI bug which resulted in jobs being run on a sin- gle node. The first attempt involved removing the existing version, and installing an old version of the library using the Redhat Package Manager. This was done on each node using the "rocks run hosts command" command, but resulted in some artefacts from the previous library causing errors compiling. We then tried using the older version of MPI. Instead of installing this library on each compute node, a new rocks distribution was created, which was installed don each of the nodes. This was successful on all but one of the compute nodes. It was quite a hard task to diagnose that the last node had failed to receive the proper update. Once the problem was identified however, the command to install the new distribution was given to that specific node, and at this point everything was functional. The Sun Grid Engine mas modified to include the head node

29 when distributing jobs. This brought the total core count from 16 to 18. Getting the High Performance Linpack benchmark to run did not cause any problems. The results were not very impressive, giving a performance of less than 1 Gflop, when run using the provided HPL.dat file. It was suggested that this was caused because the size of HPL was too small. When Linpack was run on a much bigger problem the results were better than before, giing a result near 3 Gflops. A power meter was attached to the cluster. Since only one was available only the total power could be measured When running Linpack it could be seen that there was a jump in power as it moved from being idle to computation. The Flop/Watt rate was calculated, showing that the system was not efficient and would have not made it into the Green 500. The experience of building a cluster, even when using Rocks, was an interesting one. Many issues came up that were not expected. Many lessons were learned from the expereience and as a consqeuence the workplan for the project was adjusted to allow more time to build the cluster.

7.2 ARM: Board Comparison

There are an extremely high number of ARM based processors available, and for each processor there are many boards available. It was important to compare and contrast as many options as possible to make the best choice of hardware for the project. The most obvious difference between the different boards is the processor, which vary in the architectures and the number of cores. Kritikakos made use of the OpenRD board that contained a single core Marvell 88F6281, based on the ARMv5 architecture. The project will build on Kritikakos’ work by using new multi core ARM chips to compare the power consumption as the number of cores is scaled. In order to make a comparison between the different systems, the memory and network capabilities must be taken into account. In the small selection of boards compared, the network and memory provided varied by a factor of sixteen. In terms of networks there are boards using 10/100 Ethernet and gigabit Ethernet as well as one board that uses USB. The difference in the speed of communication will change the performance of the systems, skewing the results. Finally the boards must be available within the time-scales of the project. There were several boards coming onto the market in the coming months. While some of these boards may have been a better choice they had to be discounted as they would not arrive in time. One of these was the Nvidia Carma board as the release date was unknown. After a lengthy comparison of the ARM boards memory and networks, and taking into account the cost of each board a decision on what board to use for the project was

30 Board Processor Cores Memory Network OpenRd 1 Marvell Sheeva 512Mb DDR2 Gigabit Ethernet Panda 2 1024Mb DDR2 10/100 Ethernet Cotton Candy 4 ARM A9 1024Mb DRAM USB 2.0 Snowball 2 Nova 1024Mb DDR1 10/100 Ethernet Samsung Origen 2 1024Mb DDR3 Raspberry Pi 1 256Mb SDRAM 10/100 Ethernet Swift 2 1024Mb DDR2 Gigabit Ethernet Quadmo MX6 4 4096Mb DDR3 Gigabit Ethernet Quadmo Omap4 4 1024Mb DDR2 10/100 Ethernet Quadmo T30 4 2048Mb DDR3 Gigabit Ethernet Trim Slice 2 Tegra 2 1024Mb DDR2 Gigabit Ethernet Table 7.1: Comparison of ARM Boards reached. Several boards were discounted due to the release date including all three Quadmo Boards. Several boards had similar networks and memories, which only left the price of the board to compare. As a consequence several boards were removed including the Trim Slice. In the end the decision was taken to buy the Cstick Cotton Candy. Despite not having an Ethernet connection, the quad core processor made it the best choice. As the cotton candy was not available at the time, a back up was chosen from the boards that were available at the time, to minimise the risks to the project should they be delayed.

7.2.1 Cstick Cotton Candy

The Cstick Cotton Candy is a computer the size of a memory stick. It contains a quad core ARM chip alongside a GPU. It was designed to be a companion system to other devices such as smart phones, tablets and PCs to access cloud services. The Cotton Candy has several operating systems available including and An- droid. Network connections are through the wireless connection or USB.

31 Figure 7.1: Cotton Candy (www.fxitech.com) 7.3 Changes to the ARM Hardware Available

In the weeks that followed a new board was announced, the Seco Qseven, which was a much better option. The Qseven is identical to the Nividia Carma board which was discounted earlier in the project. The unknown release date of the Qseven significantly increased the risks that the hard- ware would be unavailable. In the weeks that followed the Qsevens were delayed on several occasions. During the time waiting for the Qsevens to arrive a Raspberry Pi became available, and soon after a Pandaboard. Time was spent working on both boards in getting benchmarks to compile and run in readiness for the Qsevens arriving. When it became clear there was a real risk of the Qsevens not arriving, the decision was taken to allocate the ARM hardware that was available between the three projects that were to make use of it. This project was allocated a single Pandaboard. The reasons for this was to allow preparation for the Qsevens. Two more Raspberry Pi boards arrived for use, but all three were allocated else where due to hardware requirements of other projects. As there was a substantial risk of not being able to do any scaling on ARM there was a suggestion of buying more Pandaboard solely to allow scaling on an ARM system. Due to the time constraints of the project combined with the numerous delays in the delivery of ARM hardware, the decision was taken to change the aims of the project to focus entirely on a comparison of the Intel Xeon and Intel Atom processors.

7.3.1 Raspberry Pi

The Raspberry Pi board is a new ARM based board, designed with the intention of being low priced board to help get more school children into programming. The idea was born in the University of Cambridge, after a group of colleagues become concerned about the programming ability of students going to study Computer Science.

32 Figure 7.2: Raspberry Pi Board (Raspberry Pi Foundation)

Figure 7.3: Pandaboard (Pandaboard.org)

The board is the size of a credit card and using the RCA video jack or HDMI port can be plugged into almost any display device making it an ideal platform for young children to start programming. The Raspberry Pi makes use of a Broadcom System on Chip based on the ARM11 classic processor family. While the ARM11 does not have the performance on the newer ARM processors, it is extremely energy efficient. The board also has a GPU, 256MB RAM and is powered by a micro USB.

7.3.2 Pandaboard ES

The Pandaboard ES is a lower power board based on the OMAP 4460 SoC. The OMAP4 System on Chip contains two ARM A9 cores. The ARM A9 is the latest processor to be released, although it will soon be followed by the quad core A15. The board is about 12cm square and has HDMI, Serial and mini USB connections. A 10/100 network connection allows the boards to be connected to a network. Each Pandaboard has 1Gb of DDR2 memory, four times the amount of memory when compared to the Raspberry Pi.

33 Figure 7.4: Seco Qseven (www.secoqseven.com)

7.3.3 Seco Qseven - Quadmo 747-X/T30

The Seco Q7, shown in figure 7.4, is a self contained board. It is named the Qseven as it is 7cm Square. The Quadmo 747-x/T30 version contains the Nvidia Tegra 3 processor, containing four ARM A9 cores. The Tegra 3 is combined with Gigabit Ethernet and 2GB of DDR3 memory, giving it greater capabilities than any of the Cotton Candy, Pandaboard or Raspberry Pi. The Seco Qseven was due to be released in June 2012 alongside the Nvida Carma board. However they both have been delayed on several occasions. As of August 2012 a release date has still not been confirmed.

7.4 Xeon: Edinburgh Compute and Data Facility

The compute component of the Edinburgh Compute and Data Facility (ECDF), also known as Eddie, is a computer system based at the University of Edinburgh. Eddie is used by researchers across the university to run large programs in both serial and parallel.

Figure 7.5: Eddie at ECDF

34 Eddie, shown in Figure 7.5 consists of 286 nodes containing two different types of processors. The first 130 nodes contain two Quad core Intel Westmere processors, while the remaining 156 contain two six core Intel Westmere processors. There are two networks connecting the nodes. Gigabit Ethernet is used as the basic network on both phases of Eddie. Phase two also has an Infiband network used for high bandwidth, low latency communication.

7.5 Atom: Edinburgh Data Intensive Machine 1

The Edinburgh Data Intensive Machine (EDIM1) has been developed by EPCC and the School of Informatics at the University of Edinburgh. EDIM1 consists of one hundred and twenty nodes, each containing a dual core Intel Atom processor. The system is divided into six groups of twenty nodes, each of which has its own power distribution unit (PDU).

7.6 Power Measurement

7.6.1 ARM

A “Watts Up? .net” power meter, shown in 7.7, is to be used to log the power con- sumption of theARM cluster. Using a windows machine the Watts Up USB logger was installed. The software allows the power consumption to be logged and viewed in either table or graph form. The setup is shown in figure 7.6.

Figure 7.6: Power Measurement Setup Up on the ARM cluster

35 Figure 7.7: Watts Up power Meter used on the ARM Cluster

7.6.2 ECDF

Power measurement was set up to poll the power usage of each node at intervals of one minute. The granularity of power measurements caused some issues as all of the benchmarks needed to run for at least several minutes, preferably half an hour, in order for the power measurements to be viable. The power measurements for each of the eight nodes available to the project are written to a file every minute. The setup of the system is shown in figure ?? The needed data is then collected through a number of scripts. These scripts select the required data based on some parameters before manipulating it into the needed format. An Example scrpit is shown in listing 7.1.

#! /bin/bash grep $1 power.2012−07−12 > 40012 grep $1 power.2012−07−13 > 40013 grep $1 power.2012−07−14 > 40014 grep $1 power.2012−07−15 > 40015 grep $1 power.2012−07−16 > 40016 grep $1 power.2012−07−17 > 40017 grep $1 power.2012−07−18 > 40018 awk ’{print $3}’ < 40012 > 40012p awk ’{print $3}’ < 40013 > 40013p awk ’{print $3}’ < 40014 > 40014p awk ’{print $3}’ < 40015 > 40015p awk ’{print $3}’ < 40016 > 40016p awk ’{print $3}’ < 40017 > 40017p awk ’{print $3}’ < 40018 > 40018p paste timesnew 40012p 40013p 40014p 40015p 40016p 40017p 40018p > $1week Listing 7.1: Sample Script for Processing Power Measurements

36 Figure 7.8: Power Measurement Setup on Eddie

2012−07−12−12:38 eddie465 220 2012−07−12−12:38 eddie466 170 2012−07−12−12:38 eddie467 170 2012−07−12−12:38 eddie468 210 2012−07−12−12:38 eddie400 230 2012−07−12−12:38 eddie404 230 2012−07−12−12:38 eddie402 220 2012−07−12−12:38 eddie403 220 2012−07−12−12:39 eddie465 220 2012−07−12−12:39 eddie466 170 2012−07−12−12:39 eddie467 170 2012−07−12−12:39 eddie468 210 2012−07−12−12:39 eddie400 260 2012−07−12−12:39 eddie404 250 2012−07−12−12:39 eddie402 220 2012−07−12−12:39 eddie403 250 Listing 7.2: Sample Power Output File from Eddie

Using bash scripts along with the time that a job started and finished the power con- sumption of a job can be found. As each nodes power consumption are reported sep- arately the total power consumption of a job is calculated as the sum of the figures for all nodes used.

7.6.3 Edim1

The project has been allocated twenty nodes on the Edim1 machine. The machine is set up with six groups of twenty nodes, each group supplied by a single PDU. The power supplied by the PDU is logged using an internet based logger. The logger polls the level of current supplied. Using the current and the supply voltage the power

37 supplied can be calculated. In order to verify that the calculation was accurate, the result was verified using a single node from Edim1. Using the Watt’s Up power meter the power of a single node was found to be around 60 watts. The supply voltage was also found to be 237 volts. The logger showed the current to be 4.7 amps, while seventeen of the twenty nodes were operational. This showed a total power of 1114 watts, giving a power per node of around 65 watts. The error can be taken into account to give a more accurate reading. The output of the internet based logging system is shown in figure 7.9 alongside a diagram showning the setup in figure 7.10

Figure 7.9: Power Logging on Edim1

38 Figure 7.10: Power Measurement Setup on Edim1

39 Chapter 8

Hardware Setup, Benchmark Compilation and Cluster Implementation

8.1 Edinburgh Compute and Data Facility

Each node of Eddie uses two Intel Westmere Xeon Processors, based on the X86 ar- chitecture. This architecture is extremely well supported by compilers, libraries and applications. ECDF has a large number of modules and libraries pre installed ready for use by users of the system. Each of the modules must be loaded at the start of a session. Compilation of the Stream, LMbench and Coremark benchmarks did not encounter any problems. Using the makefiles provided, the benchmarks could easily be compiled and run on the frontend nodes. However, compilation of the HPCC and HPL benchmarks was not as simple. Both HPCC and HPL required changes to the makefile to set the paths to the libraries used. The major libraries used are the MPI and linear algebra libraries. Locating the libraries was not an easy task, as numerous versions of each library are installed across the machine. The MPI library was located without any major difficulty, but the CBLAS library was not found. In order to rectify the problem the CBLAS library was installed in the home directory and the path modified to reflect the location. A section of the makefile is shown in the appendix. In order to take power measurements the benchmarks needed to be run on the back end nodes of the system. Each job had to be submitted through the Grid Engine using a job script. The project had access to two queues set up through SGE, one used the Gigabit Ethernet network while the other used the Infiniband network. The queues each had access to four nodes, giving access to a total of forty eight Intel Xeon cores.

40 In addition to the queues the project was given a priority to allow jobs to run with guranteed priority.

8.1.1 Problems Encountered with ECDF

There were several problems encountered in using the ECDF system during the project. The biggest problem was getting jobs to run on the system. There are a large number of users of the system, many of whom run large and long jobs. As the project did not have exclusive access to the eight nodes assigned to it, many jobs needed to wait until resources were available. The jobs could wait from a few hours to a few days, limiting the ability to run large numbers of benchmarks. During the project it was discovered that the waiting times of many jobs were longer than should be expected. The group that runs the ECDF system, discovered after that there was a problem with the Grid engine. It was not allocating resources correctly. The issue caused many jobs to wait for excessive amounts of time. The more resources that a job needed the longer they appeared to wait, hampering the projects ability to run benchmarks at higher numbers of cores. During the project two of the eight nodes went down for several days. The loss of two nodes had a massive effect on the project as one node was lost on each of the two queues, removing all ability to run benchmarks on more than 36 cores. In the last few weeks of the project, many of the issues in running jobs on Eddie were made significantly better. Jobs on both the queues began to run in much shorter periods of time, however this was not guaranteed. The time required to run and verify a set of scaling benchmarks was still several days. Due to the short amount of time left in the project, only a limited the amount of investigations into annomalous results could be carried out.

8.2 Edinburgh Data Intensive Machine 1

The project was given exclusive access to twenty dual core Atom nodes on the EDIM1 machine, and also had access to a frontend node. The machine was built with data in- tensive computing in mind and so had no tools such as a batch system or MPI installed. However the system did have ssh keys setup to allow secure shell logins with out the need to use passwords. The machine used Rocks Linux, which was used in the cluster building challenge week. The experience of using rocks allowed efficient installation of the libraries across all of the nodes. The Rocks run host and Iterate commands allowed me to run the same command across all of the compute nodes without the need to leave the frontend.

41 8.2.1 Problems Encountered with EDIM1

The issues encountered with EDIM1 were centred around the power measurement used on the system. The system logs the electrical current used by the entire group of twenty nodes each minute. The power is then calculated by finding the product of the current and the voltage. As the voltage was not logged it was assumed to be constant. To verify that the calculation gave an accurate result, the power consumption of a single node was measured directly using the power meter that was to be used in for the ARM cluster. The results showed that the idle power calculation was a good approximation to the real power of a single node. However, when running benchmarks it became clear that there was a problem. The power consumption did not change when running any of the benchmarks. The other PDUs on the system were then checked to ensure that the correct PDU was being used, which was found to be the case. The benchmarks were run on the single node to get more accurate readings for the power measurement. The problem was soon revealed to be caused by the accuracy of the logs. The logging system measured to current to an accuracy of a tenth of an amp. When running the CoreMark benchmark it was found that the current changed by only milliamps, and that the voltage changed by several volts during runtime. In order to get more accurate power measurements to allow scaling up to higher number of cores it was decided to use an identical watt’s up power meter to the one used on the ARM cluster. The power meter was placed between four of the nodes on EDIM1 and the PDU that supplies all of the nodes with the power. There was a limit of four nodes as this was the biggest multi-way adapter that could be found at short notice. The new setup is shown in figure 8.1. The meter was attached to node compute-0-0 through the USB interface to allow log- ging to take place. The Linux watt’s up utilities were used to log the power from the command line. Compute-0-2 went down due to technical issues so it was not attached to the power meter. The node was one of five to fail on EDIM1 during the project reducing the number of cores that could be used for scaling by ten.

42 Figure 8.1: Replacement Power Measurement Setup on Edim1

43 8.3 Raspberry Pi

A single Raspberry Pi board was installed with the provided Debian Linux Distribution. The distribution only had the default libraries and compilers available. In order to run all of the benchmarks several libraries and compilers needed to be located and installed for the ARM based computer. One of the major problems Kritikakos faced was the lack of support in terms of libraries and compilers for ARM based systems. By far the biggest issue was the lack of a Fortran compiler. Since Kritikakos finished his project there has been growing support for ARM based systems as they are seen as a possible technology in the development of exascale systems. In the past few months GNU have released the Gfortran compiler for ARM based systems. The experience from compiling HPL on Eddie was of great benefit. Several errors that had arisen on Eddie were quickly solved. The only major problem that arose was linking to the MPI libraries. On Eddie the dynamic library was used, however only the static libraries were available on the Raspberry Pi. Linking only to the mpich2.a file did not allow the benchmark to compile. After researching the problem it was found that the MPL.a library also needed to be included. Unfortunately, no benchmarks were ran on the Raspberry Pi before it was allocated to another project.

8.4 Pandaboard ES

The project planned to use the Pandaboard to allow benchmarking of the system at least in serial. It was believed that all of the necessary cables needed to start the Pandaboard had been collected. The Linux image was burned onto the memory card, using the commands shown in the tutorial on the Pandaboard website. Having connected the Pandaboard up it was switched on for the first time. The status lights flashed to show that it was starting up but no display was shown. The two students involved in the projects that were making use of pandaboards both attempted to get the Pandaboard to startup. Both students agreed that there was a need to a serial cable in order see the display. Having gained access to a serial to USB converter, the display could now be seen using the minicom terminal. The setup for Ubuntu then asked for input to select a language. Both students attempted to try various different solutions to the problem including dif- ferent terminals, writing directly to the serial port and using a windows machine but were unable to send input to the board. Eventually through some luck and directly changing the root password in the files access was gained to the board by skipping the setup.

44 The consequences of skipping the setup required several settings including the network configuration, installing of included packages and changing the host name needed to be done manually. The first thing was to install the compilers including gcc, g++ and gfortran. These compilers were all available as a debian package, allowing them to be installed using the apt-get command. In order to control which libraries were installed on the Pandaboard the same tar files that were used on the ATOM node, were used on both Eddie and Edim1. Having in- stalled these libraries several times already, many of the issues could be avoided.

8.5 Qseven / Pandaboard / Atom Cluster

A small cluster was built consisting of an Intel Core 2 frontend, an Intel Atom Node from EDIM1 and a Pandaboard. The first step was to set up the Frontend. It was decided to use Ubuntu Server 12.04. Ubuntu server is command line based and so there were no graphics that would not effect the performance of the cluster. Installing Ubuntu was straight forward other than some issues with network configuration. It turned out that the wrong Ethernet port was being used to connect to the EPCC network. The next step was to install the compilers for C , Fortran and C++. The GNU C com- piler GCC was already installed on Ubuntu Server. The experience gained using the Raspberry Pi allowed installation of the Fortran and C++ compilers using the debian package manager through the apt-get command. The next step was to install the libraries needed by the benchmarks. Again having installed and used these libraries on Eddie and the Raspberry Pi, the experience made installation simple.

Figure 8.2: ARM Cluster

45 In order to submit jobs to the compute nodes a batch system was needed. There are sev- eral different batch systems that could be used. The system used in the cluster building week and on Eddie is the Grid Engine made by Oracle, sometimes better known as the Sun Grid Engine (SGE). In the project completed by Kritikakos last year, he made use of the Torque Batch System. As Torque is open source and had be been proven to work with ARM hardware during Kritikakos’ project, the decision was taken to use the Torque system. The Torque batch system has an inbuilt scheduler however it is extremely basic, so is decided to install Maui a scheduler known to work well with Torque. The installation of the latest version of Torque was not a simple task. Almost all of the documentation available online was based on an older version of Torque. As such the tutorials to install Torque needed to be adapted to the latest version where several changes had been made. Despite a few minor mistakes, eventually the Torque sever was up and running. When looking into installing Maui, it was found that it had been replaced by Moab, which was not available to download for free. The last version of Maui was available from the clusterresources.com website. Using the available tutorials attempts were made to install the Maui scheduler. Unfortu- nately during compilation errors relating to variables being declared in two places with two different data types appeared. The errors were caused as the variables were de- clared in both Maui and Torque as both Int and Unsigned data types. After researching the problem, several other developers had also found this issue. The simplest suggestion was to change the data types in Maui from Ints to Unsigned. After making the change in the code Maui compiled with no other errors. Unfortunately despite all of the the processes appearing to be running correctly, the Maui scheduler did not allow jobs to run. Having done more research the decision was taken to remove both Torque and Maui, and reinstalling an older version of Torque. Maui was not reinstalled as the project could make use of the more basic scheduler provided through Torque. Installation of the older version of Torque appeared to be much simpler and quicker than the new version. In order to allow the compute nodes to connect to the internet through the front end the iptables were changed. The front end was set up to allow packets to be forwarded to the other nodes.

8.5.1 Compute Nodes

The Atom node is a prototype node from the EDIM1 system. It contains a dual core Intel Atom. As the node came from EDIM1, the node started to attempt to PXE boot Linux from the master node. As such the project started by installing Ubuntu server from scratch.

46 Figure 8.3: Atom Node on ARM Cluster

During setup the node was connected to an eight port gigabit Ethernet switch. The IP address could be set up manually as all of the nodes to be connected to the cluster were known. The gateway was set to be the frontend, before using the ping command to test that the node could connect to the internet. To test the network was working correctly the master node was used to ping the IP address of the Atom node, before pinging the master node in the opposite direction. As both pings were successful the host list was updated on both nodes, before setting up ssh connections with RSA keys. Finally the Torque system needed to be set up. Originally the newest version was in- stalled reflecting the version installed on the master node. The atom node could be seen on the master node Torque server. When the decision was taken to change the version of Torque, it was also changed on the node. Once the Pandaboard was up an running the network configuration, Torque and libraries were installed with out a problem. Had more nodes been available the set up would have followed a similar set of steps to the previous two nodes. Once one node is set up, it would be possible to clone the system for use in other identical nodes, leaving only minor changes to be made.

47 Chapter 9

Results

9.1 Idle Power

In order to take account of the hardware differences in the nodes of Eddie and Edim1 the Idle power must be measured. The idle power is the power consumption when the node is not in use. The results are shown in figure 9.1. The Xeon node in use on Eddie uses almost twice the power consumption of the Atom node. The power was measured when no processing was taking place other than that required by their respective operating systems. In order to take a reading from the Xeon node, originally a sleep job was submitted. While the sleep function uses minimal power it did not give an accurate measure of the idle power. During the project it became apparent using the qhost that the load on node466 was at zero for a considerable amount of time. The zero load showed that no processing was taking place. The power measurement was continuous for the length of the project, and continued to take power measurements of the node despite no jobs running. When the power logs were checked the results showed that the idle power was 100 watts. The same measurements were taken using the single atom node that was built into the cluster built for the project. Using the Watts Up .net power meter the idle power was measured for an hour, giving a result of around 50 watts. The results are shown in 9.1. The results are compared to the results from Kritikakos’ project from 2011. The power consumption of the Atom nodes has increased between the two projects. The nodes used in this project were part of the Edim1 system. Each contains several hard drives used for data intensive research, which may account for the increase in power of the Atom. While as a whole the Xeon nodes use far more power than the the Atom nodes, it is important to notice that the difference in power consumption does not reflect the number of cores each contains.

48 140

120

100

80

Power (Watts) 60

40

20

0 00:00 00:05 00:10 00:15 00:20 00:25 00:30 00:35 00:40 00:45 00:50 00:55 01:00

Time

Figure 9.1: Comparison of Idle Power on Xeon and Atom Nodes

Year 2011 2012 2011 2012 Processor E5507 E5645 Atom Atom Cores 4 6 2 2 Sockets 2 2 1 1 Speed 2.26 GHz 2.4 GHz 1.80 1.80 Memory 16GB DDR3 24GB DDR3 4GB DDR2 4GB DDR2 Network Gigabit Ethernet Gigabit Ethernet Gigabit Ethernet Gigabit Ethernet Idle Power 118 Watts 100 Watts 44 Watts 53 Watts Idle Power 14.75 8.3 22 26.5 per Core Table 9.1: Idle Power Comparison

49 The expected results would be that the Xeon node containing six times more cores than the atom should be using six times the power. However, the Xeon node uses less than twice the power, suggesting that the Xeon is a much more efficient processor. These findings were also shown in the project completed by Kritikakos in 2011, where the Xeon only used just over twice the power of the Atom board. Another notable result is that despite the speed and number of cores increasing the power consumption of the Xeon node has decreased by 18 watts. When the power per core is compared it was seen to be half the result of 2011.

9.1.1 How Idol Power changes with errors

During the project there were issues in taking power measurements on EDIM1. While the Watt Up meter was attached to the system one of the nodes developed an error. The error was noticed becuse the idol power had risen from around 150 watts to 180. At it’s highest the apprent idol power reach 240 watts. A 10 minute sample was taken and plotted as shown in 9.2, to allow a comparsion to be made another line has been plotted at 150 watts. It is believed that the error was caused by MPI processes failling to be cleaned up correctly. it was not possible to confirm this as the node refused ssh connections. The single node used almost twice the idol power compared to normal operation. As systems are getting bigger they are containing more nodes and such failiures could increase power usage by HPC systems a tenth, should they fail at the same rate of one in twenty.

50 200 error idol power idol power

190

180

170

160

150

140 0 100 200 300 400 500 600

Figure 9.2: Idol Power while node had error

9.2 CoreMark

The Coremark was run on both Eddie and Edim1. The benchmark allows the number of iterations and compiler flags to be set by the user. To get a full comparison the benchmark was compiled using the o1, o2 and o3 optimisation flags. There are two more levels of optimisations, however they were not used. The number of iterations was increased so that the runtime of Coremark lasted about an hour. The runtime needed to be increased in order to get viable power measuerments. In order to achieve the required run time the number of iterations was increased to 30000000. A second run of half that size was run to reduce the running time when run on Edim1. In the project that was completed last year, three different sizes of Coremark were used. To allow for a comaprison to be made between the results of the two projects, these tests were repeated. Last year a parallel version was also run. It was decided not to run a parallel version, to allow the limited time to be focused on other benchmarks. The results for both the Xeon and Atom are shown in figures 9.3 and 9.4 for each of the optimisation levels. The results for all of the sizes run were extremly similar. The effects of the optimisation levels show an increase in performance as would be expectected for each optimisation level. The optimisation appears to increased perfor- mance more on the Xeon than the atom.

51 9000 01 02 03 8000

7000

6000

5000 Iterations Per Second

4000

3000

2000 Xeon Atom Processor

Figure 9.3: Iterations Per Second for three optimisation levels

68 01 02 66 03

64

62

60

58

56

Iterations Per Watt 54

52

50

48

46 Xeon Atom Processor

Figure 9.4: Iterations Per Watt for three optimisation levels

52 8000

7000

6000

5000

4000 Iterations Per Second

3000

2000

1000 Xeon 2012 Xeon 2011 Atom 2012 Atom 2011 ARM 2011 Processor

Figure 9.5: Iterations Per Second for Different Processors in 2011 and 2012

Figures 9.5 and 9.6 show a comparsion of the results of the Coremark benchmark as run in 2011 and 2012. The Xeon shows improvement from last year, which is expected as it is a newer version. The Atom however shows a decrease in performance, despite being identical. There are many possible reasons that could explain the drop in performance, but the most likley is that while the processors were identical the OS was not. There is no way to tell what difference the OS may have made to the results. When looking at the performance per watt the resultsfor teh Xeon and Atom reflect the trends shown in 9.5. it is clear that despite the improvement made by the Xeon, it is still only a fraction of the performance per watt shown by the single core ARM processor in 2011.

53 220

200

180

160

140

120 Iterations Per Watt 100

80

60

40 Xeon 2012 Xeon 2011 Atom 2012 Atom 2011 ARM 2011 Processor

Figure 9.6: Iterations Per watt for Different Processors in 2011 and 2012

9.3 HPL

The HPL benchmark is used to compare the power usage and performance as the num- ber of cores is increased. The runtime, performance, total power, average power and flop per watt were logged for analysis. It was not possible to complete full scaling runs of the benchmark, instead only a subsection were completed to show the general trend. Due to issue with the power logging on the EDIM1 system, only the time and perfor- mance could be logged. The number of cores used with the Intel compiler and Ethernet network only reached 12 cores. The limited scaling was due to time constraints, the larger jobs required the use of several nodes which were not allocated in time for anal- ysis.

9.3.1 Intel Xeon vs Intel Atom

HPL was run on both the Eddie and EDIM1 systems. The network, compilers and grids were identical between the systems. The comparison of the runtime is shown in figure 9.7, the performance comaprison is shown in figure 9.8. The runtime for the Atom is much longer than that of the Xeon. The longer runtime is expected as the clock speed of the Atom is lower than the Xeon. The lower performance of the Atom is in part due to the lower clock speed, bu there is another factor. Eddie makes use of 4 nodes each containing 12 cores, while EDIM1

54 uses 20 nodes each with only 2 cores. The physical layout of cores on EDIM1 will require more internode communication than on Eddie. Internode commuincation takes longer than intranode communication, so the more in- ternode communication that takes places the more the perofrmance will be reduced. The performance of the Xeon shows continued improvement as the number of cores is scaled from 1 to 48. The Atom performance is shown up to 24 cores, but does not show the the same rate of improvement. In order to make a more complete comparision of the Atom and Xeon as the number of cores is scaled the systems used must be have the same physical layout. The differences not only in layout but the Operating system and libaries must be taken into account. The project made use of two system run by another group within the Univeristy of Edinburgh. Both systems were already running and in use before the start of the project, so there was no possibility to change the hardware layout or software of these systems.

25000 Atom Xeon

20000

15000

Time (Seconds) 10000

5000

0 0 5 10 15 20 25 30 35 40 45 50 Number of Cores

Figure 9.7: Comparsion of the runtime as the number of cores is scaled using different processors

9.3.2 GCC vs Intel

The Benchmark was run using both the Intel and GCC compilers. The comparison was made on both the Infiniband and Ethernet networks. The comparison of runtime for the Ethernet Network is shown in figure 9.17, the per- formance in figure 9.18, the average and total power in figures 9.15 and 9.16 and the

55 40 Atom Xeon 35

30

25

20

15 Performance (Gflops)

10

5

0 0 5 10 15 20 25 30 35 40 45 50 Number of Cores

Figure 9.8: Comparsion of the performance as the number of cores is scaled using different processors

flop per watt rate in 9.14. The equivalent comparsions for the Infiniband are shown in 9.12, 9.13, 9.10, 9.11 and 9.9. All of the plots show the same behavior when a single node is used, suggesting that the compiler used on the MPI library has no effect when only intranode communication takes place. The a notable differences when multiple nodes are used. The Intel compiler produces better performance, due to the lower runtimes. It also follows that the greater average power use is higher than that of the GCC runs. The lower runtimes result in the total power when using the Intel compiler The performance of the Intel and GCC compilers is shown in figure 9.18. The perfor- mance is identical when a single node is used as there is no internode communication. All communication is within the node, so the difference in the MPI library has no affect. As the number of cores in creased to make use of multiple nodes the MPI library is used for communication, so the performance is effected.

9.3.3 Gigabit Ethernet vs Infiniband

The average power used per minute for each run is shown in figure 9.19 for the GCC compiler and figure 9.27 shows the average power for the Infiniband network. When

56 1000 Intel GCC 900

800

700

600

500 Power (Watts)

400

300

200

100 0 5 10 15 20 25 30 35 40 45 50

Figure 9.9: Comparsion of the Flop/Watt for different compilers using Infiniband

1000 Intel GCC 900

800

700

600

500 Power (Watts)

400

300

200

100 0 5 10 15 20 25 30 35 40 45 50

Figure 9.10: Comparsion of the average power for different compilers using Infiniband

57 22000 Intel GCC 20000

18000

16000

14000

12000

Power (Watts) 10000

8000

6000

4000

2000 0 5 10 15 20 25 30 35 40 45 50

Figure 9.11: Comparsion of the Total Power for different compilers using Infiniband

9000 Intel GCC 8000

7000

6000

5000

Seconds 4000

3000

2000

1000

0 0 5 10 15 20 25 30 35 40 45 50

Figure 9.12: Comparsion of the Runtime for different compilers using Infiniband

58 40 Intel GCC

35

30

25

20

15 Performance (Glops)

10

5

0 0 5 10 15 20 25 30 35 40 45 50

Figure 9.13: Comparsion of the Performance for different compilers using Infiniband

5e+07 Intel GCC 4.5e+07

4e+07

3.5e+07

3e+07

Flop/Watt 2.5e+07

2e+07

1.5e+07

1e+07

5e+06 0 5 10 15 20 25 30 35 40 45 50

Figure 9.14: Comparsion of the Flop/Watt for different compilers using Ethernet

59 1000 Intel GCC 900

800

700

600

500 Power (Watts)

400

300

200

100 0 5 10 15 20 25 30 35 40 45 50

Figure 9.15: Comparsion of the average power for different compilers using Ethernet

20000 Intel GCC 18000

16000

14000

12000

10000 Power (Watts)

8000

6000

4000

2000 0 5 10 15 20 25 30 35 40 45 50

Figure 9.16: Comparsion of the Total Power for different compilers using Ethernet

60 9000 Intel GCC 8000

7000

6000

5000

4000 Time (Seconds)

3000

2000

1000

0 0 5 10 15 20 25 30 35 40 45 50

Figure 9.17: Comparsion of the Runtime for different compilers using Ethernet

40 Intel GCC

35

30

25

20

15 Performance (Gflops)

10

5

0 0 5 10 15 20 25 30 35 40 45 50

Figure 9.18: Comparsion of the Performance for different compilers using Ethernet

61 1000 Infiniband Ethernet 900

800

700

600

500 Power (watts)

400

300

200

100 0 5 10 15 20 25 30 35 40 45 50 Number of Cores

Figure 9.19: Comparsion of the Average Power for different networks only a single node is used the Ethernet run used more power than the Infiniband run. The greater power correlates with the performance graphs shown in figures 9.22 and 9.25. As the number of cores is increased beyond a single node the average power per minute increases, with the ethernet network using more power than the Infiniband. The total power is calcuated as the sum of the power measurement every minute. The Infiniband uses more power in a single node as the benchmark took a longer time to run. The runtimes are shown in figure 9.21. Infiniband uses more power in total when multiple nodes were in use, despite using less power each minute. The performance of Infiniband and Ethernet networks is shown in figure 9.22. When using a single node the Ethernet produces better performance than the Infiniband. When internode communication is involved the Infiniband overtakes the Ethernet at every point except when 48 cores are used. The performance gains for both networks shows clear steps for each node. The Infini- band performance improves dramatically for whole nodes, but shows smaller improve- ments when part nodes are used. The Ethernet performance has the opposite pattern. The step pattern is due to the amounts of internode and intranode communication. As the number of cores approches a whole number of nodes, inter node communication becomes more dominant than intranode communication.

62 22000 Infiniband Ethernet 20000

18000

16000

14000

12000

Power (watts) 10000

8000

6000

4000

2000 0 5 10 15 20 25 30 35 40 45 50 Number of Cores

Figure 9.20: Comparsion of the Total Power for different networks

9000 Infiniband Ethernet 8000

7000

6000

5000

4000 Time (Seconds)

3000

2000

1000

0 0 5 10 15 20 25 30 35 40 45 50 Number of Cores

Figure 9.21: Comparsion of the Runtime for different networks

63 40 Infiniband Ethernet 35

30

25

20 Gflops

15

10

5

0 0 5 10 15 20 25 30 35 40 45 50 Number of Cores

Figure 9.22: Comparsion of the Performance for different networks

The ethernet MPI produces better performance when only one node is in use. The better performance suggests that it is better at intranode communications than Infiniband. Using whole nodes produces communication pattens dominated by internode com- munincation. When a partial node is used the domiance is reduced to it’s smallest when a single core is used on the node. The change in the communication pattern causes the performance to change differently for each of the networks. The results using 48 cores appears to be annomalous. The performance is reveresed between the networks, but there appears to be no obvious reason. Both tests used the same parameters, so the differences are caused by the network or hardware. Had more time been available further investigations would have been carried out to determine the cause. The flop per watt rate was calcuated using the average power and performance. As reflected in the other plots, the Infiniband had a lower rate than the Ethernet when only a single node was is use. As more nodes were used the Infiniband, which has better internode communication performed considerably better than the Ethernet network. It was again noticable that the results for 48 cores appears to be annomalous.

9.3.4 Scaling

The project has compared the performance of different networks, processors and com- pilers to large numbers of cores. Running the HPL benchmark on larger numbers of

64 6e+07 Infiniband Ethernet

5e+07

4e+07

3e+07 Flops/Watt

2e+07

1e+07

0 0 5 10 15 20 25 30 35 40 45 50 Number of Cores

Figure 9.23: Comparsion of the Flop/Watt for different networks using the GCC Com- piler

9000 Ethernet Infiniband 8000

7000

6000

5000

4000 Time (Seconds)

3000

2000

1000

0 0 5 10 15 20 25 30 35 40 45 50

Figure 9.24: Comparsion of the Runtime for different networks using the Intel Compiler

65 40 Ethernet Infiniband

35

30

25

20

15 Performance (Gflops)

10

5

0 0 5 10 15 20 25 30 35 40 45 50

Figure 9.25: Comparsion of the Performance for different networks using the Intel Compiler

22000 Ethernet Infiniband 20000

18000

16000

14000

12000

Power (Watts) 10000

8000

6000

4000

2000 0 5 10 15 20 25 30 35 40 45 50

Figure 9.26: Comparsion of the Total Power for different networks using the Intel Com- piler

66 1000 Ethernet Infiniband 900

800

700

600

500 Power (Watts)

400

300

200

100 0 5 10 15 20 25 30 35 40 45 50

Figure 9.27: Comparsion of the Average Power for different networks using the Intel Compiler

5e+07 Ethernet Infiniband 4.5e+07

4e+07

3.5e+07

3e+07

Flop/Watt 2.5e+07

2e+07

1.5e+07

1e+07

5e+06 0 5 10 15 20 25 30 35 40 45 50

Figure 9.28: Comparsion of the Flop/Watt for different networks using the Intel Com- piler

67 cores allows another comparison to be made, how well they scale in terms of perfo- mance and power. The Xeon processor performance scales better than the Atom, the Intel compiler better than the GCC compiler and the Infiniband better than Ethernet. These results are to be expected, although it is clear that the result using 48 cores is annomalous in all comparisons. When looking into the power conumption it can be seen that as the number of cores increase the amount of power used per second increases, but die to the smaller runtimes the total power decreases. It is better to run a code on a larger number of cores than a smaller one as it will use less power in total. But it is also clear that while choosing to use the best networks and compilers will decrease run time, they will increase the toal power used.

9.4 Stream

Processor Size Function Rate (MB/s) Avg. time copy 7818.2200 0.0413 scale 7515.7899 0.0429 eddie403 500 MB add 8239.3724 0.0587 triad 7605.2656 0.0635 copy 2085.1105 0.1531 scale 1916.6238 0.1679 Atom1 500 MB add 2483.9630 0.2021 triad 1892.6728 0.2554 copy 7818.2200 0.0413 scale 7515.7899 0.0429 eddie466 500 MB add 8239.3724 0.0587 triad 7605.2656 0.0635 Table 9.2: Stream Results

The STREAM benchmark measures the memory performance of the system. The Stream Benchmark was run on both Eddie and Edim1. It was not possible to take power measuerments for the STREAM benchmark as the resolution of the power log- gin on both systems were too coarse. It can be seen, from the results in table 9.2, that the Xeon outperforms the ATOM on all four operations, by approximately a factor of four. The performance of the memory will have a major impact on the overall performance of the system in both runtime and power consumption.

68 Basic system parameters Host OS Description Mhz tlb cache mem scal pages line par load bytes eddie403 Linux 2.6.18- eddie403 2397 1 compute-0 Linux 2.6.37. x86_64- 1595 64 1.0000 1 Linux-gnu Table 9.3: LMBench: Basic system parameters

Processor, Processes - times in microseconds - smaller is better MHz Null Null Stat Open Slct Call I/O Clos TCP eddie403 Linux 2.6.18- 2397 0.27 0.31 1.92 4.59 3.28 Sig Sig Fork Exec Sh Inst Hndl Proc Proc Proc 0.34 1.29 161. 485. 1686 MHz Null Null Stat Open Slct Call I/O Clos TCP compute-0 Linux 2.6.37. 1595 0.38 0.75 5.92 11.8 7.41 Sig Sig Fork Exec Sh Inst Hndl Proc Proc Proc 0.74 3.59 565. 1642 4614

Table 9.4: LMBench: Processor, Processes

9.5 LMBench

The basic integer operations, shown in table 9.5, shows that the Atom has worse per- formance for all operations. All of the operations except bit are a factor of four worse in runtime. When using 64 bit integers the Atom also performs poorly, as seen in table 9.6. The divison and mod operations have particularly long running times. Like the integer operations, the float and double operations, shown in tables 9.8 and 9.9, perform badly on the Atom in comparsion to the Xeon. The Atoms performance for integer, float and double operations explain why the Atom has poor performance in

Basic integer operations - times in nanoseconds - smaller is better Host OS Description intgr intgr intgr intgr intgr bit add mul div mod eddie403 Linux 2.6.18- eddie403 0.6300 0.2100 0.1200 10.0 9.3900 compute-0 Linux 2.6.37. x86_64- 0.6400 0.4100 0.4100 39.6 39.0 linux-gnu Table 9.5: LMBench: Basic integer operations

69 Basic uint64 operations - times in nanoseconds - smaller is better Host OS Description int64 int64 int64 int64 int64 bit add mul div mod eddie403 Linux 2.6.18- eddie403 0.420 0.1500 18.1 19. compute-0 Linux 2.6.37. x86_64- 0.640 0.9100 95.6 95.6 linux-gnu Table 9.6: LMBench: Basic uint64 operations

Basic float operations - times in nanoseconds - smaller is better Host OS Description float float float float add mul div bogo eddie403 Linux 2.6.18- eddie403 1.2500 2.5200 6.2000 7.7800 compute-0 Linux 2.6.37. x86_64- 3.1400 2.5300 20.7 27. linux-gnu Table 9.7: LMBench: Basic float operations the HPL benchmark.

Basic double operations - times in nanoseconds - smaller is better Host OS Description double double double double add mul div bogo eddie403 Linux 2.6.18- eddie403 1.2500 2.1000 9.5400 9.1800 compute-0 Linux 2.6.37. x86_64- 3.1400 3.1600 38.9 46.7 linux-gnu Table 9.8: LMBench: Basic double operations

70 Context switching - times in microseconds - smaller is better 2p/0K 2p/16K 2p/64K 8p/16K ctxsw ctxsw ctxsw ctxsw eddie403 Linux 2.6.18- Description 2.0900 4.8000 1.8000 3.9200 8p/64K 16p/16K 16p/64K ctxsw ctxsw ctxsw eddie403 6.9600 4.56000 8.92000 2p/0K 2p/16K 2p/64K 8p/16K ctxsw ctxsw ctxsw ctxsw compute-0 Linux 2.6.37. Description 6.5400 10.3 7.5500 11.4 8p/64K 16p/16K 16p/64K ctxsw ctxsw ctxsw x86_64 16.3 15.0 18.6 Table 9.9: LMBench: Context switching

*Local* Communication latencies in microseconds - smaller is better 2p/0K Pipe AF UDP ctxsw Unix eddie403 Linux 2.6.18- Description 2.090 9.220 13.9 13.32 RPC / TCP RCP / TCP UDP TCP Conn eddie403 25.9 23.8 30.1 30. 2p/0K Pipe AF UDP ctxsw Unix compute-0 Linux 2.6.37. Description 6.540 19.5 29.2 51.8 RPC / TCP RCP / TCP UDP TCP Conn x86_64 71.4 66.0 86.2 146. Table 9.10: LMBench: Local Communication latencies

*Remote* Communication latencies in microseconds - smaller is better Host OS Description UDP RPC TCP RCP TCP / / Conn UDP TCP eddie403 Linux 2.6.18- eddie403 1 ATOM1 Linux 3.2.0-2 i686-pc- 1596 16 128 1.0000 1 linux-gnu Table 9.11: LMBench: Remote Communication latencies

71 File & VM system latencies in microseconds - smaller is better 0k File 10k File Create Delete Create Delete eddie403 Linux 2.6.18- Description 9.5138 7.5240 22.2 15.3 Mmap Prot Page 100fd La- Fault Fault selct eddie403 tency 2997.0 0.384 0.98880 1.810 0k File 10k File Create Delete Create Delete compute-0 Linux 2.6.37. Description 38.3 30.7 204.9 54.6 Mmap Prot Page 100fd La- Fault Fault selct Edim1 tency 36.1K 1.412 4.140

Table 9.12: LMBench: File & VM system latencies

*Local* Communication bandwidths in MB/s - bigger is bette AF TCP File Mmap Unix ReRead Reread eddie403 Linux 2.6.18- Pipe 3455 1817 6159.2 6396.8 Bcopy Bcopy Mem Mem (libc) (hand) Read Write 1412 3937.9 3705.2 5186 6586 AF TCP File Mmap Unix ReRead Reread compute-0 Linux 2.6.37. Pipe 1029 457. 1188.6 3025.4 Bcopy Bcopy Mem Mem (libc) (hand) Read Write 543 1031.8 1028.3 2496 1231.

Table 9.13: LMBench: Local Communication bandwidths

Memory latencies in nanoseconds - smaller is better Host OS Description Mhz L1 $ L2 $ Main Rand Mem Mem eddie403 Linux 2.6.18- eddie403 2397 1.6690 4.2190 30.4 113.3 compute-0 Linux 2.6.37. x_86 1595 1.9030 9.5670 36.7 253.4 Table 9.14: LMBench: Memory latencies

72 Chapter 10

Future Work

• Scaling on ARM: Build cluster of ARM cores and investigate sacling on an ARM cluster • Scaling with Accelerators: Accelerators have been proven to reduce power con- sumption on HPC systems. Investiagate how they scale. • Atom Scaling:Complete power measuerments on Atom system • Power of Optimisation: Look into the effects of Tuning on power consumption and performance • Xeon Phi vs GPUs: Compare the performance per watt of the new Xeon Phi against current GPUs. • BlueGene/Q: Comapre the performance of the Bluegene/q with conventional sys- tems • Real world Applications: Look at power consumption of real world appications such as Molecular Dynamics

73 Chapter 11

Conclusions

The dissertation has changed from the objectives set out at the beginning of the project. It has researched the changes that are currently occuring in HPC in order to reduce power consumption. Several new technologies such as the Xeon Phi have been intro- duced, alongside reviews of metrics used to compare HPC systems. An ARM cluster was desiged and set up, although never used as like much new hardware delays did not allow the ARM hardware to arrive in time. A comparison of power consumption as the number of cores were sacled has been completed, looking at difference in processors, compilers and Networks. There needs to be innovations at in all areas of HPC in order to reduce the power con- sumption of HPC systems to a level such that an Exascale system can be built. Innovations such as the Intel Xeon Phi or ARM based servers will take several years to reach the mainstream. The time to develop, test and deploy new technologies will almost certainly be increased by unexpected delays. As part of this project the Seco Qseven board was going to be used. The Qseven is a brand new technology, with several projects such as Mont Blanc planning to use the boards. As of September 2012 the boards have been delayed by three months and no release date has yet been confirmed. The delays to the Qsevens show just how long it can take for a new technology to be used in the mainstream. There will need to be numerous adavnces in technology if an exascale system is to be built successfully, and as a consequence it will not be until late in this decade if not the next decade before the technology required will be available. Once the technology is available, it will still require time for developers to learn to use it. The Xeon Phi does not require developers to learn a new parallel language but they still need to learn how to get performance from the processor. Once developers have learned to use the technology, old codes must be ported. Port- ing can take significant amounts of time as the codes may not be suited to the new technology. Each code may have to be substantially re written in order to port the code. Each innovation will require substantial work to develop, test and use in HPC systems.

74 It cannot be underestimated the time it will take for new energy effienct systems to replace older systems, and allow exascale systems to be built. The results of the CoreMark benchmark make it clear that the Intel Xeon and Atom both have a long way to go to match the performance per watt of the ARM processor used last year. Both are still a long way ahead in performance alone, but with quad core ARM processors realeased this year, the gap could be reduced signinficantly. It was unfortunate that the ARM hardware was not available for use by the project, as it would have been interesting to see how much the performance of the ARM had increased. Software and Algorithums have a major part to play in reducing power usgae by HPC systems, but equally to all computer systems. The use of the correct libraries and com- pilers will increase the performance of many codes and applications. While optimisation may make code less portable, the extra work in porting codes can have a positive effect in the long run. Lower power usage and better performance will allow more computation to be completed overall. It is clear that while hardware has the biggest part to play in reducing power usage, that the software, algorithms and libraies must be used efficiently aswell. The use of the correct libraries and compilers can signinficantly increase the performance per watt. But the HPC community must be clear, the changes required will not be easy, nor will be be quick. Steps taken today will take years to reach the main stream and there are no guraentees how sucessful they will be. It is very possible that an Exascale system will not be built by 2018, but steps have been taken towards it. Power consumption must continue to be reduced beyond exascale systems, helping the enviroment, running costs and moving towards the next generation of HPC systems, Zettascale.

75 Chapter 12

Project Evaluation

This chapter will review the project, looking at the original proposal, changes made to the proposal, the risks and the work plan.

12.1 Goals

The project has changed from the origanal proposal, due to a large number of problems that arose. The objectives set out at the start of the project are shown below.

1. Report on low power architectures that may be used in HPC 2. Report on the Related work done in Low Power HPC 3. Report on the requirements of systems used in the low power HPC Project 4. Report on available networks for HPC 5. Functional ARM cluster of between 20 and 40 cores 6. Final MSc Dissertation 7. Project Presentation

It was not possible to achieve all of the objectives of the project. Objective five was not completed as the ARM hardware was unavailable. All of the other objectives have been met. In addition to the goals above several new objectives were added:

1. Report on different compilers available on HPC systems 2. Report on different MPI libraies available 3. report on scaling when using different processors, compilers and networks

76 12.2 Work Plan

The origanal work plan was re written at an early stage of the project. As more research was completed it became clear that more time would be required to build the ARM cluster. Over the three months of the project the work plan was constantly reviewed due to the changing goals of the project. As the ARM cluster was not used, and more time was required to benchmark both Eddie and EDIM1 the work plan was changed to accomodate these requirements. It was not possible to document alll changes to the work plan due to the frequency of changes.

12.3 Risks

The risks identified at the beginning of the project were related to each of the systems planned to be used: eddie, EDIM1 and the ARM cluster. All of the risks identified occured and needed to be mitigated. Several of the risks occured on more than one occasion. The biggest risk was that the ARM hardware would not arrive in time. The risk was mitigated several times with replacement hardware, before finally reomving deliver- ables requiring use of ARM hardware from the Project. The risks related to Eddie and EDIM1 also became an issue, as there were problems running jobs on both systems. Working with the groups responsible for the systems and changing my objectives mitigated the issues.

12.4 Changes

The project has substantially changed from it’s origanal objectives. Changes were made to the systems, benchmarks and objectives of the project in order to mitigate the numer- ous issues that arose. The HPCC benchmark was not run due to time constraints in running jobs on the sys- tems, the ARM system was reomoved completely and more comparisions were made using the HPL benchmark to reduce the risk on unexpected complications. The project has not been a simple one, but being able to adjust the project to fit around the issues has allowed a sucessful comparsion of two systems to be made.

77 Appendix A

HPL Results

Cores Compiler Network Performance Time Total Average Flop / Watt power Power 1 GCC Infiniband 1.189e+00 8764.37 20960 139.73 8509267.87 2 GCC Infiniband 2.409e+00 4325.30 11180 149.07 16160193.2 3 GCC Infiniband 3.178e+00 3278.08 8910 159.11 19973603.17 4 GCC Infiniband 3.455e+00 3015.16 9180 180 19194444.44 5 GCC Infiniband 4.103e+00 2539.13 7920 180 22794444.44 6 GCC Infiniband 4.085e+00 2550.09 7920 180 22694444.44 7 GCC Infiniband 5.538e+00 1881.11 5760 180 30766666.67 8 GCC Infiniband 5.471e+00 1904.27 6640 207.5 26366265.06 9 GCC Infiniband 5.751e+00 1811.38 6690 215.81 26648440.76 10 GCC Infiniband 6.838e+00 1523.57 5790 222.69 30706363.11 11 GCC Infiniband 1.062e+01 981.13 3860 214.44 49524342.47 12 GCC Infiniband 7.963e+00 1308.21 5170 224.78 35425749.62 16 GCC Infiniband 1.232e+01 845.79 7320 385.26 31978404.19 24 GCC Infiniband 2.385e+01 436.81 6840 427.5 55789473.68 32 GCC Infiniband 2.491e+01 418.13 5810 528.18 47161952.36 36 GCC Infiniband 3.020e+01 344.98 5670 515.45 58589581.92 48 GCC Infiniband 3.254e+01 320.11 8050 805 40422360.25 Table A.1: Results from HPL Benchmark using GCC and Infiniband

78 Cores Compiler Network Performance Time Total Average Flop / Watt power Power 1 GCC Ethernet 1.216e+00 8565.47 18590 130 9353846.15 2 GCC Ethernet 3.021e+00 3447.89 8710 150.17 20117200.51 3 GCC Ethernet 4.043e+00 2576.59 6950 161.63 25013920.68 4 GCC Ethernet 4.784e+00 2177.70 6430 178.61 26784614.52 5 GCC Ethernet 6.431e+00 1619.82 4850 179.63 35801369.48 6 GCC Ethernet 6.277e+00 1659.64 5340 197.78 31737283.85 7 GCC Ethernet 8.508e+00 1224.40 3920 196 43408163.27 8 GCC Ethernet 7.302e+00 1426.62 4780 207.83 35134484.92 9 GCC Ethernet 6.416e+00 1623.63 5620 208.15 30823925.05 10 GCC Ethernet 7.822e+00 1331.76 4800 218.18 35851132.093 11 GCC Ethernet 1.051e+01 991.19 3500 218.75 48045714.29 12 GCC Ethernet 8.204e+00 1269.89 4800 228.57 35892724.33 16 GCC Ethernet 1.216e+01 856.59 5650 403.57 30131080.11 24 GCC Ethernet 1.610e+01 646.90 3820 629.09 25592522.53 32 GCC Ethernet 2.372e+01 439.13 4886 658.57 36017431.71 36 GCC Ethernet 2.495e+01 417.61 4840 691.43 36084636.19 42 GCC Ethernet 3.071e+01 339.27 5170 861.67 35640094.24 48 GCC Ethernet 3.658e+01 284.82 3630 907.5 40308539.94 Table A.2: Results from HPL Benchmark using GCC and Ethernet

79 Cores Compiler Network Performance Time Total Average Flop / Watt power Power 1 Intel Infiniband 1.180e+00 8824.96 21980 147.52 7998915.4 2 Intel Infiniband 2.413e+00 4317.59 11390 158.19 15253808.71 3 Intel Infiniband 3.169e+00 3286.96 9050 161.61 19608935.09 4 Intel Infiniband 3.452e+00 3017.65 8620 169.02 20423618.51 5 Intel Infiniband 4.064e+00 2563.59 7650 177.91 22843010.51 6 Intel Infiniband 4.086e+00 2549.75 7680 178.6 22877939.53 7 Intel Infiniband 5.433e+00 1917.39 6530 197.88 27456033.96 8 Intel Infiniband 5.480e+00 1901.19 6630 207.19 26449152.95 9 Intel Infiniband 5.929e+00 1757.19 6520 217.33 27281093.27 10 Intel Infiniband 6.754e+00 1542.54 5770 221.92 30434390.77 11 Intel Infiniband 1.084e+01 961.00 3790 222.94 48622947.88 12 Intel Infiniband 8.043e+00 1295.17 5150 234.09 34358580.03 16 Intel Infiniband 1.221e+01 853.09 5770 384.67 31741492.71 24 Intel Infiniband 1.614e+01 645.46 5430 452.5 35668508.29 32 Intel Infiniband 2.490e+01 418.38 5180 647.5 38455598.46 36 Intel Infiniband 2.462e+01 423.21 6030 753.75 32663349.92 42 Intel Infiniband 3.166e+01 329.03 5200 866.67 36530628.73 48 Intel Infiniband 3.850e+01 270.58 4650 930 41397849.46 Table A.3: Results from HPL Benchmark using Intel and Infiniband

Cores Compiler Network Performance Time Total Average Flop / Watt power Power 1 Intel Ethernet 1.191e+00 8744.98 19080 129.8 9175654.85 2 Intel Ethernet 2.982e+00 3493.62 8810 149.32 19970533.08 3 Intel Ethernet 3.993e+00 2609.12 7050 160.23 24920426.89 4 Intel Ethernet 4.806e+00 2167.42 6560 177.3 27106598.98 5 Intel Ethernet 5.957e+00 1748.73 5240 174.67 34104310.99 6 Intel Ethernet 6.256e+00 1665.18 5670 195.52 31996726.68 7 Intel Ethernet 8.423e+00 1236.74 4030 191.9 43892652.42 8 Intel Ethernet 7.323e+00 1422.50 4920 205 35721951.22 9 Intel Ethernet 6.411e+00 1625.07 5690 203.21 31548644.26 10 Intel Ethernet 7.832e+00 1330.12 4910 213.48 36687277.5 11 Intel Ethernet 1.044e+01 997.66 3620 212.94 49027895.18 12 Intel Ethernet 8.231e+00 1265.69 5030 228.64 35999825.05 16 Intel Ethernet 1.209e+01 861.84 5450 389.29 31056880.73 18 Intel Ethernet 1.226e+01 849.49 5700 407.14 30112280.7 24 Intel Ethernet 1.634e+01 637.48 4470 406.36 40210290.83 Table A.4: Results from HPL Benchmark using Intel and Ethernet

80 Appendix B

Coremark Results

Processor Flags Iterations Iterations/Sec Total Time (Sec) Consumption PPW Intel Xeon 01 100000 7039.28 14.21 130 54.15 Intel Atom 01 100000 2413.48 41.43 52 46.40 Intel Xeon 02 100000 7807.01 12.08 130 60.05 Intel Atom 02 100000 2678.09 37.34 52 51.50 Intel Xeon 03 100000 8580.74 11.65 130 66.01 Intel Atom 03 100000 3100.29 332.25 52 59.62 Table B.1: Coremark Results for 100000 Iterations

Processor Flags Iterations Iterations/Sec Total Time (Sec) Consumption PPW Intel Xeon 01 1000000 7079.70 141.25 130 54.45 Intel Atom 01 1000000 2413.52 414.33 52 46.41 Intel Xeon 02 1000000 7881.28 126.88 130 60.62 Intel Atom 02 1000000 2678.40 373.35 52 51.50 Intel Xeon 03 1000000 8663.25 115.43 130 66.64 Intel Atom 03 1000000 3100.51 322.52 52 59.63 Table B.2: Coremark Results for 1000000 Iterations

Processor Flags Iterations Iterations/Sec Total Time (Sec) Consumption PPW Intel Xeon 01 2000000 7093.20 281.69 130 54.56 Intel Atom 01 2000000 2412.68 828.95 52 46.39 Intel Xeon 02 2000000 7874.95 253.97 130 60.57 Intel Atom 02 2000000 2676.03 747.37 52 51.46 Intel Xeon 03 2000000 8647.15 231.29 130 66.51 Intel Atom 03 2000000 3100.27 645.10 52 59.62 Table B.3: Coremark Results for 1000000 Iterations

81 Processor Flags Iterations Iterations/Sec Total Time (Sec) Consumption PPW Intel Xeon 01 30000000 7100.68 4224.95 130 54.62 Intel Atom 01 30000000 2415.76 12418.46 52 46.46 Intel Xeon 02 30000000 7840.15 3826.46 130 60.30 Intel Atom 02 30000000 2676.08 11210.08 52 51.46 Intel Xeon 03 30000000 8638.30 3472.906 130 66.44 Intel Atom 03 30000000 3105.11 9661.49 52 59.71 Table B.4: Coremark Results for 30000000 Iterations

Processor Flags Iterations Iterations/Sec Total Time (Sec) Consumption PPW Intel Xeon 01 15000000 7102.37 2111.97 130 54.63 Intel Atom 01 15000000 2416.46 6207.42 52 46.47 Intel Xeon 02 15000000 7872.35 1905.40 130 60.55 Intel Atom 02 15000000 2674.69 5608.134 52 51.43 Intel Xeon 03 15000000 8662.42 1731.62 130 66.64 Intel Atom 03 15000000 3105.38 4830.33 52 60.29 Table B.5: Coremark Results for 15000000 Iterations

82 Appendix C

Final Project Proposal

C.1 Content

The main aim of the project is to compare the power consumption of low power pro- cessors compared to commodity processors used in High Performance Systems Today. The power consumption will be compared in terms of performance and networks when running selected High Performance Benchmarks. The Low Power systems will be compared using the Intel Xeon processors in the ECDF system as a baseline system representative of systems in use today.

C.2 The work to be undertaken

C.2.1 Deliverables

• Report on low power architectures that may be used in HPC • Report on the Related work done in Low Power HPC • Report on the requirements of systems used in the low power HPC Project • Report on available networks for HPC • Functional ARM cluster of between 20 and 40 cores • Final MSc Dissertation • Project Presentation

C.2.2 Tasks

• Deployment of low power ARM cluster

83 • Gain Access to Nodes of EDIM1 • Gain Access to Nodes of ECDF • Identification of suitable benchmarks for all systems • Porting of benchmarks to all systems • Running and gathering results for all benchmarks on each system • Writing of dissertation report covering all aspects of the project

C.2.3 Additional Information/Knowledge Required

• Knowledge of Programming in Fortran and C in order to port benchmarks • Knowledge of cluster Deployment and maintainable • Presentation skills • Report Writing skills

84 Appendix D

Section of Makefile for Eddie from HPCC

# −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− # − HPL Directory Structure / HPL library −−−−−−−−−−−−−−−−−−−−−−−−−−−−−− # −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− # TOPdir = ../../.. INCdir = $(TOPdir)/include BINdir = $(TOPdir)/bin/$(ARCH) LIBdir = $(TOPdir)/lib/$(ARCH) # HPLlib = $(LIBdir)/libhpl.a # # −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− # − Message Passing library (MPI) −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− # −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− # MPinc tells the C compiler where to find the Message Passing library # header files, MPlib is defined to be the name of the library to be # used. The variable MPdir is only used for defining MPinc and MPlib. # #MPdir = /usr/local/mpi MPdir = /exports/ applications /apps/SL5/MPI/openmpi/ ethernet /gcc/1.4.1 − gcc_4 . 1 . 2 #MPdir = /exports/applications/apps/SL5/MPI/mpich2/gcc/64/1.0.6p1/smpd MPinc = −I$(MPdir)/ include MPlib = $(MPdir)/lib/libmpi.so # # −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− # − Linear Algebra library (BLAS or VSIPL) −−−−−−−−−−−−−−−−−−−−−−−−−−−−− # −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− # LAinc tells the C compiler where to find the Linear Algebra library # header files, LAlib is defined to be the name of the library to be # used. The variable LAdir is only used for defining LAinc and LAlib. # LAdir = /exports/home/s0791373/CBLAS/lib LAinc = LAlib = /usr/lib64/libblas.a /usr/lib64/liblapack.a $(LAdir )/cblas_LINUX.a #$(LAdir)/ libcblas.a $(LAdir)/ libatlas .a Listing D.1: Section of Make File for HPCC from Eddie

85 Appendix E

Benchmark Sample Outputs

E.1 Sample Output from Stream

Sat Jul 14 15:03:12 BST 2012 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− STREAM version $Revision: 5.9 $ −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− This system uses 8 bytes per DOUBLE PRECISION word. −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− Array size = 20000000, Offset = 0 Total memory required = 457.8 MB. Each test is run 10 times, but only t h e ∗ b e s t ∗ time for each is used. −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− Printing one line per active thread.... −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− Your clock granularity/precision appears to be 1 microseconds. Each test below will take on the order of 34057 microseconds. (= 34057 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− WARNING −− The above is only a rough guideline. For best results , please be sure you know the precision of your system timer. −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− Function Rate (MB/s) Avg time Min time Max time Copy: 7816.1256 0.0413 0.0409 0.0414 Scale: 7535.1149 0.0429 0.0425 0.0430 Add: 8325.2322 0.0582 0.0577 0.0584 Triad: 7602.0206 0.0636 0.0631 0.0637 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− Solution Validates −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− Sat Jul 14 15:03:15 BST 2012 Listing E.1: Sample Output from Stream Benchmark

86 E.2 Sample Output from Coremark

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− Running Coremark program −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

2K performance run parameters for coremark. CoreMark Size : 666 Total ticks : 3808689 Total time (secs): 3808.689000 Iterations/Sec : 7876.726086 Iterations : 30000000 Compiler version : GCC4.1.2 20080704 (Red Hat 4.1.2 −50) Compiler flags : −O2 − l r t Memory location : Please put data memory location here (e.g. code in flash, data on heap etc) seedcrc :0xe9f5 [0]crclist : 0xe714 [0]crcmatrix : 0x1fd7 [0]crcstate : 0x8e3a [0]crcfinal : 0xcc42 Correct operation validated. See readme.txt for run and reporting rules. CoreMark 1.0 : 7876.726086 / GCC4.1.2 20080704 (Red Hat 4.1.2 −50) −O2 − l r t / Heap

−−−−−−−−−−−−−−−−−−−− Finished Coremark program −−−−−−−−−−−−−−−−−−−− Listing E.2: Sample Output from Coremark Benchmark

87 E.3 Sample Output from HPL

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− Running nodes program −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

Hello world from processor eddie403, rank 0 out of 12 processors Hello world from processor eddie403, rank 1 out of 12 processors Hello world from processor eddie403, rank 3 out of 12 processors Hello world from processor eddie403, rank 4 out of 12 processors Hello world from processor eddie403, rank 5 out of 12 processors Hello world from processor eddie403, rank 9 out of 12 processors Hello world from processor eddie403, rank 10 out of 12 processors Hello world from processor eddie403, rank 7 out of 12 processors Hello world from processor eddie403, rank 6 out of 12 processors Hello world from processor eddie403, rank 8 out of 12 processors Hello world from processor eddie403, rank 11 out of 12 processors Hello world from processor eddie403, rank 2 out of 12 processors Sat Jul 14 08:16:17 BST 2012

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− Running HPL program −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

======HPLinpack 2.0 −− High−Performance Linpack benchmark −− September 10, 2008 Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory , UTK Modified by Piotr Luszczek , Innovative Computing Laboratory , UTK Modified by Julien Langou, University of Colorado Denver ======

An explanation of the input/output parameters follows: T/V : Wall time / encoded variant. N : The order of the coefficient matrix A. NB : The partitioning blocking factor. P : The number of process rows. Q : The number of process columns. Time : Time in seconds to solve the linear system. Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N : 25000 NB : 128 PMAP : Row−major process mapping P : 1 Q : 1 PFACT : R ig ht NBMIN : 4 NDIV : 2 RFACT : Crout BCAST : 1 ringM DEPTH : 1 SWAP : Mix (threshold = 64) L1 : transposed form U : transposed form EQUIL : yes ALIGN : 8 double precision words

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

− The matrix A is randomly generated for each test. − The following scaled residual check will be computed: | | Ax−b||_oo / ( eps ∗ ( | | x | | _oo ∗ || A ||_oo + || b ||_oo ) ∗ N) − The relative machine precision (eps) is taken to be 1.110223e−16 − Computational tests pass if scaled residuals are less than 16.0

88 ======T/V NNB P Q Time Gflops −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− WR11C2R4 25000 128 1 1 8771.00 1.188e+00 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− | | Ax−b||_oo/(eps ∗ ( | | A | | _oo ∗ ||x||_oo+||b||_oo)∗N)= 0.0047479 ...... PASSED ======

Finished 1 tests with the following results: 1 tests completed and passed residual checks, 0 tests completed and failed residual checks, 0 tests skipped because of illegal input values. −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

End of Tests. ======

real 147m34.724s user 147m29.850s sys 0m3.623s

−−−−−−−−−−−−−−−−−−−− Finished HPL program −−−−−−−−−−−−−−−−−−−−

Sat Jul 15 10:43:51 BST 2012 Listing E.3: Sample Output from HPL Benchmark

89 E.4 Sample Output From LMBench

LMBENCH 3.0 SUMMARY −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− (Alpha software , do not distribute)

Basic system parameters −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− Host OSDescription Mhz tlb cache mem scal pages line par load b y t e s −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− −−−− −−−−− −−−−− −−−−−− −−−− frontend0 Linux 2.6.18 − x86_64−l i n u x −gnu 2398 1

Processor , Processes − times in microseconds − smaller is better −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− Host OS Mhznull null openslct sig sig forkexecsh call I/O stat clos TCP inst hndl proc proc proc −−−−−−−−−−−−−−−−−−−−−− −−−− −−−− −−−− −−−− −−−− −−−− −−−− −−−− −−−− −−−− −−−− frontend0 Linux 2.6.18 − 2398 0.25 0.31 1.70 3.34 3.26 0.34 1.28 137. 536. 1893

Basic integer operations − times in nanoseconds − smaller is better −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− Host OS intgr intgr intgr intgr intgr bit add mul div mod −−−−−−−−−−−−−−−−−−−−−−−−−−−− −−−−−− −−−−−− −−−−−− −−−−−− frontend0 Linux 2.6.18 − 0.5300 0.2100 0.1200 10.0 9.4700

Basic uint64 operations − times in nanoseconds − smaller is better −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− Host OSint64 int64 int64 int64 int64 bit add mul div mod −−−−−−−−−−−−−−−−−−−−−−−−−−−− −−−−−− −−−−−− −−−−−− −−−−−− frontend0 Linux 2.6.18 − 0.420 0.1300 18.1 19.2

Basic float operations − times in nanoseconds − smaller is better −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− Host OS float float float float add mul div bogo −−−−−−−−−−−−−−−−−−−−−−−−−−−− −−−−−− −−−−−− −−−−−− frontend0 Linux 2.6.18 − 1.2400 1.6700 6.2000 5.8400

Basic double operations − times in nanoseconds − smaller is better −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− Host OS double double double double add mul div bogo −−−−−−−−−−−−−−−−−−−−−−−−−−−− −−−−−− −−−−−− −−−−−− frontend0 Linux 2.6.18 − 1.2400 2.0900 9.5400 9.1800

Context switching − times in microseconds − smaller is better −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− Host OS 2p/0K2p/16K2p/64K8p/16K8p/64K 16p/16K 16p/64K ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw −−−−−−−−−−−−−−−−−−−−−−−−−−−− −−−−−− −−−−−− −−−−−− −−−−−− −−−−−−− −−−−−−− frontend0 Linux 2.6.18 − 1.2500 0.0700 0.7100 1.5100 7.3300 2.67000 7.97000

∗ Local ∗ Communication latencies in microseconds − smaller is better −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− Host OS 2p / 0K Pipe AF UDP RPC / TCP RPC / TCP ctxsw UNIX UDP TCP conn −−−−−−−−−−−−−−−−−−−−−− −−−−− −−−−− −−−− −−−−− −−−−− −−−−− −−−−− −−−− frontend0 Linux 2.6.18 − 1.250 4.314 7.02 13.6 15.8 13.0 15.8 37.

∗Remote∗ Communication latencies in microseconds − smaller is better −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− Host OS UDP RPC / TCP RPC / TCP

90 UDP TCP conn −−−−−−−−−−−−−−−−−−−−−− −−−−− −−−−− −−−−− −−−−− −−−− frontend0 Linux 2.6.18 −

File &VM system latencies in microseconds − smaller is better −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− Host OS 0KFile 10KFile Mmap Prot Page 100fd Create Delete Create Delete Latency Fault Fault selct −−−−−−−−−−−−−−−−−−−−−−−−−−−− −−−−−− −−−−−− −−−−−− −−−−−−− −−−−− −−−−−−− −−−−− frontend0 Linux 2.6.18 − 10.8 8.3597 22.8 16.2 1454.0 0.372 0.96490 1.801

∗ Local ∗ Communication bandwidths in MB/s − bigger is better −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− Host OS PipeAF TCP File Mmap Bcopy Bcopy Mem Mem UNIX reread reread (libc) (hand) read write −−−−−−−−−−−−−−−−−−−−−− −−−− −−−− −−−− −−−−−− −−−−−− −−−−−− −−−−−− −−−− −−−−− frontend0 Linux 2.6.18 − 1565 4556 1809 3284.6 6594.8 4212.2 3975.5 4924 6827.

Memory latencies in nanoseconds − smaller is better (WARNING − may not be correct , check graphs) −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− Host OS Mhz L1$ L2$ Mainmem Randmem Guesses −−−−−−−−−−−−−−−−−−−−−− −−− −−−− −−−− −−−−−−−− −−−−−−−− −−−−−−− frontend0 Linux 2.6.18 − 2398 1.6700 4.2650 30.1 101.4 Listing E.4: Sample Output from LMBench Benchmark

91 Appendix F

Submission Script from Eddie

F.1 run.sge

# ! / b i n / sh #$ −N ibhpl48fullmem #$ −cwd #$ −pe openib_smp12_qdr 48 #$ −R y #$ −l h_rt=07:00:00 #$ −P physics_epcc_msc_s0791373 #$ −q ecdf@@epcc_msc_power_ib

#$ −M [email protected] #$ −m be echo "−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−" echo "Running nodes program" echo "−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−" echo

. /etc/profile.d/modules.sh module load openmpi/infiniband/gcc/ latest mpirun −n $NSLOTS . / nodes sed ’6 c\101504 Ns’ HPL.dat > test cp test HPL.dat sed ’11 c\6 Ps’ HPL.dat > test cp test HPL.dat sed ’12 c\8 Qs’ HPL.dat > test cp test HPL.dat d a t e echo "Running HPL program"

(time mpirun −n 48 ./xhpl) 2>&1 echo "Finished HPL program"

92 d a t e Listing F.1: Submission Script from eddie

F.2 nodes.c

#include #include int main(int argc, char ∗∗ argv ) { // Initialize the MPI environment MPI_Init (NULL, NULL);

// Get the number of processes int world_size; MPI_Comm_size(MPI_COMM_WORLD, &world_size );

// Get the rank of the process int world_rank; MPI_Comm_rank(MPI_COMM_WORLD, &world_rank ) ;

// Get the name of the processor c h a r processor_name [MPI_MAX_PROCESSOR_NAME ] ; int name_len; MPI_Get_processor_name(processor_name , &name_len );

// Print off a hello world message printf("Hello world from processor %s, rank %d" " out of %d processors\n", processor_name , world_rank , world_size);

// Finalize the MPI environment. MPI_Finalize (); } Listing F.2: Nodes Program from eddie

93 Bibliography

[1] Top500.org. Top500 lists. http://www.top500.org. Accessed February 1 2012. [2] Green500.org. Green500 lists. http://www.green500.org. Accessed February 1 2012. [3] Dr. Matthias Brehm. Energy efficient hpc centers âA¸Sat˘ what cost?, energy effi- cient hpc systems:concepts, procurement and installation. In International Super- computing Conference 2012. Leibniz Supercomputing Centre, 2012. [4] spenergywholesale.com. Longannet power station. http://www. spenergywholesale.com/pages/longannet_power_station. asp. Accessed August 9 2012. [5] Various. The international exascale software project roadmap. Technical report, IESP, 2010. [6] Intel. Moores law. http://download.intel.com/museum/Moores_ Law/Printed_Materials/Moores_Law_2pg.pdf. Accessed August 9 2012. [7] Douglas Eadline. Mays law and parallel software. http://www.linux-mag. com/id/8422/. Accessed February 1 2012. [8] Various. Exascale computing study:technology challenges in achieving exascale systems. Technical report, DARPA, 2008. [9] http://www.arm.com/. [10] Nicolas Dube. True sustainability,the path to a net-zero datacenter:energy, carbon, water. In International Supercomputing Conference 2012. HP, 2012. [11] Graph500.org. Graph500 lists. http://www.graph500.org. Accessed February 1 2012. [12] Panagiotis Kritikakos. Low power high performance computing. Master’s thesis, University Of Edinburgh, 2011. [13] Marvell Technology Group Ltd. Marvell 88f6281 soc with sheeva technologye. http://www.marvell.com/embedded-processors/ kirkwood/assets/88F6281-004_ver1.pdf. Accessed May 29 2012.

94 [14] Boston Ltd. Boston viridis project. http://download.boston.co.uk/ downloads/9/3/2/932c4ecb-692a-47a9-937d-a94bd0f3df1b/ viridis.pdf. Accessed May 31 2012. [15] Nikola Rajovic, Nikola Puzovic, Lluis Vilanova, Carlos Villavieja, and Alex Ramirez. The low-power architecture approach towards exascale computing. In Proceedings of the second workshop on Scalable algorithms for large-scale sys- tems, ScalA ’11, pages 1–2, New York, NY, USA, 2011. ACM. [16] Pawel Gepner, Michal F. Kowalik, David L. Fraser, and Kazimierz Wackowski. Early performance evaluation of new six-core intel xeon 5600 family processors for hpc. In Parallel and Distributed Computing (ISPDC), 2010 Ninth International Symposium on, pages 117 –124, july 2010. [17] ARM Holdings. The arm cortex-a9 processors white paper. http://www.arm. com/files/pdf/armcortexa-9processors.pdf. Accessed May 30 2012. [18] P. Luszczek, J.J. Dongarra, D. Koester, R. Rabenseifner, B. Lucas, J. Kepner, J. McCalpin, D. Bailey, and D. Takahashi. Introduction to the hpc challenge benchmark suite. 2005. [19] Carl Staelin. lmbench: an extensible micro-benchmark suite. Software: Practice and Experience, 35(11):1079–1105, 2005. [20] L. McVoy and C. Staelin. lmbench: Portable tools for performance analysis. In Proceedings of the 1996 annual conference on USENIX Annual Technical Confer- ence, pages 23–23. Usenix Association, 1996. [21] Coremark.org. Coremark benchmark. http://www.coremark.org/. Ac- cessed August 21 2012. [22] Energy aware computing. http://www.inf.ed.ac.uk/teaching/ courses/eac. Accessed July 1 2011. [23] Univeristy of Edinburgh. Ecdf. http://www.ecdf.ed.ac.uk. Accessed August 1 2012. [24] HP. Hp project moonshot. http://h20195.www2.hp.com/V2/GetPDF. aspx/4AA3-9839ENW.pdf. Accessed August 1 2012. [25] Paul Martin, Malcolm Atkinson, Mark Parsons, Adam Carter, and Gareth Francis. Edim1 progress report. Technical report, 12/2011 2011. [26] Margaret Martonosi Stefanos Kaxiras. Computer Architecture Techniques for Power-Efficiency. Morgan and ClayPool, 2008. [27] John Hennessy David Patterson. Computer Architecture A Quanative Approach Forth Edition. Morgan Kaufmann, 2007. [28] John Hennessy David Patterson. Computer Organization and Design Third Edi- tion. Morgan Kaufmann, 2007.

95 [29] Professor Arthur Trew. The exascale challenge. HPC EcoSystems, EPCC, Uni- veristy of Edinburgh, 2012. [30] Professor Arthur Trew. The exascale solution? HPC EcoSystems, EPCC, Uni- veristy of Edinburgh, 2012. [31] Dr Alan Gray. Gpu architecture. HPC Architectures, EPCC, Univeristy of Edin- burgh, 2011. [32] L.E. Jonsson and W.R. Magro. Comparative performance of infiniband architec- ture and gigabit ethernet interconnects on intel R itanium R 2 microarchitecture- based clusters. Intel Americas. [33] http://content.dell.com/us/en/enterprise/d/campaigns/ project-copper.aspx. [34] S. Alam, R. Barrett, M. Bast, M.R. Fahey, J. Kuehn, C. McCurdy, J. Rogers, P. Roth, R. Sankaran, J.S. Vetter, P. Worley, and W. Yu. Early evaluation of ibm bluegene/p. In High Performance Computing, Networking, Storage and Analysis, 2008. SC 2008. International Conference for, pages 1 –12, nov. 2008.

96