Low-Power High Performance

Panagiotis Kritikakos

August 16, 2011

MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2011 Abstract

The emerging development of computer systems to be used for HPC require a change in the architecture for processors. New design approaches and technologies need to be embraced by the HPC community for making a case for new approaches in system design for making it possible to be used for Exascale within the next two decades, as well as to reduce the CO2 emissions of supercomputers and scientific clusters, leading to greener computing. Power is listed as one of the most important issues and constraint for future Exascale systems. In this project we build a hybrid cluster, investigating, measuring and evaluating the performance of low-power CPUs, such as Atom and ARM (Marvell 88F6281) against commodity Intel Xeon CPU that can be found within standard HPC and data-center clusters. Three main factors are considered: computational performance and efficiency, power efficiency and porting effort. Contents

1 Introduction 1 1.1 Report organisation ...... 2

2 Background 3 2.1 RISC versus CISC ...... 3 2.2 HPC Architectures ...... 4 2.2.1 System architectures ...... 4 2.2.2 Memory architectures ...... 5 2.3 Power issues in modern HPC systems ...... 9 2.4 Energy and application efficiency ...... 10

3 Literature review 12 3.1 ...... 12 3.2 Supercomputing in Small Spaces (SSS) ...... 12 3.3 The AppleTV Cluster ...... 13 3.4 Playstation 3 Cluster ...... 13 3.5 Microsoft XBox Cluster ...... 14 3.6 IBM BlueGene/Q ...... 14 3.7 Less Watts ...... 14 3.8 Energy-efficient cooling ...... 14 3.8.1 Green Revolution Cooling ...... 15 3.8.2 Data Centres ...... 15 3.8.3 Nordic Research ...... 16 3.9 Exascale ...... 17

4 Technology review 19 4.1 Low-power Architectures ...... 19 4.1.1 ARM ...... 19 4.1.2 Atom ...... 21 4.1.3 PowerPC and Power ...... 22 4.1.4 MIPS ...... 23

5 Benchmarking, power measurement and experimentation 25 5.1 suites ...... 25 5.1.1 HPCC Benchmark Suite ...... 25

i 5.1.2 NPB Benchmark Suite ...... 25 5.1.3 SPEC Benchmarks ...... 26 5.1.4 EEMBC Benchmarks ...... 26 5.2 Benchmarks ...... 27 5.2.1 HPL ...... 27 5.2.2 STREAM ...... 27 5.2.3 CoreMark ...... 27 5.3 Power measurement ...... 28 5.3.1 Metrics ...... 29 5.3.2 Measuring unit power ...... 29 5.3.3 The measurement procedure ...... 29 5.4 Experiments design and execution ...... 30 5.5 Validation and reproducibility ...... 31

6 Cluster design and deployment 33 6.1 Architecture support ...... 33 6.1.1 Hardware considerations ...... 33 6.1.2 Software considerations ...... 34 6.1.3 Soft Float vs Hard Float ...... 34 6.2 Fortran ...... 35 6.3 /C++ ...... 35 6.4 ...... 35 6.5 Hardware decisions ...... 36 6.6 Software decisions ...... 36 6.7 Networking ...... 39 6.8 Porting ...... 40 6.8.1 Fortran to C ...... 40 6.8.2 Binary incompatibility ...... 40 6.8.3 Scripts developed ...... 41

7 Results and analysis 42 7.1 Thermal Design Power ...... 42 7.2 Idle readings ...... 43 7.3 Benchmark results ...... 44 7.3.1 Serial performance: CoreMark ...... 44 7.3.2 Parallel performance: HPL ...... 50 7.3.3 Memory performance: STREAM ...... 58 7.3.4 HDD and SSD power consumption ...... 61

8 Future work 63

9 Conclusions 64

A CoreMark results 66

ii B HPL results 67

C STREAM results 69

D Shell Scripts 70 D.1 add_node.sh ...... 70 D.2 status.sh ...... 71 D.3 armrun.sh ...... 71 D.4 watt_log.sh ...... 71 D.5 fortran2c.sh ...... 72

E Benchmark outputs samples 73 E.1 CoreMark output sample ...... 73 E.2 HPL output sample ...... 73 E.3 STREAM output sample ...... 75

F Project evaluation 76 F.1 Goals ...... 76 F.2 Work paln ...... 76 F.3 Risks ...... 77 F.4 Changes ...... 77

G Final Project Proposal 78 G.1 Content ...... 78 G.2 The work to be undertaken ...... 78 G.2.1 Deliverables ...... 78 G.3 Tasks ...... 79 G.4 Additional information / Knowledge required ...... 79

iii List of Tables

6.1 Cluster nodes hardware specifications ...... 36 6.2 Cluster nodes software specifications ...... 37 6.3 Network configuration ...... 39

7.1 Maximum TDP per ...... 42 7.2 Average system power consumption on idle...... 43 7.3 CoreMark results with 1 million iterations...... 44 7.4 HPL problem sizes...... 51 7.5 HPL problem sizes...... 51 7.6 STREAM results for 500MB array size...... 58

A.1 CoreMark results for various iterations...... 66

B.1 HPL problem sizes...... 67 B.2 HPL problem sizes...... 68 B.3 HPL results for N=500...... 68

C.1 STREAM results for size array of 500MB...... 69

iv List of Figures

2.1 Single Instruction Single Data (Reproduced from Blaise Barney, LLNL)...... 5 2.2 Single Instruction Multiple Data (Reproduced from Blaise Barney, LLNL)...... 6 2.3 Multiple Instruction Single Data (Reproduced from Blaise Barney, LLNL)...... 6 2.4 Multiple Instruction Multiple Data (Reproduced from Blaise Bar- ney, LLNL)...... 6 2.5 Distributed memory architecture (Reproduced from Blaise Barney, LLNL)...... 7 2.6 Shared Memory UMA architecture (Reproduced from Blaise Bar- ney, LLNL)...... 8 2.7 Shared Memory NUMA architecture (Reproduced from Blaise Bar- ney, LLNL)...... 8 2.8 Hybrid Distributed-Shared Memory architecture (Reproduced from Blaise Barney, LLNL)...... 8 2.9 Moore’s law for power consumption (Reproduced from W-chun Feng, LANL)...... 10

3.1 GRCooling four-rack CarnotJetTM system at Midas Networks (source GRCooling) ...... 15 3.2 Google data-centre at Finland, next to the Finnish gulf (source Google). 16 3.3 NATO ammunition depot at Rennesøy, Norway (source Green Moun- tain Data Centre AS)...... 17 3.4 Projected power demand of a (M. Kogge) ...... 18

4.1 OpenRD board SoC with ARM (Marvell 88F6281) (Cantanko). . . . 20 4.2 Intel D525 Board with dual-core...... 21 4.3 IBM’s BlueGene/Q 16-core compute node (Timothy Prickett Mor- gan, The Register)...... 22 4.4 Pipelined MIPS, showing the five stages (instruction fetch, instruc- tion decode, execute, memory and write back (Wikimedia Commons)...... 23 4.5 with 2G processor (Wikimedia Commons). 24

v 5.1 Power measurement setup...... 29

6.1 The seven-node cluster that was built as part of this project. . . . . 38 6.2 Cluster connectivity ...... 39

7.1 Power readings over time...... 43 7.2 CoreMark results for 1 million iterations...... 45 7.3 CoreMark results for 1 thousand iterations...... 46 7.4 CoreMark results for 2 million iterations...... 46 7.5 CoreMark results for 1 million iterations utilising 1 per core. 47 7.6 CoreMark performance for 1, 2, 4, 6 and 8 cores per system. . . . . 48 7.7 CoreMark performance speedup per system...... 49 7.8 CoreMark performance on Intel Xeon...... 49 7.9 Power consumption over time while executing CoreMark...... 50 7.10 HPL results for large problem size, calculated with ACT’s script. . 52 7.11 HPL results for problem size 80% of the system memory...... 52 7.12 HPL results for N=500...... 53 7.13 HPL total power consumption for N equal to 80% of memory. . . . 54 7.14 HPL total power consumption for N calculated with ACT’s script. . 55 7.15 HPL total power consumption for N=7296...... 56 7.16 HPL total power consumption for N=500...... 56 7.17 Power consumption over time while executing HPL...... 57 7.18 STREAM results for 500MB array size...... 59 7.19 STREAM results for 3GB array size...... 60 7.20 Power consumption over time while executing STREAM...... 61 7.21 Power consumption with 3.5" HDD and 2.5" SSD...... 62

vi Listings

2.1 Assembly on RISC ...... 3 2.2 Assembly on CISC ...... 4

vii Acknowledgements

I would like to thank my supervisors Mr Sean McGeever and Dr. Lorna Smith. Their guidance and help throughout the project was of great value and have greatly con- tributed in the successful completion of this project. Chapter 1

Introduction

With the continuous evolving of computer systems, power is becoming more and more a constraint in modern systems, especially to those targeted at Supercomputing and HPC in general. The requirement and the demand for continuously increasing performance requires additional processors per board where electrical power and heat is a limit. This is discussed in detail in DARPA’s Exascale Computing study [1]. For the last few years, there is an increasing interest in the use of GPUs in HPC as they offer FLOP per Watt performance far greater than standard CPUs. Designing power-limited systems can have a negative affect on the delivered application performance, due to less powerful processors and not appropriate design for the required tasks, and as a consequence reduces the scope and effectiveness of such systems. Thinking about the upcoming Exascale systems, this is going to be a great issue. New design approaches need to be considered by exploiting low-power architectures and technologies that can deliver acceptable performance for HPC and other scientific applications at reasonable and acceptable power levels. The Green500[6] list argues that the scope of high-performance systems for the past decades has been to increase the performance in relationship to price. Increasing the performance, and the speedup as a consequence, does not necessarily mean that the system is efficient. SSS reports that "from the early 1990s to the early 2000s, the per- formance of our n-body code for galaxy formation improved by 2000-fold, but the per- formance per watt only improved 300-fold and the performance per square foot only 65-fold. Clearly, we have been building less and less efficient supercomputers, thus resulting in the construction of massive data-centers, and even, entirely new buildings (and hence, leading to an extraordinarily high total cost of ownership). Perhaps a more insidious problem to the above inefficiency is that the reliability (and usability) of these systems continues to decrease as traditional supercomputers continue to follow Moore’s Law for Power Consumption." [8] [9]. Up to now, the chip vendors have been following Moore’s law [8]. When more than one cores are incorporated within the same chip, the clock speed per core is decreased. This is not an issue as two cores with a reduce clock speed give better performance

1 than a single chip with a relatively higher clock speed. Decreasing the clock speed decrease the electrical power needed, as well as the corresponding heat, that is produced within the chip. This concept is followed in most modern multi-core chips. The idea behind low-power HPC stands on the same ground. A significant number of low-power, low electricity consumption, chips and systems can be clustered together. This could deliver the performance required by HPC and other scientific applications in an efficient manner, in terms of application performance and energy consumption. Putting together nodes with low-power chips will not solve the problem right away. As these architectures are not being widely used in the HPC field, the required tools, mainly and libraries, might not be available or supported. An effort may be required for porting these to the new architectures. Even having the tools, the codes themselves may require porting and optimisation as well, in order to exploit the underlying power. From a management perspective, every watt of reduced power consumption means sav- ings of $1M per year for large supercomputers as the IESP Roadmap reports [2]. The IESP Roadmap reports also that the high-end servers (that are also used to build up HPC clusters) were estimated to consume 2% of North American power as of 2006. The same reports mentions that the IDC (International Data Corporation) estimates that HPC systems will be the largest faction of high-end server market. That means, the impact of the electrical power required by such systems needs to be reduced [2]. In this project we designed and build a hybrid cluster, investigating, measuring and evaluating the performance of low-power CPUs, such as Intel Atom and ARM against commodity Intel Xeon CPU that can be found within standard HPC and data-center clusters. Three main factors are considered: computational performance and efficiency, power efficiency and porting effort.

1.1 Report organisation

This dissertation is organised in three main groups of chapters. The first group includes chapters 2 to 4, presenting background material, literature and technology reviews. Chapter 5 can be considered its own group, discussing the benchmarking suites and the benchmarks considered and used, the power measurement techniques and methods as well as the experimentation that we used through the project. The third group includes chapters 6 to 9, discussing the design and deployment of our hybrid low-power cluster, the results and analysis of the experiments that were conducted, suggestions for future work and, finally, conclusions over the project.

2 Chapter 2

Background

In this chapter we make a comparison between RISC and CISC systems, we present the systems and memory architecture that can be found in HPC, and what each one means. In addition, we make a case for the power issues in modern HPC systems and how energy efficiency relates to application efficiency.

2.1 RISC versus CISC

The majority of modern commodity processors, that are also used within the field of HPC, are implementing the CISC (Complex Instruction Set Computing) architecture. Although, the need of energy efficiency, lower cost, multiple cores and scaling are lead- ing to a simplification of the underlying architectures, requiring by hardware vendors to develop energy efficient, high performance RISC (Reduced Instruction Set Computing) processors. RISC emphasises a simple instruction set made of simple highly optimised instructions using single-clock reduced instructions and large number of general purpose registers. That is a better match to integrated circuits and technology than the complex instruction sets [3] [4]. Complex instructions could be performed by the compiler, minimising the need of additional transistors. That leads to an emphasis on software and the implementation of more transistors to be used as memory registers. For instance, in assembly language, multiplying two variables and storing the result to the first variable (i.e. a=a*b), when run on a RISC system, it would look like the following (assuming 2:3 and 5:2 are memory locations). Listing 2.1: Assembly on RISC LOAD A , 2 : 3 LOAD B , 5 : 2 PROD A , B STORE 2 : 3 , A

3 Each operation - LOAD, PROD, STORE - is executed in one clock cycle, requiring three clock cycles to perform the whole operation. Due to the simplicity of the operations though, the processors will manage to perform the task relatively quickly. CISC emphasises that hardware is always faster than software and a multi-clock com- plex instruction set, adding transistors to the processors, could deliver better perfor- mance. That also minimises the assembly lines. The code can then be executed by the processors by performing multiple instructions per clock cycle, as opposed to RISC pro- cessors where each clock cycle would execute a single instruction. CISC gives emphasis on hardware by implementing additional transistors to store complex instructions. Within a CISC system, the multiplication example above would require a single assem- bly code line. Listing 2.2: Assembly on CISC MULT 2 : 3 , 5 : 2

In this case, the system must support an additional instruction, that of MULT. This is a complex instruction that performs the multiplication operation within a single clock cycle, and the whole operation would be executed directly on hardware, without the need to specify the LOAD and STORE instructions. However, due to its complexity, the execution time would be approximately the same time to that of RISC. When thinking of large codes with intensive computation and the use of supercomputers that num- ber thousands of cores, these additional transistors that are able to handle complex in- structions can create power/heat issues and have large energy demands for the systems themselves. Modern RISC processors have become more complex than what in the early versions. They implement additional, more complex, instructions and can execute two instruc- tions per clock cycle. However, when comparing modern RISC versus modern CISC processors, the complexity differences and architectural design still exists, having dif- ference in both performance and energy consumption.

2.2 HPC Architectures

In this section we present the different architectures in terms of systems and memory. Both RISC and CISC processors can belong to any of the architectures discussed bellow.

2.2.1 System architectures

High Performance Computing and Parallel architectures have been firstly classified by Michael J. Flynn. Flynn’s taxonomy1 defines four classifications of architectures based on the instruction and data stream. These classifications are: 1IEEE Trans.Comput., vol. C-21, no.9, p.948-60, Sept. 1972

4 SISD - Singe Instruction Single Data • SIMD - Single Instruction Multiple Data • MISD - Multiple Instruction Single Data • MIMD - Multiple Instruction Multiple Data • Singe Instruction Single Data: This classification defines a serial system that does not provides any form of parallelism for either of the streams (instruction and data). A single instruction stream is executed on a single clock cycle. A single data stream is used as input on a single clock cycle for an instruction. Systems that belong to this group are old mainframes, and standard single-core personal computers.

Figure 2.1: Single Instruction Single Data (Reproduced from Blaise Barney, LLNL).

Single Instruction Multiple Data: This classification defines a type of parallel process- ing, where each processor executes the same set of instructions on a different stream of data on every clock cycle. Each instruction is issued by the front-end and each proces- sor can communicate with any other processor but has access only to its own memory. Array and vector processors, as well as GPUs, belong to this group. Multiple Instruction Single Data: This classification defines the most uncommon par- allel architecture were multiple processors execute the same data stream on every clock cycle. This architecture can be used for fault tolerance where different systems working on the same data stream must report the same results. Multiple Instruction Multiple Data: This classification defines the most common parallel architecture used today. Modern multi-core desktops and laptops fall within this category. Each processor executes a different instruction set on a different data stream on every clock cycle.

2.2.2 Memory architectures

There are two main memory architectures that can be found within HPC systems: dis- tributed memory and shared memory. An MIMD system can be built with either mem-

5 Figure 2.2: Single Instruction Multiple Data (Reproduced from Blaise Barney, LLNL).

Figure 2.3: Multiple Instruction Single Data (Reproduced from Blaise Barney, LLNL).

Figure 2.4: Multiple Instruction Multiple Data (Reproduced from Blaise Barney, LLNL).

6 ory architectures. Distributed memory: In this architecture, each processor has its own local memory, apart from caches, and each processor is connected with any other processor via an interconnect. This requires the processors to communicate via the message-passing programming model. This memory architecture enables the development of Massively Parallel Processing (MPP) systems. Some examples of such systems include the Cray XT6, IBM BlueGene and any . Each processor acts as being a individual system by running its own copy of the . The total memory size can be increased by adding more processors and in theory, that can grow up to any size. However, performance and scalability relies on appropriate interconnect and introduces system management overhead.

Figure 2.5: Distributed memory architecture (Reproduced from Blaise Barney, LLNL).

Shared memory: In this architecture, each processor has access to a global shared memory. Communication between the processors takes place via writes and reads to memory, using the shared-variable programming model. The most common archi- tecture of that type is Symmetric Multi-Processing and can be divided in two differ- ent architectures: (UMA) and Non-Uniform Memory Access (NUMA). A UMA system defines a single SMP machine while a NUMA system is made by physically linking two or more SMP systems where each system can access di- rectly the memory of another. Therefore, each processor has equal access to the global memory. The processors do not require message-passing but an appropriate shared- memory programming model. Example systems include IBM and Sun HPC Servers and any multi-processor PC and commodity server. The system appears as a single machine to the external user and runs a single copy of the operating system. Scaling on the number of processors within a single system is not trivial as memory access is a bottleneck. Hybrid Distributed-Shared Memory: This could be characterised as the most com- mon memory architecture used in supercomputers and other clusters today. It employs both distributed and shared memory and is usually made up by interconnecting mul- tiple SMP systems in an UMA fashion, where each system has direct access only to its own memory and needs to send explicit messages to the other systems in order to communicate.

7 Figure 2.6: Shared Memory UMA architecture (Reproduced from Blaise Barney, LLNL).

Figure 2.7: Shared Memory NUMA architecture (Reproduced from Blaise Barney, LLNL).

Figure 2.8: Hybrid Distributed-Shared Memory architecture (Reproduced from Blaise Barney, LLNL).

8 2.3 Power issues in modern HPC systems

Modern HPC systems and clusters are usually built by using commodity multi-core systems. Connecting such systems with fast interconnect can create supercomputers and offer desired platforms for scientists and any HPC user candidate. The increase in speed is mainly achieved by increasing the number of cores within each system, while dropping the clock frequency of each core, as well as the number of systems in each cluster. The main issue with the CPU technology used today is that it is designed without power efficiency in mind, solely following Moore’s law for theoretical perfor- mance. While this has been working for Petascale systems that use such processors, it is a challenge for the design, built and deployment of supercomputers that achieve need to Exascale performance. In order to address, and by-bass up to a level, the power issues with the current tech- nology, the use of GPUs is increasing as they offer better flop-per-watt performance. Physicist scientists, among others, suggest that Moore’s law will gradually cease to hold true around 2020 [3]. That introduces the need for a new technology and design in CPUs as supercomputers will not be able to take advantage of Moore’s law anymore for increasing their theoretical peak performance. Alan Gara or IBM says that "the biggest part (of energy savings) comes from the use of a new processor optimised for energy efficiency rather than thread performance". He continues that in order to achieve that, "a different approach needs to be followed for building supercomputers, and that is the use of scalable, energy-efficient processors". More experts have addressed the power issues in a similar manner. Pete Beckman of ANL argues that "the issue of electri- cal power and the shift to multiple cores will dramatically change the architecture and programming of these systems". Bill Dally, chief scientist at , states that "an Exascale system needs to be constructed in such a way that it can run in a machine room with a total power budget not higher that what supercomputers use today". This can be achieved by improving the energy efficiency of the computing resources that will close the gap and reach Exascale computing in acceptable levels. CPU is not the only high-power source in modern systems. 1) Memory, 2) communi- cations and 3) storage, add up greatly to the overall power consumption of a system. Memory transistors are charged every single time a specific memory cell needs to be ac- cessed. On commodity systems, memory chips are independent components, separated from the processor (RAM, not memory). This increases the power cost as there is a need for additional memory interface and for the communication between the memory and the processor. Embedded devices follow the concept of System-on-Chip (SoC), where all the components are part of the same module, reducing distances and interfaces, and hence power. The interaction between different nodes rather than components of a single node, com- munications between nodes requires power as well. The longer the distance between systems, the more power they need to communicate in order to charge and power the signal to travel. Optical and serial communications are already used to speed-up and do more efficiently communications that partly solve the power issues. On the other hand,

9 Watts

Feature size

Figure 2.9: Moore’s law for power consumption (Reproduced from W-chun Feng, LANL). the large the systems become, the more communication they need. It is important to keep the distance between the independent nodes as close as possible. Decreasing the size of each node and keeping the extremes of a cluster close, could significantly reduce the power needs and costs. Commodity storage devices such as Hard Disk Drivers, are the most common within HPC clusters, due to their simplicity, easy maintainability and relatively low cost. The target is to get faster interconnect between the nodes and storage pools, rather than re- placing the storage devices themselves. High I/O is not very common in HPC, but is very common in specific science fields that use HPC resources such as Astronomy, Bi- ology and Geosciences that tend to work with enormous data-sets. Such data-intensive use-cases will increase the storage demands both in terms of capacity, performance and power. HDDs smaller in physical size and SSDs (Solid State Disk) start becoming more common in data-intensive research and applications.

2.4 Energy and application efficiency

The driven force behind building new systems until very recently, and still in most ven- dors, is to achieve the highest clock speed possible, following Moore’s law. However, it is pointed that around 2020 Moore’s law will gradually cease and a replacement tech- nology needs to be found. Transistors will be so small that quantum theory or atomic

10 physics will take over and electrons will leak out of the wires [5]. Even with the sys- tems today, Moore’s law does not guarantee application efficiency and of course does not comply with energy efficiency as the overall clock speed increases. On the contrary, application efficiency follow May’s law2. May’s law states that software efficiency halves every 18 months, compensating for Moore’s Law. The main reason behind this is that every new generation of hardware introduces new complex hardware optimisation, handled by the compiler and compilers come against an efficiency barrier with . These two issues, especially that of energy efficiency, can be considered as the biggest constraints for the design and development of acceptable Exascale sys- tems in terms on performance, efficiency, consumption and cost. To address this issue, HPC vendors and institutes have start using GP-GPUs (General Purpose-Graphics Pro- cessor Unit) within supercomputers, to achieve high performance without adding extra high-power commodity processors, leading to hybrid supercomputers. The fastest su- percomputer in the world today is a RISC system, the of RAKEN Advanced Institute for (AICS) in Japan, using SPARC64 processors, de- livering performance of 8.62 petaflops per second. A petaflop is equivalent to 1,000 trillion calculations. This system consumes 9.89 megawatts. The second faster super- computer, the Tianhe-1A of the National Supercomputing Center in Tianjin in , is a hybrid machine which is able to achieve the speed of 2.56 petaflops per second and consumes 4.04 megawatts. This is achieved by combining both commodity CPUs, Intel Xeon, and NVIDIA GPUs. These numbers can clearly show the difference that GPUs can do in terms of power consumption for large systems. GPUs are able to execute specially ported code in much less time than standard CPUs, mainly due to their large number of cores and their design simplicity, delivering better performance-per-Watt. While overall a GPU can cost more in term of power needs, it performs the operations ver quickly, that in a length of time it overcomes the cost and proves to be both more energy and application efficient when compared to standard CPUs. In addition to this, it takes the processing load off the processor, reducing the energy demands for the standard CPU. Low-power processors and low-power clusters follow the same concept by using a large number of cores with the simplicity of reduced instruction sets. We can also hypothesise, based on the increased use of GPUs and the porting of applications to these platforms, that in the future the programming models for GPUs will spread even more and GPUs will become more easy to program. In these cases, the standard CPU could play the role of the data distributor to the GPUs, with low-power CPUs being the most suitable candidate for such a task as they will not need to undertake computationally intensive jobs. From a power consumption perspective, the systems mentioned earlier consume 9.89 and 4.04 megawatts per second, for K computer and Tianhe-1A respectively. K com- puter is listed in the 6th position of the Green500 list while. The most power effi- cient supercomputer, the IBM BlueGene Q/Prototype 2 hosted at NNSA/SC, consumes 40.95kW and achieves 2097.19 MFLOPs per Watt. It is listed on 110th position in the Top500 list, delivering 85.9 TFLOPs when executing the Linpack benchmark.

2May’s Law and Parallel Software - http://www.linux-mag.com/id/8422/

11 Chapter 3

Literature review

In this section we look into projects that related to low-power computing and that have been building and benchmarking low-power clusters.

3.1 Green500

The Green500 list is a re-ordering of the well-known TOP500 list, listing the most energy-efficient supercomputers. Green500 raise awareness about power consumption, promotes alternative total cost of ownership performance metrics, and ensures that su- percomputers only simulate climate change and not create it [6]. Green500 started in April 2005 by Dr. Wu-chun Feng at the IEEE IPDPS Workshop of High-Performance Computing, Power-Aware Computer.

3.2 Supercomputing in Small Spaces (SSS)

The SSS project started in 2001 by Wu-chun Feng, Michael S. Warren and Eric H. Wiegle aiming at low-power architectural approaches and power-aware, software-based project approaches. In 2002, the SSS project deployed the Green Destiny cluster, a 240- node system consuming 3.2 kWs, placing it at #393 on the TOP500 list at the time. The SSS project has being making it clear that traditional supercomputers need to stop following Moore’s law for power consumption. Modern systems have being start be- coming less and less efficient, following May’s law, which states that the efficiency is dropped to half every two years. The project argues this with the fact that "from the early 1990s to the early 2000s, the performance of our n-body code for galaxy forma- tion improved by 2000-fold, but the only improved 300-fold and the performance per square foot only 65-fold" [9].

12 3.3 The AppleTV Cluster

A research team at the Ludwig-Maximilians University in Munich, have built and experiment with low-power ARM cluster made of AppleTV devices, The AppleTV Cluster. They also evaluated another ARM-based system, a BeagleBoard xM [28]. The team used CoreMark, High Performance Linpack, Membench and STREAM for mea- suring the CPU (serial and parallel) and memory performance of each system. The CoreMark benchmark scored 1920 and 2316 iterations per second on BeagleBoard xM and AppleTV respectively. On the HPL benchmark, the systems 22.6 and 57.5 MFLOPs for Single Precision operation. On Double Precision they achieve 29.3 and 40.8 MFLOPs for BeagleBoard xM and AppleTV respectively. The support of NEON acceleration (support of 128-bit registers) on the BeagleBoard, allowed it to achieve 33.8 MFLOPs on Single Precision mode. In terms of memory performance, the team reports copying rates of 481.1 and 749.8 MB/s for BeagleBoard xM and AppleTV respectively. The researchers states that when compared to a modern Intel Core i7 CPU with 800MHz DDR2 RAM (the same fre- quency and technology as in the ARM systems used) can deliver more than ten times of the reported bandwidth [28] The power consumption of the AppleTV cluster, which achieves an overall system per- formance of 160.4 MFLOPs, is 10 Watt for the whole cluster when executing the HPL benchmark and 4 Watt, for the whole cluster, when on idle. That results in 16 MFLOPs per single Watt when fully executing the benchmark.

3.4 Sony Playstation 3 Cluster

Researchers at the North Carolina State University have built a Sony PS3 Cluster3. Sony PS3 uses an eight-core Cell Broadband Engine Processor at 3.2 GHZ and 256MB XDR RAM, suitable for SMP and MPI programming. The 9-node cluster ran a PowerPC version of Fedora . The cluster achieved a total of 218 GFLOPs and 25.6 GB/s memory bandwidth. The researchers do not state any power consumption measure- ments. However, the power consumption of Sony PS3 consoles varies from 76 Watt up to 200 Watt for normal use. The power supply that is provided with these systems have a 380 Watt power supply. The size of processor varies from a 90nm Cell CPU down to 45nm Cell. 3Sony PS3 Cluster - http://moss.csc.ncsu.edu/ mueller/cluster/ps3/

13 3.5 Microsoft XBox Cluster

Another research team at the University of Houston, have built a low-cost computer cluster with unmodified XBox game consoles4. Microsoft XBox comes with an Intel Celeron/P3 733 MHz processor processor and 64MB DDR RAM. The 4-node cluster achieved a total of 1.4 GFLOPs when executing High Performance Linpack on GNU/Linux, consuming between 96 and 130 Watts. That gives a range from 10.7 to 14.58 MFLOPs per single Watt. The cluster supported MPI, Intel C++ and Fortran compilers.

3.6 IBM BlueGene/Q

In terms of high-end supercomputing related projects, IBM BlueGene/Q prototype ma- chines aim at designing and building energy-efficient supercomputer based on embed- ded processors. On the latest Green500 list (June 2011), the BlueGene/Q Prototype 2 is listed as the most energy efficient system, achieving a total of 85880 GFLOPs overall performance. That translates to 2097.19 MFLOPs per Watt as it consumes 40.95 kW. The second most energy-efficient entry belongs to BlueGene/Q Prototype 1, achieving 1684.20 MFLOPs per Watt. The BlueGene/Q is still not available on the market.

3.7 Less Watts

The overall rising concerns over power efficiently, drop in power costs and reduction of overall CO2 have pushed software vendors to look into saving power on software level. The Open Source Technology Center of Intel Corporation, have established an open source project, LessWatts.org, that aims to save power with Linux on Intel Plat- forms. The project focuses on end users, developers and operating system vendors by delivering those components and tools that are needed to reduce the energy required by the Linux operating system 6. The project targets on desktop, laptops and commodity servers and achieves power savings by enabling, or disabling, specific extensions on the Linux Kernel.

3.8 Energy-efficient cooling

Apart from the considerations and research into reducing the overall energy of a system using energy-efficient processors, there has been research done and solutions produced in reducing the cooling needs of clusters and data-centres that require huge amounts

4Microsoft XBox Cluster - http://www.bgfax.com/xbox/home.html 6Less Watts. Saving Power with Linux - http://www.lesswatts.org/

14 of power in total, including both the power needed for the systems as well as for sys- tem cooling infrastructure. The main driving force behind such methods is the growing power costs for keeping large systems and clusters at the correct temperature. HPC clusters require sophisticated and effective cooling infrastructure as well. Such infras- tructure might use more energy than the computing systems themselves. The new cool- ing systems do not solve the issues with heating within the processor, the efficiency of a system and its scalability in order to perform higher than petaflop. Although, they in- troduce an environmental friendly cooling infrastructure, dropping down maintenance costs and energy demands as a whole for large clusters, similar to those of supercom- puters.

3.8.1 Green Revolution Cooling

The Green Revolution Cooling is a US based company that offers cooling solution for data-centres. They use a a fluid submersion technology, GreenDEFTM, that reduces the cooling energy used clusters by 90-95% and the server power usage by 10-20% [19]. While these facts are interesting for commodity servers and even more for cooling sys- tems, such approaches do not target at the power efficiency of the processor architecture and the standard power needs of the systems. These solution can be used with existing HPC clusters or even with future HPC clusters in order to achieve an overall low-power, and environmental friendly, infrastructure.

Figure 3.1: GRCooling four-rack CarnotJetTM system at Midas Networks (source GRCooling) .

3.8.2 Google Data Centres

Google has been investigating in smart, innovative and efficient design for their large data-centres that are used to provide web services to million of users. Two of their data- centres in Europe, one in Belgium and one Finland, do not use any air conditioning or

15 chiller systems but they are cooling the systems using natural resources, such as the air temperature and water. In Belgium, the average air temperature is less than the average temperature cooling systems provide to data-centres, thus it can be used for cooling the systems. Moreover, as the data-centre is close to an industrial canal, the water is purified and used to cool the systems. In Finland, the facility is built next to the gulf of Finland, enabling to use the low temperature of the sea water to cool the data-centre [20].

Figure 3.2: Google data-centre at Finland, next to the Finnish gulf (source Google).

3.8.3 Nordic Research

Institutions, as well as industry, in Scandinavia and Iceland are investigating into green, energy-efficient, solutions to support large HPC and data-centre infrastructure with the lowest cost and reduced CO2 emissions. For achieving this, projects aim to exploit and make use of abandoned mines (Lefdal Mine Project) [19], retired NATO ammunition depot within mountains halls (Green Mountain Data Centre AS) [22] as well as design of new data-centres in remote mountain locations, close to hydro-electric power plants for natural cooling and green energy resources [23]. A new initiative has been signed between DCSC (Denmark), UNINETT Sigma (Nor- way), SNIC (Sweden) and the University of Iceland for the Nordic Supercomputer to operate in Iceland later in 2011. The location of Iceland was chosen as its climate offers suitable natural resources for cooling such a computing infrastructure. Iceland produces 70% of its electricity from hydro, 29.9% from geothermal and only 0.1% from fossil [24].

16 Figure 3.3: NATO ammunition depot at Rennesøy, Norway (source Green Moun- tain Data Centre AS).

3.9 Exascale

The increasing number of computationally intensive problems and applications, such as weather prediction, nuclear simulation or analysis of space data, have put the need for the development of new computing facilities, targeting at Exascale performance. The IESP defines as Exascale "a system that is taken to mean that one or more key attributes of the system has 1,000 times the value of what an attribute of a Petascale system of 2010 has". Building Exascale systems with the current technological trends would re- quire huge amounts of energy, among other things such as storage rooms and cooling, to keep it running. Wilfried Verachtert, high-performance computing project manager at the Belgian research institute IMEC argues that "the power demand for an Exas- cale computer made using today’s technology would keep 14 nuclear reactors running. There are a few very hard problems we have to face in building an Exascale computer. Energy is number one. Right now we need 7,000MW for Exascale performance. We want to get that down to 50MW, and that is still higher than we want." There are two main approaches investigated for the design and built of Exascale sys- tems, the Low-power, Architectural Approach and the Project Aware, Software-based Approach [10], but it is still on prototype level.

Low-power, Architectural Approach: This approach refers to the same approach • we have chosen to work on this project. Low-power consumption, energy-efficient, processors replace the standard commodity, high power, processors used in HPC clusters up to now. Using energy efficient processors would enable system engi- neers to built larger systems, with larger number of processors in order to achieve Exascale performance at acceptable levels. IBM’s BlueGene/Q Prototype 2 is

17 right now the most energy efficient, low-power supercomputing for its size, using low-power PowerPC processors [10]. The processor architectural approach can be followed for other parts of the hard- ware. Energy efficient storage devices, efficient high-bandwidth networking, ap- propriate power supplies can decrease the total footprint of each system. Project Aware, Software-based Approach: It is suggested by many system re- • searchers that the low-power architectural approach sacrifices too much perfor- mance, at an unacceptable level for HPC applications. A more architectural inde- pendent approach is therefore suggested. This involves the the use of high-power CPUs that support dynamic voltage and frequency scaling. That allows the design and programming of algorithms that conserves power by scaling up and down the processor voltage and frequency as needed by the application [10].

The approach chosen for this project is that of Architectural Low-Power Approach as it can enable the design and building of any size of reliable, efficient HPC system and does not require any significant change to existing parallel algorithms and code. In specific designs and use-cases, a hybrid approach (a combination of both approaches) might be the golden mean between acceptable performance, power consumption, efficiency and reliability. In figure 3.4 that follows is presented the power demands of supercomputers, from 2006 estimated up to 2020. Based on the fact that the graph was compiled on 2010 and that the 2011 predictions match the current TOP500 systems, we can trust, even with a deviation, the predictions and the power demands of supercomputer as time passes over. This comes to justify the need for supercomputer that will be energy efficient.

Figure 3.4: Projected power demand of a supercomputer (M. Kogge)

18 Chapter 4

Technology review

In this section we are examining the most developed and likely candidates of low-power processors for HPC

4.1 Low-power Architectures

The Low-power processor is not a new trend in the processor business. It is how- ever a new necessity in modern computer systems, especially supercomputers. Energy- efficient processors have been used for many years in embedded systems as well as in consumer electronic devices. Also, systems used in the HPC field have been using low- power RISC processors such as Sun’s SPARC and IBM’s PowerPC. In this section we will look into the most likely low-power processor candidates for future supercomput- ing systems.

4.1.1 ARM

The ARM processors are widely used in many portable consumer devices, such as mo- bile phones and handled organisers, as well as in networking equipment and other em- bedded devices such as AppleTV. Modern ARM cores, such as Cortex-A8 (single core, ranging from 600MHz to 1.2GHz), Cortex-A9 (single-core, dual-core and quad-core version with clock speed at 2GHz) and the upcoming Cortex-A15 (dual-core, and quad- core version, ranging from 1GHz to 2.5GHz), are 32-bit processors, using 16 registers and are designed under the Harvard memory model, where the processor consists of two single memories, one for instructions and one for data. This allows two simultaneous memory fetches. As ARM cores are RISC cores, they implement the simple load/store model. The latest ARM processor in production, and available on existing systems, is ARM Cortex-A9, using the ARMv7 architecture which is ARM’s first generation superscalar

19 architecture. That is the highest performance ARM processor, designed around the most advanced, high- efficiency, dynamic length, multi-issue superscalar, out-of-order, speculating 8-stage pipeline. The Cortex-A9 processor delivers unprecedented levels of performance and power efficiency with the functionality required for leading edge products across the broad range of systems [9] [10]. The Cortex-A9 processor comes in both multi-core (MPcore) and single-core system versions, making it a promising alternative for low-power HPC clusters. What ARM cores lack is the 64-bit address space, as they support only 32-bit. The recent Cortex-A9 comes with optional NEON media and floating-point processing engine, aiming to deliver higher performance for most intensive applications, such as video encoding [11]. Cortex-A8 uses as well the ARMv7 architecture but implementing a 13-stage integer pipeline and a 10-stage NEON pipeline. The NEON support is used for accelerating multimedia applications as well as signal-processing applications. The default support of NEON in Cortex-A8 comes out of the fact that this processor is mainly designed for embedded devices. However, NEON technology can be used as an accelerator for multiple data processing on single input. This, enables the ARM to operate on four multiply-accumulate via dual-issue instructions to two pipelines [11]. NEON supports 64-bit and 128-bit, being able to operate both integer and floating- point operations. Commercial servers manufacturers are already shipping low-power servers with ARM cores. A number of different low-cost, low-power ARM boxes and development boards are available in the market as well, such as OpenRD, DreamPlug, PandaBoard and BeagleBoard. Moreover, NVIDIA has announced the Denver project, which aims to build custom CPU cores for the GPUs based on the ARM architecture, targeting both on personal computers and supercomputers [9].

Figure 4.1: OpenRD board SoC with ARM (Marvell 88F6281) (Cantanko).

20 4.1.2 Atom

Atom is Intel’s low-power processor, aiming at laptop and low-cost, low-power servers and desktops, ranging on clock-speed from 800MHz to 2.13GHz. It supports both 32- bit and 64-bit registers and being an -based architecture make it one of the most suitable alternative candidates to standard high-power processors so far. Server vendors already ship systems with Atom chips and due to their low price, can be very appealing for prototype low-power systems that do not require software alterations. Each instruc- tion loaded in the CPU is translated to a micro-operation performing a memory load and a store operation on each ALU, extending the traditional RISC design, allowing the processor to perform multiple tasks per clock-cycle. The processor has a 16-stage pipeline, where each pipeline stage is broken down to three parts, decoding, dispatching and cache access [11]. The Intel Atom processor supports two ALUs and two FPUs. The first ALU is used to handles any shift operation while the second one handle jumps. The FPUs are used for any arithmetic operation, including integer ones. The first FPU is used for addition only, while the second FPU handles multiple data over single instructions (SIMD) and oper- ations that involve multiplication and divisions. The basic operations can be executed and completed within a single clock-cycle while the processor can use up to 31 clock- cycles for more complex instructions, such as floating-point divisions for instance. The newest models support Hyper-Threading technology, allowing parallel execution of two threads per core, providing virtually four cores on the system [11].

Figure 4.2: Intel D525 Board with Intel Atom dual-core.

21 4.1.3 PowerPC and Power

PowerPC is one of the oldest low-power RISC processors that has been used in the HPC field and is still being used for one of the world’s fastest supercomputer, IBM’s BlueGene/P. PowerPC processors are also available in standard commercial servers for general purpose computing, not just for HPC. They support both 32-bit and 64-bit. PowerPC processors can be found in IBM’s system, making it an expensive solution for low-budget projects and institutes. The latest BlueGene/Q is using one of the latest Power processors, the A2. PowerPC A2 is described as massively multicore and multi-threaded with 64-bit support. Its clock- speed ranges from 1.4GHz to 2.3GHz. Being a massively multicore processor it can support up to 16 cores per processor and is 4-way multi-threading, allowing simulta- neous multithreading for up to 64 threads per processor [18]. Each chip has integrated memory and I/O controllers.

Figure 4.3: IBM’s BlueGene/Q 16-core compute node (Timothy Prickett Morgan, The Register).

Due to its low-power consumption and its flexibility, the design of the A2 is used in the PowerEN (Power Edge of Network) processor which is a hybrid processor between a networking processor and a standard server processor. This type of processors are also known as wire-speed processors, merging characteristics from network processors, such as low-power cores, accelerators, integrated network and memory I/O, smaller memory line sizes and total low-power, and characteristics from standard processors, such as full ISA cores, support for standard programming models, operating system, hypervisor and full virtualisation support. The wire-speed processors are used within application in the area of networking processing, intelligent I/O devices, and streaming applications. The architectural consideration of power efficiency drops the power consumption bellow 50% of the initial power consumption. The high number of threaded processors are able to deliver better throughput power-performance when compared to a standard CPU, but with poorer single-thread performance. Power is also minimised by operating at the lowest voltage necessary to function at a specific frequency [17].

22 4.1.4 MIPS

MIPS is a RISC processor that is used widely in consumer devices, with most popular Sony Playstation PSX and Sony Playstation Portable (PSP). Being a low-power pro- cessor its design is based on RISC, having all instructions completing in one cycle. It supports both 32-bit and 64-bit registers and implements the Von Neumann memory architecture.

Instruction Decode Execute Memory Access Write Back Instruction Fetch Register Fetch Address Calc. IF ID EX MEM WB

Next PC

Adder Next SEQ PC Next SEQ PC MUX

RS1

RS2 Branch taken Register Zero? File MEM / WB EX / MEM ID / EX IF / ID MUX

IR MUX PC Memory

ALU Sign Imm MUX Extend Memory

WB Data

Figure 4.4: Pipelined MIPS, showing the five stages (instruction fetch, instruction decode, execute, memory access and write back (Wikimedia Commons).

Being of RISC design, MIPS uses a fixed-length, regularly encoded instruction set with use of the load/store model which is a fundamental concept in the RISC architecture. The arithmetic and logic operations in MIPS design use a 3-operand instructions, en- abling compilers to optimise complex expressions formulation, branch/jump options and delayed jump instructions. Floating-point registers are supported both in 32 and 64-bit, in the same way as general purpose registers are supported. Superscalar imple- mentation are enabled by the use of no integer conditions codes. MIPS offer flexible high-performance caches and memory management with well-defined cache control options. The 64-bit floating-point registers and the pairing of two single 32-bit floating- point operations improves the overall performance and speed up specific tasks by en- abling SIMD [31] [32] [33] [34]. MIPS Technologies license their architecture designs to third parties in order to design and build their MIPS-based processors. The Chinese Academy of Sciences have de- signed the MIPS-based processor Loongson. Chinese institutes have started designing

23 and building MIPS chips for their next generation supercomputers [8]. China’s Institute of Computing Technology (ICT) has licensed the MIPS32 and MIPS64 architectures from MIPS Technologies [35].

Figure 4.5: Motherboard with Loongson 2G processor (Wikimedia Commons).

Looking at the market, commercial MIPS products do not target the server market, or the generic computing market, making it almost impossible to identify appropriate systems off-the-self for designing and building a MIPS low-power HPC cluster with the needed software support for HPC codes.

24 Chapter 5

Benchmarking, power measurement and experimentation

In this chapter we will give a brief description of the benchmarking suites we have considered and on the final benchmarks we have run.

5.1 Benchmark suites

5.1.1 HPCC Benchmark Suite

The HPCC suite consists of seven low-level benchmarks, reporting performance on floating-point operation, memory bandwidth and communications latency and band- width. The most common benchmark for measuring floating-point operation is Lin- pack, used widely for measuring peak performance of supercomputer systems. While all of the benchmarks are written in C, Linpack builds upon the BLAS library which is written in Fortran. In order to compile the benchmarks successfully on the ARM archi- tecture, that requires the use of GNU version of the BLAS library which is available in C. The HPCC Benchmarks while are easy to compile and execute, do not represent a complete HPC or scientific application. They are useful to identify the performance of a system at low level but do not represent the performance of a system as a whole, when executing a complete HPC application [14]. The HPCC Benchmarks are free of cost.

5.1.2 NPB Benchmark Suite

The NAS Parallel Benchmarks are developed by the NASA Advanced Supercomput- ing (NAS) Division. This benchmarking suite provides benchmarks for MPI, OpenMP, High-Performance Fortran, Java as well as serial versions of the parallel codes. The suite provides 11 benchmarks, where the majority is developed in Fortran, having only

25 4 benchmarks written in C. Most of the benchmarks are low-level targeting at specific system operations, such as floating-operation per second, memory bandwidth and I/O performance. Examples of full applications are provided as well for acquiring more ac- curate results on the performance of high performance systems [15]. The NAS Parallel Benchmarks are free of cost.

5.1.3 SPEC Benchmarks

The Standard Performance Evaluation Corporation (SPEC) provides a large variety of benchmarks, both kernel and application benchmarks, for many different systems, in- cluding MPI and OpenMP versions. These suites are of interest to the HPC community are the SPEC CPU, MPI, OMP and Power benchmarks. The majority of benchmarks represent HPC and scientific applications, allowing to measure the overall performance of a system. The CPU benchmarks are designed to provide performance measurements that can be used to compare computationally intensive workloads on different computer systems. The suite provide CPU-intensive codes, stressing a system’s processor, memory sub- system and compiler. It provides 29 codes, where 25 are available in C/C++ and 6 in Fortran. The MPI benchmarks are used for evaluating MPI parallel, floating point, compute intensive performance across a wide range of cluster and SMP hardware. The suite provides 18 codes where 12 are developed in C/C++ and 6 in Fortran. The OMP Benchmarks are used for evaluating performance floating point and compute intensive performance on SMP hardware based on OpenMP applications. The suite provides 11 benchmarks, only 2 of the codes being available in C and 9 in Fortran. The Power benchmarks is one the first industry-standard benchmark that is used to measure the power and performance of servers and clusters in the same way as done for performance. While this allows power measurements, it does not allow to observe the performance for an HPC, or another scientific application, as their using Java server- based codes to evaluate the system’s power consumption.

5.1.4 EEMBC Benchmarks

The EEMBC (Embedded Benchmark Consortium) provide a wide range of benchmarks for benchmarking embedded devices such as those used in networking, digital media, automotive, industrial, consumer and office equipment products. Some of the benchmarks are free of cost, and open source, while others are given under li- cense, academic or commercial. The benchmarks suites provide codes for measuring single-core and multi-core performance, power consumption, telecom/networking per- formance, floating-point operation performance as well as various codes for different appliances of consumer electronic devices.

26 5.2 Benchmarks

In this section we describe the benchmarks we used to evaluate the systems used in this project. These benchmarks do not represent full HPC codes, but are established and well defined benchmarks used widely for reporting the performance of computing systems. Full HPC codes tend to take long time to execute, proving to be a constraint for the project in terms of the available time. That is an additional reason over the decision on running simpler, kernel, benchmarks, where the data-sets can be defined by the user.

5.2.1 HPL

We use the High-Performance Linpack to measure the performance in flops of each dif- ferent system. HPL solves a random dense linear system in double precision arithmetic either on a single or on distributed-memory systems. The algorithm used in this code uses "a 2D block-cyclic data distribution - Right-looking variant of the LU factorisa- tion with row partial pivoting featuring multiple look-ahead depths - Recursive panel factorisation with pivot search and column broadcast combined - Various virtual panel broadcast topologies - bandwidth reducing swap-broadcast algorithm - backward sub- stitution with look-ahead of depth 1" [16] [17]. The results outline how long it takes to solve the linear system and how many Mflops or Gflops are achieved during the computational process. HPL is part of the HPCC Benchmark suite.

5.2.2 STREAM

The STREAM benchmarks is a synthetic benchmark that measures the memory band- width and the computation rate for simple vector kernels [12]. The benchmark tests four different memory functions, copy, scale, add and triad. It reports the bandwidth in MB/s as well as the average, the minimum and maximum time it takes to complete each of the operations. STREAM is part of the HPCC Benchmark suite. It can be executed either in serial or in multi-threaded mode.

5.2.3 CoreMark

The CoreMark benchmark is developed by the The Embedded Microprocessor Bench- mark Consortium. It is a generic, simple benchmark, targeted at the functionality of a single processing core within a system. It uses a mixture of read/write, integer and con- trol operations including matrix manipulation, linked lists manipulation, state machine operations and Cyclic Redundancy, an operation that is commonly used in embedded systems. The benchmark reports performance on how may iterations are performed in total as well as in per second plus the total execution time and total processor ticks. It

27 can be executed either in serial or in multi-threaded mode, enabling to evaluate hyper- threading cores more effectively. CoreMark does not represent a real application and puts under stress the processor’s pipeline operations, memory access (including caches) and integer operations [26].

5.3 Power measurement

Power measurement techniques vary and can be conducted in many different parts of the system. Power consumption can be measured between the power supply and the electrical socket, the motherboard (or another hardware part of the system) and the power supply as well as between individual parts of the system. Initially, we want to measure the system as a whole. That will let us know which systems can be bought "off-the-shelf" on best performance-per-Watt basis. For our experiments we adopt the technique used by the Green500 to measure the power consumption of a system. That is, using a power meter between the power supply’s AC input of a selected unit and a socket connected to the external power supply system. That allows us to measure the power consumption of the system as a whole. The power meter reports the power consumption of the system at any time and status, being idle or when running a specific code. By enabling logging of data at specific times, we can identify the power consumption at any moment it is required. An alternative method of measuring the same form of power consumption is by sensor- enabled software tools installed within the operating system. That has as a pre-requisite the hardware to provide the needed sensors. For the systems we have used, the high- power Intel Xeon systems provided the necessary sensors and software allowing us to use software tools on the host system to measure the power consumption. The low- power systems do not provide sensor support, preventing us from using software tools to gather the power consumption of these systems. Due to this, we have used external power meters on all of the systems in order to qualify all the readings equally by using the same method and provide more fairness to the experiments. Power measurement can also be performed on individual components of the system. That would allow to measure specifically how much power each processor consumes without being affected by any other parts of the system. With this method, we can also measure the power requirements and consumption between different parts of the system as well, like the processor and the memory for instance. While this is of great interest, and perhaps one the best ways to qualify and quantify at the maximum level where power is going and how it is used by each component, due to time constraints with this project we could not invest the time and effort in this method.

28 5.3.1 Metrics

In this project we use the same metric as used by the Green500 list, being the "performance- per-watt" (PPW) metric that is used to rank the energy efficiency of supercomputers. The metric is defined by following equation:

P erformance PPW = (5.1) Power Performance in equation (5.1) is defined as the maximal performance by the corre- sponding benchmark, defined as GFLOPS (Giga FLoating-point OPerations per Sec- ond) for High Performance Linpack, MB/s (Mega Bytes per Second) for STREAM and Iter/s (Iterations per Second) for CoreMark. Power in equation (5.1) is defined as the average system power consumption during the execution of each benchmark for the given problem size, defined as Watt per second.

5.3.2 Measuring unit power

The power measurements performed using the Watts up? PRO ES7 and the CREATE ComfortLINE8 power meters. The meter is placed between the power supply’s AC input, of the machine to be monitored, and the socket connected to the external power supply infrastructure. This reports watts consumed at any time. The power meter is provided with a USB interface and software that allow us to record the data we need to an external system and study them at any desired time. This methodology reflects the technique followed to submit performance results to the Green500 list [12]. The basic set up is illustrated by the figure 5.1 that follows.

Figure 5.1: Power measurement setup.

5.3.3 The measurement procedure

The measurement procedure consists of nine simple steps, similar to these described in the Green500 List Power Measurement Tutorial [15].

7Watts up? - http://www.wattsupmeters.com/ 8CREATE - The Energy Education Experts - http://www.create.org.uk/

29 1. Connect the power meter between the electricity socket and the physical machine. 2. Power on the meter (if required). 3. Power on the physical machine. 4. Start the power usage logger. 5. Initialise and execute the benchmark. 6. Start recording of power consumption. 7. Finish recording of power consumption. 8. Record the reported benchmark performance. 9. Load power usage data and calculate average and PPW.

Having connected the physical machines and power meter and are both running, we initialise the execution of the benchmark and then start recording the data for the power consumption of the system. We use a problem size large enough in order to keep the fastest system busy enough to provide a reliable recording of power usage during the execution time. That gives more execution time to the other systems, allowing us to gather accurate power consumption data for every system we are examining. For each benchmark the problem size can vary depending on the limitations from a hardware point of view (e.g. memory size, storage).

5.4 Experiments design and execution

Experimentation is the process that defines a series of experiments, or tests, that are conducted in order to discover something about a particular process or system. In other words, “experiments are used to study the performance of processes and systems“ [25]. The performance of a system though depends on variables and factors, both controllable and uncontrollable. The accuracy, meaning the success of the measurement and the observation, of an ex- periment depends on these controllable and uncontrollable variables and factors as they can affect the results. These variables can vary in different conditions and environ- ments. For instance, the execution of unnecessary applications while conducting the experiments is a controlled variable that can effect negatively the experimental results. The Operating System CPU scheduling algorithm on the other hand is not a controlled variable and can vary within the same operating system when executed on a different architecture. This plays major role in the differentiation of the results from system to system. On the other hand, the architecture of the CPU itself is an uncontrolled variable that will affect the results. The controllable factors for this projects have been identified as below:

Execution of non-Operating System specific applications and processes. • 30 Installation of unnecessary software packages as that can result in additional • power consumption for unneeded services. Multiple uses of a system by different users. • These factors have been eliminated in order to get more representative and unaffected results. The uncontrolled factors have been identified as below:

Operating System scheduling algorithms. • Operating System services/applications. • Underlying hardware architecture and implementation. • Network noise and delay. • From this list, the only factor that is partially controlled, is the network noise and delay. We use private IPs with NAT, and that prevents the machines to be contacted by outside the private network, unless they issue an external call. Keeping the systems that are about to be measured outside a public network eliminates the noise and delay that comes with the physical wire from devices connected to this network. Finally, the technical phase of experimentation has been separated into seven different stages:

Designing the experiments. • Planning the experiments. • Conducting the experiments. • Collecting the data. • Generating data sets. • Generating graphs and tables. • Results analysis •

5.5 Validation and reproducibility

The validation of each benchmark is confirmed by either of the provided validation tests of each one, like the residual tests in HPL for instance, or by being accepted for publi- cation, which was the case for the CoreMark results that are published on the CoreMark website9. The STREAM benchmark does also state at the end of each whether the run validates or not. Having all of the experiments with all of the benchmarks to have validated, we can claim accuracy and correctness over results we present bellow. Reproducibility is confirmed by executing the benchmark four times with the same op- tions and definitions. The average of all of the runs is taken and presented in the results

9CoreMark scores - http://www.coremark.org/benchmark/index.php?pg=benchmark

31 that follow in this section. The power readings have been performed at a frequency of every second during the execution of the benchmark. The average is then calculated to identify the average power consumption of each system when running a specific bench- mark.

32 Chapter 6

Cluster design and deployment

In this chapter I discuss the hardware and software specifications of the hybrid cluster I have designed and built as part of this project. I discuss the issues I encountered and how I solved them.

6.1 Architecture support

6.1.1 Hardware considerations

To evaluate effectively the performance of low-power processors we need to have a suitable infrastructure that enable us to run the same experiments across a number of different systems, both low-power and high-power, in order to perform a comparison in equal terms. Identifying identical systems in every aspect apart from the CPU is realistically not feasible, within the time and budget of this project. Therefore, the ex- periments are designed in such a way that enable us to measure equal software metrics. For the analysis of the results we take under consideration any important differences in the hardware that can affect the interpretation of the results. The project experiments with different architectures, such as standard x86 [9] (i.e. Intel Xeon), RISC x86 [10] (i.e. Intel Atom) and ARM [11] (Marvell Sheeva 88F6281). These fall within a modern comparison and experimentation of CISC (Complex Instruction Set Computing) versus RISC (Reduced Instruction Set Computing) designs for HPC appliances. Each of these architectures though uses a different type of registers (i.e. 32/64-bit). For instance, the x86 architecture both support 64-bit registers while ARM on the other hand supports only 32-bit registers. That may prove to be an issue for scientific codes from a software performance perspective as the same code may behave and perform differently when compiled on a 32 and 64-bit system. While the registers (processor registers, data registers, address registers etc.) is one of the main differences between architectures, identical systems is very hard to build when

33 using chips of different architectures in terms of the other parts of the hardware. The boards will require to be different, memory chips and sizes may differ, networking sup- port can differ as well (e.g. Megabit versus Gigabit support). Also, different hard disks type, such as HDD versus SSD, will affect the total power consumption of a system.

6.1.2 Software considerations

Abstracting the architectural differences to a software level, some tool-chains (libraries, compilers, etc.) are not identical for every architecture. For instance, the official GNU GCC ARM tool-chain is at version 4.0.3 while the standard x86 version is at 4.5.2. We solved this by using the binary distributions that comes by default with Linux distri- bution from specific vendors, such as Red Hat in our case that ships GCC 4.1.2 with their operating version system on any supported architecture. The source code can also be used to compile the needed tools but that proves to be a time consuming, and some times not a trivial, task. It might be the only way though for installing a specific version of the tool-chain when there is no binary compiled for the needed architecture. The compiled Linux distributions available for ARM, such as Debian GNU/Linux and Fedora are compiled for the ARMv5 architecture, which is an older ARM architecture than the one the latest ARM processors are based on, ARMv7. Other distributions, such as Slackware Linux, are compiled on even older architectures ARMv4. Using an oper- ating system, compilers, tools and libraries that are compiled for an older architecture do not take advantage of the additional capabilities of the instruction set of the newest architecture. A simple example is the comparison between x86 and x86_64 systems. A standard x86 compiled operating system running on a x86_64 hardware would not take advantage of the additional larger virtual and physical address spaces, preventing applications and codes to use larger data sets. Intel Atom on the other hand, does not have any issues with compiler, tools and soft- ware support. Being an x86 based architecture it supports and can handle any x86 package that is available for commodity high-power hardware, used widely nowadays in scientific cluster and supercomputers.

6.1.3 Soft Float vs Hard Float

Soft Floats use an FPU (Floating Point Unit) emulator on software level, while Hard Floats use the hardware’s FPU. As we described earlier on, most of the modern ARM processors come with FPU support. However, in order to provide full FPU support, the required tools and libraries need to be re-compiled from scratch. Also, dependency packages would need to be re-compiled as well, and that can include low-level libraries such as the C library. The supported Linux distributions, compilers, tools and libraries that target the ARMv5 architecture use soft floats as ARMv5 does not come with hard- ware FPU support. Therefore, they are unable to take advantage of the processors FPU

34 and additional NEON SIMD instructions. It is reported that recompiling the whole op- erating system from scratch with Hard Float support, can increase the performance up to 300% [27]. By now there is no distribution fully available to be used and taking advantage of the hardware FPU, and recompiling the largest part of a distribution from scratch is beyond the scope of this project.

6.2 Fortran

The GNU ARM tool-chain provides C and C++ compilers but not a Fortran compiler. That is a limitation on itself as this means no Fortran code can be compiled and run widely on the ARM architecture. That can be a restricting factor for many scientists and HPC system designers at this moment as there is a great number of HPC and scien- tific applications that are written in Fortran. Specific Linux distributions, such Debian GNU/Linux, and Fedora provide their own compiled GCC packages, including Fortran support. On a non-supported system, porting Fortran code to C can be time consuming. A way to do this is to use Netlib’s f2c [22] library that is able of porting automatically Fortran code to C. Despite the ability of successfully porting the whole code to C, it might need additional work to link correctly the MPI or OpenMP calls within the C version. What is more, the f2c tool supports only Fortran 77 codes. As part of this project, we have created a basic script to automate the process of converting and compiling the original Fortran 77 code to C. Other proprietary and open-source compilers, such as G95, PathScale and PGI do not yet provide Fortran, or other, compilers for the ARM architecture.

6.3 C/C++

The C/C++ support of the ARM architecture is totally acceptable and at the same level as in the other architectures. However, we have used the GNU C/C++ compiler and we have not investigated any proprietary compilers. Compiler suites that are common in HPC, such as PathScale and PGI, do not support the ARM architecture. Both MPI and OpenMP are supported within all the architectures that we have used, without any need for additional software libraries or porting of the existing codes.

6.4 Java

The Java run time environment is supported as well within the ARM architecture by the official Oracle Java for embedded systems for ARMv5 (Soft Float), ARMv6 (Hard Float) and ARMv7 (Hard Float). It lacks though the Java compiler and that would

35 require to develop and compile the application on a system of another architecture that provides support for the Java compiler and then execute the resulted binary on the ARM system.

6.5 Hardware decisions

In order to evaluate the systems, the design of the cluster reflects that of a hybrid cluster, interconnecting systems of different architectures together. Our cluster consists of the following machines.

Processor Memory Storage NIC Status Intel Xeon 8-core (E5507) 16GB DDR3-1.3GHz SATA 1GigE Front-end / Gateway Intel Xeon 8-core (E5507) 16GB DDR3-1.3GHz SATA 1GigE Compute-node 1 Intel Xeon 8-core (E5507) 16GB DDR3-1.3GHz SATA 1GigE Compute-node 2 Intel Xeon 8-core (E5507) 16GB DDR3-1.3GHz SATA 1GigE Compute-node 3 Intel Atom 2-core (D525) 4GB DDR2-800MHz SATA 1GigE Compute-node 4 Intel Atom 2-core (D525) 4GB DDR2-800MHz SATA 1GigE Compute-node 5 ARM (Marvell 88F6281) 1-core 512MB DDR2-800MHz NAND 1GigE Compute-node 6 ARM (Marvell 88F6281) 1-core 512MB DDR2-800MHz NAND 1GigE Compute-node 7

Table 6.1: Cluster nodes hardware specifications

The cluster provides access to 34 cores, 57GB of RAM and 3.6TB of storage. All of the systems, both the gateway and the compute-nodes, are connected on a single . The gateway has a public and a private IP and each compute-node a private IP. That enables all the nodes to communicate with each other while the gateway allows them to access the external public network and the Internet if needed.

6.6 Software decisions

The software details of each system are outlined in the table that follows.

36 System OS C/C++/Fortran MPI OpenMP Java Front-end SL 5.5 GCC-4.1.2 MPICH2-1.3.2 GCC-4.1.2 JAVA 1.6 Node1 SL 5.5 GCC-4.1.2 MPICH2-1.3.2 GCC-4.1.2 JAVA 1.6 Node2 SL 5.5 GCC-4.1.2 MPICH2-1.3.2 GCC-4.1.2 JAVA 1.6 Node3 SL 5.5 GCC-4.1.2 MPICH2-1.3.2 GCC-4.1.2 JAVA 1.6 Node4 SL 5.5 GCC-4.1.2 MPICH2-1.3.2 GCC-4.1.2 JAVA 1.6 Node5 SL 5.5 GCC-4.1.2 MPICH2-1.3.2 GCC-4.1.2 JAVA 1.6 Node6 Fedora 8 GCC-4.1.2 MPICH2-1.3.2 GCC-4.1.2 JAVAE 1.6 Node7 Fedora 8 GCC-4.1.2 MPICH2-1.3.2 GCC-4.1.2 JAVAE 1.6

Table 6.2: Cluster nodes software specifications

The x86 based system have run Scientific Linux 5.5 x86_64, with the latest supported GNU Compiler Collection that provides C, C++ and Fortran compilers. We have in- stalled the latest MPICH2 version for enabling programming with the message-passing model. Regarding OpenMP support, GCC provides default support for shared-variable programming using OpenMP directives by specifying the correct flag at compile time. For Java, we have deployed Oracle’s SDK that provides both the compiler and runtime environment. Regarding the ARM systems, there are some differentiations. The operating system installed is Fedora 8, which belongs to the same family as Scientific Linux being a Red Hat related project, but this version specifically is older. Deploying a more recent op- erating system is possible but due to project time limitations we used the pre-installed operating system as well as compilers and libraries. However, the GCC versions is the same across all systems. MPI and OpenMP are supported by MPICH2 and GCC respectively. In relation to Java, Oracle provides an official version with the Java Run- time Environment for embedded devices. That version is the one we can run on the ARM architecture. It lacks though the Java compiler, allowing only the execution of pre-compiled Java applications. The batch system used to connect the front-end and the nodes is Torque, which is based on OpenPBS10. Torque comes with both the server and client sides of the batch system as well as with its own scheduler, which is not though very flexible. We did not face any issues installing and configuring the batch system across the different architectures and systems.

10PBS Works - Enabling On-Demand Computing - http://www.pbsworks.com/

37 Figure 6.1: The seven-node cluster that was built as part of this project.

38 6.7 Networking

In terms of networking connectivity the front-end acts a gateway to the public network and the Internet, therefore it has a public IP which can be used to be accessed remotely as a login-node. As the front-end needs to communicate with the nodes as well, it uses a second interface with a private IP within the network 192.168.1.0/24. Each of the compute-nodes uses a private IP on a single NAT (Network Address Translation) interface. That allows each node to communicate with every single node in the cluster as well as the front-end, which is used as gateway when needed to communicate with the public network.

Hostname IP Status lhpc0 129.215.175.13 Gateway lhpc0 192.168.1.1 Front-end lhpc1 192.168.1.2 compute-node lhpc2 192.168.1.3 compute-node lhpc3 192.168.1.4 compute-node lhpc4 192.168.1.5 compute-node lhpc5 192.168.1.6 compute-node lhpc6 192.168.1.7 compute-node lhpc7 192.168.1.8 compute-node

Table 6.3: Network configuration

The physical connectivity between the systems is illustrated by the figure bellow.

Figure 6.2: Cluster connectivity

39 6.8 Porting

The main reason for porting an application is the incompatibility between the archi- tecture the application is initially developed for and the targeted architecture. As we have mentioned already in this report, the ARM architecture does not widely support a Fortran compiler. This has as a result the need to use specific Linux distributions or the porting Fortran code to C, or C++, in order to run it successfully on ARM. It is not part of this project to investigate the level to which this can be done, neither for the benchmarks used or for any other HPC or scientific application. The Intel Atom processor, being an x86 architecture, can support all the widely used HPC and scientific tools and codes. That means, there is no need of porting to be done for any benchmarks or code desired to run on such a platform. Thus, Atom systems can be used to build low-power clusters for HPC with Fortran support. Hybrid clusters (i.e. consisting of Atom and other low-power machines) can be deployed as well. That would require the appropriate configuration of the batch system into different queues, reflecting the configuration of each group of systems. For instance, that could be a Fortran-supported queue and a generic queue for C/C++. Queues that group together systems of the same architectures can be created as well, in the same concept as done already with GPU queues and standard CPU queues on cluster and supercomputers.

6.8.1 Fortran to C

During investigating the issue of Fortran support on ARM, I came across on a possible workaround solution for platforms that do not support Fortran. This is using the f2c tool (i.e. Fortran-to-C) from the Netlib repository which can convert Fortran code to C. There are two main issues with this tool. Firstly, f2c is developed for converting only Fortran 77 code to C. Secondly, and more related to HPC and scientific codes, calls to the MPI and OpenMP libraries might not be converted successfully, failing to compile the converted C code even when linked correctly with the MPI and OpenMP C libraries. The f2c tool, was used to port for instance the LAPACK library to C and has also influenced the development of the GNU g77 compiler which uses a modified version of the f2c runtime libraries. We think that with more effort and a closer study to f2c, it could be used to convert HPC codes directly from Fortran 77 to C.

6.8.2 Binary incompatibility

Another issue with hybrid systems made of different architectures is the binary incom- patibility of a compiled code. A code that is compiled on a x86 system will not be able, in most cases, to execute on the ARM architectures, and vice versa, except if it is a very basic one, without system calls that relate to the underlying system architecture. This is

40 a barrier for the design and deployment of hybrid clusters, like the one we built for this project. This architecture incompatibility requires the existence and availability of login-nodes for each architecture in order the users to be able to compile their applications in the target platform. In addition to this, each architecture should provide its own batch system. However, in order to eliminate the need of additional systems and the added complexity of additional schedulers and queues, a single login-node can be used with specific scripts to enable code compilation of different architecture. This single front- end, as well as the scheduler (be it the same machine or another), can have different queues for each architecture, allowing the users to submit jobs at the desired platform each time without conflicts and faulty runs of binary incompatibility.

6.8.3 Scripts developed

For easing and automating the deployment and management of the cluster, as well as the power readings, we have developed a few shell scripts as part of this projects. The source of the scripts can be found Appendix D.

add_node.sh: Adds a new node on the batch system. It copies all the necessary • files to the targeted system, starts the required services, mount the filesystems and attach the node to the batch pool. Usage: ./add_node.sh [architecture] status.sh: Reports on the status of each node, whether the batch service are run- • ning or not. Usage: ./status.sh armrun.sh: It can be used to execute remotely any command on the ARM systems • from the x86 login-node. Particularly, it can be used to compiles ARM targeted code on x86 login-node without requiring to login specifically to an ARM system. Usage: ./armrun watt_log.sh: Captures power usage on PowerEdge servers with IPMI sensors • support. It logs the readings in a defined file on which the average can be calcu- lated as well. Usage: ./watt_log.sh

41 Chapter 7

Results and analysis

In this section we present and analyse and results we gathered during the experimen- tation process of the hybrid cluster we have built during this project. We start from discussing the given Thermal Design Power and idle power consumption of each sys- tem and we go into more detail for each benchmark individually.

7.1 Thermal Design Power

Each processor vendor defines the maximum Thermal Design Power (TDP). This is maximum power that can be cooled by a cooling system within the processor and there- fore the maximum power a processor can use. The power is referred in Watts. Bellow we present the values as given by the vendors of each of the processors we used.

Processor GHz TDP Per core Intel Xeon, 4-core 2.27 80 Watt 20 Watt Intel Atom, 2-core 1.80 13 Watt 6.5 Watt ARM (Marvell 88F6281) 1.2 0.87 Watt (870 mW) 0.87 Watt

Table 7.1: Maximum TDP per processor.

The Intel Xeon systems uses two quad-core processors, with a TDP of 80 Watt each, giving a maximum and total of 160 Watt per system. These very fist values can give us a clear first idea on the power consumption of each system. Dividing the TDP of the processor by the number of cores we get 20 Watt for each Intel Xeon core, 6.5 Watt of each Intel Atom core and just 870 mW for ARM (Marvell 88F6281). By this, we can clearly see the difference between commodity server processors, low-power servers processors and purely embedded systems processors. The cooling mechanism within each system is reduced or increased by the scope of the system and the design of the processor.

42 7.2 Idle readings

In order to identify the power consumption of a system when is idle (i.e. lack of pro- cessing), we gathered the power consumption rate without running an special software or any of the benchmarks. We measured each system for 48 hours, allowing us to get a concrete indication on how much power each system consumes in idle mode. The results are listed bellow.

Processor Watt Intel Xeon, 8-core 118 Watt Intel Atom, 2-core 44 Watt ARM (Marvell 88F6281) 8 Watt

Table 7.2: Average system power consumption on idle.

150.0

112.5

75.0 POWER

37.5

0 TIME

Intel Xeon Intel Atom ARM

Figure 7.1: Power readings over time.

On figure 7.1 we can see that each system tends to use relatively more power when it boots and then stabilises and keeps a constant power consumption rate throughout time without executing any special software. Thus, these results reflect the power consump- tion of the system while running their respective operating system after a fresh instal- lation with the only additional service running, the Batch System that we installed. We

43 can also observe that the Intel Xeon system tends to increase slightly its power usage by 1Watt, from 118 to 119, every 20 seconds approximately most probably to a specific operating system service or procedure. The results come to justify and confirm the TDP values as given by each manufacturer, as the systems with the lowest TDP values are those who consume less power on idle as well.

7.3 Benchmark results

In this section we present and discuss the results of each benchmark individually across the various architectures and platforms that have been executed.

7.3.1 Serial performance: CoreMark

Table 7.3 shows the results of the CoreMark benchmark the consumption of the CPU is dropped, its efficiency increases. For instance, Intel Xeon system performs 55.76 iterations per single Watt consumed, Intel Atom 65.9 iterations per Watt and ARM 206.63 iterations per Watt. In terms of power efficiency, ARM is ahead of the other two candidates. The tradeoff comes in the total execution time as ARM being a single-core takes 3.5x and 1.5x times more to complete the iterations than Intel Xeon and Intel Atom respectively. Intel Atom, while it consumes more than half the power of Intel Xeon, the performance-per-Watt (PPW) does not differ greatly to that of Intel Xeon while taking 2.3x times to complete the operations.

Processor Iterations/Sec Total time Usage PPW Intel Xeon 6636.40 150.68 119 Watt 55.76 Iters. Intel Atom 2969.70 336.73 45 Watt 65.9 Iters. ARM (Marvell 88F6281) 1859.67 537.72 9 Watt 206.63 Iters.

Table 7.3: CoreMark results with 1 million iterations.

Calculating the total power consumption for performing the same number of iterations, ARM (Marvell 88F6281) proves to be the most power efficient with Intel Atom follow- ing and Intel Xeon consuming the maximum amount of power. Each system consumes in total 17930 Watt, 16152 Watt and 4839 Watt for Intel Xeon, Intel Atom, and ARM respectively.

44 150.0 7000

112.5 5250

Iterations per second

75.0 3500 Watts per second per Watts

37.5 1750

0 0 Intel Xeon Intel Atom ARM WattsWatts Iterations

Figure 7.2: CoreMark results for 1 million iterations.

The same differences in performance, and power consumption as well, are observed with both smaller and larger number of iterations as presented on the figures that follow, 7.2 and 7.3 respectively. We also observe that the number of iterations per second remains approximately the same no matter the total number of the iterations. The total execution time increases proportionally as the number of total iterations increases. The difference between the various systems in execution times stays near the same values with keeping the same levels in power consumption as well. The results that bring the ARM system in front of the other two candidates, in terms of performance per Watt, can be explained by the simplicity of the CoreMark benchmark that targets at integer-point operations.

45 150.0 7000

112.5 5250

Iterations per second

75.0 3500 Watts per second per Watts

37.5 1750

0 0 Intel Xeon Intel Atom ARM WattsWatts Iterations

Figure 7.3: CoreMark results for 1 thousand iterations.

150.0 7000

112.5 5250 Iterations per second

75.0 3500 Watts per second per Watts

37.5 1750

0 0 Intel Xeon Intel Atom ARM Cortex-A8 WattsWatts Iterations

Figure 7.4: CoreMark results for 2 million iterations.

The results presented show the performance on a single core per system. The Intel Xeon system has though 8-cores and the Intel Atom 4-cores (2 cores with Hyper-threading

46 on each core). The results of the systems with all the threads turned on for each system are illustrated by figure 7.5g˘ as follows.

150.0 500

Iterations per second 112.5 375

75.0 250

8 THREADS Watts per second per Watts

37.5 125 1 THREAD 4 THREADS

0 0 Intel Xeon Intel Atom ARM WattsWatts Iterations

Figure 7.5: CoreMark results for 1 million iterations utilising 1 thread per core.

We can observe that the performance is increasing almost proportionally for Intel Xeon and Intel Atom achieving in total 51516.21 and 9076.67 iterations per second, giving 432.90 and 201.7 iterations-per-Watt respectively. With these results, the ARM proces- sor is ahead of the Intel Atom with 4.93 iterations per Watt/s, and 261.65 iterations per Watt/s behind Intel Xeon, which has significantly higher clock-speed, 2.27GHz versus 1.22 GHz. With these considerations under mind, as well as the fact that ARM does not support 64-bit registers, we could argue that there is plenty of space of develop- ments and progress for the ARM microprocessor, as we can also see from its current developments with multi-core support and NEON acceleration. CoreMark is not based, and does not represent, any real application, but allows us to draw some conclusions over specifically the performance of a single core and the CPU itself. The presented results show clearly that the CPU with the highest clock-speed, and architecture complexity, can achieve higher performance, being able to perform larger number of iterations per second in shorter total execution time. In our experiments, Intel Xeon that achieves the best performance, uses both instantly, as well as in overall execution time, the highest amount of Watts to perform the total number of iterations. Based on the figures and results presented earlier in this section, ARM is the most efficient processor on performance-per-Watt basis, being able to handle very efficient integer-point operations.

47 Looking solely at iterations per second, the figures 7.6 and 7.7 show us how each system performs, in terms of iterations and speedup, for the serial version as well as for 2, 4, 6 and 8 processors. Intel Xeon does pretty well, while Amdahl’s law applies on Intel Atom and ARM once we exploit more threads than the physical number of cores. Thus, the ARM system, being a single-core machine, performs any tasks, serial or multi- threaded, as serial.

60000

45000

30000

ITERATIONS

15000

0 1 2 4 6 8

Intel Xeon Intel Atom ARM

Figure 7.6: CoreMark performance for 1, 2, 4, 6 and 8 cores per system.

48 8

6

4 SPEEDUP

2

0 1 2 4 6 8

Intel Xeon Intel Atom ARM

Figure 7.7: CoreMark performance speedup per system.

The same rule applies for Intel Xeon. Figure 7.8 shows that the Intel Xeon system hits the performance wall once there are allocated more threads than the actual number of cores on the system.

60000

45000

30000 ITERATIONS

15000

0 1 2 4 6 8 10 12 14 16

Intel Xeon

Figure 7.8: CoreMark performance on Intel Xeon.

In the figure 7.9 that follows, we can see any power changes over time of benchmarking each system with the CoreMark benchmark. As it can be clearly seen, the power usage throughout the execution of the benchmark on each system is stable. The low-power systems do not raise their power consumption as much as the high-power Intel Xeon

49 system. An explanation to this can be the fact in order to keep load-balance between the processors, the system will utilise more than a single core even when executing a single thread, thus requiring more power. The Intel Xeon system increases its power usage by 5.88$, Intel Atom by 0.8% and ARM by 12.5%.

150.0

112.5

75.0 POWER

37.5

0 TIME

Intel Xeon Intel Atom ARM

Figure 7.9: Power consumption over time while executing CoreMark.

7.3.2 Parallel performance: HPL

For HPL, we used four different approaches to identify the suitable problem size for each system. The first one, is the rule of thumb suggested by HPL developers, giving a problem size nearly 80% of the total amount of the memory system11. The second one, is using an automated script provided by Advanced Clustering Technologies, Inc. that calculates the ideal problem size based on the information given for the target system12. The third one is by using the ideal problem size for the smallest machine for all of the systems. The fourth one using a very small problem size to identify differences in performance depending on problem size, as larger problem sizes that do not fit in the physical memory of the system will need to make use of swap memory, with a drop off on performance. All the problem sizes are presented in the table 7.4.

11http://www.netlib.org/benchmark/hpl/faqs.html#pbsize 12http://www.advancedclustering.com/faq/how-do-i-tune-my-hpldat-file.html

50 Processor Problem size Block size Method Intel Xeon 13107 128 HPL Intel Atom 3276 128 HPL ARM (Marvell 88F6281) 409 128 HPL Intel Xeon 41344 128 ACT Intel Atom 20608 128 ACT ARM (Marvell 88F6281) 7296 128 ACT Intel Xeon Intel Atom 7296 128 Equal size ARM (Marvell 88F6281) Intel Xeom Intel Atom 500 32 Small size ARM (Marvell 88F6281)

Table 7.4: HPL problem sizes.

Table 7.5 presents the benchmarks results of the HPL benchmark for all the different problem sizes we used. Figures 7.10 and 7.11 present the results of the benchmark with problem sizes defined by HPL and ACT.

Processor GFLOPs Usage PPW Problem size Intel Xeon 1.22 197 Watt 6.1 MFLOPs 13107 Intel Atom 4.28 55 Watt 77.81 MFLOPs 3276 ARM (Marvell 88F6281) 1.11 9 Watt 123.3 MFLOPs 409 Intel Xeon 1.21 197 Watt 6.1 MFLOPs 41344 Intel Atom 3.48 55 Watt 63.27 MFLOPs 20608 ARM (Marvell 88F6281) 1.10 9 Watt 122 MFLOPs 7296 Intel Xeon 1.21 197 Watt 6.14 MFLOPs Intel Atom 4.15 55 Watt 75.45 MFLOPs 7296 ARM (Marvell 88F6281) 1.10 9 Watt 122.2 MFLOPs Intel Xeon 7.18 197 Watt 36.44 MFLOPs Intel Atom 5.46 55 Watt 99.27 MFLOPs 500 ARM (Marvell 88F6281) 1.13 9 Watt 125.5 MFLOPs

Table 7.5: HPL problem sizes.

51 200 5.00

150 3.75 GFLOpssecond per

100 2.50 Watts per second per Watts

50 1.25

0 0 Intel Xeon Intel Atom ARM Watts GFLOPs

Figure 7.10: HPL results for large problem size, calculated with ACT’s script.

200 5.00

150 3.75

GFLOpsGFLOpssecondsecond per per

100 2.50 Watts per second per Watts

50 1.25

0 0 Intel Xeon Intel Atom ARM WattsWatts GFLOPs

Figure 7.11: HPL results for problem size 80% of the system memory.

The GFLOPs rate as well as the power consumption remains at the same level for both the Intel Xeon and ARM systems while Intel Atom improves its overall performance

52 by 800 MFLOPs and 14.54 MFLOps per Watt, when using a problem size equal to the 80% of the memory. We have experiment with a smaller problem size, N = 500 and that allows the systems to achieve higher performance. These results are illustrated on the figure 7.12 that follows.

200 8

150 6

GFLOPs per second

100 4 Watts per second per Watts 50 2

0 0 Intel Xeon Intel Atom ARM

Watts GFLOPs

Figure 7.12: HPL results for N=500.

This experiment shows us that Intel Xeon is capable of achieving relatively high perfor- mance for small problem sizes, Intel Atom increases its performance by approximately 2 GFLOPs and ARM (Marvell 88F6281) by 20 MFLOPs, staying within the same lev- els of performance as for large problem sizes. Despite the increase in performance for both Intel Xeon and Intel Atom, ARM achieves the best performance-per-Watt, with 152.2 MFLOPs per Watt while Intel Atom 96.5 MFLOPs per Watt and Intel Xeon 37.51 MFLOPs per Watt. These results should not surprise us. In the Green500 list, the first entry belongs to BlueGene/Q Prototype 2 that is ranked as the 110th fastest supercom- puter in the TOP500 list, meaning that the fastest supercomputer does not necessarily mean that is the most power efficiency as well and vice versa. For the reported performance, we must not underestimate the fact that the installed op- erating systems (including tools, compilers and libraries) on the ARM machines, as well as the processor’s design and implementation by Marvell that we used, do not support hardware FPU and use a soft float, FPU on software level. That prevents the systems from using the NEON SIMD acceleration, from which the executed benchmarks could benefit. As there is an increased interest from both the desktop/laptop and the HPC com- munities for exploiting low-power chips, we can strongly suggest that hardware FPU support will be available in the near future, enabling applications to achieve higher performance, and take full advantage of the underlying hardware. It is reported, that NEON SIMD acceleration can increase HPL performance by 60% [28]. ARM states

53 that NEON technology can accelerate multimedia and signal processing algorithms by at least 3x on ARMv7 [29] [30]. While the performance in GFLOPs is of course important and interesting, we must not leave aside the total execution time and the total power consumption a system needs in order to solve a problem of a given size. The CoreMark results have shown that the ARM system can achieve the best performance in terms of performance-per-Watt as well as in overall power consumption for integer-point operations. The results for HPL differentiate from this and are less clear, depending on the problem size. In figure 7.13 we can see the total power that is used by each system when is solving a problem of which the size is equal to 80% of the total main system memory. This experiment clearly shows us that larger the memory is, a larger problem is required which then leads to more power usage. We see in this experiment that Intel Xeon uses 236250 Watt and takes 119.24 seconds to complete. Intel Atom uses 3004 Watt and takes 54.62 seconds to complete while ARM uses 36.63 Watts in total and takes 4.07 to complete for N equal to 13107, 3276 and 409 for Intel Xeon, Intel Atom and ARM respectively.

300000 1500

225000 1125

Execution time (sec)

150000 750

75000 375 Total power consumption (Watt) consumption power Total

0 0 Intel Xeon Intel Atom ARM Watts Execution time

Figure 7.13: HPL total power consumption for N equal to 80% of memory.

The great difference in problem sizes do not allow us to draw specific conclusions, neither for the achieved performance in GFLOPs or the power usage. Figure 7.14 presents the total power consumption for a given problem size calculated with the ACT script, that is for N equal to 41344 for Intel Xeon, 20608 for Intel Atom and 7296 for

54 ARM (Marvell 88F6281). That results to 7625270 Watt and 38707.10 seconds for Intel Xeon, 919756.75 Watt and 16722.85 seconds for Intel Atom and 21988 and 23554.23 for ARM. We can see here that as the problem size increases for each node, both the total power usage and execution time increase as expected. In this experiment we see that while Intel Atom is able to solve the linear problem faster, the ARM system is still ahead when comparing performance-per-Watt.

8000000 40000

6000000 30000 Execution time (sec)

4000000 20000

2000000 10000 Total power consumption (Watt) consumption power Total

0 0 Intel Xeon Intel Atom ARM Watts Execution time

Figure 7.14: HPL total power consumption for N calculated with ACT’s script.

In order to be able and quantify in a better way the total power consumption, we per- formed another experiment with a problem size N equal to 7296 on all systems. Figure 7.15 presents the power consumption for each system. We can see that this problem size is solved relatively quick on Intel and Intel Atom, taking 41934 Watt and 213.95 seconds for Intel Xeon and 34266.1 Watt and 623.02 seconds for Intel Atom. The ARM systems uses in total 211988 Watt and takes 23554.23 seconds to solve the problem. That brings it at the bottom of power efficiency for this given problem due to the lack of a floating-point unit on hardware level. In order to quantify the differences in performance, we draw the same graph for the problem size N equal to 500. The problem size is rather small to draw concrete conclu- sions for the performance of the Intel Xeon and Intel Atom systems as they both solve the problem within a second, using 197 and 52 Watts respectively, while the ARM sys- tem takes 7.37 seconds consuming 66.33, using in total 130.67 Watts less than Intel Xeon and 11.33 Watt more than Intel Atom. The results are illustrated by figure 7.16.

55 300000 30000

225000 22500

Execution time (sec)

150000 15000

75000 7500 Total power consumption (Watt) consumption power Total

0 0 Intel Xeon Intel Atom ARM Watts Execution time

Figure 7.15: HPL total power consumption for N=7296.

200 8

150 6

Execution time (sec)

100 4

50 2 Total power consumption (Watt) consumption power Total

0 0 Intel Xeon Intel Atom ARM Watts Execution time

Figure 7.16: HPL total power consumption for N=500.

56 All the results can clearly show that the ARM system lacks in terms of floating-point operations, although being competitive in terms of performance-per-Watt for small floating-point problem sizes. As we have mentioned, the ARM system we used, OpenRD Client with Marvell Sheeva 88F6281, does not have implemented the FPU on hardware level neither provides the NEON acceleration for SIMD processing. The underlying compilers and libraries perform the floating-point operations on software level and that is a performance drawback for the system. Intel Atom is very competitive when com- pared to high-power Intel Xeon, as it can achieve reasonably high performance with relatively low power-consumption. The graph in figure 7.17 shows us the power consumption over time for each system when executing the HPL benchmark. The low-power systems achieve the peak of their power consumption and keep a stable rate of it, only a few seconds after the benchmark start executing. On the other hand, like with the CoreMark benchmark, the high-power Intel Xeon system it takes approximately from 10 to 15 seconds to achieve its peak power consumption and keeps a stable rate during the execution of the HPL benchmark. This comes to justify the suggestions by the Green500 power measurement tutorial for start recording the actual power consumption 10 seconds after the benchmark has been initialised. The Intel Xeon system raises its power consumption by 56%, Intel Atom by 17.9% and ARM by 14.28%. It is important to note here that the build up of power consumption for real applications, and different types of applications, might differ to that of the HPL benchmark, or any other benchmark.

200

150

100 POWER

50

0

TIME

Intel Xeon Intel Atom ARM

Figure 7.17: Power consumption over time while executing HPL.

57 7.3.3 Memory performance: STREAM

Processor Function Rate (MB/s) Avg. time Usage Copy 3612.4793 0.0978 Intel Xeon Scale 3642.3530 0.0968 118 Watt Add 3960.9033 0.1334 Triad 4009.4806 0.1319 Copy 2851.0365 0.1236 Intel Atom Scale 2282.0852 0.1543 44 Watt Add 3033.9793 0.1742 Triad 2237.8844 0.2361 Copy 777.8065 0.4029 ARM Scale 190.8710 1.6398 8 Watt Cortex-A8 Add 173.9241 2.6886 Triad 113.8851 4.0880

Table 7.6: STREAM results for 500MB array size.

As an overall observation, we see that the power consumption is not increased at all when performing intensive memory operations when executing the STREAM bench- mark with small array size. ARM proves to be more efficient in terms of performance- per-Watt as it copies 97.2MB per Watt consumed, while Intel Atom and Intel Xeon 54.8MB and 64.79MB respectively. That is, 1.7x and 1.5x times more efficient in terms of performance and actual power usage. These results reflect the performance of the system when using the maximum memory that could be handled by the OpenRD client system, 512MB of physical memory in total. These results are presented on figure 7.18. We executed an additional experiment on the Intel Atom and Intel Xeon boxes with the larger array size, 3GB, nearly the maximum size that can be handled by the Intel Atom system, having available 4096MB or physical memory. This experiment have shown differentiation in the power consumption of each system, increasing the usage by 4 Watts on each system, to 122 and 48 Watts on Intel Xeon and Intel Atom respectively. The increase of the same amount of power in both systems may reflect the similari- ties they share being both of x86 architecture. The performance results with the 3GB array size are presented on figure 7.19. We can see that the performance of both sys- tems is kept at the same levels, with Intel Xeon increasing slightly its performance and power efficiency on the functions Copy and Scale, from 3612MB to 3627MB and from 3642MB to 3670MB respectively, than when using smaller size of array. The functions Add and Triad are slightly decreased when using larger array size, from 3960MB to 3943MB and from 4009MB to 3991MB. These differences are so small that can be categorised within the area of the statistical fault and standard deviation. The performance differences between the various memory subsystems can be justified by the bandwidth interface and the frequency of each system. The Intel Xeon sys-

58 tem uses the highest bandwidth interface and highest data-rate frequency (DDR3 and 1333MHz) than the other two systems (DDR2 and 800MHz). Having a closer look at the low-power systems, both of them use equal bandwidth interfaces and data-rate fre- quency. The large bandwidth advantage of the Intel Atom system lies to the fact that its memory subsystem is made by two chips of 2GB each while the ARM system uses four chips of 128MB each. That makes the Intel Atom system capable to fit all of the array size (500MB) into a single chip, requiring less data movement within the tran- sistors. Although, the ARM system keeps higher performance-per-Watt than the Intel Atom system.

150.0 5000 Bandwidth per second per Bandwidth 112.5 3750

75.0 2500 Watts per second per Watts 37.5 1250

0 0 Copy Scale Add TriadTriad Copy Scale Add TriadTriad Copy Scale Add Triad

Intel Xeon E5507 Intel Atom D525 ARM Cortex-A8 Watts MBs

Figure 7.18: STREAM results for 500MB array size.

59 130.0 4000

97.5 3000 Bandwidth per second per Bandwidth

65.0 2000 Watts per second per Watts 32.5 1000

0 0 Copy Scale Add Triad Copy Scale Add Triad Intel Xeon E5507 Intel Atom D525 Watts MBs

Figure 7.19: STREAM results for 3GB array size.

The figure 7.20 confirms that the power consumption stability rate for every second of the sample is equal for each sample from the different array sizes we have used to stress the memory subsystem of each system. With the larger 3GB array, the Intel Xeon system increases its power consumption by 3.38% and the Intel Atom by 9.1%.

60 150.0

112.5

75.0 POWER

37.5

0 TIME

Intel Xeon Intel Atom ARM

Figure 7.20: Power consumption over time while executing STREAM.

7.3.4 HDD and SSD power consumption

We have mentioned earlier in this work that altering components within the targeted systems could affect performance, by either increasing it or decreasing it. The com- ponent that is the easiest to test is the storage device. By default, the Intel Xeon and Intel Atom machines come with commodity SATA HD drives (with SCSI interface for Intel Xeon). We replaced the Hard Disk Drive on one of the Intel Atom machines with a SATA Solid-State Drive. SSD does not involve spinning platters and thus avoids the power required to spin them. During the experiments we performed, various power consumptions were observed and we can not draw a specific pattern, apart from the generic observation that SSD de- creases the overall power consumption of the system. When idle, the system with the SSD uses 6 Watt less than the system with the HDD. On the CoreMark experiments, the SSD system consumes 3 Watts less. On STREAM, the SSD system consumes 4 Watts less, while when executing HPL the difference is 10 Watts, giving a total power consumption of 58 Watt per second. These results are illustrated by the figure 7.21 that follows.

61 60

45

30 Watts per second per Watts

15

0 Idle CoreMark STREAM HPL HDD SSD

Figure 7.21: Power consumption with 3.5" HDD and 2.5" SSD.

HHDs of smaller physical size, for instance 2.5" instead of the standard 3.5" may de- crease the power consumption as well. As we did not have one of these disks available we could not confirm this hypothesis. Previous research suggests as the physical size of the disk decreases, its power consumption decreases as well, improving the power efficiency of the whole system [36] [37]. These differences in power do not only reduce the costs and the scalability, in terms of power, of such systems, but allow the deployment of extra nodes that consume the amount of power taken from the difference in the consumption of the components. For instance, the maximum difference of the HDD and SSD systems is 10 Watts, which is enough for an additional ARM system which consumes at maximum 9 Watt. Saving power from a single component in larger scale, can allow the deployment of additional compute-nodes what would consume the same power as when using a not very power efficient component on each system. While CoreMark, HPL and STREAM do not perform intensive I/O operations, they allow us to measure the standalone power consumption of the SSD and compare it against that of the HDD.

62 Chapter 8

Future work

Future work in this field could investigate a number of different possibilities as they are outlined bellow:

Real HPC/scientific applications: Real HPC and scientific applications could be • executed on the existing cluster and their results could be used for analysis and comparison against the results preented in this dissertation. Modern ARM: The cluster can be extended by deploying more modern ARM sys- • tems, such as Coretex-A9 and the upcoming Cortex-A15 that support hardware FPUs and multi-core. Intensive I/O: Additional I/O intensive benchmarks and applications could be ex- • ecuted for identifying power consumption over such applications rather than ap- plications and codes that do no make heavy use of I/O operations. Detailed power measurements: More detailed power measurements could be per- • formed by measuring each system component individually and quantifying how and where is power exactly used. CPUs vs. GPUs: Comparison between performance and performance-per-Watt • of low-power CPUs and GPUs. Parallelism: Extend the existing cluster by adding a significant number of low- • power nodes to exploit more parallelism.

63 Chapter 9

Conclusions

This dissertation has achieved its goals as it has researched the current trends and tech- nologies on low-power systems and techniques for High Performance Computing in- frastructures and have reported the related work in the field. We have also designed and successfully built a hybrid seven-node cluster consisting of three different systems, In- tel Xeon, Intel Atom and ARM (Marvell 88F6281), providing access to 34 cores, 57GB of RAM and 3.6TB of storage and described the issues faced and how were solved. The cluster environment supports programming in both message-passing and shared- variable models. MPI, OpenMP and Java threads are supported on all of the platforms. We have experimented and analysed the performance, power consumption and power efficiency for each different system of the cluster. By observing the market and the developments of the HPC systems, low-power pro- cessors will start being one of the default choices in the very near future. The energy demands of large system will require the shifting to processors and systems that have considered energy by design. Consumer electronics devices are becoming more and more powerful as they need to execute computensively intensive applications and still are designed with energy efficiency on mind. For qualifying and quantifying the computationally performance and efficiency as well as the power efficiency of each system, we ran three main different benchmarks (Core- Mark, High Performance Linpack and STREAM) in order to quantify the performance of each system on performance-per-Watt basis for integer-point operations, floating- point operations as well as memory bandwidth. On CoreMark, the serial integer-point benchmark, the ARM system achieves the best performance-per-Watt, with 206.63 iter- ations per Watt against Intel Xeon and Intel Atom that perform 55.76 and 56.03iterations per Watt respectively on single thread and 432.90 and 171.25 on utilising every thread per core. This allows us to conclude that the ARM processor is very competitive and can achieve very high score on integer-point operations, performing better than Intel Atom which is a dual-core system with hyper-threading support, providing access to four cores. The ARM system does not support an FPU on hardware level due to its ARMv5 ar-

64 chitecture, lacking on performance on floating-point operations as we can see from the HPL results, being able to achieve at maximum 1.37 GFLOPs while Intel Xeon 7.39 GFLOPs and Intel Atom 6.08 GFLOPs. In terms of power consumption, while ARM achieves the best performance-per-Watt, 152.2 versus 37.51 and 96.50 for Intel Xeon and Intel Atom respectively, it takes much longer to solve large problems. That intro- duces a high overhead in total power consumption, having as a consequence the usage of more power in total than what Intel Xeon or Intel Atom use. In terms of memory performance, for small sizes, the power consumption remains at minimum levels. Larger data-sizes, higher than 2GB increase the consumption on Intel Xeon and Intel by 4 Watts. The ARM system is able to handle only small data-sizes, 512MB. Intel Xeon achieves the highest bandwidth per second as it uses the DDR3 at 1.3GHz while Intel Atom and ARM use DDR2 at 800MHz. Intel Atom also scores higher than ARM as it is able to store the maximum data-set the ARM system can handle within a single chip, unlike ARM that uses four individual memory chips. Individual components affect systems performance as well. We have observed that SSD storage can reduce the power consumption from 3 up to 10 Watt when compared to stan- dard 3.5" HDD at 7200rpm. Other components, such as different memory subsystem, interconnect as well as different power supplies, could affect the system performance. Due to time constraints, as well as budget, we did not experiment with different com- ponents one each of these subsystems. In terms of porting and software support, all of the tested platforms support C, C++, Fortran and Java. ARM does not support the Java compiler but only the Java Runtime Environment. Intel Atom, being an x86 based architecture (despite its RISC design), supports and is fully compatible with any x86 system that is being currently used. ARM does not provide the same binary compatibility with existing systems due to architec- ture differences, requiring recompiling of the targeted code. What is more, ARMv5 is not capable on performing floating-point operations on the hardware level, having to use soft float instead. The latest architecture, ARMv7, provides hardware FPU func- tionality as well as SIMD acceleration. For taking advantage of the hardware FPU and SIMD acceleration, changes are need to be made on the software level as well. Linux distributions, or the needed compilers and libraries with all their dependencies, need to be recompiled on the ARMv7 architecture in order to support hardware FPUs and improve the overall system performance. The emerging interest by the HPC communities in exploiting low-power architectures and new designs to support efficiently and reliably the design and development of Exas- cale systems. In combination with the market developments in consumer devices from desktops to mobile phones, ensures the increasing of the functionality and the perfor- mance of low-power processors to levels acceptable for HPC and scientific applications. The cease of Moore’s law introduces an extra need for the development of such systems.

65 Appendix A

CoreMark results

Processor Iterations Iterations/Sec Total time (sec) Threads Consumption PPW Intel Xeon 100000 6617.25 15.11 1 119 Watt 55.60 Intel Atom 100000 2954.12 33.85 1 54 Watt 54.70 ARM 100000 1859.70 53.77 1 9 Watt 206.63 Intel Xeon 1000000 6636.40 150.68 1 119 Watt 55.76 Intel Atom 1000000 2969.70 336.73 1 53 Watt 56.03 ARM 1000000 1859.67 537.72 1 9 Watt 206.63 Intel Xeon 2000000 6610.49 302.54 1 126 Watt 54.46 Intel Atom 2000000 2953.23 677.22 1 54 Watt 54.68 ARM 2000000 1861.36 1074.48 1 9 Watt 206.81 Intel Xeon 1000000 51516.21 155.89 8 119 Watt 432.90 Intel Atom 1000000 9076.67 440.69 4 53 Watt 171.25 ARM 1000000 1859.67 537.72 1 9 Watt 206.63

Table A.1: CoreMark results for various iterations.

66 Appendix B

HPL results

Processor Problem size Block size Method Intel Xeon 13107 128 HPL Intel Atom 3276 128 HPL ARM (Marvell 88F6281) 409 128 HPL Intel Xeon 41344 128 ACT Intel Atom 20608 128 ACT ARM (Marvell 88F6281) 7296 128 ACT Intel Xeon Intel Atom 7296 128 Equal size ARM (Marvell 88F6281) Intel Xeom Intel Atom 500 32 Small size ARM (Marvell 88F6281)

Table B.1: HPL problem sizes.

67 Processor GFLOPs Usage PPW Problem size Intel Xeon 1.22 197 Watt 6.1 MFLOPs 13107 Intel Atom 4.28 55 Watt 77.81 MFLOPs 3276 ARM (Marvell 88F6281) 1.11 9 Watt 123.3 MFLOPs 409 Intel Xeon 1.21 197 Watt 6.1 MFLOPs 41344 Intel Atom 3.48 55 Watt 63.27 MFLOPs 20608 ARM (Marvell 88F6281) 1.10 9 Watt 122 MFLOPs 7296 Intel Xeon 1.21 197 Watt 6.14 MFLOPs Intel Atom 4.15 55 Watt 75.45 MFLOPs 7296 ARM (Marvell 88F6281) 1.10 9 Watt 122.2 MFLOPs Intel Xeon 7.18 197 Watt 36.44 MFLOPs Intel Atom 5.46 55 Watt 99.27 MFLOPs 500 ARM (Marvell 88F6281) 1.13 9 Watt 125.5 MFLOPs

Table B.2: HPL problem sizes.

Processor GFLOPs Usage PPW Intel Xeon 7.39 197 Watt 37.51 MFLOPs Intel Atom 6.08 63 Watt 96.50 MFLOPs ARM 1.37 9 Watt 152.2 MFLOPs

Table B.3: HPL results for N=500.

68 Appendix C

STREAM results

Processor Size Function Rate (MB/s) Avg. time Usage Copy 3612.4793 0.0978 Intel Xeon 500MB Scale 3642.3530 0.0968 118 Watt Add 3960.9033 0.1334 Triad 4009.4806 0.1319 Copy 2851.0365 0.1236 Intel Atom 500MB Scale 2282.0852 0.1543 52 Watt Add 3033.9793 0.1742 Triad 2237.8844 0.2361 Copy 777.8065 0.4029 ARM 500MB Scale 190.8710 1.6398 8 Watt (Marvell 88F6281) Add 173.9241 2.6886 Triad 113.8851 4.0880 Copy 3627.5380 0.5886 Intel Xeon 3GB Scale 3670.4334 0.5816 122 Watt Add 3943.3052 0.8120 Triad 3991.4984 0.8022 Copy 2875.2246 0.7422 Intel Atom 3GB Scale 2275.0291 0.9379 56 Watt Add 3035.8659 1.0544 Triad 2269.4263 1.4103

Table C.1: STREAM results for size array of 500MB.

69 Appendix D

Shell Scripts

D.1 add_node.sh

#!/bin/bash # # Author: Panagiotis Kritikakos

NODE=$1 ARCH=$2 ssh root@${NODE} ’mkdir /root/.ssh; chmod 700 /root/.ssh’ scp /root/.ssh/id_dsa.pub root@${NODE}:.ssh/authorized_keys if [ "${ARCH}" == "ARM" ] || [ "${ARCH}" == "arm" ]; then scp fstab.arm root@${NODE}:/etc/fstab else scp fstab root@${NODE}:/etc/fstab fi scp hosts root@${NODE}:/etc/hosts scp profile root@${NODE}:/etc/profile scp mom_priv.config root@${NODE}:/var/spool/torque/mom_priv/config scp pbs_mom root@${NODE}:/etc/init.d/. ssh root@${NODE} ’mount /home’ ssh root@${NODE} ’mkdir /usr/local/mpich2-1.3.2p1’ ssh root@${NODE} ’mount /usr/local/mpich2-1.3.2p1’ ssh root@${NODE} ’/sbin/chkconfig --add pbs_mom \ && /sbin/chkconfig --levels 234 pbs_mon on’ ssh root@${NODE} ’/sbin/service pbs_mom start’

70 qterm -t quick pbs_server

D.2 status.sh

#!/bin/bash # # Author: Panagiotis Kritikakos

for i in ‘cat nodes.txt‘ do ssh root@$i ’hostname; service pbs_mom status’ done

D.3 armrun.sh

#!/bin/bash # # Author: Panagiotis Kritikakos

ARMHOST=lhpc6 ARGUMENTS=$* ssh ${ARMHOST} ${ARGUMENTS}

D.4 watt_log.sh

#!/bin/bash # # Author: Panagiotis Kritikakos

option=$1 logfile=$2 code=$3

getAvg(){ totalwatts=‘cat ${logfile} | \ awk ’{total = total + $1}END{print total}’‘

71 elements=‘cat ${logfile} | wc -l‘ avgwatts=‘echo "${totalwatts} / ${elements}" | bc‘

printf "\n\n Average watts: ${avgwatts}\n\n" } if [ "${option}" == "average" ]; then getAvg exit 0 fi if [ $# -lt 3 ] || [ $# -gt 3 ]; then echo " Specify logfile and code" exit 1 fi if [ -e ${logfile} ]; then rm -f ${logfile}; fi codeis=‘ps aux | grep ${code} | grep -v grep | wc -l‘ while [ ${codeis} -gt 0 ]; do

sudo /usr/sbin/ipmi-sensors | grep -w "System Level" | \ awk {’print $5’} | awk ’ sub("\\.*0+$","") ’ >> ${logfile} sleep 1 codeis=‘ps aux | grep ${code} | grep -v grep | wc -l‘ done getAvg

D.5 fortran2c.sh

#!/bin/bash

fortranFile=$1 fileName=‘echo $1 | sed ’s/\(.*\)\..*/\1/’‘ f2c $fortranFile gcc ${fileName}.c -o $fileName -lf2c

72 Appendix E

Benchmark outputs samples

E.1 CoreMark output sample

The output that follows is a sample output from an Intel Xeon system when executing CoreMark with 100000 iterations and a single thread.

2K performance run parameters for . CoreMark Size : 666 Total ticks : 15112 Total time (secs): 15.112000 Iterations/Sec : 6617.257808 Iterations : 100000 Compiler version : GCC4.1.2 20080704 (Red Hat 4.1.2-50) Compiler flags : -O2 -DPERFORMANCE_RUN=1 -lrt Memory location : Please put data memory location here (e.g. code in flash, data on heap etc) seedcrc : 0xe9f5 [0]crclist : 0xe714 [0]crcmatrix : 0x1fd7 [0]crcstate : 0x8e3a [0]crcfinal : 0xd340 Correct operation validated. See readme.txt for run and reporting rules. CoreMark 1.0 : 6617.257808 / GCC4.1.2 20080704 (Red Hat 4.1.2-50) -O2 -DPERFORMANCE_RUN=1 -lrt / Heap

E.2 HPL output sample

The output that follows is a sample output from an Intel Atom system when executing HPL with problem size N=407.

Gflops : Rate of execution for solving the linear system.

73 The following parameter values will be used:

N : 407 NB : 128 PMAP : Row-major process mapping P: 1 Q: 1 PFACT : Right NBMIN : 4 NDIV : 2 RFACT : Crout BCAST : 1ringM DEPTH : 1 SWAP : Mix (threshold = 64) L1 : transposed form U : transposed form EQUIL : yes ALIGN : 8 double precision words

------

- The matrix A is randomly generated for each test. - The following scaled residual check will be computed: ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N) - The relative machine precision (eps) is taken to be 1.110223e-16 - Computational tests pass if scaled residuals are less than 16.0

======T/V N NB P Q Time Gflops ------WR11C2R4 3274 128 2 2 54.62 4.287e-01 ------||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=0.0051440 ...... PASSED ======

Finished 1 tests with the following results: 1 tests completed and passed residual checks, 0 tests completed and failed residual checks, 0 tests skipped because of illegal input values. ------

End of Tests. ======

74 E.3 STREAM output sample

The output that follows is a sample output from an Intel Atom system when executing STREAM with array size 441.7MB.

------STREAM version $Revision: 5.9 $ ------This system uses 8 bytes per DOUBLE PRECISION word. ------Array size = 19300000, Offset = 0 Total memory required = 441.7 MB. Each test is run 10 times, but only the *best* time for each is used. ------Printing one line per active thread.... ------Your clock granularity/precision appears to be 2 microseconds. Each test below will take on the order of 1194623 microseconds. (= 597311 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ------WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ------Function Rate (MB/s) Avg time Min time Max time Copy: 777.8065 0.4029 0.3970 0.4318 Scale: 190.8710 1.6398 1.6178 1.6900 Add: 173.9241 2.6886 2.6632 2.7319 Triad: 113.8851 4.0880 4.0673 4.1260 ------Solution Validates ------

75 Appendix F

Project evaluation

F.1 Goals

The project have achieved the following goals, as set by the project proposal and as presented within this dissertation:

Report on low-power architectures targeted for HPC systems. • Report on related work done in the field of low-power HPC. • Report on the analysis and specification of requirements for the low-power HPC • project. Report on the constraints of the available architectures on their use in HPC. • Functional low-power seven-node cluster targeted for HPC applications. • A specific set of benchmarks that can run across all chosen architectures. • Final MSc dissertation. • The final project proposal can be found on Appendix G.

F.2 Work paln

The schedule that was presented in the Project Preparation report has been followed and we have met the deadlines as described in the schedule. Slight changes have been made and the schedule had to be adjusted as the project was progressing. The changes applied to time scales of certain tasks.

76 F.3 Risks

During the project preparation, the following risks have been identified:

Risk 1: Unavailable architectures. • Risk 2: Unavailable tool-chains. • Risk 3: More time required to build cluster/port code than to run benchmarks. • Risk 4: Unavailability of identical tools /underlying platform. • Risk 5: Architectural differences. • The risks have finally not affected the project and we managed to mitigate any of them that occurred accordingly, as described in the project preparation report. However, we came across another two risks that have not been initially identified:

Risk 6: Service outage and support by the University’s related groups. • Risk 7: Absence due to summer holidays. • The first one affected the project for a week and it slowed down the experimenta- tion process as the cluster could not be accessed remotely due to network issues that were later solved. During these outage, the cluster had to be accessed physi- cal to conduct experiments and gather results. The second one did not cause any issues to the project itself but perhaps if there was no absence more experiments and more benchmarks could have been designed and executed.

F.4 Changes

The most important change was the decision over which benchmarks to run. We have left aside the SPEC Benchmarks as they would require large times of execution, some- thing that could not be afforded for the project. We also left out the NAS Parallel Benchmarks as they as they proved a bit complicated to execute in a similar manner on all three architectures as well as they would take rather long to finish execution and gather the needed results. For that reason we finally decided to proceed with the Core- Mark benchmark to measure the serial performance of a core, the High-Performance Linpack to measure the parallel performance of a system and STREAM to measure the memory bandwidth of a system. These three benchmarks have been a good choice as they are widely used and accepted in the HPC field, are configurable, easy to run and can complete their execution in relatively short time, enabling us to design a number of different experiments for qualifying and quantifying our the results.

77 Appendix G

Final Project Proposal

G.1 Content

The main scope of the project is to investigate, measure and compare the performance of low-power CPUs versus standard commodity 32/64-bit x86 CPUs when executing selected High-Performance Computing applications. Performance factors to be inves- tigated include: the computational performance along with power consumption and porting effort of standard HPC codes across to the low-power architectures. Using 32/64-bit x86 as the baseline, a number of different low-power CPUs will be investigated and compared, such as ARM, Intel Atom and PowerPC. The performance, in terms of cost and efficiency, of the various architectures will be measured by using well-known and established benchmarking suites. Due to the differences in the archi- tectures and the available supported compilers, a set of appropriate benchmarks will need to be identified. Fortran compilers are not available on the ARM platform, there- fore a number of C or C++ codes will need to be identified, that will represent either HPC applications or parts of HPC operations, in order to put the systems under stress.

G.2 The work to be undertaken

G.2.1 Deliverables

Report on low-power architectures targeted for HPC systems. • Report on related work done in the field of low-power HPC. • Report on the analysis and specification of requirements for the low-power HPC • project. Report on the constraints of the available architectures on their use in HPC- e.g., • 32-bit only, toolchain availability, existing code ports.

78 Functional low-power cluster, between 6 and 12 nodes, targeted for HPC appli- • cations. A specific set of codes that can run across all chosen architectures. • Final MSc dissertation. • Project presentation. •

G.3 Tasks

Survey of available and possible low-power architecture for HPC use. • Survey on existing work done in the low-power HPC field. • Deployment of low-power HPC cluster. • Identification of appropriate set of benchmarks to run on all architectures, run • experiments and analyse the results. Writing of the dissertation reflecting the work undertaken and the outcomes of • the project.

G.4 Additional information / Knowledge required

Programming knowledge and skills are assumed as the benchmark codes might • require porting. Systems engineering knowledge to build up, configure and deploy low-power • cluster. Understanding of different methods/techniques of power measuring for computer • systems. Presentation skills for writing a good dissertation and presenting the results of the • project in front of public.

79 Bibliography

[1] P. M. Kogge and et al., "Exascale Computing Study: Technology Challenges in Achieving Exascale Systems", DARPA Information Processing Techniques Of- fice, Washington, DC, pp. 278, September 28, 2008. [2] J. Dongarra, et al., "International Exascale Software Project: Roadmap 1.1", http://www.Exascale.org/mediawiki/images/2/20/IESP-roadmap.pdf, February 2011 [3] D. A. Patterson and D. R. Ditzel, "The Case for the Reduced Instruction Set Computer," ACM SIGARCH News, 8: 6, 25-33, Oct. 1980. [4] D. W. Clark and W. D. Strecker, "Comments on ÔThe Case for the Reduced Instruction Set Computer", ibid, 34-38, Oct. 1980. [5] Michio Kaku, "The Physics of the Future", 2010 [6] S. Sharma, Chung-Hsing Hsu and Wu-chun Feng, "Making a Case for a Green500 List", 20th IEEE International Parallel & Distributed Processing Sym- posium (IPDPS), Workshop on High-Performance, Power-Aware Computing (HP-PAC), April 2006 [7] W. Feng, M. Warren, E. Weigle, "Honey, I Shrunk the Beowulf", In the Proceed- ings of the 2002 International Conference on Parallel Processing, August 2002 [8] Wu-chu Feng, The Importance of Being Low Power in High Performance Com- puting, CTWatch QUARTERLY, Volume 1 Number 3, Page 12, August 2005 [9] NVIDIA Press, http://pressroom.nvidia.com/easyir/customrel.do?easyirid=A0D622CE9F579F09 &version=live&releasejsp=release_157&xhtml=true&prid=705184, Accessed 13 May 2011 [10] HPC Wire, http://www.hpcwire.com/hpcwire/2011-03- 07/china_makes_its_own_supercomputing_cores.html, Accessed 13 May 2011 [11] Katie Roberts-Hoffman, Pawankumar Hedge, ARM (Marvell 88F6281) vs. Intel Atom: Architectural and Benchmark Comparisons, EE6304 Computer Architec- ture Course project, University of Texas at Dallas, 2009

80 [12] R. Ge, X. Feng, H. Pyla, K. Cameron, W. Feng, Power Measurement Tutorial for the Green500 List, The Green500 List: Environmentally Responsible Supercom- puting, June 27, 2007 [13] J. J. Dongarra, the LINPACK benchmark: an explanation, In the Proceedings of the 1st International Conference on Supercomputings, Springer-Verlag New York, Inc. New York, NY, USA, 1988 [14] Piotr R. Luszczek et al., The HPC Challenge (HPCC) benchmark suite, In the Proceeding of SC ’06 Proceedings of the 2006 ACM/IEEE conference on Super- computing, New York, NY, USA, 2006 [15] D. Weeratunga et al., "The NAS Parallel Benchmarks", NAS Technical Report RNR-94-007, NASA Ames Research Center, Moffett Field, CA, March 1994 [16] Cathy May, et al., "The PowerPC Architecture: A Specification for A New Family of RISC Processors", Morgan Kaufmann Publishers, 1994 [17] Charles Johnson, et al., A Wire-Speed Power Processor: 2.3GHz 45nm SOI with 16 Cores and 64 Threads, IEEE International Solid-State Circuits Conference, White paper, 2010 [18] D.M. Tullsen , S.J. Eggers, H.M. Levy, "Simultaneous multithreading: maximiz- ing on-chip parallelism", ISCA Ô95, pp. 392-403, June 22, 1995 [19] Green Revolution Cooling, http://www.grcooling.com, Accessed 2 June 2011 [20] , http://www.google.com/corporate/datacenter/index.html, Accessed 2 June 2011 [21] Sindre Kvalheim, "Lefdal Mine Project", META magazine, Number 2: 2011, p. 14-15, Notur II Project, February 2011 [22] Knut Molaug, "Green Mountain Data Centre AS", META magazine, Number 2: 2011, p. 16-17, Notur II Project, February 2011 [23] Bjørn Rønning, "Rjukan Mountain Hall - RMH, META magazine, Number 2: 2011, p. 18-19, Notur II Project, February 2011 [24] Jacko Koster, "A Nordic Supercomputer in Iceland", META magazine, Number 2: 2011, p. 13, Notur II Project, February 2011 [25] Douglas Montgomery, "Design and Analysis of Experiments", John Wiley & Sons, sixth edition, 2004 [26] CoreMark an EMMBC Benchmark, http://www.coremark.org, Accessed 12 May 2011 [27] Genesi’s Hard Float optimizations speeds up Linux performance up to 300% on ARM Laptops, http://armdevices.net/2011/06/21/genesis-hard- float-optimizations-speeds-up-linux-performance-up-to-300-on-arm-laptops/, Accessed 21 June 2011

81 [28] K. Furlinger, C. Klausecker, D. Kranzlmuller, The AppleTV-Cluster: Towards Energy Efficient Parallel Computing on Consumer Electronic Devices, Whitepa- per, Ludwig-Maximilians-Universitat, April 2011 [29] NEONTM Technology, http://www.arm.com/products/processors/technologies/neon.php, Accessed 21 June 2011 [30] ARM, ARM NEON support in the ARM compiler, White Paper, September 2008 [31] MIPS Technologies, MIPS64 Architecture for Programmers Volume I: Introduc- tion to the MIPS64, v3.02 [32] MIPS Technologies, MIPS64 Architecture for Programmers Volume I-B: Intro- duction to the microMIPS64, v3.02 [33] MIPS Technologies, MIPS64 Architecture for Programmers Volume II: The MIPS64 Instruction Set, v3.02 [34] MIPS Technologies, MIPS Architecture For Programmers Volume III: The MIPS64 and microMIPS64 Privileged Resource Architecture, v3.12 [35] MIPS Technologies, ChinaÕs Institute of Computing Technology Li- censes Industry-Standard MIPS Architectures, http://www.mips.com/news- events/newsroom/release-archive-2009/6_15_09.dot, Accessed 21 June 2011 [36] Young-Jin Kim, Kwon-Taek Kwon, Jihong Kim, Energy-efficient disk replace- ment and file placement techniques for mobile systems with hard disks, In the Proceedings of the 2007 ACM symposium on Applied computing, New York, NY, USA 2007 [37] Young-Jin Kim, Kwon-Taek Kwon, Jihong Kim, Energy-efficient file placement techniques for heterogeneous mobile storage systems, In the Proceeding EM- SOFT ’06 Proceedings of the 6th ACM & IEEE International conference on Em- bedded software, ACM New York, NY, USA 2006

82