Exploring performance of Phi co-processor

Mateusz Iwo Dubaniowski

August 21, 2015

MSc in High Performance Computing

The University of Edinburgh

Year of Presentation: 2015

Abstract

The project aims to explore the performance of processor. We use various parallelisation and vectorisation methods to port a LU decomposition library to the co- processor. The popularity of accelerators and co-processors is growing due to their good energy efficiency characteristics, and the large potential of further performance improvements. These two factors make co-processors suitable to drive the innovation in high performance computing forwards, towards the next goal of achieving the Exascale- level computing. Due to increasing demand Intel has delivered a co-processor designed to fit the requirements of the HPC community, the Intel MIC architecture, of which the most prominent example is Intel Xeon Phi. The co-processor utilises the many-core principle. It provides a large number of slower cores supplemented with vector processing units, thus forcing high level of parallelisation upon the users.

LU factorisation is an operation on matrices used in many fields to solve linear algebra, inverse matrices, and calculate matrix determinants. In this project we port a LU factorisation algorithm using Gaussian elimination method to perform the decomposition to Intel Xeon Phi co-processor. We use various parallelisation techniques including Intel LEO, OpenMP 4.0 pragmas, Intel’s array notation, and ivdep pragma. Furthermore, we examine the effect of data transfer to the co-processor on the overall execution time.

The results obtained show that the best level of performance on Xeon Phi is achieved with the use of Intel Cilk array notation to vectorise, and OpenMP4.0 to parallelise the code. Intel Cilk array notation, on average across sparse and dense benchmark matrices, results in the speed-up of 27 times over the single-threaded performance of the host processor. The peak speed-up achieved with this method, across attempted benchmarks, results in performance 49 times better than that of a single thread of the host processor.

Contents

Chapter 1 Introduction ...... 1

1.1 Obstacles and diversions from the original plan ...... 4

1.2 Structure of the dissertation ...... 4

Chapter 2 Co-processors and accelerators in HPC ...... 6

2.1 Importance of energy efficiency in HPC ...... 6

2.2 Co-processors and the move to Exascale ...... 8

2.3 Intel Xeon Phi and other accelerators ...... 9

2.4 Related work ...... 10

Chapter 3 Intel MIC architecture...... 13

3.1 Architecture of Intel MICs ...... 13

3.2 Xeon Phi in EPCC and Hartree ...... 16

3.3 Xeon Phi programming tools ...... 17

3.4 Intel Xeon – host node...... 18

3.5 Knights Landing – the future of Xeon Phi ...... 18

Chapter 4 LU factorization – current implementation ...... 20

4.1 What is LU factorization? ...... 20

4.2 Applications of LU factorization ...... 21

4.3 Initial algorithm ...... 21

4.4 Matrix data structure...... 22

i Chapter 5 Optimisation and parallelisation methods ...... 24

5.1 Intel “ivdep” pragma and compiler auto-vectorisation ...... 24

5.2 OpenMP 4.0 ...... 25

5.3 Intel Cilk array notation ...... 26

5.4 Offload models ...... 26

Chapter 6 Implementation of the solution...... 29

6.1 Initial profiling ...... 29

6.2 Parallelising the code ...... 30

6.2.1 Hotspots analysis ...... 31

6.3 Offloading the code ...... 32

6.4 Hinting vectorisation with ivdep ...... 33

6.4.1 Hotspots for further vectorisation ...... 34

6.5 Ensuring vectorisation with Intel Cilk and OpenMP ...... 34

6.5.1 Intel Cilk array notation ...... 35

6.5.2 OpenMP simd pragma ...... 35

Chapter 7 Benchmarking the solution ...... 37

7.1 Matrix format ...... 37

7.2 University of Florida sparse matrix collection ...... 38

7.3 Dense benchmarks ...... 39

7.4 Summary of benchmarks’ characteristics ...... 39

Chapter 8 Analysis of performance of Xeon Phi ...... 41

8.1 Collection of results ...... 41

8.2 Validation of the results ...... 42

8.3 Overview of results...... 43

ii 8.4 Speed-up with different optimisation options...... 45

8.5 Native speed-up on Intel Xeon and on Intel Xeon Phi ...... 47

8.6 Offloading overhead ...... 49

8.7 Speed-up on the host with different optimisation options ...... 51

8.8 Running NICSLU on the host ...... 53

Chapter 9 Summary and conclusions ...... 54

9.1 Future work ...... 56

Bibliography ...... 57

iii List of Tables

Table 2-1: Overview of available co-processors and accelerators by vendor ...... 10

Table 3-1: Overview of versions of Intel Xeon Phi available ...... 16

Table 5-1: Outline of scheduling options available in OpenMP 4.0 [31] ...... 25

Table 5-2: Intel Cilk array notation example ...... 26

Table 6-1: gprof profile of running the LU factorization algorithm with ranmat4500 input on 4 host threads ...... 30

Table 6-2: Intel Cilk array notation use in lup_od_omp function ...... 35

Table 6-3: OpenMP simd pragma usage in lup_od_omp ...... 35

Table 7-1: Characteristics of benchmark matrices ...... 40

Table 8-1: Execution times (in seconds) of running benchmarks offloaded to Xeon Phi with different parallelisation methods ...... 44

Table 8-2: Speed-up values summary against single-threaded host execution time ...... 46

Table 8-3: Code snippets explaining performance difference between simd pragma and Intel Cilk array notation ...... 47

Table 8-4: Overview of performance improvements due to optimisation methods on the host – Intel Xeon ...... 52

Table 8-5: Results of running NICSLU on the host ...... 53

iv List of Figures

Figure 1-1: Performance of Top500 list systems over the past 11 years [2] ...... 2

Figure 2-1: Increasing share of co-processors/accelerators in the systems from Top500 list over the past 4 years...... 9

Figure 3-1: Simple outline of a single Intel MIC core ...... 14

Figure 3-2: More detailed outline of Intel MIC core ...... 15

Figure 3-3: Hartree’s Intel Xeon Phi racks ...... 17

Figure 4-1: Matrix data structure implementation ...... 23

Figure 5-1: Intel Xeon Phi execution models with Intel Xeon as the host processor [33] ...... 28

Figure 5-2: Intel Xeon Phi software stack [23] ...... 28

Figure 6-1: VTune threads utilisation in lup_od_omp function pre and post optimisation ...... 32

Figure 7-1: Sample matrix and its Matrix Market Exchange format representation ...... 38

Figure 8-1: Speed-up of benchmarks with different parallelisation techniques used .....45

Figure 8-2: Speed-up of Xeon Phi with varying number of threads (speed-up=1, when n=8) ...... 48

Figure 8-3: Offloading and execution times as a percentages of the total execution time for OpenMP 4.0 simd pragma ...... 49

Figure 8-4: Offloading and execution times as a percentages of the total execution time for Intel Cilk array notation ...... 50

Figure 8-5: Speed-up of benchmarks including offload time with different parallelisation techniques used ...... 51

Figure 8-6: Speed-up on the host – Intel Xeon – with different optimisation methods and benchmarks ...... 52

v Acknowledgements

Writing this master thesis would not have been possible if not for many remarkable people I have met during this year. With deep gratitude, I would like to thank my supervisor Adrian Jackson for his guidance, encouragement and thoughtful comments that made writing this work a real adventure. For motivation and support, I would like to thank my personal tutor Mark Bull. Further, I am also grateful and would like to thank for being given chance to take part in HPC Conference in Frankfurt and all the people I encountered there, for it exposed me to variety of ideas that have driven this work.

Furthermore, I would like to thank everyone at STFC – Hartree Centre for their help and continuing support with accessing Xeon Phis in Hartree.

And, of course, thanks to Pat for some proof-reading, submitting, and general support.

This dissertation makes use of the template provided by Dr. David Henty, which is based on the template created by Prof. Charles Duncan.

vi Chapter 1

Introduction

Co-processors and accelerators are fast becoming the hot topic of high performance computing community. Their unique ability to achieve a high level of performance relative to their energy consumption attracts a significant amount of attention. The amount of research devoted to designing, and subsequently utilising these devices in the most efficient fashion is growing. Moreover the performance of co-processors has not been fully exploited yet. There is not a unique formula for how to achieve the best performance on accelerators. In fact, there are widely varying opinions as to how accelerators should be designed, and how they should be used, to achieve their goal of improving the performance of , both in terms of time, and energy consumption.

The other aspect attracting growing attention to co-processors and alternative ways of boosting performance is the “power wall” of multiprocessors. The multi-core systems begin to experience decrease of the rate of performance improvements. [1] These effects can be seen on Figure 1-1, where we can see that the Top500 systems rate of performance growth has dropped over the last 3 years. The rate of growth has dropped for all: top of the list, bottom of the list systems, and the sum of the systems from Top500 list. A large majority of Top500 list systems use primarily multicore systems. Similar issue was experienced with single-core systems a decade ago, when the frequency could no longer be increased due to energy consumption and cooling limitations. Co-processors provide an alternative way to achieve the high performance with the use of highly-parallel, many- core architectures. At the same time they attain a good level of energy efficiency per unit of computation. Moreover, many systems combine traditional CPU processors and co- processors by employing heterogeneous implementations.

The usage of co-processors is becoming increasingly prominent with more and more systems employing accelerators to boost their performance. On the most recent, released in July 2015, Top500 list, a list of 500 best performing in the world, accelerators are used in 88 out of the total of 500 systems. This is an increase from 75 systems using accelerators or co-processors in November 2014, when the previous Top500 list was released. [2] This suggests that co-processors are being adopted by more and more systems in order to achieve the best performance. The two best performing systems on the July 2015 Top500 list both use accelerators to achieve their performance.

1

Figure 1-1: Performance of Top500 list systems over the past 11 years [2]

Similarly, the most energy efficient systems use accelerators or co-processors to achieve their energy consumption performance level. On the Green500 list released in June 2015, first 32 most energy efficient systems use co-processors or accelerators. This is an increase of almost 40% over the November 2014 release of Green500 list, where 23 top systems employed accelerators or co-processors. Furthermore, in the November 2014 release of the Green500 list, a first system broke the barrier of 5GFLOPS1/Watt. The system employed an accelerator to achieve this level of energy efficiency. In June 2015 release of Green500, the most energy-efficient system, RIKEN-Shoubu, achieves performance of more than 7GFLOPS/Watt. [3]

Although 32 most energy efficient systems use accelerators, there are only 88 of them utilising accelerators overall on the Green500 list released in June 2015. While these numbers are growing, in general, the proportion of systems using accelerators is relatively small. Developers and scientists are often disinclined to use co-processors or

1 GFLOP – GigaFLOP – billion floating point operations per second

2 heterogeneous architectures. The process of developing for, or porting software to accelerators is complicated and not standardised. Additionally, these operations carry risk of being in conflict with optimisations, and developments for regular CPU-only architectures.

The OpenMP 4.0 standard aims to unify the programming model of co-processors and heterogeneous systems by providing a standard set of pragmas for programming these devices. Furthermore, OpenMP pragmas can be easily switched off for compiling the code for a CPU-only system. Similarly, Intel attempts to provide utilities such as Intel Cilk to aid programming their Intel MIC architecture devices. AMD and support OpenACC, which similarly to OpenMP is a set of pragmas designed to help with programming heterogeneous systems. Furthermore, Intel’s new generation of Intel Xeon Phi co-processors will be self-bootable and compatible with Intel Xeon’s ISA2, so that the execution and programming of heterogeneous systems will become simpler.

Intel Xeon Phi is Intel’s co-processor following the Intel MIC architecture. It is designed on a principle of many-core architectures, where a large number of simple cores are used to enable parallel execution of applications. It is in contrast to multi-core processors, where each chip aims to contain a lower number of more sophisticated cores. The motivations behind this approach are based on improved floating point instructions performance through a wider issue of these instructions, better memory utilisation of such systems, and lower energy consumption due to reduced clock frequency. Intel Xeon Phi co-processor is aimed primarily at scientific and research community, and is meant to meet the needs of these industries. This is in contrary to GPU3 systems, which often stem from multimedia and gaming industries, and thus might have different priorities. Consequently, Intel provides a wide range of tools for optimisation and development of scientific code with Intel Xeon Phi. These include the Intel MKL – a math operations library, or Intel Parallel Studio XE – a software development tool.

LU factorisation or decomposition is a mathematical operation on matrices, which is widely used in solving linear algebra, deriving inverse or determinant of a matrix. These operations have wide industrial applications among others in design automation, , and signals processing. Since the implementations of LU decomposition tend to contain many loops and operations on rows of matrices, LU factorisation is highly parallelisable and potentially can be vectorised.

The purpose of this project was to explore the performance of Intel Xeon Phi co- processor. We aimed to analyse how the performance of Intel Xeon Phi varies, when various different parallelisation and vectorisation techniques are employed. This was

2 ISA – instruction set architecture

3 GPU – graphics processing unit

3 achieved by porting a LU factorisation library to Intel Xeon Phi using various models of execution.

1.1 Obstacles and diversions from the original plan

During the progress of the project, we ran into several obstacles, which we had initially deliberated throughout the project planning stage. Initially, we aimed to port NICSLU library [4] to Intel Xeon Phi in order to explore performance of the latter. NICSLU is a library, which efficiently implements LU factorisation aimed at electronic design automation applications.

However, during the course of the project it emerged that the original NICSLU code implemented in C is not compatible with Intel Xeon Phi hardware. The original code implementing NICSLU was compatible with Intel386 architecture, which allows unaligned data accesses. However, K1OM architecture implemented by Xeon Phi does not allow improperly aligned data accesses. The workspace part of NICSLU’s application memory, used for performing operations on matrix arrays, is implemented with the use of unaligned accesses. Therefore, it became impossible to allocate the workspace memory in a similar fashion on Xeon Phi as it had been done originally.

The unaligned accesses to the workspace in NICSLU affect many various parts of the library. Therefore, changing the allocation method of workspace has proved to cause too many legacy inconsistencies throughout the code of the library. Since the code of NICSLU library is considerably large in size, we decided to move to a different LU factorisation library. The LU decomposition code used in this project is a combination of two implementations adapted by us for this project through introduction of OpenMP directives. [5] [6] This has allowed us to continue the project without major interruptions. The move to a different code base had been initially predicted in the risk analysis at the project preparation stage so we had a contingency plan in place. [7]

1.2 Structure of the dissertation

In this report, we present the work carried out throughout the dissertation project. From this point onwards the report is structured in the following manner.

Initially, in Chapter 2, we introduce the concept of co-processors and accelerators. We consider these in detail in the context of high performance computing. We mention energy efficiency, the move to Exascale, Intel Xeon Phi and compare Xeon Phi to other accelerators.

Subsequently, in Chapter 3, we describe Intel MIC architecture; explain the systems we used throughout the project, and outline the future of Xeon Phi – Knights Landing.

Then, in Chapter 4, we introduce LU factorisation. We describe its applications and the algorithm used to perform the computation. Moreover, we explain the data structure used to store matrices.

4 In Chapter 5, we introduce the methods used to parallelise and vectorise the solution such as OpenMP and Intel Cilk pragmas.

In Chapter 6, we describe the implementation of the solution and how we proceeded about parallelising the code.

In Chapter 7, we outline the benchmarks used to compare different parallelisation and vectorisation methods.

In Chapter 8, we present and analyse the results obtained after running the benchmarks on Xeon Phi with different parallelisation and vectorisation options. Furthermore, we outline the speed-up and show the impact of the offloading on the execution time.

Finally, in Chapter 9, we present conclusions and summary. We introduce ideas for future work that could be completed if the project continued beyond the prescribed timeframe.

5 Chapter 2

Co-processors and accelerators in HPC

Co-processors and accelerators are thought of as the future of high performance computing. Their prevalence is growing and with recent advancements in technology they have become a very important field of improvements in performance of HPC systems. There can be little to no surprise that this is the current state of the supercomputing industry. The limits on single thread performance force researchers and industrial institutions to focus on developments of highly parallel systems. Co-processors meet this demand by providing substantial parallelism to achieve high levels of performance. This chapter helps to establish the position of this dissertation within the wider context of the field of high performance computing.

We examine the motivations and background behind investigating co-processors, and Intel Xeon Phi in particular. We outline the aspects of energy-efficiency and how accelerators contribute to the move to Exascale computing4. The move to Exascale is widely considered to be the next step in the evolution of HPC systems. Moreover, we describe recent advancements in co-processors, and how they shape the landscape of high performance computing. We present major features of accelerators that contribute to their growing popularity. Similarly, we present a selection of accelerators and co-processors widely used in modern systems. Finally, we perform a brief review of similar and related works on exploring performance of accelerators.

2.1 Importance of energy efficiency in HPC

The continued progress in performance of HPC systems is undoubtedly predicated on the energy efficiency of these systems. The ability to perform operations while consuming the amount of energy acceptable by the operators is the pinnacle of high performance computing. Systems constantly need to meet the demand for higher speed and more operations performed every second. It then becomes only natural that this goal stands in contradiction to lower energy consumption. The ability to sustain the current trend in increasing performance, while at the same time sustaining reasonable energy consumption is the “holy grail” of high performance computing.

The appearance of “power wall” over the past decade has forced the research to focus on parallelisation rather than further increases in frequency or decreases in the size of chips. Chip designers were no longer able to provide faster speed of computation solely due to increases in frequency. The energy consumed by, and required to cool such chips was

4 – computing systems able to achieve performance of at least one exaFLOPS. This is to perform at least 1018 floating point operations per second.

6 becoming unsustainable. This has resulted in an emergence of parallel architectures in all fields of science, and at all levels of computer architecture. In June 2015’s issue of Top500 list, over 96% of systems use chip multiprocessors with more than 6 cores. [2]

Currently, however, the rate of performance improvements in new developments of multiprocessors is decreasing. This suggests that we are approaching the point where multiprocessor technologies hit the “power wall”. The benefits of placing additional cores on a single chip will become outweighed by large energy consumption of these chips, and memory bottlenecks. Consequently, further improvements in chip multiprocessors will gradually become more and more difficult to sustain. This especially applies to large-scale systems such as are used in high performance computing. These systems consist of many multiprocessors, and so each processor within such system needs to be as energy efficient as possible.

The importance of energy consumption in HPC systems can be seen in various places. For instance, almost a third of energy consumed by the University of Edinburgh is consumed by ARCHER, the UK’s national maintained by the University. [8] Consequently, various clever and sustainable cooling techniques are used throughout HPC systems to ensure that the energy efficiency is high.

The above examples show that the developments in the field of HPC depend largely on decreasing energy consumption of HPC systems. Large amount of energy that supercomputers use needs to be reduced in order for supercomputers to be sustainable in the future. Traditional multiprocessors are seen as no longer being able to keep up with both demands: for better performance, and for better energy efficiency.

Co-processors, however, present good power efficiency characteristics and at the same time have high theoretical peak performance. As a result, these chips have low energy consumption per unit of computation, expressed as high FLOPS5 per Watt. What is more, co-processors’ performance has not been fully explored and optimised, and there is still plenty of room for further research. Co-processors can therefore be used to deliver the highly desired energy efficiency to HPC systems.

However, co-processors are not ideal for all purposes. Applications, which are not highly parallelisable, and do not scale well will not benefit from accelerators, such as Intel’s Xeon Phi. Similarly, if a very good single thread performance is required, the co- processor might not be able to satisfy this. Xeon Phi’s single thread performance is below the currently widely accepted standards. Furthermore, many accelerators do not have sufficient functionality to run all programs (for instance GPUs cannot run standard operating systems or business applications). Similarly, inherent in Xeon Phi requirement to execute instructions in-order is an obstacle to using the co-processor with some

5 FLOPS – floating operations per second

7 applications, which might generate many out-of-order instructions. Consequently, there exist areas of computing, which might not benefit from the use of accelerators.

2.2 Co-processors and the move to Exascale

Energy consumption is one of the major issues in the move to Exascale. The move to Exascale is widely considered to be the next step, the future of high performance computing. As such, the future of HPC is highly dependent on the ability to decrease energy consumption of processors. This is required so that an exascale computer, once commissioned, will be economically sustainable to maintain.

The need for improved energy efficiency characteristics requires the systems to use much less energy and at the same time provide room for performance enhancements. As mentioned in the previous section, these characteristics are similar to that of co- processors. Co-processors have high energy efficiency per operation since their operating frequency is not as high as that of regular processors. At the same time, they have a good peak speed performance due to their high level of parallelisation. As a result, they have a very good FLOP/Watt performance. Consequently, the ability to utilise the energy efficiency potential of co-processors is conditioned upon developing algorithms and programming methodologies that will be able to take advantage of their performance.

The research into efficient utilisation of co-processors is needed in order to find out methods, which enable the users of these systems to program them efficiently. This becomes especially relevant since in the future exascale systems will likely contain many co-processors. There are countless methods and libraries provided for programming co- processors. These include among others OpenMP 4.0 pragmas, OpenACC, CUDA, Intel Cilk and Intel LEO6. However, there has not been established a single standard, which would be widely accepted and recognized as the method of choice for programming the accelerators. Therefore, it becomes valuable to explore the space of these various parallelisation methods in regards to various applications in order to understand what methods provide the best performance for these applications.

6 Intel LEO – Intel Language Extensions for Offload

8 20% 18% 16% 14% 12% 10% 8% 6% 4%

2% Share of systems Shareof systems usingaccelerators 0% Jun-11 Dec-11 Jun-12 Dec-12 Jun-13 Dec-13 Jun-14 Dec-14 Jun-15 Top500 list release

Figure 2-1: Increasing share of co-processors/accelerators in the systems from Top500 list over the past 4 years

Four out of top ten supercomputers in the world according to Top500 list use co- processors to achieve their high level of performance. On Figure 2-1, we can also see the growth in the share of co-processors in the Top500 list. Although systems utilising accelerators are still in minority, we clearly notice the trend of a substantial growth in their share among the top HPC systems. The share increased by a factor of 1.7, since June 2013. Therefore, it becomes necessary to understand well the programming models, which these co-processors use. Better understanding would allow us to make better use of the HPC systems, which are based increasingly more on co-processors. Top two systems on Top500 list both use co-processors to boost their computational power. [2] Taking the rate of growth of the number of accelerators on Top500 list, we can predict that it is likely that the first Exascale computer will use co-processors to break into Exascale performance level.

2.3 Intel Xeon Phi and other accelerators

There are various different implementations of accelerators or co-processors. More and more HPC systems make use of co-processors to accelerate their performance. Vendors produce different architectures to take advantage of accelerating different applications.

Historically, accelerators, or as some prefer to call them, co-processors, have been used to deal with a particular application for which they were specialised. The architecture of accelerators was developed in a way that enabled them to deal with one particular task, so that the main processor did not have to execute this task. Consequently, the main CPU saves time it would otherwise spend executing code for which the processor was not optimised. Accelerators were primarily developed for multimedia applications such as GPUs for graphics processing, or audio cards for audio signals processing. The potential of having additional computational power in the system, however, has since been discovered by the HPC community. Especially, when the rise of parallelisation 9 throughout the systems is considered. Since many of the scientific applications exhibit parallelism, having a highly parallel SIMD7 co-processors available allowed the scientists to aid them with computation of these parts of the code.

The GPUs often consist of huge arrays of simple processing units, which have capability of performing simple instructions. The code is then propagated throughout these units (streamed) to arrive, at the end, with the final result. The code executed by each unit is called in CUDA terminology “kernel”. GPUs are often optimised for floating point instructions. Due to their architecture that focuses on maximising the area devoted to computation on a chip, GPUs are able to perform many hundreds of floating point instructions in a lockstep in a stream. This allows GPUs to perform many floating point operations in parallel, while sacrificing branch and control instructions. As a result GPUs can achieve a very high level of floating point performance, which is often desired not only in graphics processing, but also in many scientific applications.

Currently, the offloading to GPU or a dedicated accelerator becomes an increasingly widespread procedure across various applications of HPC. The most popular accelerators used in modern HPC systems from the Top500 list are shown in Table 2-1, with their respective vendors.

Table 2-1: Overview of available co-processors and accelerators by vendor

Vendor Co-processor or accelerator AMD FirePro S9150, S9050 Intel Xeon Phi (Knights Corner) K20, K40, K80

We can see in Table 2-1, the main vendors supplying accelerators. These are Intel, NVIDIA, and AMD. They provide their own systems and methods to program these accelerators such as Intel Cilk and LEO from Intel, or NVIDIA’s CUDA. However, there is also a wide range of open environments available with OpenMP, OpenACC, and OpenCL being the three main examples. These accelerators are constantly being developed further into newer, higher performing generations. In this project, we focused on Intel Xeon Phi co-processor, and we present the architecture of the system in Chapter 3, where we also show future enhancements to be included in the new version of the accelerator to be released in the fall of 2015.

2.4 Related work

There is a significant ongoing activity in researching performance of co-processors and accelerators. The exploration of performance of accelerators is considered to influence the design of future HPC systems. These systems tend to make increasing uses of accelerators. Consequently, there is more and more need for the exploration of their

7 SIMD – Single Instruction Multiple Data

10 performance. The research activity in the field of accelerators focuses both on establishing the most beneficial architectures for use in accelerators, as well as researching the most efficient methods of implementing the solutions on accelerators. Furthermore, there is an ongoing research into the applications that could benefit from being offloaded, or into localising parts of applications that are suitable candidates for offloading. Research into different aspects of accelerators has been also present in many dissertation in EPCC throughout the past years. [9] [10]

In this project, we attempt to port an LU factorization application to Xeon Phi and subsequently explore its performance using various parallelisation, vectorisation, and offloading methods. Similar research has been conducted in porting climate dynamics [11], astrophysics [12], or indeed in LU factorization algorithms [13] [14] [15]. These works aim to primarily explore the performance of Xeon Phi by optimising the code for Xeon Phi. They compare the performance on Xeon Phi to other accelerators in order to understand which accelerators offer the best performance for the respective applications.

There is plenty of research carried into exploring kernels that could be offloaded to accelerators and GPUs. [16] [17] [18] These focus on kernels that could be offloaded and ported to co-processors, and on exploring their performance. There exists a set of problems, which are considered to be highly parallelisable and of significant importance to exploring performance of co-processors. These problems, called a set of Berkley dwarfs, are often used in benchmarking co-processors. [17] Berkley dwarfs consist of 13 problem categories considered to be of significant importance to high performance computing community. These problems are highly scalable, can be executed with the use of computational kernels. Consequently, they are often used to benchmark HPC solutions. LU factorization is one of the Berkley dwarfs. [17]

GPUs are often used in implementing highly parallel problems. Their ability to exploit large amount of data level parallelism is used in various applications that exhibit this parallelism. In order to utilise the parallelism the process of mapping applications to GPU hardware is extremely important. One example of this is the work on targetDP [17]. This highlights the benefits of abstracting hardware issues away from the programmer so that programmers do not spend excessive time on developing porting patterns instead of solving the actual problem at hand. The emergence of various different GPUs and accelerators has resulted in this problem becoming even more prevalent, when different systems require different methods of porting. Through porting a complex fluid lattice Boltzman application – Ludwig – to GPUs it is possible to demonstrate performance improvements due to using GPUs in the computation process. Applying the targetDP abstraction layer, targeting data parallel hardware of different platforms in order to enable applications to take better advantage of accelerators through abstracting memory, task, thread, and instruction level parallelism. The abstraction simplifies the process of porting the code and provides good performance for the application. [19] These methods of abstracting parallelism emerge as a response to more complex underlying GPU hardware.

Similar issues exist with porting the code to Xeon Phi. The placement of threads on the cores of Xeon Phi to exploit performance becomes crucial. A case study of porting CP2K code to Xeon Phi by Iain Bethune and Fiona Reid explored some of these issues emerging 11 in the code ported to Xeon Phi. [20] [21] Porting the code to Xeon Phi proved to be relatively easy if the source is parallelised. However, the work showed that efficient placement of threads on cores is important to ensure good performance results. Finding enough parallelism to fill the threads sufficiently also showed to be a significant issue. Similarly, it confirmed that complex functions and calculations will perform worse on Xeon Phi than on modern host CPU nodes. Overall, the ported CP2K without additional optimisations showed to perform 4 times slower on Xeon Phi in comparison to 16 cores of Intel Xeon E5-2670 host node. After further optimisations the CP2K achieved similar level of performance to the host node.

Finally, there is ongoing work into comparison of various methods available for optimising and parallelising the code on accelerators and GPUs. A comparison of various methods for porting code to GPUs on XK7 was performed by Berry, Schuchart and Henschel of Indiana University and Oak Ridge National Laboratory. [22] They focus on analysing benefits and issues of different methods, when porting a molecular dynamics library to NVIDIA Kepler K20 GPU. They compare CUDA, OpenACC, and OpenMP+MPI implementations. In the molecular dynamics application, the use of OpenACC on the GPU resulted in a speed-up factor of 13 times over 32 threads of OpenMP run on AMD Interlagos 6276 CPU.

The above examples of work undertaken in the field of accelerators and GPUs show that there is a significant potential in utilising these devices. We can see performance improvements due to introducing these devices. HPC systems can clearly benefit from their introduction. Furthermore, we notice that accelerators, including Xeon Phi, are not yet fully optimised and their optimal programming models are far from defined. Therefore, an ongoing research in this area is important to better understand the nature of programming such devices, and what can be achieved with them. Furthermore, we notice that not all applications benefit equally from the use of accelerators, and research is carried into determining a scope of potential beneficiaries of the co-processor hardware.

12 Chapter 3

Intel MIC architecture

The architecture of Xeon Phi is described in this chapter. We outline the co-processor’s programming models. Furthermore, different versions of Xeon Phi are discussed, and the future releases of the co-processor are mentioned. The developments of Xeon Phi are presented in order to relate current results to future versions of Xeon Phi. Moreover, we summarise some of the tools used with Xeon Phi to aid with porting the solution to the hardware.

3.1 Architecture of Intel MICs

In Chapter 2, we discussed the general architectures of co-processors and accelerators. The ideas leading and guiding their development were explained. In this section, we will focus in particular on Intel MIC architecture. Intel MIC stands for Intel Many Integrated Core architecture. It is a new generation of many-core systems introduced by Intel to compete with accelerators and GPUs in the field of high performance computing. These systems introduce a new idea of combining many older, however fully functional, cores in a ring network to accelerate computations.

Intel Xeon Phi, which is an example of Intel MIC architecture, consists of up to 61 cores. These cores are each supplemented with a wide vector processing unit. The vector processing unit serves the main purpose of enabling vectorisation on a large scale. It carries the idea from GPUs, where significant parallelism is introduced due to SIMD instructions. Consequently, the vector processing units (VPUs) in Intel MICs are designed to handle the data-level parallelism achieved through vectorisation. The vector processing units on Intel MIC are 512-bit wide. This corresponds to performing up to 16 single or 8 double precision operations on a VPU simultaneously. Each of the VPUs is supplemented with 32 512-bit registers to service the VPUs. The large amount of available VPUs and vector registers need to be utilized in order to achieve efficiency on Intel Xeon Phi. Complete vectorisation is critical to achieving high level of performance on the co-processor.

Similarly, higher-level parallelisation is desired too, since single scalar cores of Xeon Phi are not implemented with the state of the art architecture. In fact, the scalar unit core design in Xeon Phi is taken from P54C architectural design. The Pentium architecture does not use an up-to-date frequency of execution. As a result, it cannot achieve the speed of modern processor cores in terms of single-threaded computation. It becomes necessary to utilise parallelism available between the cores to achieve performance on Xeon Phi. Single core, single-threaded execution on Xeon Phi would not be competitive in terms of speed with modern CPUs.

13 Each core of Intel MIC supports multithreading. Up to four threads can run per core at any time. In the architecture of Intel MIC it is impossible to issue instructions from the same thread back-to-back in the same functional unit. Therefore, to fully utilise scalar units of Intel MIC at least two threads are required to run per core. To increase chances of functional units being fully optimised, we should aim to perform at four threads per core.

Each core is supplemented with two levels of cache. The caches are fully coherent and level 2 cache is connected in a ring interconnect with L2 caches of the other cores of the Xeon Phi. Level 1 cache is divided into two 32 kilobyte-size parts of respectively instruction and data cache, with 3 cycle access. While level 2 cache is unified and can store 512 kilobytes of data per core. Level 2 caches are joined with a bi-directional ring interconnect to form a large common level 2 cache accessible by all the cores with latencies varying from 11 cycles upwards.

Intel MIC communicates with the host node using PCI Express bus (PCIe), and the most common programming model involves instructions being offloaded to Xeon Phi from the host node. The host communicates with Xeon Phi through a PCIe Bus and data is offloaded from the host node to Xeon Phi and back.

The memory available to Xeon Phi is 16GB GDDR5 memory. This is distributed over 16 memory channels.

In Figure 3-1 [23], we can see an outline of a single core of Intel MIC. We can see the

Figure 3-1: Simple outline of a single Intel MIC core scalar unit and the vector unit present in each core. On the figure, we can notice how caches are structured and connected to the ring interconnect, which runs across all the 14 cores. On Figure 3-2 [24], we can see a more detailed outline of a single Intel MIC core. We see that there are 4 threads that can issue instructions. We see the 512-bit VPU, instruction and data caches, as well as the dual issue of certain scalar instructions.

Figure 3-2: More detailed outline of Intel MIC core

The architecture of Intel MIC is innovative and unique in the way that it aims to reduce the size of the computational segment of the die. Thus, making space for memory, and so bringing memory closer to the computation. We can notice on Figure 3-2 that the computational logic specific to architecture takes up less than 2% of the total area of the die.

This is an upcoming trend in the field of high performance computing, and computing in general. As the advancements in computer architecture have progressed this far, the ALUs8 are contained often in less than 15% of the chip. [25] Most of the area is devoted to addressing efficient memory access issues. By reducing the significance of each chip to the overall system, and distributing them across a larger area, we gain the possibility to introduce more of the very fast access memories such as L1 and L2 caches. These, in turn, help to speed up the highly parallelised applications. This process follows, and extends to the field of computer architecture, the well-established principle of , responsible among others for the Map Reduce framework. The principle, which states that it is cheaper to bring computation to data than data to computation.

8 ALU-arithmetic and logic unit

15 3.2 Xeon Phi in EPCC and Hartree

Throughout the project we used Intel Xeon Phi nodes at EPCC, called hydra, as well as in Hartree. Hydra has two Intel MIC systems available and both are connected to one host node of 2 sockets of 8-core Intel Xeon processors. On the other hand, Hartree has 42 Intel Xeon Phis available. These are connected to one host each. Hartree host nodes consist of 2 sockets of 12-core Intel Xeon processors each.

Both Hartree and EPCC systems use the same version of Intel MIC architecture. Both employ the 60-core versions of Intel Xeon Phi co-processor. These are Intel Xeon Phi 5110P belonging to 5100 series. This version of Xeon Phi has a base operating frequency of 1.053 GHz, and can service up to 8GB of memory over 16 channels with the bandwidth of 320 GB/s. Intel Xeon Phi 5110P is manufactured in 22nm technology. The total amount of L2 cache available to the co-processor is 30MB. In Table 3-1, we can see details of different series of Intel Xeon Phi currently available in the market. We can also see the 5100 series, which is available at EPCC and Hartree, and was used in this project.

Although the versions of Intel Xeon Phi at EPCC and Hartree are the same, their toolchains varied slightly, with different versions of compilers installed on the two versions. EPCC has the Intel C compiler version 15.0.2, and this compiler was used to compile the code for Xeon Phi on hydra. Hartree’s version of the compiler is 15.0.1, and it was used to compile the programs on Hartree.

Table 3-1: Overview of versions of Intel Xeon Phi available

Xeon Phi No. of Base Maximum Maximum TDP9 versions cores frequency memory size Memory bandwidth 3100 series 57 1.1 GHz 6 GB 240 GB/s 300 W 5100 series 60 1.053 GHz 8 GB 320-352 GB/s 225-245 W 7100 series 61 1.238 GHz 16 GB 352 GB/s 270-300 W

On Figure 3-3, we can see Phase-2 Wonder iDataPlex to which Xeon Phi nodes are connected at Hartree Centre in Daresbury Laboratory.

9 TDP – thermal design power is the average power, in watts, the processor dissipates when operating at a base frequency with all cores active under an Intel-defined, high-complexity workload.

16

Figure 3-3: Hartree’s Intel Xeon Phi racks

3.3 Xeon Phi programming tools

There exists a large array of tools aiming to help with programming specifically for many-core systems such as Intel MIC. These tools help to understand how many cores are utilised, and which parts of the program can undergo further optimisations in order to maximise vectorisation, or parallelisation. Since efficient parallelisation and vectorisation are key requirements of efficient performance on Intel Xeon Phis, these tools are especially beneficial. Tools aimed at Intel MIC architecture used throughout this project include:

 Intel VTune Amplifier – help with parallelisation and utilisation of cores  Intel Vectorization Adviser – help with vectorisation and improving efficiency of vectorisation  Allinea DDT – debugger for multi-core and multi-threaded systems

17 3.4 Intel Xeon – host node

Since we compare the results to the performance of Intel Xeon processors, in this section we present the details of Intel Xeon processor used in this project as the host node for Xeon Phi. The host node we used for benchmarking and comparison to Intel Xeon Phi is the host node of hydra. This is Intel Xeon processor with 2 sockets, each socket consists of 8 cores, and there is one thread per core. This is a 16-threads system. Therefore, the maximum number of threads we used on the host node was 16.

Intel Xeon on hydra has 32K of data and instruction level 1 cache each, 256K level 2 cache, and 20480K level 3 cache. It is the Intel Xeon E5-2650 version of Intel Xeon processor. The cores run with frequency of 2GHz, in comparison to Intel Xeon Phi’s frequency of 1.053GHz.

The above show different characteristics of Intel Xeon Phi and Intel Xeon processors. We note the large difference in frequency of operation and number of threads per core available on each architecture, as well as the total number of cores. These specifications, furthermore, outline the basic differences between the multi-core processors and many- core processors. The former being represented by Intel Xeon on hydra host node, while the latter is represented by Intel Xeon Phi.

3.5 Knights Landing – the future of Xeon Phi

The version of Intel Xeon Phi, on which this project has been completed, is codenamed Knights Corner (KNC). Towards the end of 2015, a new version of Intel MIC architecture, and consequently a new version of Intel Xeon Phi, is going to be released, codenamed Knights Landing (KNL). [26] The new version of Xeon Phi will introduce significant developments, which will enable to extend the concept of many-core architecture to a larger set of problems. This shows that the MIC architecture is being adopted as the correct way to proceed with increasing the performance of processors and computing systems in general.

KNL, in contrast to KNC, will be self-bootable and compatible with Intel x86 executables. This means that the new version of Xeon Phi will be able to execute the same machine code as Intel Xeon processors, and potentially both can use the same compilers. This development comes, among others, from the fact that improving performance when optimising for Xeon Phi has shown to improve performance on the host node as well. This is something that we have experienced throughout the course of this project too. Optimising code for execution on Xeon Phi often resulted in faster execution on Intel Xeon as well. Furthermore, the ability of KNL to self-boot means that it will be able to run without connecting to the host. It will be self-standing and will not require the host to run, so the costs associated with offloading could be potentially diminished. Also, systems including only Xeon Phi processors will become possible, when the need for a host disappears.

18 Moreover, KNL will have up to 72 cores, up from 61, which was the maximum number for KNC. It will have introduced ability to schedule instructions from the same thread back-to-back, so that utilisation of multithreading would improve. Each core will have two instead of one vector processing units associated with the core. Furthermore, there is a change planned in interconnect from the current ring interconnect to a 2D-mesh interconnect. Also, KNL will contain a high performance, high-speed memory on chip, which furthers the principle of moving data to computation, rather than computation to data. These are the most significant changes implemented in the new version of the Intel MIC architecture.

In principle, the above changes mean that KNL will not be a co-processor, but a standalone many-core processor that can efficiently run parallelised code on its own. It will still have much slower single-threaded performance than regular CPUs, but there will be a possibility of Xeon Phi running as the main processor. This signifies a large step in the development of many-core systems, and will have a noteworthy impact on the HPC community. The design of systems and guiding principles for programming such systems will need to shift in order to utilise many-core systems efficiently. The need for parallelism would become more reemphasized. Similarly, we might begin to see the shift in commercial and home user processors to the architectures more resembling many-core systems.

19 Chapter 4

LU factorization – current implementation

We explain what lower upper (LU) factorization is, what are its main applications and its significance. Furthermore, in this chapter, we describe the Gaussian elimination method, as the algorithm used throughout this project to perform LU factorization. Finally, we explain the data structure used to represent the matrices in memory throughout the computation. LU factorization is interchangeably called LU decomposition in literature and throughout this report.

4.1 What is LU factorization?

LU factorization is a method of decomposing a single matrix into two matrices. In doing so, we express the original matrix as a product of two triangular matrices. One of these matrices is an upper triangular matrix, while the other is a lower triangular matrix. This means that one of the matrices has only “zero” entries below the main diagonal, while the other has such entries only above the main diagonal.

Below is the mathematical representation and an example of LU factorization.

퐴 = 퐿푈

푎11 푎12 푎13 푙11 0 0 푢11 푢12 푢13 [푎21 푎22 푎23] = [푙21 푙22 0 ] [ 0 푢22 푢23] 푎31 푎32 푎33 푙31 푙32 푙33 0 0 푢33

In the above equations, we can also see examples of the lower triangular and upper triangular matrices.

In order to ensure that the matrix can be factorized, it might be necessary to reorder the rows of the matrix. In such case LU decomposition is called LU factorization with partial pivoting. We arrange the matrix in such a way that its elements will be able to factorize in the LU factorization process. This means that in fact the LU factorization will have a following mathematical representation:

푃퐴 = 퐿푈

Where P is the pivoting matrix, which ensures that the ordering of rows is such that the LU factorization will materialize. In such form, an LU decomposition for any square matrix exists.

20 4.2 Applications of LU factorization

Lower upper decomposition is widely used in algorithms and mathematics. It is used in various optimisation problems, because it is an efficient method of solving linear equations. Having derived LU factorization, it is easy to subsequently solve a linear equations system based on that decomposition. Similarly, LU factorization is often used in deriving an inverse of a matrix in a quick manner, especially for larger matrices. The computation of a matrix determinant can be sped up with the use of LU factorization. These matrix operations gain significant performance improvements, in the case of larger matrices, due to the use of LU factorization products in the process of performing these operations. Subsequently, there exist many algorithms, which specialise in performing LU decomposition on particular kinds of matrices, or on matrices derived from particular problems. There are plenty of industrial applications in which computation of LU factorization is used. A small subset of these is presented in this section.

LU decomposition is used in EDA10 to aid with placement and routing of components on circuit boards. Since the components and distances between them are often modelled with the use of matrices and linear algebra, using LU factorization is often beneficial. Moreover, LU decomposition is used to simulate such circuits in order to predict their behaviour after producing the chips. The manufacturing process of chips is very costly, and computer simulations help to minimise these costs through optimisation of testing and design stage. LU factorization forms a great part of this process.

Furthermore, it is used in climate simulations, machine learning, and virtually any application that involves linear algebra for its computation. For example, LU factorization is used in singular value decomposition calculation, which in turn is widely used in signal and data processing. As a result, the computation of LU decomposition is widely used in many fields and applicable to many real world problems.

4.3 Initial algorithm

In this section, we describe the initial structure and outline of the algorithm. In the next section, we mention the data structures used in the algorithm to represent matrices.

Initially, the implemented algorithm was a simple version of LU factorization algorithm using Gaussian elimination method to factorize matrices. Gaussian elimination uses basic matrix transformations in order to remove non-zero entries from the relevant part of the matrix. Elementary row transformations are performed until the matrix is transformed into an upper triangular matrix. At the same time, the algorithm creates a unit lower

10 EDA – electronic design automation

21 triangular matrix11. This method of LU factorization is called the Doolittle method. [27] The obtained triangular matrices are the results of LU factorization. [28]

The operations required to perform Gaussian elimination consist of multiple nested loops, and use an additional function for matrix multiplication in order to pivot the matrix. There is an “if statement”, which ensures numerical stability by stopping the algorithm from further calculations on critically small numbers. The general core idea behind the algorithm has not changed throughout the implementation. We mainly introduced optimisations, refactoring, and pragmas in order to observe how the performance changes as a result of these modifications combined with offloading to Xeon Phi. These changes constitute the exploration space of Xeon Phi performance in the context of this project.

4.4 Matrix data structure

This section explains the data structure used to store matrices in the application implementing LU factorization. Matrices were implemented as two dimensional arrays. In order to ensure spatial locality for efficient cache accesses, and to improve ability to vectorise, the arrays were constructed in two layers.

Firstly, a big one-dimensional array was created, which contained the whole matrix. Subsequently, an array of pointers to the first element of each row within the big one- dimensional array was created. This is shown on Figure 4-1. We can see on the figure, that the whole two dimensional array is assigned, with one malloc call, to a[0]. Subsequently, a one dimensional array of size n is assigned relevant pointers to a[0] at intervals corresponding to the number of columns.

Therefore, we end up with an array of pointers to relevant parts of the single large two dimensional array that contains the actual matrix. Orange arrows on Figure 4-1 represent the malloc calls for particular variable, while blue arrows show where each particular pointer points.

11 Unit lower triangular matrix – a lower triangular matrix with “1”s on the diagonal

22 a a[0] a[0]+n a[0]+2n a[0]+3n a[0]+4n …

a[0] a[0][0] a[0][1] a[0][2] a[0][3] a[0][4] … a[0]+n a[1][0] a[1][1] a[1] [2] a[1][3] a[1][4] … a[0]+2n a[2][0] a[2][1] a[2][2] a[2][3] a[2][4] …

Figure 4-1: Matrix data structure implementation

The data structure presented in Figure 4-1 results in matrix being stored in one long row in memory. Therefore, the spatial locality is emphasised, and preserved. Consequently, vectorisation is easy to detect, and implement.

23 Chapter 5

Optimisation and parallelisation methods

In this chapter we describe techniques attempted to optimise the solution for Intel Xeon Phi processor. These include techniques used to improve the speed of execution of the program in general, as well as specific features of the programming languages that improved the performance. Additionally, several methods of parallelising the code were attempted including pragmas or auto-vectorisation performed by the compiler.

We discuss in this chapter the methods we attempted and analysed throughout the project in order to improve the execution of the program. The variations between different methods enable us to explore the performance of Intel Xeon Phi. The methods attempted throughout this project include OpenMP4.0 pragmas, Intel Cilk array notation and ivdep vectorisation pragmas.

Furthermore, we discuss the compiler options and associated optimisations which were attempted in the process of improving the performance of executing the code on Xeon Phi.

In the next chapter, we show how techniques described in this chapter are applied to LU factorization code in order to explore the performance of Xeon Phi.

5.1 Intel “ivdep” pragma and compiler auto-vectorisation

Intel ivdep pragma is a non-binding pragma, which is used in Intel compiler to aid the process of auto-vectorisation. It prevents the compiler from treating assumed dependencies as proven dependencies. [29] Thus, it helps the compiler to vectorise the code.

Assumed dependencies are based on variables that are independent of a loop index. Due to the independency from the loop index, there could be dependencies on those variables between loop iterations, and the compiler conservatively assumes that they are dependencies. However, the ivdep pragma could be used to specify that such “assumed” dependencies are not in fact dependencies at all, and can be safely vectorised by the compiler. This aids the process of auto-vectorisation executed by the compiler. The compiler is more inclined to vectorise loops which include ivdep pragma, however, there are still other kinds of dependencies and obstacles to vectorising the code, which ivdep pragma cannot overcome.

The ivdep pragma is available in Intel compilers, however, there is also a gcc compiler version of the pragma that implements the ivdep pragma for that compiler.

24 5.2 OpenMP 4.0

OpenMP [30] is a standard used across the industry and academia to aid parallelising code with the use of pragmas. It is an API for shared memory programming using C/C++ or FORTRAN programming languages. It provides a set of pragmas, directives and library routines, which are used to specify parallelism in applications. The standard OpenMP pragmas include parallelisation directives, which distribute tasks among various threads in multithreaded systems. These could contribute to distributing various tasks to different processors, but equally these allow single for-loops to be distributed among various threads. This is done with the use of parallel for pragma presented below:

#pragma omp parallel for

Additionally, OpenMP provides a set of schedulers for the distribution of loops between threads. The schedules allow to distribute the workload adequately depending on the application and workload of each particular loop. As a result the scheduling options help to balance the workload between various threads of the system. This is an important feature of OpenMP, which we exploit to optimise the performance of code on Xeon Phi. The list of different scheduling options available in OpenMP and their descriptions is presented in Table 5-1.

Table 5-1: Outline of scheduling options available in OpenMP 4.0 [31]

Scheduling Description option clause Divide the loop into equal-sized chunks or as equal as possible in the case where the number of loop iterations is not evenly divisible by the static number of threads multiplied by the chunk size. By default, chunk size is loop_count/number_of_threads. Use the internal work queue to give a chunk-sized block of loop iterations to each thread. When a thread is finished, it retrieves the dynamic next block of loop iterations from the top of the work queue. By default, the chunk size is 1. Similar to dynamic scheduling, but the chunk size starts off large and decreases to better handle load imbalance between iterations. The guided optional chunk parameter specifies them minimum size chunk to use. By default the chunk size is approximately loop_count/number_of_threads. The decision regarding scheduling is delegated to the compiler. The auto programmer gives the compiler the freedom to choose any possible mapping of iterations to threads in the team.

However, in the context of many-core systems, such as Xeon Phi, it is important to utilise the vector processing units available on the chip. Therefore, in OpenMP 4.0, there is a simd pragma which aids the compiler with vectorising the code marked with the pragma.

25 Simd pragma informs the compiler that the code enclosed by the pragma can be safely vectorised, and does not contain any dependencies.

OpenMP is an efficient method of parallelising the code in a shared memory context and across many threads. It also includes mechanisms for aiding with vectorisation. The usage of OpenMP is relatively simple and it is available across multiple platforms. Moreover, it is not bound to a specific implementation or compiler. Most compilers include their own implementation of OpenMP. Furthermore, OpenMP usage on Intel MIC architecture allows to populate 4 threads per core, which helps Xeon Phi to run at its peak efficiency.

5.3 Intel Cilk array notation

Intel Cilk array notation [32] is Intel’s proprietary extension to C language, which allows for better expression of SIMD parallelism. It involves extensions to array notation, which allow a programmer to explicitly state vectorization while writing the code. On Table 5-2, we present how “for-loop” could be replaced with Intel Cilk array notation.

Table 5-2: Intel Cilk array notation example

for loop Intel Cilk array notation equivalent for(i=0; i

The ability to explicitly state vectorisations is beneficial on Xeon Phi architecture, which includes many vector processing units. These VPUs could potentially be utilised in a better fashion, when using the explicit vectorisation representation provided by Intel Cilk array notation.

5.4 Offload models

There are several methods of running the program on a Xeon Phi. The program can be offloaded or run natively. Similarly, different parts of the code can be offloaded. The control of the offload model has a significant influence on the performance of the solution. Parts of the code or whole functions can be offloaded. Similarly, data can be transferred separately to and from the host to Xeon Phi through a PCI Express bus using special Intel LEO, or other pragmas. The choice of the offloading model is a substantial factor in optimising the performance of code on Xeon Phi. Using two or more Xeon Phis simultaneously could also be attempted in order to achieve better performance.

On Figure 5-1, we see the execution models available on Xeon Phi, with their most popular use cases. The code can be executed by offloading the code during the execution to Xeon Phi with the use of specialized pragmas such as Intel LEO, or OpenMP 4.0 pragmas. We notice that such model is beneficial, where there are phases, or blocks of highly parallelisable code. In such case, offloading the most computationally intensive and parallelisable parts of the code can result in improved performance. This model is similar to what has been long used in the case of GPUs and graphics processing, where 26 graphics code was offloaded to the GPU to be processed there. In the case of Xeon Phi, offloading is done with the use of Intel LEO pragmas such as the following two to offload code block and data respectively:

#pragma offload target(mic) { […] /code block/ } #pragma offload_transfer target(mic:0) in(a:length(N) alloc_if(1) free_if(0)) /data offload/

This approach allows for the code to be distributed between the host processor, and Xeon Phi. However, this approach potentially introduces overheads associated with data transfer between the host and Xeon Phi. As a result, in applications where such transfer would occur often, the offloading approach might not be the most beneficial.

Another approach is to execute the code symmetrically. Symmetric approach is a one where cores available on Xeon Phi are treated in the same way as the cores available on the host. The host then executes the program treating these cores as additional processors available. This approach allows for more flexibility, and gives a choice of how much workload is devoted to the host and how much to Xeon Phi cores. However, there is still overhead due to communication between the host and the Xeon Phi. Moreover, balancing the workload appropriately between Xeon Phi and the host nodes could be troublesome. The difficulty in distributing the workload correctly might mean that the underlying application does not execute adequately.

Finally, for highly parallel applications, full workload might be executed on the Xeon Phi node. Thus, limiting the communication overheads. However, as we see on Figure 5-1, such approach works well only with applications that are highly parallelisable as a whole. Since this approach relies on a high utilisation of all the cores at all times, it is beneficial for applications that exhibit these features. The performance of an individual single core of Xeon Phi is significantly lower than that of a single core of the host. Consequently, we lose much of performance if we do not utilise parallelisation fully in the native mode.

On Figure 5-2, we can see the software stack of Intel Xeon Phi. Clearly, the communication between Intel Xeon Phi and the host node is marked. We see the different models for native and offload execution. Symmetric execution is similar in terms of overheads to the offload execution. From the figure, we see that the communication between a Xeon Phi and a host happens through a PCI Express bus, and is serviced by a layer of supporting and communication libraries. This PCI Express connection is where all the communication between the Xeon Phi and the host happens; both in symmetric mode of execution, and when the offload to Intel MIC mode is chosen. Therefore, the PCI Express link is the bottleneck of the data transfer intensive applications, and can bound the execution time. This would happen in cases where data is constantly being offloaded or sent between Xeon Phi and the host.

27

Figure 5-1: Intel Xeon Phi execution models with Intel Xeon as the host processor [33]

Figure 5-2: Intel Xeon Phi software stack [23]

28 Chapter 6

Implementation of the solution

In this chapter, we outline how the solution was ported to Xeon Phi and we describe changes made to the code in order to optimise the code and increase the performance. The optimised performance across different optimisation methods allows us to compare these methods and in turn explore the overall performance of Xeon Phi.

In order to improve the performance and port the code to Xeon Phi, we began with localising the bottlenecks of the program. To analyse the code and find hotspots of the calculation, we have used profiler, which has enabled us to see where the code would benefit from additional optimisation. Subsequently, we tried to refactor these parts of code and with the use of compiler flags achieve the best single thread performance. Finally, throughout the process of implementation, we have used the methods described in the previous chapter to compare these methods’ performance, when applied to LU factorization problem on a Xeon Phi co-processor.

In the following chapter, we present and describe benchmarks used to measure the performance of the solution on Xeon Phi co-processor.

6.1 Initial profiling

To improve the performance and pinpoint parts of the code, which could benefit from parallelisation, we have employed a profiler, which has enabled us to find the most computationally expensive parts of the code. These are the parts of the code that bounded the solution and were the bottleneck of the execution of the program. These parts, subsequently, became the focus of our parallelisation efforts.

We have used Intel VTune [34] and gprof [35] profiling tools to find the hotspots and the most time consuming sections of the code. From the profiling data it emerged that the most computationally intensive parts of the code are functions mat_mul and lup_od_omp. The profile initially obtained from gprof after running the code on 4 threads on host with some OpenMP pragmas introduced. This can be seen in Table 6-1. This code had some OpenMP pragmas allowing for parallelisation of the loops introduced already. With no parallelisation at all, the matrix multiplication function dominated completely. Therefore we introduced early parallelisation pragmas in mat_mul function in order to obtain more representative profiling results as presented in Table 6-1. The overall runtime of this program for the profiler was 70 seconds, and so the profiler had enough time to collect the statistics regarding function calls.

29 Table 6-1: gprof profile of running the LU factorization algorithm with ranmat4500 input on 4 host threads

% time Function name 84.34 mat_mul 15.55 lup_od_omp 0.05 __intel_memset 0.03 main 0.02 mat_pivot 0.01 read_arr

From Table 6-1, we can see that clearly mat_mul and lup_od_omp, functions responsible for matrix multiplication and LU factorization algorithm itself, were the most computationally intensive. This is in line with what we had expected of an LU factorization algorithm.

Having shown that the mat_mul and lup_od_omp are in fact responsible for the most of the execution time and are the bottleneck of the execution, we focused our efforts on maximising the performance of these functions and applying the optimisation methods and techniques primarily to these two functions.

6.2 Parallelising the code

Having located the bottlenecks of the code, we aim to improve the performance of these parts of the code. To do so, we introduce parallelisation pragmas and refactor the code, so that parallelisation blocks can be larger and fit more instructions, thereby maximising the parallelisation utilisation. In this section, we present the examples of transformations, which we have completed in order to initially parallelise the code.

Initially, we start with the most time consuming part of the code according to the obtained profiler, namely the matrix multiplication operation. Matrix multiplication is known to be easily parallelisable, and it consists of several nested loops that are independent and iterate over two two-dimensional arrays, which correspond to two matrices. Therefore, we can introduce an OpenMP pragma collapsing two of the most outer level loops, in order to parallelise these loops as shown below.

#pragma omp parallel for private(i, j, k) collapse(2) for (i=0; i < n; i++) for (j=0; j < n; j++) …

We can see above a sample of how introduction of an OpenMP pragma is used in parallelising the code. This pragma allows the compiler to distribute the program among the available threads, and the collapse clause is used to combine the two loops, which in this case are independent, into one loop parallelisation operation.

30 Moreover, there could be used scheduling clauses of OpenMP to further improve the load balancing of the distribution of the for loop sections. In the above case of matrix multiplication the size and complexity of computation between iterations does not change. Since all of iterations have similar complexity, the default static scheduling of n/num_of_threads loop elements per core works well due to cache hits in the local cores of Xeon Phi. In other parts of the code, where the complexity between iterations varies more, it is beneficial to use other scheduling options, which ensure the load balance between the threads.

We followed the similar process with other functions, focusing especially on lup_od_omp, where we also applied refactoring to the code in order to ensure the code is parallelised. The changes in the code also ensured that the number of parallel regions was kept to minimum to limit the overheads associated with starting, closing and synchronising the parallel regions.

Furthermore, the process of refactoring the code allows the compiler to apply auto- vectorization techniques more efficiently. For example, explicitly putting one array operation within one loop helps compiler to see that this section of the code could be auto-vectorised. We investigated different options to see which one presented the most efficient results with the use of vectorisation reports and direct measurement of the performance.

6.2.1 Hotspots analysis

Having completed parallelisation of the algorithm, we have attempted to analyse the performance again to localise any further hotspots and bottlenecks in the program. This was done with the use of VTune, which is a tool that helps to profile and navigate the hotspots within the program. VTune is especially helpful with navigating the hotspots of the program offloaded to Xeon Phi. It has helped us locate parts of the code that could be further parallelised. This further parallelisation could be achieved by simple introduction of extra parallelisation pragmas or by collapsing several loops to increase the thread utilisation on Xeon Phi.

Hotspot analysis using VTune was of particular use when choosing the correct loop scheduling for the “parallel for” pragma. The tool informed us, if the load balancing is adequate between the threads. Similarly, the tool has helped us to find the loops that were not optimally parallelised and could be collapsed together to improve the utilisation of threads.

On Figure 6-1, we can see how we used VTune to aid with optimisation of threads utilisation. On Figure 6-1, we can see the graphs showing thread utilisation pre and post optimisation of lup_od_omp function. We can clearly see that the level of utilisation of threads changes and improves. The ability to visualise how many threads are working throughout the duration of the program is of immense help when optimising the program. We notice, on Figure 6-1, that the average utilisation of Xeon Phi threads increases after optimisation. The majority of the execution time post-optimisation is spend utilising above 200 threads. 31

Figure 6-1: VTune threads utilisation in lup_od_omp function pre and post optimisation

Moreover, it is also important to note that the size of benchmark and its density are both important factors in the utilisation of threads. If the benchmark is small, it might not generate enough operations to distribute among the large number of threads available on Xeon Phi. Subsequently, it might not utilise the threads as efficiently as a larger benchmark would with the same code

Nevertheless, the analysis presented on Figure 6-1, does not take into account how well vector processing units (VPUs) are utilised. The graph on Figure 6-1, shows only the utilisation of up to 4 single threads per core, available on Xeon Phi. It does not include in the calculation the utilisation of vector processing units.

6.3 Offloading the code

Offloading the code was the approach we took when working with Xeon Phi. The conventional operating model for Xeon Phi is to offload the most computationally intensive parts of the code to a Xeon Phi node and execute the code there. In the case of this project, we considered offloading the code at several points throughout the code. We attempted offloading for loops on their own as well as larger parts of the code including whole functions.

Due to large size of some benchmarks it became important that matrices are offloaded only once as to avoid the costly process of offloading these large matrices over PCIe bus. There exist pragmas dedicated to offloading the code to Xeon Phi from the host 32 processor. Intel LEO – Intel Language Extensions for Offload [36] are one set of such pragmas. These pragmas are one of the methods used for offloading the code to Xeon Phi. Alternatives include OpenMP 4.0 pragmas [30], OpenACC [37], or HAM (Heterogenous Active Messages for Efficient Offloading) [38]. All these methods have positive aspects to them, however, we have decided and focused mainly on Intel LEO being the proprietary method of Intel. Intel LEO is the method preferred by Intel and claimed to achieve the best results when offloading to Xeon Phi, due to its particular fit for the purpose.

In this project we attempted offloading at different levels of granularity, and decided to finally create one offload region. Functions called within this region are compiled for the MIC architecture using __attribute__((target(mic)) attribute for these functions.

The offload itself was performed with the use of the following Intel LEO pragma call:

#pragma offload target(mic) in(n) inout(A_ptr : length(n*n) alloc_if(1)) nocopy(end_time) nocopy(start_time)

This is the main offloading pragma used in the code. Within the offloaded section, we call the function which executes the LU decomposition algorithm, and matrix multiplication required in pivoting.

6.4 Hinting vectorisation with ivdep

Having parallelised, optimised, and offloaded the code to MIC, we attempted various methods of improving utilisation of vector processing units, and helping the compiler with auto-vectorisation. One of such methods is the introduction of ivdep pragma to the code. This pragma helps the compiler to vectorise the code. The compiler is assured that certain kinds of dependencies do not occur in the loop that follows the pragma. Therefore, it is used as a way of notifying the compiler that it is safe to vectorise that code.

The pragma was used in key loops throughout the code to hint the compiler in the direction of vectorisation. The loops in matrix multiplication function are independent, and can easily be transformed into vector expressions. However, they often rely on outside variables, and so the compiler might not always vectorise these loops. Similarly, the mat_pivot function makes use of the pragma to aid vectorisation of loops in its code. The example is shown below:

#pragma ivdep for (k=0; k < n; k++) { p[i][k]=p[max_j][k]; }

We can see that the ivdep pragma helps to vectorise such loop, which otherwise might not be vectorised due to compiler’s uncertainty of data dependency. The data might be overwriting itself, if an overlap between the pointers exists. In this case pointers are partially defined outside of the scope if the loop by i and max_j variables. Therefore, the existence of the overlap might not be determinable by the compiler at compile-time. As 33 a result, the compiler would safely assume that the dependency exists and it might not vectorise the loop. Ivdep pragma prevents that from happening and helps the compiler to vectorise the code.

6.4.1 Hotspots for further vectorisation

To analyse and evaluate the code and find places, which might require further vectorisation we have used vectorisation reports together with VTune. VTune has enabled us, as previously mentioned, to navigate through the hotspots of the code, and to find the parts that require further optimisation. Having found these parts, we looked for loops that could be vectorised in these parts of the code. Vectorisation reports provided by the Intel compiler were then used to see whether all loops, which could be vectorised were in fact vectorised. This has enabled us to analyse the code and see, how much more vectorisation can we achieve through additional vectorisation pragmas. Moreover, this process has allowed us to explore and compare performance of Xeon Phi with different programming models, parallelisation levels, and sets of pragmas.

6.5 Ensuring vectorisation with Intel Cilk and OpenMP simd

In this section, we describe the use of Intel Cilk array notation and OpenMP 4.0’s simd pragma to ensure that the code is vectorised by the compilers. These methods work in a similar manner by explicitly showing to the compiler that there are no dependencies in a loop or array operation and therefore, the compiler can safely attempt to vectorise the statement. Efficient vectorisation is a key concern when programming for Xeon Phi, because the co-processor contains many 512-bits wide vector processing units. These vector processing units, when utilised provide significant improvement in performance due to exploiting data parallelism.

Programs benefit from these operations if they are written in such a way as to allow for the vectorisation to be performed. In case of LU factorization there are many operations on arrays that can be vectorised. There are some dependencies between the loop levels, however, there are not many dependencies between loop iteration. Below an example of a loop that can be vectorised is shown:

for(j = k + 1; j < n; j++) a[i][j] -= aik * a[k][j];

We can see that, in the above loop, the outer-most array index is dependant only on the array iterator. Therefore, in the above case the accesses to variable a could be vectorised. However, the compiler is reluctant to do so without a dedicated pragma due to its limited knowledge of whether i and k would not overlap. We perform a similar operation and analysis in regards to all the loops of the code. Likewise, we change ivdep pragmas to their Cilk array notation or OpenMP simd pragma equivalents. This leaves us with optimally parallelised and vectorised code using the respective two methods.

34 6.5.1 Intel Cilk array notation

To use Intel Cilk array notation, we have replaced the inner-most loops’ with Intel Cilk array notation. An example of how Intel Cilk array notation was used in lup_od_omp function is presented in Table 6-2.

Table 6-2: Intel Cilk array notation use in lup_od_omp function

Intel Cilk array notation Original for loop a[i][k+1:n-(k+1)] -= aik * for(j = k + 1; j < n; j++) a[k][ k+1:n-(k+1)]; a[i][j] -= aik * a[k][j];

The loop statement is transformed into a single line statement, which explicitly denotes the ability to vectorise this part of code. Similar process was followed throughout the whole code, where outer-most indices of array accesses were replaced with Intel Cilk array notation, and the respective loops were abolished.

The use of Intel Cilk array notation has allowed us to ensure that the compiler uses vectorisation to the fullest, and takes advantage of any opportunity to vectorise statements.

6.5.2 OpenMP simd pragma

OpenMP pragmas are different to Intel Cilk array notation, since they do not transform the loop’s syntax. Instead they are used to mark up the loops, which can be legally vectorised. The usage of OpenMP simd pragma on the similar piece of code to the one from Table 6-2 is presented in Table 6-3.

Table 6-3: OpenMP simd pragma usage in lup_od_omp

OpenMP simd pragma Original for loop #pragma omp simd for(j = k + 1; j < n; j++) for(j = k + 1; j < n; j++) a[i][j] -= aik * a[k][j]; a[i][j] -= aik * a[k][j];

In Table 6-3, we can see that the only change to the code required to introduce OpenMP simd pragma is the introduction of the pragma itself. The pragma notifies the compiler that the following loop can be vectorised. This process has been repeated throughout the code to ensure that all loops which can be vectorised are marked up with the pragma. Similarly, introducing the pragma at various levels of nested loops has allowed us to determine the optimal points for introduction of simd pragmas.

The most optimal points to include the simd pragmas tended to be the loops, which covered the largest amount of outer-most indices in the statements in the body of the loops. For example, in mat_mul function, the simd pragma was introduced to the middle loop due to the middle loop’s index controlling two out of three statements in the body of the loops. This proved to be the most efficient place to insert the simd pragma.

35 Similarly to Intel Cilk array notation, the introduction of simd pragmas has allowed us to maximize the vectorisation of the code due to a restrictive nature of OpenMP’s simd pragma. The code marked with the pragma is vectorised by the compiler regardless of any potential dependencies that the compiler might determine in the code. This has allowed us, through the use of pragmas, to ensure that the utilisation of vector processing units is maximised as much as possible.

36 Chapter 7

Benchmarking the solution

In this chapter, we describe the process of obtaining benchmarks, which were then used to establish the performance of the code. In order to benchmark LU factorization algorithm, it is necessary to obtain a set of benchmarks, which include matrices that can be subsequently used in the LU factorization process. A wide range of sizes and densities of these matrices would allow us to have a wide view of how performance of Xeon Phi and of the algorithms is affected by various kinds of benchmark matrices. Furthermore, a wide range of matrices would allow us to ensure that the exploration of performance of Xeon Phi is not narrowed to a limited number of cases.

To obtain a wide spectrum of sizes and densities, we have used several kinds of benchmarks. We obtained part of benchmarks from the University of Florida sparse matrices collection [39]. Additionally, to obtain dense matrices, we have created a small program to generate dense matrices of different sizes. These dense matrices could subsequently be used to explore performance of the hardware.

In this chapter, we present the selection of benchmarks and discuss their features. We provide an explanation of the format of the benchmarks, a short overview of the University of Florida’s sparse matrices collection, the process of obtaining dense benchmarks. Finally, we present a table of characteristics of the benchmarks used in the project.

7.1 Matrix format

The benchmarks used in this project are stored and prepared in the Matrix Market Exchange format [40]. The format is simple and is often used to store and present sparse matrices. A Matrix Market Exchange format consists of a single line specifying the dimensions of the matrix. This is followed by a number of lines corresponding to non- zero entries in the matrix. Each entry consists of three numbers. The first two numbers specify the coordinates, the localisation of the last number on each line, within the matrix. We can see an example of such matrix on Figure 7-1 [40].

37

Figure 7-1: Sample matrix and its Matrix Market Exchange format representation

To enable matrices in Matrix Market format to be read by the program a dedicated module was created responsible for reading the matrices from file. The module reads a Matrix Market format file and populates a matrix, which can then be processed by the core body of the program. This module is in file readmatrix.c.

7.2 University of Florida sparse matrix collection

The University of Florida sparse matrix collection [39] has been developed by Dr. Tim Davis and includes various matrices. The matrices relate to a wide range of problems, such as route optimisation, circuit design, and mathematics.

The matrices selected from the collection for this project are primarily used in electronic design automation optimisation. These matrices simulate sample representative circuits and they are meant to correspond to certain categories of circuits. The matrices are used to simulate routing of these circuits on a circuit board.

The matrices from the University of Florida’s sparse matrix collection used to evaluate the performance of the LU factorization algorithm in this project are presented below:

 add32.mtx  circuit_1.mtx  circuit_2.mtx  coupled.mtx  init_adder1.mtx  meg4.mtx  rajat01.mtx

38  rajat03.mtx

7.3 Dense benchmarks

To arrive with dense benchmarks, we created a program, which used a pseudo-random number generator to populate matrices. The program created and populated the matrices. Subsequently, it wrote them to a file adhering to Matrix Market Exchange format. The file containing C code to generate the dense matrices is called: populate.c.

Dense matrices generated with the use of the code are completely filled with non-zero numbers. These are generated using the pseudo-random number generator available as a standard library in C. The numbers populating dense matrices are all positive and limited in magnitude between 0 and 100. Matrices of various sizes were created in order to reflect the wide range of problems that LU factorization might be used in. When deciding the sizes of these matrices, we accounted for the fact that only wider matrices will be able to take full advantage of the vast number of threads available on Xeon Phi. Similarly, we accounted for the fact that beyond a certain size the execution of the algorithm becomes too time consuming to measure the time multiple times during the scope of the project. These factors have influenced our selection of dense matrices’ sizes.

The matrices created using the populate.c program are the following:

 ranmat1000.mtx  ranmat4500.mtx  ranmat5000.mtx  ranmat5500.mtx  ranmat6000.mtx

7.4 Summary of benchmarks’ characteristics

In Table 7-1, we present the summary of benchmark matrices used in evaluating the solution, and exploring the performance of Intel Xeon Phi. In the table we see the name of the matrix, the dimensions of the matrix, and the number of non-zero elements. These features all influence the execution time of LU decomposition algorithm.

39

Table 7-1: Characteristics of benchmark matrices

Matrix name Dimensions No. of non-zero elements add32.mtx 4960 x 4960 19,848 circuit_1.mtx 2624 x 2624 35,823 circuit_2.mtx 4510 x 4510 21,199 coupled.mtx 11341 x 11341 97,193 init_adder1.mtx 1813 x 1813 11,156 meg4.mtx 5860 x 5860 25,258 rajat01.mtx 6833 x 6833 43,250 rajat03.mtx 7602 x 7602 32,653 ranmat1000.mtx 1000 x 1000 1,000,000 ranmat4500.mtx 4500 x 4500 20,250,000 ranmat5000.mtx 5000 x 5000 25,000,000 ranmat5500.mtx 5500 x 5500 30,250,000 ranmat6000.mtx 6000 x 6000 36,000,000

From Table 7-1, we see that a wide range of matrix sizes and densities is used in evaluating the performance of Xeon Phi throughout this project. We can also see the clear division between the sparse and dense matrices. The first eight matrices are clearly sparse, while the latter five are dense. These matrices are used as benchmarks to explore performance of Xeon Phi. The results of running the program with different optimisation methods, and these matrices as inputs, are presented in the next chapter.

40 Chapter 8

Analysis of performance of Xeon Phi

We present the results obtained after running the algorithm on Xeon Phi with the use of different methods presented in previous chapters. Consequently, various levels of parallelism result from the use of these methods. We analyse how these methods compare with each other and what is the performance on different kinds of benchmarks. Furthermore, we show how speed-up on Xeon Phi varies with the use of these methods. Moreover, we compare the results of running the program offloaded to Xeon Phi to running it natively on the host processor.

These results show how we have explored performance of Xeon Phi. We can see which methods work the best on Xeon Phi, and how well the programs scale on the hardware. Furthermore, we can notice at which point it becomes beneficial to switch to Xeon Phi rather than executing the program on the local host.

What is more, we provide a brief analysis of the offloading time required for sections of the code to be offloaded to Xeon Phi. The time required to offload data to Xeon Phi could be a major overhead and a bottleneck in memory intensive programs. In the case of LU factorization this applies to larger matrices which need to be transferred off the chip into Xeon Phi.

The findings presented in this chapter help to establish how Xeon Phi can be exploited to improve speed of computation. LU factorization is a common pattern in many scientific computations involving matrices. Similarly to LU factorisation, computations on matrices often can be parallelised to a certain degree. Therefore, the results presented in this chapter reflect Xeon Phi’s usability in applications involving operations on matrices. Furthermore, these results show what approach is the most beneficial, in terms of performance gain, when porting an application to Xeon Phi. This could guide future attempts to port code to Xeon Phi.

8.1 Collection of results

To collect results, we have measured the time taken by the relevant section to run. The aim of collecting results was to accurately measure the execution time. Subsequently, based on the execution time, we calculate speed-ups of the program under various modifications and in different conditions. This process is described in this section, and has allowed us to arrive with the results presented in this chapter.

In order to measure the execution time, we have used OpenMP’s omp_get_wtime() function, which returns the wall time. We inserted calls to this function at the beginning and at the end of the region we aimed to time. Then, the former value obtained was

41 subtracted from the latter in order to arrive with the execution time of the given block of code. This method has allowed us to obtained accurate results for timing of the program and did not affect the overall execution of the program.

We measured execution time only, this is, the time after the data had been offloaded to Intel Xeon Phi. Likewise, we measured the time taken by this part of the program to execute, including offload to Xeon Phi. This approach has allowed us to see and measure influence of the offload on the overall performance of the program. Since transfer of the data often becomes a bottleneck when working with Xeon Phi, this method was selected.

Finally, in order to ensure that the data collected is representative we measured the execution time several times for each benchmark in each condition. Therefore, the results presented in this report are based on collecting the data of running each configuration of the application and parallelisation method three times. The three data points collected were then assessed if they do not contain any outliers. Outliers were considered to be points, which are more than 10% off from the average of the 3 points. If the initially collected data contained outliers, the collection process was repeated for that benchmark. If the data did not contain any outliers, this is, all data points for each benchmark and configuration were within 10% of the average for this benchmark and configuration, the data was accepted.

The above process has ensured that the data was representative, and that there were no outliers. The running time of the program could vary due to other users using Xeon Phi, or other processes running on Xeon Phi. To avoid this, we have followed the process described in previous paragraph. In the following sections, we show the results obtained in line with the process described.

Speed-up, in the context of this project is explained with the following formula:

푇 Speed-up = 1 푇푛

In the above formula, T1 is the execution time on one thread, while Tn is the execution time on n threads. The speed-up, consequently, corresponds to a factor by which the execution time has decreased with an increase in the number of threads. It is the primary metric used to evaluate performance of a program execution in computer science and engineering.

8.2 Validation of the results

In order to ensure that the results were valid after changes employed to the initial algorithm, we performed a validation of the results. To ensure that the computation was correct, we compared the results obtained from the algorithm with results obtained using a different algorithm. The other algorithm implemented a basic Gaussian elimination. However, only the smaller benchmark matrices were evaluated against the algorithm, because of the time constraints. Since the other algorithm was largely single threaded and not optimised, the larger benchmark matrices executed for a long time using this 42 algorithm. A subset of smaller benchmarks was therefore used to ensure that the algorithm, after changes, was producing correct results.

The results from the two algorithms were compared, and if they remained sufficiently similar, within 5% error boundaries, the result was deemed to be correct for the modified algorithm. This process was a projection into the overall performance of the algorithm. It was inferred that if the result is correct for smaller matrices, the algorithm behaves correctly for larger matrices as well.

8.3 Overview of results

In this section, we present the results of running the program with various levels of optimisation on Intel Xeon Phi. The results presented in Table 8-1 include only the time required to run the benchmark on Xeon Phi – without including the time taken to offload data to Xeon Phi. The results were obtained by executing the program on Xeon Phi compiled with Intel C Compiler version 15.0.1. The results on host were obtained with the use of Intel C Compiler version 15.0.2. [41] Results presented in Table 8-1 were compiled with optimisation level “-O3”. All the results on Xeon Phi were executed with 236 threads, and on host the results were executed with 1, 4, and 16 threads respectively. The number of 236 threads is optimal on the version of Xeon Phi available. Since the version of Xeon Phi used has 60 cores, one core is reserved to handle OS and scheduling requests, the other 59 cores are then available for computation, and can be load balanced well without causing stalls. Therefore, we use 236 threads, which corresponds to 59 cores at 4 threads per core.

43 Table 8-1: Execution times (in seconds) of running benchmarks offloaded to Xeon Phi with different parallelisation methods

Host Xeon Phi n=236 Matrix Intel Cilk array name Host n=1 Host n=4 Host n=16 ivdep pragma simd pragma notation add32 562.92 156.77 52.25 300.16 48.99 21.76 circuit_1 76.52 22.10 6.78 36.84 7.40 4.48 circuit_2 539.55 161.80 54.77 252.12 34.32 13.33 coupled 10477.47 3478.02 979.79 6206.00 548.95 389.74 init_adder1 29.13 7.78 2.92 8.51 1.98 1.43 meg4 964.99 282.94 82.29 382.02 75.67 29.24 rajat01 2237.87 736.06 217.74 886.15 117.87 45.58 rajat03 2611.22 801.68 226.15 1345.14 158.68 72.07 ranmat1000 3.78 1.21 0.66 0.76 0.48 0.56 ranmat4500 478.27 139.40 64.24 259.68 40.19 20.04 ranmat5000 613.51 178.32 75.59 361.24 54.14 26.72 ranmat5500 877.40 258.70 104.74 477.65 72.75 35.71 ranmat6000 1057.81 307.60 140.93 618.82 93.88 45.08

From Table 8-1, we see that the performance of execution of all benchmarks eventually improves after offloading to Xeon Phi. We can see that all benchmarks show improvement from running them on one thread of the host, to running them with any level of optimisation on Xeon Phi. Moreover, with simd OpenMP pragma, or Intel Cilk array notation, we can see that the speed improves even in relation to running the code on 16 threads on the host processor. Performance of 16 threads on the host is the best performance we have achieved on the host. The execution times on host for these are also presented in Table 8-1. From the table, we see that 16 threads perform the best on the host, having the lowest runtimes for the program.

We explored various scheduling options, including dynamic and static with various chunksizes from 1 to n. It emerged that static scheduling, with equally distributed chunks presents the best performance, and was used to obtain the above results. This is due to regularity of matrices. The workload is equally distributed across matrices’ areas. Therefore, static scheduling provides good workload balancing and the best efficiency.

The above results show that introduction of optimisations, and offloading the code to Xeon Phi are beneficial for these benchmarks. The “Host n=1” column results of Table 8-1 are used as the reference point for calculating the speed-up with different methods. In the following sections we look at how performance improves with particular optimisations, and what is the speed-up achieved with these methods.

44 8.4 Speed-up with different optimisation options

From the table presented in the previous section, we notice that with introduction of parallelisation and ivdep pragma, we have improved the performance of the code. On Figure 8-1, we see the speed-up of the offloaded code with reference to the same code run on one thread of the host processor. On the figure, we also present the speed up achieved by running the code on 16 threads on the host without introducing optimisations. We can see the speed-ups with different methods of parallelisation and vectorisation implemented.

50

45

40

35

30

25

Speed Speed up 20

15

10

5

0

ivdep pragma simd pragma Intel Cilk array notation host, n=16

Figure 8-1: Speed-up of benchmarks with different parallelisation techniques used

From Figure 8-1, we can see that a good amount of speed-up is achieved when offloading the code using Intel Cilk or simd pragma. We notice that the speed-up reaches up to 49. This signifies potentially 49 times better performance when code is offloaded to Xeon Phi than on a single thread of Intel Xeon. The average performance across benchmarks for speed-up due to introduction of Intel Cilk array notation, and offload to Xeon Phi is 27.0. We can notice from the figure that the kind of a benchmark has a significant influence on the speed-up. Dense benchmarks have similar speed-up values except for the very small benchmark. The average speed-up achieved by the dense matrices benchmarks is 20.3. There is more variation among sparse benchmarks and the average speed-up for sparse matrices is slightly higher at 31.1.

45 Similarly, for OpenMP 4.0 simd pragma the speed-up due to introduction of the pragma and offload of the program to Xeon Phi reaches up to 19 times the original execution time of the program on Intel Xeon host. On average across all benchmarks, the speed-up is 13.4. This means that parallelising, vectorising, and offloading the program to Xeon Phi can improve the execution time by up to 13.4 times over single-threaded Intel Xeon execution time. Dense matrices benchmarks achieve an average speed-up of 10.9. While sparse matrices benchmarks similarly to Intel Cilk array notation achieve speed-up almost 1.5 times better, with the average speed-up for this type of benchmark being 14.9.

For ivdep pragma the average speed-up is much lower than the two options mentioned above. With the average speed-up reaching 2.33. The maximum speed-up achieved due to introduction of ivdep pragma is 5.00 with ranmat1000 matrix benchmark. The average for dense matrices benchmarks in case of ivdep pragma is 2.42. This is much closer to the average than in the case of OpenMP or Intel Cilk array notation. Finally, the sparse matrices on average attain a speed-up of 2.27. Again, this speed-up is closer to the average than in the case of OpenMP or Intel Cilk array notation. This goes to show that ivdep pragma is more uniform across dense and sparse matrices, while OpenMP and Intel Cilk array notation are both better performing, in terms of speed-up, with sparse matrices.

The above results are briefly summarised in Table 8-2.

Table 8-2: Speed-up values summary against single-threaded host execution time

Dense Sparse Overall Parallelisation matrices matrices Maximum Minimum average method average average speed-up speed-up speed-up speed-up speed-up OpenMP + 2.33 2.42 2.27 5.00 1.69 ivdep pragma OpenMP 4.0 13.4 10.9 14.9 19.1 7.82 simd pragma Intel Cilk array 27.0 20.3 31.1 49.1 6.71 notation

From Table 8-2, we also notice that the best minimum speed-up is achieved by OpenMP 4.0 with simd pragma. This suggests that for smaller benchmarks such as ranmat1000 simd pragma is more efficient than Intel Cilk array notation, which seems to be the best option overall to speed-up execution of a program on Xeon Phi.

We can see from the results presented in this section that the speed-up, when using different methods of parallelisation and vectorisation, varies significantly. This happens due to different principles behind the methods used to parallelise the code. Firstly, ivdep pragma is not as efficient in vectorising the code, as the other methods. This is because the ivdep pragma is not restrictive. Placing the pragma in the code does not mean that the loop preceded by the pragma will be vectorised. As a result several loops in the code that could be vectorised are not vectorised, despite the ivdep pragma being placed around

46 them. This happens because the compiler is conservatively assuming that any potential proven dependencies in the code are actual dependencies. Consequently, scalar code is executed and the vector processing units available on the Xeon Phi are not utilised to take advantage of their performance. Since scalar performance of threads on Xeon Phi is relatively poor, the speed-up achieved by the hardware is not as good as it could have been if the vectorisation happened. This explains the lower speed-up of ivdep pragma in comparison to other methods.

Explaining the difference between the two fastest vectorisation methods is more subtle. Both simd pragma and Intel Cilk array notation enforce vectorisation. Simd pragma ensures that the loop after the pragma is vectorised. Similarly, Intel Cilk array notation ensures that the code is vectorised. Therefore, it is more difficult to explain why such a significant difference between these two methods appears. However, a basic analysis of smaller snippets of the code has shown that the performance variation happens due to a compiler not being able to predict exactly the vectors’ width with simd pragma. Since, the pragma is still placed outside of the loop, if the loop is nested and its indices depend on the outer loop, the compiler’s vectorisation might result in remainders and peel loops more often. This is prevented with Intel Cilk array notation, where the width of a vector can be predicted. Moreover, this results in a better cache allocation with Cilk array notation, where the data can be loaded into cache in a way that supports subsequent loads to the vector registers of the Xeon Phi. An example of code exposing this difference is shown in Table 8-3.

Table 8-3: Code snippets explaining performance difference between simd pragma and Intel Cilk array notation

Intel Cilk array notation Simd pragma #pragma omp simd c[i][0:n] += a[i][k] * b[k][0:n]; for (k=0; k < n; k++) c[i][j] += a[i][k] * b[k][j];

From code snippets in Table 8-3, we see that when using Cilk array notation, we specify the length of the vector explicitly and the compiler, and runtime, know that it will not change throughout the execution of the block of code. On the other hand, in simd pragma, the value of k and n could potentially change throughout the execution of the loop. Therefore, the compiler might employ suboptimal division of arrays into vector registers resulting in a remainder or peel loop being created. Furthermore, there are additional instructions introduced in the code that check whether k and n have changed. Intel Cilk array notation is less likely to encounter that issue, consequently achieving better performance.

8.5 Native speed-up on Intel Xeon and on Intel Xeon Phi

To measure how the solutions scale on Intel Xeon Phi in comparison to the host, we have run the program on different numbers of threads. Measuring the performance on different numbers of threads both on Xeon Phi co-processor, and on Intel Xeon host processor allows us to explore how scalable the solution is. This could help to predict how 47 performance would change with an increase in the number of cores on Xeon Phi. Since next version of Xeon Phi is going to have increased number cores, such analysis could be beneficial to predict the performance of the program on the new generation of Xeon Phis.

On Figure 8-2, we can see the average speed-up of Xeon Phi across all the benchmarks, with varying number of threads. Because of the size of some benchmarks and the fact that single thread performance of Xeon Phi is poor, we calculated the speed-up with reference to running the benchmarks with 8 threads on Xeon Phi. From the graph presented on Figure 8-2, we see that the speed-up is increasing as we increase the number of threads. However, it does not scale perfectly. The speed-up increase is linear, while the number of threads increases exponentially with a factor of 2. This shows that the speed-up achieved on Xeon Phi for this algorithm is far from the ideal speed-up, where the speed-up increases by the same factor as number of threads. Nevertheless, we can determine from the graph that the performance will keep increasing if the number of threads increases beyond 236. This is beneficial, since we would like to be able to profit from running the algorithm on devices with a larger number of threads in the future.

Furthermore, the speed-up depends on the size of matrices used as benchmarks and their density. The larger and denser the benchmark matrix is, the more performance we gain when increasing the number of threads.

4

3.5

3

2.5

up - 2

Speed 1.5

1

0.5

0 8 16 32 60 100 236 Number of threads

Figure 8-2: Speed-up of Xeon Phi with varying number of threads (speed-up=1, when n=8)

48 8.6 Offloading overhead

In this section, we focus on the time required for offloading, and exploring this aspect of Xeon Phi’s performance. Since, the model we used for programming the Xeon Phi includes offloading the data to Xeon Phi in order to perform operations on the code, it becomes necessary to analyse the consequences of offloading on the performance of the program. We analysed offloading by comparing the execution time of the program including the offload of the data to Xeon Phi with the execution time of the program measured after the data had already been offloaded. This has enabled us to state the offload time as a percentage of the execution time on Intel Xeon Phi.

Subsequently, we could analyse the influence of offloading on the performance of Xeon Phi, and see whether it is beneficial to offload. The data obtained could show whether the increase in performance due to offloading overcomes the overhead due to data transfer to and from Xeon Phi caused by the offloading. It also shows at what size of matrices it becomes beneficial to use Xeon Phi.

On Figure 8-3, we can see the time taken by offloading the data to Xeon Phi as a percentage of the total execution time. This is, the execution time including the offloading time. Execution time for the total was measured outside of the offloading region so as to show the overall execution time with offloading. The data is shown on Figure 8-3, and presented based on the execution with OpenMP 4.0 simd pragma. Similarly, on Figure 8-4, we present the similar result for Intel Cilk array notation.

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

Offload time Execution time

Figure 8-3: Offloading and execution times as a percentages of the total execution time for OpenMP 4.0 simd pragma

49 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

Offload time Execution time

Figure 8-4: Offloading and execution times as a percentages of the total execution time for Intel Cilk array notation

From the figures presented in this section, we can see that the offloading time for majority of matrices does not influence the overall execution times significantly. Consistently with what we would expect, as matrix dimensions increase the share of offloading in total execution time decreases. Furthermore, we notice that even with fastest parallelisation methods, offloading for large matrices is dominated by computation time rather than offloading time. This is due to performing offload only once and keeping the data on Xeon Phi, once offloaded. Instead of sending it backwards and forwards throughout the execution of the program. Finally, we notice that the distribution of computation and offloading time does not vary significantly between different parallelisation and vectorisation methods, which is what we would expect.

We can conclude that it is beneficial to use offloading when dealing with matrices of sufficiently big size. From the figures, it seems that matrices with dimensions just above 2000 by 2000 entries result in very low influence of the offloading time on the overall execution time. To illustrate this, we also present Figure 8-5, which shows the speed-up of the total execution time, this is, the execution time including offload, using the three parallelisation methods described in Chapter 5.

50 50

45

40

35

30

up - 25

Speed 20

15

10

5

0

ivdep pragma Simd pragma Intel Cilk array notation Host, n=16

Figure 8-5: Speed-up of benchmarks including offload time with different parallelisation techniques used

From Figure 8-5, we see that the speed-up with offload is still significant, and it is beneficial to offload the code to perform the most computationally intensive operations. The offload time does not dominate the overall execution time. Therefore we still observe significant speed-up, despite the offloading time increasing the overall execution time. When we compare Figure 8-1 and Figure 8-5, we see that difference is marginal for most of the larger matrices. However, it is still noticeable. Consequently, although performance could be improved if offload speed was improved, we would still recommend offloading the most computationally parts of LU factorisation to Xeon Phi to increase overall performance of the application.

8.7 Speed-up on the host with different optimisation options

In this section, we present the results of how optimisation methods influence the performance of the code on the host processor – Intel Xeon. The program after introducing optimisations as described in previous chapters was executed on the host. The speed-up results are presented on Figure 8-6.

51 70

60

50

40

up -

speed 30

20

10

0

No optimisations, n=16 ivdep pragma simd pragma Intel Cilk array notation

Figure 8-6: Speed-up on the host – Intel Xeon – with different optimisation methods and benchmarks

From Figure 8-6, we notice that the variability between ivdep pragma and simd pragma is much greater on Intel Xeon in comparison to Xeon Phi. However, both these methods present similar range of the performance improvements. This shows that in fact vectorisation has lower influence on the overall performance on Intel Xeon processor than on Xeon Phi. Having less capability of performing wide vector instructions, the host does not benefit as much from introduction of vectorisation pragmas. Improvement in performance due to introduction of vectorisation pragmas comes from these pragmas allowing the compiler to schedule memory accesses more efficiently. As a result, the loops perform faster due to less time consumed by cache misses. In Table 8-4, we present the average speed-up for sparse, and dense benchmarks on Intel Xeon when using the optimisation methods. We also present the maximum speed-up achieved.

Table 8-4: Overview of performance improvements due to optimisation methods on the host – Intel Xeon

No Intel Cilk array optimisations, Ivdep pragma Simd pragma notation n=16 Dense average 7.42 10.3 9.73 16.7 Sparse average 10.8 13.0 15.1 40.8 Maximum 11.7 17.7 21.3 65.0

52 Moreover, we see significant improvement in performance when using Intel Cilk array notation on the host. This comes from better scheduling of loops on threads, and more efficient memory access patterns available to compiler through the introduction of Intel Cilk array notation. It is substantial to note that the optimisations introduced to improve performance of Intel Xeon Phi have resulted in performance of the host being improved too. This shows that performing optimisation and porting the code to Xeon Phi is beneficial also when trying to optimise the code on the host. Optimising the code for Xeon Phi will in majority of cases result in better performance on the host processor too. This is a strong argument for attempting to port code to Xeon Phi. Even if the performance is not improved through porting the code to Xeon Phi, we would still benefit from the improvements on the host.

8.8 Running NICSLU on the host

In Table 8-5, we present the results of running NICSLU on the host processor, and using sparse matrices as benchmark inputs.

Table 8-5: Results of running NICSLU on the host

Benchmark matrix Host, n=16 add32 4.33 circuit_1 5.62 circuit_2 2.65 coupled 16.0 init_adder1 2.65 rajat01 8.06

From Table 8-5, we can see that results of running NICSLU on the host are substantially better performing than the library we have used throughout the project. This is inevitably due to the specific design of the NICSLU library aimed at performing sparse LU factorisation of electronic design automation matrices efficiently. It would be beneficial and interesting to perform porting of the code to Xeon Phi in the future in order to analyse what is NICSLU’s performance on Intel Xeon Phi, and whether it outperforms the host’s performance, when running the library.

53 Chapter 9

Summary and conclusions

In conclusion, the project succeeded in analysing and exploring the performance of Xeon Phi. Various methods of parallelisation and factorisation were attempted and performance of the execution of LU factorisation with these methods was compared. The benefits of offloading parts of the code to Xeon Phi were examined and the speed-up was calculated.

The move towards accelerators and co-processors is becoming more evident. However, the processes governing programming these devices are not uniform. This project aimed to explore different methods of porting a code, designed to solve a particular problem, to an accelerator. The application code was ported to Intel Xeon Phi, due to Xeon Phi’s purpose of serving mainly scientific and computational problems. The novel architecture of Intel Xeon Phi made the analysis of the performance note-worthy.

The code of LU factorisation algorithm was successfully ported to Intel Xeon Phi. The primary model used to port the code was offloading the key functions performing the computation containing the loops, which were subsequently parallelised and vectorised on Xeon Phi. We used OpenMP 4.0 pragmas, ivdep pragmas, Intel Cilk, and Intel LEO pragmas to offload and parallelise the code on Intel Xeon Phi. These attempts were successful and presented an improvement in the speed-up of execution of up to a factor of 49.

The difference in instruction set architectures between Intel Xeon and Intel Xeon Phi has caused us to move to a different code base. Despite the move, we managed to obtain a significant improvement in the level of performance of the new library due to introduction of parallelisation.

We compared the parallelisation frameworks against several benchmarks of both dense and sparse matrices. These has shown that the performance of the algorithm and optimisations due to offloading the code to Xeon Phi can fluctuate depending on the shape and size of each benchmark. Dense benchmark matrices presented a more uniform speed up across various sizes. Sparse benchmark matrices resulted in more varied performance depending on the size of each benchmark matrix and number of non-zero elements.

Furthermore, we confirmed our initial expectations that vectorisation is a very important factor in utilising the performance of Intel Xeon Phi. The use of pragmas, which enabled a high level of vectorisation, has resulted in better performance improvements due to offloading the code to the co-processor, than that of pragmas which have not fully utilised this.

54 The ability to collapse several arrays using OpenMP pragma clauses has also shown to improve the performance and suggests that this feature of OpenMP is beneficial to the process of porting the code to Xeon Phi.

In overall, the best performance for optimising and porting the code to Xeon Phi was achieved with the use of Intel Cilk array notation and Intel LEO pragmas for offloading. The array notation has allowed the compiler to fully explore and take advantage of the parallelism present within the code. Intel Cilk code offloaded to Xeon Phi was able to achieve the speed-up of up to 49 when compared with single-threaded performance of Intel Xeon processor, with the average speed-up across the benchmarks of 27. This shows that the use of Intel Cilk array notation and offloading the code to Xeon Phi can substantially improve the speed of performing LU decomposition calculations.

However, Intel Cilk is a proprietary method of Intel. Although, Intel Cilk Plus, including array notation, is available to some extent with versions of gcc beyond 4.7 [42], it is the Intel compilers, which are closed source, that have a certain degree of advantage in regards to this method. Similarly, the Intel Cilk code is less portable than other cross- platform methods such as OpenMP. Therefore, the results and performance improvements due to introduction of OpenMP’s simd pragma were also interesting. We attempted to bring the performance due to introduction of OpenMP’s simd pragma as close to the performance of Intel Cilk array notation. Nevertheless, there is still some degree of difference between performance of OpenMP’s simd pragma and Intel Cilk. With the simd pragma achieving on average speed-up of 19 times over the execution on one thread of Intel Xeon.

The project has shown that the offloading to Xeon Phi can often improve the performance of the code substantially. At the same time it is possible to perform other operations or execute heterogeneously on both the host and Xeon Phi to achieve even better performance.

Moreover, we examined the effect of offloading the large matrices on the overall execution of the code on Xeon Phi. Since we utilised a framework where the data is offloaded only once, the influence of data transfer to and from the co-processor was marginal for sufficiently large matrices. The data transfer was not the bottleneck of the execution on Xeon Phi.

Finally, we noted that the improvements introduced to the code to enhance its performance on Xeon Phi resulted in improved performance on the host as well. Better parallelisation and vectorisation on Xeon Phi has turned out to benefit vectorisation and parallelisation on the host effectively speeding up the performance on the host too.

The dissertation project has succeeded in exploring the performance of Xeon Phi, and showing that LU decomposition can benefit from offloading to the co-processor. We have shown that the accelerators could be used to improve performance of certain applications and analysed different methods, which allow for these improvements. We can see that in the future the move to many-core architectures, such as the new version of Intel Xeon Phi – Knights Landing, can result in improved performance if the 55 algorithms are adequately adapted. Moreover, we have presented that the algorithms can be adapted using various techniques, some of which, such as OpenMP, do not require an expert knowledge of the underlying hardware. We believe that the use of co-processors will keep increasing in the field of high performance computing, and the use of various parallelisation methods to utilise the co-processors efficiently will become inevitable.

9.1 Future work

There are is a wide range of aspects in which further work on this project could be attempted. The options considered the most for the future work include:

 Porting the original NICSLU code to Xeon Phi by changing the library’s memory usage patterns, and so fixing the unaligned accesses issues with the current version. This would require contacting the original creators of the library and working with them to port the code. Attempting to port the code to KNL, which would allow unaligned access and examining the library’s performance then could be also attempted.  Exploring the performance of Xeon Phi with the use of a wider range of libraries, such as gamma index used in nuclear medicine and radiology. This application is also heavily based around matrix operations. Consequently, it could potentially benefit from vectorisation and parallelisation.  Moreover, an investigation of a hybrid execution on Intel Xeon Phi and the host could be attempted to further examine the performance of the co-processor in such set up. The program could be run asynchronously on both host and on Xeon Phi. This would enable us to potentially maximise the computing power available, and thus increase the performance. In a hybrid approach both host and accelerator perform the same calculations in parallel. Host processor’s cores are treated in the same way as co-processor’s cores and execute threads of the application. However, due to being more efficient, their workload would have to be adjusted appropriately.  The energy consumption and the efficiency of the co-processor could be examined in order to compare this against the efficiency of the host. This would allow us to see, if there are other benefits to offloading besides increased speed of execution of the code.  Additionally, the performance of Xeon Phi using cross-platform OpenMP pragmas could be compared with the performance of other accelerators such as NVIDIA’s Kepler or AMD’s ATI Radeon. These would offer a wider view at the spectrum of accelerators, and would show how beneficial Xeon Phi is. The energy performance of these accelerators and co-processors could also be analysed.  Finally, it would be beneficial to attempt other methods of porting the code to Intel Xeon Phi. Such as HAM - Heterogeneous Active Messages, OpenACC or MPI in order to explore their impact on the performance when porting the code to Xeon Phi.

56 Bibliography

[1] J. Hruska, “The death of CPU scaling: From one core to many — and why we’re still stuck,” ExtremeTech, 1 February 2012. [Online]. Available: http://www.extremetech.com/computing/116561-the-death-of-cpu-scaling-from- one-core-to-many-and-why-were-still-stuck/2. [Accessed 1 August 2015]. [2] TOP500, “China’s Tianhe-2 Supercomputer Maintains Top Spot on List of World’s Top Supercomputers,” TOP500, 13 July 2015. [Online]. Available: http://www.top500.org/blog/lists/2015/06/press-release/. [Accessed 10 August 2015]. [3] The Green500, “The Green500 List - June 2015,” CompuGreen, 1 August 2015. [Online]. Available: http://www.green500.org/news/green500-list-june- 2015?q=lists/green201506. [Accessed 1 August 2015]. [4] X. Chen, Y. Wang and H. Yang, “NICSLU: An Adaptive Sparse Matrix Solver for Parallel Circuit Simulation,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 32, no. 2, pp. 261-274, 2013. [5] RosettaCode.org, “LU decomposition,” RosettaCode.org, 9 January 2015. [Online]. Available: http://rosettacode.org/wiki/LU_decomposition. [Accessed 30 June 2015]. [6] P. Wallstedt, “Compact LU Factorization,” The University of Utah, [Online]. Available: http://www.sci.utah.edu/~wallstedt/LU.htm. [Accessed 30 June 2015]. [7] M. I. Dubaniowski, “Project Preparation Report,” The University of Edinburgh, Edinburgh, 2015. [8] R. C. Johnson, “Archer Supercomputer Is Fastest in UK,” EE Times, 11 April 2014. [Online]. Available: http://www.eetimes.com/document.asp?doc_id=1321896. [Accessed 3 August 2015]. [9] E.-I. Farsarakis, “Energy Efficiency: Benefits and limitations of modern HPC architectures,” EPCC, The University of Edinburgh, Edinburgh, 2014. [10] S. Tsang, “Characterising Bipartite Graph Matching Algorithms on GPUs,” EPCC, The University of Edinburgh, Edinburgh, 2014. [11] C. S. Pelissier, “Climate Dynamics on the Intel Xeon Phi,” in NASA@SC13, Denver, 2013. [12] Institute for Computational Cosmology, “Institute of Advanced Research Computing: Intel Parallel Computing Center,” Durham University, 27 May 2015. [Online]. Available: https://www.dur.ac.uk/iarc/intel/. [Accessed 30 July 2015]. [13] A. Haidar, P. Luszczek, S. Tomov and J. Dongarra, “Heterogenous Acceleration for Linear Algebra in Mulit- Environments,” Lecture Notes in Computer Science, vol. 8969, pp. 31-42, 2014.

57 [14] P. Sao, X. Liu, R. Vuduc and X. Li, “A Sparse Direct Solver for Distributed Memory Xeon Phi-accelerated Systems,” in 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS), Hyderabad, 2015. [15] A. S. Belsare, “Sparse LU Factorization for Large Circuit Matrices on Heterogenous Parallel Computing Platforms,” Texas A&M University, College Station, 2014. [16] J. Fang, A. L. Varbanescu, H. Sips, L. Zhang, Y. Che and C. Xu, “Benchmarking Intel Xeon Phi to Guide Kernel Design,” Delft University of Technology, Delft, 2013. [17] K. Asanovic, R. Bodik, B. Catanzaro, J. Gebis, P. Husbands, K. Keutzer, D. Patterson, W. Plishker, J. Shalf and S. Williams et.al., “The landscape of parallel computing research: A view from berkeley.,” Citeseer, Berkley, 2006. [18] M. Bordawekar and R. Baskaran, “Optimizing sparse matrix-vector multiplication on gpus,” IBM, 2009. [19] A. Gray, “Performance Portability,” in GPU Computing Seminar, Sheffield, 2105. [20] I. Bethune and F. Reid, “Evaluating CP2K on Exascale Hardware: Intel Xeon Phi,” PRACE, Edinburgh, 2014. [21] I. Bethune and F. Reid, “Optimising CP2K for the Intel Xeon Phi,” PRACE, Edinburgh, 2014. [22] D. K. Berry, J. Schuchart and R. Henschel, “Experiences Porting a Molecular Dynamics Code to GPUs on a Cray XK7,” in Cray User Group, Napa Valley, 2013. [23] S. U. Thiagarajan, C. Congdon, S. Naik and L. Q. Nguyen, “Intel Xeon Phi C oprocessor Developer's Quick Start Guide v. 1.5,” Intel Corporation, Santa Clara, 2013. [24] G. Chrysos, “Intel Xeon Phi Coprocessor - the Architecture,” Intel Corporation, 12 November 2012. [Online]. Available: https://software.intel.com/en- us/articles/intel-xeon-phi-coprocessor-codename-knights-corner. [Accessed 25 August 2015]. [25] F. Zarrinfar, “Optimizing Embedded Memory for the Latest ASIC and SOC Designs,” Chip Design, 10 June 2013. [Online]. Available: http://chipdesignmag.com/display.php?articleId=5260. [Accessed 9 August 2015]. [26] T. P. Morgan, “More Knights Landing Xeon Phi Secrets Unveiled,” The Platform, 25 March 2015. [Online]. Available: http://www.theplatform.net/2015/03/25/more-knights-landing-xeon-phi-secrets- unveiled/. [Accessed 9 August 2015]. [27] M. Barraza-Rios, “Gaussian Elimination & LU Decomposition,” The University of Texas at El Paso, El Paso, TX, 2011. [28] J. Mahaffy, “Gauss Elimination and LU Decomposition,” Pennsylvania State University, 1997. [Online]. Available: http://www.personal.psu.edu/jhm/f90/201.html. [Accessed 9 August 2015]. [29] Intel Developer Zone, “ivdep,” Intel Corporation, 2015. [Online]. Available: https://software.intel.com/en-us/node/524501. [Accessed 30 July 2015].

58 [30] OpenMP Architecture Review Board, “OpenMP Application Program Interface,” OpenMP Architecture Review Board, 2013. [31] R. W. Green, “OpenMP* Loop Scheduling,” Intel Corporation, 29 August 2014. [Online]. Available: https://software.intel.com/en-us/articles/openmp-loop- scheduling. [Accessed 9 August 2015]. [32] A. D. Robison, “SIMD Parallelism using Array Notation,” Intel Developer Zone, 3 September 2010. [Online]. Available: https://software.intel.com/en- us/blogs/2010/09/03/simd-parallelism-using-array- notation/?wapkw=array+notation. [Accessed 30 July 2015]. [33] P. Kennedy, “Intel Xeon Phi 5110P Coprocessor – Many Integrated Core Unleashed,” Serve The Home (STH), 13 November 2012. [Online]. Available: http://www.servethehome.com/introducing-intel-xeon-phi-5110p-coprocessor- -integrated-core-unleased/. [Accessed 16 August 2015]. [34] S. Cepeda, “Optimization and Performance Tuning for Intel Xeon Phi ,” Intel Corporation, 12 November 2012. [Online]. Available: https://software.intel.com/en-us/articles/optimization-and-performance-tuning- for-intel-xeon-phi-coprocessors-part-2-understanding. [Accessed 31 July 2015]. [35] J. Fenlason, “GNU gprof,” Free Software Foundation, Inc., November 2008. [Online]. Available: https://sourceware.org/binutils/docs/gprof/. [Accessed 30 July 2015]. [36] K. Davis, “Data transfer of an “array of pointers” using the Intel Language Extensions for Offload (LEO) for the Intel Xeon Phi coprocessor,” Intel Corporation, 22 August 2014. [Online]. Available: https://software.intel.com/en- us/articles/xeon-phi-coprocessor-data-transfer-array-of-pointers-using-language- extensions-for-offload. [Accessed 30 July 2015]. [37] OpenACC-standard.org, “The OpenACC Application Programming Interface,” OpenACC-standard.org, 2013. [38] M. Noack, “HAM - Heterogenous Active Messages for Efficient Offloading on the Intel Xeon Phi,” The Zuse Institute Berlin (ZIB), Berlin, 2014. [39] T. A. Davis and Y. Hu, “The University of Florida Sparse Matrix Collection,” ACM Transactions on Mathematical Software, vol. 38, no. 1, pp. 1:1-1:25, 2011. [40] R. Boisvert, R. Pozo and K. Remington, “The Matrix Market Exchange Formats : Initial Design,” National Institute of Standards and Technology, Gaithersburg, 1996. [41] Intel Developer Zone, “Intel® C++ Compiler in Intel Parallel Studio XE,” Intel Corporation, [Online]. Available: https://software.intel.com/en-us/c- compilers/ipsxe. [Accessed 30 July 2015]. [42] B. Tannenbaum, “Cilk Plus Array Notation for C Accepted into GCC Mainline,” Intel Corporation, 5 June 2013. [Online]. Available: https://software.intel.com/en- us/articles/cilk-plus-array-notation-for-c-accepted-into-gcc-mainline. [Accessed 30 July 2015].

59

60