Exploring Performance of Xeon Phi Co-Processor

Exploring performance of Xeon Phi co-processor Mateusz Iwo Dubaniowski August 21, 2015 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2015 Abstract The project aims to explore the performance of Intel Xeon Phi processor. We use various parallelisation and vectorisation methods to port a LU decomposition library to the co- processor. The popularity of accelerators and co-processors is growing due to their good energy efficiency characteristics, and the large potential of further performance improvements. These two factors make co-processors suitable to drive the innovation in high performance computing forwards, towards the next goal of achieving the Exascale- level computing. Due to increasing demand Intel has delivered a co-processor designed to fit the requirements of the HPC community, the Intel MIC architecture, of which the most prominent example is Intel Xeon Phi. The co-processor utilises the many-core principle. It provides a large number of slower cores supplemented with vector processing units, thus forcing high level of parallelisation upon the users. LU factorisation is an operation on matrices used in many fields to solve linear algebra, inverse matrices, and calculate matrix determinants. In this project we port a LU factorisation algorithm using Gaussian elimination method to perform the decomposition to Intel Xeon Phi co-processor. We use various parallelisation techniques including Intel LEO, OpenMP 4.0 pragmas, Intel’s Cilk array notation, and ivdep pragma. Furthermore, we examine the effect of data transfer to the co-processor on the overall execution time. The results obtained show that the best level of performance on Xeon Phi is achieved with the use of Intel Cilk array notation to vectorise, and OpenMP4.0 to parallelise the code. Intel Cilk array notation, on average across sparse and dense benchmark matrices, results in the speed-up of 27 times over the single-threaded performance of the host processor. The peak speed-up achieved with this method, across attempted benchmarks, results in performance 49 times better than that of a single thread of the host processor. Contents Chapter 1 Introduction ........................................................................................................ 1 1.1 Obstacles and diversions from the original plan ..................................................... 4 1.2 Structure of the dissertation ...................................................................................... 4 Chapter 2 Co-processors and accelerators in HPC ............................................................ 6 2.1 Importance of energy efficiency in HPC ................................................................. 6 2.2 Co-processors and the move to Exascale ................................................................. 8 2.3 Intel Xeon Phi and other accelerators ...................................................................... 9 2.4 Related work ...........................................................................................................10 Chapter 3 Intel MIC architecture......................................................................................13 3.1 Architecture of Intel MICs .....................................................................................13 3.2 Xeon Phi in EPCC and Hartree ..............................................................................16 3.3 Xeon Phi programming tools .................................................................................17 3.4 Intel Xeon – host node............................................................................................18 3.5 Knights Landing – the future of Xeon Phi .............................................................18 Chapter 4 LU factorization – current implementation .....................................................20 4.1 What is LU factorization? ......................................................................................20 4.2 Applications of LU factorization ...........................................................................21 4.3 Initial algorithm ......................................................................................................21 4.4 Matrix data structure...............................................................................................22 i Chapter 5 Optimisation and parallelisation methods .......................................................24 5.1 Intel “ivdep” pragma and compiler auto-vectorisation .........................................24 5.2 OpenMP 4.0 ............................................................................................................25 5.3 Intel Cilk array notation .........................................................................................26 5.4 Offload models .......................................................................................................26 Chapter 6 Implementation of the solution........................................................................29 6.1 Initial profiling ........................................................................................................29 6.2 Parallelising the code ..............................................................................................30 6.2.1 Hotspots analysis .............................................................................................31 6.3 Offloading the code ................................................................................................32 6.4 Hinting vectorisation with ivdep ............................................................................33 6.4.1 Hotspots for further vectorisation ...................................................................34 6.5 Ensuring vectorisation with Intel Cilk and OpenMP simd ...................................34 6.5.1 Intel Cilk array notation ..................................................................................35 6.5.2 OpenMP simd pragma ....................................................................................35 Chapter 7 Benchmarking the solution ..............................................................................37 7.1 Matrix format ..........................................................................................................37 7.2 University of Florida sparse matrix collection ......................................................38 7.3 Dense benchmarks ..................................................................................................39 7.4 Summary of benchmarks’ characteristics ..............................................................39 Chapter 8 Analysis of performance of Xeon Phi .............................................................41 8.1 Collection of results ................................................................................................41 8.2 Validation of the results .........................................................................................42 8.3 Overview of results.................................................................................................43 ii 8.4 Speed-up with different optimisation options........................................................45 8.5 Native speed-up on Intel Xeon and on Intel Xeon Phi ..........................................47 8.6 Offloading overhead ...............................................................................................49 8.7 Speed-up on the host with different optimisation options .....................................51 8.8 Running NICSLU on the host ................................................................................53 Chapter 9 Summary and conclusions ...............................................................................54 9.1 Future work .............................................................................................................56 Bibliography .....................................................................................................................57 iii List of Tables Table 2-1: Overview of available co-processors and accelerators by vendor ................10 Table 3-1: Overview of versions of Intel Xeon Phi available .........................................16 Table 5-1: Outline of scheduling options available in OpenMP 4.0 [31] .......................25 Table 5-2: Intel Cilk array notation example ...................................................................26 Table 6-1: gprof profile of running the LU factorization algorithm with ranmat4500 input on 4 host threads ...............................................................................................................30 Table 6-2: Intel Cilk array notation use in lup_od_omp function ...................................35 Table 6-3: OpenMP simd pragma usage in lup_od_omp ................................................35 Table 7-1: Characteristics of benchmark matrices ..........................................................40 Table 8-1: Execution times (in seconds) of running benchmarks offloaded to Xeon Phi with different parallelisation methods ..............................................................................44 Table 8-2: Speed-up values summary against single-threaded host execution time ......46 Table 8-3: Code snippets explaining performance difference between simd pragma and Intel Cilk array notation ....................................................................................................47

Load more