Programming for the

Michael Florian Hava RISC Software GmbH

RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 1 The Road to Xeon Phi

RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 2 Intel Pentium (1993 - 1995)

. The Pentium was the first superscalar x86 – No out-of-order execution! – Predates all SIMD extensions

. 1994: P54C – 75 – 100MHz – Core-design is used in several research projects, including the Xeon Phi architecture!

RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 3 Tera-Scale Computing (2006-)

. Research project to design TFLOP CPU

. 2007: Teraflops Research Chip/Polaris – 80 cores (96-bit VLIW) – 1 TFLOP @ 63W

. 2009: Single-chip Cloud Computer – 48 cores (x86) – No cache coherence

RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 4 Larrabee (2009)

. A fully programmable GPGPU based on x86 – Software renderer for OpenGL, DirectX,…

. 32 – 48 cores – 4-way Hyper-Threading – Cache coherence – 512-bit vector registers [LRBni]

. Planned product release: 2009-2010

RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 5 Many Integrated Core (2010-)

. 2010: Knights Ferry (prototype) – 32 cores @ 1.2 GHz – 4-way Hyper-Threading – 512-bit vector registers [???] – 0.75 TFLOPS @ single precision

. 2012: Knights Corner [Xeon Phi] – 57 – 61 cores @ 1.0 – 1.2 GHz – 4-way Hyper-Threading – 512-bit vector registers [KNC] – 1 TFLOP @ double precision

RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 6 Xeon Phi

. Calculating Peak Xeon Phi FLOPs = – FLOPs = #core × GHz × SIMD vector width × fp-ops – FMA == 2 floating point operations (takes 1 cycle)

. SIMD vector width: – Single Precision: 16 elements – Double Precision: 8 elements

57 × 1.1 × 16 × 2 → 2 TFLOPs (single) 57 × 1.1 × 8 × 2 → 1 TFLOPs (double)

RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 7 The Future

. 2015: Knights Landing – Up to 72 Airmont cores – 4-way Hyper-Threading – 512-bit vector registers [AVX-512] • Support for existing x86 extensions – 3 TFLOPS @ double precision – Both as co-processor and standalone CPU

RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 8 Programming for the Xeon Phi

RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 9 Supported Technologies

Tools Tools MKL MKL Fortran (CAF) Fortran (CAF)

TBB TBB CPU native MIC OpenMP Executable PCIe Executable OpenMP C++ C++ Cilk Plus Cilk Plus Offloading OpenCL OpenCL

RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 10 Execution Models

. Native execution – Copy cross-compiled executable to Phi – Local Linux based OS • Almost completely independent from host system

. Offloading – Implicit/Automatic – Explicit/Manual

. Message Passing (MPI) – Phi as Cluster or Node

RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 11 Offloading

. Similar to execution model of GPGPU

. Technologies – OpenCL – Intel Offload Extension – OpenMP 4.0

. Xeon Phi reserves one core in offload mode

RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 12 Memory Models

. Explicit/distinct – Identical to the GPGPU – Memory has to be copied – Limited to primitive types

. Implicit/virtual shared – Simulated a SMM – Complex data structures – Only available in C/C++

RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 13 OpenMP 4.0

RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 14 OpenMP 4.0

. Released in July 2013

. Introduces concept several new concepts – Accelerators – SIMD – Thread teams

. Partially abandons pure shared memory model – Accelerators contain local memory – Explicit memory & computation offloading

RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 15 Tagging for Offload

. Types, Functions, Variables, etc. that should be available on the accelerator have to be tagged

#pragma omp declare target //contains last error of calculate extern int last_error;

struct result_t { float sum, avg, min, max; };

result_t calculate(const float * array, int count); #pragma omp end declare target

RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 16 Offloading Computations and Data

. Offloading of computation and data is controlled via pragmas

float * array = new float[N]; std::generate_n(array, N, rand);//initialize array

//create device data context and start computation #pragma omp target map(to: array[0:N], N)\ map(from: result, last_error) result_t result = calculate(array, N);

print(result, last_error); delete[] array;

RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 17 „Explicit“ Device Data Context Management

#pragma omp target data map(to: N) {//data device context exists for this scope float * array = new float[N]; std::generate_n(array, N, rand);//initialize array

#pragma omp target update to(array[0:N]) result_t result; #pragma omp target map(from: result) result = calculate(array, N);

#pragma omp target update from(last_error) print(result, last_error); delete[] array; }

RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 18 Intel OpenMP Extensions (KMP)

RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 19 Environment Variables (Xeon Phi)

. Xeon Phi environment uses custom prefix – MIC_ENV_PREFIX=##MIC## – Variables with prefix are copied to the Xeon Phi at offload

. Samples – ##MIC##_OMP_NUM_THREADS – ##MIC##_KMP_DETERMINISTIC_REDUCTION – ##MIC##_KMP_AFFINITY

RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 20 Deterministic OpenMP Reductions

. OpenMP does not specify the order in which the partial sums should be combined! – Results are not reproducible!

std::vector arr = get_input(); float sum; #pragma omp parallel for reduction(+:sum) for(auto i = 0; i < arr.size(); ++i) sum += arr[i];

. KMP_DETERMINISTIC_REDUCTION=1 – Ensures reproducible results – Slight performance impact

RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 21 Thread Affinity

. Distribution of OpenMP threads is implementation defined

. KMP_AFFINITY allows control of distribution across sockets, CPUs, cores and hyper-threads – Compact: use closest possible cores – Scatter: distribute evenly among all cores – Balanced: distribute evenly among all cores, but keep “close” threads on as close as possible cores • only available for the Xeon Phi • recommended mode

RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 22 Thread affinity on Multicore CPU

System

Core0 Core1 Core2 Core3

HT0 HT1 HT0 HT1 HT0 HT1 HT0 HT1

compact 0 1 2 3 - - - -

scatter 0 - 1 - 2 - 3 -

RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 23 Thread affinity with Dual-Socket System

System

Socket0 Socket1

Core0 Core1 Core2 Core3 Core0 Core1 Core2 Core3

HT0 HT1 HT0 HT1 HT0 HT1 HT0 HT1 HT0 HT1 HT0 HT1 HT0 HT1 HT0 HT1

compact 0 1 2 3 4 5 6 7 ------

scatter 0 - 4 - 2 - 6 - 1 - 5 - 3 - 7 -

RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 24 Thread affinity on Xeon Phi

Phi Core0 Core1 Core2 Core3 Core4 Core5 HT0 HT1 HT2 HT3 HT0 HT1 HT2 HT3 HT0 HT1 HT2 HT3 HT0 HT1 HT2 HT3 HT0 HT1 HT2 HT3 HT0 HT1 HT2 HT3

compact 0 1 2 3 4 5 6 7 8 9 A B ------scatter 0 6 - - 1 7 - - 2 8 - - 3 9 - - 4 A - - 5 B - - balanced 0 1 - - 2 3 - - 4 5 - - 6 7 - - 8 9 - - A B - -

RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 25 Castor, 4228m Pollux, 4092m Thank You! zwischen Monte-Rosa-Massiv und Matterhorn Wallis, Schweiz

www.risc-software.at

RISC Software GmbH – Johannes Kepler University Linz © 2014 07.04.2014 | 26